Research on Segmentation and Classiﬁcation of Heart Sound Signals Based on Deep Learning

: The heart sound signal is one of the signals that reﬂect the health of the heart. Research on the heart sound signal contributes to the early diagnosis and prevention of cardiovascular diseases. As a commonly used deep learning network, convolutional neural network (CNN) has been widely used in images. In this paper, the method of analyzing heart sound through using CNN has been studied. Firstly, the original data set was preprocessed, and then the heart sounds were segmented on U-net, based on the deep CNN. Finally, the classiﬁcation of heart sounds was completed through CNN. The data from 2016 PhysioNet/CinC Challenge was utilized for algorithm validation, and the following results were obtained. When the heart sound segmented, the overall accuracy rate was 0.991, the accuracy of the ﬁrst heart sound was 0.991, the accuracy of the systolic period was 0.996, the accuracy of the second heart sound was 0.996, and the accuracy of the diastolic period was 0.997, and the average accuracy rate was 0.995; While in classiﬁcation, the accuracy was 0.964, the sensitivity was 0.781, and the speciﬁcity was 0.873. These results show that deep learning based on CNN shows good performance in the segmentation and classiﬁcation of the heart sound signal.


Introduction
The heart sound results from myocardial movement and the valve opening and closing; it is greatly affected by the hemodynamics and electrical activity of the heart muscle [1]. In the early stage of cardiovascular disease, heart sound auscultation, as a means of preliminary screening for cardiovascular diseases, can help differentiate abnormal signals from normal heart sound signals, and, therefore, provide effective information for the auxiliary diagnosis of cardiovascular diseases. Any dysfunction and anatomical defect in the heart can be reflected by the time, frequency spectrum, and morphological characteristics of the heart sound [2]. Though the electrocardiogram (ECG) signal contains a lot of physiological information on the cardiovascular system, it cannot reveal a lesion in the early stage of cardiovascular disease, for a lesion is not clear enough. Yet, this can be achieved by heart sounds during the early stage of the lesion. Therefore, heart sound signals contain very important physiological information, and the study of heart sound signals possesses very important clinical value for the early diagnosis of cardiovascular diseases. The segmentation and classification of heart sound signals are currently the most commonly used methods for studying the heart sound signal.
Heart sound segmentation, as a common method of heart sound signal processing, aims to divide the heart sound cycle into four corresponding states, and it is also an important processing step for heart sound classification [3]. A heart sound cycle of a normal adult mainly includes the first heart sound (S1), systolic period (sys), the second heart sound (S2), Li Fan et al. have done related work [31,32,33]. Compared with the heart sound classification without segmentation, the heart sound classification including the segmentation can obtain the state mark of the heart sound, which enables the clinicians to locate the abnormality part of the heart sound, such as diastolic or systolic murmur, and contributes to further determining the position of the heart valve that results in the disease. "classification of heart sound recordings-the physionet computing in cardiology challenge 2016" was a competition for heart sound classification [34], aiming to encourage the development of algorithms to classify heart sound recordings and to identify whether the subject of the recording should be referred on for an expert diagnosis. The PhysioNet provides a basic method for heart sound segmentation and a large number of heart sound signals, which have been widely applied in the 2016 competition and other researches afterwards. Among the participants in the competition, Potes et al. used the segmentation algorithm to classify heart sounds and obtained the first place [25]. However, in subsequent research on heart sound signals, Renna et al. pointed out that the complex sound classifier can only improve the classification to a limited extent, and the improved segmentation algorithm can be the best way of obtaining a more significant improvement in heart sound classification. They applied CNN to heart sound signal segmentation and got good experimental results, but they did not further discuss whether the segmentation network leads to a good performance of classification [19]. In 2020, Khan et al. also studied heart sound signals. They compared the classification results of segmented and unsegmented heart sound signals and concluded that using segmented heart sound signals can contributes to better classification. However, in the experiment, they used the improved empirical wavelet transformation and standardized Shannon average energy to preprocess and automatically segment the signals to identify the systolic and diastolic interval of the signal, instead of the segmentation of the four states [35]. Figure 1. An example of a normal heart sound includes two heart sound cycles, and each cycle consists of the following heart sound components: S1, sys, S2, and dia. Therefore, this paper studies the method of heart sound segmentation using deep CNN, and further combines the segmentation network with heart sound classification. Firstly, the heart sound was preprocessed, then the signal was segmented in the multichannel deep CNN, and finally classified in the CNN classifier and the Adaboost classifier. Heart sound segmentation, as a necessary stage of heart sound signal analysis, does not require the knowing of prediction time of each state in advance, and directly uses Figure 1. An example of a normal heart sound includes two heart sound cycles, and each cycle consists of the following heart sound components: S1, sys, S2, and dia.
Heart sound classification is to determine whether a heart sound is normal or not. The classification method of heart sounds includes heart sound classification without segmentation and heart sound classification including segmentation. Heart sound classification without segmentation means that the features of the entire heart sound are extracted after the preprocessing of the heart sounds, and the classifier is trained and classified using these features. In recent years, Hamidi, Arora, and Yaseen et al. have conducted some related research [28][29][30]. Heart sound classification including segmentation is to extract the features of S1, sys, S2, and dia, based on the segmentation of the heart sound, and a new feature set is formed through the combination of these features of each part and other features of the entire heart sound. The classifier is trained and classified based on the new feature set. In this aspect, Pedro Narváez, Kucharski, and Li Fan et al. have done related work [31][32][33]. Compared with the heart sound classification without segmentation, the heart sound classification including the segmentation can obtain the state mark of the heart sound, which enables the clinicians to locate the abnormality part of the heart sound, such as diastolic or systolic murmur, and contributes to further determining the position of the heart valve that results in the disease. "classification of heart sound recordings-the physionet computing in cardiology challenge 2016" was a competition for heart sound classification [34], aiming to encourage the development of algorithms to classify heart sound recordings and to identify whether the subject of the recording should be referred on for an expert diagnosis. The PhysioNet provides a basic method for heart sound segmentation and a large number of heart sound signals, which have been widely applied in the 2016 competition and other researches afterwards. Among the participants in the competition, Potes et al. used the segmentation algorithm to classify heart sounds and obtained the first place [25]. However, in subsequent research on heart sound signals, Renna et al. pointed out that the complex sound classifier can only improve the classification to a limited extent, and the improved segmentation algorithm can be the best way of obtaining a more significant improvement in heart sound classification. They applied CNN to heart sound signal segmentation and got good experimental results, but they did not further discuss whether the segmentation network leads to a good performance of classification [19]. In 2020, Khan et al. also studied heart sound signals. They compared the classification results of segmented and unsegmented heart sound signals and concluded that using segmented heart sound signals can contributes to better classification. However, in the experiment, they used the improved empirical wavelet transformation and standardized Shannon average energy to preprocess and automatically segment the signals to identify the systolic and diastolic interval of the signal, instead of the segmentation of the four states [35].
Therefore, this paper studies the method of heart sound segmentation using deep CNN, and further combines the segmentation network with heart sound classification. Firstly, the heart sound was preprocessed, then the signal was segmented in the multichannel deep CNN, and finally classified in the CNN classifier and the Adaboost classifier. Heart sound segmentation, as a necessary stage of heart sound signal analysis, does not require the knowing of prediction time of each state in advance, and directly uses the deep CNN to learn the sound features that minimize segmentation errors, which is the focus of this paper. Considering the increasing number of cardiovascular diseases and the existing shortage of medical resources, we will apply a relevant study to real life by a set of auxiliary diagnosis system including hardware and software, as shown in Figure 2. The hardware mainly includes electronic stethoscope (simple electronic stethoscope and professional electronic stethoscope), and the software includes record analysis software on the computer and mobile phones. The simple electronic stethoscope is available in every household just like a clinical thermometer. When the body is abnormal, the device is used to collect the signal and gets a preliminary diagnosis on the mobile terminal of the mobile phone. If the signal is abnormal, people who was uncomfortable go to the hospital for treatment. In hospital, the doctor collects the signal through the professional equipment, analyzes the signal on the computer (just like the ECG), and arranges the next examination.
to collect the signal and gets a preliminary diagnosis on the mobile terminal of the mobile phone. If the signal is abnormal, people who was uncomfortable go to the hospital for treatment. In hospital, the doctor collects the signal through the professional equipment, analyzes the signal on the computer (just like the ECG), and arranges the next examination.

Pre-Processing of Signal
The heart sounds is preprocessed to produce the data meeting the requirements of model input, which is shown in Figure 3. First, normalize the data to eliminate the influence on amplitude changes the differences in acquisition technology and auscultation location exert. This paper used normalization technique to reduce the difference between different data to the largest extent, as expressed in Equation (1).
where xr(n) represents the original heart sounds signal, and x(n) represents the normalized heart sounds signal. While xr(min) and xr(max) are the minimum and maximum values of the original signal, respectively. Then, second-order high-pass and low-pass Butterworth filters with cut-off frequencies of 25 Hz and 400 Hz were used to filter the normalized heart sounds signal, and spike removal and feature extraction were used for the filtered signal. The four methods were used to extract features: (1) Hilbert envelope; (2) Homomorphic environment map; (3) Wavelet envelope; (4) Power spectral density envelope.

Pre-Processing of Signal
The heart sounds is preprocessed to produce the data meeting the requirements of model input, which is shown in Figure 3. First, normalize the data to eliminate the influence on amplitude changes the differences in acquisition technology and auscultation location exert. This paper used normalization technique to reduce the difference between different data to the largest extent, as expressed in Equation (1).
where x r (n) represents the original heart sounds signal, and x(n) represents the normalized heart sounds signal. While x r (min) and x r (max) are the minimum and maximum values of the original signal, respectively. to collect the signal and gets a preliminary diagnosis on the mobile terminal of the mobile phone. If the signal is abnormal, people who was uncomfortable go to the hospital for treatment. In hospital, the doctor collects the signal through the professional equipment, analyzes the signal on the computer (just like the ECG), and arranges the next examination. Figure 2. The diagram of auxiliary diagnosis system: preliminary screening and professional diagnosis.

Pre-Processing of Signal
The heart sounds is preprocessed to produce the data meeting the requirements of model input, which is shown in Figure 3. First, normalize the data to eliminate the influence on amplitude changes the differences in acquisition technology and auscultation location exert. This paper used normalization technique to reduce the difference between different data to the largest extent, as expressed in Equation (1).
where xr(n) represents the original heart sounds signal, and x(n) represents the normalized heart sounds signal. While xr(min) and xr(max) are the minimum and maximum values of the original signal, respectively. Then, second-order high-pass and low-pass Butterworth filters with cut-off frequencies of 25 Hz and 400 Hz were used to filter the normalized heart sounds signal, and spike removal and feature extraction were used for the filtered signal. The four methods were used to extract features: (1) Hilbert envelope; (2) Homomorphic environment map; (3) Wavelet envelope; (4) Power spectral density envelope. Then, second-order high-pass and low-pass Butterworth filters with cut-off frequencies of 25 Hz and 400 Hz were used to filter the normalized heart sounds signal, and spike removal and feature extraction were used for the filtered signal. The four methods were used to extract features: (1) Hilbert envelope; (2) Homomorphic environment map; (3) Wavelet envelope; (4) Power spectral density envelope.
Finally, the obtained features were down-sampled to 50 Hz, and all down-sampled features were normalized in order to obtain zero mean and unit variance [9]. For each feature envelope obtained, the fixed length is extracted according to the overlapping step length of adjacent data, which can be expressed in Equation (2).
where x k is the feature of the original length of the data, X k represents the feature of extracting fixed-length data, k = 1,2,3,4, n = 1,2, . . . , 8·N−τ−8 7·τ−8 , and N represents the total length of the data, and τ is the fixed length of the data input to the model.

Segmentation of Signal
Heart sound segmentation, which segmented the heart sounds into 4 periods (including S1, systolic, S2, and diastolic) based on the features obtained through preprocessing, was the focus of this paper. This paper adopted U-net implementation based on the deep CNN, as is shown in Figure 4. Additionally, the network structure is shown in Figure 5.
Finally, the obtained features were down-sampled to 50 Hz, and all down-sampled features were normalized in order to obtain zero mean and unit variance [9]. For each feature envelope obtained, the fixed length is extracted according to the overlapping step length of adjacent data, which can be expressed in Equation (2).
x is the feature of the original length of the data, Xk represents the feature of extracting fixed-length data, k = 1,2,3,4, n = 1,2, ... , , and N represents the total length of the data, and τ is the fixed length of the data input to the model.

Segmentation of Signal
Heart sound segmentation, which segmented the heart sounds into 4 periods (including S1, systolic, S2, and diastolic) based on the features obtained through preprocessing, was the focus of this paper. This paper adopted U-net implementation based on the deep CNN, as is shown in Figure 4. Additionally, the network structure is shown in Figure 5. The mathematical model of the convolution of each layer of the network in this architecture can be expressed in Equation (3): where Ai,j represents the element in the i-th row and j-th column of the matrix, the elements of matrix A correspond to the weights of different filters corresponding to different feature inputs of the  -th layer, and The network can be understood as down-sampling, up-sampling, splicing, and fusion. Specifically, down-sampling implemented different degrees of convolution operations and learnt features at different levels. As the depth of the network increases, the learned features were also converted from low-dimension to high-dimension; Up-sampling implemented the deconvolution. After learning the deep features, the data length were gradually restored to the size of the original inputs; The splicing fusion combined the different dimensional features learned by down-sampling with the recovered data from up-sampling to realize the fusion of features at different scales. Finally, the network output was followed by a softmax activation function, and the output was each state sequence corresponding to the heart sound. It should be noted that the output of all convolutional layers of the network needed to be zero-filled to ensure that the data space dimension does not change after convolution, as shown in Figure 6. The mathematical model of the convolution of each layer of the network in this architecture can be expressed in Equation (3): where A i,j represents the element in the i-th row and j-th column of the matrix, the elements of matrix A correspond to the weights of different filters corresponding to different feature inputs of the -th layer, and N represents the dimension of the input feature space of the -th layer, k represents the output feature space dimension of the -th layer, and W i,j and Z i,j represent the input and output of the -th convolutional layer, respectively. The network can be understood as down-sampling, up-sampling, splicing, and fusion. Specifically, down-sampling implemented different degrees of convolution operations and learnt features at different levels. As the depth of the network increases, the learned features were also converted from low-dimension to high-dimension; Up-sampling implemented the deconvolution. After learning the deep features, the data length were gradually restored to the size of the original inputs; The splicing fusion combined the different dimensional features learned by down-sampling with the recovered data from up-sampling to realize the fusion of features at different scales. Finally, the network output was followed by a softmax activation function, and the output was each state sequence corresponding to the heart sound.
It should be noted that the output of all convolutional layers of the network needed to be zero-filled to ensure that the data space dimension does not change after convolution, as shown in Figure 6. It should be noted that the output of all convolutional layers of the network needed to be zero-filled to ensure that the data space dimension does not change after convolution, as shown in Figure 6.

Classification
The heart sound segmentation was realized by completing the heart sound labeling. There were many ways to classify heart sounds based on segmentation, and this study chose the adaboost classifier and CNN classifier to achieve the classification of heart sounds. Adaboost is an iterative algorithm. Its core idea is to train different classifiers (weak classifiers) for the same training set, and then these weak classifiers are grouped to

Classification
The heart sound segmentation was realized by completing the heart sound labeling. There were many ways to classify heart sounds based on segmentation, and this study chose the adaboost classifier and CNN classifier to achieve the classification of heart sounds. Adaboost is an iterative algorithm. Its core idea is to train different classifiers (weak classifiers) for the same training set, and then these weak classifiers are grouped to form a stronger final classifier (strong classifier). The CNN classifier was implemented by constructing a network structure of a series of multi-layer convolutional layers and multi-layer perceptrons (MLP). In the multi-layer convolutional layer, sufficient features were obtained by controlling the number of convolution kernels, and the obtained features were inputted into the MLP network to classify heart sounds. The process is shown in Figure 7.
form a stronger final classifier (strong classifier). The CNN classifier was implemented by constructing a network structure of a series of multi-layer convolutional layers and multilayer perceptrons (MLP). In the multi-layer convolutional layer, sufficient features were obtained by controlling the number of convolution kernels, and the obtained features were inputted into the MLP network to classify heart sounds. The process is shown in Figure 7.

Network Model
To conduct a comprehensive study of the characteristics of the heart sound signals, this paper established the segmentation and classification model structure based on CNN. The input layer was the four features extracted after pre-processing. The four features with a fixed length of 512 were selected as the input of the multi-channel CNN, which were divided into corresponding states by the segmentation network, and then connected to the CNN classifier to complete the classification. The network structure consisted of three parts, including pre-processing, a U-net segmentation network based on the deep CNN, a CNN classifier. The overall flow chart is shown in Figure 8.

Data Sources
The data used in this paper were from the 2016 PhysioNet/CinC Challenge database [29], which provided segmentation and classification annotations. It is currently the world's largest public heart sound data set. Among them, a total of 792 pieces of data with segmentation annotations were stored at a sampling frequency of 1 kHz, and a total of 301 pieces of heart sounds with classification annotations were stored at a sampling frequency of 2 kHz.

Data Pre-processing
Referring to the process of signal pre-processing, the fixed length values in this paper were 64, 128, 256, and 512. The overlapping steps of two adjacent data was 1/8 of the fixed length. Table 1 corresponds to the number of data with different fixed lengths, and the Figure 9 shows the envelope extracted from a heart sound. In order to improve the effectiveness utilization of data and increase the robustness of the algorithm model, this paper used a 10-fold cross-validation method.

Network Model
To conduct a comprehensive study of the characteristics of the heart sound signals, this paper established the segmentation and classification model structure based on CNN. The input layer was the four features extracted after pre-processing. The four features with a fixed length of 512 were selected as the input of the multi-channel CNN, which were divided into corresponding states by the segmentation network, and then connected to the CNN classifier to complete the classification. The network structure consisted of three parts, including pre-processing, a U-net segmentation network based on the deep CNN, a CNN classifier. The overall flow chart is shown in Figure 8.
form a stronger final classifier (strong classifier). The CNN classifier was implemented by constructing a network structure of a series of multi-layer convolutional layers and multilayer perceptrons (MLP). In the multi-layer convolutional layer, sufficient features were obtained by controlling the number of convolution kernels, and the obtained features were inputted into the MLP network to classify heart sounds. The process is shown in Figure 7.

Network Model
To conduct a comprehensive study of the characteristics of the heart sound signals, this paper established the segmentation and classification model structure based on CNN. The input layer was the four features extracted after pre-processing. The four features with a fixed length of 512 were selected as the input of the multi-channel CNN, which were divided into corresponding states by the segmentation network, and then connected to the CNN classifier to complete the classification. The network structure consisted of three parts, including pre-processing, a U-net segmentation network based on the deep CNN, a CNN classifier. The overall flow chart is shown in Figure 8.

Data Sources
The data used in this paper were from the 2016 PhysioNet/CinC Challenge database [29], which provided segmentation and classification annotations. It is currently the world's largest public heart sound data set. Among them, a total of 792 pieces of data with segmentation annotations were stored at a sampling frequency of 1 kHz, and a total of 301 pieces of heart sounds with classification annotations were stored at a sampling frequency of 2 kHz.

Data Pre-processing
Referring to the process of signal pre-processing, the fixed length values in this paper were 64, 128, 256, and 512. The overlapping steps of two adjacent data was 1/8 of the fixed length. Table 1 corresponds to the number of data with different fixed lengths, and the Figure 9 shows the envelope extracted from a heart sound. In order to improve the effectiveness utilization of data and increase the robustness of the algorithm model, this paper used a 10-fold cross-validation method.

Data Sources
The data used in this paper were from the 2016 PhysioNet/CinC Challenge database [29], which provided segmentation and classification annotations. It is currently the world's largest public heart sound data set. Among them, a total of 792 pieces of data with segmentation annotations were stored at a sampling frequency of 1 kHz, and a total of 301 pieces of heart sounds with classification annotations were stored at a sampling frequency of 2 kHz.

Data Pre-Processing
Referring to the process of signal pre-processing, the fixed length values in this paper were 64, 128, 256, and 512. The overlapping steps of two adjacent data was 1/8 of the fixed length. Table 1 corresponds to the number of data with different fixed lengths, and the Figure 9 shows the envelope extracted from a heart sound. In order to improve the effectiveness utilization of data and increase the robustness of the algorithm model, this paper used a 10-fold cross-validation method.

Evaluation Index
To analyze the results of the model, this paper used the following six indicators: overall accuracy (PA), accuracy of each state (CPA), and average accuracy of each state (MPA), accuracy (Acc), sensitivity (Se), and specificity (Sp). Among them, PA, CPA and MPA were used for the evaluation of segmentation performance, while Acc, Se, and Sp were used to evaluate the classification performance. The relevant indicators are calculated as follows.
where TP represents the number of normal and classified as normal, FP represents the number of abnormal but classified as normal, TN represents the number of abnormal and classified as abnormal, and FN represents the number of normal but classified as abnormal.

Figure 9.
An example of the envelope extracted from a heart sound signal with a fixed length, which contains the following four heart sound features: Homomorphic envelogram, Hilbert envelope, Wavelet envelope, and PSD (Power Spectrum Density) envelope.

Evaluation Index
To analyze the results of the model, this paper used the following six indicators: overall accuracy (PA), accuracy of each state (CPA), and average accuracy of each state (MPA), accuracy (Acc), sensitivity (Se), and specificity (Sp). Among them, PA, CPA and MPA were used for the evaluation of segmentation performance, while Acc, Se, and Sp were used to evaluate the classification performance. The relevant indicators are calculated as follows.
where T P represents the number of normal and classified as normal, F P represents the number of abnormal but classified as normal, T N represents the number of abnormal and classified as abnormal, and F N represents the number of normal but classified as abnormal.

Development Environment
The Experimental configuration environment was as follows: Intel Core i3-3220@3.30GHz CPU, 8G RAM, and a GTX 550 graphics card. The Python 3.7 was selected as the development platform, and used the pytorch as the back-end.

Impact of Fixed Length on Performance Indicators
During the experiment, this study set up input signals of four lengths to train and test the model. The set signal lengths were 64, 128, 256, 512 (corresponding to 1.28 s, 2.56 s, 5.12 s, 10.24 s, respectively). Table 2 shows the results. It was found out that when the length was set to 512, the best results were obtained: the accuracy of segmentation was 0.994, and the accuracy of the four states of S1, sys, S2, and dia were 0.986, 0.993, 0.994, and 0.996, respectively. The average accuracy rate was 0.992. From these results, it can be seen that the performance of all aspects was significantly improved when the length was 512. In addition, in Table 1, it can be seen that when the length was set to 512, there were a total of 711 pieces of data, and the amount of data was significantly less than that of other lengths. However, Ronneberger et al. pointed out that the U-net network also showed good performance in small data sets [36]. Therefore, the input data length of the model was set to 512.

Impact of Optimizer on Performance Indicators
Through choosing a suitable optimizer and optimal learning rate, the training speed and classification accuracy of the model can be improved [37]. This article selected several commonly used optimizers-Adam, RMSprop (Root Mean Square Prop), Adagrad, and SGD (Stochastic Gradient Descent)-to train the model. Table 3 shows the experimental results under different optimizers. From these results, we can see that the segmentation effect was the best when the Adam optimizer was used, with the learning rate set to 0.0001.

Impact of U-net Depth on Performance Indicators
The U-net network was a network built on the basis of CNN. The increase in the number of network layers means an increase in the number of CNN. The deepened network can obtain higher-dimensional features. This paper set the depth of the U-net network to 5, 6, 7, and 8. Table 4 shows the experimental results under different U-net depths. As these results show, a good segmentation effect can be obtained in all depth parameters we set. However, considering the computer memory and training time, the depth was set to 5.

Impact of Convolution Kernel Sizes on Performance Indicators
Considering that the size of the convolution kernel has a deep influence on the classification performance and operation speed [37], four different convolution kernel were set after determining the basic structure of the network and the length of the input signal, which was shown in Table 5. It was found out that the change of the convolution kernel has a certain influence on the result. When the convolution kernel is set to 9 × 4 and 31 × 4, the best segmentation effect was obtained. However, considering the influence of the increase in the size of the convolution kernel on the calculation speed, the model's convolution kernel was set to 9 × 4 without wasting computing resources.

Determination of Segmentation Model Parameters
From the above experimental results, it can be seen that among all the parameters considered, the size and depth of the convolution kernel of the network have little influence on the segmentation effect, but the length of the input data and the selected optimizer possess a greater influence on the result, as shown in Figure 10.

Impact of Convolution Kernel Sizes on Performance Indicators
Considering that the size of the convolution kernel has a deep influence on the classification performance and operation speed [37], four different convolution kernel were set after determining the basic structure of the network and the length of the input signal, which was shown in Table 5. It was found out that the change of the convolution kernel has a certain influence on the result. When the convolution kernel is set to 9 × 4 and 31 × 4, the best segmentation effect was obtained. However, considering the influence of the increase in the size of the convolution kernel on the calculation speed, the model's convolution kernel was set to 9 × 4 without wasting computing resources.

Determination of Segmentation Model Parameters
From the above experimental results, it can be seen that among all the parameters considered, the size and depth of the convolution kernel of the network have little influence on the segmentation effect, but the length of the input data and the selected optimizer possess a greater influence on the result, as shown in Figure 10. Finally, the segmentation model parameters were determined, as shown in Table 6. The length was 512, the optimizer Adam was chosen, the network depth was 5, and the convolution kernel was 9 × 4. Figure 11 shows the segmentation result of a normal heart sound. In the process of down-sampling, the number of filters in the convolutional layer increased from 8 to 128 successively, and in the process of up-sampling, the number of filters was sequentially reduced from 128 to 4; the size of the convolution kernel of each layer was 9 × 4, the stride of the convolution layer was 1, the pooling layer was set to 2, and the learning rate was 0.0001. Finally, the segmentation model parameters were determined, as shown in Table 6. The length was 512, the optimizer Adam was chosen, the network depth was 5, and the convolution kernel was 9 × 4. Figure 11 shows the segmentation result of a normal heart sound. In the process of down-sampling, the number of filters in the convolutional layer increased from 8 to 128 successively, and in the process of up-sampling, the number of filters was sequentially reduced from 128 to 4; the size of the convolution kernel of each layer was 9 × 4, the stride of the convolution layer was 1, the pooling layer was set to 2, and the learning rate was 0.0001.  Finally, the segmentation model parameters were determined, as shown in Table 6. The length was 512, the optimizer Adam was chosen, the network depth was 5, and the convolution kernel was 9 × 4. Figure 11 shows the segmentation result of a normal heart sound. In the process of down-sampling, the number of filters in the convolutional layer increased from 8 to 128 successively, and in the process of up-sampling, the number of filters was sequentially reduced from 128 to 4; the size of the convolution kernel of each layer was 9 × 4, the stride of the convolution layer was 1, the pooling layer was set to 2, and the learning rate was 0.0001. Figure 11. The segmentation results of an example of a normal heart sound includes two heart sound cycles, and each cycle consists of the following heart sound components: S1, sys, S2, and dia.  Figure 11. The segmentation results of an example of a normal heart sound includes two heart sound cycles, and each cycle consists of the following heart sound components: S1, sys, S2, and dia.

Application of Segmentation Model in Classification
In this paper, the adaboost classifier and CNN classifier were selected on the basis of segmentation to classify heart sounds, and the classification results are shown in the following Table 7. It can be clearly seen from Figure 12. that the results of the CNN classifier are significantly better than the Adaboost classifier in sensitivity, specificity, and accuracy.

Discussion
In this paper, the CNN was used to segment the heart sound signal, and it was further applied to classification. In the study of heart sound signal segmentation, referring to the process of image segmentation, the U-net network composed of the deep CNN was used for the segmentation of heart sound. Furthermore, the CNN classifier was used to classify the segmented heart sounds into normal or abnormal. In terms of segmentation, we discussed the impact of data length, network depth, convolution kernel size, and optimizer on the segmentation results. It can be seen from the fixed length parameters that the increase in the data length can improve the segmentation accuracy. As Table 2 shows, the amount of data decreases while the fixed length increases. Ronneberger et al. pointed out that the U-net network can also show good performance on small data sets [36], with a smaller amount of data reducing the credibility of the optimized model. When the data length was set to 512, the best segmentation results were obtained. The influence of the amount of fixed-length data on the result needs to be further explored through more data. The increase in network depth can effectively improve the performance of the network. This conclusion is consistent with the research results obtained by Krizhevsky and Simonyan et al. [38,39]. However, during the experiment, it was found out that the increase in network depth will increase the complexity of the model. It means that the number of related parameters in the model has increased exponentially. Too many parameters will consume a lot of computer memory capacity and training time. On the basis of a good segmentation effect, it is not worthwhile to spend a lot of computer memory and training time on the improvement of the segmentation accuracy, which was why the network depth was set to 5. The selection of the optimizer has a greater impact on the segmentation results. When the SGD optimizer was selected, the segmentation results were poor. When the Adam optimizer was selected, the segmentation effect was improved to a certain extent, and it performed best in the selected optimizer. However, in the study conducted by

Discussion
In this paper, the CNN was used to segment the heart sound signal, and it was further applied to classification. In the study of heart sound signal segmentation, referring to the process of image segmentation, the U-net network composed of the deep CNN was used for the segmentation of heart sound. Furthermore, the CNN classifier was used to classify the segmented heart sounds into normal or abnormal. In terms of segmentation, we discussed the impact of data length, network depth, convolution kernel size, and optimizer on the segmentation results. It can be seen from the fixed length parameters that the increase in the data length can improve the segmentation accuracy. As Table 2 shows, the amount of data decreases while the fixed length increases. Ronneberger et al. pointed out that the U-net network can also show good performance on small data sets [36], with a smaller amount of data reducing the credibility of the optimized model. When the data length was set to 512, the best segmentation results were obtained. The influence of the amount of fixed-length data on the result needs to be further explored through more data. The increase in network depth can effectively improve the performance of the network. This conclusion is consistent with the research results obtained by Krizhevsky and Simonyan et al. [38,39]. However, during the experiment, it was found out that the increase in network depth will increase the complexity of the model. It means that the number of related parameters in the model has increased exponentially. Too many parameters will consume a lot of computer memory capacity and training time. On the basis of a good segmentation effect, it is not worthwhile to spend a lot of computer memory and training time on the improvement of the segmentation accuracy, which was why the network depth was set to 5. The selection of the optimizer has a greater impact on the segmentation results. When the SGD optimizer was selected, the segmentation results were poor. When the Adam optimizer was selected, the segmentation effect was improved to a certain extent, and it performed best in the selected optimizer. However, in the study conducted by Keskar et al., it can be found out that selecting the most basic SGD optimizer [40], and gradually increasing optimization parameters (such as first-order momentum, second-order momentum) to optimize the model according to the research object can improve the model. For heart sound signals, this method can be considered to further explore the optimizer of heart sound signals to further improve the performance of the model.

Conclusions
This paper proposed a method of using CNN to study heart sound signals, which mainly involved segmentation and classification. In the study of segmentation, this paper applied U-net network composed of deep CNN to the segmentation step, and determined the relevant parameters of the model and trained the model that can segment heart sounds well through optimizing the relevant network structure and comparing the segmentation results under different optimizers and different input data lengths. In the study of classification, the segmentation model obtained was used to segment the heart sounds, and then used the Adaboost classifier and the CNN classifier to classify the heart sounds, and finally compared the classification results to select a better classifier. Without knowing the prediction time of each state, the segmentation model we trained obtained the overall accuracy rate of 0.996, the accuracy rates of S1, sys, S2, and dia were 0.991, 0.996, 0.996, 0.997, and the average accuracy rate was 0.995. Additionally, in the subsequent classification process, the CNN classifier got the accuracy of 0.964, the sensitivity of 0.781, and the specificity of 0.873. Therefore, a preliminary conclusion can be drawn that, as a basic network structure of deep learning, CNN can be applied to the research of heart sounds. We also believe that it will shine in the future combination of heart sounds and deep learning. In addition, considering that the changes in heart sounds at different periods are often accompanied by different types of cardiovascular diseases, the advent of the era of big data, the rapid development of artificial intelligence and the increasing incidence of heart diseases, we can carry out the analysis of different types of diseases in the future, the study of heart sound signals in different periods, and even the research on specific diseases. In the future, we will focus on the study of heart sound signals in different periods, and expand the classification of normal and abnormal heart sounds to the screening of specific diseases in specific periods. At the same time, we will further look for more opportunities of collaboration with clinicians to collect more heart sound data, and optimize the model structure.