We hypothesized that the accuracy of murmur detection models that can determine the current state of the heart will increase as the feature information in phonocardiography becomes more diverse. We conducted research using four features: spectrograms, S1 and S2, the complements of S1 and S2, and envelopes. We transformed the one-dimensional PCG data into a two-dimensional spectrogram and used it as input to the CNN. Spectrograms provide important diagnostic information related to the cardiac state that is not clearly visible in the temporal domain of one-dimensional data by visually representing the frequency and temporal components of the signal, thereby improving the accuracy of prediction algorithms.
2.1.1. Log-Mel Spectrogram
The log-mel spectrogram is a feature extraction technique that takes into account human auditory characteristics. It performs a short-time Fourier transform (STFT) and applies mel-scale filters to obtain the power for each frequency band, which is then converted to a logarithmic scale. In order to determine what information is present in the original signal, the signal must be decomposed into its frequency components using a Fourier transform. However, applying a Fourier transform to the entire signal would result in the loss of temporal information and only frequency information would remain. Therefore, to preserve the temporal information, the signal needs to be divided into short time intervals, which is known as the STFT. The number of data points sampled per second in the signal data is called the “sampling rate”, and the process of dividing time units into short intervals is called “framing”. To preserve the time information, the signal is divided into small pieces, with the size of each piece referred to as the “window size”. However, since the window is defined with the edges cut off, the result is similar to the sound being truncated. Therefore, to end naturally, a Hamming window is applied to each frame to give weight to the center where more information can be seen. However, since the application of the Hamming window results in the loss of information at the end of the frame, a Fourier transform needs to be applied with overlapping frames to prevent the loss of information at the edge of the frame. The degree to which certain intervals overlap is called the “hop length”, and the STFT spectrogram is generated based on the window size and hop length. The mel spectrogram [
15] is calculated by converting the power for each frequency band of the generated STFT spectrogram to the mel scale using Equation (
1),
where
f represents frequency. The log-mel spectrogram is obtained by taking the logarithm of the power of the mel spectrogram and converting it to dB using Equation (2), and
is a reference value that
is relatively scaled to.
2.1.2. Peak Interval
Peaks refer to the areas in a signal where the amplitude or energy is high. This is an important clue for inferring the characteristics of sound. In this paper, the peak interval is defined as the distance between peaks in the PCG signal. Numerous medical studies and previous PhysioNet competitions have utilized the R-R interval, which represents the distance between peaks in an ECG signal, as a tool for evaluating anomalies in cardiac disease [
16,
17,
18]. Since the R-R interval in an ECG effectively represents heart rate variability (HRV) [
17,
18], an additional function is required for PCGs to express HRV. Thus, the “peak interval”, which corresponds to the R-R interval, was devised for representing HRV in PCG signals.
Patients have heart murmurs between systole and diastole, which generate waveforms corresponding to the noise. Therefore, we hypothesized that patients with heart disease would have more peaks due to abnormal heart murmurs between systole and diastole. More peaks result in shorter peak intervals. To confirm this hypothesis, we compared the average peak interval between subjects with murmurs and those without murmurs in the PhysioNet Challenge 2022 dataset. We found that the interval was approximately 49% longer in the absence of murmurs, as illustrated in
Figure 2. Patients with murmurs exhibited a higher number of peaks during the same time period compared to those without murmurs. Peak detection can be easily performed on raw audio using the “ecg_peaks” function from the Python library “Neurokit2” [
19]. Although we intended to use the peak interval value as a sequence, it was difficult to calculate the exact peak interval due to noise, so we used the average peak interval instead of the sequence form.
2.1.3. Boundary Detection for Fundamental Heart Sounds and Heart Murmurs
The heart sound is a mechanical activity signal of the heart. The sound produced by the fundamental heart sounds (FHSs) consists of two components, S1 and S2, as illustrated in
Figure 3. S1 and S2 are the most basic heart sounds, while heart murmurs are sounds occurring during both systole and diastole. The presence of heart murmurs can be indicative of cardiac issues, although they may also be occasionally heard in healthy children and young adults (Akkarapol et al., 2012 [
20]). FHSs and heart murmurs are essential elements when analyzing PCG signals to diagnose various heart diseases. However, in the PhysioNet Challenge 2022 dataset, FHS information is not provided, resulting in the lack of crucial characteristic information to judge the heart’s condition. To improve the performance of the model in automatically identifying heart diseases, FHS feature information would be necessary. Therefore, we referred to the time–frequency-domain approach proposed by Ghosh and Ponnalagu (2019) [
21] for automated FHS detection using PCG signals. The boundaries of the heart cycle, S1, S2, and the boundaries of systole and diastole murmurs contained in the extracted heart cycle were detected. We then applied a log-mel spectrogram to each detected boundary signal to extract spectrogram features for identifying heart diseases.
The process of detecting FHS boundaries and applying spectrograms was performed in six steps: (1) amplitude normalization, (2) applying a Butterworth low-pass filter, (3) creating a PCG signal envelope, (4) selecting a threshold, (5) detecting boundaries, and (6) applying a log-mel spectrogram. This process is demonstrated in
Figure 4.
First, according to Ghosh et al. (2019) [
21], HRV signals vary in amplitude depending on patient factors such as physiology, gender, and age, making it difficult to predict the dynamic range of the signal. Therefore, normalization was applied to the HRV signal data to transform the range of amplitudes to between −1 and 1. The signal data were normalized according to Equation (
3) below, where
, and
N represents the total number of samples.
represents the normalized signal.
Second, to eliminate high-frequency components, we preprocessed the normalized signal using a low-pass Butterworth filter with a 150 Hz cutoff frequency [
21]. This cutoff frequency, a hyperparameter, was selected after considering the trade-off between effectively reducing high-frequency noise and preserving vital cardiac sound information. Ghosh et al. (2019) [
21] applied the Stockwell transform to remove heart murmurs. In this study, heart murmur information was an important factor for classifying normal and abnormal heart sounds. Therefore, the Stockwell transform was not performed.
Third, since the sampling frequency of the PCG signal is 4000 Hz, the variation of the frequency amplitude in the time domain is fast. Rapid-amplitude variation makes it difficult to detect the onset and end of the heart sounds S1 and S2. Therefore, it is necessary to detect the envelope of the PCG signal to minimize this variation [
22]. The envelope was extracted using the
signal.hilbert function of the scipy library. The graphical representation of the extracted envelope is shown in step three in
Figure 4. Since the PCG signal is almost symmetric around the horizontal axis of zero, we considered only the positive part of the signal for computational efficiency. Fourthly, in order to automatically detect the boundaries of cardiac sound activity, the threshold needs to be determined. Therefore, selecting an appropriate threshold value (
) is one of the most important tasks. In this study, Equation (
4) was used to find an appropriate threshold value,
.
where
N represents the total number of samples.
is the signal envelope. The fourth step in
Figure 4 shows the FHS boundary detection, represented by red dashed lines, based on the calculated threshold value.
Fifth, the boundary detection was separated using the previously calculated
, which represents the detection of S1 and S2, as well as the boundaries of the systole and diastole. In
Figure 4, step five involves the process of separating the FHS boundary (represented by the red dashed lines) based on the threshold calculated in step four. The FHS boundary values of 1 correspond to the normal heart sounds (namely, S1 and S2), while values of 0 represent features associated with noise occurring during systole and diastole. Sixth, the log-mel spectrogram was applied to the two separated signal envelope features. Step six in
Figure 4 shows the log-mel spectrogram of the separated boundaries. The 2D image of the log-mel spectrogram was used as the input for the experimental model.
Figure 5 shows new features considered in this study. The first and second columns represent the spectrograms of S1 and S2 and of systolic and diastolic murmurs, respectively, obtained from step six shown in
Figure 4. The third column shows the log-mel spectrogram features extracted from the envelope feature in step three shown in
Figure 4.