Phonocardiography Signals Compression with Deep Convolutional Autoencoder for Telecare Applications

: Phonocardiography (PCG) signals that can be recorded using the electronic stethoscopes play an essential role in detecting the heart valve abnormalities and assisting in the diagnosis of heart disease. However, it consumes more bandwidth when transmitting these PCG signals to remote sites for telecare applications. This paper presents a deep convolutional autoencoder to compress the PCG signals. At the encoder side, seven convolutional layers were used to compress the PCG signals, which are collected on the patients in the rural areas, into the feature maps. At the decoder side, the doctors at the remote hospital use the other seven convolutional layers to decompress the feature maps and reconstruct the original PCG signals. To conﬁrm the effectiveness of our method, we used an open accessed dataset on PHYSIONET. The achievable compress ratio (CR) is 32 when the percent root-mean-square difference (PRD) is less than 5%.


Introduction
According to the American heart association report, cardiovascular disease (CVD) is the leading global cause of death and it is expected to have more than 22.2 million deaths by 2030 [1]. Auscultation is one of the significant methods for CVD's monitoring [2]. Besides, heart sound can be used to diagnose heart diseases or to evaluate a human's physiological condition [3,4]. The electronic stethoscopes exploit the vibrations that are caused by the heartbeats to graphically record the heart sounds called phonocardiography (PCG) signals [5]. The PCG signals provide a non-invasive method for detecting heart valve abnormalities and assisting in diagnosing heart disease. An efficient signal compression method is necessary due to the vast amounts of data that are generated by long-term PCG monitoring. Moreover, telecare, which uses telematics to transmit medical information, has attracted much attention recently, especially for telemedical problems in rural areas [6]. For example, people can put on singlets, which are embedded wearable electrocardiography (ECG) sensors, in order to collect ECG data for telecare purposes [7]. However, for peoples who live in rural areas and have insufficient medical resources, we may use medical electronic stethoscopes to collect the PCG signals and transmit these signals to the doctors at the remote hospital. For the smart healthcare ecosystem, healthcare data privacy and security and data storage issues are some of the most challenging issues and opportunities [8,9]. Most of the current medical tests are done in medical institutions. To check the patients' health conditions, many of them need to go back and forth between their residences and hospitals. This wastes the patient's time for medical consultation. Therefore, the development of telecare is a critical issue today. It has created a new way of medical communication, which enables synchronized or asynchronous interaction between physicians and patients, overcoming spatiotemporal barriers, improving medical quality, and increasing convenience. In this work, we focus on the compression of PCG signals by using the deep convolutional autoencoder. Furthermore, we consider the impact of communication link quality [10] and float-point to fixed-point conversion issues. Our proposed method achieves a compression ratio of 32 with the percent root-mean-square difference less than 4.8% under the word-length of 10. Our deep autoencoder can be implemented on the existing lightweight deep learning framework on a smartphone by inspecting the model complexity. The remainder of this paper is organized, as follows. Section 2 briefly describes related work regarding PCG compression problems and background about the deep CNN network. Section 3 details our proposed deep convolutional autoencoder. Section 4 reports the experiments to validate the effectiveness of our work on the PCG compression problems. Section 5 discusses the model complexity issues. Conclusions are drawn in Section 6.

Sound Compression
For heart sound compression problems, the conventional performance metrics are compress ratio (CR) and percent root-mean-square difference (PRD) [11]. CR is defined, as follows: where B 0 and B 1 are the data size before and after compression, respectively. PRD can be calculated, as follows: where x i is the i-th sample in the original signal, andx i denotes the corresponding reconstructed sample. µ x is the sample mean of the N data samples of the original signals. Note that the mean value of the original signals must be removed to obtain unbiased PRD [12]. Audio compression can be lossless or lossy. It has been shown that lossless compression algorithms rarely obtain a CR larger than 3, while lossy compression algorithms allow attainable CR up to 12 and higher [13]. For example, the value of CR for free lossless audio codec (FLAC) compression is only about 1.94 [14]; and, the value of CR for lossless ECG compression is 2.56 [15]. Additionally, it has been reported that a medical professional felt the necessity of a high CR and can tolerate a PRD as high as 5% [16]. Thus, our design aims to attain a high CR at the values of PRD are less than 5%. The pioneering work on PCG signals compression has been reported in [17]. The authors proposed using wavelet transform or wavelet packet transform methods in order to compress the original heart sound signals by coefficient thresholding. Their method can further be combined with some lossless compression methods, such as Huffman coding, to increase the compression ratio. The thresholding method based on wavelet energy was reported to reduce the loss caused by compression. At the PRD values of 5.82, the average values of CR are about 40 using the wavelet transform [18]. In [19], the authors proposed using a better wavelet transform method with tunable Q-factor [20]. The optimal value of Q can be found by using the genetic algorithm. In [16], the authors exploited the repetition patterns that were embedded in the PCG signals in order to eliminate their redundancy. After decomposing the PCG signals into the time-frequency domain, the authors proposed clustering the decomposed data to build a dictionary during the training phase. This dictionary is then used to obtain the optimal codebook. They achieved the averaged CR of 17.4 at the averaged PRD 4.87 for PCG signals. For the fetal PCG signals, the authors had reported that the achievable CR was less than 7.5 when the required PRD was less than two by using the compression techniques based on Discrete Cosine Transform (DCT) and Discrete Wavelet Transform (DWT) [21].

Convolutional Neural Networks
Because audio signals can be stored as one-dimensional (1D) signals, 1D convolutional neural networks (1D CNNs) have been applied to solve many practical problems that traditional signal processing approaches hardly tackle, such as ventricular heartbeats detection [22], speaker recognition [23], speech emotion [24], and environmental sound classification [25]. Other than the audio signals, 1D CNN has been applied to various scenarios, such as human respiration pattern recognition [26], abnormal detection for industrial control systems [27], and contact-less paraparesis detection [28]. In [29], the authors proposed combining the conventional 2D CNN with an autoencoder for the electroencephalogram (EEG) signal compression problems. The autoencoder aims to reconstruct its input based on the neural network in an unsupervised learning manner [30].
Let x be the input to convolution layer of length N and h be the kernel of size K. Let y be the output of the non-causal 1D convolution between x and h. We can express the n-th element of y as follows [31]: where x(n) and h(n) represent the n-th element of x and h, respectively; s denotes the stride for shifting the kernel window. Note that the length of output vector y is not equal to the length of output vector x, and this can be avoided by zero-padding operation for the input vector. In each CNN layer, 1D forward propagation can be expressed, as follows: where x k denotes the input of the k-th neuron at the -th layer of the CNN; b k represents the corresponding bias; s −1 i is the output of the i-th neuron at the ( − 1)-th layer; w −1 ik is the kernel from the i-th neuron at the ( − 1)-th layer to the k-th neuron at the -th layer; conv1D (·, ·) is used to perform 1D convolution operation, as described in (3). The output of the k-th neuron at the -th layer y k can be expressed, as follows: where f (·) is the non-linear activation function. Possible activation functions could be sigmoid function, hyperbolic tangent, and rectified linear unit (ReLU). Note that y k may be further processed by pooling operation to create downsampled or pooled feature maps, a summarized version of the features detected in the input. Two standard functions used in the pooling operation are average pooling and maximum pooling (MaxPool). The average and MaxPool operations calculate the average and maximum values for each patch on the feature map. For example, the MaxPool with a factor of P can be expressed, as follows:

Autoencoder
Autoencoders have been widely used in data denoising [32][33][34] and data compression [35] applications. An autoencoder comprises encoder and decoder parts. The encoder translates an input vector x with length N to a hidden representation y with length M. The transform can be expressed, as follows: where W denotes the weight matrix of size M × N, b represents the bias vector of size M × 1, and f (·) denotes the transformation function, which is non-linear in general. Note that for data compression purposes, N M holds. On the other hand, the decoder translates the hidden representation y to a reconstructed representation x with the same length of x. The demapping process can be expressed, as follows: where W denotes the weight matrix of size N × M, b represents the bias vector of size N × 1, and g (·) denotes the demapping function, which is non-linear in general. The parameters associated with the autoencoder are adjusted by minimizing the following cost function J: where L denotes the length of the training dataset.

Data Pre-Processing
Before applying the measured data into the proposed deep convolutional autoencoder model, the data is pre-processed as follows:

1.
Normalization: the data are normalized, such that the values of all data sets are mapped into the ranges of [0, 1].

2.
Segmentation: we edit the heart sound into several fixed-length segments. The time duration of each segment is 3 s, which corresponds to 6000 samples at the sampling rate of 2 kHz. Note that each segment contains roughly four or five cardiac cycles.

3.
Overlapping: we apply a sliding window on the data segmentation to implement the data augmentation. Except for the first segment, each segment is overlapped with its previous segment by 93.33%, i.e., each segment has 400 new samples and keeps 5600 existing samples in the previous segment. It has been reported that such a time-shift-based data augmentation method is useful to prevent overfitting [36] when training the CNN networks [37].

Feature Selection
It has been reported that the hidden semi-Markov model (HSMM) heart sound segmentation method could correctly decompose a cardiac cycle into four parts: S1, systole, S2, and diastole periods [38]. We propose using S1 and S2 signals as the features to train the deep convolutional autoencoder. Thus, for each segment, we null all samples that correspond to the systole and diastole periods and keep other samples in a segment.

Proposed Network Model for the Deep Convolutional Autoencoder
Figure 1 depicts the system model in this work. With the aid of medical professionals, such as nurses, the patients who live in rural areas can regularly collect their PCG signals using electronic stethoscopes; then, an encoder continuously compresses the PCG signals before transmitting these compressed data to the remote site via a noisy communication link. Doctors use a decoder to continuously decompress the received compressed PCG signals at the remote site, so that doctors can virtually auscultate the patients' heart sounds without meeting with patients. Inspired by the pioneering works about the design of the deep neural network [39,40], we combine one to two convolutional networks with one max-pooling layer as a basic unit to build our proposed network architecture at the encoder side; on the decoder side, we combine one to two convolutional networks with one upsampling layer as a basic unit. Empirically, we keep stacking the basic units until overfitting happened in order to determine the depth of our network model. The details about the encoder and decoder are listed Tables 1 and 2    For the encoder, we fed one segment, which contains 6000 samples, into the first layer. After this convolutional operation with 12 1D filters, each with a kernel size of three and zero paddings, its output size is 6000 × 12. We can treat that 12 features are extracted from one segment by the first convolutional network. Subsequently, the one-dimensional max-pooling operation reduces the signal length by four, i.e., the output of the MaxPool 1D layer is 1500 × 12. After the consecutive convolution and max pooling operations, the final layer outputs a vector with length 187. This means the encoder compresses each segment with 6000 samples into each feature map with 187 samples. Note that one batch normalization (Batch Norm) was used in the encoder, such that a stable distribution of activation values throughout training to alleviate the internal covariate shift problems [41]. Table 2 lists the decoder's detailed architecture, which decompresses each feature map with 187 samples to reconstruct the corresponding PCG segments, each with 6000 samples. Note that, except for the convolutional network and up-sampling operations, the last layer is a fully connected network (Dense) reconstruct the original segment. The total number of trainable parameters for the encoder and decoder are 180,945 and 18,124,705, respectively. This implies the computational cost at the encoder is much less than that at the decoder. Note that the PCG signals can continuously partition into segments and then be compressed with the proposed encoders. The remote side then continuously receives each compressed feature maps. After decoding, the decompressed segments can be used to combine into the original PCG signals. Thus, even though each segment contains roughly only four or five cardiac cycles, which may be insufficient for diagnostic or monitoring purposes, our segment-by-segment signal processing approach could achieve a good CR and a low PRD without compromising continuity for the long-term monitoring.

Dataset
We used an open accessed dataset, which comprises nine heart sound databases that were collected by various organizations, provided by PHYSIONET [42]. Due to a large number of tested subjects and long recording time, we choose the database that was collected by the Dalian University of Technology heart sounds database (DLUTHSDB) from these nine sound databases as our experimental dataset. In this dataset, subjects included 174 healthy volunteers and 335 patients with coronary artery disease. PHYSIONET resamples the original waveform at 2 kHz. Figure 2 illustrates three recorded heart sound in the DLUTHSDB database.

Results
The experiments were conducted on the Tensorflow framework. First, we evaluate the impact of segment length on the resulting PRD and the computational complexity and determine the suitable segment length to our proposed model. Next, we try to increase the CR with a tolerable PRD. Third, we consider fixed-point and communication link quality issues for practical implementation.

Segment Length Evaluation
During the training phase, the batch size is chosen as 60, and the number of the epoch is set to 200. Meanwhile, we use conventional adaptive moments (Adam) optimizer [43] in order to obtain better results. The main parameters associated with the Adam optimizer are learning rate = 0.01, hyper-parameters β 1 = 0.9 and β 2 = 0.999, weight decay factor λ = 10 −5 . Because each audio file's length in the data set is not the same, we first divided them into segments and then randomly split into approximately 80% for training, 10% for validation, and 10% for testing, respectively.
We fix the CR as 27 and evaluate the resulting PRD under various segment lengths (see Figure 3). Note that the segment length of 2000 or 3000 samples contains roughly two or three cardiac cycles; the segment length of 6000 samples contains roughly four or five cardiac cycles, and the segment length of 8000 samples contains roughly six or seven cardiac cycles. The evaluation results are listed in Table 3, in which we can observe that the longer segment length leads to higher computational cost, and the segment length of 6000 achieves the lowest PRD. Thus, we choose the samples in a segment as 6000.
As shown in Figure 4, we observe that for the time duration of 3 s (segment length of 6000), the details contained in the compressed signals are less than that for the time duration of 4 s (segment length of 8000) (see Figure 4a,c). However, the comparisons between original data with the reconstructed signal reveal that the time duration of 3 s (segment length of 6000) has a less PRD (see Figure 4b,d-f). Thus, we inferred that too many details in a compressed signal might not lead to a better PRD performance. This could explain why a network model with a more significant number of trainable parameters fails to result in better training results.  (c) (d)

Compression Ratio Evaluation
It has been shown that higher values of CR result in larger values of PRD. Given that PRD's desired values are less than 5%, we would like to evaluate the attainable best CR in this Section.
During the training phase, the batch size is chosen as 60, and the number of the epoch is set to 400. We fix the length of the segment of 6000, and the total number of segments is about 7.2 × 10 3 . The dataset was randomly split into approximately 80% for training, 10% for validation, and 10% for testing. Meanwhile, we use the conventional Adam optimizer, as we have described in Section 4.2.1. The evaluation results are shown in Table 4. Although the PRD corresponding to CR = 36 is less than the target PRD (5%), if we further considered the fixed-point computation issue, the resulting PRD may not meet the target PRD. Thus, we choose the value of CR as 32 in the rest of this paper.

Fixed-Point Issues
We consider the fixed-point operation issues for the proposed network architecture to reduce the computational cost of the floating-point operation. Table 5 lists the maximum and minimum values of both weights and biases for each layer at the encoder and the decoder sides. Note that layers 2, 6, and 9 are max-pooling layers at the encoder side; the layers 4, 7, and 9 are up-sampling layers at the decoder side. Table 6 shows the evaluation results in fixed-point cases. Thus, we choose the word length as 10 in the rest of this paper. Note that we assign one bit for the sign bit, two bits for integer parts, and the remaining seven bits for fractional parts.

Communication Link Quality Issues
Besides the segment length, CR, and the fixed-point factors, we further consider the impact of communication link quality on the resulting PRD. We simply model the link quality using the artificially induced bit error rate (BER) from the communication link and compare our proposed method with the conventional DCT method. The evaluation results can be found in Table 7. To have a fair comparison, we set the CR as 15 for the DCT method, so that the resulting PRD is close to 5% in the simulation. As shown in the last row in Table 7, when there is no transmission error, i.e., BER = 0, the DCT method could achieve PRD of 5.48% at CR of 15 while our method could attain PRD of 4.792% at CR of 32. We can observe that the resulting PRD with the DCT compression method exhibits a significant drop when the value of BER is higher than 10 −3 ; however, our proposed method shows the smooth performance when the communication link quality becomes noisier. Similar results could also be observed from the QS index. This could imply that we method has a better immunity against transmission errors. Note that the CR for the DCT and our methods are different in Table 7. Thus, we further consider a performance metric "quality score" (QS), which is the ratio between the CR and the PRD, i.e., QS = CR/PRD. The QS index is useful when the values of CR for different compression methods are different [44]. When ignoring communication errors, our method could reach a little lower PRD and a significantly higher CR than the DCT-based compression method.
That is, our proposed method exhibits a higher QS than the conventional DCT method could achieve. However, the computational cost of the DCT method is less than our method.

Discussion
For the telecare application, it is desired that the hardware cost or computational complexity of the proposed method is affordable to people in rural areas. We use the number of multiply-accumulate (MAC) operations as the performance metric to evaluate the model complexity. In our proposed model, the computation-intensive operation is mainly on the convolutional layer and the fully-connected layer. For a convolutional layer with the kernel size of K, the number of the filter of C out , input length of W out , and the number of input layer C in , the resulting number of MAC at the encoder and decoder side are listed in Tables 8 and 9, respectively. Moreover, the full connection layer at the decoder side has 2992 inputs and 6000 outputs, i.e., the number of MACs is 17,952,000. In summary, the number of GMAC is around 0.08 and 0.06 at the encoder and decoder sides, respectively. Besides, we use a tool called "SCALE-Sim" in order to evaluate the required cycles performed on the "Eyeriss" chips [45]. For the encoder, the seven "Conv 1D" layers listed in Table 8 require 6016, 22,960, 145,386, 118,491,  279,700, 21,088, and 3906 cycles on the Eyeriss chips, respectively. The total number of cycles is about 5.98 × 10 5 cycles. The ratio of the required number of cycles at the encoder of our method to the MobileUNet, which runs MobileNet [46,47] on U-Net [48], is about one third (see Table 10). It has been shown that MobileUNet implements real-time deep learning in mobile applications [49]. Thus, the model complexity of our proposed model is affordable for a consumer-grade smartphone.

Conclusions
In this paper, we have proposed a convolutional autoencoder's architecture for the compression of 1D PCG signals. When the input segment length of 6000 samples, the achievable CR is 32 under the constraint that the corresponding PRD is less than 5%. The proposed method is more robust than the DCT compression method when considered the impact of the link quality. As the BER rose from 10 −4 to 10 −3 , the DCT method increased the PRD by about 7.6%, whereas our proposed method increased the PRD by only about 0.3%. We also considered the fix-point issue. When comparing the case of floating-point representation, we observed that the evaluation results have shown that the resulting PRD was slightly increased by 0.2% as the word-length of 10 bits with Q3.7 format. Our future work is jointly considered the denoising and data compression tasks for the PCG signals while using the deep convolutional autoencoder.