Novel Data Augmentation Employing Multivariate Gaussian Distribution for Neural Network-Based Blood Pressure Estimation

: In this paper, we propose a novel data augmentation technique employing multivariate Gaussian distribution (DA-MGD) for neural network (NN)-based blood pressure (BP) estimation, which incorporates the relationship between the features in a multi-dimensional feature vector to describe the correlated real-valued random variables successfully. To verify the proposed algorithm against the conventional algorithm, we compare the results in terms of mean error (ME) with standard deviation and Pearson correlation using 110 subjects contributed to the database (DB) which includes the systolic BP (SBP), diastolic BP (DBP), photoplethysmography (PPG) signal, and electrocardiography (ECG) signal. For each subject, 3 times (or 6 times) measurements are accomplished in which the PPG and ECG signals are recorded for 20 s. And, to compare with the performance of the BP estimation (BPE) using the data augmentation algorithms, we train the BPE model using the two-stage system, called the stacked NN. Since the proposed algorithm can express properly the correlation between the features than the conventional algorithm, the errors turn out lower compared to the conventional algorithm, which shows the superiority of our approach.


Introduction
A blood pressure (BP) is an essential factor to diagnose a health condition, and it is important to periodically monitor the BP for our healthcare. For this, previously reported algorithms for the BP measurement have been extensively studied using polynomial regression, support vector machine, and artificial neural network (NN) [1,2]. Also, recently, in order to improve the estimation performance, the research using deep learning [3] has been attempted and resulted in superior accuracy.
However, a collection of the biological data such as the BP is mostly limited because the cost is very high for large database (DB) which includes data and its label, verified by expert. For this reason, since an advantage of the deep neural network (DNN) which works well with the large DB is limited with the small DB, the DNN model which is trained by small training DB yields a fatal weakness [4][5][6][7][8][9][10]. To address the fatal weakness, previous researches have been extensively conducted by the deep learning techniques which train a model with the limited label data, such as the siamese network and fewshot learning technique [11][12][13][14][15][16][17]. However, since the deep learning techniques employ the image-based feature, the technique is not proper in our task which uses the signal-based feature. Therefore, to address this problem, an augmentation algorithm is demanded to create a pseudo data for the training DB. For this, previous studies for the BP estimation (BPE) have used a bootstrap algorithm which augments the training DB [3,18,19].
However, the bootstrap algorithm does not work well in the DB of non-diversity and the pseudo data which is created by the bootstrap algorithm does not properly express the characteristics and the correlations between the multi-dimensional features. For this reason, although the training DB is sufficiently augmented, there is significant limitation in performance improvement because the pseudo data has a partially negative impact on DNN training [3]. In addition, previous studies do not consider properly the characteristic of the DB which is collected several times for each subject.
Thus, in this paper, we review the conventional data augmentation algorithm used to improve the limited performance obtained from the NN model trained with a small DB. And then, in order to overcome the weakness of the conventional algorithm, we propose the novel data augmentation algorithm based on multivariate Gaussian distribution (DA-MGD) for the BPE using the NN. The pseudo data derived from a multivariate Gaussian distribution is much more similar to the characteristics of the real data than when they were generated by the conventional data augmentation method. Specifically, since the relationship between the features of the pseudo data and reference BP is represented by multivariate Gaussian distribution, the pseudo data is constructed more effectively than the conventional data augmentation method. For this reason, since the proposed algorithm boosts the training DB effectively than the previous augmentation algorithm, the NN-based model can estimate the BP better than the original method.
This paper is composed of as follows. The NN-based BPE employing bootstrap algorithm is described in detail in Section 2. The proposed algorithm is described in detail in Section 3. Next, Section 4 shows the results, and discussion is provided in Section 5. Finally, Section 6 concludes the paper.

NN-Based BPE
We describe the NN-based BPE employing the bootstrap algorithm, proposed in [3]. For estimating the BP including systolic BP (SBP) and diastolic BP (DBP), the photoplethysmography (PPG) and electrocardiography (ECG) signals are collected using the smart wristwatch embedded with the sensors. However, since raw data of the PPG and ECG signal are contaminated by noise, the peak point which is important to extract the features cannot be precisely detected.
To mitigate this problem, the pre-processing is designed to remove noise using the Butterworth second-order filter [20]. Also, the PPG and ECG signals pass through the Butterworth second-order band-pass filter (0.5 Hz and 11 Hz) [21,22] and Butterworth second-order low-pass filter (cut-off: 30 Hz) [23], respectively.

Feature Extraction
To estimate the SBP and DBP, the features are extracted using the PPG and ECG signals after pre-processing. For this, the peak point (PP) P PPG and valley point (VP) V PPG of the PPG signal, and R-peak point (RPP) P ECG of the ECG signal are first detected and authors in [3] then extract the features which is related with the SBP and DBP as in Table 1 [1,[24][25][26].
Also, S PPG (n) is a point at which the slope in the positive direction of the PPG signal becomes the maximum. And, n indicates the index of the points and d denotes the distance between the fingertip and heart, respectively. However, since measuring the distance demands further inconvenient, authors in [3] alternatively substitute the distance with the half value of the subject's height. Also, we use the body information such as gender, age, height, and weight related to the BP as the feature [27,28]. Finally, the extracted feature vectors pass through the Nth order median filter.

NN Training and BPE
Once the feature vector is extracted, the NN model is trained to estimate the SBP and DBP for which there are two major stages including a parameter initialization stage and fine-tuning stage. For this, the feature vector is first normalized using their means and standard deviations (SD) [29]. For this, the exponential linear unit (ELU) f (t) [30] is used for the activation function of the hidden layers as given by: where t is an argument of the ELU function. Also, α denotes the parameter of the ELU function and then the parameter is set to 1. Then, we update the parameters such as the weights and biases of each layer in accordance with the minimum mean square error (MMSE) [31] between the estimated BP and reference BP. In addition, the MMSE plays the role of the error function E of the NN using mini-batches as follows: where k denotes an index of the mini-batch with K representing the mini-batch size, andŶ k (w (l) , b (l) ) and Y k are denote the estimated BP and target BP, respectively. Also, w (l) and b (l) are, respectively, the weight and bias of the lth layer. Finally, the parameters of each hidden layers are updated repeatedly using the learning rate λ as follows [31]: Thus, we set the values of the SBP and DBP as the target vectors, and obtain the NN models to estimate the SBP and DBP. To estimate the SBP and DBP using the NN models, we employ the weight and bias to estimate the BP such that where w 1 , w 2 , and w out are the weight of each layer, and b 1 , b 2 and b out are the bias of each layer. Also, D denotes the feature vector. Finally, the output of the NN model is de-normalized for representing the SBP and DBP unit (mmHg) using the pre-computed mean and SD [24]. Also, to improve the performance of the proposed algorithm, we employ the two-stage system based on the stacked NN [32]. The two-stage system exhibits the structure of the cascade type using the stacked NN which is connected with the first NN model as shown in Figure 1. For this, input features of the stacked NN model consist of the extracted feature and estimated BP of the first NN. The feature added for the stacked NN model acts as the major feature which helps to train the stacked model which estimates more elaborately the BP than the first NN. In addition, the stacked NN model is trained equally with a procedure of the first NN training as in Equations (1)  To estimate the BP using the two-stage system based on the stacked NN, the first NN model equipped with the extracted features estimates the BP and then the estimated BP is used as the input feature of the stacked NN. Hence, the estimated BP of the first NN model is concatenated with the extracted features to construct the stacked NN input. Finally, after the stacked NN model estimates the BP as in Equations (4) and (5) using the constructed input, the estimated value is calculated through the de-normalization using the mean and SD.

Conventional Data Augmentation Algorithm
Since the NN works well with a sufficient training DB, the data augmentation algorithm plays a great role in the ultimate performance. Specifically, previous works propose a bootstrap algorithm [3,18,19] which creates the pseudo data to augment the training DB dramatically. For this, as shown in Figure 2, the actual data for training DB is divided into multiple groups randomly. And, in order to obtain the pseudo data, the statistic information such as the mean and SD of features for each group are calculated. The pseudo data are generated randomly according to normal distribution using the mean and SD. In addition, the features of the pseudo data are created independently for each feature and the reference BP of the pseudo feature vector is determined by the BP of the group (Interested readers are referred to [3] for further information).

Method
Under the insufficient DB environment, in order to improve the performance of the NN, we propose a novel data augmentation algorithm using the pseudo data generator based on multivariate Gaussian distribution as displayed in Figure 2. Since the multivariate Gaussian distribution is a generalization of the univariate normal distribution to two or more variables, it can represent the distribution for random vector of correlated variables where each vector element has a univariate Gaussian distribution. Indeed, multivariate Gaussian distribution (MGD) f MGD is formulated as follows [33]: where X denotes the multi-dimensional features, µ and Σ are the mean vector and covariance matrix, respectively. And, k denotes the feature dimension and i, j are indices of the feature dimension, respectively. Thus, the pseudo data is generated through sampling to follow the MGD with the mean & covariance obtained from the actual data. To make the pseudo data similar to the actual data, which were collected 3-6 times for each subject, we create the pseudo data of 8 times for each pseudo subject after generating the pseudo subjects which consist of the BP and body information (height, weight, age, and gender). First, a normal pseudo subject is created from only BP and body information. After that, we create a pseudo feature vector that considers all the features used in the proposed algorithm, including BP and body information. By comparing the pseudo feature vector with the normal pseudo subject's body information and BP, more refined highquality pseudo data can be obtained. After creating a pseudo subject step by step in this way, it takes the effect of purifying the created pseudo feature vector. The reason for the occurrence of 8 times is due to the fact that it is slightly more than 6, which is the number of times actually measured. As shown in Figure 3, the pseudo subjects are generated by the MGD after extracting the mean and covariance of the BP and body information in training DB. However, since the generated pseudo subjects may contain outlier data such as abnormal body information, the outlier data is further removed. When the height was less than 149 cm or more than 195 cm, the weight was less than 30 kg and more than 150 kg, and the age was less than 20 years, it was removed.  And, to develop the feature vector for the pseudo subject, we generate the pseudo feature vector after extracting the mean vector and covariance matrix of the BP, feature vector, and body information in training DB using Equation (6). The pseudo feature vector includes the body information to match the pseudo subject. At this time, in order to determine the reference BP of the pseudo data, the pseudo subject and pseudo feature vector include the reference BP. Finally, to match the pseudo feature vector for the pseudo subject, we perform the Algorithm 1 as follows: Algorithm 1 Matching the pseudo subject with the pseudo feature where S j denotes a vector for the pseudo subject, which consist of the BP and body information, and F is the matrix for the candidate pseudo feature, related with specific conditions in Equation (9). And, H, W, A, G, and B are respectively height, weight, age, gender, and BP. Also, j and J denote the index of the pseudo subject and entire pseudo subject, respectively. In addition, m denotes the index of the pseudo data F j,m for the pseudo subject, and then only body information from F j,m is replaced by the body information from S j . P is dimension of signal-based features. Finally, T H , T W , T A , and T B denote thresholds of height (cm), weight (kg), age, and BP (mmHg), respectively. When training the NN model, the pseudo data is combined with the real data. Finally, as shown in Figure 3, after the training DB is augmented by our proposed algorithm, we train the NN model for estimating the SBP and DBP using the method described in the Section 2. In addition, after the feature extraction is performed on the smart wristwatches, the extracted feature vector is transmitted to the smartphone connected via Bluetooth to estimate the SBP and DBP based on the NN parameters at the smartphone.
In order to verify the DA-MGD algorithm compared with the bootstrap algorithm, we trained the BPE model based on the NN with the augmented training DB. For this, the training DB was augmented as 5 times and 20 times using the bootstrap algorithm, and we then trained the NN model. And, to augment the training DB using the proposed algorithm, the pseudo subjects were created additionally with 50 and 100 subjects. And, T H , T W , T A , and T B were set to 5, 5, 5, and 10, respectively. Since the loop in a code of the algorithm runs infinitely if the thresholds that are too small was used, we empirically set it to 5 to make the algorithm work smoothly. In addition, the threshold of the BP was set to 10 because the difference of 10 mmHg between the trials of the actual BP could occur.
We compared the NN models using the data augmentation algorithms with the baseline NN model without the data augmentation. To train the NN models, the number of hidden layers was set to three and the number of hidden units on each layer was set to 128, 256, and 128, respectively. We used the same learning parameters for all experiments to evaluate the performance of the data augmentation algorithms. Also, the number of hidden layers and units was determined empirically through experiments with the best performance. To alleviate the overfitting problem, we employed the drop-out (0.2) and L1 regularization.

Statistics
To compare the DA-MGD algorithm with the bootstrap algorithm, we compared the NN-based BPE results, which are respectively obtained by the two algorithms for the reference BP. In order to evaluate the performance of the results, we adopted the mean error (ME) with the SD and Pearson correlation coefficient r-value between the estimated BP and reference BP. All statistical analyses were performed using MATLAB R2019b and IBM SPSS ver 21.0 [34] (IBM Corp., Armonk, New York, NY, USA).

Data Collection Protocol and Data Sets
This research was confirmed by a local research ethics committee, and then every participant signed informed consent before measurement. For this experiment, we used the smart wristwatch (InBody smart wristwatch, InBody Corp., Seoul, Korea) embedded with the ECG sensor (Device: AD8233, Sampling rate: 500 Hz) and PPG sensor (Device: ADPD174GGI, Sampling rate: 500 Hz, Two green light emitting diodes), and DB and labels were collected using the wristwatch and mercury sphygmomanometer (Desk type 0320, Baumanometer, New York, NY, USA). Also, the error limitation of the mercury sphygmomanometer was ±3 mmHg. In order to obtain the reference BP, the noninvasive BP monitoring was performed while the subject wears a smart wristwatch to obtain SBP and DBP through the mercury sphygmomanometer under guidance by a nurse.
However, since it is practically impossible to measure the PPG signal using the wristwatch while the subject wears a cuff of the mercury sphygmomanometer, we cannot simultaneously measure the reference BP (SBP and DBP) and the signals because the PPG signal cannot be obtained while the sphygmomanometer cuff is in place. Thus, we recorded the PPG and ECG signals (20 s) during the rest time between measurements while the BP is measured 4 times (or 7 times) and then the BP of the PPG and ECG signals is determined by averaging these values of the front and rear. Therefore, the DB contained the two signals of 20 s and the average SBP and DBP for it.
We collected the DB from 110 subjects (mean ± SD, height: 166.3 ± 9.0 cm, weight: 65.3 ± 13.3 kg, age: 36.7 ± 10.5, SBP: 106.8 ± 12.6 mmHg, DBP: 67.1 ± 10.2 mmHg, and gender (male/female, %): 35/65). Specifically, the data for 61 subjects were collected three times per subject on the left arm, and the rest of the data were collected three times per subject for the left and right arms. Also, the PPG and ECG signal were collected simultaneously during 20 s. To evaluate the proposed algorithm, the 110 subjects are divided by four groups and we then used two of the four groups as the training DB. And, the remaining two groups were used as the test DB that is not included in learning. Also, since the number of subjects was not the multiple of four, the four groups was divided randomly respectively 27, 27, 28, 28 subjects.
At first, after calculating the average SBP of each subject, the average SBP of the subjects was arrayed in ascending order. Finally, each subject was assigned one of four groups in order. However, since the hypertension (SBP > 130 mmHg) and hypotension (SBP < 85 mmHg) data were not almost included in our DB, the hypertension and hypotension data were included in training DB for reasonable learning.

Data Augmentation
In reality, since the bootstrap algorithm created independently the pseudo features. Finally, as shown in Figures 4-6, while each feature vector which is created by the proposed algorithm represented properly the reference BP, the results of the conventional algorithm did not represent correlation between the feature vector and reference BP well. Finally, our experimental results were summarized in Tables 2 and 3. The proposed algorithm showed better performance than the bootstrap algorithms in terms of the value of ME ± SD and r-value. While the average performance in terms of SDE was respectively improved by 23% and 11% for SBP and DBP when using the proposed algorithm, the average performance of the conventional algorithm was respectively decreased by 5% and 13% for SBP and DBP. It can be considered that the r-value has decreased because the data generated from the conventional algorithm had an adverse effect as a result. In addition, while the average r-value of the proposed algorithm was respectively increased by 18% and 16% for SBP and DBP, the average r-value of the conventional algorithm was respectively decreased by 2% and 11% for SBP and DBP.

Discussion
Since a medical data such as BP is mostly limited quantitatively, it is difficult to take advantage of the NN, which shows promising performance when large data is used. In reality, since measurements are obtained with expensive machinery and labels are the fruit of a time-consuming analysis, drawn from the conclusions of human experts, it is difficult to collect the sufficient labeled data for BP measurement which includes data and its label, verified by the expert.
To address this problem, the data augmentation techniques were utilized for boosting quantitatively the training DB in which one of the representative techniques is the bootstrap algorithm. However, while the previous method such as the bootstrap algorithm does not properly represent the correlations between multi-dimensional features in estimating the BP, the proposed algorithm can express efficiently the correlation between the features. For this reason, the proposed algorithm showed better performance than the conventional algorithm. It is wise to explain the merit of the generator using the MGD, as depicted in Figure 7. As shown in the figure, the generator based on the MGD creates the pseudo data that resembles most closely the actual data. Specifically, when we employ the MGD, the distribution of the pseudo data is similar to the distribution of the original one, while the data distribution of bootstrap is quite different. In addition, since the MGD-based generator creates the pseudo data for sections with a little distribution of actual data, the pseudo data is generated more diversely while maintaining correlation between features than bootstrap. In addition, when we compared the results of 50 pseudo subjects with the results of 100 pseudo subjects, the average performance of each experiment was similar. Thus, it turns out that the 50 pseudo subjects were sufficient to improve the performance of the NN-based BPE. In the case of 100 pseudo subjects, a very small amount of unnecessary data may have been added. Finally, since our DB contains a small amount of hypertension data, we need to collect additional DB for evaluating performance for hypertension in future works.

Conclusions
In this paper, we proposed the novel data augmentation algorithm for estimating the BP based on the NN when using a smart wristwatch. The proposed data augmentation algorithm based on the MGD created the pseudo data properly while maintaining the relationship between the features. However, the conventional algorithm cannot properly express the relationship between the features.
For this reason, the performance of the proposed data augmentation algorithm was better than that of the bootstrap algorithm. Therefore, the results of the proposed algorithm show that the performance limitation of the NN model with small training DB alleviates effectively.