Multitask Siamese Network for Remote Photoplethysmography and Respiration Estimation

Heart and respiration rates represent important vital signs for the assessment of a person’s health condition. To estimate these vital signs accurately, we propose a multitask Siamese network model (MTS) that combines the advantages of the Siamese network and the multitask learning architecture. The MTS model was trained by the images of the cheek including nose and mouth and forehead areas while sharing the same parameters between the Siamese networks, in order to extract the features about the heart and respiratory information. The proposed model was constructed with a small number of parameters and was able to yield a high vital-sign-prediction accuracy, comparable to that obtained from the single-task learning model; furthermore, the proposed model outperformed the conventional multitask learning model. As a result, we can simultaneously predict the heart and respiratory signals with the MTS model, while the number of parameters was reduced by 16 times with the mean average errors of heart and respiration rates being 2.84 and 4.21. Owing to its light weight, it would be advantageous to implement the vital-sign-monitoring model in an edge device such as a mobile phone or small-sized portable devices.


Introduction
In 2020, the coronavirus disease 2019 (COVID-19) pandemic introduced many changes in our lives. Given its very high infection rate, the number of infected people increases rapidly, and hospital accommodation facilities often become insufficient. Thus, patients are often required to be self-quarantined at home and monitor their conditions. Some patients suffer from a lack of knowledge regarding these self-checks. Therefore, the importance of non-face-to-face healthcare monitoring has also been emphasized [1,2], and heart rate (HR) and respiration rate (RR) are vital signs that could allow for the monitoring of post-COVID-19 infections. In an experiment which involved 2745 subjects with COVID-19, the HR and RR abruptly increased when the disease symptoms began to appear [3]. This study confirmed that HR and RR could be early indicators of the development of COVID-19 [1,4]. Furthermore, HR is used to detect heart disease. According to the World Health Organization (WHO) 2019 report on human deaths caused by diseases worldwide, heart diseases rank first and chronic obstructive pulmonary disease (COPD) ranks third [3]. Heart diseases with high mortality rates are even more dangerous because they rarely have symptoms and are not treated in a timely manner. This strengthens the importance of the constant monitoring of heart signals [5]. Respiration rate could be also used as an that implements on-device, contactless, vital measurements and simultaneously predicts the rPPG and respiratory signals [26]. The MTTS-CAN model is associated with complicated preprocessing. However, our proposed multitask Siamese (MTS) model has fewer parameters than the MTTS-CAN and is associated with simpler preprocessing.
We propose a multi-task Siamese (MTS) model, which can simultaneously estimate the remote PPG and respiratory signals using the same input image for the single-task Siamese network. The MTS model, which combines the advantages of the MTTS-CAN and Siamese network models to improve its performance during the execution of multiple tasks with high efficiency. Using the suggested model, rPPG and respiratory signals are simultaneously estimated with higher accuracy and fewer parameters compared with the conventional multitask learning model. The proposed MTS model is computationally lightweight due to the reduced the model parameters, which could be advantageous to implementing on-device learning and testing in a small edge device such as a mobile phone or a portable device [25][26][27]. This could realize the continuous and convenient HRand RR-monitoring system in our daily lives and can be expected to improve the health conditions of patients with cardiac and respiratory disorders.
Section 2 addresses the overall architecture of the MTS model with a detailed description, and Section 3 describes the experimental method used to evaluate the performance of the MTS model. In Section 4, the experimental results of the MTS model are compared with those of other remote PPG and respiratory models. In Section 5, we discuss why the performance of the MTS model is better than that of the single-task model (Siamese network with the convolutional block attention module (CBAM)). Finally, the conclusions are presented in Section 6.

Algorithm
Inspired by the Siamese rPPG network [25,28] and the MTTS-CAN [25], we propose the MTS, whose overall flowchart is illustrated in Figure 1. The structure of the MTTS-CAN could implement the multitask learning process to learn more than one task at the same time, that is, the simultaneous prediction of the PPG and respiratory signals. In the original MTTS-CAN architecture, the input image of the current and the previous frames should be provided together for its training, and the attention networks are also included, which is a complicated pretreatment process causing computational burden. On the contrary, Siamese networks require a simpler image-preprocessing process, for which the original Siamese rPPG network comprises a simple image-preprocessing step with low-computational complexity. In addition, the accuracy and stability of the rPPG prediction is improved owing to the weight sharing between the two branches of the Siamese networks. The MTS model is designed based on the structures of the Siamese rPPG network and the MTTS-CAN model and is expected to possess the advantages of both models. A major disadvantage of the conventional Siamese networks is the requirement of longer training time than the typical deep neural networks. Siamese networks are relatively slow to be trained, since two pairs of networks should be created during the learning process. Therefore, the implementation of the multitask learning model for the slow Siamese networks could be a solution for the long training issue due to simultaneously learning two tasks over a similar duration as the single-task model. In addition, it has a great advantage of reducing the number of model parameters by the simultaneous learning process of two tasks.
The MTS model in Figure 1 predicts cardiac (PPG) and respiratory signals using facial video streams, whereby two regions of interest (ROIs) of the forehead and cheek are separately selected as the inputs to the Siamese network. The cheek and forehead are the optimal areas used for capturing blood-flow changes and have often been used for rPPG predictions [27]. In particular, blood-volume pulses are usually obtained from the cheeks and forehead [25]. The proposed MTS model has multi-input and multi-output structures, whereby two different inputs from the cheeks and forehead separately enter the two branches of the Siamese networks. After the process of the CNN layers is completed, the outputs of the forehead and cheek streams are finally merged into one in the Add layer. The output of the Add layer yields a predicted PPG signal, and the respiratory signal can be obtained through the addition of a dense layer, as shown in Figure 2. The dense layer was not added after the PPG output because the performance was degraded by adding them. Additionally, a dropout layer was applied to reduce the probability of overfitting [29]. The activation function is the leaky rectified linear unit (Leaky ReLU) [30], formulated in Equation (1). Alpha is the slope of the leaky ReLU function, and α generally has a value of 0.01. The MTS model in Figure 1 predicts cardiac (PPG) and respiratory signals using facial video streams, whereby two regions of interest (ROIs) of the forehead and cheek are separately selected as the inputs to the Siamese network. The cheek and forehead are the optimal areas used for capturing blood-flow changes and have often been used for rPPG predictions [27]. In particular, blood-volume pulses are usually obtained from the cheeks and forehead [25]. The proposed MTS model has multi-input and multi-output structures, whereby two different inputs from the cheeks and forehead separately enter the two branches of the Siamese networks. After the process of the CNN layers is completed, the outputs of the forehead and cheek streams are finally merged into one in the Add layer. The output of the Add layer yields a predicted PPG signal, and the respiratory signal can be obtained through the addition of a dense layer, as shown in Figure 2. The dense layer was not added after the PPG output because the performance was degraded by adding them. Additionally, a dropout layer was applied to reduce the probability of overfitting [29]. The activation function is the leaky rectified linear unit (Leaky ReLU) [30], formulated in Equation (1). Alpha is the slope of the leaky ReLU function, and α generally has a value of 0.01. The ReLU function outputs 0 or 1 when it is less or greater than a predefined threshold value. This greatly reduces the multiplication, thereby reducing the probability of vanishing, exploding problems. Unlike the ReLU function, if all the differentiated values in any layer are 0, then learning does not proceed in any layer thereafter. The leaky ReLU has an output of the input value multiplied by 0.01 to prevent the knockout problem [30]. As shown in Figure 3, for learning negative data, the input value lower than the threshold is multiplied by 0.01.
The CBAM [31] is an attention module used in image classification and objectrecognition models. The CBAM has shown significant improvements in classification performance with few parameters. The CBAM effectively highlights and suppresses intermediate features using channel attention and a spatial attention module, respectively [31]. The CBAM has been applied to estimate PPG and respiratory signals from facial video streams, thus highlighting the significant features of the images, which increases the learning effect with few parameters.
We added a 1 × 1 convolution layer used in GoogLeNet for each convolution layer [32]. The 1 × 1 convolution layer adjusts the number of dimensions in order to design deep networks with fewer computations and time resources. The MTS model effectively reduces the parameters and makes a more lightweight model using the 1 × 1 convolution layer, making it available in various environments. The ReLU function outputs 0 or 1 when it is less or greater than a predefined threshold value. This greatly reduces the multiplication, thereby reducing the probability of vanishing, exploding problems. Unlike the ReLU function, if all the differentiated values in any layer are 0, then learning does not proceed in any layer thereafter. The leaky ReLU has an output of the input value multiplied by 0.01 to prevent the knockout problem [30]. As shown in Figure 3, for learning negative data, the input value lower than the threshold is multiplied by 0.01.   [31] is an attention module used in image classification and object-recog-

Methods
We compared the proposed MTS model with the MTTS-CAN, the original single-task Siamese rPPG Network [25], the Siamese network with CBAM, and 1 × 1 convolution layer. The results of the MTTS-CAN and Siamese networks with CBAM were produced using the COHFACE dataset, and those of the original Siamese rPPG network were obtained from [25], which also used the same COHFACE dataset.

Dataset
COHFACE is a remote photoplethysmography dataset. It is a dataset presented to evaluate the proposed algorithm in a standard and principled manner [33]. Compared to some other datasets, more realistic conditions are included. So, the experimental dataset is COHFACE, which includes 160 videos from 40 subjects recorded at 20 FPS with a resolution of 640 × 480 pixels. Each video was acquired for approximately 1 min using a Logitech HD C525 camera, and the subject's face and upper body are shown. In addition, the PPG and respiration signals were synchronized with the video, and a Thought Technology device and BioGraph Infiniti software (Thought Technology, Ltd., Montreal, QC, Canada) were used for the analysis [25,34]. Among the four video streams obtained for each subject, light evenly entered the screen to clarify the facial images in two streams, while light was naturally exposed, and their faces were captured in a relatively dark environment in the other two streams [22]. The training, validation, and test datasets were divided at a ratio of 3:1:1 for the cross-validation of the model.

Pre/Postprocessing
To fix the input size of the model, the cheek and forehead images of the subjects in the dataset were extracted using the dlib's face recognition library [35]. As shown in Figure 4, the forehead area was determined from the top of the head to the eyebrows and the cheek area from the tip of the nose to the chin. The areas were set in the first frame of the video and fixed throughout the entire 600 frames. Considering the ambient noise in the beginning frames of the video, 600 frames were extracted from the middle of the video streams. The true PPG and respiratory signals corresponding to each frame were obtained from the dataset.

Pre/Postprocessing
To fix the input size of the model, the cheek and forehead images of the subjects in the dataset were extracted using the dlib's face recognition library [35]. As shown in Figure 4, the forehead area was determined from the top of the head to the eyebrows and the cheek area from the tip of the nose to the chin. The areas were set in the first frame of the video and fixed throughout the entire 600 frames. Considering the ambient noise in the beginning frames of the video, 600 frames were extracted from the middle of the video streams. The true PPG and respiratory signals corresponding to each frame were obtained from the dataset. After the estimation of the PPG and respiration signals using the deep-learning models, HR was calculated from the PPG signal using the fourth-order Butterworth bandpass filter, while RR was estimated from the respiration signal using the second-order Butterworth bandpass filter. The cutoff frequencies of the filters were (0.66 Hz, 3.3 Hz) for HR and (0.1 Hz, 0.4 Hz) for RR. HR (or RR) may be extracted from the distance between peak points. To derive the final HR and RR, the peaks of the filtered PPG and respiratory signals were detected, and the means of the intervals between the peaks were calculated using Equation (2) [5]. After the estimation of the PPG and respiration signals using the deep-learning models, HR was calculated from the PPG signal using the fourth-order Butterworth bandpass filter, while RR was estimated from the respiration signal using the second-order Butterworth bandpass filter. The cutoff frequencies of the filters were (0.66 Hz, 3.3 Hz) for HR and (0.1 Hz, 0.4 Hz) for RR. HR (or RR) may be extracted from the distance between peak points. To derive the final HR and RR, the peaks of the filtered PPG and respiratory signals were detected, and the means of the intervals between the peaks were calculated using Equation (2) [5].
where p denotes the ith time instance of a peak in the PPG or respiratory signal, and N is the total number of peaks.

Environment and Evaluation
We used an NVIDIA GeForce RTX 2060 graphics processing unit to train the model, which was implemented in the TensorFlow framework [36]. The loss function used to train the model was the Pearson correlation coefficient expressed by Equation (3).
where x i and y i are the ground truth and predicted values with size N, and x and y are their mean values. The Pearson correlation r(x, y) has a value between −1 and 1. The optimizer for the model training was Adam with a learning rate of 0.0001, β1 = 0.9, and β2 = 0.999 (β1, β2 are exponential decay rates for the moment estimates) [37]. The loss function in Equation (4) was applied to train the deep-learning models.
To evaluate the performance of the models, the metrics of the Pearson correlation coefficient (R), mean absolute error (MAE), and root-mean-square error (RMSE) were calculated, as described in Equations (5)-(7).
where X i denotes the HR (or RR) estimated from the predicted PPG (or respiration) signal, and X i is the HR (RR) obtained from the ground-truth PPG (or respiration) signal. X and X are the averages of the predicted and real heart rate or respiratory rate (real HRs or RRs), respectively. We trained the MTS model within 250 epochs with a batch size of one and dropout rates of 0.25, 0.5, and 0.6. The single-task Siamese network was also trained under the same conditions as the MTS. The MTTS-CAN was trained for 30 epochs with a batch size of two. This is because the test results were the best when learning was implemented in this manner.

Results
To evaluate the performance of the proposed MTS model, its performance was compared with those of the existing Siamese rPPG network [25], the Siamese network with CBAM, and the MTTS-CAN [26]. The Siamese network with the CBAM includes the CBAM attention mechanism with the existing Siamese rPPG network and reduces the model parameters by adding a 1 × 1 convolution layer [32]. Inspired by MobileNet, which was successfully made to be lightweight in order to use deep learning on edge devices, with 1 × 1 Conv, it has an advantage in the computation over the standard convolutional layer, resulting in the lightening of the model [34]. These Siamese network models are single-task learning models for only one task. As a benchmark test model for multitask learning, the MTTS-CAN was compared, which was trained within 30 epochs using a batch size of two. Table 1 lists the benchmark test results, including the proposed model, the Siamese network with the CBAM, and the other conventional single learning models. The Siamese network with the CBAM of Table 1 was used to extract the HR as a single-task model improved through CBAM and 1 × 1 conduction layer, etc., referring to the Siamese rPPG network [25]. Given that the Siamese network with the CBAM is a single-task model, learning the HR and RR was attempted. Unfortunately, the Siamese network-based models could not learn the respiratory signals owing to their original designs for the PPG signals; thus, their results could not be included in the table. We will talk about this later in discussion. Table 2 lists the performance measurement of each HR and RR by comparing the multitask model with our proposed model, the MTS. Table 3 lists the number of parameters of the benchmarking models. The MTS model could yield a comparable HR performance with fewer parameters compared with the single-task models. In particular, the R value of the MTS was higher than that of the original Siamese network, even though the number of parameters was 16 times lower. The Siamese network with the CBAM produced a performance similar to that of the MTS with a considerably reduced number of parameters compared with the original model. The number of parameters of the Siamese network with the CBAM was slightly smaller than that of the MTS, but this is because the Siamese network with the CBAM is a single-task model. In addition, the multitask learning model, the MTTS-CAN, was also outperformed by the MTS model for the predictions of HR and RR, which required smaller numbers of parameters.  Table 3. The number of parameters of the multitask Siamese network model (MTS) compared with the other models for the predictions of the heart rates and respiration rates.

Model # of Parameters
Siamese rPPG network [25] 11 The correlations between the predictions and their ground truths (real HR) obtained using the proposed MTS and the conventional multitask model MTTS-CAN are displayed as scatter plots in Figure 5. Figure 5a shows that the predicted HR using the MTS produced a high correlation coefficient with the real HR (R 2 = 0.94). Conversely, Figure 5b shows the correlation results of the HR for the MTTS-CAN model, whereby the R 2 value of the HR was 0.08. This means that the MTS describes HR data much better than the MTTS-CAN. These results demonstrate that the proposed MTS model overwhelmingly outperforms the conventional multitask learning model MTTS-CAN. Although the R 2 value of the respiratory rate was lower than that of the HR, the MTS model was able to significantly reduce the MAE and RMSE of the predicted respiratory signals, compared with those of the MTTS-CAN.  Figure 6 illustrates the predicted PPG and respiratory signals of the four subjects wi their ground truths (PPG and respiratory signals). Although the predictions of the PP signals appear more accurate than those of the respiratory signals, the similar peak p terns of their predictions are noticeable in all figures, thus resulting in reasonable pred tions of HR and RR. The results in Figures 4 and 5 demonstrate that the proposed M model could produce a similar performance to that of the original Siamese rPPG netwo with an extremely small number of parameters. It is also noted that the MTS model cou simultaneously yield reasonable predictions of the RR and HR through its multitask lear ing architecture, while only the HR could be estimated using the original Siamese rPP network.  Figure 6 illustrates the predicted PPG and respiratory signals of the four subjects with their ground truths (PPG and respiratory signals). Although the predictions of the PPG signals appear more accurate than those of the respiratory signals, the similar peak patterns of their predictions are noticeable in all figures, thus resulting in reasonable predictions of HR and RR. The results in Figures 4 and 5 demonstrate that the proposed MTS model could produce a similar performance to that of the original Siamese rPPG network with an extremely small number of parameters. It is also noted that the MTS model could simultaneously yield reasonable predictions of the RR and HR through its multitask learning architecture, while only the HR could be estimated using the original Siamese rPPG network.
terns of their predictions are noticeable in all figures, thus resulting in reasonable predictions of HR and RR. The results in Figures 4 and 5 demonstrate that the proposed MTS model could produce a similar performance to that of the original Siamese rPPG network with an extremely small number of parameters. It is also noted that the MTS model could simultaneously yield reasonable predictions of the RR and HR through its multitask learning architecture, while only the HR could be estimated using the original Siamese rPPG network. It has been demonstrated that the proposed MTS model outperforms the single-task Siamese networks with the CBAM. For a more rigorous investigation of the MTS and single-task Siamese network model's learning processes, their learning curves were generated and are illustrated with their loss values in Figure 7. As can be seen in Figure 7c, the MTS has only one learning curve for the predictions of both PPG and respiratory signals It has been demonstrated that the proposed MTS model outperforms the single-task Siamese networks with the CBAM. For a more rigorous investigation of the MTS and singletask Siamese network model's learning processes, their learning curves were generated and are illustrated with their loss values in Figure 7. As can be seen in Figure 7c, the MTS has only one learning curve for the predictions of both PPG and respiratory signals owing to its multitask learning architecture. Similar to the results in Table 1, the single-task Siamese network model continuously produced an increasing validation loss as the epoch increased (see the blue line in Figure 7b). Because the single-task model is only designed for the prediction of the PPG signal, the networks may need to be completely redesigned to learn the respiratory signal. Conversely, the MTS was successful in simultaneously learning the respiratory signal as well as the PPG signal with a small number of parameters. owing to its multitask learning architecture. Similar to the results in Table 1, the singletask Siamese network model continuously produced an increasing validation loss as the epoch increased (see the blue line in Figure 7b). Because the single-task model is only designed for the prediction of the PPG signal, the networks may need to be completely redesigned to learn the respiratory signal. Conversely, the MTS was successful in simultaneously learning the respiratory signal as well as the PPG signal with a small number of parameters.

Discussion
Previous studies on the estimation of RR using video cameras mainly relied on the

Discussion
Previous studies on the estimation of RR using video cameras mainly relied on the monitoring of the abdomen or chest movements [41,42]. This approach requires an additional camera for the application, where both HR and RR need to be simultaneously monitored using noncontact visual sensors. The proposed multi-task learning model could reduce the additional equipment, as well as the additional model that only estimates the RR since it deals with completely different areas of the body compared with those for the HR estimation.
The proposed MTS model was successfully trained using the facial videos together with the PPG and respiratory signals, yielding the accurate estimation of the HR and RR. Previous studies [11,12] have reported that PPG signals could contain respiratory information, and Nakajima et al. [43] also demonstrated the extraction of respiratory rates from PPG signals. Thus, the accurate estimation of rPPG using the proposed model from the facial videos could include the respiratory feature as well as the heartbeat information. Additionally, the MTS model was trained using the facial videos together with the respiratory signals recorded from the chest, in order to monitor the movement of facial parts such as the nostrils to extract the respiratory information. For future research, we will try to look into the relationship between the movement of a specific facial part and the respiration in order to check this.
The single-task Siamese rPPG network produced superior results of HR estimation in terms of MAE and RMSE compared with those using the proposed MTS model. Their performance difference could be understandable owing to it being designed only for HR estimation. However, the MAE and RMSE of the MTS would be comparable with meaningful levels of less than 5, which can be found in Table 4 showing the benchmark results of the various rPPG-prediction models using the COHFACE dataset. In addition, its correlation coefficient was higher than that of the Siamese rPPG network. Considering the number of parameters of the models shown in Table 3, the size of the MTS model was 16 times smaller than that of the single-task learning model, which is a significant advantage of using the proposed multitask learning model, especially for edge devices when conducting multiple tasks. In addition, the MTTS-CAN is also designed to reduce the model size; as can be seen in Table 3, it is 12 times smaller than the Siamese rPPG network. Despite this, the proposed MTS model contains a smaller number of parameters than that of the MTTS-CAN, and significantly improves the estimation performance of HR and RR, as demonstrated in Table 2. Table 4. Benchmark performance of the various rPPG-prediction models on the COHFACE dataset [25]. 2SR, CHROME, and LiCVPR are the traditional signal processing-based methods, while HR-CNN and Two stream are data-driven and machine learning-based algorithms. The proposed model selects regions of interest (ROIs) in the facial image before the training process, for which the cheeks and forehead were chosen as the ROIs. Recently, a study compared the results using selected ROIs and the entire face areas for the rPPG model training, which reported that the model trained using the entire areas outperformed the other [45]. They demonstrated that skin-color changes in all facial areas would be helpful to yielding accurate rPPG signals. For future work, this model will be trained using the full facial images, and we will try to decrease the model size while dealing with the increased input data size.

Method
Through the Siamese network, heart rate and respiratory rate were successfully predicted. The heart-rate information has a significant effect on the prediction of respiratory rate, which helps improve accuracy [11,12]. However, there is a lack of evidence that information on respiratory rate helps predict heart rate.
We applied the ROI detection method from the facial area to train the model. The cheek and forehead areas were extracted as the areas of interest, which were used for the training dataset. A recent study has investigated the performance of the rPPG estimation corresponding to different areas of the face including the entire face [49]. It reported that the more areas of the face the model learns, the better the rPPG-prediction performance it could produce owing to the sufficient information on the skin-color changes and movements of the subjects. Therefore, further investigation is needed to confirm the optimal areas of the face for the rPPG prediction. It has been said that using as many areas as possible to learn skin-color change and movement is advantageous for learning. There is a need for further research on whether it is better to use the entire face or to extract and use a specific part [50].
The COHFACE dataset is an open dataset with a well-refined environment. The experiment was conducted with the subject's movement or ambient noise as controlled as possible. However, what we want is to predict heart rates well in our daily lives. In future studies, it is necessary to analyze noisy data. Previous studies have shown that NIR cameras are strong in color change and movement [51,52]. In future research, we may consider combining the NIR camera with our MTS model.

Conclusions
We proposed a multitask Siamese network model that simultaneously estimates the PPG and respiratory signals from human facial video streams with only the use of a camera. We applied the Siamese network and the CBAM to the multitask learning architecture with a 1 × 1 convolution layer to considerably reduce the number of parameters in the model, while concurrently increasing its performance. It was demonstrated that the proposed model outperformed the single-task model as well as the conventional multitask learning model for RR estimation and exhibited comparable HR-prediction performance to that of the single-task model without increasing the number of model parameters. The MTS is a light model with a few parameters and could be suitable for on-device learning architectures.