3D Convolutional Neural Networks for Remote Pulse Rate Measurement and Mapping from Facial Video

: Remote pulse rate measurement from facial video has gained particular attention over the last few years. Research exhibits signiﬁcant advancements and demonstrates that common video cameras correspond to reliable devices that can be employed to measure a large set of biomedical parameters without any contact with the subject. A new framework for measuring and mapping pulse rate from video is presented in this pilot study. The method, which relies on convolutional 3D networks, is fully automatic and does not require any special image preprocessing. In addition, the network ensures concurrent mapping by producing a prediction for each local group of pixels. A particular training procedure that employs only synthetic data is proposed. Preliminary results demonstrate that this convolutional 3D network can effectively extract pulse rate from video without the need for any processing of frames. The trained model was compared with other state-of-the-art methods on public data. Results exhibit signiﬁcant agreement between estimated and ground-truth measurements: the root mean square error computed from pulse rate values assessed with the convolutional 3D network is equal to 8.64 bpm, which is superior to 10 bpm for the other state-of-the-art methods. The robustness of the method to natural motion and increases in performance correspond to the two main avenues that will be considered in future works.


Introduction
The domain of physiological signal measurement using contactless devices has gained vast attention.Research exhibits significant advancements over the last few years and demonstrates that standard video cameras are reliable devices that can be employed to measure a large set of biomedical parameters without any contact with the subject.Nevertheless, and despite important advancements, the most recent methods are still not ready to satisfy real-world applications.The main challenge consists in improving robustness to natural motion that produces undesirable noise and artifacts in the measurements.This issue is common to most systems that record and analyze images to sense vital signs and biomedical parameters.In the era of ubiquitous computing where mobile devices (smartphones, laptops, tablets, ...) are omnipresent, cameras and webcams are sensors that are already available and, thus, that are particularly interesting for unobtrusively measuring vital signs.
Photoplethysmography (PPG) and ballistocardiography (BCG) are the two main principles for measuring pulse rate in video streams recorded by a camera.Ballistocardiography [1] relates to the observation of small body displacements [2] that appear during systole (cardiac contraction).BCG is frequently measured on sitting subjects to minimize unintentional movements.Motion associated with heartbeats or breathing phases is not noticeable by the naked eye but can be measured from video streams using computer vision [3] and video magnification [4,5] techniques.
Photoplethysmography [6] consists in indirect observation of blood volume variations by measuring absorption and reflection of light on skin tissues [7].These fluctuations in volume are periodic and produced at each heartbeat: the volume of blood increases during systole (cardiac contraction) and decreases during diastole (cardiac relaxation).It must be emphasized that the definition of the principle is still discussed today: light variations that are remotely measured by the camera might be related to elastic deformations of the capillary bed, by a rise of the capillary density that compresses tissues during systole, instead of a direct observation of the changes in sections of the pulsatile arteries [8].Several biomedical parameters can be computed from PPG signals: blood oxygen level, also known as peripheral oxygen saturation (SpO 2 ) [9,10], breathing rate [11][12][13], blood pressure by pulse transit time estimation [14,15], peripheral vasomotor activity [16,17], and vascular occlusion [18].Imaging PPG have also been used to identify living skin in images [19,20].
Motion is to the main limitation of PPG or BCG methods.BCG methods present two advantages over PPG methods: they work even when the skin is not visible and are not affected by variations in lighting conditions.BCG methods are, however, more affected by natural motion than PPG methods and are more prone to noise and artifacts when larger distances are considered [21].Remote PPG has been far more exploited over the last years than BCG.Applications cover mixed reality [22], newborn health monitoring [23], physiological measurements of drivers [24], automatic skin detection and segmentation [19], and face anti-spoofing [25].
The recent advent of deep learning in computer vision showed that conventional two-stage models (handmade feature extraction and classifier learning) can largely be outperformed by representation-learning models that can learn a hierarchy of features, from low-level ones to high-level ones [26].These models can be trained with (supervised learning) or without (unsupervised learning) labeled data.The systems yield competitive performance in object recognition [27], semantic segmentation [28], human action/pose recognition [29,30], natural language processing [31], and audio classification and speech recognition [32,33].Deep learning approaches have also been employed in healthcare, bioinformatics, and genomics for analyzing biomedical data, DNA sequences, and medical images [34,35].
In this pilot study, we introduce an automated method for measuring pulse rate from video recordings using a representation-learning approach: 3D convolutional neural networks.The video, considered here as a consistent ensemble of frames, is directly introduced in the neural network and no prior image processing (e.g., automatic face detection and tracking) is required.Simultaneous mapping of relevant PPG pixels, and consequently skin pixels, is additionally provided by the system.The volume of either uncompressed and labeled (with reference pulse rate values) video data being very limited, a synthetic generator of pseudo-PPG videos is proposed to train the models (Figure 1).
The remainder of the manuscript is organized as follows.Section 2 presents an overview of studies that relate to imaging photoplethysmography and remote pulse-rate measurement from videos.The materials and methods are presented in Section 3. Experimental results are presented and discussed in Section 4, just before Section 5, where conclusions are derived from the potential and limits of the methods developed in this work.

Imaging Photoplethysmography
Relevant surveys in this area of research have been proposed the last past years.They are either dedicated to the measurement of cardiorespiratory signals from non-contact technologies [21,36] or specifically oriented towards imaging photoplethysmography (iPPG) [2,6,37].
The first measurements of PPG signals from facial video streams recorded by a standard camera were by Takano et al. [38] and Verkruysse et al. [39] in 2007 and 2008, respectively.The authors proposed a method that detects light intensity fluctuations on the face from a set of predefined regions of interest.This technique has been employed on monochromatic (Takano et al.) and color image sequences (Verkruysse et al.).PPG signals are simply formed by averaging the intensity of pixels included in the region of interest.

Video Recording
Imaging photoplethysmography has mainly been measured with conventional three-band (red-green-blue, RGB) cameras [2].Monochromatic [8,38,40] and near-infrared [41] sensors were also employed in fundamental and early-stage research.McDuff et al. demonstrated that five-band cameras outperform traditional RGB cameras in the context of biomedical parameter measurement from image sequences [42].The researchers also showed that video compression has a negative impact on PPG measurement (decrease in signal-to-noise ratio) [43].Image resolution and sampling frequencies vary greatly even if 640 × 480 and 30 frames per second are commonly adopted in practice [6].Illumination parameters must be considered when PPG signals are measured from image sequences.The methods work well with both natural and/or artificial lighting.They are, in contrast, more effective when illumination is homogeneous and diffuse.

Image Processing
The face is the most exploited region of interest (ROI) [2].Different automatic face-tracking algorithms have been exploited over the past years.Poh et al. proposed to resize the bounding box provided by the Viola-Jones face detector [44].Bousefsaf et al. proposed to select only skin pixels by prior skin detection [45] and to define custom sub-regions from the face lightness distribution [46].The cheeks and forehead correspond to other custom regions that were particularly tracked [46,47] using deformable model fitting.Block-based (or grid) methods that ensure spatial subdivision were proposed to increase the signal-to-noise ratio by retaining only the most relevant cells [48].
Some authors proposed to work with color spaces different from the standard RGB.Color spaces like L * u * v * or L * a * b * (developed by the International Commission on Illumination) that allow luminance-chrominance representation were particularly employed [45].Spatial averaging is ultimately performed to transform 2D images into a 1D signal, each image being transformed to a scalar value.

Signal Processing
Independent component analysis, a blind source separation technique initially employed by Poh et al. [44], aims to remove artifacts and noise by separating the fluctuations caused by the pulse from raw PPG signals.De Haan et al. developed different color transformations to improve pulse extraction: blood volume pulse signature (PBV), chrominance signal combination (CHROM), spatial subspace rotation (S2R), and plane orthogonal to skin (POS) [49].
Different band-pass filtering techniques have been previously employed to remove artifacts and noise from raw PPG signals.Poh et al. used detrending operations to refine the pulse signal by removing irrelevant trends [50] and thus improve beat detection.The PPG signal can be filtered from its Fourier transform representation [2] or wavelet transform representation [45].The latter allows filtering of artifacts and noise without any drastic impact on the pulse amplitude [16].Ultimately, biomedical parameters like pulse rate, vasomotor activity (pulse amplitude), breathing rate, oxygen saturation, and pulse transit time can be computed from filtered PPG signals [2,6,21].Stress level can also been assessed from some of these parameters [51,52].

Machine Learning
Research that covers the development of machine learning models dedicated to PPG signal measurement or biomedical parameter assessment from video streams are quite rare.Supervised machine learning techniques like linear regression and k-nearest neighbors showed better results than methods based on blind source separation [53].The trained models are user-dependent.Support vector machine models have also been proposed to detect heart beats [54] and assess pulse rate [55].
Hsu et al. were the first to employ a standard deep convolutional neural network architecture (VGG with 15 layers).The network is trained to predict pulse rate based on the time-frequency representation of processed PPG signals [56].Chen et al. proposed DeepPhys [57] and DeepMag [58], two deep convolutional network architectures trained to respectively predict pulse wave and magnify color/motion variations produced by the periodic changes in blood flow.The convolutional layers are guided using attention masks to ensure the robust estimation of PPG signals under lighting fluctuation and motion.Chaichulee et al. proposed a deep convolutional neural network architecture to robustly segment skin regions and assess vital signs [59].Špetlík et al. proposed a two-stage deep convolutional neural network (CNN) architecture [60] with an extractor stage that takes temporal sequences and outputs a signal.The latter is then fed to a heart rate estimator that predicts the pulse rate.Niu et al. employed spatiotemporal maps to train a pulse rate estimator with transfer learning [61] using both synthetic and real video data.
Contact pulse signals can also be identified using restricted Boltzmann machine and deep belief networks [62].Deep recurrent neural network architectures, and in particular multilayerl ong short-term memory (LSTM), can be trained to predict arterial blood pressure from contact PPG and electrocardiogram signals [63].

3D Convolutional Networks
Convolutional neural networks (CNNs) correspond to a particular category of models dedicated to feature extraction from 2D inputs (e.g., images).In CNN, different trainable filters followed by pooling operations are applied on input images [26].They are quite invariant to pose and lighting variations.Learning models can be trained using supervised or unsupervised approaches.They yield competitive performance in several applications, ranging from object recognition [27], semantic segmentation [28], human action/pose recognition [29,30], natural language processing [31], and audio classification and speech recognition [32,33].
In several applications, like video surveillance, action recognition, and scene analysis, video streams are analyzed instead of simple 2D frames.Thus, 3D CNN models have been developed and employed to extract both spatial and temporal features from video streams by performing 3D convolutions [64,65].Motion, by nature present in multiple adjacent frames, is thereby captured by 3D CNNs.More complex architectures (3D CNNs with long-term temporal convolutions) were recently proposed to capture video representations at full temporal scale [66].
Other neural network architectures dedicated to spatiotemporal data analysis were recently proposed, the majority incorporating recurrent neural networks (RNN).Graham et al. proposed drift neural networks [67], a particular architecture that merges deep CNN with a randomly initialized echo state network (the latter can be assimilated to an unconventional RNN).Convolutional gated recurrent units [68] were employed to ensure temporal reasoning (respect of the temporal order of frames).Karapthy et al. proposed to observe the relevance of CNN paired with different temporal fusion strategies [29].Spatiotemporal CNNs [69] and temporal segment networks [70], which combine a sparse sampling strategy and aggregation functions to enhance modeling of long-range information, have recently been proposed.Visual features combined with long short-term memory (LSTM) were also introduced by Donahue et al. [71].

Datasets
The training of complex machine learning models that comprise large number of variables is particularly cumbersome.Here, model architecture and the selection of relevant data are crucial considerations [26].If not chosen properly, the artificial intelligence may produce a model that causes overfitting or underfitting, which often leads to bad predictions of new, unseen, data [72].
Few datasets that comprise high-quality and uncompressed facial recordings with reference physiological measurements (e.g., heart rate from an electrocardiograph or pulse rate from a finger oximeter) are currently available.Compression, even at a low factor, has a significant impact on the measurement.Highly compressed video streams lead to low-quality and corrupted PPG signals [43].
MANHOB-HCI [73] is a dataset that contains a large number of facial videos along with electrocardiographic recordings.The video streams are, however, compressed.PURE, a dataset introduced in [47] by Stricker et al. contains 60 one-minute videos along with pulse signals recorded with a finger oximeter.The videos streams have not been compressed.Like PURE, the COHFACE dataset [74] also contains one-minute videos along with pulse oximeter recordings.The video streams are, however, highly compressed.More recently, Bobbia et al. proposed the UBFC-RPPG video dataset [20], which contains 43 uncompressed videos along with finger oximeter signals.
The participants played a time-sensitive mathematical game that supposedly raises their pulse rate.Spetlik et al. proposed ECG-Fitness [60], a dataset that contains 204 uncompressed videos.Electrocardiograms were simultaneously recorded.
In the context of automatic learning of artificial intelligence models, the databases presented above are limited: only a small volume of data are available and the datasets include notable differences (e.g., sampling frequency, compression, and image resolution).Thus, preprocessing operations (e.g., spatial and temporal resampling) are required to unify the data.
In the following, we propose a new strategy dedicated to the simulation of synthetic PPG videos.The pulse signal is first approximated using data that are fitted to real iPPG signals.

Synthetic Data Generation
In this section, we develop a process dedicated to the generation of synthetic iPPG video streams.A suitable amount of synthetic data ensures proper training and validation of machine learning models that contain a very large number of intrinsic variables.This kind of procedure has already been employed in astronomy for the determination of galaxy morphology [75] and the detection of gravitational waves [76], and in bioinformatics for automatic genetic variant annotation [77] and feature extraction from functional magnetic resonance images [78].
The procedure has five steps: A waveform model, fitted to real iPPG pulse waves using Fourier series, is employed to construct a generic wave (see Figures 2a and 3).A two-second signal is produced from this waveform (Figure 2b), and a linear, quadratic, or cubic tendency is added (Figure 2c).Note that both amplitude and frequency are controlled.The unidimensional pulse signal is then transformed to a video using vector repetition (Figure 2d).Random noise is independently added to each image of the video stream (Figure 2e).This step reproduces natural fluctuations due to camera noise that randomly appear in images.Results of the curve-fitting procedure (periodic and non-periodic models were tested).The RMSE is computed for each of the 62 pulse waves and its corresponding fitted-to-data approximation.The statistics for each tested method (each histogram bin) indicates that Fourier series and sum of sines are the most relevant models.(c) Each method (e.g., Fourier series with n = 1) contains 62 sets of a i and b i coefficients (one set of coefficients per PPG wave).We thus choose to respectively average the coefficients to produce a unique set of a i and b i coefficients for each method.Fourier series for n = 2 correspond to the method that gives the lowest RMSE.

Modeling iPPG Waveforms
We approximated the pulse waveform measured over skin in video streams with curve models fitted to data.To this end, 62 clear PPG waves were extracted from the UBFC-RPPG dataset [20] (a typical excerpt is presented in Figure 4a).
Curves generated from periodic and non-periodic models were then fitted to data (Figure 3).Models used to approximate PPG waveforms are presented in Table 1.Goodness of fit was evaluated using the fit standard error, also know as the root mean square error (RMSE): y is a PPG pulse wave extracted from the UBFC-RPPG dataset, and ŷ is its corresponding fitted-to-data approximation.n corresponds to the number of samples in y, and m is the number of fitted coefficients in the model (n − m is the residual degrees of freedom).
An RMSE value by pulse wave and by methods (here, a method refers to a model with a given n value, such as the polynomial model with n = 5, for example) was computed.Statistics (mean and standard deviation) are presented in Figure 4b.They show that the Fourier series and sum of sines models are the most relevant models.As expected, the RMSE decreases (goodness of fit increases) as we increase the number of terms (n) and therefore the number of coefficients.These models present a particular advantage: they fit periodic functions and can therefore be used to generate periodic functions.
Each method (e.g., Fourier series with n = 1) currently includes 62 sets of a i and b i coefficients (one set of coefficients per PPG wave).We averaged the corresponding coefficients to get a unique model per method.As stated in the previous paragraph, only Fourier series and sum of sines models were considered.We computed once more the RMSE between the 2 × 8 average models and the 62 pulse waves extracted from the UBFC-RPPG dataset.Related results are presented in Figure 4c.Independently of the model, n = 2 is the best choice (lowest RMSE).Overall, Fourier series for n = 2 is the method that gives the lowest RMSE.Because the two models are pretty similar (see equations in Table 1), the error difference between them for n = 2 is slight (average model RMSE for Fourier series: 0.12; average model RMSE for sum of sines model: 0.13).We can also observe that the RMSE increases as we increase n, probably because of overfitting that restricts generalization.

Model Name
Fits Periodic Functions?

Number of Coefficients
( σ corresponds to the scaling factor and µ to the mean value.We scaled the signal between 0 and 1. ω = 2π f with 0.9 f 4 Hz at intervals of 2.5/60 Hz. 0 x 2 s.The time sampling was set to 30 Hz.This value corresponds to the typical number of frames per second delivered by standard cameras.Thus, y corresponds to a vector that integrates 60 (2 s × 30 Hz) scalars.Values of a i and b i coefficients (Equation ( 2)) are presented in Table 2.The phase of the produced signal is randomly shifted using a uniform distribution.2)) corresponds to a linear vector instead of a scalar value.The slope (curve parameters) was randomly selected with a uniform distribution.The results of this procedure are illustrated in Figure 5 for the case of a linear tendency.Figure 5a shows an excerpt of a raw PPG signal (taken from subject #1, UBFC-RPPG dataset).Figure 5b depicts a simulated signal (dotted blue line) with its associated trend (dashed crimson line).The resulting signal is presented in a solid black line.

From 1D (Signal) to 3D (Video)
We transformed the unidimensional pulse signal y into a video stream using vector repetition (Figure 2d).At this stage, all the pixels in a frame share a unique value.This value gradually rises and falls as time progresses.The amplitude (α in Equation ( 3)) was randomly selected with a uniform distribution.The video corresponds to a volume v whose shape is 25 × 25 × 60.

Addition of Noise
Random noise was independently added to each frame (Figure 2e).This step reproduces natural fluctuations due to camera noise that randomly appear on images.The noise (ν in Equation ( 3)) was added to each pixel of a frame using a normal (Gaussian) distribution (mean: 0.5, standard deviation: 0.25).Performing a simple spatial averaging operation [39] on these small video patches produces synthetic PPG signals (see Figure 5c,d for typical examples) that are quite similar to realistic ones (Figure 5a).

3D CNN for Automatic Pulse Rate Estimation
A 3D CNN classifier structure was developed for both the extraction and classification of unprocessed video streams.The CNN acts as a feature extractor.Its final activations feed two dense layers (multilayer perceptron) that are used to classify pulse rate.The neural network was implemented and trained in Python using TensorFlow and Keras frameworks.All of the predicted data and statistics were processed with Matlab.

Network Architecture
The complete architecture is presented in Figure 6.The convolution operations are performed by 32 3D filters (or kernels) of a 58 × 20 × 20 size.A 3D max pooling operation with a pool size of 2 × 2 × 2 follows the convolutional layer.Rectified linear unit (ReLU) is employed as an activation function.The CNN part of the network structure can be formalized as a set of three operations, namely, convolution (Equation ( 4)), pooling (Equation ( 5)), and non-linear activation (Equation ( 6)): In Equation ( 4), x = [x 1 , x 2 , ..., x n ] corresponds to the convolutional layer inputs, i.e., a batch of synthetic video streams v (see Equation ( 3)).W is the weight matrix (learnable filters).corresponds to the convolution operator.In Equation ( 5), pool denotes the 3D max pooling operation.An additional dropout operation (Equation ( 7)) has been introduced to regularize the CNN.Dropout regularization has proven to be very effective against overfitting [79].r is a binary vector whose elements are randomly drawn.We randomly dropped out 20% of the total number of units in the convolutional layer.
The final activations of the CNN are then flattened and passed to a multilayer perceptron with a hidden layer that includes 512 neurons.The hidden layer is connected to the 76 output neurons: 75 for the pulse rates (55 to 240 bpm at regular intervals of 2.5 bpm) plus an extra "No PPG" class trained using synthetic videos of camera noise and illumination fluctuations.The activation functions for the first and second (output) dense layers are, respectively, ReLU and softmax functions.As for the convolutional layer, a dropout operation (fraction: 20%) is introduced to improve regularization.

Learning the Model
Backpropagation algorithm is currently the standard training method [79].It is based on gradient descent to update the learnable parameters.Adam optimizer [80] was selected as an optimization function with an initial learning rate of 10 −3 .All weights were randomly initialized using the method proposed by Glorot and Bengio [81].Biases were initialized to zero.
Each video was centered around zero by removing the mean value.Training was carried out by successively launching batches of 15,200 in size (200 video patches in each of 76 classes).Thus, each batch updated the weights of the networks according to an input tensor of a 15,200 × 25 × 25 × 60 size.New synthetic video patches were generated before passing a new batch to the network.We chose to pass a given batch of data to the network a single time (1 iteration).The number of epochs (which is the same as the number of batches because iteration = 1) was set to 5000.
We reserved a full batch of data for validation.We used this set to monitor the training process by stopping the procedure with an early-stopping criterion based on overfitting detection.We used categorical cross-entropy as a loss function.In practice, we observed that the training procedure converged well before the 5000th epoch, the decrease in validation loss becoming negligible (Figure 7a).Validation accuracy was greater than training accuracy, both being greater than 0.9 (Figure 7b).Dropout regularization presumably caused this particularity.These metrics are, of course, completely virtual since the model learned only synthetic data.We next present pulse rate estimations that were computed on real PPG videos using the 3D CNN model trained on synthetic data.

Pulse Rate Prediction
The learned model produces a prediction for a volume of 25 × 25 × 60 pixels.The synthetic data generator used to train the model does not incorporate a stage ensuring that the frames in the 25 × 25 patches contain pixels that are naturally arranged and ordered like in real video streams.We therefore chose to break the coherent structure of pixels before predicting the pulse rate by shuffling the pixel position.Note that only the green channel [39] was processed by the model.Maps of predictions were formed by computing a prediction for each group of pixels in the video stream.The procedure predicts and then shifts the input volume with a constant spatial step of 1 pixel (with overlapping).Typical prediction maps computed from the first 60 frames of subjects #1, #33, and #42 (UBFC-RPPG dataset) are presented in Figure 8. Blue pixels correspond to regions where no distinct PPG variations were identified (e.g., background and hair), while the other colors refer to properly predicted pulse rate values.It is important to emphasize that the network gives a score for each class (all pulse rate values plus a "No PPG" class) and that only the class that presents the highest score is saved and presented in these maps.

Image sequence
Predictions map Pulse rate (bpm) We can visually observe that the majority of pulse pixels are located in relevant regions like the cheeks and forehead.These areas contain significant PPG signal-to-noise ratios.The maps are somewhat similar to those presented in [82].
The right illustrations on Figure 8 present the histograms computed from the maps of predictions.They exhibit a dominant pulse rate (main peak) of 85, 122.5, and 82.5 bpm for subjects #1, #25, and #42, respectively.The corresponding ground-truth pulse rates for these examples are, respectively, 90, 122, and 81 bpm.The histograms have been normalized so that their total energy is equal to 1.Only the bins that correspond to pulse rates are presented.The final pulse rate was computed by aggregating all of the bins in the histogram of predictions using a weighted average operation: PR corresponds to the pulse rate value outputted by the method, f to the frequency (55 to 240 bpm at regular intervals of 2.5 bpm), and δ ( f ) to the amplitude (number of pixels) of a bin.

Results and Discussion
The UBFC-RPPG dataset [20] was selected to assess the performance of the neural network presented in Section 3. From the initial dataset, we manually removed image sequences in which the participant presented no distinct PPG signal (due particularly to wide head movements) or in case of corrupted ground-truth signals.In total, 1312 pulse rate values from 15 participants were extracted from the initial dataset.
The benchmark methods (presented hereafter) operate more efficiently with prior skin detection, robust face tracking, or when pixels of interest are segmented beforehand [46].Therefore, and in order to provide fair comparisons, the forehead or the cheeks (when the forehead was covered with hair) were manually selected as regions of interest.

Evaluation Metrics and Methods
In this section, we detail the metrics and methods employed for evaluating the performance of the neural network.We selected the mean of pulse rate error (ME), standard deviation of pulse rate error (STDE), mean absolute error (MAE, see Equation ( 9)), and RMSE (Equation ( 10)), along with Bland-Altman plots to quantify the level of agreement between the estimated and ground-truth pulse rate values.
Here, the pulse rate estimated from the image sequence is denoted PR (Equation ( 8)), and the ground-truth pulse rate is denoted as PR.Pulse rate values estimated with the 3D CNN network were also compared with other state-of-the-art methods:
These four methods were implanted using iPhys, an open toolbox released by McDuff and Blackford [83].The red, green, and blue signals were interpolated with a shape-preserving cubic function to 30 Hz before launching the methods.After computing their respective PPG signals, the four benchmark methods share a common procedure: We processed the signal with a 3rd order Butterworh filter with cutoff frequencies set to [0.667, 4] Hz, which correspond to [40,240] bpm.The signal was then interpolated with a cubic spline function at a frequency of 256 Hz to refine peaks.Beat-to-beat pulse rate values were finally computed from the interbeat intervals.

Results Analysis
General results are summarized in Table 3, while a typical excerpt is presented in Figure 9.The Bland-Altman plots presented in Figure 10 represent the differences between estimates against ground-truth measurements.Means are represented by dash-dot lines and 95% limits of agreement (±1.96SD) by dashed lines.Note that each ME value in Table 3 corresponds to each dash-dot line in the Bland-Altman representations.The results presented in Table 3 exhibit significant agreement between the estimated and ground-truth measurements: the RMSE computed from the pulse rate values assessed with the 3D CNN is lower than for the other methods.The MAE is, however, the lowest for POS, which is globally the most relevant benchmark method.Figure 9a,b presents the estimation for subject #31.Apart from the couple of erroneous beats at the beginning of the POS series, both methods performed well.In addition, and concurrently with previous findings, GREEN, which is in fact the most straightforward method, produced noisy PPG signals and thus pulse rate series full of artifacts.GREEN presents the largest RMSE and standard deviation of pulse rate error.Surprisingly, metrics computed from CHROM estimations are very close to GREEN ones.This contrasts with the findings of Bobbia et al. [20], in which CHROM was even superior to POS.It is worth mentioning that the results cannot be directly compared because image processing techniques like skin detection and super-pixels were used in their work.These methods improved the signal-to-noise ratio and reduced noise and artifacts from PPG signals.The Bland-Altman plots confirm the metrics presented in Table 3.The distribution is far wider for GREEN, ICA, and CHROM methods than for 3D CNN and POS.Visually, it can be observed that the distribution is more concentrated for POS (Figure 10d) than 3D CNN (Figure 10e).From this observation, we can conclude that POS estimations are globally more accurate, while the method we propose presents fewer irrelevant beats (which are characterized by outliers in the figures).Excluding these artifacts from pulse rate series with a dedicated filtering technique [50] would presumably ascertain this remark.This kind of procedure has not been included because the main objective consisted in assessing the relevance of direct beat-to-beat pulse rate values.
Training and prediction were executed on a computer equipped with an Intel Xeon CPU E5-1607 v4 and a GPU NVIDIA GeForce GTX 1080 Ti.Without any software or hardware optimization, estimating a pulse rate value from a 25 × 25 × 60 video patch takes 4 ms.

Improving the Network Architecture
The model achieved valuable results in regard to the shallow network architecture proposed in this work.Recent deep learning models adopted or built for the purpose of blood volume pulse [57] or pulse rate [61] measurement from videos exhibit significant results, in particular for image sequences that contain wide head movements.The shallow architecture proposed in this work may not compete with these deep models.
The main objective of this pilot study was to assess the limits and potential of 3D CNN in the context of PPG measurement from image sequences.We therefore envisage improving the network architecture in order to get more promising results.One of the main avenues consists in increasing the number of hidden (i.e., 3D CNN) layers and integrating optimized distributed gradient boosting (XGBoost).XGBoost is a widely used method that achieves state-of-the-art results in many machine learning challenges [84].
An iterative, grid-based procedure should be developed to assess the impact of network architecture (e.g., number of layers, number of filters per convolutional layer, and number of neurons per dense layer) and hyper-parameters on performance.This procedure could provide an objective and automatic way of selecting the network architecture that achieves the highest performance.
Only video patches of a 25 × 25 × 60 size were analyzed by the neural network.Varying these values or adopting a spatiotemporal multiscale approach [66] should be investigated in future work.In addition, the learned pulse rate values were sampled with a constant 2.5 bpm step.Rising the sensitivity of the model by reducing this interval could be particularly interesting.

Maps Convergence during Training
Figure 11 presents some prediction maps associated with temporary models that were generated throughout training.The maps were computed from the first 60 frames of subject #1.The associated histograms are presented in bottom row.
We can visually observe that the pixels labeled with pulse rates converged into relevant areas (i.e., forehead and cheeks) as the neural network learned, while pixels that contain no PPG information (blue pixels, e.g., from the background) were properly identified.In addition, the maps at the beginning of the learning procedure (Figure 11b,c contain disparate pulse rate values, resulting in high entropy histograms.After several epochs (Figure 11d,e), the entropy is lowered because the maps contain one or two prevailing pulse rate values.The histogram entropy, computed using Shannon's formula, can here be assimilated to a confidence index: the lesser the entropy, the better the prediction.The corresponding normalized histograms are presented in the bottom row.We can visually observe that the pixels labeled with pulse rates converge into relevant areas, while pixels that contain no PPG information are properly identified.

Non-Stationary Signals and Motion
The UBFC-RPPG dataset contains videos during which the participants played a time-sensitive mathematical game that supposedly raised their pulse rate.High pulse rate values can effectively be observed on the Bland-Altman plots (Figure 10).In addition to natural motion (head movements), the game tends to drastically change the PPG signal frequency over time.These signals are thus predominantly non-stationary.This has not been considered in the method we propose in this study, the synthetic generator (Section 3.2) producing only stationary signals.
Non-stationary PPG signals have an impact on the histogram distribution computed from the prediction maps.Figure 12 presents a typical example: the more a video patch contains a non-stationary PPG signal (Figure 12b, bottom row), the more bins are present in the histogram of predictions (Figure 12c, bottom row), which thus increases entropy and reduces the pulse rate prediction accuracy.Strong and wide movements were not considered in this work: except for the trends (Section 3.2.3), the network has not been trained with data that includes motion artifacts.In addition, natural variations of the trend observed in an image sequence may not follow a linear, quadratic, or cubic order.We believe that there is room for improvement and plan to investigate further in this direction.If a deeper architecture is adopted, transfer learning and/or fine-tuning approaches could be viable options for increasing performance and, presumably, handling motion [61].

Other Future Developments
Quality of the predicted maps: We plan to cross each map of predictions with a skin mask in order to assess the relevance of the detection.Metrics like precision, which is expected to be high, and recall could be used for this purpose.
Only the class (pulse rate value) that presents the highest score was saved and presented in the prediction maps.The network, however, attributes to each class a score, and we believe that this information could be exploited to improve prediction performance.
Currently, the method accepts as input only a single channel.We therefore did not consider color as we processed only the green channel.We plan to enhance the network and compare the impact of color on performance, especially in terms of pulse rate accuracy and artifact removal.

Conclusions
The main objective of the pilot study we present in this article consisted in assessing the potential and limits of 3D convolutional neural networks dedicated to the estimation of pulse rate from video streams.The results show that this solution is promising in this particular context, despite the shallow network architecture.We envisage comparing the proposed method with other deep learning architectures developed to measure blood volume pulse or pulse rate from facial videos.
There is room for improvement here: adding more convolutional layers to the network is the principal avenue that must be investigated next.Of course, a limited number of 3D CNN layers can be added because of the computational burden that compromises training.The impact of other types of layers, like recurrent neural networks, should also be investigated.A multiresolution approach could be of interest in order to overcome varying image resolution and distance.Both spatial and temporal resolutions can be considered with this kind of approach.

Figure 1 .
Figure 1.(top) Conventional approach: image processing operations are applied to the video stream to detect pixels or regions of interest (ROIs).The signal is traditionally computed using a spatial averaging operation over the ROI before being processed with spectral or temporal filters.Finally, biomedical parameters like pulse rate are estimated from this signal.(bottom) The approach we propose consists in training an artificial intelligence model using only synthetic data.The input corresponds to a video stream (image sequence).The model predicts a pulse rate for each video patch (25 × 25 pixels over 60 frames) and thus produces a map of predictions instead of a single estimation.

Figure 2 .
Figure 2. Flowchart of the synthetic imaging photoplethysmography (iPPG) video generator approach.(a) A realistic pulse model approximated with Fourier series (sum of sine and cosine functions, see Figure 3 and Table 1) serves as the base waveform.(b) A two-second signal is produced from this waveform.(c) A linear, quadratic, or cubic tendency is added to the signal.(d) Videos are generated by repeating the signal for each pixel.(e) Random noise is independently added to each image of the video stream.This step reproduces natural fluctuations (due to camera noise) that randomly appear on images.Note that illustrations below blocks (d,e) have been magnified.

Figure 3 .Figure 4 .
Figure 3. Fitting real iPPG signals to determine the best pulse wave model.Different models are tested.Fourier series were chosen because they present the lowest root mean square error (RMSE, see Figure4for details).They also have the advantage of fitting periodic signals.

3 Weibull y = abx b−1 e −ax b 2 3. 2 . 2 .
y = ∑ n i=1 a i sin (b i x + c i ) n: number of terms 1 n 8 3n power series y = ax b and y = ax b + c 2 and Signal Formation Two-second pulse signals were generated by varying ω between 55 and 240 bpm (0.9 and 4 Hz) at regular intervals of 2.5 bpm for the Fourier series model:

Figure 5 .
Comparison between signals produced by the synthetic generator and a real PPG signal.(a) Excerpt of subject #1 raw PPG signal taken from the UBFC-RPPG dataset.Pixels from the forehead area (green channel) have been spatially averaged to compute the signal.(b) Dotted blue line: simulated PPG signal (output of Figure 2b).Dashed crimson line: linear trend.Solid black line: signal combined with the trend.(c,d) Signals outputted by the generator with two different noise factors.They were computed using a spatial averaging operation over all the pixels of the synthesized frames.The produced signals are pretty similar to the real PPG signal presented in figure a.

Figure 6 .
Figure 6.Model architecture.The network integrates a 3D convolution (blue) with its associated 3D pooling (green) layers.The stream converges to two fully dense layers (orange).

Figure 7 .
Learning metrics: loss (a) and accuracy (b) for training and validation.

Figure 8 .
Figure 8. Maps of predictions and their relative histograms.The model produces a prediction for each group of 25 × 25 pixels in the video stream.Blue pixels correspond to regions where no distinct PPG variations were identified, while the other colors refer to properly predicted pulse rate values.Top row: data and results for subject #1 (first 60 frames; histogram main peak: 85 bpm; ground truth: 90 bpm).Middle row: data and results for subject #33 (first 60 frames; histogram main peak: 122.5 bpm; ground truth: 122 bpm).Bottom row: data and results for subject #42 (first 60 frames; histogram main peak: 82.5 bpm; ground truth: 81 bpm).

Figure 9 .
Typical examples of pulse rate assessment by (a) the method we propose and (b) the plane orthogonal to skin tone (POS) algorithm.Here, the two methods present relevant estimations.The pulse rate values presented in these figures were computed from the video of subject #31.

Figure 10 .
Figure 10.Beat-to-beat Bland-Altman plots showing the differences in pulse rate between video and ground-truth measurements, plotted against ground-truth measurements.Means are represented by dash-dot lines and 95% limits of agreement (±1.96SD) by dashed lines.

Figure 11 .
Figure 11.Prediction maps (subject #1, 60 first frames) for different training epochs.(a) Video stream excerpt (face close-up).Map for (b) epoch #25, (c) epoch #45, (d) epoch #100, (e) epoch #1000.The corresponding normalized histograms are presented in the bottom row.We can visually observe that the pixels labeled with pulse rates converge into relevant areas, while pixels that contain no PPG information are properly identified.

Figure 12 .
Figure 12.Impact of a non-stationary signal on its relative histogram of predictions.(a) Prediction maps from subject #40 (region of interest: forehead).(b) PPG signals computed with the GREEN method.(c) histograms of predictions.Top row: The frequency of the PPG signal is quite constant.Bottom row: the frequency of the PPG signal rises as time advances.

Table 1 .
Fitting models exploited to approximate the pulse waveform.

Table 2 .
Fourier series coefficients computed after data fitting.

Table 3 .
Performance of pulse rate measurement for selected UBFC-RPPG image sequences.ME: