Intelligent Remote Photoplethysmography-Based Methods for Heart Rate Estimation from Face Videos: A Survey

: Over the last few years, a rich amount of research has been conducted on remote vital sign monitoring of the human body. Remote photoplethysmography (rPPG) is a camera-based, unobtrusive technology that allows continuous monitoring of changes in vital signs and thereby helps to diagnose and treat diseases earlier in an effective manner. Recent advances in computer vision and its extensive applications have led to rPPG being in high demand. This paper speciﬁcally presents a survey on different remote photoplethysmography methods and investigates all facets of heart rate analysis. We explore the investigation of the challenges of the video-based rPPG method and extend it to the recent advancements in the literature. We discuss the gap within the literature and suggestions for future directions.


Introduction
Heart rate (HR) is a performance indicator of a person's total cardiac output and a prospective clinical diagnosis tool. The gold standard for analyzing cardiac measurements is an electrocardiogram (ECG), which measures the electrical activity of the heart through sensors (called electrodes) attached to the skin. These electrodes are connected by wires to an ECG recording machine, and this type of contact measurement is appropriate for clinical setting. Another method is photoplethysmography (PPG), an optical and non-invasive technique that detects the changes in the blood volume pulse (BVP) in peripheral blood vessels via contact sensors attached to anatomical locations such as the wrists, fingers, and toes. Commercial wearable devices such as fitness trackers and smartwatches make use of this principle, where a sensor emits light to the skin and measures the reflected light intensity due to the optical absorption of blood [1]. Even though these methods are invasive, they require skin contact, which can cause discomfort, especially in neonates [2] and elderly care.
However, in recent years, remote measurement of HR has been a prominent research topic that measures the heart rate (HR) from face images and videos by analyzing tiny color variations or body movement [3]. This is a practical application of PPG technology in a completely non-invasive manner and is referred to as remote photoplethysmography (rPPG). It can predict not only the heart rate but also other vital information, such as heart rate variability and blood pressure. thereby inferring mental stress [4], variations in cardiovascular functions, quality of sleep [5], and drowsiness [6]. The advent of the digital camera brought this remote method to the masses. Remote heart rate monitoring applications have spread across the following fields:
• To discuss rPPG measurement using signal processing methods as well as its recent furtherance in the deep learning environment; • To harness the insight into the challenges on rPPG, and we anticipate some suggestions on the future direction.

Remote Photoplethysmography
Photoplethysmography (PPG) is a noninvasive optical technique that is used to detect volumetric changes in blood in the microvascular bed of tissue [21]. This method reckons with the principle that optical absorption of human skin varies with the blood volume pulse (BVP), which measures the amount of blood flowing through the tissues with each heartbeat. Human skin has three layers: the epidermis with capillaries, dermis with arterioles, and hypodermis with arteries [22]. When the skin is exhibited to light with a specific wavelength, the epidermis and dermis layer scatter light, whereas the hypodermis diffuses light [23,24]. In consonance with Lambert's law of light intensity [25], the light reflected through the skin can be contemplated in the process of diffusion and scattering.
Remote photoplethysmography (rPPG) is a contactless measurement that makes use of the PPG principle. It relies on a camera and then measures red, green, and blue light reflection changes from the skin as the contrast between specular and diffusion reflections, as demonstrated in Figure 1. Images or videos of human skin under ambient light sources or with dedicated illumination are recorded and processed to recover the plethysmography signal from which physiological parameters are extracted. A diffuse reflection component carries the information of PPG as it diffuses through the skin, whereas a specular reflection component is the one scattered by the surface of the skin. Even though the specular component has no pulse information, the total reflected light observed by the camera depends on the relative contribution of both components.
In essence, the changes in blood volume during a cardiac cycle would cause minute color changes on the skin. Although these changes are invisible to the human eye, they could be captured by optical sensors. Accurate measurement of these changes generates a plethysmography signal, from which vital signs of the body such as the heart rate, heart rate variability, and respiration rate could be measured. Illustration of remote photoplethysmography from face videos. This method quantifies the contrast between specular and diffusion reflection components and measures the changes in red, blue and green light reflected from the skin due to blood flow. A physiological signal is procured using suitable a computational approach and thereby predicting vital information.

Remote Methods for HR Detection
A camera is capable of seizing the subtle pulsation of human skin due to blood circulation and could produce red, green, and blue raw signal traces by sampling different regions of the optical spectrum. These raw signals are then processed to obtain a plethysmography signal which contains physiological information. The existing remote PPG methods for HR measurement from human face videos can be classified as shown in Figure 2. Classification of computational techniques used in remote photoplethysmography for recovering physiological information from videos. From the input video stream, each frame is analyzed with different computational approaches to obtain vital information. The signal processing method is an unsupervised approach to processing input video frames, and it is classified into motion-based and color intensity-based methods. The learning-based approach is the recent trend in technology, and based on the perspective of workflow, it could be classified as a supervised (hybrid) approach and end-to-end learning.
The motion-based method for detecting a heart rate (HR) emanated from the ballistocardiogram [26]. This explains the relation between cardiac output and the amplitude of human body movements. Later, heart rate measurement using ballistocardiography (BCG) motion of the head with a wearable device was explained in [27]. Sooner, the possibility of heart rate detection from face videos by measuring subtle head motion due to the influx of blood at each beat was shown in [28]. In this method, a combination of principal combination analysis (PCA) and the filtering method was used to identify the individual beats and evaluate them in 18 subjects. A Viola-Jones face detector [29] is used for region of interest (ROI) detection.
A motion-based method was explained in [30], using a single ROI and independent component analysis (ICA) subsequently. The technological improvements using BCG methods were scrutinized in [31], and it was concluded that more studies were needed to mitigate motion artifact challenges. Although these motion-based methods are invariant to illumination, the voluntary head motion and complex facial expressions could degrade the reliability of this method.
In this paper, we focus on color intensity-based methods because of their increasing attention in the literature, since they enable heart rate detection from a simple camera with ambient light as an illumination source. These methods detect heart rates from camera recordings with the help of different image and signal processing techniques. The possibility of non-contact physiological computation using a thermal camera was introduced in [32], and it was demonstrated that plethysmography signals could be measured from the human face from simple consumer-level camera recordings with ambient light conditions [33]. Since then, a substantial amount of research has been conducted in remote photoplethysmography. The rPPG methods can be split into two categories according to the previous works: signal processing-based methods and learning-based methods.

Signal Processing Methods
This method is a color intensity-based approach to measuring PPG from face videos. First, a region of interest (ROI) of each frame of the input video is detected, and then the red, green, and blue channels are spatially averaged to form raw signal traces. These traces are then processed by different signal processing techniques to recover the physiological signal. The entire process can be divided into three stages as demonstrated below. An overview of the general steps in the signal processing-based approach for recovering the heart rate is illustrated in Figure 3: (1). Pre-processing; (2). Signal extraction; (3). Heart rate estimation (post-processing).

Pre-Processing Face Detection and ROI Tracking
Since heart rate detection is based on the photoplethysmography signals, which are derived from imperceptible skin color variations caused by pulsatile flow, it is essential to process the video frames. The process starts with the extraction of the face and localizes the measurement region of interest (ROI) for each video frame. In some of the previous works, face detection has been explained manually, with a subject standing stock-still. However, most of the works have performed face detection automatically by using the Viola-Jones algorithm was explained in [29], which is based on a machine learning approach that provides a bounding box of the subject as a result. This algorithm is a bookmark in rPPG methods, as it possesses a high detection rate and is available in the computer vision library of OpenCV and MATLAB.
Other popular algorithms used for face detection are active appearance models (AAM), a statistical model that provides facial landmarks [34], dlib [35], mtcnn [36], and the Kanade-Lucas-Tomasi approach [37,38], which provides limited assumptions about the image and possess high accuracy. (a) Pre-processing is needed to obtain red, green, and blue traces from input video frames. This stage includes face tracking and ROI detection. (b) Signal extraction is performed using different signal processing algorithms, and it includes a filtering process to obtain a good quality physiological signal. (c) Heart rate estimation is the final step, where the physiological signal is processed using peak detection or frequency analysis to obtain required vital information.
Selecting a suitable region of interest (ROI) is the next challenging step, as it has a direct impact on the accuracy and reliability of the general algorithm. ROI detection finds a set of pixels that has the most significant PPG information, and these pixels are spatially averaged to obtain the plethysmography signal [39].
Several studies were explained the quality of the ROI having a direct influence on the quality of the signal. Heart rate estimation utilizing the whole face has been proposed in some of previous works, although eye movements near the eye area may cause artifacts. Due to the high amount of light absorption, the skin regions with capillaries would produce a strong signal [40]. However, many researchers selected the forehead and cheeks [41][42][43] as the most significant ROI areas, as they are less susceptible to muscle movements compared with other regions of the face. Table 1 describes the summary of different methods of face selection and ROI detection. The authors of [44] were proposed that the forehead and cheeks would be computationally efficient ROIs. They divided and analyzed different face regions and evaluated the quality by using evaluation matrices.

Raw Signal Trace Extraction
To obtain the raw signal traces, the detected ROIs were separated into RGB channels. Then, the three channels were averaged spatially over all the pixels to obtain the red, green, and blue signal traces. Subsequent processing would be performed on these raw traces.

Signal Extraction
This stage includes filtering and dimensionality reduction. The raw signal obtained from the ROI might have unwanted noise due to motion or illumination. To remove the noise, a filtering process was performed on the raw RGB traces, and thereby the signal-tonoise ratio (SNR) would be increased. An increased SNR value provides a good quality plethysmography signal.

Filtering
Filtering is the process in which digital filters were applied to the raw signal traces based on some prior knowledge of HR frequencies. Before applying dimensionality reduction, a filtering process would be performed on the raw signals to achieve a good signal-to-noise ratio. A frequency band (0.7 Hz-4 Hz) is normally selected which leads to 42-240 beats per minute HR [45]. The filtered signal can be directly used for plethysmography signal detection [46]. According to [47], the green channel signal carries more PPG information compared with the other channels. However, the red and blue channels also carry some complementary information. In the green channel approach, the filtered green channel component is taken for further processing to obtain a PPG signal. It uses the spatially averaged pixel value of green traces and then normalizes the traces. Then, it performs an FFT to transform the signal from the spatial domain to the frequency domain and calculate the power spectral density (PSD) distribution.

Dimensionality Reduction
Dimensionality reduction methods are used to minimize the dimensionality from raw signals to achieve a more accurate and robust PPG information signal. The major classifications of the rPPG methods are based on how they extract plethysmography signals from the raw traces. The signal extraction methods can be classified broadly into three categories [48]: A PPG signal is considered a one-dimensional signal which is represented as a linear combination of the weighted sum of the raw signals, and it is taxing to estimate their weights [49]. Blind source separation (BSS) algorithms were introduced in [50], and the purpose of BSS algorithms is to separate the desired PPG signal from noise and artifacts due to statistical independence and correlation. Principle component analysis (PCA) and independent component analysis (ICA) are typical BSS techniques that are widely applied for dimensionality reduction.
An ICA-based algorithm was explained in [51] as an optimal combination of the raw signals, in which the raw signals are separated into independent non-Gaussian channels. In this method, the authors arbitrated that the second component produced after the ICA is considered a periodic one used for further processing. Several authors adopted this method in their works.
Principal component analysis (PCA) has been proposed [52], and these authors claimed the effectiveness of their approach on ICA, which may lead to the same result in some applications. Later, different methods for rPPG investigated in [53] and deferred to ICA to yield better accuracy and reliability. This BSS approach was further investigated and adopted in the literature [54], explaining the performance limits of ICA-based techniques down the line. In the BSS method, the raw signal traces are combined, and the most periodic independent signal selected is the PP signal. The main drawback of this method is that this does not also consider motion in the given periodic signal. Thus, the major limitation of BSS can be concluded to be motion intolerance.
Chrominance-based (CHROM) algorithms [55], which belong to the model-based approach, mitigates the subject motion issues in the BSS algorithm. The authors proposed a method in which the RGB pixels in each frame of the input video have been identified using a color filter method and claimed that white illumination is successfully eliminated by the proposed skin tone standardization approach. CHROM eliminates the specular reflection component by using a color difference chrominance signal and taking advantage of the BSS method. However, both methods still do not considered illumination, as it is a significant noise source in the recovered signal.
To overcome this, the spatial subspace rotation (2SR) method was proposed in [56], which exploits the benefits of statistical measurement of multiple pixel sensors in a camera.
This method is performed in both the spatial and temporal domains. First, a subspace of skin pixels is constructed, and then the rotation angle between the frames is measured to determine the PPG information. The authors claimed the 2SR method outperformed ICA and CHROM.

Heart Rate Estimation
The heart rate (HR) is evaluated from the recovered PPG signal either by peak detection or frequency analysis. In the peak detection approach, individual peaks are used to extract the heart rate. Later, the authors of [57] showed the physiological measurements using five-band camera sensors. Based on the error range of the reliable methods of heart rate detection, the medically tolerable accuracy is set to three beats per minute (BPM), which represents the accuracy of the rPPG method to be the same as traditional contact methods. A photoplethysmography signal is considered a time-varying intensity signal. From the resulting physiological signal, the heart rate (HR) is the inverse of the average time difference between two consecutive beats in the time domain. However, in the frequency domain, the HR is extracted with the highest energy power spectrum of the physiological signal. We could calculate the instantaneous HR by measuring the beat-to-beat HR, and this is more informative, but this requires accurate peak detection.
An automated method has been proposed to detect the peak-to-peak time between systolic and diastolic inflexion using the second-order derivative of the recovered signal. An analysis of HR detection methods was performed [58] based on the variations of the inter-beat intervals. A short-time Fourier transfer (STFT) method for HR detection was proposed in [59], and it is more effective when the heart rate pattern changes rapidly. A predictive model was also developed using workout video frames, and it would be more productive under real-time scenarios.
However, frequency analysis is the commonly adopted method in the literature. In this method, the extracted PPG signal is converted to the frequency domain using an FFT [60] or DCT [61], where Welch's method is used for density estimation. The strongest periodic signal within the frequency band is considered the signal with PPG information and computes the main heart rate over a particular period. Later, the authors of [62] introduced a generative adversarial network (GAN), a deep learning-based technique to learn rPPG noise impacts. An analysis of some of the relevant signal processing-based rPPG methods can be found in Table 1.

Learning-Based Methods
Signal processing-based rPPG methods were explained in the previous sections. In the literature, recent trends include learning-based PPG measurements. The major benefit is that they could detect the heart rate directly from video input, and the system learns the rPPG mechanism from the beginning. Learning-based techniques can be divided into two categories for better understanding: supervised learning methods and end-toend learning methods. An illustration of the workflow can be seen in Figure 4. With the supervised learning approach, the feature extraction should be performed manually, whereas deep learning methods extract features directly from the input video without any human intervention.

Supervised Learning Methods
This method is a combination of both the manual and learning-based approaches, in which the preprocessing part is performed manually and the result feeds into the learning networks. The motivation to develop this algorithm is to mitigate the issues of signal processing-based methods, and it was a successful strategy to a certain extent.
A machine learning approach was proposed in [63] to improve the accuracy of the conventional method, which evaluated and compared the ICA method with two machine learning techniques: linear regression and the k-nearest neighbor (kNN) classifier in a controlled situation. Linear regression is a model between a dependent variable and explanatory variables, whereas kNN is a learning-based approach [64] that measures the training instance closest to the known test instance. The kNN takes the average heart rate of the k-nearest neighbors, and the results have shown that it would outperform the ICA method. Later on, more advanced machine learning techniques such as convolutional neural networks [65,66] and temporal neural networks were proposed. A two-layer LSTM was explained in [67] and showed that noise signals can preserve functional signals. Synthetic signals are used to train the model, and the results are analyzed on a public domain database. A feature extraction stream can be observed in [68], which learned a robust feature representation and developed a complementary stream to extract reliable vital signals. A unified neural network was reported for estimating the HR, and performance analysis was performed using the COHFACE dataset.
A single-photon avalanche diode (SPAD) camera-based method was introduced in [69] and provided a hybrid method that analyzed the frame stream with a neural network followed by signal processing techniques for HR detection, and it showed its effectiveness in unrestrained illumination. A deep HR method was proposed in [70], and the authors also explained a machine learning approach with a frontend and backend component. The front end learns independently from training video samples, whereas the back end is a fully connected neural network for HR estimation and evaluated on two different datasets.
A Siamese rPPG network [71] proposed feature learning from two facial regions simultaneously. A two-branch model was trained jointly, and the results were evaluated on three benchmark datasets and shown to surpass the results of the existing methods.

End-to-End Learning-Based Approach
With the emergence of the deep learning end-to-end method, extensive opportunities are opening up for performing tasks more efficiently in a better way. The first end-to-end learning model 'DeepPhys' was introduced in [72], which is based on a convolutional attention network (CAN) and enables spatiotemporal visualization of the signals. This paper proposed a skin reflection model that is exceptionally robust in different illumination conditions. Since it is an end-to-end system, the intermediate steps in the state-of-theart method could be removed successfully. The authors evaluated the proposed method on three different datasets and have shown surpassing results when compared with the state-of-the-art approaches.
Subsequently, SynRhythm was proposed [73] for HR estimations and it is an unsupervised learning based approach. Two successive convolutional neural networks (CNN) are used to extract the blood volume pulse from a sequence of images and thereby the heart rate. RhythmNet [74] exploits the CNN and gated recurrent units to form a spatiotemporal representation. A VPL-HR database [75] containing 2378 visible light subjects was introduced to study the algorithm's robustness with motion and illumination variance. Nonetheless, a compression artifact challenge has yet to be investigated. Belatedly, a deep spatiotemporal network for regenerating the HR from videos was proposed in [76] and used the MAHNOB HCI and OBF databases for experiments. The results were evaluated and compared with RNN and 3DCNN-based PhysNet algorithms and showed better performance. Three signal processing methods, including the CHROM and POS methods, were replicated, and the results were compared with the proposed algorithm. The main advantage is that the proposed algorithm allows HRV features, and it would be a beneficial method in realistic situations.
A two-step convolutional network was introduced in [77], where it was trained by alternating optimization, and the results were validated on three publicly available datasets as well as on a newly collected dataset of 204 fitness-themed videos. However, compression is still a challenging scenario. The authors of [78] proposed a transfer learning strategy from a limited number of face videos and used a deep HR estimator from synthetic rhythm signals. This algorithm uses a sine function to represent the periodic part of the synthetic signal and limit the frequency to overcome the challenges, such as a large volume of training data and illumination. Even if the proposed approach showed effectiveness with the state-of-the-art methods, it still needs a large database for a more accurate HR.
A neural architecture called AutoHR was proposed in [79], which evaluated the convolution difference in the spatial domain. Subsequently, the authors of [80] performed a comparative evaluation and showed the learning-based method to achieve better performance in the signal processing methods. They also showed a low error rate, which makes learning-based methods applicable in real-time scenarios. Some relevant papers on deep learning-based rPPG can be found in Table 2. An end-to-end three-dimensional (3D) spatiotemporal convolutional network was introduced in [81] which used a multi-hierarchical feature fusion-based attention module. It efficiently minimized the impact of motion and noise. Two publicly available datasets were used for evaluation, and it reconstructed the physiological signals accurately.
A three-domain segment network, ETA-rPPG Net, was illustrated in [82] along with a time domain attention module that used a convolutional kernel. A two-part loss function was proposed for supervised training, and it could effectively reduce the noise interference from illumination variation. However, despite showing better results, more robust models in low-constraint environments are still needed.
A major drawback of the learning-based approach is the large amount of data needed for training the network to achieve robustness and accuracy. To overcome this difficulty, the authors of [83] proposed an approach to training a deep HR estimator from synthetic PPG signals and a limited number of available face data. The authors showed the effectiveness of their approach using public datasets. The authors explained the effectiveness of extracting the HR from face videos deprived of video processing.
Later, the authors of [84] came up with a meta-learning approach (Meta-rPPG) that focuses on using a synthetic gradient generator, and it requires several transductive inference steps and achieves a greater accuracy than the state-of-the-art methods. A metaphysical model that works well with supervised and unsupervised models was proposed in [85] and evaluated on two different datasets. However, the performance degraded when the subject was darker. This paper demonstrated better performance than the state-of-the-art approaches. Even if it outperformed the results of the state-ofthe-art signal processing methods, it still needs manual feature extraction. The main challenges still need to be mitigated are the following:

•
It requires a large volume of training data; • Poor performance under realistic conditions; • Low accuracy due to compression; • Complexity due to intermediate steps.
An end-to-end model proposed in [85] using undercomplete independent component analysis U-LMA was tested under three scenarios to estimate the nonlinear cumulative density function (CDF). Another skin segmentation method was introduced in [86] to process low-resolution inputs, make use of depth-wise convolutional layers, and localize skin pixels. The authors proved the real-time better performance on a small IoT device.

Datasets
To evaluate the rPPG algorithms, most of the authors used privately recorded datasets which are not available publicly. DEAP is a multimodal dataset which was put forward in [87] for human emotion analysis. The authors made the dataset available publicly with the physiological signals of 32 participants and 40 videos. Later, in [88], MAHNOB-HCI, a dataset with a large collection of modalities was recorded and made open to the public. High synchronization accuracy makes this database beneficial for researchers who need to assess their methods and algorithms in challenging databases. The authors of [89] conducted analysis and evaluated different public datasets. They also introduced a cleaner PPG set with a collection of truth peaks for 13 major datasets to overcome the noise and miscalculations in public datasets.
Practically, the main challenge regarding datasets is the lack of publicly available datasets under realistic conditions. Most of the papers in the literature were assessed on privately owned databases, which makes it difficult to generalize the algorithms. Selections of datasets that are publicly available are shown in Table 3.

Challenges
The preceding sections explained different approaches to HR detection from face videos. From the literature, it is clear that learning-based methods are robust and flexible and work better in practical applications. Since remote photoplethysmography is a camerabased technology, certain challenges such as skin melanin tone, illumination conditions, subject motion, and compression impacts need to be addressed for accurate measurement of heart rates. In the literature, we could find different works carried out to overcome these challenges. Deep learning networks can overcome these limitations to an extent by training large datasets.
The influence of the compression schemes of motion in different video formats has been investigated, and the quality loss against compression artifacts was investigated [90], addressing the compression problem in detail and evaluating the significant decrease in performance of rPPG algorithms with the increase in compression. The authors observed the compression to degrade the accuracy of the measured physiological signals in real-time processing. Since most of the datasets were recorded under laboratory settings with good conditions, the rPPG-based HR gives better results compared with the traditional contactbased techniques. However, in real-world applications, video compression is inevitable, as it helps to reduce storage, transmission time, and bandwidth. The videos captured through the commercial cameras undergo different compression codecs and bitrates, and so the frames observed from a camera significantly affected the compression artifacts. Since compression plays an important role in signal detection, the compression artifact impact remains open, and only a few pieces of research have been carried out in this area.
In [91], the authors explained the types of compression artifacts and proposed a singlechannel framework to reduce the effects of compression. They claimed that the red and blue color components are the ones most affected by video compression due to the low bit rate. The authors of [85] developed a STVEN autoencoder to convert video from one bit rate to another. They performed an image enhancement procedure to overcome the compression effects. Subsequently, the authors of [92] proposed a deep super-resolution network for lowresolution video which enhanced the rPPG method in compressed video and conducted a performance analysis at varying compression levels and in different formats. The authors proposed an approach to recover PPG signals from compressed videos rather than enhancing them and also evaluated the effects of compression on different skin types. However, the authors did not consider the effects of compression from motion.
To sum up, video compression degrades the quality of PPG measurements, since it relies upon subtle changes in the signal from the camera. However, the compression does not affect the quality of the videos, as it is typically optimized for visual quality. Since the remote methods consider minute changes in the signal, it is important to develop methods that can mitigate compression loss. Other significant gaps can be seen in the data and privacy concerns.

Data Implication
Different datasets contain different amounts of motion, resulting in the difficulty of generalizing an algorithm. It is important to have benchmarks to evaluate the efficiency of different approaches [93]. A public benchmark dataset, Vicar Vision, has been developed to overcome the reproducibility problem in rPPG research, which defines the illumination and motion challenges. There is no benchmark dataset available to address the challenges in the rPPG environment. Another issue is skin tone, as greater amounts of melanin absorb more light than other skin types, and thus the pixels may become saturated. This results in a weaker physiological signal measurement.
Most of the datasets contain lighter skin tone participants because they were collected from European countries and the United States of America. A meta-analysis method explained the significant drop in performance for darker skin tones. To study the impact, the authors combined three datasets with different participants and concluded that datasets with better representation are needed for more accurate vital sign measurements using rPPG. The skin tone biases in the rPPG environment were investigated in [94], and a physically driven approach was proposed in [95].

Privacy Concern
Since this is camera-based technology, there is potential risk in terms of ethics and the privacy of the subject. Researchers proposed innovative methods to mitigate this concern. The Privacy-Phys model was proposed in [96] based on a pretrained model of a 3D CNN. A novel algorithm, pulse edit, was proposed in [97] to edit the facial video physiological signal to protect the subject's privacy disclosure.

Conclusions
In this paper, we performed a critical review of different remote photoplethysmography methods for heart rate detection from facial videos. This survey also aids in highlighting the advantages and disadvantages of different techniques and approaches to HR detection. Additionally, we observed the impact of compression artifacts on rPPG methods and reviewed some works that took video compression into account. A significant research gap can be seen in the literature for taking compression into consideration. Another crucial challenge that needs to be addressed is the performance gap between skin color tones, as this plays a key role in real-time scenarios. We hope that recent advancements in neural networks can help to mitigate the current issues. In our future work, we would like to develop some hybrid approaches to increase the accuracy and investigate the possibilities of advancing remote methods by using neural models to alleviate the existing challenges.