1. Introduction
Remote photoplethysmography (rPPG) enables non-contact estimation of physiological signals—most notably heart rate—using only standard video cameras. By tracking subtle color or reflectance fluctuations on the skin surface caused by blood-volume changes, rPPG provides an unobtrusive alternative to contact-based sensors and has become increasingly relevant for health monitoring, human–computer interaction, and affective computing. Numerous signal-extraction algorithms have been proposed, including chrominance-based methods (CHROM [
1], POS [
2]), projection-based techniques (PBV [
3]), and blind source separation approaches such as PCA and ICA [
4]. The green-channel baseline remains a foundational method in color-based rPPG [
5]. However, their reported performance varies widely across datasets, sensor modalities, illumination conditions, and subject motion, making it challenging to assess their true robustness.
Classical machine-learning regressors—such as Random Forest [
6], XGBoost [
7], Support Vector Regression [
8], and Linear Regression [
9,
10]—remain commonly used in rPPG research due to their simplicity, interpretability, and suitability for small datasets. Yet the interaction between rPPG extraction methods, handcrafted features, and model choice is still insufficiently explored. Many prior studies evaluate only a single dataset or a narrow selection of algorithms, leaving open questions regarding cross-dataset reliability, feature relevance, and the generalizability of classical pipelines for physiological estimation.
To address these gaps, this study presents a unified benchmarking framework that standardizes ROI extraction, signal preprocessing, rPPG estimation, feature computation, and label generation across four publicly available datasets: UBFC-rPPG Part 1 and Part 2 [
11], VicarPPG-2 [
12], and IMVIA-NIR [
13]. These datasets collectively span low-cost RGB recordings, high-frame-rate RGB videos with controlled motion and stress conditions, and near-infrared (NIR) facial recordings designed to evaluate rPPG performance under monochromatic illumination where color cues are absent. Within this unified pipeline, we benchmark five rPPG extraction methods (Green [
5], POS [
2], CHROM [
1], PBV [
3], and PCA/ICA [
4] for NIR) and four machine-learning regressors using MAE, RMSE, and
, complemented by permutation feature importance for model interpretability.
By comparing algorithmic rPPG methods and regression models under consistent conditions across both RGB and NIR modalities, this work establishes reliable baselines, identifies the most robust combinations of features and regressors, and highlights limitations inherent to small-sample classical machine-learning pipelines. The results provide a foundation for future research, including deep-learning architectures, hybrid algorithmic–learned models, and multimodal sensor-fusion approaches for remote physiological monitoring.
2. Benchmarking Datasets
2.1. UBFC-rPPG Part 1 Dataset
The UBFC-rPPG Part 1 dataset is a publicly available facial video collection designed for evaluating remote photoplethysmography (rPPG) methods [
11]. It includes 6 subjects with 7 recordings captured at 30 fps using a Logitech webcam. Each recording is approximately two minutes long, synchronized with fingertip pulse oximeter measurements to provide ground-truth heart rate [
11].
2.2. UBFC-rPPG Part 2 Dataset
The UBFC-rPPG Part 2 dataset extends the original collection to 42 subjects and 42 recordings, each about two minutes long [
11]. Videos were captured with a low-cost Logitech C920 HD webcam at 30 fps and 640 × 480 RGB resolution. Ground-truth heart-rate signals were obtained using a fingertip pulse oximeter (CMS50E, Contec Medical Systems Co., Ltd., Qinhuangdao, China). During recordings, subjects performed a fast, mentally engaging math task to emulate realistic computer-usage scenarios. This dataset provides tightly synchronized video and physiological signals, making it suitable for benchmarking ROI-selection strategies, rPPG signal-extraction algorithms, and heart-rate estimation methods [
2].
2.3. VicarPPG-2 Dataset
The VicarPPG-2 dataset is a high-frame-rate, multi-modal benchmark for evaluating heart-rate and short-term HRV estimation [
12]. It contains 10 subjects recorded at 60 fps using a Logitech Brio webcam (Logitech International S.A., Lausanne, Switzerland) under controlled indoor lighting, with simultaneous ECG (250 Hz) and finger-PPG (60 Hz) ground truth. Each participant completed four 5-min conditions—baseline resting, structured head-movement tasks, a Stroop-based stress game, and post-workout recovery—totaling 200 min of synchronized video and physiological signals. The dataset is notable for long continuous recordings, realistic motion and stress variations, and dual-sensor ground truth, making it ideal for benchmarking modern rPPG and HRV algorithms [
12].
2.4. IMVIA-NIR
The IMVIA-NIR dataset is a specialized public resource specifically curated for advancing research and benchmarking methods in remote photoplethysmography (rPPG) [
13]. Designed to address the scarcity of publicly available resources in the Near-Infrared (NIR) vision domain, the dataset features 20 videos collected from 10 diverse subjects in an indoor setting. Data acquisition was performed using an IDS UI-3240ML-NIR-GL camera (IDS Imaging Development Systems GmbH, Obersulm, Germany) and an 850 nm NIR LED light source (OSRAM Opto Semiconductors GmbH, Regensburg, Germany), with videos recorded at 20 frames per second at a 1280 × 1024 resolution. A key feature of the dataset is its structure into two challenging subsets: one where subjects were static (‘still’) and another where subjects were speaking (‘talking’), the latter introducing significant motion artifacts. For ground-truth physiological data, the dataset provides synchronized Blood Volume Pulse (BVP) signals recorded at 64 Hz using an Empatica E4 watch (Empatica Inc., Boston, MA, USA), making it a valuable tool for training and evaluating robust rPPG algorithms under motion and NIR imaging conditions [
13].
The characteristics and specifications of the data used for evaluation are summarized in
Table 1, which provides an overview of the benchmark datasets used in this study.
3. Methodology
The proposed processing pipeline unifies ROI extraction, signal construction, preprocessing, rPPG estimation, feature extraction, and label generation across all RGB and NIR [
11,
12,
13] datasets, while maintaining dataset-specific training and evaluation in order to enable fair and comparable benchmarking rather than cross-dataset generalization. For the RGB recordings, facial regions were detected using a Haar cascade and the mean red, green, and blue intensities were extracted from each frame to form three temporal color traces. In addition to the chrominance-based rPPG algorithms, the green channel was also included as a standalone method due to its well-established high signal-to-noise ratio for pulsatile analysis [
5]. In contrast, the NIR dataset provides single-channel grayscale videos; therefore, each detected facial ROI was divided into a fixed
grid, producing nine spatially distinct intensity signals per frame. These patch-based signals were later decomposed using PCA and ICA to obtain candidate pulsatile components. All signals—whether derived from RGB channels, the green channel, or NIR patches were subsequently scaled, bandpass-filtered, and segmented with 50% overlap to ensure a unified preprocessing pipeline. rPPG estimation was performed using algorithmic chrominance-based methods (POS, CHROM, PBV), the standalone green-channel trace, and blind source separation methods (PCA/ICA) for the NIR recordings. The resulting waveforms were used to compute a standardized set of handcrafted temporal, spectral, and nonlinear features, while synchronized PPG signals provided per-segment heart-rate labels via peak counting. Finally, four machine-learning models Random Forest, XGBoost, Support Vector Regression, and Linear Regression—were trained on these features, with permutation feature importance (PFI) used to assess and refine the contribution of each feature to the regression performance.
To clarify the scope of evaluation, all models were trained and tested within each dataset independently, and no cross-dataset or cross-modality train–test transfer experiments (e.g., training on UBFC and testing on Vicar) were performed. Accordingly, the reported results reflect method performance under standardized processing conditions across datasets, rather than model generalization or transferability across domains or sensing modalities.
3.1. ROI Extraction and Signal Construction
3.1.1. RGB Datasets
For each video, facial regions were detected using a Haar cascade classifier for frontal faces [
14]. To ensure temporal consistency of the region of interest (ROI) and to reduce frame-to-frame jitter, the detected face bounding boxes were stabilized using a median-based approach: bounding box coordinates (position and size) were aggregated across frames, and the median bounding box was selected and applied consistently to all frames of the video. Videos were retained only if successful face detection was achieved in all but at most five frames, ensuring reliable ROI extraction; occasional missed detections within this tolerance were considered negligible and did not introduce meaningful discontinuities in the extracted signals. From each frame within the stabilized ROI, the mean intensity values of the red, green, and blue channels were computed, forming three temporal traces corresponding to R, G, and B. These traces were stored as
.npy files to ensure uniform processing across datasets.
3.1.2. NIR Dataset
For the NIR recordings, the videos are provided in single-channel grayscale format [
13]. A frontal-face Haar cascade classifier [
14] was then applied to locate the facial region in each frame. Whenever a face was detected, the corresponding bounding box was extracted and used as the region of interest.
Unlike the RGB datasets, where a single mean value per color channel was computed, the grayscale face ROI in the NIR dataset was divided into a fixed grid, producing nine equally sized spatial patches. For each patch, the mean pixel intensity was computed, resulting in a nine-dimensional feature vector that captures coarse spatial variations across the face. This patch-based design provides richer information in the absence of color channels and helps preserve local reflectance differences that are relevant for rPPG estimation.
For every video, all nine-patch intensity vectors were concatenated over time, yielding a matrix of shape , where T denotes the number of frames. These matrices were saved as .npy files to maintain consistency with the storage format used for the RGB datasets and to enable uniform downstream processing.
3.2. Signal Preprocessing
3.2.1. RGB Datasets
Each color-channel is scaled from the original range of to to reduce variability caused by illumination differences and to standardize subsequent filtering operations. Then each color-channel was processed using a third-order Butterworth bandpass filter with cutoff frequencies of 0.7–3.0 Hz, covering the physiological heart-rate range of approximately 42–180 bpm. This filtering step suppresses slow illumination drift and high-frequency noise while preserving the dominant cardiac oscillations.
To increase the number of training samples and improve robustness, the filtered RGB signals were segmented using a sliding window with 50% overlap. This ensures equal-length temporal samples and provides more data points for training the regression models.
3.2.2. NIR Dataset
For the NIR recordings, the raw grayscale facial signals were first decomposed using PCA and ICA, as described in the following section. After extracting the principal and independent components, the resulting waveforms underwent the same preprocessing steps as the RGB datasets: amplitude scaling to the range
, Butterworth bandpass filtering [
15] (0.7–3.0 Hz), and segmentation using a sliding window with 50% overlap. This ensures that all datasets, regardless of modality, follow a consistent preprocessing pipeline.
3.3. rPPG Method Computation
3.3.1. RGB Datasets
The rPPG windowing procedure is independent of the segmentation length. Specifically, within each segment, all rPPG methods employed a fixed internal window length of 1.6 s, corresponding to 48 samples at 30 Hz (UBFC-rPPG Part 1/2) and 96 samples at 60 Hz (VicarPPG-2). Adjacent windows overlapped by 50%, and the resulting window-level pulse estimates were combined using Hann-weighted overlap-add reconstruction to form a continuous waveform within each segment.
For each window, RGB values are mean-normalized and the local covariance matrix Σ is computed with ridge regularization to ensure numerical stability. Specifically, a small constant is added to the diagonal of Σ prior to inversion.
POS Method
The POS algorithm follows the projection-based formulation proposed in [
16] and relies on lightweight temporal normalization without variance equalization across channels. For each window, the RGB segment is first mean-normalized by dividing each channel by its temporal mean and subtracting 1. The normalized segment
is then projected onto two chrominance directions,
A balancing coefficient,
is computed per window, and the instantaneous pulse estimate is formed as
Each windowed waveform is subsequently zero-centered and standardized, multiplied by a Hann window, and accumulated via overlap-add. After reconstruction, the POS signal is globally mean-centered and normalized to unit variance.
CHROM Method
The CHROM algorithm is implemented following the chrominance-based formulation introduced in [
1]. Although POS and CHROM are historically presented with different projection definitions, both methods can be expressed within a unified linear chrominance-projection framework. In this work, a common projection notation is adopted to emphasize implementation consistency across rPPG methods rather than to imply algorithmic equivalence.
Unlike POS, which applies only per-window mean normalization prior to projection, CHROM introduces stronger illumination compensation. The RGB recording is first globally mean-normalized once over the entire signal. Subsequently, for each window, the RGB channels are zero-centered and standardized to equalize variances across channels, thereby improving robustness to illumination drift and large-scale color imbalance.
After applying the CHROM-specific chrominance projection and linear combination—expressed here using the unified notation of Equations (1)–(3)—each window is standardized, Hann-weighted, and reconstructed via overlap-add. The final waveform is then globally mean-centered and normalized to unit variance. While the mathematical expressions are written using the same symbols as POS for clarity, the defining distinction between the two methods lies in their normalization strategies, consistent with the original CHROM formulation.
PBV Method
The PBV algorithm constructs a data-driven projection based on a predefined skin-tone direction [
3]. For each window, RGB values are mean-normalized and the local covariance matrix Σ is computed with ridge regularization to ensure numerical stability. Specifically, a small constant
is added to the diagonal of
prior to inversion. Let
denote the normalized empirical skin-tone vector. The PBV projection vector is obtained as
The rPPG waveform is then computed by projecting the normalized RGB segment onto
z:
Each window is standardized, Hann-weighted, and combined using overlap-add. A third-order Butterworth bandpass filter (0.7–3 Hz) is applied to the reconstructed signal, which is then globally standardized.
Overlap-Add Reconstruction
All three methods employ the same reconstruction strategy. Let
denote the Hann window and
the normalized pulse estimate for window
k. The final waveform is given by
where
is a small constant used to avoid division by zero in samples not fully covered by overlapping windows. The reconstructed rPPG signals are globally normalized prior to feature extraction.
Green Channel
The green channel is widely used in rPPG because hemoglobin absorbs light most strongly in the green region of the visible spectrum [
5]. As a result, pulsatile blood-volume changes produce the highest signal-to-noise ratio (SNR) in the green channel compared to the red or blue channels. This makes the green trace particularly effective for extracting cardiac-related oscillations, even under varying illumination. Therefore, in addition to the three rPPG methods, the extracted and filtered green-channel segments were also included as input to the machine-learning models.
3.3.2. NIR Dataset
For the NIR recordings, the videos are provided in single-channel grayscale format [
13]. A frontal-face Haar cascade classifier [
14] was applied to locate the facial region in each frame. Whenever a face was detected, the corresponding bounding box was extracted and used as the region of interest (ROI).
Unlike the RGB datasets, where a single mean value per color channel was computed, the grayscale face ROI in the NIR dataset was uniformly partitioned into a fixed grid in pixel space, producing nine equally sized rectangular patches after ROI cropping. The patch dimensions were determined by integer division of the ROI height and width, ensuring that all patches have identical pixel size. For each patch, the mean pixel intensity was computed, resulting in a nine-dimensional feature vector that captures coarse spatial intensity variations across the face.
For every video, the nine-patch intensity vectors were concatenated over time, yielding a matrix of shape , where T denotes the number of frames. No normalization was applied at the pixel or patch level prior to extraction. Instead, standardization was performed at the feature level as part of the preprocessing pipeline before applying PCA or ICA, ensuring comparable scaling across patch-based temporal signals while preserving their relative spatial intensity relationships.
These matrices were saved as .npy files to maintain consistency with the storage format used for the RGB datasets and to enable uniform downstream processing.
Post-Processing
As described in the Signal Preprocessing section, the selected PCA- or ICA-derived waveform was then subjected to the same preprocessing steps used for the RGB datasets. The signal was first scaled to the range and filtered using a third-order Butterworth bandpass filter (0.7–3 Hz) to isolate the cardiac frequency band. The filtered waveform was subsequently segmented into fixed-length windows using a 50% overlapping sliding window, ensuring consistency with the RGB preprocessing pipeline. This uniform post-processing procedure allows the NIR-derived rPPG signals to be directly comparable to those extracted from the RGB datasets.
3.4. Feature Extraction
For each segmented rPPG waveform, a total of fifteen handcrafted features were extracted. These features were computed independently for the green channel, CHROM, POS, and PBV signals, as well as the PCA- and ICA-derived NIR signals. The feature set included five time-domain descriptors, five frequency-domain descriptors, and five nonlinear dynamical features, described as follows.
- (1)
Time-Domain Features
Five standard statistical features were computed directly from the amplitude distribution of the signal segment:
Mean of the waveform.
Variance as a measure of amplitude dispersion.
Skewness, characterizing waveform asymmetry.
Kurtosis, describing the heaviness of the signal tails.
Lag-1 autocorrelation, computed by normalizing the autocorrelation of the zero-mean signal at a one-sample lag.
These metrics capture the fundamental statistical structure of the temporal waveform and its linear dependencies.
- (2)
Frequency-Domain Features
Frequency-domain characteristics were computed using the periodogram power spectral density (PSD). After normalizing the PSD to unit total power, the following features were extracted:
Dominant frequency, corresponding to the PSD peak.
Dominant power, i.e., the PSD magnitude at the dominant frequency.
Spectral centroid, the power-weighted mean frequency.
Spectral entropy, quantifying spectral flatness.
Spectral bandwidth, computed as the standard deviation of the spectrum around the centroid.
These features capture oscillatory behavior relevant to heart-rate estimation, including periodicity strength and spectral complexity.
- (3)
Nonlinear Features
To characterize dynamical and complexity-related properties of the rPPG signal, five nonlinear features were extracted:
Hjorth activity, equivalent to the signal variance [
19].
Hjorth mobility, describing the mean frequency of the signal based on first derivatives [
19].
Hjorth complexity, measuring the change in frequency content over time [
19].
Sample entropy, quantifying the irregularity and unpredictability of the waveform [
20].
Permutation entropy, a complexity measure based on ordinal pattern statistics [
21].
These nonlinear descriptors provide sensitivity to subtle changes in waveform morphology and dynamical structure beyond linear measures.
- (4)
Standardization
For machine-learning models requiring scale-invariant inputs (SVR and Linear Regression), all features were standardized using z-score normalization computed from the training set only. The z-score of a feature value
x was computed as
where
and
denote the mean and standard deviation of the feature within the training set. Feature scaling was not applied to tree-based models (Random Forest and XGBoost), which are inherently insensitive to feature magnitude.
3.5. Heart-Rate Label Construction
For each segmented video sample, a corresponding ground-truth heart-rate label was computed from the synchronized PPG signal. Because the four datasets provide ground-truth PPG in different formats, two approaches were used.
- (1)
Peak-Based Labeling for UBFC-rPPG Part 1, UBFC-rPPG Part 2, and IMVIA-NIR
For the first three datasets, the raw PPG waveforms do not include explicit annotations for systolic peaks. Therefore, peak detection was performed using the
NeuroKit2 library [
22], which provides a validated implementation of the Elgendi peak detection pipeline. Prior to peak detection, each PPG segment was band-pass filtered using a third-order zero-phase IIR Butterworth filter with cut-off frequencies of 0.5–8 Hz to suppress baseline wander and high-frequency noise while preserving cardiac pulsations. Systolic peaks were then identified using the Elgendi method, which employs adaptive thresholding based on moving-average envelopes of the squared signal, thereby reducing sensitivity to amplitude fluctuations and motion artifacts. Segments with insufficient or failed peak detections were retained, as peak-count–based estimation inherently reflects signal quality degradation; however, short detection gaps (on the order of a few frames) were negligible relative to the segment duration and did not materially affect heart-rate estimation.
Let
denote the number of PPG peaks in a segment and
T the segment duration in minutes. The heart rate (in beats per minute) for that segment was computed as:
This procedure yields an effective heart-rate estimate aligned with the same temporal window used for rPPG feature extraction.
It is worth noting that for short temporal windows, particularly the 10 s segments used in the UBFC datasets, peak-count–based heart-rate estimation is subject to quantization effects. In this setting, a single missed or spurious peak corresponds to a discrete HR change of approximately ±6 BPM, which can manifest as isolated error spikes. While alternative labeling strategies—such as computing HR from the average RR (or PP) interval—can reduce this quantization error when reliable inter-beat annotations are available, peak counting was intentionally adopted in this work to maintain methodological consistency across datasets and to reflect realistic short-window rPPG operating conditions. The impact of this quantization effect is therefore acknowledged and considered when interpreting results on short segments.
- (2)
Labeling for the VicarPPG-2 Dataset
The VicarPPG-2 dataset includes PPG signals with pre-annotated systolic peak indicators provided for every sample [
12]. In this case, peak detection was not required. Instead, the number of peaks inside each segmented PPG window was directly counted using the supplied annotations. The heart-rate label for each segment was then computed identically to the previous datasets using Equation (
8).
- (3)
Summary
In all datasets, ground-truth labels were constructed by peak counting over the exact temporal extent of each segment, ensuring temporal alignment between the rPPG features and the reference heart-rate values. This peak-counting approach avoids reliance on precomputed dataset-level heart-rate values and allows per-segment labeling consistent with the segmentation scheme used throughout the pipeline.
3.6. Data Augmentation
Given the limited size of the UBFC-rPPG Part 1 and IMVIA-NIR datasets, data augmentation was performed by adding zero-mean Gaussian noise to each extracted rPPG segment [
23]. The noise standard deviation was fixed to
and was applied uniformly across all segments, independent of the underlying signal variance. Feature extraction was then applied to the augmented segments using the same procedure as for the original data. The ground-truth labels remained unchanged, as they were derived directly from the reference sensor recordings.
3.7. Machine-Learning Regression and Evaluation
The resulting feature set was divided into training and testing subsets using an 80/20 split. This split was implemented such that video segments originating from the same recording never appeared in both the training and testing sets, thereby preventing temporal leakage and ensuring a realistic evaluation of generalization performance. Four regression models were evaluated: Random Forest Regressor [
6], XGBoost Regressor [
7], Support Vector Regression (SVR) [
8], and Linear Regression [
9,
10].
To ensure full repeatability of the benchmarking framework, all model hyperparameters, random seeds, and software versions were explicitly fixed and reported. The evaluated models were configured as follows:
Random Forest Regressor:n_estimators = 200, max_depth = None, random_state = 42, n_jobs = −1.
Support Vector Regression (SVR): RBF kernel with C = 10 and epsilon = 0.1. All remaining parameters were kept at scikit-learn default values, including gamma = “scale”.
Linear Regression: Default scikit-learn configuration.
XGBoost Regressor:n_estimators = 300, learning_rate = 0.05, max_depth = 5, random_state = 42, n_jobs = −1. All other parameters were left at XGBoost default values.
The random seed (random_state = 42) was fixed for all stochastic models to ensure deterministic behavior across runs.
Each model was trained independently for every rPPG extraction method, and performance was evaluated using Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and the coefficient of determination ().
To assess feature relevance, permutation feature importance (PFI) was computed for each trained model using repeated random permutations on a held-out test set. Feature importance was quantified as the mean change in prediction error (negative mean squared error) over multiple permutation repeats. Features exhibiting consistently negative importance—indicating that their permutation led to slight performance improvements—were interpreted as weak, noisy, or redundant and were removed in a controlled manner, after which models were retrained to assess performance changes.
All experiments were conducted using the following software environment:
Python: 3.13.5
NumPy: 2.1.3
SciPy: 1.15.3
pandas: 2.2.3
scikit-learn: 1.6.1
XGBoost: 3.1.1
4. Results
The performance of all rPPG extraction methods, feature sets, and regression models was evaluated across the four datasets using MAE, RMSE, and . To ensure fair and comparable evaluation, each dataset was processed using its native segmentation duration, while differences in frame rate were explicitly accounted for by reporting the corresponding segment lengths in samples. Specifically, 10-s segments were used for UBFC-rPPG Part 1 and Part 2, corresponding to 300 samples each. For the IMVIA-NIR dataset, 20-s segments were used, corresponding to 400 samples. For the VicarPPG-2 dataset, both 20-s and 25-s segments were evaluated, corresponding to 1200 and 1500 samples, respectively.
Table 2,
Table 3,
Table 4,
Table 5 and
Table 6 summarize the results obtained using both the full 15-feature set and the PFI-reduced subsets. Immediately after each table, we list the positive-mean PFI features in descending order for the CHROM method under the two best-performing models overall—Random Forest and XGBoost. These ranked feature lists correspond exactly to the “PFI-Reduced Features” columns shown in the tables and provide a transparent view of which features contributed most strongly to the improved performance obtained with permutation-based feature selection. Across datasets, clear performance differences emerged between rPPG methods and regression models, while permutation-based feature reduction influenced performance to varying degrees. The following tables present these quantitative comparisons in detail.
RF: Dominant Frequency, Skewness, Sample Entropy
XGB: Dominant Frequency, Skewness, Mean, Sample Entropy, Autocorrelation (lag = 1), Spectral Bandwidth, Dominant Power, Spectral Centroid
RF: Dominant Frequency, Dominant Power, Sample Entropy, Permutation Entropy, Spectral Entropy, Skewness, Mean, Kurtosis, Hjorth Activity
XGB: Dominant Frequency, Dominant Power, Spectral Centroid, Autocorrelation (lag = 1), Sample Entropy, Spectral Entropy, Variance, Hjorth Complexity, Skewness
RF: Skewness, Permutation Entropy, Mean, Sample Entropy, Variance
XGB: Skewness, Variance, Mean, Kurtosis, Dominant Power, Spectral Bandwidth, Sample Entropy
RF: Dominant Frequency, Spectral Centroid, Dominant Power, Kurtosis, Skewness, Hjorth Activity, Variance, Autocorrelation (lag = 1)
XGB: Dominant Frequency, Spectral Centroid, Skewness, Variance, Kurtosis, Dominant Power, Autocorrelation (lag = 1)
RF: Dominant Frequency, Spectral Centroid, Sample Entropy, Skewness, Autocorrelation (lag = 1), Hjorth Activity, Hjorth Mobility, Variance, Hjorth Complexity
XGB: Dominant Frequency, Spectral Centroid, Variance, Autocorrelation (lag = 1), Skewness, Spectral Entropy, Sample Entropy
5. Discussion
The results across all four datasets consistently demonstrate that the CHROM algorithm is the most reliable and robust rPPG extraction method in the context of classical machine-learning–based heart-rate regression. Across UBFC-rPPG Part 1, UBFC-rPPG Part 2, and VicarPPG-2, CHROM outperformed POS, PBV, and the standalone green-channel signal for nearly every model and evaluation metric. This is particularly evident for the tree-based regressors (Random Forest and XGBoost), where CHROM achieved the lowest MAE and RMSE and the highest values. The superior performance of CHROM can be attributed to its stronger illumination compensation and variance-normalization procedures, which stabilize the chrominance components under varying lighting and skin-tone conditions. These results are consistent with previous findings that attribute CHROM’s robustness to its built-in color-balance normalization, making it particularly effective on RGB recordings with realistic illumination variation.
In contrast to the RGB datasets, the IMVIA-NIR recordings required a different strategy for rPPG extraction due to their single-channel grayscale nature. Here, the PCA-based approach proved substantially more effective than ICA across all models. PCA combined with spatial patch decomposition leveraged the structured variation present in the grid of facial patches, consistently recovering components with strong cardiac periodicity. ICA, while theoretically capable of isolating independent sources, tended to introduce instability and noise amplification due to the limited dimensionality and the absence of color information. The clear superiority of PCA highlights the importance of spatial redundancy when working with NIR data, where color cues are absent and the pulsatile variations must instead be extracted from subtle spatial reflectance modulations.
Negative values indicate that a model performs worse than a constant predictor equal to the mean ground-truth heart rate, reflecting cases where the extracted rPPG signal contains insufficient physiological information for reliable estimation. The strongly negative values observed for some classical methods (e.g., Green, PBV, and POS on VicarPPG-2) are primarily caused by severe motion and illumination artifacts that degrade signal quality rather than by deficiencies in the regression models themselves. Since all methods were evaluated under identical conditions, relative performance comparisons remain meaningful even when absolute values are negative, and simple baselines (mean-HR and dominant-frequency HR) help contextualize when learning-based models provide benefit beyond trivial estimators.
Despite these encouraging results, several challenges inherent to the datasets must be acknowledged. Most recordings—particularly UBFC-rPPG Part 1 and IMVIA-NIR—contain a very limited number of subjects and only a few minutes of video per participant. Even with segment-level augmentation based on Gaussian noise, the overall diversity of physiological states, lighting conditions, and motion patterns remains limited. This restricts the capacity of classical machine-learning models to generalize beyond the specific subjects and recording configurations present in each dataset. Furthermore, because we ensured strict separation of training and test segments by enforcing that no segments from the same video appear in both sets, cross-segment temporal leakage was eliminated at the cost of increased regression difficulty. This conservative split strategy likely contributed to the relatively low or sometimes negative values observed in several models, especially those constrained by linear assumptions (e.g., Linear Regression).
The limitations of classical machine-learning models themselves also played a significant role. Although Random Forest and XGBoost captured nonlinear dependencies effectively, they lack temporal modeling capabilities and must rely entirely on handcrafted features. Physiological signals such as rPPG waveforms contain rich temporal structure and nonlinear oscillatory patterns that cannot be fully represented by summary statistics alone, regardless of how carefully the features are curated. This limitation is particularly evident in datasets with more motion or illumination variability, where handcrafted features can fail to encode subtle temporal cues essential for stable pulse extraction.
Permutation Feature Importance (PFI) provided additional insight into which features most strongly supported accurate predictions. Across datasets, the dominant frequency, skewness, sample entropy, spectral centroid, and autocorrelation emerged repeatedly among the highest-ranked features for the CHROM method. These results suggest that both frequency-domain periodicity (e.g., dominant frequency, dominant power) and nonlinear dynamical descriptors (e.g., entropy measures, Hjorth parameters) play a central role in capturing the quality of the pulsatile signal extracted by CHROM and PCA. Moreover, removing negatively contributing features via PFI-based selection often improved performance, indicating that some handcrafted descriptors introduced noise rather than providing discriminative value.
6. Conclusions
This study demonstrates that, despite their simplicity and interpretability, classical machine-learning pipelines exhibit inherent limitations when applied to physiological waveform regression—particularly under the small-sample conditions that characterize current rPPG datasets. Nevertheless, two consistent trends emerged across all evaluated datasets. In the RGB modality, CHROM consistently achieved the strongest performance, confirming the effectiveness of chrominance-based normalization for mitigating illumination variability and recovering robust pulse signals. In the NIR modality, the PCA-based approach substantially outperformed ICA, indicating that spatial patch decomposition combined with principal-component selection is a reliable strategy for extracting pulsatile information from grayscale recordings.
Beyond classical machine learning, the proposed framework provides a structured foundation for future methodological extensions. By standardizing ROI extraction, signal preprocessing, segmentation, and label generation across modalities, the pipeline can be directly extended to deep-learning architectures or hybrid algorithmic–learning approaches. In particular, algorithmically extracted rPPG signals (e.g., CHROM or PCA outputs) may serve as structured inputs to learning-based models, while permutation feature-importance analysis can guide feature selection or architectural design in data-limited settings.
Several promising directions emerge for future work. First, the NIR modality could be enriched by incorporating additional grayscale-specific rPPG extraction strategies, such as local reflectance-based or motion-aware methods. Second, more advanced data-augmentation techniques—spanning temporal perturbations and scenario-level variations—could be explored to improve generalization in small-sample regimes. Third, explicit temporal modeling, including lightweight autoregressive models or learned temporal encoders, may further exploit the sequential structure of rPPG signals beyond fixed-window feature extraction. Finally, multimodal fusion strategies combining RGB and NIR information at the feature or decision level represent a natural extension of the unimodal baselines established in this work.
The strict separation of training and testing segments adopted in this study further highlights the need for larger, more diverse rPPG datasets to support robust generalization for both classical and learning-based models. Overall, the results confirm that algorithmic rPPG methods combined with structured preprocessing constitute a strong and interpretable baseline for advancing future research in remote physiological monitoring.
Finally, the proposed benchmarking framework supports sustainability by improving the reliability and reproducibility of camera-based heart-rate monitoring. By reducing dependence on specialized medical hardware and enabling scalable, non-contact sensing, such standardized evaluation pipelines contribute to accessible health-monitoring solutions and broader deployment of early physiological risk-detection systems.
The practical utility of this framework is exemplified in two primary domains. First, in telehealth and remote patient monitoring, standardized evaluation ensures that rPPG-derived vitals achieve the clinical consistency necessary for diagnostic support, allowing healthcare providers to monitor chronic conditions via existing consumer devices. This software-centric approach reduces the environmental footprint associated with the manufacturing and disposal of short-lifecycle medical wearables. Second, in occupational safety and automotive monitoring, the framework facilitates the deployment of non-intrusive systems capable of detecting physiological stress or fatigue. By providing a validated, reproducible pathway for these technologies, the framework supports a sustainable health infrastructure that prioritizes proactive risk detection through pervasive, low-power sensing rather than resource-intensive clinical interventions.