CHROM-Y: Illumination-Adaptive Robust Remote Photoplethysmography Through 2D Chrominance–Luminance Fusion and Convolutional Neural Networks

Javidh, Mohammed; Shah, Ruchi; Uma, Mohan; Prabhu, Sethuramalingam; Jeyavathana, Rajendran Beaulah

doi:10.3390/signals6040072

Open AccessArticle

CHROM-Y: Illumination-Adaptive Robust Remote Photoplethysmography Through 2D Chrominance–Luminance Fusion and Convolutional Neural Networks

by

Mohammed Javidh

¹

,

Ruchi Shah

¹,

Mohan Uma

^1,*,

Sethuramalingam Prabhu

²

and

Rajendran Beaulah Jeyavathana

¹

Department of Computational Intelligence, SRM Institute of Science and Technology, Kattankulathur, Chennai 603203, India

²

Department of Mechanical Engineering, SRM Institute of Science and Technology, Kattankulathur, Chennai 603203, India

^*

Author to whom correspondence should be addressed.

Signals 2025, 6(4), 72; https://doi.org/10.3390/signals6040072 (registering DOI)

Submission received: 27 October 2025 / Revised: 23 November 2025 / Accepted: 4 December 2025 / Published: 9 December 2025

Download

Browse Figures

Versions Notes

Abstract

Remote photoplethysmography (rPPG) enables non-contact heart rate estimation but remains highly sensitive to illumination variation and dataset-dependent factors. This study proposes CHROM-Y, a robust 2D feature representation that combines chrominance (Ω, Φ) with luminance (Y) to improve physiological signal extraction under varying lighting conditions. The proposed features were evaluated using U-Net, ResNet-18, and VGG16 for heart rate estimation and waveform reconstruction on the UBFC-rPPG and BhRPPG datasets. On UBFC-rPPG, U-Net with CHROM-Y achieved the best performance with a Peak MAE of 3.62 bpm and RMSE of 6.67 bpm, while ablation experiments confirmed the importance of the Y-channel, showing degradation of up to 41.14% in MAE when removed. Although waveform reconstruction demonstrated low Pearson correlation, dominant frequency preservation enabled reliable frequency-based HR estimation. Cross-dataset evaluation revealed reduced generalization (MAE up to 13.33 bpm and RMSE up to 22.80 bpm), highlighting sensitivity to domain shifts. However, fine-tuning U-Net on BhRPPG produced consistent improvements across low, medium, and high illumination levels, with performance gains of 11.18–29.47% in MAE and 12.48–27.94% in RMSE, indicating improved adaptability to illumination variations.

Keywords:

rPPG; featurization; non-contact heartrate; 2D CNN; UBFC-rPPG dataset; chrominance feature

1. Introduction

Vital signs like heart rate (HR), oxygen saturation, and respiratory rate (RR) are crucial parameters for monitoring an individual’s health status. Photoplethysmography (PPG) is one of the noninvasive techniques used to capture various clinical parameters like oxygen saturation, heart rate, blood pressure, cardiac output, and respiration [1,2]. PPG is the fundamental technology behind pulse oximeters. Pulse oximeters, apart from measuring oxygen levels, also output a periodic wave, which is sometimes referred to as a PPG waveform [3]. This waveform is the light sensor’s reading of the amount of light absorbed by the body tissues (generally the finger or toe) under green light. The tissues absorb a portion of the light and the rest is reflected to the light sensor, which is the PPG waveform. The absorbed light is strongly associated with the cardiac, respiratory, and autonomic systems [3]. The PPG waveform is not some random periodic wave but it holds valuable information on various health parameters. One of the interesting sides of this wave is its frequency spectrum. The frequency spectrum matches the heartbeat frequency spectrum [4]. The simple Fourier transform of the PPG signal can reveal the heart rate of the subject. The blood density which alters its absorbance is measured during the cardiac cycle [5]. This explains the frequency match of the waveform with heart rate. The PPG waveform can be used in many healthcare applications; Lu S et al. [6] in a study proved the association between ECG Heart Rate Variability (HRV) and PPG Heart Rate Variability. PPG of the finger and toe along with EEG can measure pulse transit time (PPT), which can derive blood pressure [7]. PPG applications are not just limited to health parameters; they can also be used for sleep health, mental health and stress monitoring, diabetes management, pain assessment, anesthesia monitoring, exercise and sports monitoring, and vascular health assessment [8].

Remote photoplethysmography (rPPG) attempts to capture the reflected light through a consumer-grade camera, where the surrounding luminance is the major source of light with the face as the subject. The face carries many blood-carrying vessels lying at low depths which provide the way for strong light absorbance and good PPG yield that is then captured through a camera placed a few meters away [9]. The forehead and cheeks are good sources for rPPG signal extraction [9]. The face, specifically the forehead and cheeks, provides a strong rPPG signal [10]. Video recording is a series of images (generally 30 images per second); each image is a 2d matrix of red, green, and blue (referred to as RGB). In the case of recording a live human face, subtle variation over time in R, G, and B colors is observed. Various factors including light absorbance cause color variations over a region of a face. The subtle color change in facial videos can be imagined as a bunch of waves, each caused by different factors; rPPG is one of the waves. The primary objective of any rPPG-related project is to extract only the rPPG component from the raw signal. A few of the major wave components, including the cardiac cycle, respiration, etc., will be referred to as rPPG components. Components other than rPPG are stronger and some of them are highly influenced by tiny changes in the environment. Components like motion variation, illumination variation, and facial deformation are some of the significant components of the reflected light [10], which makes the rPPG component very weak. Components other than rPPG can be addressed as noise, as they are useless and will only degrade the extracted signal quality. Various techniques and methods are introduced to eliminate noise up to a significant extent. One of the widely used noise reduction techniques in the domain of rPPG is the band-pass filter. The heart rate of a live human lies in the range of [0.75–4 Hz] [5]. Each heartbeat creates a spike PPG waveform, then gradually drops until the next beat [4]. Therefore, the band-pass filter of [0.75–4 Hz] will hold only the periodic wave portion of the signal within this frequency [0.75–4 Hz]. This technique eliminates a significant part of the noise but does not eradicate it to none. Several algorithms were introduced to eliminate the noise to a great extent, which will be discussed later in this paper.

Simple statistical techniques might not be enough to extract rPPG components, which might have a more sophisticated pattern. Deep learning methods are good for understanding complex non-linear associations [11]. Therefore, deep learning models were the next step in the rPPG signal extraction process [12]. The big problem in DL-based methods is, what represents the RGB color variation? What is the input to the deep learning algorithm? One straightforward approach is to provide the raw ‘RGB color over time’ as the input, with the expectation that the model will capture the sophisticated hidden pattern. However, there is a trade-off between input complexity and model size: when the input is simple, a larger model may be necessary to capture nuances, whereas more complex inputs can allow for a smaller model to perform effectively [13], Complex input refers to pre-processed input, in which a portion of the job has been completed already in the input. Due to the lack of huge dataset availability, a smaller model with complex (or pre-processed) inputs is the ideal choice. rPPG signal is a temporal feature; sometimes, based on pre-processing, it becomes a spatial–temporal feature. CNNs are generally good at spatial–temporal features [14]. In this paper, we focus on CNN and spatial–temporal features. We also propose a novel featurization technique, which is robust against illumination artifacts. Illumination artifacts are by-products of noise from light sources, facial deformities, and also facial motion. A model that is robust against illumination implies a model that is robust against varying light sources, facial deformation, and facial motions. Featurization refers to the process of pre-processing and representing raw RGB traces in an effective format that the DL model accepts. Ze Yang et al. [15] studied various deep learning-based rPPG models (like DeepPhys [12], rPPGNet [16], and PhysNet [17]) and concluded that traditional models (like CHROM [18], POS [19], and GREEN [10]) had superior performance than DL methods under different illumination conditions. This study shows that existing DL models fail to include illumination-based features. These DL methods are trained with raw signals; no high-level featurization is performed. In the case of traditional methods, it performs various pre-processing operations based on certain assumptions formulated based on domain knowledge. The traditional models work on an assumption that some linear transformation of RGB reveals the rPPG signal. Some DL models provide raw signals, with the expectation that the model will learn all sophisticated patterns. We attempted to solve these problems with existing methods by introducing CHROM-Y, a novel rPPG feature image. This was built on top of the CHROM method. We trained and tested ResNet-18 [13], VGG [20], and U-Net [21] on UBFC-rPPG [22], a dataset with our feature image. CNN can process 1D, 2D, and even 3D images; the Convolution operation of CNN enables it to understand spatial features in those images. Regarding the CNNs used for rPPG problems, one of their core objectives is to represent color signals in 2D or 3D images. We focused only on 2D image representation; the same architecture can perform better with different 2D image representations. ‘PPG Maps’ [23] and ‘MST Maps’ [24] are examples of such 2D representations. Some portions of a given face obtain a good rPPG yield but this varies from subject to subject [25]. Most of the good ROIs lie within the cheeks and forehead; therefore, we chose both the cheeks and forehead collectively as a single ROI. In the pre-processing phase, ROI detection is a computationally expensive task. Mediapipe [26] is widely regarded as an effective framework for real-time region of interest (ROI) detection on faces and other body parts due to its optimized design and efficient processing capabilities. We used Mediapipe for ROI processing due to its efficiency and speed. Clinical applications of rPPG sometimes demand real-time results; in such cases a computationally huge process will be a setback. It is crucial to have reduced processing time and lightweight modules in all of the pipeline. In this paper, we did not analyze the processing speed. Han, S. and Mao et al. [27] proposed a three-stage method to compress deep learning models to fit into embedded systems. This could serve as the basis for a new study focused on assessing the impact of rPPG model compression on processing efficiency and real-time performance, particularly in clinical applications where computational resources are often limited. Furthermore, incorporating model compression techniques can significantly enhance the practicality of rPPG models in resource-constrained environments. By reducing the computational load while maintaining accuracy, these methods can facilitate real-time health monitoring on devices such as mobile phones or embedded systems. Future studies should explore the trade-offs between compression rates, computational efficiency, and the accuracy of rPPG signal extraction to ensure robust performance in clinical and consumer-grade applications. Our proposed method addresses some of the current limitations in rPPG analysis, particularly the challenges posed by environmental factors like variable lighting conditions. By introducing a novel feature extraction technique, we aim to improve the accuracy and robustness of rPPG signal extraction even under non-ideal conditions. Future work will explore the integration of compressed models to optimize processing efficiency, making our approach more suitable for real-time clinical applications. Additionally, the potential for model compression to enable implementation on embedded systems will be considered, expanding the usability of rPPG technology in various healthcare settings.

Related Works

Remote photoplethysmography (rPPG) aims to capture the subtle color variation in reflected light through a consumer-grade camera, but the reflected light consists of multiple components caused by various factors. The dichromatic reflection model (DRM) [28] provides a framework for understanding how light interacts with complex surfaces like human skin. The fundamental idea of DRM is that reflected light is composed of two major components, a diffuse component and a specular component, each of which can be further broken down into several sub-components. In the context of skin, a portion of the light penetrates the skin and interacts with melanin, hemoglobin, etc. These interactions cause scattering and absorption; such diffused light is the diffuse component. The unpenetrated light bounces back from the skin’s surface and is called the specular component. The smoothness, hydration, and oil content of the skin affect the intensity and appearance of the specular reflection. Different wavelengths of light interact uniquely with the skin’s deeper layers. For example, the green channel shows the highest absorbance due to its sensitivity to hemoglobin. As a result, subtracting the green channel from the other color channels or some linear combinations of color channels can effectively minimize the specular component, isolating the diffuse reflection. rPPG information lies in the diffuse reflection. Several initial research studies were based on this idea [10,18,29]. Verkruysse et al. [10] used the fact that the green channel holds strong rPPG components; they eliminated non-periodic components with Butterworth filters and then performed a spectral analysis to extract HR and RR. Poh et al. [29] proposed Independent Component Analysis (ICA) of all three color channels. In [30], this work is further improved with detrending filters. ICA-based methods assume that the sources of each reflected component are statistically independent. This might not be true all the time, especially under extreme noise conditions. Similarly, Lewandowska et al. [31] used principal component analysis (PCA) to eliminate noise. Blind Source Separation (BSS)-based methods are not good at eliminating periodic motion artifacts [18]. The human heart rate lies in the range of [0.75–4 Hz] (or 42–240 bpm) [5] and also the frequency of the PPG waveform is the same as the heartbeat frequency [4]. A band-pass filter with a frequency range of [0.75–4 Hz] effectively removes non-periodic and unlikely rPPG components, but it does not eliminate noise entirely; band-pass filters are used in [10,18,29,30,31]. Spatial averaging is the most used technique to eliminate tiny motion artifacts; compared to a single pixel-wise value, the mean value of a bunch of pixels significantly improved the SNR [10]. SNR may even become better when pixels are chosen from the skin region or ROIs (sources of rPPG). Spatial averaging or average pooling eliminates motion artifacts and facial deformities within an ROI; the larger the ROI’s area, the better the noise cancelation. G de Haan et al. [18] (also known as CHROM algorithm), proposed two orthogonal chrominance color spaces (two different linear combinations of RGB) which effectively extract rPPG components; additionally, the algorithm addresses nonwhite illumination by standardizing skin tones using normalized RGB values based on a fixed reference, ensuring consistency across various lighting conditions. This is achieved by normalizing each channel and applying standardized weights. In this paper, we used this algorithm for featurization. Those two chrominances were then divided, leaving behind linear combinations of RGB. Conventional methods for remote photoplethysmography (rPPG) can be broadly divided into two categories: Single-Color Methods and Multi-Color Methods. Due to their high sensitivity to blood volume changes, single-color methods typically focus on a green color channel. These methods employ signal processing techniques like detrending, smoothing, and band-pass filtering to extract physiological signals, as seen in studies [10,32,33]. In contrast, Multi-Color Methods leverage multiple color channels, often all three (red, green, and blue) to enhance accuracy and noise reduction. Some approaches within this category use Blind Source Separation (BSS) for noise elimination, with [29,30] utilizing Independent Component Analysis (ICA) and [31] implementing principal component analysis (PCA). Additionally, studies like [18,34] propose combining multiple color channels linearly to improve signal extraction. Deep learning models are good at generalization [11]. Subtle color variation within the pixels of an ROI is the only useful information for rPPG signal extraction. Featurization is the process of converting raw data into a structured format that can be effectively utilized by deep learning models. This transformation is crucial because handling unprocessed inputs often demands complex models, which require extensive datasets for effective training. Large-scale data is particularly important for complex deep learning networks to learn meaningful patterns and achieve high performance [35]. Models with fewer parameters cannot capture sophisticated patterns hidden in the raw input. This problem can be mitigated with pre-processed features. RythmNet [36] proposed a 2D spatial–temporal feature image on YUV color channels; they tested different color spaces including RGB, HSV, and YCrCb and concluded that YUV performed better, with the least RMSE. The YUV color space retained rPPG information more effectively than the other color spaces. Temporal features (raw YUV signals) from multiple non-overlapping ROIs were stacked to form an image of size (n, m, 3); each row of a 2D image in the YUV temporal feature is an ROI. J. Wu et al. [23] proposed a “Multi-scale spatial–temporal representation of PPG”. Their feature image was similar to [36], but unlike RhythmNet, they utilized all possible sets of ROIs, and each row of their 2D image corresponded to some non-null subset of ROIs. X. Niu et al. [24] used a similar approach but included two color spaces (RGB and YUV) in their feature image. Both refs. [23,24] build their feature images with raw RGB and YUV colors, expecting the CNN to figure out hidden patterns. R. Song et al. [37] constructed their feature image, where each row was a pre-processed signal from different time intervals, and each row was a signal from overlapping time intervals. Instead of raw color traces, they pre-processed raw signals with the CHROM [18] algorithm. Overlapping intervals are effective in minimizing the impact of brief, high-intensity noise. Even if one interval is heavily affected by noise, the other intervals can still offer reliable information, ensuring more accurate overall results [38]. Although refs. [23,24] focus on face liveness detection, similar methods are also found in heart rate estimation tasks in [39,40]. References [12,16,17] utilized attention mechanisms with CNNs, LSTM, and RNN architecture. Deep learning methods can be broadly classified into two categories: image-based [23,24,36,37,40] and attention-based [12,16,17]. The rPPG waveform contains information on many vital parameters such as HR, RR, and SpO2. Different DL models can be trained to predict each parameter separately, but the optimal approach is to have a DL model for rPPG signal reconstruction and have lightweight models to predict all possible parameters. M. H. Chowdhury et al. [41] used an encoder–decoder model for rPPG waveform estimation and heart rate was estimated with the peak value of Power Spectral Density (PSD). Pearson’s Correlation Coefficient (PCC), Dynamic Time Warping Distance (DTW), and RMSE were used to analyze the model performance. For a heart rate estimation task, mean absolute error (MAE) is enough to analyze model performance, but for a waveform estimation task, wave-similarity metrics are appropriate. M. Das et al. [42] utilized a scalogram of RGB traces as a feature image to estimate the waveform. In rPPG, the wavelet transform decomposes signals into multiple scales, providing both time and frequency information; it is utilized in [42,43,44]. Early approaches utilized color channels and statistical tools like ICA and PCA to capture blood volume changes, focusing on minimizing noise and isolating diffuse reflections. However, these methods had limitations under complex noise conditions. Modern deep learning models, such as those using spatial–temporal features and attention mechanisms, have shown better performance. B.-F. Wu et al. [45] analyzed the CHROM algorithm with fluctuating brightness and concluded that sudden illumination fluctuation degrades SNR and also increases error in HR estimation. Z. Yang et al. [15] investigated conventional methods and deep learning models under different illumination; the performance of deep learning models was poorer than that of conventional methods under variable lighting conditions.

2. Proposed Methodology

rPPG Features

Under good lighting conditions, the light that is reflected off the skin is composed of various components caused by multiple factors. rPPG waveforms are one of the components (1) ([rppg] represents the rPPG component and [

\hat{x}

] all other components or noise). The rPPG waveform can be effectively reconstructed with those rPPG components. Therefore, the (2) rPPG waveform can be represented as a function of [rppg]. As we are using an RGB camera (Logitech C920 HD Pro (LOGITECH ASIA PACIFIC Ltd., Hong Kong, China)) for video recording, the rPPG component is embedded with RGB. Therefore, the rPPG waveform can be represented as a function of RGB color channels (3).

reflected_light = f ([rppg], [\hat{x}])

(1)

rPPG_wave = f([rppg])

(2)

rPPG_wave = f(R, G, B)

(3)

In the CHROM algorithm, two chrominances, “X = R − G and Y = 0.5R + 0.5G − B”, were utilized [18]. X and Y can effectively eliminate components that are similar in all color channels; most of the artifacts are similar across our channels but the rPPG components are strong in green and weak in other colors. So, X and Y effectively comprise [rPPG]. They also proposed the correction factors Ω and Φ, ensuring consistency with the non-white light source (in (4) and (5)). Therefore, Equation (3) can be simplified to variables Ω and Φ.

Ω = 3R − 2G

(4)

Φ = 1.5R + G − 1.5B

(5)

rPPG_wave = f(Ω, Φ)

(6)

The simplified function (6) with Ω and Φ as independent variables is robust against motion artifacts, colored light sources, and different skin tones. Thus, the pair (Ω, Φ) is a strong feature. The intensity of each RGB color channel is directly proportional to the reflected brightness, represented as the Y component in the YCrCb color space. Each channel can thus be modeled as a function of brightness (Y) and additional components ‘[ŷ]’ (7), while the chrominance components remain independent of brightness [18]. We hypothesize that Ω and Φ require the inclusion of the brightness (Y) component to be complete.

C_i = f(Y, [ŷ])

(7)

Fluctuations in brightness contribute to errors in heart rate prediction [45]. Deep learning models relying on raw RGB signals struggle under varying illumination conditions [15], supporting our assertion. Therefore, the features (R, G, B) and (Ω, Φ) are not resilient against varying brightness conditions. To address the issue, we propose a feature ‘F’ as some function of (Ω, Φ, Y) (in (8) and (9)).

F = f(Ω, Φ, Y)

(8)

rPPG_wave = f(F)

(9)

(i) Feature Image:

In this study, CNN models are employed for heart rate estimation, requiring an effective image representation of FFF in (8). The forehead and both cheeks are selected as regions of interest (ROIs) due to their suitability as sources of PPG signals [9]. The Mediapipe framework in Python 3.10 is utilized for ROI mask extraction. When a video is treated as a sequence of images, the mean values of the R, G, B, and Y color channels are computed within the defined ROIs for each image.

V_{t}^{c} = \frac{1}{N} \sum_{r \in R O I} \sum_{i, j \in r} C_{i, j, t}

(10)

Equation (10) computes the mean intensity value of a specific color channel C ∈ (R, G, B, Y) for the tth frame within a given region of interest (ROI). Here, N represents the total number of pixels in all ROIs, and summation is performed over each ROI ‘r’. For each ROI, the formula sums the intensity values of color C at each pixel (i, j). The result is then normalized by N to obtain the mean intensity. This method is commonly used in applications like remote photoplethysmography (rPPG) to analyze color variations in the skin region over time. A sequence of 128 consecutive frames is used for constructing a single feature image. For each C ∈ (R, G, B, Y), an array of length 128 is obtained, and a band-pass filter with a window of (0.75–4 Hz) is applied, followed by division by the mean values of the array.

C_{j} = {B a n d P a s s}_{0.75 - 4 Hz} {(C_{j})}_{r} C \in \{R, G, B\}

(11)

C_{j} = \frac{C_{j}}{μ (C_{j})}

(12)

{Ω = M i n M a x}_{0 - 1} (3 R - 2 G)

(13)

{Φ = M i n M a x}_{0 - 1} (1.5 R + G - 1.5 B)

(14)

Y = {M i n M a x}_{0 - 1} (0.299 R + 0.587 G + 0.114 B)

(15)

The process begins with three arrays, Ω, Φ, and Y, each of length 128, which are individually normalized to the range [0–1]. For each array, a 2D image of size 64 × 64 is created by mapping the values in the array to the rows of the image: the first row corresponds to indices [0 to 64], the second row to indices [1 to 65], and the jth row corresponds to values from indices [j to j + 64]. This process generates three separate images, one for each signal (Ω, Φ, and Y), and these images are then merged to form a single composite image of size 3 × 64 × 64, as described by Equation (16). Figure 1 explains the overall process.

I_{k} = (\begin{matrix} S_{k, 0} & S_{k, 1} & \dots & S_{k, 64} \\ S_{k, 1} & S_{k, 2} & \dots & S_{k, 65} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ S_{k, j} & S_{k, j + 1} & \dots & S_{k, j + 64} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ S_{k, 64} & S_{k, 65} & \dots & S_{k, 127} \end{matrix})

(16)

Figure 1 showcases the CHROM-Y feature image generation pipeline, which forms the core of the proposed method. It begins with the identification of regions of interest (ROIs), specifically the forehead and cheeks, using MediaPipe version 0.10.20. These regions are chosen due to their suitability for capturing robust rPPG signals. Chrominance signals (Ω, Φ) are extracted using the CHROM algorithm, and the luminance signal (Y) is derived from the YCrCb color space. These signals undergo band-pass filtering (0.75–4 Hz) to focus on the frequency range corresponding to heart rate. The signals are then normalized and converted into a 2D representation by mapping time-domain data to spatial dimensions, forming a 3-channel (Ω, Φ, Y) image. This spatial–temporal image serves as input for deep learning models to predict heart rate and reconstruct the rPPG waveform.

(ii) Sampling:

To construct a single feature image, 128 consecutive frames from a video are required. In this study, every possible set of 128 successive frames was utilized to create feature images using a sliding window with a stride of 2 frames. This approach efficiently captures the temporal dynamics while reducing redundancy between adjacent samples. For a video consisting of F frames, this results in

[\frac{x - 128}{2} + 1]

unique feature images, where each new image corresponds to a window shifted by 2 frames from the previous one. Figure 2 illustrates this process.

Figure 2 explains the sampling process for feature image generation using a sliding window technique. A sequence of 128 consecutive frames from a video is used to construct a single feature image, capturing temporal information. By shifting the window by two frames, a new feature image is created, maximizing the temporal data available for model training. This method ensures better inclusion of temporal patterns within the dataset.

3. Training and Evaluation

This section includes our features that were trained and tested with various CNN models on different tasks. Here, we trained ResNet-18, VGG-16 and U-Net models for heart rate prediction and rPPG waveform estimation using the UBFC-rPPG [22] dataset. The experiments section is broken down into two sections for each task. Here, we used a publicly available dataset, the UBFC-rPPG dataset. It consists of two sub-datasets. One includes controlled motions, the second includes realistic motions. A real-life scenario includes natural head movements. Here we used both datasets for training and testing. The dataset consists of 50 subjects/videos of around 1 min for each subject. The videos were captured using a Logitech C920 HD Pro (LOGITECH ASIA PACIFIC Ltd., Hong Kong, China) webcam at a resolution of 640 × 480 and a frame rate of 30 fps, recorded in an uncompressed 8-bit RGB format. Ground truth PPG data, including both the PPG waveform and heart rates, were collected using a CMS50E transmissive pulse oximeter (Qinhuangdao Contec Xinjia Medical Technology Co., Ltd., Qinhuangdao, China). Applied training and data preprocessing separately on the T4 GPU in the Google Colab environment generated around 35k Chrom-Y Feature images. Out of all the images, 2.17% of the images’ ground truth values were defective or not recorded properly. Here, we removed defective data to avoid outliers. The ground truth PPG waveform was recorded at different sample rates. Linear interpolation was applied to resample both the PPG waveform and the corresponding HR signal to achieve a uniform sampling rate of 30 samples per second, matching the video frame rate. This interpolation ensured alignment between the video frames and physiological signals. Consequently, for each feature image constructed from 128 frames, a corresponding ground-truth PPG waveform of length 128 and 128 HR samples were obtained. The average of these HR samples was then used as the consolidated heart rate value associated with that image.

3.1. Dataset Processing

Among the 50 total subjects of the UBFC rPPG dataset, 42 were randomly selected for training and the remaining 8 were reserved for testing, resulting in approximately 35,000 feature samples for the training set and 5000 feature samples for the testing set, thus ensuring no data leakage. Testing was performed in several small batches of unseen data, with each batch consisting of 100 images. For all the training batches, Kernel Density Estimation (KDE) was plotted to analyze the model performance under all possible conditions. This could plot the performance of a model even with the presence outliers. Outliers cannot be neglected for clinical applications. This approach can give a range of errors and their likelihood. The peak value in the KDE distribution was considered the consolidated performance error of the model.

3.2. Synthetic Data Generation

To reduce overfitting and improve the generalization capability of the proposed models, a synthetic data generation framework was adopted. The deep learning architectures employed in this work (ResNet18, VGG16, and U-Net) are relatively heavy and require a large volume of training samples to learn reliable feature representations. However, the UBFC-rPPG dataset provides data from a limited number of subjects, which restricts the diversity of training samples. To compensate for this limitation, synthetic rPPG signals and their corresponding feature images were generated and used during the pretraining phase. The synthetic pulse waveform was designed to mimic physiological pulsatile behavior and was modeled as a combination of sinusoidal components:

s(t) = A sin(2π f t + φ) + 0.2A sin(4π f t + φ/2) + ε(t)

(17)

where A denotes amplitude (varied from 0.6 to 1.2), f represents the heart rate in Hz (derived from randomly sampled BPM values between 55 and 130), ϕ is a random phase offset, and ϵ(t) represents additive Gaussian noise. The signal was sampled at 30 frames per second, producing 128 temporal samples per window (approximately 4.27 s), consistent with the acquisition rate of the UBFC-rPPG dataset. In addition to the pulse signal, a synthetic illumination variation signal was generated to simulate slow lighting fluctuations and sensor drift. This illumination trace consisted of low-frequency sinusoidal components combined with cumulative random noise, providing gradual intensity variations across the temporal window.

Each synthetic feature sample was represented as a 64 × 64 × 3 image, where the three channels were constructed as follows: Channel 1 (Pulse channel) was generated directly from the synthetic pulse signal and transformed into a 2D representation using a sliding window approach, where consecutive segments of length 64 were stacked row-wise. Channel 2 (Mixed channel) was formed by linearly combining the pulse and illumination signals using a weighted sum (0.7 × pulse + 0.3 × illumination) in an attempt to mimic interference among rPPG components. Channel 3 (Illumination channel) was derived solely from the illumination trace and similarly converted into a 2D image using the same sliding window mapping. Figure 3 shows a few samples of generated synthetic images.

3.3. Training and Testing Procedure

Approximately 35,000 feature images from the UBFC-rPPG dataset were used for training, and 5000 images were reserved for testing. Initially, all models were trained using 35,000 synthetic feature images. After this stage, the models were further trained using UBFC data, where the training set consisted of 50% synthetic and 50% real UBFC samples. This two-stage training was used to allow the models to learn general patterns from synthetic data and adapt to real rPPG characteristics from UBFC. For testing, the 5000 UBFC images were evaluated using a repeated random sampling approach. In each test run, 100 samples were randomly selected from the test set and used for evaluation. This process was repeated 1000 times to capture performance variations across different sample combinations. For each evaluation metric, a distribution of values was obtained from the 1000 test runs. From this distribution, the minimum, maximum, and peak values were extracted. The peak value, obtained from the Kernel Density Estimation (KDE) curve, represents the most frequent performance value and is treated as the representative performance for comparison between models. In addition, KDE plots for selected metrics were generated to visualize the distribution of model performance.

This approach allows analysis of both the common and extreme performance cases, providing a clear view of model behavior under different test sample selections.

3.4. Heart-Rate Estimation Using CNN

Three convolutional neural network (CNN) architectures, ResNet-18, U-Net, and VGG16, were investigated for heart rate prediction in beats per minute (bpm). The models were trained using mean squared error (MSE) as the loss function and evaluated using two performance metrics: mean absolute error (MAE) and root mean square error (RMSE) in bpm. A total of 35,000 images were utilized for training over 75 epochs. The Kernel Density Estimation (KDE) of MAE and RMSE for all three models is presented in Figure 3, indicating that U-Net performed the best according to all evaluation metrics. Additionally, Figure 4 illustrates the error bar plot of MAE, which serves as an alternative to the KDE plot, where VGG demonstrated the worst performance, while U-Net performed better, with U-Net achieving the highest accuracy. Table 1 provides a summary of the minimum, maximum, and most likely errors (peak of KDE), while Table 2 presents a comparative analysis of the proposed models against existing models, including ETA-rPPGNet [46], Deepphys [12], rPPGNet [16], and Physnet [17], all evaluated on the same dataset.

Figure 4 suggests that U-Net models showed the best performance according to all metrics. This provides a comparative analysis of the models’ performance in heart rate estimation through Kernel Density Estimation (KDE) plots for mean absolute error (MAE) and root mean square error (RMSE). In Figure 4, the MAE plot highlights that U-Net achieves the best curve, demonstrating the better predictions with minimal errors. VGG and ResNet show a slightly broader and shifted distribution, reflecting moderate reliability, Similarlythe RMSE plot reveals U-Net’s ability to minimize large errors effectively, whereas VGG and ResNet exhibit moderate performance, confirming their comparatively poor reliability.

Figure 5 presents an error bar plot comparing the mean absolute error (MAE) ranges for heart rate estimation across different models. This is to compare the performance of each model.

Table 2 presents a comparative analysis of CHROM-Y with existing deep learning-based heart rate estimation methods, including ETA-rPPGNet, DeepPhys, rPPGNet, and PhysNet, all evaluated on the same dataset. While state-of-the-art models such as ETA-rPPGNet and PhysNet report lower error values, the results also highlight the performance variation within the proposed CHROM-Y framework across different architectures. Among the proposed models, CHROM-Y with U-Net achieved the best performance, recording an MAE of 3.62 bpm and RMSE of 6.67 bpm, outperforming both the VGG and ResNet variants. CHROM-Y with ResNet and VGG exhibited higher error rates, with MAE values of 8.91 bpm and 9.06 bpm, respectively, indicating that architectural choice significantly influences performance when using the same feature representation. Although the proposed models did not surpass all existing methods, the results demonstrate that integrating CHROM-Y features with an appropriate architecture, particularly U-Net, led to competitive performance and highlights the importance of combining domain-specific feature engineering with suitable network design for effective heart rate estimation.

3.5. rPPG Waveform Estimation Using CNN

The same network architectures were employed for the waveform estimation task with a modified output layer to produce a one-dimensional array of length 128, corresponding to the predicted PPG waveform. Prior to training, the ground truth waveforms were normalized to a fixed range to stabilize learning. Since the clinical relevance of PPG signals lies primarily in their shape and temporal pattern rather than their absolute amplitude, this normalization does not affect physiological interpretation but does help improve convergence and prediction stability. To evaluate the similarity between the predicted and ground truth waveforms, multiple complementary metrics were used to capture both temporal and spectral characteristics; the following metrics were employed:

Pearson correlation coefficient, which measures the linear similarity between the predicted and reference waveforms and reflects how well the temporal structure is preserved.
Waveform MAE, which quantifies the absolute amplitude deviation between the two signals in the time domain.
DCT distance, computed as the L2 distance between the Discrete Cosine Transform coefficients of the predicted and ground truth waveforms, representing spectral similarity.
Peak Frequency MAE, derived from the dominant peak obtained using the Fast Fourier Transform (FFT), which evaluates the difference in the primary frequency component (heart rate) between the predicted and reference signals.

For Peak Frequency MAE, the dominant frequency was extracted from the FFT spectrum of both signals, and the mean absolute difference between the strongest frequency (converted to bpm) was computed. This metric specifically assesses the accuracy of heart rate-related frequency estimation independent of waveform amplitude scaling. The quantitative analysis of waveform similarity under these metrics is presented in Table 3, where the minimum, maximum, and peak values (obtained from KDE distributions across repeated test runs) are tabulated for each model. The results indicate that U-Net achieves the best performance in HR estimation, while the traditional CHROM algorithm achieved a better correlation, MAE (waveform), and DCT.

4. Cross-Dataset Evaluation

To assess the generalizability and real-world robustness of the proposed CHROM-Y framework, a cross-dataset evaluation was conducted using the BhRPPG [15] dataset, following a strict no-fine-tuning protocol. In this experiment, the deep learning models ResNet-18, VGG16, and U-Net were trained exclusively on a combination of synthetic data and the UBFC-rPPG dataset and subsequently tested on the BhRPPG dataset without any additional retraining or domain adaptation. This setup evaluated the ability of the models to transfer learned rPPG representations across datasets with different acquisition conditions, subject demographics, and illumination profiles. The cross-dataset evaluation was performed for both core tasks: heart rate (HR) estimation and rPPG waveform reconstruction. All three models were evaluated under the three lighting conditions provided in the BhRPPG dataset: low illumination, medium illumination, and high illumination. This segmentation allowed analysis of model robustness under varying lighting intensities, which is critical for remote physiological monitoring in real-world scenarios.

The BhRPPG recordings had the nominal frame rate of 30 fps and, under low lighting conditions, the effective frame rate reduced to approximately 20 fps. To maintain consistency with the UBFC training pipeline, all BhRPPG videos were processed using the same CHROM-Y feature extraction methodology. Feature images were constructed using non-overlapping windows of 128 frames, ensuring that each available video sequence contributed maximally to the evaluation.

Evaluation

For each lighting condition and task, the models were evaluated using performance metrics including mean absolute error (MAE) and root mean square error (RMSE) for HR estimation and Pearson correlation, DCT distance, and Peak Frequency MAE for waveform similarity. Table 4 provides a comprehensive comparison of model performance across lighting conditions.

Observation: The cross-dataset evaluation in Table 5 on the BhRPPG dataset reveals a clear degradation in performance across all three models, U-Net, VGG16, and ResNet-18, indicating limited generalization capability when models trained on synthetic and UBFC data are exposed to unseen recording conditions.

HR Estimation Performance: From the HR estimation task, all models exhibited relatively high MAE and RMSE values across illumination conditions. Under low lighting, MAE values ranged from 8.58 bpm (ResNet) to 8.82 bpm (VGG), while RMSE values varied between 10.0 bpm (ResNet) and 11.70 bpm (U-Net). Performance further deteriorated under medium lighting, where U-Net showed the worst results with MAE = 13.33 bpm and RMSE = 22.80 bpm, reflecting significant instability. Even under high illumination, which typically benefits rPPG extraction, errors remained substantial, with RMSE values exceeding 14 bpm for both VGG and ResNet models. These results indicate that the temporal and spectral characteristics learned during training on UBFC and synthetic data do not effectively transfer to BhRPPG, which differs in camera quality, frame rate consistency, lighting distribution, and skin reflectance dynamics.

Waveform Reconstruction Performance: The waveform reconstruction task further confirms the lack of robust generalization. Pearson correlation values across all lighting conditions remained near zero for every model, with values ranging from −0.002 to 0.03, indicating almost no linear similarity between predicted and ground truth PPG waveforms. This implies that the reconstructed waveforms failed to capture both morphological structure and temporal coherence. Similarly, the MAE (waveform) values remained consistently high, concentrated around 4.8–4.9, while the DCT distance metric remained around 5.5 for all models and lighting conditions, suggesting persistent spectral dissimilarity. The Peak Frequency MAE values also reached concerning levels, especially under low illumination, with U-Net and ResNet showing errors of up to 14.20 bpm, reinforcing the conclusion that heart rate frequency estimation is unreliable in this cross-dataset scenario.

Possible Root Causes: The poor cross-dataset performance can primarily be attributed to the domain shift between the training datasets and BhRPPG. Although CHROM-Y improves illumination robustness within a dataset (discussed in ablation study), the spectral and chromatic distributions in BhRPPG appear substantially different (notably the drop to ~20 fps under low illumination); the models failed to capture general/global features independent of local (within the dataset) features. Furthermore, the absence of brightness augmentation during training likely reduced robustness to intensity variation. Prior work in the BhRPPG study [15] demonstrated that incorporating brightness augmentation improves cross-dataset generalization. Integrating similar augmentation strategies into the CHROM-Y pipeline may enhance resilience to illumination variability.

5. Ablation Study

To further investigate the contribution of the luminance component (Y) to the proposed CHROM-Y representation, an ablation study was conducted by selectively removing the Y-channel and evaluating its impact on model performance across datasets. This analysis aims to verify whether incorporating luminance information provides a meaningful advantage over purely chrominance-based representations. Initially, all three deep learning models, U-Net, ResNet-18, and VGG-16, were trained using the full CHROM-Y representation on the UBFC-rPPG dataset, as discussed in the previous section. The proposed CHROM-Y feature combines the chrominance signals (Ω, Φ) with the luminance signal (Y). For the ablation experiment, the third channel corresponding to the Y trace was explicitly suppressed by setting it to a zero matrix, while preserving the Ω and Φ channels unchanged in the feature image. This modification was applied consistently across all data sources, including the synthetic dataset and UBFC-rPPG dataset.

Models were trained on synthetic data and the UBFC-rPPG dataset (modified configuration) and evaluated on UBFC-rPPG. We compared models trained with and without the Y-channel attempts at HR estimation tasks to analyze the role of luminance information. The results were tabulated in Table 6.

Observations and Analysis

The ablation results demonstrate that the inclusion of the luminance channel (Y) improves the HR estimation accuracy across the tested architectures. The most significant improvement was observed for the U-Net model, where the Peak MAE decreased from 6.15 bpm (without Y) to 3.62 bpm (with Y), corresponding to a performance gain of 41.14%as portrayed in Table 7. Similarly, the Peak RMSE decreased from 8.32 bpm to 6.67 bpm, reflecting a 19.83% improvement. This gain suggests that U-Net benefitted strongly from the additional luminance cue, likely due to its encoder–decoder structure which efficiently exploits spatial intensity variations. VGG-16 and ResNet-18 exhibited moderate improvement, with gains of 8% and 2% in MAE, respectively. Although the gains were smaller compared to U-Net’s, they still confirm that luminance information contributes positively to model performance. Overall, the ablation study validates that the Y-channel plays a role in enhancing performance, especially for architectures sensitive to intensity variations. Although the (Ω, Φ, and Y) can be computed with raw RGB traces, the significant performance improvement in U-Net also suggests that providing models with well-processed data with domain knowledge helps the models improve their accuracy, especially in cases where we have limited data samples on which to train them.

The improvement/degradation of model performance is captured with the given metric below:

R e l a t i v e A c c u r a c y C h a n g e (%) = (\frac{{A c c u r a c y}_{n e w} - {A c c u r a c y}_{o l d}}{{A c c u r a c y}_{o l d}}) \times 100

(18)

where a positive value implies performance improvement (error reduction) and a negative value improves performance degradation (error increase). This formulation ensures a normalized and interpretable measure of the contribution of luminance information.

6. Fine-Tuning on BH-rPPG

To analyze the performance of CHROM-Y in different illumination conditions, the model pretrained on UBFC was further fine-tuned with a relatively small training sample on the Bh-rPPG dataset. Fine-tuning was performed exclusively on the U-Net architecture, as it consistently outperformed ResNet-18 and VGG16 on the UBFC dataset in heart rate estimation tasks. The BhRPPG dataset consists of recordings under three distinct illumination conditions: low, medium, and high brightness, each contributing equally to the dataset distribution. Using a 50% overlapping windowing strategy (stride = 64) for sampling, this could generate 490 CHROM-Y feature images per illumination level. From these, 400 images were used for training and 90 images for testing, resulting in an effective dataset size of approximately 1200 training images and 270 testing images across all three illumination conditions. This split corresponds to an approximate 80%: 20% ratio, ensuring balanced representation of each illumination level in both training and testing sets.

The U-Net model, initially trained on UBFC-rPPG, was subsequently fine-tuned (all layers marked as trainable) using the BhRPPG training images while keeping the network architecture and weight unchanged. Performance was then evaluated on the BhRPPG test set and directly compared with the original cross-dataset evaluation results (UBFC to BhRPPG). A comparison of fine-tuned performance versus cross-dataset evaluation is presented in Table 8; we used metric (18) to capture effective improvement/degradation in the model’s performance

Fine-tuning consistently improved performance across all illumination levels. In terms of MAE, reductions of 11.18% (low), 14.48% (medium), and 29.47% (high) were observed, while RMSE improved by 12.48%, 19.74%, and 27.94%, respectively. The improvement in all illumination conditions may suggest that the feature and model can adapt for different illumination. The most significant improvement occurred under high brightness, indicating that fine-tuning enables better adaptation to illumination extremes. Although the absolute errors remain higher than intra-dataset evaluations, these results also conclude that domain-specific fine-tuning substantially mitigates cross-dataset performance degradation and enhances the robustness of U-Net for real-world rPPG estimation scenarios.

7. Discussions

7.1. Advantages

The most significant strength of the proposed approach lies not in model complexity but in the use of well-processed, domain-driven input features. The CHROM-Y representation, which integrates chrominance signals (Ω, Φ) with luminance (Y), demonstrates that carefully engineered features can achieve meaningful performance. On the UBFC dataset, U-Net combined with CHROM-Y achieved a Peak MAE of 3.62 bpm and RMSE of 6.67 bpm, substantially outperforming VGG16 and ResNet-18, confirming that structured, physiologically grounded preprocessing plays a dominant role in performance.

CHROM (Ω, Φ) features are inherently resilient to skin tone variations, minor motion artifacts, and small illumination inconsistencies due to their chrominance-based formulation [18]. The inclusion of the luminance (Y) channel further enhances robustness against fluctuating brightness. Although rapid illumination fluctuation could not be explicitly tested due to dataset limitations, partial validation through BhRPPG fine-tuning demonstrated consistent improvement across all illumination levels. Fine-tuning yielded performance gains of 11.18–29.47% in MAE and 12.48–27.94% in RMSE, irrespective of whether the illumination was low, medium, or high, indicating that the model adapts effectively to different brightness conditions when trained with CHROM-Y features. An important insight is that lighter models achieved competitive or superior performance compared to heavier architectures, reinforcing the notion that performance gains can be achieved through intelligent feature fusion rather than brute-force network depth. This supports a scalable and computationally efficient pathway for rPPG systems, where specialized features are fused to target specific artifacts (illumination, motion, and skin reflectance), allowing small models to reach practical levels of accuracy.

7.2. Limitations

Despite the demonstrated advantages of the CHROM-Y representation, the proposed approach remains limited in handling strong domain shifts and extreme lighting dynamics. Cross-dataset evaluation exposed significant degradations in performance, with MAE values rising to 8.58–13.33 bpm and RMSE reaching 22.80 bpm. Pearson correlation values for waveform reconstruction remained near zero, indicating weak temporal fidelity despite successful dominant frequency estimation. These results reveal that while CHROM-Y improves robustness to moderate illumination variations, it is insufficient to fully address real-world conditions. A fundamental limitation of the proposed feature design lies in its restricted generalization capability. Although the chrominance components (Ω, Φ) are theoretically more resilient to dataset-specific variations such as skin tone, minor motion, and camera characteristics, the inclusion of the luminance (Y) trace introduces sensitivity to dataset-dependent illumination patterns. This is evidenced by the poor cross-dataset metrics, which indicate that the feature representation, while effective within controlled environments, is not fully dataset-independent.

Additionally, the current study lacks controlled experiments simulating abrupt illumination changes (e.g., sudden flashes or fluctuating ambient lighting), limiting its ability to conclusively validate claims of insensitivity to extreme brightness variation, although we attempted to validate this with fine-tuning of bh-rPPG. The absence of rapid-light-transition scenarios restricts the assessment of the approach under realistic, uncontrolled conditions commonly encountered in practical deployments.

7.3. Future Directions

A primary future focus should be the enhancement of cross-dataset generalization capability. Since the current CHROM-Y representation shows sensitivity to dataset-specific illumination patterns, particularly through the Y-channel, future work should explore illumination-normalized or illumination-invariant luminance formulations. This includes adaptive luminance normalization strategies and dynamic scaling mechanisms that reduce dependency on dataset-specific brightness distributions. To address the limited adaptability of models to unseen domains, domain adaptation and transfer learning techniques should be incorporated. Approaches such as adversarial domain adaptation, feature alignment, and multi-domain training pipelines can help models recalibrate to varying acquisition environments. Training on diverse datasets that encompass broader lighting, camera types, and subject demographics will further improve robustness.

Another essential direction is the inclusion of controlled rapid illumination fluctuation experiments. Future studies should simulate realistic dynamic lighting conditions such as sudden brightness changes, flickering light sources, and shadow transitions to more accurately evaluate the model’s resilience. Integrating synthetic brightness perturbation and temporal illumination noise during training can help develop models that are resilient to real-world lighting instability.

Given that lightweight models demonstrated competitive performance, future work should also explore artifact-specific feature fusion frameworks, where multiple processed features are combined, each targeting specific disturbances such as motion artifacts, illumination drift, or camera noise. This approach could enable small, efficient models to achieve higher accuracy without excessive computational overhead.

In summary, advancing this research requires a shift toward domain-adaptive feature engineering, dataset-independent luminance featuring, and diversified training environments, enabling the development of scalable, rPPG systems.

8. Conclusions

This work presented CHROM-Y, a robust 2D featurization strategy that integrates chrominance (Ω, Φ) and luminance (Y) components to enhance remote photoplethysmography performance under varying illumination conditions. Through extensive evaluation using three deep learning architectures (U-Net, ResNet-18, and VGG16) across heart rate estimation and waveform reconstruction tasks, the study demonstrates that performance is strongly influenced by the quality of input representation rather than model depth alone. On the UBFC dataset, U-Net achieved the best performance, with a Peak MAE of 3.62 bpm and RMSE of 6.67 bpm, while the ablation study confirmed the critical role of luminance, with the removal of the Y-channel leading to significant performance degradation. Although waveform reconstruction exhibited weak temporal correlation across datasets, the models—particularly U-Net—consistently preserved dominant frequency components, enabling reliable frequency-based heart rate estimation even when point-wise waveform similarity was low. Cross-dataset evaluation highlighted the limitations of the approach, with MAE values reaching up to 13.33 bpm and RMSE peaking at 22.80 bpm, revealing sensitivity to domain shift and dataset-specific illumination characteristics. However, fine-tuning on the BhRPPG dataset resulted in consistent improvements across all illumination levels, confirming that the model can adapt when exposed to domain-relevant data.

Overall, the findings emphasize that well-processed, physiologically informed features such as CHROM-Y can significantly enhance model performance, allowing lightweight architectures to achieve competitive accuracy. While the approach does not yet achieve state-of-the-art results and remains limited under extreme illumination dynamics, it establishes a strong foundation for future rPPG systems. Continued research focusing on dataset-independent luminance modeling, controlled illumination testing, and domain adaptation strategies is essential to further improve robustness and enable reliable deployment in real-world scenarios.

Author Contributions

M.J. contributed to the conceptualization of the study, implementation of the proposed method, and drafting of the manuscript. M.U. supervised the research, provided critical feedback, and assisted in refining the manuscript in machine learning. S.P. was responsible for data acquisition, preprocessing, and statistical analysis. M.J. and R.S. contributed to the development of CNN architectures and R.B.J. conducted the experimental validation. All authors have read and agreed to the published version of the manuscript.

Funding

It is certified on behalf of the corresponding author that the present research is funded by IndiaAI Fellowship, Ministry of Electronics and Information Technology, Govt. of India, Funding No.:INDAI/4/2025-INDAI, Dated: 22 January 2025.

Institutional Review Board Statement

The study utilized the publicly available UBFC-rPPG and BH-rPPG datasets. The creators of these datasets obtained the necessary ethics approval and consent. No new human subject data was collected for this research.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original data presented in the study are openly available in UBFC-rPPG and BH-rPPG datasets. All data generated or analyzed during this study are included in this article.

Acknowledgments

Author wishes to thank R. Macwan and Y. Benezeth from the University of Bourgogne Franche-Comté for providing the UBFC-rPPG dataset. The author also wishes to thank Ze Yang, Haofei Wang, Feng Lu from Beihang University, Beijing for providing BH-rPPG dataset, these datasets were essential for the development and evaluation of our rPPG estimation method. Their work and the availability of the dataset greatly contributed to this research. Regarding the portrait of the participant labeled ‘subject-1’ from the UBFC-rPPG dataset, the subject’s consent is provided in the dataset. We also thank the Department of Computational Intelligence, SRM Institute of Science and Technology, Kattakulathur.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

Allen, J. Photoplethysmography and its application in clinical physiological measurement. Physiol. Meas. 2007, 28, R1–R39. [Google Scholar] [CrossRef] [PubMed]
Hertzman, A.B. Photoelectric Plethysmography of the Fingers and Toes in Man. Proc. Soc. Exp. Biol. Med. 1937, 37, 529–534. [Google Scholar] [CrossRef]
Alian, A.A.; Shelley, K.H. Photoplethysmography. Best Pract. Res. Clin. Anaesthesiol. 2014, 28, 395–406. [Google Scholar] [CrossRef]
Nitzan, M.; Babchenko, A.; Khanokh, B.; Landau, D. The variability of the photoplethysmographic signal—A potential method for the evaluation of the autonomic nervous system. Physiol Meas. 1998, 19, 93–102. [Google Scholar] [CrossRef] [PubMed]
Lee, R.J.; Sivakumar, S.; Lim, K.H. Review on remote heart rate measurements using photoplethysmography. Multimed. Tools Appl. 2023, 83, 44699–44728. [Google Scholar] [CrossRef]
Lu, S.; Zhao, H.; Ju, K.; Shin, K.; Lee, M.; Shelley, K.; Chon, K.H. Can Photoplethysmography Variability Serve as an Alternative Approach to Obtain Heart Rate Variability Information? J. Clin. Monit. Comput. 2008, 22, 23–29. [Google Scholar] [CrossRef]
Yoon, Y.; Cho, J.H.; Yoon, G. Non-constrained Blood Pressure Monitoring Using ECG and PPG for Personal Healthcare. J. Med. Syst. 2009, 33, 261–266. [Google Scholar] [CrossRef]
Loh, H.W.; Xu, S.; Faust, O.; Ooi, C.P.; Barua, P.D.; Chakraborty, S.; Tan, R.-S.; Molinari, F.; Acharya, U.R. Application of photoplethysmography signals for healthcare systems: An in-depth review. Comput. Methods Programs Biomed. 2022, 216, 106677. [Google Scholar] [CrossRef] [PubMed]
Kim, D.-Y.; Lee, K.; Sohn, C.-B. Assessment of ROI Selection for Facial Video-Based rPPG. Sensors 2021, 21, 7923. [Google Scholar] [CrossRef] [PubMed]
Verkruysse, W.; Svaasand, L.O.; Nelson, J.S. Remote plethysmographic imaging using ambient light. Opt. Express 2008, 16, 21434–21445. [Google Scholar] [CrossRef] [PubMed]
LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef]
Chen, W.; McDuff, D. Deepphys: Video-based physiological measurement using convolutional attention networks. In Proceedings of the European Conference on Computer Vision (ECCV), Malmö, Sweden, 8–13 September 2018; pp. 349–365. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Tran, D.; Wang, H.; Torresani, L. Learning Spatio-Temporal Features with 3D Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 4489–4497. [Google Scholar]
Yang, Z.; Wang, H.; Lu, F. Assessment of Deep Learning-Based Heart Rate Estimation Using Remote Photoplethysmography Under Different Illuminations. IEEE Trans. Hum.-Mach. Syst. 2022, 52, 1236–1246. [Google Scholar] [CrossRef]
Yu, Z.; Peng, W.; Li, X.; Hong, X.; Zhao, G. Remote heart rate measurement from highly compressed facial videos: An end-to-end deep learning solution with video enhancement. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 151–160. [Google Scholar]
Yu, Z.; Li, X.; Zhao, G. Remote photoplethysmograph signal measurement from facial videos using spatiotemporal networks. In Proceedings of the British Machine Vision Conference (BMVC), Cardiff, UK, 9–12 September 2019. [Google Scholar]
de Haan, G.; Jeanne, V. Robust Pulse Rate From Chrominance-Based rPPG. IEEE Trans. Biomed. Eng. 2013, 60, 2878–2886. [Google Scholar] [CrossRef]
Wang, W.; den Brinker, A.C.; Stuijk, S.; de Haan, G. Algorithmic Principles of Remote PPG. IEEE Trans. Biomed. Eng. 2016, 64, 1479–1491. [Google Scholar] [CrossRef] [PubMed]
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. 2015. Available online: https://arxiv.org/abs/1409.1556 (accessed on 26 October 2025).
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. 2015. Available online: https://arxiv.org/abs/1505.04597 (accessed on 26 October 2025).
Bobbia, S.; Macwan, R.; Benezeth, Y.; Mansouri, A.; Dubois, J. Unsupervised skin tissue segmentation for remote photoplethysmography. Pattern Recognit. Lett. 2017, 124, 82–90. [Google Scholar]
Wu, J.; Zhu, Y.; Jiang, X.; Liu, Y.; Lin, J. Local attention and long-distance interaction of rPPG for deepfake detection. Vis. Comput. 2024, 40, 1083–1094. [Google Scholar] [CrossRef]
Niu, X.; Yu, Z.; Han, H.; Li, X.; Shan, S.; Zhao, G. Videobased remote physiological measurement via cross-verified feature disentangling. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Part II. Springer: Berlin/Heidelberg, Germany, 2020; Volume 16, pp. 295–310. [Google Scholar]
Tulyakov, S.; Lam, J.; Feng, J. Self-adaptive matrix completion for heart rate estimation from face videos under realistic conditions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 2396–2404. [Google Scholar]
Lugaresi, C.; Tang, J.; Nash, H.; McClanahan, C.; Uboweja, E.; Hays, M.; Zhang, F.; Chang, C.L.; Yong, M.G.; Lee, J.; et al. Mediapipe: A Framework for Building Perception Pipelines. arXiv 2019, arXiv:1906.08172. [Google Scholar] [CrossRef]
Han, S.; Mao, H.; Dally, W.J. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization, and Huffman Coding. In Proceedings of the International Conference on Learning Representations (ICLR), San Juan, Puerto Rico, 2–4 May 2016. [Google Scholar]
Tominaga, S. Dichromatic reflection models for a variety of materials. Color Res. Appl. 1994, 19, 277–285. [Google Scholar] [CrossRef]
Poh, M.-Z.; McDuff, D.J.; Picard, R.W. Non-contact, automated cardiac pulse measurements using video imaging and blind source separation. Opt. Express 2010, 18, 10762–10774. [Google Scholar] [CrossRef] [PubMed]
Poh, M.-Z.; McDuff, D.J.; Picard, R.W. Picard, Advancements in noncontact, multiparameter physiological measurements using a webcam. IEEE Trans. Biomed. Eng. 2011, 58, 7–11. [Google Scholar] [CrossRef]
Lewandowska, M.; Rumiński, J.; Kocejko, T.; Nowak, J. Measuring pulse rate with a webcam—A non-contact method for evaluating cardiac activity. In Proceedings of the 2011 Federated Conference on Computer Science and Information Systems (FedCSIS), Szczecin, Poland, 18–21 September 2011; pp. 405–410. [Google Scholar]
McDuff, D.; Gontarek, S.; Picard, R.W. Improvements in remote cardiopulmonary measurement using a five band digital camera. IEEE Trans. Biomed. Eng. 2014, 61, 2593–2601. [Google Scholar] [CrossRef] [PubMed]
Lam, A.; Kuno, Y. Robust heart rate measurement from video using select random patches. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 3640–3648. [Google Scholar] [CrossRef]
Huelsbusch, M. Ein Bildgestuetstes, Funktionelles Verfahren zur Optoelektronischer Erfassung der Hautperfusion. Ph.D. Dissertation, Fakultaet fuer Elektrotechnik un Informationstechnik, RWTH Aachen University, Aachen, Germany, 28 January 2008; p. 70. [Google Scholar]
Bengio, Y.; Courville, A.; Vincent, P. Representation Learning: A Review and New Perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 1798–1828. [Google Scholar] [CrossRef]
Niu, X.; Shan, S.; Han, H.; Chen, X. RhythmNet: End-to-End Heart Rate Estimation From Face via Spatial-Temporal Representation. IEEE Trans. Image Process. 2020, 29, 2409–2423. [Google Scholar]
Song, R.; Zhang, S.; Li, C.; Zhang, Y.; Cheng, J.; Chen, X. Heart Rate Estimation From Facial Videos Using a Spatiotemporal Representation With Convolutional Neural Networks. IEEE Trans. Instrum. Meas. 2020, 69, 7411–7421. [Google Scholar] [CrossRef]
Birla, L.; Gupta, P.; Kumar, S. SUNRISE: Improving 3D Mask Face Anti-Spoofing for Short Videos Using Pre-Emptive Split and Merge. IEEE Trans. Dependable Secur. Comput. 2023, 20, 1927–1940. [Google Scholar] [CrossRef]
Tarassenko, L.; Villarroel, M.; Guazzi, A.; Jorge, J.; Clifton, D.A.; Pugh, C. Non-contact video-based vital sign monitoring using ambient light and auto-regressive models. Physiol. Meas. 2014, 35, 807–831. [Google Scholar] [CrossRef] [PubMed]
Hu, M.; Qian, F.; Wang, X.; He, L.; Guo, D.; Ren, F. Robust Heart Rate Estimation With Spatial-Temporal Attention Network From Facial Videos. IEEE Trans. Cogn. Dev. Syst. 2022, 14, 639–647. [Google Scholar] [CrossRef]
Chowdhury, M.H.; Chowdhury, M.E.; Reaz, M.B.I.; Ali, S.H.M.; Rakhtala, S.M.; Murugappan, M.; Mahmud, S.; Shuzan, N.I.; Bakar, A.A.A.; Bin Shapiai, M.I.; et al. LGI-rPPG-Net: A shallow encoder-decoder model for rPPG signal estimation from facial video streams. Biomed. Signal Process. Control. 2024, 89, 105687. [Google Scholar] [CrossRef]
Das, M.; Bhuyan, M.K.; Sharma, L.N. Time-Frequency Learning Framework for rPPG Signal Estimation Using Scalogram-Based Feature Map of Facial Video Data. IEEE Trans. Instrum. Meas. 2023, 72, 4007710. [Google Scholar] [CrossRef]
Bousefsaf, F.; Maaoui, C.; Pruski, A. Continuous wavelet filtering on webcam photoplethysmographic signals to remotely assess the instantaneous heart rate. Biomed. Signal Process. Control 2013, 8, 568–574. [Google Scholar] [CrossRef]
Yu, Z.; Cai, R.; Li, Z.; Yang, W.; Shi, J.; Kot, A.C. Benchmarking Joint Face Spoofing and Forgery Detection With Visual and Physiological Cues. IEEE Trans. Dependable Secur. Comput. 2024, 21, 4327–4342. [Google Scholar] [CrossRef]
Wu, B.-F.; Wu, Y.-C.; Chou, Y.-W. A Compensation Network With Error Mapping for Robust Remote Photoplethysmography in Noise-Heavy Conditions. IEEE Trans. Instrum. Meas. 2022, 71, 4000211. [Google Scholar] [CrossRef]
Hu, M.; Qian, F.; Guo, D.; Wang, X.; He, L.; Ren, F. ETA-rPPGNet: Effective Time-Domain Attention Network for Remote Heart Rate Measurement. IEEE Trans. Multimed. 2021, 70, 2506212. [Google Scholar] [CrossRef]

Figure 1. The process of feature image generation.

Figure 2. Feature image generation using sliding window with Stride 2.

Figure 3. Randomly chosen samples of synthetic data.

Figure 4. KDE plots of U-net, VGG, and ResNet-18 models. (a) Distribution of MAE. (b) Distribution of RMSE.

Figure 5. Range of MAE errors across models with error bars.

Table 1. Evaluation of models for heart rate estimation task. Tabulated range of MAEs and RMSEs.

Model		MAE (bpm)			RMSE (bpm)
Model		Min	Peak	Max	Min	Peak	Max
CHROM		4.95	6.78	9.96	6.48	9.94	17.53
CHROM-Y	U-Net	2.43	3.62	5.510	4.84	6.67	9.48
	VGG	6.91	9.06	11.22	9.91	12.18	14.19
	ResNet	7.19	8.91	10.70	10.03	11.54	13.52

Table 2. Comparative analysis of CHROM-Y with other deep learning-based methods.

Sno	Models	MAE (bmp)	RMSE (bpm)
1	ETA-rPPGNet [46]	1.46	3.97
2	Deepphys [12]	3.71	5.27
3	rPPGNet [16]	3.24	4.97
4	Physnet [17]	2.33	3.04
5	Ours (ResNet)	8.91	11.54
6	Ours (VGG)	9.06	12.18
7	Ours (U-Net)	3.62	6.67

Table 3. Evaluation of models for PPG waveform estimation task. Tabulated range of similarity indexes, MAEs with PPG waveform and HR estimation with dominant frequency (in bpm).

Model		Pearson Correlation			MAE (Waveform)			DCT			Peak Freq MAE (bmp)
Model		Min	Peak	Max	Min	Peak	Max	Min	Peak	Max	Min	Peak	Max
CHROM		0.72	0.8	0.93	2.09	3.91	10.12	1.03	1.23	1.87	4.95	6.78	9.96
CHROM-Y	U-Net	−0.146	−0.01	0.168	4.50	4.77	5.0	3.43	4.66	4.94	0.56	3.12	8.154
	VGG	−0.084	−0.001	0.070	4.42	4.55	4.71	5.32	5.45	5.60	19.12	24.58	30.35
	ResNet	−0.099	0.001	0.099	4.41	4.56	4.702	5.31	5.46	5.62	9.84	15.00	19.5

Table 4. Cross-dataset evaluation: HR estimation task on Bh-rPPG dataset. This table presents the cross-dataset performance of CHROM-Y-based deep learning models on the Bh-rPPG dataset for the heart rate estimation task. The table compares model performance under three illumination conditions (low, medium, and high) using MAE and RMSE as evaluation metrics, illustrating how each architecture responds to varying lighting scenarios in an unseen dataset setting.

Model\|Lighting		MAE (bpm)			RMSE (bpm)
Model\|Lighting		Low	Medium	High	Low	Medium	High
CHROM-Y	U-Net	8.59	13.33	11.54	11.70	22.80	17.33
	VGG	8.82	10.56	9.02	10.23	11.82	14.26
	ResNet	8.58	10.70	8.98	10.0	18.03	14.34

Table 5. Cross-dataset evaluation: PPG wave reconstruction.

Model		Pearson Correlation			MAE (Waveform)			DCT			Peak Freq MAE (bmp)
Model		Low	Mid	High	Low	Mid	High	Low	Mid	High	Low	Mid	High
CHROM-Y	U-Net	0.013	−0.002	0.016	4.89	4.84	4.87	5.60	5.52	5.5	13.67	13.32	14.01
	VGG	−0.0001	0.02	0.03	4.84	4.82	4.82	5.54	5.51	5.52	12.48	15.27	13.54
	ResNet	0.017	0.001	0.01	4.85	4.84	4.83	5.55	5.52	5.55	14.20	13.91	12.99

Table 6. HR estimation task.

Model		MAE (bpm)			RMSE (bpm)
Model		Min	Peak	Max	Min	Peak	Max
CHROM-Y	U-Net	4.32	6.15	7.20	6.48	8.32	10.87
	VGG	7.01	9.89	11.98	10.01	12.99	15.09
	ResNet	8.03	9.12	11.70	9.32	12.22	14.10

Table 7. Comparison of different models.

Model	Peak MAE (bpm)			Peak RMSE (bmp)
	with Y	Without Y	Performance %	with Y	Without Y	Performance %
U-Net	3.62	6.15	+41.14%	6.67	8.32	+19.83%
VGG	9.06	9.89	+8.39%	12.18	12.99	+6.24%
ResNet	8.91	9.12	+2.30%	11.54	12.22	+5.56%

Table 8. Impact of U-Net fine-tuning on BhRPPG dataset under different illumination levels. Table 8: Table presents a quantitative comparison of the U-Net model’s heart rate estimation performance on the BhRPPG dataset before and after fine-tuning. The evaluation is conducted across three illumination conditions (low, medium, high), using mean absolute error (MAE) and root mean square error (RMSE) as performance metrics. The percentage improvement indicates the relative performance gain achieved through domain-specific fine-tuning.

U-Net Model	MAE (bpm)			RMSE (bmp)
Brightness	No Fine-Tuning	Fine-Tuning	Performance %	No Fine-Tuning	Fine-Tuning	Performance %
Low	8.59	7.63	11.18%	11.70	10.24	12.48%
Medium	13.33	11.40	14.48%	22.80	18.30	19.74%
High	11.54	8.14	29.47%	17.33	12.49	27.94%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Javidh, M.; Shah, R.; Uma, M.; Prabhu, S.; Jeyavathana, R.B. CHROM-Y: Illumination-Adaptive Robust Remote Photoplethysmography Through 2D Chrominance–Luminance Fusion and Convolutional Neural Networks. Signals 2025, 6, 72. https://doi.org/10.3390/signals6040072

AMA Style

Javidh M, Shah R, Uma M, Prabhu S, Jeyavathana RB. CHROM-Y: Illumination-Adaptive Robust Remote Photoplethysmography Through 2D Chrominance–Luminance Fusion and Convolutional Neural Networks. Signals. 2025; 6(4):72. https://doi.org/10.3390/signals6040072

Chicago/Turabian Style

Javidh, Mohammed, Ruchi Shah, Mohan Uma, Sethuramalingam Prabhu, and Rajendran Beaulah Jeyavathana. 2025. "CHROM-Y: Illumination-Adaptive Robust Remote Photoplethysmography Through 2D Chrominance–Luminance Fusion and Convolutional Neural Networks" Signals 6, no. 4: 72. https://doi.org/10.3390/signals6040072

APA Style

Javidh, M., Shah, R., Uma, M., Prabhu, S., & Jeyavathana, R. B. (2025). CHROM-Y: Illumination-Adaptive Robust Remote Photoplethysmography Through 2D Chrominance–Luminance Fusion and Convolutional Neural Networks. Signals, 6(4), 72. https://doi.org/10.3390/signals6040072

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

CHROM-Y: Illumination-Adaptive Robust Remote Photoplethysmography Through 2D Chrominance–Luminance Fusion and Convolutional Neural Networks

Abstract

1. Introduction

Related Works

2. Proposed Methodology

rPPG Features

3. Training and Evaluation

3.1. Dataset Processing

3.2. Synthetic Data Generation

3.3. Training and Testing Procedure

3.4. Heart-Rate Estimation Using CNN

3.5. rPPG Waveform Estimation Using CNN

4. Cross-Dataset Evaluation

Evaluation

5. Ablation Study

Observations and Analysis

6. Fine-Tuning on BH-rPPG

7. Discussions

7.1. Advantages

7.2. Limitations

7.3. Future Directions

8. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI