Remote Photoplethysmography Using Triple-Head Spatio-Temporal Transformer with Reaction-Driven Gating and Illumination Separation
Abstract
1. Introduction
- We introduce Reaction-Driven Gating (RDG), a dynamic spatial masking mechanism guided by facial blendshapes.
- We propose Dynamic Anchor Locking (DAL) to estimate illumination noise using a background reference token.
- We design an Adaptive Frequency Window (AFW) to dynamically adjust bandpass filtering based on facial activity.
2. Related Work
2.1. Traditional rPPG Signal Processing Methods
2.2. Deep Learning–Based rPPG Methods
2.3. Transformer-Based rPPG Architectures
| Category | Method | Key Methodology/Technical Contribution | MAE (bpm) ↓ | Ground Truth Device |
|---|---|---|---|---|
| Traditional | CHROM [12] | Chrominance-based projection: Uses a linear combination of R,G,B channels to eliminate motion artifacts based on a skin reflection model. | 4.06 | CMS50E Pulse Oximeter |
| POS [13] | Plane-Orthogonal-to-Skin: Projects color signals onto a plane orthogonal to the skin tone to maximize pulse SNR. | 4.08 | CMS50E Pulse Oximeter | |
| Deep Learning | DeepPhys [14] | 2D-CNN: First end-to-end model using normalized frame differences and a motion-representation branch. | 6.25 | CMS50E Pulse Oximeter |
| TS-CAN [15] | Multi-task 2D-CNN: Employs Temporal Shift Modules and attention to capture motion efficiently without 3D convolutions. | 1.7 | CMS50E Pulse Oximeter | |
| PhysNet [6] | 3D-CNN: Uses spatio-temporal convolutions to extract heart rate and BVP signals directly from video volumes. | 2.95 | CMS50E Pulse Oximeter | |
| Transformer | PhysFormer [16] | ViT-based: Uses spatio-temporal self-attention to model long-range dependencies in the pulse signal. | 0.52 | CMS50E Pulse Oximeter |
| RhythmFormer [19] | Temporal Transformer: Specifically targets the periodic rhythm of the BVP signal using constrained attention. | 0.5 | CMS50E Pulse Oximeter |
2.4. Robust rPPG Estimation in Real-World Conditions
3. Methodology
3.1. Architecture Overview
3.2. Dynamic Background Anchor Selection
- Temporal Variance (Vart): measures the stability of a region over time.
- Spatial Gradient (Grad): ensures that the anchor is a uniform surface (like a wall) rather than a textured object. It is defined using the L2 norm of the Sobel operators Gx and Gy:
- Stability Index: The final score Sa is a weighted minimization objective:
3.3. Spatio-Temporal Transformer Encoding
3.4. Reaction-Driven Spatial Gating
3.5. Illumination Modeling and Physiological Signal Reconstruction
3.6. Adaptive Frequency Window
3.7. Multi-Head Objective Functions and Optimization
3.7.1. Physiology Loss (Lphys): Hybrid Time-Frequency Supervision
3.7.2. Reaction Loss (Lreact): Anatomical Distillation
3.7.3. Illumination Loss (Lillum): Background Reference
3.7.4. Orthogonality Loss (Lortho): Latent Disentanglement
3.8. Lightweight Auxiliary Heads
4. Experimental Results
4.1. Datasets
- UBFC-rPPG [23]: Designed to induce high heart rate variability (HRV) and natural facial expressions, including 42 participants playing a stress-inducing math game. Videos are recorded using a Logitech C920 HD Pro webcam at 640 × 480 resolution at 30 fps. The capturing environment is indoor with good illumination, with a varying amount of sunlight. All videos have a negligible compression (220 Mbps) to preserve subtle physiological variations.
- VIPL-HR [38]: A multi-modal dataset containing 2378 videos from 107 subjects. It includes 9 different scenarios, including varied head movements and three different video sensors (webcam, smartphone, and NIR). RGB videos are captured using a Logitech C310 color camera (960 × 720 at 25 fps) and a Huawei P9 camera (1920 × 1080 at 30 fps). The environments include dark and bright illumination with moderate compression. VIPL-HR contains substantial motion variation, sensor diversity, and illumination inconsistency, making physiological reconstruction significantly more challenging than UBFC-rPPG.
- COHFACE [39]: It contains 40 subjects with different age groups and skin tones. Each subject is recorded in four video sequences, under two experimental conditions: clean (controlled lab illumination) and natural (lights off, only sunlight entering the lab). Videos are recorded using Logitech HD C525 640 × 480 resolution at 20 fps. All videos are stored with a high compression rate (250 kbps), which leads to compression artifacts that can hide the subtle skin color variations and increase the difficulty of accurate rPPG estimation.
4.2. Implementation Details
- Stage 1 (Epochs 1–10: Spatial Prior Initialization).
- Stage 2 (Epochs 11–25: Physiological Signal Learning).
- Stage 3 (Epochs 26–100: Full Multi-task Refinement).
4.3. Results and Discussion
4.4. Ablation Study
4.4.1. Baseline Model (Mbase)
4.4.2. Impact of Reaction-Driven Gating (Mreact)
- Detection Accuracy and Temporal Jitter: MediaPipe provides sub-pixel landmark localization. However, extreme head rotations (>45 degrees) or severe occlusions (hands covering face) cause detection failures. Moreover, raw blendshape coefficients exhibit frame-to-frame jitter due to video compression and landmark tracking noise. To reduce temporal jitter, a moving average filter is applied (through the window) before computing. Additionally, the soft-gating formulation (Equation (5)) produces gradual mask changes rather than binary switching, making the system robust to small coefficient changes.
- Anatomical Mapping: The mapping matrix A (52 blendshapes to an 8 × 8 spatial grid) requires manual definition of which facial regions each blendshape affects. This is an approximation; in reality, muscle activations have diffuse effects. We minimize this limitation by: (a) using a coarse 8 × 8 grid (64 patches total), so spatial precision requirements are modest, and (b) allowing the transformer to learn residual adjustments through end-to-end training, as the gating mask is a soft weight, not a hard segmentation.
4.4.3. Impact of Illumination Decoupling (Millum)
4.4.4. Full TH-STT Synergy (Mfull Without/with AFW)
4.4.5. Impact of Temporal Window Size
4.4.6. Adaptive Frequency Window Parameter
4.4.7. Sensitivity Analysis and Parameter Selection
- Stage 1: The initial objective was to establish a stable spatial gating mask in the first 10 epochs. We fixed the reaction coefficient λreact = 1.0 and performed a grid search for the physiological coefficient λphys between 0.1 and 0.6. As shown in Figure 4a, the Mask MAE reaches its minimum at λphys = 0.2. Beyond this threshold, physiological gradients begin to dominate the latent space, leading to a degradation of the gating mask accuracy.
- Stage 2: In the second stage, we prioritized signal extraction by fixing λphys = 1.0 and searching for the optimal λreact value. The selection was based on identifying the stability point between signal purity and gating stability. As depicted in Figure 4b, the highest Pearson correlation was achieved at λreact = 0.3, ensuring the model preserves its learned spatial priors while maximizing BVP signal quality.
- Stage 3: The final stage introduced the illumination and orthogonality coefficients. Our analysis in Figure 4c demonstrates that the RMSE reaches an optimal value at a coefficient value of 0.2. Setting the illumination weight higher than 0.2 leads to a gradient dominance of the illumination head. At higher values for λillum, the illumination head overfits illumination reconstruction by modeling light variations, effectively treating the subtle BVP signal as noise and causing a sharp rise in final error.
4.5. Computational Complexity
4.6. Error Distribution and Agreement Analysis
4.7. Failure and Challenging Case Analysis
- (a)
- Transient Rapid Movement in Low Illumination: In Figure 7a, a sharp transient spike is observed near frame 250. This corresponds to a sudden, high-magnitude head movement under low-light conditions. Sudden movements in low illumination are hard to deal with. It is difficult to deal with due to the sudden changes in the targeted skin and the change in light reflection.
- (b)
- Continuous Movement and Phase Drift: Figure 7b demonstrates the effect of sustained motion in low illumination. At the same time, the predicted signal maintains the correct periodicity. However, it exhibits Phase Drift misalignment with some of the ground truth peaks. This suggests that continuous motion in compressed video streams can cause the Factorized Attention mechanism to struggle with maintaining exact peak correspondence.
- (c)
- HRV Tracking under Compounded Noise: In Figure 7c, the model faces a combination of camera motion, low illumination, and High Heart Rate Variability (HRV). The ground truth shows rapid changes in the inter-beat interval. The TH-STT successfully tracks the general trend. However, the compounded noise results in a damping effect where some specific peak magnitudes are slightly underestimated.
4.8. Generalization Analysis
- Using Dynamic Anchor Locking (DAL) technology, the model learns to isolate lighting noise and helps the lighting head to adapt to illuminating a new dataset instantly.
- The Reaction Head uses a uniform blending pattern. Since human facial muscle movements are anatomically universal, a dynamic spatial gate technique learned on one group of people can be applied to a completely different population in another dataset.
- The Orthogonality Loss (Lortho) ensures that the latent space is un-noisy. When moving to a new dataset, the model is not confused by new types of background noise because it has already learned that the BVP signal should be environment-independent.
4.9. Detailed Noise Robustness Comparison
4.10. Research Limitations and Future Directions
4.10.1. Limitations
- Compression Sensitivity: High video compression rates can degrade the signal-to-noise ratio. While the DAL head mitigates environmental noise, the loss of subtle intensity variations due to quantization in heavily compressed streams remains a challenge for sub-millimeter physiological changes. Additionally, the reaction head will suffer from detecting reactions used in weighing the face ROIs
- Extreme Pose Variations: While the RDG head effectively manages standard facial expressions and minor rotations, extreme rigid motion (e.g., head turns exceeding 45) can lead to temporary Phase Drifts as spatial tokens lose correspondence with the underlying tissue.
- Computational Complexity for Edge Deployment: The current implementation uses a ViT-Small backbone. Proposed work is efficient for workstation-class GPUs; the computational overhead and memory footprint may limit real-time execution on mobile devices or embedded systems without further optimization.
- Limitations we cannot solve: Stabilizing the camera’s position is crucial. Moving objects can be tracked, but handling an unstable camera is extremely difficult.
4.10.2. Future Works
- Lightweight Architectures: We aim to explore the transition from ViT-Small to a highly optimized ViT-Tiny or MobileViT backbone. We want to use Knowledge Distillation to move the TH-STT’s rich spatio-temporal features into a lighter model with a lot fewer parameters.
- Edge-Device Optimization: Future iterations will involve the implementation of TensorRT or CoreML optimizations to take advantage of hardware acceleration on mobile Processing Units. This will allow the system to operate in real-time on standard smartphones and be viable for telemedicine applications.
- Multi-Parameter Physiological Sensing: We plan to extend the triple-head model to involve additional biometrics such as Respiratory Rate (RR) and Oxygen Saturation SpO2.
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Mendelson, Y.; Ochs, B.D. Noninvasive Pulse Oximetry Utilizing Skin Reflectance Photoplethysmography. IEEE Trans. Biomed. Eng. 1988, 35, 798–805. [Google Scholar] [CrossRef]
- Allen, J. Photoplethysmography and Its Application in Clinical Physiological Measurement. Physiol. Meas. 2007, 28, R1. [Google Scholar] [CrossRef]
- Poh, M.-Z.; McDuff, D.J.; Picard, R.W. Non-Contact, Automated Cardiac Pulse Measurements Using Video Imaging and Blind Source Separation. Opt. Express 2010, 18, 10762–10774. [Google Scholar] [CrossRef]
- Verkruysse, W.; Svaasand, L.O.; Nelson, J.S. Remote Plethysmographic Imaging Using Ambient Light. Opt. Express 2008, 16, 21434. [Google Scholar] [CrossRef]
- Niu, X.; Zhao, X.; Han, H.; Das, A.; Dantcheva, A.; Shan, S.; Chen, X. Robust Remote Heart Rate Estimation from Face Utilizing Spatial-Temporal Attention. In Proceedings of the 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019), 14–18 May 2019; pp. 1–8. [Google Scholar]
- Yu, Z.; Li, X.; Zhao, G. Remote Photoplethysmograph Signal Measurement from Facial Videos Using Spatio-Temporal Networks. arXiv 2019, arXiv:1905.02419. [Google Scholar] [CrossRef]
- McDuff, D.; Gontarek, S.; Picard, R.W. Improvements in Remote Cardiopulmonary Measurement Using a Five Band Digital Camera. IEEE Trans. Biomed. Eng. 2014, 61, 2593–2601. [Google Scholar] [CrossRef]
- Seo, H.; Kim, S.; Lee, E.C. Estimation of Respiratory Signals from Remote Photoplethysmography of RGB Facial Videos. Electronics 2025, 14, 2152. [Google Scholar] [CrossRef]
- Wioleta, S. Using Physiological Signals for Emotion Recognition. In Proceedings of the 2013 6th International Conference on Human System Interactions (HSI), Sopot, Poland, 6–8 June 2013; pp. 556–561. [Google Scholar]
- Cheng, C.-H.; Wong, K.-L.; Chin, J.-W.; Chan, T.-T.; So, R.H.Y. Deep Learning Methods for Remote Heart Rate Measurement: A Review and Future Research Agenda. Sensors 2021, 21, 6296. [Google Scholar] [CrossRef] [PubMed]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2021, arXiv:2010.11929. [Google Scholar] [CrossRef]
- de Haan, G.; Jeanne, V. Robust Pulse Rate from Chrominance-Based rPPG. IEEE Trans. Biomed. Eng. 2013, 60, 2878–2886. [Google Scholar] [CrossRef]
- Wang, W.; den Brinker, A.C.; Stuijk, S.; de Haan, G. Algorithmic Principles of Remote PPG. IEEE Trans. Biomed. Eng. 2017, 64, 1479–1491. [Google Scholar] [CrossRef]
- Chen, W.; McDuff, D. DeepPhys: Video-Based Physiological Measurement Using Convolutional Attention Networks. In Computer Vision–ECCV 2018; Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2018; Volume 11206, pp. 356–373. ISBN 978-3-030-01215-1. [Google Scholar]
- Liu, X.; Fromm, J.; Patel, S.; McDuff, D. Multi-Task Temporal Shift Attention Networks for On-Device Contactless Vitals Measurement. In Proceedings of the Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2020; Volume 33, pp. 19400–19411. [Google Scholar]
- Yu, Z.; Shen, Y.; Shi, J.; Zhao, H.; Torr, P.; Zhao, G. PhysFormer: Facial Video-Based Physiological Measurement with Temporal Difference Transformer. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–20 June 2022; pp. 4176–4186. [Google Scholar]
- Yu, Z.; Shen, Y.; Shi, J.; Zhao, H.; Cui, Y.; Zhang, J.; Torr, P.; Zhao, G. PhysFormer++: Facial Video-Based Physiological Measurement with SlowFast Temporal Difference Transformer. Int. J. Comput. Vis. 2023, 131, 1307–1330. [Google Scholar] [CrossRef]
- Li, J.; Guo, S.; Tang, L.; Cui, C.; Kong, L.; Yang, X. VidFormer: A Novel End-to-End Framework Fused by 3DCNN and Transformer for Video-Based Remote Physiological Measurement. arXiv 2025, arXiv:2501.01691. [Google Scholar]
- Zou, B.; Guo, Z.; Chen, J.; Zhuo, J.; Huang, W.; Ma, H. RhythmFormer: Extracting Patterned rPPG Signals Based on Periodic Sparse Attention. Pattern Recognit. 2025, 164, 111511. [Google Scholar] [CrossRef]
- Savic, M.; Zhao, G. RS-rPPG: Robust Self-Supervised Learning for rPPG. In Proceedings of the 2024 IEEE 18th International Conference on Automatic Face and Gesture Recognition (FG), Istanbul, Turkiye, 27 May 2024; pp. 1–10. [Google Scholar]
- Li, X.; Chen, J.; Zhao, G.; Pietikainen, M. Remote Heart Rate Measurement From Face Videos Under Realistic Situations. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 4264–4271. [Google Scholar]
- Shao, H.; Luo, L.; Qian, J.; Yan, M.; Chen, S.; Yang, J. Remote Photoplethysmography in Real-World and Extreme Lighting Scenarios. arXiv 2025, arXiv:2503.11465. [Google Scholar] [CrossRef]
- Bobbia, S.; Macwan, R.; Benezeth, Y.; Mansouri, A.; Dubois, J. Unsupervised Skin Tissue Segmentation for Remote Photoplethysmography. Pattern Recognit. Lett. 2019, 124, 82–90. [Google Scholar] [CrossRef]
- Hu, R.; Singh, A. UniT: Multimodal Multitask Learning with a Unified Transformer. arXiv 2021, arXiv:2102.10772. [Google Scholar] [CrossRef]
- Arnab, A.; Dehghani, M.; Heigold, G.; Sun, C.; Lučić, M.; Schmid, C. ViViT: A Video Vision Transformer. arXiv 2021, arXiv:2103.15691. [Google Scholar]
- Wang, Y.; Chen, X.; Cao, L.; Huang, W.; Sun, F.; Wang, Y. Multimodal Token Fusion for Vision Transformers. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–20 June 2022; pp. 12176–12185. [Google Scholar]
- Tsai, Y.-H.H.; Bai, S.; Liang, P.P.; Kolter, J.Z.; Morency, L.-P.; Salakhutdinov, R. Multimodal Transformer for Unaligned Multimodal Language Sequences. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics; Korhonen, A., Traum, D., Màrquez, L., Eds.; Association for Computational Linguistics: Florence, Italy, July 2019; pp. 6558–6569. [Google Scholar]
- Channel Attention and Spatial Attention (Woo et al., 2018). (A) Channel. Available online: https://www.researchgate.net/figure/Channel-attention-and-spatial-attention-Woo-et-al-2018-A-Channel-Attention-B_fig7_381589258 (accessed on 16 March 2026).
- He, K.; Chen, X.; Xie, S.; Li, Y.; Dollar, P.; Girshick, R. Masked Autoencoders Are Scalable Vision Learners. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–20 June 2022; pp. 15979–15988. [Google Scholar]
- Bandara, W.G.C.; Patel, N.; Gholami, A.; Nikkhah, M.; Agrawal, M.; Patel, V.M. AdaMAE: Adaptive Masking for Efficient Spatiotemporal Learning with Masked Autoencoders. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 14507–14517. [Google Scholar]
- Lugaresi, C.; Tang, J.; Nash, H.; McClanahan, C.; Uboweja, E.; Hays, M.; Zhang, F.; Chang, C.-L.; Yong, M.G.; Lee, J.; et al. MediaPipe: A Framework for Building Perception Pipelines. arXiv 2019, arXiv:1906.08172. [Google Scholar] [CrossRef]
- Liehr, P. Uncovering a Hidden Language: The Effects of Listening and Talking on Blood Pressure and Heart Rate. Arch. Psychiatry Nurs. 1992, 6, 306–311. [Google Scholar] [CrossRef] [PubMed]
- Shookster, D.; Lindsey, B.; Cortes, N.; Martin, J.R. Accuracy of Commonly Used Age-Predicted Maximal Heart Rate Equations. Int. J. Exerc. Sci. 2020, 13, 1242–1250. [Google Scholar] [CrossRef] [PubMed]
- Levenson, R.W.; Ekman, P.; Friesen, W.V. Voluntary Facial Action Generates Emotion-Specific Autonomic Nervous System Activity. Psychophysiology 1990, 27, 363–384. [Google Scholar] [CrossRef] [PubMed]
- Convex Optimization. Available online: https://www.cambridge.org/universitypress/subjects/statistics-probability/optimization-or-and-risk/convex-optimization (accessed on 13 March 2026).
- Zhang, Z.; Luo, P.; Loy, C.C.; Tang, X. Facial Landmark Detection by Deep Multi-Task Learning. In Proceedings of the Computer Vision—ECCV 2014; Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T., Eds.; Springer International Publishing: Cham, Switzerland, 2014; pp. 94–108. [Google Scholar]
- Niu, X.; Yu, Z.; Han, H.; Li, X.; Shan, S.; Zhao, G. Video-Based Remote Physiological Measurement via Cross-Verified Feature Disentangling. In Proceedings of the Computer Vision—ECCV 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M., Eds.; Springer International Publishing: Cham, Switzerland, 2020; pp. 295–310. [Google Scholar]
- Niu, X.; Han, H.; Shan, S.; Chen, X. VIPL-HR: A Multi-Modal Database for Pulse Estimation from Less-Constrained Face Video. In Proceedings of the Computer Vision—ACCV 2018; Jawahar, C.V., Li, H., Mori, G., Schindler, K., Eds.; Springer International Publishing: Cham, Switzerland, 2019; pp. 562–576. [Google Scholar]
- Heusch, G.; Anjos, A.; Marcel, S. A Reproducible Study on Remote Heart Rate Measurement. arXiv 2017, arXiv:1709.00962. [Google Scholar] [CrossRef]







| Model | Architecture | Noise Robustness Mechanism | Robustness Type |
| PhysFormer [16] | Dense Transformer | Temporal Difference (TD) features | Implicit |
| RhythmFormer [19] | Hierarchical periodic Transformer | Periodic Query/Refinement | Implicit |
| VidFormer [18] | Global Temporal Attention + CNN | Spatio-temporal redundancy | Implicit |
| Shao et al. [22] | Swin Transformer | Global Interference Sharing & Background Ref. | Explicit |
| TH-STT (Our proposed) | Sparse ST-Transformer | RDG, DAL Anchor, & AFW | Hybrid (Implicit + Explicit) |
| Architecture | Motion Handling | Illumination Handling | Post-Processing | Integration Logic |
|---|---|---|---|---|
| MTTS-CAN/TS-CAN [15] | Temporal Shift | Global Average | Static Bandpass (0.75–2.5) | Passive (Multi-task Loss) |
| PhysNet [6] | 3D-Convolutions | Normalized Input | Static Bandpass | Passive (Feature-Fusion) |
| Shao et al. [22] | Landmark- based STMap | Global Sharing/Background | Static Bandpass | Softmax Similarity |
| VidFormer [18] | Dense ST-Attention | Generic Attention | Static Bandpass | End-to-End Mapping |
| TH-STT (Ours) | RDG (Reaction) | DAL (Anchor) | AFW (Adaptive) | Active (Closed-Loop) |
| Dataset | Videos | Subjects | Main Challenge | Illumination Source | Compression | Ground Truth |
|---|---|---|---|---|---|---|
| UBFC-rPPG | 42 | 42 | Spontaneous Reaction | Indoor | No compression (data rate~220 Mbps) | CMS50E (PPG) |
| VIPL-HR | 2378 | 107 | Sensor and Scale Diversity-Motion | Indoor/dark/bright | Moderate (data rate~5.17 Mbps) | BVP/ECG |
| COHFACE | 160 | 40 | Illumination Compression rate | Lab/Natural | High (data rate~250 kbps) | BVP |
| Method | UBFC | VIPL-HR | COHFACE | |||
|---|---|---|---|---|---|---|
| MAE (bpm) ↓ | RMSE (bpm) ↓ | MAE (bpm) ↓ | RMSE (bpm) ↓ | MAE (bpm) ↓ | RMSE (bpm) ↓ | |
| CHROM [12] | 4.06 | 8.83 | 11.37 | 16.99 | 3.82 | 6.8 |
| POS [13] | 4.08 | 7.62 | 10.8 | 14.8 | 3.14 | 10.57 |
| Ts-CAN [15] | 1.70 | 2.72 | - | - | - | - |
| DeepPhys [14] | 6.25 | 10.81 | 11.04 | 13.82 | 6.56 | 13.84 |
| PhysNet [6] | 2.95 | 3.67 | 10.8 | 14.8 | 5.38 | 10.76 |
| PhysFormer [16] | 0.52 | 0.71 | 4.97 | 7.79 | - | - |
| PhysFormer++ [17] | 0.51 | 0.69 | 4.88 | 7.62 | - | - |
| RhythmFormer [19] | 0.5 | 0.87 | 4.51 | 7.98 | 1.17 | 3.36 |
| RS-rPPG [20] | - | - | 5.98 | 10.5 | - | - |
| TH-STT (Ours) | 0.43 | 0.63 | 4.65 | 7.23 | 1.08 | 3.15 |
| Configuration | UBFC (MAE ↓) |
|---|---|
| Baseline (Shared Backbone) | 0.82 |
| +Reaction Head (DSG) | 0.53 |
| +Illumination Head (DAL) | 0.70 |
| Full TH-STT (Without AFW) | 0.437 |
| Full TH-STT (With AFW) | 0.42 |
| Window | RMSE | FLOPs (G) |
|---|---|---|
| 64 | 8.62 | 32.7 |
| 96 | 7.73 | 46.5 |
| 128 | 7.23 | 61.1 |
| 160 | 7.09 | 83.7 |
| Method | Type | Parameters (M) | FLOPs(M) /Frame | Time /Frame (ms) | RMSE |
|---|---|---|---|---|---|
| DeepPhys | 2D-CNN | 3.91 | 744 | 0.231 | 13.8 |
| TS-CAN | 2D + shift | 3.91 | 744 | 0.230 | 14.59 |
| PhysNet | 3D-CNN | 0.78 | 429 | 0.126 | 14.8 |
| PhysFormer | Transformer | 7.38 | 293 | 0.141 | 7.79 |
| PhysFormer++ | Transformer | 9.79 | 311 | 0.145 | 7.62 |
| RhythmFormer | Transformer | 3.25 | 240 | 0.182 | 7.98 |
| TH-STT (Ours) | Transformer | 14.8 | 476 | 0.219 | 7.23 |
| Method | MAE (bpm) ↓ | RMSE (bpm) ↓ | Pearson (ρ) ↑ |
|---|---|---|---|
| PhysFormer++ | 1.81 | 3.75 | 0.92 |
| RhythmFormer | 1.72 | 3.45 | 0.93 |
| TH-STT (Ours) | 1.54 | 3.17 | 0.92 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Mehrez, A.; Alsammak, A.; El-Mashad, S.Y. Remote Photoplethysmography Using Triple-Head Spatio-Temporal Transformer with Reaction-Driven Gating and Illumination Separation. Sensors 2026, 26, 3490. https://doi.org/10.3390/s26113490
Mehrez A, Alsammak A, El-Mashad SY. Remote Photoplethysmography Using Triple-Head Spatio-Temporal Transformer with Reaction-Driven Gating and Illumination Separation. Sensors. 2026; 26(11):3490. https://doi.org/10.3390/s26113490
Chicago/Turabian StyleMehrez, Ahmed, Abdelwahab Alsammak, and Shady Y. El-Mashad. 2026. "Remote Photoplethysmography Using Triple-Head Spatio-Temporal Transformer with Reaction-Driven Gating and Illumination Separation" Sensors 26, no. 11: 3490. https://doi.org/10.3390/s26113490
APA StyleMehrez, A., Alsammak, A., & El-Mashad, S. Y. (2026). Remote Photoplethysmography Using Triple-Head Spatio-Temporal Transformer with Reaction-Driven Gating and Illumination Separation. Sensors, 26(11), 3490. https://doi.org/10.3390/s26113490

