MDPI - Publisher of Open Access Journals

20 pages, 34702 KB

Open AccessArticle

rePPG: Relighting Photoplethysmography Signal to Video

by Seunghyun Kim, Yeongje Park, Byeongseon An and Eui Chul Lee

Biomimetics 2026, 11(4), 230; https://doi.org/10.3390/biomimetics11040230 - 1 Apr 2026

Viewed by 352

Remote photoplethysmography (rPPG) extracts physiological signals from facial videos by analyzing subtle skin color variations caused by blood flow. While this technology enables contactless health monitoring, it also raises privacy concerns because facial videos reveal both identity and sensitive biometric information. Existing privacy-preserving [...] Read more.

Remote photoplethysmography (rPPG) extracts physiological signals from facial videos by analyzing subtle skin color variations caused by blood flow. While this technology enables contactless health monitoring, it also raises privacy concerns because facial videos reveal both identity and sensitive biometric information. Existing privacy-preserving techniques, such as blurring or pixelation, degrade visual quality and are unsuitable for practical rPPG applications. This paper presents rePPG, a framework that inserts a desired rPPG signal into facial videos while preserving the original facial appearance. The proposed method disentangles facial appearance and physiological features, enabling replacement of the physiological signal without altering facial identity or visual quality. Skin segmentation restricts modifications to skin regions, and a cycle-consistency mechanism ensures that the injected rPPG signal can be reliably recovered from the generated video. Importantly, the extracted rPPG signals are evaluated against the injected target physiological signals rather than the subject’s original physiological state, ensuring that the evaluation measures signal rewriting accuracy. Experiments on the PURE and UBFC datasets show that rePPG successfully embeds target PPG signals, achieving 1.10 BPM MAE and 95.00% PTE6 on PURE while preserving visual quality (PSNR 24.61 dB, SSIM 0.638). Heart rate metrics are computed using a 5-second temporal window to ensure a consistent evaluation protocol. Full article

(This article belongs to the Special Issue Bio-Inspired Signal Processing on Image and Audio Data)

► Show Figures

Figure 1

6 pages, 372 KB

Open AccessProceeding Paper

Performance Analysis of Hammer Throwers Integrating Inertial Measurement Unit and IoT

by Li-Chun Yu and Hao-Lun Huang

Eng. Proc. 2026, 134(1), 24; https://doi.org/10.3390/engproc2026134024 - 31 Mar 2026

Viewed by 163

Abstract

Hammer throw is a complex discipline requiring strength, refined technique, and precise inter-segmental coordination. We developed an IoT-enabled system with inertial measurement units (IMUs) to provide objective, real-time analytics for coaches and athletes. IMUs were mounted on the hip, knee, and ankle to [...] Read more.

Hammer throw is a complex discipline requiring strength, refined technique, and precise inter-segmental coordination. We developed an IoT-enabled system with inertial measurement units (IMUs) to provide objective, real-time analytics for coaches and athletes. IMUs were mounted on the hip, knee, and ankle to capture tri-axial acceleration and angular velocity during the throwing action. Data were streamed wirelessly and processed to extract rotation rate profiles, joint coordination metrics, and temporal events (winds, turns, and release). Two collegiate athletes performed 10 throws, and the results were compared with video-based analysis. The IMU system captured finer-grained variations in angular velocity and acceleration during rapid rotation phases and achieved an accuracy of 93.5% in classifying higher- and lower-quality throws using cross-validated models. The system developed enables quantitative feedback and continuous progress tracking in training. The feasibility of IMU + IoT integration for hammer throw performance analysis provides a foundation for AI-assisted, on-field decision support. Full article

(This article belongs to the Proceedings of The 7th Eurasia Conference on IoT, Communication and Engineering 2025 (ECICE 2025))

► Show Figures

Figure 1

21 pages, 743 KB

Open AccessArticle

BEATSCORE: Beat-Synchronous Contrastive Alignment and Event-Centric Grading for Long-Term Sports Assessment

by Lijie Wang, Jianyong Zhu, Houlei Wang and Xiaochao Li

Sensors 2026, 26(7), 2157; https://doi.org/10.3390/s26072157 - 31 Mar 2026

Viewed by 213

Abstract

Long-term sports assessment is a challenging task in video understanding, since it requires judging subtle movement variations over minutes and evaluating action–music coordination. However, in many sporting events the background music is only weakly related to the performed movements, and the cues that [...] Read more.

Long-term sports assessment is a challenging task in video understanding, since it requires judging subtle movement variations over minutes and evaluating action–music coordination. However, in many sporting events the background music is only weakly related to the performed movements, and the cues that matter for synchrony are often temporal and structural, such as small phase or tempo deviations that occur around decisive moments, rather than semantic correspondences between audio content and action categories. Prior approaches typically rely on implicit cross-modal fusion over dense sequences to learn such weak associations, which can smooth out near-miss misalignment and become brittle under tempo or phase shifts. To address this issue, we propose BEATSCORE, a beat-guided audio–visual learning framework that explicitly models action–music alignment at the beat level and performs event-centric sparse grading for long videos. In our framework, we first convert audio and motion into beat-synchronous tokens, enabling direct comparison on a unified rhythmic timeline. We then introduce a beat-level contrastive objective with near-offset hard negatives to sharpen sensitivity to misalignment. To handle the sparsity of decisive moments, we further design an event proposal and grading module that scores a small set of key segments and aggregates them via learnable multiple-instance pooling into a final assessment score. We evaluate BEATSCORE on public long-term sports benchmarks to demonstrate improved accuracy with competitive efficiency. Full article

(This article belongs to the Special Issue Multi-Modality Sensing Data Analysis and Its Application in Image Processing and Vision)

► Show Figures

Figure 1

10 pages, 873 KB

Open AccessProceeding Paper

Utilizing Residual Network 50 Convolutional Neural Network Architecture for Enhanced Philippine Regional Language Classification on Jetson Orin Nano

by John Paul T. Cruz, Aaron B. Abadiano, FP O. Sangilan, Emmy Grace T. Requillo and Roben C. Juanatas

Eng. Proc. 2026, 134(1), 2; https://doi.org/10.3390/engproc2026134002 - 26 Mar 2026

Viewed by 286

Abstract

Visual speech recognition systems encounter significant challenges in multilingual nations such as the Philippines, where numerous regional languages, including Cebuano and Ilocano, feature distinct phonetic-visual characteristics. Deep learning models such as the Lip Reading Network and the Lightweight Crowd Segmentation Network have demonstrated [...] Read more.

Visual speech recognition systems encounter significant challenges in multilingual nations such as the Philippines, where numerous regional languages, including Cebuano and Ilocano, feature distinct phonetic-visual characteristics. Deep learning models such as the Lip Reading Network and the Lightweight Crowd Segmentation Network have demonstrated strong performance with 3D Convolutional Neural Networks (CNNs). However, their substantial computational requirements restrict deployment on portable edge devices. We introduce a more efficient alternative that integrates a 2D Residual Network 50 architecture with a Long Short-Term Memory network and Connectionist Temporal Classification for lip-reading classification of Philippine regional languages. The proposed model is deployed on the Jetson Orin Nano, a high-performance edge device optimized for real-time inference through Compute Unified Device Architecture acceleration. Using a dataset of 2000 annotated videos encompassing 10 lexicons each for Cebuano and Ilocano, the model’s effectiveness was evaluated. Results achieved a regional language classification accuracy of 90%, with lexicon-level accuracies of 74% for Cebuano and 66% for Ilocano. This work represents a step toward developing accessible and scalable communication aids for deaf communities in linguistically diverse environments, leveraging transfer learning on pretrained models. Full article

(This article belongs to the Proceedings of The 7th Eurasia Conference on IoT, Communication and Engineering 2025 (ECICE 2025))

► Show Figures

Figure 1

26 pages, 6958 KB

Open AccessArticle

A Method for Industrial Smoke Video Semantic Segmentation Using DeffNet with Inter-Frame Adaptive Variable Step Size Based on Fuzzy Control

by Jiantao Yang and Hui Liu

Sensors 2026, 26(6), 1949; https://doi.org/10.3390/s26061949 - 20 Mar 2026

Viewed by 225

Abstract

Segmenting non-rigid objects such as smoke in video requires effective utilization of temporal information, which remains challenging due to their irregular deformation and complex appearance variations. Based on our previously proposed DeffNet for industrial fumes video segmentation, this letter presents a novel adaptive [...] Read more.

Segmenting non-rigid objects such as smoke in video requires effective utilization of temporal information, which remains challenging due to their irregular deformation and complex appearance variations. Based on our previously proposed DeffNet for industrial fumes video segmentation, this letter presents a novel adaptive frame selection algorithm that employs fuzzy logic control to dynamically optimize the temporal processing step size for the specific task of industrial smoke video segmentation. Our method quantifies inter-frame variation using the Structural Similarity Index (SSIM) and Normalized Cross-Correlation (NCC) as inputs to a fuzzy inference system. Gaussian membership functions, shaped via K-means clustering, and a five-rule fuzzy system are designed to determine the optimal step size, maximizing informative dynamic feature extraction while minimizing redundant computation. As a lightweight front-end module, the algorithm integrates seamlessly into the existing DeffNet segmentation framework without reconstructing new network architecture. Extensive experiments on a dedicated industrial smoke video dataset demonstrate that our approach effectively improves the segmentation performance of DeffNet, achieving 84.27% Intersection over Union (IoU) while maintaining a high inference speed of 39.71 FPS. This work provides an efficient and scene-specific solution for temporal modeling in industrial smoke non-rigid object segmentation and offers a practical improved strategy for DeffNet in real-time industrial smoke monitoring. Full article

(This article belongs to the Special Issue AI-Based Visual Sensing for Object Detection)

► Show Figures

Figure 1

25 pages, 9628 KB

Open AccessArticle

Real-Time Endoscopic Video Enhancement via Degradation Representation Estimation and Propagation

by Handing Xu, Zhenguo Nie, Tairan Peng and Xin-Jun Liu

J. Imaging 2026, 12(3), 134; https://doi.org/10.3390/jimaging12030134 - 16 Mar 2026

Viewed by 369

Abstract

Endoscopic images are often degraded by uneven illumination, motion blur, and tissue occlusion, which obscure critical anatomical details and complicate surgical manipulation. This issue is particularly pronounced in single-port endoscopic surgery, where the imaging capability of the camera is further constrained by limited [...] Read more.

Endoscopic images are often degraded by uneven illumination, motion blur, and tissue occlusion, which obscure critical anatomical details and complicate surgical manipulation. This issue is particularly pronounced in single-port endoscopic surgery, where the imaging capability of the camera is further constrained by limited working space. While deep learning-based enhancement methods have demonstrated impressive performance, most existing approaches remain too computationally demanding for real-time surgical use. To address this challenge, we propose an efficient stepwise endoscopic image enhancement framework that introduces an implicit degradation representation as an intermediate feature to guide the enhancement module toward high-quality results. The framework further exploits the temporal continuity of endoscopic videos, based on the assumption that image degradation evolves smoothly over short time intervals. Accordingly, high-quality degradation representations are estimated only on key frames at fixed intervals, while the representations for the remaining frames are obtained through fast inter-frame propagation, thereby significantly improving computational efficiency while maintaining enhancement quality. Experimental results demonstrate that our method achieves an excellent balance between enhancement quality and computational efficiency. Further evaluation on the downstream segmentation task suggests that our method substantially enhances the understanding of the surgical scene, validating that implicitly learning and degradation representation propagation offer a practical pathway for real-time clinical application. Full article

(This article belongs to the Section Medical Imaging)

► Show Figures

Figure 1

19 pages, 1198 KB

Open AccessArticle

GSMTNet: Dual-Stream Video Anomaly Detection via Gated Spatio-Temporal Graph and Multi-Scale Temporal Learning

by Di Jiang, Huicheng Lai, Guxue Gao, Dan Ma and Liejun Wang

Electronics 2026, 15(6), 1200; https://doi.org/10.3390/electronics15061200 - 13 Mar 2026

Viewed by 304

Abstract

Video Anomaly Detection aims to identify video segments containing abnormal events. However, detecting anomalies relies more heavily on temporal modeling, particularly when anomalies exhibit only subtle deviations from normal events. However, most existing methods inadequately model the heterogeneity in spatiotemporal relationships, especially the [...] Read more.

Video Anomaly Detection aims to identify video segments containing abnormal events. However, detecting anomalies relies more heavily on temporal modeling, particularly when anomalies exhibit only subtle deviations from normal events. However, most existing methods inadequately model the heterogeneity in spatiotemporal relationships, especially the dynamic interactions between human pose and video appearance. To address this, we propose GSMTNet, a dual-stream heterogeneous unsupervised network integrating gated spatio-temporal graph convolution and multi-scale temporal learning. First, we introduce a dynamic graph structure learning module, which leverages gated spatio-temporal graph convolutions with manifold transformations to model latent spatial relationships via human pose graphs. This is coupled with a normalizing flow-based density estimation module to model the probability distribution of normal samples in a latent space. Second, we design a hybrid dilated temporal module that employs multi-scale temporal feature learning to simultaneously capture long- and short-term dependencies, thereby enhancing the separability between normal patterns and potential deviations. Finally, we propose a dual-stream fusion module to hierarchically integrate features learned from pose graphs and raw video sequences, followed by a prediction head that computes anomaly scores from the fused features. Extensive experiments demonstrate state-of-the-art performance, achieving 86.81% AUC on ShanghaiTech and 70.43% on UBnormal, outperforming existing methods in rare anomaly scenarios. Full article

(This article belongs to the Special Issue Advanced Scene Understanding Methods and Applications in Multi-Modal Data)

► Show Figures

Figure 1

23 pages, 2115 KB

Open AccessReview

Artificial Intelligence in Cardiovascular Imaging: From Automated Acquisition to Precision Diagnostics and Clinical Decision Support

by Minodora Teodoru, Alexandra-Kristine Tonch-Cerbu, Dragoș Cozma, Cristina Văcărescu, Raluca-Daria Mitea, Florina Batâr, Horea-Laurentiu Onea, Florin-Leontin Lazăr and Alina Camelia Cătană

Med. Sci. 2026, 14(1), 132; https://doi.org/10.3390/medsci14010132 - 11 Mar 2026

Viewed by 501

Abstract

Cardiovascular imaging is a cornerstone of modern cardiology, yet its clinical impact is limited by operator dependence, inter-observer variability, time-consuming workflows, and unequal access to advanced expertise. Artificial intelligence (AI), particularly machine learning and deep learning, offers new opportunities to overcome these limitations. [...] Read more.

Cardiovascular imaging is a cornerstone of modern cardiology, yet its clinical impact is limited by operator dependence, inter-observer variability, time-consuming workflows, and unequal access to advanced expertise. Artificial intelligence (AI), particularly machine learning and deep learning, offers new opportunities to overcome these limitations. This review aims to summarize current and emerging AI applications in cardiovascular imaging and to evaluate their potential clinical value in precision diagnostics and decision support. This narrative review synthesizes clinically relevant literature on AI applications across major cardiovascular imaging modalities, including echocardiography, cardiovascular magnetic resonance, cardiac computed tomography, and nuclear cardiology. Evidence was analyzed with a focus on AI-enabled acquisition support, image segmentation, quantitative and functional assessment, workflow automation, and risk stratification, alongside key methodological and implementation considerations. Across imaging modalities, AI-driven approaches have demonstrated improved reproducibility, efficiency, and scalability of cardiovascular imaging workflows. Automated algorithms reduce operator dependence, facilitate standardized extraction of imaging biomarkers, and support advanced functional assessment and prognostic stratification. Recent developments in video-based, temporal, and multimodal models further expand AI capabilities from technical automation toward integrated disease phenotyping and personalized clinical decision support. However, translation into routine practice remains limited by heterogeneous datasets, insufficient external validation, algorithmic bias, limited interpretability, and challenges related to regulatory approval and workflow integration. Artificial intelligence has the potential to reshape cardiovascular imaging into a more efficient, reproducible, and patient-centered precision medicine tool. Real-world clinical impact will depend on outcome-driven evaluation, robust external validation, multimodal data integration, and human-in-the-loop implementation strategies that ensure safe, equitable, and clinically meaningful adoption. Full article

(This article belongs to the Special Issue Artificial Intelligence (AI) in Cardiovascular Medicine)

► Show Figures

Figure 1

27 pages, 1334 KB

Open AccessArticle

ETR: Event-Centric Temporal Reasoning for Question-Conditioned Video Question Answering

by Lingmin Pan, Ziyi Gao, Yueming Zhu, Fuchen Chen, Chengyuan Zhang, Dan Yin, Yong Cai, Siqiao Tan and Lei Zhu

Mathematics 2026, 14(5), 913; https://doi.org/10.3390/math14050913 - 7 Mar 2026

Viewed by 432

Abstract

Video Question Answering (VideoQA) requires a deep understanding of dynamic video content, integrating spatial reasoning, temporal dependencies, and language comprehension. Existing methods often struggle with long or semantically complex videos due to the lack of question-guided keyframe weight adjustment and the absence of [...] Read more.

Video Question Answering (VideoQA) requires a deep understanding of dynamic video content, integrating spatial reasoning, temporal dependencies, and language comprehension. Existing methods often struggle with long or semantically complex videos due to the lack of question-guided keyframe weight adjustment and the absence of question-aligned cross-modal description generation. To address these challenges, we propose ETR (Event-centric Temporal Reasoning), an adaptive framework for VideoQA. ETR introduces three key mechanisms: (i) a hierarchical weight adjustment selector to identify questions requiring event-centric temporal reasoning; (ii) a T-Route that segments videos into semantically coherent events and dynamically adjusts keyframe weights with question intent; and (iii) a question-conditioned prompting strategy that focuses on key objects to generate textual prompts aligned with a question’s semantics. This hierarchical and adaptive design effectively balances visual and textual information, enhances temporal reasoning, and improves object-centric alignment. Experiments on two datasets demonstrate that ETR achieves competitive performance in fine question-aware VideoQA. Full article

(This article belongs to the Special Issue Structural Networks for Image Application)

► Show Figures

Figure 1

18 pages, 1637 KB

Open AccessArticle

Spatio-Temporal Capsule Networks for Weakly Supervised Surveillance Video Anomaly Detection

by Mohammed Iqbal Dohan Almurumudhe and Olivér Hornyák

Appl. Sci. 2026, 16(5), 2567; https://doi.org/10.3390/app16052567 - 7 Mar 2026

Viewed by 311

Abstract

Real surveillance systems require weakly supervised video anomaly detection due to the fact that long untrimmed videos do not always have accurate temporal labels. Models will be required to label a video as normal or abnormal and also to identify sparse anomaly areas [...] Read more.

Real surveillance systems require weakly supervised video anomaly detection due to the fact that long untrimmed videos do not always have accurate temporal labels. Models will be required to label a video as normal or abnormal and also to identify sparse anomaly areas with mere video-level supervision. In this paper, we introduce ST-CapsNet, which is a spatio-temporal capsule network that enhances weakly supervised localization of anomalies by using a structured representation and temporal agreement. Every video is broken down into 32 parts and coded with 512-dimensional 3D CNN (Convolutional Neural Network) features. Primary capsules record patterns of segments as vectors, and temporal capsules are created by dynamic routing over time, enabling the related abnormal segments to provide support to a common event representation. Training is based on a multiple-instance learning model that has a bag-level BCE (Binary Cross-Entropy) loss, a ranking loss between abnormal and normal separation, and smoothness and sparsity regularization to impose temporal consistency and sparse event behavior. The weakly supervised FAST (Focused and Accelerated Subset Training) split experiments on the UCF-Crime weakly supervised FAST split demonstrate that ST-CapsNet is better than strong baselines. The findings indicate that capsule routing is an effective part of the whole temporal reasoning of weakly supervised surveillance anomaly detection. Full article

(This article belongs to the Section Computing and Artificial Intelligence)

► Show Figures

Figure 1

22 pages, 3288 KB

Open AccessArticle

An Intelligent Real-Time System for Sentence-Level Recognition of Continuous Saudi Sign Language Using Landmark-Based Temporal Modeling

by Adel BenAbdennour, Mohammed Mukhtar, Osama Almolike, Bilal A. Khawaja and Abdulmajeed M. Alenezi

Sensors 2026, 26(5), 1652; https://doi.org/10.3390/s26051652 - 5 Mar 2026

Viewed by 447

Abstract

A persistent challenge for Deaf and Hard-of-Hearing individuals is the communication gap between sign language users and the hearing community, particularly in regions with limited automated translation resources. In Saudi Arabia, this gap is amplified by the reliance on Saudi Sign Language (SSL) [...] Read more.

A persistent challenge for Deaf and Hard-of-Hearing individuals is the communication gap between sign language users and the hearing community, particularly in regions with limited automated translation resources. In Saudi Arabia, this gap is amplified by the reliance on Saudi Sign Language (SSL) and the scarcity of real-time, sentence-level translation systems. This paper presents a real-time system for sentence-level recognition of continuous SSL and direct mapping to natural spoken Arabic. The proposed system operates end-to-end on live video streams or pre-recorded content, extracting spatio-temporal landmark features using the MediaPipe Holistic framework. For classification, the input feature vector consists of 225 features derived from hand and body pose landmarks. These features are processed by a Bidirectional Long Short-Term Memory (BiLSTM) network trained on the ArabSign (ArSL) dataset to perform direct sentence-level classification over a vocabulary of 50 continuous Arabic sign language sentences, supported by an idle-based segmentation mechanism that enables natural, uninterrupted signing. Experimental evaluation demonstrates robust generalization: under a Leave-One-Signer-Out (LOSO) cross-validation protocol, the model attains a mean sentence-level accuracy of 94.2%, outperforming the fixed signer-independent split baseline of 92.07%, while maintaining real-time performance suitable for interactive use. To enhance linguistic fluency, an optional post-recognition refinement stage is incorporated using a large language model (LLM), followed by text-to-speech synthesis to produce audible Arabic output; this refinement operates strictly as post-processing and is not included in the reported recognition accuracy metrics. The results demonstrate that direct sentence-level modeling, combined with landmark-based feature extraction and real-time segmentation, provides an effective and practical solution for continuous SSL sentence recognition in real-time. Full article

(This article belongs to the Special Issue Sensor Systems for Gesture Recognition (3rd Edition))

► Show Figures

Figure 1

21 pages, 1754 KB

Open AccessArticle

Analysis of the Consensual Pupillary Reflex Using Blue LED Step Light and Automated Image Segmentation

by Edyson R. Torres-Centeno, Erwin J. Sacoto-Cabrera, Roger Jesus Coaquira-Castillo, L. Walter Utrilla Mego, Miguel A. Castillo-Guevara, Yesenia Concha-Ramos and Edison Moreno-Cardenas

Computers 2026, 15(3), 160; https://doi.org/10.3390/computers15030160 - 3 Mar 2026

Viewed by 434

Abstract

This study evaluates the dynamics of the human pupillary reflex in response to a stepped blue light stimulus (465 nm) in young adults residing at high altitude (3400 m above sea level). High-resolution video sequences of three participants were analyzed using four classical [...] Read more.

This study evaluates the dynamics of the human pupillary reflex in response to a stepped blue light stimulus (465 nm) in young adults residing at high altitude (3400 m above sea level). High-resolution video sequences of three participants were analyzed using four classical image segmentation techniques: K-Means, Otsu, fixed binary threshold, and multi-channel RGB threshold. Rather than proposing new algorithms, this work evaluates the technical feasibility and stability of computationally lightweight segmentation approaches under controlled lighting conditions and with low-cost hardware constraints. Among the methods evaluated, fixed binary thresholding showed stable temporal behavior and minimal computational complexity within the experimental setup. The results show a consistent contraction–plateau–recovery pattern across all participants, with representative contraction, stabilization, and recovery times of 1.89 s, 0.41 s, and 2.33 s, respectively. Although limited by the small sample size, these findings support the feasibility of implementing simplified segmentation strategies for pupillometry in resource-limited settings. Full article

► Show Figures

Graphical abstract

22 pages, 5005 KB

Open AccessArticle

Behavioral Engagement in VR-Based Sign Language Learning: Visual Attention as a Predictor of Performance and Temporal Dynamics

by Davide Traini, José Manuel Alcalde-Llergo, Mariana Buenestado-Fernández, Domenico Ursino and Enrique Yeguas-Bolívar

Multimodal Technol. Interact. 2026, 10(3), 23; https://doi.org/10.3390/mti10030023 - 2 Mar 2026

Viewed by 583

Abstract

Understanding how learners engage with immersive sign language training environments is essential for advancing virtual reality-based education and inclusion. This study analyzes behavioral engagement in SONAR, a virtual reality application designed for sign language training and validation. We focus on three automatically derived [...] Read more.

Understanding how learners engage with immersive sign language training environments is essential for advancing virtual reality-based education and inclusion. This study analyzes behavioral engagement in SONAR, a virtual reality application designed for sign language training and validation. We focus on three automatically derived engagement indicators (Visual Attention (VA), Video Replay Frequency (VRF), and Post-Playback Viewing Time (PPVT)) and examine their relationship with learning performance in a sample of 117 university students. Participants completed a self-paced Training phase with 12 sign language instructional videos, followed by a Validation quiz assessing retention. We employed Pearson correlation analysis to examine the relationships between engagement indicators and quiz performance, followed by binomial Generalized Linear Model (GLM) regression to assess their joint predictive contributions. Additionally, we conducted temporal analysis by aggregating moment-to-moment VA traces across all learners to characterize engagement dynamics during the learning session. Results show that VA exhibits a strong positive correlation with quiz performance (r = 0.76), followed by PPVT (r = 0.66), whereas VRF shows no meaningful association. A binomial GLM confirms that VA and PPVT are significant predictors of learning success, jointly explaining a substantial proportion of performance variance (

p s e u d o - R^{2}

= 0.83). Going beyond outcome-oriented analysis, we characterize temporal engagement patterns by aggregating moment-to-moment VA traces across all learners. The temporal profile reveals distinct attention peaks aligned with informationally dense segments of both training and validation videos, as well as phase-specific engagement dynamics, including initial acclimatization, oscillatory attention cycles during learning, and pronounced attentional peaks during assessment. Together, these findings highlight the central role of sustained and strategically allocated visual attention in VR-based sign language learning and demonstrate the value of behavioral trace data for understanding and predicting learner engagement in immersive environments. Full article

► Show Figures

Figure 1

16 pages, 1079 KB

Open AccessArticle

TDA-Phys: Temporal Difference Adaptation of Video Foundation Model for Remote Photoplethysmography

by Wei Chen, Yinghao Ding, Kunze Bu, Ming Yu and Hang Wu

Appl. Sci. 2026, 16(4), 2038; https://doi.org/10.3390/app16042038 - 19 Feb 2026

Viewed by 378

Abstract

Remote photoplethysmography (rPPG) enables noncontact estimation of vital signs, particularly heart rate, by analyzing subtle periodic skin color variations in facial videos. While deep learning has advanced rPPG signal extraction, existing methods rely on carefully designed task-specific architectures that are costly to develop [...] Read more.

Remote photoplethysmography (rPPG) enables noncontact estimation of vital signs, particularly heart rate, by analyzing subtle periodic skin color variations in facial videos. While deep learning has advanced rPPG signal extraction, existing methods rely on carefully designed task-specific architectures that are costly to develop and generalize poorly. In this work, we demonstrate that the general video foundation model VideoMAE v2 can be effectively adapted to the rPPG signal regression task by introducing only a lightweight adapter, without modifying its pretrained backbone. We freeze the entire VideoMAE v2 encoder and introduce a Temporal Difference Convolutional Adapter to capture the subtle interframe intensity differences. To address the mismatch between VideoMAE v2′s short input window (16 frames) and the long temporal context typically required for robust rPPG extraction (e.g., 160 frames), we adopt an overlapping sliding window strategy for segmented inference and reconstruct the full signal through weighted temporal aggregation. On the COHFACE and UBFC-rPPG datasets, our method achieves mean absolute errors (MAEs) of 0.90 and 1.55, reducing the error by more than 55% and 42%, respectively, compared to PhysFormer (2.00 and 2.70). Furthermore, on challenging real-world datasets such as BUAA-MIHR, which features strong illumination variations, and VIPL-HR, which involves significant head movements, our approach achieves MAEs of 6.68 and 8.23, respectively, despite incorporating no task-specific robustness modules. These results demonstrate stable rPPG signal recovery and validate the feasibility of leveraging general video foundation models for physiological signal perception. Full article

► Show Figures

Figure 1

27 pages, 7440 KB

Open AccessArticle

3D Road Defect Mapping via Differentiable Neural Rendering and Multi-Frame Semantic Fusion in Bird’s-Eye-View Space

by Hongjia Xing and Feng Yang

J. Imaging 2026, 12(2), 83; https://doi.org/10.3390/jimaging12020083 - 15 Feb 2026

Viewed by 395

Abstract

Road defect detection is essential for traffic safety and infrastructure maintenance. Excising automated methods based on 2D image analysis lack spatial context and cannot provide accurate 3D localization required for maintenance planning. We propose a novel framework for road defect mapping from monocular [...] Read more.

Road defect detection is essential for traffic safety and infrastructure maintenance. Excising automated methods based on 2D image analysis lack spatial context and cannot provide accurate 3D localization required for maintenance planning. We propose a novel framework for road defect mapping from monocular video sequences by integrating differentiable Bird’s-Eye-View (BEV) mesh representation, semantic filtering, and multi-frame temporal fusion. Our differentiable mesh-based BEV representation enables efficient scene reconstruction from sparse observations through MLP-based optimization. The semantic filtering strategy leverages road surface segmentation to eliminate off-road false positives, reducing detection errors by 33.7%. Multi-frame fusion with ray-casting projection and exponential moving average update accumulates defect observations across frames while maintaining 3D geometric consistency. Experimental results demonstrate that our framework produces geometrically consistent BEV defect maps with superior accuracy compared to single-frame 2D methods, effectively handling occlusions, motion blur, and varying illumination conditions. Full article

(This article belongs to the Special Issue Intelligent 3D Vision: Reconstruction, Understanding, Generative Modeling, and Applications)

► Show Figures

Figure 1

Search Results (146)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (146)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI