MDPI - Publisher of Open Access Journals

34 pages, 5288 KB

Open AccessArticle

A Video-Based Mobile Palmprint Dataset and an Illumination-Robust Deep Learning Architecture for Unconstrained Environments

by Betül Koşmaz Sünnetci, Özkan Bingöl, Eyüp Gedikli, Murat Ekinci, Ramazan Özgür Doğan, Salih Türk and Nihan Güngör

Appl. Sci. 2025, 15(21), 11368; https://doi.org/10.3390/app152111368 - 23 Oct 2025

Viewed by 175

Abstract

The widespread adoption of mobile devices has made secure and user-friendly biometric authentication critical. However, widely used modalities such as fingerprint and facial recognition show limited robustness under uncontrolled illumination and on heterogeneous devices. In contrast, palmprint recognition offers strong potential because of [...] Read more.

The widespread adoption of mobile devices has made secure and user-friendly biometric authentication critical. However, widely used modalities such as fingerprint and facial recognition show limited robustness under uncontrolled illumination and on heterogeneous devices. In contrast, palmprint recognition offers strong potential because of its rich textural patterns and high discriminative power. This study addresses the limitations of laboratory-based datasets that fail to capture real-world challenges. We introduce MPW-180, a novel dataset comprising videos of 180 participants recorded on their own smartphones in everyday environments. By systematically incorporating diverse illumination conditions (with and without flash) and natural free-hand movements, MPW-180 is the first dataset to adopt a bring-your-own-device paradigm, providing a realistic benchmark for evaluating generalization in mobile biometric models. In addition, we propose PalmWildNet, an SE-block-enhanced deep learning architecture trained with Triplet Loss and a cross-illumination sampling strategy. The experimental results show that conventional methods suffer over 50% performance degradation under cross-illumination conditions. In contrast, our method reduces the Equal Error Rate to 1–2% while maintaining an accuracy above 97%. These findings demonstrate that the proposed framework not only tolerates illumination variability but also learns robust illumination-invariant representations, making it well-suited for mobile biometric authentication. Full article

► Show Figures

Figure 1

18 pages, 2231 KB

Open AccessArticle

VFGF: Virtual Frame-Augmented Guided Prediction Framework for Long-Term Egocentric Activity Forecasting

by Xiangdong Long, Shuqing Wang and Yong Chen

Sensors 2025, 25(18), 5644; https://doi.org/10.3390/s25185644 - 10 Sep 2025

Viewed by 667

Abstract

Accurately predicting future activities in egocentric (first-person) videos is a challenging yet essential task, requiring robust object recognition and reliable forecasting of action patterns. However, the limited number of observable frames in such videos often lacks critical semantic context, making long-term predictions particularly [...] Read more.

Accurately predicting future activities in egocentric (first-person) videos is a challenging yet essential task, requiring robust object recognition and reliable forecasting of action patterns. However, the limited number of observable frames in such videos often lacks critical semantic context, making long-term predictions particularly difficult. Traditional approaches, especially those based on recurrent neural networks, tend to suffer from cumulative error propagation over extended time steps, leading to degraded performance. To address these challenges, this paper introduces a novel framework, Virtual Frame-Augmented Guided Forecasting (VFGF), designed specifically for long-term egocentric activity prediction. The VFGF framework enhances semantic continuity by generating and incorporating virtual frames into the observable sequence. These synthetic frames fill the temporal and contextual gaps caused by rapid changes in activity or environmental conditions. In addition, we propose a Feature Guidance Module that integrates anticipated activity-relevant features into the recursive prediction process, guiding the model toward more accurate and contextually coherent inferences. Extensive experiments on the EPIC-Kitchens dataset demonstrate that VFGF, with its interpolation-based temporal smoothing and feature-guided strategies, significantly improves long-term activity prediction accuracy. Specifically, VFGF achieves a state-of-the-art Top-5 accuracy of 44.11% at a 0.25 s prediction horizon. Moreover, it maintains competitive performance across a range of long-term forecasting intervals, highlighting its robustness and establishing a strong foundation for future research in egocentric activity prediction. Full article

(This article belongs to the Special Issue Computer Vision-Based Human Activity Recognition)

► Show Figures

Figure 1

21 pages, 37523 KB

Open AccessFeature PaperArticle

S_T_Mamba: A Novel Jinnan Calf Diarrhea Behavior Recognition Model Based on Sequence Tree Mamba

by Wangli Hao, Yakui Xue, Hao Shu, Bingxue Lv, Hanwei Li, Meng Han, Yanhong Liu and Fuzhong Li

Animals 2025, 15(18), 2646; https://doi.org/10.3390/ani15182646 - 9 Sep 2025

Viewed by 358

Abstract

The efficient and precise recognition of diarrhea-related behaviors in Jinnan calves is crucial for ensuring their healthy development. Nevertheless, conventional behavior recognition techniques are often limited by a notable decline in performance when distinguishing between similar behavioral patterns. This paper proposes a novel [...] Read more.

The efficient and precise recognition of diarrhea-related behaviors in Jinnan calves is crucial for ensuring their healthy development. Nevertheless, conventional behavior recognition techniques are often limited by a notable decline in performance when distinguishing between similar behavioral patterns. This paper proposes a novel behavior recognition model for Jinnan calf diarrhea, named S_T_Mamba (Sequence Tree Mamba). Specifically, S_T_Mamba incorporates a sequence processing strategy and a tree state space module (TreeSSM). The sequence processing strategy utilizes sequence as inputs to capture the temporal dependencies underlying the video. Additionally, the tree state space module is designed to extract and aggregate long-range pixel association features from video frames, enabling the effective recognition of subtle distinctions between similar behaviors. Therefore, the proposed model significantly enhances the performance of calf diarrhea behavior recognition. Experimental results indicate that the S_T_Mamba model achieves state-of-the-art performance in Jinnan calf diarrhea behavior recognition. Specifically, S_T_Mamba achieves 99.78% accuracy, outperforming existing popular models by 0.59% to 1.99%. Full article

(This article belongs to the Section Animal System and Management)

► Show Figures

Figure 1

28 pages, 2673 KB

Open AccessArticle

AI Anomaly-Based Deepfake Detection Using Customized Mahalanobis Distance and Head Pose with Facial Landmarks

by Cosmina-Mihaela Rosca and Adrian Stancu

Appl. Sci. 2025, 15(17), 9574; https://doi.org/10.3390/app15179574 - 30 Aug 2025

Viewed by 1281

Abstract

The development of artificial intelligence has inevitably led to the growth of deepfake images, videos, human voices, etc. Deepfake detection is mandatory, especially when used for unethical and illegal purposes. This study presents a novel approach to image deepfake detection by introducing the [...] Read more.

The development of artificial intelligence has inevitably led to the growth of deepfake images, videos, human voices, etc. Deepfake detection is mandatory, especially when used for unethical and illegal purposes. This study presents a novel approach to image deepfake detection by introducing the Custom-Made Facial Recognition Algorithm (CMFRA), which employs four distinct features to differentiate between authentic and deepfake images. The proposed method combines facial landmark detection with advanced statistical analysis, integrating mean Mahalanobis distance and three head pose coordinates (yaw, pitch, and roll). The landmarks are extracted using the Google Vision API. This multi-feature approach assesses facial structure and orientation, capturing subtle inconsistencies indicative of deepfake manipulations. A key innovation of this work is introducing the mean Mahalanobis distance as a core feature for quantifying spatial relationships between facial landmarks. The research also emphasizes anomaly analysis by focusing solely on authentic facial data to establish a baseline for natural facial characteristics. The anomaly detection model recognizes when a face is modified without extensive training on deepfake samples. The process is implemented by analyzing deviations from this established pattern. The CMFRA demonstrated a detection accuracy of 90%. The proposed algorithm distinguishes between authentic and deepfake images under varied conditions. Full article

(This article belongs to the Special Issue The Evolving Role of AI in Software Engineering: Trends, Techniques, and Impact)

► Show Figures

Figure 1

21 pages, 3474 KB

Open AccessArticle

DFF: Sequential Dual-Branch Feature Fusion for Maritime Radar Object Detection and Tracking via Video Processing

by Donghui Li, Yu Xia, Fei Cheng, Cheng Ji, Jielu Yan, Weizhi Xian, Xuekai Wei, Mingliang Zhou and Yi Qin

Appl. Sci. 2025, 15(16), 9179; https://doi.org/10.3390/app15169179 - 20 Aug 2025

Viewed by 524

Abstract

Robust maritime radar object detection and tracking in maritime clutter environments is critical for maritime safety and security. Conventional Constant False Alarm Rate (CFAR) detectors have limited performance in processing complex-valued radar echoes, especially in complex scenarios where phase information is critical and [...] Read more.

Robust maritime radar object detection and tracking in maritime clutter environments is critical for maritime safety and security. Conventional Constant False Alarm Rate (CFAR) detectors have limited performance in processing complex-valued radar echoes, especially in complex scenarios where phase information is critical and in the real-time processing of successive echo pulses, while existing deep learning methods usually lack native support for complex-valued data and have inherent shortcomings in real-time compared to conventional methods. To overcome these limitations, we propose a dual-branch sequence feature fusion (DFF) detector designed specifically for complex-valued continuous sea-clutter signals, drawing on commonly used methods in video pattern recognition. The DFF employs dual parallel complex-valued U-Net branches to extract multilevel spatiotemporal features from distance profiles and Doppler features from distance–Doppler spectrograms, preserving the critical phase–amplitude relationship. Subsequently, the sequential feature-extraction module (SFEM) captures the temporal dependence in both feature streams. Next, the Adaptive Weight Learning (AWL) module dynamically fuses these multimodal features by learning modality-specific weights. Finally, the detection module generates the object localisation output. Extensive evaluations on the IPIX and SDRDSP datasets show that DFF performs well. On SDRDSP, DFF achieves 98.76% accuracy and 68.75% in F1 score, which significantly outperforms traditional CFAR methods and state-of-the-art deep learning models in terms of detection accuracy and false alarm rate (FAR). These results validate the effectiveness of DFF for reliable maritime object detection in complex clutter environments through multimodal feature fusion and sequence-dependent modelling. Full article

(This article belongs to the Section Computing and Artificial Intelligence)

► Show Figures

Figure 1

34 pages, 3909 KB

Open AccessArticle

UWB Radar-Based Human Activity Recognition via EWT–Hilbert Spectral Videos and Dual-Path Deep Learning

by Hui-Sup Cho and Young-Jin Park

Electronics 2025, 14(16), 3264; https://doi.org/10.3390/electronics14163264 - 17 Aug 2025

Viewed by 1074

Abstract

Ultrawideband (UWB) radar has emerged as a compelling solution for noncontact human activity recognition. This study proposes a novel framework that leverages adaptive signal decomposition and video-based deep learning to classify human motions with high accuracy using a single UWB radar. The raw [...] Read more.

Ultrawideband (UWB) radar has emerged as a compelling solution for noncontact human activity recognition. This study proposes a novel framework that leverages adaptive signal decomposition and video-based deep learning to classify human motions with high accuracy using a single UWB radar. The raw radar signals were processed by empirical wavelet transform (EWT) to isolate the dominant frequency components in a data-driven manner. These components were further analyzed using the Hilbert transform to produce time–frequency spectra that capture motion-specific signatures through subtle phase variations. Instead of treating each spectrum as an isolated image, the resulting sequence was organized into a temporally coherent video, capturing spatial and temporal motion dynamics. The video data were used to train the SlowFast network—a dual-path deep learning model optimized for video-based action recognition. The proposed system achieved an average classification accuracy exceeding 99% across five representative human actions. The experimental results confirmed that the EWT–Hilbert-based preprocessing enhanced feature distinctiveness, while the SlowFast architecture enabled efficient and accurate learning of motion patterns. The proposed framework is intuitive, computationally efficient, and scalable, demonstrating strong potential for deployment in real-world scenarios such as smart healthcare, ambient-assisted living, and privacy-sensitive surveillance environments. Full article

► Show Figures

Figure 1

31 pages, 4668 KB

Open AccessArticle

BLE Signal Processing and Machine Learning for Indoor Behavior Classification

by Yi-Shiun Lee, Yong-Yi Fanjiang, Chi-Huang Hung and Yung-Shiang Huang

Sensors 2025, 25(14), 4496; https://doi.org/10.3390/s25144496 - 19 Jul 2025

Viewed by 1130

Abstract

Smart home technology enhances the quality of life, particularly with respect to in-home care and health monitoring. While video-based methods provide accurate behavior analysis, privacy concerns drive interest in non-visual alternatives. This study proposes a Bluetooth Low Energy (BLE)-enabled indoor positioning and behavior [...] Read more.

Smart home technology enhances the quality of life, particularly with respect to in-home care and health monitoring. While video-based methods provide accurate behavior analysis, privacy concerns drive interest in non-visual alternatives. This study proposes a Bluetooth Low Energy (BLE)-enabled indoor positioning and behavior recognition system, integrating machine learning techniques to support sustainable and privacy-preserving health monitoring. Key optimizations include: (1) a vertically mounted Data Collection Unit (DCU) for improved height positioning, (2) synchronized data collection to reduce discrepancies, (3) Kalman filtering to smooth RSSI signals, and (4) AI-based RSSI analysis for enhanced behavior recognition. Experiments in a real home environment used a smart wristband to assess BLE signal variations across different activities (standing, sitting, lying down). The results show that the proposed system reliably tracks user locations and identifies behavior patterns. This research supports elderly care, remote health monitoring, and non-invasive behavior analysis, providing a privacy-preserving solution for smart healthcare applications. Full article

(This article belongs to the Special Issue The Use of Artificial Intelligence and Machine Learning with Sensor Data for Human Healthcare Applications)

► Show Figures

Figure 1

20 pages, 5700 KB

Open AccessArticle

Multimodal Personality Recognition Using Self-Attention-Based Fusion of Audio, Visual, and Text Features

by Hyeonuk Bhin and Jongsuk Choi

Electronics 2025, 14(14), 2837; https://doi.org/10.3390/electronics14142837 - 15 Jul 2025

Viewed by 1457

Abstract

Personality is a fundamental psychological trait that exerts a long-term influence on human behavior patterns and social interactions. Automatic personality recognition (APR) has exhibited increasing importance across various domains, including Human–Robot Interaction (HRI), personalized services, and psychological assessments. In this study, we propose [...] Read more.

Personality is a fundamental psychological trait that exerts a long-term influence on human behavior patterns and social interactions. Automatic personality recognition (APR) has exhibited increasing importance across various domains, including Human–Robot Interaction (HRI), personalized services, and psychological assessments. In this study, we propose a multimodal personality recognition model that classifies the Big Five personality traits by extracting features from three heterogeneous sources: audio processed using Wav2Vec2, video represented as Skeleton Landmark time series, and text encoded through Bidirectional Encoder Representations from Transformers (BERT) and Doc2Vec embeddings. Each modality is handled through an independent Self-Attention block that highlights salient temporal information, and these representations are then summarized and integrated using a late fusion approach to effectively reflect both the inter-modal complementarity and cross-modal interactions. Compared to traditional recurrent neural network (RNN)-based multimodal models and unimodal classifiers, the proposed model achieves an improvement of up to 12 percent in the F1-score. It also maintains a high prediction accuracy and robustness under limited input conditions. Furthermore, a visualization based on t-distributed Stochastic Neighbor Embedding (t-SNE) demonstrates clear distributional separation across the personality classes, enhancing the interpretability of the model and providing insights into the structural characteristics of its latent representations. To support real-time deployment, a lightweight thread-based processing architecture is implemented, ensuring computational efficiency. By leveraging deep learning-based feature extraction and the Self-Attention mechanism, we present a novel personality recognition framework that balances performance with interpretability. The proposed approach establishes a strong foundation for practical applications in HRI, counseling, education, and other interactive systems that require personalized adaptation. Full article

(This article belongs to the Special Issue Explainable Machine Learning and Data Mining)

► Show Figures

Figure 1

29 pages, 366 KB

Open AccessArticle

Video-Driven Artificial Intelligence for Predictive Modelling of Antimicrobial Peptide Generation: Literature Review on Advances and Challenges

by Jielu Yan, Zhengli Chen, Jianxiu Cai, Weizhi Xian, Xuekai Wei, Yi Qin and Yifan Li

Appl. Sci. 2025, 15(13), 7363; https://doi.org/10.3390/app15137363 - 30 Jun 2025

Viewed by 1472

Abstract

How video-based methodologies and advanced computer vision algorithms can facilitate the development of antimicrobial peptide (AMP) generation models should be further reviewed, structural and functional patterns should be elucidated, and the generative power of in silico pipelines should be enhanced. AMPs have drawn [...] Read more.

How video-based methodologies and advanced computer vision algorithms can facilitate the development of antimicrobial peptide (AMP) generation models should be further reviewed, structural and functional patterns should be elucidated, and the generative power of in silico pipelines should be enhanced. AMPs have drawn significant interest as promising therapeutic agents because of their broad-spectrum efficacy, low resistance profile, and membrane-disrupting mechanisms. However, traditional discovery methods are hindered by high costs, lengthy synthesis processes, and difficulty in accessing the extensive chemical space involved in AMP research. Recent advances in artificial intelligence—especially machine learning (ML), deep learning (DL), and pattern recognition—offer game-changing opportunities to accelerate AMP design and validation. By integrating video analysis with computational modelling, researchers can visualise and quantify AMP–microbe interactions at unprecedented levels of detail, thereby informing both experimental design and the refinement of predictive algorithms. This review provides a comprehensive overview of these emerging techniques, highlights major breakthroughs, addresses critical challenges, and ultimately emphasises the powerful synergy between video-driven pattern recognition, AI-based modelling, and experimental validation in the pursuit of next-generation antimicrobial strategies. Full article

(This article belongs to the Section Computing and Artificial Intelligence)

13 pages, 302 KB

Open AccessArticle

Structured Comparison Approach in Remote Interprofessional Education: Enhancing Role Clarity and Collaborative Identity Through Video-Based Reflection

by Ysuhisa Nakamura, Kazuko Ando, Kyoko Otani, Mayumi Yoshikawa and Ayako Furuzawa

Educ. Sci. 2025, 15(6), 687; https://doi.org/10.3390/educsci15060687 - 1 Jun 2025

Viewed by 905

Abstract

Background: Remote interprofessional education (IPE) often fails to reflect cognitive differences across disciplines, limiting students’ ability to form collaborative identities and clarify professional roles. Prior research has focused on attitudinal change, with less attention to reflective and epistemological learning. Objective: This study aimed [...] Read more.

Background: Remote interprofessional education (IPE) often fails to reflect cognitive differences across disciplines, limiting students’ ability to form collaborative identities and clarify professional roles. Prior research has focused on attitudinal change, with less attention to reflective and epistemological learning. Objective: This study aimed to examine whether making profession-specific reasoning patterns visible through structured comparison and guided reflection in a brief, remote IPE intervention could enhance students’ understanding of disciplinary perspectives and their interprofessional competence. Methods: Final-year Japanese university students in OT and MHSW participated in a two-session remote IPE intervention. In the first session, the students independently assessed a video case using the International Classification of Functioning, Disability and Health (ICF) core sets. In the second, the profession-specific results were aggregated, visualized, and shared as feedback to prompt guided reflection. A convergent mixed-methods design was used, combining pre-/post-surveys using the Readiness for Interprofessional Learning Scale (RIPLS) and the Interdisciplinary Education Perception Scale (IEPS), with qualitative analysis of the written reflections. Results: Significant improvement was observed in the IEPS Competence and Autonomy subscale among the MHSW students with low baseline scores. No significant changes were found in the RIPLS scores. Thematic analysis indicated increased awareness of disciplinary perspectives and recognition of complementary roles. Conclusions: A structured comparison approach can promote role clarity, reflective learning, and interprofessional identity in brief, remote IPE settings. Full article

► Show Figures

Graphical abstract

25 pages, 9742 KB

Open AccessArticle

Autism Spectrum Disorder Detection Using Skeleton-Based Body Movement Analysis via Dual-Stream Deep Learning

by Jungpil Shin, Abu Saleh Musa Miah, Manato Kakizaki, Najmul Hassan and Yoichi Tomioka

Electronics 2025, 14(11), 2231; https://doi.org/10.3390/electronics14112231 - 30 May 2025

Viewed by 1613

Abstract

Autism Spectrum Disorder (ASD) poses significant challenges in diagnosis due to its diverse symptomatology and the complexity of early detection. Atypical gait and gesture patterns, prominent behavioural markers of ASD, hold immense potential for facilitating early intervention and optimising treatment outcomes. These patterns [...] Read more.

Autism Spectrum Disorder (ASD) poses significant challenges in diagnosis due to its diverse symptomatology and the complexity of early detection. Atypical gait and gesture patterns, prominent behavioural markers of ASD, hold immense potential for facilitating early intervention and optimising treatment outcomes. These patterns can be efficiently and non-intrusively captured using modern computational techniques, making them valuable for ASD recognition. Various types of research have been conducted to detect ASD through deep learning, including facial feature analysis, eye gaze analysis, and movement and gesture analysis. In this study, we optimise a dual-stream architecture that combines image classification and skeleton recognition models to analyse video data for body motion analysis. The first stream processes Skepxels—spatial representations derived from skeleton data—using ConvNeXt-Base, a robust image recognition model that efficiently captures aggregated spatial embeddings. The second stream encodes angular features, embedding relative joint angles into the skeleton sequence and extracting spatiotemporal dynamics using Multi-Scale Graph 3D Convolutional Network(MSG3D), a combination of Graph Convolutional Networks (GCNs) and Temporal Convolutional Networks (TCNs). We replace the ViT model from the original architecture with ConvNeXt-Base to evaluate the efficacy of CNN-based models in capturing gesture-related features for ASD detection. Additionally, we experimented with a Stack Transformer in the second stream instead of MSG3D but found it to result in lower performance accuracy, thus highlighting the importance of GCN-based models for motion analysis. The integration of these two streams ensures comprehensive feature extraction, capturing both global and detailed motion patterns. A pairwise Euclidean distance loss is employed during training to enhance the consistency and robustness of feature representations. The results from our experiments demonstrate that the two-stream approach, combining ConvNeXt-Base and MSG3D, offers a promising method for effective autism detection. This approach not only enhances accuracy but also contributes valuable insights into optimising deep learning models for gesture-based recognition. By integrating image classification and skeleton recognition, we can better capture both global and detailed motion patterns, which are crucial for improving early ASD diagnosis and intervention strategies. Full article

(This article belongs to the Special Issue Convolutional Neural Networks and Vision Applications, 4th Edition)

► Show Figures

Figure 1

13 pages, 8546 KB

Open AccessArticle

AiWatch: A Distributed Video Surveillance System Using Artificial Intelligence and Digital Twins Technologies

by Alessio Ferone, Antonio Maratea, Francesco Camastra, Angelo Ciaramella, Antonino Staiano, Marco Lettiero, Angelo Polizio, Francesco Lombardi and Antonio Junior Spoleto

Technologies 2025, 13(5), 195; https://doi.org/10.3390/technologies13050195 - 10 May 2025

Viewed by 1845

Abstract

The primary purpose of video surveillance is to monitor public indoor areas or the boundaries of secure facilities to safeguard them against theft, unauthorized access, fire, and various other potential threats. Security cameras, equipped with integrated video surveillance systems, are strategically placed throughout [...] Read more.

The primary purpose of video surveillance is to monitor public indoor areas or the boundaries of secure facilities to safeguard them against theft, unauthorized access, fire, and various other potential threats. Security cameras, equipped with integrated video surveillance systems, are strategically placed throughout critical locations on the premises, allowing security personnel to observe all areas for specific behaviors that may signal an emergency or a situation requiring intervention. A significant challenge arises from the fact that individuals cannot maintain focus on multiple screens simultaneously, which can result in the oversight of crucial incidents. In this regard, artificial intelligence (AI) video analytics has become increasingly prominent, driven by numerous practical applications that include object identification, detection of unusual behavior patterns, facial recognition, and traffic management. Recent advancements in this technology have led to enhanced functionality, remarkable accuracy, and reduced costs for consumers. There is a noticeable trend towards upgrading security frameworks by incorporating AI into pre-existing video surveillance systems, thus leading to modern video surveillance that leverages video analytics, enabling the detection and reporting of anomalies within mere seconds, thereby transforming it into a proactive security solution. In this context, the AiWatch system introduces digital twin (DT) technology in a modern video surveillance architecture to facilitate advanced analytics through the aggregation of data from various sources. By exploiting AI and DT to analyze the different sources, it is possible to derive deeper insights applicable at higher decision levels. This approach allows for the evaluation of the effects and outcomes of actions by examining different scenarios, hence yielding more robust decisions. Full article

(This article belongs to the Special Issue Artificial Intelligence and Smart Information Systems: Trends and Innovations)

► Show Figures

Figure 1

16 pages, 1756 KB

Open AccessArticle

Multi-Scale Parallel Enhancement Module with Cross-Hierarchy Interaction for Video Emotion Recognition

by Lianqi Zhang, Yuan Sun, Jiansheng Guan, Shaobo Kang, Jiangyin Huang and Xungao Zhong

Electronics 2025, 14(9), 1886; https://doi.org/10.3390/electronics14091886 - 6 May 2025

Viewed by 588

Abstract

Video emotion recognition faces significant challenges due to the strong spatiotemporal coupling of dynamic expressions and the substantial variations in cross-scale motion patterns (e.g., subtle facial micro-expressions versus large-scale body gestures). Traditional methods, constrained by limited receptive fields, often fail to effectively balance [...] Read more.

Video emotion recognition faces significant challenges due to the strong spatiotemporal coupling of dynamic expressions and the substantial variations in cross-scale motion patterns (e.g., subtle facial micro-expressions versus large-scale body gestures). Traditional methods, constrained by limited receptive fields, often fail to effectively balance multi-scale correlations between local cues (e.g., transient facial muscle movements) and global semantic patterns (e.g., full-body gestures). To address this, we propose an enhanced attention module integrating multi-dilated convolution and dynamic feature weighting, aimed at improving spatiotemporal emotion feature extraction. Building upon conventional attention mechanisms, the module introduces a multi-branch parallel architecture. Convolutional kernels with varying dilation rates (1, 3, 5) are designed to hierarchically capture cross-scale the spatiotemporal features of low-scale facial micro-motion units (e.g., brief lip tightening), mid-scale composite expression patterns (e.g., furrowed brows combined with cheek raising), and high-scale limb motion trajectories (e.g., sustained arm-crossing). A dynamic feature adapter is further incorporated to enable context-aware adaptive fusion of multi-source heterogeneous features. We conducted extensive ablation studies and experiments on popular benchmark datasets such as the VideoEmotion-8 and Ekman-6 datasets. Experiments demonstrate that the proposed method enhances joint modeling of low-scale cues (e.g., fragmented facial muscle dynamics) and high-scale semantic patterns (e.g., emotion-coherent body language), achieving stronger cross-database generalization. Full article

► Show Figures

Figure 1

10 pages, 1379 KB

Open AccessProceeding Paper

Recognizing Human Emotions Through Body Posture Dynamics Using Deep Neural Networks

by Arunnehru Jawaharlalnehru, Thalapathiraj Sambandham and Dhanasekar Ravikumar

Eng. Proc. 2025, 87(1), 49; https://doi.org/10.3390/engproc2025087049 - 16 Apr 2025

Viewed by 1841

Abstract

Body posture dynamics have garnered significant attention in recent years due to their critical role in understanding the emotional states conveyed through human movements during social interactions. Emotions are typically expressed through facial expressions, voice, gait, posture, and overall body dynamics. Among these, [...] Read more.

Body posture dynamics have garnered significant attention in recent years due to their critical role in understanding the emotional states conveyed through human movements during social interactions. Emotions are typically expressed through facial expressions, voice, gait, posture, and overall body dynamics. Among these, body posture provides subtle yet essential cues about emotional states. However, predicting an individual’s gait and posture dynamics poses challenges, given the complexity of human body movement, which involves numerous degrees of freedom compared to facial expressions. Moreover, unlike static facial expressions, body dynamics are inherently fluid and continuously evolving. This paper presents an effective method for recognizing 17 micro-emotions by analyzing kinematic features from the GEMEP dataset using video-based motion capture. We specifically focus on upper body posture dynamics (skeleton points and angle), capturing movement patterns and their dynamic range over time. Our approach addresses the complexity of recognizing emotions from posture and gait by focusing on key elements of kinematic gesture analysis. The experimental results demonstrate the effectiveness of the proposed model, achieving a high accuracy rate of 91.48% for angle metric + DNN and 93.89% for distance + DNN on the GEMEP dataset using a deep neural network (DNN). These findings highlight the potential for our model to advance posture-based emotion recognition, particularly in applications where human body dynamics distance and angle are key indicators of emotional states. Full article

(This article belongs to the Proceedings of The 5th International Electronic Conference on Applied Sciences)

► Show Figures

Figure 1

20 pages, 2777 KB

Open AccessArticle

Video Human Action Recognition Based on Motion-Tempo Learning and Feedback Attention

by Yalong Liu, Chengwu Liang, Songqi Jiang and Peiwang Zhu

Appl. Sci. 2025, 15(8), 4186; https://doi.org/10.3390/app15084186 - 10 Apr 2025

Viewed by 943

Abstract

In video human action-recognition tasks, motion tempo describes the dynamic patterns and temporal scales of human motion. Different categories of actions are typically composed of sub-actions with varying motion tempos. Effectively capturing sub-actions with different motion tempos and distinguishing category-specific sub-actions are crucial [...] Read more.

In video human action-recognition tasks, motion tempo describes the dynamic patterns and temporal scales of human motion. Different categories of actions are typically composed of sub-actions with varying motion tempos. Effectively capturing sub-actions with different motion tempos and distinguishing category-specific sub-actions are crucial for improving action-recognition performance. Convolutional Neural Network (CNN)-based methods attempted to address this challenge, by embedding feedforward attention modules to enhance the action’s dynamic representation learning. However, feedforward attention modules rely only on local information from low-level features, lacking contextual information to generate attention weights. Therefore, we propose a Sub-action Motion information Enhancement Network (SMEN) based on motion-tempo learning and feedback attention, which consists of the Multi-Granularity Adaptive Fusion Module (MgAFM) and Feedback Attention-Guided Module (FAGM). MgAFM enhances the model’s ability to capture crucial sub-action intrinsic information by extracting and adaptively fusing motion dynamic features at different granularities. FAGM leverages high-level features that contain contextual information in a feedback manner to guide low-level features in generating attention weights, enhancing the model’s ability to extract more discriminative spatio-temporal and channel-wise features. Experiments are conducted on three datasets, and the proposed SMEN achieves top-1 accuracies of 52.4%, 63.3% on the Something-Something V1 and V2 datasets, and 76.9% on the Kinetics-400 dataset. Ablation studies, evaluations, and visualizations demonstrate that the proposed SMEN is effective for sub-action motion tempo and representation learning, and outperforms compared methods for video action recognition. Full article

(This article belongs to the Special Issue Research on Machine Learning in Computer Vision)

► Show Figures

Figure 1

Search Results (154)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (154)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI