Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (31)

Search Parameters:
Keywords = Video Swin Transformer

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
25 pages, 4660 KB  
Article
Dual-Stream Former: A Dual-Branch Transformer Architecture for Visual Speech Recognition
by Sanghun Jeon, Jieun Lee and Yong-Ju Lee
AI 2025, 6(9), 222; https://doi.org/10.3390/ai6090222 - 9 Sep 2025
Viewed by 877
Abstract
This study proposes Dual-Stream Former, a novel architecture that integrates a Video Swin Transformer and Conformer designed to address the challenges of visual speech recognition (VSR). The model captures spatiotemporal dependencies, achieving a state-of-the-art character error rate (CER) of 3.46%, surpassing traditional convolutional [...] Read more.
This study proposes Dual-Stream Former, a novel architecture that integrates a Video Swin Transformer and Conformer designed to address the challenges of visual speech recognition (VSR). The model captures spatiotemporal dependencies, achieving a state-of-the-art character error rate (CER) of 3.46%, surpassing traditional convolutional neural network (CNN)-based models, such as 3D-CNN + DenseNet-121 (CER: 5.31%), and transformer-based alternatives, such as vision transformers (CER: 4.05%). The Video Swin Transformer captures multiscale spatial representations with high computational efficiency, whereas the Conformer back-end enhances temporal modeling across diverse phoneme categories. Evaluation of a high-resolution dataset comprising 740,000 utterances across 185 classes highlighted the effectiveness of the model in addressing visually confusing phonemes, such as diphthongs (/ai/, /au/) and labio-dental sounds (/f/, /v/). Dual-Stream Former achieved phoneme recognition error rates of 10.39% for diphthongs and 9.25% for labiodental sounds, surpassing those of CNN-based architectures by more than 6%. Although the model’s large parameter count (168.6 M) poses resource challenges, its hierarchical design ensures scalability. Future work will explore lightweight adaptations and multimodal extensions to increase deployment feasibility. These findings underscore the transformative potential of Dual-Stream Former for advancing VSR applications such as silent communication and assistive technologies by achieving unparalleled precision and robustness in diverse settings. Full article
Show Figures

Figure 1

17 pages, 7292 KB  
Article
QP-Adaptive Dual-Path Residual Integrated Frequency Transformer for Data-Driven In-Loop Filter in VVC
by Cheng-Hsuan Yeh, Chi-Ting Ni, Kuan-Yu Huang, Zheng-Wei Wu, Cheng-Pin Peng and Pei-Yin Chen
Sensors 2025, 25(13), 4234; https://doi.org/10.3390/s25134234 - 7 Jul 2025
Viewed by 540
Abstract
As AI-enabled embedded systems such as smart TVs and edge devices demand efficient video processing, Versatile Video Coding (VVC/H.266) becomes essential for bandwidth-constrained Multimedia Internet of Things (M-IoT) applications. However, its block-based coding often introduces compression artifacts. While CNN-based methods effectively reduce these [...] Read more.
As AI-enabled embedded systems such as smart TVs and edge devices demand efficient video processing, Versatile Video Coding (VVC/H.266) becomes essential for bandwidth-constrained Multimedia Internet of Things (M-IoT) applications. However, its block-based coding often introduces compression artifacts. While CNN-based methods effectively reduce these artifacts, maintaining robust performance across varying quantization parameters (QPs) remains challenging. Recent QP-adaptive designs like QA-Filter show promise but are still limited. This paper proposes DRIFT, a QP-adaptive in-loop filtering network for VVC. DRIFT combines a lightweight frequency fusion CNN (LFFCNN) for local enhancement and a Swin Transformer-based global skip connection for capturing long-range dependencies. LFFCNN leverages octave convolution and introduces a novel residual block (FFRB) that integrates multiscale extraction, QP adaptivity, frequency fusion, and spatial-channel attention. A QP estimator (QPE) is further introduced to mitigate double enhancement in inter-coded frames. Experimental results demonstrate that DRIFT achieves BD rate reductions of 6.56% (intra) and 4.83% (inter), with an up to 10.90% gain on the BasketballDrill sequence. Additionally, LFFCNN reduces the model size by 32% while slightly improving the coding performance over QA-Filter. Full article
(This article belongs to the Special Issue Multimodal Sensing Technologies for IoT and AI-Enabled Systems)
Show Figures

Figure 1

18 pages, 1727 KB  
Article
Meta-Learning Approach for Adaptive Anomaly Detection from Multi-Scenario Video Surveillance
by Deepak Kumar Singh, Dibakar Raj Pant, Ganesh Gautam and Bhanu Shrestha
Appl. Sci. 2025, 15(12), 6687; https://doi.org/10.3390/app15126687 - 13 Jun 2025
Viewed by 1772
Abstract
Video surveillance is widely used in different areas like roads, malls, education, industries, retail, parks, bus stands, and restaurants, each presenting distinct anomaly patterns that demand specialized detection strategies. Adapting anomaly detection models to new camera viewpoints or environmental variations within the same [...] Read more.
Video surveillance is widely used in different areas like roads, malls, education, industries, retail, parks, bus stands, and restaurants, each presenting distinct anomaly patterns that demand specialized detection strategies. Adapting anomaly detection models to new camera viewpoints or environmental variations within the same scenario remains a significant challenge. Extending these models to entirely different surveillance environments or scenarios often requires extensive retraining, which can be both resource-intensive and time-consuming. To overcome these limitations, model frameworks, i.e., the video anomaly detector model, have been proposed, leveraging the meta-learning framework for faster adaptation using swin transformer for feature extraction to new concepts. In response, the dataset named MSAD (multi-scenario anomaly detection) having 14 different scenarios from multiple camera views, is the high resolution anomaly detection dataset that includes diverse motion patterns and challenging variations such as varying lighting and weather conditions, offering a robust foundation for training advanced anomaly detection models. Experiments validate the effectiveness of the proposed framework, which integrates model-agnostic meta-learning (MAML) with a ten-shot, one-query adaptation strategy. Leveraging the swin transformer as a spatial feature extractor, the model captures rich hierarchical representations from surveillance videos. This combination enables rapid generalization to novel viewpoints within the same scenario and maintains competitive performance when deployed in entirely new environments. These results highlight the strength of MAML in few-shot learning settings and demonstrate its potential for scalable anomaly detection across diverse surveillance scenarios. Full article
Show Figures

Figure 1

21 pages, 6048 KB  
Article
GenConViT: Deepfake Video Detection Using Generative Convolutional Vision Transformer
by Deressa Wodajo Deressa, Hannes Mareen, Peter Lambert, Solomon Atnafu, Zahid Akhtar and Glenn Van Wallendael
Appl. Sci. 2025, 15(12), 6622; https://doi.org/10.3390/app15126622 - 12 Jun 2025
Cited by 2 | Viewed by 2701
Abstract
Deepfakes have raised significant concerns due to their potential to spread false information and compromise the integrity of digital media. Current deepfake detection models often struggle to generalize across a diverse range of deepfake generation techniques and video content. In this work, we [...] Read more.
Deepfakes have raised significant concerns due to their potential to spread false information and compromise the integrity of digital media. Current deepfake detection models often struggle to generalize across a diverse range of deepfake generation techniques and video content. In this work, we propose a Generative Convolutional Vision Transformer (GenConViT) for deepfake video detection. Our model combines ConvNeXt and Swin Transformer models for feature extraction, and it utilizes an Autoencoder and Variational Autoencoder to learn from latent data distributions. By learning from the visual artifacts and latent data distribution, GenConViT achieves an improved performance in detecting a wide range of deepfake videos. The model is trained and evaluated on DFDC, FF++, TM, DeepfakeTIMIT, and Celeb-DF (v2) datasets. The proposed GenConViT model demonstrates strong performance in deepfake video detection, achieving high accuracy across the tested datasets. While our model shows promising results in deepfake video detection by leveraging visual and latent features, we demonstrate that further work is needed to improve its generalizability when encountering out-of-distribution data. Our model provides an effective solution for identifying a wide range of fake videos while preserving the integrity of media. Full article
Show Figures

Figure 1

21 pages, 512 KB  
Article
Enhancing Sign Language Recognition Performance Through Coverage-Based Dynamic Clip Generation
by Taewan Kim and Bongjae Kim
Appl. Sci. 2025, 15(11), 6372; https://doi.org/10.3390/app15116372 - 5 Jun 2025
Viewed by 941
Abstract
Sign Language Recognition (SLR) has made substantial progress through advances in deep learning and video-based action recognition. Conventional SLR systems typically segment input videos into a fixed number of clips (e.g., five clips per video), regardless of the video’s actual length, to meet [...] Read more.
Sign Language Recognition (SLR) has made substantial progress through advances in deep learning and video-based action recognition. Conventional SLR systems typically segment input videos into a fixed number of clips (e.g., five clips per video), regardless of the video’s actual length, to meet the fixed-length input requirements of deep learning models. While this approach simplifies model design and training, it fails to account for temporal variations inherent in sign language videos. Specifically, applying a fixed number of clips to videos of varying lengths can lead to significant information loss: longer videos suffer from excessive frame skipping, causing the model to miss critical gestural cues, whereas shorter videos require frame duplication, introducing temporal redundancy that distorts motion dynamics. To address these limitations, we propose a dynamic clip generation method that adaptively adjusts the number of clips during inference based on a novel coverage metric. This metric quantifies how effectively a clip selection captures the temporal information in a given video, enabling the system to maintain both temporal fidelity and computational efficiency. Experimental results on benchmark SLR datasets using multiple models-including 3D CNNs, R(2+1)D, Video Swin Transformer, and Multiscale Vision Transformers demonstrate that our method consistently outperforms fixed clip generation methods. Notably, our approach achieves 98.67% accuracy with the Video Swin Transformer while reducing inference time by 28.57%. These findings highlight the effectiveness of coverage-based dynamic clip generation in improving both accuracy and efficiency, particularly for videos with high temporal variability. Full article
(This article belongs to the Topic Applied Computing and Machine Intelligence (ACMI))
Show Figures

Figure 1

26 pages, 5404 KB  
Article
Real-Time Coronary Artery Dominance Classification from Angiographic Images Using Advanced Deep Video Architectures
by Hasan Ali Akyürek
Diagnostics 2025, 15(10), 1186; https://doi.org/10.3390/diagnostics15101186 - 8 May 2025
Viewed by 1016
Abstract
Background/Objectives: The automatic identification of coronary artery dominance holds critical importance for clinical decision-making in cardiovascular medicine, influencing diagnosis, treatment planning, and risk stratification. Traditional classification methods rely on the manual visual interpretation of coronary angiograms. However, current deep learning approaches typically [...] Read more.
Background/Objectives: The automatic identification of coronary artery dominance holds critical importance for clinical decision-making in cardiovascular medicine, influencing diagnosis, treatment planning, and risk stratification. Traditional classification methods rely on the manual visual interpretation of coronary angiograms. However, current deep learning approaches typically classify right and left coronary artery angiograms separately. This study aims to develop and evaluate an integrated video-based deep learning framework for classifying coronary dominance without distinguishing between RCA and LCA angiograms. Methods: Three advanced video-based deep learning models—Temporal Segment Networks (TSNs), Video Swin Transformer (VST), and VideoMAEv2—were implemented using the MMAction2 framework. These models were trained and evaluated on a large dataset derived from a publicly available source. The integrated approach processes entire angiographic video sequences, eliminating the need for separate RCA and LCA identification during preprocessing. Results: The proposed framework demonstrated strong performance in classifying coronary dominance. The best test accuracies achieved using TSNs, Video Swin Transformer, and VideoMAEv2 were 87.86%, 92.12%, and 92.89%, respectively. Transformer-based models showed superior accuracy compared to convolution-based methods, highlighting their effectiveness in capturing spatial–temporal patterns in angiographic videos. Conclusions: This study introduces a unified video-based deep learning approach for coronary dominance classification, eliminating manual arterial branch separation and reducing preprocessing complexity. The results indicate that transformer-based models, particularly VideoMAEv2, offer highly accurate and clinically feasible solutions, contributing to the development of objective and automated diagnostic tools in cardiovascular imaging. Full article
(This article belongs to the Special Issue Cardiovascular Imaging)
Show Figures

Figure 1

19 pages, 3140 KB  
Article
Fast Algorithm for Depth Map Intra-Frame Coding 3D-HEVC Based on Swin Transformer and Multi-Branch Network
by Fengqin Wang, Yangang Du and Qiuwen Zhang
Electronics 2025, 14(9), 1703; https://doi.org/10.3390/electronics14091703 - 22 Apr 2025
Cited by 1 | Viewed by 539
Abstract
Three-Dimensional High-Efficiency Video Coding (3D-HEVC) efficiently compresses 3D video by incorporating depth map encoding techniques. However, the quadtree partitioning of depth map coding units (CUs) greatly increases computational complexity, contributing to over 90% of the total encoding time. To overcome the limitations of [...] Read more.
Three-Dimensional High-Efficiency Video Coding (3D-HEVC) efficiently compresses 3D video by incorporating depth map encoding techniques. However, the quadtree partitioning of depth map coding units (CUs) greatly increases computational complexity, contributing to over 90% of the total encoding time. To overcome the limitations of existing methods in complex edge modeling and partitioning efficiency, this paper presents Swin-Hier Net, a hierarchical CU partitioning prediction model based on the Swin Transformer. First, a multi-branch feature fusion architecture is designed: the Swin Transformer’s shifted window attention mechanism extracts global contextual features, lightweight CNNs capture local texture details, and traditional edge/variance features enhance multi-scale representation. Second, a recursive hierarchical decision mechanism dynamically activates sub-CU prediction branches based on the partitioning probability of parent nodes, ensuring strict compliance with the HEVC standard quadtree syntax. Additionally, a hybrid pooling strategy and dilated convolutions improve edge feature retention. Experiments on 3D-HEVC standard test sequences show that, compared to exhaustive traversal methods, the proposed algorithm reduces encoding time by 72.7% on average, lowers the BD-Rate by 1.16%, improves CU partitioning accuracy to 94.5%, and maintains a synthesized view PSNR of 38.68 dB (baseline: 38.72 dB). The model seamlessly integrates into the HTM encoder, offering an efficient solution for real-time 3D video applications. Full article
Show Figures

Figure 1

17 pages, 3439 KB  
Article
A Novel Approach for Visual Speech Recognition Using the Partition-Time Masking and Swin Transformer 3D Convolutional Model
by Xiangliang Zhang, Yu Hu, Xiangzhi Liu, Yu Gu, Tong Li, Jibin Yin and Tao Liu
Sensors 2025, 25(8), 2366; https://doi.org/10.3390/s25082366 - 8 Apr 2025
Cited by 3 | Viewed by 1338
Abstract
Visual speech recognition is a technology that relies on visual information, offering unique advantages in noisy environments or when communicating with individuals with speech impairments. However, this technology still faces challenges, such as limited generalization ability due to different speech habits, high recognition [...] Read more.
Visual speech recognition is a technology that relies on visual information, offering unique advantages in noisy environments or when communicating with individuals with speech impairments. However, this technology still faces challenges, such as limited generalization ability due to different speech habits, high recognition error rates caused by confusable phonemes, and difficulties adapting to complex lighting conditions and facial occlusions. This paper proposes a lip reading data augmentation method—Partition-Time Masking (PTM)—to address these challenges and improve lip reading models’ performance and generalization ability. Applying nonlinear transformations to the training data enhances the model’s generalization ability when handling diverse speakers and environmental conditions. A lip-reading recognition model architecture, Swin Transformer and 3D Convolution (ST3D), was designed to overcome the limitations of traditional lip-reading models that use ResNet-based front-end feature extraction networks. By adopting a strategy that combines Swin Transformer and 3D convolution, the proposed model enhances performance. To validate the effectiveness of the Partition-Time Masking data augmentation method, experiments were conducted on the LRW video dataset using the DC-TCN model, achieving a peak accuracy of 92.15%. The ST3D model was validated on the LRW and LRW1000 video datasets, achieving a maximum accuracy of 56.1% on the LRW1000 dataset and 91.8% on the LRW dataset, outperforming current mainstream lip reading models and demonstrating superior performance on challenging easily confused samples. Full article
(This article belongs to the Special Issue Sensors for Biomechanical and Rehabilitation Engineering)
Show Figures

Figure 1

18 pages, 9658 KB  
Article
Swin-Panda: Behavior Recognition for Giant Pandas Based on Local Fine-Grained and Spatiotemporal Displacement Features
by Xinyu Yi, Han Su, Peng Min, Mengnan He, Yimin Han, Gai Luo, Pengcheng Wu, Qingyue Min, Rong Hou and Peng Chen
Diversity 2025, 17(2), 139; https://doi.org/10.3390/d17020139 - 19 Feb 2025
Viewed by 1026
Abstract
The giant panda, a rare and iconic species endemic to China, has attracted significant attention from both domestic and international researchers due to its crucial ecological role, unique cultural value, and distinct evolutionary history. While substantial progress has been made in the field [...] Read more.
The giant panda, a rare and iconic species endemic to China, has attracted significant attention from both domestic and international researchers due to its crucial ecological role, unique cultural value, and distinct evolutionary history. While substantial progress has been made in the field of individual identification, behavior recognition remains underdeveloped, facing challenges such as the lack of dynamic temporal features and insufficient extraction of behavioral characteristics. To address these challenges, we propose the Swin-Panda model, which leverages transfer learning based on the Video Swin Transformer architecture within the mmaction2 framework. In addition, we introduce two novel modules: the Comprehensive Perception Auxiliary Module and the Spatiotemporal Shift Attention Module. These modules facilitate the extraction of local and spatiotemporal information, allowing the model to more effectively capture the behavioral and movement patterns of giant pandas. Experimental results on the PACV-8 dataset demonstrate that our model achieves an accuracy of 88.02%, outperforming several benchmark models. This approach significantly enhances behavior recognition accuracy, thereby contributing to the advancement of panda welfare and species conservation efforts. Full article
Show Figures

Figure 1

16 pages, 2668 KB  
Article
Localization of Capsule Endoscope in Alimentary Tract by Computer-Aided Analysis of Endoscopic Images
by Ruiyao Zhang, Boyuan Peng, Yiyang Liu, Xinkai Liu, Jie Huang, Kohei Suzuki, Yuki Nakajima, Daiki Nemoto, Kazutomo Togashi and Xin Zhu
Sensors 2025, 25(3), 746; https://doi.org/10.3390/s25030746 - 26 Jan 2025
Cited by 2 | Viewed by 1358
Abstract
Capsule endoscopy is a common method for detecting digestive diseases. The location of a capsule endoscope should be constantly monitored through a visual inspection of the endoscopic images by medical staff to confirm the examination’s progress. In this study, we proposed a computer-aided [...] Read more.
Capsule endoscopy is a common method for detecting digestive diseases. The location of a capsule endoscope should be constantly monitored through a visual inspection of the endoscopic images by medical staff to confirm the examination’s progress. In this study, we proposed a computer-aided analysis (CADx) method for the localization of a capsule endoscope. At first, a classifier based on a Swin Transformer was proposed to classify each frame of the capsule endoscopy videos into images of the stomach, small intestine, and large intestine, respectively. Then, a K-means algorithm was used to correct outliers in the classification results. Finally, a localization algorithm was proposed to determine the position of the capsule endoscope in the alimentary tract. The proposed method was developed and validated using videos of 204 consecutive cases. The proposed CADx, based on a Swin Transformer, showed a precision of 93.46%, 97.28%, and 98.68% for the classification of endoscopic images recorded in the stomach, small intestine, and large intestine, respectively. Compared with the landmarks identified by endoscopists, the proposed method demonstrated an average transition time error of 16.2 s to locate the intersection of the stomach and small intestine, as well as 13.5 s to locate that of the small intestine and the large intestine, based on the 20 validation videos with an average length of 3261.8 s. The proposed method accurately localizes the capsule endoscope in the alimentary tract and may replace the laborious real-time visual inspection in capsule endoscopic examinations. Full article
(This article belongs to the Special Issue Advances in Optical Sensing, Instrumentation and Systems: 2nd Edition)
Show Figures

Figure 1

22 pages, 11079 KB  
Article
Hybrid 3D Convolutional–Transformer Model for Detecting Stereotypical Motor Movements in Autistic Children During Pre-Meltdown Crisis
by Salma Kammoun Jarraya and Marwa Masmoudi
Appl. Sci. 2024, 14(23), 11458; https://doi.org/10.3390/app142311458 - 9 Dec 2024
Viewed by 1245
Abstract
Computer vision using deep learning algorithms has served numerous human activity identification applications, particularly those linked to safety and security. However, even though autistic children are frequently exposed to danger as a result of their activities, many computer vision experts have shown little [...] Read more.
Computer vision using deep learning algorithms has served numerous human activity identification applications, particularly those linked to safety and security. However, even though autistic children are frequently exposed to danger as a result of their activities, many computer vision experts have shown little interest in their safety. Several autistic children show severe challenging behaviors such as the Meltdown Crisis which is characterized by hostile behaviors and loss of control. This study aims to introduce a monitoring system capable of predicting the Meltdown Crisis condition early and alerting the children’s parents or caregivers before entering more difficult settings. For this endeavor, the suggested system was constructed using a combination of a pre-trained Vision Transformer (ViT) model (Swin-3D-b) and a Residual Network (ResNet) architecture to extract robust features from video sequences to extract and learn the spatial and temporal features of the Stereotyped Motor Movements (SMMs) made by autistic children at the beginning of the Meltdown Crisis state, which is referred to as the Pre-Meltdown Crisis state. The evaluation was conducted using the MeltdownCrisis dataset, which contains realistic scenarios of autistic children’s behaviors in the Pre-Meltdown Crisis state, with data from the Normal state serving as the negative class. Our proposed model achieved great classification accuracy, at 92%. Full article
Show Figures

Figure 1

15 pages, 23802 KB  
Article
Vision-Based Prediction of Flashover Using Transformers and Convolutional Long Short-Term Memory Model
by M. Hamed Mozaffari, Yuchuan Li, Niloofar Hooshyaripour and Yoon Ko
Electronics 2024, 13(23), 4776; https://doi.org/10.3390/electronics13234776 - 3 Dec 2024
Cited by 1 | Viewed by 1209
Abstract
The prediction of fire growth is crucial for effective firefighting and rescue operations. Recent advancements in vision-based techniques using RGB vision and infrared (IR) thermal imaging data, coupled with artificial intelligence and deep learning techniques, have shown promising solutions to be applied in [...] Read more.
The prediction of fire growth is crucial for effective firefighting and rescue operations. Recent advancements in vision-based techniques using RGB vision and infrared (IR) thermal imaging data, coupled with artificial intelligence and deep learning techniques, have shown promising solutions to be applied in the detection of fire and the prediction of its behavior. This study introduces the use of Convolutional Long Short-term Memory (ConvLSTM) network models for predicting room fire growth by analyzing spatiotemporal IR thermal imaging data acquired from full-scale room fire tests. Our findings revealed that SwinLSTM, an enhanced version of ConvLSTM combined with transformers (a deep learning architecture based on a new mechanism called multi-head attention) for computer vision purposes, can be used for the prediction of room fire flashover occurrence. Notably, transformer-based ConvLSTM deep learning models, such as SwinLSTM, demonstrate superior prediction capability, which suggests a new vision-based smart solution for future fire growth prediction tasks. The main focus of this work is to perform a feasibility study on the use of a pure vision-based deep learning model for analysis of future video data to anticipate behavior of fire growth in room fire incidents. Full article
(This article belongs to the Special Issue Deep Learning for Computer Vision Application)
Show Figures

Figure 1

15 pages, 2137 KB  
Article
Research on Abnormal State Detection of CZ Silicon Single Crystal Based on Multimodal Fusion
by Lei Jiang, Haotan Wei and Ding Liu
Sensors 2024, 24(21), 6819; https://doi.org/10.3390/s24216819 - 23 Oct 2024
Viewed by 1171
Abstract
The Czochralski method is the primary technique for single-crystal silicon production. However, anomalous states such as crystal loss, twisting, swinging, and squareness frequently occur during crystal growth, adversely affecting product quality and production efficiency. To address this challenge, we propose an enhanced multimodal [...] Read more.
The Czochralski method is the primary technique for single-crystal silicon production. However, anomalous states such as crystal loss, twisting, swinging, and squareness frequently occur during crystal growth, adversely affecting product quality and production efficiency. To address this challenge, we propose an enhanced multimodal fusion classification model for detecting and categorizing these four anomalous states. Our model initially transforms one-dimensional signals (diameter, temperature, and pulling speed) into time–frequency domain images via continuous wavelet transform. These images are then processed using a Dense-ECA-SwinTransformer network for feature extraction. Concurrently, meniscus images and inter-frame difference images are obtained from the growth system’s meniscus video feed. These visual inputs are fused at the channel level and subsequently processed through a ConvNeXt network for feature extraction. Finally, the time–frequency domain features are combined with the meniscus image features and fed into fully connected layers for multi-class classification. The experimental results show that the method can effectively detect various abnormal states, help the staff to make a more accurate judgment, and formulate a personalized treatment plan for the abnormal state, which can improve the production efficiency, save production resources, and protect the extraction equipment. Full article
(This article belongs to the Special Issue Feature Papers in Physical Sensors 2024)
Show Figures

Figure 1

21 pages, 6287 KB  
Article
Spatiotemporal Sensitive Network for Non-Contact Heart Rate Prediction from Facial Videos
by Liying Su, Yitao Wang, Dezhao Zhai, Yuping Shi, Yinghao Ding, Guohua Gao, Qinwei Li, Ming Yu and Hang Wu
Appl. Sci. 2024, 14(20), 9551; https://doi.org/10.3390/app14209551 - 19 Oct 2024
Cited by 1 | Viewed by 1574
Abstract
Heart rate (HR) is an important indicator reflecting the overall physical and mental health of the human body, playing a crucial role in diagnosing cardiovascular and neurological diseases. Recent research has revealed that variations in the light absorption of human skin captured through [...] Read more.
Heart rate (HR) is an important indicator reflecting the overall physical and mental health of the human body, playing a crucial role in diagnosing cardiovascular and neurological diseases. Recent research has revealed that variations in the light absorption of human skin captured through facial video over the cardiac cycle, due to changes in blood volume, can be utilized for non-contact HR estimation. However, most existing methods rely on single-modal video sources (such as RGB or NIR), which often yield suboptimal results due to noise and the limitations of a single information source. To overcome these challenges, this paper proposes a multimodal information fusion architecture named the spatiotemporal sensitive network (SS-Net) for non-contact heart rate estimation. Firstly, spatiotemporal feature maps are utilized to extract physiological signals from RGB and NIR videos effectively. Next, a spatiotemporal sensitive (SS) module is introduced to extract useful physiological signal information from both RGB and NIR spatiotemporal maps. Finally, a multi-level spatiotemporal context fusion (MLSC) module is designed to fuse and complement information between the visible light and infrared modalities. Then, different levels of fused features are refined in task-specific branches to predict both remote photoplethysmography (rPPG) signals and heart rate (HR) signals. Experiments conducted on three datasets demonstrate that the proposed SS-Net achieves superior performance compared to existing methods. Full article
Show Figures

Figure 1

14 pages, 1847 KB  
Article
A Dynamic Position Embedding-Based Model for Student Classroom Complete Meta-Action Recognition
by Zhaoyu Shou, Xiaohu Yuan, Dongxu Li, Jianwen Mo, Huibing Zhang, Jingwei Zhang and Ziyong Wu
Sensors 2024, 24(16), 5371; https://doi.org/10.3390/s24165371 - 20 Aug 2024
Cited by 3 | Viewed by 1597
Abstract
The precise recognition of entire classroom meta-actions is a crucial challenge for the tailored adaptive interpretation of student behavior, given the intricacy of these actions. This paper proposes a Dynamic Position Embedding-based Model for Student Classroom Complete Meta-Action Recognition (DPE-SAR) based on the [...] Read more.
The precise recognition of entire classroom meta-actions is a crucial challenge for the tailored adaptive interpretation of student behavior, given the intricacy of these actions. This paper proposes a Dynamic Position Embedding-based Model for Student Classroom Complete Meta-Action Recognition (DPE-SAR) based on the Video Swin Transformer. The model utilizes a dynamic positional embedding technique to perform conditional positional encoding. Additionally, it incorporates a deep convolutional network to improve the parsing ability of the spatial structure of meta-actions. The full attention mechanism of ViT3D is used to extract the potential spatial features of actions and capture the global spatial–temporal information of meta-actions. The proposed model exhibits exceptional performance compared to baseline models in action recognition as observed in evaluations on public datasets and smart classroom meta-action recognition datasets. The experimental results confirm the superiority of the model in meta-action recognition. Full article
(This article belongs to the Section Intelligent Sensors)
Show Figures

Figure 1

Back to TopTop