Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (34)

Search Parameters:
Keywords = video vision transformer (ViViT)

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
20 pages, 4569 KB  
Article
Lightweight Vision Transformer for Frame-Level Ergonomic Posture Classification in Industrial Workflows
by Luca Cruciata, Salvatore Contino, Marianna Ciccarelli, Roberto Pirrone, Leonardo Mostarda, Alessandra Papetti and Marco Piangerelli
Sensors 2025, 25(15), 4750; https://doi.org/10.3390/s25154750 - 1 Aug 2025
Viewed by 485
Abstract
Work-related musculoskeletal disorders (WMSDs) are a leading concern in industrial ergonomics, often stemming from sustained non-neutral postures and repetitive tasks. This paper presents a vision-based framework for real-time, frame-level ergonomic risk classification using a lightweight Vision Transformer (ViT). The proposed system operates directly [...] Read more.
Work-related musculoskeletal disorders (WMSDs) are a leading concern in industrial ergonomics, often stemming from sustained non-neutral postures and repetitive tasks. This paper presents a vision-based framework for real-time, frame-level ergonomic risk classification using a lightweight Vision Transformer (ViT). The proposed system operates directly on raw RGB images without requiring skeleton reconstruction, joint angle estimation, or image segmentation. A single ViT model simultaneously classifies eight anatomical regions, enabling efficient multi-label posture assessment. Training is supervised using a multimodal dataset acquired from synchronized RGB video and full-body inertial motion capture, with ergonomic risk labels derived from RULA scores computed on joint kinematics. The system is validated on realistic, simulated industrial tasks that include common challenges such as occlusion and posture variability. Experimental results show that the ViT model achieves state-of-the-art performance, with F1-scores exceeding 0.99 and AUC values above 0.996 across all regions. Compared to previous CNN-based system, the proposed model improves classification accuracy and generalizability while reducing complexity and enabling real-time inference on edge devices. These findings demonstrate the model’s potential for unobtrusive, scalable ergonomic risk monitoring in real-world manufacturing environments. Full article
(This article belongs to the Special Issue Secure and Decentralised IoT Systems)
Show Figures

Figure 1

21 pages, 7528 KB  
Article
A Fine-Tuning Method via Adaptive Symmetric Fusion and Multi-Graph Aggregation for Human Pose Estimation
by Yinliang Shi, Zhaonian Liu, Bin Jiang, Tianqi Dai and Yuanfeng Lian
Symmetry 2025, 17(7), 1098; https://doi.org/10.3390/sym17071098 - 9 Jul 2025
Viewed by 383
Abstract
Human Pose Estimation (HPE) aims to accurately locate the positions of human key points in images or videos. However, the performance of HPE is often significantly reduced in practical application scenarios due to environmental interference. To address this challenge, we propose a ladder [...] Read more.
Human Pose Estimation (HPE) aims to accurately locate the positions of human key points in images or videos. However, the performance of HPE is often significantly reduced in practical application scenarios due to environmental interference. To address this challenge, we propose a ladder side-tuning method for the Vision Transformer (ViT) pre-trained model based on multi-path feature fusion to improve the accuracy of HPE in highly interfering environments. First, we extract the global features, frequency features and multi-scale spatial features through the ViT pre-trained model, the discrete wavelet convolutional network and the atrous spatial pyramid pooling network (ASPP). By comprehensively capturing the information of the human body and the environment, the ability of the model to analyze local details, textures, and spatial information is enhanced. In order to efficiently fuse these features, we devise an adaptive symmetric feature fusion strategy, which dynamically adjusts the intensity of feature fusion according to the similarity among features to achieve the optimal fusion effect. Finally, a multi-graph feature aggregation method is developed. We construct graph structures of different features and deeply explore the subtle differences among the features based on the dual fusion mechanism of points and edges to ensure the information integrity. The experimental results demonstrate that our method achieves 4.3% and 4.2% improvements in the AP metric on the MS COCO dataset and a custom high-interference dataset, respectively, compared with the HRNet. This highlights its superiority for human pose estimation tasks in both general and interfering environments. Full article
(This article belongs to the Special Issue Symmetry and Asymmetry in Computer Vision and Graphics)
Show Figures

Figure 1

21 pages, 6048 KB  
Article
GenConViT: Deepfake Video Detection Using Generative Convolutional Vision Transformer
by Deressa Wodajo Deressa, Hannes Mareen, Peter Lambert, Solomon Atnafu, Zahid Akhtar and Glenn Van Wallendael
Appl. Sci. 2025, 15(12), 6622; https://doi.org/10.3390/app15126622 - 12 Jun 2025
Viewed by 1758
Abstract
Deepfakes have raised significant concerns due to their potential to spread false information and compromise the integrity of digital media. Current deepfake detection models often struggle to generalize across a diverse range of deepfake generation techniques and video content. In this work, we [...] Read more.
Deepfakes have raised significant concerns due to their potential to spread false information and compromise the integrity of digital media. Current deepfake detection models often struggle to generalize across a diverse range of deepfake generation techniques and video content. In this work, we propose a Generative Convolutional Vision Transformer (GenConViT) for deepfake video detection. Our model combines ConvNeXt and Swin Transformer models for feature extraction, and it utilizes an Autoencoder and Variational Autoencoder to learn from latent data distributions. By learning from the visual artifacts and latent data distribution, GenConViT achieves an improved performance in detecting a wide range of deepfake videos. The model is trained and evaluated on DFDC, FF++, TM, DeepfakeTIMIT, and Celeb-DF (v2) datasets. The proposed GenConViT model demonstrates strong performance in deepfake video detection, achieving high accuracy across the tested datasets. While our model shows promising results in deepfake video detection by leveraging visual and latent features, we demonstrate that further work is needed to improve its generalizability when encountering out-of-distribution data. Our model provides an effective solution for identifying a wide range of fake videos while preserving the integrity of media. Full article
Show Figures

Figure 1

23 pages, 1894 KB  
Article
ViViT-Prob: A Radar Echo Extrapolation Model Based on Video Vision Transformer and Spatiotemporal Sparse Attention
by Yunan Qiu, Bingjian Lu, Wenrui Xiong, Zhenyu Lu, Le Sun and Yingjie Cui
Remote Sens. 2025, 17(12), 1966; https://doi.org/10.3390/rs17121966 - 6 Jun 2025
Viewed by 572
Abstract
Weather radar, as a crucial component of remote sensing data, plays a vital role in convective weather forecasting through radar echo extrapolation techniques. To address the limitations of existing deep learning methods in radar echo extrapolation, this paper proposes a radar echo extrapolation [...] Read more.
Weather radar, as a crucial component of remote sensing data, plays a vital role in convective weather forecasting through radar echo extrapolation techniques. To address the limitations of existing deep learning methods in radar echo extrapolation, this paper proposes a radar echo extrapolation model based on video vision transformer and spatiotemporal sparse attention (ViViT-Prob). The model takes historical sequences as input and initially maps them into a fixed-dimensional vector space through 3D convolutional patch encoding. Subsequently, a multi-head spatiotemporal fusion module with sparse attention encodes these vectors, effectively capturing spatiotemporal relationships between different regions in the sequences. The sparse constraint enables better utilization of data structural information, enhanced focus on critical regions, and reduced computational complexity. Finally, a parallel output decoder generates all time step predictions simultaneously, then maps back to the prediction space through a deconvolution module to reconstruct high-resolution images. Our experimental results on the Moving MNIST and real radar echo dataset demonstrate that the proposed model achieves superior performance in spatiotemporal sequence prediction and improves the prediction accuracy while maintaining structural consistency in radar echo extrapolation tasks, providing an effective solution for short-term precipitation forecasting. Full article
Show Figures

Figure 1

13 pages, 354 KB  
Article
Enhanced Cleft Lip and Palate Classification Using SigLIP 2: A Comparative Study with Vision Transformers and Siamese Networks
by Oraphan Nantha, Benjaporn Sathanarugsawait and Prasong Praneetpolgrang
Appl. Sci. 2025, 15(9), 4766; https://doi.org/10.3390/app15094766 - 25 Apr 2025
Viewed by 1966
Abstract
This paper extends our previous work on cleft lip and/or palate (CL/P) classification, which employed vision transformers (ViTs) and Siamese neural networks. We now integrate SigLIP 2, a state-of-the-art multilingual vision–language model, for feature extraction, replacing the previously utilized BiomedCLIP. SigLIP 2 offers [...] Read more.
This paper extends our previous work on cleft lip and/or palate (CL/P) classification, which employed vision transformers (ViTs) and Siamese neural networks. We now integrate SigLIP 2, a state-of-the-art multilingual vision–language model, for feature extraction, replacing the previously utilized BiomedCLIP. SigLIP 2 offers enhanced semantic understanding, improved localization capabilities, and multilingual support, potentially leading to more robust feature representations for CL/P classification. We hypothesize that SigLIP 2’s superior feature extraction will improve the classification accuracy of CL/P types (bilateral, unilateral, and palate-only) from the UltraSuite CLEFT dataset, a collection of ultrasound video sequences capturing tongue movements during speech with synchronized audio recordings. A comparative analysis is conducted, evaluating the performance of our original ViT-Siamese network model (using BiomedCLIP) against a new model leveraging SigLIP 2 for feature extraction. Performance is assessed using accuracy, precision, recall, and F1 score, demonstrating the impact of SigLIP 2 on CL/P classification. The new model achieves statistically significant improvements in overall accuracy (86.6% vs. 82.76%) and F1 scores for all cleft types. We discuss the computational efficiency and practical implications of employing SigLIP 2 in a clinical setting, highlighting its potential for earlier and more accurate diagnosis, personalized treatment planning, and broader applicability across diverse populations. The results demonstrate the significant potential of advanced vision–language models, such as SigLIP 2, to enhance AI-powered medical diagnostics. Full article
Show Figures

Figure 1

27 pages, 10045 KB  
Article
Vision-Language Models for Autonomous Driving: CLIP-Based Dynamic Scene Understanding
by Mohammed Elhenawy, Huthaifa I. Ashqar, Andry Rakotonirainy, Taqwa I. Alhadidi, Ahmed Jaber and Mohammad Abu Tami
Electronics 2025, 14(7), 1282; https://doi.org/10.3390/electronics14071282 - 24 Mar 2025
Cited by 3 | Viewed by 3497
Abstract
Scene understanding is essential for enhancing driver safety, generating human-centric explanations for Automated Vehicle (AV) decisions, and leveraging Artificial Intelligence (AI) for retrospective driving video analysis. This study developed a dynamic scene retrieval system using Contrastive Language–Image Pretraining (CLIP) models, which can be [...] Read more.
Scene understanding is essential for enhancing driver safety, generating human-centric explanations for Automated Vehicle (AV) decisions, and leveraging Artificial Intelligence (AI) for retrospective driving video analysis. This study developed a dynamic scene retrieval system using Contrastive Language–Image Pretraining (CLIP) models, which can be optimized for real-time deployment on edge devices. The proposed system outperforms state-of-the-art in-context learning methods, including the zero-shot capabilities of GPT-4o, particularly in complex scenarios. By conducting frame-level analyses on the Honda Scenes Dataset, which contains a collection of about 80 h of annotated driving videos capturing diverse real-world road and weather conditions, our study highlights the robustness of CLIP models in learning visual concepts from natural language supervision. The results also showed that fine-tuning the CLIP models, such as ViT-L/14 (Vision Transformer) and ViT-B/32, significantly improved scene classification, achieving a top F1-score of 91.1%. These results demonstrate the ability of the system to deliver rapid and precise scene recognition, which can be used to meet the critical requirements of advanced driver assistance systems (ADASs). This study shows the potential of CLIP models to provide scalable and efficient frameworks for dynamic scene understanding and classification. Furthermore, this work lays the groundwork for advanced autonomous vehicle technologies by fostering a deeper understanding of driver behavior, road conditions, and safety-critical scenarios, marking a significant step toward smarter, safer, and more context-aware autonomous driving systems. Full article
(This article belongs to the Special Issue Intelligent Transportation Systems and Sustainable Smart Cities)
Show Figures

Figure 1

15 pages, 2030 KB  
Article
Transformer-Based Student Engagement Recognition Using Few-Shot Learning
by Wejdan Alarefah, Salma Kammoun Jarraya and Nihal Abuzinadah
Computers 2025, 14(3), 109; https://doi.org/10.3390/computers14030109 - 18 Mar 2025
Viewed by 1557
Abstract
Improving the recognition of online learning engagement is a critical issue in educational information technology, due to the complexities of student behavior and varying assessment standards. Additionally, the scarcity of publicly available datasets for engagement recognition exacerbates this challenge. The majority of existing [...] Read more.
Improving the recognition of online learning engagement is a critical issue in educational information technology, due to the complexities of student behavior and varying assessment standards. Additionally, the scarcity of publicly available datasets for engagement recognition exacerbates this challenge. The majority of existing methods for detecting student engagement necessitate significant amounts of annotated data to capture variations in behaviors and interaction patterns. To address these limitations, we investigate few-shot learning (FSL) techniques to reduce the dependency on extensive training data. Transformer-based models have shown comprehensive results for video-based facial recognition tasks, thus paving new ground for understanding complicated patterns. In this research, we propose an innovative FSL model that employs a prototypical network with the vision transformer (ViT) model pre-trained on a face recognition dataset (e.g., MS1MV2) for spatial feature extraction, followed by an LSTM layer for temporal feature extraction. This approach effectively addresses the challenges of limited labeled data in engagement recognition. Our proposed approach achieves state-of-the-art performance on the EngageNet dataset, demonstrating its efficacy and potential in advancing engagement recognition research. Full article
(This article belongs to the Special Issue Advanced Image Processing and Computer Vision)
Show Figures

Figure 1

15 pages, 11262 KB  
Article
Fiber Sensing in the 6G Era: Vision Transformers for ϕ-OTDR-Based Road-Traffic Monitoring
by Robson A. Colares, Leticia Rittner, Evandro Conforti and Darli A. A. Mello
Appl. Sci. 2025, 15(6), 3170; https://doi.org/10.3390/app15063170 - 14 Mar 2025
Viewed by 760
Abstract
This article adds to the emergent body of research that examines the potential of 6G as a platform that can combine wired and wireless sensing modalities. We apply vision transformers (ViTs) in a distributed fiber-optic sensing system to evaluate road traffic parameters in [...] Read more.
This article adds to the emergent body of research that examines the potential of 6G as a platform that can combine wired and wireless sensing modalities. We apply vision transformers (ViTs) in a distributed fiber-optic sensing system to evaluate road traffic parameters in smart cities. Convolutional neural networks (CNNs) are also assessed for benchmarking. The experimental setup is based on a direct-detection phase-sensitive optical time-domain reflectometer (ϕ-OTDR) implemented using a narrow linewidth source. The monitored fibers are buried on the university campus, creating a smart city environment. Backscattered traces are consolidated into space–time matrices, illustrating traffic patterns and enabling analysis through image processing algorithms. The ground truth is established by traffic parameters obtained by processing video camera images monitoring the same street using the YOLOv8 model. The results indicate that ViTs outperform CNNs for estimating the number of vehicles and the mean vehicle speed. While a ViT necessitates a significantly larger number of parameters, its complexity is similar to that of a CNN when considering multiply–accumulate operations and random access memory usage. The processed dataset has been made publicly available for benchmarking. Full article
Show Figures

Figure 1

17 pages, 4741 KB  
Article
Liquid Level Detection of Polytetrafluoroethylene Emulsion Rotary Vibrating Screen Device Based on TransResNet
by Wenwu Liu, Xianghui Fan, Meng Liu, Hang Li, Jiang Du and Nianbo Liu
Electronics 2025, 14(5), 913; https://doi.org/10.3390/electronics14050913 - 25 Feb 2025
Cited by 1 | Viewed by 636
Abstract
The precise real-time detection of polytetrafluoroethylene (PTFE) emulsion rotary vibration sieve levels is critical for improving production efficiency, ensuring product quality, and safeguarding personnel safety. This research presents a deep-learning-oriented video surveillance model for the intelligent level detection of vibrating screens, waste drums, [...] Read more.
The precise real-time detection of polytetrafluoroethylene (PTFE) emulsion rotary vibration sieve levels is critical for improving production efficiency, ensuring product quality, and safeguarding personnel safety. This research presents a deep-learning-oriented video surveillance model for the intelligent level detection of vibrating screens, waste drums, and emulsion outlets, effectively addressing the limitations of traditional methods. With the introduction of TransResNet, which combines Vision Transformer (ViT) with ResNet, we can utilize the advantages of both approaches. ViT has excellent global information capture capability for processing image features, while ResNet excels in local feature extraction. The combined model effectively recognizes level changes in complex backgrounds, enhancing overall detection performance. During model training, synthetic data generation is used to alleviate the marker scarcity problem and generate synthetic images under different liquid level states to further enrich the training dataset, solve the issue of unequal data distribution, and enhance the model’s capacity to generalize. In order to validate the efficacy of our proposed model, we carried out a performance test with real-world data obtained from a material production site. The test results show that the model achieves 96%, 99%, and 99% accuracy at three test points, respectively: the vibrating screen, waste drum, and emulsion. These results not only prove the efficiency of the model but also highlight its significant value in practical applications. Full article
Show Figures

Figure 1

17 pages, 3294 KB  
Article
Hybrid Neural Network Models to Estimate Vital Signs from Facial Videos
by Yufeng Zheng
BioMedInformatics 2025, 5(1), 6; https://doi.org/10.3390/biomedinformatics5010006 - 22 Jan 2025
Cited by 2 | Viewed by 1847
Abstract
Introduction: Remote health monitoring plays a crucial role in telehealth services and the effective management of patients, which can be enhanced by vital sign prediction from facial videos. Facial videos are easily captured through various imaging devices like phone cameras, webcams, or [...] Read more.
Introduction: Remote health monitoring plays a crucial role in telehealth services and the effective management of patients, which can be enhanced by vital sign prediction from facial videos. Facial videos are easily captured through various imaging devices like phone cameras, webcams, or surveillance systems. Methods: This study introduces a hybrid deep learning model aimed at estimating heart rate (HR), blood oxygen saturation level (SpO2), and blood pressure (BP) from facial videos. The hybrid model integrates convolutional neural network (CNN), convolutional long short-term memory (convLSTM), and video vision transformer (ViViT) architectures to ensure comprehensive analysis. Given the temporal variability of HR and BP, emphasis is placed on temporal resolution during feature extraction. The CNN processes video frames one by one while convLSTM and ViViT handle sequences of frames. These high-resolution temporal features are fused to predict HR, BP, and SpO2, capturing their dynamic variations effectively. Results: The dataset encompasses 891 subjects of diverse races and ages, and preprocessing includes facial detection and data normalization. Experimental results demonstrate high accuracies in predicting HR, SpO2, and BP using the proposed hybrid models. Discussion: Facial images can be easily captured using smartphones, which offers an economical and convenient solution for vital sign monitoring, particularly beneficial for elderly individuals or during outbreaks of contagious diseases like COVID-19. The proposed models were only validated on one dataset. However, the dataset (size, representation, diversity, balance, and processing) plays an important role in any data-driven models including ours. Conclusions: Through experiments, we observed the hybrid model’s efficacy in predicting vital signs such as HR, SpO2, SBP, and DBP, along with demographic variables like sex and age. There is potential for extending the hybrid model to estimate additional vital signs such as body temperature and respiration rate. Full article
(This article belongs to the Section Applied Biomedical Data Science)
Show Figures

Figure 1

22 pages, 11079 KB  
Article
Hybrid 3D Convolutional–Transformer Model for Detecting Stereotypical Motor Movements in Autistic Children During Pre-Meltdown Crisis
by Salma Kammoun Jarraya and Marwa Masmoudi
Appl. Sci. 2024, 14(23), 11458; https://doi.org/10.3390/app142311458 - 9 Dec 2024
Viewed by 1130
Abstract
Computer vision using deep learning algorithms has served numerous human activity identification applications, particularly those linked to safety and security. However, even though autistic children are frequently exposed to danger as a result of their activities, many computer vision experts have shown little [...] Read more.
Computer vision using deep learning algorithms has served numerous human activity identification applications, particularly those linked to safety and security. However, even though autistic children are frequently exposed to danger as a result of their activities, many computer vision experts have shown little interest in their safety. Several autistic children show severe challenging behaviors such as the Meltdown Crisis which is characterized by hostile behaviors and loss of control. This study aims to introduce a monitoring system capable of predicting the Meltdown Crisis condition early and alerting the children’s parents or caregivers before entering more difficult settings. For this endeavor, the suggested system was constructed using a combination of a pre-trained Vision Transformer (ViT) model (Swin-3D-b) and a Residual Network (ResNet) architecture to extract robust features from video sequences to extract and learn the spatial and temporal features of the Stereotyped Motor Movements (SMMs) made by autistic children at the beginning of the Meltdown Crisis state, which is referred to as the Pre-Meltdown Crisis state. The evaluation was conducted using the MeltdownCrisis dataset, which contains realistic scenarios of autistic children’s behaviors in the Pre-Meltdown Crisis state, with data from the Normal state serving as the negative class. Our proposed model achieved great classification accuracy, at 92%. Full article
Show Figures

Figure 1

22 pages, 25110 KB  
Article
Depth-Based Intervention Detection in the Neonatal Intensive Care Unit Using Vision Transformers
by Zein Hajj-Ali, Yasmina Souley Dosso, Kim Greenwood, JoAnn Harrold and James R. Green
Sensors 2024, 24(23), 7753; https://doi.org/10.3390/s24237753 - 4 Dec 2024
Cited by 1 | Viewed by 1146
Abstract
Depth cameras can provide an effective, noncontact, and privacy-preserving means to monitor patients in the Neonatal Intensive Care Unit (NICU). Clinical interventions and routine care events can disrupt video-based patient monitoring. Automatically detecting these periods can decrease the time required for hand-annotating recordings, [...] Read more.
Depth cameras can provide an effective, noncontact, and privacy-preserving means to monitor patients in the Neonatal Intensive Care Unit (NICU). Clinical interventions and routine care events can disrupt video-based patient monitoring. Automatically detecting these periods can decrease the time required for hand-annotating recordings, which is needed for system development. Moreover, the automatic detection can be used in the future for real-time or retrospective intervention event classification. An intervention detection method based solely on depth data was developed using a vision transformer (ViT) model utilizing real-world data from patients in the NICU. Multiple design parameters were investigated, including encoding of depth data and perspective transform to account for nonoptimal camera placement. The best-performing model utilized ∼85 M trainable parameters, leveraged both perspective transform and HHA (Horizontal disparity, Height above ground, and Angle with gravity) encoding, and achieved a sensitivity of 85.6%, a precision of 89.8%, and an F1-Score of 87.6%. Full article
(This article belongs to the Special Issue Machine Learning and Image-Based Smart Sensing and Applications)
Show Figures

Figure 1

24 pages, 5511 KB  
Article
Severity Classification of Parkinson’s Disease via Synthesis of Energy Skeleton Images from Videos Produced in Uncontrolled Environments
by Nejib Ben Hadj-Alouane, Arav Dhoot, Monia Turki-Hadj Alouane and Vinod Pangracious
Diagnostics 2024, 14(23), 2685; https://doi.org/10.3390/diagnostics14232685 - 28 Nov 2024
Cited by 1 | Viewed by 1233
Abstract
Background/Objectives: Parkinson’s Disease is a prevalent neurodegenerative disorder affecting millions worldwide, primarily marked by motor and non-motor symptoms due to the degeneration of dopamine-producing neurons. Despite the absence of a cure, current treatments focus on symptom management, often relying on pharmacotherapy and surgical [...] Read more.
Background/Objectives: Parkinson’s Disease is a prevalent neurodegenerative disorder affecting millions worldwide, primarily marked by motor and non-motor symptoms due to the degeneration of dopamine-producing neurons. Despite the absence of a cure, current treatments focus on symptom management, often relying on pharmacotherapy and surgical interventions. Early diagnosis remains a critical challenge, particularly in underserved areas, as existing diagnostic protocols lack standardization and accessibility. This paper proposes a novel framework for the diagnosis and severity classification of PD using video data captured in uncontrolled environments. Methods: Leveraging deep learning techniques, our approach synthesizes Skeleton Energy Images (SEIs) from gait sequences and employs three advanced models—a Convolutional Neural Network (CNN), a Residual Network (ResNet), and a Vision Transformer (ViT)—to analyze these images. Our methodology allows for the accurate detection of PD and differentiation of its severity without requiring specialized equipment or professional oversight. The dataset used consists of labeled videos capturing the early stages of the disease, facilitating the potential for timely intervention. Results: The four models performed very accurately during the training phase. In fact, an accuracy higher than 99% was achieved by the ViT and ResNet models. Moreover, a lesser accuracy of 90% was achieved by the CNN five-layer model. During the test phase, only the best-performing models from the training experiments were tested. The ResNet-18 model has achieved a 100% accuracy. However, the ViT and the CNN five-layer models have achieved, respectively, 99.96% and 96.40% test accuracy. Conclusions: The results demonstrate high accuracy, highlighting the framework’s capabilities, and in particular the effectiveness of the workflow used for generating the SEI images. Given the nature of the dataset used, the proposed framework stands to function as a cost-effective and accessible tool for early PD detection in various healthcare settings. This study contributes to the advancement of mobile health technologies, aiming to enhance early diagnosis and monitoring of Parkinson’s Disease. Full article
Show Figures

Figure 1

18 pages, 5410 KB  
Article
Transformer and Adaptive Threshold Sliding Window for Improving Violence Detection in Videos
by Fernando J. Rendón-Segador, Juan A. Álvarez-García and Luis M. Soria-Morillo
Sensors 2024, 24(16), 5429; https://doi.org/10.3390/s24165429 - 22 Aug 2024
Viewed by 2018
Abstract
This paper presents a comprehensive approach to detect violent events in videos by combining CrimeNet, a Vision Transformer (ViT) model with structured neural learning and adversarial regularization, with an adaptive threshold sliding window model based on the Transformer architecture. CrimeNet demonstrates exceptional performance [...] Read more.
This paper presents a comprehensive approach to detect violent events in videos by combining CrimeNet, a Vision Transformer (ViT) model with structured neural learning and adversarial regularization, with an adaptive threshold sliding window model based on the Transformer architecture. CrimeNet demonstrates exceptional performance on all datasets (XD-Violence, UCF-Crime, NTU-CCTV Fights, UBI-Fights, Real Life Violence Situations, MediEval, RWF-2000, Hockey Fights, Violent Flows, Surveillance Camera Fights, and Movies Fight), achieving high AUC ROC and AUC PR values (up to 99% and 100%, respectively). However, the generalization of CrimeNet to cross-dataset experiments posed some problems, resulting in a 20–30% decrease in performance, for instance, training in UCF-Crime and testing in XD-Violence resulted in 70.20% in AUC ROC. The sliding window model with adaptive thresholding effectively solves these problems by automatically adjusting the violence detection threshold, resulting in a substantial improvement in detection accuracy. By applying the sliding window model as post-processing to CrimeNet results, we were able to improve detection accuracy by 10% to 15% in cross-dataset experiments. Future lines of research include improving generalization, addressing data imbalance, exploring multimodal representations, testing in real-world applications, and extending the approach to complex human interactions. Full article
Show Figures

Figure 1

15 pages, 23607 KB  
Article
Enhancing Image Copy Detection through Dynamic Augmentation and Efficient Sampling with Minimal Data
by Mohamed Fawzy, Noha S. Tawfik and Sherine Nagy Saleh
Electronics 2024, 13(16), 3125; https://doi.org/10.3390/electronics13163125 - 7 Aug 2024
Viewed by 1794
Abstract
Social networks have become deeply integrated into our daily lives, leading to an increase in image sharing across different platforms. Simultaneously, the existence of robust and user-friendly media editors not only facilitates artistic innovation, but also raises concerns regarding the ease of creating [...] Read more.
Social networks have become deeply integrated into our daily lives, leading to an increase in image sharing across different platforms. Simultaneously, the existence of robust and user-friendly media editors not only facilitates artistic innovation, but also raises concerns regarding the ease of creating misleading media. This highlights the need for developing new advanced techniques for the image copy detection task, which involves evaluating whether photos or videos originate from the same source. This research introduces a novel application of the Vision Transformer (ViT) model to the image copy detection task on the DISC21 dataset. Our approach involves innovative strategic sampling of the extensive DISC21 training set using K-means clustering to achieve a representative subset. Additionally, we employ complex augmentation pipelines applied while training with varying intensities. Our methodology follows the instance discrimination concept, where the Vision Transformer model is used as a classifier to map different augmentations of the same image to the same class. Next, the trained ViT model extracts descriptors of original and manipulated images that subsequently underwent post-processing to reduce dimensionality. Our best-achieving model, tested on a refined query set of 10K augmented images from the DISC21 dataset, attained a state-of-the-art micro-average precision of 0.79, demonstrating the effectiveness and innovation of our approach. Full article
Show Figures

Figure 1

Back to TopTop