MDPI - Publisher of Open Access Journals

14 pages, 6060 KiB

Open AccessArticle

Text Typing Using Blink-to-Alphabet Tree for Patients with Neuro-Locomotor Disabilities

by Seungho Lee and Sangkon Lee

Sensors 2025, 25(15), 4555; https://doi.org/10.3390/s25154555 - 23 Jul 2025

Viewed by 294

Lou Gehrig’s disease, also known as ALS, is a progressive neurodegenerative condition that weakens muscles and can lead to paralysis as it progresses. For patients with severe paralysis, eye-tracking devices such as eye mouse enable communication. However, the equipment is expensive, and the [...] Read more.

Lou Gehrig’s disease, also known as ALS, is a progressive neurodegenerative condition that weakens muscles and can lead to paralysis as it progresses. For patients with severe paralysis, eye-tracking devices such as eye mouse enable communication. However, the equipment is expensive, and the calibration process is very difficult and frustrating for patients to use. To alleviate this problem, we propose a simple and efficient method to type texts intuitively with graphical guidance on the screen. Specifically, the method detects patients’ eye blinks in video frames to navigate through three sequential steps, narrowing down the choices from 9 letters, to 3 letters, and finally to a single letter (from a 26-letter alphabet). In this way, a patient is able to rapidly type a letter of the alphabet by blinking a minimum of three times and a maximum of nine times. The proposed method integrates an API of large language model (LLM) to further accelerate text input and correct sentences in terms of typographical errors, spacing, and upper/lower case. Experiments on ten participants demonstrate that the proposed method significantly outperforms three state-of-the-art methods in both typing speed and typing accuracy, without requiring any calibration process. Full article

(This article belongs to the Section Biomedical Sensors)

► Show Figures

Figure 1

24 pages, 3409 KiB

Open AccessArticle

DepressionMIGNN: A Multiple-Instance Learning-Based Depression Detection Model with Graph Neural Networks

by Shiwen Zhao, Yunze Zhang, Yikai Su, Kaifeng Su, Jiemin Liu, Tao Wang and Shiqi Yu

Sensors 2025, 25(14), 4520; https://doi.org/10.3390/s25144520 - 21 Jul 2025

Viewed by 484

Abstract

The global prevalence of depression necessitates the application of technological solutions, particularly sensor-based systems, to augment scarce resources for early diagnostic purposes. In this study, we use benchmark datasets that contain multimodal data including video, audio, and transcribed text. To address depression detection [...] Read more.

The global prevalence of depression necessitates the application of technological solutions, particularly sensor-based systems, to augment scarce resources for early diagnostic purposes. In this study, we use benchmark datasets that contain multimodal data including video, audio, and transcribed text. To address depression detection as a chronic long-term disorder reflected by temporal behavioral patterns, we propose a novel framework that segments videos into utterance-level instances using GRU for contextual representation, and then constructs graphs where utterance embeddings serve as nodes connected through dual relationships capturing both chronological development and intermittent relevant information. Graph neural networks are employed to learn multi-dimensional edge relationships and align multimodal representations across different temporal dependencies. Our approach achieves superior performance with an MAE of 5.25 and RMSE of 6.75 on AVEC2014, and CCC of 0.554 and RMSE of 4.61 on AVEC2019, demonstrating significant improvements over existing methods that focus primarily on momentary expressions. Full article

(This article belongs to the Special Issue Multimodal Human Behavior Understanding in Human–AI Interaction: Sensor-Based Signal Processing and Interaction Techniques)

► Show Figures

Figure 1

13 pages, 665 KiB

Open AccessReview

Emerging Technologies for Injury Identification in Sports Settings: A Systematic Review

by Luke Canavan Dignam, Lisa Ryan, Michael McCann and Ed Daly

Appl. Sci. 2025, 15(14), 7874; https://doi.org/10.3390/app15147874 - 14 Jul 2025

Viewed by 478

Abstract

Sport injury recognition is rapidly evolving with the integration of new emerging technologies. This systematic review aims to identify and evaluate technologies capable of detecting injuries during sports participation. A comprehensive search of PUBMED, Sport Discus, Web of Science, and ScienceDirect was conducted [...] Read more.

Sport injury recognition is rapidly evolving with the integration of new emerging technologies. This systematic review aims to identify and evaluate technologies capable of detecting injuries during sports participation. A comprehensive search of PUBMED, Sport Discus, Web of Science, and ScienceDirect was conducted following the PRISMA 2020 guidelines. The review was registered on PROSPERO (CRD42024608964). Inclusion criteria focused on prospective studies involving athletes of all ages, evaluating tools which are utilised to identify injuries in sports settings. The review included research between 2014 and 2024; retrospective, conceptual, and fatigue-focused studies were excluded. Risk of bias was assessed using the Critical Appraisal Skills Program (CASP) tool. Of 4283 records screened, 70 full-text articles were assessed, with 21 studies meeting the final inclusion criteria. The technologies were grouped into advanced imaging (Magnetic Resonance Imaging (MRI), Diffusion Tensor Imaging (DFI), and Quantitative Susceptibility Mapping (QSM), with biomarkers (i.e., Neurofilament Light (NfL), Tau protein, Glial Fibrillary Acidic Protein (GFAP), Salivary MicroRNAs, and Immunoglobulin A (IgA), and sideline assessments (i.e., the King–Devick test, KD-Eye Tracking, modified Balance Error Scoring System (mBESS), DETECT, ImPACT structured video analysis, and Instrumented Mouth Guards (iMGs)), which demonstrated feasibility for immediate sideline identification of injury. Future research should improve methodological rigour through larger, diverse samples and controlled designs, with real-world testing environments. Following this guidance, the application of emerging technologies may assist medical staff, coaches, and national governing bodies in identifying injuries in a sports setting, providing real-time assessment. Full article

(This article belongs to the Special Issue Sports Injuries: Prevention and Rehabilitation)

► Show Figures

Figure 1

22 pages, 7735 KiB

Open AccessArticle

Visual Perception of Peripheral Screen Elements: The Impact of Text and Background Colors

by Snježana Ivančić Valenko, Marko Čačić, Ivana Žiljak Stanimirović and Anja Zorko

Appl. Sci. 2025, 15(14), 7636; https://doi.org/10.3390/app15147636 - 8 Jul 2025

Viewed by 434

Abstract

Visual perception of screen elements depends on their color, font, and position in the user interface design. Objects in the central part of the screen are perceived more easily than those in the peripheral areas. However, the peripheral space is valuable for applications [...] Read more.

Visual perception of screen elements depends on their color, font, and position in the user interface design. Objects in the central part of the screen are perceived more easily than those in the peripheral areas. However, the peripheral space is valuable for applications like advertising and promotion and should not be overlooked. Optimizing the design of elements in this area can improve user attention to peripheral visual stimuli during focused tasks. This study aims to evaluate how different combinations of text and background color affect the visibility of moving textual stimuli in the peripheral areas of the screen, while attention is focused on a central task. This study investigates how background color, combined with white or black text, affects the attention of participants. It also identifies which background color makes a specific word most noticeable in the peripheral part of the screen. We designed quizzes to present stimuli with black or white text on various background colors in the peripheral regions of the screen. The background colors tested were blue, red, yellow, green, white, and black. While saturation and brightness were kept constant, the color tone was varied. Among ten combinations of background and text color, we aimed to determine the most noticeable combination in the peripheral part of the screen. The combination of white text on a blue background resulted in the shortest detection time (1.376 s), while black text on a white background achieved the highest accuracy rate at 79%. The results offer valuable insights for improving peripheral text visibility in user interfaces across various visual communication domains such as video games, television content, and websites, where peripheral information must remain noticeable despite centrally focused user attention and complex viewing conditions. Full article

► Show Figures

Figure 1

21 pages, 4777 KiB

Open AccessArticle

Harnessing Semantic and Trajectory Analysis for Real-Time Pedestrian Panic Detection in Crowded Micro-Road Networks

by Rongyong Zhao, Lingchen Han, Yuxin Cai, Bingyu Wei, Arifur Rahman, Cuiling Li and Yunlong Ma

Appl. Sci. 2025, 15(10), 5394; https://doi.org/10.3390/app15105394 - 12 May 2025

Viewed by 437

Abstract

Pedestrian panic behavior is a primary cause of overcrowding and stampede accidents in public micro-road network areas with high pedestrian density. However, reliably detecting such behaviors remains challenging due to their inherent complexity, variability, and stochastic nature. Current detection models often rely on [...] Read more.

Pedestrian panic behavior is a primary cause of overcrowding and stampede accidents in public micro-road network areas with high pedestrian density. However, reliably detecting such behaviors remains challenging due to their inherent complexity, variability, and stochastic nature. Current detection models often rely on single-modality features, which limits their effectiveness in complex and dynamic crowd scenarios. To overcome these limitations, this study proposes a contour-driven multimodal framework that first employs a CNN (CDNet) to estimate density maps and, by analyzing steep contour gradients, automatically delineates a candidate panic zone. Within these potential panic zones, pedestrian trajectories are analyzed through LSTM networks to capture irregular movements, such as counterflow and nonlinear wandering behaviors. Concurrently, semantic recognition based on Transformer models is utilized to identify verbal distress cues extracted through Baidu AI’s real-time speech-to-text conversion. The three embeddings are fused through a lightweight attention-enhanced MLP, enabling end-to-end inference at 40 FPS on a single GPU. To evaluate branch robustness under streaming conditions, the UCF Crowd dataset (150 videos without panic labels) is processed frame-by-frame at 25 FPS solely for density assessment, whereas full panic detection is validated on 30 real Itaewon-Stampede videos and 160 SUMO/Unity simulated emergencies that include explicit panic annotations. The proposed system achieves 91.7% accuracy and 88.2% F1 on the Itaewon set, outperforming all single- or dual-modality baselines and offering a deployable solution for proactive crowd safety monitoring in transport hubs, festivals, and other high-risk venues. Full article

► Show Figures

Figure 1

24 pages, 10867 KiB

Open AccessArticle

Machine Learning-Based Smartphone Grip Posture Image Recognition and Classification

by Dohoon Kwon, Xin Cui, Yejin Lee, Younggeun Choi, Aditya Subramani Murugan, Eunsik Kim and Heecheon You

Appl. Sci. 2025, 15(9), 5020; https://doi.org/10.3390/app15095020 - 30 Apr 2025

Viewed by 726

Abstract

Uncomfortable smartphone grip postures resulting from inappropriate user interface design can degrade smartphone usability. This study aims to develop a classification model for smartphone grip postures by detecting the positions of the hand and fingers on smartphones using machine learning techniques. Seventy participants [...] Read more.

Uncomfortable smartphone grip postures resulting from inappropriate user interface design can degrade smartphone usability. This study aims to develop a classification model for smartphone grip postures by detecting the positions of the hand and fingers on smartphones using machine learning techniques. Seventy participants (35 males and 35 females with an average of 38.5 ± 12.2 years) with varying hand sizes participated in the smartphone grip posture experiment. The participants performed four tasks (making calls, listening to music, sending text messages, and web browsing) using nine smartphone mock-ups of different sizes, while cameras positioned above and below their hands recorded their usage. A total of 3278 grip posture images were extracted from the recorded videos and were preprocessed using a skin color and hand contour detection model. The grip postures were categorized into seven types, and three models (MobileNetV2, Inception V3, and ResNet-50), along with an ensemble model, were used for classification. The ensemble-based classification model achieved an accuracy of 95.9%, demonstrating higher accuracy than the individual models: MobileNetV2 (90.6%), ResNet-50 (94.2%), and Inception V3 (85.9%). The classification model developed in this study can efficiently analyze grip postures, thereby improving usability in the development of smartphones and other electronic devices. Full article

(This article belongs to the Special Issue Novel Approaches and Applications in Ergonomic Design III)

► Show Figures

Figure 1

27 pages, 2001 KiB

Open AccessReview

Recent Research Progress of Graph Neural Networks in Computer Vision

by Zhiyong Jia, Chuang Wang, Yang Wang, Xinrui Gao, Bingtao Li, Lifeng Yin and Huayue Chen

Electronics 2025, 14(9), 1742; https://doi.org/10.3390/electronics14091742 - 24 Apr 2025

Cited by 2 | Viewed by 3215

Abstract

Graph neural networks (GNNs) have demonstrated significant potential in the field of computer vision in recent years, particularly in handling non-Euclidean data and capturing complex spatial and semantic relationships. This paper provides a comprehensive review of the latest research on GNNs in computer [...] Read more.

Graph neural networks (GNNs) have demonstrated significant potential in the field of computer vision in recent years, particularly in handling non-Euclidean data and capturing complex spatial and semantic relationships. This paper provides a comprehensive review of the latest research on GNNs in computer vision, with a focus on their applications in image processing, video analysis, and multimodal data fusion. First, we briefly introduce common GNN models, such as graph convolutional networks (GCN) and graph attention networks (GAT), and analyze their advantages in image and video data processing. Subsequently, this paper delves into the applications of GNNs in tasks such as object detection, image segmentation, and video action recognition, particularly in capturing inter-region dependencies and spatiotemporal dynamics. Finally, the paper discusses the applications of GNNs in multimodal data fusion tasks such as image–text matching and cross-modal retrieval, and highlights the main challenges faced by GNNs in computer vision, including computational complexity, dynamic graph modeling, heterogeneous graph processing, and interpretability issues. This paper provides a comprehensive understanding of the applications of GNNs in computer vision for both academia and industry and envisions future research directions. Full article

(This article belongs to the Special Issue AI Synergy: Vision, Language, and Modality)

► Show Figures

Figure 1

18 pages, 3278 KiB

Open AccessArticle

Efficient Detection of Mind Wandering During Reading Aloud Using Blinks, Pitch Frequency, and Reading Rate

by Amir Rabinovitch, Eden Ben Baruch, Maor Siton, Nuphar Avital, Menahem Yeari and Dror Malka

AI 2025, 6(4), 83; https://doi.org/10.3390/ai6040083 - 18 Apr 2025

Cited by 2 | Viewed by 1049

Abstract

Mind wandering is a common issue among schoolchildren and academic students, often undermining the quality of learning and teaching effectiveness. Current detection methods mainly rely on eye trackers and electrodermal activity (EDA) sensors, focusing on external indicators such as facial movements but neglecting [...] Read more.

Mind wandering is a common issue among schoolchildren and academic students, often undermining the quality of learning and teaching effectiveness. Current detection methods mainly rely on eye trackers and electrodermal activity (EDA) sensors, focusing on external indicators such as facial movements but neglecting voice detection. These methods are often cumbersome, uncomfortable for participants, and invasive, requiring specialized, expensive equipment that disrupts the natural learning environment. To overcome these challenges, a new algorithm has been developed to detect mind wandering during reading aloud. Based on external indicators like the blink rate, pitch frequency, and reading rate, the algorithm integrates these three criteria to ensure the accurate detection of mind wandering using only a standard computer camera and microphone, making it easy to implement and widely accessible. An experiment with ten participants validated this approach. Participants read aloud a text of 1304 words while the algorithm, incorporating the Viola–Jones model for face and eye detection and pitch-frequency analysis, monitored for signs of mind wandering. A voice activity detection (VAD) technique was also used to recognize human speech. The algorithm achieved 76% accuracy in predicting mind wandering during specific text segments, demonstrating the feasibility of using noninvasive physiological indicators. This method offers a practical, non-intrusive solution for detecting mind wandering through video and audio data, making it suitable for educational settings. Its ability to integrate seamlessly into classrooms holds promise for enhancing student concentration, improving the teacher–student dynamic, and boosting overall teaching effectiveness. By leveraging standard, accessible technology, this approach could pave the way for more personalized, technology-enhanced education systems. Full article

► Show Figures

Figure 1

17 pages, 2758 KiB

Open AccessArticle

History-Aware Multimodal Instruction-Oriented Policies for Navigation Tasks

by Renas Mukhametzianov and Hidetaka Nambo

AI 2025, 6(4), 75; https://doi.org/10.3390/ai6040075 - 11 Apr 2025

Viewed by 796

Abstract

The rise of large-scale language models and multimodal transformers has enabled instruction-based policies, such as vision-and-language navigation. To leverage their general world knowledge, we propose multimodal annotations for action options and support selection from a dynamic, describable action space. Our framework employs a [...] Read more.

The rise of large-scale language models and multimodal transformers has enabled instruction-based policies, such as vision-and-language navigation. To leverage their general world knowledge, we propose multimodal annotations for action options and support selection from a dynamic, describable action space. Our framework employs a multimodal transformer that processes front-facing camera images, light detection and ranging (LIDAR) sensor’s point clouds, and tasks as textual instructions to produce a history-aware decision policy for mobile robot navigation. Our approach leverages a pretrained vision–language encoder and integrates it with a custom causal generative pretrained transformer (GPT) decoder to predict action sequences within a state–action history. We propose a trainable attention score mechanism to efficiently select the most suitable action from a variable set of possible options. Action options are text–image pairs and encoded using the same multimodal encoder employed for environment states. This approach of annotating and dynamically selecting actions is applicable to broader multidomain decision-making tasks. We compared two baseline models, ViLT (vision-and-language transformer) and FLAVA (foundational language and vision alignment), and found that FLAVA achieves superior performance within the constraints of 8 GB video memory usage in the training phase. Experiments were conducted in both simulated and real-world environments using our custom datasets for instructed task completion episodes, demonstrating strong prediction accuracy. These results highlight the potential of multimodal, dynamic action spaces for instruction-based robot navigation and beyond. Full article

(This article belongs to the Section AI in Autonomous Systems)

► Show Figures

Figure 1

17 pages, 5771 KiB

Open AccessArticle

RelVid: Relational Learning with Vision-Language Models for Weakly Video Anomaly Detection

by Jingxin Wang, Guohan Li, Jiaqi Liu, Zhengyi Xu, Xinrong Chen and Jianming Wei

Sensors 2025, 25(7), 2037; https://doi.org/10.3390/s25072037 - 25 Mar 2025

Viewed by 1266

Abstract

Weakly supervised video anomaly detection aims to identify abnormal events in video sequences without requiring frame-level supervision, which is a challenging task in computer vision. Traditional methods typically rely on low-level visual features with weak supervision from a single backbone branch, which often [...] Read more.

Weakly supervised video anomaly detection aims to identify abnormal events in video sequences without requiring frame-level supervision, which is a challenging task in computer vision. Traditional methods typically rely on low-level visual features with weak supervision from a single backbone branch, which often struggles to capture the distinctive characteristics of different categories. This limitation reduces their adaptability to real-world scenarios. In real-world situations, the boundary between normal and abnormal events is often unclear and context-dependent. For example, running on a track may be considered normal, but running on a busy road could be deemed abnormal. To address these challenges, RelVid is introduced as a novel framework that improves anomaly detection by expanding the relative feature gap between classes extracted from a single backbone branch. The key innovation of RelVid lies in the integration of auxiliary tasks, which guide the model to learn more discriminative features, significantly boosting the model’s performance. These auxiliary tasks—including text-based anomaly detection and feature reconstruction learning—act as additional supervision, helping the model capture subtle differences and anomalies that are often difficult to detect in weakly supervised settings. In addition, RelVid incorporates two other components, which include class activation feature learning for improved feature discrimination and a temporal attention module for capturing sequential dependencies. This approach enhances the model’s robustness and accuracy, enabling it to better handle complex and ambiguous scenarios. Evaluations on two widely used benchmark datasets, UCF-Crime and XD-Violence, demonstrate the effectiveness of RelVid. Compared to state-of-the-art methods, RelVid achieves superior performance in both detection accuracy and robustness. Full article

(This article belongs to the Section Intelligent Sensors)

► Show Figures

Figure 1

20 pages, 20407 KiB

Open AccessArticle

VAD-CLVA: Integrating CLIP with LLaVA for Voice Activity Detection

by Andrea Appiani and Cigdem Beyan

Information 2025, 16(3), 233; https://doi.org/10.3390/info16030233 - 16 Mar 2025

Viewed by 1721

Abstract

Voice activity detection (VAD) is the process of automatically determining whether a person is speaking and identifying the timing of their speech in an audiovisual data. Traditionally, this task has been tackled by processing either audio signals or visual data, or by combining [...] Read more.

Voice activity detection (VAD) is the process of automatically determining whether a person is speaking and identifying the timing of their speech in an audiovisual data. Traditionally, this task has been tackled by processing either audio signals or visual data, or by combining both modalities through fusion or joint learning. In our study, drawing inspiration from recent advancements in visual-language models, we introduce a novel approach leveraging Contrastive Language-Image Pretraining (CLIP) models. The CLIP visual encoder analyzes video segments focusing on the upper body of an individual, while the text encoder processes textual descriptions generated by a Generative Large Multimodal Model, i.e., the Large Language and Vision Assistant (LLaVA). Subsequently, embeddings from these encoders are fused through a deep neural network to perform VAD. Our experimental analysis across three VAD benchmarks showcases the superior performance of our method compared to existing visual VAD approaches. Notably, our approach outperforms several audio-visual methods despite its simplicity and without requiring pretraining on extensive audio-visual datasets. Full article

(This article belongs to the Special Issue Application of Machine Learning in Human Activity Recognition)

► Show Figures

Figure 1

14 pages, 687 KiB

Open AccessArticle

Artificial Intelligence and Data Literacy in Rural Schools’ Teaching Practices: Knowledge, Use, and Challenges

by Marta López Costa

Educ. Sci. 2025, 15(3), 352; https://doi.org/10.3390/educsci15030352 - 12 Mar 2025

Cited by 1 | Viewed by 3095

Abstract

This study explores the implementation of artificial intelligence (AI) and data literacy in rural Catalan schools by analyzing teacher knowledge, use, and perceptions. Data were collected from a representative sample of teachers at these schools to examine their understanding of AI and data [...] Read more.

This study explores the implementation of artificial intelligence (AI) and data literacy in rural Catalan schools by analyzing teacher knowledge, use, and perceptions. Data were collected from a representative sample of teachers at these schools to examine their understanding of AI and data literacy, how they utilize these technologies, and their perspectives on their applications. The results indicate that although over half of the teachers reported moderate to high AI knowledge, classroom implementation remains limited. Teachers primarily employed AI for text generation and content detection, with less frequent use of video generation or simulations. Common applications included lesson planning and material creation. Concerns centered on ethical implications, academic integrity, and a potential reduction in students’ critical thinking skills. This study reveals a moderate level of AI and data literacy knowledge among teachers in Catalan rural schools, contrasting with its limited practical application in the classroom. Teachers mainly use AI for text generation and content detection. Regarding data literacy, teachers demonstrated knowledge but lacked practical skills. These findings reveal a disconnect between theoretical AI knowledge and its practical application in the classroom, emphasizing the need for enhanced training and support to facilitate effective AI integration in education Full article

► Show Figures

Figure 1

14 pages, 10252 KiB

Open AccessArticle

A New Log-Transform Histogram Equalization Technique for Deep Learning-Based Document Forgery Detection

by Yong-Yeol Bae, Dae-Jea Cho and Ki-Hyun Jung

Symmetry 2025, 17(3), 395; https://doi.org/10.3390/sym17030395 - 5 Mar 2025

Cited by 2 | Viewed by 1247

Abstract

Recent advancements in image processing technology have positively impacted some fields, such as image, document, and video production. However, the negative implications of these advancements have also increased, with document image manipulation being a prominent issue. Document image manipulation involves the forgery or [...] Read more.

Recent advancements in image processing technology have positively impacted some fields, such as image, document, and video production. However, the negative implications of these advancements have also increased, with document image manipulation being a prominent issue. Document image manipulation involves the forgery or alteration of documents like receipts, invoices, various certificates, and confirmations. The use of such manipulated documents can cause significant economic and social disruption. To prevent these issues, various methods for the detection of forged document images are being researched, with recent proposals focused on deep learning techniques. An essential aspect of using deep learning to detect manipulated documents is to enhance or augment the characteristics of document images before inputting them into a model. Enhancing the distinctive features of manipulated documents before inputting them into a deep learning model is crucial to achieve high accuracy. One crucial characteristic of document images is their inherent symmetrical patterns, such as consistent text alignment, structural balance, and uniform pixel distribution. This study investigates document forgery detection through a symmetry-aware approach. By focusing on the symmetric structures found in document layouts and pixel distribution, the proposed LTHE technique enhances feature extraction in deep learning-based models. Therefore, this study proposes a new image enhancement technique based on the results of three general-purpose CNN models to enhance the characteristics of document images and achieve high accuracy in deep learning-based forgery detection. The proposed LTHE (Log-Transform Histogram Equalization) technique increases low pixel values through log transformation and increases image contrast by performing histogram equalization to make the features of the image more prominent. Experimental results show that the proposed LTHE technique achieves higher accuracy when compared to other enhancement methods, indicating its potential to aid the development of deep learning-based forgery detection algorithms in the future. Full article

(This article belongs to the Special Issue Symmetry in Image Processing: Novel Topics and Advancements)

► Show Figures

Figure 1

16 pages, 7264 KiB

Open AccessArticle

Video Description Generation Method Based on Contrastive Language–Image Pre-Training Combined Retrieval-Augmented and Multi-Scale Semantic Guidance

by Liang Wang, Yingying Hu, Zhouyong Xia, Enru Chen, Meiqing Jiao, Mengxue Zhang and Jun Wang

Electronics 2025, 14(2), 299; https://doi.org/10.3390/electronics14020299 - 13 Jan 2025

Cited by 1 | Viewed by 914

Abstract

In view of the insufficiency of the text encoder using CLIP and the insufficiency of the interaction between the two towers using CLIP, a CLIP-based video description model RAMSG is proposed, which combines retrieval augmentation with multi-scale semantic guidance. Firstly, RAMSG uses the [...] Read more.

In view of the insufficiency of the text encoder using CLIP and the insufficiency of the interaction between the two towers using CLIP, a CLIP-based video description model RAMSG is proposed, which combines retrieval augmentation with multi-scale semantic guidance. Firstly, RAMSG uses the visual and text encoder module of CLIP to achieve cross-modal retrieval and extract relevant text as a supervisory signal. Then, the intrinsic order of semantic words is obtained by a semantic detector and ranker. Finally, the global and local semantic guidance of the multi-scale semantic guidance module is used to improve the video description generation effect of the decoder module. Experimental results on the video description datasets MSR-VTT and VATEX show that the RAMSG model has a significant improvement in several performance indicators compared with other work, and the additional text semantics obtained through videotext matching tasks greatly improve the performance of the model. Full article

► Show Figures

Figure 1

27 pages, 3711 KiB

Open AccessArticle

An IoT Framework for Assessing the Correlation Between Sentiment-Analyzed Texts and Facial Emotional Expressions

by Sebastian-Ioan Petruc, Razvan Bogdan, Marian-Emanuel Ionascu, Sergiu Nimara and Marius Marcu

Electronics 2025, 14(1), 118; https://doi.org/10.3390/electronics14010118 - 30 Dec 2024

Cited by 1 | Viewed by 816

Abstract

Emotion monitoring technologies leveraging detection of facial expressions have gained important attention in psychological and social research due to their ability of providing objective emotional measurements. However, this paper addresses a gap in the literature consisting of the correlation between emotional facial response [...] Read more.

Emotion monitoring technologies leveraging detection of facial expressions have gained important attention in psychological and social research due to their ability of providing objective emotional measurements. However, this paper addresses a gap in the literature consisting of the correlation between emotional facial response and sentiment analysis of written texts, developing a system capable of recognizing real-time emotional responses. The system uses a Raspberry Pi 4 and a Pi Camera module in order to perform real-time video capturing and facial expression analysis with the DeepFace version 0.0.80 model, while sentiment analysis of texts was performed utilizing Afinn version 0.1.0. User secure authentication and real-time database were implemented with Firebase. Although suitable for assessing psycho-emotional health in test takers, the system also provides valuable insights into the strong compatibility of the sentiment analysis performed on texts and the monitored facial emotional response, computing for each testing session the “compatibility” parameter. The framework provides an example of a new methodology for performing comparisons between different machine learning models, contributing to the enhancement of machine learning models’ efficiency and accuracy. Full article

(This article belongs to the Special Issue Artificial Intelligence in Vision Modelling)

► Show Figures

Figure 1

Search Results (66)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (66)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI