Sensors

Research

22 pages, 1588 KB

Open AccessArticle

Generative Sign-Description Prompts with Multi-Positive Contrastive Learning for Sign Language Recognition

by Siyu Liang, Yunan Li, Wentian Xin, Huizhou Chen, Xujie Liu, Kang Liu and Qiguang Miao

Sensors 2025, 25(19), 5957; https://doi.org/10.3390/s25195957 - 24 Sep 2025

Viewed by 523

While sign language combines sequential hand motions with concurrent non-manual cues (e.g., mouth shapes and head tilts), current recognition systems lack multimodal annotation methods capable of capturing their hierarchical semantics. To bridge this gap, we propose GSP-MC, the first method integrating generative large [...] Read more.

While sign language combines sequential hand motions with concurrent non-manual cues (e.g., mouth shapes and head tilts), current recognition systems lack multimodal annotation methods capable of capturing their hierarchical semantics. To bridge this gap, we propose GSP-MC, the first method integrating generative large language models into sign language recognition. It leverages retrieval-augmented generation with domain-specific large language models and expert-validated corpora to produce precise multipart descriptions. A dual-encoder architecture bidirectionally aligns hierarchical skeleton features with multi-level text descriptions (global, synonym, part) through probabilistic matching. The approach combines global and part-level losses with KL divergence optimization, ensuring robust alignment across relevant text-skeleton pairs while capturing sign semantics and detailed dynamics. Experiments demonstrate state-of-the-art performance, achieving 97.1% accuracy on the Chinese SLR500 (surpassing SSRL’s 96.9%) and 97.07% on the Turkish AUTSL (exceeding SML’s 96.85%), confirming cross-lingual potential for inclusive communication technologies. Full article

(This article belongs to the Special Issue Multimodal Perception Modeling Based on Advanced Computational Technologies)

► Show Figures

Figure 1

19 pages, 2435 KB

Open AccessArticle

Image Sensor-Supported Multimodal Attention Modeling for Educational Intelligence

by Yanlin Chen, Yingqiu Yang, Zeyu Lan, Xinyuan Chen, Haoyuan Zhan, Lingxi Yu and Yan Zhan

Sensors 2025, 25(18), 5640; https://doi.org/10.3390/s25185640 - 10 Sep 2025

Viewed by 451

Abstract

To address the limitations of low fusion efficiency and insufficient personalization in multimodal perception for educational intelligence, a novel deep learning framework is proposed that integrates image sensor data with textual and contextual information through a cross-modal attention mechanism. The architecture employs a [...] Read more.

To address the limitations of low fusion efficiency and insufficient personalization in multimodal perception for educational intelligence, a novel deep learning framework is proposed that integrates image sensor data with textual and contextual information through a cross-modal attention mechanism. The architecture employs a cross-modal alignment module to achieve fine-grained semantic correspondence between visual features captured by image sensors and associated textual elements, followed by a personalized feedback generator that incorporates learner background and task context embeddings to produce adaptive educational guidance. A cognitive weakness highlighter is introduced to enhance the discriminability of task-relevant features, enabling explicit localization and interpretation of conceptual gaps. Experiments show the proposed method outperforms conventional fusion and unimodal baselines with 92.37% accuracy, 91.28% recall, and 90.84% precision. Cross-task and noise-robustness tests confirm its stability, while ablation studies highlight the fusion module’s +4.2% accuracy gain and the attention mechanism’s +3.8% recall and +3.5% precision improvements. These results establish the proposed method as a transferable, high-performance solution for next-generation adaptive learning systems, offering precise, explainable, and context-aware feedback grounded in advanced multimodal perception modeling. Full article

(This article belongs to the Special Issue Multimodal Perception Modeling Based on Advanced Computational Technologies)

► Show Figures

Figure 1

15 pages, 1943 KB

Open AccessArticle

Multimodal Latent Representation Learning for Video Moment Retrieval

by Jinkwon Hwang, Mingyu Jeon and Junyeong Kim

Sensors 2025, 25(14), 4528; https://doi.org/10.3390/s25144528 - 21 Jul 2025

Viewed by 1027

Abstract

The rise of artificial intelligence (AI) has revolutionized the processing and analysis of video sensor data, driving advancements in areas such as surveillance, autonomous driving, and personalized content recommendations. However, leveraging video data presents unique challenges, particularly in the time-intensive feature extraction process [...] Read more.

The rise of artificial intelligence (AI) has revolutionized the processing and analysis of video sensor data, driving advancements in areas such as surveillance, autonomous driving, and personalized content recommendations. However, leveraging video data presents unique challenges, particularly in the time-intensive feature extraction process required for model training. This challenge is intensified in research environments lacking advanced hardware resources like GPUs. We propose a new method called the multimodal latent representation learning framework (MLRL) to address these limitations. MLRL enhances the performance of downstream tasks by conducting additional representation learning on pre-extracted features. By integrating and augmenting multimodal data, our method effectively predicts latent representations, leveraging pre-extracted features to reduce model training time and improve task performance. We validate the efficacy of MLRL on the video moment retrieval task using the QVHighlight dataset, benchmarking against the QD-DETR model. Our results demonstrate significant improvements, highlighting the potential of MLRL to streamline video data processing by leveraging pre-extracted features to bypass the time-consuming extraction process of raw sensor data and enhance model accuracy in various sensor-based applications. Full article

(This article belongs to the Special Issue Multimodal Perception Modeling Based on Advanced Computational Technologies)

► Show Figures

Figure 1

26 pages, 8022 KB

Open AccessArticle

Toward a Recognition System for Mexican Sign Language: Arm Movement Detection

by Gabriela Hilario-Acuapan, Keny Ordaz-Hernández, Mario Castelán and Ismael Lopez-Juarez

Sensors 2025, 25(12), 3636; https://doi.org/10.3390/s25123636 - 10 Jun 2025

Cited by 1 | Viewed by 1113

Abstract

This paper describes ongoing work surrounding the creation of a recognition system for Mexican Sign Language (LSM). We propose a general sign decomposition that is divided into three parts, i.e., hand configuration (HC), arm movement (AM), and non-hand gestures (NHGs). This paper focuses [...] Read more.

This paper describes ongoing work surrounding the creation of a recognition system for Mexican Sign Language (LSM). We propose a general sign decomposition that is divided into three parts, i.e., hand configuration (HC), arm movement (AM), and non-hand gestures (NHGs). This paper focuses on the AM features and reports the approach created to analyze visual patterns in arm joint movements (wrists, shoulders, and elbows). For this research, a proprietary dataset—one that does not limit the recognition of arm movements—was developed, with active participation from the deaf community and LSM experts. We analyzed two case studies involving three sign subsets. For each sign, the pose was extracted to generate shapes of the joint paths during the arm movements and fed to a CNN classifier. YOLOv8 was used for pose estimation and visual pattern classification purposes. The proposed approach, based on pose estimation, shows promising results for constructing CNN models to classify a wide range of signs. Full article

(This article belongs to the Special Issue Multimodal Perception Modeling Based on Advanced Computational Technologies)

► Show Figures

Figure 1

30 pages, 845 KB

Open AccessArticle

A Multimodal Deep Learning Approach for Legal English Learning in Intelligent Educational Systems

by Yanlin Chen, Chenjia Huang, Shumiao Gao, Yifan Lyu, Xinyuan Chen, Shen Liu, Dat Bao and Chunli Lv

Sensors 2025, 25(11), 3397; https://doi.org/10.3390/s25113397 - 28 May 2025

Viewed by 1193

Abstract

With the development of artificial intelligence and intelligent sensor technologies, traditional legal English teaching approaches have faced numerous challenges in handling multimodal inputs and complex reasoning tasks. In response to these challenges, a cross-modal legal English question-answering system based on visual and acoustic [...] Read more.

With the development of artificial intelligence and intelligent sensor technologies, traditional legal English teaching approaches have faced numerous challenges in handling multimodal inputs and complex reasoning tasks. In response to these challenges, a cross-modal legal English question-answering system based on visual and acoustic sensor inputs was proposed, integrating image, text, and speech information and adopting a unified vision–language–speech encoding mechanism coupled with dynamic attention modeling to effectively enhance learners’ understanding and expressive abilities in legal contexts. The system exhibited superior performance across multiple experimental evaluations. In the assessment of question-answering accuracy, the proposed method achieved the best results across BLEU, ROUGE, Precision, Recall, and Accuracy, with an Accuracy of 0.87, Precision of 0.88, and Recall of 0.85, clearly outperforming the traditional ASR+SVM classifier, image-retrieval-based QA model, and unimodal BERT QA system. In the analysis of multimodal matching performance, the proposed method achieved optimal results in Matching Accuracy, Recall@1, Recall@5, and MRR, with a Matching Accuracy of 0.85, surpassing mainstream cross-modal models such as VisualBERT, LXMERT, and CLIP. The user study further verified the system’s practical effectiveness in real teaching environments, with learners’ understanding improvement reaching 0.78, expression improvement reaching 0.75, and satisfaction score reaching 0.88, significantly outperforming traditional teaching methods and unimodal systems. The experimental results fully demonstrate that the proposed cross-modal legal English question-answering system not only exhibits significant advantages in multimodal feature alignment and deep reasoning modeling but also shows substantial potential in enhancing learners’ comprehensive capabilities and learning experiences. Full article

(This article belongs to the Special Issue Multimodal Perception Modeling Based on Advanced Computational Technologies)

► Show Figures

Figure 1

20 pages, 2148 KB

Open AccessArticle

Analysis of Pleasure and Displeasure in Harmony Between Colored Light and Fragrance by the Left and Right OFC Response Differences

by Toshinori Oba, Midori Tanaka and Takahiko Horiuchi

Sensors 2025, 25(7), 2230; https://doi.org/10.3390/s25072230 - 2 Apr 2025

Cited by 1 | Viewed by 791

Abstract

Daily actions are influenced by sensory information. Several studies have investigated the multisensory integration of multiple sensory modalities, known as crossmodal perception. Recently, visual–olfactory crossmodal perception has been studied using objective physiological measures rather than subjective evaluations. This study focused on sensing in [...] Read more.

Daily actions are influenced by sensory information. Several studies have investigated the multisensory integration of multiple sensory modalities, known as crossmodal perception. Recently, visual–olfactory crossmodal perception has been studied using objective physiological measures rather than subjective evaluations. This study focused on sensing in the orbitofrontal cortex (OFC), which responds to visual and olfactory stimuli, and may serve as a physiological indicator of perception. Using near-infrared spectroscopy (NIRS), we analyzed the emotions evoked by combinations of colored light and fragrance with a particular focus on the lateralization of brain function. We selected pleasant and unpleasant fragrances from some essential oils, paired with colored lights that were perceived as either harmonious or disharmonious with the fragrances. NIRS measurements were conducted under the four following conditions: fragrance-only, colored light-only, harmonious crossmodal, and disharmonious crossmodal presentations. The results showed that the left OFC was activated during the crossmodal presentation of a harmonious color with a pleasant fragrance, thereby evoking pleasant emotions. In contrast, during the crossmodal presentation of a disharmonious color with an unpleasant fragrance, the right OFC was activated, suggesting increased displeasure. Additionally, the lateralization of brain function between the left and right OFC may be influenced by ‘pleasure–displeasure ’ and ‘crossmodal perception–multimodal perception’. Full article

(This article belongs to the Special Issue Multimodal Perception Modeling Based on Advanced Computational Technologies)

► Show Figures

Figure 1

16 pages, 3396 KB

Open AccessArticle

Parameter-Efficient Adaptation of Large Vision—Language Models for Video Memorability Prediction

by Iván Martín-Fernández, Sergio Esteban-Romero, Fernando Fernández-Martínez and Manuel Gil-Martín

Sensors 2025, 25(6), 1661; https://doi.org/10.3390/s25061661 - 7 Mar 2025

Viewed by 2158

Abstract

The accurate modelling of video memorability, or the intrinsic properties that render a piece of audiovisual content more likely to be remembered, will facilitate the development of automatic systems that are more efficient in retrieving, classifying and generating impactful media. Recent studies have [...] Read more.

The accurate modelling of video memorability, or the intrinsic properties that render a piece of audiovisual content more likely to be remembered, will facilitate the development of automatic systems that are more efficient in retrieving, classifying and generating impactful media. Recent studies have indicated a strong correlation between the visual semantics of video and its memorability. This underscores the importance of developing advanced visual comprehension abilities to enhance model performance. It has been demonstrated that Large Vision–Language Models (LVLMs) demonstrate exceptional proficiency in generalist, high-level semantic comprehension of images and video, due to their extensive multimodal pre-training on a vast scale. This work makes use of the vast generalist knowledge of LVLMs and explores efficient adaptation techniques with a view to utilising them as memorability predictors. In particular, the Quantized Low-Rank Adaptation (QLoRA) technique is employed to fine-tune the Qwen-VL model with memorability-related data extracted from the Memento10k dataset. In light of existing research, we propose a particular methodology that transforms Qwen-VL from a language model to a memorability score regressor. Furthermore, we consider the influence of selecting appropriate LoRA hyperparameters, a design aspect that has been insufficiently studied. We validate the LoRA rank and alpha hyperparameters using 5-Fold Cross-Validation and evaluate our best configuration on the official testing portion of the Memento10k dataset, obtaining a state-of-the-art Spearman Rank Correlation Coefficient (SRCC) of 0.744. Consequently, this work represents a significant advancement in modelling video memorability through high-level semantic understanding. Full article

(This article belongs to the Special Issue Multimodal Perception Modeling Based on Advanced Computational Technologies)

► Show Figures

Figure 1

17 pages, 2351 KB

Open AccessArticle

Extending Anxiety Detection from Multimodal Wearables in Controlled Conditions to Real-World Environments

by Abdulrahman Alkurdi, Maxine He, Jonathan Cerna, Jean Clore, Richard Sowers, Elizabeth T. Hsiao-Wecksler and Manuel E. Hernandez

Sensors 2025, 25(4), 1241; https://doi.org/10.3390/s25041241 - 18 Feb 2025

Cited by 2 | Viewed by 2782

Abstract

This study quantitatively evaluated whether and how machine learning (ML) models built by data from controlled conditions can fit real-world conditions. This study focused on feature-based models using wearable technology from real-world data collected from young adults, so as to provide insights into [...] Read more.

This study quantitatively evaluated whether and how machine learning (ML) models built by data from controlled conditions can fit real-world conditions. This study focused on feature-based models using wearable technology from real-world data collected from young adults, so as to provide insights into the models’ robustness and the specific challenges posed by diverse environmental noise. Feature-based models, particularly XGBoost and Decision Trees, demonstrated considerable resilience, maintaining higher accuracy and reliability across different noise levels. This investigation included an in-depth analysis of transfer learning, highlighting its potential and limitations in adapting models developed from standard datasets, like WESAD, to complex real-world scenarios. Moreover, this study analyzed the distributed feature importance across various physiological signals, such as electrodermal activity (EDA) and electrocardiography (ECG), considering their vulnerability to environmental factors. It was found that integrating multiple physiological data types could significantly enhance model robustness. The results underscored the need for a nuanced understanding of signal contributions to model efficacy, suggesting that feature-based models showed much promise in practical applications. Full article

(This article belongs to the Special Issue Multimodal Perception Modeling Based on Advanced Computational Technologies)

► Show Figures

Figure 1

Journal Menu

Journal Browser

Multimodal Perception Modeling Based on Advanced Computational Technologies

Share This Special Issue

Special Issue Editors

Special Issue Information

Keywords

Benefits of Publishing in a Special Issue

Published Papers (8 papers)

Research

Further Information

Guidelines

MDPI Initiatives

Follow MDPI