MDPI - Publisher of Open Access Journals

27 pages, 4945 KB

Open AccessArticle

A Robust Framework for Coffee Bean Package Label Recognition: Integrating Image Enhancement with Vision–Language OCR Models

by Thi-Thu-Huong Le, Yeonjeong Hwang, Ahmada Yusril Kadiptya, JunYoung Son and Howon Kim

Sensors 2025, 25(20), 6484; https://doi.org/10.3390/s25206484 - 20 Oct 2025

Abstract

Text recognition on coffee bean package labels is of great importance for product tracking and brand verification, but it poses a challenge due to variations in image quality, packaging materials, and environmental conditions. In this paper, we propose a pipeline that combines several [...] Read more.

Text recognition on coffee bean package labels is of great importance for product tracking and brand verification, but it poses a challenge due to variations in image quality, packaging materials, and environmental conditions. In this paper, we propose a pipeline that combines several image enhancement techniques and is followed by an Optical Character Recognition (OCR) model based on vision–language (VL) Qwen VL variants, conditioned by structured prompts. To facilitate the evaluation, we construct a coffee bean package image set containing two subsets, namely low-resolution (LRCB) and high-resolution coffee bean image sets (HRCB), enclosing multiple real-world challenges. These cases involve various packaging types (bottles and bags), label sides (front and back), rotation, and different illumination. To address the image quality problem, we design a dedicated preprocessing pipeline for package label situations. We develop and evaluate four Qwen-VL OCR variants with prompt engineering, which are compared against four baselines: DocTR, PaddleOCR, EasyOCR, and Tesseract. Extensive comparison using various metrics, including the Levenshtein distance, Cosine similarity, Jaccard index, Exact Match, BLEU score, and ROUGE scores (ROUGE-1, ROUGE-2, and ROUGE-L), proves significant improvements upon the baselines. In addition, the public POIE dataset validation test proves how well the framework can generalize, thus demonstrating its practicality and reliability for label recognition. Full article

(This article belongs to the Special Issue Digital Imaging Processing, Sensing, and Object Recognition)

► Show Figures

Figure 1

20 pages, 7756 KB

Open AccessArticle

A Novel System for the Characterization of Bark Macroscopic Morphology for Central European Woody Species

by László Zoltán and Márton Korda

Forests 2025, 16(10), 1586; https://doi.org/10.3390/f16101586 - 15 Oct 2025

Viewed by 497

Abstract

Accurate identification of deciduous woody species in winter is challenging, and the misidentification can lead to ecological and management damage. This study aims to substantiate a diagnostic system for woody species based on macromorphological bark characters. First, we reviewed the literature on bark-based [...] Read more.

Accurate identification of deciduous woody species in winter is challenging, and the misidentification can lead to ecological and management damage. This study aims to substantiate a diagnostic system for woody species based on macromorphological bark characters. First, we reviewed the literature on bark-based species identification to assess existing approaches and their limitations. Building on this, we identified informative macromorphological features of bark through both literature analysis and our experiences. These characters cover all developmental phases, including twigs, young bark, and mature bark, and are supported by new diagnostic terminology. Using this framework, we compiled a character set for 115 Central European woody taxa, providing practical, primarily qualitative traits that can be applied directly in the field. Finally, we developed and tested “Single-access Keys” as an alternative to conventional dichotomous keys, demonstrating their effectiveness in enabling flexible and rapid species recognition, even under atypical conditions or when only partial observations are possible. Our results highlight the value of bark macromorphology as a diagnostic tool and emphasize its potential for advancing thematic identification keys, as well as digital applications in forestry, taxonomy, and ecological monitoring. Full article

(This article belongs to the Section Forest Biodiversity)

► Show Figures

Figure 1

21 pages, 2783 KB

Open AccessArticle

Deep Learning-Based Eye-Writing Recognition with Improved Preprocessing and Data Augmentation Techniques

by Kota Suzuki, Abu Saleh Musa Miah and Jungpil Shin

Sensors 2025, 25(20), 6325; https://doi.org/10.3390/s25206325 - 13 Oct 2025

Viewed by 295

Abstract

Eye-tracking technology enables communication for individuals with muscle control difficulties, making it a valuable assistive tool. Traditional systems rely on electrooculography (EOG) or infrared devices, which are accurate but costly and invasive. While vision-based systems offer a more accessible alternative, they have not [...] Read more.

Eye-tracking technology enables communication for individuals with muscle control difficulties, making it a valuable assistive tool. Traditional systems rely on electrooculography (EOG) or infrared devices, which are accurate but costly and invasive. While vision-based systems offer a more accessible alternative, they have not been extensively explored for eye-writing recognition. Additionally, the natural instability of eye movements and variations in writing styles result in inconsistent signal lengths, which reduces recognition accuracy and limits the practical use of eye-writing systems. To address these challenges, we propose a novel vision-based eye-writing recognition approach that utilizes a webcam-captured dataset. A key contribution of our approach is the introduction of a Discrete Fourier Transform (DFT)-based length normalization method that standardizes the length of each eye-writing sample while preserving essential spectral characteristics. This ensures uniformity in input lengths and improves both efficiency and robustness. Moreover, we integrate a hybrid deep learning model that combines 1D Convolutional Neural Networks (CNN) and Temporal Convolutional Networks (TCN) to jointly capture spatial and temporal features of eye-writing. To further improve model robustness, we incorporate data augmentation and initial-point normalization techniques. The proposed system was evaluated using our new webcam-captured Arabic numbers dataset and two existing benchmark datasets, with leave-one-subject-out (LOSO) cross-validation. The model achieved accuracies of 97.68% on the new dataset, 94.48% on the Japanese Katakana dataset, and 98.70% on the EOG-captured Arabic numbers dataset—outperforming existing systems. This work provides an efficient eye-writing recognition system, featuring robust preprocessing techniques, a hybrid deep learning model, and a new webcam-captured dataset. Full article

(This article belongs to the Special Issue Vision- and Sensor-Based Sensing in Human Activity Recognition—Second Edition)

17 pages, 2289 KB

Open AccessArticle

Aging-Aware Character Recognition with E-Textile Inputs

by Juncong Lin, Yujun Rong, Yao Cheng and Chenkang He

Electronics 2025, 14(19), 3964; https://doi.org/10.3390/electronics14193964 - 9 Oct 2025

Viewed by 264

Abstract

E-textiles, a type of textile integrated with conductive sensors, allows users to freely utilize any area of the body in a convenient and comfortable manner. Thus, interactions with e-textiles are attracting more and more attention, especially for text input. However, the functional aging [...] Read more.

E-textiles, a type of textile integrated with conductive sensors, allows users to freely utilize any area of the body in a convenient and comfortable manner. Thus, interactions with e-textiles are attracting more and more attention, especially for text input. However, the functional aging of e-textiles affects the characteristics and even the quality of the captured signal, presenting serious challenges for character recognition. This paper focuses on studying the behavior of e-textile functional aging and alleviating its impact on text input with an unsupervised domain adaptation technique, named A²TEXT (aging-aware e-textile-based text input). We first designed a deep kernel-based two-sample test method to validate the impact of functional aging on handwriting with an e-textile input. Based on that, we introduced a so-called Gabor domain adaptation technique, which adopts a novel Gabor orientation filter in feature extraction under an adversarial domain adaptation framework. We demonstrated superior performance compared to traditional models in four different transfer tasks, validating the effectiveness of our work. Full article

(This article belongs to the Special Issue End User Applications for Virtual, Augmented, and Mixed Reality)

► Show Figures

Figure 1

16 pages, 7184 KB

Open AccessArticle

Towards Robust Scene Text Recognition: A Dual Correction Mechanism with Deformable Alignment

by Yajiao Feng and Changlu Li

Electronics 2025, 14(19), 3968; https://doi.org/10.3390/electronics14193968 - 9 Oct 2025

Viewed by 361

Abstract

Scene Text Recognition (STR) faces significant challenges under complex degradation conditions, such as distortion, occlusion, and semantic ambiguity. Most existing methods rely heavily on language priors for correction, but effectively constructing language rules remains a complex problem. This paper addresses two key challenges: [...] Read more.

Scene Text Recognition (STR) faces significant challenges under complex degradation conditions, such as distortion, occlusion, and semantic ambiguity. Most existing methods rely heavily on language priors for correction, but effectively constructing language rules remains a complex problem. This paper addresses two key challenges: (1) The over-correction behavior of language models, particularly on semantically deficient input, can result in both recognition errors and loss of critical information. (2) Character misalignment in visual features, which affects recognition accuracy. To address these problems, we propose a Deformable-Alignment-based Dual Correction Mechanism (DADCM) for STR. Our method includes the following key components: (1) We propose a visually guided and language-assisted correction strategy. A dynamic confidence threshold is used to control the degree of language model intervention. (2) We designed a visual backbone network called SCRTNet. The net enhances key text regions through a channel attention module (SENet) and applies deformable convolution (DCNv4) in deep layers to better model distorted or curved text. (3) We propose a deformable alignment module (DAM). The module combines Gumbel-Softmax-based anchor sampling and geometry-aware self-attention to improve character alignment. Experiments on multiple benchmark datasets demonstrate the superiority of our approach. Especially on the Union14M-Benchmark, where the recognition accuracy surpasses previous methods by 1.1%, 1.6%, 3.0%, and 1.3% on the Curved, Multi-Oriented, Contextless, and General subsets, respectively. Full article

► Show Figures

Figure 1

18 pages, 864 KB

Open AccessArticle

Enhanced Semantic BERT for Named Entity Recognition in Education

by Ping Huang, Huijuan Zhu, Ying Wang, Lili Dai and Lei Zheng

Electronics 2025, 14(19), 3951; https://doi.org/10.3390/electronics14193951 - 7 Oct 2025

Viewed by 289

Abstract

To address the technical challenges in the educational domain named entity recognition (NER), such as ambiguous entity boundaries and difficulties with nested entity identification, this study proposes an enhanced semantic BERT model (ES-BERT). The model innovatively adopts an education domain, vocabulary-assisted semantic enhancement [...] Read more.

To address the technical challenges in the educational domain named entity recognition (NER), such as ambiguous entity boundaries and difficulties with nested entity identification, this study proposes an enhanced semantic BERT model (ES-BERT). The model innovatively adopts an education domain, vocabulary-assisted semantic enhancement strategy that (1) applies the term frequency–inverse document frequency (TF-IDF) algorithm to weight domain-specific terms, and (2) fuses the weighted lexical information with character-level features, enabling BERT to generate enriched, domain-aware, character–word hybrid representations. A complete bidirectional long short-term memory-conditional random field (BiLSTM-CRF) recognition framework was established, and a novel focal loss-based joint training method was introduced to optimize the process. The experimental design employed a three-phase validation protocol, as follows: (1) In a comparative evaluation using 5-fold cross-validation on our proprietary computer-education dataset, the proposed ES-BERT model yielded a precision of 90.38%, which is higher than that of the baseline models; (2) Ablation studies confirmed the contribution of domain-vocabulary enhancement to performance improvement; (3) Cross-domain experiments on the 2016 knowledge base question answering datasets and resume benchmark datasets demonstrated outstanding precision of 98.41% and 96.75%, respectively, verifying the model’s transfer-learning capability. These comprehensive experimental results substantiate that ES-BERT not only effectively resolves domain-specific NER challenges in education but also exhibits remarkable cross-domain adaptability. Full article

(This article belongs to the Special Issue Advances in Intelligent Data Analysis and Its Applications, 3rd Edition)

► Show Figures

Figure 1

20 pages, 552 KB

Open AccessArticle

Trust in Stories: A Reader Response Study of (Un)Reliability in Akutagawa’s “In a Grove”

by Inge van de Ven

Literature 2025, 5(4), 24; https://doi.org/10.3390/literature5040024 - 30 Sep 2025

Viewed by 316

Abstract

For this article, we reviewed and synthesized narratological theories on reliability and unreliability and used them as the basis for an exploratory study, examining how real readers respond to a literary short story that contains several unreliable or conflicting narrative accounts. The story [...] Read more.

For this article, we reviewed and synthesized narratological theories on reliability and unreliability and used them as the basis for an exploratory study, examining how real readers respond to a literary short story that contains several unreliable or conflicting narrative accounts. The story we selected is “In a Grove” by Ryūnosuke Akutagawa (orig. 藪の中/Yabu no naka) from 1922 in the English translation by Jay Rubin from 2007. To investigate how readers evaluate trustworthiness in narrative contexts, we combined quantitative and qualitative methods. We analyzed correlations between reading habits (i.e., Author Recognition Test), cognitive traits (e.g., Need for Cognition; Epistemic Trust), and trust attributions to characters while also examining how narrative sequencing and character-specific reasons for (dis)trust shaped participants’ judgments. This mixed-methods approach allows us to situate narrative trust as a context-sensitive, interpretive process rather than a stable individual disposition. Full article

(This article belongs to the Special Issue Literary Experiments with Cognition)

► Show Figures

Figure 1

14 pages, 3652 KB

Open AccessArticle

Enhancing Mobility for the Blind: An AI-Powered Bus Route Recognition System

by Shehzaib Shafique, Gian Luca Bailo, Monica Gori, Giulio Sciortino and Alessio Del Bue

Algorithms 2025, 18(10), 616; https://doi.org/10.3390/a18100616 - 30 Sep 2025

Viewed by 261

Abstract

Vision is a critical component of daily life, and its loss significantly hinders an individual’s ability to navigate, particularly when using public transportation systems. To address this challenge, this paper introduces a novel approach for accurately identifying bus route numbers and destinations, designed [...] Read more.

Vision is a critical component of daily life, and its loss significantly hinders an individual’s ability to navigate, particularly when using public transportation systems. To address this challenge, this paper introduces a novel approach for accurately identifying bus route numbers and destinations, designed to assist visually impaired individuals in navigating urban transit networks. Our system integrates object detection, image enhancement, and Optical Character Recognition (OCR) technologies to achieve reliable and precise recognition of bus information. We employ a custom-trained You Only Look Once version 8 (YOLOv8) model to isolate the front portion of buses as the region of interest (ROI), effectively eliminating irrelevant text and advertisements that often lead to errors. To further enhance accuracy, we utilize the Enhanced Super-Resolution Generative Adversarial Network (ESRGAN) to improve image resolution, significantly boosting the confidence of the OCR process. Additionally, a post-processing step involving a pre-defined list of bus routes and the Levenshtein algorithm corrects potential errors in text recognition, ensuring reliable identification of bus numbers and destinations. Tested on a dataset of 120 images featuring diverse bus routes and challenging conditions such as poor lighting, reflections, and motion blur, our system achieved an accuracy rate of 95%. This performance surpasses existing methods and demonstrates the system’s potential for real-world application. By providing a robust and adaptable solution, our work aims to enhance public transit accessibility, empowering visually impaired individuals to navigate cities with greater independence and confidence. Full article

(This article belongs to the Section Combinatorial Optimization, Graph, and Network Algorithms)

► Show Figures

Figure 1

22 pages, 365 KB

Open AccessArticle

Development of a Fully Autonomous Offline Assistive System for Visually Impaired Individuals: A Privacy-First Approach

by Fitsum Yebeka Mekonnen, Mohammad F. Al Bataineh, Dana Abu Abdoun, Ahmed Serag, Kena Teshale Tamiru, Winner Abula and Simon Darota

Sensors 2025, 25(19), 6006; https://doi.org/10.3390/s25196006 - 29 Sep 2025

Viewed by 571

Abstract

Visual impairment affects millions worldwide, creating significant barriers to environmental interaction and independence. Existing assistive technologies often rely on cloud-based processing, raising privacy concerns and limiting accessibility in resource-constrained environments. This paper explores the integration and potential of open-source AI models in developing [...] Read more.

Visual impairment affects millions worldwide, creating significant barriers to environmental interaction and independence. Existing assistive technologies often rely on cloud-based processing, raising privacy concerns and limiting accessibility in resource-constrained environments. This paper explores the integration and potential of open-source AI models in developing a fully offline assistive system that can be locally set up and operated to support visually impaired individuals. Built on a Raspberry Pi 5, the system combines real-time object detection (YOLOv8), optical character recognition (Tesseract), face recognition with voice-guided registration, and offline voice command control (VOSK), delivering hands-free multimodal interaction without dependence on cloud infrastructure. Audio feedback is generated using Piper for real-time environmental awareness. Designed to prioritize user privacy, low latency, and affordability, the platform demonstrates that effective assistive functionality can be achieved using only open-source tools on low-power edge hardware. Evaluation results in controlled conditions show 75–90% detection and recognition accuracies, with sub-second response times, confirming the feasibility of deploying such systems in privacy-sensitive or resource-constrained environments. Full article

(This article belongs to the Section Biomedical Sensors)

► Show Figures

Figure 1

19 pages, 537 KB

Open AccessArticle

Tracking the Impact of Age and Dimensional Shifts on Situation Model Updating During Narrative Text Comprehension

by César Campos-Rojas and Romualdo Ibáñez-Orellana

J. Eye Mov. Res. 2025, 18(5), 48; https://doi.org/10.3390/jemr18050048 - 26 Sep 2025

Viewed by 272

Abstract

Studies on the relationship between age and situation model updating during narrative text reading have mainly used response or reading times. This study enhances previous measures (working memory, recognition probes, and comprehension) by incorporating eye-tracking techniques to compare situation model updating between young [...] Read more.

Studies on the relationship between age and situation model updating during narrative text reading have mainly used response or reading times. This study enhances previous measures (working memory, recognition probes, and comprehension) by incorporating eye-tracking techniques to compare situation model updating between young and older Chilean adults. The study included 82 participants (40 older adults and 42 young adults) who read two narrative texts under three conditions (no shift, spatial shift, and character shift) using a between-subject (age) and within-subject (dimensional change) design. The results show that, while differences in working memory capacity were observed between the groups, these differences did not impact situation model comprehension. Younger adults performed better in recognition tests regardless of updating conditions. Eye-tracking data showed increased fixation times for dimensional shifts and longer reading times in older adults, with no interaction between age and dimensional shifts. Full article

► Show Figures

Figure 1

22 pages, 1250 KB

Open AccessArticle

Entity Span Suffix Classification for Nested Chinese Named Entity Recognition

by Jianfeng Deng, Ruitong Zhao, Wei Ye and Suhong Zheng

Information 2025, 16(10), 822; https://doi.org/10.3390/info16100822 - 23 Sep 2025

Viewed by 329

Abstract

Named entity recognition (NER) is one of the fundamental tasks in building knowledge graphs. For some domain-specific corpora, the text descriptions exhibit limited standardization, and some entity structures have entity nesting. The existing entity recognition methods have problems such as word matching noise [...] Read more.

Named entity recognition (NER) is one of the fundamental tasks in building knowledge graphs. For some domain-specific corpora, the text descriptions exhibit limited standardization, and some entity structures have entity nesting. The existing entity recognition methods have problems such as word matching noise interference and difficulty in distinguishing different entity labels for the same character in sequence label prediction. This paper proposes a span-based feature reuse stacked bidirectional long short term memory network (BiLSTM) nested named entity recognition (SFRSN) model, which transforms the entity recognition of sequence prediction into the problem of entity span suffix category classification. Firstly, character feature embedding is generated through bidirectional encoder representation of transformers (BERT). Secondly, a feature reuse stacked BiLSTM is proposed to obtain deep context features while alleviating the problem of deep network degradation. Thirdly, the span feature is obtained through the dilated convolution neural network (DCNN), and at the same time, a single-tail selection function is introduced to obtain the classification feature of the entity span suffix, with the aim of reducing the training parameters. Fourthly, a global feature gated attention mechanism is proposed, integrating span features and span suffix classification features to achieve span suffix classification. The experimental results on four Chinese-specific domain datasets demonstrate the effectiveness of our approach: SFRSN achieves micro-F1 scores of 83.34% on ontonotes, 73.27% on weibo, 96.90% on resume, and 86.77% on the supply chain management dataset. This represents a maximum improvement of 1.55%, 4.94%, 2.48%, and 3.47% over state-of-the-art baselines, respectively. The experimental results demonstrate the effectiveness of the model in addressing nested entities and entity label ambiguity issues. Full article

(This article belongs to the Section Artificial Intelligence)

► Show Figures

Graphical abstract

18 pages, 1694 KB

Open AccessArticle

FAIR-Net: A Fuzzy Autoencoder and Interpretable Rule-Based Network for Ancient Chinese Character Recognition

by Yanling Ge, Yunmeng Zhang and Seok-Beom Roh

Sensors 2025, 25(18), 5928; https://doi.org/10.3390/s25185928 - 22 Sep 2025

Viewed by 373

Abstract

Ancient Chinese scripts—including oracle bone carvings, bronze inscriptions, stone steles, Dunhuang scrolls, and bamboo slips—are rich in historical value but often degraded due to centuries of erosion, damage, and stylistic variability. These issues severely hinder manual transcription and render conventional OCR techniques inadequate, [...] Read more.

Ancient Chinese scripts—including oracle bone carvings, bronze inscriptions, stone steles, Dunhuang scrolls, and bamboo slips—are rich in historical value but often degraded due to centuries of erosion, damage, and stylistic variability. These issues severely hinder manual transcription and render conventional OCR techniques inadequate, as they are typically trained on modern printed or handwritten text and lack interpretability. To tackle these challenges, we propose FAIR-Net, a hybrid architecture that combines the unsupervised feature learning capacity of a deep autoencoder with the semantic transparency of a fuzzy rule-based classifier. In FAIR-Net, the deep autoencoder first compresses high-resolution character images into low-dimensional, noise-robust embeddings. These embeddings are then passed into a Fuzzy Neural Network (FNN), whose hidden layer leverages Fuzzy C-Means (FCM) clustering to model soft membership degrees and generate human-readable fuzzy rules. The output layer uses Iteratively Reweighted Least Squares Estimation (IRLSE) combined with a Softmax function to produce probabilistic predictions, with all weights constrained as linear mappings to maintain model transparency. We evaluate FAIR-Net on CASIA-HWDB1.0, HWDB1.1, and ICDAR 2013 CompetitionDB, where it achieves a recognition accuracy of 97.91%, significantly outperforming baseline CNNs (p < 0.01, Cohen’s d > 0.8) while maintaining the tightest confidence interval (96.88–98.94%) and lowest standard deviation (±1.03%). Additionally, FAIR-Net reduces inference time to 25 s, improving processing efficiency by 41.9% over AlexNet and up to 98.9% over CNN-Fujitsu, while preserving >97.5% accuracy across evaluations. To further assess generalization to historical scripts, FAIR-Net was tested on the Ancient Chinese Character Dataset (9233 classes; 979,907 images), achieving 83.25% accuracy—slightly higher than ResNet101 but 2.49% lower than SwinT-v2-small—while reducing training time by over 5.5× compared to transformer-based baselines. Fuzzy rule visualization confirms enhanced robustness to glyph ambiguities and erosion. Overall, FAIR-Net provides a practical, interpretable, and highly efficient solution for the digitization and preservation of ancient Chinese character corpora. Full article

(This article belongs to the Section Sensing and Imaging)

► Show Figures

Figure 1

16 pages, 261 KB

Open AccessArticle

Naming as Narrative Strategy: Semiotic Inversion and Cultural Authenticity in Yemeni Television Drama

by Elham Alzain and Faiz Algobaei

Genealogy 2025, 9(3), 99; https://doi.org/10.3390/genealogy9030099 - 17 Sep 2025

Viewed by 638

Abstract

This study investigates the semiotic and cultural functions of character naming in the Yemeni television series Duroob al-Marjalah (Branching Paths of Manhood) (2024–2025). It applies onomastic theory and Barthesian semiotics to examine how Yemeni screenwriters employ names as narrative and ideological tools. A [...] Read more.

This study investigates the semiotic and cultural functions of character naming in the Yemeni television series Duroob al-Marjalah (Branching Paths of Manhood) (2024–2025). It applies onomastic theory and Barthesian semiotics to examine how Yemeni screenwriters employ names as narrative and ideological tools. A purposive sample of ten central characters was selected from a Yemeni drama series for qualitative analysis. Each name was examined for linguistic structure, semantic meaning, intertextual associations, and socio-cultural alignment. Semiotic interpretation followed Barthes’ signifier–signified–myth model to decode narrative and cultural symbolism. The findings indicate that character names function as multifaceted semiotic tools, conveying heritage, while occasionally employing stylization for satire or fostering empathy through cultural resonance. However, many lack grounding in Yemeni naming conventions, creating a tension between narrative dramatization and socio-onomastic realism. The results suggest that while Yemeni screenwriters show partial awareness of naming as a cultural and narrative tool, the creative process often privileges thematic resonance over ethnographic accuracy. This research contributes to onomastic theory, Arabic media studies, and semiotic analysis by evidencing how localized naming practices—or their absence—shape identity construction, world-building, and cultural recognition in regional television drama. Full article

11 pages, 1005 KB

Open AccessProceeding Paper

Multimodal Fusion for Enhanced Human–Computer Interaction

by Ajay Sharma, Isha Batra, Shamneesh Sharma and Anggy Pradiftha Junfithrana

Eng. Proc. 2025, 107(1), 81; https://doi.org/10.3390/engproc2025107081 - 10 Sep 2025

Viewed by 597

Abstract

Our paper introduces a novel idea of a virtual mouse character driven by gesture detection, eye-tracking, and voice monitoring. This system uses cutting-edge computer vision and machine learning technology to let users command and control the mouse pointer using eye motions, voice commands, [...] Read more.

Our paper introduces a novel idea of a virtual mouse character driven by gesture detection, eye-tracking, and voice monitoring. This system uses cutting-edge computer vision and machine learning technology to let users command and control the mouse pointer using eye motions, voice commands, or hand gestures. This system’s main goal is to provide users who want a more natural, hands-free approach to interacting with their computers as well as those with impairments that limit their bodily motions, such as those with paralysis—with an easy and engaging interface. The system improves accessibility and usability by combining many input modalities, therefore providing a flexible answer for numerous users. While the speech recognition function permits hands-free operation via voice instructions, the eye-tracking component detects and responds to the user’s gaze, therefore providing exact cursor control. Gesture recognition enhances these features even further by letting users use their hands simply to execute mouse operations. This technology not only enhances personal user experience for people with impairments but also marks a major development in human–computer interaction. It shows how computer vision and machine learning may be used to provide more inclusive and flexible user interfaces, therefore improving the accessibility and efficiency of computer usage for everyone. Full article

(This article belongs to the Proceedings of The 7th International Global Conference Series on ICT Integration in Technical Education & Smart Society)

► Show Figures

Figure 1

25 pages, 4660 KB

Open AccessArticle

Dual-Stream Former: A Dual-Branch Transformer Architecture for Visual Speech Recognition

by Sanghun Jeon, Jieun Lee and Yong-Ju Lee

AI 2025, 6(9), 222; https://doi.org/10.3390/ai6090222 - 9 Sep 2025

Viewed by 1184

Abstract

This study proposes Dual-Stream Former, a novel architecture that integrates a Video Swin Transformer and Conformer designed to address the challenges of visual speech recognition (VSR). The model captures spatiotemporal dependencies, achieving a state-of-the-art character error rate (CER) of 3.46%, surpassing traditional convolutional [...] Read more.

This study proposes Dual-Stream Former, a novel architecture that integrates a Video Swin Transformer and Conformer designed to address the challenges of visual speech recognition (VSR). The model captures spatiotemporal dependencies, achieving a state-of-the-art character error rate (CER) of 3.46%, surpassing traditional convolutional neural network (CNN)-based models, such as 3D-CNN + DenseNet-121 (CER: 5.31%), and transformer-based alternatives, such as vision transformers (CER: 4.05%). The Video Swin Transformer captures multiscale spatial representations with high computational efficiency, whereas the Conformer back-end enhances temporal modeling across diverse phoneme categories. Evaluation of a high-resolution dataset comprising 740,000 utterances across 185 classes highlighted the effectiveness of the model in addressing visually confusing phonemes, such as diphthongs (/ai/, /au/) and labio-dental sounds (/f/, /v/). Dual-Stream Former achieved phoneme recognition error rates of 10.39% for diphthongs and 9.25% for labiodental sounds, surpassing those of CNN-based architectures by more than 6%. Although the model’s large parameter count (168.6 M) poses resource challenges, its hierarchical design ensures scalability. Future work will explore lightweight adaptations and multimodal extensions to increase deployment feasibility. These findings underscore the transformative potential of Dual-Stream Former for advancing VSR applications such as silent communication and assistive technologies by achieving unparalleled precision and robustness in diverse settings. Full article

► Show Figures

Figure 1

Search Results (610)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (610)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI