MDPI - Publisher of Open Access Journals

17 pages, 3726 KB

Open AccessArticle

LEAD-Net: Semantic-Enhanced Anomaly Feature Learning for Substation Equipment Defect Detection

by Linghao Zhang, Junwei Kuang, Yufei Teng, Siyu Xiang, Lin Li and Yingjie Zhou

Processes 2025, 13(8), 2341; https://doi.org/10.3390/pr13082341 - 23 Jul 2025

Viewed by 568

Substation equipment defect detection is a critical aspect of ensuring the reliability and stability of modern power grids. However, existing deep-learning-based detection methods often face significant challenges in real-world deployment, primarily due to low detection accuracy and inconsistent anomaly definitions across different substation [...] Read more.

Substation equipment defect detection is a critical aspect of ensuring the reliability and stability of modern power grids. However, existing deep-learning-based detection methods often face significant challenges in real-world deployment, primarily due to low detection accuracy and inconsistent anomaly definitions across different substation environments. To address these limitations, this paper proposes the Language-Guided Enhanced Anomaly Power Equipment Detection Network (LEAD-Net), a novel framework that leverages text-guided learning during training to significantly improve defect detection performance. Unlike traditional methods, LEAD-Net integrates textual descriptions of defects, such as historical maintenance records or inspection reports, as auxiliary guidance during training. A key innovation is the Language-Guided Anomaly Feature Enhancement Module (LAFEM), which refines channel attention using these text features. Crucially, LEAD-Net operates solely on image data during inference, ensuring practical applicability. Experiments on a real-world substation dataset, comprising 8307 image–text pairs and encompassing a diverse range of defect categories encountered in operational substation environments, demonstrate that LEAD-Net significantly outperforms state-of-the-art object detection methods (Faster R-CNN, YOLOv9, DETR, and Deformable DETR), achieving a mean Average Precision (mAP) of 79.51%. Ablation studies confirm the contributions of both LAFEM and the training-time text guidance. The results highlight the effectiveness and novelty of using training-time defect descriptions to enhance visual anomaly detection without requiring text input at inference. Full article

(This article belongs to the Special Issue Smart Optimization Techniques for Microgrid Management)

► Show Figures

Figure 1

28 pages, 2518 KB

Open AccessEditor’s ChoiceArticle

Enhancing Keyword Spotting via NLP-Based Re-Ranking: Leveraging Semantic Relevance Feedback in the Handwritten Domain

by Stergios Papazis, Angelos P. Giotis and Christophoros Nikou

Electronics 2025, 14(14), 2900; https://doi.org/10.3390/electronics14142900 - 20 Jul 2025

Viewed by 1058

Abstract

Handwritten Keyword Spotting (KWS) remains a challenging task, particularly in segmentation-free scenarios where word images must be retrieved and ranked based on their similarity to a query without relying on prior page-level segmentation. Traditional KWS methods primarily focus on visual similarity, often overlooking [...] Read more.

Handwritten Keyword Spotting (KWS) remains a challenging task, particularly in segmentation-free scenarios where word images must be retrieved and ranked based on their similarity to a query without relying on prior page-level segmentation. Traditional KWS methods primarily focus on visual similarity, often overlooking the underlying semantic relationships between words. In this work, we propose a novel NLP-driven re-ranking approach that refines the initial ranked lists produced by state-of-the-art KWS models. By leveraging semantic embeddings from pre-trained BERT-like Large Language Models (LLMs, e.g., RoBERTa, MPNet, and MiniLM), we introduce a relevance feedback mechanism that improves both verbatim and semantic keyword spotting. Our framework operates in two stages: (1) projecting retrieved word image transcriptions into a semantic space via LLMs and (2) re-ranking the retrieval list using a weighted combination of semantic and exact relevance scores based on pairwise similarities with the query. We evaluate our approach on the widely used George Washington (GW) and IAM collections using two cutting-edge segmentation-free KWS models, which are further integrated into our proposed pipeline. Our results show consistent gains in Mean Average Precision (mAP), with improvements of up to

2.3 %

(from

94.3 %

to

96.6 %

) on GW and

3 %

(from

79.15 %

to

82.12 %

) on IAM. Even when mAP gains are smaller, qualitative improvements emerge: semantically relevant but inexact matches are retrieved more frequently without compromising exact match recall. We further examine the effect of fine-tuning transformer-based OCR (TrOCR) models on historical GW data to align textual and visual features more effectively. Overall, our findings suggest that semantic feedback can enhance retrieval effectiveness in KWS pipelines, paving the way for lightweight hybrid vision-language approaches in handwritten document analysis. Full article

(This article belongs to the Special Issue AI Synergy: Vision, Language, and Modality)

► Show Figures

Figure 1

21 pages, 3826 KB

Open AccessArticle

UAV-OVD: Open-Vocabulary Object Detection in UAV Imagery via Multi-Level Text-Guided Decoding

by Lijie Tao, Guoting Wei, Zhuo Wang, Zhaoshuai Qi, Ying Li and Haokui Zhang

Drones 2025, 9(7), 495; https://doi.org/10.3390/drones9070495 - 14 Jul 2025

Viewed by 1862

Abstract

Object detection in drone-captured imagery has attracted significant attention due to its wide range of real-world applications, including surveillance, disaster response, and environmental monitoring. Although the majority of existing methods are developed under closed-set assumptions, and some recent studies have begun to explore [...] Read more.

Object detection in drone-captured imagery has attracted significant attention due to its wide range of real-world applications, including surveillance, disaster response, and environmental monitoring. Although the majority of existing methods are developed under closed-set assumptions, and some recent studies have begun to explore open-vocabulary or open-world detection, their application to UAV imagery remains limited and underexplored. In this paper, we address this limitation by exploring the relationship between images and textual semantics to extend object detection in UAV imagery to an open-vocabulary setting. We propose a novel and efficient detector named Unmanned Aerial Vehicle Open-Vocabulary Detector (UAV-OVD), specifically designed for drone-captured scenes. To facilitate open-vocabulary object detection, we propose improvements from three complementary perspectives. First, at the training level, we design a region–text contrastive loss to replace conventional classification loss, allowing the model to align visual regions with textual descriptions beyond fixed category sets. Structurally, building on this, we introduce a multi-level text-guided fusion decoder that integrates visual features across multiple spatial scales under language guidance, thereby improving overall detection performance and enhancing the representation and perception of small objects. Finally, from the data perspective, we enrich the original dataset with synonym-augmented category labels, enabling more flexible and semantically expressive supervision. Experiments conducted on two widely used benchmark datasets demonstrate that our approach achieves significant improvements in both mean mAP and Recall. For instance, for Zero-Shot Detection on xView, UAV-OVD achieves 9.9 mAP and 67.3 Recall, 1.1 and 25.6 higher than that of YOLO-World. In terms of speed, UAV-OVD achieves 53.8 FPS, nearly twice as fast as YOLO-World and five times faster than DetrReg, demonstrating its strong potential for real-time open-vocabulary detection in UAV imagery. Full article

(This article belongs to the Special Issue Applications of UVs in Digital Photogrammetry and Image Processing)

► Show Figures

Figure 1

25 pages, 1669 KB

Open AccessArticle

Zero-Shot Infrared Domain Adaptation for Pedestrian Re-Identification via Deep Learning

by Xu Zhang, Yinghui Liu, Liangchen Guo and Huadong Sun

Electronics 2025, 14(14), 2784; https://doi.org/10.3390/electronics14142784 - 10 Jul 2025

Viewed by 749

Abstract

In computer vision, the performance of detectors trained under optimal lighting conditions is significantly impaired when applied to infrared domains due to the scarcity of labeled infrared target domain data and the inherent degradation in infrared image quality. Progress in cross-domain pedestrian re-identification [...] Read more.

In computer vision, the performance of detectors trained under optimal lighting conditions is significantly impaired when applied to infrared domains due to the scarcity of labeled infrared target domain data and the inherent degradation in infrared image quality. Progress in cross-domain pedestrian re-identification is hindered by the lack of labeled infrared image data. To address the degradation of pedestrian recognition in infrared environments, we propose a framework for zero-shot infrared domain adaptation. This integrated approach is designed to mitigate the challenges of pedestrian recognition in infrared domains while enabling zero-shot domain adaptation. Specifically, an advanced reflectance representation learning module and an exchange–re-decomposition–coherence process are employed to learn illumination invariance and to enhance the model’s effectiveness, respectively. Additionally, the CLIP (Contrastive Language–Image Pretraining) image encoder and DINO (Distillation with No Labels) are fused for feature extraction, improving model performance under infrared conditions and enhancing its generalization capability. To further improve model performance, we introduce the Non-Local Attention (NLA) module, the Instance-based Weighted Part Attention (IWPA) module, and the Multi-head Self-Attention module. The NLA module captures global feature dependencies, particularly long-range feature relationships, effectively mitigating issues such as blurred or missing image information in feature degradation scenarios. The IWPA module focuses on localized regions to enhance model accuracy in complex backgrounds and unevenly lit scenes. Meanwhile, the Multi-head Self-Attention module captures long-range dependencies between cross-modal features, further strengthening environmental understanding and scene modeling. The key innovation of this work lies in the skillful combination and application of existing technologies to new domains, overcoming the challenges posed by vision in infrared environments. Experimental results on the SYSU-MM01 dataset show that, under the single-shot setting, Rank-1 Accuracy (Rank-1) andmean Average Precision (mAP) values of 37.97% and 37.25%, respectively, were achieved, while in the multi-shot setting, values of 34.96% and 34.14% were attained. Full article

(This article belongs to the Special Issue Deep Learning in Image Processing and Computer Vision)

► Show Figures

Figure 1

32 pages, 10515 KB

Open AccessArticle

E-CLIP: An Enhanced CLIP-Based Visual Language Model for Fruit Detection and Recognition

by Yi Zhang, Yang Shao, Chen Tang, Zhenqing Liu, Zhengda Li, Ruifang Zhai, Hui Peng and Peng Song

Agriculture 2025, 15(11), 1173; https://doi.org/10.3390/agriculture15111173 - 29 May 2025

Cited by 1 | Viewed by 1204

Abstract

With the progress of agricultural modernization, intelligent fruit harvesting is gaining importance. While fruit detection and recognition are essential for robotic harvesting, existing methods suffer from limited generalizability, including adapting to complex environments and handling new fruit varieties. This problem stems from their [...] Read more.

With the progress of agricultural modernization, intelligent fruit harvesting is gaining importance. While fruit detection and recognition are essential for robotic harvesting, existing methods suffer from limited generalizability, including adapting to complex environments and handling new fruit varieties. This problem stems from their reliance on unimodal visual data, which creates a semantic gap between image features and contextual understanding. To solve these issues, this study proposes a multi-modal fruit detection and recognition framework based on visual language models (VLMs). By integrating multi-modal information, the proposed model enhances robustness and generalization across diverse environmental conditions and fruit types. The framework accepts natural language instructions as input, facilitating effective human–machine interaction. Through its core module, Enhanced Contrastive Language–Image Pre-Training (E-CLIP), which employs image–image and image–text contrastive learning mechanisms, the framework achieves robust recognition of various fruit types and their maturity levels. Experimental results demonstrate the excellent performance of the model, achieving an F1 score of 0.752, and an mAP@0.5 of 0.791. The model also exhibits robustness under occlusion and varying illumination conditions, attaining a zero-shot mAP@0.5 of 0.626 for unseen fruits. In addition, the system operates at an inference speed of 54.82 FPS, effectively balancing speed and accuracy, and shows practical potential for smart agriculture. This research provides new insights and methods for the practical application of smart agriculture. Full article

(This article belongs to the Section Artificial Intelligence and Digital Agriculture)

► Show Figures

Figure 1

26 pages, 3709 KB

Open AccessArticle

Evaluation of Prompt Engineering on the Performance of a Large Language Model in Document Information Extraction

by Lun-Chi Chen, Hsin-Tzu Weng, Mayuresh Sunil Pardeshi, Chien-Ming Chen, Ruey-Kai Sheu and Kai-Chih Pai

Electronics 2025, 14(11), 2145; https://doi.org/10.3390/electronics14112145 - 24 May 2025

Cited by 1 | Viewed by 7446

Abstract

The accelerated digitization of documentation, including paper invoices and receipts, has mitigated the necessity for precise and expeditious information management. Nevertheless, it has become unfeasible for humans to manually capture data due to the laborious and time-consuming nature of the process. The paper [...] Read more.

The accelerated digitization of documentation, including paper invoices and receipts, has mitigated the necessity for precise and expeditious information management. Nevertheless, it has become unfeasible for humans to manually capture data due to the laborious and time-consuming nature of the process. The paper proposed a low training cost, prompt-based applied key information extraction (applied KIE) pipeline of the information extraction approach with Amazon Textract and Automatic Prompt Engineer (APE) using large language models (LLMs). A series of experiments were conducted to evaluate the performance of the proposed approach, with the results indicating an average precision of 95.5% and document information extraction accuracy of 91.5% on the SROIE (a widely used English dataset), and an average precision of 97.15% and a document information extraction accuracy of 85.29% on the invoice dataset from a Taiwanese shipping company. Full article

(This article belongs to the Special Issue Techniques and Applications in Prompt Engineering and Generative AI)

► Show Figures

Figure 1

22 pages, 7640 KB

Open AccessArticle

Bilingual Sign Language Recognition: A YOLOv11-Based Model for Bangla and English Alphabets

by Nawshin Navin, Fahmid Al Farid, Raiyen Z. Rakin, Sadman S. Tanzim, Mashrur Rahman, Shakila Rahman, Jia Uddin and Hezerul Abdul Karim

J. Imaging 2025, 11(5), 134; https://doi.org/10.3390/jimaging11050134 - 27 Apr 2025

Cited by 4 | Viewed by 3138

Abstract

Communication through sign language effectively helps both hearing- and speaking-impaired individuals connect. However, there are problems with the interlingual communication between Bangla Sign Language (BdSL) and English Sign Language (ASL) due to the absence of a unified system. This study aims to introduce [...] Read more.

Communication through sign language effectively helps both hearing- and speaking-impaired individuals connect. However, there are problems with the interlingual communication between Bangla Sign Language (BdSL) and English Sign Language (ASL) due to the absence of a unified system. This study aims to introduce a detection system that incorporates these two sign languages to enhance the flow of communication for those who use these forms of sign language. This study developed and tested a deep learning-based sign-language detection system that can recognize both BdSL and ASL alphabets concurrently in real time. The approach uses a YOLOv11 object detection architecture that has been trained with an open-source dataset on a set of 9556 images containing 64 different letter signs from both languages. Data preprocessing was applied to enhance the performance of the model. Evaluation criteria, including the precision, recall, mAP, and other parameter values were also computed to evaluate the model. The performance analysis of the proposed method shows a precision of 99.12% and average recall rates of 99.63% in 30 epochs. The studies show that the proposed model outperforms the current techniques in sign language recognition (SLR) and can be used in communicating assistive technologies and human–computer interaction systems. Full article

(This article belongs to the Section Computer Vision and Pattern Recognition)

► Show Figures

Figure 1

15 pages, 4959 KB

Open AccessArticle

Image–Text Person Re-Identification with Transformer-Based Modal Fusion

by Xin Li, Hubo Guo, Meiling Zhang and Bo Fu

Electronics 2025, 14(3), 525; https://doi.org/10.3390/electronics14030525 - 28 Jan 2025

Cited by 2 | Viewed by 2951

Abstract

Existing person re-identification methods utilizing CLIP (Contrastive Language-Image Pre-training) mostly suffer from coarse-grained alignment issues. This is primarily due to the original design intention of the CLIP model, which aims at broad and global alignment between images and texts to support a wide [...] Read more.

Existing person re-identification methods utilizing CLIP (Contrastive Language-Image Pre-training) mostly suffer from coarse-grained alignment issues. This is primarily due to the original design intention of the CLIP model, which aims at broad and global alignment between images and texts to support a wide range of image–text matching tasks. However, in the specific domain of person re-identification, local features and fine-grained information are equally important in addition to global features. This paper proposes an innovative modal fusion approach, aiming to precisely locate the most prominent pedestrian information in images by combining visual features extracted by the ResNet-50 model with text representations generated by a text encoder. This method leverages the cross-attention mechanism of the Transformer Decoder to enable text features to dynamically guide visual features, enhancing the ability to identify and locate the target pedestrian. Experiments conducted on four public datasets, namely MSMT17, Market1501, DukeMTMC, and Occluded-Duke, demonstrate that our method outperforms the baseline network by 5.4%, 2.7%, 2.6%, and 9.2% in mAP, and by 4.3%, 1.7%, 2.7%, and 11.8% in Rank-1, respectively. This method exhibits excellent performance and provides new research insights for the task of person re-identification. Full article

(This article belongs to the Special Issue Deep Learning-Based Image Restoration and Object Identification)

► Show Figures

Figure 1

17 pages, 1000 KB

Open AccessArticle

Zero-Shot Day–Night Domain Adaptation for Face Detection Based on DAl-CLIP-Dino

by Huadong Sun, Yinghui Liu, Ziyang Chen and Pengyi Zhang

Electronics 2025, 14(1), 143; https://doi.org/10.3390/electronics14010143 - 1 Jan 2025

Viewed by 1809

Abstract

Two challenges in computer vision (CV) related to face detection are the difficulty of acquisition in the target domain and the degradation of image quality. Especially in low-light situations, the poor visibility of images is difficult to label, which results in detectors trained [...] Read more.

Two challenges in computer vision (CV) related to face detection are the difficulty of acquisition in the target domain and the degradation of image quality. Especially in low-light situations, the poor visibility of images is difficult to label, which results in detectors trained under well-lit conditions exhibiting reduced performance in low-light environments. Conventional works image enhancement and object detection techniques are unable to resolve the inherent difficulties in collecting and labeling low-light images. The Dark-Illuminated Network with Contrastive Language–Image Pretraining (CLIP) and Self-Supervised Vision Transformer (Dino), abbreviated as DAl-CLIP-Dino is proposed to address the degradation of object detection performance in low-light environments and achieve zero-shot day–night domain adaptation. Specifically, an advanced reflectance representation learning module (which leverages Retinex decomposition to extract reflectance and illumination features from both low-light and well-lit images) and an interchange–redecomposition coherence process (which performs a second decomposition on reconstructed images after the exchange to generate a second round of reflectance and illumination predictions while validating their consistency using redecomposition consistency loss) are employed to achieve illumination invariance and enhance model performance. CLIP (VIT-based image encoder part) and Dino have been integrated for feature extraction, improving performance under extreme lighting conditions and enhancing its generalization capability. Our model achieves a mean average precision (mAP) of

29.6 %

for face detection on the DARK FACE dataset, outperforming other models in zero-shot domain adaptation for face detection. Full article

(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)

► Show Figures

Figure 1

43 pages, 973 KB

Open AccessReview

Interventions for School-Aged Children with Auditory Processing Disorder: A Scoping Review

by Jacynthe Bigras, Josée Lagacé, Ahmed El Mawazini and Héloïse Lessard-Dostie

Healthcare 2024, 12(12), 1161; https://doi.org/10.3390/healthcare12121161 - 7 Jun 2024

Cited by 3 | Viewed by 7513

Abstract

(1) Background: Auditory processing (AP) disorder is associated with learning difficulties and poses challenges to school-aged children in their daily activities. This scoping review identifies interventions and provides audiologists with protocol insights and outcome measures. (2) Methods: A systematic search of both peer-reviewed [...] Read more.

(1) Background: Auditory processing (AP) disorder is associated with learning difficulties and poses challenges to school-aged children in their daily activities. This scoping review identifies interventions and provides audiologists with protocol insights and outcome measures. (2) Methods: A systematic search of both peer-reviewed and grey literature (January 2006 to August 2023) covered ten databases. Studies included had the following characteristics: (i) published in French or English; (ii) participants were school-aged, and had a normal audiogram, AP difficulties or disorder, and no cognitive, developmental, congenital or neurological disorder (with the exception of learning, attention, and language disabilities); (iii) were intervention studies or systematic reviews. (3) Results: Forty-two studies were included, and they predominantly featured auditory training (AT), addressing spatial processing, dichotic listening, temporal processing and listening to speech in noise. Some interventions included cognitive or language training, assistive devices or hearing aids. Outcome measures listed included electrophysiological, AP, cognitive and language measures and questionnaires addressed to parents, teachers or the participants. (4) Conclusions: Most interventions focused on bottom-up approaches, particularly AT. A limited number of top-down approaches were observed. The compiled tools underscore the need for research on metric responsiveness and point to the inadequate consideration given to understanding how children perceive change. Full article

(This article belongs to the Special Issue Auditory Processing Disorder: A Forgotten Hearing Impairment)

► Show Figures

Figure 1

14 pages, 2944 KB

Open AccessArticle

Animal Pose Estimation Based on Contrastive Learning with Dynamic Conditional Prompts

by Xiaoling Hu and Chang Liu

Animals 2024, 14(12), 1712; https://doi.org/10.3390/ani14121712 - 7 Jun 2024

Cited by 3 | Viewed by 2419

Abstract

Traditional animal pose estimation techniques based on images face significant hurdles, including scarce training data, costly data annotation, and challenges posed by non-rigid deformation. Addressing these issues, we proposed dynamic conditional prompts for the prior knowledge of animal poses in language modalities. Then, [...] Read more.

Traditional animal pose estimation techniques based on images face significant hurdles, including scarce training data, costly data annotation, and challenges posed by non-rigid deformation. Addressing these issues, we proposed dynamic conditional prompts for the prior knowledge of animal poses in language modalities. Then, we utilized a multimodal (language–image) collaborative training and contrastive learning model to estimate animal poses. Our method leverages text prompt templates and image feature conditional tokens to construct dynamic conditional prompts that integrate rich linguistic prior knowledge in depth. The text prompts highlight key points and relevant descriptions of animal poses, enhancing their representation in the learning process. Meanwhile, transformed via a fully connected non-linear network, image feature conditional tokens efficiently embed the image features into these prompts. The resultant context vector, derived from the fusion of the text prompt template and the image feature conditional token, generates a dynamic conditional prompt for each input sample. By utilizing a contrastive language–image pre-training model, our approach effectively synchronizes and strengthens the training interactions between image and text features, resulting in an improvement to the precision of key-point localization and overall animal pose estimation accuracy. The experimental results show that language–image contrastive learning based on dynamic conditional prompts enhances the average accuracy of animal pose estimation on the AP-10K and Animal Pose datasets. Full article

(This article belongs to the Section Animal System and Management)

► Show Figures

Figure 1

16 pages, 941 KB

Open AccessArticle

Continual Pre-Training of Language Models for Concept Prerequisite Learning with Graph Neural Networks

by Xin Tang, Kunjia Liu, Hao Xu, Weidong Xiao and Zhen Tan

Mathematics 2023, 11(12), 2780; https://doi.org/10.3390/math11122780 - 20 Jun 2023

Cited by 5 | Viewed by 3170

Abstract

Prerequisite chains are crucial to acquiring new knowledge efficiently. Many studies have been devoted to automatically identifying the prerequisite relationships between concepts from educational data. Though effective to some extent, these methods have neglected two key factors: most works have failed to utilize [...] Read more.

Prerequisite chains are crucial to acquiring new knowledge efficiently. Many studies have been devoted to automatically identifying the prerequisite relationships between concepts from educational data. Though effective to some extent, these methods have neglected two key factors: most works have failed to utilize domain-related knowledge to enhance pre-trained language models, thus making the textual representation of concepts less effective; they also ignore the fusion of semantic information and structural information formed by existing prerequisites. We propose a two-stage concept prerequisite learning model (TCPL), to integrate the above factors. In the first stage, we designed two continual pre-training tasks for domain-adaptive and task-specific enhancement, to obtain better textual representation. In the second stage, to leverage the complementary effects of the semantic and structural information, we optimized the encoder of the resource–concept graph and the pre-trained language model simultaneously, with hinge loss as an auxiliary training objective. Extensive experiments conducted on three public datasets demonstrated the effectiveness of the proposed approach. Our proposed model improved by 7.9%, 6.7%, 5.6%, and 8.4% on ACC, F1, AP, and AUC on average, compared to the state-of-the-art methods. Full article

(This article belongs to the Topic Data Science and Knowledge Discovery)

► Show Figures

Figure 1

16 pages, 7026 KB

Open AccessArticle

Borno-Net: A Real-Time Bengali Sign-Character Detection and Sentence Generation System Using Quantized Yolov4-Tiny and LSTMs

by Nasima Begum, Rashik Rahman, Nusrat Jahan, Saqib Sizan Khan, Tanjina Helaly, Ashraful Haque and Nipa Khatun

Appl. Sci. 2023, 13(9), 5219; https://doi.org/10.3390/app13095219 - 22 Apr 2023

Cited by 9 | Viewed by 3695

Abstract

Sign language is the most commonly used form of communication for persons with disabilities who have hearing or speech difficulties. However, persons without hearing impairment cannot understand these signs in many cases. As a consequence, persons with disabilities experience difficulties while expressing their [...] Read more.

Sign language is the most commonly used form of communication for persons with disabilities who have hearing or speech difficulties. However, persons without hearing impairment cannot understand these signs in many cases. As a consequence, persons with disabilities experience difficulties while expressing their emotions or needs. Thus, a sign character detection and text generation system is necessary to mitigate this issue. In this paper, we propose an end-to-end system that can detect Bengali sign characters from input images or video frames and generate meaningful sentences. The proposed system consists of two phases. In the first phase, a quantization technique for the YoloV4-Tiny detection model is proposed for detecting 49 different sign characters, including 36 Bengali alphabet characters, 10 numeric characters, and 3 special characters. Here, the detection model localizes hand signs and predicts the corresponding character. The second phase generates text from the predicted characters by a detection model. The Long Short-Term Memory (LSTM) model is utilized to generate meaningful text from the character signs detected in the previous phase. To train the proposed system, the BdSL 49 dataset is used, which has approximately 14,745 images of 49 different classes. The proposed quantized YoloV4-Tiny model achieves a mAP of 99.7%, and the proposed language model achieves an overall accuracy of 99.12%. In addition, performance analysis among YoloV4, YoloV4 Tiny, and YoloV7 models is provided in this research. Full article

► Show Figures

Figure 1

10 pages, 551 KB

Open AccessArticle

MSRN and Multi-Headed Attention Mechanism for Language Identification

by Ailing Zeng, Mijit Ablimit and Askar Hamdulla

Information 2023, 14(1), 17; https://doi.org/10.3390/info14010017 - 28 Dec 2022

Viewed by 2266

Abstract

With the popularity of the mobile internet, people all over the world can easily create and publish diverse media content such as multilingual and multi-dialectal audio and video. Therefore, language or dialect identification (LID) is increasingly important for practical applications such as multilingual [...] Read more.

With the popularity of the mobile internet, people all over the world can easily create and publish diverse media content such as multilingual and multi-dialectal audio and video. Therefore, language or dialect identification (LID) is increasingly important for practical applications such as multilingual and cross lingual processing as the front-end part of the subsequent tasks such as speech recognition and voice identification. This paper proposes a neural network framework based on a multiscale residual network (MSRN) and multi-headed self-attention (MHSA). Experimental results show that this method can effectively improve the accuracy and robustness compared to other methods. This model uses the MSRN to extract the language spectrogram feature and uses MHSA to filter useful features and suppress irrelevant features. Training and test sets are constructed from both the “Common Voice” and “Oriental Language Recognition” (AP17-OLR) datasets. The experimental results show that this model can effectively improve the accuracy and robustness of LID. Full article

► Show Figures

Figure 1

18 pages, 404 KB

Open AccessArticle

Predicting Academic Performance: Analysis of Students’ Mental Health Condition from Social Media Interactions

by Md. Saddam Hossain Mukta, Salekul Islam, Swakkhar Shatabda, Mohammed Eunus Ali and Akib Zaman

Behav. Sci. 2022, 12(4), 87; https://doi.org/10.3390/bs12040087 - 23 Mar 2022

Cited by 14 | Viewed by 13263

Abstract

Social media have become an indispensable part of peoples’ daily lives. Research suggests that interactions on social media partly exhibit individuals’ personality, sentiment, and behavior. In this study, we examine the association between students’ mental health and psychological attributes derived from social media [...] Read more.

Social media have become an indispensable part of peoples’ daily lives. Research suggests that interactions on social media partly exhibit individuals’ personality, sentiment, and behavior. In this study, we examine the association between students’ mental health and psychological attributes derived from social media interactions and academic performance. We build a classification model where students’ psychological attributes and mental health issues will be predicted from their social media interactions. Then, students’ academic performance will be identified from their predicted psychological attributes and mental health issues in the previous level. Firstly, we select samples by using judgmental sampling technique and collect the textual content from students’ Facebook news feeds. Then, we derive feature vectors using MPNet (Masked and Permuted Pre-training for Language Understanding), which is one of the latest pre-trained sentence transformer models. Secondly, we find two different levels of correlations: (i) users’ social media usage and their psychological attributes and mental health status and (ii) users’ psychological attributes and mental health status and their academic performance. Thirdly, we build a two-level hybrid model to predict academic performance (i.e., Grade Point Average (GPA)) from students’ Facebook posts: (1) from Facebook posts to mental health and psychological attributes using a regression model (SM-MP model) and (2) from psychological and mental attributes to the academic performance using a classifier model (MP-AP model). Later, we conduct an evaluation study by using real-life samples to validate the performance of the model and compare the performance with Baseline Models (i.e., Linguistic Inquiry and Word Count (LIWC) and Empath). Our model shows a strong performance with a microaverage f-score of 0.94 and an AUC-ROC score of 0.95. Finally, we build an ensemble model by combining both the psychological attributes and the mental health models and find that our combined model outperforms the independent models. Full article

► Show Figures

Figure 1

Search Results (19)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (19)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI