MDPI - Publisher of Open Access Journals

31 pages, 5190 KB

Open AccessArticle

MDF-YOLO: A Hölder-Based Regularity-Guided Multi-Domain Fusion Detection Model for Indoor Objects

by Fengkai Luan, Jiaxing Yang and Hu Zhang

Fractal Fract. 2025, 9(10), 673; https://doi.org/10.3390/fractalfract9100673 - 18 Oct 2025

Viewed by 114

With the rise of embodied agents and indoor service robots, object detection has become a critical component supporting semantic mapping, path planning, and human–robot interaction. However, indoor scenes often face challenges such as severe occlusion, large-scale variations, small and densely packed objects, and [...] Read more.

With the rise of embodied agents and indoor service robots, object detection has become a critical component supporting semantic mapping, path planning, and human–robot interaction. However, indoor scenes often face challenges such as severe occlusion, large-scale variations, small and densely packed objects, and complex textures, making existing methods struggle in terms of both robustness and accuracy. This paper proposes MDF-YOLO, a multi-domain fusion detection framework based on Hölder regularity guidance. In the backbone, neck, and feature recovery stages, the framework introduces the CrossGrid Memory Block, Hölder-Based Regularity Guidance–Hierarchical Context Aggregation module, and Frequency-Guided Residual Block, achieving complementary feature modeling across the state space, spatial domain, and frequency domain. In particular, the HG-HCA module uses the Hölder regularity map as a guiding signal to balance the dynamic equilibrium between the macro and micro paths, thus achieving adaptive coordination between global consistency and local discriminability. Experimental results show that MDF-YOLO significantly outperforms mainstream detectors in metrics such as mAP@0.5, mAP@0.75, and mAP@0.5:0.95, achieving values of 0.7158, 0.6117, and 0.5814, respectively, while maintaining near real-time inference efficiency in terms of FPS and latency. Ablation studies further validate the independent and synergistic contributions of CGMB, HG-HCA, and FGRB in improving small-object detection, occlusion handling, and cross-scale robustness. This study demonstrates the potential of Hölder regularity and multi-domain fusion modeling in object detection, offering new insights for efficient visual modeling in complex indoor environments. Full article

► Show Figures

Figure 1

42 pages, 104137 KB

Open AccessArticle

A Hierarchical Absolute Visual Localization System for Low-Altitude Drones in GNSS-Denied Environments

by Qing Zhou, Haochen Tang, Zhaoxiang Zhang, Yuelei Xu, Feng Xiao and Yulong Jia

Remote Sens. 2025, 17(20), 3470; https://doi.org/10.3390/rs17203470 - 17 Oct 2025

Viewed by 431

Abstract

Current drone navigation systems primarily rely on Global Navigation Satellite Systems (GNSSs), but their signals are susceptible to interference, spoofing, or suppression in complex environments, leading to degraded positioning performance or even failure. To enhance the positioning accuracy and robustness of low-altitude drones [...] Read more.

Current drone navigation systems primarily rely on Global Navigation Satellite Systems (GNSSs), but their signals are susceptible to interference, spoofing, or suppression in complex environments, leading to degraded positioning performance or even failure. To enhance the positioning accuracy and robustness of low-altitude drones in satellite-denied environments, this paper investigates an absolute visual localization solution. This method achieves precise localization by matching real-time images with reference images that have absolute position information. To address the issue of insufficient feature generalization capability due to the complex and variable nature of ground scenes, a visual-based image retrieval algorithm is proposed, which utilizes a fusion of shallow spatial features and deep semantic features, combined with generalized average pooling to enhance feature representation capabilities. To tackle the registration errors caused by differences in perspective and scale between images, an image registration algorithm based on cyclic consistency matching is designed, incorporating a reprojection error loss function, a multi-scale feature fusion mechanism, and a structural reparameterization strategy to improve matching accuracy and inference efficiency. Based on the above methods, a hierarchical absolute visual localization system is constructed, achieving coarse localization through image retrieval and fine localization through image registration, while also integrating IMU prior correction and a sliding window update strategy to mitigate the effects of scale and rotation differences. The system is implemented on the ROS platform and experimentally validated in a real-world environment. The results show that the localization success rates for the h, s, v, and w trajectories are 95.02%, 64.50%, 64.84%, and 91.09%, respectively. Compared to similar algorithms, it demonstrates higher accuracy and better adaptability to complex scenarios. These results indicate that the proposed technology can achieve high-precision and robust absolute visual localization without the need for initial conditions, highlighting its potential for application in GNSS-denied environments. Full article

(This article belongs to the Special Issue Target Detection, Recognition, Tracking, and Positioning Using Remote Sensing and AI Techniques)

► Show Figures

Graphical abstract

17 pages, 2475 KB

Open AccessArticle

YOLO-LMTB: A Lightweight Detection Model for Multi-Scale Tea Buds in Agriculture

by Guofeng Xia, Yanchuan Guo, Qihang Wei, Yiwen Cen, Loujing Feng and Yang Yu

Sensors 2025, 25(20), 6400; https://doi.org/10.3390/s25206400 - 16 Oct 2025

Viewed by 316

Abstract

Tea bud targets are typically located in complex environments characterized by multi-scale variations, high density, and strong color resemblance to the background, which pose significant challenges for rapid and accurate detection. To address these issues, this study presents YOLO-LMTB, a lightweight multi-scale detection [...] Read more.

Tea bud targets are typically located in complex environments characterized by multi-scale variations, high density, and strong color resemblance to the background, which pose significant challenges for rapid and accurate detection. To address these issues, this study presents YOLO-LMTB, a lightweight multi-scale detection model based on the YOLOv11n architecture. First, a Multi-scale Edge-Refinement Context Aggregator (MERCA) module is proposed to replace the original C3k2 block in the backbone. MERCA captures multi-scale contextual features through hierarchical receptive field collaboration and refines edge details, thereby significantly improving the perception of fine structures in tea buds. Furthermore, a Dynamic Hyperbolic Token Statistics Transformer (DHTST) module is developed to replace the original PSA block. This module dynamically adjusts feature responses and statistical measures through attention weighting using learnable threshold parameters, effectively enhancing discriminative features while suppressing background interference. Additionally, a Bidirectional Feature Pyramid Network (BiFPN) is introduced to replace the original network structure, enabling the adaptive fusion of semantically rich and spatially precise features via bidirectional cross-scale connections while reducing computational complexity. In the self-built tea bud dataset, experimental results demonstrate that compared to the original model, the YO-LO-LMTB model achieves a 2.9% improvement in precision (P), along with increases of 1.6% and 2.0% in mAP50 and mAP50-95, respectively. Simultaneously, the number of parameters decreased by 28.3%, and the model size reduced by 22.6%. To further validate the effectiveness of the improvement scheme, experiments were also conducted using public datasets. The results demonstrate that each enhancement module can boost the model’s detection performance and exhibits strong generalization capabilities. The model not only excels in multi-scale tea bud detection but also offers a valuable reference for reducing computational complexity, thereby providing a technical foundation for the practical application of intelligent tea-picking systems. Full article

(This article belongs to the Section Smart Agriculture)

► Show Figures

Figure 1

14 pages, 1149 KB

Open AccessArticle

Modality Information Aggregation Graph Attention Network with Adversarial Training for Multi-Modal Knowledge Graph Completion

by Hankiz Yilahun, Elyar Aili, Seyyare Imam and Askar Hamdulla

Information 2025, 16(10), 907; https://doi.org/10.3390/info16100907 - 16 Oct 2025

Viewed by 131

Abstract

Multi-modal knowledge graph completion (MMKGC) aims to complete knowledge graphs by integrating structural information with multi-modal (e.g., visual, textual, and numerical) features and leveraging cross-modal reasoning within a unified semantic space to infer and supplement missing factual knowledge. Current MMKGC methods have advanced [...] Read more.

Multi-modal knowledge graph completion (MMKGC) aims to complete knowledge graphs by integrating structural information with multi-modal (e.g., visual, textual, and numerical) features and leveraging cross-modal reasoning within a unified semantic space to infer and supplement missing factual knowledge. Current MMKGC methods have advanced in terms of integrating multi-modal information but have overlooked the imbalance in modality importance for target entities. Treating all modalities equally dilutes critical semantics and amplifies irrelevant information, which in turn limits the semantic understanding and predictive performance of the model. To address these limitations, we proposed a modality information aggregation graph attention network with adversarial training for multi-modal knowledge graph completion (MIAGAT-AT). MIAGAT-AT focuses on hierarchically modeling complex cross-modal interactions. By combining the multi-head attention mechanism with modality-specific projection methods, it precisely captures global semantic dependencies and dynamically adjusts the weight of modality embeddings according to the importance of each modality, thereby optimizing cross-modal information fusion capabilities. Moreover, through the use of random noise and multi-layer residual blocks, the adversarial training generates high-quality multi-modal feature representations, thereby effectively enhancing information from imbalanced modalities. Experimental results demonstrate that our approach significantly outperforms 18 existing baselines and establishes a strong performance baseline across three distinct datasets. Full article

► Show Figures

Figure 1

31 pages, 3160 KB

Open AccessArticle

Multimodal Image Segmentation with Dynamic Adaptive Window and Cross-Scale Fusion for Heterogeneous Data Environments

by Qianping He, Meng Wu, Pengchang Zhang, Lu Wang and Quanbin Shi

Appl. Sci. 2025, 15(19), 10813; https://doi.org/10.3390/app151910813 - 8 Oct 2025

Viewed by 489

Abstract

Multi-modal image segmentation is a key task in various fields such as urban planning, infrastructure monitoring, and environmental analysis. However, it remains challenging due to complex scenes, varying object scales, and the integration of heterogeneous data sources (such as RGB, depth maps, and [...] Read more.

Multi-modal image segmentation is a key task in various fields such as urban planning, infrastructure monitoring, and environmental analysis. However, it remains challenging due to complex scenes, varying object scales, and the integration of heterogeneous data sources (such as RGB, depth maps, and infrared). To address these challenges, we proposed a novel multi-modal segmentation framework, DyFuseNet, which features dynamic adaptive windows and cross-scale feature fusion capabilities. This framework consists of three key components: (1) Dynamic Window Module (DWM), which uses dynamic partitioning and continuous position bias to adaptively adjust window sizes, thereby improving the representation of irregular and fine-grained objects; (2) Scale Context Attention (SCA), a hierarchical mechanism that associates local details with global semantics in a coarse-to-fine manner, enhancing segmentation accuracy in low-texture or occluded regions; and (3) Hierarchical Adaptive Fusion Architecture (HAFA), which aligns and fuses features from multiple modalities through shallow synchronization and deep channel attention, effectively balancing complementarity and redundancy. Evaluated on benchmark datasets (such as ISPRS Vaihingen and Potsdam), DyFuseNet achieved state-of-the-art performance, with mean Intersection over Union (mIoU) scores of 80.40% and 80.85%, surpassing MFTransNet by 1.91% and 1.77%, respectively. The model also demonstrated strong robustness in challenging scenes (such as building edges and shadowed objects), achieving an average F1 score of 85% while maintaining high efficiency (26.19 GFLOPs, 30.09 FPS), making it suitable for real-time deployment. This work presents a practical, versatile, and computationally efficient solution for multi-modal image analysis, with potential applications beyond remote sensing, including smart monitoring, industrial inspection, and multi-source data fusion tasks. Full article

(This article belongs to the Special Issue Signal and Image Processing: From Theory to Applications: 2nd Edition)

► Show Figures

Figure 1

24 pages, 3017 KB

Open AccessArticle

Tree-Guided Transformer for Sensor-Based Ecological Image Feature Extraction and Multitarget Recognition in Agricultural Systems

by Yiqiang Sun, Zigang Huang, Linfeng Yang, Zihuan Wang, Mingzhuo Ruan, Jingchao Suo and Shuo Yan

Sensors 2025, 25(19), 6206; https://doi.org/10.3390/s25196206 - 7 Oct 2025

Viewed by 443

Abstract

Farmland ecosystems present complex pest–predator co-occurrence patterns, posing significant challenges for image-based multitarget recognition and ecological modeling in sensor-driven computer vision tasks. To address these issues, this study introduces a tree-guided Transformer framework enhanced with a knowledge-augmented co-attention mechanism, enabling effective feature extraction [...] Read more.

Farmland ecosystems present complex pest–predator co-occurrence patterns, posing significant challenges for image-based multitarget recognition and ecological modeling in sensor-driven computer vision tasks. To address these issues, this study introduces a tree-guided Transformer framework enhanced with a knowledge-augmented co-attention mechanism, enabling effective feature extraction from sensor-acquired images. A hierarchical ecological taxonomy (Phylum–Family Species) guides prompt-driven semantic reasoning, while an ecological knowledge graph enriches visual representations by embedding co-occurrence priors. A multimodal dataset containing 60 pest and predator categories with annotated images and semantic descriptions was constructed for evaluation. Experimental results demonstrate that the proposed method achieves 90.4% precision, 86.7% recall, and 88.5% F1-score in image classification, along with 82.3% hierarchical accuracy. In detection tasks, it attains 91.6% precision and 86.3% mAP@50, with 80.5% co-occurrence accuracy. For hierarchical reasoning and knowledge-enhanced tasks, F1-scores reach 88.5% and 89.7%, respectively. These results highlight the framework’s strong capability in extracting structured, semantically aligned image features under real-world sensor conditions, offering an interpretable and generalizable approach for intelligent agricultural monitoring. Full article

(This article belongs to the Special Issue Image Feature Extraction for Computer Vision Tasks in Sensor Systems and Applications)

► Show Figures

Figure 1

20 pages, 59706 KB

Open AccessArticle

Learning Hierarchically Consistent Disentanglement with Multi-Channel Augmentation for Public Security-Oriented Sketch Person Re-Identification

by Yu Ye, Zhihong Sun and Jun Chen

Sensors 2025, 25(19), 6155; https://doi.org/10.3390/s25196155 - 4 Oct 2025

Viewed by 407

Abstract

Sketch re-identification (Re-ID) aims to retrieve pedestrian photographs in the gallery dataset by a query sketch image drawn by professionals, which is crucial for criminal investigations and missing person searches in the field of public security. The main challenge of this task lies [...] Read more.

Sketch re-identification (Re-ID) aims to retrieve pedestrian photographs in the gallery dataset by a query sketch image drawn by professionals, which is crucial for criminal investigations and missing person searches in the field of public security. The main challenge of this task lies in bridging the significant modality gap between sketches and photos while extracting discriminative modality-invariant features. However, information asymmetry between sketches and RGB photographs, particularly the differences in color information, severely interferes with cross-modal matching processes. To address this challenge, we propose a novel network architecture that integrates multi-channel augmentation with hierarchically consistent disentanglement learning. Specifically, a multi-channel augmentation module is developed to mitigate the interference of color bias in cross-modal matching. Furthermore, a modality-disentangled prototype(MDP) module is introduced to decompose pedestrian representations at the feature level into modality-invariant structural prototypes and modality-specific appearance prototypes. Additionally, a cross-layer decoupling consistency constraint is designed to ensure the semantic coherence of disentangled prototypes across different network layers and to improve the stability of the whole decoupling process. Extensive experimental results on two public datasets demonstrate the superiority of our proposed approach over state-of-the-art methods. Full article

(This article belongs to the Special Issue Advances in Security for Emerging Intelligent Systems)

► Show Figures

Figure 1

17 pages, 2114 KB

Open AccessArticle

Omni-Refinement Attention Network for Lane Detection

by Boyuan Zhang, Lanchun Zhang, Tianbo Wang, Yingjun Wei, Ziyan Chen and Bin Cao

Sensors 2025, 25(19), 6150; https://doi.org/10.3390/s25196150 - 4 Oct 2025

Viewed by 387

Abstract

Lane detection is a fundamental component of perception systems in autonomous driving. Despite significant progress in this area, existing methods still face challenges in complex scenarios such as abnormal weather, occlusions, and curved roads. These situations typically demand the integration of both the [...] Read more.

Lane detection is a fundamental component of perception systems in autonomous driving. Despite significant progress in this area, existing methods still face challenges in complex scenarios such as abnormal weather, occlusions, and curved roads. These situations typically demand the integration of both the global semantic context and local visual features to predict the lane position and shape. This paper presents ORANet, an enhanced lane detection framework built upon the baseline CLRNet. ORANet incorporates two novel modules: Enhanced Coordinate Attention (EnCA) and Channel–Spatial Shuffle Attention (CSSA). EnCA models long-range lane structures while effectively capturing global semantic information, whereas CSSA strengthens the precise extraction of local features and provides optimized inputs for EnCA. These components operate in hierarchical synergy, collectively establishing a complete enhancement pathway from refined local feature extraction to efficient global feature fusion. The experimental results demonstrate that ORANet achieves greater performance stability than CLRNet in complex roadway scenarios. Notably, under shadow conditions, ORANet achieves an F1 score improvement of nearly 3% over CLRNet. These results highlight the potential of ORANet for reliable lane detection in real-world autonomous driving environments. Full article

(This article belongs to the Section Vehicular Sensing)

► Show Figures

Figure 1

35 pages, 7343 KB

Open AccessArticle

A Hybrid Deep Learning and Knowledge Graph Approach for Intelligent Image Indexing and Retrieval

by Mohamed Hamroun and Damien Sauveron

Appl. Sci. 2025, 15(19), 10591; https://doi.org/10.3390/app151910591 - 30 Sep 2025

Viewed by 454

Abstract

Technological advancements have enabled users to digitize and store an unlimited number of multimedia documents, including images and videos. However, the heterogeneous nature of multimedia content poses significant challenges in efficient indexing and retrieval. Traditional approaches primarily focus on visual features, often neglecting [...] Read more.

Technological advancements have enabled users to digitize and store an unlimited number of multimedia documents, including images and videos. However, the heterogeneous nature of multimedia content poses significant challenges in efficient indexing and retrieval. Traditional approaches primarily focus on visual features, often neglecting the semantic context, which limits retrieval efficiency. This paper proposes a hybrid deep learning and knowledge graph approach for intelligent image indexing and retrieval. By integrating deep learning models such as EfficientNet and Vision Transformer (ViT) with structured knowledge graphs, the proposed framework enhances semantic understanding and retrieval performance. The methodology incorporates feature extraction, concept classification, and hierarchical knowledge graph structuring to facilitate effective multimedia retrieval. Experimental results on benchmark datasets, including TRECVID, Corel, and MSCOCO, demonstrate significant improvements in precision, robustness, and query expansion techniques. The findings highlight the potential of combining deep learning with knowledge graphs to bridge the semantic gap and optimize multimedia indexing and retrieval. Full article

(This article belongs to the Special Issue Application of Deep Learning and Big Data Processing)

► Show Figures

Figure 1

32 pages, 9638 KB

Open AccessArticle

MSSA: A Multi-Scale Semantic-Aware Method for Remote Sensing Image–Text Retrieval

by Yun Liao, Zongxiao Hu, Fangwei Jin, Junhui Liu, Nan Chen, Jiayi Lv and Qing Duan

Remote Sens. 2025, 17(19), 3341; https://doi.org/10.3390/rs17193341 - 30 Sep 2025

Viewed by 503

Abstract

In recent years, the convenience and potential for information extraction offered by Remote Sensing Image–Text Retrieval (RSITR) have made it a significant focus of research in remote sensing (RS) knowledge services. Current mainstream methods for RSITR generally align fused image features at multiple [...] Read more.

In recent years, the convenience and potential for information extraction offered by Remote Sensing Image–Text Retrieval (RSITR) have made it a significant focus of research in remote sensing (RS) knowledge services. Current mainstream methods for RSITR generally align fused image features at multiple scales with textual features, primarily focusing on the local information of RS images while neglecting potential semantic information. This results in insufficient alignment in the cross-modal semantic space. To overcome this limitation, we propose a Multi-Scale Semantic-Aware Remote Sensing Image–Text Retrieval method (MSSA). This method introduces Progressive Spatial Channel Joint Attention (PSCJA), which enhances the expressive capability of multi-scale image features through Window-Region-Global Progressive Attention (WRGPA) and Segmented Channel Attention (SCA). Additionally, the Image-Guided Text Attention (IGTA) mechanism dynamically adjust textual attention weights based on visual context. Furthermore, the Cross-Modal Semantic Extraction Module (CMSE) incorporated learnable semantic tokens at each scale, enabling attention interaction between multi-scale features of different modalities and the capturing of hierarchical semantic associations. This multi-scale semantic-guided retrieval method ensures cross-modal semantic consistency, significantly improving the accuracy of cross-modal retrieval in RS. MSSA demonstrates superior retrieval accuracy in experiments across three baseline datasets, achieving a new state-of-the-art performance. Full article

(This article belongs to the Section Remote Sensing Image Processing)

► Show Figures

Figure 1

23 pages, 18084 KB

Open AccessArticle

WetSegNet: An Edge-Guided Multi-Scale Feature Interaction Network for Wetland Classification

by Li Chen, Shaogang Xia, Xun Liu, Zhan Xie, Haohong Chen, Feiyu Long, Yehong Wu and Meng Zhang

Remote Sens. 2025, 17(19), 3330; https://doi.org/10.3390/rs17193330 - 29 Sep 2025

Viewed by 317

Abstract

Wetlands play a crucial role in climate regulation, pollutant filtration, and biodiversity conservation. Accurate wetland classification through high-resolution remote sensing imagery is pivotal for the scientific management, ecological monitoring, and sustainable development of these ecosystems. However, the intricate spatial details in such imagery [...] Read more.

Wetlands play a crucial role in climate regulation, pollutant filtration, and biodiversity conservation. Accurate wetland classification through high-resolution remote sensing imagery is pivotal for the scientific management, ecological monitoring, and sustainable development of these ecosystems. However, the intricate spatial details in such imagery pose significant challenges to conventional interpretation techniques, necessitating precise boundary extraction and multi-scale contextual modeling. In this study, we propose WetSegNet, an edge-guided Multi-Scale Feature Interaction network for wetland classification, which integrates a convolutional neural network (CNN) and Swin Transformer within a U-Net architecture to synergize local texture perception and global semantic comprehension. Specifically, the framework incorporates two novel components: (1) a Multi-Scale Feature Interaction (MFI) module employing cross-attention mechanisms to mitigate semantic discrepancies between encoder–decoder features, and (2) a Multi-Feature Fusion (MFF) module that hierarchically enhances boundary delineation through edge-guided spatial attention (EGA). Experimental validation on GF-2 satellite imagery of Dongting Lake wetlands demonstrates that WetSegNet achieves state-of-the-art performance, with an overall accuracy (OA) of 90.81% and a Kappa coefficient of 0.88. Notably, it achieves classification accuracies exceeding 90% for water, sedge, and reed habitats, surpassing the baseline U-Net by 3.3% in overall accuracy and 0.05 in Kappa. The proposed model effectively addresses heterogeneous wetland classification challenges, validating its capability to reconcile local–global feature representation. Full article

(This article belongs to the Special Issue Remote Sensing for Mapping and Monitoring Wetlands and Their Ecosystems)

► Show Figures

Figure 1

32 pages, 2754 KB

Open AccessArticle

Critical Thinking Writing Assessment in Middle School Language: Logic Chain Extraction and Expert Score Correlation Test Using BERT-CNN Hybrid Model

by Yao Wu and Qin-Hua Zheng

Appl. Sci. 2025, 15(19), 10504; https://doi.org/10.3390/app151910504 - 28 Sep 2025

Viewed by 371

Abstract

Critical thinking, as a crucial component of 21st-century core competencies, poses significant challenges for effective assessment in educational evaluation. This study proposes an automated assessment method for critical thinking in middle school Chinese language based on a Bidirectional Encoder Representations from Transformers—Convolutional Neural [...] Read more.

Critical thinking, as a crucial component of 21st-century core competencies, poses significant challenges for effective assessment in educational evaluation. This study proposes an automated assessment method for critical thinking in middle school Chinese language based on a Bidirectional Encoder Representations from Transformers—Convolutional Neural Network (BERT-CNN) hybrid model, achieving a multi-dimensional quantitative assessment of students’ critical thinking performance in writing through the synergistic effect of deep semantic encoding and local feature extraction. The research constructs an annotated dataset containing 4827 argumentative essays from three middle school grades, employing expert scoring across nine dimensions of the Paul–Elder framework, and designs three types of logic chain extraction algorithms: argument–evidence mapping, causal reasoning chains, and rebuttal–support structures. Experimental results demonstrate that the BERT-CNN hybrid model achieves a Pearson correlation coefficient of 0.872 in overall assessment tasks and an average F1 score of 0.770 in logic chain recognition tasks, outperforming the traditional baseline methods tested in our experiments. Ablation experiments confirm the hierarchical contributions of semantic features (31.2%), syntactic features (24.1%), and logical markers (18.9%), while revealing the model’s limitations in assessing higher-order cognitive dimensions. The findings provide a feasible technical solution for the intelligent assessment of critical thinking, offering significant theoretical value and practical implications for advancing educational evaluation reform and personalized instruction. Full article

► Show Figures

Figure 1

21 pages, 2380 KB

Open AccessArticle

Edge-Embedded Multi-Feature Fusion Network for Automatic Checkout

by Jicai Li, Meng Zhu and Honge Ren

J. Imaging 2025, 11(10), 337; https://doi.org/10.3390/jimaging11100337 - 27 Sep 2025

Viewed by 235

Abstract

The Automatic Checkout (ACO) task aims to accurately generate complete shopping lists from checkout images. Severe product occlusions, numerous categories, and cluttered layouts impose high demands on detection models’ robustness and generalization. To address these challenges, we propose the Edge-Embedded Multi-Feature Fusion Network [...] Read more.

The Automatic Checkout (ACO) task aims to accurately generate complete shopping lists from checkout images. Severe product occlusions, numerous categories, and cluttered layouts impose high demands on detection models’ robustness and generalization. To address these challenges, we propose the Edge-Embedded Multi-Feature Fusion Network (E2MF2Net), which jointly optimizes synthetic image generation and feature modeling. We introduce the Hierarchical Mask-Guided Composition (HMGC) strategy to select natural product poses based on mask compactness, incorporating geometric priors and occlusion tolerance to produce photorealistic, structurally coherent synthetic images. Mask-structure supervision further enhances boundary and spatial awareness. Architecturally, the Edge-Embedded Enhancement Module (E3) embeds salient structural cues to explicitly capture boundary details and facilitate cross-layer edge propagation, while the Multi-Feature Fusion Module (MFF) integrates multi-scale semantic cues, improving feature discriminability. Experiments on the RPC dataset demonstrate that E2MF2Net outperforms state-of-the-art methods, achieving checkout accuracy (cAcc) of 98.52%, 97.95%, 96.52%, and 97.62% on Easy, Medium, Hard, and Average mode, respectively. Notably, it improves by 3.63 percentage points in the heavily occluded Hard mode and exhibits strong robustness and adaptability in incremental learning and domain generalization scenarios. Full article

(This article belongs to the Section Computer Vision and Pattern Recognition)

► Show Figures

Figure 1

22 pages, 2395 KB

Open AccessArticle

Multimodal Alignment and Hierarchical Fusion Network for Multimodal Sentiment Analysis

by Jiasheng Huang, Huan Li and Xinyue Mo

Electronics 2025, 14(19), 3828; https://doi.org/10.3390/electronics14193828 - 26 Sep 2025

Viewed by 798

Abstract

The widespread emergence of multimodal data on social platforms has presented new opportunities for sentiment analysis. However, previous studies have often overlooked the issue of detail loss during modal interaction fusion. They also exhibit limitations in addressing semantic alignment challenges and the sensitivity [...] Read more.

The widespread emergence of multimodal data on social platforms has presented new opportunities for sentiment analysis. However, previous studies have often overlooked the issue of detail loss during modal interaction fusion. They also exhibit limitations in addressing semantic alignment challenges and the sensitivity of modalities to noise. To enhance analytical accuracy, a novel model named MAHFNet is proposed. The proposed architecture is composed of three main components. Firstly, an attention-guided gated interaction alignment module is developed for modeling the semantic interaction between text and image using a gated network and a cross-modal attention mechanism. Next, a contrastive learning mechanism is introduced to encourage the aggregation of semantically aligned image-text pairs. Subsequently, an intra-modality emotion extraction module is designed to extract local emotional features within each modality. This module serves to compensate for detail loss during interaction fusion. The intra-modal local emotion features and cross-modal interaction features are then fed into a hierarchical gated fusion module, where the local features are fused through a cross-gated mechanism to dynamically adjust the contribution of each modality while suppressing modality-specific noise. Then, the fusion results and cross-modal interaction features are further fused using a multi-scale attention gating module to capture hierarchical dependencies between local and global emotional information, thereby enhancing the model’s ability to perceive and integrate emotional cues across multiple semantic levels. Finally, extensive experiments have been conducted on three public multimodal sentiment datasets, with results demonstrating that the proposed model outperforms existing methods across multiple evaluation metrics. Specifically, on the TumEmo dataset, our model achieves improvements of 2.55% in ACC and 2.63% in F1 score compared to the second-best method. On the HFM dataset, these gains reach 0.56% in ACC and 0.9% in F1 score, respectively. On the MVSA-S dataset, these gains reach 0.03% in ACC and 1.26% in F1 score. These findings collectively validate the overall effectiveness of the proposed model. Full article

► Show Figures

Figure 1

22 pages, 1588 KB

Open AccessArticle

Generative Sign-Description Prompts with Multi-Positive Contrastive Learning for Sign Language Recognition

by Siyu Liang, Yunan Li, Wentian Xin, Huizhou Chen, Xujie Liu, Kang Liu and Qiguang Miao

Sensors 2025, 25(19), 5957; https://doi.org/10.3390/s25195957 - 24 Sep 2025

Viewed by 623

Abstract

While sign language combines sequential hand motions with concurrent non-manual cues (e.g., mouth shapes and head tilts), current recognition systems lack multimodal annotation methods capable of capturing their hierarchical semantics. To bridge this gap, we propose GSP-MC, the first method integrating generative large [...] Read more.

While sign language combines sequential hand motions with concurrent non-manual cues (e.g., mouth shapes and head tilts), current recognition systems lack multimodal annotation methods capable of capturing their hierarchical semantics. To bridge this gap, we propose GSP-MC, the first method integrating generative large language models into sign language recognition. It leverages retrieval-augmented generation with domain-specific large language models and expert-validated corpora to produce precise multipart descriptions. A dual-encoder architecture bidirectionally aligns hierarchical skeleton features with multi-level text descriptions (global, synonym, part) through probabilistic matching. The approach combines global and part-level losses with KL divergence optimization, ensuring robust alignment across relevant text-skeleton pairs while capturing sign semantics and detailed dynamics. Experiments demonstrate state-of-the-art performance, achieving 97.1% accuracy on the Chinese SLR500 (surpassing SSRL’s 96.9%) and 97.07% on the Turkish AUTSL (exceeding SML’s 96.85%), confirming cross-lingual potential for inclusive communication technologies. Full article

(This article belongs to the Special Issue Multimodal Perception Modeling Based on Advanced Computational Technologies)

► Show Figures

Figure 1

Search Results (293)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (293)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI