MDPI - Publisher of Open Access Journals

23 pages, 2744 KiB

Open AccessArticle

CASF: Correlation-Alignment and Significance-Aware Fusion for Multimodal Named Entity Recognition

by Hui Li, Yunshi Tao, Huan Wang, Zhe Wang and Qingzheng Liu

Algorithms 2025, 18(8), 511; https://doi.org/10.3390/a18080511 - 14 Aug 2025

Viewed by 101

With the increasing content richness of social media platforms, Multimodal Named Entity Recognition (MNER) faces the dual challenges of heterogeneous feature fusion and accurate entity recognition. Aiming at the key problems of inconsistent distribution of textual and visual information, insufficient feature alignment and [...] Read more.

With the increasing content richness of social media platforms, Multimodal Named Entity Recognition (MNER) faces the dual challenges of heterogeneous feature fusion and accurate entity recognition. Aiming at the key problems of inconsistent distribution of textual and visual information, insufficient feature alignment and noise interference fusion, this paper proposes a multimodal named entity recognition model based on dual-stream Transformer: CASF-MNER, which designs cross-modal cross-attention based on visual and textual features, constructs a bidirectional interaction mechanism between single-layer features, forms a higher-order semantic correlation modeling, and realizes the cross relevance alignment of modal features; construct a dynamic perception mechanism of multimodal feature saliency features based on multiscale pooling method, construct an entropy weighting strategy of global feature distribution information to adaptively suppress noise redundancy and enhance key feature expression; establish a deep semantic fusion method based on hybrid isomorphic model, design a progressive cross-modal interaction structure, and combine with contrastive learning to realize global fusion of the deep semantic space and representational consistency optimization. The experimental results show that CASF-MNER achieves excellent performance on both Twitter-2015 and Twitter-2017 public datasets, which verifies the effectiveness and advancement of the method proposed in this paper. Full article

(This article belongs to the Section Algorithms for Multidisciplinary Applications)

► Show Figures

Figure 1

20 pages, 983 KiB

Open AccessArticle

A Library-Oriented Large Language Model Approach to Cross-Lingual and Cross-Modal Document Retrieval

by Wang Yi, Xiahuan Cai, Hongtao Ma, Zhengjie Fu and Yan Zhan

Electronics 2025, 14(15), 3145; https://doi.org/10.3390/electronics14153145 - 7 Aug 2025

Viewed by 379

Abstract

Under the growing demand for processing multimodal and cross-lingual information, traditional retrieval systems have encountered substantial limitations when handling heterogeneous inputs such as images, textual layouts, and multilingual language expressions. To address these challenges, a unified retrieval framework has been proposed, which integrates [...] Read more.

Under the growing demand for processing multimodal and cross-lingual information, traditional retrieval systems have encountered substantial limitations when handling heterogeneous inputs such as images, textual layouts, and multilingual language expressions. To address these challenges, a unified retrieval framework has been proposed, which integrates visual features from images, layout-aware optical character recognition (OCR) text, and bilingual semantic representations in Chinese and English. This framework aims to construct a shared semantic embedding space that mitigates semantic discrepancies across modalities and resolves inconsistencies in cross-lingual mappings. The architecture incorporates three main components: a visual encoder, a structure-aware OCR module, and a multilingual Transformer. Furthermore, a joint contrastive learning loss has been introduced to enhance alignment across both modalities and languages. The proposed method has been evaluated on three core tasks: a single-modality retrieval task from image → OCR, a cross-lingual retrieval task between Chinese and English, and a joint multimodal retrieval task involving image, OCR, and language inputs. Experimental results demonstrate that, in the joint multimodal setting, the proposed model achieved a Precision@10 of 0.693, Recall@10 of 0.684, nDCG@10 of 0.672, and F1@10 of 0.685, substantially outperforming established baselines such as CLIP, LayoutLMv3, and UNITER. Ablation studies revealed that removing either the structure-aware OCR module or the cross-lingual alignment mechanism resulted in a decrease in mean reciprocal rank (MRR) to 0.561, thereby confirming the critical role of these components in reinforcing semantic consistency across modalities. This study highlights the powerful potential of large language models in multimodal semantic fusion and retrieval tasks, providing robust solutions for large-scale semantic understanding and application scenarios in multilingual and multimodal contexts. Full article

(This article belongs to the Section Artificial Intelligence)

► Show Figures

Figure 1

30 pages, 37977 KiB

Open AccessArticle

Text-Guided Visual Representation Optimization for Sensor-Acquired Video Temporal Grounding

by Yun Tian, Xiaobo Guo, Jinsong Wang and Xinyue Liang

Sensors 2025, 25(15), 4704; https://doi.org/10.3390/s25154704 - 30 Jul 2025

Viewed by 342

Abstract

Video temporal grounding (VTG) aims to localize a semantically relevant temporal segment within an untrimmed video based on a natural language query. The task continues to face challenges arising from cross-modal semantic misalignment, which is largely attributed to redundant visual content in sensor-acquired [...] Read more.

Video temporal grounding (VTG) aims to localize a semantically relevant temporal segment within an untrimmed video based on a natural language query. The task continues to face challenges arising from cross-modal semantic misalignment, which is largely attributed to redundant visual content in sensor-acquired video streams, linguistic ambiguity, and discrepancies in modality-specific representations. Most existing approaches rely on intra-modal feature modeling, processing video and text independently throughout the representation learning stage. However, this isolation undermines semantic alignment by neglecting the potential of cross-modal interactions. In practice, a natural language query typically corresponds to spatiotemporal content in video signals collected through camera-based sensing systems, encompassing a particular sequence of frames and its associated salient subregions. We propose a text-guided visual representation optimization framework tailored to enhance semantic interpretation over video signals captured by visual sensors. This framework leverages textual information to focus on spatiotemporal video content, thereby narrowing the cross-modal gap. Built upon the unified cross-modal embedding space provided by CLIP, our model leverages video data from sensing devices to structure representations and introduces two dedicated modules to semantically refine visual representations across spatial and temporal dimensions. First, we design a Spatial Visual Representation Optimization (SVRO) module to learn spatial information within intra-frames. It selects salient patches related to the text, capturing more fine-grained visual details. Second, we introduce a Temporal Visual Representation Optimization (TVRO) module to learn temporal relations from inter-frames. Temporal triplet loss is employed in TVRO to enhance attention on text-relevant frames and capture clip semantics. Additionally, a self-supervised contrastive loss is introduced at the clip–text level to improve inter-clip discrimination by maximizing semantic variance during training. Experiments on Charades-STA, ActivityNet Captions, and TACoS, widely used benchmark datasets, demonstrate that our method outperforms state-of-the-art methods across multiple metrics. Full article

(This article belongs to the Section Sensing and Imaging)

► Show Figures

Figure 1

16 pages, 2370 KiB

Open AccessArticle

SemABC: Semantic-Guided Adaptive Bias Calibration for Generative Zero-Shot Point Cloud Segmentation

by Yuyun Wei and Meng Qi

Appl. Sci. 2025, 15(15), 8359; https://doi.org/10.3390/app15158359 - 27 Jul 2025

Viewed by 399

Abstract

Due to the limited quantity and high cost of high-quality three-dimensional annotations, generalized zero-shot point cloud segmentation aims to transfer the knowledge of seen to unseen classes by leveraging semantic correlations to achieve generalization purposes. Existing generative point cloud semantic segmentation approaches rely [...] Read more.

Due to the limited quantity and high cost of high-quality three-dimensional annotations, generalized zero-shot point cloud segmentation aims to transfer the knowledge of seen to unseen classes by leveraging semantic correlations to achieve generalization purposes. Existing generative point cloud semantic segmentation approaches rely on generators trained on seen classes to synthesize visual features for unseen classes in order to help the segmentation model gain the ability of generalization, but this often leads to a bias toward seen classes. To address this issue, we propose a semantic-guided adaptive bias calibration approach with a dual-branch network architecture. This network consists of a novel visual–semantic fusion branch alongside the primary segmentation branch to suppress the bias toward seen classes. Specifically, the visual–semantic branch exploits the visual–semantic relevance of the synthetic features of unseen classes to provide auxiliary predictions. Furthermore, we introduce an adaptive bias calibration module that dynamically integrates the predictions from both the main and auxiliary branches to achieve unbiased segmentation results. Extensive experiments conducted on standard benchmarks demonstrate that our approach significantly outperforms state-of-the-art methods on both seen and unseen classes, thereby validating the effectiveness of our approach. Full article

(This article belongs to the Special Issue Applications of Artificial Intelligence in Industrial Engineering)

► Show Figures

Figure 1

21 pages, 5527 KiB

Open AccessArticle

SGNet: A Structure-Guided Network with Dual-Domain Boundary Enhancement and Semantic Fusion for Skin Lesion Segmentation

by Haijiao Yun, Qingyu Du, Ziqing Han, Mingjing Li, Le Yang, Xinyang Liu, Chao Wang and Weitian Ma

Sensors 2025, 25(15), 4652; https://doi.org/10.3390/s25154652 - 27 Jul 2025

Viewed by 398

Abstract

Segmentation of skin lesions in dermoscopic images is critical for the accurate diagnosis of skin cancers, particularly malignant melanoma, yet it is hindered by irregular lesion shapes, blurred boundaries, low contrast, and artifacts, such as hair interference. Conventional deep learning methods, typically based [...] Read more.

Segmentation of skin lesions in dermoscopic images is critical for the accurate diagnosis of skin cancers, particularly malignant melanoma, yet it is hindered by irregular lesion shapes, blurred boundaries, low contrast, and artifacts, such as hair interference. Conventional deep learning methods, typically based on UNet or Transformer architectures, often face limitations in regard to fully exploiting lesion features and incur high computational costs, compromising precise lesion delineation. To overcome these challenges, we propose SGNet, a structure-guided network, integrating a hybrid CNN–Mamba framework for robust skin lesion segmentation. The SGNet employs the Visual Mamba (VMamba) encoder to efficiently extract multi-scale features, followed by the Dual-Domain Boundary Enhancer (DDBE), which refines boundary representations and suppresses noise through spatial and frequency-domain processing. The Semantic-Texture Fusion Unit (STFU) adaptively integrates low-level texture with high-level semantic features, while the Structure-Aware Guidance Module (SAGM) generates coarse segmentation maps to provide global structural guidance. The Guided Multi-Scale Refiner (GMSR) further optimizes boundary details through a multi-scale semantic attention mechanism. Comprehensive experiments based on the ISIC2017, ISIC2018, and PH2 datasets demonstrate SGNet’s superior performance, with average improvements of 3.30% in terms of the mean Intersection over Union (mIoU) value and 1.77% in regard to the Dice Similarity Coefficient (DSC) compared to state-of-the-art methods. Ablation studies confirm the effectiveness of each component, highlighting SGNet’s exceptional accuracy and robust generalization for computer-aided dermatological diagnosis. Full article

(This article belongs to the Section Biomedical Sensors)

► Show Figures

Figure 1

22 pages, 2485 KiB

Open AccessArticle

Infrared and Visible Image Fusion Using a State-Space Adversarial Model with Cross-Modal Dependency Learning

by Qingqing Hu, Yiran Peng, KinTak U and Siyuan Zhao

Mathematics 2025, 13(15), 2333; https://doi.org/10.3390/math13152333 - 22 Jul 2025

Viewed by 317

Abstract

Infrared and visible image fusion plays a critical role in multimodal perception systems, particularly under challenging conditions such as low illumination, occlusion, or complex backgrounds. However, existing approaches often struggle with global feature modelling, cross-modal dependency learning, and preserving structural details in the [...] Read more.

Infrared and visible image fusion plays a critical role in multimodal perception systems, particularly under challenging conditions such as low illumination, occlusion, or complex backgrounds. However, existing approaches often struggle with global feature modelling, cross-modal dependency learning, and preserving structural details in the fused images. In this paper, we propose a novel adversarial fusion framework driven by a state-space modelling paradigm to address these limitations. In the feature extraction phase, a computationally efficient state-space model is utilized to capture global semantic context from both infrared and visible inputs. A cross-modality state-space architecture is then introduced in the fusion phase to model long-range dependencies between heterogeneous features effectively. Finally, a multi-class discriminator, trained under an adversarial learning scheme, enhances the structural fidelity and detail consistency of the fused output. Extensive experiments conducted on publicly available infrared–visible fusion datasets demonstrate that the proposed method achieves superior performance in terms of information retention, contrast enhancement, and visual realism. The results confirm the robustness and generalizability of our framework for complex scene understanding and downstream tasks such as object detection under adverse conditions. Full article

(This article belongs to the Special Issue Advances in Artificial Intelligence, Machine Learning and Optimization)

► Show Figures

Figure 1

32 pages, 2740 KiB

Open AccessArticle

Vision-Based Navigation and Perception for Autonomous Robots: Sensors, SLAM, Control Strategies, and Cross-Domain Applications—A Review

by Eder A. Rodríguez-Martínez, Wendy Flores-Fuentes, Farouk Achakir, Oleg Sergiyenko and Fabian N. Murrieta-Rico

Eng 2025, 6(7), 153; https://doi.org/10.3390/eng6070153 - 7 Jul 2025

Viewed by 1983

Abstract

Camera-centric perception has matured into a cornerstone of modern autonomy, from self-driving cars and factory cobots to underwater and planetary exploration. This review synthesizes more than a decade of progress in vision-based robotic navigation through an engineering lens, charting the full pipeline from [...] Read more.

Camera-centric perception has matured into a cornerstone of modern autonomy, from self-driving cars and factory cobots to underwater and planetary exploration. This review synthesizes more than a decade of progress in vision-based robotic navigation through an engineering lens, charting the full pipeline from sensing to deployment. We first examine the expanding sensor palette—monocular and multi-camera rigs, stereo and RGB-D devices, LiDAR–camera hybrids, event cameras, and infrared systems—highlighting the complementary operating envelopes and the rise of learning-based depth inference. The advances in visual localization and mapping are then analyzed, contrasting sparse and dense SLAM approaches, as well as monocular, stereo, and visual–inertial formulations. Additional topics include loop closure, semantic mapping, and LiDAR–visual–inertial fusion, which enables drift-free operation in dynamic environments. Building on these foundations, we review the navigation and control strategies, spanning classical planning, reinforcement and imitation learning, hybrid topological–metric memories, and emerging visual language guidance. Application case studies—autonomous driving, industrial manipulation, autonomous underwater vehicles, planetary rovers, aerial drones, and humanoids—demonstrate how tailored sensor suites and algorithms meet domain-specific constraints. Finally, the future research trajectories are distilled: generative AI for synthetic training data and scene completion; high-density 3D perception with solid-state LiDAR and neural implicit representations; event-based vision for ultra-fast control; and human-centric autonomy in next-generation robots. By providing a unified taxonomy, a comparative analysis, and engineering guidelines, this review aims to inform researchers and practitioners designing robust, scalable, vision-driven robotic systems. Full article

(This article belongs to the Special Issue Interdisciplinary Insights in Engineering Research)

► Show Figures

Figure 1

20 pages, 10186 KiB

Open AccessArticle

SC-CoSF: Self-Correcting Collaborative and Co-Training for Image Fusion and Semantic Segmentation

by Dongrui Yang, Lihong Qiao and Yucheng Shu

Sensors 2025, 25(12), 3575; https://doi.org/10.3390/s25123575 - 6 Jun 2025

Viewed by 526

Abstract

Multimodal image fusion and semantic segmentation play pivotal roles in autonomous driving and robotic systems, yet their inherent interdependence remains underexplored. To address this gap and overcome performance bottlenecks, we propose SC-CoSF, a novel coupled framework that jointly optimizes these tasks through synergistic [...] Read more.

Multimodal image fusion and semantic segmentation play pivotal roles in autonomous driving and robotic systems, yet their inherent interdependence remains underexplored. To address this gap and overcome performance bottlenecks, we propose SC-CoSF, a novel coupled framework that jointly optimizes these tasks through synergistic learning. Our approach replaces traditional duplex encoders with a weight-sharing CNN encoder, implicitly aligning multimodal features while reducing parameter overhead. The core innovation lies in our Self-correction and Collaboration Fusion Module (Sc-CFM), which integrates (1) a Self-correction Long-Range Relationship Branch (Sc-LRB) to strengthen global semantic modeling, (2) a Self-correction Fine-Grained Branch (Sc-FGB) for enhanced visual detail retention through local feature aggregation, and (3) a Dual-branch Collaborative Recalibration (DCR) mechanism for cross-task feature refinement. This design preserves critical edge textures and color contrasts for segmentation while leveraging segmentation-derived spatial priors to guide fusion. We further introduce the Interactive Context Recovery Mamba Decoder (ICRM) to restore lost long-range dependencies during the upsampling process; meanwhile, we propose the Region Adaptive Weighted Reconstruction Decoder (ReAW), which is mainly used to reduce feature redundancy in image fusion tasks. End-to-end joint training enables gradient propagation across all task branches via shared parameters, exploiting inter-task consistency for superior performance. Experiments demonstrate significant improvements over independently optimized baselines in both fusion quality and segmentation accuracy. Full article

(This article belongs to the Section Sensing and Imaging)

► Show Figures

Figure 1

18 pages, 775 KiB

Open AccessArticle

The Role of the Visual Versus Verbal Modality in Learning Novel Verbs

by Maria Luisa Lorusso, Laura Pigazzini, Laura Zampini, Michele Burigo, Martina Caccia, Anna Milani and Massimo Molteni

Children 2025, 12(6), 722; https://doi.org/10.3390/children12060722 - 31 May 2025

Viewed by 475

Abstract

Background/Objectives: Verbs are considered to be more abstract than nouns, as they represent actions, states, and events, which are less tangible, more flexible in their meaning and thus less univocally specified. It has been suggested that children acquire abstract concepts based on their [...] Read more.

Background/Objectives: Verbs are considered to be more abstract than nouns, as they represent actions, states, and events, which are less tangible, more flexible in their meaning and thus less univocally specified. It has been suggested that children acquire abstract concepts based on their linguistic contexts of use, making use of semantic and syntactic cues. By contrast, according to theories of embodied cognition, conceptual knowledge is based on physical and perceptual interaction with the world. The present study investigates whether the verbal and the visual modality produce similar or different results in the processes of construction and reactivation of novel verbs, corresponding to new compositional abstract concepts, in children of different ages. In Experiment 1, the acquisition of the concept was determined based on the quality of verbal explanation; in Experiment 2, participants were asked to decide whether a visual representation fitted the concept or not. Thus, response modality could be either explicit or implicit, and either congruent or incongruent with respect to learning modality. Methods: In Experiment 1, 100 children from grade 1 to 5 were asked to explain the meaning of verbs introduced via verbal or visual instances. In Experiment 2, 15 children aged 8 to 10 had to judge pictures as (not) being examples of previously verbally or visually presented novel verbs. Results: The results of Experiment 1 show more accurate explanations after verbal presentation across all grades. In Experiment 2, verbal presentation was no longer associated with more accurate matching responses, but rather with slower decision times. Conclusions: Modality congruence, explicitness and linguistic (semantic and syntactic) factors were all shown to play a role, which is discussed in a developmental perspective. Full article

(This article belongs to the Special Issue Cognitive and Linguistic Development in Children and Adolescents: 2nd Edition)

► Show Figures

Figure 1

32 pages, 10515 KiB

Open AccessArticle

E-CLIP: An Enhanced CLIP-Based Visual Language Model for Fruit Detection and Recognition

by Yi Zhang, Yang Shao, Chen Tang, Zhenqing Liu, Zhengda Li, Ruifang Zhai, Hui Peng and Peng Song

Agriculture 2025, 15(11), 1173; https://doi.org/10.3390/agriculture15111173 - 29 May 2025

Viewed by 666

Abstract

With the progress of agricultural modernization, intelligent fruit harvesting is gaining importance. While fruit detection and recognition are essential for robotic harvesting, existing methods suffer from limited generalizability, including adapting to complex environments and handling new fruit varieties. This problem stems from their [...] Read more.

With the progress of agricultural modernization, intelligent fruit harvesting is gaining importance. While fruit detection and recognition are essential for robotic harvesting, existing methods suffer from limited generalizability, including adapting to complex environments and handling new fruit varieties. This problem stems from their reliance on unimodal visual data, which creates a semantic gap between image features and contextual understanding. To solve these issues, this study proposes a multi-modal fruit detection and recognition framework based on visual language models (VLMs). By integrating multi-modal information, the proposed model enhances robustness and generalization across diverse environmental conditions and fruit types. The framework accepts natural language instructions as input, facilitating effective human–machine interaction. Through its core module, Enhanced Contrastive Language–Image Pre-Training (E-CLIP), which employs image–image and image–text contrastive learning mechanisms, the framework achieves robust recognition of various fruit types and their maturity levels. Experimental results demonstrate the excellent performance of the model, achieving an F1 score of 0.752, and an mAP@0.5 of 0.791. The model also exhibits robustness under occlusion and varying illumination conditions, attaining a zero-shot mAP@0.5 of 0.626 for unseen fruits. In addition, the system operates at an inference speed of 54.82 FPS, effectively balancing speed and accuracy, and shows practical potential for smart agriculture. This research provides new insights and methods for the practical application of smart agriculture. Full article

(This article belongs to the Section Artificial Intelligence and Digital Agriculture)

► Show Figures

Figure 1

28 pages, 5257 KiB

Open AccessArticle

Comparative Evaluation of Sequential Neural Network (GRU, LSTM, Transformer) Within Siamese Networks for Enhanced Job–Candidate Matching in Applied Recruitment Systems

by Mateusz Łępicki, Tomasz Latkowski, Izabella Antoniuk, Michał Bukowski, Bartosz Świderski, Grzegorz Baranik, Bogusz Nowak, Robert Zakowicz, Łukasz Dobrakowski, Bogdan Act and Jarosław Kurek

Appl. Sci. 2025, 15(11), 5988; https://doi.org/10.3390/app15115988 - 26 May 2025

Viewed by 888

Abstract

Job–candidate matching is pivotal in recruitment, yet traditional manual or keyword-based methods can be laborious and prone to missing qualified candidates. In this study, we introduce the first Siamese framework that systematically contrasts GRU, LSTM, and Transformer sequential heads on top of a [...] Read more.

Job–candidate matching is pivotal in recruitment, yet traditional manual or keyword-based methods can be laborious and prone to missing qualified candidates. In this study, we introduce the first Siamese framework that systematically contrasts GRU, LSTM, and Transformer sequential heads on top of a multilingual Sentence Transformer backbone, which is trained end-to-end with triplet loss on real-world recruitment data. This combination captures both long-range dependencies across document segments and global semantics, representing a substantial advance over approaches that rely solely on static embeddings. We compare the three heads using ranking metrics such as Top-K accuracy and Mean Reciprocal Rank (MRR). The Transformer-based model yields the best overall performance, with an MRR of 0.979 and a Top-100 accuracy of 87.20% on the test set. Visualization of learned embeddings (t-SNE) shows that self-attention more effectively clusters matching texts and separates them from irrelevant ones. These findings underscore the potential of combining multilingual base embeddings with specialized sequential layers to reduce manual screening efforts and improve recruitment efficiency. Full article

(This article belongs to the Special Issue Innovations in Artificial Neural Network Applications)

► Show Figures

Figure 1

19 pages, 1636 KiB

Open AccessArticle

Scene Graph and Natural Language-Based Semantic Image Retrieval Using Vision Sensor Data

by Jaehoon Kim and Byoung Chul Ko

Sensors 2025, 25(11), 3252; https://doi.org/10.3390/s25113252 - 22 May 2025

Viewed by 1010

Abstract

Text-based image retrieval is one of the most common approaches for searching images acquired from vision sensors such as cameras. However, this method suffers from limitations in retrieval accuracy, particularly when the query contains limited information or involves previously unseen sentences. These challenges [...] Read more.

Text-based image retrieval is one of the most common approaches for searching images acquired from vision sensors such as cameras. However, this method suffers from limitations in retrieval accuracy, particularly when the query contains limited information or involves previously unseen sentences. These challenges arise because keyword-based matching fails to adequately capture contextual and semantic meanings. To address these limitations, we propose a novel approach that transforms sentences and images into semantic graphs and scene graphs, enabling a quantitative comparison between them. Specifically, we utilize a graph neural network (GNN) to learn features of nodes and edges and generate graph embeddings, enabling image retrieval through natural language queries without relying on additional image metadata. We introduce a contrastive GNN-based framework that matches semantic graphs with scene graphs to retrieve semantically similar images. In addition, we incorporate a hard negative mining strategy, allowing the model to effectively learn from more challenging negative samples. The experimental results on the Visual Genome dataset show that the proposed method achieves a top nDCG@50 score of 0.745, improving retrieval performance by approximately 7.7 percentage points compared to random sampling with full graphs. This confirms that the model effectively retrieves semantically relevant images by structurally interpreting complex scenes. Full article

(This article belongs to the Special Issue Vision Sensors for Object Detection and Tracking)

► Show Figures

Figure 1

28 pages, 9332 KiB

Open AccessArticle

Contrastive Learning-Based Cross-Modal Fusion for Product Form Imagery Recognition: A Case Study on New Energy Vehicle Front-End Design

by Yutong Zhang, Jiantao Wu, Li Sun and Guoan Yang

Sustainability 2025, 17(10), 4432; https://doi.org/10.3390/su17104432 - 13 May 2025

Viewed by 659

Abstract

Fine-grained feature extraction and affective semantic mapping remain significant challenges in product form analysis. To address these issues, this study proposes a contrastive learning-based cross-modal fusion approach for product form imagery recognition, using the front-end design of new energy vehicles (NEVs) as a [...] Read more.

Fine-grained feature extraction and affective semantic mapping remain significant challenges in product form analysis. To address these issues, this study proposes a contrastive learning-based cross-modal fusion approach for product form imagery recognition, using the front-end design of new energy vehicles (NEVs) as a case study. The proposed method first employs the Biterm Topic Model (BTM) and Analytic Hierarchy Process (AHP) to extract thematic patterns and compute weight distributions from consumer review texts, thereby identifying key imagery style labels. These labels are then leveraged for image annotation, facilitating the construction of a multimodal dataset. Next, ResNet-50 and Transformer architectures serve as the image and text encoders, respectively, to extract and represent multimodal features. To ensure effective alignment and deep fusion of textual and visual representations in a shared embedding space, a contrastive learning mechanism is introduced, optimizing cosine similarity between positive and negative sample pairs. Finally, a fully connected multilayer network is integrated at the output of the Transformer and ResNet with Contrastive Learning (TRCL) model to enhance classification accuracy and reliability. Comparative experiments against various deep convolutional neural networks (DCNNs) demonstrate that the TRCL model effectively integrates semantic and visual information, significantly improving the accuracy and robustness of complex product form imagery recognition. These findings suggest that the proposed method holds substantial potential for large-scale product appearance evaluation and affective cognition research. Moreover, this data-driven fusion underpins sustainable product form design by streamlining evaluation and optimizing resource use. Full article

► Show Figures

Figure 1

20 pages, 12809 KiB

Open AccessArticle

Visual Prompt Learning of Foundation Models for Post-Disaster Damage Evaluation

by Fei Zhao, Chengcui Zhang, Runlin Zhang and Tianyang Wang

Remote Sens. 2025, 17(10), 1664; https://doi.org/10.3390/rs17101664 - 8 May 2025

Viewed by 743

Abstract

In response to the urgent need for rapid and precise post-disaster damage evaluation, this study introduces the Visual Prompt Damage Evaluation (ViPDE) framework, a novel contrastive learning-based approach that leverages the embedded knowledge within the Segment Anything Model (SAM) and pairs of remote [...] Read more.

In response to the urgent need for rapid and precise post-disaster damage evaluation, this study introduces the Visual Prompt Damage Evaluation (ViPDE) framework, a novel contrastive learning-based approach that leverages the embedded knowledge within the Segment Anything Model (SAM) and pairs of remote sensing images to enhance building damage assessment. In this framework, we propose a learnable cascaded Visual Prompt Generator (VPG) that provides semantic visual prompts, guiding SAM to effectively analyze pre- and post-disaster image pairs and construct a nuanced representation of the affected areas at different stages. By keeping the foundation model’s parameters frozen, ViPDE significantly enhances training efficiency compared with traditional full-model fine-tuning methods. This parameter-efficient approach reduces computational costs and accelerates deployment in emergency scenarios. Moreover, our model demonstrates robustness across diverse disaster types and geographic locations. Beyond mere binary assessments, our model distinguishes damage levels with a finer granularity, categorizing them on a scale from 1 (no damage) to 4 (destroyed). Extensive experiments validate the effectiveness of ViPDE, showcasing its superior performance over existing methods. Comparative evaluations demonstrate that ViPDE achieves an F1 score of 0.7014. This foundation model-based approach sets a new benchmark in disaster management. It also pioneers a new practical architectural paradigm for foundation model-based contrastive learning focused on specific objects of interest. Full article

(This article belongs to the Special Issue Advanced Satellite Remote Sensing for Geohazards)

► Show Figures

Figure 1

18 pages, 7130 KiB

Open AccessArticle

Improving Cerebrovascular Imaging with Deep Learning: Semantic Segmentation for Time-of-Flight Magnetic Resonance Angiography Maximum Intensity Projection Image Enhancement

by Tomonari Yamada, Takaaki Yoshimura, Shota Ichikawa and Hiroyuki Sugimori

Appl. Sci. 2025, 15(6), 3034; https://doi.org/10.3390/app15063034 - 11 Mar 2025

Viewed by 1003

Abstract

Magnetic Resonance Angiography (MRA) is widely used for cerebrovascular assessment, with Time-of-Flight (TOF) MRA being a common non-contrast imaging technique. However, maximum intensity projection (MIP) images generated from TOF-MRA often include non-essential vascular structures such as external carotid branches, requiring manual editing for [...] Read more.

Magnetic Resonance Angiography (MRA) is widely used for cerebrovascular assessment, with Time-of-Flight (TOF) MRA being a common non-contrast imaging technique. However, maximum intensity projection (MIP) images generated from TOF-MRA often include non-essential vascular structures such as external carotid branches, requiring manual editing for accurate visualization of intracranial arteries. This study proposes a deep learning-based semantic segmentation approach to automate the removal of these structures, enhancing MIP image clarity while reducing manual workload. Using DeepLab v3+, a convolutional neural network model optimized for segmentation accuracy, the method achieved an average Dice Similarity Coefficient (DSC) of 0.9615 and an Intersection over Union (IoU) of 0.9261 across five-fold cross-validation. The developed system processed MRA datasets at an average speed of 16.61 frames per second, demonstrating real-time feasibility. A dedicated software tool was implemented to apply the segmentation model directly to DICOM images, enabling fully automated MIP image generation. While the model effectively removed most external carotid structures, further refinement is needed to improve venous structure suppression. These results indicate that deep learning can provide an efficient and reliable approach for automated cerebrovascular image processing, with potential applications in clinical workflows and neurovascular disease diagnosis. Full article

(This article belongs to the Special Issue MR-Based Neuroimaging)

► Show Figures

Figure 1

Search Results (50)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (50)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI