Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (73)

Search Parameters:
Keywords = cross-modal transfer

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
33 pages, 15612 KiB  
Article
A Personalized Multimodal Federated Learning Framework for Skin Cancer Diagnosis
by Shuhuan Fan, Awais Ahmed, Xiaoyang Zeng, Rui Xi and Mengshu Hou
Electronics 2025, 14(14), 2880; https://doi.org/10.3390/electronics14142880 - 18 Jul 2025
Viewed by 317
Abstract
Skin cancer is one of the most prevalent forms of cancer worldwide, and early and accurate diagnosis critically impacts patient outcomes. Given the sensitive nature of medical data and its fragmented distribution across institutions (data silos), privacy-preserving collaborative learning is essential to enable [...] Read more.
Skin cancer is one of the most prevalent forms of cancer worldwide, and early and accurate diagnosis critically impacts patient outcomes. Given the sensitive nature of medical data and its fragmented distribution across institutions (data silos), privacy-preserving collaborative learning is essential to enable knowledge-sharing without compromising patient confidentiality. While federated learning (FL) offers a promising solution, existing methods struggle with heterogeneous and missing modalities across institutions, which reduce the diagnostic accuracy. To address these challenges, we propose an effective and flexible Personalized Multimodal Federated Learning framework (PMM-FL), which enables efficient cross-client knowledge transfer while maintaining personalized performance under heterogeneous and incomplete modality conditions. Our study contains three key contributions: (1) A hierarchical aggregation strategy that decouples multi-module aggregation from local deployment via global modular-separated aggregation and local client fine-tuning. Unlike conventional FL (which synchronizes all parameters in each round), our method adopts a frequency-adaptive synchronization mechanism, updating parameters based on their stability and functional roles. (2) A multimodal fusion approach based on multitask learning, integrating learnable modality imputation and attention-based feature fusion to handle missing modalities. (3) A custom dataset combining multi-year International Skin Imaging Collaboration(ISIC) challenge data (2018–2024) to ensure comprehensive coverage of diverse skin cancer types. We evaluate PMM-FL through diverse experiment settings, demonstrating its effectiveness in heterogeneous and incomplete modality federated learning settings, achieving 92.32% diagnostic accuracy with only a 2% drop in accuracy under 30% modality missingness, with a 32.9% communication overhead decline compared with baseline FL methods. Full article
(This article belongs to the Special Issue Multimodal Learning and Transfer Learning)
Show Figures

Figure 1

17 pages, 3612 KiB  
Article
MPVT: An Efficient Multi-Modal Prompt Vision Tracker for Visual Target Tracking
by Jianyu Xie, Yan Fu, Junlin Zhou, Tianxiang He, Xiaopeng Wang, Yuke Fang and Duanbing Chen
Appl. Sci. 2025, 15(14), 7967; https://doi.org/10.3390/app15147967 - 17 Jul 2025
Viewed by 248
Abstract
Visual target tracking is a fundamental task in computer vision. Combining multi-modal information with tracking leverages complementary information, which improves the precision and robustness of trackers. Traditional multi-modal tracking methods typically employ a full fine-tuning scheme, i.e., fine-tuning pre-trained single-modal models to multi-modal [...] Read more.
Visual target tracking is a fundamental task in computer vision. Combining multi-modal information with tracking leverages complementary information, which improves the precision and robustness of trackers. Traditional multi-modal tracking methods typically employ a full fine-tuning scheme, i.e., fine-tuning pre-trained single-modal models to multi-modal tasks. However, this approach suffers from low transfer learning efficiency, catastrophic forgetting, and high cross-task deployment costs. To address these issues, we propose an efficient model named multi-modal prompt vision tracker (MPVT) based on an efficient prompt-tuning paradigm. Three key components are involved in the model: a decoupled input enhancement module, a dynamic adaptive prompt fusion module, and a fully connected head network module. The decoupled input enhancement module enhances input representations via positional and type embedding. The dynamic adaptive prompt fusion module achieves efficient prompt tuning and multi-modal interaction using scaled convolution and low-rank cross-modal attention mechanisms. The fully connected head network module addresses the shortcomings of traditional convolutional head networks such as inductive biases. Experimental results from RGB-T, RGB-D, and RGB-E scenarios show that MPVT outperforms state-of-the-art methods. Moreover, MPVT can save 43.8% GPU memory usage and reduce training time by 62.9% compared with a full-parameter fine-tuning model. Full article
(This article belongs to the Special Issue Advanced Technologies Applied for Object Detection and Tracking)
Show Figures

Figure 1

17 pages, 1913 KiB  
Article
CropSTS: A Remote Sensing Foundation Model for Cropland Classification with Decoupled Spatiotemporal Attention
by Jian Yan, Xingfa Gu and Yuxing Chen
Remote Sens. 2025, 17(14), 2481; https://doi.org/10.3390/rs17142481 - 17 Jul 2025
Viewed by 379
Abstract
Recent progress in geospatial foundation models (GFMs) has demonstrated strong generalization capabilities for remote sensing downstream tasks. However, existing GFMs still struggle with fine-grained cropland classification due to ambiguous field boundaries, insufficient and low-efficient temporal modeling, and limited cross-regional adaptability. In this paper, [...] Read more.
Recent progress in geospatial foundation models (GFMs) has demonstrated strong generalization capabilities for remote sensing downstream tasks. However, existing GFMs still struggle with fine-grained cropland classification due to ambiguous field boundaries, insufficient and low-efficient temporal modeling, and limited cross-regional adaptability. In this paper, we propose CropSTS, a remote sensing foundation model designed with a decoupled temporal–spatial attention architecture, specifically tailored for the temporal dynamics of cropland remote sensing data. To efficiently pre-train the model under limited labeled data, we employ a hybrid framework combining joint-embedding predictive architecture with knowledge distillation from web-scale foundation models. Despite being trained on a small dataset and using a compact model, CropSTS achieves state-of-the-art performance on the PASTIS-R benchmark in terms of mIoU and F1-score. Our results validate that structural optimization for temporal encoding and cross-modal knowledge transfer constitute effective strategies for advancing GFM design in agricultural remote sensing. Full article
(This article belongs to the Special Issue Advanced AI Technology for Remote Sensing Analysis)
Show Figures

Figure 1

23 pages, 7732 KiB  
Article
Vocabulary Retention Under Multimodal Coupling Strength Index (MCSI): Insights from Eye Tracking
by Qiyue Tang and Chen Chen
Appl. Sci. 2025, 15(14), 7645; https://doi.org/10.3390/app15147645 - 8 Jul 2025
Viewed by 210
Abstract
This eye-tracking investigation employed a 2 × 2 experimental design to examine multimodal lexical encoding processes. Eighty participants were systematically assigned to four conditions: Group A (text-only), Group B (text + image), Group C (text + sound), and Group D (text + image [...] Read more.
This eye-tracking investigation employed a 2 × 2 experimental design to examine multimodal lexical encoding processes. Eighty participants were systematically assigned to four conditions: Group A (text-only), Group B (text + image), Group C (text + sound), and Group D (text + image + sound). The results demonstrated significantly superior recall accuracy in Group D (92.00%) compared with unimodal conditions (Group B: 82.07%; Group C: 76.00%; Group A: 59.60%; p < 0.001), confirming robust audiovisual synergy. The novel Multimodal Coupling Strength Index (MCSI) dynamically quantified crossmodal integration efficacy through eye-tracking metrics (Attentional Synchronization Coefficient, ASC; Saccade Duration–Fixation Duration differential, SD-FD), revealing significantly stronger coupling in audiovisual conditions (C/D: 0.71; B/D: 0.54). Crucially, the established MCSI provides a transferable diagnostic framework for evaluating multimodal integration efficiency in learning environments. Full article
Show Figures

Figure 1

18 pages, 1568 KiB  
Article
Coupling of Temporal-Check-All-That-Apply and Nose-Space Analysis to Investigate the In Vivo Flavor Perception of Extra Virgin Olive Oil and Carriers’ Impact
by Danny Cliceri, Iuliia Khomenko, Franco Biasioli, Flavia Gasperi and Eugenio Aprea
Foods 2025, 14(13), 2343; https://doi.org/10.3390/foods14132343 - 1 Jul 2025
Viewed by 320
Abstract
The perceived quality of extra virgin olive oil (EVOO) arises from the multisensory integration of multimodal stimuli, primarily driven by non-volatile and volatile organic compounds (VOCs). Given that EVOO is frequently consumed in combination with other foods, cross-modal interactions, encompassing both internal and [...] Read more.
The perceived quality of extra virgin olive oil (EVOO) arises from the multisensory integration of multimodal stimuli, primarily driven by non-volatile and volatile organic compounds (VOCs). Given that EVOO is frequently consumed in combination with other foods, cross-modal interactions, encompassing both internal and external elements, play a crucial role in shaping its sensory perception. A more realistic representation of EVOO perception can be achieved by considering these cross-modal effects and their temporal dynamics. This study employed dynamic sensory and instrumental techniques to investigate the product-related mechanisms that influence EVOO flavor perception. Ten trained panelists (mean age = 41.5 years; 50% female) evaluated two EVOO samples under two consumption conditions: alone and accompanied by a solid carrier (bread or chickpeas). Temporal Check-All-That-Apply (TCATA) and nose-space analysis using Proton-Transfer-Reaction Time-of-Flight Mass Spectrometry (PTR-ToF-MS) were conducted simultaneously. Sensory descriptors and mass spectral peaks were analyzed through temporal curve indices (Area Under the Curve, Maximum Citation/Concentration, Time to Maximum), which were then used to construct multi-dimensional sensory and VOC release maps. Findings revealed that the composition and texture of the food carriers had a greater influence on temporal flavor perception than the variability in VOCs released by the different EVOO samples. These results underscore the importance of considering cross-modal sensory interactions when predicting EVOO flavor perception. The carriers modulated both the perception and VOC release, with effects dependent on their specific composition and texture. This methodological approach enabled a deeper understanding of the dynamic relationship between VOC release and EVOO sensory experience. Full article
Show Figures

Graphical abstract

24 pages, 1664 KiB  
Review
A Comprehensive Review of Multimodal Emotion Recognition: Techniques, Challenges, and Future Directions
by You Wu, Qingwei Mi and Tianhan Gao
Biomimetics 2025, 10(7), 418; https://doi.org/10.3390/biomimetics10070418 - 27 Jun 2025
Viewed by 1639
Abstract
This paper presents a comprehensive review of multimodal emotion recognition (MER), a process that integrates multiple data modalities such as speech, visual, and text to identify human emotions. Grounded in biomimetics, the survey frames MER as a bio-inspired sensing paradigm that emulates the [...] Read more.
This paper presents a comprehensive review of multimodal emotion recognition (MER), a process that integrates multiple data modalities such as speech, visual, and text to identify human emotions. Grounded in biomimetics, the survey frames MER as a bio-inspired sensing paradigm that emulates the way humans seamlessly fuse multisensory cues to communicate affect, thereby transferring principles from living systems to engineered solutions. By leveraging various modalities, MER systems offer a richer and more robust analysis of emotional states compared to unimodal approaches. The review covers the general structure of MER systems, feature extraction techniques, and multimodal information fusion strategies, highlighting key advancements and milestones. Additionally, it addresses the research challenges and open issues in MER, including lightweight models, cross-corpus generalizability, and the incorporation of additional modalities. The paper concludes by discussing future directions aimed at improving the accuracy, explainability, and practicality of MER systems for real-world applications. Full article
(This article belongs to the Special Issue Intelligent Human–Robot Interaction: 4th Edition)
Show Figures

Figure 1

14 pages, 4768 KiB  
Article
Deep Learning with Transfer Learning on Digital Breast Tomosynthesis: A Radiomics-Based Model for Predicting Breast Cancer Risk
by Francesca Galati, Roberto Maroncelli, Chiara De Nardo, Lucia Testa, Gloria Barcaroli, Veronica Rizzo, Giuliana Moffa and Federica Pediconi
Diagnostics 2025, 15(13), 1631; https://doi.org/10.3390/diagnostics15131631 - 26 Jun 2025
Viewed by 446
Abstract
Background: Digital breast tomosynthesis (DBT) is a valuable imaging modality for breast cancer detection; however, its interpretation remains time-consuming and subject to inter-reader variability. This study aimed to develop and evaluate two deep learning (DL) models based on transfer learning for the [...] Read more.
Background: Digital breast tomosynthesis (DBT) is a valuable imaging modality for breast cancer detection; however, its interpretation remains time-consuming and subject to inter-reader variability. This study aimed to develop and evaluate two deep learning (DL) models based on transfer learning for the binary classification of breast lesions (benign vs. malignant) using DBT images to support clinical decision-making and risk stratification. Methods: In this retrospective monocentric study, 184 patients with histologically or clinically confirmed benign (107 cases, 41.8%) or malignant (77 cases, 58.2%) breast lesions were included. Each case underwent DBT with a single lesion manually segmented for radiomic analysis. Two convolutional neural network (CNN) architectures—ResNet50 and DenseNet201—were trained using transfer learning from ImageNet weights. A 10-fold cross-validation strategy with ensemble voting was applied. Model performance was evaluated through ROC–AUC, accuracy, sensitivity, specificity, PPV, and NPV. Results: The ResNet50 model outperformed DenseNet201 across most metrics. On the internal testing set, ResNet50 achieved a ROC–AUC of 63%, accuracy of 60%, sensitivity of 39%, and specificity of 75%. The DenseNet201 model yielded a lower ROC–AUC of 55%, accuracy of 55%, and sensitivity of 24%. Both models demonstrated relatively high specificity, indicating potential utility in ruling out malignancy, though sensitivity remained suboptimal. Conclusions: This study demonstrates the feasibility of using transfer learning-based DL models for lesion classification on DBT. While the overall performance was moderate, the results highlight both the potential and current limitations of AI in breast imaging. Further studies and approaches are warranted to enhance model robustness and clinical applicability. Full article
Show Figures

Figure 1

28 pages, 11793 KiB  
Article
Unsupervised Multimodal UAV Image Registration via Style Transfer and Cascade Network
by Xiaoye Bi, Rongkai Qie, Chengyang Tao, Zhaoxiang Zhang and Yuelei Xu
Remote Sens. 2025, 17(13), 2160; https://doi.org/10.3390/rs17132160 - 24 Jun 2025
Cited by 1 | Viewed by 388
Abstract
Cross-modal image registration for unmanned aerial vehicle (UAV) platforms presents significant challenges due to large-scale deformations, distinct imaging mechanisms, and pronounced modality discrepancies. This paper proposes a novel multi-scale cascaded registration network based on style transfer that achieves superior performance: up to 67% [...] Read more.
Cross-modal image registration for unmanned aerial vehicle (UAV) platforms presents significant challenges due to large-scale deformations, distinct imaging mechanisms, and pronounced modality discrepancies. This paper proposes a novel multi-scale cascaded registration network based on style transfer that achieves superior performance: up to 67% reduction in mean squared error (from 0.0106 to 0.0068), 9.27% enhancement in normalized cross-correlation, 26% improvement in local normalized cross-correlation, and 8% increase in mutual information compared to state-of-the-art methods. The architecture integrates a cross-modal style transfer network (CSTNet) that transforms visible images into pseudo-infrared representations to unify modality characteristics, and a multi-scale cascaded registration network (MCRNet) that performs progressive spatial alignment across multiple resolution scales using diffeomorphic deformation modeling to ensure smooth and invertible transformations. A self-supervised learning paradigm based on image reconstruction eliminates reliance on manually annotated data while maintaining registration accuracy through synthetic deformation generation. Extensive experiments on the LLVIP dataset demonstrate the method’s robustness under challenging conditions involving large-scale transformations, with ablation studies confirming that style transfer contributes 28% MSE improvement and diffeomorphic registration prevents 10.6% performance degradation. The proposed approach provides a robust solution for cross-modal image registration in dynamic UAV environments, offering significant implications for downstream applications such as target detection, tracking, and surveillance. Full article
(This article belongs to the Special Issue Advances in Deep Learning Approaches: UAV Data Analysis)
Show Figures

Graphical abstract

20 pages, 3598 KiB  
Article
Transfer Learning Model for Crack Detection in Side SlopesBased on Crack-Net
by Na Li, Yilong Zhang, Qing Zhang and Shaoguang Zhu
Appl. Sci. 2025, 15(13), 6951; https://doi.org/10.3390/app15136951 - 20 Jun 2025
Viewed by 400
Abstract
Accurate detection of slope cracks plays a crucial role in early landslide disaster warning; however, traditional approaches often struggle to identify fine and irregular cracks. This study introduces a novel deep learning model, Crack-Net, which leverages a multi-modal feature fusion mechanism and is [...] Read more.
Accurate detection of slope cracks plays a crucial role in early landslide disaster warning; however, traditional approaches often struggle to identify fine and irregular cracks. This study introduces a novel deep learning model, Crack-Net, which leverages a multi-modal feature fusion mechanism and is developed using transfer learning. To resolve the blurred representation of small-scale cracks, a nonlinear frequency-domain mapping module is employed to decouple amplitude and phase information, while a cross-domain attention mechanism facilitates adaptive feature fusion. In addition, a deep feature fusion module integrating deformable convolution and a dual attention mechanism is embedded within the encoder–decoder architecture to enhance multi-scale feature interactions and preserve crack topology. The model is pre-trained on the CrackVision12K dataset and fine-tuned on a custom dataset of slope cracks, effectively addressing performance degradation in small-sample scenarios. Experimental results show that Crack-Net achieves an average accuracy of 92.1%, outperforming existing models such as DeepLabV3 and CrackFormer by 9.4% and 5.4%, respectively. Furthermore, the use of transfer learning improves the average precision by 1.6%, highlighting the model’s strong generalization capability and practical effectiveness in real-world slope crack detection. Full article
Show Figures

Graphical abstract

24 pages, 5869 KiB  
Article
On Data Selection and Regularization for Underdetermined Vibro-Acoustic Source Identification
by Laixu Jiang, Jingqiao Liu, Xin Jiang and Yuezhao Pang
Sensors 2025, 25(12), 3767; https://doi.org/10.3390/s25123767 - 16 Jun 2025
Viewed by 364
Abstract
The number of hologram points in near-field acoustical holography (NAH) for a vibro-acoustic system plays a vital role in conditioning the transfer function between the source and measuring points. The requirement for many overdetermined hologram points for extended sources to obtain high accuracy [...] Read more.
The number of hologram points in near-field acoustical holography (NAH) for a vibro-acoustic system plays a vital role in conditioning the transfer function between the source and measuring points. The requirement for many overdetermined hologram points for extended sources to obtain high accuracy poses a problem for the practical applications of NAH. Furthermore, overdetermination does not generally ensure enhanced accuracy, stability, and convergence, owing to the problem of rank deficiency. To achieve satisfactory reconstruction accuracy with underdetermined hologram data, the best practice for choosing hologram points and regularization methods is determined by comparing cross-linked sets of data-sorting and regularization methods. Three typical data selection and treatment methods are compared: iterative discarding of the most dependent data, monitoring singular value changes during the data reduction process, and zero padding in the patch holography technique. To test the regularization method for inverse conditioning, which is used together with the data selection method, the Tikhonov method, Bayesian regularization, and the data compression method are compared. The inverse equivalent source method is chosen as the holography method, and a numerical test is conducted with a point-excited thin plate. The simulation results show that selecting hologram points using the effective independence method, combined with regularization via compressed sensing, significantly reduces the reconstruction error and enhances the modal assurance criterion value. The experimental results also support the proposed best practice for inverting underdetermined hologram data by integrating the NAH data selection and regularization techniques. Full article
(This article belongs to the Section Fault Diagnosis & Sensors)
Show Figures

Figure 1

15 pages, 7136 KiB  
Article
Source-Free Domain Adaptation for Cross-Modality Abdominal Multi-Organ Segmentation Challenges
by Xiyu Zhang, Xu Chen, Yang Wang, Dongliang Liu and Yifeng Hong
Information 2025, 16(6), 460; https://doi.org/10.3390/info16060460 - 29 May 2025
Viewed by 422
Abstract
Abdominal organ segmentation in CT images is crucial for accurate diagnosis, treatment planning, and condition monitoring. However, the annotation process is often hindered by challenges such as low contrast, artifacts, and complex organ structures. While unsupervised domain adaptation (UDA) has shown promise in [...] Read more.
Abdominal organ segmentation in CT images is crucial for accurate diagnosis, treatment planning, and condition monitoring. However, the annotation process is often hindered by challenges such as low contrast, artifacts, and complex organ structures. While unsupervised domain adaptation (UDA) has shown promise in addressing these issues by transferring knowledge from a different modality (source domain), its reliance on both source and target data during training presents a practical challenge in many clinical settings due to data privacy concerns. This study aims to develop a cross-modality abdominal multi-organ segmentation model for label-free CT (target domain) data, leveraging knowledge solely from a pre-trained source domain (MRI) model without accessing the source data. To achieve this, we generate source-like images from target-domain images using a one-way image translation approach with the pre-trained model. These synthesized images preserve the anatomical structure of the target, enabling segmentation predictions from the pre-trained model. To further enhance segmentation accuracy, particularly for organ boundaries and small contours, we introduce an auxiliary translation module with an image decoder and multi-level discriminator. The results demonstrate significant improvements across several performance metrics, including the Dice similarity coefficient (DSC) and average symmetric surface distance (ASSD), highlighting the effectiveness of the proposed method. Full article
Show Figures

Figure 1

24 pages, 1224 KiB  
Article
MDFormer: Transformer-Based Multimodal Fusion for Robust Chest Disease Diagnosis
by Xinlong Liu, Fei Pan, Hainan Song, Siyi Cao, Chunping Li and Tanshi Li
Electronics 2025, 14(10), 1926; https://doi.org/10.3390/electronics14101926 - 9 May 2025
Viewed by 1014
Abstract
With the increasing richness of medical images and clinical data, abundant data support is provided for multimodal chest disease diagnosis methods. However, traditional multimodal fusion methods are often relatively simple, leading to insufficient exploitation of crossmodal complementary advantages. At the same time, existing [...] Read more.
With the increasing richness of medical images and clinical data, abundant data support is provided for multimodal chest disease diagnosis methods. However, traditional multimodal fusion methods are often relatively simple, leading to insufficient exploitation of crossmodal complementary advantages. At the same time, existing multimodal chest disease diagnosis methods usually focus on two modalities, and their scalability is poor when extended to three or more modalities. Moreover, in practical clinical scenarios, missing modality problems often arise due to equipment limitations or incomplete data acquisition. To address these issues, this paper proposes a novel multimodal chest disease classification model, MDFormer. This model designs a crossmodal attention fusion mechanism, MFAttention, and combines it with the Transformer architecture to construct a multimodal fusion module, MFTrans, which effectively integrates medical imaging, clinical text, and vital signs data. When extended to multiple modalities, MFTrans significantly reduces model parameters. At the same time, this paper also proposes a two-stage masked enhancement classification and contrastive learning training framework, MECCL, which significantly improves the model’s robustness and transferability. Experimental results show that MDFormer achieves a classification precision of 0.8 on the MIMIC dataset, and when 50% of the modality data are missing, the AUC can reach 85% of that of the complete data, outperforming models that did not use two-stage training. Full article
Show Figures

Figure 1

15 pages, 3296 KiB  
Article
Leveraging Pre-Trained GPT Models for Equipment Remaining Useful Life Prognostics
by Haoliang Cui, Xiansheng Guo and Liyang Yu
Electronics 2025, 14(7), 1265; https://doi.org/10.3390/electronics14071265 - 23 Mar 2025
Viewed by 666
Abstract
Remaining Useful Life (RUL) prediction is crucial for optimizing predictive maintenance and resource management in industrial machinery. However, existing methods struggle with rigid spatiotemporal feature fusion, difficulty in capturing long-term dependencies, and poor performance on small datasets. To address these challenges, we propose [...] Read more.
Remaining Useful Life (RUL) prediction is crucial for optimizing predictive maintenance and resource management in industrial machinery. However, existing methods struggle with rigid spatiotemporal feature fusion, difficulty in capturing long-term dependencies, and poor performance on small datasets. To address these challenges, we propose a GPT-based RUL prediction model that enhances feature integration flexibility while leveraging few-shot learning and cross-modal knowledge transfer for improved accuracy in both data-rich and data-limited scenarios. Experiments on the NASA N-CMAPSS dataset show that our model outperforms state-of-the-art methods across multiple metrics, enabling more precise maintenance, cost optimization, and sustainable operations. Full article
Show Figures

Figure 1

27 pages, 19799 KiB  
Article
Video Temporal Grounding with Multi-Model Collaborative Learning
by Yun Tian, Xiaobo Guo, Jinsong Wang, Bin Li and Shoujun Zhou
Appl. Sci. 2025, 15(6), 3072; https://doi.org/10.3390/app15063072 - 12 Mar 2025
Cited by 3 | Viewed by 1184
Abstract
Given an untrimmed video and a natural language query, the video temporal grounding task aims to accurately locate the target segment within the video. Functioning as a critical conduit between computer vision and natural language processing, this task holds profound importance in advancing [...] Read more.
Given an untrimmed video and a natural language query, the video temporal grounding task aims to accurately locate the target segment within the video. Functioning as a critical conduit between computer vision and natural language processing, this task holds profound importance in advancing video comprehension. Current research predominantly centers on enhancing the performance of individual models, thereby overlooking the extensive possibilities afforded by multi-model synergy. While knowledge flow methods have been adopted for multi-model and cross-modal collaborative learning, several critical concerns persist, including the unidirectional transfer of knowledge, low-quality pseudo-label generation, and gradient conflicts inherent in cooperative training. To address these issues, this research proposes a Multi-Model Collaborative Learning (MMCL) framework. By incorporating a bidirectional knowledge transfer paradigm, the MMCL framework empowers models to engage in collaborative learning through the interchange of pseudo-labels. Concurrently, the mechanism for generating pseudo-labels is optimized using the CLIP model’s prior knowledge, bolstering both the accuracy and coherence of these labels while efficiently discarding extraneous temporal fragments. The framework also integrates an iterative training algorithm for multi-model collaboration, mitigating gradient conflicts through alternate optimization and achieving a dynamic balance between collaborative and independent learning. Empirical evaluations across multiple benchmark datasets indicate that the MMCL framework markedly elevates the performance of video temporal grounding models, exceeding existing state-of-the-art approaches in terms of mIoU and Rank@1. Concurrently, the framework accommodates both homogeneous and heterogeneous model configurations, demonstrating its broad versatility and adaptability. This investigation furnishes an effective avenue for multi-model collaborative learning in video temporal grounding, bolstering efficient knowledge dissemination and charting novel pathways in the domain of video comprehension. Full article
Show Figures

Figure 1

24 pages, 2586 KiB  
Article
Deep Multi-Modal Skin-Imaging-Based Information-Switching Network for Skin Lesion Recognition
by Yingzhe Yu, Huiqiong Jia, Li Zhang, Suling Xu, Xiaoxia Zhu, Jiucun Wang, Fangfang Wang, Lianyi Han, Haoqiang Jiang, Qiongyan Zhou and Chao Xin
Bioengineering 2025, 12(3), 282; https://doi.org/10.3390/bioengineering12030282 - 12 Mar 2025
Cited by 1 | Viewed by 1561
Abstract
The rising prevalence of skin lesions places a heavy burden on global health resources and necessitates an early and precise diagnosis for successful treatment. The diagnostic potential of recent multi-modal skin lesion detection algorithms is limited because they ignore dynamic interactions and information [...] Read more.
The rising prevalence of skin lesions places a heavy burden on global health resources and necessitates an early and precise diagnosis for successful treatment. The diagnostic potential of recent multi-modal skin lesion detection algorithms is limited because they ignore dynamic interactions and information sharing across modalities at various feature scales. To address this, we propose a deep learning framework, Multi-Modal Skin-Imaging-based Information-Switching Network (MDSIS-Net), for end-to-end skin lesion recognition. MDSIS-Net extracts intra-modality features using transfer learning in a multi-scale fully shared convolutional neural network and introduces an innovative information-switching module. A cross-attention mechanism dynamically calibrates and integrates features across modalities to improve inter-modality associations and feature representation in this module. MDSIS-Net is tested on clinical disfiguring dermatosis data and the public Derm7pt melanoma dataset. A Visually Intelligent System for Image Analysis (VISIA) captures five modalities: spots, red marks, ultraviolet (UV) spots, porphyrins, and brown spots for disfiguring dermatosis. The model performs better than existing approaches with an mAP of 0.967, accuracy of 0.960, precision of 0.935, recall of 0.960, and f1-score of 0.947. Using clinical and dermoscopic pictures from the Derm7pt dataset, MDSIS-Net outperforms current benchmarks for melanoma, with an mAP of 0.877, accuracy of 0.907, precision of 0.911, recall of 0.815, and f1-score of 0.851. The model’s interpretability is proven by Grad-CAM heatmaps correlating with clinical diagnostic focus areas. In conclusion, our deep multi-modal information-switching model enhances skin lesion identification by capturing relationship features and fine-grained details across multi-modal images, improving both accuracy and interpretability. This work advances clinical decision making and lays a foundation for future developments in skin lesion diagnosis and treatment. Full article
(This article belongs to the Special Issue Artificial Intelligence for Skin Diseases Classification)
Show Figures

Figure 1

Back to TopTop