Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (263)

Search Parameters:
Keywords = cross-modal learning network

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
22 pages, 1809 KB  
Article
Semantic-Aware Co-Parallel Network for Cross-Scene Hyperspectral Image Classification
by Xiaohui Li, Chenyang Jin, Yuntao Tang, Kai Xing and Xiaodong Yu
Sensors 2025, 25(21), 6688; https://doi.org/10.3390/s25216688 (registering DOI) - 1 Nov 2025
Abstract
Cross-scene classification of hyperspectral images poses significant challenges due to the lack of a priori knowledge and the differences in data distribution across scenes. While traditional studies have had limited use of a priori knowledge from other modalities, recent advancements in pre-trained large-scale [...] Read more.
Cross-scene classification of hyperspectral images poses significant challenges due to the lack of a priori knowledge and the differences in data distribution across scenes. While traditional studies have had limited use of a priori knowledge from other modalities, recent advancements in pre-trained large-scale language-vision models have shown strong performance on various downstream tasks, highlighting the potential of cross-modal assisted learning. In this paper, we propose a Semantic-aware Collaborative Parallel Network (SCPNet) to mitigate the impact of data distribution differences by incorporating linguistic modalities to assist in learning cross-domain invariant representations of hyperspectral images. SCPNet uses a parallel architecture consisting of a spatial–spectral feature extraction module and a multiscale feature extraction module, designed to capture rich image information during the feature extraction phase. The extracted features are then mapped into an optimized semantic space, where improved supervised contrastive learning clusters image features from the same category together while separating those from different categories. Semantic space bridges the gap between visual and linguistic modalities, enabling the model to mine cross-domain invariant representations from the linguistic modality. Experimental results demonstrate that SCPNet significantly outperforms existing methods on three publicly available datasets, confirming its effectiveness for cross-scene hyperspectral image classification tasks. Full article
(This article belongs to the Special Issue Remote Sensing Image Processing, Analysis and Application)
31 pages, 34773 KB  
Article
Learning Domain-Invariant Representations for Event-Based Motion Segmentation: An Unsupervised Domain Adaptation Approach
by Mohammed Jeryo and Ahad Harati
J. Imaging 2025, 11(11), 377; https://doi.org/10.3390/jimaging11110377 - 27 Oct 2025
Viewed by 217
Abstract
Event cameras provide microsecond temporal resolution, high dynamic range, and low latency by asynchronously capturing per-pixel luminance changes, thereby introducing a novel sensing paradigm. These advantages render them well-suited for high-speed applications such as autonomous vehicles and dynamic environments. Nevertheless, the sparsity of [...] Read more.
Event cameras provide microsecond temporal resolution, high dynamic range, and low latency by asynchronously capturing per-pixel luminance changes, thereby introducing a novel sensing paradigm. These advantages render them well-suited for high-speed applications such as autonomous vehicles and dynamic environments. Nevertheless, the sparsity of event data and the absence of dense annotations are significant obstacles to supervised learning for motion segmentation from event streams. Domain adaptation is also challenging due to the considerable domain shift in intensity images. To address these challenges, we propose a two-phase cross-modality adaptation framework that translates motion segmentation knowledge from labeled RGB-flow data to unlabeled event streams. A dual-branch encoder extracts modality-specific motion and appearance features from RGB and optical flow in the source domain. Using reconstruction networks, event voxel grids are converted into pseudo-image and pseudo-flow modalities in the target domain. These modalities are subsequently re-encoded using frozen RGB-trained encoders. Multi-level consistency losses are implemented on features, predictions, and outputs to enforce domain alignment. Our design enables the model to acquire domain-invariant, semantically rich features through the use of shallow architectures, thereby reducing training costs and facilitating real-time inference with a lightweight prediction path. The proposed architecture, alongside the utilized hybrid loss function, effectively bridges the domain and modality gap. We evaluate our method on two challenging benchmarks: EVIMO2, which incorporates real-world dynamics, high-speed motion, illumination variation, and multiple independently moving objects; and MOD++, which features complex object dynamics, collisions, and dense 1kHz supervision in synthetic scenes. The proposed UDA framework achieves 83.1% and 79.4% accuracy on EVIMO2 and MOD++, respectively, outperforming existing state-of-the-art approaches, such as EV-Transfer and SHOT, by up to 3.6%. Additionally, it is lighter and faster and also delivers enhanced mIoU and F1 Score. Full article
(This article belongs to the Section Image and Video Processing)
Show Figures

Figure 1

37 pages, 10732 KB  
Review
Advances on Multimodal Remote Sensing Foundation Models for Earth Observation Downstream Tasks: A Survey
by Guoqing Zhou, Lihuang Qian and Paolo Gamba
Remote Sens. 2025, 17(21), 3532; https://doi.org/10.3390/rs17213532 - 24 Oct 2025
Viewed by 401
Abstract
Remote sensing foundation models (RSFMs) have demonstrated excellent feature extraction and reasoning capabilities under the self-supervised learning paradigm of “unlabeled datasets—model pre-training—downstream tasks”. These models achieve superior accuracy and performance compared to existing models across numerous open benchmark datasets. However, when confronted with [...] Read more.
Remote sensing foundation models (RSFMs) have demonstrated excellent feature extraction and reasoning capabilities under the self-supervised learning paradigm of “unlabeled datasets—model pre-training—downstream tasks”. These models achieve superior accuracy and performance compared to existing models across numerous open benchmark datasets. However, when confronted with multimodal data, such as optical, LiDAR, SAR, text, video, and audio, the RSFMs exhibit limitations in cross-modal generalization and multi-task learning. Although several reviews have addressed the RSFMs, there is currently no comprehensive survey dedicated to vision–X (vision, language, audio, position) multimodal RSFMs (MM-RSFMs). To tackle this gap, this article provides a systematic review of MM-RSFMs from a novel perspective. Firstly, the key technologies underlying MM-RSFMs are reviewed and analyzed, and the available multimodal RS pre-training datasets are summarized. Then, recent advances in MM-RSFMs are classified according to the development of backbone networks and cross-modal interaction methods of vision–X, such as vision–vision, vision–language, vision–audio, vision–position, and vision–language–audio. Finally, potential challenges are analyzed, and perspectives for MM-RSFMs are outlined. This survey from this paper reveals that current MM-RSFMs face the following key challenges: (1) a scarcity of high-quality multimodal datasets, (2) limited capability for multimodal feature extraction, (3) weak cross-task generalization, (4) absence of unified evaluation criteria, and (5) insufficient security measures. Full article
(This article belongs to the Section AI Remote Sensing)
Show Figures

Figure 1

25 pages, 2557 KB  
Article
Modality-Resilient Multimodal Industrial Anomaly Detection via Cross-Modal Knowledge Transfer and Dynamic Edge-Preserving Voxelization
by Jiahui Xu, Jian Yuan, Mingrui Yang and Weishu Yan
Sensors 2025, 25(21), 6529; https://doi.org/10.3390/s25216529 - 23 Oct 2025
Viewed by 448
Abstract
Achieving high-precision anomaly detection with incomplete sensor data is a critical challenge in industrial automation and intelligent manufacturing. This incompleteness often results from sensor failures, environmental interference, occlusions, or acquisition cost constraints. This study explicitly targets both types of incompleteness commonly encountered in [...] Read more.
Achieving high-precision anomaly detection with incomplete sensor data is a critical challenge in industrial automation and intelligent manufacturing. This incompleteness often results from sensor failures, environmental interference, occlusions, or acquisition cost constraints. This study explicitly targets both types of incompleteness commonly encountered in industrial multimodal inspection: (i) incomplete sensor data within a given modality, such as partial point cloud loss or image degradation, and (ii) incomplete modalities, where one sensing channel (RGB or 3D) is entirely unavailable. By jointly addressing intra-modal incompleteness and cross-modal absence within a unified cross-distillation framework, our approach enhances anomaly detection robustness under both conditions. First, a teacher–student cross-modal distillation mechanism enables robust feature learning from both RGB and 3D modalities, allowing the student network to accurately detect anomalies even when a modality is missing during inference. Second, a dynamic voxel resolution adjustment with edge-retention strategy alleviates the computational burden of 3D point cloud processing while preserving crucial geometric features. By jointly enhancing robustness to missing modalities and improving computational efficiency, our method offers a resilient and practical solution for anomaly detection in real-world manufacturing scenarios. Extensive experiments demonstrate that the proposed method achieves both high robustness and efficiency across multiple industrial scenarios, establishing new state-of-the-art performance that surpasses existing approaches in both accuracy and speed. This method provides a robust solution for high-precision perception under complex detection conditions, significantly enhancing the feasibility of deploying anomaly detection systems in real industrial environments. Full article
(This article belongs to the Section Industrial Sensors)
Show Figures

Figure 1

23 pages, 6498 KB  
Article
A Cross-Modal Deep Feature Fusion Framework Based on Ensemble Learning for Land Use Classification
by Xiaohuan Wu, Houji Qi, Keli Wang, Yikun Liu and Yang Wang
ISPRS Int. J. Geo-Inf. 2025, 14(11), 411; https://doi.org/10.3390/ijgi14110411 - 23 Oct 2025
Viewed by 394
Abstract
Land use classification based on multi-modal data fusion has gained significant attention due to its potential to capture the complex characteristics of urban environments. However, effectively extracting and integrating discriminative features derived from heterogeneous geospatial data remain challenging. This study proposes an ensemble [...] Read more.
Land use classification based on multi-modal data fusion has gained significant attention due to its potential to capture the complex characteristics of urban environments. However, effectively extracting and integrating discriminative features derived from heterogeneous geospatial data remain challenging. This study proposes an ensemble learning framework for land use classification by fusing cross-modal deep features from both physical and socioeconomic perspectives. Specifically, the framework utilizes the Masked Autoencoder (MAE) to extract global spatial dependencies from remote sensing imagery and applies long short-term memory (LSTM) networks to model spatial distribution patterns of points of interest (POIs) based on type co-occurrence. Furthermore, we employ inter-modal contrastive learning to enhance the representation of physical and socioeconomic features. To verify the superiority of the ensemble learning framework, we apply it to map the land use distribution of Bejing. By coupling various physical and socioeconomic features, the framework achieves an average accuracy of 84.33 %, surpassing several comparative baseline methods. Furthermore, the framework demonstrates comparable performance when applied to a Shenzhen dataset, confirming its robustness and generalizability. The findings highlight the importance of fully extracting and effectively integrating multi-source deep features in land use classification, providing a robust solution for urban planning and sustainable development. Full article
Show Figures

Figure 1

71 pages, 9523 KB  
Article
Neural Network IDS/IPS Intrusion Detection and Prevention System with Adaptive Online Training to Improve Corporate Network Cybersecurity, Evidence Recording, and Interaction with Law Enforcement Agencies
by Serhii Vladov, Victoria Vysotska, Svitlana Vashchenko, Serhii Bolvinov, Serhii Glubochenko, Andrii Repchonok, Maksym Korniienko, Mariia Nazarkevych and Ruslan Herasymchuk
Big Data Cogn. Comput. 2025, 9(11), 267; https://doi.org/10.3390/bdcc9110267 - 22 Oct 2025
Viewed by 324
Abstract
Thise article examines the reliable online detection and IDS/IPS intrusion prevention in dynamic corporate networks problems, where traditional signature-based methods fail to keep pace with new and evolving attacks, and streaming data is susceptible to drift and targeted “poisoning” of the training dataset. [...] Read more.
Thise article examines the reliable online detection and IDS/IPS intrusion prevention in dynamic corporate networks problems, where traditional signature-based methods fail to keep pace with new and evolving attacks, and streaming data is susceptible to drift and targeted “poisoning” of the training dataset. As a solution, we propose a hybrid neural network system with adaptive online training, a formal minimax false-positive control framework, and a robustness mechanism set (a Huber model, pruned learning rate, DRO, a gradient-norm regularizer, and a prioritized replay). In practice, the system combines modal encoders for traffic, logs, and metrics; a temporal GNN for entity correlation; a variational module for uncertainty assessment; a differentiable symbolic unit for logical rules; an RL agent for incident prioritization; and an NLG module for explanations and the preparation of forensically relevant artifacts. In this case, the applied components are connected via a cognitive layer (cross-modal fusion memory), a Bayesian-neural network fuser, and a single multi-task loss function. The practical implementation includes the pipeline “novelty detection → active labelling → incremental supervised update” and chain-of-custody mechanisms for evidential fitness. A significant improvement in quality has been experimentally demonstrated, since the developed system achieves an ROC AUC of 0.96, an F1-score of 0.95, and a significantly lower FPR compared to basic architectures (MLP, CNN, and LSTM). In applied validation tasks, detection rates of ≈92–94% and resistance to distribution drift are noted. Full article
(This article belongs to the Special Issue Internet Intelligence for Cybersecurity)
Show Figures

Figure 1

24 pages, 2308 KB  
Review
Review on Application of Machine Vision-Based Intelligent Algorithms in Gear Defect Detection
by Dehai Zhang, Shengmao Zhou, Yujuan Zheng and Xiaoguang Xu
Processes 2025, 13(10), 3370; https://doi.org/10.3390/pr13103370 - 21 Oct 2025
Viewed by 547
Abstract
Gear defect detection directly affects the operational reliability of critical equipment in fields such as automotive and aerospace. Gear defect detection technology based on machine vision, leveraging the advantages of non-contact measurement, high efficiency, and cost-effectiveness, has become a key support for quality [...] Read more.
Gear defect detection directly affects the operational reliability of critical equipment in fields such as automotive and aerospace. Gear defect detection technology based on machine vision, leveraging the advantages of non-contact measurement, high efficiency, and cost-effectiveness, has become a key support for quality control in intelligent manufacturing. However, it still faces challenges including difficulties in semantic alignment of multimodal data, the imbalance between real-time detection requirements and computational resources, and poor model generalization in few-shot scenarios. This paper takes the paradigm evolution of gear defect detection technology as the main line, systematically reviews its development from traditional image processing to deep learning, and focuses on the innovative application of intelligent algorithms. A research framework of “technical bottleneck-breakthrough path-application verification” is constructed: for the problem of multimodal fusion, the cross-modal feature alignment mechanism based on Transformer network is deeply analyzed, clarifying its technical path of realizing joint embedding of visual and vibration signals by establishing global correlation mapping; for resource constraints, the performance of lightweight models such as MobileNet and ShuffleNet is quantitatively compared, verifying that these models reduce Parameters by 40–60% while maintaining the mean Average Precision essentially unchanged; for small-sample scenarios, few-shot generation models based on contrastive learning are systematically organized, confirming that their accuracy in the 10-shot scenario can reach 90% of that of fully supervised models, thus enhancing generalization ability. Future research can focus on the collaboration between few-shot generation and physical simulation, edge-cloud dynamic scheduling, defect evolution modeling driven by multiphysics fields, and standardization of explainable artificial intelligence. It aims to construct a gear detection system with autonomous perception capabilities, promoting the development of industrial quality inspection toward high-precision, high-robustness, and low-cost intelligence. Full article
Show Figures

Figure 1

19 pages, 4569 KB  
Article
NeuroNet-AD: A Multimodal Deep Learning Framework for Multiclass Alzheimer’s Disease Diagnosis
by Saeka Rahman, Md Motiur Rahman, Smriti Bhatt, Raji Sundararajan and Miad Faezipour
Bioengineering 2025, 12(10), 1107; https://doi.org/10.3390/bioengineering12101107 - 15 Oct 2025
Viewed by 618
Abstract
Alzheimer’s disease (AD) is the most prevalent form of dementia. This disease significantly impacts cognitive functions and daily activities. Early and accurate diagnosis of AD, including the preliminary stage of mild cognitive impairment (MCI), is critical for effective patient care and treatment development. [...] Read more.
Alzheimer’s disease (AD) is the most prevalent form of dementia. This disease significantly impacts cognitive functions and daily activities. Early and accurate diagnosis of AD, including the preliminary stage of mild cognitive impairment (MCI), is critical for effective patient care and treatment development. Although advancements in deep learning (DL) and machine learning (ML) models improve diagnostic precision, the lack of large datasets limits further enhancements, necessitating the use of complementary data. Existing convolutional neural networks (CNNs) effectively process visual features but struggle to fuse multimodal data effectively for AD diagnosis. To address these challenges, we propose NeuroNet-AD, a novel multimodal CNN framework designed to enhance AD classifcation accuracy. NeuroNet-AD integrates Magnetic Resonance Imaging (MRI) images with clinical text-based metadata, including psychological test scores, demographic information, and genetic biomarkers. In NeuroNet-AD, we incorporate Convolutional Block Attention Modules (CBAMs) within the ResNet-18 backbone, enabling the model to focus on the most informative spatial and channel-wise features. We introduce an attention computation and multimodal fusion module, named Meta Guided Cross Attention (MGCA), which facilitates effective cross-modal alignment between images and meta-features through a multi-head attention mechanism. Additionally, we employ an ensemble-based feature selection strategy to identify the most discriminative features from the textual data, improving model generalization and performance. We evaluate NeuroNet-AD on the Alzheimer’s Disease Neuroimaging Initiative (ADNI1) dataset using subject-level 5-fold cross-validation and a held-out test set to ensure robustness. NeuroNet-AD achieved 98.68% accuracy in multiclass classification of normal control (NC), MCI, and AD and 99.13% accuracy in the binary setting (NC vs. AD) on the ADNI dataset, outperforming state-of-the-art models. External validation on the OASIS-3 dataset further confirmed the model’s generalization ability, achieving 94.10% accuracy in the multiclass setting and 98.67% accuracy in the binary setting, despite variations in demographics and acquisition protocols. Further extensive evaluation studies demonstrate the effectiveness of each component of NeuroNet-AD in improving the performance. Full article
Show Figures

Graphical abstract

20 pages, 5086 KB  
Article
A Multi-Modal Attention Fusion Framework for Road Connectivity Enhancement in Remote Sensing Imagery
by Yongqi Yuan, Yong Cheng, Bo Pan, Ge Jin, De Yu, Mengjie Ye and Qian Zhang
Mathematics 2025, 13(20), 3266; https://doi.org/10.3390/math13203266 - 13 Oct 2025
Viewed by 403
Abstract
Ensuring the structural continuity and completeness of road networks in high-resolution remote sensing imagery remains a major challenge for current deep learning methods, especially under conditions of occlusion caused by vegetation, buildings, or shadows. To address this, we propose a novel post-processing enhancement [...] Read more.
Ensuring the structural continuity and completeness of road networks in high-resolution remote sensing imagery remains a major challenge for current deep learning methods, especially under conditions of occlusion caused by vegetation, buildings, or shadows. To address this, we propose a novel post-processing enhancement framework that improves the connectivity and accuracy of initial road extraction results produced by any segmentation model. The method employs a dual-stream encoder architecture, which jointly processes RGB images and preliminary road masks to obtain complementary spatial and semantic information. A core component is the MAF (Multi-Modal Attention Fusion) module, designed to capture fine-grained, long-range, and cross-scale dependencies between image and mask features. This fusion leads to the restoration of fragmented road segments, the suppression of noise, and overall improvement in road completeness. Experiments on benchmark datasets (DeepGlobe and Massachusetts) demonstrate substantial gains in precision, recall, F1-score, and mIoU, confirming the framework’s effectiveness and generalization ability in real-world scenarios. Full article
(This article belongs to the Special Issue Mathematical Methods for Machine Learning and Computer Vision)
Show Figures

Figure 1

18 pages, 4337 KB  
Article
A Transformer-Based Multimodal Fusion Network for Emotion Recognition Using EEG and Facial Expressions in Hearing-Impaired Subjects
by Shuni Feng, Qingzhou Wu, Kailin Zhang and Yu Song
Sensors 2025, 25(20), 6278; https://doi.org/10.3390/s25206278 - 10 Oct 2025
Viewed by 640
Abstract
Hearing-impaired people face challenges in expressing and perceiving emotions, and traditional single-modal emotion recognition methods demonstrate limited effectiveness in complex environments. To enhance recognition performance, this paper proposes a multimodal fusion neural network based on a multimodal multi-head attention fusion neural network (MMHA-FNN). [...] Read more.
Hearing-impaired people face challenges in expressing and perceiving emotions, and traditional single-modal emotion recognition methods demonstrate limited effectiveness in complex environments. To enhance recognition performance, this paper proposes a multimodal fusion neural network based on a multimodal multi-head attention fusion neural network (MMHA-FNN). This method utilizes differential entropy (DE) and bilinear interpolation features as inputs, learning the spatial–temporal characteristics of brain regions through an MBConv-based module. By incorporating the Transformer-based multi-head self-attention mechanism, we dynamically model the dependencies between EEG and facial expression features, enabling adaptive weighting and deep interaction of cross-modal characteristics. The experiment conducted a four-classification task on the MED-HI dataset (15 subjects, 300 trials). The taxonomy included happy, sad, fear, and calmness, where ‘calmness’ corresponds to a low-arousal neutral state as defined in the MED-HI protocol. Results indicate that the proposed method achieved an average accuracy of 81.14%, significantly outperforming feature concatenation (71.02%) and decision layer fusion (69.45%). This study demonstrates the complementary nature of EEG and facial expressions in emotion recognition among hearing-impaired individuals and validates the effectiveness of feature layer interaction fusion based on attention mechanisms in enhancing emotion recognition performance. Full article
(This article belongs to the Section Biomedical Sensors)
Show Figures

Figure 1

20 pages, 59706 KB  
Article
Learning Hierarchically Consistent Disentanglement with Multi-Channel Augmentation for Public Security-Oriented Sketch Person Re-Identification
by Yu Ye, Zhihong Sun and Jun Chen
Sensors 2025, 25(19), 6155; https://doi.org/10.3390/s25196155 - 4 Oct 2025
Viewed by 445
Abstract
Sketch re-identification (Re-ID) aims to retrieve pedestrian photographs in the gallery dataset by a query sketch image drawn by professionals, which is crucial for criminal investigations and missing person searches in the field of public security. The main challenge of this task lies [...] Read more.
Sketch re-identification (Re-ID) aims to retrieve pedestrian photographs in the gallery dataset by a query sketch image drawn by professionals, which is crucial for criminal investigations and missing person searches in the field of public security. The main challenge of this task lies in bridging the significant modality gap between sketches and photos while extracting discriminative modality-invariant features. However, information asymmetry between sketches and RGB photographs, particularly the differences in color information, severely interferes with cross-modal matching processes. To address this challenge, we propose a novel network architecture that integrates multi-channel augmentation with hierarchically consistent disentanglement learning. Specifically, a multi-channel augmentation module is developed to mitigate the interference of color bias in cross-modal matching. Furthermore, a modality-disentangled prototype(MDP) module is introduced to decompose pedestrian representations at the feature level into modality-invariant structural prototypes and modality-specific appearance prototypes. Additionally, a cross-layer decoupling consistency constraint is designed to ensure the semantic coherence of disentangled prototypes across different network layers and to improve the stability of the whole decoupling process. Extensive experimental results on two public datasets demonstrate the superiority of our proposed approach over state-of-the-art methods. Full article
(This article belongs to the Special Issue Advances in Security for Emerging Intelligent Systems)
Show Figures

Figure 1

15 pages, 2373 KB  
Article
LLM-Empowered Kolmogorov-Arnold Frequency Learning for Time Series Forecasting in Power Systems
by Zheng Yang, Yang Yu, Shanshan Lin and Yue Zhang
Mathematics 2025, 13(19), 3149; https://doi.org/10.3390/math13193149 - 2 Oct 2025
Viewed by 393
Abstract
With the rapid evolution of artificial intelligence technologies in power systems, data-driven time-series forecasting has become instrumental in enhancing the stability and reliability of power systems, allowing operators to anticipate demand fluctuations and optimize energy distribution. Despite the notable progress made by current [...] Read more.
With the rapid evolution of artificial intelligence technologies in power systems, data-driven time-series forecasting has become instrumental in enhancing the stability and reliability of power systems, allowing operators to anticipate demand fluctuations and optimize energy distribution. Despite the notable progress made by current methods, they are still hindered by two major limitations: most existing models are relatively small in architecture, failing to fully leverage the potential of large-scale models, and they are based on fixed nonlinear mapping functions that cannot adequately capture complex patterns, leading to information loss. To this end, an LLM-Empowered Kolmogorov–Arnold frequency learning (LKFL) is proposed for time series forecasting in power systems, which consists of LLM-based prompt representation learning, KAN-based frequency representation learning, and entropy-oriented cross-modal fusion. Specifically, LKFL first transforms multivariable time-series data into text prompts and leverages a pre-trained LLM to extract semantic-rich prompt representations. It then applies Fast Fourier Transform to convert the time-series data into the frequency domain and employs Kolmogorov–Arnold networks (KAN) to capture multi-scale periodic structures and complex frequency characteristics. Finally, LKFL integrates the prompt and frequency representations through an entropy-oriented cross-modal fusion strategy, which minimizes the semantic gap between different modalities and ensures full integration of complementary information. This comprehensive approach enables LKFL to achieve superior forecasting performance in power systems. Extensive evaluations on five benchmarks verify that LKFL sets a new standard for time-series forecasting in power systems compared with baseline methods. Full article
(This article belongs to the Special Issue Artificial Intelligence and Data Science, 2nd Edition)
Show Figures

Figure 1

25 pages, 8881 KB  
Article
Evaluating Machine Learning Techniques for Brain Tumor Detection with Emphasis on Few-Shot Learning Using MAML
by Soham Sanjay Vaidya, Raja Hashim Ali, Shan Faiz, Iftikhar Ahmed and Talha Ali Khan
Algorithms 2025, 18(10), 624; https://doi.org/10.3390/a18100624 - 2 Oct 2025
Viewed by 409
Abstract
Accurate brain tumor classification from MRI is often constrained by limited labeled data. We systematically compare conventional machine learning, deep learning, and few-shot learning (FSL) for four classes (glioma, meningioma, pituitary, no tumor) using a standardized pipeline. Models are trained on the Kaggle [...] Read more.
Accurate brain tumor classification from MRI is often constrained by limited labeled data. We systematically compare conventional machine learning, deep learning, and few-shot learning (FSL) for four classes (glioma, meningioma, pituitary, no tumor) using a standardized pipeline. Models are trained on the Kaggle Brain Tumor MRI Dataset and evaluated across dataset regimes (100%→10%). We further test generalization on BraTS and quantify robustness to resolution changes, acquisition noise, and modality shift (T1→FLAIR). To support clinical trust, we add visual explanations (Grad-CAM/saliency) and report per-class results (confusion matrices). A fairness-aligned protocol (shared splits, optimizer, early stopping) and a complexity analysis (parameters/FLOPs) enable balanced comparison. With full data, Convolutional Neural Networks (CNNs)/Residual Networks (ResNets) perform strongly but degrade with 10% data; Model-Agnostic Meta-Learning (MAML) retains competitive performance (AUC-ROC ≥ 0.9595 at 10%). Under cross-dataset validation (BraTS), FSL—particularly MAML—shows smaller performance drops than CNN/ResNet. Variability tests reveal FSL’s relative robustness to down-resolution and noise, although modality shift remains challenging for all models. Interpretability maps confirm correct activations on tumor regions in true positives and explain systematic errors (e.g., “no tumor”→pituitary). Conclusion: FSL provides accurate, data-efficient, and comparatively robust tumor classification under distribution shift. The added per-class analysis, interpretability, and complexity metrics strengthen clinical relevance and transparency. Full article
(This article belongs to the Special Issue Machine Learning Models and Algorithms for Image Processing)
Show Figures

Figure 1

29 pages, 3761 KB  
Article
An Adaptive Transfer Learning Framework for Multimodal Autism Spectrum Disorder Diagnosis
by Wajeeha Malik, Muhammad Abuzar Fahiem, Jawad Khan, Younhyun Jung and Fahad Alturise
Life 2025, 15(10), 1524; https://doi.org/10.3390/life15101524 - 26 Sep 2025
Viewed by 675
Abstract
Autism Spectrum Disorder (ASD) is a complex neurodevelopmental condition with diverse behavioral, genetic, and structural characteristics. Due to its heterogeneous nature, early diagnosis of ASD is challenging, and conventional unimodal approaches often fail to capture cross-modal dependencies. To address this, this study introduces [...] Read more.
Autism Spectrum Disorder (ASD) is a complex neurodevelopmental condition with diverse behavioral, genetic, and structural characteristics. Due to its heterogeneous nature, early diagnosis of ASD is challenging, and conventional unimodal approaches often fail to capture cross-modal dependencies. To address this, this study introduces an adaptive multimodal fusion framework that integrates behavioral, genetic, and structural MRI (sMRI) data, addressing the limitations of unimodal approaches. Each modality undergoes a dedicated preprocessing and feature optimization phase. For behavioral data, an ensemble of classifiers using a stacking technique and attention mechanism is applied for feature extraction, achieving an accuracy of 95.5%. The genetic data is analyzed using Gradient Boosting, which attained a classification accuracy of 86.6%. For the sMRI data, a Hybrid Convolutional Neural Network–Graph Neural Network (Hybrid-CNN-GNN) architecture is proposed, demonstrating a strong performance with an accuracy of 96.32%, surpassing existing methods. To unify these modalities, fused using an adaptive late fusion strategy implemented with a Multilayer Perceptron (MLP), where adaptive weighting adjusts each modality’s contribution based on validation performance. The integrated framework addresses the limitations of unimodal approaches by creating a unified diagnostic model. The transfer learning framework achieves superior diagnostic accuracy (98.7%) compared to unimodal baselines, demonstrating strong generalization across heterogeneous datasets and offering a promising step toward reliable, multimodal ASD diagnosis. Full article
(This article belongs to the Special Issue Advanced Machine Learning for Disease Prediction and Prevention)
Show Figures

Figure 1

22 pages, 2395 KB  
Article
Multimodal Alignment and Hierarchical Fusion Network for Multimodal Sentiment Analysis
by Jiasheng Huang, Huan Li and Xinyue Mo
Electronics 2025, 14(19), 3828; https://doi.org/10.3390/electronics14193828 - 26 Sep 2025
Viewed by 981
Abstract
The widespread emergence of multimodal data on social platforms has presented new opportunities for sentiment analysis. However, previous studies have often overlooked the issue of detail loss during modal interaction fusion. They also exhibit limitations in addressing semantic alignment challenges and the sensitivity [...] Read more.
The widespread emergence of multimodal data on social platforms has presented new opportunities for sentiment analysis. However, previous studies have often overlooked the issue of detail loss during modal interaction fusion. They also exhibit limitations in addressing semantic alignment challenges and the sensitivity of modalities to noise. To enhance analytical accuracy, a novel model named MAHFNet is proposed. The proposed architecture is composed of three main components. Firstly, an attention-guided gated interaction alignment module is developed for modeling the semantic interaction between text and image using a gated network and a cross-modal attention mechanism. Next, a contrastive learning mechanism is introduced to encourage the aggregation of semantically aligned image-text pairs. Subsequently, an intra-modality emotion extraction module is designed to extract local emotional features within each modality. This module serves to compensate for detail loss during interaction fusion. The intra-modal local emotion features and cross-modal interaction features are then fed into a hierarchical gated fusion module, where the local features are fused through a cross-gated mechanism to dynamically adjust the contribution of each modality while suppressing modality-specific noise. Then, the fusion results and cross-modal interaction features are further fused using a multi-scale attention gating module to capture hierarchical dependencies between local and global emotional information, thereby enhancing the model’s ability to perceive and integrate emotional cues across multiple semantic levels. Finally, extensive experiments have been conducted on three public multimodal sentiment datasets, with results demonstrating that the proposed model outperforms existing methods across multiple evaluation metrics. Specifically, on the TumEmo dataset, our model achieves improvements of 2.55% in ACC and 2.63% in F1 score compared to the second-best method. On the HFM dataset, these gains reach 0.56% in ACC and 0.9% in F1 score, respectively. On the MVSA-S dataset, these gains reach 0.03% in ACC and 1.26% in F1 score. These findings collectively validate the overall effectiveness of the proposed model. Full article
Show Figures

Figure 1

Back to TopTop