Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Article Types

Countries / Regions

Search Results (231)

Search Parameters:
Keywords = dual-modality attention

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
29 pages, 13806 KB  
Article
DCAM-DETR: Dual Cross-Attention Mamba Detection Transformer for RGB–Infrared Anti-UAV Detection
by Zemin Qin and Yuheng Li
Information 2026, 17(1), 103; https://doi.org/10.3390/info17010103 - 19 Jan 2026
Viewed by 123
Abstract
The proliferation of unmanned aerial vehicles (UAVs) poses escalating security threats across critical infrastructures, necessitating robust real-time detection systems. Existing vision-based methods predominantly rely on single-modality data and exhibit significant performance degradation under challenging scenarios. To address these limitations, we propose DCAM-DETR, a [...] Read more.
The proliferation of unmanned aerial vehicles (UAVs) poses escalating security threats across critical infrastructures, necessitating robust real-time detection systems. Existing vision-based methods predominantly rely on single-modality data and exhibit significant performance degradation under challenging scenarios. To address these limitations, we propose DCAM-DETR, a novel multimodal detection framework that fuses RGB and thermal infrared modalities through an enhanced RT-DETR architecture integrated with state space models. Our approach introduces four innovations: (1) a MobileMamba backbone leveraging selective state space models for efficient long-range dependency modeling with linear complexity O(n); (2) Cross-Dimensional Attention (CDA) and Cross-Path Attention (CPA) modules capturing intermodal correlations across spatial and channel dimensions; (3) an Adaptive Feature Fusion Module (AFFM) dynamically calibrating multimodal feature contributions; and (4) a Dual-Attention Decoupling Module (DADM) enhancing detection head discrimination for small targets. Experiments on Anti-UAV300 demonstrate state-of-the-art performance with 94.7% mAP@0.5 and 78.3% mAP@0.5:0.95 at 42 FPS. Extended evaluations on FLIR-ADAS and KAIST datasets validate the generalization capacity across diverse scenarios. Full article
(This article belongs to the Special Issue Computer Vision for Security Applications, 2nd Edition)
Show Figures

Graphical abstract

23 pages, 2211 KB  
Article
BEMF-Net: A Boundary-Enhanced Multi-Scale Feature Fusion Network
by Jiayi Zhang, Chao Xu and Zhengping Li
Electronics 2026, 15(2), 430; https://doi.org/10.3390/electronics15020430 - 19 Jan 2026
Viewed by 125
Abstract
The elevated morbidity and mortality of kidney cancer make the precise, automated segmentation of kidneys and tumors essential for supporting clinical diagnosis and guiding surgical interventions. Recently, the segmentation of kidney tumors has been significantly advanced by deep learning. However, persistent challenges include [...] Read more.
The elevated morbidity and mortality of kidney cancer make the precise, automated segmentation of kidneys and tumors essential for supporting clinical diagnosis and guiding surgical interventions. Recently, the segmentation of kidney tumors has been significantly advanced by deep learning. However, persistent challenges include the fuzzy boundaries of kidney tumors, multi-scale problems with kidney and renal tumors regarding location and size, and the strikingly similar textural characteristics of malignant lesions and the surrounding renal parenchyma. To overcome the aforementioned constraints, this study introduces a boundary-enhanced multi-scale feature fusion network (BEMF-Net) for endoscopic image segmentation of kidney tumors. This network incorporates a boundary-selective attention module (BSA) to cope with the renal tumor boundary ambiguity problem and obtain more accurate tumor boundaries. Furthermore, we introduce a multi-scale feature fusion attention module (MFA) designed to handle 4 distinct feature hierarchies captured by the encoder, enabling it to effectively accommodate the diverse size variations observed in kidney tumors. Finally, a hybrid cross-modal attention module (HCA) is introduced to conclude our design. It is designed with a dual-branch structure combining Transformer and CNN, thereby integrating both global contextual relationships and fine-grained local patterns. On the Re-TMRS dataset, our approach achieved mDice and mIoU scores of 91.2% and 85.7%. These results confirm its superior segmentation quality and generalization performance compared to leading existing methods. Full article
Show Figures

Figure 1

37 pages, 1276 KB  
Review
Versatility of Transcranial Magnetic Stimulation: A Review of Diagnostic and Therapeutic Applications
by Massimo Pascuzzi, Nika Naeini, Adam Dorich, Marco D’Angelo, Jiwon Kim, Jean-Francois Nankoo, Naaz Desai and Robert Chen
Brain Sci. 2026, 16(1), 101; https://doi.org/10.3390/brainsci16010101 - 17 Jan 2026
Viewed by 324
Abstract
Transcranial magnetic stimulation (TMS) is a non-invasive neuromodulation technique that utilizes magnetic fields to induce cortical electric currents, enabling both the measurement and modulation of neuronal activity. Initially developed as a diagnostic tool, TMS now serves dual roles in clinical neurology, offering insight [...] Read more.
Transcranial magnetic stimulation (TMS) is a non-invasive neuromodulation technique that utilizes magnetic fields to induce cortical electric currents, enabling both the measurement and modulation of neuronal activity. Initially developed as a diagnostic tool, TMS now serves dual roles in clinical neurology, offering insight into neurophysiological dysfunctions and the therapeutic modulation of abnormal cortical excitability. This review examines key TMS outcome measures, including motor thresholds (MT), input–output (I/O) curves, cortical silent periods (CSP), and paired-pulse paradigms such as short-interval intracortical inhibition (SICI), short-interval intracortical facilitation (SICF), intracortical facilitation (ICF), long interval cortical inhibition (LICI), interhemispheric inhibition (IHI), and short-latency afferent inhibition (SAI). These biomarkers reflect underlying neurotransmitter systems and can aid in differentiating neurological conditions. Diagnostic applications of TMS are explored in Parkinson’s disease (PD), dystonia, essential tremor (ET), Alzheimer’s disease (AD), and mild cognitive impairment (MCI). Each condition displays characteristic neurophysiological profiles, highlighting the potential for TMS-derived biomarkers in early or differential diagnosis. Therapeutically, repetitive TMS (rTMS) has shown promise in modulating cortical circuits and improving motor and cognitive symptoms. High- and low-frequency stimulation protocols have demonstrated efficacy in PD, dystonia, ET, AD, and MCI, targeting the specific cortical regions implicated in each disorder. Moreover, the successful application of TMS in differentiating and treating AD and MCI underscores its clinical utility and translational potential across all neurodegenerative conditions. As research advances, increased attention and investment in TMS could facilitate similar diagnostic and therapeutic breakthroughs for other neurological disorders that currently lack robust tools for early detection and effective intervention. Moreover, this review also aims to underscore the importance of maintaining standardized TMS protocols. By highlighting inconsistencies and variability in outcomes across studies, we emphasize that careful methodological design is critical for ensuring the reproducibility, comparability, and reliable interpretation of TMS findings. In summary, this review emphasizes the value of TMS as a distinctive, non-invasive approach to probing brain function and highlights its considerable promise as both a diagnostic and therapeutic modality in neurology—roles that are often considered separately. Full article
Show Figures

Figure 1

23 pages, 2992 KB  
Article
Key-Value Mapping-Based Text-to-Image Diffusion Model Backdoor Attacks
by Lujia Chai, Yang Hou, Guozhao Liao and Qiuling Yue
Algorithms 2026, 19(1), 74; https://doi.org/10.3390/a19010074 - 15 Jan 2026
Viewed by 121
Abstract
Text-to-image (T2I) generation, a core component of generative artificial intelligence(AI), is increasingly important for creative industries and human–computer interaction. Despite impressive progress in realism and diversity, diffusion models still exhibit critical security blind spots particularly in the Transformer key-value mapping mechanism that underpins [...] Read more.
Text-to-image (T2I) generation, a core component of generative artificial intelligence(AI), is increasingly important for creative industries and human–computer interaction. Despite impressive progress in realism and diversity, diffusion models still exhibit critical security blind spots particularly in the Transformer key-value mapping mechanism that underpins cross-modal alignment. Existing backdoor attacks often rely on large-scale data poisoning or extensive fine-tuning, leading to low efficiency and limited stealth. To address these challenges, we propose two efficient backdoor attack methods AttnBackdoor and SemBackdoor grounded in the Transformer’s key-value storage principle. AttnBackdoor injects precise mappings between trigger prompts and target instances by fine-tuning the key-value projection matrices in U-Net cross-attention layers (≈5% of parameters). SemBackdoor establishes semantic-level mappings by editing the text encoder’s MLP projection matrix (≈0.3% of parameters). Both approaches achieve high attack success rates (>90%), with SemBackdoor reaching 98.6% and AttnBackdoor 97.2%. They also reduce parameter updates and training time by 1–2 orders of magnitude compared to prior work while preserving benign generation quality. Our findings reveal dual vulnerabilities at visual and semantic levels and provide a foundation for developing next generation defenses for secure generative AI. Full article
Show Figures

Figure 1

20 pages, 5073 KB  
Article
SAWGAN-BDCMA: A Self-Attention Wasserstein GAN and Bidirectional Cross-Modal Attention Framework for Multimodal Emotion Recognition
by Ning Zhang, Shiwei Su, Haozhe Zhang, Hantong Yang, Runfang Hao and Kun Yang
Sensors 2026, 26(2), 582; https://doi.org/10.3390/s26020582 - 15 Jan 2026
Viewed by 148
Abstract
Emotion recognition from physiological signals is pivotal for advancing human–computer interaction, yet unimodal pipelines frequently underperform due to limited information, constrained data diversity, and suboptimal cross-modal fusion. Addressing these limitations, the Self-Attention Wasserstein Generative Adversarial Network with Bidirectional Cross-Modal Attention (SAWGAN-BDCMA) framework is [...] Read more.
Emotion recognition from physiological signals is pivotal for advancing human–computer interaction, yet unimodal pipelines frequently underperform due to limited information, constrained data diversity, and suboptimal cross-modal fusion. Addressing these limitations, the Self-Attention Wasserstein Generative Adversarial Network with Bidirectional Cross-Modal Attention (SAWGAN-BDCMA) framework is proposed. This framework reorganizes the learning process around three complementary components: (1) a Self-Attention Wasserstein GAN (SAWGAN) that synthesizes high-quality Electroencephalography (EEG) and Photoplethysmography (PPG) to expand diversity and alleviate distributional imbalance; (2) a dual-branch architecture that distills discriminative spatiotemporal representations within each modality; and (3) a Bidirectional Cross-Modal Attention (BDCMA) mechanism that enables deep two-way interaction and adaptive weighting for robust fusion. Evaluated on the DEAP and ECSMP datasets, SAWGAN-BDCMA significantly outperforms multiple contemporary methods, achieving 94.25% accuracy for binary and 87.93% for quaternary classification on DEAP. Furthermore, it attains 97.49% accuracy for six-class emotion recognition on the ECSMP dataset. Compared with state-of-the-art multimodal approaches, the proposed framework achieves an accuracy improvement ranging from 0.57% to 14.01% across various tasks. These findings offer a robust solution to the long-standing challenges of data scarcity and modal imbalance, providing a profound theoretical and technical foundation for fine-grained emotion recognition and intelligent human–computer collaboration. Full article
(This article belongs to the Special Issue Advanced Signal Processing for Affective Computing)
Show Figures

Figure 1

29 pages, 4179 KB  
Article
Ontology-Enhanced Deep Learning for Early Detection of Date Palm Diseases in Smart Farming Systems
by Naglaa E. Ghannam, H. Mancy, Asmaa Mohamed Fathy and Esraa A. Mahareek
AgriEngineering 2026, 8(1), 29; https://doi.org/10.3390/agriengineering8010029 - 13 Jan 2026
Viewed by 258
Abstract
Early and accurate date palm disease detection is the key to successful smart farming ecosystem sustainability. In this paper, we introduce DoST-DPD, a new Dual-Stream Transformer architecture for multimodal disease diagnosis utilizing RGB, thermal and NIR imaging. In contrast with standard deep learning [...] Read more.
Early and accurate date palm disease detection is the key to successful smart farming ecosystem sustainability. In this paper, we introduce DoST-DPD, a new Dual-Stream Transformer architecture for multimodal disease diagnosis utilizing RGB, thermal and NIR imaging. In contrast with standard deep learning approaches, our model receives ontology-based semantic supervision (via per-dataset OWL ontologies), enabling knowledge injection via SPARQL-driven reasoning during training. This structured knowledge layer not only improves multimodal feature correspondence but also restricts label consistency for improving generalization performance, particularly in early disease diagnosis. We tested our proposed method on a comprehensive set of five benchmarks (PlantVillage, PlantDoc, Figshare, Mendeley, and Kaggle Date Palm) together with domain-specific ontologies. An ablation study validates the effectiveness of incorporating ontology supervision, consistently improving the performance across Accuracy, Precision, Recall, F1-Score and AUC. We achieve state-of-the-art performance over five widely recognized baselines (PlantXViT, Multi-ViT, ERCP-Net, andResNet), with our model DoST-DPD achieving the highest Accuracy of 99.3% and AUC of 98.2% on the PlantVillage dataset. In addition, ontology-driven attention maps and semantic consistency contributed to high interpretability and robustness in multiple crop and imaging modalities. Results: This work presents a scalable roadmap for ontology-integrated AI systems in agriculture and illustrates how structured semantic reasoning can directly benefit multimodal plant disease detection systems. The proposed model demonstrates competitive performance across multiple datasets and highlights the unique advantage of integrating ontology-guided supervision in multimodal crop disease detection. Full article
Show Figures

Figure 1

24 pages, 5237 KB  
Article
DCA-UNet: A Cross-Modal Ginkgo Crown Recognition Method Based on Multi-Source Data
by Yunzhi Guo, Yang Yu, Yan Li, Mengyuan Chen, Wenwen Kong, Yunpeng Zhao and Fei Liu
Plants 2026, 15(2), 249; https://doi.org/10.3390/plants15020249 - 13 Jan 2026
Viewed by 255
Abstract
Wild ginkgo, as an endangered species, holds significant value for genetic resource conservation, yet its practical applications face numerous challenges. Traditional field surveys are inefficient in mountainous mixed forests, while satellite remote sensing is limited by spatial resolution. Current deep learning approaches relying [...] Read more.
Wild ginkgo, as an endangered species, holds significant value for genetic resource conservation, yet its practical applications face numerous challenges. Traditional field surveys are inefficient in mountainous mixed forests, while satellite remote sensing is limited by spatial resolution. Current deep learning approaches relying on single-source data or merely simple multi-source fusion fail to fully exploit information, leading to suboptimal recognition performance. This study presents a multimodal ginkgo crown dataset, comprising RGB and multispectral images acquired by an UAV platform. To achieve precise crown segmentation with this data, we propose a novel dual-branch dynamic weighting fusion network, termed dual-branch cross-modal attention-enhanced UNet (DCA-UNet). We design a dual-branch encoder (DBE) with a two-stream architecture for independent feature extraction from each modality. We further develop a cross-modal interaction fusion module (CIF), employing cross-modal attention and learnable dynamic weights to boost multi-source information fusion. Additionally, we introduce an attention-enhanced decoder (AED) that combines progressive upsampling with a hybrid channel-spatial attention mechanism, thereby effectively utilizing multi-scale features and enhancing boundary semantic consistency. Evaluation on the ginkgo dataset demonstrates that DCA-UNet achieves a segmentation performance of 93.42% IoU (Intersection over Union), 96.82% PA (Pixel Accuracy), 96.38% Precision, and 96.60% F1-score. These results outperform differential feature attention fusion network (DFAFNet) by 12.19%, 6.37%, 4.62%, and 6.95%, respectively, and surpasses the single-modality baselines (RGB or multispectral) in all metrics. Superior performance on cross-flight-altitude data further validates the model’s strong generalization capability and robustness in complex scenarios. These results demonstrate the superiority of DCA-UNet in UAV-based multimodal ginkgo crown recognition, offering a reliable and efficient solution for monitoring wild endangered tree species. Full article
(This article belongs to the Special Issue Advanced Remote Sensing and AI Techniques in Agriculture and Forestry)
Show Figures

Figure 1

21 pages, 2506 KB  
Article
Collaborative Dispatch of Power–Transportation Coupled Networks Based on Physics-Informed Priors
by Zhizeng Kou, Yingli Wei, Shiyan Luan, Yungang Wu, Hancong Guo, Bochao Yang and Su Su
Electronics 2026, 15(2), 343; https://doi.org/10.3390/electronics15020343 - 13 Jan 2026
Viewed by 136
Abstract
Under China’s “dual-carbon” strategic goals and the advancement of smart city development, the rapid adoption of electric vehicles (EVs) has deepened the spatiotemporal coupling between transportation networks and distribution grids, posing new challenges for integrated energy systems. To address this, we propose a [...] Read more.
Under China’s “dual-carbon” strategic goals and the advancement of smart city development, the rapid adoption of electric vehicles (EVs) has deepened the spatiotemporal coupling between transportation networks and distribution grids, posing new challenges for integrated energy systems. To address this, we propose a collaborative optimization framework for power–transportation coupled networks that integrates multi-modal data with physical priors. The framework constructs a joint feature space from traffic flow, pedestrian density, charging behavior, and grid operating states, and employs hypergraph modeling—guided by power flow balance and traffic flow conservation principles—to capture high-order cross-domain coupling. For prediction, spatiotemporal graph convolution combined with physics-informed attention significantly improves the accuracy of EV charging load forecasting. For optimization, a hierarchical multi-agent strategy integrating federated learning and the Alternating Direction Method of Multipliers (ADMM) enables privacy-preserving, distributed charging load scheduling. Case studies conducted on a 69-node distribution network using real traffic and charging data demonstrate that the proposed method reduces the grid’s peak–valley difference by 20.16%, reduces system operating costs by approximately 25%, and outperforms mainstream baseline models in prediction accuracy, algorithm convergence speed, and long-term operational stability. This work provides a practical and scalable technical pathway for the deep integration of energy and transportation systems in future smart cities. Full article
Show Figures

Figure 1

31 pages, 4778 KB  
Article
ESCFM-YOLO: Lightweight Dual-Stream Architecture for Real-Time Small-Scale Fire Smoke Detection on Edge Devices
by Jong-Chan Park, Myeongjun Kim, Sang-Min Choi and Gun-Woo Kim
Appl. Sci. 2026, 16(2), 778; https://doi.org/10.3390/app16020778 - 12 Jan 2026
Viewed by 146
Abstract
Early detection of small-scale fires is crucial for minimizing damage and enabling rapid emergency response. While recent deep learning-based fire detection systems have achieved high accuracy, they still face three key challenges: (1) limited deployability in resource-constrained edge environments due to high computational [...] Read more.
Early detection of small-scale fires is crucial for minimizing damage and enabling rapid emergency response. While recent deep learning-based fire detection systems have achieved high accuracy, they still face three key challenges: (1) limited deployability in resource-constrained edge environments due to high computational costs, (2) performance degradation caused by feature interference when jointly learning flame and smoke features in a single backbone, and (3) low sensitivity to small flames and thin smoke in the initial stages. To address these issues, we propose a lightweight dual-stream fire detection architecture based on YOLOv5n, which learns flame and smoke features separately to improve both accuracy and efficiency under strict edge constraints. The proposed method integrates two specialized attention modules: ESCFM++, which enhances spatial and channel discrimination for sharp boundaries and local flame structures (flame), and ESCFM-RS, which captures low-contrast, diffuse smoke patterns through depthwise convolutions and residual scaling (smoke). On the D-Fire dataset, the flame detector achieved 74.5% mAP@50 with only 1.89 M parameters, while the smoke detector achieved 89.2% mAP@50. When deployed on an NVIDIA Jetson Xavier NX (NVIDIA Corporation, Santa Clara, CA, USA)., the system achieved 59.7 FPS (single-stream) and 28.3 FPS (dual-tream) with GPU utilization below 90% and power consumption under 17 W. Under identical on-device conditions, it outperforms YOLOv9t and YOLOv12n by 36–62% in FPS and 0.7–2.0% in detection accuracy. We further validate deployment via outdoor day/night long-range live-stream tests on Jetson using our flame detector, showing reliable capture of small, distant flames that appear as tiny cues on the screen, particularly in challenging daytime scenes. These results demonstrate overall that modality-specific stream specialization and ESCFM attention reduce feature interference while improving detection accuracy and computational efficiency for real-time edge-device fire monitoring. Full article
Show Figures

Figure 1

31 pages, 10745 KB  
Article
CNN-GCN Coordinated Multimodal Frequency Network for Hyperspectral Image and LiDAR Classification
by Haibin Wu, Haoran Lv, Aili Wang, Siqi Yan, Gabor Molnar, Liang Yu and Minhui Wang
Remote Sens. 2026, 18(2), 216; https://doi.org/10.3390/rs18020216 - 9 Jan 2026
Viewed by 221
Abstract
The existing multimodal image classification methods often suffer from several key limitations: difficulty in effectively balancing local detail and global topological relationships in hyperspectral image (HSI) feature extraction; insufficient multi-scale characterization of terrain features from light detection and ranging (LiDAR) elevation data; and [...] Read more.
The existing multimodal image classification methods often suffer from several key limitations: difficulty in effectively balancing local detail and global topological relationships in hyperspectral image (HSI) feature extraction; insufficient multi-scale characterization of terrain features from light detection and ranging (LiDAR) elevation data; and neglect of deep inter-modal interactions in traditional fusion methods, often accompanied by high computational complexity. To address these issues, this paper proposes a comprehensive deep learning framework combining convolutional neural network (CNN), a graph convolutional network (GCN), and wavelet transform for the joint classification of HSI and LiDAR data, including several novel components: a Spectral Graph Mixer Block (SGMB), where a CNN branch captures fine-grained spectral–spatial features by multi-scale convolutions, while a parallel GCN branch models long-range contextual features through an enhanced gated graph network. This dual-path design enables simultaneous extraction of local detail and global topological features from HSI data; a Spatial Coordinate Block (SCB) to enhance spatial awareness and improve the perception of object contours and distribution patterns; a Multi-Scale Elevation Feature Extraction Block (MSFE) for capturing terrain representations across varying scales; and a Bidirectional Frequency Attention Encoder (BiFAE) to enable efficient and deep interaction between multimodal features. These modules are intricately designed to work in concert, forming a cohesive end-to-end framework, which not only achieves a more effective balance between local details and global contexts but also enables deep yet computationally efficient interaction across features, significantly strengthening the discriminability and robustness of the learned representation. To evaluate the proposed method, we conducted experiments on three multimodal remote sensing datasets: Houston2013, Augsburg, and Trento. Quantitative results demonstrate that our framework outperforms state-of-the-art methods, achieving OA values of 98.93%, 88.05%, and 99.59% on the respective datasets. Full article
(This article belongs to the Section AI Remote Sensing)
Show Figures

Graphical abstract

20 pages, 3945 KB  
Article
Dual-Modal Mixture-of-KAN Network for Lithium-Ion Battery State-of-Health Estimation Using Early Charging Data
by Yun Wang, Ziyang Zhang and Fan Zhang
Energies 2026, 19(2), 335; https://doi.org/10.3390/en19020335 - 9 Jan 2026
Viewed by 241
Abstract
Accurate estimation of the state of health (SOH) of lithium-ion batteries is crucial for the safe operation of electric vehicles and energy storage systems. However, most existing methods rely on complete charging curves or manual feature engineering, making them difficult to adapt to [...] Read more.
Accurate estimation of the state of health (SOH) of lithium-ion batteries is crucial for the safe operation of electric vehicles and energy storage systems. However, most existing methods rely on complete charging curves or manual feature engineering, making them difficult to adapt to practical scenarios where only limited charging segments are available. To fully exploit degradation information from limited charging data, this paper proposes a dual-modal mixture of Kolmogorov–Arnold network (DM-MoKAN) for lithium-ion battery SOH estimation using only early-stage constant-current charging voltage data. The proposed method incorporates three synergistic modules: an image branch, a sequence branch, and a dual-modal fusion regression module. The image branch converts one-dimensional voltage sequences into two-dimensional Gramian Angular Difference Field (GADF) images and extracts spatial degradation features through a lightweight network integrating Ghost convolution and efficient channel attention (ECA). The sequence branch employs a patch-based Transformer encoder to directly model local patterns and long-range dependencies in the raw voltage sequence. The dual-modal fusion module concatenates features from both branches and feeds them into a MoKAN regression head composed of multiple KAN experts and a gating network for adaptive nonlinear mapping to SOH. Experimental results demonstrate that DM-MoKAN outperforms various baseline methods on both Oxford and NASA datasets, achieving average RMSE/MAE of 0.28%/0.19% and 0.89%/0.71%, respectively. Ablation experiments further verify the effective contributions of the dual-modal fusion strategy, ECA attention mechanism, and MoKAN regression head to estimation performance improvement. Full article
Show Figures

Figure 1

20 pages, 2862 KB  
Article
Image–Text Multimodal Sentiment Analysis Algorithm Based on Curriculum Learning and Attention Mechanisms
by Yifan Chang, Zhuoxin Li, Youxiang Ruan and Guangqiang Yin
Big Data Cogn. Comput. 2026, 10(1), 23; https://doi.org/10.3390/bdcc10010023 - 7 Jan 2026
Viewed by 293
Abstract
With the rapid development of mobile internet technology, the explosive growth of image–text multimodal data generated by social networking platforms has provided rich practical scenarios and theoretical research value for multimodal sentiment analysis. However, existing methods generally suffer from inefficient modal interaction and [...] Read more.
With the rapid development of mobile internet technology, the explosive growth of image–text multimodal data generated by social networking platforms has provided rich practical scenarios and theoretical research value for multimodal sentiment analysis. However, existing methods generally suffer from inefficient modal interaction and imperfect sentiment aggregation mechanisms, particularly an over-reliance on visual modalities, leading to an imbalance in cross-modal semantic correlation modeling. To address these issues, this paper proposes a sentiment analysis algorithm for image–text modalities based on curriculum learning and attention mechanisms. The algorithm introduces the concept of curriculum learning, fully considering the negative impact of irrelevant images in image–text data on overall sentiment analysis, effectively suppressing interference from irrelevant visual information without requiring manual data cleaning. Meanwhile, the algorithm designs a dual-stage attention architecture—first capturing cross-modal correlation features via cross-modal attention, then introducing an attention bottleneck strategy to compress redundant information flow, achieving efficient feature fusion by constraining intra-modal attention dimensions. Finally, extensive experiments were conducted on two public datasets, demonstrating that the proposed method outperforms existing approaches in sentiment prediction performance. Full article
Show Figures

Figure 1

22 pages, 5960 KB  
Article
JFDet: Joint Fusion and Detection for Multimodal Remote Sensing Imagery
by Wenhao Xu and You Yang
Remote Sens. 2026, 18(1), 176; https://doi.org/10.3390/rs18010176 - 5 Jan 2026
Viewed by 231
Abstract
Multimodal remote sensing imagery, such as visible and infrared data, offers crucial complementary information that is vital for time-sensitive emergency applications like search and rescue or disaster monitoring, where robust detection under adverse conditions is essential. However, existing methods’ object detection performance is [...] Read more.
Multimodal remote sensing imagery, such as visible and infrared data, offers crucial complementary information that is vital for time-sensitive emergency applications like search and rescue or disaster monitoring, where robust detection under adverse conditions is essential. However, existing methods’ object detection performance is often suboptimal due to task-independent fusion and inherent modality inconsistency. To address this issue, we propose a joint fusion and detection approach for multimodal remote sensing imagery (JFDet). First, a gradient-enhanced residual module (GERM) is introduced to combine dense feature connections with gradient residual pathways, effectively enhancing structural representation and fine-grained texture details in fused images. For robust detection, we introduce a second-order channel attention (SOCA) mechanism and design a multi-scale contextual feature-encoding (MCFE) module to capture higher-order semantic dependencies, enrich multi-scale contextual information, and thereby improve the recognition of small and variably scaled objects. Furthermore, a dual-loss feedback strategy propagates detection loss to the fusion network, enabling adaptive synergy between low-level fusion and high-level detection. Experiments on the VEDAI and FLIR-ADAS datasets demonstrate that the proposed detection-driven fusion framework significantly improves both fusion quality and detection accuracy compared with state-of-the-art methods, highlighting its effectiveness and high potential for mission-critical multimodal remote sensing and time-sensitive application. Full article
Show Figures

Figure 1

17 pages, 779 KB  
Article
Geometry Diagram Parsing and Reasoning Based on Deep Semantic Fusion
by Pengpeng Jian, Xuhui Zhang, Lei Wu, Bin Ma and Wangyang Hong
Symmetry 2026, 18(1), 92; https://doi.org/10.3390/sym18010092 - 4 Jan 2026
Viewed by 314
Abstract
Effective Automated Geometric Problem Solving (AGP) requires a deep integration of visual perception and textual comprehension. To address this, we propose a dual-stream fusion model that injects deep semantic understanding from a Pre-trained Language Model (PLM) into the geometric diagram parsing pipeline. Our [...] Read more.
Effective Automated Geometric Problem Solving (AGP) requires a deep integration of visual perception and textual comprehension. To address this, we propose a dual-stream fusion model that injects deep semantic understanding from a Pre-trained Language Model (PLM) into the geometric diagram parsing pipeline. Our core innovation is a Semantic-Guided Cross-Attention (SGCA) mechanism, which uses the global semantic intent of the problem text to direct attention toward key visual primitives. This yields context-enriched visual representations that serve as inputs to a Graph Neural Network (GNN), enabling relational reasoning that is not only perception-driven but also context-aware. By explicitly bridging the semantic gap between text and diagrams, our approach delivers more robust and accurate predictions. To the best of our knowledge, this is the first study to introduce a semantic-guided cross-attention mechanism into geometric diagram parsing, establishing a new paradigm that effectively addresses the cross-modal semantic gap and achieves state-of-the-art performance. This is particularly effective for parsing problems involving geometric symmetries, where textual cues often clarify or define symmetrical relationships not obvious from the diagram alone. Full article
(This article belongs to the Special Issue Symmetry and Asymmetry in Human-Computer Interaction)
Show Figures

Figure 1

17 pages, 6410 KB  
Article
IESS-FusionNet: Physiologically Inspired EEG-EMG Fusion with Linear Recurrent Attention for Infantile Epileptic Spasms Syndrome Detection
by Junyuan Feng, Zhenzhen Liu, Linlin Shen, Xiaoling Luo, Yan Chen, Lin Li and Tian Zhang
Bioengineering 2026, 13(1), 57; https://doi.org/10.3390/bioengineering13010057 - 31 Dec 2025
Viewed by 462
Abstract
Infantile Epileptic Spasms Syndrome (IESS) is a devastating epileptic encephalopathy of infancy that carries a high risk of lifelong neurodevelopmental disability. Timely diagnosis is critical, as every week of delay in effective treatment is associated with worse cognitive outcomes. Although synchronized electroencephalogram (EEG) [...] Read more.
Infantile Epileptic Spasms Syndrome (IESS) is a devastating epileptic encephalopathy of infancy that carries a high risk of lifelong neurodevelopmental disability. Timely diagnosis is critical, as every week of delay in effective treatment is associated with worse cognitive outcomes. Although synchronized electroencephalogram (EEG) and surface electromyography (EMG) recordings capture both the electrophysiological and motor signatures of spasms, accurate automated detection remains challenging due to the non-stationary nature of the signals and the absence of physiologically plausible inter-modal fusion in current deep learning approaches. We introduce IESS-FusionNet, an end-to-end dual-stream framework specifically designed for accurate, real-time IESS detection from simultaneous EEG and EMG. Each modality is processed by a dedicated Unimodal Encoder that hierarchically integrates Continuous Wavelet Transform, Spatio-Temporal Convolution, and Bidirectional Mamba to efficiently extract frequency-specific, spatially structured, local and long-range temporal features within a compact module. A novel Cross Time-Mixing module, built upon the linear recurrent attention of the Receptance Weighted Key Value (RWKV) architecture, subsequently performs efficient, time-decaying, bidirectional cross-modal integration that explicitly respects the causal and physiological properties of cortico-muscular coupling during spasms. Evaluated on an in-house clinical dataset of synchronized EEG-EMG recordings from infants with confirmed IESS, IESS-FusionNet achieves 89.5% accuracy, 90.7% specificity, and 88.3% sensitivity, significantly outperforming recent unimodal and multimodal baselines. Comprehensive ablation studies validate the contribution of each component, while the proposed cross-modal fusion requires approximately 60% fewer parameters than equivalent quadratic cross-attention mechanisms, making it suitable for real-time clinical deployment. IESS-FusionNet delivers an accurate, computationally efficient solution with physiologically inspired cross-modal fusion for the automated detection of infantile epileptic spasms, offering promise for future clinical applications in reducing diagnostic delay. Full article
Show Figures

Figure 1

Back to TopTop