Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (122)

Search Parameters:
Keywords = multimodal and multiscale fusion

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
30 pages, 23104 KB  
Article
MSAFNet: Multi-Modal Marine Aquaculture Segmentation via Spatial–Frequency Adaptive Fusion
by Guolong Wu and Yimin Lu
Remote Sens. 2025, 17(20), 3425; https://doi.org/10.3390/rs17203425 - 13 Oct 2025
Viewed by 223
Abstract
Accurate mapping of marine aquaculture areas is critical for environmental management and sustainable development for marine ecosystem protection and sustainable resource utilization. However, remote sensing imagery based on single-sensor modalities has inherent limitations when extracting aquaculture zones in complex marine environments. To address [...] Read more.
Accurate mapping of marine aquaculture areas is critical for environmental management and sustainable development for marine ecosystem protection and sustainable resource utilization. However, remote sensing imagery based on single-sensor modalities has inherent limitations when extracting aquaculture zones in complex marine environments. To address this challenge, we constructed a multi-modal dataset from five Chinese coastal regions using cloud detection methods and developed Multi-modal Spatial–Frequency Adaptive Fusion Network (MSAFNet) for optical-radar data fusion. MSAFNet employs a dual-path architecture utilizing a Multi-scale Dual-path Feature Module (MDFM) that combines CNN and Transformer capabilities to extract multi-scale features. Additionally, it implements a Dynamic Frequency Domain Adaptive Fusion Module (DFAFM) to achieve deep integration of multi-modal features in both spatial and frequency domains, effectively leveraging the complementary advantages of different sensor data. Results demonstrate that MSAFNet achieves 76.93% mean intersection over union (mIoU), 86.96% mean F1 score (mF1), and 93.26% mean Kappa coefficient (mKappa) in extracting floating raft aquaculture (FRA) and cage aquaculture (CA), significantly outperforming existing methods. Applied to China’s coastal waters, the model generated 2020 nearshore aquaculture distribution maps, demonstrating its generalization capability and practical value in complex marine environments. This approach provides reliable technical support for marine resource management and ecological monitoring. Full article
Show Figures

Figure 1

19 pages, 1648 KB  
Article
Modality-Enhanced Multimodal Integrated Fusion Attention Model for Sentiment Analysis
by Zhenwei Zhang, Wenyan Wu, Tao Yuan and Guang Feng
Appl. Sci. 2025, 15(19), 10825; https://doi.org/10.3390/app151910825 - 9 Oct 2025
Viewed by 614
Abstract
Multimodal sentiment analysis aims to utilize multisource information such as text, speech and vision to more comprehensively and accurately identify an individual’s emotional state. However, existing methods still face challenges in practical applications, including modality heterogeneity, insufficient expressive power of non-verbal modalities, and [...] Read more.
Multimodal sentiment analysis aims to utilize multisource information such as text, speech and vision to more comprehensively and accurately identify an individual’s emotional state. However, existing methods still face challenges in practical applications, including modality heterogeneity, insufficient expressive power of non-verbal modalities, and low fusion efficiency. To address these issues, this paper proposes a Modality Enhanced Multimodal Integration Model (MEMMI). First, a modality enhancement module is designed to leverage the semantic guidance capability of the text modality, enhancing the feature representation of non-verbal modalities through a multihead attention mechanism and a dynamic routing strategy. Second, a gated fusion mechanism is introduced to selectively inject speech and visual information into the dominant text modality, enabling robust information completion and noise suppression. Finally, a combined attention fusion module is constructed to synchronously fuse information from all three modalities within a unified architecture, hile a multiscale encoder is used to capture feature representations at different semantic levels. Experimental results on three benchmark datasets—CMU-MOSEI, CMU-MOSI, and CH-SIMS—demonstrate the superiority of the proposed model. On CMU-MOSI, it achieves an Acc-7 of 45.91, with binary accuracy/F1 of 82.86/84.60, MAE of 0.734, and Corr of 0.790, outperforming TFN and MulT by a large margin. On CMU-MOSEI, the model reaches an Acc-7 of 54.17, Acc-2/F1 of 83.69/86.02, MAE of 0.526, and Corr of 0.779, surpassing all baselines, including ALMT. On CH-SIMS, it further achieves 41.88, 66.52, and 77.68 in Acc-5/Acc-3/Acc-2, with F1 of 77.85, MAE of 0.450, and Corr of 0.594, establishing new state-of-the-art performance across datasets. These results confirm that MEMMI achieves state-of-the-art performance across multiple metrics. Furthermore, ablation studies validate the effectiveness of each module in enhancing modality representation and fusion efficiency. Full article
Show Figures

Figure 1

21 pages, 1094 KB  
Article
Dynamic Equivalence of Active Distribution Network: Multiscale and Multimodal Fusion Deep Learning Method with Automatic Parameter Tuning
by Wenhao Wang, Zhaoxi Liu, Fengzhe Dai and Huan Quan
Mathematics 2025, 13(19), 3213; https://doi.org/10.3390/math13193213 - 7 Oct 2025
Viewed by 294
Abstract
Dynamic equivalence of active distribution networks (ADNs) is emerging as one of the most important issues for the backbone network security analysis due to high penetration of distributed generations (DGs) and electricity vehicles (EVs). The multiscale and multimodal fusion deep learning (MMFDL) method [...] Read more.
Dynamic equivalence of active distribution networks (ADNs) is emerging as one of the most important issues for the backbone network security analysis due to high penetration of distributed generations (DGs) and electricity vehicles (EVs). The multiscale and multimodal fusion deep learning (MMFDL) method proposed in this paper contains two modalities, one of which is a CNN + attention module to simulate Newton Raphson power flow calculation (NRPFC) for the important feature extraction of a power system caused by disturbance, which is motivated by the similarities between NRPFC and convolution network computation. The other is a long short-term memory (LSTM) + fully connected (FC) module for load modeling based on the fact that LSTM + FC can represent a load′s differential algebraic equations (DAEs). Moreover, to better capture the relationship between voltage and power, the multiscale fusion method is used to aggregate load modeling models with different voltage input sizes and combined with CNN + attention, merging as MMFDL to represent the dynamic behaviors of ADNs. Then, the Kepler optimization algorithm (KOA) is applied to automatically tune the adjustable parameters of MMFLD (called KOA-MMFDL), especially the LSTM and FC hidden layer number, as they are important for load modeling and there is no human knowledge to set these parameters. The performance of the proposed method was evaluated by employing different electric power systems and various disturbance scenarios. The error analysis shows that the proposed method can accurately represent the dynamic response of ADNs. In addition, comparative experiments verified that the proposed method is more robust and generalizable than other advanced non-mechanism methods. Full article
(This article belongs to the Section C2: Dynamical Systems)
Show Figures

Figure 1

15 pages, 3389 KB  
Article
Photovoltaic Decomposition Method Based on Multi-Scale Modeling and Multi-Feature Fusion
by Zhiheng Xu, Peidong Chen, Ran Cheng, Yao Duan, Qiang Luo, Huahui Zhang, Zhenning Pan and Wencong Xiao
Energies 2025, 18(19), 5271; https://doi.org/10.3390/en18195271 - 4 Oct 2025
Viewed by 310
Abstract
Deep learning-based Non-Intrusive Load Monitoring (NILM) methods have been widely applied to residential load identification. However, photovoltaic (PV) loads exhibit strong non-stationarity, high dependence on weather conditions, and strong coupling with multi-source data, which limit the accuracy and generalization of existing models. To [...] Read more.
Deep learning-based Non-Intrusive Load Monitoring (NILM) methods have been widely applied to residential load identification. However, photovoltaic (PV) loads exhibit strong non-stationarity, high dependence on weather conditions, and strong coupling with multi-source data, which limit the accuracy and generalization of existing models. To address these challenges, this paper proposes a multi-scale and multi-feature fusion framework for PV disaggregation, consisting of three modules: Multi-Scale Time Series Decomposition (MTD), Multi-Feature Fusion (MFF), and Temporal Attention Decomposition (TAD). These modules jointly capture short-term fluctuations, long-term trends, and deep dependencies across multi-source features. Experiments were conducted on real residential datasets from southern China. Results show that, compared with representative baselines such as SGN-Conv and MAT-Conv, the proposed method reduces MAE by over 60% and SAE by nearly 70% for some users, and it achieves more than 45% error reduction in cross-user tests. These findings demonstrate that the proposed approach significantly enhances both accuracy and generalization in PV load disaggregation. Full article
Show Figures

Figure 1

21 pages, 899 KB  
Article
Gated Fusion Networks for Multi-Modal Violence Detection
by Bilal Ahmad, Mustaqeem Khan and Muhammad Sajjad
AI 2025, 6(10), 259; https://doi.org/10.3390/ai6100259 - 3 Oct 2025
Viewed by 403
Abstract
Public safety and security require an effective monitoring system to detect violence through visual, audio, and motion data. However, current methods often fail to utilize the complementary benefits of visual and auditory modalities, thereby reducing their overall effectiveness. To enhance violence detection, we [...] Read more.
Public safety and security require an effective monitoring system to detect violence through visual, audio, and motion data. However, current methods often fail to utilize the complementary benefits of visual and auditory modalities, thereby reducing their overall effectiveness. To enhance violence detection, we present a novel multimodal method in this paper that detects motion, audio, and visual information from the input to recognize violence. We designed a framework comprising two specialized components: a gated fusion module and a multi-scale transformer, which enables the efficient detection of violence in multimodal data. To ensure a seamless and effective integration of features, a gated fusion module dynamically adjusts the contribution of each modality. At the same time, a multi-modal transformer utilizes multiple instance learning (MIL) to identify violent behaviors more accurately from input data by capturing complex temporal correlations. Our model fully integrates multi-modal information using these techniques, improving the accuracy of violence detection. In this study, we found that our approach outperformed state-of-the-art methods with an accuracy of 86.85% using the XD-Violence dataset, thereby demonstrating the potential of multi-modal fusion in detecting violence. Full article
Show Figures

Figure 1

18 pages, 1856 KB  
Article
A Uniform Multi-Modal Feature Extraction and Adaptive Local–Global Feature Fusion Structure for RGB-X Marine Animal Segmentation
by Yue Jiang, Yan Gao, Yifei Wang, Yue Wang, Hong Yu and Yuanshan Lin
Electronics 2025, 14(19), 3927; https://doi.org/10.3390/electronics14193927 - 2 Oct 2025
Viewed by 256
Abstract
Marine animal segmentation aims at segmenting marine animals in complex ocean scenes, which plays an important role in underwater intelligence research. Due to the complexity of underwater scenes, relying solely on a single RGB image or learning from a specific combination of multi-model [...] Read more.
Marine animal segmentation aims at segmenting marine animals in complex ocean scenes, which plays an important role in underwater intelligence research. Due to the complexity of underwater scenes, relying solely on a single RGB image or learning from a specific combination of multi-model information may not be very effective. Therefore, we propose a uniform multi-modal feature extraction and adaptive local–global feature fusion structure for RGB-X marine animal segmentation. It can be applicable to various situations such as RGB-D (RGB+depth) and RGB-O (RGB+optical flow) marine animal segmentation. Specifically, we first fine-tune the SAM encoder using parallel LoRA and adapters to separately extract RGB information and auxiliary information. Then, the Adaptive Local–Global Feature Fusion (ALGFF) module is proposed to progressively fuse multi-modal and multi-scale features in a simple and dynamical way. Experimental results on both RGB-D and RGB-O datasets demonstrate that our model achieves superior performance in underwater scene segmentation tasks. Full article
(This article belongs to the Special Issue Recent Advances in Efficient Image and Video Processing)
Show Figures

Figure 1

23 pages, 4303 KB  
Article
LMCSleepNet: A Lightweight Multi-Channel Sleep Staging Model Based on Wavelet Transform and Muli-Scale Convolutions
by Jiayi Yang, Yuanyuan Chen, Tingting Yu and Ying Zhang
Sensors 2025, 25(19), 6065; https://doi.org/10.3390/s25196065 - 2 Oct 2025
Viewed by 234
Abstract
Sleep staging is a crucial indicator for assessing sleep quality, which contributes to sleep monitoring and the diagnosis of sleep disorders. Although existing sleep staging methods achieve high classification performance, two major challenges remain: (1) the ability to effectively extract salient features from [...] Read more.
Sleep staging is a crucial indicator for assessing sleep quality, which contributes to sleep monitoring and the diagnosis of sleep disorders. Although existing sleep staging methods achieve high classification performance, two major challenges remain: (1) the ability to effectively extract salient features from multi-channel sleep data remains limited; (2) excessive model parameters hinder efficiency improvements. To address these challenges, this work proposes a lightweight multi-channel sleep staging network (LMCSleepNet). LMCSleepNet is composed of four modules. The first module enhances frequency domain features through continuous wavelet transform. The second module extracts time–frequency features using multi-scale convolutions. The third module optimizes ResNet18 with depthwise separable convolutions to reduce parameters. The fourth module improves spatial correlation using the Convolutional Block Attention Module (CBAM). On the public datasets SleepEDF-20, SleepEDF-78, and LMCSleepNet, respectively, LMCSleepNet achieved classification accuracies of 88.2% (κ = 0.84, MF1 = 82.4%) and 84.1% (κ = 0.77, MF1 = 77.7%), while reducing model parameters to 1.49 M. Furthermore, experiments validated the influence of temporal sampling points in wavelet time–frequency maps on sleep classification performance (accuracy, Cohen’s kappa, and macro-average F1-score) and the influence of multi-scale dilated convolution module fusion methods on classification performance. LMCSleepNet is an efficient lightweight model for extracting and integrating multimodal features from multichannel Polysomnography (PSG) data, which facilitates its application in resource-constrained scenarios. Full article
(This article belongs to the Section Biomedical Sensors)
Show Figures

Figure 1

20 pages, 2545 KB  
Article
LG-UNet Based Segmentation and Survival Prediction of Nasopharyngeal Carcinoma Using Multimodal MRI Imaging
by Yuhao Yang, Junhao Wen, Tianyi Wu, Jinrang Dong, Yunfei Xia and Yu Zhang
Bioengineering 2025, 12(10), 1051; https://doi.org/10.3390/bioengineering12101051 - 29 Sep 2025
Viewed by 429
Abstract
Image segmentation and survival prediction for nasopharyngeal carcinoma (NPC) are crucial for clinical diagnosis and treatment decisions. This study presents an improved 3D-UNet-based model for NPC GTV segmentation, referred to as LG-UNet. The encoder introduces deep strip convolution and channel attention mechanisms to [...] Read more.
Image segmentation and survival prediction for nasopharyngeal carcinoma (NPC) are crucial for clinical diagnosis and treatment decisions. This study presents an improved 3D-UNet-based model for NPC GTV segmentation, referred to as LG-UNet. The encoder introduces deep strip convolution and channel attention mechanisms to enhance feature extraction while avoiding spatial feature loss and anisotropic constraints. The decoder incorporates Dynamic Large Convolutional Kernel (DLCK) and Global Feature Fusion (GFF) modules to capture multi-scale features and integrate global contextual information, enabling precise segmentation of the tumor GTV in NPC MRI images. Risk prediction is performed on the segmented multi-modal MRI images using the Lung-Net model, with output risk factors combined with clinical data in the Cox model to predict metastatic probabilities for NPC lesions. Experimental results on 442 NPC MRI scans from Sun Yat-sen University Cancer Center showed DSC of 0.8223, accuracy of 0.8235, recall of 0.8297, and HD95 of 1.6807 mm. Compared to the baseline model, the DSC improved by 7.73%, accuracy increased by 4.52%, and recall improved by 3.40%. The combined model’s risk prediction showed C-index values of 0.756, with a 5-year AUC value of 0.789. This model can serve as an auxiliary tool for clinical decision-making in NPC. Full article
(This article belongs to the Section Biosignal Processing)
Show Figures

Figure 1

22 pages, 2395 KB  
Article
Multimodal Alignment and Hierarchical Fusion Network for Multimodal Sentiment Analysis
by Jiasheng Huang, Huan Li and Xinyue Mo
Electronics 2025, 14(19), 3828; https://doi.org/10.3390/electronics14193828 - 26 Sep 2025
Viewed by 700
Abstract
The widespread emergence of multimodal data on social platforms has presented new opportunities for sentiment analysis. However, previous studies have often overlooked the issue of detail loss during modal interaction fusion. They also exhibit limitations in addressing semantic alignment challenges and the sensitivity [...] Read more.
The widespread emergence of multimodal data on social platforms has presented new opportunities for sentiment analysis. However, previous studies have often overlooked the issue of detail loss during modal interaction fusion. They also exhibit limitations in addressing semantic alignment challenges and the sensitivity of modalities to noise. To enhance analytical accuracy, a novel model named MAHFNet is proposed. The proposed architecture is composed of three main components. Firstly, an attention-guided gated interaction alignment module is developed for modeling the semantic interaction between text and image using a gated network and a cross-modal attention mechanism. Next, a contrastive learning mechanism is introduced to encourage the aggregation of semantically aligned image-text pairs. Subsequently, an intra-modality emotion extraction module is designed to extract local emotional features within each modality. This module serves to compensate for detail loss during interaction fusion. The intra-modal local emotion features and cross-modal interaction features are then fed into a hierarchical gated fusion module, where the local features are fused through a cross-gated mechanism to dynamically adjust the contribution of each modality while suppressing modality-specific noise. Then, the fusion results and cross-modal interaction features are further fused using a multi-scale attention gating module to capture hierarchical dependencies between local and global emotional information, thereby enhancing the model’s ability to perceive and integrate emotional cues across multiple semantic levels. Finally, extensive experiments have been conducted on three public multimodal sentiment datasets, with results demonstrating that the proposed model outperforms existing methods across multiple evaluation metrics. Specifically, on the TumEmo dataset, our model achieves improvements of 2.55% in ACC and 2.63% in F1 score compared to the second-best method. On the HFM dataset, these gains reach 0.56% in ACC and 0.9% in F1 score, respectively. On the MVSA-S dataset, these gains reach 0.03% in ACC and 1.26% in F1 score. These findings collectively validate the overall effectiveness of the proposed model. Full article
Show Figures

Figure 1

18 pages, 4817 KB  
Article
A Multimodal Deep Learning Framework for Accurate Wildfire Segmentation Using RGB and Thermal Imagery
by Tao Yue, Hong Huang, Qingyang Wang, Bo Song and Yun Chen
Appl. Sci. 2025, 15(18), 10268; https://doi.org/10.3390/app151810268 - 21 Sep 2025
Viewed by 489
Abstract
Wildfires pose serious threats to ecosystems, human life, and climate stability, underscoring the urgent need for accurate monitoring. Traditional approaches based on either optical or thermal imagery often fail under challenging conditions such as lighting interference, varying data sources, or small-scale flames, as [...] Read more.
Wildfires pose serious threats to ecosystems, human life, and climate stability, underscoring the urgent need for accurate monitoring. Traditional approaches based on either optical or thermal imagery often fail under challenging conditions such as lighting interference, varying data sources, or small-scale flames, as they do not account for the hierarchical nature of feature representations. To overcome these limitations, we propose a multimodal deep learning framework that integrates visible (RGB) and thermal infrared (TIR) imagery for accurate wildfire segmentation. The framework incorporates edge-guided supervision and multilevel fusion to capture fine fire boundaries while exploiting complementary information from both modalities. To assess its effectiveness, we constructed a multi-scale flame segmentation dataset and validated the method across diverse conditions, including different data sources, lighting environments, and five flame size categories ranging from small to large. Experimental results show that BFCNet achieves an IoU of 88.25% and an F1 score of 93.76%, outperforming both single-modality and existing multimodal approaches across all evaluation tasks. These results demonstrate the potential of multimodal deep learning to enhance wildfire monitoring, offering practical value for disaster management, ecological protection, and the deployment of autonomous aerial surveillance systems. Full article
Show Figures

Figure 1

22 pages, 4086 KB  
Article
Bidirectional Dynamic Adaptation: Mutual Learning with Cross-Network Feature Rectification for Urban Segmentation
by Jiawen Zhang and Ning Chen
Appl. Sci. 2025, 15(18), 10000; https://doi.org/10.3390/app151810000 - 12 Sep 2025
Viewed by 425
Abstract
Semantic segmentation of urban scenes from red–green–blue and thermal infrared imagery enables per-pixel categorization, delivering precise environmental understanding for autonomous driving and urban planning. However, existing methods suffer from inefficient fusion and insufficient boundary accuracy due to modal differences. To address these challenges, [...] Read more.
Semantic segmentation of urban scenes from red–green–blue and thermal infrared imagery enables per-pixel categorization, delivering precise environmental understanding for autonomous driving and urban planning. However, existing methods suffer from inefficient fusion and insufficient boundary accuracy due to modal differences. To address these challenges, we propose a bidirectional dynamic adaptation framework with two complementary networks. The modality-aware network uses dual attention and multi-scale feature integration to balance modal contributions adaptively, improving intra-class semantic consistency and reducing modal disparities. The edge-texture guidance network applies pixel-level and feature-level weighting with Sobel and Gabor filters to enhance inter-class boundary discrimination, improving detail and boundary precision. Furthermore, the framework redefines multi-modal synergy using an adaptive cross-modal mutual learning mechanism. This mechanism employs information-driven dynamic alignment and probability-guided semantic consistency to overcome the fixed constraints of traditional mutual learning. This cohesive orchestration enhances multi-modal fusion efficiency and boundary delineation accuracy. Extensive experiments on the MFNet and PST900 datasets demonstrate the framework’s superior performance in urban road, vehicle, and pedestrian segmentation, surpassing state-of-the-art approaches. Full article
Show Figures

Figure 1

22 pages, 3585 KB  
Article
A Novel 3D U-Net–Vision Transformer Hybrid with Multi-Scale Fusion for Precision Multimodal Brain Tumor Segmentation in 3D MRI
by Fathia Ghribi and Fayçal Hamdaoui
Electronics 2025, 14(18), 3604; https://doi.org/10.3390/electronics14183604 - 11 Sep 2025
Viewed by 752
Abstract
In recent years, segmentation for medical applications using Magnetic Resonance Imaging (MRI) has received increasing attention. Working in this field has emerged as an ambitious task and a major challenge for researchers; particularly, brain tumor segmentation from MRI is a crucial task for [...] Read more.
In recent years, segmentation for medical applications using Magnetic Resonance Imaging (MRI) has received increasing attention. Working in this field has emerged as an ambitious task and a major challenge for researchers; particularly, brain tumor segmentation from MRI is a crucial task for accurate diagnosis, treatment planning, and patient monitoring. With the rapid development of deep learning methods, significant improvements have been made in medical image segmentation. Convolutional Neural Networks (CNNs), such as U-Net, have shown excellent performance in capturing local spatial features. However, these models cannot explicitly capture long-range dependencies. Therefore, Vision Transformers have emerged as an alternative segmentation method recently, as they can exploit long-range correlations through the self-attention mechanism (MSA). Despite their effectiveness, ViTs require large annotated datasets and may compromise fine-grained spatial details. To address these problems, we propose a novel hybrid approach for brain tumor segmentation that combines a 3D U-Net with a 3D Vision Transformer (ViT3D), aiming to jointly exploit local feature extraction and global context modeling. Additionally, we developed an effective fusion method that uses upsampling and convolutional refinement to improve multi-scale feature integration. Unlike traditional fusion approaches, our method explicitly refines spatial details while maintaining global dependencies, improving the quality of tumor border delineation. We evaluated our approach on the BraTS 2020 dataset, achieving a global accuracy score of 99.56%, an average Dice similarity coefficient (DSC) of 77.43% (corresponding to the mean across the three tumor subregions), with individual Dice scores of 84.35% for WT, 80.97% for TC, and 66.97% for ET, and an average Intersection over Union (IoU) of 71.69%. These extensive experimental results demonstrate that our model not only localizes tumors with high accuracy and robustness but also outperforms a selection of current state-of-the-art methods, including U-Net, SwinUnet, M-Unet, and others. Full article
Show Figures

Figure 1

25 pages, 7560 KB  
Article
RTMF-Net: A Dual-Modal Feature-Aware Fusion Network for Dense Forest Object Detection
by Xiaotan Wei, Zhensong Li, Yutong Wang and Shiliang Zhu
Sensors 2025, 25(18), 5631; https://doi.org/10.3390/s25185631 - 10 Sep 2025
Viewed by 427
Abstract
Multimodal remote sensing object detection has gained increasing attention due to its ability to leverage complementary information from different sensing modalities, particularly visible (RGB) and thermal infrared (TIR) imagery. However, existing methods typically depend on deep, computationally intensive backbones and complex fusion strategies, [...] Read more.
Multimodal remote sensing object detection has gained increasing attention due to its ability to leverage complementary information from different sensing modalities, particularly visible (RGB) and thermal infrared (TIR) imagery. However, existing methods typically depend on deep, computationally intensive backbones and complex fusion strategies, limiting their suitability for real-time applications. To address these challenges, we propose a lightweight and efficient detection framework named RGB-TIR Multimodal Fusion Network (RTMF-Net), which introduces innovations in both the backbone architecture and fusion mechanism. Specifically, RTMF-Net adopts a dual-stream structure with modality-specific enhancement modules tailored for the characteristics of RGB and TIR data. The visible-light branch integrates a Convolutional Enhancement Fusion Block (CEFBlock) to improve multi-scale semantic representation with low computational overhead, while the thermal branch employs a Dual-Laplacian Enhancement Block (DLEBlock) to enhance frequency-domain structural features and weak texture cues. To further improve cross-modal feature interaction, a Weighted Denoising Fusion Module is designed, incorporating an Enhanced Fusion Attention (EFA) attention mechanism that adaptively suppresses redundant information and emphasizes salient object regions. Additionally, a Shape-Aware Intersection over Union (SA-IoU) loss function is proposed to improve localization robustness by introducing an aspect ratio penalty into the traditional IoU metric. Extensive experiments conducted on the ODinMJ and LLVIP multimodal datasets demonstrate that RTMF-Net achieves competitive performance, with mean Average Precision (mAP) scores of 98.7% and 95.7%, respectively, while maintaining a lightweight structure of only 4.3M parameters and 11.6 GFLOPs. These results confirm the effectiveness of RTMF-Net in achieving a favorable balance between accuracy and efficiency, making it well-suited for real-time remote sensing applications. Full article
(This article belongs to the Section Sensing and Imaging)
Show Figures

Figure 1

16 pages, 846 KB  
Article
MMKT: Multimodal Sentiment Analysis Model Based on Knowledge-Enhanced and Text-Guided Learning
by Chengkai Shi and Yunhua Zhang
Appl. Sci. 2025, 15(17), 9815; https://doi.org/10.3390/app15179815 - 7 Sep 2025
Viewed by 776
Abstract
Multimodal Sentiment Analysis (MSA) aims to predict subjective human emotions by leveraging multimodal information. However, existing research inadequately utilizes explicit sentiment semantic information at the lexical level in text and overlooks noise interference from non-dominant modalities, such as irrelevant movements in visual modalities [...] Read more.
Multimodal Sentiment Analysis (MSA) aims to predict subjective human emotions by leveraging multimodal information. However, existing research inadequately utilizes explicit sentiment semantic information at the lexical level in text and overlooks noise interference from non-dominant modalities, such as irrelevant movements in visual modalities and background noise in audio modalities. To address this issue, we propose a multimodal sentiment analysis model based on knowledge enhancement and text-guided learning (MMKT). The model constructs a sentiment knowledge graph for the textual modality using the SenticNet knowledge base. This graph directly annotates word-level sentiment polarity, strengthening the model’s understanding of emotional vocabulary. Furthermore, global sentiment knowledge features are generated through graph embedding computations to enhance the multimodal fusion process. Simultaneously, a dynamic text-guided learning approach is introduced, which dynamically leverages multi-scale textual features to actively suppress redundant or conflicting information in visual and audio modalities, thereby generating purer cross-modal representations. Finally, concatenated textual features, cross-modal features, and knowledge features are utilized for sentiment prediction. Experimental results on the CMU-MOSEI and Twitter2019 dataset demonstrate the superior performance of the MMKT model. Full article
Show Figures

Figure 1

31 pages, 3129 KB  
Review
A Review on Gas Pipeline Leak Detection: Acoustic-Based, OGI-Based, and Multimodal Fusion Methods
by Yankun Gong, Chao Bao, Zhengxi He, Yifan Jian, Xiaoye Wang, Haineng Huang and Xintai Song
Information 2025, 16(9), 731; https://doi.org/10.3390/info16090731 - 25 Aug 2025
Cited by 1 | Viewed by 1413
Abstract
Pipelines play a vital role in material transportation within industrial settings. This review synthesizes detection technologies for early-stage small gas leaks from pipelines in the industrial sector, with a focus on acoustic-based methods, optical gas imaging (OGI), and multimodal fusion approaches. It encompasses [...] Read more.
Pipelines play a vital role in material transportation within industrial settings. This review synthesizes detection technologies for early-stage small gas leaks from pipelines in the industrial sector, with a focus on acoustic-based methods, optical gas imaging (OGI), and multimodal fusion approaches. It encompasses detection principles, inherent challenges, mitigation strategies, and the state of the art (SOTA). Small leaks refer to low flow leakage originating from defects with apertures at millimeter or submillimeter scales, posing significant detection difficulties. Acoustic detection leverages the acoustic wave signals generated by gas leaks for non-contact monitoring, offering advantages such as rapid response and broad coverage. However, its susceptibility to environmental noise interference often triggers false alarms. This limitation can be mitigated through time-frequency analysis, multi-sensor fusion, and deep-learning algorithms—effectively enhancing leak signals, suppressing background noise, and thereby improving the system’s detection robustness and accuracy. OGI utilizes infrared imaging technology to visualize leakage gas and is applicable to the detection of various polar gases. Its primary limitations include low image resolution, low contrast, and interference from complex backgrounds. Mitigation techniques involve background subtraction, optical flow estimation, fully convolutional neural networks (FCNNs), and vision transformers (ViTs), which enhance image contrast and extract multi-scale features to boost detection precision. Multimodal fusion technology integrates data from diverse sensors, such as acoustic and optical devices. Key challenges lie in achieving spatiotemporal synchronization across multiple sensors and effectively fusing heterogeneous data streams. Current methodologies primarily utilize decision-level fusion and feature-level fusion techniques. Decision-level fusion offers high flexibility and ease of implementation but lacks inter-feature interaction; it is less effective than feature-level fusion when correlations exist between heterogeneous features. Feature-level fusion amalgamates data from different modalities during the feature extraction phase, generating a unified cross-modal representation that effectively resolves inter-modal heterogeneity. In conclusion, we posit that multimodal fusion holds significant potential for further enhancing detection accuracy beyond the capabilities of existing single-modality technologies and is poised to become a major focus of future research in this domain. Full article
Show Figures

Figure 1

Back to TopTop