MDPI - Publisher of Open Access Journals

17 pages, 91001 KiB

Open AccessArticle

PONet: A Compact RGB-IR Fusion Network for Vehicle Detection on OrangePi AIpro

by Junyu Huang, Jialing Lian, Fangyu Cao, Jiawei Chen, Renbo Luo, Jinxin Yang and Qian Shi

Remote Sens. 2025, 17(15), 2650; https://doi.org/10.3390/rs17152650 (registering DOI) - 30 Jul 2025

Viewed by 172

Multi-modal object detection that fuses RGB (Red-Green-Blue) and infrared (IR) data has emerged as an effective approach for addressing challenging visual conditions such as low illumination, occlusion, and adverse weather. However, most existing multi-modal detectors prioritize accuracy while neglecting computational efficiency, making them [...] Read more.

Multi-modal object detection that fuses RGB (Red-Green-Blue) and infrared (IR) data has emerged as an effective approach for addressing challenging visual conditions such as low illumination, occlusion, and adverse weather. However, most existing multi-modal detectors prioritize accuracy while neglecting computational efficiency, making them unsuitable for deployment on resource-constrained edge devices. To address this limitation, we propose PONet, a lightweight and efficient multi-modal vehicle detection network tailored for real-time edge inference. PONet incorporates Polarized Self-Attention to improve feature adaptability and representation with minimal computational overhead. In addition, a novel fusion module is introduced to effectively integrate RGB and IR modalities while preserving efficiency. Experimental results on the VEDAI dataset demonstrate that PONet achieves a competitive detection accuracy of 82.2% mAP@0.5 while sustaining a throughput of 34 FPS on the OrangePi AIpro 20T device. With only 3.76 M parameters and 10.2 GFLOPs (Giga Floating Point Operations), PONet offers a practical solution for edge-oriented remote sensing applications requiring a balance between detection precision and computational cost. Full article

► Show Figures

Figure 1

24 pages, 12286 KiB

Open AccessArticle

A UAV-Based Multi-Scenario RGB-Thermal Dataset and Fusion Model for Enhanced Forest Fire Detection

by Yalin Zhang, Xue Rui and Weiguo Song

Remote Sens. 2025, 17(15), 2593; https://doi.org/10.3390/rs17152593 - 25 Jul 2025

Viewed by 363

Abstract

UAVs are essential for forest fire detection due to vast forest areas and inaccessibility of high-risk zones, enabling rapid long-range inspection and detailed close-range surveillance. However, aerial photography faces challenges like multi-scale target recognition and complex scenario adaptation (e.g., deformation, occlusion, lighting variations). [...] Read more.

UAVs are essential for forest fire detection due to vast forest areas and inaccessibility of high-risk zones, enabling rapid long-range inspection and detailed close-range surveillance. However, aerial photography faces challenges like multi-scale target recognition and complex scenario adaptation (e.g., deformation, occlusion, lighting variations). RGB-Thermal fusion methods integrate visible-light texture and thermal infrared temperature features effectively, but current approaches are constrained by limited datasets and insufficient exploitation of cross-modal complementary information, ignoring cross-level feature interaction. A time-synchronized multi-scene, multi-angle aerial RGB-Thermal dataset (RGBT-3M) with “Smoke–Fire–Person” annotations and modal alignment via the M-RIFT method was constructed as a way to address the problem of data scarcity in wildfire scenarios. Finally, we propose a CP-YOLOv11-MF fusion detection model based on the advanced YOLOv11 framework, which can learn heterogeneous features complementary to each modality in a progressive manner. Experimental validation proves the superiority of our method, with a precision of 92.5%, a recall of 93.5%, a mAP50 of 96.3%, and a mAP50-95 of 62.9%. The model’s RGB-Thermal fusion capability enhances early fire detection, offering a benchmark dataset and methodological advancement for intelligent forest conservation, with implications for AI-driven ecological protection. Full article

(This article belongs to the Special Issue Advances in Spectral Imagery and Methods for Fire and Smoke Detection)

► Show Figures

Figure 1

14 pages, 2370 KiB

Open AccessArticle

DP-AMF: Depth-Prior–Guided Adaptive Multi-Modal and Global–Local Fusion for Single-View 3D Reconstruction

by Luoxi Zhang, Chun Xie and Itaru Kitahara

J. Imaging 2025, 11(7), 246; https://doi.org/10.3390/jimaging11070246 - 21 Jul 2025

Viewed by 291

Abstract

Single-view 3D reconstruction remains fundamentally ill-posed, as a single RGB image lacks scale and depth cues, often yielding ambiguous results under occlusion or in texture-poor regions. We propose DP-AMF, a novel Depth-Prior–Guided Adaptive Multi-Modal and Global–Local Fusion framework that integrates high-fidelity depth priors—generated [...] Read more.

Single-view 3D reconstruction remains fundamentally ill-posed, as a single RGB image lacks scale and depth cues, often yielding ambiguous results under occlusion or in texture-poor regions. We propose DP-AMF, a novel Depth-Prior–Guided Adaptive Multi-Modal and Global–Local Fusion framework that integrates high-fidelity depth priors—generated offline by the MARIGOLD diffusion-based estimator and cached to avoid extra training cost—with hierarchical local features from ResNet-32/ResNet-18 and semantic global features from DINO-ViT. A learnable fusion module dynamically adjusts per-channel weights to balance these modalities according to local texture and occlusion, and an implicit signed-distance field decoder reconstructs the final mesh. Extensive experiments on 3D-FRONT and Pix3D demonstrate that DP-AMF reduces Chamfer Distance by 7.64%, increases F-Score by 2.81%, and boosts Normal Consistency by 5.88% compared to strong baselines, while qualitative results show sharper edges and more complete geometry in challenging scenes. DP-AMF achieves these gains without substantially increasing model size or inference time, offering a robust and effective solution for complex single-view reconstruction tasks. Full article

(This article belongs to the Section AI in Imaging)

► Show Figures

Figure 1

17 pages, 3612 KiB

Open AccessArticle

MPVT: An Efficient Multi-Modal Prompt Vision Tracker for Visual Target Tracking

by Jianyu Xie, Yan Fu, Junlin Zhou, Tianxiang He, Xiaopeng Wang, Yuke Fang and Duanbing Chen

Appl. Sci. 2025, 15(14), 7967; https://doi.org/10.3390/app15147967 - 17 Jul 2025

Viewed by 246

Abstract

Visual target tracking is a fundamental task in computer vision. Combining multi-modal information with tracking leverages complementary information, which improves the precision and robustness of trackers. Traditional multi-modal tracking methods typically employ a full fine-tuning scheme, i.e., fine-tuning pre-trained single-modal models to multi-modal [...] Read more.

Visual target tracking is a fundamental task in computer vision. Combining multi-modal information with tracking leverages complementary information, which improves the precision and robustness of trackers. Traditional multi-modal tracking methods typically employ a full fine-tuning scheme, i.e., fine-tuning pre-trained single-modal models to multi-modal tasks. However, this approach suffers from low transfer learning efficiency, catastrophic forgetting, and high cross-task deployment costs. To address these issues, we propose an efficient model named multi-modal prompt vision tracker (MPVT) based on an efficient prompt-tuning paradigm. Three key components are involved in the model: a decoupled input enhancement module, a dynamic adaptive prompt fusion module, and a fully connected head network module. The decoupled input enhancement module enhances input representations via positional and type embedding. The dynamic adaptive prompt fusion module achieves efficient prompt tuning and multi-modal interaction using scaled convolution and low-rank cross-modal attention mechanisms. The fully connected head network module addresses the shortcomings of traditional convolutional head networks such as inductive biases. Experimental results from RGB-T, RGB-D, and RGB-E scenarios show that MPVT outperforms state-of-the-art methods. Moreover, MPVT can save 43.8% GPU memory usage and reduce training time by 62.9% compared with a full-parameter fine-tuning model. Full article

(This article belongs to the Special Issue Advanced Technologies Applied for Object Detection and Tracking)

► Show Figures

Figure 1

21 pages, 4147 KiB

Open AccessArticle

AgriFusionNet: A Lightweight Deep Learning Model for Multisource Plant Disease Diagnosis

by Saleh Albahli

Agriculture 2025, 15(14), 1523; https://doi.org/10.3390/agriculture15141523 - 15 Jul 2025

Viewed by 455

Abstract

Timely and accurate identification of plant diseases is critical to mitigating crop losses and enhancing yield in precision agriculture. This paper proposes AgriFusionNet, a lightweight and efficient deep learning model designed to diagnose plant diseases using multimodal data sources. The framework integrates RGB [...] Read more.

Timely and accurate identification of plant diseases is critical to mitigating crop losses and enhancing yield in precision agriculture. This paper proposes AgriFusionNet, a lightweight and efficient deep learning model designed to diagnose plant diseases using multimodal data sources. The framework integrates RGB and multispectral drone imagery with IoT-based environmental sensor data (e.g., temperature, humidity, soil moisture), recorded over six months across multiple agricultural zones. Built on the EfficientNetV2-B4 backbone, AgriFusionNet incorporates Fused-MBConv blocks and Swish activation to improve gradient flow, capture fine-grained disease patterns, and reduce inference latency. The model was evaluated using a comprehensive dataset composed of real-world and benchmarked samples, showing superior performance with 94.3% classification accuracy, 28.5 ms inference time, and a 30% reduction in model parameters compared to state-of-the-art models such as Vision Transformers and InceptionV4. Extensive comparisons with both traditional machine learning and advanced deep learning methods underscore its robustness, generalization, and suitability for deployment on edge devices. Ablation studies and confusion matrix analyses further confirm its diagnostic precision, even in visually ambiguous cases. The proposed framework offers a scalable, practical solution for real-time crop health monitoring, contributing toward smart and sustainable agricultural ecosystems. Full article

(This article belongs to the Special Issue Computational, AI and IT Solutions Helping Agriculture)

► Show Figures

Figure 1

21 pages, 12122 KiB

Open AccessArticle

RA3T: An Innovative Region-Aligned 3D Transformer for Self-Supervised Sim-to-Real Adaptation in Low-Altitude UAV Vision

by Xingrao Ma, Jie Xie, Di Shao, Aiting Yao and Chengzu Dong

Electronics 2025, 14(14), 2797; https://doi.org/10.3390/electronics14142797 - 11 Jul 2025

Viewed by 280

Abstract

Low-altitude unmanned aerial vehicle (UAV) vision is critically hindered by the Sim-to-Real Gap, where models trained exclusively on simulation data degrade under real-world variations in lighting, texture, and weather. To address this problem, we propose RA3T (Region-Aligned 3D Transformer), a novel self-supervised framework [...] Read more.

Low-altitude unmanned aerial vehicle (UAV) vision is critically hindered by the Sim-to-Real Gap, where models trained exclusively on simulation data degrade under real-world variations in lighting, texture, and weather. To address this problem, we propose RA3T (Region-Aligned 3D Transformer), a novel self-supervised framework that enables robust Sim-to-Real adaptation. Specifically, we first develop a dual-branch strategy for self-supervised feature learning, integrating Masked Autoencoders and contrastive learning. This approach extracts domain-invariant representations from unlabeled simulated imagery to enhance robustness against occlusion while reducing annotation dependency. Leveraging these learned features, we then introduce a 3D Transformer fusion module that unifies multi-view RGB and LiDAR point clouds through cross-modal attention. By explicitly modeling spatial layouts and height differentials, this component significantly improves recognition of small and occluded targets in complex low-altitude environments. To address persistent fine-grained domain shifts, we finally design region-level adversarial calibration that deploys local discriminators on partitioned feature maps. This mechanism directly aligns texture, shadow, and illumination discrepancies which challenge conventional global alignment methods. Extensive experiments on UAV benchmarks VisDrone and DOTA demonstrate the effectiveness of RA3T. The framework achieves +5.1% mAP on VisDrone and +7.4% mAP on DOTA over the 2D adversarial baseline, particularly on small objects and sparse occlusions, while maintaining real-time performance of 17 FPS at 1024 × 1024 resolution on an RTX 4080 GPU. Visual analysis confirms that the synergistic integration of 3D geometric encoding and local adversarial alignment effectively mitigates domain gaps caused by uneven illumination and perspective variations, establishing an efficient pathway for simulation-to-reality UAV perception. Full article

(This article belongs to the Special Issue Innovative Technologies and Services for Unmanned Aerial Vehicles)

► Show Figures

Figure 1

23 pages, 2463 KiB

Open AccessArticle

MCDet: Target-Aware Fusion for RGB-T Fire Detection

by Yuezhu Xu, He Wang, Yuan Bi, Guohao Nie and Xingmei Wang

Forests 2025, 16(7), 1088; https://doi.org/10.3390/f16071088 - 30 Jun 2025

Viewed by 320

Abstract

Forest fire detection is vital for ecological conservation and disaster management. Existing visual detection methods exhibit instability in smoke-obscured or illumination-variable environments. Although multimodal fusion has demonstrated potential, effectively resolving inconsistencies in smoke features across diverse modalities remains a significant challenge. This issue [...] Read more.

Forest fire detection is vital for ecological conservation and disaster management. Existing visual detection methods exhibit instability in smoke-obscured or illumination-variable environments. Although multimodal fusion has demonstrated potential, effectively resolving inconsistencies in smoke features across diverse modalities remains a significant challenge. This issue stems from the inherent ambiguity between regions characterized by high temperatures in infrared imagery and those with elevated brightness levels in visible-light imaging systems. In this paper, we propose MCDet, an RGB-T forest fire detection framework incorporating target-aware fusion. To alleviate feature cross-modal ambiguity, we design a Multidimensional Representation Collaborative Fusion module (MRCF), which constructs global feature interactions via a state-space model and enhances local detail perception through deformable convolution. Then, a content-guided attention network (CGAN) is introduced to aggregate multidimensional features by dynamic gating mechanism. Building upon this foundation, the integration of WIoU further suppresses vegetation occlusion and illumination interference on a holistic level, thereby reducing the false detection rate. Evaluated on three forest fire datasets and one pedestrian dataset, MCDet achieves a mean detection accuracy of 77.5%, surpassing advanced methods. This performance makes MCDet a practical solution to enhance early warning system reliability. Full article

(This article belongs to the Special Issue Advanced Technologies for Forest Fire Detection and Monitoring)

► Show Figures

Figure 1

21 pages, 5214 KiB

Open AccessArticle

YOLO-SAR: An Enhanced Multi-Scale Ship Detection Method in Low-Light Environments

by Zihang Xiong, Mei Wang, Ruixiang Kan and Jiayu Zhang

Appl. Sci. 2025, 15(13), 7288; https://doi.org/10.3390/app15137288 - 28 Jun 2025

Viewed by 339

Abstract

Nowadays, object detection has become increasingly crucial in various Internet-of-Things (IoT) systems, and ship detection is an essential component of this field. In low-illumination scenes, traditional ship detection algorithms often struggle due to poor visibility and blurred details in RGB video streams. To [...] Read more.

Nowadays, object detection has become increasingly crucial in various Internet-of-Things (IoT) systems, and ship detection is an essential component of this field. In low-illumination scenes, traditional ship detection algorithms often struggle due to poor visibility and blurred details in RGB video streams. To address this weakness, we create the Lowship dataset and propose the YOLO-SAR framework, which is based on the You Only Look Once (YOLO) architecture. As for implementing ship detecting methods in such challenging conditions, the main contributions of this work are as follows: (i) a low-illumination image-enhancement module that adaptively improves multi-scale feature perception in low-illumination scenes; (ii) receptive-field attention convolution to compensate for weak long-range modeling; and (iii) an Adaptively Spatial Feature Fusion head to refine the multi-scale learning of ship features. Experiments show that our method achieves 92.9% precision and raises mAP@0.5 to 93.8%, outperforming mainstream approaches. These state-of-the-art results confirm the significant practical value of our approach. Full article

► Show Figures

Figure 1

25 pages, 3449 KiB

Open AccessArticle

CSANet: Context–Spatial Awareness Network for RGB-T Urban Scene Understanding

by Ruixiang Li, Zhen Wang, Jianxin Guo and Chuanlei Zhang

J. Imaging 2025, 11(6), 188; https://doi.org/10.3390/jimaging11060188 - 9 Jun 2025

Viewed by 831

Abstract

Semantic segmentation plays a critical role in understanding complex urban environments, particularly for autonomous driving applications. However, existing approaches face significant challenges under low-light and adverse weather conditions. To address these limitations, we propose CSANet (Context Spatial Awareness Network), a novel framework that [...] Read more.

Semantic segmentation plays a critical role in understanding complex urban environments, particularly for autonomous driving applications. However, existing approaches face significant challenges under low-light and adverse weather conditions. To address these limitations, we propose CSANet (Context Spatial Awareness Network), a novel framework that effectively integrates RGB and thermal infrared (TIR) modalities. CSANet employs an efficient encoder to extract complementary local and global features, while a hierarchical fusion strategy is adopted to selectively integrate visual and semantic information. Notably, the Channel–Spatial Cross-Fusion Module (CSCFM) enhances local details by fusing multi-modal features, and the Multi-Head Fusion Module (MHFM) captures global dependencies and calibrates multi-modal information. Furthermore, the Spatial Coordinate Attention Mechanism (SCAM) improves object localization accuracy in complex urban scenes. Evaluations on benchmark datasets (MFNet and PST900) demonstrate that CSANet achieves state-of-the-art performance, significantly advancing RGB-T semantic segmentation. Full article

(This article belongs to the Topic Transformer and Deep Learning Applications in Image Processing)

► Show Figures

Figure 1

39 pages, 3695 KiB

Open AccessArticle

Fast Identification and Detection Algorithm for Maneuverable Unmanned Aircraft Based on Multimodal Data Fusion

by Tian Luan, Shixiong Zhou, Yicheng Zhang and Weijun Pan

Mathematics 2025, 13(11), 1825; https://doi.org/10.3390/math13111825 - 30 May 2025

Viewed by 811

Abstract

To address the critical challenges of insufficient monitoring capabilities and vulnerable defense systems against drones in regional airports, this study proposes a multi-source data fusion framework for rapid UAV detection. Building upon the YOLO v11 architecture, we develop an enhanced model incorporating four [...] Read more.

To address the critical challenges of insufficient monitoring capabilities and vulnerable defense systems against drones in regional airports, this study proposes a multi-source data fusion framework for rapid UAV detection. Building upon the YOLO v11 architecture, we develop an enhanced model incorporating four key innovations: (1) A dual-path RGB-IR fusion architecture that exploits complementary multi-modal data; (2) C3k2-DATB dynamic attention modules for enhanced feature extraction and semantic perception; (3) A bilevel routing attention mechanism with agent queries (BRSA) for precise target localization; (4) A semantic-detail injection (SDI) module coupled with windmill-shaped convolutional detection heads (PCHead) and Wasserstein Distance loss to expand receptive fields and accelerate convergence. Experimental results demonstrate superior performance with 99.3% mAP@50 (17.4% improvement over baseline YOLOv11), while maintaining lightweight characteristics (2.54M parameters, 7.8 GFLOPS). For practical deployment, we further enhance tracking robustness through an improved BoT-SORT algorithm within an interactive multiple model framework, achieving 91.3% MOTA and 93.0% IDF1 under low-light conditions. This integrated solution provides cost-effective, high-precision drone surveillance for resource-constrained airports. Full article

(This article belongs to the Special Issue Artificial Intelligence and Optimization in Aircraft Design and Unmanned Aerial Vehicles, 2nd Edition)

► Show Figures

Figure 1

30 pages, 10008 KiB

Open AccessArticle

Integrating Stride Attention and Cross-Modality Fusion for UAV-Based Detection of Drought, Pest, and Disease Stress in Croplands

by Yan Li, Yaze Wu, Wuxiong Wang, Huiyu Jin, Xiaohan Wu, Jinyuan Liu, Chen Hu and Chunli Lv

Agronomy 2025, 15(5), 1199; https://doi.org/10.3390/agronomy15051199 - 15 May 2025

Viewed by 595

Abstract

Timely and accurate detection of agricultural disasters is crucial for ensuring food security and enhancing post-disaster response efficiency. This paper proposes a deployable UAV-based multimodal agricultural disaster detection framework that integrates multispectral and RGB imagery to simultaneously capture the spectral responses and spatial [...] Read more.

Timely and accurate detection of agricultural disasters is crucial for ensuring food security and enhancing post-disaster response efficiency. This paper proposes a deployable UAV-based multimodal agricultural disaster detection framework that integrates multispectral and RGB imagery to simultaneously capture the spectral responses and spatial structural features of affected crop regions. To this end, we design an innovative stride–cross-attention mechanism, in which stride attention is utilized for efficient spatial feature extraction, while cross-attention facilitates semantic fusion between heterogeneous modalities. The experimental data were collected from representative wheat and maize fields in Inner Mongolia, using UAVs equipped with synchronized multispectral (red, green, blue, red edge, near-infrared) and high-resolution RGB sensors. Through a combination of image preprocessing, geometric correction, and various augmentation strategies (e.g., MixUp, CutMix, GridMask, RandAugment), the quality and diversity of the training samples were significantly enhanced. The model trained on the constructed dataset achieved an accuracy of 93.2%, an F1 score of 92.7%, a precision of 93.5%, and a recall of 92.4%, substantially outperforming mainstream models such as ResNet50, EfficientNet-B0, and ViT across multiple evaluation metrics. Ablation studies further validated the critical role of the stride attention and cross-attention modules in performance improvement. This study demonstrates that the integration of lightweight attention mechanisms with multimodal UAV remote sensing imagery enables efficient, accurate, and scalable agricultural disaster detection under complex field conditions. Full article

(This article belongs to the Special Issue New Trends in Agricultural UAV Application—2nd Edition)

► Show Figures

Figure 1

25 pages, 8140 KiB

Open AccessArticle

RDCRNet: RGB-T Object Detection Network Based on Cross-Modal Representation Model

by Yubin Li, Weida Zhan, Yichun Jiang and Jinxin Guo

Entropy 2025, 27(4), 442; https://doi.org/10.3390/e27040442 - 19 Apr 2025

Cited by 1 | Viewed by 743

Abstract

RGB-thermal object detection harnesses complementary information from visible and thermal modalities to enhance detection robustness in challenging environments, particularly under low-light conditions. However, existing approaches suffer from limitations due to their heavy dependence on precisely registered data and insufficient handling of cross-modal distribution [...] Read more.

RGB-thermal object detection harnesses complementary information from visible and thermal modalities to enhance detection robustness in challenging environments, particularly under low-light conditions. However, existing approaches suffer from limitations due to their heavy dependence on precisely registered data and insufficient handling of cross-modal distribution disparities. This paper presents RDCRNet, a novel framework incorporating a Cross-Modal Representation Model to effectively address these challenges. The proposed network features a Cross-Modal Feature Remapping Module that aligns modality distributions through statistical normalization and learnable correction parameters, significantly reducing feature discrepancies between modalities. A Cross-Modal Refinement and Interaction Module enables sophisticated bidirectional information exchange via trinity refinement for intra-modal context modeling and cross-attention mechanisms for unaligned feature fusion. Multiscale detection capability is enhanced through a Cross-Scale Feature Integration Module, improving detection performance across various object sizes. To overcome the inherent data scarcity in RGB-T detection, we introduce a self-supervised pretraining strategy that combines masked reconstruction with adversarial learning and semantic consistency loss, effectively leveraging both aligned and unaligned RGB-T samples. Extensive experiments demonstrate that RDCRNet achieves state-of-the-art performance on multiple benchmark datasets while maintaining high computational and storage efficiency, validating its superiority and practical effectiveness in real-world applications. Full article

(This article belongs to the Topic Color Image Processing: Models and Methods (CIP: MM))

► Show Figures

Figure 1

12 pages, 1902 KiB

Open AccessArticle

Edge-Supervised Attention-Aware Fusion Network for RGB-T Semantic Segmentation

by Ming Wang, Zhongjie Zhu, Yuer Wang, Renwei Tu, Jiuxing Weng and Xianchao Yu

Electronics 2025, 14(8), 1489; https://doi.org/10.3390/electronics14081489 - 8 Apr 2025

Viewed by 670

Abstract

To address the limitations in the efficiency of modality feature fusion in existing RGB-T semantic segmentation methods, which restrict segmentation performance, this paper proposes an edge-supervised attention-aware algorithm to enhance segmentation capabilities. Firstly, we design a feature fusion module incorporating channel and spatial [...] Read more.

To address the limitations in the efficiency of modality feature fusion in existing RGB-T semantic segmentation methods, which restrict segmentation performance, this paper proposes an edge-supervised attention-aware algorithm to enhance segmentation capabilities. Firstly, we design a feature fusion module incorporating channel and spatial attention mechanisms to achieve effective complementation and enhancement of RGB-T features. Secondly, we introduce an edge-aware refinement module that processes low-level modality features using global and local attention mechanisms, obtaining fine-grained feature information through element-wise multiplication. Building on this, we design a parallel structure of dilated convolutions to extract multi-scale detail information. Additionally, an EdgeHead is introduced after the edge-aware refinement module, with edge supervision applied to further enhance edge detail capture. Finally, the optimized fused features are fed into a decoder to complete the RGB-T semantic segmentation task. Experimental results demonstrate that our algorithm achieves mean Intersection over Union (mIoU) scores of 58.52% and 85.38% on the MFNet and PST900 datasets, respectively, significantly improving the accuracy of RGB-T semantic segmentation. Full article

(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)

► Show Figures

Figure 1

31 pages, 15440 KiB

Open AccessFeature PaperArticle

Enhancing Mirror and Glass Detection in Multimodal Images Based on Mathematical and Physical Methods

by Jiyuan Qiu and Chen Jiang

Mathematics 2025, 13(5), 747; https://doi.org/10.3390/math13050747 - 25 Feb 2025

Viewed by 1057

Abstract

The detection of mirrors and glass, which possess unique optical surface properties, has garnered significant attention in recent years. Due to their reflective and transparent nature, these surfaces are often difficult to distinguish from their surrounding environments, posing substantial challenges even for advanced [...] Read more.

The detection of mirrors and glass, which possess unique optical surface properties, has garnered significant attention in recent years. Due to their reflective and transparent nature, these surfaces are often difficult to distinguish from their surrounding environments, posing substantial challenges even for advanced deep learning models tasked with performing such detection. Current research primarily relies on complex network models that learn and fuse different modalities of images, such as RGB, depth, and thermal, to achieve mirror and glass detection. However, these approaches often overlook the inherent limitations in the raw data caused by sensor deficiencies when facing mirrors and glass surfaces. To address this issue, we applied mathematical and physical methods, such as three-point plane determination and steady-state heat conduction in two-dimensional planes, along with an RGB enhancement module, to reconstruct RGB, depth, and thermal data for mirrors and glass in two publicly available datasets: an RGB-D mirror detection dataset and an RGB-T glass detection dataset. Additionally, we synthesized four enhanced and ideal datasets. Furthermore, we propose a double weight Mamba fusion network (DWMFNet) that strengthens the model’s global perception of image information by extracting low-level clue weights and high-level contextual weights from the input data using the prior fusion feature extraction module (PFFE) and the deep fusion feature guidance module (DFFG). This is complemented by the Mamba module, which efficiently captures long-range dependencies, facilitating information complementarity between multi-modal features. Extensive experiments demonstrate that our data enhancement method significantly improves the model’s capability in detecting mirrors and glass surfaces. Full article

(This article belongs to the Special Issue New Trends in Computer Vision, Deep Learning and Artificial Intelligence)

► Show Figures

Figure 1

18 pages, 14931 KiB

Open AccessArticle

Wavelet-Driven Multi-Band Feature Fusion for RGB-T Salient Object Detection

by Jianxun Zhao, Xin Wen, Yu He, Xiaowei Yang and Kechen Song

Sensors 2024, 24(24), 8159; https://doi.org/10.3390/s24248159 - 20 Dec 2024

Cited by 1 | Viewed by 1312

Abstract

RGB-T salient object detection (SOD) has received considerable attention in the field of computer vision. Although existing methods have achieved notable detection performance in certain scenarios, challenges remain. Many methods fail to fully utilize high-frequency and low-frequency features during information interaction among different [...] Read more.

RGB-T salient object detection (SOD) has received considerable attention in the field of computer vision. Although existing methods have achieved notable detection performance in certain scenarios, challenges remain. Many methods fail to fully utilize high-frequency and low-frequency features during information interaction among different scale features, limiting detection performance. To address this issue, we propose a method for RGB-T salient object detection that enhances performance through wavelet transform and channel-wise attention fusion. Through feature differentiation, we effectively extract spatial characteristics of the target, enhancing the detection capability for global context and fine-grained details. First, input features are passed through the channel-wise criss-cross module (CCM) for cross-modal information fusion, adaptively adjusting the importance of features to generate rich fusion information. Subsequently, the multi-scale fusion information is input into the feature selection wavelet transforme module (FSW), which selects beneficial low-frequency and high-frequency features to improve feature aggregation performance and achieves higher segmentation accuracy through long-distance connections. Extensive experiments demonstrate that our method outperforms 22 state-of-the-art methods. Full article

(This article belongs to the Special Issue Multi-Modal Image Processing Methods, Systems, and Applications)

► Show Figures

Figure 1

Search Results (37)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (37)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI