MDPI - Publisher of Open Access Journals

25 pages, 2042 KB

Open AccessArticle

MSAFusion: A Lightweight Multispectral Pedestrian Detection Network with Multi-Scale and Adaptive Feature Fusion

by Yang Song, Xin Zuo, Chenyu Qu, Qiang Qian and Dengbiao Jiang

J. Imaging 2026, 12(6), 246; https://doi.org/10.3390/jimaging12060246 - 30 May 2026

Viewed by 262

Robust multispectral pedestrian detection remains challenging in complex environments such as those with low illumination, strong thermal contrast, and background clutter. Although RGB–thermal fusion provides complementary cues, lightweight detectors often suffer from unstable feature representation across scales and insufficient control over modality-biased responses [...] Read more.

Robust multispectral pedestrian detection remains challenging in complex environments such as those with low illumination, strong thermal contrast, and background clutter. Although RGB–thermal fusion provides complementary cues, lightweight detectors often suffer from unstable feature representation across scales and insufficient control over modality-biased responses during fusion, which can degrade localization accuracy and weaken the detection of small or distant pedestrians. To address these issues, we develop a lightweight stage-wise RGB–thermal fusion pipeline that integrates pre-fusion feature refinement, cross-modal interaction, and post-fusion adaptive recalibration. Specifically, a Multi-scale Feature Refinement (MSFR) module is proposed at the mid-level to enhance modality-specific representations by jointly modeling local details and contextual information, thereby reducing scale-sensitive noise before interaction. An established Cross-Modality Fusion Transformer (CFT) is then adopted to promote semantic correspondence between RGB and thermal features. After interaction, an Adaptive Feature Recalibration (AFR) module is introduced to suppress background-dominated and modality-biased responses through lightweight channel-wise adjustment. Extensive experiments on three public RGB–thermal benchmarks, including the pedestrian-focused KAIST and LLVIP datasets together with the FLIR-aligned road-scene benchmark, demonstrate that the proposed method achieves a favorable accuracy–efficiency trade-off, delivering consistent improvements over the lightweight baseline while maintaining a compact architecture and real-time inference capability. Full article

(This article belongs to the Section Color, Multi-spectral, and Hyperspectral Imaging)

► Show Figures

Figure 1

31 pages, 9142 KB

Open AccessArticle

GMD-YOLO: A Dual-Modality Framework with Multi-Scale Enhancement and Adaptive Fusion for PV Fault Detection

by Zhichao Lin, Xiuling Wang and Yuyang Guo

Sensors 2026, 26(11), 3394; https://doi.org/10.3390/s26113394 - 27 May 2026

Viewed by 372

Abstract

Photovoltaic (PV) module faults, such as hotspots, diode short circuits, occlusions, and shadows, degrade power generation efficiency and safety. Existing manual inspection and single-modality methods show limited robustness under complex conditions, especially with illumination variations and weak thermal responses, while most deep learning [...] Read more.

Photovoltaic (PV) module faults, such as hotspots, diode short circuits, occlusions, and shadows, degrade power generation efficiency and safety. Existing manual inspection and single-modality methods show limited robustness under complex conditions, especially with illumination variations and weak thermal responses, while most deep learning approaches fail to exploit the complementarity of visible and infrared modalities. To address this issue, a dual-modality visible–infrared fusion framework based on YOLO11 is proposed, integrating a multi-scale pyramid pooling and dilated convolution module (MSPPD), a gradient-aware fusion module (GAFusion), and a dynamic convolution and element-wise scaling detection head (Detect-DEhead). GAFusion enhances cross-modal structural consistency and reduces feature misalignment and information loss during fusion by introducing gradient-aware feature interaction. Shape-IoU loss is employed to improve localization accuracy. The proposed method improves mean average precision (mAP)@0.5 from 86.7% to 88.1%, while reducing parameters, computational cost, and model size from 4.3 M to 3.7 M, 11.42 GFLOPs to 9.37 GFLOPs, and 9.1 MB to 7.9 MB, respectively. With Shape-IoU, mAP@0.5 reaches 88.4%, and recall increases from 78.5% to 84.9%. Experiments on the FLIR Thermal dataset achieve gains of 2.2%, 1.6%, and 2.7% in precision, recall, and mAP@0.5. The method achieves an effective trade-off between accuracy and efficiency for intelligent PV module inspection. Full article

(This article belongs to the Section Fault Diagnosis & Sensors)

► Show Figures

Figure 1

17 pages, 25138 KB

Open AccessArticle

Deep Learning for Low-Light Vision: An Efficient Infrared–Visible Fusion Approach

by Jiajie Lu, Viviana Desantis, Marco Brando Mario Paracchini and Marco Marcon

Appl. Sci. 2026, 16(10), 4737; https://doi.org/10.3390/app16104737 - 10 May 2026

Viewed by 303

Abstract

Low-light enhancement technologies are of great significance for visual driver assistance applications and autonomous driving systems. Infrared vision can improve nighttime visibility but also faces challenges of low resolution and lack of color information. This paper presents a unified framework for RGB-guided infrared [...] Read more.

Low-light enhancement technologies are of great significance for visual driver assistance applications and autonomous driving systems. Infrared vision can improve nighttime visibility but also faces challenges of low resolution and lack of color information. This paper presents a unified framework for RGB-guided infrared super-resolution and infrared-visible fusion that achieves high-resolution output under limited computational resources. Our approach employs a U-Net architecture with novel triple-grouped window attention (TGWA) encoding that captures global dependencies through grouped attention while reducing computational overhead, and adaptive multi-dilated convolutional (AMDC) decoding that adaptively selects optimal dilation rates using mixture-of-experts-inspired routing. Experiments on multiple datasets achieve competitive super-resolution and fusion results with minimal computational complexity, while real-world downstream object detection validation confirms robust performance in challenging nighttime scenarios. Quantitatively, the proposed method achieves 28.744 dB/0.872 SSIM on PBVS24 and 31.424 dB/0.882 SSIM on HDRT-Night for 8× infrared super-resolution, reaches competitive fusion quality on both MSRS and HDRT-Night, and attains 69.4% mAP@0.5 in downstream object detection on FLIR_aligned, while requiring only 1.12 M parameters and 85.44 G FLOPs. This work provides new possibilities for seeing clearly in the dark. Full article

(This article belongs to the Special Issue Recent Advances in Hyperspectral Imaging Technology)

► Show Figures

Figure 1

23 pages, 10261 KB

Open AccessArticle

A Method for Lightweight Pedestrian and Vehicle Detection for Unmanned Ground Vehicles in Open Environments

by Xulong Zhang, Hong Jiang, Dong Han, Hai Guo, Xiangfeng Zhang and Kaige Sun

Machines 2026, 14(5), 527; https://doi.org/10.3390/machines14050527 - 8 May 2026

Cited by 1 | Viewed by 354

Abstract

In open environments, lightweight pedestrian and vehicle detection models deployed on edge platforms of Unmanned Ground Vehicles (UGVs) often struggle to balance detection accuracy and inference efficiency when facing complex backgrounds, distant small targets, and occluded objects. To address this, we propose a [...] Read more.

In open environments, lightweight pedestrian and vehicle detection models deployed on edge platforms of Unmanned Ground Vehicles (UGVs) often struggle to balance detection accuracy and inference efficiency when facing complex backgrounds, distant small targets, and occluded objects. To address this, we propose a lightweight object detection model based on YOLO11n, named UGV-Net. This model enhances feature interaction and global context modeling capabilities by introducing the C3k2_PS module, employs Dysample dynamic upsampling to achieve content-aware feature reconstruction, and designs an LSDECD detection head to reduce multi-scale prediction redundancy and computational overhead, thereby balancing detection accuracy and inference efficiency. Compared with the baseline model YOLO11n, UGV-Net improves F1 score, mAP50, and mAP50:95 by 2.19%, 2.26%, and 2.10%, respectively, on the KITTI dataset, while reducing GFLOPs from 6.3 to 4.9 and the number of parameters from 2.58M to 2.41M. Similarly, on the SODA10M and FLIR datasets, the F1 score improves by 1.71% and 1.41%, and mAP50 improves by 1.67% and 2.39%, respectively, demonstrating excellent detection accuracy and generalization ability. Furthermore, experiments on the Jetson Orin Nano platform verify that UGV-Net achieves robust real-time detection performance, making it an efficient, reliable, and lightweight solution for UGV perception in open environments. Full article

(This article belongs to the Section Robotics, Mechatronics and Intelligent Machines)

► Show Figures

Figure 1

27 pages, 17739 KB

Open AccessArticle

3D Radiometric Thermography Mosaics with Low-Cost Mobile Sensor Stack

by Scott McAvoy, Jonathan Klingspon, Adrian Tong, Eric Lo, Nathan Hui, Maurizio Seracini, Dominique Rissolo, Neal Driscoll and Falko Kuester

Remote Sens. 2026, 18(9), 1335; https://doi.org/10.3390/rs18091335 - 27 Apr 2026

Viewed by 469

Abstract

Infrared thermography provides key information for a wide range of diagnostic applications within built and natural environments. As thermal states are changing with ambient conditions, it is important to deploy thermal imaging systems and operators opportunistically. It is therefore an attractive proposition to [...] Read more.

Infrared thermography provides key information for a wide range of diagnostic applications within built and natural environments. As thermal states are changing with ambient conditions, it is important to deploy thermal imaging systems and operators opportunistically. It is therefore an attractive proposition to make these systems more affordable and accessible. Low-cost thermal sensors generally produce low-resolution outputs. To increase data density across large subjects, diagnosticians may create image mosaics from multiple overlapping thermographs. The registration of individual inputs into large mosaics is aided by the acquisition of additional sensor data (photographs and depthmaps), which can provide critical spatial references. In many cases, the materials inherent to the modern built environment present challenges to traditional data registration workflows between multiple sensor streams. Mobile devices offer an opportunity to innovate in the creation of these mosaics, integrating rapid geospatial mapping functionality with radiometric thermography within a 3D context. In this paper the authors evaluate the FLIR One Pro thermal camera module along with iOS/iPhone specific rapid mapping capabilities, and present a methodology: (1) introducing a workflow for the integration of short-range (within 0.3–5 m capture distance) iPhone mobile sensor data into modeling pipelines; (2) introducing a calibration model enabling effective registration and fusion of multi-modal inputs from the iPhone mobile sensor stack and FLIR One thermographic module; and (3) detailing an alternative open-source methodology for the evaluation and translation of thermographic imagery for multi-sensor fusion. The end product of this pipeline is a 3D radiometric thermographic mosaic: a spatially continuous, textured surface model in which hundreds of individual low-resolution thermographs are fused into a single queryable output retaining full 16-bit temperature values at every point. All datasets have been made openly available and the two case studies used in this paper have been made accessible at full resolution for interactive 3D online viewing. Full article

(This article belongs to the Special Issue Remote Sensing for 2D/3D Mapping)

► Show Figures

Figure 1

29 pages, 10248 KB

Open AccessArticle

Fs2PA: A Full-Scale Feature Synergistic Perception Architecture for Vehicular Infrared Object Detection via Physical Priors and Semantic Constraints

by Boxuan Pei, Leyuan Wu, Xiaoyan Zheng, Chao Zhou and Dingxiang Wang

Sensors 2026, 26(7), 2257; https://doi.org/10.3390/s26072257 - 6 Apr 2026

Cited by 1 | Viewed by 434

Abstract

Vehicular infrared object detection is a key technology supporting autonomous driving systems to achieve all-weather environmental perception. However, infrared images inherently lack texture, resulting in blurred object contours. Additionally, deep network propagation severely erodes and loses feature information of distant tiny objects. To [...] Read more.

Vehicular infrared object detection is a key technology supporting autonomous driving systems to achieve all-weather environmental perception. However, infrared images inherently lack texture, resulting in blurred object contours. Additionally, deep network propagation severely erodes and loses feature information of distant tiny objects. To address the above issues, this study proposes a Full-Scale Feature Synergistic Perception Architecture for vehicular infrared object detection. This architecture first designs a Gradient-Informed Attention module, which initializes convolution kernels through physical gradient operators to inject geometric prior information into the network, enhancing the model’s perception capability of blurred object boundaries. Secondly, it constructs a Full-Scale Feature Pyramid containing a

P_{2}

high-resolution feature layer to effectively recover the geometric detail features of distant tiny objects. Finally, it proposes a Scale-Aware Shared Head, which relies on a cross-scale parameter sharing mechanism to achieve extreme parameter compression, and simultaneously introduces deep semantic information to form strong constraints, suppressing noise interference in shallow features. Experimental results on the FLIR v2 and M3FD datasets show that the proposed architecture exhibits excellent detection performance. On FLIR v2, it raises mAP@50 to 64.06% (6.51% relative gain vs. YOLOv11) while maintaining 547 FPS inference speed, achieving an optimal accuracy–efficiency balance. Full article

(This article belongs to the Special Issue AI Agent Driven Sensing, Data Acquisition, and Signal Processing Methods in Autonomous Driving)

► Show Figures

Graphical abstract

32 pages, 11735 KB

Open AccessArticle

GEM-YOLO: A Lightweight and Real-Time RGBT Object Detector with Gated Multimodal Fusion

by Lijuan Wang, Zuchao Bao and Dongming Lu

Sensors 2026, 26(7), 2035; https://doi.org/10.3390/s26072035 - 25 Mar 2026

Viewed by 981

Abstract

Red–Green–Blue–Thermal (RGBT) object detection is critical for robust all-weather perception. However, deploying dual-stream networks on resource-constrained edge devices is severely hindered by insufficiently adaptive multimodal fusion, the loss of small-object features during downsampling, and substantial computational overhead. To address these challenges, we propose [...] Read more.

Red–Green–Blue–Thermal (RGBT) object detection is critical for robust all-weather perception. However, deploying dual-stream networks on resource-constrained edge devices is severely hindered by insufficiently adaptive multimodal fusion, the loss of small-object features during downsampling, and substantial computational overhead. To address these challenges, we propose GEM-YOLO, a real-time and lightweight RGBT detector. Specifically, an Adaptive Multimodal Gated Fusion Mechanism (GFM) is designed to dynamically calibrate modality weights and suppress noise. Furthermore, Space-to-Depth (SPD) convolutions are integrated into the backbone to achieve lossless downsampling, preventing the feature collapse of small targets. Finally, a lightweight Ghost-Neck is constructed using Ghost modules and GSConv to eliminate computational redundancy. Extensive experiments on the Forward-Looking Infrared (FLIR) and Multi-Modal Multispectral Fusion Dataset (M3FD) datasets demonstrate the effectiveness of the proposed method. With only 7.58 Giga Floating-Point Operations (GFLOPs) and 3.44 million parameters (M), GEM-YOLO reduces the computational cost by 18.6% relative to the dual-stream YOLOv11n baseline. Concurrently, it achieves competitive mean Average Precision at IoU = 0.5 (mAP@50) scores of 82.8% and 69.0% on FLIR and M3FD, respectively, with more evident gains on small-target localization. In practice, GEM-YOLO maintains competitive detection performance while keeping computational overhead low, making it promising for real-time multispectral perception on resource-constrained edge platforms. Full article

(This article belongs to the Special Issue Advanced Sensor Technologies for Multimodal Decision-Making)

► Show Figures

Figure 1

14 pages, 3141 KB

Open AccessArticle

Enhanced Real-Time Detector for Industrial Vision-Based Corn Impurity Detection

by Xiao Zhang, Yuhang Bian, Xiangdong Li, Haoze Yu, Dong Li and Min Wu

Foods 2026, 15(6), 1065; https://doi.org/10.3390/foods15061065 - 18 Mar 2026

Viewed by 375

Abstract

The effective cleaning of corn prior to storage is crucial for ensuring grain quality and safety. Traditional Convolutional Neural Network (CNN)-based detection methods often struggle to maintain accuracy in scenarios with dense occlusions. Furthermore, limitations in image quality and feature representation hinder their [...] Read more.

The effective cleaning of corn prior to storage is crucial for ensuring grain quality and safety. Traditional Convolutional Neural Network (CNN)-based detection methods often struggle to maintain accuracy in scenarios with dense occlusions. Furthermore, limitations in image quality and feature representation hinder their generalization to diverse impurity types. To address these challenges, this paper proposes an enhanced real-time detector transformer model named RT-DETR-CD (Real-Time Detector Transformer with Convolution and Dynamic Upsampling) for corn impurity detection based on industrial vision. This approach integrates Receptive Field Attention Convolutions (RFAConv) to enhance sensitivity to local texture details and employs the dynamic upsampling operator DySample to restore high-frequency edge information. Additionally, a novel Inner-Shape-IoU loss function is introduced to accelerate bounding box regression for objects with varying aspect ratios. Images were captured using FLIR industrial cameras under controllable annular LED illumination. Experiments on a self-built dataset demonstrate that the proposed model achieves a 4.7% improvement in mean average precision (mAP) and operates at 68 frames per second (FPS), outperforming the original RT-DETR model in both accuracy and speed. This work provides a practical solution for real-time, high-precision impurity detection on grain processing lines. Full article

(This article belongs to the Section Food Analytical Methods)

► Show Figures

Figure 1

18 pages, 1878 KB

Open AccessArticle

Recognition Task-Based Detection Score: A Task-Oriented Evaluation Metric for Infrared Image Colorization

by Hao Wang, Jiaming Cai, Yao Hu, Chenglong Zhang and Qun Hao

Sensors 2026, 26(6), 1807; https://doi.org/10.3390/s26061807 - 13 Mar 2026

Viewed by 431

Abstract

Infrared image colorization has gained widespread attention in recent years as an important means of enhancing image visibility and semantic expression. However, existing evaluation methods mostly rely on pixel-level differences or feature distribution distances, failing to comprehensively reflect the usability of colorization results [...] Read more.

Infrared image colorization has gained widespread attention in recent years as an important means of enhancing image visibility and semantic expression. However, existing evaluation methods mostly rely on pixel-level differences or feature distribution distances, failing to comprehensively reflect the usability of colorization results in practical tasks. To address this, we propose a task-oriented colorization quality evaluation metric called Recognition-Task based Detection Score (RDS), which uses the recognition accuracy of object detection models on colorized images as a proxy indicator to measure their actual performance in downstream tasks, thereby achieving consistency between image quality assessment and task performance. RDS incorporates three key characteristics in its design: enhancing position robustness through the matching mechanism of object detection tasks, providing fine-grained interpretability through category-level accuracy calculation, and achieving task adjustability through flexible category division strategies. Systematic experiments conducted on both NIR–RGB and FLIR-5C datasets demonstrate that RDS maintains good subjective–objective consistency with traditional metrics under standard registration conditions, exhibits superior stability under registration error scenarios, and possesses fine-grained interpretability and task adjustability that traditional metrics lack. RDS maintains a 5.7% improvement in discriminative Score Gap under misalignment while PSNR degrades by 69.8%, and flexible category merging raises TIC-CGAN’s RDS from 76.05% to 96.45% on unseen scenes, providing more practically valuable criteria for the evaluation and optimization of infrared colorization models. Full article

(This article belongs to the Special Issue AI-Based Visual Sensing for Object Detection)

► Show Figures

Figure 1

21 pages, 4639 KB

Open AccessArticle

Deep Learning-Based Real-Time Vehicle Tire and Tank Temperature Monitoring Using Thermal Cameras

by Yaoyao Hu, Jiaxin Li, Chuanyi Ma, Shuai Cheng, Ruolin Zheng and Xingang Zhang

Appl. Sci. 2026, 16(6), 2656; https://doi.org/10.3390/app16062656 - 11 Mar 2026

Viewed by 502

Abstract

Ensuring the driving safety of hazardous chemical vehicles is a critical priority. High temperatures in tires and tanks can lead to catastrophic accidents, including fires and road damage, particularly in bridge and tunnel sections. Therefore, the purpose of this study is to utilize [...] Read more.

Ensuring the driving safety of hazardous chemical vehicles is a critical priority. High temperatures in tires and tanks can lead to catastrophic accidents, including fires and road damage, particularly in bridge and tunnel sections. Therefore, the purpose of this study is to utilize deep learning to obtain the temperature of vehicle tires and tanks in real time. We constructed a comprehensive dataset by combining the FLIR infrared vehicle dataset, the SPT visible tire dataset, and self-collected thermal video frames captured in various environments. State-of-the-art object detection models, including different scales of YOLOv8, YOLOv9, and YOLOv10, were evaluated for the multi-target detection of vehicles, tires, and tanks. Comparative analysis reveals that the YOLOv8-L model optimized with the GIoU loss function delivers the best performance. Specifically, it achieves a mean Average Precision (mAP) of 97.9% with an average inference time of 6.9 ms per frame, effectively balancing accuracy and real-time efficiency. Finally, by mapping the detection bounding boxes to the radiometric temperature matrix, the system achieves precise, real-time temperature monitoring of the vehicle components. Full article

► Show Figures

Figure 1

24 pages, 11192 KB

Open AccessArticle

FCAT: Frequency-Domain Cross-Attention for All-Weather Multispectral Object Detection in Low-Altitude UAV Security Inspection of Urban and Industrial Areas

by Kewei Li, Ziyi Zhong, Ziyue Luo, Haishan Tian, Kui Wang, Han Jiang, Deyuan Xiang and Weiwei Tang

Remote Sens. 2026, 18(5), 826; https://doi.org/10.3390/rs18050826 - 7 Mar 2026

Cited by 1 | Viewed by 609

Abstract

UAVs are widely used for all-weather, round-the-clock security inspections in urban and industrial areas. However, pure visible-light systems fail at night or in adverse weather conditions, while pure infrared methods are limited by thermal noise, low spatial resolutions, and high false alarm rates. [...] Read more.

UAVs are widely used for all-weather, round-the-clock security inspections in urban and industrial areas. However, pure visible-light systems fail at night or in adverse weather conditions, while pure infrared methods are limited by thermal noise, low spatial resolutions, and high false alarm rates. Multispectral images render the task of object detection highly reliable and robust by providing complementary target feature information. This study suggests a frequency-based cross-attention transformer (FCAT) for multispectral object detection as a solution to this issue. This approach collects cross-modal complementary characteristics, effectively learns and integrates global contextual information via the cross-attention mechanism, and greatly increases multispectral object detection accuracy. At the same time, spatial-domain features are mapped to the frequency domain via the Fourier transform, and the scaled dot product attention is estimated via element-wise product operations, which break through the limitation of traditional spatial-domain matrix multiplication and effectively reduce the computational cost of the model. Additionally, this study independently builds a multi-scene multi-time climate visible–infrared dataset (OPVM-VIRD), which contains 20,025 target instances, to address the issue of the lack of all-weather cross-spectral data in object detection tasks from the perspective of UAVs. Experimental findings from the OPVM-VIRD, M3FD, and FLIR datasets demonstrate that our proposed approach outperforms prevailing state-of-the-art multispectral object detection algorithms on public benchmarks, while the FCAT model achieves an mAP50 score of 94.7% on our custom-built dataset—10.8% higher than ICAF. At the same time, the number of FCAT parameters is 85.26 M, which is significantly lower than that of mainstream models, such as ICAF. Therefore, the FCAT is a change detection strategy with strong model generalization abilities, and it has important application value in the all-day and all-weather security patrol of cities and industrial parks carried out by UAVs. Full article

(This article belongs to the Section Remote Sensing Image Processing)

► Show Figures

Figure 1

29 pages, 13806 KB

Open AccessArticle

DCAM-DETR: Dual Cross-Attention Mamba Detection Transformer for RGB–Infrared Anti-UAV Detection

by Zemin Qin and Yuheng Li

Information 2026, 17(1), 103; https://doi.org/10.3390/info17010103 - 19 Jan 2026

Cited by 3 | Viewed by 1416

Abstract

The proliferation of unmanned aerial vehicles (UAVs) poses escalating security threats across critical infrastructures, necessitating robust real-time detection systems. Existing vision-based methods predominantly rely on single-modality data and exhibit significant performance degradation under challenging scenarios. To address these limitations, we propose DCAM-DETR, a [...] Read more.

The proliferation of unmanned aerial vehicles (UAVs) poses escalating security threats across critical infrastructures, necessitating robust real-time detection systems. Existing vision-based methods predominantly rely on single-modality data and exhibit significant performance degradation under challenging scenarios. To address these limitations, we propose DCAM-DETR, a novel multimodal detection framework that fuses RGB and thermal infrared modalities through an enhanced RT-DETR architecture integrated with state space models. Our approach introduces four innovations: (1) a MobileMamba backbone leveraging selective state space models for efficient long-range dependency modeling with linear complexity

O (n)

; (2) Cross-Dimensional Attention (CDA) and Cross-Path Attention (CPA) modules capturing intermodal correlations across spatial and channel dimensions; (3) an Adaptive Feature Fusion Module (AFFM) dynamically calibrating multimodal feature contributions; and (4) a Dual-Attention Decoupling Module (DADM) enhancing detection head discrimination for small targets. Experiments on Anti-UAV300 demonstrate state-of-the-art performance with 94.7% mAP@0.5 and 78.3% mAP@0.5:0.95 at 42 FPS. Extended evaluations on FLIR-ADAS and KAIST datasets validate the generalization capacity across diverse scenarios. Full article

(This article belongs to the Special Issue Computer Vision for Security Applications, 2nd Edition)

► Show Figures

Graphical abstract

22 pages, 5960 KB

Open AccessArticle

JFDet: Joint Fusion and Detection for Multimodal Remote Sensing Imagery

by Wenhao Xu and You Yang

Remote Sens. 2026, 18(1), 176; https://doi.org/10.3390/rs18010176 - 5 Jan 2026

Viewed by 922

Abstract

Multimodal remote sensing imagery, such as visible and infrared data, offers crucial complementary information that is vital for time-sensitive emergency applications like search and rescue or disaster monitoring, where robust detection under adverse conditions is essential. However, existing methods’ object detection performance is [...] Read more.

Multimodal remote sensing imagery, such as visible and infrared data, offers crucial complementary information that is vital for time-sensitive emergency applications like search and rescue or disaster monitoring, where robust detection under adverse conditions is essential. However, existing methods’ object detection performance is often suboptimal due to task-independent fusion and inherent modality inconsistency. To address this issue, we propose a joint fusion and detection approach for multimodal remote sensing imagery (JFDet). First, a gradient-enhanced residual module (GERM) is introduced to combine dense feature connections with gradient residual pathways, effectively enhancing structural representation and fine-grained texture details in fused images. For robust detection, we introduce a second-order channel attention (SOCA) mechanism and design a multi-scale contextual feature-encoding (MCFE) module to capture higher-order semantic dependencies, enrich multi-scale contextual information, and thereby improve the recognition of small and variably scaled objects. Furthermore, a dual-loss feedback strategy propagates detection loss to the fusion network, enabling adaptive synergy between low-level fusion and high-level detection. Experiments on the VEDAI and FLIR-ADAS datasets demonstrate that the proposed detection-driven fusion framework significantly improves both fusion quality and detection accuracy compared with state-of-the-art methods, highlighting its effectiveness and high potential for mission-critical multimodal remote sensing and time-sensitive application. Full article

(This article belongs to the Special Issue GeoAI and EO Big Data Driven Advances in Earth Environmental Science (Second Edition))

► Show Figures

Figure 1

29 pages, 5902 KB

Open AccessArticle

MSLCP-DETR: A Multi-Scale Linear Attention and Sparse Fusion Framework for Infrared Small Target Detection in Vehicle-Mounted Systems

by Fu Li, Meimei Zhu, Ming Zhao, Yuxin Sun and Wangyu Wu

Mathematics 2026, 14(1), 67; https://doi.org/10.3390/math14010067 - 24 Dec 2025

Viewed by 890

Abstract

Detecting small infrared targets in vehicle-mounted systems remains challenging due to weak thermal radiation, cross-scale feature loss, and dynamic background interference. To address these issues, this paper proposes MSLCP-DETR, an enhanced RT-DETR-based framework that integrates multi-scale linear attention and sparse fusion mechanisms. The [...] Read more.

Detecting small infrared targets in vehicle-mounted systems remains challenging due to weak thermal radiation, cross-scale feature loss, and dynamic background interference. To address these issues, this paper proposes MSLCP-DETR, an enhanced RT-DETR-based framework that integrates multi-scale linear attention and sparse fusion mechanisms. The model introduces three novel components: a Multi-Scale Linear Attention Encoder (MSLA-AIFI), which combines multi-branch depth-wise convolution with linear attention to efficiently capture cross-scale features while reducing computational complexity; a Cross-Scale Small Object Feature Optimization module (CSOFO), which enhances the localization of small targets in dense scenes through spatial rearrangement and dynamic modeling; and a Pyramid Sparse Transformer (PST), which replaces traditional dense fusion with a dual-branch sparse attention mechanism to improve both accuracy and real-time performance. Extensive experiments on the M3FD and FLIR datasets demonstrate that MSLCP-DETR achieves an excellent balance between accuracy and efficiency, with its precision, mAP@50, and mAP@50:95 reaching 90.3%, 79.5%, and 86.0%, respectively. Ablation studies and visual analysis further validate the effectiveness of the proposed modules and the overall design strategy. Full article

(This article belongs to the Special Issue Advanced Methods and Applications with Deep Learning in Object Recognition)

► Show Figures

Figure 1

18 pages, 3213 KB

Open AccessArticle

YOLOv7-tiny-CR: A Causal Intervention Framework for Infrared Small Target Detection with Feature Debiasing

by Honglong Wang and Lihui Sun

Appl. Sci. 2025, 15(24), 13008; https://doi.org/10.3390/app152413008 - 10 Dec 2025

Cited by 1 | Viewed by 642

Abstract

The performance of infrared small target detection is often hindered by spurious correlations learned between features and labels. To address this feature bias at its root, this paper proposes a debiased detection framework grounded in causal reasoning. Built upon the YOLOv7-tiny architecture, the [...] Read more.

The performance of infrared small target detection is often hindered by spurious correlations learned between features and labels. To address this feature bias at its root, this paper proposes a debiased detection framework grounded in causal reasoning. Built upon the YOLOv7-tiny architecture, the framework introduces a three-stage debiasing mechanism. First, a Structural Causal Model (SCM) is adopted to disentangle causal features from non-causal image cues. Second, a Causal Attention Mechanism (CAM) is embedded into the backbone, where a causality-guided feature weighting strategy enhances the model’s focus on semantically critical target characteristics. Finally, a Causal Intervention (CI) module is incorporated into the neck, leveraging backdoor adjustments to suppress spurious causal links induced by contextual confounders. Extensive experiments on the public FLIR_ADASv2 dataset demonstrate notable gains in feature discriminability, with improvements of 2.9% in mAP@50 and 2.7% in mAP@50:95 compared to the baseline. These results verify that the proposed framework effectively mitigates feature bias and enhances generalization capability, outperforming the baseline by a substantial margin. Full article

(This article belongs to the Special Issue Object Detection Technology—2nd Edition)

► Show Figures

Figure 1

Search Results (59)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (59)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI