MDPI - Publisher of Open Access Journals

22 pages, 3892 KB

Open AccessArticle

Structure-Aware Progressive Multi-Modal Fusion Network for RGB-T Crack Segmentation

by Zhengrong Yuan, Xin Ding, Xinhong Xia, Yibin He, Hui Fang, Bo Yang and Wei Fu

J. Imaging 2025, 11(11), 384; https://doi.org/10.3390/jimaging11110384 (registering DOI) - 1 Nov 2025

Crack segmentation in images plays a pivotal role in the monitoring of structural surfaces, serving as a fundamental technique for assessing structural integrity. However, existing methods that rely solely on RGB images exhibit high sensitivity to light conditions, which significantly restricts their adaptability [...] Read more.

Crack segmentation in images plays a pivotal role in the monitoring of structural surfaces, serving as a fundamental technique for assessing structural integrity. However, existing methods that rely solely on RGB images exhibit high sensitivity to light conditions, which significantly restricts their adaptability in complex environmental scenarios. To address this, we propose a structure-aware progressive multi-modal fusion network (SPMFNet) for RGB-thermal (RGB-T) crack segmentation. The main idea is to integrate complementary information from RGB and thermal images and incorporate structural priors (edge information) to achieve accurate segmentation. Here, to better fuse multi-layer features from different modalities, a progressive multi-modal fusion strategy is designed. In the shallow encoder layers, two gate control attention (GCA) modules are introduced to dynamically regulate the fusion process through a gating mechanism, allowing the network to adaptively integrate modality-specific structural details based on the input. In the deeper layers, two attention feature fusion (AFF) modules are employed to enhance semantic consistency by leveraging both local and global attention, thereby facilitating the effective interaction and complementarity of high-level multi-modal features. In addition, edge prior information is introduced to encourage the predicted crack regions to preserve structural integrity, which is constrained by a joint loss of edge-guided loss, multi-scale focal loss, and adaptive fusion loss. Experimental results on publicly available RGB-T crack detection datasets demonstrate that the proposed method outperforms both classical and advanced approaches, verifying the effectiveness of the progressive fusion strategy and the utilization of the structural prior. Full article

(This article belongs to the Special Issue Image Segmentation Techniques: Current Status and Future Directions (2nd Edition))

► Show Figures

Figure 1

29 pages, 10715 KB

Open AccessArticle

LIVEMOS-G: A High Throughput Gantry Monitoring System with Multi-Source Imaging and Environmental Sensing for Large-Scale Commercial Rabbit Farming

by Yutong Han, Tai Wei, Zhaowang Chen, Hongying Wang, Liangju Wang, Congyan Li, Xiuli Mei, Liangde Kuang and Jianjun Gong

Animals 2025, 15(21), 3177; https://doi.org/10.3390/ani15213177 (registering DOI) - 31 Oct 2025

Abstract

The rising global demand for high-quality animal protein has driven the development of advanced technologies in high-density livestock farming. Rabbits, with their rapid growth, high reproductive efficiency, and excellent feed conversion, play an important role in modern animal agriculture. However, large-scale rabbit farming [...] Read more.

The rising global demand for high-quality animal protein has driven the development of advanced technologies in high-density livestock farming. Rabbits, with their rapid growth, high reproductive efficiency, and excellent feed conversion, play an important role in modern animal agriculture. However, large-scale rabbit farming poses challenges in timely health inspection and environmental monitoring. Traditional manual inspections are labor-intensive, prone-to-error, and inefficient for real-time management. To address these issues, we propose Livestock Environmental Monitoring System–Gantry (LIVEMOS-G), an intelligent gantry-based monitoring system tailored for large-scale rabbit farms. Inspired by plant phenotyping platforms, the system integrates a three-axis motion module with multi-source imaging (RGB, depth, near-infrared, thermal infrared) and an environmental sensing module. It autonomously inspects around the farm, capturing multi-angle, high-resolution images and real-time environmental data without disturbing the rabbits. Key environmental parameters are collected accurately and compared with welfare standards. After training on an original dataset, which contains a total of 2325 sets of images (each set includes RGB, NIR, TIR, and depth image), the system is able to detect dead rabbits using a fusion-based object detection model during inspections. LIVEMOS-G offers a scalable, non-intrusive solution for intelligent livestock inspection, contributing to enhanced biosecurity, animal welfare, and data-driven management in high-density, modern rabbit farms. It also shows the potential to be extended to other species, contributing to the sustainable development of the animal farming industry as a whole. Full article

(This article belongs to the Topic AI, Deep Learning, and Machine Learning in Veterinary Science Imaging)

► Show Figures

Figure 1

16 pages, 579 KB

Open AccessArticle

IGSMNet: Ingredient-Guided Semantic Modeling Network for Food Nutrition Estimation

by Donglin Zhang, Weixiang Shi, Boyuan Ma, Weiqing Min and Xiao-Jun Wu

Foods 2025, 14(21), 3697; https://doi.org/10.3390/foods14213697 - 30 Oct 2025

Viewed by 296

Abstract

In recent years, food nutrition estimation has received growing attention due to its critical role in dietary analysis and public health. Traditional nutrition assessment methods often rely on manual measurements and expert knowledge, which are time-consuming and not easily scalable. With the advancement [...] Read more.

In recent years, food nutrition estimation has received growing attention due to its critical role in dietary analysis and public health. Traditional nutrition assessment methods often rely on manual measurements and expert knowledge, which are time-consuming and not easily scalable. With the advancement of computer vision, RGB-based methods have been proposed, and more recently, RGB-D-based approaches have further improved performance by incorporating depth information to capture spatial cues. While these methods have shown promising results, they still face challenges in complex food scenes, such as limited ability to distinguish visually similar items with different ingredients and insufficient modeling of spatial or semantic relationships. To solve these issues, we propose an Ingredient-Guided Semantic Modeling Network (IGSMNet) for food nutrition estimation. The method introduces an ingredient-guided module that encodes ingredient information using a pre-trained language model and aligns it with visual features via cross-modal attention. At the same time, an internal semantic modeling component is designed to enhance structural understanding through dynamic positional encoding and localized attention, allowing for fine-grained relational reasoning. On the Nutrition5k dataset, our method achieves PMAE values of 12.2% for Calories, 9.4% for Mass, 19.1% for Fat, 18.3% for Carb, and 16.0% for Protein. These results demonstrate that our IGSMNet consistently outperforms existing baselines, validating its effectiveness. Full article

(This article belongs to the Section Food Nutrition)

► Show Figures

Figure 1

22 pages, 6682 KB

Open AccessArticle

Multimodal Fire Salient Object Detection for Unregistered Data in Real-World Scenarios

by Ning Sun, Jianmeng Zhou, Kai Hu, Chen Wei, Zihao Wang and Lipeng Song

Fire 2025, 8(11), 415; https://doi.org/10.3390/fire8110415 - 26 Oct 2025

Viewed by 715

Abstract

In real-world fire scenarios, complex lighting conditions and smoke interference significantly challenge the accuracy and robustness of traditional fire detection systems. Fusion of complementary modalities, such as visible light (RGB) and infrared (IR), is essential to enhance detection robustness. However, spatial shifts and [...] Read more.

In real-world fire scenarios, complex lighting conditions and smoke interference significantly challenge the accuracy and robustness of traditional fire detection systems. Fusion of complementary modalities, such as visible light (RGB) and infrared (IR), is essential to enhance detection robustness. However, spatial shifts and geometric distortions occur in multi-modal image pairs collected by multi-source sensors due to installation deviations and inconsistent intrinsic parameters. Existing multi-modal fire detection frameworks typically depend on pre-registered data, which struggles to handle modal misalignment in practical deployment. To overcome this limitation, we propose an end-to-end multi-modal Fire Salient Object Detection framework capable of dynamically fusing cross-modal features without pre-registration. Specifically, the Channel Cross-enhancement Module (CCM) facilitates semantic interaction across modalities in salient regions, suppressing noise from spatial misalignment. The Deformable Alignment Module (DAM) achieves adaptive correction of geometric deviations through cascaded deformation compensation and dynamic offset learning. For validation, we constructed an unregistered indoor fire dataset (Indoor-Fire) covering common fire scenarios. Generalizability was further evaluated on an outdoor dataset (RGB-T Wildfire). To fully validate the effectiveness of the method in complex building fire scenarios, we conducted experiments using the Fire in historic buildings (Fire in historic buildings) dataset. Experimental results demonstrate that the F1-score reaches 83% on both datasets, with the IoU maintained above 70%. Notably, while maintaining high accuracy, the number of parameters (91.91 M) is only 28.1% of the second-best SACNet (327 M). This method provides a robust solution for unaligned or weakly aligned modal fusion caused by sensor differences and is highly suitable for deployment in intelligent firefighting systems. Full article

► Show Figures

Figure 1

25 pages, 18310 KB

Open AccessArticle

A Multimodal Fusion Method for Weld Seam Extraction Under Arc Light and Fume Interference

by Lei Cai and Han Zhao

J. Manuf. Mater. Process. 2025, 9(11), 350; https://doi.org/10.3390/jmmp9110350 - 26 Oct 2025

Viewed by 232

Abstract

During the Gas Metal Arc Welding (GMAW) process, intense arc light and dense fumes cause local overexposure in RGB images and data loss in point clouds, which severely compromises the extraction accuracy of circular closed-curve weld seams. To address this challenge, this paper [...] Read more.

During the Gas Metal Arc Welding (GMAW) process, intense arc light and dense fumes cause local overexposure in RGB images and data loss in point clouds, which severely compromises the extraction accuracy of circular closed-curve weld seams. To address this challenge, this paper proposes a multimodal fusion method for weld seam extraction under arc light and fume interference. The method begins by constructing a weld seam edge feature extraction (WSEF) module based on a synergistic fusion network, which achieves precise localization of the weld contour by coupling image arc light-removal and semantic segmentation tasks. Subsequently, an image-to-point cloud mapping-guided Local Point Cloud Feature extraction (LPCF) module was designed, incorporating the Shuffle Attention mechanism to enhance robustness against noise and occlusion. Building upon this, a cross-modal attention-driven multimodal feature fusion (MFF) module integrates 2D edge features with 3D structural information to generate a spatially consistent and detail-rich fused point cloud. Finally, a hierarchical trajectory reconstruction and smoothing method is employed to achieve high-precision reconstruction of the closed weld seam path. The experimental results demonstrate that under severe arc light and fume interference, the proposed method achieves a Root Mean Square Error below 0.6 mm, a maximum error not exceeding 1.2 mm, and a processing time under 5 s. Its performance significantly surpasses that of existing methods, showcasing excellent accuracy and robustness. Full article

► Show Figures

Figure 1

14 pages, 7476 KB

Open AccessArticle

Development of 3D-Stacked 1Megapixel Dual-Time-Gated SPAD Image Sensor with Simultaneous Dual Image Output Architecture for Efficient Sensor Fusion

by Kazuma Chida, Kazuhiro Morimoto, Naoki Isoda, Hiroshi Sekine, Tomoya Sasago, Yu Maehashi, Satoru Mikajiri, Kenzo Tojima, Mahito Shinohara, Ayman T. Abdelghafar, Hiroyuki Tsuchiya, Kazuma Inoue, Satoshi Omodani, Alice Ehara, Junji Iwata, Tetsuya Itano, Yasushi Matsuno, Katsuhito Sakurai and Takeshi Ichikawa

Sensors 2025, 25(21), 6563; https://doi.org/10.3390/s25216563 - 24 Oct 2025

Viewed by 351

Abstract

Sensor fusion is crucial in numerous imaging and sensing applications. Integrating data from multiple sensors with different field-of-view, resolution, and frame timing poses substantial computational overhead. Time-gated single-photon avalanche diode (SPAD) image sensors have been developed to support multiple sensing modalities and mitigate [...] Read more.

Sensor fusion is crucial in numerous imaging and sensing applications. Integrating data from multiple sensors with different field-of-view, resolution, and frame timing poses substantial computational overhead. Time-gated single-photon avalanche diode (SPAD) image sensors have been developed to support multiple sensing modalities and mitigate this issue, but mismatched frame timing remains a challenge. Dual-time-gated SPAD image sensors, which can capture dual images simultaneously, have also been developed. However, the reported sensors suffered from medium-to-large pixel pitch, limited resolution, and inability to independently control the exposure time of the dual images, which restricts their applicability. In this paper, we introduce a 5 µm-pitch, 3D-backside-illuminated (BSI) 1Megapixel dual-time-gated SPAD image sensor enabling a simultaneous output of dual images. The developed SPAD image sensor is verified to operate as an RGB-Depth (RGB-D) sensor without complex image alignment. In addition, a novel high dynamic range (HDR) technique, utilizing pileup effect with two parallel in-pixel memories, is validated for dynamic range extension in 2D imaging, achieving a dynamic range of 119.5 dB. The proposed architecture provides dual image output with the same field-of-view, resolution, and frame timing, and is promising for efficient sensor fusion. Full article

(This article belongs to the Special Issue Special Issue on the 2025 International Image Sensor Workshop (IISW2025))

► Show Figures

Figure 1

37 pages, 14970 KB

Open AccessArticle

Research on Strawberry Visual Recognition and 3D Localization Based on Lightweight RAFS-YOLO and RGB-D Camera

by Kaixuan Li, Xinyuan Wei, Qiang Wang and Wuping Zhang

Agriculture 2025, 15(21), 2212; https://doi.org/10.3390/agriculture15212212 - 24 Oct 2025

Viewed by 386

Abstract

Improving the accuracy and real-time performance of strawberry recognition and localization algorithms remains a major challenge in intelligent harvesting. To address this, this study presents an integrated approach for strawberry maturity detection and 3D localization that combines a lightweight deep learning model with [...] Read more.

Improving the accuracy and real-time performance of strawberry recognition and localization algorithms remains a major challenge in intelligent harvesting. To address this, this study presents an integrated approach for strawberry maturity detection and 3D localization that combines a lightweight deep learning model with an RGB-D camera. Built upon the YOLOv11 framework, an enhanced RAFS-YOLO model is developed, incorporating three core modules to strengthen multi-scale feature fusion and spatial modeling capabilities. Specifically, the CRA module enhances spatial relationship perception through cross-layer attention, the HSFPN module performs hierarchical semantic filtering to suppress redundant features, and the DySample module dynamically optimizes the upsampling process to improve computational efficiency. By integrating the trained model with RGB-D depth data, the method achieves precise 3D localization of strawberries through coordinate mapping based on detection box centers. Experimental results indicate that RAFS-YOLO surpasses YOLOv11n, improving precision, recall, and mAP@50 by 4.2%, 3.8%, and 2.0%, respectively, while reducing parameters by 36.8% and computational cost by 23.8%. The 3D localization attains millimeter-level precision, with average RMSE values ranging from 0.21 to 0.31 cm across all axes. Overall, the proposed approach achieves a balance between detection accuracy, model efficiency, and localization precision, providing a reliable perception framework for intelligent strawberry-picking robots. Full article

(This article belongs to the Section Artificial Intelligence and Digital Agriculture)

► Show Figures

Figure 1

25 pages, 8387 KB

Open AccessArticle

HFF-Net: An Efficient Hierarchical Feature Fusion Network for High-Quality Depth Completion

by Yi Han, Mao Tian, Qiaosheng Li and Wuyang Shan

ISPRS Int. J. Geo-Inf. 2025, 14(11), 412; https://doi.org/10.3390/ijgi14110412 - 23 Oct 2025

Viewed by 335

Abstract

Depth completion aims to achieve high-quality dense depth prediction from a pair of synchronized sparse depth map and RGB image, and it plays an important role in many intelligent applications, including urban mapping, scene understanding, autonomous driving, and augmented reality. Although the existing [...] Read more.

Depth completion aims to achieve high-quality dense depth prediction from a pair of synchronized sparse depth map and RGB image, and it plays an important role in many intelligent applications, including urban mapping, scene understanding, autonomous driving, and augmented reality. Although the existing convolutional neural network (CNN)-based deep learning architectures have obtained state-of-the-art depth completion results, depth ambiguities in large areas with extremely sparse depth measurements remain a challenge. To address this problem, an efficient hierarchical feature fusion network (HFF-Net) is proposed for producing complete and accurate depth completion results. The key components of HFF-Net are the hierarchical depth completion architecture for predicting a robust initial depth map, and the multi-level spatial propagation network (MLSPN) for progressively refining the predicted initial depth map in a coarse-to-fine manner to generate a high-quality depth completion result. Firstly, the hierarchical feature extraction subnetwork is adopted to extract multi-scale feature maps. Secondly, the hierarchical depth completion architecture that incorporates a hierarchical feature fusion module and a progressive depth rectification module is utilized to generate an accurate and reliable initial depth map. Finally, the MLSPN-based depth map refinement subnetwork is adopted, which progressively refines the initial depth map utilizing multi-level affinity weights to achieve a state-of-the-art depth completion result. Extensive experiments were undertaken on two widely used public datasets, i.e., the KITTI depth completion and NYUv2 datasets, to validate the performance of HFF-Net. The comprehensive experimental results indicate that HFF-Net produces robust depth completion results on both datasets. Full article

(This article belongs to the Topic 3D Computer Vision and Smart Building and City, 3rd Edition)

► Show Figures

Figure 1

25 pages, 6045 KB

Open AccessArticle

Energy-Aware Sensor Fusion Architecture for Autonomous Channel Robot Navigation in Constrained Environments

by Mohamed Shili, Hicham Chaoui and Khaled Nouri

Sensors 2025, 25(21), 6524; https://doi.org/10.3390/s25216524 - 23 Oct 2025

Cited by 1 | Viewed by 492

Abstract

Navigating autonomous robots in confined channels is inherently challenging due to limited space, dynamic obstacles, and energy constraints. Existing sensor fusion strategies often consume excessive power because all sensors remain active regardless of environmental conditions. This paper presents an energy-aware adaptive sensor fusion [...] Read more.

Navigating autonomous robots in confined channels is inherently challenging due to limited space, dynamic obstacles, and energy constraints. Existing sensor fusion strategies often consume excessive power because all sensors remain active regardless of environmental conditions. This paper presents an energy-aware adaptive sensor fusion framework for channel robots that deploys RGB cameras, laser range finders, and IMU sensors according to environmental complexity. Sensor data are fused using an adaptive Extended Kalman Filter (EKF), which selectively integrates multi-sensor information to maintain high navigation accuracy while minimizing energy consumption. An energy management module dynamically adjusts sensor activation and computational load, enabling significant reductions in power consumption while preserving navigation reliability. The proposed system is implemented on a low-power microcontroller and evaluated through simulations and prototype testing in constrained channel environments. Results show a 35% reduction in energy consumption with minimal impact on navigation performance, demonstrating the framework’s effectiveness for long-duration autonomous operations in pipelines, sewers, and industrial ducts. Full article

(This article belongs to the Section Sensors and Robotics)

► Show Figures

Graphical abstract

22 pages, 6012 KB

Open AccessArticle

Assessment of Individual Tree Crown Detection Based on Dual-Seasonal RGB Images Captured from an Unmanned Aerial Vehicle

by Shichao Yu, Kunpeng Cui, Kai Xia, Yixiang Wang, Haolin Liu and Susu Deng

Forests 2025, 16(10), 1614; https://doi.org/10.3390/f16101614 - 21 Oct 2025

Viewed by 211

Abstract

Unmanned aerial vehicle (UAV)-captured RGB imagery, with high spatial resolution and ease of acquisition, is increasingly applied to individual tree crown detection (ITCD). However, ITCD in dense subtropical forests remains challenging due to overlapping crowns, variable crown size, and similar spectral responses between [...] Read more.

Unmanned aerial vehicle (UAV)-captured RGB imagery, with high spatial resolution and ease of acquisition, is increasingly applied to individual tree crown detection (ITCD). However, ITCD in dense subtropical forests remains challenging due to overlapping crowns, variable crown size, and similar spectral responses between neighbouring crowns. This paper investigates to what extent the ITCD accuracy can be improved by using dual-seasonal UAV-captured RGB imagery in different subtropical forest types: urban broadleaved, planted coniferous, and mixed coniferous–broadleaved forests. A modified YOLOv8 model was employed to fuse the features extracted from dual-seasonal images and perform the ITCD task. Results show that dual-seasonal imagery consistently outperformed single-seasonal datasets, with the greatest improvement in mixed forests, where the F1 score range increased from 56.3%–60.7% (single-seasonal datasets) to 69.1%–74.5% (dual-seasonal datasets) and the AP value range increased from 57.2%–61.5% to 70.1%–72.8%. Furthermore, performance fluctuations were smaller for dual-seasonal datasets than for single-seasonal datasets. Finally, our experiments demonstrate that the modified YOLOv8 model, which fuses features extracted from dual-seasonal images within a dual-branch module, outperformed both the original YOLOv8 model with channel-wise stacked dual-seasonal inputs and the Faster R-CNN model with a dual-branch module. The experimental results confirm the advantages of using dual-seasonal imagery for ITCD, as well as the critical role of model feature extraction and fusion strategies in enhancing ITCD accuracy. Full article

(This article belongs to the Section Forest Inventory, Modeling and Remote Sensing)

► Show Figures

Figure 1

25 pages, 8305 KB

Open AccessArticle

SAHI-Tuned YOLOv5 for UAV Detection of TM-62 Anti-Tank Landmines: Small-Object, Occlusion-Robust, Real-Time Pipeline

by Dejan Dodić, Vuk Vujović, Srđan Jovković, Nikola Milutinović and Mitko Trpkoski

Computers 2025, 14(10), 448; https://doi.org/10.3390/computers14100448 - 21 Oct 2025

Viewed by 324

Abstract

Anti-tank landmines endanger post-conflict recovery. Detecting camouflaged TM-62 landmines in low-altitude unmanned aerial vehicle (UAV) imagery is challenging because targets occupy few pixels and are low-contrast and often occluded. We introduce a single-class anti-tank dataset and a YOLOv5 pipeline augmented with a SAHI-based [...] Read more.

Anti-tank landmines endanger post-conflict recovery. Detecting camouflaged TM-62 landmines in low-altitude unmanned aerial vehicle (UAV) imagery is challenging because targets occupy few pixels and are low-contrast and often occluded. We introduce a single-class anti-tank dataset and a YOLOv5 pipeline augmented with a SAHI-based small-object stage and Weighted Boxes Fusion. The evaluation combines COCO metrics with an operational operating point (score = 0.25; IoU = 0.50) and stratifies by object size and occlusion. On a held-out test partition representative of UAV acquisition, the baseline YOLOv5 attains mAP@0.50:0.95 = 0.553 and AP@0.50 = 0.851. With tuned SAHI (768 px tiles, 40% overlap) plus fusion, performance rises to mAP@0.50:0.95 = 0.685 and AP@0.50 = 0.935—ΔmAP = +0.132 (+23.9% rel.) and ΔAP@0.50 = +0.084 (+9.9% rel.). At the operating point, precision = 0.94 and recall = 0.89 (F1 = 0.914), implying a 58.4% reduction in missed detections versus a non-optimized SAHI baseline and a +14.3 AP@0.50 gain on the small/occluded subset. Ablations attribute gains to tile size, overlap, and fusion, which boost recall on low-pixel, occluded landmines without inflating false positives. The pipeline sustains real-time UAV throughput and supports actionable triage for humanitarian demining, as well as motivating RGB–thermal fusion and cross-season/-domain adaptation. Full article

(This article belongs to the Special Issue Advanced Image Processing and Computer Vision (2nd Edition))

► Show Figures

Figure 1

24 pages, 4921 KB

Open AccessArticle

YOLOv11-DCFNet: A Robust Dual-Modal Fusion Method for Infrared and Visible Road Crack Detection in Weak- or No-Light Illumination Environments

by Xinbao Chen, Yaohui Zhang, Junqi Lei, Lelin Li, Lifang Liu and Dongshui Zhang

Remote Sens. 2025, 17(20), 3488; https://doi.org/10.3390/rs17203488 - 20 Oct 2025

Viewed by 336

Abstract

Road cracks represent a significant challenge that impacts the long-term performance and safety of transportation infrastructure. Early identification of these cracks is crucial for effective road maintenance management. However, traditional crack recognition methods that rely on visible light images often experience substantial performance [...] Read more.

Road cracks represent a significant challenge that impacts the long-term performance and safety of transportation infrastructure. Early identification of these cracks is crucial for effective road maintenance management. However, traditional crack recognition methods that rely on visible light images often experience substantial performance degradation in weak-light environments, such as at night or within tunnels. This degradation is characterized by blurred or deficient image textures, indistinct target edges, and reduced detection accuracy, which hinders the ability to achieve reliable all-weather target detection. To address these challenges, this study introduces a dual-modal crack detection method named YOLOv11-DCFNet. This method is based on an enhanced YOLOv11 architecture and incorporates a Cross-Modality Fusion Transformer (CFT) module. It establishes a dual-branch feature extraction structure that utilizes both infrared and visible light within the original YOLOv11 framework, effectively leveraging the high contrast capabilities of thermal infrared images to detect cracks under weak- or no-light conditions. The experimental results demonstrate that the proposed YOLOv11-DCFNet method significantly outperforms the single-modal model (YOLOv11-RGB) in both weak-light and no-light scenarios. Under weak-light conditions, the fusion model effectively utilizes the weak texture features of RGB images alongside the thermal radiation information from infrared (IR) images. This leads to an improvement in Precision from 83.8% to 95.3%, Recall from 81.5% to 90.5%, mAP@0.5 from 84.9% to 92.9%, and mAP@0.5:0.95 from 41.7% to 56.3%, thereby enhancing both detection accuracy and quality. In no-light conditions, the RGB single modality performs poorly due to the absence of visible light information, with an mAP@0.5 of only 67.5%. However, by incorporating IR thermal radiation features, the fusion model enhances Precision, Recall, and mAP@0.5 to 95.3%, 90.5%, and 92.9%, respectively, maintaining high detection accuracy and stability even in extreme no-light environments. The results of this study indicate that YOLOv11-DCFNet exhibits strong robustness and generalization ability across various low illumination conditions, providing effective technical support for night-time road maintenance and crack monitoring systems. Full article

(This article belongs to the Special Issue Road Extraction and Distress Assessment by Spaceborne, Airborne and Terrestrial Platforms (Second Edition))

► Show Figures

Graphical abstract

20 pages, 5086 KB

Open AccessArticle

A Multi-Modal Attention Fusion Framework for Road Connectivity Enhancement in Remote Sensing Imagery

by Yongqi Yuan, Yong Cheng, Bo Pan, Ge Jin, De Yu, Mengjie Ye and Qian Zhang

Mathematics 2025, 13(20), 3266; https://doi.org/10.3390/math13203266 - 13 Oct 2025

Viewed by 403

Abstract

Ensuring the structural continuity and completeness of road networks in high-resolution remote sensing imagery remains a major challenge for current deep learning methods, especially under conditions of occlusion caused by vegetation, buildings, or shadows. To address this, we propose a novel post-processing enhancement [...] Read more.

Ensuring the structural continuity and completeness of road networks in high-resolution remote sensing imagery remains a major challenge for current deep learning methods, especially under conditions of occlusion caused by vegetation, buildings, or shadows. To address this, we propose a novel post-processing enhancement framework that improves the connectivity and accuracy of initial road extraction results produced by any segmentation model. The method employs a dual-stream encoder architecture, which jointly processes RGB images and preliminary road masks to obtain complementary spatial and semantic information. A core component is the MAF (Multi-Modal Attention Fusion) module, designed to capture fine-grained, long-range, and cross-scale dependencies between image and mask features. This fusion leads to the restoration of fragmented road segments, the suppression of noise, and overall improvement in road completeness. Experiments on benchmark datasets (DeepGlobe and Massachusetts) demonstrate substantial gains in precision, recall, F1-score, and mIoU, confirming the framework’s effectiveness and generalization ability in real-world scenarios. Full article

(This article belongs to the Special Issue Mathematical Methods for Machine Learning and Computer Vision)

► Show Figures

Figure 1

19 pages, 762 KB

Open AccessArticle

TMRGBT-D2D: A Temporal Misaligned RGB-Thermal Dataset for Drone-to-Drone Target Detection

by Hexiang Hao, Yueping Peng, Zecong Ye, Baixuan Han, Wei Tang, Wenchao Kang, Xuekai Zhang, Qilong Li and Wenchao Liu

Drones 2025, 9(10), 694; https://doi.org/10.3390/drones9100694 - 10 Oct 2025

Viewed by 554

Abstract

In the field of drone-to-drone detection tasks, the issue of fusing temporal information with infrared and visible light data for detection has been rarely studied. This paper presents the first temporal misaligned rgb-thermal dataset for drone-to-drone target detection, named TMRGBT-D2D. The dataset covers [...] Read more.

In the field of drone-to-drone detection tasks, the issue of fusing temporal information with infrared and visible light data for detection has been rarely studied. This paper presents the first temporal misaligned rgb-thermal dataset for drone-to-drone target detection, named TMRGBT-D2D. The dataset covers various lighting conditions (i.e., high-light scenes captured during the day, medium-light and low-light scenes captured at night, with night scenes accounting for 38.8% of all data), different scenes (sky, forests, buildings, construction sites, playgrounds, roads, etc.), different seasons, and different locations, consisting of a total of 42,624 images organized into sequential frames extracted from 19 RGB-T video pairs. Each frame in the dataset has been meticulously annotated, with a total of 94,323 annotations. Except for drones that cannot be identified under extreme conditions, infrared and visible light annotations are one-to-one corresponding. This dataset presents various challenges, including small object detection (the average size of objects in visible light images is approximately 0.02% of the image area), motion blur caused by fast movement, and detection issues arising from imaging differences between different modalities. To our knowledge, this is the first temporal misaligned rgb-thermal dataset for drone-to-drone target detection, providing convenience for research into rgb-thermal image fusion and the development of drone target detection. Full article

(This article belongs to the Special Issue Detection, Identification and Tracking of UAVs and Drones)

► Show Figures

Figure 1

18 pages, 5377 KB

Open AccessArticle

M³ENet: A Multi-Modal Fusion Network for Efficient Micro-Expression Recognition

by Ke Zhao, Xuanyu Liu and Guangqian Yang

Sensors 2025, 25(20), 6276; https://doi.org/10.3390/s25206276 - 10 Oct 2025

Viewed by 481

Abstract

Micro-expression recognition (MER) aims to detect brief and subtle facial movements that reveal suppressed emotions, discerning authentic emotional responses in scenarios such as visitor experience analysis in museum settings. However, it remains a highly challenging task due to the fleeting duration, low intensity, [...] Read more.

Micro-expression recognition (MER) aims to detect brief and subtle facial movements that reveal suppressed emotions, discerning authentic emotional responses in scenarios such as visitor experience analysis in museum settings. However, it remains a highly challenging task due to the fleeting duration, low intensity, and limited availability of annotated data. Most existing approaches rely solely on either appearance or motion cues, thereby restricting their ability to capture expressive information fully. To overcome these limitations, we propose a lightweight multi-modal fusion network, termed M³ENet, which integrates both motion and appearance cues through early-stage feature fusion. Specifically, our model extracts horizontal, vertical, and strain-based optical flow between the onset and apex frames, alongside RGB images from the onset, apex, and offset frames. These inputs are processed by two modality-specific subnetworks, whose features are fused to exploit complementary information for robust classification. To improve generalization in low data regimes, we employ targeted data augmentation and adopt focal loss to mitigate class imbalance. Extensive experiments on five benchmark datasets, including CASME I, CASME II, CAS(ME)², SAMM, and MMEW, demonstrate that M³ENet achieves state-of-the-art performance with high efficiency. Ablation studies and Grad-CAM visualizations further confirm the effectiveness and interpretability of the proposed architecture. Full article

(This article belongs to the Special Issue AI-Based Computer Vision Sensors & Systems—2nd Edition)

► Show Figures

Figure 1

Search Results (587)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (587)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI