Forest fires occur frequently in China; however, the complex terrain and incomplete road networks severely constrain ground rescue efficiency. Accurate forest road information is essential for the optimization of emergency response and rescue force deployment. Existing road extraction algorithms are primarily designed for urban environments and exhibit limited efficacy in forest scenarios due to dense canopy, complex background interference and specific forest road features. To address this gap, this study proposes a forest road extraction method based on an enhanced DeepLabv3+ model using multi-temporal, high-resolution satellite imagery. Specifically, a Multi-Scale Channel Attention (MCSA) mechanism is embedded in skip connections to suppress background interference, while strip pooling is integrated into the Atrous Spatial Pyramid Pooling (ASPP) module to better capture slender road features. A composite Focal-Dice loss function is also constructed to mitigate sample imbalance. Finally, by applying the model in multi-temporal remote sensing images, a fusion strategy is introduced to integrate multi-seasonal road masks to enhance overall accuracy and topological integrity. Experimental results show that the proposed method achieves a precision of 54.1%, an F1-Score of 59.3%, and an IoU of 41.8%, effectively enhancing road continuity and providing robust technical support for fire-rescue decision-making. Full article

(This article belongs to the Special Issue From Prediction to Action: Next Generation AI Solutions for Disaster Preparedness Emergency Response and Community Safety)

►▼ Show Figures

Figure 1

24 pages, 19222 KB

Open AccessArticle

LID-YOLO: A Lightweight Network for Insulator Defect Detection in Complex Weather Scenarios

by Yangyang Cao, Shuo Jin and Yang Liu

Energies 2026, 19(7), 1640; https://doi.org/10.3390/en19071640 - 26 Mar 2026

Abstract

Ensuring the structural reliability of power transmission networks is a fundamental prerequisite for the stable operation of modern energy systems. To address the challenges posed by complex weather interference and the small scale of insulator defects during power line inspections, this paper proposes LID-YOLO, a lightweight insulator defect detection network. First, to mitigate image feature degradation caused by weather interference, we design the C3k2-CDGC module. By leveraging the input-adaptive characteristics of dynamic convolution and the spatial preservation properties of coordinate attention, this module enhances feature extraction capabilities and robustness in complex weather scenarios. Second, to address the detection challenges arising from the significant scale disparity between insulators and defects, we propose Detect-LSEAM, a detection head featuring an asymmetric decoupled architecture. This design facilitates multi-scale feature fusion while minimizing computational redundancy. Subsequently, we develop the NWD-MPDIoU hybrid loss function to balance the weights between distribution metrics and geometric constraints dynamically. This effectively mitigates gradient instability arising from boundary ambiguity and the minute size of insulator defects. Finally, we construct a synthetic multi-weather condition insulator defect dataset for training and validation. Compared to the baseline, LID-YOLO improves precision, recall, and mAP@0.5 by 1.7%, 3.6%, and 4.2%, respectively. With only 2.76 M parameters and 6.2 G FLOPs, it effectively maintains the lightweight advantage of the baseline, achieving an optimal balance between detection accuracy and computational efficiency for insulator inspections under complex weather conditions. This lightweight and robust framework provides a reliable algorithmic foundation for automated grid monitoring, supporting the continuous and resilient operation of modern energy systems. Full article

(This article belongs to the Section F: Electrical Engineering)

►▼ Show Figures

Figure 1

31 pages, 9441 KB

Open AccessArticle

Quantitative Microstructure Characterization in Additively Manufactured Nickel Alloy 625 Using Image Segmentation and Deep Learning

by Tuğrul Özel, Sijie Ding, Amit Ramasubramanian, Franco Pieri and Doruk Eskicorapci

Machines 2026, 14(4), 366; https://doi.org/10.3390/machines14040366 - 26 Mar 2026

Abstract

Laser Powder Bed Fusion for metals (PBF-LB/M) is a complex additive manufacturing process in which metal powder is selectively melted layer-by-layer to fabricate 3D parts. Process parameters critically influence the resulting microstructure in nickel alloys, with features such as melt pool marks, grain size and orientation, porosity, and cracks serving as key process signatures. These features are typically analyzed post-process to identify suboptimal conditions. This research aims to develop automated post-process measurement and analysis techniques using image processing, pattern recognition, and statistical learning to correlate process parameters with part quality. Optical microscopy images of build surfaces are analyzed using machine learning algorithms to evaluate porosity, grain size, and relative density in fabricated test coupons. Effect plots are generated to identify trends related to increasing energy density. A novel deep learning approach based on Mask R-CNN is used to detect and segment melt pool regions in optical microscopy images. From the segmented regions, melt pool dimensions—such as width, depth, and area—are extracted using bounding geometry coordinates. Manually labeled images (Type I and Type II) are used to train the model. A comparison between ResNet-50 and ResNet-101 backbones shows that the ResNet-50-based model (Model 2) achieves superior performance, with lower training loss (0.1781 vs. 0.1907) and validation loss (8.6140 vs. 9.4228). Quantitative evaluation using the Jaccard index, precision, and recall metrics shows that the ResNet-101 backbone outperforms ResNet-50, achieving about 4% higher mean Intersection-over-Union, with values of 0.85 for Type I and 0.82 for Type II melt pools, where Type I is detected more accurately due to its more regular morphology and clearer boundaries. By extending Faster R-CNNs with a mask prediction branch, the method allows for precise melt pool measurements, providing valuable insights into process quality and dimensional accuracy, and aiding in the detection of defects in PBF-LB-fabricated parts. Full article

(This article belongs to the Special Issue Artificial Intelligence in Mechanical Engineering Applications)

►▼ Show Figures

Figure 1

33 pages, 24295 KB

Open AccessArticle

HDCGAN+: A Low-Illumination UAV Remote Sensing Image Enhancement and Evaluation Method Based on WPID

by Kelly Chen Ke, Min Sun, Xinyi Wang, Dong Liu and Hanjun Yang

Remote Sens. 2026, 18(7), 999; https://doi.org/10.3390/rs18070999 (registering DOI) - 26 Mar 2026

Abstract

Remote sensing images acquired by UAVs under nighttime or low-illumination conditions suffer from insufficient illumination, leading to degraded image quality, detail loss, and noise, which restrict their application in public security and disaster emergency scenarios. Although existing machine learning-based enhancement methods can recover part of the missing information, they often cause color distortion and texture inconsistency. This study proposes an improved low-illumination image enhancement method based on a Weakly Paired Image Dataset (WPID), combining the Hierarchical Deep Convolutional Generative Adversarial Network (HDCGAN) with a low-rank image fusion strategy to enhance the quality of low-illumination UAV remote sensing images. First, YCbCr color channel separation is applied to preserve color information from visible images. Then, a Low-Rank Representation Fusion Network (LRRNet) is employed to perform structure-aware fusion between thermal infrared (TIR) and visible images, thereby enabling effective preservation of structural details and realistic color appearance. Furthermore, a weakly paired training mechanism is incorporated into HDCGAN to enhance detail restoration and structural fidelity. To achieve objective evaluation, a structural consistency assessment framework is constructed based on semantic segmentation results from the Segment Anything Model (SAM). Experimental results demonstrate that the proposed method outperforms state-of-the-art approaches in both visual quality and application-oriented evaluation metrics. Full article

(This article belongs to the Section Remote Sensing Image Processing)

35 pages, 4321 KB

Open AccessArticle

Syncretic Grad-CAM Integrated ViT-CNN Hybrids with Inherent Explainability for Early Thyroid Cancer Diagnosis from Ultrasound

by Ahmed Y. Alhafdhi, Gibrael Abosamra and Abdulrhman M. Alshareef

Diagnostics 2026, 16(7), 999; https://doi.org/10.3390/diagnostics16070999 (registering DOI) - 26 Mar 2026

Abstract

Background/Objectives: Accurate detection of thyroid cancer using ultrasound remains a challenge, as malignant nodules can be microscopic and heterogeneous, easily confused with point clusters and borderline-featured tissues. Current studies in deep learning demonstrate good performance with convolutional neural networks (CNNs) and clustering; however, many approaches focus on local tissue and provide limited, non-quantitative interpretation, reducing clinical confidence. This study proposes an integrated framework combining enhanced convolutional feature encoders (DenseNet169 and VGG19) with an enhanced vision transformer (ViT-E) to integrate local feature and global relational context during learning, rather than delayed integration. Methods: The proposed framework integrates enhanced convolutional feature encoders (DenseNet169 and VGG19) with an enhanced vision transformer (ViT-E), enabling simultaneous learning of local feature representations and global relational context. This design allows feature fusion during the learning stage instead of delayed integration, aiming to improve diagnostic performance and interpretability in thyroid ultrasound image analysis. Results: The best-performing model, ViT-E–DenseNet169, achieved 98.5% accuracy, 98.9% sensitivity, 99.15% specificity, and 97.35% AUC, surpassing the robust basic hybrid model (CNN–XGBoost/ANN) and existing systems. A second contribution is improved interpretability, moving from mere illustration to validation. Gradient-weighted class activation mapping (Grad-CAM) maps demonstrated distinct and clinically understandable concentration patterns across various thyroid cancers: precise intralesional concentration for high-confidence malignancies (PTC = 0.968), edge/interface concentration for capsule risk patterns (PTC = 0.957), and broader-field activation consistent with infiltration concerns (PTC = 0.984), while benign scans showed low and diffuse activation (PTC = 0.002). Spatial audits reinforced this behavior (IoU/PAP: 0.72/91%, 0.65/78%, 0.58/62%). Conclusions: The integrated ViT-E–DenseNet169 framework provides highly accurate thyroid cancer detection while offering clinically meaningful interpretability through Grad-CAM-based spatial validation, supporting improved confidence in AI-assisted ultrasound diagnosis. Full article

(This article belongs to the Special Issue Deep Learning Techniques for Medical Image Analysis)

16 pages, 2156 KB

Open AccessArticle

Research on Pedestrian Detection Method Based on Dual-Branch YOLOv8 Network of Visible Light and Infrared Images

by Zhuomin He and Xuewen Chen

World Electr. Veh. J. 2026, 17(4), 177; https://doi.org/10.3390/wevj17040177 - 26 Mar 2026

Abstract

In complex traffic environments such as low light, strong glare, occlusion and at night, systems that rely solely on visible light single sensors for pedestrian detection have drawbacks such as low detection accuracy and poor robustness. Based on the YOLOv8 convolutional network, this paper adopts a dual-branch structure to process visible light and infrared images simultaneously, fully utilizing feature information at different scales to effectively detect pedestrian targets in complex and changeable environments. To address the issues of insufficient interaction of modal feature information and fixed fusion weights, a cross-modal feature interaction and enhancement mechanism was introduced. A modal-channel interaction block (MCI-Block) was designed, in which residual connection structures and weight interaction were added within the module to achieve feature enhancement and filter out noise information. Introduce a dynamic weighted feature fusion strategy, adaptively adjusting the contribution ratio of different modal features in the fusion process, aiming to enhance the discrimination ability of the key pedestrian area. The training and testing of the network designed in this paper were completed on the visible light and infrared pedestrian detection dataset LLVIP and Kaist. At the same time, the test results of the dual-branch model and the model designed in this paper were further verified in actual traffic scenarios. The results show that the dual-branch YOLOv8 network for visible light and infrared images, which was constructed in this paper, can reliably enhance the detection performance of pedestrian targets in complex traffic environments, including accuracy, recall rate, and mAP@0.5, etc., thereby improving the robustness of pedestrian detection. Full article

(This article belongs to the Section Vehicle and Transportation Systems)

►▼ Show Figures

Figure 1

17 pages, 3275 KB

Open AccessArticle

3D Reconstruction Method for GM-APD Array LiDAR Based on Intensity Image Guidance

by Ye Liu, Kehao Chi, Ruikai Xue and Genghua Huang

Photonics 2026, 13(4), 323; https://doi.org/10.3390/photonics13040323 - 26 Mar 2026

Abstract

Geiger-mode avalanche photodiode (GM-APD) array light detection and ranging (LiDAR) has significant advantages in low-light scenes due to its single-photon-level detection sensitivity. However, it is susceptible to noise, which leads to a decrease in target localization accuracy. Traditional methods rely on long-term accumulation to distinguish signal photons from noise photons, making it difficult to achieve efficient processing, especially in scenarios with sparse echo photons and low signal-to-noise ratio (SNR), where performance is limited. To quickly and accurately obtain three-dimensional (3D) information of the target under such extreme conditions, this paper proposes a method for target detection and temporal window depth estimation based on intensity information guidance. First, noise suppression is performed on the intensity image according to its statistical characteristics, and an outlier detection mechanism based on neighborhood sparsity is introduced to remove outliers, thereby completing the target detection. Next, by exploiting the spatial continuity and reflectivity similarity of the target, local fusion of photon data within the target neighborhood is performed to construct highly consistent “superpixels”. Finally, according to the distribution difference between signal photons and noise photons on the time axis, temporal window screening is applied to the superpixels to extract depth information, and empty pixels are filled using a convex segmentation method to achieve depth estimation of the target. The experimental results demonstrate that under conditions of low photon counts and strong noise, the proposed method significantly outperforms traditional and existing methods in target recovery and depth estimation by effectively integrating target intensity information. Furthermore, this method achieves faster reconstruction speed, enabling high-precision and high-efficiency 3D target reconstruction. Full article

(This article belongs to the Special Issue Advances in Photon-Counting Imaging and Sensing)

►▼ Show Figures

Figure 1

24 pages, 4289 KB

Open AccessArticle

Floor Plan Generation of Existing Buildings Based on Deep Learning and Stereo Vision

by Dejiang Wang and Taoyu Peng

Buildings 2026, 16(7), 1310; https://doi.org/10.3390/buildings16071310 - 26 Mar 2026

Abstract

The reinforcement and renovation of existing buildings constitute an important component of the future development of the civil engineering industry. Such projects typically require the original construction drawings of the building. However, for older structures, the original paper-based drawings may be damaged or lost. Moreover, traditional manual surveying and mapping methods are time-consuming, labor-intensive, and limited in accuracy. To address these issues, this paper proposes a floor plan generation method for existing buildings that integrates deep learning and stereo vision based on a fusion of synthetic and real data. First, collaborative modeling and automated rendering between a large language model and Blender are implemented based on the Model Context Protocol (MCP), enabling indoor scene modeling and image acquisition to construct a synthetic dataset containing structural components such as doors, windows, and walls. Meanwhile, manually annotated real indoor images are incorporated. Synthetic and real data are mixed in different proportions to form multiple dataset configurations for model training and validation. Subsequently, the SegFormer model is employed to perform semantic segmentation of indoor components. Combined with stereo camera calibration results, disparity computation is conducted to extract the three-dimensional spatial coordinates of component corner points. On this basis, the architectural floor plan is generated according to the spatial geometric relationships among structural components. Experimental results demonstrate that the proposed method effectively reduces the need for manual annotation and on-site measurement, providing an efficient technical solution for indoor floor plan generation of existing buildings. Full article

(This article belongs to the Topic Application of Smart Technologies in Buildings)

►▼ Show Figures

Figure 1

18 pages, 21058 KB

Open AccessArticle

MSSA-Net: Multi-Modal Structural and Semantic-Adaptive Network for Low-Light Image Enhancement

by Tianxiang Chen, Xiaoyi Wang, Tongshun Zhang and Qiuzhan Zhou

Sensors 2026, 26(7), 2059; https://doi.org/10.3390/s26072059 - 25 Mar 2026

Abstract

Low-light image enhancement (LLIE) remains challenging due to severe degradation of high-frequency structures and semantic ambiguity under extreme darkness. Although existing methods achieve satisfactory brightness recovery, they often suffer from structural inconsistency and semantic drift, as diverse scenes are typically processed with uniform enhancement strategies or static text prompts. To address these issues, we propose a Multi-Modal Structural and Semantic-Adaptive Network (MSSA-Net) under a structure-anchored paradigm. First, we design a Multi-Scale Self-Refinement Block (MSRB) to enhance degraded visible representations through multi-scale feature extraction and progressive refinement. Meanwhile, a pseudo-infrared structural prior derived from the input image is introduced to provide noise-insensitive geometric cues. These cues are extracted via a Structure-Guided Cross-Attention (SGCA) module to produce structure-dominant features. The refined visible features and structural features are then adaptively integrated through an adaptive residual fusion (ARF) module to achieve balanced restoration. Furthermore, we develop a Large Multi-modal Model (LMM)-Driven Scene-Adaptive Attention mechanism that generates instance-aware scene tags from a coarse preview and injects semantic embeddings into visual features. Extensive experiments demonstrate that MSSA-Net improves structural fidelity, brightness recovery, and semantic naturalness across multiple benchmarks. Full article

(This article belongs to the Special Issue AI-Based Computer Vision Sensors & Systems—2nd Edition)

►▼ Show Figures

Figure 1

36 pages, 1944 KB

Open AccessArticle

EMAF-Net: A Lightweight Single-Stage Detector for 13-Class Object Detection in Agricultural Rural Road Scenes

by Zhixin Yao, Chunjiang Zhao, Yunjie Zhao, Xiaoyi Liu, Tuo Sun and Taihong Zhang

Sensors 2026, 26(7), 2055; https://doi.org/10.3390/s26072055 - 25 Mar 2026

Abstract

Rural road perception for agricultural machinery automation faces challenges including complex backgrounds, drastic lighting and weather variations, frequent occlusions, and high densities of small objects with significant scale variations. These factors make conventional detectors prone to missed detections and misclassifications. To address these issues, a 4K rural road dataset with 4771 images is constructed. The dataset covers 13 object categories and includes diverse day/night conditions and multiple weather scenarios on both structured and unstructured roads. EMAF-Net, a lightweight single-stage detector based on YOLOv4-P6, is proposed. The backbone integrates an EMHA module combining EfficientNet-B1 with multi-head self-attention (MHSA) for enhanced global context modeling while preserving efficient local feature extraction. The neck adopts an Improved ASPP and a bidirectional FPN to achieve robust multi-scale feature fusion and expanded receptive fields. Meanwhile, CIoU loss is used to optimize bounding box regression accuracy. The experimental results demonstrate that EMAF-Net achieves an mAP@0.5 of 64.05% and an mAP@0.5:0.95 of 48.95% on a rural road dataset. At the same time, it maintains a lightweight design with 18.3 M parameters and a computational complexity of 38.5 GFLOPs. Ablation studies confirm the EMHA module contributes a 6.22% mAP@0.5 improvement, validating EMAF-Net’s effectiveness for real-time rural road perception in autonomous agricultural systems. Full article

(This article belongs to the Section Smart Agriculture)

25 pages, 3612 KB

Open AccessArticle

CrtNet: A Cross-Model Residual Transformer Network for Structure-Guided Remote Sensing Scene Classification

by Chaoran Chen, Tianyuan Zhu, Tao Cui, Dalin Li, Adriano Tavares, Yanchun Liang and Yanheng Liu

Electronics 2026, 15(7), 1366; https://doi.org/10.3390/electronics15071366 - 25 Mar 2026

Abstract

Accurate remote sensing scene classification is essential for large-scale Earth observation but remains challenging due to significant inter-class similarity and complex spatial layouts in medium- and low-resolution imagery. Conventional convolutional neural networks (CNNs) effectively capture local structural patterns but struggle to model long-range semantic dependencies, whereas Vision Transformers excel at global context modeling yet often show reduced sensitivity to fine-grained spatial structures. To address these limitations, we propose CrtNet, a structure-aware Cross-Model Residual Transformer Network that establishes a dual-stream collaborative architecture integrating convolutional structural representations with Transformer-based semantic modeling through gated residual cross-model interactions. In this framework, a convolutional branch first extracts stable local structural features with strong spatial inductive biases. These features are continuously injected into the Transformer encoding process via residual cross-model connections, enabling persistent structural guidance during global attention modeling. In addition, a sample-adaptive dynamic gating mechanism is introduced to flexibly balance structural and semantic features during prediction. Extensive experiments conducted on two public remote sensing benchmarks, EuroSAT and UCM, demonstrate that CrtNet consistently outperforms representative CNN-based, Transformer-based, and hybrid state-of-the-art models, particularly in visually ambiguous scene categories. Full article

(This article belongs to the Special Issue Computer Vision and Machine Learning: Real-World Applications)

►▼ Show Figures

Figure 1

26 pages, 16104 KB

Open AccessArticle

Multi-Slot Attention with State Guidance for Egocentric Robotic Manipulation

by Sofanit Wubeshet Beyene and Ji-Hyeong Han

Electronics 2026, 15(7), 1365; https://doi.org/10.3390/electronics15071365 - 25 Mar 2026

Abstract

Visual perception is fundamental to robotic manipulation for recognizing objects, goals, and contextual details. Third-person cameras provide global views but can miss contact-rich interactions and require calibration. Wrist-mounted egocentric cameras reduce these limitations but introduce occlusion, motion blur, and partial observability, which complicate visuomotor learning. Furthermore, existing perception modules that rely solely on pixels or fuse imagery with proprioception as flat vectors do not explicitly model structured scene representations in dynamic egocentric views. To address these challenges, a multi-slot attention fusion encoder for egocentric manipulation is introduced. Learnable slot queries extract localized visual features from image tokens, and Feature-wise Linear Modulation (FiLM) conditions each slot on the robot’s joint states, producing a structured slot-based latent representation that adapts to viewpoint and configuration changes without requiring object labels or external camera priors. The resulting structured slot-based latent representation is used as input to a Soft Actor–Critic (SAC) agent, which achieves a higher mean cumulative return than pixel-only CNN/DrQ and state-only baselines on a ManiSkill3 egocentric manipulation task. Probing experiments and real-camera evaluation further show that the learned representation remains stable under egocentric viewpoint shifts and partial occlusions, indicating robustness in practical manipulation settings. Full article

(This article belongs to the Section Artificial Intelligence)

►▼ Show Figures

Figure 1

28 pages, 105542 KB

Open AccessArticle

Underwater Image Enhancement via HSV-CS Representation and Perception-Driven Adaptive Fusion

by Fengxu Guan, Tong Guo and Yuzhu Zhang

Remote Sens. 2026, 18(7), 986; https://doi.org/10.3390/rs18070986 - 25 Mar 2026

Abstract

Underwater images often suffer from color distortion and low contrast, severely limiting the reliability of visual perception systems. Existing methods struggle to balance enhancement quality and computational efficiency. To address this issue, we propose PCF-Net (Perception-driven Color Fusion Network), a lightweight dual-branch network for underwater image enhancement based on a stable HSV-CS (Hue-Saturation-Value with sine–cosine transformation) color-space representation. Specifically, a sine–cosine transformation is introduced to construct a stable HSV-CS color space, effectively avoiding hue discontinuities at boundary regions in conventional HSV representations. To compensate for underwater degradation, a Color-Bias-Aware module and a Value-Confidence module are designed to adaptively correct color distortion and luminance degradation. Furthermore, a lightweight Channel-Spatial Adaptive Gated Fusion module dynamically aggregates features from the RGB and HSV-CS branches in a perception-driven manner. The overall architecture incorporates multi-branch re-parameterizable convolutions, significantly reducing computational cost while preserving strong representational capacity. Extensive experiments on underwater image enhancement benchmarks, including UIEB and RUIE, demonstrate that PCF-Net achieves state-of-the-art performance in terms of PSNR, SSIM, and UIQM, along with visually superior color correction and contrast enhancement. With only 0.17 M parameters, the proposed model runs at 118.6 FPS on an RTX 3090 and 35.3 FPS on a Jetson Orin Nano at a resolution of 512 × 512, making it well suited for resource-constrained real-time underwater vision applications. Full article

(This article belongs to the Special Issue Deep Learning for Remote Sensing Image Enhancement)

►▼ Show Figures

Figure 1

24 pages, 8565 KB

Open AccessArticle

Rethinking Data Leakage in Patch-Based Hyperspectral Image Classification with Traditional Deep Networks

by Kaizhe Zhan, Zijie Huo, Zhijian Yin and Zhen Yang

Remote Sens. 2026, 18(7), 979; https://doi.org/10.3390/rs18070979 - 25 Mar 2026

Abstract

The application of hyperspectral image (HSI) processing techniques has become increasingly important in many fields such as agriculture, environmental detection, and mining. However, the number of annotated samples in existing hyperspectral datasets is limited, and most hyperspectral classification models typically use patch data for model training. There, the pixels to be classified are often taken as the center, and then a mean pixel length is calculated from the surrounding pixel neighborhood to form patch data. However, during model training, the researchers found that there was an area of overlapping pixels between most of the training data and the test data. This inevitably led to data leakage, resulting in excellent classification performance of the model. To solve this problem, we develop a method of replacing overlapping pixels (ROP) in patch data, which means the training pixel points of the same class are used to replace the test pixel points that appear in the overlapping region of the training patch data. Furthermore, a multiple feature extraction and fusion (MFEF) module is also proposed to enhance the capacity of the HSI model to extract spectral–spatial feature information from new patch data. The results on five publicly available HSI datasets demonstrate that the proposed resolving data leakage network (RDLNet) can provide competitive classification results on the patch data reconstructed with the ROP strategy, which outperforms existing state-of-the-art (SOTA) classification methods as well. Full article

(This article belongs to the Section Remote Sensing Image Processing)

►▼ Show Figures

Figure 1

25 pages, 3612 KB

Open AccessArticle

Learning Modality Complementarity for RGB-D Salient Object Detection via Dynamic Neural Network

by Yuanhao Li, Jia Song, Chenglizhao Chen and Xinyu Liu

Electronics 2026, 15(7), 1361; https://doi.org/10.3390/electronics15071361 - 25 Mar 2026

Abstract

RGB-D salient object detection (RGB-D SOD) aims to accurately localize and segment visually salient objects by jointly leveraging RGB images and depth maps. Some existing methods rely on static fusion strategies with fixed paths and weights, which treat all regions equally and fail to capture the varying importance of different regions and modalities. Although some attention-based methods alleviate the limitations of static fusion by assigning adaptive weights to different regions and modalities, the quality of RGB and depth data may degrade in real-world scenarios due to sensor noise, illumination changes, or environmental interference. These attention-based methods often overlook inter-modality quality differences and complementarity, making them prone to over-relying on a certain modality, which can lead to noise introduction, feature conflicts, and performance degradation. To address these limitations, this paper proposes a novel dynamic feature routing and fusion framework for RGB-D SOD, which adaptively adjusts the fusion strategy according to the quality of input modalities. To enable modality quality awareness, the proposed method characterizes the modality complementarity between RGB and depth features in a task-driven manner inspired by information-theoretic principles. We introduce a task-relevance scoring function which is integrated with a mutual information estimator to quantify such complementarity, and emphasizes task-relevant features while suppressing redundancy. A dynamic routing module is then designed to perform feature selection guided by the captured complementarity. In addition, we propose a novel cross-modal fusion module to adaptively fuse the features selected by the dynamic routing module, which effectively enhances complementary representations while suppressing redundant features and noise interference. Extensive experiments conducted on seven public RGB-D SOD benchmark datasets demonstrate that the proposed method consistently achieves competitive performance, outperforming existing methods by an average of approximately 1% across multiple evaluation metrics. Notably, in challenging scenarios with severe modality quality degradation, the proposed method outperforms existing best-performing methods by up to 1.8%, demonstrating strong robustness against cluttered backgrounds, complex object structures, and diverse object scales. Overall, the proposed dynamic fusion framework provides a novel solution to modality quality imbalance in RGB-D salient object detection. Full article

(This article belongs to the Section Artificial Intelligence)

►▼ Show Figures

Figure 1

Show export options Show export options

Select all

Export citation of selected articles as:

Error

Oops... you haven't selected anything for export.

Displaying article 1-50 on page 1 of 160.

Go to page 1 2 3 4 5

Search Results (7,952)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI