MDPI - Publisher of Open Access Journals

21 pages, 2363 KB

Open AccessArticle

Fusion of RGB and LiDAR Modalities for Building Footprint Extraction Using High-Resolution Aerial Imagery

by Norbert Serbán, Péter Enyedi, Péter Burai and Balázs Harangi

Remote Sens. 2026, 18(12), 2049; https://doi.org/10.3390/rs18122049 (registering DOI) - 21 Jun 2026

In this paper, a novel approach is presented for fusing RGB and LiDAR inputs for semantic segmentation. Accurate building detection is required for various scenarios such as urban planning or environmental monitoring. The two main sources for accurate building segmentation are either RGB [...] Read more.

In this paper, a novel approach is presented for fusing RGB and LiDAR inputs for semantic segmentation. Accurate building detection is required for various scenarios such as urban planning or environmental monitoring. The two main sources for accurate building segmentation are either RGB aerial images or LiDAR point clouds covering the selected area. Each of these sources has its own well-known techniques for segmentation; however, for the combination of the input, there are not many architectures available, and extracting different features from the two different fields can result in an enhanced segmentation map. The authors of this article created a semantic segmentation model that uses both the aerial RGB image and the LiDAR point cloud as its input. The network first takes the point cloud and forwards the processed projection to a modified U-Net-based architecture, which fuses the extracted features of the 3D input with the extracted information of the 2D input on each level of the decoding. To train and test the presented model, the authors used a dataset containing more than 3000 images and their corresponding 3D point clouds of three different areas from Hungary. As is also presented in this paper, this approach provides significantly better results than the traditional RGB, Point Cloud segmentation models, and their ensembles in terms of segmentation accuracy. Full article

(This article belongs to the Section AI Remote Sensing)

► Show Figures

Figure 1

25 pages, 6003 KB

Open AccessArticle

Multi-Scale Feature Fusion for Intelligent Recognition of Tunnel Face Fractures

by Qiang Gong, Jiaying Fan, Ning Zhang, Hongliang Liu, Xinbo Jiang, Changyuan Chen, Wenfeng Tu and Yuxue Chen

Appl. Sci. 2026, 16(12), 6182; https://doi.org/10.3390/app16126182 - 18 Jun 2026

Viewed by 168

Abstract

Accurate recognition of fractures on tunnel faces is essential for evaluating surrounding-rock integrity and ensuring excavation safety, yet it remains difficult because fracture traces are slender, irregular, discontinuous, and easily obscured by complex rock textures and illumination variability. This study proposes MF-DeepLabv3+, an [...] Read more.

Accurate recognition of fractures on tunnel faces is essential for evaluating surrounding-rock integrity and ensuring excavation safety, yet it remains difficult because fracture traces are slender, irregular, discontinuous, and easily obscured by complex rock textures and illumination variability. This study proposes MF-DeepLabv3+, an enhanced DeepLabv3+-based semantic segmentation framework for tunnel-face fracture identification and geometric characterization. Unlike existing attention-based DeepLab variants that mainly enhance global feature representation, MF-DeepLabv3+ is specifically designed for thin and discontinuous tunnel-face fracture segmentation by integrating a Multi-Scale Cross Attention module for multi-receptive-field feature interaction, a Feature Smoothing Module for noise suppression and fracture-continuity enhancement, and a lightweight MobileNetV2 backbone for improved computational efficiency. A dataset of 2153 annotated images collected from the Qingdao Jiaozhou Bay Second Subsea Tunnel and the Yantai Urban Rapid Road Tunnel was established for training and evaluation. Considering the strong class imbalance between fracture and background pixels, Accuracy is reported only as an auxiliary metric, while mAP, mIoU, per-class IoU, and fracture-specific Precision, Recall, and F1-score are emphasized to provide a more reliable assessment of segmentation performance. Comparative and ablation experiments show that MF-DeepLabv3+ achieved 82.56% mAP and 62.99% mIoU, with an auxiliary Accuracy of 92.47%. Compared with the original DeepLabv3+ baseline, the proposed model achieved a substantial improvement in mAP and a modest improvement in mIoU, indicating enhanced fracture recognition capability and slightly improved region-level overlap and a moderate increase in computational cost in exchange for improved segmentation performance. Fracture grouping and post-processing were further performed using edge detection, Hough transform, connected-component analysis, and fitted-line geometry to estimate fracture length and width. The proposed method therefore enables more reliable tunnel-face fracture recognition and provides quantitative geometric information for engineering assessment and geological interpretation. Full article

► Show Figures

Figure 1

29 pages, 7383 KB

Open AccessArticle

A Lightweight Transformer-Based Network for Image Deraining with Feature-Wise Attention and Cross-Level Feature Refinement

by Baozhu Li, Wanci Dai and Chao He

Appl. Sci. 2026, 16(12), 6108; https://doi.org/10.3390/app16126108 - 17 Jun 2026

Viewed by 187

Abstract

The aim in single-image deraining tasks is to remove rain streaks from degraded images while preserving scene structures and fine details. However, existing deep learning-based methods often face a trade-off between restoration quality and computational efficiency, and many models struggle to capture hierarchical [...] Read more.

The aim in single-image deraining tasks is to remove rain streaks from degraded images while preserving scene structures and fine details. However, existing deep learning-based methods often face a trade-off between restoration quality and computational efficiency, and many models struggle to capture hierarchical information effectively under complex rain conditions. To address these limitations, we propose a lightweight cross-gated hierarchical transformer for image deraining. The proposed network adopts a five-stage encoder–decoder architecture with Multi-head Feature-wise Attention (MFA) to efficiently model channel-wise dependencies while reducing the computational burden associated with conventional self-attention. In addition, an Enhanced Gated Depthwise Feed-Forward Network (EGDFN) is introduced to obtain refined feature representations with improved efficiency, and a Cross-Level Feature Refinement (CLFR) module is designed to enhance information exchange between corresponding encoder and decoder stages, thereby strengthening hierarchical feature integration and preserving structural details. The network is trained using a single SSIM-based loss, which enhances the structural fidelity of the restored results. Extensive experiments on four synthetic datasets, two real-world datasets, and a downstream semantic segmentation benchmark demonstrate that the proposed method consistently achieves strong restoration performance, producing cleaner outputs with sharper details and improved effectiveness for subsequent vision tasks. Full article

(This article belongs to the Special Issue Advanced Technologies in Image Processing, Analysis, and Machine Vision)

► Show Figures

Figure 1

16 pages, 6626 KB

Open AccessData Descriptor

A High-Resolution Multi-Temporal Remote Sensing Dataset for Levee-like Feature Segmentation in Arid Regions

by Osman Ilniyaz, Qingwu Hu, Hao Lu and Kaisar Ahmat

Data 2026, 11(6), 146; https://doi.org/10.3390/data11060146 - 16 Jun 2026

Viewed by 191

Abstract

Levee-like features are critical for water regulation in arid regions, but their automated extraction from remote sensing imagery remains challenging due to the scarcity of high-resolution labeled datasets. This data descriptor introduces a high-resolution remote sensing image dataset for semantic segmentation of levee-like [...] Read more.

Levee-like features are critical for water regulation in arid regions, but their automated extraction from remote sensing imagery remains challenging due to the scarcity of high-resolution labeled datasets. This data descriptor introduces a high-resolution remote sensing image dataset for semantic segmentation of levee-like features. The dataset covers 11 regions across Xinjiang and Gansu Province in northwestern China. It includes 459 single-phase base images with a spatial resolution of 0.50 m, as well as multi-temporal images of the same regions captured at different times. All annotations were manually drawn in polygon mode using the LabelMe tool and converted into YOLO format label files. The dataset adopts a strict strategy to prevent data leakage: first, training, validation and test sets are divided based on single-phase images, and then multi-temporal images are allocated to the corresponding data subsets according to their spatial locations. The dataset has been publicly released on the ScienceDB platform under the CC BY 4.0 license. YOLO and U-Net segmentation experiments on the test set achieved promising results, demonstrating its usability for levee-like feature segmentation. This dataset can provide fundamental data support for research on levee-like feature extraction, remote sensing change detection, and cross-region model transfer learning. Full article

(This article belongs to the Section Spatial Data Science for Environment and Earth)

► Show Figures

Figure 1

36 pages, 32050 KB

Open AccessArticle

Semantic Segmentation of Pegmatite Dikes in High-Resolution Remote Sensing Imagery Using GAD-UNet++ in the Yilanlike Area, South Tianshan

by Zirui Wu, Chuan Chen, Yuanjun Yu, Yong Tian, Jian Yu and Fang Xia

Remote Sens. 2026, 18(12), 1988; https://doi.org/10.3390/rs18121988 - 15 Jun 2026

Viewed by 200

Abstract

Pegmatite dikes are important prospecting indicators for rare-metal deposits, whereas traditional methods for pegmatite dike identification are constrained by the limited capability of human visual interpretation to capture information from remote sensing imagery, resulting in low identification accuracy and efficiency. In recent years, [...] Read more.

Pegmatite dikes are important prospecting indicators for rare-metal deposits, whereas traditional methods for pegmatite dike identification are constrained by the limited capability of human visual interpretation to capture information from remote sensing imagery, resulting in low identification accuracy and efficiency. In recent years, global research on semantic segmentation of different surface features and remote sensing-based mineral exploration using deep learning methods and high-resolution remote sensing imagery has made significant progress; however, studies on surface-exposed geological bodies such as pegmatite dikes remain highly insufficient. To address the key problem of efficiently identifying pegmatite dikes in remote sensing imagery, this study proposes an improved model based on UNet++, termed GAD-UNet++. In the field of remote sensing geology, this study constructed a pegmatite dike semantic segmentation dataset based on high-resolution RGB imagery by using 0.66 m RGB imagery for visual delineation and ZY1F hyperspectral data for spectral constraint and label refinement; on this basis, semantic segmentation of surface pegmatite dikes in the Yilanlike area of the South Tianshan Mountains, Xinjiang, was conducted using RGB remote sensing image patches as model input. Specifically, because pegmatite dikes are small targets characterized by slender structures, indistinct boundaries, and sparse regional distribution, this study introduced a lightweight feature extraction structure (GhostNetV2) and a long-range dependency attention module (DFC) at the encoder stage, and further incorporated the Coordinate Attention module (CA) to enhance spatial localization and boundary representation of the targets. Finally, focal cross-entropy loss and a deep supervision strategy were adopted to improve the accuracy of semantic information extraction for pegmatite dikes, as well as the training stability and segmentation accuracy under class-imbalance conditions. The results show that the proposed model achieved an mIoU of 93.11% and an F1-score of 94.95% on the test set. Compared with existing semantic segmentation models, the proposed model achieved superior performance in both identification accuracy and computational efficiency for pegmatite dikes. In addition, this study delineated 18 potential pegmatite dike enrichment zones in the Yilanlike area, providing technical support for remote sensing-based rare-metal prospecting and geological interpretation in the study area. Full article

(This article belongs to the Section Remote Sensing in Geology, Geomorphology and Hydrology)

► Show Figures

Figure 1

30 pages, 7012 KB

Open AccessArticle

TerrainFormer: World Model-Guided Decision Transformer for Autonomous Off-Road Navigation

by Yongzhi Yang and Kenneth Ricks

Sensors 2026, 26(12), 3795; https://doi.org/10.3390/s26123795 - 14 Jun 2026

Viewed by 406

Abstract

Autonomous navigation in unstructured off-road environments presents fundamental challenges due to terrain heterogeneity, the absence of structured road markings, and the necessity for real-time traversability reasoning from raw sensory observations. We present TerrainFormer, a hierarchical framework that integrates a world model for terrain [...] Read more.

Autonomous navigation in unstructured off-road environments presents fundamental challenges due to terrain heterogeneity, the absence of structured road markings, and the necessity for real-time traversability reasoning from raw sensory observations. We present TerrainFormer, a hierarchical framework that integrates a world model for terrain dynamics prediction with a temporal decision transformer for action selection. Our methodology employs a two-phase training paradigm: (1) self-supervised world model pretraining on LiDAR point clouds to learn terrain representations encompassing traversability, elevation, and semantic segmentation; (2) behavioral cloning of the decision transformer conditioned on frozen world model features with temporally derived goal directions. The world model processes raw 3D LiDAR point clouds through a PointPillars encoder for real-time bird’s-eye-view (BEV) projection, followed by a Vision Transformer backbone that produces latent terrain representations. A principal contribution is our cross-dataset generalization paradigm: the world model is trained on separate datasets while the decision transformer is trained on separate sequences, ensuring zero data overlap between training phases. We introduce automatic goal direction computation from vehicle pose trajectories, enabling the model to learn directionally conditioned navigation policies. To address the class imbalance inherent in off-road driving data, we employ focal loss with inverse-frequency class weighting and action-chunk supervision. Experimental evaluation on the RELLIS-3D dataset achieves 87.31% test accuracy with 0.7948 macro F1 across all 12 action classes. The world model’s predicted future frames produce only a 0.79% accuracy drop versus ground-truth observations, with 98.82% action agreement, demonstrating effective cross-dataset generalization for real-time off-road navigation. Full article

(This article belongs to the Special Issue Intelligent Sensors for Smart and Autonomous Vehicles: 2nd Edition)

► Show Figures

Figure 1

29 pages, 31575 KB

Open AccessArticle

DCA-DeepLab: Dual-Coordinate Attention DeepLab with Adaptive Focal Loss for Cotton Growth Semantic Segmentation from UAV Remote Sensing Images

by Liruizhi Jia, Jiazhan Gao, Zuolong Li, Heng Shi and Jihong Zhu

Drones 2026, 10(6), 456; https://doi.org/10.3390/drones10060456 - 11 Jun 2026

Viewed by 291

Abstract

UAV remote sensing provides centimetre-level imagery for fine-grained cotton growth monitoring, yet existing segmentation models face three challenges: cotton fields exhibit a pronounced row and column structure that standard convolutions struggle to capture; conventional decoders fuse features statically, suppressing fine boundary cues; and [...] Read more.

UAV remote sensing provides centimetre-level imagery for fine-grained cotton growth monitoring, yet existing segmentation models face three challenges: cotton fields exhibit a pronounced row and column structure that standard convolutions struggle to capture; conventional decoders fuse features statically, suppressing fine boundary cues; and the pixel-level class distribution is severely imbalanced. We present DCA-DeepLab, built on DeepLabv3+ with three task-specific components: a Dual-Coordinate Attention Gating (DCAG) module that decouples horizontal and vertical dependencies to encode row and column structures; a Multi-Scale Attention-Guided Modulated Feature Merging (MSAM-MFM) module that reweights semantic and detail features at each location; and an adaptive pixel-level modulated focal loss (APMFL), which focuses training on hard, minority-class pixels. We construct a cotton growth dataset of 11,745 UAV patches with four semantic classes. On this dataset and the public LoveDA benchmark, DCA-DeepLab attained the highest mIoU among the compared methods (51.74% and 51.71%), exceeding the strongest cotton baseline by 1.10 percentage points. Relative to DeepLabv3+, the Vigorous and Sparse minority-class IoUs improved by 3.51 and 1.91 percentage points, respectively, and Vigorous recall rose from 51.85% to 60.04%, with only 3.9% more parameters. These results show that encoding directional structure and adaptively balancing class contributions benefits fine-grained UAV crop segmentation. Full article

(This article belongs to the Section Drones in Agriculture and Forestry)

► Show Figures

Figure 1

18 pages, 2629 KB

Open AccessArticle

Dual-Guided Semi-Supervised Semantic Segmentation for Citrus Quality Evaluation

by Xufeng Xu, Ruokai Guo, Kai Guo, Zetong Li, Zichao Wei and Xiuqin Rao

Foods 2026, 15(11), 2029; https://doi.org/10.3390/foods15112029 - 5 Jun 2026

Viewed by 276

Abstract

Automated defect detection in precision agriculture serves as a critical technology for enhancing the quality of agricultural products. Although supervised-only semantic segmentation has demonstrated remarkable performance in citrus surface defect detection, it relies heavily on training with large-scale labeled data, which results in [...] Read more.

Automated defect detection in precision agriculture serves as a critical technology for enhancing the quality of agricultural products. Although supervised-only semantic segmentation has demonstrated remarkable performance in citrus surface defect detection, it relies heavily on training with large-scale labeled data, which results in prohibitive acquisition costs. Semi-supervised learning mitigates reliance on labeled data by generating pseudo-labels. However, existing semi-supervised segmentation methods still face challenges. On the one hand, the instability of pseudo-labels and the propagation of noise can mislead the training of semi-supervised models. On the other hand, due to the lack of semantic constraints in feature learning, models often suffer from insufficient feature discriminability when handling complex samples, such as citrus surface defects characterized by similar textures and blurred boundaries. Therefore, this study proposes UP-ETS, a dual-guided semi-supervised semantic segmentation model based on the Mean Teacher–Student framework, specifically designed for the segmentation of complex citrus surface defects. UP-ETS employs Uncertainty Estimation (UE) based on Kullback–Leibler (KL) divergence to quantify the prediction discrepancy between the teacher and student models on blurred and ambiguous pixels. This mechanism guides the model to dynamically adjust weights, thereby reducing noise propagation and enhancing pseudo-label stability under complex citrus surface textures. Prototype Contrastive Learning (PCL) is utilized to align pixel-level features of difficult samples with class prototypes, optimizing the feature discriminability for complex citrus surfaces. Experimental results demonstrate that the UP-ETS model exhibits superior semi-supervised segmentation performance. Notably, at a labeled data ratio of only 1/16, the dice improved from 85.57% to 87.76% compared to the supervised-only baseline. Furthermore, the model shows significant performance enhancements in segmenting difficult samples, such as small targets, complex boundaries, and blurred regions. The results of ablation studies and t-SNE visualization prove the effectiveness of the proposed UE and PCL. These two methods synergistically guide the model to construct a feature space that is better structured and highly discriminative. Furthermore, UP-ETS outperforms various representative semi-supervised segmentation models in terms of segmentation performance, parameters, and inference speed. In cross-dataset validation, the model exhibits robust generalization capabilities, achieving performance comparable to supervised-only methods trained on the full augmented dataset. Consequently, the framework introduced in this study effectively mitigates the heavy dependency on annotated datasets, providing significant practical value for agricultural deployment. Full article

(This article belongs to the Section Food Engineering and Technology)

► Show Figures

Figure 1

36 pages, 10912 KB

Open AccessArticle

Waterbody Extraction from the Perspective of RGB+X Semantic Segmentation

by Zhechen Yang, Wangrui Zhang, Qi Zhang, Zongbao Hong, Danjie Cheng, Qiao Xu, Yan Meng, Yangjie Sun and Yuxuan Liu

Remote Sens. 2026, 18(11), 1824; https://doi.org/10.3390/rs18111824 - 3 Jun 2026

Viewed by 390

Abstract

Waterbody extraction is of great significance for water resource investigation and monitoring. In addition to RGB bands, most common satellite images have a near-infrared (NIR) band. By combining these RGB-NIR bands, certain water, vegetation, and shadow indices can be calculated. The near-infrared band [...] Read more.

Waterbody extraction is of great significance for water resource investigation and monitoring. In addition to RGB bands, most common satellite images have a near-infrared (NIR) band. By combining these RGB-NIR bands, certain water, vegetation, and shadow indices can be calculated. The near-infrared band and these indices are very similar to the X modality in RGB+X data (common examples include RGB-D and RGB-Thermal). However, at present, no studies have thoroughly examined multimodal feature fusion from the RGB+X perspective in order to extract waterbodies with high precision. As a result, existing algorithms do not fully utilize satellite image information and have limited generalization ability. To overcome this limitation, we propose a dual-complexity backbone for waterbody extraction from the perspective of RGB+X data semantic segmentation. Its complex Transformer branch is used to extract RGB modality features, while its simple CNN branch is used to extract X modality features. This network structure can effectively capture multimodal, global, and local features in remote sensing images. It can also fully leverage the fact that the scale of RGB image datasets in computer vision is significantly larger than that of remote sensing waterbody extraction datasets. If a large pretrained model is used in the RGB branch, it is unnecessary to freeze the weights. Instead, both branches can be trained jointly, allowing the RGB branch to better adapt to the remote sensing waterbody extraction task without raising concerns that fine-tuning might undermine the pretrained model’s strong representation capability. We also propose two X modality configurations with strong generalization performance. To fully fuse multimodal features, we design a hybrid fusion module combining a CNN and a cross-attention mechanism. To integrate the multi-scale features, we employ a multi-scale Transformer structure in the RGB branch and design a multi-scale decoder. Our algorithm achieves state-of-the-art performance on the GID-5 dataset and competitive performance on the S1S2-Water dataset. Furthermore, it significantly outperforms existing methods in cross-dataset zero-shot transfer between the two datasets, with IoU/F1-score gains of 26.08%/27.33% on GID-5 and 38.74%/31.37% on S1S2-Water over previous SOTA methods. Our processing paradigm of modeling RGB-NIR remote sensing images as RGB+X data shows potential for generalization to other multi-modal remote sensing tasks. The dual-complexity backbone we design also has potential to be extended to other tasks that transfer large pretrained RGB models to remote sensing imagery with RGB-NIR four bands or even more spectral bands. We have open-sourced the code and trained models used in this research. Full article

(This article belongs to the Special Issue Foundation Model-Based Multi-Modal Data Fusion in Remote Sensing)

► Show Figures

Figure 1

23 pages, 2981 KB

Open AccessArticle

Hybrid Transformer Model with Augmentation for Kidney Tumor Segmentation

by Rajagopal Kumaraswamy, V. Sheeja Kumari, N. Muthuvairavan Pillai, R. H. Aswathy, Vijayalakshmi Ramakumar and Indra Neel Pulidindi

Computers 2026, 15(6), 359; https://doi.org/10.3390/computers15060359 - 2 Jun 2026

Viewed by 240

Abstract

Precise segmentation of kidney tumors in medical images is crucial for diagnosis, treatment planning, and prognosis assessment. In this work, we present a newly proposed hybrid deep learning model that combines the merits of U-Net and the Swin Transformer architectures in order to [...] Read more.

Precise segmentation of kidney tumors in medical images is crucial for diagnosis, treatment planning, and prognosis assessment. In this work, we present a newly proposed hybrid deep learning model that combines the merits of U-Net and the Swin Transformer architectures in order to enhance the segmentation performance. Although U-Net has great spatial localization ability thanks to the encoder–decoder structure, which works in a hierarchical way, it is still difficult to capture global context well. The Swin Transformer instead captures long-range dependencies and assists in local detail extraction, while attention pooling might also smear fine boundary details. This motivates our hybrid integration. To attempt to resolve these issues, we extend U-Net with the Swin Transformer blocks in the backbone encoder path in order to efficiently perform multi-scale semantic feature extraction while preserving structural consistency. We trained and cross-validated the model on the publicly available Kidney Tumor Segmentation Challenge 2021 (KiTS21) dataset with extensive data augmentation as well as custom loss functions to address class imbalance and boundary obscureness. Experiments demonstrated that it achieved better performance when compared with the solo models, seeking a similar multi-task learning objective on not only U-Net and the Swin Transformer but also other baseline architectures in terms of the average Dice similarity coefficient (average DSC), intersection over union score (IoU) and Hausdorff distance. The proposed model achieved a Dice similarity coefficient (DSC) of 0.91, an IoU of 0.87, a PR-AUC of 0.89, and an overall voxel-wise accuracy of 98%, demonstrating robust and precise kidney tumor segmentation across varying tumor sizes and shapes. Moreover, the integrated solution is more robust and generalizes better, particularly in challenging cases with diverse anatomical variations. These findings demonstrate the power of Transformer-based hybrid models for medical image segmentation. Our results have positive implications for the design of computer-aided diagnostic systems and their association with other prevalent medical imaging tasks besides organ-specific or pathology-focused tasks. Full article

(This article belongs to the Special Issue AI in Bioinformatics)

► Show Figures

Figure 1

23 pages, 7625 KB

Open AccessArticle

MultiDecNet: An Ensemble-Based Semantic Segmentation Architecture for Urban Scene Understanding

by Büşra Emek Soylu and Mehmet Serdar Güzel

Information 2026, 17(6), 540; https://doi.org/10.3390/info17060540 - 1 Jun 2026

Viewed by 308

Abstract

Semantic segmentation is a fundamental task in computer vision that aims to assign a categorical label to each pixel in an image, facilitating dense and detailed scene understanding. This pixel-level classification is especially crucial in autonomous driving, where accurate environmental perception is vital [...] Read more.

Semantic segmentation is a fundamental task in computer vision that aims to assign a categorical label to each pixel in an image, facilitating dense and detailed scene understanding. This pixel-level classification is especially crucial in autonomous driving, where accurate environmental perception is vital for dependable object detection and safe decision-making. In this study, we propose MultiDecNet, a novel multi-decoder semantic segmentation framework designed to capture both macroscopic scene layouts and fine-grained spatial boundaries in complex urban environments. Drawing inspiration from classical networks, MultiDecNet incorporates a parallel dual-branch decoding strategy that simultaneously leverages the multi-scale context modeling of the Pyramid Pooling Module (PPM) and the structural refinement capabilities of Atrous Spatial Pyramid Pooling (ASPP). To explore the impact of modern backbone representations, we structurally modernize the feature extraction pipeline by introducing the contemporary ConvNeXt convolutional architecture as an alternative to traditional ResNet101 backbones. We extensively evaluate and compare the baseline configurations alongside our proposed MultiDecNet using both ResNet101 and ConvNeXt-Large backbones on the benchmark Cityscapes dataset. The quantitative assessments demonstrate that the MultiDecNet architecture consistently provides highly competitive performance within the scope of this comparative study, with the MultiDecNet-ConvNeXt variant achieving favorable overall scores among the evaluated methods. Furthermore, a granular, class-wise IoU and training dynamics analysis reveals that while traditional networks retain competitive boundaries for localized minority targets, the modern ConvNeXt backbone ensures faster convergence stability and balanced contextual mastery over large-scale driving layouts. Ultimately, these findings offer critical insights into architectural synergy and backbone selection, presenting a robust, scalable, and well-balanced solution for advanced autonomous navigation systems. Full article

(This article belongs to the Special Issue Computer Vision for Security Applications, 2nd Edition)

► Show Figures

Graphical abstract

34 pages, 3154 KB

Open AccessArticle

PF-CMNet: Progressive Frequency-Aware Cross-Modal Network with Missing-Modality Distillation for 3D Brain Tumor Segmentation

by Haokun Wang, Shuyi Wang, Yuqi Li, Xinrong Miao and Chenyi Cao

Brain Sci. 2026, 16(6), 588; https://doi.org/10.3390/brainsci16060588 - 29 May 2026

Viewed by 180

Abstract

Background/Objectives: Accurate automatic segmentation of multimodal magnetic resonance imaging (MRI) is essential for neurosurgical planning and image-guided procedures. However, existing three-dimensional segmentation models often struggle with low lesion-to-tissue contrast, ambiguous tumor boundaries, small enhancing tumor regions, and performance degradation caused by missing imaging [...] Read more.

Background/Objectives: Accurate automatic segmentation of multimodal magnetic resonance imaging (MRI) is essential for neurosurgical planning and image-guided procedures. However, existing three-dimensional segmentation models often struggle with low lesion-to-tissue contrast, ambiguous tumor boundaries, small enhancing tumor regions, and performance degradation caused by missing imaging modalities. This study aimed to develop a robust segmentation framework that improves cross-modal representation learning, boundary recovery, and segmentation performance under incomplete-input conditions. Methods: We propose PF-CMNet, a Progressive Frequency-Aware Cross-Modal Network with Missing-Modality Distillation for three-dimensional brain tumor segmentation. The network introduces a Cross-Modal Selective Frequency Attention module in the early encoder stage to model modality-specific frequency responses and spatially adaptive cross-modal correlations. A Progressive Cross-Scale Detail Fusion decoder is further employed to aggregate multilevel semantic features and refine high-resolution boundary details. To enhance robustness under missing-modality conditions, a teacher–student distillation strategy transfers full-modality predictions and shallow feature knowledge to a student network trained with random modality dropout. Results: On the MSD Task01_BrainTumour dataset, PF-CMNet achieved an average Dice score of 84.3%, with Dice scores of 79.6%, 82.8%, and 90.4% for enhancing tumor, tumor core, and whole tumor, respectively. On the BraTS2021 dataset, the model achieved an average Dice score of 88.2% and the lowest average 95th percentile Hausdorff distance among the compared methods. In predefined complete-modality absence stress tests, where unavailable MRI sequences were zero-masked to model the absence of input modalities rather than partial image degradation, the distilled model maintained average Dice scores of 78.64%, 82.58%, 58.39%, 82.03%, and 79.29% when FLAIR, T1, T1ce, T2, and T1 + T2 were unavailable, respectively. Conclusions: PF-CMNet provides a unified framework for multimodal brain tumor segmentation, improving full-modality segmentation accuracy, boundary consistency, and robustness to incomplete MRI inputs while maintaining a favorable accuracy–efficiency trade-off. Full article

(This article belongs to the Section Computational Neuroscience, Neuroinformatics, and Neurocomputing)

► Show Figures

Figure 1

20 pages, 6540 KB

Open AccessArticle

SRDFNet: Semantic Refinement and Differential Features for High-Resolution Change Detection

by Wenbo Zhao, Donghua Lu, Yingjun Zhao and Keyue Chen

Sensors 2026, 26(11), 3427; https://doi.org/10.3390/s26113427 - 28 May 2026

Viewed by 345

Abstract

To address misclassification and reduced accuracy in semantic change detection caused by class imbalance and variable object sizes, this paper improves BGSNet and proposes a new change detection network, SRDFNet (Semantic Refinement and Differential Features). Based on BGSNet’s framework, it introduces three complementary [...] Read more.

To address misclassification and reduced accuracy in semantic change detection caused by class imbalance and variable object sizes, this paper improves BGSNet and proposes a new change detection network, SRDFNet (Semantic Refinement and Differential Features). Based on BGSNet’s framework, it introduces three complementary modules: (1) a hierarchical graph module (HGM) that converts multi-scale feature maps into compact semantic graph nodes, using graph attention for intra-layer and cross-level semantic interaction to enhance topological relationship perception; the HGM mitigates the effects of class imbalance by compacting multi-scale features into semantic nodes; (2) a difference enhancement (DE) module that extracts multi-receptive-field difference information from bi-temporal concatenated features via multi-scale parallel convolution branches; (3) a semantic refine (SR) module that performs lightweight residual refinement on bi-temporal semantic features to improve the segmentation accuracy. The DE and SR modules mitigate the degradation in semantic segmentation accuracy caused by variable object sizes. It is trained and tested with BGSNet and three other models on the SECOND and HRSCD datasets. For the SECOND dataset, in terms of five quantitative indicators, namely OA, mIoU, SeK, F1 and recall, SRDFNet achieves 87.64%, 70.31%, 20.36%, 60.25% and 65.27%, respectively. Compared with BGSNet, it gains performance increases of 1.34%, 0.73%, 1.44%, 0.81% and 2.72%, respectively. For the HRSCD dataset, SRDFNet achieves 98.13% (OA), 52.67% (mIoU), 73.77% (SeK), 88.86% (F1) and 88.18% (recall), ranking first among the four methods. Compared with BGSNet, it gains performance increases of 3.96%, 3.93%, 9.69%, 2.33% and 4.00%, respectively. Full article

(This article belongs to the Section Remote Sensors)

► Show Figures

Figure 1

24 pages, 20331 KB

Open AccessArticle

Fine-Grained Perception and Spatial Heterogeneity Analysis of Streetscapes Within Beijing’s 5th Ring Road Based on a Multi-Task Fine-Tuning Framework

by Yuhe Hu, Haiming Qin, Nan Chen, Linhe Song, Shuo Wang and Weiqi Zhou

Sustainability 2026, 18(11), 5256; https://doi.org/10.3390/su18115256 - 23 May 2026

Viewed by 324

Abstract

Deep learning-powered Street View Imagery (SVI) analytics provides a critical mechanism for smart city perception within the framework of Sustainable Development Goal 11 (SDG 11), effectively bridging the gap left by traditional remote sensing in fine-grained street-level observation. Over the years, deep learning-based [...] Read more.

Deep learning-powered Street View Imagery (SVI) analytics provides a critical mechanism for smart city perception within the framework of Sustainable Development Goal 11 (SDG 11), effectively bridging the gap left by traditional remote sensing in fine-grained street-level observation. Over the years, deep learning-based semantic segmentation of urban streetscapes has become the dominant paradigm. However, when scaling to megacity measurements, current research faces the dual bottlenecks of “computational redundancy” and the “geographical domain shift” caused by the blind application of pre-trained models based on Western datasets. To address these challenges, this study is the first to systematically quantify the performance trade-off between Multi-Task Learning (MTL) and Single-Task Learning (STL) in megacity scenarios. Using this as a baseline, we constructed and validated a “low-computation, high-robustness” framework for streetscape semantic perception and spatial measurement. Relying on an integrated ResNeXt101-FPN MTL architecture and an ultra-low-cost fine-tuning strategy to overcome geographical domain shift, we extracted and analyzed the spatial heterogeneity of five core semantic elements—vegetation, sky, building, road, and vehicle—across the road network within Beijing’s 5th Ring Road. The results indicate the following: (1) We explicitly defined the computation-accuracy trade-off of MTL and STL in megacity perception. While utilizing only 1/5 of the parameters of STL, the MTL framework achieved a 5.34-fold increase in inference speed with a negligible 0.1% loss in overall mean Intersection over Union (mIoU); however, a 27.13% decrease in boundary segmentation accuracy was observed. (2) We established a low-cost, localized correction paradigm to overcome domain shift. Utilizing a minimal annotation cost (only 200 local images) significantly improved cross-domain adaptability, boosting the overall mIoU by 8.92% and significantly mitigating the geographical domain shift problem. (3) Multi-dimensional measurement and spatial analysis revealed a significant spatial decoupling pattern in Beijing’s streetscapes. The visual proportion of vegetation exhibited a pronounced “north-high, south-low” spatial differentiation, whereas built environment elements (e.g., building and road) displayed a typical “center-periphery” concentric gradient. This objectively reflects the spatial inequality of urban street greenery resources and the monocentric development characteristics of the built environment. The proposed framework therefore serves as a low-cost, AI-driven computational paradigm for smart city perception in resource-constrained regions. Furthermore, the revealed spatial heterogeneity offers data-driven insights for formulating sustainable urban renewal policies aligned with SDG 11. Full article

(This article belongs to the Special Issue Leveraging AI and Deep Learning for Smart Cities: Challenges, Opportunities, and Applications to Sustainable Development)

► Show Figures

Figure 1

22 pages, 3271 KB

Open AccessArticle

TextureCLIP: Cross-Dataset Zero-Shot Texture Anomaly Segmentation with Triadic Descriptive Prompting

by Xin Peng Ooi and Seong G. Kong

Electronics 2026, 15(10), 2220; https://doi.org/10.3390/electronics15102220 - 21 May 2026

Viewed by 284

Abstract

Texture anomaly segmentation aims to localize irregularities on textured surfaces, a task critical for industrial quality control. Supervised methods require extensive labeled data, while unsupervised approaches often struggle to generalize to unseen target domains. Recent zero-shot methods based on vision-language models such as [...] Read more.

Texture anomaly segmentation aims to localize irregularities on textured surfaces, a task critical for industrial quality control. Supervised methods require extensive labeled data, while unsupervised approaches often struggle to generalize to unseen target domains. Recent zero-shot methods based on vision-language models such as Contrastive Language-Image Pretraining (CLIP) enable anomaly detection through text prompts without target-domain training data. However, existing approaches typically rely on generic prompts and show limited sensitivity to fine-grained texture variations. To address these limitations, we propose TextureCLIP, a cross-dataset zero-shot framework with auxiliary training for texture anomaly segmentation. The framework is trained on source texture data from the MVTec AD texture subset using annotated source-domain samples and directly evaluated on six unseen target datasets without access to target-domain training images, annotations, or fine-tuning. The proposed Triadic Descriptive Prompting (TriDP) integrates normal prompts, generic anomaly prompts, and descriptive anomaly prompts to provide complementary semantic cues for improved cross-domain generalization. To enhance spatial sensitivity, Dual Attention Modules (DAMs) are incorporated into the CLIP image encoder to refine local feature representations. In addition, Softmax-Weighted Averaging (SMWA) aggregates multiple anomaly cues by emphasizing the prompt responses with higher similarity scores. Experimental results demonstrate that TextureCLIP achieves strong and consistent performance across diverse texture datasets, attaining 67.06% AP and 65.69% F1-max, with improvements of 5.17 and 2.66 percentage points over the competitive baselines, respectively. Full article

(This article belongs to the Section Artificial Intelligence)

► Show Figures

Figure 1

Search Results (779)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (779)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI