MDPI - Publisher of Open Access Journals

33 pages, 172200 KB

Open AccessArticle

HDCGAN+: A Low-Illumination UAV Remote Sensing Image Enhancement and Evaluation Method Based on WPID

by Kelly Chen Ke, Min Sun, Xinyi Wang, Dong Liu and Hanjun Yang

Remote Sens. 2026, 18(7), 999; https://doi.org/10.3390/rs18070999 (registering DOI) - 26 Mar 2026

Remote sensing images acquired by UAVs under nighttime or low-illumination conditions suffer from insufficient illumination, leading to degraded image quality, detail loss, and noise, which restrict their application in public security and disaster emergency scenarios. Although existing machine learning-based enhancement methods can recover [...] Read more.

Remote sensing images acquired by UAVs under nighttime or low-illumination conditions suffer from insufficient illumination, leading to degraded image quality, detail loss, and noise, which restrict their application in public security and disaster emergency scenarios. Although existing machine learning-based enhancement methods can recover part of the missing information, they often cause color distortion and texture inconsistency. This study proposes an improved low-illumination image enhancement method based on a Weakly Paired Image Dataset (WPID), combining the Hierarchical Deep Convolutional Generative Adversarial Network (HDCGAN) with a low-rank image fusion strategy to enhance the quality of low-illumination UAV remote sensing images. First, YCbCr color channel separation is applied to preserve color information from visible images. Then, a Low-Rank Representation Fusion Network (LRRNet) is employed to perform structure-aware fusion between thermal infrared (TIR) and visible images, thereby enabling effective preservation of structural details and realistic color appearance. Furthermore, a weakly paired training mechanism is incorporated into HDCGAN to enhance detail restoration and structural fidelity. To achieve objective evaluation, a structural consistency assessment framework is constructed based on semantic segmentation results from the Segment Anything Model (SAM). Experimental results demonstrate that the proposed method outperforms state-of-the-art approaches in both visual quality and application-oriented evaluation metrics. Full article

(This article belongs to the Section Remote Sensing Image Processing)

► Show Figures

Figure 1

25 pages, 3105 KB

Open AccessArticle

SAMS-Net: A Stage-Decoupled Semantic Segmentation Network for Forest Fire Detection

by Yuxin Tan, Jiazhe An, Yabin Wang, Zhun Li, Jia Gao and Fuxing Yu

Appl. Sci. 2026, 16(7), 3144; https://doi.org/10.3390/app16073144 - 24 Mar 2026

Viewed by 35

Abstract

High-precision and real-time monitoring of forest fires is a critical requirement in disaster prevention and mitigation. During fire evolution, significant stage-wise variations occur, which make it difficult for conventional semantic segmentation models to simultaneously achieve robust multi-scale feature extraction and strong interference resistance. [...] Read more.

High-precision and real-time monitoring of forest fires is a critical requirement in disaster prevention and mitigation. During fire evolution, significant stage-wise variations occur, which make it difficult for conventional semantic segmentation models to simultaneously achieve robust multi-scale feature extraction and strong interference resistance. To address this issue, this paper proposes a stage-aware multi-head segmentation network, termed SAMS-Net. The proposed network decouples fire-stage recognition from pixel-level segmentation and employs a Hard-Switch Routing mechanism to dynamically activate the stage-specific decoder that matches the current fire phase during inference, while pruning irrelevant branches to reduce computational redundancy. Experimental results show that SAMS-Net achieves 76.16% mIoU, 81.30% Dice, and 90.31% PA, outperforming mainstream segmentation models such as FCN, U-Net++, DeepLabV3, and YOLOv9-Seg. In challenging stages, particularly the early and recession phases, the segmentation performance improves by more than 10% compared with the second-best model. Meanwhile, the proposed method maintains high accuracy with a real-time inference speed of 75.8 FPS. These results support the effectiveness of SAMS-Net for flame-and-ember foreground segmentation on the constructed multi-stage forest-fire benchmark dataset. Broader generalization across independent datasets and real-world deployment scenarios will be further investigated in future work. Full article

(This article belongs to the Section Computing and Artificial Intelligence)

► Show Figures

Figure 1

25 pages, 13685 KB

Open AccessArticle

Vision and Language Reference for a Segment Anything Model for Few-Shot Segmentation

by Kosuke Sakurai, Ryotaro Shimizu and Masayuki Goto

J. Imaging 2026, 12(4), 143; https://doi.org/10.3390/jimaging12040143 - 24 Mar 2026

Viewed by 143

Abstract

Segment Anything Model (SAM)-based few-shot segmentation models traditionally rely solely on annotated reference images as prompts, which inherently limits their accuracy due to an over-reliance on visual cues and a lack of semantic context. This reliance leads to incorrect segmentation, where visually similar [...] Read more.

Segment Anything Model (SAM)-based few-shot segmentation models traditionally rely solely on annotated reference images as prompts, which inherently limits their accuracy due to an over-reliance on visual cues and a lack of semantic context. This reliance leads to incorrect segmentation, where visually similar objects from different categories are incorrectly identified as the target object. We propose Vision and Language Reference Prompt into SAM (VLP-SAM), a novel few-shot segmentation model that integrates both visual information of reference images and semantic information of text labels into SAM. VLP-SAM introduces a vision-language model (VLM) with pixel–text matching into the prompt encoder for SAM, effectively leveraging textual semantic consistency while preserving SAM’s extensive segmentation knowledge. By incorporating task-specific structures such as an attention mask, our model achieves superior few-shot segmentation performance with only 1.4 M learnable parameters. Evaluations on PASCAL-5ⁱ and COCO-20ⁱ datasets demonstrate that VLP-SAM significantly outperforms previous methods by 6.8% and 9.3% in mIoU, respectively. Furthermore, VLP-SAM exhibits strong generalization across unseen objects and cross-domain scenarios, highlighting the robustness provided by textual semantic guidance. This study offers an effective and scalable framework for few-shot segmentation with multimodal prompts. Full article

(This article belongs to the Special Issue Trustworthy Multimodal Vision Models: Generalization, Robustness, and Explainability)

► Show Figures

Figure 1

25 pages, 8404 KB

Open AccessArticle

Ladder-Side-Tuning of Visual Foundation Model for City-Scale Individual Tree Detection from High-Resolution Remote Sensing Images

by Chen Huang, Ying Ding, Kun Xiao, Rong Liu and Ying Sun

Remote Sens. 2026, 18(5), 819; https://doi.org/10.3390/rs18050819 - 6 Mar 2026

Viewed by 196

Abstract

Accurate detection of individual trees is essential for urban forest management and ecological assessment, yet remains challenging due to the heterogeneous backgrounds, variable sizes of tree crowns, and significant variations across urban scenarios. To address these issues, we propose Tree-SAM, a city-scale individual [...] Read more.

Accurate detection of individual trees is essential for urban forest management and ecological assessment, yet remains challenging due to the heterogeneous backgrounds, variable sizes of tree crowns, and significant variations across urban scenarios. To address these issues, we propose Tree-SAM, a city-scale individual tree detection architecture built upon the visual foundation model Segment Anything Model (SAM) and equipped with three task-specific modules, i.e., Cross-Correlation Feature Backbone (CCFB), Hierarchical Instance Aggregation Neck (HIAN), and Context-Aware Adaptation Head (CAAH). These modules synergistically fuse general semantics with fine-grained structural cues, enable multi-scale feature aggregation, and adaptively refine predictions based on specific scene contexts. On the GZ-Tree Crown dataset, Tree-SAM achieves F1-scores of 0.762, 0.732, and 0.830, with corresponding AP@50 values of 0.478, 0.454, and 0.526 in the forest, mixed, and urban scenarios, respectively, consistently ranking first across all scenes and demonstrating strong adaptability to diverse intra-city landscapes. Additional evaluations on the BAMFORESTS dataset and the SZ-Dataset further confirm its robustness across varied geographic contexts. Tree-SAM provides a reliable, automated framework for large-scale urban tree mapping, providing reliable data support for urban forest management, carbon stock estimation, and ecological assessment. Full article

(This article belongs to the Special Issue The Recent Progression of Machine Learning in Remote Sensing: Theory and Modelling (Second Edition))

► Show Figures

Figure 1

18 pages, 9422 KB

Open AccessArticle

A SAM2-Driven RGB-T Annotation Pipeline with Thermal-Guided Refinement for Semantic Segmentation in Search-and-Rescue Scenes

by Andrés Salas-Espinales, Ricardo Vázquez-Martín and Anthony Mandow

Modelling 2026, 7(2), 50; https://doi.org/10.3390/modelling7020050 - 4 Mar 2026

Viewed by 377

Abstract

High-quality RGB–thermal infrared (RGB-T) semantic segmentation datasets are crucial for search-and-rescue (SAR) applications, yet their development is hindered by the scarcity of annotated ground-truth and by the challenges of thermal-camera calibration, which typically depends on heated targets with limited geometric definition. Recent approaches [...] Read more.

High-quality RGB–thermal infrared (RGB-T) semantic segmentation datasets are crucial for search-and-rescue (SAR) applications, yet their development is hindered by the scarcity of annotated ground-truth and by the challenges of thermal-camera calibration, which typically depends on heated targets with limited geometric definition. Recent approaches focus on using semantic segmentation annotation tools and transferring RGB masks to multi-spectral data, but they do not fully address the need for robust cross-modal geometric validation, quality control, or human-in-the-loop reliability assessment in RGB-T segmentation. To fill this gap, we propose a validated cross-modal annotation pipeline that combines deep correspondence matching, geometric transformation (affine or homography) of RGB-T pairs, and quantitative alignment validation. Our RGB-T pipeline integrates a semi-automatic annotation pipeline based on the Segment Anything Model 2 (SAM2) in Label Studio, with guided human refinement, and incorporates quantitative cost and quality control via inter-annotator agreement before being used in downstream model training. Results across three annotators show that the proposed approach reduces annotation time by 36% while achieving high annotation quality (mean IoU = 74.9%) and strong inter-annotator agreement (mean pixel accuracy = 74.3%, Cohen’s

κ

= 65%). The proposed RGB-T pipeline was annotated on a SAR-oriented RGB-T dataset comprising 306 image pairs and trained on two SOTA RGB-T. These findings demonstrate the practical value of the proposed methodology and establish a reproducible framework for generating reliable RGB-T semantic segmentation datasets, complementing and extending recent multispectral auto-labeling approaches. Full article

(This article belongs to the Section Modelling in Artificial Intelligence)

► Show Figures

Figure 1

30 pages, 50903 KB

Open AccessArticle

A Realistic Instance-Level Data Augmentation Method for Small-Object Detection Based on Scene Understanding

by Chuwei Li, Zhilong Zhang, Ping Zhong and Jun He

Remote Sens. 2026, 18(4), 647; https://doi.org/10.3390/rs18040647 - 20 Feb 2026

Viewed by 400

Abstract

Instance-level data augmentation methods, exemplified by “copy-paste”, serve as a conventional strategy for improving the performance of small object detectors. The core idea involves leveraging background redundancy by compositing object instances with suitable backgrounds—drawn either from the same image or from different images—to [...] Read more.

Instance-level data augmentation methods, exemplified by “copy-paste”, serve as a conventional strategy for improving the performance of small object detectors. The core idea involves leveraging background redundancy by compositing object instances with suitable backgrounds—drawn either from the same image or from different images—to increase both the quantity and diversity of training samples. However, existing methods often struggle with mismatches in background, scale, illumination, and viewpoint between instances and backgrounds. More critically, their predominant reliance on background information, without a joint understanding of instance-background characteristics, results in augmented images lacking visual realism. Empirical studies have demonstrated that such unrealistic images not only fail to improve detection performance but can even be detrimental. To tackle this problem, we propose a scene-understanding-driven approach that systematically addresses these mismatches via joint instance-background understanding. This is achieved through a unified framework that integrates image inpainting, image tagging, open-set object detection, the Segment Anything Model (SAM), and pose estimation to jointly model instance attributes, background semantics, and their interrelationships, thereby abandoning the random operation paradigm of existing methods and synthesizing highly realistic augmented images while preserving data diversity. On the VisDrone dataset, our method improves the mAP@0.5:0.95 and mAP@0.5 of the baseline detector by 1.6% and 2.2%, respectively. Both quantitative gains and qualitative visualizations confirm that the systematic resolution of these mismatches directly translates into significantly higher visual realism and detection performance improvements. Full article

(This article belongs to the Special Issue Multi-Object Detection and Feature Extraction of Remote Sensing Images)

► Show Figures

Figure 1

25 pages, 15267 KB

Open AccessArticle

3D Semantic Map Reconstruction for Orchard Environments Using Multi-Sensor Fusion

by Quanchao Wang, Yiheng Chen, Jiaxiang Li, Yongxing Chen and Hongjun Wang

Agriculture 2026, 16(4), 455; https://doi.org/10.3390/agriculture16040455 - 15 Feb 2026

Viewed by 594

Abstract

Semantic point cloud maps play a pivotal role in smart agriculture. They provide not only core three-dimensional data for orchard management but also empower robots with environmental perception, enabling safer and more efficient navigation and planning. However, traditional point cloud maps primarily model [...] Read more.

Semantic point cloud maps play a pivotal role in smart agriculture. They provide not only core three-dimensional data for orchard management but also empower robots with environmental perception, enabling safer and more efficient navigation and planning. However, traditional point cloud maps primarily model surrounding obstacles from a geometric perspective, failing to capture distinctions and characteristics between individual obstacles. In contrast, semantic maps encompass semantic information and even topological relationships among objects in the environment. Furthermore, existing semantic map construction methods are predominantly vision-based, making them ill-suited to handle rapid lighting changes in agricultural settings that can cause positioning failures. Therefore, this paper proposes a positioning and semantic map reconstruction method tailored for orchards. It integrates visual, LiDAR, and inertial sensors to obtain high-precision pose and point cloud maps. By combining open-vocabulary detection and semantic segmentation models, it projects two-dimensional detected semantic information onto the three-dimensional point cloud, ultimately generating a point cloud map enriched with semantic information. The resulting 2D occupancy grid map is utilized for robotic motion planning. Experimental results demonstrate that on a custom dataset, the proposed method achieves 74.33% mIoU for semantic segmentation accuracy, 12.4% relative error for fruit recall rate, and 0.038803 m mean translation error for localization. The deployed semantic segmentation network Fast-SAM achieves a processing speed of 13.36 ms per frame. These results demonstrate that the proposed method combines high accuracy with real-time performance in semantic map reconstruction. This exploratory work provides theoretical and technical references for future research on more precise localization and more complete semantic mapping, offering broad application prospects and providing key technological support for intelligent agriculture. Full article

(This article belongs to the Special Issue Advances in Robotic Systems for Precision Orchard Operations)

► Show Figures

Figure 1

24 pages, 4394 KB

Open AccessArticle

A Code-Conforming Computer Vision Framework for Visual Inspection of Reinforced and Prestressed Concrete Bridges

by Giuseppe Santarsiero, Valentina Picciano, Nicola Ventricelli and Angelo Masi

Sensors 2026, 26(4), 1242; https://doi.org/10.3390/s26041242 - 14 Feb 2026

Viewed by 380

Abstract

The assessment of structural degradation in reinforced concrete bridges is a crucial task for infrastructure maintenance and safety. Traditional inspection methods are often time-consuming, dependent on expert interpretation and weather conditions. This study explores the potential of artificial intelligence to support inspectors in [...] Read more.

The assessment of structural degradation in reinforced concrete bridges is a crucial task for infrastructure maintenance and safety. Traditional inspection methods are often time-consuming, dependent on expert interpretation and weather conditions. This study explores the potential of artificial intelligence to support inspectors in the detection of typical deterioration patterns in reinforced (RC) and prestressed concrete (PRC) bridges, developing the VIADUCT (Visual Inspection and Automated Damage Understanding by Computer vision Techniques) software tool. Unlike previous studies, focusing only on a limited variety of possible defects (e.g., cracks, water stains), this study aims to train a deep learning model to be able to recognise a larger range of defects, such as those foreseen by the current Italian code for the assessment of existing bridges. The methodology relies on the YOLOv8n object detection model, which was trained, validated, and tested using a dataset including 1045 either wide-angle or detailed photographs taken during routine inspections. With these kinds of images being challenging for object detection algorithms (they include large parts of the background), multimodal attention mechanisms were implemented in the Graphical User Interface (GUI) through the semantic segmentation of the bridge surface using both the SAM and the U-Net model, as well as a tile reduction approach. These attention mechanisms allow the object detection model to focus on the relevant portions of the image (i.e., the bridge), while suppressing background information. Despite the limitation of the small size dataset used for training, results showed promising detection capabilities and precision. Furthermore, VIADUCT is ready to accept and use newer and more efficient versions of the object detection model, as soon as they become available. Full article

(This article belongs to the Section Sensing and Imaging)

► Show Figures

Figure 1

21 pages, 18124 KB

Open AccessArticle

Integrating Dynamic Representation and Multi-Priors for Transnasal Intubation via Visual Foundation Model

by Jinyu Liu, Yang Zhou, Ruoyi Hao, Mingying Li, Yang Zhang and Hongliang Ren

Bioengineering 2026, 13(2), 217; https://doi.org/10.3390/bioengineering13020217 - 13 Feb 2026

Viewed by 443

Abstract

Accurate and real-time glottis localization is critical for ensuring intraoperative oxygenation and patient safety during nasotracheal intubation. However, representative foundation models exemplified by the Segment Anything Model exhibit notable limitations in medical applications, stemming from their rigid attention mechanisms, feature space misalignment, and [...] Read more.

Accurate and real-time glottis localization is critical for ensuring intraoperative oxygenation and patient safety during nasotracheal intubation. However, representative foundation models exemplified by the Segment Anything Model exhibit notable limitations in medical applications, stemming from their rigid attention mechanisms, feature space misalignment, and insufficient generalization to complex glottal anatomies. To address these challenges, we propose Glottis-SAM, a lightweight and task-adaptive segmentation framework that integrates dynamic representation learning with multi-prior contextual modeling. Specifically, we introduce a hierarchical low-rank adaptation strategy that enables efficient fine-tuning of visual foundation models by preserving geometric priors while significantly reducing computational overhead. To further enhance semantic fusion and generalization, we design a feature aggregation module with dual-path dynamic feature pyramids, which enables complementary optimization from local textures to global semantic structures under varying anatomical conditions. Extensive experiments on three diverse datasets demonstrate that Glottis-SAM achieves state-of-the-art segmentation accuracy with 72.6% mDice, a compact 55.2 MB model size, and 44.3 FPS inference speed on clinical data. These results highlight the model’s robustness, efficiency, and potential for deployment in visual guidance systems for nasotracheal intubation. Full article

(This article belongs to the Section Biosignal Processing)

► Show Figures

Graphical abstract

18 pages, 4326 KB

Open AccessArticle

DCS: A Zero-Shot Anomaly Detection Framework with DINO-CLIP-SAM Integration

by Yan Wan, Yingqi Lang and Li Yao

Appl. Sci. 2026, 16(4), 1836; https://doi.org/10.3390/app16041836 - 12 Feb 2026

Viewed by 510

Abstract

Recently, the progress of foundation models such as CLIP and SAM has shown the great potential of zero-shot anomaly detection tasks. However, existing methods usually rely on general descriptions such as “abnormal”, and the semantic coverage is insufficient, making it difficult to express [...] Read more.

Recently, the progress of foundation models such as CLIP and SAM has shown the great potential of zero-shot anomaly detection tasks. However, existing methods usually rely on general descriptions such as “abnormal”, and the semantic coverage is insufficient, making it difficult to express fine-grained anomaly semantics. In addition, CLIP primarily performs global-level alignment, and it is difficult to accurately locate minor defects, while the segmentation quality of SAM is highly dependent on prompt constraints. In order to solve these problems, we proposed DCS, a unified framework that integrates Grounding DINO, CLIP and SAM through three key innovations. First of all, we introduced FinePrompt for adaptive learning, which significantly enhanced the modeling ability of exception semantics by building a fine-grained exception description library and adopting learnable text embeddings. Secondly, we have designed an Adaptive Dual-path Cross-modal Interaction (ADCI) module to achieve more effective cross-modal information exchange through dual-path fusion. Finally, we proposed a Box-Point Prompt Combiner (BPPC), which combines box prior information provided by DINO with the point prompt generated by CLIP, so as to guide SAM to generate finer and more complete segmentation results. A large number of experiments have proved the effectiveness of our method. On the MVTec-AD and VisA datasets, DCS has achieved the most state-of-the-art zero-shot anomaly detection results. Full article

► Show Figures

Figure 1

26 pages, 46620 KB

Open AccessArticle

PolyGeom: Geometry-Aware Graph Transformer for Building Polygon Extraction in Remote Sensing Images

by Hongbin Pei, Huiyuan Li, Xufan Hou, Bo Yang and Huiqi Deng

Remote Sens. 2026, 18(4), 551; https://doi.org/10.3390/rs18040551 - 9 Feb 2026

Viewed by 375

Abstract

Building polygon extraction is a critical task in remote sensing analysis and a fundamental component of modern urban management. Conventional segmentation-based methods often suffer from geometric distortions during the conversion from masks to polygons. End-to-end polygon prediction approaches (e.g., PolyWorld) alleviate this issue [...] Read more.

Building polygon extraction is a critical task in remote sensing analysis and a fundamental component of modern urban management. Conventional segmentation-based methods often suffer from geometric distortions during the conversion from masks to polygons. End-to-end polygon prediction approaches (e.g., PolyWorld) alleviate this issue by directly predicting building polygons; however, existing PolyWorld-like methods remain limited in accurate corner vertex detection and polygon reasoning due to insufficient representation learning, particularly for geometry. In this work, we propose PolyGeom, an end-to-end framework equipped with a geometry-aware graph transformer for accurate and robust building polygon extraction. PolyGeom employs the Segment Anything Model (SAM) as its backbone to leverage large-scale pretrained features, thereby capturing both local and global semantics. Moreover, we propose a geometry-aware graph transformer that explicitly models geometry of building polygons, facilitating more reliable polygon reasoning. Extensive experiments on three challenging benchmarks, CrowdAI, WHU, and BONAI datasets, demonstrate that PolyGeom consistently outperforms existing methods in terms of building detection accuracy, topology correctness, and geometry alignment. Ablation studies further validate the effectiveness of the two key proposed designs in building polygon extraction. Full article

(This article belongs to the Special Issue Remote Sensing Target Recognition and Detection: Theory and Applications (Second Edition))

► Show Figures

Figure 1

25 pages, 62812 KB

Open AccessArticle

From Prompts to Self-Prompts: Parameter-Efficient Multi-Label Remote Sensing via Mask-Guided Classification

by Ge Qu, Xiongwei Guan, Fei Wen and Xinyu Zou

Remote Sens. 2026, 18(3), 518; https://doi.org/10.3390/rs18030518 - 5 Feb 2026

Viewed by 329

Abstract

Multi-label remote sensing scene classification (MLRSSC) requires autonomous discovery of all relevant land-cover categories without human guidance. Conventional expert classifiers return only label vectors without spatial evidence, while foundation segmenters (e.g., SAM, RemoteSAM) remain passively dependent on external prompts—misaligned with autonomous interpretation. We [...] Read more.

Multi-label remote sensing scene classification (MLRSSC) requires autonomous discovery of all relevant land-cover categories without human guidance. Conventional expert classifiers return only label vectors without spatial evidence, while foundation segmenters (e.g., SAM, RemoteSAM) remain passively dependent on external prompts—misaligned with autonomous interpretation. We introduce SAFI-XRS, a parameter-efficient self-prompted framework that transforms passive prompting into active scene parsing. By training only <2% of a 332M-parameter segmenter (∼2.4M parameters), SAFI-XRS generates class-aligned queries from images via a Semantic Query Generator (SQR), replacing external prompts with self-generated conditioning. A Mask-Guided Classifier (MGC) aggregates spatial evidence into label confidences, enabling mask-based explainability. Experiments on UCM-ML, DFC15-ML, and AID-ML show SAFI-XRS surpasses text-prompted foundation segmenters (+3.9/+3.8 mAP on balanced datasets) while achieving 6.8× parameter efficiency compared to expert models, validating a practical path toward autonomous, explainable RS scene understanding. Full article

(This article belongs to the Section AI Remote Sensing)

► Show Figures

Figure 1

22 pages, 11216 KB

Open AccessArticle

A Multi-Scale Remote Sensing Image Change Detection Network Based on Vision Foundation Model

by Shenbo Liu, Dongxue Zhao and Lijun Tang

Remote Sens. 2026, 18(3), 506; https://doi.org/10.3390/rs18030506 - 4 Feb 2026

Viewed by 565

Abstract

As a key technology in the intelligent interpretation of remote sensing, remote sensing image change detection aims to automatically identify surface changes from images of the same area acquired at different times. Although vision foundation models have demonstrated outstanding capabilities in image feature [...] Read more.

As a key technology in the intelligent interpretation of remote sensing, remote sensing image change detection aims to automatically identify surface changes from images of the same area acquired at different times. Although vision foundation models have demonstrated outstanding capabilities in image feature representation, their inherent patch-based processing and global attention mechanisms limit their effectiveness in perceiving multi-scale targets. To address this, we propose a multi-scale remote sensing image change detection network based on a vision foundation model, termed SAM-MSCD. This network integrates an efficient parameter fine-tuning strategy with a cross-temporal multi-scale feature fusion mechanism, significantly improving change perception accuracy in complex scenarios. Specifically, the Low-Rank Adaptation mechanism is adopted for parameter-efficient fine-tuning of the Segment Anything Model (SAM) image encoder, adapting it for the remote sensing change detection task. A bi-temporal feature interaction module(BIM) is designed to enhance the semantic alignment and the modeling of change relationships between feature maps from different time phases. Furthermore, a change feature enhancement module (CFEM) is proposed to fuse and highlight differential information from different levels, achieving precise capture of multi-scale changes. Comprehensive experimental results on four public remote sensing change detection datasets, namely LEVIR-CD, WHU-CD, NJDS, and MSRS-CD, demonstrate that SAM-MSCD surpasses current state-of-the-art (SOTA) methods on several key evaluation metrics, including the F1-score and Intersection over Union(IoU), indicating its broad prospects for practical application. Full article

(This article belongs to the Section AI Remote Sensing)

► Show Figures

Figure 1

20 pages, 17064 KB

Open AccessArticle

PriorSAM-DBNet: A SAM-Prior-Enhanced Dual-Branch Network for Efficient Semantic Segmentation of High-Resolution Remote Sensing Images

by Qiwei Zhang, Yisong Wang, Ning Li, Quanwen Jiang and Yong He

Sensors 2026, 26(2), 749; https://doi.org/10.3390/s26020749 - 22 Jan 2026

Viewed by 420

Abstract

Semantic segmentation of high-resolution remote sensing imagery is a critical technology for the intelligent interpretation of sensor data, supporting automated environmental monitoring and urban sensing systems. However, processing data from dense urban scenarios remains challenging due to sensor signal occlusions (e.g., shadows) and [...] Read more.

Semantic segmentation of high-resolution remote sensing imagery is a critical technology for the intelligent interpretation of sensor data, supporting automated environmental monitoring and urban sensing systems. However, processing data from dense urban scenarios remains challenging due to sensor signal occlusions (e.g., shadows) and the complexity of parsing multi-scale targets from optical sensors. Existing approaches often exhibit a trade-off between the accuracy of global semantic modeling and the precision of complex boundary recognition. While the Segment Anything Model (SAM) offers powerful zero-shot structural priors, its direct application to remote sensing is hindered by domain gaps and the lack of inherent semantic categorization. To address these limitations, we propose a dual-branch cooperative network, PriorSAM-DBNet. The main branch employs a Densely Connected Swin (DC-Swin) Transformer to capture cross-scale global features via a hierarchical shifted window attention mechanism. The auxiliary branch leverages SAM’s zero-shot capability to exploit structural universality, generating object-boundary masks as robust signal priors while bypassing semantic domain shifts. Crucially, we introduce a parameter-efficient Scaled Subsampling Projection (SSP) module that employs a weight-sharing mechanism to align cross-modal features, freezing the massive SAM backbone to ensure computational viability for practical sensor applications. Furthermore, a novel Attentive Cross-Modal Fusion (ACMF) module is designed to dynamically resolve semantic ambiguities by calibrating the global context with local structural priors. Extensive experiments on the ISPRS Vaihingen, Potsdam, and LoveDA-Urban datasets demonstrate that PriorSAM-DBNet outperforms state-of-the-art approaches. By fine-tuning only 0.91 million parameters in the auxiliary branch, our method achieves mIoU scores of 82.50%, 85.59%, and 53.36%, respectively. The proposed framework offers a scalable, high-precision solution for remote sensing semantic segmentation, particularly effective for disaster emergency response where rapid feature recognition from sensor streams is paramount. Full article

(This article belongs to the Special Issue Artificial Intelligence-Based Target Recognition and Remote Sensing Data Processing)

► Show Figures

Figure 1

25 pages, 4064 KB

Open AccessArticle

Application of CNN and Vision Transformer Models for Classifying Crowns in Pine Plantations Affected by Diplodia Shoot Blight

by Mingzhu Wang, Christine Stone and Angus J. Carnegie

Forests 2026, 17(1), 108; https://doi.org/10.3390/f17010108 - 13 Jan 2026

Viewed by 403

Abstract

Diplodia shoot blight is an opportunistic fungal pathogen infesting many conifer species and it has a global distribution. Depending on the duration and severity of the disease, affected needles appear yellow (chlorotic) for a brief period before becoming red or brown in colour. [...] Read more.

Diplodia shoot blight is an opportunistic fungal pathogen infesting many conifer species and it has a global distribution. Depending on the duration and severity of the disease, affected needles appear yellow (chlorotic) for a brief period before becoming red or brown in colour. These symptoms can occur on individual branches or over the entire crown. Aerial sketch-mapping or the manual interpretation of aerial photography for tree health surveys are labour-intensive and subjective. Recently, however, the application of deep learning (DL) techniques to detect and classify tree crowns in high-spatial-resolution imagery has gained significant attention. This study evaluated two complementary DL approaches for the detection and classification of Pinus radiata trees infected with diplodia shoot blight across five geographically dispersed sites with varying topographies over two acquisition years: (1) object detection using YOLOv12 combined with Segment Anything Model (SAM) and (2) pixel-level semantic segmentation using U-Net, SegFormer, and EVitNet. The three damage classes for the object detection approach were ‘yellow’, ‘red-brown’ (both whole-crown discolouration) and ‘dead tops’ (partially discoloured crowns), while for the semantic segmentation the three classes were yellow, red-brown, and background. The YOLOv12m model achieved an overall mAP50 score of 0.766 and mAP50–95 of 0.447 across all three classes, with red-brown crowns demonstrating the highest detection accuracy (mAP50: 0.918, F1 score: 0.851). For semantic segmentation models, SegFormer showed the strongest performance (IoU of 0.662 for red-brown and 0.542 for yellow) but at the cost of longest training time, while EVitNet offered the most cost-effective solution achieving comparable accuracy to SegFormer but with a superior training efficiency with its lighter architecture. The accurate identification and symptom classification of crown damage symptoms support the calibration and validation of satellite-based monitoring systems and assist in the prioritisation of ground-based diagnosis or management interventions. Full article

(This article belongs to the Section Forest Health)

► Show Figures

Figure 1

Search Results (82)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (82)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI