MDPI - Publisher of Open Access Journals

31 pages, 3694 KB

Open AccessArticle

Transformer-Based Individual Tree Crown Detection from Canopy Height Models with Cross-Domain and Self-Supervised Pretraining

by Josué Gourde, Baoxin Hu and Qian Li

Remote Sens. 2026, 18(11), 1674; https://doi.org/10.3390/rs18111674 - 22 May 2026

Abstract

Individual tree crown (ITC) detection from remotely sensed data is fundamental to forest inventory and ecological monitoring, but deep learning approaches remain constrained by limited labelled training data. We systematically evaluate three transformer detectors (the Detection Transformer (DETR), Deformable DETR, and DETR with [...] Read more.

Individual tree crown (ITC) detection from remotely sensed data is fundamental to forest inventory and ecological monitoring, but deep learning approaches remain constrained by limited labelled training data. We systematically evaluate three transformer detectors (the Detection Transformer (DETR), Deformable DETR, and DETR with Improved DeNoising Anchor Boxes (DINO)) paired with two backbones, ImageNet-pretrained ResNet-50 and a Masked Autoencoder (MAE) pretrained on unlabelled Canopy Height Model (CHM) data. These are benchmarked against a classical local maximum and watershed pipeline and Faster R-CNN across four test sets spanning boreal, temperate mixed-wood, and diverse North American forest types at 0.25–1.0 m resolution. Spatially held-out test regions with a one-patch buffer band eliminate sliding-window leakage; headline configurations are reported as mean ± standard deviation across three random seeds. With multi-resolution MAE pretraining, the practical lower bound for non-degenerate single-dataset transformer detection lies between ∼200 and ∼1200 patches. Without MAE pretraining, DETR fails at every dataset size we tested. Multi-dataset joint training reaches

F_{1} = 0.84 \pm 0.01

on the boreal test set and 0.45–0.68 across the temperate-mixed-wood and NEON test sets, while Faster R-CNN narrowly wins on the smallest training pool. Standard DETR with ResNet-50 collapses regardless of the length of training schedule, but the same architecture with an MAE backbone reaches

F_{1} = 0.83 \pm 0.01

at that schedule, showing that DETR’s reported instability is conditional on the combination of backbone initialization and training budget rather than architectural. Resolution and backbone interact: ResNet-50 wins at 0.25 m, and MAE wins at 0.5–1.0 m, consistent with the eight-pixel MAE patch-matching crown scale only at coarser resolutions. Full article

(This article belongs to the Special Issue AI-Driven Forestry Remote Sensing: Datasets, Models, Analysis and Applications)

► Show Figures

Figure 1

28 pages, 15951 KB

Open AccessArticle

Local–Global Aware Concept Bottleneck Models for Interpretable Image Classification

by Ci Liu, Zijie Lin and Chen Tang

Sensors 2026, 26(6), 1833; https://doi.org/10.3390/s26061833 - 14 Mar 2026

Viewed by 636

Abstract

Concept Bottleneck Models facilitate interpretable image classification by predicting human-understandable concepts prior to class labels. However, when constructed upon CLIP, they exhibit unreliable concept scores stemming from CLIP’s global representation bias and insufficient region-level sensitivity, which severely constrain their effectiveness in sensor-driven applications [...] Read more.

Concept Bottleneck Models facilitate interpretable image classification by predicting human-understandable concepts prior to class labels. However, when constructed upon CLIP, they exhibit unreliable concept scores stemming from CLIP’s global representation bias and insufficient region-level sensitivity, which severely constrain their effectiveness in sensor-driven applications like remote sensing and medical imaging where localized visual evidence is critical. To mitigate this, we propose the Local–Global Aware Concept Bottleneck Model (LGA-CBM), which improves concept prediction through a training-free refinement pipeline. Building on initial CLIP-derived concept scores, LGA-CBM incorporates three key components: a Dual Masking Guided Concept Score Refinement (DMCSR) module that exploits attention weights to strengthen region–concept alignment; a Local-to-Global Concept Reidentification (L2GCR) strategy to harmonize local and global activations; and a Similar Concepts Correction Mechanism (SCCM) integrating Grounding DINO for fine-grained disambiguation. A sparse linear layer then maps the refined concepts to class labels, enabling highly interpretable classification with minimal concept usage. Experiments across six benchmark datasets demonstrate that LGA-CBM consistently achieves state-of-the-art performance in both accuracy and interpretability, producing explanations that align closely with human cognition. Full article

(This article belongs to the Special Issue AI for Emerging Image-Based Sensor Applications)

► Show Figures

Figure 1

23 pages, 10939 KB

Open AccessArticle

Virtual Try-on-Based Data Augmentation for Robust Person Re-Identification in Emergency Surveillance Scenarios

by Pei Wang, Jiaming Liu, Yuyao Cao and Hui Zhang

Fire 2026, 9(3), 116; https://doi.org/10.3390/fire9030116 - 5 Mar 2026

Viewed by 905

Abstract

Person Re-identification (Re-ID) plays an important role in dynamic evacuation path planning and safety monitoring. However, rapid appearance changes and limited long-term surveillance data significantly degrade model robustness in emergency scenarios. To address this issue, a virtual try-on-based data augmentation framework is proposed [...] Read more.

Person Re-identification (Re-ID) plays an important role in dynamic evacuation path planning and safety monitoring. However, rapid appearance changes and limited long-term surveillance data significantly degrade model robustness in emergency scenarios. To address this issue, a virtual try-on-based data augmentation framework is proposed for person Re-ID. A prompt-based automatic clothing mask generation (PACMG) module integrating Grounding DINO and the Segment Anything Model (SAM) is developed to improve clothing mask accuracy under low-resolution, occlusion, and complex background conditions. A tiered augmentation strategy is further designed to alleviate identity-level imbalance. Experimental results demonstrate that the proposed method increases the clothing replacement validity rate from

52 %

to

73.61 %

while preserving identity consistency and distribution stability, as verified through multi-level analyses. When the augmented data are incorporated into the training set, consistent improvements in Rank-1 accuracy and mAP are observed on a ResNet-50-based person Re-ID benchmark. These results indicate that the augmented data enhance robustness to appearance variation, providing practical support for robust person tracking in evacuation scenarios. Full article

(This article belongs to the Special Issue Fire Safety Technology and Intelligent Evacuation)

► Show Figures

Figure 1

16 pages, 18592 KB

Open AccessArticle

A Framework for Nuclei and Overlapping Cytoplasm Segmentation with MaskDino and Hausdorff Distance

by Baocan Zhang, Xiaolu Jiang, Wei Zhao and Shixiao Xiao

Symmetry 2026, 18(2), 218; https://doi.org/10.3390/sym18020218 - 23 Jan 2026

Viewed by 566

Abstract

Accurate segmentation of nuclei and cytoplasm in cervical cytology images plays a pivotal role in characterizing cellular morphology. The primary challenge is to precisely delineate boundaries within densely clustered cells, which is complicated by low-contrast edges and irregular morphologies. This paper introduces a [...] Read more.

Accurate segmentation of nuclei and cytoplasm in cervical cytology images plays a pivotal role in characterizing cellular morphology. The primary challenge is to precisely delineate boundaries within densely clustered cells, which is complicated by low-contrast edges and irregular morphologies. This paper introduces a novel framework combining MaskDino architecture with Hausdorff distance loss, enhanced by a two-phase training strategy. The method begins by employing MaskDino for precise nucleus segmentation. Building on this foundation, the framework then enhances cytoplasmic boundary detection in cellular clusters by incorporating a Hausdorff distance loss, with weight transfer initialization ensuring feature consistency across tasks.. The symmetry between the nucleus and cytoplasm servers as a key morphological indicator for cell assessment, and our method provides a reliable basis for such analysis. Extensive experiments demonstrate that our method achieves state-of-the-art cytoplasm segmentation results on the ISBI2014 dataset, with absolute improvements of 2.9% in DSC, 1.6% in TPRp and 2.0% in FNRo. The performance of nucleus segmentation is better than the average level. These results validate the proposed framework’s effectiveness for improving cervical cancer screening through robust cellular segmentation. Full article

(This article belongs to the Section Computer)

► Show Figures

Figure 1

21 pages, 2785 KB

Open AccessArticle

Multimodal-Based Selective De-Identification Framework

by Dae-Jin Kim

Electronics 2025, 14(19), 3896; https://doi.org/10.3390/electronics14193896 - 30 Sep 2025

Viewed by 982

Abstract

Selective de-identification is a key technology for protecting sensitive objects in visual data while preserving meaningful information. This study proposes a framework that leverages text prompt-based zeroshot and referring object detection techniques to accurately identify and selectively de-identify sensitive objects without relying on [...] Read more.

Selective de-identification is a key technology for protecting sensitive objects in visual data while preserving meaningful information. This study proposes a framework that leverages text prompt-based zeroshot and referring object detection techniques to accurately identify and selectively de-identify sensitive objects without relying on predefined classes. By utilizing state-of-the-art models such as GroundingDINO, objects are detected based on natural language prompts, and de-identification—via blurring or masking—is applied only to the corresponding regions, thereby minimizing information loss while achieving a high level of privacy protection. Experimental results demonstrate that the proposed method outperforms conventional batch de-identification approaches in terms of scalability and flexibility. Full article

(This article belongs to the Special Issue Recent Advances in Security and Privacy for Multimedia Systems)

► Show Figures

Figure 1

29 pages, 6246 KB

Open AccessArticle

DASeg: A Domain-Adaptive Segmentation Pipeline Using Vision Foundation Models—Earthquake Damage Detection Use Case

by Huili Huang, Andrew Zhang, Danrong Zhang, Max Mahdi Roozbahani and James David Frost

Remote Sens. 2025, 17(16), 2812; https://doi.org/10.3390/rs17162812 - 14 Aug 2025

Viewed by 2045

Abstract

Limited labeled imagery and tight response windows hinder the accurate damage quantification for post-disaster assessment. The objective of this study is to develop and evaluate a deep learning-based Domain-Adaptive Segmentation (DASeg) workflow to detect post-disaster damage using limited information [...] Read more.

Limited labeled imagery and tight response windows hinder the accurate damage quantification for post-disaster assessment. The objective of this study is to develop and evaluate a deep learning-based Domain-Adaptive Segmentation (DASeg) workflow to detect post-disaster damage using limited information available shortly after an event. DASeg unifies three Vision Foundation Models in an automatic workflow: fine-tuned DINOv2 supplies attention-based point prompts, fine-tuned Grounding DINO yields open-set box prompts, and a frozen Segment Anything Model (SAM) generates the final masks. In the earthquake-focused case study DASeg-Quake, the pipeline boosts mean Intersection over Union (mIoU) by 9.52% over prior work and 2.10% over state-of-the-art supervised baselines. In a zero-shot setting scenario, DASeg-Quake achieves the mIoU of 75.03% for geo-damage analysis, closely matching expert-level annotations. These results show that DASeg achieves superior workflow enhancement in infrastructure damage segmentation without needing pixel-level annotation, providing a practical solution for early-stage disaster response. Full article

(This article belongs to the Special Issue Machine Learning at the Object: Fine-Grained Extraction and Analysis in Remote Sensing)

► Show Figures

Graphical abstract

17 pages, 2934 KB

Open AccessArticle

An Improved Small Target Segmentation Model Based on Mask Dino

by Jun Yang, Xu Chen, Yun Guan, Yixuan Hu and Gang Ge

Appl. Sci. 2025, 15(4), 1832; https://doi.org/10.3390/app15041832 - 11 Feb 2025

Viewed by 7560

Abstract

To address the issue of low segmentation accuracy for small objects in the Mask Dino segmentation method, we propose an improved small object segmentation model called FFMask Dino. Initially, we introduce scaled cosine attention and the log-cpb method into the Swin Transformer backbone [...] Read more.

To address the issue of low segmentation accuracy for small objects in the Mask Dino segmentation method, we propose an improved small object segmentation model called FFMask Dino. Initially, we introduce scaled cosine attention and the log-cpb method into the Swin Transformer backbone network. Subsequently, by adjusting the network structure, we enhance the feature extraction process, which helps the model maintain generalization across different datasets and reduces the risk of overfitting. Lastly, we propose the FFPN module to optimize the pathways for feature fusion and transmission. The improved FPN reduces unnecessary computations, accelerates model inference speed, and integrates multi-scale feature details and high-level semantic information to complement object features, thereby enhancing model segmentation accuracy. Experimental results demonstrate that the improved segmentation model achieves a mean Intersection over Union (mIoU) of 42.15% on the ADE20K dataset for semantic segmentation tasks, representing a 0.96% increase compared to the Mask Dino method. On the CoCo dataset for instance segmentation tasks, with the Swin Transformer backbone, the Mask AP and Box AP are 47.10 and 52.60, respectively, showing improvements of 1% and 1.3% over the Mask Dino method. With the ResNet-50 backbone, the Mask AP and Box AP are 40.00 and 44.10, respectively, with improvements of 0.5% and 0.9% over the Mask Dino method. For the CoCo dataset’s panoptic segmentation tasks, with the Swin Transformer backbone, the PQ is 54.95, showing a 0.4% increase over the Mask Dino method. With the ResNet-50 backbone, the PQ is 46.93, showing a 0.9% increase over the Mask Dino method. These results effectively demonstrate the improved accuracy and precision of Mask Dino in segmenting small objects across various segmentation tasks. Full article

► Show Figures

Figure 1

25 pages, 7107 KB

Open AccessArticle

Parameter-Efficient Fine-Tuning of Large Pretrained Models for Instance Segmentation Tasks

by Nermeen Abou Baker, David Rohrschneider and Uwe Handmann

Mach. Learn. Knowl. Extr. 2024, 6(4), 2783-2807; https://doi.org/10.3390/make6040133 - 2 Dec 2024

Cited by 9 | Viewed by 11440

Abstract

Research and applications in artificial intelligence have recently shifted with the rise of large pretrained models, which deliver state-of-the-art results across numerous tasks. However, the substantial increase in parameters introduces a need for parameter-efficient training strategies. Despite significant advancements, limited research has explored [...] Read more.

Research and applications in artificial intelligence have recently shifted with the rise of large pretrained models, which deliver state-of-the-art results across numerous tasks. However, the substantial increase in parameters introduces a need for parameter-efficient training strategies. Despite significant advancements, limited research has explored parameter-efficient fine-tuning (PEFT) methods in the context of transformer-based models for instance segmentation. Addressing this gap, this study investigates the effectiveness of PEFT methods, specifically adapters and Low-Rank Adaptation (LoRA), applied to two models across four benchmark datasets. Integrating sequentially arranged adapter modules and applying LoRA to deformable attention—explored here for the first time—achieves competitive performance while fine-tuning only about 1–6% of model parameters, a marked improvement over the 40–55% required in traditional fine-tuning. Key findings indicate that using 2–3 adapters per transformer block offers an optimal balance of performance and efficiency. Furthermore, LoRA, exhibits strong parameter efficiency when applied to deformable attention, and in certain cases surpasses adapter configurations. These results show that the impact of PEFT techniques varies based on dataset complexity and model architecture, underscoring the importance of context-specific tuning. Overall, this work demonstrates the potential of PEFT to enable scalable, customizable, and computationally efficient transfer learning for instance segmentation tasks. Full article

(This article belongs to the Section Learning)

► Show Figures

Figure 1

16 pages, 8101 KB

Open AccessArticle

Visual Prompt Selection Framework for Real-Time Object Detection and Interactive Segmentation in Augmented Reality Applications

by Eungyeol Song, Doeun Oh and Beom-Seok Oh

Appl. Sci. 2024, 14(22), 10502; https://doi.org/10.3390/app142210502 - 14 Nov 2024

Cited by 1 | Viewed by 3137

Abstract

This study presents a novel visual prompt selection framework for augmented reality (AR) applications that integrates advanced object detection and image segmentation techniques. The framework is designed to enhance user interactions and improve the accuracy of foreground–background separation in AR environments, making AR [...] Read more.

This study presents a novel visual prompt selection framework for augmented reality (AR) applications that integrates advanced object detection and image segmentation techniques. The framework is designed to enhance user interactions and improve the accuracy of foreground–background separation in AR environments, making AR experiences more immersive and precise. We evaluated six state-of-the-art object detectors (DETR, DINO, CoDETR, YOLOv5, YOLOv8, and YOLO-NAS) in combination with a prompt segmentation model using the DAVIS 2017 validation dataset. The results show that the combination of YOLO-NAS-L and SAM achieved the best performance with a J&F score of 70%, while DINO-scale4-swin had the lowest score of 57.5%. This 12.5% performance gap highlights the significant contribution of user-provided regions of interest (ROIs) to segmentation outcomes, emphasizing the importance of interactive user input in enhancing accuracy. Our framework supports fast prompt processing and accurate mask generation, allowing users to refine digital overlays interactively, thereby improving both the quality of AR experiences and overall user satisfaction. Additionally, the framework enables the automatic detection of moving objects, providing a more efficient alternative to traditional manual selection interfaces in AR devices. This capability is particularly valuable in dynamic AR scenarios, where seamless user interaction is crucial. Full article

(This article belongs to the Section Robotics and Automation)

► Show Figures

Figure 1

19 pages, 10946 KB

Open AccessEditor’s ChoiceArticle

Crop Growth Analysis Using Automatic Annotations and Transfer Learning in Multi-Date Aerial Images and Ortho-Mosaics

by Shubham Rana, Salvatore Gerbino, Ehsan Akbari Sekehravani, Mario Brandon Russo and Petronia Carillo

Agronomy 2024, 14(9), 2052; https://doi.org/10.3390/agronomy14092052 - 7 Sep 2024

Cited by 18 | Viewed by 3948

Abstract

Growth monitoring of crops is a crucial aspect of precision agriculture, essential for optimal yield prediction and resource allocation. Traditional crop growth monitoring methods are labor-intensive and prone to errors. This study introduces an automated segmentation pipeline utilizing multi-date aerial images and ortho-mosaics [...] Read more.

Growth monitoring of crops is a crucial aspect of precision agriculture, essential for optimal yield prediction and resource allocation. Traditional crop growth monitoring methods are labor-intensive and prone to errors. This study introduces an automated segmentation pipeline utilizing multi-date aerial images and ortho-mosaics to monitor the growth of cauliflower crops (Brassica Oleracea var. Botrytis) using an object-based image analysis approach. The methodology employs YOLOv8, a Grounding Detection Transformer with Improved Denoising Anchor Boxes (DINO), and the Segment Anything Model (SAM) for automatic annotation and segmentation. The YOLOv8 model was trained using aerial image datasets, which then facilitated the training of the Grounded Segment Anything Model framework. This approach generated automatic annotations and segmentation masks, classifying crop rows for temporal monitoring and growth estimation. The study’s findings utilized a multi-modal monitoring approach to highlight the efficiency of this automated system in providing accurate crop growth analysis, promoting informed decision-making in crop management and sustainable agricultural practices. The results indicate consistent and comparable growth patterns between aerial images and ortho-mosaics, with significant periods of rapid expansion and minor fluctuations over time. The results also indicated a correlation between the time and method of observation which paves a future possibility of integration of such techniques aimed at increasing the accuracy in crop growth monitoring based on automatically derived temporal crop row segmentation masks. Full article

► Show Figures

Figure 1

24 pages, 5085 KB

Open AccessArticle

Personalized Text-to-Image Model Enhancement Strategies: SOD Preprocessing and CNN Local Feature Integration

by Mujung Kim, Jisang Yoo and Soonchul Kwon

Electronics 2023, 12(22), 4707; https://doi.org/10.3390/electronics12224707 - 19 Nov 2023

Cited by 3 | Viewed by 3117

Abstract

Recent advancements in text-to-image models have been substantial, generating new images based on personalized datasets. However, even within a single category, such as furniture, where the structures vary and the patterns are not uniform, the ability of the generated images to preserve the [...] Read more.

Recent advancements in text-to-image models have been substantial, generating new images based on personalized datasets. However, even within a single category, such as furniture, where the structures vary and the patterns are not uniform, the ability of the generated images to preserve the detailed information of the input images remains unsatisfactory. This study introduces a novel method to enhance the quality of the results produced by text-image models. The method utilizes mask preprocessing with an image pyramid-based salient object detection model, incorporates visual information into input prompts using concept image embeddings and a CNN local feature extractor, and includes a filtering process based on similarity measures. When using this approach, we observed both visual and quantitative improvements in CLIP text alignment and DINO metrics, suggesting that the generated images more closely follow the text prompts and more accurately reflect the input image’s details. The significance of this research lies in addressing one of the prevailing challenges in the field of personalized image generation: enhancing the capability to consistently and accurately represent the detailed characteristics of input images in the output. This method enables more realistic visualizations through textual prompts enhanced with visual information, additional local features, and unnecessary area removal using a SOD mask; it can also be beneficial in fields that prioritize the accuracy of visual data. Full article

(This article belongs to the Special Issue Deep Learning-Based Computer Vision: Technologies and Applications)

► Show Figures

Figure 1

24 pages, 9298 KB

Open AccessArticle

High-Quality Object Detection Method for UAV Images Based on Improved DINO and Masked Image Modeling

by Wanjie Lu, Chaoyang Niu, Chaozhen Lan, Wei Liu, Shiju Wang, Junming Yu and Tao Hu

Remote Sens. 2023, 15(19), 4740; https://doi.org/10.3390/rs15194740 - 28 Sep 2023

Cited by 14 | Viewed by 4813

Abstract

The extensive application of unmanned aerial vehicle (UAV) technology has increased academic interest in object detection algorithms for UAV images. Nevertheless, these algorithms present issues such as low accuracy, inadequate stability, and insufficient pre-training model utilization. Therefore, a high-quality object detection method based [...] Read more.

The extensive application of unmanned aerial vehicle (UAV) technology has increased academic interest in object detection algorithms for UAV images. Nevertheless, these algorithms present issues such as low accuracy, inadequate stability, and insufficient pre-training model utilization. Therefore, a high-quality object detection method based on a performance-improved object detection baseline and pretraining algorithm is proposed. To fully extract global and local feature information, a hybrid backbone based on the combination of convolutional neural network (CNN) and vision transformer (ViT) is constructed using an excellent object detection method as the baseline network for feature extraction. This backbone is then combined with a more stable and generalizable optimizer to obtain high-quality object detection results. Because the domain gap between natural and UAV aerial photography scenes hinders the application of mainstream pre-training models to downstream UAV image object detection tasks, this study applies the masked image modeling (MIM) method to aerospace remote sensing datasets with a lower volume than mainstream natural scene datasets to produce a pre-training model for the proposed method and further improve UAV image object detection accuracy. Experimental results for two UAV imagery datasets show that the proposed method achieves better object detection performance compared to state-of-the-art (SOTA) methods with fewer pre-training datasets and parameters. Full article

(This article belongs to the Special Issue Advances and Challenges on Multisource Remote Sensing Image Fusion: Datasets, New Technologies, and Applications)

► Show Figures

Figure 1

13 pages, 3147 KB

Open AccessArticle

Image-Based Vehicle Classification by Synergizing Features from Supervised and Self-Supervised Learning Paradigms

by Shihan Ma and Jidong J. Yang

Eng 2023, 4(1), 444-456; https://doi.org/10.3390/eng4010027 - 1 Feb 2023

Cited by 8 | Viewed by 2902

Abstract

This paper introduces a novel approach to leveraging features learned from both supervised and self-supervised paradigms, to improve image classification tasks, specifically for vehicle classification. Two state-of-the-art self-supervised learning methods, DINO and data2vec, were evaluated and compared for their representation learning of vehicle [...] Read more.

This paper introduces a novel approach to leveraging features learned from both supervised and self-supervised paradigms, to improve image classification tasks, specifically for vehicle classification. Two state-of-the-art self-supervised learning methods, DINO and data2vec, were evaluated and compared for their representation learning of vehicle images. The former contrasts local and global views while the latter uses masked prediction on multiple layered representations. In the latter case, supervised learning is employed to finetune a pretrained YOLOR object detector for detecting vehicle wheels, from which definitive wheel positional features are retrieved. The representations learned from these self-supervised learning methods were combined with the wheel positional features for the vehicle classification task. Particularly, a random wheel masking strategy was utilized to finetune the previously learned representations in harmony with the wheel positional features during the training of the classifier. Our experiments show that the data2vec-distilled representations, which are consistent with our wheel masking strategy, outperformed the DINO counterpart, resulting in a celebrated Top-1 classification accuracy of 97.2% for classifying the 13 vehicle classes defined by the Federal Highway Administration. Full article

(This article belongs to the Special Issue Feature Papers in Eng 2022)

► Show Figures

Figure 1

Search Results (13)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (13)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI