error_outline You can access the new MDPI.com website here. Explore and share your feedback with us.
 
 
Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (7)

Search Parameters:
Keywords = open-vocabulary semantic segmentation

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
29 pages, 166274 KB  
Article
Bridging Vision Foundation and Vision–Language Models for Open-Vocabulary Semantic Segmentation of UAV Imagery
by Fan Li, Zhaoxiang Zhang, Xuanbin Wang, Xuan Wang and Yuelei Xu
Remote Sens. 2025, 17(22), 3704; https://doi.org/10.3390/rs17223704 - 13 Nov 2025
Viewed by 1066
Abstract
Open-vocabulary semantic segmentation (OVSS) is of critical importance for unmanned aerial vehicle (UAV) imagery, as UAV scenes are highly dynamic and characterized by diverse, unpredictable object categories. Current OVSS approaches mainly rely on the zero-shot capabilities of vision–language models (VLMs), but their image-level [...] Read more.
Open-vocabulary semantic segmentation (OVSS) is of critical importance for unmanned aerial vehicle (UAV) imagery, as UAV scenes are highly dynamic and characterized by diverse, unpredictable object categories. Current OVSS approaches mainly rely on the zero-shot capabilities of vision–language models (VLMs), but their image-level pretraining objectives yield ambiguous spatial relationships and coarse-grained feature representations, resulting in suboptimal performance in UAV scenes. In this work, we propose a novel hybrid framework for OVSS in UAV imagery, named HOSU, which leverages the priors of vision foundation models to unleash the potential of vision–language models in representing complex spatial distributions and capturing fine-grained small-object details in UAV scenes. Specifically, we propose a distribution-aware fine-tuning method that aligns CLIP with DINOv2 across intra- and inter-region feature distributions, enhancing the capacity of CLIP to model complex scene semantics and capture fine-grained details critical for UAV imagery. Meanwhile, we propose a text-guided multi-level regularization mechanism that leverages the text embeddings of CLIP to impose semantic constraints on the visual features, preventing their drift from the original semantic space during fine-tuning and ensuring stable vision–language correspondence. Finally, to address the pervasive occlusion in UAV imagery, we propose a mask-based feature consistency strategy that enables the model to learn stable representations, remaining robust against viewpoint-induced occlusions. Extensive experiments across four training settings on six UAV datasets demonstrate that our approach consistently achieves state-of-the-art performance compared with previous methods, while comprehensive ablation studies and analyses further validate its effectiveness. Full article
Show Figures

Graphical abstract

14 pages, 2231 KB  
Article
OpenMamba: Introducing State Space Models to Open-Vocabulary Semantic Segmentation
by Viktor Ungur and Călin-Adrian Popa
Appl. Sci. 2025, 15(16), 9087; https://doi.org/10.3390/app15169087 - 18 Aug 2025
Viewed by 3028
Abstract
Open-vocabulary semantic segmentation aims to label each pixel of an image based on text descriptions provided at inference time. Recent approaches for this task are based on methods which require two stages: the first one uses a mask generator to generate mask proposals, [...] Read more.
Open-vocabulary semantic segmentation aims to label each pixel of an image based on text descriptions provided at inference time. Recent approaches for this task are based on methods which require two stages: the first one uses a mask generator to generate mask proposals, while the other one deals with segment classification using a pre-trained vision–language model, such as CLIP. However, since CLIP is pre-trained on natural images, the model struggles with segmentation masks because of their abstract nature. In this paper, we introduce OpenMamba, a novel approach to creating high-level guidance maps to assist in extracting CLIP features within the masked regions for classification. High-level guidance maps are generated by leveraging both visual and textual modalities and introducing State Space Duality (SSD) as an efficient way to tackle the open-vocabulary semantic segmentation task. Also, we propose a new matching technique for the mask proposals, based on IoU with a dynamic threshold conditioned by mask quality and we introduce a contrastive-based loss to assure that similar mask proposals achieve similar CLIP embeddings. Comprehensive experiments across open-vocabulary benchmarks show that our method can achieve superior performance compared to other approaches while managing to reduce memory consumption. Full article
Show Figures

Figure 1

20 pages, 110802 KB  
Article
Toward High-Resolution UAV Imagery Open-Vocabulary Semantic Segmentation
by Zimo Chen, Yuxiang Xie and Yingmei Wei
Drones 2025, 9(7), 470; https://doi.org/10.3390/drones9070470 - 1 Jul 2025
Viewed by 1830
Abstract
Unmanned Aerial Vehicle (UAV) image semantic segmentation faces challenges in recognizing novel categories due to closed-set training paradigms and the high cost of annotation. While open-vocabulary semantic segmentation (OVSS) leverages vision-language models like CLIP to enable flexible class recognition, existing methods are limited [...] Read more.
Unmanned Aerial Vehicle (UAV) image semantic segmentation faces challenges in recognizing novel categories due to closed-set training paradigms and the high cost of annotation. While open-vocabulary semantic segmentation (OVSS) leverages vision-language models like CLIP to enable flexible class recognition, existing methods are limited to low-resolution images, hindering their applicability to high-resolution UAV data. Current adaptations—downsampling, cropping, or modifying CLIP—compromise either detail preservation, global context, or computational efficiency. To address these limitations, we propose HR-Seg, the first high-resolution OVSS framework for UAV imagery, which effectively integrates global context from downsampled images with local details from cropped sub-images through a novel cost-volume architecture. We introduce a detail-enhanced encoder with multi-scale embedding and a detail-aware decoder for progressive mask refinement, specifically designed to handle objects of varying sizes in aerial imagery. We evaluated existing OVSS methods alongside HR-Seg, training on the VDD dataset and testing across three benchmarks: VDD, UDD, and UAVid. HR-Seg achieved superior performance with mIoU scores of 89.38, 73.67, and 55.23, respectively, outperforming all compared state-of-the-art OVSS approaches. These results demonstrate HR-Seg’s exceptional capability in processing high-resolution UAV imagery. Full article
Show Figures

Figure 1

28 pages, 40407 KB  
Article
FreeMix: Open-Vocabulary Domain Generalization of Remote-Sensing Images for Semantic Segmentation
by Jingyi Wu, Jingye Shi, Zeyong Zhao, Ziyang Liu and Ruicong Zhi
Remote Sens. 2025, 17(8), 1357; https://doi.org/10.3390/rs17081357 - 11 Apr 2025
Viewed by 2714
Abstract
In this study, we present a novel concept termed open-vocabulary domain generalization (OVDG), which we investigate within the context of semantic segmentation. OVDG presents greater difficulty compared to conventional domain generalization, yet it offers greater practicality. It jointly considers (1) recognizing both base [...] Read more.
In this study, we present a novel concept termed open-vocabulary domain generalization (OVDG), which we investigate within the context of semantic segmentation. OVDG presents greater difficulty compared to conventional domain generalization, yet it offers greater practicality. It jointly considers (1) recognizing both base and novel classes and (2) generalizing to unseen domains. In OVDG, only the labels of base classes and the images from source domains are available to learn a robust model. Then, the model could be generalized to images from novel classes and target domains directly. In this paper, we propose a dual-branch FreeMix module to implement the OVDG task effectively in a universal framework: the base segmentation branch (BSB) and the entity segmentation branch (ESB). First, the entity mask is introduced as a novel concept for segmentation generalization. Additionally, semantic logits are learned for both the base mask and the entity mask, enhancing the diversity and completeness of masks for both base classes and novel classes. Second, the FreeMix utilizes pretrained self-supervised learning on large-scale remote-sensing data (RS_SSL) to extract domain-agnostic visual features for decoding masks and semantic logits. Third, a training tactic called dataset-aware sampling (DAS) is introduced for multi-source domain learning, aimed at improving the overall performance. In summary, RS_SSL, ESB, and DAS can significantly improve the generalization ability of the model on both a class level and a domain level. Experiments demonstrate that our method produces state-of-the-art results on several remote-sensing semantic-segmentation datasets, including Potsdam, GID5, DeepGlobe, and URUR, for OVDG. Full article
Show Figures

Figure 1

25 pages, 31509 KB  
Article
Expanding Open-Vocabulary Understanding for UAV Aerial Imagery: A Vision–Language Framework to Semantic Segmentation
by Bangju Huang, Junhui Li, Wuyang Luan, Jintao Tan, Chenglong Li and Longyang Huang
Drones 2025, 9(2), 155; https://doi.org/10.3390/drones9020155 - 19 Feb 2025
Cited by 1 | Viewed by 2535
Abstract
The open-vocabulary understanding of UAV aerial images plays a crucial role in enhancing the intelligence level of remote sensing applications, such as disaster assessment, precision agriculture, and urban planning. In this paper, we propose an innovative open-vocabulary model for UAV images, which combines [...] Read more.
The open-vocabulary understanding of UAV aerial images plays a crucial role in enhancing the intelligence level of remote sensing applications, such as disaster assessment, precision agriculture, and urban planning. In this paper, we propose an innovative open-vocabulary model for UAV images, which combines vision–language methods to achieve efficient recognition and segmentation of unseen categories by generating multi-view image descriptions and feature extraction. To enhance the generalization ability and robustness of the model, we adopted Mixup technology to blend multiple UAV images, generating more diverse and representative training data. To address the limitations of existing open-vocabulary models in UAV image analysis, we leverage the GPT model to generate accurate and professional text descriptions of aerial images, ensuring contextual relevance and precision. The image encoder utilizes a U-Net with Mamba architecture to extract key point information through edge detection and partition pooling, further improving the effectiveness of feature representation. The text encoder employs a fine-tuned BERT model to convert text descriptions of UAV images into feature vectors. Three key loss functions were designed: Generalization Loss to balance old and new category scores, semantic segmentation loss to evaluate model performance on UAV image segmentation tasks, and Triplet Loss to enhance the model’s ability to distinguish features. The Comprehensive Loss Function integrates these terms to ensure robust performance in complex UAV segmentation tasks. Experimental results demonstrate that the proposed method has significant advantages in handling unseen categories and achieving high accuracy in UAV image segmentation tasks, showcasing its potential for practical applications in diverse aerial imagery scenarios. Full article
Show Figures

Figure 1

20 pages, 18444 KB  
Article
Exploration of an Open Vocabulary Model on Semantic Segmentation for Street Scene Imagery
by Zichao Zeng and Jan Boehm
ISPRS Int. J. Geo-Inf. 2024, 13(5), 153; https://doi.org/10.3390/ijgi13050153 - 5 May 2024
Cited by 4 | Viewed by 5921
Abstract
This study investigates the efficacy of an open vocabulary, multi-modal, foundation model for the semantic segmentation of images from complex urban street scenes. Unlike traditional models reliant on predefined category sets, Grounded SAM uses arbitrary textual inputs for category definition, offering enhanced flexibility [...] Read more.
This study investigates the efficacy of an open vocabulary, multi-modal, foundation model for the semantic segmentation of images from complex urban street scenes. Unlike traditional models reliant on predefined category sets, Grounded SAM uses arbitrary textual inputs for category definition, offering enhanced flexibility and adaptability. The model’s performance was evaluated across single and multiple category tasks using the benchmark datasets Cityscapes, BDD100K, GTA5, and KITTI. The study focused on the impact of textual input refinement and the challenges of classifying visually similar categories. Results indicate strong performance in single-category segmentation but highlighted difficulties in multi-category scenarios, particularly with categories bearing close textual or visual resemblances. Adjustments in textual prompts significantly improved detection accuracy, though challenges persisted in distinguishing between visually similar objects such as buses and trains. Comparative analysis with state-of-the-art models revealed Grounded SAM’s competitive performance, particularly notable given its direct inference capability without extensive dataset-specific training. This feature is advantageous for resource-limited applications. The study concludes that while open vocabulary models such as Grounded SAM mark a significant advancement in semantic segmentation, further improvements in integrating image and text processing are essential for better performance in complex scenarios. Full article
(This article belongs to the Special Issue Advances in AI-Driven Geospatial Analysis and Data Generation)
Show Figures

Figure 1

21 pages, 822 KB  
Article
Smart Contract Generation Assisted by AI-Based Word Segmentation
by Yu Tong, Weiming Tan, Jingzhi Guo, Bingqing Shen, Peng Qin and Shuaihe Zhuo
Appl. Sci. 2022, 12(9), 4773; https://doi.org/10.3390/app12094773 - 9 May 2022
Cited by 14 | Viewed by 5690
Abstract
In the last decade, blockchain smart contracts emerged as an automated, decentralized, traceable, and immutable medium of value exchange. Nevertheless, existing blockchain smart contracts are not compatible with legal contracts. The automatic execution of a legal contract written in natural language is an [...] Read more.
In the last decade, blockchain smart contracts emerged as an automated, decentralized, traceable, and immutable medium of value exchange. Nevertheless, existing blockchain smart contracts are not compatible with legal contracts. The automatic execution of a legal contract written in natural language is an open research question that can extend the blockchain ecosystem and inspire next-era business paradigms. In this paper, we propose an AI-assisted Smart Contract Generation (AIASCG) framework that allows contracting parties in heterogeneous contexts and different languages to collaboratively negotiate and draft the contract clauses. AIASCG provides a universal representation of contracts through the machine natural language (MNL) as the common understanding of the contract obligations. We compare the design of AIASCG with existing smart contract generation approaches to present its novelty. The main contribution of AIASCG is to address the issue in our previous proposed smart contract generation framework. For sentences written in natural language, existing framework requires editors to manually split sentences into words with semantic meaning. We propose an AI-based automatic word segmentation technique called Separation Inference (SpIn) to fulfill automatic split of the sentence. SpIn serves as the core component in AIASCG that accurately recommends the intermediate MNL outputs from a natural language sentence, tremendously reducing the manual effort in contract generation. SpIn is evaluated from a robustness and human satisfaction point of view to demonstrate its effectiveness. In the robustness evaluation, SpIn achieves state-of-the-art F1 scores and Recall of Out-of-Vocabulary (R_OOV) words on multiple word segmentation tasks. In addition, in the human evaluation, participants believe that 88.67% of sentences can be saved 80–100% of the time through automatic word segmentation. Full article
Show Figures

Figure 1

Back to TopTop