UGFF-VLM: Uncertainty-Guided and Frequency-Fused Vision-Language Model for Remote Sensing Farmland Segmentation

Tan, Kai; Wu, Yanlan; Yang, Hui; Ma, Xiaoshuang

doi:10.3390/rs18020282

Open AccessArticle

UGFF-VLM: Uncertainty-Guided and Frequency-Fused Vision-Language Model for Remote Sensing Farmland Segmentation

¹

Institutes of Physical Science and Information Technology, Anhui University, Hefei 230601, China

²

School of Artificial Intelligence, Anhui University, Hefei 230601, China

³

State Key Laboratory of Opto-Electronic Information Acquisition and Protection Technology, Hefei 230601, China

⁴

Engineering Research Center of Autonomous Unmanned System Technology, Ministry of Education, Hefei 230601, China

⁵

Engineering Research Center for Unmanned System and Intelligent Technology, Hefei 230601, China

⁶

School of Resources and Environmental Engineering, Anhui University, Hefei 230601, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(2), 282; https://doi.org/10.3390/rs18020282

Submission received: 4 December 2025 / Revised: 5 January 2026 / Accepted: 14 January 2026 / Published: 15 January 2026

(This article belongs to the Special Issue Advanced AI Technology for Remote Sensing Analysis)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

We propose UGFF-VLM, a vision-language model for farmland segmentation that introduces Uncertainty-Guided Adaptive Alignment (UGAA) to dynamically estimate text-visual correspondence confidence and Frequency-Enhanced Cross-Modal Fusion (FECF) to preserve high-frequency boundary details, enabling robust and precise farmland extraction.
The proposed UGFF-VLM method outperforms existing state-of-the-art methods in remote sensing farmland segmentation, as verified across multiple diverse agricultural regions in China.

What are the implications of the main findings?

UGFF-VLM offers a solution to reduce false positives and missed detections in vision-language model-based farmland extraction under variable seasonal and imaging conditions.
The combination of uncertainty-guided cross-modal fusion and frequency-domain feature enhancement provides a promising direction for improving segmentation accuracy in complex agricultural landscapes with fragmented and irregularly shaped parcels.

Abstract

Vision-language models can leverage natural language descriptions to encode stable farmland characteristics, providing a new paradigm for farmland extraction, yet existing methods face challenges in ambiguous text-visual alignment and loss of high-frequency boundary details during fusion. To address this, this article utilizes the semantic prior knowledge provided by textual descriptions in vision–language models to enhance the model’s ability to recognize polymorphic features, and proposes an Uncertainty-Guided and Frequency-Fused Vision-Language Model (UGFF-VLM) for remote sensing farmland extraction. The UGFF-VLM combines the semantic representation ability of vision-language models, further integrates an Uncertainty-Guided Adaptive Alignment (UGAA) module to dynamically adjust cross-modal fusion based on alignment confidence, and a Frequency-Enhanced Cross-Modal Fusion (FECF) mechanism to preserve high-frequency boundary details in the frequency domain. Experimental results on the FarmSeg-VL dataset demonstrate that the proposed method delivers excellent and stable performance, achieving the highest mIoU across diverse geographical environments while showing significant improvements in boundary precision and robustness against false positives. Therefore, the proposed UGFF-VLM not only mitigates the issues of recognition confusion and poor generalization in purely vision-based models caused by farmland feature polymorphism but also effectively enhances boundary segmentation accuracy, providing a reliable method for the precise delineation of agricultural parcels in diverse landscapes.

Keywords:

vision-language model; remote sensing; farmland segmentation

1. Introduction

As arable land constitutes a critical yet finite resource for human survival and development, the ability to precisely delineate farmland boundaries enables effective monitoring of cultivated area dynamics, assessment of crop insurance claims, and implementation of sustainable agricultural practices [1]. Remote sensing, with its inherent advantages of wide-area coverage, rapid acquisition, and periodic observation, enables efficient and accurate extraction of farmland information, thereby providing critical support for agricultural monitoring, resource management, and policy formulation [2]. However, precise farmland extraction from remote sensing imagery still faces numerous challenges, primarily due to heterogeneous land cover, diverse crop types, and variations in growth stages.

With the advancement of artificial intelligence technologies and their increasing application in remote sensing information extraction [3,4,5], deep learning-based farmland extraction methods have become mainstream [6,7,8]. Early research directly applied deep learning methods from computer vision to farmland extraction tasks, such as Fully Convolutional Networks (FCN) [9], U-Net [10], and the DeepLab series [11]. These methods, however, struggled to capture multi-scale contextual information effectively and failed to model the complex spatial relationships within farmland parcels adequately. To address these limitations, methods specifically designed for farmland characteristics were subsequently proposed. CNN-LSTM architectures were developed to leverage temporal information for farmland identification [12]. Attention mechanism-based methods emerged to enhance multi-scale feature extraction and capture spatial relationships [13]. Additionally, deformable convolution-based approaches were introduced for handling farmland boundaries of varying scales and deformations [14]. Despite these advances, farmland exhibits polymorphism in visual characteristics because its internal features vary with planting structure, crop phenology, and growth conditions. Consequently, vision-based models require training samples that cover diverse crop growth stages and phenological features. Yet the high diversity of samples and significant differences in feature distributions can further induce cognitive hallucinations. As a result, current purely vision-based models suffer from serious problems of false positives, missed detections, and insufficient generalization when performing farmland extraction.

Although farmland internal features exhibit certain polymorphism, farmland possesses stable characteristics that distinguish it from surrounding regions. Specifically, farmland parcels typically present characteristic spatial distribution patterns, either concentrated contiguous arrangements or dispersed configurations across the landscape. The shape of farmland exhibits identifiable geometric properties such as blocky or scaly patterns, with internal roads serving as natural boundary markers regardless of whether they are curved or straight. The terrain context, including flat plains or terraced slopes, provides consistent topographic cues. Furthermore, the surrounding environment offers distinctive contextual information, including water bodies such as scattered ponds or irrigation channels, and vegetation elements like scattered trees or shelterbelts around field margins. These external properties remain relatively stable regardless of internal crop variations, seasonal phenological changes, or growth stage differences. Therefore, by incorporating such prior knowledge, the accuracy of farmland extraction and model generalization can be further enhanced. Vision-language models can leverage natural language descriptions to encode these stable farmland characteristics, providing a new paradigm for farmland extraction. Recent advances have also explored unified architectures integrating open-vocabulary detection and segmentation through text prompts, such as YoloE [15], demonstrating the potential of vision-language alignment for real-time object recognition. FSVLM [16] pioneered this direction by combining semantic segmentation models with multimodal large language models using an “embedding as mask” paradigm, enabling text-guided farmland identification and significantly improving extraction accuracy. Building upon this, FarmSeg_VLM [17] addresses limited textual supervision through Image-Text Spatial Alignment strategies with Multi-Label Background Priors, further enhancing boundary delineation capabilities. Despite these advances, existing VLM-based farmland segmentation methods face two critical challenges. First, the alignment between textual descriptions and their visual manifestations remains ambiguous. Farmland characteristics described in text, such as “scattered trees around the farmland” or “dispersed field distribution”, correspond to highly variable visual patterns depending on environmental conditions, imaging seasons, and viewing scales, yet current approaches apply uniform fusion weights without accounting for this reliability variability. Second, precise boundary delineation requires preserving high-frequency visual details that are vulnerable to being smoothed during conventional spatial-domain fusion operations, particularly for narrow features like field ridges and vegetation strips that separate adjacent parcels.

To address these limitations, we propose a vision-language model for farmland segmentation that integrates multimodal understanding with dense prediction capabilities. The model employs LLaVA [18] as the multimodal understanding backbone, which incorporates a CLIP [19] vision encoder for vision-language alignment and a large language model for textual reasoning over farmland descriptions. For dense visual feature extraction, we utilize DINOv3 [20] as a separate visual backbone to capture fine-grained spatial details through self-supervised hierarchical feature representations from multiple intermediate layers. These multi-scale visual features are subsequently processed by a DPT-based [21] decoder architecture that transforms the hierarchical representations into dense pixel-wise predictions for parcel segmentation masks. To enhance cross-modal fusion between textual semantics and visual features, we introduce two novel mechanisms integrated within the decoder. First, we develop an Uncertainty-Guided Adaptive Alignment (UGAA) module that dynamically estimates the confidence of text-visual correspondence and adaptively modulates fusion strength based on alignment reliability. This mechanism enables robust performance when textual descriptions mention abstract properties or when visual features exhibit ambiguity due to environmental similarity between farmland and surrounding regions. Second, we propose a Frequency-Enhanced Cross-Modal Fusion (FECF) mechanism that performs feature fusion in the frequency domain, decomposing features into magnitude and phase components to enable selective enhancement of boundary-relevant high-frequency details while preserving semantic low-frequency information. Additionally, we replace the conventional single-layer upsampling with a Progressive Upsampling strategy that gradually refines features through multiple stages for improved boundary precision.

The main contributions of this work are as follows:

(1): We propose a vision-language model specifically designed for farmland parcel segmentation. The model leverages LLaVA as the multimodal backbone for vision-language understanding, DINOv3 as the visual feature extractor for capturing fine-grained spatial details, and a DPT-based decoder with Progressive Upsampling for generating high-quality segmentation masks.
(2): We introduce an UGAA module that dynamically estimates the alignment confidence between textual and visual modalities. By learning to predict uncertainty scores and generating adaptive channel-wise scaling factors, this module enables robust cross-modal fusion even when text-visual correspondence is ambiguous due to variable visual patterns across different seasons and imaging conditions.
(3): We propose a FECF mechanism that performs feature fusion in the frequency domain through Fast Fourier Transform. By decomposing features into magnitude and phase components with learnable fusion weights, this module enables selective enhancement of high-frequency boundary details while preserving semantic low-frequency information, thereby improving precise field boundary delineation.

2. Related Work

2.1. Remote Sensing Farmland Segmentation

Deep learning-based methods have made substantial progress in semantic segmentation of farmland from remote sensing images. Traditional CNN-based approaches such as FCN and U-Net have served as foundational architectures for farmland segmentation, with multi-scale feature extraction and attention mechanisms becoming increasingly prevalent. Pan et al. [22] conducted comparative experiments on various approaches, with MAENet [23] employing dual-pooling efficient channel attention (DPECA) to enhance feature representation. Pyramid scene parsing networks [24] and other CNN-based methods provide baseline approaches for semantic segmentation. Recent advances recognize that single-source imagery constrains model performance, with approaches like GLCANet [25] combining CNNs and Transformers through a global-local information mining module to leverage multi-source data. Furthermore, explicit edge extraction has proven valuable as an auxiliary task for farmland boundary delineation, with DE2S2N [26] proposing a dual-branch architecture that fuses farmland and road segmentation via morphological operations, while DFBNet [27] introduces a three-branch architecture that simultaneously extracts detail, deep semantic, and boundary features. These foundational and intermediate approaches progressively advanced farmland segmentation through hierarchical feature representation, multi-source data integration, and explicit boundary constraints.

To address boundary segmentation and diverse farmland characteristics, recent architectures employ multitask learning and deformable convolutions. MDE-UNet [14] utilizes Deformable ConvNets V2 to adaptively adjust receptive fields for farmland boundaries of varying scales and deformations, segmenting deterministic, fuzzy, and raw boundaries as separate specialized tasks. CTMENet [28] advances this direction by combining a hybrid CNN-Transformer backbone with channel gating units (CGUs) and a specialized edge loss function based on instance IoU calculations in an edge-aware multitask network framework. The emergence of Vision Transformers has enabled more comprehensive farmland understanding through hybrid architectures that combine local CNN feature extraction with transformer-based global context modeling. Dense-feature overlay fusion modules [29] and EGCM-UNet [30] incorporating Mamba [31] for long-range dependency modeling have demonstrated effectiveness, with Mamba models providing innovative sequence modeling capabilities. FL-DBENet exemplifies this approach by integrating SAM’s [32] powerful edge detection capabilities with SegFormer’s lightweight multi-scale feature extraction and Low-Rank Adaptation (LoRA) [33]. Semantic segmentation frameworks [34] further contribute to the evolution of farmland segmentation methods.

Most recently, vision-language models (VLMs) have opened new paradigms for farmland segmentation by integrating linguistic understanding with visual perception. FSVLM [16] pioneers this direction by combining a semantic segmentation model with multimodal large language models using an “embedding as mask” paradigm, enabling text-guided farmland identification. Building upon this foundation, FarmSeg_VLM [17] addresses the challenge of limited textual supervision through Image-Text Spatial Alignment strategies with Multi-Label Background Priors (ITSA_MLBP) and an Image-Text Alignment Adapter (ITAA), which enriches sparse textual prompts with dense visual information for improved boundary delineation. These specialized applications demonstrate how vision-language understanding can enhance farmland segmentation by leveraging complementary strengths of visual and linguistic modalities to address temporal and spatial heterogeneity challenges.

2.2. Vision-Language Models for Remote Sensing

The emergence of large vision-language models (VLMs) has demonstrated remarkable capabilities in understanding natural images through instruction-tuning paradigms. However, general-domain VLMs encounter significant performance degradation in remote sensing scenarios due to fundamental domain disparities, including diverse object scales, varying spatial resolutions, and complex scene layouts characteristic of remote sensing imagery. To address these challenges, domain-specific remote sensing VLMs have been developed to leverage specialized knowledge for RS interpretation tasks. RemoteCLIP [35] adapts CLIP for remote sensing through data scaling and continual pretraining, achieving strong zero-shot and retrieval performance. GeoChat [36] introduces grounded vision-language understanding for RS through visual grounding capabilities. RSGPT [37] proposes a comprehensive RS vision-language model with accompanying benchmarks for systematic evaluation. SkySenseGPT [38] contributes fine-grained instruction tuning datasets tailored for RS understanding, while LHRS-Bot [39] enhances remote sensing capabilities by incorporating volunteer geographic information (VGI) into multimodal language models. RS-LLaVA [40] demonstrates effectiveness in joint captioning and question answering tasks through efficient architectural adaptation. Building upon these foundations, SkyEyeGPT [41] proposes a unified framework capable of handling multi-granularity vision-language tasks spanning image-level, region-level, and video-level understanding, demonstrating that effective remote sensing VLM design does not necessitate complex architectures when coupled with appropriate instruction tuning through a two-stage approach combining image-text alignment and multi-task conversation fine-tuning. H2RSVLM [42] further advances reliability by addressing hallucination and trustworthiness concerns in remote sensing applications.

Architectural adaptations for remote sensing VLMs have explored diverse efficient fine-tuning strategies to balance performance and computational constraints. RSGPT [43] adopts the InstructBLIP framework with selective fine-tuning, where only the Q-Former network and linear projection layers are trained while the vision encoder and language model remain frozen, enabling effective visual-linguistic alignment. In contrast, RS-LLaVA employs low-rank adaptation (LoRA) for efficient fine-tuning, allowing selective modification of specific layers while preserving the majority of model parameters, thus significantly reducing computational costs. Despite these advances, fundamental challenges persist in remote sensing VLMs, particularly the hallucination phenomenon where models generate inaccurate or fabricated information. Co-LLaVA [44] addresses this through model collaboration strategies that combine lightweight and full models to enhance response reliability. Visual grounding in remote sensing presents additional complexity, demanding unified support for multiple annotation formats including horizontal bounding boxes, oriented bounding boxes, and segmentation masks. GeoGround [45] tackles this challenge through a Text-Mask paradigm that textualizes pixel-level masks for unified training, combined with hybrid supervision incorporating prompt-assisted and geometry-guided learning to achieve robust visual grounding across diverse annotation types. Beyond general-purpose remote sensing understanding, VLMs have also been adapted for specialized agricultural applications such as farmland segmentation [16,17], demonstrating the versatility of vision-language frameworks across diverse remote sensing interpretation tasks. Beyond image-level understanding tasks, visual grounding that localizes referred objects through natural language has recently gained attention in remote sensing and aerial scenarios. Zhan et al. [46] introduced the RSVG task and constructed the DIOR-RSVG dataset, proposing a multi-granularity visual language fusion module to address scale variation and cluttered background challenges in RS imagery. For 3D localization, Mono3DVG [47] pioneered monocular 3D visual grounding using descriptions with both appearance and geometry information, demonstrating that language guidance can enhance spatial reasoning in single RGB images. More recently, UAV-SVG [48] extended visual grounding to aerial video sequences, proposing the SAVG-DETR framework to handle small objects and complex camera motion in drone footage. These visual grounding approaches share a common goal with our work: achieving precise text-visual alignment for accurate object localization. While they focus on bounding box regression, our UGFF-VLM addresses pixel-level segmentation with uncertainty-guided fusion and frequency-domain enhancement, targeting the specific challenges of farmland boundary delineation. These architectural innovations and training strategies represent progressive steps toward robust and reliable remote sensing vision-language models capable of addressing the unique challenges of remote sensing interpretation.

3. Methodology

3.1. Overall Architecture

Our proposed framework is a vision-language model specifically designed for farmland parcel segmentation, integrating multiple advanced components into a unified architecture as illustrated in Figure 1. Farmland parcels present unique challenges including irregular boundary shapes, subtle visual differences between adjacent fields, and high variability due to seasonal changes and agricultural practices, making vision-language integration particularly valuable for leveraging both visual appearance and textual semantic cues. The model builds upon LLaVA as the multi-modal understanding backbone, which combines a pre-trained large language model with visual encoders to enable joint reasoning over textual descriptions and visual content. For visual feature extraction, we employ dual visual encoders serving complementary roles. The CLIP vision encoder, integrated within the LLaVA framework, processes input images to establish vision-language alignment and enables the language model to understand farmland-related visual patterns. In parallel, DINOv3, a self-supervised Vision Transformer, captures fine-grained spatial details of field boundaries and vegetation patterns through hierarchical feature representations extracted from multiple intermediate layers. The DINOv3 backbone generates multi-scale visual features particularly sensitive to farmland geometric and textural characteristics, which are subsequently projected and fused through a feature pyramid structure, then processed by a DPT-based decoder architecture that transforms the hierarchical representations into dense pixel-wise predictions for parcel segmentation masks. The textual descriptions encoding crop types, field shapes, and spatial distribution patterns are processed through a tokenizer and fed into the large language model, where Low-Rank Adaptation (LoRA) is employed to efficiently fine-tune the model for farmland-specific concepts while keeping the majority of pre-trained weights frozen. As illustrated in Figure 1, “Frozen” denotes the pre-trained parameters that remain fixed during training, including the CLIP vision encoder and DINOv3 backbone, which preserve generalizable visual representations. “Trainable” indicates the parameters that are updated during training, including the LoRA adapters in the language model, the cross-modal fusion modules (UGAA and FECF), and the segmentation decoder, which are optimized for farmland-specific segmentation.

To facilitate cross-modal integration, the textual features extracted by the language model are fused with visual features through a two-stage process within the Cross Modal Fusion module. The UGAA module projects text embeddings into the visual feature space and generates confidence-weighted text-guided feature maps through adaptive channel-wise scaling. These features are then fused with DINOv3 visual features in the frequency domain via the FECF mechanism, which decomposes both modalities into magnitude and phase components for selective enhancement before feeding into the DPT-based decoder. The Large Language Model within the LLaVA framework encodes farmland-related textual descriptions into rich semantic embeddings that capture spatial distribution patterns, shape characteristics, and contextual information, while performing cross-modal reasoning to establish correspondences between linguistic concepts and visual patterns. For parameter-efficient fine-tuning, Low-Rank Adaptation (LoRA) is applied to the query and value projection matrices of the self-attention layers in the LLM. As indicated in Figure 1, the pre-trained weights of the CLIP vision encoder, DINOv3 backbone, and the majority of LLM parameters remain frozen, while only the LoRA adapters (with rank r = 8), the cross-modal fusion modules, and the segmentation decoder are updated during training, significantly reducing trainable parameters while enabling effective adaptation to farmland segmentation.

The key innovations of our architecture lie in two novel cross-modal fusion mechanisms integrated within the Cross Modal Fusion module, specifically addressing farmland segmentation challenges. First, we introduce an Uncertainty-Guided Adaptive Alignment (UGAA) module that estimates the alignment confidence between textual and visual modalities, enabling adaptive fusion strength adjustment based on the reliability of text-visual correspondence. This module is particularly important for handling ambiguous farmland characteristics where textual descriptions may correspond to variable visual patterns across different seasons and imaging conditions. The UGAA module learns to predict uncertainty scores and generates adaptive scaling factors that modulate the contribution of text-guided features to the final predictions. Second, we propose a Frequency-Enhanced Cross-Modal Fusion (FECF) mechanism that performs feature fusion in the frequency domain rather than solely in the spatial domain. By decomposing features into magnitude and phase components through Fast Fourier Transform, the FECF module enables enhanced interaction between text and visual modalities, particularly preserving high-frequency boundary details critical for precise field boundary delineation such as narrow vegetation strips or curved parcel edges. The projected text embeddings from the language model are integrated with DINOv3 visual features through both UGAA and FECF modules, where the confidence-weighted and frequency-aware fusion ensures robust performance even when text-visual alignment is ambiguous. Additionally, we replace the conventional single-layer upsampling with a Progressive Upsampling strategy that gradually refines features through multiple stages, inspired by SAM’s multi-scale design. The entire framework is trained end-to-end with a combined loss function including cross-entropy loss for language modeling, binary cross-entropy loss and Dice loss for mask prediction, enabling simultaneous optimization of vision-language understanding and segmentation quality.

3.2. Uncertainty-Guided Adaptive Alignment

A fundamental challenge in farmland parcel segmentation is the ambiguous relationship between textual descriptions and their visual manifestations. Unlike well-defined object categories, farmland characteristics like “scattered trees around the farmland” or “dispersed distribution of fields” correspond to highly variable visual patterns depending on tree density, field fragmentation, and viewing scale. Certain textual cues may be semantically relevant but visually weak, such as “flat terrain” which provides contextual understanding but lacks distinctive visual features. Traditional fusion approaches applying uniform weighting fail to account for this reliability variability. As shown in Figure 2, we propose UGAA to dynamically estimate the confidence of text-visual correspondence and adaptively modulate fusion strength. Specifically, the module first evaluates the alignment reliability between textual descriptions and visual patterns through a learnable uncertainty estimator. When the text-visual correlation is high (e.g., distinctive descriptions like “terraced fields with clear boundaries”), UGAA assigns higher confidence scores, resulting in stronger text-guided feature modulation. Conversely, when the correlation is low (e.g., abstract descriptions like “flat terrain”), UGAA increases uncertainty estimation and reduces the contribution of text-guided features, allowing the model to rely more on visual features for segmentation. The adaptive scaling factors generated by UGAA are then applied channel-wise to modulate the projected text features before spatial expansion and fusion with visual representations.

Given a text embedding encoding farmland descriptions and visual features capturing agricultural landscape patterns, UGAA first estimates alignment uncertainty through a learnable estimator:

u = σ (W_{u}^{(2)} \cdot R e L U (W_{u}^{(1)} t + b_{u}^{(1)}) + b_{u}^{(2)})

(1)

where

W_{u}^{(1)}

,

W_{u}^{(2)}

are learnable matrices capturing the relationship between textual features and alignment reliability, and

σ

is the sigmoid function. High uncertainty occurs when textual descriptions mention abstract properties or when visual features exhibit ambiguity due to similar spectral responses across crop types. The alignment confidence is:

c = 1 - u

(2)

To enable channel-level adaptive modulation, recognizing that different visual channels respond to different farmland characteristics with varying text-visual alignment, we compute adaptive scaling factors:

s = σ (τ \cdot W_{s}^{(2)} \cdot ReLU (W_{s}^{(1)} t)) ⊙ c

(3)

where

W_{s}^{(1)}

,

W_{s}^{(2)}

are learnable projection matrices,

τ

is a learnable temperature parameter, and

⊙

denotes element-wise multiplication. The scaling vector

s

assigns different weights to feature channels based on their semantic relevance to the textual description.

s \in R^{C}

The text embedding is projected into the visual feature space:

t_{v} = W_{p}^{(2)} \cdot R e L U (W_{p}^{(1)} t)

(4)

where

W_{p}^{(1)}

,

W_{p}^{(2)}

transform the text embedding into a feature vector compatible with visual feature dimensions. This projected text feature is spatially expanded and modulated by the confidence-aware adaptive scaling factors:

T = (s \otimes 1_{H \times W}) ⊙ (t_{v} \otimes 1_{H \times W})

(5)

where

\otimes

denotes outer product broadcasting and

1_{H \times W}

is a matrix of ones with spatial dimensions matching the visual features. The resulting text-guided feature map adaptively emphasizes spatial regions and feature channels where textual farmland descriptions provide reliable guidance for identifying parcel boundaries.

3.3. Frequency-Enhanced Cross-Modal Fusion

Accurate farmland parcel segmentation critically depends on precise delineation of field boundaries, which often manifest as high-frequency visual details in satellite imagery. These boundaries may consist of narrow vegetation strips separating adjacent fields, subtle elevation changes along terrace edges, thin dirt roads defining parcel perimeters, or abrupt transitions in crop reflectance at field margins. However, these high-frequency boundary features are vulnerable to being smoothed or lost during spatial-domain fusion operations. Moreover, textual descriptions contain frequency-specific semantic cues such as “blocky and scaly field shapes” implying strong high-frequency geometric patterns, while “dispersed distribution” relates to low-frequency spatial arrangement. To address this, we propose FECF (Figure 3) to decompose features into spectral components and perform text-guided fusion in the frequency domain, enabling selective enhancement of boundary-relevant high-frequency details.

Given the spatial visual features and the text-guided features from UGAA, we first transform both into the frequency domain using two-dimensional Fast Fourier Transform:

F_{v} = {F F T}_{2} (v), F_{T} = {F F T}_{2} (T)

(6)

where

{F F T}_{2} (\cdot)

results in complex-valued frequency representations

F_{v}, F_{T} \in C^{C \times H \times W}

. Low-frequency components correspond to overall field patterns, while high-frequency components capture sharp transitions and boundary details critical for parcel delineation. The frequency representations are decomposed into magnitude and phase:

M_{v} = ‖ F_{v} ‖, ϕ_{v} = ∠ F_{v}

(7)

M_{T} = | F_{T} |, ϕ_{T} = ∠ F_{T}

(8)

where

| \cdot |

and

∠

denote magnitude and phase extraction operations. To adaptively weight different frequency components based on their relevance to farmland boundary segmentation, we compute frequency-aware weights:

{\bar{M}}_{v} = \frac{1}{i W} \sum_{h, w} M_{ϕ} [\cdot | h, w]

(9)

ω = σ (W_{f} \cdot N_{e})

(10)

where

{\bar{M}}_{v}

represents channel-wise average magnitude across spatial frequencies, and

W_{f}

is a learnable weight matrix identifying frequency bands containing discriminative boundary information. The magnitude fusion balances preservation of original visual boundary details with text-guided semantic enhancements:

M_{f u s e d} = α_{m} M_{v} + ω β M_{T}

(11)

where

α_{m}

and

β

are learnable parameters. For phase fusion, we adopt a conservative strategy to prevent distortion that could introduce artifacts in boundary predictions:

Δ ϕ = ϕ_{T} - ϕ_{o}

(12)

Δ ϕ_{w r a p p e d} = a t a n 2 (\sin (Δ ϕ), \cos (Δ ϕ))

(13)

ϕ_{f u s e d} = ϕ_{v} + α_{P} Δ ϕ_{w r a p p e d}

(14)

where

a t a n 2

wraps the phase difference to

[- π, π]

, and

α_{P}

is a learnable parameter. The fused frequency representation is reconstructed:

F_{f u s e d} = M_{f u s e d} \cdot e^{i ϕ_{f u s e d}}

(15)

Finally, features are transformed back to the spatial domain and combined via residual connection:

v_{e n h a n c e d} = {I F F T}_{2} (F_{f u s e d}) + v

(16)

where

{I F F T}_{2}

denotes inverse FFT. The residual connection preserves critical high-frequency boundary details, and the enhanced features are fed into the DPT decoder for accurate farmland parcel mask prediction.

4. Experimental

4.1. Datasets and Evaluation Metrics

In this work, we use FarmSeg-VL [49], the first fine-grained image–text dataset specifically constructed for spatiotemporal farmland segmentation. FarmSeg-VL provides rich language-based descriptions that explicitly encode farmland shape, spatial distribution, phenological states, surrounding environmental elements, and regional topographic characteristics, addressing the limitations of conventional label-only remote sensing datasets in modeling spatial relationships and seasonal dynamics. Meanwhile, the textual descriptions in FarmSeg-VL explicitly specify the imaging month and season for each sample, covering all four seasons (spring, summer, autumn, and winter) across different agricultural phenological stages. This seasonal information enables the model to learn season-aware features and ensures comprehensive evaluation across varying temporal conditions. The dataset is built through a semi-automatic annotation pipeline to ensure high semantic quality and efficient caption generation. It covers eight major agricultural regions across China, spans approximately 4300 km², includes imagery from all four seasons, and offers a spatial resolution ranging from 0.5 m to 2 m. All images are cropped into 512 × 512 patches to retain detailed spatial structures such as field boundaries and vegetation textures. The dataset contains 15,821 training samples, 4512 validation samples, and 2272 test samples, with test samples drawn from the Northeast China Plain, the Huang-Huai-Hai Plain, the Northern Arid and Semi-Arid Region, the Loess Plateau, the Yangtze River Middle and Lower Reaches Plain, South China, the Sichuan Basin, and the Yungui Plateau. FarmSeg-VL serves as a comprehensive benchmark for evaluating both traditional deep learning methods and modern vision–language models in farmland segmentation.

In this study, four widely adopted metrics are employed to comprehensively assess the performance of our model on the farmland segmentation task. Specifically, we use Pixel Accuracy (ACC), Mean Intersection over Union (MIoU), Mean Dice coefficient (mDice), and Recall, as they capture different aspects of segmentation quality from pixel-level correctness to region-level consistency and class-wise recognition capability. Together, these metrics provide a balanced and reliable evaluation of the model’s effectiveness. The mathematical definitions of these metrics are given as follows:

A C C = \frac{T P + T N}{T P + F P + F N + T N}

(17)

m l o U = \frac{1}{N} \sum_{i = 1}^{N} \frac{T P_{i}}{T P_{i} + F P_{i} + F N_{i}}

(18)

m D i c e = \frac{1}{N} \sum_{i = 1}^{N} \frac{2 \times T P_{i}}{2 \times T P_{i} + F P_{i} + F N_{i}}

(19)

R e c a l l = \frac{T P}{T P + F N}

(20)

4.2. Implementation Details

All experiments are conducted on NVIDIA A6000 GPUs using DeepSpeed for distributed training and memory optimization. The multimodal backbone is initialized from the LLaVA-Llama-2-13B-Chat-Lightning-Preview model, while the visual encoder is based on the DINOv3-ViT-H/16 variant, loaded from its official pretrained checkpoint. During training, the DINOv3 vision tower and the multimodal projection layers remain frozen, and only the language model and segmentation-related modules, including the mask decoder and text-guided fusion layers, are updated. To enable parameter-efficient tuning, LoRA adapters with rank r = 16, scaling factor α = 32, and dropout rate 0.1 are injected into selected linear layers of the language model. The AdamW optimizer is employed with a learning rate of 3 × 10⁻⁴, weight decay of 0.01, and momentum parameters β₁ = 0.9 and β₂ = 0.95. A linear warm-up strategy is applied for the first 100 iterations, followed by a scheduled decay using DeepSpeed’s WarmupDecayLR. Training is run for 10 epochs, each consisting of 1000 optimization steps, using a micro-batch size of 2 on each GPU and gradient accumulation over 10 iterations. The input images are resized to 1024 × 1024 pixels, and the maximum text sequence length is set to 512 tokens. Mixed-precision training is enabled through BF16 to improve computational efficiency, with gradient clipping set to 1.0 for training stability. The overall training objective combines cross-entropy loss for language modeling, binary cross-entropy loss for mask prediction, and a Dice loss term, weighted by 1.0, 2.0, and 0.5, respectively. Data loading follows a hybrid sampling strategy that mixes segmentation samples with explanatory text according to predefined sampling rates. During validation, full-resolution predictions are generated and saved for qualitative analysis. Model selection is based on validation mIoU, and all compared methods follow identical data partitions to ensure fair and consistent evaluation.

4.3. Comparisons with Other Methods

To comprehensively evaluate the effectiveness of our proposed approach, we compare it with a diverse set of segmentation baselines that cover general semantic segmentation models, remote-sensing-oriented architectures, and vision–language models tailored for farmland understanding. Among them, DeepLab-v3 [50] represents typical semantic segmentation frameworks widely adopted across natural-image tasks, providing strong baselines for assessing the general perceptual and feature-aggregation capabilities of our model. DCSwin [51], UNetFormer [52], DOCNet [53] and LOGCAN++ [54] are specifically designed for remote sensing imagery and thus offer competitive comparisons under domain-specific conditions such as complex terrain, spatial heterogeneity, and high-resolution farmland structures. Finally, FSVLM [16] is a vision-language segmentation model tailored for farmland, which integrates SAM’s segmentation capabilities with the LLaVA multimodal framework using an “embedding as mask” paradigm, enabling explicit integration of textual knowledge related to farmland attributes and spatiotemporal patterns. By comparing with these methods spanning general, domain-specific, and multimodal paradigms, we demonstrate that our model not only maintains competitive segmentation performance in broad segmentation scenarios but also excels in capturing farmland-specific structural characteristics and leveraging language-driven priors, highlighting its advantages in both robustness and semantic understanding.

Table 1, Table 2, Table 3, Table 4, Table 5, Table 6, Table 7 and Table 8 present the quantitative comparison results across eight geographical regions in China. Our proposed method achieves the best mIoU in seven out of eight regions, demonstrating consistent competitiveness over existing approaches. Specifically, our model attains mIoU scores of 87.22%, 77.38%, 86.44%, 84.07%, 84.09%, 91.28%, 95.08%, and 94.95% across the Yangtze River Middle and Lower Reaches Plain, South China Areas, Sichuan Basin, Yungui Plateau, Northern Arid and Semi-arid Region, Northeast China Plain, Loess Plateau, and Huang-Huai-Hai Plain, respectively. The most substantial improvements are observed in the northern agricultural regions, where our model surpasses the second-best method by 1.20% mIoU on the Loess Plateau and 1.54% mIoU on the Huang-Huai-Hai Plain. These regions are characterized by large-scale regular farmland parcels with clear geometric boundaries, where our FECF mechanism effectively preserves high-frequency boundary details during feature fusion. Notably, in the Sichuan Basin, FSVLM achieves the best performance (86.52% mIoU), slightly outperforming our method (86.44% mIoU) by only 0.08%. This marginal difference suggests that both vision-language approaches demonstrate comparable effectiveness in this region, and the performance gap falls within the range of experimental variance.

Among the compared methods, UNetFormer and LOGCAN++ consistently rank as strong competitors, achieving competitive performance particularly in northern regions with regular farmland patterns. UNetFormer achieves the second-best mIoU in the Loess Plateau (93.88%) and Northeast China Plain (91.19%), while LOGCAN++ performs well in the Northern Arid and Semi-arid Region (83.72% mIoU). However, these methods show performance degradation in southern regions characterized by fragmented and irregularly shaped parcels, such as the South China Areas where UNetFormer achieves 76.71% mIoU and DOCNet achieves 76.94% mIoU compared to our 77.38%. DeepLab-v3, as a general semantic segmentation baseline, shows relatively lower performance across most regions, achieving 78.27% mIoU in the Yangtze River Middle and Lower Reaches Plain, 62.29% mIoU in the South China Areas, and 84.56% mIoU in the Huang-Huai-Hai Plain, indicating its limited capacity to handle the diverse and complex agricultural landscapes without domain-specific adaptations. The vision-language baseline FSVLM shows variable performance across regions: while achieving the best results in the Sichuan Basin (86.52% mIoU), it underperforms in other regions such as the Yangtze River Middle and Lower Reaches Plain (84.14% mIoU) and South China Areas (74.52% mIoU). DCSwin shows considerable variation across regions, performing adequately in the Huang-Huai-Hai Plain (88.96% mIoU) but struggling in the South China Areas (64.38% mIoU), indicating limited adaptability to diverse agricultural landscapes. In contrast, our proposed method demonstrates superior robustness and consistency across all eight geographical regions with diverse terrain characteristics. The integration of UGAA enables our model to dynamically adjust the fusion strength between textual and visual modalities based on alignment confidence, while the FECF mechanism effectively preserves boundary-relevant high-frequency details. These two mechanisms work synergistically to address the challenges of text-visual correspondence ambiguity and boundary precision that limit the performance of existing methods, resulting in consistently high performance across both northern regions with regular field patterns and southern regions with fragmented, irregularly shaped parcels.

To complement the quantitative evaluation, we provide qualitative comparisons through visualization of segmentation results across representative samples from all eight regions, as illustrated in Figure 4 and Figure 5. The yellow boxes in these figures highlight false positive regions where methods incorrectly classify non-farmland areas as farmland. The southern agricultural regions, characterized by small-scale and fragmented farmland parcels embedded within complex surrounding environments, pose significant challenges for accurate segmentation (Figure 4). In the Yangtze River Middle and Lower Reaches Plain samples (row a), DeepLab-v3, UNetFormer, and LOGCAN++ produce similar false positive predictions in areas adjacent to farmland boundaries in the first sample, whereas our method correctly distinguishes these ambiguous regions through UGAA. In the second sample, most compared methods fail to extract the farmland parcels entirely; only FSVLM and our method successfully segment the farmland regions, with FSVLM still producing fragmented false positive artifacts. The South China Areas samples (row b) reveal that all compared methods except ours suffer from either false positive predictions or incomplete extraction, while our model leverages the UGAA module to achieve complete and accurate farmland extraction by effectively differentiating farmland from visually similar surrounding environments. In the Sichuan Basin samples (row c), although FSVLM achieves the highest quantitative accuracy in this region, it fails to completely extract farmland parcels in the first sample under complex surrounding conditions, whereas our method maintains robust extraction capability. Both FSVLM and our method achieve consistently strong performance in the second sample. The Yungui Plateau samples (row d) demonstrate that other methods either generate false positive predictions or produce incomplete extraction results, whereas our method achieves accurate and complete farmland segmentation by dynamically adjusting the fusion strength between textual and visual modalities based on alignment confidence.

The northern agricultural regions present distinct challenges, primarily involving large-scale farmland parcels where precise boundary delineation and prevention of boundary adhesion become critical (Figure 5). In the Northern Arid and Semi-arid Region samples (row a), our method achieves precise boundary delineation while maintaining clear separation between adjacent farmland parcels. In contrast, DeepLab-v3, UNetFormer, LOGCAN++, and DOCNet suffer from boundary adhesion where neighboring parcels are incorrectly merged, while FSVLM produces severe fragmentation artifacts. The Northeast China Plain samples (row b) demonstrate that our method not only accurately segments the boundaries of large-scale farmland parcels but also successfully extracts small farmland regions at the edges that other methods fail to detect. The Loess Plateau samples (row c) present particularly challenging scenarios where sparse vegetation and arid soil conditions cause farmland and non-farmland regions to exhibit similar visual characteristics. Other methods produce noticeable boundary adhesion or misalignment in areas with ambiguous land cover transitions, whereas our method successfully separates adjacent parcels with precise boundaries. The Huang-Huai-Hai Plain samples (row d) further confirm that our method achieves accurate boundary segmentation, while some competing methods incorrectly classify adjacent non-farmland areas as farmland. These qualitative observations corroborate the quantitative results, demonstrating that the proposed UGAA and FECF mechanisms effectively enhance both farmland recognition accuracy and boundary precision, particularly in preventing boundary adhesion between adjacent parcels and detecting small farmland regions across diverse geographical and agricultural conditions.

4.4. Ablation Experiments

To validate the effectiveness of each proposed component, we conduct comprehensive ablation experiments on the FarmSeg-VL test set. Specifically, we evaluate the contribution of the UGAA module and the FECF mechanism by progressively adding them to the baseline model. The baseline model consists of the LLaVA multimodal backbone, DINOv3 visual encoder, and DPT-based decoder without our proposed cross-modal fusion modules.

The quantitative results of our ablation study are summarized in Table 6. The baseline model achieves an mIoU of 87.69%, Accuracy of 93.46%, mDice of 93.43%, and Recall of 93.42%. When incorporating only the FECF mechanism, the model demonstrates modest improvements with mIoU increasing to 88.03% (+0.34%), indicating that frequency-domain fusion effectively preserves boundary-relevant high-frequency details during cross-modal feature integration. The incorporation of only the UGAA module yields more substantial improvements, achieving an mIoU of 89.01% (+1.32%), Accuracy of 94.20%, mDice of 94.18%, and Recall of 94.15%. This significant enhancement validates our hypothesis that dynamically estimating text-visual alignment confidence and adaptively modulating fusion strength is crucial for handling the inherent ambiguity in farmland descriptions. When both modules are combined, the full model achieves the best performance with mIoU of 89.66% (+1.97%), demonstrating that UGAA and FECF provide complementary benefits. While the numerical improvements may appear modest, it is important to note that in fine-grained farmland segmentation tasks, even small gains in mIoU correspond to substantial improvements in boundary precision, as evidenced by the qualitative results in Figure 6. Moreover, as shown in Table 9, the additional computational cost introduced by UGAA and FECF is minimal. The full model requires only 68.2 M trainable parameters, representing a modest increase of 5.7 M (9.1%) over the 62.5 M baseline, while the TFLOPs remain virtually unchanged at 15.86. This negligible increase in computational complexity demonstrates the effectiveness of our lightweight module design. UGAA handles semantic alignment uncertainty while FECF preserves geometric boundary details.

To further validate our quantitative findings, we provide qualitative comparisons of segmentation results across different model configurations in Figure 6. The first row demonstrates the effectiveness of our FECF module in boundary delineation. As highlighted by the yellow boxes, the baseline model and UGAA-only variant produce boundaries with noticeable irregularities and imprecise edges at field margins. In contrast, the configurations incorporating FECF (both FECF-only and the full model) achieve significantly sharper and more accurate boundary predictions. This improvement can be attributed to the frequency-domain decomposition in FECF, which explicitly preserves high-frequency boundary details that are often smoothed during conventional spatial-domain fusion operations. The narrow vegetation strips and subtle elevation changes along field edges are better captured through selective enhancement of boundary-relevant spectral components.

The second row in Figure 6 illustrates the advantage of our UGAA module in extracting farmland regions when visual features exhibit similarity with surrounding areas. The highlighted yellow regions show areas where farmland parcels share similar spectral and textural characteristics with adjacent regions, making accurate extraction challenging based on visual features alone. The baseline model and FECF-only variant fail to extract these visually similar farmland areas, resulting in incomplete segmentation with missing parcels. However, the UGAA-equipped configurations (both UGAA-only and the full model) successfully identify and extract these challenging farmland regions by leveraging text-guided semantic understanding. When visual features alone are insufficient to distinguish farmland from surrounding areas, the UGAA module appropriately increases the contribution of textual guidance based on the estimated alignment confidence, enabling the model to recognize farmland parcels that would otherwise be missed due to visual ambiguity.

To gain deeper insight into the internal mechanism, we visualize the feature representations at different stages of our decoder architecture in Figure 7. The columns three to six displays the four-dimensional feature maps output by the decoder before our cross-modal fusion modules, which serve as the foundation for dense prediction. These features capture multi-scale spatial information but exhibit limited discriminability at boundary regions and areas with similar visual appearances. The last column presents the enhanced feature maps after processing through both UGAA and FECF modules. The enhanced features demonstrate substantially improved boundary definition, with clearer separation between adjacent farmland parcels and more distinct activation patterns at field edges. Furthermore, the enhanced features show better consistency in homogeneous farmland regions while maintaining sharp transitions at boundaries, indicating that our dual-module design successfully addresses both semantic alignment challenges and geometric detail preservation. The comparison confirms that our proposed modules effectively transform the raw decoder features into more discriminative representations that are better suited for accurate farmland parcel segmentation.

To investigate the sensitivity of UGAA to text prompt variations and analyze the contribution of the language model branch, we conduct experiments under three conditions, including detailed descriptions from the FarmSeg-VL dataset, simple task instructions (“Segment the farmland.”), and no text prompt where we remove the entire language model branch and retain only the DINOv3 encoder and decoder for training. As shown in Table 10, the model with detailed prompts achieves 89.66% mIoU, outperforming simple prompts (89.12%) by 0.54% and the no-prompt variant (87.46%) by 2.20%. Similar trends are observed across other metrics, with detailed prompts yielding improvements of 1.17% in Accuracy, 1.25% in mDice, and 0.88% in Recall compared to the no-prompt baseline. The performance gap between detailed and simple prompts remains relatively modest (0.54% mIoU), while the gap between simple prompts and no prompt is more substantial (1.66% mIoU), demonstrating that the language model branch provides significant semantic guidance for farmland segmentation.

To provide deeper insights into how prompt complexity affects segmentation quality, we present qualitative comparisons in Figure 8. Three representative cases illustrate the benefits of detailed prompts. In the first case, descriptions mentioning “farmland with greenhouses” enable correct identification of greenhouse-covered agricultural areas that simple prompts fail to recognize. In the second case, shape-related descriptions help distinguish farmland from visually similar non-agricultural regions, reducing false positive predictions. In the third case, descriptions specifying road conditions between farmlands produce clearer boundary delineation, avoiding the over-segmentation observed with simple prompts.

The results from the no-prompt variant provide important insights into the role of visual features when text information is unavailable. When the language model branch is completely removed, the model relies entirely on DINOv3 visual features for segmentation. The 87.46% mIoU achieved by this vision-only baseline demonstrates that DINOv3 provides a strong foundation for farmland recognition. However, the 2.20% improvement from adding detailed text prompts indicates that linguistic guidance enhances the model’s ability to disambiguate challenging cases. This behavior can be attributed to the UGAA module’s uncertainty estimation mechanism. When textual descriptions lack discriminative semantic information, the module estimates higher uncertainty scores and consequently reduces the fusion weight of text-guided features, allowing the model to rely more heavily on visual features from DINOv3. This adaptive behavior ensures stable segmentation performance even when text descriptions are ambiguous or uninformative.

These ablation results collectively demonstrate that both UGAA and FECF contribute essential and complementary capabilities to our framework. UGAA provides robust handling of text-visual alignment uncertainty prevalent in farmland descriptions, while FECF ensures precise preservation of boundary details critical for accurate parcel delineation. The synergistic combination of these two modules enables our full model to achieve superior segmentation performance across diverse agricultural landscapes.

4.5. Limitation Analysis

While UGFF-VLM demonstrates strong performance across diverse agricultural regions, we identify a notable limitation in urban-rural fringe areas where farmland and built-up regions are interspersed, as illustrated in Figure 9. In such peri-urban landscapes, the model faces challenges from multiple factors: small farmland parcels are fragmented by roads, buildings, and other infrastructure; vegetation in residential courtyards and gardens shares similar spectral characteristics with crops; and the complex spatial arrangement increases boundary ambiguity. As shown in Figure 9a–c, these conditions lead to false positive predictions where non-agricultural green spaces are misclassified as farmland, boundary merging between adjacent micro-parcels, and irregular segmentation where farmland edges blend with surrounding vegetation.

These failure cases suggest that the current model struggles with the high heterogeneity and fragmentation characteristic of urban-rural transition zones. Future work could address this limitation by incorporating spatial context reasoning about land use patterns, integrating building footprint information as auxiliary constraints, and developing specific training strategies for peri-urban agricultural landscapes.

5. Conclusions

To addresses the problem of cognitive confusion and poor generalization in current purely vision-based deep learning models for remote sensing farmland extraction, which caused by the polymorphism of farmland due to changes in farmland planting structure and growth status, as well as the influence of different geographical environments. This article proposes an Uncertainty-Guided and Frequency-Fused Vision-Language Model (UGFF-VLM) to achieve high-precision farmland extraction. The proposed UGFF-VLM builds upon the LLaVA framework with a DINOv3 encoder as the visual backbone and incorporates two key innovations for cross-modal fusion. First, the UGAA module dynamically estimates the confidence of text-visual correspondence and adaptively modulates fusion strength based on alignment reliability. This mechanism addresses the inherent ambiguity in farmland descriptions where textual cues such as “scattered trees” or “dispersed distribution” correspond to highly variable visual patterns depending on environmental conditions and imaging seasons. By learning to predict uncertainty scores and generating adaptive channel-wise scaling factors, UGAA enables robust performance even when certain textual descriptions are semantically relevant but visually weak, or when visual features exhibit ambiguity due to similar spectral responses between farmland and surrounding regions. Second, the FECF mechanism performs feature fusion in the frequency domain through Fast Fourier Transform, decomposing features into magnitude and phase components with learnable fusion weights. This design enables selective enhancement of high-frequency boundary details critical for precise field delineation, such as narrow vegetation strips and curved parcel edges, while preserving semantic low-frequency information for overall field pattern recognition.

Comprehensive experiments demonstrate the effectiveness and superiority of the proposed method. In comparison experiments on the FarmSeg-VL dataset across eight major agricultural regions in China, UGFF-VLM achieves the highest mIoU in seven out of eight regions, with scores of 87.22%, 77.38%, 86.44%, 84.07%, 84.09%, 91.28%, 95.08%, and 94.95% in the Yangtze River Middle and Lower Reaches Plain, South China Areas, Sichuan Basin, Yungui Plateau, Northern Arid and Semi-arid Region, Northeast China Plain, Loess Plateau, and Huang-Huai-Hai Plain, respectively. The most substantial improvements are observed in northern agricultural regions characterized by large-scale regular farmland parcels, where the model surpasses the second-best method by 1.20% mIoU on the Loess Plateau and 1.54% mIoU on the Huang-Huai-Hai Plain. Compared with general semantic segmentation baselines such as DeepLab-v3, remote sensing specialized methods including UNetFormer, LOGCAN++, and DOCNet, as well as the vision-language baseline FSVLM, our method demonstrates superior robustness and consistency across diverse geographical landscapes. Qualitative analysis reveals that the proposed method effectively reduces false positives in southern regions with fragmented farmland and complex surrounding environments, while simultaneously preventing boundary adhesion and maintaining precise delineation in northern regions with large-scale regular parcels. Ablation studies further validate the contribution of each proposed component, showing that the baseline model achieves 87.69% mIoU, the FECF-only variant achieves 88.03% (+0.34%), the UGAA-only variant achieves 89.01% (+1.32%), and the full model combining both modules achieves 89.66% (+1.97%). These results confirm that UGAA and FECF provide complementary benefits, with UGAA handling semantic alignment uncertainty and FECF preserving geometric boundary details. Therefore, the proposed UGFF-VLM not only mitigates the issues of recognition confusion and poor generalization in purely vision-based models caused by farmland feature polymorphism but also effectively enhances boundary segmentation accuracy, providing a reliable method for precise agricultural parcel delineation across diverse geographical landscapes and seasonal variations.

Author Contributions

Conceptualization, data curation, formal analysis, methodology, validation, visualization, writing—original draft, writing—review and editing, K.T.; investigation, project administration, resources, software, funding acquisition, supervision, writing—review and editing, Y.W.; conceptualization, validation, supervision, writing—review and editing, H.Y.; writing—review & editing, visualization, supervision, X.M. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by the Anhui Province Ecological Environment Science and Technology Project under Grant 2024HB001, and Anhui Provincial Water Conservancy Science and Technology Plan Project: slkj202501-10.

Data Availability Statement

The source code, pre-trained models, and configuration files that support the findings of this study will be publicly released in a dedicated repository upon the official publication of this article.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

UGFF-VLM	Uncertainty-Guided and Frequency-Fused Vision-Language Model
UGAA	Uncertainty-Guided Adaptive Alignment
FECF	Frequency-Enhanced Cross-Modal Fusion
DINO	Distillation with No Labels
LLaVA	Large Language and Vision Assistant
DPT	Dense Prediction Transformer

References

Zheng, J.; Ye, Z.; Wen, Y.; Huang, J.; Zhang, Z.; Li, Q.; Hu, Q.; Xu, B.; Zhao, L.; Fu, H. A comprehensive review of agricultural parcel and boundary delineation from remote sensing images: Recent progress and future perspectives. arXiv 2025, arXiv:2508.14558. [Google Scholar] [CrossRef]
Peña-Barragán, J.M.; Ngugi, M.K.; Plant, R.E.; Six, J. Object-based crop identification using multiple vegetation indices, textural features and crop phenology. Remote Sens. Environ. 2011, 115, 1301–1316. [Google Scholar] [CrossRef]
Wang, X.; Shu, L.; Han, R.; Yang, F.; Gordon, T.; Wang, X.; Xu, H. A survey of farmland boundary extraction technology based on remote sensing images. Electronics 2023, 12, 1156. [Google Scholar] [CrossRef]
Song, B.; Yang, H.; Wu, Y.; Zhang, P.; Wang, B.; Han, G. A multispectral remote sensing crop segmentation method based on segment anything model using multistage adaptation fine-tuning. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4408818. [Google Scholar] [CrossRef]
Yang, H.; Jiang, Z.; Zhang, Y.; Wu, Y.; Luo, H.; Zhang, P.; Wang, B. A high-resolution remote sensing land use/land cover classification method based on multi-level features adaptation of segment anything model. Int. J. Appl. Earth Obs. Geoinf. 2025, 141, 104659. [Google Scholar] [CrossRef]
Wang, S.; Sun, G.; Zheng, B.; Du, Y. A crop image segmentation and extraction algorithm based on mask RCNN. Entropy 2021, 23, 1160. [Google Scholar] [CrossRef]
Zhang, D.; Pan, Y.; Zhang, J.; Hu, T.; Zhao, J.; Li, N.; Chen, Q. A generalized approach based on convolutional neural networks for large area cropland mapping at very high resolution. Remote Sens. Environ. 2020, 247, 111912. [Google Scholar] [CrossRef]
Taravat, A.; Wagner, M.P.; Bonifacio, R.; Petit, D. Advanced fully convolutional networks for agricultural field boundary detection. Remote Sens. 2021, 13, 722. [Google Scholar] [CrossRef]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015, Proceedings of the MICCAI 2015; Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Semantic image segmentation with deep convolutional nets and fully connected crfs. arXiv 2014, arXiv:1412.7062. [Google Scholar]
Liu, W.; Wang, J.; Luo, J.; Wu, Z.; Chen, J.; Zhou, Y.; Sun, Y.; Shen, Z.; Xu, N.; Yang, Y. Farmland parcel mapping in mountain areas using time-series SAR data and VHR optical images. Remote Sens. 2020, 12, 3733. [Google Scholar] [CrossRef]
Lu, R.; Wang, N.; Zhang, Y.; Lin, Y.; Wu, W.; Shi, Z. Extraction of agricultural fields via dasfnet with dual attention mechanism and multi-scale feature fusion in south Xinjiang, China. Remote Sens. 2022, 14, 2253. [Google Scholar] [CrossRef]
Wang, Y.; Gu, L.; Jiang, T.; Gao, F. MDE-UNet: A multitask deformable UNet combined enhancement network for farmland boundary segmentation. IEEE Geosci. Remote Sens. Lett. 2023, 20, 1–5. [Google Scholar] [CrossRef]
Wang, A.; Liu, L.; Chen, H.; Lin, Z.; Han, J.; Ding, G. Yoloe: Real-time seeing anything. arXiv 2025, arXiv:2503.07465. [Google Scholar] [CrossRef]
Wu, H.; Du, Z.; Zhong, D.; Wang, Y.; Tao, C. Fsvlm: A vision-language model for remote sensing farmland segmentation. IEEE Trans. Geosci. Remote Sens. 2025, 63, 4402813. [Google Scholar] [CrossRef]
Wu, H.; Mu, W.; Zhong, D.; Du, Z.; Li, H.; Tao, C. FarmSeg_VLM: A farmland remote sensing image segmentation method considering vision-language alignment. ISPRS J. Photogramm. Remote Sens. 2025, 225, 423–439. [Google Scholar] [CrossRef]
Liu, H.; Li, C.; Wu, Q.; Lee, Y.J. Visual instruction tuning. Adv. Neural Inf. Process. Syst. 2023, 36, 34892–34916. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J. Learning transferable visual models from natural language supervision. Proc. Mach. Learn. Res. 2021, 139, 8748–8763. [Google Scholar]
Siméoni, O.; Vo, H.V.; Seitzer, M.; Baldassarre, F.; Oquab, M.; Jose, C.; Khalidov, V.; Szafraniec, M.; Yi, S.; Ramamonjisoa, M. Dinov3. arXiv 2025, arXiv:2508.10104. [Google Scholar]
Ranftl, R.; Bochkovskiy, A.; Koltun, V. Vision transformers for dense prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: Piscataway, NJ, USA, 2021; pp. 12179–12188. [Google Scholar]
Pan, J.; Wei, Z.; Zhao, Y.; Zhou, Y.; Lin, X.; Zhang, W.; Tang, C. Enhanced FCN for farmland extraction from remote sensing image. Multimed. Tools Appl. 2022, 81, 38123–38150. [Google Scholar] [CrossRef]
Huan, H.; Liu, Y.; Xie, Y.; Wang, C.; Xu, D.; Zhang, Y. MAENet: Multiple attention encoder–decoder network for farmland segmentation of remote sensing images. IEEE Geosci. Remote Sens. Lett. 2021, 19, 2503005. [Google Scholar] [CrossRef]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Piscataway, NJ, USA, 2017; pp. 2881–2890. [Google Scholar]
Zhang, J.; Li, Y.; Tong, Z.; He, L.; Zhang, M.; Niu, Z.; He, H. GLCANet: Global–Local Context Aggregation Network for Cropland Segmentation from Multi-Source Remote Sensing Images. Remote Sens. 2024, 16, 4627. [Google Scholar] [CrossRef]
Sun, W.; Sheng, W.; Zhou, R.; Zhu, Y.; Chen, A.; Zhao, S.; Zhang, Q. Deep edge enhancement-based semantic segmentation network for farmland segmentation with satellite imagery. Comput. Electron. Agric. 2022, 202, 107273. [Google Scholar] [CrossRef]
Tang, Z.; Pan, X.; She, X.; Ma, J.; Zhao, J. Detail and Deep Feature Multi-Branch Fusion Network for High-Resolution Farmland Remote-Sensing Segmentation. Remote Sens. 2025, 17, 789. [Google Scholar] [CrossRef]
Liu, Y.; Zhang, T.; Huang, Y.; Shi, F. An Edge-aware Multi-task Network based on CNN and Transformer Backbone for Farmland Instance Segmentation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 13765–13779. [Google Scholar] [CrossRef]
Yang, H.; Yang, Z.; Wu, Y.; Wang, C.; Wu, Y.; Zhang, P.; Wang, B. High-Resolution Remote Sensing Farmland Extraction Network Based on Dense-Feature Overlay Fusion and Information Homogeneity Enhancement. IEEE Geosci. Remote Sens. Lett. 2025, 22, 2501105. [Google Scholar] [CrossRef]
Zheng, J.; Fu, Y.; Chen, X.; Zhao, R.; Lu, J.; Zhao, H.; Chen, Q. EGCM-UNet: Edge Guided Hybrid CNN-Mamba UNet for farmland remote sensing image semantic segmentation. Geocarto Int. 2025, 40, 2440407. [Google Scholar] [CrossRef]
Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. In Proceedings of the First Conference on Language Modeling, Philadelphia, PA, USA, 7 October 2024. [Google Scholar]
Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.-Y. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: Piscataway, NJ, USA, 2023; pp. 4015–4026. [Google Scholar]
Feng, W.; Guan, F.; Sun, C.; Xu, W. FL-DBENet: Double-branch encoder network based on segment anything model for farmland segmentation of large very-high-resolution optical remote sensing images. ISPRS Ann. Photogramm. Remote Sens. Spat. Inf. Sci. 2025, X-G-2025, 253–260. [Google Scholar] [CrossRef]
Khan, S.D.; Alarabi, L.; Basalamah, S. Segmentation of farmlands in aerial images by deep learning framework with feature fusion and context aggregation modules. Multimed. Tools Appl. 2023, 82, 42353–42372. [Google Scholar] [CrossRef]
Liu, F.; Chen, D.; Guan, Z.; Zhou, X.; Zhu, J.; Ye, Q.; Fu, L.; Zhou, J. Remoteclip: A vision language foundation model for remote sensing. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5622216. [Google Scholar] [CrossRef]
Kuckreja, K.; Danish, M.S.; Naseer, M.; Das, A.; Khan, S.; Khan, F.S. Geochat: Grounded large vision-language model for remote sensing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2024; pp. 27831–27840. [Google Scholar]
Hu, Y.; Yuan, J.; Wen, C.; Lu, X.; Liu, Y.; Li, X. Rsgpt: A remote sensing vision language model and benchmark. ISPRS J. Photogramm. Remote Sens. 2025, 224, 272–286. [Google Scholar] [CrossRef]
Luo, J.; Pang, Z.; Zhang, Y.; Wang, T.; Wang, L.; Dang, B.; Lao, J.; Wang, J.; Chen, J.; Tan, Y. Skysensegpt: A fine-grained instruction tuning dataset and model for remote sensing vision-language understanding. arXiv 2024, arXiv:2406.10100. [Google Scholar]
Muhtar, D.; Li, Z.; Gu, F.; Zhang, X.; Xiao, P. Lhrs-bot: Empowering remote sensing with vgi-enhanced large multimodal language model. In Computer Vision—ECCV 2024, Proceedings of the European Conference on Computer Vision, 2024; Springer: Cham, Switzerland, 2024; pp. 440–457. [Google Scholar]
Bazi, Y.; Bashmal, L.; Al Rahhal, M.M.; Ricci, R.; Melgani, F. Rs-llava: A large vision-language model for joint captioning and question answering in remote sensing imagery. Remote Sens. 2024, 16, 1477. [Google Scholar] [CrossRef]
Zhan, Y.; Xiong, Z.; Yuan, Y. Skyeyegpt: Unifying remote sensing vision-language tasks via instruction tuning with large language model. ISPRS J. Photogramm. Remote Sens. 2025, 221, 64–77. [Google Scholar] [CrossRef]
Pang, C.; Wu, J.; Li, J.; Liu, Y.; Sun, J.; Li, W.; Weng, X.; Wang, S.; Feng, L.; Xia, G.-S. H2rsvlm: Towards helpful and honest remote sensing large vision language model. arXiv 2024, arXiv:2403.20213. [Google Scholar] [CrossRef]
Dai, W.; Li, J.; Li, D.; Tiong, A.; Zhao, J.; Wang, W.; Li, B.; Fung, P.N.; Hoi, S. Instructblip: Towards general-purpose vision-language models with instruction tuning. Adv. Neural Inf. Process. Syst. 2023, 36, 49250–49267. [Google Scholar]
Liu, F.; Dai, W.; Zhang, C.; Zhu, J.; Yao, L.; Li, X. Co-LLaVA: Efficient Remote Sensing Visual Question Answering via Model Collaboration. Remote Sens. 2025, 17, 466. [Google Scholar] [CrossRef]
Zhou, Y.; Lan, M.; Li, X.; Feng, L.; Ke, Y.; Jiang, X.; Li, Q.; Yang, X.; Zhang, W. Geoground: A unified large vision-language model for remote sensing visual grounding. arXiv 2024, arXiv:2411.11904. [Google Scholar]
Zhan, Y.; Xiong, Z.; Yuan, Y. Rsvg: Exploring data and models for visual grounding on remote sensing data. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5604513. [Google Scholar] [CrossRef]
Zhan, Y.; Yuan, Y.; Xiong, Z. Mono3dvg: 3d visual grounding in monocular images. Proc. AAAI Conf. Artif. Intell. 2024, 38, 6988–6996. [Google Scholar] [CrossRef]
Zhan, Y.; Yuan, Y. Where Does It Exist from the Low-Altitude: Spatial Aerial Video Grounding. In Proceedings of the Thirty-ninth Annual Conference on Neural Information Processing Systems, San Diego, CA, USA, 3 December 2025. [Google Scholar]
Tao, C.; Zhong, D.; Mu, W.; Du, Z.; Wu, H. A large-scale image-text dataset benchmark for farmland segmentation. Earth Syst. Sci. Data Discuss. 2025, 17, 4835–4864. [Google Scholar] [CrossRef]
Chen, L.-C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV); Springer: Cham, Switzerland, 2018; pp. 801–818. [Google Scholar]
Wang, L.; Li, R.; Duan, C.; Zhang, C.; Meng, X.; Fang, S. A novel transformer based semantic segmentation scheme for fine-resolution remote sensing images. IEEE Geosci. Remote Sens. Lett. 2022, 19, 6506105. [Google Scholar] [CrossRef]
Wang, L.; Li, R.; Zhang, C.; Fang, S.; Duan, C.; Meng, X.; Atkinson, P.M. UNetFormer: A UNet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery. ISPRS J. Photogramm. Remote Sens. 2022, 190, 196–214. [Google Scholar] [CrossRef]
Ma, X.; Che, R.; Wang, X.; Ma, M.; Wu, S.; Feng, T.; Zhang, W. DOCNet: Dual-domain optimized class-aware network for remote sensing image segmentation. IEEE Geosci. Remote Sens. Lett. 2024, 21, 2500905. [Google Scholar] [CrossRef]
Ma, X.; Lian, R.; Wu, Z.; Guo, H.; Yang, F.; Ma, M.; Wu, S.; Du, Z.; Zhang, W.; Song, S. Logcan++: Adaptive local-global class-aware network for semantic segmentation of remote sensing images. IEEE Trans. Geosci. Remote Sens. 2025, 63, 4404216. [Google Scholar] [CrossRef]

Figure 1. Overall architecture of our model.

Figure 2. Architecture of the Uncertainty-Guided Adaptive Alignment (UGAA) module. (a) uncertainty estimation, (b) adaptive scaling, (c) text feature projection.

Figure 3. Architecture of the Frequency-Enhanced Cross-Modal Fusion (FECF) mechanism.

Figure 4. The segmentation maps across four geographical regions in China representing different seasonal conditions. Rows (a–d) show samples from: (a) Yangtze River Middle and Lower Reaches Plain (Spring), (b) South China Areas (Winter), (c) Sichuan Basin (Summer), and (d) Yungui Plateau (Winter). Each row displays the input image, ground truth, and predictions from different methods. Yellow boxes indicate false positive regions.

Figure 5. The segmentation maps across four geographical regions in China representing different seasonal conditions. Rows (a–d) show samples from: (a) Northern Arid and Semi-arid Region (Autumn), (b) Northeast China Plain (Summer), (c) Loess Plateau (Autumn), and (d) Huang-Huai-Hai Plain (Summer). Each row displays the input image, ground truth, and predictions from different methods. Yellow boxes indicate false positive regions.

Figure 6. Visual comparison of segmentation results under different ablation configurations. From left to right, each column shows the input image, ground truth, baseline prediction, prediction with FECF only, prediction with UGAA only, and prediction with both modules (full model). Yellow boxes highlight regions of interest discussed in the text.

Figure 7. Visualization of feature maps from the decoder. The columns three to six shows the four-channel feature maps output by the DPT decoder before cross-modal fusion, and the last column shows the corresponding enhanced feature maps after processing through the UGAA and FECF modules.

Figure 8. Qualitative comparison of segmentation results under different text prompt conditions. Yellow boxes highlight regions with notable differences.

Figure 9. Failure cases of UGFF-VLM in urban-rural fringe areas. (a) Northeast China Plain, (b) Sichuan Basin, and (c) South China. Red regions indicate model predictions.

Table 1. Segmentation results in the Yangtze River Middle and Lower Reaches Plain. Bold values indicate the best performance.

Model	mIOU	Acc	mDice	Recall
DeepLab-v3	78.27	88.59	87.72	88.59
DCSwin	80.17	89.00	88.90	88.81
UnetFormer	86.81	92.56	92.48	92.40
DOCNet	86.49	92.35	92.29	92.24
LOGCAN++	86.88	93.14	92.94	92.75
PixelLM	80.82	90.20	89.31	89.28
LaSagna	83.22	91.53	90.79	90.64
FSVLM	84.14	92.07	91.33	91.39
Ours	87.22	93.70	93.14	93.01

Table 2. Segmentation results in the South China Areas. Bold values indicate the best performance.

Model	mIOU	Acc	mDice	Recall
DeepLab-v3	62.29	71.85	74.13	71.85
DCSwin	64.38	72.93	75.95	80.99
UnetFormer	76.71	86.61	86.77	86.94
DOCNet	76.94	86.54	86.61	87.82
LOGCAN++	74.63	82.07	84.47	87.60
PixelLM	71.36	89.89	82.10	81.84
LaSagna	74.07	91.27	84.13	84.80
FSVLM	74.52	81.48	84.45	85.23
Ours	77.38	92.54	86.53	86.94

Table 3. Segmentation results in the Sichuan Basin. Bold values indicate the best performance.

Model	mIOU	Acc	mDice	Recall
DeepLab-v3	76.82	89.68	86.66	89.68
DCSwin	80.28	88.65	88.83	89.02
UnetFormer	85.62	91.36	91.32	91.27
DOCNet	85.58	92.48	92.12	91.77
LOGCAN++	85.87	92.83	92.29	92.08
PixelLM	84.45	93.14	91.43	91.24
LaSagna	85.51	93.66	92.07	91.89
FSVLM	86.52	94.18	92.67	92.85
Ours	86.44	94.05	92.62	92.16

Table 4. Segmentation results in the Yungui Plateau. Bold values indicate the best performance.

Model	mIOU	Acc	mDice	Recall
DeepLab-v3	71.50	82.82	83.25	82.82
DCSwin	75.28	85.70	85.82	85.96
UnetFormer	83.30	90.86	90.63	90.43
DOCNet	83.22	91.30	91.00	90.77
LOGCAN++	82.87	91.02	90.61	90.32
PixelLM	76.52	87.18	86.64	86.76
LaSagna	79.62	89.04	88.61	88.61
FSVLM	81.44	90.11	89.73	89.69
Ours	84.07	91.60	91.32	91.13

Table 5. Segmentation results in the Northern Arid and Semi-arid Region. Bold values indicate the best performance.

Model	mIOU	Acc	mDice	Recall
DeepLab-v3	70.40	82.59	82.64	82.59
DCSwin	76.80	86.97	86.88	86.98
UnetFormer	83.53	91.04	91.31	91.53
DOCNet	83.02	90.89	90.91	90.93
LOGCAN++	83.72	91.16	91.14	91.13
PixelLM	79.14	88.37	88.36	88.39
LaSagna	83.02	90.74	90.72	90.77
FSVLM	82.70	90.53	90.53	90.52
Ours	84.09	91.38	91.36	91.49

Table 6. Segmentation results in the Northeast China Plain. Bold values indicate the best performance.

Model	mIOU	Acc	mDice	Recall
DeepLab-v3	78.60	93.99	87.84	86.75
DCSwin	83.32	90.21	90.81	91.50
UnetFormer	91.19	95.07	95.37	95.59
DOCNet	90.30	94.44	94.87	95.36
LOGCAN++	89.92	94.23	94.66	95.14
PixelLM	85.88	93.16	92.35	92.32
LaSagna	89.15	94.85	94.23	94.29
FSVLM	89.75	95.15	94.57	94.56
Ours	91.28	95.92	95.42	95.55

Table 7. Segmentation results in the Loess Plateau. Bold values indicate the best performance.

Model	mIOU	Acc	mDice	Recall
DeepLab-v3	70.74	83.07	82.86	83.07
DCSwin	85.85	92.28	92.38	92.75
UnetFormer	93.88	96.87	96.84	96.83
DOCNet	92.26	96.01	95.97	95.96
LOGCAN++	92.98	96.40	96.36	96.35
PixelLM	86.50	92.77	92.76	92.76
LaSagna	90.68	95.11	95.11	95.11
FSVLM	91.82	95.74	95.73	95.78
Ours	95.08	97.48	97.48	97.50

Table 8. Segmentation results in the Huang-Huai-Hai Plain. Bold values indicate the best performance.

Model	mIOU	Acc	mDice	Recall
DeepLab-v3	84.56	91.71	91.59	91.71
DCSwin	88.96	94.04	94.13	94.23
UnetFormer	93.41	95.35	95.58	95.84
DOCNet	93.40	96.53	95.58	95.63
LOGCAN++	93.06	96.32	96.40	96.48
PixelLM	88.11	94.11	93.65	93.79
LaSagna	90.79	95.51	95.16	95.27
FSVLM	91.70	95.97	95.66	95.72
Ours	94.95	96.71	96.45	96.59

Table 9. Ablation study results on the FarmSeg-VL test set. The symbols × and √ indicate whether the corresponding module is excluded or included in the model configuration, respectively. Bold values indicate the best performance.

UGAA	FECF	mIOU	Acc	mDice	Recall	Parameters	TFLOPs
×	×	87.69	93.46	93.43	93.42	62.5 M	15.86
×	√	88.03	93.68	93.63	93.84	63.8 M	15.86
√	×	89.01	94.20	94.18	94.15	66.9 M	15.86
√	√	89.66	94.57	94.55	94.52	68.2 M	15.86

Table 10. Ablation study on text prompt sensitivity. “No” denotes removal of the language model branch (DINOv3 encoder and decoder only), “Simple” denotes minimal task instruction, while “Detailed” denotes the full descriptions from FarmSeg-VL dataset. Bold values indicate the best performance.

Text Prompt	mIOU	Acc	mDice	Recall
No	87.46	93.40	93.30	93.64
Simple	89.12	94.29	94.25	94.21
Detailed	89.66	94.57	94.55	94.52

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Tan, K.; Wu, Y.; Yang, H.; Ma, X. UGFF-VLM: Uncertainty-Guided and Frequency-Fused Vision-Language Model for Remote Sensing Farmland Segmentation. Remote Sens. 2026, 18, 282. https://doi.org/10.3390/rs18020282

AMA Style

Tan K, Wu Y, Yang H, Ma X. UGFF-VLM: Uncertainty-Guided and Frequency-Fused Vision-Language Model for Remote Sensing Farmland Segmentation. Remote Sensing. 2026; 18(2):282. https://doi.org/10.3390/rs18020282

Chicago/Turabian Style

Tan, Kai, Yanlan Wu, Hui Yang, and Xiaoshuang Ma. 2026. "UGFF-VLM: Uncertainty-Guided and Frequency-Fused Vision-Language Model for Remote Sensing Farmland Segmentation" Remote Sensing 18, no. 2: 282. https://doi.org/10.3390/rs18020282

APA Style

Tan, K., Wu, Y., Yang, H., & Ma, X. (2026). UGFF-VLM: Uncertainty-Guided and Frequency-Fused Vision-Language Model for Remote Sensing Farmland Segmentation. Remote Sensing, 18(2), 282. https://doi.org/10.3390/rs18020282

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

UGFF-VLM: Uncertainty-Guided and Frequency-Fused Vision-Language Model for Remote Sensing Farmland Segmentation

Highlights

Abstract

1. Introduction

2. Related Work

2.1. Remote Sensing Farmland Segmentation

2.2. Vision-Language Models for Remote Sensing

3. Methodology

3.1. Overall Architecture

3.2. Uncertainty-Guided Adaptive Alignment

3.3. Frequency-Enhanced Cross-Modal Fusion

4. Experimental

4.1. Datasets and Evaluation Metrics

4.2. Implementation Details

4.3. Comparisons with Other Methods

4.4. Ablation Experiments

4.5. Limitation Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI