WS-DINO: A DINOv2-Based Weed Segmentation Method with Feature Priors and Spatial Fusion

Zhou, Hongsheng; Liu, Jiangping; Wu, Rigeng; Zhao, Baoping

doi:10.3390/agriculture16101105

Open AccessArticle

WS-DINO: A DINOv2-Based Weed Segmentation Method with Feature Priors and Spatial Fusion

by

Hongsheng Zhou

^1,†,

Jiangping Liu

^1,2,*,†

,

Rigeng Wu

¹ and

Baoping Zhao

³

¹

College of Computer and Information Engineering, Inner Mongolia Agricultural University, Hohhot 010018, China

²

Inner Mongolia Autonomous Region Key Laboratory of Big Data Research and Application of Agriculture and Animal Husbandry, Hohhot 010030, China

³

College of Agriculture, Inner Mongolia Agricultural University, Hohhot 010018, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Agriculture 2026, 16(10), 1105; https://doi.org/10.3390/agriculture16101105

Submission received: 7 April 2026 / Revised: 5 May 2026 / Accepted: 8 May 2026 / Published: 18 May 2026

(This article belongs to the Topic Digital Agriculture, Smart Farming and Crop Monitoring)

Download

Browse Figures

Versions Notes

Abstract

Weed segmentation is a fundamental task in precision agriculture, essential for targeted intervention and sustainable farming. However, achieving accurate segmentation remains challenging due to the high visual similarity between weeds and crops, as well as the ambiguous, fine-grained boundaries often present in complex field environments. To address this, we present WS-DINO, a novel weed segmentation network built upon the DINOv2 vision foundation model. Our framework introduces two key innovations: (1) a Feature Prior Module that leverages a Canny-guided refinement process to extract and inject fine-grained cues related to weed texture, morphology, and boundaries into specific blocks of the Vision Transformer; and (2) a Spatial Feature Fusion Module that leverages convolutional layers to generate multi-scale spatial features, which are then fused with the semantically rich token features from DINOv2, effectively compensating for the Transformer’s limitations in capturing local spatial details. Comprehensive evaluation on the public PhenoBench dataset shows that WS-DINO achieves an mIoU of 88.67% and outperforms the evaluated benchmark methods. Moreover, on the challenging MotionBlurred dataset, WS-DINO reaches 88.75% mIoU, showing stable performance under motion blur and degraded visual conditions.

Keywords:

weed segmentation; DINOv2; feature prior module; spatial feature fusion module; precision agriculture

1. Introduction

Precision agriculture has emerged as a key paradigm for improving crop productivity and sustainability by leveraging advanced sensing and intelligent decision-making technologies [1]. Among various agricultural tasks, weed segmentation plays a crucial role in enabling site-specific weed management, reducing excessive herbicide use, and promoting environmentally friendly farming practices [2]. With the rapid development of unmanned aerial vehicles (UAVs) and high-resolution imaging systems, large-scale field data can now be efficiently collected, providing new opportunities for automated weed detection and segmentation [3]. However, accurately segmenting weeds from complex agricultural scenes remains a highly challenging problem.

The primary difficulty of weed segmentation lies in the inherent characteristics of field environments. First, weeds exhibit highly diverse and complex morphologies, including small-scale objects, slender structures, and clustered distributions, which make them difficult to capture using standard feature representations [4]. Second, weeds often share strong visual similarities with crops in terms of color, texture, and shape, leading to severe inter-class ambiguity [5]. Third, practical field conditions frequently involve occlusion, overlap, and background interference, further complicating the accurate delineation of weed boundaries [6]. As a result, achieving precise and robust weed segmentation, particularly at fine-grained object boundaries, remains an open challenge.

In response to these challenges, a variety of deep learning-based methods have been proposed for weed segmentation. Convolutional neural networks (CNNs) have demonstrated strong capabilities in capturing local spatial features and have been widely adopted in this domain. For instance, Liao et al. [7] proposed SC-Net, which introduces a strip multilevel convolution block to capture strip-like features and improve the segmentation accuracy of slender leaves. L. Janneh et al. [8] designed a multi-level feature re-weighted module to enhance contextual feature representations of crops and weeds, alleviating semantic inconsistency caused by complex backgrounds. In addition, Castellano et al. [9] improved the Lawin-Transformer by incorporating multispectral data, thereby enhancing the generalization capability of the model. Furthermore, Wei et al. proposed HCT-Net, a hybrid CNN–Transformer architecture that combines local feature extraction with global attention to improve plant morphology recognition [10]. More recently, Liu et al. [11] developed a real-time multi-branch segmentation network with semantic-guided fusion to strengthen feature interaction between crops and weeds while balancing accuracy and efficiency. Thiagarajan et al. [12] systematically explored encoder-decoder models for precision weed detection and demonstrated their effectiveness for semantic segmentation in complex field environments. Tao et al. [13] further investigated UAV-based crop–weed segmentation and showed that an ensemble integrating Transformer, DINOv2, and Mamba architectures can deliver strong robustness under image degradation and field disturbances. In addition, Lu et al. [14] addressed large-scene UAV orthomosaic analysis by combining image enhancement and deep instance segmentation, improving weed delineation and field-scale mapping performance.

Despite these advancements, existing methods still suffer from several critical limitations when applied to real-world weed segmentation. On the one hand, most approaches struggle to accurately capture fine-grained structural details, particularly for slender stems, thin leaves, and branching tips, due to the lack of explicit fine-grained prior modeling mechanisms. This often leads to blurred or incomplete object boundaries [15]. On the other hand, the high visual similarity between weeds and crops in terms of color, texture, and morphology frequently causes severe inter-class confusion, which cannot be effectively resolved by purely data-driven feature learning. Recently, Vision Transformers (ViTs) have shown remarkable potential due to their ability to model long-range dependencies through self-attention. In particular, DINOv2, a self-supervised vision foundation model, has demonstrated strong generalization and representation capabilities across diverse visual tasks [16,17,18]. Benefiting from large-scale self-supervised pretraining, DINOv2 learns rich and transferable visual representations, making it a promising backbone for downstream segmentation tasks [19]. DINOv2 is trained by self-distillation on large-scale unlabeled images and provides general-purpose visual features with strong transferability. Its main strengths lie in robust semantic representation and cross-domain generalization, whereas its patch-based token representation is less effective for preserving precise local boundaries. However, directly applying DINOv2 to weed segmentation remains suboptimal, as it lacks explicit fine-grained feature priors and spatial sensitivity, which are crucial for accurately delineating complex weed boundaries.

To overcome these limitations, we propose WS-DINO, an improved DINOv2-based framework for precise weed segmentation. In addition to architectural enhancements, we adopt a parameter-efficient fine-tuning strategy based on low-rank adaptation to specialize the pretrained backbone for weed segmentation while keeping the number of trainable parameters small [16,20]. To the best of our knowledge, this is the first work that explicitly integrates feature priors and multi-scale spatial features into a DINOv2-based architecture for weed segmentation, effectively filling the gap in leveraging fine-grained and spatial cues within vision foundation models for this task. Our approach enhances the representation capability of DINOv2 by explicitly integrating feature and spatial priors into the Transformer architecture. Specifically, the Feature Prior Module (FPM) uses a Canny-guided refinement process to extract texture, morphology, and boundary cues and injects them into selected Transformer blocks via learnable cross-attention. In addition, the Spatial Feature Fusion Module (SFFM) encodes multi-scale spatial features through a convolutional encoder and fuses them with semantically rich Transformer tokens, improving local detail modeling while preserving global reasoning capability.

Extensive experiments on the public PhenoBench dataset demonstrate that WS-DINO achieves strong performance among the evaluated methods, reaching an mIoU of 88.67% and outperforming existing mainstream segmentation methods. These results validate the effectiveness of combining explicit feature priors and multi-scale spatial features with foundation models for addressing the challenges of weed segmentation in complex agricultural environments. In summary, the main contributions of this work are as follows: (1) We propose WS-DINO, a novel DINOv2-based framework that enhances weed segmentation by integrating structural and spatial priors into a Vision Transformer architecture. (2) We design a Feature Prior Module to explicitly inject fine-grained feature priors into Transformer blocks, improving texture perception, structural awareness, and boundary-sensitive segmentation performance. (3) We introduce a Spatial Feature Fusion Module to combine multi-scale convolutional features with Transformer representations, enabling more robust feature learning in complex scenes. (4) We demonstrate through extensive experiments that the proposed method achieves superior performance on a challenging benchmark dataset, highlighting its effectiveness for precision agriculture applications.

2. Materials and Methods

2.1. Datasets

2.1.1. PhenoBench Dataset

The PhenoBench dataset [21] is employed in this study for evaluating weed segmentation performance. This dataset is collected using a DJI M600 unmanned aerial vehicle equipped with a high-resolution PhaseOne iXM-100 (PhaseOne, Frederiksberg, Denmark) camera over a sugar beet experimental field at the University of Bonn, Germany, and the annotated data were publicly released in 2024. The aerial platform captures RGB images with an ultra-high ground sampling distance of approximately 1 mm per pixel, enabling fine-grained observation of plant structures and detailed scene characteristics. The dataset contains a total of 2872 images, each with a spatial resolution of 1024 × 1024 pixels. The dataset is officially divided into 1407 training images, 772 validation images, and 693 test images. In total, it includes 24,558 crop instances and 16,358 weed instances, providing a diverse and challenging benchmark for segmentation tasks. The images are collected in real agricultural environments and exhibit substantial variability in illumination, plant growth stages, and spatial distributions. Moreover, the dataset presents several inherent challenges, including high visual similarity between crops and weeds, complex background conditions, and frequent occlusion and overlap among plant instances. These characteristics make PhenoBench a representative and demanding dataset for evaluating the robustness and generalization capability of weed segmentation methods.

2.1.2. MotionBlurred Dataset

To further evaluate the robustness and generalization capability of the proposed method, we employ the MotionBlurred dataset [22]. This dataset is collected using a DJI Mavic 2 Pro (DJI, Shenzhen, China) unmanned aerial vehicle over a sorghum experimental field located in southern Germany, and the public dataset was released in 2022. The UAV captures high-resolution RGB images with a ground sampling distance of approximately 1 mm per pixel, enabling detailed observation of plant structures under real field conditions. The dataset consists of 19 images acquired during different growth stages of sorghum, each with a spatial resolution of 5472 × 3648 pixels. Among them, 12 images are annotated and used for model training, while the remaining 7 images are reserved for testing. Compared to standard datasets, MotionBlurred presents a more challenging scenario due to the presence of motion blur artifacts. These artifacts are primarily caused by platform instability induced by environmental factors such as wind during data acquisition. The motion blur effect significantly degrades image quality, leading to ambiguous and weakened object boundaries, reduced texture clarity, and loss of fine structural details. This poses substantial challenges for accurate weed segmentation, particularly for boundary delineation and small or slender objects. Therefore, this dataset serves as a rigorous benchmark to assess the model’s ability to generalize under degraded visual conditions. Models are required to maintain robust feature extraction and boundary perception capabilities despite the presence of blur-induced distortions.

2.2. Weed Segmentation Model

We propose WS-DINO, an improved DINOv2-based framework for precise weed segmentation, which enhances the capability of vision foundation models by incorporating explicit structural and spatial priors. The overall architecture follows an encoder–decoder paradigm, where a pretrained Vision Transformer (ViT-S/14) from DINOv2 serves as the backbone encoder, and a lightweight segmentation head is employed for dense prediction. To adapt the pretrained backbone to the weed segmentation task without updating all its parameters, we employ Low-Rank Adaptation (LoRA) [20]. Specifically, the original backbone weights are kept frozen, and trainable low-rank matrices are injected into the linear projections of the Transformer blocks, including the query, key, value, and output projections in self-attention as well as the feed-forward linear layers. For a weight matrix

W \in R^{d_{i n} \times d_{o u t}}

, LoRA re-parameterizes its update as

W^{'} = W + A B, A \in R^{d_{i n} \times r}, B \in R^{r \times d_{o u t}}, r ≪ min (d_{i n}, d_{o u t}),

(1)

so that only A and B are optimized during fine-tuning. In our implementation, we optimize the LoRA parameters together with the proposed task-specific modules and the segmentation head, while keeping the original DINOv2 backbone weights fixed. As illustrated in Figure 1, the WS-DINO framework consists of three main components: (1) a DINOv2-based Transformer encoder, (2) a Feature Prior Module (FPM) for fine-grained cue modeling, and (3) a Spatial Feature Fusion Module (SFFM) for multi-scale representation learning. These components work collaboratively to address the challenges of fine-grained weed segmentation in complex agricultural environments.

Given an input image, the DINOv2 encoder first extracts high-level semantic representations in the form of token embeddings. Although Vision Transformers have demonstrated strong capability in modeling global contextual dependencies, they often lack sensitivity to fine-grained local structures such as object boundaries, texture variations, and thin patterns, which are critical in weed segmentation tasks [23]. To enhance the model’s sensitivity to these subtle details, the proposed FPM injects fine-grained feature prior information into selected Transformer blocks. A Canny-guided refinement process helps emphasize informative local patterns, while subsequent convolutional operations encode feature cues related to texture, morphology, and boundary structures, making the module particularly suitable for distinguishing morphologically similar objects [24,25]. Specifically, the ViT backbone is divided into multiple stages, and feature priors are integrated into the early block of each stage, enabling progressive refinement of fine-grained representations throughout the network. Notably, the pretrained Transformer parameters are kept frozen, while only the lightweight prior integration layers are optimized, ensuring efficient adaptation with minimal computational overhead. Meanwhile, previous studies have shown that combining convolutional operations with Transformer architectures can effectively compensate for their respective limitations, where CNNs excel at capturing local spatial details while Transformers are more effective in modeling global context [26,27]. To this end, we introduce the SFFM to extract and fuse multi-scale spatial features. A convolutional spatial encoder is applied to the input image to generate hierarchical feature maps at different resolutions, which helps preserve fine-grained spatial information that may be lost in pure Transformer representations. In parallel, intermediate Transformer tokens from different stages are reshaped into spatial feature maps. These semantic features are then aligned and fused with the corresponding multi-scale spatial features, forming a set of enriched representations that combine global contextual information with fine-grained spatial details. Finally, the fused multi-scale features are fed into a SegFormer-based segmentation head to produce dense pixel-wise predictions. By jointly leveraging global semantic representations, explicit feature priors, and multi-scale spatial information, WS-DINO achieves accurate and robust weed segmentation, particularly in scenarios with complex structures, high inter-class similarity, and ambiguous boundaries.

2.2.1. Feature Prior Module

Accurate delineation of weeds requires not only boundary sensitivity but also robust modeling of fine-grained cues such as texture, morphology, slender structures, and local shape variations. However, Vision Transformers (ViTs) primarily focus on global context modeling and lack explicit mechanisms to capture such detailed local characteristics. To address this limitation, we propose a Feature Prior Module (FPM) that explicitly extracts and injects fine-grained feature prior information into the Transformer encoder, as illustrated in Figure 2.

Given an input image

I \in R^{H \times W \times 3}

, we first derive a Canny-guided structural map and use it to modulate the input, yielding a feature-enhanced image:

I_{c} = I ⊙ C (I),

(2)

where

C (\cdot)

denotes the Canny edge detection function and ⊙ denotes element-wise multiplication. In all experiments, the Canny operator uses fixed low and high thresholds of 100 and 200, respectively, which were selected empirically to provide stable structural guidance under typical field conditions. This operation suppresses less informative regions and emphasizes local structures that are highly relevant to weed discrimination, providing a guidance signal for subsequent fine-grained feature extraction. Inspired by prior edge-enhanced segmentation designs and local-texture enhancement strategies [28,29], we employ a series of convolution and smoothing operations to extract enriched feature responses. Specifically, we define a

1 \times 1

convolution as

F_{1 \times 1} (\cdot)

and a

3 \times 3

average pooling as

P_{3 \times 3} (\cdot)

. The feature prior extraction process can be formulated as:

F = F_{1 \times 1} (P_{3 \times 3} (F_{1 \times 1} (I_{c})) - P_{3 \times 3} (P_{3 \times 3} (F_{1 \times 1} (I_{c})))),

(3)

where

F \in R^{C \times H \times W}

represents the refined fine-grained feature map. Based on this, we apply global average pooling over

F

to encode it into a compact feature prior vector

f \in R^{C}

, which summarizes fine-grained information of the input image, including texture, morphology, and boundary-related structures.

As shown in Figure 3, the output feature maps generated by the proposed FPM exhibit strong responses over discriminative local regions of weeds and crops. This visualization further verifies grained texture details while preserving structurally relevant patterns, which benefits accurate class discrimination in complex agricultural scenes.

We incorporate the extracted feature prior into the Transformer using a cross-attention-based integration mechanism [30,31]:

f^{'} = MLP (f), T^{'} = T + CrossAttn (T, f^{'}, f^{'}),

(4)

where

T \in R^{N \times C}

denotes the token embeddings from a ViT block and

f^{'} \in R^{C}

is the projected feature prior token. In the cross-attention term, the query is derived from

T

, while the key and value are both given by

f^{'}

, enabling adaptive injection of fine-grained cues into the token representations.

For an effective trade-off between segmentation performance and computational efficiency, we divide the ViT backbone into four stages, each containing three Transformer blocks. The proposed FPM is applied only to the first block of each stage, enabling progressive refinement of fine-grained information throughout the network. Moreover, all parameters of the pre-trained ViT blocks are kept frozen, while only the lightweight MLP and cross-attention layers in the FPM are optimized. By explicitly incorporating feature priors into the feature learning process, the proposed module enhances the model’s ability to capture texture cues, morphological patterns, and object boundaries, thereby improving segmentation accuracy in challenging agricultural scenarios. As further quantified in the computational expense analysis, this design introduces only a moderate computational overhead relative to the accuracy gain achieved by the full framework.

2.2.2. Spatial Feature Fusion Module

Although the DINOv2 ViT encoder provides strong global semantic representations, its patch-based tokenization can weaken spatial precision and local detail modeling, which is crucial for boundary-sensitive weed segmentation. Motivated by prior hybrid CNN–Transformer designs [32,33], we propose a Spatial Feature Fusion Module (SFFM) that explicitly injects multi-scale spatial priors extracted by a lightweight convolutional encoder and fuses them with intermediate ViT token features to form a pyramid of enriched representations.

As illustrated in Figure 2, a lightweight Spatial Prior Encoder is applied to the input image to produce a four-level feature pyramid

{D_{i}}_{i = 1}^{4}

. The encoder follows a compact convolutional architecture that progressively aggregates local spatial patterns while preserving fine-grained details, providing multi-scale spatial cues for subsequent fusion. In parallel, we divide the ViT backbone into multiple stages and extract patch token embeddings from the last block of selected stages. For each selected hidden state

H_{k} \in R^{B \times (m + 1) \times C}

, we retain the patch tokens, rearrange them into a 2D feature map, and bilinearly interpolate it to match the spatial resolution of the corresponding convolutional prior, yielding

V_{k} \in R^{B \times C \times \frac{H}{2^{k + 1}} \times \frac{W}{2^{k + 1}}}

. The aligned semantic feature

V_{k}

is then fused with the spatial prior

D_{k}

through channel concatenation followed by a

1 \times 1

convolution and normalization to obtain the multi-scale fused representations:

F_{k} = BN (F_{1 \times 1} ([V_{k}; D_{k}])), k \in {1, 2, 3, 4} .

(5)

Subsequently, the fused pyramid

{F_{1}, F_{2}, F_{3}, F_{4}}

is fed into a SegFormer lightweight decoder head. Each level is first projected to a unified embedding dimension using

1 \times 1

convolutions, resized to the highest resolution, and concatenated for feature fusion. The fused representation is then mapped to pixel-wise logits and upsampled to the original image size, producing the final segmentation prediction.

2.2.3. Loss Function

We supervise WS-DINO with a boundary-aware objective that combines standard pixel-wise cross-entropy with explicit edge consistency [34]. Given the model prediction P and the ground-truth label map Y, we compute boundary magnitude maps using Sobel operators

S_{x}

and

S_{y}

:

\begin{matrix} B_{pred} & = \sqrt{{(P * S_{x})}^{2} + {(P * S_{y})}^{2}}, \end{matrix}

(6)

\begin{matrix} B_{gt} & = \sqrt{{(Y * S_{x})}^{2} + {(Y * S_{y})}^{2}}, \end{matrix}

(7)

where ∗ denotes 2D convolution applied channel-wise. The final loss is defined as

L_{boundary} = L_{CE} (P, Y) + κ {∥B_{pred} - B_{gt}∥}_{2}^{2},

(8)

where

κ

controls the strength of the boundary supervision. In all experiments, we set

κ = 0.5

. Preliminary tuning indicated that performance remained reasonably stable within a moderate range around this value, while excessively large weights tended to over-emphasize local edges and slightly weaken regional consistency. This formulation encourages accurate region classification while explicitly penalizing boundary mismatch, which improves the delineation of thin structures and ambiguous contours in challenging field scenes.

3. Results

3.1. Model Evaluation Metrics

We quantitatively evaluate segmentation performance using five standard metrics: Intersection over Union (IoU), mean Intersection over Union (mIoU), Precision, Recall, and F1-score. These metrics reflect pixel-level classification accuracy and the capability of the model to delineate fine weed boundaries in complex field imagery.

IoU measures the overlap between the predicted mask and the ground-truth region for a given semantic class:

IoU = \frac{|\hat{Y} \cap Y|}{|\hat{Y} \cup Y|} = \frac{TP}{TP + FP + FN},

(9)

where

\hat{Y}

denotes the predicted region, Y denotes the ground truth, and TP, FP, and FN represent true positives, false positives, and false negatives, respectively. The overall mIoU is computed by averaging IoU over all semantic categories (crop, weed, and soil):

mIoU = \frac{{IoU}_{crop} + {IoU}_{weed} + {IoU}_{soil}}{3} .

(10)

Precision and Recall are defined as

Precision = \frac{TP}{TP + FP}, Recall = \frac{TP}{TP + FN},

(11)

and the F1-score is computed as the harmonic mean of Precision and Recall:

F 1 = \frac{2 Precision Recall}{Precision + Recall} .

(12)

3.2. Experimental Settings

All experiments are conducted on a single NVIDIA RTX 3090 GPU. The implementation is based on PyTorch 2.0 with CUDA 11.8 and Python 3.8. We optimize the network using Adam with an initial learning rate of

6 \times 10^{- 5}

and a batch size of 16. A compound learning-rate schedule is adopted, where the learning rate is linearly warmed up during the first five epochs and then decayed using a polynomial policy with a power of 1.0. The model is trained for approximately 200 epochs until convergence. These settings were selected based on preliminary experiments, prior practice in ViT-based dense prediction, and the memory constraints of a single RTX 3090 GPU. In particular, the learning rate and batch size were chosen to maintain stable LoRA-based fine-tuning, while the warm-up and polynomial decay schedule improved convergence stability.

For fair and consistent evaluation across datasets, we adopt a sliding-window strategy for both training and testing. For WS-DINO, images are cropped into

224 \times 224

patches to satisfy the input-size constraint of the DINOv2 ViT-S/14 backbone. For comparison, we select representative CNN-based and Transformer-based segmentation baselines, including U-Net [35], DeepLabV3 [36], SegNet [37], Swin-UNet [38], and SegFormer-B0 [39]. These baselines were chosen because they are widely used, architecturally representative, and cover both classical convolutional and recent Transformer-style segmentation paradigms. For these baseline models, we use the same sliding-window protocol and crop images into

256 \times 256

patches according to their standard input settings. All datasets use the same data augmentation pipeline, including random scaling, random cropping, random horizontal flipping, and photometric distortion.

3.3. Comparisons with Benchmarks on Segmentation Datasets

We evaluate WS-DINO against a series of representative convolutional and Transformer-based segmentation methods, including U-Net [35], DeepLabV3 [36], SegNet [37], Swin-UNet [38], and SegFormer-B0 [39], under the same experimental settings. The quantitative results on the PhenoBench dataset are summarized in Table 1.

Overall, WS-DINO achieves the best performance across all evaluation metrics, reaching 88.67% mIoU and 93.77% F1-score, demonstrating its superiority in both segmentation accuracy and robustness. From a class-wise perspective, background regions are relatively easy to segment, with all methods achieving saturated Bg IoU values close to 99%. In contrast, distinguishing weeds from crops remains the primary challenge due to their high visual similarity and structural complexity. WS-DINO consistently outperforms all baselines across all categories, achieving 99.50% Bg IoU, 95.82% Crop IoU, and 70.68% Weed IoU. Compared with the strongest baseline SegFormer-B0, WS-DINO improves mIoU by 1.90 percentage points. The most significant improvement is observed in the weed class, indicating that the proposed structural guidance and spatial feature fusion are particularly effective in handling thin structures, occlusions, and ambiguous boundaries. These gains are especially visible when weed regions are fragmented, slender, or partially occluded, where purely semantic features are often insufficient for precise delineation. Figure 4 presents qualitative comparisons on PhenoBench. Compared with baseline methods, WS-DINO produces cleaner segmentation masks with more complete weed regions and sharper object boundaries, while effectively suppressing false positives in background and crop areas. In challenging scenarios involving fragmented leaves and cluttered soil textures, WS-DINO better preserves fine-grained weed structures and avoids misclassifying weed pixels as crops, which is consistent with the quantitative improvements reported in Table 1.

Robustness under degraded visual conditions is further assessed on a motion-blurred dataset, where blurred edges and low-frequency textures substantially increase the difficulty of boundary localization. As shown in Table 2, WS-DINO again achieves the best overall performance, with 88.75% mIoU and 93.92% F1-score. In terms of class-wise IoU, WS-DINO attains 99.35% Bg IoU, 85.55% Crop IoU, and 81.34% Weed IoU. Compared with the best-performing baseline Swin-UNet, WS-DINO improves mIoU by 1.73 percentage points. It is worth noting that although SegFormer-B0 achieves competitive performance on the crop class, its Weed IoU drops significantly to 75.75%, suggesting that accurately delineating weeds under motion blur remains a challenging problem. This trend suggests that explicit structural priors help stabilize local responses when blur weakens texture contrast and boundary sharpness. In contrast, WS-DINO maintains strong performance across all classes, highlighting its superior robustness and generalization capability. Figure 5 further illustrates qualitative results on the motion-blurred dataset. Figure 6 and Figure 7 summarize the overall advantage of WS-DINO across mIoU, Precision, and Recall on the PhenoBench and MotionBlurred datasets, respectively. WS-DINO demonstrates a clear advantage in preserving thin weed structures and producing more coherent and consistent boundaries under blur, further validating the effectiveness of the proposed design.

3.4. Ablation Studies

The contribution of each proposed component is analyzed through ablation studies on the PhenoBench dataset. Specifically, we evaluate four configurations: (1) the baseline DINOv2 model, (2) DINOv2 with the Feature Prior Module (FPM), (3) DINOv2 with the Spatial Feature Fusion Module (SFFM), and (4) the full WS-DINO framework integrating both modules. The quantitative results are reported in Table 3.

Starting from the baseline, DINOv2 achieves an mIoU of 86.36%, with a Weed IoU of 64.46%, indicating that although the pretrained model provides strong semantic representations, it still struggles to accurately capture fine-grained structures and distinguish visually similar categories. After introducing the Feature Prior Module (FPM), the performance improves significantly, with mIoU increasing to 88.15% (+1.79) and Weed IoU rising to 69.46% (+5.00). This notable gain demonstrates that incorporating explicit feature priors effectively enhances the model’s ability to capture texture cues, morphological patterns, and object boundaries, which are particularly important for segmenting thin and complex weed regions. Similarly, integrating the Spatial Feature Fusion Module (SFFM) also leads to substantial improvements. The model achieves an mIoU of 88.04% (+1.68) and a Weed IoU of 69.15% (+4.69). This confirms that fusing multi-scale spatial features with Transformer representations helps recover local details that are otherwise insufficiently modeled by the pure ViT backbone. When both modules are jointly applied, the full WS-DINO model achieves the best performance, reaching 88.67% mIoU and 70.68% Weed IoU. Compared with the baseline, this corresponds to gains of +2.31 and +6.22 percentage points, respectively. The consistent improvement across all metrics indicates that FPM and SFFM are complementary, with FPM enhancing fine-grained feature perception and SFFM enriching spatial representation. Notably, the improvements are most pronounced in the weed class, which is the most challenging category due to its complex morphology, small-scale structures, and high similarity to crops. This further validates that the proposed design effectively addresses the core difficulties of weed segmentation.

3.5. Computational Expense

In this section, we analyze the computational efficiency of WS-DINO in terms of parameter count and inference speed (FPS), as summarized in Table 4. These metrics are critical for evaluating the practicality of deploying segmentation models in real-world precision agriculture scenarios, particularly for UAV-based or robotic weeding systems where both accuracy and efficiency are required. This comparison also provides a concrete basis for describing the proposed task-specific modules as lightweight enhancements in relation to their performance contribution, rather than implying that the overall model is the smallest among all baselines.

In terms of model parameters, WS-DINO contains 24.29 M parameters, which is comparable to U-Net (24.21 M) and smaller than several widely used models such as Swin-UNet (36.10 M), SegNet (27.20 M), and DeepLabV3 (55.28 M). Although WS-DINO has a larger parameter count than lightweight Transformer-based models such as SegFormer-B0 (3.82 M), this increase is mainly attributed to the incorporation of additional modules for structural and spatial feature enhancement. Importantly, the parameter size of WS-DINO remains within a moderate range, demonstrating a good balance between model complexity and segmentation performance. Regarding inference speed, WS-DINO achieves 34.10 FPS, which is competitive among the evaluated methods. While it is slower than lightweight architectures such as SegFormer-B0 (77.25 FPS) and U-Net (54.32 FPS), it outperforms more complex Transformer-based models like Swin-UNet (28.32 FPS) and achieves comparable speed to DeepLabV3 (32.63 FPS). This indicates that WS-DINO maintains reasonable efficiency despite its enhanced feature modeling capability. Overall, WS-DINO strikes a favorable trade-off between accuracy and computational cost. Although it is not the most lightweight model, its competitive inference speed and moderate parameter size, combined with superior segmentation performance, make it well-suited for practical applications in complex agricultural environments where both precision and robustness are essential.

4. Discussion

Accurate weed segmentation in real-world agricultural environments remains a challenging yet crucial task for enabling precision farming practices such as site-specific spraying and autonomous weeding. In this work, we propose WS-DINO, which leverages the strong representation capability of a vision foundation model and enhances it with explicit structural and spatial priors. Across multiple datasets, WS-DINO consistently outperforms both convolutional and Transformer-based baselines, especially in Weed IoU, indicating that edge-aware guidance and multi-scale spatial information effectively improve weed segmentation.

Compared with conventional CNN-based methods such as U-Net and DeepLabV3, WS-DINO benefits from the global modeling capability of Transformers, enabling it to better capture long-range contextual dependencies. At the same time, unlike standard Vision Transformer-based models (e.g., Swin-UNet and SegFormer), our method explicitly addresses their limitations in capturing fine-grained local structures. The Feature Prior Module improves sensitivity to object boundaries for thin and irregular weeds, while the Spatial Feature Fusion Module supplements global representations with multi-scale spatial details, leading to more accurate and coherent segmentation results. This combination allows WS-DINO to achieve superior performance in complex scenarios involving occlusions, cluttered backgrounds, and ambiguous boundaries. Furthermore, the results on the motion-blurred dataset demonstrate that WS-DINO maintains strong robustness under image degradation. In particular, the notable improvement in Weed IoU suggests that the proposed structural guidance mechanism effectively mitigates the negative impact of blurred edges and low-frequency textures. This indicates that integrating explicit priors into foundation models is a promising direction for improving generalization in challenging real-world conditions.

Despite these encouraging results, several limitations remain. First, although WS-DINO achieves superior segmentation accuracy, its computational cost is higher than that of lightweight models such as SegFormer-B0, which may limit its deployment on resource-constrained edge devices. Second, while the proposed edge prior enhances boundary awareness, its effectiveness may be reduced in extremely low-quality images where edge information is severely degraded or noisy. In addition, because the Canny-guided structural prior relies on fixed thresholds, substantial illumination changes or low-contrast scenes may affect edge completeness and introduce less reliable guidance. Third, similar to most existing weed segmentation approaches, this work focuses on coarse-grained classification (crop, weed, and background), without distinguishing between different weed species, which may limit its applicability in more fine-grained agricultural management scenarios.

In future work, we plan to address these limitations from several aspects. On the one hand, we aim to develop a lightweight version of WS-DINO to improve computational efficiency and facilitate real-time deployment on embedded platforms such as agricultural robots and UAVs. On the other hand, we will explore more robust structural priors and adaptive feature learning strategies to further enhance performance under extreme imaging conditions. In addition, extending the current framework to fine-grained weed species segmentation and multi-class recognition is another important direction, which could provide more precise decision support for intelligent agricultural systems.

5. Conclusions

In this paper, we proposed WS-DINO, a DINOv2-based weed segmentation framework designed to address the challenges of high inter-class similarity, ambiguous boundaries, and fine-grained structural variation in agricultural scenes. By introducing a Feature Prior Module to enhance the modeling of texture, morphology, and boundary-related cues, together with a Spatial Feature Fusion Module to recover multi-scale local details, the proposed method effectively compensates for the limitations of standard Vision Transformers in fine-grained segmentation tasks. In addition, the adoption of LoRA enables parameter-efficient adaptation of the pretrained foundation model to the weed segmentation task.

Extensive experiments on the PhenoBench and MotionBlurred datasets demonstrate that WS-DINO achieves superior performance over representative CNN-based and Transformer-based baselines. In particular, WS-DINO attains the best overall segmentation accuracy on PhenoBench and maintains strong robustness under motion blur, with especially notable improvements in the weed category, which is the most challenging class in practical field scenarios. These results confirm that integrating explicit feature priors and multi-scale spatial representations into vision foundation models is an effective strategy for precise and robust weed segmentation. Overall, WS-DINO provides a promising solution for intelligent precision agriculture and offers a useful foundation for future research on robust, fine-grained crop–weed understanding.

Author Contributions

H.Z. and J.L. contributed equally to this work. Conceptualization, H.Z., J.L. and B.Z.; methodology, J.L.; software, H.Z., J.L. and R.W.; validation, H.Z. and J.L.; formal analysis, H.Z.; investigation, H.Z.; resources, J.L.; data curation, H.Z.; writing—original draft preparation, H.Z.; writing—review and editing, J.L., R.W. and B.Z.; visualization, H.Z.; supervision, J.L., R.W. and B.Z.; project administration, J.L.; funding acquisition, J.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Key Research and Development and Achievement Transformation Project of Inner Mongolia Autonomous Region (2025YFHH0278); the Sub-project of the National Key Research and Development Program of China (2023YFD1600702-04); the Natural Science Foundation Project of Inner Mongolia Autonomous Region (2025MS06054, 2025MS06046, 2025MS06020); the Scientific Research Start-up Project for Excellent Doctoral Talents Introduction (NDYB2023-32); and the Innovation Start-up Support Program for Returned Overseas Students in Inner Mongolia Autonomous Region (202223).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data derived from public domain resources. The data presented in this study are available in the PhenoBench repository at https://doi.org/10.1109/TPAMI.2024.3419548, PhenoBench Version 1.1.0 released on 3 December 2023, accessed on July 2025; and in the Mendeley Data repository for the UAV-based weed segmentation dataset for sorghum fields at https://doi.org/10.17632/4hh45vkp38.4, Version 5 released on 24 July 2023, accessed on October 2025.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Saini, A.K.; Yadav, A.K.; Dhiraj. A Comprehensive review on technological breakthroughs in precision agriculture: IoT and emerging data analytics. Eur. J. Agron. 2025, 163, 127440. [Google Scholar] [CrossRef]
Singh, P.; Zhao, B.; Shi, Y. Computer Vision for Site-Specific Weed Management in Precision Agriculture: A Review. Agriculture 2025, 15, 2296. [Google Scholar] [CrossRef]
Sandoval-Pillajo, L.; García-Santillán, I.; Pusdá-Chulde, M.; Giret, A. Weed detection based on deep learning from UAV imagery: A review. Smart Agric. Technol. 2025, 12, 101147. [Google Scholar] [CrossRef]
Syed, A.; Chen, B.; Abbasi, A.A.; Butt, S.A.; Fang, X. MSEA-Net: Multi-Scale and Edge-Aware Network for Weed Segmentation. AgriEngineering 2025, 7, 103. [Google Scholar] [CrossRef]
Gao, J.; Tan, F.; Li, X. EDM-UNet: An Edge-Enhanced and Attention-Guided Model for UAV-Based Weed Segmentation in Soybean Fields. Agriculture 2025, 15, 2575. [Google Scholar] [CrossRef]
Yang, Q.; Ye, Y.; Gu, L.; Wu, Y. MSFCA-Net: A Multi-Scale Feature Convolutional Attention Network for Segmenting Crops and Weeds in the Field. Agriculture 2023, 13, 1176. [Google Scholar] [CrossRef]
Liao, J.; Chen, M.; Zhang, K.; Zhou, H.; Zou, Y.; Xiong, W.; Zhang, S.; Kuang, F.; Zhu, D. SC-Net: A new strip convolutional network model for rice seedling and weed segmentation in paddy field. Comput. Electron. Agric. 2024, 220, 108862. [Google Scholar] [CrossRef]
Janneh, L.L.; Zhang, Y.; Cui, Z.; Yang, Y. Multi-level feature re-weighted fusion for the semantic segmentation of crops and weeds. J. King Saud Univ.-Comput. Inf. Sci. 2023, 35, 101545. [Google Scholar] [CrossRef]
Castellano, G.; De Marinis, P.; Vessio, G. Weed mapping in multispectral drone imagery using lightweight vision transformers. Neurocomputing 2023, 562, 126914. [Google Scholar] [CrossRef]
Wei, Y.; Feng, Y.; Zu, D.; Zhang, X. A hybrid CNN-transformer network: Accurate and efficient semantic segmentation of crops and weeds on resource-constrained embedded devices. Crop Prot. 2025, 188, 107018. [Google Scholar] [CrossRef]
Liu, Y.; Liu, M.; Wang, L.; Ma, H.; Zhang, M. Real-time semantic segmentation network for crops and weeds based on multi-branch structure. IET Comput. Vis. 2024, 18, 1313–1324. [Google Scholar] [CrossRef]
Thiagarajan, S.; Vijayalakshmi, A.; Grace, G.H. Weed detection in precision agriculture: Leveraging encoder-decoder models for semantic segmentation. J. Ambient Intell. Humaniz. Comput. 2024, 15, 3547–3561. [Google Scholar] [CrossRef]
Tao, J.; Qiao, Q.; Song, J.; Sun, S.; Chen, Y.; Wu, Q.; Liu, Y.; Xue, F.; Wu, H.; Zhao, F. Deep Learning-Driven Automatic Segmentation of Weeds and Crops in UAV Imagery. Sensors 2025, 25, 6576. [Google Scholar] [CrossRef] [PubMed]
Lu, C.; Gehring, K.; Kopfinger, S.; Bernhardt, H.; Beck, M.; Walther, S.; Ebertseder, T.; Minceva, M.; Hu, Y.; Yu, K. Weed instance segmentation from UAV Orthomosaic Images based on Deep Learning. Smart Agric. Technol. 2025, 11, 100966. [Google Scholar] [CrossRef]
Zhang, J.; Cao, S.; Xu, B.; Li, Y.; Jia, W.; Wu, T.; Lu, H.; Hu, W.; Han, Z. DepthCropSeg++: Scaling a Crop Segmentation Foundation Model With Depth-Labeled Data. IEEE J. Sel. Top. Signal Process. 2026, 20, 129–141. [Google Scholar] [CrossRef]
Espejo-Garcia, B.; Güldenring, R.; Nalpantidis, L.; Fountas, S. Foundation vision models in agriculture: DINOv2, LoRA and knowledge distillation for disease and weed identification. Comput. Electron. Agric. 2025, 239, 110900. [Google Scholar] [CrossRef]
Li, W.; Zhu, L.; Liu, J. PL-DINO: An Improved Transformer-Based Method for Plant Leaf Disease Detection. Agriculture 2024, 14, 691. [Google Scholar] [CrossRef]
Picón, A.; Eguskiza, I.; Mugica, D.; Romero, J.; Jimenez, C.J.; White, E.M.; Do-Lago-Junqueira, G.; Klukas, C.; Navarra-Mestre, R. Robust MultiSpecies Agricultural Segmentation Across Devices, Seasons, and Sensors Using Hierarchical DINOv2 Models. arXiv 2025, arXiv:2508.07514v2. [Google Scholar]
Oquab, M.; Darcet, T.; Moutakanni, T.; Vo, H.; Szafraniec, M.; Khalidov, V.; Fernandez, P.; Haziza, D.; Massa, F.; El-Nouby, A.; et al. DINOv2: Learning Robust Visual Features without Supervision. arXiv 2024, arXiv:2304.07193. [Google Scholar] [CrossRef]
Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. LoRA: Low-Rank Adaptation of Large Language Models. In Proceedings of the International Conference on Learning Representations, Online, 25–29 April 2022. [Google Scholar]
Weyler, J.; Magistri, F.; Marks, E.; Chong, Y.L.; Sodano, M.; Roggiolani, G.; Chebrolu, N.; Stachniss, C.; Behley, J. PhenoBench—A Large Dataset and Benchmarks for Semantic Image Interpretation in the Agricultural Domain. IEEE Trans. Pattern Anal. Mach. Intell. (T-PAMI) 2024, 46, 9583–9594. [Google Scholar] [CrossRef] [PubMed]
Genze, N.; Ajekwe, R.; Güreli, Z.; Haselbeck, F.; Grieb, M.; Grimm, D.G. Deep learning-based early weed segmentation using motion blurred UAV images of sorghum fields. Comput. Electron. Agric. 2022, 202, 107388. [Google Scholar] [CrossRef]
Azad, R.; Kazerouni, A.; Azad, B.; Khodapanah Aghdam, E.; Velichko, Y.; Bagci, U.; Merhof, D. Laplacian-Former: Overcoming the Limitations of Vision Transformers in Local Texture Detection. In Proceedings of the Medical Image Computing and Computer Assisted Intervention–MICCAI 2023; Greenspan, H., Madabhushi, A., Mousavi, P., Salcudean, S., Duncan, J., Syeda-Mahmood, T., Taylor, R., Eds.; Spinger: Cham, Switzerland, 2023; pp. 736–746. [Google Scholar]
Shi, B.; Zhu, W.P.; Swamy, M. SGDC: Structurally-Guided Dynamic Convolution for Medical Image Segmentation. arXiv 2026, arXiv:2602.23496. [Google Scholar] [CrossRef]
Suárez, P.L.; Sappa, A.D. Edge-Aware Camouflaged Object Detection. In Proceedings of the Computer Analysis of Images and Patterns; Castrillón-Santana, M., Travieso-González, C.M., Deniz Suarez, O., Freire-Obregón, D., Hernández-Sosa, D., Lorenzo-Navarro, J., Santana, O.J., Eds.; Spinger: Cham, Switzerland, 2026; pp. 197–208. [Google Scholar]
Wang, H.; Chen, X.; Zhang, T.; Xu, Z.; Li, J. CCTNet: Coupled CNN and Transformer Network for Crop Segmentation of Remote Sensing Images. Remote Sens. 2022, 14, 1956. [Google Scholar] [CrossRef]
Silva, L.; Drews, P.; de Bem, R. Soybean Weeds Segmentation Using VT-Net: A Convolutional-Transformer Model. In Proceedings of the 2023 36th SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI), Rio Grande, Brazil, 6–9 November 2023; pp. 127–132. [Google Scholar] [CrossRef]
Cheng, X.; Huang, S.; Liao, B.; Wang, Y.; Luo, X. BG-Net: Boundary-guidance network for object consistency maintaining in semantic segmentation. Vis. Comput. 2024, 40, 373–391. [Google Scholar] [CrossRef]
Bui, N.T.; Hoang, D.H.; Nguyen, Q.T.; Tran, M.T.; Le, N. MEGANet: Multi-Scale Edge-Guided Attention Network for Weak Boundary Polyp Segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 4–8 January 2024; pp. 7985–7994. [Google Scholar]
Yu, H.; Fu, T.; Li, B.; Xue, X. EAFormer: Scene Text Segmentation with Edge-Aware Transformers. In Proceedings of the Computer Vision—ECCV 2024; Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G., Eds.; Spinger: Cham, Switzerland, 2025; pp. 410–427. [Google Scholar]
Zhu, D.; Huang, X.; Huang, H.; Cheng, Q.; Huang, Z.; Shao, Z. ChangeViT: Unleashing plain vision transformers for change detection in remote sensing images. Pattern Recognit. 2026, 172, 112539. [Google Scholar] [CrossRef]
Wang, W.; Xie, E.; Li, X.; Fan, D.P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 568–578. [Google Scholar]
Li, R.; Yu, B.; Zhang, B.; Ma, H.; Qin, Y.; Lv, X.; Yan, S. Lightweight CNN–Transformer Hybrid Network with Contrastive Learning for Few-Shot Noxious Weed Recognition. Horticulturae 2025, 11, 1236. [Google Scholar] [CrossRef]
Kervadec, H.; Bouchtiba, J.; Desrosiers, C.; Granger, E.; Dolz, J.; Ben Ayed, I. Boundary loss for highly unbalanced segmentation. In Proceedings of the 2nd International Conference on Medical Imaging with Deep Learning, London, UK, 8–10 July 2019; Proceedings of Machine Learning Research; Cardoso, M.J., Feragen, A., Glocker, B., Konukoglu, E., Oguz, I., Unal, G., Vercauteren, T., Eds.; PMLR: Cambridge, MA, USA, 2019; Volume 102, pp. 285–296. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015; Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F., Eds.; Spinger: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
Chen, L.-C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking Atrous Convolution for Semantic Image Segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar] [CrossRef]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef] [PubMed]
Cao, H.; Wang, Y.; He, D.; Wang, J.; Wang, H.; Miao, W. Swin-Unet: Unet-Like Pure Transformer for Medical Image Segmentation. In Proceedings of the Computer Vision—ECCV 2022 Workshops; Wang, L., Dou, Q., Fletcher, P.T., Speidel, S., Li, S., Eds.; Spinger: Cham, Switzerland, 2022; pp. 205–218. [Google Scholar]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. arXiv 2021, arXiv:2105.15203. [Google Scholar] [CrossRef]

Figure 1. Overall framework of WS-DINO. Snowflake and flame symbols indicate frozen and trainable parameters, respectively.

Figure 2. Illustration of the Feature Prior Module (FPM), including Canny-guided feature extraction, feature refinement, and feature prior encoding.

Figure 3. Visualization of output feature maps produced by the Feature Prior Module. The highlighted responses indicate that the proposed module effectively captures fine-grained texture details of weeds and crops.

Figure 4. Qualitative comparisons on the PhenoBench dataset. White boxes highlight visually challenging regions for segmentation.

Figure 5. Qualitative comparisons on the motion-blurred dataset. White boxes highlight visually challenging regions for segmentation.

Figure 6. Radar plot comparison of benchmark methods on the PhenoBench dataset using three metrics: mIoU, Precision, and Recall. The mIoU axis is placed at the top, forming a triangular radar chart without polygon filling.

Figure 7. Radar plot comparison of benchmark methods on the MotionBlurred dataset using three metrics: mIoU, Precision, and Recall. The mIoU axis is placed at the top, forming a triangular radar chart without polygon filling.

Table 1. Quantitative comparisons with benchmark segmentation methods on PhenoBench dataset. All metrics are reported in %.

Method	Bg IoU (%)	Crop IoU (%)	Weed IoU (%)	mIoU (%)	F1-Score (%)	Precision (%)	Recall (%)
U-Net	99.24	95.48	63.47	86.06	91.53	90.67	92.41
DeepLabV3	99.19	95.19	63.50	85.96	90.96	92.27	89.68
SegNet	99.56	95.32	62.74	85.87	91.27	90.22	92.35
Swin-UNet	99.02	95.26	65.61	86.63	92.28	90.51	94.13
SegFormer-B0	99.21	95.06	66.05	86.77	92.82	91.61	94.06
WS-DINO	99.50	95.82	70.68	88.67	93.77	92.31	95.28

Note: the highest performance under each evaluation metric is highlighted in bold.

Table 2. Quantitative comparisons with benchmark segmentation methods on MotionBlurred dataset.

Method	Bg IoU (%)	Crop IoU (%)	Weed IoU (%)	mIoU (%)	F1-Score (%)	Precision (%)	Recall (%)
U-Net	99.24	83.54	77.44	86.74	91.76	89.28	94.39
DeepLabV3	99.18	82.23	75.56	85.66	91.97	90.26	93.75
SegNet	99.27	83.72	77.49	86.83	92.69	91.36	94.05
Swin-UNet	99.20	83.65	78.20	87.02	92.84	92.91	92.78
SegFormer-B0	98.91	84.76	75.75	86.47	92.82	91.61	94.06
WS-DINO	99.35	85.55	81.34	88.75	93.92	93.13	94.72

Note: the highest performance under each evaluation metric is highlighted in bold.

Table 3. Ablation Studies on PhenoBench dataset.

Baseline	FPM	SFFM	Bg IoU (%)	Crop IoU (%)	Weed IoU (%)	mIoU (%)	F1-Score (%)	Precision (%)	Recall (%)
✔			99.23	95.39	64.46	86.36	91.88	91.25	92.53
✔	✔		99.26	95.72	69.46	88.15	93.14	91.94	94.43
✔		✔	99.27	95.69	69.15	88.04	93.06	92.25	93.91
✔	✔	✔	99.50	95.82	70.68	88.67	93.77	92.31	95.28

Note: ✔ indicates that the corresponding component is enabled; the best performance under each evaluation metric is highlighted in bold.

Table 4. Comparison of computational expense among benchmark segmentation methods in terms of parameter count and inference speed.

	WS-DINO	SegFormer-B0	Swin-UNet	SegNet	DeepLabV3	U-Net
Parameters (M)	24.29	3.82	36.10	27.20	55.28	24.21
Inference Speed (FPS)	34.10	77.25	28.32	45.15	32.63	54.32

Note: the best values are highlighted in bold (lowest parameter count and highest inference speed).

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhou, H.; Liu, J.; Wu, R.; Zhao, B. WS-DINO: A DINOv2-Based Weed Segmentation Method with Feature Priors and Spatial Fusion. Agriculture 2026, 16, 1105. https://doi.org/10.3390/agriculture16101105

AMA Style

Zhou H, Liu J, Wu R, Zhao B. WS-DINO: A DINOv2-Based Weed Segmentation Method with Feature Priors and Spatial Fusion. Agriculture. 2026; 16(10):1105. https://doi.org/10.3390/agriculture16101105

Chicago/Turabian Style

Zhou, Hongsheng, Jiangping Liu, Rigeng Wu, and Baoping Zhao. 2026. "WS-DINO: A DINOv2-Based Weed Segmentation Method with Feature Priors and Spatial Fusion" Agriculture 16, no. 10: 1105. https://doi.org/10.3390/agriculture16101105

APA Style

Zhou, H., Liu, J., Wu, R., & Zhao, B. (2026). WS-DINO: A DINOv2-Based Weed Segmentation Method with Feature Priors and Spatial Fusion. Agriculture, 16(10), 1105. https://doi.org/10.3390/agriculture16101105

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

WS-DINO: A DINOv2-Based Weed Segmentation Method with Feature Priors and Spatial Fusion

Abstract

1. Introduction

2. Materials and Methods

2.1. Datasets

2.1.1. PhenoBench Dataset

2.1.2. MotionBlurred Dataset

2.2. Weed Segmentation Model

2.2.1. Feature Prior Module

2.2.2. Spatial Feature Fusion Module

2.2.3. Loss Function

3. Results

3.1. Model Evaluation Metrics

3.2. Experimental Settings

3.3. Comparisons with Benchmarks on Segmentation Datasets

3.4. Ablation Studies

3.5. Computational Expense

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI