Image Prompt Adapter-Based Stable Diffusion for Enhanced Multi-Class Weed Generation and Detection

Boyang Deng; Yuzhen Lu

doi:10.3390/agriengineering7110389

and

Department of Biosystems & Agricultural Engineering, Michigan State University, East Lansing, MI 48824, USA

^*

Author to whom correspondence should be addressed.

AgriEngineering2025, 7(11), 389;https://doi.org/10.3390/agriengineering7110389
(registering DOI)

Version Notes

Order Reprints

Abstract

The curation of large-scale, diverse datasets for robust weed detection is extremely time-consuming and resource-intensive in practice. Generative artificial intelligence (AI) opens up opportunities for image generation to supplement real-world image acquisition and annotation efforts. However, it is not a trial task to generate high-quality, multi-class weed images that capture the nuances and variations in visual representations for enhanced weed detection. This study presents a novel investigation of advanced stable diffusion (SD) integrated with a module with image prompt capability, IP-Adapter, for weed image generation. Using the IP-Adapter-based model, two image feature encoders, CLIP (contrastive language image pre-training) and BioCLIP (a vision foundation model for biological images), were utilized to generate weed instances, which were then inserted into existing weed images. Image generation and weed detection experiments are conducted on a 10-class weed dataset captured in vegetable fields. The perceptual quality of generated images is assessed in terms of Fréchet Inception Distance (FID) and Inception Score (IS). YOLOv11 (You Only Look Once version 11) models were trained for weed detection, achieving an improved mAP@50:95 of 1.26% on average when combining inserted weed instances with real ones in training, compared to using original images alone. Both the weed dataset and software programs in this study will be made publicly available. This study offers valuable perspectives into the use of IP-adapter-based SD for generating weed images and weed detection.

Keywords:

generative artificial intelligence; image prompt; precision agriculture; stable diffusion; weed detection

1. Introduction

Weeds pose a major biological challenge to global agricultural production, significantly affecting crop productivity and threatening food security. Among various crop stressors, weeds cause an average of 34% of yield losses, more than animal pests (18%) and pathogens (16%) []. If left uncontrolled, weeds can lead to complete yield loss []. Globally, weed infestations result in an annual economic loss of $100 billion in crop production [], with the loss exceeding $34 billion in the United States alone []. The introduction of transgenic crops engineered for herbicide resistance in the 1990s brought a major shift in weed management, making herbicide application the primary strategy for many crops []. However, excessive reliance on herbicides has led to the selection of adaptive weed populations and the evolution of herbicide resistance [], significantly increasing the difficulty of herbicide application and overall management costs. At the time of writing, 539 cases of herbicide-resistant weeds of 273 species have been documented globally []. Moreover, the lack of new herbicides and the absence of novel herbicide modes of action [] exacerbate the challenges of weed management, especially as herbicide-resistant weed biotypes continue to grow []. These challenges underscore the compelling demand for herbicide-reduced or non-chemical weeding methods to promote sustainable crop production.

Machine vision technology offers a promising solution for precision weeding, enabling accurate weed localization and precise herbicide application to reduce usage [,,]. It is also important to implement non-herbicide weeding methods, such as mechanical weeding [,] and laser weeding []. However, many studies are often conducted in structured or controlled environments, and these weeding systems still struggle to reliably identify weeds and crops in dynamic field conditions, potentially causing detrimental effects on the environment and crop damage [,]. Robust weed detection depends on large-scale, diverse datasets for model development, which capture relevant biological variability of plants that convolute with the variations in environmental conditions (e.g., field lighting, soil types). Although numerous labeled weed datasets have been created in recent years [,], they are still limited in terms of image/instance number, species diversity, and representation across varying field conditions and plant growth stages. However, there is a major barrier to curating big datasets, which lies in the data collection and annotation process, which is time-consuming and labor-intensive, especially considering the requirement of domain expertise for weed identification.

The rapid advancement of generative artificial intelligence (AI) technologies has provided novel means to produce high-fidelity, high-resolution synthetic images that closely resemble real-world ones [,]. The generative AI methods are promising for generating realistic weed images to supplement existing datasets without the need to curate additional real-world samples. Diffusion Models (DMs) are a class of state-of-the-art probabilistic generative models designed to learn and represent data distributions []. Trained on large-scale image–text datasets, they can generate high-quality images from textual prompts and noise inputs []. However, these models incur high computational costs during training and face challenges in generating fine details of specific subjects across diverse contexts []. Stable Diffusion (SD) [], a large-scale Latent Diffusion Model (LDM), applies diffusion in the latent space of robust pre-trained autoencoders rather than directly in pixel space. It thus enables more stable, controlled information flow through neural layers, supports high-fidelity image generation, and, compared to pixel-based diffusion models, significantly reduces computational costs in training and inference processes. The impressive performance of SD in image generation has attracted widespread attention.

Despite operating in latent space, full fine-tuning of SD is not a trivial task due to the model complexity, occupying over 7 GB of memory. To address this, several parameter-efficient methods, each with separate advantages, have been proposed to ease fine-tuning, including Low-Rank Adaptation (LoRA) [], DreamBooth [], ControlNet [], T2I-Adapter [], and Image Prompt Adapter (IP-Adapter) []. DreamBooth fine-tunes parts of SD (e.g., UNet and the text encoder) to encode subject-specific identities from just 3 to 5 reference images. It encodes visual details into a unique identifier used in text prompts for image generation. LoRA inserts trainable low-rank matrices into attention layers while freezing the base model, enabling more efficient fine-tuning. It supports multiple subjects per class but may struggle with complex concepts from different classes. ControlNet introduces spatial conditioning to Stable Diffusion using paired inputs such as depth maps or bounding boxes. However, its ability to precisely control object placement and support multiple classes remains limited. T2I-Adapter offers a lightweight alternative to ControlNet, training only an external adapter module. It runs faster but provides weaker control and lower spatial fidelity. IP-Adapter enables image-based conditioning in SD by integrating a CLIP vision encoder [] to extract features from reference images. Unlike text-only or spatial-control methods, it provides direct visual guidance and supports generalization across diverse concepts, styles, and classes. A single IP-Adapter offers high-fidelity control and can handle multiple classes without requiring separate training for each. These advancements have facilitated the development of practical tools such as Automatic1111 WebUI (https://github.com/AUTOMATIC1111/stable-diffusion-webui (accessed on 10 January 2025)), Artbreeder (https://www.artbreeder.com (accessed on 17 January 2025)), and Leonardo.Ai (https://leonardo.ai (accessed on 20 January 2025)).

While SD has shown promise for weed image generation, it remains a challenge to effectively apply the technique and select appropriate parameter-efficient methods, particularly to ensure controllable generation and improved weed detection performance. Moreno (2023) [] applied SD to generate an artificial weed dataset to supplement the original training set. In a three-species weed detection task, they achieved a 6% to 8.1% improvement in mAP@50 using YOLOv8-large trained on a combination of real and synthetic images. However, the authors only used center-cropped images containing only a single weed instance to simplify the generation process and ensure compatibility with their color-based auto-annotation algorithm. Additionally, each weed species was trained separately. These constraints may limit the generalizability of their pipeline to real-world scenarios, where multiple weed instances of different species commonly appear within a single field of view. In our previous study [], ControlNet-augmented SD (v1.5) was utilized to generate weed images, achieving a 1.4% improvement in mAP@50:95 across 10 weed species. The approach used bounding boxes as spatial hints, eliminating the need to crop single-weed images, as in []. However, generating weeds of each species by the ControlNet-based approach still requires a separate model to prevent feature mixing, limiting scalability, and necessitating the exclusion of other species from the training images. Additionally, as mentioned earlier, the generated weed instances did not always align precisely with the bounding boxes provided, highlighting the challenges of image-level generation involving multiple weed instances.

IP-Adapter [] is a parameter-efficient extension of SD, enabling high-fidelity, image-guided generation through visual conditioning. Its key innovation is the use of decoupled cross-attention, which allows the model to independently process both visual and textual inputs within the U-Net. Instead of replacing text prompts, IP-Adapter introduces a parallel attention path that injects image features into multiple layers of the U-Net. This allows the model to maintain global control from text while enhancing fine details such as texture, pose, and color from the image, making it potentially well-suited for generating weed instances under varying conditions. Only the adapter layers are trainable, while the rest of the model remains frozen, which significantly reduces computational cost and lowers the risk of overfitting []. Compared to the image-level generation of ControlNet, IP-Adapter permits directly generating high-fidelity weed instances, which can then be seamlessly inserted into existing images, offering greater flexibility. Moreover, due to the strong guidance provided by image prompts, a single IP-Adapter model can support instance generation across multiple weed species.

Directly building on our prior work on weed image generation [], this study, therefore, aimed to utilize the IP-Adapter-augmented SD for improved multi-class weed image generation and weed detection, which represents the first effort to leverage the advanced generative AI technique with image prompt capability for weed generation. Based on a 10-class weed dataset, this study was accomplished through three specific objectives, i.e., to: (1) propose an efficient pipeline for generating synthetic weed instances and inserting them into existing weed images, (2) assess the perceptual quality of the generated weed images, and (3) evaluate the efficacy of the images with generated weed instances for weed detection.

2. Materials and Methods

2.1. Dataset Curation

To remain consistent with our previous study [], the same three-season weed dataset [] (https://doi.org/10.5281/zenodo.14861516) was used in this study, comprising a total of 8436 images and 27,963 annotated instances across 10 weed classes. The images were collected over three consecutive years, including 4704 in 2021, 1948 in 2022, and 1784 in 2023. The images in 2021 were sourced from the CottonWeedDet12 dataset [,], while the images in 2022 were derived from the two-season dataset presented in []. The images in 2023 were acquired using a mobile platform operating under natural field conditions. Data collection took place across diverse field sites in Mississippi and Michigan. Further details on image acquisition and annotation can be found in [].

2.2. IP-Adapter-Based Stable Diffusion for Weed Generation

Stable Diffusion (SD) leverages a pre-trained autoencoder to encode noise images into latent representations and employs a U-Net-based denoising network conditioned on text embeddings, typically from a CLIP text encoder [], to generate high-resolution, semantically meaningful images guided by textual prompts []. However, as mentioned earlier, full fine-tuning of SD is time-consuming and prone to overfitting when training data is limited, due to the model’s large size exceeding 7 GB of memory. To address this and ensure high-quality generation of complex subjects that are difficult to describe through text alone, Ye (2023) [] introduced a modular and efficient mechanism called IP-Adapter with image prompt capability, which incorporates a small number of trainable parameters into a pre-trained SD model. By enabling image-based conditioning, IP-Adapter bypasses the limitations and ambiguities of text prompts through direct guidance from visual features. Compared to the ControlNet-based SD in our previous study [], which uses spatial localization hints such as bounding boxes, IP-Adapter offers greater flexibility in representing fine-grained and visually complex subjects. Additionally, unlike the ControlNet-based SD that requires a separate model for generating images of each category to avoid feature mixing, a single IP-Adapter model can generalize across multiple categories, making it more scalable and efficient for generating multi-class weeds. Figure 1 shows the architectural framework of the IP-Adapter-based SD.

Figure 1. Architectural framework of IP-Adapter-based Stable Diffusion. In IP-Adapter training, blue blocks are frozen, while red blocks remain trainable. CLIP [] and LN denote contrastive language image pre-training and normalization layer, respectively.

Given the challenges of image-level generation involving multiple weeds within a field scene [], this study instead focused on generating individual weeds separately via the IP-Adapter, inserting them into real images for precise location control. A species-agnostic text prompt was used to provide general semantic context: “Please generate a top-down illustration of a field with a realistic depiction of plants under various lighting conditions”. Species-specific visual information was supplied through a reference weed instance image. In addition to the default CLIP image encoder used in IP-Adapter, BioCLIP [] was included for comparison. Trained on a large set of 10.4 million biological images spanning 454,000 species, BioCLIP is a specialized variant of CLIP, a vision foundation model with the potential to provide more representative embeddings for weed images.

In this study, the IP-Adapter-based SD models were trained for 50 epochs with a batch size of 8, which was found to be effective for model convergence through empirical testing. The classifier-free guidance (CFG) scale ranged from 0–20, set to 7.5 during training and 2.0 during inference. Higher CFG values enforce stronger adherence to text prompts, while lower values encourage diversity. Since this work employed identical short prompts across weed species and relied mainly on IP-Adapter image guidance, relatively low CFGs were applied in both stages. Strength was set to 0.8 for inference and 1.0 for training (range: 0–1) to allow adequate variation, with 50 inference steps for training and 20 for image generation. Other hyperparameters, such as the learning rate (1 × 10⁻⁴), were set to default values as []. All training and testing experiments were conducted using the PyTorch framework (v2.1.0) on a Windows 11 PC with an i9-10900X processor (256 GB RAM) and an NVIDIA RTX A6000 GPU (48 GB memory).

Figure 2 depicts the overall weed generation pipeline, which includes mask creation, weed instance generation, and instance insertion. Masks designate the insertion locations for weed generation, with circular masks chosen to reflect the typical top-down view of most weed species, as weeds naturally spread outward in all directions. To reduce complexity, up to four masks of sizes comparable to the existing weeds were placed, ensuring minimal or no overlap with existing instances, particularly in densely packed weed images. To correspond with the downstream weed detection task, only the training set, resized to a maximum resolution of 1024 × 1024 pixels, was used for weed generation. Here, the mask diameter was set to range from 64 to 512 pixels, covering approximately 90% of real weed sizes across 10 species, with an added 10-pixel padding to ensure smooth edge blending. Subsequently, the IP-Adapter was operated at a resolution of 512 × 512 pixels, offering a good balance between image quality and generation speed. The generated instances were then scaled to match the corresponding mask size, and the mask regions in the target image were painted using the generated weeds. Finally, the painted areas are seamlessly integrated back into their original positions using Gaussian-based edge blending.

Figure 2. Pipeline of weed generation by the IP-Adapter-based Stable Diffusion. The IP-Adapter is capable of blending the original instance background with that of the target image, while also adjusting the instance’s aspect ratio to match the mask. The figure on the right shows additional examples of generated instances from various weed species.

Following the weed instance insertion, effective annotation of the inserted weeds is required. A simple yet effective auto-annotation method was used. Since insertion is controlled, the weed species is already known. However, the mask may not precisely align the generated instance due to generation variability and the 10-pixel padding added earlier. To refine the bounding box, a color-based approach was first applied to the images with the inserted weeds for detecting green regions. However, overlapping leaves from other instances could interfere with accurate annotation by enlarging the green area, as shown in the left-center part of Figure 2. To improve the annotation, a weed detection model trained using the original training set images was used to predict instances in the inserted images. The predictions within a mask with confidence above 25% were used as annotations; otherwise, the green area identified by the color-based approach was considered as the annotation.

It is noted that there is significant flexibility in scheduling weed instance generation and insertion. Figure 3 illustrates various strategies for inserting synthetic weeds. Different weed species can be placed into the same target locations while maintaining original lighting conditions, and the background and viewpoint can be adjusted to match the target image. Multiple weed instances can also be inserted into a single target image, with insertion sizes determined by the mask size. This is particularly efficient for late-season or well-managed fields where only a few weeds are present during data collection. Additionally, for near-perfect insertion, weed instances can be matched to target images with similar backgrounds or identical soil types. Although this was not the focus of the current study, it appears promising for creating highly natural synthetic images and remains an interesting direction for future work. To simplify the generation process, this study randomly matched weed instances with target images, allowing multiple mask locations per image for insertion, and focused primarily on generating more samples for sparse species. Further details are provided below in Section 3.

Figure 3. Examples of inserted weed instances: the top row shows different species inserted into the same location; the bottom-left illustrates multiple instances inserted into a single image; the bottom-right presents the insertion of a weed instance (highlighted with the yellow dashed box) into a target image containing many weeds of the same species of Lambsquarters.

2.3. YOLOv11 for Weed Detection

Among the latest, most powerful real-time object detectors, YOLOv11 [], featuring a more efficient architecture and advanced attention mechanisms compared to earlier versions of YOLO detectors, was selected for developing weed detection models due to its strong balance of efficiency, accuracy, and ease of implementation. Specifically, the YOLOv11-Large (YOLOv11l) variant was trained in three experimental groups for comparison: (1) trained only on original weed instances on the training set, (2) trained with additional synthetic weed instances generated by IP-Adapter using CLIP, and (3) trained with synthetic weed instances generated by IP-Adapter using BioCLIP instead of CLIP. The weed detection performance was assessed in terms of metrics including mAP@50 and mAP@50:95 [], as detailed in the following section. The goal was to evaluate the impact of incorporating synthetic instances relative to the number of real-world images per weed class, as illustrated in Figure 4. For model development and evaluation, the weed dataset was randomly split into training, validation, and test sets (75%:5%:20%) and modeled with three replications as done in our work [], and averaged performance metrics were used for model assessment and comparison.

Figure 4. Weed detection pipeline enhanced by the IP-Adapter-based Stable Diffusion (SD).

Input images were resized to 1024 × 1024 pixels for model training, providing adequate resolution to detect weeds of various sizes while offering a practical balance for training efficiency, with minimal loss of weed-specific features compared to the original higher-resolution images (over 10 million pixels). The YOLOv11l models were trained for 48 epochs with a batch size of 4, using default hyperparameters as specified in the official implementation []. Each modeling replication contained approximately a total of 21,000 weed instances. To avoid occluding original weeds and maintaining natural placement without dense packing, approximately 12,000 synthetic instances, about 57% of the total number of training images, were inserted into the existing training set. In addition, for weed detection, a small portion of generated weeds that appeared blurred or deviated significantly from the original reference instances were replaced by directly copying and pasting the original instances. This strategy was applied to synthetic instances with confidence scores below 10%, an empirical threshold derived from preliminary experiments. Approximately 2000 instances (17% of the synthetic set) were removed in each replication, while the majority of synthetic instances remained. This preliminary filtering and replacement mechanism may be substituted with an improved approach in future work.

2.4. Performance Metrics

The perceptual quality of the generated weed images was quantitatively assessed using two widely adopted evaluation metrics: Fréchet Inception Distance (FID) [] and Inception Score (IS) []. These metrics are designed to capture different aspects of generative performance. FID primarily evaluates image fidelity by measuring how closely the generated images resemble real ones in a represented feature space, while IS emphasizes diversity by assessing the variety and semantic clarity of the generated images, without requiring access to real image references. High IS typically indicates both image quality and class diversity. Both FID and IS rely on feature representations extracted from the ImageNet-pretrained Inception-v3 model [].

To compute FID, features from the last pooling layer of Inception-v3 are extracted for both real and generated images, modeled as multivariate Gaussians. The Fréchet distance (Wasserstein-2) between these distributions is then computed []:

FID (X, G) = {‖μ_{X} - μ_{G}‖}^{2} + Tr (C_{X} + C_{G} - 2 {(C_{X} C_{G})}^{1 / 2})

(1)

where

μ_{X}

,

μ_{G}

and

C_{X}

and

C_{G}

are the corresponding covariance matrices for real and generated images, respectively. Tr indicates the trace of a matrix.

{| | \cdot | |}^{2}

is the squared Euclidean distance.

The IS evaluates the Kullback–Leibler (KL) divergence between the conditional class distribution

P (y | x)

and the marginal distribution

P (y)

over generated data:

IS = \exp [E_{x} D_{KL} (P (y | x) | | P (y))]

(2)

where

D_{KL}

denotes the KL divergence, and

E_{x}

represents the expectation over the generated images.

Weed detection performance was evaluated using mean Average Precision (mAP) metrics. Specifically, mAP@50 refers to mAP computed at an Intersection over Union (IoU) threshold of 0.5, while mAP@50:95 represents the average mAP over IoU thresholds ranging from 0.50 to 0.95 in 0.05 increments. Both mAP@50 and mAP@50:95 were employed to assess detection accuracy. The general formula for mAP is given as follows:

mAP = \frac{1}{N} \sum_{i = 1}^{N} {AP}_{i}

(3)

where N (set to 10 in this study) denotes the total number of weed classes, and AP_i is the average precision for the i-th class.

The FID and IS metrics were computed using the Torchmetrics library (v1.3.1), while mAP values were calculated using the Ultralytics library (v8.3.67) [], the official implementation of YOLOv11.

3. Results

3.1. General Overall

Figure 5 shows the number of original real images and bounding boxes per weed class in the training set, along with the synthetic instance number and rebalanced total instance counts after incorporating synthetic weed instances. As noted in [], the real (three-season) dataset exhibits significant variations, such as field backgrounds, lighting conditions, and plant growth stages, which could complicate effective diffusion model training. A clear class imbalance exists in the training set, with Lambsquarters being the dominant weed that contains 9777 instances, which may bias the performance of weed detection models toward this species. Although this study does not explicitly focus on addressing class imbalance, an inverse-proportional instance rebalancing strategy was applied to generate synthetic weed instances, prioritizing underrepresented species. For example, Eclipta instances increased from 223 to 3397, and Goosegrass from 453 to 2411, contributing to a more balanced training set, as shown in Figure 5. The green bars represent synthetic instances for each class, which are inserted into real training images, while the red bars reflect the total instances (original plus synthetic).

Figure 5. Distributions of instance counts (bars) and image numbers (dotted curve) in the training set for one replication used in this study. The other two replications have similar distributions.

3.2. Quality of Generated Weeds

BioCLIP [] was utilized as a biology-specific encoder to replace the default CLIP-based IP-Adapter for comparison. Figure 6 shows t-SNE visualizations of real-world weed instances encoded by CLIP (ViT-Huge/14) and BioCLIP (ViT-Base/16). The t-SNE (t-distributed Stochastic Neighbor Embedding) [] is a nonlinear dimensionality reduction technique commonly used for visualizing high-dimensional features. While CLIP is built on the larger ViT-Huge architecture, BioCLIP, with a comparable patch size for token extraction, shows enhanced representational strength and improved class distinction through tighter feature clustering. Certain species that were previously entangled, such as Carpetweed (Red) and Purslane (Blue), became more distinguishable in the BioCLIP feature space. However, some species, such as Lambsquarters (Green) and Palmer Amaranth (Cyan), remain mixed. Overall, BioCLIP may facilitate the generation of more species-specific and visually distinctive weed instances.

Figure 6. t-SNE (t-distributed stochastic neighbor embedding) visualization for weed features extracted by CLIP (left) or BioCLIP (right) encoders on real-world weed instances.

Since only the weed instances are generated while the target images remain real, the quality of synthetic weed instances was evaluated by directly applying FID and IS metrics to the synthetic instances, differing from the image-level assessment used for ControlNet-generated images []. Figure 7 shows the FID and IS for each weed species. For the original Clip, the average FID and IS were 15.90 and 3.49, respectively, while BioClip yielded 18.67 and 3.89. The BioCLIP-based model often showed points in the top-right corner relative to the CLIP counterpart on the FID-IS plot, reflecting higher IS but worse FID performance. The higher IS indicates improved diversity and perceptual quality of generated images, although at the cost of degraded FID, suggesting reduced resemblance to real weed instances. Based on these two metrics, weeds in the bottom-left of the FID-IS plot (e.g., Lambsquarters, Palmer Amaranth) showed strong resemblance to real weeds but limited diversity. Those in the top-right (e.g., Goosegrass, Spotted Spurge) had greater diversity but lower resemblance. Weed classes in the middle-right (e.g., Eclipta, Ragweed) represented a balanced trade-off between resemblance and diversity.

Figure 7. Image quality of weed species, generated by IP-Adapter with original CLIP or BioCLIP, was evaluated using Fréchet Inception Distance (FID) and Inception Score (IS). Lower FID indicates higher resemblance to real images, and higher IS reflects greater quality and diversity. The arrows in the braces indicate the directions that lead to improved results.

Figure 8 shows weed instances generated by the IP-Adapter using the default CLIP encoder and the BioCLIP encoder. Although BioCLIP demonstrated superior clustering performance and different FID and IS compared to the default CLIP, this discrepancy was not clearly reflected in the visual appearance of the generated weed instances. One possible explanation is that the FID and IS metrics may not reliably capture the true visual quality of generated weed instances. This limitation stems from their reliance on the Inception v3 model pretrained on ImageNet, where only about 2% of the classes are plant-related, and none are weed-specific. Consequently, Inception v3 may struggle to extract meaningful features for evaluating weed images, potentially misrepresenting their perceptual quality. Similar issues have been observed in [], where FID scores contradicted human ratings. In the future, more systematic and biology-specific metrics should be developed to evaluate the quality of generated weed images. To improve visual quality assessment for generated weed images, future research could explore more accurate and representative metrics, such as fine-tuning Inception v3 on weed images or evaluating CLIP embedding-based the Conditional Maximum Mean Discrepancy (CMMD) proposed in [].

Figure 8. Examples of weed instances generated by the IP-Adapter using the default CLIP (left) and BioCLIP (right) encoders.

3.3. Weed Detection

Three groups of modeling experiments were conducted for weed detection. Figure 9 illustrates the visual comparison between traditional copy-paste data augmentation and the IP-Adapter-generated weed instances. In the copy-paste method, a randomly selected bounding box of a weed instance was directly pasted onto a random position in the target image, a technique proven effective for enhancing object detection. However, this approach often introduces unnatural visual artifacts that conflict with human perception, even though it can easily increase the number of instances as the copy-paste ratio (i.e., the probability of applying copy-paste at each training step) rises. In contrast, the IP-Adapter provides a more natural blending, potentially leading to better performance. Consequently, copy-paste augmentation with two different ratios (0.5 and 1.0) was conducted as a comparison to the IP-Adapter method.

Figure 9. Examples of copy-pasted images (top) and IP-Adapter-inserted images (bottom). IP-Adapter is capable of merging background contexts between the inserted instance and the target image, while also adjusting the instance’s original aspect ratio to conform to the spatial size of the target region. A more similar background enables smoother insertion.

Table 1 and Table 2 show the weed detection results based on real and synthetic instances. The baseline detection model “R only”, trained solely on real weeds, achieved an overall accuracy of 86.77% in mAP@50:95 and 94.80% in mAP@50, consistent with previous results on the same dataset []. Compared to this baseline, incorporating a 50% probability of adding copy-pasted weeds led to modest improvements, achieving 87.30% (+0.53%) in mAP@50:95 and 95.10% (+0.30%) in mAP@50. Increasing the addition probability to 100% yielded comparable but slightly lower gains (87.27%, +0.50% in mAP@50:95 and 94.90%, +0.10% in mAP@50), indicating that merely adding more instances via copy-paste may not guarantee consistent performance enhancements. In contrast, using IP-Adapter-generated weed instances produced more substantial improvements, with models utilizing CLIP and BioCLIP achieving 88.03% (+1.26%) and 87.90% (+1.22%) in mAP@50:95, and 95.30% (+0.50%) and 95.17% (+0.37%) in mAP@50, respectively. In addition, the p-valves were calculated using the significance level of 5%. The p-values for mAP@50:95 indicate significant differences between “R only,” “R+CP,” and “R+S.” In contrast, no significant differences were observed between using 0.5 versus 1.0 copy-paste, or between CLIP and BioCLIP. This is consistent with the similar performance achieved by the two encoders; although BioCLIP presents superior clustering ability, both encoders yielded comparable gains. This outcome may be attributed to the fact that images generated by BioCLIP, despite possessing enhanced representational capacity, exhibit thorough differences from real images, which belong to a visually similar but distinct domain. Therefore, the improved clustering performance of BioCLIP does not necessarily imply that its generated images are closer to real images than those produced by CLIP. Alternatively, it is also possible that the YOLO model lacks mechanisms, such as specialized attention modules for synthetic images, required to fully exploit the enhanced representational power of BioCLIP.

Table 1. Weed detection results of mAP@50:95 based on real and synthetic instances. “R only” represents the original images. “R+CP” indicates that copy-paste augmentation was applied to enhance the original dataset, with a 0.5 or 1.0 probability at each training step. “R+S” integrates synthetic weed instances into the original images, encoded using CLIP or BioCLIP. The best result for each weed class is shown in bold.

Table 2. Weed detection results of mAP@50 based on real and synthetic instances. “R only” represents the original images. “R+CP” indicates that copy-paste augmentation was applied to enhance the original dataset, with a 0.5 or 1.0 probability at each training step. “R+S” integrates synthetic weed instances into the original images, encoded using CLIP or BioCLIP. The best result for each weed class is shown in bold.

At the individual class level, Lambsquarters, Purslane, and Ragweed each achieved over a 1.5% improvement in mAP@50:95 with both CLIP- and BioCLIP-based IP-Adpater (e.g., +1.56%, +1.60%, and +1.86% with CLIP, respectively), with Ragweed also exhibiting a 1.0% increase in mAP@50. These results suggest that IP-Adapter-generated instances are beneficial for weed species with varying amounts of real instances, with Ragweed, characterized by complex leaf structures (thin and branched leaves), showing greater improvements, likely due to better blending between the synthetic instances and the target images. Another notable observation was found for Eclipta, where CLIP-based IP-Adapter achieved over a 1.0% improvement (+1.24%), while BioCLIP resulted in minimal improvement (+0.14%). This discrepancy may be attributed to CLIP-generated samples more closely resembling real instances, whereas BioCLIP produced more diverse variations, potentially introducing a domain gap for this weed species with few real instances. Overall, an average improvement of 1.26% in mAP@50:95 was achieved by the IP-Adapter, and the enhancement for certain weed species was more significant. These results confirm the positive contribution of IP-Adapter-based synthetic instances to improving weed detection performance.

Figure 10 shows examples of weed detection results based on models trained using only real images and those trained with an integrated set of real and synthetic weed instances (by IP-Adapter). Although in many cases the two models produced similar detection outcomes (as reflected by the close results in Table 1), the inclusion of synthetic weeds led to improved predictions for certain challenging cases. Models trained on the augmented dataset with IP-Adapter-generated instances not only exhibited higher confidence scores but also showed notable improvements, particularly for small weeds, weed clusters, and images with complex backgrounds such as shadows, which could be especially beneficial for precision weeding applications.

Figure 10. Examples of weed detection in test images using YOLOv11l (large). The top images show results from models trained only on real images, while the bottom part shows the detection by the models trained on inserted images, the IP-Adapter encoded by CLIP.

4. Discussion

Compared to the previous study using ControlNet-added SD [], with a 1.4% gain in mAP@50:95, which required training separate models for each weed class to avoid feature mixing, all 10 weed species in this study could be generated using a single IP-Adapter-based SD model, and this study received a comparable improvement of 1.26% using the same dataset. On the other hand, the automatic labeling process employed in our study may introduce annotation errors despite the filtration procedures applied. Such errors could potentially affect the final performance of the weed detection model. The accuracy may be further improved in future work by developing effective methods to quantify annotation errors in synthetic images and to optimize the automated labeling process. The IP-Adapter approach significantly reduced the total model training time from about 9 days to less than 20 h on an NVIDIA RTX A6000 GPU (48 GB). Compared to training time, the inference times of both ControlNet and IP-Adapter are much shorter. Although ControlNet is slightly slower than IP-Adapter, both can generate a 512 × 512 image in less than 3 s with 20 steps, and only a few hours are required to generate thousands of images. This was made possible by leveraging the high level of detail provided through image prompting, significantly streamlining the generation process by requiring the training of only one model instead of ten. Furthermore, instance-level weed generation by IP-Adapter is inherently easier than full image-level generation by ControlNet. Furthermore, arbitrary background images can be utilized to greatly expand the range of scenarios, including new fields, varied lighting conditions, weather-induced soil changes, and even potted plants, rather than being restricted to the original weed image backgrounds present in the generative model’s training set.

Compared with other state-of-the-art image generation models, such as GigaGAN [], R3GAN [], and SoftVQ-VAE [], generative adversarial networks (GANs) still struggle with diversity and the inherent instability of the training process []. Although variational autoencoders (VAEs) are widely adopted within diffusion models, standalone VAEs generally lag behind both diffusion- and GAN-based models in terms of realism []. Furthermore, the ControlNet or IP-Adapter-based Stable Diffusion pipeline enables precise spatial control of the generated instances, a capability that GANs and VAEs generally lack.

Numerous publicly available weed datasets have been developed through various imaging efforts. Although large-scale datasets with over 100,000 images, comparable to COCO [], are rare in weed detection, dozens of small- to medium-sized datasets have been made available. However, as noted by [], these datasets are often constrained by limited field diversity and a relatively small number of annotated instances. To address these gaps, image synthesis has emerged as a powerful strategy, offering a scalable and flexible alternative to traditional methods of data collection and curation for improving weed detection capabilities. The IP-Adapter approach enables weed datasets to be effectively decomposed into background images and individual weed instances, with backgrounds subsequently painted to fill the regions where the original instances were removed. It enables the generation of high-quality, multi-class synthetic weeds by blending real weed instances into new backgrounds. Therefore, this method could facilitate the integration of existing small- to medium-scale weed datasets, each containing hundreds to thousands of images. When combined, these datasets could generate millions of synthetic images, enabling the construction of large-scale datasets for model training. Further research is needed to fully explore and validate the benefits of this approach. In addition, the dataset used in this study does not include crops, which limits its applicability to practical precision weeding scenarios where crops and weeds coexist. To address this, we have gradually added crop images from vegetable fields (e.g., lettuce, radish, and beet) to our dataset during the 2024 and 2025 seasons [], which can be incorporated into future studies. This promising IP-Adapter-based image generation approach could also be applied to enhance detection performance in practical fields containing both weeds and crops.

A general prompt was used to describe the overall scenario for generating different classes of weeds, as shown in Figure 1. However, more refined prompts would likely be preferable for producing weeds with specific properties, such as lighting conditions, growth stages, or leaf numbers. This would require more dedicated efforts in prompt design, potentially employing techniques such as prompt engineering []. Additionally, future upgrades could involve generating videos [] that capture various weed growth stages, with key frames extracted for use in weed detection training. On the other hand, because the insertion mask has already defined the placement of synthetic weeds, there is no need for post-filtering steps, such as those used by [], to remove low-quality synthetic images or misaligned weeds. Furthermore, although the BioCLIP encoder, trained on a large biological image dataset encompassing plants, animals, and fungi, demonstrated superior clustering performance on weed species compared to the default CLIP encoder used in the IP-Adapter, this improvement did not translate into better weed detection performance. This suggests that advanced feature clustering alone may not directly improve detection accuracy, highlighting the need for more straightforward and accurate image quality metrics, such as the conditional distribution-based CMMD proposed in [], rather than relying solely on FID and IS, to better assess the relationship between synthetic image quality and task performance in weed detection.

The deployment of machine vision-based weed detection models integrated with practical weeding systems is critical for ensuring successful application in real-world agricultural environments. In this study, the 10-class weed dataset focused on detecting individual weed species, without including crop images. While species-level identification is valuable, practical precision weeding often initially depends on distinguishing crops from non-crops to avoid crop injury, particularly when weeds emerge close to desirable plants. If crops are not included during model training, the model may incorrectly classify crops as weed species with similar visual features, rather than ignoring them, potentially confusing the weeding system and increasing the risk of damaging crops. For many mechanical weeding systems [], early designs relied on simple inter-row or height-based intra-row weeding strategies, which utilized height differences between crops and weeds. More recent systems have incorporated vision capabilities to differentiate between crops and weeds [,], labeled as two categories of “crop” and “weed”. However, a species-blind approach often leads to ineffective or unnecessary interventions, and bioherbicides typically target only a narrow range of weed species []. As weeding technology evolves, intra-row weeders with species-level weed detection will become increasingly important, enabling targeted, species-specific control strategies. For example, different chemicals specifically target grasses and broadleaf weeds, respectively; thus, accurate identification is crucial for optimizing herbicide applications []. Meanwhile, although precise crop/weed detection is becoming feasible, dedicated actuation systems, integrated with detection models for species-specific weed control, are still in the early stages of development [], highlighting a key area for future research and the integration of advanced vision and control technologies.

5. Conclusions

This study presents a novel investigation of an IP-Adapter-based SD approach to weed generation, which enables the direct generation of weed instances that are then inserted into target background images rather than generating complete images. This method offers greater flexibility and higher instance fidelity by leveraging direct image prompting with reference weed instance images. A single IP-Adapter model was capable of producing instances of 10 weed species, maintaining high fidelity and aligning the aspect ratios of reference instances with target images. The synthetic weed instances achieved average FIDs and ISs of 15.90 and 3.49, respectively, using a CLIP encoder, and 18.67 and 3.89, respectively, using BioCLIP. Notably, BioCLIP demonstrated superior clustering performance across the 10 weed species. Augmenting the training set with synthetic weed instances, alongside real images, improved weed detection performance, yielding an overall gain of 1.26% in mAP@50:95 compared to the baseline model trained only on real images. The proposed method also performed competitively with traditional copy-paste data augmentation, which achieved a 0.53% mAP@50:95 improvement. Furthermore, the annotation of synthetic weed instances was automated to streamline their integration into the detection model training pipeline. Further efforts are needed to fully leverage generative modeling for greater detection improvements.

Author Contributions

B.D.: writing—original draft, investigation, formal analysis, software; Y.L.: writing—original draft, review and editing, conceptualization, supervision, funding acquisition. All authors have read and agreed to the published version of the manuscript.

Funding

This research was in part supported by the Discretionary Funding Initiative of Michigan State University, the Farm Innovation Grant of the Michigan Department of Agriculture and Rural Development (MDARD), and the MDARD Specialty Crop Block Grant Program through the Michigan Vegetable Council.

Data Availability Statement

Data will be made available on request.

Conflicts of Interest

The authors declare there is no conflict of interest in this manuscript.

References

Oerke, E.C. Crop losses to pests. J. Agric. Sci. 2006, 144, 31–43. [Google Scholar] [CrossRef]
Chauhan, B.S. Grand challenges in weed management. Front. Agron. 2020, 1, 3. [Google Scholar] [CrossRef]
Esposito, M.; Crimaldi, M.; Cirillo, V.; Sarghini, F.; Maggio, A. Drone and sensor technology for sustainable weed management: A review. Chem. Biol. Technol. Agric. 2021, 8, 18. [Google Scholar] [CrossRef]
Pimentel, D.; Lach, L.; Zuniga, R.; Morrison, D. Environmental and economic costs of nonindigenous species in the United States. BioScience 2000, 50, 53–65. [Google Scholar] [CrossRef]
Duke, S.O. Perspectives on transgenic, herbicide-resistant crops in the United States almost 20 years after introduction. Pest Manag. Sci. 2015, 71, 652–657. [Google Scholar] [CrossRef]
Délye, C.; Jasieniuk, M.; Le Corre, V. Deciphering the evolution of herbicide resistance in weeds. Trends Genet. 2013, 29, 649–658. [Google Scholar] [CrossRef]
Heap, I. The International Herbicide-Resistant Weed Database. 2025. Available online: www.weedscience.org (accessed on 14 November 2025).
Heap, I.; Duke, S.O. Overview of glyphosate-resistant weeds worldwide. Pest Manag. Sci. 2018, 74, 1040–1049. [Google Scholar] [CrossRef]
Westwood, J.H.; Charudattan, R.; Duke, S.O.; Fennimore, S.A.; Marrone, P.; Slaughter, D.C.; Swanton, C.; Zollinger, R. Weed management in 2050: Perspectives on the future of weed science. Weed Sci. 2018, 66, 275–285. [Google Scholar] [CrossRef]
Gerhards, R.; Andujar Sanchez, D.; Hamouz, P.; Peteinatos, G.G.; Christensen, S.; Fernandez-Quintanilla, C. Advances in site-specific weed management in agriculture—A review. Weed Res. 2022, 62, 123–133. [Google Scholar] [CrossRef]
Upadhyay, A.; Sunil, G.C.; Zhang, Y.; Koparan, C.; Sun, X. Development and evaluation of a machine vision and deep learning-based smart sprayer system for site-specific weed management in row crops: An edge computing approach. J. Agric. Food Res. 2024, 18, 101331. [Google Scholar] [CrossRef]
Deng, B.; Lu, Y.; Brainard, D. Improvements and evaluation of a smart sprayer prototype for weed control in vegetable crops. In 2025 ASABE Annual International Meeting Paper No. 2500549; American Society of Agricultural and Biological Engineers: St. Joseph, MI, USA, 2025. [Google Scholar] [CrossRef]
Machleb, J.; Peteinatos, G.G.; Kollenda, B.L.; Andújar, D.; Gerhards, R. Sensor-based mechanical weed control: Present state and prospects. Comput. Electron. Agric. 2020, 176, 105638. [Google Scholar] [CrossRef]
Xiang, M.; Qu, M.; Wang, G.; Ma, Z.; Chen, X.; Zhou, Z.; Qi, J.; Gao, X.; Li, H.; Jia, H. Crop detection technologies, mechanical weeding executive parts and working performance of intelligent mechanical weeding: A review. Front. Plant Sci. 2024, 15, 1361002. [Google Scholar] [CrossRef] [PubMed]
Yaseen, M.U.; Long, J.M. Laser weeding technology in cropping systems: A comprehensive review. Agronomy 2024, 14, 2253. [Google Scholar] [CrossRef]
Lati, R.N.; Siemens, M.C.; Rachuy, J.S.; Fennimore, S.A. Intrarow weed removal in broccoli and transplanted lettuce with an intelligent cultivator. Weed Technol. 2016, 30, 655–663. [Google Scholar] [CrossRef]
Pai, D.G.; Kamath, R.; Balachandra, M. Deep learning techniques for weed detection in agricultural environments: A comprehensive review. IEEE Access 2024, 12, 113193–113214. [Google Scholar] [CrossRef]
Lu, Y.; Young, S. A survey of public datasets for computer vision tasks in precision agriculture. Comput. Electron. Agric. 2020, 178, 105760. [Google Scholar] [CrossRef]
Deng, B.; Lu, Y.; Xu, J. Weed database development: An updated survey of public weed datasets and cross-season weed detection adaptation. Ecol. Inform. 2024, 81, 102546. [Google Scholar] [CrossRef]
Lu, Y.; Chen, D.; Olaniyi, E.; Huang, Y. Generative adversarial networks (GANs) for image augmentation in agriculture: A systematic review. Comput. Electron. Agric. 2022, 200, 107208. [Google Scholar] [CrossRef]
Raut, G.; Singh, A. Generative AI in vision: A survey on models, metrics and applications. arXiv 2024, arXiv:2402.16369. [Google Scholar] [CrossRef]
Yang, L.; Zhang, Z.; Song, Y.; Hong, S.; Xu, R.; Zhao, Y.; Zhang, W.; Cui, B.; Yang, M.H. Diffusion models: A comprehensive survey of methods and applications. ACM Comput. Surv. 2023, 56, 1–39. [Google Scholar] [CrossRef]
Ramesh, A.; Dhariwal, P.; Nichol, A.; Chu, C.; Chen, M. Hierarchical text-conditional image generation with clip latents. arXiv 2022, arXiv:2204.06125. [Google Scholar] [CrossRef]
Croitoru, F.A.; Hondru, V.; Ionescu, R.T.; Shah, M. Diffusion models in vision: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 10850–10869. [Google Scholar] [CrossRef] [PubMed]
Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 10684–10695. [Google Scholar]
Ryu, S. Low-Rank Adaptation for Fast Text-To-Image Diffusion Fine-Tuning, version 0.0.1; [Computer software]; 2023. GitHub. Available online: https://github.com/cloneofsimo/lora (accessed on 17 January 2025).
Ruiz, N.; Li, Y.; Jampani, V.; Pritch, Y.; Rubinstein, M.; Aberman, K. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 22500–22510. [Google Scholar]
Zhang, L.; Rao, A.; Agrawala, M. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 3836–3847. [Google Scholar]
Mou, C.; Wang, X.; Xie, L.; Wu, Y.; Zhang, J.; Qi, Z.; Shan, Y. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. Proc. AAAI Conf. Artif. Intell. 2024, 38, 4296–4304. [Google Scholar] [CrossRef]
Ye, H.; Zhang, J.; Liu, S.; Han, X.; Yang, W. IP-Adapter: Text compatible image prompt adapter for text-to-image diffusion models. arXiv 2023, arXiv:2308.06721. [Google Scholar] [CrossRef]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, PMLR, Virtual, 18–24 July 2021; Volume 139, pp. 8748–8763. [Google Scholar]
Moreno, H.; Gómez, A.; Altares-López, S.; Ribeiro, A.; Andújar, D. Analysis of stable diffusion-derived fake weeds performance for training convolutional neural networks. Comput. Electron. Agric. 2023, 214, 108324. [Google Scholar] [CrossRef]
Deng, B.; Lu, Y. Weed image augmentation by ControlNet-added stable diffusion for multi-class weed detection. Comput. Electron. Agric. 2025, 232, 110123. [Google Scholar] [CrossRef]
Lu, Y. 3SeasonWeedDet10: A three-season, 10-class dataset for benchmarking AI models for robust weed detection [Data set]. Zenodo 2025. [Google Scholar] [CrossRef]
Lu, Y. CottonWeedDet12: A 12-class weed dataset of cotton production systems for benchmarking AI models for weed detection [Data set]. Zenodo 2023. [Google Scholar] [CrossRef]
Dang, F.; Chen, D.; Lu, Y.; Li, Z. YOLOWeeds: A novel benchmark of YOLO object detectors for multi-class weed detection in cotton production systems. Comput. Electron. Agric. 2023, 205, 107655. [Google Scholar] [CrossRef]
Stevens, S.; Wu, J.; Thompson, M.J.; Campolongo, E.G.; Song, C.H.; Carlyn, D.E.; Dong, L.; Dahdul, W.M.; Stewart, C.; Berger-Wolf, T.; et al. Bioclip: A vision foundation model for the tree of life. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 19412–19424. [Google Scholar]
Jocher, G.; Chaurasia, A.; Qiu, J. YOLO by Ultralytics, Version 11.0.0; [Computer Software]; 2024. GitHub. Available online: https://github.com/ultralytics/ultralytics (accessed on 25 January 2025).
Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Hochreiter, S. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In Proceedings of the 31st Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Salimans, T.; Goodfellow, I.; Zaremba, W.; Cheung, V.; Radford, A.; Chen, X. Improved techniques for training GANs. In Proceedings of the 30th Conference on Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016. [Google Scholar]
Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2818–2826. [Google Scholar]
Dowson, D.C.; Landau, B. The Fréchet distance between multivariate normal distributions. J. Multivar. Anal. 1982, 12, 450–455. [Google Scholar] [CrossRef]
Hinton, G.E.; Roweis, S. Stochastic neighbor embedding. In Proceedings of the 16th International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 9–14 December 2002; pp. 857–864. [Google Scholar]
Jayasumana, S.; Ramalingam, S.; Veit, A.; Glasner, D.; Chakrabarti, A.; Kumar, S. Rethinking fid: Towards a better evaluation metric for image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 9307–9315. [Google Scholar]
Kang, M.; Zhu, J.Y.; Zhang, R.; Park, J.; Shechtman, E.; Paris, S.; Park, T. Scaling up gans for text-to-image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 10124–10134. [Google Scholar]
Huang, N.; Gokaslan, A.; Kuleshov, V.; Tompkin, J. The gan is dead; long live the gan! a modern gan baseline. Adv. Neural Inf. Process. Syst. 2024, 37, 44177–44215. [Google Scholar]
Chen, H.; Wang, Z.; Li, X.; Sun, X.; Chen, F.; Liu, J.; Wang, J.; Raj, B.; Liu, Z.; Barsoum, E. Softvq-vae: Efficient 1-dimensional continuous tokenizer. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 11–12 June 2025; pp. 28358–28370. [Google Scholar]
Chakraborty, T.; KS, U.R.; Naik, S.M.; Panja, M.; Manvitha, B. Ten years of generative adversarial nets (GANs): A survey of the state-of-the-art. Mach. Learn. Sci. Technol. 2024, 5, 011001. [Google Scholar] [CrossRef]
Wang, X.; He, Z.; Peng, X. Artificial-intelligence-generated content with diffusion models: A literature review. Mathematics 2024, 12, 977. [Google Scholar] [CrossRef]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common objects in context. Computer Vision—ECCV 2014. In Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2014; Volume 8693. [Google Scholar] [CrossRef]
Amatriain, X. Prompt design and engineering: Introduction and advanced methods. arXiv 2024, arXiv:2401.14423. [Google Scholar] [CrossRef]
Xing, Z.; Feng, Q.; Chen, H.; Dai, Q.; Hu, H.; Xu, H.; Wu, Z.; Jiang, Y.G. A survey on video diffusion models. ACM Comput. Surv. 2024, 57, 1–42. [Google Scholar] [CrossRef]
Gatkal, N.R.; Nalawade, S.M.; Shelke, M.S.; Sahni, R.K.; Walunj, A.A.; Kadam, P.B.; Ali, M. Review of cutting-edge weed management strategy in agricultural systems. Int. J. Agric. Biol. Eng. 2025, 18, 25–42. [Google Scholar] [CrossRef]
FarmWise. Titan: AI-Powered Mechanical Weeding Robot. 2021. Available online: https://farmwise.io/ (accessed on 14 November 2025).
Patel, D.; Gandhi, M.; Shankaranarayanan, H.; Darji, A.D. Design of an autonomous agriculture robot for real-time weed detection using CNN. In Advances in VLSI and Embedded Systems; Lecture Notes in Electrical Engineering; Darji, A.D., Joshi, D., Joshi, A., Sheriff, R., Eds.; Springer: Singapore, 2022; Volume 962. [Google Scholar] [CrossRef]

Figure 1. Architectural framework of IP-Adapter-based Stable Diffusion. In IP-Adapter training, blue blocks are frozen, while red blocks remain trainable. CLIP [] and LN denote contrastive language image pre-training and normalization layer, respectively.

Figure 2. Pipeline of weed generation by the IP-Adapter-based Stable Diffusion. The IP-Adapter is capable of blending the original instance background with that of the target image, while also adjusting the instance’s aspect ratio to match the mask. The figure on the right shows additional examples of generated instances from various weed species.

Figure 3. Examples of inserted weed instances: the top row shows different species inserted into the same location; the bottom-left illustrates multiple instances inserted into a single image; the bottom-right presents the insertion of a weed instance (highlighted with the yellow dashed box) into a target image containing many weeds of the same species of Lambsquarters.

Figure 4. Weed detection pipeline enhanced by the IP-Adapter-based Stable Diffusion (SD).

Figure 5. Distributions of instance counts (bars) and image numbers (dotted curve) in the training set for one replication used in this study. The other two replications have similar distributions.

Figure 6. t-SNE (t-distributed stochastic neighbor embedding) visualization for weed features extracted by CLIP (left) or BioCLIP (right) encoders on real-world weed instances.

Figure 7. Image quality of weed species, generated by IP-Adapter with original CLIP or BioCLIP, was evaluated using Fréchet Inception Distance (FID) and Inception Score (IS). Lower FID indicates higher resemblance to real images, and higher IS reflects greater quality and diversity. The arrows in the braces indicate the directions that lead to improved results.

Figure 8. Examples of weed instances generated by the IP-Adapter using the default CLIP (left) and BioCLIP (right) encoders.

Figure 9. Examples of copy-pasted images (top) and IP-Adapter-inserted images (bottom). IP-Adapter is capable of merging background contexts between the inserted instance and the target image, while also adjusting the instance’s original aspect ratio to conform to the spatial size of the target region. A more similar background enables smoother insertion.

Figure 10. Examples of weed detection in test images using YOLOv11l (large). The top images show results from models trained only on real images, while the bottom part shows the detection by the models trained on inserted images, the IP-Adapter encoded by CLIP.

Table 1. Weed detection results of mAP@50:95 based on real and synthetic instances. “R only” represents the original images. “R+CP” indicates that copy-paste augmentation was applied to enhance the original dataset, with a 0.5 or 1.0 probability at each training step. “R+S” integrates synthetic weed instances into the original images, encoded using CLIP or BioCLIP. The best result for each weed class is shown in bold.

Weed Classes	R Only	R+CP (0.5)	R+CP (1.0)	R+S (CLIP)	R+S (BioCLIP)
Carpetweed	80.83%	81.77%	81.67%	82.17%	81.83%
Eclipta	92.63%	92.67%	90.13%	93.87%	92.77%
Goosegrass	91.20%	91.80%	91.10%	92.63%	91.10%
Lambsquarters	77.37%	77.97%	78.27%	78.93%	79.00%
Morningglory	92.67%	93.23%	93.60%	94.27%	94.07%
Palmer Amaranth	89.17%	89.10%	89.37%	89.70%	89.30%
Purslane	75.7%	76.17%	76.87%	77.30%	77.47%
Ragweed	86.27%	87.03%	87.6%	88.13%	88.13%
Spotted Spurge	87.63%	88.73%	89.1%	88.70%	89.37%
Waterhemp	94.30%	94.63%	94.80%	94.77%	94.27%
Total	86.77%	87.30%	87.27%	88.03%	87.90%

Table 2. Weed detection results of mAP@50 based on real and synthetic instances. “R only” represents the original images. “R+CP” indicates that copy-paste augmentation was applied to enhance the original dataset, with a 0.5 or 1.0 probability at each training step. “R+S” integrates synthetic weed instances into the original images, encoded using CLIP or BioCLIP. The best result for each weed class is shown in bold.

Weed Classes	R Only	R+CP (0.5)	R+CP (1.0)	R+S (CLIP)	R+S (BioCLIP)
Carpetweed	90.60%	91.30%	91.53%	91.80%	91.30%
Eclipta	96.17%	95.77%	94.43%	96.4%	96.27%
Goosegrass	95.37%	95.77%	95.8%	96.4%	95.77%
Lambsquarters	93.57%	93.83%	93.97%	94.2%	94.23%
Morningglory	97.90%	98.00%	97.93%	97.97%	98.03%
Palmer Amaranth	96.97%	97.07%	97.23%	97.27%	96.87%
Purslane	89.63%	90.30%	89.73%	90.33%	90.53%
Ragweed	94.57%	95.07%	95.23%	95.70%	95.60%
Spotted Spurge	95.33%	95.7%	95.47%	95.23%	95.60%
Waterhemp	97.63%	97.70%	97.73%	97.80%	97.37%
Total	94.80%	95.10%	94.90%	95.30%	95.17%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Article Metrics

Citations

Article Access Statistics

Journal Statistics

Article metric data becomes available approximately 24 hours after publication online.