VL-PAW: A Vision–Language Dataset for Pear, Apple and Weed

Yu, Gwang-Hyun; Anh, Le Hoang; Vu, Dang Thanh; Lee, Jin; Rahman, Zahid Ur; Lee, Heon-Zoo; Jo, Jung-An; Kim, Jin-Young

doi:10.3390/electronics14102087

Open AccessArticle

VL-PAW: A Vision–Language Dataset for Pear, Apple and Weed

by

Gwang-Hyun Yu

^1,†

,

Le Hoang Anh

^1,†

,

Dang Thanh Vu

²

,

Jin Lee

¹

,

Zahid Ur Rahman

¹,

Heon-Zoo Lee

²

,

Jung-An Jo

^3,* and

Jin-Young Kim

^1,*

¹

Department of Intelligent Electronics and Computer Engineering, Chonnam National University, 77, Yongbong-ro, Buk-gu, Gwangju 61186, Republic of Korea

²

Research Center, AISeed Inc., Gwangju 61186, Republic of Korea

³

Department of Digital Agriculture Institute for Horticultural Fruit, Chonnam National University, 77, Yongbong-ro, Buk-gu, Gwangju 61186, Republic of Korea

^*

Authors to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Electronics 2025, 14(10), 2087; https://doi.org/10.3390/electronics14102087

Submission received: 1 April 2025 / Revised: 12 May 2025 / Accepted: 20 May 2025 / Published: 21 May 2025

(This article belongs to the Collection Image and Video Analysis and Understanding)

Download

Browse Figures

Versions Notes

Abstract

:

Vision–language models (VLMs) have achieved remarkable success in natural image domains, yet their potential remains underexplored in agriculture due to the lack of high-quality, joint image–text datasets. To address this limitation, we introduce VL-PAW (Vision–Language dataset for Pear, Apple, and Weed), a dataset comprising 3.9 K image–caption pairs for two key agricultural tasks: weed species classification and fruit inspection. We fine-tune the CLIP model on VL-PAW and gain several insights. First, the model demonstrates impressive zero-shot performance, achieving 98.21% accuracy in classifying coarse labels. Second, for fine-grained categories, the vision–language model outperforms vision-only models in both few-shot settings and entire dataset training (1-shot: 56.79%; 2-shot: 72.82%; 3-shot: 74.49%; 10-shot: 83.85%). Third, using intuitive captions enhances fine-grained fruit inspection performance compared to using class names alone. These findings demonstrate the applicability of VLMs in future agricultural querying systems.

Keywords:

vision language model; agriculture dataset; few-shot learning; image retrieval

1. Introduction

Vision–language models (VLMs) are emerging as a promising approach at the cutting edge of artificial intelligence [1]. VLMs are built upon powerful frameworks like Transformers [2] and are capable of processing both visual and textual modality. They evolve by combining image recognition with linguistic reasoning, enabling more context-aware performance [3]. This allows them to capture semantic details and accurate identification by combining linguistic cues with visual signals. Their ability to leverage text-based cues means they can excel in zero-shot and open-set recognition, detecting previously unseen object categories without prior training examples for those specific categories.

VLMs offer a new direction for analyzing different modalities of agricultural data and provide deeper insights into complex agricultural phenomena [4]. However, applying VLMs to specialized domains like agriculture presents distinct challenges due to the domain shift between general web-crawled data and agricultural imagery [5,6]. Agricultural tasks frequently require visual understanding and fine-grained feature learning, such as distinguishing subtle disease symptoms or variations between closely related plant types, which general pretrained models may not adequately offer.

A barrier is the lack of comprehensive agricultural image–text data suitable for pretraining or fine-tuning VLMs, with most existing datasets being image-only or designed for narrow tasks. Furthermore, agricultural expertise is scarce and specialized, making it difficult to curate high-quality evaluation data or establish robust benchmarks [6,7,8]. Proprietary models can pose challenges due to access restrictions, limiting customization, tuning, and transparency vital for adapting them to specific agricultural needs [9]. The implementation of VLMs in agricultural diagnostics, particularly in uncontrolled, real-world field environments, remains an area that is still relatively lightly covered in the literature.

Concurrently, VLMs are becoming increasingly integrated into scientific research workflows, such as generating large datasets through image captioning and annotation, enabling the automation and scaling of complex tasks. By leveraging large-scale visual and textual data, models like ChatGPT [10] can generate initial image descriptions, streamlining the annotation pipelines compared to manual approaches. Multiple modalities—such as images and text—have shown strong potential for addressing complex vision tasks [1,11]. A prominent example is the Contrastive Language–Image Pretraining (CLIP) model [12], developed by OpenAI, which aligns visual and textual embeddings within a unified representation space. Trained on a massive corpus of 400 million image–text pairs [13], CLIP has demonstrated impressive capabilities across a wide range of general-purpose datasets.

Despite its success in broad contexts, the application of CLIP to fine-grained, domain-specific tasks such as agricultural data analysis has yet to be fully explored [6,8,14]. Fine-grained agricultural datasets present unique challenges, including subtle variations in visual features and domain-specific textual descriptions, making them distinct from the generic datasets on which CLIP was originally trained. Understanding how well CLIP adapts to agriculture data is the first step to bring VLMs to this specialized domain. However, a primary concern is the scarcity of domain-specific, high-quality annotated datasets [6,7,8,9]. Agricultural images often exhibit variability due to factors like lighting conditions, occlusions, and diverse crop appearances, making it difficult for models trained on general datasets to generalize effectively. Additionally, the complexity of agricultural scenes, where objects of interest may occupy a small portion of the image or be partially obscured, poses difficulties for accurate detection and description.

This study empirically evaluates the performance of the CLIP model on a fine-grained agricultural dataset. To this end, we curate VL-PAW (Vision–Language dataset for Pear, Apple, and Weed), a vision–language dataset specifically designed for agricultural applications. VL-PAW consists of 3.9 K image–caption pairs, covering 27 classes with both coarse- and fine-grained categories. Each image is annotated with one simple caption and five descriptive captions. Using VL-PAW, we compare the capabilities of vision–language models against vision-only models in zero-shot and few-shot settings [15,16,17]. Our findings hence offer valuable insights into the strengths and limitations of VLMs in agriculture.

Treating paired texts as fine-grained labels for images, image-to-text contrastive loss functions similarly to traditional label-based classification [18]. The key distinction is that the text encoder dynamically generates label representations, overcoming the limitations of rigid one-hot encoding commonly used in vision-only models. To address the above hypothesis on CLIP in low-data regimes, such as agriculture, we conduct this study and present the following key findings:

VL-PAW dataset: A vision–language dataset for agriculture, consisting of 3.9 K image–caption pairs. It was collected through seminal research and annotated by field experts and OpenAI ChatGPT [10] (Section 3.1).
VLM fine-tuning: Fine-tuning only the projection head of a CLIP model on VL-PAW yields the best performance in few-shot and full-dataset settings, and achieves results on par with fully trained ViT (Section 4.3).
Benefit of descriptive captions: Using detailed captions enhances performance, particularly for fine-grained tasks like fruit inspection (Section 4.5).

2. Related Work

Vision–language model (VLMs): Early vision–language models utilized contrastive learning to align images and text using large-scale web data. CLIP [12] and ALIGN [19] trained on massive image–text pairs enabled zero-shot classification and retrieval, while CoCa [20] introduced a unified encoder–decoder framework that combined contrastive and captioning losses for improved cross-modal retrieval and multimodal understanding. To address the limitations of CLIP-based models, SigCLIP [21] replaced softmax-based contrastive learning with a more scalable sigmoid loss, enhancing efficiency at various batch sizes. BLIP [22] improved vision–language pretraining by bootstrapping captions for better understanding and generation, while EVA-CLIP [23] refined training with advanced representation learning and augmentation strategies. While few-shot tuning enables rapid adaptation with minimal examples [24], as demonstrated by Flamingo [15], it remains insufficient for robust generalization across diverse tasks. In contrast, instruction tuning [25] has emerged as a promising alternative, pretraining models on structured task instructions to enhance zero-shot capabilities. Recent advances focus on instruction tuning to improve model adaptability. Qwen-VL [26] employs a three-stage training pipeline with a visual receptor and multilingual multimodal data to enhance grounding, text reading, and dialogue interaction. LLaVA [27,28] further advances this trend by using GPT-4-generated multimodal instruction data, demonstrating that structured vision–language instructions significantly improve zero-shot generalization and interactive capabilities.

Vision–language dataset: Large-scale VLMs rely on extensive image–text datasets to improve generalization and zero-shot capabilities. Conceptual Captions [29] and COCO Captions [30] provide high-quality captions from web sources and human annotations with over 330 K image–text pairs, while LAION-400M [13] and LAION-5B [31] offer massive CLIP-filtered datasets, enabling scalable pretraining for models like CLIP and Stable Diffusion. However, general-purpose datasets may not be optimal for specialized domains, leading to the development of domain-specific datasets tailored to particular applications. For example, FaceCaption-15M [32], the largest facial image–caption dataset, enhances face-centered tasks such as recognition and attribute analysis. Similarly, PMC-15M [33] focuses on biomedical AI with 15 million biomedical image–text pairs from scientific literature. These domain-specific datasets enable VLMs to achieve superior performance in specialized fields by providing targeted, high-quality multimodal training data.

Vision–language for agriculture: While general-purpose VLM datasets have fueled progress in transferable learning, their effectiveness in agriculture is often limited due to domain gaps and the need for fine-grained features. WDSD [34] addresses this by curating a wheat disease-specific dataset to enhance disease diagnosis using VLMs. Similarly, AgroGPT [8] constructed AgroInstruct, a 70 K expert-tuning dataset, to fine-tune a conversational AI for agriculture, filling the gap of instruction-tuning data in this domain. AgEval [14] benchmarked plant stress phenotyping, highlighting the need for domain expertise and task-specific few-shot learning to improve reliability. AgriCLIP [6] introduced ALive, a 600 K image–text pair dataset covering crops, livestock, and fisheries, leveraging customized prompts to overcome annotation scarcity. Likewise, a large-scale knowledge base called AGBASE-200K [7], containing 205 K pieces of expert-verified agricultural facts, is highlighted as the comprehensive vision–language benchmark covering a diverse range of agricultural topics and tasks, including symptom descriptions, insect and pest identification, species identification, disease categorization, management instructions, growing advice, weed management, environmental stress management, and nutrient deficiency management.

Our VL-PAW dataset addresses similar challenges in the agricultural domain, where VLMs require greater domain specificity. However, unlike prior works, our dataset is specifically curated for weed species classification, a task where even pretrained VLMs struggle with the complexity of scientific weed names. Additionally, for fruit inspection on pears and apples, VL-PAW introduces a new challenge—subtle visual differences that demand expert-level annotations. This not only tests the capabilities of VLMs against vision-only models but also highlights the gap between VLM performance and expert-level classification, making it a valuable benchmark for real-world agricultural applications.

3. Materials and Methods

3.1. VL-PAW (Vision–Language Dataset for Pear, Apple, and Weed)

As shown in Table 1, the dataset used in this study is composed of three distinct subsets that cover a wide range of agricultural applications: CNU Weeds (weeds) [35], Surface Subtle Apple Defects (apples) [19,36], and CNU Pear Disease (pears). These subsets are designed to evaluate the model’s capability in multimodal tasks, including classification and retrieval across varying levels of granularity and complexity.

CNU Pear Disease dataset: This dataset addresses disease classification in pears and includes 1000 images divided into four classes. It covers normal samples (400 images) and three disease categories: black spot (200 samples), moth disease (178 samples), and pear cicada disease (222 samples). The dataset emphasizes the challenge of distinguishing between healthy and diseased pears and among different diseases with overlapping visual features.
Surface Subtle Apple Defects dataset: This subset is designed for detecting surface defects on Fuji apples. It contains 800 images distributed across two defect types: pest damage (393 samples) and scratches (407 samples). The dataset captures high-resolution apple surface images, reflecting real-world challenges in identifying small and fine-grained defects that are critical for industrial quality control.
CNU Weeds dataset: This subset focuses on weed classification, consisting of images annotated at both family and class levels. It includes five families with 21 weed species, each represented by 100 samples, totaling 2100 images. The dataset features subtle inter-class variations, such as those within the Amaranthaceae family (e.g., common pigweed, slender amaranth, smooth pigweed), making it ideal for fine-grained classification tasks. Additionally, it includes diverse families such as Scrophulariaceae, Chenopodiaceae, Convolvulaceae and Asteraceae, ensuring a broad coverage of weed species.

To construct an image–text pair dataset for fine-tuning the CLIP model, we employed the ChatGPT-4 API to generate image captions. One key limitation we encountered was the sensitivity of the output quality to prompt design. Without carefully tailored prompts, the generated captions tended to be overly simplistic and often failed to capture meaningful details of the images. To mitigate this issue, we incorporated expert feedback to review and correct inaccurate captions. Furthermore, the cost of caption generation, measured in API credits, was relatively high and varied depending on the resolution and complexity of the image content. In this study, captions were generated using the following prompt template:

“No response other information, only response 5 different sentences describing the physical properties of the object: color, size, shape, surface, appearance, texture, etc. Each sentence must follow this format: ‘object of cls class which has [context]’. [context] should be described details of the image. Each sentence separated by a new line.”

For each image, the API was invoked five times to generate five unique captions. The process of caption generation is illustrated in Figure 1. Additionally, we simply synthesized a class-only caption per image with the template “A photo of [CLS] [Type]”; this template has been extensively used in prior works [6,12,20,22]. In total,

3900 \times 6

(1 class-only caption + 5 descriptive captions) image–caption pairs were curated, covering 27 classes. The dataset has been made publicly available on Dataset Hub (https://huggingface.co/datasets/AISeedCorp/VL-PAW, accessed on 4 April 2025), and Figure 2 visualizes several samples from our VL-PAW dataset hosted on the online platform.

Captions generated by ChatGPT may contain potential inaccuracies and biases [37,38]. To address this, we conducted a systematic evaluation of caption quality using a predefined template developed based on expert guidelines. This template ensures accurate identification of the sample’s species and defect types. Additionally, we verified the descriptions of texture and shape generated by ChatGPT, correcting any inaccuracies as needed. All captions were manually reviewed and curated. This process was feasible given the relatively small size of our dataset (3.9 K samples). Furthermore, the use of ChatGPT was intended to introduce a degree of subjectivity, rather than strictly human-like objectivity. As such, we tolerate a certain proportion of hallucinated details that help resolve ambiguous cases—such as unclear texture, shape, or color, or the presence of multiple objects in a single image [39,40]. We would like to emphasize that the goal of this study is not to achieve perfect accuracy in the generated captions, but rather to explore the potential of VLMs in agricultural applications. Therefore, ethical considerations regarding the use of AI-generated content are acknowledged but not discussed in depth in this paper [41,42].

3.2. Dataset Assessment

BRISQUE [43] and CLIPIQA [44] are prominent metrics for assessing image quality without requiring a reference image. BRISQUE analyzes natural scene statistics in the spatial domain, quantifying deviations caused by distortions. It provides a simple scalar score, where lower values indicate better quality, and is commonly used for evaluating artifacts such as noise and compression. In contrast, CLIPIQA leverages deep representations from pretrained VLMs like CLIP, offering a semantic-aware assessment of image quality. By aligning text and image embeddings, CLIPIQA evaluates both low-level distortions and high-level semantic integrity, making it especially suitable for generative models and diverse image categories. While BRISQUE is computationally efficient and rooted in handcrafted features, CLIPIQA benefits from modern deep learning techniques, though it requires greater computational resources. Figure 3a,b demonstrate the BRISQUE and CLIPIQA score distributions of our CLIP dataset, respectively.

To analyze the synthesized captions, we present a word cloud illustration of the VL-PAW dataset for each category, as shown in Figure 3c. Before generating the word cloud, we removed template words that appeared in all sentences to prevent bias in the results. The analysis reveals that the most common words vary across datasets, reflecting the dominant class in each: scratch for the apple dataset, spot for the pear dataset, and Amaranthaceae for the weed dataset. Additionally, words such as shape, surface, and texture frequently appear, indicating that the caption synthesis process emphasizes key visual attributes of the samples.

3.3. Modeling

The Contrastive Language–Image Pretraining (CLIP) [12] is a deep learning framework introduced by OpenAI. It aligns visual and textual data into a shared embedding space, enabling cross-modal understanding. CLIP is trained using a large dataset of image–text pairs, allowing it to generalize across a wide variety of tasks without task-specific fine-tuning. Below, we provide a detailed explanation of CLIP, including its architecture and training mechanism.

3.3.1. Architecture of CLIP

CLIP consists of two main components:

Image Encoder: The image encoder is typically a Convolutional Neural Network (CNN) or Vision Transformer (ViT). It maps images I into an image embedding $v \in R^{d}$ , where d is the dimensionality of the embedding.
Text Encoder: The text encoder is based on a Transformer architecture. It encodes textual descriptions T into a textual embedding $t \in R^{d}$ , sharing the same dimensionality as the image embeddings.

These components are jointly trained with cross-attention to ensure that semantically similar images and text descriptions are closer in the shared embedding space.

3.3.2. Training Objective

CLIP employs a contrastive learning objective [45] to train the model. The goal is to maximize the similarity between corresponding image–text pairs while minimizing the similarity between mismatched pairs [46]. This is achieved using the following components:

Embedding Space Alignment: Given a batch of N image–text pairs $(I_{i}, T_{i})$ , the image encoder produces embeddings $v_{i}$ , and the text encoder produces embeddings $t_{i}$ . These embeddings are normalized to unit vectors:

${\hat{v}}_{i} = \frac{v_{i}}{| | v_{i} | |}, {\hat{t}}_{i} = \frac{t_{i}}{| | t_{i} | |}$
Similarity score: The similarity between the ith image and the jth text embedding is measured using the cosine similarity:

$S_{i j} = {\hat{v}}_{i}^{⊤} {\hat{t}}_{j}$
Contrastive loss: The model optimizes a symmetric cross-entropy loss across the image-to-text and text-to-image directions. The loss for a batch is given as

$L = \frac{1}{N} \sum_{i = 1}^{N} [- log \frac{exp (S_{i i} / τ)}{\sum_{j = 1}^{N} exp (S_{i j} / τ)}] + \frac{1}{N} \sum_{i = 1}^{N} [- log \frac{exp (S_{i i} / τ)}{\sum_{j = 1}^{N} exp (S_{j i} / τ)}]$

where $τ$ is a temperature parameter that controls the sharpness of the similarity distribution.

3.3.3. Text-Based Image Retrieval

Given a prompt description T and a set of query images

V = {I_{i}}^{N}

, the top-k text-based retrieved images are determined as follows:

I_{k} = a r g max_{V} {\hat{v}}_{k}^{⊤} \hat{t}

(1)

where

{\hat{v}}_{k}

represents the embedding of the k-th image and

\hat{t}

denotes the embedding of the text T in the aligned embedding space.

4. Results

This section focuses on three key experiments: (1) evaluating the CLIP VLM’s capability for zero-shot querying on coarse labels, (2) comparing the performance of the vision–language model against the vision-only model in few-shot and full-dataset training, and (3) assessing the impact of descriptive captions compared to class-name-only captions.

4.1. Settings

All experiments were conducted on a single A100 GPU using the Transformers library. We fine-tuned the CLIP model with a ViT-based backbone, pretrained on a dataset of 400 M image–text pairs [12,13], while the vision-only model was a ViT base with patch size 16 × 16 pretrained on ImageNet21K [2]. Training was performed using the Adam optimizer with linear learning rate decay starting from

2 \times 10^{- 4}

, with a batch size of 64, for 10 epochs. Cross-entropy loss was used for the vision-only model, while contrastive loss was used for the vision–language model. Accuracy was primarily reported on the validation set for performance comparisons, which has 740 images accounting for 20% of the dataset. Also, we demonstrate the text-based retrieval capabilities of the VLM as a method for quality assessment.

4.2. Zero-Shot Prediction

Zero-shot learning is a notable capability of vision–language models (VLMs), enabling pretrained models to compute similarity scores between visual and textual modalities without requiring additional training samples [47]. In this experiment, we use CLIP [12], BLIP [22], CoCa [20], and ALIGN [18] to measure the similarity between images from the VL-PAW dataset and a set of predefined captions. For the CNU Weed dataset, we prompt the models with class names such as common pigweed and large speedwell. For the pear and apple datasets, we append postfixes like normal pear or scratch apple to ensure clearer context, rather than using generic terms like “Normal” or “Scratch”. From the results shown in Table 2, we draw two key observations: (1) VLMs are effective at recognizing coarse-grained categories (Figure 4), such as weed, apple, or pear, but struggle with fine-grained distinctions. Specifically, CLIP achieves the highest F1 score of 97.72%, followed by CoCa (94.73%) and ALIGN (81.48%). However, when it comes to fine-grained grounding, all four models perform poorly, with the highest F1 score reaching only 20.49% (achieved by CLIP). (2) CLIP outperforms the other models, likely due to its pretraining on structured templates such as “A photo of [CLS]”, which align more closely with our prompt design compared to the noisier or more complex text formats used to train the other models. These insights motivate the next set of experiments, where we show that fine-tuning on domain-specific data significantly improves performance, and even a few-shot fine-tuning approach can yield substantial gains for VLMs.

4.3. Few-Shot and Full Fine-Tuning

To demonstrate the potential of VLMs, we conduct an analysis comparing them with CLIP-based vision encoders (V_f/V_p) and the vanilla ViT model, both representing vision-only approaches. In this experiment, we vary the number of training examples under few-shot settings, where n-shot means the model is fine-tuned only on n new examples per class while ignoring the rest. For vision-only models (ViT and CLIP vision encoder), we train using cross-entropy on one-hot class labels. In contrast, the vision–language model is prompted with a simple template, “A photo of [CLS] [type]”, and optimized using contrastive loss.

Table 3 show that when fine-tuning only the projection head, the VLM significantly outperforms vision-only models across all shot settings. Notably, the vision–language (VL_p) model achieves 72.82% accuracy with just 2 shots, whereas ViT requires at least 10 shots to reach 56.28%, and the CLIP vision encoder (V_p) needs the entire dataset to reach 81.54%. This advantage likely stems from the text encoder in vision–language models, which enhances fine-grained class differentiation through captions. For direct comparison, the CLIP vision–language model consistently outperforms its vision-only counterpart across all settings, whether fine-tuning the full model or just the projection head. However, the best overall performance comes from full fine-tuning of the ViT model, excelling in both few-shot and full-dataset settings. This can be attributed to ViT’s pretraining for visual recognition tasks, whereas CLIP is primarily optimized for vision–text alignment. Moreover, the confusion matrices in Figure 5b show that providing just three examples to the vision–language model leads to a substantial performance boost. While most weed classes are correctly classified at this stage, challenges remain in distinguishing pear and apple diseases. However, increasing the number of shots, particularly to 10, effectively resolves these misclassifications, demonstrating the model’s adaptability with additional training examples.

In few-shot learning scenarios, it is often recommended to fine-tune only a small subset of the pretrained model’s parameters, while keeping the majority of the network frozen [49,50]. This study also shows that fine-tuning only the projection head yields better performance, whereas fine-tuning the entire model leads to negative effects. The CLIP model contains two projection heads: one for visual features and one for textual features. Both are linear layers that output 512-dimensional embeddings. The visual projection head takes as input a 768-dimensional feature vector, which corresponds to the flattened representation of an image patch token of size 16. Notably, the (visual and textual) projection heads account for only about 0.44% of the total parameters in the CLIP model. Fine-tuning only the projection heads not only significantly speeds up training time but also helps prevent catastrophic forgetting, which can occur when the model attempts to fit a very few number of new samples. A detailed comparison is provided in Figure 6.

Experiments comparing fine-tuning only the projection head versus fine-tuning the entire model consistently show better performance when fine-tuning only the projection head, across various settings and both in few-shot learning and full-dataset scenarios. A clear trend we observe is that as the number of training samples increases, the performance gap between the two settings narrows, compared to when only 1 to 3 samples are used. This suggests that in few-shot learning, it is advantageous to keep the number of learnable parameters small, preventing the model from overfitting to the limited new data. This trend is particularly clear in the weed and pear datasets, where class-specific features are more distinct. However, the same pattern is dismissed in the Apple Defects Dataset, where the types of defects are very hard to differentiate. This highlights a broader challenge for vision–language models in handling fine-grained distinctions and suggests that further research is needed.

4.4. Effects of Descriptive Captions

We further demonstrate the advantages of descriptive captions over brief captions, as depicted in Table 4. In this work, a descriptive caption provides detailed information about the agricultural instance, including object shape, color, and texture, whereas a brief caption consists only of an abstract label, such as the category. In 1-shot learning, the model relies heavily on semantic richness to generalize from very limited data. Brief captions, such as “A photo of an apple”, provide almost no additional semantic information, offering little detail about important features like color, shape, or texture. In contrast, descriptive captions act as a form of soft label expansion, giving the model richer attribute information that can help distinguish between visually similar classes. However, the extent to which descriptive captions improve performance depends on dataset-specific properties. In datasets with high intra-class variability, such as weeds, the benefit of detailed captions is limited because the class itself is visually diverse. On the other hand, in datasets with high inter-class similarity, like apples, richer descriptions are still valuable, but they may not be sufficient without strong visual cues to fully separate the classes.

Descriptive captions generally lead to improved performance across all metrics compared to brief captions. Overall, fine-tuning the CLIP model with descriptive captions improved the F1 score by 7.51% on the apple dataset, 16.23% on the pear dataset, and 0.29% on the weed dataset. In depth, the pear dataset shows the largest improvement, with descriptive captions boosting accuracy by 8%, precision by 8.09%, recall by 15.05%, and F1 score by 16.23%, suggesting that fine-grained visual attributes are crucial for distinguishing between pear classes. The apple dataset also benefits moderately, with increases of around 8–10% across all metrics, though overall performance remains relatively low, likely due to high visual similarity among different apple classes. In the weed dataset, improvements are minimal; while descriptive captions slightly increase accuracy and F1 score, they result in a small drop in precision and recall, indicating that high intra-class variability in weeds may limit the effectiveness of detailed textual descriptions. For the VL-PAW dataset, descriptive captions improve accuracy and F1 score by about 3.2%, mainly by increasing recall, although precision decreases, suggesting a trade-off between finding more true positives and introducing more false positives.

The choice between longer and shorter captions when training VLMs depends on balancing detail and efficiency. Longer captions offer more detailed supervision, enhancing the model’s ability to capture fine-grained relationships and improving few-shot generalization. However, excessively long captions can introduce noise, weakening the contrastive signal. In contrast, brief captions improve computational efficiency and align better with simple query-based tasks but may lack essential contextual details. In the experiment, the pear dataset showed significant improvement with longer captions. However, this was not the case for other datasets, where descriptive captions were biased toward color and shape, making the descriptions monotone and less useful for learning.

4.5. Text-Based Image Retrieval

To roughly assess the performance of text-based image retrieval, we crafted 27 captions corresponding to 27 classes using the template “A photo of [CLS] [type]”. For each caption, the model retrieves the top k images with the highest similarity scores from a pool of 64 random supports. The performance metric is calculated as the proportion of queries for which the retrieved k images include at least one image from the correct class. Table 5 provides the top k text-based image retrieval performance across few-shot tuning. The model shows modest performance in the zero-shot setting (top 1: 22.22%) but rapidly improves with 1–3 shots, reaching over 88.88% top 1 accuracy. The 10-shot performance matches that of training with the entire dataset, with retrieval performance saturating at perfect scores (top 5: 100%). Although the quantitative results demonstrate impressive performance, it is important to note that using a fixed query can limit the flexibility of vision–language models, which are designed to handle unconstrained natural language prompts.

For quality assessment, we conduct a text-based retrieval experiment where the CLIP model is given a predefined set of images and a query text prompt. The model retrieves the top k images that best match the query using Equation (1). As shown in Figure 7, Figure 8 and Figure 9, the retrieved images consistently align with the expected categories, such as weed, apple, or pear. In the weed dataset, the model accurately returns either the correct class or images from the same family. Notably, it also distinguishes between different types of pear diseases and apple defects, despite not having explicit contextual information like when training.

4.6. Discussion

This study investigates the potential of vision–language models (VLMs) for specialized tasks in agriculture, addressing the notable gap in high-quality, domain-specific image–text datasets. To tackle this limitation, we introduced VL-PAW, a novel vision–language dataset comprising 3.9 K image–caption pairs specifically curated for weed species classification and fruit inspection (pear and apple). The dataset covers 27 classes and includes images annotated by field experts with both simple and descriptive captions, partially generated and reviewed using ChatGPT (Section 3.1). We fine-tuned the CLIP model on VL-PAW, evaluated its capabilities and gained insights into its performance on agricultural imagery.

Our experiments yielded several key findings (Section 4.2). In zero-shot settings, pretrained VLMs, particularly CLIP, demonstrated strong performance in recognizing coarse-grained categories like “weed”, “apple”, or “pear”, with CLIP achieving an impressive F1 score of 97.72% classification. However, fine-grained distinctions among specific species or defect types proved challenging for these models in a zero-shot manner. This highlights the domain shift challenge and the need for adaptation to specialized agricultural data.

To overcome the limitations of zero-shot performance on fine-grained tasks, we explored fine-tuning strategies on the VL-PAW dataset in Section 4.3. We observed that fine-tuning the VLM, specifically by updating only the projection head, led to significant performance improvements over vision-only models across various shot settings. Although the fully trained ViT model ultimately achieved the highest overall accuracy (92.18%) on the whole dataset, the vision–language model achieved a comparable 91.67% accuracy while requiring fine-tuning of only a fraction of the parameters (0.44%). This indicates that efficiently adapting pretrained VLMs is feasible with limited data. However, challenges remain in distinguishing certain pear and apple issues until more examples are provided. Our results also align with recommendations in the few-shot learning literature, showing that fine-tuning only a small subset of parameters, like the projection head, is advantageous and helps prevent overfitting to limited new data.

The impact of caption quality was also explored in Section 4.5. We demonstrated that using descriptive captions, which provide richer details about physical properties like color, shape, and texture, generally leads to improved performance compared to brief, class-name-only captions, particularly in 1-shot settings. The capability for text-based image retrieval is also a valuable tool for quality assessment or searching for images based on textual descriptions. Even without explicit contextual information during retrieval training, the fine-tuned CLIP model could accurately retrieve images matching coarse-, i.e., “weed”, “apple”, “pear”, and fine-grained descriptions. Hence, text-based image retrieval offers practitioners an interactive and intuitive tool for managing multimodal agricultural datasets.

Despite the demonstrated potential, our study has limitations and points towards future research directions. The performance on extremely fine-grained distinctions, such as certain apple defects, still lags behind potential expert-level classification, suggesting that further research is needed to enhance VLM capabilities in these challenging scenarios. Additionally, the quality and usefulness of AI-generated captions are sensitive to prompt design and can suffer from biases or limitations in descriptive variety, although our manual review process mitigated some of these issues for this dataset size. Also, while we provided runtime metrics for our model on an A100 GPU, 24.4 ms per image, and 1148 MB memory, our current work did not take deployment into consideration, which is a critical factor for real-world field applications. Future work should focus on increasing dataset diversity by incorporating a broader variety of crops, environmental conditions, and phenotypic variations. Additionally, techniques such as instruction tuning, low-rank adaptation, and domain-specific prompt templating could be explored to enhance the adaptability and performance of vision–language models in agricultural tasks.

5. Conclusions

This study evaluates the performance of the CLIP model on a fine-grained vision–language agricultural dataset. Our findings reveal that while CLIP’s zero-shot learning effectively retrieves coarse labels with high accuracy, it requires at least three-shot fine-tuning to perform well on fine-grained labels. Additionally, fine-tuning only the CLIP projection head outperforms vision-only models but remains slightly inferior to the fully trained ViT model, indicating room for improvement. Future work should focus on expanding dataset diversity, and enhancing fine-tuning techniques such as prompt engineering to harness the potential of vision–language models in agricultural applications.

Author Contributions

Conceptualization, G.-H.Y. and D.T.V.; methodology, G.-H.Y. and L.H.A.; software, L.H.A.; validation, J.L., D.T.V., Z.U.R. and G.-H.Y.; formal analysis, L.H.A.; investigation, L.H.A. and H.-Z.L.; resources, G.-H.Y.; data curation, L.H.A. and J.-A.J.; writing—original draft preparation, D.T.V., H.-Z.L. and J.L.; writing—review and editing, D.T.V. and J.-Y.K.; visualization, J.L. and Z.U.R.; supervision, J.-Y.K.; project administration, J.-Y.K. and J.-A.J.; funding acquisition, G.-H.Y. and J.-A.J. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Korea Institute of Planning and Evaluation for Technology in Food, Agriculture and Forestry (IPET) through the Open-field Smart Agriculture Utilization Model Development Program, funded by Ministry of Agriculture, Food and Rural Affairs (MAFRA) (RS-2025-02307408). This work was partly supported by the Innovative Human Resource Development for Local Intellectualization program through the Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (IITP-2024-00156287). This work was supported by the Artificial intelligence industrial convergence cluster development project funded by the Ministry of Science and ICT (MSIT, Korea) and Gwangju Metropolitan City.

Data Availability Statement

The VL-PAW dataset is publicly available at https://huggingface.co/datasets/AISeedCorp/VL-PAW (accessed on 4 April 2025).

Conflicts of Interest

Authors Dang Thanh Vu and Heon Zoo Lee are employed by AISeed Inc. However, the company did not influence the design, execution, data interpretation, writing or funding of the manuscript. Other authors declare no conflicts of interest.

References

Zhang, J.; Huang, J.; Jin, S.; Lu, S. Vision-language models for vision tasks: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 5625–5644. [Google Scholar] [CrossRef] [PubMed]
Dosovitskiy, A. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Li, L.; Li, J.; Chen, D.; Pu, L.; Yao, H.; Huang, Y. VLLFL: A Vision-Language Model Based Lightweight Federated Learning Framework for Smart Agriculture. arXiv 2025, arXiv:2504.13365. [Google Scholar]
Yu, P.; Lin, B. A Framework for Agricultural Intelligent Analysis Based on a Visual Language Large Model. Appl. Sci. 2024, 14, 8350. [Google Scholar] [CrossRef]
Arshad, M.A.; Jubery, T.Z.; Roy, T.; Nassiri, R.; Singh, A.K.; Singh, A.; Hegde, C.; Ganapathysubramanian, B.; Balu, A.; Krishnamurthy, A.; et al. Leveraging Vision Language Models for Specialized Agricultural Tasks. In Proceedings of the 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Tucson, AZ, USA, 26 February–6 March 2025; IEEE: New York, NY, USA, 2025; pp. 6320–6329. [Google Scholar]
Nawaz, U.; Awais, M.; Gani, H.; Naseer, M.; Khan, F.; Khan, S.; Anwer, R.M. AgriCLIP: Adapting CLIP for Agriculture and Livestock via Domain-Specialized Cross-Model Alignment. arXiv 2024, arXiv:2410.01407. [Google Scholar]
Gauba, A.; Pi, I.; Man, Y.; Pang, Z.; Adve, V.S.; Wang, Y.X. AgMMU: A Comprehensive Agricultural Multimodal Understanding and Reasoning Benchmark. arXiv 2025, arXiv:2504.10568. [Google Scholar]
Awais, M.; Alharthi, A.H.S.A.; Kumar, A.; Cholakkal, H.; Anwer, R.M. AgroGPT: Efficient Agricultural Vision-Language Model with Expert Tuning. arXiv 2024, arXiv:2410.08405. [Google Scholar]
Yang, X.; Gao, J.; Xue, W.; Alexandersson, E. Pllama: An open-source large language model for plant science. arXiv 2024, arXiv:2401.01600. [Google Scholar]
Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. GPT-4 technical report. arXiv 2023, arXiv:2303.08774. [Google Scholar]
Girdhar, R.; El-Nouby, A.; Liu, Z.; Singh, M.; Alwala, K.V.; Joulin, A.; Misra, I. Imagebind: One embedding space to bind them all. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 15180–15190. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning; PMLR: New York, NY, USA, 2021; pp. 8748–8763. [Google Scholar]
Schuhmann, C.; Vencu, R.; Beaumont, R.; Kaczmarczyk, R.; Mullis, C.; Katta, A.; Coombes, T.; Jitsev, J.; Komatsuzaki, A. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv 2021, arXiv:2111.02114. [Google Scholar]
Arshad, M.A.; Jubery, T.Z.; Roy, T.; Nassiri, R.; Singh, A.K.; Singh, A.; Hegde, C.; Ganapathysubramanian, B.; Balu, A.; Krishnamurthy, A.; et al. AgEval: A Benchmark for Zero-Shot and Few-Shot Plant Stress Phenotyping with Multimodal LLMs. arXiv 2024, arXiv:2407.19617. [Google Scholar]
Alayrac, J.B.; Donahue, J.; Luc, P.; Miech, A.; Barr, I.; Hasson, Y.; Lenc, K.; Mensch, A.; Millican, K.; Reynolds, M.; et al. Flamingo: A visual language model for few-shot learning. Adv. Neural Inf. Process. Syst. 2022, 35, 23716–23736. [Google Scholar]
Kojima, T.; Gu, S.S.; Reid, M.; Matsuo, Y.; Iwasawa, Y. Large language models are zero-shot reasoners. Adv. Neural Inf. Process. Syst. 2022, 35, 22199–22213. [Google Scholar]
Wei, J.; Bosma, M.; Zhao, V.Y.; Guu, K.; Yu, A.W.; Lester, B.; Du, N.; Dai, A.M.; Le, Q.V. Finetuned language models are zero-shot learners. arXiv 2021, arXiv:2109.01652. [Google Scholar]
Jia, C.; Yang, Y.; Xia, Y.; Chen, Y.T.; Parekh, Z.; Pham, H.; Le, Q.; Sung, Y.H.; Li, Z.; Duerig, T. Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning; PMLR: New York, NY, USA, 2021; pp. 4904–4916. [Google Scholar]
Lee, J.H.; Vo, H.T.; Kwon, G.J.; Kim, H.G.; Kim, J.Y. Multi-Camera-Based Sorting System for Surface Defects of Apples. Sensors 2023, 23, 3968. [Google Scholar] [CrossRef]
Yu, J.; Wang, Z.; Vasudevan, V.; Yeung, L.; Seyedhosseini, M.; Wu, Y. Coca: Contrastive captioners are image-text foundation models. arXiv 2022, arXiv:2205.01917. [Google Scholar]
Zhai, X.; Mustafa, B.; Kolesnikov, A.; Beyer, L. Sigmoid loss for language image pre-training. In Proceedings of the CVF International Conference on Computer Vision (ICCV), Paris, France, 2–3 October 2023; pp. 11941–11952. [Google Scholar]
Li, J.; Li, D.; Xiong, C.; Hoi, S. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning; PMLR: New York, NY, USA, 2022; pp. 12888–12900. [Google Scholar]
Sun, Q.; Fang, Y.; Wu, L.; Wang, X.; Cao, Y. Eva-clip: Improved training techniques for clip at scale. arXiv 2023, arXiv:2303.15389. [Google Scholar]
Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
Dai, W.; Li, J.; Li, D.; Tiong, A.M.H.; Zhao, J.; Wang, W.; Li, B.; Fung, P.; Hoi, S. Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv 2023, arXiv:2305.06500. [Google Scholar]
Bai, J.; Bai, S.; Yang, S.; Wang, S.; Tan, S.; Wang, P.; Lin, J.; Zhou, C.; Zhou, J. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv 2023, arXiv:2308.12966. [Google Scholar]
Liu, H.; Li, C.; Wu, Q.; Lee, Y.J. Visual instruction tuning. Adv. Neural Inf. Process. Syst. 2023, 36, 34892–34916. [Google Scholar]
Liu, H.; Li, C.; Li, Y.; Lee, Y.J. Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 26296–26306. [Google Scholar]
Sharma, P.; Ding, N.; Goodman, S.; Soricut, R. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, 15–20 July 2018; pp. 2556–2565. [Google Scholar]
Chen, X.; Fang, H.; Lin, T.Y.; Vedantam, R.; Gupta, S.; Dollár, P.; Zitnick, C.L. Microsoft coco captions: Data collection and evaluation server. arXiv 2015, arXiv:1504.00325. [Google Scholar]
Schuhmann, C.; Beaumont, R.; Vencu, R.; Gordon, C.; Wightman, R.; Cherti, M.; Coombes, T.; Katta, A.; Mullis, C.; Wortsman, M.; et al. Laion-5b: An open large-scale dataset for training next generation image-text models. Adv. Neural Inf. Process. Syst. 2022, 35, 25278–25294. [Google Scholar]
Dai, D.; Li, Y.; Liu, Y.; Jia, M.; YuanHui, Z.; Wang, G. 15m multimodal facial image-text dataset. arXiv 2024, arXiv:2407.08515. [Google Scholar]
Zhang, S.; Xu, Y.; Usuyama, N.; Xu, H.; Bagga, J.; Tinn, R.; Preston, S.; Rao, R.; Wei, M.; Valluri, N.; et al. BiomedCLIP: A multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs. arXiv 2023, arXiv:2303.00915. [Google Scholar]
Zhang, K.; Ma, L.; Cui, B.; Li, X.; Zhang, B.; Xie, N. Visual large language model for wheat disease diagnosis in the wild. Comput. Electron. Agric. 2024, 227, 109587. [Google Scholar] [CrossRef]
Trong, V.H.; Gwang-hyun, Y.; Vu, D.T.; Jin-young, K. Late fusion of multimodal deep neural networks for weeds classification. Comput. Electron. Agric. 2020, 175, 105506. [Google Scholar] [CrossRef]
Han, N.B.N.; Lee, J.H.; Vu, D.T.; Murtza, I.; Kim, H.G.; Kim, J.Y. HAFREE: A Heatmap-Based Anchor-Free Detector for Apple Defect Detection. IEEE Access 2024, 12, 182799–182813. [Google Scholar]
Tonmoy, S.; Zaman, S.; Jain, V.; Rani, A.; Rawte, V.; Chadha, A.; Das, A. A comprehensive survey of hallucination mitigation techniques in large language models. arXiv 2024, arXiv:2401.01313. [Google Scholar]
Peng, B.; Chen, K.; Li, M.; Feng, P.; Bi, Z.; Liu, J.; Niu, Q. Securing large language models: Addressing bias, misinformation, and prompt attacks. arXiv 2024, arXiv:2409.08087. [Google Scholar]
Tan, Z.; Li, D.; Wang, S.; Beigi, A.; Jiang, B.; Bhattacharjee, A.; Karami, M.; Li, J.; Cheng, L.; Liu, H. Large language models for data annotation and synthesis: A survey. arXiv 2024, arXiv:2402.13446. [Google Scholar]
Dong, H.; Li, J.; Wu, B.; Wang, J.; Zhang, Y.; Guo, H. Benchmarking and improving detail image caption. arXiv 2024, arXiv:2405.19092. [Google Scholar]
Yin, H.; Hsu, T.Y.; Min, J.; Kim, S.; Rossi, R.A.; Yu, T.; Jung, H.; Huang, T.H. Understanding How Paper Writers Use AI-Generated Captions in Figure Caption Writing. arXiv 2025, arXiv:2501.06317. [Google Scholar]
Huang, Y.; Arora, C.; Houng, W.C.; Kanij, T.; Madulgalla, A.; Grundy, J. Ethical Concerns of Generative AI and Mitigation Strategies: A Systematic Mapping Study. arXiv 2025, arXiv:2502.00015. [Google Scholar]
Mittal, A.; Moorthy, A.K.; Bovik, A.C. No-reference image quality assessment in the spatial domain. IEEE Trans. Image Process. 2012, 21, 4695–4708. [Google Scholar] [CrossRef]
Wang, J.; Chan, K.C.; Loy, C.C. Exploring clip for assessing the look and feel of images. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 2555–2563. [Google Scholar]
Oord, A.v.d.; Li, Y.; Vinyals, O. Representation learning with contrastive predictive coding. arXiv 2018, arXiv:1807.03748. [Google Scholar]
Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A simple framework for contrastive learning of visual representations. In International Conference on Machine Learning; PMLR: New York, NY, USA, 2020; pp. 1597–1607. [Google Scholar]
Ramesh, A.; Dhariwal, P.; Nichol, A.; Chu, C.; Chen, M. Hierarchical text-conditional image generation with clip latents. arXiv 2022, arXiv:2204.06125. [Google Scholar]
Güldenring, R.; Nalpantidis, L. Self-supervised contrastive learning on agricultural images. Comput. Electron. Agric. 2021, 191, 106510. [Google Scholar] [CrossRef]
Shen, Z.; Liu, Z.; Qin, J.; Savvides, M.; Cheng, K.T. Partial is better than all: Revisiting fine-tuning strategy for few-shot learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021; Volume 35, pp. 9594–9602. [Google Scholar]
Fahes, M.; Vu, T.H.; Bursuc, A.; Pérez, P.; de Charette, R. Fine-Tuning CLIP’s Last Visual Projector: A Few-Shot Cornucopia. arXiv 2024, arXiv:2410.05270. [Google Scholar]

Figure 1. Data generation scheme. Given an image, the corresponding text is generated by OpenAI ChatGPT using the predefined prompt.

Figure 2. VL-PAW dataset. The dataset is hosted on the HuggingFace dataset platform with five columns: image, descriptive captions, simple caption, class of image, and family of images.

Figure 3. Image quality assessment: the pear dataset shows better image quality, with low BRISQUE scores (a) and high CLIPIQA scores (b). In comparison, the weed dataset has a wide range of scores, showing varied image quality. The apple dataset, on the other hand, has less variation in scores, indicating more uniform image quality. Additionally, the word cloud (c) demonstrates that image captions describe textures and appearances of the object.

Figure 4. Confusion matrix of zero-shot prediction with coarse captions (weed, apple, and pear) from CLIP (tl), CoCa (tr), BLIP (bl) and ALIGN (br) models.

Figure 5. Comparing the performance on VL-PAW dataset of vision-only and vision–language models under few-shot settings and full-dataset fine-tuning. (a): radar chart of accuracy, (b): 3-shot confusion matrix.

Figure 6. Radar chart comparing the accuracy of few-shot predictions across datasets between fine-tuning only the projection head (VL Projection-Only) and fine-tuning the entire model (VL Full).

Figure 7. Visualization of text-based image retrieval on the CNU Weeds dataset. The query prompts are composed of classes and families of the weed dataset.

Figure 8. Visualization of text-based image retrieval on the CNU Pear Disease dataset. The query prompts are composed of disease types of pears.

Figure 9. Visualization of text-based image retrieval on the apple dataset. The query prompts are composed of defect types of apples. The bounding boxes are only for visualization; the model is unable to pinpoint the defected ares.

Table 1. Summary of VL-PAW dataset.

Dataset	Family	Class	Samples
CNU Pear Disease	Normal	Normal	400
	Disease	Black spot	200
		Moth disease	178
		Pear Cicada disease	222
Surface Subtle Apple Defects	Fuji apple	Pest	393
Surface Subtle Apple Defects	Fuji apple	Scratch	407
CNU Weeds	Amaranthaceae	Common pigweed	100
		Slender amaranth	100
		Smooth pigweed	100
		Green amaranth	100
		Spiny amaranth	100
	Scrophulariaceae	Large speedwell	100
	Scrophulariaceae	Common speedwell	100
	Chenopodiaceae	White goosefoot	100
	Chenopodiaceae	Small goosefoot	100
	Convolvulaceae	Round-leaved morning glory	100
	Asteraceae	Common ragweed	100
		Mapleleaf ragweed	100
		American cocklebur	100
		Annual fleabane	100
		Hairy starwort	100
		Common sowthistle	100
		Tall prickly sowthistle	100
		Tall annual fleabane	100
		Prickly sowthistle	100
		Greater burweed	100
		Devil’s beggarticks	100
Total	8	27	3900

Table 2. Zero-shot performance comparison (%) of CLIP, ALIGN, BLIP, and CoCa models across datasets.

Dataset	Model	Accuracy	Precision	Recall	F1 Score
Apple Defects	CLIP	56.25	56.83	56.04	54.87
	ALIGN	51.88	53.48	51.39	43.04
	BLIP	52.50	75.80	51.90	37.69
	CoCa	57.50	60.46	57.15	53.67
CNU Weed	CLIP	18.81	13.90	17.85	14.13
	ALIGN	9.29	14.93	7.78	7.60
	BLIP	7.14	3.66	7.43	3.32
	CoCa	10.71	12.32	10.53	7.95
CNU Weed (5 families)	CLIP	46.19	36.10	48.19	36.08
	ALIGN	10.24	28.47	21.25	9.57
	BLIP	23.57	11.75	19.09	8.86
	CoCa	25.95	35.19	26.02	25.95
CNU Pear Disease	CLIP	43.00	25.63	25.93	19.40
	ALIGN	31.50	24.08	38.30	25.87
	BLIP	9.50	9.48	10.11	8.41
	CoCa	52.00	38.41	36.36	31.93
VL-PAW class	CLIP	35.38	20.95	24.43	20.49
	ALIGN	28.46	23.17	16.67	14.51
	BLIP	6.54	3.55	6.34	2.74
	CoCa	30.13	22.88	17.13	14.80
VL-PAW type	CLIP	98.21	97.35	98.15	97.72
	ALIGN	82.31	83.62	83.32	81.48
	BLIP	77.31	49.89	63.92	55.18
	CoCa	96.03	94.49	95.44	94.73

Table 3. Performance (%) comparison across different shot settings on VL-PAW dataset. M_f refers to full-model fine-tuning, whereas M_p denotes frozen backbone with fine-tuning applied only to projection head.

Model	0-Shot	1-Shot	2-Shot	3-Shot	10-Shot	Whole
ViT_f [2]	-	71.92	77.31	80.64	89.10	92.18
ViT_p	-	5.00	11.92	20.13	56.28	81.92
V_f [48]	-	28.85	31.79	35.13	61.67	85.00
V_p	-	5.00	7.05	9.49	19.10	81.54
VL_f (ours)	35.38	36.15	47.44	50.77	72.31	84.36
VL_p (ours)	35.38	56.79	72.82	74.49	83.85	91.67

Table 4. Performance (%) comparison of brief and descriptive captions in 1-shot learning.

Dataset	Caption	Accuracy	Precision	Recall	F1 Score
Pear	Brief	72.00	71.24	59.12	58.47
Pear	Descriptive	80.00	79.33	74.17	74.70
Apple	Brief	46.25	45.42	46.02	44.29
Apple	Descriptive	54.37	56.03	54.67	51.80
Weed	Brief	58.81	64.60	60.27	56.08
Weed	Descriptive	59.29	58.39	58.23	56.37
VL-PAW	Brief	56.79	65.05	56.00	53.06
VL-PAW	Descriptive	59.74	57.82	59.68	56.26

Table 5. Top k retrieval performance (%) varied by number of shots on VL-PAW dataset.

Shot	Top@1	Top@3	Top@5
0-shot	22.22	48.14	55.55
1-shot	81.48	88.88	92.59
2-shot	85.18	92.59	96.29
3-shot	88.88	92.59	96.29
10-shot	96.29	96.29	96.29
whole	96.29	96.29	100

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yu, G.-H.; Anh, L.H.; Vu, D.T.; Lee, J.; Rahman, Z.U.; Lee, H.-Z.; Jo, J.-A.; Kim, J.-Y. VL-PAW: A Vision–Language Dataset for Pear, Apple and Weed. Electronics 2025, 14, 2087. https://doi.org/10.3390/electronics14102087

AMA Style

Yu G-H, Anh LH, Vu DT, Lee J, Rahman ZU, Lee H-Z, Jo J-A, Kim J-Y. VL-PAW: A Vision–Language Dataset for Pear, Apple and Weed. Electronics. 2025; 14(10):2087. https://doi.org/10.3390/electronics14102087

Chicago/Turabian Style

Yu, Gwang-Hyun, Le Hoang Anh, Dang Thanh Vu, Jin Lee, Zahid Ur Rahman, Heon-Zoo Lee, Jung-An Jo, and Jin-Young Kim. 2025. "VL-PAW: A Vision–Language Dataset for Pear, Apple and Weed" Electronics 14, no. 10: 2087. https://doi.org/10.3390/electronics14102087

APA Style

Yu, G.-H., Anh, L. H., Vu, D. T., Lee, J., Rahman, Z. U., Lee, H.-Z., Jo, J.-A., & Kim, J.-Y. (2025). VL-PAW: A Vision–Language Dataset for Pear, Apple and Weed. Electronics, 14(10), 2087. https://doi.org/10.3390/electronics14102087

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

VL-PAW: A Vision–Language Dataset for Pear, Apple and Weed

Abstract

1. Introduction

2. Related Work

3. Materials and Methods

3.1. VL-PAW (Vision–Language Dataset for Pear, Apple, and Weed)

3.2. Dataset Assessment

3.3. Modeling

3.3.1. Architecture of CLIP

3.3.2. Training Objective

3.3.3. Text-Based Image Retrieval

4. Results

4.1. Settings

4.2. Zero-Shot Prediction

4.3. Few-Shot and Full Fine-Tuning

4.4. Effects of Descriptive Captions

4.5. Text-Based Image Retrieval

4.6. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI