E-CLIP: An Enhanced CLIP-Based Visual Language Model for Fruit Detection and Recognition

Zhang, Yi; Shao, Yang; Tang, Chen; Liu, Zhenqing; Li, Zhengda; Zhai, Ruifang; Peng, Hui; Song, Peng

doi:10.3390/agriculture15111173

Open AccessArticle

E-CLIP: An Enhanced CLIP-Based Visual Language Model for Fruit Detection and Recognition

by

Yi Zhang

¹,

Yang Shao

²,

Chen Tang

²,

Zhenqing Liu

²,

Zhengda Li

^2,3,

Ruifang Zhai

¹,

Hui Peng

^1,* and

Peng Song

^2,4,5,*

¹

College of Informatics, Huazhong Agricultural University, No. 1 Shizi Mountain Street, Hongshan District, Wuhan 430070, China

²

College of Plant Science and Technology, Huazhong Agricultural University, No. 1 Shizi Mountain Street, Hongshan District, Wuhan 430070, China

³

Wuhan X-Agriculture Intelligent Technology Co., Ltd., Wuhan 430070, China

⁴

Hubei Hongshan Laboratory, Wuhan 430070, China

⁵

National Key Laboratory of Crop Genetic Improvement, Huazhong Agricultural University, Wuhan 430070, China

^*

Authors to whom correspondence should be addressed.

Agriculture 2025, 15(11), 1173; https://doi.org/10.3390/agriculture15111173

Submission received: 20 April 2025 / Revised: 21 May 2025 / Accepted: 26 May 2025 / Published: 29 May 2025

(This article belongs to the Section Digital Agriculture)

Download

Browse Figures

Versions Notes

Abstract

With the progress of agricultural modernization, intelligent fruit harvesting is gaining importance. While fruit detection and recognition are essential for robotic harvesting, existing methods suffer from limited generalizability, including adapting to complex environments and handling new fruit varieties. This problem stems from their reliance on unimodal visual data, which creates a semantic gap between image features and contextual understanding. To solve these issues, this study proposes a multi-modal fruit detection and recognition framework based on visual language models (VLMs). By integrating multi-modal information, the proposed model enhances robustness and generalization across diverse environmental conditions and fruit types. The framework accepts natural language instructions as input, facilitating effective human–machine interaction. Through its core module, Enhanced Contrastive Language–Image Pre-Training (E-CLIP), which employs image–image and image–text contrastive learning mechanisms, the framework achieves robust recognition of various fruit types and their maturity levels. Experimental results demonstrate the excellent performance of the model, achieving an F1 score of 0.752, and an mAP@0.5 of 0.791. The model also exhibits robustness under occlusion and varying illumination conditions, attaining a zero-shot mAP@0.5 of 0.626 for unseen fruits. In addition, the system operates at an inference speed of 54.82 FPS, effectively balancing speed and accuracy, and shows practical potential for smart agriculture. This research provides new insights and methods for the practical application of smart agriculture.

Keywords:

visual language models; contrast learning; smart agriculture

1. Introduction

With the continuous growth of the global population and the increase in food demand, the agricultural sector faces significant challenges in improving production efficiency and sustainability [1]. In this context, the automation of agricultural harvesting has become one of the research and development hotspots. By introducing robotic technology to perform fruit-picking tasks, it is not only possible to greatly reduce labor intensity but also effectively enhance operational efficiency and quality. Presently, intelligent picking robots primarily consist of a recognition module, control module, and motion module [2]. As the first module for acquiring and processing external information, the recognition module plays a crucial role in improving the work efficiency of intelligent picking robots. Many studies have made efforts to enhance various aspects of agricultural robot recognition modules, such as speed, accuracy, and generalization.

Driven by the rapid advancement of artificial intelligence technologies, deep learning-based methods have achieved remarkable progress in fruit detection and recognition. YOLO is popular in agricultural recognition and edge device deployment due to its high speed and precision. Sapkota et al. adopted YOLOv8 for instance segmentation in complex apple orchard environments, constructing datasets for dormant apple trees and early growing season images containing green leaves and unripe apples. For dataset 1, YOLOv8 achieved a precision of 0.90 and a recall of 0.95 for all classes. Under dataset 2, YOLOv8 had a precision of 0.93 and recall of 0.97 [3]. Jiang et al. used YOLOv8 for watermelon counting in the field, achieving a detection accuracy of 99.20% [4]. Zhou et al. employed YOLOv7 for pitaya detection and classification, achieving precision, recall, and mean average precision of 0.844, 0.924, and 0.932, respectively [5]. However, when multiple objects overlap within a grid, YOLO may fail to accurately detect all objects due to predicting only a certain number of bounding boxes per grid, making it less suitable for handling complex farm environments and fruit occlusions. The SSD algorithm combines the benefits of Faster R-CNN and YOLO, adopting a one-pass prediction method that reduces redundant calculations, significantly increasing detection speed compared to Faster R-CNN while maintaining comparable accuracy [6]. Liang et al. proposed an SSD-based mango detection method, achieving an excellent F1 score of 0.911 at 35 FPS [7]. This method realized a video counting accuracy of 90% on Hass avocado and lemon datasets from Chile and apple datasets from California, USA [8]. However, due to the smaller receptive fields of shallow feature maps, SSD may perform poorly on small targets, and the same fruit might be detected redundantly by different-sized bounding boxes. The R-CNN series (including R-CNN, Fast R-CNN, and Faster R-CNN) and Mask R-CNN are network architectures specifically designed for solving object detection problems. Wang et al. [9] proposed an improved Faster R-CNN with an attention mechanism for background color similarity for cherry tomato detection and identification, addressing low recognition accuracy issues caused by varying light conditions and leaf occlusions. Xu et al. [10] proposed an improved Mask R-CNN network model considering prior neighborhood constraints among peduncles for cherry tomato identification.

While deep learning algorithms improve recognition accuracy, they face a fundamental limitation in that they are narrowly specialized rather than general-purpose. The key issues include the following: (1) limited adaptability to complex environments—many existing models do not perform well under varying conditions, making them less effective in real-world scenarios; (2) narrow specialization—most models are trained for specific fruit types and cannot generalize to unseen varieties without retraining; (3) dependence on large labeled datasets—supervised methods require extensive annotations for each new fruit category, which is costly for rare or regional crops. These problems stem from the closed-set paradigm and reliance on unimodal visual data in traditional deep learning, where models only recognize predefined classes and lack semantic understanding of fruit attributes, making them unable to adapt to dynamic agricultural needs. Therefore, developing a general-purpose fruit recognition system with enhanced semantic understanding, robust environmental adaptation, and efficient few-shot and zero-shot learning capability is of significant research importance for advancing intelligent agricultural applications.

The emergence of visual language models (VLMs) has brought new methods and ideas for addressing the challenges in automated fruit harvesting. VLMs are a type of multimodal AI system that integrates vision and natural language processing capabilities [11]. Compared with traditional deep learning models, VLMs demonstrate significant advantages in multiple aspects. First, VLMs can effectively process multimodal data by integrating visual and textual information into a unified representational space, achieving semantic alignment between visual content and linguistic descriptions. This ability enables VLMs to perform more accurate reasoning in complex visual scenes by leveraging linguistic information, thereby maintaining robust performance under varying conditions such as different lighting or occlusion. Second, VLMs exhibit strong generalization capabilities. By capturing the semantic attributes of objects, VLMs can recognize and classify unseen varieties based on their visual and linguistic properties. This ability makes VLMs adaptable to diverse tasks and scenarios, significantly reducing the need for task-specific retraining and the dependence on large labeled data, enhancing their practicality in real-world applications. Furthermore, VLMs exhibit great flexibility in input and output, making them adaptable to a variety of visual tasks.

Among the current VLMs, Contrastive Language–Image Pre-Training (CLIP) is the most widely used. CLIP [12] adopts a dual-tower structure and has made significant progress in the field of text–image fusion. Its core concept is to align image and text representations through contrastive learning using massive weakly supervised text pairs. BLIP [13] builds on CLIP and removes noisy data by optimizing the module structure to generate higher-quality text descriptions. BLIP-2 [14] freezes the pre-trained image and language models and adds a lightweight query transformer to bridge the modality gap. The LLaVa family [15] directly projects CLIP embeddings as soft prompts of LLM. The Qwen-VL family [16] connects ViT and Qwen (7B) through an adapter to convert image feature sequences into sequences that match the length of Qwen (7B) sequences for fusion. VLMs are also emerging in agriculture, with several notable applications already being explored. Cao et al. [17] proposed ITLMLP, integrating CLIP and SimCLR structures for cucumber disease recognition. Zhou et al. [18] utilized pre-trained VLMs for crop disease classification, generating descriptive texts with Qwen-VL and enhancing key text features through cross-attention and SE attention. Tan et al. [19] experimented with GPT-4 for crop recognition, nutrient deficiency, pest and disease identification, and phenotyping. Qing et al. [20] provided reliable diagnoses of plant diseases based on GPT models. However, the current applications of VLMs are predominantly focused on general object recognition and classification tasks, such as identifying common objects like vehicles, animals, and everyday items. In contrast, there is limited research specifically targeting fruit detection and recognition. This gap is notable because fruit detection and recognition present unique challenges, such as distinguishing between different varieties, identifying ripeness levels, and detecting fruits in complex natural environments. These specialized requirements have not yet been fully addressed in the existing literature and applications.

Inspired by the current work on VLMs, this study introduces and optimizes the classic CLIP visual language model to build a robust framework for fruit detection and output the pixel coordinates of detected fruit bounding boxes, providing a basis for subsequent robotic grasping actions. The optimized CLIP model leverages the strengths of the original CLIP architecture while further enhancing its zero-shot and few-shot learning capabilities.

The main contributions of this research can be summarized as follows:

(1): Multimodal Fruit Dataset: A multimodal dataset was constructed specifically designed for fruit recognition in agricultural robotics. The dataset comprises 6770 real-world orchard images spanning 12 fruit categories, with 7 categories annotated at three maturity levels (unripe, semi-ripe, and ripe). Additionally, we generated 100 natural language queries using Qwen-7B to establish semantic alignments between visual features and textual descriptions. This dataset facilitates both fine-grained maturity detection and open-set recognition, enabling systems to adapt to dynamic picking instructions.
(2): Natural Language Instruction Module: An input module for natural language instructions has been developed, integrating a language model parser, enabling robots to perform complex tasks based on natural language commands (through text or voice inputs). This enhances operational flexibility and user interaction. Furthermore, this language input will be combined with image data to improve the accuracy of subsequent fruit detection and recognition. The method significantly lowers the technical threshold for agricultural robots, aiding in their wider adoption and utilization in fruit picking.
(3): Enhanced CLIP Model Architecture: An enhanced CLIP model architecture has been proposed, modifying the original CLIP framework by incorporating three key components: a YOLO detection head, an image–image contrastive learning branch, and an image–text contrastive learning branch. The YOLO detection head is used to detect fruit regions within images; the image–image contrastive learning branch focuses on identifying similarities between different fruit images, enhancing the model’s recognition capability in scenarios with scarce data and complex environmental conditions. The image-to-text contrastive learning branch contrasts structured natural language instructions with images, learning the relationship between instructions and fruit features. This not only enhances the model’s understanding of task requirements but also leverages the mapping between text and images in zero-shot scenarios, thus improving the model’s generalization ability.

2. Materials and Methods

2.1. Multimodal Dataset

2.1.1. Dataset Construction

(1) Image Dataset Construction

In this study, we constructed a small image dataset containing a variety of ripe fruits. Specifically, as shown in Table 1, the dataset includes 7 fruit categories with ripe labels (apple, banana, grape, strawberry, persimmon, peach, and passion fruit) and 5 fruit categories without ripe labels (lychee, lemon, pear, tomato, and mango). The fruits were selected based on their representativeness in an orchard scene. Data collection was mainly conducted through two methods:

Field capture: Use a high-resolution camera to capture clear images under various occlusion conditions (such as slight occlusion/moderate occlusion/heavy occlusion).
Open data sources: Additional images of mangoes and tomatoes were sourced from open-access repositories to enhance dataset diversity.

(2) Text Dataset Construction

To train the multimodal visual language model (VLM), we generated 100 text sentences using the Qwen-7B language model. These sentences simulate the natural language commands that humans may use to direct the work of a picking robot, covering a variety of contexts, including indicators of fruit ripeness (e.g., changes in color and maturity), etc. Each prompt was carefully designed to ensure semantic alignment with the image labels.

In addition, text data are also used to test the speed of different language models and the accuracy of conversion to JSON, guiding the selection of language models.

2.1.2. Dataset Preprocessing

In order to solve the imbalance problem, some of the lesser data were enhanced, mainly including flipping, rotation, cropping, and scaling. The dataset was divided into training set, validation set, and test set in a ratio of 8:1:1 for regular training and evaluation.

In addition, to evaluate the model’s ability to learn from a small number of samples, a rigorous stratified sampling strategy was employed: for each fruit category, K samples are randomly selected as the training set, where K ∈ {1, 4, 8, 16}. The remaining samples were then randomly assigned to the test set to evaluate the model’s generalization ability.

To ensure a comprehensive evaluation, the test set is as follows:

(1) Basic Test Set: 50 images are randomly selected from each category and used for regular model performance evaluation.

(2) Extreme Condition Test Set: 50 images in the remaining samples are randomly selected to simulate challenging conditions. Specifically, images are occluded by leaves or other fruits to simulate occlusion scenarios with three levels of severity: light (10–30%), medium (30–60%), and heavy (60–90%). For each set of occlusion conditions, we then employed OpenCV to adjust the brightness and contrast of the images to simulate varying lighting conditions, specifically using parameters

α

= 0.7 and

β

= −30 to reduce the contrast and brightness of the image.

(3) Zero-Shot Test Set: To further evaluate the model’s generalization ability, 20 images from four fruit categories (orange, watermelon, cantaloupe, and cherry) that were not included in the training phase were selected, forming another test subset.

2.2. The Overall Framework of Fruit-Picking System

Figure 1 presents the overall workflow of our fruit-picking system based on visual language models (VLMs), which comprises three core components:

(1): Instruction Processing Module: Processes natural language commands describing picking requirements.
(2): Visual Language Processing Module: Analyzes multimodal information including vision and text data, through the VLM for context understanding and object recognition, then outputs the 2D bounding box coordinates $(x_{min}, y_{min}, x_{max}, y_{max})$ of detected targets.
(3): Motion Control Module: Generates actionable motor commands by fusing prior perception results.

This study specifically concentrates on the first two modules: instruction parsing and visual perception.

2.2.1. Instruction Processing Module

Figure 2 illustrates the instruction processing workflow, starting with user instruction reception. The system accepts either text or voice commands: text inputs are directly fed into the language model, while voice inputs undergo speech-to-text conversion before entering the model. The language model then converts these instructions into a standard JSON format under the designed prompt (Appendix B) and determines whether they contain a picking task through the “action” field of the JSON. If a picking task is identified, the system activates the camera to capture scene videos. Both the textual and visual data are then transmitted to the VLM for object detection and spatial localization. If no picking task is detected, the command is routed to the motion control module to execute robotic movements.

2.2.2. Visual Language Processing Module

In this module, the Enhanced CLIP model processes text and visual information for object detection and localization. While the original CLIP model performs well in image–text contrastive learning, it lacks object region detection due to its generic design and reliance on text cues, and has difficulty in accurate classification in complex fruit images. To address these limitations, the Enhanced CLIP model retains the image–text contrastive learning branch in the original CLIP model, while adding a YOLOv8 detection head to provide accurate fruit region localization, and an image–image contrastive learning branch to enhance the model’s robustness and generalization capabilities.

(1): YOLOv8 Detection Head

YOLOv8 [21] is specifically employed for fruit region detection, as shown in Figure 3, which provides precise bounding boxes for subsequent classification tasks carried out by the CLIP model. The fundamental structure of YOLOv8 is retained, which includes the Backbone for feature extraction, the Neck for feature fusion, and the Head for prediction output. However, to tailor it for our specific needs, modifications have been made in the Head. The classification component in the Head has been entirely removed. The model now solely focuses on bounding box regression, outputting the coordinates of detected fruit regions without performing any class identification. Additionally, the loss function has been redefined to exclude the classification loss, thereby emphasizing localization accuracy during training. Once pre-trained, YOLOv8 can be directly integrated into the framework for fruit region detection without the need for further training.

(2): Image Encoder

This study proposes a unified architecture based on Vision Transformer (ViT) [22], which serves image–image contrastive learning and image–text contrastive learning tasks, respectively. As shown in Figure 4, ViT meets the differentiated requirements of different tasks through flexible feature extraction strategies:

For image–image contrastive learning tasks, ViT fully utilizes its global modeling advantages to capture the overall semantic association and contextual dependency of the image through the self-attention mechanism. Specifically, Image encoding process first utilizes the YOLO object detection head to perform object detection on the input image. The YOLO model outputs bounding boxes

R_{1}, R_{2}, \dots, R_{M}

for different target areas in the image and crops out these target regions from the original image to obtain each target region

R_{j}

(where

j = 1, 2, \dots, M

). These target regions are subsequently fed into the ViT model for feature extraction. For each target region

R_{j}

, it is divided into fixed-size patches, and each patch is linearly mapped into a D-dimensional vector representation

x_{j, i} \in R^{D}

:

x_{j, i} = Proj (Flatten (p_{j}^{i})) \in R^{D}, i = 1, 2, \dots, N_{j}

(1)

where

N_{j}

represents the number of patches in the target region

R_{j}

,

p_{j}^{i}

denotes the

i^{t h}

patch within the target region,

Proj (\cdot)

is the linear projection operation, and

Flatten (\cdot)

indicates flattening each patch into a one-dimensional vector. Then, after adding learnable positional encodings

E_{pos}

to all patch features, the input sequence is formed as follows:

X_{j}^{(0)} = [x_{j, 1} + e_{1}, x_{j, 2} + e_{2}, \dots, x_{j, N_{j}} + e_{N_{j}}] \in R^{N_{j} \times D}

(2)

where

X_{j}^{(0)}

is the initial input sequence for the j-th target region

R_{j}

, consisting of the D-dimensional feature vectors

x_{j, i}

of each image patch after applying flatten and projection operations, plus learnable positional encodings

e_{i}

to preserve spatial information;

x_{j, i}

represents the feature vector of the i-th image patch in

R_{j}

, with

i = 1, 2, \dots, N_{j}

, where

N_{j}

denotes the total number of patches in

R_{j}

and D is the dimension of the projected feature space;

R^{N_{j} \times D}

indicates that

X_{j}^{(0)}

is a real-valued matrix of size

N_{j} \times D

, where each row corresponds to an image patch’s feature vector with its positional encoding added. This input sequence is fed into the multi-layer Transformer encoder of ViT. Each layer of the Transformer encoder includes Multi-Head Self-Attention (MHSA) and Feed-Forward Network (FFN) modules:

\begin{matrix} Z_{j}^{(l)} & = MHSA (LN (X_{j}^{(l - 1)})) + X_{j}^{(l - 1)} \end{matrix}

(3)

\begin{matrix} X_{j}^{(l)} & = FFN (LN (Z_{j}^{(l)})) + Z_{j}^{(l)}, l = 1, 2, \dots, L \end{matrix}

(4)

where MHSA stands for Multi-Head Self-Attention mechanism, LN is Layer Normalization, and FFN is the Feed-Forward Network. Each layer of the Transformer processes the input patch features to capture global information and contextual relationships. This architectural feature effectively supports the accurate evaluation of the overall semantic similarity between images.

In the image–text contrastive learning task, ViT achieves local detail preservation by adjusting the feature extraction strategy. In addition to the conventional global features, the model also outputs fine-grained feature maps. To obtain local features at various scales, we set up windows of different sizes to perform localized computations of the self-attention mechanism. These computations are followed by multi-scale feature fusion. Incorporating Equations (2) and (3), we obtain the localized attention values

Z_{j, {scale}_{i}}^{(l)}

for windows of different sizes. Then, we fuse these different

Z_{j, {scale}_{i}}^{(l)}

into a unified

Z_{j}^{(l)}

. This fused representation is subsequently fed into a feed-forward neural network. This feature extraction mechanism with different windows enables the visual representation to maintain fine-grained alignment with the text description.

(3): Text Encoder

This study adopts the standard Transformer architecture [23] as a text encoder, whose core function is to natural language instructions (e.g., “a round red apple”) into 512-dimensional semantic feature vectors

f_{text} (T)

to provide a fine-grained semantic representation. The text encoder consists of the following key modules:

Positional Encoding

Since Transformer itself does not have sequence order information, we inject absolute position information through positional encoding. The encoding formula is:

E (x_{i}) = x_{i} + W_{p} \cdot PE (i),

(5)

where

x_{i}

is the ith word embedding vector in the input sequence;

W_{p}

is the learnable position encoding matrix;

PE (i)

uses a combination of sine and cosine functions [23] to ensure that the model captures the relative positional relationships of the sequence.

Multi-Head Self-Attention (MHSA)

After incorporating positional encoding, the model processes the input sequence using Multi-Head Self-Attention. This mechanism allows the model to focus on different parts of the input sequence in parallel, capturing long-range dependencies.

MultiHead (Q, K, V) = Concat ({head}_{1}, \dots, {head}_{h}) W^{O}

(6)

where

{head}_{i} = softmax (\frac{Q_{i} K_{i}^{T}}{\sqrt{d_{k}}}) V_{i}

(7)

where

Q, K, V

denote the Query, Key, and Value matrices, respectively, which are generated from the input sequence by a linear transformation;

d_{k}

is the dimension of the key vector (usually set to

d_{model} / h

,

d_{model} = 512

); h is the number of attention heads.

Feed-Forward Network (FFN)

The output of the Multi-Head Self-Attention mechanism is then passed through a Feed-Forward Network (FFN) to further refine the semantic representation of the text. This module applies a nonlinear transformation to the input features:

FFN (x) = max (0, W_{1} x + b_{1}) W_{2} + b_{2}

(8)

where

W_{1} \in R^{d_{ff} \times d_{model}}

,

W_{2} \in R^{d_{model} \times d_{ff}}

is the learnable weight matrix;

d_{ff} = 2048

is the hidden layer dimension;

max (0, \cdot)

denotes the ReLU activation function. These components work together to generate a rich semantic representation of the input text, which is crucial for aligning text with image features.

(4): Loss Function

The model is optimized using a combined loss function that integrates both image–image and image–text contrastive learning objectives. Specifically, for the image–image contrastive learning task, the loss function is defined as:

L_{img - img} = - log \frac{exp (\frac{f_{img} (I_{q}) \cdot f_{img} (I_{s})}{τ})}{\sum_{j = 1}^{M} exp (\frac{f_{img} (I_{q}) \cdot f_{img} (I_{j})}{τ})}

(9)

where

f_{img} (I_{q})

and

f_{img} (I_{s})

are the global feature vectors extracted by ViT for the query and support images, respectively.

τ

is the temperature parameter, and M is the number of negative samples. This loss function ensures that the feature vectors of similar images have a higher cosine similarity, while those of dissimilar images have a lower similarity. For the image–text contrastive learning task, the loss function is defined as:

L_{img - text} = - log \frac{exp (\frac{f_{img} (I) \cdot f_{text} (T)}{τ})}{\sum_{j = 1}^{M} exp (\frac{f_{img} (I) \cdot f_{text} (T_{j})}{τ})}

(10)

where

f_{img} (I)

is the local feature vector extracted by ViT for the image, and

f_{text} (T)

is the semantic feature vector extracted by the text encoder.

τ

is the temperature parameter, and M is the number of negative samples. This loss function ensures that the image features and the correct text description have a higher cosine similarity, while those with incorrect descriptions have a lower similarity. The total loss function is a weighted sum of the two contrastive losses:

L_{total} = α L_{img - img} + (1 - α) L_{img - text}

(11)

where

α

is a hyperparameter that balances the contributions of the two losses.

This combined loss function allows the model to be optimized for both global semantic understanding and fine-grained text–image alignment, leveraging the strengths of both images and text in their respective tasks.

(5): Fruit Detection Output

The output of the visual language processing module is structured as JSON (Appendix A) data, providing detailed information for downstream robotic execution tasks. The output includes class level, confidence score, bounding box, and centroid position.

2.3. Experimental Setup

2.3.1. Implementation Details

Table 2 and Table 3 describe the hardware and software environment of the experiment.

2.3.2. Hyperparameter Settings

Integrate a YOLOv8 detection head with CLIP and ViT for joint detection and classification. YOLOv8 is trained for 50 epochs with a batch size of 32 and a learning rate of 0.01 using SGD and 8 workers. CLIP and ViT are both trained for 300 epochs using the Adam optimizer, with learning rates of

1 \times 10^{- 5}

and

1 \times 10^{- 4}

, and batch sizes of 16 and 8, respectively. Both use 4 workers. The contrastive learning weight is 0.6.

2.3.3. Evaluation Metrics

The performance of the vision language model was rigorously evaluated using a comprehensive suite of standard metrics, encompassing accuracy, precision, recall, F1 score, average precision (AP), mean average precision (mAP), and intersection over union (IoU). Additionally, computational efficiency was assessed through GFLOPs, parameters, and frames per second (FPS) to measure computational complexity, parameter count, and processing speed, respectively.

(1): Precision (P), reflecting the proportion of true positive samples among all detected positives, is defined as:

$P = \frac{TP}{TP + FP}$

(12)

where $TP$ denotes correctly identified positive samples, and $FP$ represents false positives.
(2): Recall (R), which quantifies the fraction of actual positives accurately predicted by the model, is given by:

$R = \frac{TP}{TP + FN}$

(13)

where $FN$ indicates negatives incorrectly classified as such.
(3): The F1 score, calculated as the harmonic mean of precision and recall, provides a balanced assessment:

$F 1 = 2 \times \frac{P \times R}{P + R}$

(14)
(4): Average precision (AP) measures localization accuracy by determining the area under the precision–recall curve. Mean average precision (mAP), an aggregate measure across all classes, reflects overall detection performance:

$mAP = \frac{1}{N} \sum_{i = 1}^{N} {AP}_{i}$

(15)

where N is the total number of classes.
(5): Intersection over union (IoU), indicating spatial overlap between predicted and ground-truth bounding boxes, is expressed as:

$IoU = \frac{A \cap B}{A \cup B}$

(16)

where A and B represent the areas of the predicted and ground-truth bounding boxes, respectively.
(6): GFLOPs, measuring computational complexity, are calculated as follows:

$GFLOPs = \frac{1}{10^{9}} \sum (K \times K \times C_{in} \times C_{out} \times H \times W)$

(17)

where H and W are the height and width of the output feature map, respectively.
(7): Parameters quantify the total trainable parameters within the model, particularly convolutional layers, via the following:

$Parameters = \sum (K \times K \times C_{in} \times C_{out})$

(18)

where K is the convolutional kernel size, $C_{in}$ is the number of input channels, and $C_{out}$ is the number of output channels.
(8): Finally, FPS, representing inference speed, is given by the following:

$FPS = \frac{N}{T_{total}}$

(19)

where N is the total number of processed frames, and $T_{total}$ is the total inference time in seconds.

Detection performance was evaluated with AP metrics at varying IoU thresholds, providing insights into the model’s detection capabilities. Computational efficiency metrics ensured a holistic evaluation of the model’s effectiveness and efficiency.

3. Results and Discussion

3.1. Language Model Comparison

To select the most appropriate language model, the speed and accuracy of five models under various quantization methods (e.g., BF16, GPTQ-Int4, AWQ, and FP8) were evaluated: Phi-2, Qwen2-7B, TinyLlava, GPT-4, and Deepseek. We recorded the average time to convert 100 commands (Appendix C) to JSON and manually evaluated the parsing command accuracy of the parsed commands. The results are shown in Table 4. As can be seen, Phi-2 is slightly less accurate than the other models, but has a speed advantage, making it the preferred choice for real-time command parsing and execution in the fruit-picking task.

3.2. Visual Language Model Evaluation

3.2.1. Model Performance Comparision

The performance of the visual language model E-CLIP is compared with several current classic models, including single-stage object detection models (such as the YOLO series and RT-DETR), multimodal large models (such as GLIP, RegionCLIP, and MM-Grounding-DINO), and Transformer-based visual detection models (such as ViTDet). The results are listed in Table 5. The model shows very competitive performance with a precision of 0.778, a recall of 0.728, and a mAP@0.5 of 0.791. This performance is attributed to the joint optimization of image–image and image–text contrastive learning, which improves the precision of detection and classification. Figure 5 shows the AP performance of the model for different types of fruits. Many fruits have an AP value above 0.85. Affected by the amount of data, the AP value of green apples is lower. Figure 6 shows the detection results of some fruits.

Compared with the single-stage detection model, E-CLIP introduces a joint image–text comparison learning mechanism to make up for the lack of semantic understanding of the traditional detection structure, thereby improving the ability to distinguish complex targets and backgrounds; compared with the multimodal large model, E-CLIP’s image similarity alignment effectively avoids the ambiguity of the whole image information and the regional information during alignment, and compared with the Transformer-based visual large model, E-CLIP introduces an image–text enhancement module in the structure, so that it can better capture visual semantic association information while retaining the global modeling ability of the Transformer, so that it has achieved significant improvements in indicators such as mAP@0.5 and mAP@[0.5:0.95], showing a strong learning ability in complex orchard scenes.

RegionCLIP’s mAP@0.5 is only 0.084, which is extremely poor. The main reason is that its RPN generates a large number of invalid frames in the candidate frame generation stage, which makes it difficult to distinguish between the foreground and the background, and then leads to inaccurate alignment between the region and the text, affecting the overall detection performance.

Of course, E-CLIP has its own limitations, mainly because its model performance depends on the detection accuracy of the detection head. The detection head we currently use is YOLOv8. If YOLOv8 misses the target, it will affect the final performance of the model.

3.2.2. Robustness Assessment

As shown in Table 6, the proposed model E-CLIP shows comprehensive advantages in six sets of occlusion–light coupling scenarios. Specifically, compared with the traditional YOLO series, under normal lighting and slight occlusion conditions, the mAP@0.5 of our model is 0.106 higher than that of the independent YOLOv8 (0.832 vs. 0.726), better than 0.026 (0.695 vs. 0.669) under backlight conditions, and better than 0.046 (0.551 vs. 0.505) under severe backlight occlusion conditions. At the same time, mAP@[0.5:0.95] is improved by 0.067 (0.462 vs. 0.395), and the F1 score is improved by 0.042 (0.543 vs. 0.501). These data show that the robust performance of E-CLIP is better than that of pure vision models. This is because E-CLIP adds text information, which can provide advanced semantic guidance and guide the model to focus on key semantic attributes rather than local pixels. Therefore, it makes up for the limitations of pure vision models and has better performance than pure vision models in complex scenes. Figure 7 shows the mAP@0.5 values of each model under different occlusion and lighting conditions.

Compared with the multimodal model, in the slight occlusion-normal illumination scenario, the mAP@0.5 index is 0.304 higher than GLIP (0.832 vs. 0.528), 0.741 higher than RegionCLIP (0.832 vs. 0.091) and 0.364 higher than MM-Grounding-DINO (0.832 vs. 0.468); in the severe occlusion-backlight scenario, the F1 score is 0.223 higher than GLIP (0.543 vs. 0.320) and 0.115 higher than MM-Grounding-DINO (0.543 vs. 0.428). These data indicate that the proposed multimodal model outperforms those CLIP-based models, which only use image–text contrastive learning. This is because E-CLIP adds an image–image contrast learning branch, which further enhances the model’s ability to learn invariance to illumination and occlusion, allowing the model to ignore irrelevant variables and focus on more essential features of the image (such as texture, shape, etc.), thereby reducing the impact of external interference factors (such as illumination, occlusion, etc.).

In summary, the added image–image contrast learning and the retained image–text contrast learning in the model enable the model to learn more discriminative and robust feature representations. Under different occlusion and illumination conditions, these contrast learnings can help the model grasp the key features of the fruit. In addition, the image encoders in both branches use ViT, which can effectively capture the global information of the image through the self-attention mechanism, rather than just local features. Even if the target is occluded or affected by light, it can accurately identify the target through contextual information. Therefore, ViT also contributes to the enhancement of the model’s feature extraction ability and invariance learning ability.

3.2.3. Generalization Assessment

In the few-shot learning scenario, the model is trained by using a limited number of samples from the fruit category; then, we evaluate the model performance on a separate test set containing these categories. The results (shown in Table 7) show that the model is able to achieve a mAP@0.5 value of 0.646 with 16 samples, which is only 0.145 away from the result of 0.791 on the full test set. Compared with other models, the model achieves higher F1 scores and mAP@0.5 values at different sample sizes (Figure 8). In the zero-shot learning scenario, the model is tested with new fruit categories but does not use any training samples from these categories. This setting evaluates the model’s ability to recognize new categories based only on pre-trained knowledge. Table 8 shows that the model achieves an average mAP@0.5 of 0.626 in new categories, demonstrating its ability to generalize new categories to unknown categories. Figure 9 shows the detection results for selected test samples in the zero-shot learning scenario.

The generalization ability of the model can be attributed to its multimodal learning architecture. Traditional image–text comparison pre-training methods mainly use whole images and text pairs for matching. Although they have good semantic understanding capabilities, they are also easily disturbed by irrelevant background information in the image, especially in regional recognition tasks, which easily introduce noise. This method effectively focuses on the target area itself by sending the local area features extracted by the detection head into CLIP for image–text alignment. However, since these areas are not specifically trained in the pre-training stage, relying solely on image–text comparison is prone to inaccurate matching. Therefore, image–image comparison learning is introduced. By constructing contrast constraints between enhanced image pairs, the model can capture the similarities and differences between regions more finely from a visual perspective, making up for the shortcomings of pure image–text comparison in object-level understanding. Especially in the case of scarce samples, image–image comparison can strengthen the model’s structural understanding of similar targets so that accurate classification can be achieved even when there are very few or no training samples. This mechanism not only enhances the model’s discriminative ability but also further improves its generalization ability for unseen categories.

3.2.4. Ablation Studies

To assess the contributions of each module and component in E-CLIP, a series of ablation studies was conducted and the results are shown in Table 9.

(1) Effect of image–image contrastive learning module

The removal of the image–image contrastive learning module resulted in significant performance degradation across both localization and classification tasks. Specifically, the mean average precision (mAP@[0.5:0.95]) decreased from 0.652 to 0.513, while mAP@0.5 declined from 0.791 to 0.667, respectively. This underscores the critical role of ViT in capturing global semantic contexts, particularly for small and multi-scale objects. Classification metrics also deteriorated, with the F1 score dropping from 0.752 to 0.686, indicating reduced semantic consistency in class predictions. These results highlight the module’s importance in enhancing robustness for complex detection scenarios.

(2) Effect of image–text contrastive learning module

Similarly, disabling the image–text contrastive learning module led to notable performance declines. The mAP@[0.5:0.95] decreased from 0.652 to 0.521, with mAP@0.5 falling from 0.791 to 0.683, respectively, reflecting weakened fine-grained localization capabilities. Classification performance suffered more severely: F1 score decreased to 0.695. This demonstrates the module’s essential function in aligning local visual features with textual semantics, which is pivotal for class-discriminative detection.

(3) Impact of the combined loss function’s weight parameter

α

The balance between the two modules was further analyzed by tuning the weight parameter

α

in the combined loss function

L_{total} = α L_{img - img} + (1 - α) L_{img - text}

, as shown in Table 10 and Figure 10. Optimal performance was achieved at

α = 0.6

, yielding mAP@0.5 = 0.791 and F1 score = 0.752. Deviating from this balance—e.g.,

α = 0.4

(overemphasizing text alignment) or

α = 0.8

(prioritizing image contrast)—resulted in suboptimal performance, with mAP@0.5 declining to 0.614 and 0.532, respectively. This emphasizes the necessity of harmonizing global and local feature learning to maximize cross-modal synergy.

3.2.5. Computational Efficiency

The experimental results shown in Table 11 demonstrate that our model has excellent computational efficiency compared to state-of-the-art methods. Notably, our model achieves an inference speed of 54.82 FPS, outperforming most existing models, including VitDet (26.45 FPS), RTDETR (28.79 FPS), RegionCLIP (14.79 FPS), and MM-Grounding-DINO (4.76 FPS). Although lower than the FPS of YOLOv8 (272.27), our framework significantly reduces the computational complexity while maintaining competitive real-time performance. Specifically, our model requires only 98.61 GFLOPs and 86.42M parameters, a significant drop compared to other multimodal models. This lightweight architecture outperforms computationally intensive models such as GLIP (415.17 GFLOPs) and RegionCLIP (518.52 GFLOPs). Our model balances parameter count and speed. These features enable the model to be deployed on resource-constrained edge devices while maintaining real-time responsiveness, which is critical for agricultural robotics applications.

3.3. Limitation and Future Work

Although the visual language model proposed in this study shows significant performance in fruit detection and automated picking tasks, it still has certain limitations.

(1) Imbalanced data distribution: The dataset constructed in this study contains 12 different fruit categories. However, there is a significant imbalance in the distribution of these categories. For example, the number of mango instances far exceeds that of less common fruits such as lychee and lemon. This imbalance may have a negative impact on the model’s detection accuracy for underrepresented categories.

(2) Less robust under extreme occlusion conditions: Although the model has good robustness under moderate occlusion and illumination changes, its detection performance still needs to be improved under extreme occlusion or complex mixed illumination environments. The overall framework relies on YOLOv8 as the detection head. Although it performs well in general object detection tasks, it may still miss objects in extreme occlusion scenarios, which will affect the model results.

(3) Poor generalization to non-circular objects: Although the model shows some generalization ability in zero-shot and few-shot scenarios, its performance drops significantly when detecting new categories with large morphological differences (such as carrots). This indicates that there are challenges in handling extreme variations in object morphology.

(4) The speed is still behind the traditional YOLO: Although this method is faster than traditional multimodal models, its speed still lags behind YOLO, which poses a challenge for real-time applications that require high-speed processing.

Future research can further optimize and expand this framework from multiple directions.

(1) Expand the diversity and balance of the dataset. Including more fruit varieties and adding more fine-grained maturity labels (especially labels from complex agricultural scenes) will improve the versatility and accuracy of the model.

(2) In the future, the performance of the model under extreme occlusion and complex lighting can be improved in many aspects. For example, the attention mechanism can be introduced or the loss function for occlusion can be designed to enhance the model’s perception of occluded targets; the adaptive normalization method can be used to extract more stable features and reduce the impact of lighting changes. In addition, the detection head that is more suitable for complex scenes or the candidate region completion mechanism can be introduced to improve the overall robustness and detection accuracy.

(3) Fusion of depth information and multispectral images can enhance the perception ability of the model. Combining it with an adaptive feature extraction mechanism will help improve the detection accuracy of objects with significant morphological differences.

(4) Explore technologies such as model distillation and pruning. These methods can reduce the computational load while maintaining or even improving model performance, making it more suitable for real-time applications on resource-constrained devices.

4. Conclusions

In summary, this study aims to enhance the robustness and generalizability of current fruit detection and recognition models. To achieve this, the following work has been completed:

First, to effectively train the proposed multimodal model E-CLIP, a multimodal dataset has been created. It includes 6770 fruit images across 12 categories and 21 maturity levels, covering a variety of lighting conditions, as well as 100 carefully designed natural language queries. The construction of this multimodal dataset provides a comprehensive and diverse foundation for training and evaluating the model, supporting its ability to adapt to dynamic agricultural environments and new fruit varieties.

Second, the speech recognition module with the language model is integrated, enabling the robot to understand and execute natural language commands, greatly simplifying the operation process and lowering the threshold for use. In the experiment, a variety of language models were compared, including Phi-2, Qwen2-7B, TinyLlava, GPT-4, and Deepseek. The experimental results show that the Phi-2 model has the best balance between speed and accuracy and has efficient real-time performance. Therefore, it is most suitable for the command processing of agricultural harvesting robots.

Third, and most importantly, in order to solve the problem of insufficient robustness and generalization of traditional fruit detection and classification models, an enhanced CLIP (E-CLIP) model is proposed. This model includes three branches: the YOLO detection head, the image-to-image contrast learning branch, and the image-to-text contrast learning branch. This integrated structure makes up for the shortcomings of the pure vision model and further enhances the learning ability of the multimodal model. Experimental results show that the mAP@0.5 and F1 scores of the model reached 0.791 and 0.752, which are significantly higher than the multimodal baseline model and also improved compared to the traditional YOLO. The model also exhibits its robustness under occlusion and illumination changes. It can maintain a high accuracy even under severe occlusion and backlighting conditions, showing its practical application potential in dynamic orchard environments. Furthermore, the model demonstrates strong generalization ability in few-shot learning (e.g., 1, 4, 8, and 16 samples per category) and zero-shot learning scenarios, achieving a mAP@0.5 of 0.626 for new fruit categories in zero-shot learning, demonstrating its adaptability to new categories without direct supervision.

These enhancements not only boost the efficiency and accuracy of automatic fruit picking but also pave the way for advancements of general-purpose picking robots in smart agriculture.

Author Contributions

Writing—original draft, data collection and pre-processing, methodology, and experiment implementation and testing: Y.Z.; Writing—review and editing, methodology guidance, innovation design: H.P.; Project administration and funding acquisition and Writing—review and editing and polishing: P.S.; Review and Editing and Writing—polishing: R.Z.; Partial image dataset collection and JSON specification: Y.S., C.T., Z.L. (Zhenqing Liu) and Z.L. (Zhengda Li); Supervision: H.P. and P.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Key Research and Development Program of Hubei Province grant number 2024BBB053.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Acknowledgments

The authors would like to thank Wenfu Huang for his assistance in figure beautification, and Miao Yang for his contributions to the experiments on the MM-Grounding-DINO model.

Conflicts of Interest

Author Zhengda Li was employed by the company Wuhan X-Agriculture Intelligent Technology Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Appendix A. The JSON Output Format

Appendix B. Language Model JSON Specification

Appendix B.1. Start Harvesting Task

Appendix B.2. End Harvesting Task

Appendix B.3. Pause Work

Appendix B.4. Resume Work

Appendix B.5. Robot Voice-Controlled Movement

Appendix B.6. Speed Adjustment Control

Appendix B.7. Reset Arm Position

Appendix B.8. Self-Charging

Appendix C. Command Extraction and Entity Recognition Results

Taking apple picking as an example, Table A1 shows the results of the text dataset and language model we built to extract actions and entities.

Table A1. Command extraction and entity recognition results for apple picking activities.

ID	Original Command	Action	Entity
1	For semi-ripe apples, use special tools for picking.	Picking	Semi-ripe apples
2	Start from the edge of the orchard and gradually move inward to pick ripe apples.	Picking	Ripe apples
3	After rain, check and pick brighter apples.	Checking, Picking	Brighter apples
4	Use sensors to detect sugar content and decide when to pick unripe apples.	Deciding Picking Time	Unripe apples
5	Pick those that are close to full ripeness but have not yet fallen.	Picking	Nearly fully ripe apples
6	To ensure quality, prioritize picking apples with smooth and undamaged surfaces.	Prioritize Picking	Smooth and undamaged apples
7	Create a picking route mAP@[0.5:0.95] in the orchard to plan the picking of different types of apples.	Planning Picking	Different types of apples
8	Use drones to assist in locating hard-to-reach apples.	Assisting Location	Apples
9	Utilize AI vision systems to analyze the optimal picking time for apples.	Analyzing Picking	Apples
10	Record the location of each apple during picking for subsequent management.	Recording Location	Apples
11	To protect the environment, pick wild apples without disrupting the ecosystem.	Picking	Wild apples
12	By comparing the characteristics of different varieties, learn how to more Accuracyurately pick apples.	Learning Picking	Apples
13	Combine weather forecast information to schedule picking activities ahead of time to ensure the quality of apples.	Scheduling Picking	Apples
14	Improve picking techniques using machine learning to increase efficiency while reducing damage to apples.	Improving Picking Techniques	Techniques
15	At night, use infrared cameras to help pick apples that ripen at night.	Helping Picking	Nightly ripe apples
16	Develop specialized applications to guide workers on correctly picking various types of apples.	Guiding Picking	Various types of apples
17	Train pickers to understand the best picking methods for each type of apple.	Training Picking	Picking methods
18	For special events, carefully select and pick the highest quality apples.	Carefully Selecting, Picking	Highest quality apples
19	Through community cooperation, jointly participate in the local orchard’s apple picking work.	Participating Picking	Apples
20	During the harvest season, organize volunteersto pick large quantities of ripe apples together.	Organizing Picking	Ripe apples
21	Pick wild apples from bushes near the ground.	Picking	Wild apples
22	Find the highest branches and pick the largest apples there.	Picking	Largest apples
23	When detecting specific colors, pick high-hanging apples.	Picking	Specific color apples
24	On dewy mornings, pick fresh apples before the dew dries.	Picking	Fresh apples
25	Carefully distinguish and only pick fully ripe apples.	Picking	Fully ripe apples
26	Avoid damaging surrounding leaves and precisely pick the hard apples.	Precisely Picking	Hard apples
27	Find and pick hidden apples among dense foliage.	Picking	Hidden apples
28	Use image recognition technology to assist in picking rare apples.	Assisting Picking	Rare apples
29	Adjust strategies Accuracyording to seasonal changes to pick suitable apples.	Adjusting Picking	Apples
30	Optimize routes efficiently to pick multiple apples using smart algorithms.	Optimizing Path, Picking	Apples
31	Carefully pick ripe apples from fruit trees.	Carefully Picking	Ripe apples
32	Instruct robots to go to the orchard to select and pick fresh apples.	Selecting, Picking	Fresh apples
33	Use robotic arms to Accuracyurately pick brightly colored apples.	Accuracyurately Picking	Brightly colored apples
34	In greenhouses, search for and pick ripe clusters of apples.	Searching, Picking	Ripe clusters of apples
35	Detect and pick small apples hanging on lower branches.	Detecting, Picking	Small apples
36	Identify and pick apples whose color has changed from green to red.	Identifying, Picking	Green to red apples
37	Confirm the fruit is ripe and then start picking round apples.	Confirming, Picking	Round apples
38	Select and pick apples based on size and color.	Selecting, Picking	Apples
39	Remove apples that have changed color from the tree.	Removing	Changed color apples
40	Safely pick soft apples from vines or branches.	Safely Picking	Soft apples
41	Use drones to take aerial photos to help locate apples that need picking.	Helping Locating	Apples
42	Robots selectively pick apples based on preset maturity criteria.	Selectively Picking	Apples
43	Set up an automatic navigation system in the orchard to assist in picking apples.	Assisting Picking	Apples
44	Develop new picking algorithms to adapt to different sizes and shapes of apples.	Developing Algorithms	Apples
45	Apply augmented reality (AR) technology to guide pickers to find the best locations for apples.	Guiding Picking	Apples
46	Conduct a comprehensive scan of all apples in a specific area before starting to pick.	Scanning, Picking	Apples
47	Use laser rangefinders to determine the exact position of each apple.	Determining Position	Apples
48	Integrate environmental sensors to monitor weather conditions, optimizing the timing for picking unripe apples.	Monitoring, Optimizing	Unripe apples
49	Employ wearable devices like smart glasses to assist workers in efficient picking of apples.	Assisting Picking	Apples
50	Use machine vision recognition technology to Accuracyurately locate apples against complex backgrounds.	Locating	Apples
51	Determine the optimal picking time by detecting color changes in fruits to start picking apples.	Detecting, Picking	Apples
52	Instruct robots to search and only pick apples that meet specific maturity standards.	Searching, Picking	Maturity standard apples
53	Assess the hardness of the fruit with sensors to determine the picking time, ensuring apples are at their best maturity.	Assessing, Picking	Apples
54	When detected sugar content reaches peak, instruct robots to pick ripe apples.	Instructing, Picking	Ripe apples
55	Use image recognition technology to analyze surface features of the fruit, selecting high-maturity apples.	Analyzing, Picking	High-maturity apples
56	Identify and prioritize picking fully ripe apples.	Identifying, Prioritizing Picking	Fully ripe apples
57	Screen and pick apples that meet predefined maturity parameters.	Screening, Picking	Apples
58	Confirm that the size and color of the fruit meet maturity requirements before initiating the picking program for apples.	Confirming, Initiating	Apples
59	Predict the optimal maturity of each type of apple using machine learning algorithms to schedule picking times.	Predicting, Scheduling	Apples
60	Regularly monitor environmental conditions (such as temperature, humidity) to optimize the picking plan and ensure the maturity of apples.	Monitoring, Optimizing	Apples
61	Based on fruit growth cycle data, intelligently determine when to pick apples.	Determining, Picking	Apples
62	Perform a quick check before picking to ensure all selected apples have reached the expected maturity.	Checking, Ensuring	Apples
63	Use infrared imaging technology to assist in judging the internal maturity of apples, guiding precise picking.	Judging, Guiding	Apples
64	Combine weather forecast information to plan picking activities in advance, ensuring apples are picked at optimal maturity.	Planning, Ensuring	Apples
65	Use automated systems to monitor the development progress of each apple, determining specific picking dates.	Monitoring, Determining	Apples
66	Dynamically adjust picking strategies during the process to Accuracyommodate different types of apples based on real-time maturity analysis.	Analyzing, Adjusting	Apples
67	Develop dedicated software to help growers identify which apples have reached ideal maturity, ready for picking.	Developing Software, Helping	Apples
68	Collect data through wireless sensor networks, analyzing and predicting the maturity trends of apples on each tree.	Collecting, Analyzing, Predicting	Apples
69	Use AI models to simulate the maturation process of apples under different environmental conditions, providing scientific picking suggestions.	Simulating, Providing	Apples
70	Equip multi-spectral cameras for precise assessment of apple maturity, executing picking tasks Accuracyordingly.	Assessing, Executing	Apples
71	Differentiate between semi-ripe and fully ripe apples, selecting appropriate picking targets as needed.	Differentiating, Selecting	Apples
72	Accuracyurately predict the maturity of apples by referencing historical data and current environmental conditions, scheduling picking Accuracyordingly.	Predicting, Scheduling	Apples
73	Use machine vision systems to perform 3D scans of fruits to assess their shape and maturity, deciding whether to pick apples.	Scanning, Assessing, Deciding	Apples
74	Leverage deep-learning-based algorithms enabling robots to efficiently recognize and pick mature apples in complex backgrounds.	Recognizing, Picking	Apples
75	Before picking, conduct gentle touch tests to verify if apples are sufficiently ripe.	Testing, Verifying	Apples
76	Set multiple maturity thresholds to allow robots to flexibly handle the picking needs of different types of apples.	Setting, Handling	Apples
77	Judge maturity based on the aroma release pattern of the fruit, selectively picking suitable apples.	Judging, Picking	Apples
78	Continuously update information about the maturity of apples during the picking process, enhancing picking efficiency and quality.	Updating, Enhancing	Apples
79	Combine various non-invasive detection methods, such as sound waves and spectral analysis, to determine the maturity of apples.	Combining, Determining	Apples
80	Robots remotely control picking actions based on cloud platform data analysis capabilities, ensuring only correctly matured apples are picked.	Controlling, Ensuring	Apples
81	Receive real-time feedback through mobile applications, adjusting the robot’s maturity evaluation standards for apples.	Receiving, Adjusting	Apples
82	Recognize prematurely or delayed-ripening apples due to weather reasons, adjusting picking strategies Accuracyordingly.	Recognizing, Adjusting	Apples
83	Customize maturity detection schemes based on the characteristics of different varieties of apples, improving picking Accuracy.	Customizing, Improving	Apples
84	Utilize blockchain to record the entire process from sowing to picking, ensuring transparency and traceability of maturity information for each apple.	Recording, Ensuring	Apples
85	Flexibly adjust picking operations for apples based on user-set maturity preferences.	Adjusting	Apples
86	Before picking, confirm the final maturity of apples, avoiding premature or late picking.	Confirming, Avoiding	Apples
87	Simulate natural pollination processes to promote fruit development, ensuring apples reach ideal maturity at picking time.	Promoting, Ensuring	Apples
88	Use virtual reality (VR) training modules to familiarize operators with how to judge the maturity of different kinds of apples.	Training, Judging	Apples
89	Instantly evaluate and select the ripest apples during the picking process.	Evaluating, Selecting	Apples
90	See those very bright-colored apples? It’s time to pick them.	Picking	Bright-colored apples
91	This morning I noticed some apples started changing color; please help me pick these ripe fruits.	Picking	Ripe fruits
92	Can you gently pick those just perfectly ripe apples in the orchard while the dew is still on?	Picking	Perfectly ripe apples
93	When you lightly touch those hanging apples on the tree, if they feel soft but not mushy, please pick them.	Picking	Soft but not mushy apples
94	Look at those darker-colored apples at the top of the tree; please help me pick them.	Picking	Darker-colored apples
95	Accuracyording to the weather forecast, these days are perfect for picking; please prepare in advance and do not miss any batch of ripe apples.	Preparing, Picking	Ripe apples
96	The greenhouse environment is well controlled; robot, you can check and pick those apples that have ripened ideally.	Checking, Picking	Ripe apples
97	Before picking, observe the color and size of each apple; choose only the perfect ones.	Observing, Choosing	Perfect apples
98	Through touch and visual inspection, determine which apples have reached ideal maturity, then carefully pick them.	Determining, Picking	Apples
99	As soon as you smell the sweet aroma in the air, it’s time to go pick those enticingly scented apples.	Picking	Enticingly scented apples
100	Monitor the weather conditions and select sunny days for picking to ensure the sweetness and quality of the apples.	Picking	apples

References

Chen, Z.; Lei, X.; Yuan, Q.; Qi, Y.; Ma, Z.; Qian, S.; Lyu, X. Key Technologies for Autonomous Fruit- and Vegetable-Picking Robots: A Review. Agronomy 2024, 14, 2233. [Google Scholar] [CrossRef]
Nyyssönen, A. Vision-Language Models in Industrial Robotics. Bachelor’s Thesis, Faculty of Engineering and Natural Sciences, Tampere University, Tampere, Finland, 2024. [Google Scholar]
Sapkota, R.; Ahmed, D.; Karkee, M. Comparing YOLOv8 and Mask R-CNN for instance segmentation in complex orchard environments. Artif. Intell. Agric. 2024, 13, 84–99. [Google Scholar] [CrossRef]
Jiang, L.; Jiang, H.; Jing, X.; Dang, H.; Li, R.; Chen, J.; Majeed, Y.; Sahni, R.; Fu, L. UAV-based field watermelon detection and counting using YOLOv8s with image panorama stitching and overlap partitioning. Artif. Intell. Agric. 2024, 13, 117–127. [Google Scholar] [CrossRef]
Zhou, J.; Zhang, Y.; Wang, J. A Dragon Fruit Picking Detection Method Based on YOLOv7 and PSP-Ellipse. Sensors 2023, 23, 3803. [Google Scholar] [CrossRef] [PubMed]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Computer Vision—ECCV 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar] [CrossRef]
Liang, Q.; Zhu, W.; Long, J.; Wang, Y.; Sun, W.; Wu, W. A Real-Time Detection Framework for On-Tree Mango Based on SSD Network. In Proceedings of the Intelligent Robotics and Applications, Newcastle, NSW, Australia, 9–11 August 2018; Chen, Z., Mendes, A., Yan, Y., Chen, S., Eds.; Springer: Berlin/Heidelberg, Germany, 2018; pp. 423–436. [Google Scholar]
Vasconez, J.; Delpiano, J.; Vougioukas, S.; Auat Cheein, F. Comparison of convolutional neural networks in fruit detection and counting: A comprehensive evaluation. Comput. Electron. Agric. 2020, 173, 105348. [Google Scholar] [CrossRef]
Wang, P.; Niu, T.; He, D. Tomato Young Fruits Detection Method under Near Color Background Based on Improved Faster R-CNN with Attention Mechanism. Agriculture 2021, 11, 1059. [Google Scholar] [CrossRef]
Xu, P.; Fang, N.; Liu, N.; Lin, F.; Yang, S.; Ning, J. Visual recognition of cherry tomatoes in plant factory based on improved deep instance segmentation. Comput. Electron. Agric. 2022, 197, 106991. [Google Scholar] [CrossRef]
Bordes, F.; Pang, R.Y.; Ajay, A.; Li, A.C.; Bardes, A.; Petryk, S.; Mañas, O.; Lin, Z.; Mahmoud, A.; Jayaraman, B.; et al. An Introduction to Vision-Language Modeling. arXiv 2024, arXiv:2405.17247. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning. arXiv 2021, arXiv:2103.00020. [Google Scholar]
Li, J.; Li, D.; Xiong, C.; Hoi, S. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. arXiv 2022, arXiv:2201.12086. [Google Scholar]
Li, J.; Li, D.; Savarese, S.; Hoi, S. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. arXiv 2023, arXiv:2301.12597. [Google Scholar]
Liu, H.; Li, C.; Wu, Q.; Lee, Y.J. Visual Instruction Tuning. arXiv 2023, arXiv:2304.08485. [Google Scholar]
Bai, J.; Bai, S.; Yang, S.; Wang, S.; Tan, S.; Wang, P.; Lin, J.; Zhou, C.; Zhou, J. Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond. arXiv 2023, arXiv:2308.12966. [Google Scholar]
Cao, Y.; Chen, L.; Yuan, Y.; Sun, G. Cucumber disease recognition with small samples using image-text-label-based multi-modal language model. Comput. Electron. Agric. 2023, 211, 107993. [Google Scholar] [CrossRef]
Zhou, Y.; Yan, H.; Ding, K.; Cai, T.; Zhang, Y. Few-Shot Image Classification of Crop Diseases Based on Vision–Language Models. Sensors 2024, 24, 6109. [Google Scholar] [CrossRef]
Tan, C.; Cao, Q.; Li, Y.; Zhang, J.; Yang, X.; Zhao, H.; Wu, Z.; Liu, Z.; Yang, H.; Wu, N.; et al. On the Promises and Challenges of Multimodal Foundation Models for Geographical, Environmental, Agricultural, and Urban Planning Applications. arXiv 2023, arXiv:2312.17016. [Google Scholar]
Qing, J.; Deng, X.; Lan, Y.; Li, Z. GPT-aided diagnosis on agricultural image based on a new light YOLOPC. Comput. Electron. Agric. 2023, 213, 108168. [Google Scholar] [CrossRef]
Jocher, G.; Qiu, J.; Chaurasia, A. Ultralytics YOLO (Version 8.0.0). 2023. Available online: https://github.com/ultralytics/ultralytics (accessed on 9 April 2025).
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2021, arXiv:2010.11929. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv 2023, arXiv:1706.03762. [Google Scholar]
Li, Y.; Bubeck, S.; Eldan, R.; Giorno, A.D.; Gunasekar, S.; Lee, Y.T. Textbooks Are All You Need II: Phi-1.5 technical report. arXiv 2023, arXiv:2309.05463. [Google Scholar]
Yang, A.; Yang, B.; Hui, B.; Zheng, B.; Yu, B.; Zhou, C.; Li, C.; Li, C.; Liu, D.; Huang, F.; et al. Qwen-2 Technical Report. arXiv 2024, arXiv:2407.10671. [Google Scholar]
Zhou, B.; Hu, Y.; Weng, X.; Jia, J.; Luo, J.; Liu, X.; Wu, J.; Huang, L. TinyLLaVA: A Framework of Small-scale Large Multimodal Models. arXiv 2024, arXiv:2402.14289. [Google Scholar]
OpenAI. GPT-4 Technical Report. arXiv 2024, arXiv:2303.08774. [Google Scholar]
DeepSeek-AI. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv 2025, arXiv:2501.12948. [Google Scholar]
DeepSeek-AI. DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model. arXiv 2024, arXiv:2405.04434. [Google Scholar]
DeepSeek-AI. DeepSeek-V3 Technical Report. arXiv 2025, arXiv:2412.19437. [Google Scholar]
Li, L.H.; Zhang, P.; Zhang, H.; Yang, J.; Li, C.; Zhong, Y.; Wang, L.; Yuan, L.; Zhang, L.; Hwang, J.N.; et al. Grounded Language-Image Pre-training. arXiv 2022, arXiv:2112.03857. [Google Scholar]
Li, Y.; Xie, S.; Chen, X.; Dollar, P.; He, K.; Girshick, R. Benchmarking Detection Transfer Learning with Vision Transformers. arXiv 2021, arXiv:2111.11429. [Google Scholar]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs Beat YOLOs on Real-time Object Detection. arXiv 2024, arXiv:2304.08069. [Google Scholar]
Zhong, Y.; Yang, J.; Zhang, P.; Li, C.; Codella, N.; Li, L.H.; Zhou, L.; Dai, X.; Yuan, L.; Li, Y.; et al. RegionCLIP: Region-based Language-Image Pretraining. arXiv 2021, arXiv:2112.09106. [Google Scholar]
Zhao, X.; Chen, Y.; Xu, S.; Li, X.; Wang, X.; Li, Y.; Huang, H. An Open and Comprehensive Pipeline for Unified Object Grounding and Detection. arXiv 2024, arXiv:2401.02361. [Google Scholar]
Khanam, R.; Hussain, M. What is YOLOv5: A deep look into the internal features of the popular object detector. arXiv 2024, arXiv:2407.20892. [Google Scholar]
Yaseen, M. What is YOLOv8: An In-Depth Exploration of the Internal Features of the Next-Generation Object Detector. arXiv 2024, arXiv:2408.15857. [Google Scholar]
Khanam, R.; Hussain, M. YOLOv11: An Overview of the Key Architectural Enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar]

Figure 1. (a) The overall framework of the fruit-picking system. It includes the instruction processing module, the visual language processing module, and the motion control module. (b) Technical details of the visual language processing module. The image is first detected by YOLOv8 for a single target, then the image is classified through image-to-image comparison learning and image-to-text comparison learning, and then the result is finally output.

Figure 2. Flowchart of the intelligent fruit-picking system. The instruction module receives user commands (text or voice), identifies picking intent via language model processing, and triggers the Identify Module to capture and analyze video for object detection. The motion control module then guides the robotic arm to perform the picking action based on detection results.

Figure 3. YOLOv8 Detection Head. It retains the detection function. The input fruit features go through two convolutional layers (kernel size

k = 3

, stride

s = 1

, padding

p = 1

), and then enter a regression head (Conv2d) with kernel size

k = 3

, stride

s = 1

, padding

p = 0

, and output channels

c = 4 \times {reg}_{max}

, where

{reg}_{max}

is the maximum bounding box coordinate offset that the model can predict. Bounding box predictions are supervised using the Bbox.Loss function. The red box indicates the detected object.

Figure 3. YOLOv8 Detection Head. It retains the detection function. The input fruit features go through two convolutional layers (kernel size

k = 3

, stride

s = 1

, padding

p = 1

), and then enter a regression head (Conv2d) with kernel size

k = 3

, stride

s = 1

, padding

p = 0

, and output channels

c = 4 \times {reg}_{max}

, where

{reg}_{max}

is the maximum bounding box coordinate offset that the model can predict. Bounding box predictions are supervised using the Bbox.Loss function. The red box indicates the detected object.

Figure 4. Improved ViT architecture. It uses windows of different sizes to fuse image features and implements a multi-scale self-attention mechanism to capture local and global features of the image. “*” represents the extra learnable [class] token used for classification.

Figure 5. The AP50 value of each type of fruit. The results are divided into three groups represented by different colors: 0–0.6, 0.6–0.9, 0.9–1.0.

Figure 6. The results of fruit detection and classification.

Figure 7. The mAP@0.5 values of each model under different occlusion and lighting conditions.

Figure 8. The few-shot performance of different models under varying numbers of training samples. (a) Performance metrics with 1-shot training. (b) Performance metrics with 4-shot training. (c) Performance metrics with 8-shot training. (d) Performance metrics with 16-shot training.

Figure 9. The detection and classification results of untrained fruits.

Figure 10. Optimal values of mAP@0.5, mAP@[0.5:0.95], and F1 scores under different

α

values.

Figure 10. Optimal values of mAP@0.5, mAP@[0.5:0.95], and F1 scores under different

α

values.

Table 1. Distribution of the dataset across different categories and maturity levels.

Category	Maturity	Total Instances	Category	Maturity	Total Instances
Apple	Mature
	Semi-mature	568	Lychee	-	448
	Immature
Banana	Mature
	Semi-mature	217	Lemon	-	451
	Immature
Grapes	Mature
	Semi-mature	395	Pear	-	451
	Immature
Strawberry	Mature
	Semi-mature	265	Tomato	-	357
	Immature
Persimmon	Mature
	Semi-mature	379	Mango	-	1223
	Immature
Peach	Mature
	Semi-mature	1308
	Immature
Passion Fruit	Mature
	Semi-mature	708
	Immature
Total		3840	Total		2930
Total:					6770

Table 2. Hardware configuration for the model deployment.

Component	Specification	Usage
GPU	NVIDIA RTX 4090D	Model inference and computation acceleration
CPU	AMD EPYC 9754	Data preprocessing and backend services
Memory	60 GB	Store model parameters, intermediate cache, and temporary data
Storage	System Disk: 30 GB SSD; Data Disk: 50 GB SSD	Fast loading of training data and model files

Table 3. Software stack for the model deployment.

Component	Version	Usage
Operating System	Ubuntu 22.04	Provides a reliable runtime environment
Framework	PyTorch 2.5.1	Supports model training and inference
Python Environment	Python 3.12	Manages script execution and dependencies
CUDA Toolkit	CUDA 12.4	Enables GPU acceleration for deep learning

Table 4. Simulated performance comparison under different quantization methods.

Model	Input Length (Chinese Characters)	Quantization	GPU Num	Speed (Tokens/s)	Accuracy (%)
Phi-2 [24]	30–50	BF16	1	42.5	91.00
Phi-2 [24]	30–50	GPTQ-Int4	1	47.3	91.40
Phi-2 [24]	30–50	AWQ	1	43.9	90.7
Qwen2-7B [25]	30–50	BF16	1	34.7	95.10
Qwen2-7B [25]	30–50	GPTQ-Int4	1	36.2	94.80
Qwen2-7B [25]	30–50	AWQ	1	33.1	93.80
TinyLlava [26]	30–50	BF16	1	30.5	91.10
TinyLlava [26]	30–50	GPTQ-Int4	1	34.2	90.90
GPT-4 [27]	30–50	BF16	1	30.2	94.60
GPT-4 [27]	30–50	GPTQ-Int4	1	34.8	94.80
GPT-4 [27]	30–50	AWQ	1	33.6	93.80
Deepseek-R1 [28]	30–50	BF16	1	21.8	93.60
Deepseek-V2 [29]	30–50	FP8	1	32.2	95.20
Deepseek-V3 [30]	30–50	3-bit	1	12.3	94.80

Table 5. Performance comparison of different models.

Model	Precison	Recall	F1 Score	mAP@0.5	mAP@[0.5:0.95]
GLIP [31]	0.516	0.523	0.519	0.507	0.226
ViTDet [32]	0.539	0.554	0.546	0.542	0.366
RTDERT [33]	0.671	0.691	0.681	0.683	0.396
RegionCLIP [34]	0.081	0.089	0.085	0.084	0.036
MM-Grounding-DINO [35]	0.413	0.708	0.522	0.413	0.248
YOLOv5 [36]	0.682	0.706	0.694	0.721	0.568
YOLOv8 [37]	0.711	0.723	0.717	0.720	0.558
YOLO11 [38]	0.728	0.710	0.719	0.718	0.557
E-CLIP (ours)	0.778	0.728	0.752	0.791	0.652

Table 6. Robustness comparison of different models.

Model	Scenario	Precision	Recall	F1 Score	mAP@0.5	mAP@[0.5:0.95]
GLIP	Slight Occ.—Normal Light	0.527	0.531	0.529	0.528	0.241
ViTDet	Slight Occ.—Normal Light	0.541	0.564	0.552	0.553	0.372
RTDETR	Slight Occ.—Normal Light	0.689	0.701	0.695	0.693	0.405
RegionCLIP	Slight Occ.—Normal Light	0.096	0.103	0.099	0.091	0.042
MM-Grounding-DINO	Slight Occ.—Normal Light	0.468	0.874	0.610	0.468	0.289
YOLOv5	Slight Occ.—Normal Light	0.728	0.703	0.715	0.736	0.575
YOLOv8	Slight Occ.—Normal Light	0.724	0.728	0.726	0.729	0.561
YOLO11	Slight Occ.—Normal Light	0.732	0.711	0.721	0.727	0.559
E-CLIP (ours)	Slight Occ.—Normal Light	0.836	0.829	0.832	0.832	0.661
GLIP	Slight Occ.—Backlight	0.472	0.481	0.476	0.462	0.205
ViTDet	Slight Occ.—Backlight	0.502	0.515	0.508	0.508	0.335
RTDETR	Slight Occ.—Backlight	0.623	0.643	0.633	0.635	0.362
RegionCLIP	Slight Occ.—Backlight	0.070	0.078	0.074	0.073	0.035
MM-Grounding-DINO	Slight Occ.—Backlight	0.352	0.700	0.468	0.352	0.201
YOLOv5	Slight Occ.—Backlight	0.635	0.658	0.646	0.673	0.530
YOLOv8	Slight Occ.—Backlight	0.663	0.675	0.669	0.672	0.520
YOLO11	Slight Occ.—Backlight	0.678	0.662	0.670	0.670	0.519
E-CLIP (ours)	Slight Occ.—Backlight	0.702	0.689	0.695	0.693	0.589
GLIP	Moderate Occ.—Normal Light	0.427	0.437	0.432	0.418	0.183
ViTDet	Moderate Occ.—Normal Light	0.468	0.482	0.475	0.475	0.316
RTDETR	Moderate Occ.—Normal Light	0.581	0.602	0.591	0.593	0.340
RegionCLIP	Moderate Occ.—Normal Light	0.060	0.068	0.064	0.063	0.014
MM-Grounding-DINO	Moderate Occ.—Normal Light	0.337	0.790	0.473	0.335	0.276
YOLOv5	Moderate Occ.—Normal Light	0.590	0.613	0.601	0.627	0.495
YOLOv8	Moderate Occ.—Normal Light	0.615	0.628	0.621	0.628	0.487
YOLO11	Moderate Occ.—Normal Light	0.632	0.617	0.624	0.626	0.485
E-CLIP (ours)	Moderate Occ.—Normal Light	0.634	0.665	0.649	0.651	0.581
GLIP	Moderate Occ.—Backlight	0.394	0.403	0.398	0.384	0.165
ViTDet	Moderate Occ.—Backlight	0.433	0.447	0.440	0.442	0.295
RTDETR	Moderate Occ.—Backlight	0.541	0.562	0.551	0.553	0.313
RegionCLIP	Moderate Occ.—Backlight	0.055	0.062	0.058	0.057	0.012
MM-Grounding-DINO	Moderate Occ.—Backlight	0.586	0.554	0.569	0.582	0.209
YOLOv5	Moderate Occ.—Backlight	0.550	0.573	0.561	0.585	0.462
YOLOv8	Moderate Occ.—Backlight	0.575	0.588	0.581	0.586	0.455
YOLO11	Moderate Occ.—Backlight	0.591	0.577	0.584	0.583	0.453
E-CLIP (ours)	Moderate Occ.—Backlight	0.581	0.643	0.610	0.619	0.523
GLIP	Severe Occ.—Normal Light	0.352	0.362	0.357	0.345	0.142
ViTDet	Severe Occ.—Normal Light	0.397	0.410	0.403	0.403	0.275
RTDETR	Severe Occ.—Normal Light	0.502	0.523	0.512	0.514	0.294
RegionCLIP	Severe Occ.—Normal Light	0.048	0.055	0.051	0.050	0.012
MM-Grounding-DINO	Severe Occ.—Normal Light	0.426	0.759	0.553	0.426	0.324
YOLOv5	Severe Occ.—Normal Light	0.512	0.535	0.523	0.551	0.435
YOLOv8	Severe Occ.—Normal Light	0.535	0.548	0.541	0.548	0.425
YOLO11	Severe Occ.—Normal Light	0.552	0.539	0.545	0.545	0.423
E-CLIP (ours)	Severe Occ.—Normal Light	0.552	0.601	0.575	0.574	0.483
GLIP	Severe Occ.—Backlight	0.316	0.325	0.320	0.308	0.121
ViTDet	Severe Occ.—Backlight	0.358	0.371	0.364	0.363	0.251
RTDETR	Severe Occ.—Backlight	0.462	0.483	0.472	0.474	0.261
RegionCLIP	Severe Occ.—Backlight	0.032	0.039	0.035	0.034	0.001
MM-Grounding-DINO	Severe Occ.—Backlight	0.332	0.595	0.428	0.328	0.208
YOLOv5	Severe Occ.—Backlight	0.472	0.495	0.483	0.508	0.402
YOLOv8	Severe Occ.—Backlight	0.495	0.508	0.501	0.505	0.395
YOLO11	Severe Occ.—Backlight	0.512	0.499	0.505	0.503	0.393
E-CLIP (ours)	Severe Occ.—Backlight	0.513	0.576	0.543	0.551	0.462

Table 7. Generalization performance of the E-CLIP.

	Precison	Recall	F1 Score	mAP@0.5	mAP@[0.5:0.95]
1-shot	0.431	0.125	0.194	0.059	0.049
4-shot	0.420	0.438	0.429	0.387	0.292
8-shot	0.596	0.421	0.493	0.492	0.340
16-shot	0.606	0.618	0.612	0.646	0.523

Table 8. Performance metrics for new categories.

	Precision	Recall	F1 Score	mAP@0.5	mAP@[0.5:0.95]
Orange	0.505	0.632	0.561	0.649	0.442
Watermelon	0.664	0.610	0.636	0.619	0.248
Cantaloupe	0.944	0.615	0.745	0.819	0.492
Cherry	0.356	0.679	0.467	0.415	0.129
All	0.617	0.634	0.625	0.626	0.328

Table 9. The effect of the image–image module and text–image module. ↓ indicates performance drop compared to full model; √ means the module is enabled, and × means the module is disabled.

Image–Image Module	Text–Image Module	F1 Score	mAP@0.5	mAP@[0.5:0.95]
√	√	0.752	0.791	0.652
√	×	0.695 (↓0.057)	0.683 (↓0.108)	0.521 (↓0.131)
×	√	0.686 (↓0.066)	0.667 (↓0.124)	0.513 (↓0.139)

Table 10. The impact of the loss function’s weight parameter

α

.

Table 10. The impact of the loss function’s weight parameter

α

.

$α$	Precision	Recall	F1 Score	mAP@0.5	mAP@[0.5:0.95]
0.0	0.318	0.638	0.419	0.396	0.281
0.1	0.357	0.622	0.453	0.413	0.306
0.2	0.414	0.641	0.511	0.523	0.389
0.3	0.445	0.644	0.526	0.529	0.390
0.4	0.502	0.553	0.527	0.614	0.479
0.5	0.638	0.692	0.663	0.661	0.512
0.6	0.778	0.728	0.752	0.791	0.652
0.7	0.445	0.644	0.526	0.529	0.39
0.8	0.434	0.656	0.528	0.532	0.399
0.9	0.337	0.558	0.421	0.385	0.281
1.0	0.305	0.707	0.419	0.387	0.272

Table 11. Comparison of model inference speed and computing efficiency.

	GFLOPs ( $10^{9}$ )	Parameters ( $10^{6}$ )	FPS
VitDet	321.94	563.20	26.45
RTDERT	231.72	409.23	28.79
GLIP	415.17	954.19	18.76
RegionCLIP	518.52	923.01	14.79
MM-Grounding-DINO	39.61	201.26	4.76
YOLOv5	24.1	9.13	135.14
YOLOv8	28.7	11.15	272.27
YOLO11	6.50	2.60	83.33
E-CLIP (ours)	98.61	86.42	54.82

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, Y.; Shao, Y.; Tang, C.; Liu, Z.; Li, Z.; Zhai, R.; Peng, H.; Song, P. E-CLIP: An Enhanced CLIP-Based Visual Language Model for Fruit Detection and Recognition. Agriculture 2025, 15, 1173. https://doi.org/10.3390/agriculture15111173

AMA Style

Zhang Y, Shao Y, Tang C, Liu Z, Li Z, Zhai R, Peng H, Song P. E-CLIP: An Enhanced CLIP-Based Visual Language Model for Fruit Detection and Recognition. Agriculture. 2025; 15(11):1173. https://doi.org/10.3390/agriculture15111173

Chicago/Turabian Style

Zhang, Yi, Yang Shao, Chen Tang, Zhenqing Liu, Zhengda Li, Ruifang Zhai, Hui Peng, and Peng Song. 2025. "E-CLIP: An Enhanced CLIP-Based Visual Language Model for Fruit Detection and Recognition" Agriculture 15, no. 11: 1173. https://doi.org/10.3390/agriculture15111173

APA Style

Zhang, Y., Shao, Y., Tang, C., Liu, Z., Li, Z., Zhai, R., Peng, H., & Song, P. (2025). E-CLIP: An Enhanced CLIP-Based Visual Language Model for Fruit Detection and Recognition. Agriculture, 15(11), 1173. https://doi.org/10.3390/agriculture15111173

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

E-CLIP: An Enhanced CLIP-Based Visual Language Model for Fruit Detection and Recognition

Abstract

1. Introduction

2. Materials and Methods

2.1. Multimodal Dataset

2.1.1. Dataset Construction

2.1.2. Dataset Preprocessing

2.2. The Overall Framework of Fruit-Picking System

2.2.1. Instruction Processing Module

2.2.2. Visual Language Processing Module

2.3. Experimental Setup

2.3.1. Implementation Details

2.3.2. Hyperparameter Settings

2.3.3. Evaluation Metrics

3. Results and Discussion

3.1. Language Model Comparison

3.2. Visual Language Model Evaluation

3.2.1. Model Performance Comparision

3.2.2. Robustness Assessment

3.2.3. Generalization Assessment

3.2.4. Ablation Studies

3.2.5. Computational Efficiency

3.3. Limitation and Future Work

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A. The JSON Output Format

Appendix B. Language Model JSON Specification

Appendix B.1. Start Harvesting Task

Appendix B.2. End Harvesting Task

Appendix B.3. Pause Work

Appendix B.4. Resume Work

Appendix B.5. Robot Voice-Controlled Movement

Appendix B.6. Speed Adjustment Control

Appendix B.7. Reset Arm Position

Appendix B.8. Self-Charging

Appendix C. Command Extraction and Entity Recognition Results

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI