Next Article in Journal
Life Cycle Assessment of CO2, Rumen, and Biological Biomass Pretreatment Methods for Biomethane Production
Previous Article in Journal
Hybrid Deep Learning Approaches for Improved Genomic Prediction in Crop Breeding
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

E-CLIP: An Enhanced CLIP-Based Visual Language Model for Fruit Detection and Recognition

1
College of Informatics, Huazhong Agricultural University, No. 1 Shizi Mountain Street, Hongshan District, Wuhan 430070, China
2
College of Plant Science and Technology, Huazhong Agricultural University, No. 1 Shizi Mountain Street, Hongshan District, Wuhan 430070, China
3
Wuhan X-Agriculture Intelligent Technology Co., Ltd., Wuhan 430070, China
4
Hubei Hongshan Laboratory, Wuhan 430070, China
5
National Key Laboratory of Crop Genetic Improvement, Huazhong Agricultural University, Wuhan 430070, China
*
Authors to whom correspondence should be addressed.
Agriculture 2025, 15(11), 1173; https://doi.org/10.3390/agriculture15111173
Submission received: 20 April 2025 / Revised: 21 May 2025 / Accepted: 26 May 2025 / Published: 29 May 2025
(This article belongs to the Section Digital Agriculture)

Abstract

With the progress of agricultural modernization, intelligent fruit harvesting is gaining importance. While fruit detection and recognition are essential for robotic harvesting, existing methods suffer from limited generalizability, including adapting to complex environments and handling new fruit varieties. This problem stems from their reliance on unimodal visual data, which creates a semantic gap between image features and contextual understanding. To solve these issues, this study proposes a multi-modal fruit detection and recognition framework based on visual language models (VLMs). By integrating multi-modal information, the proposed model enhances robustness and generalization across diverse environmental conditions and fruit types. The framework accepts natural language instructions as input, facilitating effective human–machine interaction. Through its core module, Enhanced Contrastive Language–Image Pre-Training (E-CLIP), which employs image–image and image–text contrastive learning mechanisms, the framework achieves robust recognition of various fruit types and their maturity levels. Experimental results demonstrate the excellent performance of the model, achieving an F1 score of 0.752, and an mAP@0.5 of 0.791. The model also exhibits robustness under occlusion and varying illumination conditions, attaining a zero-shot mAP@0.5 of 0.626 for unseen fruits. In addition, the system operates at an inference speed of 54.82 FPS, effectively balancing speed and accuracy, and shows practical potential for smart agriculture. This research provides new insights and methods for the practical application of smart agriculture.

1. Introduction

With the continuous growth of the global population and the increase in food demand, the agricultural sector faces significant challenges in improving production efficiency and sustainability [1]. In this context, the automation of agricultural harvesting has become one of the research and development hotspots. By introducing robotic technology to perform fruit-picking tasks, it is not only possible to greatly reduce labor intensity but also effectively enhance operational efficiency and quality. Presently, intelligent picking robots primarily consist of a recognition module, control module, and motion module [2]. As the first module for acquiring and processing external information, the recognition module plays a crucial role in improving the work efficiency of intelligent picking robots. Many studies have made efforts to enhance various aspects of agricultural robot recognition modules, such as speed, accuracy, and generalization.
Driven by the rapid advancement of artificial intelligence technologies, deep learning-based methods have achieved remarkable progress in fruit detection and recognition. YOLO is popular in agricultural recognition and edge device deployment due to its high speed and precision. Sapkota et al. adopted YOLOv8 for instance segmentation in complex apple orchard environments, constructing datasets for dormant apple trees and early growing season images containing green leaves and unripe apples. For dataset 1, YOLOv8 achieved a precision of 0.90 and a recall of 0.95 for all classes. Under dataset 2, YOLOv8 had a precision of 0.93 and recall of 0.97 [3]. Jiang et al. used YOLOv8 for watermelon counting in the field, achieving a detection accuracy of 99.20% [4]. Zhou et al. employed YOLOv7 for pitaya detection and classification, achieving precision, recall, and mean average precision of 0.844, 0.924, and 0.932, respectively [5]. However, when multiple objects overlap within a grid, YOLO may fail to accurately detect all objects due to predicting only a certain number of bounding boxes per grid, making it less suitable for handling complex farm environments and fruit occlusions. The SSD algorithm combines the benefits of Faster R-CNN and YOLO, adopting a one-pass prediction method that reduces redundant calculations, significantly increasing detection speed compared to Faster R-CNN while maintaining comparable accuracy [6]. Liang et al. proposed an SSD-based mango detection method, achieving an excellent F1 score of 0.911 at 35 FPS [7]. This method realized a video counting accuracy of 90% on Hass avocado and lemon datasets from Chile and apple datasets from California, USA [8]. However, due to the smaller receptive fields of shallow feature maps, SSD may perform poorly on small targets, and the same fruit might be detected redundantly by different-sized bounding boxes. The R-CNN series (including R-CNN, Fast R-CNN, and Faster R-CNN) and Mask R-CNN are network architectures specifically designed for solving object detection problems. Wang et al. [9] proposed an improved Faster R-CNN with an attention mechanism for background color similarity for cherry tomato detection and identification, addressing low recognition accuracy issues caused by varying light conditions and leaf occlusions. Xu et al. [10] proposed an improved Mask R-CNN network model considering prior neighborhood constraints among peduncles for cherry tomato identification.
While deep learning algorithms improve recognition accuracy, they face a fundamental limitation in that they are narrowly specialized rather than general-purpose. The key issues include the following: (1) limited adaptability to complex environments—many existing models do not perform well under varying conditions, making them less effective in real-world scenarios; (2) narrow specialization—most models are trained for specific fruit types and cannot generalize to unseen varieties without retraining; (3) dependence on large labeled datasets—supervised methods require extensive annotations for each new fruit category, which is costly for rare or regional crops. These problems stem from the closed-set paradigm and reliance on unimodal visual data in traditional deep learning, where models only recognize predefined classes and lack semantic understanding of fruit attributes, making them unable to adapt to dynamic agricultural needs. Therefore, developing a general-purpose fruit recognition system with enhanced semantic understanding, robust environmental adaptation, and efficient few-shot and zero-shot learning capability is of significant research importance for advancing intelligent agricultural applications.
The emergence of visual language models (VLMs) has brought new methods and ideas for addressing the challenges in automated fruit harvesting. VLMs are a type of multimodal AI system that integrates vision and natural language processing capabilities [11]. Compared with traditional deep learning models, VLMs demonstrate significant advantages in multiple aspects. First, VLMs can effectively process multimodal data by integrating visual and textual information into a unified representational space, achieving semantic alignment between visual content and linguistic descriptions. This ability enables VLMs to perform more accurate reasoning in complex visual scenes by leveraging linguistic information, thereby maintaining robust performance under varying conditions such as different lighting or occlusion. Second, VLMs exhibit strong generalization capabilities. By capturing the semantic attributes of objects, VLMs can recognize and classify unseen varieties based on their visual and linguistic properties. This ability makes VLMs adaptable to diverse tasks and scenarios, significantly reducing the need for task-specific retraining and the dependence on large labeled data, enhancing their practicality in real-world applications. Furthermore, VLMs exhibit great flexibility in input and output, making them adaptable to a variety of visual tasks.
Among the current VLMs, Contrastive Language–Image Pre-Training (CLIP) is the most widely used. CLIP [12] adopts a dual-tower structure and has made significant progress in the field of text–image fusion. Its core concept is to align image and text representations through contrastive learning using massive weakly supervised text pairs. BLIP [13] builds on CLIP and removes noisy data by optimizing the module structure to generate higher-quality text descriptions. BLIP-2 [14] freezes the pre-trained image and language models and adds a lightweight query transformer to bridge the modality gap. The LLaVa family [15] directly projects CLIP embeddings as soft prompts of LLM. The Qwen-VL family [16] connects ViT and Qwen (7B) through an adapter to convert image feature sequences into sequences that match the length of Qwen (7B) sequences for fusion. VLMs are also emerging in agriculture, with several notable applications already being explored. Cao et al. [17] proposed ITLMLP, integrating CLIP and SimCLR structures for cucumber disease recognition. Zhou et al. [18] utilized pre-trained VLMs for crop disease classification, generating descriptive texts with Qwen-VL and enhancing key text features through cross-attention and SE attention. Tan et al. [19] experimented with GPT-4 for crop recognition, nutrient deficiency, pest and disease identification, and phenotyping. Qing et al. [20] provided reliable diagnoses of plant diseases based on GPT models. However, the current applications of VLMs are predominantly focused on general object recognition and classification tasks, such as identifying common objects like vehicles, animals, and everyday items. In contrast, there is limited research specifically targeting fruit detection and recognition. This gap is notable because fruit detection and recognition present unique challenges, such as distinguishing between different varieties, identifying ripeness levels, and detecting fruits in complex natural environments. These specialized requirements have not yet been fully addressed in the existing literature and applications.
Inspired by the current work on VLMs, this study introduces and optimizes the classic CLIP visual language model to build a robust framework for fruit detection and output the pixel coordinates of detected fruit bounding boxes, providing a basis for subsequent robotic grasping actions. The optimized CLIP model leverages the strengths of the original CLIP architecture while further enhancing its zero-shot and few-shot learning capabilities.
The main contributions of this research can be summarized as follows:
(1)
Multimodal Fruit Dataset: A multimodal dataset was constructed specifically designed for fruit recognition in agricultural robotics. The dataset comprises 6770 real-world orchard images spanning 12 fruit categories, with 7 categories annotated at three maturity levels (unripe, semi-ripe, and ripe). Additionally, we generated 100 natural language queries using Qwen-7B to establish semantic alignments between visual features and textual descriptions. This dataset facilitates both fine-grained maturity detection and open-set recognition, enabling systems to adapt to dynamic picking instructions.
(2)
Natural Language Instruction Module: An input module for natural language instructions has been developed, integrating a language model parser, enabling robots to perform complex tasks based on natural language commands (through text or voice inputs). This enhances operational flexibility and user interaction. Furthermore, this language input will be combined with image data to improve the accuracy of subsequent fruit detection and recognition. The method significantly lowers the technical threshold for agricultural robots, aiding in their wider adoption and utilization in fruit picking.
(3)
Enhanced CLIP Model Architecture: An enhanced CLIP model architecture has been proposed, modifying the original CLIP framework by incorporating three key components: a YOLO detection head, an image–image contrastive learning branch, and an image–text contrastive learning branch. The YOLO detection head is used to detect fruit regions within images; the image–image contrastive learning branch focuses on identifying similarities between different fruit images, enhancing the model’s recognition capability in scenarios with scarce data and complex environmental conditions. The image-to-text contrastive learning branch contrasts structured natural language instructions with images, learning the relationship between instructions and fruit features. This not only enhances the model’s understanding of task requirements but also leverages the mapping between text and images in zero-shot scenarios, thus improving the model’s generalization ability.

2. Materials and Methods

2.1. Multimodal Dataset

2.1.1. Dataset Construction

(1) Image Dataset Construction
In this study, we constructed a small image dataset containing a variety of ripe fruits. Specifically, as shown in Table 1, the dataset includes 7 fruit categories with ripe labels (apple, banana, grape, strawberry, persimmon, peach, and passion fruit) and 5 fruit categories without ripe labels (lychee, lemon, pear, tomato, and mango). The fruits were selected based on their representativeness in an orchard scene. Data collection was mainly conducted through two methods:
  • Field capture: Use a high-resolution camera to capture clear images under various occlusion conditions (such as slight occlusion/moderate occlusion/heavy occlusion).
  • Open data sources: Additional images of mangoes and tomatoes were sourced from open-access repositories to enhance dataset diversity.
(2) Text Dataset Construction
To train the multimodal visual language model (VLM), we generated 100 text sentences using the Qwen-7B language model. These sentences simulate the natural language commands that humans may use to direct the work of a picking robot, covering a variety of contexts, including indicators of fruit ripeness (e.g., changes in color and maturity), etc. Each prompt was carefully designed to ensure semantic alignment with the image labels.
In addition, text data are also used to test the speed of different language models and the accuracy of conversion to JSON, guiding the selection of language models.

2.1.2. Dataset Preprocessing

In order to solve the imbalance problem, some of the lesser data were enhanced, mainly including flipping, rotation, cropping, and scaling. The dataset was divided into training set, validation set, and test set in a ratio of 8:1:1 for regular training and evaluation.
In addition, to evaluate the model’s ability to learn from a small number of samples, a rigorous stratified sampling strategy was employed: for each fruit category, K samples are randomly selected as the training set, where K ∈ {1, 4, 8, 16}. The remaining samples were then randomly assigned to the test set to evaluate the model’s generalization ability.
To ensure a comprehensive evaluation, the test set is as follows:
(1) Basic Test Set: 50 images are randomly selected from each category and used for regular model performance evaluation.
(2) Extreme Condition Test Set: 50 images in the remaining samples are randomly selected to simulate challenging conditions. Specifically, images are occluded by leaves or other fruits to simulate occlusion scenarios with three levels of severity: light (10–30%), medium (30–60%), and heavy (60–90%). For each set of occlusion conditions, we then employed OpenCV to adjust the brightness and contrast of the images to simulate varying lighting conditions, specifically using parameters α = 0.7 and β = −30 to reduce the contrast and brightness of the image.
(3) Zero-Shot Test Set: To further evaluate the model’s generalization ability, 20 images from four fruit categories (orange, watermelon, cantaloupe, and cherry) that were not included in the training phase were selected, forming another test subset.

2.2. The Overall Framework of Fruit-Picking System

Figure 1 presents the overall workflow of our fruit-picking system based on visual language models (VLMs), which comprises three core components:
(1)
Instruction Processing Module: Processes natural language commands describing picking requirements.
(2)
Visual Language Processing Module: Analyzes multimodal information including vision and text data, through the VLM for context understanding and object recognition, then outputs the 2D bounding box coordinates ( x min , y min , x max , y max ) of detected targets.
(3)
Motion Control Module: Generates actionable motor commands by fusing prior perception results.
This study specifically concentrates on the first two modules: instruction parsing and visual perception.

2.2.1. Instruction Processing Module

Figure 2 illustrates the instruction processing workflow, starting with user instruction reception. The system accepts either text or voice commands: text inputs are directly fed into the language model, while voice inputs undergo speech-to-text conversion before entering the model. The language model then converts these instructions into a standard JSON format under the designed prompt (Appendix B) and determines whether they contain a picking task through the “action” field of the JSON. If a picking task is identified, the system activates the camera to capture scene videos. Both the textual and visual data are then transmitted to the VLM for object detection and spatial localization. If no picking task is detected, the command is routed to the motion control module to execute robotic movements.

2.2.2. Visual Language Processing Module

In this module, the Enhanced CLIP model processes text and visual information for object detection and localization. While the original CLIP model performs well in image–text contrastive learning, it lacks object region detection due to its generic design and reliance on text cues, and has difficulty in accurate classification in complex fruit images. To address these limitations, the Enhanced CLIP model retains the image–text contrastive learning branch in the original CLIP model, while adding a YOLOv8 detection head to provide accurate fruit region localization, and an image–image contrastive learning branch to enhance the model’s robustness and generalization capabilities.
(1)
YOLOv8 Detection Head
YOLOv8 [21] is specifically employed for fruit region detection, as shown in Figure 3, which provides precise bounding boxes for subsequent classification tasks carried out by the CLIP model. The fundamental structure of YOLOv8 is retained, which includes the Backbone for feature extraction, the Neck for feature fusion, and the Head for prediction output. However, to tailor it for our specific needs, modifications have been made in the Head. The classification component in the Head has been entirely removed. The model now solely focuses on bounding box regression, outputting the coordinates of detected fruit regions without performing any class identification. Additionally, the loss function has been redefined to exclude the classification loss, thereby emphasizing localization accuracy during training. Once pre-trained, YOLOv8 can be directly integrated into the framework for fruit region detection without the need for further training.
(2)
Image Encoder
This study proposes a unified architecture based on Vision Transformer (ViT) [22], which serves image–image contrastive learning and image–text contrastive learning tasks, respectively. As shown in Figure 4, ViT meets the differentiated requirements of different tasks through flexible feature extraction strategies:
For image–image contrastive learning tasks, ViT fully utilizes its global modeling advantages to capture the overall semantic association and contextual dependency of the image through the self-attention mechanism. Specifically, Image encoding process first utilizes the YOLO object detection head to perform object detection on the input image. The YOLO model outputs bounding boxes R 1 , R 2 , , R M for different target areas in the image and crops out these target regions from the original image to obtain each target region R j (where j = 1 , 2 , , M ). These target regions are subsequently fed into the ViT model for feature extraction. For each target region R j , it is divided into fixed-size patches, and each patch is linearly mapped into a D-dimensional vector representation x j , i R D :
x j , i = Proj ( Flatten ( p j i ) ) R D , i = 1 , 2 , , N j
where N j represents the number of patches in the target region R j , p j i denotes the i t h patch within the target region, Proj ( · ) is the linear projection operation, and Flatten ( · ) indicates flattening each patch into a one-dimensional vector. Then, after adding learnable positional encodings E pos to all patch features, the input sequence is formed as follows:
X j ( 0 ) = [ x j , 1 + e 1 , x j , 2 + e 2 , , x j , N j + e N j ] R N j × D
where X j ( 0 ) is the initial input sequence for the j-th target region R j , consisting of the D-dimensional feature vectors x j , i of each image patch after applying flatten and projection operations, plus learnable positional encodings e i to preserve spatial information; x j , i represents the feature vector of the i-th image patch in R j , with i = 1 , 2 , , N j , where N j denotes the total number of patches in R j and D is the dimension of the projected feature space; R N j × D indicates that X j ( 0 ) is a real-valued matrix of size N j × D , where each row corresponds to an image patch’s feature vector with its positional encoding added. This input sequence is fed into the multi-layer Transformer encoder of ViT. Each layer of the Transformer encoder includes Multi-Head Self-Attention (MHSA) and Feed-Forward Network (FFN) modules:
Z j ( l ) = MHSA ( LN ( X j ( l 1 ) ) ) + X j ( l 1 )
X j ( l ) = FFN ( LN ( Z j ( l ) ) ) + Z j ( l ) , l = 1 , 2 , , L
where MHSA stands for Multi-Head Self-Attention mechanism, LN is Layer Normalization, and FFN is the Feed-Forward Network. Each layer of the Transformer processes the input patch features to capture global information and contextual relationships. This architectural feature effectively supports the accurate evaluation of the overall semantic similarity between images.
In the image–text contrastive learning task, ViT achieves local detail preservation by adjusting the feature extraction strategy. In addition to the conventional global features, the model also outputs fine-grained feature maps. To obtain local features at various scales, we set up windows of different sizes to perform localized computations of the self-attention mechanism. These computations are followed by multi-scale feature fusion. Incorporating Equations (2) and (3), we obtain the localized attention values Z j , scale i ( l ) for windows of different sizes. Then, we fuse these different Z j , scale i ( l ) into a unified Z j ( l ) . This fused representation is subsequently fed into a feed-forward neural network. This feature extraction mechanism with different windows enables the visual representation to maintain fine-grained alignment with the text description.
(3)
Text Encoder
This study adopts the standard Transformer architecture [23] as a text encoder, whose core function is to natural language instructions (e.g., “a round red apple”) into 512-dimensional semantic feature vectors f text ( T ) to provide a fine-grained semantic representation. The text encoder consists of the following key modules:
Positional Encoding
Since Transformer itself does not have sequence order information, we inject absolute position information through positional encoding. The encoding formula is:
E ( x i ) = x i + W p · PE ( i ) ,
where x i is the ith word embedding vector in the input sequence; W p is the learnable position encoding matrix; PE ( i ) uses a combination of sine and cosine functions [23] to ensure that the model captures the relative positional relationships of the sequence.
Multi-Head Self-Attention (MHSA)
After incorporating positional encoding, the model processes the input sequence using Multi-Head Self-Attention. This mechanism allows the model to focus on different parts of the input sequence in parallel, capturing long-range dependencies.
MultiHead ( Q , K , V ) = Concat ( head 1 , , head h ) W O
where
head i = softmax Q i K i T d k V i
where Q , K , V denote the Query, Key, and Value matrices, respectively, which are generated from the input sequence by a linear transformation; d k is the dimension of the key vector (usually set to d model / h , d model = 512 ); h is the number of attention heads.
Feed-Forward Network (FFN)
The output of the Multi-Head Self-Attention mechanism is then passed through a Feed-Forward Network (FFN) to further refine the semantic representation of the text. This module applies a nonlinear transformation to the input features:
FFN ( x ) = max ( 0 , W 1 x + b 1 ) W 2 + b 2
where W 1 R d ff × d model , W 2 R d model × d ff is the learnable weight matrix; d ff = 2048 is the hidden layer dimension; max ( 0 , · ) denotes the ReLU activation function. These components work together to generate a rich semantic representation of the input text, which is crucial for aligning text with image features.
(4)
Loss Function
The model is optimized using a combined loss function that integrates both image–image and image–text contrastive learning objectives. Specifically, for the image–image contrastive learning task, the loss function is defined as:
L img - img = log exp f img ( I q ) · f img ( I s ) τ j = 1 M exp f img ( I q ) · f img ( I j ) τ
where f img ( I q ) and f img ( I s ) are the global feature vectors extracted by ViT for the query and support images, respectively. τ is the temperature parameter, and M is the number of negative samples. This loss function ensures that the feature vectors of similar images have a higher cosine similarity, while those of dissimilar images have a lower similarity. For the image–text contrastive learning task, the loss function is defined as:
L img - text = log exp f img ( I ) · f text ( T ) τ j = 1 M exp f img ( I ) · f text ( T j ) τ
where f img ( I ) is the local feature vector extracted by ViT for the image, and f text ( T ) is the semantic feature vector extracted by the text encoder. τ is the temperature parameter, and M is the number of negative samples. This loss function ensures that the image features and the correct text description have a higher cosine similarity, while those with incorrect descriptions have a lower similarity. The total loss function is a weighted sum of the two contrastive losses:
L total = α L img - img + ( 1 α ) L img - text
where α is a hyperparameter that balances the contributions of the two losses.
This combined loss function allows the model to be optimized for both global semantic understanding and fine-grained text–image alignment, leveraging the strengths of both images and text in their respective tasks.
(5)
Fruit Detection Output
The output of the visual language processing module is structured as JSON (Appendix A) data, providing detailed information for downstream robotic execution tasks. The output includes class level, confidence score, bounding box, and centroid position.

2.3. Experimental Setup

2.3.1. Implementation Details

Table 2 and  Table 3 describe the hardware and software environment of the experiment.

2.3.2. Hyperparameter Settings

Integrate a YOLOv8 detection head with CLIP and ViT for joint detection and classification. YOLOv8 is trained for 50 epochs with a batch size of 32 and a learning rate of 0.01 using SGD and 8 workers. CLIP and ViT are both trained for 300 epochs using the Adam optimizer, with learning rates of 1 × 10 5 and 1 × 10 4 , and batch sizes of 16 and 8, respectively. Both use 4 workers. The contrastive learning weight is 0.6.

2.3.3. Evaluation Metrics

The performance of the vision language model was rigorously evaluated using a comprehensive suite of standard metrics, encompassing accuracy, precision, recall, F1 score, average precision (AP), mean average precision (mAP), and intersection over union (IoU). Additionally, computational efficiency was assessed through GFLOPs, parameters, and frames per second (FPS) to measure computational complexity, parameter count, and processing speed, respectively.
(1)
Precision (P), reflecting the proportion of true positive samples among all detected positives, is defined as:
P = TP TP + FP
where TP denotes correctly identified positive samples, and FP represents false positives.
(2)
Recall (R), which quantifies the fraction of actual positives accurately predicted by the model, is given by:
R = TP TP + FN
where FN indicates negatives incorrectly classified as such.
(3)
The F1 score, calculated as the harmonic mean of precision and recall, provides a balanced assessment:
F 1 = 2 × P × R P + R
(4)
Average precision (AP) measures localization accuracy by determining the area under the precision–recall curve. Mean average precision (mAP), an aggregate measure across all classes, reflects overall detection performance:
mAP = 1 N i = 1 N AP i
where N is the total number of classes.
(5)
Intersection over union (IoU), indicating spatial overlap between predicted and ground-truth bounding boxes, is expressed as:
IoU = A B A B
where A and B represent the areas of the predicted and ground-truth bounding boxes, respectively.
(6)
GFLOPs, measuring computational complexity, are calculated as follows:
GFLOPs = 1 10 9 K × K × C in × C out × H × W
where H and W are the height and width of the output feature map, respectively.
(7)
Parameters quantify the total trainable parameters within the model, particularly convolutional layers, via the following:
Parameters = K × K × C in × C out
where K is the convolutional kernel size, C in is the number of input channels, and C out is the number of output channels.
(8)
Finally, FPS, representing inference speed, is given by the following:
FPS = N T total
where N is the total number of processed frames, and T total is the total inference time in seconds.
Detection performance was evaluated with AP metrics at varying IoU thresholds, providing insights into the model’s detection capabilities. Computational efficiency metrics ensured a holistic evaluation of the model’s effectiveness and efficiency.

3. Results and Discussion

3.1. Language Model Comparison

To select the most appropriate language model, the speed and accuracy of five models under various quantization methods (e.g., BF16, GPTQ-Int4, AWQ, and FP8) were evaluated: Phi-2, Qwen2-7B, TinyLlava, GPT-4, and Deepseek. We recorded the average time to convert 100 commands (Appendix C) to JSON and manually evaluated the parsing command accuracy of the parsed commands. The results are shown in Table 4. As can be seen, Phi-2 is slightly less accurate than the other models, but has a speed advantage, making it the preferred choice for real-time command parsing and execution in the fruit-picking task.

3.2. Visual Language Model Evaluation

3.2.1. Model Performance Comparision

The performance of the visual language model E-CLIP is compared with several current classic models, including single-stage object detection models (such as the YOLO series and RT-DETR), multimodal large models (such as GLIP, RegionCLIP, and MM-Grounding-DINO), and Transformer-based visual detection models (such as ViTDet). The results are listed in Table 5. The model shows very competitive performance with a precision of 0.778, a recall of 0.728, and a mAP@0.5 of 0.791. This performance is attributed to the joint optimization of image–image and image–text contrastive learning, which improves the precision of detection and classification. Figure 5 shows the AP performance of the model for different types of fruits. Many fruits have an AP value above 0.85. Affected by the amount of data, the AP value of green apples is lower. Figure 6 shows the detection results of some fruits.
Compared with the single-stage detection model, E-CLIP introduces a joint image–text comparison learning mechanism to make up for the lack of semantic understanding of the traditional detection structure, thereby improving the ability to distinguish complex targets and backgrounds; compared with the multimodal large model, E-CLIP’s image similarity alignment effectively avoids the ambiguity of the whole image information and the regional information during alignment, and compared with the Transformer-based visual large model, E-CLIP introduces an image–text enhancement module in the structure, so that it can better capture visual semantic association information while retaining the global modeling ability of the Transformer, so that it has achieved significant improvements in indicators such as mAP@0.5 and mAP@[0.5:0.95], showing a strong learning ability in complex orchard scenes.
RegionCLIP’s mAP@0.5 is only 0.084, which is extremely poor. The main reason is that its RPN generates a large number of invalid frames in the candidate frame generation stage, which makes it difficult to distinguish between the foreground and the background, and then leads to inaccurate alignment between the region and the text, affecting the overall detection performance.
Of course, E-CLIP has its own limitations, mainly because its model performance depends on the detection accuracy of the detection head. The detection head we currently use is YOLOv8. If YOLOv8 misses the target, it will affect the final performance of the model.

3.2.2. Robustness Assessment

As shown in Table 6, the proposed model E-CLIP shows comprehensive advantages in six sets of occlusion–light coupling scenarios. Specifically, compared with the traditional YOLO series, under normal lighting and slight occlusion conditions, the mAP@0.5 of our model is 0.106 higher than that of the independent YOLOv8 (0.832 vs. 0.726), better than 0.026 (0.695 vs. 0.669) under backlight conditions, and better than 0.046 (0.551 vs. 0.505) under severe backlight occlusion conditions. At the same time, mAP@[0.5:0.95] is improved by 0.067 (0.462 vs. 0.395), and the F1 score is improved by 0.042 (0.543 vs. 0.501). These data show that the robust performance of E-CLIP is better than that of pure vision models. This is because E-CLIP adds text information, which can provide advanced semantic guidance and guide the model to focus on key semantic attributes rather than local pixels. Therefore, it makes up for the limitations of pure vision models and has better performance than pure vision models in complex scenes. Figure 7 shows the mAP@0.5 values of each model under different occlusion and lighting conditions.
Compared with the multimodal model, in the slight occlusion-normal illumination scenario, the mAP@0.5 index is 0.304 higher than GLIP (0.832 vs. 0.528), 0.741 higher than RegionCLIP (0.832 vs. 0.091) and 0.364 higher than MM-Grounding-DINO (0.832 vs. 0.468); in the severe occlusion-backlight scenario, the F1 score is 0.223 higher than GLIP (0.543 vs. 0.320) and 0.115 higher than MM-Grounding-DINO (0.543 vs. 0.428). These data indicate that the proposed multimodal model outperforms those CLIP-based models, which only use image–text contrastive learning. This is because E-CLIP adds an image–image contrast learning branch, which further enhances the model’s ability to learn invariance to illumination and occlusion, allowing the model to ignore irrelevant variables and focus on more essential features of the image (such as texture, shape, etc.), thereby reducing the impact of external interference factors (such as illumination, occlusion, etc.).
In summary, the added image–image contrast learning and the retained image–text contrast learning in the model enable the model to learn more discriminative and robust feature representations. Under different occlusion and illumination conditions, these contrast learnings can help the model grasp the key features of the fruit. In addition, the image encoders in both branches use ViT, which can effectively capture the global information of the image through the self-attention mechanism, rather than just local features. Even if the target is occluded or affected by light, it can accurately identify the target through contextual information. Therefore, ViT also contributes to the enhancement of the model’s feature extraction ability and invariance learning ability.

3.2.3. Generalization Assessment

In the few-shot learning scenario, the model is trained by using a limited number of samples from the fruit category; then, we evaluate the model performance on a separate test set containing these categories. The results (shown in Table 7) show that the model is able to achieve a mAP@0.5 value of 0.646 with 16 samples, which is only 0.145 away from the result of 0.791 on the full test set. Compared with other models, the model achieves higher F1 scores and mAP@0.5 values at different sample sizes (Figure 8). In the zero-shot learning scenario, the model is tested with new fruit categories but does not use any training samples from these categories. This setting evaluates the model’s ability to recognize new categories based only on pre-trained knowledge. Table 8 shows that the model achieves an average mAP@0.5 of 0.626 in new categories, demonstrating its ability to generalize new categories to unknown categories. Figure 9 shows the detection results for selected test samples in the zero-shot learning scenario.
The generalization ability of the model can be attributed to its multimodal learning architecture. Traditional image–text comparison pre-training methods mainly use whole images and text pairs for matching. Although they have good semantic understanding capabilities, they are also easily disturbed by irrelevant background information in the image, especially in regional recognition tasks, which easily introduce noise. This method effectively focuses on the target area itself by sending the local area features extracted by the detection head into CLIP for image–text alignment. However, since these areas are not specifically trained in the pre-training stage, relying solely on image–text comparison is prone to inaccurate matching. Therefore, image–image comparison learning is introduced. By constructing contrast constraints between enhanced image pairs, the model can capture the similarities and differences between regions more finely from a visual perspective, making up for the shortcomings of pure image–text comparison in object-level understanding. Especially in the case of scarce samples, image–image comparison can strengthen the model’s structural understanding of similar targets so that accurate classification can be achieved even when there are very few or no training samples. This mechanism not only enhances the model’s discriminative ability but also further improves its generalization ability for unseen categories.

3.2.4. Ablation Studies

To assess the contributions of each module and component in E-CLIP, a series of ablation studies was conducted and the results are shown in Table 9.
(1) Effect of image–image contrastive learning module
The removal of the image–image contrastive learning module resulted in significant performance degradation across both localization and classification tasks. Specifically, the mean average precision (mAP@[0.5:0.95]) decreased from 0.652 to 0.513, while mAP@0.5 declined from 0.791 to 0.667, respectively. This underscores the critical role of ViT in capturing global semantic contexts, particularly for small and multi-scale objects. Classification metrics also deteriorated, with the F1 score dropping from 0.752 to 0.686, indicating reduced semantic consistency in class predictions. These results highlight the module’s importance in enhancing robustness for complex detection scenarios.
(2) Effect of image–text contrastive learning module
Similarly, disabling the image–text contrastive learning module led to notable performance declines. The mAP@[0.5:0.95] decreased from 0.652 to 0.521, with mAP@0.5 falling from 0.791 to 0.683, respectively, reflecting weakened fine-grained localization capabilities. Classification performance suffered more severely: F1 score decreased to 0.695. This demonstrates the module’s essential function in aligning local visual features with textual semantics, which is pivotal for class-discriminative detection.
(3) Impact of the combined loss function’s weight parameter α
The balance between the two modules was further analyzed by tuning the weight parameter α in the combined loss function L total = α L img - img + ( 1 α ) L img - text , as shown in Table 10 and Figure 10. Optimal performance was achieved at α = 0.6 , yielding mAP@0.5 = 0.791 and F1 score = 0.752. Deviating from this balance—e.g., α = 0.4 (overemphasizing text alignment) or α = 0.8 (prioritizing image contrast)—resulted in suboptimal performance, with mAP@0.5 declining to 0.614 and 0.532, respectively. This emphasizes the necessity of harmonizing global and local feature learning to maximize cross-modal synergy.

3.2.5. Computational Efficiency

The experimental results shown in Table 11 demonstrate that our model has excellent computational efficiency compared to state-of-the-art methods. Notably, our model achieves an inference speed of 54.82 FPS, outperforming most existing models, including VitDet (26.45 FPS), RTDETR (28.79 FPS), RegionCLIP (14.79 FPS), and MM-Grounding-DINO (4.76 FPS). Although lower than the FPS of YOLOv8 (272.27), our framework significantly reduces the computational complexity while maintaining competitive real-time performance. Specifically, our model requires only 98.61 GFLOPs and 86.42M parameters, a significant drop compared to other multimodal models. This lightweight architecture outperforms computationally intensive models such as GLIP (415.17 GFLOPs) and RegionCLIP (518.52 GFLOPs). Our model balances parameter count and speed. These features enable the model to be deployed on resource-constrained edge devices while maintaining real-time responsiveness, which is critical for agricultural robotics applications.

3.3. Limitation and Future Work

Although the visual language model proposed in this study shows significant performance in fruit detection and automated picking tasks, it still has certain limitations.
(1) Imbalanced data distribution: The dataset constructed in this study contains 12 different fruit categories. However, there is a significant imbalance in the distribution of these categories. For example, the number of mango instances far exceeds that of less common fruits such as lychee and lemon. This imbalance may have a negative impact on the model’s detection accuracy for underrepresented categories.
(2) Less robust under extreme occlusion conditions: Although the model has good robustness under moderate occlusion and illumination changes, its detection performance still needs to be improved under extreme occlusion or complex mixed illumination environments. The overall framework relies on YOLOv8 as the detection head. Although it performs well in general object detection tasks, it may still miss objects in extreme occlusion scenarios, which will affect the model results.
(3) Poor generalization to non-circular objects: Although the model shows some generalization ability in zero-shot and few-shot scenarios, its performance drops significantly when detecting new categories with large morphological differences (such as carrots). This indicates that there are challenges in handling extreme variations in object morphology.
(4) The speed is still behind the traditional YOLO: Although this method is faster than traditional multimodal models, its speed still lags behind YOLO, which poses a challenge for real-time applications that require high-speed processing.
Future research can further optimize and expand this framework from multiple directions.
(1) Expand the diversity and balance of the dataset. Including more fruit varieties and adding more fine-grained maturity labels (especially labels from complex agricultural scenes) will improve the versatility and accuracy of the model.
(2) In the future, the performance of the model under extreme occlusion and complex lighting can be improved in many aspects. For example, the attention mechanism can be introduced or the loss function for occlusion can be designed to enhance the model’s perception of occluded targets; the adaptive normalization method can be used to extract more stable features and reduce the impact of lighting changes. In addition, the detection head that is more suitable for complex scenes or the candidate region completion mechanism can be introduced to improve the overall robustness and detection accuracy.
(3) Fusion of depth information and multispectral images can enhance the perception ability of the model. Combining it with an adaptive feature extraction mechanism will help improve the detection accuracy of objects with significant morphological differences.
(4) Explore technologies such as model distillation and pruning. These methods can reduce the computational load while maintaining or even improving model performance, making it more suitable for real-time applications on resource-constrained devices.

4. Conclusions

In summary, this study aims to enhance the robustness and generalizability of current fruit detection and recognition models. To achieve this, the following work has been completed:
First, to effectively train the proposed multimodal model E-CLIP, a multimodal dataset has been created. It includes 6770 fruit images across 12 categories and 21 maturity levels, covering a variety of lighting conditions, as well as 100 carefully designed natural language queries. The construction of this multimodal dataset provides a comprehensive and diverse foundation for training and evaluating the model, supporting its ability to adapt to dynamic agricultural environments and new fruit varieties.
Second, the speech recognition module with the language model is integrated, enabling the robot to understand and execute natural language commands, greatly simplifying the operation process and lowering the threshold for use. In the experiment, a variety of language models were compared, including Phi-2, Qwen2-7B, TinyLlava, GPT-4, and Deepseek. The experimental results show that the Phi-2 model has the best balance between speed and accuracy and has efficient real-time performance. Therefore, it is most suitable for the command processing of agricultural harvesting robots.
Third, and most importantly, in order to solve the problem of insufficient robustness and generalization of traditional fruit detection and classification models, an enhanced CLIP (E-CLIP) model is proposed. This model includes three branches: the YOLO detection head, the image-to-image contrast learning branch, and the image-to-text contrast learning branch. This integrated structure makes up for the shortcomings of the pure vision model and further enhances the learning ability of the multimodal model. Experimental results show that the mAP@0.5 and F1 scores of the model reached 0.791 and 0.752, which are significantly higher than the multimodal baseline model and also improved compared to the traditional YOLO. The model also exhibits its robustness under occlusion and illumination changes. It can maintain a high accuracy even under severe occlusion and backlighting conditions, showing its practical application potential in dynamic orchard environments. Furthermore, the model demonstrates strong generalization ability in few-shot learning (e.g., 1, 4, 8, and 16 samples per category) and zero-shot learning scenarios, achieving a mAP@0.5 of 0.626 for new fruit categories in zero-shot learning, demonstrating its adaptability to new categories without direct supervision.
These enhancements not only boost the efficiency and accuracy of automatic fruit picking but also pave the way for advancements of general-purpose picking robots in smart agriculture.   

Author Contributions

Writing—original draft, data collection and pre-processing, methodology, and experiment implementation and testing: Y.Z.; Writing—review and editing, methodology guidance, innovation design: H.P.; Project administration and funding acquisition and Writing—review and editing and polishing: P.S.; Review and Editing and Writing—polishing: R.Z.; Partial image dataset collection and JSON specification: Y.S., C.T., Z.L. (Zhenqing Liu) and Z.L. (Zhengda Li); Supervision: H.P. and P.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Key Research and Development Program of Hubei Province grant number 2024BBB053.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Acknowledgments

The authors would like to thank Wenfu Huang for his assistance in figure beautification, and Miao Yang for his contributions to the experiments on the MM-Grounding-DINO model.

Conflicts of Interest

Author Zhengda Li was employed by the company Wuhan X-Agriculture Intelligent Technology Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Appendix A. The JSON Output Format

Agriculture 15 01173 i001

Appendix B. Language Model JSON Specification

Appendix B.1. Start Harvesting Task

Agriculture 15 01173 i002
Agriculture 15 01173 i003

Appendix B.2. End Harvesting Task

Agriculture 15 01173 i004
Agriculture 15 01173 i005

Appendix B.3. Pause Work

Agriculture 15 01173 i006
Agriculture 15 01173 i007

Appendix B.4. Resume Work

Agriculture 15 01173 i008
Agriculture 15 01173 i009

Appendix B.5. Robot Voice-Controlled Movement

Agriculture 15 01173 i010
Agriculture 15 01173 i011

Appendix B.6. Speed Adjustment Control

Agriculture 15 01173 i012
Agriculture 15 01173 i013

Appendix B.7. Reset Arm Position

Agriculture 15 01173 i014
Agriculture 15 01173 i015

Appendix B.8. Self-Charging

Agriculture 15 01173 i016
Agriculture 15 01173 i017

Appendix C. Command Extraction and Entity Recognition Results

Taking apple picking as an example, Table A1 shows the results of the text dataset and language model we built to extract actions and entities.
Table A1. Command extraction and entity recognition results for apple picking activities.
Table A1. Command extraction and entity recognition results for apple picking activities.
IDOriginal CommandActionEntity
1For semi-ripe apples, use special tools for picking.PickingSemi-ripe apples
2Start from the edge of the orchard and gradually move inward to pick ripe apples.PickingRipe apples
3After rain, check and pick brighter apples.Checking, PickingBrighter apples
4Use sensors to detect sugar content and decide when to pick unripe apples.Deciding Picking TimeUnripe apples
5Pick those that are close to full ripeness but have not yet fallen.PickingNearly fully ripe apples
6To ensure quality, prioritize picking apples with smooth and undamaged surfaces.Prioritize PickingSmooth and undamaged apples
7Create a picking route mAP@[0.5:0.95] in the orchard to plan the picking of different types of apples.Planning PickingDifferent types of apples
8Use drones to assist in locating hard-to-reach apples.Assisting LocationApples
9Utilize AI vision systems to analyze the optimal picking time for apples.Analyzing PickingApples
10Record the location of each apple during picking for subsequent management.Recording LocationApples
11To protect the environment, pick wild apples without disrupting the ecosystem.PickingWild apples
12By comparing the characteristics of different varieties, learn how to more Accuracyurately pick apples.Learning PickingApples
13Combine weather forecast information to schedule picking activities ahead of time to ensure the quality of apples.Scheduling PickingApples
14Improve picking techniques using machine learning to increase efficiency while reducing damage to apples.Improving Picking TechniquesTechniques
15At night, use infrared cameras to help pick apples that ripen at night.Helping PickingNightly ripe apples
16Develop specialized applications to guide workers on correctly picking various types of apples.Guiding PickingVarious types of apples
17Train pickers to understand the best picking methods for each type of apple.Training PickingPicking methods
18For special events, carefully select and pick the highest quality apples.Carefully Selecting, PickingHighest quality apples
19Through community cooperation, jointly participate in the local orchard’s apple picking work.Participating PickingApples
20During the harvest season, organize volunteersto pick large quantities of ripe apples together.Organizing PickingRipe apples
21Pick wild apples from bushes near the ground.PickingWild apples
22Find the highest branches and pick the largest apples there.PickingLargest apples
23When detecting specific colors, pick high-hanging apples.PickingSpecific color apples
24On dewy mornings, pick fresh apples before the dew dries.PickingFresh apples
25Carefully distinguish and only pick fully ripe apples.PickingFully ripe apples
26Avoid damaging surrounding leaves and precisely pick the hard apples.Precisely PickingHard apples
27Find and pick hidden apples among dense foliage.PickingHidden apples
28Use image recognition technology to assist in picking rare apples.Assisting PickingRare apples
29Adjust strategies Accuracyording to seasonal changes to pick suitable apples.Adjusting PickingApples
30Optimize routes efficiently to pick multiple apples using smart algorithms.Optimizing Path, PickingApples
31Carefully pick ripe apples from fruit trees.Carefully PickingRipe apples
32Instruct robots to go to the orchard to select and pick fresh apples.Selecting, PickingFresh apples
33Use robotic arms to Accuracyurately pick brightly colored apples.Accuracyurately PickingBrightly colored apples
34In greenhouses, search for and pick ripe clusters of apples.Searching, PickingRipe clusters of apples
35Detect and pick small apples hanging on lower branches.Detecting, PickingSmall apples
36Identify and pick apples whose color has changed from green to red.Identifying, PickingGreen to red apples
37Confirm the fruit is ripe and then start picking round apples.Confirming, PickingRound apples
38Select and pick apples based on size and color.Selecting, PickingApples
39Remove apples that have changed color from the tree.RemovingChanged color apples
40Safely pick soft apples from vines or branches.Safely PickingSoft apples
41Use drones to take aerial photos to help locate apples that need picking.Helping LocatingApples
42Robots selectively pick apples based on preset maturity criteria.Selectively PickingApples
43Set up an automatic navigation system in the orchard to assist in picking apples.Assisting PickingApples
44Develop new picking algorithms to adapt to different sizes and shapes of apples.Developing AlgorithmsApples
45Apply augmented reality (AR) technology to guide pickers to find the best locations for apples.Guiding PickingApples
46Conduct a comprehensive scan of all apples in a specific area before starting to pick.Scanning, PickingApples
47Use laser rangefinders to determine the exact position of each apple.Determining PositionApples
48Integrate environmental sensors to monitor weather conditions, optimizing the timing for picking unripe apples.Monitoring, OptimizingUnripe apples
49Employ wearable devices like smart glasses to assist workers in efficient picking of apples.Assisting PickingApples
50Use machine vision recognition technology to Accuracyurately locate apples against complex backgrounds.LocatingApples
51Determine the optimal picking time by detecting color changes in fruits to start picking apples.Detecting, PickingApples
52Instruct robots to search and only pick apples that meet specific maturity standards.Searching, PickingMaturity standard apples
53Assess the hardness of the fruit with sensors to determine the picking time, ensuring apples are at their best maturity.Assessing, PickingApples
54When detected sugar content reaches peak, instruct robots to pick ripe apples.Instructing, PickingRipe apples
55Use image recognition technology to analyze surface features of the fruit, selecting high-maturity apples.Analyzing, PickingHigh-maturity apples
56Identify and prioritize picking fully ripe apples.Identifying, Prioritizing PickingFully ripe apples
57Screen and pick apples that meet predefined maturity parameters.Screening, PickingApples
58Confirm that the size and color of the fruit meet maturity requirements before initiating the picking program for apples.Confirming, InitiatingApples
59Predict the optimal maturity of each type of apple using machine learning algorithms to schedule picking times.Predicting, SchedulingApples
60Regularly monitor environmental conditions (such as temperature, humidity) to optimize the picking plan and ensure the maturity of apples.Monitoring, OptimizingApples
61Based on fruit growth cycle data, intelligently determine when to pick apples.Determining, PickingApples
62Perform a quick check before picking to ensure all selected apples have reached the expected maturity.Checking, EnsuringApples
63Use infrared imaging technology to assist in judging the internal maturity of apples, guiding precise picking.Judging, GuidingApples
64Combine weather forecast information to plan picking activities in advance, ensuring apples are picked at optimal maturity.Planning, EnsuringApples
65Use automated systems to monitor the development progress of each apple, determining specific picking dates.Monitoring, DeterminingApples
66Dynamically adjust picking strategies during the process to Accuracyommodate different types of apples based on real-time maturity analysis.Analyzing, AdjustingApples
67Develop dedicated software to help growers identify which apples have reached ideal maturity, ready for picking.Developing Software, HelpingApples
68Collect data through wireless sensor networks, analyzing and predicting the maturity trends of apples on each tree.Collecting, Analyzing, PredictingApples
69Use AI models to simulate the maturation process of apples under different environmental conditions, providing scientific picking suggestions.Simulating, ProvidingApples
70Equip multi-spectral cameras for precise assessment of apple maturity, executing picking tasks Accuracyordingly.Assessing, ExecutingApples
71Differentiate between semi-ripe and fully ripe apples, selecting appropriate picking targets as needed.Differentiating, SelectingApples
72Accuracyurately predict the maturity of apples by referencing historical data and current environmental conditions, scheduling picking Accuracyordingly.Predicting, SchedulingApples
73Use machine vision systems to perform 3D scans of fruits to assess their shape and maturity, deciding whether to pick apples.Scanning, Assessing, DecidingApples
74Leverage deep-learning-based algorithms enabling robots to efficiently recognize and pick mature apples in complex backgrounds.Recognizing, PickingApples
75Before picking, conduct gentle touch tests to verify if apples are sufficiently ripe.Testing, VerifyingApples
76Set multiple maturity thresholds to allow robots to flexibly handle the picking needs of different types of apples.Setting, HandlingApples
77Judge maturity based on the aroma release pattern of the fruit, selectively picking suitable apples.Judging, PickingApples
78Continuously update information about the maturity of apples during the picking process, enhancing picking efficiency and quality.Updating, EnhancingApples
79Combine various non-invasive detection methods, such as sound waves and spectral analysis, to determine the maturity of apples.Combining, DeterminingApples
80Robots remotely control picking actions based on cloud platform data analysis capabilities, ensuring only correctly matured apples are picked.Controlling, EnsuringApples
81Receive real-time feedback through mobile applications, adjusting the robot’s maturity evaluation standards for apples.Receiving, AdjustingApples
82Recognize prematurely or delayed-ripening apples due to weather reasons, adjusting picking strategies Accuracyordingly.Recognizing, AdjustingApples
83Customize maturity detection schemes based on the characteristics of different varieties of apples, improving picking Accuracy.Customizing, ImprovingApples
84Utilize blockchain to record the entire process from sowing to picking, ensuring transparency and traceability of maturity information for each apple.Recording, EnsuringApples
85Flexibly adjust picking operations for apples based on user-set maturity preferences.AdjustingApples
86Before picking, confirm the final maturity of apples, avoiding premature or late picking.Confirming, AvoidingApples
87Simulate natural pollination processes to promote fruit development, ensuring apples reach ideal maturity at picking time.Promoting, EnsuringApples
88Use virtual reality (VR) training modules to familiarize operators with how to judge the maturity of different kinds of apples.Training, JudgingApples
89Instantly evaluate and select the ripest apples during the picking process.Evaluating, SelectingApples
90See those very bright-colored apples? It’s time to pick them.PickingBright-colored apples
91This morning I noticed some apples started changing color; please help me pick these ripe fruits.PickingRipe fruits
92Can you gently pick those just perfectly ripe apples in the orchard while the dew is still on?PickingPerfectly ripe apples
93When you lightly touch those hanging apples on the tree, if they feel soft but not mushy, please pick them.PickingSoft but not mushy apples
94Look at those darker-colored apples at the top of the tree; please help me pick them.PickingDarker-colored apples
95Accuracyording to the weather forecast, these days are perfect for picking; please prepare in advance and do not miss any batch of ripe apples.Preparing, PickingRipe apples
96The greenhouse environment is well controlled; robot, you can check and pick those apples that have ripened ideally.Checking, PickingRipe apples
97Before picking, observe the color and size of each apple; choose only the perfect ones.Observing, ChoosingPerfect apples
98Through touch and visual inspection, determine which apples have reached ideal maturity, then carefully pick them.Determining, PickingApples
99As soon as you smell the sweet aroma in the air, it’s time to go pick those enticingly scented apples.PickingEnticingly scented apples
100Monitor the weather conditions and select sunny days for picking to ensure the sweetness and quality of the apples.Pickingapples

References

  1. Chen, Z.; Lei, X.; Yuan, Q.; Qi, Y.; Ma, Z.; Qian, S.; Lyu, X. Key Technologies for Autonomous Fruit- and Vegetable-Picking Robots: A Review. Agronomy 2024, 14, 2233. [Google Scholar] [CrossRef]
  2. Nyyssönen, A. Vision-Language Models in Industrial Robotics. Bachelor’s Thesis, Faculty of Engineering and Natural Sciences, Tampere University, Tampere, Finland, 2024. [Google Scholar]
  3. Sapkota, R.; Ahmed, D.; Karkee, M. Comparing YOLOv8 and Mask R-CNN for instance segmentation in complex orchard environments. Artif. Intell. Agric. 2024, 13, 84–99. [Google Scholar] [CrossRef]
  4. Jiang, L.; Jiang, H.; Jing, X.; Dang, H.; Li, R.; Chen, J.; Majeed, Y.; Sahni, R.; Fu, L. UAV-based field watermelon detection and counting using YOLOv8s with image panorama stitching and overlap partitioning. Artif. Intell. Agric. 2024, 13, 117–127. [Google Scholar] [CrossRef]
  5. Zhou, J.; Zhang, Y.; Wang, J. A Dragon Fruit Picking Detection Method Based on YOLOv7 and PSP-Ellipse. Sensors 2023, 23, 3803. [Google Scholar] [CrossRef] [PubMed]
  6. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Computer Vision—ECCV 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar] [CrossRef]
  7. Liang, Q.; Zhu, W.; Long, J.; Wang, Y.; Sun, W.; Wu, W. A Real-Time Detection Framework for On-Tree Mango Based on SSD Network. In Proceedings of the Intelligent Robotics and Applications, Newcastle, NSW, Australia, 9–11 August 2018; Chen, Z., Mendes, A., Yan, Y., Chen, S., Eds.; Springer: Berlin/Heidelberg, Germany, 2018; pp. 423–436. [Google Scholar]
  8. Vasconez, J.; Delpiano, J.; Vougioukas, S.; Auat Cheein, F. Comparison of convolutional neural networks in fruit detection and counting: A comprehensive evaluation. Comput. Electron. Agric. 2020, 173, 105348. [Google Scholar] [CrossRef]
  9. Wang, P.; Niu, T.; He, D. Tomato Young Fruits Detection Method under Near Color Background Based on Improved Faster R-CNN with Attention Mechanism. Agriculture 2021, 11, 1059. [Google Scholar] [CrossRef]
  10. Xu, P.; Fang, N.; Liu, N.; Lin, F.; Yang, S.; Ning, J. Visual recognition of cherry tomatoes in plant factory based on improved deep instance segmentation. Comput. Electron. Agric. 2022, 197, 106991. [Google Scholar] [CrossRef]
  11. Bordes, F.; Pang, R.Y.; Ajay, A.; Li, A.C.; Bardes, A.; Petryk, S.; Mañas, O.; Lin, Z.; Mahmoud, A.; Jayaraman, B.; et al. An Introduction to Vision-Language Modeling. arXiv 2024, arXiv:2405.17247. [Google Scholar]
  12. Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning. arXiv 2021, arXiv:2103.00020. [Google Scholar]
  13. Li, J.; Li, D.; Xiong, C.; Hoi, S. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. arXiv 2022, arXiv:2201.12086. [Google Scholar]
  14. Li, J.; Li, D.; Savarese, S.; Hoi, S. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. arXiv 2023, arXiv:2301.12597. [Google Scholar]
  15. Liu, H.; Li, C.; Wu, Q.; Lee, Y.J. Visual Instruction Tuning. arXiv 2023, arXiv:2304.08485. [Google Scholar]
  16. Bai, J.; Bai, S.; Yang, S.; Wang, S.; Tan, S.; Wang, P.; Lin, J.; Zhou, C.; Zhou, J. Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond. arXiv 2023, arXiv:2308.12966. [Google Scholar]
  17. Cao, Y.; Chen, L.; Yuan, Y.; Sun, G. Cucumber disease recognition with small samples using image-text-label-based multi-modal language model. Comput. Electron. Agric. 2023, 211, 107993. [Google Scholar] [CrossRef]
  18. Zhou, Y.; Yan, H.; Ding, K.; Cai, T.; Zhang, Y. Few-Shot Image Classification of Crop Diseases Based on Vision–Language Models. Sensors 2024, 24, 6109. [Google Scholar] [CrossRef]
  19. Tan, C.; Cao, Q.; Li, Y.; Zhang, J.; Yang, X.; Zhao, H.; Wu, Z.; Liu, Z.; Yang, H.; Wu, N.; et al. On the Promises and Challenges of Multimodal Foundation Models for Geographical, Environmental, Agricultural, and Urban Planning Applications. arXiv 2023, arXiv:2312.17016. [Google Scholar]
  20. Qing, J.; Deng, X.; Lan, Y.; Li, Z. GPT-aided diagnosis on agricultural image based on a new light YOLOPC. Comput. Electron. Agric. 2023, 213, 108168. [Google Scholar] [CrossRef]
  21. Jocher, G.; Qiu, J.; Chaurasia, A. Ultralytics YOLO (Version 8.0.0). 2023. Available online: https://github.com/ultralytics/ultralytics (accessed on 9 April 2025).
  22. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2021, arXiv:2010.11929. [Google Scholar]
  23. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv 2023, arXiv:1706.03762. [Google Scholar]
  24. Li, Y.; Bubeck, S.; Eldan, R.; Giorno, A.D.; Gunasekar, S.; Lee, Y.T. Textbooks Are All You Need II: Phi-1.5 technical report. arXiv 2023, arXiv:2309.05463. [Google Scholar]
  25. Yang, A.; Yang, B.; Hui, B.; Zheng, B.; Yu, B.; Zhou, C.; Li, C.; Li, C.; Liu, D.; Huang, F.; et al. Qwen-2 Technical Report. arXiv 2024, arXiv:2407.10671. [Google Scholar]
  26. Zhou, B.; Hu, Y.; Weng, X.; Jia, J.; Luo, J.; Liu, X.; Wu, J.; Huang, L. TinyLLaVA: A Framework of Small-scale Large Multimodal Models. arXiv 2024, arXiv:2402.14289. [Google Scholar]
  27. OpenAI. GPT-4 Technical Report. arXiv 2024, arXiv:2303.08774. [Google Scholar]
  28. DeepSeek-AI. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv 2025, arXiv:2501.12948. [Google Scholar]
  29. DeepSeek-AI. DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model. arXiv 2024, arXiv:2405.04434. [Google Scholar]
  30. DeepSeek-AI. DeepSeek-V3 Technical Report. arXiv 2025, arXiv:2412.19437. [Google Scholar]
  31. Li, L.H.; Zhang, P.; Zhang, H.; Yang, J.; Li, C.; Zhong, Y.; Wang, L.; Yuan, L.; Zhang, L.; Hwang, J.N.; et al. Grounded Language-Image Pre-training. arXiv 2022, arXiv:2112.03857. [Google Scholar]
  32. Li, Y.; Xie, S.; Chen, X.; Dollar, P.; He, K.; Girshick, R. Benchmarking Detection Transfer Learning with Vision Transformers. arXiv 2021, arXiv:2111.11429. [Google Scholar]
  33. Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs Beat YOLOs on Real-time Object Detection. arXiv 2024, arXiv:2304.08069. [Google Scholar]
  34. Zhong, Y.; Yang, J.; Zhang, P.; Li, C.; Codella, N.; Li, L.H.; Zhou, L.; Dai, X.; Yuan, L.; Li, Y.; et al. RegionCLIP: Region-based Language-Image Pretraining. arXiv 2021, arXiv:2112.09106. [Google Scholar]
  35. Zhao, X.; Chen, Y.; Xu, S.; Li, X.; Wang, X.; Li, Y.; Huang, H. An Open and Comprehensive Pipeline for Unified Object Grounding and Detection. arXiv 2024, arXiv:2401.02361. [Google Scholar]
  36. Khanam, R.; Hussain, M. What is YOLOv5: A deep look into the internal features of the popular object detector. arXiv 2024, arXiv:2407.20892. [Google Scholar]
  37. Yaseen, M. What is YOLOv8: An In-Depth Exploration of the Internal Features of the Next-Generation Object Detector. arXiv 2024, arXiv:2408.15857. [Google Scholar]
  38. Khanam, R.; Hussain, M. YOLOv11: An Overview of the Key Architectural Enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar]
Figure 1. (a) The overall framework of the fruit-picking system. It includes the instruction processing module, the visual language processing module, and the motion control module. (b) Technical details of the visual language processing module. The image is first detected by YOLOv8 for a single target, then the image is classified through image-to-image comparison learning and image-to-text comparison learning, and then the result is finally output.
Figure 1. (a) The overall framework of the fruit-picking system. It includes the instruction processing module, the visual language processing module, and the motion control module. (b) Technical details of the visual language processing module. The image is first detected by YOLOv8 for a single target, then the image is classified through image-to-image comparison learning and image-to-text comparison learning, and then the result is finally output.
Agriculture 15 01173 g001
Figure 2. Flowchart of the intelligent fruit-picking system. The instruction module receives user commands (text or voice), identifies picking intent via language model processing, and triggers the Identify Module to capture and analyze video for object detection. The motion control module then guides the robotic arm to perform the picking action based on detection results.
Figure 2. Flowchart of the intelligent fruit-picking system. The instruction module receives user commands (text or voice), identifies picking intent via language model processing, and triggers the Identify Module to capture and analyze video for object detection. The motion control module then guides the robotic arm to perform the picking action based on detection results.
Agriculture 15 01173 g002
Figure 3. YOLOv8 Detection Head. It retains the detection function. The input fruit features go through two convolutional layers (kernel size k = 3 , stride s = 1 , padding p = 1 ), and then enter a regression head (Conv2d) with kernel size k = 3 , stride s = 1 , padding p = 0 , and output channels c = 4 × reg max , where reg max is the maximum bounding box coordinate offset that the model can predict. Bounding box predictions are supervised using the Bbox.Loss function. The red box indicates the detected object.
Figure 3. YOLOv8 Detection Head. It retains the detection function. The input fruit features go through two convolutional layers (kernel size k = 3 , stride s = 1 , padding p = 1 ), and then enter a regression head (Conv2d) with kernel size k = 3 , stride s = 1 , padding p = 0 , and output channels c = 4 × reg max , where reg max is the maximum bounding box coordinate offset that the model can predict. Bounding box predictions are supervised using the Bbox.Loss function. The red box indicates the detected object.
Agriculture 15 01173 g003
Figure 4. Improved ViT architecture. It uses windows of different sizes to fuse image features and implements a multi-scale self-attention mechanism to capture local and global features of the image. “*” represents the extra learnable [class] token used for classification.
Figure 4. Improved ViT architecture. It uses windows of different sizes to fuse image features and implements a multi-scale self-attention mechanism to capture local and global features of the image. “*” represents the extra learnable [class] token used for classification.
Agriculture 15 01173 g004
Figure 5. The AP50 value of each type of fruit. The results are divided into three groups represented by different colors: 0–0.6, 0.6–0.9, 0.9–1.0.
Figure 5. The AP50 value of each type of fruit. The results are divided into three groups represented by different colors: 0–0.6, 0.6–0.9, 0.9–1.0.
Agriculture 15 01173 g005
Figure 6. The results of fruit detection and classification.
Figure 6. The results of fruit detection and classification.
Agriculture 15 01173 g006
Figure 7. The mAP@0.5 values of each model under different occlusion and lighting conditions.
Figure 7. The mAP@0.5 values of each model under different occlusion and lighting conditions.
Agriculture 15 01173 g007
Figure 8. The few-shot performance of different models under varying numbers of training samples. (a) Performance metrics with 1-shot training. (b) Performance metrics with 4-shot training. (c) Performance metrics with 8-shot training. (d) Performance metrics with 16-shot training.
Figure 8. The few-shot performance of different models under varying numbers of training samples. (a) Performance metrics with 1-shot training. (b) Performance metrics with 4-shot training. (c) Performance metrics with 8-shot training. (d) Performance metrics with 16-shot training.
Agriculture 15 01173 g008
Figure 9. The detection and classification results of untrained fruits.
Figure 9. The detection and classification results of untrained fruits.
Agriculture 15 01173 g009
Figure 10. Optimal values of mAP@0.5, mAP@[0.5:0.95], and F1 scores under different α values.
Figure 10. Optimal values of mAP@0.5, mAP@[0.5:0.95], and F1 scores under different α values.
Agriculture 15 01173 g010
Table 1. Distribution of the dataset across different categories and maturity levels.
Table 1. Distribution of the dataset across different categories and maturity levels.
CategoryMaturityTotal InstancesCategoryMaturityTotal Instances
AppleMature
Semi-mature568Lychee-448
Immature
BananaMature
Semi-mature217Lemon-451
Immature
GrapesMature
Semi-mature395Pear-451
Immature
StrawberryMature
Semi-mature265Tomato-357
Immature
PersimmonMature
Semi-mature379Mango-1223
Immature
PeachMature
Semi-mature1308
Immature
Passion FruitMature
Semi-mature708
Immature
Total3840Total2930
Total:6770
Table 2. Hardware configuration for the model deployment.
Table 2. Hardware configuration for the model deployment.
ComponentSpecificationUsage
GPUNVIDIA RTX 4090DModel inference and computation acceleration
CPUAMD EPYC 9754Data preprocessing and backend services
Memory60 GBStore model parameters, intermediate cache, and temporary data
StorageSystem Disk: 30 GB SSD; Data Disk: 50 GB SSDFast loading of training data and model files
Table 3. Software stack for the model deployment.
Table 3. Software stack for the model deployment.
ComponentVersionUsage
Operating SystemUbuntu 22.04Provides a reliable runtime environment
FrameworkPyTorch 2.5.1Supports model training and inference
Python EnvironmentPython 3.12Manages script execution and dependencies
CUDA ToolkitCUDA 12.4Enables GPU acceleration for deep learning
Table 4. Simulated performance comparison under different quantization methods.
Table 4. Simulated performance comparison under different quantization methods.
ModelInput Length (Chinese Characters)QuantizationGPU NumSpeed (Tokens/s)Accuracy (%)
Phi-2 [24]30–50BF16142.591.00
Phi-2 [24]30–50GPTQ-Int4147.391.40
Phi-2 [24]30–50AWQ143.990.7
Qwen2-7B [25]30–50BF16134.795.10
Qwen2-7B [25]30–50GPTQ-Int4136.294.80
Qwen2-7B [25]30–50AWQ133.193.80
TinyLlava [26]30–50BF16130.591.10
TinyLlava [26]30–50GPTQ-Int4134.290.90
GPT-4 [27]30–50BF16130.294.60
GPT-4 [27]30–50GPTQ-Int4134.894.80
GPT-4 [27]30–50AWQ133.693.80
Deepseek-R1 [28]30–50BF16121.893.60
Deepseek-V2 [29]30–50FP8132.295.20
Deepseek-V3 [30]30–503-bit112.394.80
Table 5. Performance comparison of different models.
Table 5. Performance comparison of different models.
ModelPrecisonRecallF1 ScoremAP@0.5mAP@[0.5:0.95]
GLIP [31]0.5160.5230.5190.5070.226
ViTDet [32]0.5390.5540.5460.5420.366
RTDERT [33]0.6710.6910.6810.6830.396
RegionCLIP [34]0.0810.0890.0850.0840.036
MM-Grounding-DINO [35]0.4130.7080.5220.4130.248
YOLOv5 [36]0.6820.7060.6940.7210.568
YOLOv8 [37]0.7110.7230.7170.7200.558
YOLO11 [38]0.7280.7100.7190.7180.557
E-CLIP (ours)0.7780.7280.7520.7910.652
Table 6. Robustness comparison of different models.
Table 6. Robustness comparison of different models.
ModelScenarioPrecisionRecallF1 ScoremAP@0.5mAP@[0.5:0.95]
GLIPSlight Occ.—Normal Light0.5270.5310.5290.5280.241
ViTDetSlight Occ.—Normal Light0.5410.5640.5520.5530.372
RTDETRSlight Occ.—Normal Light0.6890.7010.6950.6930.405
RegionCLIPSlight Occ.—Normal Light0.0960.1030.0990.0910.042
MM-Grounding-DINOSlight Occ.—Normal Light0.4680.8740.6100.4680.289
YOLOv5Slight Occ.—Normal Light0.7280.7030.7150.7360.575
YOLOv8Slight Occ.—Normal Light0.7240.7280.7260.7290.561
YOLO11Slight Occ.—Normal Light0.7320.7110.7210.7270.559
E-CLIP (ours)Slight Occ.—Normal Light0.8360.8290.8320.8320.661
GLIPSlight Occ.—Backlight0.4720.4810.4760.4620.205
ViTDetSlight Occ.—Backlight0.5020.5150.5080.5080.335
RTDETRSlight Occ.—Backlight0.6230.6430.6330.6350.362
RegionCLIPSlight Occ.—Backlight0.0700.0780.0740.0730.035
MM-Grounding-DINOSlight Occ.—Backlight0.3520.7000.4680.3520.201
YOLOv5Slight Occ.—Backlight0.6350.6580.6460.6730.530
YOLOv8Slight Occ.—Backlight0.6630.6750.6690.6720.520
YOLO11Slight Occ.—Backlight0.6780.6620.6700.6700.519
E-CLIP (ours)Slight Occ.—Backlight0.7020.6890.6950.6930.589
GLIPModerate Occ.—Normal Light0.4270.4370.4320.4180.183
ViTDetModerate Occ.—Normal Light0.4680.4820.4750.4750.316
RTDETRModerate Occ.—Normal Light0.5810.6020.5910.5930.340
RegionCLIPModerate Occ.—Normal Light0.0600.0680.0640.0630.014
MM-Grounding-DINOModerate Occ.—Normal Light0.3370.7900.4730.3350.276
YOLOv5Moderate Occ.—Normal Light0.5900.6130.6010.6270.495
YOLOv8Moderate Occ.—Normal Light0.6150.6280.6210.6280.487
YOLO11Moderate Occ.—Normal Light0.6320.6170.6240.6260.485
E-CLIP (ours)Moderate Occ.—Normal Light0.6340.6650.6490.6510.581
GLIPModerate Occ.—Backlight0.3940.4030.3980.3840.165
ViTDetModerate Occ.—Backlight0.4330.4470.4400.4420.295
RTDETRModerate Occ.—Backlight0.5410.5620.5510.5530.313
RegionCLIPModerate Occ.—Backlight0.0550.0620.0580.0570.012
MM-Grounding-DINOModerate Occ.—Backlight0.5860.5540.5690.5820.209
YOLOv5Moderate Occ.—Backlight0.5500.5730.5610.5850.462
YOLOv8Moderate Occ.—Backlight0.5750.5880.5810.5860.455
YOLO11Moderate Occ.—Backlight0.5910.5770.5840.5830.453
E-CLIP (ours)Moderate Occ.—Backlight0.5810.6430.6100.6190.523
GLIPSevere Occ.—Normal Light0.3520.3620.3570.3450.142
ViTDetSevere Occ.—Normal Light0.3970.4100.4030.4030.275
RTDETRSevere Occ.—Normal Light0.5020.5230.5120.5140.294
RegionCLIPSevere Occ.—Normal Light0.0480.0550.0510.0500.012
MM-Grounding-DINOSevere Occ.—Normal Light0.4260.7590.5530.4260.324
YOLOv5Severe Occ.—Normal Light0.5120.5350.5230.5510.435
YOLOv8Severe Occ.—Normal Light0.5350.5480.5410.5480.425
YOLO11Severe Occ.—Normal Light0.5520.5390.5450.5450.423
E-CLIP (ours)Severe Occ.—Normal Light0.5520.6010.5750.5740.483
GLIPSevere Occ.—Backlight0.3160.3250.3200.3080.121
ViTDetSevere Occ.—Backlight0.3580.3710.3640.3630.251
RTDETRSevere Occ.—Backlight0.4620.4830.4720.4740.261
RegionCLIPSevere Occ.—Backlight0.0320.0390.0350.0340.001
MM-Grounding-DINOSevere Occ.—Backlight0.3320.5950.4280.3280.208
YOLOv5Severe Occ.—Backlight0.4720.4950.4830.5080.402
YOLOv8Severe Occ.—Backlight0.4950.5080.5010.5050.395
YOLO11Severe Occ.—Backlight0.5120.4990.5050.5030.393
E-CLIP (ours)Severe Occ.—Backlight0.5130.5760.5430.5510.462
Table 7. Generalization performance of the E-CLIP.
Table 7. Generalization performance of the E-CLIP.
PrecisonRecallF1 ScoremAP@0.5mAP@[0.5:0.95]
1-shot0.4310.1250.1940.0590.049
4-shot0.4200.4380.4290.3870.292
8-shot0.5960.4210.4930.4920.340
16-shot0.6060.6180.6120.6460.523
Table 8. Performance metrics for new categories.
Table 8. Performance metrics for new categories.
PrecisionRecallF1 ScoremAP@0.5mAP@[0.5:0.95]
Orange0.5050.6320.5610.6490.442
Watermelon0.6640.6100.6360.6190.248
Cantaloupe0.9440.6150.7450.8190.492
Cherry0.3560.6790.4670.4150.129
All0.6170.6340.6250.6260.328
Table 9. The effect of the image–image module and text–image module. ↓ indicates performance drop compared to full model; means the module is enabled, and × means the module is disabled.
Table 9. The effect of the image–image module and text–image module. ↓ indicates performance drop compared to full model; means the module is enabled, and × means the module is disabled.
Image–Image ModuleText–Image ModuleF1 ScoremAP@0.5mAP@[0.5:0.95]
0.7520.7910.652
×0.695 (↓0.057)0.683 (↓0.108)0.521 (↓0.131)
×0.686 (↓0.066)0.667 (↓0.124)0.513 (↓0.139)
Table 10. The impact of the loss function’s weight parameter α .
Table 10. The impact of the loss function’s weight parameter α .
α PrecisionRecallF1 ScoremAP@0.5mAP@[0.5:0.95]
0.00.3180.6380.4190.3960.281
0.10.3570.6220.4530.4130.306
0.20.4140.6410.5110.5230.389
0.30.4450.6440.5260.5290.390
0.40.5020.5530.5270.6140.479
0.50.6380.6920.6630.6610.512
0.60.7780.7280.7520.7910.652
0.70.4450.6440.5260.5290.39
0.80.4340.6560.5280.5320.399
0.90.3370.5580.4210.3850.281
1.00.3050.7070.4190.3870.272
Table 11. Comparison of model inference speed and computing efficiency.
Table 11. Comparison of model inference speed and computing efficiency.
GFLOPs ( 10 9 )Parameters ( 10 6 )FPS
VitDet321.94563.2026.45
RTDERT231.72409.2328.79
GLIP415.17954.1918.76
RegionCLIP518.52923.0114.79
MM-Grounding-DINO39.61201.264.76
YOLOv524.19.13135.14
YOLOv828.711.15272.27
YOLO116.502.6083.33
E-CLIP (ours)98.6186.4254.82
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, Y.; Shao, Y.; Tang, C.; Liu, Z.; Li, Z.; Zhai, R.; Peng, H.; Song, P. E-CLIP: An Enhanced CLIP-Based Visual Language Model for Fruit Detection and Recognition. Agriculture 2025, 15, 1173. https://doi.org/10.3390/agriculture15111173

AMA Style

Zhang Y, Shao Y, Tang C, Liu Z, Li Z, Zhai R, Peng H, Song P. E-CLIP: An Enhanced CLIP-Based Visual Language Model for Fruit Detection and Recognition. Agriculture. 2025; 15(11):1173. https://doi.org/10.3390/agriculture15111173

Chicago/Turabian Style

Zhang, Yi, Yang Shao, Chen Tang, Zhenqing Liu, Zhengda Li, Ruifang Zhai, Hui Peng, and Peng Song. 2025. "E-CLIP: An Enhanced CLIP-Based Visual Language Model for Fruit Detection and Recognition" Agriculture 15, no. 11: 1173. https://doi.org/10.3390/agriculture15111173

APA Style

Zhang, Y., Shao, Y., Tang, C., Liu, Z., Li, Z., Zhai, R., Peng, H., & Song, P. (2025). E-CLIP: An Enhanced CLIP-Based Visual Language Model for Fruit Detection and Recognition. Agriculture, 15(11), 1173. https://doi.org/10.3390/agriculture15111173

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop