Next Article in Journal
Life Cycle Assessment of CO2, Rumen, and Biological Biomass Pretreatment Methods for Biomethane Production
Previous Article in Journal
Hybrid Deep Learning Approaches for Improved Genomic Prediction in Crop Breeding
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
This is an early access version, the complete PDF, HTML, and XML versions will be available soon.
Article

E-CLIP: An Enhanced CLIP-Based Visual Language Model for Fruit Detection and Recognition

1
College of Informatics, Huazhong Agricultural University, No. 1 Shizi Mountain Street, Hongshan District, Wuhan 430070, China
2
College of Plant Science and Technology, Huazhong Agricultural University, No. 1 Shizi Mountain Street, Hongshan District, Wuhan 430070, China
3
Wuhan X-Agriculture Intelligent Technology Co., Ltd., Wuhan 430070, China
4
Hubei Hongshan Laboratory, Wuhan 430070, China
5
National Key Laboratory of Crop Genetic Improvement, Huazhong Agricultural University, Wuhan 430070, China
*
Authors to whom correspondence should be addressed.
Agriculture 2025, 15(11), 1173; https://doi.org/10.3390/agriculture15111173
Submission received: 20 April 2025 / Revised: 21 May 2025 / Accepted: 26 May 2025 / Published: 29 May 2025
(This article belongs to the Section Digital Agriculture)

Abstract

With the progress of agricultural modernization, intelligent fruit harvesting is gaining importance. While fruit detection and recognition are essential for robotic harvesting, existing methods suffer from limited generalizability, including adapting to complex environments and handling new fruit varieties. This problem stems from their reliance on unimodal visual data, which creates a semantic gap between image features and contextual understanding. To solve these issues, this study proposes a multi-modal fruit detection and recognition framework based on visual language models (VLMs). By integrating multi-modal information, the proposed model enhances robustness and generalization across diverse environmental conditions and fruit types. The framework accepts natural language instructions as input, facilitating effective human–machine interaction. Through its core module, Enhanced Contrastive Language–Image Pre-Training (E-CLIP), which employs image–image and image–text contrastive learning mechanisms, the framework achieves robust recognition of various fruit types and their maturity levels. Experimental results demonstrate the excellent performance of the model, achieving an F1 score of 0.752, and an mAP@0.5 of 0.791. The model also exhibits robustness under occlusion and varying illumination conditions, attaining a zero-shot mAP@0.5 of 0.626 for unseen fruits. In addition, the system operates at an inference speed of 54.82 FPS, effectively balancing speed and accuracy, and shows practical potential for smart agriculture. This research provides new insights and methods for the practical application of smart agriculture.
Keywords: visual language models; contrast learning; smart agriculture visual language models; contrast learning; smart agriculture

Share and Cite

MDPI and ACS Style

Zhang, Y.; Shao, Y.; Tang, C.; Liu, Z.; Li, Z.; Zhai, R.; Peng, H.; Song, P. E-CLIP: An Enhanced CLIP-Based Visual Language Model for Fruit Detection and Recognition. Agriculture 2025, 15, 1173. https://doi.org/10.3390/agriculture15111173

AMA Style

Zhang Y, Shao Y, Tang C, Liu Z, Li Z, Zhai R, Peng H, Song P. E-CLIP: An Enhanced CLIP-Based Visual Language Model for Fruit Detection and Recognition. Agriculture. 2025; 15(11):1173. https://doi.org/10.3390/agriculture15111173

Chicago/Turabian Style

Zhang, Yi, Yang Shao, Chen Tang, Zhenqing Liu, Zhengda Li, Ruifang Zhai, Hui Peng, and Peng Song. 2025. "E-CLIP: An Enhanced CLIP-Based Visual Language Model for Fruit Detection and Recognition" Agriculture 15, no. 11: 1173. https://doi.org/10.3390/agriculture15111173

APA Style

Zhang, Y., Shao, Y., Tang, C., Liu, Z., Li, Z., Zhai, R., Peng, H., & Song, P. (2025). E-CLIP: An Enhanced CLIP-Based Visual Language Model for Fruit Detection and Recognition. Agriculture, 15(11), 1173. https://doi.org/10.3390/agriculture15111173

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop