Abstract
The agricultural system faces the formidable challenge of efficiently harvesting strawberries, a labor-intensive process that has long relied on manual labor. The advent of autonomous harvesting robot systems offers a transformative solution, but their success hinges on the accuracy and efficiency of strawberry detection. In this paper, we present DPViT-YOLOV8, a novel approach that leverages advancements in computer vision and deep learning to significantly enhance strawberry detection. DPViT-YOLOV8 integrates the EfficientViT backbone for multi-scale linear attention, the Dynamic Head mechanism for unified object detection heads with attention, and the proposed C2f_Faster module for enhanced computational efficiency into the YOLOV8 architecture. We meticulously curate and annotate a diverse dataset of strawberry images on a farm. A rigorous evaluation demonstrates that DPViT-YOLOV8 outperforms baseline models, achieving superior Mean Average Precision (mAP), precision, and recall. Additionally, an ablation study highlights the individual contributions of each enhancement. Qualitative results showcase the model’s proficiency in locating ripe strawberries in real-world agricultural settings. Notably, DPViT-YOLOV8 maintains computational efficiency, reducing inference time and FLOPS compared to the baseline YOLOV8. Our research bridges the gap between computer vision and agriculture systems, offering a powerful tool to accelerate the adoption of autonomous strawberry harvesting, reduce labor costs, and ensure the sustainability of strawberry farming.