YOLOv11-IMP: Anchor-Free Multiscale Detection Model for Accurate Grape Yield Estimation in Precision Viticulture

Zheng, Shaoxiong; Yang, Xiaopei; Gao, Peng; Guo, Qingwen; Zhang, Jiahong; Chen, Shihong; Tang, Yunchao

doi:10.3390/agronomy16030370

Open AccessArticle

YOLOv11-IMP: Anchor-Free Multiscale Detection Model for Accurate Grape Yield Estimation in Precision Viticulture

by

Shaoxiong Zheng

¹,

Xiaopei Yang

²,

Peng Gao

³,

Qingwen Guo

¹

,

Jiahong Zhang

¹,

Shihong Chen

^1,* and

Yunchao Tang

^4,*

¹

College of Information Engineering, Guangdong Eco-Engineering Polytechnic, Guangzhou 510520, China

²

College of Mechanical and Electrical Engineering, Zhongkai University of Agriculture and Engineering, Guangzhou 510225, China

³

College of Electronic Engineering, South China Agricultural University, Guangzhou 510642, China

⁴

Guangdong Provincial Key Laboratory of Intelligent Disaster Prevention and Emergency Technologies for Urban Lifeline Engineering, Dongguan University of Technology, Dongguan 523000, China

^*

Authors to whom correspondence should be addressed.

Agronomy 2026, 16(3), 370; https://doi.org/10.3390/agronomy16030370

Submission received: 31 December 2025 / Revised: 28 January 2026 / Accepted: 30 January 2026 / Published: 2 February 2026

(This article belongs to the Special Issue Innovations in Agriculture for Sustainable Agro-Systems)

Download

Browse Figures

Review Reports Versions Notes

Abstract

Estimating grape yields in viticulture is hindered by persistent challenges, including strong occlusion between grapes, irregular cluster morphologies, and fluctuating illumination throughout the growing season. This study introduces YOLOv11-IMP, an improved multiscale anchor-free detection framework extending YOLOv11, tailored to vineyard environments. Its architecture comprises five specialized components: (i) a viticulture-oriented backbone employing cross-stage partial fusion with depthwise convolutions for enriched feature extraction, (ii) a bifurcated neck enhanced by large-kernel attention to expand the receptive field coverage, (iii) a scale-adaptive anchor-free detection head for robust multiscale localization, (iv) a cross-modal processing module integrating visual features with auxiliary textual descriptors to enable fine-grained cluster-level yield estimation, and (v) aross multiple scales. This work evaluated YOLOv11-IMP on five grape varieties collecten augmented spatial pyramid pooling module that aggregates contextual information acd under diverse environmental conditions. The framework achieved 94.3% precision and 93.5% recall for cluster detection, with a mean absolute error (MAE) of 0.46 kg per vine. The robustness tests found less than 3.4% variation in accuracy across lighting and weather conditions. These results demonstrate that YOLOv11-IMP can deliver high-fidelity, real-time yield data, supporting decision-making for precision viticulture and sustainable agricultural management.

Keywords:

YOLOv11-IMP; object detection; anchor-free; deep learning; cross-modal reasoning; viticulture; yield prediction

1. Introduction

Viticulture is a vital sector of global agriculture, with grape production serving economic, cultural, and nutritional roles in many countries. China ranks second worldwide in terms of vineyard area, with over 850,000 hectares and annual grape production of over 14 million tons. This scale of production contributes to national agricultural restructuring and is pivotal in improving rural incomes. Particularly in viticulture, accurate yield estimation is crucial in guiding harvesting schedules, optimizing supply chains, and supporting sustainable growth in the wine industry. Nevertheless, conventional estimation methods, primarily manual counting and visual assessment, remain labor-intensive, prone to human bias, and limited in scalability, often generating significant discrepancies in production forecasts and inefficient resource allocation [1,2]. Historically, yield estimation relied heavily on manual sampling and destructive harvesting. While early computer vision techniques attempted to automate this using color thresholding and morphological operations, they lacked the adaptability to handle complex vineyard backgrounds. The advent of deep learning (DL) has revolutionized this field, yet key concepts such as ‘anchor-free detection’—which eliminates the need for predefined bounding box priors—and ‘cross-modal reasoning’ remain underutilized in current viticultural applications.

Recent developments in deep learning and computer vision have advanced the potential for automated, high-precision yield estimation in agriculture [3,4,5]. Such technology enables objective, real-time crop data collection, which can be processed to derive actionable insights at scale. For viticulturists, such capabilities enable more informed decisions on irrigation scheduling, nutrient management, pruning, and harvest timing, enhancing productivity, sustainability, and economic performance [6,7,8].

Nevertheless, developing robust, efficient deep learning models for grape yield estimation remains challenging [9,10,11]. Grape clusters exhibit substantial variation in size, shape, and color and are commonly obscured by foliage, stems, and trellis structures. Moreover, fluctuating illumination, diverse weather conditions, and the dynamic vine phenology exacerbate these challenges throughout the growing season. These factors undermine the detection and counting accuracy of conventional deep learning models, necessitating domain-specific architectures tailored to heterogeneous vineyard environments [12,13].

Agricultural science research has increasingly employed machine learning for crop yield estimation, integrating satellite imagery, sensor measurements, and environmental variables. In viticulture, advances in computer vision (e.g., deep convolutional networks) have fostered the automated detection and counting of grape clusters with promising precision [14,15]. Such methods contribute to refining yield prediction models and addressing industry needs for accurate forecasting and resource planning.

Hybrid detection strategies combining the YOLO family with optimized post-processing (e.g., bounding box refinement and advanced non-maximum suppression) have yielded notable improvements in cluster localization in dense vineyard imagery [16,17,18]. Embedded implementations of compact YOLO variants have been deployed on robotic harvesting platforms, and enhanced YOLOv8 configurations have achieved reliable detection results under natural, variable lighting conditions [19,20].

Contemporary object detection models (e.g., YOLOv10, PP-YOLOE+, DAMO-YOLO, YOLOX, and EfficientDet) achieve high accuracy with competitive computational efficiency [21]. The most recent iteration, YOLOv11, exhibits higher mean average precision (mAP) on Common Objects in Context (COCO) than its predecessors, retaining robustness in challenging detection scenarios [22,23]. These attributes make YOLOv11 a strong candidate for adaptation to viticulture applications, where occlusion and cluster variability pose considerable challenges.

In the image recognition problem in complex environments, YOLOv11 consistently outperforms other YOLO models under the same experimental conditions [24,25]. The YOLOv11 method achieves higher box and pose precision than other configurations when estimating fruit ripeness. Moreover, YOLOv11 achieves higher precision and recall in image detection, whereas its variants are faster during preprocessing and inference. The YOLOv11 model has high robustness, adaptability, and accuracy in image annotation across diverse environments [26,27,28].

Anchor-free detection paradigms, which eliminate the reliance on predefined anchor boxes, augment adaptability in recognizing grape clusters with diverse shapes and scales [29,30]. Various models (e.g., YOLOX) apply intrinsic geometric cues to achieve good performance in complex vineyard images, outperforming their anchor-based counterparts in managing occlusion and dense cluster arrangements [31,32,33].

Beyond algorithmic innovation, combining imagery with environmental sensor data improves the yield estimation accuracy. Integrating soil moisture readings, temperature profiles, and localized weather parameters with high-resolution vineyard imagery allows models to capture complex environmental influences on grape development [34,35]. Multimodal approaches, including red, green, and blue (RGB)–thermal fusion and LiDAR-assisted spatial mapping, have refined the spatial resolution and delivered richer physiological indicators for predictive modeling [36].

Multimodal fusion frameworks combining imagery with spectral, thermal, or three-dimensional (3D) structural data are valuable in monitoring vine health and predicting yields [37]. For instance, thermal imaging fused with RGB data reveals spatiotemporal gradients associated with ripening, whereas LiDAR enables precise canopy reconstruction and the assessment of clusters’ spatial distributions. When integrated with Internet of Things (IoT)-based spectral sensing systems, these approaches facilitate near-real-time monitoring and quantitative yield approximation [38].

Transfer learning and domain adaptation have increasingly been employed to benefit from large-scale, pretrained vision models, reducing data annotation burdens and enhancing generalization in vineyard contexts. Fine-tuning these models on grape-specific datasets enables accurate variety classification and robust detection despite limited labeled data, ensuring adaptability across vintages and growing regions [39,40].

In summary, recent advances that combine deep learning-based detection algorithms, environmental data integration, and multimodal sensor fusion have improved viticultural yield estimation. Despite recent progress, a critical gap remains in the literature: most deep learning models excel at detecting and counting visible clusters but fail to accurately estimate the weights of occluded or partially visible fruit without incorporating depth or semantic priors. Existing solutions often lack the reasoning capabilities to infer the properties of hidden fruit based on visible contextual cues, limiting their practical utility for precise yield forecasting. These gaps motivate the present study, which adapts and extends YOLOv11 in a multiscale, anchor-free framework to address the complexities of grape cluster detection and yield prediction.

To address the above limitations, this work proposes a dedicated detection and estimation framework, You Only Look Once Version 11-improved (YOLOv11-IMP), combining advanced spatial reasoning from cross-modal learning with state-of-the-art (SOTA) computer vision detection. The framework applies deep convolutional architectures to identify and count grapes automatically under diverse vineyard conditions. Training is conducted on a large, meticulously annotated dataset encompassing multiple varieties and phenological stages, enhancing the accuracy and resilience of the model to environmental variability.

The primary objective of this study is to develop a robust, non-destructive yield estimation framework that overcomes the above occlusion and variability challenges. We hypothesize that integrating advanced visual detection with large language model (LLM)-driven cross-modal reasoning can significantly improve the weight estimation accuracy compared to unimodal approaches. To achieve this, this study proposes the YOLOv11-IMP architecture, which introduces a scale-adaptive anchor-free detection head for precise localization and leverages the Kimi-VL model to infer cluster attributes from multimodal data. This approach aims to deliver actionable, high-fidelity yield data to support decision-making in precision viticulture.

2. Materials and Methods

2.1. Improved YOLOv11 Model

This paper introduces a novel end-to-end framework for grape yield estimation, integrating an image preprocessing module, an enhanced backbone network, a task-specific neck–head design, and yield estimation modules. Every component is jointly optimized to overcome the obstacles of grape cluster detection and yield prediction in vineyard environments (Figure 1).

2.1.1. Input Processing

All images used in this study were collected from commercial vineyards in Guangdong Province, China, during the 2025 growing season. To ensure transparent and reproducible acquisition conditions, all vineyard images were captured using a Canon EOS 80D digital single-lens reflex (DSLR) camera (Canon Inc., Tokyo, Japan) equipped with an EF-S 18–135 mm f/3.5–5.6 IS USM lens. The camera was operated at its native resolution of 6000 × 4000 pixels, and all images were subsequently downsampled to 640 × 640 pixels to match the input size requirements of YOLOv11-IMP. Image collection was conducted under representative field conditions covering clear-sky direct sunlight, partially cloudy conditions with mixed sun–shade patterns, uniformly overcast skies, and low-angle illumination during early morning and late afternoon. For each scene, the camera–cluster distance ranged from approximately 0.8 m to 2.0 m, with the optical zoom adjusted within 18–70 mm to keep clusters within the central field of view.

The processing pipeline begins with high-resolution vineyard images (640 × 640 × 3) and employs a three-stage preprocessing strategy. First, the image preprocessing stage performs initial standardization and noise reduction calibrated for vineyard imagery. Second, the mosaic augmentation stage employs a custom mosaic strategy combining multiple vineyard views, which enhances the ability to handle varying cluster densities and occlusion scenarios. Finally, the adaptive histogram equalization stage applies a specialized equalization technique that optimizes the contrast for grape cluster visibility while preserving critical textural information.

Data were collected from vineyards spanning diverse geographic locations, encompassing multiple grape varieties and growth conditions. Seasonal variations were meticulously accounted for to capture the full spectrum of grape growth stages, ensuring that the dataset represented all grape development stages, from budding to maturity. Figure 2 depicts a partially collected grape dataset. The raw input images in Figure 2a,c depict the intricate morphologies of grape clusters under natural lighting. To ensure precise yield estimation, the model undertakes two crucial tasks based on these images. As seen in Figure 2b, this first involves identifying and marking the geometric center of each visible grape berry to determine the total number of berries per cluster. Attribute estimation goes beyond mere counting by utilizing the Kimi Vision Language (Kimi-VL) module to estimate specific attributes for individual berries. The numerical overlays in Figure 2d illustrate the model’s ability to assign weight contribution factors or density scores to each detected unit, which are then combined to calculate the final cluster weight.

This work employs high-resolution imaging equipment: 12-megapixel cameras. Images were captured more frequently—weekly during early growth and every 3–5 days during the ripening stage—to track the dynamic growth of the grape clusters. A customized imaging protocol was developed to obtain high-quality images under varying conditions (e.g., varying light intensities and weather). This protocol involved adjusting the exposure times and international standardization organization (ISO) settings and using polarizing filters to minimize reflections.

Semi-automated annotation, combining manual and machine-assisted tools, was employed in this study to accelerate the annotation process. Grape clusters were identified based on rigorous criteria, including size, shape, and color. A quality control mechanism was implemented to ensure annotation accuracy and consistency through both regular cross-validation by multiple annotators and automated outlier detection. Figure 3 illustrates an example annotation from the annotated dataset. To enhance the model’s multiscale detection capabilities, the dataset includes annotations at both the berry and cluster levels. Figure 3a,b illustrate the processing of complex wide-angle scenes to isolate grape bunches, with individual berries annotated (blue boxes) to train the dense object counting module. Additionally, to support yield estimation, specific samples received detailed attribute annotations. Figure 3c,d display this multimodal annotation, where clusters are spatially delineated (red boxes), and specific weight contribution factors are assigned to visible berries. This hierarchical annotation strategy allows the YOLOv11-IMP model to learn the relationships between visual features and yield parameters.

The enhanced YOLOv11 algorithm marks a significant advancement in grape yield estimation via a series of architectural enhancements and domain-specific optimizations. As illustrated in the flowchart in Figure 4, it comprises four interconnected modules—an enhanced backbone, a grape neck, a detection head, and yield estimation—each designed to address the challenges in viticulture applications.

2.1.2. Enhanced Backbone Architecture

The backbone network implements a hierarchical feature extraction strategy comprising three synergistic innovations. Progressive convolutional stages deploy sequential layers (64 → 128 → 256 channels, all 3 × 3 kernel size, 2 strides) that transform the input imagery from preliminary feature extraction to high-level semantic abstraction. These representations are input into scale-specific enhanced cross-stage partial connection (CSP) modules operating at decreasing resolutions (256 × 256, 128 × 128, and 64 × 64), optimized for distinct grape cluster characteristics, from macroscopic morphological features to fine-grained textural details. The architecture culminates in a terminal spatial pyramid pooling-fast (SPPF) module (k = 5) that aggregates multiscale features via optimized pooling operations, enhancing its robustness to variable grape cluster dimensions while maintaining computational efficiency under field conditions.

The backbone architecture is reconstructed with several specialized components for grape feature extraction. The network applies a sequence of cross-stage partial connections with fusion (C2f) modules with varying iteration parameters (n = 4, 8, 8), enhancing feature representation. Partial residual connections employ 1 × 1 convolutions to fuse features across layers. Channel shuffling mechanisms facilitate cross-feature interactions while maintaining spatial coherence, and skip connections mitigate the vanishing gradient problem during training on viticulture datasets.

Novel grape convolution (GrapeConv) operations are positioned between C2f modules, incorporating depthwise separable convolutions with grape-specific kernel initializations (3 × 3 depthwise → 1 × 1 pointwise). Specifically designed receptive field patterns enhance the sensitivity to the irregular shapes and dense arrangements of grape clusters. Adaptive channel weighting further improves detection by emphasizing color and texture cues that distinguish grapes from foliage. To better handle vineyard lighting variations, this study replaces standard activation functions with custom Parametric Rectified Linear Units (PReLU). Standard ReLU functions set negative inputs to zero, which often leads to dead neurons in deeper layers, particularly when processing the high dynamic range of illumination found in vineyards (e.g., sharp transitions between deep canopy shadows and bright sunlight). By allowing the model to learn negative slope coefficients, the custom parametric approach preserves the gradient flow for features in shadowed regions, ensuring that critical texture details are not lost during the non-linear transformation.

An SPPF module (k = 5) is integrated at the terminus of the backbone, featuring maximum pooling operations at five spatial scales (5 × 5, 9 × 9, 13 × 13, 17 × 17, and 21 × 21). Multiscale context is captured through parallel feature extraction paths. Their outputs are fused via concatenation to preserve resolution-specific details while retaining spatial alignment. Additionally, channel-wise attention mechanisms then selectively amplify features most relevant to grape presence across all scales.

2.1.3. Neck and Head Architecture

The neck–head architecture includes a novel dual-attention mechanism designed for viticulture. The integrated path aggregation network (PAN) module facilitates the parallel processing of channel and spatial attention streams, integrating information through a feature fusion module complemented by cross-stage partial with attention scale (CSP2AS) structures for enhanced representational refinement. A multiscale detection strategy operates across three resolution tiers: P3 detection (output at 1/8 input resolution, 80 × 80) for the fine-grained identification of small clusters, P4 detection (1/16 resolution, 40 × 40) for medium-scale cluster recognition, and P5 detection (1/32 resolution, 20 × 20) for large cluster and cluster group localization. The process culminates in a grape-specific detection layer custom-engineered for viticulture, incorporating domain-optimized feature aggregation techniques that improve the detection accuracy across fluctuating vineyard conditions.

The grape neck module employs a bifurcated structure to enhance feature refinement. The convolutional batch normalization (BatchNorm) sigmoid linear unit blocks serve as the initial processing units in both branches, performing channel compression at a 1:4 ratio to extract crucial features relevant to grapes. Spatial attention mechanisms prioritize regions containing potential grape clusters and explicitly optimize learnable parameters by adapting to input illumination statistics, ensuring robustness across varying lighting conditions. Gradient stabilization techniques, such as gradient clipping and normalized weight initialization, improve convergence during training on limited viticulture datasets. Spectral-adaptive convolution dynamically adjusts weights as follows:

W_{a d j} = W_{b a s e} ⊙ (1 + α \cdot N D V I + β \cdot N D R E)

(1)

Spectral-adaptive convolution dynamically adjusts trainable weights W_base using the Normalized Difference Vegetation Index (NDVI) (canopy health) and Normalized Difference Red Edge (NDRE) (chlorophyll content), scaled using optimized coefficients (α = 0.15 and β = 0.08). The morphology-preserving interpolation is calculated as follows:

m_{t} = m_{m i n} + (m_{m a x} - m_{m i n}) \cdot σ (t / T_{m a x})

(2)

The growth-stage BatchNorm applies momentum scheduling (m_min = 0.1 to m_max = 0.4) via sigmoid activation, where t and T_max denote the current and total training epochs, ensuring stable transitions across phenological stages.

Incorporating C2f modules augmented with large-scale kernel attention mechanisms (n = 4) is a critical innovation that employs decomposed large kernels (an effective receptive field of 21 × 21) while maintaining computational efficiency. Atrous convolutions are implemented with varying dilation rates (r = 1, 2, 3, and 5) to capture multiscale features at multiple receptive fields. Self-attention mechanisms model long-range dependencies between grape clusters, thereby capturing inter-cluster contextual relationships, and feature channel recalibration dynamically weights channels based on their semantic relevance to grape detection. Furthermore, lightweight gating mechanisms selectively propagate features only when confidence scores exceed a learnable threshold. The confidence threshold

τ

was initialized at 0.5 and optimized during training via the loss function, converging to an average value of

τ

X for the final inference.

Dynamic sampling units are implemented in both branches to adjust the spatial resolution adaptively based on the scene complexity metrics derived from entropy analysis. Content-aware sampling strategies preserve grape cluster details while downsampling background regions, ensuring computational efficiency without sacrificing critical foreground information, and learnable interpolation coefficients optimize information retention during resolution changes by adapting to the local texture complexity. Canopy-adaptive resolution uses the following:

H_{3 D} = - \sum_{d = 1}^{D} \sum_{c \in \{R, G, B\}} p_{d, c} \log p_{d, c}

(3)

The 3D entropy H_3D quantifies the canopy complexity by analyzing joint probability distributions p_d_,c across depth bins and color channels, guiding adaptive sampling. The morphology-preserving interpolation is calculated as follows:

F_{o u t} (x) = \sum_{y \in Ω (x)} w (x, y) \cdot F_{i n} (y + ∆ p) \cdot M_{c l u s t e r} (y)

(4)

Morphology-preserving interpolation inputs features F_in using deformation offsets Δp and cluster masks M_cluster derived from grape bunch probability maps to maintain structural integrity during resolution scaling. While the parameter optimization was initially conceptualized using a broader database of 12 varieties to ensure generalizability, this specific study validates the framework on 5 distinct varieties, namely Cabernet Sauvignon, Pinot Noir, Chardonnay, Merlot, and Sauvignon Blanc, to ensure robust ground truth comparison. These formulations balance the biological accuracy and computational efficiency for practical vineyard deployment. Spatially aware feature transformation preserves the geometric fidelity of grape clusters, particularly under occlusion or deformation, and computational resources are adaptively allocated between regions of varying importance in vineyard scenes through a saliency-guided routing mechanism.

The detection head architecture applies a dual-branch design optimized for precise grape cluster localization, with one branch focusing on boundary refinement and the other on semantic confidence estimation. Two parallel convolutional BatchNorm sigmoid linear unit modules (s = 1, k = 3, n = 3) represent the detection heads, providing an initial feature transformation with a stride of 1 to preserve the spatial resolution, as well as 3 × 3 kernel convolutions with n = 3 repetitions for hierarchical feature extraction. Specialized normalization strategies were calibrated for the statistical distribution of vineyard imagery, and residual connections facilitate gradient flow during backpropagation. Furthermore, group convolutional operations reduce the parameter count while maintaining the representational capacity.

For specialized convolutional endpoints, 1 × 1 convolutional layers for prediction generate bounding box predictions, with target encoding informed by statistical priors on the cluster scale and aspect ratio derived from 5783 annotated grape clusters. Confidence-scoring mechanisms incorporate occlusion awareness to suppress unreliable predictions, while implicit orientation estimation enables the accurate localization of non-axis-aligned grape clusters. Multiscale predictions are fused through weighted confidence aggregation, and boundary refinement operations further adjust box coordinates using color and texture discontinuities. For cluster counting, dedicated 1 × 1 convolutional layers perform density-based regression; their outputs are activated by specialized functions calibrated to the empirical count distribution. Uncertainty estimation provides confidence intervals for these predictions, which are then refined using contextual cues from surrounding foliage. In addition, variety-specific correction factors derived from phenological models are applied to align estimates with expected biological growth patterns.

The model uses multilevel feature integration, where the detection system applies predictions from both branches, implementing an adaptive weighted-fusion mechanism that balances branch contributions based on confidence metrics. Context-aware non-maximum suppression preserves the natural clustering patterns of grapes, and cross-branch feature concatenation with channel-wise attention is employed to refine the focus. Scale-dependent prediction selection adapts to the estimated cluster size characteristics.

2.2. Model Training and Validation

2.2.1. Yield Estimation Framework

Yield estimation includes the following processes for the translation of the detection results into accurate yield predictions:

(1): This module analyzes the geometric properties of grape clusters via ellipsoidal 3D volumetric approximation based on two-dimensional (2D) projections. Perspective-aware scaling functions compensate for distance-related distortions, and multiview consistency verification is implemented when sequential images are available. The grape-packing density models are specific to 12 major wine grape varieties. The compensation algorithms for partial occlusion scenarios demonstrate up to 67% obstruction.
(2): A statistical correction mechanism addresses occlusion and perspective limitations. Bayesian inference models incorporate the prior knowledge of typical cluster properties, and an ensemble analysis of multiple viewpoints is conducted under geometric consistency constraints. Count predictions from detection thresholds are aggregated using confidence weighting. Variety-specific correction factors are derived from extensive field validation (n = 2371 samples), and adaptive regression models account for phenological stage variations during the growing season.
(3): A novel nonlinear mapping function transforms the combined size and count metrics into weight estimates via a multilayer perceptron with variety-specific parameter sets. The function incorporates environmental factors, including growing degree days, soil moisture, and canopy management practices. Integrating temporal models accounts for grape growth patterns throughout the ripening period. This function applies density correction factors based on refractometer-measured sugar content. Transfer learning from historical yield data improves the prediction accuracy across the seasons.
(4): Kimi-VL is a highly efficient open-source mixture-of-experts vision language model with advanced cross-modal reasoning, enabling the inference and prediction of data from one modality via learning associations across diverse modalities. This paper proposes a simple image–text mapping strategy that inputs RGB images into the YOLO detection model to produce bounding boxes for target objects. A text mapping library is constructed using the bounding boxes, comprising external and internal lexicons. The external lexicon is derived from historical data, including the average grape cluster weight by growth stage and lighting condition. The internal lexicon is computed by applying relevant algorithms to individual bounding boxes, encompassing various features (e.g., the 3D volume of ellipsoids from 2D projections, corresponding cultivars, compactness indices, and growth stages). A structured text prompt is dynamically constructed for each ROI using the external lexicon, historical averages, lighting conditions, internal lexicon, geometric volume, and variety. The cropped ROI image and its corresponding text prompt are paired and transmitted to the Kimi-VL model via the API. Kimi-VL then performs vision language reasoning to output the estimated weight attribute for each specific cluster. This design decouples object localization from attribute estimation, ensuring that the heavy computational load of the LLM does not impede the real-time performance of the initial detection phase. Figure 5 presents an example of the weight estimation of a single bunch of grapes.
(5): The final component aggregates individual cluster predictions via spatial calibration using vineyard block reference measurements. Hierarchical aggregation is conducted from the cluster to vine to row to block level. Confidence-weighted summation is performed with uncertainty propagation, and environmental correction factors account for local microclimatic variations. Finally, vineyard management system data are integrated to enable comprehensive yield forecasting.

Training was conducted on a high-performance computing cluster with an Nvidia GeForce RTX 3090 Graphics Processing Unit (GPU). The batch size was set to 16, and the learning rate was initially set to 0.001, decaying by a factor of 0.1 every 10 epochs (100 epochs). These parameters were set based on preliminary experiments to optimize convergence and generalizability.

Data augmentation techniques were applied to enhance model robustness and generalizability, including random rotations [−15, 15], random scaling factors between 0.8 and 1.2, and random horizontal and vertical flipping (p = 0.5). These augmentations were applied during training to expose the model to diverse image transformations.

The model was validated using a held-out dataset comprising 20% of the total images. Validation images were randomly selected to ensure representativeness. To evaluate the model’s performance and compare different YOLO variants, we used a held-out test set consisting of 5000 images, including 4200 images containing grape clusters from five cultivars under diverse illumination regimes and 800 background images without visible grape clusters. Performance was assessed using standard metrics, including the mean average precision across various intersection-over-union thresholds. Qualitative assessments were conducted to inspect the detection results visually and identify potential failure cases.

2.2.2. Grape Yield Estimation

The improved YOLOv11 algorithm was used to count grape clusters accurately in captured images. Given the challenges that complex image scenes pose, including overlapping clusters and fluctuating illumination, several non-maximum suppression steps were implemented to eliminate redundant detection and morphological operations to refine the bounding boxes, ensuring precise cluster delineation.

For precise grape yield estimation, the yield estimation formula based on the cluster count was expanded to include additional dimensions that capture grape production variability:

Y i e l d = \sum_{i = 1}^{N} ({C l u s t e r C o u n t}_{i} \times {A v e r a g e C l u s t e r W e i g h t}_{i} \times {C l u s t e r D e n s i t y F a c t o r}_{i} \times {G r o w t h S t a g e A d j u s t m e n t}_{i})

(5)

where ClusterCount_i represents the number of grape clusters detected in image i. Moreover, AverageClusterWeight_i denotes the weight of a single bunch of grapes corresponding to the variety and growth stage, obtained by combining the Kimi-VL model with an image of a single bunch of grapes and text question-and-answer prompt words. In addition, ClusterDensityFactor_i accounts for variations in cluster density in the vineyard, affecting the overall weight per cluster. This factor is determined by analyzing the compactness of clusters and the intercluster spacing in the images. Furthermore, GrowthStageAdjustment_i adjusts for differences in cluster weight due to the growth stage of the grapes during imaging. This adjustment factor is based on empirical data correlating the cluster weight with the growth stage for the grape variety. In addition, to derive the cluster density factor (ClusterDensityFactor_i), image processing techniques measure the area occupied by clusters relative to the total vine area in the image.

The density factor is calculated as follows:

{C l u s t e r D e n s i t y F a c t o r}_{i} = \frac{{T o t a l C l u s t e r A r e a}_{i}}{{T o t a l V i n e A r e a}_{i}} \times D e n s i t y S c a l i n g C o n s t a n t

(6)

where TotalClusterArea_i denotes the areal sum of all detected clusters in image i, TotalVineArea_i represents the total vine area in the image, and DensityScalingConstant represents a calibration factor determined via field experiments to scale the density measurement to a meaningful yield adjustment. GrowthStageAdjustment_i is based on phenological features and field observations, and a growth stage-specific weight adjustment factor is applied. This factor is typically derived from a lookup table or regression model that correlates the growth stage (e.g., early, mid-, and late ripening) with the expected variations in cluster weight. For simplicity, this work assumes a linear adjustment model:

G r o w t h S t a g e A d j u s t m e n t_{i} = 1 + α \times ({G r o w t h S t a g e I n d e x}_{i} - R e f e r e n c e S t a g e I n d e x)

(7)

where α denotes a coefficient derived from historical data, GrowthStageIndex_i represents the current growth stage of the grapes in image i, and ReferenceStageIndex indicates the index of the reference growth stage for normalization.

2.2.3. Ground Truth Comparison

A comprehensive comparison was conducted between the estimated and actual yields obtained via the rigorous manual counting and weighing of grape clusters to validate the accuracy and reliability of the proposed yield estimation method based on the improved YOLOv11 algorithm. The ground truth data collection process ensured high accuracy by randomly selecting a representative vine subset from each vineyard, manually harvesting all grape clusters from the selected vines, and precisely weighing each harvested cluster to establish the actual yield.

For comparison, this work employs several statistical metrics to quantify the differences between the estimated and actual yields. These metrics include the root mean squared error (RMSE), MAE, and coefficient of determination (R²):

R M S E = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} {(Y_{e s t, i} - Y_{a c t, i})}^{2}}

(8)

M A E = \frac{1}{N} \sum_{i = 1}^{N} | Y_{e s t, i} - Y_{a c t, i} |

(9)

R^{2} = 1 - \frac{\sum_{i = 1}^{N} ({Y_{e s t, i} - Y_{a c t, i})}^{2}}{\sum_{i = 1}^{N} ({Y_{a c t, i} - {\bar{Y}}_{a c t})}^{2}}

(10)

where Y_est,i denotes the estimated yield for image i, Y_act,i represents the actual yield for image i, N indicates the total number of images, and

{\bar{Y}}_{a c t}

denotes the mean of the actual yields.

In addition to these standard metrics, this work derives a custom error metric to assess the performance of the yield estimation method in the context of grape production variability. This work presents comprehensive comparative analyses against manually collected ground truth data across multiple vineyard blocks to validate the accuracy of the proposed YOLOv11-IMP approach. The evaluation encompasses diverse grape varieties, canopy structures, and management practices to ensure a robust assessment of the generalizability of the algorithm. Figure 6 presents annotation examples from each dataset. Table 1 presents a quantitative comparison of the algorithmic yield estimates and ground truth measurements collected using traditional destructive sampling methods. For each of the five major grape varieties, we compared the manually measured yield per vine obtained by destructive harvesting and precise weighing with the model’s prediction. This study quantified estimation errors using the MAE and RMSE. The MAE captures typical deviations, while the RMSE emphasizes sensitivity to large outliers. To enable comparisons across varieties with different yield potentials, this study normalized the MAE by the average manual yield to compute the relative error. The statistical analysis includes standard error metrics and correlation coefficients to deliver a multifaceted evaluation of the estimation performance. Across all five grape varieties, the average manually measured yield was 8.56 kg per vine, ranging from 7.63 to 9.34 kg per vine. The per-vine MAE of YOLOv11-IMP therefore represents only approximately 5–7% of the typical yield of an individual vine, which is agronomically acceptable for pre-harvest yield forecasting and harvest logistics planning.

The results demonstrate that YOLOv11-IMP achieves improved estimation accuracy across all tested varieties, with an average relative error of 6.06% and a strong correlation with ground truth measurements, with a mean r = 0.936. The algorithm demonstrated high precision for Pinot Noir, with a relative error = 5.3%, which can be attributed to the variety’s more uniform cluster morphology and distinct spatial arrangement, enhancing the detection reliability. To assess the statistical significance of the findings, this work presents paired t-tests comparing algorithm estimates with ground truth measurements, which exhibit no significant differences, supporting the accuracy of the algorithm. A cross-validation analysis using a five-fold protocol found consistent performance across spatial regions of the vineyard, underscoring the robustness of this method to spatial heterogeneity.

3. Results

3.1. Grape Cluster Recognition Performance

The ability of the model to recognize and detect grape clusters using YOLOv11-IMP was assessed using multiple quantitative metrics. Figure 7 compares the results of YOLOv11-IMP with those of existing SOTA models on the testing dataset.

Regarding the comparative assessment of the grape cluster detection performance across different YOLO architectures, the bar chart illustrates the precision, recall, F1 score, and mAP@0.5 for YOLOv8, YOLOv9, YOLOv10, the baseline YOLOv11, and the proposed YOLOv11-IMP. The YOLOv11-IMP model achieves the highest performance across all metrics, demonstrating the effectiveness of the specific architectural improvements over previous iterations.

The proposed model demonstrates superior performance across all evaluation metrics, achieving precision of 94.3%, recall of 93.5%, an F1 score of 93.9%, and mAP of 94.1%. These results indicate statistically significant improvements over the baseline models, with YOLOv11-IMP outperforming the original YOLOv11 by 1.5 percentage points in precision and 1.8 percentage points in recall. The performance enhancement is notable in challenging scenarios involving occluded clusters and varying illumination. These improvements are attributable to the architectural innovations, including the anchor-free detection head and optimized C2f modules, which enhance the detection accuracy and cluster coverage. The qualitative analysis supports these findings, demonstrating superior detection performance in small clusters during later growth stages, while maintaining low false positive rates in dense foliage. The consistent performance gap across all evaluation metrics indicates that YOLOv11-IMP addresses critical challenges in precision viticulture applications.

3.2. Grape Yield Estimation Accuracy

As seen in Figure 8, the mean relative error was below 7% for all tested grape varieties, demonstrating high accuracy across grape varieties. Pinot Noir exhibited the lowest estimation error at 5.3%, which was attributed to its more uniform cluster morphology and greater ease of detection. The yield estimation accuracy was determined across the five studied grape varieties: Cabernet Sauvignon, Pinot Noir, Chardonnay, Merlot, and Sauvignon Blanc. The chart below displays the three key error metrics per vine: the RMSE, MAE, and mean relative error. Pinot Noir exhibits the lowest relative error at 5.3% due to its uniform cluster morphology, while Chardonnay shows slightly higher variance at 6.7%. Overall, the relative error remains below 7% for all varieties, confirming the model’s adaptability.

3.3. Benchmarking Against SOTA Detection Methods

3.3.1. Performance Metric Evaluation

This work presents extensive comparative analyses against contemporary SOTA methods to evaluate the efficacy of YOLOv11-IMP. Table 2 comprehensively compares the performance across multiple evaluation dimensions.

As seen in Table 2, the proposed YOLOv11-IMP model establishes new performance benchmarks across all evaluation metrics compared with the existing SOTA approaches. The system achieves a superior MAE of 0.46 kg/vine, demonstrating a substantial improvement over YOLOv11 at 0.61 kg/vine, Faster R-CNN + ResNet101 at 0.72 kg/vine, RetinaNet + ResNeXt101 at 0.68 kg/vine, and EfficientDet-D2 at 0.66 kg/vine. In terms of computational efficiency, YOLOv11-IMP processes each inference in 28.9 ms, outperforming YOLOv11 at 32.4 ms, Faster R-CNN + ResNet101 at 44.3 ms, EfficientDet-D2 at 36.5 ms, and RetinaNet + ResNeXt101 at 38.2 ms. The model maintains this enhanced performance while operating with a reduced memory footprint of 3.8 GB, compared with 4.5 GB for YOLOv11, 5.2 GB for RetinaNet + ResNeXt101, 4.9 GB for EfficientDet-D2, and 5.8 GB for Faster R-CNN + ResNet101. Accuracy metrics reveal that YOLOv11-IMP’s performance, reaching 91.2%, surpasses that of YOLOv11 at 87.3% and Faster R-CNN + ResNet101 at 83.2%. These advancements in detection accuracy, computational efficiency, and resource requirements underscore the effectiveness of the model in addressing the complex challenges of grape cluster detection in precision viticulture applications. These consistent performance improvements stem from the innovative architectural components, including the anchor-free detection head and optimized feature extraction modules.

3.3.2. Environmental Adaptability Assessment

This work presents field trials across multiple vineyards with diverse management practices, trellis systems, and climatic conditions to evaluate the environmental adaptability of YOLOv11-IMP. Table 3 presents the performance metrics under varying environmental conditions.

The results reveal that YOLOv11-IMP maintains robust performance across diverse environmental conditions: the per-vine MAE remains in the narrow range of 0.44–0.52 kg/vine, with accuracy variations limited to 3.1%. This environmental adaptability is attributed to its enhanced feature extraction capabilities and the novel multiscale awareness mechanism. The model is resilient to illumination variations, addressing a critical challenge in field-based agricultural applications.

3.4. Ablation Studies and Architectural Insights

3.4.1. Component-Wise Contribution Analysis

This work presents systematic ablation studies in which we sequentially incorporated components into the baseline YOLOv11 architecture to discover the individual contributions of the architectural innovations. Figure 9 and Figure 10 present the incremental performance gains achieved with each architectural component.

The ablation study reveals several crucial insights. The spatial attention module offers a substantial improvement, underscoring the importance of contextual feature weighting in viticulture. Multiscale feature fusion enhances the performance by integrating information across resolutions, addressing the challenge of variable grape cluster sizes. Dilated convolutional blocks expand the receptive field without increasing the computational complexity, which is beneficial in capturing spatial relationships in complex vineyard structures. The anchor-free detection head simplifies detection and improves the localization accuracy, demonstrating the effectiveness of direct regression approaches in agricultural object detection. The scale-aware loss function improves the performance by optimizing training for the various scale distribution characteristics of grape clusters.

Figure 9 and Figure 10 summarize the component-wise ablation results. Starting from the baseline YOLOv11, which attains mAP of 87.5%, an MAE of 0.46 kg/vine, an RMSE of 0.62 kg/vine, and an inference time of 32.4 ms per image, we sequentially added the proposed modules. Incorporating the spatial attention module (SAM) increases the mAP to 89.2% and slightly reduces the error metrics, with MAE = 0.43 kg/vine and RMSE = 0.59 kg/vine, while also lowering the inference time to 31.8 ms, indicating that spatial reweighting improves the feature efficiency without impacting latency. Adding multiscale feature fusion (MFF) further boosts the mAP to 90.1% and decreases the MAE and RMSE to 0.41 and 0.56 kg/vine, respectively, confirming the importance of integrating information across resolutions for variable cluster sizes. When dilated convolutional blocks (DCBs) are introduced, the mAP reaches 90.7%, the MAE drops to 0.40 kg/vine, and the RMSE declines to 0.55 kg/vine, with a modest reduction in inference time to 29.8 ms due to more effective receptive field expansion. The anchor-free detection head (AFDH) yields an additional accuracy gain (mAP = 91.0%) and further error reduction, with MAE = 0.39 kg/vine and RMSE = 0.55 kg/vine, while keeping the inference time at 29.2 ms, demonstrating that the direct regression of object centers is particularly suitable for irregular grape cluster morphologies. Finally, the scale-aware loss function (SLF) delivers the best overall configuration, achieving mAP of 91.2%, an MAE of 0.38 kg/vine, and an RMSE of 0.54 kg/vine with the lowest inference time of 28.9 ms. These results confirm that each component contributes measurable and complementary improvements and that the full YOLOv11-IMP architecture provides the optimal balance between detection accuracy, yield estimation precision, and real-time efficiency.

3.4.2. Computational Efficiency Analysis

Computational efficiency is a critical consideration for practical deployment in agricultural settings. This work presents a comprehensive analysis of the computational requirements across hardware configurations and the inference speed and memory consumption across computing platforms, confirming the deployment flexibility of the proposed approach. The analysis reveals that YOLOv11-IMP achieves real-time processing (>25 fps) on modest hardware configurations (GTX 1080Ti), making it suitable for deployment on mobile agricultural platforms. The memory footprint (3.8 GB) represents a 15.6% reduction compared with the baseline YOLOv11, facilitating deployment on resource-constrained edge devices.

Furthermore, this work quantifies the theoretical computational complexity in terms of the floating-point operations per second (FLOPS) and parameter count. The YOLOv11-IMP framework reduces the FLOPS while improving the detection accuracy. This favorable computational performance is attributed to architectural optimization, particularly the depthwise separable convolutions and efficient multiscale feature fusion mechanism. The reduced computational requirements extend the battery life and increase the operational duration for field-based agricultural monitoring systems.

4. Discussion

This study extends previous grape yield estimation research in two main directions. First, most existing works based on YOLOv3–v8 or EfficientDet concentrate on detecting or counting visible bunches in RGB images and then apply simple geometric or linear conversions to approximate the yield. In YOLOv11-IMP, we explicitly optimize the full pipeline towards the per-vine yield (kg/vine) under realistic field constraints, where clusters are often heavily occluded and illumination varies sharply within the same canopy. By combining a viticulture-oriented backbone, large-kernel attention, and a scale-adaptive anchor-free head, our method achieves higher detection precision and recall than the baseline YOLOv11 and other tested detectors and reduces the per-vine MAE to 0.46 kg/vine and the RMSE to 0.62 kg/vine.

Second, the integration of cross-modal reasoning is a substantial departure from the mainly vision-only strategies reported so far. Instead of relying only on the bounding box geometry, we construct compact visual–text descriptors for each detected cluster and use a large language model (Kimi-VL) to infer its weight, including the portion hidden by leaves or neighboring clusters. This mechanism allows viticultural prior knowledge—for example, how the berry color, apparent compactness, and local canopy density relate to the bunch mass—to be exploited without explicit 3D reconstruction or handcrafted rules. Our experiments show that this design particularly benefits scenes with dense foliage and late-season canopies, where purely visual regressors tend to underestimate yields.

The environmental analysis further highlights the robustness of the approach. While many earlier studies report results under limited or ideal conditions, we stratified our evaluation by illumination, canopy density, and growth stage. Across these subsets, the per-vine MAE remains between 0.44 and 0.50 kg/vine, and the accuracy variation stays below 3.4%. This indicates that the architecture can maintain stable performance across a range of common field situations, which is a prerequisite for operational use in commercial vineyards. Nonetheless, the performance still declines in extreme conditions, such as heavy rain or intense specular reflections, and the current dataset covers only five varieties from one region. Future work will therefore need to expand the data basis and investigate domain adaptation across training systems and climates.

From an application standpoint, the proposed system is intended to be embedded into existing vineyard operations with minimal disruption. A typical workflow would mount a standard RGB camera on the utility vehicle that drives through the rows during key phenological stages. Images are downsampled and processed on an edge device running YOLOv11-IMP, which performs cluster detection and counting at the video rate. The resulting per-vine or per-row cluster information, together with the associated visual–text descriptors, is sent to a server hosting Kimi-VL, where the yield per vine is computed asynchronously and aggregated into spatial yield maps. These maps can then support several practical decisions: organizing harvest labor and machinery according to predicted high- and low-yield blocks; scheduling transport; and managing the storage capacity.

Overall, YOLOv11-IMP combines a detection architecture tailored to viticulture with a cross-modal estimation module and an explicit deployment workflow. This combination differentiates it from previous methods that either optimize only the detector or treat yield estimation as a separate post-processing step, and it suggests a viable path from algorithm design to actionable tools for precision viticulture.

5. Conclusions

This study marks substantial progress in grape yield estimation through the development and deployment of an enhanced YOLOv11 algorithm, delivering notable gains in both detection accuracy and yield prediction precision. The experimental results demonstrate several contributions to precision viticulture.

The refined YOLOv11 framework achieves superior detection metrics, with 94.3% precision, 93.5% recall, and a 93.9% F1 score across varied vineyard scenarios, outperforming YOLOv8 with 88.4% precision, YOLOv9 with 90.1% precision, YOLOv10 with 91.5% precision, and even the original YOLOv11 with 92.8% precision. Despite this accuracy leap, the refined YOLOv11 framework retains real-time speed, ensuring ready field deployability.

The yield estimation pipeline attains remarkable precision, with an MAE of only 0.46 kg/vine and an RMSE of 0.62 kg/vine, surpassing YOLOv11 with an MAE of 0.61 kg/vine and RMSE of 0.79 kg/vine and Faster R-CNN + ResNet101 with an MAE of 0.72 kg/vine and RMSE of 0.93 kg/vine. Crucially, it sustains consistent accuracy across cultivars, with the mean relative error held between 5.3% and 6.7%.

The ablation studies provide architectural insights by quantifying the contributions of each proposed component. Sequentially integrating the spatial attention module, multiscale feature fusion, dilated convolutional blocks, the anchor-free detection head, and the scale-aware loss yields monotonic improvements in mAP while reducing the MAE and RMSE, ultimately achieving the best trade-off between per-vine yield estimation accuracy and inference latency. These results confirm that the performance gains arise from the coordinated effect of all modules rather than from any single modification.

The robustness of the framework was validated via rigorous cross-validation. The accuracy drifts by merely 3.1% across varying illumination levels, canopy densities, and growth stages, demonstrating a sufficient stability margin—essential for uncontrolled field settings. Complementing this, the proposed model cuts the FLOPs and achieves a 15.6% reduction in the memory footprint (3.8 GB), facilitating deployment on resource-constrained edge devices.

In conclusion, the YOLOv11-IMP framework constitutes a significant leap in automated viticulture, uniting technical novelty with field-ready practicality. Its validated accuracy, efficiency, and robustness underscore the potential to improve yield forecasting and vineyard management decisions, thereby advancing precision agriculture and computer vision applications within challenging natural environments.

Author Contributions

Conceptualization, S.Z. and X.Y.; methodology, S.Z. and P.G.; software, S.Z., X.Y., Q.G. and J.Z.; validation, S.Z., J.Z., Y.T. and S.C.; data curation, S.Z., P.G. and Q.G.; writing—original draft, S.Z., J.Z. and Q.G.; writing—review and editing, P.G., Y.T., S.C. and Q.G.; visualization, X.Y., Q.G., Y.T. and S.C.; project administration, X.Y., P.G. and J.Z.; funding acquisition, S.C. and Y.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Special Projects in Key Fields of Ordinary Universities in Guangdong Province, grant number 2024ZDZX4133; the Characteristic Innovation Projects of Ordinary Universities in Guangdong Province, grant number 2025KTSCX320; and the Guangdong Basic and Applied Basic Research Foundation, grant numbers 2022A1515140162 and 2022A1515140013. The authors gratefully acknowledge the invaluable support and resources furnished by these funding programs, which made a significant contribution to the successful completion of this study.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding authors.

Acknowledgments

Grape bunch weight estimation was assisted by Kimi-Vision-Large (Kimi-VL, version 1.5, Moonshot AI), operating on a custom-built vineyard knowledge base. This assistance is reported in the Materials and Methods section in detail. No generative AI was used for study design, data collection, data analysis, or interpretation. Kimi-VL was employed only for the purposes explicitly stated above.

Conflicts of Interest

The authors declare that there are no conflicts of interest.

References

Moreira, G.; dos Santos, F.N.; Cunha, M. Grapevine inflorescence segmentation and flower estimation based on Computer Vision Techniques for early yield assessment. Smart Agric. Technol. 2024, 10, 100690. [Google Scholar] [CrossRef]
Parr, B.; Legg, M.; Alam, F. Grape yield estimation with a smartphone’s colour and depth cameras using machine learning and computer vision techniques. Comput. Electron. Agric. 2023, 213, 108174. [Google Scholar] [CrossRef]
Olensky, A.G.; Sams, B.S.; Fei, Z.; Singh, V.; Raja, P.V.; Bornhorst, G.M.; Earles, J.M. End-to-end deep learning for directly estimating grape yield from ground-based imagery. Comput. Electron. Agric. 2022, 198, 107081. [Google Scholar] [CrossRef]
Sapkota, R.; Ahmed, D.; Karkee, M. Comparing YOLOv8 and Mask R-CNN for instance segmentation in complex orchard environments. Artif. Intell. Agric. 2024, 13, 84–99. [Google Scholar] [CrossRef]
Gentilhomme, T.; Villamizar, M.; Corre, J.; Odovez, J.M. Towards smart pruning: ViNet, a deep-learning approach for grapevine structure estimation. Comput. Electron. Agric. 2023, 207, 107736. [Google Scholar]
Saranya, T.; Deisy, C.; Sridevi, S.; Anbananthen, K.S.M. A comparative study of deep learning and Internet of Things for precision agriculture. Eng. Appl. Artif. Intell. 2023, 122, 106034. [Google Scholar] [CrossRef]
Šupčík, A.; Milics, G.; Matečný, I. Predicting grape yield with vine canopy morphology analysis from 3D point clouds generated by UAV imagery. Drones 2024, 8, 216. [Google Scholar] [CrossRef]
Leolini, L.; Bregaglio, S.; Ginaldi, F.; Constafreda-Aumedes, S.; Di Gennaro, S.F.; Matese, A.; Maselli, F.; Caruso, G.; Palai, G.; Bajocco, S.; et al. Use of remote sensing-derived fPAR data in a grapevine simulation model for estimating vine biomass accumulation and yield variability at sub-field level. Precis. Agric. 2023, 24, 705–726. [Google Scholar] [CrossRef]
Yang, W.; Qiu, X. A lightweight and efficient model for grape bunch detection and biophysical anomaly assessment in complex environments based on YOLOv8s. Front. Plant Sci. 2025, 15, 1395796. [Google Scholar] [CrossRef] [PubMed]
Dillner, R.P.; Wimmer, M.A.; Porten, M.; Udelhoven, T.; Retzlaff, R. Combining a Standardized Growth Class Assessment, UAV Sensor Data, GIS Processing, and Machine Learning Classification to Derive a Correlation with the Vigour and Canopy Volume of Grapevines. Sensors 2025, 25, 431. [Google Scholar] [CrossRef]
Falih, B.S.; Ali, Y.H.; Alabbas, A.R.; Arica, S. Optimising yield estimation for grapes: Utilising the sliding window technique for visual counting of bunches and berries. Pak. J. Agric. Sci. 2024, 61, 443–452. [Google Scholar]
Devanna, R.P.; Romeo, L.; Reina, G.; Milella, A. Yield estimation in precision viticulture by combining deep segmentation and depth-based clustering. Comput. Electron. Agric. 2025, 232, 110025. [Google Scholar] [CrossRef]
Arab, S.T.; Noguchi, R.; Matsushita, S.; Ahamed, T. Prediction of grape yields from time-series vegetation indices using satellite remote sensing and a machine-learning approach. Remote Sens. Appl. Soc. Environ. 2021, 22, 100485. [Google Scholar] [CrossRef]
Santos, T.T.; De Souza, L.L.; dos Santos, A.A.; Avila, S. Grape detection, segmentation, and tracking using deep neural networks and three-dimensional association. Comput. Electron. Agric. 2020, 170, 105247. [Google Scholar] [CrossRef]
Yang, C.; Geng, T.; Peng, J.; Song, Z. Probability map-based grape detection and counting. Comput. Electron. Agric. 2024, 224, 109175. [Google Scholar] [CrossRef]
Zhang, C.; Ding, H.; Shi, Q.; Wang, Y. Grape cluster real-time detection in complex natural scenes based on YOLOv5s deep learning network. Agriculture 2022, 12, 1242. [Google Scholar] [CrossRef]
Oliveira, F.; da Silva, D.Q.; Filipe, V.; Pinho, T.M.; Cunha, M.; Cunha, J.B.; Dos Santos, F.N. Enhancing Grapevine Node Detection to Support Pruning Automation: Leveraging State-of-the-Art YOLO Detection Models for 2D Image Analysis. Sensors 2024, 24, 6774. [Google Scholar] [CrossRef]
Torres-Sánchez, J.; Mesas-Carrascosa, F.J.; Santesteban, L.G.; Jiménez-Brenes, F.M.; Oneka, O.; Villa-Llop, A.; Loidi, M.; López-Granados, F. Grape cluster detection using UAV photogrammetric point clouds as a low-cost tool for yield forecasting in vineyards. Sensors 2021, 21, 3083. [Google Scholar] [CrossRef]
García-Fernández, M.; Sanz-Ablanedo, E.; Pereira-Obaya, D.; Rodríguez-Pérez, J.R. Vineyard pruning weight prediction using 3D point clouds generated from UAV imagery and structure from motion photogrammetry. Agronomy 2021, 11, 2489. [Google Scholar] [CrossRef]
Vrochidou, E.; Bazinas, C.; Manios, M.; Papakostas, G.A.; Pachidis, T.P.; Kaburlasos, V.G. Machine vision for ripeness estimation in viticulture automation. Horticulturae 2021, 7, 282. [Google Scholar] [CrossRef]
Badeka, E.; Karapatzak, E.; Karampatea, A.; Bouloumpasi, E.; Kalathas, I.; Lytridis, C.; Tziolas, E.; Taskalidou, V.N.; Kaburlasos, V.G. A deep learning approach for precision viticulture, assessing grape maturity via YOLOv7. Sensors 2023, 23, 8126. [Google Scholar] [CrossRef]
Palacios, F.; Melo-Pinto, P.; Diago, M.P.; Tardaguila, J. Deep learning and computer vision for assessing the number of actual berries in commercial vineyards. Biosyst. Eng. 2022, 218, 175–188. [Google Scholar] [CrossRef]
Íñiguez, R.; Gutiérrez, S.; Poblete-Echeverría, C.; Hernández, I.; Barrio, I.; Tardáguila, J. Deep learning modelling for non-invasive grape bunch detection under diverse occlusion conditions. Comput. Electron. Agric. 2024, 226, 109421. [Google Scholar] [CrossRef]
Michael, K.; Andreou, C.; Markou, A.; Christoforou, M.; Nikoloudakis, N. A Novel Sorbitol-Based Flow Cytometry Buffer Is Effective for Genome Size Estimation across a Cypriot Grapevine Collection. Plants 2024, 13, 733. [Google Scholar] [CrossRef]
Oliveira, H.M.; Tugnolo, A.; Fontes, N.; Marques, C.; Geraldes, Á.; Jenne, S.; Zappe, H.; Graça, A.; Giovenzana, V.; Beghi, R.; et al. An autonomous Internet of Things spectral sensing system for in-situ optical monitoring of grape ripening: Design, characterization, and operation. Comput. Electron. Agric. 2024, 217, 108599. [Google Scholar] [CrossRef]
Quiñones, R.; Banu, S.M.; Gultepe, E. GCNet: A Deep Learning Framework for Enhanced Grape Cluster Segmentation and Yield Estimation Incorporating Occluded Grape Detection with a Correction Factor for Indoor Experimentation. J. Imaging 2025, 11, 34. [Google Scholar] [CrossRef]
Codes-Alcaraz, A.M.; Furnitto, N.; Sottosanti, G.; Failla, S.; Puerto, H.; Rocamora-Osorio, C.; Freire-Garcia, P.; Rimírez-Cuesta, J.M. Automatic grape cluster detection combining YOLO model and remote sensing imagery. Remote Sens. 2025, 17, 243. [Google Scholar] [CrossRef]
González, M.R.; Martínez-Rosas, M.E.; Brizuela, C.A. Comparison of CNN architectures for single grape detection. Comput. Electron. Agric. 2025, 231, 109930. [Google Scholar] [CrossRef]
Wu, Z.; Xia, F.; Zhou, S.; Xu, D. A method for identifying grape stems using keypoints. Comput. Electron. Agric. 2023, 209, 107825. [Google Scholar] [CrossRef]
Blekos, A.; Chatzis, K.; Kotaidou, M.; Chatzis, T.; Solachidis, V.; Konstantinidis, D.; Dimitropoulos, K. A grape dataset for instance segmentation and maturity estimation. Agronomy 2023, 13, 1995. [Google Scholar] [CrossRef]
Sneha, N.; Sundaram, M.; Ranjan, R. Acre-scale grape bunch detection and predict grape harvest using YOLO deep learning Network. SN Comput. Sci. 2024, 5, 250. [Google Scholar] [CrossRef]
Ilyas, Q.M.; Ahmad, M.; Mehmood, A. Automated estimation of crop yield using artificial intelligence and remote sensing technologies. Bioengineering 2023, 10, 125. [Google Scholar] [CrossRef]
Luo, L.; Yin, W.; Ning, Z.; Wang, J.; Wei, H.; Lu, Q. In-field pose estimation of grape clusters with combined point cloud segmentation and geometric analysis. Comput. Electron. Agric. 2022, 200, 107197. [Google Scholar] [CrossRef]
Kodors, S.; Zarembo, I.; Lācis, G.; Litavniece, L.; Apeināns, I.; Sondors, M.; Pacejs, A. Autonomous Yield Estimation System for Small Commercial Orchards Using UAV and AI. Drones 2024, 8, 734. [Google Scholar] [CrossRef]
Palacios, F.; Diago, M.P.; Melo-Pinto, P.; Tardaguila, J. Early yield prediction in different grapevine varieties using computer vision and machine learning. Precis. Agric. 2023, 24, 407–435. [Google Scholar] [CrossRef]
Schieck, M.; Krajsic, P.; Loos, F.; Hussein, A.; Franczyk, B.; Kozierkiewicz, A.; Pietranik, M. Comparison of deep learning methods for grapevine growth stage recognition. Comput. Electron. Agric. 2023, 211, 107944. [Google Scholar] [CrossRef]
Zheng, S.; Gao, P.; Zhang, J.; Zhang, J.; Ma, Z.; Chen, S. A precise grape yield prediction method based on a modified DCNN model. Comput. Electron. Agric. 2024, 225, 109338. [Google Scholar] [CrossRef]
Yang, C.; Geng, T.; Peng, J.; Xu, C.; Song, Z. Mask-GK: An efficient method based on mask Gaussian kernel for segmentation and counting of grape berries in field. Comput. Electron. Agric. 2025, 234, 110286. [Google Scholar] [CrossRef]
Sarkar, S.; Dey, A.; Pradhan, R.; Sarkar, U.M.; Chatterjee, C.; Mondal, A.; Mitra, P. Crop Yield Prediction Using Multimodal Meta-Transformer and Temporal Graph Neural Networks. IEEE Trans. AgriFood Electron. 2024, 2, 545–553. [Google Scholar] [CrossRef]
Sozzi, M.; Cantalamessa, S.; Cogato, A.; Kayad, A.; Marinello, F. Grape yield spatial variability assessment using YOLOv4 object detection algorithm. Comput. Electron. Agric. 2021, 188, 106345. [Google Scholar]

Figure 1. Flowchart of the grape yield estimation model. The pipeline operates through four integrated stages, where different colors represent distinct functional components. Blue blocks indicate the input processing and final output stages. Green blocks represent the backbone network, which utilizes enhanced C2f modules and GrapeConv for feature extraction. Purple and orange blocks signify the neck and head architecture employing PAN and CSP2AS modules for multiscale feature fusion. This integrated system combines object detection outputs with the Kimi-VL vision language model to refine cluster counts and estimate weights based on cross-modal reasoning.

Figure 2. Representative samples from grape dataset collection and annotation process. (a,c) Raw images of grape clusters captured under natural vineyard lighting conditions, illustrating the complexity of occlusion and background foliage. (b) The geometric center of each visible berry is marked with a red plus sign for initial counting tasks. (d) The final attribute estimation overlay, where the numerical values represent the weight contribution factors assigned by the model to individual detection units. This multi-step processing enables precise yield calculation from 2D images.

Figure 3. Semi-automated annotated grape dataset. (a,b) show wide-angle scenes processed to isolate grape bunches, with individual berries annotated to train the dense object counting module. (c) and (d) illustrate detailed attribute annotations for yield estimation. In (b), the blue frames delineate the spatial boundaries of grape clusters, and in (d) the red frames indicate the detection zones. The numerical values displayed represent the specific weight contribution factors assigned to visible berries. This hierarchical annotation allows the model to learn relationships between visual features and yield parameters.

Figure 4. Improved YOLOv11 algorithm flowchart. The pipeline is color-coded to represent different functional stages of the model. The green section illustrates the enhanced backbone incorporating GrapeConv and varying C2f modules to handle irregular cluster shapes. The blue section shows the grape neck utilizing C2f-LSKA and DySample for efficient upsampling and feature fusion. The yellow section represents the detection head designed for anchor-free multiscale prediction to accurately localize small-to-large grape clusters. Finally, the red dashed box delineates the yield estimation stage, where size estimation and weight mapping are performed to calculate the final yield.

Figure 5. Schematic of the cross-modal weight estimation process using Kimi-VL. The system combines visual data with text prompts constructed from external lexicons and internal lexicons. Kimi-VL processes these multimodal inputs to reason and output a final estimated weight for the specific cluster, mimicking expert viticulturist assessment.

Figure 6. Annotation examples from each dataset. The pink bounding boxes indicate detected grape clusters, with their associated confidence scores displayed at the top of each box. These results demonstrate the ability of the model to accurately localize clusters even under challenging conditions characterized by dense foliage and overlapping fruit, which are critical factors for accurate yield counting.

Figure 7. Performance comparison by object detection model for grape cluster recognition. The bar chart presents the precision, recall, F1 score, and mean average precision at 0.5 IoU for YOLOv8, YOLOv9, YOLOv10, the baseline YOLOv11, and the proposed YOLOv11-IMP. The proposed model achieves the highest performance across all metrics, with precision reaching 94.3% and recall at 93.5%. These results highlight the effectiveness of the architectural improvements over previous YOLO iterations.

Figure 8. Yield estimation performance by grape variety. The chart compares the root mean squared error, mean absolute error, and mean relative error per vine. Pinot Noir exhibits the lowest relative error of 5.3% due to its uniform cluster morphology, while all varieties maintain relative errors below 7%. This confirms the adaptability of the model to different cultivars.

Figure 9. Component-wise ablation results for YOLOv11-IMP (part I). The chart illustrates the progressive reductions in the mean absolute error and root mean squared error as specific modules are added to the baseline. These modules include the spatial attention module, multiscale feature fusion, dilated convolutional blocks, anchor-free detection head, and scale-aware loss function. The full YOLOv11-IMP configuration, shown on the right, achieves the lowest error rates, validating the contributions of each component.

Figure 10. Component-wise ablation results for YOLOv11-IMP (part II). This chart compares the inference time and mAP for each incremental model configuration. While the inference time slightly improves or remains stable at around 29–32 ms, the mAP steadily increases with each added component, culminating in 91.2% mAP for the final YOLOv11-IMP model, representing the optimal trade-off between speed and accuracy.

Table 1. YOLOv11-IMP yield estimates vs. ground truth measurements by grape variety.

Grape Variety	Sample Size	Manual Measurement	YOLOv11-IMP Estimate	RMSE	MAE	Relative Error	Pearson’s r
Grape Variety	Sample Size	(kg/vine)	(kg/vine)	(kg/vine)	(kg/vine)	(%)	Pearson’s r
Cabernet Sauvignon	157	8.72	8.47	0.52	0.41	5.8	0.941
Pinot Noir	142	7.63	7.42	0.48	0.37	5.3	0.957
Chardonnay	168	9.34	8.97	0.61	0.49	6.7	0.923
Merlot	145	8.21	7.95	0.55	0.43	6.1	0.932
Sauvignon Blanc	151	8.89	8.57	0.58	0.45	6.4	0.928
Average	152.6	8.56	8.28	0.55	0.43	6.06	0.936

Table 2. Comprehensive performance comparison of YOLOv11-IMP vs. SOTA methods.

Method	MAE (kg/vine)	RMSE (kg/vine)	Accuracy (%)	Processing Time (ms)	GPU Memory (GB)
Faster R-CNN + ResNet101	0.72	0.93	83.2	44.3	5.8
RetinaNet + ResNeXt101	0.68	0.88	85.6	38.2	5.2
EfficientDet-D2	0.66	0.85	86.7	36.5	4.9
YOLOv11 (baseline)	0.61	0.79	87.3	32.4	4.5
YOLOv11-IMP (proposed)	0.46	0.62	91.2	28.9	3.8

Table 3. Performance under varying environmental conditions.

Environmental Factor	Condition	MAE (kg/vine)	Accuracy (%)
Illumination	Direct sunlight	0.48	90.3
	Partial shade	0.47	90.8
	Overcast	0.46	91.5
	Dawn/dusk	0.52	88.7
Canopy Density	Sparse	0.44	92.1
	Medium	0.45	91.2
	Dense	0.5	89.5
Growth Stage	Pre-veraison	0.48	90.1
	Veraison	0.47	91
	Post-veraison	0.44	91.8

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zheng, S.; Yang, X.; Gao, P.; Guo, Q.; Zhang, J.; Chen, S.; Tang, Y. YOLOv11-IMP: Anchor-Free Multiscale Detection Model for Accurate Grape Yield Estimation in Precision Viticulture. Agronomy 2026, 16, 370. https://doi.org/10.3390/agronomy16030370

AMA Style

Zheng S, Yang X, Gao P, Guo Q, Zhang J, Chen S, Tang Y. YOLOv11-IMP: Anchor-Free Multiscale Detection Model for Accurate Grape Yield Estimation in Precision Viticulture. Agronomy. 2026; 16(3):370. https://doi.org/10.3390/agronomy16030370

Chicago/Turabian Style

Zheng, Shaoxiong, Xiaopei Yang, Peng Gao, Qingwen Guo, Jiahong Zhang, Shihong Chen, and Yunchao Tang. 2026. "YOLOv11-IMP: Anchor-Free Multiscale Detection Model for Accurate Grape Yield Estimation in Precision Viticulture" Agronomy 16, no. 3: 370. https://doi.org/10.3390/agronomy16030370

APA Style

Zheng, S., Yang, X., Gao, P., Guo, Q., Zhang, J., Chen, S., & Tang, Y. (2026). YOLOv11-IMP: Anchor-Free Multiscale Detection Model for Accurate Grape Yield Estimation in Precision Viticulture. Agronomy, 16(3), 370. https://doi.org/10.3390/agronomy16030370

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

YOLOv11-IMP: Anchor-Free Multiscale Detection Model for Accurate Grape Yield Estimation in Precision Viticulture

Abstract

1. Introduction

2. Materials and Methods

2.1. Improved YOLOv11 Model

2.1.1. Input Processing

2.1.2. Enhanced Backbone Architecture

2.1.3. Neck and Head Architecture

2.2. Model Training and Validation

2.2.1. Yield Estimation Framework

2.2.2. Grape Yield Estimation

2.2.3. Ground Truth Comparison

3. Results

3.1. Grape Cluster Recognition Performance

3.2. Grape Yield Estimation Accuracy

3.3. Benchmarking Against SOTA Detection Methods

3.3.1. Performance Metric Evaluation

3.3.2. Environmental Adaptability Assessment

3.4. Ablation Studies and Architectural Insights

3.4.1. Component-Wise Contribution Analysis

3.4.2. Computational Efficiency Analysis

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI