GBDR-Net: A YOLOv10-Derived Lightweight Model with Multi-Scale Feature Fusion for Accurate, Real-Time Detection of Grape Berry Diseases

Li, Pan; Zhou, Jitao; Sun, Huihui; Li, Penglin; Chen, Xi

doi:10.3390/horticulturae12010038

Open AccessArticle

GBDR-Net: A YOLOv10-Derived Lightweight Model with Multi-Scale Feature Fusion for Accurate, Real-Time Detection of Grape Berry Diseases

by

Pan Li

^1,2

,

Jitao Zhou

¹

,

Huihui Sun

^3,*,

Penglin Li

⁴ and

Xi Chen

¹

School of Emergency Equipment, North China Institute of Science and Technology, Langfang 065201, China

²

Hebei Key Laboratory of Safety Monitoring of Mining Equipment, Langfang 065201, China

³

School of Mechanical and Electrical Engineering, Huainan Normal University, Huainan 232038, China

⁴

School of Mechanical Engineering, University of Science and Technology Beijing, Beijing 100083, China

^*

Author to whom correspondence should be addressed.

Horticulturae 2026, 12(1), 38; https://doi.org/10.3390/horticulturae12010038 (registering DOI)

Submission received: 18 November 2025 / Revised: 17 December 2025 / Accepted: 26 December 2025 / Published: 28 December 2025

(This article belongs to the Section Viticulture)

Download

Browse Figures

Versions Notes

Abstract

Grape berries are highly susceptible to diseases during growth and harvest, which severely impacts yield and postharvest quality. While rapid and accurate disease detection is essential for real-time control and optimized management, it remains challenging due to complex symptom patterns, occlusions in dense clusters, and orchard environmental interference. Although deep learning presents a viable solution, robust methods specifically for detecting grape berry diseases under dense clustering conditions are still lacking. To bridge this gap, we propose GBDR-Net—a high-accuracy, lightweight, and deployable model based on YOLOv10. It incorporates four key enhancements: (1) an SDF-Fusion module replaces the original C2f module in deeper backbone layers to improve global context and subtle lesion feature extraction; (2) an additional Detect-XSmall head is integrated at the neck, with cross-concatenated outputs from SPPF and PSA modules, to enhance sensitivity to small disease spots; (3) the nearest-neighbor upsampling is substituted with a lightweight content-aware feature reassembly operator (LCFR-Op) for efficient and semantically aligned multi-scale feature enhancement; and (4) the conventional bounding box loss function is replaced with Inner-SIoU loss to accelerate convergence and improve localization accuracy. Evaluated on the Grape Berry Disease Visual Analysis (GBDVA) dataset, GBDR-Net achieves a precision of 93.4%, recall of 89.6%, mAP@0.5 of 90.2%, and mAP@0.5:0.95 of 86.4%, with a model size of only 4.83 MB, computational cost of 20.5 GFLOPs, and a real-time inference speed of 98.2 FPS. It outperforms models such as Faster R-CNN, SSD, YOLOv6s, and YOLOv8s across key metrics, effectively balancing detection accuracy with computational efficiency. This work provides a reliable technical solution for the intelligent monitoring of grape berry diseases in horticultural production. The proposed lightweight architecture and its design focus on dense, small-target detection offer a valuable framework that could inform the development of similar systems for other cluster-growing fruits and vegetables.

Keywords:

grape berry; disease detection; GBDR-Net model; YOLOv10; lightweight improvement; multi-scale feature fusion

1. Introduction

Grapes are important fruit crops that combine nutritional and economic value, playing a pivotal role in agricultural production. However, as berry plants, grapes are highly susceptible to various diseases during the growth and harvest seasons, often leading to yield reduction and quality deterioration with substantial economic losses [1,2]. For instance, fruit diseases such as anthracnose and gray mold not only directly compromise berry appearance and food safety, but also cause yield losses exceeding 20% during the harvest period due to fruit rotting and shedding, with severe cases even leading to total crop loss [3]. Currently, chemical control remains the primary approach to managing grape diseases, but excessive reliance on pesticides may not only induce pathogen resistance but also pose prominent issues such as pesticide residues and environmental pollution [4,5]. Notably, grape berries exhibit significant differences in disease susceptibility across different growth stages. During harvest, the dense clustering and mutual contact of berries, coupled with restrictions on chemical pesticide use, further facilitate disease spread [6]. Therefore, achieving accurate and rapid detection of fruit diseases at this stage is of great significance for ensuring the quality of fresh-eating and processed products, formulating timely harvest strategies, and optimizing postharvest grading processes.

Currently, the detection of grape berry diseases primarily relies on manual observation and empirical judgment by growers. This approach is not only costly and inefficient during the harvest season but is also highly susceptible to interference from variable lighting, occlusion within fruit clusters, and complex field conditions, leading to a high rate of misclassification. Consequently, it fails to meet the demands of modern viticulture for rigorous quality control and precise disease management [7,8,9]. The introduction of machine learning has significantly improved disease diagnosis through computer vision and image processing techniques. However, conventional methods often depend on handcrafted feature extraction algorithms, which typically exhibit limited generalization and poor robustness when identifying multiple diseases under complex field conditions. In practical scenarios characterized by overlapping berries, varying lesion sizes, and uneven illumination, these methods are prone to missed detections and false positives, thereby hindering the widespread application of computer vision in large-scale, automated harvest sorting [10,11].

The advent of deep learning has brought transformative changes to this field. Unlike traditional computer vision methods that depend on handcrafted features, deep learning utilizes its powerful end-to-end learning capability to automatically extract hierarchical discriminative features directly from raw images. This approach effectively captures subtle visual patterns of diseases, thereby significantly improving detection accuracy and efficiency [12,13,14]. The emergence of lightweight network architectures has been particularly impactful, reducing computational complexity and parameter counts. This optimization enables real-time inference on resource-constrained devices, such as embedded systems or mobile platforms, providing feasible technical support for rapid in-field disease diagnosis in agricultural production [15,16]. Consequently, such methods have been extensively adopted for crop disease detection tasks. In grape disease research, scholars have actively explored and refined various deep learning models to overcome detection challenges in complex orchard environments, continually enhancing their practicality and robustness. For instance, Wu et al. [17] developed GC-MobileNet, a model based on MobileNetV3, for efficient classification and fine-grained severity assessment of grape leaf diseases. By integrating Ghost modules to replace certain inverted residual structures, the model significantly reduced its parameter count while enhancing feature extraction efficiency. The incorporation of the CBAM attention mechanism strengthened spatial and channel feature representation, and the use of the LeakyReLU activation function helped retain both positive and negative feature information. Combined with transfer learning and data augmentation strategies, these improvements enabled the model to achieve a classification accuracy of 98.63%. To address detection in complex environments, Cai et al. [18] proposed a Siamese network (Siamese DWOAM-DRNet). This model employs a dual-factor weight optimization attention mechanism (DWOAM) to enhance disease feature extraction and suppress background interference. It also utilizes diverse branch residual modules (DRM) to enrich feature representation and adopts a combined loss function to improve discrimination between similar diseases. Experimental results demonstrated a detection accuracy of 93.26%, confirming the model’s effectiveness in classifying disease images under natural conditions. Zhang et al. [19] introduced DLVTNet, a lightweight model for grape leaf disease detection. They innovatively designed an LVT module, which combines Ghost and Transformer structures to collaboratively extract and fuse multi-scale local and global contextual features. Furthermore, dense connections between the LVT and MARI modules enhanced feature richness and improved the perception and localization of lesion areas. On the New Plant Diseases dataset, the model attained an average detection accuracy of 98.48%.

Despite significant progress in deep learning for plant disease detection, substantial challenges persist in the detection of grape berry diseases. Firstly, most existing studies focus on grape leaves, with relatively scarce research dedicated specifically to berry diseases. The lack of high-quality public datasets and practical application cases further hinders progress. As evidenced by existing works (e.g., [17,18,19]), which primarily target branches and leaves, insufficient attention has been paid to berry detection under complex field conditions, thereby limiting methodological advancement and model generalization in this area. Secondly, the performance of deep learning-based detection fundamentally depends on the model’s ability to learn sufficiently rich and discriminative features. This is particularly challenging for grape berries due to their small size and the complex diversity of disease manifestations, which make detection more difficult than for common leaf diseases. Early symptoms, such as subtle spots, depressions, or mold layers, are characterized by small scales and low contrast against healthy tissues, posing significant challenges to stable feature capture and precise localization. Moreover, common diseases like anthracnose, scab, and gray mold exhibit similar visual features in early stages, while lesion size, shape, and texture change considerably as the disease progresses, further complicating model discrimination and generalization. Finally, most existing models exhibit high structural complexity and computational costs, hindering efficient inference in resource-limited settings like orchard harvest sites. This limitation also restricts their deployment on mobile or embedded platforms for real-time sorting applications.

To address the aforementioned challenges and bridge the research gap in grape berry disease detection, this study proposes GBDR-Net, a detection model that integrates high accuracy, a lightweight design, and easy deployment. It aims to provide an effective technical pathway for the precise detection and real-time prevention of fruit diseases in natural environments. The design rationale of the proposed GBDR-Net model is summarized as follows: (1) built on the YOLOv10 framework, it incorporates the innovative SDF-Fusion module to enhance the backbone network’s capability of perceiving global context and subtle features; (2) an additional Detect-XSmall detection head is introduced to strengthen the recognition sensitivity of faint lesions, while the cross-concatenation strategy is adopted to achieve efficient fusion of multi-scale features; (3) a lightweight content-aware feature rearrangement operator replaces traditional upsampling methods to improve the semantic alignment quality of small-scale disease features; and (4) the traditional bounding box loss function is replaced with the Inner-SIoU loss, effectively improving the model’s convergence speed and localization accuracy. To evaluate its performance, we trained GBDR-Net on a dataset collected under field conditions and compared it with several common models.

This study provides an effective technical tool for the intelligent monitoring and precise prevention of grape berry diseases. When integrated with smart agriculture platforms and precision spraying systems, the GBDR-Net model significantly enhances the timeliness and accuracy of disease management, facilitating real-time detection and precise removal of diseased fruits during harvest to safeguard grape yield and post-harvest quality. Furthermore, this research improves the stability and resilience of vineyard production systems, contributing to the sustainable development of the horticultural industry.

2. Materials and Methods

2.1. Image Acquisition

This study focuses on four common grape berry diseases: black mold, canker, powdery mildew, and sour rot. Image data encompassing these diseases across multiple grape cultivars were systematically acquired to support the development of a dedicated dataset. The images were collected at Beijing Changping Aroma Grape Garden (116°9′24.808″ E, 40°11′53.959″ N). The grape garden cultivates several popular cultivars, including ‘Kyoho’, ‘Summer Black’, and ‘Red Globe’, thereby enabling the collection of a representative sample set. The annual grape harvest in this region occurs from August to September, a period characterized by a climatic transition from hot, rainy summers to cool, dry autumns with increasing diurnal temperature variation. To enhance model robustness and environmental adaptability, image acquisition spanned multiple time periods from early August to late September, under diverse weather conditions including clear skies, light rain, post-rain overcast conditions, and winds below beaufort scale 4. A total of 2270 high-resolution images (4000 × 4000 pixels, 1:1 aspect ratio) of diseased grapes were acquired using a Canon EOS 200D II camera. Figure 1 presents representative images for each of the four disease classes.

2.2. Image Data Processing

To ensure effective model training, we constructed the Grape Berry Disease Visual Analysis Dataset (GBDVA) by selecting 950 high-quality images from the collected set through stratified random sampling. The GBDVA dataset was designated as the training set for this study. To optimize computational resource utilization, all 950 images were uniformly resized to 640 × 640 pixels, which preserves detail clarity while reducing storage requirements. Subsequently, the LabelImg v1.8.1 tool was employed for annotation, generating a standardized dataset format compatible with the YOLO object detection framework. To mitigate potential model overfitting or underfitting caused by imbalanced class distribution, a multi-faceted data augmentation strategy was employed, incorporating techniques such as horizontal/vertical flipping, random angle rotation, Gaussian noise injection, Gaussian blur filtering, and random brightness adjustment. This process expanded the dataset from 950 to 5750 images, thereby enhancing sample diversity and significantly improving the model’s robustness and generalization capability under complex conditions. To prevent severe image data leakage, the remaining 1320 original images were uniformly resized to 640 × 640 pixels, after which 924 images were selected through stratified random sampling to form the validation set, with the remaining 396 images allocated to the test set. The distribution of sample image counts across each subset is detailed in Table 1.

To ensure the quality and consistency of the GBDVA dataset, a standardized annotation protocol was implemented. All visible disease lesions were delineated with bounding boxes by three annotators following a unified guideline. For the purpose of this study, lesion progression was pragmatically categorized into two stages within the annotation metadata: “Early-stage” (characterized by small, isolated spots or slight discoloration) and “Developed-stage” (marked by larger, coalesced, or pronounced symptomatic areas). To quantitatively assess the labeling consistency, a subset of 150 images was independently annotated. The inter-annotator agreement, measured by the mean Average Precision (AP) at an Intersection-over-Union (IoU) threshold of 0.5, reached 0.85, indicating a high level of consensus. Furthermore, the environmental variability integral to the dataset—encompassing diverse lighting conditions (e.g., full sun, shadow), weather scenarios (e.g., clear, post-rain), and temporal contexts as described in Section 2.1—ensures its representativeness for real-world field applications and enhances model robustness.

2.3. GBDR-Net Model Construction

2.3.1. Overview of the Model Architecture

The accurate detection of grape berry diseases represents a critical task in smart agriculture. The challenge is compounded by the diversity of disease types and their visual manifestations, which can include irregular spots, discoloration, textural degradation, and shape deformities, thereby substantially complicating detection and classification. Furthermore, vineyard conditions pose additional obstacles, such as dense foliage and occluded fruit clusters. Variations in natural lighting and complex backgrounds can also severely compromise the accuracy of image acquisition and analysis. These practical challenges collectively demand intelligent detection models with high accuracy, strong generalization capability, and resilience to environmental variations.

To address the aforementioned challenges, this study adopts YOLOv10 [20] as the base framework for grape berry disease detection. As a significant milestone in the YOLO series, YOLOv10 achieves a remarkable end-to-end breakthrough in real-time object detection by eliminating its reliance on non-maximum suppression (NMS) for post-processing. Its high efficiency and accuracy make it exceptionally suitable for deployment in field scenarios with limited computational resources. The model’s core innovation is its consistent dual assignment strategy. This approach synergistically leverages both one-to-many and one-to-one branches during training. The one-to-many branch enriches supervision signals by assigning multiple predictions to a single ground-truth object, thereby enhancing learning accuracy. Concurrently, the one-to-one branch employs a one-to-one matching strategy, which allows the model to directly generate optimal predictions at inference time without depending on NMS. This design not only reduces inference latency but also resolves the long-standing discrepancy between the training and inference pipelines. Consequently, it establishes a solid foundation for highly robust real-time detection in field conditions. Architecturally, YOLOv10 incorporates comprehensive and systematic optimizations for both efficiency and accuracy. Its backbone employs an enhanced CSPNet to improve gradient flow, while the neck integrates a Path Aggregation Network (PAN) for effective multi-scale feature fusion. Furthermore, the model includes a suite of lightweight and enhancement techniques—such as a lightweight classification head, spatial-channel decoupled downsampling, large-kernel convolution, and partial self-attention modules—providing an advanced and efficient starting point for subsequent task-specific optimizations.

The overall architecture of the grape berry disease recognition model GBDR-Net constructed by improving YOLOv10 in this study is illustrated in Figure 2. Its backbone network primarily consists of fundamental modules such as SCDown, C2f, and SDF-Fusion, which are responsible for efficiently extracting multi-scale features from input images while minimizing information loss. For the neck network, the model adopts a Path Aggregation Network (PANet) and cross-stage partial connection structure to fuse deep semantic features and shallow localization feature maps from the backbone, thereby enhancing the multi-scale perception capability for subtle lesion features. The detection head is designed as a decoupled detection head. Specifically, the main improvements of the model are as follows: In the backbone network, the innovative SDF-Fusion modules replace the original C2f modules in the deep network, aiming to enhance the model’s ability to extract global contextual information and subtle lesion features. An additional Detect-XSmall detection head is added at the end of the neck, specifically for capturing lesion targets with smaller pixel areas and weaker features, significantly improving the model’s perceptual sensitivity to small-scale diseases. A novel Cross-concatenation strategy is introduced in the neck network, where the output of the SPPF module at the end of the backbone network is cross-concatenated with the output of the PSA module, serving jointly as inputs for subsequent upsampling and detection heads. Furthermore, a lightweight content-aware feature reassembly operator (LCFR-Op) is proposed to replace the nearest neighbor interpolation as the new upsampling operator, achieving enhanced semantic alignment of multi-scale defect features by dynamically reorganizing convolution kernels. The Inner-SIoU loss function is adopted to replace the original CIoU loss function, endowing the model with faster convergence speed and higher final localization accuracy. Through the collaborative design of lightweight modules and attention mechanisms, the GBDR-Net model can effectively improve the detection accuracy and robustness for grape berry disease regions while maintaining high inference speed.

The data transmission and processing workflow of the GBDR-Net model operates as follows. Preprocessed images are first fed into the backbone network. Here, efficient downsampling is performed by the SCDown module, while multi-scale features are progressively extracted through the C2f and SDF-Fusion modules, enhancing the model’s perception of both detailed symptoms and global context. Subsequently, the neck network, based on a Path Aggregation Network (PANet), merges deep semantic features with shallow localization maps. A novel cross-concatenation strategy further reorganizes the outputs from the SPPF and PSA modules, strengthening multi-scale representations. These refined feature maps are then upsampled by the LCFR-Op operator to achieve precise semantic alignment before being routed to the decoupled detection heads (including Detect-XSmall). These heads directly output the category, confidence, and spatial coordinates of disease targets. Optimized by the Inner-SIoU loss function, the entire pipeline eliminates the need for non-maximum suppression (NMS), achieving end-to-end efficient inference and real-time detection.

2.3.2. SDF-Fusion Module

This study proposes a novel custom module, SDF-Fusion, which incorporates an advanced bottleneck structure based on Fast Context Attention (FCA). The FCA module, an enhanced version derived from the FasterNet [21] architecture, employs a bottleneck comprising partial convolution, two sequential 1 × 1 convolutions, and cross-layer residual connections. This structural design is pivotal for significantly boosting inference speed and computational efficiency. Furthermore, to enhance the model’s capacity for perceiving subtle disease features in grape berries, the module integrates the Context Anchor Attention (CAA) mechanism [22]—a functional sub-unit originating from the Polynomial Kernel Initial Network. The SiLU [23] activation function is also adopted to ensure smoother gradient flow, which accelerates model convergence and enhances the model’s overall performance.

As depicted in Figure 3, the FCA module employs a 3 × 3 spatial mixing operation for selected input channels. This design significantly reduces computational complexity compared to standard convolutional operations. A key component of the module is a feed-forward network (FFN) that integrates two pointwise convolution (PWConv) layers with the CAA mechanism. This integration enables the capture of long-range contextual dependencies, thereby strengthening feature representation. This capability proves particularly beneficial in complex scenarios containing multiple objects of the same category. The CAA mechanism operates by first extracting local features using average pooling and pointwise convolution. It then utilizes two lightweight depthwise separable strip convolutions to efficiently emulate the receptive field of a large-kernel depthwise convolution. This approach substantially reduces the parameter count compared to a conventional 2D depthwise convolution of kernel size k_b. The mechanism employs a weight matrix,

A_{l - 1} \in R^{C_{l} \times H_{l} \times W_{l}}

, to quantify channel-wise importance. The detailed computational workflow is shown in Figure 3 and formalized by the equations below:

F_{l - 1}^{p o o l} = {Conv}_{1 \times 1} (P_{a v g} (X_{l - 1}))

(1)

F_{l - 1}^{w} = {DWConv}_{1 \times k_{b}} (F_{l - 1}^{p o o l}), F_{l - 1}^{h} = {DWConv}_{k_{b} \times 1} (F_{l - 1}^{w})

(2)

A_{l - 1} = Sigmoid ({Conv}_{1 \times 1} (F_{l - 1}^{h}))

(3)

The CAA mechanism adaptively calibrates input features by computing channel-wise attention weights. This mechanism generates a weight vector from global contextual information. These weights are then used to enhance the feature map through channel-wise multiplication. The calibrated features are combined with the original input via skip connections, forming a residual learning unit. This structure mitigates information loss through feature reuse and promotes smooth gradient flow, which stabilizes training and facilitates the construction of deeper networks. Collectively, these properties substantially improve the model’s representational capacity and its performance on complex tasks.

2.3.3. XSmall Detection Head

The original YOLO model conducts progressive downsampling on feature maps through five stages (P1 to P5). For an input image size of 640 × 640 pixels, the resolutions of the feature maps fed into the detection heads are sequentially reduced to 80 × 80 (stage P3), 40 × 40 (stage P4), and 20 × 20 (stage P5). To enhance the detection capability for small-sized objects, we innovatively introduced the XSmall detection head architecture [24], which delivers a feature map with a high resolution of 160 × 160 pixels. By drastically reducing the number of downsampling levels to only two, this detection head effectively preserves richer and finer-grained feature information of small objects, mitigating potential detail loss associated with traditional multi-stage downsampling. As illustrated in Figure 2, the XSmall detection head performs structural feature concatenation and fusion with the feature map of the corresponding scale in the backbone network. By integrating this high-resolution detection head into the model’s neck network architecture, we facilitated the effective integration of finer-grained features, thereby significantly optimizing the multi-scale feature fusion performance in the feature pyramid. This improvement not only enhances the model’s representation capability for small-sized objects but also establishes a more hierarchical and semantically abundant feature representation through the complementary fusion of high-resolution and low-resolution features, ultimately boosting the comprehensive detection performance for objects of different scales in complex scenarios.

2.3.4. A Novel Cross-Concatenation Strategy

We introduced a novel cross-concatenation strategy in the neck network (as illustrated in Figure 2), designed to optimize the feature fusion mechanism in the object detection pipeline and enhance the detection performance of multi-scale objects. In contrast to YOLOv10, which merely feeds the output of the PSA module directly to the first upsampling block and the final feature stage prior to the large detection head, our strategy innovatively adopts the output of the SPPF module for cross-concatenation with the PSA module’s output while preserving the original connection path between the PSA module and the first upsampling block. The SPPF module effectively captures rich multi-scale contextual information by performing spatial pyramid pooling operations on feature maps across different scales—a trait particularly crucial for object detection tasks involving targets with significant size variations in grape berry disease images, thereby enhancing the model’s ability to perceive objects of diverse scales. The core value of this structural adjustment lies in its proactive integration of multi-scale features from the SPPF module, which effectively alleviates the loss of extensive contextual information potentially induced by traditional attention mechanisms in the later stages of the pipeline. This thereby lays a more comprehensive and detailed foundation for scene semantic understanding in the final detection layer, ultimately enhancing the model’s overall detection performance for targets in complex orchard scenes.

2.3.5. Lightweight Content-Aware Reassembly Operator

In object detection tasks, feature upsampling is a critical operation for enhancing feature map resolution and fusing multi-scale information [25]. Nearest-neighbor interpolation upsampling, employed in traditional methods such as YOLOv10, assigns grayscale values to target pixels by directly replicating the nearest-neighbor pixels in the warped image. While computationally efficient, it has notable limitations: the method performs interpolation solely based on pixel spatial positions, completely ignoring the semantic information embedded in feature tensors. Furthermore, its upsampling kernel typically has a small receptive field, hindering effective capture of global context. To address these bottlenecks, this study proposed a lightweight content-aware feature reassembly operator (LCFR-Op) as a novel upsampling operator. By dynamically reconstructing convolutional kernels, this operator enhances the semantic alignment of multi-scale defect features. During the feature reconstruction stage, it not only enables dynamic upsampling based on input data but also effectively expands the receptive field to capture richer contextual information. Moreover, it maintains low computational cost through optimized computational architecture, providing an efficient solution for the model to achieve more accurate semantic alignment in the feature fusion stage. As illustrated in Figure 4, the LCFR-Op primarily consists of a kernel prediction module and a content-aware reconstruction module.

Given the input tensor X∈R^C^×H×W, the LCFR-Op upsamples it to X′∈R^C^×ξH×ξW via an integer scaling factor ξ. Each coordinate point p′ = (u′, v′) in the output feature map X′ maps to the corresponding coordinate p = (u, v) in the input feature map X, where u = u′/ξ and v = v′/ξ.

The Kernel Prediction Module

The core function of this module is to dynamically generate reassembly kernels. Specifically, it produces a dedicated reassembly kernel W_p_′ for each target spatial location p′ in the output feature map. The generation process relies on contextual information provided by the input feature tensor X: the module extracts features from a local neighborhood N(X_p, k_encoder) of size k_encoder centered at a location p in the input features, and uses this information to predict the corresponding kernel. The complete mapping from local input features to the target reassembly kernel can be formally defined by the following mathematical expression:

W_{p'} = σ (N (X_{p}, k_{e n c o d e r}))

(4)

where σ denotes the kernel prediction module; X_p represents the coordinates of a pixel in the input feature tensor X; and k_encoder is defined by the relation k_encoder = k_up − 2, which specifies the size of the convolution kernel.

The kernel prediction module consists of three sequential operations: channel compression, content encoding, and kernel normalization. In the channel compression stage, a 1 × 1 convolutional layer reduces the number of input channels from C to C_m to decrease the parameter count and computational overhead of subsequent steps. During content encoding, a convolutional layer with a kernel size of k_encoder generates the reassembly kernels, whose parameter count amounts to k²_encoder × C_m × C_up. Although increasing k_encoder enlarges the receptive field to capture richer contextual information, it also leads to a quadratic rise in computational cost, necessitating a careful balance between performance and efficiency. Finally, in the kernel normalization stage, each k_up × k_up reassembly kernel is normalized via a Softmax function before use. This operation enforces a unit-sum constraint to achieve soft local selection and ensures the stability of the feature tensor’s mean throughout the LCFR-Op computation.

Content-Aware Reassembly Module

This module employs the predicted kernel W_p_′ from earlier as weighting parameters to perform a weighted reassembly of the features within the k_up × k_up neighborhood N(X_p, k_up) centered at position p in the input tensor X, producing the reassembled feature value at the corresponding location p’ in the output feature map X’. The complete feature reassembly operation is defined by the following expression:

X_{p'}^{'} = ϕ (N (X_{p}, k_{u p}), W_{p'})

(5)

where ϕ denotes the content-aware reassembly module, and k_up represents the kernel size used for the reassembly operation.

For each target location p’ in the output feature map, the content-aware reassembly module redistributes features within the local neighborhood N(X_p, k_up) of the input feature map, which is centered at location p with a radius r = k_up/2. This is achieved by applying a dynamically predicted kernel W_p_′ and a specific reassembly function ϕ. The core mechanism adaptively adjusts the contribution of features at each position within the neighborhood using the learned weighting parameters, enabling a refined reconstruction of the feature map. This reassembly process can be precisely defined by the following mathematical expression:

X_{p'}^{'} = \sum_{n = - r}^{r} \sum_{m = - r}^{r} W_{p'} (n, m) \cdot X (u + n, v + m)

(6)

where

X_{p'}^{'}

denotes the output feature value at location p’. The prediction kernel W_p_′ differentially weights pixels in the local neighborhood N(X_p, k_up) based on their feature content rather than spatial proximity alone, thereby determining the contribution of each feature to the final output

X_{p'}^{'}

. This content-aware weighting mechanism enhances critical information during feature reassembly, resulting in a reconstructed feature map X’ that is semantically richer than the input X.

2.3.6. Inner-SIoU Bounding Box Loss Function

YOLOv10 utilizes CIoU (Complete Intersection over Union) as the default loss function for bounding box regression [26]. Building on the traditional IoU, this function incorporates three geometric factors between the predicted and ground-truth boxes, namely overlap area, center point distance, and aspect ratio—improving localization accuracy to a certain extent. However, its mathematical formulation has a notable limitation: it completely ignores the angular difference between bounding boxes. This flaw imposes a notable impact on practical training. In the absence of directional consistency constraints, the model requires more iterations to adjust the orientation of predicted boxes, resulting in slower convergence. Meanwhile, localization accuracy becomes limited when handling objects with specific orientations or complex scenes, especially for targets with abnormal aspect ratios, for which directional deviations are more prominent. To address these shortcomings, this study introduces the Inner-SIoU loss function [27] to replace the original CIoU. While retaining all geometric constraints of CIoU, this method innovatively introduces an angular loss term between the predicted and ground-truth boxes. By incorporating angular cost as a new optimization objective, the model can more accurately perceive the directional differences between bounding boxes, thereby achieving faster convergence and higher final localization accuracy during training, effectively addressing the limitations of CIoU in theoretical formulation and application performance.

Let t = (x_t, y_t, w_t, h_t) denote the ground-truth box and a = (x_a, y_a, w_a, h_a) represent the anchor box. To enhance the robustness of bounding box regression and broaden the model’s adaptability to objects of varying scales, a scaling factor γ ∈ [0.5, 1.5] is introduced to control the size variation range of the auxiliary bounding boxes. The coordinates of an auxiliary bounding box are defined by its top-left (tl) and bottom-right (br) corners, calculated as follows:

{\hat{t}}_{tl} = (x_{t} - \frac{γ w_{t}}{2}, y_{t} - \frac{γ h_{t}}{2})

(7)

{\hat{t}}_{br} = (x_{t} + \frac{γ w_{t}}{2}, y_{t} + \frac{γ h_{t}}{2})

(8)

{\hat{a}}_{tl} = (x_{a} - \frac{γ w_{a}}{2}, y_{a} - \frac{γ h_{a}}{2})

(9)

{\hat{a}}_{br} = (x_{a} + \frac{γ w_{a}}{2}, y_{a} + \frac{γ h_{a}}{2})

(10)

The formula for calculating the intersection area is as follows:

A_{\cap} = {[\min ({\hat{t}}_{b r}^{x}, {\hat{a}}_{br}^{x}) - \max ({\hat{t}}_{t l}^{x}, {\hat{a}}_{tl}^{x})]}^{+} \cdot {[\min ({\hat{t}}_{b r}^{y}, {\hat{a}}_{br}^{y}) - \max ({\hat{t}}_{tl}^{y}, {\hat{a}}_{tl}^{y})]}^{+}

(11)

The formula for calculating the union area is as follows:

A_{\cup} = γ^{2} (w_{t} h_{t} + w_{a} h_{a}) - A_{\cap}

(12)

The formula for the Inner-IoU metric is as follows:

{IoU}_{i n n e r} = \frac{A_{\cap}}{A_{\cup}}

(13)

Building on this, the method introduces the SIoU loss function [28]. It embodies an innovative angular alignment mechanism, which effectively optimizes the bounding box regression training process through the redesign of the angular penalty term’s computation. Based on vector geometric relationships, this mechanism enables the predicted box to autonomously perceive the angular deviation from the ground-truth box and swiftly align with the nearest coordinate axis direction in the early training stages, significantly enhancing regression efficiency. Building upon angular alignment, the method further integrates distance loss and shape loss to construct a multi-dimensional optimization framework: angular alignment ensures the accuracy of bounding box orientation, distance loss governs the fine-tuning of the center point position, and shape loss regulates the matching of aspect ratios. These three dimensions synergize, and the joint optimization strategy comprehensively improves bounding box localization accuracy, empowering the model to retain stable detection performance even under complex scenarios.

(i) Distance Loss (Δ_dist): Based on the angular alignment constraint, this loss component dynamically adjusts the penalty on center point distance. It automatically reduces the distance penalty when a substantial angular deviation is detected, prioritizing directional correction before fine-grained positional refinement.

(ii) Shape Loss (Δ_shape): This component constrains the aspect ratio discrepancy between predicted and ground-truth boxes. By incorporating an exponential scaling factor, it amplifies dimensional errors to enhance the model’s perception of geometric variations and its sensitivity to significant shape differences.

The SIoU loss function is defined as follows (where λ₁ and λ₂ are hyperparameters):

L_{SIoU} = 1 - IoU + λ_{1} \cdot Δ_{d i s t} + λ_{2} \cdot Δ_{s h a p e}

(14)

The final Inner-SIoU loss function is defined as follows:

L_{Inner-SIoU} = L_{SIoU} + (IoU - {IoU}_{inner})

(15)

2.4. Experimental Environment and Parameter Configuration

Both training and testing of the GBDR-Net model were conducted on a high-performance computing platform, which is equipped with an NVIDIA GeForce RTX 4090Ti GPU (24GB video memory; NVIDIA Corporation, Santa Clara, CA, USA) and an Intel Core™ i9-7940X CPU (3.10GHz clock speed; Intel Corporation, Santa Clara, CA, USA), running the Windows 11 64-bit operating system. Input images were uniformly scaled to a resolution of 640 × 640 × 3, and the algorithm was implemented based on Python 3.8 and the PyTorch 2.0 deep learning framework. For the training phase, the following configurations were adopted: 300 epochs were configured with a batch size of 8, the network backbone was initialized by loading pre-trained weights, and an initial learning rate of 0.01 was set with the stochastic gradient descent (SGD) algorithm employed for optimization. All aforementioned training and testing procedures were performed under the specified hardware and software environments to ensure consistency in model development and validation.

2.5. Performance Metrics

To systematically evaluate the detection performance of the GBDR-Net model on the GBDVA dataset, this study adopted multiple commonly used metrics, including precision, recall, mAP@0.5, mAP@0.5:0.95, F1-score, model size, and inference speed, to conduct a comprehensive analysis of the model’s overall performance.

In object detection tasks, precision is defined as the ratio of true positives correctly predicted by the model to all predicted positive instances, and is used to evaluate the reliability of detection results. Conversely, recall represents the ratio of true positives accurately identified by the model to all actual positive instances, reflecting the model’s coverage capability for target objects. The mAP@0.5 adopts an Intersection over Union (IoU) threshold of 0.5 as the criterion for correct detection—i.e., a detection is considered correct when the overlap between the predicted bounding box and the ground-truth box exceeds 50%. It calculates the mean average precision (mAP) based on this threshold, providing a comprehensive evaluation of the model’s performance across all categories. The mAP@0.5:0.95 further assesses the mean average precision across diverse IoU thresholds (ranging from 0.5 to 0.95 with a step size of 0.05), better reflecting the model’s robustness under varying localization precision requirements. The F1-score is the harmonic mean of precision and recall, balancing these two metrics to offer a holistic assessment of model performance. Higher values of the aforementioned metrics indicate better detection performance of the model. Model size characterizes the algorithm’s complexity and resource consumption; smaller models typically imply lower computational overhead and reduced hardware dependency, facilitating deployment on edge devices. Inference speed refers to the time required for the model to complete a single forward propagation, and faster inference speed helps meet the real-time processing requirements in practical applications.

3. Results

3.1. Ablation Study and Analysis

To evaluate the collective optimization contributions of the SDF-Fusion module, Detect-XSmall detection head, Cross-concatenation strategy, LCFR-Op operator, and Inner-SIoU loss function to model performance, this study systematically conducted a set of ablation studies using YOLOv10 as the baseline model and the self-constructed GBDVA dataset. By sequentially incorporating the aforementioned components, the experiments quantitatively evaluated their synergistic optimization effects on detection performance and model size. The settings and key results of the ablation studies are summarized in Table 2. The ultimately integrated GBDR-Net model demonstrates excellent performance across multiple critical metrics: precision of 93.4%; recall of 89.6%; mAP@0.5 of 90.2%; mAP@0.5:0.95 of 86.4%; with the model size constrained to 4.83 MB.

Table 2 systematically demonstrates the progressive optimization effects of each improved module on model performance. First, replacing the original C2f module with the SDF-Fusion module in the backbone network achieves effective model compression while enhancing the ability to extract global context and subtle lesion features. This improvement reduces the model size by 21.2% to 6.82 MB, increases precision by 1.7 percentage points to 86.6%, and improves mAP@0.5 by 4.3 percentage points to 82.9%, indicating that the module significantly optimizes computational efficiency while maintaining feature expression capability. Subsequently, the Detect-XSmall detection head, dedicated to small-scale disease targets, is introduced, leading to a significant increase in recall by 1.1 percentage points to 84.6% and verifying enhanced detection ability for subtle lesion features. Although this operation slightly increases the model parameter count (with the model size increasing to 7.16 MB), it provides crucial support for the subsequent implementation of multi-scale feature fusion. Further embedding the Cross-concatenation strategy in the neck network enhances the transmission and reuse of multi-scale semantic information by fusing the output features of the SPPF and PSA modules at the end of the backbone. This strategy improves precision by 3.2 percentage points to 90.5%, mAP@0.5 by 2.4 percentage points to 88.5%, and mAP@0.5:0.95 by 3.5 percentage points to 81.7%, while the model size only slightly increases to 7.21 MB, demonstrating excellent performance gains and parameter efficiency. To further achieve model lightweighting, the LCFR-Op operator is adopted to replace the traditional upsampling method, promoting semantic alignment of multi-scale features through dynamic convolution kernel reorganization. This operator significantly reduces the model size by 26.8% to 5.28 MB while maintaining high-precision detection performance (mAP@0.5 reaches 89.4%), exhibiting dual advantages in feature enhancement and structure simplification. Finally, the Inner-SIoU loss function replaces the original CIoU function, further optimizing the model’s convergence speed and localization accuracy by introducing more refined geometric constraints of bounding boxes. With the model size further compressed to 4.83 MB, all detection metrics are comprehensively improved.

The above analysis results indicate that the synergistic effects of each module enable the constructed GBDR-Net model to achieve a favorable balance between detection accuracy and lightweight performance. As shown in Figure 5, the model exhibits excellent multi-scale disease detection capability, maintaining stable detection accuracy for grape berry diseases of different morphologies with almost no missed detections or false positives, which fully verifies its practicality and reliability in complex orchard environments. Table 3 systematically presents the quantitative detection results of the GBDR-Net model for various grape berry diseases, further indicating that the model maintains high detection accuracy for all disease types under different disease severity levels and background interference conditions.

To accurately evaluate the GBDR-Net model’s detection performance for various grape berry diseases, this study plotted separate Precision-Recall (PR) curves for each disease category based on the precision and recall data collected during training, and the visualizations are presented in Figure 6. The area under the PR curve (AUC) for each disease category in the figure was calculated as the average precision (AP) at an Intersection over Union (IoU) threshold of 0.5, thereby objectively reflecting the model’s detection capability across different categories. Notably, the areas enclosed by the PR curves of different disease categories and the coordinate axes are relatively close—indicating that the model delivers stable detection performance in complex orchard environments and maintains consistent detection performance for fruit diseases of diverse morphologies and scales, thus verifying the effectiveness and applicability of the proposed model structure for grape berry disease detection.

Based on the CardCAM visual heatmap method [29], this study performed a visualized analysis of the feature attention regions of the GBDR-Net model during grape berry disease detection, and the results are illustrated in Figure 7. Heatmaps intuitively reflect the model’s attention intensity across different image regions via color gradients, with highlighted regions indicating that the features in these areas are highly contributory to model decision-making. The visualization results demonstrate that GBDR-Net can accurately focus on the key characteristic regions of various diseases: when recognizing black mold, the model’s attention is highly concentrated on the edges and central texture areas of mold spots; when detecting canker, it precisely captures the transition areas between typical sunken lesions and adjacent healthy tissues; for powdery mildew, it mainly responds to the distribution of powdery substances on the berry surface; and when identifying sour rot, it prioritizes distinctive symptoms (e.g., soft rot and exudation) on the berry surface. Such a highly consistent feature response pattern indicates that the model can effectively capture the typical visual features of different diseases. In summary, the CardCAM visualization results validate the effectiveness and interpretability of the GBDR-Net model in grape berry disease detection tasks from an attribution perspective, providing robust support for its reliable application in real-world complex environments.

3.2. Performance Comparison Against Other Detection Models

To accurately evaluate the detection performance of the GBDR-Net model developed in this study, four representative object detection models—Faster R-CNN [30], SSD [31], YOLOv6s [32], and YOLOv8s [33]—were selected for comparative experiments alongside GBDR-Net, focusing on examining their actual performance in grape berry disease detection tasks. All experiments were conducted on the designated test set, with strict consistency upheld across the hardware platform, software environment, and hyperparameter settings to ensure high objectivity and credibility of the comparative results. The specific detection outcomes are presented in Figure 8, providing an intuitive comparison of each model’s performance on typical grape berry disease samples. Notably, Faster R-CNN (Figure 8b) can localize some prominent lesions but is prone to missed detections for disease regions with complex morphologies, accompanied by limited bounding box precision. SSD (Figure 8c) excels in detection speed but is inadequate in recognizing dense small lesions, exhibiting obvious missed detections. YOLOv6s (Figure 8d) shows improved performance compared to the previous two models but still exhibits localization deviations at the detailed edges of lesions. YOLOv8s (Figure 8e) performs well overall but has certain missed detections in areas with berry surface reflection or mild symptomatic regions. In contrast, the proposed GBDR-Net model (Figure 8f) demonstrates the optimal comprehensive performance across all test scenarios. It can not only accurately identify various diseases—including subtle mold spots, early-stage cankers, and irregular rot areas—but also maintain extremely high detection stability and localization accuracy under complex background interference and varying lighting conditions. This is largely due to its integrated internal SDF-Fusion module, multi-scale feature fusion strategy, and Inner-SIoU loss function, which enable it to more effectively capture the discriminative features of diseases. This comparative experiment visually validates the superiority and robustness of the GBDR-Net model in grape berry disease detection tasks.

To ensure a fair and rigorous comparison, the following measures were implemented across all baseline models: (1) Hyperparameter Optimization: Each baseline model (Faster R-CNN, SSD, YOLOv6s, YOLOv8s) was individually tuned on the GBDVA validation set to achieve its optimal performance. Key hyperparameters (e.g., learning rate, optimizer settings, anchor box scales) were adjusted according to the common practices or official recommendations for each model. (2) Augmentation Consistency: All models were trained using the identical data augmentation pipeline as described in Section 2.2, ensuring they learned from the same enhanced data distribution. (3) Multi-run Stability: To account for training stochasticity, each model was trained three times with different random seeds. The performance metrics reported in Table 4 and Figure 9 are the mean values from these runs, reflecting stable performance. The standard deviation for key metrics (e.g., mAP@0.5) was less than ±0.5%, indicating high reproducibility. (4) Statistical Significance: The performance improvements of GBDR-Net over all baselines are not only substantial in magnitude but also consistent across all three independent runs, confirming the statistical reliability of the reported advantages.

Figure 9 depicts the trends in precision, recall, mAP@0.5, and mAP@0.5:0.95 across different detection models over identical training cycles. The results indicate that for the grape berry disease detection task, the proposed GBDR-Net model demonstrates not only superior convergence but also comprehensively outperforms all other models, including Faster R-CNN, SSD, YOLOv6s, and YOLOv8s, across key performance metrics. As detailed in Table 4, GBDR-Net achieves significant leads in all four key metrics. Specifically, it attains a precision of 93.4%, surpassing Faster R-CNN, SSD, YOLOv6s, and YOLOv8s by 8.2, 10.8, 3.7, and 2.2 percentage points, respectively. Similarly, its recall values are higher by 9.3, 12.7, 5.4, and 2.3 percentage points. For mAP@0.5, the margins of superiority are 9.7, 12.4, 4.4, and 1.9 percentage points. Even under the more stringent mAP@0.5:0.95 metric, GBDR-Net maintains the highest performance, leading by 12.9, 16.3, 7.2, and 2.7 percentage points, respectively. These quantitative outcomes robustly validate that GBDR-Net ensures high detection accuracy while substantially improving localization capability, offering pronounced advantages in comprehensive detection performance within complex scenarios.

Table 4 also provides a systematic comparison of GBDR-Net against Faster R-CNN, SSD, YOLOv6s, and YOLOv8s across four key dimensions: F1-score, model size, inference speed, and computational complexity. The experimental results demonstrate that the proposed method achieves synergistic optimization across these metrics while maintaining robust detection performance. In terms of the F1-score, GBDR-Net attains 92.9%, outperforming Faster R-CNN, SSD, YOLOv6s, and YOLOv8s by 8.5, 11.6, 4.8, and 2.4 percentage points, respectively, indicating more consistent identification of grape disease regions under complex backgrounds. From a model lightweighting perspective, GBDR-Net exhibits significant parameter efficiency, with a model size of only 4.83 MB—representing reductions of 96.2%, 95.4%, 71.3%, and 78.5% compared to Faster R-CNN, SSD, YOLOv6s, and YOLOv8s, thereby substantially enhancing its deployment potential on resource-constrained devices. Regarding real-time inference performance, GBDR-Net achieves 98.2 fps, which is 85.7 fps higher than Faster R-CNN, and also exceeds SSD, YOLOv6s, and YOLOv8s by 69.9 fps, 35.7 fps, and 19.8 fps, respectively, demonstrating superior responsiveness in practical applications. Furthermore, GBDR-Net shows a clear advantage in computational efficiency, requiring only 20.5 GFLOPs—a reduction of 154.8 GFLOPs compared to Faster R-CNN, and 122.1 GFLOPs, 28.0 GFLOPs, and 8.1 GFLOPs lower than SSD, YOLOv6s, and YOLOv8s, respectively. This indicates a significantly lower demand for computational resources while maintaining high accuracy, making the model better suited for practical agricultural applications such as edge computing.

To objectively evaluate the disease detection capability of the GBDR-Net model, this study employed the confusion matrix to analyze its classification performance, and the results are illustrated in Figure 10. Notably, the model displays a certain degree of misclassification between some disease categories: both black mold and sour rot are typified by dark mold and tissue decay, resulting in mutual misclassification; meanwhile, the powdery symptoms of powdery mildew are visually analogous to the early-stage white lesions of canker disease, causing moderate classification confusion. This observation indicates that the high similarity in color and texture features among different diseases in complex orchard environments is one of the key challenges hindering the accurate localization of visual detection models. Nevertheless, GBDR-Net still demonstrates stable and reliable detection performance in the overall detection of the four grape berry diseases, with the false detection rate kept at a low level.

Figure 11 employs a radar chart to systematically evaluate the overall performance of the proposed GBDR-Net against Faster R-CNN, SSD, YOLOv6s, and YOLOv8s for grape berry disease detection. The results reveal that the lightweight-optimized GBDR-Net exhibits superior robustness in complex orchard environments. It successfully maintains high detection accuracy and real-time detection speed while achieving a substantially reduced model size. Consequently, the model establishes an excellent balance between detection accuracy, inference efficiency, and computational demand, underscoring its comprehensive advantages for deployment in resource-constrained edge scenarios, such as orchard on-site monitoring and handheld diagnostic devices. These findings collectively validate the efficacy of the proposed architectural improvements for the grape berry disease detection task.

The GBDR-Net model proposed in this study offers an effective technical solution for the precise and rapid detection of grape berry diseases during the harvest period, helping to reduce yield and quality losses caused by diseases at this critical stage. The model supports the development of an integrated “identification–localization–assessment” intelligent harvest-assistance system, promoting the transition of grape harvesting operations from traditional reliance on manual experience to intelligent sorting and decision-making based on visual perception. This research holds significant academic value and practical relevance for improving harvesting efficiency and ensuring quality for both fresh consumption and processing purposes.

4. Discussion

This study tackles long-standing challenges in grape berry disease detection, including the absence of specialized models for clustered fruits, inadequate accuracy in detecting small-scale lesions, and the incompatibility of complex models with resource-constrained deployment environments. On the GBDVA dataset, GBDR-Net demonstrates outstanding overall performance, achieving a precision of 93.4%, recall of 89.6%, mAP@0.5 of 90.2%, and mAP@0.5:0.95 of 86.4%. Furthermore, the model exhibits remarkable lightweight characteristics, with a size of 4.83 MB, a computational cost of 20.5 GFLOPs, and a real-time inference speed of 98.2 FPS, highlighting its strong deployment potential. These results collectively confirm that GBDR-Net effectively bridges the identified research gaps by striking an optimal balance between detection accuracy and practical applicability.

Compared to existing deep learning-based methods for plant disease detection, the innovation of GBDR-Net lies in its targeted design for the unique attributes of grape berries. Most previous studies (e.g., Wu et al. [17], Cai et al. [18], Zhang et al. [19]) have focused on grape leaf disease detection, where lesions are generally larger and plant tissue occlusion is less severe, making feature extraction relatively easier. In contrast, grape berries are not only small in size and grow in dense clusters, but their early-stage disease spots also exhibit more subtle features that can be easily obscured by adjacent fruits or complex backgrounds. To address these challenges, the SDF-Fusion module integrated into the backbone of GBDR-Net plays a critical role. It enhances the extraction of both global contextual information and fine-grained lesion features, effectively compensating for the loss of detailed semantic information in deeper layers—a limitation of the original C2f module in YOLOv10. This design focus on prioritizing the capture of “small-scale, low-contrast features”, which are essential for early disease diagnosis, explains why GBDR-Net outperforms disease detection models tailored for leaves in fruit-specific scenarios.

The incorporation of the Detect-XSmall head and the Cross-concatenation strategy for SPPF and PSA outputs further addresses the challenge of multi-scale lesion detection. Unlike SSD or Faster R-CNN, which rely on fixed-scale feature maps and are prone to missing small lesions, GBDR-Net model’s multi-head design enables effective detection of early-stage lesions as small as 1–2 mm. This capability is particularly critical during the harvest period, when chemical control is restricted and the timely removal of infected berries becomes the only measure to prevent disease spread. Moreover, the model’s lightweight profile—with a size of only 4.83 MB and a computational cost of 20.5 GFLOPs—represents a critical step towards efficient deployment on resource-constrained hardware. This significant reduction in model complexity and computational demand, compared to mainstream detectors, establishes a strong foundation for its potential application in real-time, edge-computing scenarios such as embedded systems on sorting lines or mobile devices for field scouting. The reported high inference speed (98.2 FPS) on a high-performance GPU demonstrates the algorithm’s intrinsic efficiency; its translation into practical throughput on specific edge devices (e.g., Jetson series, Raspberry Pi) will be the focus of subsequent engineering optimization and deployment studies.

The design principles and architectural innovations of GBDR-Net, particularly its focus on capturing “small-scale, low-contrast features” in clustered environments (via SDF-Fusion) and its semantic alignment capabilities (via LCFR-Op), suggest its potential adaptability to other clustered fruits facing similar detection challenges, such as strawberries, blueberries, and raspberries. For instance, the LCFR-Op operator is designed to preserve inter-object spatial relationships, which could theoretically help mitigate false detections caused by overlapping healthy fruits in dense clusters—a common issue in strawberry disease detection. While direct validation on these crops is beyond the scope of this study and remains a subject for future work, the problem-driven design of GBDR-Net provides a potentially transferable technical framework for addressing analogous “dense growth + small-scale lesion” detection tasks in precision horticulture.

The GBDR-Net model supports the transition in sustainable agriculture from broad-spectrum chemical control to precision disease management. With a detection accuracy of 90.2% mAP@0.5, the model reliably identifies and facilitates the removal of infected berries during harvest, effectively reducing yield loss at this critical stage. Its real-time inference speed of 98.2 FPS further enables dynamic harvesting strategies. For example, if the system detects that disease incidence in the field exceeds a preset threshold, growers can prioritize harvesting healthy clusters to minimize cross-contamination risks. This capability aligns with global objectives to reduce pesticide use and enhance food safety. By providing accurate in-field disease identification, GBDR-Net helps curtail the need for pre-harvest broad-spectrum fungicides, mitigates pesticide residue risks at the source, and offers a practical pathway toward greening horticultural production systems.

In post-harvest processing, integrating GBDR-Net into automated sorting systems can effectively overcome the cost and accuracy limitations associated with heavy reliance on manual grading. With a precision of 93.4%, the model significantly reduces the misclassification of healthy fruits, while its 89.6% recall rate ensures that the vast majority of infected fruits are accurately identified and removed. This capability not only enhances end-product quality directly but also holds substantial practical significance for establishing a reliable quality control barrier at the upstream supply chain level and strengthening end-consumer trust.

Despite the significant progress achieved in this study, several limitations remain. While the current model can accurately identify disease categories, it is unable to perform fine-grained grading of infection severity. This capability gap is partly attributable to constraints in its data foundation. Above all, the GBDVA dataset used covers only four common diseases and lacks several prevalent ones such as anthracnose, as well as complex scenarios involving co-infection, thereby limiting the model’s diagnostic scope. Subsequently, all samples were collected from a single geographical region (North China) and specific orchards, lacking representation from diverse climatic zones—such as the rainy regions of Southern China or arid Northwestern China—and varied grape cultivars, which challenges the model’s generalization capability across broader real-world conditions. Ultimately, while the current GBDVA dataset primarily consists of high-resolution, close-range images that facilitate the learning of clear disease features, a gap remains between such data and the wide-angle, low-resolution scenarios commonly encountered in practical field monitoring. This highlights the need for further improvement in the model’s robustness under multi-scale imaging conditions. Nevertheless, the overall advantages and practical value of the GBDR-Net model remain substantial, positioning it as a promising solution for real-time monitoring of grape berry diseases in agricultural applications.

Based on the current research outcomes, our future work will systematically advance along five interconnected directions to further enhance intelligent grape disease detection and management. First, we plan to systematically enhance the scale diversity of the dataset. On the one hand, we will collect or integrate wide-angle images from drones, field robots, and fixed monitoring points to simulate real-world scenarios where target fruits appear smaller and at lower resolution within the image. On the other hand, a multi-scale collaborative training strategy will be designed and implemented, enabling the model to simultaneously learn from both close-range details and long-range contextual information.

Second, we will expand the dataset by incorporating high-incidence diseases such as anthracnose and black pox, along with mixed infection cases, through cross-regional sampling across diverse climatic zones including southern rainy and northwestern arid areas in China. This will establish a comprehensive multi-disease, multi-variety, multi-region, and multi-infection-type dataset to significantly improve model generalization. Third, we will strengthen fine-grained analysis capabilities by implementing a multi-task learning framework that incorporates infection severity grading alongside existing detection and classification tasks. Through optimized loss function design, the model will achieve simultaneous disease identification, severity quantification, and mixed-infection detection. Subsequently, we will focus on optimizing edge deployment efficiency by conducting performance tests on low-power devices like Raspberry Pi Zero and applying 8-bit quantization with model pruning for further compression. This will be complemented by developing mobile applications and UAV-mounted systems to enable multi-scenario deployment strategies combining handheld field inspection and large-scale aerial monitoring. Finally, we will explore multi-modal data fusion by integrating near-infrared imaging, spectral data, and RGB images to construct a unified detection framework that leverages complementary features for identifying early-stage weak symptoms and variety-specific disease manifestations. Collectively, these efforts will establish a comprehensive intelligent monitoring system characterized by broad adaptability, fine-grained classification, flexible deployment, and multi-modal perception, thereby providing stronger technical support for precision viticulture and sustainable production.

5. Conclusions

This study successfully developed GBDR-Net, a lightweight grape berry disease detection model specifically designed for complex orchard environments, effectively addressing the long-standing technical challenge of achieving high-precision real-time detection in dense clusters with small-scale lesions. Unlike previous studies focused primarily on leaf diseases or generic object detection, our research introduces four key improvements based on YOLOv10, achieving a synergistic breakthrough in both detection performance and practical applicability: (1) integration of the SDF-Fusion module into the backbone network to enhance the extraction of global contextual and fine-grained lesion features; (2) incorporation of a Detect-XSmall head combined with a cross-concatenation strategy in the neck network, constructing an enhanced feature pyramid that significantly improves sensitivity to small-scale disease objects; (3) proposal of the lightweight content-aware feature reorganization operator (LCFR-Op) to achieve efficient upsampling through enhanced semantic alignment; (4) adoption of the Inner-SIoU bounding box loss function, which accelerates model convergence and improves localization accuracy by introducing more refined geometric constraints.

Comprehensive evaluation experiments demonstrate that GBDR-Net model has reached the state-of-the-art level in key performance metrics. On the GBDVA dataset, it achieved a precision of 93.4%, a recall of 89.6%, and an mAP@0.5 of 90.2%—not only significantly outperforming traditional models (e.g., Faster R-CNN, SSD) but also comprehensively surpassing advanced contemporary models (e.g., YOLOv6s, YOLOv8s). Notably, with a model size of only 4.83 MB and a computational cost of 20.5 GFLOPs, the model achieves a real-time inference speed of 98.2 FPS. This marks its optimal balance across the three critical dimensions of accuracy, speed, and lightweightedness, laying a foundation for large-scale deployment on embedded devices and mobile terminals.

In summary, GBDR-Net model not only fills the critical gap for specialized high-precision detection models targeting densely clustered fruit diseases, but also delivers a comprehensive and practically viable solution for smart horticulture through its exceptional overall performance. Future research will systematically focus on enhancing generalization capability, achieving practical deployment, deepening perceptual capacity, and establishing a comprehensive system, aiming to advance this technology from a laboratory prototype toward robust field application.

Author Contributions

Conceptualization, P.L. (Pan Li) and J.Z.; methodology, H.S.; software, J.Z.; validation, P.L. (Pan Li) and J.Z.; formal analysis, P.L. (Penglin Li) and X.C.; investigation, P.L. (Pan Li) and X.C.; resources, P.L. (Pan Li); data curation, P.L. (Penglin Li); writing—original draft preparation, P.L. (Pan Li); writing—review, and editing, P.L. (Pan Li) and X.C.; visualization, P.L. (Pan Li); supervision, H.S.; project administration, P.L. (Pan Li); funding acquisition, H.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Youth Fund Project for Scientific Research of Higher Education Institutions in Hebei Province (grant number: QN2025015), the 14th Five Year Plan Project for Higher Education Science Research of Hebei Higher Education Association (grant number: GJXHZ2024-30), Hebei Natural Science Foundation Spring Talent Special Project (grant number: A2025508001), the Open Research Fund of Anhui Province Key Laboratory of Machine Vision Inspection (grant number: KLMVI-2024-HIT-12), and Anhui Province Science and Technology Innovation Break through Plan (grant number: 202423i08050056).

Data Availability Statement

Given that the raw data is still undergoing active expansion and remains an integral part of an ongoing research project, it will not be made public until the study is completed.

Acknowledgments

The authors are grateful to Beijing Changping Aroma Grape Garden for their sustained support of this research. We also extend our thanks to the editors and reviewers for their insightful comments and valuable suggestions.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Strati, V.; Albéri, M.; Barbagli, A.; Boncompagni, S.; Casoli, L.; Chiarelli, E.; Colla, R.; Colonna, T.; Elek, N.I.; Galli, G.; et al. Advancing Grapevine Disease Detection Through Airborne Imaging: A Pilot Study in Emilia-Romagna (Italy). Remote Sens. 2025, 17, 2465. [Google Scholar] [CrossRef]
Li, W.; Zhou, B.; Zhou, Y.; Jiang, C.; Ruan, M.; Ke, T.; Wang, H.; Lv, C. Grape Disease Detection Using Transformer-Based Integration of Vision and Environmental Sensing. Agronomy 2025, 15, 831. [Google Scholar] [CrossRef]
Rahman, M.U.; Liu, X.; Wang, X.; Fan, B. Grapevine gray mold disease: Infection, defense and management. Hortic. Res. 2024, 11, 182. [Google Scholar] [CrossRef]
Das, S.; Dutta, S.; Ghosh, S.; Mukherjee, A. Chitinolytic microorganisms for biological control of plant pathogens: A Comprehensive review and meta-analysis. Crop Prot. 2024, 185, 106888. [Google Scholar] [CrossRef]
Li, R.; Liu, J.; Shi, B.; Zhao, H.; Li, Y.; Zheng, X.; Peng, C.; Lv, C. High-Performance Grape Disease Detection Method Using Multimodal Data and Parallel Activation Functions. Plants 2024, 13, 2720. [Google Scholar] [CrossRef] [PubMed]
Prasad, V.K.; Vaidya, H.; Rajashekhar, C.; Karelal, K.S.; Sali, R.; Nisar, K.S. Multiclass classification of diseased grape leaf identification using deep convolutional neural network(DCNN) classifier. Sci. Rep. 2024, 14, 9002. [Google Scholar] [CrossRef]
G, O.; Billa, S.R.; Malik, V.; Bharath, E.; Sharma, S. Grapevine fruits disease detection using different deep learning models. Multimed. Tools Appl. 2024, 84, 5523–5548. [Google Scholar] [CrossRef]
Calzarano, F.; Amalfitano, C.; Seghetti, L.; Di Marco, S. Effect of Different Foliar Fertilizer Applications on Esca Disease of Grapevine: Symptom Expression and Nutrient Content in the Leaf and Composition of the Berry. Agronomy 2023, 13, 1355. [Google Scholar] [CrossRef]
Ismail, K.; Ishak, P. Advancements in deep learning for accurate classification of grape leaves and diagnosis of grape diseases. J. Plant Dis. Prot. 2024, 131, 1061–1080. [Google Scholar] [CrossRef]
Saha, K.D.; Ahmed, R.M.; Nath, D.T.; Boby, R.I.; Hossen, M.; Mridha, M.F. Fusing explainable deep learning ensembles and LLM recommendations for real-time plant leaf disease diagnosis. Intell. Syst. Appl. 2025, 28, 200596. [Google Scholar] [CrossRef]
Chen, J.; Zhang, D.; Nanehkaran, Y.A.; Li, D. Detection of rice plant diseases based on deep transfer learning. J. Sci. Food Agric. 2020, 100, 3246–3256. [Google Scholar] [CrossRef]
Sun, H.; Wang, R.-F. BMDNet-YOLO: A Lightweight and Robust Model for High-Precision Real-Time Recognition of Blueberry Maturity. Horticulturae 2025, 11, 1202. [Google Scholar] [CrossRef]
Zeng, T.; Li, C.; Zhang, B.; Wang, R.; Fu, W.; Wang, J.; Zhang, X. Rubber Leaf Disease Recognition Based on Improved Deep Convolutional Neural Networks with a Cross-Scale Attention Mechanism. Front. Plant Sci. 2022, 13, 829479. [Google Scholar] [CrossRef]
Jiao, Z.; Zhang, D.; Zhang, J.; Wang, L.; Ma, D.; Ma, L.; Wang, Y.; Gu, A.; Fan, X.; Peng, B.; et al. Early Detection of Chinese Cabbage Clubroot Based on Integrated Leaf Multispectral Imaging and Machine Learning. Horticulturae 2025, 11, 1335. [Google Scholar] [CrossRef]
Diallo, B.M.; Li, Y.; Chukwuka, S.O.; Boamah, S.; Gao, Y.; Kone, M.M.K.; Rocho, G.; Wei, L. Enhanced-RICAP: A novel data augmentation strategy for improved deep learning-based plant disease identification and mobile diagnosis. Front. Plant Sci. 2025, 16, 1646611. [Google Scholar] [CrossRef]
Albahli, S. AgriFusionNet: A Lightweight Deep Learning Model for Multisource Plant Disease Diagnosis. Agriculture 2025, 15, 1523. [Google Scholar] [CrossRef]
Wu, C.; Gu, X.; Xiong, H.; Huang, H. Fine-grained recognition of grape leaf diseases based on transfer learning and convolutional block attention module. Appl. Soft Comput. 2025, 172, 112896. [Google Scholar] [CrossRef]
Cai, C.; Wang, Q.; Cai, W.; Yang, Y.; Hu, Y.; Li, L.; Wang, Y.; Zhou, G. Identification of grape leaf diseases based on VN-BWT and Siamese DWOAM-DRNet. Eng. Appl. Artif. Intell. 2023, 123, 106341. [Google Scholar] [CrossRef]
Zhang, N.; Zhang, E.; Qi, G.; Li, F.; Lv, C. Lightweight grape leaf disease recognition method based on transformer framewor. Sci. Rep. 2025, 15, 28974. [Google Scholar] [CrossRef]
Li, Y.; Guo, Z.; Sun, Y.; Chen, X.; Cao, Y. Weed Detection Algorithms in Rice Fields Based on Improved YOLOv10n. Agriculture 2024, 14, 2066. [Google Scholar] [CrossRef]
Wang, T.; Niu, Y.; Zhao, W.; Gamage, R.P.; Ahmad, I. Research on intelligent classification of limestone photomicrographs based on the improved FasterNet architecture. Earth Sci. Inf. 2025, 18, 538. [Google Scholar] [CrossRef]
Naparstek, O. Complexity as Advantage: A Regret-Based Perspective on Emergent Structure. arXiv 2025, arXiv:2511.04590. [Google Scholar] [CrossRef]
Shin, J.; Yang, H.; Yi, Y. SparseInfer: Training-free Prediction of Activation Sparsity for Fast LLM Inference. arXiv 2025, arXiv:2411.12692. [Google Scholar]
Jiang, Y.; Wei, Z.; Hu, G. Detection of tea leaf blight in UAV remote sensing images by integrating super-resolution and detection networks. Environ. Monit. Assess. 2024, 196, 1044. [Google Scholar] [CrossRef]
Manuylovich, E.; Bednyakova, A.E.; Ivoilov, D.A.; Terekhov, I.S.; Turitsyn, S.K. SOA-based reservoir computing using upsampling. Opt. Lett. 2024, 49, 5827–5830. [Google Scholar] [CrossRef] [PubMed]
Xue, J.; Cheng, F.; Li, Y.; Song, Y.; Mao, T. Detection of Farmland Obstacles Based on an Improved YOLOv5s Algorithm by Using CIoU and Anchor Box Scale Clustering. Sensors 2022, 22, 1790. [Google Scholar] [CrossRef] [PubMed]
Fan, H.; Liu, J.; Yan, X.; Zhang, C.; Cao, X.; Mao, Q. A Fast and High-Accuracy Foreign Object Detection Method for Belt Conveyor Coal Flow Images with Target Occlusion. Sensors 2024, 24, 5251. [Google Scholar] [CrossRef]
Shen, M.; Liu, Y.; Chen, J.; Ye, K.; Gao, H.; Che, J.; Wang, Q.; He, H.; Liu, J.; Wang, Y.; et al. Defect detection of printed circuit board assembly based on YOLOv5. Sci. Rep. 2024, 14, 19287. [Google Scholar] [CrossRef]
Li, P.; Zhou, J.; Sun, H.; Zeng, J. RDRM-YOLO: A High-Accuracy and Lightweight Rice Disease Detection Model for Complex Field Environments Based on Improved YOLOv5. Agriculture 2025, 15, 479. [Google Scholar] [CrossRef]
Xia, B.; Luo, H.; Shi, S. Improved Faster R-CNN Based Surface Defect Detection Algorithm for Plates. Comput. Intell. Neurosci. 2022, 2022, 3248722. [Google Scholar] [CrossRef]
Tan, L.; Huangfu, T.; Wu, L.; Chen, W. Comparison of RetinaNet, SSD, and YOLOv3 for real-time pill identification. BMC Med. Inform. Decis. Mak. 2021, 21, 324. [Google Scholar] [CrossRef] [PubMed]
Geetha, A.S. What is YOLOv6? A Deep Insight into the Object Detection Model. arXiv 2024, arXiv:2412.13006. [Google Scholar] [CrossRef]
Xu, W.; Li, X.; Ji, Y.; Li, S.; Cui, C. BD-YOLOv8s: Enhancing bridge defect detection with multidimensional attention and precision reconstruction. Sci Rep. 2024, 14, 18673. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Typical sample images of four grape berry diseases: (a) black mold; (b) canker; (c) powdery mildew; (d) sour rot.

Figure 2. GBDR-Net model architecture diagram.

Figure 3. The proposed SDF-Fusion module architecture diagram. Note: * represents the spatial convolution operation between the h × w × c_p feature map and the filters, where appropriate padding is applied to retain the spatial dimensions (h × w) of the output feature map.

Figure 4. Upsampling process of the LCFR-Op operator.

Figure 5. Detection effect of GBDR-Net model: (a) black mold; (b) canker; (c) powdery mildew; (d) sour rot.

Figure 6. Precision-Recall curves of the GBDR-Net model.

Figure 7. Visual heatmaps of the GBDR-Net model for grape berry disease detection.

Figure 8. Comparison of detection performance across different models: (a) Original images; (b) Faster R-CNN; (c) SSD; (d) YOLOv6s; (e) YOLOv8s; and (f) GBDR-Net.

Figure 9. Training curves of different detection models.

Figure 10. Confusion matrix of the GBDR-Net model.

Figure 11. Radar chart for multi-dimensional performance evaluation of detection models.

Table 1. Distribution of sample image counts across subsets.

Grape Berry Diseases	Number of Original Samples	Number of Augmented Samples	Total Number of Samples	Number of Samples in Training Sets	Number of Samples in Validation Sets	Number of Samples in Test Sets
Black mold	234	1186	1746	1420	230	96
Canker	258	1282	1890	1540	254	96
Powdery mildew	232	1228	1794	1460	240	94
Sour rot	226	1104	1640	1330	220	90
Total	950	4800	7070	5750	924	396

Table 2. Ablation study settings and key results.

YOLOv10	SDF-Fusion	Detect-XSmall	Cross-Concatenation Strategy	LCFR-Op	Inner-SIoU	Precision /%	Recall /%	mAP@0.5 /%	mAP@0.5:0.95 /%	Model Size/MB
√	-	-	-	-	-	84.9	80.2	78.6	71.6	8.59
√	√	-	-	-	-	86.6	83.5	82.9	75.9	6.82
√	√	√	-	-	-	87.3	84.6	86.3	78.2	7.16
√	√	√	√	-	-	90.5	87.1	88.5	81.7	7.21
√	√	√	√	√	-	91.7	88.3	89.4	83.2	5.28
√	√	√	√	√	√	93.4	89.6	90.2	86.4	4.83

Table 3. Detection performance of the GBDR-Net model for different grape berry diseases.

Grape Berry Diseases	Precision/%	Recall/%	mAP@0.5:0.95/%	F1-Score/%	IoU
Black mold	94.2	92.1	88.2	93.7	0.88
Canker	91.9	89.7	85.5	91.2	0.85
Powdery mildew	93.3	91.4	87.9	92.3	0.86
Sour rot	92.5	90.2	86.5	91.5	0.83

Table 4. Performance metrics of different detection models.

Performance Parameter	Faster R-CNN	SSD	YOLOv6s	YOLOv8s	GBDR-Net
Precision/%	85.2	82.6	89.7	91.2	93.4
Recall/%	80.3	76.9	84.2	87.3	89.6
mAP@0.5/%	80.5	77.8	85.8	88.3	90.2
mAP@0.5:0.95/%	73.5	70.1	79.2	83.7	86.4
F1 Score/%	84.4	81.3	88.1	90.5	92.9
Model size/MB	125.6	105.2	16.8	22.5	4.83
Inference speed/FPS	12.5	28.3	62.5	78.4	98.2
FLOPs/GFLOPs	175.3	142.6	48.5	28.6	20.5

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, P.; Zhou, J.; Sun, H.; Li, P.; Chen, X. GBDR-Net: A YOLOv10-Derived Lightweight Model with Multi-Scale Feature Fusion for Accurate, Real-Time Detection of Grape Berry Diseases. Horticulturae 2026, 12, 38. https://doi.org/10.3390/horticulturae12010038

AMA Style

Li P, Zhou J, Sun H, Li P, Chen X. GBDR-Net: A YOLOv10-Derived Lightweight Model with Multi-Scale Feature Fusion for Accurate, Real-Time Detection of Grape Berry Diseases. Horticulturae. 2026; 12(1):38. https://doi.org/10.3390/horticulturae12010038

Chicago/Turabian Style

Li, Pan, Jitao Zhou, Huihui Sun, Penglin Li, and Xi Chen. 2026. "GBDR-Net: A YOLOv10-Derived Lightweight Model with Multi-Scale Feature Fusion for Accurate, Real-Time Detection of Grape Berry Diseases" Horticulturae 12, no. 1: 38. https://doi.org/10.3390/horticulturae12010038

APA Style

Li, P., Zhou, J., Sun, H., Li, P., & Chen, X. (2026). GBDR-Net: A YOLOv10-Derived Lightweight Model with Multi-Scale Feature Fusion for Accurate, Real-Time Detection of Grape Berry Diseases. Horticulturae, 12(1), 38. https://doi.org/10.3390/horticulturae12010038

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

GBDR-Net: A YOLOv10-Derived Lightweight Model with Multi-Scale Feature Fusion for Accurate, Real-Time Detection of Grape Berry Diseases

Abstract

1. Introduction

2. Materials and Methods

2.1. Image Acquisition

2.2. Image Data Processing

2.3. GBDR-Net Model Construction

2.3.1. Overview of the Model Architecture

2.3.2. SDF-Fusion Module

2.3.3. XSmall Detection Head

2.3.4. A Novel Cross-Concatenation Strategy

2.3.5. Lightweight Content-Aware Reassembly Operator

The Kernel Prediction Module

Content-Aware Reassembly Module

2.3.6. Inner-SIoU Bounding Box Loss Function

2.4. Experimental Environment and Parameter Configuration

2.5. Performance Metrics

3. Results

3.1. Ablation Study and Analysis

3.2. Performance Comparison Against Other Detection Models

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI