1. Introduction
In the era of precision agriculture, the demand for intelligent harvesting technologies has grown rapidly due to increasing labor shortages, rising production costs, and the urgent need for quality and efficiency in fruit production [
1,
2]. Among the core tasks enabling automated harvesting, real-time and accurate fruit-ripeness detection plays a pivotal role in determining optimal picking times, minimizing post-harvest losses, and improving yield quality [
3]. Traditional manual harvesting methods are often labor intensive, inconsistent, and unsuitable for large-scale orchards [
4,
5]. As a result, vision-based ripeness-detection systems have emerged as promising solutions for bridging the gap between traditional practices and intelligent agricultural operations [
6,
7].
Blueberries, characterized by their small size, clustered distribution, and rapid ripening cycle, are both an economically valuable crop and a highly challenging target for intelligent harvesting [
8,
9,
10]. Their nutritional richness-high in anthocyanins, vitamin C, and antioxidants has fueled global consumption and expansion of cultivation areas [
11,
12]. However, the fruits’ tendency to rot rapidly after ripening and their physical complexity make detection and harvesting particularly difficult in natural orchard environments [
13]. Overlapping fruits, occlusion from leaves and stems, and varying lighting conditions further exacerbate the detection challenge, especially for systems deployed on low-power embedded devices with limited computational capacity [
14]. However, traditional machine-learning methods, such as color-based segmentation and shape-based analysis, often suffer from low detection accuracy, poor robustness, high computational cost, and slow processing speed, limiting their applicability in complex agricultural environments.
To address these issues, recent studies have applied improved object-detection models to the task of berry ripeness recognition. In particular, two-stage detection frameworks have significantly improved detection performance. Among them, the R-CNN family stands out as a representative approach. By introducing the Region Proposal Network (RPN), these models generate candidate object regions within an image and perform classification and localization through separate subnetworks, enabling accurate and robust object detection. In contrast, one-stage detection methods, represented by the YOLO series, remove the need for an RPN and directly perform object classification and bounding box regression in a single detection step, significantly improving detection speed while maintaining competitive accuracy. In [
15], Chen et al. proposed the MTD-YOLOv7 model for cherry tomato bunch ripeness detection. By extending the YOLOv7 architecture with multi-task decoders, it simultaneously identifies fruit bunches, individual fruit ripeness, and bunch-level ripening. The model demonstrated strong robustness in complex agricultural environments, achieving 86.6% accuracy, making it promising for robotic harvesting applications. In [
16], Zhu et al. developed YOLO-LM, a lightweight detector specifically for Camellia oleifera fruit in orchards. Incorporating Criss-Cross Attention (CCA) and Adaptive Spatial Feature Fusion (ASFF), the model improved detection accuracy in occluded environments (93.18% mAP@0.5) and facilitated orchard yield estimation and autonomous harvesters. In [
17], Li et al. introduced a multi-view imaging-based phenotyping system (MARS-PhenoBot), integrating the Segment Anything Model (SAM) for label-free annotation, along with a customized BerryNet model. This system automates the detection of metrics such as fruit count, ripeness level, and cluster compactness, enabling high-throughput phenotyping in field conditions for precision breeding and management. In [
18], Yang et al. developed an enhanced detail feature module (EDFM) with content-aware reassembly (CARAFE), improving feature extraction for color and texture, thus enhancing detection accuracy. In [
19], Quiroz et al. validated a CNN-based model for identifying legacy blueberry growth stages in Chilean smart farms, demonstrating the versatility of deep learning in agricultural settings. Despite these advances, existing methods still struggle in visually complex scenes, particularly with occlusion and background interference. Moreover, high computational demands hinder their applicability to real-time embedded systems.
To overcome these challenges, we propose BlueberryNet, a novel lightweight and robust deep-learning framework tailored for high-accuracy blueberry ripeness detection in real-world orchard settings. Our approach is guided by three core intuitions: (1) accurate detection under occlusion requires strong global semantic representation; (2) multi-scale feature fusion should be dynamically adaptable to account for variability in fruit size and viewpoint; (3) loss functions must be sensitive to IoU quality and class imbalance in dense scenes.
To this end, BlueberryNet introduces three novel modules that jointly enhance accuracy, adaptability, and efficiency. First, the GLKRep module improves global semantic perception by leveraging reparameterized large-kernel convolutions, enabling wide receptive fields without increasing inference overhead. Second, the UMSF detection head dynamically fuses multi-scale features through learnable receptive field selection, enhancing robustness to varying fruit sizes and perspectives. Finally, the model incorporates the SAIoU loss function, which introduces semantic consistency constraints among regional features during regression optimization, thereby mitigating false detections under occlusion and class imbalance.
In contrast to previous lightweight detectors such as YOLOv5n and YOLOv8n—which rely on static feature fusion structures and conventional classification losses, BlueberryNet introduces both structural adaptability and sample-aware optimization into a compact architecture. This enables superior performance in complex orchard environments characterized by dense clustering, occlusion, and variable illumination. By jointly addressing semantic representation, multi-scale fusion, and supervisory quality, BlueberryNet achieves a favorable trade-off between accuracy and efficiency, making it a practical and deployable solution for real-time fruit-ripeness detection on edge devices.
The main contributions of this paper are summarized as follows:
- (1)
We construct a novel Grouped Large Kernel Reparameterization (GLKRep) module, which improves semantic representation using structurally reparameterized grouped convolutions, allowing for large receptive fields without increased inference cost.
- (2)
We propose an adaptive Unified Adaptive Multi-Scale Fusion Module (UMSF) detection head that dynamically fuses multi-scale features through learned receptive field selection, overcoming the rigidity of traditional FPN or PANet-based fusion used in YOLO-family models.
- (3)
We integrate Semantics-Aware IoU (SAIoU) Loss, which introduces semantic consistency constraints among regional features during the regression optimization process, enabling a more comprehensive and precise evaluation of the alignment between predicted and ground truth bounding boxes.
The rest of this paper is organized as follows:
Section 2 introduces the dataset and preprocessing methods;
Section 3 presents the BlueberryNet model;
Section 4 reports experimental results;
Section 5 discusses the findings;
Section 6 concludes the paper and outlines future work.
5. Experiments
5.1. Experimental Details
To evaluate the performance, generalization, and efficiency of the proposed BlueberryNet model, we conducted a series of quantitative and qualitative experiments against multiple baselines.
The experimental models were trained, validated, and tested on a Windows 10 (64-bit) operating system. The computer used had 32 GB of RAM, an NVIDIA GeForce RTX 2060 GPU (NVIDIA Corporation, Santa Clara, CA, USA), and an Intel(R) Core(TM) i7-10870H CPU @ 2.20 GHz (Intel Corporation, Santa Clara, CA, USA). The PyTorch version was 1.10.0, the programming language was Python 3.8.5, and CUDA 11.3 was used for GPU acceleration.
All experiments in this study were conducted under identical conditions. The training images were resized to 640 × 640 pixels, with a batch size of 16. The initial learning rate was set to 0.01, and the optimizer used for training was SGD, with a momentum value of 0.937. The training process was carried out for a total of 120 epochs.
5.2. Evaluation Indicators
This study primarily evaluates model performance using precision (P), recall (R), mean average precision (mAP), floating point operations (FLOPs), frames per second (FPS), and the number of parameters (parameters, M),which are defined as follows [
30,
31]:
where true positive (TP) represents the number of actual positive samples correctly predicted as positive, while false positive (FP) refers to the number of actual negative samples incorrectly predicted as positive.
In the formula, false negative (FN) indicates the number of actual positive samples predicted as negative.
where average precision (AP) measures the average precision for a specific class of targets at various recall points and corresponds to the area under the precision–recall (PR) curve. When the Intersection over Union (IoU) threshold is set to 0.5, AP is specifically denoted as AP50.
where real-time performance is assessed using frames per second (FPS), where a higher FPS value indicates better real-time detection capability. These metrics collectively evaluate the accuracy and efficiency of the model in detecting blueberry ripeness.
5.3. Performance Comparison of BlueberryNet and YOLOv8n
This study proposes BlueberryNet by making structural improvements based on YOLOv8n. To verify the effectiveness of the improved model, a series of comparative experiments were conducted in which several representative test images were randomly selected for comparison.
As shown in
Figure 6, The four sets of comparative images illustrate the following limitations of YOLOv8n when dealing with small, dense, and occluded blueberry targets. First, YOLOv8n exhibits a noticeable tendency for missed detections. In the first row, YOLOv8n identifies only a subset of the fruits, with the number of bounding boxes significantly lower than the actual number of blueberries—particularly for those located near the edges or partially occluded by leaves. In contrast, BlueberryNet accurately localizes most of the fruits in the scene. Second, YOLOv8n suffers from low localization precision and blurred bounding box boundaries. In the second row, multiple predicted boxes from YOLOv8n overlap substantially, which hinders target differentiation and negatively impacts subsequent tasks such as fruit counting and recognition. BlueberryNet, however, produces tighter and more precise bounding boxes that closely fit the fruit contours, demonstrating improved localization performance. Finally, YOLOv8n tends to produce redundant detections in densely packed scenarios. As shown in the fourth row, YOLOv8n outputs several overlapping boxes within the same region, leading to detection clustering. BlueberryNet effectively mitigates this issue by incorporating a more efficient feature extraction mechanism, which enhances its ability to detect densely distributed small targets.
Figure 7 presents the performance curves of YOLOv8n and the improved BlueberryNet model during the training process.
An analysis of four key metrics reveals that BlueberryNet demonstrates superior learning capability and convergence speed from the early stages of training. In terms of recall, BlueberryNet shows rapid improvement within the first 20 epochs and maintains a stable value above 0.96% throughout the remainder of the training. This indicates stronger object detection capability and a lower risk of missed detections. For precision, BlueberryNet achieves a relatively stable curve with minimal fluctuations, reflecting robust training stability and better generalization. In contrast, YOLOv8n exhibits less stability in the early stages and maintains a comparatively lower precision overall. Regarding the mAP@0.5 metric, BlueberryNet sustains a high level of accuracy above 0.95 in the later stages of training, significantly outperforming YOLOv8n, which plateaus around 0.90. Overall, BlueberryNet consistently outperforms YOLOv8n across all key performance metrics—recall, precision, mAP@0.5, and mAP@0.5:0.95.
Furthermore, its performance advantage becomes increasingly evident as training progresses, validating the effectiveness and superiority of the proposed model in blueberry-detection tasks.
5.4. Comparison Experiments
To evaluate the detection performance of the proposed BlueberryNet algorithm, a comprehensive comparison was conducted against several mainstream object-detection models, including Faster R-CNN [
32], SSD [
33], YOLOv5n [
34], YOLOv7n-tiny [
35], YOLOv8n [
36], YOLOv9t [
37], YOLOv10n [
38], YOLOv11n [
39], and BlueberryNet. All models were trained under identical environments and hyperparameter settings to ensure a fair evaluation. The performance comparison results are presented in
Table 2.
As shown in
Table 2, BlueberryNet achieves the most outstanding performance, with a precision of 98.1%, a recall of 95.5%, and a mAP of 97.5%. Furthermore, it maintains a lightweight structure with only 2.6M parameters and a computational cost of 7.2 GFLOPs, making it highly suitable for deployment in resource-constrained environments. These results are largely attributed to the integration of the improved GLKRep module at the end of the backbone, which enhances contextual awareness while significantly reducing the number of parameters. Additionally, the UMSF detection head module is capable of aggregating multi-level feature maps and adaptively optimizing multi-scale feature fusion through a dynamic receptive field selection mechanism, thereby improving the model’s ability to detect blueberry targets of varying sizes. In comparison, Faster R-CNN achieves a mAP@0.5–0.95% of 91.7%, precision of 96.2%, and recall of 97.1%. Although its recall is slightly higher, it comes at the cost of increased model complexity 3.0M parameters and computational load 8.2 GFLOPs, as typical of two-stage detection frameworks. SSD attains the highest recall among single-stage models but suffers from excessive complexity, resulting in low inference efficiency and poor suitability for lightweight applications. YOLOv10n and YOLOv11n generate overly large weight files, limiting their compatibility with embedded platforms. Although YOLOv5n, YOLOv7n-tiny, YOLOv8n, and YOLOv9t each offer different trade-offs in accuracy and efficiency, none achieves the optimal balance demonstrated by BlueberryNet.
The proposed model proves especially effective in dense, small-object scenarios such as real-time blueberry ripeness detection in complex orchard environments.
To provide an intuitive comparison of model performance, representative test images were selected, as shown in
Figure 8.
In the first row of images, the blueberry fruits are densely packed with noticeable differences in ripeness, where light green and light purple fruits are interspersed among dark, mature fruits, creating some identification interference. In this scenario, Faster R-CNN can generally detect most dark fruits but fails to effectively identify lighter-colored or partially occluded fruits, with some bounding boxes erroneously offset to leaf areas, resulting in significant errors. YOLOv5n detects more targets, but the confidence scores are unevenly distributed, with severe overlapping of some bounding boxes and instances of false detections and redundant boxes. YOLOv7-tiny shows improvement in detecting edge fruits but still misses some lighter-colored fruits. YOLOv10 and YOLOv11 produce more compact bounding box distributions and clearer boundary delineations among overlapping fruits, with YOLOv11 achieving relatively accurate localization of some partially mature fruits. BlueberryNet performs the best in this scenario, not only identifying all mature fruits but also accurately detecting two light green, unripe fruits, demonstrating that the model has learned the appearance features of fruits at different growth stages during training.
In the second row of images, the blueberry fruits are more sparsely distributed, but the background features significant leaf occlusion, posing a challenge to the models’ anti-interference capabilities. Faster R-CNN only detects some foreground fruits, failing to penetrate leaf occlusion to identify targets in the background. YOLOv5n and YOLOv7-tiny show an increase in the number of detections but still miss several mature fruits. YOLOv10 and YOLOv11 effectively avoid misjudgments caused by leaf veins or reflections through accurate extraction of fruit edge contours, achieving significantly better detection performance than the previous models. BlueberryNet once again demonstrates precise recognition of small-scale and leaf-occluded fruits, even accurately boxing a fruit with only a partially exposed peel, showcasing significantly enhanced robustness and target perception capabilities.
The third row of images depicts blueberry fruits in a greenhouse environment with complex background structures and some degree of uneven lighting. Faster R-CNN’s detection performance further declines, failing to mark most edge targets except for clearly visible foreground fruits. YOLOv5n and YOLOv7-tiny show slightly improved adaptability to the environment, but misdetections persist in areas with light spots or highly reflective leaves. YOLOv10 produces bounding boxes that better align with fruit contours, reducing the false detection rate, while YOLOv11 maintains boundary independence among multiple overlapping fruits. BlueberryNet again exhibits strong recognition capabilities in occluded and unevenly lit areas, particularly in the heavily occluded lower-left region of the image, where it successfully detects targets completely missed by other models, maintaining high confidence scores.
In the fourth row of images, the blueberries exhibit significant ripeness variations, ranging from light green, pink, purple, to dark blue fruits. Faster R-CNN and YOLOv5n almost entirely fail to identify non-dark fruits, resulting in low bounding box density and insufficient accuracy. YOLOv7-tiny responds to purple fruits to some extent but suffers from fragmented recognition and redundant boxes. YOLOv10 and YOLOv11 stably detect fruits of medium to high ripeness with balanced confidence score distributions. BlueberryNet comprehensively covers fruits of all color stages, achieving the highest detection count with almost no false positives, indicating that its training dataset likely includes blueberry samples under different spectra, endowing the model with superior spectral robustness and ripeness perception capabilities.
Figure 9 presents a radar chart comparing the performance metrics of BlueberryNet with those of other benchmark models. The figure provides a clear visual illustration of BlueberryNet’s strengths, particularly in terms of its lightweight design. It achieves the highest scores in both parameter count (Params) and computational complexity (FLOPs), indicating its excellent suitability for deployment on resource-constrained devices.
Considering multiple aspects—including the number of detected objects, confidence score distribution, boundary localization accuracy, occlusion handling, and ripeness stage diversity—BlueberryNet consistently outperforms competing models. These results highlight its strong potential for real-world applications, especially in scenarios requiring high efficiency and robustness in complex agricultural environments.
5.5. Ablation Experiments
To evaluate the effectiveness of each proposed improvement, we conducted four ablation experiments under identical datasets and training settings. The experiments were carried out in a stepwise manner: first, the original SPPF layer in the backbone network was replaced with the custom-designed GLKRep module; second, the PANet structure in the neck network was substituted with the UMSF detection head; and finally, the original classification loss was replaced with the SAIou Loss function. The detailed results are presented in
Table 3.
The baseline model, without any modifications, achieved an mAP@0.5–0.95 of 91.7%, with 8.2 GFLOPs and 3.0M parameters. After introducing the GLKRep module, the mAP increased to 94.1%, while both FLOPs and parameter count decreased by approximately 0.2 units, indicating that the module enhances local feature extraction and receptive field representation without sacrificing computational efficiency.
Building upon this, replacing the PANet in the neck with the UMSF detection head module led to a further increase in mAP to 96.8%, and a reduction in model size to 2.7M parameters. This improvement is attributed to the UMSF detection head’s ability to receive multi-level feature maps and adaptively optimize multi-scale feature fusion through a dynamic receptive field selection mechanism. These results further validate the module’s effectiveness in enhancing detection performance for blueberry targets of varying sizes.
Finally, after integrating both structural improvements, we replaced the original loss function with the SAIou Loss. This led to a final mAP of 97.5%, with 7.2 GFLOPs and 2.6M parameters. The SAIou Loss improves the model’s discriminative power by weighting positive and negative samples based on IoU-aware scores, thereby enhancing detection accuracy under occlusion and reducing false negatives and false positives. The results demonstrate a well-balanced improvement in both detection accuracy and model efficiency.
Figure 10 presents feature map visualizations to compare the performance of the GLKRep module and UMSF detection head during the feature extraction process.
Figure 10a shows the original input image containing multiple blueberry fruits in a complex background. In
Figure 10b, the feature maps extracted after incorporating the GLKRep module exhibit strong responses to object edges and local textures, effectively highlighting the structural information of blueberry fruits. This enhancement contributes to more precise localization by emphasizing fine-grained details.
Figure 10c shows the feature maps after introducing the UMSF detection head. The results demonstrate improved semantic consistency and spatial continuity, with feature activations more concentrated in the blueberry regions and significantly reduced background interference. This indicates superior global feature fusion capability. Overall, the GLKRep module enhances the extraction of fine details, while the UMSF detection head module strengthens multi-scale semantic fusion. The combination of both modules substantially improves the model’s robustness and accuracy in complex detection scenarios.
To intuitively illustrate the effectiveness of the proposed model improvements, we employ Gradient-weighted Class Activation Mapping (Grad-CAM) to visualize the attention regions of the detection targets [
40,
41]. In the visualizations, brighter areas indicate regions to which the model pays greater attention. The results are shown in
Figure 11.
The first
Figure 11a shows the original input images without any module. The second
Figure 11b displays the heatmaps from the baseline YOLOv8n model. As observed, the attention is relatively scattered and inconsistent, with noticeable activation on non-target areas such as leaves and background structures. This reflects the model’s limited ability to focus accurately on the blueberry regions in cluttered environments.
When the GLKRep module is incorporated (
Figure 11c), the attention becomes significantly more concentrated on the actual blueberry targets. The model learns to emphasize relevant local features, thereby enhancing discrimination between foreground and background. However, some residual background interference remains under certain complex conditions.
With the integration of the UMSF detection head
Figure 11d), the model gains improved global contextual awareness, enabling it to better aggregate multi-scale information and suppress irrelevant activations. The heatmaps show a more coherent and stable attention focus, even under occlusions or partial visibility of the fruit.
The final
Figure 11e shows the results from the full model that combines both the GLKRep module and UMSF detection head. This configuration yields the most accurate and robust attention distribution, with clear, sharply localized focus on the blueberry regions and minimal response to background clutter.
These visual results confirm that the proposed architectural modifications contribute to improved feature representation and localization precision, particularly in complex field environments.
5.6. Generalization Assessment
To evaluate the generalization capability of the BlueberryNet model, this study conducted tests on a publicly available blueberry dataset from the literature [
42]. As shown in
Figure 12, the left column presents the original images, while the right column displays the detection results produced by BlueberryNet. The experimental results demonstrate that the model can accurately detect blueberry fruits and distinguish their ripeness levels under complex background conditions. Even in scenarios involving dense fruit clusters, partial occlusion, or significant lighting variations, the model consistently produces precise bounding boxes and class labels, exhibiting strong robustness and high detection accuracy. These findings further confirm that BlueberryNet maintains reliable consistency and recognition performance across various shooting angles and application scenarios.
Table 4 compares the performance of YOLO-BLBE and BlueberryNet across multiple evaluation metrics. As shown in the table, BlueberryNet demonstrates higher efficiency in terms of model size and detection speed, while also outperforming YOLO-BLBE in key detection indicators such as precision, recall, and mAP. Although its F1 score is slightly lower than that of YOLO-BLBE, the overall performance of BlueberryNet is more balanced. These results indicate that BlueberryNet not only performs well on the specific training dataset but also exhibits strong detection capability and generalization on the public test set.
6. Conclusions
This paper proposes a lightweight and robust deep-learning framework, BlueberryNet, for high-accuracy blueberry ripeness detection.To achieve accurate identification of blueberries at different ripeness stages and address the limitations of existing detection models in multi-scale feature extraction and adaptability to complex environments, this study develops a lightweight BlueberryNet model based on the YOLOv8n architecture. The proposed model incorporates a GLKRep module to enhance semantic perception capability and introduces a UMSF detection head dual-layer detection module to adapt to multi-scale feature fusion requirements. The advantages of the proposed model are demonstrated in three key aspects:
- (1)
Lightweight design and deep integration of structural reparameterization: This paper innovatively introduces the GLKRep module, which combines grouped channel convolution with large-kernel structural reparameterization. This approach significantly reduces computational complexity while maintaining semantic perception capabilities, effectively enhancing the depth and semantic awareness of feature extraction. It ensures efficient deployment and real-time response on edge devices.
- (2)
Adaptive dual-layer receptive field multi-scale fusion structure: To address the significant scale variations and complex spatial distribution of blueberries in natural environments, a UMSF detection head dual-layer detection module was designed. This module dynamically receives and fuses feature maps from different layers of the backbone network, utilizing a multi-scale convolutional structure to achieve precise blueberry fruit recognition under conditions of scale variation, target overlap, and perspective changes, significantly enhancing the model’s robustness in identifying blueberry ripeness in complex scenarios.
- (3)
Introduction of IoU-Aware classification loss to optimize detection consistency: During the model training phase, the SAIou Loss function is introduced, leveraging IoU-Aware Classification Scores (IACS) to effectively coordinate the optimization of target classification and bounding box regression tasks. This results in higher stability and accuracy in multi-target detection scenarios with dense fruit clusters and severe occlusion.
Despite the significant breakthroughs achieved by the BlueberryNet model in the blueberry ripeness-detection task, certain limitations remain. First, the model currently focuses on ripeness classification and does not address the detection of fruit pests or diseases. Second, the model relies on high-quality image inputs, and its adaptability to extreme weather conditions or blurry images needs further improvement. Future research will focus on enhancing the BlueberryNet model, further exploring its detection and recognition capabilities for blueberry targets in complex agricultural scenarios. This includes achieving precise identification and classification of blueberry pests and diseases, intelligent estimation of large-scale blueberry yields, and analysis of blueberry growth trends, ultimately contributing to the promotion and development of intelligent agricultural monitoring technologies. In addition, we plan to deploy BlueberryNet on mobile and embedded platforms such as NVIDIA Jetson Nano, enabling real-time inference in orchards and post-harvest processing lines. These enhancements will further promote the deployment of AI-based monitoring systems in precision agriculture and facilitate the transition toward autonomous fruit production management.