Research on Blueberry Maturity Detection Based on Receptive Field Attention Convolution and Adaptive Spatial Feature Fusion

Huang, Bingqiang; Xie, Zongyi; Homann, Hanno; Fei, Zhengshun; Xiang, Xinjian; Zheng, Yongping; Zhang, Guolong; Sun, Siqi

doi:10.3390/app15116356

Open AccessArticle

Research on Blueberry Maturity Detection Based on Receptive Field Attention Convolution and Adaptive Spatial Feature Fusion

by

Bingqiang Huang

^1,*,

Zongyi Xie

¹,

Hanno Homann

²

,

Zhengshun Fei

¹,

Xinjian Xiang

¹,

Yongping Zheng

¹,

Guolong Zhang

¹ and

Siqi Sun

¹

School of Automation and Electrical Engineering, Zhejiang University of Science and Technology, Hangzhou 310023, China

²

Department of Computer Science, Hochschule Hannover-University of Applied Sciences and Arts, 30459 Hannover, Germany

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(11), 6356; https://doi.org/10.3390/app15116356

Submission received: 1 May 2025 / Revised: 1 June 2025 / Accepted: 4 June 2025 / Published: 5 June 2025

(This article belongs to the Special Issue Application, Optimization and Architecture of Deep Learning Neural Network)

Download

Browse Figures

Versions Notes

Abstract

Detecting small objects in complex outdoor conditions remains challenging. This paper proposes an improved version of YOLOv8n for the detection of blueberry in challenging outdoor scenarios. In this context, this article addresses feature extraction, small-target detection, and multi-scale feature fusion. Specifically, the C2F-RFAConv module is introduced to enhance spatial receptive field learning and a P2-level detection layer is introduced for small and distant targets and fused by a four-head adaptive spatial feature fusion detection head (Detect-FASFF). Additionally, the Focaler-CIoU loss is chosen to mitigate sample imbalance, accelerate convergence, and improve overall model performance. Experiments on our blueberry maturity dataset show that the proposed model outperforms YOLOv8n, achieving 2.8% higher precision, 4% higher recall, and a 4.5% increase in mAP@0.5, with an FPS of 80. It achieves 89.1%, 91.0%, and 85.5% AP for ripe, semi-ripe, and unripe blueberries, demonstrating robustness under varying lighting, occlusion, and distance conditions. Compared to other lightweight networks, the model offers superior accuracy and efficiency. Future work will focus on model compression for real-world deployment.

Keywords:

blueberry maturity; YOLOv8; receptive field attention convolution (RFAConv); adaptive spatial feature fusion; small object detection

1. Introduction

Blueberries are widely grown around the world, including the United States, Japan, China, New Zealand, and Europe [1], and have high economic value because of their rich nutrition and unique taste [2]. As traditional manual harvesting can no longer meet the growing market demands, automated blueberry harvesting technology has become a research interest [3]. Fast and accurate maturity detection is key to automated picking but faces challenges in complex outdoor environments, such as shading by branches and leaves, variations in light, overlapping fruits, and the small diameter of blueberries themselves.

Traditional image processing methods mainly rely on extracting surface features, such as the color and texture of the target fruit, and then use support vector machines and K-means clustering algorithms to detect and classify the maturity of the fruit [4]. Khojastehnazhand et al. classified apricots into three maturity stages (i.e., unripe, ripe, and overripe), employing color features [5]. Saifullah et al. used the K-Nearest Neighbors (KNN) method based on color feature extraction to classify the fruit maturity of oil palms [6]. Saha et al. studied starfruit maturity using smartphone-based imaging and a variety of traditional machine-learning algorithms [7]. Wei et al. predicted the maturity of grapes from color values using the back-propagation neural network algorithm (BPNN) [8]. Yue et al. developed three strawberry maturity recognition models using absorbance data, finding the multivariate nonlinear model most accurate in greenhouse conditions [9]. Wang et al. segmented strawberries based using hue-based masks and used the red ratio to quantify maturity [10].

While traditional methods achieve high accuracy in controlled settings, their sensitivity to lighting and background changes limits their generalization and suitability for outdoor farming, and this limitation has led to the adoption of deep learning, offering a more robust approach for maturity detection in fruit and vegetable crops under complex natural conditions [3]. Yang et al. proposed a strawberry maturity grading model based on YOLOv8, combined with a LW-Swin Transformer module to improve the generalization capability [11]. Addressing more complex scenarios, such as the detection of small objects and the occlusion of different maturity stages, Wang et al. proposed a foreground–foreground class balance method and an improved YOLOv8s network to detect and segment tomato fruits [12]. Zhou et al. [13] compared strawberry maturity classification using YOLOv3 with aerial and ground-based images, demonstrating superior speed and accuracy, respectively. Siricharoen et al. [14] addressed the problem of limited training data for pineapple maturity grading using the more traditional Mask R-CNN. Saedi et al. [15] introduced a dual-path lightweight CNN combining regular and dilated convolution to classify olive varieties. To address occlusion and lighting interference in broccoli maturity recognition, Kang et al. [16] developed a CNN–Transformer-based instance segmentation network. MacEachern et al. [17] compared six YOLO variants for wild blueberry maturity detection, aiding localized grower decisions. Parvathi et al. [18] proposed a Faster R-CNN model with ResNet-50 for detecting coconut maturity in complex backgrounds. Wang et al. [19] combined a Faster R-CNN with a path aggregation network for tomato maturity detection, achieving strong results in occluded and uneven lighting conditions. Chen et al. [20] introduced a multi-task variant of YOLOv7 to assess cherry tomato with respect to the location, size, category, and maturity of the fruit clusters. The methods and drawbacks of fruit and vegetable maturity detection research using deep learning are shown in Table 1.

While these algorithms outperform traditional computer vision methods in accuracy, robustness, and applicability, they are often limited to close-range detection and struggle with more complex scenarios. In blueberry maturity detection, the issues of extraordinarily small fruits, severe occlusions, and strongly varying light conditions need to be addressed for automated harvesting across large areas. Therefore, how to optimize model complexity while ensuring detection accuracy and enhancing adaptability in complex environments remains a key issue in current algorithm research.

The contributions of this work are as follows: (1) This article extends the well-known YOLOv8n model by Receptive-Field Attention Convolution (RFAConv) [21] in the backbone network to enhance the model’s ability to extract small object features and improve detection accuracy. (2) A small object detection layer is introduced in the neck network. (3) These features are integrated as a four-head adaptive spatial feature fusion (Detect-FASFF), based on [22], which further enhances the model’s ability to detect targets at different scales and reduces feature loss to solve the problem of missed detections and false positives under conditions such as long-range views, severe occlusion, and varying lighting. (4) A more focused complete intersection over union loss (Focaler-CIoU) is introduced into the loss function to speed up the convergence of the model, thus improving its detection performance [23].

2. YOLOv8 Reference Network

YOLOv8, developed by the company Ultralytics, is the eighth-generation model in the YOLO series. It introduces several architectural improvements to enhance performance and flexibility in object detection tasks [24]. The standard YOLOv8 network consists of three main components: the backbone, the neck, and the head networks.

In YOLOv8, the backbone network replaces the earlier C3 (Cross Stage Partial Network v3) module with the C2f (Cross Stage Partial Network v2 with Fusion) module for feature extraction. The C2f module integrates the bottleneck structure, convolutional layers, and skip connections, improving gradient flow while reducing model complexity. Additionally, the SPPF (Spatial Pyramid Pooling-Fast) module is incorporated for multi-scale feature extraction and fusion. The neck network employs the FPN (Feature Pyramid Network) and PAN (path aggregation network) architectures to enhance feature fusion for multi-scale object detection. The head network adopts a decoupled structure with an anchor-free design, where the classification task uses the BCE (Binary Cross Entropy) loss function and the regression task employs the Distribution Focal Loss and CIOU (complete intersection over union) loss functions.

3. Modifications to the Reference Network

This paper aims to improve the YOLOv8n baseline network for the detection of very small objects, such as blueberry fruits. Our modifications include the following: (1) In the backbone, Receptive-Field Attention Convolution (RFAConv) is applied to the C2f module. The C2f-RFAConv module aims to capture more detailed structural information for small objects by strengthening receptive fields to enhance sensitivity to fine-scale features. (2) In the neck network, a small object detection layer is added to enhance the ability to extract small object features. (3) At the output stage, this study employs a four-head version of the adaptive spatial feature fusion detection (Detect-FASFF). This module integrates multi-scale spatial features to improve the detection of small and occluded objects. (4) The CIOU loss function is replaced with Focaler-CIoU, a more focused complete intersection over union loss. This modification intends to improve bounding box regression by prioritizing difficult-to-detect small objects. The adapted model’s structure diagram is shown in Figure 1, with the red dashed boxes indicating the modifications.

3.1. Receptive-Field Attention Convolution

RFAConv (Receptive-Field Attention Convolution) [21] improves upon standard convolution by addressing its key limitation: parameter sharing. Traditional convolutions apply the same kernel across the entire image, which can overlook variations in feature importance across different spatial locations. RFAConv introduces Receptive-Field Attention (RFA), which assigns attention weights to different receptive fields for more context-aware feature extraction. This approach is beneficial for detecting small objects like blueberries by focusing on locally relevant receptive fields, making it more effective at capturing small-object patterns.

In this work, integrating RFAConv into the C2f module of YOLOv8 is proposed, as shown in Figure 2: the conventional bottleneck consists of two convolutions with a residual connection and fusion (concat) [25,26,27], and the convolutions are replaced by the RFAConv module.

3.2. Small-Target Detection Layer

The blueberry dataset used in this paper includes scenes with long distances and severe branch and leaf occlusion, with a large number of small-sized blueberry fruits. The feature maps at the lower pyramid levels, P1 and P2, of the backbone network contain rich information about small targets; however, this information has not been fully utilized. To improve on this point, a small-target detection layer is added in this paper. The proportion of small targets in the image is very small, so when detecting, the model needs a higher spatial resolution to capture the features of small targets; that is, a 160 × 160 feature map detection layer needs adding to the original network to detect small blueberry targets larger than 4 × 4 pixels, thereby improving the detection capability for small blueberry targets at long distances and in severe occlusion scenes. The structure diagram is shown as a dashed box in Figure 1.

3.3. Four-Head ASFF Detector

Adaptive spatial feature fusion (ASFF) reweighs feature maps from different pyramid levels and performs fusion using learned control parameters. As usual, this process is differentiable and can be optimized through back-propagation to improve the model’s detection performance [22]. The ASFF structure diagram is shown in Figure 3.

The PAN-FPN structure in YOLOv8 can extract information from multiple scales, but its feature fusion process is fixed and lacks flexibility. Feature maps of different resolutions are fused in a fixed weighted manner, and the best fusion strategy cannot be adaptively selected according to the characteristics of the target. This fixed fusion method may lead to feature loss and feature inconsistency problems when detecting very small targets like blueberries. To optimize feature fusion, improve detection accuracy, and reduce the missed detection rate and false detection rate in complex environments, this paper proposes a four-head adaptive spatial feature fusion detection head (Detect-FASFF) based on ASFF. The original equations were extended to fuse four levels, with the structure shown in Figure 4 and feature calculation shown in Equations (1) and (2).

y_{i j}^{l} = α_{i j}^{l} \cdot x^{1 \to l} + β_{i j}^{l} \cdot x^{2 \to l} + γ_{i j}^{l} \cdot x^{3 \to l} + δ_{i j}^{l} \cdot x^{4 \to l}

(1)

α_{i j}^{l} = e^{λ_{a_{i j}}^{l}} / (e^{λ_{a_{i j}}^{l}} + e^{λ_{β_{i j}}^{l}} + e^{λ_{γ_{i j}}^{l}} + e^{δ_{δ_{i j}}^{l}})

(2)

Here, x_ij^s→l represents a feature map that is rescaled from the size of a layer s to the size of a layer l. Then, α, β, γ, and δ are weight parameters that indicate the importance of the feature map of each layer, the values of which are learned during training through gradient descent to effectively adjust the contribution of each layer’s feature map to the final output, and α^l_ij, β^l_ij, γ^l_ij, and δ^l_ij are the values of α, β, γ, and δ at position (i, j), respectively.

3.4. Focaler-CIoU Loss Function

The loss L_CIoU of the original network bounding box of YOLOv8 is calculated using the CIoU (complete intersection over union loss) formula, as shown in Equations (3) to (5).

L_{C I o U} = 1 - I O U + \frac{ρ (b, b^{g})}{c^{2}} + α v

(3)

α = \frac{v}{(1 - I O U) + v}

(4)

v = \frac{4}{π^{2}} (\arctan \frac{w^{g}}{h^{g}} - \arctan \frac{w}{h})^{2}

(5)

Here, v is a measure of the consistency of the aspect ratio of the predicted box and the true box, and α is a weight function that depends on the difference in the aspect ratio of the predicted box and the true box. w^g and h^g are the width and height of the true box, respectively. In the detection of small blueberry objects, due to the small size of the object, the CIoU loss function cannot accurately reflect the actual situation, resulting in unsatisfactory optimization results. The scale of objects varies greatly, especially for small-object detection, which strongly depends on the camera view and distance. CIoU has limited performance in dealing with an uneven sample distribution and is prone to overfitting for easy-to-classify samples while ignoring difficult-to-classify samples. In complex tasks, the convergence speed of CIoU is still slow, and it does not dynamically adjust sample features, resulting in an overall unbalanced performance and certain limitations.

To address these issues, we combine the CIoU loss function with the Focaler-IoU loss and introduce Focaler-CIoU. This strengthens the bounding box regression by focusing on the sample distribution to solve the problem of sample imbalance. Our method reconstructs the IoU in the loss function, making the model more focused on positive samples that are difficult to classify, thereby improving detection accuracy. The formula for LFocaler-CIoU is shown in (6) to (7).

I O U^{f o c a l e r} = \{\begin{cases} 0, I o U < d \\ \frac{I o U - d}{u - d} d \leq I o U \leq u \\ 1 I o U > u \end{cases}

(6)

L_{F o c a l e r - CIoU} = L_{CIOU} + I O U - I o U^{f o c a l e r}

(7)

L_Focaler-CIoU is the improved loss, and d and u are the lower and upper thresholds, respectively, where d, u ∈ [0, 1]. Focaler-CIoU optimizes the loss function by using the intersection over union (IoU) value, thereby increasing the model’s sensitivity to samples with medium overlap, enabling it to learn the features of these samples more effectively. By adjusting the thresholds, the IoU can focus on different regression samples rather than just the simplest or most difficult ones, as shown in Figure 5. Figure 6a,b show the linear mapping curves of Focaler-CIoU for difficult samples and simple samples, respectively. A loss function designed in this way can effectively distinguish samples with different IoU values, allowing the model to learn more selectively. Specifically, samples with an extremely low IoU will be ignored to prevent the model from overly focusing on negative samples that are still too difficult to learn. For samples with an IoU in the medium range, the loss function is more sensitive to changes in the IoU, which leads to the model performing better on these samples. For samples with a very high IoU, IoU^focaler is fixed to 1 to avoid over optimization of the model on these already well performing samples, thereby saving computational resources.

4. Experiment and Result Analysis

All experiments in this section were conducted under the same conditions to ensure fairness in the results. Specifically, all experiments used the same image input size, augmentation methods, training schedule, and parameters.

4.1. Experimental Environment and Training Parameter Settings

The following hardware configuration was used for training: Intel Xeon(R) Silver 4210 CPU @ 2.20 GHz × 40, NVIDIA GeForce RTX 2080 Ti. The software environment was Ubuntu 18.04.6 LTS, Python version 3.8, Pytorch 2.3.0 framework, GPU acceleration library is Cuda 12.1 version, and the training parameters are listed in Table 2.

4.2. Image Acquisition

The image dataset used in this study was assembled by two methods: (1) collection from the internet by a web-crawler and (2) the Roboflow Universe platform. The number of images collected through the network was 661, and the number of images from the Roboflow Universe platform was 614. Thus, the total number of original images was 1275. The image resolution varied between 1480 × 1974 and 3883 × 5184 pixels.

The images include different lighting conditions (strong light, normal light, insufficient light), natural environmental factors (such as branch and leaf occlusion, fruit stacking), and different shooting distances (close-up and distant). These images include three levels of blueberry fruit maturity: ripe, semi-ripe, and unripe according to maturity. Table 3 provides representative close-up images of the maturity levels, while Figure 6 presents representative images from the dataset.

To ensure the effectiveness of model training, this study used the LabelImg tool to annotate the collected images, generating XML annotation files in the Pascal VOC format. The data was divided into training and validation test datasets using a split ratio of 70%:10%:20%. To improve generalization and to reduce model overfitting, the dataset was augmented by common techniques such as flipping, scaling, translation, rotation, and the addition of noise, along with random combinations of these methods [28]. To minimize computational effort, data augmentation was pre-computed resulting in a total of 4500 augmented images. As shown in Figure 7, the visualization results of the dataset illustrate several aspects: the top left corner of the figure shows the number of labels for each category in the dataset, and the top right-hand corner of the figure presents the distribution of bounding boxes. The bottom left-hand corner of the figure explains the distribution of the center points, and the bottom right-hand corner of the figure displays the size distribution of labels in the dataset. It can be seen from the figure that most of the targets are concentrated in the range of 0–0.2, indicating that small blueberry targets account for a large proportion of the entire dataset.

4.3. Effect of the Four-Head ASFF Detector

First, the original three-head ASFF detector (Detect-ASFF) was compared with the four-head ASFF detector (Detect-FASFF), which extends to the P2 feature layer for small objects. To this end, the conventional YOLOv8n was used, and both the three-head detector and the four-head version were inserted into the model. All other parameters were kept unchanged, and the model was evaluated on the blueberry maturity dataset.

The experimental results are shown in Table 4. Although the use of the Detect-FASFF model slightly increases the computational complexity and memory requirements, its precision, recall rate, and average precision improved by 1.7%, 4.2%, and 4.0%, respectively, compared to the baseline model. Thus, the overall detection quality of the model was significantly improved. Although the frame rate (FPS) decreased, this setup can still meet the requirements of practical applications. Figure 8 shows the results on a representative image. Detect-ASFF and Detect-FASFF have fewer missed detections, especially when the blueberries are smaller or severely occluded.

4.4. Ablation Experiments

To verify the improvements in the YOLOv8n algorithm proposed in this paper, five ablation experiments were conducted on the blueberry maturity dataset to assess the effectiveness of each modification. The results of the ablation experiments are compared to the YOLOv8n reference in Table 5.

As shown in Table 4, in Experiment 1, replacing the C2f module in the backbone of the YOLOv8n network with C2-RFAConv enhanced the model’s attention to local information when processing features. However, this comes at the cost of partially neglecting global features, leading to a decrease in recall. Despite this, the model’s precision and mean average precision (mAP@0.5) increased by 2.2% and 0.7%, respectively, with only a slight increase of 0.1 MB in model size.

In Experiment 2, the introduction of Detect-FASFF adds additional convolutional layers and parameters, enabling the model to learn and compute adaptive weights for feature fusion. As a result, the model size increased by 2.6 MB. However, compared to Experiment Ref, the precision, recall, and mAP@0.5 improved by 1.7%, 4.2%, and 4%, respectively, significantly enhancing the model’s detection performance.

Experiment 3 incorporated Focaler-CIoU, further improving the model’s precision, recall, and mAP@0.5, demonstrating that Focaler-CIoU enhances the model’s fitting capability. In Experiment 4, fusing C2f-RFAConv and Detect-FASFF lead to an overall performance improvement compared to Experiments 1 and 2. In Experiment 5, the fusion of Detect-FASFF and Focaler-CIoU further improved the model’s detection performance compared to Experiment 2.

Finally, Experiment 6 evaluated the combined effect of all three proposed enhancements of the baseline network, resulting in the best-performing model. Compared to the baseline model, the improved model achieved increases of 2.8% in precision, 4% in recall, and 4.5% in mAP@0.5, while the model size only grew by 2.7 MB. Additionally, the frame rate (FPS) reached 80 frames per second, fully meeting real-time detection requirements.

Furthermore, as illustrated in Figure 9, the P-R curve of the final improved model is closer to the upper right corner of the coordinate axis and exhibits a larger area under the curve (AUC) compared to the baseline model. This indicates better overall performance across different thresholds.

In addition, to further demonstrate the model’s sensitivity to blueberry targets of different sizes, an image containing blueberry targets of various sizes was selected for testing. The results are shown in Figure 10. Through the heatmap, it can be intuitively seen that the model proposed in this paper can better focus on blueberry targets of different sizes at close and far distances while effectively suppressing environmental background noise. On the other hand, the YOLOv8n model shows insufficient attention to certain sizes of blueberries in the detection process, and some environmental background noise is present.

4.5. Comparison of Other Backbone Networks

In this experiment, YOLOv8n was used as the baseline network, replacing mainstream feature extraction backbones such as MobileNetV3, GhostNetV2, and ShuffleNet. To ensure a fair comparison, all settings except for the backbone network remained unchanged while evaluating the training performance on the blueberry dataset.

As shown in Table 6, our backbone network incorporating the C2f-RFAConv improvement method outperformed ShuffleNetV1, ShuffleNetV2, and GhostNetV2, achieving an increase in mAP@0.5 of 4.3%, 2.4%, and 2.5%, respectively. While the proposed backbone achieved the same mAP@0.5 improvement as MobileNetV3, it reduced the number of parameters by 2.2 M, the computational complexity by 2.2 G, and the model size by 5 MB. These reductions make the model more efficient and better suited for real-world deployment.

4.6. Comparison of Different Target Detection Models

To verify the proposed algorithm in blueberry maturity detection, the same dataset and experimental conditions were used to compare it with mainstream models such as YOLOv3-tiny, YOLOv5n, YOLOv6, YOLOv7-tiny, and YOLOv9c. The experimental results are shown in Table 7.

According to Table 6, the proposed algorithm outperformed other models in multiple key performance indicators, especially in precision, recall, and average precision (mAP@0.5). Among them, the mAP@0.5 is improved by 13.8%, 5.3%, 7.5%, 2.3%, 6.0%, and 0.5% compared with YOLOv3-tiny, YOLOv5n, YOLOv6n, YOLOv6s, YOLOv7-tiny, and YOLOv9c, respectively. In addition, although the number of parameters and the amount of calculation have increased compared to the older models YOLOv5n and YOLOv6n, the proposed model has excellent performance in terms of performance indicators such as precision, recall, and mAP@0.5. By improving the performance and resource efficiency of the proposed model, an ideal balance is achieved, which significantly improves the accuracy and reliability of the blueberry maturity detection task.

4.7. Visual Comparison

To visualize the results, two representative cases are compared in Figure 9 and Figure 10. Red boxes indicate ripe blueberries, orange boxes indicate semi-ripe blueberries, and pink boxes indicate unripe blueberries. Moreover, blue boxes indicate missed or false detections.

4.7.1. Strong Light, Close Range, and Occlusion

As shown in Figure 11b,c, in strong light conditions, YOLOv8n and YOLOv9c struggle to detect small blueberry targets at long distances, under intense lighting, and when heavily obstructed by branches and leaves, resulting in missed and false detections. While YOLOv9c demonstrates higher confidence levels, it still produces some missed and false detections. In contrast, the proposed model exhibits lower missed and false detection rates, achieving the best detection performance and higher confidence levels.

4.7.2. Weak Light, Long Distance, and Occlusion

As shown in Figure 12b,c, under conditions of insufficient light, long distances, and severe occlusion, YOLOv8n and YOLOv9c frequently misidentify darker branches and leaves as ripe or unripe blueberries, resulting in high missed and false detection rates. In contrast, the proposed model demonstrates superior detection performance in challenging environments, with overall higher confidence levels.

4.8. Model Robustness Testing Under Different Conditions

To verify the robustness of the model under varying lighting, occlusion, and distance conditions, the 917 test images in the dataset were divided into two experimental groups based on the presence of environmental disturbances. The test images in Experiment 1 mainly include blueberries under close distances, no occlusion, and normal lighting conditions. The test images in Experiment 2 mainly include blueberries under conditions with environmental disturbances such as long distances, partial occlusion, strong lighting, and insufficient lighting. Under the same experimental conditions, the performance of the proposed model was compared with that of the YOLOv8n model. The experimental results are shown in Table 8, and the confusion matrix after testing the models is shown in Figure 13.

As shown in Table 8, the model proposed in this paper exhibits significant advantages under environmental disturbance conditions, particularly outperforming YOLOv8n in terms of precision, recall rate, and mAP@0.5. In Experiment 1, under normal environmental conditions, the accuracy of the model proposed is slightly higher than YOLOv8n, with only a small difference in detection performance between the two. However, in Experiment 2, when faced with environmental disturbances, the precision, recall rate, and mAP@0.5 of the model proposed improved by 3%, 5.4%, and 4.6%, respectively, compared to YOLOv8n, significantly enhancing its performance in complex environments. These results indicate that the model proposed exhibits stronger robustness in blueberry detection tasks, better adaptability to environmental changes, and higher stability and accuracy.

As shown in the confusion matrix in Figure 11, the model proposed outperforms YOLOv8n in both experiments. Although the performance of the two models is similar in Experiment 1, the model proposed has a slight advantage in classification accuracy, especially for blueberry-U. However, in Experiment 2, our model significantly outperforms YOLOv8n, providing more precise classification for all categories (blueberry-R, blueberry-U, blueberry-S) and significantly reducing the misclassification rate for all categories. Overall, the model proposed demonstrates higher stability and accuracy than YOLOv8n in both experiments, especially in more complex classification tasks, and shows stronger model robustness.

5. Conclusions

Based on a review of previous research on fruit detection, a blueberry detection algorithm based on improved YOLOv8n is proposed in this paper, which effectively improves the detection problem for very small objects in complex orchard environments. The improvements mainly include optimizing the feature extraction of the model backbone network, adding a small-target detection layer at pyramid level P2, and improving the detection head and loss function. Specifically, RFAConv is introduced into the backbone network and a C2F-RFAConv module is constructed to enhance the model’s ability to extract target features and capture global information more accurately through the interactive adaptive learning of spatial receptive fields and convolutions. To solve the problem of small-target detection, a detection layer specifically for small targets is added to the neck network, which significantly improved the detection accuracy of distant small targets. The head network adopts a four-head detector (Detect-FASFF), which reduced missed and false detections in complex environments by adaptively fusing information from the lower pyramid level’s feature maps. In addition, a more focused complete intersection over union loss (Focaler-CIoU) is introduced into the bounding box loss function, which effectively handles the sample imbalance problem, speeds up the convergence, and reduces the bounding box loss value. Combining various improvement measures, the proposed algorithm exhibits high detection performance in complex orchard environments, demonstrating excellent robustness and accuracy and significantly improving the accuracy and efficiency of blueberry maturity detection.

The proposed model performs well in blueberry target detection in complex environments, with average precisions of 89.1%, 91.0%, and 85.5% for ripe, semi-ripe and unripe blueberries, respectively. Compared with the baseline model YOLOv8n, the proposed model has improved mAP@0.5 by 4.5% in all maturity detection tasks. The model demonstrates excellent detection performance in complex environments such as those with multiple targets, strong lighting, insufficient lighting, distant and close ranges, and a frame rate of up to 80 FPS.

Although this study achieved significant results in the detection of blueberry maturity, there are still potential areas for improvement that need to be addressed in future research. Firstly, although the current model achieves a good balance between detection accuracy and operational efficiency, there is still potential for further compression in terms of parameter scale and size. This is particularly important when considering various application scenarios, where lightweight models are particularly crucial. Therefore, future work will focus on optimization strategies such as model pruning and quantization, with the aim of further reducing model size, lowering computational costs, and improving deployment efficiency while maintaining detection accuracy.

Meanwhile, the deployment of the model not only relies on the optimization of the algorithm itself but also requires comprehensive consideration of the hardware performance of the terminal devices, real-time data acquisition, and the overall stability of the system. Future research will further explore the integration of this detection system with agricultural intelligent devices (such as drones, agricultural robots, etc.) to promote the intelligence and scalable application of the blueberry maturity detection system in real-world scenarios. This will provide more accurate and efficient solutions for agricultural production, effectively improving blueberry harvesting efficiency and quality management and supporting automation and intelligence in agriculture. However, the current model still has limitations in terms of adaptability to complex natural environments. For example, under rainy or foggy conditions, the stability and accuracy of the model may be affected. Therefore, future research should further enhance the robustness of the model in dynamic environments to improve its stability and reliability in real-world applications.

In addition, the current training data coverage may still be insufficient and cannot fully represent all blueberry varieties, growth stages, and environmental differences. In the future, efforts should be made to enhance data diversity, covering more real collection scenarios and samples from different geographical regions. By integrating techniques such as adaptive learning and transfer learning, the generalization ability of the model can be improved in various environments. Although this study currently focuses on blueberries, future research will explore its applicability to other fruits and vegetables such as strawberries, grapes, etc., given the potential universality of this method in fruit maturity detection. These crops are also sensitive to harvest time and have rich visual features, making them suitable for maturity recognition using computer vision technology. By applying this model to other fruits and vegetables, the scope of the model could be further expanded and technical support for the development of a universal intelligent fruit and vegetable detection platform could be provided.

Author Contributions

Conceptualization, G.Z.; Methodology, B.H.; Software, Z.X.; Validation, H.H.; Formal analysis, Z.F.; Resources, X.X.; Data curation, Y.Z.; Supervision, S.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by [the Zhejiang Provincial Key Research and Development Program] grant number [2024C01071].

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

Júnior, M.R.B.; dos Santos, R.G.; de Azevedo Sales, L.; Vargas, R.B.S.; Deltsidis, A.; de Oliveira, L.P. Image-based and ML-driven analysis for assessing blueberry fruit quality. Heliyon 2025, 11, e42288. [Google Scholar] [CrossRef] [PubMed]
Bai, Y.; Fang, Y.; Zhang, B.; Fan, S. Model robustness in estimation of blueberry SSC using NIRS. Comput. Electron. Agric. 2022, 198, 107073. [Google Scholar] [CrossRef]
Sun, H.; Ren, R.; Zhang, S.; Tan, C.; Jing, J. Maturity detection of ‘Huping’ jujube fruits in natural environment using YOLO-FHLD. Smart Agric. Technol. 2024, 9, 100670. [Google Scholar] [CrossRef]
Rizzo, M.; Marcuzzo, M.; Zangari, A.; Gasparetto, A.; Albarelli, A. Fruit ripeness classification: A survey. Artif. Intell. Agric. 2023, 7, 44–57. [Google Scholar] [CrossRef]
Khojastehnazhand, M.; Mohammadi, V.; Minaei, S. Maturity detection and volume estimation of apricot using image processing technique. Sci. Hortic. 2019, 251, 247–251. [Google Scholar] [CrossRef]
Saifullah, S.; Prasetyo, D.B.; Indahyani; Dreżewski, R.; Dwiyanto, F.A. Palm oil maturity classification using K-nearest neighbors based on RGB and L*a*b color extraction. Procedia Comput. Sci. 2023, 225, 3011–3020. [Google Scholar] [CrossRef]
Saha, K.K.; Rahman, A.; Moniruzzaman, M.; Syduzzaman, M.; Uddin, M.Z.; Rahman, M.M.; Ali, M.A.; al Riza, D.F.; Oliver, M.M.H. Classification of starfruit maturity using smartphone-image and multivariate analysis. J. Agric. Food Res. 2023, 11, 100473. [Google Scholar] [CrossRef]
Wei, X.; Wu, L.; Ge, D.; Yao, M.; Bai, Y. Prediction of the Maturity of Greenhouse Grapes Based on Imaging Technology. Plant Phenomics 2022, 2022, 9753427. [Google Scholar] [CrossRef]
Yue, X.; Shang, Z.; Yang, J.; Huang, L.; Wang, Y. A smart data-driven rapid method to recognize the strawberry maturity. Inf. Process. Agric. 2020, 7, 575–584. [Google Scholar] [CrossRef]
Wang, D.; Wang, X.; Chen, Y.; Wu, Y.; Zhang, X. Strawberry ripeness classification method in facility environment based on red color ratio of fruit rind. Comput. Electron. Agric. 2023, 214, 108313. [Google Scholar] [CrossRef]
Yang, S.; Wang, W.; Gao, S.; Deng, Z. Strawberry ripeness detection based on YOLOv8 algorithm fused with LW-Swin Transformer. Comput. Electron. Agric. 2023, 215, 108360. [Google Scholar] [CrossRef]
Wang, A.; Qian, W.; Li, A.; Xu, Y.; Hu, J.; Xie, Y.; Zhang, L. NVW-YOLOv8s: An improved YOLOv8s network for real-time detection and segmentation of tomato fruits at different ripeness stages. Comput. Electron. Agric. 2024, 219, 108833. [Google Scholar] [CrossRef]
Zhou, X.; Lee, W.S.; Ampatzidis, Y.; Chen, Y.; Peres, N.; Fraisse, C. Strawberry maturity classification from UAV and near-ground imaging using deep learning. Smart Agric. Technol. 2021, 1, 100001. [Google Scholar] [CrossRef]
Siricharoen, P.; Yomsatieankul, W.; Bunsri, T. Fruit maturity grading framework for small dataset using single image multi-object sampling and Mask R-CNN. Smart Agric. Technol. 2023, 3, 100130. [Google Scholar] [CrossRef]
Saedi, S.I.; Rezaei, M.; Khosravi, H. Dual-path lightweight convolutional neural network for automatic sorting of olive fruit based on cultivar and maturity. Postharvest Biol. Technol. 2024, 216, 113054. [Google Scholar] [CrossRef]
Kang, S.; Fan, J.; Ye, Y.; Li, C.; Du, D.; Wang, J. Maturity recognition and localisation of broccoli under occlusion based on RGB-D instance segmentation network. Biosyst. Eng. 2025, 250, 270–284. [Google Scholar] [CrossRef]
MacEachern, C.B.; Esau, T.J.; Schumann, A.W.; Hennessy, P.J.; Zaman, Q.U. Detection of fruit maturity stage and yield estimation in wild blueberry using deep learning convolutional neural networks. Smart Agric. Technol. 2023, 3, 100099. [Google Scholar] [CrossRef]
Parvathi, S.; Selvi, S.T. Detection of maturity stages of coconuts in complex background using Faster R-CNN model. Biosyst. Eng. 2021, 202, 119–132. [Google Scholar] [CrossRef]
Wang, Z.; Ling, Y.; Wang, X.; Meng, D.; Nie, L.; An, G.; Wang, X. An improved Faster R-CNN model for multi-object tomato maturity detection in complex scenarios. Ecol. Inform. 2022, 72, 101886. [Google Scholar] [CrossRef]
Chen, W.; Liu, M.; Zhao, C.; Li, X.; Wang, Y. MTD-YOLO: Multi-task deep convolutional neural network for cherry tomato fruit bunch maturity detection. Comput. Electron. Agric. 2024, 216, 108533. [Google Scholar] [CrossRef]
Zhang, X.; Liu, C.; Yang, D.; Song, T.; Ye, Y.; Li, K.; Song, Y. RFAConv: Innovating spatial attention and standard convolutional operation. arXiv 2023, arXiv:2304.03198. [Google Scholar]
Liu, S.; Huang, D.; Wang, Y. Learning spatial fusion for single-shot object detection. arXiv 2019, arXiv:1911.09516. [Google Scholar]
Zhang, H.; Zhang, S. Focaler-IoU: More Focused Intersection over Union Loss. arXiv 2024, arXiv:2401.10525. [Google Scholar]
Wei, L.; Tong, Y. Enhanced-YOLOv8: A new small target detection model. Digit. Signal Process. 2024, 153, 104611. [Google Scholar] [CrossRef]
Ling, Z.; Xin, Q.; Lin, Y.; Su, G.; Shui, Z. Optimization of Autonomous Driving Image Detection Based on RFAConv and Triplet Attention. arXiv 2024, arXiv:2407.09530. [Google Scholar] [CrossRef]
Huang, C.; Yang, D.; Zhang, X. Lightweight object detection algorithm for road scenes based on YOLOV8. In Proceedings of the SPIE 13176, Fourth International Conference on Machine Learning and Computer Application (ICMLCA 2023), Hangzhou, China, 27–29 October 2023. [Google Scholar]
Ma, J.; Guo, J.; Zheng, X.; Fang, C. An Improved Bird Detection Method Using Surveillance Videos from Poyang Lake Based on YOLOv8. Animals 2024, 14, 3353. [Google Scholar] [CrossRef]
Xu, M.; Yoon, S.; Fuentes, A.; Park, D.S. A comprehensive survey of image augmentation techniques for deep learning. Pattern Recognit. 2023, 137, 109347. [Google Scholar] [CrossRef]

Figure 1. Adapted YOLOv8n model structure.

Figure 2. C2f-RFAConv structure.

Figure 3. ASFF structure.

Figure 4. FASFF detection head structure.

Figure 5. IoU deviation curve.

Figure 6. Representative examples of the dataset under varying illumination and view conditions.

Figure 7. Visualization of the dataset.

Figure 8. Comparative analysis of detection results for different detection heads.

Figure 9. P-R curve comparison between the improved model and the baseline model.

Figure 10. Comparison and analysis of the heatmaps for blueberries of different sizes.

Figure 11. Comparison of results under strong lighting.

Figure 12. Comparison of results under insufficient light.

Figure 13. Comparison of normalized confusion matrix.

Table 1. Methods and drawbacks in fruit and vegetable maturity detection using deep learning.

Author	Methods	Shortcomings
Yang et al. [11]	YOLOv8-based strawberry maturity grading with LW-Swin Transformer	Combining YOLOv8 and Transformer modules to achieve generalization will increase complexity.
Wang et al. [12]	Foreground–foreground class balance and YOLOv8s for tomato fruit segmentation	Foreground–foreground class balance does not work well for highly occluded fruits
Zhou et al. [13]	YOLOv3 for strawberry maturity classification using aerial and ground images	The accuracy is lower when using aerial images in complex environments.
Siricharoen et al. [14]	Mask R-CNN for pineapple maturity grading with limited training data	Limited training data for Mask R-CNN hinders model performance
Saedi et al. [15]	Dual-path lightweight CNN with regular and dilated convolutions for olive variety classification	The complex dual-path CNN architecture will lead to high computational costs.
Kang et al. [16]	CNN–Transformer-based instance segmentation network for broccoli maturity recognition	The model performs poorly in detection under lighting and occlusion conditions.
MacEachern et al. [17]	Comparison of six YOLO variants for wild blueberry maturity detection	Lack of a focused solution for wild blueberry environments in real-time applications
Parvathi et al. [18]	Faster R-CNN with ResNet-50 for coconut maturity detection in complex backgrounds	Challenges with detecting coconut maturity in diverse backgrounds under varying conditions
Wang et al. [19]	Faster R-CNN with path aggregation network for tomato maturity detection	Detecting tomato maturity is very difficult under occlusion and poor lighting, and the model takes a long time to train
Chen et al. [20]	YOLOv7 multi-task variant for cherry tomato maturity assessment	Multi-task nature of the model increases computational complexity

Table 2. Training parameters.

Training Parameters	Values
Initial learning rate	0.01
Optimizer	SGD
Optimizer momentum	0.937
Optimizer weight decay rate	0.0005
Number of image per batch	4
Number of epochs	400

Table 3. Definition of maturity levels and close-up image examples.

Maturity Level	Number of Labels	Visual Appearance
Ripe	31,839	Dark in color, dark blue
Semi-ripe	11,097	Lighter in color, purplish red
Unripe	56,137	Lighter in color, green

Table 4. Comparison experiment of adaptive feature fusion detection heads.

Detection Heads	Precision/%	Recall/%	mAP@0.5/%	Parameters/M	FPS	ModelSize/MB
YOLOv8 ref.	86.3	75.2	84.0	3.4	161	6.0
Detect-ASFF	86.9	76.7	85.7	4.7	88	8.6
Detect-FASFF	88.0	79.4	88.0	4.3	99	8.7

Table 5. Ablation experiments.

No.	Base Model	C2f-RFAConv	Detect-FASFF	Focaler-CIoU	Precision/%	Recall/%	AP_r /%	AP_s/%	AP_u /%	mAP@0.5/%	FPS	Model Size/MB
Ref	YOLOv8n	×	×	×	86.3	75.2	86.2	86.4	79.5	84.0	161	6.0
1	YOLOv8n	√	×	×	88.5	74.6	86.7	87.4	80.1	84.7	132	6.1
2	YOLOv8n	×	√	×	88.0	79.4	89.0	90.4	84.8	88.0	99	8.6
3	YOLOv8n	×	×	√	87.1	75.1	86.5	86.7	79.4	84.2	165	6.0
4	YOLOv8n	√	√	×	89.3	78.8	89.1	90.7	85.5	88.4	76	8.7
5	YOLOv8n	×	√	√	88.2	80	89.6	90.4	84.6	88.2	101	8.6
6	YOLOv8n	√	√	√	89.1	79.2	89.1	91	85.5	88.5	80	8.7

√ indicates using the component; × indicates not using the component; r, s, and u represent ripe, semi-ripe, and unripe blueberries, respectively.

Table 6. Comparison experiment of different feature extraction backbone networks.

Backbone	Precision/%	Recall/%	mAP@0.5/%	Parameters/M	FLOPs/G	ModelSize/MB
MobileNetV3	87.2	75.5	84.7	5.6	10.6	11.1
ShuffleNetV1	86.3	70.5	80.4	3.5	6.5	7.0
ShuffleNetV2	85.3	73.0	82.3	2.8	7.4	5.6
GhostNetV2	86.1	72.6	82.2	6.3	8.7	12.6
C2f-RFAonv	88.5	74.6	84.7	3.4	8.4	6.1

Table 7. Experimental results on different object detection models.

Models	Precision/%	Recall/%	AP_r/%	AP_s/%	AP_u/%	mAP@0.5/%	Parameters/M	FLOPs/G
YOLOv3-tiny	84.9	66.5	79.2	79.9	65.1	74.7	12.3	18.9
YOLOv5n	87.0	73.7	85.4	85.7	78.6	83.2	2.5	7.1
YOLOv6n	84.6	72.2	83.3	83.1	76.6	81.0	4.2	11.8
YOLOv6s	88.7	77.4	88.3	87.7	82.7	86.2	16.2	44
YOLOv7-tiny	85.9	75.1	85.7	85.4	76.5	82.5	6.0	13.2
YOLOv9c	88.9	79.4	89.4	89.8	84.9	88.0	25.3	102.3
Ours	89.1	79.2	89.1	91	85.5	88.5	4.3	15.6

Table 8. Robustness testing experimental results.

Experiments	Model	Number of Test Images	Precision/%	Recall/%	mAP@0.5/%
1	YOLOv8n	354	97	95.6	98.2
1	Ours	354	97.5	95.6	98.5
2	YOLOv8n	563	79.8	69.5	78.4
2	Ours	563	82.8	74.9	83

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Huang, B.; Xie, Z.; Homann, H.; Fei, Z.; Xiang, X.; Zheng, Y.; Zhang, G.; Sun, S. Research on Blueberry Maturity Detection Based on Receptive Field Attention Convolution and Adaptive Spatial Feature Fusion. Appl. Sci. 2025, 15, 6356. https://doi.org/10.3390/app15116356

AMA Style

Huang B, Xie Z, Homann H, Fei Z, Xiang X, Zheng Y, Zhang G, Sun S. Research on Blueberry Maturity Detection Based on Receptive Field Attention Convolution and Adaptive Spatial Feature Fusion. Applied Sciences. 2025; 15(11):6356. https://doi.org/10.3390/app15116356

Chicago/Turabian Style

Huang, Bingqiang, Zongyi Xie, Hanno Homann, Zhengshun Fei, Xinjian Xiang, Yongping Zheng, Guolong Zhang, and Siqi Sun. 2025. "Research on Blueberry Maturity Detection Based on Receptive Field Attention Convolution and Adaptive Spatial Feature Fusion" Applied Sciences 15, no. 11: 6356. https://doi.org/10.3390/app15116356

APA Style

Huang, B., Xie, Z., Homann, H., Fei, Z., Xiang, X., Zheng, Y., Zhang, G., & Sun, S. (2025). Research on Blueberry Maturity Detection Based on Receptive Field Attention Convolution and Adaptive Spatial Feature Fusion. Applied Sciences, 15(11), 6356. https://doi.org/10.3390/app15116356

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Research on Blueberry Maturity Detection Based on Receptive Field Attention Convolution and Adaptive Spatial Feature Fusion

Abstract

1. Introduction

2. YOLOv8 Reference Network

3. Modifications to the Reference Network

3.1. Receptive-Field Attention Convolution

3.2. Small-Target Detection Layer

3.3. Four-Head ASFF Detector

3.4. Focaler-CIoU Loss Function

4. Experiment and Result Analysis

4.1. Experimental Environment and Training Parameter Settings

4.2. Image Acquisition

4.3. Effect of the Four-Head ASFF Detector

4.4. Ablation Experiments

4.5. Comparison of Other Backbone Networks

4.6. Comparison of Different Target Detection Models

4.7. Visual Comparison

4.7.1. Strong Light, Close Range, and Occlusion

4.7.2. Weak Light, Long Distance, and Occlusion

4.8. Model Robustness Testing Under Different Conditions

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI