1. Introduction
For blueberries, as a fruit with high economic value, the precise detection and classification of their maturity are of great significance for orchard management, automated harvesting, and market grading [
1,
2,
3]. According to the statistics of the Food and Agriculture Organization of the United Nations (FAO), global blueberry production has increased by nearly 60% over the past decade, with a production of approximately 1.2 million tons in 2023. China, the United States, and Chile are the main producers [
4]. Since the beginning of the 21st century, China has undertaken large-scale cultivation of blueberries and has emerged as a key production region within the Asia-Pacific area. According to recent industry statistics reflected in the cited literature, the total national planting area has reached approximately 1.1 million hectares, with an annual output of up to 500,000 tons. It should be noted that these figures represent general agricultural estimates for the sector as a whole, rather than values tied to specific cultivars or localized markets. While blueberry cultivation carries high agricultural added value, the maturity of the berries significantly affects their taste, nutritional quality, market price, and harvest costs [
5,
6]. For example, market analyses indicate that, under common commercial grading systems, blueberries classified as maturity level III typically command prices more than 30% higher than those of maturity level I, a trend observed across several major distribution channels [
7]. Therefore, the automatic detection and classification of blueberry maturity can not only improve harvesting efficiency but also significantly increase economic benefits [
8,
9].
The traditional methods for detecting blueberry maturity mostly rely on manual observation and hand-crafted feature extraction [
10,
11]. However, such approaches are generally labor-intensive, time-consuming, and susceptible to human subjectivity, which makes them difficult to meet the requirements of intelligent and automated agriculture, especially under large-scale production conditions [
12]. With the continuous expansion of the blueberry cultivation scale, the limitations of traditional methods have become increasingly prominent. In recent years, the rapid development of computer vision and deep learning technologies has brought about changes in the agricultural field, especially in image-based object detection and classification methods, which have been widely applied [
13,
14,
15,
16]. However, when deployed in complex natural environments characterized by factors such as occlusion, varying lighting conditions, and significant scale variation among fruits, these models often face challenges in maintaining robust accuracy while meeting the stringent requirements for real-time processing and lightweight deployment essential for practical agricultural applications. To address these specific challenges, this study proposes the M-YOLOv11n network. Building upon the efficient YOLO framework, our work introduces key structural innovations. The novelty of M-YOLOv11n lies in its integrated approach, which strategically incorporates a multi-scale feature extraction module and an attention mechanism within a lightweight architecture. This design aims to enhance feature discrimination for blueberries under challenging conditions without substantially increasing computational cost, thereby advancing the balance between accuracy, robustness, and efficiency for in-field maturity detection. For example, Zhang et al. proposed a fruit detection method based on Faster R-CNN and achieved more than 90% accuracy in the detection task of apples and oranges [
13]. However, the high computational complexity of Faster R-CNN makes it perform poorly in real-time detection [
17]. In contrast, the YOLO series models have achieved a better balance between real-time performance and accuracy due to their efficient single-stage detection framework, which has been successfully applied in various agricultural detection tasks. For instance, an improved YOLOv8 model has been used for the precise identification and detection of fresh leaves from five different varieties of Yunnan large-leaf tea trees, achieving a mean Average Precision (mAP) of 94.8% [
18]. The YOLOv7 framework further demonstrates the series’ capability by establishing a new state-of-the-art for real-time general object detectors through its novel trainable optimization strategies [
19]. Similarly, a lightweight improved YOLOv4-Tiny network has shown high effectiveness in recognizing blueberry fruits and determining their maturity levels in natural environments, with a detection speed as fast as 5.7 milliseconds per image [
20]. Wang et al. proposed a transformer-based grape maturity detection method in Computers and Electronics in Agriculture, but the model complexity limits the practical application scenarios [
14]. At present, certain progress has been made in the research on precise fruit identification and fruit maturity classification, both at home and abroad. In order to detect round-like fruits, Li Ying et al. proposed an improved citrus fruit maturity detection method based on the YOLOv8s model [
21]. The mean Average Precision (mAP) mAP(0.5) of the improved YOLOv8s model on the test set is 95.6%. However, the method still has the problem of missed detection due to the occlusion of overlapping fruits in the detection of citrus fruit maturity. Chen Fengjun et al. addressed the issue that
Camellia oleifera fruits are often obscured in the natural environment [
22]. Based on the original YOLOv7 model, they improved and proposed a method for detecting the maturity of
Camellia oleifera fruits. The mean Average Precision (mAP) mAP of the improved YOLOv7 model under the test set was 94.60%. However, there are still some problems in this method for the detection of camellia fruit maturity, such as missed detection and false detection, and it is not easy to deploy to mobile devices. However, for blueberries as a specific crop, existing detection networks still face significant challenges in complex actual orchard environments: First, blueberry fruits are small in size and densely clustered, making them highly prone to missing detections due to overlapping and occlusion by branches and leaves; Second, the variable lighting conditions in orchards (e.g., backlighting, shadows) severely affect the robustness of color-dependent ripeness discrimination; Third, most improved models increase computational complexity to enhance precision, making it difficult for real-time deployment on edge devices with limited computing resources such as picking robots. Therefore, developing a blueberry detection model with high precision, strong robustness, and lightweight characteristics in complex environments is crucial for realizing automated harvesting [
23,
24,
25,
26,
27].
To address the above issues, this study proposes a blueberry maturity detection and classification method based on the improved YOLOv11n model. By introducing Multi-Scale Block (MsBlock) with a depth-separable lightweight component into the Backbone and introducing the SE attention mechanism into the feature pyramid, this study aims to enhance the model’s feature extraction capability for targets of different scales, thereby improving detection accuracy and robustness. Multi-scale feature extraction has been proven to have significant advantages in object detection tasks. For example, the Feature Pyramid Network (FPN) proposed by He et al. significantly improves the detection ability of small targets by fusing features at different levels [
28]. Similarly, the Multi-Scale attention mechanism proposed by Chen et al. further enhances the robustness of the model in complex scenarios by dynamically adjusting the feature weights [
29].
Beyond blueberries, imaging-based maturity and quality assessment combined with advanced machine-learning techniques has been increasingly explored across a wide range of agricultural and horticultural crops. Recent studies have demonstrated the integration of imaging and deep learning with quantitative quality and maturity indices in various agricultural and horticultural products, highlighting the potential for cross-crop modeling and transferability of such approaches [
30,
31]. These cross-crop efforts provide important methodological insights and further support the broader applicability of the proposed framework beyond a single crop species.
Based on various theoretical models and empirical cases mentioned in the previous text, this study makes structural improvements on the basis of the original YOLOv11n backbone network Backbone. Without significantly increasing the memory of the network structure, the Multi-Scale Block (MsBlock) and the SE attention mechanism are introduced. The effect of its application in the detection and recognition of the maturity of blueberry fruit under a natural environment was tested through experiments. This study can provide an important data foundation and recognition basis for subsequent yield estimation, labor allocation planning, and target locking in automated mechanical harvesting, and exhibits potential for further application in smart agriculture management systems.
In terms of data collection and construction, this study collected blueberry images under natural light conditions on a farm in Florida, USA, and constructed a specialized dataset containing scenes of mild occlusion, severe occlusion, and backlighting. The dataset comprises 876 original images, with 63,728 fruits annotated and categorized into three classes based on maturity. After data augmentation expanded the dataset to 7005 images, it was split into training, validation, and test sets in a ratio of 7:1:2. Regarding experimental design and evaluation, the effectiveness of the MsBlock, SE attention mechanism, and depthwise separable convolution was progressively validated through seven ablation experiments. Comparisons were made with YOLOv8n, SSD-MobileNet, YOLOv11n, and Faster R-CNN across the three test scenarios. All experiments were conducted under consistent software and hardware environments using identical training strategies. Metrics including mAP50, accuracy, recall, F1-score, FPS, and parameter count were employed to comprehensively evaluate model performance.
3. Algorithm Design and Experiment
3.1. YOLOv11n Object Detection Network
There are two main categories of object detection methods based on deep learning. The first category is two-stage object detection algorithms based on region proposals, such as R-CNN, Fast R-CNN, and Faster R-CNN. The second category is one-stage object detection methods based on regression, such as YOLO, RetinaNet, and EfficientDet. Since Redmon proposed the first regression-based object detection, YOLOv1 in 2016, it has received extensive attention from researchers [
32]. By 2024, the YOLO series network will have been updated to the 11th generation. After verification on standard data sets, YOLOv11 has good performance. However, the detection accuracy in the detection speed and multi-feature environment still cannot meet the real-time requirements, and the network structure occupies a large amount of memory, which makes it difficult to achieve the conditions for deployment on the embedded system carried by the agricultural picking robot.
YOLOv11n is a lightweight object detector with a C3k2 backbone. Compared to C2f, it achieves lightweighting via optimized convolution kernels and group convolution, retaining SiLU activation to balance feature extraction and computational cost [
33]. Adopting FPN for multi-scale fusion, it optimizes feature extraction through information flow and cross-scale fusion, focusing on P3/P4/P5 scales. Structural optimization and channel compression reduce computational complexity, meeting real-time multi-scale detection needs in conventional scenarios. In summary, YOLOv11n aims to enhance real-time detection efficiency and reduce computational costs. However, its drawback lies in the fact that it does not introduce advanced feature enhancement modules and attention mechanisms, and thus cannot fully model the importance correlation between feature channels and spatial dimension. Its ability to capture local detail features is relatively weak [
34]. When detecting crops, due to factors such as tree branch obstruction and backlight exposure, YOLO11n is prone to problems such as weakened features of small targets and confusion between targets and backgrounds, with obvious missed detection phenomena. There is still considerable room for improvement in detection accuracy in complex environments.
Table 2 presents the comparison between YOLOv11n and other existing models.
In order to further improve the performance and detection accuracy of the object detection network, this study proposes an improved object detection network (M-YOLOv11n) containing a Multi-Scale Block (MsBlock). By introducing Multi-Scale Block (MsBlock) (multi-scale Block, MsBlock) and the SE attention mechanism into the YOLOv11n object detection network, and adopting the hierarchical multi-branch structure and the multi-scale feature extraction strategy based on depthwise separable convolution to capture the feature differences of targets at different scales and integrate local details and global information, the effective transmission of multi-scale features is strengthened, thereby enhancing the extraction of deep information in the network structure.
3.2. Multi-Scale Block
In object detection tasks, the model’s ability to capture features of different scales directly determines the detection accuracy of the target in different environments. This feature capture ability is closely related to the receptive field coverage range of the module [
35]. In lightweight networks, fixed-scale convolution kernels are often used to control the number of parameters, making it difficult to meet the feature extraction requirements of both the local details of small targets and the global contours of large targets. Especially in natural scenes, detection omissions are prone to occur under different conditions [
36].
A Multi-Scale Block (MsBlock) is a feature extraction module composed of multiple parallel convolutional layers, each of which has a different receptive field to capture feature information of different scales [
37]. The basic principle of MSBlock can be summarized into the following three core pillars:
In object detection tasks, the model’s ability to capture features at different scales is crucial, as it directly affects detection accuracy for targets of varying sizes in complex natural environments. For blueberry fruit detection, the dataset includes both close-range (large targets) and distant (small targets) fruits, often accompanied by occlusion from branches and leaves. To address these multi-scale and partial occlusion challenges, this study designs a Multi-Scale Block (MsBlock), the core of which lies in using a set of parallel convolutional layers to obtain differentiated receptive fields, thereby effectively capturing feature information from local details to global contours.
The structural design of MsBlock directly responds to the characteristics of the blueberry dataset. Among its components, smaller convolution kernels focus on extracting local subtle variations in fruit surface color and texture, which is essential for distinguishing maturity levels; whereas larger convolution kernels help integrate broader contextual information under foliage occlusion to infer the complete contour of the fruit. These multi-scale features are subsequently integrated through weighted fusion, ensuring that the model can simultaneously adapt to blueberry fruits of different sizes and visibility levels in the dataset.
In summary, MsBlock is not a generic multi-scale structure but rather a customized feature extraction scheme designed to address the inherent scale variability and occlusion complexity in field images of blueberries. The selection of the number of branches and the sizes of convolution kernels is based on the specific target size distribution and occlusion situations observed in prior data analysis.
As illustrated in
Figure 5, the MS-Block module processes input features by splitting them into multiple parallel branches along the channel dimension. Each branch first performs cross-channel interaction and dimension mapping via a 1 × 1 convolution, followed by spatial feature extraction using k × k depthwise convolution. The features are then refined and compressed through another 1 × 1 convolution before all branches are finally merged via a channel-wise 1 × 1 convolution to integrate multi-scale information [
38]. This design enhances the model’s ability to perceive objects with large-scale variations, making it well-suited for complex detection scenarios.
In M-YOLOv11n, the MS-Block is integrated into the backbone network (
Figure 6), where it extracts both local and global information through multi-scale convolutional kernels. A channel weighting mechanism is applied to adaptively adjust the contribution of features at each scale, thereby improving feature representation. The corresponding calculation is expressed as follows:
Here, Xi is the feature map at the ith scale, βi is its weighting coefficient, Si is the channel importance score, and γ is the temperature coefficient, which is used to control the smoothness of the weight distribution. Subsequent experiments have shown that after adding MsBlock to YOLOv11n, the mAP for small object detection has been improved, especially performing better in fine recognition tasks such as fruit ripening detection.
3.3. Depthwise Separable Convolution
Depthwise separable convolution is an efficient convolution structure that is widely used in the design of lightweight neural networks, aiming to significantly reduce the computational complexity and the number of parameters of the model while maintaining a strong feature extraction capability [
39]. In this section, the structure principle, computational complexity, and its application in M-YOLOv11n are analyzed in detail.
3.3.1. Structure and Principle
Let H and W denote the spatial dimensions of the input feature map, Cin and Cout denote the input and output channel counts, respectively, and K denotes the convolution kernel size. Computational complexity in the context of convolutional neural networks refers to the number of floating-point operations (FLOPs) required to perform a given layer’s computation. The computational complexity of standard convolution (FLOP
sstd) and depth-separable convolution (FLOP
sdsc) can be expressed as:
Computational Efficiency Comparison: Computational Ratio Between Deep Separable Convolution and Standard Convolution:
Theoretical Engineering Simplification: When the condition Cout≫K
2 is satisfied, the following approximate relationship can be obtained:
Taking a typical layer in a backbone network as an example, with parameters C_(in) = 256, C_(out) = 512, and K = 3, standard convolution requires approximately 1.18 GFLOPs, while deep separable convolution requires only about 0.13 GFLOPs—a computational reduction of approximately 88.98%. This significant reduction in computational cost demonstrates that depthwise separable convolution serves as a core component in lightweight network design, enabling substantial computational overhead reduction while maintaining robust feature extraction capabilities.
3.3.2. Application in M-YOLOv11n
To construct a detection model that takes into account both high precision and high efficiency, in this paper, depthwise separable convolution is introduced as the core lightweight component into the M-YOLOv11n network. The core idea of the Multi-Scale Block (MsBlock) proposed in this paper is to enhance the model’s performance in processing multi-scale information by improving the size and structure of the convolution kernel and optimizing the feature fusion method, thereby improving the overall object detection accuracy and efficiency [
40]. Depthwise separable convolution can effectively achieve hierarchical feature fusion of MsBlock, thereby maintaining the model’s strong feature extraction capability while significantly reducing computational complexity and memory usage with depthwise separable convolution.
3.4. Complete Intersection over Union (CIoU) Loss Function
To achieve accurate localization and classification of blueberry fruits in complex natural environments, it is essential to optimize the loss function to balance the training errors associated with bounding box position, confidence, and category. This study adopts the Complete Intersection over Union (CIoU) loss function. Compared to the traditional Intersection over Union (IoU) loss, CIoU mitigates the gradient vanishing problem when the predicted bounding box does not intersect with the ground truth and provides a more comprehensive measure of their overlap [
41].
In the context of blueberry detection, where target fruits are often small and frequently exhibit incomplete boundaries due to occlusion by branches and leaves, the CIoU loss offers particular advantages. Beyond merely considering the overlap area, CIoU introduces penalty terms for both the center-point distance and aspect ratio consistency. This design is crucial for bounding box regression under conditions of small targets and partial occlusion. The center-point distance term provides an effective gradient direction even when the overlap between the predicted and ground truth boxes is low, alleviating optimization difficulties caused by small target sizes. Meanwhile, the aspect ratio term constrains the predicted box shape to better conform to the nearly circular appearance of blueberries, thereby enhancing localization stability. Consequently, for blueberry datasets characterized by significant scale variation and frequent occlusion, CIoU contributes to more robust bounding box regression. The formulation of this loss function is given by the following Equation (6).
In the formula, S, D, and V respectively represent the overlapping area, distance, and aspect ratio, which are two predicted bounding boxes, respectively,
. However, IoU and GIoU loss only consider the overlapping area, as shown in Equation (7).
The normalized center point distance is adopted to measure the distance between two predicted bounding boxes, as shown in Equation (8).
where
and
is the center point of box
and box
, and c is the diagonal length of box G.
, which is specified as the Euclidean distance.
The consistency of the aspect ratio is achieved, as shown in Equation (9).
Finally, the loss function CIoU of the complete IoU is obtained, as shown in Equation (10).
Among them, IoU (Intersection over Union) represents the Intersection over Union (IoU) between the predicted bounding box and the ground truth bounding box, (b, bg) are the center points of the predicted bounding box and the ground truth bounding box, respectively, is the Euclidean distance between the two center points, c is the diagonal length of the minimum bounding box, measures the consistency of the aspect ratio, is the weight coefficient, which controls the influence of the aspect ratio loss.
Where
is a trade-off parameter, as shown in Equation (11).
The CIoU loss can rapidly shorten the distance between two predicted bounding boxes, so its convergence speed is much faster than that of the GIoU loss. For cases involving two predicted bounding boxes or with extreme aspect ratios, the CIoU loss will make the regression very fast, while the GIoU loss almost degenerates into an IoU loss.
3.5. Adaptive Attention Mechanism
The SE attention mechanism significantly enhances the model’s ability to extract key features of blueberry ripeness through dynamic adjustment of channel weights: In terms of color perception, this mechanism can strengthen the color channels related to ripeness (such as purple, red, and green), and suppress background and interfering color channels, thereby improving color robustness under complex lighting conditions; in terms of shape and texture representation, by introducing SE in the shallow P3 layer, it enhances the perception of small target edges and local textures, and realizes adaptive selection of multi-scale features at the cross-scale fusion node to balance details and overall shape. At the same time, SE effectively reduces the interference of surface specular reflection on features by inhibiting the weights of high-brightness and high-contrast channels. Visual analysis further validates the effectiveness of this mechanism, showing that the attention weights can specifically concentrate on the key color channels at different ripeness stages, such as the purple channel of mature blueberries, the red transition channel of semi-ripe fruits, and the green channel of unripe fruits, indicating that the SE mechanism can adaptively focus on the feature information most relevant to ripeness discrimination.
The calculation process of the SE module can be concisely described as follows:
Compression stage:
where: X is the input feature map, C is the number of channels, H × W is the spatial dimension, Z
C is the global feature descriptor of the CTH channel, and Z is the compressed vector.
Motivation stage:
where W
1 is the first fully connected layer weight matrix, b
1 is the bias term, δ is the ReLU function, and a is the intermediate feature. W
2 is the weight matrix of the second fully connected layer, b
2 is the bias term, σ is the Sigmoid function, and S is the final channel attention weight vector.
Re-weighting stage:
where:
is the reweighted feature map of the CTH channel.
The estimation of channel weights is an end-to-end process from global information statistics to nonlinear relationship learning, and then to feature recalibration [
42].
3.6. M-YOLOV11n Object Detection Network
While ensuring the real-time performance of the object detection network, it is necessary to meet and improve the accuracy of the object detection network in recognizing blueberry fruit as much as possible. In this study, to enhance the performance of YOLOv11n in multi-scale object detection tasks, especially in terms of accuracy and robustness when dealing with targets of different sizes, it is necessary to optimize its Backbone structure. The C3k2 module in the original YOLOv11n Backbone was replaced by a Multi-Scale Block (MsBlock), and the depthwise separable convolution component was introduced in the MsBlock.
In the proposed M-YOLOv11n, multiple MsBlock modules are sequentially deployed in the backbone. As shown in
Table 3, three MsBlock stages output feature maps with resolutions of 1/8, 1/16, and 1/32, which are further forwarded to the neck network for multi-scale feature fusion and detection. The remaining MsBlock operates at a higher-resolution stage to enhance shallow feature representation and is not directly connected to the detection heads. The network structure diagram is shown in
Figure 7.
The original C3k2 module of YOLOv11n Backbone consists of successive convolutional and pooling layers, where C3 represents a 3 × 3 convolution operation, and K2 represents each convolutional layer followed by a 2 × 2 pooling layer. This structure can effectively extract high-order features of images. However, due to the fixed size of the convolution kernel, it is prone to poor adaptability to targets with significant scale changes, especially in small object detection and complex backgrounds, where its feature extraction ability is limited [
43]. C3k2 is calculated by a standard 3 × 3 convolution as follows.
where X is the input feature map, W is the 3 × 3 convolution kernel, b is the bias term, which represents the convolution operation, and σ is the activation function. Since all targets use the same scale feature extraction method, this structure may not be able to fully extract key information when dealing with targets with large scale variations or small targets, resulting in a decrease in detection accuracy.
In order to overcome the shortcomings of C3k2, we introduce MsBlock, whose core idea is to use multi-scale convolution kernels to extract feature information of different scales in parallel, and then improve the detection ability through the fusion mechanism. The calculation process of the MsBlock structure is as follows:
Among them, Wi represents convolution kernels of different scales, of which αi are channel weighting factors calculated through the channel attention mechanism, and Fms is the multi-scale feature map after fusion. Specifically, Wi (i = 1, 2, …, N) denotes a set of convolution kernels operating in parallel, with their scales designed to capture multi-granularity information ranging from local details to global context. The channel weighting factor αi is dynamically generated by a channel attention mechanism (e.g., SE block): the mechanism first learns the importance of each channel through global average pooling and fully-connected layers, then assigns weights via a normalization function, enabling the model to adaptively enhance informative feature channels while suppressing redundant ones. This weighted fusion mechanism allows Fms to effectively integrate multi-scale representations, thereby improving the model’s discriminative ability for blueberry maturity detection in complex scenarios.
Convolutional layers of different scales can capture local details, such as clustered small targets like blueberries and global information, and adaptively adjust the contribution of features at each scale through a channel weighting mechanism. Finally, MsBlock generates more diverse feature representations to improve the detection performance of the model in complex scenes.
3.7. Experimental Design and Evaluation
3.7.1. Experimental Platform
The training and testing in this study were run on a computer equipped with Inter® Core™ i5-12400F, NVIDIA RTX 4060 GPU, and 32 GB running memory. With Cuda 12.6.2 parallel computer framework and Cudnn 9.6.0 deep learning acceleration library installed. All experiments were conducted under Windows 11 using Python 3.10.15 and PyTorch 2.5.1. The baseline YOLOv11n model was implemented based on the official open-source YOLOv11 repository, following the default network configuration and training pipeline. To accelerate convergence and improve training stability, all models were initialized with COCO pre-trained weights. No layers were frozen during training, and all parameters were fine-tuned on the blueberry maturity dataset to ensure a fair and consistent comparison among different model variants.
3.7.2. M-YOLOv11n Ablation Experiment
To verify the influence of MsBlock, Squeeze-and-Excitation (SE) module and depthwise separable convolution on model performance, the following seven sets of experiments were designed, as shown in
Table 4.
Standardized configuration was used in this study to ensure the reproducibility and comparability of results [
44]. In terms of data set division, the experiment adopts a fixed training set, validation set and test set division strategy, and all comparison experiments are based on the same data distribution for model training and performance verification, so as to eliminate the influence of data randomness on experimental results [
45]. In terms of hyperparameter configuration, the batch size (batch_size) of model training was set as 16, the training period (epochs) was set as 200, the early stop patience was set as 15 epochs, and the minimum improvement threshold (delta) was 0.001. The SGD optimizer combined with cosine annealing learning rate scheduling strategy was used. The initial learning rate was set to 0.005, the momentum coefficient was set to 0.937, and the weight decay coefficient was 0.0005. In order to comprehensively evaluate the performance of the model, multi-dimensional evaluation indicators are selected as follows: In terms of detection accuracy, the mean average accuracy (mAP50) and recall rate (Recall) are adopted as the main indicators. Real-time performance is evaluated by the frame rate per second (FPS) to measure the inference speed, and model complexity is quantified by the number of trainable parameters (Params). The FPS test is performed on a single NVIDIA RTX 4060 GPU with 640 × 640 input resolution.
All reported metrics are obtained by averaging the results of three independent runs with different random seeds. This experimental protocol, which is commonly adopted in object detection and YOLO-based studies, is used to mitigate the randomness introduced by stochastic optimization and to evaluate the consistency of performance trends. The training, validation, and test splits are kept identical across all runs to ensure experimental consistency and fair comparison.
3.7.3. Comparison Test
To comprehensively evaluate the proposed M-YOLOv11n, a representative set of baseline models was selected for comparison. The selection covers diverse architectural paradigms and design priorities to ensure a robust assessment. Specifically, YOLOv11n serves as the direct baseline to isolate the contribution of the proposed MsBlock and SE modules. YOLOv8n represents the widely adopted previous-generation state-of-the-art in lightweight YOLO detectors, providing an evolutionary benchmark. SSD-MobileNet is included as a classical, efficiency-optimized one-stage detector, establishing a standard for mobile and embedded performance. Finally, Faster R-CNN provides a high-accuracy two-stage detector reference, representing an accuracy upper bound to contextualize the speed–accuracy trade-off of the proposed lightweight model. This multifaceted comparison across architecture types, generations, and design goals offers a comprehensive evaluation context for M-YOLOv11n.
All comparison models were trained and tested on the blueberry dataset constructed in this study. A uniform experimental configuration was adopted to ensure fairness. The input image resolution was 640 × 640 pixels, the batch size was 16, and the training period was 200. The optimizer adopted SDG and was combined with the cosine annealing learning rate scheduling strategy. The initial learning rate was 0.005. The initial learning rate is 0.005, the momentum coefficient is 0.937, and the weight decay coefficient is 0.0005.
3.7.4. Experimental Index
For the recognition of blueberry targets in natural and complex environments, the accuracy and real-time performance of the detection network need to be taken into consideration.
In this study, Mean Average Precision (mAP, %) is adopted as the evaluation index of the model’s detection accuracy. mAP is related to recall and accuracy rate, and its calculation is shown in Equations (18) to (21) [
46].
Recall: It reflects the completeness of the model’s detection of positive samples
Precision: The true reliability of a positive sample in the response model’s detection
Average Precision: It reflects the comprehensive continuity of the detection accuracy of a single category and is the integral representation of the precision rate—recall curve
Mean Average Precision: It reflects the overall average level of detection accuracy of the model for all categories and is the summary of the mean values of each single-category AP
In the above equation, TP is the number of samples correctly classified as positive, FP is the number of samples incorrectly classified as positive, FN is the number of samples incorrectly classified as negative, M is the total number of categories, and AP(k) is the AP value of the KTH class.
The F1 score is a metric used to measure the accuracy of a binary classification model, and is often used as an experimental metric for comparison. The F1 score can be regarded as a weighted average of the model’s accuracy and recall, with a maximum value of 1 and a minimum value of 0, as shown in Equation (26).
In object detection models, FPS usually refers to Frames Per Second, which is used as an experimental metric to measure the speed of the model in real-time images, specifically how many detections the model can complete per second.
5. Discussion
This study proposes an improved YOLOv11n multi-scale module-based object detection network (M-YOLOv11n) for the recognition and detection of blueberry fruit of different ripeness. On the YOLOv11n object detection network, fusion introduces the Multi-Scale Block (MsBlock) (multi-scale Block, MsBlock) of depthwise separable convolution and the adaptive attention module (Squeeze-and-Excitation, SE). While significantly improving mAP, Precision, Recall, and F1-score, it only brings about a small increase in the number of parameters and memory usage. Overall, it still maintains a lightweight structure, which is conducive to deployment on agricultural embedded mobile devices and provides a reliable detection basis for picking robots and early yield estimation of crops.
According to the different scenes in the natural environment, blueberry image datasets in three scenarios, namely slight occlusion, severe occlusion, and backlight, were created. Comparative experiments were conducted using the YOLOv11n object detection network before and after improvement with the SSD-MobileNet, YOLOv8n, and Faster R-CNN object detection networks. The results show that the average accuracy and F1 score of the improved object detection network (M-YOLOv11n) reach 96.5% and 96%, respectively. For the detection of three different maturity blueberries, the M-YOLOv11n object detection network performs better and can provide higher recognition accuracy on the basis of achieving real-time performance.
In the current study, the evaluation of model performance mainly focuses on the characteristics of target detection and multi-classification tasks. Therefore, metrics such as mean average precision (mAP), precision, recall rate, and F1 score are used for measurement. These metrics directly reflect the model’s ability to locate blueberry fruits in complex natural scenes and accurately classify their maturity categories. However, it should be noted that indicators such as coefficient of determination (R2), mean absolute error (MAE), mean squared error (MSE), and root mean squared error (RMSE) are typically applicable to regression analysis of continuous variables, such as predicting specific physiological parameters like fruit size, sugar content, or hardness. The lightweight detection framework established in this study has laid a stable foundation for further implementation of continuous quantitative prediction of blueberry maturity. In future work, if relevant regression analysis is conducted, the above error and goodness-of-fit indicators will have significant evaluation value.
The M-YOLOv11n model proposed in this study demonstrated superior detection accuracy and real-time performance in natural environments, but still has some notable limitations. Firstly, at the commercialization and practical deployment level, although the model has the characteristic of being lightweight, integrating it into actual agricultural production processes (such as robot harvesting or post-harvest sorting lines) still requires addressing numerous engineering issues. As a robot-mounted perception system, its performance needs to be systematically verified on mobile platforms, vibration disturbances, and different lighting periods, and it must meet the extremely low latency requirements for harvesting action planning. In the application of sorting lines, the camera installation position, production line speed, and the software and hardware coordination with existing sensors (such as weighing and spectrometers) need to be considered. Additionally, the long-term maintenance of the model after deployment is also crucial, including: dealing with the possible regular re-training or domain adaptation required when different production areas have different lighting conditions, variations in blueberry varieties (fruit size, fruit frost, color depth), and when considering the need for regular re-training or domain adaptation; establishing calibration procedures for different camera models and installation positions; and evaluating the model’s robustness to common on-site disturbances such as lens stains and dust. Secondly, in terms of the generalization and transferability of the model, the dataset used in this study comes from a fixed camera configuration on a single farm in Florida, USA, which covers complex scenarios such as occlusion and backlighting, but does not systematically cover the diverse varieties and cultivation patterns of major global production areas. The model’s adaptability to different blueberry varieties (especially those with significant differences in fruit skin luster or fruit frost characteristics) needs further verification. However, the core contribution of this study—that is, enhancing the model’s feature representation and discrimination ability in complex scenarios through multi-scale modules (MsBlock) and channel attention (SE)—has a universal design concept. Therefore, the proposed improved architecture is expected to be directly transferable to other small fruit detection tasks (such as grapes, cherries, strawberries) and non-fruit agricultural target detection tasks (such as pest and disease identification, flower counting) with similar detection challenges, but data collection and fine-tuning for specific target crops are required. These limitations indicate the direction for future research, which is to promote the model from laboratory performance verification to field engineering application and cross-crop generalization.
Future research should focus on advancing the model from laboratory performance validation toward field-ready engineering applications and cross-crop generalization. Key directions include conducting hardware-in-loop validation on typical edge computing platforms (e.g., the Jetson series) to optimize throughput, power consumption, and stability in real-world deployments, as well as constructing dynamic datasets that encompass factors such as platform vibration, multi-period lighting variations, and seasonal changes to enhance system robustness. Concurrently, it is essential to collaboratively build multi-regional benchmark datasets spanning different cultivars and growing seasons, and to develop lightweight domain adaptation methods suitable for embedded devices, thereby reducing the cost of adapting the model to new environments and crops. A full lifecycle management system for the model should also be established, incorporating online performance monitoring, automated calibration, and interactive update protocols. Furthermore, the core architecture proposed in this study should be migrated to other intensive agricultural vision tasks to systematically verify its potential as a general lightweight detection framework. Through these efforts, a practical, maintainable, and scalable agricultural vision solution can be formed, providing solid technical support for the perception layer of smart agriculture.