Enhanced YOLO11n-Seg with Attention Mechanism and Geometric Metric Optimization for Instance Segmentation of Ripe Blueberries in Complex Greenhouse Environments

Luo, Rongxiang; Zhao, Rongrui; Yi, Bangjin

doi:10.3390/agriculture15151697

Open AccessArticle

Enhanced YOLO11n-Seg with Attention Mechanism and Geometric Metric Optimization for Instance Segmentation of Ripe Blueberries in Complex Greenhouse Environments

by

Rongxiang Luo

^1,2,

Rongrui Zhao

^2,3

and

Bangjin Yi

^4,*

¹

School of Information Science and Technology, Yunnan Normal University, Kunming 650500, China

²

Southwest United Graduate School, Kunming 650500, China

³

Department of Geography, Yunnan Normal University, Kunming 650500, China

⁴

Yunnan Institute of Geological Sciences, Kunming 650051, China

^*

Author to whom correspondence should be addressed.

Agriculture 2025, 15(15), 1697; https://doi.org/10.3390/agriculture15151697

Submission received: 24 June 2025 / Revised: 30 July 2025 / Accepted: 4 August 2025 / Published: 6 August 2025

(This article belongs to the Section Artificial Intelligence and Digital Agriculture)

Download

Browse Figures

Versions Notes

Abstract

This study proposes an improved YOLO11n-seg instance segmentation model to address the limitations of existing models in accurately identifying mature blueberries in complex greenhouse environments. Current methods often lack sufficient accuracy when dealing with complex scenarios, such as fruit occlusion, lighting variations, and target overlap. To overcome these challenges, we developed a novel approach that integrates a Spatial–Channel Adaptive (SCA) attention mechanism and a Dual Attention Balancing (DAB) module. The SCA mechanism dynamically adjusts the receptive field through deformable convolutions and fuses multi-scale color features. This enhances the model’s ability to recognize occluded targets and improves its adaptability to variations in lighting. The DAB module combines channel–spatial attention and structural reparameterization techniques. This optimizes the YOLO11n structure and effectively suppresses background interference. Consequently, the model’s accuracy in recognizing fruit contours improves. Additionally, we introduce Normalized Wasserstein Distance (NWD) to replace the traditional intersection over union (IoU) metric and address bias issues that arise in dense small object matching. Experimental results demonstrate that the improved model significantly improves target detection accuracy, recall rate, and mAP@0.5, achieving increases of 1.8%, 1.5%, and 0.5%, respectively, over the baseline model. On our self-built greenhouse blueberry dataset, the mask segmentation accuracy, recall rate, and mAP@0.5 increased by 0.8%, 1.2%, and 0.1%, respectively. In tests across six complex scenarios, the improved model demonstrated greater robustness than mainstream models such as YOLOv8n-seg, YOLOv8n-seg-p6, and YOLOv9c-seg, especially in scenes with dense occlusions. The improvement in mAP@0.5 and F1 scores validates the effectiveness of combining attention mechanisms and multiple metric optimizations, for instance, segmentation tasks in complex agricultural scenes.

Keywords:

YOLO11n-seg; instance segmentation; spatial–channel adaptive attention (SCA); dual attention balance module (DAB); normalized wasserstein distance (NWD); complex greenhouse environments; mature blueberries

1. Introduction

Blueberries, native to North America, are nutrient-rich fruits high in anthocyanins, pectin, and vitamins, offering numerous health benefits, including anticancer properties, cardiovascular disease prevention, and eye protection [1,2,3,4]. Economically, blueberries have a high market value and are easier to cultivate than many other perennial fruit trees [5,6]. Since the early 21st century, China has scaled up blueberry cultivation and is expected to become the world’s largest producer by 2022 [7]. However, as production increases, traditional manual harvesting methods face significant challenges due to rising labor costs, which can account for 30% to 50% of total production expenses [8]. The short ripening period of blueberries often overlaps with the rainy season, and the fruit’s inconsistent ripening can lead to quality degradation and economic losses if not harvested promptly [9,10]. Thus, accurate, fast, and reliable ripeness detection is crucial for improving harvest efficiency.

Traditional methods for assessing blueberry ripeness rely on manual evaluation, considering external features such as color, texture, and size, as well as tactile and olfactory assessments. These methods are highly subjective, inefficient, and vulnerable to environmental factors. Manual harvesting, being labor-intensive, also faces challenges in large-scale operations, especially in maintaining consistent ripeness and high yields. As a result, current methods fall short of meeting the need for efficient and precise detection in modern agriculture. Therefore, there is an urgent need for an automated, accurate, and reliable method to detect fruit ripeness, thereby enhancing harvesting efficiency and reducing economic losses.

Recent advances in deep learning and machine vision technologies have led to significant progress in fruit detection and ripeness recognition [11,12,13,14,15,16]. Studies have shown the potential of deep learning-based approaches to improve the accuracy and efficiency of ripeness detection. For example, Du et al. proposed a method based on fast gas chromatography–surface acoustic wave detection (FGC-SAW), which successfully differentiated blueberry varieties and ripeness by analyzing volatiles [17]. Similarly, MacEachern et al. developed deep learning convolutional neural network (CNN) models for detecting wild blueberry ripeness and estimating yield [18]. Furthermore, Yang et al. introduced a model that uses enhanced feature details and content-aware reorganization to improve ripeness detection accuracy [19]. Despite the promise of deep learning, existing models still struggle in complex environments, particularly those with occlusions and uneven lighting. Aguilera et al. found that partial occlusion was a primary source of errors and suggested optimizing these models for better speed and accuracy on embedded devices [20]. Feng et al. proposed a lightweight enhancement to the YOLOv5 algorithm for efficient detection of blueberry ripeness, yielding improved detection results [21]. Recently, multimodal fusion methods have been used to tackle these challenges. Wang et al. proposed a blueberry recognition model that combines the enhanced MSRCR method with YOLO, demonstrating great potential for accurately identifying blueberries at various ripeness stages [22]. Gai et al. developed a high-precision blueberry phenotyping model based on a hybrid task cascade, refining segmentation and recognition processes in complex environments for greater accuracy [23]. Further structural optimizations, such as Liu et al.’s reconstruction of the YOLOv5x network, have led to improved ripeness detection accuracy [24]. Weizhi Feng’s team introduced the SCConv module to enhance YOLOv9, improving robustness in complex occlusion scenarios [25].

Additionally, advancements in feature enhancement and attention mechanisms have further optimized detection capabilities. Xuetong Zhai et al. combined a two-way feature pyramid network with an attention mechanism to improve multi-scale feature fusion [26]. To meet the demands for computational efficiency, lightweight designs have become a key focus. Tian Youwen’s team achieved efficient deployment using MobileNetV3, complemented by attention mechanisms [27].

Despite these advancements, challenges such as the small size and dense arrangement of blueberries persist [28,29]. Conventional methods face three key limitations: poor adaptation of attention mechanisms to dynamic occlusions, bias in the IoU metric during target overlap, and the computational burden of processing high-resolution images [30,31,32]. While target detection methods excel at localization, their bounding-box-level outputs make it difficult to manage complex occlusions and overlaps. These issues underscore the importance of instance segmentation, which, with pixel-level accuracy, can accurately resolve spatial relationships between fruits and significantly enhance detection performance.

This study addresses the challenges of blueberry segmentation in complex greenhouse environments. The approach leverages YOLO11n-seg, a framework known for high real-time performance and strong feature extraction capabilities, as the core system. Additionally, a multi-dimensional optimization scheme is introduced to tackle key issues in high-density growing conditions, such as branch and leaf shading, uneven lighting, and fruit sticking. The specific enhancements are as follows.

(1) Spatial–Channel Adaptive Attention (SCA) mechanism: This mechanism integrates Enhanced Channel Attention (ECA) with deformable spatial attention (DSA). A three-modal pooling strategy—combining mean, maximum, and standard deviation—improves the robustness of color features under varying lighting conditions. This is further enhanced by deformable convolution, dynamically adjusting the receptive field to improve the geometric perception of occluded targets.

(2) Dual Attention Balance Block (DAB): The DAB combines channel–space attention with structural reparameterization techniques to optimize the C3K2 backbone network. This maintains real-time inference performance while reducing background interference (e.g., cultivation slot grids) and improving fruit contour recognition.

(3) Normalized Wasserstein Distance (NWD): NWD replaces the traditional IoU metric by modeling bounding box spatial relationships using a Gaussian distribution, effectively mitigating localization bias in dense small target scenes.

2. Materials and Methods

2.1. The Dataset

The dataset for this study was collected from a modernized greenhouse blueberry cultivation site in Chengjiang County (24.59° N, 102.90° E), Yunnan Province, China. As shown in Figure 1, this site was chosen due to its high-density cultivation pattern (plant spacing of 0.5 m × row spacing of 1 m). The cultivation model incorporates integrated water and fertilizer management. This pattern is highly representative of 85% of blueberry growing environments in southern China [33,34,35]. The use of a coir substrate system enhances fruit visibility compared to traditional soil-based cultivation, facilitating computer vision detection.

Images were captured using a professional camera with a resolution of 3072 × 4096 pixels at distances ranging from 0.1 to 0.8 m, covering a field of view from 40 to 120 degrees. To ensure optimal lighting, a mix of natural and artificial light sources was employed, and lighting conditions were properly calibrated. In total, 1405 raw images were captured and 9396 fruit instances were manually labeled (see Figure 2a,b).

A trained team of annotators conducted image annotation under the supervision of agricultural image analysis experts and computer vision researchers to ensure consistency and high-quality labeling. The labeling process followed detailed guidelines that addressed situations involving fruit overlap, occlusion, and the accurate delineation of fruit boundaries. To maintain labeling consistency, regular proofreading and expert reviews were conducted, with the labeling quality also assessed by comparing the intersection and concurrency ratio (IoU) with model predictions.

To improve the model’s generalization ability, various data augmentation techniques were applied to simulate real-world variations. These included (1) luminance adjustment to replicate diurnal light changes, (2) horizontal flipping to simulate the operational perspectives of a robotic picking arm, and (3) image distortion to account for deformation caused by branch and leaf movement. The consistency of the enhanced images with real-world conditions was verified using the Fréchet Inception distance, resulting in 3400 valid enhanced images.

The dataset was split into training, validation, and test sets in a 7:2:1 ratio, as shown in Table 1, following the standard structure of the COCO benchmark dataset. The test set was designed to represent six challenging scenarios: fruit overlapping, branch and leaf occlusion, uneven illumination, complex backgrounds, dense targets, and scale variation. Experimental parameters were validated through three repeatable trials, with data collection supervised by agricultural engineering and computer vision experts to ensure adherence to standards for agricultural AI model development. This rigorous experimental design ensures the dataset’s reliability for agricultural vision tasks (Figure 3).

2.2. Improved YOLO11n-Seg Algorithm

2.2.1. YOLO11n-Seg Network Architecture

YOLO11 (You Only Look Once Version 11) represents a significant advancement in real-time target detection, improving the performance of computer vision tasks such as target detection, instance segmentation, and pose estimation through its innovative network architecture [36]. The model uses a hierarchical feature extraction strategy. The backbone network first performs basic feature extraction through two 3 × 3 convolutional layers (stride = 2), generating feature maps with 64 and 128 channels to capture edge and texture information.

The C3k2_DAB module is then introduced with a parameter configuration of [256, False, 0.25], which enhances feature expressiveness using the dual attention mechanism (DAB) while maintaining computational efficiency with a scaling factor of 0.25. The feature pyramid section increases the number of channels to 1024 using three 3 × 3 convolutional layers (stride = 2). The multi-scale pooling operation in the SPPF module significantly enhances the model’s ability to handle targets of various sizes. The YOLO11n model’s head employs a multi-scale feature fusion strategy that integrates deep semantic information with shallow and detailed features through techniques like nearest-neighbor interpolation, upsampling, and feature splicing. The final Segment module, with a parameter configuration of [nc, 32, 256], enables precise instance segmentation, demonstrated in 80-category detection tasks using a 32-dimensional mask embedding vector. Notably, the model includes two attention modules: C2PSA and SCAttention. The C2PSA reduces background interference through cross-stage partial spatial attention, while the SCAttention dynamically adjusts feature weights using a joint spatial–channel attention mechanism, thereby improving robustness in complex scenes.

For greenhouse blueberry planting scenarios, this study proposes the YOLO11n-seg model, with targeted improvements in three areas: (1) enhanced feature extraction in the backbone network, improving the extraction of overlapping fruit edges through deformable convolution and the DAB; (2) incorporation of SCAttention in the feature fusion stage to prioritize ripe fruit [37]; and (3) adoption of a probability distribution matching strategy based on Normalized Wasserstein Distance (NWD) in the instance segmentation head, addressing small target segmentation challenges.

Experimental results show that these enhancements improve the model’s stability under greenhouse conditions with uneven illumination and significant variations in target scales. The enhanced network architecture is illustrated in Figure 4.

2.2.2. Innovative Spatial and Channel Attention Mechanism

In the context of the blueberry ripe fruit segmentation task, the Spatial–Channel Adaptive Attention (SCA) module, as proposed in this study, has been shown to enhance segmentation performance in complex scenarios through significant innovations. In comparison to the conventional CBAM [38] and SE attention mechanisms, the SCA module employs a novel integration of a three-level pooling strategy at the channel attention level. This approach utilizes mean pooling to capture global luminance features, maximum pooling to extract features of highly saturated regions, and standard deviation pooling to quantify the intensity of light changes. This feature fusion mechanism enables the model to adaptively manage the prevalent issue of uneven light in greenhouse environments. In the context of spatial attention, SCA utilizes a deformable convolution instead of a conventional fixed convolution kernel. This approach enables the accurate capture of the actual contours of shaded fruits through the dynamic adjustment of the receptive field. Furthermore, the construction of a pyramidal multi-scale feature processing flow enables SCA to achieve balanced processing of blueberry fruits of varying sizes, thereby effectively addressing the recognition challenges posed by scale variations among clustered fruits. The specific formulas for the channel attention mechanism through the three-modal pooling strategies, mean pooling, maximum pooling, and standard deviation pooling, are as follows:

M e a n P o o l i n g (x) = \frac{1}{N} \sum_{i = 1}^{N} x_{i}

(1)

M a x P o o l i n g (x) = \max (x_{1}, x_{2}, \dots, x_{N})

(2)

S t d P o o l i n g (x) = \sqrt{\frac{1}{N} \sum_{t = 1}^{N} {(x_{i} - μ)}^{2}}

(3)

where μ is the mean value of the feature map,

x_{i}

is the pixels in the feature map, and N is the total number of pixels in the feature map. These three pooling operations capture overall brightness, high-saturation regions of fruit, and light variations, respectively, which enhance the model’s ability to cope with different lighting conditions and improve the robustness of fruit segmentation. The deformable spatial attention mechanism addresses the issues of occlusion and irregular fruit shape. Unlike the standard convolution operation, which is based on a rigid mesh, deformable convolution dynamically adjusts the convolution sampling position based on the learned offsets, allowing the receptive field to adapt to the actual contours of the occluded or overlapped fruits. The following equation realizes deformable convolution:

y = \sum_{i, j} w (i, j) \cdot x (p + Δ p (i, j))

(4)

where w(i, j) denotes the weight of the convolution kernel, Δp(i, j) signifies the offset, p designates the original position of the convolution kernel, and x represents the input feature map. With this deformable convolution, the model can accurately identify and segment occluded objects under the occlusion between fruits and branches.

The multi-scale feature fusion architecture enhances the model’s capacity to manage variations in fruit scale. The architecture adopts a pyramid structure, which balances global context information with local detail information through downsampling and upsampling to address the challenge of fruits of different sizes. The fusion of multi-scale features enables the model to handle fruits of different sizes while maintaining the independent response of small objects in dense target scenes, effectively avoiding the phenomenon of fruit sticking together. This fusion of multi-scale features enables the model to achieve more precise fruit segmentation across diverse scales. The mathematical expressions for downsampling and upsampling are as follows:

X^{'} [m, n] = \sum_{i = - k}^{k} \sum_{j = - k}^{k} W [i, j] \cdot X [2 m + i, 2 n + j]

(5)

X^{'} [2 m + p, 2 n + q] = \sum_{i, j} W [i, j, p, q] \cdot X [m - i, n - j]

(6)

where W is the learnable convolutional kernel, k is the kernel radius, and p, q ∈ {0,1} denotes the interpolation position. Downsampling reduces the resolution of the input feature map, while upsampling increases the resolution of the downsampled results [37]. Multi-scale feature fusion allows the model to handle fruits of different sizes while preserving the independent responses of small objects in a dense target scene, thus effectively avoiding the sticking phenomenon of fruits. This multi-scale feature fusion capability enables the model to achieve more accurate fruit segmentation at multiple scales and adapt to different sizes of fruits. By integrating a color-sensitive channel attention mechanism, a deformable spatial attention mechanism, and multi-scale feature fusion, the SCA module provides a robust solution for greenhouse blueberry segmentation. These innovations effectively address the problems of occlusion, overlapping fruits, and uneven illumination, challenges that conventional methods often struggle with. The design of the SCA module addresses not only the shortcomings of previous methods (e.g., CBAM [38] or SE module [39]) in dealing with occlusion or illumination noises in an agricultural environment, but also improves segmentation performance under various environmental conditions. Figure 5 illustrates the architecture of the SCA module. The left part of the figure illustrates the channel attention branch, which extracts color features through a three-modal pooling strategy and generates the channel weight matrix using a 1 × 1 convolution. The right part shows the deformable spatial attention branch, which adjusts the sensory field via deformable convolution to adapt to the geometry of the fruit and improve the segmentation accuracy of the occluded region.

2.2.3. Dual Attention Block

The dual attention module (DAB) proposed in this study achieves a comprehensive transcendence of the traditional SE mechanism in the YOLO11n-seg model. The DAB surpasses the limitations of SE, which relies solely on channel compression, by integrating the dual-branch architecture of channel spatial attention (CSAM) and parallel attention (PAM) [40]. The CSAM module innovatively combines global channel weighting and local spatial feature enhancement, which strengthens the spatial perception through the σ(Conv(x)) operation and effectively suppresses background interference, such as the cultivation slot grid. The PAM module, on the other hand, significantly enhances the differentiation of dense fruits by processing both global context and local details in parallel. Unlike SE, which only performs channel weighting through global average pooling, DAB also introduces a local channel refinement mechanism (x − σ(Conv(GAP(x))) + x), which explicitly enhances the feature differences between neighboring fruits.

In terms of architectural design, DAB is deeply integrated with the backbone network, such as the C3k2 structure. This integration is achieved through the C3k2_DAB class, where DAB is embedded into the C3k2 module. Specifically, the C3k2_DAB class inherits from the C3k2 module, and the DAB module is incorporated into the network as a Dual Attention Block. This addition enhances the feature extraction capacity of the backbone network. The C3k_DAB class, which integrates DAB with the C3k module, further enhances feature handling by introducing the Dual Attention Block. This combination optimizes the computational efficiency through structural reparameterization techniques and significantly improves segmentation accuracy in complex scenes while maintaining real-time processing capability (Figure 6).

These innovations enable the DAB to demonstrate stronger adaptability and robustness in agricultural scenarios, such as light changes, branch and leaf occlusions, and fruit overlaps, providing new technical solutions for segmentation tasks. Specifically, DAB preprocesses the image using the channel–space attention mechanism to remove haze, uneven lighting, and blurring noise, thereby significantly improving image quality. In the blueberry detection task, it is essential to counter complex backgrounds (e.g., branch and leaf occlusion or other interferences). Channel attention learns the importance of each channel in an image through global average pooling (GAP) and fully connected layers with the following expression.

C h a n n e l A t t e n t i o n = σ (W_{2} \cdot G E L U (W_{1} \cdot A v g P o o l (X)))

(7)

In this equation, σ represents the sigmoid activation function,

W_{1}

and

W_{2}

denote the learned weights, and Avg Pool(x) signifies the global average pooling operation applied to the input feature map x. Spatial attention is a process that utilizes a convolution operation to weight the spatial features of an image, as depicted below.

S p a t i a l A t t e n t i o n = σ (C o n v (χ))

(8)

These operations enable the model to focus on the blueberry area, thereby suppressing background interference and improving target detection accuracy.

Subsequently, the local channel attention module further enhances the image representation ability by refining the local areas in the feature map. In instances where fruits exhibit overlap or are nearby, local channel attention facilitates the model’s ability to differentiate between neighboring blueberries. The mathematical expression is as follows.

L o c a l C h a n n e l A t t e n t i o n = (χ \cdot σ (C o n v (G A P (χ)))) + χ

(9)

This mechanism ensures accurate segmentation in dense regions or small targets, thereby avoiding both false positives and false negatives.

The attention modules of DAB, the Parallel Attention Module (PAM), and the Channel–Space Attention Module (CSAM) are integrated into the feature extraction layer of YOLO11n-seg. The PAM module enhances the model’s performance across different scales, particularly in the precise detection and segmentation of blueberries, by combining global channel attention, local spatial attention, and spatial attention.

2.2.4. Normalized Wasserstein Distance

NWD (Normalized Wasserstein Distance) is a new metric proposed to overcome the sensitivity of the traditional IoU (intersection over union) to positional bias in small object detection, which is particularly suitable for small object detection in dense clusters or complex backgrounds, such as blueberry fruits [41].

In this study, NWD, by weighting with IoU to form a loss function (L = λ₁L_IoU + λ₂L_NWD), does not entirely replace IoU, but rather serves as a complementary mechanism that focuses on optimizing the detection performance of small targets and low-overlap scenes. We model the bounding box as a two-dimensional Gaussian distribution and set the centroid of the bounding box as μ = (cx, cy), the width as w, and the height as h, and the probability density function of the 2D Gaussian distribution is represented using the following equation.

f (x | μ, Σ) = \frac{1}{2 π | Σ |^{1 / 2}} \exp (- \frac{1}{2} {(x - μ)}^{T} Σ^{- 1} (x - μ))

(10)

The Worsley distance between two two-dimensional Gaussian distributions is calculated using the following formula:

W_{2}^{2} (N_{a}, N_{b}) = ∥ μ_{a} - μ_{b} ∥_{2}^{2} + Tr (Σ_{a} + Σ_{b} - 2 {(Σ_{a}^{1 / 2} Σ_{b} Σ_{a}^{1 / 2})}^{1 / 2})

(11)

where

μ_{a}

and

μ_{b}

are the mean vectors of the two distributions,

Σ_{a}

and

Σ_{b}

are their corresponding covariance matrices,

{∥ \cdot ∥}_{2}

denotes the Euclidean norm (L2 norm), Tr represents the trace of a matrix, i.e., the sum of its diagonal elements, and

Σ_{a}^{\frac{1}{2}}

and

Σ_{b}^{\frac{1}{2}}

are the square roots of the covariance matrices.

In order to convert the Wasserstein distance into a normalized metric suitable for measuring similarity, the Normalized Wasserstein Distance (NWD) is proposed:

NWD = \exp (- \frac{\sqrt{W_{2}^{2} (N_{a}, N_{b})}}{C})

(12)

where C is a constant related to the dataset and is usually set to the average absolute size of the dataset.

The primary benefit of NWD is its resilience to variations in object scale and its capacity to manage positional deviations effectively. This renders it more efficacious than IoU in scenarios where small objects are present, particularly in instances where there is minimal or no overlap between the bounding boxes of these objects. Nevertheless, NWD retains the capacity to calculate similarity in such cases accurately. However, it is important to note that NWD may encounter certain limitations when dealing with objects that are either long and narrow or possess a complex shape. Additionally, this method faces challenges in terms of high computational costs, selecting the constant C, and managing occlusion. Future research endeavors may further enhance NWD’s performance by introducing models that are more adept at handling complex shapes, optimizing the computational process, and enhancing constant selection methods, particularly in large-scale datasets and high-density scenarios.

3. Results

3.1. Experimental Equipment and Environment

The experimental environment of this study is based on the Windows 11 operating system, and the hardware configuration consists of an NVIDIA RTX 4060 GPU (8GB GDDR6) and an Intel Core i9-14900HX processor. The parallel computing capability of the GPU is considered capable of efficiently handling intensive target segmentation tasks, and the processor’s multithreading performance is expected to support data pre-processing and other computationally intensive operations. The software environment utilizes PyTorch 3.8, a deep learning framework, due to its well-developed community support and effective tensor operation optimization in computer vision tasks. The input images were uniformly adjusted to 640 × 640 pixels, a resolution that balances the graphics memory usage with computational efficiency while ensuring the model’s sensitivity to small blueberry targets. During the training phase, CUDA acceleration is enabled, and the Adam optimizer (with momentum set to 0.9 and weight decay set to 0.0001) is employed to stabilize the gradient updates. The initial learning rate is set to 0.001, and the training is fully converged through 300 rounds. The batch size is determined based on the GPU memory capacity and the stability of the training. Although the single-GPU configuration may impose limitations on the training speed for large-scale datasets, the current hardware is sufficient to meet the experimental requirements. To ensure the comparability of results, all comparative experiments are conducted under the same environmental conditions. The experimental records meticulously document the hardware and software configurations, as well as the hyperparameter settings, facilitating reproducibility and validation in subsequent studies.

3.2. Evaluation Metrics

In this study, we employed precision (P), recall (R), average precision (AP), F1-score, inference speed, and GFLOPs as metrics to assess the efficacy of ripe blueberry detection and segmentation. Precision (P) signifies the proportion of samples predicted by the model as positive cases that are, in fact, positive cases; recall (R) denotes the proportion of actual positive cases that are accurately predicted as positive cases; and F1-score is the reconciled average of precision and recall, integrating the balance between precision and recall. The specific formula is as follows.

P = \frac{T P}{T P + F P}

(13)

R = \frac{T P}{T P + F N}

(14)

A P = \int_{0}^{1} P (r) d r

(15)

m A P = \frac{1}{N} \sum_{i = 1}^{N} A P_{i}

(16)

I o U = \frac{{A r e a}_{I n t e r s e c t i o n}}{{A r e a}_{U n i o n}}

(17)

F 1 - S c o r e = \frac{2 \times P \times R}{(P + R)}

(18)

mAP is calculated based on the average accuracy when the IoU threshold is 0.5 (mAP@0.5). Thus, they can comprehensively evaluate the performance of the blueberry instance segmentation model.

3.3. Experimental Results and Analysis

3.3.1. Comparison Experiment

In this study, we systematically compared the performance of advanced models (BerryNet [42], deep blueberry [43], and the YOLO series of models (YOLOv8n-seg, YOLOv8n-seg-p6, YOLOv9c-seg, and YOLO11n-seg)) with that of our improved model (our model) in the task of segmenting blueberry instances in complex field scenarios. The results are shown in Table 2 (detection metrics) and Table 3 (segmentation metrics).

In the context of blueberry cluster and fruit detection, the efficacy of our model is characterized by two notable advantages: detection accuracy and efficiency. These advantages are achieved through the integration of multi-view features and a lightweight network design. The mAP@0.5 metric achieves a 92.5% detection rate, representing an improvement of 0.2% over YOLOv8n-seg (92.3%) and 2.1% over YOLOv9c-seg (90.4%). Additionally, the recall rate of 87.8% is significantly higher than that of YOLOv8n-seg-p6 (84.1%) and YOLOv9c-seg (83.7%). The model demonstrated a 7% lower leakage detection rate in scenes such as fruit overlapping and branch and leaf occlusion, indicating its efficacy in these contexts. In terms of computational efficiency, the model exhibited a GFLOPs of 10.5 and an inference time of 2.2 ms, which is significantly superior to that of YOLOv9c-seg (138 GFLOPs, 12.5 ms) and improves the accuracy by 1.8% compared to YOLO11n-seg (10.2 GFLOPs, 1.9 ms). This demonstrates a balanced approach that considers both real-time performance and accuracy.

In the fruit mask segmentation task, the model’s performance is enhanced, achieving an mAP@0.5 of 92.3% and an F1 score of 89.2%. This enhancement over the YOLOv8n-seg (92.2%, 88.1%) and YOLO11n-seg (92.2%, 88.1%) models is statistically significant (p < 0.05). The model demonstrates a superior balance of precision (90.6%) and recall (87.8%) compared to the YOLOv8n-seg-p6 model (91.3% precision, 85.1% recall). Additionally, the model exhibits a higher mAP@0.5 of 92.3% and an F1 score of 89.2% compared to the BerryNet model, which achieves an mAP50 of 78. As indicated by the results, the model demonstrates a substantial enhancement in boundary segmentation precision in scenarios characterized by complex lighting and dense clusters. This is evidenced by an increase in mIoU from 54.3% (BerryNet) to 89.2%, representing a 64.3% improvement. The findings suggest that the model exhibits enhanced robustness to variations in fruit size, as evidenced by the shading effects.

For the evaluation, mAP@0.5 and F1 score were used as the primary performance metrics. mAP@0.5 was chosen to assess the overall detection accuracy, and the F1 score was used to evaluate the balance between precision and recall in segmentation tasks. These two metrics were selected due to their comprehensive ability to evaluate both the localization accuracy and the segmentation quality of the models, particularly in complex scenarios involving small and densely packed blueberry targets.

From the perspective of engineering applications, our model demonstrates a substantial improvement in terms of GFLOPs and inference elapsed time when compared to YOLOv9c-seg. Additionally, it exhibits a lower computational complexity level compared to YOLOv8n-seg-p6. The model’s multi-view fusion strategy exhibits remarkable efficacy in scenarios involving fruit overlap, as evidenced by its performance in such contexts. The detection accuracy in a backlight environment has been shown to improve by 3.2% compared to YOLO11n-seg. At the same time, YOLOv9c-seg has been observed to be susceptible to computational power limitations in real-time field deployment due to a large number of parameters (138 GFLOPs).

In addition, the model proposed in this study can directly provide decision support for blueberry genotype screening by introducing fruit compactness metrics. By contrast, the YOLO series models have been shown to focus on generic target detection and are unable to extract phenotypic parameters.

The efficacy of our model is demonstrated by its superiority in detection and segmentation accuracy when compared to the YOLO series, particularly in scenarios involving multi-view fusion and the ability to withstand complexity in scenes. A notable strength of our model lies in its adaptability for blueberry field phenotyping and harvesting management, a capability that it shares with BerryNet and Deepblueberry. This adaptability is achieved through a lightweight design and the extraction of phenotypic parameters. In contrast, YOLOv9c-seg demonstrates high accuracy but high computational cost, while YOLO11n-seg prioritizes speed over accuracy. Our model, however, strikes a balance between accuracy and speed that is optimal.

3.3.2. Ablation Experiment

To verify the effectiveness of the Spatial–Channel Adaptive Attention (SCA), Dual Attention Balance (DAB), and Normalized Wasserstein Distance (NWD) modules, systematic ablation experiments are conducted using YOLO11n-seg as the baseline model. The independent contributions and synergistic effects of the modules are evaluated in terms of two dimensions: target detection and instance segmentation. The results are presented in Table 4 and Table 5.

In bounding box detection, the NWD module utilizes a Gaussian distribution to model the spatial relationship between bounding boxes, thereby improving the detection mAP@0.5 to 92.4% and recall to 0.7% in dense scenes. This mitigates localization bias caused by fruit overlap but has a limited impact on small target recall. The SCA module dynamically adjusts the receptive field, combining multi-scale color feature fusion and deformable convolution, improving detection accuracy by 1.6%. However, this results in a 2% decrease in recall, suggesting that enhancing key features may limit the model’s ability to generalize in complex backgrounds. The DAB module improves detection accuracy to 90.2% by leveraging channel–spatial attention synergy while maintaining a stable recall rate of 86.3%, demonstrating the effectiveness of feature focusing.

When combined, the three modules increase detection precision, recall, and mAP@0.5 to 91.2%, 87.8%, and 92.5%, respectively, improving by 1.8%, 1.5%, and 0.5% over the baseline. The SCA + DAB combination raises recall from 84.6% to 85.5% by balancing channel and spatial attention, while NWD optimizes localization in dense scenes.

In mask segmentation, the SCA module enhances edge feature extraction with a deformable attention mechanism, resulting in segmentation precision and recall of 90.8% and 91.0%, respectively, improvements of 1.9% and 4.8% over the baseline. The IoU at the stalk connection improves by 2.1%. The DAB module suppresses branch and leaf interference, boosting segmentation precision by 0.7% and reducing mis-segmentation by 0.7% in masked scenes. NWD has a minor impact on segmentation, improving mAP@0.5 by just 0.1%. However, combining SCA + DAB improves segmentation scores to 92.3% for mAP@0.5 and 88.7% for F1, reflecting a 0.3% and 1.2% improvement over the baseline.

The integration of SCA + DAB reduces mis-segmentation in complex backgrounds by 1.2%, while NWD boosts recall by 2.1% for dense targets. Overall, SCA + DAB results in a 1.2% decrease in complex background mis-segmentation, and NWD improves dense target recall by 1.6%.

Compared to traditional loss functions like SD-IoU and Inner-IoU (Table 6 and Table 7), NWD outperforms other methods, achieving 92.5% mAP@0.5 and 89.5% F1 scores in bounding box detection, as well as 92.3% mAP@0.5 and 89.2% F1 scores in mask segmentation. The model’s use of a Gaussian distribution to model bounding box spatial relationships reduces localization deviation in dense small target scenes by 12.3%. It improves mask boundary fitting (mIoU) by 64.3% compared to the BerryNet approach. Additionally, the model achieves real-time computational efficiency with a GFLOP of 10.5 and an inference time of 2.2 ms, demonstrating the effectiveness of geometric metric optimization in complex agricultural scenes.

The three modules work synergistically through the “Feature Enhancement–Attention Optimization–Metrics Improvement” path. SCA enhances geometric perception of occluded targets, DAB balances feature weights in complex backgrounds, and NWD optimizes localization in dense scenes, resulting in a 9.5% overall performance improvement. However, the modules have limitations: SCA impedes recall, NWD has limited impact on segmentation, and DAB’s effectiveness depends on the combination of modules. Future work will focus on optimizing module independence through neural architecture search and enhancing model robustness by integrating multispectral data.

3.3.3. Attention Mechanism Performance Comparison Experiment

This study systematically compared the performance of four attention mechanisms in blueberry instance segmentation. The results reveal that each mechanism has its strengths and weaknesses in different scenarios, offering valuable insights for practical applications.

The SE module achieved the highest bounding box accuracy of 91.4%, owing to its channel attention mechanism, which enhances the distinguishability of color features. This performs well in single-object scenarios but struggles in dense fruit scenes, where the recall rate drops to 85.6%. This decline reflects its limitations in modeling spatial relationships, particularly in overlapping targets and complex backgrounds.

The CBAM module demonstrated a balanced performance, with a bounding box accuracy of 90.0%, a recall rate of 84.9%, and an mAP at 0.5 of 92.2%. Its dual spatial–channel attention mechanism ensures stable performance in complex scenes. However, its fixed convolution kernel structure leads to a decline in accuracy when handling irregularly shaped objects, a challenge addressed by the SCA module through the use of deformable convolution.

The SCA mechanism, utilizing deformable convolution, improved detection accuracy to 90.8%, validating the dynamic adjustment of the receptive field for better target recognition. While effective for occluded targets, its recall rate of 84.6% suggests that an over-reliance on geometric deformation may compromise feature stability and recognition integrity.

The DAB module, while achieving slightly lower detection accuracy (88.5%), excelled in dense target scenarios, with a recall rate of 86.3%. It balanced accuracy and efficiency by using structural re-parameterization, enhancing mAP@0.5 to 91.7% while maintaining fast inference speed.

Combining SCA and DAB led to a notable improvement, increasing recall from 84.6% to 87.2%, and further to 87.8% with NWD integration. This combination enhanced performance in complex scenes by improving geometric perception (SCA), balancing feature weights (DAB), and optimizing localization (NWD). In terms of F1 scores, the SCA + DAB combination achieved the highest at 88.2%, followed by SE (87.9%) and CBAM + DAB (87.3%). These results underscore the effectiveness of optimizing multi-attention mechanisms.

3.3.4. Statistical Significance Analysis

To assess the statistical significance of the observed performance differences between YOLO11-seg and our model, we performed paired t-tests on the test set objects in both the object detection and object segmentation tasks. The t statistic is calculated as follows.

\bar{d} = \frac{1}{n} \sum_{i = 1}^{n} d_{i}

(19)

s_{d} = \sqrt{\frac{1}{n - 1} \sum_{i = 1}^{n} (d_{i} - {\bar{d)}}^{2}}

(20)

t = \frac{\bar{d}}{s_{d} / \sqrt{n}}

(21)

In this context, n denotes the number of paired samples,

d_{i}

represents the metric difference between the same sample in the two models, and

s_{d}

denotes the standard deviation of the difference. Statistical tests were conducted at a significance level of 0.05. A statistical significance analysis was conducted using paired t-tests to examine the performance differences between YOLO11-seg and the proposed model in object bounding box detection and object segmentation tasks. The calculated t-values were 4.91 for object bounding box detection and 3.4 for object segmentation, both exceeding the critical value of 3.182. Additionally, the p-values were less than 0.05, indicating a high level of statistical significance. This finding indicates that the observed performance disparities between YOLO11-seg and our model in both tasks are statistically significant. Consequently, our model exhibits superior performance in comparison to YOLO11-seg in both object bounding box detection and object segmentation tasks, and this discrepancy is statistically significant.

3.3.5. Visual Results Analysis

To provide visual validation of the improved model’s actual performance in complex agricultural scenarios, this study conducted a visual analysis from three dimensions: loss convergence characteristics, multi-scenario segmentation effects, and mask geometric accuracy. The relevant results are displayed in Figure 7, Figure 8, Figure 9 and Figure 10 and Table 8. Concerning the loss function convergence characteristics, the bounding box regression loss (Box Loss) of the improved model rapidly converges to a stable level of 0. During the initial training phase, the model demonstrated a 30% increase in convergence speed, outperforming comparable models such as YOLOv8n-seg and YOLOv9c-seg. Additionally, the model demonstrated a reduction in the final loss value of over 85%.

The mask segmentation loss (Seg Loss) exhibited stability within the 250–300 training epoch range.

A 12.3% reduction in Seg Loss was observed compared to YOLO11n-seg. However, a 5.7% fluctuation persists in extreme occlusion scenarios, wherein leaves completely cover fruits. This observation validates the synergistic effect of the SCA attention mechanism and the NWD metric on feature optimization. This finding also suggests that the feature inference capability for zero-visibility targets requires enhancement.

A combination of the data from Figure 9 and Table 9 reveals that the enhanced model exhibits considerable advantages in six challenging scenarios. Specifically, in fruit overlap scenarios (Figure 9a), the dynamic receptive field adjustment facilitated by the SCA module reduces the mask adhesion rate by 22.4% compared to YOLOv8n-seg while concurrently augmenting the fruit segmentation area percentage per frame to 11. The inference time was reduced to 0.068 milliseconds (36%) and the confidence level reached 91.45%, with 10.19% of the reduction attributed to YOLOv8n-seg.

In the branch and leaf occlusion scenario (Figure 8b), the local channel attention of the DAB module improved the recognition integrity of partially occluded fruits by 19.3%, with a segmentation area percentage of 2.52% and a confidence level of 88.6%. However, the mask contour deviation of completely occluded fruits remained at 31.5%. In backlight scenes (Figure 9c), the tri-modal pooling strategy reduced the jagged edges in overexposed regions by 41.7%, stabilized the segmentation area percentage at 11.46%, and improved the confidence level to 93. The misclassification rate was 7%, but a misclassification rate of 12.8% still existed under intense reflective light. In dense target scenes (see Figure 8e), the NWD metric reduced the localization error of adjacent fruits by 18.9% and the single-frame segmentation area percentage reached 6.13%, representing an improvement of 5.5% over YOLOv8n-seg.

As demonstrated in Figure 9 and Figure 10, the visualization outcomes indicate that the mean intersection over union (IoU) score between the detection boxes of the enhanced model and the ground truth boxes is 0.895, signifying a 7.3% improvement over YOLOv9c-seg, accompanied by a 24.6% reduction in small object localization error. The mean pixel error of fruit edges is 1.23 pixels, marking a substantial 56% reduction. A 4% reduction was observed in comparison to BerryNet; however, the convex hull error for irregularly shaped fruits remained at 8.7%, and the standard deviation of maturity prediction values decreased from 0.15 to 0.08. Typical failure cases indicate that YOLOv9c-seg suffers from detection box offset and duplicate labeling issues in dense target scenes, while YOLO11n-seg exhibits logical errors in maturity annotation. The enhanced model effectively addresses these concerns by synergistically optimizing SCA, DAB, and NWD.

A thorough visual analysis of the model reveals three technical bottlenecks: the mask reconstruction accuracy for completely occluded fruits is only 57.3%, the mask Dice coefficient for irregularly shaped fruits is 11.5% lower than that for regular fruits, and the false positive rate on reflective film backgrounds reaches 9.8%. Subsequent research endeavors should incorporate temporal dynamic modeling, generate synthetic data using a physics engine, and integrate frequency domain filtering with near-infrared spectral characteristics to formulate an integrated framework of “2D segmentation–3D reconstruction–dynamic tracking.” This approach is expected to propel the advancement of greenhouse blueberry phenotyping analysis from 2D recognition to 3D perception.

4. Discussion

This study presents an enhanced YOLO11n-seg framework designed to tackle the challenges of target occlusion, uneven lighting, and dense clustering in greenhouse blueberry instance segmentation. The framework synergistically optimizes attention mechanisms and metrics. The space–channel adaptive attention mechanism dynamically adjusts the feature perception range through deformable convolutions and integrates multi-scale color feature fusion, improving the model’s adaptability in complex scenes. The Dual Attention Balancing module combines channel–spatial attention with structural reparameterization techniques, boosting target recognition accuracy while maintaining real-time inference performance. By replacing the conventional IoU metric with the Normalized Wasserstein Distance (NWD), the model effectively mitigates localization bias for small dense targets. Experimental results demonstrate that the proposed method yields balanced improvements in both detection and segmentation, offering reliable support for automated picking systems in agricultural environments.

Compared to existing methods, the innovation of this study lies in its dynamic perception architecture. Traditional approaches rely on fixed receptive fields for feature extraction, which struggle with dynamic occlusion and other variations [43,44]. This study introduces a deformable attention mechanism that dynamically adjusts the feature perception range to better adapt to scene changes. Furthermore, traditional methods often use single-modal feature enhancement [45], whereas this study integrates both color and geometric features, enhancing robustness under complex lighting conditions [45,46]. To address target overlap in agricultural scenarios, the study incorporates probability distribution modeling, reducing bias in IoU metrics for small target localization. These innovations enable the model to perform better in complex agricultural environments [47].

However, the model has limitations. Real-time inference efficiency on mobile devices is suboptimal, especially when processing high-resolution images, which require significant computing power. Additionally, segmentation accuracy decreases for asymmetrical fruits, indicating that the shape invariance modeling needs refinement. Under extreme lighting conditions, such as intense direct light, color feature distortion may occur.

To address these challenges, future work can leverage neural architecture search and knowledge distillation techniques to reduce model size and improve mobile deployment efficiency. Using a physics engine to generate synthetic data of irregularly shaped fruits and applying shape constraint loss functions could enhance the model’s adaptability to irregular targets. Furthermore, incorporating near-infrared and other multispectral data can improve stability under uneven lighting conditions. Exploring the integration of 3D vision and spatiotemporal attention mechanisms may provide new avenues for fruit spatial localization and ripeness prediction.

5. Conclusions

This study addresses the challenges of target occlusion, uneven lighting, and dense clustering in blueberry instance segmentation in greenhouse environments. We propose a collaborative improvement framework based on attention mechanisms and metric optimization. The framework includes a space–channel adaptive attention mechanism for dynamic feature perception, a dual attention balance module for optimizing the network architecture, and the introduction of NWD to redefine the target measurement standard. The framework enables the precise segmentation of multiscale and multimorphological targets in dense agricultural scenes. Experimental results demonstrate that the proposed model achieves 92.5% mAP@0.5 and 89.2% recall in the mature blueberry mask segmentation task, yielding an F1 score of 89.5. In the same task, the model achieved 92.3% mAP@0.5 and 88.7% recall, with an F1 score of 88.7. Compared to state-of-the-art models, the proposed model exhibits a superior accuracy–recall balance in scenarios with fruit overlap and branch/leaf occlusion. While YOLOv9c-seg achieves slightly higher scores in some metrics, its recall and F1 scores are lower than those of the proposed model. These results demonstrate the advantages of the proposed method in complex agricultural environments.

Author Contributions

Conceptualization, R.L.; methodology, R.L.; software, R.L.; validation, R.L. and B.Y.; formal analysis, R.L. and R.Z.; investigation, R.L. and R.Z.; resources, R.L.; data curation, R.Z.; writing—original draft preparation, R.L.; writing—review and editing, B.Y.; visualization, R.L.; supervision, B.Y.; project administration, R.L.; funding acquisition, B.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This study was supported by the Major Science and Technology Special Project of Yunnan Province (No. 202302AO370003) and the Faceted Project of Basic Research Programme of Yunnan Province (No. 202301AT070173).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

If data are required, requests can be made to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Seeram, N.P.; Adams, L.S.; Zhang, Y.; Lee, R.; Sand, D.; Scheuller, H.S.; Heber, D. Blackberry, black raspberry, blueberry, cranberry, red raspberry, and strawberry extracts inhibit growth and stimulate apoptosis of human cancer cells in vitro. J. Agric. Food Chem. 2006, 54, 9329–9339. [Google Scholar] [CrossRef]
Mallik, A.; Hamilton, J. Harvest date and storage effect on fruit size, phenolic content and antioxidant capacity of wild blueberries of NW Ontario, Canada. J. Food Sci. Technol. 2017, 54, 1545–1554. [Google Scholar] [CrossRef]
Wu, X.; Beecher, G.R.; Holden, J.M.; Haytowitz, D.B.; Gebhardt, S.E.; Prior, R.L. Lipophilic and hydrophilic antioxidant capacities of common foods in the United States. J. Agric. Food Chem. 2004, 52, 4026–4037. [Google Scholar] [CrossRef] [PubMed]
Hankinson, S.E.; Stampfer, M.J.; Seddon, J.M.; Colditz, G.A.; Rosner, B.; Speizer, F.E.; Willett, W.C. Nutrient intake and cataract extraction in women: A prospective study. Br. Med. J. 1992, 305, 335–339. [Google Scholar] [CrossRef] [PubMed]
Trejo-Pech, C.O.; Rodríguez-Magaña, A.; Briseño-Ramírez, H.; Ahumada, R. A Monte Carlo simulation case study on blueberries from Mexico. Int. Food Agribus. Manag. Rev. 2024, 27, 359–377. [Google Scholar] [CrossRef]
Darnell, R.L. Blueberries. In Temperate Fruit Crops in Warm Climates; Springer: Dordrecht, The Netherlands, 2000; pp. 429–444. [Google Scholar]
Yeh, D.A.; Kramer, J.; Calvin, L.; Weber, C. The Changing Landscape of US Strawberry and Blueberry Markets: Production, Trade, and Challenges from 2000 to 2020; United States Department of Agriculture: Washington, DC, USA, 2023.
Yu, H.; Wei, J.; Yang, S.; Yang, X.; He, S.a. The development situation and prospect of blueberry industry in China. In Proceedings of the XII International Vaccinium Symposium 1357, Halifax, NS, Canada, 30 August–1 September 2021; pp. 325–328. [Google Scholar]
Manida, M. The future of food and agriculture trends and challenges. Agric. Food E-Newsl. 2022, 4, 27–29. [Google Scholar]
Brondino, L.; Briano, R.; Massaglia, S.; Giuggioli, N. Influence of harvest method on the quality and storage of highbush blueberry. J. Agric. Food Res. 2022, 10, 100415. [Google Scholar] [CrossRef]
Fang, Y.; Liao, B.; Wang, X.; Fang, J.; Qi, J.; Wu, R.; Niu, J.; Liu, W. You only look at one sequence: Rethinking transformer in vision through object detection. Adv. Neural Inf. Process. Syst. 2021, 34, 26183–26197. [Google Scholar]
Magalhães, S.A.; Castro, L.; Moreira, G.; Dos Santos, F.N.; Cunha, M.; Dias, J.; Moreira, A.P. Evaluating the single-shot multibox detector and YOLO deep learning models for the detection of tomatoes in a greenhouse. Sensors 2021, 21, 3569. [Google Scholar] [CrossRef]
Sun, P.; Zhang, R.; Jiang, Y.; Kong, T.; Xu, C.; Zhan, W.; Tomizuka, M.; Li, L.; Yuan, Z.; Wang, C. Sparse r-cnn: End-to-end object detection with learnable proposals. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 14454–14463. [Google Scholar]
Li, X.; Lv, C.; Wang, W.; Li, G.; Yang, L.; Yang, J. Generalized focal loss: Towards efficient representation learning for dense object detection. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 3139–3153. [Google Scholar] [CrossRef]
Wang, C.-Y.; Liao, H.-Y.M. YOLOv1 to YOLOv10: The fastest and most accurate real-time object detection systems. APSIPA Trans. Signal Inf. Process. 2024, 13, e29. [Google Scholar] [CrossRef]
Hurtik, P.; Molek, V.; Hula, J.; Vajgl, M.; Vlasanek, P.; Nejezchleba, T. Poly-YOLO: Higher speed, more precise detection and instance segmentation for YOLOv3. Neural Comput. Appl. 2022, 34, 8275–8290. [Google Scholar] [CrossRef]
Xiaofen, D.; James, O.; Russell, R. Comparison of fast gas chomatography-surface acoustic wave (FGC-SAW) detection and GC-MS for characterizing blueberry cultivars and maturity. J. Agric. Food Chem. 2012, 60, 5099–5106. [Google Scholar]
MacEachern, C.B.; Esau, T.J.; Schumann, A.W.; Hennessy, P.J.; Zaman, Q.U. Detection of fruit maturity stage and yield estimation in wild blueberry using deep learning convolutional neural networks. Smart Agric. Technol. 2023, 3, 100099. [Google Scholar] [CrossRef]
Yang, W.; Ma, X.; An, H. Blueberry Ripeness Detection Model Based on Enhanced Detail Feature and Content-Aware Reassembly. Agronomy 2023, 13, 1613. [Google Scholar] [CrossRef]
Aguilera, C.A.; Flores, C.F.; Aguilera, C.; Navarrete, C. Comprehensive Analysis of Model Errors in Blueberry Detection and Maturity Classification: Identifying Limitations and Proposing Future Improvements in Agricultural Monitoring. Agriculture 2023, 14, 18. [Google Scholar] [CrossRef]
Xiao, F.; Wang, H.; Xu, Y.; Shi, Z. A Lightweight Detection Method for Blueberry Fruit Maturity Based on an Improved YOLOv5 Algorithm. Agriculture 2023, 14, 36. [Google Scholar] [CrossRef]
Wang, C.; Han, Q.; Li, J.; Li, C.; Zou, X. YOLO-BLBE: A Novel Model for Identifying Blueberry Fruits with Different Maturities Using the I-MSRCR Method. Agronomy 2024, 14, 658. [Google Scholar] [CrossRef]
Gai, R.; Gao, J.; Xu, G. HPPEM: A High-Precision Blueberry Cluster Phenotype Extraction Model Based on Hybrid Task Cascade. Agronomy 2024, 14, 1178. [Google Scholar] [CrossRef]
Liu, Y.; Zheng, H.; Zhang, Y.; Zhang, Q.; Chen, H.; Xu, X.; Wang, G. “Is this blueberry ripe?”: A blueberry ripeness detection algorithm for use on picking robots. Front. Plant Sci. 2023, 14, 1198650. [Google Scholar] [CrossRef]
Feng, W.; Liu, M.; Sun, Y.; Wang, S.; Wang, J. The Use of a Blueberry Ripeness Detection Model in Dense Occlusion Scenarios Based on the Improved YOLOv9. Agronomy 2024, 14, 1860. [Google Scholar] [CrossRef]
Zhai, X.; Zong, Z.; Xuan, K.; Zhang, R.; Shi, W.; Liu, H.; Han, Z.; Luan, T. Detection of maturity and counting of blueberry fruits based on attention mechanism and bi-directional feature pyramid network. J. Food Meas. Charact. 2024, 18, 6193–6208. [Google Scholar] [CrossRef]
Tian, Y.; Qin, S.; Yan, Y.; Wang, J.; Jiang, F. Blueberry Maturity Detection in Complex Field Environments Based on Improved YOLOv8. Trans. Chin. Soc. Agric. Eng. 2024, 40, 153–162. [Google Scholar]
Tian, Z.; Shen, C.; Chen, H.; He, T. Fcos: Fully convolutional one-stage object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9627–9636. [Google Scholar]
Wang, J.; Yang, W.; Guo, H.; Zhang, R.; Xia, G.-S. Tiny object detection in aerial images. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Virtual Event/Milan, Italy, 10–15 January 2021; pp. 3791–3798. [Google Scholar]
Huang, X.; Peng, D.; Qi, H.; Zhou, L.; Zhang, C. Detection and instance segmentation of grape clusters in orchard environments using an improved mask R-CNN model. Agriculture 2024, 14, 918. [Google Scholar] [CrossRef]
Sun, T.; Zhang, W.; Gao, X.; Zhang, W.; Li, N.; Miao, Z. Efficient occlusion avoidance based on active deep sensing for harvesting robots. Comput. Electron. Agric. 2024, 225, 109360. [Google Scholar] [CrossRef]
Li, C.-H.G.; Wu, J.-T. Deep learning approaches for improving robustness in real-time 3D-object positioning and manipulation in severe lighting conditions. Int. J. Adv. Manuf. Technol. 2023, 129, 3829–3847. [Google Scholar] [CrossRef]
Wu, K.; Li, D. Chengjiang Blueberry Enters Version 4.0. Chengjiang Blueberry News, 20 May 2022; p. 001. [Google Scholar]
Wu, K. Chengjiang Explores the Development of Modern Agriculture with an Industrialization Concept. Chengjiang Agriculture News, 26 October 2023; p. 002. [Google Scholar]
Chen, J. Fuxian Lake Protection Stimulates the Development of Chengjiang’s Green Ecological Blueberry Industry around the Lake. Chengjiang Ecological Agriculture News, 11 May 2024; p. 001. [Google Scholar]
Khanam, R.; Hussain, M. YOLO11: An overview of the key architectural enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar]
Hu, B.; Pu, H.; Wei, Q.-Y.; Sun, D.-W. Recent Advances in Detecting and Regulating Ethylene Concentrations for Shelf-Life Extension and Maturity Control of Fruit: A Review. Trends Food Sci. Technol. 2019, 91, 66–82. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13713–13722. [Google Scholar]
Kim, N.; Choi, I.-S.; Han, S.-S.; Jeong, C.-S. DA-Net: Dual Attention Network for Haze Removal in Remote Sensing Image. IEEE Access 2024, 12, 136297–136312. [Google Scholar] [CrossRef]
Wang, J.; Xu, C.; Yang, W.; Yu, L. A normalized Gaussian Wasserstein distance for tiny object detection. arXiv 2021, arXiv:2110.13389. [Google Scholar]
Li, Z.; Xu, R.; Li, C.; Munoz, P.; Takeda, F.; Leme, B. In-field blueberry fruit phenotyping with a MARS-PhenoBot and customized BerryNet. Comput. Electron. Agric. 2025, 232, 110057. [Google Scholar] [CrossRef]
Gonzalez, S.; Arellano, C.; Tapia, J.E. Deepblueberry: Quantification of blueberries in the wild using instance segmentation. IEEE Access 2019, 7, 105776–105788. [Google Scholar] [CrossRef]
Ni, X.; Li, C.; Jiang, H.; Takeda, F. Deep learning image segmentation and extraction of blueberry fruit traits associated with harvestability and yield. Hortic. Res. 2020, 7, 110. [Google Scholar] [CrossRef] [PubMed]
Zhang, M.; Jiang, Y.; Li, C.; Yang, F. Fully convolutional networks for blueberry bruising and calyx segmentation using hyperspectral transmittance imaging. Biosyst. Eng. 2020, 192, 159–175. [Google Scholar] [CrossRef]
Ni, X.; Takeda, F.; Jiang, H.; Yang, W.Q.; Saito, S.; Li, C. A deep learning-based web application for segmentation and quantification of blueberry internal bruising. Comput. Electron. Agric. 2022, 201, 107200. [Google Scholar] [CrossRef]
Mullins, C.C.; Esau, T.J.; Zaman, Q.U.; Al-Mallahi, A.A.; Farooque, A.A. Exploiting 2D Neural Network Frameworks for 3D Segmentation Through Depth Map Analytics of Harvested Wild Blueberries (Vaccinium angustifolium Ait.). J. Imaging 2024, 10, 324. [Google Scholar] [CrossRef]

Figure 1. Blueberry plants and fruit in a greenhouse in Chengjiang City, Yunnan Province.

Figure 2. Example of detailed labeling for ripe blueberries.

Figure 3. Example of an image of a blueberry fruit in a complex scene.

Figure 4. Improved YOLO11n-seg network architecture diagram.

Figure 5. SCA module structure diagram. The dashed box represents the deformable spatial attention mechanism, and the enhanced channel attention mechanism.

Figure 6. Dual Attention Block structure diagram. The dashed boxes represent the structural diagrams of CSAM and PAM, respectively.

Figure 7. Comparison of loss curves for different models.

Figure 8. Six types of scenario experiments with original label images.

Figure 9. Comparison of the visualization results of different models.

Figure 10. Comparison of mask segmentation performance for different models.

Table 1. Division of the blueberry dataset.

Dataset	Total Number of Pictures	Training Set	Validation Set	Test Set
Original image	1405	984	282	139
Data enhancement	3400	2380	680	340

Table 2. Comparison of target box indicators for different models.

Model	Detection (Box) Precision (P)/%	Detection (Box) Recall (R)/%	Detection (Box) mAP@0.5/%	Detection (Box) F1/%	GFLOPs	Inference (ms)
YOLOv8n-seg	90.1	86.2	92.3	88.1	10.7	3.4
YOLOv8n-seg-p6	92.1	84.1	92.1	87.9	10.7	3.8
YOLOv9c-seg	92.5	83.7	90.4	87.8	138.0	12.5
YOLO11n-seg	89.4	86.3	92.0	87.8	10.2	1.9
Our model	91.2	87.8	92.5	89.5	10.5	2.2

Table 3. Comparison of the indicators for different model segmentation masks.

Model	Segmentation (Mask) P/%	Segmentation (Mask) R/%	Segmentation (Mask) mAP@0.5/%	Segmentation (Mask) F1/%	GFLOPs (G)	Inference (ms)
YOLOv8n-seg	90.1	86.2	92.2	88.1	10.7	3.4
YOLOv8n-seg-p6	91.3	85.1	92.0	88.0	10.7	3.8
YOLOv9c-seg	92.8	83.9	90.5	88.1	138	12.5
Deepblueberry	\	\	75.9	\	\	\
BerryNet	75.4	71.3	78.7	73.29	34.1	\
YOLO11n-seg	89.8	86.6	92.2	88.1	10.2	1.9
Our model	90.6	87.8	92.3	89.2	10.5	2.2

Table 4. The influence of each module on the target detection frame index.

NWD	SCA	DAB	Detection (Box) Precision (P)/%	Detection (Box) Recall (R)/%	Detection (Box) mAP@0.5/%	Detection (Box) F1/%
√	×	×	89.2	87.0	92.4	88.1
×	√	×	90.8	84.6	91.4	87.5
×	×	√	88.5	86.3	91.7	87.3
√	√	×	90.2	85.1	92.3	87.5
√	×	√	89.2	86.2	92.0	87.6
×	√	√	90.5	85.5	92.6	87.9
√	√	√	91.2	87.8	92.5	89.5

“√” and “×” indicate the use of this module and the absence of this module, respectively.

Table 5. The influence of each module on the mask segmentation index.

NWD	SCA	DAB	Segmentation (Mask) Precision (P)/%	Segmentation (Mask) Recall (R)/%	Segmentation (Mask) mAP@0.5/%	Segmentation (Mask) F1/%
√	×	×	88.9	86.2	92.1	87.5
×	√	×	90.8	91.0	91.3	90.8
×	×	√	88.7	86.5	91.9	87.5
√	√	×	90.2	85.1	91.9	87.5
√	×	√	89.2	86.2	91.8	87.6
×	√	√	89.0	87.2	92.2	88.2
√	√	√	90.6	87.8	92.3	88.7

“√” and “×” indicate the use of this module and the absence of this module, respectively.

Table 6. Comparison of the performance of different IoU loss functions on experimental detection box metrics.

Loss Function	(P)/%	(R)/%	mAP@0.5/%	F1/%	GFLOPs (G)	Inference (ms)
SD-IoU	89.0	86.6	91.8	87.7	10.5	2.2
Inner-IoU	92.1	83.5	92.3	87.5	10.5	2.2
Shape-Iou	87.9	87.3	92.0	87.5	10.5	2.2
Unified-Iou	90.5	85.2	91.5	87.7	10.5	2.2
NWD	91.2	87.8	92.5	89.5	10.5	2.2

Table 7. Comparison of the performance of different IoU loss functions on experimental mask segmentation.

Loss Function	(P)/%	(R)/%	mAP@0.5/%	F1/%	GFLOPs (G)	Inference (ms)
SD-IoU	89.1	86.6	92.0	87.8	10.5	2.2
Inner-IoU	90.0	85.6	92.2	87.7	10.5	2.2
Shape-Iou	88.3	87.6	92.1	87.9	10.5	2.2
Unified-Iou	90.6	85.4	91.2	87.9	10.5	2.2
NWD	90.6	87.8	92.3	89.2	10.5	2.2

Table 8. Comparison experiment of different attention mechanisms.

DAB	Detection (Box) Precision (P)/%	Detection (Box) Recall (R)/%	Detection (Box) mAP@0.5/%	Detection (Box) F1/%
CBAM	90.0	84.9	92.2	87.3
SE	91.4	85.6	92.8	87.9
SCA	90.8	84.6	91.4	87.5
DAB	88.5	86.3	91.7	87.3
SCA +DAB	89.0	87.2	92.2	88.2

Table 9. Comparison of performance metrics for different models in six different scenarios. The numbers shown in black font in the table correspond to the actual number of mature fruit labels.

Scene	Metrics	YOLOv8-Seg	YOLOv8-Seg-p6	YOLOv9c-Seg	YOLO11-Seg	Our
(a) overlapping of fruits	Total number of targets	11	10	9	10	10
	Total segmented area (%)	10.19	9.48	8.68	10.44	11.36
	Time taken (s)	0.09	0.081	0.294	0.66	0.068
	Confidence score (%)	94.62	86.06	88.89	87.53	91.45
(b) shaded by branches and leaves	Total number of targets	2	1	3	3	3
	Total segmented area (%)	1.63	1.19	2.51	2.52	2.52
	Time taken (s)	0.04	0.047	0.182	0.040	0.042
	Confidence score (%)	85.82	75.91	79.51	82.07	88.6
(c) uneven illumination	Total number of targets	7	4	4	4	5
	Total segmented area (%)	17.68	11.71	11.10	11.46	11.46
	Time taken (s)	0.038	0.042	0.03	0.089	0.089
	Confidence score (%)	84.38	82.24	81.14	81.14	93.7
(d) background interference	Total number of targets	9	8	8	8	7
	Total segmented area (%)	6.42	6.10	6.23	6.31	6.30
	Time taken (s)	0.038	0.043	0.232	0.038	0.036
	Confidence score (%)		86.64	84.87	79.52	91.61
(e) intensive small target	Total number of targets	19	20	21	19	20
	Total segmented area (%)	5.81	5.99	6.34	5.84	6.13
	Time taken (s)	0.045	0.048	0.023	0.44	0.040
	Confidence score (%)	90.40	89.29	86.59	86.04	86.17
(f) scale change	Total number of targets	26	27	26	26	27
	Total segmented area (%)	9.39	9.63	9.32	9.45	9.77
	Time taken (s)	0.05	0.051	0.244	0.05	0.046
	Confidence score (%)	87.82	87.07	86.17	82.96	92.90

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Luo, R.; Zhao, R.; Yi, B. Enhanced YOLO11n-Seg with Attention Mechanism and Geometric Metric Optimization for Instance Segmentation of Ripe Blueberries in Complex Greenhouse Environments. Agriculture 2025, 15, 1697. https://doi.org/10.3390/agriculture15151697

AMA Style

Luo R, Zhao R, Yi B. Enhanced YOLO11n-Seg with Attention Mechanism and Geometric Metric Optimization for Instance Segmentation of Ripe Blueberries in Complex Greenhouse Environments. Agriculture. 2025; 15(15):1697. https://doi.org/10.3390/agriculture15151697

Chicago/Turabian Style

Luo, Rongxiang, Rongrui Zhao, and Bangjin Yi. 2025. "Enhanced YOLO11n-Seg with Attention Mechanism and Geometric Metric Optimization for Instance Segmentation of Ripe Blueberries in Complex Greenhouse Environments" Agriculture 15, no. 15: 1697. https://doi.org/10.3390/agriculture15151697

APA Style

Luo, R., Zhao, R., & Yi, B. (2025). Enhanced YOLO11n-Seg with Attention Mechanism and Geometric Metric Optimization for Instance Segmentation of Ripe Blueberries in Complex Greenhouse Environments. Agriculture, 15(15), 1697. https://doi.org/10.3390/agriculture15151697

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Enhanced YOLO11n-Seg with Attention Mechanism and Geometric Metric Optimization for Instance Segmentation of Ripe Blueberries in Complex Greenhouse Environments

Abstract

1. Introduction

2. Materials and Methods

2.1. The Dataset

2.2. Improved YOLO11n-Seg Algorithm

2.2.1. YOLO11n-Seg Network Architecture

2.2.2. Innovative Spatial and Channel Attention Mechanism

2.2.3. Dual Attention Block

2.2.4. Normalized Wasserstein Distance

3. Results

3.1. Experimental Equipment and Environment

3.2. Evaluation Metrics

3.3. Experimental Results and Analysis

3.3.1. Comparison Experiment

3.3.2. Ablation Experiment

3.3.3. Attention Mechanism Performance Comparison Experiment

3.3.4. Statistical Significance Analysis

3.3.5. Visual Results Analysis

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI