Next Article in Journal
Factors Influencing Commercial Feed Buying Behaviour and Productivity of Small-Scale Dairy Farmers in Sululta, Ethiopia
Previous Article in Journal
Experimental Study on the Physical and Mechanical Properties of Combined Plug Seedlings of Pepper
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

GS-BiFPN-YOLO: A Lightweight and Efficient Method for Segmenting Cotton Leaves in the Field

1
College of Information Engineering, Tarim University, Alaer 843300, China
2
Key Laboratory of Tarim Oasis Agriculture, Tarim University, Ministry of Education, Alaer 843300, China
3
Key Laboratory of Modern Agricultural Engineering, Tarim University, Alaer 843300, China
*
Author to whom correspondence should be addressed.
Agriculture 2026, 16(1), 102; https://doi.org/10.3390/agriculture16010102
Submission received: 24 November 2025 / Revised: 24 December 2025 / Accepted: 29 December 2025 / Published: 31 December 2025
(This article belongs to the Section Artificial Intelligence and Digital Agriculture)

Abstract

Instance segmentation of cotton leaves in complex field environments presents challenges including low accuracy, high computational complexity, and costly data annotation. This paper presents GS-BiFPN-YOLO, a lightweight instance segmentation method that integrates SAM for semi-automatic labeling and enhances YOLOv11n-seg with GSConv, BiFPN, and CBAMs to reduce annotation cost and improve accuracy. To streamline parameters, the YOLOv11-seg architecture incorporates the lightweight GSConv module, utilizing group convolution and channel shuffle. Integration of a Bidirectional Feature Pyramid Network (BiFPN) enhances multi-scale feature fusion, while a Convolutional Block Attention Module (CBAM) boosts discriminative focus on leaf regions through dual-channel and spatial attention mechanisms. Experimental results on a self-built cotton leaf dataset reveal that GS-BiFPN-YOLO achieves a bounding box and mask mAP@0.5 of 0.988 and a recall of 0.972, maintaining a computational cost of 9.0 GFLOPs and achieving an inference speed of 322 FPS. In comparison to other lightweight models (YOLOv8n-seg to YOLOv12n-seg), the proposed approach achieves superior segmentation accuracy while preserving high real-time performance. This research offers a practical solution for precise and efficient cotton leaf instance segmentation, thereby facilitating the advancement of intelligent monitoring systems for cotton production.

1. Introduction

Cotton is a vital global economic crop, playing a pivotal role in the textile industry and socioeconomic development. In Xinjiang, China’s primary cotton production region, the planting area reached 2447.9 thousand hectares in 2024, yielding 5.686 million tons [1] and accounting for 92.2% [2] of the national total. To enable efficient and precise management, smart agricultural technologies, particularly agricultural robots, are becoming indispensable. Within this context, fine-grained monitoring at the leaf level has emerged as an essential research direction for advancing intelligent cotton production.
Accurate instance segmentation of individual leaves is a critical prerequisite for enabling subsequent robotic operations. This technique provides the visual perception foundation for key applications such as leaf phenotypic parameter extraction, growth status analysis, and precise pest and disease identification [3]. However, achieving robust cotton leaf segmentation in complex field environments remains challenging due to three major issues. First, the irregular morphology of cotton leaves, combined with field complexities such as uneven illumination and inter-leaf occlusion, complicates the precise delineation of leaf boundaries. This leads to high annotation costs and results that are prone to subjective bias [4,5,6]. Second, in high-density planting scenarios where leaves are heavily clustered, models must possess strong discriminative capabilities, which often results in large parameter sizes and high computational complexity, hindering deployment on resource-limited devices [7]. Lastly, a fundamental trade-off exists between segmentation accuracy and real-time performance. The computational burden of performing pixel-level segmentation on numerous irregular and overlapping leaves makes it difficult for conventional methods to meet stringent speed requirements while maintaining high precision [8].
To mitigate the conflict between model complexity and deployment efficiency, model lightweighting has become a key research focus. Techniques such as model pruning, efficient convolution modules, and quantization have been employed to reduce computational overhead. For instance, Liu et al. [9] proposed a channel pruning method for YOLOv5, compressing the model to ~1.7 M parameters and achieving over 10 FPS on Raspberry Pi. Similarly, Xia et al. [10] used Ghost modules to build a lightweight network for pear inflorescence detection, reducing parameters while maintaining accuracy. However, these methods often sacrifice feature extraction capacity, leading to significant accuracy drops when segmenting densely overlapping cotton leaves.
Enhancements in feature fusion mechanisms have also improved model perception of multi-scale and occluded targets. Gao et al. [11] proposed the Feature Pyramid Network (FPN) to bolster small object detection by integrating multi-level features. Jin et al. [12] integrated a weighted Bidirectional Feature Pyramid Network (BiFPN) into a YOLO-based model, achieving a high mean average precision (mAP) of 92.7% for detecting tomato seedling diseases while maintaining a lightweight model of only 12.8 MB. Ma et al. [13] employed a lightweight YOLOv8 architecture with a Bidirectional Feature Pyramid Network (BiFPN) for efficient feature fusion, achieving a mean average precision (mAP) of 94.7% in wheat grain detection while significantly reducing the model’s computational footprint. While these methods enhance robustness to scale variation, their capability for fine-grained segmentation remains inadequate for the blurred edges caused by the irregular morphology and severe occlusion of cotton leaves.
Advances in architectural design, such as CNN-Transformer hybrids, offer new pathways for complex scene understanding by combining local feature extraction with global contextual modeling. For instance, Xu et al. [14] applied a Swin Transformer-based encoder for paddy rice identification using Sentinel-2 imagery, demonstrating superior accuracy in rice mapping and enhanced performance in delineating field boundaries compared to other deep learning models such as U-Net and DeepLab v3. Bah et al. [15] combined FPN with a lightweight Transformer for crop row detection, achieving sub-pixel localization error in UAV imagery. Nevertheless, such models typically incur substantial computational costs, challenging their deployment for real-time inference on low-power edge devices.
In summary, while existing methods are mature for perceiving relatively regular-structured targets like wheat heads or tomato fruits, efficient instance segmentation of cotton leaves remains a distinct and under-explored challenge. Unlike the elongated leaves of monocots, cotton—a dicot—has irregular palmate-lobed leaves with long, twisted petioles, creating a unique fine-grained segmentation scenario characterized by extreme overlap and occlusion. Although a few studies have adapted architectures like U-Net for cotton disease spot identification (achieving an IoU of 0.89) [16], these works focus on semantic segmentation of diseased areas or leaf counting, and lack a systematic solution for whole-leaf instance segmentation that simultaneously balances high accuracy, efficiency, and lightweight design.
Furthermore, a critical aspect often overlooked in agricultural computer vision is the generalization capability of models across different crop species. While model performance on the primary target crop is essential, practical agricultural systems often involve multiple crops, and the ability of a model to maintain reasonable performance on unseen crop types is crucial for real-world deployment. This generalization challenge is compounded by dataset limitations commonly encountered in agricultural research, including data from single geographic regions, fixed camera types, and narrow temporal windows. To address these limitations, we systematically evaluate the cross-crop transferability of our method. Specifically, we assess its zero-shot generalization capability on soybean, another economically important dicot crop, thereby providing insights into the broader applicability of agricultural vision systems.
Therefore, an automatic segmentation method that can reduce annotation costs, handle irregular shapes and dense overlaps, and balance accuracy with speed is urgently needed. To this end, this study proposes a lightweight, efficient, real-time method named Group Shuffle-Bidirectional Feature Pyramid Network-YOLO (GS-BiFPN-YOLO). The main contributions are threefold:
  • For reducing annotation costs in complex field environments, a semi-automatic workflow was developed utilizing the Segment Anything Model (SAM). SAM’s robust zero-shot generalization capability enabled the generation of high-quality preliminary leaf contours, substantially diminishing the need for manual annotation.
  • Addressing the issues of excessive model parameters and high computational complexity, the GSConv (Group Shuffle Convolution) module was integrated into the YOLOv11-seg backbone to replace standard convolutions. This design leverages grouped convolution and channel shuffling to compress the model size while preserving its representational power, thereby improving its suitability for lightweight devices.
  • To counteract performance degradation from leaf overlap and multi-scale variation, a hybrid architecture incorporating a Bidirectional Feature Pyramid Network (BiFPN) and a Convolutional Block Attention Module (CBAM) was devised. The BiFPN enhances multi-scale feature fusion, while the CBAM employs a dual-channel spatial attention mechanism to accentuate leaf regions and suppress background interference, thus effectively balancing accuracy and inference speed in complex scenarios.
Validation results on the self-constructed cotton leaf dataset demonstrated that GS-BiFPN-YOLO achieved an excellent balance between segmentation accuracy and inference speed while significantly reducing annotation costs. Additionally, we conducted cross-crop evaluation on an independent soybean-cotton mixed dataset to assess the model’s generalization capability. The proposed method generates precise instance masks for cotton leaves in complex scenarios and shows promising zero-shot performance on soybean leaves, providing reliable technical support for automated phenotypic analysis in precision agriculture and insights into the development of general-purpose crop monitoring systems.
The remainder of this paper is organized as follows: Section 2 introduces the dataset construction and model methodology; Section 3 presents the experimental results and analysis, including cross-crop generalization assessment; and Section 4 discusses the advantages, limitations, and future research directions of the model.

2. Materials and Methods

2.1. Data Collection

All data used in this study were collected from cotton fields at the Tenth Regiment of Alar, Xinjiang, during the mid-to-late growth stages (early June to mid-August 2024). This period is characterized by a fully developed canopy with high structural complexity, including dense foliage, significant leaf overlap, and varying lighting conditions—providing a robust test scenario for segmentation algorithms. Images were captured using the 12-megapixel primary camera of an iPhone 14 smartphone (Apple Inc., Cupertino, CA, USA) at a maximum resolution of 4000 × 3000 pixels. To ensure consistent image quality, all data were collected between 9:00 a.m. and 12:00 p.m. under natural light. A top-down perspective was employed for all images to optimize the visibility of leaf morphology and structure for the segmentation task. To ensure sample diversity and enhance model robustness under complex field conditions, the image data encompassed cotton leaves captured across varying spatial arrangements, occlusion scenarios, and diverse weather conditions (Figure 1). Consequently, this study assembled a comprehensive dataset comprising 1000 high-quality raw images. All images were characterized by high-resolution properties, serving as a reliable data foundation for cotton phenotyping analysis under authentic field conditions.

2.2. Dataset Construction

2.2.1. Data Annotation

This study adopted the Segment Anything 2.1 (Large) model, integrated within the X-Anylabeling platform, for the semi-automated annotation of cotton leaf images (Figure 2). Built upon the Segment Anything Model (SAM) architecture [17], this large variant was selected for its powerful generalization capability, which facilitated the accurate initial localization of cotton leaf regions within complex field imagery. The annotation workflow comprised two primary steps: first, assigning the unified category “cotton leaf” to all target leaves; second, generating high-quality, instance-level segmentation masks through point-click interactions. The graphical interface of the X-Anylabeling platform (Figure 2) provides a standard toolbar for these operations, including icons for file navigation, drawing annotations, and editing labels. A critical focus during the subsequent manual verification and refinement stage was the precise delineation of leaf boundaries in areas of severe overlap and occlusion. All automatically generated masks were meticulously inspected, and instances with blurred or inaccurate contours were re-annotated to ensure each leaf—even when partially hidden—was accurately defined.
The resulting dataset provides pixel-perfect, instance-level masks that serve as the essential supervisory signal for training our model. By learning from these high-fidelity annotations, the model is fundamentally enabled to develop the capability to extract discriminative features for separating adjacent and overlapping leaves. Specifically, it learns to recognize subtle visual cues such as boundary continuity, intensity gradients at occlusion edges, and the topological structure of leaf lobes, which are critical for addressing the core challenge of feature extraction under occlusion as outlined in the introduction. Therefore, this semi-automated, human-verified approach not only significantly improved annotation efficiency but also ensured the construction of a high-precision dataset explicitly tailored for robust instance segmentation in dense and complex cotton canopies.
After annotation, the platform supported the direct export of segmentation labels in the YOLO format (saved as .txt files), which facilitated the subsequent training of YOLO-series models. The annotation interface of Segment Anything 2.1 (Large) in X-Anylabeling is shown in Figure 2. To clearly illustrate the annotation quality and instance-level details, we further visualized the generated masks by assigning a distinct random color to each leaf instance. Representative results of this visualization are presented in Figure 3, which compares a raw field image Figure 3a with its corresponding annotation map after background removal Figure 3b. These high-quality, instance-wise annotations provide precise supervision for training the segmentation model, while the model’s inherent ability to handle challenges such as leaf overlap stems from its architectural design—described in the following section.

2.2.2. Annotation Efficiency Comparison: SAM-Assisted vs. Manual

To systematically evaluate the performance of the proposed SAM-assisted annotation pipeline, a controlled experiment was conducted. An experienced annotator performed annotations on the same set of 12 cotton leaf images using two distinct workflows within the identical X-Anylabeling platform. The first workflow employed traditional manual annotation using the polygon (“linestrip”) tool for point-by-point contour tracing. The second utilized the SAM-assisted workflow, where mask generation was initiated via point prompts and subsequently refined. The image set was carefully selected to represent four prevalent challenging field conditions (as illustrated in Figure 1): densely occluded leaves, overlapping leaves within the canopy, isolated and unobstructed leaves, and leaves with dust deposition, with three images per condition to ensure comprehensive assessment. The time required for each annotation task was recorded to the second for quantitative efficiency analysis. All output masks were preserved for subsequent qualitative visual comparison.
(1)
Quantitative Analysis of Annotation Efficiency
The quantitative results of the efficiency comparison are summarized in Table 1. The data reveals a substantial efficiency gain provided by the SAM-assisted workflow. Overall, the SAM-assisted method required an average of only 176.53 s per image, representing a 73.26% reduction in time cost compared to the average of 660.07 s for fully manual annotation. This efficiency advantage was consistently observed across all four challenging scenarios. Notably, even for the most complex “densely occluded” condition, the average time saving was 65.55%. The most dramatic improvement was seen for the “isolated leaf” condition, where the SAM-assisted process achieved an average time saving of 85.45%, highlighting its potential for rapidly processing large volumes of clear targets.
(2)
Visual Comparison of Annotation Quality
A visual comparison of all 12 annotation pairs was conducted to assess the qualitative aspects of the outputs. This comparison is presented in two complementary figures: Figure 4 displays the annotation results from the purely manual workflow, while Figure 5 presents the corresponding results from the SAM-assisted workflow. Both figures are organized into four columns, each representing one of the core challenge scenarios: (a) densely occluded leaf, (b) overlapping leaves within the canopy, (c) isolated and unobstructed leaf from a single plant, and (d) leaf with dust deposition.
Visual inspection reveals several key observations regarding annotation quality. First, the SAM-assisted workflow successfully generated complete instance masks for all leaf targets, confirming its fundamental reliability. More importantly, a direct comparison between Figure 4 and Figure 5 demonstrates that the boundaries produced by SAM exhibit superior smoothness and continuity. The SAM-generated contours adhere more precisely to the natural curvature of the leaves, effectively delineating the outline even in complex areas of overlap or occlusion. In contrast, the manually drawn contours in Figure 4, likely due to the inherent limitations of point-wise manual tracing, occasionally show subtle irregularities. Furthermore, for leaves with dust deposition, the SAM-assisted annotations demonstrate robust performance by consistently capturing the primary leaf structure while largely ignoring scattered dust spots as visual noise. These observations underscore the advantageous consistency and geometric accuracy of the SAM-assisted workflow under challenging field conditions.

2.2.3. Data Augmentation

To enhance the model’s generalization capability and mitigate overfitting caused by insufficient training data, this study used four data augmentation techniques to expand the original set of images and their corresponding label files simultaneously. The specific methods included color and growth state adjustment, geometric transformations, flip transformation combined with noise injection, and color space transformation. This process resulted in a final dataset comprising 5000 images and their associated label files (including the originals). These augmentation strategies effectively simulated real-field challenges such as variations in illumination, shooting perspectives, and equipment noise, thereby significantly improving the model’s adaptability and robustness across diverse agricultural scenarios (Figure 6).

2.2.4. Data Splitting

Following data augmentation, the dataset was randomly partitioned into training, validation, and test sets at an 8:1:1 ratio. Each sample was defined as an image-label pair during the split to maintain consistency. Consequently, the training, validation, and test sets comprised 4000, 500, and 500 samples, respectively. This standardized division facilitates effective model training and provides a reliable basis for performance evaluation.

2.3. Cross-Crop Generalization Evaluation Dataset

To comprehensively assess the model’s generalization capability beyond the primary training target, we employed an additional independent dataset for cross-crop evaluation. This soybean-cotton mixed leaf dataset, developed by the Department of Mechanical Engineering, São Carlos School of Engineering, University of São Paulo, contains 640 high-resolution images with precisely annotated bounding boxes and segmentation masks for 7221 soybean leaves and 5190 cotton leaves [18]. The dataset encompasses diverse growth stages, lighting conditions, and field complexities, making it suitable for evaluating model performance in realistic agricultural scenarios.
For our cross-crop generalization analysis, we selected three representative subsets from this dataset, each comprising 10 images:
  • Pure cotton images: Images containing only cotton leaves, allowing comparison with performance on our primary dataset.
  • Pure soybean images: Images containing only soybean leaves, enabling zero-shot evaluation on a different dicot species.
  • Mixed crop images: Images containing both cotton and soybean leaves, testing discriminative capability in intercropping scenarios.
This selection strategy ensures comprehensive evaluation across different crop types and mixed scenarios. The statistical characteristics of each subset are summarized in Table 2. Notably, soybean leaves exhibit different morphological characteristics (trifoliate structure) compared to cotton leaves (palmate-lobed structure), presenting a challenging test for cross-crop generalization.
Figure 7 presents a cross-crop evaluation dataset, organized to illustrate phenotypic variation across growth stages for three distinct categories. The images are arranged in three columns: column (a) shows pure cotton leaves at different developmental stages, demonstrating variations in occlusion and canopy architecture; column (b) presents pure soybean leaves over time, highlighting changes in leaf size and arrangement morphology; and column (c) displays mixed cotton-soybean leaves across various periods, showcasing different intercropping proportions and canopy complexities.
The cross-crop evaluation serves two primary purposes: (1) to assess the model’s zero-shot generalization capability to an unseen crop species (soybean), and (2) to evaluate performance consistency when the model encounters mixed crop scenarios. This analysis directly addresses concerns regarding dataset limitations (single crop type, single geographic region) by demonstrating the model’s ability to handle variations beyond its training distribution. Furthermore, it provides insights into the transferability of learned features across dicot species with different leaf morphologies, contributing to the development of more generalizable agricultural vision systems.

2.4. GS-BiFPN-YOLO Cotton Leaf Segmentation Model

2.4.1. Baseline Model: YOLO11n-Seg

This study used YOLOv11n-seg [19] as the baseline model. As the most lightweight variant in the YOLOv11-seg series, it maintained competitive accuracy while delivering exceptional inference efficiency, which met the stringent real-time requirements for edge deployment in field environments. Building upon this efficient baseline, we introduced architectural optimizations to enhance the segmentation accuracy and robustness for cotton leaves under complex field conditions.
As shown in Figure 8, the YOLO11n-seg architecture comprises three components: the backbone, neck, and segmentation head. The backbone extracts multi-scale feature layers and feeds them to the neck of the model. The neck uses a feature pyramid structure to fuse and enhance features across layers and then transmits the processed features to the corresponding scale segmentation heads. Finally, the algorithm outputs optimal detection boxes and segmentation masks after non-maximum suppression (NMS).

2.4.2. GS-BiFPN-YOLO Model Construction

Based on the YOLOv11n-seg architecture, this study presents the GS-BiFPN-YOLO model to address the challenge of cotton leaf image segmentation in complex field environments. Building on the efficient design of the original network, the model incorporates three core improvement modules to establish a segmentation architecture that balances high accuracy with computational efficiency. The overall model structure, illustrated in Figure 9, achieves synergistic optimization of the feature extraction, fusion, and selection mechanisms through a systematic modular design.
In the architectural design, GS-BiFPN-YOLO reconstructs the feature extraction layers in the backbone network using GSConv modules. By integrating grouped convolution with channel shuffle operations, these modules reduce the computational complexity while maintaining strong feature representation capabilities. Furthermore, the model adopts a Bidirectional Feature Pyramid Network (BiFPN) as its neck structure. Leveraging its bidirectional cross-scale connections and learnable weighting mechanism, the BiFPN enables the efficient fusion of multilevel features. Finally, a Convolutional Block Attention Module (CBAM) was embedded after the critical feature layers. Through its sequential channel and spatial attention mechanisms, the CBAM guides the model to focus on the key leaf regions, effectively suppressing interference from complex backgrounds.
Through the coordinated action of these three core modules, GS-BiFPN-YOLO achieves improved segmentation accuracy and robustness in complex field environments while maintaining a high inference speed. Specifically, GSConv contributes to model lightweighting, BiFPN enhances multi-scale feature fusion, and CBAM improves the selection of critical features. This systematic improvement scheme provides a more reliable solution for cotton leaf segmentation tasks, and its high-performance and lightweight design approach offers valuable insights for other complex agricultural vision applications.

2.4.3. GSConv Module

To enhance the model’s adaptability to the complex variations in cotton leaf morphology while maintaining computational efficiency suitable for lightweight devices, we integrated the GSConv module [20] to replace standard convolutions in YOLOv11n-seg. Standard convolution suffers from a quadratic increase in parameters with kernel size. K and input/output channels ( C i n , C o u t ), as defined in (Equation (1)), which hinders scalability.
Paramsstd = K2 × Cin × Cout
GSConv addresses the computational inefficiency of standard convolutions by restructuring the pipeline. As illustrated in Figure 10, its workflow is as follows.
The input feature map   X R C i n × H × W is first processed by a group convolution with kernel weights   W g . This operation splits the input channels into groups to produce an intermediate feature map Y g (Equation (2)).
Y g = G r o u p C o n v X , W g
A pointwise convolution (1 × 1) with weights W p then integrates features across the groups, yielding Y p (Equation (3)).
Y p = C o n v 1 × 1 Y g , W p
Finally, a channel shuffle operation S (·) is applied to Y p to ensure thorough inter-group information fusion, producing the final output feature Y o u t .
Y o u t = S Y p
This design, which leverages group convolution, pointwise convolution, and channel shuffling, shifts parameter growth from quadratic to approximately linear. This reduction in computational complexity allows the model to efficiently capture multi-scale features of cotton leaves, such as detailed edges and vein textures, without compromising the feature representation capability required for high accuracy, thereby improving deployability on mobile platforms.

2.4.4. BiFPN Module

To enhance multi-scale feature representation for cotton leaves of varying sizes, a Bidirectional Feature Pyramid Network (BiFPN) [21] was integrated into the neck network. BiFPN establishes bidirectional cross-scale connections to fuse features from different backbone layers effectively.
The fusion process for multi-scale features set P 2 , P 3 , P 4 , P 5 , where P i R C i × H i × W i denotes the feature map at level i . The fusion process consists of two complementary pathways.
In the top-down pathway, higher-level features (e.g., P 5 ) are upsampled and fused with adjacent, shallower features (e.g., P4) via element-wise addition to propagate rich semantic information downward. The top-down fused feature T i at layer i (Equation (5)).
T i = U p S a m p l e ( T i + 1 ) P i
In the bottom-up pathway, high-resolution features are downsampled and fused with deeper features to enhance spatial details, producing the bottom-up feature B i :
B i = D o w n S a m p l e ( B i 1 ) P i
BiFPN introduces learnable weights α i   and β i to adaptively balance the contributions from both pathways. The final output feature F i at each level is a weighted combination of the two paths:
F i = α i × T i + β i × B i
where α i + β i = 1 and α i , β i ≥ 0. As illustrated in Figure 11, this bidirectional architecture enables deep semantic cues for leaf regions to be refined with precise spatial details (e.g., edge textures) from shallow features. This coordinated fusion improves the model’s robustness to scale variation and occlusion with minimal parameter overhead, making it suitable for real-time inference.

2.4.5. CBAM Attention Mechanism

In the context of cotton leaf segmentation, a primary challenge is the accurate delineation of individual leaves under frequent occlusion and overlap. To address this, we integrated the Convolutional Block Attention Module (CBAM) [22]. This module enables the model to adaptively focus on the most discriminative features of each leaf instance while suppressing irrelevant background interference (e.g., soil and weeds), by sequentially applying channel and spatial attention to recalibrate feature maps.
As illustrated in Figure 12, the Convolutional Block Attention Module (CBAM) operates on an input feature map F R C × H × W . It sequentially infers two attention maps: a 1D channel attention map M C R C × 1 × 1 followed by a 2D spatial attention map M S R 1 × H × W .
The channel-refined feature F   is obtained by element-wise multiplication of M c ( F ) .
F = M c ( F ) F
Subsequently, the spatial attention map is applied to F to produce the final output F :
F = M S ( F ) F
Here, the operator denotes element-wise multiplication.
The channel attention mechanism models inter-channel relationships by aggregating global context through both Global Average Pooling (GAP) and Global Max Pooling (GMP). These pooled features are then processed by a shared Multi-Layer Perceptron (MLP).
The channel attention mechanism is designed to model inter-channel relationships while preserving the unique statistical information present in different channels—a crucial consideration for cotton leaves where color and texture cues are distributed non-uniformly across RGB channels. To mitigate the potential loss of discriminative information that could arise from relying solely on a single pooling strategy (e.g., simple averaging), CBAM employs a dual-pooling approach. GAP captures the overall statistical mean of each channel, reflecting consistent features like the general green hue of healthy leaves. In contrast, GMP captures the most salient responses within each channel, highlighting distinctive high-intensity features such as leaf veins, specular highlights, or regions affected by disease or dust. The features from both pooling operations are then processed and fused. This synergistic use of GAP and GMP ensures that the computed channel weights are informed by both the broad channel characteristics and the most prominent within-channel activations, thereby preserving the specificity of cotton leaf representations across RGB channels and reducing the risk of information loss associated with any single aggregation method.
The channel attention weights M c are obtained by summing the outputs of two shared multilayer perceptrons (MLPs) applied separately to the globally average-pooled and max-pooled features, followed by a sigmoid activation σ :
M c F = σ ( M L P G A P F + M L P G M P F ) )
This mechanism allows the model to adaptively emphasize feature channels that are most discriminative for leaf segmentation.
The spatial attention mechanism focuses on identifying salient regions, such as leaf contours. It generates the spatial attention map M s by applying a 7 × 7 convolution, denoted f 7 × 7 , to the concatenation of channel-wise average-pooled and max-pooled features:
M s F = σ ( f 7 × 7 ( C o n c a t G A P c F , G M P c ( F ) ) ) )
Here, G A P c and G M P c perform average and max pooling along the channel dimension, respectively, each producing a 2D map of size 1 × H × W.
By collaboratively optimizing the “what” (channel) and “where” (spatial) aspects of features, CBAM enhances the representation of leaf bodies and fine details with minimal parameter overhead, significantly improving segmentation robustness in real-world cotton fields without substantially increasing computational burden.

2.5. Experimental Environment and Parameter Settings

Experiments were conducted on a workstation with an AMD Ryzen 7 8845HS CPU (Advanced Micro Devices, Inc., Santa Clara, CA, USA) and an NVIDIA GeForce RTX 4060 Laptop GPU (NVIDIA Corporation, Santa Clara, CA, USA). The software environment included the Windows 11 operating system, CUDA version 11.8, Python 3.10.18 programming language, and the PyTorch 2.0.0 deep learning framework.
During the training process, the input images were uniformly resized to a resolution of 640 × 640 pixels to balance the computational efficiency and feature preservation requirements. The model was trained for 300 epochs to ensure adequate convergence. Each training batch contained 16 images, with eight data loading workers enabled for parallel processing to improve the data preprocessing efficiency. The Stochastic Gradient Descent (SGD) optimizer was employed with an initial learning rate of 1 × 10−2 and a momentum parameter of 0.9 to accelerate convergence and stabilize training dynamics. Additionally, a weight decay coefficient of 5 × 10−4 was applied as an L2 regularization constraint to avoid overfitting. This parameter combination was systematically validated to achieve an optimal balance between the computational resource efficiency and model convergence stability. The detailed experimental parameter configuration is summarized in Table 3 below.

2.6. Evaluation Metrics

A comprehensive suite of metrics was employed to rigorously evaluate the proposed model, encompassing detection accuracy, segmentation quality, and inference efficiency. This multi-dimensional assessment ensures a holistic understanding of the model’s performance in localizing leaf instances, generating precise masks, and meeting the real-time requirements of field deployment.

2.6.1. Detection Accuracy Metrics

These metrics evaluate the model’s reliability in correctly identifying and localizing individual leaf instances based on bounding box predictions.
Precision (P) quantifies the proportion of correct detections among all predicted positives, reflecting the model’s ability to minimize false alarms.
P r e c i s i o n = T P T P + F P
Recall (R) quantifies the proportion of actual ground-truth leaves that are successfully detected, indicating the model’s capability to avoid missed detections.
R e c a l l = T P T P + F N
F1-Score provides a single balanced metric that harmonizes the trade-off between Precision and Recall.
F 1 = 2     P     R P + R
In the above equations, a True Positive (TP) is a correctly detected leaf instance, a False Positive (FP) is a background region incorrectly predicted as a leaf, and a False Negative (FN) is an undetected ground-truth leaf.
The mean Average Precision (mAP) serves as the primary composite metric. For the single-class task in this study (n = 1), it is computed as the area under the Precision-Recall curve:
We specifically report mAP@0.5, where a detection is valid only if the Intersection over Union (IoU) between its predicted bounding box and the ground truth is ≥0.5. To assess robustness across stricter localization criteria, we also compute mAP@0.5:0.95, defined as the mean of Average Precision (AP) values calculated at 10 IoU thresholds from 0.5 to 0.95 with a step of 0.05:
m A P = i = 1 n 0 1 P ( R ) d R n
m A P 50 95 = A P 0.5 + A P 0.55 + + A P 0.95 k

2.6.2. Segmentation Quality Metrics

To provide a granular, pixel-level assessment of mask accuracy—which is critical for downstream phenotypic analysis—we employ three standard segmentation metrics that evaluate region overlap and boundary fidelity.
Intersection over Union (IoU), or Jaccard Index, measures the spatial overlap between the predicted mask (A) and the ground truth mask (B).
I o U = | A B | | A B | = T P T P + F P + F N
Dice Coefficient (Dice), functionally equivalent to the F1-score at the pixel level, emphasizes the similarity between the predicted and ground-truth regions.
D i c e = 2 × | A B | | A | + | B | = 2 × T P 2 × T P + F P + F N
Hausdorff Distance (HD) quantifies the maximum boundary deviation between two point sets, capturing the worst-case segmentation error. Given the predicted boundary point set P and the ground truth set G, it is defined as:
H D P , G = max   ( sup   d ( p , G ) ,   sup   d ( g , P ) ) p P                                                             g G
where d ( p , G ) is the Euclidean distance. A lower HD indicates superior boundary alignment.

2.6.3. Inference Efficiency Metrics

The model’s practicality for deployment in resource-constrained field environments is assessed using the following efficiency metrics.
Frames Per Second (FPS): The average number of images processed per second during inference, indicating real-time processing capability.
Giga Floating-point Operations (GFLOPs): The total number of floating-point operations required for a single forward pass, quantifying the model’s computational complexity.

3. Results and Analysis

3.1. Model Training Results

The GS-BiFPN-YOLO model achieved balanced performance in cotton leaf instance segmentation. Specifically, it attained precision and recall rates of 0.951 and 0.972 for bounding boxes, with identical metrics for mask segmentation. When evaluated at an IoU threshold of 0.5, the model attained mAP scores of 0.988 for both bounding box and mask segmentation, confirming its accurate identification and segmentation capability, even under stringent overlap requirements. Furthermore, in the more comprehensive mAP@0.5:0.95 evaluation, the model maintained a robust performance with scores of 0.940 for bounding boxes and 0.904 for mask prediction, reflecting strong generalization across varying overlap thresholds.
With a computational complexity of 9.0 GFLOPs and an inference speed of 322 FPS, the proposed GS-BiFPN-YOLO achieves an excellent accuracy-efficiency trade-off. The integration of GSConv and BiFPN enhances multi-scale leaf feature extraction and fusion, thereby preserving the inherent inference efficiency of the YOLO framework and offering a practical solution for real-time image analysis in complex field settings (Figure 13).

3.2. Comparative Experiments

To comprehensively evaluate the overall performance of the proposed model (hereinafter referred to as GS-BiFPN-YOLO for consistency), we conducted a comparative analysis with several current instance segmentation models, including five lightweight YOLO-based models (YOLOv8n-seg, YOLOv9t-seg, YOLOv10n-seg, YOLOv11n-seg, and YOLOv12n-seg) and the widely recognized Mask R-CNN as a representative two-stage approach. All models were trained and tested on the same cotton leaf dataset under identical conditions to ensure a fair comparison. The experiment aimed to validate the advantages of the proposed model over existing counterparts in terms of segmentation accuracy, inference efficiency, and computational complexity. The results are summarized in Table 4.
Based on the comparative results presented in Table 4, distinct performance trade-offs are observed among the evaluated models. The classical two-stage Mask R-CNN model provides a solid baseline in accuracy but incurs the highest computational cost (180 GFLOPs) and the slowest inference speed (2.44 FPS), which limits its suitability for real-time field applications.
Among the lightweight YOLO-based models, each variant exhibits different strengths. YOLOv11n-seg achieves the fastest inference (333 FPS), while YOLOv9t-seg has the lowest computational complexity (8.2 GFLOPs). However, their overall segmentation accuracy, particularly in terms of mask mAP50–95, does not lead the comparison. YOLOv8n-seg maintains a reasonable balance between speed and accuracy, though its recall is comparatively lower. YOLOv10n-seg excels in mask precision but shows a compromise in recall and overall mAP50–95. In this study, YOLOv12n-seg yields the lowest mask mAP50–95 score among all compared models.
In this context, our proposed GS-BiFPN-YOLO demonstrates superior overall performance. It attains the highest scores in both bounding box and mask mAP50 (0.988). More importantly, it achieves a mask mAP50–95 of 0.904, representing a 4.5% improvement over the best-performing baseline. The model also maintains the highest recall (0.972) and the highest F1-score (0.962), indicating strong robustness in detecting leaves in dense field scenarios.
In terms of efficiency, GS-BiFPN-YOLO remains highly competitive. Its computational cost is only 9.0 GFLOPs, which is lower than most compared models. With an inference speed of 322 FPS, it matches the pace of the fastest model while being significantly faster than several others, confirming the effectiveness of its lightweight design.
To further evaluate segmentation quality at the pixel level, we report the Dice coefficient, IoU, and Hausdorff Distance (HD) in Table 5. GS-BiFPN-YOLO achieves the highest IoU (0.881) and a leading Dice score (0.935), confirming its strong pixel-wise agreement. Although Mask R-CNN attains the best HD value, indicating smaller extreme boundary errors, our model achieves a competitive HD while maintaining superior overall segmentation accuracy.
In summary, the comparative experiments show that by integrating GSConv, BiFPN, and CBAMs, GS-BiFPN-YOLO surpasses other lightweight models in key accuracy metrics while maintaining high inference efficiency. Compared to Mask R-CNN, it offers markedly better computational efficiency while delivering comparable or better accuracy. The model thus presents a robust and efficient solution for real-time cotton leaf segmentation in complex field environments.

3.3. Ablation Experiments

To systematically evaluate the contributions of the proposed GSConv, Bidirectional Feature Pyramid Network (BiFPN), and Convolutional Block Attention Module (CBAM), we conducted a detailed ablation study on the cotton leaf instance segmentation dataset. Starting from the YOLOv11n-seg baseline (Group 1), we incrementally integrated each proposed module to isolate and quantify its impact on performance. The experimental results are summarized in Table 6.
As detailed in Table 6, the ablation study validates the role of each component. The baseline (Group 1) offers a strong balance of accuracy and efficiency. Introducing GSConv alone (Group 2) reduces computational cost (GFLOPs) by 8.3% while maintaining speed, with a minor trade-off in accuracy. Incorporating BiFPN alone (Group 3) significantly improves recall but incurs a substantial speed penalty. Their combination (Group 4) demonstrates clear synergy: GSConv mitigates BiFPN’s efficiency loss, restoring speed close to the baseline while recovering accuracy. Finally, integrating CBAM (Group 5) yields the best overall performance, achieving superior accuracy (e.g., mask mAP50: 0.988) while retaining high inference speed.
To further quantify the impact of each module on model complexity and efficiency, we present a stepwise analysis of incremental contributions in Table 7. This table complements the performance overview in Table 6 by focusing on the changes in parameters, computational load, and inference speed for each architectural variant.
Table 7 provides a distilled view of the cost-accuracy trade-offs. GSConv effectively reduces model size and computation. BiFPN introduces a speed overhead without increasing parameters. Their combination (GSConv + BiFPN) shows a net efficiency gain over using BiFPN alone. The final integration of CBAM in the complete model delivers the most substantial accuracy improvement (+0.036 mAP50–95 over the baseline), while the overall architecture ends up with fewer parameters and lower computational cost than the original baseline, achieving an optimal balance for real-time deployment.
While the ablation tables provide quantitative evidence of each module’s contribution, it is equally important to visually examine how these architectural improvements translate to tangible segmentation quality in challenging field conditions. To this end, Figure 14 presents a comparative visual analysis of segmentation results between the baseline YOLOv11n-seg (Group 1) and our full GS-BiFPN-YOLO model (Group 5) across four representative scenarios: dense occlusion, overlapping canopies, single unobstructed leaves, and dust-adhered leaves.
From Figure 14, it can be seen that YOLOv11n-seg has difficulty achieving complete segmentation of leaves, with the problem of mask fragmentation being particularly prominent when processing individual, unobstructed leaves. In contrast, GS-BiFPN-YOLO not only accurately segments the complete leaf contours, but also consistently outputs higher prediction confidence, fully demonstrating the superiority of its structural improvements.

3.4. Efficiency Analysis Under Different Operational Settings

To thoroughly assess the model’s practical deployment potential and address the scalability of its efficiency claims, we conducted additional experiments measuring the end-to-end inference speed under varying batch sizes and input resolutions. This analysis provides a crucial dual perspective on performance, differentiating between theoretical peak throughput and measured real-time latency.
The results are summarized in Table 8. Two key operational modes are evaluated:
Real-time Streaming Performance (Batch Size = 1): This simulates the stringent condition of processing a live video feed, where each frame must be handled individually—a common scenario for drones or robotic platforms. Under this setting, our model achieves 86.3 FPS at a resolution of 640 × 640 and maintains 72.4 FPS at a higher resolution of 960 × 960. These empirically measured speeds include all system overhead and confirm the model’s capability to meet real-time requirements in field applications.
Batch Processing Efficiency (Batch Size > 1): When processing bursts of images (e.g., from a stored dataset or when latency tolerance is higher), larger batch sizes improve GPU utilization. The model scales efficiently, reaching 136.4 FPS at a batch size = 8 for 640 × 640 inputs.
These empirical measurements complement the theoretical peak throughput of 322 FPS reported in Section 3.1, which was derived under optimal, batched conditions (batch size = 16) to reflect the model’s inherent computational lightness and maximum hardware utilization. The significant performance under batch size = 1 is particularly relevant for edge deployment, as it guarantees responsive processing in continuous streaming scenarios. Together, these metrics demonstrate that the GS-BiFPN-YOLO architecture is not only efficient for high-throughput offline analysis but also capable of sustaining robust real-time performance for critical streaming inputs in agricultural fields.

3.5. Visualization of Segmentation Results

The proposed GS-BiFPN-YOLO model demonstrates strong comprehensive performance in cotton leaf instance segmentation under complex field conditions. As shown in Figure 15, the model accurately identifies and segments cotton leaves, even in challenging scenarios involving occlusion and overlapping.
In terms of object detection and instance segmentation, GS-BiFPN-YOLO shows consistent and marked improvements over all compared baseline models. The model achieves a better balance between precision and recall, leading to superior performance in both bounding box and mask mAP metrics across different IoU thresholds. These results confirm its robustness in handling complex backgrounds and dense leaf arrangements, providing a reliable foundation for subsequent morphological analysis.
Regarding efficiency, GS-BiFPN-YOLO maintains the lowest computational complexity among the models evaluated, with an inference speed sufficient for real-time field applications. This demonstrates that the model achieves an excellent trade-off between accuracy and operational efficiency.
In summary, by integrating GSConv, BiFPN, and CBAM, GS-BiFPN-YOLO attains synergistic optimization in accuracy and speed, offering a practical and efficient solution for real-time crop phenotyping in smart agriculture.

3.6. Analysis of Failure Cases and Limitations

The model’s performance boundaries are most evident in specific challenging scenarios, as visually summarized in Figure 16. A qualitative analysis identifies two primary failure modes:
Missed Detections under Severe Occlusion and Degradation: The model frequently fails to detect leaves that are visually dominated by occluders (e.g., where only a small tip is visible) or have their surface features heavily altered by dust. This is because the visible pixels lack sufficient discriminative context for the network to recognize a complete leaf instance.
Boundary Errors in Visually Homogeneous Clusters: In areas where multiple leaves with nearly identical color and texture overlap, the model may produce blurred boundaries or merge adjacent instances. This indicates that while attention mechanisms help, disambiguating such feature-entangled regions remains difficult.
These failures define the model’s operational envelope: it excels when leaves have clear, continuous contours and distinct appearances. Performance declines predictably when these conditions are compromised. Addressing these cases—potentially by integrating structural reasoning—is a clear direction for enhancing robustness in unstructured field environments.

3.7. Cross-Crop Generalization Analysis

To address reviewer concerns regarding the model’s generalization capabilities beyond the training domain, we conducted a comprehensive zero-shot evaluation on an independent soybean-cotton mixed dataset. This analysis assesses the model’s performance when applied to a different dicot species (soybean) and mixed crop scenarios, providing insights into feature transferability and cross-crop generalization.

3.7.1. Cross-Crop Performance Evaluation

Figure 17 presents representative segmentation results. On pure cotton (Figure 17a), the model delivers precise segmentation with high confidence. For pure soybean (Figure 17b), it successfully detects and segments leaf structures, confirming the transfer of learned botanical features (e.g., edge, texture). However, all soybean instances are labeled as “cotton”—a natural outcome of single-class training—and exhibit lower visual confidence. In mixed scenes (Figure 17c), the model distinguishes individual leaves from both crops, maintaining high quality for cotton while replicating the soybean performance pattern observed in Figure 17b.
Table 9 summarizes the quantitative performance across crop types. The model maintains strong performance on cotton (average confidence: 0.669) while achieving reasonable zero-shot performance on soybean (average confidence: 0.535), representing 78.3% relative performance. In mixed scenarios, the model achieves intermediate performance (average confidence: 0.645), demonstrating discriminative capability.
Figure 18 presents the confidence distribution heatmap across different crop types and confidence intervals. The heatmap reveals distinct patterns: cotton images show concentration in high confidence ranges (60–80% in 0.8–1.0 range), soybean images shift toward moderate confidence ranges (40–60% in 0.4–0.8 range), while mixed images show intermediate distributions.

3.7.2. Cross-Crop Generalization Performance Analysis

The results clearly delineate the model’s generalization scope. The maintained performance (~78%) on soybean demonstrates that the architectural innovations (GSConv, BiFPN, CBAM) learn transferable, general-purpose features for leaf instance segmentation. This confirms feature-level generalization.
However, the systematic confidence drop and categorical mislabeling define the current boundary. The model excels at instance segmentation across domains but cannot perform species classification without corresponding multi-class training. This distinction is crucial for practical application: the model provides a powerful, efficient backbone for cross-crop detection and segmentation tasks, upon which a species classifier could be added with targeted training on new crops.
In summary, the analysis validates that GS-BiFPN-YOLO learns a robust and transferable representation of leaf structures, while honestly scoping its capability and providing a clear direction for future work in multi-crop agricultural vision systems.

4. Discussion

The proposed GS-BiFPN-YOLO model demonstrates excellent performance for cotton leaf instance segmentation, achieving a mask mAP@0.5 of 0.988 and a high inference speed. These results should be interpreted within the broader context of agricultural computer vision and the specific challenges outlined in the introduction. Compared to two-stage methods like Mask R-CNN, our model achieves significant improvement in accuracy while drastically reducing computational cost, aligning with the need for efficient architectures in real-time applications. Against leading lightweight YOLO variants, our model shows superior comprehensive segmentation accuracy. The integration of GSConv, BiFPN, and CBAM effectively addresses the core challenges in dense dicot leaf segmentation—scale variation, severe occlusion, and irregular morphology—which were under-explored in prior work focused on more structured targets.
The performance gain stems from targeted architectural innovations. The GSConv module ensures lightweight feature extraction, while the BiFPN’s learnable weighting mechanism enables adaptive multi-scale fusion, crucial for handling the large size variation in leaves. Most importantly, the CBAM, through its dual-channel spatial attention, enhances boundary precision for overlapping leaves with similar colors by focusing on discriminative edge features. This coordinated design represents a deeper optimization for complex leaf segmentation compared to simpler approaches.
The model’s generalization capability is a key practical consideration. The zero-shot evaluation on soybean leaves (Section 3.7), where the model retains 78.3% of its average confidence relative to cotton, suggests promising transferability of learned structural features among dicot species. This performance, validated on the public ‘SoyCotton-Leafs’ dataset, moves beyond the common limitation of single-crop, single-region datasets. However, the remaining gap underscores that for ready multi-crop application, techniques like domain adaptation would be beneficial.
A critical examination of the model’s limitations is essential. As detailed in Section 3.6, the model excels when leaves exhibit visible morphological continuity but struggles under specific conditions: (1) severe occlusion where the visible portion lacks sufficient discriminative context, and (2) dense clusters of visually homogeneous leaves where feature confusion leads to instance merging. These boundaries are honestly defined, indicating that overcoming such edge cases may require integrating higher-level shape priors.
Regarding deployment, the reported 322 FPS (theoretical throughput at batch size 16) indicates high architectural efficiency. The supplementary efficiency analysis (Section 3.4) provides further practical insight: under a simulated real-time streaming setting (batch size = 1), the model maintains 86.3 FPS. This measurable real-time capability, combined with its low computational footprint (9.0 GFLOPs), strongly supports its potential for field deployment. Nevertheless, we acknowledge that actual performance on resource-constrained edge hardware (e.g., NVIDIA Jetson) remains to be validated, and real-world dynamics demand further robustness—both of which are clearly outlined as immediate future work.

5. Conclusions

This study developed GS-BiFPN-YOLO, a novel lightweight model for leaf instance segmentation in complex field environments. The main contributions are threefold:
  • Architectural Innovation for Accuracy-Efficiency Balance: The synergistic integration of GSConv, BiFPN, and CBAMs achieves state-of-the-art segmentation accuracy (0.988 mask mAP@0.5) while maintaining high efficiency. The model’s lightweight design is evidenced by its low computational cost (9.0 GFLOPs), high theoretical throughput (322 FPS at batch size 16), and, crucially, its measured real-time performance of over 86 FPS under streaming conditions (batch size 1).
  • Enhanced Generalizability and Practical Validation: Beyond the primary cotton target, cross-crop zero-shot evaluation on soybean leaves demonstrates promising feature transferability, with the model retaining 78.3% relative performance. This addresses common concerns regarding dataset specificity. Furthermore, the proposed SAM-assisted annotation pipeline quantitatively reduces manual labeling effort by 73.26%, providing a practical solution for data preparation.
  • Transparent Scope Definition through Rigorous Analysis: The work explicitly defines the model’s operational envelope through dedicated failure case analysis. It excels in scenes with clear leaf structures but identifies performance boundaries under severe occlusion and visually homogeneous leaf clusters. This honest scoping, coupled with the empirical efficiency analysis, provides a clear and reliable foundation for practical application.
These contributions, however, also define the current scope of our work and highlight clear avenues for future research. First, to overcome the performance boundaries under severe occlusion and feature ambiguity identified in Section 3.6, integrating higher-level shape priors or structural reasoning mechanisms represents a logical next step. Second, while the architectural efficiency is established, its practical deployability on resource-constrained edge devices (e.g., NVIDIA Jetson) requires dedicated validation through performance and power profiling—a gap explicitly quantified in our efficiency analysis. Finally, to translate the promising cross-crop generalization capability into robust multi-crop systems, future work must focus on expanding training data diversity and investigating domain adaptation techniques.
In summary, GS-BiFPN-YOLO provides a high-performance, efficient, and practically validated solution for a challenging agricultural vision task. By clearly stating its contributions, transparently scoping its limitations, and outlining a concrete path forward, this work aims to facilitate the transition from laboratory research to field-ready deployment.

Author Contributions

Conceptualization, W.W. and L.C.; methodology, W.W.; software, W.W.; validation, W.W.; formal analysis, W.W.; investigation, W.W.; resources, W.W.; data curation, W.W. and L.C.; writing—original draft preparation, W.W.; writing—review and editing, L.C.; visualization, W.W.; supervision, L.C.; project administration, L.C.; funding acquisition, L.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Regional Innovation Guidance Plan of Science and Technology Bureau of Xinjiang Production and Construction Corps (2023AB040 and 2021BB012).

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Acknowledgments

The authors thank the research team members for their contributions to this work.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

  1. National Bureau of Statistics of China. Announcement on 2024 Cotton Output. Available online: https://www.stats.gov.cn/sj/zxfb/202412/t20241225_1957879.html (accessed on 24 November 2024).
  2. Xinjiang Academy of Agricultural Sciences. Agricultural Science and Technology Achievements: Xinjiang Cotton Goes Global [Video]. Silk Road Audio-Visual. Available online: http://www.xaas.ac.cn/info/1052/7652.htm (accessed on 24 November 2024).
  3. Yan, J.; Yan, T.; Ye, W.; Lv, X.; Gao, P.; Xu, W. Cotton Leaf Segmentation with Composite Backbone Architecture Combining Convolution and Attention. Front. Plant Sci. 2023, 14, 1111175. [Google Scholar] [CrossRef] [PubMed]
  4. Huang, Y.; Qian, Y.; Wei, H.; Lu, Y.; Ling, B.; Qin, Y. A Survey of Deep Learning-Based Object Detection Methods in Crop Counting. Comput. Electron. Agric. 2023, 215, 108425. [Google Scholar] [CrossRef]
  5. Manavalan, R. Towards an Intelligent Approaches for Cotton Diseases Detection: A Review. Comput. Electron. Agric. 2022, 200, 107255. [Google Scholar] [CrossRef]
  6. Shen, J.; Wu, T.; Zhao, J.; Wu, Z.; Huang, Y.; Gao, P.; Zhang, L. Organ Segmentation and Phenotypic Trait Extraction of Cotton Seedling Point Clouds Based on a 3D Lightweight Network. Agronomy 2024, 14, 1083. [Google Scholar] [CrossRef]
  7. Singh, N.; Tewari, V.K.; Biswas, P.K. Vision Transformers for Cotton Boll Segmentation: Hyperparameters Optimization and Comparison with Convolutional Neural Networks. Ind. Crops Prod. 2025, 223, 120241. [Google Scholar] [CrossRef]
  8. Yan, J.; Tan, F.; Li, C.; Jin, S.; Zhang, C.; Gao, P.; Xu, W. Stem–Leaf Segmentation and Phenotypic Trait Extraction of Individual Plant Using a Precise and Efficient Point Cloud Segmentation Network. Comput. Electron. Agric. 2024, 220, 108839. [Google Scholar] [CrossRef]
  9. Liu, Z.; Liao, G.; Cao, L.; Yang, Y. YOLOv5-Lite: A Novel Compression Method Based on Channel Pruning. Eng. Res. Express 2025, 7, 025231. [Google Scholar] [CrossRef]
  10. Xia, Y.; Lei, X.H. Detection of Pear Inflorescences Using an Improved Ghost-YOLOv5s-BiFPN Algorithm. Smart Agric. 2022, 4, 108. [Google Scholar] [CrossRef]
  11. Cao, L.; Zhang, X.; Pu, J.; Xu, S.; Cai, X.; Li, Z. The Field Wheat Count Based on the Efficientdet Algorithm. In Proceedings of the 2020 IEEE 3rd International Conference on Information Systems and Computer Aided Education (ICISCAE), Dalian, China, 27–29 September 2020; pp. 557–561. [Google Scholar]
  12. Jin, X.; Zhu, X.; Ji, J.; Li, M.; Xie, X.; Zhao, B. Online Diagnosis Platform for Tomato Seedling Diseases in Greenhouse Production. Int. J. Agric. Biol. Eng. 2024, 17, 80–89. [Google Scholar] [CrossRef]
  13. Ma, N.; Li, Z.; Kong, Q. Wheat Grains Automatic Counting Based on Lightweight YOLOv8. INMATEH-Agric. Eng. 2024, 73, 592–602. [Google Scholar] [CrossRef]
  14. Xu, H.; Song, J.; Zhu, Y. Evaluation and Comparison of Semantic Segmentation Networks for Rice Identification Based on Sentinel-2 Imagery. Remote Sens. 2023, 15, 1499. [Google Scholar] [CrossRef]
  15. Bah, M.D.; Hafiane, A.; Canals, R. CRowNet: Deep Network for Crop Row Detection in UAV Images. IEEE Access 2020, 8, 5189–5200. [Google Scholar] [CrossRef]
  16. Do Rosário, E.; Saide, S.M. Segmentation of Leaf Diseases in Cotton Plants Using U-Net and a MobileNetV2 as Encoder. In Proceedings of the 2024 International Conference on Artificial Intelligence, Big Data, Computing and Data Communication Systems (icABCD), Port Louis, Mauritius, 1–2 August 2024; pp. 1–6. [Google Scholar]
  17. Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.-Y.; et al. Segment Anything. arXiv 2023, arXiv:2304.02643. [Google Scholar] [PubMed]
  18. Segreto, T.H.; Negri, J.; Polegato, P.H.; Pinheiro, J.M.H.; Godoy, R.; Becker, M. A Leaf-Level Dataset for Soybean-Cotton Detection and Segmentation. arXiv 2025, arXiv:2503.01605. [Google Scholar]
  19. Khanam, R.; Hussain, M. YOLOv11: An Overview of the Key Architectural Enhancements. arXiv 2024, arXiv:2405.14259. [Google Scholar] [CrossRef]
  20. Li, H.; Li, J.; Wei, H.; Liu, Z.; Zhan, Z.; Ren, Q. Slim-Neck by GSConv: A Lightweight-Design for Real-Time Detector Architectures. arXiv 2022, arXiv:2206.02424. [Google Scholar] [CrossRef]
  21. Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and Efficient Object Detection. arXiv 2020, arXiv:1911.09070. [Google Scholar] [CrossRef]
  22. Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. arXiv 2018, arXiv:1807.06521. [Google Scholar] [CrossRef]
Figure 1. Cotton leaf images under different conditions. (a) Densely occluded leaf, (b) overlapping leaves within the canopy, (c) isolated and unobstructed leaf from a single plant, and (d) leaf with dust deposition.
Figure 1. Cotton leaf images under different conditions. (a) Densely occluded leaf, (b) overlapping leaves within the canopy, (c) isolated and unobstructed leaf from a single plant, and (d) leaf with dust deposition.
Agriculture 16 00102 g001
Figure 2. Annotation interface of the X-Anylabeling platform integrating the Segment Anything 2.1 (Large) model. The graphical user interface shows the vertical toolbar, which includes (from top to bottom) icons for: file operations (open, save); image navigation; drawing tools (polygon, rectangle, circle, point, etc.); annotation editing (select, delete, undo); view control; and automated annotation.
Figure 2. Annotation interface of the X-Anylabeling platform integrating the Segment Anything 2.1 (Large) model. The graphical user interface shows the vertical toolbar, which includes (from top to bottom) icons for: file operations (open, save); image navigation; drawing tools (polygon, rectangle, circle, point, etc.); annotation editing (select, delete, undo); view control; and automated annotation.
Agriculture 16 00102 g002
Figure 3. SAM-annotated cotton leaf image. (a) Raw field cotton leaf image; (b) leaf-annotated image with background removed.
Figure 3. SAM-annotated cotton leaf image. (a) Raw field cotton leaf image; (b) leaf-annotated image with background removed.
Agriculture 16 00102 g003
Figure 4. Annotation results from the purely manual workflow. (a) Densely occluded leaf, (b) overlapping leaves within the canopy, (c) isolated and unobstructed leaf, (d) leaf with dust deposition. The green dashed line is part of the annotation software’s interface captured in the screenshot.
Figure 4. Annotation results from the purely manual workflow. (a) Densely occluded leaf, (b) overlapping leaves within the canopy, (c) isolated and unobstructed leaf, (d) leaf with dust deposition. The green dashed line is part of the annotation software’s interface captured in the screenshot.
Agriculture 16 00102 g004
Figure 5. Annotation results from the SAM-assisted workflow. (a) Densely occluded leaf, (b) overlapping leaves within the canopy, (c) isolated and unobstructed leaf, (d) leaf with dust deposition. The green dashed line is part of the annotation software’s interface captured in the screenshot.
Figure 5. Annotation results from the SAM-assisted workflow. (a) Densely occluded leaf, (b) overlapping leaves within the canopy, (c) isolated and unobstructed leaf, (d) leaf with dust deposition. The green dashed line is part of the annotation software’s interface captured in the screenshot.
Agriculture 16 00102 g005
Figure 6. Examples of data augmentation. Top row: Densely occluded leaves—(a) color and growth state adjustment, (b) geometric transformation, (c) mirroring combined with noise addition, (d) color variation. Second row: Overlapping canopy leaves—(e) color and growth state adjustment, (f) geometric transformation, (g) mirroring combined with noise addition, and (h) color variation. Third row: Unobstructed single-plant leaves—(i) color and growth state adjustment, (j) geometric transformation, (k) mirroring combined with noise addition, (l) color variation. Bottom row: Dust-adhered leaves—(m) color and growth state adjustment, (n) geometric transformation, (o) mirroring combined with noise addition, (p) color variation.
Figure 6. Examples of data augmentation. Top row: Densely occluded leaves—(a) color and growth state adjustment, (b) geometric transformation, (c) mirroring combined with noise addition, (d) color variation. Second row: Overlapping canopy leaves—(e) color and growth state adjustment, (f) geometric transformation, (g) mirroring combined with noise addition, and (h) color variation. Third row: Unobstructed single-plant leaves—(i) color and growth state adjustment, (j) geometric transformation, (k) mirroring combined with noise addition, (l) color variation. Bottom row: Dust-adhered leaves—(m) color and growth state adjustment, (n) geometric transformation, (o) mirroring combined with noise addition, (p) color variation.
Agriculture 16 00102 g006
Figure 7. Examples from the cross-crop evaluation dataset. (a) Pure cotton leaf images; (b) pure soybean leaf images; (c) mixed cotton-soybean leaf images.
Figure 7. Examples from the cross-crop evaluation dataset. (a) Pure cotton leaf images; (b) pure soybean leaf images; (c) mixed cotton-soybean leaf images.
Agriculture 16 00102 g007
Figure 8. Architecture of YOLOv11-seg model.
Figure 8. Architecture of YOLOv11-seg model.
Agriculture 16 00102 g008
Figure 9. Architecture of the GS-BiFPN-YOLO model.
Figure 9. Architecture of the GS-BiFPN-YOLO model.
Agriculture 16 00102 g009
Figure 10. Structural Diagram of the GSConv Module.
Figure 10. Structural Diagram of the GSConv Module.
Agriculture 16 00102 g010
Figure 11. Architectural Overview of the Bidirectional Feature Pyramid Network (BiFPN).
Figure 11. Architectural Overview of the Bidirectional Feature Pyramid Network (BiFPN).
Agriculture 16 00102 g011
Figure 12. Schematic of the Convolutional Block Attention Module (CBAM).
Figure 12. Schematic of the Convolutional Block Attention Module (CBAM).
Agriculture 16 00102 g012
Figure 13. Training Results of GS-BiFPN-YOLO Model. The data points on the blue line represent the model performance values after each training epoch, and the orange markers represent the fitted values after curve smoothing.
Figure 13. Training Results of GS-BiFPN-YOLO Model. The data points on the blue line represent the model performance values after each training epoch, and the orange markers represent the fitted values after curve smoothing.
Agriculture 16 00102 g013
Figure 14. This figure demonstrates a comparative analysis of segmentation performance between YOLOv11n-seg and GS-BiFPN-YOLO. (a) Dense occluded leaf images segmented via YOLOv11n-seg. (b) Dense occluded leaf images segmented via GS-BiFPN-YOLO. (c) Overlapping canopy leaf images segmented with YOLOv11n-seg. (d) Overlapping canopy leaf images segmented with GS-BiFPN-YOLO. (e) Single, unobstructed leaf images processed by YOLOv11n-seg. (f) Single, unobstructed leaf images processed by GS-BiFPN-YOLO. (g) Images of dusty leaves segmented via YOLOv11n-seg. (h) Dust-adhered leaf images segmented with GS-BiFPN-YOLO.
Figure 14. This figure demonstrates a comparative analysis of segmentation performance between YOLOv11n-seg and GS-BiFPN-YOLO. (a) Dense occluded leaf images segmented via YOLOv11n-seg. (b) Dense occluded leaf images segmented via GS-BiFPN-YOLO. (c) Overlapping canopy leaf images segmented with YOLOv11n-seg. (d) Overlapping canopy leaf images segmented with GS-BiFPN-YOLO. (e) Single, unobstructed leaf images processed by YOLOv11n-seg. (f) Single, unobstructed leaf images processed by GS-BiFPN-YOLO. (g) Images of dusty leaves segmented via YOLOv11n-seg. (h) Dust-adhered leaf images segmented with GS-BiFPN-YOLO.
Agriculture 16 00102 g014
Figure 15. Presents the segmentation results of the GS-BiFPN-YOLO model in various scenarios: (a) dense occlusion, (b) group overlapping, (c) a single plant without occlusion, and (d) dust deposition.
Figure 15. Presents the segmentation results of the GS-BiFPN-YOLO model in various scenarios: (a) dense occlusion, (b) group overlapping, (c) a single plant without occlusion, and (d) dust deposition.
Agriculture 16 00102 g015
Figure 16. Examples of failure cases in four challenging scenarios: (a) dense occlusion, (b) overlapping canopy, (c) isolated leaf, (d) dust deposition.
Figure 16. Examples of failure cases in four challenging scenarios: (a) dense occlusion, (b) overlapping canopy, (c) isolated leaf, (d) dust deposition.
Agriculture 16 00102 g016
Figure 17. Cross-crop segmentation results. (a) Pure cotton leaf segmentation; (b) pure soybean leaf segmentation; (c) mixed cotton-soybean segmentation. The model maintains high segmentation quality across different crop types with minimal confusion between species.
Figure 17. Cross-crop segmentation results. (a) Pure cotton leaf segmentation; (b) pure soybean leaf segmentation; (c) mixed cotton-soybean segmentation. The model maintains high segmentation quality across different crop types with minimal confusion between species.
Agriculture 16 00102 g017
Figure 18. Performance metrics and dataset composition across crop types. (A) Average confidence score for cotton, soybean, and mixed image datasets. (B) Average number of object detections per image for each dataset type. (C) Dataset composition pie chart, showing an equal distribution of images across the three types (one-third each). Percentages are rounded to one decimal place for presentation. (D) Zero-shot performance comparison, plotting average detection rate against average confidence for each dataset type.
Figure 18. Performance metrics and dataset composition across crop types. (A) Average confidence score for cotton, soybean, and mixed image datasets. (B) Average number of object detections per image for each dataset type. (C) Dataset composition pie chart, showing an equal distribution of images across the three types (one-third each). Percentages are rounded to one decimal place for presentation. (D) Zero-shot performance comparison, plotting average detection rate against average confidence for each dataset type.
Agriculture 16 00102 g018
Table 1. Quantitative comparison of annotation efficiency under different conditions.
Table 1. Quantitative comparison of annotation efficiency under different conditions.
IDConditionManual Time (s)SAM-Assisted Time (s)Time Saved (s)Time Saved (%)
001Densely occluded leaf855.35241.37613.9871.78%
003Densely occluded leaf854.36355.09499.2758.44%
016Densely occluded leaf732.92245.67487.2566.48%
Densely occluded (Avg.)814.21280.71533.5065.55%
041overlapping leaves within the canopy660.83175.52485.3173.44%
148overlapping leaves within the canopy1080.44365.96714.4866.13%
243overlapping leaves within the canopy772.49243.84528.6568.44%
Canopy overlap (Avg.)837.92261.77576.1569.33%
481isolated and unobstructed leaf from a single plant209.1130.54178.5785.40%
569isolated and unobstructed leaf from a single plant295.9245.57250.3584.60%
570isolated and unobstructed leaf from a single plant307.3141.95265.3686.36%
Isolated leaf (Avg.)270.7839.35231.4385.45%
678Leaf with dust deposition681.07131.93549.1480.63%
999leaf with dust deposition686.63150.38536.2578.11%
1000leaf with dust deposition784.3490.59693.7588.45%
Dust deposition (Avg.)717.35124.30593.0582.40%
Overall Average660.07176.53483.5373.26%
Table 2. Statistical Summary of Cross-Crop Evaluation Dataset.
Table 2. Statistical Summary of Cross-Crop Evaluation Dataset.
Dataset TypeNumber of ImagesAverage Leaves per ImageAverage Leaf Area (Pixels)Total Leaves
Pure Cotton107.9 ± 7.1941,956 ± 23,79579
Pure Soybean103.5 ± 3.9538,856 ± 29,91935
Mixed Crops103.9 ± 3.5739,581 ± 25,95139
Total305.1 ± 5.4140,131 ± 26,343153
Table 3. Model training parameter configuration.
Table 3. Model training parameter configuration.
Parameter NameValue
Image640
epoch300
Batch16
workers8
OptimizerSGD
Learning rate1 × 10−2
Weight decay5 × 10−4
momentum0.90
Table 4. Model performance comparison.
Table 4. Model performance comparison.
ModelP (Box)RmAP50mAP50–95P (Mask)RmAP50mAP50–95F1FPSGFLOPs
Mask R-CNN0.7700.8010.8000.8010.7740.8050.8030.6880.7892.44180
YOLOv8n-seg0.9290.9290.9740.8980.9270.9300.9720.8650.93037010.1
YOLOv9t-seg0.9150.8570.9440.7580.9130.8640.9490.7110.8902438.2
YOLOv10n-seg0.9700.9250.9730.8360.9740.9290.9780.8040.9502949.8
YOLOv11n-seg0.9350.9330.9770.9060.9350.9320.9750.8680.9303339.6
YOLOv12n-seg0.9400.9330.9700.8240.9420.9350.9770.8000.9402709.6
GS-BiFPN-YOLO0.9510.9720.9880.9400.9510.9720.9880.9040.9623229.0
Table 5. Fine-grained segmentation performance comparison (↑ indicates higher is better, ↓ indicates lower is better).
Table 5. Fine-grained segmentation performance comparison (↑ indicates higher is better, ↓ indicates lower is better).
ModelDice ↑IoU ↑HD (px) ↓
Mask R-CNN0.7620.71836.5
YOLOv8n-seg0.9260.86545.5
YOLOv9t-seg0.9020.87442.1
YOLOv10n-seg0.9310.82438.6
YOLOv11n-seg0.9250.86342.0
YOLOv12n-seg0.9330.87741.2
GS-BiFPN-YOLO0.9350.88139.7
Table 6. Effects of different modules on model performance.
Table 6. Effects of different modules on model performance.
GroupGsconvBiFPNCBAMP(B)RmAP50mAP50–95P(M)RmAP50mAP50–95FPSGFLOPs
1 0.9350.9330.9770.9060.9350.9320.9750.8683339.6
2 0.9340.9320.9750.8970.9340.9320.9740.8613338.8
3 0.9290.9380.9760.9020.9280.9380.9750.8682779.6
4 0.9380.9370.9770.9050.9390.9370.9770.8713238.9
50.9510.9720.9880.9400.9510.9720.9880.9043229.0
Note: A checkmark (√) in the Gsconv, BiFPN, and CBAM columns indicates that the corresponding module was included in the model configuration for that experimental group. Groups without a checkmark indicate the absence of that module.
Table 7. Stepwise analysis of incremental module contributions. All Δ values are calculated relative to the baseline YOLOv11n-seg (Group 1).
Table 7. Stepwise analysis of incremental module contributions. All Δ values are calculated relative to the baseline YOLOv11n-seg (Group 1).
Model VariantΔParams (M)Params (M)ΔGFLOPsGFLOPsΔFPSFPSΔmAP50–95 (Mask)mAP50–95 (Mask)
YOLOv11n-seg-2.84-9.6-333-0.868
+GSconv−0.332.51−0.88.8±0.0333−0.0070.861
+BiFPN±0.002.84±0.09.6−56277±0.0000.868
GSConv + BiFPN+0.012.52−0.58.9−10323+0.0030.871
GS-BiFPN-YOLO−0.322.52−0.69.0−11322+0.0360.904
Table 8. Measured Inference Speed under Different Operational Settings.
Table 8. Measured Inference Speed under Different Operational Settings.
ResolutionBatch SizeAvg. FPSLatency per Image (ms)Note
640 × 640186.311.6Measured real-time streaming performance
640 × 6408136.47.3Measured small-batch processing performance
960 × 960172.413.8Measured high-resolution real-time performance
640 × 64016322.03.1Theoretical peak throughput (from Section 3.1)
Table 9. Cross-Crop Performance Summary.
Table 9. Cross-Crop Performance Summary.
Crop TypeImagesAvg. ConfidenceDetections per ImageAvg. Mask Area (pixels)
Cotton100.669 ± 0.2507.9 ± 7.1941,956 ± 23,795
Soybean100.535 ± 0.2193.5 ± 3.9538,856 ± 29,919
Mixed100.645 ± 0.2803.9 ± 3.5739,581 ± 25,951
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wu, W.; Chen, L. GS-BiFPN-YOLO: A Lightweight and Efficient Method for Segmenting Cotton Leaves in the Field. Agriculture 2026, 16, 102. https://doi.org/10.3390/agriculture16010102

AMA Style

Wu W, Chen L. GS-BiFPN-YOLO: A Lightweight and Efficient Method for Segmenting Cotton Leaves in the Field. Agriculture. 2026; 16(1):102. https://doi.org/10.3390/agriculture16010102

Chicago/Turabian Style

Wu, Weiqing, and Liping Chen. 2026. "GS-BiFPN-YOLO: A Lightweight and Efficient Method for Segmenting Cotton Leaves in the Field" Agriculture 16, no. 1: 102. https://doi.org/10.3390/agriculture16010102

APA Style

Wu, W., & Chen, L. (2026). GS-BiFPN-YOLO: A Lightweight and Efficient Method for Segmenting Cotton Leaves in the Field. Agriculture, 16(1), 102. https://doi.org/10.3390/agriculture16010102

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop