CF-SAM: An Efficient and Precise SAM Model for Instance Segmentation of Cotton Top Leaves

Mao, Yanliang; Olivier, Kubwimana; Niu, Guangzhi; Chen, Liping

doi:10.3390/agronomy16010030

Open AccessArticle

CF-SAM: An Efficient and Precise SAM Model for Instance Segmentation of Cotton Top Leaves

¹

College of Information Engineering, Tarim University, Alaer 843300, China

²

Key Laboratory of Tarim Oasis Agriculture, Tarim University, Ministry of Education, Alaer 843300, China

³

Key Laboratory of Modern Agricultural Engineering, Tarim University, Alaer 843300, China

^*

Author to whom correspondence should be addressed.

Agronomy 2026, 16(1), 30; https://doi.org/10.3390/agronomy16010030

Submission received: 19 November 2025 / Revised: 15 December 2025 / Accepted: 16 December 2025 / Published: 22 December 2025

(This article belongs to the Special Issue Agricultural Imagery and Machine Vision)

Download

Browse Figures

Versions Notes

Abstract

The complexity of field environments poses significant challenges for the segmentation of cotton top leaves, a critical step for apical bud localization in intelligent topping systems. Conventional segmentation models typically rely on large annotated datasets and high computational costs to achieve high precision and robustness. To address these challenges, this paper proposes an efficient and accurate segmentation model, CF-SAM, built upon the Segment Anything Model (SAM) framework. CF-SAM integrates a lightweight Tiny-ViT encoder to reduce computational overhead and employs a LoRA-based fine-tuning strategy for domain adaptation, achieving improved performance with minimal parameter increments. In addition, an Adaptive Prompting Strategy (APS) is introduced to automatically generate high-quality point prompts, enabling fully automated and end-to-end instance segmentation. Trained on only 1000 field images, CF-SAM achieves 98.0% mask accuracy and an mAP@0.5 of 97.83%, while maintaining real-time inference at 58 FPS with only 0.091 M (0.8%) additional parameters. These results demonstrate that CF-SAM achieves an excellent balance between segmentation accuracy and computational cost, providing a reliable technical foundation for apical bud localization and precision agricultural operations.

Keywords:

cotton plant apical bud; adaptive prompting; SAM; instance segmentation; LoRA

1. Introduction

As a vital economic crop, cotton yield directly influences farmers’ income and the overall performance of the agricultural economy [1]. To ensure stable and high yields, topping operations during the vigorous growth period (June to July) are essential. This process effectively enhances yield by removing the apical growth point, weakening apical dominance, and redirecting nutrients toward boll development [2]. Traditional manual topping is labor-intensive, inefficient, and heavily dependent on farmers’ visual experience, which limits operational accuracy. With the rapid advancement of agricultural mechanization and intelligent equipment, the adoption of cotton topping machines has significantly improved operational efficiency and reduced labor costs [3]. As a key sensing component in mechanized operations, accurate apical bud recognition via computer vision forms the foundation for automated topping.

With the continuous evolution of the YOLO series [4,5,6,7,8], object detection techniques have become increasingly mature and widely applied in agricultural scenarios. Xuening Zhang et al. [9] employed YOLOv8 as the baseline model, replacing the C2f module with a cross-level partial network and partial convolution (CSPPC) module, and optimized the loss function using Inner CIoU, which introduces auxiliary inner bounding boxes to emphasize overlap consistency within predicted and ground-truth regions, thereby improving localization accuracy for small objects, achieving 97% accuracy on a self-constructed dataset. Yufei Xie et al. [10] formulated cotton apical bud detection as a small-object detection problem and proposed a leaf morphology region-of-interest (ROI) generation network based on YOLOv11n, achieving 96.7% accuracy by constraining the search range. Similarly, Meng Li et al. [11] developed CMD-YOLO to improve the accuracy of small-object detection for cherry ripeness, reducing model complexity through structure simplification and enhanced feature fusion. Compared to the YOLOv12 baseline, it achieved a 5.3% improvement in accuracy and a 73.1% reduction in parameters. For potato bud detection, Qiang Zhao et al. [12] improved YOLOv5s performance by introducing an additional small-object detection layer and attention mechanism, enabling high-precision detection of potato “buds.” As another state-of-the-art framework, Faster R-CNN [13] has also been widely applied in agriculture. Andrew Magdy et al. [14] proposed a lightweight Faster R-CNN-based model tailored for agricultural remote sensing, performing high-resolution detection of crop distribution, land use, and pest-infected regions. By optimizing RoI pooling and region proposal processes, both inference speed and detection accuracy were improved. Although numerous models based on YOLO and Faster R-CNN have been developed for agricultural object detection [15], most improvements primarily focus on architectural optimization to enhance feature extraction and convergence speed. Consequently, achieving higher accuracy and robustness often demands extensive datasets, resulting in substantial annotation costs and computational overhead.

Despite the progress of YOLO-based methods for apical bud detection, their deployment in real-world field environments remains challenging. The field environment is inherently complex: apical buds are frequently occluded by upper leaves, illumination varies drastically across time and climate, and noise arises from dust, pollution, and cluttered backgrounds. These factors collectively diminish the visual saliency of apical buds, complicating feature extraction and target localization. Moreover, apical buds exhibit typical small-object characteristics—small size, variable scale, and high similarity to surrounding tissues—causing mainstream one-stage detectors to miss or misidentify targets under multi-occlusion and weak-saliency conditions. To address these challenges, this paper proposes an indirect apical bud localization approach based on cotton top-leaf segmentation. The proposed method leverages the morphological relationship between the apical bud and top leaves, as illustrated in Figure 1, effectively alleviating false and missed detections observed in direct detection tasks. Cotton, being an indeterminate plant, exhibits a reflective leaf arrangement pattern in which top leaves radiate around the terminal bud, typically located near the intersection of extended petiole lines [16]. This spatial configuration provides a novel means for locating the apical bud. Additionally, since top leaves have larger areas and more distinct visual features, they can be segmented more accurately. Hence, top-leaf segmentation becomes a crucial step.

Segmentation techniques can generally be categorized as single-stage, two-stage, or multi-stage approaches. Single-stage models—such as YOLACT [17], YOLACT++ [18], SOLO [19], and the YOLO-Seg series—unify detection and segmentation into an end-to-end framework, bypassing the conventional “detect-then-extract” process by directly predicting instance masks at the pixel level. Two-stage models, such as Mask R-CNN [20], first generate region proposals before mask prediction. However, existing segmentation models still require extensive annotations and training resources. In contrast to detection tasks, segmentation labels consist of detailed polygon contours rather than simple bounding boxes. In April 2023, Meta AI introduced the Segment Anything Model (SAM) [21], establishing a new segmentation paradigm. SAM, built upon a Vision Transformer (ViT) encoder, uses point, box, and mask prompts as inputs to achieve interactive zero-shot segmentation. Owing to its strong generalization, SAM has been widely adopted in medical image analysis [22] and remote sensing segmentation [23]. Nevertheless, it experiences significant performance degradation in agricultural applications characterized by long petioles, fine structures, severe occlusion, and weak texture, and remains sensitive to prompt quality and scale [24]. For agricultural automation, an interactive segmentation paradigm is impractical—an end-to-end segmentation framework with automatic prompting is therefore preferred.

To this end, this paper proposes CF-SAM (A Prompt-Enhanced SAM Model for Instance Segmentation of Terminal Cotton Leaves), an improved SAM-based model tailored for cotton top-leaf instance segmentation. The proposed CF-SAM retains SAM’s zero-shot generalization capability while achieving accurate instance segmentation with a limited dataset. Specifically, the original SAM encoder is replaced with a lightweight Tiny-ViT, and LoRA-based fine-tuning is applied to the encoder–decoder structure for domain adaptation. Moreover, an Adaptive Prompting Strategy (APS) is introduced to automatically generate prompts for cotton top leaves, enabling fully automated end-to-end instance segmentation.

The main contributions of this paper are as follows:

The SAM model is introduced into cotton-leaf segmentation, and its performance on fine structures and weak-texture regions is enhanced via LoRA.
A lightweight Tiny-ViT encoder is utilized to reduce model parameters and inference time while maintaining accuracy.
An automatic point-prompting mechanism (APS) is proposed for cotton top leaves, enabling accurate and efficient automatic prompting.

Due to the lack of publicly available datasets, a self-constructed dataset of cotton apical leaves was developed to evaluate the proposed model. Experimental results demonstrate that CF-SAM produces more accurate segmentation masks with only a minor increase in parameters, thereby providing a solid technical foundation for apical bud localization in cotton plants.

2. Materials and Methods

2.1. Dataset Image Collection

To construct a representative and robust dataset for subsequent instance segmentation research on cotton top leaves, field data were collected in cotton fields of the 10th Regiment, Alar City, Xinjiang, China, from June to July 2025 (Beijing time). This period corresponds to the key agronomical stage of cotton growth—the topping period—which also coincided with frequent sandstorms in southern Xinjiang.

As shown in Figure 2, to enhance the model’s robustness, images were captured under various environmental conditions, including clear-sky, overcast, and dust-haze scenarios, taking into account factors such as sunlight, shading, and sandstorm coverage. A Xiaomi 15 smartphone (manufactured in Wuxi City, Jiangsu Province, China) was used for data acquisition with an aspect ratio of 1:1 and an image resolution of 2500 × 2500. To simulate the operational perspective of the topping machine and standardize the image scale, all images were manually captured vertically from a height of 60 cm above the cotton canopy. A total of 1000 overhead cotton-plant images under different conditions were collected.

2.2. Dataset Preprocessing and Annotation

Data preprocessing: To meet the resolution requirements of agricultural equipment, we downsampled the image resolution from 2500 × 2500 to 640 × 640. This resolution is also the input resolution of the YOLO-Seg series, thus achieving a unified comparison.

Data annotation: In this work, we used LabelMe [25] (version 5.6.0) software for annotation. LabelMe is an open source image annotation tool, first proposed by MIT. It supports fine annotation of multiple categories and multiple boundaries, generates JSON format files, and is widely used in the construction of target detection and instance segmentation datasets. Following the focus of the topping operation scenario, we only performed precise instance-level annotation on the upper leaves around the top bud area of the cotton plant, ignoring the middle and lower leaves and background information, in order to focus on the structural expression of the top canopy and the extraction of the operation area. After annotation, we saved and exported the labels as LabelMe JSON files. In order to conduct comparative experiments, we also carried out the APS strategy proposed in this study. Specifically, we used three custom Python scripts to perform format conversion while preserving the one-to-one correspondence between each image and its annotations:

(1): labelme_json_to_coco.py merges per-image LabelMe JSON files and converts polygon-based instance annotations into a single COCO instance-segmentation JSON, including segmentation, bounding box, and area fields;
(2): coco_to_yolo_seg.py exports the COCO annotations into YOLO-Seg label files (.txt) for training the YOLO-Seg baselines;
(3): coco_to_mask.py rasterizes the polygon annotations to generate per-image binary/instance mask files used for APS-related supervision (e.g., heatmap construction). All outputs keep consistent image filenames/IDs to ensure reproducible mapping across formats.

Dataset creation: After data preprocessing and annotation, the original collected data yielded 1000 cotton plant images and corresponding annotation files. As shown in Table 1, to ensure the objectivity and generalization ability of the model evaluation, all samples were randomly shuffled and divided into a training set (800 images), a validation set (100 images), and a test set (100 images) in an 8:1:1 ratio. In this dataset, each image corresponds to a single annotation file, which contains three annotated instances representing the top-layer leaves surrounding the apical bud of an individual cotton plant. Therefore, each image includes three target instances, resulting in the total number of labeled instances being three times the number of images.

2.3. Design and Implementation of the CF-SAM Model

2.3.1. Baseline Model: SAM-Base

This study adopts the Segment Anything Model (SAM-base), released by Meta AI in April 2023, as the baseline model for image segmentation tasks. SAM represents a significant breakthrough in the field of general-purpose segmentation by introducing a new paradigm of promptable segmentation, which supports zero-shot, one-shot, and few-shot inference across a variety of visual scenarios. The SAM model family includes several versions—SAM-H, SAM-L, and SAM-B—distinguished by backbone architecture and parameter scale. Among these, SAM-base provides an optimal balance between model capacity and inference speed, making it particularly suitable for high-throughput applications with constrained computational resources.

Built upon the powerful Vision Transformer (ViT-B) backbone, SAM-base is pretrained on an extensive dataset containing over one billion segmentation masks. The overall architecture comprises three primary modules: an image encoder, a prompt encoder, and a mask decoder. The image encoder extracts dense visual representations from high-resolution inputs using a hierarchical Transformer architecture; the prompt encoder processes various forms of prompts—such as points, boxes, and masks—and embeds them into the same latent space as the image features; the mask decoder integrates both image and prompt embeddings through a lightweight attention mechanism, ultimately generating high-quality segmentation masks suitable for precise inference.

As shown in Figure 3, SAM-B employs a modular design that decouples image encoding from mask generation, enabling image-feature precomputation to accelerate interactive segmentation responses. Moreover, SAM-base exhibits strong generalization ability, allowing for adaptation to unseen object categories without retraining. This property makes it highly effective for open-world segmentation tasks. Therefore, SAM-B is selected as the baseline model in this study to leverage its flexibility, scalability, and real-time segmentation capability, ensuring high adaptability and accuracy in diverse downstream applications.

2.3.2. Structure of the CF-SAM Model

The proposed CF-SAM model extends SAM to achieve fully automated, end-to-end instance segmentation of the top leaves of cotton plants.

As illustrated in Figure 4, a lightweight Tiny-ViT is employed as the image encoder to extract multi-scale visual features. During training, the original backbone is frozen, and low-rank adaptation (LoRA) matrices are inserted into the self-attention modules of the Tiny-ViT encoder. Only the LoRA layers are updated, enhancing the model’s ability to learn task-specific representations while maintaining training efficiency. In parallel, a prompt prediction branch is introduced. By reusing the visual features extracted by the image encoder, this branch utilizes a lightweight CNN to generate a heatmap of potential top-leaf regions and identify peak points within those regions. These detected peaks are then converted into point prompts that are automatically generated within the target leaf areas. The resulting point prompts are fed into the SAM prompt encoder, and together with the image embeddings output by the Tiny-ViT encoder, are passed to the mask decoder to produce the final segmentation masks. To further enhance segmentation performance, LoRA matrices are also incorporated into the self-attention and cross-attention modules of the mask decoder, where only the corresponding LoRA layers are updated. This design improves the model’s segmentation precision without increasing significant computational overhead. Moreover, Section 4 provides a detailed discussion on the effects of inserting LoRA modules at different layers within the image encoder.

2.3.3. Image Encoder

To substantially reduce computational overhead on the encoding side while maintaining segmentation accuracy, the original ViT-based image encoder was replaced with a lightweight Tiny-encoder. The original encoder employs fixed-stride patch projection and performs feature modeling through stacked global or window self-attention layers, followed by two convolutional layers to align channel dimensions to 256 for decoder compatibility. As illustrated in Figure 5, Tiny-ViT encoder adopts a pyramid-like hierarchical architecture that progressively decreases feature resolution while increasing the number of channels, enabling richer semantic representation across multiple scales. By integrating local convolution with window-based self-attention, Tiny-ViT encoder models feature dependencies only within localized regions, thereby significantly reducing computational complexity.

Meanwhile, convolutional operations preserve fine-grained spatial details, compensating for the limitations of pure self-attention in modeling local structures. Unlike the standard ViT encoder, which has limited positional awareness, the Tiny-encoder introduces spatial bias into its attention mechanism, enhancing spatial sensitivity and enabling low-cost localization of target positions. The window attention operates within fixed-size regions and incorporates learnable spatial biases to strengthen positional adaptability.

In addition, intra-block depth-wise separable convolution compensates for the lack of geometric priors in self-attention, achieving an efficient fusion of global dependencies from self-attention and local priors from convolution. This design enhances the model’s sensitivity to the spatial position of upper leaves in cotton plants. Finally, the Tiny-encoder outputs feature maps with 256 channels through the Neck module, allowing for seamless integration with the original SAM decoder.

2.3.4. Automatic Prompt Sampler (APS)

In the context of agricultural automation, using the Segment Anything Model (SAM) as the base framework for top-leaf segmentation requires overcoming the limitations of its interactive segmentation mode. Automatically generating high-quality point-as-prompts is essential for fully exploiting the potential of SAM. To this end, we propose the Automatic Prompt Sampler (APS) strategy. During dataset construction, APS first performs rigorous filtering of top-leaf instances: samples with insufficient area or poor-quality masks are excluded based on COCO annotations. The remaining images are proportionally scaled along their long side and standardized to a uniform resolution using symmetrical padding. Subsequently, a continuous heatmap generated by distance transformation and Gaussian smoothing is used as a supervision signal to strengthen the structural center response within the leaf region. Specifically, for each instance mask, we compute a normalized Euclidean distance transform and then take the pixel-wise maximum over all instances to form a single supervision heatmap. To suppress extremely weak responses, the distance values are clipped with a minimum overlap of 0.1 (min_overlap = 0.1), followed by Gaussian smoothing with a kernel size of 15 (smooth_kernel = 15), and the heatmap is normalized to [0, 1]. A minimum response threshold is applied to suppress edge noise and mitigate background interference and false detections at the data level.

As shown in Figure 6, after fine-tuning with APS integration, the CF-SAM image encoder is utilized for feature extraction, and a lightweight PromptFilter module is trained.

This module directly regresses a prompt probability heatmap from SAM-preprocessed images and their corresponding visual embeddings. PromptFilter is optimized using binary cross-entropy (BCE) loss between the predicted probability heatmap and the target heatmap, and trained with AdamW (learning rate = 1 × 10⁻³, weight decay = 1 × 10⁻⁴) for 100 epochs with a batch size of 32. Through an interpretable post-processing step, the peak points within the heatmap are converted into the final point prompts.

During inference, APS applies non-maximum suppression (NMS) using iterative peak search on the encoder outputs. Combined with a minimum point-spacing constraint, this approach prevents the over-concentration of cue points within a single leaf region. In our peak extraction, peaks with confidence below 0.5 are discarded (score_threshold = 0.5), and an NMS suppression radius of 20 pixels is used (suppression_radius = 20). At most, 8 prompt points are generated per image (max_points = 8). The predicted heatmap can be optionally smoothed with a Gaussian kernel of size 9 before peak search (gaussian_kernel = 9). All candidate points are constrained within their corresponding bounding boxes, and in low-signal scenarios, the system degenerates into a mask centroid backoff strategy to ensure that each target retains at least one high-confidence cue. For instance-level prompt generation, we additionally limit the number of positive points to 2 per instance (max_points_per_instance = 2) and enforce a minimum intra-instance distance of 10 pixels (min_point_distance = 10). If no valid peak is obtained, the mask centroid is used as a fallback point. Finally, the coordinates are restored to the original image space through inverse scaling, allowing the cue points to directly guide the subsequent segmentation process.

2.3.5. Prompt Encoder and Mask Decoder

As to the prompt encoder, the original encoder from SAM is reused, with the output of the Automatic Prompt Sampler (APS) serving as the point-based prompt to generate the positional (POS) embeddings for inference.

As shown in Figure 7, the original SAM mask decoder is also adopted in this work. Since SAM performs a single forward pass per instance, each point cue corresponds to one instance mask output. To achieve end-to-end segmentation, the list of point cues generated by the APS module is provided as the input to the decoder. The resulting instance masks are then concatenated and mapped back onto the original image through a post-processing stage to produce the final segmentation output.

2.3.6. Integration of LoRA into Attention Mechanisms

The attention mechanism is a cornerstone of modern vision and multimodal models. However, its large number of parameters and high computational cost often become bottlenecks for model transfer and downstream task adaptation. Traditional fine-tuning requires updating the entire attention weight matrix, which is computationally expensive and prone to overfitting when data are limited. To address this challenge, we incorporate the low-rank adaptation method (LoRA) into the attention module [26]. The basic idea of LoRA is to approximate the update ∆W of the original weights by multiplying two low-rank matrices, thereby significantly reducing the number of trainable parameters while maintaining the model’s segmentation ability. Its approximation can be expressed as Equations (1) and (2):

\hat{W} = W + ∆ W

(1)

∆ W \approx s B A, s = \frac{α}{r}

(2)

In:

^1.: $W \in R^{d \times k}$ are the pre-trained weights.
^2.: $∆ W \in R^{d \times k}$ is the weight updated for a specific task.
^3.: $B \in R^{d \times r}, A \in R^{r \times k}$ and $r ≪ \min (d, k)$ .
^4.: $s$ is a commonly used scaling factor (decoupling the magnitude of low-rank updates from the rank).
^5.: $α$ is a scaling hyperparameter (in this study, $α$ was set to its default value and kept consistent with $r$ ), and $r$ denotes the rank of the low-rank decomposition.

Since the updates of pre-trained weights are stable, a low-rank decomposition technique is proposed to achieve this update. Therefore, to achieve low-rank adaptation, we first freeze the image encoder to keep W fixed, training only ∆W for backpropagation. Because the feature extraction capability of the image encoder comes from its multi-head self-attention mechanism, to improve performance for specific tasks, we add a bypass structure to its projection layer. As shown in Equation (3). This bypass structure consists of three linear layers, and the input features

X \in R^{N \times d_{in}}

are obtained after passing through this weight matrix:

Q^{'} = X (W_{Q} + s_{Q} B_{Q} A_{Q}) = Q + s_{Q} (X B_{Q}) A_{Q}

K^{'} = X (W_{K} + s_{K} B_{K} A_{K}) = K + s_{K} (X B_{K}) A_{K}

(3)

V^{'} = X (W_{V} + s_{V} B_{V} A_{V}) = V + s_{V} (X B_{V}) A_{V}

In CF-SAM, the schematic diagram of LoRA in Attention is shown in Figure 8 below:

As can be seen, LoRA is equivalent to adding a bottleneck bypass after the original linear layer, first mapping the input X to the low-rank subspace

XB \in R^{N \times r}

, and then pulling it back to the head dimension via

A \in R^{r \times d_{h}}

. Due to the core multi-head attention mechanism in ViT-Block, which uses cosine similarity for calculation, the original multi-head attention mechanism can be expressed as Equation (4):

A t t (Q, K, V) = S o f t m a x (\frac{{Q K}^{T}}{\sqrt{d_{h}}}) V

(4)

Therefore, the attention scoring matrix ratio after the LoRA linear layer becomes Equation (5):

S^{'} = \frac{Q^{'} K^{' ⊤}}{\sqrt{d_{h}}} = \frac{(Q + ∆ Q) {(K + ∆ K)}^{⊤}}{\sqrt{d_{h}}} = \frac{Q K^{⊤}}{\sqrt{d_{h}}} + \frac{Q {(∆ K)}^{⊤} + (∆ Q) K^{⊤} + (∆ Q) {(∆ K)}^{⊤}}{\sqrt{d_{h}}}

(5)

In:

∆ Q = s_{Q} (X B_{Q}) A_{Q} ∆ K = s_{K} (X B_{K}) A_{K}

When r is small but the scaling factor s is reasonable,

(∆ Q) {(∆ K)}^{⊤}

is often a second-order small quantity, which can be ignored during reanalysis. Therefore, the attention mechanism after LoRA fine-tuning is defined as shown in Equation (6).

S^{'} \approx \frac{Q K^{⊤}}{\sqrt{d_{h}}} + \frac{Q A_{K} {(X B_{K})}^{⊤} + (X B_{Q}) A_{Q} K^{⊤}}{\sqrt{d_{h}}}

(6)

This demonstrates that LoRA fine-tunes the attention score through a learnable low-rank correction term, thereby changing “who to focus on” with almost no increase in computation/memory. Furthermore, by adding all three projection layers to a trainable LoRA low-rank matrix, the number of new trainable parameters per head is significantly less than that of full fine-tuning (

r (d_{in} + d_{h}) \times 3 ≪ 3 d_{in} d_{h}

). Meanwhile, the complexity of forward propagation remains almost unchanged (the main time consumption is still in

Q K^{⊤}

and Softmax), because the multiplication of XB and A is low-rank.

2.4. Performance Metrics

The performance metrics used to evaluate the model include precision (P), recall (R), and mean average precision (mAP). To compare the current model with the baseline model, we also used params and flops to verify the effectiveness of the improvements. Furthermore, we used the mean intersection–union ratio (mIoU) to fine-tune the effect, with the current class being 1 and the number of instances being N. The formulas for calculating the performance metrics are shown in Equations (7)–(10).

P = \frac{T P}{T P + F P}

(7)

R = \frac{T P}{T P + F N}

(8)

m A P = \frac{\sum_{i = 1}^{n} \int_{0}^{1} P (R) d R}{n}

(9)

{I o U}_{i} = \frac{I n t e r s e c t i o n (P_{i}, G_{i})}{U n i o n (P_{i}, G_{i})}, m I O U = \frac{1}{N} \sum_{i = 1}^{N} {I o U}_{i}

(10)

In Formulas (7) and (8), true positives (TP) represent the number of samples accurately classified as top-layer leaves, false positives (FP) represent the number of samples misclassified as top-layer leaves, and false negatives (FN) represent the number of top-layer leaf samples that were not detected. In Formula (9), the parameter n represents the number of sample categories, which is set to 1 in this study. When the IoU threshold reaches 0.5, mAP is calculated, and the result is recorded as mAP@0.5, indicating that detection is considered valid only when the IoU of the predicted result exceeds this threshold.

3. Results and Analysis

3.1. Experimental Environment

To compare the advantages of LoRA, this study used two workstations for model inference and performance evaluation. For inference, CF-SAM used a server with an Intel(R) Xeon(R) E5-2680 CPU and an NVIDIA A100 (40G) GPU. For training and inference, CF-SAM used a personal workstation with an Intel (i7) 12700KF CPU and an NVIDIA 4070 (12G) GPU. Both systems ran on Ubuntu Server 20.04 (64-bit). The CUDA development tool version used was 11.8, combined with the deep learning framework PyTorch 2.4.1. The programming language was Python 3.10. The hyperparameter values set before the experiment are shown in Table 2 and Table 3.

3.2. Comparative Experiment

To comprehensively evaluate the performance of CF-SAM, we conducted comparative experiments against several mainstream instance-segmentation frameworks, including Mask R-CNN, YOLO-Seg variants, and the SAM-Base model. The quantitative results are summarized in Table 4.

Among all competitors, YOLOv11n-seg and YOLOv12n-seg achieved competitive accuracy within the YOLO family. Although Mask R-CNN yielded the highest bounding-box precision for leaf instances, its mask predictions lacked sharpness along leaf boundaries due to its two-stage detection pipeline, which substantially increases computational overhead. Consequently, its inference speed was limited to 55.7 FPS. The baseline SAM-Base exhibited the weakest overall performance. Despite its strong zero-shot generalization capability, it struggled to adapt to fine-grained leaf-segmentation tasks, achieving only P(Mask) = 0.8532 and 13 FPS, while also incurring the largest parameter load (92.3 M) among all models. The Mask R-CNN model, built upon ResNet-50, likewise suffered from heavy parameterization (44.1 M). In contrast, CF-SAM fine-tunes SAM in a parameter-efficient manner by integrating LoRA into its attention layers and updating only the low-rank matrices. This design significantly improves segmentation accuracy on small datasets while maintaining model compactness. With only 0.091 M additional parameters, CF-SAM attains P(Mask) = 0.9804, F1 = 0.9675, and mAP@0.5 = 0.9783, achieving precision comparable to YOLOv12n-seg while sustaining 58 FPS inference—sufficient for real-time field-level automation.

As shown in Figure 9, SAM-B exhibited clearly suboptimal segmentation performance on this specific task and was highly dependent on the provided prompts. Although the proposed APS strategy was able to supply accurate point prompts within the interior of the top leaf, it still failed to produce a complete and precise segmentation of the entire leaf. This observation further corroborates that, while SAM, as a general-purpose segmentation foundation model, possesses strong generalization ability, its original form does not transfer well to specialized, fine-grained segmentation tasks. As a typical two-stage instance segmentation framework, Mask R-CNN achieved nearly 100% accuracy in predicting bounding boxes for leaf instances; however, its mask predictions for leaf-edge texture were unsatisfactory, and it even misidentified the top leaf in the second image. In contrast, YOLOv12-Seg delivered superior performance, accurately localizing the top leaf and providing more refined segmentation along leaf boundaries. CF-SAM achieved segmentation results comparable to those of YOLOv12-Seg, indicating that, after fine-tuning, the model combined with the proposed APS prompting strategy exhibits strong adaptability and robustness for the segmentation of top leaves in cotton plants.

As illustrated in Figure 10, the proposed CF-SAM achieves a superior trade-off between accuracy and efficiency compared with the baseline SAM-B model. In terms of network complexity, CF-SAM reduces both the number of trainable parameters and overall model size by approximately one order of magnitude, accompanied by a substantial decrease in FLOPs, indicating a significantly lower computational cost during inference. Despite this drastic reduction in complexity, CF-SAM delivers consistent performance gains: its mIoU and F1-score surpass those of the baseline, demonstrating improved localization accuracy and more stable pixel-level classification. Notably, the parameter increment of CF-SAM is nearly four orders of magnitude lower than that of SAM-B, underscoring the effectiveness of the LoRA-based fine-tuning mechanism in achieving efficient and accurate segmentation.

3.3. Ablation Experiment

To evaluate the effectiveness of each proposed component, a series of ablation studies was conducted on the cotton apical leaf dataset.

As shown in Table 5, replacing the original encoder with Tiny-ViT drastically reduces both the number of trainable parameters and FLOPs—by approximately one order of magnitude—while maintaining comparable segmentation accuracy. Hence, all subsequent experiments employ Tiny-ViT as the image encoder.

To assess the contribution of LoRA in different attention projections, we applied it individually to the Q, K, and V layers within the image encoder. As reported in Table 6, fine-tuning only the encoder yields a noticeable improvement in segmentation accuracy, reaching the highest performance when QKV are fine-tuned simultaneously, with merely 0.044 M (0.438%) additional parameters and a 5% improvement over the SAM-Base model.

In contrast, as shown in Table 7, fine-tuning the mask decoder further enhances segmentation performance. With only 0.047 M (4.63%) additional parameters, the model achieves an 8% accuracy improvement compared to the baseline. This demonstrates that tuning the decoder enables better learning of fine-grained texture and edge information in cotton leaves.

We simultaneously fine-tuned both the image encoder and the mask decoder, and the experimental results are shown in Table 8. By comparing the performance of applying LoRA to different modules, it was found that fine-tuning the encoder allows the model to learn more feature information of cotton leaves, thus improving segmentation accuracy. Fine-tuning the decoder allows the model to learn a more complete cotton leaf texture and edge information, thus improving segmentation quality. Experimental results show that the segmentation quality is optimal when both the encoder and decoder are fine-tuned simultaneously, achieving a 13% improvement in segmentation accuracy compared to SAM-base with only a 0.091M (0.8%) increase in parameters.

As shown in Figure 11, the visualization results of fine-tuning different layers and projection modules are presented. (a) and (b) represent the cases where only the encoder is fine-tuned, with rank = 4 matrices inserted into qv and qkv, respectively. (c) and (d) represent the cases where only the decoder is fine-tuned, with rank = 4 matrices inserted into qkv and qkvo, respectively. (e) and (f) represent the visualization results when both the encoder and decoder are fine-tuned simultaneously, using matrices with ranks = 2 and 4, respectively.

The visualization results show that fine-tuning only the encoder or decoder leads to inaccurate segmentation of leaf edges. As illustrated in (e) and (f), when both the encoder and decoder are fine-tuned simultaneously, the model achieves a higher level of segmentation performance, enabling more precise delineation of leaf boundaries. The model exhibits the best segmentation quality when the rank size is set to 4.

We also compared the impact of different rank sizes on segmentation accuracy by inserting low-rank matrices of different rank sizes into the multi-head self-attention mechanism, cross-attention mechanism, and mask attention mechanism of the image encoder and mask decoder. The results are shown in Table 9.

Experimental results show that fine-tuning with different rank sizes can achieve comparable results. The model’s segmentation ability is strongest when the rank of the low-rank matrix is 4, with P(Mask) reaching 98% and mIoU 0.916, while achieving a trade-off in parameter increment. When the rank is greater than 4, performance decreases, indicating that excessively large rank can cause the model to learn more redundant features, leading to a sparse feature matrix and thus affecting segmentation accuracy.

3.4. Demonstration

To further illustrate the practical effectiveness of the proposed CF-SAM, we present a series of qualitative demonstrations on cotton images from the test set and additional field scenes. Representative examples show the APS-generated point prompts, the predicted masks, and their overlays on the original images. Notably, the test-set demonstrations explicitly include challenging dust-haze samples; for example, the cases shown in Figure 12c–e were all collected under dust-haze conditions. This qualitative evidence complements the quantitative evaluation by illustrating the robustness of CF-SAM under severe atmospheric interference. Compared with the baseline SAM-B and other competing methods, CF-SAM produces more complete top-leaf contours, sharper leaf edges, and fewer false positives on background regions. The qualitative results also indicate that CF-SAM maintains stable segmentation performance under variations in illumination, background clutter, and partial occlusion, demonstrating its suitability for real-field deployment and downstream automated phenotyping applications.

4. Discussion

The proposed CF-SAM model demonstrates superior performance in cotton apical leaf segmentation under complex field conditions. By fine-tuning the large SAM model with LoRA on a small dataset, CF-SAM substantially improves fine-grained segmentation accuracy while maintaining zero-shot generalization. Compared with the untuned SAM-Base (P(Mask) ≈ 0.85), CF-SAM achieves P(Mask) = 0.98 and mAP@50 = 97.83% with only 0.09M additional parameters. Replacing ViT-B with Tiny-ViT reduces parameters from 92.3M to 13.5M, improving inference efficiency nearly tenfold without compromising accuracy. These results verify that lightweight architectures can achieve real-time segmentation on resource-constrained agricultural machinery.

The Adaptive Prompting Strategy (APS) enables automatic end-to-end segmentation. By generating high-quality point cues via CNN, APS guides SAM to focus precisely on top-leaf regions, removing the need for manual prompts. It performs reliably across varying illumination and occlusion but shows limited robustness when leaves are heavily overlapped or backgrounds are highly cluttered. For example, as shown in Figure 13c, APS may mistakenly select a young inner leaf as a top-layer target, which can further degrade the segmentation quality.

Methodologically, CF-SAM illustrates that large vision models can be effectively adapted to agricultural tasks using small datasets through efficient fine-tuning. With only 1000 training images, the model achieves high accuracy and compactness, proving the feasibility of transferring pretrained general models to agricultural scenarios. The introduction of Tiny-ViT further enhances portability, making CF-SAM deployable on embedded and edge devices.

Nevertheless, given the substantial variability of real-world field environments, the current dataset—although collected under diverse conditions (e.g., clear-sky, overcast, and dust-haze)—remains moderate in both size and coverage for instance segmentation. Moreover, all images were acquired within a single region and season, which may constrain external validity when deploying the model across different cotton cultivars, growth stages, and geographic areas. Future work will therefore expand data collection across regions and varieties and incorporate cross-region and cross-dataset evaluations to more rigorously assess generalization.

In addition, APS currently supports only single-plant segmentation. Future work will focus on multi-target extension and integrating geometric reasoning or depth cues to improve robustness and bud localization.

In summary, CF-SAM combines accuracy, efficiency, and automation, providing a practical framework for intelligent cotton topping and demonstrating the potential of large-model adaptation in agricultural vision.

5. Conclusions

This paper explores a novel approach to indirect bud localization in cotton plants and develops an end-to-end segmentation model, CF-SAM, for segmenting top leaf instances. We introduce the general segmentation model SAM into agricultural scenarios and significantly improve its segmentation accuracy and practicality under limited data conditions through innovative combinations of LoRA fine-tuning, Tiny-ViT lightweighting, and adaptive prompting strategies. Experimental results show that CF-SAM achieves significant advantages in accuracy (more than 5% improvement in P(Mask) and mAP@50 metrics) compared to traditional detection and segmentation methods, requiring only minimal additional training parameters. The prompting mechanism, which eliminates the need for manual interaction, automates the top leaf segmentation process, laying the foundation for precise bud localization by intelligent topping machines. This work demonstrates the feasibility and efficiency of adapting a large model to small sample sizes in precision agriculture.

In practical applications, CF-SAM holds promise for integration into field operation platforms, such as drones or self-propelled topping machines, to segment the top leaves of cotton plants. This will improve the intelligence level of cotton topping operations and reduce reliance on human experience. However, we also recognize the limitations of this study: the model’s robustness to extreme occlusion conditions and its cross-environment generalization ability require further verification. In future work, we plan to collect larger-scale, multi-regional cotton plant image data to train and test the model, continuously improving its adaptability and reliability. Simultaneously, we can explore methods such as multi-point prompts and multimodal information fusion to improve the APS strategy and enhance segmentation accuracy in complex scenarios. We will also attempt to extend this method to the identification of key parts of other crops, verifying its universality and effectiveness in broader precision agriculture applications. In conclusion, CF-SAM provides a new technical approach for precise topping of cotton buds and offers valuable experience and insights for the application of deep learning models in agriculture.

Author Contributions

Conceptualization, Y.M. and L.C.; methodology, Y.M. and G.N.; software, Y.M. and G.N.; validation, Y.M. and K.O.; formal analysis, Y.M. and G.N.; investigation, Y.M. and K.O.; resources, Y.M. and K.O.; data curation, Y.M. and L.C.; writing—original draft preparation, Y.M. and G.N.; writing—review and editing, K.O.; visualization, Y.M.; supervision, L.C.; project administration, L.C.; funding acquisition, L.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Regional Innovation Guidance Plan of the Science and Technology Bureau of Xinjiang Production and Construction Corps (2023AB040 and 2021BB012).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The data utilized in this study can be obtained upon request from the corresponding author.

Acknowledgments

The authors would like to thank the research team members for their contributions to this work.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Yang, X.; Chen, X.; Zhu, Y. AI-Driven Phenotyping Platforms for Large-Scale Cotton Field Trials. Field Crop 2025, 8, 139–153. [Google Scholar] [CrossRef]
Zhu, L.-X.; Liu, L.-T.; Sun, H.-C.; Zhang, Y.-J.; Zhang, K.; Bai, Z.-Y.; Li, A.-C.; Dong, H.-Z.; Li, C.-D. Effects of chemical topping on cotton development, yield and quality in the Yellow River Valley of China. J. Integr. Agric. 2022, 21, 78–90. [Google Scholar] [CrossRef]
Zhu, Y.; Wang, G.; Du, H.; Liu, J.; Yang, Q. The Effect of Agricultural Mechanization Services on the Technical Efficiency of Cotton Production. Agriculture 2025, 15, 1233. [Google Scholar] [CrossRef]
Varghese, R.; Sambath, M. YOLOv8: A Novel Object Detection Algorithm with Enhanced Performance and Robustness. In Proceedings of the 2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS), Chennai, India, 18–19 April 2024; pp. 1–6. [Google Scholar] [CrossRef]
Wang, C.-Y.; Yeh, I.-H.; Mark Liao, H.-Y. Yolov9: Learning what you want to learn using programmable gradient information. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; pp. 1–21. [Google Scholar] [CrossRef]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J. Yolov10: Real-time end-to-end object detection. Adv. Neural Inf. Process. Syst. 2024, 37, 107984–108011. [Google Scholar] [CrossRef]
Khanam, R.; Hussain, M. Yolov11: An overview of the key architectural enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar] [CrossRef]
Tian, Y.; Ye, Q.; Doermann, D. Yolov12: Attention-centric real-time object detectors. arXiv 2025, arXiv:2502.12524. [Google Scholar] [CrossRef]
Zhang, X.; Chen, L. Bud-YOLO: A Real-Time Accurate Detection Method of Cotton Top Buds in Cotton Fields. Agriculture 2024, 14, 1651. [Google Scholar] [CrossRef]
Xie, Y.; Chen, L. LGN-YOLO: A Leaf-Oriented Region-of-Interest Generation Method for Cotton Top Buds in Fields. Agriculture 2025, 15, 1254. [Google Scholar] [CrossRef]
Li, M.; Ding, X.; Wang, J. CMD-YOLO: A lightweight model for cherry maturity detection targeting small object. Smart Agric. Technol. 2025, 12, 101513. [Google Scholar] [CrossRef]
Zhao, Q.; Zhao, P.; Wang, X.; Xu, Q.; Liu, S.; Ma, T. YOLO-SCA: A Lightweight Potato Bud Eye Detection Method Based on the Improved YOLOv5s Algorithm. Agriculture 2025, 15, 2066. [Google Scholar] [CrossRef]
Girshick, R. Fast R-CNN. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar] [CrossRef]
Magdy, A.; Moustafa, M.S.; Ebied, H.M.; Tolba, M.F. Lightweight faster R-CNN for object detection in optical remote sensing images. Sci. Rep. 2025, 15, 16163. [Google Scholar] [CrossRef] [PubMed]
Khan, Z.; Shen, Y.; Liu, H. Object detection in agriculture: A comprehensive review of methods, applications, challenges, and future directions. Agriculture 2025, 15, 1351. [Google Scholar] [CrossRef]
Mauney, J.R. Anatomy and morphology of cultivated cottons. Cotton 2015, 57, 77–96. [Google Scholar] [CrossRef]
Bolya, D.; Zhou, C.; Xiao, F.; Lee, Y.J. Yolact: Real-time instance segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9157–9166. [Google Scholar] [CrossRef]
Bolya, D.; Zhou, C.; Xiao, F.; Lee, Y.J. YOLACT++ Better Real-Time Instance Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 1108–1121. [Google Scholar] [CrossRef] [PubMed]
Wang, X.; Kong, T.; Shen, C.; Jiang, Y.; Li, L. Solo: Segmenting objects by locations. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 649–665. [Google Scholar] [CrossRef]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar] [CrossRef]
Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; RoIland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.-Y. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 4015–4026. [Google Scholar] [CrossRef]
Gurav, U.; Jadhav, S. Prompt-SAM: A Vision-Language and SAM based Hybrid Framework for Prompt-Augmented Zero-Shot Segmentation. Hum.-Centric Intell. Syst. 2025, 5, 431–449. [Google Scholar] [CrossRef]
Zhang, D.; Li, Y.; Shen, Y.; Guo, H.; Wei, H.; Cui, J.; Wu, G.; He, T.; Wang, L.; Liu, X.; et al. A Dual-Branch Framework Integrating the Segment Anything Model and Semantic-Aware Network for High-Resolution Cropland Extraction. Remote Sens. 2025, 17, 3424. [Google Scholar] [CrossRef]
Ji, W.; Li, J.; Bi, Q.; Liu, T.; Li, W.; Cheng, L. Segment Anything Is Not Always Perfect: An Investigation of SAM on Different Real-world Applications. Mach. Intell. Res. 2024, 21, 617–630. [Google Scholar] [CrossRef]
Russell, B.C.; Torralba, A.; Murphy, K.P.; Freeman, W.T. LabelMe: A Database and Web-Based Tool for Image Annotation. Int. J. Comput. Vis. 2008, 77, 157–173. [Google Scholar] [CrossRef]
Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. Lora: Low-rank adaptation of large language models. ICLR 2022, 1, 3. [Google Scholar] [CrossRef]

Figure 1. Schematic illustration of indirect apical bud localization through the utilization of cotton top leaves.

Figure 2. Illustration of data collection setup and sample types under different illumination and weather conditions.

Figure 3. SAM network structure diagram.

Figure 4. CF-SAM network structure diagram. The red stars indicate the prompt points for top cotton leaves generated by the APS module.

Figure 5. Architecture of the Tiny-encoder network.

Figure 6. Schematic illustration of the APS architecture.

Figure 7. Schematic diagram of the prompt encoder and mask decoder.

Figure 8. LoRA in attention.

Figure 9. Inference visualization results of different models on the test set. In SAM-Base and Ours, the white circles indicate the point prompts of top cotton leaves generated by the APS module.

Figure 10. Performance improvements of CF-SAM relative to the baseline.

Figure 11. Visualization of the results of fine-tuning different layers and projects. The white circles indicate the point prompts of top cotton leaves generated by the APS module. (a,b) represent the cases where only the encoder is fine-tuned, with rank = 4 matrices inserted into qv and qkv; (c,d) represent the cases where only the de-coder is fine-tuned, with rank = 4 matrices inserted into qkv and qkvo; (e,f) represent the visualization results when both the encoder and decoder are fine-tuned simultaneously, using matrices with ranks = 2 and 4.

Figure 12. Visualization of inference results on the test set. (a–i) show the visualization results of the models on different types of data in the test set, where the white circles indicate the prompt points of top cotton leaves generated by the APS module.

Figure 13. Visualization of failure cases under complex conditions. The white circles indicate the point prompts of top cotton leaves generated by the APS module. (a–c) show the visualization of cases with severe leaf overlap, where APS and the segmentation results are unsatisfactory.

Table 1. Dataset partitioning.

Dataset	Number of Images	Number of Labels
Train	800	2400
Validation	100	300
Test	100	300
Total	1000	3000

Table 2. Hyperparameter values of YOLO-Seg and Mask R-CNN.

Hyperparameter	Value
epochs	150
patience	5
batch_size	16
img_size	640
Lr0	$1 \times 10^{- 3}$

Table 3. Hyperparameter values of CF-SAM.

Hyperparameter	Value
epoch	150
patience	5
batch_size	32
img_size	640
Lr0	$1.5 \times 10^{- 3}$
LoRA rank	2/4/6/8
alpha	16
targets	Encoder/Decoder/Both

Table 4. The comparison results of the models’ performance metrics.

Model	P(Mask)	F1	mAP@0.5	FPS	Params↑ (M)
YOLOv10n-seg	0.9013	0.8951	0.9075	148.7	2.89
YOLOv11n-seg	0.9427	0.9109	0.9321	127.4	2.87
YOLOv12n-seg	0.9878	0.9819	0.9507	97.8	2.85
MASK R-CNN	0.9127	0.9045	0.9118	55.7	44.1
SAM-Base	0.8532	0.9457	0.9472	13	92.3
CF-SAM	0.9804	0.9675	0.9783	58	0.091