RA-CottNet: A Real-Time High-Precision Deep Learning Model for Cotton Boll and Flower Recognition

Wang, Rui-Feng; Qin, Yi-Ming; Zhao, Yi-Yi; Xu, Mingrui; Schardong, Iago Beffart; Cui, Kangning

doi:10.3390/ai6090235

Open AccessArticle

RA-CottNet: A Real-Time High-Precision Deep Learning Model for Cotton Boll and Flower Recognition

by

Rui-Feng Wang

¹

,

Yi-Ming Qin

²,

Yi-Yi Zhao

³,

Mingrui Xu

¹,

Iago Beffart Schardong

⁴

and

Kangning Cui

^5,*

¹

Department of Crop and Soil Sciences, College of Agriculture and Environmental Sciences, University of Georgia, Tifton, GA 31793, USA

²

International College Beijing, China Agricultural University, 17 Qinghua East Road, Haidian, Beijing 100083, China

³

College of Engineering, China Agricultural University, 17 Qinghua East Road, Haidian, Beijing 100083, China

⁴

Institute of Plant Breeding, Genetics and Genomics, University of Georgia-Tifton Campus, Tifton, GA 31793, USA

⁵

Department of Computer Science, Wake Forest University, 1834 Wake Forest Road, Winston-Salem, NC 27109, USA

^*

Author to whom correspondence should be addressed.

AI 2025, 6(9), 235; https://doi.org/10.3390/ai6090235

Submission received: 19 August 2025 / Revised: 11 September 2025 / Accepted: 16 September 2025 / Published: 18 September 2025

(This article belongs to the Special Issue Leveraging Simulation and Deep Learning for Enhanced Health and Safety)

Download

Browse Figures

Versions Notes

Abstract

Cotton is the most important natural fiber crop worldwide, and its automated harvesting is essential for improving production efficiency and economic benefits. However, cotton boll detection faces challenges such as small target size, fine-grained category differences, and complex background interference. This study proposes RA-CottNet, a high-precision object detection model with both directional awareness and attention-guided capabilities, and develops an open-source dataset containing 4966 annotated images. Based on YOLOv11n, RA-CottNet incorporates ODConv and SPDConv to enhance directional and spatial representation, while integrating CoordAttention, an improved GAM, and LSKA to improve feature extraction. Experimental results showed that RA-CottNet achieves 93.683%

P r e c i s i o n

, 86.040%

R e c a l l

, 93.496%

m A P 50

, 72.857%

m A P 95

, and 89.692%

F 1

-

s c o r e

, maintaining stable performance under multi-scale and rotation perturbations. The proposed approach demonstrated high accuracy and real-time capability, making it suitable for deployment on agricultural edge devices and providing effective technical support for automated cotton boll harvesting and yield estimation.

Keywords:

cotton; cotton boll; YOLO; rotational convolution; attention mechanism; precision agriculture

1. Introduction

Cotton plays a pivotal role in the global agricultural and industrial system [1,2,3], accounting for about 35% of natural fiber output [4]. It underpins the livelihoods of over 100 million farming households worldwide [5] and remains crucial to the textile, apparel, medical, and hygiene sectors thanks to its superior fiber properties [6,7,8]. Global production and harvested area have steadily increased from 1994 to 2023 (Figure 1), reflecting its sustained agricultural and industrial significance.

Despite this importance, cotton harvesting faces critical challenges [2]. Manual picking is highly labor-intensive and increasingly costly amid rural depopulation and rising wages, with skilled workers harvesting only 50–80 kg per day—far below the efficiency of intelligent machinery [9]. Harvest timing is equally critical, as delays can reduce fiber quality and yield [10]. Thus, accurate recognition of boll maturity is essential for automation and yield assessment [11]. Image-based methods can discriminate boll maturity, quantify yield, and support breeding decisions [12]. Early machine learning approaches, such as the system of Sanjay et al. [11], which combined high-resolution imaging with robotic control, improved efficiency by 20–25%. However, such machine learning methods still face limitations [13,14,15,16,17]: poor scalability and adaptability under diverse conditions [18,19]; computational bottlenecks with large-scale data [20]; weak robustness against livestock interference or machinery vibrations [21]; and the persistent tension between edge computing constraints and real-time demands, where even lightweight models often fall short for time-critical operations [22].

With rapid advances in computer vision and artificial intelligence [23,24,25,26], deep learning has opened new avenues for addressing long-standing challenges in traditional cotton harvesting and agricultural visual recognition [2,27]. In image recognition tasks, deep models offer end-to-end learning and highly nonlinear feature representations that capture complex phenotypic structures, markedly improving the accuracy and stability of feature extraction and classification [28,29]. In recent years, deep learning has been widely adopted in agriculture and has delivered strong performance in multiple tasks [30]. For field management, Sun et al. [31] combined feature wavelength spectral imaging with a convolutional neural network (CNN) to rapidly detect imidacloprid residues on lettuce leaves. Titila et al. [32] used unmanned aerial vehicle (UAV) imagery and deep learning for soybean pest detection and classification, with a fine-tuned ResNet-50 achieving 93.82% accuracy. At the harvest stage, Machefer et al. [33] integrated a Mask R-CNN branch into a Faster R-CNN framework and applied transfer learning to optimize UAV imaging parameters, enabling efficient identification and counting of lettuce.

In agricultural visual recognition, the YOLO family has become a widely used approach due to its lightweight architecture, rapid inference, and strong real-time performance [34,35]. Compared with Faster R-CNN, Mask R-CNN, SVM/HOG, and ResNet/CNN methods, YOLO offers markedly higher detection efficiency and real-time throughput, addressing limitations of conventional deep learning frameworks [36]. YOLO models also impose modest computational demands, making them suitable for edge devices and resource-constrained settings and enabling flexible deployment [35]. There are many examples of successful YOLO-based model applications. Di et al. [37] proposed UAV-YOLO, a YOLOv8s-based small object detector that achieved 47.3% mAP50 on the VisDrone2019 dataset, improving accuracy by 8.9%. Similarly, Huo et al. [38] integrated coordinate attention into YOLOX-tiny with HSV thresholding and morphological operations for high-precision chili pepper detection. Cardellicchio et al. [39] developed an incremental learning pipeline based on YOLOv11 for tomato phenotyping, enhancing accuracy, domain adaptation, and stability. Wang et al. [40] introduced YOLO11-PGM, a lightweight pomegranate growth stage detector with MSEE, SSCH, and HSFPN modules, achieving state-of-the-art accuracy and efficiency for practical orchard monitoring.

Deep learning also exhibits strong capability in cotton boll detection and has become a key enabler of automated harvesting. Liu et al. [41] compiled a 1500-image, multi-environment dataset of unopened bolls and proposed MRF-YOLO, an enhanced YOLOX model that incorporates multi-receptive-field extraction and a small object detection layer, markedly improving small object detection and counting (

R^{2}

= 0.92) and supporting yield prediction and smart agriculture. In parallel, high-performance YOLO models have been applied to UAV-based boll detection. For example. Zhang et al. [42] introduced YOLO-SSPD, an improved YOLOv8 integrating spatial depthwise convolution (SPD-Conv) and a small object head, which significantly increased detection accuracy at the boll opening stage. Similarly, Liu et al. [43] developed a lightweight model for cotton boll detection and yield prediction based on YOLOv8. Notably, Yu et al. [44] proposed a cross-platform detection method (CPD-YOLO) for cotton pests and diseases using UAV and smartphone imaging.

Despite recent progress in cotton boll detection, substantial practical challenges remain. Field scenes present multiple sources of interference—interlaced foliage occlusions, illumination fluctuations, weather variability, and soil background clutter—that can induce false positives (e.g., reflective leaves mistaken for bolls) and false negatives (missed occluded targets) [45]. Moreover, wide geographic dispersion of cotton and pronounced ecological and dataset differences across production regions can lead to performance degradation under cross-regional deployment, revealing limited generalization. Deep learning methods also rely on large, high-quality annotations, yet public datasets for cotton imagery are scarce and often insufficient for deep model training; lengthy data collection cycles, high labeling costs, and the need for domain expertise further constrain algorithm development. Most critically, boll morphology itself is challenging: plants have irregular architecture, bolls are embedded within foliage and easily occluded, and targets occupy a small fraction of the image with weak visual cues, all of which complicate precise detection and localization.

To address the above limitations, this study proposes an innovative solution and implements improvements at both the model architecture and data utilization levels; the specific contributions are as follows:

1: Open data resources to support follow-up research: To address the scarcity of high-quality cotton boll imagery, we curated a dataset from existing acquisitions and annotations. Both the dataset and training configuration files are open-source to the research community.
2: Proposed ODConv–SPDConv detection framework: To address the rotated, small-scale characteristics of cotton bolls in field scenes, we integrated Omni-Dimensional Dynamic Convolution (ODConv) and Space-to-Depth Convolution (SPDConv) into the architecture for the first time, enhancing multi-angle and multi-scale feature representation and improving detection accuracy in complex environments.
3: Multi-attention mechanisms for stronger aggregation and representation: We incorporated a Global Attention Mechanism (GAM), Coordinate Attention (CA), and Large-Kernel Self-Attention (LSKA) to strengthen global context capture and fine-grained spatial feature extraction, effectively mitigating the weak feature nature of cotton boll targets.
4: Empirical validation with significant performance gains: Obtained results demonstrated superior performance on cotton boll detection, achieving a Precision, Recall, mAP50, mAP95, and F1-Score of 93.683%, 86.040%, 93.496%, 72.857%, and 89.692%, respectively, while also maintaining stable detection under multi-scale and rotational disturbances.

Overall, this study advances automated cotton boll recognition by introducing an innovative, high-precision deep learning model, offering a useful reference for future precision agriculture. The remainder of this paper is organized as follows: Section 2 details the methods and materials, including dataset construction and model architecture; Section 3 describes the training hardware, evaluation metrics, and experimental design; Section 4 presents and analyzes the experimental results; Section 5 discusses this study’s strengths, challenges, and potential avenues for future research and optimization; and Section 6 concludes this work.

2. Materials and Methods

2.1. Dataset

The dataset was constructed in eight steps: (1) Morphology review: We systematically surveyed boll states at harvest (flower, fully opened boll, partially opened boll, and defected boll) to define class labels. (2) Open-source collection: We gathered publicly available cotton boll object detection datasets from the Internet to enrich sources and increase sample diversity. (3) Web scraping: Using Python crawlers, we automatically retrieved boll images from multiple websites to further expand scale. (4) Quality control: We filtered out low-resolution, high-noise, and low-information images to ensure data quality. (5) Manual annotation: Volunteers labeled the screened images with bounding boxes and classes, forming the raw web-scraped dataset. (6) Data augmentation: On the raw scraped set, we applied mirroring, brightness/darkness adjustment, and random masking to produce an augmented boll dataset. (7) Integration: We merged the online open-source data with the augmented scraped data to construct the final dataset with diverse samples. (8) Partitioning: We split the final dataset for subsequent experiments. The overall workflow is shown in Figure 2.

In the final constructed dataset for this study, approximately 78% of images were sourced from open-source datasets and about 22% were obtained via web scraping. Before formal acquisition, we conducted broad searches on major platforms (GitHub and Kaggle) to locate open-source cotton boll object detection datasets, followed by Google Scholar queries using the keywords “Cotton,” “Cotton Boll,” and “Cotton Flower” to identify datasets released with published studies. Throughout online retrieval and collection, we applied initial screening criteria that considered dataset provenance and scale, platform reputation and credibility, and the consistency and accuracy of annotations. To ensure rigor, the collected open-source data was cleaned by removing low-quality, mislabeled, or irrelevant images and eliminating duplicate entries across datasets.

To further enrich sample diversity and enhance model generalization, we implemented a Python-based web crawler to automatically collect cotton boll images from public online sources. The crawler combined keyword queries with image filtering strategies to target high-quality imagery. After crawling, four co-authors manually screened the images to remove duplicates, blurred or otherwise low-quality items, and irrelevant content, ensuring accuracy and representativeness. All images were then converted to JPEG and annotated to meet deep learning training requirements. To increase diversity and robustness, we applied data augmentation to the scraped images, including random brightness adjustment, insertion of random masks (randomly generating 1–20 masks of varying sizes), flipping, and Gaussian noise. The augmented web datasets were subsequently integrated with the open-source datasets, yielding a complete set of 4966 images. This final dataset was split 8:2 into a Training Set (3982 images) and a Validation Set (984 images). Representative samples are shown in Figure 3, with pre- and post-augmentation examples in Figure 4.

2.2. YOLOv11n

In cotton boll detection, models must simultaneously deliver high recognition accuracy, fast inference, and easy deployment to support field applications. The YOLO (You Only Look Once) family comprises widely used single-stage detectors [37,46,47], which are valued for speed, architectural simplicity, and strong accuracy, and is broadly applied in agricultural targets [2]. YOLOv11, a recent iteration, preserves the efficient architecture while introducing structural refinements that enhance small object performance and facilitate deployment on resource-constrained devices [48]. Its lightweight variant, YOLOv11n, offers strong overall performance for cotton boll detection, a task requiring fine-grained detail and precise localization.

YOLOv11n (Figure 5) features a small model size, low computational complexity, and excellent inference speed, making it well suited for deployment on resource-constrained agricultural devices [35]. This model integrates efficient residual connections, improved multi-scale feature fusion, and an optimized bounding-box regression loss, enabling robust and generalizable detection under severe illumination changes, large morphological variability, and cluttered backgrounds in cotton fields [49]. Compared with two-stage detectors such as Faster R-CNN [50,51], its end-to-end design avoids redundant proposal generation and markedly increases efficiency; relative to DETR-style models [52], it is simpler, easier to train, more flexible and efficient to deploy on field equipment.

YOLOv11n employs a structured multi-task loss to jointly optimize detection accuracy and confidence prediction, as shown in Equation (1):

L_{YOLO} = λ_{box} \cdot L_{box} + λ_{cls} \cdot L_{cls} + λ_{obj} \cdot L_{obj}

(1)

where

L_{b o x}

is the bounding-box regression loss, typically evaluated with CIoU or DIoU to measure the overlap between predicted and ground-truth boxes;

L_{c l s}

is the classification loss;

L_{o b j}

is the objectness (presence) loss. The three terms are combined with weights (

λ_{b o x}

,

λ_{c l s}

,

λ_{o b j}

) to jointly optimize the model and strengthen overall performance in multi-object detection.

Compared with YOLOv7 and YOLOv8, YOLOv11n reduces model size without compromising detection accuracy, thereby lowering deployment barriers and enhancing real-time performance on edge platforms such as mobile devices and UAVs [49]. It is particularly suitable for large-scale cotton field operations requiring real-time classification of “fully opened bolls,” “partially opened bolls,” “defected bolls,” and “flowers.” For downstream tasks—including intelligent picking, maturity assessment, and phenotypic modeling—YOLOv11n provides stable, efficient visual recognition and thus offers key technical support for frontline, production-oriented intelligent systems.

Beyond the YOLO family, models such as RetinaNet, EfficientDet, and RT-DETR are also used for object detection. However, RetinaNet adapts poorly to scenarios requiring high detection speed [53]; EfficientDet, while offering strong compression, has a complex architecture and depends heavily on training strategies and tuning expertise [54]; and Transformer-based models like RT-DETR, despite strong accuracy, incur higher inference latency that hinders real-time deployment [55,56]. By contrast, YOLOv11n provides superior detection efficiency, model compression, and platform adaptability, making it one of the most practical choices for cotton boll detection. Accordingly, we adopted YOLOv11n as the baseline for subsequent module optimization and deployment experiments.

2.3. Convolutions

2.3.1. Omni-Dimensional Dynamic Convolution

To better capture diverse spatial structures and texture orientations, we incorporated an Omni-Dimensional Dynamic Convolution (ODConv) module into the YOLOv11n backbone. Proposed by Li et al. [57], ODConv dynamically models convolutional kernels along spatial, input channel, output channel, and kernel number dimensions. Compared with conventional convolutions or channel attention schemes, it substantially enhances directional selectivity and context modeling while preserving computational efficiency.

ODConv augments standard convolution by assigning learnable attention weights to four dimensions—spatial extent, input channels, output channels, and kernel cardinality—and synthesizing the effective kernel through weighted fusion. This enables finer-grained, adaptive modeling of the input features. The core formulation is provided in Equation (2).

ODConv (X) = \sum_{i = 1}^{K} α_{i} \cdot (W_{i} * X)

(2)

where X is the input feature map;

W_{i}

is the ith learnable convolution kernel;

α_{i} \in R

is the dynamic weight produced by a directional attention mechanism; * denotes standard convolution; and K is the number of kernels.

Specifically, ODConv first applies global average pooling and a lightweight attention module to generate attention vectors for each dimension and then fuses them to obtain the final attention weights

α_{i}

. In contrast to conventional dynamic convolutions (e.g., CondConv and DYConv) that introduce dynamics only along the channel or kernel cardinality dimension, ODConv performs all-dimensional dynamic fusion, yielding stronger nonlinear modeling capacity. The architecture is shown in Figure 6.

In cotton boll imagery, targets (e.g., partially opened and fully opened bolls) often exhibit complex structural details, pose variability, and inconsistent orientations. Introducing ODConv allows the model—conditioned on spatial position, channel responses, and feature directionality—to adapt its receptive field and kernel orientation, thereby extracting local and global information more effectively. This markedly improves discrimination of fine structures in cluttered backgrounds; notably, when multi-scale bolls are densely distributed, ODConv strengthens the capture of salient directional cues [58]. These gains come with only modest increases in parameters and computation, making ODConv suitable for module-level optimization in lightweight models. Consequently, it balances detection accuracy and inference efficiency and lays the groundwork for deployment on field edge devices.

2.3.2. Space-to-Depth Convolution

In boll recognition, targets are typically small, densely distributed, and weakly bounded. To enhance sensitivity to such small objects under complex backgrounds and low-resolution imagery, we integrated a Space-to-Depth Convolution (SPDConv) module into the YOLOv11n backbone [59]. Unlike commonly used strided convolutions and pooling in conventional CNNs, SPDConv performs downsampling via a structural replacement that preserves spatial detail, thereby strengthening the model’s capacity to represent small-scale cotton bolls [60].

Specifically, SPDConv consists of two components: a space-to-depth (SPD) rearrangement and a non-strided convolution.

First, the SPD operation partitions the input feature map

X \in R^{S \times S \times C}

into non-overlapping sub-blocks according to a downsampling factor

s c a l e

and reassembles them along the channel dimension to produce a new feature map

X^{'}

, as shown in Equation (3).

X^{'} = SPD (X; s c a l e) \in R^{\frac{S}{s c a l e} \times \frac{S}{s c a l e} \times C \cdot s c a l e^{2}}

(3)

Subsequently, a standard stride-1 convolution is applied to

X^{'}

for compressive encoding, as specified in Equation (4).

X^{''} = {Conv}_{1 \times 1, C_{out}} (X^{'}) \in R^{\frac{S}{scale} \times \frac{S}{scale} \times C_{out}}

(4)

This design performs spatial downsampling while retaining all original feature information, thereby avoiding the sampling bias and edge information loss associated with traditional strided convolutions. The SPDConv architecture is shown in Figure 7.

In natural cotton field scenes, some bolls appear blurred and are easily missed due to low image resolution or their intrinsically small size. Incorporating SPDConv alleviates the loss of small object features introduced by downsampling during feature extraction, preserving strong discriminative capability under low-resolution inputs. SPDConv further provides an interpretable alternative for spatial compression in lightweight models and can be flexibly embedded into diverse detection architectures. Without relying on higher-resolution inputs, adding SPDConv to YOLO-family models yields significant improvements in small object detection accuracy and exhibits strong potential for practical deployment.

2.4. Attention Mechanisms

2.4.1. Coordinate Attention Mechanism

To strengthen joint modeling of spatial structure and channel semantics, we incorporated a Coordinate Attention (CA) module into the improved YOLOv11n. CA is a lightweight attention mechanism [61] that, unlike conventional channel attention (e.g., SE [62] and CBAM [63]), preserves positional encoding while modeling inter-channel dependencies, thereby enhancing structural expressiveness and spatial discrimination. This makes it particularly suitable for crop image analysis in agricultural scenes with pronounced structural features.

Given an input feature map

X \in R^{C \times H \times W}

, the CA module performs global average pooling separately along the height (H) and width (W) dimensions to produce directional context vectors

z_{h} \in R^{C \times H \times 1}

and

z_{w} \in R^{C \times 1 \times W}

, as shown in Equations (5) and (6).

z_{h} (c, h) = \frac{1}{W} \sum_{i = 1}^{W} X (c, h, i)

(5)

z_{w} (c, w) = \frac{1}{H} \sum_{j = 1}^{H} X (c, j, w)

(6)

The two directional features are then concatenated and compressed to generate attention weights

α_{h}

and

α_{w}

. These weights are applied elementwise to the input feature map to produce the enhanced output Y, which integrates spatial positional encoding, as given in Equation (7).

Y (c, h, w) = X (c, h, w) \cdot a_{h} (c, h) \cdot a_{w} (c, w)

(7)

This mechanism effectively mitigates the positional information loss inherent to conventional global pooling and enhances the model’s sensitivity to fine-grained spatial structures. The CA module architecture is shown in Figure 8a.

The CA module is lightweight and modular, allowing insertion at multiple points in compact detectors such as YOLOv11n with minimal impact on overall computational complexity. Its strong positional retention and semantic enhancement make it well suited for deployment on resource-constrained agricultural edge platforms (e.g., UAVs and UGVs), meeting stringent real-time and accuracy requirements for target detection.

2.4.2. Optimized Global Attention Mechanism

During deep semantic feature modeling, it is essential to fuse global context from disparate spatial regions to better perceive complex structures and multi-scale targets. Accordingly, we inserted a Global Attention Mechanism (GAM) [64] into the YOLOv11n backbone to enhance global representation and cross-channel interactions in deep features. Unlike the original GAM, we changed its internal activation from Sigmoid [65] to ReLU [66], thereby reducing the computational burden of the nonlinear transformations and making the model more suitable for lightweight agricultural object detection tasks.

Specifically, the GAM comprises two branches: a Spatial Attention Branch and a Channel Attention Branch. Given an input feature map

X \in R^{C \times H \times W}

, it models global contextual dependencies along both spatial and channel dimensions and fuses them to produce the enhanced output Y. Structurally, the channel branch extracts global channel statistics and uses fully connected layers to compress them and generate attention weights, while the spatial branch applies convolutional operations to model the spatial dimension and obtain a spatial response map. The outputs of the two branches are jointly applied to the input to yield an enhanced representation with strong global context awareness. To reduce computational complexity and improve inference efficiency, we replaced the original GAM’s Sigmoid activation with ReLU. This preserves attention distribution capability while mitigating gradient saturation and eliminating the exponential computation overhead, making the module better suited to resource-constrained agricultural devices. In UAVs and field edge terminals that demand high real-time performance and low power, this modification further improves the GAM’s practicality and the overall system efficiency. The improved architecture is shown in Figure 8b.

2.4.3. Large Separable Kernel Attention

To strengthen global context modeling and long-range dependencies at the tail of the backbone, we introduced a fused SPPF-LSKA module at the deepest layer of YOLOv11n. Built upon the YOLOv11 series’ Spatial Pyramid Pooling—Fast (SPPF) structure and augmented with Large Separable Kernel Attention (LSKA) [67], this design preserves lightweight computation while enhancing deep feature maps’ sensitivity to large-scale targets and global structure.

As shown in Figure 8c, Large Separable Kernel Attention (LSKA) is a feature extraction module that combines the strengths of large-kernel convolution and attention. By adopting depthwise separable convolutions, it reduces computational complexity, while enlarged kernels expand the receptive field to enhance sensitivity to spatial structure and fine details. Coupled with a spatial attention mechanism, LSKA enables the convolution to focus on salient information over a broader region, thereby improving the understanding of complex imagery [68]. The attention generation and enhancement process is formalized in Equations (8)–(11).

Modeling long-range dependencies via directional separable convolution (Equation (8)):

{\bar{Z}}^{C} = \sum_{H, W} W_{(2 d - 1) \times 1}^{C} * (\sum_{H, W} W_{1 \times (2 d - 1)}^{C} * F^{C})

(8)

Dilated convolution expands the receptive field and strengthens representational capacity (Equation (9)):

Z^{C} = \sum_{H, W} W_{⌊ k / d ⌋ \times 1}^{C} * (\sum_{H, W} W_{1 \times ⌊ k / d ⌋}^{C} * {\bar{Z}}^{C})

(9)

Channel attention map generation (Equation (10)):

A^{C} = W_{1 \times 1} * Z^{C}

(10)

Elementwise weighting to enhance the input feature map (Equation (11)):

{\bar{F}}^{C} = A^{C} \otimes F^{C}

(11)

where

F^{C}

is the original input feature map,

A^{C}

is the channel attention map,

{\bar{F}}^{C}

is the enhanced output, and ⊗ denotes elementwise multiplication.

LSKA employs lightweight large-kernel modeling to markedly expand the receptive field while substantially reducing computational cost, making it suitable for resource-constrained embedded agricultural platforms. In scenes with complex backgrounds and for small object detection, it offers strong feature expressiveness and serves as a valuable complement to conventional convolutional networks, particularly for vision tasks that must balance efficiency and accuracy [69]. When combined with the SPPF structure, LSKA enhances global spatial modeling without sacrificing the advantages of the YOLO architecture. Incorporating this module yields clear gains in overall accuracy and robustness to small objects in cotton boll detection, especially for densely distributed targets with subtle textural differences in complex field imagery.

2.5. Proposed Model (RA-CottNet)

To address small target sizes, fine-grained class distinctions, and cluttered backgrounds in cotton boll detection, we propose a lightweight, high-precision model built on YOLOv11n—the Rotation- and Attention-enhanced Cotton Detection Network (RA-CottNet). As illustrated in Figure 9, RA-CottNet integrates directional dynamic convolution, spatial reconfiguration convolution, and multi-scale attention mechanisms to provide direction-aware, attention-guided feature learning, markedly improving robustness for multi-class boll recognition in complex field scenes.

RA-CottNet preserves the overall YOLOv11n backbone while replacing and augmenting modules to meet the semantic demands of different feature stages. At the input stage, standard convolutions are replaced with ODConv, which assigns learnable dynamic weights across spatial, input channel, output channel, and kernel dimensions to enable joint multi-dimensional modeling and improve adaptation to directional structural variations in the imagery. In the shallow feature extraction stage, a CA module encodes both channel and spatial information, strengthening focus on object boundaries and localization regions while keeping the network lightweight. For mid- to high-level semantic modeling, portions of the original convolutions are retained to control computational cost, and two SPDConv modules replace the original downsampling convolutions: spatial information is rearranged into the channel dimension prior to convolution to achieve lossless downsampling, thereby preserving key small object structures (e.g., partially opened bolls) that would otherwise be compressed early.

In the deep backbone, RA-CottNet incorporates a GAM to aggregate global contextual semantics, enhancing responsiveness to distant targets, blurred instances, and cluttered background regions. To further reduce computational complexity, we replaced the GAM’s original Sigmoid activation with ReLU, removing exponential operations and improving deployability. Finally, the SPPF layer is augmented with LSKA, which strengthens global perception and target separability in the backbone outputs while maintaining a low computational cost.

In the neck, RA-CottNet retains YOLOv11n’s multi-scale decoding architecture. Before fusing high- and mid-level features, an additional CA module is inserted to improve cross-layer feature alignment and structural selectivity prior to concatenation.

Taken together, RA-CottNet maintains a bounded parameter count and inference latency while, through module-level optimizations, enhancing spatial expressivity, directional modeling, and semantic focus. These advances provide a reliable visual perception basis for more efficient, fine-grained cotton boll detection and the subsequent development of intelligent picking systems.

3. Experiments and Evaluation Metrics

3.1. Model Training Device and Parameter Setup

Model training was conducted on an NVIDIA A100 GPU to ensure efficient computation. We used a batch size of 64, with preprocessing and augmentation parallelized on a 16-core CPU to improve data throughput. The initial learning rate was set to 0.001, the optimizer was AdamW, and L2 regularization (weight decay = 0.0005) was applied to mitigate gradient explosion and enhance generalization. Training proceeded for up to 800 epochs with early stopping [70]: if validation performance showed no substantial improvement for 50 consecutive epochs (patience = 50), training was terminated to conserve resources. Mixed-precision training was enabled to accelerate computation and reduce GPU memory usage.

3.2. Model Evaluation Experiment

3.2.1. Baseline Models for Comparative Evaluation

To comprehensively validate the effectiveness and overall advantages of the proposed cotton boll recognition model (RA-CottNet), we benchmarked it against a broad set of mainstream detectors—covering multiple versions and scales of the YOLO family and representative Transformer-based models from the RT-DETR series—to ensure comprehensive evaluation dimensions and objective results.

For the YOLO family, we evaluate YOLOv8, YOLOv11, and YOLOv12 across five scales (n, s, m, l, x). YOLOv11 [48] has seen broad adoption; by refining the backbone, introducing efficient multi-scale fusion, and improving the decoder, it achieves a strong accuracy–speed balance [71], with scale variants spanning ultra-low-resource to high-accuracy use cases. YOLOv8 [72] emphasizes architectural lightweighting and training strategy optimization, combining deployment flexibility with solid performance; its five variants are well suited to both mobile and server platforms. YOLOv12 [73] further advances the detection head and feature fusion design and pairs these with inference optimizations, delivering an excellent speed–accuracy trade-off across scales.

For Transformer-based detectors, we adopted RT-DETR-50 and RT-DETR-101 as representatives [74,75]. RT-DETR retains DETR’s end-to-end paradigm while introducing efficient query selection and structured feature encoding, substantially improving inference efficiency and detection accuracy. RT-DETR-50 targets real-time scenarios with constrained computation, whereas RT-DETR-101 further strengthens global feature modeling and detection performance while maintaining high throughput [76,77]. Incorporating these diverse baselines—spanning categories, scales, and architectures—enables a comprehensive and objective comparison of RA-CottNet’s overall performance on cotton boll recognition.

3.2.2. Evaluation Metrics

To evaluate RA-CottNet in cotton boll detection, we adopted a standard suite of metrics:

P r e c i s i o n

,

R e c a l l

,

F 1

-score, mean average precision (

m A P

), and the confusion matrix.

P r e c i s i o n

quantifies the accuracy of positive-class predictions;

R e c a l l

measures the detection rate of the positive class;

F 1

-score is the harmonic mean of

P r e c i s i o n

and

R e c a l l

; and

m A P

provides an overall assessment of detection performance, reported as

m A P 50

(

A P

at IoU = 0.5) and

m A P 95

(mean

A P

over IoU thresholds from 0.5 to 0.95 in 0.05 increments). The confusion matrix offers a visual summary of the model’s per-class classification behavior. The formulas for

R e c a l l

,

P r e c i s i o n

,

F 1

-

s c o r e

, and

m A P

are shown in Equations (12)–(15).

P r e c i s i o n = \frac{T P}{T P + F P} \times 100 %

(12)

R e c a l l = \frac{T P}{T P + F N} \times 100 %

(13)

F 1 - S c o r e = \frac{2 \times P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(14)

m A P = \frac{\sum_{n = 1}^{N} A P (n)}{N} \times 100 %

(15)

where

T P

donates true positive;

F N

donates false negative;

F P

donates false positive; and N donates class.

3.3. Ablation Experiment

To comprehensively assess the effectiveness and performance contribution of each core module in the proposed cotton boll recognition model (RA-CottNet), we conducted systematic ablation experiments. Seven variants were constructed and compared: (1) RA-CottNet (full model: YOLOv11n + ODConv + SPDConv + GAM + CA + LSKA); (2) Without_CA (YOLOv11n + ODConv + SPDConv + GAM + LSKA); (3) Without_GAM (YOLOv11n + ODConv + SPDConv + CA + LSKA); (4) Without_LSKA (YOLOv11n + ODConv + SPDConv + GAM + CA); (5) Without_All_Attn (all attention modules removed: YOLOv11n + ODConv + SPDConv); (6) Without_ODConv (YOLOv11n + SPDConv + GAM + CA + LSKA); and (7) Without_SPDConv (YOLOv11n + ODConv + GAM + CA + LSKA).

The ablation experiments were designed to quantify the impact of individual modules on detection performance and generalization and to reveal the benefits of multi-module synergy. By removing specific components while keeping the rest of the architecture fixed, we directly compared each module’s contribution to enhancing feature expressiveness, improving small object detection, and suppressing interference from complex backgrounds.

For evaluation, we used six key indicators—

P r e c i s i o n

,

R e c a l l

,

m A P 50

,

m A P 95

,

F 1

-

s c o r e

, and the time to reach the best model checkpoint (

T i m e

)—to analyze each variant across detection accuracy, recall capacity, multi-scale adaptability, and overall balance. The results not only confirm the critical role of each module in boosting performance, but also offer a reliable reference for future modular optimization and deployment in resource-constrained cotton field settings.

4. Results

4.1. Results of Model Evaluation Experiment

The multi-model comparison results for cotton boll recognition are shown in Table 1. Given that this study seeks to provide a reliable perceptual basis for automated harvesting and yield estimation,

P r e c i s i o n

was treated as the primary metric. The rationale is that missed detections (low

R e c a l l

) can be corrected later during harvesting or tallying through manual intervention, whereas false positives (low

P r e c i s i o n

) may directly trigger erroneous picking commands, damaging partially opened bolls and flowers, or lead to the mistaken harvesting or counting of defected cotton bolls, thereby causing actual yield loss. Thus, in this task, high

P r e c i s i o n

is both a prerequisite for operational reliability and a cornerstone of accurate yield estimation.

Experimental results show that RA-CottNet excels in

P r e c i s i o n

, reaching 93.683%—a 4.5% improvement over YOLOv11n—and outperforming all comparison models, including YOLOv11, YOLOv8, YOLOv12, and RT-DETR. Its

R e c a l l

is 86.040%, slightly below some YOLOv11 and YOLOv12 variants; nevertheless, the

F 1

-

s c o r e

remains high at 89.692%, indicating that RA-CottNet preserves its precision advantage while maintaining strong overall detection performance.

Regarding mAP, RA-CottNet attains an

m A P 50

of 93.496%, leading among the mainstream detectors in our comparisons; its

m A P 95

reaches 72.857%, indicating stable detection and good generalization across different IoU thresholds. In addition, the time to reach the best model checkpoint (i.e., to generate the best.pt weight file) is 2565.07 s, markedly shorter than mid-/large-scale models such as YOLOv11m/l/x and YOLOv12m/x, demonstrating faster training convergence while maintaining high accuracy. Figure 10 depicts RA-CottNet’s performance trajectory, and the confusion matrix in Figure 11 further corroborates its strong performance in cotton boll recognition.

In sum, RA-CottNet shows clear advantages for precision-prioritized cotton boll detection, effectively reducing false positives and thereby mitigating yield loss risks from erroneous harvesting. Combined with its high

m A P

and moderate training overhead, these results indicate strong practical potential and feasibility for deployment in real cotton fields for automated harvesting and phenotypic analysis.

4.2. Results of Ablation Experiment

To assess the practical role of each improvement and its impact on boll recognition, we performed ablations by removing, in turn, Coordinate Attention (CA), the Global Attention Mechanism (GAM), Large Separable Kernel Attention (LSKA), all attention modules (All Attentions), ODConv, and SPDConv, yielding six model variants; the comparative results are summarized in Table 2. As in Section 4.1,

P r e c i s i o n

is treated as the primary metric for cotton boll detection: high precision prevents false positives that could trigger erroneous picking and yield loss, whereas missed detections can be corrected later through manual supplementation.

In the ablation experiments, the full RA-CottNet achieved the best

P r e c i s i o n

(93.683%), outperforming every variant with any module removed, underscoring the importance of cross-module synergy in reducing false positives. The LSKA module contributed most: removing it lowered

P r e c i s i o n

to 92.697%, with

F 1

-

s c o r e

changing from 89.692% to 89.754%, indicating LSKA’s key role in enlarging the receptive field, enriching fine-grained features, and distinguishing subtly textured bolls from the background. ODConv also delivered notable gains in both Precision and Recall; without it,

P r e c i s i o n

fell to 92.164% and

R e c a l l

to 85.896%, highlighting the advantage of dynamic kernels under varying field scenes and illumination. CA and the GAM were complementary: removing CA or the GAM reduced Precision to 88.092% and 89.983%, respectively, with CA having the larger impact—consistent with the need to suppress background interference and maintain stable detection. SPDConv had a comparatively milder effect on

P r e c i s i o n

(89.812% when removed) yet remained beneficial for multi-scale feature representation. Notably, eliminating all attention mechanisms caused the most severe degradation, with

P r e c i s i o n

dropping to 89.411% and

m A P 50

to 92.049%, reaffirming the indispensable role of multi-attention schemes in fusing global and local information. However, it should be noted that many models, including the baseline model, exhibited relatively limited performance in certain cases, which can be largely attributed to practical challenges such as leaf occlusion, reflective glare, and complex field backgrounds that obscure or distort boll features.

In terms of training efficiency, the variant with all attention mechanisms removed converged the fastest (1209.43 s) but exhibited substantial drops in Precision and overall performance, highlighting the trade-off between accuracy and training time. By contrast, RA-CottNet attains the highest Precision while keeping convergence time (2565.07 s) within an acceptable range, demonstrating a favorable balance of accuracy and efficiency for practical deployment.

In summary, the ablation results substantiate the effectiveness of all modules in improving cotton boll detection accuracy, suppressing false positives, and enhancing model robustness. Notably, the combination of LSKA, ODConv, and multiple attention mechanisms contributes most to

P r e c i s i o n

, a gain that is critical for ensuring accurate automated harvesting and reliable yield estimation.

5. Discussion

5.1. Advantages

Overall, RA-CottNet demonstrates a marked performance advantage for cotton boll recognition, achieving the highest Precision among all baselines (93.683%)—a critical factor for automated harvesting and yield estimation. Relative to lightweight baselines such as YOLOv11n and YOLOv8n, it sustains a moderate

R e c a l l

(86.040%) and a strong

F 1

-

s c o r e

(89.692%), reducing false positives while preserving overall detection quality. Ablation studies further confirm the contribution of each component: LSKA provides the largest gains by enlarging the receptive field and enriching fine-grained features; ODConv excels in adapting to diverse field scenes and illumination; and the combination of multiple attention mechanisms enhances feature discriminability and robustness. This synergistic, multi-module design maintains stable performance under complex backgrounds, varying lighting, and dense target distributions, while striking a favorable balance between convergence speed and computational cost—thereby laying the groundwork for deployment on UAVs and embedded terminals.

5.2. Challenges

Although RA-CottNet excels in

P r e c i s i o n

, its

R e c a l l

is slightly lower than some YOLOv11 and YOLOv12 variants, indicating a risk of missed detections under extreme occlusion, weak texture, or low-light conditions. While such omissions can be corrected manually, they may still reduce overall efficiency in large-scale operations. Moreover, although the training time (2565.07 s) is reasonable given the achieved accuracy, the incorporation of multiple attention mechanisms and custom convolutions inevitably increases architectural complexity, which may require further optimization on severely resource-constrained platforms. In addition, many models, including the baseline model, exhibited relatively limited performance in certain scenarios, largely due to practical challenges such as leaf occlusion, reflective glare, and complex field backgrounds that obscure or distort boll features. Finally, the current experiments were based on a single dataset; generalization to other cotton varieties, different acquisition seasons, or cross-regional settings remains insufficiently validated and warrants further study prior to broad deployment.

5.3. Future Perspectives

To overcome these limitations and broaden practical applicability, future work can proceed along four directions:

Boost Recall and performance in complex scenes: While maintaining high Precision, introduce lightweight attention or improved feature fusion strategies [27] to raise Recall and reduce missed detections, especially under weak textures and heavy occlusion.
Enhance robustness via multimodal fusion: Integrate RGB, near-infrared (NIR), and hyperspectral imagery to handle challenging illumination and background variation, thereby improving adaptability and stability across diverse cotton field conditions.
Deployment and lightweight optimization: Apply pruning, quantization, and knowledge distillation at deployment to cut computational cost and meet real-time inference needs on resource-limited edge devices such as UAVs and handheld terminals.
Cross-domain generalization and large-scale testing: Conduct extensive field trials across regions, climates, and cotton varieties to systematically assess cross-domain generalization to ensure stable performance of RA-CottNet in varied agricultural production scenarios.

6. Conclusions

Cotton is the most widely cultivated natural fiber crop and has a major role in economic development in both producing and consuming countries. Yet harvesting remains challenging: manual picking is inefficient and time-consuming and easily misses the optimal window, causing quality degradation and yield loss. Accurate recognition of cotton bolls and their maturity is therefore a prerequisite to automated harvesting and yield estimation, and it also benefits breeding by enabling selection of high-yield varieties. To advance automation and intelligence in boll recognition, we propose RA-CottNet, a high-precision, direction-aware, and attention-guided detector, and the release of an open-source boll dataset with 4966 annotated images. Methodologically, RA-CottNet builds on YOLOv11n, augmenting directional and spatial representation with ODConv (directional dynamic convolution) and SPDConv (spatial reconfiguration convolution) and incorporating three attention mechanisms—CoordAttention (CA), a modified Global Attention Mechanism (GAM), and Large Separable Kernel Attention (LSKA)—to address small object size, fine-grained class differences, and cluttered backgrounds. On our dataset, RA-CottNet delivered strong accuracy and stability, achieving

P r e c i s i o n

93.683%,

R e c a l l

86.040%,

m A P 50

93.496%,

m A P 95

72.857%, and

F 1

-

s c o r e

89.692%, while remaining robust under multi-scale variation and rotational perturbations. Overall, the model combines high accuracy with real-time performance, is suitable for deployment on agricultural edge devices, and provides solid technical support for automated field harvesting and yield assessment, as well as an efficient, reliable phenotypic recognition tool for cotton breeding.

Author Contributions

Conceptualization, K.C. and R.-F.W.; methodology, K.C. and R.-F.W.; software, R.-F.W.; validation, R.-F.W. and M.X.; formal analysis, R.-F.W. and K.C.; investigation, R.-F.W., Y.-M.Q., and Y.-Y.Z.; data curation, Y.-M.Q., Y.-Y.Z., and K.C.; writing—original draft preparation, R.-F.W., Y.-M.Q., and Y.-Y.Z.; writing—review and editing, M.X., I.B.S., K.C., and R.-F.W.; visualization, Y.-M.Q., Y.-Y.Z., and R.-F.W.; supervision, K.C. and R.-F.W.; project administration, K.C. and R.-F.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The proposed dataset and training configuration can be found at https://github.com/SweefongWong/RA-CottNet (accessed on 15 August 2025).

Acknowledgments

We sincerely thank the two volunteers—Heng-Wei Zhang and Rui-Lan Wang—for their valuable assistance in the creation of the proposed dataset. We also extend our special thanks to the co-authors Yi-Ming Qin and Yi-Yi Zhao for their active participation in data collection, annotation, and organization. All four contributors played a significant role in ensuring the high quality of the dataset. We also express our thanks to Iago Beffart Schardong for his help in polishing our manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Scarpin, G.J.; Bhattarai, A.; Hand, L.C.; Snider, J.L.; Roberts, P.M.; Bastos, L.M. Cotton lint yield and quality variability in Georgia, USA: Understanding genotypic and environmental interactions. Field Crops Res. 2025, 325, 109822. [Google Scholar] [CrossRef]
Yang, Z.Y.; Xia, W.K.; Chu, H.Q.; Su, W.H.; Wang, R.F.; Wang, H. A comprehensive review of deep learning applications in cotton industry: From field monitoring to smart processing. Plants 2025, 14, 1481. [Google Scholar] [CrossRef]
Smith, C.W.; Cothren, J.T. Cotton: Origin, History, Technology, and Production; John Wiley & Sons: Hoboken, NJ, USA, 1999. [Google Scholar]
Huang, G.; Huang, J.Q.; Chen, X.Y.; Zhu, Y.X. Recent advances and future perspectives in cotton research. Annu. Rev. Plant Biol. 2021, 72, 437–462. [Google Scholar] [CrossRef]
Fortucci, P. The contribution of cotton to economy and food security in developing countries. In Proceedings of the Cotton and Global Trade Negotiations Sponsored by the World Bank and ICAC Conference, Washington, DC, USA, 8–9 July 2002; Volume 8, pp. 8–9. [Google Scholar]
Mathangadeera, R.W.; Hequet, E.F.; Kelly, B.; Dever, J.K.; Kelly, C.M. Importance of cotton fiber elongation in fiber processing. Ind. Crops Prod. 2020, 147, 112217. [Google Scholar] [CrossRef]
Krifa, M.; Stevens, S.S. Cotton utilization in conventional and non-conventional textiles—A statistical review. Agric. Sci. 2016, 7, 747–758. [Google Scholar] [CrossRef]
Shahriari Khalaji, M.; Lugoloobi, I. Biomedical application of cotton and its derivatives. In Cotton Science and Processing Technology: Gene, Ginning, Garment and Green Recycling; Springer: Singapore, 2020; pp. 393–416. [Google Scholar]
Gu, S.; Sun, S.; Wang, X.; Wang, S.; Yang, M.; Li, J.; Maimaiti, P.; van der Werf, W.; Evers, J.B.; Zhang, L. Optimizing radiation capture in machine-harvested cotton: A functional-structural plant modelling approach to chemical vs. manual topping strategies. Field Crops Res. 2024, 317, 109553. [Google Scholar] [CrossRef]
Ibrahim, A.A.; Hamoda, S. Effect of planting and harvesting dates on the physiological characteristics of cotton-seed quality. J. Plant Prod. 2021, 12, 1295–1299. [Google Scholar] [CrossRef]
Sanjay, N.A.; Venkatramani, N.; Harinee, V.; Dinesh, V. Cotton harvester through the application of machine learning and image processing techniques. Mater. Today Proc. 2021, 47, 2200–2205. [Google Scholar] [CrossRef]
Pabuayon, I.L.B.; Sun, Y.; Guo, W.; Ritchie, G.L. High-throughput phenotyping in cotton: A review. J. Cotton Res. 2019, 2, 18. [Google Scholar] [CrossRef]
Zhang, Z.; Janvekar, N.A.S.; Feng, P.; Bhaskar, N. Graph-Based Detection of Abusive Computational Nodes. U.S. Patent 12,223,056, 11 February 2025. [Google Scholar]
Li, L.; Li, J.; Wang, H.; Georgieva, T.; Ferentinos, K.; Arvanitis, K.; Sygrimis, N. Sustainable energy management of solar greenhouses using open weather data on MACQU platform. Int. J. Agric. Biol. Eng. 2018, 11, 74–82. [Google Scholar] [CrossRef]
Qin, Y.M.; Tu, Y.H.; Li, T.; Ni, Y.; Wang, R.F.; Wang, H. Deep Learning for sustainable agriculture: A systematic review on applications in lettuce cultivation. Sustainability 2025, 17, 3190. [Google Scholar] [CrossRef]
Zhou, Y.; Xia, H.; Yu, D.; Cheng, J.; Li, J. Outlier detection method based on high-density iteration. Inf. Sci. 2024, 662, 120286. [Google Scholar] [CrossRef]
Wang, J.X.; Fan, L.F.; Wang, H.H.; Zhao, P.F.; Li, H.; Wang, Z.Y.; Huang, L. Determination of the moisture content of fresh meat using visible and near-infrared spatially resolved reflectance spectroscopy. Biosyst. Eng. 2017, 162, 40–56. [Google Scholar] [CrossRef]
Tan, L.; Lu, J.; Jiang, H. Tomato leaf diseases classification based on leaf images: A comparison between classical machine learning and deep learning methods. AgriEngineering 2021, 3, 542–558. [Google Scholar] [CrossRef]
Cui, K.; Tang, W.; Zhu, R.; Wang, M.; Larsen, G.D.; Pauca, V.P.; Alqahtani, S.; Yang, F.; Segurado, D.; Fine, P.; et al. Efficient Localization and Spatial Distribution Modeling of Canopy Palms Using UAV Imagery. IEEE Trans. Geosci. Remote Sens. 2025, 63, 4413815. [Google Scholar] [CrossRef]
Wu, A.Q.; Li, K.L.; Song, Z.Y.; Lou, X.; Hu, P.; Yang, W.; Wang, R.F. Deep Learning for Sustainable Aquaculture: Opportunities and Challenges. Sustainability 2025, 17, 5084. [Google Scholar] [CrossRef]
Nagano, S.; Moriyuki, S.; Wakamori, K.; Mineno, H.; Fukuda, H. Leaf-movement-based growth prediction model using optical flow analysis and machine learning in plant factory. Front. Plant Sci. 2019, 10, 227. [Google Scholar] [CrossRef]
Sliwa, B.; Piatkowski, N.; Wietfeld, C. LIMITS: Lightweight machine learning for IoT systems with resource limitations. In Proceedings of the ICC 2020–2020 IEEE International Conference on Communications (ICC), Dublin, Ireland, 7–11 June 2020; pp. 1–7. [Google Scholar]
Zhang, S.; Zhao, K.; Huo, Y.; Yao, M.; Xue, L.; Wang, H. Mushroom image classification and recognition based on improved ConvNeXt V2. J. Food Sci. 2025, 90, e70133. [Google Scholar] [CrossRef]
Lu, W.; Wang, J.; Wang, T.; Zhang, K.; Jiang, X.; Zhao, H. Visual style prompt learning using diffusion models for blind face restoration. Pattern Recognit. 2025, 161, 111312. [Google Scholar] [CrossRef]
Zhou, Y.; Liu, C.; Urgaonkar, B.; Wang, Z.; Mueller, M.; Zhang, C.; Zhang, S.; Pfeil, P.; Horn, D.; Liu, Z.; et al. PBench: Workload Synthesizer with Real Statistics for Cloud Analytics Benchmarking. arXiv 2025, arXiv:2506.16379. [Google Scholar] [CrossRef]
Yao, S.; Guan, R.; Wu, Z.; Ni, Y.; Huang, Z.; Liu, R.W.; Yue, Y.; Ding, W.; Lim, E.G.; Seo, H.; et al. Waterscenes: A multi-task 4d radar-camera fusion dataset and benchmarks for autonomous driving on water surfaces. IEEE Trans. Intell. Transp. Syst. 2024, 25, 16584–16598. [Google Scholar] [CrossRef]
Wang, Z.; Zhang, H.W.; Dai, Y.Q.; Cui, K.; Wang, H.; Chee, P.W.; Wang, R.F. Resource-Efficient Cotton Network: A Lightweight Deep Learning Framework for Cotton Disease and Pest Classification. Plants 2025, 14, 2082. [Google Scholar] [CrossRef]
Cynthia, E.P.; Ismanto, E.; Arifandy, M.I.; Sarbaini, S.; Nazaruddin, N.; Manuhutu, M.A.; Akbar, M.A.; Abdiyanto. Convolutional Neural Network and Deep Learning Approach for Image Detection and Identification. J. Phys. Conf. Ser. 2022, 2394, 012019. [Google Scholar]
Cui, K.; Zhu, R.; Wang, M.; Tang, W.; Larsen, G.D.; Pauca, V.P.; Alqahtani, S.; Yang, F.; Segurado, D.; Lutz, D.; et al. Detection and geographic localization of natural objects in the wild: A case study on palms. arXiv 2025, arXiv:2502.13023. [Google Scholar]
Wang, R.F.; Su, W.H. The application of deep learning in the whole potato production Chain: A Comprehensive review. Agriculture 2024, 14, 1225. [Google Scholar] [CrossRef]
Sun, L.; Cui, X.; Fan, X.; Suo, X.; Fan, B.; Zhang, X. Automatic detection of pesticide residues on the surface of lettuce leaves using images of feature wavelengths spectrum. Front. Plant Sci. 2023, 13, 929999. [Google Scholar] [CrossRef] [PubMed]
Tetila, E.C.; Machado, B.B.; Astolfi, G.; de Souza Belete, N.A.; Amorim, W.P.; Roel, A.R.; Pistori, H. Detection and classification of soybean pests using deep learning with UAV images. Comput. Electron. Agric. 2020, 179, 105836. [Google Scholar] [CrossRef]
Machefer, M.; Lemarchand, F.; Bonnefond, V.; Hitchins, A.; Sidiropoulos, P. Mask R-CNN refitting strategy for plant counting and sizing in UAV imagery. Remote Sens. 2020, 12, 3015. [Google Scholar] [CrossRef]
Jiang, P.; Ergu, D.; Liu, F.; Cai, Y.; Ma, B. A Review of Yolo algorithm developments. Procedia Comput. Sci. 2022, 199, 1066–1073. [Google Scholar] [CrossRef]
Badgujar, C.M.; Poulose, A.; Gan, H. Agricultural object detection with You Only Look Once (YOLO) Algorithm: A bibliometric and systematic literature review. Comput. Electron. Agric. 2024, 223, 109090. [Google Scholar] [CrossRef]
Darwin, B.; Dharmaraj, P.; Prince, S.; Popescu, D.E.; Hemanth, D.J. Recognition of bloom/yield in crop images using deep learning models for smart agriculture: A review. Agronomy 2021, 11, 646. [Google Scholar] [CrossRef]
Di, X.; Cui, K.; Wang, R.F. Toward Efficient UAV-Based Small Object Detection: A Lightweight Network with Enhanced Feature Fusion. Remote Sens. 2025, 17, 2235. [Google Scholar] [CrossRef]
Huo, Y.; Wang, R.F.; Zhao, C.T.; Hu, P.; Wang, H. Research on Obtaining Pepper Phenotypic Parameters Based on Improved YOLOX Algorithm. AgriEngineering 2025, 7, 209. [Google Scholar] [CrossRef]
Cardellicchio, A.; Renò, V.; Cellini, F.; Summerer, S.; Petrozza, A.; Milella, A. Incremental Learning with Domain Adaption for Tomato Plant Phenotyping. Smart Agric. Technol. 2025, 12, 101324. [Google Scholar] [CrossRef]
Wang, R.; Chen, Y.; Zhang, G.; Yang, C.; Teng, X.; Zhao, C. YOLO11-PGM: High-Precision Lightweight Pomegranate Growth Monitoring Model for Smart Agriculture. Agronomy 2025, 15, 1123. [Google Scholar] [CrossRef]
Liu, Q.; Zhang, Y.; Yang, G. Small unopened cotton boll counting by detection with MRF-YOLO in the wild. Comput. Electron. Agric. 2023, 204, 107576. [Google Scholar] [CrossRef]
Zhang, M.; Chen, W.; Gao, P.; Li, Y.; Tan, F.; Zhang, Y.; Ruan, S.; Xing, P.; Guo, L. YOLO SSPD: A small target cotton boll detection model during the boll-spitting period based on space-to-depth convolution. Front. Plant Sci. 2024, 15, 1409194. [Google Scholar] [CrossRef]
Xiang, L.; Ruoxue, X.; Chenglong, B.; Min, T.; Mingtian, T.; Kaiwen, H. Lightweight Cotton Boll Detection Model and Yield Prediction Method Based on Improved YOLO v8. Nongye Jixie Xuebao/Trans. Chin. Soc. Agric. Mach. 2025, 56, 130–140. [Google Scholar]
Yu, G.; Ma, B.; Zhang, R.; Xu, Y.; Lian, Y.; Dong, F. CPD-YOLO: A cross-platform detection method for cotton pests and diseases using UAV and smartphone imaging. Ind. Crops Prod. 2025, 234, 121515. [Google Scholar] [CrossRef]
Khan, M.A.; Wahid, A.; Ahmad, M.; Tahir, M.T.; Ahmed, M.; Ahmad, S.; Hasanuzzaman, M. World cotton production and consumption: An overview. In Cotton Production and Uses: Agronomy, Crop Protection, and Postharvest Technologies; Springer: Cham, Switzerland, 2020; pp. 1–7. [Google Scholar]
Hidayatullah, P.; Syakrani, N.; Sholahuddin, M.R.; Gelar, T.; Tubagus, R. YOLOv8 to YOLO11: A comprehensive architecture in-depth comparative review. arXiv 2025, arXiv:2501.13400. [Google Scholar]
Alkhammash, E.H. Multi-classification using YOLOv11 and hybrid YOLO11n-MobileNet models: A fire classes case study. Fire 2025, 8, 17. [Google Scholar] [CrossRef]
Khanam, R.; Hussain, M. Yolov11: An overview of the key architectural enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar] [CrossRef]
Wang, R.F.; Tu, Y.H.; Li, X.C.; Chen, Z.Q.; Zhao, C.T.; Yang, C.; Su, W.H. An Intelligent Robot Based on Optimized YOLOv11l for Weed Control in Lettuce. In Proceedings of the 2025 ASABE Annual International Meeting. American Society of Agricultural and Biological Engineers, Toronto, ON, Canada, 13–16 July 2025; p. 1. [Google Scholar]
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Zou, Z.; Chen, K.; Shi, Z.; Guo, Y.; Ye, J. Object detection in 20 years: A survey. Proc. IEEE 2023, 111, 257–276. [Google Scholar] [CrossRef]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable detr: Deformable transformers for end-to-end object detection. arXiv 2020, arXiv:2010.04159. [Google Scholar]
Li, Y.; Ren, F. Light-weight retinanet for object detection. arXiv 2019, arXiv:1905.10011. [Google Scholar]
Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10781–10790. [Google Scholar]
Zhao, Z.; Chen, S.; Ge, Y.; Yang, P.; Wang, Y.; Song, Y. Rt-detr-tomato: Tomato target detection algorithm based on improved rt-detr for agricultural safety production. Appl. Sci. 2024, 14, 6287. [Google Scholar] [CrossRef]
Xie, W.; Zhao, M.; Liu, Y.; Yang, D.; Huang, K.; Fan, C.; Wang, Z. Recent advances in Transformer technology for agriculture: A comprehensive survey. Eng. Appl. Artif. Intell. 2024, 138, 109412. [Google Scholar] [CrossRef]
Li, C.; Zhou, A.; Yao, A. Omni-dimensional dynamic convolution. arXiv 2022, arXiv:2209.07947. [Google Scholar] [CrossRef]
Qian, J.; Lin, J.; Bai, D.; Xu, R.; Lin, H. Omni-dimensional dynamic convolution meets bottleneck transformer: A novel improved high accuracy forest fire smoke detection model. Forests 2023, 14, 838. [Google Scholar] [CrossRef]
Sunkara, R.; Luo, T. No more strided convolutions or pooling: A new CNN building block for low-resolution images and small objects. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Grenoble, France, 19–23 September 2022; Springer: Cham, Switzerland, 2022; pp. 443–459. [Google Scholar]
Sun, R.; Fan, H.; Tang, Y.; He, Z.; Xu, Y.; Wu, E. Research on small target detection algorithm for UAV inspection scene based on SPD-conv. In Proceedings of the Fourth International Conference on Computer Vision and Data Mining (ICCVDM 2023), Changchun, China, 20–22 October 2023; SPIE: Bellingham, WA, USA, 2024; Volume 13063, pp. 686–691. [Google Scholar]
Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13713–13722. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European conference on computer vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Liu, Y.; Shao, Z.; Hoffmann, N. Global attention mechanism: Retain information to enhance channel-spatial interactions. arXiv 2021, arXiv:2112.05561. [Google Scholar] [CrossRef]
Elfwing, S.; Uchibe, E.; Doya, K. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural Netw. 2018, 107, 3–11. [Google Scholar] [CrossRef]
Agarap, A.F. Deep learning using rectified linear units (relu). arXiv 2018, arXiv:1803.08375. [Google Scholar]
Lau, K.W.; Po, L.M.; Rehman, Y.A.U. Large separable kernel attention: Rethinking the large kernel attention design in cnn. Expert Syst. Appl. 2024, 236, 121352. [Google Scholar] [CrossRef]
Tie, J.; Zhu, C.; Zheng, L.; Wang, H.; Ruan, C.; Wu, M.; Xu, K.; Liu, J. LSKA-YOLOv8: A lightweight steel surface defect detection algorithm based on YOLOv8 improvement. Alex. Eng. J. 2024, 109, 201–212. [Google Scholar] [CrossRef]
Deng, L.; Wu, S.; Zhou, J.; Zou, S.; Liu, Q. LSKA-YOLOv8n-WIoU: An Enhanced YOLOv8n Method for Early Fire Detection in Airplane Hangars. Fire 2025, 8, 67. [Google Scholar] [CrossRef]
Wu, X.X.; Liu, J.G. A new early stopping algorithm for improving neural network generalization. In Proceedings of the 2009 Second International Conference on Intelligent Computation Technology and Automation, Washington, DC, USA, 10–11 October 2009; Volume 1, pp. 15–18. [Google Scholar]
He, L.h.; Zhou, Y.z.; Liu, L.; Cao, W.; Ma, J.h. Research on object detection and recognition in remote sensing images based on YOLOv11. Sci. Rep. 2025, 15, 14032. [Google Scholar] [CrossRef]
Varghese, R.; Sambath, M. Yolov8: A novel object detection algorithm with enhanced performance and robustness. In Proceedings of the 2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS), Chennai, India, 18–19 April 2024; pp. 1–6. [Google Scholar]
Tian, Y.; Ye, Q.; Doermann, D. Yolov12: Attention-centric real-time object detectors. arXiv 2025, arXiv:2502.12524. [Google Scholar]
Zhang, Y.; Wang, X.; Hu, J.; Zhang, J.; Zhu, P.; Lu, W. Research on small target detection in complex substation environments based on an end-to-end improved RT-DETR model. In Proceedings of the International Conference on Mechatronic Engineering and Artificial Intelligence (MEAI 2024), Shenyang, China, 13–15 December 2024; SPIE: Bellingham, WA, USA, 2025; Volume 13555, pp. 813–821. [Google Scholar]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. Detrs beat yolos on real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 16965–16974. [Google Scholar]
Kong, Y.; Shang, X.; Jia, S. Drone-DETR: Efficient small object detection for remote sensing image using enhanced RT-DETR model. Sensors 2024, 24, 5496. [Google Scholar] [CrossRef] [PubMed]
Wu, M.; Qiu, Y.; Wang, W.; Su, X.; Cao, Y.; Bai, Y. Improved RT-DETR and its application to fruit ripeness detection. Front. Plant Sci. 2025, 16, 1423682. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Global cotton production and yield quantities from 1994 to 2023. [FAO dataset, last accessed on 6 June 2025].

Figure 2. Workflow of constructing our proposed cotton boll and flower dataset.

Figure 3. Samples of proposed cotton boll and flower dataset: (a) cotton flower; (b) defected boll; (c,d) partially opened bolls; (e,f) fully opened bolls.

Figure 4. Some samples of our web-crawled cotton bolls and flower images after augmentation (including original images). Note: (a) cotton flower; (b) defected boll; (c) partially opened boll; (d) fully opened boll.

Figure 5. The architecture diagram of the YOLOv11n network.

Figure 6. Structural diagram of Omni-Dimensional Dynamic Convolution (ODConv). Note: GAP represents Global Average Pooling; FC is Fully Connected Layer.

Figure 7. Illustration diagram of Space-to-Depth Convolution (SPDconv).

Figure 8. (a) Structural diagram of Coordinate Attention (CA) mechanism; (b) structural diagram of Global Attention Mechanism (GAM); (c) structural diagram of Large Separable Kernel Attention (LSKA) mechanism.

Figure 9. Overall framework of our proposed model (RA-CottNet).

Figure 10. Performance iteration process of RA-CottNet.

Figure 11. Confusion matrix of trained RA-CottNet model for cotton bolls and flowers.

Table 1. Performance comparison between the proposed model and mainstream object detection models.

Model	P (%)	R (%)	F1 (%)	$mAP 50 (%)$	$mAP 95 (%)$	Time (s)
YOLOv11n	89.183	88.144	88.661	92.949	73.790	5019.29
YOLOv11s	89.227	90.419	89.818	93.442	73.550	2995.95
YOLOv11m	91.923	89.550	90.723	93.097	73.204	5713.89
YOLOv11l	89.985	89.515	89.749	93.471	73.310	7525.64
YOLOv11x	89.650	90.513	90.079	93.833	74.665	11916.40
YOLOv8n	93.577	86.693	90.022	92.308	71.674	2205.33
YOLOv8s	90.271	89.516	89.892	92.053	72.214	4216.21
YOLOv8m	91.246	87.544	89.358	92.708	72.757	3136.89
YOLOv8l	91.056	90.139	90.595	92.746	72.046	4241.90
YOLOv8x	92.053	87.980	89.968	92.807	72.914	7865.91
YOLOv12n	87.911	89.897	88.991	92.406	72.004	2373.79
YOLOv12s	89.671	89.100	89.384	92.990	73.434	6616.87
YOLOv12m	90.746	89.172	89.952	93.104	73.464	9010.27
YOLOv12x	89.694	86.045	87.829	92.649	72.552	6459.60
RT-DETR-50	91.166	88.333	89.727	91.672	71.805	10054.10
RT-DETR-101	89.041	87.907	88.471	91.500	71.547	8969.67
RA-CottNet	93.683	86.040	89.692	93.496	72.857	2565.07

Note: “P” represents

P r e c i s i o n

; “R” is

R e c a l l

; and “F1” stands for

F 1

-

S c o r e

.

Table 2. Ablation experiment results of different modules.

Model	P (%)	R (%)	F1 (%)	mAP50 (%)	mAP95 (%)	Time (s)
Without_CA	88.092	90.523	89.290	93.113	72.211	5823.26
Without_GAM	89.983	86.799	88.360	92.717	71.472	2282.23
Without_LSKA	92.697	86.998	89.754	92.976	72.021	2727.82
Without_All_Attn	89.411	87.302	88.342	92.049	70.820	1209.43
Without_ODConv	92.164	85.896	88.918	92.422	71.951	3816.51
Without_SPDConv	89.812	86.845	88.301	92.642	71.360	2153.55
RA-CottNet	93.683	86.040	89.692	93.496	72.857	2565.07

Note: “P” represents

P r e c i s i o n

; “R” is

R e c a l l

; and “F1” stands for

F 1

-

S c o r e

.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, R.-F.; Qin, Y.-M.; Zhao, Y.-Y.; Xu, M.; Schardong, I.B.; Cui, K. RA-CottNet: A Real-Time High-Precision Deep Learning Model for Cotton Boll and Flower Recognition. AI 2025, 6, 235. https://doi.org/10.3390/ai6090235

AMA Style

Wang R-F, Qin Y-M, Zhao Y-Y, Xu M, Schardong IB, Cui K. RA-CottNet: A Real-Time High-Precision Deep Learning Model for Cotton Boll and Flower Recognition. AI. 2025; 6(9):235. https://doi.org/10.3390/ai6090235

Chicago/Turabian Style

Wang, Rui-Feng, Yi-Ming Qin, Yi-Yi Zhao, Mingrui Xu, Iago Beffart Schardong, and Kangning Cui. 2025. "RA-CottNet: A Real-Time High-Precision Deep Learning Model for Cotton Boll and Flower Recognition" AI 6, no. 9: 235. https://doi.org/10.3390/ai6090235

APA Style

Wang, R.-F., Qin, Y.-M., Zhao, Y.-Y., Xu, M., Schardong, I. B., & Cui, K. (2025). RA-CottNet: A Real-Time High-Precision Deep Learning Model for Cotton Boll and Flower Recognition. AI, 6(9), 235. https://doi.org/10.3390/ai6090235

Article Menu

RA-CottNet: A Real-Time High-Precision Deep Learning Model for Cotton Boll and Flower Recognition

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset

2.2. YOLOv11n

2.3. Convolutions

2.3.1. Omni-Dimensional Dynamic Convolution

2.3.2. Space-to-Depth Convolution

2.4. Attention Mechanisms

2.4.1. Coordinate Attention Mechanism

2.4.2. Optimized Global Attention Mechanism

2.4.3. Large Separable Kernel Attention

2.5. Proposed Model (RA-CottNet)

3. Experiments and Evaluation Metrics

3.1. Model Training Device and Parameter Setup

3.2. Model Evaluation Experiment

3.2.1. Baseline Models for Comparative Evaluation

3.2.2. Evaluation Metrics

3.3. Ablation Experiment

4. Results

4.1. Results of Model Evaluation Experiment

4.2. Results of Ablation Experiment

5. Discussion

5.1. Advantages

5.2. Challenges

5.3. Future Perspectives

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI