A Real-Time Cotton Boll Disease Detection Model Based on Enhanced YOLOv11n

Yang, Lei; Cui, Wenhao; Li, Jingqian; Han, Guotao; Zhou, Qi; Lan, Yubin; Zhao, Jing; Qiao, Yongliang

doi:10.3390/app15148085

Open AccessArticle

A Real-Time Cotton Boll Disease Detection Model Based on Enhanced YOLOv11n

by

Lei Yang

^1,2,

Wenhao Cui

^1,2,

Jingqian Li

^1,2,

Guotao Han

^1,2,

Qi Zhou

^1,2,

Yubin Lan

^1,2,3,

Jing Zhao

^1,2,3,* and

Yongliang Qiao

^4,*

¹

School of Agricultural Engineering and Food Science, Shandong University of Technology, Zibo 255000, China

²

Shandong Province Smart Farm Technology International Cooperation Joint Laboratory, Zibo 255000, China

³

Institute of Modern Agricultural Equipment, Zibo 255000, China

⁴

Australian Institute for Machine Learning (AIML), University of Adelaide, Adelaide, SA 5000, Australia

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2025, 15(14), 8085; https://doi.org/10.3390/app15148085

Submission received: 26 June 2025 / Revised: 14 July 2025 / Accepted: 18 July 2025 / Published: 21 July 2025

(This article belongs to the Special Issue Computer Vision in Automatic Detection and Identification, 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

Existing methods for detecting cotton boll diseases frequently exhibit high rates of both false negatives and false positives under complex field conditions (e.g., lighting variations, shadows, and occlusions) and struggle to achieve real-time performance on edge devices. To address these limitations, this study proposes an enhanced YOLOv11n model (YOLOv11n-ECS) for improved detection accuracy. A dataset of cotton boll diseases under different lighting conditions and shooting angles in the field was constructed. To mitigate false negatives and false positives encountered by the original YOLOv11n model during detection, the EMA (efficient multi-scale attention) mechanism is introduced to enhance the weights of important features and suppress irrelevant regions, thereby improving the detection accuracy of the model. Partial Convolution (PConv) is incorporated into the C3k2 module to reduce computational redundancy and lower the model’s computational complexity while maintaining high recognition accuracy. Furthermore, to enhance the localization accuracy of diseased bolls, the original CIoU loss is replaced with Shape-IoU. The improved model achieves floating point operations (FLOPs), parameter count, and model size at 96.8%, 96%, and 96.3% of the original YOLOv11n model, respectively. The improved model achieves an mAP@0.5 of 85.6% and an mAP@0.5:0.95 of 62.7%, representing improvements of 2.3 and 1.9 percentage points, respectively, over the baseline YOLOv11n model. Compared with CenterNet, Faster R-CNN, YOLOv8-LSW, MSA-DETR, DMN-YOLO, and YOLOv11n, the improved model shows mAP@0.5 improvements of 25.7, 21.2, 5.5, 4.0, 4.5, and 2.3 percentage points, respectively, along with corresponding mAP@0.5:0.95 increases of 25.6, 25.3, 8.3, 2.8, 1.8, and 1.9 percentage points. Deployed on a Jetson TX2 development board, the model achieves a recognition speed of 56 frames per second (FPS) and an mAP of 84.2%, confirming its suitability for real-time detection. Furthermore, the improved model effectively reduces instances of both false negatives and false positives for diseased cotton bolls while yielding higher detection confidence, thus providing robust technical support for intelligent cotton boll disease detection.

Keywords:

detection of cotton boll diseases; YOLOv11; EMA mechanism; partial convolution; loss function

1. Introduction

Cotton is a vital economic crop in China, playing a crucial role in national defense, pharmaceuticals, and industrial applications [1]. Cotton boll disease is a complex disorder caused by multiple pathogens, including cotton boll blight, red rot, black fruit disease, soft rot, and Aspergillus infection. Infected bolls exhibit poor fiber extrusion, distorted shells that fail to open properly, and frequent withering or rotting, which severely compromises cotton quality and yield [2]. Currently, disease detection relies heavily on manual inspection, a method that is time-consuming, labor-intensive, prone to subjective error, and hinders timely prevention and control. Consequently, developing automated detection technology for cotton boll diseases holds significant importance for safeguarding the cotton industry’s sustainable growth and enhancing productivity. However, these diseases typically manifest during the middle to late growth stages, primarily affecting bolls on the middle and lower sections of the plants. These areas suffer from poor lighting conditions and are disturbed by withered leaves. Coupled with natural environmental factors like occlusions and shadows, the identification of diseases becomes even more challenging.

Efficient and accurate recognition of crop disease symptoms using image recognition algorithms can provide agricultural producers with timely and accurate disease diagnosis information. Chen et al. [3] employed an improved YOLOv5-CBM to detect early-stage tea leaf spots in natural environments. Compared to the original YOLOv5s model, the improved model increased the average precision of early tea powdery mildew and early tea ring spot disease detection by 1.9% and 0.9%, respectively, achieving an mAP of 97.3% for different diseases. Kinger et al. [4] used three different deep learning models, CNN, InceptionV3, and ResNet-52, for cotton leaf disease detection, with detection accuracies of 99.057%, 97.170%, and 98.113%, respectively. Zhang et al. [5] proposed the YOLOv8-FECA model to address the problem of false negatives and false positives in small target detection for wheat fusarium head blight spores. By adding a small target detection layer, constructing a focus attention mechanism (FECA), and combining Wise-IoU with DFL loss function optimization, mAP improved by 4.3% compared to the original YOLOv8. Yang et al. [6] proposed a lightweight network based on YOLOv8n, achieving 98.5% mAP and 97.5% precision through StarBlock reconstruction, the introduction of a hybrid local channel attention mechanism, optimization of detection heads, and channel pruning strategies, significantly improving model efficiency and detection accuracy. Tian et al. [7] proposed the multi-scale dense network MD-YOLO, which introduced DenseNet blocks and an adaptive attention module (AAM) to enhance feature extraction and small target detection capabilities, achieving a recognition accuracy of 86.2%.

To improve the accuracy of cotton boll disease detection in complex field environments, this study proposes a YOLOv11n-ECS model suitable for mobile deployment. The proposed model demonstrates significant performance improvements over the baseline YOLOv11n. The main contributions of this study are as follows: (1) A cotton boll disease dataset was constructed, comprising images captured under varying light intensities and shooting angles to accurately reflect real-world production conditions. (2) The EMA mechanism was integrated to enhance distinctive feature representation. It adaptively weights channel-wise features, suppressing background interference while strengthening disease features under suboptimal lighting, thereby improving detection accuracy. (3) The loss function was replaced with Shape-IoU to optimize bounding box localization precision. (4) A C3k2_PConv module was designed to replace the standard C3k2 block. This module employs partial convolution operations to reduce computational redundancy while enhancing the model’s capability to handle irregular regions.

2. Materials and Methods

2.1. Cotton Boll Disease Image Collection

The data collection area is located at the Cotton Planting Base of Binzhou National Agricultural Science and Technology Park, Shandong Province, China (37°34′53″ N, 118°3′49.5″ E). The data were collected during August–September 2024, between 10:00 and 18:00. The image data of cotton boll diseases was mainly obtained through three methods: using an Intel D435i (RGB-D) camera mounted on a cotton boll disease detection trolley (as shown in Figure 1), manual photography, and online collection.

The RGB-D camera was mounted at a height of 0.3 to 0.6 m above ground level, capturing images at a resolution of 640 × 640. Additionally, manual image collection was conducted under various lighting conditions and angles. During the manual collection, a Redmi K60 phone was used as the shooting device, capturing images vertically at a distance of 20 cm from the target. The image resolution was 4096 × 2304. All images were saved in JPG format. To enhance the model’s robustness and applicability, additional cotton boll disease images were collected online. When collecting cotton boll disease images from the Internet, the selection criteria prioritize high-resolution, blur-free images captured at appropriate angles under moderate lighting. These conditions ensure clear visualization of boll morphology and pathological characteristics. For quality control, a two-step process is implemented. Initial manual screening eliminated images failing predefined standards, followed by expert validation of disease diagnoses to ensure semantic consistency across the dataset. A total of 7100 cotton boll images were collected, including 4000 images captured by the disease detection system, 3000 images manually captured, and 100 images collected online. Some of the images are shown in Figure 2.

2.2. Cotton Boll Disease Dataset Construction

The common boll diseases mainly include boll rot, red rot, anthracnose, and soft rot. The characteristics of boll diseases primarily consist of brown spots on the boll surface and yellow-brown, fuzz-like mold. These features provide a reference for rapid identification and detection of field diseases. The obtained boll disease image data are categorized into four types: uncracked green bolls, cotton bolls with opened fibers, lightly diseased bolls (with disease spots covering 0–50% of the boll surface area), and severely diseased bolls (with disease spots covering more than 50% of the entire boll surface area), as shown within the red rectangular box in Figure 3. The four-category grading scheme employed in this study is grounded primarily in observable morphological characteristics and practical field assessment expertise. It was designed to enhance operational efficiency for image recognition research and serve as a foundation for preliminary data organization. The disease severity thresholds (0–50% and >50% surface coverage) were established through consensus by experienced field technicians during data collection, based on standardized visual symptom assessments. This approach was adopted to capture distinct levels of visual symptom severity relevant to field scouting and potential management decisions.

Using LabelImg for annotation, the healthy uncracked green bolls, severely diseased bolls, lightly diseased bolls, and cotton bolls with opened fibers were labeled as “0”, “1”, “2”, and “3”, respectively.

Although factors such as changes in lighting conditions and complex backgrounds in the natural environment were considered during the image collection process, with data collected from multiple angles, times, and locations, it is still impossible to capture all scenarios. To address these challenges, several data augmentation techniques were chosen, tailored to agricultural environmental conditions: color enhancement simulated lighting variations; mirroring and rotation facilitated multi-angle observation; and random occlusion mimicked partial obstructions caused by plant leaves or branches. Data augmentation was performed on the annotated boll disease images using the Python PIL (Pillow 10.0.0) library. Methods such as color enhancement, mirroring, rotation, adding Gaussian noise, random occlusion, and scaling were applied for multiple data augmentation techniques on a single image [8]. To ensure data quality, an augmentation screening framework based on agricultural pathological characteristics was developed. This framework enforces dual constraints—HSV hue variation ≤5° (preserving diagnostic color features such as the reddish-brown hue) and Structural Similarity Index (SSIM) >0.85 (maintaining lesion morphological integrity)—using an automated pipeline for real-time filtering of non-compliant samples. A sample of the data augmentation is shown in Figure 4. A total of 35,500 sample images were obtained and randomly divided into training, validation, and testing sets in an 8:1:1 ratio.

3. Cotton Boll Disease Detection Algorithm and Improvement

3.1. Selection of the Basic Detection Model

The YOLO (You Only Look Once) series is a representative single-stage object detection algorithm that regards object detection as a regression problem. It directly predicts object bounding boxes and classes from images. The primary advantage of this method lies in its high speed, enabling real-time object detection, which has led to its widespread application in the field of computer vision. To select the most suitable YOLO algorithm, five commonly used versions—YOLOv5s, YOLOv7, YOLOv8n, YOLOv9, and YOLOv11n—were compared. The results are shown in Table 1.

As shown in Table 1, YOLOv11n achieved the highest mAP@0.5 of 83.3% and the highest precision of 90.1%, surpassing all other models in both metrics. In contrast, YOLOv9t recorded the lowest mAP@0.5 of 80.7%. Furthermore, YOLOv11n attained a recall rate of 81.9%, which is also the highest among the models, demonstrating its strong capability in disease detection. Moreover, YOLOv11n exhibits significantly fewer parameters and lower computational complexity than YOLOv5s, YOLOv7, and YOLOv8n. Despite YOLOv11n having 0.8M more parameters than YOLOv9t, YOLOv11n achieves a 2.6 percentage point higher mAP@0.5. Therefore, considering its overall performance superiority, YOLOv11n was selected as the baseline network for this study, outperforming the other compared YOLO models.

3.2. YOLOv11 Convolutional Neural Network Model

YOLOv11 is a next-generation real-time object detection model developed by Ultralytics. Building upon the YOLO series, it achieves significant improvements in speed, accuracy, and efficiency through multiple architectural innovations. The YOLOv11 network architecture consists of four main components: the Input, Backbone, Neck, and Head. The Input module preprocesses the image and resizes it to a predefined dimension. The Backbone is constructed based on an improved version of CSPDarknet53, incorporating C3K2 blocks [9] and the C2PSA spatial attention module to enhance feature extraction efficiency. The Neck utilizes PANet to achieve multi-scale feature fusion and integrates a lightweight C2f module to further strengthen feature representation. Finally, the Head predicts the object’s location, class, and confidence score based on the feature maps provided by the Neck.

YOLOv11 achieves a dual breakthrough in both speed and accuracy by introducing the C3K2 feature extraction component, the C2PSA attention enhancement module, and the SPPF [10] design. The C3K2 module improves the efficiency of feature extraction, enabling the model to perform better in small object detection and complex scenes while maintaining high inference speed. The C2PSA module, with its enhanced spatial attention mechanism, automatically focuses on key information in the image, thereby improving the model’s robustness in challenging scenarios such as occlusion and varying lighting conditions. The SPPF design optimizes multi-scale feature extraction through a fast pooling strategy, reducing computational resource consumption while increasing the model’s response speed. Thanks to its modular architecture and multitask support, YOLOv11 offers significant advantages for efficient deployment on embedded devices and in complex application environments.

3.3. Improved Cotton Boll Disease Detection Algorithm Model Framework

This paper is based on the YOLOv11n object detection network and proposes the following optimizations: (1) The EMA mechanism is introduced to enhance the model’s feature extraction capability and improve detection accuracy, (2) the PConv (Partial Convolution) is integrated into the C3K2 module to reduce computational redundancy and strengthen the model’s ability to handle irregular regions and occluded objects, and (3) the loss function is replaced with Shape-IoU, enabling more accurate calculation of object shape and position, thereby further enhancing detection precision. The resulting YOLOv11n-ECS network architecture based on these improvements is illustrated in Figure 5.

3.3.1. Introducing the EMA Mechanism

During cotton boll disease detection, leaf occlusions and ambient lighting variations frequently lead to false negatives and false positives. Additionally, visual similarities between desiccated foliage and diseased cotton bolls frequently induce misclassification. These challenges stem primarily from the model’s limited capacity to extract and filter discriminative disease features. To address this, an EMA mechanism [11] is introduced into the Backbone of YOLOv11n. By emphasizing important features and suppressing irrelevant regions (such as leaves), the EMA mechanism improves the detection accuracy of the model. By replacing the ordinary attention mechanism in C2PSA with EMA, the model better extracts and fuses multi-scale features, thereby enhancing its target detection capability in complex scenarios. The EMA mechanism, proposed by Daliang Ouyang in 2023, is an efficient multi-scale attention approach based on cross-spatial feature learning. It reshapes part of the channel dimension into the batch dimension and performs grouped operations across channels without dimensionality reduction, thus avoiding the loss of channel-wise feature information. This design reduces computational overhead while maintaining high accuracy and a low parameter count [12]. The structure of the EMA mechanism is illustrated in Figure 6.

The EMA mechanism enhances pixel-level attention accuracy in the feature map through 1 × 1 and 3 × 3 branches and a cross-spatial learning module. The 1 × 1 branch establishes cross-channel interactions through global average pooling and 1 × 1 convolutions, while the 3 × 3 branch extracts multi-scale features. The cross-spatial learning module pools and applies Softmax activation to the outputs of the 1 × 1 and 3 × 3 branches, generating two spatial attention maps via dot product. These maps are then fused, which strengthens the model’s contextual understanding. By using convolutions at different scales in parallel, EMA effectively integrates multi-scale features, significantly improving the model’s object detection capability.

3.3.2. Introduce PConv to Optimize the C3k2 Module

In YOLOv11, the C3k2 module enhances feature extraction and network efficiency via an optimized convolution kernel and cross-stage connections. However, its multi-branch convolutions introduce substantial computational overhead. This manifests primarily in two aspects: (1) substantial computational redundancy exists across feature map channels, and (2) frequent full-channel convolution operations incur high memory access costs. These factors form critical bottlenecks limiting C3k2′s computational efficiency. To address these efficiency issues arising from channel redundancy and frequent memory access, the Partial Convolution (PConv) [13] is integrated into the C3k2 module, forming the C3k2_PConv module. PConv effectively reduces redundant computation and memory accesses by performing dense convolution only on a subset of input feature map channels [14]. Its structure is shown in Figure 7.

C3k2_PConv utilizes the PConv design from FasterNet, replacing the Bottleneck in the feature extraction module with a Faster Block as the main gradient flow branch. This reduces the model’s floating-point numbers and computational load. PConv improves computational efficiency by reducing redundant computations and memory accesses, enabling the C3k2_PConv module to maintain efficient feature extraction while reducing computational costs. By replacing some standard convolutional layers in YOLOv11 with the C3k2_PConv module, the model’s floating-point computational load and memory usage are reduced, optimizing the utilization of computing resources and enhancing the model’s detection capability in complex scenarios. Its structure is shown in Figure 8.

3.3.3. Loss Function Optimization

Loss functions are fundamental components in deep learning that quantify discrepancies between predictions and ground truth. To address limitations in conventional bounding box regression where shape attributes are inadequately modeled, the CIoU loss in YOLOv11n is replaced with Shape-IoU [15]. Shape-IoU enhances detection accuracy through three key mechanisms: distance penalty, aspect ratio constraint, and edge alignment. The distance penalty enhances the localization accuracy of small objects, the aspect ratio penalty effectively prevents shape distortion, and the edge alignment term ensures more precise bounding box alignment [16]. This enables Shape-IoU to better handle variations in object shape, size, and position, thereby improving the model’s performance in complex scenarios.

The definition of the Shape-IoU loss function is as follows:

IoU = \frac{| B ⋂ B^{G T} |}{| B ⋃ B^{G T} |}

(1)

ww = \frac{2 \times {(w^{g t})}^{s c a l e}}{{(w^{g t})}^{s c a l e} + {(h^{g t})}^{s c a l e}}

(2)

hh = \frac{2 \times {(h^{g t})}^{s c a l e}}{{(w^{g t})}^{s c a l e} + {(h^{g t})}^{s c a l e}}

(3)

{d i s t a n c e}^{s h a p e} = hh \times {(x_{c} - x_{c}^{g t})}^{2} / c^{2} + ww \times {(y_{c} - y_{c}^{g t})}^{2} / c^{2}

(4)

Ω^{s h a p e} = \sum_{t = w, h} {(1 - e^{{- w}_{t}})}^{θ}, θ = 4

(5)

w_{w} = h h \times \frac{|w - w^{g t}|}{m a x (w, w^{g t})}

(6)

w_{h} = w w \times \frac{|h - h^{g t}|}{m a x (h, h^{g t})}

(7)

L_{S h a p e - I o U} = 1 - IoU + {d i s t a n c e}^{s h a p e} + 0.5 \times Ω^{s h a p e}

(8)

In the equation,

s c a l e

is the scaling factor, which is related to the size of the objects in the dataset.

w w

and

h h

represent the weight coefficients in the horizontal and vertical directions, respectively, and their values are related to the shape of the GT bounding box.

w^{g t}

and

h^{g t}

correspond to the length and width of the GT bounding box, while

x_{c}^{g t}

and

y_{c}^{g t}

represent the center coordinates of the GT box. w and h are the length and width of the anchor box, while x_c and y_c are its center coordinates. w_w and h_h are the weighted coefficients in the horizontal and vertical directions, and these coefficients are related to the shape of the GT bounding box.

{d i s t a n c e}^{s h a p e}

and

Ω^{s h a p e}

represent the distance and shape loss components, respectively.

3.4. Model Training and Evaluation Metrics

3.4.1. Test Platform and Model Parameters

The hardware configuration of the training model platform is an Intel Core i9-9820X@3.30 GHz processor (Intel Corporation, Santa Clara, CA, USA), 32 GB of RAM, and an Nvidia GeForce RTX2080Ti GPU (NVIDIA Corporation, Santa Clara, CA, USA). The software environment consists of Ubuntu 18.04 LTS 64-bit system, Python 3.8, the PyTorch 1.11.0 framework, and CUDA 11.4. During training, the model processes input images at a resolution of 640 × 640 pixels, with a total of 250 epochs and a batch size of 16. Pretrained model weights are used for initialization. The Adam optimizer is employed, which dynamically adjusts the learning rate for each parameter by calculating the mean and variance of the gradients, thus accelerating convergence and improving training stability. The initial learning rate is set to 0.001, with a weight decay factor of 0.0005.

3.4.2. Model Evaluation Metrics

In this study, the model’s performance is primarily evaluated using the following metrics: Precision (P), Recall (R), mean Average Precision at IoU 0.5 (mAP@0.5), mAP@0.5–0.95, FLOPs (floating point operations), number of parameters, average frames per second (FPS), and model size. The specific formulas are as follows:

P = \frac{T_{P}}{T_{P} + F_{P}} \times 100 %

(9)

R = \frac{T_{P}}{T_{P} + F_{N}} \times 100 %

(10)

AP = \int_{0}^{1} P (r) d r

(11)

mAP = \frac{\sum_{i = 1}^{N} A P_{i}}{N}

(12)

where T_P represents the number of actual positive samples predicted as positive, F_P represents the number of actual negative samples predicted as positive, and F_N represents the number of actual positive samples predicted as negative. AP represents the area under the precision–recall curve. mAP stands for mean Average Precision.

4. Results and Analyses

4.1. Impact of Different Attention Mechanisms on Model Detection Performance

To evaluate the performance of different attention mechanism modules, this paper conducts comparative experiments by replacing the EMA attention mechanism in the model with four other attention mechanisms: SE [17], CA [18], ECA [19], and GAM [20]. The results are shown in Table 2.

As shown in Table 2, the EMA mechanism demonstrates superior performance for cotton boll disease detection, achieving 84.4% mAP@0.5 and 60.2% mAP@0.5:0.95. This represents an improvement of 2.3, 3.5, 1.3, and 4.8 percentage points over the SE, CA, ECA, and GAM mechanisms, respectively. Notably in mAP@0.5:0.95, EMA surpasses SE, CA, ECA, and GAM by 2.8, 5.1, 0.9, and 1.3 percentage points, respectively. The precision of the model with the EMA mechanism is also higher by 0.5, 3.6, and 2.4 percentage points compared to SE, CA, and GAM, respectively, and slightly lower than ECA’s precision of 90.9%. However, EMA achieves 2.3 percentage points higher recall than ECA. Recall quantifies the proportion of correctly identified positives among all actual positives. High recall indicates effective identification of diseased cotton bolls with minimal false negatives. Due to its effective feature weighting capability, the model can more comprehensively focus on the key features of cotton bolls, effectively improving detection accuracy. This dual-attention architecture significantly outperforms the SE and CA mechanisms, which focus on a single dimension, as well as the ECA mechanism, which uses 1D convolutions to optimize channel attention.

4.2. Impact of Different Lightweight Networks on Model Detection Performance

To validate the effectiveness of the C3K2-PConv structure, this study employs YOLOv11 as the baseline and substitutes its feature extraction network with several mainstream lightweight models for comparative analysis. The selected models comprise ShuffleNet, EfficientNetv2 [21], and MobileNetV4 [22]. This approach aims to comprehensively assess the enhancements afforded by the C3K2-PConv structure. By maintaining identical parameters except for the backbone network, we compare their impacts on training performance. The experimental results are shown in Table 3.

As shown in Table 3, compared to ShuffleNet, the C3K2-PConv increases the number of parameters and FLOPs by 0.8 M and 2.0 G, respectively, but the mAP@0.5 and mAP@0.5:0.95 improve by 5.5 and 5.4 percentage points, and the frame rate increases by 3 frames. Compared to EfficientNetv2, C3K2-PConv reduces the number of parameters by 1.0M, improves the mAP@0.5 and mAP@0.5:0.95 by 1.5 and 3.9 percentage points, reduces the model size by 2.0 MB, and increases the frame rate by 11%. In comparison with MobileNetV4, the mAP@0.5 of C3K2-PConv decreases by 0.4 percentage points, but its number of parameters and FLOPs decrease by 1.5 M and 0.4 G, respectively, while the mAP@0.5:0.95 increases 1.5 percentage points and model size reduces by 3.1 MB, and the frame rate increases by 14 frames. Although C3K2-PConv has an increase in parameter count and model size compared to ShuffleNet, it shows a significant improvement in both mAP and frame rate. Furthermore, compared to EfficientNetv2 and MobileNetV4, C3K2-PConv outperforms in most metrics, with a significant increase in frame rate. The results demonstrate that the C3K2-PConv structure proposed in this study is effective in improving the model’s detection accuracy while reducing computational overhead.

4.3. Effect of Different Loss Functions on Model Detection Performance

To analyze the performance of different loss functions on the cotton boll disease dataset, this experiment compares four loss functions used in the YOLOv11 model: CIoU, SIoU, Focal-EIoU [23], and Shape-IoU. The training loss curves for the four loss functions are shown in Figure 9.

As shown in Figure 9, when using SIoU, the convergence speed is the slowest, and the loss value after convergence is the highest. Both Focal-EIoU and CIoU converge faster than SIoU, with lower final loss values. Shape-IoU performs the best, showing the fastest gradient descent speed during training and the lowest loss value after convergence. This is because Shape-IoU takes into account the shape information of the bounding box, enhancing the model’s robustness to changes in object shape, allowing the loss function to more accurately guide the model’s parameter updates, thereby accelerating the convergence speed and significantly reducing the final loss value.

4.4. YOLOv11n-ECS Ablation Experiment Performance Comparison

An ablation experiment [24] refers to the process of deep learning model design and optimization to analyze the impact of certain components, layers, or parameters of the model on the model performance by removing or setting them to fixed values. To verify the effectiveness of each improvement module, the ablation test is carried out by different combinations of multiple improvement modules in different ways, and the ablation test is carried out on the network model by adding three improvement modules step by step. “√” indicates that the module is used, and “×” indicates that the module is not used. The test results are shown in Table 4.

As indicated in Table 4, after adding the EMA mechanism module to the network, the mAP@0.5 and mAP@0.5:0.95 improved by 1.4 and 1.7 percentage points compared to the original YOLOv11n model, although the FLOPs slightly increased and the model weight size remained unchanged. This demonstrates that the EMA module not only enhances the model’s feature extraction ability but also introduces a small additional computational overhead. Replacing C3k2 with the C3k2_PConv module increased mAP@0.5 and mAP@0.5:0.95 by 1.1 and 2.3 percentage points, while reducing parameters and FLOPs by 3.1% and 4%, respectively. Model size concurrently decreased by 0.2 MB. This improvement is due to the PConv module, which reduces computational redundancy by utilizing redundant information in the feature map and performing convolution operations only on a subset of the channels. Substituting the original CIoU loss with Shape-IoU boosted mAP@0.5 by 0.9 percentage points without increasing FLOPs, parameters, or model size. By introducing both the EMA and C3k2_PConv modules, the improved model’s mAP@0.5 and mAP@0.5:0.95 increased by 1.8 and 3.6 percentage points compared to the original YOLOv11n model, while the computational cost, number of parameters, and model size were all reduced. Incorporating all three modules achieved detection mAP@0.5 and mAP@0.5:0.95 of 85.6% and 62.7%—gains of 2.3 and 4.2 percentage points over the baseline model—while reducing FLOPs, parameters, and model size to 96.8%, 96.0%, and 96.3% of original values. This effectively controlled the model’s complexity while improving its performance. As shown in Figure 10, the improved model performs better in the presence of background interference such as dried leaves and soil, reducing false negatives and false positives.

4.5. Performance Comparison of Different Object Detection Models

To evaluate the performance of the model proposed in this paper, several classic object detection models were selected for comparison, including the two-stage object detection model Faster R-CNN [25], the one-stage object detection algorithm CenterNet [26], and the MSA-DETR [27] model based on the Transformer architecture. In addition, YOLOv8-LSW [28] and DMN-YOLO [29] models specifically designed for small-target disease detection were also included in the comparison. All experiments were performed under identical datasets, training strategies, and test environments. The results are shown in Table 5.

As shown in Table 5, the improved YOLOv11n-ECS model surpasses other comparative models in precision, recall, and mAP metrics. Compared to CenterNet, Faster R-CNN, YOLOv8-LSW, MSA-DETR, DMN-YOLO, and YOLOv11n, the proposed method achieves improvements of 25.7, 21.2, 5.5, 4.0, 4.5, and 2.3 percentage points in mAP@0.5, and 25.6, 25.3, 8.3, 2.8, 1.8, and 1.9 percentage points in mAP@0.5:0.95, respectively. Regarding model size, YOLOv11n-ECS is 118.7 M, 102.7 M, 0.1 M, 33.6 M, 0.2 M, and 0.2 M smaller than CenterNet, Faster R-CNN, YOLOv8-LSW, MSA-DETR, DMN-YOLO, and YOLOv11n, respectively, demonstrating a significant lightweight advantage. As a two-stage object detection model, Faster R-CNN has a larger computational cost and parameter size, resulting in slower inference speed and a larger weight file. Its mAP@0.5 is only 64.4%, higher than only CenterNet. As representative single-stage object detection models, the YOLO series excels in both speed and detection performance. Compared to CenterNet, YOLOv8-LSW and YOLOv11n feature significantly smaller model sizes and lower computational costs, along with substantially higher mAP scores. The MSA-DETR model, built on the Transformer-based architecture, has achieved remarkable results in object detection tasks, with an mAP@0.5 score of 81.6%. This performance surpasses the YOLOv8-LSW and DMN-YOLO models, which were specifically optimized for disease detection tasks. However, MSA-DETR exhibits relatively high computational complexity and a large model size, ranking second only to CenterNet and Faster R-CNN in computational load and parameter count. Both the YOLOv8-LSW and DMN-YOLO models have undergone backbone network optimizations, enabling superior model lightweighting. Notably, DMN-YOLO demonstrates more robust localization capabilities, achieving an mAP@0.5:0.95 of 60.9%, which surpasses the MSA-DETR by one percentage point. In contrast, the improved YOLOv11n-ECS model not only maintains its advantage in lightweight design but also achieves higher detection performance. In conclusion, the YOLOv11n-ECS model achieves an excellent balance between accuracy, speed, and lightweight design, making it suitable for real-time detection and resource-constrained scenarios. Compared to traditional detection models, it demonstrates stronger adaptability and robustness in small object detection, occlusion handling, and complex environments, making it highly applicable in a wide range of real-world scenarios.

4.6. Analysis of the Deployment Test Results of the Model on Jetson TX2

To evaluate the model’s detection performance for cotton boll diseases on embedded devices, the YOLOv11n-ECS and the original YOLOv11n models were deployed on the Jetson TX2 development board for comparative testing. At the same time, TensorRT was used to optimize and accelerate the models with high-performance operators, and Int8 quantization was applied to further reduce the model size and computational complexity. This approach helps prevent a significant drop in detection speed when migrating the model from a computer to a resource-constrained embedded device. The same cotton boll disease dataset was used for testing. The detection results and performance comparison are shown in Table 6 and Figure 11.

As shown in Table 6, the mAP@0.5 of both models decreased after deployment on the Jetson TX2. Specifically, the mAP@0.5 of YOLOv11n and the improved model dropped by 1.9 and 1.4 percentage points, respectively. As illustrated in Figure 11, due to the small size of cotton bolls and the complexity of the background environment, the YOLOv11n model exhibited issues such as false positives and duplicate detections when identifying diseased bolls. In contrast, the improved model demonstrated better detection performance, with more accurate target localization and fewer false positives. The detection speeds of the improved model and the original YOLOv11n model on the Jetson TX2 development board were 56 FPS and 52 FPS, respectively. The improved model achieved an mAP@0.5 of 84.2%, demonstrating good detection performance on embedded devices. Compared to the original model, it achieved a 7.7% increase in detection speed, indicating its capability to meet the real-time detection requirements for diseased cotton bolls in actual field conditions.

5. Discussion and Conclusions

5.1. Discussion

To mitigate issues like occlusion, false positives, and false negatives encountered during the original model’s detection, we introduce the EMA mechanism to enhance feature extraction capabilities and integrate PConv into the C3k2 module to reduce computational redundancy. Furthermore, the CIoU loss function is replaced with Shape-IoU to improve the localization accuracy for diseased cotton bolls. The mAP@0.5 of the improved YOLOv11N-ECS model reaches 85.6%, representing an increase of 2.3 percentage points compared to the base YOLOv11n network model. Compared to the original model, the improved model demonstrates higher detection accuracy in scenarios with numerous targets and significantly reduces both false positives and false negatives in complex environments featuring distractions like falling leaves. Hamdulla et al. [30] proposed an improved YOLOv5s-ESTC model for detecting Cistanche deserticola in natural environments characterized by occlusion and dense distributions. Their model achieves an mAP of 89.8%. It is important to note that the images used in YOLOv5s-ESTC were captured at close range, where targets occupy a larger image area. In contrast, our study involves downscaling images (from 4000 × 3000 to 640 × 640), which reduces the pixel proportion of cotton bolls, leading to the loss of some detailed information and consequently increasing the difficulty of detecting diseased bolls. Despite this challenge, YOLOv11n-ECS achieves competitive performance.

In terms of model complexity, the FLOPs, parameters, and model size of the improved YOLOv11n-ECS were reduced to 96.8%, 96%, and 96.3% of the original YOLOv11n network, respectively. The introduction of PConv was pivotal to achieving this efficient design, reducing computational redundancy while maintaining high recognition accuracy. This lightweight approach prepares the model effectively for engineering deployment. Wang et al. [31] proposed a lightweight rice disease recognition model named YOLOv8—DiDL. The model incorporates modules, including the inverted residual mobile block and deformable convolutional networks v2, and replaces the spatial pyramid pooling fast (SPPF) module with a large separable kernel attention (LSKA) module. Compared with YOLOv8n, the model weight is decreased by 9.7%, and the number of parameters is reduced by approximately 9.3%. In this study, the PConv-based design and architectural refinements enabled a significant reduction in computational burden while still achieving a substantial 2.3 percentage point increase in mAP over the base YOLOv11n. This balance of high accuracy, reduced parameters, and faster detection speed better meets the real-time requirements of agricultural production. During data collection, some images exhibited issues like blurred backgrounds and overexposure. To address these issues, the deep learning-based Super-Resolution Convolutional Neural Network (SRCNN) algorithm could be applied prior to data processing to enhance blurred images and restore detail. Simultaneously, algorithms such as adaptive histogram equalization can be used to adjust the brightness of overexposed images to enhance contrast. Additionally, the approach combining YOLOv5 with dark channel prior enhancement, proposed by Fan et al. [32,33], could also be referenced to address low-light nighttime image challenges. Furthermore, collecting images under diverse nighttime conditions also could enrich the dataset and enhance model robustness.

Although the EMA mechanism and C3k2_PConv structure proposed in this study demonstrate superior performance on the cotton boll disease dataset, they still confront inherent limitations in agricultural scenarios: under severe occlusion or chromatic camouflage, EMA’s attention weighting mechanism may fail or exhibit attenuated effectiveness; C3k2_PConv requires improvement in capturing subtle morphological variations while maintaining robustness against complex textured backgrounds. Furthermore, the model shows sensitivity to extreme illumination and cross-growth-stage chromatic variations. Future study will prioritize designing occlusion-resistant attention mechanisms, integrating multi-spectral features to mitigate camouflage challenges, optimizing sparse connectivity to enhance fine-grained discrimination, and developing adaptive background suppression modules, thereby improving the model’s generalization capability in complex agricultural environments. It should be noted that the four-tier classification framework based on visual lesion coverage (0–50% vs. >50%) carries inherent limitations: weak linkage to pathological mechanisms (where symptom gradation remains unquantified against bio-pathological evidence such as pathogen invasion dynamics or host physiological responses) and experience-dependent threshold setting—while practical for rapid field assessment—risks misclassification of borderline cases due to environmental variability and observer subjectivity, potentially constraining model generalizability in complex agronomic scenarios. Future refinements should integrate objective biomarkers through multi-center pathological validation to strengthen the scientific robustness of the criteria.

5.2. Conclusions

In this study, a novel model named YOLOv11n-ECS is proposed for rapid multi-target detection of cotton boll diseases in natural environments, based on the YOLOv11n architecture. The improved model achieves a precision of 92.2%, a recall of 84.3%, an mAP@0.5 of 85.6%, and an mAP@0.5:0.95 of 62.7%. Compared to the original model, YOLOv11n-ECS achieves reductions of 3.2%, 4.0%, and 3.7% in model parameters, computational load, and weight file size, respectively. It thus effectively controls model complexity while improving detection performance.

Compared to CenterNet, Faster R-CNN, YOLOv8-LSW, MSA-DETR, DMN-YOLO, and the original YOLOv11n, YOLOv11n-ECS demonstrates substantial improvements in both mAP metrics. Specifically, it achieves gains of 25.7, 21.2, 5.5, 4.0, 4.5, and 2.3 percentage points in mAP@0.5 and 25.6, 25.3, 8.3, 2.8, 1.8, and 1.9 percentage points in mAP@0.5:0.95, respectively. Furthermore, YOLOv11n-ECS exhibits superior detection performance for diseased cotton bolls in natural scenes, significantly reducing both false negatives and false positives. When deployed on the Jetson TX2, it achieves a detection speed of 56 FPS with an mAP@0.5 of 84.2%, fulfilling the requirements for intelligent real-time recognition in field applications.

Author Contributions

Writing—original draft preparation, methodology, investigation, conceptualization, L.Y.; validation, data curation, W.C.; investigation, data curation, J.L.; data curation, formal analysis, G.H.; formal analysis, resources, Q.Z.; software, investigation, Y.L.; writing—review and editing, resources, funding acquisition, J.Z.; formal analysis, project administration, Y.Q. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

Dataset available on request from the authors.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhang, Z.; Li, Y.; Zheng, Y.; Liu, X.; Shu, X.; Liu, J.; Guo, C. Cotton production pattern and contribution factors in Xinjiang from 1988 to 2020. J. Agric. Resour. Environ. 2024, 41, 1192. [Google Scholar]
Li, S.; Lu, X.; Hao, Y.; Zhao, M.; Nian, G.; Guo, Q.; Zhang, X.; Ma, P. Cotton boll rot occurrence, analysis of varietal resistance and pathogenicity differentiation of the major pathogen. Acta Phytopathol. Sin. 2017, 47, 824–831. [Google Scholar] [CrossRef]
Zhai, Z.; Cao, Y.; Yu, H.; Yuan, P.; Wang, H. Review of key techniques for crop disease and pest detection. Trans. Chin. Soc. Agric. Mach. 2021, 52, 1–18. [Google Scholar]
Chen, Y.; Wu, X.; Zhang, Z.; Yan, J.; Zhang, F.; Yu, L. Method for identifying tea diseases in natural environment using improved YOLOv5s. Trans. Chin. Soc. Agric. Eng. 2023, 39, 185–194. [Google Scholar]
Kinger, S.; Tagalpallewar, A.; George, R.R.; Hambarde, K.; Sonawane, P. Deep learning based cotton leaf disease detection. In Proceedings of the 2022 International Conference on Trends in Quantum Computing and Emerging Business Technologies (TQCEBT), Pune, India, 13–15 October 2022; pp. 1–10. [Google Scholar]
Zhang, D.; Gao, Y.; Tao, C.; HU, G.; Yang, X.; Qiao, H.; Guo, W.; Gu, C. Detection of wheat scab spores in dense scene based on YOLOv8-FECA. Trans. Chin. Soc. Agric. Eng. 2024, 40, 127–136. [Google Scholar]
Yang, S.; Zhang, P.; Wang, L.; Tang, L.; Wang, S.; He, X. Identifying tomato leaf diseases and pests using lightweight improved YOLOv8n and channel pruning. Trans. Chin. Soc. Agric. Eng. 2025, 41, 206–214. [Google Scholar]
Tian, Y.; Wang, S.; Li, E.; Yang, G.; Liang, Z.; Tan, M. MD-YOLO: Multi-scale Dense YOLO for small target pest detection. Comput. Electron. Agric. 2023, 213, 108233. [Google Scholar] [CrossRef]
Wang, S. Evaluation of impact of image augmentation techniques on two tasks: Window detection and window states detection. Results Eng. 2024, 24, 103571. [Google Scholar] [CrossRef]
Wei, J.; Ni, L.; Luo, L.; Chen, M.; You, M.; Sun, Y.; Hu, T. GFS-YOLO11: A Maturity Detection Model for Multi-Variety Tomato. Agronomy 2024, 14, 2644. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef]
Ouyang, D.; He, S.; Zhang, G.; Luo, M.; Guo, H.; Zhan, J.; Huang, Z. Efficient multi-scale attention module with cross-spatial learning. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
Chen, S.; Li, Y.; Zhang, Y.; Yang, Y.; Zhang, X. Soft X-ray image recognition and classification of maize seed cracks based on image enhancement and optimized YOLOv8 model. Comput. Electron. Agric. 2024, 216, 108475. [Google Scholar] [CrossRef]
Chen, J.; Kao, S.-h.; He, H.; Zhuo, W.; Song, W.; Chul-Ho, L.; Gary, S.-H.; Run, C. Don’t walk: Chasing higher FLOPS for faster neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–20 June 2023; pp. 12021–12031. [Google Scholar]
Wang, Z.; Zhang, S.; Chen, Y.; Xia, Y.; Wang, H.; Jin, R.; Wang, C.; Fan, Z.; Wang, Y.; Wang, B. Detection of small foreign objects in Pu-erh sun-dried green tea: An enhanced YOLOv8 neural network model based on deep learning. Food Control 2025, 168, 110890. [Google Scholar] [CrossRef]
Zhang, H.; Zhang, S. Shape-iou: More accurate metric considering bounding box shape and scale. arXiv 2023, arXiv:2312.17663. [Google Scholar]
Wang, D.; Tan, J.; Wang, H.; Kong, L.; Zhang, C.; Pan, D.; Li, T.; Liu, J. SDS-YOLO: An improved vibratory position detection algorithm based on YOLOv11. Measurement 2025, 244, 116518. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13713–13722. [Google Scholar]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11534–11542. [Google Scholar]
Liu, Y.; Shao, Z.; Hoffmann, N. Global attention mechanism: Retain information to enhance channel-spatial interactions. arXiv 2021, arXiv:2112.05561. [Google Scholar]
Tan, M.; Le, Q. Efficientnetv2: Smaller models and faster training. arXiv 2021, arXiv:2104.00298. [Google Scholar]
Qin, D.; Leichner, C.; Delakis, M.; Fornoni, M.; Luo, S.; Yang, F.; Wang, W.; Banbury, C.; Ye, C.; Akin, B. Mobilenetv4-universal models for the mobile ecosystem. arXiv 2023, arXiv:2404.10518. [Google Scholar]
Yang, Z.; Wang, X.; Li, J. EIoU: An improved vehicle detection algorithm based on vehiclenet neural network. Proc. J. Phys. Conf. Ser. 2021, 1924, 012001. [Google Scholar] [CrossRef]
Tan, H.; Ma, W.; Tian, Y.; Zhang, Q.; Li, M.; Li, M.; Yang, X. Improved YOLOv8n object detection of fragrant pears. Trans. Chin. Soc. Agric. Eng. 2024, 40, 178–185. [Google Scholar]
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Kang, H.; Zhou, F.; Gao, S.; Xu, Q. Crack detection of concrete based on improved CenterNet model. Appl. Sci. 2024, 14, 2527. [Google Scholar] [CrossRef]
Lin, W.; Li, S. Grape leaf disease detection method based on improved RT-DETR. Trans. Chin. Soc. Agric. Mach. 2025, 41, 1–10. [Google Scholar] [CrossRef]
Liu, S.; Xu, H.; Deng, Y.; Cai, Y.; Wu, Y.; Zhong, X.; Zheng, J.; Lin, Z.; Ruan, M.; Chen, J.; et al. YOLOv8-LSW: A Lightweight Bitter Melon Leaf Disease Detection Model. Agriculture 2025, 15, 1281. [Google Scholar] [CrossRef]
Gao, L.; Cao, H.; Zou, H.; Wu, H. DMN-YOLO: A Robust YOLOv11 Model for Detecting Apple Leaf Diseases in Complex Field Conditions. Agriculture 2025, 15, 1138. [Google Scholar] [CrossRef]
Arkin, H.; Hou, Y. Detecting Cistanche deserticola using YOLOv5s-ESTC. Trans. Chin. Soc. Agric. Eng. 2024, 40, 267–276. [Google Scholar]
Guo, L.; Huang, J.; Wu, Y. Detecting rice diseases using improved lightweight YOLOv8n. Trans. Chin. Soc. Agric. Eng. 2025, 41, 156–164. [Google Scholar]
Fan, Y.; Zhang, S.; Feng, K.; Qian, K.; Wang, Y.; Qin, S. Strawberry maturity recognition algorithm combining dark channel enhancement and YOLOv5. Sensors 2022, 22, 419. [Google Scholar] [CrossRef]

Figure 1. Cotton boll disease detection vehicle (1—GPS receiver; 2—aluminum profile bracket; 3—electronic compass; 4—fill light equipment; 5—mobile power supply; 6—Jetson TX2; 7—display screen; 8—RGB-D camera; 9—Songling SCOUT2.0 chassis).

Figure 2. Examples of dataset images: (a) RGB-D camera collection, (b) manual collection, and (c) online collection. All images are in RGB format.

Figure 3. Cotton boll disease dataset division: (a) uncracked green bolls, (b) severely diseased bolls, (c) lightly diseased bolls, and (d) opening cotton bolls.

Figure 4. Sample data enhancement: (a) original, (b) color, (c) mirroring, (d) random rotation, (e) adding noise, (f) random masking, (g) scaling, (h) contrast adjustment, and (i) random cropping.

Figure 5. YOLOv11n-ECS model structure. Note: Conv2 d is the convolutional layer, Concat is the feature connection module, Upsample is the upsampling module, Detect is the detection head, MaxPool2d is the maximum pooling, Bottleneck is the convolution module containing residual connections, and Split is the feature layering.

Figure 6. Network structure of the EMA (efficient multi-scale attention) mechanism module.

Figure 7. PConv module structure diagram. Note: c is the number of channels in the input feature map; c_p is the number ofchannels participating in the convolution; * is the convolution operation; h, w are the height and width dimensions of the feature map.

Figure 8. C3k2_PConv module structure diagram.

Figure 9. Loss curves for different loss functions.

Figure 10. Comparison of detection effects of YOLOv11n and YOLOv11n-ECS: (a) YOLOv11n detection samples and (b) YOLOv11n-ECS detection samples. Note: △ in the figure indicates a repeat detection, ○ indicates a false positive, and ◊ indicates a false negative. Shapes represent meanings as in Figure 11.

Figure 11. Comparison of detection effects of YOLOv11n and YOLOv11n-ECS: (a) YOLOv11n detection samples and (b) YOLOv11n-ECS detection samples.

Table 1. Comparison of the detection effect of different models.

Model Name	Precision/%	Recall/%	mAP@0.5/%	FLOPs/G	Parameters/M
YOLOv5s	87.3	79.2	82.60	15.8	7.2
YOLOv7	86.8	78.3	81.3	105.1	37
YOLOv8n	86.2	81.5	83	28.7	11.8
YOLOv9t	87.5	80.3	80.7	6.4	1.7
YOLOv11n	90.1	81.9	83.3	6.4	2.5

Table 2. Comparative analysis of different attention mechanisms.

Attention Mechanisms	Precision/%	Recall/%	mAP@0.5/%	mAP@0.5:0.95/%
EMA	90.7	81.9	84.7	60.2
SE	90.2	77.6	82.1	57.4
CA	87.1	78.5	80.9	55.1
ECA	90.9	79.6	83.1	59.3
GAM	88.3	71.5	79.6	58.9

Table 3. Comparison of different lightweight feature extraction backbone networks.

Models	Parameters/M	FLOPs/G	Model size/MB	mAP@0.5/%	mAP@0.5:0.95/%	FPS
ShuffleNet	1.7	4.1	3.8	78.9	55.4	86
EfficientNetv2	3.5	3.6	7.3	82.9	56.9	71
MobileNetV4	4.0	6.5	8.4	84.8	59.3	65
C3K2-PConv	2.5	6.1	5.3	84.4	60.8	79

Table 4. YOLOv11n-ECS ablation experiment performance comparison.

No.	EMA	C3K3-PConv	Shape-IoU	mAP@0.5/%	mAP@0.5:0.95/%	FLOPs/G	Parameters/M	Model Size/MB
1	×	×	×	83.3	58.5	6.4	2.5	5.5
2	√	×	×	84.7	60.2	6.5	2.5	5.5
3	×	√	×	84.4	60.8	6.2	2.4	5.3
4	×	×	√	84.2	60.7	6.4	2.5	5.5
5	√	√	×	85.1	62.1	6.2	2.4	5.4
6	√	×	√	84.9	61.4	6.5	2.5	5.5
7	×	√	√	85	61.1	6.2	2.4	5.3
8	√	√	√	85.6	62.7	6.2	2.4	5.3

Table 5. Comparison of detection results of cotton boll diseases by different models.

Models	Precision/%	Recall/%	mAP@0.5/%	mAP@0.5:0.95/%	FLOPs/G	Parameters/M	Model Size/MB
CenterNet	62.6	58.2	59.9	37.1	70.2	32.66	124.0
Faster R-CNN	50.7	60.9	64.4	37.4	369.8	136.7	108
YOLOv8-LSW	88.7	82.3	80.1	54.4	5.7	2.2	5.4
MSA-DETR	89.9	83.0	81.6	59.9	57.6	32.4	38.9
DMN-YOLO	87.5	80.3	81.1	60.9	6.7	2.9	5.5
YOLOv11n	90.1	81.9	83.3	60.8	6.4	2.5	5.5
YOLOv11n-ECS	92.2	84.3	85.6	62.7	6.2	2.4	5.3

Table 6. Model deployment test results on Jetson TX2.

Models	mAP@0.5/%	FPS
YOLOv11n	81.4	52
YOLOv11n-ECS	84.2	56

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yang, L.; Cui, W.; Li, J.; Han, G.; Zhou, Q.; Lan, Y.; Zhao, J.; Qiao, Y. A Real-Time Cotton Boll Disease Detection Model Based on Enhanced YOLOv11n. Appl. Sci. 2025, 15, 8085. https://doi.org/10.3390/app15148085

AMA Style

Yang L, Cui W, Li J, Han G, Zhou Q, Lan Y, Zhao J, Qiao Y. A Real-Time Cotton Boll Disease Detection Model Based on Enhanced YOLOv11n. Applied Sciences. 2025; 15(14):8085. https://doi.org/10.3390/app15148085

Chicago/Turabian Style

Yang, Lei, Wenhao Cui, Jingqian Li, Guotao Han, Qi Zhou, Yubin Lan, Jing Zhao, and Yongliang Qiao. 2025. "A Real-Time Cotton Boll Disease Detection Model Based on Enhanced YOLOv11n" Applied Sciences 15, no. 14: 8085. https://doi.org/10.3390/app15148085

APA Style

Yang, L., Cui, W., Li, J., Han, G., Zhou, Q., Lan, Y., Zhao, J., & Qiao, Y. (2025). A Real-Time Cotton Boll Disease Detection Model Based on Enhanced YOLOv11n. Applied Sciences, 15(14), 8085. https://doi.org/10.3390/app15148085

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Real-Time Cotton Boll Disease Detection Model Based on Enhanced YOLOv11n

Abstract

1. Introduction

2. Materials and Methods

2.1. Cotton Boll Disease Image Collection

2.2. Cotton Boll Disease Dataset Construction

3. Cotton Boll Disease Detection Algorithm and Improvement

3.1. Selection of the Basic Detection Model

3.2. YOLOv11 Convolutional Neural Network Model

3.3. Improved Cotton Boll Disease Detection Algorithm Model Framework

3.3.1. Introducing the EMA Mechanism

3.3.2. Introduce PConv to Optimize the C3k2 Module

3.3.3. Loss Function Optimization

3.4. Model Training and Evaluation Metrics

3.4.1. Test Platform and Model Parameters

3.4.2. Model Evaluation Metrics

4. Results and Analyses

4.1. Impact of Different Attention Mechanisms on Model Detection Performance

4.2. Impact of Different Lightweight Networks on Model Detection Performance

4.3. Effect of Different Loss Functions on Model Detection Performance

4.4. YOLOv11n-ECS Ablation Experiment Performance Comparison

4.5. Performance Comparison of Different Object Detection Models

4.6. Analysis of the Deployment Test Results of the Model on Jetson TX2

5. Discussion and Conclusions

5.1. Discussion

5.2. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI