Improved YOLOv8n Method for the High-Precision Detection of Cotton Diseases and Pests

Huang, Jiakuan; Huang, Wei

doi:10.3390/agriengineering7070232

Open AccessArticle

Improved YOLOv8n Method for the High-Precision Detection of Cotton Diseases and Pests

by

Jiakuan Huang

¹ and

Wei Huang

^1,2,*

¹

School of Computer Science and Engineering, Wuhan Institute of Technology, Wuhan 430205, China

²

Hubei Key Laboratory of Intelligent Robotics, Wuhan Institute of Technology, Wuhan 430205, China

^*

Author to whom correspondence should be addressed.

AgriEngineering 2025, 7(7), 232; https://doi.org/10.3390/agriengineering7070232

Submission received: 17 May 2025 / Revised: 23 June 2025 / Accepted: 6 July 2025 / Published: 11 July 2025

Download

Browse Figures

Versions Notes

Abstract

Accurate detection of cotton pests and diseases is essential for agricultural productivity yet remains challenging due to complex field environments, the small size of pests and diseases, and significant occlusions. To address the challenges presented by these factors, a novel cotton disease and pest detection method is proposed. This method builds upon the YOLOv8 baseline model and incorporates a Multi-Scale Sliding Window Attention Module (MSFE) within the backbone architecture to enhance feature extraction capabilities specifically for small targets. Furthermore, a Depth-Separable Dilated Convolution Module (C2f-DWR) is designed to replace the existing C2f module in the neck of the network. By employing varying dilation rates, this modification effectively expands the receptive field and alleviates the loss of detailed information associated with the downsampling processes. In addition, a Multi-Head Attention Detection Head (MultiSEAMDetect) is introduced to supplant the original detection head. This new head utilizes diverse patch sizes alongside adaptive average pooling mechanisms, thereby enabling the model to adjust its responses in accordance with varying contextual scenarios, which significantly enhances its ability to manage occlusion during detection. For the purpose of experimental validation, a dedicated dataset for cotton disease and pest detection was developed. In this dataset, the improved model’s mAP50 and mAP50:95 increased from 73.4% and 46.2% to 77.2% and 48.6%, respectively, compared to the original YOLOv8 algorithm. Validation on two Kaggle datasets showed that mAP50 rose from 92.1% and 97.6% to 93.2% and 97.9%, respectively. Meanwhile, mAP50:95 improved from 86% and 92.5% to 87.1% and 93.5%. These findings provide compelling evidence of the superiority of the proposed algorithm. Compared to other advanced mainstream algorithms, it exhibits higher accuracy and recall, indicating that the improved algorithm performs better in the task of cotton pest and disease detection.

Keywords:

MSFE; C2f-DWR; MultiSEAMDetect

1. Introduction

Cotton (Gossypium L.) is one of the most important economic crops globally. However, throughout its growth cycle, it faces constant threats from pests and diseases, resulting in significant economic losses [1]. Consequently, the development of effective strategies to address these issues, stabilize crop yields, and mitigate economic losses has become a critical concern [2]. Traditional pest and disease detection methods, particularly in large agricultural areas, rely heavily on on-site identification by agricultural experts or farmers based on their experience [3]. This reliance leads to labor-intensive and time-consuming processes, compounded by a shortage of agricultural specialists. Furthermore, visual inspections introduce subjective factors that can result in the indiscriminate use of pesticides, causing environmental pollution and unnecessary economic losses. With advancements in computer vision technology, deep learning has been increasingly applied in the detection of agricultural pests and diseases. These technologies not only automate the identification process but also enhance the efficiency and accuracy of detection [4].

In recent years, deep learning-based image detection networks have been categorized into two-stage and one-stage detection networks. The Faster Region-based Convolutional Neural Network (Faster R-CNN) [5] is one of the most representative two-stage detection networks. Zhou et al. [6] proposed a rice disease detection algorithm that integrates Faster R-CNN with FCM-KM, achieving commendable performance. Although Faster R-CNN boasts high detection accuracy, its speed is relatively slow. In contrast, one-stage detection networks typically offer faster performance than their two-stage counterparts, albeit with potential compromises in accuracy. You Only Look Once (YOLO) [7] and Single Shot MultiBox Detector (SSD) [8] are notable examples of one-stage detection networks. Sun et al. [9] introduced a multi-scale feature fusion instance detection method based on SSD, which improved the detection of maize leaf blight in complex backgrounds. Despite the advancements in speed, these methods still fall short of meeting real-time detection requirements. Conversely, YOLO has achieved notable success in balancing speed and accuracy, leading to its widespread application in the detection of agricultural pests and diseases. For instance, Liu et al. [10] enhanced the accuracy and speed of pest and disease detection by implementing image pyramid technology and utilizing the feature layers of the YOLOv3 model for multi-scale feature detection. Additionally, Liu et al. [11] proposed a tomato pest recognition algorithm based on an improved YOLOv4 that integrates a triplet attention mechanism (YOLOv4-TAM). This method demonstrated effective detection performance across multiple experimental datasets. Dai et al. [12] introduced the SWin Transformer and Transformer mechanisms into YOLOv5, enhancing its ability to capture global features and expanding its receptive field for detecting plant pests and diseases. Zhang et al. [13] employed a channel attention module (ECA) to reduce the influence of complex backgrounds and used the Swish activation function in conjunction with the focal loss function from YOLOX, resulting in improved performance in detecting cotton pests and diseases. Additionally, Zhang et al. [14] proposed the CBAM-YOLOv7 algorithm, an improved attention mechanism for YOLOv7, which provides a robust theoretical foundation for the real-time monitoring of cotton leaf diseases.

In these studies focused on small object detection, mainstream object detection network models have been enhanced primarily through techniques such as super-resolution [15], multi-scale feature fusion [16], contextual information learning [17], and attention mechanisms [18]. Some research has introduced novel model structures or optimization methods, such as the SSAM attention module and MPFPN structure [19], to enhance the accuracy of small object detection. Despite significant advancements in object detection algorithms, existing methods continue to face numerous challenges when detecting small pest and disease targets in cotton. These challenges primarily include the following [20]:

Due to low image resolution, the available feature information is limited, which can result in unclear visual characteristics of small objects.
Effective feature extraction is critical for object detection tasks, as the quality of feature extraction directly impacts the accuracy of detection results. Compared to large-scale objects, the extraction of features from small objects presents greater challenges. This is primarily because essential features of small objects may be lost following multiple down-sampling operations during the detection process, further complicating the detection task.
The complexity of field environments makes small objects prone to occlusion, which poses difficulties in distinguishing the target from the background.

To address these challenges and enhance the performance of small object detection for cotton pest and disease recognition, we propose a lightweight pest and disease detection model based on YOLOv8n. First, we introduce the Multi-Scale Feature Enhancement (MSFE) module into the backbone to strengthen feature extraction capabilities. Second, we design a Depthwise Separable Dilated Convolution module (C2f-DWR) to replace the C2f module in the neck, thereby improving the extraction of both coarse-grained and fine-grained semantic information. Finally, we integrate the Separable Enhanced Attention Module (SEAM) into the original detection head, aiming to enhance its learning capability for complex pest samples and ultimately improve detection accuracy for small targets. Through these enhancements, we aim to deliver a more efficient and accurate method for detecting cotton pests and diseases, thereby significantly improving recognition performance for small pest and disease targets.

2. Materials and Methods

2.1. Establishment of the Dataset

The dataset for cotton pest and disease detection utilized in this study was obtained through an online web crawler (the images obtained from the web are freely available), comprising a total of 6097 annotated images created using LabelImg software (1.8.6). During dataset construction, the images were allocated into training, validation, and testing sets at a ratio of 7:2:1. This dataset encompasses eight distinct categories: leaf spot disease (Alternaria spp.), leaf blight (Ascochyta spp.), wilt disease (Verticillium spp.), gray mold (Botrytis cinerea), leaf curl disease (Begomovirus), aphids (Aphididae), armyworms (Spodoptera spp.), and healthy plants. A sample of the dataset is presented in Figure 1.

2.2. Cotton Pest and Disease Detection Methods

2.2.1. Introduction to the YOLOv8 Algorithm

YOLOv8 is the latest model in the YOLO series, employing an efficient single-stage detection architecture that predicts object bounding boxes and their corresponding class probabilities directly from input images using a unified neural network. This approach allows YOLOv8 to excel in real-time object detection tasks. Additionally, the model integrates a range of advanced data augmentation techniques, which enhance its generalization capabilities across diverse environments and enables it to adapt more effectively to the complex and dynamic scenarios encountered in real-world applications.

YOLOv8 is categorized into multiple versions, including YOLOv8n, YOLOv8s, YOLOv8m, YOLOv8l, and YOLOv8x, based on varying depths and widths of the network. These versions share a similar network architecture, with the primary distinctions residing in their depth and width configurations [21].

The core structure of the YOLOv8 model consists of three key layers: the backbone, the neck, and the head. The model first preprocesses the input images through the input layer, followed by feature extraction using the backbone. The extracted features are then input into the neck, which merges features across different scales to construct a feature pyramid, thereby enhancing the model’s information representation capabilities. Finally, the prediction results are generated by the head layer. The network architecture of YOLOv8 is illustrated in Figure 2.

2.2.2. Improved YOLOv8 Architecture

The application of deep learning networks in various complex real-world scenarios holds significant practical value for detection tasks. This paper explores optimization strategies based on YOLOv8 to effectively detect small targets associated with cotton pests and diseases in intricate environments. The main improvements are as follows:

To enhance the model’s performance in complex settings and improve the detection of small and distant targets, a Multi-Scale Sliding Window Attention module (MSFE) is introduced.

Additionally, C2f-DWR is proposed to replace the original C2f module in the YOLOv8 head. This modification reduces information loss for small targets during the multi-layer convolution process while decreasing the model’s complexity, thereby further enhancing the model’s performance in recognizing targets of varying scales in complex backgrounds, particularly for small targets.

Finally, to address occlusion issues in complex environments, a MultiSEAM-Detect head is proposed to replace the original detection head of YOLOv8. This adaptation integrates adaptive pooling and channel reduction techniques, effectively utilizing multi-scale spatial information to enhance detection accuracy and robustness. The improved YOLOv8 network structure is illustrated in Figure 3.

(1): Multi-Scale Sliding Window Attention Module

The SPPF layer in the Backbone primarily enhances the receptive field through pooling operations. While it can capture information at various scales, it may struggle to fully extract detailed local features or texture information when addressing complex characteristics. In the context of crop pest and disease detection, small lesions or pests may be overlooked. Moreover, the SPPF layer does not adaptively adjust for channel importance, which can result in the suppression of features from significant channels while emphasizing irrelevant features, ultimately affecting overall detection performance. To mitigate these issues, the MSFE [22] module was introduced following the SPPF layer.

The MSFE is a network structure based on the channel attention mechanism (ECA) [23], spatially separable convolutions, and multi-scale feature extraction, as illustrated in Figure 4. It enhances the feature representation capability of channels through the ECA module and employs multiple branches with varying convolution and pooling strategies to capture multi-scale and multi-directional feature information. By utilizing spatially separable convolutions (such as 1 × 3, 3 × 1, 5 × 1, and 1 × 5 convolutions) in specific branches, it further optimizes computational efficiency without compromising feature extraction capabilities. This design enables the network to effectively capture and integrate spatial and channel information in the input data while reducing computational load and the complexity of the network structure, ultimately improving the effectiveness of feature extraction and representation.

ECA (Efficient Channel Attention) is an attention mechanism designed to enhance the performance of convolutional neural networks. It comprises a compression module that aggregates global spatial information and an excitation module that facilitates interaction between channels. As illustrated in Figure 5, the process commences with global average pooling to aggregate features without reducing dimensionality. Subsequently, ECA adaptively determines the kernel size (k). A one-dimensional convolution of size (k) is then applied to the feature map, followed by the application of a Sigmoid function to derive the channel attention weights. These weights are utilized to scale each channel of the input feature map, resulting in the generation of the output feature map. The formula for ECA is as follows:

ω = δ (C 1 D (G A P (X)))

(1)

Y = ω X

(2)

k = ψ (C) = {|\frac{\log_{2} (C)}{γ} + \frac{b}{γ}|}_{o d d}

(3)

where δ denotes the Sigmoid function, and C1D represents one-dimensional convolution with a kernel size of k. The kernel size k is adaptively determined based on the channel dimension C. ω signifies the channel attention weights. X and Y represent the input and output feature maps, respectively, where |X|_odd refers to the nearest odd number to |X|. γ and b are hyperparameters, with values of 2 and 1 assigned to γ and b, respectively.

Spatially separable convolution decomposes the standard convolution operation in the spatial dimension into multiple smaller kernel convolution operations, as demonstrated in Equations (4) and (5). A k × k convolution is equivalent to the combination of a k × 1 convolution and a 1 × k convolution branch.

[\begin{matrix} w_{11} & \dots & w_{1 k} \\ ⋮ & ⋱ & ⋮ \\ w_{k 1} & \dots & w_{k k} \end{matrix}] = [\begin{matrix} x_{1} \\ ⋮ \\ x_{k} \end{matrix}] \times [\begin{matrix} y_{1} & \dots & y_{k} \end{matrix}]

(4)

w_{ij} = x_{i} \times y_{i}, (1 \leq i, j \leq k)

(5)

where w_ij denotes the value at point (i, j) in the k × k convolution, x_i represents the value at point (i, 1) in the k × 1 convolution, and y_j denotes the value at point (1, j) in the 1 × k convolution. The number of parameters for the k × 1 convolution and the 1 × k convolution is (k + 1) + (1 + k) = 2k, which is fewer than the number of parameters in the k × k convolution when k > 2. Spatially separable convolution reduces the number of parameters in the module, accelerates the model’s computational speed, and increases the model’s depth.

After incorporating Multi-Scale Feature Enhancement (MSFE) following the backbone layer, we observed significant improvements across various dimensions. However, to effectively manage multi-scale feature information, it is essential to increase the number of channels in the head network to ensure dimensional compatibility following feature fusion. This adjustment enhances the model’s adaptability and performance in multi-scale object detection tasks, enabling it to better comprehend target features across different scales, thereby improving detection accuracy and robustness. In an optimal experimental setup, modifications were implemented to ensure the coherence and consistency of the network architecture, facilitating the integration of MSFE and alleviating confusion arising from multi-scale feature information through cross-layer connections and feature pyramid structures. Experimental results indicated that under standard input conditions at the final detection layer of the detection head, hierarchical adjustments to the merged connection layers can more effectively process feature information, rendering the model more flexible and capable of generalization.

(2): C2f-DWR Module Design

The C2f module serves as a backbone network layer grounded in Cross Stage Partial (CSP) networks. It comprises an initial and a final convolution layer (CV1 and CV2), along with a series of bottleneck blocks. Each bottleneck block fosters feature reuse and facilitates information flow by processing intermediate features. The network begins by extracting raw features, and these design choices enhance the network’s performance and efficiency without sacrificing detection accuracy. However, as deeper layers progressively capture high-level features, the reliance on bottleneck blocks with smaller convolution kernels and limited receptive fields may result in the loss of critical details pertaining to small objects, ultimately impacting detection accuracy and robustness.

The Dilation-wise Residual Module (DWR) [24] is designed to enhance the feature extraction capabilities of convolutional neural networks. This module utilizes dilated convolutions and wide connections to improve the network’s feature extraction ability. Dilated convolutions implement varying dilation rates to expand the receptive field, thereby capturing a broader spatial context. Meanwhile, wide connections augment channel connectivity, which enhances the stability of residual learning and feature transfer. These techniques enable the network to maintain the integrity and accuracy of information when processing small objects and complex visual tasks.

To mitigate information loss during downsampling, the C2f-DWR replaces the bottleneck blocks in the C2f module with the DWR module, as illustrated in Figure 6. This modification significantly improves the model’s capability to detect and recognize small objects, thereby reducing information loss resulting from multiple convolution operations and enhancing detection accuracy and robustness.

(3): MultiSEAMDetect Module Design

The detection head design of YOLOv8 utilizes a single-scale receptive field to identify objects, concentrating on extracting features from specific levels. However, this approach may hinder the model’s ability to effectively comprehend and locate the occluded portions of partially obscured objects.

MultiSEAM [25] is a multi-head attention network developed to tackle challenges such as alignment errors, local occlusions, and feature loss resulting from inter-class occlusion, as illustrated in Figure 7. It incorporates multiple instances of the Channel and Spatial Mixing Module (CSMM), with each module configured with varying patch sizes to ensure that the network effectively captures both local and global spatial information. Following feature extraction through the CSMM modules, an adaptive average pooling layer is employed to consolidate the spatial information into a compact representation. This pooling step reduces dimensionality while preserving essential features. Subsequently, a fully connected network is utilized to perform adaptive scaling and modulation of the extracted features. This mechanism enables the network to adjust its response based on the input content, thereby enhancing its robustness and discriminative capabilities.

To address the occlusion problem, MultiSEAMDetect incorporates MultiSEAM into the original detection head, as illustrated in Figure 8. This integration preserves the richness of feature representation while simultaneously reducing the number of parameters. By extracting features at various scales and synthesizing information across multiple channels, the model enhances its understanding and detection capabilities for occluded objects. Consequently, this improvement further increases the model’s adaptability and generalization in complex scenes.

2.3. Experimental Environment Configuration

The hardware configuration used in the experiments is as follows: Intel(R) Core(TM) i5-8250U CPU @ 1.60GHz, NVIDIA GeForce RTX 2080 Ti, with CUDA version 11.3. The experimental platform operates on Ubuntu 20.04, and the deep learning framework utilized is PyTorch (1.12.1+cu113). The specific parameter settings for the experiments are detailed in Table 1.

2.4. Model Evaluation Metrics

To evaluate the performance of the proposed algorithm and compare it with other methods, the following metrics were utilized: Precision (P), average precision (AP), Recall (R), and the computational complexity indicated by the number of floating-point operations (FLOPs). The evaluation metrics were defined as follows:

\Pr e c i s i o n = \frac{TP}{TP + FP}

(6)

Re c a l l = \frac{TP}{TP + FN}

(7)

A P = \int_{0}^{1} \Pr ecision (Recall) dRecall

(8)

m A P = \frac{\sum_{i = 1}^{C} A P_{i}}{C}

(9)

In the formulas, TP represents the number of samples correctly identified as belonging to the cotton pest and disease image category, FP denotes the number of samples incorrectly identified as belonging to other categories within the cotton pest and disease image category, and FN indicates the number of samples that the model incorrectly identifies as belonging to other categories when they actually belong to the current category of cotton pest and disease. C represents the number of pest and disease categories.

Additionally, At various IoU levels, we also assessed the mean average precision (mAP). When the IoU threshold is set to 0.5, the mAP@0.5 reflects the average across all categories and the mAP@0.5:0.95 represents the average mAP for each category at various thresholds ranging from 0.5 to 0.95 with a step size of 0.05.

3. Results and Discussion

3.1. The Impact of the MSFE Module on the Algorithm

To validate the impact of the network architecture adjustments after incorporating the MSFE module, two sets of experiments were designed: the first set involved adding the MSFE module to the original model followed by fine-tuning, while the second set involved fine-tuning the model after integrating all design modules into the original network. The detailed results are shown in Table 2. After incorporating the MSFE module, the fine-tuned network structure model achieved improvements of 0.1% and 1% in mAP50 and mAP50:95, respectively, compared to the model without fine-tuning. When all design modules were integrated into the model, the network structure model without fine-tuning showed a 0.1% decrease in mAP50 compared to the model with only the MSFE module. Furthermore, the fine-tuned network structure model outperformed the non-fine-tuned model by 2.2% and 1.3% in mAP50 and mAP50:95, respectively. These results suggest that when the final detection layer of the detection head receives standardized input, adjusting the connection layers of the merged head can more effectively process feature information, thereby enhancing the model’s flexibility and generalization ability. Therefore, in this study, the model structure was fine-tuned after incorporating the MSFE module.

3.2. Ablation Experiments

To validate the impact of the improvement methods on the YOLOv8 model, ablation experiments were conducted by applying each module individually to the YOLOv8n model. Four evaluation criteria were employed to assess their effectiveness, as illustrated in Table 3.

The introduction of the MSFE module enhanced the model’s feature representation capability, resulting in a 1.7% increase in mAP50. Substituting the neck C2f module with C2f-DWR mitigated the loss of critical information while also reducing the model’s computational load, leading to a 1.1% increase in mAP50 and a 1.1% decrease in FLOPs. Replacing the original detection head with MultiSEAMDetect improved mAP while significantly decreasing FLOPs, indicating that this module not only effectively enhances the model’s robustness but also conserves computational resources.

When all three modules were integrated simultaneously, mAP50 increased by 3.8%, mAP50:95 improved by 2.4%, and FLOPs decreased by 6.17%, achieving optimal performance for the model. The improved algorithm was trained and tested on the cotton pest and disease dataset, with the detection results presented in Figure 9. The original model failed to detect certain diseases, whereas the improved model significantly reduced this issue, demonstrating the effectiveness of the proposed method for detecting cotton pests and diseases.

3.3. Performance Comparison with Other Algorithms

To validate the rationale and effectiveness of the algorithm improvements proposed in this paper, comparative experiments were conducted with current mainstream object detection algorithms under the same experimental conditions and dataset. The experimental results are presented in Table 4. Based on the core metric of mAP50, the algorithm proposed in this paper achieved a maximum value of 77.2%, demonstrating a significant precision advantage. Compared to the classic lightweight models YOLOv5n (71.3%) and YOLOv8n (73.4%), this algorithm outperformed them by 5.9 and 3.8 percentage points, respectively. Although YOLOv9s, with its complex network architecture, achieved a higher value of 76.9%, its computational cost (26.7 GFLOPs) far exceeded that of the proposed model (7.6 GFLOPs). Additionally, the algorithm presented in this paper exhibited a clear mAP50 advantage over other reference models, including YOLOv6 (72.2%), YOLOv7-tiny (70.4%), YOLOv10n (72.7%), and RT-DETR (73.1%) [26,27,28,29,30]. In terms of inference speed, while YOLOv6 (277 FPS) and YOLOv10n (256 FPS) slightly outperformed the proposed algorithm (227 FPS), their detection precision was significantly lower than that of the proposed model (77.2%). This indicates that the proposed algorithm maintains excellent real-time performance while ensuring high precision. Furthermore, the proposed model demonstrated the highest recall rate (72.8%), surpassing all, reference models, which indicates its exceptional object coverage and sensitivity to positive samples, effectively reducing missed detections. Overall, the proposed algorithm achieves a superior balance between precision, speed, and computational efficiency, demonstrating promising prospects for practical applications.

3.4. Model Generalization Experiment

To validate the robustness and effectiveness of the model, tests were conducted on the original YOLOv8 model and the proposed method using two datasets obtained from the KAGGLE website. The Field Weed Dataset was obtained from https://www.kaggle.com/datasets/jawadulkarim117/cotton-weed-12-class (accessed on 17 July 2024). The Tomato Leaf Disease Dataset was obtained from https://www.kaggle.com/datasets/kpoviesistphane/tomato-leaf-disease-detection (accessed on 17 July 2024). The Field Weed Dataset comprises twelve types of weeds, including Chenopodium album, Convolvulus arvensis, Cynodon dactylon, Dichondra repens, Eclipta prostrata, Malva parviflora, Palmer amaranth, Persicaria pensylvanica, Portulaca oleracea, Sium suave, Sonchus oleraceus, and Artemisia vulgaris, totaling 5648 images. These images are divided into a training set (4321 images), a validation set (480 images), and a test set (847 images). The Tomato Leaf Disease Dataset consists of nine categories: Early Blight, Healthy, Late Blight, Leafminer, Leaf Mold, Mosaic Virus, Septoria Leaf Spot, Spider Mites, and Yellow Curl Virus, containing a total of 10,044 images. These are divided into a training set (9037 images), a validation set (843 images), and a test set (1164 images). The experimental environment, hyperparameters, and the cotton pest dataset remained consistent throughout the study.

The experimental results are presented in Table 5, where A represents the Field Weed Dataset and B denotes the Tomato Leaf Disease Dataset. In the Field Weed Dataset, the improved YOLOv8 model achieved an increase of 1.1% in both mAP50 and mAP50:95 compared to the original model. In the Tomato Leaf Disease Dataset, improvements of 0.1% and 1% were observed in mAP50 and mAP50:95, respectively, further demonstrating the effectiveness of the enhanced model.

4. Conclusions

This paper presents an improved YOLOv8 model for detecting cotton pests and diseases in field conditions. The original YOLOv8n model underwent comprehensive structural adjustments in the backbone, neck, and head. The MSFE module was introduced in the backbone, the C2f module in the neck was replaced with the C2f-DWR module, and the MultiSEAMDetect detection head was added. The model’s performance was comprehensively evaluated using a prepared dataset and compared with other mainstream algorithms. Additionally, an ablation study was conducted to validate the effectiveness of the designed modules. The results show that the proposed algorithm achieved optimal performance with a recall rate of 72.8%, mAP50 of 77.2%, and a computational cost of 7.6 GFLOPs. The comprehensive performance evaluation and comparison with standard lightweight models indicate that the improved model has a compact structure, lower complexity, higher detection accuracy, and real-time detection capability. The model’s FPS is 227, slightly lower than YOLOv6 and YOLOv10n, but the achieved inference time is sufficient to meet real-time detection requirements.

Considering potential application on mobile devices, the computational load becomes an important factor to address. Therefore, future research will focus on exploring effective methods to reduce the floating-point computation requirements of the model and develop more lightweight object detection models. Additionally, the improved model will be deployed on resource-constrained devices, such as drones, and a platform for precise detection and pesticide spraying will be developed.

Author Contributions

Conceptualization, J.H.; methodology, W.H.; software, J.H.; formal analysis, J.H.; investigation, J.H.; resources, J.H. and W.H.; data curation, J.H. and W.H.; writing—original draft, J.H.; visualization, W.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Hubei Province Science and Technology Innovation Talents Project (grant number 2023DJC070).

Data Availability Statement

Data are unavailable due to privacy restrictions.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Abbas, A. Toxicity of Selective Insecticides against Sap Sucking Insect Pests of Cotton (Gossypium Hirsutum). Pure Appl. Biol. 2021, 11, 72–78. [Google Scholar] [CrossRef]
He, Y.; Xu, Y.; Chen, X. Biology, Ecology and Management of Tephritid Fruit Flies in China: A Review. Insects 2023, 14, 196. [Google Scholar] [CrossRef]
Lippi, M.; Bonucci, N.; Carpio, R.F.; Contarini, M.; Speranza, S.; Gasparri, A. A YOLO-Based Pest Detection System for Precision Agriculture. In Proceedings of the 2021 29th Mediterranean Conference on Control and Automation (MED), Puglia, Italy, 22–25 June 2021; pp. 342–347. [Google Scholar]
Meng, F.; Li, J.; Zhang, Y.; Qi, S.; Tang, Y. Transforming Unmanned Pineapple Picking with Spatio-Temporal Convolutional Neural Networks. Comput. Electron. Agric. 2023, 214, 108298. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
Zhou, G.; Zhang, W.; Chen, A.; He, M.; Ma, X. Rapid Detection of Rice Disease Based on FCM-KM and Faster R-CNN Fusion. IEEE Access 2019, 7, 143190–143206. [Google Scholar] [CrossRef]
Liu, C. Farmland Pest Detection Based on YOLO-V5l and ResNet50. Artif. Intell. Robot. Res. 2022, 11, 236–247. [Google Scholar] [CrossRef]
Deng, Z.; Wang, P.; Song, X.H. Research on Granary Pest Detection Based on SSD. J. Comput. Eng. Appl. 2020, 56, 214–218. [Google Scholar]
Sun, J.; Yang, Y.; He, X.; Wu, X. Northern Maize Leaf Blight Detection Under Complex Field Environment Based on Deep Learning. IEEE Access 2020, 8, 33679–33688. [Google Scholar] [CrossRef]
Liu, J.; Wang, X. Tomato Diseases and Pests Detection Based on Improved Yolo V3 Convolutional Neural Network. Front. Plant Sci. 2020, 11, 898. [Google Scholar] [CrossRef]
Liu, J.; Wang, X.; Miao, W.; Liu, G. Tomato Pest Recognition Algorithm Based on Improved YOLOv4. Front. Plant Sci. 2022, 13, 814681. [Google Scholar] [CrossRef]
Dai, M.; Dorjoy, M.M.H.; Miao, H.; Zhang, S. A New Pest Detection Method Based on Improved YOLOv5m. Insects 2023, 14, 54. [Google Scholar] [CrossRef]
Zhang, Y.; Ma, B.; Hu, Y.; Li, C.; Li, Y. Accurate Cotton Diseases and Pests Detection in Complex Background Based on an Improved YOLOX Model. Comput. Electron. Agric. 2022, 203, 107484. [Google Scholar] [CrossRef]
Zhang, N.; Zhang, X.; Bai, T.C. Identification Method of Cotton Leaf Pests and Diseases in Natural Environment Based on CBAM-YOLO v7. Trans. Chin. Soc. Agric. Mach. 2023, 54, 239–244. [Google Scholar]
Guo, C.; Fan, B.; Zhang, Q.; Xiang, S.; Pan, C. AugFPN: Improving Multi-Scale Feature Learning for Object Detection. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Deng, J.; Yang, C.; Huang, K.; Lei, L.; Ye, J.; Zeng, W.; Zhang, J.; Lan, Y.; Zhang, Y. Deep-Learning-Based Rice Disease and Insect Pest Detection on a Mobile Phone. Agronomy 2023, 13, 2139. [Google Scholar] [CrossRef]
Bai, Y.; Zhang, Y.; Ding, M.; Ghanem, B. SOD-MTGAN: Small Object Detection via Multi-Task Generative Adversarial Network. In Computer Vision—ECCV 2018, Proceedings of the 5th European Conference, Munich, Germany, 8–14 September 2018; Lecture Notes in Computer Science; Springer: Heidelberg, Germany, 2018; pp. 210–226. [Google Scholar]
Qiao, S.; Chen, L.-C.; Yuille, A. DetectoRS: Detecting Objects with Recursive Feature Pyramid and Switchable Atrous Convolution. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021. [Google Scholar]
Lim, J.-S.; Astrid, M.; Yoon, H.-J.; Lee, S.-I. Small Object Detection Using Context and Attention. In Proceedings of the 2021 International Conference on Artificial Intelligence in Information and Communication (ICAIIC), Jeju Island, Republic of Korea, 13–16 April 2021; pp. 181–186. [Google Scholar] [CrossRef]
Chen, G.; Wang, H.; Chen, K.; Li, Z.; Song, Z.; Liu, Y.; Chen, W.; Knoll, A. A Survey of the Four Pillars for Small Object Detection: Multiscale Representation, Contextual Information, Super-Resolution, and Region Proposal. IEEE Trans. Syst. Man Cybern. Syst. 2022, 52, 936–953. [Google Scholar] [CrossRef]
Chen, Y.; Zheng, W.; Zhao, Y.; Song, T.; Shin, H. DW-YOLO: An Efficient Object Detector for Drones and Self-Driving Vehicles. arXiv 2004, arXiv:2403.15447. [Google Scholar] [CrossRef]
Zhang, W.; Liu, Z.; Zhou, S.; Qi, W.; Wu, X.; Zhang, T.; Han, L. LS-YOLO: A Novel Model for Detecting Multiscale Landslides with Remote Sensing Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 4952–4965. [Google Scholar] [CrossRef]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Wei, H.; Liu, X.; Xu, S.; Dai, Z.; Dai, Y.; Xu, X. DWRSeg: Rethinking Efficient Acquisition of Multi-Scale Contextual Information for Real-Time Semantic Segmentation. arXiv 2022, arXiv:2212.01173. [Google Scholar]
Yu, Z.; Huang, H.; Chen, W.; Su, Y.; Liu, Y.; Wang, X. YOLO-FaceV2: A Scale and Occlusion Aware Face Detector. arXiv 2024, arXiv:2208.02019. [Google Scholar] [CrossRef]
Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W. YOLOv6: A Single-Stage Object Detection Framework for Industrial Applications. arXiv 2022, arXiv:2209.02976. [Google Scholar]
Wang, C.; Bochkovskiy, A.; Liao, H. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors. arXiv 2023, arXiv:2207.02696. [Google Scholar]
Wang, C.; Yeh, I.; Liao, H. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. arXiv 2024, arXiv:2402.13616. [Google Scholar]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-Time End-to-End Object Detection. arXiv 2024, arXiv:2405.14458. [Google Scholar]
Lv, W.; Xu, S.; Zhao, Y.; Wang, G.; Wei, J.; Cui, C.; Du, Y.; Dang, Q.; Liu, Y. DETRs Beat YOLOs on Real-Time Object Detection. arXiv 2024, arXiv:2304.08069. [Google Scholar]

Figure 1. Sample dataset diagram.

Figure 2. YOLOv8n network structure.

Figure 3. Improve YOLOv8n network structure.

Figure 4. MSFE module structure.

Figure 5. ECA module structure.

Figure 6. C2f-DWR module structure.

Figure 7. MultiSEAM module structure.

Figure 8. MultiSEAMDetect module structure.

Figure 9. Comparison of algorithm detection effects.

Table 1. Experimental parameter settings.

Training Parameters	Details
img-size (pixel)	640 × 640
Epochs	300
Batch	32
Optimization algorithm	SGD
Momentum	0.937
Initial learning rate	0.01

Table 2. After inserting the MSFE module, adjust the influence of the network on the algorithm.

Models	Adjusting the Network	GFLOPs	P/%	R/%	mAP@0.5/%	mAP@0.5:0.95/%
YOLOv8n	-	8.1	79	69.1	73.4	46.2
+MSFE	-	8.4	82.4	71.9	75.1	46.8
+MSFE	√	8.6	80.4	72.2	75.2	47.8
+MSFE + C2f-DWR + MultiSEAMDetect	-	7.5	84.2	68.8	75	47.3
+MSFE + C2f-DWR + MultiSEAMDetect	√	7.6	82.3	72.8	77.2	48.6

Table 3. Comparison of ablation experiments.

Experiment	MSFE	C2f-DWR	MultiSEAMDetect	GFLOPs	P/%	R/%	mAP@0.5/%	mAP@0.5:0.95/%
1	-	-	-	8.1	79	69.1	73.4	46.2
2	√	-	-	8.6	80.4	72.2	75.2	47.8
3	-	√	-	8.0	81.1	71.7	74.5	47.1
4	-	-	√	7.3	82.3	69.9	75.1	46.1
5	√	√	-	8.5	83.1	72.2	75.4	48.1
6	√	-	√	7.7	80	73.6	75.7	47.3
7	-	√	√	7.2	84.1	70	75.6	47.9
8	√	√	√	7.6	82.3	72.8	77.2	48.6

Table 4. Performance comparison of different algorithms.

Model	$P/%$	R/%	mAP@0.5/%	GFLOPs	FPS
YOLOv5n	81.6	67.4	71.3	7.1	99
YOLOv6	79.2	68.4	72.2	11.8	277
YOLOv7-tiny	69.8	72.7	70.4	13.1	161
YOLOv8n	79	69.1	73.4	8.1	232
YOLOv9s	84.6	72.6	76.9	26.7	145
YOLOv10n	78.3	69.1	72.7	6.5	256
RT-DETR	82.9	69.3	73.1	57.0	104
Ours	82.3	72.8	77.2	7.6	227

Table 5. Experimental results on different datasets.

Dataset	Model	P/%	R/%	mAP@0.5/%	mAP@0.5:0.95/%
A	YOLOv8n	92.5	85.3	92.1	86
A	Ours	92.4	86.6	93.2	87.1
B	YOLOv8n	96.3	92.9	97.6	92.5
B	Ours	96.1	94.2	97.9	93.5

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Huang, J.; Huang, W. Improved YOLOv8n Method for the High-Precision Detection of Cotton Diseases and Pests. AgriEngineering 2025, 7, 232. https://doi.org/10.3390/agriengineering7070232

AMA Style

Huang J, Huang W. Improved YOLOv8n Method for the High-Precision Detection of Cotton Diseases and Pests. AgriEngineering. 2025; 7(7):232. https://doi.org/10.3390/agriengineering7070232

Chicago/Turabian Style

Huang, Jiakuan, and Wei Huang. 2025. "Improved YOLOv8n Method for the High-Precision Detection of Cotton Diseases and Pests" AgriEngineering 7, no. 7: 232. https://doi.org/10.3390/agriengineering7070232

APA Style

Huang, J., & Huang, W. (2025). Improved YOLOv8n Method for the High-Precision Detection of Cotton Diseases and Pests. AgriEngineering, 7(7), 232. https://doi.org/10.3390/agriengineering7070232

Article Menu

Improved YOLOv8n Method for the High-Precision Detection of Cotton Diseases and Pests

Abstract

1. Introduction

2. Materials and Methods

2.1. Establishment of the Dataset

2.2. Cotton Pest and Disease Detection Methods

2.2.1. Introduction to the YOLOv8 Algorithm

2.2.2. Improved YOLOv8 Architecture

2.3. Experimental Environment Configuration

2.4. Model Evaluation Metrics

3. Results and Discussion

3.1. The Impact of the MSFE Module on the Algorithm

3.2. Ablation Experiments

3.3. Performance Comparison with Other Algorithms

3.4. Model Generalization Experiment

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI