1. Introduction
Cotton (
Gossypium spp.) plays a vital role in global agriculture, serving as a cornerstone of the textile industry and supporting the livelihoods of millions of farmers, particularly in regions such as Central Asia, South Asia, and parts of Africa [
1]. Despite its economic importance, cotton cultivation is highly vulnerable to a broad spectrum of biotic stressors, including foliar and vascular diseases such as leaf spot, Fusarium wilt, gray mold, and aphid-induced stress [
2]. These diseases can initiate at early developmental stages [
3], subtly manifesting through visual symptoms like minor chlorosis [
4], spot formation [
5], and edge curling [
6]. If undetected and unmanaged, such conditions can escalate rapidly, leading to substantial yield losses and increased dependency on chemical pesticides [
7]. The timely detection of these early-stage symptoms is crucial for implementing effective integrated pest management (IPM) strategies [
8]. However, traditional methods such as manual scouting are labor-intensive, prone to human error, and lack scalability [
9]. Recent advancements in remote sensing technologies and deep learning-based image analysis have shown promise in automating disease detection tasks [
10]. In particular, UAVs, or drones, offer an efficient, scalable platform for high-resolution crop monitoring [
11]. While several successful UAV-based disease detection systems utilize RGB sensors to exploit color and texture features, a substantial number of high-accuracy systems still rely on multispectral or hyperspectral sensors, which may remain cost-prohibitive and logistically challenging for smallholder farmers [
12].
To address these limitations, this paper introduces CottoNet, a novel deep learning framework specifically designed for early-stage cotton disease detection using only RGB imagery captured via UAVs. Unlike prior work that often emphasized later-stage symptoms or relied on costly sensor setups, our model is optimized for real-time deployment in resource-constrained agricultural environments. At its core, CottoNet integrates a lightweight yet powerful feature extractor that maintains high spatial resolution while ensuring computational efficiency, making it ideal for UAV-based inference. This is complemented by a multi-scale fusion mechanism, enriched with spatial and channel-wise attention modules, which amplifies fine-grained disease patterns and suppresses background noise. Additionally, a hybrid visual enhancement module is incorporated, leveraging color/space transformations, texture encoding, and edge detection to highlight subtle pathological cues that conventional detectors frequently overlook. Our method is trained and evaluated on a custom-labeled UAV image dataset collected from cotton fields in Uzbekistan, covering a range of disease classes and healthy samples under diverse lighting and environmental conditions. The model achieves a mean average precision (mAP@50) of 89.7%, an F1 score of 88.2%, and an EDA of 91.5%, significantly outperforming established lightweight models such as YOLOv8n [
13], LCDDN-YOLO [
14], and ResNet-based detectors. This paper makes the following key contributions.
We present one of the first RGB-only deep learning frameworks specifically tailored for early-stage cotton disease detection using UAV imagery, with a focus on low-cost deployment for small-scale farming systems.
We propose a dual-attention pyramid architecture tailored for multi-scale disease symptom representation, integrated with an early symptom enhancement module for subtle visual cue extraction.
We construct and release a high-quality annotated UAV dataset from Central Asia, addressing a geographic and technological gap in existing cotton disease datasets.
We provide extensive benchmarking, ablation studies, and real-time deployment evaluations on embedded hardware, validating the model’s field-readiness and scalability.
Through CottoNet, we aim to bridge the gap between advanced AI-based crop monitoring and practical field deployment, fostering a future of precision agriculture that is both inclusive and sustainable.
2. Related Works
The application of deep learning in precision agriculture, particularly for plant disease detection, has witnessed substantial growth over the past decade [
15]. With the advent of high-resolution imagery from UAVs and advancements in convolutional neural networks (CNNs), automated disease diagnosis has become increasingly viable [
16]. However, most existing approaches either depend on high-cost sensors or overlook the challenge of detecting early-stage symptoms, especially using only RGB imagery under natural field conditions [
17]. Initial attempts at plant disease detection leveraged handcrafted features, such as color histograms, texture descriptors, and edge-based filters, followed by conventional classifiers like support vector machines (SVMs) [
18] or k-nearest neighbors (kNN) [
19]. While these methods demonstrated moderate success under controlled environments, they failed to generalize to complex, real-world agricultural settings due to high intra-class variability, background clutter, and sensitivity to lighting. To improve detection accuracy and robustness, many recent systems have turned to multispectral and hyperspectral imaging [
20]. These modalities provide detailed spectral signatures that can highlight biochemical and physiological changes in plant tissue invisible to the naked eye [
21]. However, hyperspectral sensors are expensive, bulky, and computationally demanding, limiting their deployment in low-cost UAV platforms or smallholder farming contexts.
In response to the limitations of spectral methods, several studies have focused on RGB-only approaches for disease detection [
22]. Popular object detection architectures like Faster R-CNN [
23], SSD [
24], YOLOv3/YOLOv5 [
25], and EfficientDet [
26] have been adapted to plant pathology tasks. These models benefit from end-to-end learning, superior generalization, and rapid inference [
13]. However, most of these detectors are trained on datasets with fully developed, visually prominent disease symptoms, making them less effective for detecting subtle, early-stage indicators. In [
14], the authors proposed LCDDN-YOLO, a YOLOv8n-based model enhanced with a convolutional block attention module (CBAM), specifically designed for cotton disease detection in natural field environments. While the inclusion of attention mechanisms improved robustness, the model still relies heavily on strong visual symptoms and lacks optimization for early-stage detection. In [
27], the authors explored a federated learning approach for plant disease identification, allowing models to be trained across multiple field locations while preserving data privacy. They also incorporated incremental learning to adapt the model over time. Though innovative, their framework primarily targeted general disease classification and not early symptom detection from UAV data. In [
28], the authors introduced YOLO-UP, a customized architecture tailored for UAV-based pest detection using visible light imagery. Their model achieved improved performance in densely planted fields through an augmented feature pyramid network (AFPN). Nevertheless, YOLO-UP is focused on insect-related damage and does not emphasize early fungal or bacterial symptoms. Other lightweight detectors, such as those built upon MobileNetV2/V3 [
18] or ResNet50-FPN, have been adopted to meet the computational constraints of embedded devices. However, they often sacrifice detection precision, especially for small or low-contrast objects, and are not explicitly engineered for early symptom modeling.
The integration of attention mechanisms into CNNs has proven effective for enhancing feature discrimination in complex scenes. Channel-wise attention modules, like squeeze-and-excitation (SE) blocks [
29] or CBAM [
30], improve model focus by reweighting feature maps based on global context [
31]. Spatial attention modules [
32], on the other hand, prioritize informative regions within feature maps, which is critical when symptoms occupy small or inconsistent areas of the plant canopy. Feature pyramid networks (FPNs) [
33] and their variants, such as BiFPN (EfficientDet) [
34], have enabled multi-resolution feature fusion, allowing detectors to localize objects of varying scales. These structures are particularly relevant in agricultural imagery, where diseases may affect only small subregions of a leaf or plant [
35]. However, few studies have explored hybrid pyramid/attention mechanisms tailored for early-stage detection.
Despite its practical importance, early disease detection remains a relatively underexplored area in plant pathology. Early symptoms are typically characterized by low color contrast, subtle texture shifts, or minimal geometric deformation, making them difficult to distinguish using standard CNN pipelines. Moreover, RGB imagery captured from UAVs introduces further variability due to lighting changes, altitude-induced resolution shifts, and occlusions. To the best of our knowledge, no existing model simultaneously integrates multi-scale attention, visual enhancement, and color/space transformation specifically aimed at highlighting early, faint symptoms in RGB UAV imagery. While techniques such as contrast enhancement, histogram equalization, and edge detection have been employed individually in preprocessing pipelines, their incorporation into an end-to-end deep learning framework remains limited. While the field of UAV-based plant disease detection has advanced through deep learning and attention-enhanced architectures, significant gaps remain, particularly in detecting early-stage symptoms using low-cost RGB imagery. Existing models either lack sensitivity to faint visual cues or are too resource-intensive for real-time deployment on edge devices. This paper addresses these limitations by proposing CottoNet, a unified framework combining efficient backbone architecture, dual-attention feature pyramids, and an early symptom emphasis module for enhanced precision and scalability.
3. Materials and Methods
To evaluate the effectiveness of the proposed CottoNet framework in detecting early-stage cotton diseases, we designed a comprehensive experimental pipeline encompassing data acquisition, model architecture design, training procedures, and evaluation. This section outlines the methodological components underpinning our study, including the collection and preprocessing of UAV-based RGB imagery, the architectural design of the proposed deep learning model, and the implementation of attention-based modules and enhancement techniques aimed at early symptom recognition.
We begin by detailing the UAV imaging platform and the conditions under which the dataset was captured, emphasizing natural field variability to ensure robustness and generalizability. Subsequently, we describe the model architecture in depth, including the EfficientNetV2-S backbone, the DA-FPN is complemented by a multi-scale fusion mechanism, enriched with spatial and channel-wise attention modules, which we refer to as a DA-FPN, the ESEM, and the detection head responsible for disease localization and classification. Finally, we outline the training configurations, data augmentation strategies, and evaluation metrics employed to benchmark model performance. By combining lightweight yet expressive components, our approach is designed to operate efficiently on embedded hardware platforms while maintaining high sensitivity to subtle and spatially limited disease features. The methods described here form the foundation for the results and evaluations presented in the following sections.
3.1. Model Architecture
The backbone of the proposed CottoNet architecture is based on EfficientNetV2-S, a convolutional neural network that balances high performance with computational efficiency. This backbone is selected for its ability to extract rich spatial features from RGB imagery while maintaining a lightweight structure suitable for UAV deployment and real-time inference. EfficientNetV2 employs a compound scaling method introduced by Tan and Le (2021) [
36], which uniformly scales the depth (
d), width (
w), and input resolution (
r) of the network using a set of optimized constants
α,
β,
γ, and a compound coefficient
ϕ. The formulation is given as:
In terms of feature extraction, EfficientNetV2 uses Swish as the activation function, defined as
. This self-gated function enhances gradient flow and allows the network to capture subtle disease patterns such as chlorotic spots and edge blurring. Moreover, EfficientNetV2-S incorporates progressive image resizing and regularization-aware training, allowing the network to first learn coarse structures and then refine finer details, ultimately improving convergence speed and generalization. Moreover, EfficientNetV2-S includes progressive learning techniques such as progressive image resizing and regularization-aware training, both of which accelerate convergence and reduce overfitting. These techniques allow the model to focus on coarse patterns in earlier training stages and progressively learn finer details (
Figure 1).
In our implementation, the input RGB image
is passed through the EfficientNetV2-S backbone, yielding multi-scale feature maps
at different resolutions, which are subsequently fed into the DA-FPN for enhanced aggregation:
These feature maps retain both high-level semantic information and low-level spatial resolution, which is critical for detecting early-stage disease symptoms that may only occupy a small portion of the plant canopy.
To effectively detect early-stage cotton diseases that often manifest as subtle, low-contrast visual cues, we propose a DA-FPN to enhance multi-scale feature representation and fusion. The DA-FPN is designed to address two main challenges in UAV-based disease detection: (1) capturing fine-grained details across multiple resolutions, and (2) suppressing background noise while amplifying informative regions. The base of DA-FPN is built upon the bidirectional feature pyramid network (BiFPN), which enables efficient bidirectional cross-scale connections and feature fusion with learnable weights.
denote the feature maps extracted from the backbone at N different scales. In standard BiFPN, the fusion at a given level
j is defined as:
where
are learnable scalar weights, ϵ is a small constant to avoid division by zero,
depending on the connection topology. BiFPN preserves both semantic and spatial information across scales, enabling the detection of both coarse and fine patterns relevant to disease onset. To further refine the feature representation, we integrate two complementary attention modules: the channel attention module (CAM), which emphasizes informative feature channels by modeling inter-channel dependencies, and the spatial attention module (SAM), which highlights critical spatial regions by focusing on the most relevant locations within the feature maps. Together, these modules enhance the model’s ability to capture subtle yet discriminative patterns associated with early-stage cotton diseases. These attention units are applied sequentially to each fused feature map
to produce attention-weighted maps
:
CAM emphasizes important channels by using global average pooling and max pooling, followed by a shared multi-layer perceptron (MLP):
where
is the channel-wise attention map,
denotes channel-wise multiplication,
is the sigmoid activation function. SAM is applied to spatially focus on regions exhibiting early disease patterns. It uses channel-wise average and max projections, followed by a 7 × 7 convolution:
where [·; ·] denotes channel-wise concatenation,
is a convolutional layer with a 7 × 7 kernel,
is the spatial attention mask. The final set of attention-weighted, multi-scale feature maps
are passed to the detection head for object localization and classification. This multi-scale, attention-enhanced representation enables the model to robustly detect disease-affected regions even when symptoms are visually faint and sparsely distributed.
Detecting early-stage cotton diseases poses significant challenges due to the subtle nature of initial symptoms, such as minor color aberrations, fine chlorotic spots, or barely visible lesions. These visual cues are often overshadowed by complex features in UAV-captured RGB images. To address this, we introduce the ESEM, a hybrid enhancement unit designed to highlight low-intensity pathological cues while preserving important textural details. ESEM is a multi-branch image enhancement block that enriches the original feature maps by fusing color-contrast, texture, and edge-based cues. It acts as an auxiliary pathway within the backbone, producing an enhanced intermediate feature map that is later fused with the main path during mid-level processing. Given an input RGB feature map
ESEM outputs a complementary map
designed to amplify early disease indicators. ESEM transforms input features from the RGB space to both HSV and LAB color spaces, which are more perceptually aligned with human vision and can better differentiate between healthy and diseased regions, especially under varying lighting conditions.
denote the raw image or early-layer feature map. It is transformed into HSV and LAB representations:
where
(·) is a differentiable color/space transformation function. From HSV, the saturation S and value V components, and from LAB, the
a- and
b-channels, are extracted, highlighting color distortions due to disease. These channels are concatenated and passed through a 1 × 1 convolutional layer to produce a compact color-enhanced map:
Edge irregularities and vein distortions are common early indicators of biotic stress. The edge attention branch uses a Sobel gradient operator
S applied to each channel of the input:
where ∗ denotes convolution, and
,
are Sobel kernels in horizontal and vertical directions, respectively. To reduce noise, the output is passed through a Gaussian smoothing kernel
:
This branch highlights discontinuities and boundaries that may indicate lesions or insect punctures. Subtle texture changes are captured using local binary patterns (LBP), a non-parametric descriptor that encodes spatial micro-patterns:
where
are the 8 neighbors of pixel (
i,
j),
. LBP features are normalized and projected via a 3 × 3 convolution to match the dimensionality:
The three enhanced branches are concatenated and projected back to the original feature dimensionality using a 1 × 1 convolution and a ReLU activation:
Finally, a residual connection is applied to preserve the original semantic content:
where
is a learnable scalar balancing original and enhanced features.
3.2. Detecting Stage
The final stage of the CottoNet architecture is the detection head, responsible for performing both spatial localization of disease-affected regions and categorical classification of the disease type and severity. Inspired by the one-stage detection paradigm, the detection head is designed to balance inference speed and precision, while being tailored to the challenges of small, early-stage symptoms in UAV-captured cotton field imagery. This branch predicts four parameters (
x,
y,
w,
h) that define the relative coordinates and dimensions of each bounding box with respect to the grid cell. The output regression vector is denoted as:
where (
i,
j) indexes the grid cell,
represent the relative center coordinates,
are width and height, normalized to the input image size. The Complete Intersection over Union (CIoU) loss is used to optimize bounding box regression, considering overlapping area, center distance, and aspect ratio consistency:
where
is the Euclidean distance between box centers
and the ground-truth box
,
c is the diagonal length of the smallest enclosing box,
v measures aspect ratio consistency, and α is a balancing factor. This head estimates the probability that a given anchor box contains an object (disease instance), independent of its class. For each grid cell, it outputs:
The binary cross-entropy (BCE) loss is used:
where
is the ground-truth objectness label.
The classification head predicts the probability distribution over
C disease categories. The predicted vector for a cell is:
To address class imbalance between healthy samples and sparse early-stage disease instances, we adopt the Focal Tversky Loss (FTL), which penalizes false negatives more heavily:
where
are true positives, false negatives, and false positives for class
c, α, β are tunable weights (typically α = 0.7, β = 0.3),
controls the strength of the focusing. This choice ensures the model remains sensitive to rare but critical early symptoms while maintaining overall class discrimination. The total training loss is a weighted combination of the three heads:
where
are loss balancing coefficients. In our implementation, values are empirically set to 2.0, 1.0, and 2.0, respectively. To accommodate the small scale of early symptoms, anchor boxes were optimized using k-means clustering on ground truth bounding boxes across the training set. The process minimizes the IoU distance between ground-truth boxes and anchors:
This improves convergence and increases detection precision for minute features such as lesions or edge deformations.
4. Experiments
To validate the effectiveness of the proposed CottoNet framework in detecting early-stage cotton diseases under realistic field conditions, we conducted a series of controlled experiments using the two datasets described below. Our evaluation focused on both classification and object detection performance, with particular emphasis on the model’s sensitivity to subtle visual cues indicative of early symptoms. The experiments were designed to assess the model’s generalization capability across varying geographies, imaging perspectives, and disease categories. We benchmarked CottoNet against several state-of-the-art deep learning architectures adapted for plant disease detection and UAV-based imagery analysis. Additionally, we performed ablation studies to quantify the individual contributions of the model core components, namely the DA-FPN and the ESEM. All models were trained and evaluated under consistent conditions using standardized splits, augmentation pipelines, and performance metrics. The following sections describe our training configuration, evaluation metrics, and benchmarking results in detail.
4.1. Dataset
To support the development and evaluation of the proposed CottoNet framework, two complementary datasets were utilized: a custom UAV-based cotton disease detection dataset collected in Uzbekistan, and the publicly available SAR-CLD-2024 dataset developed in Bangladesh [
9]. Together, these datasets provide a rich and diverse foundation for training and benchmarking early-stage cotton disease detection models under real-world agricultural conditions.
The custom dataset was collected using a DJI Phantom 4 UAV equipped with a 4K RGB camera during the early to mid-stages of the 2024 cotton growing season in the Jizzakh Region of Uzbekistan. UAV flights DJI Phantom 4, manufactured by DJI, based in Shenzhen, China were conducted at altitudes between 20 and 30 m with nadir-view imaging to ensure consistent spatial representation of the crop canopy. A total of 5400 high-resolution RGB images were acquired, each reflecting diverse environmental conditions including variable lighting, occlusions, and background clutter typical of open-field agriculture. These images were manually annotated by plant pathology experts to mark regions affected by early symptoms of five common cotton diseases, bacterial blight, curl virus, fusarium wilt, leaf variegation, and leaf reddening, alongside healthy samples for baseline comparison. Particular emphasis was placed on capturing early-stage symptoms that are subtle in appearance, such as chlorosis, edge curling, minor lesions, and slight textural shifts, which are often difficult to detect through traditional scouting or conventional computer vision approaches. The dataset was stratified into training, validation, and test splits in a 70–15–15 ratio to ensure class balance and to allow robust performance assessment across disease categories
Figure 2.
In addition to the custom UAV dataset, we incorporated the SAR-CLD-2024 dataset to further enrich the model’s exposure to varied disease phenotypes, cotton varieties, and imaging perspectives. This dataset consists of 2137 original RGB images and 7000 augmented images, captured during field surveys conducted from October 2023 to January 2024 in Gazipur, Bangladesh. Images were taken using a Redmi Note 11S smartphone was manufactured by Xiaomi Corporation, headquartered in Beijing, China under different lighting and weather conditions. The dataset covers eight cotton leaf conditions, including bacterial blight, curl virus, herbicide growth damage, leaf hopper jassids, leaf reddening, leaf variegation, and healthy leaves. Each image was annotated by domain experts, and the dataset was curated to support both classification and machine learning benchmarking tasks.
The combination of these two datasets enables CottoNet to achieve high generalization capability by exposing it to a wide spectrum of disease types, plant phenotypes, and image acquisition scenarios (
Table 1). This dual-source training strategy enhances the robustness, accuracy, and deployability of the proposed model across different geographies and farming contexts.
To ensure a balanced and unbiased evaluation, the datasets were divided into training (70%), validation (15%), and test (15%) subsets using stratified sampling by class. Model training was performed using the PyTorch 2.1 framework on an NVIDIA RTX 3090 GPU (24 GB), as summarized in
Table 2.
To simulate field variability and enhance generalization, we applied several augmentation techniques, including random horizontal and vertical flips, Gaussian blur combined with salt-and-pepper noise, hue/saturation/value (HSV) shifts, random shadow and brightness simulation, as well as CutMix and Mosaic augmentation.
To verify model stability and convergence, we monitored the loss and macro F1 score over 150 training epochs. As illustrated in
Figure 3, both training and validation loss decreased consistently without divergence, indicating well-controlled optimization dynamics. The macro F1 score on the validation set steadily improved and closely tracked training performance, suggesting effective generalization and no signs of overfitting. The complete training procedure was performed with a batch size of 32 and an input resolution of 512 × 512 pixels; the training of CottoNet over 150 epochs required approximately 6.3 h.
4.2. Evaluation Metrics
Model performance was evaluated using the following metric.
F1 score (per class and macro-averaged):
EDA:
here,
refers to true positive predictions correctly identifying early-stage disease symptoms, as annotated by domain experts. An early-stage detection is considered correct if the predicted bounding box has an IoU ≥ 0.5 with the ground-truth annotation and matches the correct disease class label. The inference speed (FPS) was measured on NVIDIA Jetson Xavier NX to assess real-time UAV deployment feasibility.
5. Results
This section presents the analysis of experimental outcomes to validate the performance of the proposed CottoNet model. The results are discussed in terms of detection accuracy, generalization capacity, early symptom sensitivity, and real-time deployment feasibility. Performance is assessed across both benchmark comparisons and internal ablation experiments. CottoNet was benchmarked against state-of-the-art models adapted for UAV imagery and plant disease detection (
Figure 4). To evaluate the contribution of each architectural component, we conducted an ablation analysis (
Table 3).
Both the dual-attention module and the ESEM contribute significantly to detection accuracy and robustness. The model performs consistently across all disease types, with marginally lower precision in the aphid stress category due to visual similarity with the healthy class under certain lighting conditions (
Table 4).
The strong performance in differentiating early-stage diseases from healthy plants supports the model’s practical reliability in real-world monitoring systems.
To validate the performance of the proposed CottoNet model, we conducted a comprehensive benchmarking study against ten state-of-the-art (SOTA) models commonly used in plant disease detection and UAV-based image analysis. The selected models span both lightweight and standard architectures, including both one-stage and two-stage detectors, as well as models tailored for RGB-only agricultural imagery (
Table 5).
All models were trained and evaluated on the same custom RGB dataset of early-stage cotton diseases collected from cotton fields in Uzbekistan. The dataset includes five disease classes and one healthy class, with a 70–15–15 split for training, validation, and testing. Input resolution for all models was standardized to 512 × 512 pixels, and performance was measured in terms of mAP@50, F1 score, EDA, and inference speed (FPS) on NVIDIA Jetson Xavier NX (
Table 6).
The results clearly demonstrate that CottoNet outperforms all compared models across all major evaluation metrics. It achieves the highest detection precision (mAP@50 of 89.7%) and the best sensitivity to early-stage symptoms (EDA of 91.5%), while maintaining a real-time inference speed (35 FPS) suitable for deployment on embedded UAV platforms. Notably, while YOLO-based models such as YOLO-UP showed strong accuracy (87.2% mAP@50), they required significantly more computational resources and offered lower inference speed compared to our model. Similarly, two-stage detectors like Faster R-CNN achieved competitive F1 scores but lacked practical viability for real-time agricultural applications due to latency.
Figure 5 illustrates the performance of the CottoNet model in detecting early-stage cotton diseases using RGB images captured in natural field conditions. The model predictions are shown through red bounding boxes annotated with disease class labels and confidence scores. The images depict a range of disease types, including bacterial blight, curl virus, fusarium wilt, leaf variegation, and leaf reddening, along with healthy leaf samples. The model accurately localizes affected regions, even when symptoms appear faint or partially occluded. For example, early bacterial blight is identified through dense spotting on leaf surfaces, while curl virus is detected through curled and distorted leaf structures. Fusarium wilt is recognized by the presence of widespread yellowing and wilting across the plant. Their characteristic color abnormalities distinguish leaf variegation and reddening. In contrast, healthy leaves are detected with high confidence and without false positives. These visual results demonstrate CottoNet’s high sensitivity to early symptoms and its robustness under varied lighting and background conditions, supporting its suitability for real-time agricultural monitoring.
The improved performance of CottoNet can be attributed to the synergy between its architectural components. The EfficientNetV2-S backbone provides a strong balance between representational power and computational efficiency. The dual-attention FPN fuses multi-scale features while emphasizing critical regions through spatial and channel attention mechanisms. Additionally, the ESEM uniquely enhances subtle color and textural anomalies that are often overlooked by conventional RGB-based detectors. This comparative analysis highlights CottoNet’s potential as a next-generation model for UAV-based plant disease monitoring, effectively bridging the gap between high accuracy and field deployability. As such, it presents a robust and scalable solution for sustainable precision agriculture, particularly in low-resource settings.
6. Conclusions
In this study, we present CottoNet, a novel deep learning framework for the early detection of cotton diseases using UAV-acquired RGB imagery. Unlike most existing systems that rely on expensive multispectral or hyperspectral sensors, our approach leverages only standard RGB images, making it cost-effective, scalable, and accessible for smallholder farmers and resource-constrained agricultural environments. CottoNet integrates three key architectural innovations: (1) an EfficientNetV2-S backbone that ensures a high capacity-to-efficiency ratio, (2) a DA-FPN that enhances multi-scale feature fusion while suppressing background noise, and (3) an ESEM that accentuates subtle color, texture, and edge cues indicative of early disease onset. Together, these components enable the model to detect faint and sparsely distributed visual patterns—an essential capability for timely intervention in crop management.
Our method was extensively evaluated on a custom UAV-captured RGB dataset collected from cotton fields in Uzbekistan. Experimental results demonstrated that CottoNet achieved superior performance across multiple metrics, including a mAP@50 of 89.7%, an F1 score of 88.2%, and an EDA of 91.5%, outperforming leading lightweight models such as YOLOv8n, LCDDN-YOLO, and YOLO-UP. Importantly, the model maintained real-time inference speed on edge devices like the NVIDIA Jetson Xavier NX, highlighting its practical viability for in-field deployment. The proposed approach holds significant promise for enabling early and accurate disease diagnosis, thereby reducing crop losses and minimizing unnecessary pesticide usage. By empowering farmers with affordable and intelligent crop monitoring tools, CottoNet contributes to the broader goal of sustainable and precision-driven agriculture. While CottoNet demonstrates strong performance across detection accuracy, early symptom sensitivity, and real-time inference speed, several limitations should be acknowledged. First, the model has been trained primarily on UAV-acquired RGB imagery of cotton crops from Uzbekistan and Bangladesh. Its generalizability to other crops, geographic regions, or drastically different agroecological conditions remains to be empirically validated. Second, despite its compatibility with low-cost RGB sensors, CottoNet’s performance may degrade under suboptimal UAV flight conditions, such as motion blur, off-angle views, or harsh lighting, which are not extensively represented in the training set. Third, the current implementation focuses on single-frame inference and lacks temporal modeling, which could otherwise enhance robustness in detecting disease progression over time. Finally, although the use of dual-attention mechanisms improves feature focus, the model does not currently include a dedicated explainability module, such as saliency maps or uncertainty quantification, which could further support agronomic decision-making and user trust.
Future extensions of this work may include integrating temporal image sequences to improve disease progression modeling, expanding the model to other crops or diseases, and deploying the system in real-world farming operations for longitudinal validation. Additionally, coupling RGB data with environmental variables could further enhance predictive accuracy in complex agroecological settings.
Author Contributions
Methodology, H.K., M.A., S.M., J.C., C.L. and H.-S.J.; software, H.K., M.A. and S.M.; validation, J.C. and H.-S.J.; formal analysis, J.C. and H.-S.J.; resources, H.K., C.L., M.A., S.M., J.C. and H.-S.J.; data curation, H.K., C.L., M.A., S.M., J.C. and H.-S.J.; writing—original draft, H.K., M.A., S.M., J.C. and H.-S.J.; writing—review and editing, H.K., M.A., S.M., C.L., J.C. and H.-S.J.; supervision, J.C. and H.-S.J.; project administration, H.K., M.A. and S.M. All authors have read and agreed to the published version of the manuscript.
Funding
This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korean Government (MSIT) (No. RS-2024-00412141).
Data Availability Statement
The original contributions presented in this study are included in the article. The custom UAV-based dataset is currently under review for public release and will be made available upon approval. Further inquiries can be directed to the corresponding author.
Conflicts of Interest
The authors declare no conflicts of interest.
References
- Kumar, R.; Kumar, A.; Bhatia, K.; Nisar, K.S.; Chouhan, S.S.; Maratha, P.; Tiwari, A.K. Hybrid Approach of Cotton Disease Detection for Enhanced Crop Health and Yield. IEEE Access 2024, 12, 132495–132507. [Google Scholar] [CrossRef]
- Omaye, J.D.; Ogbuju, E.; Ataguba, G.; Jaiyeoba, O.; Aneke, J.; Oladipo, F. Cross-comparative review of Machine learning for plant disease detection: Apple, cassava, cotton and potato plants. Artif. Intell. Agric. 2024, 12, 127–151. [Google Scholar] [CrossRef]
- Jayanthy, S.; Kiruthika, G.; Lakshana, G.; Pragatheshwaran, M. Early Cotton Plant Disease Detection using Drone Monitoring and Deep Learning. In Proceedings of the 2024 IEEE International Conference for Women in Innovation, Technology & Entrepreneurship (ICWITE), Bangalore, India, 16–17 February 2024; IEEE: New York, NY, USA, 2024; pp. 625–630. [Google Scholar]
- Singh, G.; Aggarwal, R.; Bhatnagar, V.; Kumar, S.; Dhondiyal, S.A. Performance Evaluation of Cotton Leaf Disease Detection Using Deep Learning Models. In Proceedings of the 2024 International Conference on Computational Intelligence and Computing Applications (ICCICA), Samalkha, India, 23–24 May 2024; IEEE: New York, NY, USA, 2024; Volume 1, pp. 193–197. [Google Scholar]
- Chopkar, P.; Wanjari, M.; Jumle, P.; Chandankhede, P.; Mungale, S.; Shaikh, M.S. A Comprehensive Review on Cotton Leaf Disease Detection using Machine Learning Method. Grenze Int. J. Eng. Technol. 2024, 10, 239–245. [Google Scholar]
- Nazeer, R.; Ali, S.; Hu, Z.; Ansari, G.J.; Al-Razgan, M.; Awwad, E.M.; Ghadi, Y.Y. Detection of cotton leaf curl disease’s susceptibility scale level based on deep learning. J. Cloud Comput. 2024, 13, 1–18. [Google Scholar] [CrossRef]
- Devi, C.; Vishva, S.; Gopal, M. Cotton Leaf Disease Prediction and Diagnosis Using Deep Learning. In Proceedings of the 2024 International Conference on Advances in Computing, Communication and Applied Informatics (ACCAI), Chennai, India, 9–10 May 2024; IEEE: New York, NY, USA, 2024; pp. 1–7. [Google Scholar]
- Lakshmi, R.T.; Katiravan, J.; Visu, P. CoDet: A novel deep learning pipeline for cotton plant detection and disease identification. Autom. Časopis Za Autom. Mjer. Elektron. Računarstvo Komun. 2024, 65, 662–674. [Google Scholar] [CrossRef]
- Bishshash, P.; Nirob, A.S.; Shikder, H.; Sarower, A.H.; Bhuiyan, T.; Noori, S.R.H. A comprehensive cotton leaf disease dataset for enhanced detection and classification. Data Brief 2024, 57, 110913. [Google Scholar] [CrossRef]
- Butt, S.; Sohaib, M.; Qasim, M.; Farooq, M.A. The Cotton Guard AI Cotton Disease Detection Using Deep Learning Mehtods; MCS: Geneva, Switzerland, 2024. [Google Scholar]
- Kaur, A.; Sharma, R.; Chattopadhyay, S.; Joshi, K. Cotton Leaf Disease Classification Using Fine-Tuned VGG16 Deep Learning Model. In Proceedings of the 2024 2nd World Conference on Communication & Computing (WCONF), Raipur, India, 12–14 July 2024; IEEE: New York, NY, USA, 2024; pp. 1–4. [Google Scholar]
- Stephen, A.; Arumugam, P.; Arumugam, C. An efficient deep learning with a big data-based cotton plant monitoring system. Int. J. Inf. Technol. 2023, 16, 145–151. [Google Scholar] [CrossRef]
- Kinda, Z.; Malo, S.; Bayala, T.R. Detection of Cotton Diseases by YOLOv8 on UAV Images Using the RT-DETR Backbone. In Proceedings of the International Symposium on Ambient Intelligence, Tallinn, Estonia, 11–14 September 2024; Springer: Cham, Switzerland, 2025; pp. 3–13. [Google Scholar]
- Feng, H.; Chen, X.; Duan, Z. LCDDN-YOLO: Lightweight Cotton Disease Detection in Natural Environment, Based on Improved YOLOv8. Agriculture 2025, 15, 421. [Google Scholar] [CrossRef]
- Kaur, A.; Kukreja, V.; Kumar, M.; Choudhary, A.; Sharma, R. A Fine-tuned Deep Learning-based VGG16 Model for Cotton Leaf Disease Classification. In Proceedings of the 2024 IEEE 9th International Conference for Convergence in Technology (I2CT), Pune, India, 5–7 April 2024; IEEE: New York, NY, USA, 2024; pp. 1–6. [Google Scholar]
- Shrotriya, A.; Sharma, A.K.; Bairwa, A.K.; Manoj, R. Hybrid Ensemble Learning with CNN and RNN for Multimodal Cotton Plant Disease Detection. IEEE Access 2024, 12, 198028–198045. [Google Scholar] [CrossRef]
- Parashar, N.; Johri, P. March. Deep Learning for Cotton Leaf Disease Detection. In Proceedings of the 2024 2nd International Conference on Device Intelligence, Computing and Communication Technologies (DICCT), Dehradun, India, 15–16 March 2024; IEEE: New York, NY, USA, 2024; pp. 158–162. [Google Scholar]
- Saleh, A.; Hussein, A.; Emad, A.; Gad, R.; Mariem, A.; Yasser, D.; Ayman, Y.; Bahr, A.; Yasser, M.; Aboelftooh, M. Machine Learning-based classification of cotton diseases using Mobilenet and Support Vector Machine. In Proceedings of the International Telecommunications Conference (ITC-Egypt), Cairo, Egypt, 22–25 July 2024; IEEE: New York, NY, USA, 2024; pp. 165–171. [Google Scholar]
- Jamadar, B.; Harikant, B.M.; Chaithra, B.J.; Mohit, J.S.; Vasudevan, V. Cotton Leaf Disease Classification and Pesticide Recommendation. In Proceedings of the 2024 Second International Conference on Networks, Multimedia and Information Technology (NMITCON), Bengaluru, India, 9–10 August 2024; IEEE: New York, NY, USA, 2024; pp. 1–7. [Google Scholar]
- Li, W.; Guo, Y.; Yang, W.; Huang, L.; Zhang, J.; Peng, J.; Lan, Y. Severity Assessment of Cotton Canopy Verticillium Wilt by Machine Learning Based on Feature Selection and Optimization Algorithm Using UAV Hyperspectral Data. Remote Sens. 2024, 16, 4637. [Google Scholar] [CrossRef]
- Zhang, X.; Vinatzer, B.A.; Li, S. Hyperspectral imaging analysis for early detection of tomato bacterial leaf spot disease. Sci. Rep. 2024, 14, 1–14. [Google Scholar] [CrossRef] [PubMed]
- Zhang, S.; Jing, H.; Dong, J.; Su, Y.; Hu, Z.; Bao, L.; Fan, S.; Sarsen, G.; Lin, T.; Jin, X. Accurate Estimation of Plant Water Content in Cotton Using UAV Multi-Source and Multi-Stage Data. Drones 2025, 9, 163. [Google Scholar] [CrossRef]
- Majeed, A.; Faridoon, F.; Irfan, M.; Rashid, A. Deep Learning Approaches for Precision Agriculture: Weed and Cotton Crop Classification Using YOLO and Faster RCNN. In Proceedings of the 2024 International Conference on IT and Industrial Technologies (ICIT), Chiniot, Pakistan, 10–12 December 2024; IEEE: New York, NY, USA, 2024; pp. 1–6. [Google Scholar]
- Guo, W.; Feng, S.; Feng, Q.; Li, X.; Gao, X. Cotton leaf disease detection method based on improved SSD. Int. J. Agric. Biol. Eng. 2024, 17, 211–220. [Google Scholar]
- Yang, S.; Zhou, G.; Feng, Y.; Zhang, J.; Jia, Z. SRNet-YOLO: A model for detecting tiny and very tiny pests in cotton fields based on super-resolution reconstruction. Front. Plant Sci. 2024, 15, 1416940. [Google Scholar] [CrossRef]
- Li, R.; He, Y.; Li, Y.; Qin, W.; Abbas, A.; Ji, R.; Li, S.; Wu, Y.; Sun, X.; Yang, J. Identification of cotton pest and disease based on CFNet- VoV-GCSP -LSKNet-YOLOv8s: A new era of precision agriculture. Front. Plant Sci. 2024, 15, 1348402. [Google Scholar] [CrossRef]
- Nie, J.; Li, H.; Li, Y.; Li, J.; Chao, S. Incremental YOLOv5 for Federated Learning in Cotton Pest and Disease Detection with Blockchain Sharding; Research Square: Durham, NC, USA, 2024. [Google Scholar]
- Sun, C.; Bin Azman, A.; Wang, Z.; Gao, X.; Ding, K. YOLO-UP: A High-Throughput Pest Detection Model for Dense Cotton Crops Utilizing UAV-Captured Visible Light Imagery. IEEE Access 2025, 13, 19937–19945. [Google Scholar] [CrossRef]
- Wu, J.; Abolghasemi, V.; Anisi, M.H.; Dar, U.; Ivanov, A.; Newenham, C. Strawberry Disease Detection Through an Advanced Squeeze-and-Excitation Deep Learning Model. IEEE Trans. AgriFood Electron. 2024, 2, 259–267. [Google Scholar] [CrossRef]
- Rahman, M.A.; Ullah, M.S.; Devnath, R.K.; Chowdhury, T.H.; Rahman, G.; Rahman, M.A. Cotton Leaf Disease Detection: An Integration of CBAM with Deep Learning Approaches. Int. J. Comput. Appl. 2025, 975, 8887. [Google Scholar] [CrossRef]
- Shao, Y.; Yang, W.; Wang, J.; Lu, Z.; Zhang, M.; Chen, D. Cotton Disease Recognition Method in Natural Environment Based on Convolutional Neural Network. Agriculture 2024, 14, 1577. [Google Scholar] [CrossRef]
- Qiu, K.; Zhang, Y.; Ren, Z.; Li, M.; Wang, Q.; Feng, Y.; Chen, F. SpemNet: A Cotton Disease and Pest Identification Method Based on Efficient Multi-Scale Attention and Stacking Patch Embedding. Insects 2024, 15, 667. [Google Scholar] [CrossRef]
- Chen, S.; Yang, W.; Chen, X. Automatic Recognition of Agriculture Pests with Balanced Feature Pyramid Network. Appl. Eng. Agric. 2024, 40, 525–535. [Google Scholar] [CrossRef]
- He, R.; Li, P.; Zhu, J.; Zhang, F.; Wang, Y.; Zhang, T.; Yang, D.; Zhou, B. YOLOv9-LSBN: An Improved YOLOv9 Model for Cotton Pest and Disease Identification Method; Research Square: Durham, NC, USA, 2024. [Google Scholar]
- Wang, J.; Qi, Z.; Wang, Y.; Liu, Y. A lightweight weed detection model for cotton fields based on an improved YOLOv8n. Sci. Rep. 2025, 15, 1–16. [Google Scholar] [CrossRef] [PubMed]
- Tan, M.; Le, Q.V. Efficientnetv2: Smaller models and faster training. arXiv 2021, arXiv:2104.00298. [Google Scholar]
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).