Cotton Leaf Disease Detection Using LLM-Synthetic Data and DEMM-YOLO Model

Gao, Lijun; Ran, Tiantian; Zou, Hua; Wu, Huanhuan

doi:10.3390/agriculture15151712

Open AccessArticle

Cotton Leaf Disease Detection Using LLM-Synthetic Data and DEMM-YOLO Model

¹

College of Information Engineering, Tarim University, Alar 843300, China

²

College of Life Science and Technology, Tarim University, Alar 843300, China

³

School of Computer Science, Wuhan University, Wuhan 430072, China

⁴

Key Laboratory of Tarim Oasis Agriculture, Ministry of Education, Alar 843300, China

⁵

Key Laboratory of Modern Agricultural Engineering, Tarim University, Alar 843300, China

^*

Author to whom correspondence should be addressed.

Agriculture 2025, 15(15), 1712; https://doi.org/10.3390/agriculture15151712

Submission received: 4 July 2025 / Revised: 5 August 2025 / Accepted: 7 August 2025 / Published: 7 August 2025

(This article belongs to the Section Artificial Intelligence and Digital Agriculture)

Download

Browse Figures

Versions Notes

Abstract

Cotton leaf disease detection is essential for accurate identification and timely management of diseases. It plays a crucial role in enhancing cotton yield and quality while promoting the advancement of intelligent agriculture and efficient crop harvesting. This study proposes a novel method for detecting cotton leaf diseases based on large language model (LLM)-generated image synthesis and an improved DEMM-YOLO model, which is enhanced from the YOLOv11 model. To address the issue of insufficient sample data for certain disease categories, we utilize OpenAI’s DALL-E image generation model to synthesize images for low-frequency diseases, which effectively improves the model’s recognition ability and generalization performance for underrepresented classes. To tackle the challenges of large-scale variations and irregular lesion distribution, we design a multi-scale feature aggregation module (MFAM). This module integrates multi-scale semantic information through a lightweight, multi-branch convolutional structure, enhancing the model’s ability to detect small-scale lesions. To further overcome the receptive field limitations of traditional convolution, we propose incorporating a deformable attention transformer (DAT) into the C2PSA module. This allows the model to flexibly focus on lesion areas amidst complex backgrounds, improving feature extraction and robustness. Moreover, we introduce an enhanced efficient multi-dimensional attention mechanism (EEMA), which leverages feature grouping, multi-scale parallel learning, and cross-space interactive learning strategies to further boost the model’s feature expression capabilities. Lastly, we replace the traditional regression loss with the MPDIoU loss function, enhancing bounding box accuracy and accelerating model convergence. Experimental results demonstrate that the proposed DEMM-YOLO model achieves 94.8% precision, 93.1% recall, and 96.7% mAP@0.5 in cotton leaf disease detection, highlighting its strong performance and promising potential for real-world agricultural applications.

Keywords:

cotton leaf disease detection; intelligent agriculture; crop harvesting; LLM; YOLOv11; MFAM; EEMA; MPDIoU loss function

1. Introduction

In the process of digital transformation of modern agriculture, cotton is an important economic crop in China, and the intelligent level of its disease monitoring is directly related to the guarantee of yield and quality. Since there are many types of cotton leaf diseases and they often occur in complex field environments such as high temperature and high humidity, the traditional disease detection method that relies on manual inspection and sensor image acquisition has obvious limitations such as expensive equipment, low acquisition efficiency, and data being easily affected by light and occlusion [1,2]. In addition, the subtle differences between disease types and regional transmission characteristics also make it challenging to build high-quality and diverse data sets, further restricting the training effect and generalization ability of deep learning models. Efficient cotton disease detection technology is of great significance for realizing large-scale farmland monitoring and early identification of diseases.

Researchers have actively explored model structure, lightweight strategies, and data enhancement. Eunice et al. [3] proposed a plant leaf disease detection method using convolutional neural networks. By optimizing the hyperparameters of pre-trained models such as DenseNet-121, ResNet-50, VGG-16, and Inception V4, the classification effect of different diseases was improved, and a precision of 99.81% was achieved in 38 disease classification tasks. Khan et al. [4] proposed a two-stage apple leaf disease detection method based on Xception and Faster R-CNN. By constructing a dataset containing about 9000 expert-annotated images and combining transfer learning strategies to improve model performance, a precision of 88% and an mAP of 42% were achieved in the classification task. Zhong et al. [5] proposed a lightweight tomato leaf disease recognition model called LightMixer. By combining the Phish module and the optical residual module, the feature fusion capability is enhanced, and the computational efficiency is improved. A classification precision of 99.3% was achieved with a model size of only 1.5 M. Mustak Un Nobi et al. [6] proposed a lightweight and robust guava leaf disease detection model GLD-Det, based on transfer learning. By integrating maximum pooling and global average pooling, multi-layer batch normalization, dropout, and multi-layer fully connected networks on the improved MobileNet architecture, the feature expression ability and model generalization performance were improved. The accuracy, precision, recall, and AUC values of up to 98.0%, 98.0%, 97.0%, and 99.0% were achieved on two public benchmark datasets, respectively. Jin et al. [7] proposed a generative adversarial network model, GrapeGAN, based on a feature fusion mechanism and capsule network structure. By integrating convolution residual blocks and recombination technology, they built a U-Net-style generator and designed a discriminator containing a capsule structure, achieving an excellent performance of 5.495 on the image quality indicator Fréchet Inception Distance (FID), and the recognition precision of generated images exceeded 86.36%. Although the above methods have achieved good results, they generally rely on real field images, which are still insufficient when dealing with rare diseases, small sample categories, or multiple diseases co-occurring, especially when facing complex scene changes (such as backlight, dust, leaf curling, etc.).

The leap forward in generative AI technology, especially large language models (LLMs) such as DALL-E [8,9] and Stable Diffusion [10], has revolutionized the way agricultural visual data is acquired. Highly realistic cotton leaf images with disease spot annotations can be generated through simple text prompts, making the generation of disease datasets more flexible and efficient, with advantages such as low cost, high controllability, and strong scalability [11,12]. This method not only makes up for the shortcomings of traditional image acquisition in terms of diversity but also provides data support for training more robust detection models.

Currently, deep learning models based on the YOLO [13,14] series are extensively applied in plant leaf disease detection and have become an important technical means for smart agricultural monitoring and disease prevention and control. Sangaiah et al. [15] introduced a T-YOLO-rice method for rice disease detection, which achieved 86% mAP by integrating modules such as SPP, CBAM, and the Sand Clock Feature Extraction Module (SCFEM). Wang et al. [16] introduced an MGA-YOLO model for apple leaf disease detection, which achieved 94.0% mAP, a 10.34 MB model size, and an 84.1 FPS inference speed by introducing the Ghost module, Mobile Inverted Residual structure, and CBAM attention mechanism, and reached 12.5 FPS on mobile phones. Although the above-mentioned models have struck an effective balance between accuracy and efficiency, their continued performance improvement is still limited by the constraints of data acquisition methods. How to effectively improve the robustness and generalization ability of target detection models has become an important scientific problem that needs to be solved in the field of smart agricultural visual perception.

This study proposed a cotton leaf disease detection method using LLM to synthesize cotton leaf images and the DEMM-YOLO model. The key contributions can be summarized as follows:

(1) To address the imbalance in cotton leaf disease image samples, this study used OpenAI’s DALL-E model to generate images of low-frequency disease categories based on text descriptions and disease features. This effectively expanded the dataset, improved its balance and diversity, and enhanced the model’s accuracy and generalization in recognizing rare diseases. A DEMM-YOLO-based cotton leaf disease detection method was proposed.

(2) To tackle challenges in cotton leaf disease detection, such as scale variation, irregular lesion distribution, and difficulty distinguishing small lesions with unclear edges, this study proposes a multi-scale feature aggregation module (MFAM). It effectively integrates multi-scale semantic information, improving the model’s ability to perceive and differentiate small diseased areas while maintaining computational efficiency.

(3) The Deformable Attention Mechanism (DAT) is introduced to improve the feature extraction of the C2PSA module in cotton leaf disease detection. By dynamically learning spatial offsets, DAT can focus on irregular areas like disease spots, overcoming the fixed receptive field limitations of traditional convolution. This adaptive mechanism enhances C2PSA’s ability to detect small-scale, diverse disease spots with clearer details, improving overall detection accuracy and robustness, even in complex backgrounds.

(4) An enhanced efficient multi-scale attention (EEMA) mechanism is proposed, integrating feature grouping, multi-scale parallel sub-networks, and cross-space interaction learning to create a more expressive attention structure. To improve regression performance and training efficiency, the MPDIoU loss function is used in place of the original bounding box regression loss, boosting both convergence speed and positioning accuracy.

2. Materials and Methods

2.1. Cotton Leaf Disease Dataset

2.1.1. Dataset Image Acquisition Process

The cotton leaf disease image dataset used in this study was collected in the natural field environment of Aral, Xinjiang, China, a major cotton-producing region. The images were taken with an iPhone 13 (Apple, assembled by Foxconn in Zhengzhou, China), featuring a 12 MP wide-angle lens, under bright and clear lighting conditions. The original image resolution was 2408 × 1365 pixels. Image collection occurred between May and August 2024, during two distinct time windows: 9:00–11:30 AM and 4:00–6:30 PM, to minimize shadow interference caused by strong sunlight and enhance image quality. A shooting strategy that incorporated multiple angles (overhead and side views) and varying distances (close, medium, and long-range) was employed to comprehensively capture the morphological features of leaf diseases and their natural background environment. Additionally, autofocus and automatic white balance functions were enabled to optimize image clarity and mitigate the impact of overexposure or insufficient lighting. Figure 1 presents a selection of cotton disease image samples.

In this study, as detailed in Table 1, a cotton leaf disease detection dataset was developed, comprising 3419 original images and 754 augmented images generated using the DALL-E model. The dataset was then divided into training, test, and validation sets for model evaluation. The splits followed a 7:2:1 ratio, resulting in 2922 images for training, 834 images for testing, and 417 images for validation. This distribution ensures an ample number of images for robust model training while providing sufficient data for accurate performance evaluation and effective model generalization.

2.1.2. Leveraging LLM for Dataset Augmentatio

In cotton leaf disease detection, common diseases like leaf spot diseases, have abundant samples, while less common diseases, such as leaf curl diseases have fewer samples. This imbalance can hinder the model’s learning of rare diseases, affecting detection accuracy and generalization. This study used the DALL-E model [17,18] to generate synthetic cotton leaf disease images, leveraging the large language model (LLM) to simulate realistic disease manifestations in natural field settings. With concise text prompts, 754 high-resolution images (1024 × 1024 pixels) were generated, incorporating various lighting and background conditions. The synthesis process was guided by agronomic knowledge and disease characteristics to ensure biological accuracy. Of the images generated, 68% were used in experiments. All synthesized images were manually reviewed to ensure visual realism and correct disease characterization.

Figure 2 illustrates the principle of text-driven image generation using the DALL-E model architecture. This method takes natural language description as input, extracts semantic information through the text encoder, and aligns text and image features using the CLIP objective function. The prior network then models the text semantics in the latent space, generates the image’s latent representation, and restores it into a high-quality image through the decoder. This process not only accurately reflects the disease type and background but also generates diverse and realistic disease images, addressing the challenges of cotton disease image collection and sample scarcity and providing valuable data for training detection models.

Table 2 presents examples of generated textual prompts for cotton leaf disease samples, illustrating how these prompts effectively replicate the characteristics of real-world diseases. The prompts detail the morphology, color, and distribution of lesions, as well as the disease’s expression under varying environmental conditions. These examples serve as valuable guidance for generative models, enhancing their ability to accurately reproduce disease features.

2.2. YOLOv11-Based Network Architecture for Efficient Object Detection

YOLOv11 [19,20] is an important upgrade in the YOLO series, combining architecture optimization and lightweight design to enhance both accuracy and efficiency. It provides five versions (YOLOv11n, YOLOv11s, YOLOv11m, YOLOv11l, YOLOv11x), balancing model depth and detection performance to suit various applications, from edge devices to high-performance platforms. This study uses the standard YOLOv11 version as the base model, considering accuracy, resource requirements, and deployment feasibility.

The YOLOv11 Backbone network is optimized from the classic Darknet-53 residual structure, using multi-scale convolution operations to enhance feature expression [21]. Notably, the original C2f module is replaced by the more efficient C3k2 module, improving the network’s ability to capture semantic and contextual information through feature partitioning and lightweight fusion. Additionally, the SPPF spatial pyramid pooling module is introduced to expand the receptive field, enabling the model to capture multi-level spatial features and boosting robustness in complex recognition scenarios.

The neck structure in YOLOv11 handles multi-scale feature fusion and transmission. It combines path aggregation and the improved C3k2 module for more efficient feature integration. To improve adaptability to targets of various sizes and positions, YOLOv11 adopts a decoupled design in the Head module, separating classification and positioning tasks to reduce conflicts. Additionally, the detection head uses deep separable convolutions (DWConv and 1 × 1 Conv) instead of traditional convolutions, significantly reducing parameters and computational complexity while maintaining real-time performance.

2.3. DEMM-YOLO Cotton Leaf Disease Detection Model

Although the YOLOv11 model strikes a good balance between accuracy and inference efficiency, it has limitations, including insufficient flexibility in feature extraction, inadequate fusion of high- and low-level information, limited attention mechanism expression, and reduced positioning accuracy when dealing with challenges like varying leaf spot morphology, complex leaf poses, and significant background interference. To tackle these challenges, this study introduces the DEMM-YOLO model for detecting cotton leaf diseases. First, the multi-scale feature aggregation module (MFAM) is integrated into the backbone network to effectively fuse multi-scale semantic features, enhancing detection of small spots and edge areas. Next, the C2PSA module is improved to C2PSA-DAT by incorporating the Deformable Attention Mechanism (DAT) in the Neck layer, boosting feature extraction for cotton leaf disease detection. Additionally, the enhanced efficient multi-scale attention (EEMA) module is added to guide the model to focus more effectively on the target area across three dimensions—channel, space, and direction—mitigating background interference and improving spot recognition accuracy. Finally, the multi-point distance intersection over union (MPDIoU) loss function replaces the original loss function, enhancing boundary regression accuracy and stability in complex scenarios. Experimental results demonstrate that the DEMM-YOLO model significantly improves detection accuracy and positioning robustness for cotton leaf diseases in challenging backgrounds. Figure 3 shows the proposed DEMM-YOLO cotton leaf disease detection model.

2.3.1. C2PSA-DAT Module Structure

In cotton leaf disease detection, lesions often have irregular shapes, uncertain position distributions, and significant scale variations. They are frequently obscured by complex backgrounds, such as soil and other plants, which poses a major challenge for accurate model recognition. While traditional fully connected attention mechanisms are good at global modeling, they are computationally intensive for high-resolution images and often produce redundant spatial attention, making it difficult to focus effectively on key lesion areas. The Deformable Attention Transformer (DAT) [22,23] effectively addresses these issues, as shown in Figure 4. DAT introduces a data-driven offset mechanism, learning the offsets of reference points from input features through a dedicated offset subnetwork. This allows it to dynamically adjust sampling positions, enabling efficient modeling and precise attention to critical lesion areas, significantly improving target perception in complex backgrounds.

In cotton leaf disease detection, the model first generates a set of downsampled reference points

p

from the input feature map

x \in R^{H \times W \times C}

and obtains the query vector

q

through linear projection. Then, the offset learning module predicts the corresponding offset

Δ p

based on

q

, which is added to the reference point to obtain the new deformed sampling position. These positions are used to perform bilinear interpolation on the original feature map, extracting the deformed features

x^{'}

. This process adaptively focuses on the morphologically variable diseased areas on cotton leaves, improving the model’s ability to perceive disease features and enhance detection accuracy. The calculation formulas are as follows (Equations (1)–(3)):

q = x W_{q}

(1)

Δ p = θ_{o f f s e t} (q)

(2)

x^{'} = \emptyset (x; p^{'} = p + Δ p)

(3)

Here,

W_{q}

represents the projection matrix, and

\emptyset (.; .)

is the bilinear interpolation sampling function. The sampled feature

x^{'}

is then linearly projected to obtain its corresponding attention key vector

k^{'}

and value vector

v^{'}

, which enhance the model’s ability to capture the diseased areas on cotton leaves. The calculation formulas are as follows (Equations (4) and (5)):

k^{'} = x^{'} W k

(4)

v^{'} = ν W v

(5)

By introducing the relative position offset R, the model captures the relationship between different spatial positions, allowing for a better understanding of the spatial distribution characteristics of lesions on cotton leaves. Based on this, the multi-head attention output is computed to obtain the final feature output

z \in R^{H \times W \times C}

, as calculated by Equations (6) and (7).

z^{(m)} = S o f t m a x (\frac{q^{(m)} \cdot {k^{'}}^{(m) ⊺}}{\sqrt{d}} + \emptyset (B^{'}; R)) \cdot {v^{'}}^{(m)} m = 1, 2, 3 \dots \dots, M

(6)

z = C o n c a t (z^{(1)}, z^{(2)}, \dots \dots, z^{(M)})

(7)

Here,

z^{(m)}

represents the output feature generated by the m-th attention head, M is the number of attention heads, d is the dimension size of each attention head, and

B^{'}

is the relative position offset used to model spatial information.

In this study, the Deformable Attention Mechanism (DAT) is applied to enhance the feature extraction capability of the C2PSA module [24,25] for cotton leaf disease detection. As shown in Figure 5, by dynamically learning spatial offsets, DAT can flexibly focus on irregular areas such as disease spots, overcoming the fixed receptive field limitation of traditional convolution. This allows the model to capture the deformation, blur, and edge details of diseased areas more effectively. The adaptive perception mechanism enables C2PSA to better distinguish small-scale and varied disease spots under complex backgrounds, significantly improving detection accuracy and robustness.

2.3.2. Multi-Scale Feature Aggregation Module (MFAM)

In cotton leaf disease detection, the diseased areas often exhibit significant scale variation and irregular spatial distribution. Small diseased spots have blurred edges, weak texture, and can easily be confused with the background. These fine-grained features and unstructured distribution place high demands on the feature extraction capabilities of deep neural networks, especially in the backbone network stage. To address these challenges, this study proposes a multi-scale feature aggregation module (MFAM) [26,27] with a parallel multi-branch structure and lightweight convolution operations. This approach effectively integrates multi-scale semantic information while maintaining computational efficiency, enhancing the model’s capability to perceive and distinguish small diseased areas.

The MFAM module is inspired by the design of PKINet [28] and optimized to more efficiently extract multi-scale lesion features. As shown in Figure 6, to expand the receptive field and reduce computational cost, the module uses 1 × k and k × 1 strip convolutions (e.g., k = 7, k = 9) instead of traditional k × k large kernels. All convolution operations employ depthwise separable convolutions to reduce the number of parameters. Additionally, MFAM adopts a parallel multi-branch structure, allowing convolution branches of different scales to process the input feature map simultaneously, effectively capturing multi-scale information and improving the model’s ability to detect small lesions.

The input feature map is denoted as

x \in R^{H \times W \times C}

, and the MFAM module consists of four parallel convolution branches. The specific calculations are shown in Equations (8)–(11). Here,

Y_{1}

and

Y_{2}

capture the local texture details of the lesions, while

Y_{3}

and

Y_{4}

enhance the modeling of long-distance structures and large-scale disease areas through strip convolutions.

Y_{1} = {D W C o n v}_{3 \times 3} (X)

(8)

Y_{2} = {D W C o n v}_{5 \times 5} (X)

(9)

Y_{3} = {D W C o n v}_{1 \times K} ({D W C o n v}_{K \times 1} (X))

(10)

Y_{4} = {D W C o n v}_{K \times 1} ({D W C o n v}_{1 \times K} (X))

(11)

As shown in Equation (12), the four output feature maps are combined with the input features through element-wise addition to achieve multi-scale information fusion, where ⨁ represents the element-wise addition. Finally, a 1 × 1 standard convolution is applied to the fused features, producing the final output feature map

x \in R^{H \times W \times C}

, which incorporates richer contextual information and enhanced multi-scale disease representation.

Z = X ⨁ Y_{1} ⨁ Y_{2} ⨁ Y_{3} ⨁ Y_{4}

(12)

2.3.3. Enhanced Efficient Multi-Scale Attention (EEMA)

In cotton leaf disease detection, diseased areas often have irregular shapes and diverse scale features, with performance varying significantly under different lighting, backgrounds, or leaf textures. This makes it challenging for the model to accurately extract key features. Some diseased spots are similar in color to healthy leaves, leading to misjudgments or missed detections. Traditional convolutional structures have a limited receptive field, making it difficult to capture both local details and global context, thus limiting detection accuracy. To enhance feature perception in complex environments, this study introduces an enhanced efficient multi-dimensional attention (EEMA) [29,30] mechanism into the YOLOv11 backbone network. The module integrates feature grouping, multi-scale parallel sub-networks, and cross-space interactive learning to create a more expressive attention structure. Specifically, spatial directional features are extracted through average pooling in the X/Y directions, and channel attention weights are generated using 1 × 1 convolutions and sigmoid activation for initial enhancement. Next, multi-scale parallel branches, built with 33, 15, and 51 depths, separate convolutions and identity mapping, expand the receptive field, and enhance the model’s capability to identify lesions across various scales. Finally, GroupNorm is integrated with the cross-spatial attention mechanism to model spatial correlations, enhancing the fine-grained representation of lesions. The structure of the enhanced efficient multi-dimensional attention (EEMA) module is shown in Figure 7.

By applying two-stage feature reweighting in both the channel and spatial dimensions, EEMA effectively reduces background interference and emphasizes discriminative areas, significantly improving detection accuracy and robustness for cotton leaf diseases, all while maintaining low computational overhead and strong application potential.

2.3.4. MDPIoU Loss Function Optimization

The loss function plays a key role in the detection model framework and is a key metric for evaluating prediction effectiveness [20]. YOLOv11 uses CIoU as the boundary loss function. In cotton leaf disease images, lesions often exhibit complex features, such as small size, irregular boundaries, and fuzzy textures, making accurate target fitting challenging when relying solely on IoU. This issue becomes even more pronounced in scenarios with dense or overlapping lesions, as the complexity of IoU loss calculation increases, further limiting the model’s efficiency and performance.

I o U = \frac{∣ B \cap B_{g t} ∣}{∣ B \cup B_{g t} ∣}

(13)

Here, B denotes the predicted box, while

B_{g t}

refers to the corresponding ground truth box.

To overcome the shortcomings of traditional loss functions in bounding box regression, this study uses the MPDIoU loss function [31,32]. As shown in Figure 8, MPDIoU incorporates the Euclidean distance between the key corners of the predicted and true boxes, while retaining the IoU similarity metric, providing a more comprehensive measure of the geometric relationship between the two boxes. Compared to CIoU, MPDIoU better captures spatial differences between non-overlapping boxes and improves the model’s ability to regress small-scale targets.

MPDIoU introduces multi-point geometric constraints to improve positioning accuracy while improving the model’s ability to differentiate small, dense, and non-overlapping targets. This effectively accelerates the model’s regression convergence speed. The specific calculation formula for MPDIoU is shown in Equations (14)–(16).

L_{1}^{2} = {(x_{1}^{B} - x_{1}^{A})}^{2} + {(y_{1}^{B} - y_{1}^{A})}^{2}

(14)

L_{2}^{2} = {(x_{2}^{B} - x_{2}^{A})}^{2} + {(y_{2}^{B} - y_{2}^{A})}^{2}

(15)

L_{M P D I o U} = 1 - \frac{∣ B \cap B_{g t} ∣}{∣ B \cup B_{g t} ∣} + \frac{L_{1}^{2}}{w^{2} + h^{2}} + \frac{L_{2}^{2}}{w^{2} + h^{2}}

(16)

Here,

L_{1}

and

L_{2}

denote the Euclidean distances from the top-left and bottom-right corners of the predicted bounding box to those of the ground truth box, respectively. The dimensions of the input image are represented by w for width and h for height. Coordinate

(x_{1}^{B}, y_{1}^{B})

and

(x_{2}^{B}, y_{2}^{B})

correspond to the top-left and bottom-right vertices of the predicted bounding box, while

(x_{1}^{A}, y_{1}^{A})

and

(x_{2}^{A}, y_{2}^{A})

are the corresponding coordinates of the ground truth box.

3. Results

3.1. Experimental Environment

Every experiment carried out in this research was implemented using the same testing platform, consistent parameter settings, and identical dataset conditions to ensure the fairness and comparability of the results. The specific configuration of the experimental environment is provided in Table 3.

3.2. Evaluation of DEMM-YOLO Model Performance in Cotton Leaf Disease Detection

This study quantitatively assesses the performance using essential performance indicators, including precision, recall, F1-score, mean average precision at an IoU threshold of 0.5 (mAP@50), frames per second (FPS), and parameter size. The definitions of these metrics are presented in Equations (17)–(22).

P r e c i s i o n = \frac{T r u e P o s i t i v e}{T r u e p o s i t i v e + F a l s e P o s i t i v e}

(17)

R e c a l l = \frac{T r u e P o s i t i v e}{T r u e p o s i t i v e + F a l s e N e g a t i v e}

(18)

A P = \int_{0}^{1} P e r c i s i o n (R e c a l l) d (R e c a l l)

(19)

m A P @ 50 = \frac{1}{N} \sum_{i}^{N} {A P}_{i}

(20)

F P S = \frac{1}{P r o c e s s i n g t i m e p e r f r a m e}

(21)

1 - s c o r e = 2 \times \frac{P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(22)

In Equation (20), N denotes the number of categories, while APi represents the average precision for the i-th category.

3.3. Experimental Results of Cotton Leaf Disease Images Generated by DALL-E Model

To verify the role of synthetic images of cotton leaf diseases generated by DALL-E in improving the generalization ability of the model, this experiment used three models (YOLOv10s, YOLOv11s, and YOLOv12s) to compare dataset A (the original dataset containing only real images) with dataset B (which contains cotton leaf disease images generated by the DALL-E model in addition to the original dataset). The experimental results are shown in Table 4. The experimental results show that after the introduction of synthetic images, the recall of the YOLOv10s model performance increased from 81.5% to 86.0%, mAP@0.5 increased from 89.0% to 94.1%, and the F1-score increased from 87.0% to 90.5%. After the introduction of synthetic images, the recall of the YOLOv11s model performance increased from 82.6% to 88.0%, mAP@0.5 increased from 88.9% to 93.6%, and the F1-score increased from 85.3% to 88.8%. After introducing synthetic images, the recall rate of the YOLOv12s model increased from 83.7% to 88.1%, mAP@0.5 increased from 89.5% to 93.7%, and the F1-score increased from 87.4 to 91.0%. Experimental results show that synthetic images generated by DALL-E play an important role in enriching the diversity of training data, balancing category distribution, and expanding decision boundary samples, which can effectively promote the improvement of model detection performance.

3.4. DEMM-YOLO Model Performance in Cotton Leaf Disease Detection

Table 5 presents the detailed performance of the DEMM-YOLO model in detecting various cotton leaf diseases. For cotton leaf blight, the DEMM-YOLO model achieved a precision of 96.3%, a recall of 93.2%, and an mAP@50 of 97.3%. In detecting cotton leaf curl, it achieved a precision of 95.2%, a recall of 98.5%, and an mAP@50 of 98.3%. For cotton leaf spot, the model also performed well, with a precision of 96.3%, a recall of 98.2%, and an mAP@50 of 98.4%. Although performance slightly declined in detecting cotton grey mildew (precision = 91.1%, recall = 83.3%, mAP@50 = 92.5%) and cotton wilt (precision = 90.9%, recall = 89.0%, mAP@50 = 93.1%), the results remained at an acceptable level. Notably, when identifying healthy leaves, it achieved a precision of 98.0%, a recall of 96.0%, and a mAP@50 of 99.5%, demonstrating exceptional recognition capabilities. Overall, the model achieved an average precision of 94.8%, a recall of 93.1%, and a mAP@50 of 96.7% across all categories. The experimental outcomes demonstrate that the DEMM-YOLO model exhibits strong generalization and robust recognition capability to detect various types of cotton leaf diseases.

Figure 9 compares the performance of the DEMM-YOLO model with the baseline YOLOv11s model across several key performance indicators, including training loss, validation loss curves, mAP@0.5 curves, mAP@0.5–0.95 curves, precision curves, and recall curves. From the training and validation loss curves (Figure 9a,d), it is evident that DEMM-YOLO exhibits faster convergence and a lower final loss value throughout the training process, indicating superior feature fitting and stability. In terms of detection performance, DEMM-YOLO significantly outperforms YOLOv11s in both mAP@0.5 and mAP@0.5:0.95 (Figure 9b,e), particularly during the early stages of training, highlighting its enhanced robustness in multi-scale object recognition. Moreover, in the recall and precision curves (Figure 9c,f), DEMM-YOLO consistently maintains higher detection precision and recall, effectively improving the model’s ability to discriminate between positive and negative samples. Overall, empirical results validate that the DEMM-YOLO model exhibits superior convergence characteristics and stronger detection capabilities compared to the baseline model throughout the training process.

3.5. DEMM-YOLO Model Ablation Experiment

To systematically evaluate how the proposed modules affect the performance of the model, an ablation test was carried out, and the findings are displayed in Table 6. The baseline YOLOv11s model achieved 89.6% precision, 88.0% recall, and 93.6% mAP@0.5, with 9.42 M parameters and 21.3 GFLOPs of computational overhead. After adding the C2PSA_DAT module, precision improved to 92.9%, recall to 91.0%, and mAP@0.5 to 95.0%, demonstrating the module’s effectiveness in enhancing feature expression through spatial and channel attention. Introducing the MFAM module further boosted performance, with precision reaching 93.2%, recall 90.9%, and mAP@0.5 95.2%, showing its ability to enhance feature fusion while keeping the model lightweight. Adding the EMMA module raised recall to 92.9% and mAP@0.5 to 95.9%, indicating its strength in handling complex target boundaries. Replacing the loss function with MPDIoU improved precision to 93.0% and mAP@0.5 to 94.7%, confirming its role in optimizing target localization. The C2PSA_DAT module improves feature extraction by dynamically focusing on spatially variable regions such as lesions, and the EEMA module enhances the overall feature representation capability of the model by capturing multi-scale information and reducing background interference. The DEMM-YOLO model achieved the optimal performance, with precision, recall, and mAP@0.5 reaching 94.8%, 93.1%, and 96.7%, respectively. Compared to the baseline YOLOv11s, these metrics increased by 5.2%, 5.1%, and 3.1%, while the computational overhead only rose by 1.4 GFLOPs. Overall, the improved DEMM-YOLO outperforms the YOLOv11s baseline.

To further validate the effectiveness of the improved C2PSA_DAT module, this study visualized the intermediate feature maps from the 10th stage and compared the feature extraction performance of the original C2PSA module with the C2PSA-DAT module incorporating dynamic attention, as shown in Figure 10. The results show that the feature responses from the C2PSA module are more scattered, with activation areas that are not well-focused. Some channels show weak responses to key lesion areas, resulting in information redundancy and difficulty in capturing fine-grained disease features in complex backgrounds. In contrast, the C2PSA-DAT module demonstrates more focused and stronger channel activations, with better concentration of feature responses. It accurately captures the texture and edges of the lesion areas, showing clearer channel differentiation and richer feature expression. These improvements in focus and feature representation are reflected in the enhanced precision, recall, and mAP metrics, further highlighting the advantages of the C2PSA-DAT module for plant disease detection tasks.

To verify the improvement brought by the EEMA attention mechanism, this study compares the heatmap responses of the original YOLOv11s model and the modified YOLOv11s model with the enhanced EEMA attention mechanism (denoted as YOLOv11s-EEMA), for five typical cotton leaf diseases. As shown in Figure 11, the YOLOv11s-EEMA model exhibits more focused high-response areas for most disease types, allowing it to more accurately target the lesion site while effectively suppressing background interference, thus improving feature extraction discriminability. Particularly for gray mold and wilt, the YOLOv11s-EEMA model shows a stronger and clearer response to the lesion areas, even maintaining good detail perception in regions with blurred edges or minimal color contrast. These results confirm that the introduction of the EEMA attention mechanism enhances the model’s ability to perceive lesions and highlight significant areas, leading to improved accuracy in cotton disease detection.

3.6. Comparative Experiments

3.6.1. Comparative Experiments on Different Loss Functions

To evaluate the different bounding box regression loss functions on detection performance, this study compares CIoU [33], DIoU [34], Focal-IoU [35], and Inner-IoU [36] loss functions using the YOLOv11s model as a benchmark. Table 7 shows that the MPDIoU loss function outperforms other loss functions on several key metrics. Specifically, compared to CIoU, MPDIoU improves precision by 3.4%, recall by 1.4%, and F1-score by 2.4%. Compared to DIoU, MPDIoU increases precision by 5.9%, mAP@50 by 1.8%, and F1-score by 3.0%. Compared to Focal-IoU, MPDIoU improves precision by 2.2%, recall by 2.3%, and F1-score by 2.3%. Additionally, compared to Inner-IoU, MPDIoU shows improvements of 2.0% in precision, 1.7% in recall, and 1.9% in F1-score. These results show that MPDIoU effectively balances precision, recall, mAP@50, and F1-score, greatly improving the model’s overall detection performance.

Figure 12 shows the training and validation loss trends of the YOLOv11s model with five different bounding box regression loss functions (MPDIoU, Inner-IoU, Focal-IoU, DIoU, and CIoU). In this comparison, MPDIoU outperforms the others in both training and validation loss, converging quickly with the lowest final loss, demonstrating superior optimization. In contrast, Inner-IoU converges slowly and results in higher loss, while Focal-IoU and DIoU show some stability but still fall short of MPDIoU, highlighting that MPDIoU enhances both model accuracy and generalization more effectively in YOLOv.

3.6.2. Comparative Experiments with Different Models

To evaluate the effectiveness of the proposed DEMM-YOLO model in disease detection, we compared it against several mainstream target detection algorithms, including YOLOv12s [37], YOLOv10s [38], RT-DETR [39], YOLOv8s [40], YOLOv6s [41], and YOLOv5s [42]. All models were trained and tested on the same dataset with a unified evaluation metric to ensure fairness. The comparative results are shown in Table 8. The DEMM-YOLO model showed impressive performance, achieving 94.8% precision, 93.1% recall, 96.7% mAP@0.5, and an F1-score of 22.7 GFLOPs, with a model size of 20.3 M. Compared to YOLOv12s, it improved recall, mAP@0.5, and F1-score by 5.0%, 3.0%, and 2.9%, respectively. Against YOLOv10s, recall was improved by 7.0%, and F1-score by 3.4%. Although the DEMM-YOLO model’s precision is slightly lower than YOLOv10s (95.6%) and RT-DETR (95.5%), it provides more balanced performance. When compared to YOLOv8s (precision = 85.2%, recall = 87.2%, mAP@50 = 92.9%, F1-Score = 86.2%), YOLOv6s (precision = 93.2%, recall = 85.8%, mAP@50 = 91.2%, F1-Score = 89.3%), and YOLOv5s (precision = 83.2%, recall = 85.9%, mAP@50 = 89.2%, F1-Score = 84.5%), DEMM-YOLO outperformed all in key metrics. Additionally, DEMM-YOLO achieved a speed of 81.5 m FPS, better than all compared models. With a model size of 20.1 MB, DEMM-YOLO is smaller than RT-DETR (40.5 MB) and YOLOv6s (32.9 MB) and only marginally larger than YOLOv12s (19.0 MB) and YOLOv11s (19.2 MB). Figure 13 uses a radar chart to visually compare DEMM-YOLO with other detection algorithms, highlighting its superior performance in precision, recall, mAP@50, and F1-score, and confirming its overall advantage over other models.

Figure 14 shows the precision–recall (P-R) curve performance of YOLOv5s, YOLOv8s, YOLOv10s, YOLOv11s, YOLOv12s, and DEMM-YOLO models in the cotton leaf disease detection task. By adjusting the confidence to optimize the mean average precision (mAP), an effective balance between precision and recall is achieved. Each sub-figure marks the PR curves under different categories and the overall AUC (area under the curve) value to measure the comprehensive model’s detection capability. Overall, YOLOv12s has an mAP@0.5 of 95.0% in all categories, and YOLOv11s also achieved high performance with an mAP@0.5 of 93.7%. YOLOv10s and YOLOv8s reached 94.1% and 92.9%, respectively, but the PR curves fluctuated greatly in some categories, indicating that the detection stability was slightly inferior. YOLOv5s has the lowest mAP@0.5, which is only 89.2%, and has an obvious problem of insufficient recall rate in multiple categories. In contrast, the DEMM-YOLO model performs best among all models, with an mAP@0.5 of 96.7%, achieving higher detection accuracy and consistency in all categories, and a smoother PR curve overall, with a better detection effect. This result shows that the DEMM module has a significant advantage in improving the model’s ability to perceive multi-scale lesion features, effectively enhancing the model’s robustness and generalization ability.

Figure 15 illustrates the detection results of three models—YOLOv11s, YOLOv12s, and DEMM-YOLO—on five types of typical cotton leaf diseases. Through comparative analysis, it is evident that model optimization notably improves the precision and reliability of target detection. While the YOLOv11s model shows low detection confidence or bounding box misalignment in certain disease images, such as the low-confidence box (0.34) in the leaf curl image, YOLOv12s improves detection performance through structural upgrades, offering precise localization and high-confidence predictions in most disease spot areas. In contrast, the DEMM-YOLO model outperforms both YOLOv11s and YOLOv12s, demonstrating superior recognition across all disease types. The detection boxes align more accurately with the disease areas, with higher confidence levels, showcasing enhanced ability to perceive disease regions and suppress background interference. Overall, DEMM-YOLO surpasses the other models in multi-target recognition, bounding box precision, and adaptability to various lesion types, thereby confirming its effectiveness and advantages in cotton disease detection tasks.

4. Discussion

This study proposes a cotton leaf disease detection method (DEMM-YOLO) that combines synthetic images generated using a large language model (LLM) with an improved YOLOv11 model. This method not only effectively addresses the sample imbalance issue for rare disease categories but also significantly improves cotton leaf disease detection performance through technological innovation. The specific contributions are summarized as follows:

(1) Application of Synthetic Data: This study leverages OpenAI’s DALL-E model to generate high-quality synthetic images of cotton leaf diseases, successfully expanding the diversity of the dataset and improving detection performance for rare disease categories. The introduction of synthetic data significantly enhances the model’s generalization ability for low-sample categories.

(2) Multi-Scale Feature Aggregation Module (MFAM): To address the large-scale variation and irregular distribution of disease regions, we designed the MFAM module. This module effectively integrates multi-scale semantic information through a lightweight multi-branch convolutional structure, improving the model’s detection capabilities for small-scale diseases.

(3) Deformable Attention Mechanism (DAT): The Deformable Attention Mechanism (DAT) is introduced into the C2PSA module. By learning spatial offsets, it dynamically focuses on the diseased area in a complex background, overcoming the fixed receptive field limitation of traditional convolution, effectively improving the model’s feature extraction capability and detection accuracy.

(4) Enhanced Multi-Scale Attention Mechanism (EEMA): This study introduces an enhanced Efficient Multi-Dimensional Attention Mechanism (EEMA). This mechanism further improves the model’s feature representation capabilities in complex environments through feature grouping, multi-scale parallel sub-networks, and cross-spatial interactive learning.

(5) MPDIoU Loss: By replacing the traditional regression loss function with the MPDIoU loss, the accuracy of bounding box regression is improved, the model convergence is accelerated, and the model’s localization accuracy is further enhanced.

Compared to existing research, DEMM-YOLO introduces innovative improvements in disease detection accuracy, data augmentation strategies, and module design. Traditional disease detection methods typically rely on real-world image data, the collection of which requires substantial human and material resources and is often hindered by issues like sample scarcity and data imbalance. Although previous studies have attempted to address these challenges through techniques such as data augmentation and transfer learning, problems like insufficient samples and interference from complex backgrounds remain. In contrast, this study effectively mitigates these issues by generating synthetic data, which enhances the detection capabilities for rare disease categories. Additionally, by incorporating innovative modules such as MFAM, DAT, and EEMA, the model’s robustness in complex backgrounds is significantly improved.

The model has made significant improvements, but some limitations remain. Detection accuracy tends to decrease under challenging environmental conditions, such as intense lighting, moving shadows, or changes in leaf angle. Additionally, blurred boundaries persist when the contrast between diseased and healthy areas is low. Although synthesizing multiple diseases on a single leaf helps display individual disease symptoms and generates more realistic images, it still struggles to fully capture the overlap of multiple diseases—an occurrence commonly seen in real-world scenarios. To address this, future research will focus on integrating real-world images of overlapping diseases, enhancing the model’s accuracy and utility. We aim to leverage the DALL-E model to generate images with multiple crop disease types, thereby improving the model’s ability to detect a broader range of diseases and better handle the complex, overlapping disease patterns encountered in actual field conditions.

5. Conclusions

This study introduces the DEMM-YOLO approach for cotton leaf disease detection, which integrates synthetic image generation using an LLM with architectural enhancements to improve performance. To address data imbalance, particularly for rare disease categories, we leveraged OpenAI’s DALL-E to generate synthetic images, effectively enriching the dataset and enhancing the model’s generalization capability in few-shot scenarios. To handle challenges like varying lesion scales and irregular spatial distributions, we designed an MFAM model that combines multi-scale semantic information through a lightweight multi-branch structure. Additionally, we incorporated the DAT into the C2PSA module to overcome the limitations of fixed receptive fields, allowing the model to focus dynamically on lesions in complex backgrounds. The EEMA mechanism further improved the model’s feature representation by combining grouped features, multi-scale parallelism, and cross-spatial interaction. Finally, the use of the MPDIoU loss function enhanced bounding box regression accuracy and accelerated model convergence. Experimental results show that DEMM-YOLO delivers better performance with 94.8% precision, 93.1% recall, and 96.7% mAP@0.5. Compared to mainstream models such as YOLOv5s, YOLOv6s, YOLOv8s, and even recent versions like YOLOv10s and YOLOv12s, our model shows more balanced and consistently high performance across key metrics. Despite a slightly lower precision than RT-DETR and YOLOv10s, DEMM-YOLO outperforms them in overall detection stability and computational efficiency, achieving 81.5 FPS and maintaining a lightweight size of 20.1 MB. These results confirm DEMM-YOLO’s strong potential for real-world deployment in precision agriculture and intelligent disease monitoring systems.

Author Contributions

Conceptualization, H.W. and L.G.; methodology, L.G.; software, L.G.; validation, T.R., H.Z. and H.W.; formal analysis, T.R.; investigation, H.Z.; resources, H.W.; data curation, T.R.; writing—original draft preparation, L.G.; writing—review and editing, L.G. and H.W.; visualization, T.R.; supervision, H.Z.; project administration, H.W.; funding acquisition, H.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Science and Technology Bureau of the Xinjiang Production and Construction Corps (grant No. BTYJXM-2024-S13), the Central Guidance Fund for Local Science and Technology Development (grant No. ZYYD2025QY19), and the Bingtuan Science and Technology Program (grant No. 2018ZYYD-1).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The datasets generated and analyzed during the current study are available from the corresponding author upon reasonable request.

Acknowledgments

We would like to express our sincere gratitude to our advisor for their invaluable guidance and support throughout the course of this research. We also extend our appreciation to our colleagues for their technical assistance, as well as to all those who contributed to the preparation and completion of this manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Xu, W.; Wang, R. ALAD-YOLO: An lightweight and accurate detector for apple leaf diseases. Front. Plant Sci. 2023, 14, 1204569. [Google Scholar] [CrossRef]
Feng, H.; Chen, X.; Duan, Z. LCDDN-YOLO: Lightweight Cotton Disease Detection in Natural Environment, Based on Improved YOLOv8. Agriculture 2025, 15, 421. [Google Scholar] [CrossRef]
Eunice, J.; Popescu, D.E.; Chowdary, M.K.; Hemanth, J. Deep learning-based leaf disease detection in crops using images for agricultural applications. Agronomy 2022, 12, 2395. [Google Scholar] [CrossRef]
Khan, A.I.; Quadri, S.; Banday, S.; Shah, J.L. Deep diagnosis: A real-time apple leaf disease detection system based on deep learning. Comput. Electron. Agric. 2022, 198, 107093. [Google Scholar] [CrossRef]
Zhong, Y.; Teng, Z.; Tong, M. LightMixer: A novel lightweight convolutional neural network for tomato disease detection. Front. Plant Sci. 2023, 14, 1166296. [Google Scholar] [CrossRef]
Mustak Un Nobi, M.; Rifat, M.; Mridha, M.; Alfarhood, S.; Safran, M.; Che, D. GLD-DET: Guava leaf disease detection in real-time using lightweight deep learning approach based on MobileNet. Agronomy 2023, 13, 2240. [Google Scholar] [CrossRef]
Jin, H.; Li, Y.; Qi, J.; Feng, J.; Tian, D.; Mu, W. GrapeGAN: Unsupervised image enhancement for improved grape leaf disease recognition. Comput. Electron. Agric. 2022, 198, 107055. [Google Scholar] [CrossRef]
Ayten, E.; Wang, S.; Snoep, H. Surrealistic-like Image Generation with Vision-Language Models. arXiv 2024. [Google Scholar] [CrossRef]
Mullins, C.C.; Esau, T.J.; Zaman, Q.U.; Hennessy, P.J. Optimizing data collection requirements for machine learning models in wild blueberry automation through the application of DALL-E 2. Smart Agric. Technol. 2025, 10, 100764. [Google Scholar] [CrossRef]
Borji, A. Generated faces in the wild: Quantitative comparison of stable diffusion, midjourney and dall-e 2. arXiv 2022. [Google Scholar] [CrossRef]
Sapkota, R.; Karkee, M. Creating image datasets in agricultural environments using DALL. E: Generative AI-powered large language model. arXiv 2023. [Google Scholar] [CrossRef]
Li, W.; Wang, J.; Zhang, X. PROMPTIST: Automated prompt optimization for text-to-image synthesis. In Proceedings of the CCF International Conference on Natural Language Processing and Chinese Computing, Hangzhou, China, 2–4 November 2024; pp. 295–306. [Google Scholar]
Jiang, P.; Ergu, D.; Liu, F.; Cai, Y.; Ma, B. A Review of Yolo algorithm developments. Procedia Comput. Sci. 2022, 199, 1066–1073. [Google Scholar] [CrossRef]
Jia, L.; Wang, T.; Chen, Y.; Zang, Y.; Li, X.; Shi, H.; Gao, L. MobileNet-CA-YOLO: An improved YOLOv7 based on the MobileNetV3 and attention mechanism for Rice pests and diseases detection. Agriculture 2023, 13, 1285. [Google Scholar] [CrossRef]
Sangaiah, A.K.; Yu, F.-N.; Lin, Y.-B.; Shen, W.-C.; Sharma, A. UAV T-YOLO-rice: An enhanced tiny YOLO networks for rice leaves diseases detection in paddy agronomy. IEEE Trans. Netw. Sci. Eng. 2024, 11, 5201–5216. [Google Scholar] [CrossRef]
Wang, Y.; Wang, Y.; Zhao, J. MGA-YOLO: A lightweight one-stage network for apple leaf disease detection. Front. Plant Sci. 2022, 13, 927424. [Google Scholar] [CrossRef] [PubMed]
Sapkota, R.; Karkee, M. Improved yolov12 with llm-generated synthetic data for enhanced apple detection and benchmarking against yolov11 and yolov10. arXiv 2025. [Google Scholar] [CrossRef]
Sapkota, R.; Meng, Z.; Karkee, M. Synthetic meets authentic: Leveraging llm generated datasets for yolo11 and yolov10-based apple detection through machine vision sensors. Smart Agric. Technol. 2024, 9, 100614. [Google Scholar] [CrossRef]
Khanam, R.; Hussain, M. Yolov11: An overview of the key architectural enhancements. arXiv 2024. [Google Scholar] [CrossRef]
Liao, Y.; Li, L.; Xiao, H.; Xu, F.; Shan, B.; Yin, H. YOLO-MECD: Citrus detection algorithm based on YOLOv11. Agronomy 2025, 15, 687. [Google Scholar] [CrossRef]
Gao, L.; Cao, H.; Zou, H.; Wu, H. DMN-YOLO: A Robust YOLOv11 Model for Detecting Apple Leaf Diseases in Complex Field Conditions. Agriculture 2025, 15, 1138. [Google Scholar] [CrossRef]
Xia, Z.; Pan, X.; Song, S.; Li, L.E.; Huang, G. Vision transformer with deformable attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 4794–4803. Available online: https://ieeexplore.ieee.org/document/9878689 (accessed on 2 July 2025).
Cao, J.; Liang, J.; Zhang, K.; Li, Y.; Zhang, Y.; Wang, W.; Gool, L.V. Reference-based image super-resolution with deformable attention transformer. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 325–342. [Google Scholar]
Rasheed, A.F.; Zarkoosh, M. YOLOv11 Optimization for Efficient Resource Utilization. J. Supercomput. 2024. [Google Scholar] [CrossRef]
Li, S. SNAT-YOLO: Efficient Cross-Layer Aggregation Network for Edge-Oriented Gangue Detection. arXiv 2025. [Google Scholar] [CrossRef]
Lu, L.; He, D.; Liu, C.; Deng, Z. MASF-YOLO: An Improved YOLOv11 Network for Small Object Detection on Drone View. arXiv 2025. [Google Scholar] [CrossRef]
Wu, Q.; Wang, J.; Chai, Z.; Guo, G. Multi-scale feature aggregation and boundary awareness network for salient object detection. Image Vis. Comput. 2022, 122, 104442. [Google Scholar] [CrossRef]
Qi, Z.; Wang, J. PMDNet: An Improved Object Detection Model for Wheat Field Weed. Agronomy 2024, 15, 55. [Google Scholar] [CrossRef]
Qing, S.; Qiu, Z.; Wang, W.; Wang, F.; Jin, X.; Ji, J.; Zhao, L.; Shi, Y. Improved YOLO-FastestV2 wheat spike detection model based on a multi-stage attention mechanism with a LightFPN detection head. Front. Plant Sci. 2024, 15, 1411510. [Google Scholar] [CrossRef]
Jing, J.; Zhang, S.; Sun, H.; Ren, R.; Cui, T. YOLO-PEM: A lightweight detection method for young “Okubo” peaches in complex orchard environments. Agronomy 2024, 14, 1757. [Google Scholar] [CrossRef]
Ma, S.; Xu, Y. Mpdiou: A loss for efficient and accurate bounding box regression. arXiv 2023. [Google Scholar] [CrossRef]
Mo, L.; Xie, R.; Ye, F.; Wang, G.; Wu, P.; Yi, X. Enhanced Tomato Pest Detection via Leaf Imagery with a New Loss Function. Agronomy 2024, 14, 1197. [Google Scholar] [CrossRef]
Tao, J.; Li, X.; He, Y.; Islam, M.A. CEFW-YOLO: A High-Precision Model for Plant Leaf Disease Detection in Natural Environments. Agriculture 2025, 15, 833. [Google Scholar] [CrossRef]
Wang, M.; Fu, B.; Fan, J.; Wang, Y.; Zhang, L.; Xia, C. Sweet potato leaf detection in a natural scene based on faster R-CNN with a visual attention mechanism and DIoU-NMS. Ecol. Inform. 2023, 73, 101931. [Google Scholar] [CrossRef]
Wang, L.; Zhao, Y.; Liu, S.; Li, Y.; Chen, S.; Lan, Y. Precision detection of dense plums in orchards using the improved YOLOv4 model. Front. Plant Sci. 2022, 13, 839269. [Google Scholar] [CrossRef] [PubMed]
Li, T.; Zhang, L.; Lin, J. Precision agriculture with YOLO-Leaf: Advanced methods for detecting apple leaf diseases. Front. Plant Sci. 2024, 15, 1452502. [Google Scholar] [CrossRef]
Tian, Y.; Ye, Q.; Doermann, D. Yolov12: Attention-centric real-time object detectors. arXiv 2025. [Google Scholar] [CrossRef]
Huang, Y.; Liu, Z.; Zhao, H.; Tang, C.; Liu, B.; Li, Z.; Wan, F.; Qian, W.; Qiao, X. YOLO-YSTs: An improved YOLOv10n-based method for real-time field pest detection. Agronomy 2025, 15, 575. [Google Scholar] [CrossRef]
Wang, S.; Jiang, H.; Yang, J.; Ma, X.; Chen, J.; Li, Z.; Tang, X. Lightweight tomato ripeness detection algorithm based on the improved RT-DETR. Front. Plant Sci. 2024, 15, 1415297. [Google Scholar] [CrossRef]
Liu, Z.; Guo, X.; Zhao, T.; Liang, S. YOLO-BSMamba: A YOLOv8s-Based Model for Tomato Leaf Disease Detection in Complex Backgrounds. Agronomy 2025, 15, 870. [Google Scholar] [CrossRef]
Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W. YOLOv6: A single-stage object detection framework for industrial applications. arXiv 2022. [Google Scholar] [CrossRef]
Li, H.; Shi, L.; Fang, S.; Yin, F. Real-time detection of apple leaf diseases in natural scenes based on yolov5. Agriculture 2023, 13, 878. [Google Scholar] [CrossRef]

Figure 1. Illustrative example of cotton foliar disease manifestation: (a) cotton leaf blight; (b) cotton leaf curl disease; (c) cotton grey mildew; (d) cotton leaf spot disease; (e) cotton wilt disease; (f) healthy cotton leaf.

Figure 2. Workflow for generating cotton leaf disease images using DALL-E with a focus on leaf lesions.

Figure 3. The proposed DEMM-YOLO model.

Figure 4. The deformable attention transformer structure.

Figure 5. The C2PSA-DAT module structure.

Figure 6. The MFAM module architecture.

Figure 7. The structure of the EEMA Module.

Figure 8. The MPDIoU loss function.

Figure 9. Comparison of DEMM-YOLO and YOLOv11s on training and validation loss, mAP@0.5, mAP@0.5–0.95, precision, and recall.

Figure 10. Comparison of feature maps extracted by the C2PSA and C2PSA-DAT modules at stage 10, highlighting the differences in feature representation and attention mechanisms between the two modules.

Figure 11. Comparison of heatmap visualizations for YOLOv11s and YOLOv11s enhanced with EEMA attention mechanism on five cotton leaf disease types.

Figure 12. Training and validation loss curves of YOLOv11s with different bounding box regression loss functions.

Figure 13. Radar chart comparing the performance of various object detection models across precision, recall, F1-score, and mAP@50.

Figure 14. PR curves of different YOLO models, where a greater AUC reflects enhanced detection accuracy.

Figure 15. Comparative visualization of detection results from YOLOv11s, YOLOv12s, and DEMM-YOLO models for five types of cotton leaf diseases.

Table 1. Distribution of original and enhanced cotton leaf disease images.

Cotton Disease Category	Original Images (Sheets)	Enhanced Images (Sheets)	Enhanced Image (Sheets)
Cotton Disease Category	Original Images (Sheets)	Enhanced Images (Sheets)	Training Set	Test Set	Validation Set
Cotton leaf blight	587	624	437	125	62
Cotton leaf curl disease	295	635	445	127	63
Cotton grey mildew	313	637	446	127	64
Cotton leaf spot disease	613	631	442	126	63
Cotton wilt disease	598	633	443	127	63
Healthy cotton leaf	1013	1013	709	202	102
All diseases	3419	4173	2922	834	417

Table 2. The DALL-E model generates cotton leaf disease image samples.

Number	Cotton Disease Category	Text Description
1	Leaf blight	Randomly generate cotton leaves with cotton leaf blight disease. Small yellow-brown spots (1–3 mm) appeared on the leaf margins, expanding into irregular brown spots (5–15 mm) with distinct concentric rings, off-white in the center and dark brown at the edges. The lesions were natural and distinct, set against a natural background, and the image was square.
2	Leaf spot	Randomly generate cotton leaf spot disease, requiring the appearance of small, circular, or irregular spots on leaves. These spots are typically dark brown or yellowish-brown in the center, with a yellow halo around the edges. The spots may coalesce, forming large, necrotic areas. The lesions are natural and obvious, the background is the natural environment, and the image is square.
3	Grey mildew	Randomly generate cotton leaves infected with grey mildew disease. The leaves should develop small, water-soaked spots that expand into irregular brown lesions. When wet, a gray mold layer (conidia of the pathogen) should form on the surface. The lesions should be natural and distinct, with a natural background and a square image.

Table 3. Detailed configuration of the experimental setup.

Experimental Environment	Experimental Configuration
Operating System	Windows 10 Professional
GPU	NVIDIA RTX 4060Ti (16 GB)
CPU	Intel^® Xeon^® W-2223 processor (3.60 GHz)
Python Version	3.8.0
Deep Learning Framework	PyTorch 1.10.0
GPU Acceleration (CUDA)	11.7
Input Image Size	640 × 640 pixels
Optimizer	SGD
Momentum	0.937
Initial Learning Rate	0.1
Weight Decay	0.0005
Number of Training Epochs	300
Batch Size	16

Table 4. Evaluation of dataset enhancement across different models.

Model	Dataset	Precision/%	Recall/%	mAP@50/%	F1-Score/%
YOLOv10s	Dataset A	93.3	81.5	89.0	87.0
YOLOv10s	Dataset B	95.6	86.0	94.1	90.5
YOLOv11s	Dataset A	88.1	82.6	88.9	85.3
YOLOv11s	Dataset B	89.6	88.0	93.6	88.8
YOLOv12s	Dataset A	91.4	83.7	89.5	87.4
YOLOv12s	Dataset B	93.9	88.1	93.7	91.0

Table 5. Performance evaluation of the DEMM-YOLO model across diverse disease types.

Type of Disease	Precision/%	Recall/%	mAP@50/%
Cotton leaf blight	96.3	93.2	97.3
Cotton leaf curl disease	95.2	98.5	98.3
Cotton grey mildew	91.1	83.3	92.5
Cotton leaf spot disease	96.3	98.2	98.4
Cotton wilt disease	90.9	89.0	93.1
Health	98.0	96.0	99.5
All	94.8	93.1	96.7

Table 6. Ablation analysis of DEMM-YOLO model.

Model	C2PSA_DAT	MFAM	EEMA	MPDIoU	Precision/%	Recall/%	mAP@50/%	Parameters	FLOPs/G
YOLOv11s					89.6	88.0	93.6	9,415,122	21.3
	√				92.9	91.0	95.0	9,504,722	21.4
		√			93.2	90.9	95.2	1,063,331	22.3
			√		92.2	92.9	95.9	9,469,330	22.0
				√	93.0	89.4	94.7	9,415,122	21.3
	√	√	√	√	94.8	93.1	96.7	10,874,252	22.7

Table 7. Performance comparison of YOLOv11s with different loss functions.

Model	Loss Function	Precision/%	Recall/%	mAP@50/%	F1-Score%
YOLOv11s	CIoU	89.6	88.0	93.6	88.8
	DIoU	87.1	89.4	92.9	88.2
	Focal-IoU	90.8	87.1	93.9	88.9
	Inner-IoU	91.0	87.7	93.8	89.3
	MPDIoU	93.0	89.4	94.7	91.2

Table 8. Evaluative experiments across different models.

Model	Precision/%	Recall/%	mAP@50/%	F1-Score%	FPS	FLOPs/G	Parameter Size/MB
YOLOv5s	83.2	85.9	89.2	84.5	54.2	23.8	18.6
YOLOv6s	93.2	85.8	91.2	89.3	52.1	44.0	32.9
YOLOv8s	85.2	87.2	92.9	86.2	58.9	28.4	22.6
RT-DETR	95.5	82.2	90.3	88.3	69.6	57.0	40.5
YOLOv10s	95.6	86.0	94.1	90.5	75.7	21.4	16.6
YOLOv12s	93.9	88.1	93.7	91.0	79.0	21.2	19.0
YOLOv11s	89.6	88.0	93.6	88.8	78.6	21.3	19.2
DEMM-YOLO	94.8	93.1	96.7	93.9	81.5	22.7	20.1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Gao, L.; Ran, T.; Zou, H.; Wu, H. Cotton Leaf Disease Detection Using LLM-Synthetic Data and DEMM-YOLO Model. Agriculture 2025, 15, 1712. https://doi.org/10.3390/agriculture15151712

AMA Style

Gao L, Ran T, Zou H, Wu H. Cotton Leaf Disease Detection Using LLM-Synthetic Data and DEMM-YOLO Model. Agriculture. 2025; 15(15):1712. https://doi.org/10.3390/agriculture15151712

Chicago/Turabian Style

Gao, Lijun, Tiantian Ran, Hua Zou, and Huanhuan Wu. 2025. "Cotton Leaf Disease Detection Using LLM-Synthetic Data and DEMM-YOLO Model" Agriculture 15, no. 15: 1712. https://doi.org/10.3390/agriculture15151712

APA Style

Gao, L., Ran, T., Zou, H., & Wu, H. (2025). Cotton Leaf Disease Detection Using LLM-Synthetic Data and DEMM-YOLO Model. Agriculture, 15(15), 1712. https://doi.org/10.3390/agriculture15151712

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Cotton Leaf Disease Detection Using LLM-Synthetic Data and DEMM-YOLO Model

Abstract

1. Introduction

2. Materials and Methods

2.1. Cotton Leaf Disease Dataset

2.1.1. Dataset Image Acquisition Process

2.1.2. Leveraging LLM for Dataset Augmentatio

2.2. YOLOv11-Based Network Architecture for Efficient Object Detection

2.3. DEMM-YOLO Cotton Leaf Disease Detection Model

2.3.1. C2PSA-DAT Module Structure

2.3.2. Multi-Scale Feature Aggregation Module (MFAM)

2.3.3. Enhanced Efficient Multi-Scale Attention (EEMA)

2.3.4. MDPIoU Loss Function Optimization

3. Results

3.1. Experimental Environment

3.2. Evaluation of DEMM-YOLO Model Performance in Cotton Leaf Disease Detection

3.3. Experimental Results of Cotton Leaf Disease Images Generated by DALL-E Model

3.4. DEMM-YOLO Model Performance in Cotton Leaf Disease Detection

3.5. DEMM-YOLO Model Ablation Experiment

3.6. Comparative Experiments

3.6.1. Comparative Experiments on Different Loss Functions

3.6.2. Comparative Experiments with Different Models

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI