1. Introduction
In the process of digital transformation of modern agriculture, cotton is an important economic crop in China, and the intelligent level of its disease monitoring is directly related to the guarantee of yield and quality. Since there are many types of cotton leaf diseases and they often occur in complex field environments such as high temperature and high humidity, the traditional disease detection method that relies on manual inspection and sensor image acquisition has obvious limitations such as expensive equipment, low acquisition efficiency, and data being easily affected by light and occlusion [
1,
2]. In addition, the subtle differences between disease types and regional transmission characteristics also make it challenging to build high-quality and diverse data sets, further restricting the training effect and generalization ability of deep learning models. Efficient cotton disease detection technology is of great significance for realizing large-scale farmland monitoring and early identification of diseases.
Researchers have actively explored model structure, lightweight strategies, and data enhancement. Eunice et al. [
3] proposed a plant leaf disease detection method using convolutional neural networks. By optimizing the hyperparameters of pre-trained models such as DenseNet-121, ResNet-50, VGG-16, and Inception V4, the classification effect of different diseases was improved, and a precision of 99.81% was achieved in 38 disease classification tasks. Khan et al. [
4] proposed a two-stage apple leaf disease detection method based on Xception and Faster R-CNN. By constructing a dataset containing about 9000 expert-annotated images and combining transfer learning strategies to improve model performance, a precision of 88% and an mAP of 42% were achieved in the classification task. Zhong et al. [
5] proposed a lightweight tomato leaf disease recognition model called LightMixer. By combining the Phish module and the optical residual module, the feature fusion capability is enhanced, and the computational efficiency is improved. A classification precision of 99.3% was achieved with a model size of only 1.5 M. Mustak Un Nobi et al. [
6] proposed a lightweight and robust guava leaf disease detection model GLD-Det, based on transfer learning. By integrating maximum pooling and global average pooling, multi-layer batch normalization, dropout, and multi-layer fully connected networks on the improved MobileNet architecture, the feature expression ability and model generalization performance were improved. The accuracy, precision, recall, and AUC values of up to 98.0%, 98.0%, 97.0%, and 99.0% were achieved on two public benchmark datasets, respectively. Jin et al. [
7] proposed a generative adversarial network model, GrapeGAN, based on a feature fusion mechanism and capsule network structure. By integrating convolution residual blocks and recombination technology, they built a U-Net-style generator and designed a discriminator containing a capsule structure, achieving an excellent performance of 5.495 on the image quality indicator Fréchet Inception Distance (FID), and the recognition precision of generated images exceeded 86.36%. Although the above methods have achieved good results, they generally rely on real field images, which are still insufficient when dealing with rare diseases, small sample categories, or multiple diseases co-occurring, especially when facing complex scene changes (such as backlight, dust, leaf curling, etc.).
The leap forward in generative AI technology, especially large language models (LLMs) such as DALL-E [
8,
9] and Stable Diffusion [
10], has revolutionized the way agricultural visual data is acquired. Highly realistic cotton leaf images with disease spot annotations can be generated through simple text prompts, making the generation of disease datasets more flexible and efficient, with advantages such as low cost, high controllability, and strong scalability [
11,
12]. This method not only makes up for the shortcomings of traditional image acquisition in terms of diversity but also provides data support for training more robust detection models.
Currently, deep learning models based on the YOLO [
13,
14] series are extensively applied in plant leaf disease detection and have become an important technical means for smart agricultural monitoring and disease prevention and control. Sangaiah et al. [
15] introduced a T-YOLO-rice method for rice disease detection, which achieved 86% mAP by integrating modules such as SPP, CBAM, and the Sand Clock Feature Extraction Module (SCFEM). Wang et al. [
16] introduced an MGA-YOLO model for apple leaf disease detection, which achieved 94.0% mAP, a 10.34 MB model size, and an 84.1 FPS inference speed by introducing the Ghost module, Mobile Inverted Residual structure, and CBAM attention mechanism, and reached 12.5 FPS on mobile phones. Although the above-mentioned models have struck an effective balance between accuracy and efficiency, their continued performance improvement is still limited by the constraints of data acquisition methods. How to effectively improve the robustness and generalization ability of target detection models has become an important scientific problem that needs to be solved in the field of smart agricultural visual perception.
This study proposed a cotton leaf disease detection method using LLM to synthesize cotton leaf images and the DEMM-YOLO model. The key contributions can be summarized as follows:
(1) To address the imbalance in cotton leaf disease image samples, this study used OpenAI’s DALL-E model to generate images of low-frequency disease categories based on text descriptions and disease features. This effectively expanded the dataset, improved its balance and diversity, and enhanced the model’s accuracy and generalization in recognizing rare diseases. A DEMM-YOLO-based cotton leaf disease detection method was proposed.
(2) To tackle challenges in cotton leaf disease detection, such as scale variation, irregular lesion distribution, and difficulty distinguishing small lesions with unclear edges, this study proposes a multi-scale feature aggregation module (MFAM). It effectively integrates multi-scale semantic information, improving the model’s ability to perceive and differentiate small diseased areas while maintaining computational efficiency.
(3) The Deformable Attention Mechanism (DAT) is introduced to improve the feature extraction of the C2PSA module in cotton leaf disease detection. By dynamically learning spatial offsets, DAT can focus on irregular areas like disease spots, overcoming the fixed receptive field limitations of traditional convolution. This adaptive mechanism enhances C2PSA’s ability to detect small-scale, diverse disease spots with clearer details, improving overall detection accuracy and robustness, even in complex backgrounds.
(4) An enhanced efficient multi-scale attention (EEMA) mechanism is proposed, integrating feature grouping, multi-scale parallel sub-networks, and cross-space interaction learning to create a more expressive attention structure. To improve regression performance and training efficiency, the MPDIoU loss function is used in place of the original bounding box regression loss, boosting both convergence speed and positioning accuracy.
4. Discussion
This study proposes a cotton leaf disease detection method (DEMM-YOLO) that combines synthetic images generated using a large language model (LLM) with an improved YOLOv11 model. This method not only effectively addresses the sample imbalance issue for rare disease categories but also significantly improves cotton leaf disease detection performance through technological innovation. The specific contributions are summarized as follows:
(1) Application of Synthetic Data: This study leverages OpenAI’s DALL-E model to generate high-quality synthetic images of cotton leaf diseases, successfully expanding the diversity of the dataset and improving detection performance for rare disease categories. The introduction of synthetic data significantly enhances the model’s generalization ability for low-sample categories.
(2) Multi-Scale Feature Aggregation Module (MFAM): To address the large-scale variation and irregular distribution of disease regions, we designed the MFAM module. This module effectively integrates multi-scale semantic information through a lightweight multi-branch convolutional structure, improving the model’s detection capabilities for small-scale diseases.
(3) Deformable Attention Mechanism (DAT): The Deformable Attention Mechanism (DAT) is introduced into the C2PSA module. By learning spatial offsets, it dynamically focuses on the diseased area in a complex background, overcoming the fixed receptive field limitation of traditional convolution, effectively improving the model’s feature extraction capability and detection accuracy.
(4) Enhanced Multi-Scale Attention Mechanism (EEMA): This study introduces an enhanced Efficient Multi-Dimensional Attention Mechanism (EEMA). This mechanism further improves the model’s feature representation capabilities in complex environments through feature grouping, multi-scale parallel sub-networks, and cross-spatial interactive learning.
(5) MPDIoU Loss: By replacing the traditional regression loss function with the MPDIoU loss, the accuracy of bounding box regression is improved, the model convergence is accelerated, and the model’s localization accuracy is further enhanced.
Compared to existing research, DEMM-YOLO introduces innovative improvements in disease detection accuracy, data augmentation strategies, and module design. Traditional disease detection methods typically rely on real-world image data, the collection of which requires substantial human and material resources and is often hindered by issues like sample scarcity and data imbalance. Although previous studies have attempted to address these challenges through techniques such as data augmentation and transfer learning, problems like insufficient samples and interference from complex backgrounds remain. In contrast, this study effectively mitigates these issues by generating synthetic data, which enhances the detection capabilities for rare disease categories. Additionally, by incorporating innovative modules such as MFAM, DAT, and EEMA, the model’s robustness in complex backgrounds is significantly improved.
The model has made significant improvements, but some limitations remain. Detection accuracy tends to decrease under challenging environmental conditions, such as intense lighting, moving shadows, or changes in leaf angle. Additionally, blurred boundaries persist when the contrast between diseased and healthy areas is low. Although synthesizing multiple diseases on a single leaf helps display individual disease symptoms and generates more realistic images, it still struggles to fully capture the overlap of multiple diseases—an occurrence commonly seen in real-world scenarios. To address this, future research will focus on integrating real-world images of overlapping diseases, enhancing the model’s accuracy and utility. We aim to leverage the DALL-E model to generate images with multiple crop disease types, thereby improving the model’s ability to detect a broader range of diseases and better handle the complex, overlapping disease patterns encountered in actual field conditions.
5. Conclusions
This study introduces the DEMM-YOLO approach for cotton leaf disease detection, which integrates synthetic image generation using an LLM with architectural enhancements to improve performance. To address data imbalance, particularly for rare disease categories, we leveraged OpenAI’s DALL-E to generate synthetic images, effectively enriching the dataset and enhancing the model’s generalization capability in few-shot scenarios. To handle challenges like varying lesion scales and irregular spatial distributions, we designed an MFAM model that combines multi-scale semantic information through a lightweight multi-branch structure. Additionally, we incorporated the DAT into the C2PSA module to overcome the limitations of fixed receptive fields, allowing the model to focus dynamically on lesions in complex backgrounds. The EEMA mechanism further improved the model’s feature representation by combining grouped features, multi-scale parallelism, and cross-spatial interaction. Finally, the use of the MPDIoU loss function enhanced bounding box regression accuracy and accelerated model convergence. Experimental results show that DEMM-YOLO delivers better performance with 94.8% precision, 93.1% recall, and 96.7% mAP@0.5. Compared to mainstream models such as YOLOv5s, YOLOv6s, YOLOv8s, and even recent versions like YOLOv10s and YOLOv12s, our model shows more balanced and consistently high performance across key metrics. Despite a slightly lower precision than RT-DETR and YOLOv10s, DEMM-YOLO outperforms them in overall detection stability and computational efficiency, achieving 81.5 FPS and maintaining a lightweight size of 20.1 MB. These results confirm DEMM-YOLO’s strong potential for real-world deployment in precision agriculture and intelligent disease monitoring systems.