CottonCapT6: A Multi-Task Image Captioning Framework for Cotton Disease and Pest Diagnosis Using CrossViT and T5

Zhao, Chenzi; Meng, Xiaoyan; Bai, Bing; Qiu, Hao

doi:10.3390/app151910668

Open AccessArticle

CottonCapT6: A Multi-Task Image Captioning Framework for Cotton Disease and Pest Diagnosis Using CrossViT and T5

¹

School of Computer and Information Engineering, Xinjiang Agricultural University, Urumqi 830052, China

²

Ministry of Education Engineering, Research Center for Intelligent Agriculture, Urumqi 830052, China

³

Xinjiang Agricultural Informatization Engineering Technology Research Center, Urumqi 830052, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(19), 10668; https://doi.org/10.3390/app151910668

Submission received: 10 September 2025 / Revised: 27 September 2025 / Accepted: 30 September 2025 / Published: 2 October 2025

(This article belongs to the Section Agricultural Science and Technology)

Download

Browse Figures

Review Reports Versions Notes

Abstract

The identification of cotton diseases and pests is crucial for maintaining cotton yield and quality. However, conventional manual methods are inefficient and prone to high error rates, limiting their practicality in real-world agricultural scenarios. Furthermore, Convolutional Neural Network–Long Short-Term Memory (CNN-LSTM) models are insufficient in generating fine-grained and semantically rich image captions, particularly for complex disease and pest features. To overcome these challenges, we introduce CottonCapT6, a novel multi-task image captioning framework based on the Cross Vision Transformer (CrossViT-18-Dagger-408) and Text-to-Text Transfer Transformer (T5). We also construct a new dataset containing annotated images of seven common cotton diseases and pests to support this work. Experimental results demonstrate that CottonCapT6 achieves a Consensus-based Image Captioning Evaluation (CIDEr) score of 197.2% on the captioning task, demonstrating outstanding performance. Notably, the framework excels in providing more descriptive, coherent, and contextually accurate captions. This approach has strong potential to be deployed in cotton farms in the future, helping pest control personnel and farmers make precise judgments on cotton diseases and pests. However, its generalizability to other crops and environmental conditions remains an area for future exploration.

Keywords:

image captioning; cotton disease and pest; CrossViT; agricultural intelligence

1. Introduction

Cotton is widely recognized as a vital cash crop and a foundational raw material for the textile industry. Its yield and fiber quality have a direct impact on agricultural income and rural economic development, particularly in major cotton-producing regions such as Xinjiang, China [1,2,3]. In recent years, as farming practices have become increasingly large-scale and mechanized, diseases and pests have emerged as major threats to cotton yields. Typical threats such as cotton bollworm and Fusarium wilt can lead to yield losses exceeding 20% and degrade fiber quality by 10–15% [4]. Manual inspection remains the primary approach for disease and pest identification. However, it is constrained by limited scalability, subjective judgments, and field error rates of up to 35% [5]. These limitations underscore the need for intelligent and efficient alternatives.

In this context, deep learning has emerged as a powerful paradigm capable of addressing such challenges in agricultural research. Deep learning is a learning approach that utilizes artificial neural networks to extract knowledge from data [6]. With the advancement of artificial intelligence (AI) and deep learning, this technology is achieving increasingly significant achievements [7]. Deep learning models, particularly CNNs, have shown strong performance in agricultural identification by leveraging large-scale image data and hierarchical feature extraction. Juneja proposed an advanced deep learning framework called MRI-CropNet for automatic prostate cropping from Magnetic Resonance Imaging (MRI), which achieved state-of-the-art performance metrics, including a Dice score of 0.99, further advancing the role of AI in early prostate cancer detection [8]. Deep learning methods have significantly improved both the accuracy and detection efficiency of agricultural pest and disease detection. Nevertheless, most models are trained on generalized datasets and overlook the unique characteristics of specific diseases and pests. In cotton diseases and pests, for example, features like the shape of lesions, color of lesions, and morphological characteristics of pests are critical for accurate diagnosis but are often neglected. Due to the reliance on the end-to-end learning paradigm, deep learning-based methods fail to establish connections between low-level image features and textual semantics, resulting in a semantic gap between feature representation and final output. This makes it difficult for users to comprehend the model’s reasoning logic and undermines its interpretability. One promising approach to enhance interpretability is to move beyond pure classification and leverage methods that can generate human-readable explanations of visual content.

To address the interpretability challenge in AI-driven diagnosis, we can generate natural language descriptions of pest and disease characteristics observed in images. To achieve this goal, image captioning methods can be employed to translate visual content into structured natural language descriptions. Image captioning has evolved from early neural encoder–decoder frameworks to attention-based models [9,10,11], with further refinements through hybrid approaches and semantic attention mechanisms, enabling more context-aware captioning [12,13,14]. These advancements in general image captioning lay the groundwork for applying the technique to specialized domains, including agriculture. In recent years, foundation models and transformers have significantly advanced image captioning. These advancements form the basis for domain adaptation in more specialized fields. The image captioning method has been widely applied across various fields, such as medical diagnosis, remote sensing, and agricultural disease and pest diagnosis. It is increasingly becoming a focus of growing research. Kamal et al. proposed a clinical captioning method for medical images based on the fine-tuned Bootstrapping Language-Image Pretraining (BLIP) model, which incorporates Unified Medical Language System (UMLS) concept identifiers and explainable AI mechanisms to generate standardized diagnostic descriptions. Experimental results demonstrated that the method achieved high clinical agreement with a Bilingual Evaluation Understudy Score with precision—4-gram (BLEU-4) of 0.7300, indicating its potential to enhance the reliability and efficiency of radiological assessments [15]. Abed et al. introduced a report generation method that outperformed alternatives on BLEU-1 (0.587) and CIDEr (0.405), showing its ability to produce accurate clinical descriptions [16]. Han et al. conducted a comparative study of Vision Transformer and Visual Geometry Group 16-layer Network (VGG16) for remote sensing image captioning, and the experimental results demonstrated that the Vision Transformer model achieved superior performance with a BLEU score of 0.5507 at 50 samples [17]. Lin et al. proposed RS-MoE, improving caption specificity and scalability for remote sensing imagery [18].

Within agriculture, captioning models have been explored to enhance the understanding of plant diseases. However, adapting these models to complex agricultural environments remains challenging. Lu et al., Lee et al., and Quoc et al. extended multimodal captioning to crop disease diagnosis, combining image captioning with object detection and vision–language modeling [19,20,21]. Despite progress, these methods struggle with fine-grained, domain-specific features critical for accurate disease identification, such as leaf pubescence, vein structure, and boll morphology. Recent multimodal models like BLIP3-o, Florence-2, and Large Language Model Meta AI 4 (LLaMA-4), while strong in general tasks, require vast amounts of data and incur high computational costs, making them less practical for capturing the fine-grained agricultural specifics needed for accurate disease diagnosis [22,23]. Additionally, they often rely on fixed visual encoders, limiting adaptability to unique agricultural contexts. Recent efforts like Prompt-based Captioning (PromptCap), Bootstrapping Language-Image Pretraining with Dynamic Prompting (BLIP-DP), and zero-shot Large Language Model-based (LLM-based) semantic correction show promise [24,25]. Furthermore, studies highlight the lack of image–text datasets curated specifically for cotton and the absence of captioning frameworks that balance semantic precision, interpretability, and crop-specific relevance.

Motivated by these gaps, the aim of this paper is to introduce CottonCapT6 and demonstrate how it enhances the diagnosis of cotton diseases and pests by combining advanced deep learning models, including CrossViT and T5, with a dedicated multimodal dataset. The main contributions of this research are as follows:

A high-quality image–text dataset, CottonDP, was constructed to provide domain-specific image–text data for the diagnosis of cotton diseases and pests.
A new image captioning framework was also proposed in this study. The framework innovatively integrates a multi-scale visual encoder (CrossViT) and a text generation model (T5) to solve the task of generating descriptions for agricultural disease and pest diagnosis [26,27].
A training strategy based on classification enhancement was designed. The core innovation of the framework is a two-stage training strategy. Firstly, the visual encoder was pretrained through the cotton pest classification task to learn the discriminant visual representation from local lesion microscopic features to global distribution. This enhanced visual perception was then transferred to the description generation task, significantly improving the semantic accuracy and interpretability of diagnostic descriptions.
The model effectively describes the details of cotton pest and disease images, improves the accuracy of diagnosis, and is of great value for promoting the informatization and intelligent management of cotton planting.

2. Materials and Methods

2.1. Research Framework

With the advancement of deep learning technology, the scope of agricultural research has expanded from conventional pest identification to more complex image captioning tasks. In this study, an end-to-end framework combining visual encoding and natural language generation was developed. The overall process is illustrated in Figure 1. During the data acquisition phase, image samples of seven typical cotton diseases and pests were collected from public search platforms, including Baidu Image Library (https://image.baidu.com/) accessed on 15 February 2025 and Google Image Library (https://images.google.com/), accessed on 16 February 2025, respectively. Unlike traditional methods that train visual and textual components together from scratch, we employ a two-stage training strategy to better capture the specific visual semantics of cotton diseases and pests. In the first stage, the CrossViT-18-Dagger-408 model is trained for cotton disease and pest classification. This allows it to learn discriminative visual representations. Once this pretraining is complete, the model’s weights are transferred and used as the visual encoder in the captioning framework. In the second stage, the T5 language model is guided by these learned features to generate descriptive, structured diagnostic text.

This two-stage approach enables the model to capture detailed domain-specific visual features and ensures that the text generated is not only accurate but also contextually relevant. The architecture is built on the encoder–decoder structure, where CrossViT serves as the visual encoder and T5 generates the corresponding textual descriptions. This separation allows the model to focus on learning image-specific features first before moving on to the more complex task of generating descriptive text.

2.2. Dataset Construction

2.2.1. Data Collection

During cotton cultivation, the occurrence of diseases and pests is both frequent and irregular, making it difficult to obtain relevant images promptly using traditional methods. To overcome this limitation, this study constructed a training dataset using images from Baidu and Google image libraries. While these web-sourced images were collected from public platforms, efforts were made to ensure that the selected images accurately represent the conditions of major cotton-producing regions in China, particularly the Xinjiang region. Most of the images were taken during the peak periods of disease and pest outbreaks, and the collection was focused on these critical periods to capture the most relevant and contextually accurate data for training. By restricting the image collection to these outbreak periods, we aimed to enhance the dataset’s representativeness and ensure consistency with the ecological and pathological conditions specific to the region.

Compared to manual collection, Python-based (Python 3.9.7) web crawling technology was employed to efficiently gather a large number of images, significantly reducing the workload involved in data collection. This dataset, named CottonDP, includes images representing seven common cotton diseases and pests: target spot disease, powdery mildew, wilt disease, thrips, whiteflies, cotton aphids, and cotton bollworms. The images were primarily collected during peak disease and pest seasons, reinforcing the dataset’s relevance to real-world agricultural conditions.

2.2.2. Image Preprocessing and Augmentation

During the image acquisition phase, a total of 3500 raw image samples representing seven types of cotton diseases and pests were collected. To ensure high image quality and enhance the effectiveness of subsequent model training, the dataset underwent the following three processing steps.

Firstly, the images were processed by pest and disease type, with standardized identification (ID) numbers assigned and converted to JPG format. Then, Python scripts were used to implement deduplication based on perceptual hashes, automatically extracting visual features and eliminating duplicate or highly similar images, thereby significantly reducing manual screening workload. Finally, manual quality screening was carried out to remove photos that do not meet quality standards. Following these procedures, 784 high-quality photos were retained.

To minimize the risk of overfitting from a limited sample size and enhance both data diversity and model robustness, data augmentation techniques were utilized. First, the image brightness was randomly adjusted to simulate different lighting environments. Secondly, horizontal and vertical flipping were used to enrich spatial transformations and reduce the model’s dependence on image orientation. Thirdly, random noise was added to the images to improve the model’s robustness against disturbances. After augmentation, the dataset had expanded to 2801 images, encompassing seven representative types of cotton diseases and pests. In the dataset, the cotton diseases and pests are classified into several categories, each with varying sample sizes. These categories originally contained varying numbers of samples, which were increased through data augmentation: cotton target spot with 400 samples, cotton powdery mildew with 500 samples, cotton wilt with 400 samples, cotton thrips with 300 samples, cotton whitefly with 301 samples, cotton aphid with 450 samples, and cotton bollworm with 450 samples.

Ultimately, the CottonDP dataset was partitioned into training, validation, and test sets in an 8:1:1 ratio, comprising 2241, 224, and 224 images, respectively.

2.2.3. Data Preprocessing for Classification Task

To improve the model’s capacity to identify distinguishing features in cotton disease and pest images, image classification was employed as a pretraining task. This approach equips the model with the capability to differentiate between various types of cotton diseases and pests. Accordingly, accurate image annotation was essential for constructing a high-quality classification dataset. The open-source tool LabelMe was used to annotate the collected images manually, assigning clear category labels such as cotton target spot disease, cotton wilt disease, and cotton bollworm, among others.

This annotation process established a solid foundation for the subsequent model training, greatly improving the model’s capability to differentiate between various disease and pest categories. Following annotation, the dataset was organized, cleaned, and partitioned to ensure a balanced sample distribution, allowing the model to learn diverse image features during training effectively.

The resulting image classification dataset is essential in the training and evaluation process, providing a strong foundation for more complex downstream tasks like image caption generation.

2.2.4. Data Preprocessing for Captioning Task

To better perform data annotation for image generation description tasks, this study first collected symptom keywords corresponding to images of various cotton diseases and pests from authoritative platforms and professional books such as “Cotton Diseases and Pests in Xinjiang and Their Control”. Based on this, this study established a unified annotation guideline. For example, for cotton target spot disease and cotton aphids, the following professional keywords were extracted:

Cotton target spot disease: brown circular lesions;
Cotton whitefly: pale bodies and white wing covers.

Based on this, standardized image captioning sentences were constructed according to the specific symptoms and pest/disease types presented in the images. The examples are as follows:

Cotton target spot disease: Small brown lesions on cotton leaves suggest target spot.
Cotton whitefly: Cotton whiteflies show pale bodies and white wing covers.

To ensure the professionalism and consistency of the annotations, the annotation results were reviewed by experts based on the unified annotation guidelines. Experts provided feedback on some annotated samples, and through multiple rounds of review and revision, the annotations were refined to maintain high quality.

To boost the model’s performance in comprehending diverse linguistic expressions, this study further introduced two text enhancement techniques: word order transformation and synonym replacement. Word order transformation involves constructing new sentences by rearranging the order of sentence components, leveraging the relatively flexible sentence structure of English to create more diverse sentence patterns; synonym replacement involves substituting keywords with their synonyms without altering the original meaning, thereby generating new sentences with consistent semantics but different expressions. These enhancement techniques help improve the model’s language generalization ability and robustness, particularly in scenarios with small sample sizes.

In practice, for each image, a standard descriptive sentence is constructed based on the extracted keywords, and one enhanced sentence is constructed using each of the word order transformation and synonym replacement techniques, resulting in a total of three descriptive sentences per image. Specific examples of image descriptive sentences in this dataset are shown in Table 1.

Finally, this study constructed a total of 8403 image captioning sentences for 2801 images, with 2241 photos and 6723 sentences in the training set, 224 images and 672 sentences in the validation set, and 224 photos and 672 sentences in the test set. As shown in Figure 2 and Figure 3, the CottonDP dataset includes seven cotton disease and pest categories: the bar chart depicts the number of picture samples for each category, while the doughnut chart illustrates the distribution of samples across training, validation, and test sets (inner ring) and the relative composition of categories within each set outer ring), with segment sizes corresponding to sample counts. This diverse, high-quality image captioning dataset provides a robust data foundation for training and evaluating models in image captioning tasks.

2.3. Model Architecture

2.3.1. Classification Pretraining with CrossViT-18-Dagger-408

This study adopts the CrossViT-18-dagger-408 pretrained model, which is based on the Vision Transformer (ViT) architecture, for image classification tasks. CrossViT is a multi-scale visual transformer designed to effectively capture image features at different levels of granularity. Unlike traditional CNNs, CrossViT leverages cross-attention mechanisms to model global dependencies between different image scales. This capability significantly enhances the model’s effectiveness in understanding complex visual content.

As illustrated in Figure 4, the overall architecture of CrossViT comprises three main modules: the Multi-Scale Image Encoder, the Cross-Attention Fusion Module, and the Classification Head. In the input stage, the image is partitioned into two scales of patches, large and small, to achieve a balance between capturing fine-grained local details and global semantic information. These multi-scale patches are then fed into separate Transformer branches for feature encoding.

Subsequently, the multi-scale features extracted by the encoders are fused through efficient information exchange in the Cross-Attention Fusion Module. As depicted in Figure 5, the core of this module is the cross-attention mechanism implemented within the large-branch Transformer. This mechanism enables the integration of complementary information from both scales, enhancing the model’s ability to capture rich and hierarchical visual representations.

Within each branch, standard self-attention is first applied to the corresponding patch tokens to capture intrascale feature representations. This is followed by inter-branch information exchange facilitated by the Cross-Attention Fusion Module. At the core of this mechanism is the use of the Class (CLS) token from one branch as the query, while the patch tokens from the other branch serve as the key and value in the cross-attention computation. The input to the cross-attention operation is defined as follows:

x^{' l} = [f^{l} (x_{c l s}^{l}) ∥ x_{p a t c h}^{s}],

(1)

where

f^{l} (x)

is the projection function used for dimension alignment.

x_{c l s}^{l}

is the CLS token of the Large Branch, and the module then performs cross-attention calculation between

x_{c l s}^{l}

and

x^{' l}

, as expressed by the following formula:

q = x_{c l s}^{' l} W_{q}, k = x^{' l} W_{k}, v = x^{' l} W_{v}

(2)

A = softmax (\frac{q k^{T}}{\sqrt{C / h}}), CA (x^{' l}) = A v,

(3)

where

W_{q}, W_{k}, W_{v} \in R^{C \times (\frac{C}{h})}

are the learnable parameters;

C

is the embedding dimension; and

h

is the number of attention heads.

Since only the CLS token is used as the query vector in this context, the computation and memory overhead for generating the attention map (

A

) are linear in complexity, as opposed to the quadratic complexity of full attention mechanisms. This design significantly improves the efficiency of the entire process.

More specifically, the output of cross-attention is connected to the normalized and updated CLS token through a residual connection and concatenated with the patch tokens for subsequent encoding; its definition is as follows:

y_{c l s}^{l} = f^{l} (x_{c l s}^{l}) + MCA (LN ([f^{l} (x_{c l s}^{l}) ∥ x_{p a t c h}^{s}]))

(4)

z^{l} = [g^{l} (y_{c l s}^{l}) ∥ x_{p a t c h}^{l}]

(5)

In the above equation,

f^{l} (\cdot)

and

g^{l} (\cdot)

are the projection function and inverse projection function used for dimension alignment, respectively.

The small branch also performs a similar cross-attention operation. Afterward, the CLS tokens from both branches are sent to the subsequent Transformer encoding layers, improving their semantic representation. The final CLS tokens are passed to a Multi-Layer Perceptron (MLP) head for classification. The classification results from both branches are combined or fused to determine the category of the cotton disease and pest images.

CrossViT is designed to leverage the power of Vision Transformers in modeling long-range dependencies and multi-scale perception. The pretrained CrossViT-18-Dagger-408 model, which was fine-tuned on the ImageNet-1K dataset, supports 408 × 408 high-resolution inputs. This enhances the model’s generalization ability, making it suitable for various downstream tasks, especially in small-sample learning scenarios. The 408 × 408 input resolution ensures high precision, improving classification accuracy—vital for high-precision agricultural image recognition tasks like detecting cotton diseases and pests.

2.3.2. Captioning with T5 and Transfer Learning

The image captioning model used in this study is built on a conventional end-to-end architecture. It is widely used in natural language processing tasks. The entire model consists of two parts: an encoder and a decoder. The encoder extracts visual features from the input image, while the decoder generates corresponding natural language descriptions based on the visual features. In this study, the encoder uses the CrossViT-18-Dagger-408 model. It has been pretrained for cotton disease and pest image classification tasks and possesses the ability to extract semantic features from cotton disease and pest images. The decoder uses the T5 model, which excels at natural language generation, to generate descriptive sentences corresponding to images.

CrossViT-18-Dagger-408 Encoder

For the image captioning task, the CrossViT-18-Dagger-408 model is used as the encoder. This model, pretrained for cotton disease and pest image classification, effectively extracts multi-scale semantic features, providing detailed information for generating accurate textual descriptions.

The model first processes the input image to extract a set of high-dimensional visual features, which capture the semantic essence of the entire image. In the subsequent decoding phase, the visual features are used to generate textual descriptions that accurately correspond to the image content. Since the structure of this model has been thoroughly introduced in the previous classification task, it will not be repeated here.

T5 Decoder

The decoder uses T5, a powerful language model based on the Transformer architecture. T5 transforms various natural language processing tasks into a text-to-text problem, allowing it to generate descriptive sentences from the extracted visual features.

As shown in Figure 6, T5 includes several improvements over the standard Transformer architecture. Notably, T5 replaces external position encoding with relative position biases, enhancing its flexibility for variable-length inputs. It also simplifies the layer normalization structure, improving training stability and reducing redundancy. Each block in T5 includes self-attention, cross-attention, and feedforward networks, making it effective at modeling context and integrating encoder information.

In this study, T5 uses the image feature vectors output by CrossViT-18-Dagger-408 as input, and through its self-attention mechanism and encoder–decoder attention mechanism, it performs context modeling to gradually generate semantically coherent and content-accurate descriptive text. T5 can not only capture syntactic structures and semantic logical relationships but also flexibly integrate professional terminology and expression habits in the field of diseases and pests, thereby generating more precise and natural image captioning sentences.

2.3.3. Overall Pipeline and Model Architecture

The overall structure of the CottonCapT6 model is demonstrated in Figure 7. The image captioning process consists of two stages. First, the CrossViT-18-Dagger-408 model is pretrained on a cotton-specific image classification task to learn domain-relevant visual features, capturing subtle lesion characteristics and pest morphology. This classification pretraining step enhances the semantic representation capability of the visual encoder. Second, the pretrained CrossViT encoder is integrated into the encoder–decoder architecture of CottonCapT6. When an input image is fed into the system, the encoder extracts a high-dimensional visual feature vector that encodes the global semantic content of the image. This feature vector is then passed into the T5 decoder, which incorporates positional encoding and multi-layer attention mechanisms—including self-attention and cross-attention—to gradually transform visual semantics into coherent diagnostic language.

During training, the T5 model is optimized by minimizing the difference between the generated text and the true captions. During inference, the model can automatically generate clear, accurate, and natural descriptions for any input cotton disease or pest image.

2.4. Experimental Setup

Experiments were conducted on Ubuntu 20.04 with an Intel Xeon Platinum 8457C CPU (Intel Corporation, Santa Clara, CA, USA), 100 GB RAM, NVIDIA L20 GPU (48 GB), CUDA 11.8, and PyTorch 2.1.2. The settings of the experimental parameters are presented in Table 2.

As shown in the table, both of the experiments were optimized using the Adam optimizer. A key element of the training strategy was the use of a validation set to monitor both the validation loss and key metrics (e.g., CIDEr score). Early stopping was implemented: training was halted if no improvement was observed for 2 consecutive epochs, helping to prevent overfitting. Additionally, the model checkpoint with the best performance on the validation set was selected as the final model for testing. The final evaluation of all models, as detailed in the Section 3, was conducted using a held-out test set to ensure an unbiased assessment.

2.5. Evaluation Indicators

In this study, the evaluation of the classification task employed metrics such as precision, recall, F1-score, and accuracy to comprehensively assess the classification performance of pretrained models in image recognition. For the image captioning task, evaluation metrics including BLEU1–4, Recall-Oriented Understudy for Gisting Evaluation—Longest Common Subsequence (ROUGE-L), Metric for Evaluation of Translation with Explicit Ordering (METEOR), and CIDEr were used to measure the accuracy of the generated image descriptions. According to the characteristics of the two tasks, the classification performance was determined based on the overall results of the four performance indicators across multiple models, while the accuracy of language generation was comprehensively evaluated based on various linguistic metrics reflecting semantic relevance, syntactic structure, and readability of the generated sentences.

3. Results

3.1. The Results of the Image Classification Task

3.1.1. Quantitative Analysis of Image Classification Task

This experiment evaluated the performance of eight mainstream deep learning models in image classification tasks using cotton disease and pest image data, including Residual Network-101 (ResNet-101), Efficient Network-B4 (EfficientNet-B4), Inception Version 3 (Inceptionv3), Mobile Network Version 2 (MobileNetV2), VGG16, Swin Transformer, Vision Transformer Base with 16 × 16 Patch Size (ViT-B/16), and the CrossViT_18_dagger_408 model proposed in this study. All models were trained and assessed using a consistent dataset and an identical data augmentation strategy. The primary evaluation metrics included precision, recall, F1-score, and accuracy on the validation set. The specific results are shown in Table 3.

From the experimental results, CNNs such as ResNet-101, MobileNetV2, and VGG16 showed relatively weaker performance, with accuracy values of 80.50%, 78.10%, and 80.86%, respectively. In contrast, EfficientNet-B4 and Inceptionv3 achieved stronger results, reaching 82.14% and 86.79% accuracy. Overall, Transformer architectures outperform traditional CNN models on certain metrics. Specifically, Swin Transformer attained 79.50% accuracy, while ViT-B/16 and the CrossViT_18_dagger_408 model reached 90.60% and 91.43%, respectively. Among all compared methods, the proposed model achieved the highest precision (92.00%), recall (91.00%), and F1-score (91.00%), establishing clear superiority over the other models.

In summary, the CrossViT_18_dagger_408 model demonstrates optimal performance in complex agricultural image recognition tasks due to its multi-scale modeling capabilities and sensitivity to small objects, validating its effectiveness and practical value in cotton disease and pest classification tasks.

3.1.2. Qualitative Analysis of Image Classification Task

To provide additional confirmation of the model’s diagnostic performance, one representative image from each of the seven disease and pest categories in the test set was chosen for comparative visualization. The results depicted in Figure 8 demonstrate the classification results of the CrossViT_18_dagger_408 model alongside EfficientNet-B4 and InceptionV3, highlighting its superior accuracy and confidence in real-world scenarios across diverse conditions.

As shown in the results, the CrossViT_18_dagger_408 model successfully identified the corresponding disease and pest categories across all seven test images, with predicted labels fully matching the actual labels. Notably, it achieved high prediction confidence—for instance, 0.98 for target spot—significantly outperforming EfficientNet-B4 and InceptionV3 in both accuracy and confidence levels, demonstrating superior discriminative capability for fine-grained feature recognition.

Specifically, the model can accurately identify typical leaf spot diseases (e.g., target spot), fungal diseases (e.g., powdery mildew), and various pests (e.g., aphid, whitefly, thrips, bollworm) and maintains a high level of distinction even between categories with minor morphological differences. Particularly when dealing with diseases like wilt, which exhibits uneven symptom distribution and diverse morphological characteristics, the model can make correct judgments, demonstrating strong generalization ability and robustness.

Additionally, the model’s classification performance across various types of diseases and pests was further analyzed using a confusion matrix to assess its discriminative ability between similar categories. As shown in Figure 9, the model’s overall classification performance was satisfactory, with most samples correctly classified, demonstrating good diagnostic accuracy.

Specifically, the model demonstrates stable recognition performance and high accuracy for categories such as powdery mildew (49/50), thrips (29/30), aphid (40/45), and bollworm (40/45), indicating that the CrossViT-18-Dagger-408 model exhibits a certain level of robustness in identifying the characteristics of these diseases and pests. However, there is still some confusion between certain categories, with the following given as examples:

In the wilt category, four samples were misclassified as bollworm, and two samples were misclassified as target spot;

There was minor confusion between whitefly and aphid, with 2–3 misclassified samples each;

There were also a few cross-misclassifications between target spot and powdery mildew (one sample each).

To further refine the model’s diagnostic capabilities across categories, statistical analyses of the precision, recall, and F1-score metrics for each pest or disease category were conducted, and the results are visualized in Figure 10. The results show that the model achieved F1-scores above 0.90 for categories such as powdery mildew, target spot, thrips, and aphids, demonstrating high recognition accuracy and stability. However, the F1-score was relatively low for wilt and bollworm categories, at 0.87 and 0.84, indicating that the model still faces certain challenges in identifying these categories, which may be related to the diversity of their disease morphologies and local interference factors.

In summary, through visualization analysis of typical disease and pest samples, confusion matrix evaluation, and comparison of various category metrics, the model’s diagnostic capabilities in multi-category disease and pest identification tasks were systematically validated. The experimental results show that the model not only accurately identifies various types of diseases and pests but also demonstrates high consistency and accuracy in the identification of leaf spot diseases, fungal diseases, and common pests. Additionally, it exhibits good discriminative ability when dealing with complex backgrounds, small target areas, and situations where features between different categories are similar.

Based on the confusion matrix and classification reports, the model achieved high precision, recall, and F1-score values for most categories, particularly showing stable performance in categories such as powdery mildew, target spot, and thrips. However, for categories with complex morphologies or easily confused features, such as wilt and bollworm, there are still some misclassification phenomena, primarily concentrated between categories with visually similar features, which exhibit reasonable interpretability.

Overall, the model demonstrates good robustness and generalizability in real farmland environments, with broad application potential, providing efficient and reliable technical support for intelligent monitoring and precise control of diseases and pests.

3.2. The Results of the Image Captioning Task

The image captioning task is implemented using an encoder–decoder structure, with the CrossViT model trained in the classification task to extract image features, combined with three descriptive texts as input for training. The training uses the sparse categorical cross-entropy loss function.

Figure 11 shows the changes in the loss function during the training process. The model converges around the 40th iteration and eventually ends at around 0.6.

3.2.1. Quantitative Analysis of the Image Captioning Task

To validate the impact of each module on the quality of generated text, a series of ablation experiments were designed and implemented, with the specific results shown in Table 4. The experiments covered different combinations of encoder and decoder structures, including the traditional CNN+LSTM, CNN+LSTM+Attention with attention mechanisms, CNN+Transformer, CNN+T5, and CrossViT-based CrossViT+Transformer, CrossViT+T5, and the CottonCapT6 model.

As shown in the results, the overall quality of text generation improves progressively with enhancements in both the visual encoder and language decoder architectures. Traditional models such as CNN+LSTM yield relatively low performance across all metrics (e.g., BLEU4 of 36.6 and CIDEr of 135.3), which often translates to generic, incomplete, or factually inconsistent captions that limit their practical utility. When the T5 decoder is introduced (CNN+T5), the CIDEr score rises to 180.0, demonstrating the effectiveness of pretrained language modeling. In practical terms, this leap signifies a substantial improvement in the linguistic fluency and diversity of the generated descriptions, moving beyond rigid template-like phrases.

Further improvements are observed with the integration of CrossViT as the visual encoder. The CrossViT+Transformer and CrossViT+T5 models achieve CIDEr scores of 195.0 and 196.0, respectively, confirming the benefit of multi-scale visual encoding.

Notably, the proposed model CottonCapT6, which incorporates a two-stage training strategy (classification pretraining followed by captioning), achieves the highest performance across all metrics, including BLEU4 of 48.7, ROUGE-L of 67.4, METEOR of 42.0, and CIDEr of 197.2. The practical significance of these incremental yet consistent gains lies in the enhanced semantic richness, factual correctness, and reliability of the generated captions. A higher BLEU score reflects closer n-gram overlap with annotations, while the peak CIDEr score—better aligned with human judgment—indicates that the outputs are not only accurate but also more comprehensive and contextually appropriate.

These improvements are particularly valuable in real-world applications, where CottonCapT6 can assist cotton pest control specialists and cotton farmers by generating precise and naturally worded diagnostic statements for easier interpretation. Overall, these results clearly demonstrate that beyond architectural design, task-specific visual pretraining plays a crucial role in enhancing the semantic quality and linguistic accuracy of caption generation.

3.2.2. Qualitative Analysis of Image Captioning Task

To assess the effectiveness of image captioning results, as shown in Table 5, we selected some examples and combined them with manually annotated text to analyze them in terms of semantic consistency and keyword extraction accuracy. The following are the specific comparison and analysis results:

Disease Image 1: The model-generated description includes keywords such as ‘target-pattern lesions,’ which are semantically consistent with the manually annotated terms ‘yellowish-brown spots’ and ‘round spores,’ accurately capturing the morphology and typical symptoms of the lesions, demonstrating strong semantic alignment capability.
Disease Image 2: The predicted text ‘Fuzzy white deposits on young leaves’ and the improved version ‘Fine white fungal growth accumulates…’ both accurately reflect the growth characteristics of powdery mildew, highly consistent with annotated information such as ‘white fungal hyphae’ and ‘leaf surface coverage.’
Pest image 1: Descriptions such as ‘Cotton whiteflies producing waxy platelets on leaves’ and its enhanced version accurately identify key pest characteristics such as ‘whiteflies’ and their ‘waxy secretions.’
Pest Image 2: Texts such as ‘Aphids fully cover top cotton leaf groups’ or ‘Dense aphid populations blanket…’ accurately depict the locations where aphids congregate and the forms of damage they cause, with complete keyword extraction and natural, fluent expression.

Overall, the model generates semantically complete, logically coherent, and naturally structured descriptive sentences across multiple examples, maintaining high consistency with manually annotated text in terms of core semantics and keywords, demonstrating excellent image–text matching capability and natural language generation performance.

4. Discussion

Although the deep learning model based on the Visual Transformer architecture proposed in this study demonstrates excellent performance in both classification and image captioning tasks, some limitations remain in practical applications, suggesting both constraints and potential directions for improvement.

Category Confusion and Potential Mitigation Strategies

In the image classification task, although the CrossViT-18-Dagger-408 model performs exceptionally well overall, confusion matrix analysis still reveals misclassification between certain categories, particularly between wilt-type diseases and bollworm or target spot. This issue may stem from the visual similarity between these diseases or be influenced by factors such as lighting, angle, and background interference during image acquisition. Future work should explore metric learning techniques, such as contrastive or center-loss functions, to explicitly optimize the feature embedding space for better inter-class separation. Additionally, synthesizing hybrid or intermediate disease images through generative models like diffusion models could explicitly teach the model to distinguish these ambiguous cases.

Enhancing Caption Richness and Actionability

In the image captioning task, while the model generates text with stable semantic accuracy and structural fluency, some descriptions lack sufficient detail, especially in the early-stage disease features. To bridge this gap, future versions could integrate external agricultural knowledge bases. This would allow the model to generate not just descriptions but also diagnostic conclusions and treatment suggestions. Furthermore, employing a reinforcement learning framework optimized directly for clinical correctness and informativeness, rather than just n-gram overlap, could significantly enhance the practical utility of the generated text.

Addressing Data Scarcity and Complexity through Augmentation

The current dataset’s focus on single-condition samples limits the model’s ability to diagnose complex, co-occurring pathologies in the field. To overcome this, a promising direction is the development of a multi-label dataset and the use of synthetic data augmentation. Techniques like styleGAN or diffusion models can generate realistic images of leaves with multiple diseases.

Optimizing the Training Framework for Efficiency and Transferability

The two-stage training strategy, while effective for semantic alignment, introduces computational overhead and limits transferability to other crops. A critical future direction is to investigate unified, end-to-end pretraining paradigms inspired by models. Designing a single foundation model capable of both diagnostic and descriptive tasks would drastically reduce deployment costs. To enable deployment on low-cost agricultural devices, model compression techniques and knowledge distillation into a smaller, single-model architecture are the next essential steps. This would facilitate real-time crop disease management directly in the field.

5. Conclusions

This study proposes a novel visual–language model, CottonCapT6, which integrates a multi-scale visual encoder (CrossViT-18-Dagger-408) and a pretrained text decoder (T5) for cotton disease and pest diagnosis. The model utilizes a two-stage training strategy: first, the visual encoder is trained on a classification task to capture domain-specific features, followed by transfer learning to guide image captioning. The results indicate strong performance across both tasks, demonstrating a balance of accuracy, interpretability, and domain relevance. The CrossViT architecture enabled effective recognition of complex visual patterns such as lesions, edges, and pest clusters. However, confusion between visually similar classes (e.g., wilt-type diseases and bollworm) was observed, highlighting the potential for fine-grained feature improvement. These results validate the benefit of combining classification-aware visual features with language generation. For image captioning, the model achieves competitive performance, generating semantically accurate and fluent descriptions of key visual features.

Beyond technical validation, this work provides a practical framework for plant phenotyping and diagnostics. The immediate practical implication lies in assisting agricultural extension services and agronomists by generating standardized, interpretable reports from field images, reducing reliance on subjective visual assessment, and improving the speed and consistency of disease logging.

While these results are promising, several limitations remain. The two-stage strategy introduces training overhead, and the current dataset limits generalization to complex field scenarios. To transition from laboratory validation to field deployment, a clear roadmap is necessary. The next critical step involves collaborative field trials with agricultural research stations to validate the model’s performance under diverse real-world conditions (e.g., varying lighting, growth stages). Success in these controlled field trials should be followed by developing a lightweight mobile application, leveraging model compression techniques to enable real-time diagnosis on handheld devices.

Future work will focus on improving robustness through multi-label training and advanced data augmentation. A key scalability objective is to adapt the framework to other high-value crops (e.g., wheat, rice) by leveraging transfer learning, which would significantly broaden its impact on global precision agriculture. Enhancing semantic richness through knowledge graphs and optimizing for edge deployment remain crucial for improving practical applicability.

In summary, CottonCapT6 provides a robust and interpretable solution for cotton disease and pest diagnosis, advancing the integration of computer vision and natural language generation in smart agriculture. Its pathway to impact involves validated field trials, development of deployable edge applications, and expansion to a broader set of crops, ultimately contributing to data-driven, sustainable crop management.

Author Contributions

Conceptualization, X.M. and C.Z.; methodology, X.M. and C.Z.; software, C.Z.; validation, C.Z.; formal analysis, C.Z.; investigation, C.Z.; resources, C.Z. and B.B.; data curation, C.Z. and B.B.; writing—original draft preparation, C.Z.; writing—review and editing, X.M., H.Q. and C.Z.; visualization, C.Z.; supervision, X.M. and H.Q.; project administration, X.M.; funding acquisition, X.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key R&D Program of China (2022ZD0115805) and the Provincial Key S&T Program of Xinjiang (2022A02011).

Data Availability Statement

The data are available from the corresponding author upon reasonable request.

Acknowledgments

We are grateful to our colleagues at the School of Computer and Information Engineering, Xinjiang Agricultural University, for their help and input, without which this study would not have been possible.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

CottonCapT6	Cross Vision Transformer-18-Dagger-408 and Text-to-Text Transfer Transformer for Cotton Disease and Pest Image Captioning
CrossViT-18 Dagger-408	Cross Vision Transformer-18 Dagger-408
T5	Text-to-Text Transfer Transformer
CIDEr	Consensus-based Image captioning Evaluation Score
AI	Artificial Intelligence
CNN-LSTM	Convolutional Neural Network–Long Short-Term Memory
LSTM	Long Short-Term Memory
MRI	Magnetic Resonance Imaging
BLIP	Bootstrapping Language-Image Pretraining
UMLS	Unified Medical Language System
BLEU-4	Bilingual Evaluation Understudy Score with precision—4-gram
VGG16	Visual Geometry Group 16-layer Network
LLAMA-4	Large Language Model Meta AI 4
PromptCap	Prompt-based Captioning
BLIP-DP	Bootstrapping Language-Image Pretraining with Dynamic Prompting
LLM-based	Large Language Model-based
ID	Identification
ViT	Vision Transformer
CLS	Class
MLP	Multi-Layer Perceptron
ROUGE-L	Recall-Oriented Understudy for Gisting Evaluation—Longest Common Subsequence
METEOR	Metric for Evaluation of Translation with Explicit Ordering
ResNet-101	Residual Network-101
EfficientNet-B4	Efficient Network-B4
Inceptionv3	Inception Version 3
MobileNetV2	Mobile Network Version 2
ViT-B/16	Vision Transformer Base with 16 × 16 Patch Size

References

Su, Y.; Wei, X.; Wang, Z.; Gao, L.; Zhang, Z. Cotton production in the Yellow River Basin of China: Reforming cropping systems for ecological, economic stability and sustainable production. Front. Sustain. Food Syst. 2025, 9, 1615566. [Google Scholar] [CrossRef]
Zhu, Y.; Wang, G.; Du, H.; Liu, J.; Yang, Q. The Effect of Agricultural Mechanization Services on the Technical Efficiency of Cotton Production. Agriculture 2025, 15, 1233. [Google Scholar] [CrossRef]
Wang, J.; Tong, J.; Fang, Z. Assessing the drivers of sustained agricultural economic development in China: Agricultural productivity and poverty reduction efficiency. Sustainability 2024, 16, 2073. [Google Scholar] [CrossRef]
Bishshash, P.; Nirob, A.S.; Shikder, H.; Sarower, A.H.; Bhuiyan, T.; Noori, S.R.H. A comprehensive cotton leaf disease dataset for enhanced detection and classification. Data Brief 2024, 57, 110913. [Google Scholar] [CrossRef] [PubMed]
Kamilaris, A.; Prenafeta-Boldú, F.X. Deep learning in agriculture: A survey. Comput. Electron. Agric. 2018, 147, 70–90. [Google Scholar] [CrossRef]
Tarekegn, A.N.; Ullah, M.; Cheikh, F.A. Deep learning for multi-label learning: A comprehensive survey. arXiv 2024, arXiv:2401.16549. [Google Scholar] [CrossRef]
Tan, G.D.; Chaudhuri, U.; Varela, S.; Ahuja, N.; Leakey, A.D. Machine learning-enabled computer vision for plant phenotyping: A primer on AI/ML and a case study on stomatal patterning. J. Exp. Bot. 2024, 75, 6683–6703. [Google Scholar] [CrossRef] [PubMed]
Juneja, M.; Saini, S.K.; Chanana, C.; Jindal, P. MRI-CropNet for Automated Cropping of Prostate Cancer in Magnetic Resonance Imaging. Wirel. Pers. Commun. 2024, 136, 1183–1210. [Google Scholar] [CrossRef]
Vinyals, O.; Toshev, A.; Bengio, S.; Erhan, D. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3156–3164. [Google Scholar]
Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhudinov, R.; Bengio, Y. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 2048–2057. [Google Scholar]
Chen, X.; Zitnick, C.L. Mind’s eye: A recurrent visual representation for image caption generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 2422–2431. [Google Scholar]
Alzubi, J.A.; Jain, R.; Nagrath, P.; Satapathy, S.; Taneja, S.; Gupta, P. Deep image captioning using an ensemble of CNN and LSTM based deep neural networks. J. Intell. Fuzzy Syst. 2021, 40, 5761–5769. [Google Scholar] [CrossRef]
Joseph, R.V.; Mohanty, A.; Tyagi, S.; Mishra, S.; Satapathy, S.K.; Mohanty, S.N. A hybrid deep learning framework with CNN and Bi-directional LSTM for store item demand forecasting. Comput. Electr. Eng. 2022, 103, 108358. [Google Scholar] [CrossRef]
You, Q.; Jin, H.; Wang, Z.; Fang, C.; Luo, J. Image captioning with semantic attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 4651–4659. [Google Scholar]
Kamal, M.S.; Nimmy, S.F.; Islam, M.R.; Naseem, U. Explainable Medical Image Captioning. In Proceedings of the ACM Web Conference, Singapore, 5–9 May 2025; pp. 2253–2261. [Google Scholar]
Abed, E.A.; Aguili, T. Automated Medical Image Captioning Using the BLIP Model: Enhancing Diagnostic Support with AI-Driven Language Generation. Diyala J. Eng. Sci. 2025, 18, 228–248. [Google Scholar] [CrossRef]
Han, H.; Aboubakar, B.O.; Bhatti, M.; Talpur, B.A.; Ali, Y.A.; Al-Razgan, M.; Ghadi, Y.Y. Optimizing image captioning: The effectiveness of vision transformers and VGG networks for remote sensing. Big Data Res. 2024, 37, 100477. [Google Scholar] [CrossRef]
Lin, H.; Hong, D.; Ge, S.; Luo, C.; Jiang, K.; Jin, H.; Wen, C. Rs-moe: A vision-language model with mixture of experts for remote sensing image captioning and visual question answering. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5614918. [Google Scholar] [CrossRef]
Lu, Y.; Lu, X.; Zheng, L.; Sun, M.; Chen, S.; Chen, B.; Lv, C. Application of multimodal transformer model in intelligent agricultural disease detection and question-answering systems. Plants 2024, 13, 972. [Google Scholar] [CrossRef] [PubMed]
Lee, D.I.; Lee, J.H.; Jang, S.H.; Oh, S.J.; Doo, I.C. Crop disease diagnosis with deep learning-based image captioning and object detection. Appl. Sci. 2023, 13, 3148. [Google Scholar] [CrossRef]
Quoc, K.N.; Thu, L.L.T.; Quach, L.D. A Vision-Language Foundation Model for Leaf Disease Identification. arXiv 2025, arXiv:2505.07019. [Google Scholar] [CrossRef]
Chen, J.; Xu, Z.; Pan, X.; Hu, Y.; Qin, C.; Goldstein, T.; Xu, R. Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset. arXiv 2025, arXiv:2505.09568. [Google Scholar]
Xiao, B.; Wu, H.; Xu, W.; Dai, X.; Hu, H.; Lu, Y.; Zeng, M.; Liu, C.; Yuan, L. Florence-2: Advancing a unified representation for a variety of vision tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 4818–4829. [Google Scholar]
Xie, Y.; Dong, X.; Zhao, K.; Sirishantha, G.M.A.D.; Xiao, Y.; Yu, P.; Wang, Q. Plant Disease Phenotype Captioning Via Zero-Shot Learning with Semantic Correction Based on Llm. SSRN 2024, 5093837. [Google Scholar]
Hu, Y.; Hua, H.; Yang, Z.; Shi, W.; Smith, N.A.; Luo, J. Promptcap: Prompt-guided image captioning for vqa with gpt-3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 2963–2975. [Google Scholar]
Chen, C.F.R.; Fan, Q.; Panda, R. Crossvit: Cross-attention multi-scale vision transformer for image classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 357–366. [Google Scholar]
Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Liu, P.J. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 2020, 21, 1–67. [Google Scholar]

Figure 1. This is the research framework of CottonCapT6. The pipeline includes three phases: (1) dataset construction via web crawling, filtering, labeling, and augmentation; (2) classification pretraining using CrossViT-18-Dagger-408 to capture domain-specific features of cotton diseases and pests; and (3) training of the CottonCapT6 captioning model using a T5 encoder–decoder structure to generate diagnostic descriptions.

Figure 2. The bar chart presents the number of image samples for each category of cotton diseases and pests, including target spot (400), powdery mildew (500), wilt (400), thrips (300), whitefly (301), aphid (450), and bollworm (450).

Figure 3. Doughnut chart showing the distribution of the samples in CottonDP across training, validation, and test datasets. The inner ring represents the proportion of samples in each dataset (Train, Validation, Test), while the outer ring shows the composition of different categories within each dataset. The size of each segment corresponds to the number of samples.

Figure 4. Architecture of the CrossViT-18-Dagger-408 model used for cotton disease and pest classification. The cotton target spot image is processed through a dual-branch structure with small and large patch sizes (S-Branch and L-Branch), followed by separate transformer encoders and a multi-scale fusion via cross-attention. The final classification is obtained by combining outputs through MLP heads.

Figure 5. Detailed architecture of the cross-attention mechanism in CrossViT. The CLS token from the large branch attends to the token sequence of the small branch using a scaled dot-product attention mechanism.

Figure 6. Comparison of the standard Transformer (a) and T5 encoder–decoder architecture (b). Both models follow a similar encoder–decoder paradigm, but T5 replaces standard attention modules with customized T5Self Attention and T5Cross Attention blocks. T5 also incorporates task-specific layer normalization (Add&T5LayerNorm) and operates in a text-to-text format, improving performance in the image captioning task.

Figure 7. Overview of the proposed CottonCapT6 architecture for cotton pest and disease image captioning. The CottonCapT6 model encodes multi-scale image features from cotton pest imagery using a CrossViT encoder and decodes them into text descriptions with a T5 model.

Figure 8. A comparative visualization of the detection results from different models for seven types of cotton diseases and pests is shown. As illustrated in the figure, the CrossViT_18_dagger_408 model (highlighted for emphasis) demonstrates significantly higher recognition accuracy compared to the other benchmark models.

Figure 9. Confusion matrix heatmap of classification results for cotton disease and pest categories using the CrossViT-18-Dagger-408 model. This heatmap displays the confusion matrix for cotton pest and disease classification. The rows represent predicted categories, while the columns represent true categories. Color intensity indicates sample count, with darker blue for higher counts and lighter white for fewer samples. Diagonal cells show correctly classified samples, with higher values indicating better model performance. Off-diagonal cells highlight misclassifications, where lighter colors indicate fewer errors. The color gradient ranges from deep blue (higher counts) to white (lower counts).

Figure 10. The grouped bar chart illustrating the precision, recall, and F1-score for each cotton disease and pest category.

Figure 11. Training loss curve of the CottonCapT6 model across 4700 iterations.

Table 1. Example samples from the cotton pest and disease paired image–text dataset are shown. The dataset includes images of cotton leaves, each paired with three distinct textual descriptions: (1) the original annotation, (2) a paraphrase-augmented version that introduces variations in sentence structure, and (3) a synonym-replaced version, where key terms are substituted with their synonyms while maintaining the original meaning. These enhancements were applied to improve the model’s ability to generalize across diverse linguistic expressions and enhance robustness, particularly in small-sample scenarios.

Image	Description
	1. Cotton leaves have multiple lesions proving target spot. 2. Multiple lesions on cotton leaves indicate target spot. 3. Cotton foliage shows several spots suggesting target spot disease.
	1. Cotton leaves show a white powdery coating, diagnosed as powdery mildew. 2. On cotton leaves, a white powdery coating appears, identified as powdery mildew. 3. Cotton foliage exhibits a whitish powder-like layer, diagnosed as powdery mildew.
	1. Cotton leaves appear purplish-red, diagnosed as wilt disease. 2. Purplish-red coloration appears on cotton leaves, which is diagnosed as wilt disease. 3. Cotton foliage shows a reddish-purple hue, identified as a symptom of wilt disease.
	1. Tiny black insects are congregating inside the cotton flower buds, diagnosed as thrips. 2. Inside the cotton flower buds, small black insects are gathering, identified as thrips. 3. Tiny dark insects were found clustered in the cotton flower, confirmed as thrips.
	1. Cotton whiteflies produce waxy platelets on leaves. 2. Waxy platelets are produced by cotton whiteflies on the leaves. 3. Cotton whiteflies secrete waxy particles on the leaves.
	1. The green insects are feeding on cotton leaves, thus identified as cotton aphids. 2. On cotton leaves, green insects are feeding, which have been identified as aphids. 3. Green bugs are munching on the cotton leaves, confirmed as cotton aphids.
	1. The boll has a small hole with a green insect emerging, possibly a bollworm. 2. From a small hole in the boll, a green insect emerges, possibly a bollworm. 3. The boll shows a tiny opening, and a green larva is coming out, possibly a bollworm.

Table 2. Hyperparameter settings for the two experimental stages: image classification pretraining and image captioning training.

Parameters	Image Classification Pretraining	Image Captioning Training
learning rate	1 × 10⁻⁴	3 × 10⁻⁵
optimizer	Adam	Adam
batch size	16	64
epochs	10	10

Table 3. Comparison results of classification models.

Model	Precision (%)	Recall (%)	F1-Score (%)	Accuracy (%)
ResNet-101	81.00	78.90	79.00	80.50
EfficientNet-B4	84.00	80.00	82.00	82.14
Inceptionv3	88.00	87.00	87.00	86.79
MobileNetV2	80.10	78.50	78.00	78.10
VGG16	82.00	80.00	80.00	80.86
SwinTransformer	80.00	78.00	78.00	79.50
ViT-B/16	91.00	89.00	90.00	90.60
CrossViT_18_dagger_408	92.00	91.00	91.00	91.43

Table 4. Ablation experiment evaluation results.

Model	BLEU1 (%)	BLEU2 (%)	BLEU3 (%)	BLEU4 (%)	ROUGE-L (%)	METEOR (%)	CIDEr (%)
CNN+LSTM	65.6	50.5	42.1	36.6	55.5	34.1	135.3
CNN+LSTM+Attention	72.1	59.0	51.3	45.3	64.4	37.3	170.8
CNN+Transformer	68.5	54.0	46.5	40.5	60.0	36.5	165.0
CNN+T5	70.0	58.5	51.0	44.5	63.0	38.0	180.0
CrossViT+Transformer	72.0	60.5	53.0	47.0	66.5	41.5	195.0
CrossViT+T5	72.5	61.0	53.7	48.0	67.0	41.8	196.0
CottonCapT6	72.9	61.7	54.4	48.7	67.4	42.0	197.2

Table 5. The performance of CottonCapT6 on the CottonDP test set, with one image–text pair selected as an example for each class: 1. represents the ground-truth label from the test set; 2. represents the model-generated result.

Image	Results
	1. Cotton leaves display target-pattern lesions indicating target spot. 2. Cotton leaves reveal target-pattern lesions indicating target spot.
	1. Fuzzy white deposits on young leaves, indicating powdery mildew colonization. 2. Fine white fungal growth accumulates on young leaves, suggesting active powdery mildew colonization.
	1. The cotton leaf is showing a large area of purple-red coloration, which has been diagnosed as wilt disease. 2. The cotton leaf exhibits extensive purple-red hues, diagnosed as wilt disease.
	1. Tiny black and small insects are on the cotton plant, diagnosed as thrips. 2. The cotton plant is infested with tiny black insects, confirmed as thrips.
	1. Cotton whiteflies produce waxy platelets on leaves. 2. Waxy platelets are produced by cotton whiteflies feeding across leaf surfaces.
	1. Aphids fully cover the top cotton leaf groups. 2. Dense aphid populations blanket the upper cotton leaves.
	1. The green worm is seen feeding on the cotton bolls, which might be bollworm. 2. The green worm feeding on the cotton bolls is likely a bollworm.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhao, C.; Meng, X.; Bai, B.; Qiu, H. CottonCapT6: A Multi-Task Image Captioning Framework for Cotton Disease and Pest Diagnosis Using CrossViT and T5. Appl. Sci. 2025, 15, 10668. https://doi.org/10.3390/app151910668

AMA Style

Zhao C, Meng X, Bai B, Qiu H. CottonCapT6: A Multi-Task Image Captioning Framework for Cotton Disease and Pest Diagnosis Using CrossViT and T5. Applied Sciences. 2025; 15(19):10668. https://doi.org/10.3390/app151910668

Chicago/Turabian Style

Zhao, Chenzi, Xiaoyan Meng, Bing Bai, and Hao Qiu. 2025. "CottonCapT6: A Multi-Task Image Captioning Framework for Cotton Disease and Pest Diagnosis Using CrossViT and T5" Applied Sciences 15, no. 19: 10668. https://doi.org/10.3390/app151910668

APA Style

Zhao, C., Meng, X., Bai, B., & Qiu, H. (2025). CottonCapT6: A Multi-Task Image Captioning Framework for Cotton Disease and Pest Diagnosis Using CrossViT and T5. Applied Sciences, 15(19), 10668. https://doi.org/10.3390/app151910668

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

CottonCapT6: A Multi-Task Image Captioning Framework for Cotton Disease and Pest Diagnosis Using CrossViT and T5

Abstract

1. Introduction

2. Materials and Methods

2.1. Research Framework

2.2. Dataset Construction

2.2.1. Data Collection

2.2.2. Image Preprocessing and Augmentation

2.2.3. Data Preprocessing for Classification Task

2.2.4. Data Preprocessing for Captioning Task

2.3. Model Architecture

2.3.1. Classification Pretraining with CrossViT-18-Dagger-408

2.3.2. Captioning with T5 and Transfer Learning

2.3.3. Overall Pipeline and Model Architecture

2.4. Experimental Setup

2.5. Evaluation Indicators

3. Results

3.1. The Results of the Image Classification Task

3.1.1. Quantitative Analysis of Image Classification Task

3.1.2. Qualitative Analysis of Image Classification Task

3.2. The Results of the Image Captioning Task

3.2.1. Quantitative Analysis of the Image Captioning Task

3.2.2. Qualitative Analysis of Image Captioning Task

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI