RS Transformer: A Two-Stage Region Proposal Using Swin Transformer for Few-Shot Pest Detection in Automated Agricultural Monitoring Systems

: Agriculture is pivotal in national economies, with pest classiﬁcation signiﬁcantly inﬂuenc-ing food quality and quantity. In recent years, pest classiﬁcation methods based on deep learning have made progress. However, there are two problems with these methods. One is that there are few multi-scale pest detection algorithms, and they often lack effective global information integration and discriminative feature representation. The other is the lack of high-quality agricultural pest datasets, leading to insufﬁcient training samples. To overcome these two limitations, we propose two methods called RS Transformer (a two-stage region proposal using Swin Transformer) and the Randomly Generated Stable Diffusion Dataset (RGSDD). Firstly, we found that the diffusion model can generate high-resolution images, so we developed a training strategy called the RGSDD, which was used to generate agricultural pest images and was mixed with real datasets for training. Secondly, RS Transformer uses Swin Transformer as the backbone to enhance the ability to extract global features, while reducing the computational burden of the previous Transformer. Finally, we added a region proposal network and ROI Align to form a two-stage training mode. The experimental results on the datasets show that RS Transformer has a better performance than the other models do. The RGSDD helps to improve the training accuracy of the model. Compared with methods of the same type, RS Transformer achieves up to 4.62% of improvement.


Introduction
Agriculture directly impacts people's lives and is essential to the development of the global economy.However, pests in crops often cause great losses.Therefore, it is necessary to control pests to ensure a high agricultural yield [1].Because of developments in science and technology, pest detection methods are continually changing [2].Early detection relies on field diagnosis by agricultural experts, but proper diagnosis is difficult due to the complexity of pest conditions, lack of qualified staff, and inconsistent experience at the grassroots level.Furthermore, incorrect pest identification by farmers has led to an escalation in pesticide usage.This in turn has bolstered pest resistance [3] and has exacerbated the harm inflicted upon the natural environment.
An effective integrated pest automated monitoring system relies on a high-quality algorithm.With the development of image processing technology and deep learning, scholars are increasingly using pest image data and deep learning to identify pests, which improves the effectiveness of agricultural pest detection and is also the first application example of intelligent diagnosis.Research in respect of the classification and detection of agricultural pests is crucial to help farmers effectively manage crops and take timely measures to reduce the harm caused by pests.Object detection models, which come in onestage and two-stage varieties, are frequently employed in pest classification and detection.One-stage models like YOLO [4][5][6] and SSD [7] are renowned for their rapid detection capabilities.In contrast, two-stage models like Fast R-CNN [8] and Faster R-CNN [9] excel in achieving high accuracy, albeit at a slower processing speed compared to their one-stage counterparts.The Transformer model [10] has many potential applications in AI.Based on its effectiveness in natural language processing (NLP) [11], recent research has extended the Transformer to the field of computer vision (CV) [12].In 2021, Swin Transformer [13] was proposed as a universal backbone for CV, achieving the latest SOTA on multiple dense prediction benchmarks.The differences between language and vision make the transition from language to vision difficult, such as the vast range of visual entity scales.However, Swin Transformer can solve this problem well.In this paper, we use a Vision Transformer with a shift window to detect pests.
Currently, two dataset-related issues affect pest detection.The first is the scarcity of high-quality datasets.There are only approximately 600 photos in eight pest datasets, reflecting the lack of agricultural pest datasets [14].The second issue is the challenges involved in detecting pests at multiple scales.The size difference between large and microscopic pests is large, up to 30 times in some cases.For example, the relative size of the largest pest in the LMPD2020 dataset is 0.9%, while the relative size of the smallest pest is only 0.03%.When the size difference of the test object is large, it is difficult for the test results at multiple scales to achieve high accuracy simultaneously, and the problem of missing detection often occurs.Moreover, the Transformer also requires a large dataset for training.
In agriculture, there are few high-quality pest datasets available, and some datasets from the internet have poor clarity and different sizes.In recent years, with the development of AI-generated content technology, increasing numbers of large models of image generation based on a text description have been developed.The diffusion model [15], introduced as a sequence of denoising autoencoders, aims to remove Gaussian noise through continuous application during training with images.A new diffusion model [16] represents a novel state-of-the-art in-depth image generation.In picture-generating tasks, it outperforms the original SOTA, i.e., GAN (generative adversarial network) [17], and performs well in a variety of applications, including CV, NLP, waveform signal processing, time series modeling, and adversarial learning.The Denoising Diffusion Probabilistic Model was proposed later [18], applying to image generation.Then, Open AI's paper "Diffusion Models Beat GANs on Image Synthesis" [19] made machine-generated data even more realistic than GAN.DALL-E2 [20] allows us to use text descriptions to generate the desired image.To improve the accuracy of pest identification, we can enable models to learn more complex semantic information from training data and complement the agricultural dataset.We propose the Randomly Generated Stable Diffusion Dataset (RGSDD) method to help generate pest images.
We identified four years of representative pest detection papers, as shown in Table 1, and counted the algorithms used in the papers and the pest species included in the datasets.It was found that previous papers did not use Swin Transformer as a backbone network, nor did they use a diffusion model to generate datasets.Overall, this paper makes the following contributions: ( This study focuses on crops of high economic value.As a result, the selection of agricultural pests is based on small sample sizes.First, we went to the Beizang Village experimental field next to the Daxing Campus of Beijing University of Civil Engineering and Architecture to take photos using an iPhone 12 Pro Max and collected 400 pictures of pests.The photos were taken at a resolution of 3024 × 4032 pixels.Secondly, we searched for pests in the IPMImages database [30], National Bureau of Agricultural Insect Resources (NBAIR), Google, Bing, etc.The dataset has eight pest species as labels, which are as follows: Tetranychus urticae, TU; Bemisia argentifolii, BA; Zeugodacus cucurbitae, ZC; Thrips palmi, TP; Myzus persicae, MP; Spodoptera litura, SL; Spodoptera exigua, SE; and Helicoverpa armigera, HA. Figure 1

Dataset Generation
Stable diffusion was released by Open AI, a model that can be used to generate detailed images conditioned on text descriptions.
The diffusion model, which produces samples that fit the data after a finite amount

Dataset Generation
Stable diffusion was released by Open AI, a model that can be used to generate detailed images conditioned on text descriptions.
The diffusion model, which produces samples that fit the data after a finite amount of time, is a parameterized Markov chain trained via variational inference [18].As shown in Figure 2, the forward process and the reverse process can be separated from the entire diffusion model.It is commonly understood that the forward diffusion process is constantly adding Gaussian noise to the image, making it unrecognizable, while the reverse process reduces the noise and then restores the image.The core formula of the diffusion model is where a t is an experimental constant that decreases as t increases; z 1 is a standard Gaussian noise distribution N(0, I).

Dataset Generation
Stable diffusion was released by Open AI, a model that can be used to generat tailed images conditioned on text descriptions.
The diffusion model, which produces samples that fit the data after a finite am of time, is a parameterized Markov chain trained via variational inference [18].As sh in Figure 2, the forward process and the reverse process can be separated from the e diffusion model.It is commonly understood that the forward diffusion process is stantly adding Gaussian noise to the image, making it unrecognizable, while the rev process reduces the noise and then restores the image.The core formula of the diffu model is where  is an experimental constant that decreases as t increases;  is a stan Gaussian noise distribution (0, ).The stable diffusion model was trained using a real pest dataset.The images generated by stable diffusion are 299 × 299, as shown in Figure 4. To increase the chance of generating pest images, we chose captions that contained any word from the following list of words: [BA, HA, MP, SE, SL, TP, TU, ZC].We input some keywords and text information into the diffusion model to describe the desired picture, such as pest on the tree, pest on the leaf, pest chewing on the leaf, worm chewing on the trunk, worm swarm, cornfield, leaf, and field.After carefully eliminating the last few false positives, we obtained a dataset of 512 pest images.There were 64 high-resolution images for each pest category.The stable diffusion model was trained using a real pest dataset.The images generated by stable diffusion are 299 × 299, as shown in Figure 4. To increase the chance of generating pest images, we chose captions that contained any word from the following list of words: [BA, HA, MP, SE, SL, TP, TU, ZC].We input some keywords and text information into the diffusion model to describe the desired picture, such as pest on the tree, pest on the leaf, pest chewing on the leaf, worm chewing on the trunk, worm swarm, cornfield, leaf, and field.After carefully eliminating the last few false positives, we obtained a dataset of 512 pest images.There were 64 high-resolution images for each pest category.
generating pest images, we chose captions that contained any word from the following list of words: [BA, HA, MP, SE, SL, TP, TU, ZC].We input some keywords and text information into the diffusion model to describe the desired picture, such as pest on the tree, pest on the leaf, pest chewing on the leaf, worm chewing on the trunk, worm swarm, cornfield, leaf, and field.After carefully eliminating the last few false positives, we obtained a dataset of 512 pest images.There were 64 high-resolution images for each pest category.

Dataset Enhancement
In this study, the original image was processed using enhancement methods such as rotation, translation, flipping, and noise addition, and the enhancement technique Auto-Augmentation [31] was applied to determine the color of the images.Finally, we obtained 36,504 pest images and the details are shown in Table 2.

Dataset Enhancement
In this study, the original image was processed using enhancement methods such as rotation, translation, flipping, and noise addition, and the enhancement technique AutoAugmentation [31] was applied to determine the color of the images.Finally, we obtained 36,504 pest images and the details are shown in Table 2.With the data-enhanced images, we trained RS Transformer.In the first stage, we did not use the generated RGSDD data, and first trained with real images to obtain detailed RS Transformer data.In the second stage, we mixed the generated images in the RGSDD according to the training ratio in Table 3, and we applied this method in YOLOv8, DETR, and other models.

Framework of the Proposed Method
In this paper, R-CNN [32] is replaced by Swin Transformer and applied to pest target detection tasks.Additionally, we propose a novel object detection method called RS Transformer.Our scheme offers several advantages.Firstly, we introduce a new feature extraction method specifically designed for Swin Transformer, which enhances the alignment of global features.This improvement leads to enhanced localization accuracy, while also significantly reducing the computational cost of the Transformer through the implementation of the shift window model.Secondly, we propose the RS Transformer, which incorporates essential components such as RPN, ROI Align, and feature maps.These additions further enhance the performance and capabilities of the proposed method.Lastly, we propose a new data composition method called RGSDD.This method involves training the stable diffusion model using real images collected beforehand and subsequently generating 512 images by randomly mixing them with 10%, 20%, 30%, 40%, and 50% of the number of real images.Overall, our approach combines the advancements of Swin Transformer, the novel RS Transformer, and the innovative RGSDD data composition method to achieve improved results in pest target detection tasks.

RS Transformer
RS Transformer is a two-stage model (Figure 5).It first extracts features using Swin Transformer and then generates a series of region proposals.

Swin Transformer Backbone
The Swin Transformer backbone is introduced in Figure 6.Compared to traditional CNN models, it has stronger feature extraction capabilities, incorporates CNN's local and hierarchical structure, and utilizes attention mechanisms to produce a more interpretable model and examine the attention distribution.A 2-layer MLP (multi-layer perceptron) with GELU non-linearity follows a shiftedwindow-based MSA module (W-MSA) in the Swin Transformer block.Each MSA module (multi-head self-attention) and each MLP has an LN (layer norm) layer applied before it, and each module also has a residual connection applied after it.Supposing each window contains  ×  patches, the computational complexities of a global MSA module and image-based window ℎ ×  patches are as follows:

Swin Transformer Backbone
The Swin Transformer backbone is introduced in Figure 6.Compared to traditional CNN models, it has stronger feature extraction capabilities, incorporates CNN's local and hierarchical structure, and utilizes attention mechanisms to produce a more interpretable model and examine the attention distribution.A 2-layer MLP (multi-layer perceptron) with GELU non-linearity follows a shiftedwindow-based MSA module (W-MSA) in the Swin Transformer block.Each MSA module (multi-head self-attention) and each MLP has an LN (layer norm) layer applied before it, and each module also has a residual connection applied after it.Supposing each window contains  ×  patches, the computational complexities of a global MSA module and image-based window ℎ ×  patches are as follows: A 2-layer MLP (multi-layer perceptron) with GELU non-linearity follows a shiftedwindow-based MSA module (W-MSA) in the Swin Transformer block.Each MSA module (multi-head self-attention) and each MLP has an LN (layer norm) layer applied before it, and each module also has a residual connection applied after it.Supposing each window contains M × M patches, the computational complexities of a global MSA module and image-based window h × w patches are as follows: (2) The shift window partitioning method can be used to compute the backbones of two consecutive Swin Transformers and is denoted as follows: where ẑl and ẑl+1 represent the output of W-MSA and MLP of block l, respectively.Swin Transformer constructs hierarchical feature graphs and adopts a complexity calculation method with a linear image size.A sample diagram of a hierarchy with a small patch size is shown in Figure 7.In the deeper Transformer layers, it begins with small patches and eventually integrates nearby patches.By using patch-splitting modules like ViT, RGB images are divided into non-overlapping patches and employ a patch size of 4 × 4, making each patch's feature dimension 4 × 4 × 3 = 48.This fundamental feature is projected to any dimension (designated C) using a linear embedding layer.where ̂ and ̂ represent the output of W-MSA and MLP of block , respectively.Swin Transformer constructs hierarchical feature graphs and adopts a complexity calculation method with a linear image size.A sample diagram of a hierarchy with a small patch size is shown in Figure 7.In the deeper Transformer layers, it begins with small patches and eventually integrates nearby patches.By using patch-splitting modules like ViT, RGB images are divided into non-overlapping patches and employ a patch size of 4 × 4, making each patch's feature dimension 4 × 4 × 3 = 48.This fundamental feature is projected to any dimension (designated ) using a linear embedding layer.

RS Transformer Neck: FPN
An FPN (feature pyramid network) is proposed to achieve a better fusion of feature maps.As illustrated in Figure 8, the purpose of the FPN is to integrate feature maps from the bottom layer to the top layer to fully utilize the extracted features at each stage.

RS Transformer Neck: FPN
An FPN (feature pyramid network) is proposed to achieve a better fusion of feature maps.As illustrated in Figure 8, the purpose of the FPN is to integrate feature maps from the bottom layer to the top layer to fully utilize the extracted features at each stage.
The FPN produces a feature pyramid, not just a feature map.The pyramid after the RPN will produce many region proposals.These region proposals are produced by the RPN, and the ROI is cut out according to the region proposal for subsequent classification and regression prediction.We use a formula to determine from which k the ROI of width w and height h should be cut: where 299 represents the size of the image used for pre-training.k 0 represents the level at which the ROI of the area is w × h = 299 × 299.A large-scale ROI should be cut from a feature map of low resolution, which is conducive to the detection of large targets, and a small-scale ROI should be cut from a feature map of high resolution, which is conducive to the detection of small targets.

RS Transformer Neck: FPN
An FPN (feature pyramid network) is proposed to achieve a better fusion of feature maps.As illustrated in Figure 8, the purpose of the FPN is to integrate feature maps from the bottom layer to the top layer to fully utilize the extracted features at each stage.The FPN produces a feature pyramid, not just a feature map.The pyramid after the RPN will produce many region proposals.These region proposals are produced by the RPN, and the ROI is cut out according to the region proposal for subsequent classification and regression prediction.We use a formula to determine from which k the ROI of width w and height h should be cut:

RS Transformer Head: RPN, ROI Align
To achieve the prediction of coordinates and scores of each regional suggestion box while extracting features, the RPN network adds a regression layer (reg-layer) and a classification layer (cls-layer) to Swin Transformer.Figure 9 depicts the RPN working principle.The RPN centers on a pixel of the last layer feature map and traverses the feature map through a 3 × 3 sliding window.The pixel points mapped from the center of the sliding window to the original image are anchor points.Taking the anchor point as the original image center, using 15 preset anchor boxes with 5 different areas (32 × 32, 64 × 64, 128 × 128, 256 × 256, 512 × 512) and 3 distinct aspect ratios (2:1, 1:1, and 1:2), the original candidate region k = 15 is obtained.The RPN sends the candidate regions in the k anchor boxes to the regression layer and the category layer, respectively, for boundary regression and classification prediction.The regression layer predicts the frame coordinates (X, Y, W, H), so the output is 4k; the classification layer predicts the type, target, and background, so the output is 2k.Each anchor is then evaluated with initial over-boundary screening and non-maximum suppression (NMS) from largest to smallest to retain the top 1000 or 2000 scores.Finally, the candidate boundaries of prediction as the background in the classification layer are removed, and the candidate boundaries of prediction as a target are retained.

ROI Align
The function of ROI Pool and ROI Align is to find the feature map corresponding to the candidate box and then process a feature map of different size proportions into a fixed size, so that it can be input into the subsequent fixed-size network.Mask RCNN proposes an ROI alignment [33] based on ROI Pool.The bilinear interpolation method is used to determine the eigenvalue of each pixel in the region of interest of the original image, which avoids the error caused by quantization operation and improves the accuracy of frame prediction and mask prediction.classification prediction.The regression layer predicts the frame coordinates (X, Y, W, H), so the output is 4k; the classification layer predicts the type, target, and background, so the output is 2k.Each anchor is then evaluated with initial over-boundary screening and non-maximum suppression (NMS) from largest to smallest to retain the top 1000 or 2000 scores.Finally, the candidate boundaries of prediction as the background in the classification layer are removed, and the candidate boundaries of prediction as a target are retained.

ROI Align
The function of ROI Pool and ROI Align is to find the feature map corresponding to the candidate box and then process a feature map of different size proportions into a fixed size, so that it can be input into the subsequent fixed-size network.Mask RCNN proposes an ROI alignment [33] based on ROI Pool.The bilinear interpolation method is used to determine the eigenvalue of each pixel in the region of interest of the original image, which

Experimental Setup
Experiments were conducted on the Autodl platform, which provides low-cost GPU computing power and a configuration environment that can be rented at any time.For researchers and universities without high-performance GPUs or servers, Autodl offers a wide range of high-performance GPUs to use.The experiments were implemented using the Pytorch 1.10.0framework, Python 3.8, CUDA 11.3, and Nvidia RTX 2080Ti GPUs with 11 GB memory.

Experimental Setup
Experiments were conducted on the Autodl platform, which provides low-cost GPU computing power and a configuration environment that can be rented at any time.For researchers and universities without high-performance GPUs or servers, Autodl offers a wide range of high-performance GPUs to use.The experiments were implemented using the Pytorch 1.10.0framework, Python 3.8, CUDA 11.3, and Nvidia RTX 2080Ti GPUs with 11 GB memory.

Evaluation Indicators
To evaluate the performance of the proposed model, we used the accuracy, precision, recall, average precision (AP), mAP, and F1 score: Percision = TP FP + TP (10) where TP indicates true positive, FP indicates false positive, and FN indicates false negative.Average precision (AP): The average precision under different recall rates.The higher the accuracy, the higher the AP.
Recall: The average recall rate at different levels of precision.The higher the recall, the higher the AR.
mAP: The picture categorization procedure is usually a multi-classification problem.According to the above calculation process, the AP of each analog is obtained, and the average value is the mAP.
The F 1 score is a metric that combines precision and recall to evaluate the performance of a binary classification model.

Experimental Results and Analysis
On a dataset with eight models, we assessed the performance of popular deep learning models to illustrate the performance of the proposed model (Table 5).We used a fixed image resolution with a size of 299 × 299 pixels.Compared to other models, our proposed method achieved significant improvements, with an mAP of 90.18%, representing gains of 13.27%, 17.53%, 29.8%, 13.97%, 9.89%, 5.46%, and 4.62% over SSD, Faster R-CNN, YOLOv3, YOLOv4, YOLOv5m, YOLOv8, and DETR, respectively.The proposed method achieved 20.1 ms mDT for the detection time of each image.
To visually analyze the classification results of each pest in RS Transformer, we utilized a confusion matrix as shown in Figure 11.These data were obtained using real images for training.The confusion matrix provides an intuitive representation of the classification performance.In the matrix, rows represent predicted pest categories, columns represent actual pest categories, and the values on the main diagonal represent the classification accuracy for each category.From the confusion matrix diagram, it can be observed that the color on the main diagonal of the RS Transformer's confusion matrix is the darkest, indicating the highest values in each row and column.This indicates that RS Transformer exhibits excellent classification performance for each type of pest.To visually analyze the classification results of each pest in RS Transformer, we utilized a confusion matrix as shown in Figure 11.These data were obtained using real images for training.The confusion matrix provides an intuitive representation of the classification performance.In the matrix, rows represent predicted pest categories, columns represent actual pest categories, and the values on the main diagonal represent the classification accuracy for each category.From the confusion matrix diagram, it can be observed that the color on the main diagonal of the RS Transformer's confusion matrix is the darkest, indicating the highest values in each row and column.This indicates that RS Transformer exhibits excellent classification performance for each type of pest.The contrast in mAP is visually presented in Figure 12.It is evident that the mAP of the three compared models exhibits an upward trend during the training process, albeit with substantial fluctuations.Conversely, our model's mAP shows a more consistent trajectory, stabilizing at 77.73% after approximately 75 epochs.Subsequently, the RS Trans- The contrast in mAP is visually presented in Figure 12.It is evident that the mAP of the three compared models exhibits an upward trend during the training process, albeit with substantial fluctuations.Conversely, our model's mAP shows a more consistent trajectory, stabilizing at 77.73% after approximately 75 epochs.Subsequently, the RS Transformer model attains its peak performance, achieving a maximum mAP of 90.18%.These findings collectively confirm the stability of RS Transformer, its capacity to enhance network performance, and its ability to expedite convergence.RS Transformer exhibits a robust capacity for discerning similar pests and demonstrates superior overall performance compared to other models, as detailed in Table 6 (models' mAP) and illustrated in Figure 13.Furthermore, in challenging scenarios such as the TU dataset, the model maintains a remarkable recognition rate of 90.24%.RS Transformer exhibits a robust capacity for discerning similar pests and demonstrates superior overall performance compared to other models, as detailed in Table 6 (models' mAP) and illustrated in Figure 13.Furthermore, in challenging scenarios such as the TU dataset, the model maintains a remarkable recognition rate of 90.24%.RS Transformer exhibits a robust capacity for discerning similar pests and demonstrates superior overall performance compared to other models, as detailed in Table 6 (models' mAP) and illustrated in Figure 13.Furthermore, in challenging scenarios such as the TU dataset, the model maintains a remarkable recognition rate of 90.24%.The dataset was generated using the diffusion model (see Figure 14) and subsequently combined at varying proportions of 10%, 20%, 30%, 40%, and 50%.These datasets were then utilized as inputs for the RS Transformer model, followed by rigorous testing procedures, culminating in the presentation of the results in Table 7.The dataset was generated using the diffusion model (see Figure 14) and subsequently combined at varying proportions of 10%, 20%, 30%, 40%, and 50%.These datasets were then utilized as inputs for the RS Transformer model, followed by rigorous testing procedures, culminating in the presentation of the results in Table 7.
Applying the RGSDD method to RS Transformer, it is evident that upon incorporating 30% of the generated data, the model attains its peak performance, resulting in a notable increase of 5.53% in mAP.The RGSDD methodology was also applied to enhance the performance of the Faster R-CNN, YOLOv5m, YOLOv8, and DETR models.The results of these experiments demonstrate that RGSDD positively contributes to model enhancement, as evidenced in Tables 8-11.Applying the RGSDD method to RS Transformer, it is evident that upon incorporating 30% of the generated data, the model attains its peak performance, resulting in a notable increase of 5.53% in mAP.
The RGSDD methodology was also applied to enhance the performance of the Faster R-CNN, YOLOv5m, YOLOv8, and DETR models.The results of these experiments demonstrate that RGSDD positively contributes to model enhancement, as evidenced in Tables 8-11.
These data underscore the practical applicability of RGSDD, as shown in Figure 15.Specifically, in the case of the YOLOv8 model with 30% incorporation, it yielded a substantial 3.79% improvement in mAP.Similarly, for the DETR model with 40% incorporation, there was a noticeable enhancement of 4.36% in mAP.Furthermore, it is evident that when 50% of the generated data are included, the model's performance experiences a significant decline.This subset of data appears to introduce interference and is potentially treated as noise to some extent, resulting in adverse effects on model performance.Comparing the mAP, F1 score, and recall of different networks, it can be seen that RS Transformer is still better than the even when the RGSDD is used.At the optimal value, mAP outperforms Faster R-CNN by 9.29% and YOLOv5m by 4.95%.
Figure 16 presents the outcomes achieved by the RS Transformer model integrated with the RGSDD.Notably, the results highlight the RGSDD's exceptional accuracy in effectively identifying multi-scale pests across various species.Comparing the mAP, F1 score, and recall of different networks, it can be seen that RS Transformer is still better than the others, even when the RGSDD is used.At the optimal value, mAP outperforms Faster R-CNN by 9.29% and YOLOv5m by 4.95%.
Figure 16 presents the outcomes achieved by the RS Transformer model integrated with the RGSDD.Notably, the results highlight the RGSDD's exceptional accuracy in effectively identifying multi-scale pests across various species.Comparing the mAP, F1 score, and recall of different networks, it can be seen that RS Transformer is still better than the others, even when the RGSDD is used.At the optimal value, mAP outperforms Faster R-CNN by 9.29% and YOLOv5m by 4.95%.
Figure 16 presents the outcomes achieved by the RS Transformer model integrated with the RGSDD.Notably, the results highlight the RGSDD's exceptional accuracy in effectively identifying multi-scale pests across various species.

Comparison Results Summary
The performance comparison of the proposed method with other existing methods for eight pest datasets is shown in Table 12.Setiawan et al. [35] applied a CNN and MoblieNetV2.They used the Adam optimizer for large-scale pest classification and achieved an accuracy of 82.95% for eight classes of agriculture pests.Their model was trained for large-scale pest classification.However, due to the CNN, the ideal effect was not achieved in the case of large-scale differences in pest images.Liu et al. [36] used a novel Transformer auto-encoder to capture the features and benefits in the classification accuracy.In the case of eight pest images, as well as small samples, the method proposed by the authors reached 85.17% for mAP.We can see that models such as Vision in Transformer (ViT) models that require large datasets for training do not work well on datasets containing images of small targets such as pests.In this case, it is difficult for ViT to capture image features, resulting in inaccurate recognition.At the same time, the field environment is complex, and the image quality is full of uncertainties due to the large influence of factors such as sunlight and region when taking pictures, which lead to reductions in accuracy.In order to improve the accuracy of other models, we mixed the pest pictures generated by the into the total training dataset in a 30% proportion, and we found that Setiawan et al. [35]'s method was significantly improved by 6.40% and Liu et al.'s method was improved by 3.06%, which proved the universality and practicability of the RGSDD method.From the experimental results, our proposed method comprising RS Transformer and the RGSDD provides good performance in few-shot learning for pest classification.

Discussion
In the analysis of the results, it was clearly shown that RS Transformer performed well.Since Swin Transformer was proposed, which performed better than the CNN did, a large number of application algorithms based on Swin Transformer have been proposed [37][38][39].However, a common feature among these algorithms is that a large number of datasets are required to train Swin Transformer to realize its ability to extract features globally.Therefore, we added an FPN, RPN, and ROI Align on the basis of Swin Transformer, which reduces the computational complexity and improves the feature extraction capability.Then, using the RGSDD method to generate a dataset to assist with training, we not only achieved the purpose of expanding the dataset, but also improved the training accuracy of the model.The RS Transformer achieved 9.08% accuracy, which was higher than that of the DETR universal model at 1.41% and higher than that of the YOLOv8 model at 6.59%.Its superior multi-scale feature extraction capabilities effectively help improve accuracy.
In a two-stage model like that of Dong [40], the author used ResNet-50 as a backbone.Even though the model was improved and deep convolutional neural networks (DCNNs) were used, it still failed to achieve ideal results at a small scale, with an mAP value of only 67.9%.Jiao [22] used VGG-16 as a backbone and trained with a large number of datasets comprising about 25.4k images.However, Jiao only obtained an mAP of 56.40%.In a large number of training datasets, the algorithms proposed by the authors still fail to reach the required application.On the one hand, the pest scale is small; on the other hand, the feature extraction ability of the CNN is limited.In deep learning, we explain which backbone or which model has absolute advantages in an application field, but in our experiment, we found that RS Transformer does have certain advantages.
Before this study, there was no research on agricultural pest identification based on AIGC.For the first time, we used a diffusion model for agricultural pest training and image generation and achieved unexpectedly good results.After adding 30% of the generated images, RS Transformer; YOLOv3, 4, 5, and 8; and DETR were all improved, up to 8.93%.This kind of high-resolution generated image is less noisy, is more conducive to model training, and helps to quickly locate and extract effective features.
In general, the quality and size of the dataset, the appropriate improvement strategy, and the underlying model architecture all have important effects on the detection accuracy.A multi-stage algorithm is faster and has a lighter weight on the basis of ensuring accuracy, while a single-stage algorithm improves the detection accuracy on the basis of maintaining the advantages of speed and model size.Achieving higher performance levels and achieving a balance of performance such as accuracy, speed, and magnitude are the current trends.

Conclusions
Swin Transformer, introduced here as the foundational network for pest detection, represents a pioneering contribution.In conjunction with this innovation, RS Transformer was developed, building upon the inherent strengths of the R-CNN framework.Furthermore, we employed a diffusion model to create a novel pest accompanied by introducing an innovative training approach tailored for the Randomly Generated Stable Diffusion Dataset (RGSDD).This approach involves the judicious fusion of synthetic data generated through the RGSDD with real data, calibrated as a percentage of the total dataset.Our study comprehensively compared the performance of RS Transformer and the RGSDD against established models including SSD, Faster R-CNN, YOLOv3, YOLOv4, YOLOv5m, YOLOv8, and DETR.The experimental results unequivocally demonstrate the superiority of RS Transformer and the efficacy of the RGSDD dataset, surpassing prevailing benchmarks.Importantly, our method achieves an optimal balance between accuracy and network characteristics.These findings have substantial implications for future ecological informatics research, offering fresh insights into the domain of ecological pest and disease control.The presented approach promises to advance the state of the art and contribute to more effective ecological management strategies.
RS Transformer can be used not only for agricultural pest detection, but also for multiscale target detection tasks in complex environments such as transportation, medicine, and industrial devices.In addition, the RGSDD, an image generation method based on a diffusion model, is helpful for expanding the dataset and improving accuracy.Hopefully, we can undertake more research based on the method in this paper in the future.

Figure 2 .
Figure 2. Diffusion processes.The overall structure of the diffusion model is shown in Figure 3.It contains t models.The first is the CLIP model (Contrastive Language-Image Pre-Training), whi a text encoder that converts text into vectors as input.The image is then generated u the diffusion model.This is performed in the potential space of the compressed imag the input and output of the expanded model are the image features of the potential sp not the pixels of the image itself.During the training of the latent diffusion mode

Figure 2 .
Figure 2. Diffusion processes.The overall structure of the diffusion model is shown in Figure 3.It contains three models.The first is the CLIP model (Contrastive Language-Image Pre-Training), which is a text encoder that converts text into vectors as input.The image is then generated using the diffusion model.This is performed in the potential space of the compressed image, so the input and output of the expanded model are the image features of the potential space, not the pixels of the image itself.During the training of the latent diffusion model, an encoder is used to obtain the potentials of the picture training set, which are used in the forward diffusion process (each step adds more noise to the latent representation).At inference generation, the decoder part of the VAE (Variational Auto-Encoder) converts the denoised latent signal generated by the reverse diffusion process back into an image format.Appl.Sci.2023, 13, x FOR PEER REVIEW 5 of 20

Figure 3 .
Figure 3.The framework of the diffusion model.

Figure 3 .
Figure 3.The framework of the diffusion model.

Figure 5 .Figure 5 .
Figure 5. Structure diagram of RS Transformer.2.3.1.Swin Transformer BackboneThe Swin Transformer backbone is introduced in Figure6.Compared to traditional CNN models, it has stronger feature extraction capabilities, incorporates CNN's local and hierarchical structure, and utilizes attention mechanisms to produce a more interpretable model and examine the attention distribution.

Figure 7 .
Figure 7. Sample diagram of a hierarchy with a small patch size.

Figure 7 .
Figure 7. Sample diagram of a hierarchy with a small patch size.

Figure 7 .
Figure 7. Sample diagram of a hierarchy with a small patch size.

Figure 9 .
Figure 9. RPN working principle diagram.The ROI Align algorithm's primary steps are as follows: (1) Each candidate region is traversed on the feature map, keeping the floating-point boundary unquantized.(2) In Figure10, the candidate region is evenly divided into k × k bins, and the edge of each bin retains the floating-point number without quantization.(3) In this step, 2 × 2 sample points are taken for each bin, and the bilinear interpolation method is used to calculate the pixel values of each sampling point's neighboring four pixels.(4) Finally, the pixel value in each bin is maximized to obtain the value of each bin.
Appl.Sci.2023, 13, x FOR PEER REVIEW 10 of 20 avoids the error caused by quantization operation and improves the accuracy of frame prediction and mask prediction.The ROI Align algorithm's primary steps are as follows: (1) Each candidate region is traversed on the feature map, keeping the floating-point boundary unquantized.(2) In Figure 10, the candidate region is evenly divided into k×k bins, and the edge of each bin retains the floating-point number without quantization.(3) In this step, 2×2 sample points are taken for each bin, and the bilinear interpolation method is used to calculate the pixel values of each sampling point's neighboring four pixels.(4) Finally, the pixel value in each bin is maximized to obtain the value of each bin.

Figure 13 .
Figure 13.Comparison of mAPs to identify similar pests.

Figure 13 .
Figure 13.Comparison of mAPs to identify similar pests.

Figure 13 .
Figure 13.Comparison of mAPs to identify similar pests.

Table 1 .
Statistical pest detection algorithms and accuracy.

Table 2 .
Details regarding the number of images in the dataset, including generated dataset, real data, and datasets from the internet.

Table 3 .
Details regarding the number of images using the RGSDD method.

Table 5 .
Comparison of different indexes.

Table 6 .
Comparison of different mAP indexes.

Table 6 .
Comparison of different mAP indexes.

Table 6 .
Comparison of different mAP indexes.

Table 12 .
Related work and accuracy results (%) summary.