Pests Identiﬁcation of IP102 by YOLOv5 Embedded with the Novel Lightweight Module

: The development of the agricultural economy is hindered by various pest-related problems. Most pest detection studies only focus on a single pest category, which is not suitable for practical application scenarios. This paper presents a deep learning algorithm based on YOLOv5, which aims to assist agricultural workers in efﬁciently diagnosing information related to 102 types of pests. To achieve this, we propose a new lightweight convolutional module called C3M, which is inspired by the MobileNetV3 network. Compared to the original convolution module C3, C3M occupies less computing memory and results in a faster inference speed, with the detection precision improved by 4.6%. In addition, the GAM (Global Attention Mechanism) is introduced into the neck of YOLO5, which further improves the detection capability of the model. The experimental results indicate that the C3M-YOLO algorithm performs better than YOLOv5 on IP102, a public dataset consisting of 102 pests. Speciﬁcally, the detection precision P is 2.4% higher than that of the original model, and mAP 0.75 increased by 1.7%, while the F1-score improved by 1.8%. Furthermore, the mAP 0.5 and mAP 0.75 of the C3M-YOLO algorithm are higher than those of the YOLOX detection model by 5.1% and 6.2%, respectively.


Introduction
The world's population is vast and continues to show an upward trend.Thus, people's demand for food is constantly evolving with the times [1].The management of pest-related issues associated with food crops has always been a significant concern in the field of agriculture [2].Therefore, how to adopt effective pest control methods to rise the crop yield and reduce losses in the agricultural economy is an issue of essential concern in the industry [3,4].In the field of computer vision, some experimental studies on pest detection, from the use of machine learning to deep learning techniques, will be introduced in Section 2.
The main work of the paper is illustrated in Figure 1.First, we performed data augmentation preprocessing on the IP102 dataset and then used GAM to obtain feature information processed by the YOLOv5 backbone.These features were further extracted by the neck network with the added C3M module.Finally, the head network obtained detection image results with anchor boxes at three scales.
In summary, there are three aspects to summarize the work presented in this paper: • The model training is based on an IP102 dataset [5] with a total of nearly 19,000 images of agricultural pests, which includes 102 pests of 8 crops (for example, rice, corn, and wheat); • Adopt the YOLOv5-6.0version [6] as the baseline.We integrate the model with the lightweight convolutional structure idea that is proposed in MobileNetV3 [7] and propose a new module, C3M, which has a faster calculation speed and higher precision than the C3 module; • The GAM attention mechanism [8] was introduced to develop the range of receptive fields.This model receives more image feature points that are extracted from the backbone of the model.As a result, the experiments eventually confirmed that the improved algorithm had a better detection effect on the pest images than the YOLOv5.In summary, there are three aspects to summarize the work presented in this paper: • The model training is based on an IP102 dataset [5] with a total of nearly 19,000 images of agricultural pests, which includes 102 pests of 8 crops (for example, rice, corn, and wheat); • Adopt the YOLOv5-6.0version [6] as the baseline.We integrate the model with the lightweight convolutional structure idea that is proposed in MobileNetV3 [7] and propose a new module, C3M, which has a faster calculation speed and higher precision than the C3 module; The GAM attention mechanism [8] was introduced to develop the range of receptive fields.This model receives more image feature points that are extracted from the backbone of the model.As a result, the experiments eventually confirmed that the improved algorithm had a better detection effect on the pest images than the YOLOv5.

Machine Learning
In the past, agricultural experts had to manually inspect pests, which was time-consuming and inefficient.However, the development of artificial intelligence technology has provided great convenience for identifying pests in agriculture.The process of object detection tasks based on machine learning can be divided into three basic steps: data acquisition, data preprocessing, and algorithm model classification [9].In the agricultural pest prediction methods, a real-time judgement system can be constructed using Gaussian Naive Bayes and Fast Association Rule Mining algorithms to aid farmers in identifying pests [10].In addition, other research experiments for detecting pests include the preliminary extraction of the size, color, and texture characteristics of insects through HOG (Histogram of Oriented Gradient) [11] or chromatic aberration denoising extraction [12], and then SVM (Support Vector Machines) are used as the training model to learn these features for the identification of pests.Machine learning-related detection experiments were carried out for the common diseases of wheat plants, furan methrapyran blight, and Fusarium head blight.Multiple linear regression, ridge regression, and random forest regression are separately combined with neural network technology to obtain the identification models for detecting those specific agricultural diseases [13][14][15], which can provide a kind of auxiliary judgment basis for agricultural workers to quickly and effectively di-

Machine Learning
In the past, agricultural experts had to manually inspect pests, which was timeconsuming and inefficient.However, the development of artificial intelligence technology has provided great convenience for identifying pests in agriculture.The process of object detection tasks based on machine learning can be divided into three basic steps: data acquisition, data preprocessing, and algorithm model classification [9].In the agricultural pest prediction methods, a real-time judgement system can be constructed using Gaussian Naive Bayes and Fast Association Rule Mining algorithms to aid farmers in identifying pests [10].In addition, other research experiments for detecting pests include the preliminary extraction of the size, color, and texture characteristics of insects through HOG (Histogram of Oriented Gradient) [11] or chromatic aberration denoising extraction [12], and then SVM (Support Vector Machines) are used as the training model to learn these features for the identification of pests.Machine learning-related detection experiments were carried out for the common diseases of wheat plants, furan methrapyran blight, and Fusarium head blight.Multiple linear regression, ridge regression, and random forest regression are separately combined with neural network technology to obtain the identification models for detecting those specific agricultural diseases [13][14][15], which can provide a kind of auxiliary judgment basis for agricultural workers to quickly and effectively diagnose crop diseases and pests.

Deep Learning
Currently, machine learning-based deep learning methods demonstrate exceptional efficacy in pest detection.Compared to machine learning, deep learning is more robust and does not rely heavily on the artificial processing of image features, resulting in better data-fitting ability.He Yong et al. utilized the SDD (Single Shot MultiBox Detector) deep learning detection algorithm model with an inception layer to successfully detect rapeseed pests.A Dropout network layer was added to the proposed network to balance the performance metrics between the precision and time complexity of the identification, preventing overfitting of the model to the image data [16].In the end, 12 typical rapeseed pests were tested under different lighting environments and backgrounds, resulting in an experiment obtaining a 77.14% mAP metric.Furthermore, PestNet [17] does not utilize a fully connected layer, instead, it uses position-sensitive score mapping to achieve a 75.46% mAP detection metric on datasets of 16 pests.It is quite a breakthrough that Deep-PestNet has a 100% accuracy rate in identifying major pests, such as aphids, nocturnal moths, and bollworms [18].

YOLOv5
The above experiments have common problems, that is, the sample types or sample volumes of the pest datasets used are not sufficient, so in the actual deployment application, there is an obvious disadvantage that the identification accuracy is not high, and some types of pests cannot be identified in reality.The SSD described above is a typical onestage detection model.In the one-stage architecture, the YOLO series [19][20][21] takes an indispensable position, and the article chooses the YOLOv5 deep learning model as the basic model.The YOLOv5 model can be detected in real-time with the GPU environment and has the advantages of being open source and convenient to use, and it has multi-scale prediction.The model backbone adopts a CSPDarknet53 [22] lightweight network structure, to a certain extent, which reduces the amount of calculation and memory over-occupation of the model.The backbone is mainly composed of a standard convolution Conv module, a C3 module, and an SPPF module.The neck of the model applies an FPN (Feature Pyramid Network) [23].The FPN can fuse different levels of image features into a multi-scale richer feature from upsampling and downsampling layers, which definitely improves the precision of the final detection.The head part has three scale detection visual fields, which are responsible for processing the different scale features corresponding to the backbone of the CSPDarknet53 network.

Materials and Methods
In this section, we first explain a data augmentation method, followed by a detailed description of the architecture and its component blocks of the YOLOv5 model, as well as the lightweight convolution module we propose.Finally, we list some common model evaluation indicators for object detection in formulas.

Data Augmentation
The data augmentation algorithm of Mosaic [24] can effectively improve the diversity of datasets and obtain image data with more semantic features, thereby improving the accurate performance of the model detection.The Mosaic algorithm includes Mixup, Cutout, and CutMix operations, etc., where Mixup is to mix two random image data proportionally, and the final classification result will be distributed in the same proportion.Cutout is to randomly fill part of the image data with 0-pixel values, with no change in the classification result.CutMix randomly uses other images of the same datasets to fill part of the original image, and the classification result equals the proportion uniformly distributed.The specific implementation details are shown in Figure 2. The image data format read in the model training process is to randomly select four images for Mosaic data enhancement; there is more target feature information on a single image, and the model can learn richer semantic features as a result.

YOLOv5 Network Model
As a typical one-stage detector representative, YOLOv5 is a lightweight detection architecture that was designed to achieve the target of real-time detection and meet the conditions of high detection accuracy and fast inference speed.YOLOv5 has 5 models of different magnitude, which could be sorted from the smallest to the largest: YOLOv5n, YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x.YOLOv5 is perfectly convenient to modify the magnitude of its own model by just adjusting the scale factor of the width and depth at the model configuration file.Version YOLOv5 iterates rapidly.In this article, we use version 6.0 as the baseline, and the overall architecture is shown in Figure 3.The model can be divided into three parts: the backbone, neck, and head, and the basic module structure, such as Conv, C3, and SPPF, in YOLOv5 will be introduced as follows.
Agronomy 2023, 13, x FOR PEER REVIEW 4 enhancement; there is more target feature information on a single image, and the m can learn richer semantic features as a result.

YOLOv5 Network Model
As a typical one-stage detector representative, YOLOv5 is a lightweight detectio chitecture that was designed to achieve the target of real-time detection and meet the ditions of high detection accuracy and fast inference speed.YOLOv5 has 5 models o ferent magnitude, which could be sorted from the smallest to the largest: YOLO YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x.YOLOv5 is perfectly convenient to m ify the magnitude of its own model by just adjusting the scale factor of the width depth at the model configuration file.Version YOLOv5 iterates rapidly.In this article use version 6.0 as the baseline, and the overall architecture is shown in Figure 3.The m can be divided into three parts: the backbone, neck, and head, and the basic module s ture, such as Conv, C3, and SPPF, in YOLOv5 will be introduced as follows.can learn richer semantic features as a result.

YOLOv5 Network Model
As a typical one-stage detector representative, YOLOv5 is a lightweight detection architecture that was designed to achieve the target of real-time detection and meet the conditions of high detection accuracy and fast inference speed.YOLOv5 has 5 models of different magnitude, which could be sorted from the smallest to the largest: YOLOv5n, YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x.YOLOv5 is perfectly convenient to modify the magnitude of its own model by just adjusting the scale factor of the width and depth at the model configuration file.Version YOLOv5 iterates rapidly.In this article, we use version 6.0 as the baseline, and the overall architecture is shown in Figure 3.The model can be divided into three parts: the backbone, neck, and head, and the basic module structure, such as Conv, C3, and SPPF, in YOLOv5 will be introduced as follows.

C3
In the C3 module shown in Figure 5, first, the input image is processed by the first Conv, and then the features from the second Conv and the bottleneck module are concat

C3
In the C3 module shown in Figure 5, first, the input image is processed by the first Conv, and then the features from the second Conv and the bottleneck module are concat together, and finally, the output is obtained by the third Conv.
into a new standard convolutional Conv module, as shown in Figure 4.

C3
In the C3 module shown in Figure 5, first, the input image is processed by the first Conv, and then the features from the second Conv and the bottleneck module are concat together, and finally, the output is obtained by the third Conv.

SPPF
SPPF is an upgraded version of SPP, which varies the convolution kernel size of three max-pooling layers from 5 × 5, 9 × 9, and 13 × 13 to a uniform 5 × 5 size, which improves the ability in handling the features of pooling part.SPPF is more efficient than SPP at features processing, and its concrete implementation architecture is described in Figure 6.

C3M-YOLO Model
The architecture of the C3M-YOLO model, which is an improvement on YOLOv5, is shown in Figure 7.We replaced the first C3 module of the original neck model with a C3M module.This allows the model to process features in a more flexible and diverse manner, resulting in improved detection performance.Additionally, to enable the network to recognize a wider field of view and capture more feature information from the backbone, we introduced the GAM attention mechanism.

SPPF
SPPF is an upgraded version of SPP, which varies the convolution kernel size of three max-pooling layers from 5 × 5, 9 × 9, and 13 × 13 to a uniform 5 × 5 size, which improves the ability in handling the features of pooling part.SPPF is more efficient than SPP at features processing, and its concrete implementation architecture is described in Figure 6.

C3
In the C3 module shown in Figure 5, first, the input image is processed by the first Conv, and then the features from the second Conv and the bottleneck module are concat together, and finally, the output is obtained by the third Conv.

SPPF
SPPF is an upgraded version of SPP, which varies the convolution kernel size of three max-pooling layers from 5 × 5, 9 × 9, and 13 × 13 to a uniform 5 × 5 size, which improves the ability in handling the features of pooling part.SPPF is more efficient than SPP at features processing, and its concrete implementation architecture is described in Figure 6.

C3M-YOLO Model
The architecture of the C3M-YOLO model, which is an improvement on YOLOv5, is shown in Figure 7.We replaced the first C3 module of the original neck model with a C3M module.This allows the model to process features in a more flexible and diverse manner, resulting in improved detection performance.Additionally, to enable the network to recognize a wider field of view and capture more feature information from the backbone, we introduced the GAM attention mechanism.

C3M-YOLO Model
The architecture of the C3M-YOLO model, which is an improvement on YOLOv5, is shown in Figure 7.We replaced the first C3 module of the original neck model with a C3M module.This allows the model to process features in a more flexible and diverse manner, resulting in improved detection performance.Additionally, to enable the network to recognize a wider field of view and capture more feature information from the backbone, we introduced the GAM attention mechanism.

C3
In the C3 module shown in Figure 5, first, the input image is processed by the first Conv, and then the features from the second Conv and the bottleneck module are concat together, and finally, the output is obtained by the third Conv.

SPPF
SPPF is an upgraded version of SPP, which varies the convolution kernel size of three max-pooling layers from 5 × 5, 9 × 9, and 13 × 13 to a uniform 5 × 5 size, which improves the ability in handling the features of pooling part.SPPF is more efficient than SPP at features processing, and its concrete implementation architecture is described in Figure 6.

C3M-YOLO Model
The architecture of the C3M-YOLO model, which is an improvement on YOLOv5, is shown in Figure 7.We replaced the first C3 module of the original neck model with a C3M module.This allows the model to process features in a more flexible and diverse manner, resulting in improved detection performance.Additionally, to enable the network to recognize a wider field of view and capture more feature information from the backbone, we introduced the GAM attention mechanism.

C3M
The C3M architecture mainly relies on the convolution structure of the C3 module.We optimized the convolution block structure of the bottleneck section, and Figure 8 shows a new bottleneck convolution structure called MBottleneck.This structure processes image features faster and more efficiently.The MBottleneck also references the convolution characteristics of MobileNetV3 to construct a new convolution MConv module.Its detailed implementation will be described later.The MBottleneck introduces a Boolean variable called "Shortcut" to determine whether to concatenate the input with the feature obtained after several MConv convolutions to ensure deep network learning while avoiding the model degradation problem.shows a new bottleneck convolution structure called MBottleneck.This structure processes image features faster and more efficiently.The MBottleneck also references the convolution characteristics of MobileNetV3 to construct a new convolution MConv module.Its detailed implementation will be described later.The MBottleneck introduces a Boolean variable called "Shortcut" to determine whether to concatenate the input with the feature obtained after several MConv convolutions to ensure deep network learning while avoiding the model degradation problem.

MConv
Figure 9 shows the architecture of MConv, which uses the "Identity" parameter to indicate whether the input and hidden layers of MConv are equal or not.If they are equal, the input is first processed by depthwise separable convolution and then by a smooth SE (Squeeze-and-Excitation) layer module with an attention mechanism for feature processing.Finally, the result is obtained through pointwise convolution.In addition, if the number of input and hidden layers is different, they will pass through four convolution modules in sequence: pointwise, depthwise, SELayer, and pointwise.They all have a residual parameter to determine whether to output features using the residual method.

MConv
Figure 9 shows the architecture of MConv, which uses the "Identity" parameter to indicate whether the input and hidden layers of MConv are equal or not.If they are equal, the input is first processed by depthwise separable convolution and then by a smooth SE (Squeeze-and-Excitation) layer module with an attention mechanism for feature processing.Finally, the result is obtained through pointwise convolution.In addition, if the number of input and hidden layers is different, they will pass through four convolution modules in sequence: pointwise, depthwise, SELayer, and pointwise.They all have a residual parameter to determine whether to output features using the residual method.
shows a new bottleneck convolution structure called MBottleneck.This structure pro-cesses image features faster and more efficiently.The MBottleneck also references the convolution characteristics of MobileNetV3 to construct a new convolution MConv module.Its detailed implementation will be described later.The MBottleneck introduces a Boolean variable called "Shortcut" to determine whether to concatenate the input with the feature obtained after several MConv convolutions to ensure deep network learning while avoiding the model degradation problem.

MConv
Figure 9 shows the architecture of MConv, which uses the "Identity" parameter to indicate whether the input and hidden layers of MConv are equal or not.If they are equal, the input is first processed by depthwise separable convolution and then by a smooth SE (Squeeze-and-Excitation) layer module with an attention mechanism for feature processing.Finally, the result is obtained through pointwise convolution.In addition, if the number of input and hidden layers is different, they will pass through four convolution modules in sequence: pointwise, depthwise, SELayer, and pointwise.They all have a residual parameter to determine whether to output features using the residual method.Pointwise and depthwise are a pair of lightweight convolution methods.The former only changes the number of channels of the input image without changing the size of the feature map.The latter does the opposite, transforming the size of the feature map without changing the number of channels.The SELayer can enhance the sensitivity of the model to features of different channels.With the fusion of pointwise convolution, depthwise convolution, and SELayer convolution, the image obtained from these three dimensions is richer and more representative.Thus, MConv is more flexible than YOLOv5's standard convolutions in processing image features.Pointwise and depthwise are a pair of lightweight convolution methods.The former only changes the number of channels of the input image without changing the size of the feature map.The latter does the opposite, transforming the size of the feature map without changing the number of channels.The SELayer can enhance the sensitivity of the model to features of different channels.With the fusion of pointwise convolution, depthwise convolution, and SELayer convolution, the image obtained from these three dimensions is richer and more representative.Thus, MConv is more flexible than YOLOv5's standard convolutions in processing image features.

GAM
The GAM attention mechanism effectively amplifies the cross-dimensional sensory domain feature information and stably improves the performance at different deep learning network architectures by combining the strengths of CA (Channel Attention Mechanism) and SA (Spatial Attention Mechanism). Figure 10 shows the implementation principle of the GAM mechanism, and the input image is successively disposed by CA and SA with basic multiplication operations in turn.GAM has better data scalability and robustness than other common attention mechanisms, such as CBAM [25], ABN [26], and TAM [27].Thus, it is chosen to be added at the backbone to reinforce the scalability of the algorithm for obtaining more specific pest image features.
The GAM implementation equation is as ( 1) and ( 2), F input represents the input image features, F output represents the output image features, and F is the intermediate excessive feature.M c and M s are the attention algorithmic mechanisms of CA and SA.
The GAM attention mechanism effectively amplifies the cross-dimensional sensory domain feature information and stably improves the performance at different deep learning network architectures by combining the strengths of CA (Channel Attention Mechanism) and SA (Spatial Attention Mechanism). Figure 10 shows the implementation principle of the GAM mechanism, and the input image is successively disposed by CA and SA with basic multiplication operations in turn.GAM has better data scalability and robustness than other common attention mechanisms, such as CBAM [25], ABN [26], and TAM [27].Thus, it is chosen to be added at the backbone to reinforce the scalability of the algorithm for obtaining more specific pest image features.The GAM implementation equation is as ( 1) and ( 2),   represents the input image features,   represents the output image features, and  ′ is the intermediate excessive feature.  and   are the attention algorithmic mechanisms of CA and SA.

Evaluation Metrics
The object detection evaluation metrics used in this paper include the P (Precision), R (Recall), F1 score, and mAP (mean Average Precision).The F1-Score is more rigorous than the P and R and is more reflective of the detection performance of the model.The AP (Average Precision) is the area enclosed by the P-R integration curve, which is formed by P as the vertical axis and R as the horizontal axis.Then, the mean AP of all the classes in the datasets can calculate the mAP metric, which is also a particularly important detection metric in object detection.In this paper, the detection ability of the model is tested by three ranges of mAP metrics.Among them, mAP0.5 is the indicator when the IoU threshold is set at 0.5, mAP0.75 is the average value of the AP calculated under different IoU (0.5-0.75, stride 0.05), and mAP0.5-0.95 is the average value of the AP with IoU (0.5-0.95, stride 0.05).The detailed implementation formula of these metrics introduced above is shown in ( 3)-( 7):

Evaluation Metrics
The object detection evaluation metrics used in this paper include the P (Precision), R (Recall), F1 score, and mAP (mean Average Precision).The F1-Score is more rigorous than the P and R and is more reflective of the detection performance of the model.The AP (Average Precision) is the area enclosed by the P-R integration curve, which is formed by P as the vertical axis and R as the horizontal axis.Then, the mean AP of all the classes in the datasets can calculate the mAP metric, which is also a particularly important detection metric in object detection.In this paper, the detection ability of the model is tested by three ranges of mAP metrics.Among them, mAP 0.5 is the indicator when the IoU threshold is set at 0.5, mAP 0.75 is the average value of the AP calculated under different IoU (0.5-0.75, stride 0.05), and mAP 0.5-0.95 is the average value of the AP with IoU (0.5-0.95, stride 0.05).The detailed implementation formula of these metrics introduced above is shown in ( 3)-( 7 Representations of TP, TN, FP and FN are described in Table 1, in which True of the prediction situation means that the model's judgment is correct, the original image is indeed an image of this category, and if it is not an image of this category, it is recorded as TP and TN.False indicates that the model is misjudged and does not match the reality.
Table 1.Determination of the relationship between the predicted situation and the real situation.

True False
Positive TP FP Negative TN FN

Results and Discussion
In this section, we introduce the experimental environment for the model training and the distribution of the labels in the dataset samples.We also present demonstration experiments for our proposed modules, which include ablation experiments as well as comparative experiments.

Experiment Environment Configuration
The model takes the Pytorch deep learning framework as well as the Anaconda environment.The training environment is on a Ubuntu 20.04 system, and the GPU is RTX3090; the test environment is on a Windows 10 system, and the GPU is RTX3060.In addition, the GPU is used to accelerate the training process of the model.

Label Distribution of the Training Set
The training and testing data ratio of the IP102 pest detection datasets is 7:3, and the datasets can be obtained from https://github.com/xpwu95/IP102,accessed on 20 June 2019.Figure 11 illustrates all the true bounding boxes of pests in the training datasets.Figure 11a shows the number of instances of each type of pest according to the pest type.Due to the uneven distribution of the number of images for each pest type, it is difficult for the model to accurately identify all types of pests, which undoubtedly increases the difficulty of the model training.

Results and Discussion
In this section, we introduce the experimental environment for the model training and the distribution of the labels in the dataset samples.We also present demonstration experiments for our proposed modules, which include ablation experiments as well as comparative experiments.

Experiment Environment Configuration
The model takes the Pytorch deep learning framework as well as the Anaconda environment.The training environment is on a Ubuntu 20.04 system, and the GPU is RTX3090; the test environment is on a Windows 10 system, and the GPU is RTX3060.In addition, the GPU is used to accelerate the training process of the model.

Label Distribution of the Training Set
The training and testing data ratio of the IP102 pest detection datasets is 7:3, and the datasets can be obtained from https://github.com/xpwu95/IP102,accessed on 20 June 2019.Figure 11 illustrates all the true bounding boxes of pests in the training datasets.Figure 11a shows the number of instances of each type of pest according to the pest type.Due to the uneven distribution of the number of images for each pest type, it is difficult for the model to accurately identify all types of pests, which undoubtedly increases the difficulty of the model training.Figure 11b shows the distribution of all the true bounding boxes of the pest images in the training set that we obtained using the K-means clustering algorithm, including the distribution range of the center point coordinates (x, y) and the bounding box width and height.In addition, according to the color depth, where the darker areas indicate a higher concentration, we concluded that most of the pest targets to be detected in the training dataset are located in the center of the original image.Moreover, the scatter plot with the width as the x-axis and the height as the y-axis describes their correlation, and it can be seen that most of the bounding boxes' widths and heights are above the diagonal line, showing a proportional relationship.
Based on the distribution of these bounding box sizes, the model will select three sets of initial bounding box values that are closest to the size values in the dataset, which correspond to the three scale detection heads of YOLOv5.By using these three sets of anchor box values to detect pests of different sizes, only slight modifications to the bounding box size are needed, which effectively reduces the loss of bounding boxes and improves the detection efficiency of the model.

Experimental Hyperparameter Setting
Table 2 provides information on the specific experimental parameters used in training the new network model proposed in this paper.For the training, we set the number of epochs to 300 and the batch size to 16.We adjusted the input image size of the IP102 dataset to 640 × 640.We used SGD as the optimizer and initialized the learning rate to 0.01.The momentum and weight loss were set at 0.937 and 0.0005, respectively.This optimizer effectively accelerated the model's training process and achieved optimal detection performance.Overall, the precision, P, of all the pests is directly proportional to the confidence, while the recall, R, is inversely proportional to the confidence, and the blue curve is relatively smooth.The maximum accuracy obtained by the improved model is 0.909, and the recall value is 0.880.

Ablation Experiments
The ablation experiment presented in Table 3 demonstrates the changes in the dete tion metrics, the mAP, P, and F1-Score, of the original model after introducing new mo ules.The results indicate that both the C3M module with deformable convolutions a the GAM module with an expanded perception field can effectively improve the accura of the model.Meanwhile, introducing both modules can achieve the best detection pe formance on the IP102 dataset, thereby proving the rationality of using both modules t gether.

Parameter Comparison Experiment
The experiments presented in Table 4 mainly demonstrate the change in the mod parameters after introducing the C3M module.It can be inferred that after introducing t C3M module in YOLOv5, the model's detection accuracy is improved by 4.6%, while t inference speed is also increased.Compared with the original model, C3M significan reduces the depth of the network layers and reduces the parameter volume by 2.3% a the GFLOPs by 3%.

Ablation Experiments
The ablation experiment presented in Table 3 demonstrates the changes in the detection metrics, the mAP, P, and F1-Score, of the original model after introducing new modules.The results indicate that both the C3M module with deformable convolutions and the GAM module with an expanded perception field can effectively improve the accuracy of the model.Meanwhile, introducing both modules can achieve the best detection performance on the IP102 dataset, thereby proving the rationality of using both modules together.

Parameter Comparison Experiment
The experiments presented in Table 4 mainly demonstrate the change in the model parameters after introducing the C3M module.It can be inferred that after introducing the C3M module in YOLOv5, the model's detection accuracy is improved by 4.6%, while the inference speed is also increased.Compared with the original model, C3M significantly reduces the depth of the network layers and reduces the parameter volume by 2.3% and the GFLOPs by 3%.

Presentation of Detection Results
Figure 17 shows the validation detection results of the model on the training set.The first row presents the true labels of the images, the second row displays the detection results of YOLOv5s, and the last row exhibits the recognition outcomes of C3M-YOLO.Compare to the original model, the accuracy of the improved model in detecting pest images has been enhanced.The detection confidence score for several pests, such as black cutworms, army worms, mole crickets, and blister beetles, has been increased by 0.2, resulting in higher detection precision.
C3M-YOLO's detection metrics are undoubtedly outstanding.Specifically, the mAP0.5-0.95 has increased by 3.8%, while the mAP0.5 and mAP0.75 have increased by 5.1% and 6.2%, respectively, as shown in the numerical results.Therefore, the improved model has achieved the best detection accuracy, enabling it to better recognize pest images in IP102.

Conclusions
First, we carried out data augmentation on the IP102 dataset using Mosaic enhancement so that the model could extract more detailed feature information and perform better in various real-world scenarios.Second, our proposed C3M module flexibly handles im-

Conclusions
First, we carried out data augmentation on the IP102 dataset using Mosaic enhancement so that the model could extract more detailed feature information and perform better in various real-world scenarios.Second, our proposed C3M module flexibly handles image features while also improving the model's inference speed.Third, the introduction of the GAM attention mechanism expands the model's receptive field, enabling it to effectively learn the feature information processed by the backbone in the neck of the model.Subsequent ablation experiments verified the rationality of our improvement strategy.
To address the issue of imbalanced sample sizes in the IP102 pest dataset, we can extract effective feature information by capturing global image features or using more versatile convolutional methods to improve the model's detection accuracy.Additionally, we found in the experiments that the detection model produced inaccurate anchor boxes and even lost detection confidence for small pest categories, such as unaspis yanonensis and aleurocanthus spiniferus.Therefore, in the future, we will optimize the model's detection performance for small target pests in complex background environments to better fit practical application scenarios.

Agronomy 2023 , 16 Figure 1 .
Figure 1.The main work of the article.

Figure 1 .
Figure 1.The main work of the article.

Figure 2 .
Figure 2. Sixteen examples with some interference characteristics of Mosaic data enhancemen

Figure 2 .
Figure 2. Sixteen examples with some interference characteristics of Mosaic data enhancement.

Figure 2 .
Figure 2. Sixteen examples with some interference characteristics of Mosaic data enhancement.

3. 2
.1.Conv In the Conv model, ordinary 2D convolution, BN (Batch Normalization) regularization operation, and SiLU (Sigmoid Linear Unit) activation functions are defined together into a new standard convolutional Conv module, as shown in Figure 4.

Agronomy 2023 ,
13, x FOR PEER REVIEW 5 of 16 3.2.1.Conv In the Conv model, ordinary 2D convolution, BN (Batch Normalization) regularization operation, and SiLU (Sigmoid Linear Unit) activation functions are defined together into a new standard convolutional Conv module, as shown in Figure 4.

Figure 9 .
Figure 9. MConv convolutional architecture diagram.Pointwise and depthwise are a pair of lightweight convolution methods.The former only changes the number of channels of the input image without changing the size of the feature map.The latter does the opposite, transforming the size of the feature map without changing the number of channels.The SELayer can enhance the sensitivity of the model to features of different channels.With the fusion of pointwise convolution, depthwise convolution, and SELayer convolution, the image obtained from these three dimensions is richer and more representative.Thus, MConv is more flexible than YOLOv5's standard convolutions in processing image features.

Figure 11 .
Figure 11.(a) The quantity of each pest type in the IP102 dataset of the training set.(b) Use the Kmeans clustering method to calculate the center point coordinates, width, height, and their correlations of all target pests in the dataset.

Figure 11 .
Figure 11.(a) The quantity of each pest type in the IP102 dataset of the training set.(b) Use the K-means clustering method to calculate the center point coordinates, width, height, and their correlations of all target pests in the dataset.

4. 4 .
Figures 12 and 13  respectively show the relationships between the P, R, and confidence.Each black line represents the change in the detection accuracy of a pest in the IP102 dataset, while the blue line represents the average change in the detection accuracy of all the pests.Overall, the precision, P, of all the pests is directly proportional to the confidence, while the recall, R, is inversely proportional to the confidence, and the blue curve is relatively smooth.The maximum accuracy obtained by the improved model is 0.909, and the recall value is 0.880.

Agronomy 2023 , 16 Figure 12 .
Figure 12.Graph of the relationship between Precision and Confidence score for all pests.Figure 12. Graph of the relationship between Precision and Confidence score for all pests.

Figure 12 .
Figure 12.Graph of the relationship between Precision and Confidence score for all pests.Figure 12. Graph of the relationship between Precision and Confidence score for all pests.

Figures 14 and 15 Figure 12 .
Figures 14 and 15  show the P-R curve plots for the original and improved models of YOLOv5.The results indicate that the improved model performs better than the original model on the training set, with an increase of 0.9% in mAP 0.5 .However, we noticed that some pests in the training results exhibit severe fluctuations in the detection accuracy, possibly due to insufficient samples for these pests.As a result, the detection model is unable to learn sufficient feature information, leading to misidentification.Figure16shows the changing trend of the training loss and detection accuracy of C3M-YOLO for each epoch on the training and validation datasets.The figures indicate that the boundary loss (box_loss), confidence score loss (obj_loss), and classification loss (cls_loss) decrease gradually with an increase in the training epochs.Moreover, the corresponding detection precision, recall, and mAP values show significant improvement within the first

Figure 13 .
Figure 13.Graph of the relationship between Recall and Confidence score for all pests.

Figures 14
Figures 14 and 15  show the P-R curve plots for the original and improved models of YOLOv5.The results indicate that the improved model performs better than the original model on the training set, with an increase of 0.9% in mAP0.5.However, we noticed that some pests in the training results exhibit severe fluctuations in the detection accuracy, possibly due to insufficient samples for these pests.As a result, the detection model is unable to learn sufficient feature information, leading to misidentification.

Figure 13 . 16 Figure 14 .
Figure 13.Graph of the relationship Recall and Confidence score for all pests.Agronomy 2023, 13, x FOR PEER REVIEW 11 of 16

Figure 16
Figure 16 shows the changing trend of the training loss and detection accuracy of C3M-YOLO for each epoch on the training and validation datasets.The figures indicate that the boundary loss (box_loss), confidence score loss (obj_loss), and classification loss (cls_loss) decrease gradually with an increase in the training epochs.Moreover, the corre-

Figure 16
Figure 16 shows the changing trend of the training loss and detection accuracy of C3M-YOLO for each epoch on the training and validation datasets.The figures indicate that the boundary loss (box_loss), confidence score loss (obj_loss), and classification loss (cls_loss) decrease gradually with an increase in the training epochs.Moreover, the corre-

Figure 17
Figure17shows the validation detection results of the model on the training set.The first row presents the true labels of the images, the second row displays the detection results of YOLOv5s, and the last row exhibits the recognition outcomes of C3M-YOLO.Compare to the original model, the accuracy of the improved model in detecting pest images has been enhanced.The detection confidence score for several pests, such as black cutworms, army worms, mole crickets, and blister beetles, has been increased by 0.2, resulting in higher detection precision.

Figure 17 .
Figure 17.Comparison of image results during training.

Figure 18
Figure 18 compares the detection results of different models on several original insect images.For each image, the left side shows the test results of the original model, and the right side presents the results of the improved model.It can be observed that compared to YOLOv5s, C3M-YOLO has slightly increased the detection confidence for Cicadellidae, rice leaf roller, and other pests, especially for the corn borer, with a significant improvement of 0.22 in the confidence score.

Figure 17 .
Figure 17.Comparison of image results during training.

Figure 18 16 Figure 18 .
Figure 18 compares the detection results of different models on several original insect For each image, the left side shows the test results of the original model, and the right side presents the results of the improved model.It can be observed that compared to YOLOv5s, C3M-YOLO has slightly increased the detection confidence for Cicadellidae, rice leaf roller, and other pests, especially for the corn borer, with a significant improvement of 0.22 in the confidence score.Agronomy 2023, 13, x FOR PEER REVIEW 14 of 16

Figure 18 .
Figure 18.Detection results of different pest images.

Table 4 .
The magnitude and inference speed change with C3M.

Table 4 .
The magnitude and inference speed change with C3M.