1. Introduction
In China, being a major agricultural nation, integrated pest management is crucial for sustainable agricultural development. Every year, the presence of diverse pests poses significant challenges during crop cultivation, resulting in reduced crop yields and compromised quality. In severe cases, these challenges can lead to widespread crop failures. In addition, many pests are very similar; for example, Lepidoptera contains dozens of common field crop pests, and their physical characteristics are very similar, and secondly, the size of the pests is different, and some pests will be small enough to distinguish morphology in photos. What is more complicated is that each pest may be at different in-sect ages and developmental stages, such as the larval stage and adult stage, resulting in the appearance of even the same pest being very different, which causes the field “pest detection” need to detect multi-posture, multi-species, multi-form pests, bringing much greater technical challenges than other detection recognition. Therefore, the accurate and effective classification and identification of insects are essential for implementing timely pest control measures and mitigating substantial economic losses in crop production.
Traditional approaches to classify and identify diseases and insect pests heavily depend on the expertise of insect specialists or taxonomists who accumulate research experience through professional knowledge and literature references. However, this method is slow, inefficient, costly, subjective, and often lacks timeliness. Even with extensive knowledge and experience, it remains challenging to avoid species confusion. The continuous advancement of the Internet and information technology has introduced new methods and ideas for crop disease and insect pest identification. Efficient image recognition technology has emerged as a solution to improve recognition efficiency, reduce costs, and enhance accuracy. Various operators such as SIFT [
1], LBP [
2], ORB [
3], and SURF [
4] have been employed to represent targets. Machine learning techniques like Support Vector Machine (SVM) [
5], k-Nearest Neighbors (KNN) [
6], and Random Forest [
7] are then utilized for target recognition. However, these feature-based methods heavily rely on the representation of feature operators and are not robust against illumination variations, occlusion, complex environments, and interference from similar targets.
With the rapid development of deep learning technology, researchers have increasingly applied it to image recognition, leading to significant advancements in recent years. Deep learning is characterized by its complex network structures and the ability to handle large-scale datasets, which are key features of this technology. The emergence of deep learning has provided a robust technical foundation for image recognition tasks. Various deep learning models, including Deep Belief Network (DBN) [
8], convolutional neural network (CNN) [
9], Recurrent Neural Network (RNN) [
10], Generative Adversarial Network (GAN) [
11], Capsule Network (CapsNet) [
12], and more, have been proposed. However, it is challenging for these methods to fully meet practical requirements, and many of them are primarily limited to laboratory-level research.
In recent years, computer vision technology applied in the field of agricultural engineering, specifically utilizing deep learning algorithms, has gained increasing attention and favor from experts and scholars worldwide. Compared to traditional machine vision techniques, deep-learning-based computer vision offers a superior efficiency and accuracy in various areas, including image processing, feature extraction, feature abstraction, and feature classification. As a result, numerous pest identification and classification methods based on deep learning have been proposed. These advancements aim to enhance and address the limitations of current pest identification methods, ultimately achieving more timely and effective pest control measures.
In 2016, Wu Xiang from Zhejiang University [
13] successfully applied a convolutional neural network (CNN) model to identify 10 species of moth pests, such as box borer, corn borer, rice leaf roller, mole cricket, and others. The image dataset consisted of 900 color images collected from the natural environment, with each image containing a single pest. The CNN pest recognition model comprised five layers, achieving a recognition accuracy of approximately 76.7%. Wang et al. [
14] proposed a plant pest recognition method based on a CNN with initial modules and extended convolutions. Huang et al. [
15] developed a CNN model to classify eight types of tomato pests and employed transfer learning to reduce training time. However, these approaches are limited by their small dataset, which may result in knowledge limitations and overfitting during the model learning. Moreover, the extracted features are relatively simplistic, leading to inadequate generalization capabilities in real-world scenarios.
Thanks to the progress in deep learning technology, numerous detectors based on convolutional neural networks have achieved an impressive detection performance. Single-stage detectors [
14] directly employ convolutional neural networks to predict the category and location of targets. Alternatively, Faster R-CNN [
16] generates region proposals through a region proposal network, enabling more accurate classification and regression tasks. As well as the plug-and-play module [
17] applied to the R-CNN model that has just been proposed in recent years, they designed a module to create a query-based model to be able to reason about different numbers of suggestions, and further extended it to a dynamic model. However, whether it can be applied to the YOLO series target detection, the answer is not given in the article. Transformer-based detectors [
18,
19], on the other hand, eliminate the need for Anchor constraints and post-processing steps like non-maximum suppression. This end-to-end implementation simplifies the object detection pipeline significantly.
In the domain of target detection, the YOLO series [
20,
21] has played a significant role as a standalone detector. For example, Cai et al. [
22] proposed an improved YOLOv4 target detection framework applied to the algorithm of self-driving cars, which improves the detection accuracy and supports real real-time detection operations. Compared with the original model, it shows a higher average accuracy and inference speed. This paper introduces an enhanced model called CLT-YOLOX, based on YOLOX [
23], to address the aforementioned challenges.
Figure 1 provides an overview of the detection pipeline in CLT-YOLOX, which incorporates data augmentation during training. This approach expands the dataset and enhances adaptability to significant variations in object sizes within images. To effectively combat overfitting during training, a data augmentation strategy is proposed. The backbone of CLT-YOLOX utilizes the CSPDarknet [
24] network, following the original version. Before the input of the original small detection head, a special CLT module is introduced, which processes tiny objects by fusing feature information extracted from different scales. In a Path Aggregation Network (PANet) [
25], we achieve direct upsampling by removing the convolution operation in the upsampling stage. Furthermore, improvements are made to the C3 module in PFPN. Compared to BIFPN [
26], our method is similar in that it chooses to delete nodes with only one input edge, and we choose to delete the convolution operation before upsampling to save on computational cost. Furthermore, to capture attention in images with a significant coverage, we incorporate the Convolutional Attention Module (CBAM) [
27]. As a result, our enhanced CLT-YOLOX model demonstrates a superior performance in handling agricultural pest images compared to YOLOX.
Our contributions are summarized as follows:
Addressing the scarcity of real pest image data: To overcome the limited availability of real pest image data in the field, we employed Mosaic and Mixup data augmentation techniques. These techniques effectively augment the dataset, allowing for a better model generalization. Additionally, a novel data enhancement strategy was introduced to further improve the model’s performance.
Cross-Layer Transformer (CLT) module: Our proposed CLT (Cross-Layer Transformer) module incorporates cross-layer information, enabling the extraction of fine-grained features more effectively. By leveraging this cross-layer information, our model achieves improved detection results compared to the original YOLOX algorithm.
Enhancements to the C3 module and integration of CBAM: We enhanced the C3 module within the PFPN structure by removing the convolution operation before the upsampling stage. Additionally, we integrated the Convolutional Block Attention Module (CBAM) to capture attention in complex scenes. These modifications enhance the recognition ability for multi-scale targets while managing the trade-off between computational requirements and accuracy.
Performance improvements: The improved YOLOX algorithm proposed in this paper achieved an average precision (AP) of 57.7% on the public IP102 dataset. This performance is 2.2% higher than the original YOLOX model. Notably, the APsmall value increased by 2.2%, demonstrating the effectiveness of our enhancements in detecting small targets.
3. Materials and Methods
To overcome the limitations of existing pest detection methods, such as small datasets and simplistic feature extraction, YOLOX is recognized as one of the most accurate object detectors, offering a competitive inference speed, and introducing enhanced data augmentation techniques for data preprocessing. It employs an Anchor-free framework, effectively addressing the class imbalance issue commonly encountered in Anchor-based methods. Additionally, YOLOX uses decoupled heads to handle classification and regression tasks separately, and the recent trend of detection models is to use decoupled heads instead of original detection heads, which helps to improve the accuracy and efficiency of object detection tasks. We adopt the YOLOX [
21] framework to enhance the effectiveness of pest detection. The YOLOX series consists of six models: YOLOX-Nano, YOLOX-Tiny, YOLOX-S, YOLOX-M, YOLOX-L, and YOLOX-X. Compared to YOLOX-M, which has 25.3M parameters, YOLOX-S has only 9.0M parameters, making it a smaller and more suitable option for future deployment on handheld devices. While YOLOX-Tiny and YOLOX-Nano utilize depth-wise convolution to further reduce the number of parameters, they do sacrifice some accuracy, particularly in the AP
small index. As a result, we have chosen YOLOX-S as the benchmark model due to its relatively smaller parameter size while still maintaining an acceptable accuracy for pest image recognition tasks.
To address the scarcity of real pest images, we employ the Mosaic and Mixup techniques to process the training data. These data augmentation methods significantly enhance the dataset’s diversity and quality. Additionally, we propose a new enhancement strategy to further improve the model’s performance.
To extract fine-grained features, we introduce shallow information into the existing network architecture. We propose the CLT module, which incorporates cross-layer information fusion, to effectively extract features. Furthermore, we enhance the feature pyramid structure and strike a balance between data volume and accuracy. The overall framework of our algorithm is depicted in
Figure 2.
3.1. Data Enhancement
In this paper, two fundamental data augmentation techniques, Mosaic and Mixup, are employed. Mosaic data augmentation: By applying a series of random operations to four images, the background of the detected objects in the dataset is significantly enriched. This technique enhances the diversity of the training data and improves the model’s ability to handle complex backgrounds. Mixup: Randomly selecting two images from the training dataset, the samples are combined through a weighted summation, and the labels of the samples are also weighted accordingly. This process helps to reduce the impact of incorrect labels and enhances the model’s robustness. The impact of data augmentation is illustrated in
Figure 3.
3.2. Backbone
The backbone of the CLT-YOLOX architecture utilizes the CSPDarknet [
24] module for feature extraction, aiming to maximize the gradient divergence. To prevent redundant gradient information from different layers, the gradient flow truncation method is employed. This approach enhances the feature extraction capability of the convolutional network, resulting in an improved detection speed and reduced computational overhead, all while maintaining a high detection accuracy. When compared to other commonly used backbone networks, the CSPDarknet module demonstrates a superior performance in feature extraction without compromising detection accuracy, leading to an improved efficiency and reduced computational cost.
To facilitate a more detailed discussion, let us define the model structure in mathematical terms. We will denote the input image as
x, then
, and the four feature outputs of the backbone network can be expressed as
i = 1, …, 4. These four features in Formula (1) produce the following:
The backbone network is defined by different regions denoted as . These regions include blocks , , and , which consist of a convolutional layer (Conv) followed by 3 or 9 CSPBottleneck modules. Additionally, there is a region , which is composed of a Conv layer, 3 CSPBottleneck modules, and an SPP (Spatial Pyramid Pooling) module. Each module is described as follows:
Focus module: The focus module slices an image by taking a value for each pixel at an interval (similar to adjacent down sampling). This process integrates information from the width (W) and height (H) dimensions into the channel space. The output channel is expanded by four times, resulting in a spliced image with 12 channels. This increase in channels benefits subsequent calculations.
Figure 4 illustrates the concept of the focus module.
SPP module: The SPP module is inspired by the idea of Spatial Pyramid Pooling. It utilizes a pooling layer composed of three convolutional kernels with different sizes (5 × 5, 9 × 9, 13 × 13) to fuse local features and global features. This enriches the expression capability of the final feature map. The SPP module enhances the network’s ability to capture features at different scales, improving the overall performance.
Figure 4 provides an illustration of the SPP module.
These modules, including the focus module and the SPP module, play a crucial role in the backbone network of CLT-YOLOX. They enable effective feature extraction and information integration, contributing to the accurate detection of objects in the subsequent stages of the architecture.
3.3. Improved Neck
3.3.1. Improved C3 Module
In the CLT-YOLOX architecture, the C3 module, responsible for feature fusion and extraction, is improved to enhance the model’s ability to extract and distinguish important information specific to each pest species in the image. The improved version is called the C2F module, which combines the C3 module from the original model with the ELAN (Efficient Lightweight Attention Network) [
40] concept.
The original C3 module in YOLOX utilized the CSPNet [
41] (Cross-Stage Partial Network) to introduce the concept of splitting and incorporated a residual structure. The C2F module in this paper builds upon the C3 module and integrates the ELAN idea to achieve a light weight while capturing a broader range of gradient flow information. The C2F module enhances the ability of the model to learn and represent complex patterns in pest images.
Figure 5 shows the respective structure of the C3 module and C2F in detail.
3.3.2. Convolutional Block Attention Module (CBAM)
The CLT-YOLOX model incorporates the CBAM (Convolutional Block Attention Module) [
22] to enhance its ability to capture attention regions in pest images. CBAM is a lightweight attention module known for its simplicity and effectiveness in improving feature representation. It seamlessly integrates into CNN architectures and can be trained end-to-end.
The CBAM module operates on a feature map and sequentially derives attention maps along two independent dimensions: channel and spatial. The channel attention mechanism captures interdependencies between channels, allowing the model to focus on informative channels while suppressing less relevant ones. The spatial attention mechanism highlights spatial regions of interest by modelling interdependencies between spatial locations. These attention maps are then multiplied with the input feature map, enabling adaptive feature refinement. Its structure is shown in
Figure 6.
3.3.3. Cross-Layer Transformers Module
There are also four features in the “neck”, where
directly enters the new module, so in the expression of
,
i = 2, …, 4. These three features can be expressed as follows:
In the formula, the expression of
is as follows:
Among them, represents a connection operation, represents an upsampling operation, and represents a combination of different modules. In Formula (3), the index i starts at 2. In and , consists of three C2F modules and one CBAM module.
We define the features before the last three convolutional layers as
where
i = 2, …, 4. So, we obtain Formula (4):
where
represents a different block, similar to
.
also represents different blocks, and
in
and
are composed of one CBAM module and three C2F modules, respectively. In
, the CBAM operation is omitted and the direct output is selected After obtaining
, we can obtain the final prediction as shown in Equation (6):
where
represents three output predictions from different prediction heads.
Inspired by CoTNet [
42], this paper proposes a multi-scale fusion algorithm that combines the output of the backbone network
,
,
,
. In order to better extract fine-grained features, this paper proposes the algorithm CLT in the “neck” part. Considering the shallow information of the original framework,
is further introduced so that the small path
and the small path
are introduced into the CLT module as keys, and the query and value, respectively, to predict the final
. The cross-layer feature Transformer is then realized. The new
, as Formula (7), is as shown:
The CLT module is utilized as a cross-layer feature Transformer. To fuse and extract feature information from different output sizes, we designed the CLT module as depicted in the figure. It takes inputs and and generates output . Additionally, K and Q are generated from , while V is generated from . In this context, , .
Next, let us delve into the detailed steps of the CLT module. The flowchart is shown in
Figure 7. Initially, we consider
as the static representation of the input. Instead of using a 1 × 1 convolution, we spatially align all adjacent keys within the k × k grid and perform a
k ×
k group convolution. This process enables the learned contextualized keys to inherently capture the static key information from neighboring keys. Subsequently, the connection operation is executed, as illustrated in Formula (8).
where
represents the connection operation; after the connection operation, two different convolution operations are performed to obtain the attention matrix A, as shown in Formula (9):
where
and
represent two consecutive 1 × 1 convolution operations (
with the activation function and
without the activation function).
Next, we will process another input in the CLT module. Firstly, we apply a convolution operation on , denoted as . Then, represents a 1 × 1 convolution operation, which adjusts the number of channels for the subsequent fusion process. As a result, the dimension of is transformed to .
In other words, for each head, the local attention matrix of
at each spatial position is learned based on the query feature, incorporating more realistic key features that are contextually relevant. This approach enhances the self-attention learning with the additional guidance from the static context
. Subsequently, we obtain the feature map
by combining the participation of
. The details are shown in
Figure 8. Finally, we obtain the resulting feature map
.
To ensure compatibility with the feature map
, a convolution operation is applied to
, adjusting its size to match that of
. The final output of the CLT module is the concatenation
of the dynamic feature map
and the stastic feature map
, which captures the dynamic feature interaction between inputs from different feature layers.
The entire workflow of CLT can be seen in Algorithm 1.
Algorithm 1: Cross-Layer Transformer |
Input: input feature maps, Q, K, V |
←K, Q,
←V |
Output: output Fout |
- 1.
Set kernel_size = 3, stride = 2 = c, 2h, 2w = 2c, h, w - 2.
(padding = kernel_size/2, group = 4) - 3.
- 4.
- 5.
- 6.
Change the size of matrix A to (2c, 2 × 2, h, w) - 7.
Average matrix A over dimension 2 - 8.
A.Mean (2, keepdim = False). View(2c, −1) - 9.
K2← softmax (A, dim = 1) ∗ V1 - 10.
K1← Conv (kernel_size, stride) - 11.
Change the size of matrix K2 to (2c, h, w) - 12.
- 13.
return
|
3.4. Head
As depicted in
Figure 2, the head section comprises three prediction heads, each consisting of separate classification and regression branches. These branches are concatenated along the channel dimension, and a reshape operation is applied by multiplying the width (W) and height (H). Subsequently, the three prediction heads are concatenated along the W × H dimension, and the loss is calculated based on this combined output.
3.4.1. Decoupled Head
In previous versions of the YOLO series, the regression task and the classification task shared the parameters of the preceding layer, which may not have been optimal for both tasks. To address this, the proposed model divides the vector input to the head into two parts, specifically for detection box regression and target classification. This division allows the classification task to focus more on determining which extracted features are most relevant to the existing categories, while enabling the regression task to prioritize the position coordinates and correct the bounding box parameters more accurately. This approach accelerates the convergence speed of the model and enhances detection accuracy.
3.4.2. Anchor Free
Currently, there are two mainstream methods for object detection: Anchor-based and Anchor-free. For instance, YOLOv3 and YOLOv5 employ the Anchor-based method to extract target frames and compare them with annotated ground truth frames to assess the differences. In contrast, this model utilizes the Anchor-free method to overcome the limitations of the Anchor-based approach. It directly predicts four numerical values at each position on the output feature map, which determine the offsets of the top-left and bottom-right corners relative to the target at that particular position. This approach eliminates the need for predefined Anchor boxes and allows for more flexibility in detecting objects of various sizes and aspect ratios.
3.4.3. SimOTA
The traditional distribution method of positive samples is inappropriate. The function of SimOTA proposed in this model is to set different numbers of positive samples for different targets. The formula is as follows:
In the equation, represents the balance coefficient, cls represents the category loss between the Anchor box and the predicted value, and reg represents the regression loss between the Anchor box and the predicted value. During the training process, each Anchor box undergoes adaptive adjustment within a fixed central area. The algorithm dynamically allocates k positive samples and assigns positive labels to the corresponding grid cells for these positive predictions, while the remaining grid cells are labeled as negative. Compared to OTA (Online Training Algorithm), SimOTA (Simple Online Training Algorithm) significantly reduces training time and computational complexity.
3.4.4. Loss Function
We calculate the classification loss and regression loss using Binary cross entropy, which is as follows:
Among the above, y represents the Binary label—the value is 0 or 1—and is the probability of predicting each category.
5. Conclusions
In this paper, a novel approach is proposed to address the challenges associated with pest image detection. The combination of Mosaic and Mixup techniques for data preprocessing is introduced to overcome the limitations imposed by small datasets in pest detection. By leveraging these techniques, the model can better handle knowledge limitations and mitigate overfitting issues during the learning process.
To enhance the detection performance specifically in pest identification, several state-of-the-art techniques are incorporated into the model. The C2F module is introduced to capture broader gradient flow information while maintaining a lightweight structure. The CBAM module, known for its simplicity and effectiveness, is utilized to extract attention regions in pest images, aiding in filtering out confusing information and focusing on relevant target objects. Additionally, empirical tricks are employed to further enhance the detection performance of the model.
To expand the scale feature extraction of the network model and improve its expressive ability, the paper introduces shallow information and proposes the CLT module. This module enriches the features of small paths by utilizing information from larger paths, enabling better feature representation and extraction. The CLT module plays a crucial role in improving the model’s expressive ability and contributes to the development of the CLT-YOLOX model.
To validate the effectiveness of each strategy employed in this study, ablation experiments are conducted. These experiments systematically evaluate the impact of different components and techniques on the overall performance of the model. Furthermore, the model’s performance is evaluated on public datasets, which serve as benchmarks for pest image detection. The results demonstrate the efficacy of the proposed approach in accurately detecting pests in various scenarios.
Overall, this paper presents a comprehensive approach that combines various techniques and strategies to address the challenges in pest image detection. The proposed CLT-YOLOX model showcases an improved performance and demonstrates its effectiveness in pest identification tasks.