Automatic Crop Pest Detection Oriented Multiscale Feature Fusion Approach

Simple Summary Monitoring pests is a labor-intensive and time-consuming task for agricultural experts. This paper proposes a new approach to classifying and counting different categories of crop pests. Specifically, we propose a multi-category pest detection network (MCPD-net), which includes a multiscale feature pyramid network and a novel adaptive feature region proposal network. The multiscale feature pyramid network is used to fuse the multiscale pest information, which significantly improves detection accuracy. The adaptive feature region proposal network addresses the problem of not aligning when region proposal network (RPN) iterating, especially for small pest objects. Extensive experiments on the multi-category pests dataset 2021 (MPD2021) demonstrated that the proposed method provides significant improvements in terms of average precision (AP) and average recall (AR); it outperformed other deep learning-based models. Abstract Specialized pest control for agriculture is a high-priority agricultural issue. There are multiple categories of tiny pests, which pose significant challenges to monitoring. Previous work mainly relied on manual monitoring of pests, which was labor-intensive and time-consuming. Recently, deep-learning-based pest detection methods have achieved remarkable improvements and can be used for automatic pest monitoring. However, there are two main obstacles in the task of pest detection. (1) Small pests often go undetected because much information is lost during the network training process. (2) The highly similar physical appearances of some categories of pests make it difficult to distinguish the specific categories for networks. To alleviate the above problems, we proposed the multi-category pest detection network (MCPD-net), which includes a multiscale feature pyramid network (MFPN) and a novel adaptive feature region proposal network (AFRPN). MFPN can fuse the pest information in multiscale features, which significantly improves detection accuracy. AFRPN solves the problem of anchor and feature misalignment during RPN iterating, especially for small pest objects. In extensive experiments on the multi-category pests dataset 2021 (MPD2021), the proposed method achieved 67.3% mean average precision (mAP) and 89.3% average recall (AR), outperforming other deep learning-based models.


Introduction
Crop yields play an essential role in agricultural economic development. However, agricultural pests significantly affect crop production. Traditional pest recognition usually depends on manual observation by agricultural experts, whose work is subjective and laborintensive [1][2][3]. Therefore, it is essential to propose a method that can automatically monitor pests and inform agricultural experts of the pest occurrence information timely. With the advancements in light trapping device technology, numerous pest images with high spatial In this work, to address these problems, we propose a multi-category pest detection network (MCPD-net). First, a multiscale feature pyramid network (MFPN) for feature extraction is designed to obtain different scale information of pests. To obtain an enhanced feature pyramid network, an adaptive channel fusion module (ACFM) and a global context module (GCM) are introduced to capture recognizable multiscale contextual information to improve the detection accuracy for pests with diverse sizes, especially small pests. Second, an adaptive feature region proposal network (AFRPN) is proposed to obtain richer features with detailed information of pests. To alleviate the disturbances from complex backgrounds, a feature adaptation module (FAM) and a two-stage RPN method are introduced to correctly locate and classify pests. Third, we created a large-scale dataset named the multi-category pests dataset 2021 (MPD2021), containing 18,595 crop pest images with 26 categories and 125,700 specimens. Finally, with the MPD2021 dataset, extensive comparison experiments showed that our new model achieves 67.3% mean average precision (mAP) and 89.3% average recall (AR), which significantly outperforms other state-of-the-art methods.
To sum up, this work makes the following contributions: (1) A large-scale dataset named MPD2021 was built, which will promote the effectiveness of applications of new object detection approaches in intelligent agriculture.
(2) A end-to-end detection method named MCPD-net is presented to detect large-scale pest images. The MFPN in MCPD-net can handle pests of various sizes, which significantly enhances the detection performance for multiscale pests.
(3) The presented AFRPN is able to solve the problem of anchor and feature inconsistency during the training iteration process, which benefits pest location and classification.
The rest of the paper is organized as follows: Section 2 describes the related works; Section 3 presents the pest image dataset description and analysis; Section 4 introduces the proposed method and technical details; Section 5 describes the experimental results; conclusions are covered in Section 6.

CNN-Based Crop Pest Detection Method
Some DCNN-based approaches have been developed to solve the pest detection tasks [17][18][19][20][21][22]. Most of these pest detection methods are improvements on the object detection methods. To improve the detection performance, Deng et al. [18] detected ten categories of pests using the natural statistics model. It had strong recognition performance-an accuracy of 85.5%. Liu et al. [19] proposed PestNet, which can detect 16 categories of pests. It contains a channel-spatial attention (CSA) module used for feature enhancement and a position-sensitive score maps (PSSM) module to encode position information. Rustia et al. [20] proposed an online semi-supervised learning method and applied it to an automated insect pest monitoring system, thereby achieving a pseudo-labeling accuracy of 96.3%. To accurately detect tiny and densely distributed pests, Li et al. [17] developed a coarse convolutional neural network (CCNN) for searching aphid cliques and a fine convolutional neural network (FCNN) for refining the regions in aphid cliques, combined as a coarse-to-fine network (CFN) which detects tiny and densely distributed aphids. In recent work, Li et al. [21] presented a DCNN-based pests detection framework to classify ten categories of pests, which achieved excellent results. Wang et al. [22] developed a novel region proposal network (S-RPN) for generating accurate object proposals and a backbone network using an attention mechanism, which achieved 89.0% AR and 78.7% mAP on 21 categories of pests. These methods improve the detection accuracy of small pests through data augmentation strategies or enhanced network structure. However, the aspect of enhanced feature fusion has not been considered, which is necessary for the detection of pests with high inter-category similarity.

Feature Pyramid Network
FPN-based object detectors fuse multiscale features in a top-down and lateral connection manner to build a feature pyramid [23][24][25]. They have achieved great success on general datasets-e.g., MS COCO [26] and PASCAL VOC [27]. To integrate the balanced feature information from each resolution, Libra R-CNN [28] proposed a balanced feature pyramid (BFP) that integrates multi-level features using lateral connections and then refines balanced semantic features to reduce the imbalance between feature maps. EfficientDet [29] developed a BiFPN allowing efficient, bidirectional cross-scale connections. Some recent works have focused on adaptive feature fusion to improve the FPN's ability. Adaptive spatial feature fusion (ASFF) [30] predicts the feature map's weight factor from different layers during feature fusion via a self-adaptive mechanism. AugFPN [31] narrows the semantic gaps between features of different scales through consistent supervision. However, the aforementioned feature pyramid models mostly use the weighted fusion of upper and lower features and do not consider the channel-wise and global context view, which contains useful information. We introduced this information by using ACFM and GCM in MFPN.

Region Proposal Network
Region proposal networks (RPN) are frequently used in two-stage detectors. They are used to generate a sparse set of proposal boxes by adjusting the anchors. Traditional methods adopt selective search (SS) [32] and EdgeBoxes [33] approaches to generate proposal boxes. In this process, the imbalance between the foreground and background is increased due to the dense sampling of anchor boxes, which requires huge computations and leads to performance degradation. To address these problems, Faster R-CNN [14] was used in RPN to replace SS for object proposal generation, which is then refined and classified by R-CNN. Based on this, some improved solutions were developed to enhance the RPN's proposal box generation. Vu et al. [34] proposed multi-stage refinement of the anchor box in each position, followed by using adaptive convolution to align the features and the anchors. In the dimension-decomposition region proposal network (DeRPN) [35], an anchor string mechanism is used to automatically match object shapes, which is less sensitive to variant object shapes. Besides, to avoid small objects being overwhelmed by larger objects, DeRPN [35] designed a novel scale-sensitive loss that addresses the problem of imbalanced loss computations for different scaled objects. As to improvement of the RPN to address the pest detection tasks, recently, an end-to-end deep learning approach (PestNet) [19] directly used the RPN module to search for potential region proposals for objects. Karar et al. [36] proposed a mobile application that utilizes RPN as an object bounds predictor for detection and classification of crop pests. A channel recalibration feature pyramid network (CRA-Net) [37] proposed an adaptive anchor (AA) module used in the RPN iteration to effectively correct the mismatch between the anchor and ground truth boxes. Therefore, we introduce AFRPN, combining the advantages of the above methods through FAM and a two-stage RPN method to align features with anchor boxes.

Light Trapping Device for Pest Monitoring
The appearance and main internal structure of the light trapping device for pest monitoring are shown in Figure 2a,b. This equipment was designed by Jia Duo Co., Ltd. (Hebi, China) [38]. It can be placed in fields of vegetables, rice, corn, wheat and other major crops to monitor pests. This equipment has the functions of pest trapping and photography, environmental information collection, data transmission, data analysis, etc. In addition, using the proposed pest detection method can achieve pest classification and counting results. The statistical results are reported in real-time for automatic monitoring of pests. The multispectral light trap emits light to attract multiple categories of pests, the wavelength of which is changed with time according to pests' habits. The collected pests are then dropped onto the collection plate at the bottom. Meanwhile, the HD camera above the tray is programmed to take photos every 15 min. The pests are swept away from the pest collection plate after being photographed to avoid gathering and overlapping. The collected images are saved with a resolution of 2592 × 1920 pixels. These pest images are sent to a cloud server, which recognizes the species and numbers of pests by a deep learning-based detection method.

Multi-Category Pest Dataset 2021 (MPD2021)
To promote the development of the field of automatic pest monitoring, some open datasets have been published so researchers can train models, such as IP102 [39] and the open access repository [40]. However, these datasets are mainly used for recognition, which does not meet the purpose of detecting multiple categories of pests in one image. In addition, the models trained using these datasets are also difficult to apply to pest images with complex scenes. To meet practical application requirements and train the detection model, a multi-category pest image dataset was built for pest detection tasks. Images of multiple categories of pests in the dataset were collected from a pest monitoring device. Each pest in the image was annotated as a bounding box using LabelImg software by several agriculture experts. Each bounding box has information about the upper left point, height, width, and pest category. The dataset was made in PASCAL VOC data format (an image dataset that contains 20 categories of objects; all objects are classified and annotated) [27], which uses XML files to record the pest labeling information. In summary, 125,700 labeled pests in 26 categories were annotated from 18,595 images. Based on this, a new dataset named MPD2021 was created. Table 1 shows the statistics for each pest species, including the number of specimens, the average relative scale, and the average width and height of the labeled boxes. The number of specimens per species is from 241 to 24,694, for Spodoptera frugiperda (category 26) and Proxenus lepigone (category 7), respectively. The average width of pests ranges from 37 to 215 pixels, and the average height ranges from 35 to 211 pixels. All categories have a relative scale of less than 0.9%; the smallest average relative scale is only 0.0282%. As the MPD2021 dataset has a large number of small objects with poor feature information, this will cause significant challenges for network localization and accurate recognition.
To further analyzing the distribution of specimens in MPD2021, the distribution of specimens and relative scales are shown in Figure 3a. The distribution of the pest specimen numbers in the MPD2021 dataset varies greatly. The overall trend shows a long-tailed distribution for the number of specimens in each category, and the number of specimens in the most plentiful category is 102 times greater than that in the rarest category. For example, there are only 306 and 241 object specimens for Pleonomus canaliculatus (category 23) and Spodoptera frugiperda (category 26). To further analyze the scale problem, we analyzed the distribution of pest objects' relative scale, as reported in Figure 3b. The relative size of the most pests in MPD2021 is comparatively small, mainly 0.15-0.5%. These collected pest images were randomly divided into a training set and a test set (4:1) to train the DCNNs models.

MCPD-Net Construction
The overall architecture of MCPD-net is shown in Figure 4. We propose a unified framework named MCPD-net, which consists of four parts: (1) The pest images are first collected from the image acquisition equipment and then fed into the backbone networks.

Multiscale Feature Pyramid Network (MFPN)
Some researchers developed detection approaches to address the challenges of multiscale features with different spatial resolutions. Following the setting of FPN [23], features used to build feature pyramid are denoted as {C 2 , C 3 , C 4 , C 5 }, which correspond to the feature maps with upsampling {4, 8,16, 32} strides of the input images. Feature maps of {P 2 , P 3 , P 4 , P 5 } construct the feature pyramid networks. On the one hand, the low-level feature maps are enhanced by the high levels of semantic information; thus, the features will have diverse context information. On the other hand, there will be information loss from C ∈ [C 2 , C 3 , C 4 , C 5 ] to P ∈ [P 2 , P 3 , P 4 , P 5 ] because of the reduction in feature channels and the decrease in the scale of the feature map, which leads to global semantic information loss. Pest images contain many small specimens that feature information usually suppresses due to complex background information and other large specimens. To address the problem of poor accuracy when detecting small objects, the typical approach only obtains the spatial information of multiscale feature maps to enhance the accuracy of small object detection. We argue that the information between feature channels and the global context is also essential for small objects. Thus, we designed MFPN to achieve accurate detection of multiple categories of pests. The overall framework of MFPN is shown in Figure 5. Two components of MFPN are discussed in the following subsections. Adaptive Channel Fusion: To appropriately achieve feature fusion, we fully leverage the relationships between the feature maps. Let C i represent the i-th C ∈ [C 2 , C 3 , C 4 , C 5 ].
We compress the C i spatial information into the channel descriptors. Specifically, the 1 × 1 convolutional kernel is used for the high-level feature map to make uniform the number of channels, which can be expressed as: where g c denotes the channel feature descriptor obtained by compressing, F c (x, y) indicates each pixel point (x, y) in the feature map, and H × W are the spatial dimensions of the feature map. We compute the relationships across channels using adaptive global aver-aging pooling (AGAP) and adaptive max-pooling (AMP) for a given channel descriptor g c ∈ R H×W×C . The subsequent operations can be described as: where σ represents the sigmoid activation function. Firstly, we split the operation into two branches, one branch for feature plane g c ∈ R H×W×C using AGAP for the high-level feature map F 1 ∈ R H×W×C extracted, and the other branch for adaptive max-pooling for the feature plane g c ∈ R H×W×C to obtain F 2 ∈ R H×W×C . Secondly, convolution with kernels implementing 1 × 1 operations is conducted on the obtained channel descriptor, and then we obtain channel descriptors F 1 = f 1D (F 1 ) ∈ R H×W×C/8 and F 2 = f 1D (F 2 ) ∈ R H×W×C/8 . Thirdly, the two channel descriptors F 1 and F 2 are computed by the Relu activation function and Conv1D operations to get respectively. Finally, the two channel descriptors F 1 and F 2 are summed to get the final channel descriptors A C ∈ H×W×C . Then, A C is activated by the sigmoid function, and the hadamard product operation is performed with the original high-level feature map to obtain M C ∈ H×W×C . The whole computation process can be summarized as follows: where AvgPool(y c ) and MaxPool(y c ) denote the AGAP and AMP operations for each input feature map. f 1D and f 1D denote the one-dimensional convolution operations with kernel sizes of 1 to decrease the number of channels and increase the number of channels, respectively. The function Relu denotes the rectified linear unit activation function. Had denotes hadamard product operation. Global Context: Detection performance and stability can be improved by exploiting global context information. Therefore, a global context module is introduced to strengthen MFPN. This module is integrated after the ACFM, as shown in Figure 5. Specifically, the GAP operation is first employed to extract the global information and then integrate information across channels by 1 × 1 convolution. Finally, the output features are added to the main information stream.

Adaptive Feature Region Proposal Network (AFRPN)
The detection problem is formulated in Faster R-CNN as a two-step procedure. The RPN is first used to generate a sparse set of proposal boxes by adjusting a set of anchors. The proposal boxes generated by the RPN are then refined and classified by a regional CNN detector. RPN is designed to extract high-level features and predict proposals in an end-to-end way. For a feature map F i of size w × h, a group of anchor boxes is initialized uniformly over the corresponding image. Each anchor box a consists of a set of four-dimensional information a = (a x , a y , a w , a h ), where (a x , a y ) denotes the center location of the anchor and (a w , a h ) is the width and height. The regression branch will predict the transformation value σ from the anchor box a to the ground-truth box g, as follows: a x = σ x a w + a x , a y = σ y a h + a y , a w = a w exp(σ w ), a h = a h exp(σ h ).
The regressed anchors A = {a} are then filtered by non-maximum suppression (NMS) [41] to generate the sparse proposal boxes. However, in traditional RPN, each group of anchor boxes with different scales and aspect ratios is selected for positive samples based on the intersection of union (IoU) threshold with the label object. In this process, for small objects, the IoU values are usually too small to reach the set threshold. Therefore, most small object samples will be ignored as negative samples during the training process.
Additionally, the anchor dense sampling will promote the imbalance between foreground and background, leading to module performance degradation. We propose an approach called AFRPN to systematically solve the aforementioned problem produced from the anchors and align features with anchor boxes. The pipeline of AFRPN is shown in Figure 6. During training iteration, AFRPN uses the conventional convolution to maintain the spatial features in the first stage and then adapts the proposed FAM to compute the regression prediction in the second stage to achieve high performance. for each location l on the output feature y. However, in AFRPN the offset field O is directly inferred from the deformable convolution [42] that replaces the regular grid C. The output feature y will be y By learning the offset, the deformable convolution improves the spatial sampling location and ensures alignment between the anchors and features. Two-stage RPN: A two-stage process is proposed to align anchors to features in the RPN stage. That is, the conventional convolution is used to maintain the spatial features in the first stage. In the following stages, the offset o κ of input anchor a κ on the feature map is computed by FAM. Then the regression prediction γ κ = f κ (x, o κ ) and regressed anchor a κ+1 from γ κ are computed using Equation (6). In the end, the object scores are calculated by the classifier and then filtered by NMS processing to generate region proposals.

Results
This section has a brief description of the evaluation metrics, training parameters, experimental details, and results of experiments on the constructed MPD2021 dataset.

Evaluation Metrics and Parameter Settings
Several evaluation metrics are used to evaluate the effectiveness of different approaches in the multi-category pest dataset. Average precision (AP) and mAP are used as the main evaluation metrics. AP is the area bounded by the precision-recall curve. In addition, AP 50 (IoU = 0.5), AP 75 (IoU = 0.75), average recall (AR), AR 50 (IoU = 0.5), number of models parameters, and FLOPs-auxiliary evaluation metrics-are used to demonstrate the ability of MCPD-net. The calculation formulas are as follows: where TP, FP, and F N denote true positives, false positives, and false negatives, respectively. In MCPD-net, the backbone network is Resnet-50, which was pretrained on the Ima-geNet [43] dataset. The images were resized to 1333 × 800 pixels during the training and validating stages. Moreover, the model was optimized during the training phase using the stochastic gradient descent (SGD) method. Specifically, the learning rate was 2.5 × 10 −3 for the first eight epochs and then decayed with step policy for the following epochs, and the momentum and the weight decay values were 0.9 and 0.0001, respectively. The batch size was 4. We applied the NMS with the IoU threshold of 0.5 per category during the validating and testing stage. The code was developed based on the MMDetection toolbox [44]. All experiments were run on Dell Precision T3630 workstations equipped with Intel Core I9 9900K CPU, NVIDIA RTX 2080Ti (24-GB memory) GPU, and the software environment was Ubuntu 18.04, CUDA10.1 and CuDNN7.6, python 3.7.

Quantitative Analysis
In this section, the performance of MCPD-net is first compared with those of the state-of-the-art object detectors. Then, extensive ablation experiments on the MPD2021 dataset are reported to validate the effectiveness of the proposed module in MCPD-net. The testset in the MPD2021 dataset contains 3719 pest images. Table 2 presents the comparison between the AP and AR of MCPD-net and other CNN models. MCPD-net yielded 38.3% mAP detection accuracy, which surpasses all compared detectors. Our proposed method outperformed the detection performances of SSD (one-stage) and FCOS (anchorfree), achieving 6.4% and 5.2% mAP improvements, respectively. Compared with PAFPN and Mask R-CNN (multi-stage), our method achieved 3.2% and 3.6% AP improvements, respectively. Given these results, our method has the best performance. Compared to other IoU thresholds, our method achieved 67.3% AP 50 and 40.4% AP 75 , which are higher than those of other detection approaches. Additionally, our method had a 55.4% AR, which indicates that it is more precise in object localization. There are significant differences in the results for specific categories, as shown in Table 3. Our proposed approach significantly outperformed other detection methods in most pest categories. Nilaparvata lugens (category 1) seemed to be the most difficult to detect and had the lowest AP, 28.8%. Almost all models could successfully detect Gryllotalpa orientalis (category 22) with 94.0% AP. This is because tiny pests make it more difficult to extract effective features than larger pests. Furthermore, about 12 categories had over 70% detection accuracy, and the accuracy of almost all pest categories increased by using MCPD-net. The detection results for small pests Nilaparvata lugens, Plutella xylostella, Cnaphalocrocis medinalis, and Melanotus caudex (categories: 1, 2, 12, 25) showed different degrees of improvement.

Ablation Experiments
(a) Baseline setup: The baseline network is driven by Faster R-CNN with the backbone ResNet-50. It can be seen in Table 4 that the Faster R-CNN can quickly detect pest images at 22 FPS. However, the detection performance for small pests was not satisfying. For example, the mAP of the Nilaparvata lugens (category 1) was only 16.5%.
(b) Effect of MFPN: It can be seen in Table 4 and Figure 7 that the detection results were improved by adding the MFPN module. The detection AP of small pests showed significant improvements with the MFPN module. For instance, the detection results of tiny pests (categories 1 and 12) were improved by about 12.3% and 7.6% AP. The detection results of highly similar-in-appearance pests (categories 8,11,23) were also greatly enhanced. Additionally, our method achieved 64.9% AP 50 and 80.7% AR 50 , which indicates that it is more precise in detection and more accurate at object localization. The detection results presented in Figure 7a were acquired by a multiscale structure from both high-level and low-level layer fusion, which proves that the MFPN is powerful.
(c) Effect of AFRPN: There are many easy negative samples during the training stage, which can lead to poor results. The AFRPN was proposed to solve this problem. As shown in Table 4, we achieved 87.8% AR 50 after adding the AFRPN, which is a significant enhancement over the baseline. Figure 7 presents the efficiency of adding the AFRPN module, particularly concerning small pests. As a result, the overall detection accuracy slightly increased to 61.8% AP 50 . Finally, the whole detection framework (+ MFPN + AFRPN) can achieve the best pest detection result at 17 FPS. Although the inference speed is slightly slower than the baseline (22 FPS), the detection framework is suitable for accurate pest detection in real-world application scenes.

Visualization Analyses
Some visualization analyses were conducted to evaluate the proposed MCPD-net. As shown in Figure 8, MCPD-net has promising detection performance for tiny pests. Additionally, the detection and visualization results of different approaches on the MPD2021 dataset are shown in Figure 9. The visualizations of feature maps were obtained with the Grad-Cam [45] method. As presented in Figure 9a,b, the context information learned with the SSD method was not sufficient to accurately identify the pests. Figure 9c,d shows that MCPD-net obtains more pest feature information. Figure 9e,f shows the pest detection results in a complex background. As shown in Figure 9g,h, the contextual information of the pests is much richer, which indicates that the MCPD-net predicts more accurately and misses fewer pests.
The positive anchor samples in AFRPN are illustrated in Figure 10. The anchor boxes (colored) can cover the ground-truth boxes (white) by the learning of AFM in AFRPN. In addition, the shapes of anchor boxes are close to those of the ground-truth boxes. This shows that our method improves the anchor prediction performance, so the localization capability is enhanced. The detection results for some typical images on the MPD2021 dataset are shown in Figure 11. Our method addresses the problems of a complex background, and small and dense pest distribution well. The detection results of rows 1, 3, and 5 are ground-truth boxes (red boxes). Row 2 of Figure 11 shows the detection results with multiple pest specimens in each image. MCPD-net can detect almost all of the labeled boxes. The detection results with complex backgrounds and dense pest distribution are shown in rows 4 and 6. The fine-grained information of multiple categories of pests is more distinct with our method, and the regression of pest boxes is more accurate. Therefore, we have demonstrated the generalization performance of MCPD-net for multi-category pest detection tasks on large-scale pest images.

Discussion
China is a large agricultural country. The main crops grown include rice, wheat, corn, etc. However, these crops are prone to reduced quality and yield of products due to crop pests. The widespread application of chemical pesticides has become an important means to prevent and control crop pests. However, farmers often tend to blindly use pesticides in large quantities when they cannot accurately identify pests, which can cause ecological environmental pollution and soil contamination. Thus, agricultural experts with professional knowledge are badly required to help them recognize pests. However, traditional pest recognition is usually labor-intensive and time-consuming. Intelligent light trapping devices can automatically attract many species of pests and capture images, which greatly reduces the workload of these experts. Although in practical applications, light trapping devices also attract trap pollinators and beneficial insects, the number of devices deployed is not large, so it will not have a huge impact on the environment. The collected images will be sent to a cloud server for analysis and monitoring. We proposed an object detection method named MCPD-net to automatically monitor multiple categories of pests to replace the manual observation method.
The proposed detection method can detect 26 categories of widespread agricultural pests (see Figure 1). When compared with other deep learning-based methods, the proposed MCPD-net achieved the highest detection accuracy, as shown in Table 3. The advantages of MCPD-net have been verified as follows. First, similar pests in images with complex backgrounds can be successfully detected. Second, MCPD-net is suitable for real-time detection of crop pests without prior knowledge of the acquired images about pest species. The proposed MFPN and AFRPN significantly improve the detection accuracy from the perspective of enhanced feature extraction, as shown in Figures 5 and 6.
Our next research goal will be to further extend the number of pest categories in the dataset. This will support farmers to precisely apply chemical pesticides to protect the field from further damage. Although MCPD-net has achieved excellent experimental results, it still needs to be improved. For example, for Nilaparvata lugens (category 1) and Plutella xylostella (category 12), the detection results were only 28.8% AP and 35.6% AP, which are worse results than for other categories. They are too small, occupying only 0.0282% and 0.0545% pixels of the entire image. Our future work will aim to improve the detection accuracy for tiny pests.

Conclusions
In order to replace manual recognition methods with computer vision methods for automatic monitoring of pests in crop fields, we proposed a novel end-to-end method named MCPD-net that can be applied to detect 26 species of crop pests. MCPD-net consists of an MFPN for obtaining multiscale pest features, and the novel AFRPN makes the anchor box and features consistently. Extensive experiments were conducted on the MPD2021 dataset. MCPD-Net achieved 67.3% AP and 89.3% AR, surpassing other state-of-theart methods.

Acknowledgments:
The authors would like to thank Jia Duo Co., Ltd. for providing data support.

Conflicts of Interest:
The authors declare that there is no conflict of interest.