AgriPest: A Large-Scale Domain-Specific Benchmark Dataset for Practical Agricultural Pest Detection in the Wild

The recent explosion of large volume of standard dataset of annotated images has offered promising opportunities for deep learning techniques in effective and efficient object detection applications. However, due to a huge difference of quality between these standardized dataset and practical raw data, it is still a critical problem on how to maximize utilization of deep learning techniques in practical agriculture applications. Here, we introduce a domain-specific benchmark dataset, called AgriPest, in tiny wild pest recognition and detection, providing the researchers and communities with a standard large-scale dataset of practically wild pest images and annotations, as well as evaluation procedures. During the past seven years, AgriPest captures 49.7K images of four crops containing 14 species of pests by our designed image collection equipment in the field environment. All of the images are manually annotated by agricultural experts with up to 264.7K bounding boxes of locating pests. This paper also offers a detailed analysis of AgriPest where the validation set is split into four types of scenes that are common in practical pest monitoring applications. We explore and evaluate the performance of state-of-the-art deep learning techniques over AgriPest. We believe that the scale, accuracy, and diversity of AgriPest can offer great opportunities to researchers in computer vision as well as pest monitoring applications.


Introduction
Object detection is a classic research topic in the computer vision communities. The current large volume of standardized object detection datasets [1][2][3] help to explore many key research challenges that are related to object detection and evaluate the performance of different algorithms and technologies. Especially, the recent popularity and development of deep learning techniques has proved a fact that, given sufficient high-quality annotated image datasets, deep learning approaches [4][5][6] can effectively and efficiently achieve the detection and classification tasks. This results in some practical breakthroughs in many classic applications, including face recognition [7] and vehicle detection [8]. However, in some domain-specific object detection applications, there is a huge difference of quality between standardized annotated dataset and practical raw data. This leads us to the obvious question: how could we maximize the utilization of deep learning techniques in practical applications?
Taking an example of typical object detection in smart agriculture application, current pest monitoring task requires precise and pest detection and population counting in static image. In this case, computer vision based automatic pest monitoring techniques have been widely used in real practice. These computer vision techniques deal with pest images that were captured from fixed stationary, and then adopt traditional image processing algorithms to analyze the pest associated features for detection [9]. During this processing, most of the solutions aim to formulate it as a whole image classification task [10][11][12]. However, in practical applications, wild pest detection that requires not only classification, but also localization might be much more important for pest hazard assessment, since a precise detection performance could provide higher semantic information, such as pest occurrence areas and pest population counting information in the field.
Despite recent deep learning approaches have showed great success in image recognition [13] or generic object detection applications [14][15][16], they are often intractable to be ready-to-use practical methods for showing satisfied performance on pest detection and classification. Towards this problem, the main reasons are: (1) when comparing with generic object detection, pest detection in the wild remains an open problem due to a challenging fact that many discriminative details and features of object are small, blurred, hidden, and short of sufficient details. These pose a fundamental dilemma that it is hard to distinguish small object from the generic clutter in the background. (2) The diversity and complexity of scenes in the wild cause a variety of challenges, including dense distribution, sparse distribution, illumination variations, and background clutter shown in Figure 1. These types of scenes might increase the difficulty of applying generic object detection techniques into tiny wild pest detection task.  It is well known that the large-scale image dataset plays a key role in driving efficient model and enables powerful feature representation. In the field of agricultural pest controlling, the first challenge is how to select the field crops and pest species in the large-scale dataset to build hierarchical taxonomy. From the practical point view of pest reduction, we consider the field crops that occupy a larger production of food in the world. Under this consideration, the Food and Agriculture Organization of the United Nations (FAO) reports that rice (paddy), maize (corn), and wheat are three major field crops for food production that could provide 700 M, 1000 M, and 800 M tones in 2019 [17]. Besides, there is also a large planting area in Asia for rape. Among these crops, certain insects and other arthropods are serious agricultural pests, causing significant crops loss if not controlled. Some of them, e.g. moth larvae (Lepidoptera) directly feed on the rhizome and leaves of crops while others mainly feed on nonharvested portions of the plant or suck on plant juices, such as aphids and leafhoppers [18]. Being damaged by these pests, an estimated 18-20% of the annual crop production worldwide is destroyed, estimated at a value of more than $470 billion [19].
When considering the targeted field crops and pest species, using computer vision for pest monitoring is expected to have a domain specific dataset. However, current public datasets for agricultural pest recognition and detection have several limitations: (1) most of them typically cover a small number of samples [20,21], which results in poor generalization, so that the model might not work well on recognizing pests with various attitudes. (2) Many datasets that are target at solving the problem of pest recognition, in which pest objects occupy a large ratio in images [22,23]. However, pests always show to be with tiny sizes in real-life scenes. Besides, most of the images in these datasets contain only one insect pest category, which might be unusual in practical pest images. (3) Some of the datasets collect images in laboratory or non-field environment while using trap devices or from Internet, where these pest images hold a highly simple background, making it difficult to cope with the complexity of practical fields [9,24].
In this paper, we introduce a domain-specific benchmark dataset, called AgriPest, in tiny wild pest detection, providing the researchers and communities with a standard large-scale dataset of practically wild pest images and annotation, as well as standardized evaluation procedures. Different from other public object detection datasets, such as MS COCO [1] and PASCAL VOC [2], which are collected by searching on the Internet, a task-specific image acquisition equipment is designed to build our AgriPest. During this process, we spend over seven years collecting the images due to seasonal and regional difficulty. AgriPest captures 49.7K images of four fields' crops and 14 species of pests by smartphone in the field environment. All of the images are manually annotated by agricultural experts with up to 264.7K bounding boxes of locating pests. This paper also offers a detailed analysis of AgriPest, where the validation set is split into four types of scenes that are common in practical pest monitoring applications. Benefiting to the practical precision agriculture applications, our AgriPest could provide a large amount of valuable information for precise pest monitoring that could help to reduce crop production loss. Specifically, the current agriculture automation system could deploy a deep learning pest detection method for building effective pest management policy, such as choice and concentration of pesticide, as well as natural enemies controlling and production estimation. We believe our efforts could benefit future precision agriculture and agroecosystems.
The major contributions of this paper lie in three folds: • To the best of our knowledge, the largest scale domain-specific dataset AgriPest containing more than 49.7 K images and 264.7 K annotated pests is published for tiny pest detection research. This benchmark will significantly promote the effectiveness and usefulness of applications of new object detection approaches in intelligent agriculture, e.g., crop production forecast. • AgriPest defines, categories, and establishes a series of detailed and comprehensive domain-specific sub-datasets. Its first category contains two typical challenges: pest detection and pest population counting. Subsequently, it categories four types of the validation subsets of AgriPest dense distribution, sparse distribution, illumination variations, and background clutter, which are common in practical pest monitoring applications. • Accompanying AgriPest, we build the practical pest monitoring systems that are based on deep learning detectors deployed in the task-specific equipment, in which we give comprehensive performance evaluations of the state-of-the-art deep learning techniques in AgriPest. We believe that AgriPest provides a feasible benchmark dataset and facilitate further research on the pest detection task well. Our dataset and code will be made publicly available.

Related Work
The emergence of deep learning techniques has led to significantly promising progress in the field of object detection [25], such as SSD [4], Faster R-CNN [5], Feature Pyramid Network (FPN) [6], and other extended variants of these networks [26][27][28][29]. CNN has exhibited superior capacities in learning invariance in multiple object categories from large amounts of training data [23]. It enables suggesting object proposal regions in the detection process and extract more discriminative features than hand-engineered features. The experimental results on the MS COCO [1] and PASCAL VOC [2] dataset show that Faster R-CNN [5] is an effective region-based object detector towards general object detection in the wild with an Average Precision (AP) up to 42.7% with IoU 0.5. In Faster R-CNN, Region-of-Interest (RoI) pooling is used to extract features on a single-scale feature map. However, targeting at small object detection, FPN [6] is the state-of-the-art technique for small object detection over MS COCO dataset with AP up to 56.9% with IoU 0.5. By building up a multi-scale image pyramid, FPN enables a model to detect all of the objects across a large range of scales over both positions and pyramid levels. This property is particularly useful to tiny object detection.
Benefitting from the success of these object detection methods, many applications have been developed in recent years [30][31][32]. Towards pest detection in the wild, deep learning methods might not achieve satisfactory performance, because an excellent object detection application using deep learning techniques usually need to be trained by large enough training dataset. Although there exist a few datasets for solving agricultural issues [33,34], most public large-scale datasets for tiny objects, especially agricultural pest images, cover limited data volume, which causes deep learning methods on pest detection to be restricted [21][22][23]. Besides, a large number of current pest related datasets are collected in the controlled laboratory or non-field environment, which could not satisfy the practical requirements of pest monitoring applications in the field [24]. Moreover, these datasets mainly focus on the pest recognition task, rather than pest detection, so the pest objects occupy a large ratio in images [20]. On the contrary, our proposed AgriPest is built to address practical issues in pest monitoring applications, so all of the images are collected in the wild fields and each pest is annotated with bounding box for detection as well as pest population counting.

Taxonomy
IP102 provides its pest taxonomy from 102 pest species [23]. However, among these pest insects, lots of them are not necessary to be prevented in practical agriculture because of their low level for harming fields in certain types of crops. Besides, there are several works that points out rice and wheat are two major crops that are degraded by pests [35]. Therefore, we reform the pest taxonomy of IP102 and focus on pests occuring in four types of crops. Finally, we obtain 14 categories of pests in four super-classes corresponding to four common field crops: wheat, rice, corn, and rape. Within these super-classes, each pest is a subordinate class (also known as sub-class) of a super-class. For example, rice planthopper (RPH) is a sub-class that spoils rice crop, which is one of the super-classes. By this taxonomy system, we build a hierarchical structure of pest categories in AgriPest and the sample of each category is visualized in Figure 2.

Image Acquisition
Current datasets, such as MS COCO, usually collect images using Google or Bing image search. However, most of images on the Internet are not suitable for building a practical pest monitoring application. Besides, the pest monitoring task is novel and specific, ordinary cameras might not be convenient for capturing pests in the root of crops. Thus, there is no proper image acquisition devices for our task. In order to make the captured image reasonable for practical pest occurrence in wild field, we come up with the following requirements: (1) each image must contain at least one type of pest species discussed in Section 3.1. (2) The distance between camera and pest should be various to help the diversity of AgriPest. (3) All of the captured pests need to show their different poses and gestures as they are observed in the real world and overlap is also allowed among these pests. To meet these requirements, we design a task-specific independent research and development equipment for wild pest image collection whose structure are illustrated in Figure 3. This apparatus is developed with three components in the stand: mobile client, CCD camera, and temperature-humidity sensor. When using this equipment in the field crop, we first adjust the stand height according to the pest locations of the crop, e.g., higher than the crop for wheat, since most pests usually occur in the leaves. Subsequently, we deploy a mobile client and CCD camera in the stand and randomly rotate the hinge of the stand to make the CCD camera cover various viewpoints during image capturing. The parameters of CCD camera are set to 4 mm focal length with an aperture of f/3.3. At the same time, mobile client is connected with CCD camera by wireless network to help users photograph pest images conveniently. In addition, we also adopt a temperature and humidity sensor to record current high-level environment information to help pest annotation process, since some certain pest species might occur under specific environments. Therefore, under our image acquisition, we could capture numerous pest images from the field crop. Finally, we consciously photograph the images in various typical places to improve the dataset diversity in order to balance the distribution of our AgriPest dataset. Furthermore, the candidate images are manually filtered to eliminate those containing few pests. The total number of images captured in AgriPest is 49.7K.

Professional Data Annotation
We invite 20 agricultural experts to annotate these images that are filtered from raw data, who are experienced and knowledgeable in agriculture area, due to our desire to label numerous object instances in 49.7K agricultural images of our large-scale dataset. Specifically, we cooperate with the researchers from Academy of Agricultural Sciences and associate professors in the School of Agriculture and Forestry, which make up of our image annotation team. In order to guarantee the correctness of the annotations, each expert only focuses on pest species of one sup-class so these invited experts are grouped into four groups, each of whose is responsible for annotating the corresponding crop, so it could be ensured that every image is annotated by at least five agricultural experts. In data annotation, the images are first categorized into their sup-classes. The sup-class could be perfectly categorized, because the image source is recorded during collection. Subsequently, the expert groups annotate the pest species and their locations using bounding boxes, respectively. Finally, all of the experts synergistically check the correctness of each labeled instance. The final object instance annotation results follow this criterion: one bounding box and its category could be accepted only when it is agreed by over five experts. The bounding boxes annotations of pests follow the format of Pascal VOC.

Dataset Structure and Splits
For validating the practical application value of AgriPest, we randomly split the whole images into training and validation subsets, which are split at the sub-class level. In total, AgriPest is split into 44,716 training and 4991 validation images for pest detection in the wild task. Besides, we attempt to split these images by keep the similar ratio at different sup-classes, which could ensure the distribution of validation subset is the same as training subset. Table 1 illustrates detailed splits of these two subsets. Note that the sizes of pests would not occupy at most 3% areas over the whole image, because we aim to detect pests of tiny sizes. Furthermore, to investigate various types of scenes in the practical pest detection, we further manually split the validation subset into four types of scenes, including dense distribution, sparse distribution, illumination variations, and background clutter, which are typical scenes in pest monitoring applications. Note that there exist gaps among the four validate subsets, e.g., Wheat Sticky is not in "dense" subset. We explain this phenomenon by various habits of pest species, in which several kinds of pests damage the field without group occurrence, while some other ones usually gather into clique in the field crops. Table 2 illustrates the detailed statistics of these four challenges.

Comparison with Other Datasets
We compare AgriPest with several existing datasets from two aspects, i.e., comparison with generic object detection datasets and comparison with datasets that are related to the task of insect pest recognition or detection to further motivate the construction and usage of our dataset. Table 3 illustrates the comparison.
When compared to the PASCAL VOC dataset, which is one of the largest and typical generic object detection datasets, our AgriPest contains over four times more sample images and eight times more annotated objects. In addition, both PASCAL VOC and MS COCO organize lots of common categories of objects in their images so the average size of targeted objects shows to be large (16.76% and 7.74% areas over whole image respectively). However, for in-field tiny pest objects, AgriPest tends to concentrate more their real-life body sizes, in which the pests only occupy average 0.16% area over the whole image in AgriPest that is dozens of times smaller than those in generic object detection datasets, as shown in Figure 4. When compared to traditional generic object detection task that only supports single-depth taxonomy hierarchy and single-scenario in test set, pest monitoring requires a high-level complicated validation metho. Therefore, AgriPest could provide hierarchical categories for pest samples and multi-scenario validation, as shown in Table 3.
With respect to comparison with some other existing insect pest datasets, our AgriPest could also take great advantages over other datasets for the pest classification task [20][21][22] (Figure 5a), while the insect pests show to be small in AgriPest images. In terms of background, a few current pest detection datasets contain images that are captured from the non-field environment [23] (Figure 5b). Under these limitations, most existing insect pest datasets are difficult to be applied to practical pest monitoring applications. AgriPest targets at tiny pest detection task to meet the requirements of practical applications. Furthermore, AgriPest still cover a larger number of images collected in the wild fields than those current insect pest datasets (Figure 5c).   [20] Rec. 5.1K --in-field --IP102 [23] Rec. 75K --in-field --Ding et al. [9] Det. 0.2K 4.4K -non-field --MPD2018 [24] Det.

Experimental Settings
For pest recognition and detection task, the choice of feature is treated as the most significant component. To comprehensively evaluate our AgriPest dataset, we adopt deep learning architectures as benchmarks. For this pest detection task, we select several state-of-the-art methods that are categorized into one-stage architectures and two-stage region-based architectures, including SSD [4], RetinaNet [36], FCOS [37], Faster R-CNN [5], FPN [6], and Cascade R-CNN [38].
We choose VGG16 [39] as CNN backbone for SSD and ResNet-50 [40] the other object detection approaches that are pretrained on ImageNet [13] and then fine-tuned on AgriPest. For fair comparison, the learning algorithms and hyper-parameters are set to be same and all of the models are trained to be optimal. Specifically, a Mini-batch Stochastic Gradient Descent [41] is used as our optimizer with the batch size of 2. The base learning rate is set to 0.01 and linear drop strategy for learning policy is used in our experiments, in which the learning rate drops by 0.1 at 8th and 11th epoch and the total training epoch is 12 referenced by Detectron [42]. The weight decay and momentum parameters are set to 0.0001 and 0.9, respectively. The experiments are implemented using PyTorch and performed on two NVIDIA 1080Ti GPUs with 12 GB memory.

Evaluation Metrics
We employ several comprehensive metrics for evaluation in order to evaluate the performance of CNN models on AgriPest. In our AgriPest, there are two sub-tasks for pest monitoring including pest detection and pest population counting. Firstly, we utilize the Average Precision (AP) with Intersection over Union (IoU) in [0.50:0.05:0.95], AP 0.50 and AP 0.75 as the pest detection performance evaluation metrics. The IoU is defined as the intersection over the union between predicted box and ground truth. Besides, Precision and Recall are also two major metrics that are employed in our dataset, which describe the false positive reduction and misdetection rate respectively. Secondly, for the pest population counting challenge, we evaluate different models with both the Mean Absolute Error and Mean Squared Error by following the convention of crowd counting task [43]. The MAE and MSE would be averaged among classes. Generally, MAE measures the pest population counting accuracy while MSE measures the robustness of the estimates.

Wild Tiny Pest Detection Results
On the AgriPest dataset, we build some experiments to evaluate the performance of our approach. We select six state-of-the-art object detection methods for comparison, three of which are one-stage architectures (SSD512, RetinaNet, and FCOS), while the other three are two-stage methods (Faster R-CNN, FPN, and Cascade R-CNN). Table 4 shows the multiclass tiny pest detection performance under these methods. Generally, two-stage architectures could achieve better performance than one-stage methods, outweighing approximately two to four points AP. This could be explained by that most of pests in our AgriPest hold tiny sizes so the coarse-to-fine object detection strategy adopted by region-based methods could lead to more precise pest classification with fine features. This phenomenon seems to be much more pronounced on smaller objects. For example, pest CP that holds 0.006% size gets over 10 points AP improvement between these two types of methods. Among these approaches, SSD512 performs poorly on most categories of pests. This indicates that, when the image is scaled to be small (512 × 512 resolution), the features of targeted tiny pests might be hard to extract and current state-of-the-art methods still could not satisfy real-world applications.
In addition, we illustrate the detection results using AP 0.50 , AP 0.75 , and AP [0.50:0.05:0.95] as metrics in Tables 5 and 6. As it could be observed, most of methods could obtain satisfied performance on IoU 0.5, but obtain a significant decrease when a higher IoU is the set AP threshold. Thus, existing object detection methods might not work well on highly precise pest localization, because lots of ground truths are too small to be localized. Overall, these detection results demonstrate that AgriPest exhibits high difficulty on wild tiny pest detection as well as its research value.

Method
Wheat

Precision-Recall Analysis
We evaluate the Precision-Recall (PR) by comparing the six object detection methods in AgriPest shown in Figure 6 in order to further analyze the detailed detection results. It is obvious that these methods could obtain the satisfied performance for most of pest categories especially for those with relatively large sizes, such as pest SW, DP, and GM. However, for a few certain classes, such as RM and CP, existing methods might not work well on misdetection reduction (low recall). Furthermore, the precision keeps dropping dramatically with the slight improvement of recall, which indicates that false positive reduction might also not be well performed. This could be attributed to two reasons. Firstly, for these 'hard' categories, there are a large number of pests that are densely distributed in each image (around 60 pests per image for RM and 50 for CP), leading to poor recall performance, which is also evidence for low AP from Table 2. Secondly, the training samples for RM and CP are insufficient in AgriPest, in which there are 189 and 193 images for training, respectively. In this case, models may not effectively learn the highly discriminative features for these pests from their background context.  Table 7 illustrates the detection results in dense distribution, sparse distribution, illumination variations, and background clutter using Cascade R-CNN [38] as a pest detector in order to evaluate the influence of various scenes for wild pest detection. The results show that pests sparsely distributed in images are the easiest to be detected, with more than 70% AP being obtained for most of pest species, while dense distribution is the most difficult challenge for wild tiny pest detection. This is in line with the conclusion that most of object detection approaches could not detect well on pest RM and CP, which usually occur with dense cliques in the field. On the contrary, for the pest species that do not gather together, the detector could perform well on detection, even when the background is cluttered. Apart from the distribution influence, illumination variation is also the unavoidable challenges in practical pest monitoring.

Pest Population Counting Results
In AgriPest, pest population counting is another task for practical pest monitoring applications, because precise population estimation is important in assessing crop damage degree and pest severity. Table 8 presents the average Mean Absolute Error (MAE) and Mean Square Error (MSE) while using six object detection methods for pest population counting. As it can be seen, Faster R-CNN achieve the best results in both MAE and MSE. Besides, two-stage approaches dramatically outperform one-stage approaches in this task. Thus, for tiny pest counting, region-based methods could precisely maintain the correctness of detected pest population.
In Figure 7, we compare these object detection methods in greater details. We evenly group the test images of each classes into 5 groups according to pest population in an increasing order. From this figure, it can be seen that, in the group 1 and 2, where the images contain a few pests, the six methods seem to show the similar population counting performance, which indicates that these approaches might not incorrectly detect pests. With the number of pests increasing, the errors, including absolute error and squared error, start boosting and the difference between two types of methods also becomes larger. Therefore, it is verified that two-stage methods perform more accurately and robustly to a large variance of pest number as well as density.

Limitations and Future Work
Despite that we implement some state-of-the-art object detection approaches with good performance in AgriPest, there are two limitations for future study. Firstly, the problem of unbalanced data structure has not been well solved. Specifically, pest RM and CP are two difficult pest categories in wild pest detection, because AgriPest does not contain sufficient data for model to learn them while they usually occur in cliques with tiny sizes in images, as visualized in Figure 8. Secondly, employing existing generic object detection approaches in our wild tiny pest detection task is not a qualified solution. Future work will focus on covering a larger number of categories and it develops a novel domain-specific algorithm for this task.

Conclusions
In this work, we collect a domain-specific benchmark dataset, named AgriPest, towards large-scale tiny pest detection in the wild. Our dataset covers 49.7K images and 264.7K annotated pest objects of 14 common pest species. When compared with other insect pest dataset, AgriPest targets at wild tiny pest detection in practical science. In addition, the validation images are split into four challenges that are common in practical pest monitoring applications. The images in AgriPest are collected by our designed task-specific equipment that is also deployed in practical pest monitoring application in the field. We implement and evaluate some state-of-the-art generic object detection methods in AgriPest. The experimental results demonstrate the difficulty and particularity of our AgriPest. We believe this work will help to advance future research on wild pest detection task and practical precision agriculture applications.