Detecting and Classifying Pests in Crops Using Proximal Images and Machine Learning: A Review

: Pest management is among the most important activities in a farm. Monitoring all different species visually may not be effective, especially in large properties. Accordingly, considerable research effort has been spent towards the development of effective ways to remotely monitor potential infestations. A growing number of solutions combine proximal digital images with machine learning techniques, but since species and conditions associated to each study vary considerably, it is difﬁcult to draw a realistic picture of the actual state of the art on the subject. In this context, the objectives of this article are (1) to brieﬂy describe some of the most relevant investigations on the subject of automatic pest detection using proximal digital images and machine learning; (2) to provide a uniﬁed overview of the research carried out so far, with special emphasis to research gaps that still linger; (3) to propose some possible targets for future research.


Introduction
Pests are among the main causes of losses in agriculture [1]. Insects can be particularly damaging, as they can feed from leaves, affecting photosynthesis, and they are also vectors for several serious diseases [2]. There are many chemical and biological methods for pest control [3], but to reach their maximum effectiveness, careful monitoring across the entire property is usually recommended.
In many cases, monitoring is done passively by workers as they carry out their daily activities. The problem with this method is that when the infestation is detected, a lot of damage may have already been done. Early pest detection requires a more systematic approach, especially in large farms. Traps are arguably the most widely adopted tool for systematic pest monitoring [4,5]. If implemented properly, this kind of device can successfully sample insect populations over the entire area of interest. However, without some kind of automation the traps still need to be placed and collected by hand, and infestation evaluation needs to be performed visually, introducing a degree of subjectivity that can lead to a biased assessment of the situation [5,6]. Thus, regardless of the adoption of traps or not, there is a need for methods capable of assessing the status of pests quickly, accurately and autonomously.
Considerable effort has been dedicated to the creation of more effective methods for pest detection and classification. Some techniques try to detect the associated damage instead of the pests themselves [7][8][9], but direct detection is the predominant approach. Many early studies tried to detect and identify insects by performing and acoustical analysis on the sounds emitted, but interest in this kind of approach seems to have faded in the last decade [10]. Most studies nowadays use digital images for the task [10,11]. While aerial images captured by means of unmanned aerial vehicles (UAV) are being increasingly explored [12,13], in most cases they not offer enough resolution for the detection of small specimens, and plant canopies may prevent proper detection. Thus, proximal images are still prevalent. Multispectral [14,15], hyperspectral [16,17], thermal [18] and X-ray [10] sensors are being explored, but conventional RGB (Red-Green-Blue) sensors still dominate due to their low price, portability and flexibility [19].
This article focuses on the combination of digital Red-Green-Blue (RGB) images with machine learning techniques for pest monitoring in the field. There are a few studies dedicated to pest detection in stored products [20][21][22][23][24], but those are not considered here. Also, although the majority of methods for pest monitoring using RGB images employs some kind of machine learning algorithm, there are a few studies that use only image processing techniques such as mathematical morphology [25,26], thresholding [27][28][29] and template matching [30]. While those methods fall a little outside the scope of this article, they are addressed in the text whenever deemed relevant.
As mentioned before, there are two types of problems tackled by the methods proposed in the literature: detection and classification. Detection refers to the attempt to distinguish a certain target pest from all other elements in an image; it can be viewed as a binary classification (presence/absence of the target). Detection results often are used to estimate the severity of the outbreak. Classification aims to identify different species of pests, which can be viewed as a multiclass detection. Although detection and classification are related problems, they have certain particularities that often guide the choice of techniques used in the algorithms, as emphasized in Section 2.
The creation of an automatic system for pest monitoring can be roughly divided into four stages: data acquisition, model building, encapsulation in a usable tool, and practical use. The vast majority of the studies found in the literature focus on the second stage. In most cases, the first stage is usually only superficially addressed: a description of how the data was collected is usually provided, but a meaningful discussion on the efforts to make the data more representative and on the limitations of the database is rarely present. The third and fourth stages are often beyond the scope of the studies, as the final steps towards practical adoption of the technologies involve aspects that are more related to user experience, marketability, etc.
In this context, although aspects related to all four stages are addressed at some point in the text, this review focuses on the two first stages. It is important to note that the second stage is intimately related to the first. In fact, machine learning tools have evolved to a point in which the data fed to them has a much more prominent role on their success than their intrinsic characteristics, which explains the relatively similar performances yielded by different models when the data used to train and test them is the same [31,32]. Conversely, image sets with different data distributions and variability can lead to significantly disparate performances even the exactly same machine learning model is employed, a fact that is clearly reflected in the accuracies shown in Table 1. This indicates that most of the weaknesses and research gaps that still remain are connected to the quality and variability of the data gathered, and the content of this review reflects this fact.
This article has three main contributions: -It summarizes the progress achieved so far on the use of digital images and machine learning for an effective pest monitoring, thus providing a complete picture on the subject in a single source (Section 2). -It provides a detailed discussion on the main weaknesses and research gaps that still remain, with emphasis on technical aspects that discourage practical adoption (Section 3). -It proposes some possible directions for future research (Section 4).

State of the Art of Pest Monitoring Using Digital Images and Machine Learning
An extensive search on the databases Google Scholar, Scopus, Sciencedirect and IEEE Xplore was performed to identify as many relevant studies on pest monitoring using digital images as possible. The search was performed using the keywords "pest", "insect", "image" and "detection". These are relatively general terms that yielded a large number of irrelevant results, but it was important to keep the search broad in order to avoid missing relevant matches. Only investigations published in peer-reviewed journals were selected, with the exception of [33], which is a conference article reporting some relevant findings. A second search considering the reference lists of the articles selected in the first step was also conducted. A total of 34 articles fitted all the searching criteria, while 7 articles that did not employ machine learning techniques but used some interesting strategies were also included. Also, two of the selected articles are literature surveys-Liu et al. [10] reviewed different sensing technologies (acoustic, visible light, multispectral, hyperspectral, X-rays) for detection of invertebrates on crops, while Martineau et al. [11] focused on describing different pest classification strategies found in the literature. Table 1 summarizes all selected investigations, and Figure 1 shows the most common relationships between the types of data extracted from the images and the detectors/classifiers that are adopted.
This section is structured according to the task (detection or classification) addressed by the proposed methods. It is worth noting that a more logical structure in which the methods would be presented as they improved upon the weaknesses of their predecessors was considered. However, because most studies have as main goal to overcome the limitations associated to manual and visual monitoring, instead of targeting current research gaps, following a logical progressive sequence is largely unfeasible. This fact highlights one of the main issues detected in this study: many of the methods proposed in the literature employ similar classification strategies. In most cases, the only major distinction is the species of interest, resulting in studies with significant redundancy and limited novelty. This review identifies many of the research gaps that need to be addressed, which hopefully will serve as inspiration for future efforts.

Pest Detection Methods
Detection methods are interested in distinguishing a certain target pest from the rest of the scene in an image. This is equivalent to a binary classification in which the classes are "target present" and "target absent". The number of detected specimens is often tallied in order to provide a measurement for the degree of infestation.
Some early studies explored image processing techniques without using machine learning algorithms. This type of approach has as main advantage its simplicity, both in terms of implementation and computational complexity. On the other hand, they rely heavily on handcrafted parameters and thresholds, which cause this type of algorithm to be highly susceptible to changes in the characteristics of the images. In one of the earliest studies on the subject [34], two knowledge-based systems (KBS) were combined for detecting and counting whiteflies in traps. The first KBS employed a series of image processing operations to select a few candidate location for the specimens. The second KBS used a series of numerical descriptors to fully characterize the specimens of interest; these, combined with a set of rules, determined if the pest was present or not in the candidate locations. In the same year, Qiao et al. [28] used simple thresholding and a boundary tracking algorithm (sequence of rules delineating each object of interest) for the detection of whiteflies in traps. Al-Saqer and Hassan [30] used Zernike moments and regional properties to describe the shapes of the pest of interest (Red Palm Weevil). The features calculated using by two methods were used to build a library of templates for the targeted pest, which in turn was used to determine the presence or absence of the Red Palm Weevil amidst other insect species. Wang et al. [25] applied a sequence of image processing operations for detection of whiteflies in crops. The proposed algorithm is composed of a median filter, followed by Otsu thresholding and an opening operation to separate clustered objects. The final estimate is obtained by counting the connected objects. Barbedo [26] combined color space transformations, thresholding and mathematical morphology operations (hole filling) for the detection and counting of whiteflies in soybean leaves. Special attention was given to the detection of nymphs at the earliest stages of development. Maharlooei et al. [29] employed color transformations and some image processing operations for the detection of aphids in soybean leaves. Leaf segmentation was performed manually, followed by segmentation of the pests using the hue-saturation-intensity (HSI) color space. Objects of certain sizes were selected as aphids and the remaining connected components were counted.
Multifractal analysis aims to determine the degree of irregularity and complexity of an object. This technique can be a good fit for pest detection due to its robustness to scale and rotation variations, while preserving most of the information contained in the image. Two studies employed this technique. In [58], candidate blobs representing potential locations for whiteflies in traps were determined using two different thresholding schemes. Multifractal analysis was then applied to extract the features that characterize the objects of interest. In [44], image background was first removed by means of a Mahalanobis distance, then multifractal analysis was applied to select some candidate blobs. Features extracted for each blob were combined with size and shape rules for the final whitefly detection in the images.
K-means clustering is a vector quantization technique that aims to group a certain number of observations into k clusters, or classes. This technique was employed by Wang et al. [53] for detecting whiteflies in crops. The algorithm begins by dividing the image into 100 × 100 blocks. Both the RGB and L*a*b* color spaces are then used as basis for an algorithm that preselects potential cluster center, and then k-means clustering is applied to classify each pixel. Spurious objects are eliminated using ellipse eccentricity rules. Yao et al. [62] investigated the performance of three techniques, normalized cuts (NCuts), watershed and K-means clustering, applied to the separation of pest specimens in traps. NCuts with the optical flow angle as weight function achieved the most accurate results. K-means clustering was one of the classifiers tested in [45], and it has also been used to enhance the results produced by deep learning models [50].
SVMs are among the most widely used machine learning classifiers. They are particularly suitable for binary classifications, as they try to find the hyperplane that best separates two classes. Yao et al. [63] employed a three-layer detection strategy for the detection of rice planthoppers in crops. The first layer was an AdaBoost classifier based on Haar features; the second layer was a SVM classifier based on HOG features; the third layer used color and shape features to remove spurious objects detected in the first two steps. Liu et al. [45] employed SVMs fed with HOG features for detection of aphids in images captured in wheat crops. The HOG features were extracted from regions selected using a maximally stable extremal region (MSER) descriptor. Four other machine learning classifiers were also tested: k-means clustering, Adaboost with Haar feature, and SVM with two different sets of features. Ebrahimi et al. [17] used the HSI (Hue, Saturation, Intensity) color channels as inputs to a SVM for detection of thrips in strawberry flowers. Prior to the SVM application, background was removed by means of a thresholding, and morphological operations were applied to close holes in the binary image.
Artificial neural networks (ANN) are models containing numerous nodes and connections, being loosely inspired by the way neurons are organized and interconnected in a brain. ANNs have been frequently used for the task of image classification. Before 2010, virtually all neural networks had a shallow architecture (few hidden layers). As computational power grew and Graphics Processing Unit (GPUs) became the main piece of hardware for training the models, deep architectures began to rise. Since 2015, research on image classification has strongly veered towards using deep learning [66]. However, a few studies still employ ANNs with shallow architectures. Vakilian and Massah [51] combined image processing techniques with ANNs for the detection beet armyworms in images captured under controlled conditions. Potential pests were first segmented by means of a Canny edge detector, then seven morphological and texture features were extracted and used as input for the three-layer neural networks. Espinoza et al. [40] used a three-step approach for detecting thrips and whiteflies in traps. Potential locations for the objects of interest were selected by means of a threshold, from which 14 color and morphological features were extracted. Those features were used as input to a multilayer feed-forward neural network, which was responsible for the final detection. Roldán-Serrato et al. [48] applied ANNs for the detection of beetles in bean and potato crops. Two types of neural network architectures were used, RSC and LRA, with the former yielding better results both in terms of speed and accuracy.
Deep architectures have been preferred due to their ability to infer spatial relationships without explicit feature extraction [66]. Convolutional neural networks (CNNs) have been particularly popular due to their intimate relationship between layers and spatial information [67]. CNNs have been used in [39] for detecting and counting moths in traps. The system employed a sliding window and, for each image patch, the CNN output a probability for the presence of a moth. Non-maximum suppression (NMS) was used to retain only the patches for which the probability was locally maximal and above a certain threshold, revealing the locations of the specimens. Barbedo and Castro [5] explored the use of CNNs for the detection of psyllids in images of traps. Using the Squeezenet architecture, the authors investigated the influence of sensor quality, image resolution and training/test dataset distribution on the ability of the model to properly detect the psyllids amidst other objects (insects and debris) captured by the trap. CNNs have also been combined with decision networks in order to explore contextual information [54], and with an anchor-free region proposal network for pinpointing pest positions [43].
Although CNNs have been the most popular deep learning architectures for image classification, other types of architectures have also been explored. Sun et al. [50] used a downsized RetinaNet for beetle detection in traps. The RetinaNet, which is a one-stage deep learning detector, was downsized by a depthwise separable convolution and feature pyramid tailoring. The detector was further enhanced by a k-means anchor optimization and residual classification subnet. The system was capable of distinguishing red turpentine beetles from five other beetle species. Yue et al. [64] focused their study on the development of an effective algorithm for increasing the resolution of images captured in the field. The proposed strategy employed a deep recursive super resolution network with Laplacian Pyramid for high resolution reconstruction of the images. The algorithm was compared with two other image upscaling methods, Bicubic Interpolation and Super-Resolution Convolutional Neural Network (SRCNN).

Pest Classification Methods
The challenge involved in classifying pests is high because a system like this not only has to properly discriminate between the targeted species, but also deal with non-targeted species, which can be numerous. For this reason, the adoption of sophisticated machine learning techniques is higher than in the case of detection. From the classification studies considered in this work, only two did not employ machine learning tools. Cho et al. [27] used a series of image processing operation for classification of whiteflies, aphids and thrips in traps. Two separate strategies were used: the first, dedicated to the identification of aphids, was based on some size, shape and color features; the second used features extracted from the YUV color space to discriminate whiteflies and thrips. A related approach was adopted in [49], which combined two feature-extraction techniques for recognition of six different insect species. The first technique, called LOSS, extracts a number of shape features to characterize the objects of interest. The second technique, SIFT, is a well-known algorithm that identify salient features in the image. Specimens were then grouped by feature value similarity.
Before the rise of deep learning techniques, SVM was the preferred technique for pest classification in images. However, there have been some studies employing other types of machine learning techniques. Xia et al. [59] used the watershed algorithm for segmentation of the insects, followed by the extraction of color features from the YCrCb color space using the Mahalanobis distance. The classification of each object into whitefly, aphid or thrips was given by the nearest distance between the extracted feature vector and the reference vectors associated to each class. K-means clustering was also applied in [41] for recognition of 10 pest species in images captured in the field.
The objective of some studies was to compare and evaluate several different machine learning models. Wen et al. [55] used the SIFT descriptor to extract features for the characterization of 5 pest species in images captured under controlled conditions. Six classifiers were tested: MLSLC, kNN, PDLC, PCALC, NMC, and SVM. PCALC and SVM yielded the highest accuracies. The same authors tested several geometric, contour, texture and color features as inputs for five different classifiers [56]: MLSLC, KNNC, NMC, NDLC and DT. NDLC yielded the highest accuracies when classifying eight different pest species.
As mentioned above, SVMs have been among the most widely used machine learning techniques. Although interest has dropped with the inception of deep learning techniques, SVMs are still being employed in a wide variety of classification problems. Yao et al. [61] extracted 156 color, shape and texture features and used them as input for a SVM classifier with a radial basis function as kernel, with the objective of classifying four species of Lepidoptera. Venugoban and Ramanan [52] employed SURF and HOG descriptors to extract the features used to feed a SVM dedicated to classify 20 pest species. SVMs have also been used in the context of bio-inspired methods for classification of 10 species in crops [37]. In this work, the Saliency Using Natural statistics model (SUN) was used to generate saliency maps and detect the region of interest (ROI). Features were extracted using the Hierarchical Model and X (HMAX) model combined with SIFT, NNSC and LCP algorithms. All extracted features were then fed to a SVM using a Radial Basis Function (RBF) as kernel function for pest recognition.
As in the case of pest detection, shallow neural network architectures are still being explored for pest classification. Han et al. [42] focused on the proposal of a hardware platform for pest classification in images of traps. The system extracts several morphological and color features that are used as input for the three-layer, 29-neuron ANN dedicated to the classification. Dimililer and Zarrouk [38] used a shallow ANN architecture for classification of eight pest species. The algorithm preprocesses images using grayscale conversion and median filtering, applies thresholding to remove background and canny edge detection to segment the pest, and the image is rescaled by pattern averaging before being fed to the two neural networks responsible for pest classification.
Deep learning techniques are the current state of the art when it comes to classification based on digital images. As is the case for pest detection, CNNs have been prevalent. A deep CNN model for classification of six pest species in images of traps was tested in [46]. The authors used a global contrast region-based approach to compute a saliency map for localizing pest insect objects. Bounding boxes containing targets were then extracted and used to train and test a CNN with an architecture inspired by AlexNet. Different architectures were explored by shrinking depth and width. Cheng et al. [35] optimized two CNN architectures, ResNet50 and ResNet101, using the deep residual learning method. When used to classify ten different pest species in field images with complex backgrounds, the proposed method outperformed SVM, shallow ANN and plain AlexNet CNN classifiers. The VGG19 deep learning architecture was employed in [60] for classifying 24 pest species in images captured under natural conditions. Region Proposal Network was adopted rather than a traditional selective search technique to generate a smaller number of proposal windows, improving accuracy and speed. The method was compared with a few other deep learning architectures. Dawei et al. [36] applied transfer learning to a pre-trained AlexNet CNN for classification of 10 species in images captured in the field. The model outperformed a CNN trained from scratch, likely due to the relatively small dataset size, which did not contain enough information for proper full model training. A four-part deep learning approach was used in [47] for classification of 16 butterfly species in images of traps. First, channel-spatial attention (CSA) was fused into the CNN backbone for feature extraction and enhancement. A RPN was applied to indicate potential pest positions. A PSSM was used to replace fully connected (FC) layers in the deep CNN architecture for pest classification and bounding box regression. Finally, contextual regions of interest were used to improve detection accuracy. A two-step deep learning approach for classifying and counting five pest types in images of traps was adopted in [33]. The detection step, in which locations of the insects on the sticky paper are determined, was performed by the Tiny YOLO v3 object detector. The classification step was carried out in a three-level hierarchical structure, in which CNN models trained specifically to deal with the classification problems associated to each level were applied successively.
Two other types of deep learning strategies have been investigated for pest classification. Besides being used in [33] (see above), YOLO has been investigated in [65]. This work employed a two-step machine learning approach for the classification of six pest species in images of traps. Objects were first localized by means of the YOLO detector, followed by classification and counting using a SVM fed with color, shape, texture and HOG features. Four kernel functions were tested. In [57], a deep learning model was applied for classification of nine moth species in traps. The algorithm applied a two-step morphological segmentation of the objects of interest (moths). Then, 154 texture, color, shape and local features were extracted and fed to an improved pyramidal stacked de-noising autoencoder (IpSDAE) architecture, the deep learning model dedicated to the classification step. Results were compared to SVM, random forests (RF), BayesNet, logistic regression classifier (LRC), and radial basis function (RBF).

The Data Gap Problem
As mentioned before, machine learning algorithms are now capable of solving most classification problems, as long as the data used to fit the models are representative enough. Due to intrinsic characteristics of the agricultural environment, the data collected tends to have a high degree of variability. Many studies try to counteract this by controlling the conditions under which images are acquired. This certainly increases the accuracies obtained in the experiments, but those same models tend to fail with data obtained under more realistic conditions. There are two possible solutions for this problem. The first solution is the development of models and training strategies capable of learning data distributions from a relatively small number of samples. As suggested in Lake et al. [68], the learning process of models dealing with small datasets should evolve towards the way humans learn, which involves incorporating domain-specific knowledge. Using synthetic data and augmentation to artificially increase the dataset size can also be employed [69]. The use of Generative Adversarial Networks (GANs) may be particularly interesting in this context [70], but more studies are needed to determine if this kind of approach can indeed properly capture the whole variability introduced by the agricultural environment. As promising as these new algorithms are, their development is still in its infancy and will probably not be applicable to agricultural classification problems in the near future. The second solution is to improve the process of image acquisition to better cover all variability found in practice. The remainder of this section discusses the main hurdles towards such a goal, also proposing possible solutions whenever relevant. All issues discussed in this section are summarized in Table 2.

Difficulties Related to the Strategy Adopted-Images of Traps
A little less than half of the studies considered in this work employ traps to aid the task of pest monitoring. This type of device often offers the highest count accuracy and robustness, and can be applied to multiple types of pests [45,62]. Images of traps tend to have more homogeneous characteristics than those captured in the field. As a result of this reduced variability, it is easier to build a dataset capable of representing the whole range of conditions found in practice. However, images of traps also have some specific challenges associated. Shadows caused by the trap's frame may cause detection problems [40]. Specimens trapped near or on the edges of the frames are hard to detect [28]. Also, depending on the time the trap is left on the field, insects may decay and become unrecognizable [40]. Noise [44] and grid marks [27,57,59] have also been pointed out as a potential sources of error. Fortunately, most of those problems can be prevented.
A common approach for the pest monitoring using traps is to collect images regularly and remotely without removing the trap [39], which enables an analysis of the infestation over time. This type of approach also brings some challenges. As time elapses, new objects appear in the trap (increasing occlusion and overlapping problems), wing poses change, decay effects increase, illumination conditions vary, and background texture changes [39]. These factors may cause inconsistent count estimates overtime. Ding and Taylor [39] argued that if temporal image sequences are provided with a reasonably high frequency, changes may be tracked more easily and errors may be avoided.
The position in which insects get glued can change considerably. Pose variations can cause detection and classification problems [57]. Some studies suggested that the best solution would be to capture images as soon as insects land [56], but this may be impractical in most cases. Including different poses in the dataset used to tune the algorithms and models seem to be a more feasible option, especially taking into consideration that traps usually collect a large number of samples.
Depending on the population density of insects, trapped specimens may touch or even overlap [50,61]. In some cases, traps may contain thousands of objects [40]. Situations like this can be very challenging, often leading pest numbers to be considerably underestimated [34,39,40]. Some image processing techniques are capable of partially separating clusters [62], and some deep learning models can successfully deal with crowded images [47]. However, depending on the degree of overlapping, the only solution may be applying statistical correction techniques. In this case, estimates will be only an approximation, but ultimately not even highly trained humans would be able to provide an accurate estimate under overcrowded conditions [26].
Traps with low population density also pose some problems. In particular, the relative impact of the presence of objects such as dust, debris and other insects becomes more pronounced [28]. If the number of specimens of interest is very low, even a couple of false positives will cause high error rates, a problem that is also observed when field images are used [63]. This type of issue is not easy to address, but should be taken into consideration when evaluating the performance of any trained model. While traps allow for a more controlled pest monitoring, they also have some major disadvantages. According to Liu et al. [45], traps may not be the best tool for making decisions related to treatment need or timing, as there may be a delay between the beginning of infestation and the moment the trap has enough samples for meaningful conclusions. In addition, traps usually capture only airborne adult pests, missing immature specimens capable of causing severe damage to the crops. For this reason, many studies try to detect and classify pests directly on the leaves, with images being captured either under controlled conditions (detached leaves) or directly in the field.

Difficulties Related to the Strategy Adopted-Images in the Field
When images are captured in the field, considerable illumination differences may exist [26,45], which can pose a significant challenge [29]. Those differences can be caused by weather conditions (sunny or overcast), angle of insolation, camera settings, presence of shadows, among others [71]. Many datasets do not reflect these variations, as the protocols used to generate them usually prevent images to be captured under conditions far from ideal. In practice, it may be unfeasible to force a potential user to follow the same protocols, especially considering how hostile the agricultural environment can be. Consequently, the variability found in practice is usually much larger than that found in most studies, which is one explanation for the tendency of most proposed methods to fail under more realistic conditions. This is particularly true for methods based on deep learning, as these require a comprehensive training dataset to work properly. In most cases, practical adoption of software for field pest monitoring will require models trained with more realistic and comprehensive datasets. It is worth noting that there are some techniques capable of partially compensating for illumination differences [45], which may be useful depending on the classifier being applied.
The background of images captured in the field can also vary considerably [37], which means that the contrast between specimens of interest and their surroundings will change. This is true even when detection is performed on plant leaves, as these can have colors ranging from light green to dark brown. In addition, the background usually contains artifacts (stems, leaves, spots) that may cause confusion [45]. As mentioned recurrently throughout this article, the training dataset has to represent the entire range of conditions expected to be found in practice, but if the degree of background variation is too high, this may be unfeasible. In cases like this, it may be necessary to enforce some image capture protocols.
Another challenge associated to images captured in the field is that perspective distortions may be common [45], causing specimens to appear with very different visual characteristics depending on the angle of capture. This problem is less prevalent in the case of small pests, but in any case the dataset used for training should contain different viewpoints of the targeted pest, or some image capture protocols need to be created (but these can be difficult to enforce in practice).
Full automation of pest monitoring in the field without using traps is currently unfeasible in most cases. Although it is possible to install a network of image sensors strategically located throughout the crop, many pests of interest are usually found underneath the leaves. Even when pests are located on top of the leaves, chances are that most specimens will be occluded by the crop canopies [45]. In the future, it may be possible to deploy swarming robots that will actively search for the infestations. At present, image-based field monitoring needs to either be paired with other monitoring strategies, or estimates need be corrected by carefully designed statistical models [63].

Difficulties Related to the Insects Themselves
Most methods are designed to detect adult insects. However, a complete picture about the infestation and how it is evolving may require the identification and counting of specimens at earlier stages of development [17,26,45]. This may pose a significant challenge, because younger specimens may not only be smaller [63], but also have quite different visual characteristics [45]. In some extreme cases, young nymphs may be semi-transparent, which makes them very difficult to be detected [26].
When the targeted pest is small and detection is to be performed directly on the leaves, symptoms and signs produced by diseases, nutritional deficiencies, dead skins and other insects may also become sources of confusion [26,29]. If this is expected to be a common occurrence, the training dataset should include samples of those problematic situations so the model can learn how to properly detect the pest of interest under such conditions. Another major challenge faced by recognition methods is that the targeted pest may share many visual similarities with other species [59]. This fact is particularly relevant when traps are used, as many different species may be present in a single image [50,55], including many that are not targeted by the classification system [61]. While most studies addressed this problem, the reported results will be intrinsically connected to the specific geographical region where the data was collected [5]. Since other regions may harbor different pest species, the degree of detection difficulty may vary. Thus, while most published results may be viewed as proofs of concept, new experiments need to be carried out and new models should be generated whenever a new area is considered.

Difficulties Related to the Imaging Equipment
The cameras used for capturing the images may also have considerable impact on the ability of the model to detect the objects of interest [26]. Optical quality plays an important role and, under low illumination conditions, camera settings may also be a relevant factor [29]. Building a dataset including samples captured with all kinds of sensors is impractical; instead, cameras expected to be prevalent in practice should be given priority.
In most cases, the higher the spatial resolution of the images used for recognition, the more information is available and, consequently, better results can be achieved (at the price of increased computational requirements). However, depending on the technique being used, too much detail may lead to oversegmentation, increasing error rates [59]. Thus, it is important to consider that working with the highest possible resolution is not necessarily the best approach.

Difficulties Related to Model Learning
A problem that is often overlooked is that many machine learning algorithms tend to overfit the data [5], especially with limited data. It is common to find studies reporting accuracies very close to 100%, which may not be realistic depending of the problem being addressed (this may also be caused by too homogeneous datasets). There are several ways to avoid overfitting, including the application of regularization techniques [46], image augmentation [39], unit dropout in deep architectures [46], etc., so more care should be given to the experimental design.
Class imbalance is also a common problem. In any image classification problem, certain classes will usually be much more common than others. Because there is a tendency to build datasets as large as possible, the proportion of samples ends up reflecting such imbalance. The problem with this is that the classifier will tend to become highly biased and, depending on the metrics used to evaluate the model, such bias may remain undetected. The number of samples should be roughly the same for all classes, which can be achieved by either oversampling the smaller classes or undersampling the larger ones [5].
Covariate shift issues are also widespread and rarely addressed. Covariate shift is the phenomenon in which differences between the distributions of the training data used to train the model and the data on which the model is to be applied result in low accuracies [66,71,72]. This problem becomes evident when the same database is used for training and assessing the models, which is common practice in the machine learning community for practical reasons [66,73]. Some studies on recognition of plant diseases have reported steep accuracy drops when the model trained with a given dataset is applied to other data sources [74]. Domain adaptation techniques can mitigate this problem [75], but a more definite solution would involve capturing a much wider variety of training data, which may be impractical [66].
The evaluation of the proposed methods is also not trivial. Samples in the dataset need to be properly annotated to serve as the references (or ground-truth) to which the outputs yielded by the model are compared. The problem is that the annotation process is subjective, labor intensive and, as a result, susceptible to considerable inconsistencies [71,76]. Indeed, some authors have observed significant deviations when different people performed the task [34]. Ideally, image annotation should be performed redundantly by several people, but in most cases this is impractical. Alternatively, authors should always make clear that the references used for evaluation of their algorithms might have some inconsistencies.
Many of the problems mentioned in this section could be minimized by the use of more comprehensive datasets. In a natural environment, conditions cannot be controlled, which means that data variability will be very high. Covering all possible situations found in practice is nearly impossible, especially considering that some conditions are rare and require a certain amount of luck to be properly captured. As observed in [37], pests will occur infrequently and in different locations, and they will not always be ideally positioned for image capture. In the case of plant diseases, there have been some initiatives to build comprehensive datasets using social network and citizen science concepts [66], which may also be a suitable solution for the problem of pest recognition. On the other hand, the labeling process tends to become less rigorous under such a scheme [66], exacerbating the evaluation problem discussed in the previous paragraph. Although there is no definitive solution for the problem of dataset insufficiency, evidence suggests that it is better to have comprehensive datasets that are relatively poorly labelled, than small datasets rigorously annotated [77].

Conclusions and Future Directions
Machine learning techniques, and deep learning in special, have been showing a remarkable ability to properly detect and classify pests, either in traps or natural images. Arguably, the main factor preventing a more widespread adoption of automatic pest monitoring systems is their lack of robustness to the vast variety of situations that can be found in practice. This, in turn, is the result of limitations on the datasets used to train the classification models. Building more comprehensive pest image databases is essential to close this gap. However, given the degree of variability associated to practical use, it is very unlikely that substantial progress in this regard will be achieved using conventional approaches. Future efforts could focus on creating mechanisms to facilitate and encourage the involvement of farmers and entomologists in the process of image collection and labelling. This could be achieved, for example, by using the concepts of citizen science [78] to involve more people and scale up the efforts towards a representative database. In the specific case of pest recognition, farmers and field workers could collect images in the field and, after uploaded to a server, those images would be properly labelled by an expert. Initiatives like this are already being carried out in the context of plant disease identification (Barbedo, 2018).
Joint efforts and data sharing can also greatly contribute toward the creation of more comprehensive datasets. If datasets generated by different research groups were made available and properly integrated, the resulting set of images would be much more representative and research results would be more meaningful and applicable to real world conditions. Datasets adhering to the FAIR (Findable, Accessible, Interoperable, and Reusable) principles [79] are particularly useful, as this enables and maximizes the fitness for reuse of research data.
As mentioned before, monitoring pests under natural conditions usually allows for a more timely response to possible infestations. However, monitoring entire farms in real time is a considerable challenge. Having a high density network of imaging sensors covering the entire area would be ideal, but this option is probably not cost effective, and it is technically challenging to properly implement it in an inhospitable environment. An alternative would be to embed sensors in the machinery used in the farm's daily life. This would not allow for real time monitoring, but the whole property would be covered in relatively short time frames. In the more distant future, employing swarming micro-robots may also become feasible [80]. Other suitable solutions may exist, but more research effort needs to be dedicated to the subject.
Traps have some disadvantages, but provide an easier way for monitoring large areas. Better solutions in edge computing need to be created so the communications network needed for full monitoring automation is not overwhelmed by excessive amounts of data [81]. Models and algorithms also need to be light in order to small, low-cost processing units to be able to process the images and produce useful information in a timely manner. With raw data being properly processed at the edge of the network, it should be possible to transmit only the information needed in the decision making process.
Automating pest monitoring is a challenging task. With the evolution of machine learning algorithms, the tools needed to build accurate systems with real practical applications are already available. Gathering data representative enough of the huge variability found in practice is very challenging, but as devices with imaging capabilities become ubiquitous and mechanisms to enable citizen science are perfected, this might not be a significant problem in the relatively near future. However, as discussed throughout this article, there are still many research gaps that need to be addressed, which means that pest monitoring automation will continue to be a compelling research subject for many years to come.