Insect Pest Image Recognition: A Few-Shot Machine Learning Approach including Maturity Stages Classiﬁcation

: Recognizing insect pests using images is an important and challenging research issue. A correct species classiﬁcation will help choosing a more proper mitigation strategy regarding crop management, but designing an automated solution is also difﬁcult due to the high similarity between species at similar maturity stages. This research proposes a solution to this problem using a few-shot learning approach. First, a novel insect data set based on curated images from IP102 is presented. The IP-FSL data set is composed of 97 classes of adult insect images, and 45 classes of early stages, totalling 6817 images. Second, a few-shot prototypical network is proposed based on a comparison with other state-of-art models and further divergence analysis. Experiments were conducted separating the adult classes and the early stages into different groups. The best results achieved an accuracy of 86.33% for the adults, and 87.91% for early stages, both using a Kullback–Leibler divergence measure. These results are promising regarding a crop scenario where the more signiﬁcant pests are few and it is important to detect them at earlier stages . Further research directions would be in evaluating a similar approach in particular crop ecosystems, and testing cross-domains.


Introduction
Crop yields are subject to many threats and conditions, such as pathological agents, mismanagement of soil nutrients, and climate changes to name a few. Insect pests inflict high damage on every crop, if not controlled, and a warming climate scenario may increase insect infestations and losses, especially in tropical areas [1].
Insect pests are a major cause of concern in crops because of yield losses and the intensive use of broad-spectrum insecticides [2]. Although, integrated pest management (IPM) practices have attained importance, there are still lacks of precision on timely identifying hazardous species of insects during a crop cycle [3]. If identified more precisely, and at early stages, monitoring and controlling mitigation strategies could be brought in avoiding economic losses, and helping in more sustainable practices [2].
The similarities between insect species, especially at the same maturity stages, make conventional manual identification imprecise, time-consuming, and inefficient in most cases, even for experienced agronomists [4]. Visual-based machine learning algorithms can effectively address this issue. Using images to help classify insects for pest management is a major research topic lately since the advance of machine learning techniques [5,6]. Deep learning is one of the most widely used approaches for insect classification tasks in agriculture as demonstrated in [7][8][9][10]. However, supervised models require large labeled data sets for training these models, which are scarce, very demanding, and are still far from being able to bridge the gaps in insect classes variability [11]. Besides, computer vision and deep learning methods may supply novel cost-efficient and automated sensor techniques to the field of entomology [4].
In recent years, many automated recognition systems, based on computer vision and machine learning, were proposed to manage insect pests in agriculture. Karar et al. [12] proposed a mobile application to classify five classes of insect pests using deep learning in cloud computing. Chen et al. [13] proposed an embedded drone system and deep learning to recognize insects in a tree. Li et al. [7] studied five state-of-the-art deep learning architectures for image recognition of ten categories of crop pests. Thenmozhi and Reddy [14] proposed an improved deep convolutional network, outperforming fine-tuned models in insect pest recognition. Deep learning have been the most used method for visual insect recognition using image data sets, but no one has yet approached it separating maturity stages and using few samples.
Recently, the data set IP102 [15] with 75,000 images of 102 categories of insects, mixing samples in different life stages such as egg, larva, pupa, and adults, has been put together and made available for researching this topic. Although being a major advance as data availability for insect pest recognition, the IP102 is out of proportion in many species [15]. Moreover, the IP102, with different life stages of insect classes together, makes the automatic visual recognition task even more difficult mainly due to its structural intra-class large morphological samples [15].

Learning from a Small Amount of Data: Few-Shot Learning
Machine learning is a sub-area of artificial intelligence where computer programs are designed to solve tasks T, based on gathering experience E, and approximating an objective function using a performance measure P [16]. Despite its success in approaching data-intensive applications, getting big amounts of supervised data (i.e., the experience E) is not always feasible. Learning from a small amount of samples may be possible though, if prior knowledge of few categories can be grouped and subsequently applied to further categories [17]. Few-shot learning (FSL) refers to this problem of learning using few samples, with interesting scenarios, approaches, and learning issues depending on the area addressed [18].
Few-shot learning (FSL) [18] is a learning approach that seeks to define a relative approximation between machine and human learning considering the challenging task of learning from very few samples. One important category of FSL methods is metric-based meta-learning [19]. Figure 1 provides an overview of FSL metric-based with meta-learning paradigm. Given a labeled data set for training, from a particular problem, the goal is to learn concepts in embedding space, through training tasks, to generalize classes in test tasks from a novel problem by using a similarity metric. Convolutional neural networks (CNN) are commonly used as embedding functions f and g for image feature extraction.
In this paradigm, an FSL model is typically trained through several N-way and Kshot classification tasks. A classification task is referred to as a training episode. In an episode, the support set S is composed of N classes containing K samples from each of them (i.e., S = N × K), and the query set Q consists of q samples from the same classes (i.e., Q = N × q). In Figure 1, a task is composed of two-way, two-shot, and Q = 1 from a particular class for demonstration. The model goal is to label Q images into N classes of the task. Furthermore, in meta-learning guidelines, a source set is used for training n tasks and a Target set for test tasks, there is no overlap between classes in Source and Target sets.
At present, several areas have benefited from FSL approaches, including image classification [20], and object detection in images [21], with great potential for agricultural applications. Few-shot enables the construction of models with drastic parameters reduction, which facilitates the application in embedded, mobile systems. Li and Yang [22] classified cotton crop pests using prototypical nets in an embedded terminal. Yang et al. [23] improved the results of prototypical nets by combining recognition and object localization. They proposed a salient region detection mechanism, which represents the region with the highest discriminatory characteristics for insect classification. Li and Yang [19] analyzed the cross-domain few-shot classification problem in agriculture. They used insect and plant leaf diseases data sets. Their results showed that the mixed domain, in which meta-training and meta-testing use classes of both types of data together, produces better results.

Figure 1.
Meta-metric few-shot learning example representation. Illustrated settings of tasks are 2-way, 2-shot, and one query image. In some approaches, the embedding functions f and g may be the same. Query images are labeled according to a similarity score with support embeddings.
In this research we propose to address insect pest recognition by firstly putting together a different image data set (IP-FSL), derived of IP102, but distinguishing classes into two maturity stages: early and adults. The approach for classification is by framing the insect maturity stages classification problem in a few-shot learning paradigm, and then leveraging a prototypical network by including divergence measures as similarity functions. We see a need for an effective tool in agronomy for insect management to deal with rapid insect classification by maturity stages with field images. Our research approaches this problem by posing it in a few-shot learning paradigm. We achieved 86.33% and 87.91% of accuracy, respectively, for adult and early insect classes on the IP-FSL data set.
The remaining of this paper describes the materials and methods used, Section 2, and the Section 3 includes the description of the experiments. In Section 4, the results are shown; Section 5 discusses the results obtained, and finally conclusions are summarized in Section 6.

The Meta IP-FSL Data Set
IP102 [15] is a large insect pest data set that provides 75,222 images distributed in 102 classes. Most classes cover different stages of the insect life cycle such as egg, larva, pupa, and adult. Its taxonomy comprises two major agricultural crop groups: field crops (rice, corn, wheat, beet, and alfalfa), and economic crops (vitis, citrus, and mango).
We thoroughly analyzed the IP102 to rearrange classes, and to select samples, according to two biological stages of the pests, adult and early stages, and assembled a new data set, IP-FSL (insect pests for few-shot learning), for few-shot learning. Different maturity stages in the same class can make it difficult to learn patterns from a particular class, because visually they are far apart, and consequently, the resultant classifications of the learning algorithms may be misleading. By separating the biological maturity stages, we expect two advantages: (1) providing a more discriminative feature extraction for the classes, and (2) more accurate recognition in the early stages of the pest, which is important to control the spread of the insect pest.
We built IP-FSL by selecting a maximum of 50 images from each class in the IP102. This number was chosen because of the large diversity of the data set as a top limit, but the exact numbers of some categories are less as is shown in Table 1. For the Early stage subset, we considered those images containing the presence of egg, larva, or pupa. The subset Adult stage includes young and adult insects. As a selection criterion, images with field conditions were chosen. For the species containing images of both stages mentioned, we created new classes for the respective insect species in the two subsets.
The final configuration of the IP-FSL data set is presented in Figure 2. It has a total of 6817 insect images. The subset Early stage is composed of 45 insect classes, totaling 2050 images. The adult stage consists of 97 classes, totaling 4767 images. Figure 3 shows some examples of the IP-FSL data set, and class names and amounts are presented in Table 1.   Table 1. Some examples of classes with two maturity stages (adult and early, appearing in both subsets, classes 2, 31, 67, 74, 87, and 89). Table 1. IP-FSL image data set information, derived from IP102 (Insect Pest 102) [15], assembled specifically for this few-shot learning research. The names of the insects were kept as published in the original source (IP102), and in the categories may contain common, as well as scientific, names.

Metric-Based Multi-Class Networks
FSL algorithms learn through tasks to adapt to new tasks, as shown in Figure 1. Matching [24] and Prototypical [25] networks are competitive few-shot metric-based multi-class approaches. Matching and prototypical networks originally use cosine and Euclidean distances as similarity measures. We propose here to leverage and evaluate those frameworks using other divergences, such as Mahalanobis, Kullback-Leibler, and Itakura-Saito. This group of divergences, also called Bregman divergences, measures differences between distributions, and as we are going to show in this research, can produce even better results in this FSL setting.

Matching Networks
Matching networks [24] are examples of multi-class classification. They consist of two embedding functions ( f and g, being appropriate convolutional neural networks (CNN), and potentially f = g) for feature extraction. An attention mechanism uses the cosine similarity to compare a test samplex with samples in the support set, where the class probability is obtained, as given in Equation (1): in which a(., .) is the attention mechanism described as Equation (2): c(., .) is the cosine similarity, and f and g are the embedding functions.
In general, matching networks changes the way samples are embedded, matching the support set S to the support and query samples, through a full context embeddings (FCE) process. Query and support images go through f and g structures, respectively, for feature extraction. Matching nets predict the probability of query samples by measuring the cosine similarity between support and query embeddings.

Prototypical Networks
Prototypical nets (ProtoNet) architecture [25] consists of a CNN for image features extraction, and a classifier based on Euclidean distance. The main idea is that the centroid of support embeddings (prototypes) yields relevant class representatives. ProtoNet aims to learn a metric in the feature space that represents a similarity by distance for image predictions. Query images are labeled by finding the closest class prototype.
Each prototype corresponds to the average of the class embeddings, calculated according to Equation (3): where c n represents the centroid of the class n. Query images are classified according to a probability distribution. Such probabilities are given by softmax over distances between prototypes and query embeddings, according to Equation (4): ProtoNet learning proceeds by minimizing the negative log-probability J(φ) = −log φ (y = n|x) of the true class n via stochastic gradient descent (SGD).
The learning structure is an important factor in the metric models, but the performance depends on the chosen similarity metric [24,25]. In the next section, the concepts of other divergences, not fully considered before for FSL frameworks, but used to quantify the similarity between distributions, are revised for further use in this proposal.

Leveraging FSL with Other Divergences
Bregman divergences have been applied to optimization, clustering, and machine learning problems [26][27][28], but not fully explored in FSL. This group of divergences establishes a generalized measure between distributions, defined in terms of a strictly convex function [29]. Therefore, given a continuously differentiable, strictly convex function, F : S → R, defined in a convex domain S ⊆ R d , a Bregman divergence between x, y ∈ S induced by F, is defined as where .,. denotes the inner product, and ∇F(y) represents the gradient vector of F evaluated at y. Bregman divergences have pertinent properties, among them non-negativity D F (x,y) ≥ 0, in which D F (x,y) = 0 if and only if x = y. Furthermore, with some exceptions, Bregman divergences are considered asymmetric, given that D F (x,y) = D F (y,x). The concepts of three main Bregman divergences for similarity measure are presented in the next sections.

Squared Mahalanobis Divergence
The Mahalanobis divergence, generated by the convex function F(x) = x T Ax, is defined as a distance between a point and a distribution. For this reason, it takes into account the covariance between the variables. The Mahalanobis distance between a vector x and a distribution y can be calculated by Equation (6): which is called Mahalanobis distance when A is the inverse of the covariance matrix. Equation (6) attempts to solve the Euclidean distance problem when the data have a linear correlation. It has the effect of transforming variables into uncorrelated variables, by scaling them through the covariance matrix. That way, the Equation (6) corresponds to computing the Euclidean distance with scaled data.
In this work, the low time cost for estimating the covariance matrix was prioritized. Therefore, we assumed that the covariance estimation based on task prototypes yields relevant results with low training time. That way, x corresponds to the query embeddings set and y the prototypes set of a task. This approach allows using K-shot ≥ 1 without paradigm shifts in the covariance estimation algorithm.

Kullback-Leibler Divergence
KL-divergence, or relative entropy, is generated by the convex function of negative entropy, for discrete distribution ∑ d j=1 x j log 2 x j . It quantifies the difference between two probability distributions. Bregman divergence between two discrete probability distributions, which corresponds to the convex function generating the KL-divergence, is described as: For the experiments in this work, the embeddings and prototypes from the task were transformed into probability vectors for the KL-divergence computation. In other words, given a feature vector y', the new probability vector is calculated as y = y'/sum(y') such that ∑ d j=1 y j = 1. A constant was added to the vectors before calculating the divergence to avoid infinite negative results (log0) or division by zero.

Itakura-Saito Divergence
The Itakura-Saito divergence, or IS-divergence, is an asymmetric measure widely used in signal processing. IS-divergence is generated by the function F(x) = −logx, and it can be calculated as follows: As in the KL computation, probability vectors are generated and a constant added for IS-divergence computation.

Experiments
The experiments were conducted using two main few-shot models, prototypical and matching networks. We organized the experiments in three scenarios: (I) classification of mini-imagenet data set as a baseline experiment for choosing the best network, (II) insect classification at the adult maturity stage, and (III) insect classification at the early stage.

Prototypical Nets
The episode training of the prototypical net starts randomly selecting N classes from the source set. Figure 4 presents the framework with a three-way task for demonstration. After that, the data for the respective task is divided into support set S, and query set Q, according to the parameters N-way, K-shot, and q, previously assigned, and the CNN embeds all images to generate support and query embeddings. Prototypes are computed from the support set as class representatives. The divergences are then computed between prototypes and query embeddings to classify Q images according to a probability distribution over divergences.

Matching Nets
The matching net episode training ( Figure 5), differ to the prototypical episode in two aspects: (1) it uses a mechanism to generate full context embeddings (FCE), and (2) similarities are computed between query and support embeddings, instead of query and prototypes.
A test episode is similar to a training episode for both networks, except that the parameters of the models are frozen during testing time. Furthermore, the target set is used instead of the source set. Finally, the test episode ends with the classification of Q test images, where the accuracy of the model is computed.

Experiment I
The first experiment evaluates prototypical and matching networks, along with different divergences, in a benchmark public data set. The goal is to choose the best model for insect classification, based on experimental results in a consolidated benchmark data set. Mini-ImageNet [24] is a widely used benchmark for few-shot classification. This data set consists of 100 classes, each containing 600 images. Here, the classes were divided into 80% for models training (source set) and 20% for testing (target set), following a conventional division training/testing [16]. We evaluated the models in one-shot and five-shot settings. Moreover, we used five-way tasks with q = 15.

Experiment II
This experiment aims to use the model with the highest accuracy provided by Experiment I. The training steps of the respective model, as presented in Section 3.1, were used to classify insects only in adult maturity stage. We divided the adult stage subset classes at a rate of 80:20 for training and testing [16], respectively. Therefore, the model was tested on classes unseen in training tasks. Different divergences were evaluated as a few-shot similarity function in n-way k-shot parameters settings to obtain the best model performance. For all experiments, therefore, we analyzed one-shot and five-shot in three-way and five-way tasks. In addition, q = 5 was fixed for all experiments.

Experiment III
The third experiment consists of insect classification only at the early maturity stage. The procedures match with those in Experiment II, including the network used. We divided early stage subset classes at a rate of 80:20 for training and testing [16], and then evaluated the classification tasks for one-shot and five-shot settings related to three-way and five-way tasks. Moreover, we set q = 5.

Experiments Setups
All training and testing setups were equally performed for insect classification into two maturity stages. The model inputs color images (RGB) without any image preprocessing. However, some transformations were carried out to standardize and increase the number of classes. Initially, all the images in IP-FSL were resized to 96 × 96 × 3 format, and rotated in multiples of 90º to create new augmented classes. Thus, after multiple rotations up to 270º, each subset ended with four-fold the initial number of classes, keeping the same number of samples in each class.
The experiments were conducted through the Google Colaboratory platform. The Pytorch library version 1.9.0 was used for writing and training the models. The model was trained using the source set for 10 epochs, with 2000 episodes/epoch. We carried out 20,000 training episodes in each combination setting of N-way, K-shot, and divergence. The initial learning rate of 10 −3 falls in half after each epoch.
In few-shot learning, the results are commonly presented as the average of several testing tasks. In this work, the average accuracy of 1000 testing episodes for each experiment is computed and shown.

Results
The models learn image features to differentiate classes through a set of divergences. In addition to the Euclidean, we investigated the results of the Mahalanobis, Kullback-Leibler, and Itakura-Saito divergences. Our implementation, therefore, integrates these divergences as dissimilarity measures for Mini-ImageNet data set classification in Experiment I, and for insect classification of adult and early life cycle, through Experiments II and III, respectively.

Experiment I: Mini-ImageNet Classification
The Experiment I was carried out to evaluate the Prototypical and Matching networks performance on the Mini-ImageNet data set, with the proposed divergences in order to choose the most appropriate. Table 2 presents these results, in which bold numbers indicate the best accuracy for each model and related K-shot setting. Table 2. Results for Experiment I (Mini-ImageNet). ED: Euclidean distance, MD: Mahalanobis distance, KL: KL-divergence, IS: IS-divergence.

Model
One From the results in Table 2, both networks show close accuracy to each other. However, prototypical networks achieved the highest accuracy of 0.7097. Because of this, further experiments were performed with prototypical networks on the IP-FSL data set.

Experiment II: Adult Stage Insect Classification
For the adult stage subset, prototypical networks training and testing were carried out in a meta-learning procedure, that is, performed on source and target sets, respectively, without classes overlap. A chain of 16 experiment types was carried out for evaluation. For each of the divergences investigated, we evaluated tasks of three-way, five-way, and one-shot, five-shot settings, and results are presented in Table 3.

Experiment III: Early Stage Insect Classification
For the early stage experiments, model training and testing were also performed by meta-learning paradigm using source and target sets, respectively, without class relationship between both. Table 4 gives the results.

Discussion
In this study we have addressed the important problem of insect pest image recognition, adult and early stage maturity categories, using few samples and a leveraged prototypical network learning approach. Since insects at different stages can damage crops to different levels, recognizing specific stages can mitigate the spread and the impact of further damage on crops. For this reason, we assembled the IP-FSL based on two life cycles, and we designed a few-shot experimental approach to differentiate insects by leveraging state-of-art models with other divergences for similarity measurement, and compared their performance. Encouraging general results were obtained with respect to the two maturity stages, comparable with literature results but using fewer samples. We have achieved high accuracy in both categories, adult and early, of 86.33% and 87.91%, respectively.
Insect pest recognition is a challenging and relevant issue in agriculture, and entomology in general [4]. Practical applications require rapid and accurate visual recognition to control infestations in crops. Two of the ways to address it are using few-samples machine learning algorithms, and learn them in a specific maturity stage of the pests, separately. In our approach we propose to have insect maturity stages categories addressed separately, since visually they are far distinct. Figure 6 presents a sample classification task performed in Experiment II, for the adult insect recognition, in which 15 query images are labeled according to three-way classes. The performance for the three-way in Experiment II, given in Table 3, showed that our approach recognized insects with better accuracy of 77.97% in one-shot and 86.33% in five-shot using KL-divergence. In the five-way tasks, the best performance for one-shot achieved 66.4% using KL, and 77.68% in five-shot using IS-divergence, although KL came very close with 77.43% in five-shot. While IS outperforms KL in five-shot, the difference is very low (0.25%). We assume that there is an advantage of KL-divergence for the adult stage insect classification, since it was the most accurate by a larger margin in the three-way case . In contrast, Euclidean and Mahalanobis distances yielded considerably lower accuracy. KL and IS were shown to be promising approaches to measure the dissimilarity between adult insects, with the best accuracy achieved by KL, which improved the final performance of the few-shot model.
In the early stage insect classification (Experiment III), KL and IS also yielded better accuracy, with KL achieving best results in all settings. The three-way setting procedure is presented in Figure 7, where 15 query images are labeled according to three task classes. In this situation, KL-divergence achieved the better accuracy of 81.67% in one-shot, and 87.91% in five-shot. In five-way, KL is also the best similarity approach, achieving 69.06% and 80.72% for one-shot and five-shot, respectively.
As seen in the Experiments I, II, and III, accuracy increases as K-shot increases. K-shot represents the number of support images in each insect class. It is presumptive to say that there is more information learned by the network when a greater number of class images are explored, possibly it can be enhanced in 10-shot. But in a few-shot context, it is important that K is not high.
In contrast to K-shot, accuracy increases when N-way is smaller. N-way represents the number of classes within a classification task, for which the model needs to label the query images. It is also presumptive to say that a greater range of classes to label query images results in greater difficulty in correctly classifying them.  We observed that the insects in the early maturity stages are more accurately classified in IP-FSL. A possible reason is that adult insects have a higher visual similarity, which makes it difficult to label images correctly. This may explain the importance of identifying a more suitable similarity metric in few-shot learning, such as KL-divergence as shown.
To the best of our knowledge no other work has approached this proposed split into maturity categories (early and adult) yet. Regarding the whole IP102 data set, Wu et al. [15] reported accuracy rates of 49.5%, in [30] 55.2% was achieved, and more recently, Nanni et al. [31] showed accuracy results of 74.11%. Xie et al. [32] have proposed an insect image data set with 4508 images divided into 40 classes, and they designed a multi-level learning features procedure to represent the categories and then approached classification. Their accuracy was 89.30% in their data set. In smaller insect data sets, Ayan et al. [33] have experimented an ensemble procedure to combine CNN models based on a genetic algorithm to weight the results in the classifier. They have tested on a data set by [34] with 562 images, 10 classes, and achieved best accuracy of 95.16%. Deng et al. [34] on their proposed data set have obtained accuracy results of 85.50%.
As compared with the other works from the literature, our work here has brought in the following novelties:

•
The IP-FSL data set with 6817 insect pest images, divided into species maturity stages (97 of adults, 45 of early stage samples); • A few-shot leveraged prototypical network for classification, which achieved 86.33%, and 87.91% accuracy for adults and early categories, respectively.
These results are relevant for the classification of insect pests using few samples. We see it as a promising approach for practical field applications, especially if crop based focusing on the the most damaging species for a particular crop. Previous works did not focus on specific maturity stages to classify insects on image databases, and as discussed the accuracy rates reported are competitive with the results presented here.

Conclusions
We have approached insect pest image recognition with few samples, and also separating maturity stages, in this work, by an improved few-shot learning network. We have proposed a data set, IP-FSL, with 6817 samples of adults (97 classes), and early stages (45 classes) of insect pests, derived from IP102, and properly organized for this problem. We proposed to evaluate other divergences along with state-of-art FSL matching and prototypical networks, and we have shown that a leveraged prototypical network with KL divergence is the most promising for this setting.
Our results on adult, and early stages of the insect pests achieved 86.33% and 87.91% accuracy for three-way and five-shot experiments, respectively, which are high figures even if compared to other approaches with only adult classes.
Future directions to be explored include studies on cross-domain shifts in insect pest recognition, focus on specific crops and related insect ecosystems, and deploying mobile applications to help agronomists on detecting and identifying potential insect infestations on crops.