Detection of Tomato Leaf Miner Using Deep Neural Network

As a result of climate change and global warming, plant diseases and pests are drawing attention because they are dispersing more quickly than ever before. The tomato leaf miner destroys the growth structure of the tomato, resulting in 80 to 100 percent tomato loss. Despite extensive efforts to prevent its spread, the tomato leaf miner can be found on most continents. To protect tomatoes from the tomato leaf miner, inspections must be performed on a regular basis throughout the tomato life cycle. To find a better deep neural network (DNN) approach for detecting tomato leaf miner, we investigated two DNN models for classification and segmentation. The same RGB images of tomato leaves captured from real-world agricultural sites were used to train the two DNN models. Precision, recall, and F1-score were used to compare the performance of two DNN models. In terms of diagnosing the tomato leaf miner, the DNN model for segmentation outperformed the DNN model for classification, with higher precision, recall, and F1-score values. Furthermore, there were no false negative cases in the prediction of the DNN model for segmentation, indicating that it is adequate for detecting plant diseases and pests.


Introduction
Protection of crops from plant disease is a problem that is intertwined with agriculture and climate change [1]. Climate change caused by global warming alters host resistance and pathogenic rates, as well as affects physiological interaction between hosts and pathogens [2]. The plant disease problem has worsened as various plant diseases spread faster than ever before around the world. The possibility of plant diseases emerging in previously non-affected regions has increased, and there is a lack of local expertise to treat the new plant disease in the regions [3].
Tomato leaf miner causes crop losses of around 80 to 100% in places where it is present. By invading leaves, stems, flowers, and fruit, the tomato leaf miner destroys the structure of the tomato growth. It is extremely difficult to prevent the spread of the tomato leaf miner. Despite significant efforts to prevent its migration, the tomato leaf miner took only about three years to spread across southern Europe after it was first identified in Spain. Now the tomato leaf miner is found in most South American countries, Southern Europe, Northern Africa, and West Asia [4].
Leaves, the most vulnerable component of a plant, are where disease symptoms first appear [5]. From the very beginning of their life cycle until they are ready to be harvested, the crops need to be inspected in a timely manner to protect the crops against various plant diseases. Agricultural specialists conventionally observed agricultural fields using the time-consuming approach of naked-eye surveillance to keep a check on the plant leaves for symptoms of diseases [6].
In agriculture, computer vision tools have largely supplanted the naked eye to identify plant diseases and pests. Computer vision tools frequently employed conventional image processing algorithms, which need human feature design along with classifiers to detect plant diseases and pests. Computer vision tools increase detection performance by creating imaging schemes and selecting appropriate light sources and shooting angles based on the characteristics of plant diseases and pests.
Although handcrafted imaging schemes help computer vision tools detect plant diseases and pests, they also increase the application cost. Furthermore, in a complex natural environment, plant diseases and pests are difficult to identify via handcrafted computer vision tools because it is hard to expect that the traditional computer vision tools completely exclude the influence of low contrast, large variations in the scale, image noise, and disturbances under natural light [7].
The capacity to directly use raw data without using a handcrafted feature extractor is a significant advantage of deep neural network (DNN) models [8]. DNN models especially based on convolutional neural networks (CNN) have shown success in recent years when used in a variety of computer vision applications, such as traffic detection, medical image recognition, scenario text detection, and face recognition [9][10][11][12].
In this study, DNN-based approaches for the classification and the segmentation methods were used to diagnose the tomato leaf miner. Two DNN models were trained using the same RGB images of tomato leaves captured from real-world agricultural sites. The diagnosis performance of two DNN models was evaluated and compared. One of the two DNN models was suggested as for tomato leaf miner detection based on the diagnosis performance.

Dataset Description
AI Hub is operated by National Information Society Agency in the Republic of Korea to accelerate the advancement of artificial intelligence technology and its application. Various data were released on AI Hub related to natural language, healthcare, autonomous driving, agriculture, livestock, education, and so forth.
The Agricultural Knowledge Base (AKB) dataset [37], one of the agricultural datasets released on AI Hub, was organized by I IMC corporation in 2018. The AKB dataset contains a total of 40,704 RGB image data of rose leaves and tomato leaves taken in the laboratory and real-world agricultural sites. All leaf images of the AKB are labeled with normal and types of diseases. The rose and tomato leaves in the AKB dataset are labeled with 11 and 17 types of plant diseases, including normal leaves.
We processed the AKB dataset in two different ways to train and evaluate two types of deep neural networks (DNNs) applicable to real-world agricultural sites. Images and labels of normal and mined tomato leaves collected from real-world agricultural sites were selected from the AKB dataset. In the selected dataset, there are 3115 and 3341 pairs of images and labels of normal tomato leaves and mined tomato leaves, respectively. Images in the selected dataset had various sizes. All the images in the selected dataset were resized to 300 by 300 pixels. The selected and resized dataset (D RN152 ) was used to train and evaluate a DNN model for image classification. The D RN152 was separated into a training dataset, a validation dataset, and a test dataset while the split ratio was 60%, 20%, and 20%, respectively.
The region of tomato leaf infected by the tomato leaf miner was segmented into a polygonal shape by the human from the resized images in the AKB dataset to generate binary mask images. The binary mask images have a size of 300 by 300 pixels, which is the same as that of the resized images. Pixel values where the tomato leaf miner occurred in an image were converted into one. The other pixel values in the image were converted into zero. When all the pixel values in the image were converted into one or zero using the previously described method, the image was changed to the binary mask image.
Pairs of the resized image and the binary mask image (D MRCNN ) were used to train and evaluate a DNN model for object segmentation. The D MRCNN was separated into a training dataset, a validation dataset, and a test dataset while the split ratio was 60%, 20%, and 20%, respectively.

Deep Neural Network for Tomato Leaf Miner Classification
Transfer learning is one type of machine learning method. Transfer learning makes use of previously learned knowledge from a different problem that is applicable to a new problem. Additionally, the knowledge is applied to solve the new problem.
ResNet [38] was developed using a residual learning framework with a shortcut structure to address the issue of DNN performance degrading as depth exceeds a certain number of layers. ResNet152, one of the ResNet structures, was pre-trained on the ImageNet dataset, which contains over 14 million images and 1000 different labels.
It was assumed that the classification process of D RN152 is relevant to that of the ImageNet dataset. ResNet152 was used for transfer learning to classify the D RN152 as a binary class of the normal tomato leaf and the tomato leaf infected by the tomato leaf miner. Figure 1 shows the developed DNN structure for processing D RN152 using transfer learning with ResNet152 (DNN RN152 ).
in an image were converted into one. The other pixel values in the image were converted into zero. When all the pixel values in the image were converted into one or zero using the previously described method, the image was changed to the binary mask image.
Pairs of the resized image and the binary mask image (DMRCNN) were used to train and evaluate a DNN model for object segmentation. The DMRCNN was separated into a training dataset, a validation dataset, and a test dataset while the split ratio was 60%, 20%, and 20%, respectively.

Deep Neural Network for Tomato Leaf Miner Classification
Transfer learning is one type of machine learning method. Transfer learning makes use of previously learned knowledge from a different problem that is applicable to a new problem. Additionally, the knowledge is applied to solve the new problem.
ResNet [38] was developed using a residual learning framework with a shortcut structure to address the issue of DNN performance degrading as depth exceeds a certain number of layers. ResNet152, one of the ResNet structures, was pre-trained on the ImageNet dataset, which contains over 14 million images and 1000 different labels.
It was assumed that the classification process of DRN152 is relevant to that of the ImageNet dataset. ResNet152 was used for transfer learning to classify the DRN152 as a binary class of the normal tomato leaf and the tomato leaf infected by the tomato leaf miner. Figure 1 shows the developed DNN structure for processing DRN152 using transfer learning with ResNet152 (DNNRN152).
Structure of DNNRN152 is shown in Figure 1. In Figure 1, green, blue, grey, and orange boxes denote the pooling layer, convolutional layer, residual module, and fully connected layer, respectively. Red lines indicate shortcut structures for residual learning. DNNRN152 has two individual stages to process the images: feature extractor and classifier. From conv 1 to conv 5 in Figure 1 are the feature extractors that extract feature maps from the images. The fully connected layers (FCLs) in Figure 1 are the classifier that makes a prediction using the feature maps.  Structure of DNN RN152 is shown in Figure 1. In Figure 1, green, blue, grey, and orange boxes denote the pooling layer, convolutional layer, residual module, and fully connected layer, respectively. Red lines indicate shortcut structures for residual learning. DNN RN152 has two individual stages to process the images: feature extractor and classifier. From conv 1 to conv 5 in Figure 1 are the feature extractors that extract feature maps from the images. The fully connected layers (FCLs) in Figure 1 are the classifier that makes a prediction using the feature maps.
Layers of the DNN RN152 from conv 1 to conv 5 reused the structure and weights of the ResNet152 trained on the ImageNet dataset. The structure of the FCLs was determined by finding optimal hyperparameters using Bayesian optimization (BO). BO is one of strategy for finding a set of hyperparameters from a hyperparameter space to optimize an objective function that requires a large amount of computational power and, thus, is expensive to evaluate. Table 1 shows a hyperparameter space explored to determine the structure of the FCLs and training process. The objective function of the BO was set to an F-1 score of the validation dataset, which indicates the performance of classification (see Equation (3) for more detail). The hyperparameter space was iteratively explored by the BO method to find the optimal set of hyperparameters that maximize the F-1 score. As a result of the BO, the number of layers, the number of neurons and dropout rate in each layer were determined for the FCLs as 1.64 and 0.2, respectively. The rectified linear unit and the softmax were used for the activation function in the hidden layer and the output layer. The number of neurons in the output layer was one to deal with the binary classification. For the learning process, the batch size, the optimizer, and the learning rate were set to 64, SGD, and 0.001, respectively.
The DNN RN152 was trained twice using the training dataset of the D RN152 . For the first training process, the weights of the feature extractor were frozen and not trained. Only the weights of the classifier were trained. All the weights of the DNN RN152 , both the feature extractor and the classifier, were trained during the second training process.

Deep Neural Network for Tomato Leaf Miner Segmentation
A DNN model for segmentation (DNN MRCNN ) was trained in a transfer learning manner using Mask R-CNN [39], a type of region-based convolutional neural network. The Mask R-CNN was pretrained on the COCO dataset and implemented using Matterport's library [40] in the TensorFlow environment. The DNN MRCNN was developed to segment and classify regions infected by tomato leaf miner from a leaf image by training the Mask R-CNN using D MRCNN .
The DNN MRCNN processed a leaf image as shown in Figure 2 and yielded class, confidence, bounding box, and binary mask features for segmented pixels.
Layers of the DNNRN152 from conv 1 to conv 5 reused the structure and weights of the ResNet152 trained on the ImageNet dataset. The structure of the FCLs was determined by finding optimal hyperparameters using Bayesian optimization (BO). BO is one of strategy for finding a set of hyperparameters from a hyperparameter space to optimize an objective function that requires a large amount of computational power and, thus, is expensive to evaluate. Table 1 shows a hyperparameter space explored to determine the structure of the FCLs and training process. The objective function of the BO was set to an F-1 score of the validation dataset, which indicates the performance of classification (see Equation (3) for more detail). The hyperparameter space was iteratively explored by the BO method to find the optimal set of hyperparameters that maximize the F-1 score. As a result of the BO, the number of layers, the number of neurons and dropout rate in each layer were determined for the FCLs as 1.64 and 0.2, respectively. The rectified linear unit and the softmax were used for the activation function in the hidden layer and the output layer. The number of neurons in the output layer was one to deal with the binary classification. For the learning process, the batch size, the optimizer, and the learning rate were set to 64, SGD, and 0.001, respectively.
The DNNRN152 was trained twice using the training dataset of the DRN152. For the first training process, the weights of the feature extractor were frozen and not trained. Only the weights of the classifier were trained. All the weights of the DNNRN152, both the feature extractor and the classifier, were trained during the second training process.

Deep Neural Network for Tomato Leaf Miner Segmentation
A DNN model for segmentation (DNNMRCNN) was trained in a transfer learning manner using Mask R-CNN [39], a type of region-based convolutional neural network. The Mask R-CNN was pretrained on the COCO dataset and implemented using Matterport's library [40] in the TensorFlow environment. The DNNMRCNN was developed to segment and classify regions infected by tomato leaf miner from a leaf image by training the Mask R-CNN using DMRCNN.
The DNNMRCNN processed a leaf image as shown in Figure 2 and yielded class, confidence, bounding box, and binary mask features for segmented pixels. The feature maps of the input leaf image were extracted using ResNet101, denoted by (a) in Figure 2. In Figure 2b, a region proposal network [41] generates anchor boxes in regions expected to contain plant diseases. In Figure 2c, the predicted anchor boxes and the feature maps were processed to generate fixed-size feature maps by using the region of interest pooling method [42]. The output of the region of interest pooling method was used as input for two types of DNNs-FCL and feature pyramid networks (FPN) [43]. The FCL, in Figure 2d, classified the anchor boxes as the tomato leaf miner or the background and predicted the position of bounding boxes for the tomato leaf miner. The FPN, in Figure 2e, generated a binary mask image with a value of 1 for the regions infected by the tomato leaf miner, and a value of 0 for the remaining regions.
The DNN MRCNN was trained using the D MRCNN . The batch size, optimizer, and learning rate were set to 2, SGD, and 0.001, respectively, during the learning process. Loss and MRCNN class loss on the training and validation dataset are shown in Figure 3. During the learning process of Mask R-CNN, five types of losses are computed. A MRCNN class loss is one of the five losses and represents how successfully the Mask R-CNN classifies the detected object from the image. A loss refers to the aggregate of all five losses. The loss and MRCNN class loss on the training dataset are shown in Figure 3a,b, respectively. Figure 3c  The feature maps of the input leaf image were extracted using ResNet101, denoted by (a) in Figure 2. In Figure 2b, a region proposal network [41] generates anchor boxes in regions expected to contain plant diseases. In Figure 2c, the predicted anchor boxes and the feature maps were processed to generate fixed-size feature maps by using the region of interest pooling method [42]. The output of the region of interest pooling method was used as input for two types of DNNs-FCL and feature pyramid networks (FPN) [43]. The FCL, in Figure 2d, classified the anchor boxes as the tomato leaf miner or the background and predicted the position of bounding boxes for the tomato leaf miner. The FPN, in Figure 2e, generated a binary mask image with a value of 1 for the regions infected by the tomato leaf miner, and a value of 0 for the remaining regions.
The DNNMRCNN was trained using the DMRCNN. The batch size, optimizer, and learning rate were set to 2, SGD, and 0.001, respectively, during the learning process. Loss and MRCNN class loss on the training and validation dataset are shown in Figure 3. During the learning process of Mask R-CNN, five types of losses are computed. A MRCNN class loss is one of the five losses and represents how successfully the Mask R-CNN classifies the detected object from the image. A loss refers to the aggregate of all five losses. The loss and MRCNN class loss on the training dataset are shown in Figure 3a  When there were regions with a value of 1 in the binary mask image predicted in the input leaf image, it means that tomato leaf miner-infected regions were detected in the input leaf image. In this case, the input leaf image was classified as a tomato leaf miner infection. The input leaf image was classified as normal in the opposite case.

Performance Evaluation Metrics for Developed Deep Neural Networks
The performance of two developed DNNs for the tomato leaf miner was evaluated using four metrics of confusion matrix, precision, recall, and F1-score. The DNN for the When there were regions with a value of 1 in the binary mask image predicted in the input leaf image, it means that tomato leaf miner-infected regions were detected in the input leaf image. In this case, the input leaf image was classified as a tomato leaf miner infection. The input leaf image was classified as normal in the opposite case.

Performance Evaluation Metrics for Developed Deep Neural Networks
The performance of two developed DNNs for the tomato leaf miner was evaluated using four metrics of confusion matrix, precision, recall, and F1-score. The DNN for the tomato leaf miner segmentation was additionally evaluated using intersection over union (IoU).
In the classification problem, the confusion matrix compares the prediction results of DNN with the target value and presents the comparison results in matrix form. There are four types of comparison results: true positive (TP), false positive (FP), false negative (FN), and true negative (TN). When the regions infected by the tomato leaf miner are actually present in the leaf image, the TP is the case where the DNN predicts that the leaf image contains the infected regions. The FN is the case in which the DNN predicts that there is no infected region in the leaf image. The FP is the case where the DNN predicts the leaf image contains the infected regions when all of the leaves in the leaf image are normal and do not contain the infected region. The TN is the case in which the DNN predicts that there is no infected region in the leaf image. In other words, the TP and TN are the cases in which the DNN prediction and the target value match. The FP and FN are the cases when those are not matched.
The precision is the rate at which the leaf image actually contains the infected regions when the DNN predicts the leaf image as the infection of tomato leaf miner. The precision was calculated by Equation (1) using the number of the TP and the FP cases denoted by n(TP) and n(FP), respectively.
The recall is the probability that the DNN prediction and the target value are matched when the leaf image actually contains the infected regions. The recall was calculated by Equation (2) using the number of the TP and the FN cases denoted by n(TP) and n(FN), respectively.
The precision and the recall have an inverse relationship. The F1-score is the harmonic mean of the precision and the recall, and it is used to reflect both the precision and the recall in the DNN performance evaluation for classification. The F1-score was calculated by Equation (3) using the calculation results of the precision and the recall denoted by cal(Precision) and cal(Recall), respectively.
The IoU represents the percentage of matches between the human-segmented binary mask image and the DNN-predicted binary mask image. To calculate the IoU in the plant leaf images, an overlapping area and a union area between the plant disease regions segmented by the human and the plant disease regions predicted by the DNN are required. The human-segmented and DNN-predicted plant disease regions were polygonal, making it difficult to calculate their numerical area. The number of pixels in the overlapping area and the union area between the human-segmented and DNN-predicted plant disease regions were counted to substitute the area calculation.
Both the human-generated and the DNN-generated binary mask images were converted into two-dimensional matrices consisting of integers 0 and 1 to be used in counting the number of pixels in the overlapping and union areas. The sum of the two matrices was used to count the pixels in the union area between the human-segmented plant disease regions and the DNN-predicted plant disease regions. Elements with values one and two in the sum result of the two matrices were counted as the number of pixels in the union area. Then, the Hadamard product was computed between the two matrices to count the pixels in the overlapping area between the human-segmented plant disease regions and the DNN-predicted plant disease regions. Elements with the value one were counted as the number of pixels in the overlapping area as a result of the Hadamard product. The IoU was calculated by Equation (4) using the counted number of pixels in the overlapping area and the union area. In Equation (4), the number of pixels in the overlapping area and the union area are denoted by n(A O ) and n(A U ), respectively.

Results
The performance of the DNN RN152 and the DNN MRCNN was evaluated using the test dataset. On both the DNN RN152 and the DNN MRCNN , the confusion matrix, the precision, the recall, and the F1-score were calculated. The IoU, on the other hand, was calculated only for DNN MRCNN .

Confusion Matrix
The DNN RN152 and the DNN MRCNN were both evaluated using the test dataset, which included 623 normal leaf images and 665 leaf images with infected regions of the tomato leaf miner. The DNN RN152 directly classified the input tomato leaf images as normal or infected with tomato leaf miner. The DNN MRCNN , on the other hand, segmented the regions infected by the tomato leaf miner, and the segmentation results were used to classify the input tomato leaf images, as explained in Section 2.3.
Two confusion matrices in Figure 4a,b describe the classification results using DNN RN152 and DNN MRCNN , respectively. As shown in Figure 4a, the DNN RN152 classified the 665 leaf images with infected regions of tomato leaf miner into 560 infected leaf images and 105 normal leaf images. The 623 normal leaf images were classified into 89 normal leaf images and 534 infected leaf images using the DNN RN152 . In Figure 4b, the DNN MRCNN correctly classified all 665 leaf images with infected regions of tomato leaf miner as infected leaf images. The 623 normal leaf images were classified into the 594 normal leaf images and the 29 infected leaf images using the DNN MRCNN . Using the DNN RN152 , 649 of 1288 images, or 51.7%, were classified to match the true and DNN-predicted classes. Using the DNN MRCNN , 1259 of 1288 images, or 97.7% of the total, were classified as matching between the true and DNN-predicted classes.
IoU was calculated by Equation (4) using the counted number of pixels in the overlapping area and the union area. In Equation (4), the number of pixels in the overlapping area and the union area are denoted by n(AO) and n(AU), respectively.

Results
The performance of the DNNRN152 and the DNNMRCNN was evaluated using the test dataset. On both the DNNRN152 and the DNNMRCNN, the confusion matrix, the precision, the recall, and the F1-score were calculated. The IoU, on the other hand, was calculated only for DNNMRCNN.

Confusion Matrix
The DNNRN152 and the DNNMRCNN were both evaluated using the test dataset, which included 623 normal leaf images and 665 leaf images with infected regions of the tomato leaf miner. The DNNRN152 directly classified the input tomato leaf images as normal or infected with tomato leaf miner. The DNNMRCNN, on the other hand, segmented the regions infected by the tomato leaf miner, and the segmentation results were used to classify the input tomato leaf images, as explained in Section 2.3.
Two confusion matrices in Figure 4a,b describe the classification results using DNNRN152 and DNNMRCNN, respectively. As shown in Figure 4a, the DNNRN152 classified the 665 leaf images with infected regions of tomato leaf miner into 560 infected leaf images and 105 normal leaf images. The 623 normal leaf images were classified into 89 normal leaf images and 534 infected leaf images using the DNNRN152. In Figure 4b, the DNNMRCNN correctly classified all 665 leaf images with infected regions of tomato leaf miner as infected leaf images. The 623 normal leaf images were classified into the 594 normal leaf images and the 29 infected leaf images using the DNNMRCNN. Using the DNNRN152, 649 of 1288 images, or 51.7%, were classified to match the true and DNN-predicted classes. Using the DNNMRCNN, 1259 of 1288 images, or 97.7% of the total, were classified as matching between the true and DNN-predicted classes. The DNNMRCNN predicted fewer cases of FN and FP than the DNNRN152. It is important to note that DNNMRCNN does not predict FN, which suggests that DNNMRCNN is more adequate than DNNRN152 in actual application for plant disease and pest detection.

Precision, Recall, and F1-Score
The precision, recall, and F1-Score were calculated based on the results of the confusion matrix. Figure 5 compares the precision, recall, and F1-Score calculated from the confusion matrices of the DNNRN152 and the DNNMRCNN. In Figure 5, red and blue bars denote The DNN MRCNN predicted fewer cases of FN and FP than the DNN RN152 . It is important to note that DNN MRCNN does not predict FN, which suggests that DNN MRCNN is more adequate than DNN RN152 in actual application for plant disease and pest detection.

Precision, Recall, and F1-Score
The precision, recall, and F1-Score were calculated based on the results of the confusion matrix. Figure 5 compares the precision, recall, and F1-Score calculated from the confusion matrices of the DNN RN152 and the DNN MRCNN . In Figure 5, red and blue bars denote the performance evaluation results of the DNN RN152 and the DNN MRCNN , respectively. The precision, the recall, and the F1-Score calculated from the prediction results of the DNN MRCNN all showed higher values than those calculated from the prediction results of the DNN RN152 , indicating that the DNN MRCNN outperforms the DNN RN152 in terms of diagnostic performance of the tomato leaf miner. the performance evaluation results of the DNNRN152 and the DNNMRCNN, respectively. The precision, the recall, and the F1-Score calculated from the prediction results of the DNN-MRCNN all showed higher values than those calculated from the prediction results of the DNNRN152, indicating that the DNNMRCNN outperforms the DNNRN152 in terms of diagnostic performance of the tomato leaf miner.

Intersection over Union
IoU was calculated by comparing the human-segmented binary mask image and the DNN-predicted binary mask image. Figure 6 shows the calculation results of IoU. The human-segmented binary mask image and the DNN-predicted binary mask image are superimposed in Figure 6b. Yellow denotes the human-segmented plant disease region. In addition, the DNN-predicted plant disease region is depicted in yellow with high translucency. The overlapping images are arranged based on the IoU value.

Intersection over Union
IoU was calculated by comparing the human-segmented binary mask image and the DNN-predicted binary mask image. Figure 6 shows the calculation results of IoU. The human-segmented binary mask image and the DNN-predicted binary mask image are superimposed in Figure 6b. Yellow denotes the human-segmented plant disease region. In addition, the DNN-predicted plant disease region is depicted in yellow with high translucency. The overlapping images are arranged based on the IoU value.
the performance evaluation results of the DNNRN152 and the DNNMRCNN, respectively. The precision, the recall, and the F1-Score calculated from the prediction results of the DNN-MRCNN all showed higher values than those calculated from the prediction results of the DNNRN152, indicating that the DNNMRCNN outperforms the DNNRN152 in terms of diagnostic performance of the tomato leaf miner.

Intersection over Union
IoU was calculated by comparing the human-segmented binary mask image and the DNN-predicted binary mask image. Figure 6 shows the calculation results of IoU. The human-segmented binary mask image and the DNN-predicted binary mask image are superimposed in Figure 6b. Yellow denotes the human-segmented plant disease region. In addition, the DNN-predicted plant disease region is depicted in yellow with high translucency. The overlapping images are arranged based on the IoU value.  A bar graph in Figure 6a shows the number of test datasets with IoU values within a certain range. The horizontal axis in the bar graph represents the range of IoU values with a minimum value of 0 and a maximum value of 1 divided by 0.1 intervals. The vertical axis indicates the number of DNN MRCNN predictions whose IoU value corresponds to the range of IoU values on the horizontal axis. The minimum and maximum IoU values for DNN MRCNN prediction results on test datasets were 0.05 and 1.0, respectively. In addition, the average of the IoU values was 0.59, with a variance of 0.03.
When the IoU value is 0.6 or higher in Figure 6b, two human-segmented and DNNpredicted binary mask images are nearly identical. As shown in Figure 6a, approximately half of the prediction results for 665 images containing the leaf infected by tomato leaf miner have an IoU value of 0.6 or higher. Except for 50 predictions, the vast majority have an IoU of 0.4 or higher. The IoU calculation results suggest that DNN MRCNN is capable of precisely locating lesions.

Conclusions
In this study, we developed two DNN models to diagnose the tomato leaf infected by the tomato leaf miner. DNN RN152 , one of the developed DNN models, employed the well known convolutional neural network structure, ResNet152. DNN RN152 made use of a feature extractor same to the ResNet152 and a customized classifier. Using the tomato leaf image as input, DNN RN152 directly classified the image as the normal leaf or the leaf infected by the tomato leaf miner. The Mask RCNN was used to develop another DNN model, DNN MRCNN . The regions infected by tomato leaf miner were segmented from the tomato leaf image using DNN MRCNN , and the segmented results were used to classify the tomato leaf image as the normal leaf or the leaf infected by tomato leaf miner.
The same tomato leaf images captured from real-world agricultural sites were used to train and evaluate both DNN models. The human-segmented binary mask images were additionally provided to the DNN MRCNN for the training process.
As a preliminary study, we compared the performance of DNN models for classification (DNN RN152 ) and segmentation (DNN MRCNN ) to determine which DNN model is better for detecting single plant disease in a single crop from images captured in real-world agricultural sites. The precision, recall, and F1-score were used to assess the performance of the two developed DNN models. For all criteria, the DNN MRCNN outperforms the DNN RN152 in terms of the diagnostic performance of the tomato leaf miner. The IoU was additionally calculated to assess the segmentation performance of the DNN MRCNN . The IoU calculation results showed that in the majority of test datasets, and the DNN MRCNN precisely segmented the regions infected by tomato leaf miner from the input image.
In future work, we intend to train the DNN model using Mask RCNN to detect multiple plant diseases and pests that occur in various crops in real-world agricultural sites.