Automatic Detection of Tomato Diseases Using Deep Transfer Learning

: Global food production is being strained by extreme weather conditions, ﬂuctuating temperatures, and geopolitics. Tomato is a staple agricultural product with tens of millions of tons produced every year worldwide. Thus, preserving the tomato plant from diseases will go a long way in reducing economical loss and boost output. Technological innovations have great potential in facilitating disease detection and control. More speciﬁcally, artiﬁcial intelligence algorithms in the form of deep learning methods have established themselves in many real-life applications in a wide range of disciplines (e.g., medicine, agriculture, or facial recognition, etc.). In this paper, we aim at applying deep transfer learning in the classiﬁcation of nine tomato diseases (i.e., bacterial spot, early blight, late blight, leaf mold, mosaic virus, septoria leaf spot, spider mites, target spot, and yellow leaf curl virus) in addition to the healthy state. The approach in this work uses leaf images as input, which is fed to convolutional neural network models. No preprocessing, feature extraction, or image processing is required. Moreover, the models are based on transfer learning of well-established deep learning networks. The performance was extensively evaluated using multiple strategies for data split and a number of metrics. In addition, the experiments were repeated 10 times to account for randomness. The ten categories were classiﬁed with mean values of 99.3% precision, 99.2% F1 score, 99.1% recall, and 99.4% accuracy. Such results show that it is highly feasible to develop smartphone-based applications that can aid plant pathologists and farmers to quickly and accurately perform disease detection and subsequent control.


Introduction
Agriculture is one of humanity's most critical activities, of which plant disease control is a cornerstone.It is necessary to pay attention to the quality and wellbeing of the agricultural harvest.This will help maintain food production levels in the face of natural diseases and aid countries in coping with political and environmental challenges.Tomatoes are among the vital crops and staple food products around the world because of their rich nutritional content and their role in many recipes [1].The food and agriculture organization (FAO) ranks tomatoes as the sixth most abundant vegetable around the world [2].In 2017, nearly 170.8 million tons of tomatoes were produced worldwide [3].However, the tomato plant is susceptible to many diseases caused by bacteria, viruses, or fungi that have a direct adverse effect on productivity [4].
To detect plant diseases, farmers refer to plant pathologists.Alternatively, they can rely on their own experience or public resources.However, the required time, effort, and technical expertise may be prohibitive for most professional or hobby farmers [5].Thus, technological solutions that can aid the disease detection and identification will go a long way in reducing cost and improving the accuracy and speed of disease control.In this regard, recent advances in artificial intelligence (AI) have empowered a wide swath of applications from various disciplines.AI systems capture domain knowledge in their models through the training and validation process.They provide decision-making capabilities with nontrivial sophistication and complexity [6,7].More specifically, deep learning algorithms have enabled the capture of intricate relationships and features of reallife processes.Convolutional neural networks (CNNs) are one of the types of deep learning algorithms that were found to be particularly useful for direct image-based decision-making and objection detection [8].
Neural networks are comprised of three layers; input, output, and hidden.On the other hand, deep learning involves a far greater number of layers, which enables the capturing of input features and details at various scales.Out of the many deep learning artificial intelligence algorithms, convolutional neural networks are the most suitable for handling images as input [8].Layers in a convolutional neural network perform a series of convolution operations using filters of various sizes, which is typically followed by a rectified linear unit (ReLU) activation function.The result from the ReLU is a feature map that is downsampled by the subsequent pooling layer.In general, the final layer before the output in CNN is a fully connected layer, which combines the various features learned from previous layers and feeds the output layer.
Building CNN models is an elaborate process, which needs to balance the computational cost with the ability to automatically extract appropriate features at various scales, orientations, colors, reflections, and spatial properties.Moreover, the models may suffer from overfitting, underfitting, or inefficiency.In addition, thorough evaluation is needed to establish the trustworthiness of the models.Luckily, several public and well-established models exist in the literature.These models offer a wide-range of reliable capabilities with great efficiency [9].Importantly, these models can be reused via an approach called transfer learning.This method utilizes generically pre-trained models by reusing the network structure and retraining part or all of the models including the existing model weights and parameters.Harnessing these robust models accelerates the development of new innovative AI applications without reinventing new CNN architectures.This methodology was successfully employed in many AI solutions for image classification [6,7].
In the context of technological and AI-based innovations for tomato disease diagnosis, several studies were conducted in the literature.Mim et al. [10] developed a system that helps tomato farmers discover the type of disease using leaf images of the plant.The researchers used artificial intelligence algorithms and CNN to develop a six-class (five diseases and one healthy) classification model with an accuracy of 96.55%.Hlaing and Zaw [11] isolated the leaf image from the background, and used explicit feature extraction in the form of statistical properties and scale invariant feature transform of texture features.These descriptors fed a support vector machine (SVM) classifier, which distinguishes between seven input categories (six diseases and one healthy) with an accuracy of 84.7%.Kumar and Vani [12] experimented with four deep learning models: LeNet, VGG16, ResNet, and Xception for ten-class classification (nine diseases + one healthy) of tomato leaf images, and reported a maximum accuracy of 99.25% using VGG16.Similarly, Tm et al. [13] used the AlexNet, GoogleNet, and LeNet models for the same classification problem and achieved an accuracy range of 94-95%.Annabel and Muthulakshmi [14] used masking and threshold-based segmentation to identify and isolate infected areas of a leaf image.They extracted several features (e.g., dissimilarity, homogeneity, and contrast) and used a random forest classifier to category 3 diseases plus healthy leaves with an accuracy of 94.1%.Agarwal et al. [15] developed a custom CNN model by modifying the VGG16 structure.They compared this model with traditional machine learning models (e.g., random forest and decision trees) and three deep learning ones (i.e., VGG16, Inceptionv3, and MobileNet) for ten-class classification, and achieved an accuracy of 98.4%.Ouhami et al. [16] employed transfer learning of three models; DensNet-161, DensNet-121, and VGG16.The highest accuracy was achieved using DenseNet-161 (i.e., 95.65%).Similarly, Alhaj Ali et al. [17] used Inceptionv3 and reported the highest accuracy to be 99.8%.However, the high aforementioned results were achieved with augmented images with duplication.
Other approaches were employed in the literature.In one avenue, deep learning algorithms were combined with traditional machine learning to solve the classification problem.Al-gaashani et al. [18] extracted features from leaf images using MobileNetv2 and NASNetMobile.After dimensionality reduction, the concatenation of these features fed into non-deep classification networks (i.e., random forest, SVM, multinomial logistical regression).In another methodology, deep object detection methods were applied on plant images to detect diseases on leaves.Liu and Wang [19] employed the you only look once version 3 (YOLOv3) algorithm to detect gray leaf spot disease.They reported a mean average precision of 92.5%.Similarly, Wang et al. [20] used Faster R-CNN and Mask R-CNN to detect eleven disease states (including healthy) in fruit images.
Other approaches were employed in the literature.In one avenue, deep learning algorithms were combined with traditional machine learning to solve the classification problem.Al-gaashani et al. [18] extracted features from leaf images using MobileNetv2 and NASNetMobile.After dimensionality reduction, the concatenation of these features fed into non-deep classification networks (i.e., random forest, SVM, multinomial logistical regression).In another methodology, deep object detection methods were applied on plant images to detect diseases on leaves.Liu and Wang [19] employed the you only look once version 3 (YOLOv3) algorithm to detect gray leaf spot disease.They reported a mean average precision of 92.5%.Similarly, Wang et al. [20] used Faster R-CNN and Mask R-CNN to detect eleven disease states (including healthy) in fruit images.Traditional methods were also used in the literature recently.Gadade and Kirange [21] extracted the features using Gabor filters, gray level co-occurrence matrix, and speeded up robust features.This approach involves less computational and memory overhead than deep learning, but it is less effective in solving the classification problem as demonstrated by their reported accuracy of 74%.Similarly, Lu et al. [22] presented spectral vegetation indices as features for classification using K-nearest neighbors (KNN), and they reported a 100% accuracy, albeit with a very small dataset (445 images).
This work is motivated by the following factors: • The adoption and implementation of technological innovations is generally lacking in the agricultural literature in comparison with other fields (e.g., medicine).This is especially true for the number of artificial intelligence applications in agriculture versus in medicine.

•
Traditional classification methods rely upon explicit feature extraction and/or image processing techniques, which may be sensitive to changes in image quality, orientation, size, lighting, noise, etc.Furthermore, the classification performance is directly affected by the quality of features on which it is based.Moreover, pre-processing increases the delay, computational requirements, and compounded errors.In addition, it may hinder the deployment of real-life applications if complicated actions are required by the user.

•
Previous works suffer from several deficiencies.First, some of these studies artificially increase the size of the dataset by including subtle differences in the dataset images.However, deep learning models are known to be immune to such changes.This duplication artificially improves the results by exposing the model to recognizing the similarities with the original images rather than features of the disease or health states.Second, building a customized CNN model is fraught with risks in terms of overfitting, underfitting, efficiency, and hardware requirements.Using deep transfer learning with pre-existing network architecture, carries with it the inherent credibility of the thousands of applications based on these models and the extensive scrutiny they have gone through, although at the expense of perceived lack of novelty and originality.Third, transfer learning is able to achieve competitive if not superior performance.
In this paper, deep transfer learning was used to detect and classify tomato disease using images of infected leaves.This approach has the advantages of employing well-established, trustworthy, and robust models without the need to redesign/reinvent a custom architecture.Moreover, deep learning models can render feature extraction and image preprocessing needless.The contributions of this paper are as follows: 1.
Develop deep transfer learning models for the detection and classification of tomato diseases from leaf images for nine tomato diseases: bacterial spot, early blight, late blight, leaf mold, mosaic virus, septoria leaf spot, spider mites, target spot, and yellow leaf curl virus.In addition, healthy leaves were discerned as a 10th class; 2.
Implement transfer learning of eleven deep convolutional neural networks models for the classification of leaf images into ten classes.Future Implementation of such a system in smart devices will greatly help farmers do prompt disease control; 3.
Evaluate the performance of the various models using multiple metrics that cover many aspects of the detection and classification capabilities.Moreover, the training and validation times were reported.
The remainder of this paper is organized as follows: the data, convolutional network models, and performance evaluation metrics and setup are presented in detail in Section 2, Section 3 discusses the performance evaluation results along with comparison to the related literature and discussion of the models, and we conclude in Section 4.

Materials and Methods
Figure 1 shows a diagram of all phases involved in the proposed approach.By using CNNs, performing explicit feature extraction is not required.Furthermore, there is no need for separating relevant image parts (i.e., segmentation).These steps and others are handled implicitly by the complex operations of the deep learning models.Given a generically pretrained deep learning model, several changes need to be made to re-purpose the model to the specific application.First, replace the classification layer to match the number of classes in the application (i.e., 10 classes for this paper).Second, replace the learnable layer that combine features from previous layers with a new layer.This may be a fully connected layer or a convolution2d layer depending on the CNN model.Third, if training is to be made faster, then some initial layers can be frozen (i.e., layer weights will not be updated during training).The number of frozen layers can be determined empirically depending on the application and the resulting testing performance and training speed.No layers were frozen in this work as the available hardware permitted extensive training.Fourth, the dataset needs to be prepared by resizing the images to fit the CNN requirements (e.g., 256 × 256 to 224 × 224).Furthermore, the data are split into training and validation subsets.In addition, image augmentation operations may be performed to introduce more variety into the dataset and improve the learning process.Fifth, in this final step, the CNN network is retrained with the tomato dataset, and the performance is evaluated.The next few subsections give more details about each part.

Dataset
The dataset consists of 18,160 publicly available tomato leaf images displaying features of nine tomato diseases in addition to the healthy state.The number of images per class was as follows: 2127 bacterial spot, 1000 early blight, 1909 late blight, 952 leaf mold, 373 mosaic virus, 1771 septoria leaf spot, 1676 spider mites, 1404 target spot, 5357 yellow leaf curl virus, and 1591 healthy [23].Each image represents a photo of a single leaf exhibiting one of the ten health classes.The photos were taken using a neutral background that appears somewhat unified for all images.In addition, each leaf appears at the center of each image.Although the images may contain irrelevant margins displaying the background, no cropping or pre-processing were performed.The public source of the images provided the dataset in JPEG format and a 256 × 256 resolution.Samples of leaf images of the nine diseases and healthy leaves are shown in Figure 2.

Performance Evaluation Setup
The training was performed using the same hyperparameters for all models.The number of training epochs was experimentally set to 5.This was based on the training and validation behavior of the models.Further training was deemed unnecessary.The available system memory allowed for a batch size of 16.The number of training iterations is equal to .The learning rate was set to 3 × 10 −4 .The fast converging stochastic gradient descent with momentum (SGDM) was used as the solver optimization algorithm for network training [32].
Several data splitting strategies were used to test the models' ability to generalize to more data using a larger testing set (i.e., learn better with larger training set).The first strategy split the dataset into equal-sized training and validation sets (i.e., 50/50), the second one allocated 70% for training, and the last one used 90% of the images for training.Moreover, images in each set were augmented by performing scaling operations using random values from the range [0.9,1.1], and x-y translation using random values from the range [−30,30] pixels.In addition, random x-axis reflection (i.e., horizontal or vertical shifting of the image) was applied.Augmentation has been shown to improve the generalization of the learned knowledge [33].It should be noted that augmentation did not increase the size of the dataset because the original images were discarded not duplicated.
The models were implemented and evaluated using MATLAB R2021a software running on an HP OMEN 30 L desktop GT13 with 64 GB RAM, NVIDIA GeForce RTX TM 3080 GPU, Intel Core TM i7-10700K CPU @ 3.80 GHz, and 1 TB SSD.

Performance Evaluation Metrics
The metrics used to evaluate the performance of the CNN models are shown in Equations ( 1)- (6).In these equations, T P represents the true positive (i.e., a leaf image correctly classified in one of the nine disease states), F N represents the false negative (i.e., a leaf image classified as healthy, but, in reality, it was drawn from one of the disease classes), F P represents the false positive (i.e., a healthy leaf image wrongly classified as representing a disease), and T N represents the true negative (i.e., a healthy image classified correctly as such).Recall (i.e., true positive rate (TPR) or sensitivity) measures the ability of the model to identify a leaf image as belonging to the correct disease class out of all the positive images, which is affected by the existence of false negatives.Moreover, Speci f icity measures the ability of the model to identify a leaf image as belonging to the healthy class, which is affected by the existence of false positives.High sensitivity indicates that the model easily recognizes leaf images as representing a disease but may include a large number of false positives.Precision measures the ratio of false positives to all cases identified as positive (i.e., false positives included).The Accuracy measures the ratio of the sum of true positives and true negatives to the total number of testing images.However, since different classes have a different number of images (i.e., class imbalance), the F1 score is considered a more reliable measure of the model classification performance [34].The Matthews Correlation Coefficient (MCC), see Equation (6), is another metric of great importance.MCC and its multiclass generalization provide a more correct reflection of the classification performance in comparison to the accuracy and F1 score because the size imbalance of the different classes is taken under consideration [35]:

Results and Discussion
The performance evaluation was performed in order to gauge and compare the classification capabilities of the various deep transfer learning models using well-known and reflective performance indices.Moreover, the evaluation was repeated for 10 times to account for random choices for the various data subsets.In addition, the time requirements for training/validation were reported for all models under the various setups.
Three data split strategies were used (i.e., 50/50, 70/30, and 90/10), which may reveal the abilities of the different models in learning from more data, and any underfitting/overfitting anomalies.Table 1 shows the mean over 10 runs for the overall F1 score, precision, recall, specificity, and MCC using 50% of the data for training.Most models performed exceptionally well with the highest mean F1 score of 98.5% using DenseNet-201.The worst performing model was SqueezeNet with a 90.9% F1 score.These performance values are corroborated by the confusion matrices for the best and worst performing models as shown in Figure 3.The matrix for SqueezeNet shows a problematic trend of misclassifying leaves with diseases as healthy, especially spider mites and target spots.Further insight into the results is provided by Figure 4, which shows the mean, minimum, and maximum accuracy for all algorithms over 10 randomized runs for the 50/50 data split.Three models (i.e., SqueezeNet, GoogLeNet, and Darknet53) experienced high variability over the 10 random runs in comparison with the other models, which indicates their relative sensitivity to the choice of images included in the training/validations sets.The maximum standard deviation was 2.0% for SqueezeNet.The highest average accuracy was 98.8% for DesneNet-201.Although the number of images is somewhat acceptable considering the corresponding results, it is worthwhile to explore the effect of increasing the size of the training dataset.Deep learning models, in comparison to traditional machine learning algorithms, are well-known to achieve better performance with more data.Table 2 shows the mean over 10 runs for the overall F1 score, precision, recall, specificity, and MCC using 70% of the data for training.All models achieved better performance although with diminishing returns.SqueezeNet improved to 91.8% F1 score and DenseNet201 performed the best with an F1 score of 99.0%.The confusion matrices in Figure 5 corroborate the performance values and reveal a drastically improved diagnosis in comparison with the matrix in Figure 3 with relation to misclassifying spider mites and target spots as healthy.Figure 6 shows the fluctuation of the accuracy results for the eleven models over 10 randomized runs.In comparison to Figure 4, Darknet-53 displayed much less fluctuation with more training data, which means the model had the potential for better learning with more data.Most of the other models experienced less fluctuation; however, the smaller models (i.e., SqueezeNet and GoogleNet) do not seem to benefit from more training data with respect to their sensitivity to the random choices of the images to be included in the training data.The standard deviation of the accuracy results remained 2.0% for SqueezeNet.The highest average accuracy was 99.2% for DenseNet-201 and Darknet-53.Pushing toward the extreme case of using 90% of the images for training reveals further insight into the models.Table 3 shows the mean over 10 runs for the F1 score, precision, recall, specificity, and MCC using 90% of the data for training.Both SqueezeNet and GoogLeNet improved further to an F1 score of 93.3% and 95.9%, respectively.However, the other models with high performance values seemed to peek.Darknet53 did not improve and the remaining algorithms showed small improvements (i.e., <1%).DenseNet-201 achieved the maximum mean F1 score of 99.2% and was closely followed by Inceptionv3 at 99.1%.Figure 7 shows sample confusion matrices for the DensNet-201 and SqueezeNet models using 90% of the data for training.The figure shows that very few images were misclassified.DenseNet-201 classified several categories perfectly.Another observation relates to the ResNet models (101, 50, and 18) with larger numbers in the model's name corresponding to a deeper network; the models' performance improved with an increased depth and number of layers.Regarding the fluctuation of the results with different random choices, Figure 8 shows that SqueezeNet improved to 0.4% standard deviation for the classification accuracy, but GoogleNet had the highest standard deviation with 1.0%.The ShuffleNet model fluctuation does not seem to be affected by more training data and remained almost fixed throughout the various data splitting strategies.The highest average classification accuracy was 99.4% using DensNet-201.Table 4 shows the mean training and validation times for all the models using 50/50, 70/30, and 90/10 data split.The SqueezeNet model trains the fastest in comparison to all other models.However, it also performs the worst.On the other hand, the Resnet18 seems to represent a good compromise between better classification performance and faster training time.The model produced a range of F1 scores of 97.2-98.2% with a corresponding training time of 395.5-491.9s, which is very fast in comparison to the better performing models.Nonetheless, training times may not affect the ability to deploy the models in real-life applications, especially if no live model update is performed.This is because testing does not involve model update and is usually very fast, and training is done once and offline with respect to the deployment.The inference times were in the range of 0.5-7 millisecond/image, which is very small from a human user perspective.These times are independent of the data split and depend on the hardware planform and size of the model.Several studies were conducted in the literature on the application of machine learning and deep learning algorithms for the identification and classification of plant diseases.Some of these studies (e.g., Hlaing and Zaw [11] and Annabel and Muthulakshmi [14]) used the traditional approach of employing image processing techniques to segment the input images (i.e., separation of the leaf or infected area from the background) and to extract texture features that reflect the disease state of the leaf.These features form the input for non-deep traditional machine learning algorithms (e.g., SVM).However, these studies did not consider images of different backgrounds and the classification performance results were worse than their deep learning counterparts.On the other hand, deep learning algorithms do not require these preprocessing steps and the accompanying overhead and errors.Agarwal et al. [15] modified the well-established structure of the VGG16 model and produced good performance.However, the original VGG16 model has shown its worth over hundreds of applications and thousands of studies and any modification will need to go through rigorous scrutiny.Tm et al. [13] used a similar approach to ours; however, the comparison was performed for three weaker models only (i.e., AlexNet, GoogleNet, and LeNet).Similarly, Kumar and Vani [12] experimente with four models (i.e., LeNet, VGG16, ResNet, and Xception) and produced 99.25% accuracy.However, their results were based on 14,903 leaf images from the same dataset with no apparent reason for dropping the remaining 3257 images.Table 5 shows a summary of the related literature to identify and classify tomato disease.The present study has some limitations.First, tomato has two major leaf shapes (regular and potato leaf) and multiple other variations relating to leaf dimensions, color, and shades of green.However, the dataset does not include varieties of tomato leaf shapes.This will narrow the applicability and performance of any tomato disease identification system to the specific tomato variant in the dataset.Second, all the images in the dataset have a unified background.It would be worthwhile to investigate leaf images with different backgrounds taken in a non-unified manner.Third, tomatoes are susceptible to other diseases or pests (e.g., Tuta absoluta) that are not part of the dataset.Fourth, the dataset is imbalanced with varying numbers of images in each class.

Conclusions
Tomato is an important mass-produced agricultural product that is susceptible to diseases and the consequent yield loss.The use of deep transfer learning and well-established models showed a great potential in many applications in the literature.In this work, we targeted the identification of tomato diseases from infected leaf images.Using leaf images as input, eleven deep learning models were customized and retrained to identify nine tomato diseases in addition to healthy plants.The models (i.e., DarkNet-53, DenseNet-201, GoogLeNet, Inceptionv3, MobileNetv2,ResNet-18, ResNet-50, and ResNet-101, ShuffleNet, SqueezeNet, and Xception) were compared in terms of six common metrics and training/validation times.Although all models performed well, the DenseNet-201 model produced the best results with values larger than 99% for all metrics.However, the SqueezeNet model trained the fastest, and had the shortest inference time (i.e., 0.50 milliseconds/image).
The transfer learning approach carries inherent credibility and less complexity.In addition, it does not require explicit image processing nor feature extraction.Thus, it is suitable to be implemented in standalone smartphone applications, which can aid plant pathologists and farmers in quick and effective disease recognition and control.Future work will consider evolving the models by using incremental learning (i.e., improving the model during deployment).Moreover, the same approach can be adapted to identify diseases from tomato fruit images rather than the leaves.This may require 3D deep learning models to cover all sides of the image.In addition, other models or an ensemble of models can be used for solving the same problem.Field testing and commercial availability in the form of ready-to-download applications are promising areas of future activities.

Figure 1 .Figure 2 .Figure 2 .
Figure 1.A diagram of all phases of the proposed approach.

Figure 4 .
Figure 4.The mean, minimum, and maximum accuracy for all algorithms over 10 randomized runs and 50/50 data split.

Figure 8 .
Figure 8.The mean, minimum, and maximum accuracy for all algorithms over 10 randomized runs and 90/10 data split.

Table 1 .
The mean overall F1 score, Precision, Recall, Specificity, and MCC using 50% of the data for training.The results are an average of 10 runs.

Table 2 .
The mean overall F1 score, Precision, Recall, Specificity, and MCC using 70% of the data for training.The results are an average of 10 runs.

Table 3 .
The mean overall F1 score, Precision, Recall, Specificity, and MCC using 90% of the data for training.The results are an average of 10 runs.

Table 4 .
The mean training and validation times for all algorithms and data split strategies.All times are in seconds.

Table 5 .
A summary of the related literature to identify and classify tomato diseases.The size of the training subset is a percentage of the dataset.