Tomato Leaf Disease Recognition on Leaf Images Based on Fine-Tuned Residual Neural Networks

Humans depend heavily on agriculture, which is the main source of prosperity. The various plant diseases that farmers must contend with have constituted a lot of challenges in crop production. The main issues that should be taken into account for maximizing productivity are the recognition and prevention of plant diseases. Early diagnosis of plant disease is essential for maximizing the level of agricultural yield as well as saving costs and reducing crop loss. In addition, the computerization of the whole process makes it simple for implementation. In this paper, an intelligent method based on deep learning is presented to recognize nine common tomato diseases. To this end, a residual neural network algorithm is presented to recognize tomato diseases. This research is carried out on four levels of diversity including depth size, discriminative learning rates, training and validation data split ratios, and batch sizes. For the experimental analysis, five network depths are used to measure the accuracy of the network. Based on the experimental results, the proposed method achieved the highest F1 score of 99.5%, which outperformed most previous competing methods in tomato leaf disease recognition. Further testing of our method on the Flavia leaf image dataset resulted in a 99.23% F1 score. However, the method had a drawback that some of the false predictions were of tomato early light and tomato late blight, which are two classes of fine-grained distinction.


Introduction
We rely on edible plants in the same way that we rely on oxygen. There is no food without crops, and there is no life without food. It is no coincidence that the invention of agriculture coincided with the rise of human civilization [1]. The tomato (Solanum lycopersicum L.), family Solanaceae, originated in the Andean region of South America has, according to the Food and Agriculture Organization Statistics (FAOSTAT), in the past fifty years become one of the most important and extensively grown horticultural crops in the Mediterranean region and throughout the world. Currently, it is the world's second most cultivated vegetable crop after the potato, with approximately 181 million tonnes from 5 Mha [2]. With 0.2 Mha, it is the highest-yielding vegetable in Southern Europe, and the major producers in the Mediterranean basin are Turkey, Egypt, Italy, Spain, and Morocco [2,3].
The tomato is susceptible to a variety of plant diseases caused by pathogens such as fungal, bacterial, phytoplasma, virus, and viroid pathogens due to its genetic properties as shown in Table 1. Not only is its genetic inheritance critical to managing the numerous tomato pathogens, but so are current climate changes, recently revised phytopathological control measures, and seed industry globalization [4]. One of the common diseases affecting tomato yield, the Septoria leaf spot, is caused by a fungal pathogen. Septoria has emerged as a major emerging pathogen as a result of climatic change and widespread variability. The pathogen's disease severity ranges between 35 and 65 percent in both cultivated and non-cultivated crops, posing a serious threat in the future [5]. The pathogen's complex adaptability combined with its cosmopolitan nature makes it more vigorous by targeting The cultivated tomato has a low genetic diversity due to its intensive selection and severe genetic bottlenecks that arose during evolution and domestication [10][11][12]. For these reasons, the tomato is more prone to a high disease incidence, and during the cultivation and post-harvest period, it can be affected by more than 200 diseases caused by different pathogens throughout the world [13,14]. In this paper, we propose a method for effective recognition of diseases affecting the tomato, that are mainly reflected on the leaves. We propose a residual neural network algorithm for this, which is the state-of-the-art, most recent deep learning image recognition algorithm. The efficient recognition of such can inform the farmers of the presence of such diseases on their crops and allow them to carry out control measures currently authorized in the EU that allow growers to achieve a successful and eco-sustainable disease management of this vegetable crop, fundamental for the Mediterranean diet.
One of the most important research areas in precision agriculture is disease identification using images of plant leaves [15]. Artificial intelligence, image processing, and graphical processing unit advancements have the potential to broaden and improve the practice of precise plant protection and growth. Most plant diseases produce a variety of visible symptoms; thus, learning models should be able to adequately observe and identify the distinctive symptoms of any disease [16].
Tomato leaf disease identification falls under the purview of computational agriculture [17,18]. Traditional methods for identifying tomato leaf disease at an early stage frequently use global features (such as color, texture, and shape) to describe the characteristics of disease spots in crop leaf disease images. The methods used served to separate the diseased and normal parts of the tomato leaf, and the area ratio of the two parts was used as the criterion for identification [19].
Recent advancements in deep learning provide obvious benefits in feature extraction and recognition, such as the convolutional neural network (CNN), which automatically trains the network to extract data features by introducing local connections and weight sharing in 1.
A double form of data augmentation, using image transformations and the implementation of CutMix as a secondary form of data augmentation for model generalization.

2.
The effect of the train/test data split size ratio of our dataset on the network model for disease recognition on tomato leaf images. Train/test data split ratios of sizes 40/60, 50/50, 60/40, 70/30, and 80/20 were adopted and studied. 3.
The effect of varying batch sizes in training our network to correctly recognize tomato leaf diseases. According to the capacity of the GPU available, batch sizes of 40,50,60,70,80,90, and 100 were adopted. 4.
The role of network depth in the effective recognition of tomato leaf disease. Residual networks with varying depths of 18, 34, 50, 101, and 152 layers were studied. 5.
The effect of tuning the learning rate while training the network and identification of a threshold to obtain suitable learning rates to effectively train the network to recognize tomato leaf diseases. The implementation of a discriminative learning rate for efficient training of residual models.

Related Work
Durmus et al. [22] used AlexNet [23] and SqueezeNet [24] models to classify and recognize 10 different types of tomato diseases in the PlantVillage dataset. The experiment discovered that while AlexNet's classification accuracy is slightly higher than SqueezeNet's, the size of the model and the time required are doubled.
Aravind et al. [25] used AlexNet and VGG16 [26] in conjunction with transfer learning to identify seven types of tomato diseases; the experiment revealed that the accuracies were 97.29 and 97.49 percent, respectively. Although transfer learning can accelerate model convergence and improve recognition performance, it is constrained by the original network structure.
Karthik et al. [27] proposed an attention-based deep residual network for detecting the type of tomato leaf infection. The PlantVillage dataset was used in the experiment, with 95,999 images used as training models and 24,001 images used for validation. The diseases included in the dataset were the tomato early blight, late blight, and leaf mold. The experimental results demonstrated that the proposed attention-based residual network can use CNN learning features at different processing levels and achieve 98 percent overall accuracy on the validation set in five-fold cross-validation.
Anand et al. [28] proposed an image processing and machine learning-based technique for diagnosing brinjal leaf disease. They used a K-means clustering technique to segment brinjal leaf diseases with some remarkable performance. In 2018, Zhang et al. [29] proposed a K-means clustering and PHOG algorithms-based fusion of super-pixel clustering-based leaf segmentation. Their technique performed admirably in the segmentation and recognition of plant leaf diseases. By extracting features based on color and texture and feeding them to a multiclass SVM classifier, Rani et al. [30] also proposed a K-means clusteringbased leaf disease and classification technique. On average, they recorded a classification accuracy of 95%. In the same year, Kumari et al. [31] proposed an image processing-based leaf spot recognition system, with the four stages of image acquisition, image segmentation, feature extraction, and classification. To compute the disease features, they also utilized the K-means algorithm. They achieved an accuracy of 90% and 80% for bacterial leaf spot and cotton leaf disease target spot, respectively. Liu et al. [32] proposed a leaf disease identification model based on generative adversarial networks. This model employed DenseNet and instance normalization to recognize actual and false disease images, as well as the feature extraction capability on grape leaf lesions. Finally, the approach applied a deep regret gradient penalty to stabilize the training process. The findings revealed that the GAN-based data augmentation strategy may effectively overcome the overfitting problem in disease identification while simultaneously improving accuracy. A leaf disease detection approach based on the AlexNet architecture was proposed by Lv et al. [33] in 2020. First, they created a maize leaf feature enhancement framework, which improved the capability of feature extraction combined with dilated convolution and multiscale convolution in a complex environment. After that, a DMS-Robust AlexNet architecture network was created, which improved the capability of feature extraction combined with dilated convolution and multiscale convolution in a complex environment. The disease features on tomato leaves, such as spot blight, late blight, and yellow leaf curl disease, were extracted using a deep learning method by Jiang et al. [34]. After continuous iterative learning, the proposed technique correctly predicted the disease category for each disease, with accuracy increases of 0.6 percent and 2.3 percent in the training and test sets, respectively. Waheed et al. [35] proposed an optimized DenseNet-based maize leaf recognition model with few parameters to boost job efficiency. The results of the experiments demonstrated that this technology is effective at detecting corn leaf disease. Huang et al. [36] proposed an end-to-end plant disease diagnostic model-based deep neural network, which can reliably classify plant types and plant diseases. Their model consists of two components: the leaf segmentation part that separates the leaves in the original image from the background; and the plant disease classifier, which is based on a two-headed network that classifies plant diseases using features extracted by multiple common pre-trained models. Experimental results show that this method can achieve a plant classification accuracy of 0.9807 and a disease recognition accuracy of 0.8745. In 2020, [37] proposed a combination of ABCK-BWTR and B-ARNet models for the identification of tomato leaf disease, consisting of a channel attention module in a ResNet50 [38], using the dual channel filter to extract the primary leaf features.
Sethy et al. [39] used different deep learning models for extracting rich features and applied an SVM classifier to classify them. They achieved their highest performance accuracy with a combination of a ResNet50 model with an SVM classifier. Oyewola et al. [40] in their work proposed using plain CNNs(PCNN) and deep residual network (DRNN) in identifying five different cassava plant diseases; their results showed that PCNN was outperformed by DRNN by a margin of +9.25%. Zeng et al. [41], on the other hand, proposed a self-attention convolutional neural network (SACNN) to identify several crop diseases. To examine the robustness of their model, the authors introduced noise at different levels in the test images. Diseases prone to affect rice and maize leaves were identified by Chen et al. [42] using an INC-VGGN method. They replaced the last convolutional layer of a VGG19 model with two inception layers and a global average pooling layer. Maize, apple, and grape crop diseases were identified by Yang et al. [43] using a shallow CNN (SCNN) embedded with SVM and RF classifiers. A transfer-learning approach was adopted by Ramacharan et al. [44] to identify three diseases and two pest-damage types that plague cassava plants. The authors further extended their work by implementing a smartphone-based CNN model for the identification of cassava plant diseases and recorded an 80.6% accuracy [45]. Adedoja et al. [46] proposed a deep CNN architecture based on NASNet to identify diseases on some plant leaves with an accuracy of 93.82%. However, there is still a need for improvement in the accuracy of plant disease recognition.
In 2022, an attention-based method was proposed by Devi et al. [47] where they used the Salp Swarm algorithm in the classification of tomato leaf diseases. Their method achieved an accuracy of 97.56% in predicting five types of tomato leaf diseases from leaf images taken from the plant village dataset. Apart from the computational complexity of the method, it is also limited in performance score. A method that utilized a lightweight attention-based CNN [48] to classify tomato leaf diseases achieved a model accuracy of 99.34% but with a slightly higher time complexity than conventional methods. Also in 2022, Zhao et al. [49] developed a method that utilized a spatial attention mechanism with CNN for real-time leaf disease detection. However, this method achieved a 95.20% accuracy and did not generalize well. With the aim of improving performance and generalizability, our method was researched. We also tested the proposed method on another plant leaf benchmark dataset that is different from our target dataset.

Evaluation Metrics, Results, and Discussion
This section presents the metrics used in evaluating the results of this research, the detailed results, and relevant discussion.

Evaluation Metrics
The accuracy, precision, recall, and f1-score of the proposed method were all evaluated. The proposed plant recognition system's accuracy has been calculated using the following expression, which incorporates numerical details such as true positive (TP) (the number of correctly identified leaf images), false positive (FP) (the number of incorrectly detected leaves), true negative (TN) (the number of correctly detected leaf images), and false negative (FN) (it is a parameter for representation of the number of leaf images that are correctly recognized).

•
Accuracy: Accuracy is the number of right predictions that are made by the model with respect to the total number of predictions that were made. It is mathematically represented by Equation (1). Accuracy = TP+TN TP+TN+FP+FN (1) • Precision: Precision is defined as the number of true positive results (TP) divided by the number of positive results (TP + FP) that are predicted by the model. The range of the precision is between 0 and 1 and is calculated using Equation (2). It is used to find the proportion of positive identifications that is true. • F1 Score: Being one of the widely used metrics for the performance evaluation of machine learning algorithms, the F1 score is the harmonic mean of precision and recall. The range of the F1 score is between 0 and 1, and it is calculated as shown by Equation (4). It reflects the number of instances that are correctly classified by the learning model.   (18,34,50,101, and 152) on the maximum batch size used in this research, having a value of 100. The results show that a train-validation split ratio of 80/20 recorded the highest performance in the F1 score. After 29 epochs, the result shows how the performance score reached a peak of 98% and a minimum of 94% based on different train-validation data split ratios as indicated on the plot image.

Results on Varied Network Depth
Given five different network depths adopted in this research, 18, 34, 50, 101, and 152 layers, respectively, this section reports the results and discusses the findings on the relationship between the network depth and the performance of the proposed network. The results of the F1 score based on the different depths of the proposed residual neural network are displayed in Figure 1, while Figures 2-5 show the performance via confusion matrices of the various network depths. Figure 1 displays the plot of epoch over F1 score for a network of the five varying depths (18,34,50,101, and 152) on the maximum batch size used in this research, having a value of 100. The results show that a train-validation split ratio of 80/20 recorded the highest performance in the F1 score. After 29 epochs, the result shows how the performance score reached a peak of 98% and a minimum of 94% based on different train-validation data split ratios as indicated on the plot image. It was observed that the network depth affected the network performance, though not at a very high value. However, the network depth of 152, being the highest depth used in this research, had the highest performance score of 99.51% as shown in Figure 1e above.  It was observed that the network depth affected the network performance, though not at a very high value. However, the network depth of 152, being the highest depth used in this research, had the highest performance score of 99.51% as shown in Figure 1e above.         It was observed that the network depth affected the network performance, though not at a very high value. However, the network depth of 152, being the highest depth used in this research, had the highest performance score of 99.51% as shown in Figure 1e above.

Results on Varied Train-Validation Data Split Ratios
Different train-validation data split ratios were tested on this proposed network, ranging from 40/60, 50/50, 60/40/ 70/30, and 80/10 for training and validation data, respectively. This section displays and discusses the results obtained. Table 2 shows the relationship between batch size and performance on the network on different train-validation data split ratios. The results of the F1 score for different train-validation data split ratios on the proposed residual neural network are displayed in Figures 6 and 7 below.
The results in Figures 6 and 7 show the relationship between the train-validation data split ratio and the F1 score of the networks. These results suggest that the split ratio had a great impact on the performance of the network. Figure 7 shows a clear distinction in the performance value as the train samples are increased.     The results in Figures 6 and 7 show the relationship between the train-validation data split ratio and the F1 score of the networks. These results suggest that the split ratio had a great impact on the performance of the network. Figure 7 shows a clear distinction in the performance value as the train samples are increased.
For a batch size of 100 images, being the highest batch size value adopted for this research, train set values of 40, 50, 60, 70, and 80% of the entire dataset resulted in a performance of 0.9775, 0.9848, 0.9884, 0.9921, and 0.9937, respectively, out of a total value of 1, for a network depth of 152. The results on other network depths also show a similar pattern in performance with such data split ratios. This suggests that a split ratio of 80/20 is a good choice for plant leaf image recognition. For a batch size of 100 images, being the highest batch size value adopted for this research, train set values of 40, 50, 60, 70, and 80% of the entire dataset resulted in a performance of 0.9775, 0.9848, 0.9884, 0.9921, and 0.9937, respectively, out of a total value of 1, for a network depth of 152. The results on other network depths also show a similar pattern in performance with such data split ratios. This suggests that a split ratio of 80/20 is a good choice for plant leaf image recognition.

Results on Different Batch Sizes
The effect of different batch sizes on the various models was studied and the results are described here. The results of the F1 score for different train-validation data split ratios on the proposed residual neural network are displayed in Figure 8 below. It displays the plot of the epoch over the F1 score for a batch size of 100, 90, 80, 70, 60, 60, and 50 images, respectively.
The results of the different batch sizes above do not show much difference in the overall performance of the F1 score at the end of the number of training epochs; however, the time taken was greatly influenced as is displayed in the next section.

Results on Different Batch Sizes
The effect of different batch sizes on the various models was studied and the results are described here. The results of the F1 score for different train-validation data split ratios on the proposed residual neural network are displayed in Figure 8 below. It displays the plot of the epoch over the F1 score for a batch size of 100, 90, 80, 70, 60, 60, and 50 images, respectively.

Results on Computing Time
The results of time on the proposed residual neural network for plant leaf recognition are displayed in Figure 9, and Table 3 shows the time taken on various train-validation split ratios and batch sizes.  100  162  196  221  234  249  90  179  188  218  237  253  80  169  193  222  235  247  70  176  188  211  232  240  60  172  193  217  236  259  50  188  208  221  242  275  40  200  204  226  248  281  80  169  193  222  235  247  70  176  188  211  232  240  60  172  193  217  236  259  50  188  208  221  242  275  40  200  204  226  248  From Figure 9 above, the fastest of the networks was that of a depth of 18 layers, with a train-validation data split ratio of 40/60. Having more data for validation than for training made the network training and testing procedure faster, albeit our goal was not just for speed but also improved performance. Hence, the result of performance with the fastest time was not the optimized result we have. The split ratio of 80/20 was rather that with the highest performance in training and testing, as recorded in Table 2. Figure 10 shows the plot of the test split ratio and layer depth over time. This displays a more elaborate view of the relationship between time and other parameters such as split ratio and network depth. From Figure 9 above, the fastest of the networks was that of a depth of 18 layers, with a train-validation data split ratio of 40/60. Having more data for validation than for training made the network training and testing procedure faster, albeit our goal was not just for speed but also improved performance. Hence, the result of performance with the fastest time was not the optimized result we have. The split ratio of 80/20 was rather that with the highest performance in training and testing, as recorded in Table 2. Figure 10 shows the plot of the test split ratio and layer depth over time. This displays a more elaborate view of the relationship between time and other parameters such as split ratio and network depth. The F1 score, being a more robust metric for recognition, combined both the recall and precision levels of the network. Here, we show how the network tried to maintain a considerably consistent F1 score for most of the training process with a +0.15 and −0.25 interval. The learning rate on this dataset had been seen to do well from 10 −5 to 10 −3 . The best position to train a model has been estimated to fall along that axis. Afterward, as can be seen in the plot, as the learning rate increased, the model loss increased. Table 4 compares the performance of the proposed network on the PlantVillage dataset on various training and validation data split ratios. Figure 11 shows the areas of mistake recorded by the model in this research: (a) the plot loss for a network depth of 152; (b) the images the The F1 score, being a more robust metric for recognition, combined both the recall and precision levels of the network. Here, we show how the network tried to maintain a considerably consistent F1 score for most of the training process with a +0.15 and −0.25 interval. The learning rate on this dataset had been seen to do well from 10 −5 to 10 −3 . The best position to train a model has been estimated to fall along that axis. Afterward, as can be seen in the plot, as the learning rate increased, the model loss increased. Table 4 compares the performance of the proposed network on the PlantVillage dataset on various training and validation data split ratios. Figure 11 shows the areas of mistake recorded by the model in this research: (a) the plot loss for a network depth of 152; (b) the images the model predicted wrongly for a network depth of 152, detailing the predicted, ground truth, loss, and probability values.  The F1 score, being a more robust metric for recognition, combined both the recall and precision levels of the network. Here, we show how the network tried to maintain a considerably consistent F1 score for most of the training process with a +0.15 and −0.25 interval. The learning rate on this dataset had been seen to do well from 10 −5 to 10 −3 . The best position to train a model has been estimated to fall along that axis. Afterward, as can be seen in the plot, as the learning rate increased, the model loss increased. Table 4 compares the performance of the proposed network on the PlantVillage dataset on various training and validation data split ratios. Figure 11 shows the areas of mistake recorded by the model in this research: (a) the plot loss for a network depth of 152; (b) the images the model predicted wrongly for a network depth of 152, detailing the predicted, ground truth, loss, and probability values. The network, though having an outstanding performance, was not 100% perfect. From Figure 11 above, classes that were wrongly recognized are displayed. The first image shows how the model predicted a late blight leaf as an early blight class. Table 4 shows the result of the network on various train-validation split ratios. The early blight and late blight diseased leaves had a very striking resemblance and as such, most of the wrong predictions recorded by our method happened to fall in between the two leaf classes. Additionally, some of the spots that could be found on leaves were so close that they could constitute a challenge in perfect distinction, giving rise to an imperfect recognition, even to the human eye. However, more research on fine-grained distinct images is needed to get to that point. Figure 12 below shows the plot of loss against epoch for the network of depth 152 layers on a train-validation split ratio of 40/60. Tables 5-9 detail the performance of the proposed network on the PlantVillage dataset with various model depths of the residual neural network architecture on six different training and validation data split ratios. Table 5 compares the performance of our model on various model depths of the same architecture on a validation data split ratio of 60.  Tables 5-9 detail the performance of the proposed network on the PlantVillage dataset with various model depths of the residual neural network architecture on six different training and validation data split ratios. Table 5 compares the performance of our model on various model depths of the same architecture on a validation data split ratio of 60.  Table 6 compares the performance of our model on various model depths of the same architecture on a validation data split ratio of 50.  Table 7 compares the performance of our model on various model depths of the same architecture on a validation data split ratio of 40.  Table 8 compares the performance of our model on various model depths of the same architecture on a validation data split ratio of 30.  Table 9 compares the performance of our model on various model depths of the same architecture on a validation data split ratio of 20.  A summary of the related work on plant disease identification based on leaf images and the result comparison of some of them is shown in Table 11. As can be seen from both Table 11 and Figure 13, our model outperformed the previous models, surpassing that of Li et al. [58], which was the closest in performance, with a +0.75% performance gain. Whereby some authors used the accuracy metric to measure their performance, we recorded a much higher accuracy but chose to benchmark against our F1 score value, which is regarded as a much better form of performance measure for classification problems, since it combines both the precision and recall of the model in question.   Figure 13 below shows the chart of a comparison of our proposed technique and other techniques. A detailed summary of the technique, dataset and year of publication is just as displayed in Table 11 above. More discussion on the image is given in the Discussion section of this article.   Table 11 above. More discussion on the image is given in the Discussion section of this article.

Data Acquisition and Pre-Processing
The Flavia leaf dataset The Flavia leaf dataset (download link: http://flavia.sourceforge.net/ (accessed on 13 November 2021)), introduced by Wu et al. [75], contains 1907 leaf images of size (1600 × 1200 pixels) obtained from 32 plant species on a white background, containing about 50-77 images per class. Figure 14 shows random sample images from the Flavia dataset used for this research work.

The Flavia leaf dataset
The Flavia leaf dataset (download link: http://flavia.sourceforge.net/ (accessed on 13 November 2021)), introduced by Wu et al. [75], contains 1907 leaf images of size (1600 × 1200 pixels) obtained from 32 plant species on a white background, containing about 50-77 images per class. Figure 14 shows random sample images from the Flavia dataset used for this research work.

The tomato leaf dataset
The tomato leaf dataset that was used in this research consists of images of diseased and healthy tomato plant leaves that were obtained from the publicly available PlantVillage [1] dataset, which is an open-access public resource for agriculture-related content. The entire dataset contains 54,306 images of plant leaves, which have a spread of 38 class labels assigned to them. However, the experiments in this research were narrowed down to only images of tomato plant leaves, which include nine types of tomato leaf diseases, and some healthy tomato leaves, making a total of 10 different categories for our research out of the entire 38 which are obtainable in the entire PlantVillage database. Our cropspecific data contained 18,160 images of tomato plant leaves, and each class is defined either with the corresponding name of the disease affecting the leaves in it or categorized as part of the healthy class. Figure 15 shows sample images from the PlantVillage dataset used for this research work.

2.
The tomato leaf dataset The tomato leaf dataset that was used in this research consists of images of diseased and healthy tomato plant leaves that were obtained from the publicly available PlantVillage [1] dataset, which is an open-access public resource for agriculture-related content. The entire dataset contains 54,306 images of plant leaves, which have a spread of 38 class labels assigned to them. However, the experiments in this research were narrowed down to only images of tomato plant leaves, which include nine types of tomato leaf diseases, and some healthy tomato leaves, making a total of 10 different categories for our research out of the entire 38 which are obtainable in the entire PlantVillage database. Our crop-specific data contained 18,160 images of tomato plant leaves, and each class is defined either with the corresponding name of the disease affecting the leaves in it or categorized as part of the healthy class. Figure 15 shows sample images from the PlantVillage dataset used for this research work. From Figure 15, the various diseases can be categorized either as fungi, bacteria, mold, viruses, or mites. Four of the diseases, namely, early blight, late blight, leaf spot, and target spots are caused by fungi; the bacterial spot is caused by bacteria, the leaf mold is the cause of a mold disease, while both the tomato yellow leaf curl and the tomato mosaic are viral infections, and the spider mite is a mite disease. A brief description of each of these diseases is given below:

•
Early blight is a fungal infection, and symptoms start as oval-shaped lesions with a yellow chlorotic region across the lesion; concentric leaf lesions may be seen on infected leaves.

•
Late blight, being another fungal infection, affects all aerial parts of the tomato plant; initial symptoms of the disease appear as water-soaked green to black areas on leaves which rapidly change to brown lesions; fluffy white fungal growth may appear on infected areas and leaf undersides during wet weather.

•
The leaf spot is another fungal infection. Infected plants exhibit bronzing or purpling of the upper sides of young leaves and develop necrotic spots; leaf spots may resemble those caused by bacterial spots, but a bacterial ooze test will be negative; leaves may cup downwards, and shoot tips may begin to die back. • Septoria leaf spot is yet another fungal disease. Symptoms may occur at any stage of tomato development and begin as small, water-soaked spots or circular grayishwhite spots on the underside of older leaves; spots have a grayish center and a dark margin, and they may coalesce.

•
Leaf mold is still another fungal infection. The older leaves exhibit pale greenish to yellow spots (without distinguishable margins) on the upper surface, whereas, the lower portion of these spots exhibits green to brown velvety fungal growth. As the From Figure 15, the various diseases can be categorized either as fungi, bacteria, mold, viruses, or mites. Four of the diseases, namely, early blight, late blight, leaf spot, and target spots are caused by fungi; the bacterial spot is caused by bacteria, the leaf mold is the cause of a mold disease, while both the tomato yellow leaf curl and the tomato mosaic are viral infections, and the spider mite is a mite disease. A brief description of each of these diseases is given below:

•
Early blight is a fungal infection, and symptoms start as oval-shaped lesions with a yellow chlorotic region across the lesion; concentric leaf lesions may be seen on infected leaves. • Late blight, being another fungal infection, affects all aerial parts of the tomato plant; initial symptoms of the disease appear as water-soaked green to black areas on leaves which rapidly change to brown lesions; fluffy white fungal growth may appear on infected areas and leaf undersides during wet weather.

•
The leaf spot is another fungal infection. Infected plants exhibit bronzing or purpling of the upper sides of young leaves and develop necrotic spots; leaf spots may resemble those caused by bacterial spots, but a bacterial ooze test will be negative; leaves may cup downwards, and shoot tips may begin to die back. • Septoria leaf spot is yet another fungal disease. Symptoms may occur at any stage of tomato development and begin as small, water-soaked spots or circular grayish-white spots on the underside of older leaves; spots have a grayish center and a dark margin, and they may coalesce. • Leaf mold is still another fungal infection. The older leaves exhibit pale greenish to yellow spots (without distinguishable margins) on the upper surface, whereas, the lower portion of these spots exhibits green to brown velvety fungal growth. As the disease progresses, the spots may coalesce and appear brown. The infected leaves wither and die but stay attached to the plant. • Bacterial spots are bacterial diseases, and lesions start as small water-soaked spots; lesions become more numerous and coalesce to form necrotic areas on the leaves giving them a blighted appearance; leaves drop from the plant, and severe defoliation can occur leaving the fruit susceptible to sunscald; mature spots have a greasy appearance and may appear transparent when held up to a light source; centers of lesions dry up and fall out of the leaf; blighted leaves often remain attached to the plant and give it a blighted appearance. Pre-processing steps are applied to cleanse and organize data before being fed into the model. By introducing a few distorted images into the training dataset, image transformations are used to increase the number of images in the dataset and reduce the chances of the model overfitting. The augmented images for the training data are created using standard image augmentation techniques such as flipping, Gamma correction, noise injection, PCA color augmentation, rotation, and scaling transformations. The images are each further resized to 128 before being fed to the training model.

Data Augmentation 2-CutMix
Let x ∈ R W×H×C and y denote both the training image and the corresponding label of our tomato leaf image, respectively. The goal of the CutMix augmentator is to generate a new training sample ( x, y) through the combination of two training samples, (x A , y A ) and (x B , y B ). The newly generated sample ( x, y) is then used in training the model with its original loss function.
We define the combining operation as where M ∈ {0, 1} W×H denotes a binary mask indicating where to drop out and fill in from two images, 1 is a binary mask filled with ones and is an element-wise multiplication, and the combination ratio λ between two data points is sampled from the beta distribution Beta(α, α). CutMix replaces an image region with a patch from another training image and generates locally natural images. CutMix is simple and incurs a negligible computational overhead as with existing data augmentation techniques; we can efficiently utilize it to train any network architecture. To sample the binary mask M, the bounding box coordinates B = (r x , r y ,r w , r h ) are first sampled indicating the cropping regions on x A and x B . The region B in x A is removed and filled in with the patch cropped from B of x B . The box coordinates are uniformly sampled according to: making the cropped area ratio: With the cropping region, the binary mask M ∈ {0, 1} W×H is decided by filling with 0 within the bounding box B, otherwise 1. In each training iteration, a CutMix-ed sample ( x, y) is generated by combining two randomly selected training samples in a mini-batch according to Equation (5). Figure 16 shows a visualization of the CutMix operation on the Flavia dataset. The mixture of patch images on different class images can be seen.
CutMix replaces an image region with a patch from another training image and generates locally natural images. CutMix is simple and incurs a negligible computational overhead as with existing data augmentation techniques; we can efficiently utilize it to train any network architecture. To sample the binary mask M, the bounding box coordinates = ( , , , ℎ ) are first sampled indicating the cropping regions on and . The region B in is removed and filled in with the patch cropped from B of . The box coordinates are uniformly sampled according to: making the cropped area ratio: With the cropping region, the binary mask M ∈ {0, 1} W× H is decided by filling with 0 within the bounding box B, otherwise 1. In each training iteration, a CutMix-ed sample (̃,̃) is generated by combining two randomly selected training samples in a mini-batch according to Equation (5). Figure 16 shows a visualization of the CutMix operation on the Flavia dataset. The mixture of patch images on different class images can be seen.

Convolutional Neural Networks
A typical convolutional neural network architecture will contain some basic building blocks, some of which are referred to as layers. These make up the building block for our proposed network approach and are described in the following subsection.

Convolutional Neural Networks
A typical convolutional neural network architecture will contain some basic building blocks, some of which are referred to as layers. These make up the building block for our proposed network approach and are described in the following subsection.

Convolutional Layer
In this layer, convolutional operations are performed on the input to learn useful features. To this effect, a convolutional kernel slides along the input image with a certain stride and outputs convolution plus a bias generally known as a feature map. The input to this layer could be either an RGB image or the output feature of a preceding layer for a multilayer network. A convolutional kernel means that given an input image when it is processed, the weighted average of pixels in a small area of the input image becomes each corresponding pixel in the output image, and the weight is defined by a function, and hence, they share weights to reduce parameters in the network. This process can be expressed mathematically as: where Y l−1 i is the output of the i-th feature map in the l − 1 layer, and X l j is the input of the jth feature map in the l layer. K ij and b l j are the convolutional kernel and bias in the l layer. f (.) is the activation function. Higher-level unique features can be identified through a series of increased convolution layers; hence, the need to go deeper.

2.
Pooling Layer After the convolution operation, there is a need to reduce the dimensions of the image for further processing. This process is known as down sampling or simply a pooling operation. This process can be expressed mathematically as: where X l j represents the jth feature map in the l layer. β l j and d l j are the multiplicative factor and bias, respectively.·down s represents an under-sampling function. Under-sampling can be done in many forms, some of which are average pooling, maximal pooling (max pool), minimal pooling operation, and so on. In our work, we employed max pooling.

Fully Connected Layer
Each neuron in the fully connected layer is connected to all neurons in the feature map of the previous layer, and the output can be expressed as: where h W,b (x) is the output, and W represents the corresponding weights of the network. The inputs to the fully connected layer are mainly features extracted from the preceding layer. Each feature in the former layer represents different semantic information that is unique and important for the next layer.

Transfer Learning Approach
The goal of transfer learning is to improve the target learners' performance on target domains by transferring useful knowledge [76,77] from disparate but related source domains to the target at a lower computational cost. The reliance on a large number of target domain data for constructing target learners can thus be reduced. It has emerged as a popular and promising area of machine learning due to its wide range of application possibilities, especially in solving real-world problems [78][79][80][81] in a cheaper and more reliable method. The sphere of use of transfer learning is not few, coupled with its record of high results [82,83].
An approach in transfer learning is that the last few layers of the pre-trained network are replaced with new layers, such as a fully connected layer and a softmax classification layer, with the number of classes set to be equivalent to that of the new target dataset, which in our research is 10 for the number of tomato leaf classes. In our research, all the networks used were pre-trained on the imagenet dataset before they were re-trained to learn the features of the tomato leaf image dataset in order to correctly recognize them based on their different classes.
However, the challenge of maximizing its use with regard to modifying the learning rate to suit the target data still exists. It is discovered that one learning rate being used across all the layers of a model in transfer learning is not a good practice for transfer learning [84]. In each model, we unfreeze the layer and add a stack of one activation layer, one batch-normalization layer, and one dropout layer. All models are tested with the same dropout values, different learning rates, and varying batch sizes.

Overall Architecture of the Proposed Method
The proposed research network architecture consists of five networks designed similarly but with a different number of layers. Each layer consists of both an identity and a convolutional block. The identity block is the standard block used in ResNets and corresponds to the case where the input activation and output activation have the same dimensions. Figure 17 shows the entire workflow for the tomato leaf disease recognition system. layer, with the number of classes set to be equivalent to that of the new target dataset, which in our research is 10 for the number of tomato leaf classes. In our research, all the networks used were pre-trained on the imagenet dataset before they were re-trained to learn the features of the tomato leaf image dataset in order to correctly recognize them based on their different classes.
However, the challenge of maximizing its use with regard to modifying the learning rate to suit the target data still exists. It is discovered that one learning rate being used across all the layers of a model in transfer learning is not a good practice for transfer learning [84]. In each model, we unfreeze the layer and add a stack of one activation layer, one batch-normalization layer, and one dropout layer. All models are tested with the same dropout values, different learning rates, and varying batch sizes.

Overall Architecture of the Proposed Method
The proposed research network architecture consists of five networks designed similarly but with a different number of layers. Each layer consists of both an identity and a convolutional block. The identity block is the standard block used in ResNets and corresponds to the case where the input activation and output activation have the same dimensions. Figure 17 shows the entire workflow for the tomato leaf disease recognition system. The identity block is as shown in Figure 18, in which two hidden layers are skipped to avoid the vanishing gradient problem. The purpose of this block is to match input and output dimensions. The purpose of the identity block is to resize the input to a different dimension. Figure 18 shows the residual identity mapping for the residual neural network which is the basic building block for the network used in this research, while Figure 19 shows a simplified block diagram for the arrangement of the various blocks for the residual neural network [60] network used in this research. The identity block is as shown in Figure 18, in which two hidden layers are skipped to avoid the vanishing gradient problem. The purpose of this block is to match input and output dimensions. The purpose of the identity block is to resize the input to a different dimension. Figure 18 shows the residual identity mapping for the residual neural network which is the basic building block for the network used in this research, while Figure 19 shows a simplified block diagram for the arrangement of the various blocks for the residual neural network [60] network used in this research.    Training deep learning architectures, whether it is a new model from scratch or via transfer learning is not without its difficulties. Learning algorithms based on artificial neural networks, and particularly deep learning for computer vision, may appear to include a plethora of bells and whistles, popularly known as hyperparameters [85]. To successfully train and debug them, more hyperparameters should be adjustable, and this has made it possible for one to obtain more interesting results. Making sure we have the right learning rate is one of the most important things we can do when training a model. If our learning rate is too low, training our model may take many, many epochs. For this research, all our five networks were pre-trained with the ImageNet dataset and then finetuned in the manner described below for re-training on our tomato leaf images.
We implement discriminative learning rates [86] in the process of training our model on plant leaf images. Discriminative fine-tuning is a fine-tuning strategy introduced by universal language model fine-tuning (ULMFiT) [86] for the implementation of natural language processing (NLP). Instead of the same learning rate being used across all the layers of our model, this allows us to tune each layer with a different learning rate that is efficient and suits it better than any other learning rate. Since the deepest layers of pretrained models may not require so high a learning rate as the final ones will require, we will use different learning rates for them as described by [84,87].
To preserve the quality of those pre-trained weights even after unfreezing them, some of the added parameters are tuned for a few epochs, and naturally, we would not expect the best learning rate for those pre-trained parameters to be as high as the best learning rate for the randomly added parameters. Keep in mind that the pre-trained weights were trained on millions of images over hundreds of epochs, so they have learned Training deep learning architectures, whether it is a new model from scratch or via transfer learning is not without its difficulties. Learning algorithms based on artificial neural networks, and particularly deep learning for computer vision, may appear to include a plethora of bells and whistles, popularly known as hyperparameters [85]. To successfully train and debug them, more hyperparameters should be adjustable, and this has made it possible for one to obtain more interesting results. Making sure we have the right learning rate is one of the most important things we can do when training a model. If our learning rate is too low, training our model may take many, many epochs. For this research, all our five networks were pre-trained with the ImageNet dataset and then fine-tuned in the manner described below for re-training on our tomato leaf images.
We implement discriminative learning rates [86] in the process of training our model on plant leaf images. Discriminative fine-tuning is a fine-tuning strategy introduced by universal language model fine-tuning (ULMFiT) [86] for the implementation of natural language processing (NLP). Instead of the same learning rate being used across all the layers of our model, this allows us to tune each layer with a different learning rate that is efficient and suits it better than any other learning rate. Since the deepest layers of pre-trained models may not require so high a learning rate as the final ones will require, we will use different learning rates for them as described by [84,87].
To preserve the quality of those pre-trained weights even after unfreezing them, some of the added parameters are tuned for a few epochs, and naturally, we would not expect the best learning rate for those pre-trained parameters to be as high as the best learning rate for the randomly added parameters. Keep in mind that the pre-trained weights were trained on millions of images over hundreds of epochs, so they have learned rich features from being previously trained on large datasets. Our experiment was carried out with a few specific detailed goals in mind, they are:

1.
To research the role of depth on the performance of such models, various residual convolutional neural networks, having different layers, such as ResNet-18, ResNet-34, ResNet-50, ResNet-102, and ResNet-152 were tested.

2.
Various train/test data split ratios were experimented on to determine the optimum value of the train/test split ratio for such a research area.

3.
Different batch sizes were selected based on the capacity of the GPU system obtainable in the laboratory to test for the influence of batch size on the training and if so on the result of the training and test processes of the model.

4.
The discriminative learning process was studied to determine the best learning rates to select in re-training the various models to achieve an optimal training process from one domain of data to a new domain of datasets: specific to this research is the tomato plant leaf dataset for disease recognition.

Tuning the Learning Rate Schedule
We began with an extremely low learning rate, and used it for one mini-batch, then determined the losses and increased the learning rate by a certain percentage, i.e., doubling it each time. Then, we ran another mini-batch and repeated the procedure above until the loss worsened rather than improved. This is the point where we realized we had gone too far. Then, we chose a learning rate that was slightly lower than this point. We used discriminative learning rates and a gradual unfreezing of the network to re-train the network using the tomato leaf images. The learning rate tool in fast.ai (fastai is a deep learning library which provides practitioners with high-level components [88]) was employed to return a plot of learning rate versus loss (cross-entropy) and to identify the optimal learning rates to use with the Adam optimizer. Figure 20 shows the plot of the learning rate curve as we trained the network. We used smaller learning rates in the early layers to permit the weights in those layers to change in a slower pattern than those from the later layers: 1 × 10 −4 , 1 × 10 −3 , and 1 × 10 −2 for the first, middle, and last layers, respectively. While training, for the first three epochs, only the final layers were trained, while all prior layers were frozen. Subsequently, all layers were unfrozen and trained for an additional 27 epochs. able in the laboratory to test for the influence of batch size on the training and if so on the result of the training and test processes of the model. 4. The discriminative learning process was studied to determine the best learning rates to select in re-training the various models to achieve an optimal training process from one domain of data to a new domain of datasets: specific to this research is the tomato plant leaf dataset for disease recognition.

Tuning the Learning Rate Schedule
We began with an extremely low learning rate, and used it for one mini-batch, then determined the losses and increased the learning rate by a certain percentage, i.e., doubling it each time. Then, we ran another mini-batch and repeated the procedure above until the loss worsened rather than improved. This is the point where we realized we had gone too far. Then, we chose a learning rate that was slightly lower than this point. We used discriminative learning rates and a gradual unfreezing of the network to re-train the network using the tomato leaf images. The learning rate tool in fast.ai (fastai is a deep learning library which provides practitioners with high-level components [88]) was employed to return a plot of learning rate versus loss (cross-entropy) and to identify the optimal learning rates to use with the Adam optimizer. Figure 20 shows the plot of the learning rate curve as we trained the network. We used smaller learning rates in the early layers to permit the weights in those layers to change in a slower pattern than those from the later layers: 1 × 10 −4 , 1 × 10 −3 , and 1 × 10 −2 for the first, middle, and last layers, respectively. While training, for the first three epochs, only the final layers were trained, while all prior layers were frozen. Subsequently, all layers were unfrozen and trained for an additional 27 epochs.  We can see from plot Figure 20a that in the range of 1 × 10 −6 to 1 × 10 −4 , nothing happened and the model did not train. Then, the loss started to decrease until it reached a minimum at 1 × 10 −1 , and then it increased again. This showed that a learning rate greater than 1 × 10 −1 was high and will make training diverge, but a learning rate of 1 × 10 −1 was already too high for training; hence, we needed to select a better rate. Figure 20 shows us four different points from our learning rate finder: valley, slide, steep, and minimum points, respectively. With such, we can select an appropriate value for the learning rate. However, that is not the end of the training procedure. Our goal is to train the whole network without breaking the pre-trained weights. Figure 21 shows activations of a convolutional neural network by layers, visually demonstrating what is learned by the different layers of a model [89]. already too high for training; hence, we needed to select a better rate. Figure 20 shows us four different points from our learning rate finder: valley, slide, steep, and minimum points, respectively. With such, we can select an appropriate value for the learning rate. However, that is not the end of the training procedure. Our goal is to train the whole network without breaking the pre-trained weights. Figure 21 shows activations of a convolutional neural network by layers, visually demonstrating what is learned by the different layers of a model [89]. Because different layers in a neural network capture different types of information, there is a need for them to be fine-tuned to different extents. Instead of using the same learning rate for all layers of the model, the practice of discriminative fine-tuning allows for the tuning of each layer in a neural network with different learning rates. For a proper context, the regular SGD update of a model's parameters θ at time step t looks can be represented by the following: where η is the learning rate and ∇θJ(θ) represents the gradient with regard to the objective function of the model in question. For discriminative fine-tuning, we split the parameters θ into {θ 1 ,...,θ L }, where θ l con- Because different layers in a neural network capture different types of information, there is a need for them to be fine-tuned to different extents. Instead of using the same learning rate for all layers of the model, the practice of discriminative fine-tuning allows for the tuning of each layer in a neural network with different learning rates. For a proper context, the regular SGD update of a model's parameters θ at time step t looks can be represented by the following: where η is the learning rate and ∇θJ(θ) represents the gradient with regard to the objective function of the model in question. For discriminative fine-tuning, we split the parameters θ into {θ 1 ,...,θ L }, where θ l contains the parameters of the model at the l-th layer and L is the number of layers in the entire model.
Similarly, we obtain {η 1 ,...,η L }, where η l signifies the learning rate of the l-th layer. The SGD update with discriminative fine-tuning can then be represented by the following: Empirically, it has been found to be the best practice to first of all choose the learning rate η L of the last layer by fine-tuning only the last layer and using η l−1 = η l /2.6 as the learning rate for lower layers.

Unfreezing and Re-Tuning the Learning Rate Schedule
As we went about fine-tuning the model, the deepest layers of our pre-trained ResNet model did not need as high a learning rate as the last ones, so we should probably use different learning rates for those-this is known as using discriminative learning rates. It is based on the idea that we use a lower learning rate for the early layers of the neural network, and a higher learning rate for the later layers, especially the randomly added layers that suit the problem to be solved. It derives its foundation from the insights drawn from the proposal of Jason Yosinski et al. [87] as summarized in Figure 22 below, which shows the impact different layers make and training methods on the transfer learning approach in applications of deep learning models. The learning rate finder selects a minibatch of images, determines the gradient for the minibatch, and then steps the weights based on the learning rate and the gradient. Leslie Smith initially set a very low learning rate for the first minibatch before gradually raising it. We chose the steep or minimal point divided by a value of 10 because the minimum was where learning stops, but the steepest point was where learning was most effective. The early layers did not need to be modified at all; however, the later layers did need to be adjusted to fit the tomato leaf images. Therefore, we used discriminative learning rates and set a smaller learning rate for the early layers and a greater learning rate for the latter layers.

Experimental Setup
The training and testing processes in this experiment were implemented on Pytorch The learning rate finder selects a minibatch of images, determines the gradient for the minibatch, and then steps the weights based on the learning rate and the gradient. Leslie Smith initially set a very low learning rate for the first minibatch before gradually raising it. We chose the steep or minimal point divided by a value of 10 because the minimum was where learning stops, but the steepest point was where learning was most effective. The early layers did not need to be modified at all; however, the later layers did need to be adjusted to fit the tomato leaf images. Therefore, we used discriminative learning rates and set a smaller learning rate for the early layers and a greater learning rate for the latter layers.

Experimental Setup
The training and testing processes in this experiment were implemented on Pytorch using fastai library [88]; the scikit-learn, pillow, and OpenCV libraries were used, all of which are written in Python. The model was trained and tested on an NVIDIA DGX-1 V100 with 8X Tesla V100 GPUs with a performance of one petaFLOP.
To carry out this research, a computer system was used to carry out the analysis processes. The computer system has the following specifications: Windows 10, 64-bit, Intel Core i7-4720 CPU @ 2.60 GHz, RAM 32 GB and GPU Nvidia GeForce GTX 1050 4 GB dedicated memory, and Python 3.7 on Anaconda. The PC was used on a Linux-based DELL PowerEdge T640 Tower Server with CUDA-based video cards GTX 1080TI, each GPU Video memory was 11Gb, with a storage memory of 10TB Hard Drive and 3320 GB SSD. Tables 12 and 13 detail the setup of the system used and the parameters used in running the model.

Discussion
Using data from ten different classes, the effect of different depths, and batch sizes (depth = 18, 34, 50, 101, and 152 and BS = 40, 50, 60, 70, 80, 90, and 100) and parameter values on their performance in adequately recognizing healthy tomatoes and other diseased classes were investigated in this study. The research was carried out using transfer learning and a pre-trained residual neural network. They performed well on the training and test data, according to the results. However, the steady state in the test data was observed to be delayed as the batch size value increased. The highest recognition accuracy value was 99.51% for depth = 152, with a train/validation data split size of 80/20 and a batch size of 40. According to the findings, the batch size value has no significant effect on overall performance but increasing the batch size value delays obtaining stable results. Table 11 shows a detailed comparison of our proposed technique with other state-ofthe-art models. As can be seen from Table 11 and Figure 13, our model outperformed the previous models, surpassing that of Li et al. [58], which was the closest in performance, with a +0.75% performance gain, and achieved a favorable performance with a +3.79% improvement from that of Paymode et al. [69] and other previous models. However, most recent models, such as Islam et al. [84] and Tarek et al. [71] outperformed our method, recording 100% and 99.81% accuracy. Whereby some authors used the accuracy metric to measure their performance, we recorded a much higher accuracy but chose to benchmark against our F1 score value, which is regarded as a much better form of performance measure for classification and recognition problems, since it combines both the precision and recall of the model in question.

Conclusions
Discriminative learning was implemented in training our proposed networks, having diverse layers, such as 18, 34, 50, 102, and 152 layers, respectively. The results show that the network with the highest depth produced the best performance, suggesting that deeper networks could still be better in deep learning. Five train/test data split ratios of 40/60, 50/50, 60/40, 70/30, and 80/20 were experimented on to determine the optimum value of the train/test split ratio for such a research area. The split ratio of 80/20 resulted in the optimal solution, with 70/30 coming after it. Different batch sizes were selected based on the capacity of the GPU system obtainable in the laboratory to test for the influence of batch size on the training, and if so on the result of the training and test processes of the model. The charts show a slight increase in performance for lesser batch sizes, while the most notable effect of the batch sizes was in speeding up the training process. The discriminative learning process was studied to determine the best learning rates to select in re-training the network during transfer learning, and the results show how effective it is to pre-train, determine a range of suitable learning rates, and re-train the entire network based on the discriminative learning rates to achieve a faster and more efficient performance. The data augmentation also proved to be effective, as the results were a little bit improved with the augmentation process.