A Deep Learning Enabled Multi-Class Plant Disease Detection Model Based on Computer Vision

: In this paper, a deep learning enabled object detection model for multi-class plant disease has been proposed based on a state-of-the-art computer vision algorithm. While most existing models are limited to disease detection on a large scale, the current model addresses the accurate detection of ﬁne-grained, multi-scale early disease detection. The proposed model has been improved to optimize for both detection speed and accuracy and applied to multi-class apple plant disease detection in the real environment. The mean average precision (mAP) and F1-score of the detection model reached up to 91.2% and 95.9%, respectively, at a detection rate of 56.9 FPS. The overall detection result demonstrates that the current algorithm signiﬁcantly outperforms the state-of-the-art detection model with a 9.05% increase in precision and 7.6% increase in F1-score. The proposed model can be employed as an effective and efﬁcient method to detect different apple plant diseases under complex orchard scenarios.


Introduction
Plant diseases and pests cause significant ecological and agricultural losses. Thus, early detection and prevention of various plant diseases is a key strategy in agriculture technology for commercial farms and orchards. Generally, traditional manual visual observation for disease diagnosis methods are inefficient and time-consuming and significantly increase overhead costs [1][2][3][4][5][6]. Recently, with the modern advancement of computer vision in precision agriculture technology, disease detection protocol has become an integral part of collecting information regarding crop health monitoring, which substantially improves the efficiency of disease detection and output of the crop production [7][8][9][10][11].
Early identification and prevention of plant diseases are the important aspects of crop harvesting since they can effectively reduce any growth disorders, and thus minimize pesticide application for pollution-free crop production. In this regard, automated plant disease detection utilizing different machine learning (ML) algorithms have become an efficient approach for precision agriculture [12][13][14][15][16][17][18]. Different ML approaches such as K-means clustering [14] and support vector machine (SVM) [16][17][18] have been employed for plant and disease classification and detection. However, due to complex image preprocessing and feature extraction steps, such methods have lower performance and speed in real-time disease detection. Additionally, one of the main drawbacks of traditional ML approaches is that they are not suitable for real-life detection scenarios with non-uniform complex backgrounds. In this regard, recently, deep learning has made a significant breakthrough in the realm of computer vision with various applications [19][20][21]. It has also been employed in automated agricultural technology [22], including crop and fruit classification [23][24][25], image segmentation [26,27], and crop detection [28]. Consequently, convolution neural network (CNN)-based models have gained significant popularity by demonstrating higher

The Proposed Network Structure of the Detection Model
In the current work, an improved model based on the start-of-the-art YOLOV4 algorithm [49] has been utilized for disease detection.
YOLOv4 is a high-precision one-stage object detection model that transforms the object detection task into a regression problem by generating bounding box coordinates and corresponding probabilities of each class. During object detection, the inputted image is divided into N × N uniformly equal grids. The model generates B predictive bounding boxes and a corresponding confidence score if the target falls inside the grid. When the center of the target-class ground truth falls inside a specified grid, it detects the target for a particular object class. Each grid predicts B bounding boxes with the confidence scores and corresponding C class conditional probabilities for the each target-class. The confidence scores can be expressed as When the target class falls inside the YOLO grid, p r (object) = 1 is prescribed; otherwise, p r (object) = 0. The coincidence between the reference and the predicted bounding box is described by IoU truth pred . Here, IoU is the intersection over union. The value of p r (object) indicates the accuracy of bounding box prediction when the target class is detected inside the grid. Finally, the best bounding box prediction from each of these scales has been filtered by non-maximum suppression (NMS) [41] algorithm before the final bounding box can be obtained. The detection process is shown in Figure 1.

N ×N grids on input
Bounding boxes+ confidence Class probability map Disease detection However, when detecting different diseases in the apple plant in the original YOLOv4 model, there are several issues, in particular, densely populated fine-grained and multiscale distribution, irregular geometric morphology of the infected areas, the occurrence of multiple diseases in the same leaf, and complex background, which significantly hinder detection accuracy and leads to a high number of missed detection as well as false object prediction. In order to resolve the aforementioned issues, in the present work, an improved and optimized version of the state-of-the-art YOLOv4 algorithm has been proposed based on the characteristics and complexities of the disease dataset to achieve better efficiency and accuracy of detecting different apple plant diseases with a real-time detection speed in a complex environment. The complete schematic of the model network architecture has been shown in Figure 2, which consists of three parts: backbone for the feature extraction, neck for semantic representation of extracted features, and head for the prediction.

Dense-CSPDarknet53
Modified PANet Head  Figure 2. Schematic of (a) the proposed network architecture for plant disease detection consisting of Dense-CSPDarknet53 integrating SPP as the backbone, modified PANet as a neck with a regular YOLOv3 head; (b) dense block structure.
During object detection, the YOLOv4 algorithm reduces the feature maps in the neural network. In order to preserve important feature maps and reuse the critical feature information more efficiently, the DenseNet framework [51] has been implemented in the proposed model, where each layer has been connected to other layers in feed-forward mode. The main advantage of the DenseNet block is that the n-th layer is able to receive the required feature information X n from all the previous layers X 0 , X 1 , ..., X n−1 inputs, which can be expressed as X n = H n [X 0 , X 1 , ..., X n−1 ], where H n is the spliced feature map function for layer n; [X 0 , X 1 , ..., X n−1 ] is the feature map of layers X 0 , X 1 , ..., X n−1 . Due to the complexity of the image dataset, it is found out that the dense blocks facilitate better feature transfer and gradients throughout the proposed neural network. Additionally, it may mitigate over-fitting to some degree. Thus, in the proposed model, the Cross-Stage Partial (CSP) networks convolution blocks CSP1, CSP2, CSP8, CSP8, and CSP4 in original CSPDarknet53 have been modified to D1-CSP1, D2-CSP2, D3-CSP4, D4-CSP4, and D5-CSP2 by adding dense connection blocks to enhance feature propagation and reducing convolution blocks to reduce the number of redundant feature operations and improve the computational speed. The schematic of the proposed dense block network structure has been shown in Figure 2b.
One of the important aspects of the object detection model is to select proper activation function for a specific problem to enhance the accuracy and performance of the neural network [52]. In order to enhance stabilization of the network gradient flow and help learning more expressive features in the detection model, the proposed model uses Mish activation function [50], which can be expressed as: f (x) = x.tanh(so f tplus(x)) = x.tanh(In(1 + e x )). Additionally, due to Mish's unique property of unboundedness and bounded below, it helps to remove the saturation problem of the output neurons and improve network regularization. Additionally, it is unbiased towards the initialization of weights and learning rate due to the smoothness property. Thus, using Mish as a primary activation function replacing Leaky Rectified Linear Unit (Leaky-ReLU) [53] in the proposed model has demonstrated a significant gain in accuracy in our custom model dataset.
To enhance the receptive field and separate important context features during object detection, an SPP block [54] was tightly integrated with the last residual block (D5-CSP2) as shown in Figure 2. In the proposed model, the SPP was modified to retain the output spatial dimension, with a maximum pool applied to a sliding kernel of size 5 × 5, 9 × 9, and 13 × 13, considering stride equal to 1. A relatively large 13 × 13 max-pooling effectively increases the receptive field of the backbone. Furthermore, to preserve fine-grain localize information, a modified PANet [55] has been used in the neck part of the proposed network model which shortens the path of high and low fusion for multi-scale feature pyramid map as shown in Figure 2. Additionally, drop block regularization [56] for learning spatially discriminating features and class label smoothing [49] for better generalization of a dataset was employed. The original YOLOv3 head was utilized as the detection head. With the inputted image size of 512 × 512 × 3, the proposed model can predict bounding boxes at the detection head in three different scales: 64 × 64 × 24, 32 × 32 × 24, and 16 × 16 × 24. The data augmentation procedure (i.e., rotation, mirror projection, color balancing, brightness transformation, blur processing) was employed (as shown in Figure 3) to increase the variability of inputted images obtained from different environments, which enhances the robustness of the detection model.

Performance Matrices of the Detection Model
In deep learning-based object detection models, some important statistical measures of matrices, including intersection over union (IoU), precision (P) recall (R), F-1 score, average precision (AP), and mean average precision (mAP), are generally used to evaluate the performance of the model. In YOLOv4, a scale-invariant evaluation metric IoU is a standard measure to define the accuracy of target object detection. IoU calculates the efficiency and performance of the given model by measuring the overlap area ratio between the bounding box prediction from the model and the true bounding area of the object, which can be expressed as where A overlap is defined as the intersection area between the bounding box prediction from the model and true bounding box of the object as shown in Figure 4. However, A union is the union area of aforementioned bounding boxes. For binary classification, if IoU is greater than 0.5, the classified object class can be defined as true positive (TP). For IoU below 0.5, corresponding class can be labeled as false positive (FP). From the definition of TP, FP, and FN, the performance parameters P and R can be expressed as follows From Equation (3), one can conclude that higher P represents stronger capability of models to distinguish negative datasets, whereas higher R refers to stronger detection capability for positive datasets. In order to obtain the degree of precision of the test accuracy, F1 score can be defined from Equation (3) as follows: The F1 score is an evaluated indicator to integrate the mean of the precision and the recall, which could reconcile the precision and recall of the model. A higher F1 score indicates that the model is more robust. In a general sense, AP is equal to the area under a PR-curve, which can be expressed as A higher AP corresponds to a larger area under the PR curve, indicating better accuracy of predicting a object class, whereas mAP is the average of all APs, which can be expressed as In the dense object detection models, bounding box regression is a popular approach to predict the localization boxes on the input images. In the proposed model, complete IoU (CIoU) [57] has been utilized to achieve better accuracy and speed of convergency for the target bounding box prediction process. CIoU loss has been formulated incorporating consistency of aspect ratio parameter v and a positive trade off parameter α, which can be expressed as: where w gt , w and h gt , h are the widths and heights of ground truth bounding box and prediction bounding box, respectively, as shown in Figure 5.

Result and Discussion
In order to develop a real-time high-performance disease detection model on a single GPU, an improved version of state-of-art YOLOv4 algorithm has been considered. Initially, a total of 600 original images consisting of 200 images from each of the two apple diseases (i.e., scab and rust) and 200 images containing both scab and rust have been collected from the publicly available Kaggle PlantPathology Apple Dataset [58] to construct the single dataset. Utilizing different image augmentation procedures, the single dataset has been expanded tenfold to obtain the custom dataset for this study (see Table 1). For image annotation of target classes in the custom dataset, a Python-based open-source script LabelImg [59] has been used, which saves the annotations as XML files and organizes them into PASCAL VOC format. Each XML contains the information of the target class and corresponding bounding coordinate during annotation for images in the training dataset. From the custom dataset, a total of 3600, 1200, and 1200 were randomly selected for constructing training, validation, and test sets, respectively. The experiments were performed on the local system. The local computing resources and deep neural network (DNN) environment specifications are detailed in Table 2. To obtain better accuracy of the proposed detection model for different growth phases of apple, inputted dataset images of size 512 × 512 were considered. The initial configuration parameters (i.e., initial learning rate, number of channels, momentum value, decay regularization, etc.) were kept the same as the original parameters in the YOLOV4 model. The primary initial configuration parameters corresponding to the improved YOLOV4 model are presented in Table 3.

Overall Performance of the Proposed Detection Model
In order to compare the overall performances of the proposed detection model, the values of IoU, F1-score, mAP, final validation loss, and average detection time were compared with YOLOv3 and YOLOv4 as shown in Table 4. Comparing IoU, it was found that the proposed model attained the highest IoU value of 0.922, which is 6.1% over the original YOLOv4 model. Thus, the proposed detection model has better accuracy of detecting bounding boxes compared to the other two models. The model demonstrated better efficiency and accuracy in detection performance with an F1 score of 0.959 and mAP of 0.912, which are 7.6% and 7.3% improvement from YOLOv4. Furthermore, the average detection time has been compared between these three models, which indicates that the YOLOv4 has the lowest detection time of 15.301 ms (or speed of 65.22 FPS). The detection time of the proposed model was found to be higher than the YOLOv4 model with a detection time of 17.577 ms (or 56.89 FPS). Nevertheless, it can still provide the real-time detection of high-resolution images with better accuracy and confidence compared to the other two models. The comparison of precision-recall (PR) curves between these three models is shown in Figure 6. By comparing the characteristics of PR curves, one can conclude that the precision value from the proposed model is higher for a particular recall when the area under the PR curve is the highest between all three models. This indicates that the current model demonstrates better detection accuracy compared to YOLOv3 and YOLOv4.  Figure 7 compares the validation loss curves between three models. At the initial phase, the loss began to decrease significantly after approximately 20,000 training steps in YOLOv4, whereas, for the proposed model, the loss reduction occurred after approximately 5000 training steps, indicating better convergence characteristics compared to YOLOv4. After exhibiting several cycles of fluctuation in the loss curve, loss began to saturate after approximately 60000 training steps with a final loss value of 1.65, whereas the final loss valuess in the YOLOv3 and YOLOv4 were 11.12 and 4.31, respectively, as shown in Table 4. Clearly, the proposed model has a faster convergence rate and better convergence characteristics compared to the original YOLOV4 model, which indicates superior performance and detection accuracy in the proposed model. Detailed detection results containing TP, FP, and FN for each class and corresponding precision, recall, and F-1 score are presented in Table 5. The proposed model has demonstrated relatively higher precision and recall in rust, namely 94.37% and 98.41%, respectively. Overall, the proposed model attained 93.91% precision and 98.14% recall, which are increased by 9.05% and 5.91% from the original YOLOv4, respectively. In comparison to other models, one can see that the proposed model maximizes the TP value, while FP and FN reach minimum compared to YOLOv3 and YOLOv4 for all classes. For example, TP increases from 2944 to 3272; FP and FN decrease from 525 to 212 and 248 to 62, respectively, from YOLOv4, as shown in Table 5. Thus, the proposed model significantly improves the overall precision, recall, and F-1 score of the test dataset compared to YOLOv3 and YOLOv4 detection models. Thus, it is evident from the aforementioned comparison that the proposed object detection model can significantly outperform YOLOv3 and YOLOv4 in terms of precision and accuracy, slightly compromising the detection speed. Thus, it can be concluded that the performance and the accuracy of the proposed model have been significantly improved.

Detection Results for Different Plant Disease Class
The detection results from the proposed model for two distinct diseases in the apple plant considering two different infected leaves belonging to each of the disease classes were considered and compared with YOLOv3 and YOLOv4 models, as shown in Figures 8 and 9. For better clarity of the bounding boxes, two different diseases, scab and rust, were marked with corresponding bounding box class identifiers: 1 and 2, respectively. Corresponding detection results consisting of detected (detec.), undetected (undetec.), and missdetected (misdetc.) diseases for each of the leaves are detailed and compared between these three models in Tables 6-8. From the detection result, one can see that the bounding box prediction from the proposed model is more accurate compared to YOLOv3 and YOLOv4 detection models for all disease classes.  Figure 8. Comparison of detection result for apple scab on two distinct apple leaves from three models: (a1,a2) YOLOv3; (b1,b2) YOLOv4; (c1,c2) proposed model. Table 6. Comparison of detection results between YOLOv3, YOLOv4, and the proposed model for apple scab detection as shown in Figure 8. Bold highlights the best result obtained from corresponding model prediction.

Figs. No
Model Detc. Undetc. Scab detection: Scab lesions in leaves are roughly elliptical with feathery edges and have an olive green-to-black color. They are preferably distributed as the discreet form of patches, as shown in Figure 8. Due to erratic growth patterns and often high aspect ratio of the patch size, it is a challenging task to detect each of the spots individually. In the first test case, a relatively less dense discreet distribution of scab has been considered. For such a case, all three models work relatively well; however, the proposed model showed superior performance by correctly identifying all scab spots, while YOLOv3 and YOLOv4 had three and two undetected spots, respectively, as shown in Figure 8(a1-c1). For a more challenging case, a highly dense scab-infected sample was considered with a complex background of soil and leaves; the detection results from the proposed model indicate a significant improvement of detection accuracy and reduction of several undetected disease spots compared to the other two models, as shown in Figure 8(a2-c2). Overall, the proposed model demonstrates a reduced number of undetected scab spots compared to YOLOv3 and YOLOv4 as shown in Table 6.  Table 7. Comparison of detection results between YOLOv3, YOLOv4, and proposed model for apple rust detection as shown in Figure 9. Bold highlights the best result obtained from corresponding model prediction.

Figs. No Model
Detc. Undetc. Rust detection: The infections with rust usually first appear as small pale yellow spots on the upper surfaces of the leaf. They can rapidly extend to the whole surface of the leaf with dense distribution of spots. Due to the fine-grained nature and similarity of texture with the complex background, it is often hard to detect each of these affected areas precisely. In the first test case, a relatively less dense, fine-grained discreet distribution of rust has been considered. While the detection results from YOLOv3 and YOLOv4 indicate several missed detections for the fine-grained diseases spots, the proposed model demonstrated superior performance, in particular, by identifying fine-grained infected zones without any undetected spots, as shown in Figure 9(a1-c1), whereas there are five and three undetected rust spots from YOLOv3 and YOLOv4, respectively, as shown in Table 7. In a more challenging scenario with the densely populated distribution of infected areas, there are several missed detections from YOLOv3 and YOLOv4, as shown in Figure 9(a2,b2). In such a critical scenario, the proposed model demonstrated better multiscale disease detection capability compared to the other two models with higher confidence scores in bounding box prediction and a significant reduction in missed detection (see Table 7). Table 8. Comparison of detection results between YOLOv3, YOLOv4, and proposed model for both apple scab and rust as shown in Figure 10. Bold highlights the best result obtained from corresponding model prediction.

Figs. No
Model Detc. Undetc. Multi-class disease detection: In this section, the proposed model has been tested for multi-class diseases detection where both scab and rust are present in the image. At first, we have considered a challenging case for the early disease phase where both diseases are of fine-grain nature. One can see that the proposed model has better accuracy of detecting multi-class fine-grained diseases spots compared to YOLOv3 and YOLOv4, as shown in Figure 10(a1-c1). In our second case, we have considered a multi-scale disease detection problem where the size of rust is relatively larger than the scab as shown in Figure 10(a2-c2). In such a challenging scenario, the proposed model demonstrated superior detection results and reduced missed detections to a great extent, as shown in Figure 10(c2). Moreover, it has higher confidence scores in bounding box prediction compared to the other two models, as shown in Table 8.  Figure 10. Comparison of detection result for both apple scab and rust on two distinct apple leaves from three models: (a1,a2) YOLOv3; (b1,b2) YOLOv4; (c1,c2) proposed model.

Misdetc. Confidence Scores
It can be concluded from our results that the proposed detection model has better capability and higher adaptability of disease detection in various environments compared to YOLOv3 and YOLOv4. The detection results demonstrate that the proposed detection model can provide high classification accuracy for multi-scale disease spot detection. Overall, it has a higher accuracy of detecting an object and can effectively avoid the problem of false detection and missing detection compared to the YOLOv3 and YOLOv4 models. The proposed model can be employed in real-life complex orchard scenarios for disease detection under various environmental conditions.

Conclusions
To summarize, in this study, a real-time object detection framework has been developed based on an improved YOLOv4 algorithm and applied to various plant disease detections in apple. The proposed model has been modified to optimize for accuracy and verified by detecting diseases under complex orchard scenarios. At a detection rate of 56.9 FPS, the proposed algorithm reached a mean average precision (mAP) value of 91.2%, F1-score of 95.9%. Compared to the original YOLOv4 model, the proposed model acquires 9.05% increase in precision and 7.6% increase in F1-score, indicating the potential of superior inspection performance in the real-time in-field application. The current work provides an effective and efficient method of detecting different plant diseases under complex scenarios and can be extended to different fruit and crop detection, generic disease detection, and automated agricultural detection processes.

Data Availability Statement:
The data that support the findings of this study are available from from J. Bhaduri (j.bhaduri@capacloud.com) upon reasonable request.