Plant Leaf Disease Recognition Using Depth-Wise Separable Convolution-Based Models

: Proper plant leaf disease (PLD) detection is challenging in complex backgrounds and under different capture conditions. For this reason, initially, modiﬁed adaptive centroid-based segmentation (ACS) is used to trace the proper region of interest (ROI). Automatic initialization of the number of clusters (K) using modiﬁed ACS before recognition increases tracing ROI’s scalability even for symmetrical features in various plants. Besides, convolutional neural network (CNN)-based PLD recognition models achieve adequate accuracy to some extent. However, memory requirements (large-scaled parameters) and the high computational cost of CNN-based PLD models are burning issues for the memory restricted mobile and IoT-based devices. Therefore, after tracing ROIs, three proposed depth-wise separable convolutional PLD (DSCPLD) models, such as segmented modiﬁed DSCPLD (S-modiﬁed MobileNet), segmented reduced DSCPLD (S-reduced MobileNet), and segmented extended DSCPLD (S-extended MobileNet), are utilized to represent the constructive trade-off among accuracy, model size, and computational latency. Moreover, we have compared our proposed DSCPLD recognition models with state-of-the-art models, such as MobileNet, VGG16, VGG19, and AlexNet. Among segmented-based DSCPLD models, S-modiﬁed MobileNet achieves the best accuracy of 99.55% and F1-sore of 97.07%. Besides, we have simulated our DSCPLD models using both full plant leaf images and segmented plant leaf images and conclude that, after using modiﬁed ACS, all models increase their accuracy and F1-score. Furthermore, a new plant leaf dataset containing 6580 images of eight plants was used to experiment with several depth-wise separable convolution models.


Introduction
Plant disease is one of the crucial reasons for food insecurity all over the world. It reduces the quantity of plant production and the quality of plants [1]. For this reason, early detection and protective measures of various plant diseases are a significant part of plant monitoring in the agro-industry. However, early detection of plant disorders and their categories are somehow tough with the naked eye and susceptible to human error. Supports of machine learning and computer vision opens the opportunities of automatic image-based decision [2], monitoring, 3D reconstruction [3], and robot-guidance in an agricultural field.
Plant diseases can be detected through leaves, roots, stems, and other parts of fruits and vegetables. For early detection of plant diseases, it is essential to detect the symptoms from the plant part. This monitoring is vital in plant diagnosis. Sometimes, symptoms appeared on specific parts of plants. Sometimes, symptoms are grown in one plant part and then speared over the other plant part. In this phenomenon, there is a chance of diminishing symptoms in the later stage of plant diseases. Therefore, choosing the right plant part is a significantly important. However, in our depth-wise separable convolutional plant leaf disease (DSCPLD) recognition framework, we consider the detection of plant diseases which spreads through young leaves.
Conventional machine learning algorithms are only appropriate and effective in specific circumstances and setup [4]. Under diversification and uncontrolled conditions, accuracy of these algorithms fall drastically. With the breakthrough of deep learning [5], researchers encouraged to apply deep learning to get state-of-the-art performance in agriculture. There are still some challenges in this perspective, such as memory restriction of devices (number of parameters), sustainable accuracy (not a fall in testing a new dataset), and computational latency (floating point operations and multiply accumulate operation).
Sustainable accuracy is a challenge in convolutional neural network (CNN)-based plant leaf disease (PLD) recognition models due to a fall in accuracy after adding new PLD images in Reference [6,7]. To overcome this challenge, it is essential to eradicate the unnecessary information from PLD images, and consider the heterogeneous image backgrounds. Moreover, some works are limited to symmetric backgrounds [6][7][8][9][10] and sensitive to image capturing conditions [11].
To overcome the above-mentioned limitations of existing PLD recognition frameworks, we propose depth-wise separable convolution (DSC)-based PLD (DSCPLD) recognition framework. In these frameworks, we introduce a segmentation technique called adaptive centroid-based segmentation (ACS) that traces the proper regions of interest (ROIs) under different circumstances, such as images with shading, images behind objects, and shrunk images overlapped with other plant leaves, in Reference [23]. Automatic initialization of optimal cluster number (K) from the PLD images in our modified ACS solves the insensitivity to proper K in Reference [20]. This technique helps the DSCPLD recognition model avoid noises and destruction in ROIs irrespective of real field environments. This phenomenon increases the generalization ability of DSCPLD and restricts to fall in accuracy depicted in Reference [6,7].
Moreover, to reduce the parameters and computational cost for mobile and IoT handled applications, depth-wise separable convolutional (DSC)-based PLD (DSCPLD) models are developed based on MobileNet [24,25]. Finally, a comprehensive trade-off is drawn among accuracy, parameter size, and computation latency for mobile and IoT-based PLD recognition.
The primary contributions of this paper: (i) a new dataset is introduced, including the diversified backgrounds of PLD images. PLD images are investigated under both direction and illumination-based augmentations to recognize the PLDs in natural circumstances. (ii) introduce a modified segmentation technique that can trace the accurate ROI irrespective of diversified backgrounds, under uneven illuminations and orientations. This phenomenon increases the sustainability of our DSCPLD recognition framework. Moreover, it also decreases the possibility of a fall in accuracy for testing an independent dataset.
(iii) various modified and reduced DSC-based architectures are developed using segmented images and full PLD images to establish a concrete trade-off among accuracy, parameter size, and computation latency for mobile and IoT-based PLD recognition.
The rest of the paper is organized as follows. Section 2 discusses the related works; proposed model for recognizing plant leaf diseases is presented in Section 3; experimental results and observations are illustrated in Section 4; and, finally, the paper is concluded in Section 5.

Literature Review
Manual plant disease identification and monitoring the plant health is a hectic, industrious, and prolonged task. More often, it is subjective, lavish, and challenging. Therefore, researchers investigate automatic detection and identification techniques to solve this problem and make the farmers' activities more efficient and accurate.
Conventional machine learning algorithms are only appropriate and effective in specific circumstances and setup [4]. Under diversification and uncontrolled conditions, the accuracy of these algorithms falls drastically. With the breakthrough of deep learning [5], researchers encouraged to apply deep learning to get state-of-the-art performance in agriculture.
Numerous modifications are done in CNN architectures for recognizing PLDs in recent years. Ferentinos et al. [6] performed CNN models for detecting 58 diseases of 25 plants and achieved 99.53% success rates for VGG. However, accuracy was reduced for previously unknown data to the training model and fell by 25-35%. In Reference [7], 26 PLDS of 14 crop species were identified using GoogleNet and AlexNet by transfer learning and learning from image scratch and achieved an accuracy of 99.35%. However, this work has limitations, such as images are taken under control, and accuracy falls drastically (above 31%) for the independent test dataset. Sladojevic et al. [8] performed Modified CaffeNet using ImageNet on more than 3000 images of 13 classes collected from Internet resources and achieved an accuracy of 96.3%. However, this work still has a limitation of a small number of sample images in the dataset and can be improved by increasing the samples. In Reference [10], for detecting 38 PLDs of 14 plants, VGG, ResNet, Inception and DenseNet were performed and achieved 99.75% accuracy for DenseNet. However, still, the computational cost is a fact. Another limitation is considering homogeneous backgrounds with a single leaf. Liang et al. [11] proposed a custom CNN model to perform on rice blast disease recognition and achieved better accuracy than using feature extraction technique, such as histogram-based local binary pattern (HLBP) and haar wavelet transformation (HaarWT). In this work, custom CNN architecture achieved the best accuracy of 95.83%. However, this work is sensitive to image capturing conditions and needs to expand the number of samples.
In Reference [13], two common diseases of banana were detected using LeNet architecture. The experiment is performed on 3700 banana color images collected from PlantVillage and also executed in grayscale images. In this work, LeNet architecture achieved 92-99% accuracy. However, their proposed work still has limitations in taking the image in real conditions, and accuracy falls significantly in grayscale images. Rahman et al. [14], performed two state-of-the-art CNN architectures, such as VGG16 and InceptionV3, for recognizing rice diseases. Besides, they have proposed a two-stage CNN model, which is effective for memory restricted devices. The authors identified that their manual process of dividing symptom classes might cause misclassifications. Liu et al. proposed PLD recognition models, including five CNN architectures (AlexNet, GoogleNet, ResNet20, and VGGNet-16) and two machine learning algorithms, such as support vector machine (SVM) and backpropagation neural network (BPNN), for recognizing apple leaves, in Reference [18]. Among them, modified AlexNet achieved the best accuracy of 97.62%. As future work, they figured out the need to expand the dataset. Arsenovic et al. performed various state-ofthe-art CNN architectures AlexNet, VGG19, InceptionV3, DenseNet201, and ResNet with generative adversarial network (GAN) data augmentation for recognizing 42 classes of 12 species in Reference [19] and achieved the best accuracy of 90.88%. Besides, in this work, faster R-CNN, faster R-CNN with FPN, faster R-CNN with TDM, YOLOV3, SSD513, and RetinaNet were performed for object detection in the plant. Moreover, this work proves the generalization by executing independent training and test dataset. They pointed out that in future, they will integrate their work into a mobile application. However, there is no analysis of computational complexity and memory requirements for mobile devices in this work. Authors in Reference [20] trained the custom CNN models for both full images and segmented images of 10 diseases and achieved 98.6% for S-CNN and 42.3% for F-CNN and having limitations of proper segmentation in uneven illuminations and different orientations.
Chen et al. [21] proposed a custom CNN model named LeafNet for extracting features of diseases for tea leaf images. Moreover, in this work, dense scale-invariant feature transform features (DFTF) were also extracted and later used to construct a bag of visual words (BOVW) model. However, then support vector machine (SVM), and multi-layer perceptron (MLP) classifiers were performed to classify diseases. Among all the models, LeafNet algorithm identified tea leaf diseases with an accuracy of 90.16%. Authors figured out to investigate their model's universality for different species. Transfer learning was used in Reference [26] to identify plants. Six state-of-the-art architectures (AlexNet, DenseNet169, InceptionV3, ResNet34, SqueezeNet-1.1, and VGG13) were performed on PlantVillage dataset and achieved an accuracy of more than 99.2%. A saliency map as a visualization method helped to learn the diversified features. In Reference [27], the authors investigated the computational complexity and memory requirements for plant leaf disease recognition. In Reference [28], authors performed faster R-CNN, region-based fully convolutional network (R-FCN) and SSD, backend with VGG16 to recognize the tomato diseases. Their motivations are to overcome limitations of tracing disease features in different illumination and complex background. They used 5000 images and later increased the images using geometrical and intensity transformations. Despite data augmentation, obtained accuracy is not high and on an average 85.98%. In Reference [29], the authors proved the impact of segmentation and background removal. To do so, the authors used 1567 images to identify multiple diseases in the same sample. Pre-trained GoogLeNet CNN architecture achieved 75 to 100% accuracy depending on species. The work in Reference [30] represented a concrete study among various pooling strategies, such as mean-pooling, max-pooling, and stochastic pooling, to recognize rice leaf diseases using CNNs. In this work, CNN achieved 95.48% for stochastic pooling. The authors pointed out the need to expand the sample images and to optimize the number of parameters. In Reference [31], machine learningbased algorithms support vector machine (SVM), linear regression (LR), and random forest (RF) are performed to classify six classes of peanut leaf diseases. Moreover, five CNN models: VGG, AlexNet, ResNet50, DenseNet121, and InceptionV3 are investigated with augmentation and without augmentation. From them, with augmentation DenseNet121 achieved 95.98% and without augmentation ResNet50 achieved 94.36%. The authors investigated that with augmentation and ensemble with machine learning algorithms, deep learning models achieved better accuracy. Ensemble of DenseNet121 and RF achieved better accuracy of 97.59%. However, this work still has limitations of less number of disease images and classes.
In recent times, some comprehensive surveys [4,23,27] are conducted to sum up the limitations of current PLD recognition methods. Some challenges of current PLD recognition works are as follows: (i) diversified data with heterogeneous backgrounds, such as natural, complex, and under uncontrolled capture conditions. (ii) more accurate identification due to similar symptoms in various plant diseases. (iii) drastically fall in accuracy. (iv) disease phases identification due to symptom changes.
Most of the cases, authors solved the above mentioned problems to a certain extent; however, there are many opportunities to improve the PLD recognition models.
(i) sustainable accuracy. To do so: • use diversified data with heterogeneous backgrounds, such as natural, complex backgrounds, and under uncontrolled capture conditions. • use segmentation phase to eradicate unnecessary noises. • test on a dataset that is not part of a train set.
(ii) investigates memory requirements and computational latency to integrate our model into mobile. Table 1 represents the brief descriptions of various PLD recognition frameworks, and Table 2 represents the limitations of existing PLD recognition frameworks. NR= not resolved, R= resolved, PR=partially resolved.

Materials and Proposed Method
In this section, our proposed framework is discussed in detail. Initially, the disease recognition framework optionally enhances the RGB PLD image, and then ACS is applied to trace the ROIs. Finally, our DSC-based architectures based on the modification of MobileNet [24,25] is performed to recognize the PLDs. The proposed DSCPLD recognition framework has been exhibited in Figure 1.

Adding Direction Disturbance to Dataset
One of the challenges in PLD recognition is uncontrolled capturing conditions, such as image capturing in different orientations. Due to the relative position of the acquisition device, the characteristics of images can be spatially transformed. However, it is challenging to have PLD images from every angle to meet the challenges. For this reason, we use different directional augmentation to expand our PLD dataset. This augmentation increases the adaptability of our DSCPLD models.
Rotation in an image refers to rotation of all pixels in a certain angle. Suppose P(x 0 ,y 0 ) is a certain pixel in an image. After rotating by θ • clockwise, this pixel changes into position P(x,y). The co-ordinates of P(x 0 ,y 0 ) and P(x,y) are represented in Equations (1) and (2).
The mirror symmetry in an image refers to expand all pixels after selecting a line as an axis. In horizontal mirror symmetry, selects a vertical line in an image and expands all pixels. However, in vertical mirror symmetry, selects a horizontal line in an image and expand all pixels. Suppose an image's width is w, P(x 0 ,y 0 ) is a certain pixel in an image. The point's coordinate will be as shown in Equations (3) and (4), respectively, after applying horizontal and vertical mirror symmetry.
In our DSCPLD recognition framework, we use rotation and mirror symmetry (vertical and horizontal) on our original PLD images as shown in Figure 3a-g.

Adding Lighting Disturbance to Dataset
Weather condition is one of the challenges in capturing images. Sunlight orientation, shadow and foggy weather have an impact on the brightness of acquired images. For improving the generalization ability, we generate images by adjusting the sharpness value, brightness value, and contrast value.
Sharpening the image means to enhance edges and borders as the objects in that image emerge. Suppose, a pixel in RGB is P(x,y) and P(x, y) = [R(x, y), B(x, y), G(x, y)] T . For adding sharpness to the image, we apply Laplace to that pixel using Equation (5).
Brightness in an image refers to the increase or decrease of RGB values of a pixel. Suppose B 0 is the original RGB value and d is the brightness transformation factors. After applying the brightness transformation factor, we get the adjusted RGB value (B) as shown in Equation (6).
In contrast, in an image, a larger RGB value is increased, and a smaller RGB value is decreased based on the brightness's median. Suppose B 0 is the original RGB value, d is the brightness transformation factors, and i is the brightness's median. After applying the contrast, we get the adjusted RGB value (B) as shown in Equation (7).
We apply various illumination-based augmentations in our DSCPLD recognition framework, such as changes in contrast, brightness, and sharpness, on our PLD dataset, as shown in Figure 4a-g.

Enhancing Image Using Statistical Features
To improve the PLD image quality, the enhancement is optional as it depends on the magnitude of degradation. Two enhancement conditions have been used here using statistical features, such as mean (µ), median (x ), and mode (M 0 ) of a plant leaf image. The two conditions for image enhancement are devised as in Equations (8) and (9).
The performance of image enhancement conditions is as shown in Figure 5a-h. For having symmetric color in ROI and image background, our enhancement condition performs well, as shown in Figure 5a, using Equation (8). Using Equation (9), for presence of leaf shadow on the image background, our enhancement condition performs well as shown in Figure 5b.

Clustering by Adaptive Centroid-Based Segmentation
The modified adaptive centroid-based segmentation (ACS) has been applied once the PLD image quality has been enhanced. Initially, RGB PLD image is converted to L*a*b color space PLD image. Our modified ACS focuses on initializing optimal K, automatically from the leaf image based on chromatic value (a and b), to eliminate the limitation of lacking sensitivity of K in Reference [20]. In traditional K-means clustering, Euclidean distance between each point and centroid has been calculated to check whether the point is in the same cluster. In the modified ACS, data points are investigated for eligibility by using a statistical threshold. After that, we calculate the distance between these eligible points and centroids, thus, comparatively reducing the effort to form clusters and restrict the misclustering of data points. The statistical threshold (ST) value has been calculated by Equation (10).
where X i , C, and N stand for data points, the centroid of data points, and the total number of data points. The automatic initialization of K using ACS can effectively detect image characteristics for different orientations and illuminations. ACS also increases the scala-bility of the proposed segmentation technique, as shown in Figure 5f,h over traditional segmentation technique, as shown in Figure 5e,g. A few examples under different circumstances are as shown in Figure 6a-e. Rice leaf image in natural background with presence of shadow and shrunk is as shown in Figure 6a. A blur rice leaf image in natural background with same color light is presented in Figure 6b. In Figure 6c, there is a rice leaf image with symmetric color of ROI and shadow of objects behind it. Figure 6d represents a rice leaf image with complex background. A potato image with the presence of the shadow behind the ROIs, as shown in Figure 6e. Segmented results of plant samples in Figure 6a-e are presented, respectively, in Figure 6f-j.

Recognition by DSCPLD Models
In this section, we describe the basic operations of depth-wise separable convolution, basic modules of MobileNet variations, DSCPLD model design, and tuning.

Depth-wise Separable Convolution
Our PLD recognition framework is constru cted based on depth-wise separable convolution (DSC). Depth-wise separable convolution comprises two convolutions; one is depth-wise convolution, and another one is point-wise convolution. DSC splits 3 × 3 convolutions into a 3 × 3 depth-wise convolution and a 1 × 1 point-wise convolution. Traditional convolution acts both the channel-wise and spatial-wise computation in a particular step. In traditional convolution, convolution for each input channel is done with one specific kernel, and the convolved output is the convolved results from all the channels. On the contrary, DSC breaks the operation into two steps: Depth-wise convolution is a channel-wise convolution that performs the convolution using individual input channels. Then, do point-wise convolution, which is similar to traditional convolution with kernel size 1 × 1. Point-wise convolution combines the results of each channel. The comparison among the convolutions is as shown in Figure 7. The computational cost of the traditional convolution (Cost C ) is shown in Equation (11).
However, in case of depth-wise separable convolution, the computational cost (Cost D ) is shown in Equation (12).
The weight (W C ) considered for traditional convolution is shown in Equation (13).
The weight (W D ) considered for depth-wise separable convolution is shown in Equation (14).
where N is the number of input channel, and P is the number of output channel. K × K is the width and height of the kernel, and MxM is the width and height of an input feature map. Finally, the reduction on weights (F W ) and operation (F Cost ) are derived in Equations (15) and (16).

Basic Depth-wise Separable Convolution Modules
Numerous CNN models are constructed based on the modifications of convolution layers. AlexNet, VGG, Inception, ResNet are performed comparatively better in recognizing PLD. However, it is not feasible to consider those models for mobile and IoT-based PLD recognition applications due to their large number of network parameters. For getting the better of it, depth-wise separable convolutions are proposed to expand the tradeoff effectiveness among accuracy, parameter size, and computational latency. There are two variations in depth-wise separable convolution: point-wise convolution adjacent to depth-wise convolution, as shown in Figure 8b, and batch normalization and ReLU used between each of depth-wise convolution and point-wise convolution, as shown in Figure 8c. From these concepts, we propose three architectures; one is based on Figure  8b, depicted in Reference [25],called modified MobileNet (called S-modified MobileNet for segmented images and F-modified MobileNet for full leaf images). The other two are reduced MobileNet (called S-reduced MobileNet for segmented images and F-reduced MobileNet for full leaf images) based on MobileNet version in Reference [24], as shown in Figure 8c, and another one is extended MobileNet (called S-extended MobileNet for segmented images and F-extended MobileNet for full leaf images) based on Figure 8c using max-pooling layer once after last point-wise convolution.
In MobileNetV2 [36], linear bottleneck and inverted residual structure are added to build an efficient structures. It includes an additional 1 × 1 convolution followed by pair of depth-wise convolution and point-wise convolution. Moreover, there is a residual connection between input and output depending on their same number of channels as shown in Figure 9.
In MobileNetV3 [37], with all these layers modified swish non-linearities, and squeeze and excitation are added to make the MobileNet efficient.
There are two extra hyper-parameters in MobileNet versions: width multiplier (α) and resolution multiplier (ρ). Width multiplier (α) is used to make the network thinner and resolution multiplier (ρ) is used to control the input and size of each layer.

Model Design and Tuning
As one of our goals was to establish a concrete representation of trade-off among the accuracy, parameter size and computational latency, we compared our DSCPLD recognition models with state-of-the-art CNN models, such as AlexNet (input size: 224 × 224), VGG (input size: 180 × 180), MobileNetV1 (input size: 224 × 224), MobileNetV2 (input size: 224 × 224), and MobileNetV3 (input size: 224 × 224). Architectures of three DSCPLD recognition models based on MobileNet are represented in Table 4 with input size 224 × 224, Table 5 with input size 224 × 224 and Table 6 with input size 256 × 256, respectively. We split our PLD dataset into three parts: train, validation, and test in the ratio of 70-20-10, as shown in Table 3. In the training phase, we train our DSCPLD models and other state-of-the-art models using our PLD dataset. We validate our DSCPLD models using PLD images from our dataset for tuning hyper-parameters and alleviating the biasness of those models. For generalization, we test our DSCPLD models with our PLD dataset and another benchmark rice dataset. Performance of models is evaluated on mean test accuracy (mAcc) and mean F1-score (mF). Then, we investigate the impact of segmentation by executing all the models using both segmented and full leaf images. For all the experiments, various optimizers, such as Adam, SGD, and RMSprop, are used to optimize weights and minimizes the loss. We investigate the best loss of our DSCPLD models using learning rate of 0.001 and 0.0001. Momentum for SGD optimizers is 0.8 and 0.9. We use categorical cross-entropy as loss function and softmax as activation in output layers for multi-class PLD recognition. Hyper-parameters used to tune the models for recognizing PLDs are shown in Table 7.

Hardware Requirements
All the experiments were conducted on a configuration of AMD Ryzen 7 2700X Eightcore 3.7 GHz Processor. The operating system is Ubuntu version 20.04, 32 GB RAM, Nvidia GeForce RTX 2060 Super of 8 GB GPU Memory. Keras backend with TensorFlow was used.

Dataset Collection
In this experiment, 4606 images of eight plants of size 256 × 256 pixels are used to train, and 1947 PLD images are used to validate. Moreover, independent of 658 PLD images are used to test twelve classes. Data are collected from different internet sources and benchmark dataset. Source-wise statistics of our PLD image dataset are shown in Table 8.

Performance Evaluation of Our DSCPLD Frameworks Based on Mean Accuracy and Mean F1-Score Using Segmented Images
To evaluate our proposed DSCPLD recognition model's performance, we compare them with MobileNetV1, MobileNetV2, MobileNetV3, VGG16, VGG19, and AlexNet based on train, validation, test accuracy, and F1-score. To do so, we first segment the images using our modified ACS and then apply the images to the models. In our evaluation, as the number of samples is imbalanced classwise, we use some performance indicators, such as mean class accuracy (mAcc) and mean class F1-score (mF), as Equations (17)- (23).
Mean Class Accuracy of a model (mCAcc Recognition rate of a class = True Positive + True Negative Number of all samples of that class , Mean Class Precision of a model (mCP Precision of a class = True Positive True Positive + False Positive , Mean Class Recall of a model (mCR Recall of a class = True Positive True Positive + False Negative , Mean F 1 score of a class (mF) = 2 × mP × mR mP + mR , where k represents each of the class, N k indicates the number of samples in class k, and N is the total number of samples used to test the model. The comparison among PLD recognition models using segmented images with perspective to accuracies and mean F1-score (mF) is as shown in Table 9. Table 9. A concrete representation of accuracies and mean F1-score of various PLD recognition models using segmented images.

Models
Training

Performance Evaluation of Our DSCPLD Frameworks Using Segmented Images Based on Model Size and Computational Latency
We calculate the number of training parameters for memory requirements and floatingpoint operation (FLOPs) and multiply-accumulate operation (MACC) for computational latency for further evaluation. FLOPs are used to measure the complexity of a model and represent the operation of a model. MACC represents the number of additions and multiplications (dot product computation). Calculations of FLOPs and MACC are performed, as shown in Reference [38]. Concrete memory requirements and computational complexity representation of various models are as shown in Table 10.

Selection of the Best DSCPLD Framework Based on All Criteria
From Table 9, it is shown that S-modified MobileNet and state-of-the-art architecture MobileNetV3 achieve the best mean test accuracy of 99.55% on our PLD dataset. However, MobileNetV3 requires almost 5-10 times parameters than our proposed three DSCPLD recognition models, as shown in Table 10. Besides, S-modified MobileNet achieves the best mean F1-score of 97.07%. According to model size, FLOPs and MACCs as shown in Table 10, the best one is S-reduced MobileNet; however, considering all factors included in Tables 9 and 10, S-modified MobileNet is best among all the PLD recognition models for mobile and IoT-based PLD recognition.
Confusion metrices, ROC curves, Accuracy, and Loss curves of our three proposed DSCPLD models are as shown in Figures 10a-d, 11a-d, and 12a-d.

Processing Steps Using Our DSCPLD Framework
A processing example of rice blast leaf image using S-modified MobileNet is shown in Figure 13a-r with some activation on each of the layers. The presence of symmetrical color in both infected area and image background makes this leaf disease recognition quite difficult. Results in Figure 13a-r proves the followings: • effectiveness of our segmentation technique in a complex situation. • accurate recognition in natural background.

Performance Evaluation of Our PLD Frameworks Using Segmented Images and Full Leaf Images
Further, we execute DSCPLD models (F-modified MobileNet, F-reduced MobileNet and F-extended MobileNet) and six state-of-the-art CNN models (VGG16, VGG19, AlexNet, MobileNetV1, MobileNetV2, and MobileNetV2) using full leaf images to evaluate the effectiveness of segmentation. The performance of DSCPLD models is shown in Table 11. From Table 11, F-modified MobileNet (modified mobileNet using full leaf images) achieves the highest accuracy of 99.10%. The performance comparisons among the segmented-based DSCPLD models and DSCPLD models using full leaf images are as shown in Table 12 and  Tables 13 and 14. The confusion matrix and ROC curve of F-modified MobileNet are shown in Figure 14a

Performance Evaluation of Our PLD Frameworks Using Various Parameters on MobileNetV3
Further, we execute MobileNetV3 on segmented PLD images and investigate the results using width multipliers 0.25, 0.5, 0.75, and 1.0 with fixed size of image 224 × 224 as shown in Table 15. Then, we execute resolutions 128, 160, 192, and 224 with a definite width multiplier 1.0 as shown in Table 16. From Tables 15 and 16, it is observed that S-modified MobileNet is more effective than the variations experimented on MobileNetV3 based on accuracy, computational latency, and model size.    Table 12, it is shown that S-modified MobileNet achieves improved accuracy of 0.45% and F1-score of 0.44% more than the F-modified MobileNet due to eradication of extra noises from the leaf images in situations, such that obstacles behind the leaf images, images with shading and shrunk images overlapped with other plant leaves, as shown in Figure 6a-e.

Evaluation of Generalization for Our DSCPLD Framework
As in the segmentation phase, noises are removed, only ROI with symptoms is applied to our DSCPLD recognition models. This phenomenon increases the generalization and sustainability of those PLD recognition models. For evaluation of generalization in our S-modified MobiNet, we test this model using a rice leaf disease dataset (https://github. com/aldrin233/RiceDiseases-DataSet (accessed on 17 February 2021)). We consider only rice blast and rice bacterial blight leaf images for testing our DSCPLD model. There are 160 infected rice blast leaf images, including 80 rotated rice blast leaf disease images and 180 rice bacterial leaf blight images, including 90 rotated images. S-modified MobileNet achieves the best mean test accuracy of 98.53% for recognizing the two rice disease classes, and accuracy (mAcc) falls down 1.02% less than testing with our dataset, as shown in Table 17. For further evaluation, we also test this dataset using F-modified MobileNet, and accuracy (mAcc) falls down 3.57% less than testing with our dataset using F-modified MobileNet, as shown in Table 18.

Comparison among Some Benchmark PLD Recognition Frameworks
Most of the works did not investigate fall in accuracy with the independent dataset, computational complexity, and memory restriction as shown in Table 2. However, in our work, we investigate a fall in accuracy for testing a new set of plant images. It is 1.02% for Smodified MobileNet, as shown in Table 17 and 3.57% for F-modified MobileNet, as shown in Table 18 for testing a rice dataset (separated from training dataset). However, generalization is better than the works in Reference [6,7]. By performing DSCPLD recognition models, we prove that we can reduce the computational latency and memory spaces for mobile and IoT-based PLD recognition than CNN models, as shown in Table 10. These models not only mobile compatible PLD recognition models but also achieve better accuracy than other PLD works, as shown in Tables 9 and 19. NR= not resolved, R= resolved, PR=partially resolved.

Conclusions
Accurate plant leaf disease recognition is an issue in the agro-industry. The recent use of deep learning methods adds precision agriculture by early and accurate detection of plants' diseases. Deep feature extraction and faster processing embedded by hardware in deep learning methods make this optimal decision possible. However, sustainable accuracy, computational latency, and model size are the factors to recognize plant leaf diseases in mobile and IoT-based devices.
To gain sustainable accuracy, we introduced a new dataset containing PLD images under complex and natural backgrounds. Furthermore, we added some direction and illumination-based augmentation to the dataset. It increases the scalability of tracing the ROI in various circumstances. In this paper, we introduced a DSCPLD recognition framework, in which the modified segmentation technique initially finds optimal K from the PLD images and solves the limitation of segmentation-based CNN in Reference [20]. In the segmentation phase, image characteristics for uncontrolled conditions, such as under uneven illumination and different orientations, are correctly traced and make the models sustainable. However, accuracy falls at 1.02% using S-modified MobileNet and 3.57% using F-modified MobileNet in case of testing new data from another dataset. These methods provide better results than that of the methods reported in Reference [6,7] in terms of accuracy. Besides, S-modified MobileNet is very effective for mobile and IoT-based applications due to the lower network parameters of the model and lower computational cost.
We will extend our proposed model to detect multiple plant leaf diseases from the same image in the future. Further, we will focus on the stages of plant leaf diseases to visualize the symptoms' changes with time.
Author Contributions: All authors contributed equally to the conception of the idea, the design of experiments, the analysis and interpretation of results, and the writing and improvement of the manuscript. All authors have read and agreed to the published version of the manuscript.
Funding: This research received no external funding.

Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.

Data Availability Statement:
The authors have constructed a novel dataset on plant leaf diseases, which is available on request from the corresponding author.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: