Classification of Plant Leaf Diseases Based on Improved Convolutional Neural Network

Plant leaf diseases are closely related to people’s daily life. Due to the wide variety of diseases, it is not only time-consuming and labor-intensive to identify and classify diseases by artificial eyes, but also easy to be misidentified with having a high error rate. Therefore, we proposed a deep learning-based method to identify and classify plant leaf diseases. The proposed method can take the advantages of the neural network to extract the characteristics of diseased parts, and thus to classify target disease areas. To address the issues of long training convergence time and too-large model parameters, the traditional convolutional neural network was improved by combining a structure of inception module, a squeeze-and-excitation (SE) module and a global pooling layer to identify diseases. Through the Inception structure, the feature data of the convolutional layer were fused in multi-scales to improve the accuracy on the leaf disease dataset. Finally, the global average pooling layer was used instead of the fully connected layer to reduce the number of model parameters. Compared with some traditional convolutional neural networks, our model yielded better performance and achieved an accuracy of 91.7% on the test data set. At the same time, the number of model parameters and training time have also been greatly reduced. The experimental classification on plant leaf diseases indicated that our method is feasible and effective.


Introduction
With the rapid development of computer technology, traditional machine learning methods have been applied in plant diseases prediction more and more widely. With the popularity of machine learning algorithms in computer vision, in order to improve the accuracy and speed of diagnostic results, researchers have studied automated plant disease diagnosis based on traditional machine learning algorithms, such as random forest, k-nearest neighbor and support vector machine (SVM) [1][2][3]. Tan et al. established a multi-layer BP neural network model to realize the disease identification of soybean leaves, by calculating the chromaticity values of the leaves [4]. By extracting the color and texture characteristics of grape disease leaves, Tian et al. used a support vector machine (SVM) recognition method which achieved better results than the neural network [5]. Wang et al. developed a discriminant analysis method to identify cucumber lesions, by extracting the color, shape and texture features of leaf lesions, as well as combining with environmental information [6]. Zhang et al. also extracted the color, shape and texture features of lesion after lesion segmentation, and then used them to identify five types of corn leaves by K-nearest neighbor (KNN) classifier [7].
(1) Limited by experimental conditions, such as current platform and hardware, a large CNN network will cost a long training time and have a slow convergence rate; (2) Long training convergence time will cause the final classification accuracy to decrease.
To shorten long training convergence time, decrease enormous parameters of most current network models, and increase recognizing accuracy, this paper proposes an integrated method. It adopts the inception structure to fuse the extracted high-level features, the Squeeze-and-Excitation module to perform feature re-calibration for weighting the features in the channel of CNN, and global average pooling instead of the fully connected layer. The experimental results show that our method is effective in the classification and identification of plant leaf diseases. Compared with other traditional convolutional neural networks, our model achieved the highest classification accuracy rate of 91.7% on our plant leaf disease dataset.

Data Preprocessing and Augmentation
We collected 10 kinds of disease leaf images from a library of plant leaf diseases (https://challenger. ai/), where digital color cameras were used for capturing diseased blade images with a resolution width of 256 and an unfixed length, as shown in Figure 1. Because some types of leaf diseases are confusing and unidentifiable, only 10 types of blade data were selected for our research. As for the images of plant corn, we tried to adopt the image class to verify the identification and showed the generalization ability of different CNN network structures for different types of leaf diseases. The diseased parts of Sensors 2019, 19, 4161 3 of 14 apple and cherry leaves are similar, and the degrees of leaf diseases in different disease levels are also similar. It is more practical for our research. Compared with other types of leaf diseases, these diseased leaves can better reflect the distinguishing ability of disease areas for different CNN structures, and can better compare the ability of different CNN structures in leaf classification.    First, all leaf disease images were adjusted so that the length and width of the image were the same, which were resized to 224 × 224. Resizing images to 224 × 224 before inputting images into different networks is done to adapt different pre-training CNN structures. Then, because some leaf disease types contains less images than others and the collection of leaf disease images are random, images of these disease types were horizontally and vertically flipped. The leaf diseases are Cedar Apple Rust-serious, Cherry Powdery Mildew-general, and Cherry Powdery Mildew-serious. Thus, the leaf disease data set was expanded to prevent redundancy of the data set, ensure the validity of image data, and make the classifier balanced. After the data augmentation, the plant leaf disease dataset contained 6108 images, of which 5588 were for the training set and 520 were for the test set. Table 1 lists the number of images for each disease class. Our deep learning-based network consists of VGG16 convolutional layers as well as the combination of Squeeze-and-Excitation (SE) module and Inception structure. The first five convolutional layers are based on the VGG16 model for self-learning low-to-high features of training images, where deeper convolutional layers reduce more resolution of feature maps, and extract more abstract high-level features. Then, the max pooling layer is used to filter the noise of the feature maps generated by the previous convolutional layer. Inception structure performs feature fusion, broadens the ability of acquiring features on feature maps, and extracts the best distinguishing features based on multi-dimensional analysis. The embedded SE module, re-calibrating the original features in the channel dimension, is used to replace the fully connected layer with the largest average pooling layer, reduce the training parameters as well as quickening the convergence of the model, and thus improving the classification accuracy of the model. The network structure of the improved model and related parameters are shown in Figure 2 and Table 2 First, all leaf disease images were adjusted so that the length and width of the image were the same, which were resized to 224 × 224. Resizing images to 224 × 224 before inputting images into different networks is done to adapt different pre-training CNN structures. Then, because some leaf disease types contains less images than others and the collection of leaf disease images are random, images of these disease types were horizontally and vertically flipped. The leaf diseases are Cedar Apple Rust-serious, Cherry Powdery Mildew-general, and Cherry Powdery Mildew-serious. Thus, the leaf disease data set was expanded to prevent redundancy of the data set, ensure the validity of image data, and make the classifier balanced. After the data augmentation, the plant leaf disease dataset contained 6108 images, of which 5588 were for the training set and 520 were for the test set. Table 1 lists the number of images for each disease class.

CNN Overall Architecture
Our deep learning-based network consists of VGG16 convolutional layers as well as the combination of Squeeze-and-Excitation (SE) module and Inception structure. The first five convolutional layers are based on the VGG16 model for self-learning low-to-high features of training images, where deeper convolutional layers reduce more resolution of feature maps, and extract more abstract high-level features. Then, the max pooling layer is used to filter the noise of the feature maps generated by the previous convolutional layer. Inception structure performs feature fusion, broadens the ability of acquiring features on feature maps, and extracts the best distinguishing features based on multi-dimensional analysis. The embedded SE module, re-calibrating the original features in the channel dimension, is used to replace the fully connected layer with the largest average pooling layer, reduce the training parameters as well as quickening the convergence of the model, and thus improving the classification accuracy of the model. The network structure of the improved model and related parameters are shown in Figure 2 and Table 2, respectively. The five convolutional layers are based on VGG16 pre-training model, which determines which layers of the original network have to be frozen during the pre-training phase, and which layers are allowed to continue learning at a certain learning rate. Usually, the first several layers are frozen because the low-level features can better adapt to various problems. This work used a stochastic  The five convolutional layers are based on VGG16 pre-training model, which determines which layers of the original network have to be frozen during the pre-training phase, and which layers are allowed to continue learning at a certain learning rate. Usually, the first several layers are frozen because the low-level features can better adapt to various problems. This work used a stochastic gradient descent optimization method to train the model on our own data set. The initial learning rate was set to 0.001, while momentum and weight attenuation were set to 0.0005 and 0.9, respectively. The Dropout layer [24] was used in our experiments to prevent over-fitting in training and make the model more effective. Table 2. Related parameters of the convolutional neural network (CNN)-based model.

Type
Size/Stride Output Size

GoogLeNet's Inception
Inception module is the main component of GoogLeNet network. The Inception structure embeds multi-scale information and gathers features from different receptive fields to improve identification performance. It maintains the sparse structure, increases the depth and broadens the width of the network, therefore it reduces not only over-fitting but also free parameters. Figure 3 shows that the Inception module uses three different convolution kernels, 1 × 1 convolution, 3 × 3 convolution, 5 × 5 convolution as well as a 3 × 3 max pooling layer. It extracts three different scale features to increase the diversity of features, involving both macroscopic features and microscopic features. The purpose of the pooling layer is to preserve the primitive input information. The module splices the extracted features in the channel dimension and outputs a multi-scale feature map by concatenating these convolutional and pooling layers together.

Global Average Pooling (GAP)
The fully connected network has always been the standard configuration of the CNN network. However, too many parameters in the fully connected layer slows down the training speed of the network and makes it easy to be overfitting. The idea of global average pooling (GAP) [25] is to globally average the entire pixels of each feature map, and get an output for each feature map. The vector that is composed of these output features will be directly sent to softmax for classification. Figure 4 shows the comparison of the fully connected layer and the global averaged pooled layer.   Figure 5 is a schematic diagram of the SE module, which omits the previous series of convolutions in the original SE module. Given an input X, the number of feature channels is C. Unlike the traditional CNN, three operations are taken to recalibrate previously obtained features.

Global Average Pooling (GAP)
The fully connected network has always been the standard configuration of the CNN network. However, too many parameters in the fully connected layer slows down the training speed of the network and makes it easy to be overfitting. The idea of global average pooling (GAP) [25] is to globally average the entire pixels of each feature map, and get an output for each feature map. The vector that is composed of these output features will be directly sent to softmax for classification. Figure 4 shows the comparison of the fully connected layer and the global averaged pooled layer.

Global Average Pooling (GAP)
The fully connected network has always been the standard configuration of the CNN network. However, too many parameters in the fully connected layer slows down the training speed of the network and makes it easy to be overfitting. The idea of global average pooling (GAP) [25] is to globally average the entire pixels of each feature map, and get an output for each feature map. The vector that is composed of these output features will be directly sent to softmax for classification. Figure 4 shows the comparison of the fully connected layer and the global averaged pooled layer.      Figure 5. Squeeze-and-Excitation (SE) module.
The first one is Squeeze operation. Suppose that the inputs are X = (X1, X2, . . . , XC ) ,XC ∈ R H×W . Formally, a statistic z ∈ R C is generated by shrinking X through its spatial dimensions H × W. The c-th element of Z is calculated by: Therefore, the Squeeze operation converts the input of corresponding to the Fsq operation in Figure 5. The result of this step is equivalent to the numerical distribution of the C feature maps of the layer, or global information. The output Zc can be thought of as the description of a set of local descriptors for the entire channel map.
The second operation is the Excitation operation. It can represent the convolution and activation operations, which employ a simple gating mechanism with a sigmoid activation: where, δ refers to the ReLu function, and the output Z can be thought of as a set of local descriptors for the entire channel map, In order to control the complexity and generalization of the model, the embedding mechanism of the model is parameterized by two nonlinear fully connected layers.
Finally, a reweight operation regards the weight of the output of Excitation as the importance of each feature channel after feature selection, and then weights previous features by channel weighting to complete the pair in the channel dimension. The output of the block is obtained by rescaling X with the activations s: where, The SE module can be embedded in the Inception and standard network architecture of ResNet, as shown in Figure 6. Figure 6 is a combination structure of the SE module and the Inception module. The first one is Squeeze operation. Suppose that the inputs are X = (X 1 , X 2 , . . . , X C ), X C ∈ R H×W . Formally, a statistic z ∈ R C is generated by shrinking X through its spatial dimensions H × W. The c-th element of Z is calculated by: Therefore, the Squeeze operation converts the input of H × W × C into an output of 1 × 1 × C, corresponding to the F sq operation in Figure 5. The result of this step is equivalent to the numerical distribution of the C feature maps of the layer, or global information. The output Z c can be thought of as the description of a set of local descriptors for the entire channel map.
The second operation is the Excitation operation. It can represent the convolution and activation operations, which employ a simple gating mechanism with a sigmoid activation: where, δ refers to the ReLu function, and the output Z can be thought of as a set of local descriptors for the entire channel map, W 1 ∈ R C r ×C and W 2 ∈ R C× C r . In order to control the complexity and generalization of the model, the embedding mechanism of the model is parameterized by two nonlinear fully connected layers.
Finally, a reweight operation regards the weight of the output of Excitation as the importance of each feature channel after feature selection, and then weights previous features by channel weighting to complete the pair in the channel dimension. The output of the block is obtained by rescaling X with the activations s: where, F scale (x c , s c ) refers to channel-wise multiplication between the scalar s C and the feature map x C ∈ R H×W , and x = x 1 , x 2 , . . . x C . The SE module can be embedded in the Inception and standard network architecture of ResNet, as shown in Figure 6. Figure 6 is a combination structure of the SE module and the Inception module.  Figure 6. The combined structure of the SE module and the Inception structure.

Experiments and Results
The experiments were performed on an Ubuntu workstation with CPU i7-8700k and RAM 32G, accelerated by two NVIDIA GTX 1080TI GPUs. All of our experiments were implemented by Caffe, a deep learning open source framework [26]. Moreover, accuracy rate was used to evaluate the performance of network models. The accuracy rate refers to the proportion of the number of corrected positive predictions to that of the whole positive predictions. It can be expressed as: where, NTP is the number of corrected positive predictions, and NFP is the number of wrongly positive predictions.

Effects of the Feature Extraction Network
The most important metric we considered is the average accuracy of the test set. Table 3 lists the experimental accuracy, model size and training time for several commonly used deep learning CNN architectures, as well as the results of our method.

Experiments and Results
The experiments were performed on an Ubuntu workstation with CPU i7-8700k and RAM 32G, accelerated by two NVIDIA GTX 1080TI GPUs. All of our experiments were implemented by Caffe, a deep learning open source framework [26]. Moreover, accuracy rate was used to evaluate the performance of network models. The accuracy rate refers to the proportion of the number of corrected positive predictions to that of the whole positive predictions. It can be expressed as: where, N TP is the number of corrected positive predictions, and N FP is the number of wrongly positive predictions.

Effects of the Feature Extraction Network
The most important metric we considered is the average accuracy of the test set. Table 3 lists the experimental accuracy, model size and training time for several commonly used deep learning CNN architectures, as well as the results of our method. The first observation from Table 3 is that different convolution depths make the trained model produce different classification results on the test set. In general, more convolutional layers can learn more complex features from original images. Shallow CNNs such as AlexNet achieved an accuracy of 0.894 on the test set, while the deep networks VGG16, VGG19, ResNet-50, and Inceptionv2 yielded accuracies of 0.905, 0.903, 0.901 and 0.903 on the test set, respectively. Compared to other networks, our network is relatively shallow, but achieves higher accuracy on the test set. One possible reason is in that shallow network has a relatively good generalization compared to deep ones. The other reason is because of the use of the Inception module to broaden the network and combine the multi-scale feature information, as well as the use of the SE module to merge the feature channel into the Inception module and thus weighted and recalibrated features. As a result, our network achieved a maximum accuracy of 91.7% on the test set. Figure 7 shows the trends of accuracy of different CNN models on the test set.  The first observation from Table 3 is that different convolution depths make the trained model produce different classification results on the test set. In general, more convolutional layers can learn more complex features from original images. Shallow CNNs such as AlexNet achieved an accuracy of 0.894 on the test set, while the deep networks VGG16, VGG19, ResNet-50, and Inceptionv2 yielded accuracies of 0.905, 0.903, 0.901 and 0.903 on the test set, respectively. Compared to other networks, our network is relatively shallow, but achieves higher accuracy on the test set. One possible reason is in that shallow network has a relatively good generalization compared to deep ones. The other reason is because of the use of the Inception module to broaden the network and combine the multi-scale feature information, as well as the use of the SE module to merge the feature channel into the Inception module and thus weighted and recalibrated features. As a result, our network achieved a maximum accuracy of 91.7% on the test set. Figure 7 shows the trends of accuracy of different CNN models on the test set.

Comparison of Model Size for Different Network Models
From the comparison of model size of different models in Table 2, we can get an intuitive observation that the larger the size of the CNN model, the more parameters the CNN had, and the longer the training time. The size of the AlexNet, VGG16, and VGG19 training models were 217 MB, 537.2 MB, and 558.4 MB, respectively. The large model size is because the last three layers of these network structures are all fully connected, which causes the number of the trained network model size to be larger than that of other deep learning models. On the contrary, GoogLeNet, Inceptionv2,

Comparison of Model Size for Different Network Models
From the comparison of model size of different models in Table 2, we can get an intuitive observation that the larger the size of the CNN model, the more parameters the CNN had, and the longer the training time. The size of the AlexNet, VGG16, and VGG19 training models were 217 MB, 537.2 MB, and 558.4 MB, respectively. The large model size is because the last three layers of these network structures are all fully connected, which causes the number of the trained network model size to be larger than that of other deep learning models. On the contrary, GoogLeNet, Inceptionv2, and Inceptionv3 with Inception structure greatly reduce the size amount to 47.1 MB, 45.1 MB, and 87.3 MB, respectively. The size of our model is 57.3 MB, which is greatly reduced compared with VGG16 and VGG19. The reason is that our model used the Inception structure and the global average pooling instead of the last three-layer fully connected layer. This structure can avoid the requirement of a large number of weight parameters, reduce the size of the CNN model and solve the problem of large memory occupancy and slow convergence in the training CNN model.

Comparison of Training Time for Different Network Models
The general CNN model linearly converts all extracted feature maps into 4096-dimensional feature vectors after convolutional and pooled layers, and classifies leaf diseases by softmax layer. Table 4 shows the training time of the forward propagation and backpropagation processes for different CNN models and improved models. As can be seen from Table 4, our model performs a forward propagation rate of 0.038 s, which means that the time required to test a picture is 0.038 s. Compared with other CNN models, our model has a faster advantage in the forward propagation time.

Loss Function and Confusion Matrix of Our Network
From Figure 8a, it can be concluded that our model tends to converge (blue curve), and the final accuracy rate is stable at 91.7% (orange curve), achieving a better classification result. Accuracy is an unreliable performance metric for evaluating the classification model because it can produce misleading results when the sample numbers of different classes in the data set are unevenly distributed. Moreover, the average accuracy of all categories is an accurate indicator for the model on the test set. In other words, the categories that are difficult to classify will be improved by the easily classified categories. The confusion matrix is the degree to which a classification model is accurate for each classification category. From the confusion matrix in Figure 9b, we can conclude that for some difficult-to-classify plant leaf diseases, the classification accuracy of such single-category on the test set is low, because the diseased region in each leaf is too small and the number of different grades of leaf disease is different. Therefore, it is difficult to be classified and identified by model. For instance, leaves of "Cherry Powdery Mildew -general" and those of "Cherry Powdery Mildew -serious" are difficult to be classified, because most of the regions in these leaves are very similar. The confusion matrix of the last experiment showed that the accuracy of disease recognition for corn is 100%, which did not interfere with other types of leaf diseases in classification.  Figure 9 visualizes a list of feature extractions after different layers of our network. The visualization of the network model can help us to intuitively understand the classification model. The ideal feature map of CNN should be sparse and contain typical local information. Through the visualization of the model, we can understand what features each layer of CNN learns, which can be used to adjust network parameters to improve the accuracy of the model. As a result, it provides a better understanding of how the CNN network learns the characteristics of the input image by visualizing various convolutional layers. We found that the features learned by CNN are hierarchical. The higher the level is, the more the specific features are presented. Moreover, the higher the dimensional feature maps correctly classify the images, the greater the effect presents. Specifically, the deep layer (Figure 9(7) or (8)) presents some edge corners and abstract features of colors, and the shallow feature map (Figure 9(1) or (2)) responds to the color information of the corners and other edges. The feature map of the middle layer (Figure 9(3), (4), (5), or (6)) has more complex invariance, captures similar textures, and has more layers for feature extraction. The high-level feature map shows the salient pose of the entire image after the extraction of the high-level abstract features.  Figure 9 visualizes a list of feature extractions after different layers of our network. The visualization of the network model can help us to intuitively understand the classification model. The ideal feature map of CNN should be sparse and contain typical local information. Through the visualization of the model, we can understand what features each layer of CNN learns, which can be used to adjust network parameters to improve the accuracy of the model. As a result, it provides a better understanding of how the CNN network learns the characteristics of the input image by visualizing various convolutional layers. We found that the features learned by CNN are hierarchical. The higher the level is, the more the specific features are presented. Moreover, the higher the dimensional feature maps correctly classify the images, the greater the effect presents. Specifically, the deep layer (Figure 9(7) or (8)) presents some edge corners and abstract features of colors, and the shallow feature map (Figure 9(1) or (2)) responds to the color information of the corners and other edges. The feature map of the middle layer ( Figure 9(3), (4), (5), or (6)) has more complex invariance, captures similar textures, and has more layers for feature extraction. The high-level feature map shows the salient pose of the entire image after the extraction of the high-level abstract features.   Table 2), (2) conv3_1, (3) conv5_1, (4) inception_ 1 × 1, (5) inception_ 3 × 3, (6) inception_ 5 × 5, (7) inception_ pool, (8) pool7.

Conclusion
This paper proposed an improved structure of convolutional neural networks for the identification and classification of a large dataset of different plant leaf diseases. Based on the traditional five-layer convolutional model of VGG16, the final, fully connected layer of VGG16 was replaced with Inception and SE modules, which can improve the classification accuracy of the model on the plant leaf disease dataset. Moreover, the global pooling layer can shorten the training time and parameter memory requirements, and also improve the generalization ability of the model. As a result, our method achieved the highest classification accuracy of 91.7% on the test set of plant leaf diseases. Compared with some other CNN methods, it has better adaptability to the change of image spatial position, showing better robustness to identify different diseases of various plant leaves, not limited to different diseases of the same plant.   Table 2), (2) conv3_1, (3) conv5_1, (4) inception_ 1 × 1, (5) inception_ 3 × 3, (6) inception_ 5 × 5, (7) inception_ pool, (8) pool7.

Conclusions
This paper proposed an improved structure of convolutional neural networks for the identification and classification of a large dataset of different plant leaf diseases. Based on the traditional five-layer convolutional model of VGG16, the final, fully connected layer of VGG16 was replaced with Inception and SE modules, which can improve the classification accuracy of the model on the plant leaf disease dataset. Moreover, the global pooling layer can shorten the training time and parameter memory requirements, and also improve the generalization ability of the model. As a result, our method achieved the highest classification accuracy of 91.7% on the test set of plant leaf diseases. Compared with some other CNN methods, it has better adaptability to the change of image spatial position, showing better robustness to identify different diseases of various plant leaves, not limited to different diseases of the same plant.