Fused-Deep-Features Based Grape Leaf Disease Diagnosis

: Rapid and accurate grape leaf disease diagnosis is of great signiﬁcance to its yield and quality of grape. In this paper, aiming at the identiﬁcation of grape leaf diseases, a fast and accurate detection method based on fused deep features, extracted from a convolutional neural network (CNN), plus a support vector machine (SVM) is proposed. In the research, based on an open dataset, three types of state-of-the-art CNN networks, three kinds of deep feature fusion methods, seven species of deep feature layers, and a multi-class SVM classiﬁer were studied. Firstly, images were resized to meet the input requirements of the CNN network; then, the deep features of the input images were extracted via the speciﬁc deep feature layer of the CNN network. Two kinds of deep features from different networks were then fused using different fusion methods to increase the effective classiﬁcation feature information. Finally, a multi-class SVM classiﬁer was trained with the fused deep features. The experimental results on the open dataset show that the fused deep features with any kind of fusion method can obtain a better classiﬁcation performance than using a single type of deep feature. The direct concatenation of the Fc1000 deep feature extracted from ResNet50 and ResNet101 can achieve the best classiﬁcation result compared with the other two fusion methods, and its F1 score is 99.81%. Furthermore, the SVM classiﬁer trained using the proposed method can achieve a classiﬁcation performance comparable to that of using the CNN model directly, but the training time is less than 1 s, which has an advantage over spending tens of minutes training a CNN model. The experimental results indicate that the method proposed in this paper can achieve fast and accurate identiﬁcation of grape leaf diseases and meet the needs of actual agricultural production. of CCA-related fusion methods will reduce the F1 score. The last 6 groups show that for the deep feature extracted using other CNN models, no matter which fusion method is used, the classiﬁcation performance is better than that of a single deep feature.


Introduction
Grape is one of the most favorite fruits in the world, which contains a variety of vitamins, carotenoids, and polyphenols which have numerous benefits for human health such as anti-cancer, anti-oxidation, and photoprotective [1,2]. Italy, France, Spain, the United States, and China are the main producers of grapes. According to the survey data of the Food and Agriculture Organization of the United Nations, grape disease is the main reason for the decrease in global grape production. However, most grape diseases start from the leaves and then spread to the entire plant. Therefore, a method which could identify grape leaf diseases with high accuracy will help to improve the management of grape production and provide a good growth environment.
Conventional expert diagnosis of grape leaf disease has the disadvantage of high cost and large risk of error. With the development of computer vision (CV), machine learning (ML), and deep learning (DL), technology has been widely applied to crop disease detection [3,4]. Conventional machine vision methods segment crop diseases spots using handcraft features such as color, texture, or shape. However, the characteristics of different diseases' symptoms are highly similar. As a result, it is difficult to judge the types of diseases, and the accuracy of disease recognition is poor, especially in a complex natural (1) The deep features extracted by CNN models were adopted to train a support vector machine (SVM) classifier for the classification of grape leaf disease. (2) Three deep feature fusion methods were adopted to fuse deep features extracted from different CNN models to improve the classification performance of the classifier. (3) A comprehensive analysis of the deep feature plus SVM, fused deep features plus SVM, and conventional deep learning methods was carried out.
The rest of the paper is organized in the following manner. In Section 2, the studied dataset and the proposed method is given. Then, in Section 3, a comprehensive discussion based on the experiment results is presented. Finally, a conclusion of the research is given in Section 4.

Dataset
The dataset adopted to evaluate the performance in this study is a publicly available grape leaf disease dataset, which can be downloaded at http://www.kaggle.com (1 August 2021). The database of Kaggle, which is the largest database in the world, contains a large number of plant disease images. The dataset contains 4062 images (resolution: 256 × 256) with a total of 4 kinds of grape leaves (black rot, esca measles, leaf spot, and healthy). A detailed distribution of the dataset is shown in Table 1, and the images of grape leaves in 4 categories are shown in Figure 1. different networks can make the SVM classifier learn more features and improve the classification performance. The main contributions of this study are as follows: (1) The deep features extracted by CNN models were adopted to train a support vector machine (SVM) classifier for the classification of grape leaf disease. (2) Three deep feature fusion methods were adopted to fuse deep features extracted from different CNN models to improve the classification performance of the classifier. (3) A comprehensive analysis of the deep feature plus SVM, fused deep features plus SVM, and conventional deep learning methods was carried out.
The rest of the paper is organized in the following manner. In Section 2, the studied dataset and the proposed method is given. Then, in Section 3, a comprehensive discussion based on the experiment results is presented. Finally, a conclusion of the research is given in Section 4.

Dataset
The dataset adopted to evaluate the performance in this study is a publicly available grape leaf disease dataset, which can be downloaded at http://www.kaggle.com (1 August 2021). The database of Kaggle, which is the largest database in the world, contains a large number of plant disease images. The dataset contains 4062 images (resolution: 256 × 256) with a total of 4 kinds of grape leaves (black rot, esca measles, leaf spot, and healthy). A detailed distribution of the dataset is shown in Table 1, and the images of grape leaves in 4 categories are shown in Figure 1.

Network Architecture and Deep Features Layers
In this study, the deep features extracted from three state-of-the-art CNN models, i.e., AlexNet [16], GoogLeNet [17], and ResNet [18], were adopted to evaluate the performance of the proposed method. All the deep features are extracted from a fully connected layer of a CNN model. Generally, a CNN network may contain several different fully connected layers (deep feature layers), e.g., the AlexNet has three fully connected layers of fc6, fc7, and fc8. Then, in this research, only some typical deep feature layers were examined, and detailed information of the selected layers is listed in Table 2. AlexNet was proposed by Alex Krizhevsky et al. [16]. and won first place in the ImageNet competition in 2012. The proposal of AlexNet is regarded as the beginning of deep learning. AlexNet, as shown in Figure 2, is a basic, simple, and effective CNN architecture, which is mainly composed of a convolutional layer, pooling layer, rectified linear unit (ReLU) layer, and fully connected layer. The success of AlexNet can be attributed to some practical strategies: (1) using ReLU nonlinear layers instead of a sigmoid function as activation functions, which can significantly accelerate the training phase and prevent overfitting; (2) a dropout strategy, which can be considered as a regularization to reduce the co-adaptation of neurons by setting the number of input neurons or hidden neurons to zero at random, was adopted to suppress overfitting; (3) the network was trained using multi-GPU to speed up the training phase. In this research, the deep features of fc6, fc7, and fc8 of AlexNet were examined.

Network Architecture and Deep Features Layers
In this study, the deep features extracted from three state-of-the-art CNN models, i.e., AlexNet [16], GoogLeNet [17], and ResNet [18], were adopted to evaluate the performance of the proposed method. All the deep features are extracted from a fully connected layer of a CNN model. Generally, a CNN network may contain several different fully connected layers (deep feature layers), e.g., the AlexNet has three fully connected layers of fc6, fc7, and fc8. Then, in this research, only some typical deep feature layers were examined, and detailed information of the selected layers is listed in Table 2. AlexNet was proposed by Alex Krizhevsky et al. [16]. and won first place in the ImageNet competition in 2012. The proposal of AlexNet is regarded as the beginning of deep learning. AlexNet, as shown in Figure 2, is a basic, simple, and effective CNN architecture, which is mainly composed of a convolutional layer, pooling layer, rectified linear unit (ReLU) layer, and fully connected layer. The success of AlexNet can be attributed to some practical strategies: (1) using ReLU nonlinear layers instead of a sigmoid function as activation functions, which can significantly accelerate the training phase and prevent overfitting; (2) a dropout strategy, which can be considered as a regularization to reduce the co-adaptation of neurons by setting the number of input neurons or hidden neurons to zero at random, was adopted to suppress overfitting; (3) the network was trained using multi-GPU to speed up the training phase. In this research, the deep features of fc6, fc7, and fc8 of AlexNet were examined.

GoogLeNet
GoogLeNetwas proposed by Christian Szegedy in 2014; before that, the deep learning networks obtained better performance by increasing the depth of the network (layers). However, with the increase of layers many problems, such as overfitting, gradient disappearance, and gradient explosion, may occur. In addition, when designing a network, only

GoogLeNet
GoogLeNetwas proposed by Christian Szegedy in 2014; before that, the deep learning networks obtained better performance by increasing the depth of the network (layers). However, with the increase of layers many problems, such as overfitting, gradient disappearance, and gradient explosion, may occur. In addition, when designing a network, only one operation such as convolution or pooling was used in a layer. Moreover, the size of the convolution kernel for the convolution operation is fixed. However, in practical situations, for different sizes of images, different sizes of convolution kernels are needed to produce the best performance, or for the same image, different sizes of convolution kernels behave differently because they have a different perceptual field. To address the above problems, GoogLeNet, constructed by Inception, was proposed. Inception puts multiple convolutions parallel together as a unit to form a network. Then, the model can choose the optimal convolutional kernels by adjusting the parameters during training. Networks constructed through inception modules can use computing resources more efficiently and can extract more features with the same amount of calculation. In this research, the deep feature layer of the loss3-classifier was examined.

ResNet
The residual network (ResNet) was proposed by He et al. [18], which could solve the degradation problem via the introduction of a residual module. The problem of network degradation refers to the decline of network accuracy with the deepening of network layers. It is certain that the performance degradation is not caused by overfitting, in which situation the accuracy should be high enough. Theoretically, for the problem of "accuracy decreases as the network deepens", the residual block provides two options, i.e., identity mapping and residual mapping, where identity mapping (usually called "shortcut connection") and residual mapping correspond to the x and F(x), respectively. The output of a residual block is y = F(x) + x (do not consider nonlinear activation). In the training phase, when the network has reached the optimum state, even if the network deepens, the residual mapping will be pushed to 0, leaving only identity mapping; then, the network is kept in optimum state and the performance will not decrease.
As shown in Equation (1), when the dimensions of x and F(x) are different, a linear projection W should be applied on x such that the dimensions of x could match the dimensions of F(x).
In this article, three widely used ResNet architectures, i.e., ResNet18, ResNet50, and ResNet101, were chosen as the deep feature extraction network. In addition, the deep features extraction layers shown in Table 2 were examined.

Fusion of Deep Features by Canonical Correlation Analysis
In this research, the canonical correlation analysis (CCA) [19] algorithm was adopted to fuse two kinds of deep features extracted by different networks of different deep features layers into a single feature vector. The fused feature is more discriminative than any of the input feature vectors. Canonical correlation analysis (CCA) has been widely adopted to analyze associations between two sets of variables.
Suppose that two ways are adopted to extract the P and q dimensional deep features of each sample, and two matrices, X ∈ R p×n and Y ∈ R q×n , are obtained respectively, where n is the number of samples. Then, a total of (p + q) dimensional features of each sample are extracted.
Let S xx = R p×p and S yy = R q×q denote the within-sets covariance matrices of X and Y and S xy = R p×q denote the between-set covariance matrix between X and Y. The matrix S shown below is the overall (p + q) × (p + q) covariance matrix, which contains all the information on associations between the pairs of deep features.
However, the correlation between these two sets of deep feature vectors may not follow a consistent pattern, and therefore, it is difficult to understand the relationship between these two sets of deep features from this matrix [20]. The aim of CCA is to find a linear transformation, X * = W T x X and Y * = W T y Y, and to maximize the pair-wise correlation between the two datasets: where corr(X * , Y * ) = W T x S xy W y , var(X * ) = W T x S xx W x and var(Y * ) = W T y S yy W y . The covariance between X * and Y * (X * , Y * ∈ R d×n are known as canonical variables) is maximized using the Lagrange multiplier method, and the constraint condition is var(X * ) = var(Y * ) = 1. Further, the linear transformation matrix W x and W y can be obtained by solving the eigenvalue equation as below [20]: whereŴ x andŴ y are the eigenvectors, and Λ 2 is a diagonal matrix of eigenvalues or squares of the canonical correlations.
The number of non-zero eigenvalues of each equation is d = rank S xy ≤ min(n, p, q), further arranged in descending order, λ 1 ≥ λ 2 ≥ · · · ≥ λ d . The transformation matrix W x and W y is composed of eigenvectors corresponding to sorted non-zero eigenvalues. For the transformed data, the form of the sample covariance matrix defined in Equation (2) is as follows: As shown in the above matrix, the upper left and lower right identity matrices indicate that the canonical variates are uncorrelated within each data set, and canonical variates have none zero correlation only on their corresponding indices.
As defined in [19], the deep features extracted by different CNN models could be fused via concatenation or summation of the transformed features (canonical variates X * and Y * ), and the fusion equations are shown in Equations (6) and (7).
where Z 1 and Z 2 are named canonical correlation discriminant features (CCDFs). In this research, both the fusion methods shown in Equations (6) and (7) were adopted to achieve the fusion of deep features extracted from different CNN networks. In addition, the fusion method of a direct concatenation of two kinds of deep features extracted from different CNN networks was also evaluated in the experiment.

Proposed Methodology
The processing flow of the proposed method is demonstrated in Figure 3: First, adjust the image size to make it fit the input requirement of the CNN models. The input requirements of the selected CNN models (AlexNet, ResNet, and GoogLeNet) are 227 × 227 × 3, 224 × 224 × 3, and 224 × 224 × 3, respectively.
Second, extract the deep features at specific layers of the CNN model. By inputting the image to the pre-trained CNN model and getting the parameter values on specified layers of the network, the specified deep feature can be obtained. The selected CNNs are pre-trained using ImageNet, which is a famous dataset for different applications. ImageNet contains more than 14 million images, covering more than 20,000 categories. As a result, more effective and meaningful deep features could be extracted by the pre-trained CNN models.
Third, make a fusion of the extracted deep features using one of the following methods, i.e., direct concatenation, canonical correlation analysis (CCA) concatenation, and canonical correlation analysis (CCA) sum.
Finally, feed the fused deep features into a fine-trained SVM classifier; then, the classifier can output the disease types of the input grape leaves. In the training stage, the "fit class error correcting output codes" (fitcecoc) function (MATLAB 2020b) with its default parameters was used, which can train a multi-class SVM classifier. The function of "fitcecoc" uses a K(K-1)/2 binary SVM model with a one-vs-one coding design, which enhances the classification performance of the classifier. Part of the default parameters adopted in the training stage are shown in Table 3.   Third, make a fusion of the extracted deep features using one of the following methods, i.e., direct concatenation, canonical correlation analysis (CCA) concatenation, and canonical correlation analysis (CCA) sum.
Finally, feed the fused deep features into a fine-trained SVM classifier; then, the classifier can output the disease types of the input grape leaves. In the training stage, the "fit class error correcting output codes" (fitcecoc) function (MATLAB 2020b) with its default parameters was used, which can train a multi-class SVM classifier. The function of "fitcecoc" uses a K(K-1)/2 binary SVM model with a one-vs-one coding design, which enhances the classification performance of the classifier. Part of the default parameters adopted in the training stage are shown in Table 3.

Experiment Setup
A Dell T7920 graphics workstation acted as the experiment platform. The basic configuration of the workstation is: Windows 10 operating system, two Intel Xeon Gold 6248R CPUs, two NVIDIA Quadro RTX 5000e GPUs, 64GRAM, and a 1T solid-state drive. The software environment is MATLAB 2020b, which can support some classical CNN models such as AlexNet, GoogLeNet, and ResNet by installing the DeepLearning toolbox. In addition, in the experiment, all the adopted CNN models (AlexNet, GoogLeNet, and ResNet) are pre-trained using ImageNet to ensure that they have powerful feature extraction capability.

The Evaluation Index
In the experiment, four metrics, i.e., accuracy, recall, precision, and F1 score, as shown in Equations (8)- (11), were adopted to evaluate the performance.
where TP, TN, FP, and FN represent the number of positive samples, true negative samples, false positive samples, and false negative samples.

Performance Analysis Based on Single Type of Deep Feature
The classification results of the fine-trained SVM classifier with a single type of deep feature are shown in Table 4. In the experiment, in order to obtain more reliable experiment data, 10 independent runs for training and validation of each SVM classifier were made in the experiment, and their mean results were adopted to represent its performance. It was observed that, from Table 4, for the AlexNet network, the fc6 layer deep features have better performance compared with the other two kinds of deep layers. The Fc1000 layer of ResNet50 obtained the best classification performance compared with all the examined deep layers with accuracy, precision, recall, and F1 scores of 99.08%, 99.26%, 99.24%, and 99.25%, respectively. In addition, the training time of the SVM model with an fc6 layer of AlexNet takes the most time, with an average of 36.35 s, while the average time of all other single-type deep features was about 1 s or less. Because the fc6 layer of AlexNet can achieve better performance than fc7 and fc8, then, in the following sections, for the AlexNet network only the fc6 deep feature will be considered.  Table 5 shows the performance of the SVM classifier trained using direct concatenation fusion of two deep features extracted from different CNN networks on the test set. Similar to the previous section, 10 independent runs for training and validation of each SVM classifier were made in the experiment, and their mean results were adopted to represent its performance (the following sections will adopt the same experiment method). It can be observed that the combination of ResNet50 (Fc1000) and ResNet101 (Fc1000) can get the best classification performance, while GoogLeNet (loss3-classifier) and ResNet18 (Fc1000) obtained the worst performance. Their accuracy, precision, recall, and F1 scores are 99.77%, 99.81%, 99.80%, and 99.81% and 98.90%, 99.12%, 99.12%, and 99.12%, respectively. Table 6 shows the performance of the SVM classifier trained using CCA sum fusion of two deep features extracted from different CNN networks on the test set. It can be observed that the combination of ResNet50 (Fc1000) and ResNet101 (Fc1000) can get the best classification performance, while fc6 of AlexNet and GoogLeNet (loss3-classifier) obtained the worst performance. Their accuracy, precision, recall, and F1 scores are 99.57%, 99.66%, 99.64%, and 99.65% and 96.55%, 97.29%, 97.15%, and 97.22%, respectively.  Table 7 shows the performance of the SVM classifier trained using CCA concatenation fusion of two deep features extracted from different CNN networks on the test set. It can be observed that the combination of ResNet50 and ResNet101 can get the best classification performance, while fc6 of AlexNet and GoogLeNet obtained the worst performance. Their accuracy, precision, recall, and F1 scores are 99.55%, 99.65%, 99.64%, and 99.64% and 96.50%, 97.21%, 97.13%, and 97.17%, respectively.

Performance Comparison between 3 Deep Feature Fusion Method and a Single Deep Feature
In order to examine the influence of fusion features on the classification performance of the SVM classifier, in the experiment, we compare the performance of the three fusion methods with the corresponding single deep feature before fusion. Because the calculation of the F1 score integrates precision and recall, as shown in Equation (11), then the F1 score was adopted to evaluate the performance differences, and the results are shown in Figure 4. Each row shown in Figure 4 is the performance of the former deep features (deep features that would be used to make a fusion), the latter deep features (another deep features which would be used to make a fusion with the former deep features), the CCA sum fusion features, the CCA concatenation fusion features, and the direct concatenation fusion feature, respectively. It can be observed that for the fused deep features which contain the fc6 feature, only direct concatenation fusion can improve the performance, while the uses of CCA-related fusion methods will reduce the F1 score. The last 6 groups show that for the deep feature extracted using other CNN models, no matter which fusion method is used, the classification performance is better than that of a single deep feature.   Table 8 shows the performance comparison of the best results obtained via different methods (single feature, direct concatenation fusion, CCA concatenation fusion, CCA sum fusion). It can be seen from the table that among the three fusion methods, the direct concatenation of deep features can obtain better results than the fusion method of CCA. In addition, the SVM classifier trained via any fused deep feature can achieve better classification results than via a single deep feature, which indicates the method proposed in this paper could improve the classification performance compared with the existing methods related to deep feature plus SVM. The best F1 score is obtained via the direct concatenation of deep features extracted from ResNet50 (Fc1000) and ResNet101 (Fc1000), which is 0.56% higher than using ResNet50 (Fc1000) alone (F1 Score 99.25%, the best performance with a single deep feature). Further, from the view of time consumption, the shortest training time consumed by a single feature is only 0.2596 s, and no additional feature fusion time is needed, but for all three feature fusion methods, the total time of feature fusion and SVM classifier training is within 3 s.   Table 8 shows the performance comparison of the best results obtained via different methods (single feature, direct concatenation fusion, CCA concatenation fusion, CCA sum fusion). It can be seen from the table that among the three fusion methods, the direct concatenation of deep features can obtain better results than the fusion method of CCA. In addition, the SVM classifier trained via any fused deep feature can achieve better classification results than via a single deep feature, which indicates the method proposed in this paper could improve the classification performance compared with the existing methods related to deep feature plus SVM. The best F1 score is obtained via the direct concatenation of deep features extracted from ResNet50 (Fc1000) and ResNet101 (Fc1000), which is 0.56% higher than using ResNet50 (Fc1000) alone (F1 Score 99.25%, the best performance with a single deep feature). Further, from the view of time consumption, the shortest training time consumed by a single feature is only 0.2596 s, and no additional feature fusion time is needed, but for all three feature fusion methods, the total time of feature fusion and SVM classifier training is within 3 s. Table 9 is the training parameters for the examined CNN models. The "sgdm" was selected as the solver, and the values of "MiniBatchSize", "InitialLearnRate", and "Max-Epochs" were 20, 1 × e(−3), and 50, respectively. Furthermore, due to the workstation having two GPUs, then, the parameter of "ExecutionEnvironment" was set as "multi-gpu" to speed up the training.  The performance comparison between the feature fusion method proposed in this article and using a CNN network directly is obtained in Table 10. On the one hand, from the perspective of classification performance, the proposed method can obtain a slightly better performance than using any kinds of CNN network directly. On the other hand, from the view of the training time, based on the experimental environment of this paper, it usually takes tens of minutes to train a CNN network, while the fused deep feature + SVM method only takes less than 1 s to complete training. Therefore, we believe that the method proposed in this paper has certain advantages compared with using CNN models directly, especially in terms of the training time.  Table 11 shows some studies on the diagnosis of plant diseases in recent years. Among them, studies 1 to 5 were the diagnosis of grape leaf diseases, while 6-7 were related to the diagnosis of other crop leaf diseases. It can be observed that the accuracy achieved by the proposed method in this paper outperforms that of those studies. Study 1 adopted the same dataset as ours, but it applied a GANs model to preprocess the original dataset to generate sufficient grape leaf disease images with prominent lesions. The Xception network was adopted as the classifier, and an accuracy of 98.7% was obtained on the augmented dataset. In study 2, an attention mechanism module, i.e., a squeeze-and-excitation block, was embedded into the Faster R-CNN model to make it focus on the more effective features, and an accuracy of 99.47 was achieved. In study 3, a UnitedModel was proposed, in which two CNN models are combined in parallel. The features extracted by each CNN models are concatenated and continue to flow to the fully connected layer and softmax layer to realize best classification performance is obtained from the direct fusion of Fc1000 features of Resnet50 and Resnet101. Its accuracy, precision, recall and F1 scores are 99.77%, 99.81%, 99.81%, and 99.81%, respectively. The performance improvement verified that the proposed method does make sense. Furthermore, compared with using the CNN network directly, the proposed algorithm can also achieve a better classification performance. Especially, from the perspective of training time, in the experimental environment of this study, it usually takes tens of minutes to train a CNN network, while training the SVM with fused deep features only takes less than one second, which shows an obvious advantage.

Performance Comparison with Some Other Studies
In the future, work will be focused on model deployment. Many studies have implemented their algorithm on the smartphone [28][29][30][31], which is more convenient for end-users to diagnose diseases in situ. Generally, there are two candidate solutions to implement the proposed method on smartphones: (1) make the algorithm proposed in this article into a library file and develop an APP base on it; (2) deploy the algorithm in a cloud server, and the smartphone is responsible for sending image to the cloud server and receiving diagnosis results. The first scheme can enable the application to be used in the non-network environment, but depends on the computational ability, while the latter method needs good network bandwidth. In addition, although the proposed method could be applied to the diagnosis of other plant diseases theoretically, its versatility and effectiveness need to be further verified on other datasets in the future.