Leaf Image Recognition Based on Bag of Features

Plants are ubiquitous in human life. Recognizing an unknown plant by its leaf image quickly is a very interesting and challenging research. With the development of image processing and pattern recognition, plant recognition based on image processing has become possible. Bag of features (BOF) is one of the most powerful models for classification, which has been used for many projects and studies. Dual-output pulse-coupled neural network (DPCNN) has shown a good ability for texture features in image processing such as image segmentation. In this paper, a method based on BOF and DPCNN (BOF_DP) is proposed for leaf classification. BOF_DP achieved satisfactory results in many leaf image datasets. As it is hard to get a satisfactory effect on the large dataset by a single feature, a method (BOF_SC) improved from bag of contour fragments is used for shape feature extraction. BOF_DP and LDA (linear discriminant analysis) algorithms are, respectively, employed for textual feature extraction and reducing the feature dimensionality. Finally, both features are used for classification by a linear support vector machine (SVM), and the proposed method obtained higher accuracy on several typical leaf datasets than existing methods.


Introduction
The traditional plant classification method is mainly realized by artificial recognition, which has the disadvantages of being time-consuming, susceptible to subjective judgment, and low recognition accuracy, far from meeting the requirements for rapid and accurate plant identification. Therefore, the rapid and accurate identification of plants is very challenging and meaningful. Plant recognition has been a challenging study since early last century, and plants play an irreplaceable role in human life. In the last decades, many researchers have studied image processing and pattern recognition as well as paid extensive attention to plant recognition. They have used images of plant organs (e.g., leaf, flower, fruit, and bark) for plant recognition.
In fact, although the images of flower, fruit, and bark have been employed for plant recognition, they have low recognition rates. In addition, these organ images have some limits; for instance, the flowering period is short and the texture of bark is unstable. Compared with flower, fruit, and bark, leaf images can be collected easily during the year, and its shape and texture are also stable. Therefore, the leaf is used as one of the important features for identifying plants. Most methods for plant recognition based on image processing rely on leaf images. In other words, plant species are recognized by leaf recognition.
In pattern recognition, using shape, texture, and color features for classification has been widely used. Soumyabrata et al. [1] proposed an improved text-based classification method to improve the classification results by integrating color and texture information. In addition, different color neighboring neurons. Once the value of the feedback input ( ) exceeds the active value, the neuron generates a feedback output pulse. Then, the feedback output, feedback input, and external stimulus from the neighboring neurons work together to change the value of the internal activity ( ). Once the value of the neuron's internal activity item exceeds its activity threshold, a compensation output pulse is generated. Finally, the activity threshold ( ) and external excitation ( ) values are updated.  The mathematical expressions of DPCNN model are as follows: where S is the external stimulus, and it changes depending on the current outputs Y F and Y U . V E , V U , V F , f , and g are fixed constants between 0 and 1. W and M are the connection weights which the current neuron communicates with its neighbors. Y F is the feeding out and Y U is the compensating output. Each neuron of DPCNN is an active neuron, which can be ignited by the feedback input or internal activity of the neuron to generate output pulse. First, the feedback input (F ij ) changes due to the influence of external stimuli and external compensation output from neighboring neurons. Once the value of the feedback input (F ij ) exceeds the active value, the neuron generates a feedback output pulse. Then, the feedback output, feedback input, and external stimulus from the neighboring neurons work together to change the value of the internal activity (U ij ). Once the value of the neuron's internal activity item exceeds its activity threshold, a compensation output pulse is generated. Finally, the activity threshold (E ij ) and external excitation (S ij ) values are updated.
The pulse sequence generated by pulse-coupled neural network (PCNN) can represent the image edge and texture information; thus, it can extract effective image features. However, there are still some limitations in feature extraction. For example, there is only one pulse generator in the entire neuron model, and the excitation of neurons lacks a compensation mechanism. DPCNN is improved based on the PCNN model. Compared with PCNN, DPCNN has the following advantages: (1) each neuron of DPCNN has two chances to be excited; (2) DPCNN can adaptively change the size of the external excitation of each neuron; and (3) received local stimuli from peripheral neurons are affected Appl. Sci. 2020, 10, 5177 4 of 17 by the modulation of the input stimulus. In addition, DPCNN also has translation, rotation, scale invariance, and robustness.
When DPCNN is used for feature extraction, the input image must be a gray image and the intensity of a pixel should be between 0 and 1. In our tests, the parameters were the same as in Ref. [28], except the iterations. The output of each iteration is a binary image, which is called pulse image. The entropy of the pulse image is used as a feature, and, after n times-iteration, the feature EnS, which is a vector with lengths of n, is obtained.

Bag of Feature
BOF model represents an image as an orderless collection of local features, and it has been widely used in pattern recognition. After the efforts of many researchers, the BOF model, which is used with spatial pyramid matching (SPM) [29] and locality-constrained linear coding (LLC) [30], has good performance in many studies. The flow chart of the BOF classification model is shown in Figure 2.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 3 of 18 mechanism. DPCNN is improved based on the PCNN model. Compared with PCNN, DPCNN has the following advantages: (1) each neuron of DPCNN has two chances to be excited; (2) DPCNN can adaptively change the size of the external excitation of each neuron; and (3) received local stimuli from peripheral neurons are affected by the modulation of the input stimulus. In addition, DPCNN also has translation, rotation, scale invariance, and robustness. When DPCNN is used for feature extraction, the input image must be a gray image and the intensity of a pixel should be between 0 and 1. In our tests, the parameters were the same as in Ref. [28], except the iterations. The output of each iteration is a binary image, which is called pulse image. The entropy of the pulse image is used as a feature, and, after n times-iteration, the feature EnS, which is a vector with lengths of n, is obtained.

Bag of Feature
BOF model represents an image as an orderless collection of local features, and it has been widely used in pattern recognition. After the efforts of many researchers, the BOF model, which is used with spatial pyramid matching (SPM) [29] and locality-constrained linear coding (LLC) [30], has good performance in many studies. The flow chart of the BOF classification model is shown in Figure 2. LLC is a linear coding scheme with local constraints. Local constraint makes coding results more accurate and acquires spare code, and it improves the speed of training and classification. The mathematical expression of LLC is as follows: SPM is an algorithm of image matching, recognition, and classification using spatial pyramid, and it is a method for obtaining the spatial information of the image by statistically distributing image feature points on images of different resolutions. Generally, it has two steps: (1) Extract features from different scales and combine them together.
(2) Convert features of different lengths into fixed-length features. LLC is a linear coding scheme with local constraints. Local constraint makes coding results more accurate and acquires spare code, and it improves the speed of training and classification. The mathematical expression of LLC is as follows: . , x N } is the feature descriptor set obtained after the origin image blocking; c i is the coding result of x i ; ⊕ denotes the dot product operation; and d i is the Euclidean distance between x i and B. SPM is an algorithm of image matching, recognition, and classification using spatial pyramid, and it is a method for obtaining the spatial information of the image by statistically distributing image feature points on images of different resolutions. Generally, it has two steps: (1) Extract features from different scales and combine them together.
(2) Convert features of different lengths into fixed-length features.
When BOF works, each image of the dataset is divided into many blocks, and the size of the block is always 8 × 8. To get a better effect, neighboring blocks are combined as a patch. The collection of patches can be regarded as a bag of components. Generally, the sizes of the patches of the collection are too big, and many patches are similar. Thus, it is reasonable to unify similar patches into a standard component. In fact, the above operation is calculated in feature space such as SIFT space and HOG space. Patches are expressed by feature extracted from themselves; the collection of standard components counted by K-means is called codebook; and the standard component is called code. For each patch, it is described by its neighborhood (the code in codebook) using a histogram which is the function of LLC. Finally, the histogram is pooling by SPM, and the sparse and smooth feature is obtained.

Dictionary Learning
As traditional method of learning codebook is based on the unsupervised learning method K-means, which does not take advantage of training label. When K-means works to find the center of clustering, it will calculate the distances between a center and all the points; however, there are only a few parts of points which contribute to the calculation of center. Each cluster center is regarded as a visual vocabulary in the dictionary. When the dataset is large, it will cost a large amount of time and computing resource. The cost of clustering is mainly determined by the size of the feature matrix, and normally the size of the feature matrix is large. The features of the training set are employed to reduce the number of points while finding effective centers. For each species, D centers are counted; its typical value is 8. Combine the cluster centers of each class to get a D × n dimensional dictionary, where n is the number of species. If the value of the clustering center is too small, the features cannot be accurately clustered, and the error is large. However, if the value of the clustering center is too large, it will increase the calculation amount and time consumption. Therefore, the cluster center value we choose can reduce the learning cost and improve the learning speed.

Shape Feature Extractions
The shape of leaf is a basic feature of leaf image; when people identify an object, its shape comes to our mind firstly. Similarly, for leaf image recognition, the shape is a simple and fast feature, but the effect of a feature may be not qualified for leaf recognition by itself, as many kinds of leaf images have similar shapes. Thus, in our system, shape feature is a minor feature. As shown in Figure 3, after getting the contour of leaf image, the contour is cut into numerous shape fragments; the middle points of each fragment are shown in Figure 3C.
such as SIFT space and HOG space. Patches are expressed by feature extracted from themselves; the collection of standard components counted by K-means is called codebook; and the standard component is called code. For each patch, it is described by its neighborhood (the code in codebook) using a histogram which is the function of LLC. Finally, the histogram is pooling by SPM, and the sparse and smooth feature is obtained.

Dictionary Learning
As traditional method of learning codebook is based on the unsupervised learning method K-means, which does not take advantage of training label. When K-means works to find the center of clustering, it will calculate the distances between a center and all the points; however, there are only a few parts of points which contribute to the calculation of center. Each cluster center is regarded as a visual vocabulary in the dictionary. When the dataset is large, it will cost a large amount of time and computing resource. The cost of clustering is mainly determined by the size of the feature matrix, and normally the size of the feature matrix is large. The features of the training set are employed to reduce the number of points while finding effective centers. For each species, D centers are counted; its typical value is 8. Combine the cluster centers of each class to get a × dimensional dictionary, where is the number of species. If the value of the clustering center is too small, the features cannot be accurately clustered, and the error is large. However, if the value of the clustering center is too large, it will increase the calculation amount and time consumption. Therefore, the cluster center value we choose can reduce the learning cost and improve the learning speed.

Shape Feature Extractions
The shape of leaf is a basic feature of leaf image; when people identify an object, its shape comes to our mind firstly. Similarly, for leaf image recognition, the shape is a simple and fast feature, but the effect of a feature may be not qualified for leaf recognition by itself, as many kinds of leaf images have similar shapes. Thus, in our system, shape feature is a minor feature. As shown in Figure 3, after getting the contour of leaf image, the contour is cut into numerous shape fragments; the middle points of each fragment are shown in Figure 3C.
Then, shape context is used as a descriptor of each fragment. Finally, the BOF model is used for feature coding and pooling so that we can get a more effective feature. Unlike bag of contour fragments (BCF), in this paper, uniform sampling method and simple fragments are used to improve the speed for feature classification.  Then, shape context is used as a descriptor of each fragment. Finally, the BOF model is used for feature coding and pooling so that we can get a more effective feature. Unlike bag of contour fragments (BCF), in this paper, uniform sampling method and simple fragments are used to improve the speed for feature classification.

Texture Feature Extractions
When DPCNN works, the parameters must be set firstly; the parameters of DPCNN in this paper are from Ref. [28], except the times of iteration. In our method, the image is divided into many blocks with 8 × 8 sizes (assuming that the number of patches is N to each kind of leaf images), each block is regarded as a patch. Then, after the iteration of n times, there will be n entropy images, and the entropy of each patch in every entropy image will be counted in order, as shown in Figure 4. If the entropy of one patch is e i , the entropy vector of the jth patch will be E j = [e 1 , e 2 . . . . . . e n ]. To a species, the features matrix will be E Nn , and eight codes (center of clustering) will be counted based on this matrix too.
BOF model is mainly divided into four stages: preprocessing, acquisition of DPCNN pulse images, low-level feature extraction, and feature coding. The process of obtaining image features of BOF_DPCNN is shown in Figures 4 and 5. Since the datasets we used are processed, the preprocessing stage can be ignored. As shown in Figure 4, the color image is converted to grayscale image firstly, and then the grayscale image is divided into blocks of the same size. For each small block, the DPCNN model is iterated to obtain the pulse entropy images. Finally, the entropy vectors are calculated from the entropy images to obtain low-level features. The BOF model is used to construct the codebook with low-dimensional features, LLC is used to encode, and SPM is used for pooling, as shown as Figure 5.

Proposed Recognition Method
Image recognition has a fixed framework. In general, for plant recognition, object images acquired by special devices (e.g., camera or scanner) are used. In this paper, we select the leaf datasets with clean background for identification. Some key features which can identify the object are extracted from the images through various algorithms. The classifier is employed for classification after feature extraction. Most classifiers need to be trained by samples before classification. Finally, the result is obtained. The proposed method of leaf image recognition also adopts the above framework. The detailed scheme of the proposed method is shown in Figure 6. It can be divided into three steps: leaf image preprocessing, leaf feature extraction, and recognition. These steps are explained in the following. The process of extracting leaf image features by BOF_DPCNN combining DPCNN and BOF model is mainly divided into four stages: preprocessing, acquisition of DPCNN pulse images, low-level feature extraction, and feature coding. The process of obtaining image features of BOF_DPCNN is shown in Figures 4 and 5. Since the datasets we used are processed, the preprocessing stage can be ignored. As shown in Figure 4, the color image is converted to grayscale image firstly, and then the grayscale image is divided into blocks of the same size. For each small block, the DPCNN model is iterated to obtain the pulse entropy images. Finally, the entropy vectors are calculated from the entropy images to obtain low-level features. The BOF model is used to construct the codebook with low-dimensional features, LLC is used to encode, and SPM is used for pooling, as shown as Figure 5.
When DPCNN works, the parameters must be set firstly; the parameters of DPCNN in this paper are from Ref. [28], except the times of iteration. In our method, the image is divided into many blocks with 8×8 sizes (assuming that the number of patches is N to each kind of leaf images), each block is regarded as a patch. Then, after the iteration of n times, there will be n entropy images, and the entropy of each patch in every entropy image will be counted in order, as shown in Figure 4. If the entropy of one patch is ei, the entropy vector of the jth patch will be Ej = [e1, e2…… en]. To a species, the features matrix will be ENn, and eight codes (center of clustering) will be counted based on this matrix too.
The process of extracting leaf image features by BOF_DPCNN combining DPCNN and BOF model is mainly divided into four stages: preprocessing, acquisition of DPCNN pulse images, low-level feature extraction, and feature coding. The process of obtaining image features of BOF_DPCNN is shown in Figures 4 and 5. Since the datasets we used are processed, the preprocessing stage can be ignored. As shown in Figure 4, the color image is converted to grayscale image firstly, and then the grayscale image is divided into blocks of the same size. For each small block, the DPCNN model is iterated to obtain the pulse entropy images. Finally, the entropy vectors are calculated from the entropy images to obtain low-level features. The BOF model is used to construct the codebook with low-dimensional features, LLC is used to encode, and SPM is used for pooling, as shown as Figure 5.

Proposed Recognition Method
Image recognition has a fixed framework. In general, for plant recognition, object images acquired by special devices (e.g., camera or scanner) are used. In this paper, we select the leaf datasets with clean background for identification. Some key features which can identify the object are extracted from the images through various algorithms. The classifier is employed for classification after feature extraction. Most classifiers need to be trained by samples before classification. Finally, the result is obtained. The proposed method of leaf image recognition also adopts the above framework. The detailed scheme of the proposed method is shown in Figure 6. It can be divided into three steps: leaf image preprocessing, leaf feature extraction, and recognition. These steps are explained in the following.

Proposed Recognition Method
Image recognition has a fixed framework. In general, for plant recognition, object images acquired by special devices (e.g., camera or scanner) are used. In this paper, we select the leaf datasets with clean background for identification. Some key features which can identify the object are extracted from the images through various algorithms. The classifier is employed for classification after feature extraction. Most classifiers need to be trained by samples before classification. Finally, the result is obtained. The proposed method of leaf image recognition also adopts the above framework. The detailed scheme of the proposed method is shown in Figure 6. It can be divided into three steps: leaf image preprocessing, leaf feature extraction, and recognition. These steps are explained in the following. Appl. Sci. 2020, 10, x FOR PEER REVIEW 6 of 18 Figure 6. Scheme of the proposed method.

Image Preprocessing
The leaf image is preprocessed to improve image quality. Image preprocessing contains the following steps.
a. Image denoising: If the leaf image has some background information, the background should be deleted, which will decrease the calculations of features extraction. Because most leaf image datasets are built using optical scanners, the background is simple and easy to be removed by an adaptive threshold segmentation method. b. Image segmentation: Sometimes the obtained leaf image has a complex background, and it needs to be separated from the background by segmentation. Since most leaf images contain some regions without value, the target region is extracted by a morphology method. Then, a quadrilateral is used to surround the target region. The quadrilateral is obtained from the original image and rotated to horizontal. c. Image enhancement: Sometimes it is essential to enhance the contrast and texture of the image. Histogram equalization and linear stretching are adopted in this method. Then, high-pass filter is employed to enhance the edge and texture of the leaf image (gray image). Finally, texture feature is extracted from this gray image.

Feature Fusion
The feature extraction is introduced in Section 3. In this section, the two features are fused to a feature vector. Support F and T are the BOF_SC and BOF_DP features, respectively. Firstly, different weights α and β (support to α + β = 1) are assigned to F and T, thus the feature vector can be expressed as FV = [αF, βT]. The larger is the weight, the greater is the role of the feature in the fused feature. Because these weights greatly influence the final recognition result, α and β are usually determined after many experiments. As F and T are sparse matrices, FV is still a sparse matrix; it might be easy for classification, but it requires much memory. Thus, a direct linear discriminant analysis (LDA) algorithm [31] is used for dimensionality reduction. Finally, the final dimensionality of the feature vector is 1000.

Classification
There are many classifiers for leaf recognition, such as support vector machine (SVM) [10], probabilistic neural network (PNN) [5], K nearest neighbor (KNN) [32], and random forests [33]. The most commonly used is SVM for its high accuracy and easy of use. Liblinear [34] and Libsvm [35] are two popular SVM tools for classification. Although Libsvm and Liblinear can achieve similar results in linear classification, Liblinear is much more efficient than Libsvm in

Image Preprocessing
The leaf image is preprocessed to improve image quality. Image preprocessing contains the following steps. a.
Image denoising: If the leaf image has some background information, the background should be deleted, which will decrease the calculations of features extraction. Because most leaf image datasets are built using optical scanners, the background is simple and easy to be removed by an adaptive threshold segmentation method. b.
Image segmentation: Sometimes the obtained leaf image has a complex background, and it needs to be separated from the background by segmentation. Since most leaf images contain some regions without value, the target region is extracted by a morphology method. Then, a quadrilateral is used to surround the target region. The quadrilateral is obtained from the original image and rotated to horizontal. c.
Image enhancement: Sometimes it is essential to enhance the contrast and texture of the image. Histogram equalization and linear stretching are adopted in this method. Then, high-pass filter is employed to enhance the edge and texture of the leaf image (gray image). Finally, texture feature is extracted from this gray image.

Feature Fusion
The feature extraction is introduced in Section 3. In this section, the two features are fused to a feature vector. Support F and T are the BOF_SC and BOF_DP features, respectively. Firstly, different weights α and β (support to α + β = 1) are assigned to F and T, thus the feature vector can be expressed as FV = [αF, βT]. The larger is the weight, the greater is the role of the feature in the fused feature. Because these weights greatly influence the final recognition result, α and β are usually determined after many experiments. As F and T are sparse matrices, FV is still a sparse matrix; it might be easy for classification, but it requires much memory. Thus, a direct linear discriminant analysis (LDA) algorithm [31] is used for dimensionality reduction. Finally, the final dimensionality of the feature vector is 1000.

Classification
There are many classifiers for leaf recognition, such as support vector machine (SVM) [10], probabilistic neural network (PNN) [5], K nearest neighbor (KNN) [32], and random forests [33]. The most commonly used is SVM for its high accuracy and easy of use. Liblinear [34] and Libsvm [35] are two popular SVM tools for classification. Although Libsvm and Liblinear can achieve similar results in linear classification, Liblinear is much more efficient than Libsvm in both training and prediction. When the number of samples is large, Liblinear is significantly faster than Libsvm [36]. Thus, we use Liblinear rather than Libsvm. Given a set of training leaf features Fv i , y i ∈ [1, . . . , N], where N is the number of leaf species, when Liblinear is used for leaf recognition, the problem can be defined as: When Liblinear works, it will learn a multi-class space. r i represents the ith class learned from training data. In Equation (9), the first part is a linear regularization term or linear kernel. c is the weight of linear kernel. For the testing data, the predicted labels are defined by Equation (10).

Datasets
As leaf recognition becomes more and more attractive, many open source leaf datasets can be used for studies, such as Flavia [5], ICL [37], Swedish [38,39], MEW2012 [32], and so on.
Flavia dataset (http://flavia.sourceforge.net/) contains 1907 leaf images of 32 kinds, and it is the most used dataset for leaf recognition. Most leaves of Flavia dataset, as shown in Figure 7, are common plants in the Yangtze Delta, China. To each species, there are at least 50 leaves, which is enough for training and testing. These leaves are single leaves with the petiole removed and without complex background.

Datasets
As leaf recognition becomes more and more attractive, many open source leaf datasets can be used for studies, such as Flavia [5], ICL [37], Swedish [38,39], MEW2012 [32], and so on.

Length of DPCNN
When DPCNN works, the iteration is a significant parameter which would influence the effect of features. In Ref. [28], the iteration is set at 47. Generally, for most all PCNN models, e.g. ICM and SCM, used for feature extraction, iterations are more than 30. To some degree, the iterative process is a process of feature extraction by using a dynamic threshold, which is the most prominent feature of PCNN models.
While the iterative process is also essential for BOF_DP, how much times it costs is the key point of this part. To find the best iteration of DPCNN, the iteration number was changed from 5 to 45 to find a better iteration below 45. In fact, if the iteration were 45 or more, the time for feature extraction would be too long, so the maximum of iteration was set at 45. On the other hand, for an image, the entropy vector is an approximate periodic vector; too many iterations

Length of DPCNN
When DPCNN works, the iteration is a significant parameter which would influence the effect of features. In Ref. [28], the iteration is set at 47. Generally, for most all PCNN models, e.g. ICM and SCM, used for feature extraction, iterations are more than 30. To some degree, the iterative process is a process of feature extraction by using a dynamic threshold, which is the most prominent feature of PCNN models.
While the iterative process is also essential for BOF_DP, how much times it costs is the key point of this part. To find the best iteration of DPCNN, the iteration number was changed from

Length of DPCNN
When DPCNN works, the iteration is a significant parameter which would influence the effect of features. In Ref. [28], the iteration is set at 47. Generally, for most all PCNN models, e.g., ICM and SCM, used for feature extraction, iterations are more than 30. To some degree, the iterative process is a process of feature extraction by using a dynamic threshold, which is the most prominent feature of PCNN models.
While the iterative process is also essential for BOF_DP, how much times it costs is the key point of this part. To find the best iteration of DPCNN, the iteration number was changed from 5 to 45 to find a better iteration below 45. In fact, if the iteration were 45 or more, the time for feature extraction would be too long, so the maximum of iteration was set at 45. On the other hand, for an image, the entropy vector is an approximate periodic vector; too many iterations would not be helpful, and, on the contrary, it would lower the feature's productivity. Flavia dataset was selected for testing, where 30 sample images were selected for training for each species, and the remaining images were tested. The average recognition rates are listed in Figure 11. It is clear that the accuracy reaches its peak after a sharp increase. After the peak, when the iteration is 20, the accuracy shows a noticeable steady fall, and it never presents a rising trend. Hence, the best iteration number is around 20.

Effect and Stability Analysis
The train number of each species (tr_no) was changed from 5 to 30 as shown in Figure 12. To each training set, SIFT with LLC coding was used for comparison, and the training set and testing set were kept the same for each feature. All the accuracies are the average of 10 times.  Training number of each species BOF_DP has the best effect when iteration number is 20 while the traditional DPCNN has the best feature when the iteration number is 47. The iteration process is reduced obviously. To some extent, this may be caused by the method of sub-block processing, when images are divided into smaller pieces. The local feature is more outstanding in each block, but, when the iteration number is oversized, there would be some unnecessary data that can be regarded as noise. Actually, when the iteration number is smaller than 20, the redundancy and noise also exist. Hence, an effective method for feature selection will be helpful for improving the efficiency of the proposed feature.

Effect and Stability Analysis
The train number of each species (tr_no) was changed from 5 to 30 as shown in Figure 12. To each training set, SIFT with LLC coding was used for comparison, and the training set and testing set were kept the same for each feature. All the accuracies are the average of 10 times.

Effect and Stability Analysis
The train number of each species (tr_no) was changed from 5 to 30 as shown in Figure 12. To each training set, SIFT with LLC coding was used for comparison, and the training set and testing set were kept the same for each feature. All the accuracies are the average of 10 times. It is obvious that, for each feature, the accuracy increases with tr_no. However, we are most concerned with the proposed feature BOF_DP showing a better effect than BOF_SIFT. Both BOF_SIFT and BOF_DP are better than BOF_SC, and the combined features of BOF_DP and BOF_SC achieved the highest recognition accuracy in the Flavia dataset.
To show the results of recognition clearly, recognition rates for each species are shown in Figure 13. The training number of each class was 30, and the final recognition rate on Flavia dataset was 98.2049%. Except for Species 11 and 25, the recognition accuracies of other species were ideal. Training number of each species It is obvious that, for each feature, the accuracy increases with tr_no. However, we are most concerned with the proposed feature BOF_DP showing a better effect than BOF_SIFT. Both BOF_SIFT and BOF_DP are better than BOF_SC, and the combined features of BOF_DP and BOF_SC achieved the highest recognition accuracy in the Flavia dataset.
To show the results of recognition clearly, recognition rates for each species are shown in Figure 13. The training number of each class was 30, and the final recognition rate on Flavia dataset was 98.2049%. Except for Species 11 and 25, the recognition accuracies of other species were ideal.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 11 of 18 Figure 13. The recognition accuracy of each species in the Flavia dataset.

Comparison of Features
Some other features were used for comparison, as shown in Table 1. BOF_DP represents the proposed feature, BOW + SIFT represents the features in Ref. [14], BOW + SC is also a proposed method based on SC and BOW in Ref. [14], LLC + SIFT is the original LLC method using SIFT [29], DBCS is a deformation-based representation space for curved shapes, and the authors of [39] proposed an adaptation of k-means clustering for shape analysis in DBCS. 2DPCA [40] is the 2D-based method of principal component analysis (PCA) and uses the bagging classifier with the decision tree as a weak learner. The recognition accuracies of these features are relatively close. 2DPCA has the lowest accuracy among these features. The proposed feature BOF_DP obtains the highest accuracy in the comparison.

Comparison of Features
Some other features were used for comparison, as shown in Table 1. BOF_DP represents the proposed feature, BOW + SIFT represents the features in Ref. [14], BOW + SC is also a proposed method based on SC and BOW in Ref. [14], LLC + SIFT is the original LLC method using SIFT [29], DBCS is a deformation-based representation space for curved shapes, and the authors of [39] proposed an adaptation of k-means clustering for shape analysis in DBCS. 2DPCA [40] is the 2D-based method of principal component analysis (PCA) and uses the bagging classifier with the decision tree as a weak learner. The recognition accuracies of these features are relatively close. 2DPCA has the lowest accuracy among these features. The proposed feature BOF_DP obtains the highest accuracy in the comparison.

Comparison of Different Methods
We also compared the proposed method with other methods on some datasets. To test the effectiveness and extensibility of the proposed feature and system, leaf datasets Flavia, ICL [37], Swedish [38,42], and MEW2012 [32] were used for testing.
As ICL contains so many leaf images, most methods always take a part of ICL dataset for testing. To compare with the MEW method [32], we followed its setting. On each dataset, for each species, half of leaf images were chosen as training sets, and the rest were the testing set. Supposing the number of species is p; if p is an even number, the training leaf images number was p/2; otherwise, the training leaf images number was (p + 1)/2. Finally, the training set and the testing set were roughly equal (in fact, the testing set was larger than the training set). The detailed data of the four datasets are shown in Table 2. In the following, all tests were repeated 10 times to get a convincing result. First, we compared these methods on the Flavia dataset. The comparison results with other methods are shown in Table 3. ZRM [41] is a method based on Zernike moments. Z&H represents the method of Ref. [11], which is based on Zernike moments and histogram of oriented. VGG16 [42] and VGG19 [42] are the pre-trained models based on CNN architecture with logistic regression. MLAB (Margin, lobes, apex and base) [43] is the phenetic features of leaf. MLBP [44] is the method of extracting texture features based on modified local binary patterns. Muammer Turkoglu and Davut Hanbay [45] proposed the improved descriptors based on LBP, called region mean-LBP (RM-LBP), overall mean-LBP (OM-LBP), and ROM-LBP. RIWD (rotation invariant wavelet descriptor) [46] is a new shape proposed by Ehsan Yousefi et al. GIST [47] is an approach for plant recognition using GIST texture features. Wang et al. [48] proposed a few-shot learning method based on the Siamese network framework (S-Inception) to better classify the small sample size (where n is the number of species used in this experiment and the number of trainings is 20 n). Most of these comparison methods do not introduce the number of training and test samples. Among the comparative methods, the deep learning-based method [42,48] does not obtain the best recognition results, but is slightly lower than other machine learning methods [44][45][46]. SSV [17] is a fusion feature composed of 11 shape features, 7 statistical features, and 5 vein features. The recognition result of SSV is slightly higher than our proposed method. As shown in Table 3, the training samples of the experiment are far more than the test sample images. It can be seen from the method Z&H [11] in Table 4 that, when the number of training samples increases and the number of test samples decreases, the recognition rate increases. Further, our method uses more total images than SSV. The total images of SSV were 1600, while our total images were 1907. More than 300 images were removed in SSV. The Flavia dataset we used is original and unfiltered. Therefore, it is understandable that the SSV method obtains a slightly better recognition rate under the very superior experimental conditions. Overall, the proposed method is superior to most of other existing methods.   Table 4 shows the results of different methods on the Swedish dataset. It contains 1125 sample images from 15 species, with 75 images per species. The authors of [49] proposed SMF, which utilizes the area ratio to quantify the convexity/concavity of each contour point at different scales to construct margin feature, and they used a combination of morphological features as shape feature. Yang et al. [50] introduced a novel multiscale Fourier descriptor (MF) based on triangular features, which effectively captures the local and global features of leaf shape. MARCH [51] (multiscale arch height) is a novel multiscale shape description. Wang et al. [52] proposed a hierarchical string cuts (HSCs) method. CSD [53] is a counting-based shape descriptor for leaf recognition, which can capture global and local shape information independently. CBOW is a shape recognition algorithm based on the curvature bag of words (CBOW) model. Generally, the recognition accuracy is improved with the increase of the number of training samples. When the training number of the method Z&H [11] is 750, the recognition result is significantly improved, which is slightly higher than the method we propose. In addition, compared with the other existing methods, the proposed method is superior. S-Inception [48] obtained the lowest recognition accuracy, while MEW [32], MF [50] and CSD [53] were close to the accuracy of the proposed method. The recognition accuracies of the other methods were also very close.
For ICL dataset, some researchers only use part of samples from dataset. Hence, the detailed comparisons are listed in Table 5. GTCLC [55] is a leaf classification method using multiple descriptors. Cem Kalyoncu et al. proposed a new local binary pattern (LBP) descriptor, and they combined it with geometric, shape, texture, and color features for leaf recognition. The authors of [56] used several different descriptors to extract texture and shape features and proposed a pre-training method based on the PID to improve the DBNs. DWSRC (discriminant WSRC) [57] is the method proposed by Zhang et al. for large-scale plant species recognition. The authors of [58] presented the novel relative sub-image sparse coefficient (RSSC) algorithm for mobile devices. DBNs chose 50 species for training and testing and it obtained the highest accuracy with 96%, higher than the proposed method; however, when the number of species in the experiment was 220, the recognition accuracy dropped to 93.90%. When 220 species were selected for training and testing, our proposed method achieved the highest accuracy 94.22%. MEW dataset is also a large dataset. We compared our method with some classic methods, as shown in Table 6. The PCNN proposed by Wang et al. [59], based on pulse-coupled neural network and SVM, is a novel plant recognition method. PCNN and DPCNN have better performance than the others. It is obvious that the method we propose is better than the other methods.

Conclusions
In this paper, we propose a new feature for plant recognition based on leaf image using DPCNN and BOF and propose a method combining BOF_SC and BOF_DP. In the proposed method, features of leaf are adopted, and SVM is taken as the classifier. Firstly, the proposed features BOF_DP were compared with the existing features on the Flavia dataset. After that, four famous leaf datasets were used to validate the performance of the proposed system. Experimental results show that BOF_DP has a better effect than other features, and our method is superior to other methods in recognition accuracy. However, to the DPCNN model, the parameters may not be optimal. In future work, we will try to find the best way to set the parameters automatically and improve the recognition accuracy.
Author Contributions: Writing-original draft preparation, Z.W. and J.C.; writing-review and editing, Z.W. and J.C.; project administration, Y.Z.; resources, J.K. and Y.M. All authors have read and agreed to the published version of the manuscript.