Performance of Various Deep-Learning Networks in the Seed Classiﬁcation Problem

: We report the results of an in-depth study of 15 variants of ﬁve different Convolutional Neural Network (CNN) architectures for the classiﬁcation of seeds of seven different grass species that possess symmetry properties. The performance metrics of the nets are investigated in relation to the computational load and the number of parameters. The results indicate that the relation between the accuracy performance and operation count or number of parameters is linear in the same family of nets but that there is no relation between the two when comparing different CNN architectures. Using default pre-trained weights of the CNNs was found to increase the classiﬁcation accuracy by ≈ 3% compared with training from scratch. The best performing CNN was found to be DenseNet201 with a 99.42% test accuracy for the highest resolution image set.


Introduction
From the tropical savannas to the arctic taiga, grasses have dispersed and adopted to all ecosystems around the world. In addition to being an important livestock source for animals, different grass species play crucial ecological and economic roles as habitats of beneficial fauna. In land management, dense grass turf is utilized to prevent soil erosion and land sliding. Grasses are also used in parks, gardens and yards for recreational and ornamental functions [1].
Grasses have the majority of plant diversity in the world. There is no data set that can be used for the classification of grass seeds. The application of deep learning in the classification of plant seeds is one of the active research areas where a large image data set is required. The current study provides a new data set consisting of 8654 images of seeds of six different grass species acquired in a laboratory environment by preserving the symmetry properties.
The automated classification of images has been one of the widely studied problems of artificial intelligent research for more than four decades. Early studies employed methods, such as support vector machines, decision trees and various neural networks, which are broadly called traditional machine learning methodologies. One of the drawbacks of these methods is the need for the manual extraction of features in the images to be used as the input to the method. Feature extraction is time consuming and requires expertise. Deep Learning provides great advances in solving problems in the field of image-based recognition because the application of computer vision techniques is done with an artificial neural network that contains a great deal of processing layers compared to traditional artificial intelligence applications.
One of the advantages of deep learning is that it benefits from raw data without using hand-made features. CNNs are more comfortable to use than machine learning algorithms with manual feature extraction. CNNs are a subclass of deep learning and have many successful applications in image classification and object recognition. CNN popularity began with AlexNet [2], which used GPUs to accelerate the learning stage and won the ImageNet Large Scale Visual Recognition Competition (ILSVRC) in 2012 with a 15.3% top 5 error rate. In the following years, the error rate decreased steadily with the proposed new network to approximately 2%.
The full training of deep CNNs requires a large amount of labeled training data and extensive computational resources. In addition, they might exhibit various overfitting and convergence problems, which makes deep learning from scratch a tedious and expensive undertaking. Transfer learning has been proposed as one possible solution to this problem, and this involves fine-tuning the deep CNNs pre-trained with a large dataset for the relevant set instead of training from the scratch. Tajbakhsh et al. reported a comparison of the performance of pre-trained and deep-trained from scratch CNNs in the medical context and showed that pre-training would be beneficial in increasing the robustness of the training to the size of the training set and improving performance [3]. The importance of initialization for the convergence of training in CNNs is discussed in [4].
The automatic recognition of plant seeds is essential for biodiversity preservation as well as for commercial activities [5]. The identification of grass seeds is different from the identification of seeds of cultivated plants. Thus, cultivated plants must provide certain characteristics to be approved as a variety. In contrast to cultivated plants, the intra-species variation in weed species is high and the seeds of some close species can be very similar. This makes the identification of weed seeds more difficult than the classification of seeds of cultivated plants [6]. The classification of weed seeds from color images instead of black and white images allowed the use of color-dependent features [7], and, in 1996, Chtioui et al. [8] compared the artificial neural networks and linear discriminant analysis results using the features obtained from the images of four different weed seeds.
Granitto et al. [6,9] classified weed seeds according to color, texture and morphological features by using the Naive Bayesian approach and artificial neural networks. In the local mean-based nonparametric classifier (LMC), which uses the image features of weed seeds obtained from the principal component analysis network (PCANet), 91 different weed seeds were reported as classified with a 64.8% average accuracy [10]. Eight different pepper seeds were classified with 84.94% accuracy using the multilayer perceptron model [11]. A CNN performed better than SVM and KNN in classifying hyperspectral images of four rice species [12].
In the classification of ten types of soybean seeds, the test accuracy of six known CNN models was shown to range from 90.6% to 97.2% [13]. Researchers concluded that CNN models containing a large number of layers produced better results in classifying corn seeds as haploid or diploid [14]. Seven pre-trained CNN models were used for haploid and diploid classification in maize seeds, and VGG19 gave the best accuracy of 94.22% [15]. Gulzar et al. considered an updated version of a deep convolutional neural network (VGG16) for the classification of 14 different seeds from the RGB images of around 200 samples for each seed species in a transfer learning context and reported 99% training and test accuracy [16].
Ramcharan et al. reported encouraging results for using transfer learning to develop solutions for the digital plant disease detection problem for cassava plants [17]. Similarly, Rangarajan and Purushothaman demonstrated that the pre-trained CNN VGG16 could be utilized to detect eggplant diseases [18]. Comparing the performance of VGG16, ResNets with 50, 101 and 152 layers, Inception V4 and DenseNet with 121 layers in the identification of plant diseases from leaf pictures, Too et al. reported a maximum of 99.75% test accuracy for the DenseNet [19]. Recently, Loddo, Loddo and di Roberto reported a novel CNN based deep net called SeedNet [5], which was used to classify two sets of plant seeds with accuracy of 97.47%.
To the best of the authors' knowledge, there exists no publicly available image data set for grass seeds. The main objective of the present study is three-fold: first, we would like to provide a large set of grass seed images to the public. The second aim of the present study is to investigate the effectiveness of various widely successful convolutional neural networks in the classification of grass species from seed images. The considered CNNs have widely differing numbers of parameters, running times and flops. As the third aim of the study, we investigate the possible relations between the test accuracy of the networks and their size or running time.

Methods
Six species of grass seeds, tall fescue (Festuca arundinacea), sheep fescue (Festuca ovina), red fescue (Festuca rubra), perennial ryegrass (Lolium perenne, annual ryegrass (Lolium multiform) and Kentucky bluegrass (Poa prantesis), used in this study were obtained from the General Directorate of Seed Registration of the Ministry of Agriculture and Forestry of Turkey. These species are used most commonly as ornamental plants in gardens, but some, such as Poa prantesis, Lolium perenne and Festuca arundinacea, are also used in pastures and planted as cover crops. Seed images were obtained at a resolution of 100 pixels per millimeter at 1600 × 1200 pixels in daylight using a digital microscope (Celestron). Example images of the seeds are presented in Figure 1. The number of images for different seed species are shown in Table 1.

Deep Learning Models
Deep learning approaches have many processing layers that process images and automatically predict properties for classification. There are four main groups of learning algorithms: Convolutional neural networks (CNN), autocoders, limited Boltzmann machines and sparse coding techniques [20]. A CNN network consists of input, convolution, pooling, fully connected, softmax and output layers, which extract complex features of images, and the network uses them for classification purposes. Many densely connected convolutional neural network architectures have been developed for speech and image recognition tasks in recent years. Recent trends in CNN use are increasing, and promising results have been reported in many studies.
Training CNN-based deep learning models from scratch is a time-consuming process and requires a large data set. In some studies, models trained on a larger data set in the area of the problem are used in classification. This approach is known as transfer learning. Transfer learning can be used if the existing data set has a small number of samples for each class. In transfer learning, it is usually fine-tuned by training fully connected layers. In this study, by using the weights of CNN models trained on the Imagenet data set, all model layers for which our data amount was sufficient were trained. In the case where the last few layers were trained, the accuracy was smaller than when all layers of the system were trained.
These models can be used for transfer learning. Transfer learning can be used to train data in a similar problem using a pre-trained model and weights, as well as certain layers of the model can be trained in the new dataset by fine tuning. The CNN models used were trained on the Imagenet dataset and the weights of the models were used in this study. For the present study, we chose the CNN architectures that were successful in Imagenet competitions between the years 2012 and 2018. Here, we present a short overview of those used in the present study: • VGG [21]: was developed to study the performance implications of increased convolutional network depth, has up to 144 million parameters for N c convolutional layers with ReLU activation with 3 × 3 receptive layers, five 2 × 2 max-pooling layers, three fully connected layers with dropout regularization and a Softmax activation layer in the output. It has been used in many transfer learning tasks but suffers from a large parameter set, which makes it computationally expensive to train. We investigated two versions of it with N c = 13 and 16, which are called VGG16 and VGG19, respectively. • ResNet [22]: was introduced as an exotic network architecture that relies on socalled "network-in-network architectures" and was instrumental in showing that very deep CNNs can be trained by using the SGD algorithm in conjunction with proper initialization and residual modules due to improvements in the gradient flow. In the present study, we investigate Resnet50, Resnet101 and Resnet152 variants, which have 50, 101 and 152 layers. • DenseNet [23]: have the properties of all previous layers transferred to the current layer, thus, reducing the number of parameters required while strengthening the feature propagation and feature reuse. We studied two versions of it with 121 and 201 layers. • EfficientNet [24]: aims to increase both the computational performance and accuracy by scaling to balance the network depth, width and resolution. We considered EfficientNets 0-5 in the present study.

Datasets and Computational Details
One of the aims of the current study is to investigate the resolution dependence of the test accuracy. Toward that aim, four datasets at (32 × 32, 64 × 64, 128 × 128 and 256 × 256) pixel resolution were created from the acquired images. Furthermore, the images in these sets were converted to gray scale to obtain a total of eight datasets to be investigated. For each set, the data were randomly divided into 70% training, validation and 30% test data. A total of 30% of the training data was used for validation. In place data augmentation (rotation and zoom) was performed in order for the network to see different data variances in each epoch of the training process.
In the optimization of pre-trained CNN models, Stochastic Gradient Descent (SGD) with a parameter initial learning rate of 0.01 reduced by 10x every three epochs, a momentum (0.0) and Nesterov (True) was used. The Stochastic Gradient Descent (SGD) optimization algorithm has become dominant due to the trade-off between accuracy and efficiency [26]. The average accuracy values presented in the study over the last 30 epochs were calculated by training 50 epochs for each data set.

Evaluation Metrics
Various metrics have been used to characterize the performance of neural networks over the years, ranging from the most commonly used classification accuracy to the mean relative error between the predicted and actual values [27]. We used a set of six metrics (training, validation and testing accuracy, precision (P), recall (R) and F-1 score (F)) to evaluate the performance of the nets investigated in the present study. These metrics can be defined in terms of true positive (TP: class A predicted to be class A), false positive (FP: any class in the set that is not A classified as A), true negative (TN: any class in the set that is not A classified as not A) and false negative (FN: class A predicted to be any class other than A in the set) numbers in the testing phase as: where the sum is over all the classes in the problem.

Training and Validation Accuracy Statistics
The resolution of the images used in the training of nets has two important implications for the classification problem; it determines the amount of available information as well as the number of floating point operations required for the training of the CNN. First, we present the statistics of the training and the validation statistics of the considered CNNs in Figures 2 and 3, respectively.
Here, all of the nets were trained for 50 epochs by using the same initial parameters as the pre-trained weights supplied in the keras framework, and the statistics were calculated over the last 30 epochs of the set of data. It can be observed from Figure 2a-d, that the training accuracy for all investigated CNNs approached 1 as the image resolution increased from 128 × 128 pixels to 256 × 256 pixels as expected. Although it is difficult to generalize at lower resolutions (32 and 64), the DenseNet and EfficientNet variants displayed higher training accuracy. It is surprising that both VGG16 and VGG19 had relatively low average training accuracy at 128 × 128 resolution compared to their value at 64 × 64 pixel resolution. Similar statistics presented in Figure 3 for the validation accuracy indicates that the EfficientNets had the highest average validation accuracy, which was almost independent of the image resolution. At the highest resolution, the average validation accuracy of all nets except those of VGGs and ResNet50 approaches 1. Again, both DenseNet and ResNet displayed a larger resolution dependence compared to the EfficientNets and ResNets.

Performance Metrics
We present the performance metrics results for all image sizes for the considered CNNs in Figures 4 and 5 for the raw and the gray scale images, respectively. The results are summarized as star plots in the performance metrics of the F-1 score, recall, precision and accuracy in training, validation and test phase dimensions. A general observation from these figures is that the CNNs trained on higher resolution images had better performance, as expected [28]. It can be seen from these figures that, except for the MobileNet results for the segmented 32 × 32 pixel images, the training accuracy is in the range [0.95, 0.99] for all of the investigated CNNs.
The plots of the results for both the raw ( Figure 5) and the gray scale images indicate that as the image resolution increased, all six performance metrics improved considerably for all of the considered CNNs. While the improvement was linear for the segmented images, surprisingly, the metrics of 32 × 32 pixel images were found to be better, on average, than those for the 64 × 64 pixel images for the gray scale images.
Another interesting finding is that, as the image resolution increased, the overall best performer CNN changed; for instance, while the MobileNet was the best overall performer for the raw image study at 64 × 64 pixel resolution, it became the worst at 256 × 256 pixel resolution for the same set. Furthermore, the same MobileNet architecture had dismal performance for the 32 × 32, 64 × 64 and 128 × 128 pixel segmented images but its performance was on par with the other considered networks at 256 × 256 pixel segmented images. At the highest resolution, Densenet201 was found to perform the best for both image sets.  A confusion matrix is one of the measures that can be used to display how the classification model confuses its predictions. In Figure 6, we display confusion matrices for the worst and the best classifiers for the lowest and the highest resolution images. The relatively low accuracy of the MobileNet framework for the low resolution images appears to stem from its failure to distinguish two of the grass seeds (Fa and Fr) from the other ones. Even the best performer E5 for the 32 × 32 pixels set had difficulty distinguishing Fr seeds from Fa (Figure 6a) and Fo from Lm. At high resolution, MobileNet still had difficulty deciding on the Fa seeds (Figure 6c), while the ResNet101 confusion matrix indicates very few mis-identifications (Figure 6d). Two of the important considerations in choosing a deep CNN for practical applications are the number of weights that need to be determined in training and the number of floating point operations that is required for the determination. In Figure 7, we display the flops as a function of the number of parameters for all nets examined in the current study. As stated before, the MobileNet, which was developed to be run on mobile devices had the lowest number of parameters and the lowest computational requirements. Due to the differences in the architectures, scaling of the flops with the number of parameters was not uniform; for instance, although EfficientNetB4 and VGG19 had the same number of parameters, one needed an order of magnitude more operations to determine the weights of VGG19.
It can be seen from the figures that the flops-number of parameters relation for the same CNNs with different numbers of layers (R, D and VGG) or scaling (E) was linear. A study on the comparative performance of various well-known CNNs on the rice seed classification problem found that the accuracy depended on the number of parameters of the network [29]. Hoai et al., who studied the rice seed classification problem with standard machine learning algorithms and CNNs, reported ResNet50 and DenseNet121 as the two highest accuracy nets, which is similar to the findings of the current study [30]. One of the important parameters to consider in the evaluation of various CNNs in the classification problem is the required computing resources for the training phase. In Figure 8, we display the dependence of the test accuracy on the number of required operations at all four image resolutions. As expected, flops increase with increasing image resolution, but the accuracy-resolution dependence is not linear for all the net families. For instance, the accuracy is at the maximum for the 64 pixel resolution for MobileNet, but it decreases as the resolution is increased.
It is obvious from the figure that, at the same image resolution flops of different nets might differ by two orders of magnitude (MobileNet and VGG19 at 32 × 32 pixel resolution). Overall, the accuracy/flops ratio for the EfficientNet appear to be the best as indicated by the high accuracy of E1 net already at 128 × 128 pixel resolution (blue filled triangles in Figure 8). It is also interesting to note that VGG19, which has more convolutional layers than VGG16, performed consistently worse when compared with VGG16. The success of deep nets is thought to stem partly from the large number of layers and large number of parameters, which makes it natural to ask the question of whether the accuracy of the predictions depends in any way on the number of parameters for the given net. Figure 9a,b displays the accuracy as a function of the number of model parameters for the 128 × 128 and 256 × 256 pixel images, respectively. Although there is an increase in accuracy for increasing parameter numbers at the lowest parameter region for both image sets, further increases have no effect on the prediction accuracy.
Furthermore, for the raw image set, the model with the highest number of parameters was not among the best performing nets, which is contrary to the case observed for the segmented image set. This finding is in line with the results reported in [31,32], which demonstrated that deeper nets might not be better for all convolutional networks especially in the small data settings, and the optimum network mostly depends on the data. The training accuracies for both resolutions for EfficientNetn (E0-E5) showed that increasing the scaling parameter beyond 1 did not increase the accuracy. For the VGG, ResNet and DenseNet deep nets, the accuracy decreased with the increasing number of layers for the highest resolution dataset.

Conclusions
We conducted an in-depth study of the classification problem for six grass seed species by using 15 different convolutional neural network architectures. DenseNet201 was found to be the best CNN with 99.42% test accuracy for the 256 × 256 pixel image set, which is comparable to the accuracy of the best plant-related classification studies. Our aim was to elucidate the main factors that determine the success rate of the CNN. We found that, as expected, the size of the image used in the training and test was the single-most important factor that determined the accuracy of the classification.
We found appreciable differences (up to ≈100%) in the performance of the considered nets among the same image size samples. Network characteristics, such as the number of parameters, flops and epoch time were investigated as determinants of the accuracy. We did not detect any statistically significant correlation between any of those parameters and the accuracy except for the gray scale images. Investigating the same metrics with higher resolution images to better understand the main determinants of the performance of the CNNs would be beneficial not only for the grass seed classification problem but also for the general effectiveness of deep nets along the lines of [33].

Conflicts of Interest:
The authors declare no conflict of interest.