Weed Identiﬁcation in Maize, Sunﬂower, and Potatoes with the Aid of Convolutional Neural Networks

: The increasing public concern about food security and the stricter rules applied worldwide concerning herbicide use in the agri-food chain, reduce consumer acceptance of chemical plant protection. Site-Speciﬁc Weed Management can be achieved by applying a treatment only on the weed patches. Crop plants and weeds identiﬁcation is a necessary component for various aspects of precision farming in order to perform on the spot herbicide spraying or robotic weeding and precision mechanical weed control. During the last years, a lot of different methods have been proposed, yet more improvements need to be made on this problem, concerning speed, robustness, and accuracy of the algorithms and the recognition systems. Digital cameras and Artiﬁcial Neural Networks (ANNs) have been rapidly developed in the past few years, providing new methods and tools also in agriculture and weed management. In the current work, images gathered by an RGB camera of Zea mays , Helianthus annuus , Solanum tuberosum , Alopecurus myosuroides , Amaranthus retroﬂexus , Avena fatua , Chenopodium album , Lamium purpureum , Matricaria chamomila , Setaria spp., Solanum nigrum and Stellaria media were provided to train Convolutional Neural Networks (CNNs). Three different CNNs, namely VGG16, ResNet–50, and Xception, were adapted and trained on a pool of 93,000 images. The training images consisted of images with plant material with only one species per image. A Top-1 accuracy between 77% and 98% was obtained in plant detection and weed species discrimination, on the testing of the images.


Introduction
Public concern about food security has increased in the last decades. Simultaneously, stricter rules have been applied worldwide regarding pesticide usage in the agri-food chain. The above reasons have reduced the consumer acceptance of chemical plant protection. Site-Specific Weed Management can be achieved by applying a treatment only on the weed patches. The current application methodology is to spread the herbicide on the whole field [1], which involves a portion of the herbicide to be applied to non-target plants as weeds have a variable and heterogeneous distribution over the field [2,3]. Thus, the current state of application technology usually has a low degree of the effectiveness of the treatment while it simultaneously leads to an unnecessary negative input into the environment [4]. Yet, reducing the spray rate is not advisable for agronomical reasons, since it can promote the emergence of resistant weed species, while it can also lead to a decrease in yield [5].
comprises, the more difficult it is to reach a good classification result. Nevertheless, the twelve-class CNN achieved an average test accuracy of 94.38%. Besides training and building a CNN from scratch, Ge et al. [33] proposed that when the dataset is limited it is better to take a pretrained network, trained on a large dataset such as ImageNet, and apply transfer learning [34] to reach a better performance and reduce overfitting. dos Santos Ferreira et al. [15] used a replication of AlexNet, pretrained on the ImageNet dataset for their neural network. Four classes were distinguished-soil, Glycine max, grass, and broadleaf weeds-from drone data, and the network reached an average accuracy of 99.5%. Munz and Reiser [35] used some pre-trained networks not only to separate between pea and oat but to also estimate their coverage. Olsen et al. [20] trained multiple CNNs: Inception-v3 [36] based on GoogLeNet [37] and ResNet-50 [38], both pretrained with the ImageNet dataset. In addition to the comparison of different CNNs, they introduced the first large, multi-class weed species image dataset (Deepweeds), comprising eight invasive weed species and collected entirely under field conditions. The networks were trained for 100 epochs and achieved average classification results of 95.1% (Inception-v3) and 95.7% (ResNet-50), respectively.
Each author either constructed their own network from scratch or used existing architectures that were modified for the respective dataset. In all cases, the authors have demonstrated the potential use of Neural Networks also in the agricultural discipline. Nevertheless, choosing a suitable network requires careful planning as it must fit the task at hand [39]. Furthermore, the robustness of the trained network, along with the robustness of training similar networks has not been examined. In the context of weed and crop classification, supervised training with a prelabeled dataset is widely used to cope with the high variability in the morphology of the plants based on the development stages and environmental influence, which can lead to poor classification accuracy [39]. Yet, the difficulty of acquiring multiple labeled instances of each plant in different development stages still poses an important academic and practical challenge [20]. The acquired datasets typically have a small number of labels, and a huge variation between the classes, which enforce the usage of an unbalanced dataset. In the current paper, the capabilities of three different networks were examined, namely VGG16 [40], ResNet-50 [38] and Xception [41] in their capabilities of identifying twelve different plant species. Our aim was to demonstrate how fast those networks can be trained, and how reliable this training can be, over multiple trainings. Through the proposed methodology a significant amount of labeled images have been acquired that enabled the utilization of a balanced subset of the dataset for training and validation purposes. Ten different repetitions of each network were performed to examine if the CNN training can always conclude to similar results, in a standardized and systematic way. Therefore, we tried to investigate if this balanced dataset can achieve a better result in weed identification and plant classification than the previously demonstrated, for an agronomically applicable amount of classes.

Experimental Field
Images were gathered on a predefined experimental field at the Heidfeldhof research station of the University of Hohenheim, in southwest Germany (48°42 59.0 N and 9°11 35.4 E) in 2019. Twelve plots of 12.5 × 1 m were used, each seeded with the seeds of the respective plant species. Three crop species were used, namely maize (Zea mays L.), potato (Solanum tuberosum L.), and sunflower (Helianthus annuus L.) along with nine weed species, namely Alopecurus myosuroides Huds, Amaranthus retroflexus L., Avena fatua L., Chenopodium album L., Lamium purpureum L., Matricaria chamomila L., Setaria spp., Solanum nigrum L., and Stellaria media Vill. (Table 1). In the current work, plant species will refer to both crop and weed data; otherwise, it will be explicitly stated as crop or weed plants. Images were gathered every second day from the date of emergence for 45 days until the plants had progressed to the 8th leave stage or the beginning of tillering. Prior to the seeding, the soil was cultivated in spring with a Rabe cultivator and a working width of 3 meters, and the field was sterilized with a steam treatment to reduce the emergence of unwanted weeds and volunteer previous crop plants. Furthermore, the experimental plots were cleaned by hand from weeds foreign to the intended species twice a week.

Image Acquisition
The pictures were captured with a Sony Alpha 7R Mark4 (ILCE7-RM4, Sony Corporation, Tokyo, Japan ), a 61 megapixel RGB DSLR camera at noon. The camera has a 35.7 × 23.8 mm, back illuminated full frame CMOS sensor and JPEG images were taken at a resolution of 9504 × 6336 pixels. A shutter speed of 1/2500 s was used, while the ISO calibrated automatically to achieve a good image quality under the changing lighting conditions during the measurements, and the glare opening adjusted each recording day and set between 7 and 11. The Zeiss Batis 25 mm-a fixed focal length lens-was used to achieve a better optical quality compared to a zoom lens. The Sony camera was mounted on the "Sensicle" [42], a multisensor platform for precision farming experiments at a height of 1.2 m. The driving speed was 4 km/h and each second a picture was captured.

Image Preprocessing
From each plot, images were saved with information relevant to the plot and acquisition date. For each image, a binary image was created (Figure 1), using the Excess Green-Red Index as a thresholding mechanism to separate plant material from the soil [43,44]. Each connected pixel formation from this thresholding procedure consisted of a potential region of interest that should be fed in the CNNs and was separated and prelabeled, creating the relevant bounding box, based on the following rules: • Pixel formations less than 400 pixels were discarded. Both the original and the merged bounding boxes were kept for labeling.
The above procedure ensures that all potential inputs provided for classification from our preprocessing method are available for labeling, while simultaneously reducing as much as possible soil clusters and other noise inferences. Labels with the respective European and Mediterranean Plant Protection Organization (EPPO) code of each plant were put automatically on each bounding box, based on the image information. These labels were examined by a human expert who discarded possible wrong classifications or unwanted weeds ( Figure 1).  (Table 1). Some example images that were used for the training of the networks can be seen in Figure 2.

Neural Networks
Artificial Neural Networks and specifically Convolutional Neural Networks (CNNs) are a powerful technique that can achieve a successful plant and weed identification. The basic structure of a neural network comprises an input layer, multiple hidden layers, and an output layer. For the current study CNNs that have demonstrated some good or robust results over different disciplines have been selected. Specifically, we used VGG16, ResNet-50, and Xception, as our basic networks, modifying the top layer architecture of each network.

VGG16
Simonyan and Zisserman [40] proposed VGG16, as a further development of AlexNet. VGG16 was one of the best performing networks at the 2014 ImageNet Large Scale Visual Recognition Challenge (ILSVRC) competition, providing a 71.3% top-1 accuracy and a 90.1% top-5 accuracy. ImageNet is a labeled dataset including over 14 million images classified to 1000 different classes. VGG16 has been used for its robustness since it can provide a high performance and respective accuracies, even when the image datasets are small [45]. The input of VGG16 is a three channel RGB image of the fixed size of 224 × 224 pixels. The VGG16 architecture contains a total of 16 layers, comprising 13 convolutional (3 × 3) and three fully-connected layers ( Table 2). Rectified linear units (ReLUs)-first presented by Krizhevsky et al. [46]-act as the activation function for each convolutional layer and for the first two fully-connected layers. VGG16 is one of the best performing networks from the last years, but simultaneously it is simpler, less computer intensive than other networks.   ResNet-50 (Residual Network 50) was first presented by He et al. [38]. Their architecture continued with the trend of an increased layer depth. ResNet-50 has a similar architecture as VGG16, centered around a 3 × 3 convolutional layer with a ReLU activation function, but before and after each 3 × 3 convolutional layer 1 × 1 convolutional layers are established. Furthermore, only one pooling layer is used, batch normalization is implemented, and the final total network structure comprised three times more layers than VGG16. It is comparable to the VGG16 network, apart from the fact that Resnet50 has an additional identity mapping capability [45]. ResNet-50 can be trained much faster than the VGG16, since it reduces the vanishing gradient problem by creating an alternative shortcut for the gradient to pass trough. Practically this translates as, even if the network is much deeper than VGG16 it can bypass a CNN layer if it is not necessary. The proposed final network comprised of 50 layers (Table 2) and reached first place at the ILSVRC 2015, outperforming the previous benchmark set by VGG16. The input of ResNet-50 is also a three channel RGB image of the fixed size of 224 × 224 pixels. The residual management that ResNet-50 provides makes this algorithm one of the best for training into new datasets.

Xception
Xception [41] stands for Extreme version of Inception and, of course, is an adaptation from Inception, revolutionizing how CNNs are designed. ResNet-50 tried to solve the image classification problem by increasing the depth of the network. The Inception architectures follow a different approach, by increasing the width of the network. A generic Inception module tries to calculate multiple different layers over the same input map in parallel, cleverly merging their results into the output. Three different convolutional layers and one max pool layer are activated in parallel, generating a wider CNN compared with the previous networks. Each output is then combined in a single concatenate layer. Therefore, for each layer, Inception does a 5 × 5, 3 × 3, and 1 × 1 convolutional transformation, and an additional max pool. The concatenate layer of the model then decides whether and how the information of each layer can be used. In Xception, the inception modules have been replaced with depth-wise separable convolutions. It calculates the spatial correlations on each output channel independently of the others, but in the end, it performs a 1 × 1 depthwise convolution to capture the cross-channel correlation. In the end, Xception has also a deep architecture, even deeper than ResNet-50 with 71 layers depth ( Table 2). The input of Xception also differs from the two previews networks as it is a three layer RGB image of the fixed size of 299 × 299 pixels, compared to the 224 × 224 that was used before. The width approach that Xception is using, increases its degrees of freedom and therefore utilizes the best identification scenarios for the task at hand.

Dataset Normalization
All 93,130 images were separated into 3 distinct datasets for training, validation, and testing of the networks. To achieve a uniform comparison between network repetitions and network architectures the separation was done apriori, before the image enhancement or augmentation. For each separate class of labels, 70% of the images were used for the training of the networks, 15% of the dataset was used for the validation performed in each training, while the remaining 15% consisted of our testing subset, which was used for the final measurements and demonstration of the achieved results.
In order to perform the training and validation of the networks on a normalized dataset, subsampling was performed. A balanced dataset would avoid population bias since in the dataset there are some majority classes with a high amount of labeled images and some minority classes with fewer images. The large number of images in our dataset enabled us to perform this subsampling since even the minority classes had more than 1600 training images per class. Specifically, every five epochs 1300 images per class on the training subset, and 400 images per class on the validation subset were randomly chosen from their respective subsets. This resulted in 15,600 images to be used in each epoch for training and another 4200 for validation. The testing was performed on the complete unbalanced testing subset since this is a representative fraction of the labels actually identified inside the field.

Network Training
The network experimentation was performed with Keras 2.4.3 in python 3.6.8 using the Tensorflow (2.3.0) backend. Transfer learning was used [34]. All the aforementioned networks were used with the pretrained weights from the ImageNet dataset. The layers of those networks were not trained during our experimentation. For each of the used networks, the pretrained variant was used without the top classification layers for the ImageNet classification. Instead, two additional fully connected dense layers of 512 neurons each were included ( Figure 3). On both of those layers, a ReLU activation function was implemented, while during the training a 50% neuron dropout was used. The networks were trained on a supercomputer cluster using the NVIDIA ® Tesla Volta V100 PCIe Tensor Core GPU with 12 GB GPU memory (Nvidia Corporation, Santa Clara, California, United States). Instead of the stochastic gradient descent (SGD) algorithm, Adam, an adaptive learning rate algorithm with a learning rate of 1 × 10 −3 and a decay of 0.01 200 , was implemented for Xception and ResNet-50. For VGG16, a smaller learning rate was chosen of 1 × 10 −4 , but with the same decay. Each network was trained ten times, each independent of the previous ones. For the training and validation subsets, data augmentation was also performed to avoid over-fitting and overcome the highly variable nature of the target classification. This would account for variation in parameters like rotation, scale, illumination, perspective, and color. Specifically, rotation of up to 120 degrees, a brightness shift of ±20%, a channel shift of ±30%, a zoom of ±20% were randomly performed, along with a possible horizontal and vertical flip. A batch size of 32 images was selected. Each network was trained until the validation accuracy did not improve anymore for 150 consecutive epochs. This ensured that the networks have converged to a maximum, while even in the majority classes the probability that each training image has been used at least once on these 150 epochs overpassed 99%. The maximum and minimum number of epochs among the ten repetitions that each network used for its training can be seen in Table 2. Table 2 also shows information about the training such as the mean time used for the training of each epoch, the training parameters of each architecture, etc. Since the actual input for training depends on the batch size, (?) represents this batch size. In the current manuscript the batch size used for training, validation and testing was 32.

Evaluation Metrics
In order to evaluate the performance of the respective network classification result, precision, recall, and f1-score were used as proposed by Sokolova and Lapalme [47]. Precision refers to the class conformity of the data labels with the positive labels assigned by the classifier and it is calculated by: where tp represents the true positive values, which means the plants belonging to a class that were identified as such, and f p are the false positive values, the plants that do not belong to a class but were identified as such. Recall evaluates the sensitivity of the respective network and was calculated by: where tp is similar to Equation (1), and f n represents the false negative values, which means the plants that belong to the class but were not identified as such. f1-score illustrates the ratio between precision and recall via a harmonic mean and was calculated by: where Precision is defined in Equation (1) and Recall is defined in Equation (2).

Results
In the current paper, three different networks have been tested to evaluate their performance in a balanced dataset of twelve different plant species. All three networks had their original layers pretrained on the ImageNet dataset, while some additional fully connected layers were included as top layers of the pretrained network. Only these layers were trained to classify and identify our twelve plant classes. Ten different repetitions of each network type were trained until the validation accuracy did not decrease after 150 epochs.

Model Accuracy/Model Loss
The mean training and valid loss along with the training and validation accuracy per network type can be seen in Figure 4. The training accuracy rapidly increases in the first 100 epochs and then slowly approaches towards an optimum, depending on the network at around 300 to 700 epochs for VGG 16 (Figure 4a), 350 to 850 epochs for ResNet-50 (Figure 4c), and 400 to 800 for Xception (Figure 4e). The model accuracy did not differentiate a lot between the ten repetitions of each network. The biggest differentiation was between the maximum and the minimum accuracy of the VGG 16 Neural Network training. All networks achieved a similar result of around 81% for VGG16 and more than 97% for ResNet-50 and Xception, both in the training and validation accuracy, which was also similar to the testing accuracy performed after the finalization of the training (Figure 4, Table 6). Specifically, the top-1 accuracy for the VGG 16 network ranged from 81 to 82.7%, for ResNet-50 between 97.2 to 97.7%, while the highest accuracy was achieved with Xception at 97.5 to 97.8%.
A loss function is used to improve and evolve the accuracy of a neural network model. Loss functions try to map the parameters of the network onto a single scalar value that indicates how well those parameters achieve in performing the task the network is intended to do. The loss value implies how poorly or well the model behaves after each iteration of optimization. In our case, the model loss is visualized via the sum of errors made for each example in the training or validation set, respectively. The training and validation loss reacted similarly in all networks with the validation loss presenting higher fluctuations and differences than the training loss. The validation loss decreases steadily until it reaches the lowest value depending on the neural network, in similar epochs like the maximum accuracy was achieved. For VGG 16 that was at around 300 to 700 epochs (Figure 4b), 350 to 850 epochs for ResNet-50 (Figure 4d), and 400 to 800 for Xception (Figure 4f). The highest fluctuations between the minimum and the maximum loss difference were monitored for the Xception neural network training while the lowest fluctuations were monitored at the ResNet-50 training.

Classification Performance
In our dataset, twelve different classes have been used to separate the plants into the equivalent plant species. The distribution of how well the three networks identified each plant species can be seen in Tables 3-5. These are the results from the testing part of the dataset. All images that were kept as the testing part were fed into the finalized network, the relative metrics were calculated, and the according confusing matrices were created. Since 15% of each plant species was separated and kept as the testing part of the dataset, this result derives from an unbalanced dataset, which represents the availability of our input data. Each row represents the actual plant species of the tested plant, while each column shows what plant species was the most prevalent decision for the Neural Network (top-1 accuracy). In order to make the data comparable and more comprehensive, all data are presented as percentages relative to the number of actual plants per species, therefore the sum of each row is 100.
The mean values for the ten VGG 16 confusion matrices are shown in Table 3a, while the standard deviation of those ten matrices is shown in Table 3b. The best identification was achieved for M. chamomila where 93.36% of the M. chamomila weeds were identified as such, while the worst identification was achieved for C. album with 55.01% of correct identifications. Since VGG 16 performed the worst of the three used networks, this can also be seen in the confusion matrix. A lot of plants are misclassified as Setaria spp., S. nigrum, and S. tuberosum while a lot of S. media and C. album weeds are not identified as such. It should be noted that especially concerning C. album whenever the network is unclear, its second choice would be S. nigrum. Concerning the three crop species included in the dataset, 82-85% of them were correctly identified into their relevant plant species while if we add into each crop classification the misclassifications of other crop plants, maize, potato, and sunflower were identified as a crop at 88.0%, 84.6% and 95.0% respectively. The ten different networks performed similarly, but in the aforementioned problematic classifications, the standard deviation between different networks was the highest between 0.5% and 2.0%.
The mean values for the ten ResNet-50 confusion matrices are shown in Table 4a, while the standard deviation of those ten matrices is shown in Table 4b. ResNet-50 achieved an accuracy of around 97% and this can also be seen in the relevant confusion matrices. All plant species had more than 90% correct identification. The best identification was achieved for A. retroflexus followed by M. chamomila where 99.66% and 99.54% of the respective weeds were identified as such. S. media which was one of the worst performers for the VGG 16 network was the third most correctly identified species with 99.41%. The worst identification was achieved for S. nigrum and C. album with 90.33% and 91.52% of correct identifications, respectively. The misclassification between S. nigrum and C. album also exists in this network, but with a smaller degree of uncertainty compared to the VGG 16. Concerning the three crop species included in the dataset, 97-99% of them were correctly identified into their relevant plant species while if we add into each crop classification the misclassifications of other crop plants, maize, potato, and sunflower were identified as a crop at 98.9%, 98.9%, and 99.0%, respectively. The ten different networks performed similarly, but in the aforementioned problematic classifications, the standard deviation between different networks was the highest between 0.5% and 0.8%.
The mean values for the ten Xception confusion matrices are shown in Table 5a, while the standard deviation of those ten matrices is shown in Table 5b. Xception achieved the best accuracy of around 98%. All plant species had more than 92% correct identification. The best identification was achieved for M. chamomila followed by S. media where 99.74% and 99.61% of the respective weeds were identified as such. The worst identification was achieved for S. nigrum and C. album with 91.49% and 92.46% of correct identifications, respectively. The misclassification between S. nigrum and C. album exists also in this network, but with a smaller degree of uncertainty compared to the other networks. Concerning the three crop species included in the dataset, 97-99% of them were correctly identified into their relevant plant species while if we add into each crop classification the misclassifications of other crop plants, maize, potato, and sunflower were identified as a crop at 99.2%, 98.9%, and 99.0% respectively. The ten different networks performed similarly, but in the aforementioned problematic classifications the standard deviation between different networks was the highest between 0.6% and 0.9%, even higher than the ResNet-50 networks.

Precision/Recall
In all three Neural Networks, the average precision and recall are similar to the identified top-1 accuracy (Table 6). Even though in Table 6 only the result of the first trained network is shown, there is a fluctuation between the absolute values per species, but the averages are the same or almost the same. VVG16 has an average precision of 0.75 and a recall of 0.79, while both ResNet-50 and Xception have a recall of 0.97 and a precision of 0.96 for ResNet-50 and 0.96 or 0.97 for different implementations of Xception. It should be noted that for both the averages and per species, the majority of the cases show a higher recall value than precision for VGG16 and Xception In ResNet-50 more instances show a higher precision than recall, resulting in almost 50% of the cases showing higher precision and the rest higher recall. Species like H. annuus, M. chamomila, and S. media show the best results, achieving in a lot of cases the perfect score at precision or recall of 1.00. On the other end, C. album achieved the worst result in all neural networks followed by S. nigrum. Table 6. Precision , recall and f1-score for the first repetition of all three Neural Networks that were tested. These results derive from testing the networks only on the testing proportion of the dataset.

Discussion
Image recognition with the aid of neural networks is a relatively new topic in the domain of plant and weed classifications, as most publications were published after 2016 [25], yet it shows high potential. All networks that were used in this experiment were able to train in our dataset, and achieve significant discrimination results in all our repetitions. ResNet-50 and Xception performed better than VGG16, achieving a performance of 97% and 98%, respectively. Recent publications like dos Santos Ferreira et al. [15], Potena et al. [18], Tang et al. [4], Sharpe et al. [39] and Elnemr [19] have also achieved classification results of over 90%. Yet in the majority of these cases, a low number of classes were used (2)(3)(4), or the datasets were only sufficient to prove the researched hypothesis but not sufficient to transfer the results into the complexity of the real world. Potena et al. [18] and Sharpe et al. [39] used only two classes, while Tang et al. [4] and dos Santos Ferreira et al. [15] four. Such a small amount of classes is not sufficient for specific local weed populations and the coverage of each species [2]. They can not be used for weed management applications like, for example, precision spraying or mechanical weed control [6]. The selection of a limited number of classes for classification is mainly due to the fact that the more classes that are considered, the less accurate the result [19]. In our case we managed to achieve quite a high classification accuracy, overpassing 97% in two of our networks, with twelve different classes, representing three summer plants with some of their representative weeds, both grasses, and broadleaved weeds.
For distinguishing between many classes, a large and robust dataset is required, which is a time-consuming task [25,30,31,48]. In cases where authors have tried to achieve multi-class weed and plant classification, their classification accuracy dropped under 90% [6,49]. This can be attributed to their limited amount of training data and the associated unbalanced dataset, which can make it hard for the network to generalize [21] or a bias towards the majority class can be created [50,51]. Therefore, the results should always be interpreted under the potential dataset limitations [52], which generally encompass the scale of the dataset, the number of distinguished classes, and the distribution between the classes, as well as potential dataset biases. With the methodology that we used we managed to acquire a significant amount of plant and weed images, while simultaneously making the labeling of the said images easier. Olsen et al. [20] demonstrate one of the most robust datasets, as their dataset was collected on different locations under field conditions, the dataset was balanced with 1000 images per plant and even a negative class was included in the training process. In our case, the images were gathered only in one location, with a homogeneous soil type and a specific camera, at a similar time, along with the optimum image settings the specific camera software chose. This data uniformity could pose a problem for the robust application of the network, but based on the proposed methodology more data can be acquired and integrated into the current dataset. Even in our case, the acquisition of images at different dates, and growth stages, makes the dataset per species more representative, while the images were also acquired under different soil conditions (e.g., wet soil, normal, and quite dry). The dataset comprises single plants, overlapping plants of the same species, plant sections, leaf fragments, and damaged leaf surfaces, which results in a high variability within the classes, but simultaneously it can generalize for possible new images of the specific species. Even though there are images of various qualities inside the dataset we did not notice any significant problems or systematic errors with the images used. The task of a Neural Network is to generalize and overcome influences on its data input, both concerning image irregularities and background unwanted objects. Images with different weed species overlapping need to be also examined, and included. With our dataset and our methodology, the goal is to further improve the standard for plant and weed recognition set by Olsen et al. [20], as our dataset comprises a total of 93,130 labeled single plant images based on nine weed species and three crops, which is sufficient for choosing the appropriate herbicide treatment.
Our dataset comprised images from the plant species during various development stages from their first emergence until tillering. The capability for both Resent50 and Xception to achieve an f1-score of at least 0.89, but typically between 0.95 to 0.98 per plant species should be noted, since each plant shows differences in its morphological structure, especially via the leaf shape, the texture of the leaf surface and the total number of leaves. This high variation in the acquired images typically creates a constraint for a successful identification, particularly in the period between emergence and youth development, which is also the most favorable time for successful weed control [3,6,53]. In our training, S. nigrum and C. album performed the worst, showing a high misclassification between these two weed species. This can be attributed to the similar morphological characteristics that those two plants have, especially during the 0-and 2-leave stage, where they can be discriminated only through their texture and color. Pérez-Ortiz et al. [1] also pointed towards classification problems due to the morphological similarities between different species especially at the time of emergence. Moreover, as overlapping of individual leaves can occur, this makes it even more difficult to distinguish between individual weed and crop species. In our case, plant overlapping also existed but only between plants of the same species. For the three crop species included in the dataset, 97-99% of them were correctly identified into their relevant plant species, and if we pool the respective crop misclassifications these numbers go to 99.2%, 98.9%, and 99.0% for maize, potato, and sunflower, respectively. This fact can encourage us to use these networks for crop related applications. Due to our high average classification result especially in the early development stages of Z. mays, S. tuberosum, and H. annuus, where weed interference can significantly reduce the yield [54], weed-specific herbicide applications can be executed.
VGG16 used the smallest amount of time per epoch to train but simultaneously had the poorest outcome compared with the other network architectures. The simpler architecture, and the less total parameters that VGG16 has, makes it a good candidate for online systems, where processing power can be a restricting power. Unfortunately, with an accuracy of 82%, even though VGG16 can be a viable alternative to other methods used until now [55], it still lacks the robustness needed for an online application. Xception had the best performance over all networks. Its highest depth and complexity enabled it to adapt and generalize better than the other two networks [1], but outperformed ResNet-50 only slightly. Yet, this complexity and the highest amount of total parameters resulted in Xception to be the slowest network concerning the training and validation speed, and afterward during the testing. ResNet-50, achieved results similar to Xception, but due to its architecture and a slightly lower amount of layers and total parameters, it managed to train and validate, much faster than Xception. Its high accuracy, combined with the smallest calculation time that it utilized, suggest it as the most viable candidate of the three for an online application.
The images and the dataset were acquired near the ground, but all images had to adapt to the input of the Neural Network (224 or 299 pixel input dimension). For small plants that means enlarging the plants but for bigger plants that actually was translated as shrinking the plants. This lower resolution used for bigger plants gives the potential for this method to be implemented in Unmanned Aerial Vehicles (UAV). Peña et al. [56] using OBIA could separate between sunflowers and weeds at a later stage with high accuracy (77-91%). Being able to capture enough pixels for a robust recognition is in the majority of the cases the limiting factor. As technology improves and pixel resolutions also increase, this hurdle can be overpassed. Pflanz et al. [57] used a UAV in a low altitude flight (1 to 6 m) to achieve good results for discriminating between Matricaria recutita L., Papaver rhoeas L., Viola arvensis M., and winter wheat. Such a low altitude flight cannot be exploited in practical agricultural farms, but it definitely shows the potential of such a system. As resolutions increase, a flight altitude of 15-20 m can be used for plant classification and can be commercially applicable. In all cases, Neural Networks need at least some pixels per plant to be able to recognize and classify it.
The presented weed identification algorithm can be used in combination with site-specific weed control methods for more precise herbicide applications and mechanical treatments. It can be used to control a sprayer or a mechanical hoe in real time. Weed classification and monitoring can enable more sophisticated and complicated Decision Support Systems. Such tools can also be used by farmers, agronomists, and consultants during weed scouting and vegetation surveys. However, there are two practical limitations that need to be addressed first, such as a more robust and diverse data set combined with better hardware. For practical use, the data needs to be collected and processed simultaneously over the entire sprayer boom at a certain frame rate per second. Simultaneously it is important to correctly recognize a heterogeneous plant stock, therefore a diverse and robust dataset is imperative. As the results show, more complex neural networks are required to increase the accuracy of the classification, but this is accompanied by an increase in computing power. Yet, the trade-off between improved accuracy and speed needs to be further explored, since in our case the increase in accuracy provided by Xception cannot justify its increase in computational time. Similarly to Integrated Pest Management where the balance between Pest Management, sustainability, and food security is explored, we need to investigate the balance of how good we need our accuracy for practical applications.

Conclusions
In the current paper, we have provided the results of plant identification using some Convolutional Neural Networks. A methodology for improving the image acquisition and the generation of the dataset has been proposed, which can make the acquisition of said images easier, along with the labeling and utilization in Neural Networks training and testing. The ResNet-50 along with Xception achieved a quite high top-1 testing accuracy (>97%), outperforming VGG16, yet there were systematic misclassifications between S. nigrum and C. album. More work needs to be done in order to improve the robustness and usability of the dataset, with more diverse images of the current classified plants, and more plant species. Bigger datasets can enable us to test even more detailed classification schemes, like per plant species, growth stage, or crop variant. The current work demonstrates a functional approach for porting this knowledge and classification routine for online, in the field, weed identification, and management.