Optimized Deep Learning Model as a Basis for Fast UAV Mapping of Weed Species in Winter Wheat Crops

: Weed maps should be available quickly, reliably, and with high detail to be useful for site-speciﬁc management in crop protection and to promote more sustainable agriculture by reducing pesticide use. Here, the optimization of a deep residual convolutional neural network (ResNet-18) for the classiﬁcation of weed and crop plants in UAV imagery is proposed. The target was to reach sufﬁcient performance on an embedded system by maintaining the same features of the ResNet-18 model as a basis for fast UAV mapping. This would enable online recognition and subsequent mapping of weeds during UAV ﬂying operation. Optimization was achieved mainly by avoiding redundant computations that arise when a classiﬁcation model is applied on overlapping tiles in a larger input image. The model was trained and tested with imagery obtained from a UAV ﬂight campaign at low altitude over a winter wheat ﬁeld, and classiﬁcation was performed on species level with the weed species Matricaria chamomilla L., Papaver rhoeas L., Veronica hederifolia L., and Viola arvensis ssp. arvensis observed in that ﬁeld. The ResNet-18 model with the optimized image-level prediction pipeline reached a performance of 2.2 frames per second with an NVIDIA Jetson AGX Xavier on the full resolution UAV image, which would amount to about 1.78 ha h − 1 area output for continuous ﬁeld mapping. The overall accuracy for determining crop, soil, and weed species was 94%. There were some limitations in the detection of species unknown to the model. When shifting from 16-bit to 32-bit model precision, no improvement in classiﬁcation accuracy was observed, but a strong decline in speed performance, especially when a higher number of ﬁlters was used in the ResNet-18 model. Future work should be directed towards the integration of the mapping process on UAV platforms, guiding UAVs autonomously for mapping purpose, and ensuring the transferability of the models to other crop ﬁelds.


Introduction
Today, artificial intelligence renovates the extraction of information from very-highresolution remote sensing data (VHR) with neural networks established in deep learning architectures tailored specifically for the needs of image data. This enables object recognition and classification in much higher detail and accuracy than before, and combined with imagery obtained from unmanned aerial vehicle (UAV), a smarter monitoring of agricultural lands is thinkable. Applied to the right scenario, this might pave the way for a more sustainable agriculture [1].
One such application would be site-specific weed management (SSWM). Conventionally, pesticides are supplied with dosage instructions that are calculated uniformly on a "per hectare" basis for the entire field. The target is in this case the area within the field For differentiating individual plant details to identify the type of weed species, UAVs need to collect the imagery from altitudes nearly below 10 m [11]. Yet, to map entire fields with such a small ground sample distance would require lots of aerial images, especially if image overlap is needed for photogrammetry. Thus, one problem with UAV imagery from a low altitude would be the sheer volume of image data, which would hinder rapid weed mapping, because it is impractical in terms of data storage, handling, and further processing with photogrammetry and OBIA. A more economical and flexible approach would be an image classifier capable of automatically and quickly identifying weeds from UAV images. This would allow weed mapping directly from a UAV platform as it flies over the field, while image recognition is embedded in a single computer aboard the platform that analyzes the images online. This way only the necessary information for weed mapping can be stored away or transferred to a ground station, such as the classification image, position, and type of the weed plants from post classification or, even more abstractly, summary statistics over the complete image, e.g., overall coverage of weeds with regard to species level in that image.
With some success, global features of plant morphology such as convexity, contour, or moments have been used in image classifiers to identify individual plant species directly from images [22][23][24][25]. Yet, these approaches begin to fail if cluttered imagery, such as UAV images from crop fields, is used. More recently, the use of local invariant features within the framework of bag-of-visual words [26] has been tested successfully for identifying weed species in cluttered field imagery [11,27]. This type of classifier only failed if weed species were very similar in their appearance [11]. Even more promising seems the use of convolutional neural networks for identifying weed plants, specifically within a deep learning framework [28]. One benefit of deep convolutional neural networks (DCNN) is that they learn the feature filters needed to extract the relevant information from the images directly in one process within the training network using convolutional layer structures. Beginning with LeNet-5 [29], proposed in 1998 using a rather slick design with two convolutional layers and three fully connected layers with about 60,000 parameters to be fitted, the architectures became quickly deeper with the growing capabilities of modern computing hardware. Inception-V3 and ResNet-50, proposed in 2015, hold over 20 million parameters [30,31]. To train and use them optimally, more and more specialized designs became necessary. In case of the deep residual networks (ResNets), residual blocks became popularized as key features that enable shortcut connections in the architecture, which allows more efficient training of deeper DCNNs. This ability has led to a breakthrough in classification accuracy in major image recognition benchmarks such as ImageNet [32].
For weed image classification based on DCNNs, Dyrmann et al., [33] proposed an own DCNN structure and trained it from scratch with segmented images from different sources of RGB images. They achieved moderate to high classification accuracies for 22 different weed species. A. dos Santos Ferreira et al. [34] tested different machine learning approaches, e.g., support vector machines, Adaboost, random forests, and DCNN, for classifying UAV images obtained from soybean crops into soil, soybean, grass, and broadleaf classes. Among the tested approaches, the best results were obtained for a DCNN based on an AlexNet architecture [28]. They concluded that one advantage of DCNNs is their independence in the choice of an appropriate feature extractor. More recently, Peteintos et al. [35] tested three different DCNN architectures, including VGG16 [36], Inception, and ResNet-50, for the classification of weeds in maize, sunflower, and potato crops with images taken from a ground-based vehicle, in which the VGG16 was outperformed by the other two DCNNs. They also concluded that data sets for weed classification by DCNNs needs to be more robust, usable, and diverse. Weed classification was also achieved by segmentation with DCNN from images. Zou et al. [37] successfully differentiated crop from weeds to estimate weed density in a marigold crop field using a modified U-Net architecture with images taken from a UAV platform in 20 m altitude.
For online mapping with UAVs, it is paramount not only to achieve high accuracy of the image classifier for weed identification, but also to optimize the predictive capa-bilities of the network in terms of the speed for evaluating a full-resolution UAV image captured by the camera. Most recently, research has focused on integrating DCNNs on embedded system for identifying weed online. Olsen et al. [38] successfully trained models for classifying different rangeland weed species with Interception-3 and ResNet-50 DCNN architectures and could implement the model on NVIDIA Jetson TX2 board. They theoretically achieved an inference time of 18.7 fps for evaluating resampled weed images (224 × 224 px) collected from a ground-based vehicle. Deng et al. [39] used a semantic segmentation network based on an adapted AlexNet architecture and could effectively discriminate rice and weed on an NVIDIA Jetson TX board with 4.5 FPS. This study similarly aims for optimizing a DCNN for weed identification with embedded systems for UAV imagery. In this approach, optimization was reached mainly by avoiding redundant computations that arise when a classification model is applied on overlapping tiles in a larger input image. This is similar to fully convolutional architectures used in segmentation models, but unlike those models, this approach does not require pixel-level segmentation labels at training time, which would be too inefficient. As DCNN architecture, a deep residual type ResNet-18 structure [31] was used and taught the network to recognize the most typical weed species with UAV images collected in winter wheat crops. Based on the DCNN model and its optimization, an intelligent mapping system should be aimed for that is capable of identifying and capturing weed species from a UAV platform while it is flying over the field. Here, the optimization approach in the prediction pipeline of the ResNet-18 classifier, its implementation on an embedded system, and its performance on classifying UAV images for typical weed plants in winter wheat crops are shown.

The UAV Image Data Set and Plant Annotation
The data set used in this study was originally introduced in the study of Pflanz et al. [11]. Only the essentials are repeated here. The image data was acquired during a UAV flight campaign in a wheat field (52 • 12 54.6 N 10 • 37 25.7 E, near Brunswick, Germany) conducted on 6 March 2014, when weed plants and wheat crop were abundant in the field with the wheat at development stage BBCH 23 (tillering). The flight mission was conducted between 1:00 and 3:00 p.m. in high fog and cloudy skies so that the lighting conditions were diffuse, with no direct sunlight. As UAV platform, a hexa-copter system (Hexa XL, HiSystems GmbH, Moormerland, Germany) was used, from which images could be captured at a very low altitude between 1 and 6 m over ground at 110 waypoints. The camera setup mounted below the copter consisted of a Sony NEX 5N (Sony corporation, Tokyo, Japan) with a 23.5 × 15.6 mm APS-C sensor using a lens with a fixed focal length of 60 mm (Sigma 2.8 DN, Sigma Corp., Kawasaki City, Japan). Images were shot in nadir position with a ground sample distance between 0.1 and 0.5 mm. Each image had a dimension of 4912 × 3264 px.
The field was subdivided into training and test areas. All images acquired in the training areas were used for training the model, and all images acquired in the test area were used for testing the prediction capabilities of the model. Experts examined all UAV images and annotated 24,623 plants and background by referencing the coordinates of the plants' midpoint and their species name into an annotation database. Around each annotation coordinate, a buffer of a 201 × 201 px quadratic frame was drawn, and a subimage or image patch was clipped to that buffer depicting the annotation item. In total, 16,500 image patches were extracted this way and used for model training.

The Image Classifier Base Architecture
The core of the image classifier is a DCNN based on a residual neural network (ResNet) architecture. ResNets use so-called residual blocks that implement shortcut connections in the network architecture [31]. The stack of convolution layers within each residual block only needs to learn a residual term that refines the input of the residual block toward the desired output. This makes the DCNN easier to train, because the shortcut connections enable the direct propagation of information and gradients across multiple layers of the network, leading to better gradient flow and convergence properties of the network during calibration [40].
The specific network architecture that was used here, shown in Figure 1, is inspired by the 18-layer residual neural network architecture proposed by He et al. ([31], but deviates from this model in several aspects relevant for the optimization of computational efficiency. It incorporates two different types of residual blocks (Type A and Type B, shown in Figure 2). Type B follows the original design proposed by He et al. [31] with an identity mapping for the non-residual branch in the block, while Type A implements a modified version, where a single convolution layer is added to the non-residual branch, as in He et al. [40]. The architecture starts with a 7 × 7 convolution layer with 16 filters, followed by a stride-two 2 × 2 max pooling layer to reduce spatial resolution. Stride-two means that in the convolution layer, filters are moved at twice the spatial offset in the input as compared to the output, effectively reducing the spatial dimension of the feature map by a factor of two.
These initial layers are followed by eight residual blocks, alternating between Type A and Type B. The number of filters is 16 in the convolution layers within the first two residual blocks and 32 in all remaining convolution layers. Note that these numbers are much lower than in standard ResNet architectures to improve computational efficiency. After the first two residual blocks, the spatial dimension is again decreased by a stridetwo convolution layer. All convolution layers are followed by batch normalization and nonlinear activation layers. As activation layers, rectified linear units (ReLUs) were used throughout the network as proposed by He et al. [40]. Note that the model up to and including the final residual block is fully convolutional in the sense of Long et al. [41]. However, unlike the model studied by Long et al. [40], which is a segmentation model that needs to be trained on pixel-level segmentation labels, our model is a classification model that is trained in a multiclass classification setting on 201 × 201 px inputs.

Optimizing Computational Performance for Creating Weed Maps
The trained classification model shown in Figure 2 takes as input a 201 × 201 px image patch and predicts the plant species (or bare soil) at the center of this image patch. The goal of this study is to produce high-resolution weed maps, that is, to annotate every spatial position in a large image with the plant species that is growing at that position. A straightforward way to produce such a map would be to apply the trained model to every In standard ResNet architectures, the final residual block is followed by a global average-pooling layer and a dense layer for classification. In the model proposed in this study, the output of the final residual block, whose dimensions are 50 × 50 × 32, is first spatially cropped to 20 × 20 × 32 by removing the 15 neurons closest to the borders for all filters in both spatial dimensions. This spatial cropping layer is then followed by a global average-pooling layer and a dense layer for classification as in standard ResNet architectures. The rationale for the spatial cropping layer is that it removes all neurons in the output of the final residual block whose receptive field on the input would exceed the 201 × 201 px buffer once the model is turned into a fully convolutional model and applied to larger inputs. This is discussed in more detail in Section 2.3.

Optimizing Computational Performance for Creating Weed Maps
The trained classification model shown in Figure 2 takes as input a 201 × 201 px image patch and predicts the plant species (or bare soil) at the center of this image patch. The goal of this study is to produce high-resolution weed maps, that is, to annotate every spatial position in a large image with the plant species that is growing at that position. A straightforward way to produce such a map would be to apply the trained model to every position on a fine grid laid over the large image. However, this is computationally demanding, because the number of image patches can be very large depending on the resolution of the grid. In this study, images captured by the camera have a resolution of 3264 × 4912 px, and the aim was to classify the plant species in a four-pixel grid. This would result in 766 × 1178 = 902,348 individual classifications of 201 × 201 image patches, assuming that only patches that are fully included in the 3264 × 4912 image are used. Even for a lightweight model, this is computationally challenging, in particular if inference has to be carried out on an embedded device. Note that the image patches are strongly overlapping.
The computational performance of the proposed model was optimized by following a different approach in which the trained classification model is converted into another model that can be applied directly to the larger image, and directly outputs 766 × 1178 individual classifications for the plant species in the four-pixel grid. The trained model will be referred to as the patch-level classifier, and the converted model as the image-level classifier. The image-level classifier is designed in such a way that it is mathematically equivalent to performing the 766 × 1178 classifications with the patch-level classifier, that is, it yields exactly the same predictions as this straightforward approach. However, it is much more computationally efficient, mainly because it avoids redundant computations in the convolution layers of the patch-level classifier that would occur when applying it to strongly overlapping image patches.
To begin with the discussion of the image-level classifier, shown in Figure 3, it should be noted that the part of the patch-level classifier is fully convolutional up to and including the last residual block, that is, it can be applied directly to larger input images and then computes the corresponding larger feature maps for these larger inputs. This is much more efficient than applying the patch-level model to the many strongly overlapping image patches, as the redundant computations in the convolution layers are avoided. Applying this part of the model to a full image of size 3264 × 4912 px yields an output with a dimension of 816 × 1228 × 32 (where 816 × 1228 is the spatial dimension and 32 is the number of channels), because the two spatial pooling layers in the network jointly reduce the spatial dimension by a factor of four. equivalent to performing the 766 × 1178 classifications with the patch-level classifier is, it yields exactly the same predictions as this straightforward approach. However much more computationally efficient, mainly because it avoids redundant computa in the convolution layers of the patch-level classifier that would occur when applyi to strongly overlapping image patches.
To begin with the discussion of the image-level classifier, shown in Figure 3, it sh be noted that the part of the patch-level classifier is fully convolutional up to and inc ing the last residual block, that is, it can be applied directly to larger input images then computes the corresponding larger feature maps for these larger inputs. This is m more efficient than applying the patch-level model to the many strongly overlapping age patches, as the redundant computations in the convolution layers are avoided. plying this part of the model to a full image of size 3264 × 4912 px yields an output w dimension of 816 × 1228 × 32 (where 816 × 1228 is the spatial dimension and 32 i number of channels), because the two spatial pooling layers in the network jointly re the spatial dimension by a factor of four.  Convolution and max pooling layers as well as residual blocks are identical to those in the patch-level model, except for their larger spatial dimension that results from the larger input size of the model. The cumulative local average pooling layer is a custom layer developed in this study and is described in Section 2.3. Together with the 1 × 1 convolution layer, it mimics the operation of the three last layers of the patch-level model ( Figure 1) for each position in the grid.
A 50 × 50 × 32 spatial patch from this 816 × 1228 × 32 output is essentially equivalent to the 50 × 50 × 32 output that would have been generated at the end of the last residual block in the original patch-level model if one had applied it to a particular 201 × 201 patch in the full image. However, the activation values in a 50 × 50 × 32 patch from the 816 × 1228 × 32 output are not exactly identical to the values one would get from the last residual block in the patch-level classifier applied to the corresponding 201 × 201 image patch. This is because the outer neurons in the 50 × 50 × 32 patch have a receptive field that covers more than 201 × 201 px in the input image. In the patch-level classifier, they would see borders that are padded with zeros, while in the image-level classifier they see pixels outside of the 201 × 201 area. However, all activations within the inner 20 × 20 spatial positions of the 50 × 50 × 32 patch are identical to the output of the 20 × 20 spatial cropping layer in the patch-level classifier, which is why the cropping layer was added to the patch-level classifier (see Section 2.2 and Figure 3). Note that there are 766 × 1178 To complete the image-level classifier, one needs to implement layers that mimic the operation of the last three layers (cropping, global average pooling, dense layer) in the patch-level model for each of the 766 × 1178 grid positions. The cropping and pooling part could be achieved with a standard 20 × 20 spatial average pooling layer; however, this pooling layer would account for a significant fraction of the total computational cost of inference. The problem is that pooling is carried out over strongly overlapping patches, leading again to redundant computations. An equivalent and more efficient way of implementing the pooling operation is thus to compute cumulative sums along both the x-axis and the y-axis over the entire 816 × 1228 × 32 output and subtracting the cumulative sums at the correct indices to obtain the sum over the 20 × 20 spatial patches, which can then be normalized to the average. This efficient procedure was implemented in a custom layer (called cumulative local average pooling in Figure 3). Finally, the dense layers in the patch-level model can be translated into a corresponding 1 × 1 convolution layer in the image-level model. This computes for each grid position the product between a particular 1 × 1 × 32 entry from the 766 × 1178 × 32 feature map with a 32 × 6 weight matrix to yield the six class scores, much like the dense layer in the patch-level model computes class scores from the 32 values resulting from global average pooling. The 1 × 1 convolution layer inherits the weights from the dense layer of the patch-level classifier.
To summarize, for a 3264 × 4912 px image, the image-level classifier will compute exactly the same class probabilities as a patch-level classifier moved over the image at a four-pixel grid. As there are 766 × 1178 possible positions in a four-pixel grid, the output of the image-level classifier is of size 766 × 1178 × 32. That is, it makes a prediction every four pixels (horizontally and vertically). Therefore, the output is only one fourth of the original image size. Therefore, it does make predictions for the entire image, but the predictions are at a slightly lower resolution than the original image was.
The code for the image classifier and its image-level optimization was made publicly available by the authors on GitHub repository (https://github.com/tiborboglar/ FastWeedMapping, accessed on 27 April 2021).

Testing the Accuracy of the Image Classifier and Its Prediction Performance (Model Training and Testing)
Model training was based on the 201 × 201 px image patches taken from the annotation database as discussed in Section 2.1. Based on these image patches, the task was to teach the classifier to distinguish six categories: bare soil (SOIL), crop (wheat, TRZAW), and four different species of weeds observed commonly in the field, which were Matricaria chamomilla L. (MATCH), Papaver rhoeas L. (PAPRH), Veronica hederifolia L. (VERHE), and Viola arvensis ssp. arvensis (VIOAR). In the following, they are referred to by their EPPO code.
This training set was augmented by adding, for each image, copies of the image that were rotated by 90 • , 180 • , and 270 • , and additionally for each rotation angle, copies that were mirrored left-to-right. For the training, eight different models were created. Each of these models differed in the filter configuration applied for convolution within the network. A lower number of filters was used for the shallow part of the network (Filter 1) and a higher number of filters in the deeper part of the network (Filter 2). The exact filter configuration and its naming convention are given in Table 1.
All models were trained using the same optimizer and hyperparameters, namely, the Adam optimizer with learning rate of 0.01 and without any decay [42]. The number of epochs was fixed in 100 and the batch size fixed in 32 images. A batch size of 32 is one of the most widely chosen batch sizes; typically, models are not very sensitive to batch size. The order of magnitude of the epochs needed for convergence was judged based on the behavior of the training loss and fixed at 100 to have a round number. It is not expected that the model will be sensitive to the number of epochs as long as the number is high enough. For optimization, categorical cross-entropy was used as the loss function and accuracy as metric. The trained model was implemented in Tensorflow [43] and deployed on an NVIDIA Jetson AGX Xavier embedded system (NVIDIA CORPORATE, Santa Clara, CA, USA). For prediction, the optimized procedure was used as described in Section 2.3. To further improve computational efficiency, the NVIDIA TensorRT Software Development Kit (NVIDIA CORPORATE, Santa Clara, CA, USA) was used to decrease the floating-point precision of the models from 32-to 16-bit. This procedure takes advantage of the half precision capabilities of the Volta GPU by reducing arithmetic bandwidth and thus increasing 16-bit arithmetic throughput. As halving the floating-point precision could negatively impact the prediction results, it was further demonstrated in this study if these impacts are negligible. Each model was run five times with different randomization (seeds) of the weights. For each UAV test image, a classification map was generated this way. All classification maps were compared with 8123 annotations, which were made by experts in the UAV test images. To generate more robust outcomes for testing, the five model runs were aggregated by calculating the median over the classification results. From this, a 6 × 6 confusion matrix was calculated, which was then used to assess the metrics recall, precision, and accuracy. The weed classification in this study not only shows a binary crop-weed classification, but also discriminates between four different weed species as well as soil and winter wheat. Thus, true positive (TP), false negative (FN), and false positive (FP) values were acknowledged from a multi-class perspective. They were calculated from the 6 × 6 confusion matrices for each class separately. For example, in case of MATCH, the correct predictions of the category MATCH are called TP. FP summarizes cases in which MATCH is falsely predicted as MATCH when in fact it belongs to a different category, while FN describes cases where a different category is incorrectly predicted to be MATCH. Based on TP, FP, and FN, the following metrics were calculated: The precision of a class i represents how many predicted class positives are truly real positives from the class predictions (Equation (1)). The recall of a class i represents how many predicted class positives are truly real positives from the class measurements (Equation (2)). Thus, precision focuses on the prediction, whereas recall focuses on the measurements. The overall accuracy was calculated by Equation (3) over all classes (k = 6), where N refers to the overall number of cases in the confusion matrix.
As inference time could potentially vary over different test images, measurements of inference time are given as the average time over all images in the test set. Inference was done with the embedded system in MAX POWER mode, meaning that the embedded system was allowed to use up to 30 W of electrical power.
To make the trained ResNet-18 model more transparent, we highlighted important regions of the training images represented in the model by using gradient-weighted class activation maps (Grad-CAM). Grad-CAM was implemented after the version of Selvaraju et al. [44].

Results
The training of the ResNet-18 model with the 201 × 201 px image patches from the training set reached a fast convergence after about 60 epochs, as can be seen from the trend discovered by the accuracy and loss curves in Figure 4. There was no indication that there were any substantial changes in the trend beyond that. Thus, the use of 100 epochs for model training seemed acceptable.
predicted class positives are truly real positives from the class measurements (Equation (2)). Thus, precision focuses on the prediction, whereas recall focuses on the measurements. The overall accuracy was calculated by Equation (3) over all classes (k = 6), where N refers to the overall number of cases in the confusion matrix.
As inference time could potentially vary over different test images, measurements of inference time are given as the average time over all images in the test set. Inference was done with the embedded system in MAX POWER mode, meaning that the embedded system was allowed to use up to 30 W of electrical power.
To make the trained ResNet-18 model more transparent, we highlighted important regions of the training images represented in the model by using gradient-weighted class activation maps (Grad-CAM). Grad-CAM was implemented after the version of Selvaraju et al. [44].

Results
The training of the ResNet-18 model with the 201 × 201 px image patches from the training set reached a fast convergence after about 60 epochs, as can be seen from the trend discovered by the accuracy and loss curves in Figure 4. There was no indication that there were any substantial changes in the trend beyond that. Thus, the use of 100 epochs for model training seemed acceptable. In Figure 5, Grad-CAM images are shown for each class type as heat maps. Lighter colors indicate stronger importance for the prediction of the specific class type. All Grad-CAM images showed a localized highlighting of the importance for modeling that was distinctive for each class type. Mostly, it coincided with the features belonging to the specific class type, such as leaf structure, leaf edges, or soil textural background. In case of In Figure 5, Grad-CAM images are shown for each class type as heat maps. Lighter colors indicate stronger importance for the prediction of the specific class type. All Grad-CAM images showed a localized highlighting of the importance for modeling that was distinctive for each class type. Mostly, it coincided with the features belonging to the specific class type, such as leaf structure, leaf edges, or soil textural background. In case of MATCH, the model importance was centered on the fern-like, bipinnate leaves. It is interesting that MATCH heat maps highlighted the importance strongly in areas where the MATCH leaves crossed underlying linear structures, e.g., from wheat plants or background material. Similarly, in the TRZAW heat maps, the linear structures of the wheat leaves were strongly highlighted, but here with a strong importance devoted to the green and healthy leaves and less strong importance to the yellow and defected leaves. SOIL had expectedly the strongest model importance in areas with clear sight to the soil background, specifically highlighting areas with distinct pattern information about soil crust or small stones. The weed types PAPRH, VERHE, and VIOAR, although occurring more sporadically in the example images, were precisely highlighted in their respective heat map. Even though these latter weed species had a rather simple lobed leaf structure, it seemed that model importance was attached to specific leaf characteristics, e.g., leaf margins and lobed structures, unique to the particular weed species. ground, specifically highlighting areas with distinct pattern information about soil crust or small stones. The weed types PAPRH, VERHE, and VIOAR, although occurring more sporadically in the example images, were precisely highlighted in their respective heat map. Even though these latter weed species had a rather simple lobed leaf structure, it seemed that model importance was attached to specific leaf characteristics, e.g., leaf margins and lobed structures, unique to the particular weed species.
. Figure 5. The heat maps of the ResNet-18 model show Grad-CAM images that highlight the importance of the area in the training image for model calibration.

Overall Performance of the ResNet-18 Image-Level Classifier Regarding 32-Bit and 16-Bit Precision
The image-level classifier was tested using different filter configurations with the embedded system Jetson AGX Xavier. In general, an increasing trend for the overall accuracy with an increasing number of filters was determined ( Table 2). The most gain in overall accuracy was found within the lower filter configuration from 2/4 to 6/12. In the higher filter configurations, overall accuracy was well above 90%, indicating strong predictive capabilities of the models. When changing the computation precision of the model from 32-to 16-bit, only a slight deviation was determined with values below 0.001. This was retrieved in the same way for the individual classes ( Figure 6). No class had a higher deviation from the 32-bit models than 0.003 regarding precision and recall. Thus, the differences between 32-and 16-bit precision are negligibly small, and the use of 16-bit precision showed no detrimental effect on model quality in this study. Table 2. Overall accuracy of prediction of the ResNet-18 model in 32-bit and 16-bit precision along with the difference between 32-and 16-bit shown in different filter configurations.

Overall Performance of the ResNet-18 Image-Level Classifier Regarding 32-Bit and 16-Bit Precision
The image-level classifier was tested using different filter configurations with the embedded system Jetson AGX Xavier. In general, an increasing trend for the overall accuracy with an increasing number of filters was determined ( Table 2). The most gain in overall accuracy was found within the lower filter configuration from 2/4 to 6/12. In the higher filter configurations, overall accuracy was well above 90%, indicating strong predictive capabilities of the models. When changing the computation precision of the model from 32-to 16-bit, only a slight deviation was determined with values below 0.001. This was retrieved in the same way for the individual classes ( Figure 6). No class had a higher deviation from the 32-bit models than 0.003 regarding precision and recall. Thus, the differences between 32-and 16-bit precision are negligibly small, and the use of 16-bit precision showed no detrimental effect on model quality in this study.  In Figure 7, the evaluation speed was recorded for one test image for the patch-level and the image-level classifier. The patch-level classifier uses no optimization in the prediction pipeline and works as if predicting on the image patch by patch independently, which is of course much more inefficient regarding computation costs. The patch-level classifier resulted in evaluation times ranging from 1077 to 2321 s from lower to higher filter configuration with 32-bit resolution. This evaluation speed would be far too long for application with UAV for online mapping. With the image-level classifier, the evaluation speed was substantially reduced and ranged from 0.42 to 1.07 s, from lower to higher filter configuration in 32-bit resolution. This was a reduction of evaluation time with a factor around 2100 to 2600. The evaluation speed of the image-level classifier was further reduced by using the 16-bit rather than the 32-bit resolution version (Figure 7c). Globally, the evaluation speed increased with increasing filter configuration. Yet, the increase was greater for 32-bit than for 16-bit precision. With higher filter configurations, the test images were nearly twice as fast classified as with 16-bit precision. In numbers, an image needed 0.79 s to be fully classified on the embedded system in 32-bit with filter configuration 10/20, whereas only 0.46 s was needed when 16-bit precision was used, which refers In Figure 7, the evaluation speed was recorded for one test image for the patchlevel and the image-level classifier. The patch-level classifier uses no optimization in the prediction pipeline and works as if predicting on the image patch by patch independently, which is of course much more inefficient regarding computation costs. The patch-level classifier resulted in evaluation times ranging from 1077 to 2321 s from lower to higher filter configuration with 32-bit resolution. This evaluation speed would be far too long for application with UAV for online mapping. With the image-level classifier, the evaluation speed was substantially reduced and ranged from 0.42 to 1.07 s, from lower to higher filter configuration in 32-bit resolution. This was a reduction of evaluation time with a factor around 2100 to 2600. The evaluation speed of the image-level classifier was further reduced by using the 16-bit rather than the 32-bit resolution version (Figure 7c). Globally, the evaluation speed increased with increasing filter configuration. Yet, the increase was greater for 32-bit than for 16-bit precision. With higher filter configurations, the test images were nearly twice as fast classified as with 16-bit precision. In numbers, an image needed 0.79 s to be fully classified on the embedded system in 32-bit with filter configuration 10/20, whereas only 0.46 s was needed when 16-bit precision was used, which refers to 1.3 or 2.2 frames per second, respectively. The latter speed would be suitable for online evaluation on the UAV for mapping weeds in the fields. Thus, the remaining sections will only discuss model testing in 16-bit mode, because higher precision improves computational performance without sacrificing accuracy. to 1.3 or 2.2 frames per second, respectively. The latter speed would be suitable for onlin evaluation on the UAV for mapping weeds in the fields. Thus, the remaining sections w only discuss model testing in 16-bit mode, because higher precision improves comput tional performance without sacrificing accuracy.

Class Specific Prediction Quality Assessment
In Figure 8, the precision and recall values are shown for the individual classes relation to the filter configuration of the model. With a smaller number of filters integrate into the model, precision and recall are lower and indicate a more erratic characterist from one filter configuration to the next. This effect is especially strong for the classe VIOAR, PAPRH, and VERHE and stronger for recall than for the precision statistic. Wit reaching filter configuration 10/20, precision and recall values stabilize for all models. Th highest values for both precision and recall were received by the classes SOIL, TRZAW and MATCH. For precision, the weeds PAPRH and VERHE reach also high values abov 90%, but values for recall were below 90%. Obviously, the models tend to miss some the PAPRH and VERHE plants, but those predicted to be PAPRH and VERHE are ver likely to be actually present. Relatively, the worst model accuracy was obtained for th class VIOAR with values below 90% for precision and recall. However, with higher filt configurations greater than 10/20, VIOAR was still predicted with high quality with pr cision and recall values well above 80%.

Class Specific Prediction Quality Assessment
In Figure 8, the precision and recall values are shown for the individual classes in relation to the filter configuration of the model. With a smaller number of filters integrated into the model, precision and recall are lower and indicate a more erratic characteristic from one filter configuration to the next. This effect is especially strong for the classes VIOAR, PAPRH, and VERHE and stronger for recall than for the precision statistic. With reaching filter configuration 10/20, precision and recall values stabilize for all models. The highest values for both precision and recall were received by the classes SOIL, TRZAW, and MATCH. For precision, the weeds PAPRH and VERHE reach also high values above 90%, but values for recall were below 90%. Obviously, the models tend to miss some of the PAPRH and VERHE plants, but those predicted to be PAPRH and VERHE are very likely to be actually present. Relatively, the worst model accuracy was obtained for the class VIOAR with values below 90% for precision and recall. However, with higher filter configurations greater than 10/20, VIOAR was still predicted with high quality with precision and recall values well above 80%.
In Table 3, a confusion matrix calculated from the models with filter configuration 10/20 is given calculated over all test images. The counts of five random seed outcomes were summarized with median. Overall, there was a strong differentiation of the models between plants and background as well as between crop and weed. The overall classification accuracy was 94%. Regarding the differentiation to the soil background, only for MATCH, a slight misclassification of the predictions was determinable. This misclassification might be related to the fact that leaves of MATCH are subdivided into many branches of small lobed leaflets. Therefore, the soil shines through the plant structure of MATCH, which might become hard to discriminate in some situations in the images for the models. Yet, misclassification rate was still on a very low level with a percentage below 1.2%. Accord-ing to the confusion matrix, TRZAW was very well differentiated from the weed plants. There was only a weak confusion with MATCH, which might be attributed again to the transparency of the MATCH plants and to some extent to the remote similarity between them due to their ribbon-like plant structures. In Table 3, a confusion matrix calculated from the models with filter configuration 10/20 is given calculated over all test images. The counts of five random seed outcomes were summarized with median. Overall, there was a strong differentiation of the models between plants and background as well as between crop and weed. The overall classification accuracy was 94%. Regarding the differentiation to the soil background, only for MATCH, a slight misclassification of the predictions was determinable. This misclassification might be related to the fact that leaves of MATCH are subdivided into many branches of small lobed leaflets. Therefore, the soil shines through the plant structure of MATCH, which might become hard to discriminate in some situations in the images for the models. Yet, misclassification rate was still on a very low level with a percentage below 1.2%. According to the confusion matrix, TRZAW was very well differentiated from the weed plants. There was only a weak confusion with MATCH, which might be attributed again to the transparency of the MATCH plants and to some extent to the remote similarity between them due to their ribbon-like plant structures. Table 3. Confusion matrix for the evaluation on the test set for the image-level classifier using the ResNet-18 model with filter configuration 10 (20). The resulting counts were agglomerated from five random seeds by median. CV refers to the coefficient of variation computed from the different outcomes of recall or precision expressed in percentage. Regarding the stability among the different seeds of the models, the models for SOIL, TRZAW, and MATCH had very little variation among them for precision and recall with coefficient of variation from 0.5% to 1.3% corroborating high consistency of model prediction. To some extent, this variation was higher for the weed species PAPRH, VERHE, and VIOAR, varying from 3.2% to 5.2%. Whereas MATCH, PAPRH, and VERHE or VIOAR   10(20). The resulting counts were agglomerated from five random seeds by median. CV refers to the coefficient of variation computed from the different outcomes of recall or precision expressed in percentage. Regarding the stability among the different seeds of the models, the models for SOIL, TRZAW, and MATCH had very little variation among them for precision and recall with coefficient of variation from 0.5% to 1.3% corroborating high consistency of model prediction. To some extent, this variation was higher for the weed species PAPRH, VERHE, and VIOAR, varying from 3.2% to 5.2%. Whereas MATCH, PAPRH, and VERHE or VIOAR were relatively well discriminated from each other, a more noticeable confusion occurred between VERHE and VIOAR with up to 10% of false predictions as VIOAR when it was in fact VERHE. Both weed species show a high degree of similarity, especially in the younger growth stages in which they were observed. In addition, both plants appeared very small with only very few remarkable features in the UAV images.

MATCH
In Figure 9, a zoomed representation of an UAV aerial image is shown from the test set. This image was one of the images that were used to estimate a classification map with the image-level classifier on the embedded system. The classification map is shown on the left side of the figure for comparison. It appears that the incorporated class types are quite well-detected and outlined in the classification map. The background, SOIL class (in pink), covered not only the soil crust and aggregate structures, but also sporadically appearing stones in different shapes in the soil. The crop wheat, class TRZAW (in green), was found where it had grown densely and the leaves had a green appearance. Dead and unhealthy wheat leaves, however, were not detected by the image classifier. MATCH, which appeared quite frequently in the image (in red), was detected when it appeared in the open as well as when it densely appeared below the wheat crop. Thus, the image classifier showed abilities to differentiate the plants even when they overlapped each other. VIOAR (light blue) and VERHE (yellow) occurred less frequently and covered only small areas of the ground as individual plants, but were accurately detected by the image classifier when they appeared in the image. However, some limitations of the image classifier were also evident from the classification map of the test image shown in the figure. Although VERHE and VIOAR were precisely found in the test image, more areas of the image were assigned to VERHE and VIOAR than occurred in the field. These areas were mostly found between boundaries from one class to another, e.g., at edges of plant leaves. Probably an ambiguous structure appears in these areas of the image, which has a high similarity to another class. Another limitation can be seen in the bottom right part of the image. Here, a volunteer rapeseed plant appears. This plant species was not learned by the model and was also not learned from the background training images. Since information about the plant was not available in the model, the image classifier tries to assign the plant area to available class labels. It resulted in splitting this image area into TRZAW, VERHE, and PAPRH (dark blue) class labels. younger growth stages in which they were observed. In addition, both plants appeared very small with only very few remarkable features in the UAV images.
In Figure 9, a zoomed representation of an UAV aerial image is shown from the test set. This image was one of the images that were used to estimate a classification map with the image-level classifier on the embedded system. The classification map is shown on the left side of the figure for comparison. It appears that the incorporated class types are quite well-detected and outlined in the classification map. The background, SOIL class (in pink), covered not only the soil crust and aggregate structures, but also sporadically appearing stones in different shapes in the soil. The crop wheat, class TRZAW (in green), was found where it had grown densely and the leaves had a green appearance. Dead and unhealthy wheat leaves, however, were not detected by the image classifier. MATCH, which appeared quite frequently in the image (in red), was detected when it appeared in the open as well as when it densely appeared below the wheat crop. Thus, the image classifier showed abilities to differentiate the plants even when they overlapped each other. VIOAR (light blue) and VERHE (yellow) occurred less frequently and covered only small areas of the ground as individual plants, but were accurately detected by the image classifier when they appeared in the image. However, some limitations of the image classifier were also evident from the classification map of the test image shown in the figure. Although VERHE and VIOAR were precisely found in the test image, more areas of the image were assigned to VERHE and VIOAR than occurred in the field. These areas were mostly found between boundaries from one class to another, e.g., at edges of plant leaves. Probably an ambiguous structure appears in these areas of the image, which has a high similarity to another class. Another limitation can be seen in the bottom right part of the image. Here, a volunteer rapeseed plant appears. This plant species was not learned by the model and was also not learned from the background training images. Since information about the plant was not available in the model, the image classifier tries to assign the plant area to available class labels. It resulted in splitting this image area into TRZAW, VERHE, and PAPRH (dark blue) class labels.

Discussion
The optimized model approach for image-level classification presented in this study is fully convolutional and inherits the same features than the conventional ResNet-18 model for classification. The optimization could successfully increase evaluation speed for image classification of the UAV image, and it is implementable on an embedded system with online evaluation capabilities. Using the NVIDIA Jetson AGX Xavier board, a stable evaluation of 2.2 frames per second on the 3264 × 4912 px full-resolution images was reached in this study. Assuming a ground coverage of 2.25 m 2 of the low altitude UAV imagery, this would result in an area performance of 1.78 ha h −1 for full, continuous crop field mapping. No loss of predictive capability was recorded when moving from 32-bit to 16-bit floating-point computation, but a huge gain in speed. It can be assumed that a further gain in speed will be achieved when shifting entirely to integer-based computation on the embedded board [45], which was not tested in this study. Area performance could also be increased with higher camera resolution to become more practical, as Peteinatos et al. [35] pointed out. However, another approach to enhance area performance could be sparse mapping. In this scenario, the UAV records images with gaps between the flight paths over the field, so that a faster mapping can be achieved. This can be combined with an overview UAV images taken from a higher altitude, which would give additional information for interpolating the weed map. Geostatistical interpolation methods, such as co-kriging or regression kriging, have been shown suitable to integrate UAV imagery information in the interpolation process as secondary information [46,47].
The image classifier was trained, optimized, and tested with the goal of later integration into an online weed detection system for winter wheat for UAV platforms. Thus, both the training and test images were not taken under controlled conditions where, for example, the camera was pointed directly at weed plants or the environmental conditions were controlled such that easy segmentation of individual weed, plant, or background features would have been possible. All images were captured from the copter platform with nadir perspective during low altitude flights. Some uncertainty is wanted in this study in order to assess the performance of the model under natural conditions. Thus, differences should be taken into account when comparing model performance with other studies. In general, the optimized image-classifier of this study performed with 94% overall classification accuracy, well in the range of studies aiming for classifying mixed weed plants [33][34][35]48,49]. In comparison with Pflanz et al. [11], a higher overall accuracy could be obtained on the same data set. The better performance was particularly striking for the similar weed species VIOAR and VERHE. This might indicate that deep residual networks are better suitable than bag of visual words approaches for the classification and discrimination of weed species in UAV imagery. In contrats to segmentation models, which would also produce a pixel-level segmentation into different classes of a given input image by being directly fully convolutional [41], our approach does not need segmentation-level labeling in the training data. This trades off to some extent model accuracy and annotation effort, because patch-labeling is not as accurate as segmentation-labeling, as it also includes labels, where wheat or weed plants did not exactly fit into the patch or labels or where background objects were also present next to the object of interest. Therefore, this noise may have also impacted model accuracy.
The UAV approach shown here does not need sophisticated camera technology. The network was trained from images captured by a snapshot RGB camera. Principally, this approach can be duplicated at rather low costs, especially if drone technology and computation technology drop further in prices. In perspective, drone swarms would allow mapping entire fields for weeds in minutes. Fast available weed maps achieved by UAV remote sensing might pave the way forward to accelerate the adaptation of SSWM technology. In previous experiments with an optoelectronic and camera-based weed sensor conducted in farmers' fields of cereal and pea average, herbicide savings of up to 25.6% could be reached with SSWM [50]. They might also pave the way for selective weed management using fast-reacting direct injection sprayers [51,52]. Gerhards and Christensen [53] used tractor-carrying bispectral cameras for weed detection. In small row crops, winter wheat and winter barley, they reached herbicide savings with application maps depending on the level of weed infestation with even more than 90% by leaving such areas unsprayed where a certain treatment threshold was not reached. With the weed detection approach presented here, it should be possible in the future to identify and localize the key weeds that are important for wheat cultivation. This will contribute to adapted and more environmentally compatible crop protection and reduce the inputs of unwanted amounts of crop protection into the environment and the soil.

Conclusions
The approach presented in this study could successfully optimize a ResNet-18 DCNN classifier to differentiate crops, soils, and weeds as well as individual weed species from very high-resolution UAV imagery captured from low altitudes. Due to the optimization, the classification model can be efficiently applied to overlapping image patches in large images without leading to redundant computations in the convolution layers. This is achieved by computing the fully convolutional part of the model directly over the large, fullresolution UAV images instead of performing them patch-by-patch with a sliding window approach. The image-level classifier is guaranteed to give exactly the same predictions as independently applying ResNet-18 classification models to the image patches and therefore shares all its advantages for prediction. The performance with a ResNet filter configuration of 10 in the shallow and 20 in the deeper part of the network was found to be the best trade-off between accuracy and performance. Full-image evaluation under these settings was about 2.2 frames per second on an NVIDIA Jetson AGX Xavier board in 16-bit precision. It was found that when shifting from 16-bit to 32-bit precision, no improvement in accuracy was observed, but an increase in time cost of about a factor two for image evaluation. The performance enables implementation on a UAV platform for online mapping of weeds for crop fields. Assuming a constant speed and image processing of the UAV platform, this would amount to about 1.78 ha h −1 area output when mapping is performed continuously without any gaps from image to image. The image classifier achieved an overall accuracy of 94% when mapping the UAV aerial images of the test field. The classified images quite accurately distinguished weed species learned by the model, even in more complicated areas of the aerial imagery where plants overlapped each other. There are still limitations of the model regarding the classification of unknown species that need to be addressed to improve the transferability of the model to other crop fields.

Data Availability Statement:
No new data were created or analyzed in this study. Data sharing is not applicable to this article. The code is available on GitHub repository: https://github.com/ tiborboglar/FastWeedMapping.