Today, artificial intelligence renovates the extraction of information from very-high-resolution remote sensing data (VHR) with neural networks established in deep learning architectures tailored specifically for the needs of image data. This enables object recognition and classification in much higher detail and accuracy than before, and combined with imagery obtained from unmanned aerial vehicle (UAV), a smarter monitoring of agricultural lands is thinkable. Applied to the right scenario, this might pave the way for a more sustainable agriculture [1
One such application would be site-specific weed management (SSWM). Conventionally, pesticides are supplied with dosage instructions that are calculated uniformly on a “per hectare” basis for the entire field. The target is in this case the area within the field and not the weed as such. For SSWM, in contrast, the target is directed to the weed plants. Weed plants normally exhibit an aggregated pattern in patches over the field [2
]. Thus, SSWM can reduce the amount of unused herbicide that reaches the ground and misses the target plants. This is consistent with government regulations and policies in the European Union aiming at considerably reducing the amount of pesticide used in agriculture by 2030 [4
]. For SSWM, it is important to delineate the location and the size of weed patches in the field. To achieve this, sensors for automatic weed detection are needed to replace visual weed scouting in the field. Depending on the way in which weeds are recognized in the field, two approaches for SSWM can be outlined—the online and the offline approach [5
]. In case of the online approach, the determination of weeds and the spraying action is performed in one operation step. For example, a tractor may be equipped with a sensor capable of detecting the weed cover and, then, regulates an implement that controls the spray liquid ad hoc. In the case of the offline approach, weed detection and spraying are done in two separate operation steps. Weed maps are first generated and translated to prescription maps, which will then be passed to variable rate herbicide sprayers to vary application rates according to the weed spatial variability. Thus, the spraying amount can decrease if the coverage value or the number of weed plants decreases and vice versa while driving over the field. The spatial accuracy of herbicide application has become quite reliable with the commercial spraying technology available for SSWM [7
In recent years, small UAV platforms have become increasingly popular in precision agriculture, because they provide flexible and cost-effective monitoring of fields, offer small ground sample distances, enable on-demand collection, and provide information to the farmer quickly [8
]. UAVs can be piloted towards altitudes from which images can be captured that contain details to identify even subtle structures of individual plants in crop fields with non-sophisticated camera systems such as snapshot RGB systems. This allows information to be extracted from the images that can be used to arbitrarily distinguish not only between crops and weeds, but also what type of weeds are present in a particular location of the field [11
]. UAV technology would offer tremendous advantages for SSWM. Detailed weed maps from UAV imagery can be generated to accurately delineate weed patterns and patches in the field [12
]. The capability of differentiating among the weed species would further enable selective herbicide application [13
]. More accurate and detailed weed maps would also improve the understanding about weed concurrence and competition for analyzing and predicting the propagation mechanisms of weeds and improve the accuracy of spatio-temporal models of weed populations for agronomists and ecologists [14
]. As part of a smarter agriculture, online weed assessment by UAV could guide weed robots more efficiently across the field [16
], or UAVs extended with spraying equipment could control selected areas of the field directly from the air only where needed [17
There have been a number of attempts to generate site-specific weed maps in the past using UAV remote sensing. Since the spectral characteristics of crop and weed plants can be highly similar in the early season, the use of object characteristics of plants and plant associations has been seen to be highly effective in improving the results for weed mapping [19
]. Specifically, for the general crop–weed differentiation, many studies propose a multi-step classification approach using an object-based image analysis methodology (OBIA). Peña et al. [20
] suggested an OBIA procedure for weed classification of UAV imagery using a combination of several contextual, hierarchical, and object-based image features in a three-step algorithm. This included the identification of crop rows on their linear pattern, the discrimination between crop and weed plants, and a subsequent gridding for generating the weed map. They concluded that UAV imagery and OBIA in combination are a favorable technology over airborne and satellite remote sensing to produce crop-weed maps for calculating herbicide requirements and planning weed control application in advance. In a later study, they were able to obtain good discrimination results between plants and weeds even within plant rows by further refining the OBIA model with a random forest classifier incorporating 3D surface information from UAV photogrammetry [21
For differentiating individual plant details to identify the type of weed species, UAVs need to collect the imagery from altitudes nearly below 10 m [11
]. Yet, to map entire fields with such a small ground sample distance would require lots of aerial images, especially if image overlap is needed for photogrammetry. Thus, one problem with UAV imagery from a low altitude would be the sheer volume of image data, which would hinder rapid weed mapping, because it is impractical in terms of data storage, handling, and further processing with photogrammetry and OBIA. A more economical and flexible approach would be an image classifier capable of automatically and quickly identifying weeds from UAV images. This would allow weed mapping directly from a UAV platform as it flies over the field, while image recognition is embedded in a single computer aboard the platform that analyzes the images online. This way only the necessary information for weed mapping can be stored away or transferred to a ground station, such as the classification image, position, and type of the weed plants from post classification or, even more abstractly, summary statistics over the complete image, e.g., overall coverage of weeds with regard to species level in that image.
With some success, global features of plant morphology such as convexity, contour, or moments have been used in image classifiers to identify individual plant species directly from images [22
]. Yet, these approaches begin to fail if cluttered imagery, such as UAV images from crop fields, is used. More recently, the use of local invariant features within the framework of bag-of-visual words [26
] has been tested successfully for identifying weed species in cluttered field imagery [11
]. This type of classifier only failed if weed species were very similar in their appearance [11
]. Even more promising seems the use of convolutional neural networks for identifying weed plants, specifically within a deep learning framework [28
]. One benefit of deep convolutional neural networks (DCNN) is that they learn the feature filters needed to extract the relevant information from the images directly in one process within the training network using convolutional layer structures. Beginning with LeNet-5 [29
], proposed in 1998 using a rather slick design with two convolutional layers and three fully connected layers with about 60,000 parameters to be fitted, the architectures became quickly deeper with the growing capabilities of modern computing hardware. Inception-V3 and ResNet-50, proposed in 2015, hold over 20 million parameters [30
]. To train and use them optimally, more and more specialized designs became necessary. In case of the deep residual networks (ResNets), residual blocks became popularized as key features that enable shortcut connections in the architecture, which allows more efficient training of deeper DCNNs. This ability has led to a breakthrough in classification accuracy in major image recognition benchmarks such as ImageNet [32
For weed image classification based on DCNNs, Dyrmann et al., [33
] proposed an own DCNN structure and trained it from scratch with segmented images from different sources of RGB images. They achieved moderate to high classification accuracies for 22 different weed species. A. dos Santos Ferreira et al. [34
] tested different machine learning approaches, e.g., support vector machines, Adaboost, random forests, and DCNN, for classifying UAV images obtained from soybean crops into soil, soybean, grass, and broadleaf classes. Among the tested approaches, the best results were obtained for a DCNN based on an AlexNet architecture [28
]. They concluded that one advantage of DCNNs is their independence in the choice of an appropriate feature extractor. More recently, Peteintos et al. [35
] tested three different DCNN architectures, including VGG16 [36
], Inception, and ResNet-50, for the classification of weeds in maize, sunflower, and potato crops with images taken from a ground-based vehicle, in which the VGG16 was outperformed by the other two DCNNs. They also concluded that data sets for weed classification by DCNNs needs to be more robust, usable, and diverse. Weed classification was also achieved by segmentation with DCNN from images. Zou et al. [37
] successfully differentiated crop from weeds to estimate weed density in a marigold crop field using a modified U-Net architecture with images taken from a UAV platform in 20 m altitude.
For online mapping with UAVs, it is paramount not only to achieve high accuracy of the image classifier for weed identification, but also to optimize the predictive capabilities of the network in terms of the speed for evaluating a full-resolution UAV image captured by the camera. Most recently, research has focused on integrating DCNNs on embedded system for identifying weed online. Olsen et al. [38
] successfully trained models for classifying different rangeland weed species with Interception-3 and ResNet-50 DCNN architectures and could implement the model on NVIDIA Jetson TX2 board. They theoretically achieved an inference time of 18.7 fps for evaluating resampled weed images (224 × 224 px) collected from a ground-based vehicle. Deng et al. [39
] used a semantic segmentation network based on an adapted AlexNet architecture and could effectively discriminate rice and weed on an NVIDIA Jetson TX board with 4.5 FPS. This study similarly aims for optimizing a DCNN for weed identification with embedded systems for UAV imagery. In this approach, optimization was reached mainly by avoiding redundant computations that arise when a classification model is applied on overlapping tiles in a larger input image. This is similar to fully convolutional architectures used in segmentation models, but unlike those models, this approach does not require pixel-level segmentation labels at training time, which would be too inefficient. As DCNN architecture, a deep residual type ResNet-18 structure [31
] was used and taught the network to recognize the most typical weed species with UAV images collected in winter wheat crops. Based on the DCNN model and its optimization, an intelligent mapping system should be aimed for that is capable of identifying and capturing weed species from a UAV platform while it is flying over the field. Here, the optimization approach in the prediction pipeline of the ResNet-18 classifier, its implementation on an embedded system, and its performance on classifying UAV images for typical weed plants in winter wheat crops are shown.
The training of the ResNet-18 model with the 201 × 201 px image patches from the training set reached a fast convergence after about 60 epochs, as can be seen from the trend discovered by the accuracy and loss curves in Figure 4
. There was no indication that there were any substantial changes in the trend beyond that. Thus, the use of 100 epochs for model training seemed acceptable.
In Figure 5
, Grad-CAM images are shown for each class type as heat maps. Lighter colors indicate stronger importance for the prediction of the specific class type. All Grad-CAM images showed a localized highlighting of the importance for modeling that was distinctive for each class type. Mostly, it coincided with the features belonging to the specific class type, such as leaf structure, leaf edges, or soil textural background. In case of MATCH, the model importance was centered on the fern-like, bipinnate leaves. It is interesting that MATCH heat maps highlighted the importance strongly in areas where the MATCH leaves crossed underlying linear structures, e.g., from wheat plants or background material. Similarly, in the TRZAW heat maps, the linear structures of the wheat leaves were strongly highlighted, but here with a strong importance devoted to the green and healthy leaves and less strong importance to the yellow and defected leaves. SOIL had expectedly the strongest model importance in areas with clear sight to the soil background, specifically highlighting areas with distinct pattern information about soil crust or small stones. The weed types PAPRH, VERHE, and VIOAR, although occurring more sporadically in the example images, were precisely highlighted in their respective heat map. Even though these latter weed species had a rather simple lobed leaf structure, it seemed that model importance was attached to specific leaf characteristics, e.g., leaf margins and lobed structures, unique to the particular weed species.
3.1. Overall Performance of the ResNet-18 Image-Level Classifier Regarding 32-Bit and 16-Bit Precision
The image-level classifier was tested using different filter configurations with the embedded system Jetson AGX Xavier. In general, an increasing trend for the overall accuracy with an increasing number of filters was determined (Table 2
). The most gain in overall accuracy was found within the lower filter configuration from 2/4 to 6/12. In the higher filter configurations, overall accuracy was well above 90%, indicating strong predictive capabilities of the models. When changing the computation precision of the model from 32- to 16-bit, only a slight deviation was determined with values below 0.001. This was retrieved in the same way for the individual classes (Figure 6
). No class had a higher deviation from the 32-bit models than 0.003 regarding precision and recall. Thus, the differences between 32- and 16-bit precision are negligibly small, and the use of 16-bit precision showed no detrimental effect on model quality in this study.
In Figure 7
, the evaluation speed was recorded for one test image for the patch-level and the image-level classifier. The patch-level classifier uses no optimization in the prediction pipeline and works as if predicting on the image patch by patch independently, which is of course much more inefficient regarding computation costs. The patch-level classifier resulted in evaluation times ranging from 1077 to 2321 s from lower to higher filter configuration with 32-bit resolution. This evaluation speed would be far too long for application with UAV for online mapping. With the image-level classifier, the evaluation speed was substantially reduced and ranged from 0.42 to 1.07 s, from lower to higher filter configuration in 32-bit resolution. This was a reduction of evaluation time with a factor around 2100 to 2600. The evaluation speed of the image-level classifier was further reduced by using the 16‑bit rather than the 32-bit resolution version (Figure 7
c). Globally, the evaluation speed increased with increasing filter configuration. Yet, the increase was greater for 32-bit than for 16-bit precision. With higher filter configurations, the test images were nearly twice as fast classified as with 16-bit precision. In numbers, an image needed 0.79 s to be fully classified on the embedded system in 32-bit with filter configuration 10/20, whereas only 0.46 s was needed when 16-bit precision was used, which refers to 1.3 or 2.2 frames per second, respectively. The latter speed would be suitable for online evaluation on the UAV for mapping weeds in the fields. Thus, the remaining sections will only discuss model testing in 16-bit mode, because higher precision improves computational performance without sacrificing accuracy.
3.2. Class Specific Prediction Quality Assessment
In Figure 8
, the precision and recall values are shown for the individual classes in relation to the filter configuration of the model. With a smaller number of filters integrated into the model, precision and recall are lower and indicate a more erratic characteristic from one filter configuration to the next. This effect is especially strong for the classes VIOAR, PAPRH, and VERHE and stronger for recall than for the precision statistic. With reaching filter configuration 10/20, precision and recall values stabilize for all models. The highest values for both precision and recall were received by the classes SOIL, TRZAW, and MATCH. For precision, the weeds PAPRH and VERHE reach also high values above 90%, but values for recall were below 90%. Obviously, the models tend to miss some of the PAPRH and VERHE plants, but those predicted to be PAPRH and VERHE are very likely to be actually present. Relatively, the worst model accuracy was obtained for the class VIOAR with values below 90% for precision and recall. However, with higher filter configurations greater than 10/20, VIOAR was still predicted with high quality with precision and recall values well above 80%.
In Table 3
, a confusion matrix calculated from the models with filter configuration 10/20 is given calculated over all test images. The counts of five random seed outcomes were summarized with median. Overall, there was a strong differentiation of the models between plants and background as well as between crop and weed. The overall classification accuracy was 94%. Regarding the differentiation to the soil background, only for MATCH, a slight misclassification of the predictions was determinable. This misclassification might be related to the fact that leaves of MATCH are subdivided into many branches of small lobed leaflets. Therefore, the soil shines through the plant structure of MATCH, which might become hard to discriminate in some situations in the images for the models. Yet, misclassification rate was still on a very low level with a percentage below 1.2%. According to the confusion matrix, TRZAW was very well differentiated from the weed plants. There was only a weak confusion with MATCH, which might be attributed again to the transparency of the MATCH plants and to some extent to the remote similarity between them due to their ribbon-like plant structures.
Regarding the stability among the different seeds of the models, the models for SOIL, TRZAW, and MATCH had very little variation among them for precision and recall with coefficient of variation from 0.5% to 1.3% corroborating high consistency of model prediction. To some extent, this variation was higher for the weed species PAPRH, VERHE, and VIOAR, varying from 3.2% to 5.2%. Whereas MATCH, PAPRH, and VERHE or VIOAR were relatively well discriminated from each other, a more noticeable confusion occurred between VERHE and VIOAR with up to 10% of false predictions as VIOAR when it was in fact VERHE. Both weed species show a high degree of similarity, especially in the younger growth stages in which they were observed. In addition, both plants appeared very small with only very few remarkable features in the UAV images.
In Figure 9
, a zoomed representation of an UAV aerial image is shown from the test set. This image was one of the images that were used to estimate a classification map with the image-level classifier on the embedded system. The classification map is shown on the left side of the figure for comparison. It appears that the incorporated class types are quite well-detected and outlined in the classification map. The background, SOIL class (in pink), covered not only the soil crust and aggregate structures, but also sporadically appearing stones in different shapes in the soil. The crop wheat, class TRZAW (in green), was found where it had grown densely and the leaves had a green appearance. Dead and unhealthy wheat leaves, however, were not detected by the image classifier. MATCH, which appeared quite frequently in the image (in red), was detected when it appeared in the open as well as when it densely appeared below the wheat crop. Thus, the image classifier showed abilities to differentiate the plants even when they overlapped each other. VIOAR (light blue) and VERHE (yellow) occurred less frequently and covered only small areas of the ground as individual plants, but were accurately detected by the image classifier when they appeared in the image. However, some limitations of the image classifier were also evident from the classification map of the test image shown in the figure. Although VERHE and VIOAR were precisely found in the test image, more areas of the image were assigned to VERHE and VIOAR than occurred in the field. These areas were mostly found between boundaries from one class to another, e.g., at edges of plant leaves. Probably an ambiguous structure appears in these areas of the image, which has a high similarity to another class. Another limitation can be seen in the bottom right part of the image. Here, a volunteer rapeseed plant appears. This plant species was not learned by the model and was also not learned from the background training images. Since information about the plant was not available in the model, the image classifier tries to assign the plant area to available class labels. It resulted in splitting this image area into TRZAW, VERHE, and PAPRH (dark blue) class labels.
The optimized model approach for image-level classification presented in this study is fully convolutional and inherits the same features than the conventional ResNet-18 model for classification. The optimization could successfully increase evaluation speed for image classification of the UAV image, and it is implementable on an embedded system with online evaluation capabilities. Using the NVIDIA Jetson AGX Xavier board, a stable evaluation of 2.2 frames per second on the 3264 × 4912 px full-resolution images was reached in this study. Assuming a ground coverage of 2.25 m2
of the low altitude UAV imagery, this would result in an area performance of 1.78 ha h−1
for full, continuous crop field mapping. No loss of predictive capability was recorded when moving from 32-bit to 16-bit floating-point computation, but a huge gain in speed. It can be assumed that a further gain in speed will be achieved when shifting entirely to integer-based computation on the embedded board [45
], which was not tested in this study. Area performance could also be increased with higher camera resolution to become more practical, as Peteinatos et al. [35
] pointed out. However, another approach to enhance area performance could be sparse mapping. In this scenario, the UAV records images with gaps between the flight paths over the field, so that a faster mapping can be achieved. This can be combined with an overview UAV images taken from a higher altitude, which would give additional information for interpolating the weed map. Geostatistical interpolation methods, such as co-kriging or regression kriging, have been shown suitable to integrate UAV imagery information in the interpolation process as secondary information [46
The image classifier was trained, optimized, and tested with the goal of later integration into an online weed detection system for winter wheat for UAV platforms. Thus, both the training and test images were not taken under controlled conditions where, for example, the camera was pointed directly at weed plants or the environmental conditions were controlled such that easy segmentation of individual weed, plant, or background features would have been possible. All images were captured from the copter platform with nadir perspective during low altitude flights. Some uncertainty is wanted in this study in order to assess the performance of the model under natural conditions. Thus, differences should be taken into account when comparing model performance with other studies. In general, the optimized image-classifier of this study performed with 94% overall classification accuracy, well in the range of studies aiming for classifying mixed weed plants [33
]. In comparison with Pflanz et al. [11
], a higher overall accuracy could be obtained on the same data set. The better performance was particularly striking for the similar weed species VIOAR and VERHE. This might indicate that deep residual networks are better suitable than bag of visual words approaches for the classification and discrimination of weed species in UAV imagery. In contrats to segmentation models, which would also produce a pixel-level segmentation into different classes of a given input image by being directly fully convolutional [41
], our approach does not need segmentation-level labeling in the training data. This trades off to some extent model accuracy and annotation effort, because patch-labeling is not as accurate as segmentation-labeling, as it also includes labels, where wheat or weed plants did not exactly fit into the patch or labels or where background objects were also present next to the object of interest. Therefore, this noise may have also impacted model accuracy.
The UAV approach shown here does not need sophisticated camera technology. The network was trained from images captured by a snapshot RGB camera. Principally, this approach can be duplicated at rather low costs, especially if drone technology and computation technology drop further in prices. In perspective, drone swarms would allow mapping entire fields for weeds in minutes. Fast available weed maps achieved by UAV remote sensing might pave the way forward to accelerate the adaptation of SSWM technology. In previous experiments with an optoelectronic and camera-based weed sensor conducted in farmers’ fields of cereal and pea average, herbicide savings of up to 25.6% could be reached with SSWM [50
]. They might also pave the way for selective weed management using fast-reacting direct injection sprayers [51
]. Gerhards and Christensen [53
] used tractor-carrying bispectral cameras for weed detection. In small row crops, winter wheat and winter barley, they reached herbicide savings with application maps depending on the level of weed infestation with even more than 90% by leaving such areas unsprayed where a certain treatment threshold was not reached. With the weed detection approach presented here, it should be possible in the future to identify and localize the key weeds that are important for wheat cultivation. This will contribute to adapted and more environmentally compatible crop protection and reduce the inputs of unwanted amounts of crop protection into the environment and the soil.