Weed Mapping with UAS Imagery and a Bag of Visual Words Based Image Classiﬁer

: Weed detection with aerial images is a great challenge to generate ﬁeld maps for site-speciﬁc plant protection application. The requirements might be met with low altitude ﬂights of unmanned aerial vehicles (UAV), to provide adequate ground resolutions for differentiating even single weeds accurately. The following study proposed and tested an image classiﬁer based on a Bag of Visual Words (BoVW) framework for mapping weed species, using a small unmanned aircraft system (UAS) with a commercial camera on board, at low ﬂying altitudes. The image classiﬁer was trained with support vector machines after building a visual dictionary of local features from many collected UAS images. A window-based processing of the models was used for mapping the weed occurrences in the UAS imagery. The UAS ﬂight campaign was carried out over a weed infested wheat ﬁeld, and images were acquired between a 1 and 6 m ﬂight altitude. From the UAS images, 25,452 weed plants were annotated on species level, along with wheat and soil as background classes for training and validation of the models. The results showed that the BoVW model allowed the discrimination of single plants with high accuracy for Matricaria recutita L. (88.60%), Papaver rhoeas L. (89.08%), Viola arvensis M. (87.93%), and winter wheat (94.09%), within the generated maps. Regarding site speciﬁc weed control, the classiﬁed UAS images would enable the selection of the right herbicide based on the distribution of the predicted weed species.


Introduction
Weeds are in competition with crops for light, water, space, and nutrients [1][2][3], and can cause substantial crop yield losses [4][5][6]. Since wheat accounts for the second largest food supply to the world's population (i.e., 527 kcal per capita per day, [7]), weed induced yield losses have a serious impact on the global economy and food security. Consequently, in a world of steady rises in food demand, sustainable agriculture requires the implementation of adequate strategies for weed control. Given its effectiveness, the use of herbicides is still increasing worldwide [8], despite possible negative impacts on the environment, e.g., leaching into ground water, risk to soil biology, or impact on biodiversity [9,10]. However, the social awareness of the proper use of herbicides is also increasing [11].
One strategy to reduce herbicide use and the related environmental impact is the use of site specific weed control (SSWM). The aim of SSWM is to confine the use of herbicides only to those areas where weeds are present in the fields, whilst avoiding reducing the efficacy of the overall crop protection application. Since weeds are spatially aggregated in wheat fields, it has been shown that Regarding these preliminary considerations, the aim of this study was to show an approach for mapping emergence of weeds in winter wheat using low altitude UAS imagery. Our intention was to develop an approach that could recognize weed species in images taken from RGB camera systems, which can be installed on UAVs. Since human perception can differentiate and annotate weed species in the super high-resolution imagery, it is plausible to try and recognize weed plants using specific learning approaches that partially mimic human brain functions. We follow an object and form-based approach relying on local-invariant image features. The predictive model, we chose, is based on a learned image classifier using a Bag of Visual Words (BoVW) framework. The classifier was calibrated on a large weed image data set annotated from natural cluttered UAS images, which were not segmented before modeling. Overall accuracy for separating weed and crop, as well as separating different weed species, were tested on an independent image data set. We further tested the influence of altitude (1-6 m), filter intensity, and parameterization of the BoVW model.

Background
In general, weed detection has been studied for decades. Hyperspectral measurements have been used to characterize the reflectance of leaves to differentiate weed and crop plants under laboratory conditions [26,27]. To some extent, these results could be corroborated under semi-controlled or outdoor conditions, under ambient lighting [26]. This would enable a pixel-wise acknowledgement of weed occurrences and may even detect weed leaves that only appear below crop plants, depending on the resolution of the images. However, different lighting conditions strongly affect reflectance spectra and this has led to inacceptable classification rates [28]. Furthermore, the data-handling of large hyperspectral images is complicated and the equipment is expensive. In contrast, commercial cameras are cost-effective and can be easily integrated onto low-cost aerial platforms such as UAVs. However, the broad spectral bandwidths of RGB cameras limits the pixel-wise detection of weeds to conditions where there is a good discrimination between crop and weed, by color or RGB indices [29][30][31]. Thus, it makes sense to further incorporate information about leaf shape, texture, and size in the classification.
Consequently, earlier research targeted the global morphology of the leaves, so that features such as contour, convexity, or first invariant moments were used for classification [32][33][34]. However, in cluttered images, global features have been used to train support vector machines (SVM) with only decent results. This is because overlapping plants often result in complex objects and must be handled with a more specific and complex shape extraction method [35]. In particular cases, active shape models (which use deformable templates for matching leaves) and texture-based models had some success in detecting and discriminating plants from images with overlapping leaves [36,37]. More recently, local invariant features have been used to discriminate weeds from crops. Local invariant features are distinctive points of interest in an image, which are representative of its structure as found by a keypoint detector and bear a unique description to its immediate neighborhood. Kazmi et al. [38] could show a very high discrimination of creeping thistle from sugar beet using local invariant features, with a BoVW classifier and a maximally stable extremal regions (MSER) detector. A recent study by Suh et al. used BoVW to differentiate sugar beet from volunteer potato [39]. As local image descriptors they used a scale-invariant feature transform (SIFT) and speeded-up robust features (SURF). They trained models for weed detection with different machine learning classifiers. They concluded that the SVM approach showed better classification performance than random forest or neural networks when used with a BoVW. However, their study was based on a rather small annotation data set of 400 images.
For more differentiated plant detection, convolution neural networks (CNN) are a further approach that has recently gained attention. Unlike BoVW, image features will be directly learned from a neural network using convolution hidden layers [40]. Dyrmann et al. [41] used more than 10,000 RGB images depicting single weed plants to train their own CNNs. Images were taken under controlled light conditions, as well as under field conditions. The background was removed from the plants by color segmentation before modeling. They showed medium to high classification rates for 22 weed species.
To extract weed information from high resolution UAS imagery, most studies use a user-generated approach, following an object-based image analysis (OBIA) workflow. Peña et al. [42] successfully classified weeds in a maize field from UAS imagery collected from a 30 m altitude, using the relative position of the plants with reference to the crop row structure. They used a rule-set algorithm, which comprised spectral, contextual, and morphological information, for image segmentation in different working tasks using the software "Ecognition". In a later study, the workflow was further improved to differentiate Sorghum halepense from maize crops. Tamouridou et al. [43] were able to map Silybum marianum out of a fallow field with a high variety of different weeds. For this purpose, UAS imagery was used with a ground resolution of 0.04 m and a maximum likelihood classification with green, red, near-infrared, and texture filter information. Thus far, there are few studies that use local invariant features for the classification of weeds in UAV imagery. Hung et al. [44] identified three weed types in Australia from UAS imagery by learning a filter bank with a sparse autoencoder. Recently, Huang, et al. [45] have shown the generation of crop-weed maps in a rice field from UAS imagery with different types of CNNs. Nonetheless, specifically for UAS imagery, BoVW approaches have not been tried for weed detection until now. We postulate that the use of local invariant image features, such as SIFT, might be beneficial because they are both translation and rotation invariant, and are thus robust to changes of flight altitude, flight orientation, or slight deviations from the Nadir camera perspective.

Test Site and UAS Flight Mission
The study was conducted in a field in the Northern part of Germany near Braunschweig (52 • 12 54.6" N 10 • 37 25.7" E). Winter wheat (TRZAW, cv. 'Julius') was grown in this field with a row distance of 12.5 cm and a seed density of 380 seeds·m −2 . The soils in this field were gleyic Cambisols with sporadic occurrences of stones. The flight mission was completed during one day in March 2014. At this date, all the weed and crop plants were abundant, and winter wheat sown in November 2013 had reached the development stage of BBCH 23 (Table 1). No weed control measures were performed before the measurements. The flight mission was conducted with a hexa-copter system (Hexa XL, HiSystems GmbH Moormerland, Deutschland) using modified flight control software based on the Paparazzi UAV project ( [46]; CiS GmbH, Rostock-Bentwisch, Deutschland). For image acquisition, a Sony NEX 5N (Sony corporation, Tokyo, Japan) point and shoot camera was used, which was able to resolve 4912 × 3264 image points on a 23.5 × 15.6 mm sensor surface (APS-C Sensor). The lens attached had a fixed focal length of 60 mm (Sigma 2.8 DN, Sigma Corp., Kawasaki City, Japan). The camera was mounted onto a gimbal underneath the copter. The copter was navigated along two routes at altitudes between 1 and 6 m. In total, over 100 waypoints each located at the centers of a 24 × 24 m pattern were passed (Figure 1). At each waypoint, an image was shot from a Nadir perspective. Camera parameters and flight altitudes resulted in a GSD between 0.1 and 0.5 mm. The images spanned an area between 0.1 and 3.7 m 2 on the ground. We subdivided the field in training and test set areas, where we allocated the taken images in the corresponding data sets for the later use in calibrating and testing the image classifier ( Figure 2).

Plant Annotation, Image Preprocessing, Sub-Image Extraction
All UAS images were visually examined and the plants were annotated. Plant annotations were located by setting a point in the center of each plant and described by using relative image coordinates with a Matlab GUI script (ImageObjectLocator v0.35). In total, 25,452 ground truthing locations were annotated in the UAS images and referenced to an annotation data base. Around the midpoint of each ground truthing location, a 200 × 200 px quadratic frame was drawn for extracting the sub-images for later model building and validation. Each sub-image showed either a specific plant or soil. The investigation area was subdivided in two training areas and one test area, as given in Figure 2. The exact allocation of the sub-images into train and test sets for each plant species and soil background is shown in Table 2. Sample images are given in Figure 3. For model optimization, we divided the train set further into calibration and validation sets. Samples were selected randomly from the train set with 75% for calibration and 25% for validation.

Calibration of An Image Classifier for Weed Recognition
The image classifier was generated with the BoVW concept, as given in Figure 4. The idea was to breakdown the content of an image in relation to a generalized set of features that was a priori derived from a large set of images. The generalized set of features or code words was referred to as the visual dictionary. The relations between the features of an image to this visual dictionary can be expressed as a frequency vector counting the numbers of unique relations between image features and code words (generalized features of the visual dictionary). This frequency vector is called the "bag of visual words" [47]. Combined with known labels of the images, these frequency vectors are then used for calibrating the image classifier. To train an image classifier using the BoVW concept, three steps need to be performed: First, a visual dictionary must be generated from many images. These images may be related to a specific theme, such as unspecific images from plants. From these images, the most relevant information will be extracted using key point extractors and descriptors. Depending on the type of the key point extractor, it finds key points at edges, corners, homogeneous regions, or blobs in the images, and the key point descriptor describes the immediate local neighborhood of the key points invariant to rotation or scale. The key point descriptions are collected over the entire image set and are generalized to a smaller set of code words by finding the centroids with a vector quantization method, such as k-means, to generate a visual dictionary. The second step involves a new labelled set of images. With the same key point descriptor as the visual dictionary that was built, the features of the new image will be extracted and described. The new key point descriptions are then referenced to the code words of the visual dictionary based on its similarity. The frequencies of the key point allocations are recorded over the code words to generate the BoVW vector. In the last step, the labelled BoVW vectors are used to train the image classifier using in our case a support vector machine (SVM). In case of a prediction given a new image, a new BoVW vector must be produced in relation to the codebook, and using the calibrated image classifier, the new BoVW vector will be decided to belong to a specific category, such as a specific weed species.
We used only the sub-images ( Figure 3) extracted from the UAS images located in the training area for calibrating ( Figure 2). The sub-images were processed in three different color spaces before presenting them to the model for comparison, i.e., unprocessed RGB, a plain gray-scale transformation using the Matlab built-in function rgb2gray, and a transformation to HSV color space by rgb2hsv, which applies the algorithm of Smith [48].
The visual dictionaries were built by a modified framework, originating from the computer vision feature library (VLFEAT, Version 0.9.21) of Matlab [49]. As a standard algorithm for key point detection and description, this toolbox uses a blob detector based on SIFT [50]. To find out the optimal descriptor scale, spatial histograms were calculated for different bins, which varied by pixel widths of 2, 4, and 8. The visual dictionary was built using K-means clustering. This meant that the code words, which constitute the visual dictionary, represented the average Euclidean center of key point descriptors mapped to one of n clusters that were found by the average Euclidean dissimilarity from cluster to cluster. This mapping of new input data into one of the clusters by looking for the closest center was tested for vector quantization (VQ) and the k-dimensional binary tree (KDTREE). In contrast to the linear approximating process of VQ, KDTREE uses a hierarchal data structure to find nearest neighbors by partitioning the source data recursively along the dimension of the maximum variance [49].
The dictionary size, e.g., the number of code words (or cluster centers), were systematically varied from 200, 500, and 1000. Key point descriptors of the new images were related with an approximate nearest neighbor search to construct the BoVW vectors. Support vector machines were used to calibrate the image classifier with linear kernel and different solvers. While the Stochastic Dual Coordinate Ascent solver (SDCA) maximizes the dual SVM objective, the Stochastic Gradient Descent (SGD) minimizes the primal SVM objective. We also tested the open-source Large Linear Classification library (liblinear), which implements linear SVMs and logistic regression models using a coordinate descent algorithm [51]. All hyper parameters of the BoVW framework were systematically permuted, which produced a number of n = 486 trained BoVW models in total (Table 3). All analyses were done using MATLAB (Version 2016b, The Mathworks, Natick, MA, USA).

Mapping the Models on the UAS Field Scenes
A sliding window approach (Figure 5a) was used for mapping the BoVW models, on a total of n = 34 UAS images of the test zone [52,53]. To accomplish this, the UAS images were segmented into numerous smaller sub-images with different block sizes (BS) of 50 and 200 px and overlaps (shift) of 5 and 20 px, respectively. The lower the value of the shift is, the higher the overlap of the sub-images and the finer the sampling rate. Then, the BoVW model was evaluated over all rectangular sub-images, and a value of 1 was added to each pixel position of BS in the corresponding category layer. Each image region was subsequently classified using the BoVW model, and its result was summarized in the corresponding layer. Finally, a quality function mapped the category number into a winner matrix considering the object's location and the maximum value across the categories of layers (Figure 5b). The mapping results were validated by n = 8686 ground truthing points within the UAS field scenes from the test set area (see Figure 2).

Metrics for Evaluation
The testing of the BoVW models and the generated maps was done using the sub-images and the annotations made in the UAS images both taken from the test zone. The weed classification approach shown here considered not only a global crop-weed classification, but it discriminated between four different weed species, soil, and winter wheat. Nevertheless, this task of detection needs a metric to evaluate the quality of decisions. True Positive (TP), True Negative (TN), False Negative (FN), and False Positive (FP), were calculated from 6 × 6 confusion matrices for each class. For example, in case of MATCH, the correct predictions of the category MATCH are called TP, whilst TN is the sum of all other correct predictions of not MATCH categories. FP summarizes cases in which MATCH is falsely predicted as MATCH, whilst in reality it belongs to another category; whereas FN describes cases in which another category is falsely predicted as MATCH. TP, TN, FP, and FN were used to estimate the performance of the BoVW models with the following metrics: Accuracy overall = TP + TN TP + TN + FP + FN Precision represents the predicted positives that are truly real positives (Equation (1)). That refers to how well a given model can deliver only correctly classified species in the result set. In contrast, recall (also known as sensitivity) is the probability of predicting positives being real positives (Equation (2)). It reflects how many plants, predicted in a specific category, are truly part of this category. The overall accuracy was calculated by Equation (3). All metrics calculations were done in R [54] using the 'caret' package [55].

Image Classification
The classification results of the BoVW models with different vocabulary sizes of the visual codebook and transformations of the input images are summarized in Table 4. In short, only the results from the SDCA solver and KD-TREE quantization were shown here because results from LIBLINEAR and VQ did not provide better accuracies and computing performance was around two times slower (see Table S1). Generally, there was a slight increase of the overall classification accuracy when increasing the vocabulary size from 200 to 500, whereas the increase in accuracy from the 500 to 1000 vocabulary size was negligible. When transforming the images to HSV color space before subjecting them to the BoVW model, the overall classification accuracies were improved. In more detail, the identification of soil and wheat images had a higher precision (>95%), in comparison to the identification of weed images (<70%). However, to some extent, there was the chance that weed images were falsely classified as soil or wheat, as can be seen by the recall reaching only 80-85% ( Figure 6). Among the weed species, MATCH was most precisely identified in the test image set, whilst VERHE and VIOAR the had worst precision. This is corroborated by the confusion matrix of model 578 (Table 5, vocabulary size 500 and transformation HSV), which shows in more detail that VERHE and VIOAR had a high misclassification rate between each other. Obviously, the BoVW model confused both weed species because of its high similarity in visual appearance. However, the HSV transformation positively influenced the precision and recall of the VERHE and VIOAR models, as well as the identification of the soil background; whereas it had almost no effect or even an unfavorable effect on the MATCH and PAPRH models.

BoVW Mapping of the UAS Field Scenes
The BoVW models were used to predict categories continuously over the UAS field scenes to map the occurrences of soil, wheat, and weed species. We compared the model performances of two different window block sizes (50 and 200 px) and two step sizes (5 and 20 px) of the sliding window computing (Tables 5 and 6 and Table S2). Generally, with finer filtering and scanning, the overall accuracy increased about 20% for the pixel-exact validation. The precision of the individual categories showed a similar pattern as the models of the image test set. Soil and wheat were identified with the highest precision, greater than 90%, for the models with finer filtering. This showed that generally a good differentiation between wheat, background, and weeds was possible using the mapping approach. However, whilst the soil category had a high precision for both fine and coarse filtering, the recall for soil in the case of coarse filtering was quite low. This meant that a lot of wheat and weed images were wrongly classified into the soil category. According to the confusion matrices shown in Tables 5 and 6, this was particularly the case for VERHE and VIOAR. Nevertheless, with finer filtering, the recall for SOIL strongly increased from 38 to 77%, whilst the rate of misclassifications of VERHE and VIOAR remained almost constant. In terms of the four different altitudes of the unmanned aerial vehicles (UAV) platform, the overall accuracy increased with choosing a lower altitude for the UAV platform. This gradient was stronger with coarser filtering. However, it is noteworthy that in most cases, it was not the lowest altitude that yielded the best accuracy, but it was the altitude between 2.5 and 4.0 m. Amongst the weed categories, except for PAPRH, the precision increased with finer filtering. MATCH had the highest mapping precision with 87%, whilst the performance of the other weed species was mainly underwhelming, except at the lowest altitude, at which VIOAR and PAPRH could obtain high precision values. However, with increasing altitude, the mapping precision of VIOAR and PAPRH strongly decreased and was far below 50%.
It is rather hard to match the ground truthing location exactly with one pixel, because of its small dimensions in reality. When allowing a small tolerance as a buffer around the prediction location that corresponds to the average plant dimensions, the overall accuracy strongly increased (Table 7) and was greater than 90% for all altitudes of the finer filtering (Table 8), with the highest accuracy at altitudes between 2.5-4.0 m (compare Tables 9 and 10). This meant that the models predicted well, the right category in the immediate neighborhood of the ground truthing location. This was further supported by the prediction maps given in Figure 7 and Figure S3, for one of the UAS images. It can be seen quite strikingly, that the category soil shines through the plants similar to a background matrix, whereas the wheat plants assembled regular, long stripes within the map because of the seed rows. The different weed species, however, were spatially irregularly located in between the wheat rows. The white markings of the maps signified the ground truthing location of the predicted categories, and the red circles delineated the tolerance buffers allowed around the prediction points. In almost all cases, the correct category lay within the tolerance buffer and showed the high validity of the map.   Table 9. Correlation between precision, recall, altitude (ALT), and window filtering (block size, BS, and shift), where n is the number of ground truthing points from manual annotations of the images.

Discussion
The BoVW model proposed in this study was able to differentiate weeds from crops in the cluttered UAS imagery with similar accuracy (>90%), compared to research studies using OBIA, supervised classification, or image recognition, for weed classification. For example, Peña et al. [42] could generate weed-crop maps from UAS imagery using an OBIA approach with an 86% overall accuracy for discriminating three categories of weed coverage in maize crops. Huang et al. [45] were able to differentiate rice from unspecified weeds using a patch based convolutional neural network, with an accuracy of 97%. However, our study focused on differentiating various weed species from winter wheat. Most similar to our study would be the study of Dyrmann et al. [41], when solely focusing on the image set alone. For wheat (Triticum aestivum), MATCH, PAPRH, VERHE, and VIOAR obtained mixed classification accuracies with 92%, 90%, 69%, 45%, and 33%, respectively. In comparison, we achieved similar results with 98%, 66%, 65%, 48%, and 51%, respectively, for the same plant species. However, results cannot be directly compared with each other, because in their study they tested 21 plant species and images, which were recorded from a constant altitude, whereas in this study only 5 plant species were tested, but images were acquired from an unstable UAV platform with an altitude range between 1.0 and 6.0 m.
The use of UAS imagery for model calibration could also have influenced the accuracy of the BoVW model. However, we found it more logical to use a realistic image data set than to rely solely on a data set acquired under laboratory conditions where illumination, shading, image sharpness, and resolution were held constant. It might be interesting to improve the calibration data set with accurate image data acquired under controlled conditions. The altitude of the UAV platform had an influence on the mapping of VIOAR and VERHE, which might be explained by their small plant sizes. Since the BoVW model significantly depends on a unique collection of key point descriptor information, then with increasing altitude this information becomes increasingly homogeneous with other categories. Specifically, due to the high resemblance of VIOAR and VERHE, the distinctive form specific information might get lost in the descriptors with higher altitude, due to lower spatial resolution of the images. Still, VIOAR reached a 79% mapping accuracy when allowing a plant specific tolerance buffer.
It is difficult to match exactly the midpoint of the ground truthing locations because of the small dimensions of one pixel in reality. Allowing a small, plant-specific buffer was more realistic than the pixel-exact testing, because it integrated the classification results in its nearest vicinity and considered the multiple testing during the scanning process whilst mapping. The visual evaluation of the weed maps corroborated this approach. Furthermore, it might be discussed that the model calibration may have been influenced by the rather unbalanced distribution of the annotations. This was partly unavoidable owing to the general abundance of weeds in the UAS imagery. However, we had at least 1000 annotations in each category. Contrary to neural network approaches, we spared data augmentation because the SIFT features used were rotation and translation invariant. We also did not segment the images before presenting them to the model because we encountered in the cluttered UAS images dry old leaves, stones, or soil with different illumination due to micro relief, which might pose a problem for simple segmentation procedures, such as image thresholding. We argue that the implementation of a background category will give better results in the end, especially if more data will be integrated subsequently.

Conclusions
With high resolution UAS imagery, we were able to map weeds on a species level within a winter wheat field. For this research, we annotated a large set of images taken from a UAV platform at different altitudes, and calibrated bag of visual words models based on SIFT image features and support vector machine classification. The validation with an independent image set showed a producer's accuracy of 98%, for differentiating wheat crops from images with weeds and images with soil background. Among the weed species, Matricaria recutita L. showed the highest classification accuracy, whereas for Viola arvensis M. and Veronica hederifolia L., the models were weaker and showed signs of confusing both species with each other. The choice of the hyper-parameters for the BoVW models had only little influence on the outcomes. The weed maps generated by the BoVW models were able to distinguish weeds from crops and single weed species, with an accuracy of nearly 90% or greater. Regarding this kind of mapping, high resolution images with the capability of resolving even subtle details of crop and weed plants are an important prerequisite. Therefore, scanning resolution and altitude had a strong effect on the mapping accuracy of the UAS images and generally the better resolution, the better the map quality. Relating to SSWM, our approach would enable the selection of the right herbicide based on the distribution of the weed species in the classified UAS images. Further research will be directed to improve the generalization capabilities of the models by enhancing the annotation database by implementing more image data from different crop-weed associations and regions.
Supplementary Materials: The following are available online at http://www.mdpi.com/2072-4292/10/10/1530/ s1, Table S1: Precision and recall values of all the models tested with an independent image set. KDTREE and VQ were used as quantizer, and SDCA, SVM, and liblinear as solver; Table S2: Precision and recall values of all the models tested on UAS field images using the sliding window approach. Figure S3: Category map of the spatial distribution of weeds, winter wheat, and soil.