Deep Learning with Unsupervised Data Labeling for Weed Detection in Line Crops in UAV Images

.


Introduction
Currently, losses due to pests, diseases, and weeds can reach 40% of global crop yields each year, and this percentage is expected to increase significantly in the coming years [1].Common weed control practices consist of spraying the entire field with herbicides, a practice that involves significant waste and cost for farmers and that causes environmental pollution [2].In order to reduce the volume of chemicals while continuing to increase productivity, the concept of precision agriculture was introduced [3,4].Precision agriculture can be defined as the application of technology for the purpose of improving crop performance and environmental quality [3].The main goal of precision agriculture is to select the right management practice in order to allocate the right doses of inputs, such as fertilizers, herbicides, seed, fuel, etc., to the right place and at the right time [5].Weed detection and characterization represent one of the major challenges of precision agriculture, since, in current farming practice, herbicides are typically applied uniformly across fields, despite the fact that weeds exhibit uneven spatial distributions.
In the literature, several methods of weed detection are proposed with different acquisition systems [6][7][8].Compared to robot or satellite acquisitions, drones have been considered more efficient since they allow a fast acquisition of the field with very high spatial resolution and at a low cost [9,10].Despite significant advances in unmanned aerial vehicle (UAV) acquisition systems, the automatic detection of weeds remains a challenging problem.In recent years, deep learning techniques have shown a dramatic improvement for many computer vision tasks, and recent developments have shown the importance of these techniques for weed detection [11,12].They are still not widely used in agriculture, however, as the huge quantities of data required in the learning phase have accentuated the problem of the manual annotation of these datasets.The same problem arises in agriculture data, where labeling plants in a field image is very time consuming.So far, very little attention has been paid to the unsupervised annotation of data to train deep learning models, particularly for agriculture.
In this article, we propose a fully automatic method for weed detection in drone images.Our method is based on the unsupervised collection of a training data set and convolutional neuronal networks (CNNs).The proposed method is performed in three steps.First, we detect crop rows and exploit them to detect inter-row weeds.In the second step, these inter-row weeds are used to form the training dataset.Finally, the database created in the previous step is used to generate a model from deep learning.The advantage of this method is that it is adaptive and robust, which means that it is possible not only to use the generated model to detect weeds in a new field with the same crop type, but also to generate a new model by applying this method in a new field without any feature selection methods.
The paper is divided into five parts.In Section 2, we discuss related work.Section 3 presents the proposed method.In Section 4, we comment on and discuss the experimental results obtained.Section 5 concludes the paper.

Related Work
In the literature, several approaches have been used to detect weeds with different acquisition systems.The main approach for weed detection is to extract vegetation from the image using segmentation and then to discriminate crop and weeds.Common segmentation approaches use color and multispectral information to separate vegetation and background (soil and residues).Specific indices are calculated from this information to segment the vegetation effectively [13].
However, weeds and crop are hard to discriminate using spectral information because of their strong similarities.Regional approaches and a spatial arrangement of pixels are preferred in most cases.In [14], the Excess Green Vegetation Index (ExG) [15] and Otsu's thresholding [16] helped to remove background (soil, residues) before performing a double Hough transform [17] in order to identify the main crop lines in perspective images.Then, to discriminate crop and weeds in the segmented image, the authors applied a region-based segmentation method to develop a blob coloring analysis.Thus, any region with at least one pixel belonging to the detected rows is considered to be crop; otherwise, it is weeds.Unfortunately, this technique failed to handle weeds that were close to crop regions.In [18], an object-based image analysis (OBIA) procedure was developed based on a series of UAV images for the automatic discrimination of crop rows and weeds in a maize field.The UAV images were segmented into homogeneous multi-pixel objects using the multiscale algorithm [19].The large scale highlighted the structures of crop rows and the small scale brought out objects that lay within the crop rows.The authors found that the process was strongly affected by the presence of weeds very close to or within the crop rows.
In [20], 2-D Gabor filters were applied to extract the features, and an artificial neural network (ANN) was used to classify broadleaf and grass weeds.Their results showed that joint space-frequency texture features have the potential for weed classification.In [21], the authors relied on morphological variation and used neural network analysis to separate weeds from a maize crop.Support vector machine (SVM) and shape features were proposed for the effective classification of crops and weeds in digital images in [22].In their experiment, a total of 14 features that characterized crops and weeds in the images were tested to find the optimal combination of features which provided the highest classification rate.Latha et al. [23] suggested that, in the image, edge frequencies and veins of the crop and the weeds have different density properties (strong and weak edges) that could be used to separate crop from weed.A semi-supervised method was proposed in [24] to discriminate weeds and crop.The Ostu thresholding was applied twice on ExG.In the first step, the authors used segmentation to remove the background and then created two classes considered to be crop and weeds.K-means clustering was used to select 100 samples in each class for the training.Classification was then performed with an SVM classifier using the geometric features, spatial features, and first-and second-order statistics extracted in the red, blue, green, and ExG bands.The method proved to be effective in the sunflower field, but less robust in the corn field because of the shade produced by the corn plants.In [25], the authors used texture features extracted from wavelet subimages to detect and characterize four types of weeds in a sugar beet field.Neural networks were applied as the classifier.The use of wavelets proved to be efficient for the detection of weeds, even at a stage of growth of beet greater than six leaves.Bakhshipour and Jafari [26] evaluated weed detection with support vector machine and artificial neural networks in four species of common weeds in sugar beet fields using shape features.In [27], a semiautomatic object-based image analysis (OBIA) procedure was developed with random forests (RF) combined with feature selection techniques to classify soil, weeds, and maize.A common point in all these studies is that the selected features change, in general, from one type of crop to another or from one type of data to another.
Recently, convolutional neural networks have emerged as a powerful approach to computer vision tasks.CNNs [28] have progressed mostly through its successful use as a method in the ImageNet Large-Scale Vision Recognition Challenge 2012 (ILSCVR12) and the creation of the AlexNet network in 2012, which showed that a large, deep convolutional neural network is capable of achieving record-breaking results on a highly challenging dataset using purely supervised training [29].Nowadays, deep learning is applied to several domains to help solve many big data problems, such as computer vision, speech recognition, and natural language processing.In the agriculture domain, CNNs were applied to classify patches of water hyacinth, serrated tussock, and tropical soda apple in [30].Mortensen et al. [31] used CNNs for semantic segmentation in the context of mixed crops from images of an oil radish plot trial with barley, grass, weed, stump, and soil.Milioto et al. [32] provided accurate weed classification in real sugar beet fields with mobile agricultural robots.Dos Santos Ferreira et al. [11] applied AlexNet to the detection of weeds in soybean crops.In [12], AlexNet was applied to weed detection in different crop fields, such as beet, spinach, and bean in UAV imagery.
The main common point between the supervised machine learning algorithms is the need for training data.For a good optimization of deep learning models, it is necessary to have a certain amount of labeled data.However, as mentioned previously, creating large agricultural datasets with pixel-level annotations is an extremely time-consuming task.Few attempts have been made to develop fully automatic systems for the training and identification of weeds in agricultural fields.In a recent study, Di Cicco et al. [33] suggested the use of synthetic training datasets.However, this technique requires precise modeling in terms of texture, 3D models, and light conditions.In [34], an automatic image processing method was developed to discriminate between crop and weed pixels by combining spatial and spectral information extracted from four-band multispectral images.Image data were captured at 3 m above ground with a camera mounted on a manually held pole.The spatial approach (Hough transform) was used to detect crop rows and to build a training dataset.SVM was applied to the spectral information to perform classification.This method assumes that weeds and crops have different spectral information, which is not always the case in agricultural fields.The success of this kind of method relies on better feature selection which involves human analysis of each particular field.To the best of our knowledge, no studies have been carried out on weed detection in UAV images using automatic labeling of training images and deep learning.

Proposed Method
In modern agriculture, most crops are grown in regular rows, separated by a defined space that depends on the type of crop.Generally, plants that grow outside the rows are considered weeds, commonly referred to as inter-row weeds.Several studies have used this assumption to locate weeds using the geometric properties of the rows [35].The main advantage of this technique is that it is unsupervised and does not depend on the training data.Based on this hypothesis, we first detected the crop rows, then inter-row vegetation was used to constitute our training database, with data categorized into two classes: crop and weed.Thereafter, we performed CNNs on this database to build a model able to detect the crop and weeds in the images.The flowchart in Figure 1 depicts the main steps of the proposed method.The following sections describe each step in detail.

Detection of Crop Lines
A crop row can be defined as a composition of several parallel lines.The aim of this step is to detect the main line of each crop row.For that purpose, we used the Hough transform to highlight alignments of the pixels.In Hough space, there is one cell per line, which means that cells are aggregated by crop row.The main lines in Hough space correspond to cells which contain the maximum number of votes on each aggregation.Before starting any line detection procedure, generally, preprocessing is required to remove undesirable perturbations, such as shadows, soil, or stones.Here, we used the ExG (Equation ( 1)) with Otsu adaptive thresholding to discriminate between vegetation and background.
where r, g, and b are the normalized RGB color values.The Hough transform is one of the most widely used methods for line detection, and it is often integrated into tools to guide agricultural machines because of its robustness and ability to adjust discontinuous lines caused by missing crop plants in a row or by poor germination [36].Usually, for crop line detection, the Hough transform is directly applied to the segmented image.This procedure is computationally expensive and depends on the density of vegetation in the crop rows.There is also a risk of line overdetection.We addressed this problem by using the skeleton of each row, which is an approach that showed better performance in our previous study [37].We found that the Hough transform applied to the skeleton gave a good overall detection rate of crop lines, close to 100%, and a low overdetection rate even for images with high infestation rates.We also discovered that the skeleton provided a good overall representation of the field structure, namely, orientations and periodicity.The Hough transform H(θ, ρ) was computed on the skeleton with a θ resolution of 0.1 • , letting θ take values in the range [−90 • ; 90 • ], and a ρ resolution of 1. Thanks to a histogram of the skeletons' directions, the most frequently represented angle was chosen as the main orientation θ lines of crop lines.H(θ, ρ) was normalized to H norm (θ, ρ) in order to give the same weight to all the crop lines, especially the short ones close to the borders of the image [14].H norm (θ, ρ) is defined as the ratio between the accumulator of the vegetation image and the accumulator of a totally white image of the same size H ones (θ, ρ).To disregard the small lines created by the aggregation of weeds in the inter-row space, a threshold of 0.1 was applied to the normalized Hough transform.Moreover, in modern agriculture, crops are usually sown in parallel lines with the same inter-row distance, which means that the main peaks corresponding to the crop lines are aligned around an angle in the Hough space with the same gaps.Unfortunately, because of the realities in the agricultural field, lines are not perfectly parallel; thus, peaks in the Hough space have close but different angles, and the inter-row distance is not constant.In order to avoid skipping any crop line during the detection, the lines were kept if they had a peak in Hough space whose angle did not exceed 20 • compared to the overall orientation (θ lines ) of the lines.Figure 2 presents the flowchart of the line detection method.However, to avoid detecting more than one peak in an aggregation (i.e., to reduce overdetection), whenever a peak of a crop row was spotted in H norm (θ, ρ), we identified the corresponding skeleton, and then we deleted the votes of this skeleton in H norm (θ, ρ) before continuing.All the steps are summarized in Algorithm 1.
The detected line is a crop line end end

Unsupervised Training Data Labeling
The unsupervised training dataset annotation is based on the detected lines obtained following the procedure in the previous section.Assuming that the lines detected are mainly at the center of the crop rows (Figure 3), we applied a mask to delimit the crop rows.Hence, vegetation overlapped by the mask corresponds to the crop.This mask was obtained from the intersection of superpixels formed by the simple linear iterative clustering (SLIC) algorithm [38] and the detected lines.SLIC was chosen since it is simple and efficient in terms of the quality of results and the computation time.
It is an adaptation of the k-means approach for superpixel generation, with a control for the size and compactness of the superpixels.SLIC creates a local grouping of pixels based on their spectral values, which are defined by the values of the CIELAB color space, and their spatial proximity [11,38].A higher value of compactness makes superpixels more regularly shaped while a lower value makes superpixels adhere better to boundaries, making them irregularly shaped.Since here the goal was to create a mask around the detected crop lines that is able to delimit the crop rows, we chose a compactness of 20 because it was found that the process was less sensitive to variations of color caused by the effects of light and shadow.Figure 4 shows examples of images segmented with different sizes of superpixels.
Once the crop has been identified, the next step consists of detecting the inter-row weeds.An inter-row weed is defined as a plant growing between the crop lines.To detect weeds that lie in inter-rows, we applied a blob coloring algorithm.Hence, any region that does not intersect with the crop mask is regarded as a weed.Also, vegetation pixels which belong neither to the crop mask nor to the inter-row weeds are attributed to the potential weeds.Figure 5 presents the mask of crop, inter-row weeds, and potential weeds.To construct the training dataset, we extracted patches from the original images using positions of the detected inter-row weeds and crops.For weed samples, we applied bounding boxes to each segmented intra-row weed.For the crop samples, a sliding window was applied to the input image using positions relative to the segmented crop lines.Thus, for a given position of the window, if it intersects the binary mask and there are no inter-row weed pixels, it is attributed to the crop class.Generally, the crop class has many more samples than the weed one.In cases where there were few inter-row weed samples but a large number of potential weeds, as in Figure 5, we included the latter in the training dataset of weeds.Hence, the window which contained only potential weeds was labeled weeds.On the other hand, windows which contained crop and potential weeds, where we had more potential weeds than crop, were not retained.

Crop/Weed Classification Using Convolutional Neural Networks
CNNs are part of the deep learning approach and have shown an impressive performance in many computer vision tasks [39].CNNs are made up of two types of layers: the convolutional layers which extract different characteristics from images, and the fully connected layers based on the multilayer perceptron to perform classification.The number of convolutional layers depends on the classification task and also the number and the size of the training data.
In this work, we used a Residual Network (ResNet).This network architecture was introduced in 2015 [40].It won the ImageNet Large-Scale Vision Recognition Challenge 2015 with 152 layers.However, given the size of our data, we used the ResNet with 18 layers (ResNet18) described in [40] because it achieved a better result than AlexNet and VGG13 [41] in the ImageNet challenge.
However, given the number of parameters to be updated in ResNet18 and the data we had at our disposal, we decided to use transfer learning.Transfer learning aims to extract knowledge from one or more source tasks and applies this knowledge to a target task.In other words, transfer learning is a machine learning method in which a model developed for one task is reused as the starting point for a model in a second task [42].Transfer learning is the most popular approach in deep learning; models pretrained on a dataset such as ImageNet are used as the starting point to solve another problem in computer vision-weed and crop classification, in our case.Due to the large number of categories and images in ImageNet, some studies have shown that transfer learning of networks trained with the ImageNet database could be successfully used [31,43].Thus, we performed a transfer learning technique called fine-tuning to train the networks with our data.Fine-tuning means that we started with the learned features from the ImageNet dataset, then we truncated the last layer (softmax layer) of the pretrained network and replaced it with new softmax layers that are relevant to our own problem.Here, the thousand categories of ImageNet were replaced by two categories (crop and weeds).

Feature Extraction
Although color indices make sense in distinguishing between vegetation and background, they become less effective when applied to classify plant species.Sometimes, the color of weeds and crop leaves look almost the same.Moreover, the result becomes unreliable under different lighting conditions.To solve this problem, several image features were analyzed.We computed a series of statistical features, shape features, and texture features which have been selected in other works [23][24][25]44].A procedure for feature selection was then used to analyze the most suitable features.

Color Features
The color features are means and standard deviations of the three RGB image bands and of the ExG image.In order to make the color features consistent with different lighting levels, each color band was normalized by the sum of all three color bands.

Geometric Shape Features
Based on [22], three parameters, namely, Form Factor, Elongatedness, and Solidity, were computed as geometric features.We named the feature vector created by these three Geo3.
Elongatedness = area thickness 2  (3) Here, area is defined as the number of pixels with a value of '1' in the binary image.Perimeter is defined as the number of pixels with a value '1' for which at least one of the eight neighboring pixels has the value '0', implying that the perimeter is the number of border pixels.Convex area is the area of the smallest convex hull that covers all the plant pixels in an image.

Edge density
Edge detection is a method of image segmentation which uses the fact that the edge frequencies and veins of both crop and weeds have different density properties (strong and weak edges) to separate crop from weed [23].In the remainder of this article, we denote edge density as Edensity.It is defined as: Here, area is defined as the number of pixels with a value '1' in the binary image.The image edges were computed by the Sobel edge detection method.All the pixels marked as edge were summed, and their sum is called edge area .

Histogram of Oriented Gradients (HOG)
Contour attributes generally correspond to the histogram of the gradient orientation.HOG counts the occurrences of gradient orientation in localized regions in an image.It is fast, compared to the SIFT algorithm (because no smoothing is computed); it is processed on a large number of cells uniformly spaced in the image and overlapping.Thanks to normalization of the local contrast, it is invariant to conditions of illumination.HOG was initially used for pedestrian detection [45], but it has proved its robustness for many other issues.In agriculture, it is used to identify plant leaves [46,47].These experiments are inspiring and indicate that we can combine the features extracted by HOG methods for the classification of leaves.The principle of HOG is the division of the image into small regions called cells.For each cell, a histogram of the gradient is computed.Depending on the gradient orientation, each cell is discretized into angular bins.Finally, adjacent cells are merged into blocks and then normalized.

Haralick Texture
The co-occurrence matrix makes it possible to obtain the occurrence frequency of a pattern of two pixels separated by a distance d along a direction θ.In [48], the authors proposed 14 features that can be computed on this matrix.These features have the aim of highlighting some visual characteristics, statistics, the randomness of the gray level distribution, and the linear dependence of the gray levels on a neighborhood of pixels (homogeneity, coarseness, periodicity, smoothness, etc.).In 2012, the Haralick method was applied to extract texture features in a classification of plant species [49].Here, we used six Haralick features, namely, autocorrelation, contrast, correlation, dissimilarity, energy, and entropy.More details of these features can be found in [48,50,51].

Gabor Wavelets
This method performs joint space-frequency analysis.The short-time Fourier transform with a Gaussian window is called the Gabor transform.It is able to preserve both local and global information in the image and is particularly useful for analyzing texture images containing highly specific frequency or orientation characteristics [52,53].In 2003, Tang et al. fixed the filter orientation at 90 • for the classification of broad and narrow leaves [20].By analyzing the separation between classes of each feature, they concluded that a filter bank with four frequency levels, from 4 to 7, was suitable for the classification task.Therefore, we chose from 4 to 7 as the frequencies and 0 • , 45 • , and 90 • as the orientations.We generated 12 Gabor features.

SVM or Support Vector Machine
The ideal foundation of a good classification system is to have a fast classifier which avoids overfitting and is able to respond to multi-class problems, to separate classes with a large gap or margin, and to manage large feature vectors.In this study, we applied the SVM or support vector machine or large margin separator [54].It is one of the most successful machine learning methods [55,56].Its popularity comes not only from the fact that it provides class separation with a very large margin if provided with data in two classes but also because it is suitable for linear and nonlinear cases [57].

Random Forest (RF)
Random forest [58] is a meta-classifier, which combines several weak classifiers to form a strong one.RF easily handles multi-class problems, and it is robust to large features and has a very low risk of overfitting.It is used in several applications, such as point tracking in video surveillance, medical imaging, and games in Microsoft's Kinect.In addition, RF has been shown to be ideally suited for classifying high-resolution UAV data [59].It is structured like a real forest with trees, where each tree has roots, branches, and leaves.Trees correspond to the different classifiers.The first node corresponds to the root of the tree (the point of entry of our data), each node is then separated into intermediate nodes, and each leaf corresponds to a terminal node where the final decision is stored.The forest trees are built using bagging or bootstrap aggregating [60].The principle of bagging is the construction of each tree by selecting a subset of n observations among the N learning data (n < N) obtained by random sampling with delivery.The objective is to get trees as different as possible, or, in other words, to obtain uncorrelated trees, because the more different the trees are, the more robust the forest is.The other advantage of bagging is that it makes it possible to estimate the prediction error of the forest by using "out-of-bag" (OOB) data or data not used during the construction of trees.

Results and Discussion
Experiments were conducted on two different fields of bean and spinach (Figure 6).Images were acquired by a DJI Phantom 3 Pro drone that embeds a 36-megapixel (MP) RGB camera at an altitude of 20 m.This acquisition system produces very high resolution images with a spatial resolution of approximately 0.35 cm.
To build the unsupervised training database, we selected two different parts of each field.The first one (Part1) was used to collect training data, and the other one (Part2) was used for test data collection.
To create the crop binary mask after line detection, the superpixels' compactness was set at 20 and the number of superpixels was set to 0.1% × N, where N = 7360 × 4912 pixels (Figure 4b).In this experiment, we used a 64 × 64 window to create the weed and crop training databases.This window size provides a good trade-off between plant type and overall information.A small window is not sufficient to capture the whole plant and can lead to confusing crops and weeds because, in some conditions, crop and weed leaves have the same visual characteristics.On the other hand, too large a size presents a risk of having crop and weeds in the same window.In the bean field, the weeds present are thistles and young potato sprouts from a previous sowing on the same field.This field has few inter-row weeds, so we decided to include potential weeds directly in the weed samples.After applying the unsupervised labeling method, the number of samples collected was 673 for weeds and 4861 for crop.Even with potential weeds included, the collected samples were unbalanced.To address this problem, we carried out data augmentation.Hence, we performed two contrast changes, smoothing with a Gaussian filter, and three rotations (90 • , 180 • , 270 • ).A strong heterogeneity in the fields can often be encountered from one part of the field to another one.This heterogeneity may correspond to a difference in soil moisture, presence of straw, etc.In order to make our models robust to background, we mixed samples with and without background.Samples without background were obtained by applying ExG followed by Otsu thresholding on previously created samples (Figure 7).We evaluated the performance of our method by comparing models created by data labeled in supervised and unsupervised ways.
The supervised training datasets were labeled by human experts.A mask was applied manually to the pixels of weeds and crop.Figure 8 presents weeds delineated in red by an expert.The supervised data collected were also unbalanced, so we carried out the same data augmentation procedure performed on the unsupervised data.The total number of samples is shown in Table 1.The spinach field is more invaded by weeds (mainly thistles) than the bean field.Altogether, 4303 samples of crop and 3626 samples of weed were labeled in an unsupervised way.Unlike for the bean field, we obtained less unbalanced data.Therefore, the only data augmentation applied was adding samples without background.The same processing was applied to the supervised data.Table 1 presents the number of samples.

Results and Discussion
After the creation of both weed and crop classes, 80% of the samples were selected randomly for the training, and the remaining ones were used for validation.Table 1 presents the training and validation data performed on each field.
For fine-tuning, we tested different values of the learning rate.The initial learning rate was set to 0.01 and updated every 200 epochs.The update was done by dividing the learning rate by a factor of 10. Figure 9 shows the evolution of the loss function during training for supervised and unsupervised datasets for spinach and bean fields.From these figures, it can be seen that the validation loss curves decrease during about the first 80 epochs before increasing and converging (behavior close to overfitting).Overfitting was less pronounced in the supervised labeled bean data.The best models were obtained during the first learning phase with a learning rate of 0.01.
The performance of the models was assessed on test ground truth data collected in Part2 in a supervised way on each field; Table 2 presents the samples.The performance of the classification results is illustrated with receiver operating characteristic (ROC) curves.
The ROC curves (Figure 10) show that the AUCs (area under the curve) are close to or greater than 90% and that both types of learning data provide good results that are comparable.For both fields, a false positive rate of 20% provides a true positive rate greater than 80%.The differences in performance between supervised and unsupervised data labeling are about 6% in the bean field and about 1.5% in the spinach field.The performance gap in the bean field can be explained by the sparsity of inter-row weeds.
Both fields are infested mainly by thistles; we tested the robustness of our models by exchanging the samples of weeds from the bean field with those of the spinach field.
In Figure 11, the results obtained show that, despite the small number of samples harvested in the bean field, those data are suitable for the spinach field, and the model created with unsupervised labeling in the spinach field is most sensitive to the presence of young potato sprouts among bean weed samples.In the bean field, the areas under the curve (AUCs) are 91.37% for unsupervised data and 93.25% for supervised data.In the spinach field, the areas under the curve (AUC) are 82.70% for unsupervised data and 94.34% for supervised data.Supervised and unsupervised data mean, respectively, data labeled in supervised and unsupervised ways.SVM and RF were applied to features extracted from the datasets (Table 1).RF was performed with 200 trees.As for Resnet18, models were created based on data labeled in supervised and unsupervised ways.In order to assess the effectiveness of the selected features, we applied them separately, and to select the set of features that gives the optimal classification result, we combined them.Figures 12  and 13 show that the color, Haralick, and geometric features give the best results.In the spinach field, the abundance of thistles with a different color of leaf from that of spinach at a certain level of growth explains the effectiveness of the color features.In the bean field, the color features were less effective than the texture features (Haralick) in both datasets since we have young potato shoots from the previous sowing among the weeds, and their color is almost the same as that of the bean plants.
By using SVM, when the features are combined, the improvement is less than 2% for the data labeled in a supervised manner and about 10% for unsupervised data in the spinach field.In the bean field, the same remark applies to the data collected in a supervised manner; for the unsupervised data collected, no improvement was found.Another remark that can be made is that from one type of data to another, the best features are not the same.We also noticed that the selected features are not suitable to detect the weeds present in the bean field.With the RF, the feature selection procedure only increased performance by about 1% for both spinach datasets.In the bean field, an improvement of about 1% was observed with the data labeled in an unsupervised way and about 5% for the data labeled in a supervised way.Tables 3 and 4 present the results of SVM and RF with the best selected features.As reported in Table 4, ResNet18 provides much better results than SVM and RF in the bean field, with a performance difference greater than 20%.However, in the spinach field, the results obtained are comparable and sometimes the results of ResNet18 are lower than those of SVM and RF (Table 4).This performance of ResNet18 can be explained by the small amount of data used for training in the spinach field.For deep learning algorithms, the more data we have, the better the algorithm learns.We can also note that the performance of the models formed by the two types of data collected is comparable for the three classification methods.The maximum difference is about 6% in both fields.
Based on the results, it can be concluded that even if we manage to select the most suitable features to identify weeds in a field, these features may not be adapted to another field with a different type of crop.They also show that the features considered better by a classifier may not necessarily be the best if you change the classifier.However, in the fields from one year to another, new types of weeds may appear and the level of growth of the plants can sometimes cause confusion between weeds and crops, which leads to a new collection of weed/crop data and a new selection of features.Thus, for an efficient classification, it would be interesting to use a tool capable of automatically generating relevant samples and features to detect weeds, hence the interest in using deep learning with unsupervised data labeling.

Weed Detection
In order to detect weeds in an entire UAV image, we applied an overlapping window for weed detection.For each position of the window, the CNN models provide the probability of the plants being weeds or crops.Thus, the center of the extracted image is marked by a colored dot according to the probabilities.Blue, red, and white dots mean, respectively, that the extracted image is identified as crop, weed, and an uncertain decision (Figure 14a,c).Uncertain decision means that the two probabilities are very close to 0.5.Thereafter, we used crop line information and the previously created superpixels to classify all the pixels of the image.On each superpixel, we identify which dot color is dominant.A superpixel is classed as crop or weed if the majority of dots are blue or red, respectively.For superpixels where the majority of dots are white, we used crop line information.Hence, superpixels which are in the crop lines are regarded as crop and the others are weeds.The superpixels created in the background are removed.Figure 14b,d present the classification results in parts of the spinach and bean fields.It can be seen that inter-row and intra-row weeds are slightly overdetected.Overdetections are mainly found at the edges of the crop rows where the window cannot overlap the whole plant.Some weed pixels are not entirely in red because, after applying the threshold to the ExG, the parts of these plants which are less green are considered soil.However, the unsupervised data collection method strongly depends on the efficiency of the crop line detection method and also on the presence of weeds in the inter-row.The line detection approach used here has already shown its effectiveness in beet and corn fields in our previous work [37].With the bean field, we found that even if a field does not have a lot of samples of weeds in the inter-row, it is possible to create a robust model with data augmentation.By using a deep learning architecture such as ResNet18, robust models can be created for the classification of weeds in bean or spinach fields with supervised or unsupervised data labeling.This work can be compared to recent studies which also aim to develop unsupervised detection approaches.A semi-automatic object-based image analysis (OBIA) procedure was developed with random forest combined with feature selection techniques to classify soil, weeds, and maize in [27].An overall accuracy of 0.945 and a Kappa value of 0.912 were obtained.This method was applied to only one field, but we have found that the feature selection approach, even with random forest, is not robust when the field or crop type changes.In [34], an automatic image processing method was developed to discriminate between crop and weed pixels on images acquired by a camera mounted on a manually held pole.The authors combined spatial and spectral information extracted from four-band multispectral images.SVM was applied to the spectral information to perform the classification.On all images, the mean value of the weed detection rate was 89% for their spatial and spectral combination method.This method assumes that weeds and crops have different spectral information, which is not always the case in farm fields.Di Cicco et al. [33] used synthetic training datasets.However, this technique requires a precise modeling in terms of texture, 3D models, and light conditions.Overall, this illustrates that the main advantage of our method compared to the ones that use unsupervised labeling is that it is fully automatic and that no feature selection is required.
Currently, our method has only been evaluated on images acquired in the visible spectrum.As the line detection approach depends on the background segmentation, we intend to adapt the proposed method to multispectral images in future work.Then, we can implement a robust background segmentation algorithm using the Normalized Difference Vegetation Index (NDVI) [61].Beyond segmentation, the multispectral bands could also provide additional information to distinguish crops from certain weed species.

Conclusions
In this paper, we propose a novel fully automatic learning method using convolutional neuronal networks with unsupervised training dataset collection for weed detection in UAV images acquired from bean and spinach fields.The results obtained show a performance close to that of supervised data labeling.The area under curve (AUC) differences are 1.5% in the spinach field and 6% in the bean field.Supervised labeling is an expensive task for human experts, and given the differences in accuracy between supervised and unsupervised labeling, our method can be a better choice in the detection of weeds, especially when crop rows are spaced well apart.The proposed method is interesting in terms of flexibility and adaptivity, since a model can be easily trained on a new dataset.We also found that the ResNet18 architecture can extract useful features for weed classification in bean or spinach fields with data labeled in a supervised or unsupervised manner.In addition, the developed method could be a key technique for online weed detection with UAV.
As future work, we plan to use multispectral images because, in some conditions, multispectral bands such as red edge or near infrared could help to distinguish plants, even if they have similarities in the visible spectrum and leaf shape.With multispectral information, we also expect to improve the background segmentation.To enhance the simplicity of use and rapidity of the method, we intend to implement an application with a graphical interface that will automatically chain the different methods used in the processing flowchart and generate a weed infestation map.This map can then be integrated into a robot or tractor for selective herbicide spraying, thereby helping farmers to save money while applying the right amount of herbicide where it is needed.

Figure 1 .
Figure 1.Flowchart of the proposed method.

Figure 3 .Figure 4 .
Figure 3. From left to right: line detection in bean (a) and spinach (b) fields.Detected lines are in blue.In the spinach field, inter-row distance and the crop row orientation are not regular.The detected lines are mainly located in the center of the crop rows.

Figure 5 .
Figure 5. Detection of inter-row weeds (red) after line detection (blue) in a bean image.The crop mask is represented in green and the potential weeds in magenta.

Figure 6 .
Figure 6.Example of images taken in bean (a) and spinach fields (b).The bean field has fewer inter-row weeds and is predominantly composed of potential weeds.The inter-row distance is stable and plants are sparse compared to the spinach field, which presents a dense vegetation in the crop rows and irregular inter-row distances.The spinach field has more inter-row weeds and has few potential weeds.

Figure 7 .Figure 8 .
Figure 7. Example of crop and weed samples of size 64 × 64 pixels with and without background.Bean: samples of crop (a,b), samples of weed (c,d).Spinach: samples of crop (e,f) and samples of weed (g,h).Depending on the plant size and the window position, we obtain a plant or aggregation of plants per window.

Figure 9 .Figure 10 .
Figure 9. Evolution of the loss during training for supervised and unsupervised data in the spinach and bean fields.The validation loss curves decrease during about the first 80 epochs before increasing and converging.The top two figures represent the spinach field, and the bottom two correspond to the bean field.Figures on the left are from training on the supervised data, and those on the right are from training on the unsupervised data.

Figure 11 .
Figure 11.ROC curves of test data with weed data from the bean field exchanged with those of the spinach field.From left to right: the ROC curves computed on the bean (a) and spinach (b) test data.In the bean field, the areas under the curve (AUCs) are 91.37% for unsupervised data and 93.25% for supervised data.In the spinach field, the areas under the curve (AUC) are 82.70% for unsupervised data and 94.34% for supervised data.Supervised and unsupervised data mean, respectively, data labeled in supervised and unsupervised ways.

Figure 12 .Figure 13 .
Figure 12.ROC curves of the SVM models created by each feature for each field.The first line represents the spinach field and the second one is the bean field.The first and second columns are the results of the models trained on the supervised and unsupervised data, respectively.

Figure 14 .
Figure 14.Examples of unmanned aerial vehicle (UAV) image classification with models created by unsupervised data in two different fields.The top two figures show samples from the spinach field and the bottom two samples are from the bean field.On the left are the samples obtained after using a sliding window, without crop line and background information.Blue, red, and white dots mean that the plants are identified as crop, weed, and an uncertain decision, respectively.On the right in red are the weeds detected after crop line and background information has been applied.

Author
Contributions: M.D.B., A.H. and R.C. conceived and designed the method; M.D.B implemented the method and performed the experiments; M.D.B., A.H. and R.C. wrote the paper, discussed the results and revised the manuscript.All authors have read and approved the manuscript.Funding: This work is part of the ADVENTICES project supported by the Centre-Val de Loire Region (France), grant number ADVENTICES 16032PR.

Table 1 .
Training and validation data in the bean and spinach fields.

Table 2 .
Number of test samples used for each field.

Table 3 .
Results of test data collected in the bean field with ResNet18, support vector machine (SVM), and random forest (RF).For the SVM and RF, only the results of the best selected features are presented.

Table 4 .
Results of test data collected in the spinach field with ResNet18, SVM, and random forest.For the SVM and RF, only the results of the best selected features are presented.Sup and Unsup mean, respectively, supervised and unsupervised.