Comparison of Object Detection and Patch-Based Classification Deep Learning Models on Mid- to Late-Season Weed Detection in UAV Imagery

Mid- to late-season weeds that escape from the routine early-season weed management threaten agricultural production by creating a large number of seeds for several future growing seasons. Rapid and accurate detection of weed patches in field is the first step of site-specific weed management. In this study, object detection-based convolutional neural network models were trained and evaluated over low-altitude unmanned aerial vehicle (UAV) imagery for mid- to late-season weed detection in soybean fields. The performance of two object detection models, Faster RCNN and the Single Shot Detector (SSD), were evaluated and compared in terms of weed detection performance using mean Intersection over Union (IoU) and inference speed. It was found that the Faster RCNN model with 200 box proposals had similar good weed detection performance to the SSD model in terms of precision, recall, f1 score, and IoU, as well as a similar inference time. The precision, recall, f1 score and IoU were 0.65, 0.68, 0.66 and 0.85 for Faster RCNN with 200 proposals, and 0.66, 0.68, 0.67 and 0.84 for SSD, respectively. However, the optimal confidence threshold of the SSD model was found to be much lower than that of the Faster RCNN model, which indicated that SSD might have lower generalization performance than Faster RCNN for mid- to late-season weed detection in soybean fields using UAV imagery. The performance of the object detection model was also compared with patch-based CNN model. The Faster RCNN model yielded a better weed detection performance than the patch-based CNN with and without overlap. The inference time of Faster RCNN was similar to patch-based CNN without overlap, but significantly less than patch-based CNN with overlap. Hence, Faster RCNN was found to be the best model in terms of weed detection performance and inference time among the different models compared in this study. This work is important in understanding the potential and identifying the algorithms for an on-farm, near real-time weed detection and management.

been used for weed detection using data obtained in three different ways-using UAVs, using the autonomous ground robot, and high-resolution images obtained manually in the field. A simple CNN binary classifier was trained to classify manually collected small high-resolution images of maize and weeds [42,43]. The performance of the classifier with transfer learning on various pre-trained networks such as LeNet and AlexNet was compared, but this study was limited in variability in the obtained dataset and on the evaluation of the classification approach with large images. Dyrmann et al. [23] used a pre-trained VGG-16 network and replaced the fully connected layer with a deconvolution layer to output a pixel-wise classification map of maize, weeds, and soil. The training images were simulated by overlapping a small number of available images of soil, maize, and weeds with various sizes and orientations. The use of an encoder-decoder architecture for real-time output of pixel-wise classification maps for site-specific spraying was studied. It was found that by adding hand-crafted features such as vegetation indices, different color spaces, and edges as input channels to CNN, the model's ability to generalize to different locations and at the different growth stages of the crop improved [44][45][46]. Furthermore, to improve the generalization performance of the CNN-based weed detection system, Lottes et al. [25] studied the use of fully-convolutional DenseNet with spatiotemporal fusion and spatiotemporal decoder with sequential images to learn the local geometry of crops in fixed straight lines along the path of a ground robot. In the case of overlapping crop and weed objects, Lottes et al. [15] proposed a key point based feature extraction approach that was used to detect weed objects that overlap with the crop. In addition to weed detection, for effective removal of weeds using mechanical or laser-based methods, it is necessary to detect the stem location of weeds prior to actuation. A fully-convolutional DenseNet was trained to output the stem location as well as a pixel-wise segmentation map of crops and weeds [47,48].
In the case of weed detection using UAV imagery, similar to OBIA approaches mentioned above, dos Santos Ferreira et al. [3] used a Superpixel segmentation algorithm to segment objects and trained a CNN to classify these clusters. They then compared the performance with other machine learning classifiers which use handcrafted features. Sa et al. [27] studied the use of an encoder-decoder architecture, Segnet, for the pixel-wise classification of multispectral imagery and followed up with a performance evaluation of this detection system using different UAV platforms and multispectral cameras [49][50][51]. Bah et al. [29] used the Hough transform along with a patch-based CNN to detect weeds from UAV imagery and found that overlapping weed and crop objects led to some errors in this approach. It is to be noted that, in this approach, the patches are sliced from the large image in a non-overlapping manner. Huang et al. [30] studied the performance of various deep learning architectures for pixel-wise classification of rice and weeds and found that the fully-convolutional network architecture outperformed other architectures. Yu et al. [52] studied the use of CNN for multispecies weed detection in rye grass.
From the literature reviewed, it can be seen that automated weed detection has been primarily focused on early season weeds, since that is found to be the critical period for weed management and to prevent crop yield loss. However, it should be noted that mid-to late-season weeds that escape from the routine early-season management also threaten production by creating a large number of seeds which creates problems for several future growing seasons. With herbicide resistance, escaped weeds can proliferate and become difficult to manage. Studies on early season weeds can use vegetation segmentation as a preprocessing step to reduce the memory requirements; however, this does not apply to mid-to late-season weed imaging with no soil pixels due to canopy closure. Furthermore, because of the significant overlap between crops and weeds, it is challenging to find the optimal scale and other parameters of segmentation in OBIA to achieve the maximum performance. With deep learning-based object detection methods proving successful for tasks such as fruit counting-another situation with a cluttered background-it is hypothesized that such methods would be able to detect mid-to late-season weeds from UAV imagery. Hence, the objective of this study was to evaluate deep learning-based object detection models on detecting mid-to late-season weeds and compare their performance with patch-based CNN method for near-real time weed detection. Near-real time refers to on-farm processing of the aerial imagery on the edge device as it is collected. We refer to this as near-real time rather than real-time because there is no real time control output generated from the collected imagery and so we refer to near-real time as the completion of processing shortly after completion of data collection. The specific objectives of the study are: 1.
Evaluate the performance of two object detection models with different detection performance and inference speed-Faster RCNN and the Single Shot Detector (SSD) models-in detecting mid-to late-season weeds from UAV imagery using precision, recall, f1 score, and mean IoU as the evaluation metrics for their detection performance and inference time as the metric for their speed; 2.
Compare the performance of object detection CNN models with the patch-based CNN model in terms of weed detection performance using mean IoU and inference time.

Study Site
The study sites were located in the South Central Agricultural Laboratory of the University of Nebraska, Lincoln at Clay Center, NE, USA (40.5751, -98.1309). The two study sites were located adjacent to each other. They were different soybean weed management research plots. Figure 1 shows the stitched maps of the study sites.

UAV Data Collection
A DJI Matrice 600 pro unmanned aerial vehicle (UAV) platform ( Figure 2) was used with a Zenmuse X5R camera to capture aerial imagery. In order to collect data with varying growth stages of the crop as well as variations in illumination conditions, the images from study site 1 (shown at the top in Figure 1) were collected on 2 July 2018 whereas the images from study site 2 (shown at the bottom in Figure 1) were collected on 12 July 2018. The flight altitude in both the cases was 20m above ground level. The Zenmuse X5R camera used is a 16 megapixel camera with 4/3" sensor and 72 degree diagonal field of view. The dimension of the captured images is 4608 × 3456 pixels in three bands-Red, Green, and Blue. To develop an economical solution, this study focuses on only using RGB imagery. At a 20-m altitude, for the given sensor specifications, the spatial resolution of the output image is 0.5 cm/pixel. DJI Ground Station pro software was used for flight control. Common weed species at the experimental site were waterhemp (Amaranthus tuberculatus), Palmer amaranthus (Amaranthus palmeri), common lambsquarters (Chenopodiam album), velvetleaf (Abutilon theophrasti), and foxtail species such as yellow and green foxtails. The weeds were naturally infesting the crop and were forming patches. The two data collections were performed after 45 to 50 days after soybean planting and 15 to 20 days after post-emergence herbicides were applied in most treatments, except in plots where only pre-emergence herbicides were applied and in non-treated control plots. Soybean was at V6 (six trifoliate stage) to R2 (full flowering) growth stage.

188
The objective of the study is to develop a weed detection system with on-farm data processing 189 capability. Since the mosaicking of overlapping aerial images is the time-consuming process in the 190 workflow and is not required in this case, overlapping images were removed, and only the non-191 overlapping raw images were retained. The original dimension of the raw image is too large to fit in 192 the memory for processing so each raw image of size 4608×3456 pixels was sliced into 12 sub-images

Data Annotation and Processing
The objective of the study is to develop a weed detection system with on-farm data processing capability. Since the mosaicking of overlapping aerial images is the time-consuming process in the workflow and is not required in this case, overlapping images were removed, and only the non-overlapping raw images were retained. The original dimension of the raw image is too large to fit in the memory for processing so each raw image of size 4608 × 3456 pixels was sliced into 12 sub-images of size 1152 × 1152 pixels. The weed areas in each sub-image were annotated as rectangular bounding boxes using the python labeling tool LabelImg [53]. Only one annotator was involved in the labeling process. The annotator was trained to draw rectangular bounding boxes around weed patches. In case of weed patches of complex shapes, multiple rectangular bounding boxes were drawn to cover such patches. A total of 450 sub-images were annotated manually and were then randomly split into 90% training images and 10% test images

Patch Based CNN
Convolutional neural networks (CNNs) are feedforward artificial neural networks with the fully connected layers in the input hidden layers replaced with convolutional filters. This reduces the number of filters in each layer and enables CNNs to learn spatial patterns in images and other two-dimensional data. The advantage of a CNN is its ability to learn the features by itself, thereby preventing the need for time-consuming hand engineering of features needed in case of other Computer Vision algorithms. CNN architectures have been proposed, and its use in applications, such as document recognition by using backpropagation for training, has been studied much earlier [54]. However, their applications were limited because of the need for very large datasets to train a large number of parameters in deep networks, and also the computational needs for training. In the last decade, with advancements in parallel processing capabilities using graphical processing units and increases in the availability of large datasets, Krizhevsky et al. [36] showed the potential of CNNs in complex multiclass image classification tasks. However, in most cases, it was found that there were not enough data available to train a deep CNN from scratch. Transfer learning helped overcome this limitation. Transfer learning is the technique of using the weights of pre-trained networks trained on very large datasets such as Alexnet or GoogleNet and retraining them with small datasets for other applications [55]. This has been found to lead to exceptional classification performance and one hypothetical explanation is that the features learned in the initial convolutional layers are global features common across various image classification tasks. Several studies have looked at the application of neural networks for weed detection, such as [28,56].
In this study, a pre-trained network called Mobilenet v2 has been used for transfer learning [57]. Mobilenet v2 was developed primarily for use in mobile devices with limited memory capabilities. Hence, in order to reduce the number of parameters, each convolutional block of Mobilenet v2 consists of an expansion layer with a convolutional kernel of window size 1. This layer increases the number of channels in the input. This is followed by a depthwise convolutional layer which is then followed by a projection layer that consists of a convolutional kernel of window size 1. The depthwise convolution layer applies a single convolutional filter per input channel. The 1 × 1 convolutional layer that follows is called point wise layer. It reduces the number of channels in the output, thereby reducing the number of parameters in the next convolutional block. Hence in each block, feature maps are projected to a high dimensional space followed by learning higher dimensional features in the depthwise convolutional layer which are then encoded using a pointwise convolutional projection layer. The Mobilenet v2 network was trained on the ImageNet dataset containing 1.4 million images belonging to 1000 classes [57]. This network was then fine-tuned using the training patches belonging to both the classes in this study. Initially, for the first 10 epochs, only the classifier layer of the network were trained by freezing the weights of all other layers. This was performed to use the global features learned on the ImageNet dataset and fine-tune the classifier for this specific application. After this, fine-tuning was performed in which all the top layers were unfrozen and to allow the network to adapt to this specific application. The fine tuning was performed for 10 epochs and, hence, the model was only trained for 20 epochs in total [58].

Object Detection Models
An object in Computer Vision refers to a connected, single element present in the image. Object detection is defined as the problem of finding the class of an object, and also localizing it in the image [59]. Hence, for every object in the image, the model is expected to regress the coordinates of the bounding box of the object in addition to the class probabilities for classification. Two different models have been investigated-Faster RCNN and SSD, both with Inception v2 as a feature extractor. Faster RCNN and SSD were chosen since Faster RCNN was found to have better performance, whereas SSD was found to have better speed [60]. Several different models trained on Imagenet dataset such as Inception v2 [61], Mobilenet v2 [57], Resnet 101 [62], VGG 16 [63] can be used as feature extractors for transfer learning. Of these, Inception v2 and Mobilenet v2 have been found to be the fastest in terms of inference speed [60]. The objective was to develop a weed detection system with on-farm real-time data processing capabilities. Since with similar inference speed, Inception v2 has better performance than Mobilenet v2 for object detection tasks, Inception v2 was chosen as the feature extractor [60].

Faster RCNN
Faster RCNN is a region proposal method-based object detection algorithm. Region-based CNN (R-CNN) was the first region proposal method-based model [64]. However, it was computationally expensive since CNN based feature extraction has to be performed for each proposed region. Fast RCNN was proposed to reduce the computational time by sharing convolutional features across the region proposals [65]. To improve the speed, Faster RCNN was proposed with fully convolutional Region Proposal Networks (RPN) that are trained to propose better object regions [66]. The Faster RCNN model consists of four sections: the feature extractor, the region proposal network, Region of Interest (RoI) pooling, and classification (as shown in Figure 3).
For feature extraction, the convolutional layers from Inception v2 were used. The advantage of the Inception v2 network is its use of wider networks with filters of different kernel sizes in each layer which makes it translation and scale invariant. Hence, the Inception v2 architecture outputs a reduced-dimensional feature map for the region proposal layer. The region proposal network is defined by anchors or fixed boundary boxes at each location. At each location, anchors of different scale and aspect ratio are defined, thereby enabling the region proposal network to make scale invariant proposals. The region proposal layer uses a convolutional filter on the feature map to output a confidence score for two classes; object and background. This is called the objectness score. Furthermore, the convolutional filter outputs regression offsets for anchor boxes. Hence, assuming there are k anchors at a location, the convolutional filter in the region proposal network outputs 6k values, namely 4k coordinates and 2k scores. Two losses are calculated from this output-classification loss and bounding box regression loss. The bounding box coordinates of anchors classified as objects are then combined with the feature map from feature extractor. In the RoI pooling layer, bounding box regions of different sizes and aspect ratios are resized to fixed size outputs using max pooling. Pooling layer refers to a down sampling layer and in case of max pooling, the down sampling is done by maximum of pixels [36]. The max-pooled feature map of a fixed size corresponding to each output is then classified, and its bounding box offsets with respect to ground truth boxes are regressed. Hence, as in the region proposal layer, two losses are computed at this output, namely the classification loss and bounding box regression loss.

Hyperparameters of the Architecture
In the framework that was used, the input images to the Faster RCNN network were resized to images of fixed size 1024 × 1024 pixels. At each location in the region proposal layer, 4 different scales namely 0.25, 0.5, 1.0, 2.0 and 3 different aspect ratios namely 0.5, 1.0 and 2.0 were used. Hence, in total, there were 12 anchors at each location. The model was trained for 25,000 epochs with a batch size of 1 using stochastic gradient descent with momentum optimizer. The training dataset was split into training and validation datasets and the performance of the model on validation data was continuously monitored during training to check if the model starts to overfit. Random horizontal flip and random crop operations were performed to augment the training data. The data collected had the crop rows always parallel to the horizontal axis of the image, therefore random horizontal flip and crop operations augment the training data.

Single Shot Detector
The Single Shot Detector (SSD) (Figure 4) model was proposed to improve the inference time of objection detection models with region proposal network such as Faster RCNN. The main difference in SSD compared to Faster RCNN is the generation of detection outputs without a separate region proposal layer. Similar to Faster RCNN, SSD uses a feature extractor which is the Inception v2 architecture in this case. At each location of the feature map output, the model outputs a set of bounding boxes of different scales and aspect ratios. This is very similar to Faster RCNN but the difference being the convolutional filter on the feature map directly outputs the confidence scores corresponding to the output classes along with regression box offsets. Hence, the class and bounding box offsets are output in a single shot as the name suggests. For the model to be scale and translation invariant, rather than outputting bounding boxes from only the feature map, extra feature layers are added to the feature map output and detection boxes are output at different scales from each output. Hence, in total, the SSD model has 6 layers that output detection boxes at different scales [67].
bounding boxes of different scales and aspect ratios. This is very similar to Faster RCNN but the 300 difference being the convolutional filter on the feature map directly outputs the confidence scores 301 corresponding to the output classes along with regression box offsets. Hence, the class and bounding 302 box offsets are output in a single shot as the name suggests. For the model to be scale and translation 303 invariant, rather than outputting bounding boxes from only the feature map, extra feature layers are 304 added to the feature map output and detection boxes are output at different scales from each output.

305
Hence, in total, the SSD model has 6 layers that output detection boxes at different scales [67].

320
Tensorflow object detection API [61] in Python was used to train and evaluate Faster RCNN and SSD.

321
Tensorflow tutorial on transfer learning [58] was used to train the MobileNet v2 architecture for

Hyperparameters of the Architecture
In the case of SSD, in the framework that has been used, the input images are always reshaped to a fixed dimension of 300 × 300 pixels. After the feature extraction, in 6 different layers that output detection boxes, 6 different scales in the range 0.2-0.95 were used. Five different aspect ratios namely 1.0, 2.0, 0.5, 3.0 and 0.333 were generated at each location. The model was trained for 25,000 epochs as in the case of Faster RCNN. A batch size of 24 was used in training and the RMS prop optimizer was used. Data augmentation was applied with random horizontal flipping and random cropping of images. Validation images were, again, evaluated periodically during the training to check if the model is overfitting.

Hardware and Software Used
The models were trained and evaluation of the models was performed on a computer with Intel i9 processor with 18 cores and 64 GB of RAM and NVIDIA GeForce RTX 2080 Ti graphics card. Tensorflow object detection API [61] in Python was used to train and evaluate Faster RCNN and SSD. Tensorflow tutorial on transfer learning [58] was used to train the MobileNet v2 architecture for patch-based CNN.

Evaluation Metrics
Precision, recall, f1 score, and Intersection over Union (IoU) are the evaluation metrics used in this study.
Here TP refers to True Positive, FP refers to False Positive, and FN refers to False negative. Moreover, mean Average Precision (mAP) is another metric that is commonly used in object detection problems [59,68]. It is the mean of the average precision at all recall values at different IoUs for prediction and ground truth thresholds from 0.5 to 0.95. It should be noted that these metrics were primarily formulated for object detection. Even though, in this study, we use object detection models, the objective is not to find weed objects rather all the area covered by weeds for management purposes. In case of a deep learning-based object detection model, multiple objects with their bounding box are predicted. Of these, only the boxes which have IoU with the ground truth greater than threshold and class score (probability of that object being in each class) greater than confidence threshold are considered positive prediction boxes. Among these, only the box with highest class score is considered as the true positive and other positive boxes are considered as false positives. In our case, for a weed patch that is marked as a ground truth box, the model might have multiple positive weed boxes corresponding to that one ground truth box. However, only one of those would be considered as true positive and other boxes are false positives. As can be seen in the following Figure 5, the output of this image has two prediction boxes covering the weed area in the left but in the ground truth it was marked as one bounding box. Hence, if precision is used as the evaluation metric, the box on the bottom will be regarded as False Positive even though that box adds to more weed area being detected. Therefore, the Intersection over Union (IoU) of binary output image representing weed and background pixels with the ground truth binary image is used as the primary evaluation metric. The binary output images corresponding to prediction outputs and ground truth are obtained by considering pixels representing weed objects as 1 and other areas as 0. The intersection and union of the two binary images obtained are then used to find the IoU ratio. Hence, IoU here represents the ratio between the intersection of all positive prediction boxes (true positive and false positives in object detection terms) and all ground truth boxes in an image.    To evaluate the patch-based CNN on the sub-image, an overlap slicing approach is used. The sub-image of size 1152 × 1152 pixels is sliced into patches of size 128 × 128 pixels with a stride of 32 on the horizontal and vertical. Therefore, the sliced patches have 75% horizontal and vertical overlap. Hence, each small area of size 32 × 32 is part of 8 patches and the class with maximum votes from the 4 patches is assigned as the class of the small area. To evaluate this result with ground truth and to compare with the results of Faster RCNN and SSD, IoU is used as the evaluation metric. Figure 6 shows the training graph for Faster RCNN and SSD. The decrease in training loss and the increase in mAP of the validation data with training epochs can be seen. By the end of the training, very little difference in the mAP of Faster RCNN and the SSD validation data was obtained. Faster RCNN converged faster than SSD. The training process of Faster RCNN might appear to oscillate more than SSD, which could be due to the different batch sizes and optimizers being used by the two models. However, it should be noted that the scale of the two loss plots was different. The different batch size and optimizer could also be the reason for the Faster RCNN model converging to high validation mAP earlier than SSD, since a batch size of 1 for Faster RCNN leads to 24 times more gradient updates than SSD with a batch size of 24.

376
In order to find the optimal threshold for IoU of the prediction boxes and ground truth boxes 377 that would result in best performance of the model, precision recall curves were drawn using various 378 confidence thresholds from 0 to 1 at various IoU thresholds ranging from 0.5 to 0.95 (Figure 7).

Optimal IoU and Confidence Thresholds for Faster RCNN and SSD
In order to find the optimal threshold for IoU of the prediction boxes and ground truth boxes that would result in best performance of the model, precision recall curves were drawn using various confidence thresholds from 0 to 1 at various IoU thresholds ranging from 0.5 to 0.95 (Figure 7).
It can be seen that the area under the precision-recall curve is almost the same in case of Faster RCNN and SSD which explains the fact that the validation mAP during the final epochs as seen from the training graph was very similar (0.63 in Faster RCNN and 0.62 in SSD). Furthermore, both Faster RCNN and SSD achieved the maximum area under the precision-recall curve at an IoU threshold of 0.5 for the prediction box and ground truth box. Hence, for each ground truth box, among all prediction boxes with a confidence score greater than the threshold for confidence score, the prediction box with the highest value of IoU with the ground truth box and also whose IoU with ground truth box is greater than the threshold for IoU was considered a true positive. All prediction boxes that were not a true positive with any ground truth box are regarded as false positives. The number of false negatives is equal to the number of ground truth boxes that do not have a corresponding true positive. With the optimal IoU threshold found for Faster RCNN and SSD, the following graph (Figure 8) was plotted to find the optimal confidence threshold for Faster RCNN and SSD that results in the best performance. that would result in best performance of the model, precision recall curves were drawn using various 378 confidence thresholds from 0 to 1 at various IoU thresholds ranging from 0.5 to 0.95 (Figure 7).     [36] found that by reducing the number of proposals output by Faster RCNN, the inference time 413 Figure 8. Change in IoU of output binary image and ground truth binary image as well as f1 score with change in recall. Figure 8 shows the change in f1 score and the mean IoU of the output binary image of the model with the ground truth binary image with change in recall. From the figure, the recall value which results in the best IoU and F1 score was found using the peak. The recall at which the best mean IoU and f1 score were observed was around 0.7 and its corresponding confidence threshold for class scores was 0.6 in the case of Faster RCNN, and 0.1 in the case of SSD. It is to be noted that mean IoU here refers to the Intersection over Union of the whole binary model output image with the ground truth binary image whereas the IoU mentioned earlier was the Intersection over Union of individual prediction bounding boxes with individual ground truth bounding boxes. Table 1 shows the precision, recall, f1 score, and mean IoU of the model output binary image and the ground truth binary along with the inference time for a 1152 × 1152 image. The precision, recall, f1 score, and mean IoU of both the models were similar but the SSD model was slightly faster in execution than Faster RCNN. It should be noted that the above performance was in the case that the Faster RCNN network outputs 300 proposals from the region proposal network. However, Huang et al. [36] found that by reducing the number of proposals output by Faster RCNN, the inference time of Faster RCNN can be improved with a slight cost in precision, recall, and f1 score. Therefore, experiments were conducted to study the change in inference time, precision, recall, f1 score and mean IoU, by varying the number of proposal boxes from the Faster RCNN network from 50 to 300 and the results are plotted in Figure 9. mean IoU, by varying the number of proposal boxes from the Faster RCNN network from 50 to 300 and the results are plotted in Figure 9.  The inference time of Faster RCNN had a linear time complexity with the number of proposal boxes output from the region proposal network. It can be seen that, from 200 to 300 proposals, there was no change in performance of the model but the inference time decreased. Hence, 200 proposals was selected as the optimal number of proposals for this dataset. At 200 proposals, the inference time of Faster RCNN was 0.21 seconds, which was the same as SSD. In the case of constraints in computational power, using 100 proposal boxes would result in significant compute savings with minimal loss in mean IoU. Hence, no difference in performance was found between Faster RCNN with 200 proposals and SSD in terms of the evaluation metrics used in this study. However, it is to be The inference time of Faster RCNN had a linear time complexity with the number of proposal boxes output from the region proposal network. It can be seen that, from 200 to 300 proposals, there was no change in performance of the model but the inference time decreased. Hence, 200 proposals was selected as the optimal number of proposals for this dataset. At 200 proposals, the inference time of Faster RCNN was 0.21 seconds, which was the same as SSD. In the case of constraints in computational power, using 100 proposal boxes would result in significant compute savings with minimal loss in mean IoU. Hence, no difference in performance was found between Faster RCNN with 200 proposals and SSD in terms of the evaluation metrics used in this study. However, it is to be noted that, even with the same performance metric, Faster RCNN output weed objects with high confidence compared to SSD, since the confidence threshold being used for Faster RCNN was 0.6, whereas it was a very low 0.1 for SSD. Though this threshold might result in the best performance with the current validation test, it might affect the generalization performance of the model in the case of a test dataset from a different location or from a field with different management practices. In such cases, the low threshold might lead to reduced precision.

Comparison of Performance of Faster RCNN and SSD
On visual observation of the outputs of all the 44 test images, it was found that in 41 images, both the networks detected all the weed areas. Hence, in these images, the difference in IoU between the model output and the ground truth is only because of the slight displacements of the boundaries of the bounding boxes from each other. As mentioned in Section 2.7, the low values of precision, recall, and f1 score obtained are primarily because of the way these metrics are calculated, since only one bounding box is considered as a true positive for one ground truth box, whereas the model in case of some weed areas with slight discontinuities outputs multiple prediction boxes to detect those areas. Therefore, the mean IoU of the binary output image with the binary image of the ground truth is the appropriate metric. In three of the test images (shown in Figure 10), there was a difference in the output of Faster RCNN and SSD. In the output image 1, Faster RCNN failed to detect a small strip of weed between the crop rows, but this was detected by SSD. However, by looking at the confidence score of the weed object from SSD, it can be understood that SSD was only able to detect this weed object because of the very low confidence threshold set for it. Whereas in output image 2, SSD misclassified a row of soybean crops with herbicide drift injury as weeds. Moreover, in case of output image 3, SSD could not detect the weeds on the left vertical border of the image. With both the failure areas being present in the border of the images, this might show the susceptibility of the SSD model in the image border. This could be due to the architecture of SSD that does detection of objects and classification into its class in a single shot, unlike Faster RCNN. Another possible reason could be that, by default, the API used to train both the models was resizing the input images of Faster RCNN to 600 × 600 whereas in case of SSD it was resized to 300 × 300. Therefore, this further loss of detail in the input image compared to the Faster RCNN input image might have led to the misclassifications in the border. Hence, further study with the same input image resolution is needed for a fair comparison.
Other than the above-mentioned three images, Faster RCNN, as well as SSD, performed exceptionally well in detecting weed objects of various scales as seen in Figure 11. As mentioned earlier, it can be seen that though SSD detected all the weed objects that were detected by Faster RCNN, the confidence of many of those predictions were very low and ended up as true positive because of the low confidence threshold. Since, by reducing the number of proposals to 200, Faster RCNN can be as fast SSD in terms of inference time, it can be concluded that Faster RCNN has better speed performance tradeoff.

Comparison of Performance of Faster RCNN and Patch-Based CNN
The Mobilenet v2 network trained on the training patches showed very high performance in classifying test patches with an f1 score of 0.98. However, in order to evaluate its performance in detecting the weed objects in the sub-image and compare its performance with the Faster RCNN object detection model, the overlapping approach explained earlier was used. Table 2 shows the mean IoU of the output binary image from Faster RCNN and patch-based CNN with the ground truth binary image. Furthermore, the table shows the time taken to evaluate one sub-image by both the models.

467
Other than the above-mentioned three images, Faster RCNN, as well as SSD, performed 468 exceptionally well in detecting weed objects of various scales as seen in Figure 11. As mentioned 469 earlier, it can be seen that though SSD detected all the weed objects that were detected by Faster    Faster RCNN had better performance than patch-based CNN with overlap, both in terms of mean IoU and inference time. However, patch-based CNN without overlap has an inference time which is almost the same as Faster RCNN. The low values of IoU of patch-based CNN without overlap were because of the coarse nature of this algorithm. Since each sub-image was split into 81 patches in this approach, weeds that were smaller in size would not be detected in this approach. Furthermore, because of the way the patches were sliced, there could be a lot of patches with weeds and background in equal proportion, whereas the Mobilenet v2 model had only been trained with patches that contained only weed or only background, and hence the model was prone to error in this approach. To reduce this error, the slicing with overlap approach was tested. Since, for each small block within a patch, the class was determined by majority vote in eight patches, the problem of mixed patches was solved to some extent. Still, the similar IoU of slicing with overlap and without overlap is because the ground truth binary image represents weed objects as rectangular boxes whereas output binary images from the patch-based overlap approach consist of weed objects, which are polygonal in nature because of the majority vote as can be seen in Figure 12. Therefore, patch-based CNN with overlap has better performance than the IoU value with ground truth image suggests. However, the drawback of this approach is the very high inference time compared to Faster RCNN and patch-based RCNN without overlap. Further studies can be done with different levels of horizontal and vertical overlap and its influence on the inference time of this approach. However, with the inference time of Faster RCNN being the same as the patch-based CNN without overlap, any amount of overlap would lead to more patches to be evaluated than the non-overlap approach and hence greater inference time. Therefore, among the approaches investigated in this study, Faster RCNN had the best overall performance. It would be interesting to study a modified Fast RCNN architecture with the region proposal part replaced with an image analysis method that selects polygons. This could achieve faster computational speed as well as better performance for a patch-based CNN method.  In order to implement this system for on-farm detection, further evaluation of the performance of these approaches at higher altitudes is needed. At the altitude of 20m at which these data were collected, it is practically impossible to cover the large soybean fields with the current limitations on the battery capacity of UAV systems. Therefore, the evaluation of the performance of these models at low-resolution images from high altitude is needed for practical adoption of these systems. Like SSD, it can be seen that there is a higher misclassification rate of patches in the border of the images. In this case, it is suggested to collect images with some overlap, such as 15%, so that weed objects present in the border of one image end up in the interior of the next image. Furthermore, it is to be noted that the dataset used to train the models in the study was only collected on two different days. Therefore, the differences in phenological stage of the crop and the weed and lighting conditions are limited within the dataset. Further experiments with wide variations in lighting conditions, flight altitudes, different phenological stages are needed to analyze and compare the generalizability of performance of these models in varying conditions in the field. In addition, since the manual labeling of bounding boxes used in this study was labeled by one annotator, it is possible that there is error due to bias of the observer. Therefore, further studies using multiple annotators for labeling data with more variations as mentioned above is needed to remove bias and study the generalizability of the model. With the manual annotation of images being a time-consuming process, use of multiresolution segmentation approaches from OBIA could help in automating this. In that case, OBIA could help generate polygon labels from which rectangular bounding box labels can be generated for object detection tasks.

Conclusions
In this study, Faster RCNN and SSD object detection models were trained and evaluated over UAV imagery for mid-to late-season weed detection in soybean fields. The performance of two object detection models, Faster RCNN and the Single Shot Detector (SSD) models, as well as the performance of object detection CNN models with the patch-based CNN model, were evaluated and compared in terms of weed detection performance using mean IoU and inference speed.
It was found that the Faster RCNN model with 200 box proposals had a similar weed detection performance to the SSD model in terms of precision, recall, f1 score, and IoU as well as similar inference time. The precision, recall, f1 score and IoU were 0.65, 0.68, 0.66 and 0.85 for Faster RCNN with 200 proposals, and 0.66, 0.68, 0.67 and 0.84 for SSD respectively. However, the optimal confidence threshold of SSD was found to be 0.1, indicating the lower confidence of this model in the case of weed objects detected, whereas the optimal confidence threshold was found to be 0.6 in the case of Faster RCNN, meaning higher confidence in the weed objects detected. In addition, SSD was susceptible to misclassification in the border of some test images. These findings indicate that SSD might have lower generalization performance than Faster RCNN for mid-to late-season weed detection in soybean fields using UAV imagery. Hence, Faster RCNN was determined to be the better performing model among the two in this study. Between Faster RCNN and patch-based CNN, Faster RCNN had better weed detection performance than patch-based CNN with overlap as well as without overlap. The inference time of Faster RCNN was similar to patch-based CNN without overlap, but significantly less than patch-based CNN with overlap. Hence, Faster RCNN was found to be the best model in terms of weed detection performance and inference time among the different models compared in this study.
Future work can evaluate the performance variation of models in different weed species. In addition, the performance of Faster RCNN at different altitudes by resampling high-resolution images to low-resolution images can be studied. Furthermore, the inference time experiments at different altitudes should be performed on low computational power devices such as regular laptops and mini-PCs used for the flight control of UAV systems. Inference time experiments should also be performed on low cost hardware accelerators available for edge computing such as the Intel Neural Compute Stick or Google Coral. This would help understand the potential of using such devices for on-farm, near real-time data processing and actuation. In addition, the effect of model compression techniques and approximation algorithms developed for neural networks can be studied to understand the limit of edge computing for in-field near real-time weed detection. Moreover, further work can be performed on using the RTK GPS data of individual images and their corresponding IMU data to orthorectify the image and find the geolocation of the weed patches detected by the object detection models. In addition, the performance of object detection models for weed detection can be compared between raw individual images as used in this study and stitched mosaic maps. With the manual annotation of images being a laborious part of the process, using techniques such as self-supervised learning [69] and active learning [70] to reduce the amount of manual labeling for this task can be studied. Furthermore, few-shot learning algorithms can be studied to investigate the transfer learning of this algorithm to other crops and weed species by training with a few labeled instances from those crops and weed species.