Deep Learning-Based Weed Detection Using UAV Images: A Comparative Study

: Semantic segmentation has been widely used in precision agriculture, such as weed detection, which is pivotal to increasing crop yields. Various well-established and swiftly evolved AI models have been developed of late for semantic segmentation in weed detection; nevertheless, there is insufﬁcient information about their comparative study for optimal model selection in terms of performance in this ﬁeld. Identifying such a model helps the agricultural community make the best use of technology. As such, we perform a comparative study of cutting-edge AI deep learning-based segmentation models for weed detection using an RGB image dataset acquired with UAV, called CoFly-WeedDB. For this, we leverage AI segmentation models, ranging from SegNet to DeepLabV3+, combined with ﬁve backbone convolutional neural networks (VGG16, ResNet50, DenseNet121, EfﬁcientNetB0 and MobileNetV2). The results show that UNet with EfﬁcientNetB0 as a backbone CNN is the best-performing model compared with the other candidate models used in this study on the CoFly-WeedDB dataset, imparting Precision (88.20%), Recall (88.97%), F1-score (88.24%) and mean Intersection of Union (56.21%). From this study, we suppose that the UNet model combined with EfﬁcientNetB0 could potentially be used by the concerned stakeholders (e.g., farmers, the agricultural industry) to detect weeds more accurately in the ﬁeld, thereby removing them at the earliest point and increasing crop yields.


Introduction
Global food demand is projected to surge by a significant margin of 35% to 56% from 2010 to 2050 [1].However, the expansion of industrialization, desertification and urbanization has led to a reduction in the crop production area and, hence, productivity of food [2].In addition to these challenges, climate change is increasingly creating favorable conditions for pests such as insects and weeds, harming crops [3].Therefore, crop quality and quantity will be affected by insects and weeds if the appropriate treatment is not devised in a timely manner.Traditionally, herbicides and pesticides have been employed as a means of control [4].When these herbicides are sprayed throughout entire fields without making precise identification of weeds, such an application of herbicides, while serving its purpose, results in a detrimental impact on both crop yield and the environment.While they effectively combat pests and diseases that threaten crops, their use can lead to reduced agricultural productivity due to their excessive use where no weeds are present [5].Therefore, it is essential to precisely identify the weeds vs. crops, so that cultivated plants can be saved from pesticide harm.As such, there is a requirement for a method of weed management that can gather and assess weed-related data within the agricultural field, while also taking appropriate measures to effectively regulate weed growth on farms [6].
Remote sensing (RS)-based approaches can be an alternative for automated weed detection using satellite imagery [7].However, the success of satellite-based RS in weed detection is significantly influenced by three major limitations.First, the satellites acquire images with spatial resolutions measured in meters (e.g., Landsat at 30 m and Sentinel at 10 m), which is generally insufficient for analyzing plant-or individual plot-level weed data.Moreover, the fixed schedule of satellite revisits may not align with the timing needed to capture essential crop field images.Additionally, environmental factors like cloud cover frequently hinder the dependable quality of these images.
Recently, the Unmanned Aerial Vehicle (UAV) has made significant progress in its design and capability, including payload flexibility, communication and connectivity, navigation and autonomy, speed and flight time, etc. [8].It offers versatile revisiting capabilities, allowing farmers/researchers to deploy it when weather conditions permit, ensuring frequent image capture (thus achieving high temporal resolution).Moreover, UAVs can capture images with remarkable spatial detail, closely observing individual plants from an elevated perspective, leading to centimeter-level image resolutions.Additionally, by flying at lower altitudes, UAVs can bypass cloud cover, obtaining clear and high-quality images [9].Combined with the high-resolution crop field images acquired with UAV, semantic segmentation methods based on deep learning (DL) can provide a promising method for precise weed detection.
Semantic segmentation (SS) in computer vision is a pixel-level classification task that has revolutionized various fields, such as medical image segmentation [10,11] and precision agriculture (PA) [12].For instance, Liu et al. [11] utilized the segmentation of retinal images to help diagnose and treat retinal diseases.In the PA domain, SS has been adopted for different problems such as agricultural field boundary segmentation [12], agricultural land segmentation [13], diseased vs. healthy plant detection [14] and weed segmentation [15].Weed segmentation, which helps to identify unnecessary plants disturbing the growth of crops, is considered one of the major areas that directly contribute to improving crop productivity.
Over recent years, SS has gained significant attraction in the weed detection area of PA.Computer vision techniques that utilize image processing and machine learning methods for weed detection are widely investigated in the literature [16][17][18].However, deep learning methods for SS have shown state-of-the-art (SOTA) results for image segmentation tasks in general.The availability of pre-trained deep neural networks on large datasets such as ImageNet [19] made it possible to transfer cross-domain knowledge to agriculture field images.For instance, convolutional neural networks (CNNs) such as DeepLab [20], UNet [21] and SegNet [22] are implemented for weed detection on various crop fields.The performance of these neural networks depends on multiple factors, such as the resolution of images, types of crops and field conditions.Since the colour and texture of weeds are very similar to crops, it is a complex problem to make a differentiation between crops and weeds.Furthermore, if more than one type of weed is present in the field, it becomes more challenging to segment such regions.
A few researchers attempted to perform a thorough review and bench-marking of weed detection and segmentation tasks using computer vision and machine learning techniques [23,24].For instance, Li et al. [23] evaluated the performance of various deep learning methods such as Faster RCNN [25], YOLO [26] and CenterNet [27] for weed detection on publicly available datasets.Additionally, Fathipoor et al. [24] experimented with UNet and its variation for weed segmentation using ground-based RGB images.They achieved an IoU score of 56% with UNet++.Since most of these research works dealt with ground-based RGB images, it is still essential to explore the comparative study of DL methods for weed segmentation using UAV-based RGB images.Despite significant advancements in SS for weed detection, it has the following limitations.First, the existing works lack rigorous comparison to identify the optimal deep learning (DL)-based model for weed segmentation.Second, the performances of these DL-based models are dependent on the use of various CNNs as feature extractors, the number of training images available and the type of regions that need to be segmented or identified.As a result, it might be essential to apply some data augmentation techniques for the effective training of such a DL-based segmentation model when using various backbone CNNs as feature extractors.
Considering the aforementioned limitations, we conduct a detailed comparative study of DL models being used for weed detection and identify the best-performing model in the field.We also evaluate the performance of such SOTA models on a UAV dataset that is publicly available for weed segmentation.In summary, the main contributions of this paper are as follows: (i) The comprehensive implementation of five backbone CNNs with three segmentation models for weed detection is achieved.For this, we utilize the patch-based data augmentation method for model building.(ii) The performance comparison of five well-established CNNs as feature extractors employed with three segmentation models is reported.For this, we experiment with two strategies: binary segmentation (weed vs. non-weed) and multi-class segmentation (three classes of weed and non-weed).(iii) A DL-based method is implemented for weed segmentation using the best-performing backbone CNN and segmentation model.(iv) The effect of data augmentation techniques on the learning curve of the DL-based segmentation model while training is compared and reported.
The remainder of the paper is organized as follows: Section 2 presents a literature review, highlighting the summary of the existing works.Section 3 discusses the materials and methods used in this study.Section 4 presents the results and discussion, and Section 5 concludes the paper with future recommendations.

Related Work
Owing to the recent advancement in drone and sensor technology, the research on weed detection using DL methods has been swiftly progressing.For instance, a CNN was implemented by Dos et al. [15] for weed detection using aerial images.They acquired soybean (Glycine max) field images in Brazil with a drone and created a database of more than 1500 images including images of the soil, soybeans, broad-leaf and grass weeds.A classification accuracy of 98% was achieved using ConvNet while detecting the broadleaf and grass weeds.However, their approach employed the classification of whole images into different categories for weed detection rather than the segmentation of image pixels into various classes.Similarly, a CNN was implemented for weed mapping in Sod production using aerial images by Zhang et al. [28].They first processed the UAV images using Pix4DMapper and produced an orthomosaic of the agricultural field.Then, the orthomosaic was divided into smaller image tiles.A CNN was built with an image size of (125 px × 125 px) as input.The CNN achieved a maximum precision of 0.87, 0.82, 0.83, 0.90 and 0.88 for broadleaf, grass weeds, spurge (Euphorbia spp.), sedges (Cyperus spp.) and no weeds, respectively.Ong et al. [29] performed weed detection on a Chinese cabbage field using UAV images.They adapted AlexNet [30] to perform weed detection and compared its performance with traditional machine learning classifiers such as Random forest [31].The results showed that CNN achieved the highest accuracy of 92.41%, which was 6% higher than that of Random forest.A lightweight deep learning framework for weed detection in soybean fields was implemented by Razfar et al. [32] using MobileNetV2 [33] and ResNet50 [34] networks.
Aside from the single-stage CNNs, few works have been reported that use multi-stage pipelines for weed detection on UAV images.For instance, Bah et al. [35] implemented a three-step method for weed detection on spinach and bean fields using UAV images.First, they detected the crop rows using the Hough transform [36] technique; then, the weeds between these crop rows were used as training samples where a CNN was trained to detect the crop and weed in the UAV images.However, their proposal depends on the accuracy of the line detection technique, which might not be robust when the UAV images contain varying backgrounds and image contrast.A two-stage classifier for weed detection in tobacco crops was implemented in [37].Here, they first segmented the background pixel from the vegetation pixels which included both weed as well as tobacco pixels.Then, a three-class image segmentation model was implemented.Their proposal achieved the maximum Intersection of Union (IoU) of 0.91 for weed segmentation.However, the two-stage segmentation model requires separate training at each stage, and hence it is not possible to train the model in an end-to-end fashion, adding extra complexity to its deployment.
Object detection approaches such as single shot detector (SSD) [38], Faster RCNN [39] and YOLO [40] were also employed for weed detection using UAV images.For instance, Veeranampalayam et al. [38] compared two object detectors, namely, Faster RCNN and SSD, for weed detection using UAV images.The InceptionV2 [41] model was used for feature extraction in both detectors (Faster RCNN and SSD).The comparison revealed that Faster RCNN models produced a higher accuracy as well as less inference time for weed detection.
The segmentation of images into weed and non-weed regions at the pixel level is more precise and can be beneficial for the accurate application of pesticides.Xu et al. [20] combined the visible color index with a DL-based segmentation model for weed detection in soybean fields.They first generated the visible color index image for each UAV image and fed it into a DL-based segmentation model which utilized the DeeplabV3 [42] network.When comparing its performance with other SOTA segmentation architectures such as fully convolutional neural network (FCNN) [43] and UNet [44], it provided an accuracy of 90.50% and an IoU score of 95.90% for weed segmentation.

Dataset
We used a publicly available dataset [45], which consists of 201 RGB images acquired with DJI Phantom Pro 4 from a cotton field in Greece by Krestenitsi et al. [45].The cotton field images were acquired by flying the UAV at a height of 5 m so that it provides a close clear view of the cotton field.The images were annotated at the pixel level into three types of weeds (Johnson grass (Sorghum halepense), field bindweed (Convolvulus arvensis) and purslane (Portulaca oleracea)) and background.
The dataset includes a total of 201 RGB images of size 1280 px × 720 px.In the dataset, there are very few pixels of purslane weeds (only 0.27 × 10 6 pixels), whereas the Johnson grass, field bindweed and backgrounds have 1.44 × 10 6 , 7.56 × 10 6 and 175 × 10 6 pixels [45], respectively.It is clear that the dataset is highly imbalanced, which make the automated weed detection task more challenging.Sample images and their corresponding mask are shown in Figure 1.Here, we used the yellow, red and gray colored masks to represent the Johnson grass, field bindweed and purslane weeds, respectively, for this illustration.

Patch Generation and Augmentation
Since the images were captured with a drone flying 5 m above the crop, the original image size was 1280 px by 720 px.It is too large to feed into DL models as the large image size requires high memory, thereby slowing the training process [46].One way of dealing with this is to resize the image while feeding it into the DL model, but this loses the crucial information and the model performance decreases significantly.We employ the patch-based strategy to create small-size patches and use such patches to train the DL models because it has two-fold benefits.First, it increases the training patches, which is essential to train the DL models as they generally contain millions of trainable parameters which require a large number of training samples.Second, it also helps balance the training dataset.
Since the dataset is highly imbalanced (with the majority of the pixels in the background class) and the main focus of this study is weed detection, it is logical to remove the image patches that only include background pixels.Therefore, we set the threshold to remove image patches that include almost all (97%) background pixels (as shown in Figure 2).Following this procedure, each UAV image is divided into patches of size (256 px × 256 px), which results in a total of 786 image patches.This dataset of 786 image patches is divided into train and test sets in a ratio of 8:2 (which results in 628 images for the training set and 158 in the test set).
To see the effect of data augmentation on the performance of DL-based weed segmentation methods, we apply three augmentation techniques, flip (horizontal and vertical), rotation (by 90 degrees) and grid distortion, into each image of the training set.The first transformation is applied with a combination of horizontal flip, random rotation by 90 degrees and grid distortion.The second transformation includes the combination of vertical flip with random rotation and grid [47].After applying the data augmentation, the training set forms 1884 samples.The detailed statistics of the original (D1) and augmented dataset (D2) are presented in Table 1.
Image patches (a-f) and their corresponding masks (g-l).Note that the masks of the image patches (g,j) do not include the weed pixels (or less than 3%), and such patches are excluded while building the weed detection model.

ResNet
Deep CNNs like VGG16 and VGG19 have shown promising results in large-scale image classification tasks.However, training such models is challenging due to the vanishing gradient problem, where small gradients propagated back through the layers diminish and vanish as the network becomes deeper.To address this issue, researchers introduced skip connections, allowing certain layers to be skipped.These skip connections form residual blocks, which are the core of the ResNet architecture [34], to mitigate the vanishing gradient problem, enabling the training of very deep networks for improved performance in image classification tasks.
The ResNet model offers many variants (based on the depth of the network), such as ResNet18, ResNet34, ResNet50, ResNet101 and so on.In this study, we utilized ResNet50, which comprises 48 Convolution layers, 1 Max-pooling layer and 1 Average pooling layer.

DenseNet
DenseNet [49] is a CNN model that expands on the concept of skip connections seen in ResNet by extending them to multiple steps.The central element of DenseNet is the Dense block, which is used between these connections.In DenseNet, each layer is directly connected to all subsequent layers, creating a dense connectivity pattern.This connectivity ensures that each layer receives input from all preceding layers.Dense blocks consist of Convolution layers with the same feature map size but varying kernel sizes.Based on the specific depth of the network, it has different variations such as DenseNet121, DenseNet169 and DenseNet201.In this study, the DenseNet121 network is utilized, which consists of 120 Convolution layers and 4 Average pooling layers.The DenseNet architecture facilitates robust information flow and feature extraction, making it a valuable tool in various applications.

EfficientNet
CNNs such as ResNet [34] and DenseNet [49] have expanded the network in width, depth and resolution along the different dimensions but not systematically.However, EfficientNet proposed by Tan et al. [50] introduced a methodical strategy for scaling up CNNs using a fixed set of scaling coefficients.The architecture of EfficientNet consists of three main parts: the stem block, the body and the final block.While the stem and final blocks remain consistent across all versions of EfficientNet, the body varies among different versions.The stem block involves several components such as input processing, re-scaling, normalization, padding, convolution, batch normalization and activation layers.On the other hand, the body is composed of five modules, each containing depth-wise convolution, batch normalization and activation layers.In this study, we utilized the smaller version known as EfficientNetB0, which consists of a total of 237 layers excluding the top layer.

MobileNet
MobileNet is a CNN model that utilizes depth-wise convolution [51].Specifically, MobileNetV2 [33], an improved version of MobileNetV1 [51], introduces additional layers and blocks to enhance its performance.It incorporates one regular Convolution layer, followed by 13 depth-wise separable convolution blocks and another regular Convolution layer.MobileNetV2 also includes an Average pooling layer.Notably, it introduces the Expand layer, Residual connections and Projection layers, along with depth-wise Convolution layers known as Bottleneck residual blocks.These additions contribute to the effectiveness and efficiency of MobileNetV2, making it a powerful architecture for various applications.Here, we utilized MobineNetV2 as a backbone CNN while implementing weed segmentation models.

SegNet
SegNet [52] is based on FCNN architecture.It consists of an encoder and decoder followed by a pixel-wise classification layer.The encoder includes convolution and pooling operations to produce sparse feature maps.Then, the decoder up-samples the feature maps using un-pooling operations.The un-pooling operation uses the stored indices from the corresponding encoder pooling layers to precisely locate the feature within the up-sampled map.
In this work, we implement SegNet with various CNNs as feature extractors, as depicted in Figure 3.The corresponding layer used for feature extraction for each CNN is shown in Table 2.Then, the decoder part of the networks includes the four up-sampling blocks, where the first two blocks consist of one Up-sampling and three Convolution layers and the other two blocks consist of one Up-sampling and two Convolution layers.Finally, the segmentation layer with softmax is implemented.

UNet
UNet [44] consists of two paths, contracting and expansive paths, forming a U-shaped network.It is widely used for image segmentation [13,44].The contracting path, also known as the encoder down, samples the image by extracting an image feature using a series of convolution and pooling operations.The expansive path, also known as the decoder, up-samples the features and recovers the spatial resolution as to the original images using transposed convolution operations.As an example, the ResNet50-based implementation of UNet is demonstrated in Figure 4. Here, the skip-connection is used from the feature map of each block of the encoder to the corresponding block of the decoder to perform concatenating and up-sampling operations.The feature extraction layers for five backbone CNNs utilized in UNet model are adapted from Kezmann et al. [53].[54]).Note, the + operator represents the concatenate and up-sampling operation.

DeepLabV3+
DeepLab is a state-of-the-art deep learning model for semantic segmentation.It has evolved from DeepLabV1 to the latest DeepLabV3.DeeplabV1 [42] utilized the concept of deep convolutional neural networks (DCNNs) and atrous convolution for semantic segmentation.As atrous convolutions enabled the network to capture the multi-scale contextual information, DeepLabV1 achieved significant results in semantic segmentation tasks.DeepLabV2 further extended DeepLabV1 by introducing the use of atrous spatial pyramid pooling (ASPP) which captures the multi-scale feature at different atrous rates [55].DeepLabV3 introduced an improved ASPP that utilized global average pooling and various atrous rates to capture contextual information more effectively.It is further enhanced by introducing a decoder module that up-samples the feature maps and combines them with lower level features, as shown in Figure 5.The feature extraction layers for all CNNs used in the DeepLabV3+ segmentation model are also adapted from Kezmann et al. [53].

Experimental Setup
The experimental setup to conduct the weed segmentation was built on Python-based Keras package [56].All the experiments were carried out on Google Colab [57] cloud computing platform which utilized an NVIDIA T4 GPU with 12 GB RAM.
For maintaining consistency in comparison, each segmentation model was trained up to maximum epochs of 100, with a learning rate of 0.001 and Adam optimizer.To prevent the over-fitting of the model, an early stopping was implemented with a patient of 10 epochs.The total loss function was calculated from focal loss [58] and dice loss [59]  The experiments were conducted using two strategies: (a) binary vs. multi-class segmentation and (b) with and without data augmentation.The results of these experiments are reported in Section 4.

P = T P T P + F P
(1) where T P , T N , F P and F N represent true positive, true negative, false positive and false negative for a given class, respectively.

Performance Comparison of Different DL-Based Models for Binary Segmentation
In this section, we present the outcomes of the weed vs. background segmentation task.Since this task exhibits relatively lower complexity, all segmentation models based on DL yielded commendable performance.
The performance of the majority of the models (combination of backbone CNNs plus segmentation models) on both datasets D1 and D2 are similar, which shows that the data augmentation technique has no significant effect on the binary segmentation task (see Table 3).For instance, the highest performing model (DenseNet121 + SegNet) has a mean IoU score of 67.56% on D2 (with augmentation) and 67.12% on D1 (without augmentation).Similar results are seen for other combinations of backbone CNNs and segmentation models (SegNet, UNet and DeepLabV3), except for VGG16.In this case, UNet with VGG16 as a backbone improves its mean IoU score of 54.54% to 64.22% when the data augmentation technique is utilized.
Upon evaluating the backbone CNNs without employing augmentation (D1), both ResNet50 and DenseNet121 yield a mean IoU score exceeding 60% across all three segmentation models (SegNet, UNet and DeepLabV3+).Among these three segmentation models, SegNet achieves the highest mean IoU score of 67.12% when paired with DenseNet121.It also demonstrates competitive results with a mean IoU score of 66.85% for EfficientNetB0, 65.85% for ResNet50 and 63.50% for VGG16, except for MobileNetV2.In this case, UNet outperforms all other models, attaining a mean IoU score of 67.07%.Comparing the results of backbone CNNs on the augmented dataset (D2), DenseNet121 and EfficientNetB0 produce the highest mean IoU score of 67.56% and 67.24% when combined with the SegNet model, respectively, whereas MobineNetV2 with UNet shows a competitive performance (mean IoU score of 67.07%).
Regarding accuracy comparison, the SegNet model with both DenseNet121 and Ef-ficintNetBO as backbones achieves the highest accuracy, surpassing 88%.Notably, the DeepLabV3+ model with VGG16 as its backbone displays the lowest performance.This disparity in performance might be attributed to the limited availability of training data for the models.

Comparative Study of DL-Based Models for Multi-Class Segmentation
The results of the multi-class segmentation task are discussed.Since this task aims to segment the image region into four classes, including three types of weed and background, it is really challenging to distinguish each pixel from one class to another class for most of the segmentation models.Distinguishing the weed classes is considerably complex due to the resemblances in texture, color and patterns exhibited by distinct types of weeds.This similarity substantially contributes to the challenges encountered in effectively categorizing these diverse weed species.
For the multi-class task, the performance of the majority of the models is increased with data augmentation.For instance, EfficientNetB0 combined with UNet produces a mean IoU of 56.21%, whereas its mean IoU is only 51.97% when data augmentation is not applied (see Table 4).
Among the three segmentation models, UNet with EfficientNetB0 produces the highest mean IoU of 56.21%.It is noted that UNet performed well with other backbone CNNs as compared with SegNet and DeepLabV3+.For instance, the mean IoU of UNet with DenseNet121 is 56.09%, which is the second-highest performance among the compared models.Comparing the performance of five backbone CNNs, ResNet50, DenseNet121 and EfficientNetB0 achieve a mean IoU score of above 50% when combined with UNet and SegNet on both datasets (D1 and D2).MobileNetV2 combined with DeepLabV3+ has a mean IoU of 32.56% for D1 and 33.06% for D2.
It is observed that DenseNet121 with SegNet performs the best for the binary task, whereas EfficieentNetBo with UNet achieves the best performance for the multi-class task.However, the combination of other backbone CNN and segmentation models also yields a similar performance for the binary task.It might be attributed to the relatively lower complexity associated with the binary segmentation problem, where all models are able to discriminate between the background (also includes the crops) and weeds.In comparison, the multi-class segmentation task is more challenging as it has higher intraclasses similarity between the different types of weeds.Since UNet [4] transfers the entire feature from encoder to decoder, they might help discriminate the multiple weed classes for multi-class segmentation.This is further supported by the consistently higher performance of UNet in the majority of backbone CNNs for the multi-class segmentation task (refer to Table 4).
The pixel-wise classification accuracy of most of the models ranges from 75% to 88%, which shows that the DL-based segmentation models are able to learn some patterns from the UAV images and have some potential in weed segmentation.

Class-Wise Study of Best-Performing DL-Based Segmentation Model
We report the class-wise performance of the best models; DenseNet121 with SegNet for the binary and EfficientNetB0 with UNet for the multi-class segmentation.Table 5 shows that the model is able to identify the background class with all the performance matrices ( IoU of 87.66%, precision 91.68%, Recall 95.24% and F1-score 96.42) higher than 87%.However, the performance metrics for the weed class ranged from 47% (IoU) to 71% (Precision).
For the multi-class segmentation, the background class has the highest performance (IoU of 88.09%), while the weed class (Johnson grass) has the lowest performance (IoU of 44.78%) (refer to Table 6).It seems that it is more challenging to distinguish between the types of weeds than to differentiate between background vs. weed pixels.

Five-Fold Results of Best-Performing Model
To perform the validation of the best-performing model, we provide the five-fold cross-validation results for the multi-class (EfficientNetB0+UNet)) segmentation model.The confidence interval (CI) (refer to Equation ( 6)) for each performance metric is calculated at a 95% confidence level, which shows the statistical estimate of performance among the five folds of the dataset [60].
where µ, z, σ and n represent the sample mean, confidence level, sample standard deviation and sample size.We preferred the CI over the p-value in this statitical analysis because the interpretation of trial results based solely on p-values can be misleading [61].Table 7 reports the performance scores of the EfficientNetB0 with UNet model for five folds, shedding light on the consistency of the model across the folds.The performance of the model in Fold-3 (mean IoU of 60.43%) is the highest, followed by the Fold-1 (mean IoU of 58.05%).The minimum scores of the models are reported in Fold-4 (mean IoU of 56.21%).However, the confidence interval (CI) at α = 0.05 shows that the model is robust across the fold with the minimum margin of errors (±1.4% for IoU, ±0.24% for precision, ±0.34% for recall and ±0.18% for f-score) from the mean.

A DL-Based Framework for Weed Detection on UAV Images
We finally present the DL-based framework for weed detection using UAV images, which consists of five stages: (a) input UAV-acquired images, (b) patch generations, (c) load the trained DL model at patch level, (d) make the prediction, (e) post-process the patch level prediction and (f) generate the final segmentation map (refer to Figure 7).The framework begins with large input images ((1570 px × 720 px)), and the image is divided into smaller patches of size (256 px × 256 px).By dividing the whole image into manageable patches, we believe that the backbone CNNs can focus on discriminating features within localized areas that include weed pixels.Then, the best-performing DLbased segmentation model (e.g., EfficientNetB0 with UNet) is loaded and deployed to make the segmentation mask prediction at the patch level.The model evaluates the content of each patch and predicts types of weeds and background pixels based on the knowledge acquired during training (binary or multi-class).Finally, after obtaining predictions for individual patches, the framework proceeds to post-process these predictions.This step involves refining and combining the patch-level predictions to generate a more coherent and accurate prediction map for the whole image.In order to deploy the proposed DL framework for weed segmentation for new crop fields (other crops and weed types), the steps included in the training block (see Figure 7) are required, which will train the model to learn how to discriminate the specific crops vs. weeds.The trained model can then be deployed to predict the weeds in the given field image.By applying the above procedure, we tested the efficacy of the proposed framework (using EfficientNetB0 with UNet) for a multi-class segmentation task.The sample output generated by the proposed framework is demonstrated in Figure 8.

Conclusions
In this work, we comprehensively studied the well-established deep learning-based segmentation models for weed detection.Through the investigation of weed segmentation using UAV images in three aspects, backbone CNNs, segmentation models and data augmentation, binary as well as multi-class weed segmentation frameworks are suggested.The results indicate that DenseNet121 paired with SegNet performs the best for the binary task, while EfficientNetB0 combined with UNet poses the highest performance for multi-class segmentation.Furthermore, the comparison of five backbone CNNs on the benchmark dataset shows that the UNet model with the EfficienNet, DensNet121 and ResNet50 backbone has the best performance on multi-class weed detection.The other models have varying results while using different CNNs as the backbone.
Comparing the three segmentation models (UNet, SegNet and DeepLabV3+) with different backbones, we find a mixed performance on both binary and multi-class segmentation tasks.In particular, in the binary task, SegNet (with DenseNet121 as backbone CNN) has the highest mean IoU score of 67.56%, whereas UNet (with EfficientNetBO) has the highest mean IoU score for multi-class segmentation tasks.
Based on the complexity of segmentation tasks (binary and multi-class), the majority of models (combination of backbone CNNs and segmentation models) demonstrate similar performance for the binary task.This similarity might be attributed to the relatively lower complexity of binary segmentation, where all models effectively distinguish between the background (including crops) and weeds.In contrast, the multi-class segmentation task is more challenging due to higher intra-class similarity between different types of weeds, where UNet excels compared with the other models, which might be attributed to its ability to transfer the entire feature from encoder to decoder during segmentation.
This work has two limitations.First, the model is trained with RGB images which only include the visible light spectrum.Multispectral images can capture more canopy information and might be helpful in learning distinguishable patterns between the weeds and the background.This might further boost the performance of the segmentation models, as these SOTA models have shown the greatest performance in other domains.Second, the effect of data augmentation on multi-class task performance in most of the models indicates the necessity of more training data for improved performance.Data generation techniques such as generative models can be employed to increase the dataset.

3. 3 .
Backbone CNN Models 3.3.1.VGG VGG [48] is a CNN developed by the Visual Geometry Group (VGG) at Oxford University, which achieved victory in the 2014 ImageNet challenge.It is a large neural network with a layered architecture.It includes an Input layer, Convolution layers, Pooling layers and Dense layers.The number of layers in VGG depends on its depth.It has two variations: VGG16 (sixteen-layer network) and VGG19 (19-layer network).In this work, VGG16 is utilized as a feature extractor which includes 13 Convolution layers, 5 Max-pooling layers and 3 Dense layers.

Figure 3 .
Figure 3. High-level block diagram of SegNet with backbone CNN as an encoder.

Figure 4 .
Figure 4.The UNet block diagram with ResNet50 as backbone CNN (adapted and modified from[54]).Note, the + operator represents the concatenate and up-sampling operation.

Figure 5 .
Figure 5. Block diagram of DeeplabV3 with backbone CNN as encoder.
. The Intersection of Union (IoU) score and f-score were used as evaluation functions while training the model.A sample plot of IoU score and loss achieved per epoch while training the UNet model with EfficienetNetB0 as the backbone is reported in Figure 6 (for both D1 and D2 datasets).

Figure 6 .
Figure 6.The training and validation curves for UNet with EfficientNet as a backbone.Note that (a,b) represent the model training curves on dataset (D1) without augmentation and (c,d) represent model training curves on dataset (D2) with augmentation.

Figure 7 .
Figure 7.A DL-based framework for weed detection using UAV images.Note that, to adapt the proposed model for any other crop field, the training block shown in the dotted line needs to be followed.

Table 1 .
The image patch statistics of two datasets without augmentation (D1) and with augmentation (D2).Note that the images in the test set are not augmented.

Table 2 .
The feature extraction layer for each backbone CNN used in SegNet model.

Table 3 .
The segmentation results of three segmentation models with five backbone CNNs for weed vs. background segmentation task.Note that the bold values represent the highest performance.

Table 4 .
The segmentation results of three segmentation models with five backbone CNNs for multi-class segmentation task.Note that the bold values represent the highest performance.

Table 5 .
Class-wise performance of SegNet with DenseNet121 as backbone CNN for weed vs. background segmentation.Note that the performance metrics are reported in percentages (%).

Table 6 .
Class-wise performance of UNet with EfficientNetB0 as backbone CNN for multi-class segmentation.Note that the performance metrics are reported in percentages (%).

Table 7 .
Five-fold results for multi-class segmentation task (EfficientNet + UNet) with data augmentation (D2).Note that the CI represents the confidence interval at α = 0.05.