Benchmarking Object Detection and a Novel Self-Supervised Framework for Weed Detection in Soybean

Srivastava, Dhiraj; Singh, Vijay; Wamanse, Rutvij; Li, Song; Kochersberger, Kevin; Virk, Simerjeet; Yadav, Pappu

doi:10.3390/rs18111720

Open AccessArticle

Benchmarking Object Detection and a Novel Self-Supervised Framework for Weed Detection in Soybean

by

Dhiraj Srivastava

¹

,

Vijay Singh

^1,*

,

Rutvij Wamanse

¹,

Song Li

²,

Kevin Kochersberger

³,

Simerjeet Virk

⁴

and

Pappu Yadav

⁵

¹

Eastern Shore Agricultural Research and Extension Center, Virginia Tech, Painter, VA 23420, USA

²

School of Plant and Environmental Sciences, Virginia Tech, Blacksburg, VA 24060, USA

³

Department of Mechanical Engineering, Virginia Tech, Blacksburg, VA 24060, USA

⁴

Department of Biosystems Engineering, Auburn University, Auburn, AL 36849, USA

⁵

Department of Agricultural and Biosystems Engineering, South Dakota State University, Brookings, SD 57007, USA

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(11), 1720; https://doi.org/10.3390/rs18111720

Submission received: 19 April 2026 / Revised: 18 May 2026 / Accepted: 21 May 2026 / Published: 27 May 2026

(This article belongs to the Special Issue Remote Sensing Imagery for Agricultural Monitoring and Precision Farming)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

Across eight object detectors, YOLO models led on accuracy and computational cost, while two-stage detectors achieved the highest threshold-independent weed coverage (WCR-AUC).
GeoCLR, a self-supervised framework pretrained on unlabeled aerial imagery, produced more class-discriminative features than ImageNet pretraining and recovered roughly 95% of full-data detection accuracy using only half of the annotations.

What are the implications of the main findings?

Detector selection for precision spraying should be based on operational and deployment metrics in addition to mAP, as the weed detector solely relying on accuracy may not provide highest spraying efficacy or the lowest deployment cost.
Self-supervised pretraining on unlabeled UAS imagery can substantially reduce the annotation effort needed to field a usable weed detector, improving scalability across fields and seasons.

Abstract

Palmer amaranth is one of the most problematic weeds in soybean production in the United States and can cause major yield loss if not managed early. This study benchmarked eight object detection models for site-specific Palmer amaranth detection in soybean using high-resolution uncrewed aerial system (UAS) imagery, with the goal of supporting targeted herbicide application and reducing herbicide usage. The models YOLOv8m, YOLOv9m, YOLOv10m, YOLOv11m, Faster R-CNN, RetinaNet, RT-DETR, and a self-supervised Faster R-CNN variant were evaluated using five-fold cross-validation on 2064 annotated aerial RGB image tiles containing 5990 bounding-box instances across multiple growth stages and field conditions, with an additional 7615 unlabeled tiles used for self-supervised pretraining. All detectors followed an identical 150-epoch schedule with early stopping and were compared using Friedman with Iman–Davenport correction and post hoc Nemenyi tests. Detectors were assessed on three axes: detection accuracy (mAP and class-wise AP), operational spraying efficacy summarized by a threshold-independent weed coverage rate area under the curve (WCR-AUC), and computational deployment cost across batch sizes from 1 to 32. The YOLO models achieved the highest detection accuracy along with the lowest inference latency and memory use but showed weaker threshold-independent weed coverage; the two-stage Faster R-CNN models showed the opposite pattern. Weighing all three axes, YOLOv8m provided the most practical balance for real-time deployment. The study also introduced GeoCLR, a self-supervised pretraining framework that constructs positive pairs from UAS flight overlap rather than synthetic augmentation. GeoCLR produced more structured and class-discriminative features than ImageNet pretraining, and a detector fine-tuned on only half of the annotations recovered approximately 95% of full-data accuracy. Together, these results highlight the importance of operational metrics for practical model selection and show that self-supervised pretraining can reduce annotation effort for scalable precision agriculture.

Keywords:

Palmer amaranth; self-supervised learning; soybean; uncrewed aerial system; weed recognition

1. Introduction

Weeds interfere with crop growth and compete for soil nutrients, water, light, and space [1]. Weeds have the potential to reduce soybean (Glycine max (L.) Merr.) crop yield by 26–29% globally [2]. It has been estimated that weed infestations in soybean result in an annual revenue loss of $16.2 billion in the United States (U.S.) [3]. Following the widespread adoption of genetically modified herbicide-resistant crops, herbicides became the most commonly used tool by growers to manage weeds. However, various research studies have highlighted the negative impact of herbicides on groundwater quality, human health, and the environment [4,5,6,7]. Furthermore, the problem of herbicide-resistant weeds is becoming more prevalent, making it difficult for growers to manage weeds efficiently and effectively [8,9,10]. According to a survey conducted by the Weed Science Society of America (WSSA) in 2022, Palmer amaranth (Amaranthus palmeri S. Watson) has been recognized as number one among the seven most troublesome weeds in soybean crop in the U.S. [11]. It has been found that early emerging Palmer amaranth can have a more negative impact on the crop yield than those that emerge when the crop is fully grown [12]. The presence of just eight Palmer amaranth weeds per meter of row can cause soybean yield loss of up to 78% [13]. Therefore, it is crucial for growers to monitor the Palmer amaranth densities in their fields right from the beginning of crop plantation to design the most effective weed management strategies and implement them at the right time to reduce crop yield loss. Even when herbicides are sprayed early, Palmer amaranth plants that escape control or emerge later in the season can cause significant yield loss; hence, detection at multiple growth stages is also essential.

Herbicides are traditionally applied using agricultural boom sprayers; however, that approach results in the wastage of herbicides due to uniform spread even in those areas where weeds are absent. Recently, site-specific weed control (SSWC) measures have gained popularity in effectively controlling weeds. The SSWC focuses on targeting only those areas where weeds are present in the field and thus reducing herbicide usage [14,15]. Building SSWC tools require a comprehensive understanding of weed densities in the fields [16]. Several factors such as similarities in color, texture, and shape between crops and weeds, different lighting conditions, and the growth stage of the crops and weeds can affect the performance of SSWC tools in precisely locating the locations of the weeds in the large field [17].

The early years of computer vision weed detection research date back to the 1990s and relied on manual feature extraction based on color, texture, and shape by weed science experts [18,19,20,21]. Classical machine learning methods such as artificial neural networks, support vector machines, and random forests achieved good accuracy in this setting [22,23,24], but generalize poorly to natural field conditions and required expert-driven feature design [25,26]. Convolutional neural networks (CNNs) have since become the standard deep-learning approach for agricultural image analysis [27,28,29,30,31,32,33,34]. However, CNN classification models treat an entire image as a single class, while uncrewed aerial system (UAS) and ground robotic platforms commonly capture multiple weed and crop species in a single frame. Object detection algorithms such as You Only Look Once (YOLO) address this by localizing multiple instances within an image [35,36,37].

CNN-based algorithms require large amounts of annotated data to train robust weed detection models. The quality and diversity of training imagery strongly influence model performance and depend on factors such as sensor type, acquisition platform (e.g., UAS, robots, handheld cameras), weather conditions, illumination, and field variability [38,39]. Previous studies have used RGB imagery collected from handheld cameras and UAS platforms for Palmer amaranth detection in cotton and soybean. For example, a previous study achieved 76.9% mAP@0.5 using YOLOv5n for detecting carpetweed (Mollugo verticillata L.), morning glory (Ipomoea spp.), and Palmer amaranth in cotton [36]. Similarly, another study trained YOLOv5 on handheld and aerial imagery for Palmer amaranth detection in soybean and reported 77% mAP@0.5 [40]. Although YOLO-based models are the most widely used object detection algorithms for weed recognition, their evaluation in agriculture has often been limited to relatively small or controlled datasets and usually focused on a single generation such as YOLOv3 or YOLOv5. Since then, newer YOLO versions (YOLOv8 to YOLOv11) have introduced architectural and training modifications intended to improve both detection accuracy and computational efficiency. However, newer architectures do not necessarily perform better under real aerial field conditions [41]. In addition, other detector families such as Faster R-CNN, RetinaNet, and transformer-based detectors follow substantially different localization and feature-representation strategies [42,43,44,45]. Evaluating these detector families together therefore provides a broader understanding of how single-stage, two-stage, and transformer-based architectures behave under aerial agricultural imaging conditions. However, no study has systematically compared recent YOLO generations alongside other major detector families on high-resolution UAS imagery of Palmer amaranth in soybean across diverse growth stages and field environments.

Developing robust object detection models for SSWC requires large volumes of expert-annotated bounding boxes, making dataset generation both time-intensive and expensive, particularly when expanding across new fields, weed species, or growing seasons. Self-supervised learning (SSL) offers an alternative approach in which visual representations are learned directly from unlabeled imagery and later transferred to downstream tasks using relatively small labeled datasets [46,47]. Existing SSL frameworks such as SimCLR [46], MoCo [48], and Seasonal Contrast (SeCo) [49] have shown that meaningful image representations can be learned either from augmentation-based views or from naturally occurring spatial and temporal relationships. These hypotheses are particularly relevant for UAS-based agricultural imaging because overlapping flight trajectories naturally provide multiple views of the same physical field region under slight variation in viewing angle, illumination, and acquisition conditions. However, the use of UAS flight-overlap geometry as a positive-pair signal for self-supervised pretraining in aerial weed detection has not been systematically explored.

Beyond standard object detection metrics, a weed detection model should also be evaluated based on how useful it is for site-specific spraying. Metrics such as mAP@0.5 and mAP@0.5:0.95 are useful for measuring detection accuracy and localization quality, but they do not directly show how many weeds would actually be sprayed or how much herbicide could be saved. Previous studies have proposed deployment-oriented metrics such as weed coverage rate and area sprayed to evaluate precision spraying performance [50]; however, these metrics have generally been reported only at a single fixed confidence threshold, which does not capture detector behavior across the full operating range used in commercial spot-spraying systems [51]. Computational deployment characteristics such as inference latency at multiple batch sizes, throughput, model size, and GPU memory have similarly been under-reported, despite being decisive for on-board UAS deployment. Together, these gaps in operational and deployment reporting limit how directly current weed detector benchmarks translate into field-deployment guidance for UAS-based precision spraying.

The present study addresses these gaps through three methodological contributions. First, a novel self-supervised pretraining framework that constructs positive pairs from the natural spatial overlap between adjacent UAS images within a flight, rather than from synthetic augmentations of a single image, is introduced. Second, a threshold-independent operational metric is introduced to summarize weed-targeting performance across the full range of confidence thresholds rather than at a single operating point. Third, a three-axis evaluation framework combining detection accuracy, operational spraying behavior, and computational deployment cost is used to compare detector architectures jointly, rather than along a single criterion. To our knowledge, benchmarking recent YOLO variants alongside Faster R-CNN, RetinaNet, and a transformer-based detector within this framework on UAS imagery of Palmer amaranth in soybean has not been reported in the precision spraying literature.

The current study was designed using data collected from U.S. soybean production fields in the Mid-Atlantic region with the following three objectives:

1.: Develop an annotated Palmer amaranth and soybean image dataset representative of diverse weather conditions and growth stages.
2.: Benchmark recent YOLO models (v8–v11) alongside Faster R-CNN, RetinaNet, and a transformer-based detector using an evaluation framework spanning detection accuracy, operational spraying efficacy, and computational deployment cost.
3.: Introduce a self-supervised pretraining framework that learns from unlabeled UAS imagery, and evaluate its label efficiency for downstream Palmer amaranth detection in soybean.

2. Materials and Methods

2.1. Data Acquisition

The research study was conducted at the Eastern Shore Agricultural Research and Extension Center, Painter, Virginia (37.58566°N, 75.78511°W). Soybean for this experiment was planted in two different fields on 17 June 2022, using the Enlist E3^® (Corteva Agriscience, Indianapolis, IN, USA) variety at a seeding rate of 140,000 seeds per acre. The experiment included three weed density treatments—low (5 m⁻²), medium (10 m⁻²), and high (20 m⁻²)—along with a weed-free control. These densities were created by spreading Palmer amaranth seeds within plots and thinning them to the desired density to allow variation in plant growth, overlap, and competition with soybean. To maintain Palmer amaranth presence, grasses were selectively controlled using clethodim (Select Max^®, Valent USA LLC, Walnut Creek, CA, USA), while other broadleaf weeds were removed manually. The experiment followed a randomized complete block design (RCBD) with four replications. Each treatment plot measured 12 m in width and 3.7 m in length, with a 1 m border on all sides. The goal of this experimental design was to introduce variability in weed density and field conditions so that the collected aerial imagery represents realistic scenarios for model development and deployment.

RGB images of soybean and Palmer amaranth were collected at different crop development stages, where soybean image data was collected through multiple flights at vegetative stages (V2–V5 and V6–V8), and reproductive stages (R1–R3 and R5–R8). Palmer amaranth image data was collected, starting from 7.5 cm tall seedlings to approximately 75 cm. Apart from the two structured experimental fields, separate fields containing only soybean and natural infestations of Palmer amaranth without soybean were also imaged to capture pure class representations. These additional datasets were collected across two growing seasons (2022 and 2023).

Data were collected using a Zenmuse P1 camera (focal length 0.035 m) mounted on a DJI Matrice-300 UAS (DJI, Shenzhen, China). The drone was flown at a height of 12 m with a constant horizontal speed of 5.6 km/h. The ground sampling distance of the grid flight plan was 0.15 cm/pixel, with side and frontal overlap ratios of 70% and 80%, respectively. The total area covered in one complete flight was 3030 m², with an average flight time of 14 min. Flights were conducted between 9:00 am and 3:00 pm to ensure good lighting conditions. The average wind speed during flights was 4.8 km/h. A total of 16,589 RGB images were collected across the growing seasons. Each raw image had a resolution of

8192 \times 5460

pixels. The DJI Matrice-300 UAS with the Zenmuse P1 camera, along with a representative raw RGB image of a soybean field infested with Palmer amaranth, are shown in Figure 1.

2.2. Image Processing and Annotation

The original aerial RGB images had a resolution of

8192 \times 5460

pixels, which was too large to be directly processed during model training. To address this issue, image tiles of

640 \times 640

pixels were extracted using Python version 3.8.16 and the Pillow library (version 9.5.0). Tiles were generated using a non-overlapping sliding-window approach starting from the top-left corner of each raw image, producing a fixed grid of patches per image. For object detection training, the extracted tiles were manually annotated using LabelImg (https://github.com/heartexlabs/labelImg, accessed on 13 May 2026). A total of 2064 images were labeled with rectangular bounding boxes around soybean and Palmer amaranth plants. During annotation, images with poor visual quality or unclear plant features were excluded. The final dataset captured diversity in lighting conditions, backgrounds, plant densities, and growth stages, as illustrated in Figure 2. In total, 5990 bounding box instances were annotated, including 4513 Palmer amaranth and 1477 soybean objects. The dataset included variation in Palmer amaranth morphology across growth stages and densities, whereas soybean generally exhibited more consistent visual structure. The original YOLO annotations were converted to COCO JSON format to ensure compatibility with Faster R-CNN, RetinaNet, and RT-DETR. Using a unified annotation format ensured that all object detection models were trained and evaluated under identical conditions for fair benchmarking.

To characterize the visual properties of the annotated classes, all 5990 bounding-box instances were analyzed for object size and color distribution. Soybean instances were generally larger than Palmer amaranth instances, with a median bounding-box area of 67,417 px² compared to 40,209 px² for Palmer amaranth (Mann–Whitney U,

p = 5.6 \times 10^{- 24}

). Soybean foliage also showed a higher excess-green index than Palmer amaranth (39.8 vs. 30.1; Mann–Whitney U,

p < 10^{- 190}

), indicating stronger contrast against the soil background.

A separate set of 1523 raw aerial images that were not part of the object detection dataset was used for SSL pretraining. From each raw image, five

640 \times 640

tiles were extracted using flight-metadata-driven sampling (described in Section 2.4), resulting in a total of 7615 unlabeled images for SSL training. These unlabeled images were used to construct positive pairs for contrastive learning. To evaluate the quality of SSL-learned representations for plant species classification, an independent labeled dataset consisting of 468 images (216 soybean and 252 Palmer amaranth) was used as a held-out test set. This classification dataset was fully separated from both the SSL pretraining data and the object detection annotation dataset, ensuring that no image appeared in more than one experimental pipeline.

2.3. Object Detection Networks

Eight object detection models were evaluated in this study: four single-stage anchor-free YOLO detectors (YOLOv8m, YOLOv9m, YOLOv10m, and YOLOv11m), a single-stage anchor-based detector (RetinaNet), a two-stage anchor-based detector (Faster R-CNN), a real-time transformer-based detector (RT-DETR), and a Faster R-CNN model initialized using the SSL framework described in Section 2.4 (Faster R-CNN-GeoCLR). These models represent different object detection approaches, including anchor-based, anchor-free, two-stage, transformer-based, and SSL-initialized detection frameworks. All models were trained and evaluated on the same aerial Palmer amaranth and soybean dataset to compare how model architecture and pretraining strategy influence detection accuracy, inference speed, trainability, and deployment performance.

2.3.1. YOLO

YOLO is a single-stage object detection model that divides an image into grids, where each grid predicts bounding box coordinates, objectness confidence, and class probabilities in a single forward pass. The model uses three main loss components: bounding box regression loss, objectness loss, and classification loss. Non-Maximum Suppression (NMS) is generally used to remove redundant detections based on IoU thresholds. Unlike earlier YOLO versions that relied on predefined anchor boxes, YOLOv8 through YOLOv11 use an anchor-free detection approach. However, YOLOv10 is an NMS-free training strategy designed to reduce post-processing overhead during inference [52]. Each YOLO version is available in five model sizes—nano, small, medium, large, and extra-large—allowing trade-offs between accuracy, computational cost, and inference speed. In this study, the medium variants of YOLOv8 through YOLOv11 were used as the primary benchmark models. Additional experiments comparing different sizes of the best-performing YOLO model were also conducted.

2.3.2. Faster R-CNN

Faster R-CNN is a two-stage object detection architecture that separates region proposal and classification. A Region Proposal Network (RPN) first generates candidate object regions from shared convolutional feature maps, and these candidate regions are then refined and classified by a secondary detection head. The model uses anchor boxes to represent potential object locations across multiple scales and aspect ratios and is trained using a multi-task loss that combines classification and bounding box regression.

Two variants of Faster R-CNN were evaluated in this study. The first used a standard ResNet-50 backbone initialized with ImageNet-pretrained weights, while the second used a ResNet-50 backbone initialized using the SSL framework described in Section 2.4. All other architectural settings, including the feature pyramid network, RPN configuration, and detection head structure, were kept identical between the two variants.

2.3.3. RetinaNet

RetinaNet is a single-stage object detection model that uses a feature pyramid network for multi-scale feature extraction. The model introduces focal loss to address class imbalance between foreground objects and background regions, which is a common challenge in dense object detection tasks. Smooth L1 loss is used for bounding box regression. Unlike Faster R-CNN, RetinaNet performs object classification and localization in a single forward pass without a separate region proposal stage, making it computationally lighter than two-stage detectors while still using anchor boxes for object localization.

2.3.4. RT-DETR

RT-DETR is a transformer-based end-to-end object detector that performs object classification and bounding box regression in a single forward pass without using anchor boxes or NMS. Similar to DETR, the model uses a transformer-based encoder–decoder architecture to directly predict object locations and class labels. However, RT-DETR was specifically designed for real-time detection by improving computational efficiency and reducing inference overhead compared to earlier DETR architectures. RT-DETR is trained using a bipartite matching loss that combines classification loss with L1 and Generalized IoU (GIoU) losses for bounding box regression.

2.4. Self-Supervised Learning

The SSL framework proposed in this study is based on the hypothesis that spatial overlap between adjacent UAS images within a flight provides a stronger positive-pair signal than synthetic augmentations alone, because overlapping images capture the same physical ground region under slightly different viewing angles, illumination conditions, and timing. To evaluate this empirically before constructing the SSL pipeline, a similarity analysis was conducted on 800 image pairs (200 pairs per condition) sampled from four conditions: (i) within-flight pairs, defined as image pairs from the same field and flight whose GPS positions were within 4 m and whose timestamps were within 180 s; (ii) cross-date same-field pairs, defined as image pairs from the same field collected on different dates; (iii) cross-field same-date pairs, defined as image pairs from different fields collected within seven days of each other; (iv) random pairs separated by at least 50 m and one day in time. For each image pair, two similarity distances were computed: a chi-squared distance between 32-bin RGB color histograms generated using the color analysis functions from PlantCV v4 [53], and a cosine distance between 2048-dimensional features extracted from the penultimate layer of an ImageNet-pretrained ResNet-50. The resulting similarity-distance distributions and representative image-pair examples across the evaluated pair-selection conditions are shown in Figure 3. Within-flight pairs showed approximately three-fold lower median color-histogram distance and two-fold lower median ResNet-50 feature distance compared to all other conditions, with all comparisons significant at

p < 10^{- 24}

using one-sided Mann–Whitney U tests. These results support the use of within-flight overlap as the positive-pair signal for contrastive learning.

For each raw image in the unlabeled pretraining set, GPS coordinates, altitude, and timestamp were extracted from EXIF metadata recorded by the Zenmuse P1 camera. GPS coordinates were projected into local Universal Transverse Mercator (UTM, EPSG:32618) coordinates to calculate metric distance. Two images were considered candidate positive-pair sources if their GPS-derived ground positions were within 4 m and their timestamps were within 180 s. The 4 m spatial threshold was derived from the planned flight geometry. The ground footprint of each image was computed from the camera and flight parameters as

W_{g} = \frac{H \cdot W_{s}}{f}, H_{g} = \frac{H \cdot H_{s}}{f},

where H is the flight altitude (12 m),

W_{s}

and

H_{s}

are the physical sensor width and height (35.9 mm and 24.0 mm for the Zenmuse P1 full-frame sensor), and f is the lens focal length (35 mm). This yields a ground footprint of approximately

12.3 \times 8.2

m per image. Given the

8192 \times 5460

pixel raw image resolution, this corresponds to ground sampling distances of

\frac{12.3 m}{8192 pixel} \approx 0.150 cm / pixel

along the image width and

\frac{8.2 m}{5460 pixel} \approx 0.150 cm / pixel

along the image height, consistent with the planned ground sampling distance. At the planned 80% frontal overlap, consecutive frames along the flight direction are spaced

d_{frame} = (1 - p_{f}) \cdot W_{g} \approx 2.5 m

apart on the ground, where

p_{f} = 0.80

is the planned frontal overlap. The 4 m threshold therefore captures adjacent overlapping frames while allowing tolerance for GPS error and excluding most non-overlapping neighboring frames. The 180 s temporal threshold was selected to retain pairs from adjacent flight lines that overlap spatially because of the planned 70% side overlap but may be separated temporally by a full pass length. For each raw image, overlapping neighbors were identified by applying the 4 m and 180 s thresholds to all other images in the pretraining pool. Tile pairs were then constructed by sampling

640 \times 640

regions from the spatial overlap zone between the source image and one overlapping neighbor and extracting the corresponding tile from each image. Each tile was constrained to remain fully within its source-image boundary. This procedure was repeated

K = 5

times per source image, producing 7615 tiles from 1523 raw images in the SSL pretraining pool.

The extracted tiles covered a wide range of field conditions, including dense Palmer amaranth clusters, weeds overlapping with soybean canopy, treatment-boundary regions, and bare-soil patches. Positive pairs were intentionally similar but not identical, representing different views of the same physical region under slight variation in viewing angle, illumination, and framing. This variation encourages the encoder to learn features that are robust to nuisance factors instead of memorizing pixel-level appearance. During training, positive pairs were formed from overlapping image footprints satisfying the spatial and temporal thresholds, while negative pairs were formed implicitly within each batch from non-overlapping tiles. The 1523 unlabeled raw images used for SSL pretraining were collected from two RCBD field locations on a single flight date. Restricting SSL pretraining to a single date helped keep all positive pairs within the empirically validated within-flight condition. In contrast, the labeled object detection dataset spans multiple dates and growth stages, allowing evaluation of whether representations learned from a single-date SSL dataset transfer to downstream detection under varying field conditions. The raw images used for object detection annotation were not included in the SSL pretraining pool so that SSL pretraining and downstream detection evaluation remained fully separated. No image used during SSL pretraining later appeared as labeled training or validation data for the detection task.

The complete SSL framework proposed in this study is referred to as GeoCLR (Geographic Contrastive Learning of Representations). GeoCLR follows the encoder–projection-head architecture and NT-Xent contrastive loss used in SimCLR [46], but differs in how positive pairs are constructed. A shared encoder

f_{θ}

based on ResNet-50 was initialized from ImageNet-pretrained weights with the final classification layer removed. The encoder generated a 2048-dimensional feature representation for each image. A two-layer multilayer perceptron projection head

g_{ϕ}

with one hidden layer of width 2048 and ReLU activation projected the encoder features into a 128-dimensional contrastive embedding space. To improve robustness to variation between overlapping views, each image underwent mild photometric augmentation, including random brightness, contrast, saturation, hue jitter, and occasional Gaussian blur. The NT-Xent loss for a positive pair

(i, j)

within a batch of N positive pairs (yielding

2 N

tiles in total) was defined as

L_{i, j} = - \log \frac{\exp (sim (z_{i}, z_{j}) / τ_{c})}{\sum_{k = 1}^{2 N} ⊮ [k \neq i] \exp (sim (z_{i}, z_{k}) / τ_{c})}

where

z_{i} = g_{ϕ} (f_{θ} (x_{i}))

denotes the projected embedding of tile

x_{i}

,

sim (\cdot, \cdot)

denotes cosine similarity,

τ_{c}

is the temperature parameter, and

⊮ [\cdot]

is the indicator function. This objective pulls embeddings from positive-pair tiles closer together while separating embeddings from unrelated tiles within the same batch, allowing the encoder to learn features that remain stable under changes in viewpoint and illumination while still distinguishing different field regions. The GeoCLR encoder was trained for 200 epochs using a batch size of 32 and an input tile size of

640 \times 640

pixels. The temperature parameter was set to

τ_{c} = 0.5

. Optimization used stochastic gradient descent with momentum 0.9, weight decay of

5 \times 10^{- 4}

, and an initial learning rate of 0.06 with cosine annealing over the full training schedule. After training, the projection head was discarded and the encoder weights were frozen for downstream evaluation. To evaluate whether GeoCLR-learned representations could distinguish Palmer amaranth from soybean, the frozen encoder was used to extract 2048-dimensional features from the held-out 468-image classification dataset. The resulting embeddings were visualized using Principal Component Analysis (PCA) and t-distributed Stochastic Neighbor Embedding (t-SNE) to assess class separability in the learned feature space. The same frozen encoder was then used as the backbone initialization for the Faster R-CNN-GeoCLR detector. The GeoCLR encoder weights replaced the standard ImageNet-pretrained backbone weights, while the feature pyramid network, region proposal network, and detection heads were trained from scratch during downstream detection training. Faster R-CNN was selected as the downstream detector because it is commonly used as a transfer-learning benchmark in SSL literature for object detection evaluation. Faster R-CNN-GeoCLR training and evaluation followed the same protocol described in Section 2.5.

2.5. Training and Evaluation

All training and inference experiments were conducted on the Virginia Tech Advanced Research Computing (ARC) Tinkercliffs cluster using NVIDIA A100 GPUs. To evaluate model generalization and reduce split-dependent bias, five-fold cross-validation was implemented [54]. The 2064 annotated image tiles were partitioned into five non-overlapping folds at the level of the original raw aerial image so that tiles extracted from the same aerial image never appeared in both training and validation sets. Fold assignment was stratified to balance Palmer amaranth and soybean instance counts across folds. Standardized training and evaluation settings shared across all detection models, SSL label-fraction experiments, and YOLO variant-size experiments are summarized in Table 1. Architecture-specific settings were retained at their standard published configurations where appropriate.

To evaluate label efficiency, Faster R-CNN-GeoCLR was trained using 10%, 25%, 50%, and 100% of the annotated object detection training data within each fold. A nested subsampling strategy was used in which smaller label fractions formed strict subsets of larger ones. Specifically, the 10% subset was contained within the 25% subset, which was contained within the 50% subset, and all were contained within the full training set. Training images were deterministically ordered using a fixed random seed (seed = 42), and the first 10%, 25%, 50%, and 100% of the ordered list were selected for the respective experiments. This design ensured that performance differences across label fractions reflected only the effect of training-data volume rather than variation in sampled images. Validation was always performed on the full validation fold.

Beyond standard detection metrics, three deployment-oriented operational metrics were computed to evaluate how Palmer amaranth weed detector predictions translate into practical site-specific spraying outcomes: the weed coverage rate (WCR), the herbicide saving rate (HSR), and the weed coverage rate area under the curve (WCR-AUC). These metrics are more directly connected to field deployment because a weed detector with high mAP may still miss weeds outside the spray region or spray unnecessary areas. Spray geometry was modeled after the Ecorobotix^® (Yverdon, Switzerland) ARA field sprayer [55] as an operational example, which reports ultra-localized spraying precision of up to 6 cm × 6 cm. Each predicted weed detection retained at confidence threshold

τ

was assumed to trigger a single spray cell centered at the detection bounding-box centroid. At the ground sampling distance of 0.15 cm/pixel used in this study, each spray cell corresponded to a

40 \times 40

pixel region in image space.

Let G denote the set of ground-truth Palmer amaranth instances in an image, let I denote the image domain, and let

D (τ)

denote the set of detections retained after applying confidence threshold

τ

. For any predicted detection

d \in D (τ)

, let

c_{d} \in R^{2}

denote its centroid in pixel coordinates and let

C_{d} (τ) \subset I

denote the corresponding

40 \times 40

pixel spray cell centered at

c_{d}

. Similarly, for any ground-truth Palmer amaranth instance

g \in G

, let

c_{g}

denote its annotated centroid.

WCR at threshold

τ

was defined as

WCR (τ) = \frac{1}{| G |} \sum_{g \in G} ⊮ [\min_{d \in D (τ)} {∥ c_{g} - c_{d} ∥}_{\infty} \leq r]

where

{∥ \cdot ∥}_{\infty}

denotes the Chebyshev (

L_{\infty}

) distance,

r = 20

pixels corresponds to half the spray-cell side length, and

⊮ [\cdot]

is the indicator function. A ground-truth Palmer amaranth was considered covered if its centroid fell within the spray region of at least one retained detection.

HSR at threshold

τ

was defined as the proportion of image area that remained unsprayed:

HSR (τ) = 1 - \frac{|⋃_{d \in D (τ)} C_{d} (τ)|}{| I |}

where

| I |

denotes the total image area. The union of spray cells was implemented using raster operations so that overlapping spray regions from adjacent detections were merged and counted only once.

Both WCR and HSR depend on the confidence threshold

τ

, which determines which detections are retained as spray triggers. Existing precision-spraying studies [50] have reported these metrics only at fixed operating points, leaving no threshold-independent summary of detector performance. To address this limitation, this study introduces WCR-AUC, defined as the normalized area under the WCR(

τ

) curve over

τ \in [0.05, 0.95]

.

WCR - AUC = \frac{1}{τ_{\max} - τ_{\min}} \int_{τ_{\min}}^{τ_{\max}} WCR (τ) d τ

where

τ_{\min} = 0.05

and

τ_{\max} = 0.95

. The integral was approximated using the trapezoidal rule with threshold increments of 0.05. WCR and HSR were reported at

τ = 0.25

as representative single-operating-point operational metrics.

2.6. Statistical Analysis

To evaluate whether performance differences among object detection models were statistically significant, non-parametric statistical tests were used because the analysis involved paired measurements across five cross-validation folds without assuming normality of the performance metrics [56,57]. The Friedman test was first used as the global repeated-measures comparison across all models. The average rank of the

k^{t h}

model

(k = 1, \dots, K)

was defined as

R_{k} = \frac{1}{J} \sum_{j = 1}^{J} r_{k}^{j},

where J is the number of datasets (here

J = 5

folds) and

r_{k}^{j}

is the rank assigned to the

k^{t h}

model in the

j^{t h}

fold. The Friedman statistic was then calculated as

χ_{F}^{2} = \frac{12 J}{K (K + 1)} (\sum_{k = 1}^{K} R_{k}^{2} - \frac{K {(K + 1)}^{2}}{4}),

which follows a chi-square distribution with

K - 1

degrees of freedom under the null hypothesis.

Because the Friedman test can be conservative for small sample sizes, the Iman–Davenport correction was also applied:

F_{I D} = \frac{(J - 1) χ_{F}^{2}}{J (K - 1) - χ_{F}^{2}},

which follows an F-distribution with

(K - 1)

and

(K - 1) (J - 1)

degrees of freedom.

When the global test rejected the null hypothesis at

α = 0.05

, post hoc pairwise comparisons were performed using the Nemenyi test [58], which compares the average ranks of all detector pairs while controlling for multiple comparisons. The Nemenyi test was selected because it is commonly used as the post hoc procedure following the Friedman test for comparing multiple models across datasets or cross-validation folds.

Deployment-characteristic metrics including parameter count, GFLOPs, inference latency, throughput, and GPU memory usage were reported as point estimates and were not subjected to statistical testing because they depend on architecture design and inference hardware rather than fold-level variation. All statistical analyses were implemented in Python using the scipy and statsmodels libraries.

2.7. Overall Workflow

The complete experimental workflow used in this study is summarized in Figure 4. The pipeline consisted of two parallel branches originating from the same aerial image collection process. The first branch focused on supervised object detection benchmarking across multiple detector families, while the second branch focused on GeoCLR-based self-supervised pretraining using unlabeled aerial imagery.

The object detection branch included model benchmarking, YOLOv8 variant-size comparison, and label-fraction ablation experiments. The GeoCLR branch pretrained a ResNet-50 encoder using overlapping flight-image regions as positive pairs for contrastive learning. The pretrained encoder was then used for downstream transfer into Faster R-CNN-GeoCLR and for feature-space visualization.

Outputs from both branches were integrated into a unified evaluation stage consisting of detection, operational spraying, and deployment-oriented metrics, followed by statistical comparison and deployment-level interpretation.

3. Results

3.1. Detection Metrics Results

The eight object detection models were compared in terms of mAP at IoU thresholds of 0.5 and 0.5:0.95, as well as class-wise average precision for soybean and Palmer amaranth. All results are reported as mean ± standard deviation across the five cross-validation folds to indicate the stability of performance.

At IoU 0.5, the four YOLO models achieved similar mAP values in a narrow range of 0.790–0.806 (Figure 5a): YOLOv9m at 0.806 ± 0.015, YOLOv11m at 0.804 ± 0.013, YOLOv8m at 0.799 ± 0.016, and YOLOv10m at 0.790 ± 0.020. Faster R-CNN reached 0.795 ± 0.016, placing it within the YOLO performance range, while RetinaNet followed at 0.781 ± 0.021. Faster R-CNN-GeoCLR achieved 0.771 ± 0.017. RT-DETR recorded the lowest mAP@0.5 at 0.733 ± 0.037 and also showed the highest fold-to-fold variability. The relatively small standard deviations across folds indicate that detector performance remained stable across different train–test partitions of the dataset. At IoU 0.5, the overall gap between the strongest and weakest detectors was modest, spanning approximately 0.07 mAP.

The architectural differences became more apparent at the stricter IoU 0.5:0.95 threshold (Figure 5b). The four YOLO models maintained the highest performance, with mAP values in a narrow range of 0.514–0.525 (YOLOv11m at 0.525 ± 0.016, YOLOv9m at 0.522 ± 0.013, YOLOv8m at 0.514 ± 0.015, and YOLOv10m at 0.514 ± 0.013). In comparison, Faster R-CNN, RetinaNet, Faster R-CNN-GeoCLR, and RT-DETR dropped to 0.428–0.465, with RT-DETR recording the lowest value at 0.428 ± 0.033. The larger performance gap at the stricter IoU threshold suggests that the YOLO models produced more precise bounding-box localization, while the non-YOLO weed detectors were more strongly penalized as the overlap requirement increased.

Class-specific average precision analysis showed that soybean was generally detected more accurately than Palmer amaranth across all weed detectors (Figure 6). For soybean, the YOLO variants performed nearly identically: YOLOv9m at 0.829 ± 0.025, YOLOv11m at 0.827 ± 0.021, YOLOv8m at 0.826 ± 0.027, and YOLOv10m at 0.811 ± 0.029. The non-YOLO weed detectors fell in a slightly lower band, with Faster R-CNN at 0.806 ± 0.026, Faster R-CNN-GeoCLR at 0.788 ± 0.033, and RetinaNet at 0.787 ± 0.034. RT-DETR recorded the lowest soybean AP at 0.752 ± 0.043. For Palmer amaranth, the eight detectors spanned a narrow range of 0.713–0.784. Faster R-CNN, YOLOv9m, and YOLOv11m sat at the top of this range (0.784, 0.782, and 0.782, respectively), with YOLOv8m (0.772 ± 0.014), RetinaNet (0.775 ± 0.010), and YOLOv10m (0.769 ± 0.014) only marginally behind. Faster R-CNN-GeoCLR achieved 0.755 ± 0.016 and RT-DETR recorded the lowest Palmer amaranth AP at 0.713 ± 0.034.

To assess whether the observed performance differences across weed detectors were statistically significant, mAP@0.5 was used as the primary metric for statistical testing, and the per-fold values were compared using the Friedman test. The Friedman test showed significant variation among the eight detectors (

χ^{2} (7) = 26.73

,

p = 3.7 \times 10^{- 4}

), and the Iman–Davenport correction confirmed this result (

F = 12.94

,

p < 10^{- 4}

). Average ranks across the five folds placed YOLOv9m first (mean rank 1.8), followed by YOLOv11m (2.2) and YOLOv8m (3.2), while RT-DETR ranked lowest (7.6) and Faster R-CNN-GeoCLR second-lowest (7.0). Post hoc pairwise comparisons using the Nemenyi test identified significant differences primarily between the strongest and weakest detectors. YOLOv9m and YOLOv11m both significantly outperformed RT-DETR (

p = 0.0045

and

p = 0.0115

, respectively) and Faster R-CNN-GeoCLR (

p = 0.0180

and

p = 0.0409

, respectively). No additional pairwise comparisons reached the significance threshold. The large middle group of detectors, including YOLOv8m, YOLOv10m, Faster R-CNN, and RetinaNet, could not be statistically separated from one another or from the top-ranked YOLO models at IoU 0.5. This result suggests that although the strongest YOLO variants showed measurable advantages over the weakest detectors, most architectures achieved relatively similar performance under the permissive IoU 0.5 criterion. The limited number of folds (

n = 5

) also reduces the power of the pairwise comparisons to resolve smaller differences among closely clustered weed detectors.

3.2. Operational Metrics Results

Operational performance results were computed under the simulated precision-spraying protocol described in Section 2.5, in which each retained detection triggered a single

6 \times 6

cm spray cell modeled after the Ecorobotix^® ARA commercial sprayer [55]. Results are reported as mean ± standard deviation across the five cross-validation folds (Table 2). HSR showed limited variation across most detectors. At

τ = 0.25

, HSR ranged from 0.968 to 0.992, indicating that all detectors triggered spray coverage over only a small fraction of the total image area. The per-threshold analysis shown in Figure 7 indicated that HSR exceeded 0.97 by

τ = 0.15

for six of the eight detectors; RT-DETR, which produced the highest volume of low-confidence detections, reached that level by

τ = 0.25

. The largest HSR reductions occurred at low confidence thresholds (

τ \leq 0.10

), where RT-DETR and RetinaNet produced larger numbers of low-confidence detections that activated additional spray cells.

Compared with HSR, WCR and WCR-AUC provided stronger separation among detectors. Faster R-CNN achieved the highest WCR-AUC at

0.664 \pm 0.018

, followed by Faster R-CNN-GeoCLR at

0.645 \pm 0.034

and RT-DETR at

0.609 \pm 0.046

. RetinaNet followed at

0.535 \pm 0.031

, while the four YOLO models produced the lowest WCR-AUC values, ranging from 0.483 to 0.501.

The statistical significance of the WCR-AUC ranking was evaluated using the same Friedman + Nemenyi framework described in Section 2.6. The Friedman test rejected the null hypothesis of equal detector performance (

χ^{2} (7) = 28.73

,

p = 1.6 \times 10^{- 4}

), and the Iman–Davenport correction confirmed this result (

F = 18.34

,

p < 10^{- 5}

). Average WCR-AUC ranks placed Faster R-CNN highest (1.2), followed by Faster R-CNN-GeoCLR (2.0) and RT-DETR (3.0), while the YOLO models occupied the four lowest ranks (6.0–7.0). Post hoc Nemenyi comparisons showed that Faster R-CNN significantly outperformed all four YOLO models (p = 0.005–0.041), and Faster R-CNN-GeoCLR significantly outperformed YOLOv10m (p = 0.027). No additional pairwise differences reached statistical significance.

The threshold-dependent behavior of the detectors is shown in Figure 7a. RT-DETR achieved the highest WCR at the default operating point (

0.790 \pm 0.025

at

τ = 0.25

), exceeding Faster R-CNN by approximately 0.085. However, RT-DETR coverage decreased more rapidly as the confidence threshold increased, causing its overall WCR-AUC to fall below both Faster R-CNN variants. In contrast, Faster R-CNN maintained more stable weed coverage across the full threshold range.

3.3. Deployment Metrics Results

Deployment characteristics were benchmarked for all eight detectors on a single NVIDIA A100 GPU using

640 \times 640

input resolution and FP32 precision. Batch sizes of 1, 4, 8, 16, and 32 were evaluated. For each configuration, latency was measured over 200 timed iterations following 50 warm-up iterations. Five deployment metrics were recorded: parameter count, GFLOPs, inference latency, throughput in frames per second (FPS), and peak GPU memory usage. To complement the quantitative deployment metrics, representative prediction examples generated by YOLOv8m on held-out aerial field images that were not part of the annotated object detection dataset are shown in Figure 8. These examples span a range of Palmer amaranth densities, canopy overlap conditions, and field backgrounds observed across the test imagery and serve as a visual reference rather than for quantitative evaluation.

Headline deployment metrics at batch size 1, representing the regime most relevant for real-time field deployment, are summarized in Table 3, while batch-size scaling of latency and throughput is presented in Figure 9.

The eight detectors spanned a broad range of model sizes, from 16.5 M parameters for YOLOv10m to 43.3 M for Faster R-CNN-GeoCLR. At batch size 1, the YOLO models achieved the lowest inference latency. YOLOv8m was fastest at 8.3 ms per frame (120.9 FPS), followed by YOLOv11m at 10.8 ms (92.3 FPS), YOLOv10m at 13.0 ms (76.7 FPS), and YOLOv9m at 15.1 ms (66.2 FPS). RetinaNet and the two Faster R-CNN variants formed a middle group at 16.3–18.9 ms (53–62 FPS). RT-DETR showed the highest single-frame latency at 36.7 ms (27.2 FPS).

The batch-size sweep showed clear differences in scaling behavior across architectures (Figure 9). The YOLO models scaled most efficiently, with throughput increasing strongly as batch size increased. YOLOv8m reached approximately 508 FPS at batch size 32. RT-DETR also scaled well under batching, increasing from 27 FPS at batch size 1 to 220 FPS at batch size 32, suggesting that a larger portion of its runtime comes from fixed computational overhead that becomes amortized under larger batches. In contrast, Faster R-CNN, Faster R-CNN-GeoCLR, and RetinaNet scaled more slowly, reaching approximately 72–121 FPS at batch size 32. These differences are important for deployment because single-frame systems depend mainly on low inference latency, while buffered systems benefit more from high batched throughput.

GPU memory usage generally followed model size and architecture complexity. At batch size 1, the YOLO models required only 0.30–0.34 GB of GPU memory, RT-DETR required 0.50 GB, and the two-stage detectors required 0.72–1.08 GB. The gap widened further under batching. At batch size 32, the YOLO models and RT-DETR remained between 2.46 and 3.63 GB, whereas Faster R-CNN reached 6.70 GB and Faster R-CNN-GeoCLR reached 10.57 GB.

3.4. Self-Supervised Feature Analysis

To better understand what the GeoCLR self-supervised pretraining contributed beyond downstream detection accuracy, the learned ResNet-50 backbone was analyzed in two complementary ways: (i) separability of the learned feature embeddings for Palmer amaranth and soybean, and (ii) spatial activation structure across network depth. In both analyses, the GeoCLR-pretrained backbone was compared against an ImageNet-pretrained backbone of identical architecture, with all feature extraction and processing steps kept identical so that any observed differences were attributable to the pretraining strategy alone. Both pretraining schemes produced class-separable features, which is expected given the visual differences between Palmer amaranth and soybean. However, the GeoCLR backbone separated the two classes substantially more cleanly and compactly than the ImageNet-pretrained backbone. The frozen GeoCLR backbone produced clear separation between Palmer amaranth and soybean in the feature space (Figure 10a). In the PCA projection, the two classes separated primarily along PC1, which explained 42.7% of the variance, while PC2 explained an additional 22.6%, resulting in 65.3% cumulative variance captured by the first two principal components. In contrast, the first two principal components of the ImageNet-pretrained backbone captured only 20.3% of the total variance, indicating that the GeoCLR features concentrated the class-relevant structure into a substantially lower-dimensional subspace.

The t-SNE projection showed a similar trend, with Palmer amaranth and soybean forming compact and well-separated clusters under GeoCLR pretraining. This improvement was also reflected quantitatively by the silhouette coefficient computed on the full 2048-dimensional feature space, which increased from 0.113 for the ImageNet-pretrained backbone to 0.335 for the GeoCLR backbone, representing approximately a threefold improvement in class separation quality. Together, the embedding projections and silhouette analysis showed stronger class separation under GeoCLR pretraining than under ImageNet pretraining.

To characterize how the two backbones responded spatially across network depth, activation maps were extracted from three successive feature levels (L0, L1, and L2) for representative aerial tiles spanning background-only regions, soybean-dominant regions, Palmer-dominant regions, and mixed Palmer–soybean scenes (Figure 10b). Activation maps from the GeoCLR-pretrained and ImageNet-pretrained backbones were visualized using the same normalization scale at each level so that differences in intensity reflected differences in feature response rather than independent per-image rescaling. The clearest difference between the two backbones appeared in the background-only tile. Across all three feature levels, the GeoCLR backbone produced relatively weak and spatially uniform responses over bare soil and crop-residue regions, and this suppression became stronger at deeper levels. By L2, the background region was largely quiescent with only limited localized activation remaining. In contrast, the ImageNet-pretrained backbone produced broader responses across the same background tile, including strong activations near image borders and corners that became increasingly pronounced at deeper levels. A similar trend was observed across tiles containing vegetation. In the soybean-only, Palmer-only, and mixed Palmer–soybean scenes, the GeoCLR backbone produced more spatially coherent and selective responses that increasingly concentrated around vegetation structure with depth. The ImageNet-pretrained backbone produced broader responses extending further into background texture and edge regions, particularly at the deeper L2 level.

3.5. Ablation Studies

Two ablation studies were conducted to support key methodological decisions used throughout the main benchmark. The first evaluated whether increasing detector capacity materially changed detection accuracy within the YOLOv8 family, addressing the choice of the medium-capacity variant used in the main experiments. The second evaluated the label efficiency of the Faster R-CNN-GeoCLR by quantifying how detection performance changes as the amount of annotated training data is reduced.

3.5.1. Effect of Model Scale

The main benchmark used the medium-capacity variant for each YOLO family. To determine whether this choice influenced the comparative results, additional experiments were conducted using the nano, small, and large variants of YOLOv8 under the identical 5-fold cross-validation protocol, training schedule, and data splits used in the primary benchmark (Table 4).

Detector scale produced only modest differences in accuracy. Mean mAP@0.5 ranged from 0.785 to 0.800 across the three additional variants, corresponding to a spread of approximately 1.5 percentage points. The relationship between model size and performance was also non-monotonic: the nano variant achieved performance comparable to the large variant, while the small variant produced a similar mean accuracy of 0.785. The medium-capacity model used in the primary benchmark fell within this same narrow performance range. Similar trends were observed for mAP@0.5:0.95.

These results suggest that, for this two-class aerial weed-detection task, increasing detector capacity beyond the medium regime provides limited accuracy benefit relative to the additional computational cost. The medium-capacity variants therefore provided a reasonable tradeoff between detection performance and deployment efficiency, while also enabling a consistent comparison across detector families.

3.5.2. Label Efficiency of GeoCLR

Because GeoCLR pretraining operates entirely on unlabeled aerial imagery, only the detector fine-tuning stage requires manually annotated bounding-box labels. To quantify how detection performance scales with annotation availability, the Faster R-CNN-GeoCLR was fine-tuned using 10%, 25%, 50%, and 100% of the available manually annotated training datasets, as described in Section 2.5 (Table 5).

Detection accuracy increased rapidly at low annotation fractions and then gradually approached saturation as additional bounding-box annotations were introduced. Using only 10% of the annotated training datasets, the detector achieved an mAP@0.5 of 0.622, corresponding to 80.6% of the full-data performance. At 50% of the annotated training set, performance increased to 0.730 mAP@0.5, recovering 94.6% of the full-data result. Similar behavior was observed for mAP@0.5:0.95, which increased from 0.327 at 10% labels to 0.418 at 50%.

The performance gains per additional annotation fraction progressively decreased as more bounding-box labels were added. Increasing the label fraction from 10% to 25% improved mAP@0.5 by 0.065, whereas increasing the label fraction from 50% to 100% improved performance by only 0.042. This indicates that much of the discriminative representation learned through GeoCLR pretraining can already be transferred using a relatively small subset of manually annotated bounding boxes.

4. Discussion

This study evaluated weed detection models from a practical precision-agriculture perspective. The goal was not only to identify which detector produced the highest accuracy, but also to understand which models may be useful for site-specific spraying. For agricultural users, this distinction is important because a model with strong detection accuracy is not automatically the best model for field deployment. A useful spraying model should detect weeds reliably, maintain good weed coverage across confidence thresholds, and run fast enough for the intended hardware platform.

The results showed that detector rankings changed depending on the evaluation criterion. The YOLO models achieved the strongest detection accuracy and the lowest computational cost, while the Faster R-CNN models showed stronger threshold-independent weed coverage. This means that model selection should depend on the final use case. A real-time UAS or edge-based spraying workflow may favor a faster detector, whereas an offline mapping or decision-support workflow may tolerate a slower model if weed coverage is the main objective. GeoCLR also addressed a different but related problem: reducing dependence on large manually labeled datasets by learning useful representations from unlabeled overlapping UAS images.

4.1. Suitability of YOLO Architectures for Aerial Weed Detection

The four YOLO variants achieved the highest detection accuracy, and their advantage became more apparent at the stricter mAP@0.5:0.95 threshold. This suggests that the YOLO models did not only identify the correct plant class, but also localized plant regions more accurately than the two-stage, transformer-based, and non-YOLO single-stage detectors evaluated in this study. This result is consistent with prior weed-detection studies showing that YOLO-based models are effective for crop-weed detection in field images, including Palmer amaranth detection in soybean and cotton systems [36,40,41].

At the same time, previous weed-detection studies were often limited to fewer model generations, smaller datasets, ground-level imagery, or more controlled field conditions [36,40,59,60]. The present study extends this earlier work by comparing recent YOLO generations with two-stage, transformer-based, and non-YOLO single-stage detectors using high-resolution aerial imagery collected across multiple plant growth stages. This broader comparison is important because recent work has also shown that newer YOLO versions do not always outperform older versions under all weed-detection conditions [41]. Therefore, agricultural users should select models based on field performance, deployment needs, and operating conditions rather than model release order alone.

One likely reason the YOLO family performed well is that the image structure matched the strengths of single-stage detection. The aerial tiles contained many small and broadly similar-scale plant instances against relatively uniform field backgrounds. The single-stage YOLO design with multi-scale feature aggregation appears well matched to this type of dense small-object detection. In comparison, RT-DETR relies on attention over a limited set of object queries, which may be less suitable when many small plant instances occur within the same aerial tile. Similarly, the additional proposal stage used by Faster R-CNN did not improve localization accuracy under these relatively homogeneous field conditions. These findings support the broader observation that YOLO models can be strong candidates for field deployment when speed and localization quality are both important [61,62].

Soybean was generally easier to detect than Palmer amaranth across nearly all detectors. As described in Section 2.2, soybean foliage had a higher excess-green index than Palmer amaranth and therefore showed stronger contrast against the soil background. Soybean plants were also larger on average, with a median bounding-box area approximately 1.68 times larger than Palmer amaranth. Larger and visually distinct objects are easier to localize in aerial imagery, while Palmer amaranth showed greater variation in morphology, overlap, and canopy density across growth stages. These visual differences likely explain why soybean detection accuracy was consistently higher than Palmer amaranth detection accuracy.

4.2. Detection Accuracy and Operational Spraying Behavior Are Not Equivalent

A major finding of this study is that detection accuracy and spraying usefulness are related but not identical. Standard object detection metrics such as mAP reward correct class prediction, confidence, and bounding-box overlap. These metrics are important, but they do not directly show whether the predicted spray region would cover the weed in a practical spraying workflow. For this reason, deployment-oriented metrics such as WCR, HSR, and WCR-AUC provide additional information that mAP alone cannot provide [50].

The difference between mAP and operational spraying behavior was especially clear when comparing YOLO and Faster R-CNN. The YOLO models achieved the highest mAP values, but the Faster R-CNN models achieved stronger threshold-independent weed coverage across confidence levels. This means that a detector can be accurate by standard computer-vision criteria but still behave differently when translated into a spraying decision. For example, a detector may produce high-confidence boxes that improve mAP, but if detections become unstable across thresholds, the effective spray coverage may decline. RT-DETR showed this issue clearly. It achieved high weed coverage at lower confidence thresholds, but coverage declined rapidly as the confidence threshold increased. This led to lower WCR-AUC despite competitive low-threshold behavior. In contrast, Faster R-CNN maintained more stable weed coverage across thresholds, even though its mAP was lower than the best YOLO models. The HSR metric showed less separation among detectors because the simulated spray-cell area occupied only a small fraction of the image across models. Under these imaging and simulation conditions, operational performance was driven more by weed coverage than by sprayed area.

These findings have practical implications for precision spraying. If the goal is to build a real-time spraying system, model evaluation should include both detection accuracy and spraying behavior. Reporting only mAP may lead to selecting a model that performs well on benchmark metrics but is not optimal for actual spray coverage. Threshold-independent metrics such as WCR-AUC are therefore useful because they show how robustly a detector covers weeds across the confidence range used in deployment.

4.3. Detector Selection for Field Deployment

No single architecture dominated across all evaluation axes. YOLOv8m provided the most balanced deployment profile by combining strong detection accuracy, low latency, high throughput, and relatively low GPU memory usage. This makes YOLOv8m a practical candidate for real-time or near-real-time aerial spraying workflows. In contrast, Faster R-CNN provided stronger weed coverage across thresholds but required substantially more computational resources. Therefore, Faster R-CNN may be more useful for offline weed mapping, generating field maps that guide where herbicide should be applied, or decision-support systems where speed is less restrictive.

The deployment results also show why the intended operating mode should guide model choice. For on-board UAS inference or edge-computing systems, single-image latency and memory cost are critical because the model must process images quickly during flight or immediately after acquisition. Under this use case, the YOLO family is more practical than heavier two-stage or transformer-based models. For buffered post-flight processing, however, batching can reduce the penalty of slower architectures. RT-DETR, for example, had slow single-frame inference but scaled more efficiently under batching. This suggests that transformer-based models may be more suitable for offline processing than for direct real-time spraying.

These findings extend previous UAS and precision-agriculture studies by evaluating accuracy, spraying behavior, and computational cost together. Prior studies have commonly emphasized mAP or F1-score or latency, while deployment-related metrics such as throughput, memory use, and spraying coverage are less often reported together [36,40,41,50]. The current results show that these evaluation axes can lead to different model rankings. For practical agricultural deployment, this means that detector selection should be based on the full workflow rather than a single accuracy metric.

4.4. Significance of GeoCLR

GeoCLR was developed to address the labeling challenge in UAS-based weed detection. Manual annotation of weed and crop bounding boxes is time-consuming, especially when datasets must cover different fields, seasons, growth stages, and image acquisition conditions. GeoCLR uses spatial overlap between adjacent UAS images to form positive pairs for self-supervised pretraining. This is different from standard augmentation-based contrastive learning, where positive pairs are usually created only from transformed versions of the same image. The approach builds on the idea that naturally occurring spatial or temporal relationships can provide useful supervision for representation learning [46,48,49].

The feature analysis showed that GeoCLR produced stronger class separation than standard ImageNet pretraining. On the held-out classification dataset, the GeoCLR backbone separated Palmer amaranth and soybean more clearly than an ImageNet-pretrained backbone of the same architecture, with a silhouette coefficient of 0.335 compared with 0.113. The GeoCLR features were also more compact, with most class-related variation concentrated into a lower-dimensional subspace. The activation-map analysis showed a similar pattern: deeper GeoCLR features responded more selectively to vegetation structure while suppressing more homogeneous background regions.

The main benefit of GeoCLR was not a large increase in fully supervised detection mAP. Faster R-CNN-GeoCLR remained close to the standard ImageNet-pretrained Faster R-CNN, with only a small gap in mAP@0.5. The stronger practical contribution was label efficiency. Because GeoCLR uses unlabeled aerial imagery during pretraining, it recovered 80.6% of full-data accuracy using only 10% of the annotations and 94.6% using half of the annotations. This pattern is consistent with prior agricultural self-supervised learning studies showing that SSL is often most useful when labeled data are limited [63,64,65].

For weed scientists and agricultural engineers, this result is useful because UAS imagery is often easier to collect than to annotate. Large collections of routine aerial images could be used for pretraining, and only a smaller subset may need manual labeling for downstream detection. This could reduce the cost and time required to build new weed-detection models for additional fields, seasons, or crop systems. Recent weed dataset studies have also highlighted that public weed datasets are limited by variation, distribution shift, and cross-season generalization challenges [66]. Under these conditions, methods that use unlabeled imagery more effectively may help improve model development when large labeled datasets are unavailable.

4.5. Limitations and Future Work

Several limitations should be considered when interpreting these results. Most of the dataset was collected under experimental field conditions with a single soybean cultivar in one U.S. region and under controlled weed-density treatments. Commercial production fields can be more variable, with mixed weed species, irregular canopy structure, different soil backgrounds, and changing illumination. Therefore, the detectors evaluated here should be interpreted as benchmarked under controlled aerial acquisition conditions rather than fully validated for commercial deployment across all soybean systems.

Future studies should test these models across additional locations, soybean cultivars, weed communities, growth stages, flight conditions, and management systems. Field management practices such as tillage, strip-till, and no-till systems may change soil background, residue cover, canopy structure, and weed visibility. Weather and imaging conditions, including cloud cover, shadows, sun angle, and illumination changes, may also affect detection accuracy and spraying behavior. Evaluating these factors will be important before deployment in broader commercial settings.

The analysis also did not explicitly separate performance by growth stage, although the dataset included multiple soybean and Palmer amaranth growth stages. This is important because both species change substantially during canopy development, especially when plants overlap or weed density increases. The YOLOv8 variant-size ablation suggested that increasing model capacity beyond the medium-scale model produced only limited performance improvement. This indicates that dataset diversity may currently be a larger limitation than model size for this task.

GeoCLR pretraining was restricted to a single flight date so that all positive pairs satisfied the empirically validated within-flight overlap condition. Future work should evaluate whether the same framework generalizes across seasons, locations, crop systems, and acquisition conditions. Additional comparisons with other self-supervised approaches would also help determine how much of the observed improvement is specific to GeoCLR.

Finally, this study evaluated detection-derived operational metrics using simulated spraying behavior. Additional field validation is needed to determine how these weed maps translate into herbicide savings, economic return, weed control efficacy, and reducing crop yield loss under real spraying conditions.

5. Conclusions

This study compared eight object detection models for Palmer amaranth and soybean detection in high-resolution aerial imagery. The models were evaluated using detection accuracy, spraying-oriented operational behavior, and computational deployment cost. The study also introduced GeoCLR, a self-supervised pretraining framework that uses spatial overlap between adjacent UAS images to learn from unlabeled aerial imagery.

The results showed that model choice should depend on the intended agricultural use. The YOLO family achieved the highest detection accuracy and the lowest computational cost, making these models strong candidates for real-time or edge-based precision spraying workflows. Among the evaluated detectors, YOLOv8m provided the most balanced deployment profile. Faster R-CNN models showed stronger threshold-independent weed coverage, suggesting that they may still be useful for offline weed mapping or generating field maps that guide where herbicide should be applied when weed coverage is more important than speed.

Detection accuracy alone did not fully describe spraying behavior. Standard mAP metrics are useful for measuring object detection performance, but they do not directly show whether weeds would be covered by a spray decision. Operational metrics such as WCR, HSR, and WCR-AUC provide additional information about how a detector may behave in a site-specific spraying workflow.

GeoCLR improved feature separation and label efficiency by learning from unlabeled overlapping UAS images. This is important for agricultural applications because aerial imagery can be collected routinely, while manual weed annotation is expensive and time-consuming. The results suggest that GeoCLR can reduce the amount of labeled data required to build useful weed-detection models.

Together, these findings suggest that YOLO-based detectors and GeoCLR pretraining can play complementary roles in precision agriculture. YOLO models provide the speed required for practical field deployment, while GeoCLR can reduce the manual labeling burden involved in model development. This combination can support site-specific weed management by improving weed detection, reducing excessive herbicide application, and contributing to more sustainable farming practices.

Author Contributions

This project was part of D.S.’s MS thesis. D.S. flew UAS to collect the UAS data, labeled images for deep learning models, trained all the models, analyzed all the results, and wrote all the sections. V.S., S.L., and K.K., as committee members, provided valuable guidance throughout the project and reviewed the manuscripts. Conceptualization, D.S. and V.S.; methodology, D.S. and V.S.; software, D.S. and R.W.; validation, D.S. and R.W.; formal analysis, D.S.; investigation, D.S.; resources, D.S. and V.S.; data curation, D.S.; writing—original draft preparation, D.S.; writing—review and editing, D.S., V.S., R.W., S.V., P.Y., S.L. and K.K.; visualization, D.S.; supervision, D.S., V.S., S.L. and K.K.; project administration, V.S.; funding acquisition, V.S. All authors have read and agreed to the published version of the manuscript.

Funding

This activity was funded, in part, with an integrated, internal competitive grant (#141611) from the College of Agriculture and Life Sciences at Virginia Tech. Funding was provided to V.S.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Acknowledgments

The authors would like to acknowledge support from Vipin Kumar, Milton Sturgis, and Andrew Fletcher for their help in the establishment of this study and data collection.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Ghanizadeh, H.; Lorzadeh, S.; Aryannia, N. Effect of weed competition on corn. Weed Biol. Manag. 2014, 14, 133–137. [Google Scholar] [CrossRef]
Oerke, E. Crop losses to pests. J. Agric. Sci. 2006, 144, 31–43. [Google Scholar] [CrossRef]
Soltani, N.; Dille, J.; Burke, I.; Everman, W.; VanGessel, M.; Davis, V.; Sikkema, P. Perspectives on Potential Soybean Yield Losses from Weeds in North America. Weed Technol. 2017, 31, 148–154. [Google Scholar] [CrossRef]
Nicolopoulou-Stamati, P.; Maipas, S.; Kotampasi, C.; Stamatis, P.; Hens, L. Chemical Pesticides and Human Health: The Urgent Need for a New Concept in Agriculture. Front. Public Health 2016, 4, 148. [Google Scholar] [CrossRef]
Myers, J.P.; Antoniou, M.N.; Blumberg, B.; Carroll, L.; Colborn, T.; Everett, L.G.; Hansen, M.; Landrigan, P.J.; Lanphear, B.P.; Mesnage, R.; et al. Concerns over use of glyphosate-based herbicides and risks associated with exposures: A consensus statement. Environ. Health 2016, 15, 19. [Google Scholar] [CrossRef] [PubMed]
Aparecida, M.; de Campos Ventura-Camargo, B.; Miyuki, M. Toxicity of Herbicides: Impact on Aquatic and Soil Biota and Human Health; InTech: London, UK, 2013. [Google Scholar] [CrossRef]
Ghazi, M.R.; Nik Yusoff, N.R.; Abdul Halim, N.S.; Wahab, I.R.A.; Ab Latif, N.; Hasmoni, S.H.; Ahmad Zaini, M.A.; Zakaria, Z.A. Health effects of herbicides and its current removal strategies. Bioengineered 2023, 14, 2259526. [Google Scholar] [CrossRef] [PubMed]
Travlos, I.; de Prado, R.; Chachalis, D.; Bilalis, D.J. Editorial: Herbicide Resistance in Weeds: Early Detection, Mechanisms, Dispersal, New Insights and Management Issues. Front. Ecol. Evol. 2020, 8, 213. [Google Scholar] [CrossRef]
Busi, R.; Vila-Aiub, M.M.; Beckie, H.J.; Gaines, T.A.; Goggin, D.E.; Kaundun, S.S.; Lacoste, M.; Neve, P.; Nissen, S.J.; Norsworthy, J.K.; et al. Herbicide-resistant weeds: From research and knowledge to future needs. Evol. Appl. 2013, 6, 1218–1221. [Google Scholar] [CrossRef] [PubMed]
Paul, S.K.; Mazumder, S.; Naidu, R. Herbicidal weed management practices: History and future prospects of nanotechnology in an eco-friendly crop production system. Heliyon 2024, 10, e26527. [Google Scholar] [CrossRef]
Van Wychen, L. Survey of the Most Common and Troublesome Weeds in Broadleaf Crops, Fruits Vegetables in the United States and Canada. In National Weed Survey Dataset; Weed Science Society of America: Westminster, CO, USA, 2022; Available online: http://wssa.net/wp-content/uploads/2022-Weed-Survey-Broadleaf-crops.xlsx (accessed on 12 September 2025).
MacRae, A.W.; Webster, T.; Sosnoskie, L.; Culpepper, A.S.; Kichler, J. Cotton yield loss potential in response to length of Palmer amaranth (Amaranthus palmeri) interference. J. Cotton Sci. 2013, 17, 227–232. [Google Scholar]
Bensch, C.; Horak, M.; Peterson, D. Interference of redroot pigweed (Amaranthus retroflexus), Palmer amaranth (A. palmeri), and common waterhemp (A. rudis) in soybean. Weed Sci. 2009, 51, 37–43. [Google Scholar] [CrossRef]
Christensen, S.; Søgaard, H.T.; Kudsk, P.; Nørremark, M.; Lund, I.; Nadimi, E.S.; Jørgensen, R. Site-specific weed control technologies. Weed Res. 2009, 49, 233–241. [Google Scholar] [CrossRef]
Coleman, G.; Stead, A.; Rigter, M.; Xu, Z.; Johnson, D.; Brooker, G.; Sukkarieh, S.; Walsh, M. Using energy requirements to compare the suitability of alternative methods for broadcast and site-specific weed control. Weed Technol. 2019, 33, 633–650. [Google Scholar] [CrossRef]
Gerhards, R.; Andújar Sanchez, D.; Hamouz, P.; Peteinatos, G.G.; Christensen, S.; Fernandez-Quintanilla, C. Advances in site-specific weed management in agriculture—A review. Weed Res. 2022, 62, 123–133. [Google Scholar] [CrossRef]
Hasan, A.S.M.; Sohel, F.; Diepeveen, D.; Laga, H.; Jones, M.G.K. A survey of deep learning techniques for weed detection from images. Comput. Electron. Agric. 2021, 184, 106067. [Google Scholar] [CrossRef]
Woebbecke, D.; Meyer, G.; Von Bargen, K.; Mortensen, D. Color indices for weed identification under various soil, residue, and lighting conditions. Trans. ASAE 1995, 38, 259–269. [Google Scholar] [CrossRef]
Hemming, J.; Rath, T. Image processing for plant determination using the Hough transform and clustering methods. Gartenbauwissenschaft 2002, 67, 1–10. [Google Scholar] [CrossRef]
Jafari, A.; Mohtasebi, S.S.; Jahromi, H.E.; Omid, M. Weed detection in sugar beet fields using machine vision. Int. J. Agric. Biol. 2006, 8, 602–605. [Google Scholar]
Bakhshipour, A.; Jafari, A.; Nassiri, S.M.; Zare, D. Weed segmentation using texture features extracted from wavelet sub-images. Biosyst. Eng. 2017, 157, 1–12. [Google Scholar] [CrossRef]
Burks, T.F.; Shearer, S.A.; Heath, J.R.; Donohue, K.D. Evaluation of neural-network classifiers for weed species discrimination. Biosyst. Eng. 2005, 91, 293–304. [Google Scholar] [CrossRef]
Barrero, O.; Rojas, D.; Gonzalez, C.; Perdomo, S. Weed detection in rice fields using aerial images and neural networks. In Proceedings of the XXI Symposium on Signal Processing, Images and Artificial Vision (STSIVA); IEEE: Bucaramanga, Colombia, 2016; pp. 1–4. [Google Scholar] [CrossRef]
Wu, L.; Wen, Y. Weed/corn seedling recognition by support vector machine using texture features. Afr. J. Agric. Res. 2009, 4, 840–846. [Google Scholar]
Kodagoda, S.; Zhang, Z.; Ruiz, D.; Dissanayake, G. Weed detection and classification for autonomous farming. In Intelligent Production Machines and Systems; Elsevier: Amsterdam, The Netherlands, 2008. [Google Scholar]
Slaughter, D.C.; Giles, D.K.; Downey, D. Autonomous robotic weed control systems: A review. Comput. Electron. Agric. 2008, 61, 63–78. [Google Scholar] [CrossRef]
Sarker, I.H. Deep Learning: A Comprehensive Overview on Techniques, Taxonomy, Applications and Research Directions. SN Comput. Sci. 2021, 2, 420. [Google Scholar] [CrossRef]
Yu, J.; Schumann, A.W.; Cao, Z.; Sharpe, S.M.; Boyd, N.S. Weed detection in perennial ryegrass with deep learning convolutional neural network. Front. Plant Sci. 2019, 10, 1422. [Google Scholar] [CrossRef] [PubMed]
Suh, H.K.; IJsselmuiden, J.; Hofstee, J.W.; van Henten, E.J. Transfer learning for the classification of sugar beet and volunteer potato under field conditions. Biosyst. Eng. 2018, 174, 50–65. [Google Scholar] [CrossRef]
Wang, A.; Zhang, W.; Wei, X. A review on weed detection using ground-based machine vision and image processing techniques. Comput. Electron. Agric. 2019, 158, 226–240. [Google Scholar] [CrossRef]
Teimouri, N.; Dyrmann, M.; Nielsen, P.R.; Mathiassen, S.K.; Somerville, G.J.; Jørgensen, R.N. Weed growth stage estimator using deep convolutional neural networks. Sensors 2018, 18, 1580. [Google Scholar] [CrossRef]
Yang, Q.; Shi, L.; Han, J.; Zha, Y.; Zhu, P. Deep convolutional neural networks for rice grain yield estimation at the ripening stage using UAV-based remotely sensed images. Field Crops Res. 2019, 235, 142–153. [Google Scholar] [CrossRef]
Pal, C.; Karmakar, S.; Mukherjee, I.; Chakrabarti, P.P. A lightweight and explainable CNN model for empowering plant disease diagnosis. Sci. Rep. 2025, 15, 30720. [Google Scholar] [CrossRef] [PubMed]
Coleman, G.R.Y.; Bender, A.; Hu, K.; Sharpe, S.M.; Schumann, A.W.; Wang, Z.; Bagavathiannan, M.; Boyd, N.; Walsh, M.J. Weed detection to weed recognition: Reviewing 50 years of research to identify constraints and opportunities for large-scale cropping systems. Weed Technol. 2022, 36, 741–757. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; IEEE: New York, NY, USA, 2016; pp. 779–788. [Google Scholar] [CrossRef]
Rahman, A.; Lu, Y.; Wang, H. Performance evaluation of deep learning object detectors for weed detection for cotton. Smart Agric. Technol. 2023, 3, 100126. [Google Scholar] [CrossRef]
Sharpe, S.M.; Schumann, A.W.; Boyd, N.S. Goosegrass detection in strawberry and tomato using a convolutional neural network. Sci. Rep. 2020, 10, 9548. [Google Scholar] [CrossRef]
Cubero, S.; Aleixos, N.; Moltó, E.; Gómez-Sanchis, J.; Blasco, J. Advances in machine vision applications for automatic inspection and quality evaluation of fruits and vegetables. Food Bioprocess Technol. 2011, 4, 487–504. [Google Scholar] [CrossRef]
Hu, C.; Sapkota, B.B.; Thomasson, J.A.; Bagavathiannan, M.V. Influence of image quality and light consistency on the performance of convolutional neural networks for weed mapping. Remote Sens. 2021, 13, 2140. [Google Scholar] [CrossRef]
Barnhart, I.; Lancaster, S.; Goodin, D.; Spotanski, J.; Dille, J. Use of open-source object detection algorithms to detect Palmer amaranth (Amaranthus palmeri) in soybean. Weed Sci. 2022, 70, 648–662. [Google Scholar] [CrossRef]
Coleman, G.R.Y.; Kutugata, M.; Walsh, M.J.; Bagavathiannan, M.V. Multi-Growth Stage Plant Recognition: A Case Study of Palmer Amaranth (Amaranthus palmeri) in Cotton (Gossypium hirsutum). Comput. Electron. Agric. 2024, 217, 108622. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. In Proceedings of the 29th International Conference on Neural Information Processing Systems (NIPS’15), Montreal, Canada, 7–12 December 2015; MIT Press: Cambridge, MA, USA, 2015; pp. 91–99. Available online: https://dl.acm.org/doi/10.5555/2969239.2969250 (accessed on 20 May 2026).
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV); IEEE: Venice, Italy, 2017; pp. 2999–3007. [Google Scholar] [CrossRef]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Computer Vision—ECCV 2020; Lecture Notes in Computer Science; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M., Eds.; Springer: Cham, Switzerland, 2020; Volume 12346, pp. 213–229. [Google Scholar] [CrossRef]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs Beat YOLOs on Real-time Object Detection. arXiv 2023, arXiv:2304.08069. [Google Scholar] [CrossRef]
Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A simple framework for contrastive learning of visual representations. In Proceedings of the 37th International Conference on Machine Learning (ICML), Virtual Event, 13–18 July 2020; PMLR, 2020. pp. 1597–1607. Available online: https://proceedings.mlr.press/v119/chen20j.html (accessed on 20 May 2026).
Ericsson, L.; Gouk, H.; Loy, C.C.; Hospedales, T.M. Self-supervised representation learning: Introduction, advances and challenges. arXiv 2021, arXiv:2110.09327. [Google Scholar] [CrossRef]
He, K.; Fan, H.; Wu, Y.; Xie, S.; Girshick, R. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; IEEE: New York, NY, USA, 2020; pp. 9726–9735. [Google Scholar] [CrossRef]
Manas, O.; Lacoste, A.; Giro-i-Nieto, X.; Vazquez, D.; Rodriguez, P. Seasonal Contrast: Unsupervised Pre-Training from Uncurated Remote Sensing Data. arXiv 2021, arXiv:2103.16607. [Google Scholar] [CrossRef]
Darbyshire, M.; Salazar-Gomez, A.; Gao, J.; Sklar, E.I.; Parsons, S. Towards Practical Object Detection for Weed Spraying in Precision Agriculture. Front. Plant Sci. 2023, 14, 1183277. [Google Scholar] [CrossRef]
See & Spray: 5 Things John Deere Learned in 2024. Available online: https://www.agweb.com/news/machinery/see-spray-5-things-john-deere-learned-2024 (accessed on 13 May 2026).
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-Time End-to-End Object Detection. arXiv 2024, arXiv:2405.14458. [Google Scholar] [CrossRef]
Schuhl, H.; Brown, K.E.; Sheng, H.; Bhatt, P.K.; Gutierrez, J.; Schneider, D.; Casto, A.L.; Acosta-Gamboa, L.; Ballenger, J.G.; Barbero, F.; et al. PlantCV v4: Image analysis software for high-throughput plant phenotyping. Plant Phenome J. 2026, 9, e70065. [Google Scholar] [CrossRef]
Xu, Y.; Goodacre, R. On Splitting Training and Validation Set: A Comparative Study of Cross-Validation, Bootstrap and Systematic Sampling for Estimating the Generalization Performance of Supervised Learning. J. Anal. Test. 2018, 2, 249–262. [Google Scholar] [CrossRef] [PubMed]
ARA Field Sprayer. Available online: https://ecorobotix.com/en-us/crop-care/ara-field-sprayer (accessed on 13 May 2026).
Rainio, O.; Teuho, J.; Klén, R. Evaluation Metrics and Statistical Tests for Machine Learning. Sci. Rep. 2024, 14, 6086. [Google Scholar] [CrossRef]
Blakesley, R.E.; Mazumdar, S.; Dew, M.A.; Houck, P.R.; Tang, G.; Reynolds, C.F., III; Butters, M.A. Comparisons of Methods for Multiple Hypothesis Testing in Neuropsychological Research. Neuropsychology 2009, 23, 255–264. [Google Scholar] [CrossRef]
Demšar, J. Statistical Comparisons of Classifiers over Multiple Data Sets. J. Mach. Learn. Res. 2006, 7, 1–30. Available online: https://dl.acm.org/doi/10.5555/1248547.1248548 (accessed on 20 May 2026).
Ahmad, A.; Saraswat, D.; Aggarwal, V.; Etienne, A.; Hancock, B. Performance of deep learning models for classifying and detecting common weeds in corn and soybean production systems. Comput. Electron. Agric. 2021, 184, 106081. [Google Scholar] [CrossRef]
Wang, Q.; Cheng, M.; Huang, S.; Cai, Z.; Zhang, J.; Yuan, H. A deep learning approach incorporating YOLO v5 and attention mechanisms for field real-time detection of the invasive weed Solanum rostratum Dunal seedlings. Comput. Electron. Agric. 2022, 199, 107194. [Google Scholar] [CrossRef]
Gómez, A.; Moreno, H.; Andújar, D. Intelligent Inter- and Intra-Row Early Weed Detection in Commercial Maize Crops. Plants 2025, 14, 881. [Google Scholar] [CrossRef]
Gallo, I.; Rehman, A.U.; Dehkordi, R.H.; Landro, N.; La Grassa, R.; Boschetti, M. Deep Object Detection of Crop Weeds: Performance of YOLOv7 on a Real Case Dataset from UAV Images. Remote Sens. 2023, 15, 539. [Google Scholar] [CrossRef]
Güldenring, R.; Nalpantidis, L. Self-supervised contrastive learning on agricultural images. Comput. Electron. Agric. 2021, 191, 106510. [Google Scholar] [CrossRef]
Marszalek, M.; Saux, B.L.; Mathieu, P.-P.; Nowakowski, A.; Springer, D. Self-supervised learning—A way to minimize time and effort for precision agriculture? arXiv 2022, arXiv:2204.02100. [Google Scholar] [CrossRef]
Kar, S.; Nagasubramanian, K.; Elango, D.; Carroll, M.E.; Abel, C.A.; Nair, A.; Mueller, D.S.; O’Neal, M.E.; Singh, A.K.; Sarkar, S.; et al. Self-Supervised Learning Improves Classification of Agriculturally Important Insect Pests in Plants. Plant Phenome J. 2023, 6, e20079. [Google Scholar] [CrossRef]
Deng, B.; Lu, Y.; Xu, J. Weed database development: An updated survey of public weed datasets and cross-season weed detection adaptation. Ecol. Inform. 2024, 81, 102546. [Google Scholar] [CrossRef]

Figure 1. UAS-based RGB image collection setup; (a) DJI Matrice-300 UAS with Zenmuse P1 camera, (b) representative raw RGB image (8192 × 5460 pixels) of soybean field infested with Palmer amaranth, acquired at 12 m altitude.

Figure 2. Representative images illustrating training dataset diversity in the annotated dataset; (a) soybean-dominant imagery, (b) low-density Palmer amaranth presence within soybean canopy, (c) medium-density Palmer amaranth presence within soybean canopy, (d) dense Palmer amaranth growth with overlapping canopies, (e) magnified region from panel ‘c’ containing medium-density Palmer amaranth annotations, (f) overlapping Palmer amaranth annotation region from panel ‘d’. Bounding boxes mark annotated instances; P denotes Palmer amaranth (orange) and S denotes soybean (blue).

Figure 3. (a) Distribution of image-pair similarity distances across different pair-selection conditions using color-histogram and ResNet-50 feature distances. Each color represents one pair-selection condition: within-flight, cross-date same-field, cross-field same-date, or random pairs, (b) Representative examples of image pairs sampled across the same four pair-selection conditions. Colored dots between paired images indicate the corresponding pair-selection condition.

Figure 4. Schematic workflow from aerial data collection to deployment recommendations. The detection branch (left) trains all eight object detection models under 5-fold cross-validation. The GeoCLR branch (right) pretrains a ResNet-50 encoder on overlapping flight-image tiles; the frozen encoder serves both as the backbone for Faster R-CNN-GeoCLR and for embedding analysis. Two ablation studies follow the main benchmark. All evaluation paths converge into three metric groups, followed by statistical comparison and deployment recommendations.

Figure 5. Mean average precision (mAP) across five folds for Palmer amaranth and soybean detection models; (a) IoU 0.5 and (b) IoU 0.5:0.95. Bars show the mean across folds and error bars show standard deviation. Individual dots represent the five cross-validation folds, with consistent colors used across models so that the same fold can be visually tracked across models.

Figure 6. Class-specific average precision across 5 folds at IoU 0.5 for (a) soybean and (b) Palmer amaranth. Bars show the mean across folds and error bars show standard deviation. Individual dots represent the five cross-validation folds, with consistent colors used across models so that the same fold can be visually tracked across models.

Figure 7. Operational metric curves across confidence thresholds

τ

; (a) Weed Coverage Rate (WCR): lower thresholds retain more detections and generally increase weed coverage, while higher thresholds suppress low-confidence detections, (b) Herbicide Saving Rate (HSR) saturates near 0.98–0.99 for most detectors once

τ \geq 0.15

. Markers and error bars represent mean and standard deviation across the five cross-validation folds. The vertical dashed line marks the representative operating point at

τ = 0.25

; the dot-dashed horizontal line in ‘b’ marks the 95% herbicide saving floor.

Figure 7. Operational metric curves across confidence thresholds

τ

; (a) Weed Coverage Rate (WCR): lower thresholds retain more detections and generally increase weed coverage, while higher thresholds suppress low-confidence detections, (b) Herbicide Saving Rate (HSR) saturates near 0.98–0.99 for most detectors once

τ \geq 0.15

. Markers and error bars represent mean and standard deviation across the five cross-validation folds. The vertical dashed line marks the representative operating point at

τ = 0.25

; the dot-dashed horizontal line in ‘b’ marks the 95% herbicide saving floor.

Figure 8. Representative prediction examples generated by YOLOv8m on held-out aerial field images that were not part of the annotated object detection dataset. The examples span a range of Palmer amaranth densities, canopy overlap conditions, and field backgrounds observed across the test imagery.

Figure 9. Deployment scaling behavior across batch sizes on a single NVIDIA A100 GPU using FP32 precision and

640 \times 640

input resolution. (a) median inference latency as a function of batch size; (b) throughput in frames per second (FPS) as a function of batch size. Each latency value represents the median of 200 timed iterations following 50 warm-up iterations; error bars in ‘a’ represent the standard deviation across iterations, which was typically below 0.2% of the median and therefore visually small at the scale of the plot. The batch-size axis is shown on a logarithmic scale (base 2).

Figure 9. Deployment scaling behavior across batch sizes on a single NVIDIA A100 GPU using FP32 precision and

640 \times 640

input resolution. (a) median inference latency as a function of batch size; (b) throughput in frames per second (FPS) as a function of batch size. Each latency value represents the median of 200 timed iterations following 50 warm-up iterations; error bars in ‘a’ represent the standard deviation across iterations, which was typically below 0.2% of the median and therefore visually small at the scale of the plot. The batch-size axis is shown on a logarithmic scale (base 2).

Figure 10. Self-supervised feature analysis of the GeoCLR-pretrained backbone; (a) two-dimensional visualization of backbone features extracted from the held-out classification dataset. Left: PCA projection showing separation between Palmer amaranth and soybean features, with PC1 and PC2 explaining 42.7% and 22.6% of the variance, respectively. Right: t-SNE projection showing nonlinear separation between the two classes in the learned feature space, (b) activation maps across successive backbone feature levels (L0–L2) comparing GeoCLR-pretrained and ImageNet-pretrained backbones on representative aerial field images spanning background-only, soybean-only, Palmer-only, and mixed-species scenes.

Table 1. Standardized training and evaluation hyperparameters applied across all detection models, SSL label-fraction experiments, and YOLO variant-size experiments.

Hyperparameter	Value
Input image size	$640 \times 640$ pixels
Batch size	32
Maximum epochs	150
Early stopping metric	Validation mAP@0.5
Early stopping patience	50 epochs
Cross-validation strategy	5-fold cross-validation
Random seed	42
GPU	NVIDIA A100 (80 GB)
CPU cores per run	4
Precision	FP32

Table 2. Operational metrics for the eight Palmer amaranth weed detectors evaluated under the simulated precision-spraying protocol using a

6 \times 6

cm spray cell. WCR and HSR are reported at

τ = 0.25

, while WCR-AUC summarizes weed coverage across

τ \in [0.05, 0.95]

. Values are reported as mean ± standard deviation across five cross-validation folds.

Table 2. Operational metrics for the eight Palmer amaranth weed detectors evaluated under the simulated precision-spraying protocol using a

6 \times 6

cm spray cell. WCR and HSR are reported at

τ = 0.25

, while WCR-AUC summarizes weed coverage across

τ \in [0.05, 0.95]

. Values are reported as mean ± standard deviation across five cross-validation folds.

Detector	WCR @ $τ = 0.25$	HSR @ $τ = 0.25$	WCR-AUC
Faster R-CNN	$0.705 \pm 0.018$	$0.979 \pm 0.003$	$0.664 \pm 0.018$
Faster R-CNN-GeoCLR	$0.685 \pm 0.037$	$0.980 \pm 0.004$	$0.645 \pm 0.034$
RT-DETR	$0.790 \pm 0.025$	$0.968 \pm 0.007$	$0.609 \pm 0.046$
RetinaNet	$0.697 \pm 0.013$	$0.984 \pm 0.001$	$0.535 \pm 0.031$
YOLOv9m	$0.632 \pm 0.015$	$0.991 \pm 0.001$	$0.501 \pm 0.009$
YOLOv11m	$0.636 \pm 0.008$	$0.991 \pm 0.000$	$0.496 \pm 0.015$
YOLOv10m	$0.605 \pm 0.012$	$0.992 \pm 0.000$	$0.488 \pm 0.019$
YOLOv8m	$0.622 \pm 0.031$	$0.991 \pm 0.001$	$0.483 \pm 0.027$

Table 3. Deployment metrics for the eight detectors measured on a single NVIDIA A100 GPU at

640 \times 640

input resolution and FP32 precision. Latency, FPS, and GPU memory usage are reported at batch size 1. Latency values represent the median over 200 timed iterations following 50 warm-up iterations.

Table 3. Deployment metrics for the eight detectors measured on a single NVIDIA A100 GPU at

640 \times 640

input resolution and FP32 precision. Latency, FPS, and GPU memory usage are reported at batch size 1. Latency values represent the median over 200 timed iterations following 50 warm-up iterations.

Detector	Params (M)	GFLOPs	Latency (ms)	FPS	GPU Mem (GB)
YOLOv8m	25.9	39.5	8.3	120.9	0.31
YOLOv9m	20.2	38.8	15.1	66.2	0.30
YOLOv10m	16.5	32.0	13.0	76.7	0.30
YOLOv11m	20.1	34.1	10.8	92.3	0.34
RT-DETR	32.8	54.0	36.7	27.2	0.50
Faster R-CNN	41.1	89.0	18.9	53.0	0.72
RetinaNet	36.1	79.8	16.3	61.5	1.08
Faster R-CNN-GeoCLR	43.3	280.4	18.8	53.2	1.03

Table 4. Effect of YOLOv8 model scale on detection performance. All variants were trained using the identical 5-fold cross-validation protocol, training schedule, and data splits. Values are reported as mean ± standard deviation across the five folds.

YOLOv8 Variant	mAP@0.5	mAP@0.5:0.95
Nano (n)	$0.797 \pm 0.006$	$0.511 \pm 0.007$
Small (s)	$0.785 \pm 0.020$	$0.503 \pm 0.012$
Large (l)	$0.800 \pm 0.014$	$0.517 \pm 0.010$

Table 5. Label efficiency of the Faster R-CNN-GeoCLR. The detector was fine-tuned using increasing fractions of the available training annotations. Values are reported as mean ± standard deviation across the five folds.

Label Fraction	mAP@0.5	mAP@0.5:0.95	% of Full-Data mAP@0.5
10%	$0.622 \pm 0.008$	$0.327 \pm 0.012$	80.6%
25%	$0.687 \pm 0.015$	$0.381 \pm 0.013$	89.0%
50%	$0.730 \pm 0.017$	$0.418 \pm 0.014$	94.6%
100%	$0.772 \pm 0.017$	$0.449 \pm 0.012$	100.0%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Srivastava, D.; Singh, V.; Wamanse, R.; Li, S.; Kochersberger, K.; Virk, S.; Yadav, P. Benchmarking Object Detection and a Novel Self-Supervised Framework for Weed Detection in Soybean. Remote Sens. 2026, 18, 1720. https://doi.org/10.3390/rs18111720

AMA Style

Srivastava D, Singh V, Wamanse R, Li S, Kochersberger K, Virk S, Yadav P. Benchmarking Object Detection and a Novel Self-Supervised Framework for Weed Detection in Soybean. Remote Sensing. 2026; 18(11):1720. https://doi.org/10.3390/rs18111720

Chicago/Turabian Style

Srivastava, Dhiraj, Vijay Singh, Rutvij Wamanse, Song Li, Kevin Kochersberger, Simerjeet Virk, and Pappu Yadav. 2026. "Benchmarking Object Detection and a Novel Self-Supervised Framework for Weed Detection in Soybean" Remote Sensing 18, no. 11: 1720. https://doi.org/10.3390/rs18111720

APA Style

Srivastava, D., Singh, V., Wamanse, R., Li, S., Kochersberger, K., Virk, S., & Yadav, P. (2026). Benchmarking Object Detection and a Novel Self-Supervised Framework for Weed Detection in Soybean. Remote Sensing, 18(11), 1720. https://doi.org/10.3390/rs18111720

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Benchmarking Object Detection and a Novel Self-Supervised Framework for Weed Detection in Soybean

Highlights

Abstract

1. Introduction

2. Materials and Methods

2.1. Data Acquisition

2.2. Image Processing and Annotation

2.3. Object Detection Networks

2.3.1. YOLO

2.3.2. Faster R-CNN

2.3.3. RetinaNet

2.3.4. RT-DETR

2.4. Self-Supervised Learning

2.5. Training and Evaluation

2.6. Statistical Analysis

2.7. Overall Workflow

3. Results

3.1. Detection Metrics Results

3.2. Operational Metrics Results

3.3. Deployment Metrics Results

3.4. Self-Supervised Feature Analysis

3.5. Ablation Studies

3.5.1. Effect of Model Scale

3.5.2. Label Efficiency of GeoCLR

4. Discussion

4.1. Suitability of YOLO Architectures for Aerial Weed Detection

4.2. Detection Accuracy and Operational Spraying Behavior Are Not Equivalent

4.3. Detector Selection for Field Deployment

4.4. Significance of GeoCLR

4.5. Limitations and Future Work

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI