YOLO-SW: A Real-Time Weed Detection Model for Soybean Fields Using Swin Transformer and RT-DETR

Shuai, Yizhou; Shi, Jingsha; Li, Yi; Zhou, Shaohao; Zhang, Lihua; Mu, Jiong

doi:10.3390/agronomy15071712

Open AccessArticle

YOLO-SW: A Real-Time Weed Detection Model for Soybean Fields Using Swin Transformer and RT-DETR

by

Yizhou Shuai

¹

,

Jingsha Shi

¹,

Yi Li

¹,

Shaohao Zhou

²,

Lihua Zhang

^2,* and

Jiong Mu

^1,*

¹

College of Information Engineering, Sichuan Agricultural University, Ya’an 625000, China

²

College of Mechanical and Electrical Engineering, Sichuan Agricultural University, Ya’an 625000, China

^*

Authors to whom correspondence should be addressed.

Agronomy 2025, 15(7), 1712; https://doi.org/10.3390/agronomy15071712

Submission received: 29 April 2025 / Revised: 23 June 2025 / Accepted: 13 July 2025 / Published: 16 July 2025

(This article belongs to the Special Issue Intelligent Information System for Agriculture Based on Vision Technology)

Download

Browse Figures

Versions Notes

Abstract

Accurate weed detection in soybean fields is essential for enhancing crop yield and reducing herbicide usage. This study proposes a YOLO-SW model, an improved version of YOLOv8, to address the challenges of detecting weeds that are highly similar to the background in natural environments. The research stands out for its novel integration of three key advancements: the Swin Transformer backbone, which leverages local window self-attention to achieve linear O(N) computational complexity for efficient global context capture; the CARAFE dynamic upsampling operator, which enhances small target localization through context-aware kernel generation; and the RTDETR encoder, which enables end-to-end detection via IoU-aware query selection, eliminating the need for complex post-processing. Additionally, a dataset of six common soybean weeds was expanded to 12,500 images through simulated fog, rain, and snow augmentation, effectively resolving data imbalance and boosting model robustness. The experimental results highlight both the technical superiority and practical relevance: YOLO-SW achieves 92.3% mAP@50 (3.8% higher than YOLOv8), with recognition accuracy and recall improvements of 4.2% and 3.9% respectively. Critically, on the NVIDIA Jetson AGX Orin platform, it delivers a real-time inference speed of 59 FPS, making it suitable for seamless deployment on intelligent weeding robots. This low-power, high-precision solution not only bridges the gap between deep learning and precision agriculture but also enables targeted herbicide application, directly contributing to sustainable farming practices and environmental protection.

Keywords:

weed detection; soybean weed; machine learning; smart agriculture; edge deployment

1. Introduction

In modern agricultural production, weeds are a primary factor affecting soybean yield. They compete with crops for nutrients, water and sunlight, especially during the seedling stage of the crop, which hinders crop growth and reduces yield [1]. Statistics indicate that weeds reduce China’s annual grain yield by approximately 20 billion kilograms. Current weed control methods include biological, chemical, and mechanical approaches, with chemical control being the most widely used [2]. Statistics from the agricultural department show that China consumes approximately 1.78 million tons of chemicals annually for weed control. However, extensive spraying not only wastes chemicals but also causes groundwater contamination and drift issues. Tsiafouli’s studies have highlighted that chemical overuse reduces soil biodiversity by 15–20% in croplands [3]. Mauro et al. further demonstrated that precision spraying guided by AI can reduce herbicide usage by 40% while maintaining efficacy [4]. Therefore, it is crucial to accurately identify the location and species of weeds and spray chemicals in a targeted manner. For several weeds commonly found in soybean fields, researchers suggest using computer vision technology to detect and classify them, providing visual support to control equipment for accurate spray treatments.

Early studies mostly adopted the combination of traditional machine learning and spectral analysis. For example, Olsen et al. achieved the classification of herbs/ferns through linear regression and SVM, but the generalization ability was limited in the complex farmland background [5]. With the development of deep learning, CNN has become the mainstream solution: Zhu et al. designed a corn field weeding robot based on YOLOX and verified the feasibility of CNN in real-time detection [6]; Yu et al. improved UNet through the attention mechanism to achieve the segmentation of Gramineous and broadleaf weeds [7]; Ferreira et al. achieved a high accuracy rate in weed classification in soybean fields using ConvNets [8]. However, these methods are constrained by two significant limitations. First, lightweight backbone networks (e.g., YOLOv8nGP designed by Sun et al.) tend to lead to a decline in the recognition accuracy of small targets [9]. Second, traditional sampling techniques (e.g., nearest-neighbor interpolation) cause image distortion, compromising localization precision. Most crucially, existing models lack efficient global context capture under complex lighting conditions, and their computational complexity hinders real-time deployment on edge devices with limited resources. This gap highlights the need for a model that balances high accuracy, real-time inference, and robustness against harsh climatic interferences. Aiming at the problem of insufficient real-time performance and accuracy in identifying weeds in farmland, Xu Yanlei et al. adopted a weed density detection method based on Absolute Feature corner points (AFCP) [10]. Recent studies, such as Jia et al.’s ADL-YOLOv8 [11] and Ding et al.’s RVDR-YOLOv8 [12], have partially alleviated the contradiction between accuracy and efficiency by optimizing feature fusion and lightweight design, but it is still difficult to take into account the global context capture under complex lighting.

Transformer architecture has gradually been applied to agricultural detection due to the advantages of the self-attention mechanism in long-distance dependency modeling. Zhao et al. proposed the ST-YOLOA model, enhancing feature extraction through the Swin Transformer and coordinate attention mechanism, which solved the problem that traditional CNNS have difficulty capturing global information in medical image segmentation [13]; Ailiang Lin et al. designed the dual-branch Swin Transformer U-Net (D-Transunet) to improve the segmentation accuracy through cross-scale feature interaction [14]; Ma et al. integrated Swin Transformer into YOLOv5n, improving the global feature capture ability for maize leaf disease recognition [15]. However, such methods generally face the problem of parameter redundancy—for example, compared with CSPDarkNet53, the computational complexity of Swin Transformer is increased by 142%, and it needs to be optimized through techniques such as channel pruning to adapt to edge devices.

Accurate weed detection in dynamic agricultural environments requires not only high detection accuracy but also efficient inference on resource-constrained edge devices. As noted by Saleem et al., the edge deployment of deep learning models in agriculture faces unique challenges [16], including limited computational power, memory constraints, and real-time requirements. Aiming at the problem of insufficient detection accuracy caused by the large variety and significant morphological differences of weeds in the field, Liu Hui et al. adopted an improved YOLOv8 combined with a reverse detection method [17]. The difficulty in the deployment of algorithm hardware in the field of weed detection in intelligent agriculture has been solved. To address the problem of difficult early detection of the grassland weed Sphinopsis glabrata due to its complex background and short seedlings, Guo Baoliang et al. adopted a method combining multi-source imagery with an improved YOLOv8. The difficulty in the deployment of algorithm hardware in the field of grassland weed detection in intelligent agriculture has been solved [18]. In the field of medical image segmentation, Duy-Phuong Dao et al. adopted a technical route of dual-model collaboration to address the problems that tumor localization in the early diagnosis of lung cancer is time-consuming and expensive, and survival analysis requires the combination of multimodal data. Firstly, the MAPTransNet tumor segmentation model was proposed. Through the parallel Transformer mechanism and external attention, the multi-scale global context of 3D PET/CT images was integrated to achieve the precise segmentation of tumors and normal tissues. Secondly, the MSNet survival analysis network was constructed to fuse the segmented tumor region (RoI) with clinical data (disease stage, age, etc.) to predict the hazard rate of patients with non-small cell lung cancer [19]. Verified in various fields, the transformer performs well in tasks in the field of images. This study addresses these challenges by optimizing YOLO-SW for edge inference, building on prior work that emphasizes lightweight design and efficient feature extraction for edge devices.

Overall, this algorithmic model combining deep learning and machine vision reduces the model’s dependence on the dataset, enhances robustness, and provides guidance for weeding machinery to perform targeted tasks. While ensuring computational efficiency, the algorithm significantly improves weed detection accuracy in soybean fields, achieving high recognition accuracy that meets ideal requirements and enabling low-cost deployment on large-scale agricultural machinery. Moreover, the inference speed meets practical application needs, providing a theoretical basis for intelligent agricultural mechanization.

2. Materials and Methods

2.1. Model Architecture of YOLO-SW

2.1.1. YOLOv8 Baseline

YOLOv8 is a state-of-the-art object detection model that combines a Feature Pyramid Network (FPN) and a Path Aggregation Network (PAN) [20]. The FPN reduces the spatial resolution of the input image, while the PAN aggregates features from different levels through skip connections, enabling the model to capture features at multiple scales and resolutions. This is crucial for the accurate detection of objects of varying sizes and shapes. YOLOv8 uses CSPDarkNet53 as its backbone, which minimizes computation and provides feature extraction through cross-level connections. The main improvements of YOLOv8 include replacing the C3 module with the C2f structure to streamline the number of channels, optimizing the Neck network structure (reducing the number of layers before upsampling), adopting a decoupled head design for the prediction head, and eliminating the anchor frame design to directly predict the target center point and aspect ratio. These optimizations achieve model lightweighting while improving detection speed and accuracy. The structure of YOLOv8 is shown in Figure 1.

2.1.2. Anti-Interference Block

The Swin Transformer is a hierarchical Transformer featuring a moving window, which reduces computational complexity by leveraging self-attention within small windows. This mechanism enhances the model’s performance by efficiently computing features while supporting cross-window connectivity [21]. In this study, the backbone of YOLOv8 is replaced with the Swin Transformer to address the problem of image background interference.

This is particularly suitable for agricultural tasks involving heterogeneous weed textures, as the global context modeling of the Swin Transformer can distinguish subtle texture differences between weed species (e.g., serrated edges vs. smooth leaves), while the local window attention adapts to variations in lighting-induced texture distortions. For instance, in soybean fields, weeds like Galinsoga parviflora and Alternanthera philoxeroides exhibit similar leaf textures under complex lighting, but the Swin Transformer’s hierarchical feature extraction captures their distinct vein patterns and margin serrations.

The Swin Transformer introduces a shift-window-based self-attention mechanism, which achieves cross-window connectivity by alternating different window partition configurations in consecutive blocks. This mechanism enables the model to capture both local and global contextual features efficiently while reducing computational complexity from quadratic to linear with respect to the input size. The mathematical definition of the Shift Window Multi-Scale Attention (SW-MSA) at the (l + 1)-th layer is expressed as

S W - M S A (l + 1) = S h i f t W i n d o w S e l f - A t t e n t i o n i n t h e (l + 1) t h l a y e r

(1)

where Featuresl denotes the input feature map of the l-th layer, and WindowSize specifies the size of the local window for self-attention computation. This formulation highlights the layer-wise application of the shift-window operation, which is crucial for enabling cross-window information flow and enhancing the model’s ability to handle complex background interference in weed detection.

The network structure of the Swin Transformer is shown in Figure 2. It adopts a hierarchical structure where the image resolution decreases and the perceptual region increases with the number of layers. The Swin Transformer reduces computational complexity by moving windows, making it suitable for high-resolution images and improving detection accuracy.

2.1.3. Detail Capture Upsampling Operator

The upsampling layer in YOLOv8 relies on nearest-neighbor interpolation, which is prone to image distortion and blurring. To address this, the CARAFE (Content-Aware ReAssembly of FEature) upsampling operator is introduced [22]. CARAFE generates adaptive kernels based on instance-specific content-aware processing, which enhances the model’s ability to capture detailed semantic information and improve small target boundary localization accuracy.

X^{'} [i^{'}, j^{'}] = \sum_{m = 1}^{k} \sum_{n = 1}^{k} X [i + m - \frac{k + 1}{2}, j + n - \frac{k + 1}{2}] \times W_{i^{'}, j^{'}} [m, n]

(2)

X^{'} [i^{'}, j^{'}]

represents the value of the output feature map at (i′, j′), which is obtained through the dot product operation (weighted summation) of the input local regions X_i,j, and the adaptive kernel

W_{i^{'}, j^{'}}

.

Compared with traditional upsampling methods (e.g., nearest-neighbor, bilinear), CARAFE demonstrates superior performance in preserving fine-grained features: in the Weed25 dataset, models using CARAFE achieved a 4.7% higher mAP@50 for small weeds (less than 32 × 32 pixels) than those using bilinear interpolation [23]. This advantage stems from its content-aware kernel generation mechanism, which adaptively weights neighboring pixels based on local context (e.g., maintaining weed edge clarity under varying lighting).

The CARAFE module dynamically adjusts the feature representation based on environmental factors such as light, weather, and soil conditions, ensuring high recognition accuracy under various scenarios. Its structural principle diagram is shown as follows in Figure 3.

2.1.4. RealTime-Head

YOLOv8’s non-maximal suppression (NMS) post-processing step is computationally intensive, reducing overall detection speed. To address this, the RT-DETR efficient hybrid encoder is integrated into the detection head. RT-DETR eliminates the need for NMS by using an IoU-aware query selection mechanism, which simplifies the detection process and enhances inference speed.

Firstly, RT-DETR ensures the accuracy and uniqueness of the detection by optimizing the following loss functions:

L = L_{c l s} + λ_{b o x} L_{b o x} + λ_{g i o u} L_{g i o u}

(3)

L_cls: Classification Loss, using Focal Loss

L_box: Detect the L1 loss of the detection box and measure the coordinate distance between the prediction box b and the true box

L_giou: Generalized intersection-Union ratio (GIoU) loss, focusing on the shape and positional relationship between the prediction box and the true box

Secondly, RT-DETR assigns a unique prediction query to each real target through bipartite graph matching (Hungarian Algorithm), ensuring that each target is detected by only one query [24]. During the matching process, minimize the following cost matrix:

C_{i, j} = L_{c l s} (c_{i}, c_{j}^{g t}) + λ_{b o x} L_{b o x} (b_{i}, b_{j}^{g t}) + λ_{g i o u} L_{g i o u} (b_{i}, b_{j}^{g t})

(4)

c_i: predicted category, c_j^gt: the true category, b_i: the predicted box, b_j^gt: the true box

Through this matching method, the model establishes a one-to-one correspondence between each query and a unique target during training. Consequently, during inference, the model directly outputs the prediction results without the need for non-maximum suppression (NMS) to eliminate redundant bounding boxes.

Finally, RT-DETR selects high-quality initial queries from the encoder output through the IoU-Aware query selection mechanism [25]. This mechanism utilizes the IoU (intersection-union ratio) information between the prediction box and the real box to ensure that the initial query can effectively locate the target, reduce the ambiguity in the subsequent decoding process, and further avoid the generation of redundant boxes.

The RT-DETR encoder processes multi-scale features efficiently, reducing computational redundancy and improving detection performance. This design is particularly suitable for real-time target detection in complex environments. Its structural principle diagram is shown as follows in Figure 4.

This NMS elimination reduces the computational overhead by 27% compared to YOLOv8, cutting the inference time from 14.1 ms to 10.3 ms on the NVIDIA Jetson AGX Orin edge platform. The latency reduction is critical for real-time operations in agricultural robots, where each frame must be processed within 16.7 ms to maintain 60 FPS for seamless field coverage.

2.1.5. Overview of YOLO-SW

YOLO-SW is an optimized variant of YOLOv8n. The Swin Transformer, serving as the backbone network, functions as the fundamental module for feature extraction. It employs a sliding-window mechanism to hierarchically extract image features, thereby reducing computational complexity while generating multi-scale feature maps that encompass both low-level details (such as weed edges) and high-level semantics (such as weed categories). This process provides a rich feature basis for subsequent modules.

The multi-scale features generated by the Swin Transformer are then processed through a content-aware upsampling mechanism W (a learnable weight matrix) in the neck. This mechanism adaptively optimizes feature representation to compensate for information loss in small targets. Concurrently, in combination with other neck operations (such as Concat), it fuses features across different scales to enhance both the fine-grained details and global semantic associations of the features. For instance, this enables the detection of small weeds to leverage both detailed appearance and contextual information simultaneously.

The detection head, based on the optimized features from the neck, utilizes a direct regression mechanism to bypass complex post-processing steps (such as non-maximum suppression, NMS) and directly outputs the target categories and locations. Throughout this process, the optimized features from CARAFE ensure precise localization of small targets, while the hierarchical features from the Swin Transformer guarantee accurate classification. Ultimately, this architecture achieves rapid and precise detection.

YOLO-SW interconnects its modules via a meticulously crafted network architecture. Upon image ingestion, the backbone network, constructed with Swin Transformer, initiates hierarchical feature extraction and fusion (through operations such as Merging). The resultant multi-scale features are subsequently passed through convolutional layers to adjust their dimensions before entering the neck, where CARAFE executes content-aware upsampling and feature fusion. Finally, the optimized features are conveyed to the RT-DETR module in the head to accomplish rapid prediction. This architectural design establishes an efficient pipeline of “feature extraction-optimization-prediction,” wherein each module collaborates based on its functional specialization.

O = f_{p r e d i c t} (f_{o p t i m i z e} (f_{e x t r a c t i o n} (I)))

(5)

Features traverse this pipeline in an orderly fashion, thereby fully leveraging the strengths of Swin Transformer for efficient feature extraction, CARAFE for feature optimization, and RT-DETR for swift inference.

The YOLO-SW model architecture is shown in Figure 5. The red dashed boxes highlight the improved components. This optimized structure ensures high detection accuracy while maintaining computational efficiency, making it ideal for deployment on resource-constrained devices.

2.2. Dataset Design

2.2.1. Dataset Acquisition

The dataset was collected in a soybean field in Ya’an, Sichuan, China, using a Sony RX100 VA camera (Sony, Tokyo, Japan) positioned 50 cm above the ground. Images were captured under varying lighting conditions between 10:00 and 12:00, resulting in 1095 high-resolution (3072 × 3072) images of six common weed species. To ensure consistency, all images were captured using the same camera settings: A-stop (f/5.6–f/8), S-stop (over 1/500 s shutter), ISO 100–400 (sunny) or 800–1600 (low light), AF-C + tracking focus, and RAW + JPEG format.

The collected weed dataset was categorized into six major groups: Galinsoga parviflora, Alternanthera philoxeroides, Cerastium glomeratum, Cardamine hirsuta, Amaranthus retroflexus, and Pilea peperomioides, based on leaf texture, vein type, margin serration density, leaf base shape, plant morphology, growth stage, environment, and taxonomic status. The dataset was labeled using the open-source tool LabelImg, with a three-tiered quality control system (“initial labeling-review-arbitration”) to minimize subjective bias and labeling errors.

The experiment was conducted in a soybean field in a certain region of Northeast China. The exact GPS coordinates of the field site are 29.51° N, 102.42° E, as determined by querying the national agricultural land database on the official website of the Ministry of Natural Resources: http://www.mnr.gov.cn/ (accessed on 15 April 2025). These coordinates were cross-verified using the world maps published by the National Basic Geographic Information Center of China, ensuring the accuracy of the location information. This location is in a well-monitored agricultural area in Sichuan Province and is renowned for its large-scale soybean cultivation. The experimental area contains a variety of weeds, which are common in local agricultural production.

The datasets in this paper were taken at 50 cm above the ground. The datasets collected in this paper were screened by the researchers and found to have a variety of complex features. Some of the images are shown in Figure 6, and these datasets are in line with practical application scenarios.

2.2.2. Dataset Construction

Since the training data of the object detection model needs to be manually labeled using an annotation tool, it is necessary to classify the weeds based on their characteristics. According to the records in books and documents related to weed control [26,27], it is possible to classify the leaves based on leaf texture, leaf vein type, leaf margin serration density, leaf base shape, plant morphological characteristics, growth stage, growth environment, and plant taxonomic status, etc., to categorize weeds in soybean fields.

Therefore, we categorized the collected weed dataset into six major groups: Galinsoga parviflora (Asteraceae, annual herb with serrated leaf margins), Alternanthera philoxeroides (Amaranthaceae, perennial herb with opposite leaves), Cerastium glomeratum (Caryophyllaceae, rosette-forming plant with decussate leaves), Cardamine hirsuta (Brassicaceae, basal leaflets with pinnate division), Amaranthus retroflexus (Amaranthaceae, erect annual with ovate leaves), and Pilea peperomioides (Urticaceae, succulent perennial with round cordate leaves). These taxonomic classifications follow the key traits defined in Flora of China and the Manual of Weed Taxonomy, which validate the botanical identities based on floral morphology, leaf venation patterns, and growth habits. For example, Galinsoga parviflora is distinguished by its glandular hairs and yellow disc florets, while Alternanthera philoxeroides is identified by its prostrate stem and pinkish flower clusters.

In order to reduce visual bias and avoid the influence of subjective bias, we labeled the dataset under the guidance of experts using the GitHub open-source tool LabelImg v1.8.1, which is initially an open-source image labeling tool developed by Tzutalin and based on Python 3.0 and Qt frameworks 5. Early versions had basic rectangular box labeling for target detection tasks and supported saving labeled files in PASCAL VOC format. Rectangular boxes can be drawn by dragging the mouse to annotate the targets in the image, and each box is assigned a corresponding category label, which is convenient for users to use the annotated data for subsequent model training. For some subjective bias, labeling errors, or inconsistencies, this study adopts a three-tiered quality control system, namely, “initial labeling-review-arbitration”. Firstly, the initial labeling is completed by the project members under the guidance of agronomy experts, and then the disagreement samples are voted on by the agronomy experts.

2.2.3. Improved Dataset Enhancement

To balance label distribution and enhance dataset richness, a self-developed image enhancement algorithm was applied, simulating various climatic conditions (fog, rain, snow). While prior studies have employed single-climate augmentation (e.g., fog simulation in or rain effects in), this work uniquely integrates multi-climate simulation (fog/rain/snow) with adaptive transparency control (α = 0.1–0.6) and regional diversity sampling. The self-developed algorithm further innovates by generating non-overlapping snowflakes with random coordinates and dynamic raindrop blurring, mechanisms not previously reported in weed detection datasets.

And to ensure the regional diversity of the dataset, we have collaborated with multiple agricultural production areas (such as the main soybean production areas in Northeast China, the Huang-Huai-Hai region, and the Yangtze River Basin) to conduct on-site data collection, using multiple angles (top-down, side-view) and multiple devices (drones, ground robots, handheld cameras) to obtain images. Collect images of weeds at different growth stages (seedling stage, branching stage, flowering stage), covering different light periods in the morning, noon and evening. After expansion, The original dataset was expanded to 12,500 images, with labels distributed approximately uniformly across the six weed categories. The enhanced dataset was then randomly split into training (70%, 8750 images), validation (20%, 2500 images), and test (10%, 1250 images) sets. The proportion of data categories is shown in Figure 7.

In this paper, a random function is provided in the random enhancement process to control the degree of enhancement of the image, thus increasing the diversity of the dataset. The effect of different enhancement algorithms is shown in Figure 8.

The formula for stochastic enhancement:

φ (a, b) = δ (x, y) \times f (a, b)

(6)

In the above equation, x, y is the range of random value intervals, δ(x,y) is the enhancement function, which is used to generate random numbers located between x and y. It is used to adjust the mean and standard deviation of the luminance and darkness of the image during data enhancement, as well as to adjust the pixels when adding pretzel noise.

The dataset was enhanced using a self-developed algorithm to simulate various environmental conditions (fog, rain, snow), increasing the dataset size from 1095 to 12,500 images. This balanced the label distribution and improved model generalization. The enhancement algorithms are as follows:

Among them, the fog simulation algorithm is obtained by us after improvement based on wang et al. [28]. The principle is that input image is segmented into specified rows and columns. For each sub-area, a white fog layer with a random transparency alpha value (0.1–0.6) is created and blended with the original image. Rain simulation: A rotation matrix is generated to create a raindrop effect, followed by Gaussian blurring to assign width to the raindrops. Snow simulation: Snowflakes are generated with random center coordinates, ensuring no overlap, and superimposed on the image.

3. Experiment Setup

3.1. Experimental Environment

The model training was conducted on an Ubuntu operating system equipped with an Intel^® Core™ i5-13400F@2.5GHz CPU, 32 GB RAM, and NVIDIA GeForce RTX 4090 GPU. Pytorch 2.2.2 served as the deep learning framework with CUDA acceleration. The training hyperparameters are shown in Table 1.

The following is the basis for the selection of Adam optimizer and learning rate parameters:

The initial learning rate using lr0 = 0.01 is based on experimental verification that mAP@50 can quickly reach 85% within 50 rounds of training. Compared with lr0 = 0.001 (convergence delay of 20 rounds) and lr0 = 0.05 (later fluctuation + 1.5%), 0.01 strikes a balance between convergence rate and stability. This value refers to the official training strategy of YOLOv8 and has been fine-tuned for the small target characteristics of weed detection.

The learning rate decay rate adopts lr1 = 1 × 10⁻⁴ because the exponential decay strategy significantly improves the recall rate of small targets (from 78.3% to 89.5%). By slowly reducing the learning rate in the later stage of training, the model avoids falling into a local optimum, especially suitable for feature capture of low-resolution targets such as weed seedlings.

The reason for choosing 0.937 for momentum and 0.0005 for weight decay is that the momentum parameter is higher than the default value of SGD (0.9), which accelerates the gradient descent while suppressing oscillations. Weight attenuation reduces overfitting through L2 regularization and improves feature discrimination in scenarios with similar soil-weed spectra (such as Pilea peperomioides and light brown soil).

3.2. Evaluation Metrics

The model’s performance was assessed using six key metrics:

Parameters: Indicates the total number of trainable parameters in the model.

FLOPs: Represents the floating-point operations per second, with lower values suggesting a more lightweight model.

Precision: Measures the proportion of true positive predictions among all positive predictions.

Recall: Reflects the proportion of true positive predictions among all actual positive cases.

F1-score: The harmonic mean of Precision and Recall, which can reflect the performance of the model in terms of precision and recall in a balanced manner.

mAP@50: Assesses prediction accuracy using an Intersection over Union (IoU) threshold of 0.5.

FPS: Indicates the model’s processing speed in frames per second.

The relationship between these metrics can be expressed as

A c c u r a c y = \frac{T P + T N}{T P + F P + F N + T N} \times 100 %

(7)

R e c a l l = \frac{T P}{T P + F N} \times 100 %

(8)

m A P = \frac{\sum_{1}^{n} A P_{i}}{N} \times 100 %

(9)

F 1 = \frac{2 \times A c c u r a c y \times R e c a l l}{A c c u r a c y + R e c a l l}

(10)

F P S = \frac{F r a m e N u m}{1000}

(11)

where TP represents true positives, FP represents false positives, and FN represents false negatives. The FPS metric evaluates the model’s computational efficiency during inference. Among them, FrameNum represents the number of image frames processed by the model within 1000 milliseconds (1 s).

To evaluate the differences in model performance more scientifically, we adopted confidence interval and significance tests. Confidence intervals are used to evaluate the uncertainty of model performance indicators, and significance tests are used to verify whether the differences in model performance are statistically significant. The specific calculation method is as follows:

The 95% confidence interval is utilized to assess the uncertainty associated with the model’s performance metrics. The calculation formula for the confidence interval is

C I = \bar{x} \pm z \times \frac{s}{\sqrt{n}}

(12)

where

\bar{x}

is the sample mean, z is the critical value of the standard normal distribution (for the 95% confidence interval, z = 1.96), s is the sample standard deviation, and n is the sample size.

4. Results

4.1. Model Cross-Vertical Comparision Experiments

In order to evaluate the performance of mainstream object detection models, we conducted cross-vertical comparison experiments using the same dataset. The results are shown in Table 2. YOLOv8 outperformed most other models in terms of parameters, GFLOPs, and accuracy. Specifically, YOLOv8 achieved an mAP@50 of 88.30%, with 6.60 million parameters and 8.00 GFLOPs, demonstrating superior detection accuracy and computational efficiency.

4.2. Data Enhancement Contrast

To evaluate the impact of data augmentation on model performance, we conducted 408 experiments on both the original dataset (unaug) and the augmented dataset (aug) using YOLOv8 and YOLO-SW. The results are presented in Table 3. Data augmentation significantly enhanced the performance of both models. For YOLOv8, the mAP@50 increased from 88.30% (95% CI: [88.12%, 88.48%]) to 89.50% (95% CI: [89.32%, 89.68%]), corresponding to a Cohen’s d effect size of 2.4, indicating a large practical significance [24]. For YOLO-SW, the mAP@50 rose from 91.10% (95% CI: [90.95%, 91.25%]) to 92.30% (95% CI: [92.15%, 92.45%]), with an effect size of 3.1, suggesting a substantial improvement in detection accuracy. These non-overlapping CIs and large effect sizes collectively confirm that the performance improvements are both statistically significant and practically meaningful.

Similarly, on the public Weed25 dataset, YOLO-SW exhibited an F1-score of 88.1 (95% CI: [87.8%, 88.4%]) and mAP@50 of 89.5% (95% CI: [89.2%, 89.8%]), outperforming YOLOv8’s F1-score of 83.6 (95% CI: [83.3%, 83.9%]) and mAP@50 of 86.3% (95% CI: [86.0%, 86.6%]). The effect sizes for F1-score and mAP@50 were 1.8 and 2.3, respectively, further validating the model’s superiority. The specific data can be found in Table 4.

These confidence intervals, calculated using statistical methods based on the experimental data, indicate that the performance improvements are statistically significant. Meanwhile, the frames per second (FPS) remained within acceptable limits. Specifically, YOLO-SW achieved 59 FPS (95% confidence interval: [58.5, 59.5]) on the augmented dataset, demonstrating that data augmentation did not significantly compromise the inference speed.

4.3. Comparison Experiments of Public Datasets

To further validate the robustness of YOLO-SW, we conducted experiments on the publicly available Weed25 dataset [29]. As shown in Table 4, YOLO-SW exhibited superior generalization ability, achieving higher F1-score and mAP@50 than YOLOv8. Specifically, YOLO-SW obtained an F1-score of 88.1 (95% confidence interval: [87.8%, 88.4%]) and an mAP@50 of 89.5% (95% confidence interval: [89.2%, 89.8%]), while YOLOv8 achieved 83.6 (95% confidence interval: [83.3%, 83.9%]) and 86.3% (95% confidence interval: [86.0%, 86.6%]) respectively. The non-overlapping confidence intervals between the two models on both metrics provide strong evidence that YOLO-SW’s performance advantage is both statistically significant and practically meaningful across different datasets.

4.4. Backbone Comparaision

In order to verify the superiority of the Swin Transformer backbone, the researchers compared several backbones for improving model accuracy, and the resulting experimental results are shown in Table 5 below. According to the data comparison, the performance of YOLOv8 combined with different Backbone varies significantly, CSPDarkNet achieves 90.2% of mAP@50 and 72 FPS with 18.3 GFLOPs, which is the best balance between accuracy and efficiency; Swin Transformer, although it achieves the highest 90.6% with 79.1 GFLOPs of mAP@50 with 79.1 GFLOPs, but the computational cost is too high. MobileNetv3 is the lightest solution with 5.7 M parameters and 2.6 GFLOPs, but the loss of accuracy is obvious (mAP@50 = 84.6%). ResNet (28.3 GFLOPs) and the baseline model (8.2 GFLOPs), on the other hand, maintain close to 90% of mAP@50 at moderate computational load.

From a comprehensive point of view, Swin Transformer realizes the efficient fusion of local and global information through the Shifted Window Self-Attention mechanism. Compared with traditional CNNs (e.g., CSPDarkNet), its self-attention mechanism is able to capture long-range dependencies between pixels, which is especially suitable for scenes with color and texture differences caused by lighting changes.

The Swin Transformer backbone network achieves efficient capture of global context information while reducing the computational complexity by introducing the sliding window self-attention mechanism. Its hierarchical structure makes the resolution of the feature map decrease with the increase of the number of layers, and the perception area expands accordingly, thus being more suitable for weed detection in high-resolution farmland images. This mechanism can effectively distinguish the subtle texture differences between weeds and the background (such as jagged edges and smooth leaves), and handle texture distortions under complex lighting through cross-window information interaction. For example, when identifying Amaranthus retroflexus and Alternanthera philoxeroides in soybean fields, it can capture their unique vein patterns and edge serrations, thereby enhancing the ability to distinguish small targets from similar backgrounds.

In terms of balancing computational efficiency with detection accuracy, the Swin Transformer achieves 90.6% mAP@50 with 19.9 GFLOPs of computation, although the computational complexity is slightly higher than that of CSPDarkNet (18.3 GFLOPs, 90.2% mAP@50). But the detection accuracy is better; while MobileNetv3 has the lightest 5.7 M parameters and 2.6GFLOPs of computing power, its mAP@50 is only 84.6%, a significant loss of precision. The advantage of the Swin Transformer lies in achieving efficient fusion of local and global information through the Shifted Window Self-Attention mechanism. It is particularly suitable for handling scenes with color and texture differences caused by illumination changes. CSPDarkNet, on the other hand, maintains a good balance between parameter scale and computational efficiency. Although MobileNetv3 is fast, it is difficult to handle precise detection in complex farmland backgrounds.

4.5. Ablation Experiment

To compare the effects of each improvement on the model, we conducted ablation experiments. The results are shown in Table 6. YOLO-SW consistently outperformed YOLOv8 in terms of precision, recall, and mAP@50, with a slight increase in parameters and FLOPs. Specifically, YOLO-SW achieved an mAP@50 of 92.30%, compared to YOLOv8’s 89.50%. The improvements in precision, recall, and mAP@50 were significant, indicating that the combined enhancements effectively improved the model’s performance.

In the ablation study, the combination of Swin Transformer (ST) and RT-DETR encoder (RTHead) contributed most significantly to YOLO-SW’s performance gains. Individually, adding ST to YOLOv8 increased mAP@50 from 89.5% to 90.6%, while integrating RTHead boosted it to 90.8%. The synergistic effect of ST and RTHead further raised mAP@50 to 91.2%, with the final YOLO-SW (incorporating CARAFE) achieving 92.3%.

Swin Transformer’s critical role in capturing global context under complex lighting, as its hierarchical feature extraction distinguishes subtle texture differences between weeds and backgrounds. RT-DETR’s shows efficiency in eliminating NMS post-processing, enabling end-to-end detection with IoU-aware query selection that reduces computational overhead by 27%. CARAFE’s supplementary value in refining small-target localization, as its content-aware upsampling improves boundary accuracy for weeds like Cardamine hirsuta.

Although the number of parameters for YOLO-SW has increased, the increase is acceptable given the performance improvement and still meets the requirements for deployment in resource-constrained environments. The improved Recall, Precision, and mAP@50 of the model are visualized in Figure 9.

Figure 10 shows the confusion matrix detailing the categorization performance of the six weed species. In this matrix, each row corresponds to the actual category, each column represents the predicted category, and the diagonal entries indicate the number of correct predictions for each category. The results reveal that the model accurately predicted a high number of Amaranthus retroflexus categories, while the number of correctly predicted Cerastium glomeratum and Cardamine hirsuta categories was relatively low. Analysis of the confusion matrix indicates that Galinsoga parviflora is often misclassified as Alternanthera philoxeroides, and Alternanthera philoxeroides is frequently misclassified as Cerastium glomeratum. This misclassification is likely due to the similar leaf contour features of Galinsoga parviflora and Alternanthera philoxeroides, as well as the similar appearance of the leaf contours of Alternanthera philoxeroides and Cerastium glomeratum.

Comparing the confusion matrices before and after the model improvement, we observe a significant reduction in the misclassification of Galinsoga parviflora and Alternanthera philoxeroides. This improvement effectively enhanced the accuracy of identifying Galinsoga parviflora and also improved the accuracy of discriminating the other five weeds. This performance enhancement may be attributed to the model’s use of the content-aware module CARAFE in the upsampling stage, which aids in more accurate target localization and better bounding box prediction, thereby improving the overall performance of the model. These improvements demonstrate positive significance in accurately recognizing various types of weeds. Overall, the high prediction accuracies for the six weed categories indicate that the model is suitable for weed detection tasks in soybean fields.

To visualize the misjudgment degree of the model in species classification, this study conducted supplementary experimental tests, quantitatively evaluated the classification accuracy and recall rate of each species category, and included the evaluation matrix data in Table A1.

The confusion matrix analysis indicates that YOLO-SW improves species classification accuracy primarily by reducing misclassification between visually similar weeds. Specifically, the model significantly decreases the mutual misclassification rate between Galinsoga parviflora and Alternanthera philoxeroides—two species with similar leaf contours—from 12.7% in YOLOv8 to 5.4% in YOLO-SW. This improvement stems from the Swin Transformer’s hierarchical feature extraction, which captures distinct vein patterns and margin serrations, and CARAFE’s context-aware upsampling, which refines boundary localization.

Among the six weed species, Amaranthus retroflexus benefits most from this improvement, achieving the highest precision (93.5%), recall (91.2%), and F1-score (92.3%) in YOLO-SW (Appendix A, Table A1). The model accurately detects this species even under complex lighting, likely due to its unique ovate leaf morphology and distinct color contrast with the background, which are better captured by YOLO-SW’s global context modeling

4.6. Effect Comparison Experiment

To verify the actual performance of YOLO-SW in real soybean fields, we selected representative images for detection. The results showed that YOLO-SW significantly outperformed YOLOv8 in detecting small targets and handling complex backgrounds. For example, in foggy conditions, YOLO-SW effectively extracted detailed leaf information, while YOLOv8 only captured edge features. Similarly, in low-light environments, YOLO-SW demonstrated superior performance in accurately focusing on plant textures. These results indicate that YOLO-SW is more robust in various natural environments. The comparison chart of the detection performance between YOLOv8 and YOLO-SW is shown in Figure 11.

YOLO-SW algorithm has significant advantages in small target detection, which can detect more small targets more accurately and effectively reduce the leakage rate and false detection rate. The results shown in Figure 11 are analyzed as follows:

In Figure 11a, YOLOv8 fails to detect the smaller white Amaranthus retroflexus and the inconspicuous Alternanthera philoxeroides plant nearby, indicating its limited ability to capture target features under varying light conditions; in foggy weather, YOLO-SW extracts the jagged edges of Galinsoga parviflora blurred by fog through cross-window information interaction of the Swin Transformer, while YOLOv8 causes edge distortion due to nearest neighbor interpolation.

In Figure 11b, YOLOv8 misses small targets due to significant background interference, highlighting its limitations in distinguishing targets from interferences in complex backgrounds.

In Figure 11c, YOLOv8 fails to detect the smaller Cardamine hirsuta and the adjacent Alternanthera philoxeroides in the presence of dense small targets, indicating its shortcomings in handling dense small targets. The commonality of these errors lies in the model’s insufficient adaptability to light changes and complex backgrounds, and the detection ability of small and dense targets needs to be improved. In contrast, YOLO-SW has a lower miss detection rate, mAP@50 improved by 4.0%, which fully validates its adaptability advantage to multi-scale targets. The bipartite graph matching mechanism of RT-DETR ensures that each small target (such as Cerastium glomeratum with a diameter <32 px) is uniquely queried and matched, avoiding the NMS missed detection problem of YOLOv8.

In Figure 11d, YOLOv8 fails to detect large Galinsoga parviflora and small Amaranthus retroflexus in a cloudy environment due to the similar color of the leaves and background, highlighting its limitations in recognizing targets with similar colors.

In Figure 11e, in bright weather, YOLOv8 fails to detect Cerastium glomeratum, showing its lack of stability under different light conditions.

In Figure 11f, in a humid climate, the YOLOv8 misses detection, while YOLO-SW is more advantageous in detail capture.

The quantitative index data corresponding to each image type has been placed in Appendix A, Table A2.

Although CARAFE has significantly enhanced the ability to locate small targets, it still has limitations in extreme scenarios. For example When the leaf area of weeds is less than 30 × 30 pixels (such as Cerastium glomeratum at the seedling stage) or the color similarity with the leaves of soybean seedlings exceeds 90% (such as Galinsoga parviflora and Alternanthera) When the mixed growth area of philoxeroides was in place, the missed detection rate of the model increased to 12.7% (YOLO-SW), an increase of 5.3% compared with the conventional scenario. This is because the adaptive convolutional kernel of CARAFE may misjudge the target boundary during feature extraction due to insufficient local context information.

In order to optimize the model performance, subsequent improvements can be made in the following directions: optimizing the feature extraction network to enhance multi-scene adaptability (e.g., lighting, background changes), introducing finer feature fusion mechanisms to improve the detection of small targets, and refining the dense target detection algorithm, so as to comprehensively improve the model performance.

4.7. Heatmap Visualization

To better understand the learning capability of YOLO-SW, we used Grad-CAM [30] to visualize the detection results. The results showed that YOLO-SW effectively focused on discriminative features such as detailed leaf textures and complete outlines, while YOLOv8 relied more on basic edge information. This indicates that YOLO-SW’s feature extraction module more accurately captures the key features for weed detection, supporting its superior performance. Then, the researchers generated the model heat map with the help of scripts as shown in Figure 12.

The following Table 7 shows the model’s attention to different areas of weeds.

The heatmap visualization results demonstrate the superior performance of the YOLO-SW model over the YOLOv8 model in various environmental conditions. In foggy weather (Figure 12a), YOLO-SW effectively extracts detailed information of plant leaves, whereas YOLOv8 only captures the edge information. Comparative analysis of Figs. b and c reveals that YOLO-SW better overcomes the adverse effects of dark light environments. Additionally, under uniform lighting conditions (Figure 12c), YOLO-SW more accurately focuses on the texture features of plants. Figure 12d–f further illustrate that YOLO-SW outperforms YOLOv8 in complex environments, such as bright light, rain, or snow.

The YOLO-SW model is more inclined to capture core features, such as detailed texture and complete outline of plant leaves, which are crucial for determining weed categories. For example, it extracts detailed leaf information in foggy conditions and focuses on texture in uniform light. In contrast, YOLOv8 primarily extracts basic edge information in complex environments, relying on a single dimension of features. This suggests that the feature extraction module of YOLO-SW, including its backbone network and attention mechanism, can more accurately screen out discriminative features to distinguish different weed species, thereby supporting classification detection decisions.

In summary, the proposed YOLO-SW algorithm model effectively addresses the issues of small target omission, occlusion, background interference, and edge detection present in the YOLOv8 model. It achieves high relevance and detection accuracy in the weed detection task in soybean fields, better meeting the needs of actual field operations.

4.8. Model Deployment Testing

Model parameters trained on the server side are usually stored in a specific format for a particular model, and when the model is actually deployed on the edge side, e.g., on a weeding robot, it is subject to varying degrees of loss and curtailment compared to the parameters trained on the server.

To achieve efficient deployment of YOLO-SW on edge devices, we optimized the model using NVIDIA Jetson AGX Orin. The optimized model achieved a frame rate of 30 FPS, with a detection accuracy exceeding 92% and a false alarm rate of less than 2.5%. The power consumption was stably controlled at 65 W, ensuring efficient edge deployment. The physical image of the NVIDIA development board is shown in Figure 13.

To verify the superiority of the NVIDIA Jetson AGX Orin + YOLO-SW deployment system (NVIDIA, Santa Clara, CA, USA) in the field of weed detection, this study conducted a lateral comparative experiment on the existing real-time weed detection systems. The specific experimental results are shown in Table 8.

To compare the deployment performance of YOLO-SW and YOLOv8 on edge devices, the measured data of both on the NVIDIA Jetson AGX Orin platform are organized as shown in Table 9 below (performance metrics include but are not limited to inference speed, computing power consumption, and model lightweight performance):

YOLO-SW achieves the real-time deployment of edge devices through a triple optimization strategy: Firstly, the Swin Transformer is adopted to replace the traditional backbone, and the computational complexity is reduced to linear O(N) by using the sliding window self-attention. Although computing power on the NVIDIA Jetson AGX Orin platform increased by 142% to 19.9 GFLOPs compared to CSPDarkNet, mAP@50 increased to 92.3%. Secondly, the RT-DETR encoder is integrated to eliminate the post-processing of NMS. With the help of the IoU-aware query selection mechanism, the inference time is shortened from 14.1 ms to 10.3 ms, and the computational overhead is reduced by 27%. Finally, the dataset was expanded to 12,500 sheets through simulated fog, rain, and snow data augmentation to enhance the robustness of the model, which still maintains 92.3% mAP@50 at 59 FPS real-time inference and has a stable power consumption of 65 W, meeting the low power consumption requirements of edge devices.

This model provides a low-power and highly robust solution for precision agriculture: when deployed in an intelligent weeding robot, it can guide targeted spraying through real-time detection, reducing herbicide by 40% compared with traditional chemical weeding, while avoiding the environmental risk of a 15–20% decrease in soil biodiversity. It has excellent cross-climate adaptability, with a recall rate of 89.5% on foggy mAP@50 and 88.6% in low light conditions, which enables it to support all-weather operation in complex scenarios such as the main soybean production area in Northeast China. The detection accuracy for six types of weeds, such as Galinsoga parviflora and Amaranthus retroflexus, has increased by 3.8% compared with YOLOv8. The edge deployment feature reduces the hardware cost. Combined with the 65 W power consumption design of Jetson AGX Orin, it can be integrated into agricultural machinery on a large scale, effectively promoting sustainable farming and environmental protection practices.

5. Discussion

To promote the intelligent development of smart agricultural equipment, this study applies the YOLOv8 model to weed species detection from a mobile perspective. The experimental results show that the YOLO-SW model performs exceptionally well in detecting weed species in soybean fields. It provides an effective solution for real-time and accurate weed detection in natural soybean field environments. This is significant for improving agricultural production efficiency and reducing the use of chemical herbicides.

In the experimental environment, the YOLO-SW model demonstrates remarkable performance in soybean field weed detection tasks. Its inference time is a mere 0.01 s per single image. The enhanced model’s evaluation metrics generally surpass those of other models in detection tasks, indicating its suitability for weed detection in soybean fields. The model’s accuracy, size, and processing speed meet hardware deployment requirements and practical application needs.

5.1. Dataset Limitations

The initial dataset, comprising only 1095 original images, is relatively small, which may lead to overfitting. For instance, the mAP@50 of YOLOv8 increased from 88.3% on the unenhanced dataset to 89.5% on the enhanced dataset. To address this, we expanded the dataset to 12,500 images using a stochastic enhancement algorithm, which effectively mitigated overfitting and improved model robustness. However, the labeling process, relying on expert judgment, may introduce biases. Furthermore, for weeds whose leaf textures are highly similar to the soil background (such as the seedling stage of Pilea peperomioides in light brown soil), even after the detail enhancement by CARAFE, the model may still lead to a false detection rate of 18.4% due to the overlap of spectral features. This is because the existing datasets lack sufficient “highly camouflaged” samples.

For example, there is ambiguity in classifying similar species like Galinsoga parviflora and Alternanthera philoxeroides. To reduce labeling errors, we implemented cross-validation, multi-stage sample enhancement, and confusion matrix-driven optimization. Future work will explore automated labeling tools and expand the dataset to include more diverse samples. In the future, multispectral imaging data will be introduced to supplement such scenarios.

5.2. Cross-Field Technology Comparison

YOLO-SW’s RGB image detection solution can be integrated with multispectral remote sensing data to enhance detection in low-light scenes. Its lightweight design enables deployment on edge devices like NVIDIA Jetson AGX Orin, with an inference speed of up to 80 FPS after TensorRT optimization. YOLO-SW’s feature learning mechanism is also applicable to industrial defect detection, achieving industrial-grade standards. Future research should explore multimodal data fusion, such as RGB-Depth cameras, to improve spatial localization of small targets [31]. Additionally, constructing multi-source datasets covering different geographic regions and climatic conditions is necessary to reduce model bias.

5.3. The Impact of Hyperparameter Configuration

The initial learning rate (lr0 = 0.01) allows rapid feature capture, with mAP@50 reaching 85% in 50 epochs. Lowering lr0 to 0.001 delays convergence by 20 epochs. A higher lr0 results in late-stage fluctuations (±1.5%) in mAP@50, while lr0 = 0.01 shows only ±0.5% fluctuations. The learning rate decay strategy (lr1 = 1 × 10⁻⁴) enhances small target recall from 78.3% to 89.5%. The Adam optimizer, with momentum set to 0.937 and weight decay to 0.0005, improves mAP by 2.1% over SGD in complex backgrounds. For edge device deployment, reducing the resolution to 320 × 320 increases the frame rate by 28% while maintaining acceptable accuracy loss.

5.4. Challenges and Solutions for Model Deployment

Deploying YOLO-SW in the field faces two main challenges: the computational burden of the Swin Transformer and the high small-target miss rate. We addressed these by implementing channel pruning, mixed-precision training, and hierarchical feature fusion to reduce computational load. For small-target detection, we enhanced the model with CARAFE for detailed features, attention mechanism improvements, and auxiliary supervised branching, reducing the miss rate from 18.7% to 5.2% while maintaining accuracy and improving inference speed and robustness.

The Swin Transformer’s sliding window self-attention, crucial for global context capture, inflates computational complexity. On the NVIDIA Jetson AGX Orin, its global attention operation consumes 2.1 GB VRAM (6.6% of 32 GB total), and each 640×640 image frame requires 19.9 GFLOPs—142% more than YOLOv8’s 8.2GFLOPs. Channel pruning, retaining 70% of critical channels, cuts VRAM usage to 1.5 GB but sacrifices 0.8% in mAP@50 (from 92.3% to 91.5%), necessitating a trade-off in low-memory scenarios.

Performance tests on Jetson AGX Orin show that YOLO-SW achieves 59FPS in the 65 W default mode, sufficient for typical farmland operations (covering 17–33 cm²/frame at 0.5–1 m/s robot speed). Enabling the 78 W Max-N mode boosts FPS to 68 (+15%), but GPU throttling occurs above 85 °C, causing speed fluctuations. Experiments indicate that in high-temperature environments (>35 °C), the 65 W mode outperforms the high-performance mode, reducing the miss rate fluctuation by 4.1%.

In conclusion, YOLO-SW offers a reliable soybean weed detection solution with improved accuracy and efficiency. Future work will focus on dataset expansion, multimodal fusion, and further edge optimization.

6. Conclusions

6.1. Research Achievements and Technological Innovation

This study presents a YOLOv8-improved YOLO-SW model, which achieves 92.3% mAP@50 on the self-built soybean weed dataset by integrating the Swin Transformer backbone network, CARAFE dynamic upsampling operator, and RT-DETR high efficiency encoder. It is 3.8% higher than YOLOv8 while taking into account real-time performance (inference speed of 59 FPS on the NVIDIA Jetson platform). The data augmentation strategy expands the dataset from 1095 sheets to 12,500 sheets by simulating environments such as fog, rain, and snow, effectively alleviating the problem of data imbalance.

6.2. Practical Application and Model Value

This model has significantly improved the detection accuracy of small-target weeds in complex field environments and has been successfully deployed in intelligent weeding robots, providing a low-power consumption and highly robust solution for precision agriculture. The deployment test of edge devices shows that the power consumption of the optimized model is stable at 65 W, meeting the requirements of field operations.

6.3. Limitations and Future Directions

The types of weeds and environmental scenarios covered by the current dataset still have room for expansion, and the missed detection rate of small target detection needs to be further optimized. Especially for ultra-small targets with a pixel ratio of less than 0.5% (such as weeds with a plant height of less than 5 cm) and spectral camouflage targets (such as the Amaranthus retroflexus variety with leaf chlorophyll content consistent with that of soybeans), the context awareness ability of CARAFE still has bottlenecks. Long-distance feature associations can be strengthened by introducing the global self-attention mechanism of Transformer (such as the query mechanism of DETR) to break through the perceptual limitations of local convolution. In the future, multimodal data fusion (such as RGB-Depth information) and the construction of cross-regional datasets will be explored to enhance the generalization ability of the model.

On the NVIDIA Jetson AGX Orin platform, YOLO-SW achieves an inference speed of 59 FPS and a power consumption of 65 W after optimization. However, compared with 71 FPS and 55 W of YOLOv8, there is a 17% speed loss and an 18% increase in power consumption. For the application of weeding robots, at a conventional traveling speed of 0.5–1 m/s, a 30 FPS camera covers 17–33 cm² of the ground per frame. Although 59 FPS meets the real-time requirements, when operating at a high speed of 1.5 m/s, an overly short detection interval may cause missed detections.

In actual deployment, the trade-off between accuracy and computational efficiency is particularly crucial. Take edge devices such as low-power Raspberry PI as an example. If higher accuracy is pursued and the complete model structure and full-precision parameters of YOLO-SW are retained, although the detection accuracy can be improved, the limited computing power of the device will make it difficult to handle, resulting in a significant decrease in inference speed and even failure to meet the real-time requirements. If the focus is on computational efficiency, using mixed-precision training to convert model parameters from full precision to half precision, or applying structural pruning techniques to remove redundant network layers and parameters, although it can significantly reduce the computational load and improve the inference speed, it may lead to a decrease in model accuracy due to information loss. Therefore, based on the specific requirements of the operation scenarios, such as the moving speed of the weeding robot and the complexity of the operation environment, the quantization strategy and pruning degree should be dynamically adjusted to find the optimal balance point between accuracy and computational efficiency, ensuring the efficient and stable operation of the model on edge devices.

Author Contributions

Conceptualization, J.M. and L.Z.; methodology, Y.S.; software, Y.L.; validation, J.S., S.Z., and Y.S.; formal analysis, Y.S.; investigation, Y.L., J.S., Y.S., and S.Z.; resources, Y.L. and J.S.; data curation, Y.L., J.S., Y.S., and S.Z.; writing—original draft preparation, Y.S.; writing—review and editing, J.M. and L.Z.; visualization, Y.L.; supervision, J.M.; project administration, S.Z.; funding acquisition, L.Z. All authors have read and agreed to the published version of the manuscript.

Funding

The author states that the research, creation, and publication of this article received financial support. This work was supported by The National Key Research and Development Program of China (2022YFD2300903).

Data Availability Statement

The dataset can be found at the link below: https://www.kaggle.com/datasets/boatshuai/soybeanweed (accessed on 24 June 2025). Code can be found at the link below: https://github.com/HeckerBoat/YOLO-SW/tree/main (accessed on 24 June 2025).

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Appendix A

Table A1. Class-wise Precision, Recall, and F1-score of YOLO-SW.

Weed Species	Precision (%)	Recall (%)	F1-Score (%)
Galinsoga parviflora	91.4	89.7	90.5
Alternanthera philoxeroides	88.6	92.3	90.4
Cerastium glomeratum	85.2	83.6	84.4
Cardamine hirsuta	87.3	86.1	86.7
Amaranthus retroflexus	93.5	91.2	92.3
Pilea peperomioides	89.8	90.5	90.1

Data Source: Derived from confusion matrix analysis in Lines 189–225, where Galinsoga parviflora and Alternanthera philoxeroides showed the highest mutual misclassification (12.7% error rate), while Amaranthus retroflexus demonstrated the highest detection accuracy.

Table A2. Quantitative comparison of YOLO-SW and YOLOv8 in different climate scenarios.

Figure 11a: Detection results of foggy weather environment
Model	mAP@50(%)	Small target recall rate (%)	False alarm rate (%)
YOLOv8	76.4	68.3	8.7
YOLO-SW	89.5 *	85.2 *	3.2 *
Figure 11b: Low-light environment detection results
Model	mAP@50(%)	Precision(%)	Running time (ms)
YOLOv8	80.1	77.5	14.1
YOLO-SW	91.2 *	88.6 *	10.3 *
Figure 11c: Detection results of dense target scenes
Model	Small goals mAP@50(%)	Missed detection rate (%)	FPS
YOLOv8	72.3	18.7	71
YOLO-SW	87.6 *	5.2 *	59 *
Figure 11d: Detection results of cloudy environment
Model	Detection results of cloudy environment (%)	Average IoU (%)	Power consumption (W)
YOLOv8	65.4	0.68	55
YOLO-SW	88.1 *	0.82 *	65 *
Figure 11e: Detection results of strong light environment
Model	Texture feature retention rate (%)	Correct detection count/total target
YOLOv8	58.6	42/60
YOLO-SW	79.3 *	56/60 *
Figure 11f Test results of rain and snow environment
Model	Anti-interference mAP@50 (%)	Reasoning stability (fluctuation range)
YOLOv8	71.2	12.5%
YOLO-SW	86.7 *	3.8% *

* p < 0.05.

References

Tang, J.; Chen, X.; Miao, R.-H.; Wang, D. Weed detection using image processing under different illumination for site-specific areas spraying. Comput. Electron. Agr. 2016, 122, 103–111. [Google Scholar] [CrossRef]
Bah, M.D.; Hafiane, A.; Canals, R. Deep Learning with Unsupervised Data Labeling for Weed Detection in Line Crops in UAV Images. Remote Sens. 2018, 10, 1690. [Google Scholar] [CrossRef]
Tsiafouli, M.A.; hébault, E.; Sgardelis, S.P.; de Ruiter, P.C.; van der Putten, W.H.; Birkhofer, K.; Hemerik, L.; de Vries, F.T.; Bardgett, R.D.; Brady, M.V.; et al. Intensive Agriculture Reduces Soil Biodiversity across Europe. Glob. Change Biol. 2014, 21, 973–985. [Google Scholar] [CrossRef] [PubMed]
Mauro, M.; Simone, C.; Salvetti, F.; Angarano, S.; Chiaberge, M. Position-Agnostic Autonomous Navigation in Vineyards with Deep Reinforcement Learning. In Proceedings of the 2022 IEEE 18th International Conference on Automation Science and Engineering (CASE), Mexico City, Mexico, 20–24 August 2022. [Google Scholar]
Olsen, A.; Konovalov, D.A.; Philippa, B.; Ridd, P.; Wood, J.C.; Johns, J.; Banks, W.; Girgenti, B.; Kenny, O.; Whinney, J.; et al. DeepWeeds: A multiclass weed species image dataset for deep learning. Sci. Rep.-Uk 2019, 9, 2058. [Google Scholar] [CrossRef] [PubMed]
Zhu, H.; Zhang, Y.; Mu, D.; Bai, L.; Zhuang, H.; Li, H. YOLOX-based blue laser weeding robot in corn field. Front Plant Sci. 2022, 13, 1017803. [Google Scholar] [CrossRef] [PubMed]
Yu, H.; Men, Z.; Bi, C.; Liu, H. Research on field soybean weed identification based on an improved UNet model combined with a channel attention mechanism. Front Plant Sci. 2022, 13, 890051. [Google Scholar] [CrossRef] [PubMed]
Dos Santos Ferreira, A.; Matte Freitas, D.; da Silva, G.G.; Pistori, H.; Folhes, M.T. Weed detection in soybean crops using ConvNets. Comput. Electron. Agr. 2017, 143, 314–324. [Google Scholar] [CrossRef]
Sun, T.; Cui, L.; Zong, L.; Zhang, S.; Jiao, Y.; Xue, X.; Jin, Y. Weed Recognition at Soybean Seedling Stage Based on YOLOV8nGP+ NExG Algorithm. Agronomy 2024, 14, 657. [Google Scholar] [CrossRef]
Xu, Y.; He, R.; Gao, Z.; Li, C.; Zhai, Y.; Jiao, Y. Weed Density Detection Method Based on Absolute Feature Corner Points in Field. Agronomy 2020, 10, 113. [Google Scholar] [CrossRef]
Jia, Z.; Zhang, M.; Yuan, C.; Liu, Q.; Liu, H.; Qiu, X.; Zhao, W.; Shi, J. ADL-YOLOv8: A Field Crop Weed Detection Model Based on Improved YOLOv8. Agronomy 2024, 14, 2355. [Google Scholar] [CrossRef]
Ding, Y.; Jiang, C.; Song, L.; Liu, F.; Tao, Y. RVDR-YOLOv8: A Weed Target Detection Model Based on Improved YOLOv8. Electronics 2024, 13, 2182. [Google Scholar] [CrossRef]
Zhao, K.; Lu, R.; Wang, S.; Yang, X.; Li, Q.; Fan, J. ST-YOLOA: A Swin-transformer-based YOLO model with an attention mechanism for SAR ship detection under complex background. Front. Neurorobotics. 2023, 17. [Google Scholar] [CrossRef] [PubMed]
Lin, A.; Chen, B.; Xu, J.; Zhang, Z.; Lu, G.; Zhang, D. DS-TransUNet: Dual Swin Transformer U-Net for Medical Image Segmentation. IEEE T Instrum. Meas. 2022, 71, 1–15. [Google Scholar] [CrossRef]
Ma, L.; Yu, Q.; Yu, H.; Zhang, J. Maize Leaf Disease Identification Based on YOLOv5n Algorithm Incorporating Attention Mechanism. Agronomy 2023, 13, 521. [Google Scholar] [CrossRef]
Saleem, M.H.; Potgieter, J.; Arif, K.M. Automation in Agriculture by Machine and Deep Learning Techniques: A Review of Recent Developments. Precis. Agric. 2021, 22, 2053–2091. [Google Scholar] [CrossRef]
Liu, H.; Hou, Y.; Zhang, J.; Zheng, P.; Hou, S. Research on Weed Reverse Detection Methods Based on Improved You Only Look Once (YOLO) v8: Preliminary Results. Agronomy 2024, 14, 1667. [Google Scholar] [CrossRef]
Guo, B.; Ling, S.; Tan, H.; Wang, S.; Wu, C.; Yang, D. Detection of the Grassland Weed Phlomoides umbrosa Using Multi-Source Imagery and an Improved YOLOv8 Network. Agronomy 2023, 13, 3001. [Google Scholar] [CrossRef]
Dao, D.-P.; Yang, H.-J.; Ho, N.-H.; Pant, S.; Kim, S.-H.; Lee, G.-S.; Oh, I.-J.; Kang, S.-R. Survival Analysis based on Lung Tumor Segmentation using Global Context-aware Transformer in Multimodality. In Proceedings of the 2022 26th International Conference on Pattern Recognition (ICPR), Montreal, QC, Canada, 21–25 August 2022. [Google Scholar]
Yaseen, M. What is YOLOv9: An In-Depth Exploration of the Internal Features of the Next-Generation Object Detector. arXiv 2024, arXiv:2409.07813. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. arXiv 2021, arXiv:2103.14030. [Google Scholar] [CrossRef]
Wang, J.; Chen, K.; Xu, R.; Liu, Z.; Loy, C.C.; Lin, D. Carafe: Content-aware reassembly of features. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October 2019–2 November 2019; pp. 3007–3016. [Google Scholar]
Zhou, R.; Wan, C. Quantum Image Scaling Based on Bilinear Interpolation with Decimals Scaling Ratio. Int. J. Theor. Phys. 2021, 60, 2115–2144. [Google Scholar] [CrossRef]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q. DETRs Beat YOLOs on Real-time Object Detection. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 16965–16974. [Google Scholar]
Dai, X.; Chen, Y.; Yang, J.; Zhang, P.; Yuan, L.; Dynamic, L.Z. DETR: End-to-End Object Detection With Dynamic Attention. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 2988–2997. [Google Scholar]
Wang, P.; Peteinatos, G.; Efthimiadou, A.; Ma, W. Editorial: Weed identification and integrated control. Front. Plant Sci. 2023, 14, 1351481. [Google Scholar] [CrossRef] [PubMed]
Liu, S.; Wang, J.; Tao, L.; Li, Z.; Sun, C.; Zhong, X. Farmland Weed Species Identification Based on Computer Vision; Springer: Cham, Switzerland, 2019; pp. 452–461. [Google Scholar]
Hui, L.; Peng, H. An Improved Sharpening Algorithm for Foggy Picture Based on Dark-Channel Prior; Atlantis Press: Van Godewijckstraat, Dordrecht, 2015. [Google Scholar]
Wang, P.; Tang, Y.; Luo, F.; Wang, L.; Li, C.; Niu, Q.; Li, H. Weed25: A deep learning dataset for weed identification. Front. Plant Sci. 2022, 13. [Google Scholar] [CrossRef] [PubMed]
Selvaraju, R.R.; Das, A.; Vedantam, R.; Cogswell, M.; Parikh, D.; Batra, D. Grad-CAM: Why did you say that? arXiv 2016, arXiv:1611.07450. [Google Scholar]
Shuai, L.; Li, Z.; Chen, Z.; Luo, D.; Mu, J. A research review on deep learning combined with hyperspectral Imaging in multiscale agricultural sensing. Comput. Electron. Agr. 2024, 217, 108577. [Google Scholar] [CrossRef]

Figure 1. Network architecture of the YOLOv8 baseline model.

Figure 2. Network architecture of Swin Transformer. (a) Overall hierarchical structure of Swin Transformer with decreasing resolution and increasing receptive field. (b) Internal sliding window mechanism of Swin Transformer blocks, demonstrating how shifted windows enable cross-window connectivity and reduce computational complexity. This architecture efficiently captures local and global features, enhancing detection accuracy.

Figure 3. Network architecture of CARAFE upsampling operator. The CARAFE module dynamically generates adaptive convolutional kernels based on instance-specific content, enhancing the localization accuracy of small targets. This mechanism significantly improves the model’s ability to capture detailed semantic information under varying environmental conditions, such as light, weather, and soil conditions.

Figure 4. RT-DETR efficient encoder: IoU-aware query selection for NMS-free inference. The RT-DETR encoder processes multi-scale features efficiently, reducing computational redundancy and improving detection performance. This design eliminates the need for non-maximal suppression (NMS) by using an IoU-aware query selection mechanism, enabling stable inference speed with low latency.

Figure 5. YOLO-SW architecture diagram: Swin Transformer backbone for global context under harsh lighting conditions. This optimized model integrates the Swin Transformer, CARAFE upsampling operator, and RT-DETR encoder. The Swin Transformer reduces computational complexity while enhancing feature extraction. CARAFE improves small target localization accuracy, and RT-DETR simplifies post-processing and enhances inference speed. The red dashed boxes highlight the improved components, showcasing the model’s lightweight and high-precision design.

Figure 6. Dataset augmentation results: Simulated fog/rain/snow for balancing 6 soybean weed categories. This process effectively balanced the label distribution and improved model generalization. The figure shows the effects of different enhancement algorithms: (a) Fog simulation: random transparency (α = 0.1–0.6) per sub-region; (b) Night simulation with reduced brightness; (c) Cloudy simulation with diffused lighting; (d) Sunny simulation with high contrast; (e) Rain simulation with raindrop effects; (f) Snow simulation: non-overlapping flakes with random coordinates.

Figure 7. Label distribution map. Balanced label distribution across six weed categories after data augmentation, ensuring dataset representativeness.

Figure 8. Data augmentation results. Effects of different image enhancement algorithms after enhancement of (a) fog; (b) night; (c) cloudy; (d) sunny; (e) rainy; (f) snow.

Figure 9. Comparison of precision, recall, and mAP@50 between YOLOv8 and YOLO-SW during training. YOLO-SW consistently outperforms YOLOv8n in terms of accuracy, particularly in the early and middle stages of training. Specifically, YOLO-SW demonstrates superior performance in precision, recall, and mAP@50. In contrast, YOLOv8n exhibits more fluctuation in the early stage of training, while YOLO-SW shows less fluctuation and more consistent performance during the early and middle stages.

Figure 10. Confusion matrix comparison between YOLOv8 and YOLO-SW. The matrix shows the categorization performance of the six weed species, with each row representing the actual category and each column representing the predicted category.

Figure 11. Visual comparison of YOLO-SW and YOLOv8 in harsh climatic conditions (Red boxes represent the effect differences between the two comparison pictures, while white/blue boxes represent the experimental effects of the two pictures): Small target detection and background interference resistance. The figure demonstrates YOLO-SW’s superior performance in detecting small targets and handling complex backgrounds. In various environmental conditions (Cloudy (a), low light (b), Sunny (c), Rainy (d), Fog (e), Snow (f)), YOLO-SW effectively captures detailed leaf information and accurately focuses on plant textures, while YOLOv8 often misses small targets or fails to distinguish them from background interferences.

Figure 12. Grad-CAM heatmap comparison: YOLO-SW vs. YOLOv8 in feature focus under simulated climatic stress. The heatmaps illustrate YOLO-SW’s superior ability to focus on discriminative features such as detailed leaf textures and complete outlines, especially in various environmental conditions (fog, low light, bright light, rain, snow). YOLOv8 primarily relies on basic edge information, while YOLO-SW captures more detailed and context-aware features, supporting its higher detection accuracy.

Figure 13. NVIDIA Jetson AGX Orin development board used for deploying the YOLO-SW model.

Table 1. Hyperparameters.

Parameters	Value
Input shape	(640, 640, 3)
Epoch	200
Close mosaic	10
Batch size	8
Workers	8
Optimizer	Adam
Lr0	1 × 10⁻²
Lr1	1 × 10⁻⁴
Momentum	0.937
IoU	0.7

Table 2. Model cross-vertical comparison experiments.

Model	Param/10⁶	FLOPs (G)	F1-Score	mAP@50 (%)	mAP@75 (%)	mAP@95 (%)	FPS
Faster R-CNN	43.2	96.4	66.7	58.2	45.6	28.3	55
YOLOv5n	6.5	7.8	77.5	79.8	70.2	52.4	74
YOLOv7-tiny	27.6	64.3	83.4	85.6	75.8	58.6	61
RT-DETR	9.3	16.4	81.7	76.5	65.3	44.1	70
YOLOv8n	6.6	8.0	85.5	88.3	80.1	62.7	70
YOLO-SW	12.6	87.7	88.1	92.3 *	84.6	67.8	59

Footnote: * Compared with YOLOv8n.

Table 3. Data enhancement contrast experiment.

Model	Param/10⁶	FLOPs (G)	F1-Score	mAP@50 (%)	mAP@75 (%)	mAP@95 (%)	FPS
YOLOv8 unaug	6.6	8.0	85.5	88.3	75.2	50.1	72
YOLOv8 aug	6.8	8.2	86.7 *	89.5 *	76.4	51.3	71
YOLO-SW unaug	7.4	8.7	89.6	91.1 *	80.5	55.6	60
YOLO-SW aug	7.6	8.8	90.8 *	92.3 *	82.8	57.7	59

Footnote: * p < 0.05.

Table 4. The comparison results of the public dataset.

Model	Param/10⁶	FLOPs (G)	F1-Score	mAP@50 (%)	mAP@75 (%)	mAP@95 (%)	FPS
YOLOv8 Weed25	6.8	8.2	83.6	86.3	75.1	56.4	71
YOLO-SW Weed25	7.2	8.5	88.1	89.5	81.2	63.5	70

Table 5. Backbone comparison experiments.

Backbone	Param/10⁶	FLOPs (G)	F1-Score	mAP@50 (%)	mAP@75 (%)	mAP@95 (%)	FPS
YOLOv8 (baseline)	6.8	8.2	86.7	89.5	81.2	63.5	71
YOLOv8 + MobileNetv3	5.7	2.6	83.8	84.6 *	74.5	55.6	56
YOLOv8 + VanillaNet	20.8	72.0	86.2	87.6 *	76.5	57.8	55
YOLOv8 + ShuffleNet	26.9	76.5	84.1	86.4 *	75.2	56.9	49
YOLOv8 + ResNet	12.5	28.3	86.6	89.8	81.5	64.2	65
YOLOv8 + CSPDarkNet	9.0	18.3	86.9	90.2 *	82.1	65.3	72
YOLOv8 + Swin Transformer	9.9	19.9	88.0 *	90.6 *	82.5	65.8	63

Footnote: * p < 0.05.

Table 6. Ablation experiment.

Model	Param/10⁶	FLOPs (G)	F1-score	mAP@50 (%)	mAP@75 (%)	mAP@95 (%)	FPS
YOLOv8	6.8	8.2	86.7	89.5	81.2	63.5	71
YOLOv8 + ST	19.9	79.1	88.0	90.6 *	82.5	65.8	63
YOLOv8 + CARAFE	6.8	8.2	88.2 *	89.8 *	81.5	64.2	71
YOLOv8 + RTHead	9.5	16.8	88.0 *	90.8 *	82.3	65.5	86
YOLOv8 + ST + CARAFE	9.9	8.1	88.8 **	90.8 **	82.6	65.9	84
YOLOv8 + ST + RTHead	12.60	87.70	89.9 ***	91.2 ***	83.5	66.7	55
YOLO-SW	12.60	87.70	90.8 *	92.3 *	84.6	67.8	59

Footnote: * p < 0.05, ** p < 0.01, *** p < 0.001 (Compared with YOLOv8, Dunnett’s multiple comparison test).

Table 7. Color interpretation of heatmap.

Color	Activation Intensity	Semantic Meaning
Dark blue	0–0.2	Low-concern area (background soil)
Light blue	0.2–0.4	Moderate attention (non-critical leaf area)
Yellow	0.4–0.6	High attention (Leaf edge texture)
Deep red	0.6–1.0	Highest attention (discriminative feature point)

Table 8. Comparison results with the existing real-time weed detection systems.

System Name	Algorithm Architecture	Hardware Equipment	FPS	mAP@50	F1-Score
Weed Identification System (WIS)	CNN	GTX Titan (6 GB)	25	65.4	68.3
GCN-ResNet101 System	Graph Convolutional Network + ResNet101	NVIDIA IGX Orin	46	83.2	81.7
YOLO-SW Detection System	YOLO-SW	NVIDIA Jetson AGX Orin	59	92.3	90.8

Table 9. NVIDIA Jetson AGX Orin deployment performance summary.

Indicator	YOLOv8	YOLO-SW	Performance Difference
FPS	71	59	−17%
Power consumption (W)	55	65	+18%
GFLOPs	8.2	19.9	+142%
VRAM	1.3	2.1	+62%
mAP@50	88.3%	92.3%	+4.5%
False alarm rate	3.8%	2.5%	−1.3%
Small target recall rate	78.3%	89.5%	+11.2%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Shuai, Y.; Shi, J.; Li, Y.; Zhou, S.; Zhang, L.; Mu, J. YOLO-SW: A Real-Time Weed Detection Model for Soybean Fields Using Swin Transformer and RT-DETR. Agronomy 2025, 15, 1712. https://doi.org/10.3390/agronomy15071712

AMA Style

Shuai Y, Shi J, Li Y, Zhou S, Zhang L, Mu J. YOLO-SW: A Real-Time Weed Detection Model for Soybean Fields Using Swin Transformer and RT-DETR. Agronomy. 2025; 15(7):1712. https://doi.org/10.3390/agronomy15071712

Chicago/Turabian Style

Shuai, Yizhou, Jingsha Shi, Yi Li, Shaohao Zhou, Lihua Zhang, and Jiong Mu. 2025. "YOLO-SW: A Real-Time Weed Detection Model for Soybean Fields Using Swin Transformer and RT-DETR" Agronomy 15, no. 7: 1712. https://doi.org/10.3390/agronomy15071712

APA Style

Shuai, Y., Shi, J., Li, Y., Zhou, S., Zhang, L., & Mu, J. (2025). YOLO-SW: A Real-Time Weed Detection Model for Soybean Fields Using Swin Transformer and RT-DETR. Agronomy, 15(7), 1712. https://doi.org/10.3390/agronomy15071712

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

YOLO-SW: A Real-Time Weed Detection Model for Soybean Fields Using Swin Transformer and RT-DETR

Abstract

1. Introduction

2. Materials and Methods

2.1. Model Architecture of YOLO-SW

2.1.1. YOLOv8 Baseline

2.1.2. Anti-Interference Block

2.1.3. Detail Capture Upsampling Operator

2.1.4. RealTime-Head

2.1.5. Overview of YOLO-SW

2.2. Dataset Design

2.2.1. Dataset Acquisition

2.2.2. Dataset Construction

2.2.3. Improved Dataset Enhancement

3. Experiment Setup

3.1. Experimental Environment

3.2. Evaluation Metrics

4. Results

4.1. Model Cross-Vertical Comparision Experiments

4.2. Data Enhancement Contrast

4.3. Comparison Experiments of Public Datasets

4.4. Backbone Comparaision

4.5. Ablation Experiment

4.6. Effect Comparison Experiment

4.7. Heatmap Visualization

4.8. Model Deployment Testing

5. Discussion

5.1. Dataset Limitations

5.2. Cross-Field Technology Comparison

5.3. The Impact of Hyperparameter Configuration

5.4. Challenges and Solutions for Model Deployment

6. Conclusions

6.1. Research Achievements and Technological Innovation

6.2. Practical Application and Model Value

6.3. Limitations and Future Directions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI