STBNA-YOLOv5: An Improved YOLOv5 Network for Weed Detection in Rapeseed Field

Tao, Tao; Wei, Xinhua

doi:10.3390/agriculture15010022

Open AccessArticle

STBNA-YOLOv5: An Improved YOLOv5 Network for Weed Detection in Rapeseed Field

by

Tao Tao

^1,2 and

Xinhua Wei

^2,*

¹

Intelligent Manufacturing College, Yangzhou Polytechnic Institute, Yangzhou 225127, China

²

Key Laboratory of Modern Agricultural Equipment and Technology, Ministry of Education, College of Agricultural Engineering, Jiangsu University, Zhenjiang 212013, China

^*

Author to whom correspondence should be addressed.

Agriculture 2025, 15(1), 22; https://doi.org/10.3390/agriculture15010022

Submission received: 4 November 2024 / Revised: 9 December 2024 / Accepted: 23 December 2024 / Published: 25 December 2024

(This article belongs to the Special Issue Multi- and Hyper-Spectral Imaging Technologies for Crop Monitoring—2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

Rapeseed is one of the primary oil crops; yet, it faces significant threats from weeds. The ideal method for applying herbicides would be selective variable spraying, but the primary challenge lies in automatically identifying weeds. To address the issues of dense weed identification, frequent occlusion, and varying weed sizes in rapeseed fields, this paper introduces a STBNA-YOLOv5 weed detection model and proposes three enhanced algorithms: incorporating a Swin Transformer encoder block to bolster feature extraction capabilities, utilizing a BiFPN structure coupled with a NAM attention mechanism module to efficiently harness feature information, and incorporating an adaptive spatial fusion module to enhance recognition sensitivity. Additionally, the random occlusion technique and weed category image data augmentation method are employed to diversify the dataset. Experimental results demonstrate that the STBNA-YOLOv5 model outperforms detection models such as SDD, Faster-RCNN, YOLOv3, DETR, and EfficientDet in terms of Precision, F1-score, and mAP@0.5, achieving scores of 0.644, 0.825, and 0.908, respectively. For multi-target weed detection, the study presents detection results under various field conditions, including sunny, cloudy, unobstructed, and obstructed. The results indicate that the weed detection model can accurately identify both rapeseed and weed species, demonstrating high stability.

Keywords:

rapeseed; weed detection; deep learning; YOLO; image processing

1. Introduction

Field weeds compete with crops for water and nutrients, significantly impacting crop growth [1]. In my country, weed-related yield losses account for approximately 10–15% of total annual grain production. Rapeseed, the nation’s primary oil crop [2], is especially susceptible to weed damage. In winter rapeseed-growing areas like the Yangtze River Basin, the high temperature and good soil moisture during sowing create favorable conditions for weeds. After rapeseed is sown, weeds rapidly emerge, reaching a peak that intensively competes with rapeseed seedlings and leads to substantial yield reductions. Typical yield losses range from 10–20% [3], and severe weed infestations can reduce production by over 50%. The National Farmland Weed Investigation Team reports that weed damage affects approximately 46.9% of winter rapeseed fields in the Yangtze River Basin, with 22.3% experiencing moderate to severe damage. Effective weed control during the seedling stage is crucial for minimizing these losses. However, accurately detecting weeds in crops under natural field conditions remains challenging due to factors such as inter- and intraspecies variability in weed characteristics (e.g., shape, size, color, and texture), weed-crop similarity, and changing field conditions (e.g., lighting, and soil background).

Advancements in image acquisition technology, reduced hardware costs, and increased GPU computing power have facilitated deep-learning applications in agriculture. These applications encompass crop classification, pest and disease identification, fruit counting, plant nutrient content estimation, and weed identification, among others. Deep-learning-based methods have demonstrated considerable success in weed detection and classification, with convolutional neural networks (CNNs) supported by large datasets showing robustness to biological variability and diverse imaging conditions, resulting in more accurate and efficient weed control automation.

For instance, Wang et al. [4] developed a corn weed recognition method using multi-scale hierarchical features in CNNs, which extracts features from Gaussian pyramid layers and connects them to a multi-layer perceptron for pixel-level classification. Peng et al. [5] optimized the RPN network with a feature pyramid network, enhancing convolutional network performance for cotton and weed detection under natural lighting. Jiang et al. [6] proposed a Mask R-CNN-based method to address the low segmentation accuracy in complex field conditions, while Fan et al. [7] combined data augmentation and Faster R-CNN with a VGG16 feature extraction network for improved weed recognition. Other researchers have leveraged lightweight CNN architectures for efficiency. For example, Dyrmann et al. [8] constructed a CNN with three convolutional layers and residual blocks, achieving an 86.2% classification accuracy on over 10,000 images containing weed and crop species at early growth stages. Potena et al. [9] applied two CCNs to RGB and NIR images, achieving rapid, accurate weed-crop classification. In drone-based studies, Beeharry and Bassoo [10] demonstrated AlexNet’s 99.8% classification accuracy compared to less than 50% with standard ANN methods, emphasizing the advantages of deep-layered CNNs like AlexNet in achieving efficient weed detection.

Further innovations include McCool et al.’s [11] combination of lightweight CNNs, which achieved an over 90% accuracy at speeds exceeding 10.7 FPS, and You et al.’s [12] hybrid network with a dilated convolutional network and DropBlock, achieving 88% mean intersection over union (mIOU) on Stuttgart and Bonn datasets. Milioto et al. [13] improved the pixel segmentation accuracy to 98.16% for weeds and 95.17% for crops by enhancing the RGB input with HSV, Laplacian, and Canny edge representations. Graph convolutional networks (GCNs) have also been adapted for weed detection. Jiang et al. [14] integrated GCNs with ResNet101 features in a semi-supervised learning framework, achieving 96–99% accuracy on various weed species and outperforming AlexNet, VGG16, and ResNet-101. Similarly, Bah et al. [15] demonstrated the efficiency of CNNs in unsupervised training for spinach and bean weed detection, achieving similar AUC scores to supervised methods. Peng et al. [16] developed the WeedDet model, an enhanced RetinaNet-based model for rice field weed detection, which improved the detection speed through a lightweight feature pyramid and detection head. Zhang et al. [17] used the EM-YOLOv4-Tiny model for real-time weed classification in peanut fields, achieving detection in 10.4 ms per image. In order to deal with the situation where the shape and scale attributes of weeds and cotton at the seedling stage are similar in large fields, Fan et al. [18] use the weighted fusion BiFPN module, CBAM, and bilinear interpolation method based on the Faster RCNN model to deal with the recognition problem in complex environments. The YOLOv8 model is used to design a multi-scale feature fusion architecture at the Neck end [19], and a dynamic feature aggregation head is designed at the end to solve the problem of the large number of network parameters and being unable to detect dynamically.

Utilizing drone imagery for weed detection presents a compelling advantage over other on-field or alternative imaging methods [20]. Drones, with their ability to hover at low altitudes, provide a unique perspective that captures high-resolution images with enhanced spatial detail [21]. The real-time data acquisition capabilities of drones facilitate more frequent monitoring, allowing for the timely identification of emerging weed threats. Additionally, drones are cost-effective and can be deployed rapidly, enabling swift responses to evolving weed dynamics. The comprehensive coverage, agility, and timely data acquisition make drone imagery a superior option for weed detection, contributing to the efficiency and precision of modern agricultural practices.

This study focuses on weed detection in winter rapeseed fields during the seedling stage, targeting four common weeds. To address challenges such as high weed density, occlusion, and size variability, we propose the STBNA-YOLOv5 weed detection model. In order to increase the feature extraction ability of the backbone network, the Swin Transformer encoder module is added. In order to make full use of the rich feature information extracted by the backbone network, the BiFPN structure is used and the NAM attention mechanism module is embedded to provide a reliable front-end information source for the Prediction stage. In order to enhance the sensitivity of the detection head to crops and weeds, the Adaptively Spatial Feature Fusion module is introduced into the detection head.

2. Materials and Methods

2.1. Rape and Weed Samples

2.1.1. Data Acquisition

To enable accurate weed identification in winter rapeseed fields, this study first developed a comprehensive dataset, comprising images of multiple weed species (purslane, crabgrass, goosefoot, and canadensis) and rapeseed. These images were organized into training, validation, and test sets. The images were collected from experimental fields at Jiangsu University and captured under diverse weather and lighting conditions to enhance dataset robustness. Each image, with a resolution of 1842 × 1026 pixels, contains 2 to 8 target objects. After preprocessing to eliminate redundant images, a final dataset of 5000 images was established. This dataset includes 3500 images in the training set, 1000 in the validation set, and 500 in the test set. Sample images used for training are illustrated in Figure 1. For image annotation, we employed LabelImg software v1.8.1 to add bounding boxes around the rapeseed and weed targets, creating labeled files for network training.

2.1.2. Image Data Augmentation

The representativeness of sample data in reflecting the actual growth conditions of weeds and crops in the field is essential to the model’s effectiveness in real-world applications. Since weeds and crops often grow in close proximity, with frequent mutual occlusion, this study developed a random occlusion method based on image pixels. As illustrated in Figure 2, the data-processing sequence begins by reading all image pixels and identifying those of interest based on differences between the pixels of weeds, crops, and the ground. A randomly selected pixel serves as the center of a rectangular frame, with all pixels in the frame assigned an RGB value of (255, 255, 255) to simulate occlusion, as shown in Figure 2b. This approach enables controlled occlusion of weeds and crops, enhancing the model’s robustness.

Data imbalance between crops (rapeseed) and weeds also poses challenges in model generalization and may bias the neural network training. Typically, the network prioritizes categories with abundant data, such as crops, at the expense of weeds. Consequently, the model’s ability to detect novel or less-represented weed features diminishes. Initial data labeling revealed a substantial imbalance in the rapeseed-to-weed sample ratio, of approximately 5:1, with a diversity of weed species and morphologies not fully captured by the available dataset. To address this imbalance, we applied an image data augmentation technique for weed samples. Additional images of various weed types and shapes, captured during different growth stages, were integrated with crop images through a data augmentation process shown in Figure 3.

Weed data expansion was conducted using a regional information segmentation approach to isolate soil and weed regions within images. As shown in Figure 3a, an image containing only weeds was selected. Using a segmentation method similar to pixel-based occlusion, all weed pixels were extracted by contrasting weed pixels with the background. Figure 3b presents the resulting weed-only image. To create composite samples, we overlaid the weed image onto a crop background, generating synthetic data as shown in Figure 3c. Post-enhancement, the weed-to-rapeseed ratio was adjusted to 1:2, resulting in a balanced dataset of 5000 samples.

2.2. Proposed Method

The YOLO series of algorithms has been widely applied in agricultural object detection [22,23]. Although YOLO has evolved to its eighth version, YOLOv5 [24] remains the most commonly used version in object detection. YOLOv5 is built on the CSPDarknet53 [25] backbone, incorporates PANet as the neck structure, and utilizes YOLO as the detection head. Compared to the classic YOLOv3 network, CSPDarknet53 in YOLOv5 improves information utilization and detection accuracy by adopting a cross-stage partial network strategy, which enhances the efficiency of CNN learning and balances computational load across network layers. Additionally, cross-channel pooling reduces memory consumption. The Mish activation function [26] is used to capture negative samples during training, improving the robustness of the network. PANet [27] is responsible for feature aggregation, combining low-level positional information with high-level semantic information to strengthen the network’s representational capacity. For the task of crop and weed detection in open-field environments, this study adopts YOLOv5 as the base network and develops an STBNA-YOLOv5-based canola weed detection model, with the model architecture illustrated in Figure 4. This detection model includes Backbone, Neck, and Prediction modules. The STBNA-YOLOv5 network introduces four key improvements over the baseline YOLOv5 model. First, a Swin Transformer encoder module is added to enhance the feature extraction capability of the backbone network. Second, to fully leverage the rich features extracted by the backbone, a BiFPN structure is adopted, and a NAM attention mechanism module is embedded to provide reliable front-end information for the Prediction stage. Third, to increase the sensitivity of the detection head to crops and weeds, an Adaptively Spatial Feature Fusion module [28] is incorporated into the detection head.

2.2.1. Swin Transformer

In YOLOv5s, CSPDarknet53 serves as the backbone network, primarily comprising CSP bottleneck blocks, which utilize a residual mechanism. Recently, combining Transformers, particularly Swin Transformers, with CSP modules has drawn increasing research interest. This study selects the Swin Transformer block because, compared to the traditional Transformer module, it introduces a sliding window attention mechanism and multi-scale feature extraction, reducing the computational complexity of self-attention from N² to N and improving efficiency. The object of this study is an outdoor field environment where crops and weeds commonly coexist, although their growth patterns are inconsistent. In the early stages, weeds are often denser than crops, whereas, in later stages, crop foliage frequently obscures weeds. In the Backbone network, while the original CSPDarknet53 residual mechanism enables feature fusion, it captures overly uniform features and fails to effectively capture global information. To address this issue, this chapter replaces the bottleneck block with a combination of CSPDarknet53 and the Swin Transformer block, as illustrated in Figure 5.

The primary structure of the Swin Transformer block is illustrated in Figure 6. It comprises several key components: Layer Normalization (LN) layers, Window-based Multi-Head Self-Attention (W-MSA), Shifted Window-based Multi-Head Self-Attention (SW-MSA), and a Multi-Layer Perceptron (MLP) layer. The most critical elements are W-MSA and SW-MSA. Unlike the traditional self-attention mechanism in Transformers, W-MSA divides the feature map into multiple non-overlapping windows, applying an attention mechanism within each window individually. This windowing approach enables the Swin Transformer block to better capture small object details by merging feature information from different spatial locations, improving detection performance in high-density crop images and enhancing precision in field target localization, thus optimizing pesticide application efficiency.

Considering that this study aims to develop a real-time, field-ready weeding system, we positioned the Swin Transformer block toward the rear of the Backbone feature extraction network. Placing it here reduces computational load, as the output at this stage consists of lower-resolution feature maps, thereby enhancing network speed while maintaining robust detection performance.

2.2.2. BiFPN Feature Fusion Module

In existing YOLO networks, the Feature Pyramid Network (FPN) and Path Aggregation Network (PANet) are two prominent architectures for multi-scale feature fusion shown in Figure 7. Although FPN and PANet have shown significant advantages in enhancing model performance, they also have certain limitations, such as high computational complexity and slower inference speeds. With further research, BiFPN has gained increasing attention from scholars [29]. Unlike other feature fusion modules, BiFPN addresses the issue of rich semantic information extracted from P3 to P7 levels by the Backbone network not being effectively utilized in subsequent layers. BiFPN fuses these multi-path nodes at various levels and introduces feature weight parameters, allowing the network to automatically learn the relative importance of different input features. Each input feature map is assigned a weight, representing its significance or contribution in the fusion process.

Taking node P6 as an example, its final input is given by the following formula:

P_{6}^{out} = C o n v (\frac{ω_{1} P_{6}^{in} + ω_{2} P_{6}^{td} + ω_{3} R e s i z e (P_{5}^{out})}{ω_{1} + ω_{1} + ω_{3} + ε}) P_{6}^{td} = C o n v (\frac{ω_{1} P_{6}^{in} + ω_{2} R e s i z e (P_{7}^{in})}{ω_{1} + ω_{1} + ε})

(1)

In Equation (1),

P_{6}^{i n}

and

P_{5}^{o u t}

represent the input and output node features for P₆ and

P_{5}

, while

P_{6}^{t d}

denotes the output of fused features from

P_{6}

and

P_{7}

modules. The parameters

ω_{1}

,

ω_{2}

, and

ω_{3}

are feature weights, learned adaptively, and

ε

represents the bias term. Unlike traditional feature fusion methods, which often simply stack feature maps without considering the unique characteristics and contributions of each feature map, BiFPN’s approach acknowledges that different feature maps vary in resolution and information richness. As a result, each feature map contributes differently during fusion. This weighted fusion mechanism allows BiFPN to integrate multi-scale and multi-level feature information more effectively, enhancing the specificity and accuracy of the entire feature fusion process. Consequently, BiFPN demonstrates superior performance in tasks like complex image processing and high-precision object detection.

2.2.3. Normalization-Based Attention Module

This study also incorporates the Normalization-Based Attention Module (NAM) [30], an attention mechanism initially applied in image classification. NAM combines channel and spatial attention mechanisms, with an added weighting factor that adjusts the importance of crop and weed features. This enables the module to suppress irrelevant semantic and positional information. NAM retains a lightweight and efficient design, as illustrated in Figure 8 and Figure 9.

The inputs M_c and M_s of the module, as shown in Equation (2), represent the feature information output after passing through the channel and spatial attention modules in the NAM. Similar to the calculations in channel and spatial attention modules, scaling factors and corresponding weights are added during the calculation of Mc and Ms. In Equation (2),

γ

indicates the degree of variation across channel dimensions; a larger

γ

signifies richer semantic and positional information within the batch, while a smaller

γ

suggests more uniform information channels with lower significance. This approach enhances or suppresses feature information accordingly. Additionally, β represents a trainable displacement parameter.

u_{β}

and

σ_{β}

denote the mean and standard deviation of β, respectively. Each module undergoes a normalization operation using a sigmoid function, facilitating better parameter optimization during backpropagation.

B_{o u t} = B N (B_{i n}) = γ \frac{B_{i n} - μ_{β}}{\sqrt{δ_{β}^{2} + ε}} + β M_{c} = s i g m o i d (W_{γ} (B N (F_{1}))) M_{s} = s i g m o i d (W_{λ} (B N_{s} (F_{2})))

(2)

2.2.4. Adaptive Spatial Fusion Module

The growth rates of crops and weeds are often unbalanced, with weeds generally differing significantly from crops in shape and size. YOLOv5, as a single-stage object detector, uses three detection heads that focus primarily on target regions similar in size to their features. However, since weeds and crops exhibit different feature scales and inconsistencies, YOLOv5 may lack precision in identifying and localizing weeds in field environments. Unlike the previously mentioned cascaded multi-level feature fusion methods, such as Bi-directional Cross-Scale Connections and Weighted Feature Fusion (BiFPN) or the scalable feature pyramid architecture (NAS-FPN), the BiFPN structure used in this chapter extracts diverse spatial and semantic information from multiple scales and locations. To fully leverage the multi-layer feature information from BiFPN’s output, this chapter introduces an adaptive feature fusion approach: Adaptively Spatial Feature Fusion (ASFF). The ASFF module can adjust and adaptively fuse YOLO’s three detection heads to varying degrees, as shown in Figure 10.

In YOLOv5, the three detection heads correspond to feature maps at different resolutions. For the feature maps of varying resolutions shown in Figure 9, the ASFF module aligns feature map sizes to enable feature fusion by adjusting the resolution levels. For YOLO’s feature layers, P₃ resolution sits between P₄ and P₂, so only two sampling strategies—up-sampling and down-sampling—are required to align them with the P3-sized feature map. ASFF adapts the up-sampling and down-sampling strategies for the three feature scales accordingly. For up-sampling, a 1 × 1 convolution is first used to compress the number of feature channels, followed by interpolation to increase resolution. Down-sampling is divided into two cases: for feature maps at a 1/2 scale, a 3 × 3 convolution with a stride of 2 is used to adjust both channel size and resolution; for maps at a 1/4 scale, a max pooling layer with a stride of 2 is added before a convolution layer with a stride of 2. The final step involves adaptively learning spatial weights for feature fusion across multiple scales, as shown in Equations (3) and (4):

y_{i j}^{l} = α_{i j}^{l} \cdot x_{i j}^{1 \to l} + β_{i j}^{l} \cdot x_{i j}^{2 \to l} + γ_{i j}^{l} \cdot x_{i j}^{3 \to l}

(3)

α_{i j}^{l} + β_{i j}^{l} + γ_{i j}^{l} = 1

(4)

where

α_{i j}^{l}, β_{i j}^{l}, γ_{i j}^{l} \in [0, 1]

.

α_{i j}^{l} = \frac{e^{λ_{a_{i j}}^{l}}}{e^{λ_{a_{i j}}^{l}} + e^{λ_{β_{i j}}^{l}} + e^{λ_{γ_{i j}}^{l}}}

(5)

where

y_{i j}^{l}

represents the output feature map at position, and

x_{i j}^{n \to l}

are feature vectors adjusted from n to l. The spatial weights

α_{i j}^{l}

,

β_{i j}^{l}

, and

γ_{i j}^{l}

correspond to the three different scale feature maps and are adaptively learned by the network, satisfying the relationship in Equation (5).

By employing this strategy for adaptive fusion across feature scales, ASFF significantly reduces the model’s instability caused by variations in crop and weed sizes. This adaptive fusion helps the model maintain robustness and accuracy when detecting objects of varying scales.

2.3. Experimental Configuration

In this study, model training and testing were conducted on a workstation equipped with an 11th Gen Intel i9 CPU and an NVIDIA GTX 3080Ti GPU, using PyTorch version 1.10.2 to build the network model. The dataset was composed of enhanced images and the original dataset of weed images taken during the early growth stage of canola. The total dataset was split into training, validation, and testing sets at a ratio of 7:2:1. For parameter configuration, the model was initialized with pre-trained weights from the COCO dataset. The input image resolution for the canola weed dataset was set to 1842 × 1026, with a training cycle of 100 epochs and a batch size of 8. The model’s inference stage was carried out on an NVIDIA Jetson NX platform. To assess the impact of data augmentation on model performance, the study employed traditional data augmentation techniques such as random translation, rotation, and scaling for comparison. Six sets of ablation experiments were conducted to analyze the effectiveness of each module in the improved model. Finally, the performance of STBNA-YOLOv5 was compared with that of current state-of-the-art object detection models.

2.4. Evaluation Metrics

This study selected seven metrics to evaluate the model’s performance. These metrics include Precision, Recall, Average Precision (AP), the harmonic mean of Precision and Recall (F1-score), mean Average Precision (mAP), and mAP at an IoU threshold of 0.5 (mAP@0.5), which reflects detection accuracy. Additionally, Frames Per Second (FPS) was used to measure processing speed. The mAP metric is calculated from the Precision–Recall curve and serves as a key parameter for evaluating model detection accuracy. The formula for mAP is as follows:

P r e c i s i o n = \frac{T r u e P o s i t i v e}{T r u e P o s i t i v e + F a l s e P o s i t i v e} \times 100 %

(6)

R e c a l l = \frac{T r u e P o s i t i v e}{T r u e P o s i t i v e + F a l s e N e g a t i v e} \times 100 %

(7)

F 1 - s c o r e = 2 \times \frac{P e r c i s i o n \times R e c a l l}{P e r c i s i o n + R e c a l l} \times 100 %

(8)

A P = \int_{0}^{1} P e r c i s i o n (R e c a l l) d (R e c a l l) \times 100 %

(9)

m A P = \frac{\sum_{1}^{N} \int_{0}^{1} P e r c i s i o n (R e c a l l) d (R e c a l l)}{N} \times 100 %

(10)

In the formulae, True Positive (TP) represents the number of correctly identified weeds (of various types) and crops (canola), while False Negative (FN) indicates the number of weeds and crops that were incorrectly identified. False Positive (FP) refers to instances mistakenly classified as weeds or crops. N is the number of classes, which is 2 in this model. The physical interpretation of Average Precision (AP) is the area under the Precision–Recall curve, while mean Average Precision (mAP) is the average of the AP values for weeds and crops.

3. Results and Discussion

3.1. Performance of Dataset Augmentation

To verify the actual effectiveness of the random occlusion method based on image pixels and the data augmentation method specific to weed classes in this model, comparative experiments were conducted using the original YOLOv5 model. These methods were evaluated alongside traditional data augmentation techniques (including horizontal translation, vertical rotation, and brightness enhancement). Since the canola recognition accuracy remains above 90% across all models, this study does not discuss canola-specific metrics. Instead, the performance indicators used focus on weed detection, specifically Recall_weed, Precision_weed, F1-score_weed, and mAP@0.5. To ensure consistency in data quantity after augmentation, each data augmentation method was adjusted to produce 5000 images for the training set only, without affecting the validation or test sets. During model training, data were divided into training, validation, and test sets at a 7:2:1 ratio. The Precision–Recall (PR) curves and augmentation performance results across various metrics for both the original YOLOv5 with traditional data augmentation and the proposed random occlusion and weed-specific augmentation methods are shown in Figure 11 and Figure 12.

Based on the experimental results, as shown in the figures, the proposed random occlusion method based on image pixels and the weed-specific data augmentation technique outperform traditional data augmentation methods in terms of weed detection metrics. Specifically, the proposed methods achieved a Recall_weed of 51.3%, Precision_weed of 76%, F1-scoreweed of 61.2%, and mAP@0.5 of 84.7%. Compared to traditional data augmentation, these methods showed improvements in Precision_weed, F1-score_weed, and mAP@0.5 by 2%, 1.4%, and 1.9%, respectively. While traditional data augmentation effectively increases the weed sample size, it also proportionally increases crop samples. This approach fails to address the deep network’s bias toward high-proportion target objects. In contrast, the proposed random occlusion and weed-specific augmentation methods focus specifically on expanding weed samples without increasing canola samples, thereby balancing the dataset. Additionally, random occlusion forces the network to extract deeper feature information from the target objects. These factors contribute to the superior performance of the proposed methods over traditional augmentation techniques in this model.

3.2. Ablation Experiment

To validate and assess the effectiveness of the proposed Swin Transformer module, BiFPN feature fusion module, NAM attention mechanism, and Adaptively Spatial Feature Fusion (ASFF) module on YOLOv5, ablation experiments were conducted for each module. To ensure thorough validation, all models in the ablation experiments were trained using the dataset enhanced by the proposed data augmentation methods, with a total sample size of 5000 images. The dataset was divided into training, validation, and testing sets at a ratio of 7:2:1. The ablation experiment results are shown in Table 1. ① includes only the Swin Transformer module; ② adds the BiFPN module to ①; ③ further includes the NAM attention mechanism based on ②; and ④ incorporates the ASFF module in addition to ③. Each experiment was evaluated based on the Recall_weed, F1-score_weed, AP_weed, mAP@0.5, and FPS metrics.

Table 1 shows that adding the “YOLOv5 + Swin Transformer” configuration improved the original YOLOv5 network in terms of Recall_weed, F1-score_weed, AP_weed, mAP@0.5, and FPS by 1%, 2.8%, 1.5%, and 0.7 FPS, respectively. These improvements in Recall_weed, F1-score_weed, AP_weed, and mAP@0.5 indicate that the Swin Transformer encoder block enhances YOLOv5’s contextual sensitivity, especially for detecting small weed targets, effectively narrowing the recognition gap between weeds and crops. This module also aids in capturing targets in complex field environments. Additionally, the increase in FPS suggests that the Swin Transformer encoder block has a lower computational burden than YOLOv5’s original CSP module, which is advantageous for field deployment. Comparing the “YOLOv5 + Swin Transformer + BiFPN” model with the original YOLOv5 network on Recall_weed, F1-score_weed, AP_weed, and mAP@0.5 metrics, the configuration achieved respective increases of 4%, 5.4%, and 3%. By reusing the rich feature information extracted from the Backbone network and placing the Swin Transformer module within it, the BiFPN module gains access to more comprehensive information. The feature weight parameters introduced here allow the network to automatically learn the importance of different input features, suppressing irrelevant features to achieve multi-level feature fusion.

Moreover, with the addition of the NAM attention mechanism, which functions similarly to the weight factors in BiFPN, the spatial and semantic information for both weeds and crops is further enriched. The configuration “YOLOv5 + Swin Transformer + BiFPN + NAM” demonstrated improvements over the original model in Recall_weed, F1-score_weed, AP_weed, and mAP@0.5, with respective gains of 11%, 1.1%, 10.9%, and 5.2%, with only a slight decrease in FPS. Together, the Swin Transformer, BiFPN, and NAM modules provide a wealth of front-end information to the Prediction stage. Finally, after incorporating the ASFF module, the model achieved values of 88.1% Recall_weed, 64.4% F1-score_weed, 82.5% AP_weed, and 90.8% mAP@0.5 compared to the baseline YOLOv5. Although mAP@0.5 improved by about 6%, this metric is influenced by the high recognition rate of crops, which remained over 90% across both the baseline and improved YOLOv5 models. However, AP_weed showed a substantial increase of nearly 12%. The added modules bring significant gains in recognition accuracy with minimal sacrifices in detection efficiency, proving beneficial for field operations.

3.3. Comparative Experiment

To further validate the effectiveness of the proposed STBNA-YOLOv5 model, a comparative experiment was conducted using several object detection models: EfficientDet, SSD, Faster R-CNN, DETR, YOLOv3, and the proposed model itself. These models were evaluated on a data-augmented dataset, and split into training, validation, and test sets with a 7:2:1 ratio, using five metrics: Recall, Precision, F1-score, mAP@0.5, and FPS. The comparison results are presented in Table 2. To ensure fairness, all models were tested on the same dataset with identical conditions. This setup allows a direct and unbiased performance comparison across models, highlighting the specific advantages of STBNA-YOLOv5 in terms of detection accuracy, recall precision, and processing speed, confirming its suitability and robustness in handling complex agricultural field scenarios.

SSD, as a single-stage object detector, achieves a high FPS but underperforms in small object detection. In our experiments, SSD’s Recall_weed and Precision_weed were 8.1% and 13.4% lower than those of the proposed model, and its mAP was 7.1% lower than our improved YOLOv5. This is likely because SSD relies on lower-level feature maps, which limits its small object detection capability.

The Faster R-CNN, a two-stage detector, has been applied in agricultural object detection and includes both a region proposal stage and a feature extraction stage. In our tests, the Faster R-CNN achieved an mAP@0.5 of 81.9%. However, due to its separate regression and classification steps, its FPS performance was lower than that of other algorithms, which makes it less suitable for real-time field deployment.

DETR, a recent end-to-end model, shows promise but has a large architecture and slow convergence, requiring extended training time. It underperformed our model in Recall, Precision, and mAP@0.5, suggesting it may not be ideal for real-time applications in our field scenarios.

YOLOv3, the classic YOLO model from which YOLOv5 evolved, uses a simple FPN structure. While YOLOv3 balances an acceptable detection accuracy with fast inference speed, its single FPN architecture limits cross-scale feature fusion. Our model’s BiFPN structure, which enhances multi-scale feature fusion, demonstrates an improved mAP by nearly 7%, although with a slight trade-off in real-time performance.

In comparison with SSD, Faster R-CNN, YOLOv3, and DETR, STBNA-YOLOv5 excels in Precision_weed, F1-score_weed, and mAP@0.5, proving the effectiveness of our proposed modifications on the rapeseed-weed dataset. This makes our approach the most effective model in accurately detecting rapeseed and weeds among the models tested.

3.4. Experiments on Weed Multi-Class Detection

To better demonstrate the superiority of the proposed algorithm, we selected several test images from the test set to visually showcase the performance of the STBNA-YOLOv5 network in the multi-class detection of weeds in canola fields. As shown in Figure 13, the detection results under various field conditions—sunny, cloudy, unobstructed, and obstructed—are presented. The model clearly identifies both canola and weed species. The confidence levels for canola consistently exceed 0.7, while those for small nettle and purslane are above 0.8. In unobstructed scenarios, the confidence level for lamb’s quarters exceeds 0.8; even when partially obstructed, the confidence level remains at 0.66, with correct detection results. Additionally, in obstructed conditions, the barnyard grass is accurately detected, achieving a confidence level of 0.86. By incorporating four enhancement methods—the Swin Transformer encoder block, BiFPN, NAM, and ASFF—the STBNA-YOLOv5 model exhibits improved object perception capabilities. This advancement effectively reduces the likelihood of false negatives, false positives, and duplicate detections, making the model well-suited for the real-time detection of weeds and canola during the winter canola seedling stage.

Figure 14 and Figure 15 illustrate the weed detection results obtained using the YOLOv5 and STBNA-YOLOv5 models, respectively. As shown in Figure 14a, the YOLOv5 model failed to detect purslane, resulting in a missed detection. In contrast, the STBNA-YOLOv5 model successfully identified all instances of purslane. Additionally, Figure 14a reveals that the YOLOv5 model incorrectly classified a specimen of nutsedge as barnyard grass, leading to a false positive detection. However, the STBNA-YOLOv5 model correctly identified nutsedge.

4. Conclusions

This study established a dataset containing both weeds and rapeseed. While rapeseed samples are relatively easy to obtain, the diversity of weed species can lead to a significant data imbalance. To address this issue, we introduced a random occlusion method based on image pixel points, along with an image data augmentation technique specifically for weed categories. Building on the original YOLOv5 algorithm, we improved the Backbone, Neck, and Prediction components of the model. To enhance the feature extraction capability of the Backbone, we incorporated a Swin Transformer encoder block. We utilized a BiFPN structure at the Neck to fully leverage the rich feature information extracted by the Backbone, embedding a NAM attention mechanism to further enhance performance. Additionally, we designed a channel feature fusion module with involution for the Prediction component. The modified YOLOv5 algorithm underwent ablation experiments to analyze the impact of various modules, including image augmentation strategies, the Swin Transformer encoder block, the BiFPN module, the NAM module, and the ASFF module. The results indicated that the STBNA-YOLOv5 model excelled in the Recall_weed, AP_weed, and mAP@0.5 metrics. When compared with models such as SDD, Faster-RCNN, YOLOv3, DETR, and EfficientDet, the STBNA-YOLOv5 model demonstrated a superior Precision_weed, F1-score_weed, and mAP@0.5, with values of 0.644, 0.825, and 0.908, respectively. Furthermore, we presented detection results under various field conditions, including sunny and overcast weather, as well as scenarios with and without occlusions. The model consistently demonstrated its ability to accurately identify specific types of rapeseed and weeds. Future work will be focused on developing a system incorporating the trained weed detection model.

Author Contributions

Conceptualization, T.T. and X.W.; methodology, T.T. and X.W.; software, T.T.; validation, X.W.; formal analysis, T.T.; investigation, T.T.; resources, X.W.; data curation, T.T.; writing—original draft preparation, T.T.; writing—review and editing, X.W.; visualization, T.T.; supervision, X.W.; project administration, X.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the “National Key Research and Development Program (2022ZD0115804), and the “Jiangsu Province Qinglan Project (2022)”.

Data Availability Statement

The datasets generated for this study are available upon request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhu, J.; Wang, J.; DiTommaso, A.; Zhang, C.; Zheng, G.; Liang, W.; Islam, F.; Yang, C.; Chen, X.; Zhou, W.; et al. Weed research status, challenges, and opportunities in China. Crop Prot. 2018, 108, 72–82. [Google Scholar] [CrossRef]
Jin, F.; Wang, L.; Li, G.; Feng, Z. Winter rapeseed’s technical efficiency and its influence factors: Based on the model of stochastic frontier production function and 1707 micro-datas of farmers. J. China Agric. Univ. 2013, 18, 210–217. [Google Scholar]
Bai, J.; Xu, Y.; Wei, X.; Zhang, J.; Shen, B. Weed identification from winter rape at seedling stage based on spectrum characteristics analysis. Trans. Chin. Soc. Agric. Eng. 2013, 29, 128–134. [Google Scholar]
Wang, C.; Wu, X.; Li, Z. Multi-scale layered feature extraction for corn weed recognition based on convolutional neural networks. Trans. Chin. Soc. Agric. Eng. 2018, 34, 144–151. [Google Scholar]
Peng, M.; Xia, J.; Peng, H. Efficient weed recognition method in complex cotton field backgrounds using Faster R-CNN with FPN integration. Trans. Chin. Soc. Agric. Eng. 2019, 35, 202–209. [Google Scholar]
Jiang, H.; Zhang, C.; Zhang, Z.; Mao, W.; Wang, D.; Wang, D.W. Cornfield weed detection method based on Mask R-CNN. Trans. Chin. Soc. Agric. Mach. 2020, 51, 220–228+247. [Google Scholar]
Fan, X.; Zhou, J.; Xu, Y.; Li, K.; Wen, D. Weed recognition and localization during cotton seedling stage based on optimized Faster R-CNN. Trans. Chin. Soc. Agric. Mach. 2021, 52, 9. [Google Scholar]
Dyrmann, M.; Karstoft, H.; Midtiby, H.S. Plant species classification using deep convolutional neural network. Biosyst. Eng. 2016, 151, 72–80. [Google Scholar] [CrossRef]
Potena, C.; Nardi, D.; Pretto, A. Fast and accurate crop and weed identification with summarized train sets for precision agriculture. In Proceedings of the Intelligent Autonomous Systems 14: Proceedings of the 14th International Conference IAS-14, Shanghai, China, 3–7 July 2016; Springer International Publishing: Berlin/Heidelberg, Germany, 2017; pp. 105–121. [Google Scholar]
Beeharry, Y.; Bassoo, V. Performance of ANN and AlexNet for weed detection using UAV-based images. In Proceedings of the 2020 3rd International Conference on Emerging Trends in Electrical, Electronic and Communications Engineering (ELECOM), Balaclava, Mauritius, 25–27 November 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 163–167. [Google Scholar]
McCool, C.; Perez, T.; Upcroft, B. Mixtures of lightweight deep convolutional neural networks: Applied to agricultural robotics. IEEE Robot. Autom. Lett. 2017, 2, 1344–1351. [Google Scholar] [CrossRef]
You, J.; Liu, W.; Lee, J. A DNN-based semantic segmentation for detecting weed and crop. Comput. Electron. Agric. 2020, 178, 105750. [Google Scholar] [CrossRef]
Milioto, A.; Lottes, P.; Stachniss, C. Real-time semantic segmentation of crop and weed for precision agriculture robots leveraging background knowledge in CNNs. In Proceedings of the 2018 IEEE international conference on robotics and automation (ICRA), Brisbane, Australia, 21–25 May 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 2229–2235. [Google Scholar]
Jiang, H.; Zhang, C.; Qiao, Y.; Zhang, Z.; Zhang, W.; Song, C. CNN feature based graph convolutional network for weed and crop recognition in smart farming. Comput. Electron. Agric. 2020, 174, 105450. [Google Scholar] [CrossRef]
Bah, M.D.; Hafiane, A.; Canals, R. Deep learning with unsupervised data labeling for weed detection in line crops in UAV images. Remote Sens. 2018, 10, 1690. [Google Scholar] [CrossRef]
Peng, H.; Li, Z.; Zhou, Z.; Shao, Y. Weed detection in paddy field using an improved RetinaNet network. Comput. Electron. Agric. 2022, 199, 107179. [Google Scholar] [CrossRef]
Zhang, H.; Wang, Z.; Guo, Y.; Ma, Y.; Cao, W.; Chen, D.; Yang, S.; Gao, R. Weed detection in peanut fields based on machine vision. Agriculture 2022, 12, 1541. [Google Scholar] [CrossRef]
Fan, X.; Chai, X.; Zhou, J.; Sun, T. Deep learning based weed detection and target spraying robot system at seedling stage of cotton field. Comput. Electron. Agric. 2023, 214, 108317. [Google Scholar] [CrossRef]
Ren, D.; Yang, W.; Lu, Z.; Chen, D.; Su, W.; Li, Y. A Lightweight and Dynamic Feature Aggregation Method for Cotton Field Weed Detection Based on Enhanced YOLOv8. Electronics 2024, 13, 2105. [Google Scholar] [CrossRef]
Rehman, M.U.; Eesaar, H.; Abbas, Z.; Seneviratne, L.; Hussain, I.; Chong, K.T. Advanced drone-based weed detection using feature-enriched deep learning approach. Knowl.-BasedSyst. 2024, 305, 112655. [Google Scholar] [CrossRef]
Maddikunta PK, R.; Hakak, S.; Alazab, M.; Bhattacharya, S.; Gadekallu, T.R.; Khan, W.Z.; Pham, Q.V. Unmanned Aerial Vehicles in Smart Agriculture: Applications, Requirements, and Challenges. IEEE Sens. J. 2021, 21, 17608–17619. [Google Scholar] [CrossRef]
Wang, Q.; Cheng, M.; Huang, S.; Cai, Z.; Zhang, J.; Yuan, H. A deep learning approach incorporating YOLO v5 and attention mechanisms for field real-time detection of the invasive weed Solanum rostratum Dunal seedlings. Comput. Electron. Agric. 2022, 199, 107194. [Google Scholar] [CrossRef]
Wang, X.; Zhao, Q.; Jiang, P.; Zheng, Y.; Yuan, L.; Yuan, P. LDS-YOLO: A lightweight small object detection method for dead trees from shelter forest. Comput. Electron. Agric. 2022, 198, 107035. [Google Scholar] [CrossRef]
Glenn, J. Yolov5. Git Code. 2020. Available online: https://github.com/ultralytics/yolov5 (accessed on 10 June 2020).
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Mish, M.D. A self regularized non-monotonic activation function. arXiv 2019, arXiv:1908.08681. [Google Scholar]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 8759–8768. [Google Scholar]
Liu, S.; Huang, D.; Wang, Y. Learning spatial fusion for single-shot object detection. arXiv 2019, arXiv:1911.09516. [Google Scholar]
Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 10781–10790. [Google Scholar]
Liu, Y.; Shao, Z.; Teng, Y.; Hoffmann, N. NAM: Normalization-based attention module. arXiv 2021, arXiv:2111.12419. [Google Scholar]

Figure 1. Image data.

Figure 2. Schematic diagram of random occlusion method based on image pixels: (a) original image; and (b) image after occlusion.

Figure 3. The process of sample data augmentation based on image pixels: (a) Sample A; (b) Sample B; and (c) the synthesized data.

Figure 4. Architecture of the STBNA-YOLOv5 network.

Figure 5. Structure of C3STR block.

Figure 6. Structure of Swin Transformer encoder block.

Figure 7. Schematic diagram of different feature fusion structures.

Figure 8. Channel attention in NAM.

Figure 9. Spatial attention in NAM.

Figure 10. Schematic diagram of adaptively spatial feature fusion.

Figure 11. PR curve for different data augmentations: (a) traditional data augmentation; and (b) random occlusion method of image pixels.

Figure 12. Performance of data augmentation methods at different indicators.

Figure 13. Results of weed detection under different conditions: (a) sunny day; (b) cloudy day; (c) no occlusion; and (d) occlusion.

Figure 14. Results of weed detection on sunny day under different models: (a) YOLOv5; and (b) STBNA-YOLOv5.

Figure 15. Results of weed detection under different models: (a) YOLOv5; and (b) STBNA-YOLOv5.

Table 1. Ablation study on the effect of different modules.

Model	Recall_weed	F1-Score_weed	AP_weed	mAP@0.5	FPS
YOLOv5	0.76	0.612	0.709	0.847	22.5
YOLOv5 + ①	0.77	0.596	0.737	0.862	23.2
YOLOv5 + ②	0.8	0.61	0.763	0.877	22
YOLOv5 + ③	0.87	0.623	0.818	0.899	21.5
YOLOv5 + ④	0.881	0.644	0.825	0.908	20.1

Table 2. Performance comparison of six networks for crop and weed detection.

Model	Recall_weed	Precision_weed	F1-Score_weed	mAP@0.5	FPS
SSD	0.8	0.51	0.62	0.837	19.3
Faster RCNN	0.7	0.411	0.51	0.819	15.6
DETR	0.84	0.47	0.61	0.872	16.8
YOLOv3	0.847	0.489	0.62	0.874	22.6
EfficientDet	0.77	0.5	0.6	0.841	15.9
STBNA-YOLOv5	0.881	0.644	0.825	0.908	20.1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Tao, T.; Wei, X. STBNA-YOLOv5: An Improved YOLOv5 Network for Weed Detection in Rapeseed Field. Agriculture 2025, 15, 22. https://doi.org/10.3390/agriculture15010022

AMA Style

Tao T, Wei X. STBNA-YOLOv5: An Improved YOLOv5 Network for Weed Detection in Rapeseed Field. Agriculture. 2025; 15(1):22. https://doi.org/10.3390/agriculture15010022

Chicago/Turabian Style

Tao, Tao, and Xinhua Wei. 2025. "STBNA-YOLOv5: An Improved YOLOv5 Network for Weed Detection in Rapeseed Field" Agriculture 15, no. 1: 22. https://doi.org/10.3390/agriculture15010022

APA Style

Tao, T., & Wei, X. (2025). STBNA-YOLOv5: An Improved YOLOv5 Network for Weed Detection in Rapeseed Field. Agriculture, 15(1), 22. https://doi.org/10.3390/agriculture15010022

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

STBNA-YOLOv5: An Improved YOLOv5 Network for Weed Detection in Rapeseed Field

Abstract

1. Introduction

2. Materials and Methods

2.1. Rape and Weed Samples

2.1.1. Data Acquisition

2.1.2. Image Data Augmentation

2.2. Proposed Method

2.2.1. Swin Transformer

2.2.2. BiFPN Feature Fusion Module

2.2.3. Normalization-Based Attention Module

2.2.4. Adaptive Spatial Fusion Module

2.3. Experimental Configuration

2.4. Evaluation Metrics

3. Results and Discussion

3.1. Performance of Dataset Augmentation

3.2. Ablation Experiment

3.3. Comparative Experiment

3.4. Experiments on Weed Multi-Class Detection

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI