Next Article in Journal
Correction: Khalid et al. Real-Time Plant Health Detection Using Deep Convolutional Neural Networks. Agriculture 2023, 13, 510
Next Article in Special Issue
Parameter Calibration and Experimental Verification of the Discrete Element Model of the Edible Sunflower Seed
Previous Article in Journal
UAV-Multispectral Based Maize Lodging Stress Assessment with Machine and Deep Learning Methods
Previous Article in Special Issue
Agricultural Machinery Movement Trajectory Recognition Method Based on Two-Stage Joint Clustering
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

An Efficient Group Convolution and Feature Fusion Method for Weed Detection

1
College of Engineering, South China Agricultural University, Guangzhou 510642, China
2
Key Laboratory of Key Technology on Agricultural Machine and Equipment (South China Agricultural University), Ministry of Education, Guangzhou 510642, China
3
State Key Laboratory of Agricultural Equipment Technology, Beijing 100083, China
4
Huangpu Innovation Research Institute of SCAU, Guangzhou 510715, China
*
Author to whom correspondence should be addressed.
Agriculture 2025, 15(1), 37; https://doi.org/10.3390/agriculture15010037
Submission received: 28 November 2024 / Revised: 19 December 2024 / Accepted: 24 December 2024 / Published: 27 December 2024
(This article belongs to the Special Issue Intelligent Agricultural Machinery Design for Smart Farming)

Abstract

:
Weed detection is a crucial step in achieving intelligent weeding for vegetables. Currently, research on vegetable weed detection technology is relatively limited, and existing detection methods still face challenges due to complex natural conditions, resulting in low detection accuracy and efficiency. This paper proposes the YOLOv8-EGC-Fusion (YEF) model, an enhancement based on the YOLOv8 model, to address these challenges. This model introduces plug-and-play modules: (1) The Efficient Group Convolution (EGC) module leverages convolution kernels of various sizes combined with group convolution techniques to significantly reduce computational cost. Integrating this EGC module with the C2f module creates the C2f-EGC module, strengthening the model’s capacity to grasp local contextual information. (2) The Group Context Anchor Attention (GCAA) module strengthens the model’s capacity to capture long-range contextual information, contributing to improved feature comprehension. (3) The GCAA-Fusion module effectively merges multi-scale features, addressing shallow feature loss and preserving critical information. Leveraging GCAA-Fusion and PAFPN, we developed an Adaptive Feature Fusion (AFF) feature pyramid structure that amplifies the model’s feature extraction capabilities. To ensure effective evaluation, we collected a diverse dataset of weed images from various vegetable fields. A series of comparative experiments was conducted to verify the detection effectiveness of the YEF model. The results show that the YEF model outperforms the original YOLOv8 model, Faster R-CNN, RetinaNet, TOOD, RTMDet, and YOLOv5 in detection performance. The detection metrics achieved by the YEF model are as follows: precision of 0.904, recall of 0.88, F1 score of 0.891, and mAP0.5 of 0.929. In conclusion, the YEF model demonstrates high detection accuracy for vegetable and weed identification, meeting the requirements for precise detection.

1. Introduction

Vegetables hold a significant position in global agricultural production. In recent years, China’s vegetable industry has seen substantial growth, accounting for 35% of per capita food consumption. While vegetables play a crucial role in sustaining people’s lives, they also greatly enhance the economic well-being of growers [1,2,3]. As a major component of agricultural ecosystems, weeds compete with vegetables for growth resources [4], severely reducing both vegetable yield and quality. In vegetable production, uncontrolled weeds can cause yield losses ranging from 45% to 95% [5]. Therefore, effective weed management plays a vital role in vegetable field productivity.
With advancements in agricultural technology, intelligent weeding technologies have made it possible to precisely identify and manage weeds in fields, reducing management costs and minimizing the use of chemical herbicides. However, complex field environments increase the difficulty of accurate weed recognition [6,7,8]. Currently, computer vision—including traditional computer vision, machine learning, and deep learning—is the mainstream technology for weed recognition. Traditional computer vision methods can utilize the morphological characteristics of vegetables and weeds for detection, but they are time-consuming and often fail to meet the demands of agricultural production [9]. Machine learning allows automatic feature extraction and decision-making but relies on manually designed features (such as Support Vector Machine (SVM) and decision trees). Other relevant techniques, such as hyperspectral imaging analysis and remote sensing techniques, have also been explored for weed detection. However, these methods often face limitations in cost, data processing complexity, and real-time applicability in agricultural scenarios. In comparison, deep learning-based object detection algorithms provide a more efficient solution for weed detection tasks [10,11].
Deep learning object detection methods can be broadly categorized into one-stage and two-stage models. Two-stage models offer higher recognition accuracy but are less suited for real-time applications [12,13,14]. In contrast, as a representative of one-stage models, YOLO maintains high efficiency and real-time performance while providing sufficient detection accuracy, making it more suitable for real-time weed detection in vegetable fields [15]. Consequently, the YOLO model is employed as the foundational framework for research [16].
In recent years, various versions of the YOLO model have been widely applied in agricultural production. Wu et al. developed a model for detecting cabbage weeds based on YOLOv4, demonstrating improved accuracy for small weed targets and better adaptability across different crops. The model attained an mAP0.5 of 85.2% [17]. Ying et al. modified YOLOv4 for weed detection in carrot seedlings by incorporating the MobileNetV3-Small Backbone and depth-wise separable convolution. The modification resulted in an mAP0.5 of 89.11% and a model weight of 159 MB [18]. Hu et al. developed a multi-module YOLOV-L model by combining Efficient Channel Attention (ECA) and Coordinate Attention (CA) mechanisms into YOLOv7. These improvements enhanced its performance, achieving an mAP0.5 of 97.1%, a precision of 97.5%, and a model weight of 18.4 MB [19]. Solimani et al. introduced a tomato plant shape detection model based on YOLOv8, which enables real-time monitoring and evaluation of tomato plant shapes and growth conditions. This model achieved an mAP0.5 of 65.08% [20].
While these studies have demonstrated progress in weed and plant detection, they primarily focus on improving recognition performance. This often led to increased complexity and a heavy computing load in image processing, limiting their suitability for embedded systems in agricultural scenarios. Additionally, the cited works highlight challenges such as limited generalization in real-world conditions. The biological similarity between vegetables and weeds, irregular weed contours, and subtle differences in texture and shape under varying field conditions pose significant challenges to recognition performance [21,22].
The YOLOv8 model exhibits stronger multi-scale detection and generalization capabilities, effectively and accurately identifying weeds and crops in complex backgrounds, making it particularly suitable for weed detection in vegetable fields. Although YOLOv9 and YOLOv10 have seen further improvements, they have yet to be widely validated and are less commonly applied [23,24].
This paper focuses on optimizing the YOLOv8 model to address these issues and better capture the characteristics of weeds. New feature fusion and convolution modules are proposed. The specific improvements are as follows:
(1)
To optimize convolution operations, a plug-and-play Efficient Group Convolution (EGC) module is proposed. This module enhances processing for high-channel feature maps by reducing computational complexity while maintaining accuracy. Furthermore, the EGC serves as the foundation for the new C2f-EGC module, which further optimizes feature extraction.
(2)
An innovative attention-based feature extraction module, Group Context Anchor Attention (GCAA), is designed to capture richer contextual information across features. Based on this module, the GCAA-Fusion module is developed to blend high and low feature layers for improved weed detection accuracy, refining the traditional feature pyramid structure.
(3)
The YOLOv8-EGC-Fusion model is introduced, which builds upon YOLOv8 with an enhanced network structure to boost feature extraction capabilities, particularly for recognizing weeds in complex backgrounds.

2. Materials and Methods

2.1. Data Acquisition and Processing

2.1.1. Vegetable Weed Dataset Collection

Data collection for this study took place in mid-May, late July, and mid-December 2023 at the Cencun Research Experimental Base of South China Agricultural University, located in Guangzhou, Guangdong Province, China (latitude 23° N, longitude 113° E), as shown in Figure 1. Images for the dataset were taken multiple times in various field conditions using iPhone XS and iPhone 15 (Apple, Cupertino, CA, USA) smartphones (12-megapixel cameras). The images were captured at a height of 90 cm from the ground to simulate the typical working height of a weed control robot, with a vertical shooting angle and a resolution of 3024 × 4032 pixels. The dataset includes various types and growth stages of both vegetables and weeds. The vegetable growth stages primarily ranged from the early to middle stages, with species including Lactuca sativa, Brassica rapa subsp. pekinensis, Raphanus sativus, Cichorium endivia L., and Brassica juncea, as shown in Figure 2b. Weed species included Artemisia argyi H. Lév. & Vaniot, Chenopodium album L., Amaranthus retroflexus, Oxalis corniculata L., and Eleusine indica (L.) Gaertn., as shown in Figure 2a. The images also encompass diverse lighting conditions (such as sunny, cloudy, and post-rain), complex environments, and instances of leaf occlusion.

2.1.2. Image Preprocessing

The dataset initially comprised 1300 images. To simulate various environmental conditions for vegetable growth, data augmentation techniques—including brightness adjustment, rotation, color variation, and sharpness enhancement—were employed, expanding the dataset to 2750 images, as shown in Figure 3. After data cleaning, 2666 images were retained for the final dataset. This dataset was then divided into 2155 images for the training set, 244 for the validation set, and 267 for the test set. Annotation of the images was performed manually using the custom software LabelImg (version 1.8.6, an open-source tool available on GitHub), with bounding boxes created for 11,338 vegetable targets and 37,531 weed targets.
In the field, weed sizes and spatial distributions are uneven, which poses a challenge for model generalization. This imbalance necessitates capturing long-range and local contextual information to improve the model’s ability to accurately detect large and small targets simultaneously.

2.2. Weed Detection Model Based on Improved YOLOv8

The YOLOv8 architecture consists of three main components: the Backbone, Neck, and Detection Head. The Backbone is responsible for extracting fundamental features from the input image. The Neck then integrates and refines these multi-scale features through the Feature Pyramid Network (FPN), enhancing the model’s capability to extract information across different spatial resolutions.
Considering the limited computational resources in agricultural environments, the YOLOv8n (Nano) version, designed specifically for lightweight applications, is the basis for this study. This paper develops the YOLOv8-EGC-Fusion (YEF) model based on the YOLOv8n framework. The model architecture is illustrated in Figure 4. Key modifications include the integration of Efficient Group Convolution (EGC) into the Backbone, improving the extraction of multi-scale features and decreasing the model’s parameter count. Additionally, a GCAA-Fusion module is designed to improve the capture of low-level features by optimizing the existing feature pyramid structure. The detailed design and performance optimization of the YEF model are further discussed in this paper.

2.2.1. C2f-EGC Module

(1)
Efficient Group Convolution
In the initial YOLOv8 architecture, the convolutional module used in the feature extraction phase applies different convolutional kernels to capture useful features from the input data. This module primarily comprises convolutional layers created by conv2d, activation functions, and batch normalization layers [25]. The computational complexity of the convolutional module determines the data processing speed, and the parameter count Pconv is calculated as follows:
P c o n v = C i n × C o u t × K h × K w
where Cin represents the number of input channels. Cout represents the number of output channels. Kh × Kw represents the height and width of the convolutional kernel.
Each output channel in the convolutional module performs convolution operations over all input channels. In the convolutional modules of the YOLOv8 model, the channel dimensions are 128, 256, 512, and 1024, meaning that both computational cost and parameter count grow as the input and output channels increase. This leads to a greater demand for computational resources and longer processing times [26]. Additionally, the convolutional kernel size in standard convolutions is fixed, which limits robustness against unknown geometric transformations, thus affecting the model’s generalization capability [27].
This paper proposes the EGC module to address this issue. This module utilizes grouped convolutions with kernels of varying sizes to decrease the parameter count while also enhancing the capacity to capture multi-scale spatial features and enhancing generalization capability. The structure of the module is shown in Figure 5. Let the input and output of the EGC module be denoted as F i n C , H , W and F o u t C , H , W , where C is the number of channels, and H and W represent the height and width of the feature map. The input feature map Fin is split along the channel dimension into two paths: F c h e a p C / 2 , H , W and F g r o u p C / 2 , H , W . After splitting, the dimensions of the feature maps are as follows:
F c h e a p = F i n : 1 2 C , , F g r o u p = F i n 1 2 C : ,
where “:” is used to indicate slicing operations across dimensions.
In the EGC, one path (Fcheap) performs a simple operation to retain the original features, reducing redundancy in the feature mapping, as shown in Figure 6. The other path (Fgroup) undergoes group convolution, where Fgroup is split into two groups: F g r o u p 1 , F g r o u p 2 C / 4 , H , W . These are used as input for the group convolution, which, after processing and feature fusion, generates F 2 C / 2 , H , W . Finally, a pointwise convolution is applied to merge the feature channels from both paths, resulting in the fused output as:
F o u t = C o n v 1 × 1 C o n c a t F c h e a p , F 2
By applying the EGC module for convolutional operations, the number of input and output channels is reduced to one-quarter of the original size, which subsequently decreases the parameter count accordingly. Equation (4) illustrates how the parameter count might be written as follows:
P E G C = i = 1 G C min _ i n × C min _ o u t × K i × K j
where Cmin_in, Cmin_out represent the number of channels after the split operation. G denotes the number of groups. Ki and Kj allude to the grouped convolution kernels’ width and height, respectively.
The receptive field for recognizing image features is directly impacted by the convolution kernel’s size. A well-chosen convolution kernel can improve both the model’s performance and accuracy. Table 1 demonstrates the parameter and computation costs for different kernel sizes. By combining 1 × 1 and 3 × 3 kernels, it is possible to maintain model performance while improving efficiency. Therefore, this study adopts the [1, 3] kernel combination to optimize model performance.
(2)
C2f-EGC Module
In the YOLOv8 model, the C2f module primarily extracts higher-level feature representations [28]. Its core comprises convolution layers, activation functions, and Bottleneck modules [29]. This study replaces the second standard convolution module in the Bottleneck with the EGC module, forming the EGC-Bottleneck module (as shown in Figure 7b) to promote the exchange of multi-scale information within the model and capture local contextual information. By stacking multiple EGC-Bottleneck modules, a new network structure, C2f-EGC, is constructed (as shown in Figure 7a).
In convolution layers with high channel numbers (128 and above), the C2f-EGC model replaces the original C2f module in this study. For layers with lower channel numbers (less than 128), the impact of grouped convolution is minimal, so the original C2f module is retained. Table 2 compares the parameter counts between the C2f and C2f-EGC modules for high-channel layers. The results show that the C2f-EGC module reduces parameter count by more than 20% without compromising feature extraction capability.

2.2.2. Group Context Anchor Attention

In object detection, local information can be impacted by local blurring or noise, leading to decreased model performance. In vegetable and weed recognition, vegetables are typically planted in an organized manner, while weeds are randomly distributed [30,31]. This paper introduces the Group Context Anchor Attention (GCAA) mechanism, consisting primarily of pooling layers, EGC modules, depth-wise separable convolutions, and activation functions. The GCAA is intended to enhance the spatial feature representation for both weeds and vegetables by increasing the model’s ability to capture long-range contextual information.
Figure 8 illustrates the GCAA model’s structure. First, an average pooling operation is applied to local regions, which averages out the local features, extracts global features, and reduces the dimensionality of the feature map. The local region’s features are then obtained using the EGC convolution procedure. The operation formula is as follows:
X p o o l = E G C i × i P a v g X i n
where EGCi represents the convolution operation of the EGC module with i 1 , 3 , indicating a convolution kernel size of [1, 3]. P a v g represents the average pooling operation. X i n represents the input value.
Then, depth-wise separable convolution is applied to capture long-range dependencies in both horizontal and vertical directions. Large convolution kernels of size 1 × 11 and 11 × 1 are used to perform convolutions along the vertical and horizontal axes. The operation is formulated as:
X w = C o n v 1 × 11 ( X p o o l )
X h = C o n v 11 × 1 ( X w )
where X w and X h represent the output values after the 1 × 11 and 11 × 1 convolutions. Conv represents the standard convolution operation.
Finally, a pointwise convolution is used to integrate and compress the features extracted from the separable convolution (Xh) along the channel dimension. The sigmoid activation function is applied to perform a nonlinear transformation, generating the output feature Xout as shown in Equation (8). For the feature fusion that follows, it is crucial to make sure that the output feature values fall between 0 and 1.
X o u t = σ C o n v 1 × 1 X h
where σ denotes the sigmoid function.

2.2.3. GCAA-Based Feature-Level Fusions Model

Shallow feature maps contain rich low-level information, such as edges and textures, which can effectively distinguish subtle differences between vegetables and weeds. However, during the object detection process, consecutive convolution and pooling operations reduce the resolution of deep feature maps, resulting in the loss of shallow information. Deep networks primarily extract abstract features such as object categories, negatively impacting the recognition of small objects and targets with similar shapes, as illustrated in Figure 9 [32,33].
We assessed the Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index (SSIM) for images of various resolutions, as stated in Equations (9) and (10), aiming to examine better how image resolution affects feature extraction and object detection accuracy [34].
PSNR = 10 log 10 ( M A X 2 M S E )
where MAX is the maximum pixel value of the image, and MSE is the Mean Squared Error between two images. A higher PSNR value indicates better image quality.
S S I M ( x , y ) = ( 2 μ x μ y + C 1 ) ( 2 σ x y + C 2 ) ( μ x 2 + μ y 2 + C 1 ) ( σ x 2 + σ y 2 + C 2 )
where μ x and μ y are the means of images x and y, μ x 2 and μ y 2 are their variances, σ x y is the covariance, C 1 and C 2 are constants to stabilize the division. Higher scores on the SSIM scale, which goes from 0 to 1, indicate more remarkable similarity.
The results are summarized in Table 3. As the image resolution decreases, PSNR and SSIM values decline, indicating a loss of information and structural similarity. Lower-resolution images lose finer details, which potentially degrades deep networks’ ability to distinguish objects.
Feature enhancement is key to solving the problem of feature disappearance. In previous object detection algorithms, attention mechanisms are often used to highlight key features, which is effective but still faces the challenge of balancing and preserving low-level details and capturing high-level contextual information.
To address this issue, this paper introduces a feature fusion module based on GCAA (GCAA-Fusion). The module adaptively merges shallow and deep feature maps, enhancing both feature preservation and gradient backpropagation. As shown in Figure 10, the low-level feature map F l o w C , H , W is combined with the high-level feature map F h i g h C , H , W through simple addition. The combined features are then passed through the GCAA attention module to generate an initial attention map W A C , H , W , integrating long-range contextual information with detailed features, as expressed in Equation (11).
W A = G C A A F l o w F h i g h C , H , W
where represents the element-wise summation.
To obtain a more precise saliency feature map, the initial attention feature map W A is first concatenated with the original feature map after the addition operation to form W A C C , H , W . Then, channel shuffle operations are applied to rearrange the channels of W A C alternately. Finally, a 7 × 7 convolution operation followed by an activation function is applied to produce the feature weights W . The computation process is shown in Equations (12) and (13).
W A C = c o n c a t W A , F l o w F h i g h
W = σ ( C o n v 7 × 7 C S W A C
where CS(·) refers to the channel shuffle operation. C o n v 7 × 7 depicts the convolution operation using a kernel size of 7 × 7.
The precise feature maps generated through weighted summation are then integrated, with skip connections introduced to enhance the input features. This helps mitigate the vanishing gradient problem and simplifies the training process. Given that shallow and deep features complement each other, the generated weight W is applied to one module, while the fusion weight for the other module is represented as 1 − W [33]. Based on this, the fused features are mapped using a 1 × 1 convolution layer to obtain the final feature output, as indicated in Equation (14).
F C A A E = C o n v 1 × 1 F l o w W F h i g h ( 1 W ) F l o w F h i g h
where represents the elementwise multiplication.

2.2.4. Adaptive Feature Fusion (AFF)

The Feature Pyramid is crucial in object detection for capturing multi-scale features. Figure 11 outlines different pyramid structures. In Figure 11a, a single feature map is used for prediction, limiting the ability to exploit multi-scale information. Figure 11b presents an image pyramid approach where feature maps are generated at each scale, but this incurs a high computational cost [35]. Figure 11c enhances detection performance by utilizing multi-layer feature extraction [36], though it may lack precision in capturing fine details. Figure 11d focuses on resolving the multi-scale challenge in object detection while reducing computational complexity, although efficiency could still be optimized [37]. Improving feature extraction methods can substantially enhance the network’s detection accuracy. YOLOv8 employs the Path Aggregation Feature Pyramid Network (PAFPN) to facilitate information fusion across different layers, but deeper network layers can lead to feature loss.
This paper introduces an AFF structure built upon the PaFPN architecture to address the issue. As shown in Figure 4 and Figure 12, The orange dashed lines correspond to the 9th, 6th, and 4th layers, and the output layers N5, N4, and N3 correspond to the 15th, 18th, and 21st layers, respectively. Three GCAA-Fusion modules are incorporated before each Detection Head, with their inputs sourced from the 4th and 15th layers, the 6th and 18th layers, and the 9th and 21st layers. The AFF structure adaptively merges low-level and high-level features along the channel dimension, increasing the model’s ability to preserve features and raising object detection’s precision and effectiveness. This results in greater robustness and adaptability across various tasks.

2.3. Experimental Platform and Parameter Configuration

The experimental platform operates on a Windows 10 system, equipped with an NVIDIA Quadro RTX 5000 GPU (NVIDIA, Santa Clara, CA, USA), Intel(R) Xeon(R) Gold 6248R @ 3.00 GHz CPU (Intel, Santa Clara, CA, USA), and 64.0GB of RAM (Kingston, Fountain Valley, CA, USA). PyTorch is the deep learning framework used, with CUDA 11.6 serving as the parallel computing platform and programming model. Python 3.9 is employed as the programming language.
Most of the experiment’s parameters retain YOLOv8’s default settings. The batch size is 32, the number of workers is 5, and the input image size is fixed at 640 × 640 pixels. The weight decay is 0.0005, the momentum is 0.937, and the learning rate is 0.01. Table 4 provides detailed information about the hyperparameter settings.

2.4. Metric

The evaluation of network performance primarily relies on mAP (Mean Average Precision) during the training process and the model’s performance on the validation set after training. To assess the model’s performance, five key metrics are used in this paper: precision, recall, mAP, F1 score, and parameters. The corresponding calculation formulas are as follows:
P r e c i s i o n = T r u e   P o s i t i v e s T r u e   P o s i t i v e s + F a l s e   P o s i t i v e s
R e c a l l = T r u e   P o s i t i v e s T r u e   P o s i t i v e s + F a l s e   N e g a t i v e s
F 1 s c o r e = 2 P r e c i s i o n R e c a l l P r e c i s i o n + R e c a l l
where True Positives represent the quantity of actual weed samples accurately classified by the model as weeds. False Positives represent the number of actual non-weed samples mistakenly identified by the model as weeds. False Negatives represent the number of actual weed samples incorrectly identified by the model as non-weeds.
mAP is a crucial performance evaluation metric that calculates the mean of the average precision (AP) across all categories, as illustrated in Equations (18) and (19). mAP0.5 denotes the mAP value when the Intersection over Union (IoU) is set to 0.5. mAP0.5–0.95 refers to the mAP evaluated at various IoU thresholds ranging from 0.5 to 0.95, with an increment of 0.05 [38].
A P = 0 1 P r e c i s i o n R e c a l l d r
m A P = 1 N i = 1 N A P
where N represents the classification categories in this paper.

3. Results and Discussion

3.1. Performance Experiments

This study conducts experiments with varying counts of GCAA-Fusion modules to evaluate the impact of the AFF network architecture on YOLOv8n model performance while keeping other parameters fixed. Fusion-1 represents the YOLOv8n model enhanced with a single GCAA-Fusion module, while Fusion-2 incorporates two GCAA-Fusion modules. Results in Table 5 indicate that adding GCAA-Fusion modules leads to measurable improvements across detection metrics compared to the baseline YOLOv8n model. Notably, with the addition of three GCAA-Fusion modules (YEF), the model achieves optimal results across all evaluation metrics, highlighting the significant performance enhancements provided by the AFF network architecture.
This study substitutes the GCAA attention mechanism within the GCAA-Fusion module with alternative attention mechanisms—MLCA [38], CBAM [39], SE [40], and EMA [41]—to form new feature fusion modules and further assess the impact of feature fusion on the model.
The results, presented in Table 6, demonstrate that all feature fusion modules improve detection metrics over the baseline YOLOv8n model, validating the feasibility and robustness of the proposed feature fusion framework. The F1 score shows the most significant improvement among all metrics, indicating that feature fusion better balances precision and recall.
Moreover, the GCAA-based fusion module consistently outperforms other attention mechanisms, achieving the highest mAP scores. This highlights the strength of the GCAA mechanism in optimizing feature representation by combining high-level and low-level features. Additionally, different attention mechanisms display unique advantages for specific metrics; for example, SE excels in improving recall, while CBAM achieves competitive performance across most metrics. These attention mechanisms bring unique characteristics suited to different tasks, enabling flexibility in selecting fusion modules based on specific recognition targets.
Figure 13 illustrates the mAP0.5 values over training epochs for models with different fusion modules. In the early training stages, mAP values across fusion modules are similar; however, in later stages, all fusion-enhanced models surpass the baseline, with the GCAA-Fusion module yielding the highest mAP0.5 value, underscoring its superior detection performance relative to other fusion methods.

3.2. Comparative Experiments with the Original Model

This study presents enhancements to the YOLOv8 model’s feature extraction and fusion techniques. Various comparison studies were created to evaluate the efficacy of these changes. First, using the same dataset, the YEF model’s detection performance was compared to that of the original YOLOv8n model; the results are displayed in Table 7. As indicated, the YEF model outperforms YOLOv8n in all metrics across all categories without increasing the parameter count. Notably, the accuracy reaches 90.4%, representing a 1% improvement over the original model. Other metrics also show varying degrees of improvement, demonstrating that the modifications effectively enhance detection performance.
Grad-CAM is a technique that visualizes the intensity distribution of features through color changes, where brighter colors indicate greater attention [42]. Grad-CAM is used in this work to illustrate the YEF model’s detection results. As illustrated in Figure 14, the shallow network layer (layer 5) emphasizes detailed information, such as edges, textures, and object outlines. With the addition of the C2f-EGC module in the Backbone network, the model’s semantic understanding of the plants is significantly enhanced. The 12th layer (small object detection layer) effectively captures fine features of small objects, while the most critical 24th layer (fusion layer), enhanced by the GCAA-Fusion module, further improves the integration capability of multi-scale features.
These visualizations confirm that the improvements in the YEF model enhance feature representation. By effectively capturing both high-level semantic features and low-level details, the model is able to better represent the objects, leading to more accurate detection and localization. This ability to leverage both types of features, particularly in challenging environments, results in more precise and robust performance.
To further demonstrate the generalizability of the proposed EGC and GCAA-Fusion modules, the EGC module was embedded within the C3 module of YOLOv5, forming the C3-EGC module. Additionally, the GCAA-Fusion module was integrated for enhanced feature fusion, resulting in a new model architecture named YEF-5. Comparative experiments were conducted to evaluate the detection performance of YEF-5 against the YOLOv5n baseline.
According to Table 8, the findings reveal that YEF-5 consistently outperforms YOLOv5n across all evaluation metrics. Figure 15 illustrates the evolution of mAP0.5 over training epochs. Although YEF-5 initially shows a slightly lower mAP than YOLOv5n, it eventually surpasses the baseline as training progresses. This trend suggests that YEF-5 requires a more extended adjustment period during early training but ultimately excels in capturing critical data features, leading to superior performance in later stages.
These findings highlight the positive impact of the introduced modules on model performance, demonstrating their adaptability across different architectures and confirming their effectiveness in both YOLOv8n and YOLOv5n frameworks. These findings imply that the proposed improvement methods and modules could be generalized to a wide range of object detection models, potentially benefiting agricultural applications.

3.3. Comparative Experiments with the Other Model

The benefits of the YEF model in vegetable weed recognition tasks were assessed in this study by conducting comparative experiments on the same dataset against several well-known object detection algorithms. First, it was compared with the classic two-stage Faster R-CNN model, followed by the classic single-stage RetinaNet model. Next, comparisons were made with the end-to-end DINO model, the lightweight models RTMDet-Tiny and YOLOv10n, and finally, the TOOD-R50 model [43,44,45,46].
According to the data in Table 9 and the comparison experiment results, the YEF model outperforms other models in several key metrics. Among the models tested, the YEF model achieves the highest precision (90.4%), mAP0.5 (92.9%), and mAP0.5–0.95 (73.3%). Notably, TOOD-R50 and DINO deliver impressive detection results with mAP0.5 values of 83.3% and 89%, respectively, but their parameter counts are significantly higher—321 M and 47 M, respectively—compared to the YEF model’s lightweight 3 M. Similarly, YOLOv10, a compact model with 2.76 M parameters, achieves strong precision (89%) and mAP0.5 (91%), but it still falls short of the YEF model across all metrics.
These results underscore the YEF model’s capability for accurate and efficient detection, making it particularly suitable for agricultural applications. Future work could focus on optimizing the model for better deployment efficiency in resource-constrained environments.

3.4. Qualitative Results

This study utilized publicly available weed datasets to evaluate the detection performance of the YEF model to verify the reliability and generalizability of the proposed model. The first dataset employed was the publicly available sesame weed dataset [47], which comprises 1300 images of sesame crops and various types of weeds, with each image sized at 512 × 512 pixels. Table 10 presents the detection results of the YEF model, showing an overall precision of 86.1%, a recall rate of 82.7%, a mAP at IoU 0.5 of 89%, and a mAP at IoU 0.5–0.95 of 56.9%. Specifically, for different types of weeds, the model achieved a precision of 93.4% and a recall rate of 84%.
The second dataset utilized is the largest public weed detection dataset currently available in cotton production systems [48]. This dataset includes 12 common types of weeds found in cotton fields, comprising 5648 images and 9370 bounding boxes. Table 11 presents the detection results of the YEF model for various weed types, demonstrating superior identification capabilities across most categories. For instance, the precision for Eclipta reached 95.1%, with a recall rate of 96.8%. The results from both datasets indicate that the YEF model exhibits strong detection performance when handling different types of weed datasets, thereby proving its effectiveness and reliability for practical applications.
As shown in Figure 16, the qualitative detection results demonstrate the YEF model’s ability to accurately identify crops and weeds under various challenging conditions, including complex backgrounds, diverse plant types, and varying object sizes. (a) illustrates an example containing both vegetables and weeds, (b) shows a weed-only target, and (c) focuses solely on vegetable targets. This comprehensive representation highlights the model’s adaptability to different detection scenarios.

4. Conclusions

This paper presents a vegetable and weed detection model, YOLOv8-EGC-Fusion (YEF), capable of distinguishing crops from weeds in various complex environments. The model introduces three plug-and-play modules: EGC, GCAA, and GCAA-Fusion. The EGC module employs convolution kernels of varying sizes for efficiently grouped convolution, enabling the capture of spatial features at different scales, and it combines with the C2f module to form the C2f-EGC module. Utilizing the C2f-EGC module with a high number of channels enhances the network’s capacity to acquire local contextual information. The GCAA module leverages separable convolutions to obtain long-range contextual information. The GCAA-Fusion module effectively merges shallow and deep features, addressing the issue of feature loss within the network. Additionally, this paper designs a new feature pyramid structure—AFF—based on GCAA-Fusion and PAFPN, which further improves the model’s feature extraction capabilities.
The efficiency of the suggested model was successfully tested through several studies. Comparative experiments against the original model and mainstream network models demonstrate that the YEF model excels in recognition tasks, effectively addressing the difficulties associated with multi-scale object detection, and outperforms the YOLOv8n model, as well as mainstream networks such as Faster R-CNN, RetinaNet, Tood, RTMdet, and YOLOv5 in key metrics like precision, recall, and mAP. Results from comparative experiments on feature fusion modules with different attention mechanisms indicate that the feature fusion module based on the GCAA attention mechanism significantly enhances the algorithm’s expressiveness and detection performance. Incorporating the improved modules into the YOLOv5 model and proving their effectiveness through comparative experiments shows that the proposed modules demonstrate strong generalizability, significantly enhancing the recognition capability of YOLOv5. Testing the YEF model on public datasets confirmed its outstanding performance across different scenarios and target conditions, achieving the expected objectives.
The YEF model has shown promising results in vegetable and weed detection tasks and holds potential for practical applications in agriculture; however, it still has certain limitations: (1) the model requires further optimization for speed and memory usage prior to hardware deployment, and (2) the model has only been tested for a limited number of vegetables and weeds, while actual agricultural production involves a greater variety of vegetables and more complex planting environments. Therefore, further research and validation of the model’s effectiveness for other vegetables and weeds is needed.

Author Contributions

Conceptualization, C.C. and Y.Z.; methodology, C.C. and Z.F.; software, C.C. and Z.C.; validation, C.C. and M.Z.; formal analysis, J.J. and C.C.; investigation, J.J.; resources, C.C.; data curation, C.C.; writing—original draft preparation, C.C.; writing—review and editing, Y.Z.; visualization, D.Y. and C.C.; supervision, Y.Z.; project administration, Y.Z.; funding acquisition, Y.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Natural Science Foundation of Guangdong Province (2024A1515010463), the earmarked fund for CARS-01, and the Guangdong Province Science and Technology Plan Project (2021B1212040009).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The datasets analyzed for this study can be found in the manuscript. Other data presented in this study are available on request from the first author.

Acknowledgments

We would like to thank the anonymous reviewers for their critical comments and suggestions for improving the manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Ryder, E.J.; Dias, J.S. World Vegetable Industry: Production, Breeding, Trends. Hortic. Rev. 2011, 38, 299–356. [Google Scholar] [CrossRef]
  2. Han, J.; Luo, Y.; Yang, L.; Liu, X.; Wu, L.; Xu, J. Acidification and Salinization of Soils with Different Initial pH under Greenhouse Vegetable Cultivation. J. Soils Sediments 2014, 14, 1683–1692. [Google Scholar] [CrossRef]
  3. Tang, Y.; Dong, J.; Gruda, N.; Jiang, H. China Requires a Sustainable Transition of Vegetable Supply from Area-Dependent to Yield-Dependent and Decreased Vegetable Loss and Waste. Int. J. Environ. Res. Public Health 2023, 20, 1223. [Google Scholar] [CrossRef]
  4. Iqbal, N.; Manalil, S.; Chauhan, B.S.; Adkins, S.W. Investigation of Alternate Herbicides for Effective Weed Management in Glyphosate-Tolerant Cotton. Arch. Agron. Soil Sci. 2019, 65, 1885–1899. [Google Scholar] [CrossRef]
  5. Mennan, H.; Jabran, K.; Zandstra, B.H.; Pala, F. Non-Chemical Weed Management in Vegetables by Using Cover Crops: A Review. Agronomy 2020, 10, 257. [Google Scholar] [CrossRef]
  6. Bakhshipour, A.; Jafari, A.; Nassiri, S.M.; Zare, D. Weed Segmentation Using Texture Features Extracted from Wavelet Sub-Images. Biosyst. Eng. 2017, 157, 1–12. [Google Scholar] [CrossRef]
  7. Raja, R.; Slaughter, D.C.; Fennimore, S.A.; Nguyen, T.T.; Siemens, M.C. Crop Signalling: A Novel Crop Recognition Technique for Robotic Weed Control. Biosyst. Eng. 2019, 187, 278–291. [Google Scholar] [CrossRef]
  8. Wang, X.; Wang, Q.; Qiao, Y.; Zhang, X.; Lu, C.; Wang, C. Precision Weed Management for Straw-Mulched Maize Field: Advanced Weed Detection and Targeted Spraying Based on Enhanced YOLO v5s. Agriculture 2024, 14, 2134. [Google Scholar] [CrossRef]
  9. Wang, A.; Zhang, W.; Wei, X. A Review on Weed Detection Using Ground-Based Machine Vision and Image Processing Techniques. Comput. Electron. Agric. 2019, 158, 226–240. [Google Scholar] [CrossRef]
  10. Aversano, L.; Bernardi, M.L.; Cimitile, M.; Iammarino, M.; Rondinella, S. Tomato Diseases Classification Based on VGG and Transfer Learning. In Proceedings of the 2020 IEEE International Workshop on Metrology for Agriculture and Forestry (MetroAgriFor), Trento, Italy, 4–6 November 2020; pp. 129–133. [Google Scholar] [CrossRef]
  11. Meyer, G.E.; Neto, J.C. Verification of color vegetation indices for automated crop imaging applications. Comput. Electron. Agric. 2008, 63, 282–293. [Google Scholar] [CrossRef]
  12. He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar] [CrossRef]
  13. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
  14. Jiang, L.; Wang, Y.; Wu, C.; Wu, H. Fruit Distribution Density Estimation in YOLO-Detected Strawberry Images: A Kernel Density and Nearest Neighbor Analysis Approach. Agriculture 2024, 14, 1848. [Google Scholar] [CrossRef]
  15. Chen, J.; Wang, H.; Zhang, H.; Luo, T.; Wei, D.; Long, T.; Wang, Z. Weed Detection in Sesame Fields Using a YOLO Model with an Enhanced Attention Mechanism and Feature Fusion. Comput. Electron. Agric. 2022, 202, 107412. [Google Scholar] [CrossRef]
  16. Cao, Y.; Pang, D.; Zhao, Q.; Yan, Y.; Jiang, Y.; Tian, C.; Wang, F.; Li, J. Improved YOLOv8-GD Deep Learning Model for Defect Detection in Electroluminescence Images of Solar Photovoltaic Modules. Eng. Appl. Artif. Intell. 2024, 131, 107866. [Google Scholar] [CrossRef]
  17. Wu, H.; Wang, Y.; Zhao, P.; Qian, M. Small-Target Weed-Detection Model Based on YOLO-V4 with Improved Backbone and Neck Structures. Precis. Agric. 2023, 24, 2149–2170. [Google Scholar] [CrossRef]
  18. Ying, B.; Xu, Y.; Zhang, S.; Shi, Y.; Liu, L. Weed Detection in Images of Carrot Fields Based on Improved YOLOv4. Trait. Signal. 2021, 38, 341–348. [Google Scholar] [CrossRef]
  19. Hu, R.; Su, W.; Li, J.; Peng, Y. Real-Time Lettuce-Weed Localization and Weed Severity Classification Based on Lightweight YOLO Convolutional Neural Networks for Intelligent Intra-Row Weed Control. Comput. Electron. Agric. 2024, 226, 109404. [Google Scholar] [CrossRef]
  20. Solimani, F.; Cardellicchio, A.; Dimauro, G.; Petrozza, A.; Summerer, S.; Cellini, F.; Renò, V. Optimizing Tomato Plant Phenotyping Detection: Boosting YOLOv8 Architecture to Tackle Data Complexity. Comput. Electron. Agric. 2024, 218, 108728. [Google Scholar] [CrossRef]
  21. Qu, H.-R.; Su, W.-H. Deep Learning-Based Weed–Crop Recognition for Smart Agricultural Equipment: A Review. Agronomy 2024, 14, 363. [Google Scholar] [CrossRef]
  22. Su, D.; Qiao, Y.; Kong, H.; Sukkarieh, S. Real-Time Detection of Inter-Row Ryegrass in Wheat Farms Using Deep Learning. Biosyst. Eng. 2021, 204, 198–211. [Google Scholar] [CrossRef]
  23. Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. Yolov10: Real-Time End-to-End Object Detection. arXiv 2024, arXiv:2405.14458. [Google Scholar] [CrossRef]
  24. Wang, C.Y.; Yeh, I.H.; Liao, H.Y.M. Yolov9: Learning What You Want to Learn Using Programmable Gradient Information. arXiv 2024, arXiv:2402.13616. [Google Scholar] [CrossRef]
  25. Wen, C.; Guo, H.; Li, J.; Hou, B.; Huang, Y.; Li, K.; Lu, Y. Application of Improved YOLOv7-Based Sugarcane Stem Node Recognition Algorithm in Complex Environments. Front. Plant Sci. 2023, 14, 1230517. [Google Scholar] [CrossRef]
  26. Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C.; Xu, C. GhostNet: More Features from Cheap Operations. arXiv 2020, arXiv:2005.05928. Available online: https://arxiv.org/abs/2005.05928 (accessed on 27 November 2024).
  27. Chen, J.; Kao, S.H.; He, H.; Zhuo, W.; Wen, S.; Lee, C.H.; Chan, S.H.G. Run, Don’t Walk: Chasing Higher FLOPS for Faster Neural Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 12021–12031. [Google Scholar] [CrossRef]
  28. Xiong, C.; Zayed, T.; Abdelkader, E.M. A Novel YOLOv8-GAM-Wise-IoU Model for Automated Detection of Bridge Surface Cracks. Constr. Build. Mater. 2024, 414, 135025. [Google Scholar] [CrossRef]
  29. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar] [CrossRef]
  30. Huang, Z.; Wang, X.; Wei, Y.; Huang, L.; Shi, H.; Liu, W. CCNet: Criss-Cross Attention for Semantic Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 6896–6908. [Google Scholar] [CrossRef]
  31. Jing, X.; Liu, X.; Liu, B. Composite Backbone Small Object Detection Based on Context and Multi-Scale Information with Attention Mechanism. Mathematics 2024, 12, 622. [Google Scholar] [CrossRef]
  32. Chen, Z.; He, Z.; Lu, Z.M. DEA-Net: Single Image Dehazing Based on Detail-Enhanced Convolution and Content-Guided Attention. IEEE Trans. Image Process. 2024, 33, 1002–1015. [Google Scholar] [CrossRef]
  33. Tang, L.; Zhang, H.; Xu, H.; Ma, J. Rethinking the Necessity of Image Fusion in High-Level Vision Tasks: A Practical Infrared and Visible Image Fusion Network Based on Progressive Semantic Injection and Scene Fidelity. Inf. Fusion 2023, 99, 101870. [Google Scholar] [CrossRef]
  34. Zhang, Z.; Wang, Z.; Lin, Z.; Qi, H. Image Super-Resolution by Neural Texture Transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 7974–7983. [Google Scholar] [CrossRef]
  35. Felzenszwalb, P.F.; Girshick, R.B.; McAllester, D.; Ramanan, D. Object Detection with Discriminatively Trained Part-Based Models. IEEE Trans. Softw. Eng. 2009, 32, 1627–1645. [Google Scholar] [CrossRef]
  36. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single Shot Multibox Detector. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I; Springer International Publishing: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar] [CrossRef]
  37. Lin, T.; Dollar, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. IEEE Comput. Soc. 2017, 41, 939–954. [Google Scholar] [CrossRef]
  38. Wan, D.; Lu, R.; Shen, S.; Xu, T.; Lang, X.; Ren, Z. Mixed Local Channel Attention for Object Detection. Eng. Appl. Artif. Intell. 2023, 123, 106442. [Google Scholar] [CrossRef]
  39. Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar] [CrossRef]
  40. Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar] [CrossRef]
  41. Wang, A.; Chen, H.; Lin, Z.; Han, J.; Ding, G. Repvit-sam: Towards real-time segmenting anything. arXiv 2023, arXiv:2312.05760. Available online: https://arxiv.org/abs/2312.05760 (accessed on 27 November 2024).
  42. Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. Int. J. Comput. Vis. 2020, 128, 336–359. [Google Scholar] [CrossRef]
  43. Feng, C.; Zhong, Y.; Gao, Y.; Scott, M.R.; Huang, W. TOOD: Task-Aligned One-Stage Object Detection. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 3490–3499. [Google Scholar] [CrossRef]
  44. Lyu, C.; Zhang, W.; Huang, H.; Zhou, Y.; Wang, Y.; Liu, Y.; Chen, K. RTMDet: An Empirical Study of Designing Real-Time Object Detectors. arXiv 2022, arXiv:2212.07784. [Google Scholar] [CrossRef]
  45. Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 318–327. [Google Scholar] [CrossRef]
  46. Zhang, H.; Li, F.; Liu, S.; Zhang, L.; Su, H.; Zhu, J.; Ni, L.M.; Shum, H.Y. DINO: DETR with improved denoising anchor boxes for end-to-end object detection. arXiv 2022, arXiv:2203.03605. [Google Scholar] [CrossRef]
  47. Ravirajsinh, D. Crop and Weed Detection Data with Bounding Boxes [Dataset]; Kaggle: San Francisco, CA, USA, 2020; Available online: https://www.kaggle.com/datasets/ravirajsinh45/crop-and-weed-detection-data-with-bounding-boxes (accessed on 27 November 2024).
  48. Dang, F.; Chen, D.; Lu, Y.; Li, Z. YOLOWeeds: A Novel Benchmark of YOLO Object Detectors for Multi-Class Weed Detection in Cotton Production Systems. Comput. Electron. Agric. 2023, 205, 107655. [Google Scholar] [CrossRef]
Figure 1. Location of sampling sites for weeds and vegetables. The image was obtained from © Google Maps. Note: Non-English terms in the figure (e.g., “广州”) are Chinese names for locations.
Figure 1. Location of sampling sites for weeds and vegetables. The image was obtained from © Google Maps. Note: Non-English terms in the figure (e.g., “广州”) are Chinese names for locations.
Agriculture 15 00037 g001
Figure 2. Data collection. Images showing examples of samples from the dataset used for training the model.
Figure 2. Data collection. Images showing examples of samples from the dataset used for training the model.
Agriculture 15 00037 g002
Figure 3. Image enhancement result.
Figure 3. Image enhancement result.
Agriculture 15 00037 g003
Figure 4. YOLOv8-EGC-Fusion architecture.
Figure 4. YOLOv8-EGC-Fusion architecture.
Agriculture 15 00037 g004
Figure 5. Network architecture of the EGC module.
Figure 5. Network architecture of the EGC module.
Agriculture 15 00037 g005
Figure 6. Feature maps generated by the YOLOv8 Backbone network.
Figure 6. Feature maps generated by the YOLOv8 Backbone network.
Agriculture 15 00037 g006
Figure 7. Structure of the C2f-EGC module, which substitutes the EGC module for the second standard convolution module in the Bottleneck.
Figure 7. Structure of the C2f-EGC module, which substitutes the EGC module for the second standard convolution module in the Bottleneck.
Agriculture 15 00037 g007
Figure 8. Structure of the GCAA.
Figure 8. Structure of the GCAA.
Agriculture 15 00037 g008
Figure 9. Feature maps from the 4th and 18th layers of YOLOv8 are shown. The green box represents the 4th layer, and the blue box represents the 18th layer, with both having the same number of output channels.
Figure 9. Feature maps from the 4th and 18th layers of YOLOv8 are shown. The green box represents the 4th layer, and the blue box represents the 18th layer, with both having the same number of output channels.
Agriculture 15 00037 g009
Figure 10. Structure diagram of the GCAA-Fusion module.
Figure 10. Structure diagram of the GCAA-Fusion module.
Agriculture 15 00037 g010
Figure 11. Different feature pyramid structures. P3, P4, and P5 represent the feature levels generated by the FPN.
Figure 11. Different feature pyramid structures. P3, P4, and P5 represent the feature levels generated by the FPN.
Agriculture 15 00037 g011
Figure 12. Structure of the AFF module. P3, P4, and P5 represent the feature levels generated by the FPN, while N3, N4, and N5 indicate the corresponding newly generated feature maps.
Figure 12. Structure of the AFF module. P3, P4, and P5 represent the feature levels generated by the FPN, while N3, N4, and N5 indicate the corresponding newly generated feature maps.
Agriculture 15 00037 g012
Figure 13. mAP values across training epochs for models with different feature fusion modules. “None” indicates the YOLOv8 model without any feature fusion modules.
Figure 13. mAP values across training epochs for models with different feature fusion modules. “None” indicates the YOLOv8 model without any feature fusion modules.
Agriculture 15 00037 g013
Figure 14. Model detection results and heatmap analysis.
Figure 14. Model detection results and heatmap analysis.
Agriculture 15 00037 g014
Figure 15. Variation curves of detection metrics with training epochs.
Figure 15. Variation curves of detection metrics with training epochs.
Agriculture 15 00037 g015
Figure 16. Qualitative results of detection by the YEF model. The red bounding boxes represent weeds, while the pink bounding boxes represent vegetables.
Figure 16. Qualitative results of detection by the YEF model. The red bounding boxes represent weeds, while the pink bounding boxes represent vegetables.
Agriculture 15 00037 g016
Table 1. Convolution kernels of different sizes.
Table 1. Convolution kernels of different sizes.
Kernel SizeParametersGFLOPs
NONE3,006,0388.1
[1, 3]2,643,7987.5
[1, 5]2,698,4067.7
[1, 7]2,772,1347.8
[3, 5]2,722,9827.7
[3, 7]2,791,2547.7
[5, 7]2,847,7947.9
Table 2. Parameter comparison between C2f and C2f-EGC modules.
Table 2. Parameter comparison between C2f and C2f-EGC modules.
ModelLayerOutput ChannelKernel SizeParameter
C2f61283 × 3197,632
82563 × 3460,288
121283 × 3148,224
181283 × 3123,648
212563 × 3493,056
C2f-EGC6128[1, 3]137,344
8256[1, 3]339,584
12128[1, 3]118,080
18128[1, 3]93,504
21256[1, 3]372,352
Table 3. PSNR and SSIM values for different image sizes.
Table 3. PSNR and SSIM values for different image sizes.
Image SizePSNRSSIM
320 × 32024.000.80
160 × 16021.650.73
80 × 8019.790.69
40 × 4018.200.67
Table 4. Hyperparameter information.
Table 4. Hyperparameter information.
Hyper ParametersValues
Epochs400
Batch Size32
Workers 5
Imgsz640
Learning Rate0.01
Momentum0.937
Weight Decay0.0005
Table 5. Results of the YOLOv8 with varying numbers of GCAA-Fusion modules.
Table 5. Results of the YOLOv8 with varying numbers of GCAA-Fusion modules.
ModelsPrecision (%)Recall (%)mAP0.5 (%)mAP0.5–0.95 (%)F1 (%)
WeedCropAllWeedCropAllWeedCropAllWeedCropAllAll
YOLOv887.192.789.480.794.687.888.596.892.761.684.472.988.5
Fusion-186.492.289.380.895.788.288.397.192.660.784.872.888.7
Fusion-287.89390.480.295.187.788.597.192.861.284.87389.1
YEF87.99390.480.895.48888.897.192.961.584.473.389.1
Table 6. Model performance metrics with different feature fusion modules.
Table 6. Model performance metrics with different feature fusion modules.
ModelsPrecision (%)Recall (%)mAP0.5 (%)mAP0.5–0.95 (%)F1 (%)
WeedCropAllWeedCropAllWeedCropAllWeedCropAllAll
MLCA88.593.190.880.195.387.788.397.192.761.385.273.189.2
EMA86.892.689.781.294.387.888.996.492.761.384.672.988.7
CBAM88.294.191.179.394.987.188.597.292.961.185.273.289.0
SE85.892.38981.695.288.488.796.892.860.884.972.888.7
None87.192.789.480.794.687.888.596.892.761.684.472.988.5
GCAA87.99390.480.595.48888.897.192.961.584.473.389.1
Table 7. Comparison of detection results between YEF and original model.
Table 7. Comparison of detection results between YEF and original model.
ModelsParametersClassPrecision (%)Recall (%)mAP0.5 (%)mAP0.5–0.95 (%)F1 (%)
YEF3.0 MWeed87.9%80.5%88.8%61.5%84.0%
Crop93%95.4%97.1%85.1%94.1%
All90.4%88%92.9%73.3%89.1%
YOLOv8n3.0 MWeed87.1%80.7%88.5%61.6%83.7%
Crop92.7%94.6%96.8%84.4%93.6%
All89.4%87.8%92.7%72.9%88.5%
Table 8. Detection results of the YEF-5 model compared to the YOLOv5n model.
Table 8. Detection results of the YEF-5 model compared to the YOLOv5n model.
ModelsPrecision (%)Recall (%)mAP0.5 (%)mAP0.5–0.95 (%)F1 (%)
WeedCropAllWeedCropAllWeedCropAllWeedCropAllAll
YOLOv5n85.69288.880.69587.888.496.692.560.384.472.488.2
YEF-586.593.389.981.395.488.588.796.992.860.384.573.489.1
Table 9. Detection results of the YEF model compared with other models.
Table 9. Detection results of the YEF model compared with other models.
ModelsParametersPrecision (%)Recall (%)mAP0.5 (%)mAP0.5–0.95 (%)
Faster R-CNN41 M71.1%90%81.5%65.4%
TOOD-R50321 M71.4%91.4%83.3%69.7%
RTMDet-Tiny4 M73.1%91.3%81.3%64.7%
RetinaNet36 M60.3%87.6%77.5%63.7%
DINO47 M86.4%87%89%70.3%
YOLOv102.76 M89%87%91%72.1%
YEF3 M90.4%88%92.9%73.3%
Table 10. Detection results of the YEF model on the sesame weed dataset.
Table 10. Detection results of the YEF model on the sesame weed dataset.
Types of WeedsPrecision (%)Recall (%)mAP0.5 (%)mAP0.5–0.95 (%)F1 (%)
All86.182.78956.984.3
Weed93.48491.954.488.4
Crop78.781.586.155.180.0
Table 11. Detection results of the YEF model on the cotton weed dataset.
Table 11. Detection results of the YEF model on the cotton weed dataset.
Types of WeedsPrecision (%)Recall (%)mAP0.5 (%)mAP0.5–0.95 (%)F1 (%)
all92.189.493.68890.7
eclipta95.196.899.197.995.9
ipomoea indica97.292.496.288.794.5
eleusine indica93.590.193.884.587
sida rhombifolia919295.192.391.4
physalis angulata85.271.477.873.877.6
senna obtusifolia95.410099.598.797.6
amaranthus palmeri94.492.396.394.593.3
euphorbia maculata95.78591.483.590
portulaca oleracea89.190.995.386.389.9
mollugo verticillata84.981.288.980.483
amaranthus tuberculatus96.292.297.292.594.1
ambrosia artemisiifolia88.188.292.582.688.1
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Chen, C.; Zang, Y.; Jiao, J.; Yan, D.; Fan, Z.; Cui, Z.; Zhang, M. An Efficient Group Convolution and Feature Fusion Method for Weed Detection. Agriculture 2025, 15, 37. https://doi.org/10.3390/agriculture15010037

AMA Style

Chen C, Zang Y, Jiao J, Yan D, Fan Z, Cui Z, Zhang M. An Efficient Group Convolution and Feature Fusion Method for Weed Detection. Agriculture. 2025; 15(1):37. https://doi.org/10.3390/agriculture15010037

Chicago/Turabian Style

Chen, Chaowen, Ying Zang, Jinkang Jiao, Daoqing Yan, Zhuorong Fan, Zijian Cui, and Minghua Zhang. 2025. "An Efficient Group Convolution and Feature Fusion Method for Weed Detection" Agriculture 15, no. 1: 37. https://doi.org/10.3390/agriculture15010037

APA Style

Chen, C., Zang, Y., Jiao, J., Yan, D., Fan, Z., Cui, Z., & Zhang, M. (2025). An Efficient Group Convolution and Feature Fusion Method for Weed Detection. Agriculture, 15(1), 37. https://doi.org/10.3390/agriculture15010037

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop