1. Introduction
Weeds represent a significant challenge to the productivity and quality of soybean production systems [
1]. The effective management of weeds is of critical importance in reducing competition between crops and weeds for essential resources such as light, water, and nutrients [
2,
3]. Conventional control methodologies, predominantly those based on chemical herbicides and manual weeding—are frequently deemed to be inefficient, costly, and susceptible to environmental contamination. Such challenges underscore the necessity for more precise and sustainable weed management strategies. The advent of precision agriculture and intelligent technologies has precipitated a paradigm shift in the realm of weed identification. Automated approaches, underpinned by computer vision and machine learning, have emerged as a pivotal area of research. By analysing in-field images, these techniques can rapidly and accurately determine weed species and spatial distribution, thereby providing scientific guidance for precision spraying and mechanical weeding. In this study, a novel framework tailored to soybean fields for weed identification is proposed. The proposed approach integrates state-of-the-art image processing methods with machine learning algorithms to improve the accuracy and efficiency of weed detection in soybean plots. This approach offers a solid technical foundation for precision weed management. This work contributes to the advancement of smart agriculture and addresses the urgent demand for sustainable, high-efficiency agricultural production.
In the study conducted by Liu et al. [
4], a novel methodology was proposed that integrates semantic segmentation and image processing for weed detection. Their experiments demonstrated that, after applying knowledge distillation to the DeepLabV3+ model, the average accuracy across all classes exceeded 99.5%, and the MIoU for all classes surpassed 95.5%. Building on similar encoder–decoder paradigms, Qi et al. [
5] presented a semantic segmentation method known as PF-UperNet, which is rooted in an encoder–decoder architecture and achieved an MIoU of 87.45% and an MPA of 96.82%, with a total of 46.16 million parameters. Genze et al. [
6] proposed DeBlurWeedSeg, a methodology that integrates deblurring and segmentation models to facilitate the analysis of motion-blurred images; their experiments demonstrated that combining deblurring and segmentation enabled accurate separation of weeds from sorghum and background in both clear and motion-blurred UAV-captured images. Zou et al. [
7] presented a semantic-segmentation-based methodology for the evaluation of weed species and density, where the coefficient of determination (R
2) between algorithm-computed and manually assessed weed density reached 0.90, with a root mean square error of 0.05, indicating effective density estimation in complex environments. Guo et al. [
8] developed a semi-supervised deep learning model capable of learning semantic segmentation from both annotated and unannotated images, which reached 85.5% MIoU on test data even under conditions of intense weed infestation. In the broader field of agricultural computer vision, Ma et al. [
9] presented a SegNet-based fully convolutional network for paddy-field semantic segmentation, achieving a mean accuracy of 92.7% and effectively classifying pixels of rice seedlings, background and weeds, while Janneh et al. [
10] proposed a refined DCNN algorithm for pixel-wise semantic segmentation of crops and weeds, obtaining average MIoU scores of 0.8646, 0.9164 and 0.8459 on the CWFID carrot dataset, the BoniRob sugar beet dataset and a rice seedling dataset, respectively. Nong et al. [
11] further advanced semi-supervised segmentation with SemiWeedNet, a weed–crop segmentation method designed for complex environments to reduce reliance on extensive labelled data; comparative experiments on public datasets showed that SemiWeedNet outperformed several state-of-the-art approaches. In addition, You et al. [
12] proposed a weed/crop segmentation network capable of enhancing performance for the precise identification of weeds of arbitrary shape under complex conditions, thereby supporting autonomous robots in effectively reducing weed density.
In parallel with these high-accuracy segmentation models, a substantial body of work has focused on lightweight and real-time architectures that are suitable for deployment in the field. Han et al. [
13] proposed a fast weed segmentation approach based on a Crop Detection Model (CDM) and the Excess Green (ExG) index, which expedited the segmentation process while maintaining accuracy; the model achieved a precision of 92.50%, an IoU of 76.14%, and an overall accuracy of 98.10%, thereby providing real-time and accurate weed segmentation. Kong et al. [
14] developed a new segmentation network based on the YOLO architecture and showed that the insertion of cross-attention significantly enhanced performance: the improved model achieved an average MIoU@50 of 90.9%, alongside a 5.9% improvement in precision and a 15.56% reduction in GFLOPs, suggesting suitability for resource-constrained environments. Yu et al. [
15] proposed DCSAnet, a lightweight weed segmentation network designed for mobile weed-control equipment; with a parameter count of only 0.57 million, DCSAnet achieved an MIoU of 85.95% together with the highest segmentation precision among the compared methods, demonstrating its effectiveness for practical weed-related tasks. Kong et al. [
16] and colleagues developed an indirect approach that first segments crops and then classifies remaining green objects as weeds, achieving an MIoU of 97.9%, a recall of 93.4%, and a precision of 97.6%, while also improving inference speed. Gao et al. [
17] reported that their EPAnet increased overall accuracy by 0.65%, MIoU by 1.91%, and frequency-weighted IoU by 1.19%; compared with state-of-the-art methods, EPAnet delivered superior segmentation under uneven illumination, leaf interference and shadows in natural environments. Lan et al. [
18] presented two enhanced recognition models, MobileNetV2-UNet and FFB-BiSeNetV2, which attained higher segmentation precision than BiSeNetV2, with peak pixel accuracy and MIoU of 93.09% and 80.28%, respectively; on embedded hardware with FP16 weights, their inference speeds reached 45.05 FPS and 40.16 FPS per image. The improved U-Net model based on a MaxViT encoder and CBAM attention fusion proposed by Yadong Li et al. [
19] further demonstrated that attention-enhanced encoder–decoder architectures can significantly improve the efficiency and generalisation ability of beet–weed segmentation while ensuring high accuracy, with most misclassifications occurring only at plant boundaries.
In parallel, classical and hybrid approaches that incorporate hand-crafted features and traditional image processing remain important. Bakhshipour et al. [
20] investigated two approaches for distinguishing weeds from main crops and used principal component analysis to select 14 texture features from an initial set of 52. Their results demonstrated that wavelet texture features were effective in differentiating weeds within crops even in the presence of heavy occlusion and leaf overlap. Xu et al. [
21] also exemplified a hybrid strategy, in which visible colour indices were combined with an encoder–decoder instance segmentation model to improve robustness and accuracy. These studies highlight the value of spectral and texture descriptors, particularly when integrated with modern deep learning architectures. Despite notable advances in semantic segmentation, lightweight design and semi-supervised learning, current weed identification methods still face key challenges that limit their deployment in the field. Complex agricultural conditions, such as variable lighting, occlusion, soil heterogeneity and motion blur, reduce the stability of feature extraction. The high visual similarity between crops and weeds also leads to boundary ambiguity and misclassification. Furthermore, most models rely heavily on large annotated datasets, yet generalise poorly across regions or growth stages. Lightweight networks improve efficiency, but often sacrifice fine-grained accuracy; heavier models, meanwhile, are unsuitable for embedded platforms. Furthermore, the limited use of agronomic priors restricts robustness in dense or irregular weed scenarios. Taken together, these limitations underscore the need for weed segmentation models that are more accurate, robust and scalable.
In this study, an EDM-UNet–based method for weed segmentation in soybean fields is proposed, which demonstrates strong robustness across varying weed densities while meeting real-time detection requirements. The model architecture integrates an efficient channel attention (ECA) mechanism with an edge-assisted guidance module: The ECA modules are embedded in the skip connections, and Canny-derived edge information is introduced at each decoder stage to guide the network’s focus towards both key semantic channels and boundary structures. This process significantly improves the accuracy of weed–soybean boundary delineation. In the post-processing stage, a direction-consistency enhancement module employs multi-orientation Gabor filters to reinforce directional texture features characteristic of ridge-aligned planting patterns, thereby effectively suppressing false responses from weeds with ambiguous textures. Finally, we leverage geometric priors of field crops to design a morphology-constrained aspect-ratio filtering mechanism. This mechanism further eliminates non-target regions and enhances the structural plausibility and agricultural relevance of the segmentation output. The experimental findings, obtained under natural field conditions, corroborate the efficacy of the proposed model for soybean weed segmentation. Beyond the advancement of automation and precision in soybean weed detection, the objective of this work is to drive agricultural modernisation, thereby improving overall production efficiency and economic returns. Furthermore, the techniques developed in this study have the potential to serve as a valuable reference and be extended to precision-management tasks in other crop systems.
The remainder of this paper is structured as follows:
Section 2 details the structure and implementation of the algorithm,
Section 3 presents experimental results and analysis,
Section 4 discusses the findings, and
Section 5 concludes the study.
3. Results and Analysis
3.1. Experimental Environment
The experiments were conducted using the PyTorch (version 1.10.0) framework, and the details of the experimental environment are provided in
Table 1. The input image size was set to 512 × 512 pixels. The model hyperparameters were configured as follows: the batch size was 16, the optimizer leveraged stochastic gradient descent with an initial learning rate of 0.01, and the momentum parameter was set to 0.937. The learning rate was adjusted using the cosine annealing decay algorithm, with a decay coefficient of 0.0005.
Training comprised 300 epochs, with weight files saved every 50 epochs. A log file was also generated to record the loss values for the training and validation sets. These hyperparameters were carefully selected to ensure faster convergence, minimize overfitting, and prevent the model from becoming stuck in local minima. To ensure fairness, all comparative models in this study were trained and evaluated under identical hardware conditions and hyperparameter settings. The FPS values reported in this paper were obtained in PyTorch by repeatedly running 100 forward passes and computing the average inference speed.
3.2. Model Evaluation Index
The evaluation metrics for semantic segmentation models primarily include accuracy indicators such as recall and precision, semantic segmentation performance metrics represented by MIoU, computational efficiency measured by FPS, and model complexity indicators represented by the number of parameters and floating point operations.
Accuracy metrics can be defined using a confusion matrix. For the soybean weed segmentation task, we designate weed features as the positive class and non-weed regions as the negative class, thereby distinguishing four categories: true negative (TN), false positive (FP), false negative (FN), and true positive (TP).
Based on the confusion matrix, the following evaluation metrics can be calculated: Recall represents the proportion of actual positive samples that are correctly identified as positive, as shown in Equation (5). Precision indicates the proportion of correctly predicted positive samples among all samples predicted as positive, as expressed in Equation (6).
The IoU measures the overlap between the predicted segmentation region and the ground-truth label region, defined as the ratio of the area of their intersection to the area of their union. The MIoU is computed as the arithmetic mean of the IoU values over all classes, as shown in Equation (7).
In the formula: represents the number of categories, and represents the number of categories containing background classes.
Furthermore, FPS is introduced as an indicator to measure the processing speed of the semantic segmentation network. In addition, Params denotes the number of trainable parameters in millions, and FLOPs denotes the number of floating-point operations per forward pass in billions. These two indicators are used to characterise the model’s complexity and computational cost, which are critical for deployment in resource-constrained agricultural environments.
3.3. Ablation Test Results
In the soybean–weed segmentation task, in order to verify the effect of the proposed modules on the overall model performance, this study conducted a series of ablation experiments based on EDM-UNet, with the specific results shown in
Table 2. The experiments progressively combined modules from three aspects—ECA attention mechanism, edge enhancement, and morphological post-processing—to explore the impact of each module on the model’s MIoU, recall, precision, and FPS.
Figure 8 shows the performance comparison of different combinations
As demonstrated in Experiment 1, the initial ResNet50-UNet model exhibits an MIoU of 82.74%, a recall of 87.86%, a precision of 91.75%, and an FPS of 51.61, thereby establishing a benchmark for comparison. In Experiment 2, following the incorporation of the lightweight ECA module, the FPS decreased to 45.50. However, the Recall and MIoU increased to 91.58% and 88.08%, respectively, while Precision reached its maximum value of 95.15%. The third experiment, which is based on the second experiment, introduces edge detection, thereby further enhancing feature representation. In comparison with Experiment 2, MIoU increased by 0.29 percentage points to 88.37%, Recall increased by 0.65 percentage points to 92.23%, Precision decreased slightly to 94.80%, and FPS dropped to 39.46. This finding indicates that the model exhibits enhanced recall capability and improved segmentation accuracy. However, it is noteworthy that the implementation of the attention module may lead to a marginal increase in high-confidence background misclassification probability, which in turn causes a modest decline in precision. In Experiment 4, the edge-detection module is introduced in addition to the module employed in Experiment 1. This is done with the objective of enhancing the quality of segmentation in regions adjacent to the boundary. The findings indicate an MIoU of 83.30%, a Recall of 87.48%, and a Precision of 93.12%, with FPS decreasing to 40.66. With the exception of a decline in FPS, MIoU and Precision demonstrate consistent improvements, while Recall remains at a comparable level, suggesting that this module can effectively mitigate boundary blurring issues and enhance the model’s capacity to detect pixels at target edges. Experiment 5, a development of Experiment 1, incorporates the morphological post-processing module, resulting in an augmentation of MIoU, Recall, and Precision by 5.05%, 4.28%, and 2.36%, respectively, accompanied by a decline in FPS by 2.91%. Despite a slight decline in FPS, there is a substantial enhancement in all the accuracy metrics of the model. This finding suggests that the application of morphological operations during post-processing of the model output is effective in the removal of small false-response regions, thereby enhancing the overall accuracy of segmentation. Notably, this module does not affect the backbone network structure, which results in minimal impact on computational complexity. In Experiment 6, the methodologies of Experiments 4 and 5 are amalgamated; that is to say, the implementation of morphological post-processing is undertaken on the basis of introducing edge enhancement, thereby effecting a further improvement in the model’s overall performance. The model demonstrates an MIoU of 89.05%, with Recall and Precision reaching 93.01% and 94.83%, respectively. These accuracy metrics exceed those achieved by edge enhancement or morphological post-processing alone, thus indicating that the combination of both approaches results in enhanced segmentation outcomes, thereby validating the efficacy of the model improvements. Furthermore, the frame rate is 39.97, which is within the acceptable range. In Experiment 7, the ECA attention mechanism is integrated with the morphological post-processing module. This results in the achievement of good performance without the addition of structural complexity. The MIoU is 87.78%, the recall is 91.59%, and the precision reaches 94.72%. The FPS score of 49.19 is the highest among all groups, with the exception of the original model. Despite the fact that the metrics do not reach their optimum, this combination has been demonstrated to exhibit good balance, which renders it suitable for deployment in scenarios with high efficiency requirements. The final Experiment 8 is the comprehensive model, i.e., introducing ECA attention, edge enhancement, and morphological post-processing on the basis of the aforementioned modules. The system’s overall performance has been found to reach an optimal level, with metrics such as MIoU, Recall and Precision all demonstrating satisfactory results within the acceptable range. The MIoU score attained was 89.45%, while the Recall and Precision metrics registered 93.53% and 94.78%, respectively. The system’s frame rate was found to be 40.36 FPS. A comparative analysis of the baseline model reveals that MIoU, Recall, and Precision exhibit an increase of 6.71%, 5.67%, and 3.03%, respectively. Notably, MIoU and Recall attain their maximum values among all combination models, while FPS experiences a decline of 11.25%. The comprehensive model demonstrates significant advantages in terms of accuracy, recall, and segmentation consistency, thus fully demonstrating the important role of collaborative fusion of modules in improving model performance.
The findings of the present study demonstrate that distinct modules have discernible impacts on diverse aspects of model performance enhancement. The ECA attention mechanism has been demonstrated to enhance the model’s responsiveness to key channel features, thereby improving both recall and MIoU. The edge enhancement module has been shown to improve the boundary quality of targets, thus enhancing the model’s overall detection robustness. Finally, the morphological post-processing has been evidenced to effectively improve segmentation cleanliness and accuracy by optimising model outputs. The final combined improved model demonstrates both high accuracy and relatively low computational cost, rendering it suitable for practical soybean–weed segmentation tasks in agricultural environments.
3.4. Simulation Test
In the evaluation of deep learning models, the confusion matrix [
29] is a pivotal tool for assessing classification performance. The model is presented as a matrix in which rows represent the true labels and columns represent the predicted labels. Each element of the matrix reflects the proportion or count of instances of a given true class being predicted as a particular class. Diagonal entries in the matrix indicate the proportion of correct classifications, while off-diagonal entries represent misclassifications. Perusal of the confusion matrix facilitates intuitive comprehension of the model’s false negatives and false positives for each class, thereby providing a framework for guiding improvements.
In the context of the binary segmentation task of soybean-field weed detection, the primary focus is typically on the model’s capacity to differentiate between pixels categorised as either “background” or “weeds”. As demonstrated in
Figure 9A, the normalized confusion matrix for the original model on the test set indicates that true background pixels are predicted as background with a proportion of 1.00 and as weeds with 0.00, suggesting that the original model exhibits a high degree of accuracy in background recognition with no false positives. However, among true weed pixels, only 0.76 are correctly predicted as weeds and 0.24 are misclassified as background, indicating a high rate of false negatives. This phenomenon may be attributed to the similarity in colour, texture, or morphology between weeds and surrounding crop leaves, soil, or other vegetation. Additionally, boundary blurring or small-scale targets that are difficult to capture may also be contributing factors.
As demonstrated in
Figure 9B, the confusion matrix for the enhanced EDM-UNet on the test set continues to predict 100% of true background pixels as background, with no additional background false positives. Among pixels classified as true weed pixels, the correct prediction proportion increases to 0.87, while the misclassification rate as background decreases to 0.13. A comparison of the original model with the new model reveals a significant reduction in the false negative rate, from 24% to 13%. This represents an improvement of 11 percentage points. These results provide validation for the synergistic effect of the multiple improvement modules: EDM-UNet has been demonstrated to maintain high-precision background recognition while significantly enhancing weed detection, especially in conditions of blurred boundaries, complex textures, and background interference.
The enhanced EDM-UNet model, while not resulting in a significant escalation in parameter count or inference cost, has been demonstrated to enhance the accuracy of weed segmentation in soybean fields and reduce the occurrence of false negatives. This enhancement is achieved through mechanisms such as ECA attention, edge detection, and morphological post-processing.
3.5. Detection Model Comparison Test
In the soybean–weed segmentation task, in order to validate the effectiveness of the proposed EDM-UNet model, a comparison was made against current mainstream semantic segmentation architectures—DeepLabv3+, U-Net, PSPNet, and ResNet50-UNet—under the same experimental settings; the detailed results are presented in
Table 3. In order to further analyse the behaviour of the model, radar charts were plotted for MIoU, recall and precision across the five models (
Figure 10).
As demonstrated in
Table 3 and
Figure 10, EDM-UNet attains superior overall performance. Specifically, it attains an MIoU of 89.45%, a recall of 93.53%, and a precision of 94.78%. In comparison with DeepLabv3+, the following enhancements have been observed: an increase of 7.62% in MIoU, 0.38% in Recall, and 9.05% in Precision. In contrast with U-Net, enhancements of 7.36% in MIoU, 6.49% in Recall, and 3% in Precision have been identified. When compared with PSPNet, EDM-UNet improves MIoU, Recall, and Precision by 18.58, 18.29, and 7.82 percentage points, respectively. Additionally, when contrasted with the baseline ResNet50-UNet model, significant advancements of 6.71% in MIoU, 5.67% in recall, and 3.03% in precision were observed. These gains demonstrate the advantages of EDM-UNet in terms of pixel-level classification accuracy and detection capability, indicating a marked enhancement in region-wise segmentation precision.
In terms of computational complexity, EDM-UNet reaches 40.36 FPS, which is 11.25 FPS lower than the baseline ResNet50-UNet. The parameter count and FLOPs of EDM-UNet are very close to those of ResNet50-UNet: EDM-UNet has 43.942 million parameters and 184.257 GFLOPs, while the baseline has 43.933 million parameters and 184.100 GFLOPs. Although DeepLabv3+ and PSPNet run faster than EDM-UNet and require fewer parameters and operations, with 5.813 million parameters and 52.867 GFLOPs for DeepLabv3+ and 2.376 million parameters and 6.031 GFLOPs for PSPNet, the substantial improvements in accuracy and segmentation quality achieved by EDM-UNet make this computational cost acceptable. In addition, compared with U-Net, EDM-UNet not only achieves better performance on all evaluation metrics, but also reduces the FLOPs from 451.672 G to 184.257 G while maintaining a reasonable parameter scale and inference speed.
In summary, the proposed EDM-UNet model significantly enhances the accuracy and robustness of soybean–weed segmentation without incurring excessive complexity or computational cost. The device’s superior comprehensive performance and practical inference speed make it well suited for deployment in agricultural scenarios where resources are limited.
3.6. Detection Performance in Complex Natural Scenes
The present study evaluates the performance of various models in three different scenarios relating to weed density: sparse weeds, moderate weeds and dense weeds. The aim is to compare the segmentation results of DeepLabv3+, U-Net, PSPNet, ResNet50-U-Net and EDM-UNet on images of soybeans and weeds. The collated dataset encompasses imagery of soybean plants exhibiting diverse weed conditions. The segmentation outcomes for sparse-weed cases by EDM-UNet and the other four networks are shown in
Figure 11. As demonstrated in
Figure 11, U-Net, ResNet50-UNet, and EDM-UNet achieve high detection rates in the presence of sparse weeds. In contrast, the DeepLabv3+ and PSPNet models exhibited a high rate of false positives, with the PSPNet model in particular misclassifying a significant number of soybean plants as weeds.
In scenarios involving moderate weed levels, the PSPNet system frequently misidentifies soybean plants as weeds, while also failing to detect a significant number of weeds. Despite the absence of false positives in U-Net and ResNet50-UNet, their capacity for weed segmentation remains limited, consequently yielding low segmentation accuracy. It was observed that both the DeepLabv3+ and EDM-UNet models exhibited a tendency to overlook diminutive weeds located on the right-hand side of the images. These diminutive targets presented a considerable challenge in terms of capture, a consequence of the flight altitude constraints imposed on the UAV.
In conditions of dense vegetation, the performance of PSPNet, U-Net, and ResNet50-UNet is characterised by a persistent propensity to generate false positives and false negatives. DeepLabv3+ has been observed to occasionally merge multiple instances of weeds into a single region. Conversely, EDM-UNet attains the highest detection rate, with a low false negative rate, but misses only extremely small weed instances.
EDM-UNet demonstrates a significant enhancement in segmentation performance in comparison with the original ResNet50-UNet. In contrast to the performance of PSPNet, which exhibits severe false positives by misclassifying soybeans as weeds, and in contrast to the performance of U-Net and ResNet50-UNet, which suffer from pronounced false negatives and low accuracy across weed scenarios, EDM-UNet—with its ECA attention mechanism—precisely focuses on critical features and effectively avoids misclassifying soybeans as weeds. In comparison with DeepLabv3+, which has the capacity to merge adjacent weed regions in cases of dense conditions, EDM-UNet’s integrated edge-detection module enhances boundary recognition, and its innovative post-processing function suppresses noisy misclassifications arising from irregular weed orientations while removing geometrically implausible false positives. Consequently, EDM-UNet maintains a high weed detection rate and, in both moderate and dense weed scenarios, yields segmentation results with superior accuracy, purity, and shape conformity relative to U-Net, ResNet50-UNet, PSPNet, and DeepLabv3+. The capture of weeds at high altitudes by means of UAVs is a challenging process, with only extremely small weeds proving difficult to capture.
In conclusion, EDM-UNet has been demonstrated to exhibit both robust and effective soybean–weed segmentation capabilities across a range of weed densities. This renders it well suited for deployment in complex field environments.
5. Conclusions
In this study, we employed UAV-based remote sensing to collect images of soybean fields containing weeds. We then constructed a soybean–weed segmentation dataset and developed the EDM-UNet model for precise weed segmentation.
We integrated a lightweight ECA attention module, Canny-based edge guidance in the decoder, and a dual-constraint post-processing stage to strengthen boundary recognition and suppress non-target artefacts. The experimental findings on the test set demonstrate that the proposed EDM-UNet attains a MIoU of 89.45%, a recall of 93.53%, and a precision of 94.78%. With regard to the speed of inference, the system achieves 40.36 FPS, thus meeting the real-time detection requirements. EDM-UNet has been shown to enhance MIoU, Recall, and Precision by 6.71%, 5.67%, and 3.03%, respectively, while demonstrating an acceptable FPS reduction of 11.25% when compared with the baseline ResNet50-UNet. Visualisation analyses confirm that the model exhibits high efficiency in segmenting weeds in soybean fields and is therefore suitable for practical deployment. The proposed method has the potential to reduce the financial burden of manual inspection and to provide essential decision support for variable-rate spraying systems. To broaden applicability, future work will pursue transfer learning and domain-adaptation to improve cross-site and cross-crop generalization, and will investigate model compression and edge-deployment for true onboard, real-time UAV operation.