BAE-UNet: A Background-Aware and Edge-Enhanced Segmentation Network for Two-Stage Pest Recognition in Complex Field Environments

Chang, Jing; Li, Xuefang; Ze, Xingye; Ding, Xue; Gong, He

doi:10.3390/agronomy16020166

Open AccessArticle

BAE-UNet: A Background-Aware and Edge-Enhanced Segmentation Network for Two-Stage Pest Recognition in Complex Field Environments

by

Jing Chang

^1,*

,

Xuefang Li

¹,

Xingye Ze

¹,

Xue Ding

² and

He Gong

^1,3,*

¹

College of Information Technology, Jilin Agricultural University, Changchun 130118, China

²

School of Mathematics, Jilin University, Changchun 130012, China

³

Jilin Province Intelligent Environmental Engineering Research Center, Changchun 130118, China

^*

Authors to whom correspondence should be addressed.

Agronomy 2026, 16(2), 166; https://doi.org/10.3390/agronomy16020166

Submission received: 20 November 2025 / Revised: 1 January 2026 / Accepted: 2 January 2026 / Published: 8 January 2026

(This article belongs to the Section Pest and Disease Management)

Download

Browse Figures

Versions Notes

Abstract

To address issues such as significant scale differences, complex pose variations, strong background interference, and similar category characteristics of pests in the images obtained from field traps, this study proposes a pest recognition method based on a two-stage “segmentation–detection” approach to improve the accuracy of field pest situation monitoring. In the first stage, an improved segmentation model, BAE-UNet (Background-Aware and Edge-Enhanced U-Net), is adopted. Based on the classic U-Net framework, a Background-Aware Contextual Module (BACM), a Spatial-Channel Refinement and Attention Module (SCRA), and a Multi-Scale Edge-Aware Spatial Attention Module (MESA) are introduced. These modules respectively optimize multi-scale feature extraction, background suppression, and boundary refinement, effectively removing complex background information and accurately extracting pest body regions. In the second stage, the segmented pest body images are input into the YOLOv8 model to achieve precise pest detection and classification. Experimental results show that BAE-UNet performs excellently in the segmentation task, achieving an mIoU of 0.930, a Dice coefficient of 0.951, and a Boundary F1 of 0.943, significantly outperforming both the baseline U-Net and mainstream models such as DeepLabV3+. After segmentation preprocessing, the detection performance of YOLOv8 is also significantly improved. The precision, recall, mAP50, and mAP50–95 increase from 0.748, 0.796, 0.818, and 0.525 to 0.958, 0.971, 0.977, and 0.882, respectively. The results verify that the proposed two-stage recognition method can effectively suppress background interference, enhance the stability and generalization ability of the model in complex natural scenes, and provide an efficient and feasible technical approach for intelligent pest trap image recognition and pest situation monitoring.

Keywords:

pest recognition; two-stage model; BAE-UNet; YOLOv8

1. Introduction

Agricultural pests are the primary biological agents that continuously threaten crop yield and quality, posing a severe challenge to the stability of agricultural production and food security. Against this backdrop, the real-time and accurate monitoring of field pests to achieve precise prediction and effective early warning has become an urgent need for ensuring safe agricultural production. With the development of smart agriculture technologies, pest-monitoring systems based on image recognition have emerged as important means to achieve precise pest control [1]. Among them, light traps, with their automated image-acquisition capabilities, provide crucial technical support for real-time pest-situation monitoring and early warning [2]. However, the images collected by traps are often affected by the complex field environment. Although the equipment can reduce some background interference through an automatic cleaning track or manual intervention, it is difficult to completely avoid noises such as insect debris, dust, mud, and water stains adhering to the track, resulting in a still cluttered image background. The pests themselves exhibit significant scale differences, diverse postures, and high inter-class feature similarities. These issues collectively restrict the performance of traditional recognition models in real-world scenarios. Therefore, improving the robustness of models in complex backgrounds at the algorithm level has become a key issue in promoting the practical application of automatic pest-recognition systems.

Early pest-recognition research mainly relied on hand-designed features (such as color, texture, and morphological features) [3,4] combined with machine-learning classifiers [5,6,7] (such as support vector machines [8], AdaBoost [9], etc.). Although these methods could achieve certain results in controlled environments, their performance was limited in complex field environments. For example, Liu et al. (2021) pointed out that traditional feature extraction methods struggle to handle pest-recognition problems under natural lighting changes and cluttered backgrounds [10]. However, the feature-representation capabilities of these methods are limited, making it difficult for them to adapt to complex natural scenes. As a result, they have gradually been replaced by end-to-end methods based on deep learning. With the breakthroughs in deep learning, convolutional neural networks (CNN) [11] have been widely applied to pest-classification tasks. By fine-tuning pre-trained models (such as VGG [12], ResNet [13]), researchers have significantly improved classification performance. Li et al. (2022) used a residual network for transfer learning and achieved high accuracy in the pest-classification task [14]. However, their research also indicated that the pure classification paradigm still lacks robustness to complex backgrounds when there is a lack of explicit localization constraints. To overcome this limitation, subsequent studies gradually introduced object-detection methods (such as Faster R-CNN [15], the YOLO series [16]) into the field of pest recognition to achieve both localization and classification simultaneously. Wen et al. (2022) proposed the Pest-YOLO model for the detection of large-scale, multi-category, dense, and tiny pests [17]. It showed an approximately 5.3% improvement over YOLOv4 on the Pest24 dataset, verifying the effectiveness of deep-detection models in handling complex objects [17]. However, when the target size is small or the target is similar to the background in terms of color and texture, the detection performance still drops significantly. The fundamental reason is that the complex background is simultaneously encoded into the feature representation, interfering with the model’s learning of discriminative features of the pest body [18]. To effectively suppress the interference of background noise, some studies have started to explore a two-stage “segmentation-detection” strategy [19]. This strategy first extracts pure pest body regions through semantic segmentation and then performs detection or classification on them, weakening the impact of background information at the feature level [20]. Biradar et al. (2024) combined U-Net and DenseNet to construct a segmentation-detection framework, significantly improving the detection accuracy in complex field environments [21]. Mu et al. (2024) further pointed out that even on edge devices, the two-stage method can effectively alleviate the detection difficulties of small and overlapping targets, significantly enhancing the model’s generalization ability [22]. However, most existing two-stage frameworks mainly adopt generic segmentation backbones and treat the segmentation and detection stages as sequential yet loosely coupled processes, which limits their ability to jointly address background clutter suppression, boundary ambiguity, and feature inconsistency when dealing with small or morphologically similar pest species.

Although existing research has achieved some progress, there are still numerous challenges in achieving high-precision pest identification. Current segmentation networks have insufficient adaptability to multi-scale pest targets. In the feature extraction process, background suppression is incomplete, and noise pollution seriously affects subsequent identification. There is also a lack of fine-grained segmentation ability for pest boundaries, making it difficult to distinguish morphologically similar species. Therefore, how to improve the multi-scale adaptability and background robustness of the model while maintaining high-precision segmentation has become a key issue that needs to be addressed in the research of automatic pest identification.

To address the above problems, this study proposes a two-stage pest identification method for complex scenarios, with the following main contributions:

The BAE-UNet segmentation model is proposed, which integrates three key modules: BACM, SCRA, and MESA. The segmentation stage is designed in a task-driven collaborative manner, in which BACM, SCRA, and MESA respectively focus on multi-scale feature representation, background noise suppression, and boundary-aware feature enhancement, thereby collaboratively strengthening the support of segmentation results for subsequent detection tasks. As a result, the model effectively removes complex field background interference and accurately extracts pest regions, providing high quality feature inputs for subsequent detection tasks.
A detection stage based on YOLOv8 is constructed. Using the pest area generated by segmentation as input, it can achieve accurate positioning and category discrimination, effectively reducing the interference of complex backgrounds on the detection task. This significantly improves the model’s robustness and generalization in scenarios with multi-scale and multi-pose pests.
The systematic verification of the self-built dataset and the validation of the method’s effectiveness have been completed. Experimental results on this dataset show that the detection accuracy and robustness of the proposed two-stage method are superior to those of mainstream single-stage models, providing reliable data support and practical reference for the field implementation of intelligent pest monitoring systems.

2. Materials and Methods

2.1. Materials

2.1.1. Image Data Collection

This research conducted data collection from July to September during the pest-active period from 2023 to 2025 in multiple paddy and corn fields in Shuangyang District and Jiutai District, Changchun City, Jilin Province. A light-trapping device was employed to attract pests based on their phototaxis, with the main capture targets being adult individuals of various pests. Rapid inactivation was achieved through a controllable electric heating plate. This method can ensure the integrity of the morphological characteristics of the specimens, providing reliable samples for subsequent image acquisition and model detection.

Image acquisition was carried out using the MV-CE2001UC industrial camera with a resolution of 4024 × 3036 pixels, It is manufactured by Hikvision, an enterprise based in Hangzhou, China. All samples were photographed on a standardized track. Although there is no color confusion between the insects and the track background, the background has obvious noise interference, including insect debris, dust deposition, mud and water stains, etc. In addition, the postures of the pest bodies vary widely, covering various forms such as open and closed wings, ventral and dorsal orientations. Examples of the original dataset and the diverse postures of pests are shown in Figure 1.

2.1.2. Dataset Construction

Based on the practical needs of agricultural pest monitoring and regional representativeness, this study selected four major pests, Ostrinia furnacalis [23], Chilo suppressalis [24], Xestia c-nigrum [25], and Gryllotalpidae [26], as the research subjects. The reasons for their selection are as follows:

These species are of significant economic importance in the crop ecosystems of Northeast China;
Their morphological characteristics exhibit a high degree of diversity, including variations in body size, morphological structure, and phenotypic traits. For example, Gryllotalpidae are relatively large, while moths such as Ostrinia furnacalis and Chilo suppressalis are slender, which helps to assess the model’s adaptability to multi-scale targets;
Some species have similar morphologies. For instance, Ostrinia furnacalis and Chilo suppressalis appear similar from certain viewpoints, providing challenging samples for the model to distinguish subtle morphological differences.

Initially, 3000 images were collected. To ensure the quality of the samples, all images were uniformly cropped to 800 × 800 pixels, and those without experimental targets or with severely blurred targets were excluded. Ultimately, 2667 images were retained to form the basic dataset. This study adopted a two-stage strategy to construct datasets for the training and validation of the segmentation model (BAE-UNet) and the detection model (YOLOv8). The data construction process is shown in Table 1.

In the first stage, 1452 images were selected from the basic dataset for the training and validation of the BAE-UNet model. Among these, 950 images were used as the training set, and 502 as the validation set. All images were labeled pixel-by-pixel using the X-AnyLabeling2.4.1 tool to ensure that each image contained at least one pest target with a complete shape and clear features. To improve the model’s generalization ability, data augmentation was performed only on the training set images. After augmentation, the training set increased to 2850 images, and the first-stage dataset consisted of a total of 3352 images. The remaining 1215 images were used as the test set. This test set was strictly isolated during the model development phase and was ultimately dedicated to evaluating the generalization ability and serving as input for the second-stage detection model.

In the second stage, based on the segmentation results from the BAE-UNet model on the independent test set from the first stage, a YOLOv8 detection dataset was constructed. It contained a total of 1215 images, which were divided into a training set of 972 images and a validation set of 243 images. The X-AnyLabeling2.4.1 tool was also used to label the bounding boxes of all pest bodies. To ensure the authenticity of the evaluation of the segmentation results, no data augmentation was performed at this stage to avoid interference from changes introduced by augmentation on the segmentation prediction results. This hierarchical construction strategy ensured that both of the two-stage models were fully trained on high-quality data, providing reliable data support for subsequent model performance evaluation. The overall experimental process is shown in Figure 2.

2.1.3. Data Augmentation

To enhance the generalization and robustness of the model, this study implemented a targeted data augmentation strategy for pest bodies. First, the pest bodies were precisely segmented from the original images. Subsequently, random rotation and scaling (with a scale range of 0.8 to 1.2 times) were applied to the pest bodies to simulate natural variations in multiple angles and scales. To avoid overlap among the augmented samples, the augmentation script automatically detected the overlap between the pest body area and the original annotation area before the pasting operation. Random pasting was then performed only in non-overlapping areas.

To maintain ecological plausibility, the pasting process follows the natural spatial distribution patterns of field pests, avoiding unnatural overlap or aggregation. Moreover, the augmented pest bodies were pasted back onto their original background images, which helps preserve realistic background–foreground relationships and mitigates potential distribution bias. The processed pest-body images were randomly pasted back into different positions on the original background, generating new samples with diverse spatial distributions, as shown in Figure 3.

This method efficiently expands the number of training samples. Moreover, it enables the model to focus more on the morphological features of the pest bodies themselves rather than relying on fixed background positions for recognition. In addition, alternative augmentation strategies such as color jittering and Gaussian blur were considered but not adopted, as the proposed pasting-based strategy was found to be more suitable for enhancing appearance diversity while preserving discriminative morphological features under complex field conditions. This significantly enhances the model’s ability to suppress complex background interference and improves feature generalization performance.

2.2. Methods

2.2.1. BAE-UNet

To effectively address the core challenges of pest target segmentation in complex backgrounds, such as variable scales, background interference, and blurred boundaries, this study systematically improved the classic U-Net [27] framework and constructed a Background-Aware and Edge-Enhanced U-Net (BAE-UNet). The overall structure of the model is shown in Figure 4.

The traditional U-Net adopts a symmetric encoder–decoder structure and fuses shallow-layer details with deep-layer semantic information through skip connections, showing excellent performance in fields such as medical image segmentation. However, when directly applied to field pest segmentation, it exhibits several limitations:

Its encoder typically uses a stack of standard convolutions with a fixed receptive field, making it difficult to simultaneously adapt to the large-scale variations of pest targets.
The shallow-layer features transmitted directly by skip connections contain substantial background clutter, which may interfere with the decoder’s restoration of target details.
The feature map at the end of the decoder lacks a mechanism to strengthen key boundary information, resulting in blurred predicted boundaries where pests and background are mixed.

To retain the advantages of the classic U-shaped framework and skip connections of U-Net, this study selected VGG16 as the main encoder network to fully leverage its powerful feature extraction capabilities and hierarchical representation. This choice is motivated by its simple and stable architecture, which facilitates effective feature reuse and preserves fine-grained spatial details that are critical for small pest targets and boundary-sensitive segmentation. Compared with more recent or lightweight backbones, VGG16 provides a favorable balance between representational capacity and implementation reliability, and preliminary empirical observations indicated no clear performance advantage when replacing it with deeper or heavily compressed alternatives under the same training settings. Moreover, the moderate computational complexity of VGG16 ensures compatibility with the subsequent lightweight detection stage and supports practical deployment scenarios. Although VGG16 has a relatively large number of parameters, lightweight improvement modules were embedded at key levels. Through depth-separable convolutions and an adaptive fusion mechanism, overall complexity was effectively controlled, achieving a balance between performance and efficiency.

At the bottleneck stage, the Background-Aware Contextual Module (BACM) was embedded. This module is located at the bottleneck layer, which contains the richest feature semantics but the lowest spatial resolution, and aims to enhance the model’s adaptive perception of multi-scale targets. BACM extracts contextual information using parallel multi-scale depth-separable convolution branches (3 × 3, 5 × 5, 7 × 7) and introduces a learnable dynamic weight fusion mechanism. It adaptively adjusts the importance of each scale branch according to the input features, thereby representing multi-scale structures from subtle antennae to the overall torso. At the skip connections, the Spatial-Channel Refinement and Attention (SCRA) module was introduced to reduce interference from background noise in shallow-layer features. This module first generates a spatial weight map to suppress irrelevant backgrounds, then applies channel attention to the purified features. By modeling global relationships, it enhances the target response intensity, achieving background purification and semantic enhancement during feature fusion. At the end of the decoder, the Multi-Scale Edge-Aware Spatial Attention (MESA) module was integrated to optimize boundary prediction. MESA employs multi-scale parallel depth-separable convolutions (3 × 3, 5 × 5, 7 × 7) to capture edge information at different levels. Through an explicit edge-bias-guided spatial attention mechanism, it adaptively strengthens the target contour area, significantly improving the clarity and continuity of boundaries between pests and the background.

BAE-UNet systematically enhances segmentation ability in complex backgrounds from three aspects: feature extraction, feature fusion, and boundary optimization. Through depth-separable convolutions and a modular lightweight design, it significantly improves background suppression and boundary recognition while effectively controlling the number of parameters and computational overhead. This achieves a good balance between performance and efficiency and lays a solid foundation for subsequent pest-body recognition tasks.

2.2.2. BACM

To address the significant scale differences among pest individuals in images—where some pests are large and complex in structure while others occupy only a tiny area with extremely fine antennae and wing edges—traditional Convolutional Neural Networks (CNN) usually adopt a fixed receptive field. If the receptive field is too small, it cannot cover the entire target; if it is too large, local details are lost, resulting in limited feature representation capabilities. To solve this problem, this study designed and introduced the BACM at the bottleneck layer of U-Net to achieve adaptive fusion of multi-scale features.

The structure of the BACM module is shown in Figure 5. The input feature map

F_{i n} \in R^{C \times H \times W}

first passes through a 1 × 1 convolutional layer for channel adjustment and feature transformation. Subsequently, the feature is fed into three parallel branches. Each branch contains depth-separable convolutions with kernel sizes of 3 × 3, 5 × 5, and 7 × 7, respectively, to capture local details, local context, and global context information. The outputs of the three branches are adaptively fused through learnable weights and then passed through a residual connection to facilitate gradient propagation and feature reuse. The forward-propagation process can be expressed by the following formulas:

F_{o u t} = F_{i n} + \sum_{i = 1}^{3} w_{i} \cdot {D W C o n v}_{k_{i}} (C o n v_{1 \times 1} (F_{i n}))

(1)

where

D W C o n v_{k_{i}}

represents the depth-separable convolution operation with a kernel size of

k_{i}

, and

w_{i}

is the corresponding learnable weight. The key innovation of BACM lies in its dynamic weight-learning mechanism. The weight parameters

w = [w_{1}, w_{2}, w_{3}]

are adaptively optimized through back-propagation during training. These weights are initialized with equal values and are learned in an unconstrained manner without explicit normalization, allowing flexible scale adaptation. In practice, the residual connection within BACM helps stabilize the optimization process and prevents potential dominance of a single scale branch during training. This enables the model to adjust the contribution ratios of different scale branches according to the scale distribution of the input features. For example, when the input contains small pests, the model assigns a higher weight to the 3 × 3 branch to focus on details. For larger pests, it enhances the responses of the 5 × 5 or 7 × 7 branches to capture broader context information. Through this adaptive multi-scale fusion, BACM significantly improves the model’s representation ability for multi-scale targets at the bottleneck layer, providing a more robust semantic foundation for subsequent decoding and fine-grained segmentation.

2.2.3. SCRA

In pest trap images, the background often contains complex interference information such as soil, water stains, or other shadows. These noises are particularly prominent in shallow-layer features. If these features are directly passed to the decoder through skip connections without processing, they can seriously affect the semantic restoration of the target area. Traditional channel-attention mechanisms (such as the SE module [28]) only model channel relationships through global average pooling (GAP) [29], ignoring the non-uniformity of spatial distribution. As a result, the generated channel weights may be contaminated by a large amount of background noise, failing to optimally serve the foreground target features and limiting the segmentation accuracy of pest targets in complex backgrounds.

To address this problem, this study proposes the SCRA module and embeds it in the skip connections of U-Net to perform spatial purification and channel recalibration on shallow-layer features before feature fusion. The module structure is shown in Figure 6 and is divided into two stages:

1. Spatial Adaptive Weighting. The input feature map

X \in R^{C \times H \times W}

first passes through a 1 × 1 convolutional layer to compress the channel dimension and generate a spatial response map S. After passing through the Sigmoid activation function, a spatial weight map is obtained:

S = σ ({C o n v}_{1 \times 1} (X))

(2)

Subsequently, enhancement of the target area and suppression of the background are achieved through element-wise multiplication:

X^{'} = X ⊙ S

(3)

where σ represents the Sigmoid function, and

⊙

represents element-wise multiplication.

2. Channel Relationship Modeling. On the spatially weighted feature

X^{'}

, the module uses a linear transformation to generate query (

Q

), key (

K

), and value (

V

) vectors, and captures the global channel-dependent relationships through an attention mechanism:

A = S o f t m a x (Q K^{T})

(4)

F_{o u t} = A \cdot V

(5)

where the attention matrix

A

adaptively depicts the importance distribution among channels, achieving enhancement of significant features and suppression of redundant features.

The core of the SCRA module lies in the collaborative mechanism of “spatial processing first, then channel recalibration.” Unlike the traditional SE module, it introduces spatial weighting before global modeling, weakening background interference at the source so that the channel attention can focus more on foreground target features. This design significantly improves the model’s feature discriminability and segmentation accuracy in complex backgrounds.

2.2.4. MESA

After obtaining a feature map with rich multi-scale features and effectively suppressed background noise through the BACM and SCRA modules, there may still be blurry segmentation in areas where pest and background boundaries are mixed. This is because traditional spatial attention mechanisms (such as CBAM [30]) rely on single-scale convolution and pooling operations, making it difficult to fully capture fine-grained edge cues and multi-scale contour information. Therefore, this study designed the MESA module at the end of the decoder to achieve pixel-level boundary refinement and enhancement.

The structure of MESA is shown in Figure 7. The input feature map

X \in R^{C \times H \times W}

passes through three parallel depth-separable convolution branches (with kernel sizes of 3 × 3, 5 × 5, and 7 × 7, respectively) to simultaneously extract fine-edge, local context, and overall shape information. The outputs of each branch are fused by element-wise addition, combined with a learnable edge bias term

b_{e d g e}

, and then a spatial attention map

A_{s}

is generated through the Sigmoid activation function:

A_{s} = σ (D W C o n v_{3 \times 3} (X) + D W C o n v_{5 \times 5} (X) + D W C o n v_{7 \times 7} (X) + b_{e d g e})

(6)

Subsequently, the edge-enhanced output feature is obtained through element-wise multiplication:

F_{o u t} = X ⊙ A_{s}

(7)

where

σ

is the Sigmoid function,

D W C o n v_{k \times k}

represents the depth-separable convolution with a kernel size of

k

,

b_{e d g e}

is the learnable edge-bias term, and

⊙

denotes element-wise multiplication.

To further strengthen the synergy among channels, MESA is subsequently connected to a lightweight channel-attention sub-module. Based on the QKV mechanism, it models the channel-dependency relationships and fuses the results with the spatial-attention outcome, thus optimizing boundary perception in both spatial and channel dimensions.

The innovation of MESA lies in its construction of a multi-scale spatial-channel collaborative attention mechanism oriented towards boundary optimization. Among the three branches, the 3 × 3 convolution branch focuses on pixel-level edge details, the 5 × 5 branch connects broken contour segments, and the 7 × 7 branch perceives the overall shape trend. Through the complementarity of multi-scale features and explicit edge guidance, this module can adaptively enhance the boundary response between pests and the background. Compared with CBAM, MESA significantly improves edge-recognition ability while maintaining a lightweight structure, further enhancing the accuracy and stability of the model in fine-grained segmentation tasks.

2.2.5. YOLOv8

After completing pest-body segmentation, to further achieve accurate identification of different pest species, this study adopted YOLOv8 [31] as the detection model for the second stage. YOLOv8 is an object-detection and classification framework released by the Ultralytics team. It offers advantages such as fast detection speed, high recognition accuracy, and easy deployment. It can balance recognition performance in scenarios with small targets and complex backgrounds while maintaining efficient inference.

In this stage, no modifications were made to the YOLOv8 network structure. Instead, the official pre-trained model was directly used for transfer learning to fully utilize its mature feature-extraction and classification capabilities. The structure diagram is shown in Figure 8. By conducting detection training on the pest-body regions obtained in the segmentation stage, an end-to-end two-stage processing flow from pest-body extraction to species identification can be achieved, thereby enabling accurate identification and analysis of various field pests.

2.3. Evaluation Metrics and Experimental Setup

2.3.1. Evaluation Metrics

To comprehensively evaluate the performance of the two-stage pest identification model proposed in this study at each stage, a multi-dimensional performance evaluation system was constructed. For the segmentation task, metrics such as mean intersection over union (mIoU), Dice coefficient, Boundary F1 score, and mean pixel accuracy (mPA) were used for evaluation. Mean intersection over union (mIoU) is a widely used comprehensive metric in semantic segmentation, reflecting the degree of overlap between the predicted segmentation area and the true annotation. The higher its value, the better the segmentation accuracy. The calculation formula is shown in Equation (8). The Dice coefficient is used to measure the spatial similarity between the predicted segmentation result and the true annotation, and its calculation formula is shown in Equation (9). The Boundary F1 score specifically focuses on the accuracy of the target boundary area in the segmentation result. By calculating the matching degree between the predicted boundary and the true boundary within a certain tolerance range, it reflects the model’s ability to capture edge details, which is of great significance for improving the accuracy of pest morphological feature extraction. The calculation formula is shown in Equation (10). Mean pixel accuracy (mPA) calculates and averages the pixel-level category accuracy of pests and the background, which can intuitively reflect the correct classification degree of the model for different category pixels. The calculation formula is shown in Equation (11).

For the detection task, Precision, Recall, mean average precision at an IoU of 0.5 (mAP50), and mean average precision from an IoU of 0.5 to 0.95 (mAP50–95) were used for evaluation. Precision measures the proportion of samples predicted as positive that are actually positive, reflecting the model’s ability to control false positives. It is particularly crucial in high-precision recognition scenarios, and its formula is shown in Equation (12). Recall evaluates the model’s ability to identify positive samples, reflecting the false-negative rate. It is essential for application scenarios requiring comprehensive pest detection, and its formula is shown in Equation (13). The mean average precision at an IoU of 0.5 (mAP50) is the average of the areas under the precision–recall curves (AP) of all categories when the intersection over union (IoU) threshold is 0.5. It comprehensively reflects the detection performance of the model across different categories. The calculation involves computing AP50 for each category and then taking the average, as shown in Equation (14). mAP50–95 is the average of the areas under the precision–recall curves of all categories when the IoU threshold ranges from 0.5 to 0.95 at intervals of 0.05. It comprehensively reflects the detection performance of the model under different levels of strictness, as shown in Equation (15).

m I o U = \frac{1}{k} \sum_{i = 1}^{k} \frac{T P_{i}}{T P_{i} + F P_{i} + F N_{i}}

(8)

D i c e = \frac{2 \times | X \cap Y |}{| X | + | Y |}

(9)

B o u n d a r y F 1 = \frac{2 \times P r e c i s i o n_{b o u n d a r y} \times R e c a l l_{b o u n d a r y}}{P r e c i s i o n_{b o u n d a r y} + R e c a l l_{b o u n d a r y}}

(10)

m P A = \frac{1}{k} \sum_{i = 1}^{k} \frac{T P_{i}}{T P_{i} + F P_{i}}

(11)

P r e c i s i o n = \frac{T P}{T P + F P}

(12)

R e c a l l = \frac{T P}{T P + F N}

(13)

m A P 50 = \frac{1}{k} \sum_{i = 1}^{k} A P 50_{i}

(14)

m A P_{50 - 95} = \frac{1}{k} \sum_{i = 1}^{k} \frac{1}{10} \sum_{j = 0}^{9} A P_{50 + 0.05 j, i}

(15)

Here, TP (True Positive) represents the number of positive samples correctly identified by the model; FN (False Negative) represents the number of positive samples misidentified as negative; FP (False Positive) represents the number of negative samples misidentified as positive; TN (True Negative) represents the number of negative samples correctly identified by the model. In the segmentation task, these indicators are calculated at the pixel level.

T P_{i}

,

F P_{i}

, and

F N_{i}

are the true-positive, false-positive, and false-negative pixel counts of the i-th category, respectively, and

k

is the total number of categories (including the background). In the detection task, Precision and Recall are calculated at the image level. mAP50 is the mean of the average precisions of all categories.

k

is the number of categories, and

A P 50_{i}

represents the average precision of the i-th category when the IoU threshold is 0.5. mAP50–95 is the average of the AP means of all categories under multiple IoU thresholds ranging from 0.5 to 0.95 at intervals of 0.05, comprehensively reflecting the detection performance under varying levels of strictness. In addition, the

k

in the segmentation task (the total number of categories, including pests, background, and all distinguishable classes) has the same meaning as the

k

in the formulas of mean pixel accuracy (mPA) and mean intersection over union (mIoU), both representing the total number of categories.

2.3.2. Experimental Setup

The experiments in this study were conducted based on the PyTorch deep-learning framework and executed in the Anaconda environment. Table 2 shows the configuration of the main experimental hardware and software environment, and Table 3 presents the primary hyperparameter settings.

3. Experimental Results

3.1. First-Stage Segmentation Results

3.1.1. Comparative Experiments

To verify the effectiveness of the BAE-UNet model designed in this study on the field pest dataset, comparative experiments were conducted with four mainstream segmentation models: U-Net, DeepLabV3+ [32], PSPNet [33], and HRNetV2 [34]. All models adopted the same training–test set split, data augmentation strategy, and hyperparameters. The evaluation metrics included mIoU, Dice, Boundary F1, and mPA. It should be noted that, due to the dominance of background pixels over pest regions in field images, mPA is strongly influenced by background classification accuracy and may yield relatively high values even when small-target or boundary segmentation performance is limited. Therefore, mIoU, Dice, and Boundary F1 are emphasized as more task-relevant indicators for assessing pest segmentation quality, especially in small-target and boundary-sensitive scenarios. As shown in Table 4, BAE-UNet significantly outperformed the other models in mIoU (0.930), Dice (0.951), and Boundary F1 (0.943), with a noticeable improvement across key metrics. Figure 9 presents the grouped bar chart comparison of the models across the four metrics, further visually reflecting the differences among the models.

U-Net was slightly superior to HRNetV2 in boundary fineness (Boundary F1 scores of 0.857 and 0.811, respectively), indicating that the classic encoder–decoder structure still holds some advantages in boundary restoration. Although HRNetV2 achieved the highest performance in pixel accuracy (mPA 0.993), due to the larger number of background pixels in the dataset compared to pest targets, this metric primarily reflects the classification accuracy of the background area and does not directly demonstrate its superiority in small-target or boundary segmentation. DeepLabV3+ (mIoU 0.838, Boundary F1 0.830) and PSPNet (mIoU 0.826, Boundary F1 0.796) focused on global semantic modeling and were relatively insufficient in handling small- and medium-sized targets and boundary features. Consistent trends can also be observed across different pest categories. In particular, the proposed BAE-UNet demonstrates more evident advantages for visually challenging cases, such as small-scale pests and morphologically similar species, while maintaining stable segmentation performance for larger and visually distinctive pests. This indicates that the proposed model exhibits good adaptability across pest categories under complex field conditions. Overall, BAE-UNet can accurately segment multi-scale, multi-type pests in complex field environments while balancing boundary fineness and background suppression capabilities, fully demonstrating the effectiveness of its multi-scale feature fusion, background perception, and edge-attention modules.

The comparison of segmentation results of the five models on the same background image is shown in Figure 10. From left to right are the original image and the segmentation results of U-Net, DeepLabV3+, PSPNet, HRNetV2, and BAE-UNet. This result clearly demonstrates the role of the segmentation stage in feature purification and target focusing, providing more discriminative input features for subsequent detection.

3.1.2. Ablation Experiments

To verify the contribution of each module to the model’s performance, this study conducted ablation experiments by gradually introducing BACM, SCRA, and MESA based on the baseline U-Net model. The experimental results are shown in Table 5. Meanwhile, Figure 11 visually demonstrates the differences in the attention areas of each module during the feature-learning process using the Grad-CAM visualization technique. After adding the BACM module, the mIoU and Dice coefficient significantly increased from 0.852 and 0.867 to 0.932 and 0.944, respectively. This indicates that, through multi-scale depth-separable convolutions and the adaptive weight-fusion mechanism, this module effectively enhances the model’s perception of multi-scale pest features. Compared with the baseline U-Net, which only focuses on the core area of the pest body, the BACM module enables the model’s attention to cover the complete structure from the insect’s antennae to its torso, visually verifying its advantage in multi-scale feature fusion. After introducing the SCRA module, the Boundary F1 score increased from 0.857 to 0.916, indicating that this module significantly reduces background interference through spatial-weight suppression and channel-recalibration mechanisms. Compared with the baseline model, the background noise response after applying the SCRA module is significantly weakened, and the feature response is more concentrated in the main area of the pest body, verifying its effectiveness in background purification and semantic enhancement. After adding the MESA module, the Boundary F1 score further increased to 0.937, indicating that this module effectively improves boundary-segmentation accuracy through multi-scale edge-perception and spatial-attention mechanisms. The MESA module significantly enhances the response intensity of the pest body’s edge areas, making the boundary contours appear clearer and more continuous in the heatmap, confirming its role in edge-refinement optimization.

When the three modules work together, the complete BAE-UNet model achieves the best balance across all indicators (mIoU: 0.930, Dice: 0.951, Boundary F1: 0.943). BAE-UNet not only maintains the background-purification effect of the SCRA module and the edge-enhancement capability of the MESA module, but also inherits the multi-scale perception ability of the BACM module, achieving coordinated improvements in feature extraction, background suppression, and boundary optimization.

3.2. Second-Stage Detection Results

To verify the promoting effect of the first-stage segmentation results on the pest-body detection task, this study used YOLOv8 as the detection model in the second stage and compared the differences in detection accuracy between the original images and the images segmented by BAE-UNet. The experimental results are shown in Table 6. When the original images were directly used as input, Precision, Recall, mAP50, and mAP50–95 were 0.748, 0.796, 0.818, and 0.525, respectively. When the images segmented by BAE-UNet were used as input, these four indicators increased to 0.958, 0.971, 0.977, and 0.882, representing a significant improvement in overall detection performance. These results indicate that the first-stage BAE-UNet effectively removes cluttered field backgrounds and non-target interference, highlighting the main area of the pest body. This enables the second-stage detection model to learn the target representation in a cleaner and more concentrated feature space, thereby achieving higher detection accuracy and improved generalization ability. Further analysis shows that the performance gains introduced by the segmentation stage are category dependent. The segmentation-based pre-processing provides more pronounced improvements for visually challenging pest categories, especially small-scale and morphologically similar species. By effectively suppressing background clutter and isolating more complete pest-body regions, the segmentation stage enables the detection model to learn more discriminative shape and texture features. In contrast, larger pests with distinctive visual characteristics already exhibit relatively robust detection performance on original images and therefore show comparatively moderate, yet consistent, accuracy improvements after segmentation.

Figure 12 shows the visualization results of different pest categories. In each group, from left to right, are the original image, the detection result of YOLOv8 after inputting the image segmented by BAE-UNet, and the detection result of YOLOv8 for the original image. After the segmentation stage, background interference in the image is significantly reduced. The YOLOv8 model can more accurately locate and identify the pest-body targets, showing obvious advantages, especially for samples with complex backgrounds and small-sized pests.

When the original images and the images segmented by BAE-UNet are separately input into the YOLOv8 model for detection, the Precision-Recall curves show that, after using the segmented images as input, the detection curves of various pests shift upward overall while still maintaining high precision in the high-recall interval. This indicates that the segmentation stage effectively reduces background interference, enabling the detection model to focus more on the discriminative features of the pest-body area. The results are shown in Figure 13. Specifically, segmentation pre-processing increases the average mAP50 of YOLOv8 from 0.818 to 0.977 and Precision from 0.748 to 0.958, significantly improving detection performance. This further verifies the effectiveness and feasibility of the two-stage framework in complex backgrounds.

4. Discussion

The two-stage pest identification method proposed in this study exhibits significant advantages in complex field scenarios. The performance improvement is mainly attributed to the effective suppression of background interference in the segmentation stage. Through the collaborative efforts of the BACM, SCRA, and MESA modules, BAE-UNet can accurately extract the pest-body area, providing high-quality feature inputs for YOLOv8. This enables YOLOv8 to focus more on discriminative features such as the shape and texture of the pest body, thus significantly reducing false positives and false negatives and demonstrating stronger robustness, especially in the detection of small-sized or similar-category pests.

However, the model still has certain limitations in extreme scenarios. When images contain strongly reflective surfaces or the pests are severely occluded, the segmentation and detection performance decline. Specifically, occlusion primarily affects the segmentation stage, as incomplete pest contours and disrupted morphological structures lead to inaccurate region extraction, which subsequently propagates errors to the detection stage. This is mainly because the highlighted areas lead to the loss of pest feature information, or occlusion disrupts the morphological integrity of the target. In the future, the robustness of the model can be further enhanced by increasing samples from extreme scenarios or introducing feature-completion techniques such as Generative Adversarial Networks (GAN) [35].

Using field-collected data and targeted augmentation strategies, the model shows good practicality and transferability in simulating real-world field environments. It can operate stably in both rice and corn fields in Northeast China, indicating that this framework has cross-crop adaptability and can reduce the cost of retraining the model in different farmland ecosystems. The modular design further enhances the system’s scalability: for newly added pest categories, only the output channels of BAE-UNet need to be adjusted, and the parameters can be fine-tuned. By updating the classification head of YOLOv8, it can be quickly adapted, reducing maintenance and update costs.

Although the multi-scale fusion and attention mechanisms of BAE-UNet improve performance, they also increase computational overhead. In the future, lightweight strategies such as model pruning and knowledge distillation will be introduced to deploy the model on edge platforms, such as unmanned aerial vehicles or mobile devices, while ensuring accuracy. This will enable real-time and precise monitoring of field pest situations and provide reliable technical support for smart agriculture.

5. Conclusions

Aiming to address the problems of significant scale differences, diverse postures, strong background interference, and similar-category features of pests in field-trap images, this study proposed a two-stage “segmentation-detection” pest identification framework to improve monitoring accuracy and robustness in complex agricultural scenarios. By explicitly incorporating background suppression and boundary-aware segmentation into the recognition pipeline, the proposed framework enhances the discriminative representation of pest bodies under complex backgrounds.

The constructed BAE-UNet, based on the classic U-Net, integrates the BACM, SCRA, and MESA modules, achieving coordinated optimization of multi-scale feature adaptation, background noise suppression, and boundary refinement. Experimental results demonstrate that the proposed segmentation model consistently outperforms mainstream methods across multiple task-relevant metrics, confirming its effectiveness in extracting accurate pest regions from cluttered backgrounds. Ablation experiments further verified the role of each module in adapting to the size differences of pest bodies, suppressing background interference, and improving boundary fineness.

After pre-processing by BAE-UNet segmentation, the Precision, Recall, mAP50, and mAP50–95 of the YOLOv8 detection model increased to 0.958, 0.971, 0.977, and 0.882, respectively, verifying the effectiveness of the strategy of first segmenting and purifying, and then detecting and classifying. By providing cleaner and more discriminative region features, the proposed two-stage framework significantly reduces misclassification risks caused by background noise and visually similar pest categories.

In summary, the two-stage framework proposed in this study provides an effective solution for pest identification in complex agricultural scenarios. This study demonstrates that integrating multi-scale feature adaptation, background suppression, and boundary-aware modeling within a two-stage framework constitutes a meaningful advance for pest recognition in complex agricultural environments. In the future, it can be combined with model-lightweight strategies to achieve edge-terminal deployment, promoting real-time monitoring of field pest situations on unmanned aerial vehicles or portable devices, and supporting intelligent pest management and the sustainable development of smart agriculture.

Author Contributions

Conceptualization, J.C., X.L. and H.G.; Data Curation, X.L. and X.Z.; Formal Analysis, X.Z. and X.D.; Funding Acquisition, J.C.; Investigation, X.D.; Methodology, J.C. and H.G.; Resources, J.C. and H.G.; Supervision, X.D. and H.G.; Validation, X.L.; Visualization, X.Z.; Writing—Original Draft, J.C. and X.L.; Writing—Review and Editing, J.C., X.L., X.Z., X.D. and H.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the project of the Provincial Natural Science Foundation of Jilin Province, Project No:YDZJ202601ZYTS065.

Data Availability Statement

The datasets presented in this article are not readily available because the data are part of an ongoing study. Requests to access the datasets should be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Nanni, L.; Maguolo, G.; Pancino, F. Insect pest image detection and recognition based on bio-inspired methods. Ecol. Inform. 2020, 57, 101089. [Google Scholar] [CrossRef]
Preti, M.; Verheggen, F.; Angeli, S. Insect pest monitoring with camera-equipped traps: Strengths and limitations. J. Pest Sci. 2021, 94, 203–217. [Google Scholar] [CrossRef]
Espinoza, K.; Valera, D.L.; Torres, J.A.; López, A.; Molina-Aiz, F.D. Combination of image processing and artificial neural networks as a novel approach for the identification of Bemisia tabaci and Frankliniella occidentalis on sticky traps in greenhouse agriculture. Comput. Electron. Agric. 2016, 127, 495–505. [Google Scholar] [CrossRef]
Xie, C.; Wang, R.; Zhang, J.; Chen, P.; Dong, W.; Li, R.; Chen, T.; Chen, H. Multi-level learning features for automatic classification of field crop pests. Comput. Electron. Agric. 2018, 152, 233–241. [Google Scholar] [CrossRef]
Xie, C.; Zhang, J.; Li, R.; Li, J.; Hong, P.; Xia, J.; Chen, P. Automatic classification for field crop insects via multiple-task sparse representation and multiple-kernel learning. Comput. Electron. Agric. 2015, 119, 123–132. [Google Scholar] [CrossRef]
Liu, T.; Chen, W.; Wu, W.; Sun, C.; Guo, W.; Zhu, X. Detection of aphids in wheat fields using a computer vision technique. Biosyst. Eng. 2016, 141, 82–93. [Google Scholar] [CrossRef]
Ebrahimi, M.; Khoshtaghaza, M.H.; Minaei, S.; Jamshidi, B. Vision-based pest detection based on SVM classification method. Comput. Electron. Agric. 2017, 137, 52–58. [Google Scholar] [CrossRef]
Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
Freund, Y.; Schapire, R.E. A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 1997, 55, 119–139. [Google Scholar] [CrossRef]
Liu, J.; Wang, X. Plant diseases and pests detection based on deep learning: A review. Plant Methods 2021, 17, 22. [Google Scholar] [CrossRef]
LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 2002, 86, 2278–2324. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
Li, C.; Zhen, T.; Li, Z. Image classification of pests with residual neural network based on transfer learning. Appl. Sci. 2022, 12, 4356. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar] [CrossRef]
Wen, C.; Chen, H.; Ma, Z.; Zhang, T.; Yang, C.; Su, H.; Chen, H. Pest-YOLO: A model for large-scale multi-class dense and tiny pest detection and counting. Front. Plant Sci. 2022, 13, 973985. [Google Scholar] [CrossRef] [PubMed]
Nazir, A.; Wani, M.A. Multi-scale feature enhancement using EfficientNet-B7 and PANet in faster R-CNN for small object detection. Int. J. Inf. Technol. 2025, 1–8. [Google Scholar] [CrossRef]
Huang, Y.-Q.; Huang, Z.-C.; Huang, C.; Qiao, X. MRUNet: A two-stage segmentation model for small insect targets in complex environments. J. Integr. Agric. 2023, 22, 1117–1130. [Google Scholar] [CrossRef]
Abinaya, S.; Kumar, K.U.; Alphonse, A.S. Cascading autoencoder with attention residual U-Net for multi-class plant leaf disease segmentation and classification. IEEE Access 2023, 11, 98153–98170. [Google Scholar] [CrossRef]
Biradar, N.; Hosalli, G. Segmentation and detection of crop pests using novel U-Net with hybrid deep learning mechanism. Pest Manag. Sci. 2024, 80, 3795–3807. [Google Scholar] [CrossRef]
Mu, J.; Sun, L.; Ma, B.; Liu, R.; Liu, S.; Hu, X.; Zhang, H.; Wang, J. TFEMRNet: A Two-Stage Multi-Feature Fusion Model for Efficient Small Pest Detection on Edge Platforms. AgriEngineering 2024, 6, 4688–4703. [Google Scholar] [CrossRef]
Abbas, A.; Saddam, B.; Ullah, F.; Hassan, M.A.; Shoukat, K.; Hafeez, F.; Alam, A.; Abbas, S.; Ghramh, H.A.; Khan, K.A. Global distribution and sustainable management of Asian corn borer (ACB), Ostrinia furnacalis (Lepidoptera: Crambidae): Recent advancement and future prospects. Bull. Entomol. Res. 2025, 115, 105–120. [Google Scholar] [CrossRef]
Xiang, X.; Liu, S.; Li, H.; Danso Ofori, A.; Yi, X.; Zheng, A. Defense Strategies of Rice in Response to the Attack of the Herbivorous Insect, Chilo suppressalis. Int. J. Mol. Sci. 2023, 24, 14361. [Google Scholar] [CrossRef]
Landolt, P.; Guedot, C.; Zack, R. Spotted cutworm, Xestia c-nigrum (L.)(Lepidoptera: Noctuidae) responses to sex pheromone and blacklight. J. Appl. Entomol. 2011, 135, 593–600. [Google Scholar] [CrossRef]
Thompson, S.R.; Brandenburg, R.L. Tunneling responses of mole crickets (Orthoptera: Gryllotalpidae) to the entomopathogenic fungus, Beauveria bassiana. Environ. Entomol. 2005, 34, 140–147. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar] [CrossRef]
Lin, M.; Chen, Q.; Yan, S. Network in network. arXiv 2013, arXiv:1312.4400. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar] [CrossRef]
Varghese, R.; Sambath, M. Yolov8: A novel object detection algorithm with enhanced performance and robustness. In Proceedings of the 2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS), Chennai, India, 18–19 April 2024; pp. 1–6. [Google Scholar] [CrossRef]
Chen, L.-C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar] [CrossRef]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar] [CrossRef]
Wang, J.; Sun, K.; Cheng, T.; Jiang, B.; Deng, C.; Zhao, Y.; Liu, D.; Mu, Y.; Tan, M.; Wang, X. Deep high-resolution representation learning for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 3349–3364. [Google Scholar] [CrossRef]
Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. arXiv 2014, arXiv:1406.2661. [Google Scholar] [CrossRef]

Figure 1. Examples of original pest images collected by field pest traps and pictures of target pests in various postures.

Figure 2. Experimental Flowchart of the Two-stage Pest Recognition Method.

Figure 3. Comparison of pest images before and after data augmentation. (a–f) The original images and their corresponding augmented results of six different pest image groups.

Figure 4. Overall Structure Diagram of the BAE-UNet Segmentation Model.

Figure 5. Structural Diagram of the BACM Module.

Figure 6. Structure Diagram of the SCRA Module.

Figure 7. Structure Diagram of the MESA Module.

Figure 8. Structure Diagram of the YOLOv8 Detection Model.

Figure 9. Grouped Bar Chart of Segmentation Models.

Figure 10. Comparison of Segmentation Results of Different Models. (a–c) show the segmentation results of models (U-Net, DeepLabV3+, PSPNet, HRNetV2, BAE-UNet) for three different original pest images: the first column is the original image, and the rest are the segmentation results of the corresponding models.

Figure 11. Heatmaps after processing by different modules. Color explanation: In the heatmaps, brighter colors (e.g., yellow and red) indicate higher attention weights assigned by the model to the corresponding regions, whereas darker colors (e.g., blue) indicate lower attention weights. (a) Gryllotalpidae; (b) Xestia c-nigrum; (c) Chilo suppressalis; (d) Ostrinia furnacalis.

Figure 12. Comparison of Detection Effects of the YOLOv8 Model on Different Input Images. (a–d) show the detection results of the YOLOv8 model for four different input images.

Figure 13. Comparison of Precision-Recall Curves. (a) Detection results of YOLOv8 with images segmented by BAE-UNet as input; (b) Detection results of YOLOv8 with original images as input.

Table 1. Construction and Division of the Two-stage Pest Recognition Model Dataset.

Phase	Dataset/Subset	Number of Original Images	Data Augmentation	Final Number of Images
Data Preparation	Basic Dataset	3000	No	2667
First Phase (BAE-UNet)	Total	1452	-	3352
	Training Set	950	Yes	2850
	Validation Set	502	No	502
	Test Set	1215	No	1215
Second Phase (YOLOv8)	Total	1215	-	1215
	Training Set	972	No	972
	Validation Set	243	No	243

Table 2. Experimental environment configuration.

Environment Configuration	Parameter
CPU	Intel(R) Xeon(R) Gold 6148CPU@2.40 GHz
GPU	2 × A100 (80 GB)
Development environment	PyCharm 2023.2.5
Language	Python 3.8
frame	PyTorch 2.0.1
Operating platform	CUDA 11.8
Operating system	Windows 11

Table 3. Hyperparameter settings.

Hyperparameter	First Phase (BAE-UNet)	Second Phase (YOLOv8)
Epochs	300	300
Batch Size	16	32
Learning Rate	1 × 10⁻⁴	5 × 10⁻⁴
Optimizer	Adam	AdamW
Input Image Size	512 × 512	512 × 512

Table 4. Performance Comparison of Segmentation Models.

Model	Backbone	mIoU	Dice	Boundary F1	mPA
U-Net	VGG16	0.852	0.867	0.857	0.981
DeepLabV3+	ResNet101	0.838	0.848	0.830	0.975
PSPNet	ResNet50	0.826	0.836	0.796	0.983
HRNetV2	HRNetV2-W32	0.850	0.852	0.811	0.993
BAE-UNet	VGG16	0.930	0.951	0.943	0.985

Table 5. Performance Comparison of Ablation Experiments.

Model	mIoU	Dice	Boundary F1	mPA
U-Net	0.852	0.867	0.857	0.981
U-Net + BACM	0.932	0.944	0.891	0.971
U-Net + SCRA	0.929	0.936	0.916	0.984
U-Net + MESA	0.933	0.938	0.937	0.979
BAE-UNet	0.930	0.951	0.943	0.985

Table 6. Comparative Results of YOLOv8 Experiments.

Model	Precision	Recall	mAP50	mAP50–95
Original image+YOLOv8	0.748	0.796	0.818	0.525
BAE-UNet+YOLOv8	0.958	0.971	0.977	0.882

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Chang, J.; Li, X.; Ze, X.; Ding, X.; Gong, H. BAE-UNet: A Background-Aware and Edge-Enhanced Segmentation Network for Two-Stage Pest Recognition in Complex Field Environments. Agronomy 2026, 16, 166. https://doi.org/10.3390/agronomy16020166

AMA Style

Chang J, Li X, Ze X, Ding X, Gong H. BAE-UNet: A Background-Aware and Edge-Enhanced Segmentation Network for Two-Stage Pest Recognition in Complex Field Environments. Agronomy. 2026; 16(2):166. https://doi.org/10.3390/agronomy16020166

Chicago/Turabian Style

Chang, Jing, Xuefang Li, Xingye Ze, Xue Ding, and He Gong. 2026. "BAE-UNet: A Background-Aware and Edge-Enhanced Segmentation Network for Two-Stage Pest Recognition in Complex Field Environments" Agronomy 16, no. 2: 166. https://doi.org/10.3390/agronomy16020166

APA Style

Chang, J., Li, X., Ze, X., Ding, X., & Gong, H. (2026). BAE-UNet: A Background-Aware and Edge-Enhanced Segmentation Network for Two-Stage Pest Recognition in Complex Field Environments. Agronomy, 16(2), 166. https://doi.org/10.3390/agronomy16020166

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

BAE-UNet: A Background-Aware and Edge-Enhanced Segmentation Network for Two-Stage Pest Recognition in Complex Field Environments

Abstract

1. Introduction

2. Materials and Methods

2.1. Materials

2.1.1. Image Data Collection

2.1.2. Dataset Construction

2.1.3. Data Augmentation

2.2. Methods

2.2.1. BAE-UNet

2.2.2. BACM

2.2.3. SCRA

2.2.4. MESA

2.2.5. YOLOv8

2.3. Evaluation Metrics and Experimental Setup

2.3.1. Evaluation Metrics

2.3.2. Experimental Setup

3. Experimental Results

3.1. First-Stage Segmentation Results

3.1.1. Comparative Experiments

3.1.2. Ablation Experiments

3.2. Second-Stage Detection Results

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI