Tackling Metamorphosis and Complex Backgrounds: A Coarse-to-Fine Network for Fine-Grained Agricultural Pest Recognition
Abstract
1. Introduction
2. Materials and Methods
2.1. Dataset and Analysis
2.1.1. Overview of the Proposed Cascade Framework
- A YOLOv8-based detector is employed to localize pest targets in complex backgrounds and generate cropped “clean” images.
- An improved ResNeXt50 architecture serves as the feature extractor, enhanced by the Convolutional Block Attention Module (CBAM). Crucially, the classification head is redesigned as an Adaptive Multi-Center Head (AMC-Head) to handle multi-state pest morphologies.
2.1.2. Architecture Selection
- 1.
- C2f Module: By replacing the traditional C3 module with the C2f module (CSP-stage 2 with fusion), the network incorporates more skip connections. This enhances gradient flow, allowing the model to capture richer feature representations of pests with subtle textures.
- 2.
- Anchor-Free Head: YOLOv8 adopts an anchor-free paradigm with a decoupled head structure. This separates the classification and regression tasks, significantly improving the model’s adaptability to pests with extreme aspect ratios (e.g., slender stick insects vs. round beetles).
2.1.3. Context-Aware Cropping Strategy
2.2. Fine-Grained Classification Network
2.2.1. Preprocessing and Advanced Augmentation
- 1.
- Geometric and photometric transformations: We apply random horizontal flipping and rotation to enforce rotational invariance. Additionally, color jittering (brightness, contrast, and saturation within ) is utilized to simulate diverse lighting conditions in the field.
- 2.
- Mixup regularization: To further address the long-tailed distribution and encourage smoother decision boundaries, we introduce Mixup. Unlike traditional augmentation, Mixup constructs virtual training examples by linearly interpolating between two random samples and :where and . This forces the model to learn less confident predictions for ambiguous samples, effectively preventing the network from memorizing noise in minority classes.
2.2.2. Backbone: ResNeXt with Grouped Convolutions
2.2.3. Attention-Guided Feature Refinement (CBAM)
2.3. Adaptive Multi-Center Classification Head (AMC-Head)
2.3.1. Motivation: The Unimodal Assumption Bottleneck
2.3.2. AMC-Head Formulation
- If the input image is a larva, it will activate a specific sub-center (e.g., ) that specializes in cylindrical features.
- If the input is an adult, it will activate a different sub-center (e.g., ) specialized in wing textures.
2.3.3. Optimization Objective
2.4. Implementation Details
2.4.1. Experimental Setup
2.4.2. Training Strategy
- Batch size: 32.
- Learning rate: initialized at and decayed using a Cosine Annealing schedule to .
- Weight decay: to prevent overfitting.
- Epochs: 50.
2.4.3. Evaluation Metrics
3. Results and Discussion
3.1. Comparative Analysis with State-of-the-Art Methods
3.1.1. Quantitative Performance Comparison
- Discussion: This substantial margin (e.g., +18.9% over ResNet50) can be attributed to the fundamental difference in feature extraction strategies. Standard CNNs force the network to simultaneously learn localization (where is the pest?) and classification (what is the pest?) from full images, often leading to overfitting on dominant background features like soil or leaf textures. In contrast, our cascade approach decouples these tasks, ensuring that the classification head receives only relevant, clean biological signals.
3.1.2. Class-Wise Performance Analysis
- Discussion:
- Metamorphic Robustness: Notably, our method shows exceptional performance on pests with distinct life stages, such as Mythimna separata (Armyworm). Baseline models frequently confused the larvae of this species with other caterpillars due to their visual disparity from the adult moth. The high accuracy here confirms that the AMC-Head successfully learned to map these distinct morphologies (larva and adult) to the same semantic label without requiring manual sub-class supervision.
- Remaining Challenges: However, slight confusion persists among physically similar species (e.g., differing only by wing spot patterns). This misclassification is likely due to inter-species mimicry and shared host plants, suggesting that future work should focus on even finer-grained attention mechanisms to capture subtle texture differences.
3.2. Ablation Studies
3.2.1. Contribution of Individual Modules
- Effect of Background Decoupling (+YOLO): Introducing the cropping stage brings the most significant immediate gain, boosting accuracy by +5.2%. This empirical evidence supports our hypothesis that “environmental noise” is the primary bottleneck in field monitoring. By physically removing the background, we force the classifier to focus solely on the pest, effectively bridging the domain gap.
- Effect of Attention (+CBAM): Adding CBAM further improves accuracy by +1.9%. This indicates that even within the cropped pest body, discriminative information is non-uniform—often concentrated in key areas like antennae or wing veins—rather than the smooth body surface.
- Effect of Multi-Center Learning (+AMC): Replacing the standard classifier with our AMC-Head yields a final boost of +2.1%. Notably, the F1-score improves disproportionately, suggesting that the multi-center mechanism is particularly effective for balancing precision and recall in complex categories (e.g., pests with high intra-class variance).
3.2.2. Sensitivity Analysis of Sub-Centers (K)
- Discussion: This statistical result aligns remarkably well with the biological reality of holometabolous insects, which typically exhibit three distinct visual phases: (1) Larva, (2) Pupa, and (3) Adult.
- When , the model suffers from the unimodal bottleneck, struggling to compress divergent forms into one center.
- When , accuracy improves significantly as it captures the primary binary variance (Larva vs. Adult).
- At , the model achieves optimal performance, likely capturing an additional “intermediate” or “occluded” state.
- Increasing K to 4 or 5 leads to diminishing returns and potential overfitting, as the sub-centers become too sparse for the available training data.
3.2.3. Performance on Long-Tailed Categories
- Discussion: This result proves that standard classifiers tend to bias heavily towards majority classes. In contrast, the combination of Mixup (which expands the data manifold) and AMC-Head (which prevents minority features from being averaged out) successfully preserves the diversity of the feature space. This ensures the model does not ignore rare pests, which is crucial for early-stage pest warning systems.
3.3. Qualitative Analysis
3.3.1. Attention Visualization (Grad-CAM)
- Discussion:
- Baseline Failure Analysis: In the first row (complex background), the Baseline model’s attention (red region) is scattered across the surrounding leaves and soil. This visualization exposes the “background bias” inherent in single-stage CNNs: without explicit localization, the model erroneously learns environmental correlations (e.g., associating “soil color” with “ground pests”) rather than actual biological features.
- Ours Success Analysis: Conversely, our model accurately localizes the key discriminative parts of the pest (e.g., the texture on the wing), even when the pest color is similar to the background. This confirms that the combination of YOLO cropping and CBAM effectively suppresses environmental interference, forcing the network to learn “true” invariant features.
3.3.2. Visualization of Detection and Cropping
- Discussion: Critically, the “Context-Aware Expansion” strategy ensures that the cropped images preserve complete morphological structures—such as the long antennae of the Rice Bug shown in the third column. As highlighted in Section 3.2, preserving these peripheral structures is vital for distinguishing pests from morphologically similar natural enemies (e.g., Syrphidae larvae). The visualization confirms that our cropping strategy retains the necessary “biological context” that standard tight cropping might discard.
4. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
| CNNs | Convolutional Neural Networks |
| CBAM | Convolutional Block Attention Module |
| Grad-CAM | Gradient-weighted Class Activation Mapping |
| AMC-Head | Adaptive Multi-Center Classification Head |
| FAO | Food and Agriculture Organization |
| IPM | Integrated Pest Managemen |
| SOTA | State-of-the-Art |
| PA | Precision Agriculture |
| SIFT | Scale-Invariant Feature Transform |
| HOG | Histogram of Oriented Gradients |
| DL | Deep Learning |
| ROI | Region of Interes |
| CE | Cross-Entropy |
| FL | Focal Loss |
| SGD | Stochastic Gradient Descent |
| TL | Transfer Learning |
References
- Sharma, S.; Kooner, R.; Arora, R. Insect pests and crop losses. In Breeding Insect Resistant Crops for Sustainable Agriculture; Springer: Singapore, 2017; pp. 45–66. [Google Scholar]
- Oerke, E.C. Crop losses to pests. J. Agric. Sci. 2006, 144, 31–43. [Google Scholar] [CrossRef]
- Chen, C.; Liang, Y.; Zhou, L.; Tang, X.; Dai, M. An automatic inspection system for pest detection in granaries using YOLOv4. Comput. Electron. Agric. 2022, 201, 107302. [Google Scholar] [CrossRef]
- Preti, M.; Verheggen, F.; Angeli, S. Insect pest monitoring with camera-equipped traps: Strengths and limitations. J. Pest Sci. 2021, 94, 203–217. [Google Scholar] [CrossRef]
- Meshram, A.T.; Vanalkar, A.V.; Kalambe, K.B.; Badar, A.M. Pesticide spraying robot for precision agriculture: A categorical literature review and future trends. J. Field Robot. 2022, 39, 153–171. [Google Scholar]
- Harris, C.G.; Andika, I.P.; Trisyono, Y.A. A Comparison of HOG-SVM and SIFT-SVM Techniques for Identifying Brown Planthoppers in Rice Fields. In Proceedings of the 2022 IEEE 2nd Conference on Information Technology and Data Science (CITDS), Debrecen, Hungary, 16–18 May 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 107–112. [Google Scholar]
- Wang, D.; Cao, W.; Zhang, F.; Li, Z.; Xu, S.; Wu, X. A review of deep learning in multiscale agricultural sensing. Remote Sens. 2022, 14, 559. [Google Scholar]
- Kumar, V.; Arora, H.; Sisodia, J. Resnet-based approach for detection and classification of plant leaf diseases. In Proceedings of the 2020 International Conference on Electronics and Sustainable Communication Systems (ICESC), Coimbatore, India, 2–4 July 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 495–502. [Google Scholar]
- Guan, B.; Zhang, L.; Zhu, J.; Li, R.; Kong, J.; Wang, Y.; Dong, W. The key issues and evaluation methods for constructing agricultural pest and disease image datasets: A review. Smart Agric. 2023, 5, 17–34. [Google Scholar]
- Zhang, W.; Sun, Y.; Huang, H.; Pei, H.; Sheng, J.; Yang, P. Pest region detection in complex backgrounds via contextual information and multi-scale mixed attention mechanism. Agriculture 2022, 12, 1104. [Google Scholar] [CrossRef]
- Fennell, J.G.; Talas, L.; Baddeley, R.J.; Cuthill, I.C.; Scott-Samuel, N.E. The Camouflage Machine: Optimizing protective coloration using deep learning with genetic algorithms. Evolution 2021, 75, 614–624. [Google Scholar] [CrossRef] [PubMed]
- Ahsan, F.F.; Thomas, M.L.; Laga, H.; Sohel, F. Deep learning-based analysis of insect life stages using a repurposed dataset. Ecol. Inform. 2025, 90, 103202. [Google Scholar] [CrossRef]
- Hall, M.J.; Martín-Vega, D. Visualization of insect metamorphosis. Philos. Trans. R. Soc. B 2019, 374, 20190071. [Google Scholar] [CrossRef] [PubMed]
- Du, M.; Wang, F.; Wang, Y.; Li, K.; Hou, W.; Liu, L.; He, Y.; Wang, Y. Improving long-tailed pest classification using diffusion model-based data augmentation. Comput. Electron. Agric. 2025, 234, 110244. [Google Scholar] [CrossRef]
- Masko, D.; Hensman, P. The Impact of Imbalanced Training Data for Convolutional Neural Networks; KTH Royal Institute of Technology: Stockholm, Sweden, 2015. [Google Scholar]










| Target Crop | Classes | Representative Pest Species (Scientific Name) |
|---|---|---|
| Rice | 14 | Orseolia oryzae, Cnaphalocrocis medinalis, Lissorhoptrus oryzophilus, Chilo suppressalis |
| Corn | 13 | Ostrinia furnacalis, Mythimna separata, Agriotes spp., Agrotis ipsilon |
| Wheat | 9 | Dolerus tritici, Petrobia latens, Sitobion avenae, Penthaleus major |
| Beet | 8 | Spodoptera exigua, Pegomya hyoscyami, Loxostege sticticalis, Achyra rantalis |
| Alfalfa | 7 | Hypera postica, Peridroma saucia, Bruchophagus roddi, Adelphocoris lineolatus |
| Vitis (Grape) | 10 | Xylotrechus pyrrhoderus, Erythroneura apicalis, Bactrocera dorsalis, Theretra japonica |
| Citrus | 12 | Phyllocnistis citrella, Diaphorina citri, Panonchus citri, Aleurocanthus spiniferus |
| Mango | 6 | Procontarinia matteiana, Deanolis sublimbalis, Idioscopus clypealis |
| Others | 23 | Helicoverpa armigera, Locusta migratoria, Epicauta gorhami |
| Total | 102 | 75,222 Images (Hierarchically Structured) |
| Method | Backbone | Precision (%) | Recall (%) | Accuracy (%) |
|---|---|---|---|---|
| VGG16 | VGG-16 | 78.20 | 76.45 | 77.85 |
| MobileNetV3 | MobileNet-Small | 80.12 | 78.90 | 79.15 |
| ResNet50 | ResNet-50 | 79.45 | 71.20 | 72.50 |
| DenseNet121 | DenseNet-121 | 85.60 | 84.80 | 85.33 |
| EfficientNet-B0 | EfficientNet | 86.15 | 85.90 | 86.40 |
| L&I-Net (Ours) | ResNeXt50 + AMC | 91.25 | 90.50 | 91.40 |
| Model Variant | YOLO Crop | CBAM | Mixup | AMC-Head | Accuracy (%) |
|---|---|---|---|---|---|
| Baseline (ResNet50) | × | × | × | × | 82.50 |
| + Stage 1 Decoupling | ✓ | × | × | × | 87.72 (+5.22) |
| + Attention Module | ✓ | ✓ | × | × | 89.65 (+1.93) |
| + Mixup Strategy | ✓ | ✓ | ✓ | × | 90.30 (+0.65) |
| + AMC-Head (Ours) | ✓ | ✓ | ✓ | ✓ | 91.40 (+1.10) |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Su, H.; Zhao, L.; Liang, Y.; Liu, S. Tackling Metamorphosis and Complex Backgrounds: A Coarse-to-Fine Network for Fine-Grained Agricultural Pest Recognition. Appl. Sci. 2026, 16, 2191. https://doi.org/10.3390/app16052191
Su H, Zhao L, Liang Y, Liu S. Tackling Metamorphosis and Complex Backgrounds: A Coarse-to-Fine Network for Fine-Grained Agricultural Pest Recognition. Applied Sciences. 2026; 16(5):2191. https://doi.org/10.3390/app16052191
Chicago/Turabian StyleSu, Hang, Lei Zhao, Yongpeng Liang, and Sihui Liu. 2026. "Tackling Metamorphosis and Complex Backgrounds: A Coarse-to-Fine Network for Fine-Grained Agricultural Pest Recognition" Applied Sciences 16, no. 5: 2191. https://doi.org/10.3390/app16052191
APA StyleSu, H., Zhao, L., Liang, Y., & Liu, S. (2026). Tackling Metamorphosis and Complex Backgrounds: A Coarse-to-Fine Network for Fine-Grained Agricultural Pest Recognition. Applied Sciences, 16(5), 2191. https://doi.org/10.3390/app16052191

