Tackling Metamorphosis and Complex Backgrounds: A Coarse-to-Fine Network for Fine-Grained Agricultural Pest Recognition

Su, Hang; Zhao, Lei; Liang, Yongpeng; Liu, Sihui

doi:10.3390/app16052191

Open AccessArticle

Tackling Metamorphosis and Complex Backgrounds: A Coarse-to-Fine Network for Fine-Grained Agricultural Pest Recognition

College of Science and Information, Qingdao Agricultural University, Qingdao 266109, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(5), 2191; https://doi.org/10.3390/app16052191

Submission received: 21 January 2026 / Revised: 19 February 2026 / Accepted: 21 February 2026 / Published: 25 February 2026

(This article belongs to the Section Agricultural Science and Technology)

Download

Browse Figures

Versions Notes

Abstract

Timely and accurate identification of agricultural pests is imperative for precision crop protection. However, real-world pest recognition faces two critical challenges: the interference of complex field backgrounds, which introduces significant noise, and the severe large intra-class variance caused by pest metamorphosis, which confuses standard classifiers. To address these issues, this paper proposes a coarse-to-fine cascade framework that integrates object localization with fine-grained multi-modal classification. First, we deploy a YOLOv8-based detector to precisely localize and crop pest regions from cluttered environments, effectively eliminating background redundancy. Second, for the cropped targets, we design a fine-grained classification network based on ResNeXt50 integrated with the Convolutional Block Attention Module (CBAM) to extract discriminative features. Crucially, to tackle the challenge of multi-state pest morphologies, we propose a novel Adaptive Multi-Center Classification Head (AMC-Head). Unlike traditional methods that enforce a single feature center for each class, our approach dynamically allocates multiple latent sub-centers for each category, allowing the model to automatically disentangle and cluster distinct morphological representations within a single label. Extensive experiments on the large-scale benchmark dataset IP102 demonstrate that our method achieves an end-to-end accuracy of 91.4%, significantly outperforming single-stage baselines. The proposed framework effectively mitigates the impact of complex backgrounds and metamorphic variation, providing a robust solution for automated pest monitoring.

Keywords:

agricultural; pest recognition; convolutional neural networks; attention mechanism; focal loss; class imbalance

1. Introduction

Agricultural pests constitute one of the most formidable threats to global food security and ecosystem stability [1]. According to recent statistics from the Food and Agriculture Organization (FAO), plant pests and diseases account for a 20–40% reduction in global crop yields annually, resulting in devastating economic losses exceeding billions of dollars [2]. To mitigate these losses, effective pest management relies heavily on early detection and precise identification. Traditionally, this task has been integrated into Integrated Pest Management (IPM) strategies, which prioritize monitoring to minimize chemical intervention. Traditionally, pest monitoring relies on manual field scouting and the use of physical traps, where the subsequent identification of collected specimens requires the specialized knowledge of taxonomists and agricultural experts [3]. While the identification criteria are well-established for major crops, this process remains labor-intensive and time-consuming. Moreover, the subjective nature of human assessment, often performed under field fatigue, can lead to misidentification—such as the historically significant confusion between Helicoverpa armigera and Helicoverpa zea. Such diagnostic errors can result in inadequate management, missing critical prevention windows or leading to the misuse of pesticides [4]. Consequently, with the rapid advancement of Precision Agriculture (PA), there is an urgent and escalating demand for automated, non-invasive, and real-time pest identification systems to facilitate site-specific and environmentally friendly crop protection [5].

In the early stages of automated pest recognition, researchers primarily employed traditional computer vision techniques combined with machine learning algorithms [6]. These approaches typically relied on handcrafted feature extractors, such as Scale-Invariant Feature Transform (SIFT) or Histogram of Oriented Gradients (HOG), to capture morphological traits like texture, shape, and color, which were then fed into classifiers like Support Vector Machines (SVM) [7]. However, these methods exhibit significant limitations in unstructured field environments. First, handcrafted features are highly sensitive to environmental fluctuations; for instance, varying lighting conditions and complex soil/leaf backgrounds introduce pixel-level noise that can easily overwhelm the subtle descriptors of a small pest. Second, these shallow architectures lack the hierarchical representation capability to bridge the “semantic gap”—they focus on low-level geometric patterns but fail to encapsulate the high-level biological essence of the insects. Consequently, while effective in controlled laboratory settings with uniform backdrops, these models generalize poorly to real-world scenarios where pests are often occluded by foliage or camouflaged against complex biotic textures.

In recent years, the paradigm has shifted dramatically with the advent of Deep Learning (DL), a subset of machine learning that utilizes multi-layered neural networks to automatically learn hierarchical representations from raw data. Specifically, Convolutional Neural Networks (CNNs) have emerged as the cornerstone of agricultural computer vision due to their ability to preserve spatial hierarchies in images through local receptive fields and weight sharing [8]. Unlike traditional methods that require manual feature engineering, CNNs can autonomously extract low-level edges, mid-level textures, and high-level semantic structures in an end-to-end fashion. This capability has revolutionized various tasks, including plant disease diagnosis, crop counting, and pest classification [9]. Despite these successes, directly deploying standard CNN architectures to unstructured field scenarios remains fraught with challenges that hamper practical application.

The first major bottleneck is the interference of complex field backgrounds [10]. Unlike images captured in laboratory settings with uniform backdrops, images collected in the wild are subject to uncontrollable environmental factors, including varying lighting, soil clutter, and object occlusion [11]. Most existing approaches attempt to perform classification directly on the entire image. However, standard CNNs often struggle to focus on the Region of Interest (ROI) in such cluttered scenes, leading to models that “cheat” by learning background correlations (e.g., associating specific leaf textures with a pest label) rather than the pest features themselves [12]. While attention mechanisms have been proposed to mitigate this, they are often insufficient when the target pest is extremely small relative to the background. Therefore, a coarse-to-fine strategy—which decouples object localization from fine-grained identification—is essential for robust performance in real-world scenarios.

The second, and perhaps more critical challenge, lies in the large intra-class variance inherent to biological organisms. Pests exhibit complex biological life cycles characterized by either holometabolism or hemimetabolism, both of which introduce significant morphological diversity within the same species [13]. In holometabolous insects, complete metamorphosis results in distinct, often unrecognizable morphologies between immature phases (larvae, pupae) and the adult stage. Conversely, hemimetabolous insects undergo partial metamorphosis; while the overall body plan remains relatively consistent, adults are distinguished from nymphs by the presence of functional wings and reproductive maturity [14,15]. Historically, conventional deep learning classifiers—typically composed of a Global Average Pooling (GAP) layer followed by a single fully-connected (FC) layer—have struggled with this variance. Such architectures enforce a unimodal constraint, mapping all intra-class samples to a singular representative feature center in the high-dimensional latent space. When a class contains such multi-modal distributions, this forced convergence leads to feature confusion and suboptimal generalization, particularly on large-scale datasets like IP102, where diverse life stages are collapsed into a single label.

To address these intertwined challenges, this paper proposes a coarse-to-fine strategy named the Unified Cascade Pest Recognition Framework with Auto-Split Multi-Center Learning. Unlike single-stage methods that process the entire image at once, our framework mimics the inspection process of a human expert, first locating the pest and then scrutinizing its details. Specifically, we employ a two-stage “Divide and Conquer” approach. In the first stage, a high-performance object detector is deployed to precisely localize and crop pest regions, effectively filtering out complex background noise. In the second stage, the cropped targets are fed into a fine-grained classification network integrated with attention mechanisms to extract discriminative features. Crucially, to tackle the biological challenge of metamorphic variance, we introduce a novel Adaptive Multi-Center Classification Head (AMC-Head). This module allows the network to dynamically allocate multiple latent sub-centers for a single category, enabling the model to automatically disentangle and cluster distinct morphological representations (e.g., larvae and adults) under the same label supervision.

Consequently, the primary objective of this study is to establish a robust solution for automated pest monitoring that simultaneously addresses background clutter and morphological diversity. To this end, we propose a coarse-to-fine cascade architecture that decouples object localization from fine-grained identification, effectively suppressing environmental noise by focusing solely on the pest targets. Furthermore, to overcome the biological challenge of metamorphosis, we introduce a novel Adaptive Multi-Center Head that automatically disentangles distinct life stages (e.g., larva vs. adult) into separate sub-centers, thereby resolving the intra-class variance issues. Extensive evaluations on the large-scale IP102 benchmark demonstrate the efficacy of this approach, achieving an end-to-end accuracy of 91.4% and significantly outperforming mainstream single-stage baselines.

2. Materials and Methods

2.1. Dataset and Analysis

Experimental validation was conducted using the IP102 dataset, a large-scale benchmark specifically designed for pest recognition. The dataset comprises 75,222 images categorized into 102 distinct pest classes. To ensure data integrity, we performed a comprehensive statistical re-analysis of the IP102 dataset’s annotation files. Based on this analysis, we constructed the class distribution profile (Figure 1) and taxonomically reorganized the pest categories by host crop and scientific nomenclature (Table 1). While the image quality varies significantly—ranging from high-resolution field photography to noise-heavy internet-crawled samples—this heterogeneity effectively simulates the unconstrained conditions of real-world agricultural monitoring. To ensure biological precision and avoid ambiguity associated with common names, all categories are organized based on binomial nomenclature (scientific names), as detailed in Table 1.

Statistically, the dataset exhibits a significant long-tailed class distribution, which directly influences our methodological design. Quantitatively, “head” classes such as Cicadella viridis contain over 3000 samples, whereas “tail” classes like Papilio xuthus are represented by fewer than 50 images. This severe imbalance necessitates a learning framework capable of robust feature extraction despite majority class bias. Structurally, the 102 classes are hierarchically grouped by their primary economic host crops, covering staples (e.g., Rice, Corn, Wheat) and cash crops (e.g., Citrus, Mango). This crop-centric organization requires the model to disentangle pest features from diverse host plant morphologies, distinguishing, for instance, the venation patterns of Zea mays from the broad leaves of Mangifera indica.

The dataset includes pests at various life stages (e.g., eggs, larvae, pupae, and adults). As shown in Figure 2, a larva and an adult of the same species share almost no visual similarity, yet they share the same ground-truth label.

2.1.1. Overview of the Proposed Cascade Framework

To address the aforementioned challenges, we propose a coarse-to-fine cascade framework named Locate-and-Identify Network (L&I-Net). As shown in Figure 3, the pipeline consists of two stages:

A YOLOv8-based detector is employed to localize pest targets in complex backgrounds and generate cropped “clean” images.
An improved ResNeXt50 architecture serves as the feature extractor, enhanced by the Convolutional Block Attention Module (CBAM). Crucially, the classification head is redesigned as an Adaptive Multi-Center Head (AMC-Head) to handle multi-state pest morphologies.

Complex field backgrounds (e.g., leaves, soil, mulch) act as significant noise for pest recognition. To decouple the pest from the background, we employ YOLOv8 for target detection. YOLOv8 is a state-of-the-art anchor-free detector that balances speed and accuracy. It utilizes a CSPDarknet backbone to extract multi-scale features and a Path Aggregation Network (PANet) for feature fusion. In our framework, YOLOv8 outputs bounding boxes

B = {x, y, w, h}

for potential pests. We apply a dynamic cropping strategy with a 10% expansion padding around the predicted boxes to ensure the integrity of pest appendages (e.g., antennae, legs). These cropped regions are then resized to

256 \times 256

pixels, serving as the “purified” input for the next stage.

2.1.2. Architecture Selection

In unstructured field environments, the pest region of interest (ROI) typically occupies only a small fraction of the image, while the majority consists of irrelevant background noise (e.g., soil, mulch, and complex leaf veins). Direct classification on full-scale images often leads to feature misalignment. To address this, we introduce a dedicated background decoupling stage employing the YOLOv8 detector.

We selected YOLOv8 due to its superior trade-off between inference speed and detection precision. The architecture features two critical improvements suitable for pest detection:

1.: C2f Module: By replacing the traditional C3 module with the C2f module (CSP-stage 2 with fusion), the network incorporates more skip connections. This enhances gradient flow, allowing the model to capture richer feature representations of pests with subtle textures.
2.: Anchor-Free Head: YOLOv8 adopts an anchor-free paradigm with a decoupled head structure. This separates the classification and regression tasks, significantly improving the model’s adaptability to pests with extreme aspect ratios (e.g., slender stick insects vs. round beetles).

2.1.3. Context-Aware Cropping Strategy

A standard bounding box crop often tightly encloses the object, potentially severing protruding biological structures crucial for identification. In agricultural ecosystems, many pests share morphological similarities with natural enemies (e.g., the superficial resemblance between predatory Syrphidae larvae and herbivorous Lepidopteran larvae). Critical distinguishing features—such as the presence of prolegs, distinct head capsules, or specific antennal structures—are often located at the periphery of the subject. To preserve the complete morphological integrity required to differentiate these subtle biological nuances, we propose a Context-Aware Expansion Strategy.

Let the predicted bounding box be denoted as

B = (x_{c}, y_{c}, w, h)

, where

(x_{c}, y_{c})

is the center coordinate. We first define the expansion offsets

δ_{w}

and

δ_{h}

with a coefficient

η = 0.1

:

δ_{w} = \frac{w}{2} (1 + η), δ_{h} = \frac{h}{2} (1 + η) .

(1)

Subsequently, the cropped coordinates are calculated by constraining the expanded box within the original image boundaries

(W_{i m g}, H_{i m g})

:

\begin{matrix} x_{m i n}^{'} & = max (0, x_{c} - δ_{w}), \\ y_{m i n}^{'} & = max (0, y_{c} - δ_{h}), \\ x_{m a x}^{'} & = min (W_{i m g}, x_{c} + δ_{w}), \\ y_{m a x}^{'} & = min (H_{i m g}, y_{c} + δ_{h}) . \end{matrix}

(2)

This operation ensures that the cropped image contains the entire pest body along with a minimal margin of contextual information.

2.2. Fine-Grained Classification Network

The cropped images from Stage 1 provide a clean, noise-free input. However, classifying pests with high visual similarity (fine-grained recognition) requires a powerful feature extractor. In this stage, we employ an improved ResNeXt50 architecture integrated with attention mechanisms.

2.2.1. Preprocessing and Advanced Augmentation

To mitigate the risk of overfitting—particularly on tail classes—and to enhance the model’s generalization capability, a comprehensive data augmentation strategy was implemented. Let

I_{c r o p}

be the input image; the augmented image

\hat{I}

is generated via a series of transformations:

1.: Geometric and photometric transformations: We apply random horizontal flipping and rotation to enforce rotational invariance. Additionally, color jittering (brightness, contrast, and saturation within $\pm 20 %$ ) is utilized to simulate diverse lighting conditions in the field.
2.: Mixup regularization: To further address the long-tailed distribution and encourage smoother decision boundaries, we introduce Mixup. Unlike traditional augmentation, Mixup constructs virtual training examples by linearly interpolating between two random samples $(x_{i}, y_{i})$ and $(x_{j}, y_{j})$ :

$\tilde{x} = λ x_{i} + (1 - λ) x_{j}, \tilde{y} = λ y_{i} + (1 - λ) y_{j},$

(3)

where $λ \sim Beta (α, α)$ and $α = 0.2$ . This forces the model to learn less confident predictions for ambiguous samples, effectively preventing the network from memorizing noise in minority classes.

Finally, all images are resized to

256 \times 256

pixels and normalized using standard ImageNet statistics.

2.2.2. Backbone: ResNeXt with Grouped Convolutions

We adopt ResNeXt50 as the backbone network. While standard ResNet relies on increasing depth (number of layers) or width (number of channels) to improve performance, ResNeXt introduces a new dimension called “Cardinality” (the size of the set of transformations).

ResNeXt replaces the standard convolution in the residual block with a split-transform-merge strategy using Grouped Convolutions. Formally, the output y of a ResNeXt block is defined as:

y = x + \sum_{i = 1}^{C} T_{i} (x),

(4)

where x is the input,

C = 32

is the cardinality, and

T_{i}

denotes the transformation within each group. This structure is biologically inspired and has proven superior in capturing the repetitive and fine-grained texture patterns commonly found on insect bodies (e.g., wing scales, compound eyes) with fewer parameters than wide ResNets.

2.2.3. Attention-Guided Feature Refinement (CBAM)

To further suppress any residual background noise that might have survived the cropping stage, we embed the Convolutional Block Attention Module (CBAM) into the ResNeXt blocks. CBAM sequentially infers attention maps along two separate dimensions: channel and spatial.

As illustrated in Figure 4, the module refines the intermediate feature map

F

as follows:

F^{'} = M_{c} (F) \otimes F, F^{″} = M_{s} (F^{'}) \otimes F^{'},

(5)

where ⊗ denotes element-wise multiplication.

1. The Channel Attention Module (CAM) focuses on “what” is meaningful. It aggregates spatial information using both Global Average Pooling and Global Max Pooling to generate channel descriptors, allowing the network to selectively emphasize informative feature channels (e.g., distinct pest patterns):

M_{c} (F) = σ (MLP (AvgPool (F)) + MLP (MaxPool (F))) .

(6)

2. The Spatial Attention Module (SAM) focuses on “where” the informative part is located. It generates a spatial attention map to highlight the pest body region:

M_{s} (F^{'}) = σ (f^{7 \times 7} ([AvgPool (F^{'}); MaxPool (F^{'})])) .

(7)

This dual-attention mechanism acts as a sophisticated filter, ensuring that the subsequent classification head receives only the most discriminative pest features.

2.3. Adaptive Multi-Center Classification Head (AMC-Head)

2.3.1. Motivation: The Unimodal Assumption Bottleneck

Standard Convolutional Neural Networks typically employ a Global Average Pooling (GAP) layer followed by a fully connected (FC) layer to map the feature vector

x \in R^{D}

to class scores. This paradigm implicitly makes a “Unimodal Assumption”: it forces all samples belonging to the same class c to cluster around a single weight vector

w_{c}

in the high-dimensional feature space.

However, as discussed in Section 3.1, agricultural pests exhibit drastic morphological changes due to metamorphosis. For instance, the visual features of a Spodoptera litura larva (cylindrical, green body) are orthogonal to those of its adult form (winged, brown moth). Forcing these distinct intra-class modalities to collapse into a single feature center creates a “tension” during optimization, preventing the model from learning a compact representation for either morphology. As visualized in Figure 5a, this leads to loosely packed clusters and confusing decision boundaries.

2.3.2. AMC-Head Formulation

To address this, we propose the Adaptive Multi-Center Head (AMC-Head). Instead of assigning a single prototype to each class, we allocate K latent sub-centers to capture the multi-modal distribution of pest data.

Formally, let the weight matrix of the classification head be expanded from

W \in R^{C \times D}

to

W^{'} \in R^{(C \times K) \times D}

, where C is the number of classes (102) and K is the hyperparameter representing the number of sub-centers (set to

K = 3

in our experiments). For an input feature vector

x

, the network first computes the affinity scores with all

C \times K

sub-centers. The final logit

z_{c}

for class c is obtained via a Max-Feature operation:

z_{c} = max_{k \in 1, \dots, K} (w {c, k}^{T} x + b c, k) .

(8)

This mechanism functions as a dynamic routing process:

If the input image is a larva, it will activate a specific sub-center (e.g., $k = 1$ ) that specializes in cylindrical features.
If the input is an adult, it will activate a different sub-center (e.g., $k = 2$ ) specialized in wing textures.

Crucially, this specialization emerges automatically during backpropagation without requiring fine-grained sub-labels. As shown in Figure 5b, this approach encourages the formation of multiple tight sub-clusters for each class, significantly improving the separability between different pest species.

2.3.3. Optimization Objective

While Focal Loss is effective for class imbalance, it can lead to instability when combined with multi-center learning, as it may excessively penalize the “correct” sub-centers during the early training phase. Therefore, we adopt Label Smoothing Cross Entropy (LS-CE) as the objective function.

Label smoothing introduces a soft target distribution, which prevents the model from becoming over-confident on a single sub-center and encourages a more robust exploration of the feature space. The loss is defined as:

L = - \sum_{c = 1}^{C} q_{c} log (\frac{e^{z_{c}}}{\sum_{j = 1}^{C} e^{z_{j}}}),

(9)

where

q_{c}

is the smoothed label:

q_{c} = \{\begin{matrix} 1 - ϵ & if c = y (ground truth) \\ e p s i l o n / (C - 1) & otherwise \end{matrix}

(10)

Here, we set the smoothing factor

ϵ = 0.1

. As conceptually illustrated in Figure 5, the mathematical intention behind this design is to alter the gradient flow during training. Unlike the hard boundaries enforced by traditional losses, this formulation aims to prevent the model from collapsing into a single sub-center, theoretically enabling the capture of diverse intra-class variations (e.g., larvae vs. adults) within the feature space.

2.4. Implementation Details

2.4.1. Experimental Setup

All experiments were conducted on a workstation equipped with an NVIDIA GeForce RTX 4070 SUPER (12 GB) GPU (Nvidia, Santa Clara, CA, USA) and an Intel Core i7 processor (Intel, Santa Clara, CA, USA). The software environment was built upon PyTorch 1.13 and Python 3.9. To ensure fair comparisons, all baseline models were implemented using the MMClassification and Ultralytics libraries under identical hardware constraints.

2.4.2. Training Strategy

Since our framework is a cascade system, the training was performed in two sequential phases:

Phase 1: Detector Training. The YOLOv8 model was initialized with weights pre-trained on the COCO dataset to accelerate convergence. We fine-tuned the model on the IP102 dataset for 50 epochs with an image size of

640 \times 640

. The initial learning rate was set to

0.01

with an SGD optimizer, and the batch size was set to 16.

Phase 2: Classifier Training. The ResNeXt50 backbone was initialized with ImageNet-1K V2 weights. The input images were resized to

256 \times 256

pixels. We employed the AdamW optimizer, which is known for its superior weight decay handling, with the following hyperparameters:

Batch size: 32.
Learning rate: initialized at $1 \times 10^{- 4}$ and decayed using a Cosine Annealing schedule to $1 \times 10^{- 6}$ .
Weight decay: $1 \times 10^{- 2}$ to prevent overfitting.
Epochs: 50.

Regularization techniques including Mixup (

α = 0.2

) and Label Smoothing (

ϵ = 0.1

) were applied throughout the training process to handle the long-tailed distribution.

2.4.3. Evaluation Metrics

To comprehensively evaluate the performance of the proposed method, particularly on the imbalanced dataset, we report four standard metrics: Accuracy (Acc), Precision (Pre), Recall (Rec), and F1-score (F1). Given the multi-class nature of the problem, we calculate the macro-averaged metrics to treat all classes equally, regardless of their sample size:

Accuracy = \frac{T P + T N}{T P + T N + F P + F N}

(11)

F 1 - score = 2 \times \frac{Precision \times Recall}{Precision + Recall}

(12)

where

T P

,

F P

, and

F N

denote True Positives, False Positives, and False Negatives, respectively. Additionally, we use Grad-CAM for qualitative analysis to visualize the model’s focus regions.

3. Results and Discussion

3.1. Comparative Analysis with State-of-the-Art Methods

To demonstrate the superiority of the proposed L&I-Net, we conducted comprehensive comparative experiments on the IP102 test set. We benchmarked our framework against several mainstream Convolutional Neural Networks (CNNs) widely used in agricultural pest recognition, including VGG16, ResNet50, DenseNet121, and the lightweight MobileNetV3. For fair comparison, all baseline models were initialized with ImageNet pre-trained weights and fine-tuned using the same hyperparameters (learning rate, batch size, and optimizer) as our classification stage.

3.1.1. Quantitative Performance Comparison

The quantitative results are summarized in Table 2. The proposed framework achieves an outstanding overall Accuracy of 91.40% and an F1-score of 90.85%, significantly outperforming all baseline methods.

Discussion: This substantial margin (e.g., +18.9% over ResNet50) can be attributed to the fundamental difference in feature extraction strategies. Standard CNNs force the network to simultaneously learn localization (where is the pest?) and classification (what is the pest?) from full images, often leading to overfitting on dominant background features like soil or leaf textures. In contrast, our cascade approach decouples these tasks, ensuring that the classification head receives only relevant, clean biological signals.

3.1.2. Class-Wise Performance Analysis

To further analyze the model’s robustness across different categories, we visualize the Confusion Matrix in Figure 6. The matrix exhibits a strong diagonal dominance, indicating that the majority of pest classes are correctly identified.

Discussion:

Metamorphic Robustness: Notably, our method shows exceptional performance on pests with distinct life stages, such as Mythimna separata (Armyworm). Baseline models frequently confused the larvae of this species with other caterpillars due to their visual disparity from the adult moth. The high accuracy here confirms that the AMC-Head successfully learned to map these distinct morphologies (larva and adult) to the same semantic label without requiring manual sub-class supervision.
Remaining Challenges: However, slight confusion persists among physically similar species (e.g., differing only by wing spot patterns). This misclassification is likely due to inter-species mimicry and shared host plants, suggesting that future work should focus on even finer-grained attention mechanisms to capture subtle texture differences.

3.2. Ablation Studies

To rigorously investigate the contribution of each component in our proposed L&I-Net, we designed three sets of ablation experiments.

3.2.1. Contribution of Individual Modules

We define the baseline model as a standard ResNet50 trained on original images using Cross-Entropy loss. We then incrementally integrated our proposed modules: (1) Stage 1 Cropping (YOLO), (2) CBAM Attention, (3) Mixup Augmentation, and (4) AMC-Head. The results are detailed in Table 3.

Effect of Background Decoupling (+YOLO): Introducing the cropping stage brings the most significant immediate gain, boosting accuracy by +5.2%. This empirical evidence supports our hypothesis that “environmental noise” is the primary bottleneck in field monitoring. By physically removing the background, we force the classifier to focus solely on the pest, effectively bridging the domain gap.
Effect of Attention (+CBAM): Adding CBAM further improves accuracy by +1.9%. This indicates that even within the cropped pest body, discriminative information is non-uniform—often concentrated in key areas like antennae or wing veins—rather than the smooth body surface.
Effect of Multi-Center Learning (+AMC): Replacing the standard classifier with our AMC-Head yields a final boost of +2.1%. Notably, the F1-score improves disproportionately, suggesting that the multi-center mechanism is particularly effective for balancing precision and recall in complex categories (e.g., pests with high intra-class variance).

3.2.2. Sensitivity Analysis of Sub-Centers (K)

A critical hyperparameter in our AMC-Head is K, the number of latent sub-centers per class. Intuitively, K should correspond to the major morphological states of the pest. We evaluated the model performance with K varying from 1 (Standard Softmax) to 5. As illustrated in Figure 7, the performance peaks at

K = 3

.

Discussion: This statistical result aligns remarkably well with the biological reality of holometabolous insects, which typically exhibit three distinct visual phases: (1) Larva, (2) Pupa, and (3) Adult.

When $K = 1$ , the model suffers from the unimodal bottleneck, struggling to compress divergent forms into one center.
When $K = 2$ , accuracy improves significantly as it captures the primary binary variance (Larva vs. Adult).
At $K = 3$ , the model achieves optimal performance, likely capturing an additional “intermediate” or “occluded” state.
Increasing K to 4 or 5 leads to diminishing returns and potential overfitting, as the sub-centers become too sparse for the available training data.

3.2.3. Performance on Long-Tailed Categories

To verify whether our method genuinely addresses the long-tailed distribution problem, we divided the 102 classes into three groups based on sample size: Head (>500 images), Medium (100–500 images), and Tail (<100 images).

Figure 8 compares the accuracy of the Baseline model versus our L&I-Net. While the improvement in Head classes is moderate (+4.5%), the improvement in Tail classes is drastic, jumping from 62.1% to 81.5% (+19.4%).

Discussion: This result proves that standard classifiers tend to bias heavily towards majority classes. In contrast, the combination of Mixup (which expands the data manifold) and AMC-Head (which prevents minority features from being averaged out) successfully preserves the diversity of the feature space. This ensures the model does not ignore rare pests, which is crucial for early-stage pest warning systems.

3.3. Qualitative Analysis

To provide deeper insights into the interpretability of our model, we visualized the intermediate features using Grad-CAM and the detection results.

3.3.1. Attention Visualization (Grad-CAM)

We visualized the class activation maps of the final convolutional layer to understand where the model focuses. As shown in Figure 9, we compare the Baseline (ResNet50) with our L&I-Net.

Discussion:

Baseline Failure Analysis: In the first row (complex background), the Baseline model’s attention (red region) is scattered across the surrounding leaves and soil. This visualization exposes the “background bias” inherent in single-stage CNNs: without explicit localization, the model erroneously learns environmental correlations (e.g., associating “soil color” with “ground pests”) rather than actual biological features.
Ours Success Analysis: Conversely, our model accurately localizes the key discriminative parts of the pest (e.g., the texture on the wing), even when the pest color is similar to the background. This confirms that the combination of YOLO cropping and CBAM effectively suppresses environmental interference, forcing the network to learn “true” invariant features.

3.3.2. Visualization of Detection and Cropping

Figure 10 illustrates the output of Stage 1. Despite the pests being small or camouflaged, the YOLOv8 detector successfully generates tight bounding boxes.

Discussion: Critically, the “Context-Aware Expansion” strategy ensures that the cropped images preserve complete morphological structures—such as the long antennae of the Rice Bug shown in the third column. As highlighted in Section 3.2, preserving these peripheral structures is vital for distinguishing pests from morphologically similar natural enemies (e.g., Syrphidae larvae). The visualization confirms that our cropping strategy retains the necessary “biological context” that standard tight cropping might discard.

4. Conclusions

In this study, we addressed the dual challenges of complex background interference and intra-class morphological variance in large-scale agricultural pest recognition. We proposed a novel coarse-to-fine cascade framework, L&I-Net, which synergizes deep localization with adaptive multi-center classification.

By integrating a context-aware expansion strategy, our method effectively decouples pests from chaotic field backgrounds, ensuring that the classifier focuses on biologically relevant features. Furthermore, the proposed AMC-Head breaks the “unimodal assumption” of traditional CNNs. By allocating latent sub-centers for distinct life stages, the model successfully captures the diversity of pest metamorphosis (e.g., larva vs. adult), achieving a state-of-the-art accuracy of 91.40% on the IP102 dataset. These results validate that decoupling localization from classification is a superior strategy for fine-grained recognition tasks in complex natural environments.

Regarding the trade-off between speed and accuracy, our two-stage architecture prioritizes recognition precision, which is critical for accurate pest counting and diagnosis. While the current inference speed is well-suited for high-performance workstations or cloud-based analysis, deployment on extremely low-power edge devices remains a promising direction for extension. Future research may explore model lightweighting techniques, such as Knowledge Distillation, to transfer the multi-center knowledge to compact networks, thereby extending the applicability of L&I-Net to real-time mobile monitoring scenarios.

Author Contributions

Conceptualization, H.S., L.Z. and S.L.; methodology, H.S.; software, H.S.; validation, H.S. and Y.L.; formal analysis, H.S.; investigation, Y.L.; resources, L.Z. and S.L.; data curation, H.S.; writing—original draft preparation, H.S.; writing—review and editing, L.Z. and S.L.; visualization, H.S. and Y.L.; supervision, L.Z. and S.L.; project administration, L.Z.; funding acquisition, L.Z. and S.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Science and Technology Key Project of the XPCC Division and City, grant number 2024GG1502; the Key Research and Development Program of Shandong Province (SME Innovation Capacity Improvement Project), grant number 2024TSCC0205; and the “Tianchi Talent” Introduction Plan of Xinjiang Uygur Autonomous Region (2023). The APC was funded by Qingdao Agricultural University Doctoral Fund.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly available datasets were analyzed in this study. This data can be found here: https://github.com/xpwu95/IP102 (accessed on 20 February 2026). The version used in this study is IP102.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

CNNs	Convolutional Neural Networks
CBAM	Convolutional Block Attention Module
Grad-CAM	Gradient-weighted Class Activation Mapping
AMC-Head	Adaptive Multi-Center Classification Head
FAO	Food and Agriculture Organization
IPM	Integrated Pest Managemen
SOTA	State-of-the-Art
PA	Precision Agriculture
SIFT	Scale-Invariant Feature Transform
HOG	Histogram of Oriented Gradients
DL	Deep Learning
ROI	Region of Interes
CE	Cross-Entropy
FL	Focal Loss
SGD	Stochastic Gradient Descent
TL	Transfer Learning

References

Sharma, S.; Kooner, R.; Arora, R. Insect pests and crop losses. In Breeding Insect Resistant Crops for Sustainable Agriculture; Springer: Singapore, 2017; pp. 45–66. [Google Scholar]
Oerke, E.C. Crop losses to pests. J. Agric. Sci. 2006, 144, 31–43. [Google Scholar] [CrossRef]
Chen, C.; Liang, Y.; Zhou, L.; Tang, X.; Dai, M. An automatic inspection system for pest detection in granaries using YOLOv4. Comput. Electron. Agric. 2022, 201, 107302. [Google Scholar] [CrossRef]
Preti, M.; Verheggen, F.; Angeli, S. Insect pest monitoring with camera-equipped traps: Strengths and limitations. J. Pest Sci. 2021, 94, 203–217. [Google Scholar] [CrossRef]
Meshram, A.T.; Vanalkar, A.V.; Kalambe, K.B.; Badar, A.M. Pesticide spraying robot for precision agriculture: A categorical literature review and future trends. J. Field Robot. 2022, 39, 153–171. [Google Scholar]
Harris, C.G.; Andika, I.P.; Trisyono, Y.A. A Comparison of HOG-SVM and SIFT-SVM Techniques for Identifying Brown Planthoppers in Rice Fields. In Proceedings of the 2022 IEEE 2nd Conference on Information Technology and Data Science (CITDS), Debrecen, Hungary, 16–18 May 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 107–112. [Google Scholar]
Wang, D.; Cao, W.; Zhang, F.; Li, Z.; Xu, S.; Wu, X. A review of deep learning in multiscale agricultural sensing. Remote Sens. 2022, 14, 559. [Google Scholar]
Kumar, V.; Arora, H.; Sisodia, J. Resnet-based approach for detection and classification of plant leaf diseases. In Proceedings of the 2020 International Conference on Electronics and Sustainable Communication Systems (ICESC), Coimbatore, India, 2–4 July 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 495–502. [Google Scholar]
Guan, B.; Zhang, L.; Zhu, J.; Li, R.; Kong, J.; Wang, Y.; Dong, W. The key issues and evaluation methods for constructing agricultural pest and disease image datasets: A review. Smart Agric. 2023, 5, 17–34. [Google Scholar]
Zhang, W.; Sun, Y.; Huang, H.; Pei, H.; Sheng, J.; Yang, P. Pest region detection in complex backgrounds via contextual information and multi-scale mixed attention mechanism. Agriculture 2022, 12, 1104. [Google Scholar] [CrossRef]
Fennell, J.G.; Talas, L.; Baddeley, R.J.; Cuthill, I.C.; Scott-Samuel, N.E. The Camouflage Machine: Optimizing protective coloration using deep learning with genetic algorithms. Evolution 2021, 75, 614–624. [Google Scholar] [CrossRef] [PubMed]
Ahsan, F.F.; Thomas, M.L.; Laga, H.; Sohel, F. Deep learning-based analysis of insect life stages using a repurposed dataset. Ecol. Inform. 2025, 90, 103202. [Google Scholar] [CrossRef]
Hall, M.J.; Martín-Vega, D. Visualization of insect metamorphosis. Philos. Trans. R. Soc. B 2019, 374, 20190071. [Google Scholar] [CrossRef] [PubMed]
Du, M.; Wang, F.; Wang, Y.; Li, K.; Hou, W.; Liu, L.; He, Y.; Wang, Y. Improving long-tailed pest classification using diffusion model-based data augmentation. Comput. Electron. Agric. 2025, 234, 110244. [Google Scholar] [CrossRef]
Masko, D.; Hensman, P. The Impact of Imbalanced Training Data for Convolutional Neural Networks; KTH Royal Institute of Technology: Stockholm, Sweden, 2015. [Google Scholar]

Figure 1. Statistical distribution of the IP102 dataset derived from our analysis of the ground-truth annotations. The histogram highlights the severe long-tailed imbalance between head and tail classes.

Figure 2. Examples of intra-class variance in the IP102 dataset. Images (a,b) belong to the same class “cabbage army worm” but exhibit distinct morphologies.

Figure 3. The overall architecture of the proposed cascade framework. Stage 1 utilizes YOLOv8 for background decoupling, and Stage 2 utilizes an Attention-based ResNeXt with Multi-Center Head for fine-grained identification.

Figure 4. The architecture of the proposed Attention-Integrated Residual Block. The CBAM module is seamlessly inserted after the final convolution layer of the block to adaptively recalibrate feature responses.

Figure 5. Schematic illustration of the optimization objective. This diagram conceptualizes how Label Smoothing (a) is expected to encourage broader feature exploration compared to the rigid constraints of Focal Loss (b).

Figure 6. Confusion Matrix of the proposed L&I-Net on the IP102 test set. The distinct diagonal line demonstrates high classification accuracy across most of the 102 categories, proving the model’s effectiveness in handling large-scale fine-grained recognition tasks.

Figure 7. Impact of the number of sub-centers (K) on model accuracy. The shaded region represents the standard deviation across 5 independent runs.

K = 3

provides the best trade-off between model capacity and generalization.

Figure 7. Impact of the number of sub-centers (K) on model accuracy. The shaded region represents the standard deviation across 5 independent runs.

K = 3

provides the best trade-off between model capacity and generalization.

Figure 8. Accuracy comparison across different class frequencies (Head, Medium, Tail). Our method demonstrates superior robustness, particularly on the Tail classes where data is scarce.

Figure 9. Grad-CAM visualization comparison. (a) Baseline Attention (often distracted by background); (b) Ours Attention (precisely focused on the pest body).

Figure 10. Qualitative results of the localization stage. The green boxes represent the detection results, and the images on the right show the cropped “clean” targets fed into the classifier.

Table 1. Taxonomic composition of the IP102 dataset, curated by the authors. We reorganized the original 102 classes based on their primary economic host crops and cross-referenced biological databases (e.g., CABI, GBIF) to provide accurate scientific names.

Target Crop	Classes	Representative Pest Species (Scientific Name)
Rice	14	Orseolia oryzae, Cnaphalocrocis medinalis, Lissorhoptrus oryzophilus, Chilo suppressalis
Corn	13	Ostrinia furnacalis, Mythimna separata, Agriotes spp., Agrotis ipsilon
Wheat	9	Dolerus tritici, Petrobia latens, Sitobion avenae, Penthaleus major
Beet	8	Spodoptera exigua, Pegomya hyoscyami, Loxostege sticticalis, Achyra rantalis
Alfalfa	7	Hypera postica, Peridroma saucia, Bruchophagus roddi, Adelphocoris lineolatus
Vitis (Grape)	10	Xylotrechus pyrrhoderus, Erythroneura apicalis, Bactrocera dorsalis, Theretra japonica
Citrus	12	Phyllocnistis citrella, Diaphorina citri, Panonchus citri, Aleurocanthus spiniferus
Mango	6	Procontarinia matteiana, Deanolis sublimbalis, Idioscopus clypealis
Others	23	Helicoverpa armigera, Locusta migratoria, Epicauta gorhami
Total	102	75,222 Images (Hierarchically Structured)

Table 2. Performance comparison of different methods on the IP102 dataset. “Pre-trained” indicates initialization with ImageNet weights. The best results are highlighted in bold.

Method	Backbone	Precision (%)	Recall (%)	Accuracy (%)
VGG16	VGG-16	78.20	76.45	77.85
MobileNetV3	MobileNet-Small	80.12	78.90	79.15
ResNet50	ResNet-50	79.45	71.20	72.50
DenseNet121	DenseNet-121	85.60	84.80	85.33
EfficientNet-B0	EfficientNet	86.15	85.90	86.40
L&I-Net (Ours)	ResNeXt50 + AMC	91.25	90.50	91.40

Table 3. Step-wise ablation study on the IP102 dataset. “Baseline” refers to ResNet50 trained on full images. The progressive improvements confirm the effectiveness of each module.

Model Variant	YOLO Crop	CBAM	Mixup	AMC-Head	Accuracy (%)
Baseline (ResNet50)	×	×	×	×	82.50
+ Stage 1 Decoupling	✓	×	×	×	87.72 (+5.22)
+ Attention Module	✓	✓	×	×	89.65 (+1.93)
+ Mixup Strategy	✓	✓	✓	×	90.30 (+0.65)
+ AMC-Head (Ours)	✓	✓	✓	✓	91.40 (+1.10)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Su, H.; Zhao, L.; Liang, Y.; Liu, S. Tackling Metamorphosis and Complex Backgrounds: A Coarse-to-Fine Network for Fine-Grained Agricultural Pest Recognition. Appl. Sci. 2026, 16, 2191. https://doi.org/10.3390/app16052191

AMA Style

Su H, Zhao L, Liang Y, Liu S. Tackling Metamorphosis and Complex Backgrounds: A Coarse-to-Fine Network for Fine-Grained Agricultural Pest Recognition. Applied Sciences. 2026; 16(5):2191. https://doi.org/10.3390/app16052191

Chicago/Turabian Style

Su, Hang, Lei Zhao, Yongpeng Liang, and Sihui Liu. 2026. "Tackling Metamorphosis and Complex Backgrounds: A Coarse-to-Fine Network for Fine-Grained Agricultural Pest Recognition" Applied Sciences 16, no. 5: 2191. https://doi.org/10.3390/app16052191

APA Style

Su, H., Zhao, L., Liang, Y., & Liu, S. (2026). Tackling Metamorphosis and Complex Backgrounds: A Coarse-to-Fine Network for Fine-Grained Agricultural Pest Recognition. Applied Sciences, 16(5), 2191. https://doi.org/10.3390/app16052191

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Tackling Metamorphosis and Complex Backgrounds: A Coarse-to-Fine Network for Fine-Grained Agricultural Pest Recognition

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset and Analysis

2.1.1. Overview of the Proposed Cascade Framework

2.1.2. Architecture Selection

2.1.3. Context-Aware Cropping Strategy

2.2. Fine-Grained Classification Network

2.2.1. Preprocessing and Advanced Augmentation

2.2.2. Backbone: ResNeXt with Grouped Convolutions

2.2.3. Attention-Guided Feature Refinement (CBAM)

2.3. Adaptive Multi-Center Classification Head (AMC-Head)

2.3.1. Motivation: The Unimodal Assumption Bottleneck

2.3.2. AMC-Head Formulation

2.3.3. Optimization Objective

2.4. Implementation Details

2.4.1. Experimental Setup

2.4.2. Training Strategy

2.4.3. Evaluation Metrics

3. Results and Discussion

3.1. Comparative Analysis with State-of-the-Art Methods

3.1.1. Quantitative Performance Comparison

3.1.2. Class-Wise Performance Analysis

3.2. Ablation Studies

3.2.1. Contribution of Individual Modules

3.2.2. Sensitivity Analysis of Sub-Centers (K)

3.2.3. Performance on Long-Tailed Categories

3.3. Qualitative Analysis

3.3.1. Attention Visualization (Grad-CAM)

3.3.2. Visualization of Detection and Cropping

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI