1. Introduction
With the intensification of global climate change and continued expansion of agricultural production, agricultural harmful organisms—including insect pests, plant pathogens, and weeds—pose increasingly serious threats to food security and sustainable agricultural development. According to the Food and Agriculture Organization of the United Nations (FAO), approximately 30–40% of global crop production is lost annually due to the impact of harmful organisms, resulting in economic losses exceeding USD 100 billion. Insect pests exacerbate these losses through long-distance migration and the development of pesticide resistance, while pathogenic microorganisms continuously evolve via genetic mutations, enabling them to overcome conventional control strategies. Meanwhile, weeds spread rapidly owing to their strong ecological adaptability and invasive capacity. Together, these factors threaten not only agricultural productivity but also ecosystem stability.
As a major agricultural country, China faces dual pressures from both invasive and native harmful organisms. Since its invasion in 2019, Spodoptera frugiperda has rapidly spread across 26 provinces, causing severe damage to maize and other staple crops. Similarly, Mikania micrantha, recognized as one of the world’s most destructive invasive weeds, poses significant threats to agricultural production and ecosystem health by suppressing photosynthesis and displacing native vegetation. In addition to invasive species, certain native and invasive insects exhibit high morphological similarity, such as Hyphantria cunea and Spilosoma obliqua, which often leads to misidentification during manual inspection. These challenges substantially increase the difficulty of accurate identification and timely control in practical agricultural production and quarantine scenarios.
Traditional approaches for identifying agriculturally harmful organisms mainly rely on morphological observation or molecular diagnostic techniques. These methods typically require experienced specialists to perform labor-intensive examination, measurement, and comparison, which makes them time-consuming and costly. Moreover, their effectiveness is often constrained by subjective judgment and limited scalability, rendering them unsuitable for large-scale applications such as port quarantine, field surveillance, and real-time monitoring. In recent years, with the rapid development of computer vision and deep learning technologies, image-based automated recognition methods have emerged as promising alternatives for agricultural harmful organism identification.
In recent years, convolutional neural networks (CNNs) have achieved substantial progress in agricultural image analysis and recognition tasks. Zhao et al. [
1] proposed an MOG2–YOLOv4-based detection model for
Locusta migratoria, effectively alleviating recognition failures caused by motion blur and occlusion. Wang et al. [
2] improved the AlexNet architecture to enhance the accuracy of crop disease identification. Bo et al. [
3] introduced FOTCA, a hybrid Transformer–CNN model, achieving an accuracy of 99.8% and an F1-score of 0.9931 in leaf disease recognition. These studies demonstrate the effectiveness of deep learning methods for single-category agricultural recognition tasks. However, most existing approaches focus on a single type of biological target, limiting their applicability in complex agricultural environments where multiple harmful organisms often coexist.
Especially in real-world application scenarios such as field monitoring or quarantine inspection, targets often appear at small scales with subtle visual differences and strong background interference, and the discriminative cues are frequently confined to limited local regions such as lesion boundaries, texture variations, or fine-grained insect structures. Convolutional representations that rely solely on global features are easily affected by illumination variations, cluttered backgrounds, and occlusions, which can lead to confusion between visually similar classes. In light of this, incorporating attention mechanisms that highlight key regions while suppressing irrelevant background information has become a natural direction for improvement. To improve robustness under cluttered backgrounds and alleviate fine-grained discrimination difficulties caused by high inter-class resemblance, attention mechanisms have been widely introduced into agricultural image recognition. Attention-based design improves robustness by selectively emphasizing informative visual cues while suppressing background noise in plant disease recognition tasks [
4]. Among them, Coordinate Attention (CA) explicitly embeds positional information into channel attention via direction-aware encoding, offering a favorable trade-off between localization capability and computational cost for fine-grained agricultural recognition [
5].
Although attention mechanisms can enhance feature representation in complex backgrounds and improve fine-grained discrimination, most existing studies focus on a single target type (e.g., diseases only or pests/weeds only) and are typically built upon single-category datasets and task settings. Because different biological groups differ substantially in morphological structure, scale distribution, and imaging backgrounds, models developed for a single target are difficult to transfer directly to real agricultural scenarios where multiple groups coexist, thereby underscoring the need for a unified multi-class recognition framework. In practice, in applications such as port quarantine inspection, field diagnosis, and ecological management, crop diseases, pests, and weeds often co-occur under complex background interference, making single-species or single-category recognition paradigms insufficient for intelligent monitoring. Moreover, existing studies lack systematic cross-group sample resources and unified benchmarks, which further constrain model generalization and undermine real-world applicability.
Therefore, the recognition task in this study can be broadly categorized into coarse-grained and fine-grained recognition: at the coarse-grained level, targets are grouped into three major categories (pests, weeds, and crop diseases), highlighting the capability of unified recognition across different biological groups; at the fine-grained level, the task is further refined to distinguish specific species/disease categories (a total of 120 classes), where discrimination typically relies on subtle differences in local morphology and texture. To tackle the challenges in multi-class recognition of agricultural harmful organisms under complex scenarios—including fine-grained discrimination difficulty, substantial intra-class appearance variation, and imbalanced class distributions—this study focuses on fine-grained 120-class classification as the primary task, and the specific approaches include:
construct a comprehensive image dataset covering three major categories of agricultural harmful organisms (pests, weeds, and crop diseases), forming a multi-species, multi-source, and multi-scene data framework;
design and improve a DenseNet121-based multi-class recognition model (DenseNet-CSL) by integrating Coordinate Attention, Deep Supervision, and Label Smoothing to enhance feature representation, stabilize optimization, and mitigate class-imbalance effects;
evaluate the effectiveness and generalization of DenseNet-CSL on multi-class recognition tasks using standard classification metrics, and demonstrate its feasibility as a technical approach for intelligent identification of agriculturally harmful organisms.
2. Materials and Methods
2.1. Dataset Acquisition
The dataset used in this study consists of images representing three major categories of agriculturally harmful organisms: insect pests, weeds, and crop diseases. All samples were carefully verified and annotated by experts in agriculture and plant protection to ensure labeling accuracy and reliability. Specifically, the insect and weed datasets were manually collected using smartphones in field environments across different regions of China, while the sugarcane disease images in the crop disease dataset were likewise captured using smartphones in Guangxi, China. The remaining disease datasets were obtained from two publicly available datasets. (1) IDADP (Agricultural Disease and Pest Image Database) [
6], from which we used the corn- and rice-related disease subsets. (2) PlantVillage [
7] images were used for disease categories other than corn, rice, and sugarcane; for transparency, we downloaded the dataset via a Kaggle mirror:
https://www.kaggle.com/datasets/abdallahalidev/plantvillage-dataset (accessed on 5 September 2022).
To enhance dataset representativeness, the collected images cover a wide range of growth stages, ecological environments, and imaging conditions. This multi-source data collection strategy aims to reflect the complexity of real agricultural scenarios and to support robust evaluation of multi-class recognition performance.
2.1.1. Pest Dataset
The pest dataset includes a total of 62 insect species belonging to 36 genera, comprising two larval species—
Spodoptera exigua and
Spodoptera frugiperda—and 60 adult pest species such as
Zeugodacus tau,
Carpomya vesuviana, and
Lymantria dispar. In total, 9628 pest images were collected. These images were acquired under natural field conditions and cover a wide range of developmental stages, body postures, and background environments, as summarized in
Table 1.
As illustrated in
Figure 1, the constructed pest dataset exhibits three prominent types of phenotypic variation:
First, sexual dimorphism is evident in certain species, where males and females share similar coloration but differ markedly in morphological characteristics such as wing-tip structure or abdominal features.;
Second, high inter-class similarity exists among several species, making visual discrimination challenging. For example, species pairs such as 47 and 48, as well as groups including species 22, 25, 26, and 50, display strong resemblance in wing shape, body coloration, and overall morphology, which increases the likelihood of misclassification.
Third, substantial intra-class variability is observed, with individuals of the same species showing noticeable differences in coloration or appearance. For instance, although species 41 maintains consistent structural traits, individuals may vary considerably in body color.
Overall, the pest dataset is characterized by pronounced sexual dimorphism, high inter-class similarity, and significant intra-class variability. These factors collectively increase the complexity of pest recognition tasks and pose substantial challenges for accurate classification in practical agricultural applications.
2.1.2. Weed Dataset
The weed dataset comprises 28 weed species belonging to 24 genera. Representative species include
Spartina alterniflora,
Ipomoea cairica,
Mimosa pudica var.
diplotricha,
Amaranthus spinosus,
Solanum torvum,
Melinis repens, and
Mikania micrantha. In total, 5042 weed images were collected, covering three typical habitat scenarios: isolated individual plants, sparsely distributed patches, and densely aggregated growth communities, as detailed in
Table 2.
As shown in
Figure 2, sample diversity within the weed dataset was enhanced by varying shooting time, camera angle, and shooting distance during image acquisition. Nevertheless, under natural field conditions, the spatial distribution and abundance of weed species differ substantially. Species such as
Erigeron annuus,
Ambrosia artemisiifolia, and
Spartina alterniflora exhibit strong growth vigor and wide geographic distribution, resulting in relatively large sample sizes. In contrast, species such as
Plantago major occur more sparsely, yielding fewer samples and leading to a certain degree of class imbalance within the dataset.
2.1.3. Disease Dataset
The crop disease dataset covers nine major crops, including wheat, rice, maize, sugarcane, tomato, apple, tea, grape, and potato, and consists of 30 representative crop diseases, totaling 9325 images. The selected diseases reflect common and economically significant threats encountered in agricultural production. Detailed information on disease types and sample quantities is provided in
Table 3.
Most disease images were obtained from publicly available datasets, primarily the PlantVillage dataset on the Kaggle platform, which was constructed based on extensive experimental research conducted by the original authors and has been widely used in previous studies. To ensure consistency across different data sources, all images in the disease dataset were standardized in terms of image format and resolution. Representative samples from the disease dataset are shown in
Figure 3.
2.1.4. Data Analysis and Data Preprocessing
To further analyze the overall characteristics of the constructed dataset, statistical visualization was performed using Origin 2024 (OriginLab Corporation, Northampton, MA, USA), as shown in
Figure 4. Two key observations can be drawn from the distribution results. First, there is a pronounced imbalance in sample quantity across categories, with the largest class containing 690 samples and the smallest only 16. This uneven distribution reflects the natural occurrence frequency of different harmful organisms but also introduces challenges for model training. Second, although all images share a consistent file format, noticeable variation exists in image resolution, with some samples exhibiting substantial differences in pixel size.
To address inconsistencies in image resolution, we applied a unified preprocessing procedure to all images, resizing them to a standardized 224 × 224 square input. In addition, for classes with fewer than 250 images, we performed data augmentation on the training set to alleviate class imbalance and improve model generalization, including four augmentation strategies: horizontal flipping, vertical flipping, random rotation, and color jittering. After data augmentation, the number of images increased from 23,995 to 30,000.
Overall, the constructed dataset demonstrates notable advantages in terms of species diversity, category coverage, and scene variability, while also presenting realistic challenges such as class imbalance and heterogeneous visual characteristics. These properties provide a representative and demanding benchmark for evaluating multi-class recognition performance in complex agricultural environments.
2.2. Classification Model
2.2.1. DenseNet121
DenseNet [
8] (Dense Convolutional Network), proposed jointly by the University of Washington and Facebook AI Research in 2017, is a compact and efficient convolutional neural network architecture. The core idea of DenseNet lies in its dense connectivity pattern, in which each layer receives feature maps from all preceding layers as input. This design promotes extensive feature reuse, enhances information flow between layers, alleviates the vanishing gradient problem, and enables high representational capacity with a relatively small number of parameters.
DenseNet121 is a representative model within the DenseNet family, consisting of 121 layers organized into four Dense Blocks and three Transition Layers. To reduce computational complexity and improve parameter efficiency, DenseNet121 adopts bottleneck structures based on 1 × 1 convolutions as well as channel compression strategies in the transition layers. Owing to its effective feature propagation mechanism and compact architecture, DenseNet121 has demonstrated strong performance in various image recognition tasks, including medical image analysis and object classification.
Considering the balance between network depth, computational efficiency, and recognition accuracy, DenseNet121 was selected as the backbone network in this study. Based on this architecture, targeted structural enhancements were introduced to better address the challenges of multi-class agricultural harmful organism recognition. The overall structure of DenseNet121 is illustrated in
Figure 5.
2.2.2. Component Selection Rationale
Attention mechanisms are commonly used in agricultural image recognition to suppress background noise and highlight discriminative regions, thereby improving the capture of fine-grained differences. Existing attention modules can be broadly categorized into channel attention, spatial attention, and their combined forms, with different designs involving distinct trade-offs between performance gains and computational overhead. Channel attention structures represented by SE (Squeeze-and-Excitation) [
9] and ECA (Efficient Channel Attention) [
10] are relatively simple and easy to deploy, but they are less position-sensitive for tasks that require fine-grained focusing on local regions. CBAM (Convolutional Block Attention Module) [
11] integrates channel and spatial attention in a sequential manner, which can enhance spatial localization, but it typically introduces additional computational overhead and deployment cost. In contrast, Coordinate Attention (CA) incorporates direction-aware positional encoding within the channel-attention framework, explicitly injecting spatial information while remaining lightweight, thus balancing local-region sensitivity and computational efficiency. Compared with CBAM, CA emphasizes efficiency while improving localization; compared with SE/ECA, CA can more effectively focus on local discriminative regions. Based on these trade-offs, CA is selected as the attention replacement module in this study.
In multi-scenario, multi-class recognition of agricultural harmful organisms, inter-class differences can be very subtle, and the model needs to capture fine-grained local textures as well as higher-level semantic information. Relying solely on the final-layer output may weaken supervision for intermediate layers, resulting in suboptimal mid-level feature learning and, in some cases, unstable convergence. To address this issue, one common approach is to adopt deeper or higher-capacity backbones to enhance representation ability, apply stronger regularization, or introduce multi-scale feature fusion modules. However, compared with deep supervision, these methods typically incur additional parameters and computational cost. Deep supervision is easier to implement and provides auxiliary gradient signals that can effectively stabilize optimization and encourage meaningful feature learning at multiple depths. Therefore, we adopt deep supervision in this work as a key strategy to improve training stability and multi-level representation learning.
In large-scale multi-class recognition tasks, models often exhibit overconfidence, assigning excessively high confidence to predicted classes. This issue becomes particularly pronounced under class imbalance, heterogeneous sample quality, or the presence of label noise. Label smoothing regularizes the output layer by converting hard one-hot labels into softened targets, thereby discouraging overly confident fitting to the training samples. This strategy typically improves generalization, reduces the risk of overconfidence, and provides better robustness to potential label noise. In addition, approaches such as focal loss, class-weighted loss, and threshold tuning/confidence calibration can alleviate the effects of class imbalance to some extent. However, our goal is to improve generalization and robustness without substantially increasing training complexity or computational overhead. Therefore, we adopt label smoothing as an output regularization strategy to reduce overconfidence and to mitigate potential overfitting arising from data noise and imbalanced class distributions.
2.2.3. Architecture of the DenseNet-CSL Network
To address the challenges associated with high sample complexity, inter-class similarity, and class imbalance in multi-species recognition tasks, this study proposes an improved network architecture, termed DenseNet-CSL, based on DenseNet121. The overall structure of the proposed model is shown in
Figure 6. DenseNet-CSL enhances the baseline network by integrating three complementary strategies: Coordinate Attention, Deep Supervision, and Label Smoothing, each designed to target specific limitations of multi-class agricultural image recognition.
First, a Coordinate Attention (CA) mechanism is introduced after each Dense Block. While DenseNet effectively promotes feature reuse through dense connections, its ability to explicitly model spatial location information and long-range channel dependencies remains limited. The CA module addresses this issue by embedding positional information into channel attention. Specifically, it decomposes conventional two-dimensional global average pooling into separate horizontal and vertical encoding processes, enabling the network to capture both spatial structure and channel-wise importance simultaneously.
By incorporating coordinate information, the CA mechanism allows the network to focus more precisely on discriminative regions of harmful organisms, even under complex backgrounds and varying imaging conditions. This enhancement is particularly beneficial for agricultural images, where targets often exhibit subtle morphological differences and appear in cluttered natural environments. The detailed structure of the Coordinate Attention module is illustrated in
Figure 7.
Second, to alleviate gradient vanishing and improve training stability, deep supervision branches are introduced after the second and third Dense Blocks. By allowing intermediate feature representations to directly participate in loss backpropagation, deep supervision enhances gradient flow and encourages effective feature learning at different network depths. This design enables the model to capture multi-scale semantic information simultaneously, which is particularly important for recognizing agriculturally harmful organisms that exhibit large variations in size, morphology, and appearance.
With deep supervision, auxiliary classifiers are attached to intermediate layers, and their corresponding losses are jointly optimized with the main classifier. The overall loss function can be expressed as follows:
where
denotes the loss of the main classifier, and
represents the loss of the
-th auxiliary classifier. The coefficient
and
serve as a weighting factor to balance the contributions of the main loss and the auxiliary losses.
Finally, to mitigate the effects of class imbalance, label noise, and overfitting, the traditional cross-entropy loss is replaced with label smoothing loss. Instead of assigning a probability of 1 to the ground-truth class, label smoothing distributes a small portion of probability mass to non-target classes. This strategy reduces overconfidence in model predictions and improves generalization, particularly in datasets with uneven class distributions.
Formally, the smoothed label distribution is defined as:
where
denotes the smoothed target probability,
is the smoothing factor (typically set to 0.1 in this study), and
represents the total number of classes. Based on this formulation, the label smoothing loss is given by:
In summary, the three enhancements integrated into DenseNet-CSL—Coordinate Attention for improved feature representation, Deep Supervision for stabilized training and multi-scale learning, and Label Smoothing for enhanced generalization—jointly address key challenges in multi-class agricultural harmful organism recognition. These mechanisms significantly improve classification robustness and accuracy while maintaining a relatively lightweight network structure.
2.3. Experimental Environment and Parameter Settings
The experiments were conducted on a Windows Server 2022 Datacenter (64-bit) operating system equipped with an Intel
® Xeon
® Gold 5317 CPU operating at 3.00 GHz and 256 GB of RAM. Model training and evaluation were accelerated using an NVIDIA RTX 6000 Ada Generation GPU with 48 GB of video memory. The software environment comprised Python 3.9.15, CUDA 11.3, and PyTorch 1.12.1, with the Anaconda3 distribution used for environment management and PyCharm Community Edition 2022.1 used as the development IDE. A summary of the experimental platform and software versions is provided in
Table 4. A summary of the experimental platform and software configuration is provided in
Table 4.
The dataset was first partitioned at the original-image level into training, validation, and test subsets using a class-stratified ratio of 6:2:2. Data augmentation was subsequently applied only to the training subset to increase intra-class variability and alleviate class imbalance. After augmentation, a total of 30,000 images were used for model training and evaluation, comprising 18,000 training images, 6000 validation images, and 6000 test images. This data partitioning strategy ensures sufficient samples for model optimization while enabling reliable and unbiased performance assessment.
During training, the AdamW optimizer was employed to update model parameters, as it provides adaptive learning rates while effectively decoupling weight decay from gradient updates. To further improve convergence behavior, a cosine annealing learning rate scheduler (CosineLR) was adopted for all trainable parameters. This scheduling strategy gradually reduces the learning rate from its initial value to a predefined minimum following a smooth cosine curve, which helps the model escape local minima and stabilizes training in later stages.
To mitigate overfitting and enhance generalization performance, dropout was applied during training by randomly deactivating a subset of neuron connections. The network was trained for 200 epochs with a batch size of 32, allowing efficient utilization of GPU resources while maintaining stable gradient estimation throughout the training process. The detailed training hyperparameters are summarized in
Table 5.
The key training parameters used in this study are listed in
Table 5. All input images were resized to a fixed resolution of 224 × 224 pixels to ensure consistency with the backbone network architecture. A batch size of 32 was selected to balance training efficiency and memory usage. The initial learning rate was set to 0.001, and no pre-trained weights were used, allowing the model to be trained from scratch on the constructed dataset.
These parameter settings were selected to ensure stable optimization and fair comparison across different network architectures. By maintaining consistent training conditions, the experimental results reliably reflect the performance differences attributable to model design rather than parameter tuning.
2.4. Performance Evaluation Metrics
To comprehensively evaluate the performance of the proposed classification model, multiple evaluation metrics were employed, including the confusion matrix, precision, recall, F1-score, inference time, and model size. These metrics provide complementary perspectives on classification effectiveness, efficiency, and practical applicability. Together, they enable a systematic assessment of the model’s performance in multi-class agricultural harmful organism recognition tasks.
2.4.1. Confusion Matrix
The confusion matrix is a fundamental tool for evaluating classification performance, as it provides a detailed visualization of prediction outcomes across all categories. In the matrix, rows correspond to ground-truth labels, while columns represent predicted labels. Diagonal elements indicate correctly classified samples, whereas off-diagonal elements reflect misclassification cases.
By analyzing the confusion matrix, it is possible to identify categories that are consistently recognized with high accuracy as well as those that are prone to confusion. This analysis is particularly valuable for multi-species recognition tasks, as it helps reveal inter-class similarity and potential error patterns, thereby providing insights for further model optimization and refinement.
2.4.2. Precision
Precision measures the reliability of positive predictions made by the model and reflects the proportion of correctly classified positive samples among all samples predicted as positive. This metric is particularly important in applications where false positives may lead to unnecessary intervention or misjudgment. Precision is calculated as follows:
where
denotes true positives and
denotes false positives.
2.4.3. Recall
Recall quantifies the model’s ability to correctly identify positive samples and represents the proportion of true positives among all actual positive instances. This metric is especially critical in agricultural monitoring scenarios, where failing to detect harmful organisms may result in delayed control measures or economic losses. Recall is defined as:
where
denotes false negatives.
2.4.4. F1-Score
The F1-score is the harmonic mean of precision and recall and provides a balanced evaluation of classification performance by jointly considering both metrics. It is particularly suitable for datasets with imbalanced class distributions, as it penalizes extreme discrepancies between precision and recall. The F1-score is computed as follows:
By integrating both precision and recall into a single measure, the F1-score offers a comprehensive indicator of overall model performance in multi-class recognition tasks.
3. Results
3.1. Experimental Results
This section presents the experimental results obtained from the constructed agricultural harmful organism dataset and evaluates the performance of the proposed DenseNet-CSL model through comparative experiments. Several mainstream deep learning architectures, including ResNet101, DenseNet121, EfficientNetV2, and MobileNetV3, VisionTransformer, SwinTransformer, were selected as baseline models. The comparison focuses on training behavior, classification accuracy, and convergence characteristics, aiming to assess the effectiveness and robustness of the proposed improvement.
3.1.1. Comparison of Loss Functions and Accuracy
In this study, seven neural network models—ResNet101 [
12], DenseNet121, EfficientNetV2 [
13], MobileNetV3 [
14], VisionTransformer [
15], SwinTransformer [
16] and the proposed DenseNet-CSL—were evaluated to examine their convergence behavior and classification performance.
Figure 8 illustrates the loss curves of the seven models during training.
As shown in
Figure 8, the training loss of all models decreases progressively with increasing epochs, indicating stable convergence behavior. However, notable differences can be observed in both convergence speed and final loss values. Among the evaluated models, DenseNet-CSL exhibits the fastest loss reduction and reaches a stable state after approximately 50 epochs, with a final loss value of around 1.1. This value is substantially lower than that of the other models, suggesting a stronger capability in learning discriminative features from complex agricultural images.
By comparison, DenseNet121 and ResNet101 show similar convergence trends, with their loss values gradually stabilizing at approximately 2.0. EfficientNetV2 and MobileNetV3 also demonstrate convergent behavior, although their loss reduction occurs more slowly and stabilizes at relatively higher values, indicating comparatively weaker fitting performance on the constructed dataset. It is worth noting that the loss curves of VisionTransformer and SwinTransformer decrease much more slowly than those of the other models. Even after 150 epochs, the training process has not fully stabilized, and the final loss remains relatively high, indicating a lack of effective convergence. This suggests that, under the current task setting and dataset scale, Transformer-based architectures are difficult to train adequately and may be constrained by limited data availability and/or high image noise, resulting in suboptimal classification performance.
Figure 9 presents the corresponding accuracy curves during training. All models show a rapid increase in accuracy during the early training stages, followed by gradual stabilization. DenseNet-CSL achieves higher accuracy at an earlier stage and ultimately stabilizes at approximately 0.80, outperforming the other models. DenseNet121 and ResNet101 reach stable accuracies of around 0.77 and 0.71, respectively, while EfficientNetV2 and MobileNetV3 converge to lower accuracy levels and exhibit greater fluctuations during early training. In contrast, VisionTransformer and SwinTransformer achieved notably lower accuracies. Their performance started at a low level in the early training stage and improved only to approximately 0.47 and 0.56 by the end, which is substantially below that of the CNN-based baselines. This observation further supports the conclusion drawn from the loss curves that these Transformer-based models fail to converge effectively under the current training setting. Possible explanations include the limited dataset size and the higher optimization difficulty of Transformers, which may prevent them from realizing their expected advantages on this task. Therefore, in the subsequent experimental analysis and model comparison, the two types of Transformer models will not be emphasized.
Taken together, the loss and accuracy results indicate that DenseNet-CSL not only converges more rapidly but also achieves superior final performance. These observations suggest that the integration of Coordinate Attention, Deep Supervision, and Label Smoothing contributes to more effective feature learning and improved training stability in multi-class agricultural harmful organism recognition tasks.
3.1.2. Comparative Analysis of Confusion Matrices
To further compare the classification performance of DenseNet-CSL with the four baseline models, confusion matrices were constructed to analyze prediction results at the category level. The confusion matrix provides a detailed comparison between predicted labels and ground-truth labels, recording both correctly classified and misclassified samples for each species. Based on the 6000 test images covering 120 categories of agriculturally harmful organisms, the confusion matrices of the five models are presented in
Figure 10. The left panel shows the complete confusion matrix, while the right panel displays a subset of categories (species 1–10 and 76–95) to facilitate clearer visualization and interpretation.
Overall, DenseNet-CSL demonstrates consistently high classification accuracy across most categories, with a strong concentration of correct predictions along the diagonal of the confusion matrix. Several categories, including Asota caricae, Sphagneticola trilobata, Spartina alterniflora, Ipomoea cairica, apple rust, and Cuscuta chinensis, exhibit very low misclassification rates. These species possess relatively distinctive visual characteristics, enabling reliable recognition by the model.
Despite the overall strong performance, misclassification still occurs among certain categories with high phenotypic similarity. For instance, Colias poliographus and Pieris rapae are occasionally confused due to similarities in wing shape and coloration. Likewise, apple Alternaria leaf spot and apple gray spot share visually similar disease symptoms, which can lead to classification ambiguity. Similar confusion patterns are also observed between Vanessa indica and Vanessa cardui, as well as between Acherontia styx and Acherontia lachesis. These errors primarily arise from subtle inter-species differences and represent typical challenges in fine-grained image classification tasks.
In comparison with the baseline models, DenseNet-CSL exhibits fewer misclassification cases across visually similar categories, indicating improved discriminative capability. Although some confusion remains unavoidable in cases of extreme morphological similarity, the overall results suggest that the proposed model achieves a favorable balance between classification accuracy and robustness across diverse species of pests, weeds, and crop diseases.
3.2. Analysis of Test Results
To further evaluate the overall performance of the proposed DenseNet-CSL model, a quantitative comparison was conducted using precision, recall, F1-score, inference time, and model size as evaluation metrics. The results of all models on the test set are summarized in
Table 6.
As shown in
Table 6, DenseNet-CSL achieves the highest precision (0.813), recall (0.801), and F1-score (0.800) among all evaluated models. Compared with the baseline DenseNet121, these values represent improvements of 4.1%, 3.1%, and 3.4%, respectively. These results indicate that the proposed model is more effective in correctly identifying agricultural harmful organisms while maintaining a low rate of misclassification.
In terms of computational efficiency, DenseNet-CSL requires 22.360 s for inference on the test set, which is 1.36 s faster than DenseNet121. At the same time, the increase in model size is relatively small, with an additional 1.772 MB of parameters. This suggests that the proposed architectural enhancements improve recognition performance without imposing a substantial computational burden.
Further comparison shows that although ResNet101 and EfficientNetV2 require less inference time, their larger model size and lower inference accuracy limit their suitability for time-sensitive agricultural applications. Although MobileNetV3 has a smaller model size and shorter inference time, its accuracy remains substantially lower than that of DenseNet-CSL. Overall, DenseNet-CSL provides a favorable trade-off between accuracy and efficiency, making it well suited for multi-class agricultural harmful organism recognition tasks.
To further examine the classification stability and per-category performance of each model, the recognition accuracy of all 120 species in the test set was statistically analyzed. For ease of comparison, recognition accuracy was divided into five intervals: below 60%, 60–85%, 85–95%, 95–100%, and 100%. The number of species falling into each interval was counted for all models, and the results are illustrated in
Figure 11.
As shown in
Figure 11, DenseNet121 exhibits a strong concentration of species within the 60–85% accuracy interval, indicating relatively stable performance across a large proportion of categories. In contrast, DenseNet-CSL demonstrates a higher proportion of species in the upper accuracy intervals. A larger number of species are recognized with accuracies above 85%, including a substantial portion achieving accuracies above 95% and 100%, highlighting its superior high-precision recognition capability. Moreover, DenseNet-CSL shows only eleven species in the below-60% accuracy interval, reflecting improved robustness and reduced performance degradation across difficult categories. By comparison, ResNet101 places a considerable number of species in the lower and middle accuracy intervals and achieves fewer categories with perfect accuracy, indicating limited overall discriminative capability. EfficientNetV2 presents a relatively balanced distribution but includes several species with low recognition accuracy, while MobileNetV3 shows a more uniform distribution across intervals without dominating in high-accuracy ranges. Overall, the accuracy interval analysis highlights the strong generalization ability and classification stability of DenseNet-CSL. The model not only increases the number of species recognized with high accuracy but also effectively reduces the occurrence of low-accuracy cases, which is essential for reliable deployment in practical agricultural monitoring and quarantine applications.
3.3. Phenotypic Analysis of Different Harmful Organism Images
To further evaluate the recognition performance of DenseNet-CSL under complex phenotypic conditions, the model was tested using a set of representative samples exhibiting diverse morphological characteristics. As shown in
Figure 12, agricultural harmful organisms often present pronounced sexual dimorphism (
Figure 12a), substantial intra-class variability (
Figure 12b), and high inter-class similarity (
Figure 12c). Despite these challenges, DenseNet-CSL is able to correctly identify most samples, indicating its capacity to capture discriminative visual features under realistic and heterogeneous conditions.
These results suggest that the proposed model can effectively extract and integrate fine-grained morphological cues, even when visual differences between categories are subtle or when individuals of the same species display notable phenotypic variation. Such robustness is particularly important for agricultural applications, where imaging conditions are often uncontrolled and biological traits vary across growth stages and environments.
However, misclassification still occurs among certain species with high phenotypic similarity, as illustrated in
Figure 13a. These species exhibit extremely similar morphological traits, making discrimination challenging even for experienced taxonomists. Typical examples include
Colias erate [
17,
18],
Pieris rapae [
19],
Colias fieldii [
18],
Colias poliographus [
20], and
Catopsilia scylla [
21,
22]. All of these species belong to the family
Pieridae and are characterized by predominantly white or yellow wing coloration with similar pattern distributions.
The primary interspecific differences among these species lie in subtle variations in wing shape, vein structure, and the spatial arrangement of color patches. Such fine-grained distinctions are difficult to capture consistently, particularly when images are affected by viewpoint changes, illumination variation, or partial occlusion. As a result, these species represent a typical and challenging case for fine-grained classification in agricultural image recognition.
Similar difficulties are observed between
Vanessa indica [
23,
24] and
Vanessa cardui [
23,
25], which share highly similar wing patterns and coloration (
Figure 13b). Although
V. indica generally exhibits slightly larger body size and more saturated color patches, these differences may be diminished under certain imaging conditions, leading to occasional misclassification.
In addition to highly similar species, misclassification is also observed in certain species that do not exhibit strong phenotypic resemblance, as shown in
Figure 14a. A representative example is
Erigeron annuus [
26,
27], a highly invasive weed species belonging to the family
Asteraceae. Although this species possesses relatively distinctive morphological features, misclassification may still occur under specific conditions.
Further analysis indicates that such errors primarily arise from two factors. First, key morphological features may be partially occluded by surrounding vegetation or insects, resulting in incomplete feature representation (
Figure 14b). Second, phenotypic variation caused by environmental conditions or growth stages—such as changes in flower color or inflorescence structure—can alter the visual appearance of the species (
Figure 14c), increasing recognition difficulty.
Overall, DenseNet-CSL demonstrates strong recognition capability under a wide range of phenotypic conditions commonly encountered in agricultural environments. While misclassification persists in cases involving extreme morphological similarity or significant phenotypic variation, these limitations are largely attributable to inherent biological characteristics and data constraints rather than model instability. Future work may further enhance recognition performance by incorporating fine-grained classification strategies, region-aware attention mechanisms, or multimodal information, such as hyperspectral or structural data, to improve discrimination among visually similar species.
3.4. Ablation Study
To further investigate the contribution of each component integrated into the proposed DenseNet-CSL architecture, an ablation study was conducted using DenseNet121 as the baseline model. Different enhancement strategies were applied individually and in combination, and their performance was evaluated on the constructed dataset. The corresponding results are summarized in
Table 7.
As shown in
Table 7, the baseline DenseNet121 model achieves a precision of 0.772, a recall of 0.770, and an F1-score of 0.766. When the Coordinate Attention (CA) mechanism is introduced, these metrics increase to 0.777, 0.771, and 0.768, respectively, while the inference time is reduced from 23.720 s to 22.991 s. This improvement indicates that CA effectively enhances spatial feature representation and channel-wise dependency modeling, enabling the network to focus more accurately on discriminative regions.
When Label Smoothing (LS) is applied independently, the model shows the most notable single-module improvement, reaching 0.797 precision, 0.797 recall, and an F1-score of 0.790, with an inference time of 22.804 s. This suggests that LS effectively regularizes the classification objective and stabilizes training, leading to more reliable generalization. In contrast, using Deep Supervision (DS) alone does not yield consistent benefits, resulting in 0.766/0.765/0.757, which is slightly lower than the baseline, implying that auxiliary supervision may not be well-aligned with this dataset or training setup when used in isolation. For module combinations, CA + LS achieves the best overall accuracy among the tested variants, with an F1-score of 0.799 (precision 0.810, recall 0.803), although its inference time increases slightly to 23.196 s. Adding DS to LS (LS + DS) does not further improve accuracy beyond LS alone (F1 remains 0.790), but it reduces inference time to 22.500 s. Finally, integrating all three components (CA + LS + DS) yields 0.812/0.801/0.797 with the fastest inference time among enhanced configurations (22.368 s), demonstrating a favorable trade-off between accuracy and efficiency. Overall, the updated ablation results indicate that LS is the primary contributor to performance gains, CA offers additional but modest improvement, and DS provides limited accuracy benefits, mainly affecting efficiency when combined with other modules. These findings support the effectiveness of the proposed DenseNet-CSL design, particularly highlighting the complementary role of CA + LS in improving recognition performance.
3.5. Grad-CAM Visualization Comparison
To further interpret the feature-learning behaviors of different models and to examine whether, when trained on mixed-source images, the model relies primarily on biologically relevant cues rather than background artifacts, we employ Gradient-weighted Class Activation Mapping (Grad-CAM) to visualize the spatial regions most associated with the classification decisions. Grad-CAM provides an intuitive representation of which image regions contribute most to a model’s prediction, thereby facilitating qualitative comparison of feature extraction capability across different architectures.
Figure 15 shows representative Grad-CAM visualizations produced by ResNet101, DenseNet121, EfficientNetV2, and DenseNet-CSL for six test samples spanning both laboratory-style PlantVillage disease images (typically characterized by uniform backgrounds) and field-captured pest/weed images (typically characterized by cluttered natural backgrounds). Overall, ResNet101 and EfficientNetV2 tend to produce more scattered or multi-region activations, and in several cases their responses extend beyond the target organism to surrounding leaves, soil, or background textures. This indicates that these models may be more sensitive to background clutter and may partially rely on context-related cues rather than strictly discriminative biological regions. In contrast, DenseNet121 shows relatively improved localization but still exhibits occasional background leakage. By contrast, DenseNet121 shows relatively improved localization but still exhibits occasional background leakage. In comparison, DenseNet-CSL consistently focuses on more complete and biologically relevant regions of the harmful organisms, such as characteristic body contours, leaf lesion areas, or distinctive structural features. This observation indicates that the proposed model is more effective in capturing spatially coherent and semantically meaningful features.
The improved localization performance of DenseNet-CSL can be attributed to the integration of Coordinate Attention and Deep Supervision mechanisms. Coordinate Attention enables the network to encode positional information along spatial dimensions, thereby enhancing sensitivity to target location and shape. Meanwhile, Deep Supervision encourages intermediate layers to learn discriminative features at multiple scales, which contributes to more stable and interpretable attention patterns. As a result, DenseNet-CSL exhibits clearer and more focused activation regions, even in images containing cluttered backgrounds or subtle visual differences.
Overall, the Grad-CAM visualization results provide qualitative evidence that DenseNet-CSL achieves more accurate feature localization and improved interpretability compared with the baseline models. These findings are consistent with the quantitative performance gains reported in previous sections and further support the effectiveness of the proposed architecture for multi-class agricultural harmful organism recognition.
4. Discussion
This study addresses the challenge of multi-class recognition of agricultural harmful organisms by integrating pest, weed, and crop disease identification into a unified deep learning framework. By constructing a comprehensive dataset encompassing 120 species across three major biological categories and proposing the DenseNet-CSL model, this work aims to bridge the gap between single-category recognition studies and the complex requirements of real agricultural environments. The experimental results demonstrate that the proposed approach achieves stable and accurate performance across diverse species and imaging conditions, highlighting its promise for agricultural monitoring scenarios, while further validation under broader deployment conditions is still required.
From a data perspective, the identification of agricultural harmful organisms inherently involves several challenges, including large species diversity, high inter-class similarity, and complex imaging backgrounds. In real agricultural environments, images are often captured under uncontrolled conditions, where variations in illumination, viewpoint, growth stage, and background clutter are common. These factors contribute to substantial visual variability within the same species and subtle differences between different species, making accurate recognition particularly demanding.
Traditional identification methods based on morphological traits and expert experience can achieve reliable results in controlled settings; however, they are time-consuming, highly subjective, and difficult to scale to large-scale monitoring tasks. In contrast, deep learning–based approaches enable automatic extraction of discriminative features and rapid processing of large volumes of image data. Nevertheless, many existing studies focus primarily on single-category recognition tasks, such as pest or disease identification alone, and rely on relatively homogeneous datasets. This limitation restricts their applicability in practical scenarios where multiple types of harmful organisms coexist and need to be identified simultaneously.
At the model level, the performance of DenseNet-CSL can be attributed to its targeted architectural design, which directly addresses the key challenges posed by multi-species agricultural datasets. Coordinate Attention enhances the model’s ability to capture spatial positional cues and channel-wise dependencies, enabling more precise localization of discriminative regions under complex backgrounds. This capability is particularly relevant for agricultural images, where harmful organisms often occupy only a small portion of the image and are surrounded by visually cluttered environments.
Deep Supervision further contributes to improved performance by facilitating gradient propagation and encouraging effective feature learning at intermediate layers. By providing auxiliary supervision signals, the network is able to learn multi-scale representations that capture both fine-grained local details and higher-level semantic information. This is especially beneficial for recognizing harmful organisms that exhibit large variations in size, morphology, and growth stage. In addition, Label Smoothing regularizes the output distribution by reducing over-confident predictions and improves robustness to potential label noise; it may indirectly alleviate imbalance-related bias but does not replace dedicated imbalance-aware strategies (e.g., focal loss or class re-weighting)
Together, these design choices allow DenseNet-CSL to achieve faster convergence and more stable performance compared with baseline architectures. Rather than relying on a single enhancement, the combination of complementary mechanisms enables the model to adapt more effectively to the heterogeneous visual characteristics of pests, weeds, and crop diseases. Beyond theoretical exploration, the proposed architectural enhancements are not imposing a substantial computational burden. This statement is supported by standardized inference-time measurement under a fixed protocol (same input resolution and batch size) and model complexity indicators (e.g., parameters and FLOPs).
From an error-diagnosis perspective, the remaining misclassifications can be broadly attributed to three factors: (1) phenotypic resemblance across classes, (2) partial occlusion or truncation of key discriminative regions, and (3) appearance variations induced by growth stage and illumination. These factors are consistent with the visual characteristics of field imagery. Despite the overall strong performance of DenseNet-CSL, errors still occur for a subset of species, particularly those with high inter-class similarity or pronounced intra-class variability. Representative cases include species within the family Pieridae and the genus Vanessa, where interspecific differences are subtle and often depend on fine-grained morphological cues such as wing-vein patterns or slight color variations. Under natural imaging conditions, these cues may be partially obscured or visually degraded by changes in illumination, viewpoint, and background clutter, thereby increasing the likelihood of confusion. In addition, some misclassification cases occur in species that do not inherently exhibit high phenotypic similarity. Further analysis suggests that these errors are mainly associated with occlusion of key morphological features or phenotypic variation caused by environmental conditions and growth stages. Such factors are difficult to fully control during field data acquisition and represent inherent challenges in agricultural image recognition tasks. These observations indicate that misclassification is not solely attributable to model limitations, but also reflects the intrinsic complexity of biological systems and real-world imaging environments.
Building on the above analysis, several directions may be explored to further improve the recognition of agricultural harmful organisms under complex real-world conditions. For species with high phenotypic similarity, fine-grained classification strategies—such as region-aware attention mechanisms or multi-branch feature extraction networks—may help enhance sensitivity to subtle morphological differences. In addition, increasing sample diversity through targeted data collection or advanced data augmentation techniques could improve model robustness for underrepresented categories.
Beyond image-based approaches, integrating multimodal information offers another promising direction. The incorporation of hyperspectral data, structured-light measurements, or morphological descriptors may provide complementary cues that are difficult to capture using RGB images alone. Furthermore, extending the proposed framework to additional species and geographic regions would allow a more comprehensive evaluation of its generalization capability across different agricultural contexts. Moreover, when mixing laboratory-style disease images with field-captured pest/weed images, shortcut learning caused by acquisition-specific backgrounds cannot be fully ruled out. Future work will include Grad-CAM-based verification and background-perturbation robustness tests to confirm that the model attends to biologically relevant regions. In addition, transfer learning with pretrained backbones and comparisons with modern baselines will be explored to further improve robustness and strengthen the empirical evidence.
5. Conclusions
This study addresses the challenges of multi-class recognition of agricultural harmful organisms arising from species diversity, morphological variability, and complex background conditions. To support research in this area, a comprehensive image dataset was constructed, covering three major categories of agricultural threats—pests, weeds, and crop diseases. This dataset provides a unified data foundation for multi-target recognition tasks and reflects the complexity of real agricultural environments.
Based on the constructed dataset, an improved convolutional neural network, DenseNet-CSL, was developed by integrating Coordinate Attention, Deep Supervision, and Label Smoothing mechanisms into the DenseNet121 backbone. These enhancements collectively improve feature representation, stabilize model training, and enhance generalization capability under class imbalance and complex visual conditions. Experimental results demonstrate that DenseNet-CSL achieves higher recognition accuracy and improved efficiency compared with the baseline DenseNet121 model, while maintaining a lightweight network structure.
The proposed DenseNet-CSL model exhibits strong robustness under challenging scenarios, including complex backgrounds, high phenotypic similarity among species, and imbalanced sample distributions. These characteristics indicate that the model is well-suited for practical applications such as agricultural monitoring, pest and disease diagnosis, and port quarantine inspection. The results of this study further demonstrate the potential of deep learning techniques for large-scale and multi-class recognition tasks in agricultural biosecurity.
Future research may further extend the proposed framework by incorporating additional agricultural harmful organism species and more diverse data sources to improve generalization across regions and environments. Moreover, integrating advanced deep learning strategies and deploying the model on agricultural monitoring platforms or edge computing devices could enable real-time detection and early warning in field and inspection scenarios. Overall, this study provides both methodological insight and experimental evidence for the development of intelligent agricultural harmful organism recognition systems.