DenseNet-CSL: An Enhanced Network for Multi-Class Recognition of Agricultural Pests, Weeds, and Crop Diseases

Huang, Yiqi; Huang, Tao; Du, Jing; Qiu, Jinxue; Liu, Conghui; Wan, Fanghao; Qian, Wanqiang; Qiao, Xi; Wang, Liang

doi:10.3390/agriculture16040394

Open AccessArticle

DenseNet-CSL: An Enhanced Network for Multi-Class Recognition of Agricultural Pests, Weeds, and Crop Diseases

by

Yiqi Huang

¹,

Tao Huang

^1,2,

Jing Du

³,

Jinxue Qiu

²,

Conghui Liu

²

,

Fanghao Wan

²

,

Wanqiang Qian

²,

Xi Qiao

^1,2,*

and

Liang Wang

^3,4,*

¹

College of Mechanical Engineering, Guangxi University, Nanning 530004, China

²

Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518120, China

³

China Agricultural University Library, China Agricultural University, Beijing 100083, China

⁴

School of Chemistry and Chemical Engineering, University of Surrey, Guildford GU2 7XH, UK

^*

Authors to whom correspondence should be addressed.

Agriculture 2026, 16(4), 394; https://doi.org/10.3390/agriculture16040394

Submission received: 4 January 2026 / Revised: 5 February 2026 / Accepted: 6 February 2026 / Published: 8 February 2026

(This article belongs to the Section Artificial Intelligence and Digital Agriculture)

Download

Browse Figures

Versions Notes

Abstract

Ensuring food security and agricultural biosecurity increasingly depends on the rapid and accurate identification of harmful organisms that threaten crop production. Traditional identification methods rely heavily on expert knowledge, are time-consuming, and often fail in complex multi-species scenarios. To address these limitations, this study establishes a comprehensive image dataset that includes three major categories of agricultural harmful organisms—pests, weeds, and crop diseases—and proposes an enhanced convolutional neural network, DenseNet-CSL (DenseNet with Coordinate Attention, Deep Supervision, and Label Smoothing), developed based on DenseNet121 for efficient multi-class recognition. The dataset comprises 62 pest species, 28 weed species, and 30 major crop diseases, totaling 23,995 images collected under diverse growth stages, ecological conditions, and imaging environments. DenseNet-CSL incorporates three targeted improvements: a Coordinate Attention mechanism to strengthen spatial and channel feature representation, Deep Supervision to accelerate convergence and enhance generalization, and Label Smoothing Loss to regularize the output distribution and reduce overconfidence, which is beneficial under imbalanced and noisy data. Experimental results demonstrate that DenseNet-CSL achieves a precision of 81.3%, a recall of 80.1%, and an F1-score of 80% on the constructed dataset—outperforming DenseNet121, ResNet101, EfficientNetV2, and MobileNetV3—while shortening inference time by 1.36 s and adding only 1.772 MB of additional model parameters. These findings highlight the effectiveness of DenseNet-CSL for multi-class recognition of agricultural pests, weeds, and diseases, and underscore the importance of multi-source, multi-scene datasets for improving model robustness and generalization. The proposed framework provides a viable technical pathway for intelligent diagnosis and monitoring of agricultural harmful organisms, supporting port quarantine and agricultural biosecurity applications.

Keywords:

DenseNet; agricultural image analysis; agricultural harmful organisms; multi-species identification; deep learning

1. Introduction

With the intensification of global climate change and continued expansion of agricultural production, agricultural harmful organisms—including insect pests, plant pathogens, and weeds—pose increasingly serious threats to food security and sustainable agricultural development. According to the Food and Agriculture Organization of the United Nations (FAO), approximately 30–40% of global crop production is lost annually due to the impact of harmful organisms, resulting in economic losses exceeding USD 100 billion. Insect pests exacerbate these losses through long-distance migration and the development of pesticide resistance, while pathogenic microorganisms continuously evolve via genetic mutations, enabling them to overcome conventional control strategies. Meanwhile, weeds spread rapidly owing to their strong ecological adaptability and invasive capacity. Together, these factors threaten not only agricultural productivity but also ecosystem stability.

As a major agricultural country, China faces dual pressures from both invasive and native harmful organisms. Since its invasion in 2019, Spodoptera frugiperda has rapidly spread across 26 provinces, causing severe damage to maize and other staple crops. Similarly, Mikania micrantha, recognized as one of the world’s most destructive invasive weeds, poses significant threats to agricultural production and ecosystem health by suppressing photosynthesis and displacing native vegetation. In addition to invasive species, certain native and invasive insects exhibit high morphological similarity, such as Hyphantria cunea and Spilosoma obliqua, which often leads to misidentification during manual inspection. These challenges substantially increase the difficulty of accurate identification and timely control in practical agricultural production and quarantine scenarios.

Traditional approaches for identifying agriculturally harmful organisms mainly rely on morphological observation or molecular diagnostic techniques. These methods typically require experienced specialists to perform labor-intensive examination, measurement, and comparison, which makes them time-consuming and costly. Moreover, their effectiveness is often constrained by subjective judgment and limited scalability, rendering them unsuitable for large-scale applications such as port quarantine, field surveillance, and real-time monitoring. In recent years, with the rapid development of computer vision and deep learning technologies, image-based automated recognition methods have emerged as promising alternatives for agricultural harmful organism identification.

In recent years, convolutional neural networks (CNNs) have achieved substantial progress in agricultural image analysis and recognition tasks. Zhao et al. [1] proposed an MOG2–YOLOv4-based detection model for Locusta migratoria, effectively alleviating recognition failures caused by motion blur and occlusion. Wang et al. [2] improved the AlexNet architecture to enhance the accuracy of crop disease identification. Bo et al. [3] introduced FOTCA, a hybrid Transformer–CNN model, achieving an accuracy of 99.8% and an F1-score of 0.9931 in leaf disease recognition. These studies demonstrate the effectiveness of deep learning methods for single-category agricultural recognition tasks. However, most existing approaches focus on a single type of biological target, limiting their applicability in complex agricultural environments where multiple harmful organisms often coexist.

Especially in real-world application scenarios such as field monitoring or quarantine inspection, targets often appear at small scales with subtle visual differences and strong background interference, and the discriminative cues are frequently confined to limited local regions such as lesion boundaries, texture variations, or fine-grained insect structures. Convolutional representations that rely solely on global features are easily affected by illumination variations, cluttered backgrounds, and occlusions, which can lead to confusion between visually similar classes. In light of this, incorporating attention mechanisms that highlight key regions while suppressing irrelevant background information has become a natural direction for improvement. To improve robustness under cluttered backgrounds and alleviate fine-grained discrimination difficulties caused by high inter-class resemblance, attention mechanisms have been widely introduced into agricultural image recognition. Attention-based design improves robustness by selectively emphasizing informative visual cues while suppressing background noise in plant disease recognition tasks [4]. Among them, Coordinate Attention (CA) explicitly embeds positional information into channel attention via direction-aware encoding, offering a favorable trade-off between localization capability and computational cost for fine-grained agricultural recognition [5].

Although attention mechanisms can enhance feature representation in complex backgrounds and improve fine-grained discrimination, most existing studies focus on a single target type (e.g., diseases only or pests/weeds only) and are typically built upon single-category datasets and task settings. Because different biological groups differ substantially in morphological structure, scale distribution, and imaging backgrounds, models developed for a single target are difficult to transfer directly to real agricultural scenarios where multiple groups coexist, thereby underscoring the need for a unified multi-class recognition framework. In practice, in applications such as port quarantine inspection, field diagnosis, and ecological management, crop diseases, pests, and weeds often co-occur under complex background interference, making single-species or single-category recognition paradigms insufficient for intelligent monitoring. Moreover, existing studies lack systematic cross-group sample resources and unified benchmarks, which further constrain model generalization and undermine real-world applicability.

Therefore, the recognition task in this study can be broadly categorized into coarse-grained and fine-grained recognition: at the coarse-grained level, targets are grouped into three major categories (pests, weeds, and crop diseases), highlighting the capability of unified recognition across different biological groups; at the fine-grained level, the task is further refined to distinguish specific species/disease categories (a total of 120 classes), where discrimination typically relies on subtle differences in local morphology and texture. To tackle the challenges in multi-class recognition of agricultural harmful organisms under complex scenarios—including fine-grained discrimination difficulty, substantial intra-class appearance variation, and imbalanced class distributions—this study focuses on fine-grained 120-class classification as the primary task, and the specific approaches include:

construct a comprehensive image dataset covering three major categories of agricultural harmful organisms (pests, weeds, and crop diseases), forming a multi-species, multi-source, and multi-scene data framework;
design and improve a DenseNet121-based multi-class recognition model (DenseNet-CSL) by integrating Coordinate Attention, Deep Supervision, and Label Smoothing to enhance feature representation, stabilize optimization, and mitigate class-imbalance effects;
evaluate the effectiveness and generalization of DenseNet-CSL on multi-class recognition tasks using standard classification metrics, and demonstrate its feasibility as a technical approach for intelligent identification of agriculturally harmful organisms.

2. Materials and Methods

2.1. Dataset Acquisition

The dataset used in this study consists of images representing three major categories of agriculturally harmful organisms: insect pests, weeds, and crop diseases. All samples were carefully verified and annotated by experts in agriculture and plant protection to ensure labeling accuracy and reliability. Specifically, the insect and weed datasets were manually collected using smartphones in field environments across different regions of China, while the sugarcane disease images in the crop disease dataset were likewise captured using smartphones in Guangxi, China. The remaining disease datasets were obtained from two publicly available datasets. (1) IDADP (Agricultural Disease and Pest Image Database) [6], from which we used the corn- and rice-related disease subsets. (2) PlantVillage [7] images were used for disease categories other than corn, rice, and sugarcane; for transparency, we downloaded the dataset via a Kaggle mirror: https://www.kaggle.com/datasets/abdallahalidev/plantvillage-dataset (accessed on 5 September 2022).

To enhance dataset representativeness, the collected images cover a wide range of growth stages, ecological environments, and imaging conditions. This multi-source data collection strategy aims to reflect the complexity of real agricultural scenarios and to support robust evaluation of multi-class recognition performance.

2.1.1. Pest Dataset

The pest dataset includes a total of 62 insect species belonging to 36 genera, comprising two larval species—Spodoptera exigua and Spodoptera frugiperda—and 60 adult pest species such as Zeugodacus tau, Carpomya vesuviana, and Lymantria dispar. In total, 9628 pest images were collected. These images were acquired under natural field conditions and cover a wide range of developmental stages, body postures, and background environments, as summarized in Table 1.

As illustrated in Figure 1, the constructed pest dataset exhibits three prominent types of phenotypic variation:

First, sexual dimorphism is evident in certain species, where males and females share similar coloration but differ markedly in morphological characteristics such as wing-tip structure or abdominal features.;

Second, high inter-class similarity exists among several species, making visual discrimination challenging. For example, species pairs such as 47 and 48, as well as groups including species 22, 25, 26, and 50, display strong resemblance in wing shape, body coloration, and overall morphology, which increases the likelihood of misclassification.

Third, substantial intra-class variability is observed, with individuals of the same species showing noticeable differences in coloration or appearance. For instance, although species 41 maintains consistent structural traits, individuals may vary considerably in body color.

Overall, the pest dataset is characterized by pronounced sexual dimorphism, high inter-class similarity, and significant intra-class variability. These factors collectively increase the complexity of pest recognition tasks and pose substantial challenges for accurate classification in practical agricultural applications.

2.1.2. Weed Dataset

The weed dataset comprises 28 weed species belonging to 24 genera. Representative species include Spartina alterniflora, Ipomoea cairica, Mimosa pudica var. diplotricha, Amaranthus spinosus, Solanum torvum, Melinis repens, and Mikania micrantha. In total, 5042 weed images were collected, covering three typical habitat scenarios: isolated individual plants, sparsely distributed patches, and densely aggregated growth communities, as detailed in Table 2.

As shown in Figure 2, sample diversity within the weed dataset was enhanced by varying shooting time, camera angle, and shooting distance during image acquisition. Nevertheless, under natural field conditions, the spatial distribution and abundance of weed species differ substantially. Species such as Erigeron annuus, Ambrosia artemisiifolia, and Spartina alterniflora exhibit strong growth vigor and wide geographic distribution, resulting in relatively large sample sizes. In contrast, species such as Plantago major occur more sparsely, yielding fewer samples and leading to a certain degree of class imbalance within the dataset.

2.1.3. Disease Dataset

The crop disease dataset covers nine major crops, including wheat, rice, maize, sugarcane, tomato, apple, tea, grape, and potato, and consists of 30 representative crop diseases, totaling 9325 images. The selected diseases reflect common and economically significant threats encountered in agricultural production. Detailed information on disease types and sample quantities is provided in Table 3.

Most disease images were obtained from publicly available datasets, primarily the PlantVillage dataset on the Kaggle platform, which was constructed based on extensive experimental research conducted by the original authors and has been widely used in previous studies. To ensure consistency across different data sources, all images in the disease dataset were standardized in terms of image format and resolution. Representative samples from the disease dataset are shown in Figure 3.

2.1.4. Data Analysis and Data Preprocessing

To further analyze the overall characteristics of the constructed dataset, statistical visualization was performed using Origin 2024 (OriginLab Corporation, Northampton, MA, USA), as shown in Figure 4. Two key observations can be drawn from the distribution results. First, there is a pronounced imbalance in sample quantity across categories, with the largest class containing 690 samples and the smallest only 16. This uneven distribution reflects the natural occurrence frequency of different harmful organisms but also introduces challenges for model training. Second, although all images share a consistent file format, noticeable variation exists in image resolution, with some samples exhibiting substantial differences in pixel size.

To address inconsistencies in image resolution, we applied a unified preprocessing procedure to all images, resizing them to a standardized 224 × 224 square input. In addition, for classes with fewer than 250 images, we performed data augmentation on the training set to alleviate class imbalance and improve model generalization, including four augmentation strategies: horizontal flipping, vertical flipping, random rotation, and color jittering. After data augmentation, the number of images increased from 23,995 to 30,000.

Overall, the constructed dataset demonstrates notable advantages in terms of species diversity, category coverage, and scene variability, while also presenting realistic challenges such as class imbalance and heterogeneous visual characteristics. These properties provide a representative and demanding benchmark for evaluating multi-class recognition performance in complex agricultural environments.

2.2. Classification Model

2.2.1. DenseNet121

DenseNet [8] (Dense Convolutional Network), proposed jointly by the University of Washington and Facebook AI Research in 2017, is a compact and efficient convolutional neural network architecture. The core idea of DenseNet lies in its dense connectivity pattern, in which each layer receives feature maps from all preceding layers as input. This design promotes extensive feature reuse, enhances information flow between layers, alleviates the vanishing gradient problem, and enables high representational capacity with a relatively small number of parameters.

DenseNet121 is a representative model within the DenseNet family, consisting of 121 layers organized into four Dense Blocks and three Transition Layers. To reduce computational complexity and improve parameter efficiency, DenseNet121 adopts bottleneck structures based on 1 × 1 convolutions as well as channel compression strategies in the transition layers. Owing to its effective feature propagation mechanism and compact architecture, DenseNet121 has demonstrated strong performance in various image recognition tasks, including medical image analysis and object classification.

Considering the balance between network depth, computational efficiency, and recognition accuracy, DenseNet121 was selected as the backbone network in this study. Based on this architecture, targeted structural enhancements were introduced to better address the challenges of multi-class agricultural harmful organism recognition. The overall structure of DenseNet121 is illustrated in Figure 5.

2.2.2. Component Selection Rationale

Attention mechanisms are commonly used in agricultural image recognition to suppress background noise and highlight discriminative regions, thereby improving the capture of fine-grained differences. Existing attention modules can be broadly categorized into channel attention, spatial attention, and their combined forms, with different designs involving distinct trade-offs between performance gains and computational overhead. Channel attention structures represented by SE (Squeeze-and-Excitation) [9] and ECA (Efficient Channel Attention) [10] are relatively simple and easy to deploy, but they are less position-sensitive for tasks that require fine-grained focusing on local regions. CBAM (Convolutional Block Attention Module) [11] integrates channel and spatial attention in a sequential manner, which can enhance spatial localization, but it typically introduces additional computational overhead and deployment cost. In contrast, Coordinate Attention (CA) incorporates direction-aware positional encoding within the channel-attention framework, explicitly injecting spatial information while remaining lightweight, thus balancing local-region sensitivity and computational efficiency. Compared with CBAM, CA emphasizes efficiency while improving localization; compared with SE/ECA, CA can more effectively focus on local discriminative regions. Based on these trade-offs, CA is selected as the attention replacement module in this study.
In multi-scenario, multi-class recognition of agricultural harmful organisms, inter-class differences can be very subtle, and the model needs to capture fine-grained local textures as well as higher-level semantic information. Relying solely on the final-layer output may weaken supervision for intermediate layers, resulting in suboptimal mid-level feature learning and, in some cases, unstable convergence. To address this issue, one common approach is to adopt deeper or higher-capacity backbones to enhance representation ability, apply stronger regularization, or introduce multi-scale feature fusion modules. However, compared with deep supervision, these methods typically incur additional parameters and computational cost. Deep supervision is easier to implement and provides auxiliary gradient signals that can effectively stabilize optimization and encourage meaningful feature learning at multiple depths. Therefore, we adopt deep supervision in this work as a key strategy to improve training stability and multi-level representation learning.
In large-scale multi-class recognition tasks, models often exhibit overconfidence, assigning excessively high confidence to predicted classes. This issue becomes particularly pronounced under class imbalance, heterogeneous sample quality, or the presence of label noise. Label smoothing regularizes the output layer by converting hard one-hot labels into softened targets, thereby discouraging overly confident fitting to the training samples. This strategy typically improves generalization, reduces the risk of overconfidence, and provides better robustness to potential label noise. In addition, approaches such as focal loss, class-weighted loss, and threshold tuning/confidence calibration can alleviate the effects of class imbalance to some extent. However, our goal is to improve generalization and robustness without substantially increasing training complexity or computational overhead. Therefore, we adopt label smoothing as an output regularization strategy to reduce overconfidence and to mitigate potential overfitting arising from data noise and imbalanced class distributions.

2.2.3. Architecture of the DenseNet-CSL Network

To address the challenges associated with high sample complexity, inter-class similarity, and class imbalance in multi-species recognition tasks, this study proposes an improved network architecture, termed DenseNet-CSL, based on DenseNet121. The overall structure of the proposed model is shown in Figure 6. DenseNet-CSL enhances the baseline network by integrating three complementary strategies: Coordinate Attention, Deep Supervision, and Label Smoothing, each designed to target specific limitations of multi-class agricultural image recognition.

First, a Coordinate Attention (CA) mechanism is introduced after each Dense Block. While DenseNet effectively promotes feature reuse through dense connections, its ability to explicitly model spatial location information and long-range channel dependencies remains limited. The CA module addresses this issue by embedding positional information into channel attention. Specifically, it decomposes conventional two-dimensional global average pooling into separate horizontal and vertical encoding processes, enabling the network to capture both spatial structure and channel-wise importance simultaneously.

By incorporating coordinate information, the CA mechanism allows the network to focus more precisely on discriminative regions of harmful organisms, even under complex backgrounds and varying imaging conditions. This enhancement is particularly beneficial for agricultural images, where targets often exhibit subtle morphological differences and appear in cluttered natural environments. The detailed structure of the Coordinate Attention module is illustrated in Figure 7.

Second, to alleviate gradient vanishing and improve training stability, deep supervision branches are introduced after the second and third Dense Blocks. By allowing intermediate feature representations to directly participate in loss backpropagation, deep supervision enhances gradient flow and encourages effective feature learning at different network depths. This design enables the model to capture multi-scale semantic information simultaneously, which is particularly important for recognizing agriculturally harmful organisms that exhibit large variations in size, morphology, and appearance.

With deep supervision, auxiliary classifiers are attached to intermediate layers, and their corresponding losses are jointly optimized with the main classifier. The overall loss function can be expressed as follows:

L = α_{0} L_{m a i n} (y, y^{*}) + \sum_{k = 1}^{K} α_{k} L_{a u x} (y_{k}, y^{*})

(1)

where

L_{m a i n} (y, y^{*})

denotes the loss of the main classifier, and

L_{a u x} (y_{k,} y^{*})

represents the loss of the

k

-th auxiliary classifier. The coefficient

α_{0}

and

α_{k}

serve as a weighting factor to balance the contributions of the main loss and the auxiliary losses.

Finally, to mitigate the effects of class imbalance, label noise, and overfitting, the traditional cross-entropy loss is replaced with label smoothing loss. Instead of assigning a probability of 1 to the ground-truth class, label smoothing distributes a small portion of probability mass to non-target classes. This strategy reduces overconfidence in model predictions and improves generalization, particularly in datasets with uneven class distributions.

Formally, the smoothed label distribution is defined as:

y_{i}^{L S} = (1 - ε) y_{i} + \frac{ε}{C}

(2)

where

y_{i}^{LS}

denotes the smoothed target probability,

ε

is the smoothing factor (typically set to 0.1 in this study), and

C

represents the total number of classes. Based on this formulation, the label smoothing loss is given by:

L_{L S} = - \sum_{i = 1}^{C} y_{i}^{L S} l o g p_{i}

(3)

In summary, the three enhancements integrated into DenseNet-CSL—Coordinate Attention for improved feature representation, Deep Supervision for stabilized training and multi-scale learning, and Label Smoothing for enhanced generalization—jointly address key challenges in multi-class agricultural harmful organism recognition. These mechanisms significantly improve classification robustness and accuracy while maintaining a relatively lightweight network structure.

2.3. Experimental Environment and Parameter Settings

The experiments were conducted on a Windows Server 2022 Datacenter (64-bit) operating system equipped with an Intel^® Xeon^® Gold 5317 CPU operating at 3.00 GHz and 256 GB of RAM. Model training and evaluation were accelerated using an NVIDIA RTX 6000 Ada Generation GPU with 48 GB of video memory. The software environment comprised Python 3.9.15, CUDA 11.3, and PyTorch 1.12.1, with the Anaconda3 distribution used for environment management and PyCharm Community Edition 2022.1 used as the development IDE. A summary of the experimental platform and software versions is provided in Table 4. A summary of the experimental platform and software configuration is provided in Table 4.

The dataset was first partitioned at the original-image level into training, validation, and test subsets using a class-stratified ratio of 6:2:2. Data augmentation was subsequently applied only to the training subset to increase intra-class variability and alleviate class imbalance. After augmentation, a total of 30,000 images were used for model training and evaluation, comprising 18,000 training images, 6000 validation images, and 6000 test images. This data partitioning strategy ensures sufficient samples for model optimization while enabling reliable and unbiased performance assessment.

During training, the AdamW optimizer was employed to update model parameters, as it provides adaptive learning rates while effectively decoupling weight decay from gradient updates. To further improve convergence behavior, a cosine annealing learning rate scheduler (CosineLR) was adopted for all trainable parameters. This scheduling strategy gradually reduces the learning rate from its initial value to a predefined minimum following a smooth cosine curve, which helps the model escape local minima and stabilizes training in later stages.

To mitigate overfitting and enhance generalization performance, dropout was applied during training by randomly deactivating a subset of neuron connections. The network was trained for 200 epochs with a batch size of 32, allowing efficient utilization of GPU resources while maintaining stable gradient estimation throughout the training process. The detailed training hyperparameters are summarized in Table 5.

The key training parameters used in this study are listed in Table 5. All input images were resized to a fixed resolution of 224 × 224 pixels to ensure consistency with the backbone network architecture. A batch size of 32 was selected to balance training efficiency and memory usage. The initial learning rate was set to 0.001, and no pre-trained weights were used, allowing the model to be trained from scratch on the constructed dataset.

These parameter settings were selected to ensure stable optimization and fair comparison across different network architectures. By maintaining consistent training conditions, the experimental results reliably reflect the performance differences attributable to model design rather than parameter tuning.

2.4. Performance Evaluation Metrics

To comprehensively evaluate the performance of the proposed classification model, multiple evaluation metrics were employed, including the confusion matrix, precision, recall, F1-score, inference time, and model size. These metrics provide complementary perspectives on classification effectiveness, efficiency, and practical applicability. Together, they enable a systematic assessment of the model’s performance in multi-class agricultural harmful organism recognition tasks.

2.4.1. Confusion Matrix

The confusion matrix is a fundamental tool for evaluating classification performance, as it provides a detailed visualization of prediction outcomes across all categories. In the matrix, rows correspond to ground-truth labels, while columns represent predicted labels. Diagonal elements indicate correctly classified samples, whereas off-diagonal elements reflect misclassification cases.

By analyzing the confusion matrix, it is possible to identify categories that are consistently recognized with high accuracy as well as those that are prone to confusion. This analysis is particularly valuable for multi-species recognition tasks, as it helps reveal inter-class similarity and potential error patterns, thereby providing insights for further model optimization and refinement.

2.4.2. Precision

Precision measures the reliability of positive predictions made by the model and reflects the proportion of correctly classified positive samples among all samples predicted as positive. This metric is particularly important in applications where false positives may lead to unnecessary intervention or misjudgment. Precision is calculated as follows:

P r e c i s i o n = \frac{T P}{T P + F P}

(4)

where

T P

denotes true positives and

F P

denotes false positives.

2.4.3. Recall

Recall quantifies the model’s ability to correctly identify positive samples and represents the proportion of true positives among all actual positive instances. This metric is especially critical in agricultural monitoring scenarios, where failing to detect harmful organisms may result in delayed control measures or economic losses. Recall is defined as:

R e c a l l = \frac{T P}{T P + F N}

(5)

where

F N

denotes false negatives.

2.4.4. F1-Score

The F1-score is the harmonic mean of precision and recall and provides a balanced evaluation of classification performance by jointly considering both metrics. It is particularly suitable for datasets with imbalanced class distributions, as it penalizes extreme discrepancies between precision and recall. The F1-score is computed as follows:

F 1 - s c o r e = 2 \times \frac{P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(6)

By integrating both precision and recall into a single measure, the F1-score offers a comprehensive indicator of overall model performance in multi-class recognition tasks.

3. Results

3.1. Experimental Results

This section presents the experimental results obtained from the constructed agricultural harmful organism dataset and evaluates the performance of the proposed DenseNet-CSL model through comparative experiments. Several mainstream deep learning architectures, including ResNet101, DenseNet121, EfficientNetV2, and MobileNetV3, VisionTransformer, SwinTransformer, were selected as baseline models. The comparison focuses on training behavior, classification accuracy, and convergence characteristics, aiming to assess the effectiveness and robustness of the proposed improvement.

3.1.1. Comparison of Loss Functions and Accuracy

In this study, seven neural network models—ResNet101 [12], DenseNet121, EfficientNetV2 [13], MobileNetV3 [14], VisionTransformer [15], SwinTransformer [16] and the proposed DenseNet-CSL—were evaluated to examine their convergence behavior and classification performance. Figure 8 illustrates the loss curves of the seven models during training.

As shown in Figure 8, the training loss of all models decreases progressively with increasing epochs, indicating stable convergence behavior. However, notable differences can be observed in both convergence speed and final loss values. Among the evaluated models, DenseNet-CSL exhibits the fastest loss reduction and reaches a stable state after approximately 50 epochs, with a final loss value of around 1.1. This value is substantially lower than that of the other models, suggesting a stronger capability in learning discriminative features from complex agricultural images.

By comparison, DenseNet121 and ResNet101 show similar convergence trends, with their loss values gradually stabilizing at approximately 2.0. EfficientNetV2 and MobileNetV3 also demonstrate convergent behavior, although their loss reduction occurs more slowly and stabilizes at relatively higher values, indicating comparatively weaker fitting performance on the constructed dataset. It is worth noting that the loss curves of VisionTransformer and SwinTransformer decrease much more slowly than those of the other models. Even after 150 epochs, the training process has not fully stabilized, and the final loss remains relatively high, indicating a lack of effective convergence. This suggests that, under the current task setting and dataset scale, Transformer-based architectures are difficult to train adequately and may be constrained by limited data availability and/or high image noise, resulting in suboptimal classification performance.

Figure 9 presents the corresponding accuracy curves during training. All models show a rapid increase in accuracy during the early training stages, followed by gradual stabilization. DenseNet-CSL achieves higher accuracy at an earlier stage and ultimately stabilizes at approximately 0.80, outperforming the other models. DenseNet121 and ResNet101 reach stable accuracies of around 0.77 and 0.71, respectively, while EfficientNetV2 and MobileNetV3 converge to lower accuracy levels and exhibit greater fluctuations during early training. In contrast, VisionTransformer and SwinTransformer achieved notably lower accuracies. Their performance started at a low level in the early training stage and improved only to approximately 0.47 and 0.56 by the end, which is substantially below that of the CNN-based baselines. This observation further supports the conclusion drawn from the loss curves that these Transformer-based models fail to converge effectively under the current training setting. Possible explanations include the limited dataset size and the higher optimization difficulty of Transformers, which may prevent them from realizing their expected advantages on this task. Therefore, in the subsequent experimental analysis and model comparison, the two types of Transformer models will not be emphasized.

Taken together, the loss and accuracy results indicate that DenseNet-CSL not only converges more rapidly but also achieves superior final performance. These observations suggest that the integration of Coordinate Attention, Deep Supervision, and Label Smoothing contributes to more effective feature learning and improved training stability in multi-class agricultural harmful organism recognition tasks.

3.1.2. Comparative Analysis of Confusion Matrices

To further compare the classification performance of DenseNet-CSL with the four baseline models, confusion matrices were constructed to analyze prediction results at the category level. The confusion matrix provides a detailed comparison between predicted labels and ground-truth labels, recording both correctly classified and misclassified samples for each species. Based on the 6000 test images covering 120 categories of agriculturally harmful organisms, the confusion matrices of the five models are presented in Figure 10. The left panel shows the complete confusion matrix, while the right panel displays a subset of categories (species 1–10 and 76–95) to facilitate clearer visualization and interpretation.

Overall, DenseNet-CSL demonstrates consistently high classification accuracy across most categories, with a strong concentration of correct predictions along the diagonal of the confusion matrix. Several categories, including Asota caricae, Sphagneticola trilobata, Spartina alterniflora, Ipomoea cairica, apple rust, and Cuscuta chinensis, exhibit very low misclassification rates. These species possess relatively distinctive visual characteristics, enabling reliable recognition by the model.

Despite the overall strong performance, misclassification still occurs among certain categories with high phenotypic similarity. For instance, Colias poliographus and Pieris rapae are occasionally confused due to similarities in wing shape and coloration. Likewise, apple Alternaria leaf spot and apple gray spot share visually similar disease symptoms, which can lead to classification ambiguity. Similar confusion patterns are also observed between Vanessa indica and Vanessa cardui, as well as between Acherontia styx and Acherontia lachesis. These errors primarily arise from subtle inter-species differences and represent typical challenges in fine-grained image classification tasks.

In comparison with the baseline models, DenseNet-CSL exhibits fewer misclassification cases across visually similar categories, indicating improved discriminative capability. Although some confusion remains unavoidable in cases of extreme morphological similarity, the overall results suggest that the proposed model achieves a favorable balance between classification accuracy and robustness across diverse species of pests, weeds, and crop diseases.

3.2. Analysis of Test Results

To further evaluate the overall performance of the proposed DenseNet-CSL model, a quantitative comparison was conducted using precision, recall, F1-score, inference time, and model size as evaluation metrics. The results of all models on the test set are summarized in Table 6.

As shown in Table 6, DenseNet-CSL achieves the highest precision (0.813), recall (0.801), and F1-score (0.800) among all evaluated models. Compared with the baseline DenseNet121, these values represent improvements of 4.1%, 3.1%, and 3.4%, respectively. These results indicate that the proposed model is more effective in correctly identifying agricultural harmful organisms while maintaining a low rate of misclassification.

In terms of computational efficiency, DenseNet-CSL requires 22.360 s for inference on the test set, which is 1.36 s faster than DenseNet121. At the same time, the increase in model size is relatively small, with an additional 1.772 MB of parameters. This suggests that the proposed architectural enhancements improve recognition performance without imposing a substantial computational burden.

Further comparison shows that although ResNet101 and EfficientNetV2 require less inference time, their larger model size and lower inference accuracy limit their suitability for time-sensitive agricultural applications. Although MobileNetV3 has a smaller model size and shorter inference time, its accuracy remains substantially lower than that of DenseNet-CSL. Overall, DenseNet-CSL provides a favorable trade-off between accuracy and efficiency, making it well suited for multi-class agricultural harmful organism recognition tasks.

To further examine the classification stability and per-category performance of each model, the recognition accuracy of all 120 species in the test set was statistically analyzed. For ease of comparison, recognition accuracy was divided into five intervals: below 60%, 60–85%, 85–95%, 95–100%, and 100%. The number of species falling into each interval was counted for all models, and the results are illustrated in Figure 11.

As shown in Figure 11, DenseNet121 exhibits a strong concentration of species within the 60–85% accuracy interval, indicating relatively stable performance across a large proportion of categories. In contrast, DenseNet-CSL demonstrates a higher proportion of species in the upper accuracy intervals. A larger number of species are recognized with accuracies above 85%, including a substantial portion achieving accuracies above 95% and 100%, highlighting its superior high-precision recognition capability. Moreover, DenseNet-CSL shows only eleven species in the below-60% accuracy interval, reflecting improved robustness and reduced performance degradation across difficult categories. By comparison, ResNet101 places a considerable number of species in the lower and middle accuracy intervals and achieves fewer categories with perfect accuracy, indicating limited overall discriminative capability. EfficientNetV2 presents a relatively balanced distribution but includes several species with low recognition accuracy, while MobileNetV3 shows a more uniform distribution across intervals without dominating in high-accuracy ranges. Overall, the accuracy interval analysis highlights the strong generalization ability and classification stability of DenseNet-CSL. The model not only increases the number of species recognized with high accuracy but also effectively reduces the occurrence of low-accuracy cases, which is essential for reliable deployment in practical agricultural monitoring and quarantine applications.

3.3. Phenotypic Analysis of Different Harmful Organism Images

To further evaluate the recognition performance of DenseNet-CSL under complex phenotypic conditions, the model was tested using a set of representative samples exhibiting diverse morphological characteristics. As shown in Figure 12, agricultural harmful organisms often present pronounced sexual dimorphism (Figure 12a), substantial intra-class variability (Figure 12b), and high inter-class similarity (Figure 12c). Despite these challenges, DenseNet-CSL is able to correctly identify most samples, indicating its capacity to capture discriminative visual features under realistic and heterogeneous conditions.

These results suggest that the proposed model can effectively extract and integrate fine-grained morphological cues, even when visual differences between categories are subtle or when individuals of the same species display notable phenotypic variation. Such robustness is particularly important for agricultural applications, where imaging conditions are often uncontrolled and biological traits vary across growth stages and environments.

However, misclassification still occurs among certain species with high phenotypic similarity, as illustrated in Figure 13a. These species exhibit extremely similar morphological traits, making discrimination challenging even for experienced taxonomists. Typical examples include Colias erate [17,18], Pieris rapae [19], Colias fieldii [18], Colias poliographus [20], and Catopsilia scylla [21,22]. All of these species belong to the family Pieridae and are characterized by predominantly white or yellow wing coloration with similar pattern distributions.

The primary interspecific differences among these species lie in subtle variations in wing shape, vein structure, and the spatial arrangement of color patches. Such fine-grained distinctions are difficult to capture consistently, particularly when images are affected by viewpoint changes, illumination variation, or partial occlusion. As a result, these species represent a typical and challenging case for fine-grained classification in agricultural image recognition.

Similar difficulties are observed between Vanessa indica [23,24] and Vanessa cardui [23,25], which share highly similar wing patterns and coloration (Figure 13b). Although V. indica generally exhibits slightly larger body size and more saturated color patches, these differences may be diminished under certain imaging conditions, leading to occasional misclassification.

In addition to highly similar species, misclassification is also observed in certain species that do not exhibit strong phenotypic resemblance, as shown in Figure 14a. A representative example is Erigeron annuus [26,27], a highly invasive weed species belonging to the family Asteraceae. Although this species possesses relatively distinctive morphological features, misclassification may still occur under specific conditions.

Further analysis indicates that such errors primarily arise from two factors. First, key morphological features may be partially occluded by surrounding vegetation or insects, resulting in incomplete feature representation (Figure 14b). Second, phenotypic variation caused by environmental conditions or growth stages—such as changes in flower color or inflorescence structure—can alter the visual appearance of the species (Figure 14c), increasing recognition difficulty.

Overall, DenseNet-CSL demonstrates strong recognition capability under a wide range of phenotypic conditions commonly encountered in agricultural environments. While misclassification persists in cases involving extreme morphological similarity or significant phenotypic variation, these limitations are largely attributable to inherent biological characteristics and data constraints rather than model instability. Future work may further enhance recognition performance by incorporating fine-grained classification strategies, region-aware attention mechanisms, or multimodal information, such as hyperspectral or structural data, to improve discrimination among visually similar species.

3.4. Ablation Study

To further investigate the contribution of each component integrated into the proposed DenseNet-CSL architecture, an ablation study was conducted using DenseNet121 as the baseline model. Different enhancement strategies were applied individually and in combination, and their performance was evaluated on the constructed dataset. The corresponding results are summarized in Table 7.

As shown in Table 7, the baseline DenseNet121 model achieves a precision of 0.772, a recall of 0.770, and an F1-score of 0.766. When the Coordinate Attention (CA) mechanism is introduced, these metrics increase to 0.777, 0.771, and 0.768, respectively, while the inference time is reduced from 23.720 s to 22.991 s. This improvement indicates that CA effectively enhances spatial feature representation and channel-wise dependency modeling, enabling the network to focus more accurately on discriminative regions.

When Label Smoothing (LS) is applied independently, the model shows the most notable single-module improvement, reaching 0.797 precision, 0.797 recall, and an F1-score of 0.790, with an inference time of 22.804 s. This suggests that LS effectively regularizes the classification objective and stabilizes training, leading to more reliable generalization. In contrast, using Deep Supervision (DS) alone does not yield consistent benefits, resulting in 0.766/0.765/0.757, which is slightly lower than the baseline, implying that auxiliary supervision may not be well-aligned with this dataset or training setup when used in isolation. For module combinations, CA + LS achieves the best overall accuracy among the tested variants, with an F1-score of 0.799 (precision 0.810, recall 0.803), although its inference time increases slightly to 23.196 s. Adding DS to LS (LS + DS) does not further improve accuracy beyond LS alone (F1 remains 0.790), but it reduces inference time to 22.500 s. Finally, integrating all three components (CA + LS + DS) yields 0.812/0.801/0.797 with the fastest inference time among enhanced configurations (22.368 s), demonstrating a favorable trade-off between accuracy and efficiency. Overall, the updated ablation results indicate that LS is the primary contributor to performance gains, CA offers additional but modest improvement, and DS provides limited accuracy benefits, mainly affecting efficiency when combined with other modules. These findings support the effectiveness of the proposed DenseNet-CSL design, particularly highlighting the complementary role of CA + LS in improving recognition performance.

3.5. Grad-CAM Visualization Comparison

To further interpret the feature-learning behaviors of different models and to examine whether, when trained on mixed-source images, the model relies primarily on biologically relevant cues rather than background artifacts, we employ Gradient-weighted Class Activation Mapping (Grad-CAM) to visualize the spatial regions most associated with the classification decisions. Grad-CAM provides an intuitive representation of which image regions contribute most to a model’s prediction, thereby facilitating qualitative comparison of feature extraction capability across different architectures.

Figure 15 shows representative Grad-CAM visualizations produced by ResNet101, DenseNet121, EfficientNetV2, and DenseNet-CSL for six test samples spanning both laboratory-style PlantVillage disease images (typically characterized by uniform backgrounds) and field-captured pest/weed images (typically characterized by cluttered natural backgrounds). Overall, ResNet101 and EfficientNetV2 tend to produce more scattered or multi-region activations, and in several cases their responses extend beyond the target organism to surrounding leaves, soil, or background textures. This indicates that these models may be more sensitive to background clutter and may partially rely on context-related cues rather than strictly discriminative biological regions. In contrast, DenseNet121 shows relatively improved localization but still exhibits occasional background leakage. By contrast, DenseNet121 shows relatively improved localization but still exhibits occasional background leakage. In comparison, DenseNet-CSL consistently focuses on more complete and biologically relevant regions of the harmful organisms, such as characteristic body contours, leaf lesion areas, or distinctive structural features. This observation indicates that the proposed model is more effective in capturing spatially coherent and semantically meaningful features.

The improved localization performance of DenseNet-CSL can be attributed to the integration of Coordinate Attention and Deep Supervision mechanisms. Coordinate Attention enables the network to encode positional information along spatial dimensions, thereby enhancing sensitivity to target location and shape. Meanwhile, Deep Supervision encourages intermediate layers to learn discriminative features at multiple scales, which contributes to more stable and interpretable attention patterns. As a result, DenseNet-CSL exhibits clearer and more focused activation regions, even in images containing cluttered backgrounds or subtle visual differences.

Overall, the Grad-CAM visualization results provide qualitative evidence that DenseNet-CSL achieves more accurate feature localization and improved interpretability compared with the baseline models. These findings are consistent with the quantitative performance gains reported in previous sections and further support the effectiveness of the proposed architecture for multi-class agricultural harmful organism recognition.

4. Discussion

This study addresses the challenge of multi-class recognition of agricultural harmful organisms by integrating pest, weed, and crop disease identification into a unified deep learning framework. By constructing a comprehensive dataset encompassing 120 species across three major biological categories and proposing the DenseNet-CSL model, this work aims to bridge the gap between single-category recognition studies and the complex requirements of real agricultural environments. The experimental results demonstrate that the proposed approach achieves stable and accurate performance across diverse species and imaging conditions, highlighting its promise for agricultural monitoring scenarios, while further validation under broader deployment conditions is still required.

From a data perspective, the identification of agricultural harmful organisms inherently involves several challenges, including large species diversity, high inter-class similarity, and complex imaging backgrounds. In real agricultural environments, images are often captured under uncontrolled conditions, where variations in illumination, viewpoint, growth stage, and background clutter are common. These factors contribute to substantial visual variability within the same species and subtle differences between different species, making accurate recognition particularly demanding.

Traditional identification methods based on morphological traits and expert experience can achieve reliable results in controlled settings; however, they are time-consuming, highly subjective, and difficult to scale to large-scale monitoring tasks. In contrast, deep learning–based approaches enable automatic extraction of discriminative features and rapid processing of large volumes of image data. Nevertheless, many existing studies focus primarily on single-category recognition tasks, such as pest or disease identification alone, and rely on relatively homogeneous datasets. This limitation restricts their applicability in practical scenarios where multiple types of harmful organisms coexist and need to be identified simultaneously.

At the model level, the performance of DenseNet-CSL can be attributed to its targeted architectural design, which directly addresses the key challenges posed by multi-species agricultural datasets. Coordinate Attention enhances the model’s ability to capture spatial positional cues and channel-wise dependencies, enabling more precise localization of discriminative regions under complex backgrounds. This capability is particularly relevant for agricultural images, where harmful organisms often occupy only a small portion of the image and are surrounded by visually cluttered environments.

Deep Supervision further contributes to improved performance by facilitating gradient propagation and encouraging effective feature learning at intermediate layers. By providing auxiliary supervision signals, the network is able to learn multi-scale representations that capture both fine-grained local details and higher-level semantic information. This is especially beneficial for recognizing harmful organisms that exhibit large variations in size, morphology, and growth stage. In addition, Label Smoothing regularizes the output distribution by reducing over-confident predictions and improves robustness to potential label noise; it may indirectly alleviate imbalance-related bias but does not replace dedicated imbalance-aware strategies (e.g., focal loss or class re-weighting)

Together, these design choices allow DenseNet-CSL to achieve faster convergence and more stable performance compared with baseline architectures. Rather than relying on a single enhancement, the combination of complementary mechanisms enables the model to adapt more effectively to the heterogeneous visual characteristics of pests, weeds, and crop diseases. Beyond theoretical exploration, the proposed architectural enhancements are not imposing a substantial computational burden. This statement is supported by standardized inference-time measurement under a fixed protocol (same input resolution and batch size) and model complexity indicators (e.g., parameters and FLOPs).

From an error-diagnosis perspective, the remaining misclassifications can be broadly attributed to three factors: (1) phenotypic resemblance across classes, (2) partial occlusion or truncation of key discriminative regions, and (3) appearance variations induced by growth stage and illumination. These factors are consistent with the visual characteristics of field imagery. Despite the overall strong performance of DenseNet-CSL, errors still occur for a subset of species, particularly those with high inter-class similarity or pronounced intra-class variability. Representative cases include species within the family Pieridae and the genus Vanessa, where interspecific differences are subtle and often depend on fine-grained morphological cues such as wing-vein patterns or slight color variations. Under natural imaging conditions, these cues may be partially obscured or visually degraded by changes in illumination, viewpoint, and background clutter, thereby increasing the likelihood of confusion. In addition, some misclassification cases occur in species that do not inherently exhibit high phenotypic similarity. Further analysis suggests that these errors are mainly associated with occlusion of key morphological features or phenotypic variation caused by environmental conditions and growth stages. Such factors are difficult to fully control during field data acquisition and represent inherent challenges in agricultural image recognition tasks. These observations indicate that misclassification is not solely attributable to model limitations, but also reflects the intrinsic complexity of biological systems and real-world imaging environments.

Building on the above analysis, several directions may be explored to further improve the recognition of agricultural harmful organisms under complex real-world conditions. For species with high phenotypic similarity, fine-grained classification strategies—such as region-aware attention mechanisms or multi-branch feature extraction networks—may help enhance sensitivity to subtle morphological differences. In addition, increasing sample diversity through targeted data collection or advanced data augmentation techniques could improve model robustness for underrepresented categories.

Beyond image-based approaches, integrating multimodal information offers another promising direction. The incorporation of hyperspectral data, structured-light measurements, or morphological descriptors may provide complementary cues that are difficult to capture using RGB images alone. Furthermore, extending the proposed framework to additional species and geographic regions would allow a more comprehensive evaluation of its generalization capability across different agricultural contexts. Moreover, when mixing laboratory-style disease images with field-captured pest/weed images, shortcut learning caused by acquisition-specific backgrounds cannot be fully ruled out. Future work will include Grad-CAM-based verification and background-perturbation robustness tests to confirm that the model attends to biologically relevant regions. In addition, transfer learning with pretrained backbones and comparisons with modern baselines will be explored to further improve robustness and strengthen the empirical evidence.

5. Conclusions

This study addresses the challenges of multi-class recognition of agricultural harmful organisms arising from species diversity, morphological variability, and complex background conditions. To support research in this area, a comprehensive image dataset was constructed, covering three major categories of agricultural threats—pests, weeds, and crop diseases. This dataset provides a unified data foundation for multi-target recognition tasks and reflects the complexity of real agricultural environments.

Based on the constructed dataset, an improved convolutional neural network, DenseNet-CSL, was developed by integrating Coordinate Attention, Deep Supervision, and Label Smoothing mechanisms into the DenseNet121 backbone. These enhancements collectively improve feature representation, stabilize model training, and enhance generalization capability under class imbalance and complex visual conditions. Experimental results demonstrate that DenseNet-CSL achieves higher recognition accuracy and improved efficiency compared with the baseline DenseNet121 model, while maintaining a lightweight network structure.

The proposed DenseNet-CSL model exhibits strong robustness under challenging scenarios, including complex backgrounds, high phenotypic similarity among species, and imbalanced sample distributions. These characteristics indicate that the model is well-suited for practical applications such as agricultural monitoring, pest and disease diagnosis, and port quarantine inspection. The results of this study further demonstrate the potential of deep learning techniques for large-scale and multi-class recognition tasks in agricultural biosecurity.

Future research may further extend the proposed framework by incorporating additional agricultural harmful organism species and more diverse data sources to improve generalization across regions and environments. Moreover, integrating advanced deep learning strategies and deploying the model on agricultural monitoring platforms or edge computing devices could enable real-time detection and early warning in field and inspection scenarios. Overall, this study provides both methodological insight and experimental evidence for the development of intelligent agricultural harmful organism recognition systems.

Author Contributions

Conceptualization, Y.H. and J.D.; Methodology, Y.H.; Software, T.H.; Validation, Y.H., T.H., J.Q. and L.W.; Formal analysis, Y.H., J.D., C.L., X.Q. and L.W.; Investigation, T.H., J.D., C.L., F.W., X.Q. and L.W.; Resources, C.L. and X.Q.; Data curation, T.H., J.D. and J.Q.; Writing—original draft preparation, T.H.; Writing—review and editing, Y.H. and L.W.; Visualization, T.H.; Supervision, Y.H., C.L., F.W., X.Q. and L.W.; Project administration, W.Q., X.Q. and L.W.; Funding acquisition, C.L., W.Q. and X.Q. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Key Research and Development Program of China (2024YFC2607600 & 2024YFC2607602), Shenzhen Science and Technology Program (KCXFZ20240903093859009) and the Agricultural Science and Technology Innovation Program (ASTIP) (CAAS-ZDRW202505).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Bai, Z.; Tang, Z.; Diao, L.; Lu, S.; Guo, X.; Zhou, H.; Liu, C.; Li, L. Video target detection of East Asian migratory locust based on the MOG2-YOLOv4 network. Int. J. Trop. Insect Sci. 2021, 42, 793–806. [Google Scholar] [CrossRef]
Wang, B.; Farouk, A. Identification of Crop Diseases and Insect Pests Based on Deep Learning. Sci. Program. 2022, 2022, 9179998. [Google Scholar] [CrossRef]
Hu, B.; Jiang, W.; Zeng, J.; Cheng, C.; He, L. FOTCA: Hybrid transformer-CNN architecture using AFNO for accurate plant leaf disease image recognition. Front. Plant Sci. 2023, 14, 1231903. [Google Scholar] [CrossRef]
Duhan, S.; Gulia, P.; Gill, N.S.; Shukla, P.K.; Khan, S.B.; Almusharraf, A.; Alkhaldi, N. Investigating attention mechanisms for plant disease identification in challenging environments. Heliyon 2024, 10, e29802. [Google Scholar] [CrossRef]
Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; IEEE: New York, NY, USA, 2021; pp. 13713–13722. [Google Scholar]
Chen, L.; Yuan, Y. Agricultural Disease and Pest Research Image Database (IDADP), Version 1; National Basic Science Data Center: Beijing, China, 2023; Available online: https://cstr.cn/16666.11.nbsdc.feoakuia (accessed on 26 January 2026).
Hughes, D.P.; Salathé, M. An open access repository of images on plant health to enable the development of mobile disease diagnostics through machine learning and crowdsourcing. arXiv 2015, arXiv:1511.08060. [Google Scholar]
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely Connected Convolutional Networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; IEEE Computer Society: New York, NY, USA, 2017; pp. 2261–2269. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; IEEE: New York, NY, USA, 2018; pp. 7132–7141. [Google Scholar]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; IEEE: New York, NY, USA, 2020; pp. 11534–11542. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; Springer: Berlin/Heidelberg, Germany, 2018; pp. 3–19. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 29th IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
Tan, M.; Le, Q.V. EfficientNetV2: Smaller Models and Faster Training. In Proceedings of the International Conference on Machine Learning: ICML 2021, Online, 18–24 July 2021; pp. 10086–10096. [Google Scholar]
Howard, A.G.; Sandler, M.; Chen, B.; Wang, W.; Chen, L.-C.; Tan, M.; Chu, G.; Vasudevan, V.; Zhu, Y.; Pang, R.; et al. Searching for MobileNetV3. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–27 November 2019; pp. 1314–1324. [Google Scholar]
Dosovitskiy, A. An image is worth 16×16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Guo, B. Swin Transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; IEEE: New York, NY, USA, 2021; pp. 10012–10022. [Google Scholar]
Dzurinka, M.; Mutanen, M.; Šemeláková, M.; Csanády, A.; Mikitová, B.; Panigaj, Ľ. The intraspecific variability of Colias croceus (Geoffroy, 1785) and C. erate (Esper, 1805) (Lepidoptera, Pieridae) from the perspective of comparative morphology. Zoomorphology 2021, 140, 353–363. [Google Scholar] [CrossRef]
Kaur, M.; Kirti, J.S.; Sidhu, A.K. Re-description of clouded yellow butterflies, Colias erate (Esper, 1805) and Colias fieldii menetries, 1855 from India. Int. J. Zool. Appl. Biosci. 2024, 9, 1–4. [Google Scholar] [CrossRef]
Bai, Y.; Ma, L.B.; Xu, S.Q.; Wang, G.H. A geometric morphometric study of the wing shapes of Pieris rapae (Lepidoptera: Pieridae) from the Qinling Mountains and adjacent regions: An environmental and distance-based consideration. Fla. Entomol. Int. J. Am. 2015, 98, 162–169. [Google Scholar] [CrossRef]
Omura, H.; Yotsuzuka, S. Male-specific epicuticular compounds of the sulfur butterfly Colias erate poliographus (Lepidoptera: Pieridae). Appl. Entomol. Zool. 2015, 50, 191–199. [Google Scholar] [CrossRef]
Braby, M.F. Migration Records of Butterflies (Lepidoptera: Papilionidae, Hesperiidae, Pieridae, Nymphalidae) in the ‘Top End’ of The Northern Territory. Aust. Entomol. 2016, 43 Pt 3, 151–160. [Google Scholar]
Poorten, G.V.D.; Poorten, N.V.D. Catopsilia scylla (Linnaeus, 1763): A new record for Sri Lanka with notes on its biology, life history and distribution (Lepidoptera: Pieridae). J. Res. Lepid. 2012, 45, 17–23. [Google Scholar] [CrossRef]
Tanaka, A.; Inoue, M.; Endo, K.; Kitazawa, C.; Yamanaka, A. Presence of a cerebral factor showing summer-morph-producing hormone activity in the brain of the seasonal non-polyphenic butterflies Vanessa cardui, V-indica and Nymphalis xanthomelas japonica (Lepidoptera: Nymphalidae). Insect Sci. 2009, 16, 125–130. [Google Scholar] [CrossRef]
Otaki, J.M. Phenotypic plasticity of wing color patterns revealed by temperature and chemical applications in a nymphalid butterfly Vanessa indica. J. Therm. Biol. 2008, 33, 128–139. [Google Scholar] [CrossRef]
Connahs, H.; Rhen, T.; Simmons, R.B. Transcriptome analysis of the painted lady butterfly, Vanessa cardui during wing color pattern development. BMC Genom. 2016, 17, 270. [Google Scholar] [CrossRef] [PubMed]
Trtikova, M.; Güsewell, S.; Baltisberger, M.; Edwards, P.J. Distribution, growth performance and genetic variation of Erigeron annuus in the Swiss Alps. Biol. Invasions 2010, 13, 413–422. [Google Scholar] [CrossRef]
Song, U.; Son, D.; Kang, C.; Lee, E.J.; Lee, K.; Park, J.S. Mowing: A cause of invasion, but also a potential solution for management of the invasive, alien plant species Erigeron annuus (L.) Pers. J. Environ. Manag. 2018, 223, 530–536. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Representative pest samples illustrating phenotypic variations that pose challenges for multi-class pest recognition models. (17) Bactrocera minax; (47) Parnara ganga; (48) Parnara guttata; (22) Catopsilia scylla; (25) Colias fieldii; (26) Colias poliographus; (50) Pieris rapae; (41) Krananda semihyalina; (59) Theretra oldenlandiae; (60) Theretra silhetensis.

Figure 2. Partial Weed Dataset Pictures: (a) Erigeron annuus; (b) Ambrosia artemisiifolia; (c) Spartina alterniflora; (d) Plantago major.

Figure 3. Some images from the crop disease dataset. (a) Apple brown spot disease; (b) Apple Mosaic virus disease; (c) Apple gray spot disease; (d) Apple rust disease; (e) Rice white leaf blight disease; (f) Rice flax spot disease; (g) Potato early blight; (h) Potato late blight.

Figure 4. Distribution histogram of all categories. The indices 1–62 represent pests, 63–90 represent weeds, and 91–120 represent crop diseases. Colors are used only for visual differentiation and do not correspond to specific species or categories.

Figure 5. Network Architecture of DenseNet121.

Figure 6. Network Architecture of DenseNet-CSL.

Figure 7. Network Architecture of Coordinate Attention.

Figure 8. Loss function curves of each model.

Figure 9. Accuracy curve of each model.

Figure 10. Confusion matrix of the five models. (a) Confusion matrix of ResNet101; (b) Confusion matrix of DenseNet121; (c) Confusion matrix of EfficientNetV2; (d) Confusion matrix of MobileNetV3; (e) Confusion matrix of DenseNet-CSL.

Figure 11. Statistical map of recognition accuracy of 120 pests by five different models.

Figure 12. Correctly classified pest images exhibit: (a) dimorphism; (b) intraspecific heterogeneity; (c) interspecific homogeneity.

Figure 13. Examples illustrating challenges in fine-grained classification: (a) misclassified species with high phenotypic similarity; (b) additional species exhibiting similar morphology; (c) phenotypic variants within high-similarity species.

Figure 14. Misclassification cases for species without high phenotypic similarity: (a) misclassified samples; (b) images with occlusion of key morphological features; (c) samples exhibiting phenotypic variation.

Figure 15. Heat maps of various deep learning models. Warmer colors (red/yellow) indicate higher activation/attention (greater contribution to the prediction), whereas cooler colors (blue) indicate lower activation.

Table 1. List of 62 pest species.

Category	Pest Name	Quantity	Category	Pest Name	Quantity
1	Acherontia lachesis	148	32	Eudocima salaminia	160
2	Acherontia styx	159	33	Eysarcoris ventralis	160
3	Acronicta rumicis	159	34	Eysarcoris venustissimus	159
4	Acronicta strigosa	160	35	Hipoepa fractalis	159
5	Alcis repandata	150	36	Hipoepa biasalis	150
6	Argynnis paphia	174	37	Hyphantria cunea	275
7	Argyronome laodice	65	38	Ixias pyrene	170
8	Ascotis selenaria	148	39	Krananda latimarginaria	42
9	Asota caricae	153	40	Krananda oliveomarginata	76
10	Asota egens	159	41	Krananda semihyalina	218
11	Asota heliconia	166	42	Lymantria mathura	234
12	Asota plaginota	164	43	Lymantria monacha	160
13	Asota plana	155	44	Mamestra brassicae	46
14	Bactrocera correcta	160	45	Metoeca foedalis	160
15	Bactrocera cucuribitae	154	46	Orthonama obstipata	157
16	Bactrocera dorsalis	153	47	Parnara ganga	157
17	Bactrocera minax	34	48	Parnara guttata	158
18	Bactrocera tau	159	49	Pieris brassicae	158
19	Carpomya vesuviana	165	50	Pieris rapae	20
20	Catopsilia pomona	250	51	Pingasa ruginaria	324
21	Catopsilia pyranthe	80	52	Psilogramma increta	158
22	Catopsilia scylla	159	53	Solenopsis invicta	160
23	Cleora fraterna	241	54	Spilarctia subcarnea	158
24	Colias erate	160	55	Spilosoma lubricipedum	149
25	Colias fieldii	158	56	Spodoptera exigua	160
26	Colias poliographus	153	57	Spodoptera frugiperda	151
27	Corythucha ciliata	160	58	Sylepta derogata	157
28	Corythucha marmorata	161	59	Theretra oldenlandiae	150
29	Cydia pomonella	150	60	Theretra silhetensis	162
30	Ectropis bhurmitra	148	61	Vanessa cardui	158
31	Eudocima phalonia	161	62	Vanessa indica	156

Table 2. List of 28 weed species.

Category	Weed Name	Quantity	Category	Pest Name	Quantity
1	Ageratum conyzoides	439	15	Lantana camara	146
2	Amaranthus palmeri	325	16	Lepidium virginicum	690
3	Amaranthus retroflexus	175	17	Leucaena leucocephala	23
4	Amaranthus spinosus	120	18	Melinis repens	81
5	Ambrosia artemisiifolia	80	19	Mikania micrantha	292
6	Cuscuta chinensis	363	20	Mimosa bimucronata	122
7	Cyperus rotundus	487	21	Pennisetum alopecuroides	99
8	Datura stramonium	155	22	Plantago major	82
9	Erigeron annuus	295	23	Rumex acetosella	50
10	Euphorbia dentata	47	24	Rumex crispus	234
11	Euphorbia heterophylla	50	25	Solanum rostratum	39
12	Flaveria bidentis	343	26	Spartina alterniflora	139
13	Glechoma longituba	51	27	Spermacoce alata	75
14	Ipomoea cairica	24	28	Sphagneticola trilobata	16

Table 3. List of 30 crop diseases.

Species	Type	Quantity
Wheat	Wheat leaf rust	455
	Wheat scab caused by Fusarium graminearum	97
	Wheat loose smut	297
Rice	Rice tungro virus disease	357
	Rice bacterial blight	331
	Rice blast disease caused by Magnaporthe oryzae	463
Maize	Maize common rust	163
Maize	Maize gray leaf spot	317
Sugarcane	Sugarcane leaf blight	367
Sugarcane	Sugarcane yellow spot disease	55
Tomato	Tomato early blight	383
	Tomato leaf spot disease	359
	Tomato speck disease	398
	Tomato bacterial leaf spot	356
	Tomato yellow curl leaf virus disease	389
Apple	Apple scab	422
	Apple gray spot disease	380
	Apple mosaic virus disease	416
	Apple brown spot disease	439
	Apple rust	441
Tea	Tea round red spot disease	100
	Tea anthracnose	100
	Tea white star disease	142
	Tea algae spot disease	113
	Tea red leaf spot disease	143
Grapes	Grape black rot	388
Grapes	Grape black decay disease	336
Potato	Potato early blight	431
	Potato late blight	437
	Potato golden nematode disease	250

Table 4. Experimental test platform.

Experimental Environment	Model Identity or Designation	Parametric or Version
Computer system	Windows Server 2022 Datacenter	RAM: 256 GB
CPU	Intel(R) Xeon(R) Gold 5317	Frequency: 3.00 GHz
GPU	NVIDIA RTX 6000 Ada Generation	Video memory: 48 GB
Deep learning framework	PyTorch	1.12.1
Computational platform	CUDA	11.3
software environment	Python	3.9.15

Table 5. The training parameters of the classification model.

Training Parameters	Value
Img size (pixels)	224 × 224
Batch size	32
Epoch	200
Initial learning rate	0.001
Optimizer	AdamW
Pre-training weights	None

Table 6. The recognition results of each deep learning model.

Model	Precision	Recall	F1-Score	Testing Time (s)	Model Size (MB)
ResNet101	0.711	0.715	0.707	17.823	163.666
DenseNet121	0.772	0.770	0.766	23.720	27.578
EfficientNetV2	0.705	0.704	0.698	22.094	78.423
MobileNetV3	0.717	0.711	0.709	21.630	16.814
DenseNet-CSL	0.813	0.801	0.800	22.360	29.35

Table 7. Identification results of different improvement strategies.

Model	CA ¹	LS ¹	DS ¹	Precision	Recall	F1-Score	Testing Time (s)
DenseNet121	×	×	×	0.772	0.770	0.766	23.720
DenseNet-CSL	√	×	×	0.777	0.771	0.768	22.991
	×	√	×	0.797	0.797	0.790	22.804
	×	×	√	0.766	0.765	0.757	22.663
	√	√	×	0.810	0.803	0.799	23.196
	×	√	√	0.800	0.797	0.790	22.500
	√	×	√	0.790	0.784	0.779	23.080
	√	√	√	0.813	0.801	0.797	22.368

¹ CA is Coordinate Attention, DS is Deep Supervision, LS is Label Smoothing Loss. √ indicates that the corresponding strategy is used (enabled), whereas × indicates it is not used (disabled).

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Huang, Y.; Huang, T.; Du, J.; Qiu, J.; Liu, C.; Wan, F.; Qian, W.; Qiao, X.; Wang, L. DenseNet-CSL: An Enhanced Network for Multi-Class Recognition of Agricultural Pests, Weeds, and Crop Diseases. Agriculture 2026, 16, 394. https://doi.org/10.3390/agriculture16040394

AMA Style

Huang Y, Huang T, Du J, Qiu J, Liu C, Wan F, Qian W, Qiao X, Wang L. DenseNet-CSL: An Enhanced Network for Multi-Class Recognition of Agricultural Pests, Weeds, and Crop Diseases. Agriculture. 2026; 16(4):394. https://doi.org/10.3390/agriculture16040394

Chicago/Turabian Style

Huang, Yiqi, Tao Huang, Jing Du, Jinxue Qiu, Conghui Liu, Fanghao Wan, Wanqiang Qian, Xi Qiao, and Liang Wang. 2026. "DenseNet-CSL: An Enhanced Network for Multi-Class Recognition of Agricultural Pests, Weeds, and Crop Diseases" Agriculture 16, no. 4: 394. https://doi.org/10.3390/agriculture16040394

APA Style

Huang, Y., Huang, T., Du, J., Qiu, J., Liu, C., Wan, F., Qian, W., Qiao, X., & Wang, L. (2026). DenseNet-CSL: An Enhanced Network for Multi-Class Recognition of Agricultural Pests, Weeds, and Crop Diseases. Agriculture, 16(4), 394. https://doi.org/10.3390/agriculture16040394

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

DenseNet-CSL: An Enhanced Network for Multi-Class Recognition of Agricultural Pests, Weeds, and Crop Diseases

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset Acquisition

2.1.1. Pest Dataset

2.1.2. Weed Dataset

2.1.3. Disease Dataset

2.1.4. Data Analysis and Data Preprocessing

2.2. Classification Model

2.2.1. DenseNet121

2.2.2. Component Selection Rationale

2.2.3. Architecture of the DenseNet-CSL Network

2.3. Experimental Environment and Parameter Settings

2.4. Performance Evaluation Metrics

2.4.1. Confusion Matrix

2.4.2. Precision

2.4.3. Recall

2.4.4. F1-Score

3. Results

3.1. Experimental Results

3.1.1. Comparison of Loss Functions and Accuracy

3.1.2. Comparative Analysis of Confusion Matrices

3.2. Analysis of Test Results

3.3. Phenotypic Analysis of Different Harmful Organism Images

3.4. Ablation Study

3.5. Grad-CAM Visualization Comparison

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI