MSA-ResNet: A Neural Network for Fine-Grained Instar Identification of Spodoptera frugiperda Larvae in Smart Agriculture

Xu, Quanyuan; Wang, Mingyang; Lu, Ying; Feng, Dan; Ye, Hui; Li, Yonghe

doi:10.3390/agronomy15122724

Open AccessArticle

MSA-ResNet: A Neural Network for Fine-Grained Instar Identification of Spodoptera frugiperda Larvae in Smart Agriculture

by

Quanyuan Xu

^1,2,†,

Mingyang Wang

^3,†,

Ying Lu

^1,2,

Dan Feng

⁴,

Hui Ye

^5,* and

Yonghe Li

^6,*

¹

College of Big Data and Intelligent Engineering, Southwest Forestry University, Kunming 650224, China

²

Key Laboratory of Forestry and Ecological Big Data State Forestry Administration, Southwest Forestry University, Kunming 650024, China

³

Yinglin Branch of Yunnan Institute of Forest Inventory and Planning, Kunming 650021, China

⁴

Forest Protection Research Institute, Yunnan Academy of Forestry and Grassland, Kunming 650201, China

⁵

School of Ecology and Environment, Yunnan University, Kunming 650091, China

⁶

Yunnan Plateau Characteristic Agricultural Industry Research Institute, Yunnan Agricultural University, Kunming 650210, China

^*

Authors to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Agronomy 2025, 15(12), 2724; https://doi.org/10.3390/agronomy15122724

Submission received: 23 October 2025 / Revised: 15 November 2025 / Accepted: 25 November 2025 / Published: 26 November 2025

(This article belongs to the Section Pest and Disease Management)

Download

Browse Figures

Versions Notes

Abstract

The Spodoptera frugiperda (fall armyworm), a globally significant agricultural pest, poses severe threats to crop production. Accurate identification of larval instar stages is crucial for implementing precise control measures and reducing pesticide use. However, traditional identification methods suffer from low efficiency and heavy reliance on expert knowledge, while existing deep learning models still face challenges such as insufficient feature extraction and high computational complexity in fine-grained instar classification. To address these issues, this study proposes a novel network model, termed Multi-Scale Improved Self-Attention ResNet (MSA-ResNet), which integrates large convolutional kernels (LCK), atrous spatial pyramid pooling (ASPP), and an improved self-attention (ISA) mechanism into the ResNet50 backbone. These enhancements enable the model to more effectively capture and discriminate subtle morphological details of larvae. Experiments conducted on a self-constructed dataset comprising 24,179 images across six instar stages demonstrate that MSA-ResNet achieves an accuracy of 96.81% on the test set, significantly outperforming mainstream models such as ResNet50, VGG16, and MobileNetV3. In particular, the precision for the first instar increased by 12.94%, while the recall rates for the second and fourth instars improved by 16% and 8.97%, respectively. Ablation studies further validate the effectiveness of each module and the optimal embedding strategy. This research presents a high-precision and efficient intelligent solution for larval instar identification of S. frugiperda, offering a transferable reference for fine-grained image recognition tasks in agricultural pest management.

Keywords:

Spodoptera frugiperda; instar identification; deep learning; precision agriculture; sustainable pest management

1. Introduction

The Spodoptera frugiperda (fall armyworm), listed by the Food and Agriculture Organization (FAO) of the United Nations as a major transboundary migratory pest of global concern, has spread from the Americas to nearly one hundred countries and regions worldwide [1,2,3]. Its larvae are highly voracious and destructive, often causing severe yield losses or even total crop failure. The estimated annual economic loss exceeds USD 10 billion, posing a significant threat to the safe production of staple crops such as maize and rice, as well as to global food security and the livelihoods of smallholders [4,5,6].

The larvae of S. frugiperda typically undergo six instar stages, each exhibiting distinct behavioral patterns and damage characteristics. Early instar larvae (1st–3rd instars) tend to cluster on leaf surfaces, producing characteristic “window-pane” feeding symptoms. Although their feeding capacity is limited, this stage represents a critical window for effective chemical control due to their concealed feeding habits [7,8]. In contrast, late instar larvae (4th–6th instars, especially the 5th and 6th) become more dispersed, and their feeding intensity increases dramatically—accounting for approximately 80–90% of the total larval consumption. They are capable of boring into stems and ears, resulting in lodging and severe yield reduction. Studies have shown that larval tolerance to insecticides increases significantly with development, and older larvae exhibit strong resistance, thereby diminishing the effectiveness of chemical control [9]. Differences in susceptibility to natural enemies among instars have also been reported [10,11,12]. Therefore, accurate identification of larval instar stages—particularly the timely detection of early instars—is of great importance for precision pest management and minimizing economic losses.

Currently, a major challenge in field control lies in the difficulty of real-time and accurate identification of larval instar stages, especially in distinguishing the easily controllable early instars from the highly destructive and resistant late instars. Specific difficulties include: (1) early instars are small in size, cryptically colored, and often hidden on the undersides or whorls of leaves, making early detection difficult; (2) larval coloration varies with environmental conditions and host plants [13]; and although late instars exhibit more distinct morphological traits, they can still be easily confused with other noctuid larvae such as Mythimna separata, leading to misidentification by inexperienced personnel; (3) traditional morphological identification methods rely heavily on expert experience and are time-consuming, which limits their scalability for large-scale monitoring [14]; while molecular diagnostic techniques [15] offer high accuracy, they are costly and technically demanding, restricting their field deployment. Moreover, the timeliness of identification poses another constraint: in the field, larvae of multiple instars often coexist and develop asynchronously [16]. Missing the early instar control window allows larvae to progress into more resistant, stem-boring stages, resulting in reduced control efficacy, aggravated economic loss, and increased environmental and pesticide residue risks.

In recent years, rapid advances in deep learning have driven remarkable progress in agricultural pest image recognition. Researchers have developed a range of deep learning–based models tailored to different crops and pest species, achieving high accuracy and efficiency in classification tasks [17,18,19,20,21,22]. For example, Chiranjeevi et al. [23] proposed an end-to-end framework named InsectNet for high-precision recognition of multiple insect categories in agroecosystems, including pollinators, parasitoids, predators, and pests. Zhang et al. [24] developed a Vision Transformer-based method for crop disease and pest recognition, achieving superior performance to conventional CNNs on two datasets comprising 10 and 15 pest categories. Dharmasasth et al. [25] introduced a CNN ensemble model optimized by a genetic algorithm to improve generalization ability, outperforming both single models and average ensemble strategies on a 10-class insect dataset. An et al. [26] proposed a feature fusion network integrating ResNet, Vision Transformer, and Swin Transformer architectures and employed Grad-CAM-based attention selection for enhanced interpretability. Their model achieved state-of-the-art accuracy on both 20-class subsets and the full IP102 dataset, demonstrating strong robustness under augmented image conditions.

Although deep learning has been widely applied in pest identification, most studies have focused on species-level or disease-type classification tasks, while automatic identification of larval instar stages of S. frugiperda remains largely unexplored. To date, few studies have systematically modeled and classified its developmental stages using deep learning. Therefore, this study aims to fill this research gap by proposing a novel, deep learning-based approach for accurate instar-stage recognition of this globally significant pest.

To overcome the aforementioned limitations, this study develops a structurally improved version of the classical ResNet50 architecture based on a self-constructed, high-resolution image dataset of S. frugiperda larvae. A novel recognition model, termed Multi-Scale and Self-Attention ResNet (MSA-ResNet), is proposed by integrating multi-scale perception and attention mechanisms. The proposed model not only enhances classification accuracy but also substantially reduces both missed and false detections. The main improvements are as follows:

Large convolutional kernels for expanding shallow receptive fields: In the shallow feature extraction module, 5 × 5 depthwise separable convolutions are employed to replace traditional convolution operations. This design significantly enlarges the receptive field, enabling the network to capture richer texture and edge information while effectively reducing the number of parameters and computational complexity.
Atrous convolution for enhanced multi-scale perception: In the deep feature extraction stage, atrous (dilated) convolutions are introduced using a spatial pyramid structure with multiple dilation rates (e.g., r = 1, 2, 4). This multi-scale aggregation strengthens the joint modeling of local and global morphological features, thereby improving the model’s ability to distinguish subtle differences between adjacent instar stages.
Improved self-attention for key-region modeling: An efficient self-attention module is incorporated into the modified residual blocks to achieve dynamic focusing on key feature regions. This enhancement increases the representational power and spatial modeling depth of the network while maintaining high inference efficiency.

Building upon the classical convolutional neural network framework, this study proposes a structural optimization pathway that integrates large convolutional kernels, atrous convolutions, and an improved self-attention mechanism to address issues such as limited receptive fields, insufficient multi-scale feature utilization, and weak responses to critical regions in image classification. The proposed MSA-ResNet model enables high-precision automatic recognition of larval instar stages of S. frugiperda, providing an efficient and stable intelligent tool for pest monitoring. It facilitates accurate developmental-stage identification and early warning, thereby supporting rational pesticide application, reducing chemical use and costs, mitigating environmental pollution, and promoting a shift from experience-based to data-driven decision making in pest management.

Under the accelerating development of green and precision agriculture, the findings of this study have broad application potential. The proposed framework can be applied to large-scale field monitoring systems and agricultural informatization platforms, and also serves as a transferable architectural reference for other fine-grained visual tasks, such as insect instar identification and plant disease classification.

The structure of the subsequent chapters is arranged as follows. Section 2 provides a detailed description of the data acquisition and preprocessing procedures used in this study, including data sources, larval image collection methods, preprocessing strategies, and dataset partitioning. It also elaborates on the construction of the proposed model, covering the ResNet50 backbone, large-kernel convolution module, dilated convolution, the improved self-attention mechanism, and the overall network architecture. Section 3 presents comparative experiments and ablation studies to validate the model’s performance, analyzes the contribution of each improved module, and completes the performance evaluation. Section 4 discusses the advantages of the proposed improvements, the underlying mechanisms of each module, and potential directions for future research. Section 5 summarizes the main findings and conclusions of the entire study.

2. Materials and Methods

2.1. Data Acquisition and Preprocessing

2.1.1. Data Source and Acquisition

This study adopted a “tri-modal integrated acquisition strategy” that combined field collection, laboratory-controlled rearing, and open-source data mining to systematically obtain images of S. frugiperda larvae at six developmental instars, as well as larvae of closely related species under different environmental conditions.

(1): Field Collection

Yunnan Province, located on the southwestern border of China, lies at the intersection of East Asian and Southeast Asian biogeographic and agroecosystem zones. It serves as a frontier region for combating transboundary agricultural pest invasions. Its unique geographical position and transitional subtropical–tropical climate (warm, humid, and favorable year-round) make it the primary entry point and a persistent occurrence area for S. frugiperda migrating from Southeast Asia into inland China.

In this study, the eastern (Qujing, Wenshan, and Honghe) and central (Kunming and Yuxi) regions of Yunnan Province (approximately 102°30′–105°30′ E, 22°30′–26°30′ N) were selected as field sampling sites, as shown in Figure 1. This region, a key ecological corridor for cross-border migration and colonization of S. frugiperda, is characterized by early population emergence, overlapping generations, and severe year-round infestation. Within maize production systems, in particular, pest density and damage are high, making this region a representative area for investigating population dynamics and implementing coordinated regional pest control on both national and global scales.

Field sampling was conducted during the main crop-growing seasons from May to October 2023–2024. In the field, larvae collection prioritized maize plants showing clear leaf damage, as S. frugiperda larvae are more easily located on such plants. Concurrently, efforts were made to capture early instar larvae by specifically targeting plants with semi-transparent “window-paning” or small hole feeding damage—characteristics typically caused by 1st to 2nd instar larvae. Although plants exhibiting these specific symptoms represented a smaller proportion of the total, they were intentionally included in the collection scope to ensure representation across different instar stages. Once target larvae were located in the core regions, individuals were carefully collected using sterile tweezers or soft brushes and placed into ventilated containers containing fresh host plant tissue. Samples were then transported to the laboratory for species identification and instar determination.

(2): Laboratory Rearing Data Collection

A standardized full-cycle rearing system for S. frugiperda was established at the Institute of Yunnan Biodiversity, Southwest Forestry University (Kunming, China). Through multidimensional environmental regulation and standardized feeding protocols, a large amount of high-quality larval developmental data was obtained.

Core environmental parameters were precisely controlled by programmable climate chambers: temperature maintained at 25 ± 0.5 °C and relative humidity at 70 ± 5%, simulating stable and optimal growth conditions. Larvae were reared individually in customized polypropylene boxes and fed with a standardized artificial diet (see Figure 2).

After each molting event, morphological changes were promptly recorded and photographed. Following image acquisition, the larvae were returned to their original rearing environment, and a fresh diet was periodically supplied until pupation. The day of molting was defined as the start of the next instar. Each instar lasted approximately 2–3 days, with the entire larval stage lasting around 15 days. This standardized process provided highly consistent and reliable benchmark image data for deep learning model training.

(3): Open-Source Data Collection

To enhance the diversity and scale of the dataset, open-access image resources were integrated from multiple online sources, including search engines, agricultural information portals, and scientific databases. A systematic web data acquisition workflow was developed by combining API-based retrieval and targeted web crawling, allowing parallel extraction and filtering of S. frugiperda larval images that met predefined quality and content criteria.

2.1.2. Collection of Larva Images

A total of 256 larvae were collected from the field. All these field-collected specimens had their instar stage (1st to 6th) confirmed through traditional morphological characteristics (such as body length, head capsule width, and body segment patterns). For the laboratory-reared larvae, the instar stage was accurately determined by observing molting events. During the rearing process, although some larvae were lost due to handling during photography, natural mortality, or were excluded as developmental anomalies, images were captured for all larvae, including those that were later lost. Ultimately, the number of healthy larvae that successfully pupated was 489.

A multi-posture image acquisition protocol was employed to obtain high-quality images of S. frugiperda larvae in diverse postures. The photographs were taken under natural lighting conditions using a standard smartphone equipped with an external macro lens (APEXEL, Shenzhen, China). The native image resolution was uniformly set to 3024 × 4032 pixels to ensure the clarity of fine details. To meet the input requirements of the subsequent deep learning models, the images were later resized using proportional scaling and standardized to a uniform resolution of 224 × 224 pixels, ensuring consistency in the input data format.

During image collection, larvae were placed on a standard grid background with a precision of 1 mm² per cell. Once the larvae remained still, a soft brush was gently used to guide them into standard dorsal and lateral positions for imaging. In addition, mild tactile stimulation was applied to induce natural behavioral postures such as extension and curling. Representative images of S. frugiperda larvae from the first to sixth instars are shown in Figure 3, while images of closely related lepidopteran species are presented in Figure 4.

2.1.3. Data Preprocessing and Allocation

To improve the overall quality of the dataset and mitigate the issue of class imbalance, systematic data preprocessing was conducted, mainly involving data augmentation and class balancing operations. These procedures aimed to enhance the model’s generalization ability and ensure the scientific rigor of the training, validation, and testing processes.

During the data augmentation phase, a combination of techniques—random cropping, horizontal flipping, rotation, and denoising—was applied to expand the original image set. To address class imbalance, images were selectively augmented in proportion to the original number of samples in each class, ensuring a balanced distribution across all categories.

After preprocessing, the dataset size increased from 8633 to 24,179 images, with the number of samples in each class approaching uniformity, as shown in Table 1. Finally, the augmented dataset was divided into training, validation, and test sets at a ratio of 8:1:1, ensuring balanced class representation in each subset. This allocation provided sufficient and representative samples for both learning and evaluation, thereby effectively improving the reliability and accuracy of the model.

2.2. Building the Model

2.2.1. ResNet50

The Residual Network (ResNet) [27] represents a revolutionary breakthrough in deep learning, addressing the issues of gradient vanishing and gradient explosion that often occur when training very deep networks. By introducing the concept of residual connections, ResNet enables networks to reach greater depth while maintaining efficient training and improved representational capacity.

In traditional convolutional neural networks (CNNs), increasing the number of layers often leads to training difficulties. As the depth grows, the training error may paradoxically increase, resulting in degraded model performance. To overcome this limitation, ResNet introduces the concept of a Residual Block. Within each block, the input is added directly to the output of a series of convolutional operations, forming a so-called residual signal. This mechanism effectively preserves information flow across layers, allowing gradients to propagate more smoothly to earlier layers.

A residual unit consists of two parts: the identity mapping and the residual function. The formulation can be expressed as Equations (1) and (2):

y_{l} = h (x) + F (x_{l}, W_{l})

(1)

x_{l + 1} = f (y_{l})

(2)

where

x_{l}

is the input vector to the

l

-th layer,

y_{l}

is the output vector of the

l

-th layer,

W_{l}

denotes the learnable weight parameters of the

l

-th layer,

F (x_{l}, W_{l})

represents the residual mapping to be learned,

h (\cdot)

is the identity mapping function used for matching input and output dimensions, and

f (\cdot)

is the activation function (ReLU is used in this paper). If the activation function

f (\cdot)

is an identity mapping (i.e.,

f (z) = z

), the following new Equation (3) can be derived:

x_{N} = x_{0} + \sum_{l = 0}^{N - 1} F (x_{l}, W_{l})

(3)

The core idea of ResNet lies in its skip connections, which directly add the output of one layer to the output of a previous layer, as illustrated in Figure 2, Figure 3, Figure 4, Figure 5, Figure 6, Figure 7, Figure 8 and Figure 9. Even when gradient vanishing occurs, information can still propagate effectively through these shortcut paths to the shallower layers. In this way, the network can be trained much deeper without suffering from severe optimization difficulties or performance degradation.

The success of ResNet has greatly facilitated the development of deep architectures, providing strong support for handling complex image recognition tasks and laying a foundation for many subsequent network innovations.

2.2.2. Large Convolution Kernel Module

To enhance the global perception capability of the model for the target regions of Spodoptera frugiperda larvae, this study proposes a structural optimization strategy based on large convolution kernels (LCKs).

In conventional convolutional neural networks, small kernels (e.g., 3 × 3) are typically used in the shallow layers. Although such kernels are effective for local feature extraction, they struggle to capture the global structural information of the target, thereby limiting the model’s understanding of large-scale textures and morphological patterns [28]. To address this issue, a 5 × 5 convolution kernel is introduced in the shallow feature extraction stage to expand the receptive field and enhance the modeling of overall morphological features. This design enables the model to more effectively capture subtle structural differences among larvae at different developmental instars by improving its contextual awareness within local regions.

However, directly employing large convolution kernels can significantly increase the number of parameters and computational complexity, especially in the early stages where the feature maps retain high spatial resolution. Moreover, current mainstream deep learning frameworks provide limited computational optimization for large kernels, further constraining their efficiency and scalability in real-world applications.

To overcome these limitations, this study incorporates the Depthwise Separable Convolution (DS Conv) [29], which decomposes a standard convolution into a depthwise convolution and a pointwise convolution. This approach substantially reduces the number of parameters and floating-point operations (FLOPs) while maintaining the expressive capacity of the network.

Building upon this idea, the proposed LCK strategy integrates the 5 × 5 convolution kernel with depthwise separable convolution, as illustrated in Figure 5. This design achieves a balanced trade-off between receptive field expansion and computational efficiency. On one hand, the larger kernel enhances the model’s ability to capture texture and edge structures, improving its sensitivity to fine morphological details of larvae. On the other hand, the use of depthwise separable convolution effectively mitigates the computational overhead introduced by the increased kernel size, alleviating performance bottlenecks during model deployment. The detailed architecture of the module is shown in Figure 6.

Figure 5. Network Structure of Depthwise Separable Convolution.

Figure 6. Network structure of 5 × 5 large convolutional kernel.

2.2.3. Atrous Convolution

To enhance the model’s ability to capture multi-scale information and improve its representation and recognition of larval instar features under complex backgrounds, atrous convolution [30] is introduced in the high-level feature extraction stage of the network. Compared with standard convolution, atrous convolution can effectively expand the receptive field without increasing the number of parameters or computational overhead, thereby enabling the model to perceive larger contextual regions and better capture global semantic structures. Specifically, the Atrous Spatial Pyramid Pooling (ASPP) [31] module is incorporated into the Conv4_x stage to extract high-level features at multiple scales.

Within the ASPP module, multiple parallel branches are established, each adopting a different dilation rate (e.g., r = 1, 2, 4) to model contextual information at different scales. Branches with smaller dilation rates (e.g., r = 1) focus more on local critical details such as head contours and segment edges, while branches with larger dilation rates (e.g., r = 4) capture broader structural information, contributing to a more comprehensive understanding of the overall morphology. This pyramid-style feature extraction mechanism enables the construction of multi-scale perceptual features on a single-feature map, effectively mitigating spatial information loss caused by repeated downsampling in deeper layers.

To further improve feature fusion efficiency and discriminative capability, the feature maps extracted from branches with different dilation rates are fused using a weighted strategy. This weighted fusion not only achieves effective complementarity of scale information at the feature level but also enhances the model’s robustness and adaptability when handling variations in larval size, camera angles, and occlusions in complex scenes. The fused multi-scale features retain rich spatial and semantic information, allowing the model to maintain high-level semantic abstraction while strengthening its fine-grained recognition ability, thereby significantly improving the classification performance across different larval instars.

In summary, the introduction of the ASPP module provides the network with stronger global context modeling capabilities and scale invariance, enabling more accurate focus on discriminative regions under diverse input conditions and enhancing the integrity and expressiveness of feature extraction. The detailed architecture of the module is illustrated in Figure 7.

Figure 7. Network architecture with integrated ASPP module.

2.2.4. Improved Self-Attention

In this study, an improved self-attention mechanism, directly applicable to the feature maps output by convolutional layers of CNNs, was employed to enhance the feature representation capability and receptive field of the model while maintaining low computational complexity [32]. Unlike conventional self-attention methods (e.g., ViT [33]), which require partitioning the image into patches for sequential modeling, this approach processes the feature maps without dividing the image, thereby preserving the original spatial structure and avoiding detail loss caused by resolution reduction. Additionally, this mechanism maintains low memory usage and computational cost, making it particularly suitable for high-resolution image analysis tasks.

The proposed improved self-attention mechanism consists of three submodules, as illustrated in Figure 8: the Channel Processing Unit (CPU), the Lightweight Self-Attention module (LSA), and the Convolutional Gated Linear Unit (Convolutional GLU) [34], as shown in Figure 9.

Figure 8. Network architecture of improved self-attention.

Figure 9. Convolutional gated linear unit.

2.2.5. Proposed Model

The proposed model, termed MSA-ResNet, integrates multi-scale perception and attention mechanisms, as depicted in Figure 10.

Built upon the ResNet50 backbone, the model systematically optimizes the original architecture to address its limitations in detail extraction and semantic representation. In the shallow layers, large convolution kernels combined with depthwise separable convolutions (LCK) are introduced to expand the receptive field, enhance the modeling of texture and structural details, and effectively reduce computational overhead. In the high-level semantic modeling stage, an Atrous Spatial Pyramid Pooling (ASPP) module is integrated to perform parallel multi-scale context modeling, improving the perception of global morphological variations. Simultaneously, the improved self-attention (ISA) module is embedded within the residual blocks to strengthen the network’s responsiveness to key regions and its discriminative representation.

MSA-ResNet significantly improves the accuracy and robustness of S. frugiperda larval stage recognition while maintaining low computational complexity, providing an efficient and reliable solution for pest monitoring tasks.

3. Results

3.1. Experimental Methods

The experiments were conducted on a computer equipped with an AMD R9-9950X processor and 192 GB of RAM, using the PyTorch (v2.1) deep learning framework. Models were trained on an NVIDIA 4090 GPU. During training, the Adam (Adaptive Moment Estimation) optimizer was employed with the cross-entropy loss function. The network was trained for 100 epochs with a batch size of 32, and the learning rate was set to 0.0001.

To evaluate the performance of the proposed model, a series of comparative experiments was performed. Firstly, the proposed approach was compared with conventional classification models using standard evaluation metrics to determine whether it outperforms existing methods. Subsequently, ablation experiments were conducted by systematically removing or modifying specific modules and different combinations of improved components, observing the impact on model performance to determine the optimal configuration.

3.2. Comparison Experiments of Different Models

To comprehensively validate the superiority of the proposed network in classification accuracy, several key performance metrics were employed, including Accuracy, Precision, Recall, and F1-Score. To further demonstrate the performance advantage, the proposed model was compared with seven classical image classification models, including ResNet18, MobileNetV3, and AlexNet, while maintaining the same learning rate and number of training epochs to ensure fair comparison. The experimental results are presented in Table 2.

The comparison results indicate that the proposed model significantly outperforms other mainstream models in classification performance, achieving an Accuracy of 96.81%, a Precision of 96.84%, a Recall of 96.81%, and an F1-Score of 96.82%.

3.3. Ablation Experiment

3.3.1. The Impact of LCK Optimization at Different Stages on the Network

To systematically evaluate the practical effect of the LCK optimization strategy at different network stages, LCK was introduced into Conv2_x, Conv3_x, and the combination Conv2_x + Conv3_x stages of ResNet50, and compared with the original model. The relevant experimental results are summarized in Table 3.

As shown in Table 3, the introduction of LCK at different network stages has a notable but varying impact on model performance and computational efficiency. Among all configurations, the Conv3_x scheme (Num 2) shows the best overall performance, achieving a Top-1 error of 6.39% and reducing the parameter count to 20.55 M, representing a 12.3% reduction relative to the baseline ResNet50. This indicates a significant reduction in computational resource consumption while maintaining high accuracy, exhibiting excellent overall performance. The Conv2_x configuration (Num 1) also improves performance, reducing the Top-1 error to 6.54% and decreasing parameters to 23.43 M, indicating moderate gains under low computational cost. The combined optimization scheme (Conv2_x + Conv3_x) yields less improvement in Top-1 error rate compared to single-stage optimization.

3.3.2. The Impact of ASPP Optimization at Different Stages on Network Performance

To investigate the optimization effect of the ASPP module at different semantic levels, ASPP was embedded into Conv4_x, Conv5_x, and their combination (Conv4_x + Conv5_x) of ResNet50. The performance in terms of classification accuracy, computational complexity, and inference efficiency was then compared, with results presented in Table 4.

The findings indicate that embedding ASPP at Conv4_x (Num 1) achieves the best balance between accuracy and model size, reducing the Top-1 error to 5.99% with a parameter count of 22.35 M. Embedding ASPP deeper at Conv5_x (Num 2) results in a higher error of 6.19%, suggesting that inserting ASPP too late may oversmooth high-level semantics and weaken discriminative details.

Additionally, the Conv4_x configuration is the most parameter-efficient, effectively controlling model size compared with the joint scheme. Although the Conv4_x + Conv5_x scheme offers certain advantages, such as a larger receptive field and richer feature representation, it also incurs higher computational cost and inference latency, limiting its deployment flexibility in practical applications.

3.3.3. The Impact of ISA on Network Performance

The integration of attention mechanisms into deep neural networks directly affects the network’s feature extraction capability and overall recognition performance. Proper integration of attention modules into the network structure not only enhances the model’s focus on salient information but also effectively suppresses redundant features, thereby improving classification accuracy. To explore the optimal attention module integration strategy, this study analyzed the insertion positions and embedding strategies of attention modules in depth.

Based on the residual structure characteristics of ResNet50, four attention module insertion strategies were designed to investigate the impact of different positions on model performance, as illustrated in Figure 11. Each strategy introduces the attention mechanism at distinct positions within the residual block, aiming to study its effect on feature representation and gradient propagation, thereby optimizing the overall network performance.

Experimental data in Table 5 reveal the influence of ISA insertion position on network performance. The Standard scheme (Num 1), in which the attention module is embedded within the residual block, achieves the best performance with a Top-1 error rate of 4.99%, validating the effectiveness of synergistic optimization between feature extraction and attention. Embedding attention in the main residual path effectively integrates local features with global context, constituting the most performant configuration. The PRE scheme (Num 2) yields a Top-1 error of 5.59%, indicating that introducing attention before convolution disrupts the intended receptive field distribution. The POST scheme (Num 3) performs the worst with 15.77%, because applying attention after residual addition severely breaks the stability of residual propagation. The Identity scheme (Num 4) gives 6.39%, inferior to the Standard configuration.

The experiments demonstrate that attention module deployment should follow the principle of residual structure integrity. Implementing attention within the main residual path (Standard scheme) maintains stable feature propagation. Therefore, in ResNet50 architecture optimization, the Standard scheme should be prioritized as the default attention embedding paradigm, while strategies such as POST, which may disrupt residual characteristics, should be avoided.

Moreover, ResNet50 is composed of multiple residual modules, and the extracted feature hierarchies differ significantly across stages. Shallow layers primarily capture edges and texture information, whereas deeper layers focus on high-level semantic features. Consequently, the embedding of attention modules at different stages may differently influence feature learning. This study examined the effect of attention mechanisms at four major stages (Conv2_x − Conv5_x) and at joint stages, analyzing their impact on feature extraction to determine the optimal embedding strategy, as presented in Table 6.

The results indicate that the insertion position of the attention mechanism significantly affects classification performance. The Conv2_x scheme achieves the best performance with a Top-1 error rate of 4.99%, while the joint-stage scheme exhibits the worst performance. In contrast, the Conv3_x, Conv4_x, and Conv5_x schemes result in increased Top-1 error rates and parameter counts. Notably, although the joint scheme incorporates attention modules across all stages, the Top-1 error rate increases significantly to 13.37%, accompanied by a substantial rise in parameter count. This indicates that excessive integration of attention mechanisms may introduce redundant information, thereby hindering effective feature learning and diminishing the model’s recognition performance.

Therefore, considering classification accuracy, computational complexity, and inference efficiency, the optimal attention embedding strategy is at the Conv2_x stage. Introducing the attention module at this level enhances recognition performance while maintaining relatively low computational cost, and avoids embedding attention at all stages simultaneously, which could lead to resource wastage and performance degradation.

3.3.4. The Impact of Different Modules on Network Performance

The classification performance differences were evaluated using multiple performance metrics, and the detailed experimental results are summarized in Table 7. The results of seven experimental configurations indicate that different module combinations systematically affect the overall model performance.

The ISA-only model (Num 3) achieves 95.01% accuracy, outperforming both LCK-only (93.61%) and ASPP-only (94.01%). Combining LCK + ISA (Num 5) yields 94.21% accuracy with only 23.55 M parameters—close to ISA-only performance but with reduced model size. The full combination (LCK + ASPP + ISA, Num 7) achieves the highest accuracy (96.81%). Notably, adding ASPP to LCK (Num 4: 93.02%) slightly reduces accuracy compared to LCK alone, suggesting potential incompatibility or diminishing returns when combining certain modules. These results highlight that module synergy is non-trivial: while ISA consistently improves performance, its combination with other components must be carefully calibrated to avoid inefficiency or interference.

3.4. Model Performance Evaluation

To comprehensively evaluate the classification performance of the proposed model on the S. frugiperda larvae dataset, both overall accuracy and per-class performance were analyzed. The experimental results for each category are presented in Figure 12.

The results indicate that MSA-ResNet improves the precision for first-instar larvae by 12.94% and the recall for second- and fourth-instar larvae by 16% and 8.97%, respectively.

4. Discussion

4.1. Comparison of Improved Results

ResNet50 exhibits high misclassification rates for first-instar larvae (precision of 85.71%) and misses for second- and fourth-instar larvae (recall around 80%) in the identification of S. frugiperda larvae. To address these limitations, this study proposes a series of structural optimization strategies, including convolutional kernel adjustments and the introduction of attention mechanisms. Experimental results demonstrate that the optimized model achieves a significant overall performance improvement: the accuracy reaches 96.81% (an increase of 4.2%), and the F1 score improves by 5.56%. Compared with mainstream models such as VGG16 and MobileNetV3, the improved ResNet50 exhibits the best recognition performance, providing a reliable technical solution for real-time field identification.

4.2. Analysis of Improved Modules

Convolutional structure adjustments, such as enlarged receptive fields, enhance the model’s ability to capture global morphological features of larvae, significantly improving the discrimination between easily confused instars (e.g., second- and third-instar larvae) and thereby boosting the recall rates for second- and fourth-instar larvae. The introduction of attention mechanisms focuses the model on critical local features, strengthening the recognition of subtle morphological cues in first-instar larvae and substantially reducing misclassification rates.

4.3. Future Work

Despite these improvements, several limitations remain:

Background Generalization: The dataset primarily consists of laboratory and single-background images. Additional images from complex field environments are required to enhance the model’s adaptability in natural settings.
Lightweight Deployment: To facilitate deployment on mobile devices for field applications, further model compression is necessary, and lightweight techniques such as knowledge distillation and quantization should be explored.

5. Conclusions

In this study, a comprehensive image dataset covering six instars of S. frugiperda larvae (24,179 images) was constructed. A novel network model integrating multi-scale perception and attention mechanisms was proposed. Based on ResNet50, the model incorporates large convolutional kernels, Atrous Spatial Pyramid Pooling (ASPP) modules, and an improved self-attention mechanism to enhance the capture and discrimination of larval morphological details. The optimized model achieves an overall accuracy of 96.81%, with the precision for first-instar larvae improved by 12.94%, and the recall for second- and fourth-instar larvae increased by 16% and 8.97%, respectively, effectively addressing critical bottlenecks in instar recognition.

This study represents the first application of deep learning to the instar identification of S. frugiperda larvae. The proposed optimization strategies significantly enhance the model’s ability to discern fine morphological features, providing a high-precision, efficient solution for intelligent field pest monitoring. Furthermore, this work establishes a technical paradigm for instar identification in closely related species.

Author Contributions

Conceptualization, Q.X. and Y.L. (Ying Lu); methodology, M.W. and Q.X.; validation, Q.X.; formal analysis, Q.X. and M.W.; investigation, Y.L. (Ying Lu), D.F. and M.W.; resources, H.Y. and Y.L. (Yonghe Li); data curation, Q.X., M.W. and D.F.; writing—original draft preparation, M.W.; writing—review and editing, Q.X., M.W., Y.L. (Yonghe Li), H.Y. and D.F.; supervision, Y.L. (Yonghe Li) and Q.X. All authors have read and agreed to the published version of the manuscript.

Funding

Key Science and Technology Plan Project of Yunnan Provincial Department of Science and Technology, China—Project Number: 202001BB050002; Project Name: Research on Cross-border Invasion, Migration Patterns, and Ecological Control of Spodoptera frugiperda in Grasslands.

Data Availability Statement

The dataset used and/or analyzed during the current study is available from the corresponding author upon reasonable request.

Acknowledgments

Thanks to all the authors cited in this article and the referee for their helpful comments and suggestions.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Goergen, G.; Kumar, P.L.; Sankung, S.B.; Togola, A.; Tamò, M. First Report of Outbreaks of the Fall Armyworm Spodoptera frugiperda (J E Smith) (Lepidoptera, Noctuidae), a New Alien Invasive Pest in West and Central Africa. PLoS ONE 2016, 11, e0165632. [Google Scholar] [CrossRef] [PubMed]
Early, R.; González-Moreno, P.; Murphy, S.T.; Day, R. Forecasting the Global Extent of Invasion of the Cereal Pest Spodoptera frugiperda, the Fall Armyworm. NeoBiota 2018, 40, 25–50. [Google Scholar] [CrossRef]
Li, X.-J.; Wu, M.-F.; Ma, J.; Gao, B.-Y.; Wu, Q.-L.; Chen, A.-D.; Liu, J.; Jiang, Y.-Y.; Zhai, B.-P.; Early, R.; et al. Prediction of Migratory Routes of the Invasive Fall Armyworm in Eastern China Using a Trajectory Analytical Approach. Pest Manag. Sci. 2020, 76, 454–463. [Google Scholar] [CrossRef]
Mendesil, E.; Tefera, T.; Blanco, C.A.; Paula-Moraes, S.V.; Huang, F.; Viteri, D.M.; Hutchison, W.D. The Invasive Fall Armyworm, Spodoptera frugiperda, in Africa and Asia: Responding to the Food Security Challenge, with Priorities for Integrated Pest Management Research. J. Plant Dis. Prot. 2023, 130, 1175–1206. [Google Scholar] [CrossRef]
Wu, P.; Wu, F.; Fan, J.; Zhang, R. Potential Economic Impact of Invasive Fall Armyworm on Mainly Affected Crops in China. J. Pest Sci. 2021, 94, 1065–1073. [Google Scholar] [CrossRef]
De Groote, H.; Kimenju, S.C.; Munyua, B.; Palmas, S.; Kassie, M.; Bruce, A. Spread and Impact of Fall Armyworm (Spodoptera frugiperda J.E. Smith) in Maize Production Areas of Kenya. Agric. Ecosyst. Environ. 2020, 292, 106804. [Google Scholar] [CrossRef]
Ren, Q.; Haseeb, M.; Fan, J.; Wu, P.; Tian, T.; Zhang, R. Functional Response and Intraspecific Competition in the Fall Armyworm, Spodoptera frugiperda (Lepidoptera: Noctuidae). Insects 2020, 11, 806. [Google Scholar] [CrossRef] [PubMed]
Gao, J.; Gong, L.-F.; Wang, H.-H.; Zhao, R.; Xiao, X.; Tian, X.-Y.; Li, B.; Liang, P.; Gao, X.-W.; Gu, S.-H. Expression and Functional Analysis of Ace1 and Ace2 Reveal Their Differential Roles in Larval Growth and Insecticide Sensitivity in Spodoptera frugiperda (J. E. Smith, 1797). J. Pest Sci. 2023, 96, 1651–1666. [Google Scholar] [CrossRef]
Chen, J.; Cao, L.; Ma, Z.; Yuan, X.; Gong, Y.; Shen, X.; Wei, S. Susceptibility of Different Instar Larvae of Spodoptera frugiperda to Commonly Used Insecticides. Guangdong Agric. Sci. 2022, 49, 81–86. [Google Scholar] [CrossRef]
Fan, Z.; Kong, W.; Ran, X.; Lv, X.; Ma, C.; Yan, H. Biological and Physiological Changes in Spodoptera frugiperda Larvae Induced by Non-Consumptive Effects of the Predator Harmonia Axyridis. Agriculture 2024, 14, 1566. [Google Scholar] [CrossRef]
Li, H.; Jiang, S.; Zhang, H.; Geng, T.; Wyckhuys, K.A.G.; Wu, K. Two-Way Predation between Immature Stages of the Hoverfly Eupeodes Corollae and the Invasive Fall Armyworm (Spodoptera frugiperda J. E. Smith). J. Integr. Agric. 2021, 20, 829–839. [Google Scholar] [CrossRef]
Acharya, R.; Hwang, H.-S.; Mostafiz, M.M.; Yu, Y.-S.; Lee, K.-Y. Susceptibility of Various Developmental Stages of the Fall Armyworm, Spodoptera frugiperda, to Entomopathogenic Nematodes. Insects 2020, 11, 868. [Google Scholar] [CrossRef]
Sartiami, D.; Dadang; Harahap, I.; Kusumah, Y.; Anwar, R. First Record of Fall Armyworm (Spodoptera frugiperda) in Indonesia and Its Occurence in Three Provinces. IOP Conf. Ser. Earth Environ. Sci. 2020, 468, 012021. [Google Scholar] [CrossRef]
Sharanabasappa, S.D.; Kalleshwaraswamy, C.M.; Maruthi, M.S.; Pavithra, H.B. Biology of Invasive Fall Army Worm Spodoptera frugiperda (J.E. Smith) (Lepidoptera: Noctuidae) on Maize. Indian J. Entomol. 2018, 80, 540–543. [Google Scholar] [CrossRef]
Jing, D.-P.; Guo, J.-F.; Jiang, Y.-Y.; Zhao, J.-Z.; Sethi, A.; He, K.-L.; Wang, Z.-Y. Initial Detections and Spread of Invasive Spodoptera frugiperda in China and Comparisons with Other Noctuid Larvae in Cornfields Using Molecular Techniques. Insect Sci. 2020, 27, 780–790. [Google Scholar] [CrossRef]
Togola, A.; Beyene, Y.; Bocco, R.; Tepa-Yotto, G.; Gowda, M.; Too, A.; Boddupalli, P. Fall Armyworm (Spodoptera frugiperda) in Africa: Insights into Biology, Ecology and Impact on Staple Crops, Food Systems and Management Approaches. Front. Agron. 2025, 7, 1538198. [Google Scholar] [CrossRef]
Xu, W.; Li, W.; Wang, L.; Pompelli, M.F. Enhancing Corn Pest and Disease Recognition through Deep Learning: A Comprehensive Analysis. Agronomy 2023, 13, 2242. [Google Scholar] [CrossRef]
Dong, Q.; Sun, L.; Han, T.; Cai, M.; Gao, C. PestLite: A Novel YOLO-Based Deep Learning Technique for Crop Pest Detection. Agriculture 2024, 14, 228. [Google Scholar] [CrossRef]
Aphid Cluster Recognition and Detection in the Wild Using Deep Learning Models|Scientific Reports. Available online: https://www.nature.com/articles/s41598-023-38633-5 (accessed on 1 September 2025).
Zhu, L.-Q.; Ma, M.-Y.; Zhang, Z.; Zhang, P.-Y.; Wu, W.; Wang, D.-D.; Zhang, D.-X.; Wang, X.; Wang, H.-Y. Hybrid Deep Learning for Automated Lepidopteran Insect Image Classification. Orient. Insects 2017, 51, 79–91. [Google Scholar] [CrossRef]
Kortbeek, R.W.J.; Galland, M.D.; Muras, A.; van der Kloet, F.M.; André, B.; Heilijgers, M.; van Hijum, S.A.F.T.; Haring, M.A.; Schuurink, R.C.; Bleeker, P.M. Natural Variation in Wild Tomato Trichomes; Selecting Metabolites That Contribute to Insect Resistance Using a Random Forest Approach. BMC Plant Biol. 2021, 21, 315. [Google Scholar] [CrossRef] [PubMed]
Kasinathan, T.; Singaraju, D.; Uyyala, S.R. Insect Classification and Detection in Field Crops Using Modern Machine Learning Techniques. Inf. Process. Agric. 2021, 8, 446–457. [Google Scholar] [CrossRef]
Chiranjeevi, S.; Saadati, M.; Deng, Z.K.; Koushik, J.; Jubery, T.Z.; Mueller, D.S.; O’Neal, M.; Merchant, N.; Singh, A.; Singh, A.K.; et al. InsectNet: Real-Time Identification of Insects Using an End-to-End Machine Learning Pipeline. PNAS Nexus 2025, 4, 575. [Google Scholar] [CrossRef]
Zhang, Y.; Chen, L.; Yuan, Y. Multimodal Fine-Grained Transformer Model for Pest Recognition. Electronics 2023, 12, 2620. [Google Scholar] [CrossRef]
Dharmasastha, K.N.S.; Banu, K.S.; Kalaichevlan, G.; Lincy, B.; Tripathy, B.K. Classification of Pest in Tomato Plants Using CNN. In Proceedings of the Meta Heuristic Techniques in Software Engineering and Its Applications; Mohanty, M.N., Das, S., Ray, M., Patra, B., Eds.; Springer International Publishing: Cham, Switzerland, 2022; pp. 56–64. [Google Scholar]
An, J.; Du, Y.; Hong, P.; Zhang, L.; Weng, X. Insect Recognition Based on Complementary Features from Multiple Views. Sci. Rep. 2023, 13, 2966. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Ding, X.; Zhang, X.; Zhou, Y.; Han, J.; Ding, G.; Sun, J. Scaling Up Your Kernels to 31 × 31: Revisiting Large Kernel Design in CNNs. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arxiv 2017, arXiv:1704.04861. [Google Scholar]
Yu, F.; Koltun, V.; Funkhouser, T. Dilated Residual Networks. In Proceedings of the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef] [PubMed]
Wang, M.; Lu, Y.; Xu, Q. Improved Self-Attention for Spodoptera frugiperda Larval Instar Stages Identification. In Proceedings of the 2024 7th International Conference on Pattern Recognition and Artificial Intelligence (PRAI), Hangzhou, China, 15–17 August 2024; pp. 235–239. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S. An Image Is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Shi, D. Transnext: Robust Foveal Visual Perception for Vision Transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 17773–17783. [Google Scholar]
Howard, A.; Sandler, M.; Chu, G.; Chen, L.-C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V. Searching for Mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27–28 October 2019; pp. 1314–1324. [Google Scholar]
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition 2015. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]

Figure 1. Location of the study area.

Figure 2. Rearing containers for S. frugiperda larvae.

Figure 3. Images of 1st–6th instar larvae of S. frugiperda.

Figure 4. Trap other insects. (a). Spodoptera litura; (b). Spodoptera exigua; (c). Ostrinia furnacalis; (d). Leucania loreyi.

Figure 10. Network Structure of MSA-ResNet.

Figure 11. Attention module embedding location schematic: (a) Residual Block; (b) Standard; (c) PRE; (d) POST; (e) Identity.

Figure 12. The effects of ResNet and MSA-ResNet models in Multi-classification: (a) Accuracy, (b) Precision, (c) Recall, (d) F1-Score.

Table 1. Number of pictures of S. frugiperda.

Class	Class Name	Before Data Preprocessing	After Data Preprocessing
1	1th	1122	3622
2	2th	1126	3654
3	3th	1467	3774
4	4th	1812	3812
5	5th	1294	3594
6	6th	949	2949
7	others	863	2810

Table 2. Comparison experiments of different models.

Model	Accuracy (%)	Precision (%)	Recall (%)	F1-Score (%)
ResNet18 [27]	90.62	90.82	90.25	90.15
ResNet50 [27]	92.61	93.11	92.23	92.42
ResNet101 [27]	92.81	93.57	92.56	92.81
MobilenetV3 [35]	77.45	79.49	77.28	77.20
VGG16 [36]	85.83	86.07	85.58	85.26
ViT-B/16 [33]	88.02	87.54	87.54	87.75
AlexNet [37]	75.05	76.04	74.69	74.32
Our Model	96.81	96.84	96.81	96.82

Table 3. The impact of large kernel optimization at different stages on network. “✓” in ResNet50 denotes the backbone; in Stage, it indicates large kernel optimization applied.

Num	ResNet50	Stage		Top-1 Err (%)	Params (M)
Num	ResNet50	Conv2_x	Conv3_x	Top-1 Err (%)	Params (M)
1	✓	✓		6.54	23.43
2	✓		✓	6.39	20.55
3	✓	✓	✓	6.58	22.38

Table 4. The impact of ASPP Modules at Different Stages on network performance. “✓” in ResNet50 denotes the backbone; in Stage, it indicates ASPP optimization applied.

Num	ResNet50	Stage		Top-1 Err (%)	Params (M)
Num	ResNet50	Conv4_x	Conv5_x	Top-1 Err (%)	Params (M)
1	✓	✓		5.99	22.35
2	✓		✓	6.19	22.55
3	✓	✓	✓	6.07	22.38

Table 5. The impact of attention module placement strategies on ResNet50 recognition. “✓” in ResNet50 denotes the backbone; in Location denotes the strategy employed.

Num	ResNet50	Location				Top-1 Err (%)
Num	ResNet50	Standard	PRE	POST	Identity	Top-1 Err (%)
1	✓	✓				4.99
2	✓		✓			5.59
3	✓			✓		15.77
4	✓				✓	6.39

Table 6. Experimental Results of Attention Module Integration at Different Stages. “✓” in ResNet50 denotes the backbone; in Stage denotes the stage of Attention Module Integration.

Num	ResNet50	Stage				Top-1 Err (%)	Params (M)
Num	ResNet50	Conv2_x	Conv3_x	Conv4_x	Conv5_x	Top-1 Err (%)	Params (M)
1	✓	✓				5.23	23.87
2	✓		✓			5.39	25.37
3	✓			✓		6.19	34.57
4	✓				✓	6.79	45.58
5	✓	✓	✓	✓	✓	13.37	58.84

Table 7. Performance Comparison of ResNet50 with Different Modules (LCK, ASPP, ISA). The “✓” marks indicate the use of ResNet50 as the backbone and the inclusion of the specified modules (LCK, ASPP, ISA).

Num	ResNet50	Module			Accuracy (%)	Params (M)
Num	ResNet50	LCK	ASPP	ISA	Accuracy (%)	Params (M)
1	✓	✓			93.61	20.56
2	✓		✓		94.01	22.55
3	✓			✓	95.01	23.87
4	✓	✓	✓		93.02	30.26
5	✓	✓		✓	94.21	23.55
6	✓		✓	✓	93.81	30.47
7	✓	✓	✓	✓	96.81	30.38

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xu, Q.; Wang, M.; Lu, Y.; Feng, D.; Ye, H.; Li, Y. MSA-ResNet: A Neural Network for Fine-Grained Instar Identification of Spodoptera frugiperda Larvae in Smart Agriculture. Agronomy 2025, 15, 2724. https://doi.org/10.3390/agronomy15122724

AMA Style

Xu Q, Wang M, Lu Y, Feng D, Ye H, Li Y. MSA-ResNet: A Neural Network for Fine-Grained Instar Identification of Spodoptera frugiperda Larvae in Smart Agriculture. Agronomy. 2025; 15(12):2724. https://doi.org/10.3390/agronomy15122724

Chicago/Turabian Style

Xu, Quanyuan, Mingyang Wang, Ying Lu, Dan Feng, Hui Ye, and Yonghe Li. 2025. "MSA-ResNet: A Neural Network for Fine-Grained Instar Identification of Spodoptera frugiperda Larvae in Smart Agriculture" Agronomy 15, no. 12: 2724. https://doi.org/10.3390/agronomy15122724

APA Style

Xu, Q., Wang, M., Lu, Y., Feng, D., Ye, H., & Li, Y. (2025). MSA-ResNet: A Neural Network for Fine-Grained Instar Identification of Spodoptera frugiperda Larvae in Smart Agriculture. Agronomy, 15(12), 2724. https://doi.org/10.3390/agronomy15122724

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MSA-ResNet: A Neural Network for Fine-Grained Instar Identification of Spodoptera frugiperda Larvae in Smart Agriculture

Abstract

1. Introduction

2. Materials and Methods

2.1. Data Acquisition and Preprocessing

2.1.1. Data Source and Acquisition

2.1.2. Collection of Larva Images

2.1.3. Data Preprocessing and Allocation

2.2. Building the Model

2.2.1. ResNet50

2.2.2. Large Convolution Kernel Module

2.2.3. Atrous Convolution

2.2.4. Improved Self-Attention

2.2.5. Proposed Model

3. Results

3.1. Experimental Methods

3.2. Comparison Experiments of Different Models

3.3. Ablation Experiment

3.3.1. The Impact of LCK Optimization at Different Stages on the Network

3.3.2. The Impact of ASPP Optimization at Different Stages on Network Performance

3.3.3. The Impact of ISA on Network Performance

3.3.4. The Impact of Different Modules on Network Performance

3.4. Model Performance Evaluation

4. Discussion

4.1. Comparison of Improved Results

4.2. Analysis of Improved Modules

4.3. Future Work

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI