Autonomous Detection of Spodoptera frugiperda by Feeding Symptoms Directly from UAV RGB Imagery

The use of digital technologies to detect, position, and quantify pests quickly and accurately is very important in precision agriculture. Imagery acquisition using air-borne drones in combination with the deep learning technique is a new and viable solution to replace human labor such as visual interpretation, which consumes a lot of time and effort. In this study, we developed a method for automatic detecting an important maize pest—Spodoptera frugiperda—by its gnawing holes on maize leaves based on convolution neural network. We validated the split-attention mechanism in the classical network structure ResNet50, which improves the accuracy and robustness, and verified the feasibility of two kinds of gnawing holes as the identification features of Spodoptera frugiperda invasion and the degree. In order to verify the robustness of this detection method against plant morphological changes, images at the jointing stage and heading stage were used for training and testing, respectively. The performance of the models trained with the jointing stage images has been achieved the validation accuracy of ResNeSt50, ResNet50, EfficientNet, and RegNet at 98.77%, 97.59%, 97.89%, and 98.07%, with a heading stage test accuracy of 89.39%, 81.88%, 86.21%, and 84.21%.


Introduction
Spodoptera frugiperda, originating from the American continent, has invaded Europe, Asia, and Africa [1]. As a migratory pest, Spodoptera frugiperda has a strong survival ability and a rapid reproduction rate, colonizing the above continents in a short time and causing great damage to corn, rice, and other main food crops [2][3][4][5].
At present, the main standard method to control this pest is pesticides, including (a) detecting the occurrence and status of pests by field sampling investigation, which relies on agronomists or trained surveyors [3], and (b) spraying pesticides evenly in the corresponding area [6,7]. It is simple and easy to indiscriminatingly spray, but the process of obtaining the information is time consuming and laborious, which depends on the subjectivity of surveyors. Uniform spraying would cause pesticide waste and environmental pollution [8,9]. In this context, there is an urgent need for a low-cost, high-efficiency, and high-precision method to quickly and effectively obtain field information, including the occurrence location, extent, and overall distribution of insect pests [10].
There have been several research studies focusing on the identification of pests and diseases affecting plant leaves. Most of the image data come from ground-based sensors such as mobile phones and digital cameras [11][12][13], and a small part of this is collected by unmanned aerial vehicles (UAV), which belong to remote sensing (RS) technology [14][15][16]. RS has been frequently adopted as a rapid, non-destructive, and cost-effective means for Compared to satellite remote sensing and aerial remote sensing, UAVs have great advantages in terms of cost, operation, carrying, etc. [19], and they have been widely used in crop classification, growth monitoring, yield estimation, and other aspects, especially for large fields [20].
On the other hand, deep learning-originating from machine learning-has gradually gained popularity because of its ability to automatically extract representative features from a large number of input images [21,22]. Konstantinos et al. developed CNN models to perform plant disease detection and diagnosis using simple leaves images of healthy and diseased plants through deep learning methodologies [23]. Chen et al. used the UNet-based BLSNet to automatic identify and segment the diseased region of Rice bacterial leaf streak from the camera photos [24]. The appearance of the attention mechanism also further improves the performance of the network [22,25].
The following are studies based on UAV imagery combined with machine learning or deep learning: Tetila et al. detected soybean foliar diseases subjected to biological stress based on the simple linear iterative clustering segmentation method through foliar physical properties using RGB imagery captured by the low-cost unmanned aerial vehicle model DJI Phantom 3 [26]. Harvey et al. used an unmanned aerial vehicle (UAV) to acquire high-resolution images in the field, and they built an automated, high-throughput system based on a convolutional neural network (CNN) for the detection of northern leaf blight of maize plants [27]. Jin et al. proposed a computerized system based on CNN to process images captured by UAVs at low altitudes, which can detect Fusarium wilt of radish with high accuracy [28]. Ryo et al. used CNN to implement a detection method of virus-infected plants in a potato seed production field, with UAV RGB images being captured at an altitude of 5-10 m from the ground [7].
Compared with disease studies, insect pests are more flexible. There are two primary approaches to insect identification [29]: (i) direct, focusing on the ontology of the insects, and (ii) indirect, which focus on the damage caused by the insects [6]. For example, Liu et al. used a field insect light trap to obtain images and combined the CNN and attention mechanism to construct a direct classification model for insect identification [30]. Zhang focused on the significant change in the plant's leaf area index caused by Spodoptera frugiperda to indirectly monitor the infestation [31]. On the other hand, using the camera to closely capture pest images is also a widely used method, such as Li et al. integrating Convolutional Neural Network (CNN) and non-maximum inhibition for positioning and counting aphids in rice images obtained by a close view camera, achieving 0.93 accuracy and 0.885 mAP by optimizing key parameters and feature extraction network [13].
The above methods may have defects in accuracy or cannot be applied to large area practice. Thanks to the development of UAV technology, the pest identification based on UAV images is worth further research [32]. Ana et al. carry small aircraft RGB camera drones to obtain the vineyard plant image, and the application of geometric vision and computer vision technology, combined with landform factors on the influence of pests on the vineyard of the quantitative analysis for the farm digital management provides accurate low-cost information, which helps in the implementation and improvement of farm management and decision-making processes [33]. Farian et al. also used the corn leaves damaged by Spodoptera frugiperda and applied VGG16 and InceptionV3 to detect the infected corn leaves captured by the UAV (UAV) remote sensing technology while using the angular detection method in computer vision to strengthen the feature representation and improve the detection accuracy [34].
This paper presents a CNN-based deep learning system for the automatic detection of maize leaves infected by Spodoptera frugiperda; from RGB UAV remote sensing images at high spatial resolution. UAV remote sensing images have excellent potential for agricultural data acquisition, while deep learning has agricultural data processing potential. Through the combination, this study is based on a multi-stage pest detection classification model applied to actual maize production environmental characteristics based on the ResNest model. The model has the following capabilities: (1) Collecting corn images from the actual field agricultural production conditions for the automatic detection of leaves infected by Spodoptera frugiperda; (2) According to the feeding characteristics of corn grass, Accurately and quickly determining the pest stage of the infected leaves, providing a reliable reference for the formulation and implementation of prevention and control measures; (3) The potential and generalization ability of indirect pest detection based on UAV remote sensing images are verified. This provides a reference for the automated detection of pest invasion status in the field. The remainder of this paper is organized as follows: In Section 2, we describe the study area, data collection, and methods. Section 3 presents results, and Section 4 provides a discussion. Finally, Section 5 summarizes this work and highlights future works.

Study Area
The UAV RGB imagery of the maize pest Spodoptera frugiperda was captured at the Experimental Station (117.552616, 34.309942) of the China University of Mining and Technology, Xuzhou city, Jiangsu Province, China. Image are shown in Figure 1. The experimental site was invaded by the grass moth because of a later maize planting cycle than the surrounding fields. When we started the data collection, the larvae in the field were in a transition phase from low to medium age.
Appl. Sci. 2022, 12, x FOR PEER REVIEW 3 of 16 management and decision-making processes [33]. Farian et al. also used the corn leaves damaged by Spodoptera frugiperda and applied VGG16 and InceptionV3 to detect the infected corn leaves captured by the UAV (UAV) remote sensing technology while using the angular detection method in computer vision to strengthen the feature representation and improve the detection accuracy [34]. This paper presents a CNN-based deep learning system for the automatic detection of maize leaves infected by Spodoptera frugiperda; from RGB UAV remote sensing images at high spatial resolution. UAV remote sensing images have excellent potential for agricultural data acquisition, while deep learning has agricultural data processing potential. Through the combination, this study is based on a multi-stage pest detection classification model applied to actual maize production environmental characteristics based on the Res-Nest model. The model has the following capabilities: (1) Collecting corn images from the actual field agricultural production conditions for the automatic detection of leaves infected by Spodoptera frugiperda; (2) According to the feeding characteristics of corn grass, Accurately and quickly determining the pest stage of the infected leaves, providing a reliable reference for the formulation and implementation of prevention and control measures; (3) The potential and generalization ability of indirect pest detection based on UAV remote sensing images are verified. This provides a reference for the automated detection of pest invasion status in the field. The remainder of this paper is organized as follows: In Section 2, we describe the study area, data collection, and methods. Section 3 presents results, and Section 4 provides a discussion. Finally, Section 5 summarizes this work and highlights future works.

Study Area
The UAV RGB imagery of the maize pest Spodoptera frugiperda was captured at the Experimental Station (117.552616, 34.309942) of the China University of Mining and Technology, Xuzhou city, Jiangsu Province, China. Image are shown in Figure 1. The experimental site was invaded by the grass moth because of a later maize planting cycle than the surrounding fields. When we started the data collection, the larvae in the field were in a transition phase from low to medium age. Figure 1. (a) The study area located in Xuzhou and the experimental field; (b) images of the experimental field obtained on 8, 9, and 24 September 2020 using a low-altitude UAV equipped with an RGB sensor. Each of the colors represents a different experiment date. Each drone image has its own coordinate. Figure 1. (a) The study area located in Xuzhou and the experimental field; (b) images of the experimental field obtained on 8, 9, and 24 September 2020 using a low-altitude UAV equipped with an RGB sensor. Each of the colors represents a different experiment date. Each drone image has its own coordinate.

UAV Image Collection
The image acquisition device was a DJI Mavic air2 equipped with a half-inch CMOS sensor, which can achieve an effective pixel rate of 48 million. What is more, it is an ultrasmall drone, weighing just 570 g, which is capable of capturing high-resolution images (standard red-green-blue or RGB photos) of corn at ultra-low altitudes.
The data were collected three times in September 2020 during the critical growth period of maize across the jointing stage to the heading stage, and the specific time and resolution are shown in Table 2. In the jointing stage, corn was collected for the first time, and in the heading stage, it was collected for the second and third time. Between these two periods, maize grows rapidly, and its morphology changes greatly, especially the appearance of stamen changes the overall morphology of the maize plant to a great extent, which can be applied to the generalized type test of model. That is the reason we set the interval. The specific differences between the two stages are shown in Figure 2Part A below, and maize leaf categories are shown in the Figure 2Part B.

Figure 2. (Part A)
The picture on the left shows corn in the jointing stage, the right picture shows corn in the heading stage; (Part B) The red box shows severely infected corn leaves, the blue box shows slightly infected corn leaves, and the purple boxes show corn leaves in healthy condition.

Image Preprocessing
The data processing included two main steps: Cropping and Classing. Cropping: Due to the size of the images, we tailored them from the sizes in Table 2  The flight speed was controlled at 1.5 m/s and the flight altitude was from 2 to 5 m from the ground, which was very close to the corn canopy. Moreover, the shooting angle of the flight path along the ridges of the field was 90 • , with more efficient harvesting of corn canopy information. The specific information of the images is shown in Table 2, including the date, number, and image resolution.

Image Preprocessing
The data processing included two main steps: Cropping and Classing. Cropping: Due to the size of the images, we tailored them from the sizes in Table 2 to 200 × 200 to speed up training and to reduce the pressure on the graphics memory. We used the OpenCV-Python tool in Python language to read and crop images in batches. For visual effect, Figure 3 shows the conversion process of an image from 1 × 3000 × 4000 to 25   Classing: After cropping, the main body of the image is basically composed o leaves, and part of the image is land and weeds. By combining the edge detection Opencv2-Python library and RGB channel calculation, the image containing only lan weeds is removed. According to the habits of Spodoptera frugiperda and specific rep tation on the image, we divided the rest of the images into 3 categories by visual int tation, as shown in Figure 4 below. For pest control, the earlier the intervention, the fewer losses and pesticides used, so the first translucent silver windowpane is the most important object. However, in the image, most of the leaves were presented as healthy. In order to balance the number of positive and negative samples, we selected the number of healthy leaves in condition 1 according to the number of infected leaves in conditions 2 and 3, and we removed the leaves in condition 4 at the same time.
After the above processing, we finally obtained more than 5000 maize images in the joint stage, including 2043 healthy leaves, 1866 condition 1 images, and 1430 condition Classing: After cropping, the main body of the image is basically composed of corn leaves, and part of the image is land and weeds. By combining the edge detection tool in Opencv2-Python library and RGB channel calculation, the image containing only land and weeds is removed. According to the habits of Spodoptera frugiperda and specific representation on the image, we divided the rest of the images into 3 categories by visual interpretation, as shown in Figure 4 below.   To test the robustness of the model detection ability, 1545 images in the heading stage were used, including 532 in condition 1 and 417 in condition 2. This part of the image does not participate in training and verification at all, and it was used as an independent test set for testing the model after training.

Augmentation
The training of a deep learning model requires a very large amount of data, so we used data augmentation to amplify the data [35]. Data enhancement technology can enhance images and reduce over-fitting by flipping, mirroring, and contrast transformation without changing the original form of an image [36]. In the classification task, the image geometric transformation, color space enhancement, random erasure, and feature space enhancement operation can change the image status without changing the image category so as to improve the quantity and quality of data, play the effect of reducing the distance between the training data and the test data, and reduce the overfitting in the model training process [37,38].
For example, the contrast change can change the brightness of the image to a certain extent and enhance the sensitivity of the image to the illumination change. In this study, we used a variety of image enhancement methods, as shown in Figure 5 below.

Convolutional Neural Network
After a long period of development since AlexNet [39], convolutional neur works, composed of a convolution layer, a pooling layer, and a fully connected laye evolved into series of models that can automate the extraction of features through tr iterations [40,41]. ResNet [42], through the application of residual blocks, has solv problem of network degradation and parameter disappearance with a continuo crease in neural network layers, making an indelible contribution to the progress o learning, called Deep Convolutional Neural Network (DCNN). DCNN can automa extract the features of convolutional check images with different specifications to higher data classification accuracy, and it has become the most common identif method [43].
Based on ResNet, the integration of different methods leads to the developm various network structures, such as grouping convolution [44], self-attention mech [45], and selective attention mechanism [46]. Therefore, in this study, the feasibility feeding symptoms method based on maize Spodoptera frugiperda was verified by several kinds of ResNet related networks, including ResNet, ResNeS [47], SE-Net, a Net. Although the residual structure has been widely applied with its simple str and convenient modular design, its performance is not satisfactory in downstream cations, which is affected by factors such as receptive field size and channel inter Recently, the successful application of the channel and attention mechanism has duced new possibilities for its improvement. ResNext first introduced the idea of ing convolution. The SE-Net introduces a channel-attention mechanism for featur struction by adaptively recalibrating the channel feature response. SK-Net extract channel information from the feature map through the construction of the grouped nels. Therefore, according to the idea of taking the channel as the operation unit a viding the input data into more fine-grained weighted subgroups or subchannels on the global context, it is able to build a channel-based split attention structure. D training, each subgroup is able to perform different mapping abstractions on the

Convolutional Neural Network
After a long period of development since AlexNet [39], convolutional neural networks, composed of a convolution layer, a pooling layer, and a fully connected layer, have evolved into series of models that can automate the extraction of features through training iterations [40,41]. ResNet [42], through the application of residual blocks, has solved the problem of network degradation and parameter disappearance with a continuous increase in neural network layers, making an indelible contribution to the progress of deep learning, called Deep Convolutional Neural Network (DCNN). DCNN can automatically extract the features of convolutional check images with different specifications to obtain higher data classification accuracy, and it has become the most common identification method [43].
Based on ResNet, the integration of different methods leads to the development of various network structures, such as grouping convolution [44], self-attention mechanism [45], and selective attention mechanism [46]. Therefore, in this study, the feasibility of the feeding symptoms method based on maize Spodoptera frugiperda was verified by using several kinds of ResNet related networks, including ResNet, ResNeS [47], SE-Net, and SK-Net. Although the residual structure has been widely applied with its simple structure and convenient modular design, its performance is not satisfactory in downstream applications, which is affected by factors such as receptive field size and channel interaction. Recently, the successful application of the channel and attention mechanism has introduced new possibilities for its improvement. ResNext first introduced the idea of grouping convolution. The SE-Net introduces a channel-attention mechanism for feature construction by adaptively recalibrating the channel feature response. SK-Net extracted the channel information from the feature map through the construction of the grouped channels. Therefore, according to the idea of taking the channel as the operation unit and dividing the input data into more fine-grained weighted subgroups or subchannels based on the global context, it is able to build a channel-based split attention structure. During training, each subgroup is able to perform different mapping abstractions on the input channel data of its own part so as to build different feature representations. In the model, the module is named a distraction block. Thanks to the simple and modular structure, the distraction blocks can perform multiple reuse and stacking and then construct the universal structure bodies similar to the same residue model. Therefore, the block can be simply described as replacing the original residual part of the attention operations with channels as units and thus giving the corresponding weight to the identity.
At the same time, to increase the comparison, other classic network models such as EfficientNet [48] and RegNet [49] are also selected. The networks used in this study were consistent with the architecture of ResNet50, and the original block was replaced with a split-attention block, as shown in the Figure 6 below.
Appl. Sci. 2022, 12, x FOR PEER REVIEW 9 of 16 channel data of its own part so as to build different feature representations. In the model, the module is named a distraction block. Thanks to the simple and modular structure, the distraction blocks can perform multiple reuse and stacking and then construct the universal structure bodies similar to the same residue model. Therefore, the block can be simply described as replacing the original residual part of the attention operations with channels as units and thus giving the corresponding weight to the identity. At the same time, to increase the comparison, other classic network models such as EfficientNet [48] and RegNet [49] are also selected. The networks used in this study were consistent with the architecture of ResNet50, and the original block was replaced with a split-attention block, as shown in the Figure 6 below.

Transfer Learning
Transfer learning can transplants the weights obtained through the pre-training of large data sets to the network. Fine tuning based on these weights can accelerate the network training speed and reduce the amount of data required for training [50,51]. In this paper, the ImageNet data set [52] was used as the source data to pre-train the model.

Experimental Setup
In this experiment, we used Pytorch 1.4 as the framework, which is an open-source package in deep learning based on the python programming language. The selected optimizer was Stochastic Gradient Descent (SGD) with the momentum of 0.9, the initial learning rate was 0.002, which decreases with loss, the batch size was 32, and the loss function was CrossEntropy. The training was performed on a machine with the graphics processor of NVIDIA GTX2080s and 32 GB of memory. We trained and tested the models with the data set consisting of the corn leaf in the jointing stage, and to valid the robust ability, we tested the models with the heading stage data set. Figure 7 illustrates the processes involved in obtaining the images used for the experiments.

Transfer Learning
Transfer learning can transplants the weights obtained through the pre-training of large data sets to the network. Fine tuning based on these weights can accelerate the network training speed and reduce the amount of data required for training [50,51]. In this paper, the ImageNet data set [52] was used as the source data to pre-train the model.

Experimental Setup
In this experiment, we used Pytorch 1.4 as the framework, which is an open-source package in deep learning based on the python programming language. The selected optimizer was Stochastic Gradient Descent (SGD) with the momentum of 0.9, the initial learning rate was 0.002, which decreases with loss, the batch size was 32, and the loss function was CrossEntropy. The training was performed on a machine with the graphics processor of NVIDIA GTX2080s and 32 GB of memory. We trained and tested the models with the data set consisting of the corn leaf in the jointing stage, and to valid the robust ability, we tested the models with the heading stage data set. Figure 7 illustrates the processes involved in obtaining the images used for the experiments. Appl. Sci. 2022, 12, x FOR PEER REVIEW 10 of 16 Figure 7. Illustration of the process used in this study.

Evaluation Parameters
Model performance was assessed using six parameters: Accuracy, Sensitivity, Specificity, Precision, F1 Score, and Kappa.
Accuracy is used as the main method to calculate the accuracy of a model, and the classification ability of the model is represented by the proportion of the correct number of samples in the total number of samples. The specific formula is as follows, and in the formula, T (True) and F (False) represent whether the prediction is correct, and P (Positive) and N (Negative) represent the category of the model prediction; TP and TN is the sum of True Predictions, and FP and FN are the opposite: (1) Sensitivity represents the model's recognition ability for positive samples, consisting of TP and FN: (2) Specificity is defined to show true negative assessment ability, consisting of TN and FP:

Evaluation Parameters
Model performance was assessed using six parameters: Accuracy, Sensitivity, Specificity, Precision, F1 Score, and Kappa.
Accuracy is used as the main method to calculate the accuracy of a model, and the classification ability of the model is represented by the proportion of the correct number of samples in the total number of samples. The specific formula is as follows, and in the formula, T (True) and F (False) represent whether the prediction is correct, and P (Positive) and N (Negative) represent the category of the model prediction; TP and TN is the sum of True Predictions, and FP and FN are the opposite: Sensitivity represents the model's recognition ability for positive samples, consisting of TP and FN: Specificity is defined to show true negative assessment ability, consisting of TN and FP: Precision shows the accuracy of all model-identified positive samples, consisting of TN and FP: F1 Score is an aggregative indicator based on the harmonic mean of precision and recall.
F1 score = 2 * Precision * Sensitivity Precision + Sensitivity Kappa coefficient is a consistency test confusion matrix-based indicator with values between −1 and 1; closer to 1 indicates the overall effect of the classification. In the formula, a i and b i represent the true number and the predicted number of the i category, respectively, and the sum means the number of all data.

Experimental Results
In this study, we compared the performance of four models in the data set, and the results of each model are shown in Figure 8 below. The accuracy for ResNeSt50, ResNet50, EfficientNet, and RegNet is 98.77%, 97.59%, 97.89%, and 98.07%. It can be seen from the data that all the four network structures can obtain high reconnaissance accuracy in this classification problem. Among them, the accuracy of ResNet50 with split attention is the highest. At the same time, with the addition of transfer learning, all networks basically reach the steady state in about 20 epochs, which means in the production environment, we can complete the training and validation of the model in a relatively short time.

Experimental Results
In this study, we compared the performance of four models in the data set, and t results of each model are shown in Figure 8 below. The accuracy for ResNeSt50, ResNet EfficientNet, and RegNet is 98.77%, 97.59%, 97.89%, and 98.07%. It can be seen from t data that all the four network structures can obtain high reconnaissance accuracy in t classification problem. Among them, the accuracy of ResNet50 with split attention is t highest. At the same time, with the addition of transfer learning, all networks basica reach the steady state in about 20 epochs, which means in the production environme we can complete the training and validation of the model in a relatively short time. An image was randomly selected from the test data set (clipping completed), and t model operation was carried out. The infected image blocks in the calculation results w given different colors according to their severity, and then, they were spliced together An image was randomly selected from the test data set (clipping completed), and the model operation was carried out. The infected image blocks in the calculation results were given different colors according to their severity, and then, they were spliced together for display. A blue box represents a slight silver window, while a red box represents an irregular wormhole (see Figure 9).
Appl. Sci. 2022, 12, x FOR PEER REVIEW 12 of 16 display. A blue box represents a slight silver window, while a red box represents an irregular wormhole (see Figure 9).

Discussion
In this study, we proposed the deep learning model to detect the invasion of Spodoptera frugiperda by the features of the damaged leaves. At the same time, four different neural network structures are used to verify the feasibility of the proposed method. In addition, in order to verify the ability of this feature against maize morphological change, the four models were trained on the images at the jointing stage and tested on the images at the heading stage. The appearance of stamen and the fall of stamen in corn leaves images at the heading stage has a certain influence on the overall structure and color of the image. However, the neural network based on features of the damaged leaves still has good accuracy. Accuracy, Sensitivity, Specificity, Precision, F1 score, and Kappa were used to demonstrate the recognition ability of maize leaves with holes (see Table 4) (TSW: Translucent silver window, IW: Irregular wormhole).

Discussion
In this study, we proposed the deep learning model to detect the invasion of Spodoptera frugiperda by the features of the damaged leaves. At the same time, four different neural network structures are used to verify the feasibility of the proposed method. In addition, in order to verify the ability of this feature against maize morphological change, the four models were trained on the images at the jointing stage and tested on the images at the heading stage. The appearance of stamen and the fall of stamen in corn leaves images at the heading stage has a certain influence on the overall structure and color of the image. However, the neural network based on features of the damaged leaves still has good accuracy. Accuracy, Sensitivity, Specificity, Precision, F1 score, and Kappa were used to demonstrate the recognition ability of maize leaves with holes (see Table 4) (TSW: Translucent silver window, IW: Irregular wormhole). It can be seen from the table that the models based on four different network structures all have a good ability to identify the infected leaves from the corn images at the heading stage. However, compared with the original valid accuracy, the current accuracy has a degree of decline, respectively 89.39%, 81.88%, 86.21%, and 84.21%. The split-attention models outperformed the origin ResNet50 structures and the classical network model on the performance in terms of Accuracy, Precision, etc. ReNest50 also achieves the best results on the Kappa coefficients and F1 Score. What is more, we can see the differences between these networks more clearly in the CAM (Class Activation Map based on the average gradient) in Figure 10 below. Compared with the results of other network structures, the split-attention network can identify the target more accurately and closely. It can be seen from the table that the models based on four different network structures all have a good ability to identify the infected leaves from the corn images at the heading stage. However, compared with the original valid accuracy, the current accuracy has a degree of decline, respectively 89.39%, 81.88%, 86.21%, and 84.21%. The split-attention models outperformed the origin ResNet50 structures and the classical network model on the performance in terms of Accuracy, Precision, etc. ReNest50 also achieves the best results on the Kappa coefficients and F1 Score. What is more, we can see the differences between these networks more clearly in the CAM (Class Activation Map based on the average gradient) in Figure 10 below. Compared with the results of other network structures, the split-attention network can identify the target more accurately and closely.

Conclusions and Future Directions
This study aimed to detect maize images that included leaves infected by Spodoptera frugiperda in the early stages. Four different models including ResNeSt50, ResNet50, Effi-cientNet, and RegNet were used to verify the feasibility of using the above features for recognition and explore the split-attention mechanism to improve the accuracy and robustness of the model. The ResNeSt50 network achieved a high accuracy of 98.77% in the validation data set based on the jointing stage and of 89.39% in the test data set based on the heading stage. The model demonstrated its ability to identify infected maize leaves at various stages and allowed to classify them according to the degree of infection. In the process of model construction, methods such as data enhancement and transfer learning are adopted to speed up model construction, reduce overfitting, and improve robustness. Accurate treatment can carry out according to an image's coordinates the grade and distribution of infected leaves, which can significantly reduce the use of pesticides and assist in the implementation of biological control.
Although the model can accurately and quickly identify and judge the maize leaves present in the image for insect pests, the following problems still need to be further studied and explored in practical application: (1) under the condition of positive projections,

Conclusions and Future Directions
This study aimed to detect maize images that included leaves infected by Spodoptera frugiperda in the early stages. Four different models including ResNeSt50, ResNet50, EfficientNet, and RegNet were used to verify the feasibility of using the above features for recognition and explore the split-attention mechanism to improve the accuracy and robustness of the model. The ResNeSt50 network achieved a high accuracy of 98.77% in the validation data set based on the jointing stage and of 89.39% in the test data set based on the heading stage. The model demonstrated its ability to identify infected maize leaves at various stages and allowed to classify them according to the degree of infection.
In the process of model construction, methods such as data enhancement and transfer learning are adopted to speed up model construction, reduce overfitting, and improve robustness. Accurate treatment can carry out according to an image's coordinates the grade and distribution of infected leaves, which can significantly reduce the use of pesticides and assist in the implementation of biological control.
Although the model can accurately and quickly identify and judge the maize leaves present in the image for insect pests, the following problems still need to be further studied and explored in practical application: (1) under the condition of positive projections, so that some pest leaves may be ignored due to occlusion; (2) the image acquisition parameters such as height, angle, and resolution and actual field planting conditions; (3) the overall statistical analysis of field pest distribution and subsequent application should be further explored with agronomic knowledge. In future research, we will use the model of Spodoptera frugiperda based on more accurate network architecture for real-time field corn image recognition. In addition, according to the optimal resolution combination obtained, we will conduct a new round of data collection in this year's maize planting period to further verify the method. At the same time, we will collect more data to build a model that can identify more pests and diseases faster and more accurately.