1. Introduction
Locusts are one of the important causes of economic losses in global agriculture and grassland animal husbandry [
1]. Locust plagues, floods, and droughts are called the three major natural disasters. In China, more than two million hectares of agricultural land are affected by locust invasion every year, which is of great harm to crops and food security [
2]. Therefore, the establishment of an efficient, accurate, and intelligent pest monitoring and early warning system has become the key to improving disaster response capabilities [
3].
Traditional locust monitoring methods mainly rely on artificial ground surveys, including the gradual investigation of locust eggs, nymphs, and adults [
4]. These methods are usually implemented by local plant protection stations or professional organizations, but there are problems such as consuming a lot of manpower, material, and financial resources, low investigation efficiency, and difficulty in covering complex terrain areas such as lakes and swamps [
5]. To improve monitoring efficiency, some studies have tried to introduce climate prediction [
6], phenological prediction [
7], and remote-sensing technology [
8] to realize the dynamic analysis of pest situations at the regional level. However, in these technologies, climate prediction has a low spatial resolution due to its dependence on macro data, and phenological prediction lacks a consideration of small-scale spatial heterogeneity. These do not integrate high-resolution spatial data and landscape features, resulting in inaccurate spatial positioning of locust occurrence areas. Although remote-sensing monitoring has the advantages of real-time and wide-area coverage, it is difficult to realize the accurate identification of locust individuals due to the limitation of image resolution.
In recent years, great progress has been made in the combination of locust monitoring technology based on machine learning and remote-sensing technology. Tabar et al. [
9] developed a model, PLAN, based on spatio-temporal deep learning, which uses crowdfunding data and environmental data to predict the migration pattern of locusts in East Africa. Experiments show that its AUC score reaches 0.9, which is significantly better than the traditional machine learning model. Kimathi et al. [
10] used the MaxEnt niche model, combined with environmental variables such as temperature, precipitation, soil moisture, and sand content, to predict the breeding sites of desert locusts in East Africa. Shao et al. [
11] monitored and predicted the severity of locust outbreaks in the Asian–African desert by combining MODIS time-series data and the Hidden Markov Model (HMM), and quantitatively evaluated the impact of locust outbreaks on crops using hyperspectral images. Gómez et al. [
8] used near-real-time soil moisture data from the SMOS satellite and six machine learning algorithms to predict the breeding grounds of desert locusts and found that soil moisture data can provide sufficient prediction information 95 to 12 days before the locust appears. Sun et al. [
12] proposed a dynamic prediction model based on support vector machine and multivariate time-delay sliding window technology. Combined with multi-source remote-sensing data and historical locust survey data, the dynamic prediction of desert locust swarms was realized 16 days in advance. Guo et al. [
13] used support vector machine (SVM), random forest (RF), and maximum likelihood (ML) methods to study the formation mechanism of high-density patches of Asian migratory locusts based on time-series remote-sensing images. However, the traditional machine learning algorithm has a poor feature extraction ability in complex scenes, so it is difficult to meet the needs of high-precision locust detection.
Deep learning can automatically capture complex features in data through layer-by-layer feature extraction and generation mechanisms. Compared with traditional machine learning, it has a higher precision and efficiency in recognition tasks [
14]. Ye et al. [
5] developed the ResNet-Locust-BN model by improving the ResNet structure. Based on the training of RGB image samples in the field, the automatic identification of the species and instars of migratory locusts in East Asia was realized. The precision levels when distinguishing migratory locusts from rice locusts and cotton locusts were 93.60% and 97.80%, respectively, and the overall precision was 90.16%. Bai et al. [
15] proposed a video target detection method for migratory locusts in East Asia based on the MOG2-YOLOv4 network. By combining background separation and deep learning technology, the problems of occlusion and blurring were solved. Experiments show that the average precision of the model is 82.33%. Aiming at the segmentation problem of camouflage locusts, Liu et al. [
16] proposed an EG-PraNet model based on improving PraNet. By introducing a grouping reverse module and image enhancement technology, the segmentation precision was significantly improved, and the Dice and IoU indices were increased by 17.8% and 25.7%, respectively. At present, there are few studies on locust detection based on deep learning. The existing research mainly focuses on improving detection precision and efficiency, and the optimization of real-time and generalization ability still needs to be further explored.
As a lightweight and efficient target detection algorithm, YOLO (You Only Look Once) has been widely used in plant pest detection [
17]. For example, to balance computational efficiency and detection performance, Li et al. [
18] achieved 94.3% detection precision and 93.5% mAP in complex field environments by integrating cross-stage feature fusion (Hor-BNFA), spatial depth conversion convolution (SPDConv), and group shuffle convolution (GsConv). The model size is only 7.9 MB, which is better than the existing models. Liu et al. [
19] aimed at solving the problem that the existing models usually find it difficult to capture the long-distance dependencies and fine-grained features in the image, resulting in a poor recognition effect in the case of a complex background. By integrating the hybrid convolution Mamba module in the neck network, introducing the similarity-based attention mechanism, and using the weighted bidirectional feature pyramid network, the disease recognition ability of the model in a complex background is significantly improved. The F1-score and mAP are 3.0–4.8% higher than YOLOv8, respectively. Zhang et al. [
20] constructed a lightweight pest detection model, AgriPest-YOLO, based on YOLOv5. AgriPest-YOLO solves the problems of scale change and a complex background in light trap images by coordination and a local attention mechanism, grouping spatial pyramid pooling, and soft-NMS optimization. Zhu et al. [
21] proposed an improved CBF-YOLO network based on YOLOv7, which achieved high-precision detection of damaged leaves of soybean pests by combining CSE-ELAN, Bi-PAN, and FFE modules. The mAP of the public dataset was 86.9%, and the mAP of the actual scene was 81.6%, which was still affected by light conditions, background complexity, and similarity of pest characteristics. Wang et al. [
22] developed the Insect-YOLO model based on YOLOv8 and the CBAM attention module for the detection and counting of low-resolution farmland pest images and integrated the algorithm into the remote pest monitoring and analysis system. The mAP@50 of 93.8% was achieved on 2058 low-resolution images, which was superior to other baseline models.
At present, the research on locust detection based on deep learning is limited, and most of the existing methods focus on the target image with a single background, ignoring the challenges brought by the real field environment to the detection task. The precision and generalization performance in dealing with changing environments (such as illumination changes and target occlusion) must be improved [
5]. In addition, existing models usually have a high number of parameters and computational complexity, which affect their deployment on resource-constrained mobile or embedded platforms [
23]. To address these challenges, this study aims to develop a lightweight, efficient, and high-precision detection model for
Locusta migratoria ssp.
manilensis in complex environments. We will first construct a new dataset named Real-Locust, which includes locust images captured against diverse backgrounds, at varying densities, and in different poses. Then, we will propose an improved YOLOv8n-based model, called ATD-YOLO, which integrates modules such as AIFI, CBAM, and LTSC to enhance feature extraction capabilities and reduce model complexity. In the experimental phase, we will conduct extensive cross-validation and ablation studies to evaluate the effectiveness and efficiency of the proposed model. We will also compare it with other state-of-the-art models to demonstrate its advantages and provide a reliable solution for real-time monitoring of locusts in agricultural fields.
3. Results
3.1. Experimental Environment and Parameter Settings
All experiments in this study were performed using Python 3.10 and PyTorch 2.1.0, and the training used an RTX A5000 GPU with 24 GB of memory. The specific experimental environment and parameter settings are shown in
Table 1, and the key training parameter settings are shown in
Table 2 to ensure the transparency and repeatability of the experiment.
3.2. Model Evaluation Indicators
In this paper, precision, recall, F1-score, mAP, params, and GFLOPs are used as evaluation indices to measure the improvement effect. The precision P is the probability of the true positive sample in the sample that is predicted to be positive. The formula is
In the formula, denotes the number of positive samples correctly predicted, and denotes the number of negative samples incorrectly predicted as positive samples. The smaller the , the higher the precision, and the fewer the non-locust targets that are misjudged as locust targets.
The recall rate
R is the probability that all the samples predicted to be positive are positive samples. The calculation formula is
In the formula, represents the number of positive samples that are wrongly predicted as negative samples. The smaller the , the higher the recall rate, and the fewer the locusts that are missed.
In some cases, there is a contradiction between precision and the recall rate, which requires comprehensive consideration. To this end, a comprehensive index
F1-
score that takes into account both is introduced. The calculation formula is
Here, is a minimal constant that prevents the denominator from being 0.
The average precision (mAP) of all classes is the mean of the detection precision of all classes. The calculation formula is
In the formula, c represents the total number of categories of target detection, and P (R) is a curve drawn with the recall rate as the X axis and the precision as the Y axis. The area enclosed by the curve and the coordinate axis is the average precision. In this paper, the model only needs to identify locust species targets, so the mAP value is the value.
The number of parameters is the total number of all trainable parameters in the model, which directly affects the model complexity, training time, and inference speed. The number of floating-point operations is the number of floating-point operations performed per second during model inference. The lower the number of parameters and floating-point operations, the lower the complexity of the algorithm, and the lighter the model.
In addition, the Mean Activation Value (MAV), Activation Entropy (AE), Background Mean Activation (BMA) and Activation Contrast (AC) were used as the analysis indexes of the heat map. Among them, the MAV reflects the intensity of the model’s attention to the target area, and the calculation formula is
In the formula, Ai is all the activation values in the target box, and N is the total number of pixels in the target box. The higher the value, the stronger the activation of the model to the target area, indicating that the area is the key area for model judgment.
AE is used to measure the concentration of activation values in the target area. The calculation formula is
In the formula, 1 × 10−8 is the minimum value, avoiding cases where the denominator is 0 or the logarithm is not defined, and N is the total number of pixels in the target box. The lower the entropy value, the more the model’s focus on the target is concentrated in a few key areas, and the feature discrimination is strong.
BMA reflects the ability of the model to suppress background noise. The calculation formula is
In the formula, Bj is the activation value of all non-target areas in the image, and M is the total number of pixels in the background area. The lower the value, the weaker the activation of the background region, the better the background suppression effect of the model, and the stronger the anti-interference ability.
AC is used to measure the activation difference between the target and the background. The calculation formula is
In the formula, 1 × 10−8 is the minimum value, avoiding a denominator of 0. The higher the value, the greater the difference between the activation of the target and the background, and the stronger the discrimination ability of the model to the target.
3.3. Cross-Validation Experiments
To evaluate the robustness and generalization ability of the model, this study employed five-fold cross-validation. Specifically, the training and validation sets were combined into a single dataset (a total of 1370 images), which was then divided into five subsets. In each experiment, four of these subsets were used as the training set, while the remaining one served as the validation set. This process was repeated five times to ensure that each subset participated in the validation phase. This approach helps minimize evaluation bias caused by uneven data partitioning, providing a more reliable assessment of the model’s performance.
In each fold of the experiment, the model underwent the standard training process and was then evaluated on the validation set. Performance metrics such as precision, recall, F1-score, and mean average precision (mAP) were recorded. After completing all the folds, the average values and standard deviations of these metrics were calculated to provide a more stable assessment of the model’s performance. The results of these experiments are shown in
Table 3.
As shown in
Table 3, the model’s performance on the validation set was relatively consistent, with an mAP of 92.3% and a standard deviation of 0.9%, indicating minimal fluctuation across different folds and demonstrating good robustness. The standard deviations of precision (P) and recall (R) were 1.7% and 1.9%, respectively, while the F1-score had a standard deviation of only 0.3%, further confirming the model’s balance and consistency across folds. Based on the cross-validation results, the model with the highest mAP on the validation set was selected for the final evaluation on the test set.
Table 4 presents the comparison of the selected best-performing model on the validation set with its results on the test set.
3.4. Ablation Experiment
To verify the effectiveness of the proposed improved model, this experiment used YOLOv8n as the basic model and used the self-made locust dataset, Real-Locust, and the same equipment to perform eight groups of ablation experiments on the improved methods proposed in
Section 2.2.1,
Section 2.2.2 and
Section 2.2.3. To verify the stability of the model, each group of experiments was repeated five times, and the results were taken as the mean ± standard deviation. The model was evaluated using P, R, mAP@50, mAP@50:95, F1, Params, GFLOPs, and model size. The experimental results are shown in
Table 5.
The ablation experiment results show that the original YOLOv8n model exhibits a certain basic performance in the target detection task, P = 0.900 ± 0.007, R = 0.820 ± 0.009, mAP@50 = 0.882 ± 0.005, and mAP@50:95 = 0.420 ± 0.010. However, it has problems with a high parameter quantity and computational complexity. The YOLOv8n + CBAM algorithm improves R by 0.85% and mAP@50:95 by 2.4% by introducing a channel-spatial attention mechanism, which significantly improves the detection performance of the model under different IoU thresholds. However, P decreased by 0.44% and F1 decreased by 0.81%, indicating that the attention mechanism may introduce feature redundancy while improving the positioning accuracy. Based on the AIFI module, YOLOv8n + AIFI significantly improves P by 1.22% and R by 1.46%, and the parameter amount is compressed by 7.18%, which verifies its ability to optimize feature extraction efficiency. However, GFLOPs decreased by only 3.6%, indicating that the module has a limited optimization of computational complexity and needs further improvement in combination with a lightweight design. YOLOv8n + LTSC achieves a parameter compression of 25.2% and GFLOP reduction of 21.7%, while R is increased by 2.56% through a lightweight shared convolution detection head. The results show that the separated normalization can effectively retain the target motion information while reducing redundant calculation, which is suitable for real-time detection scenarios. The combined scheme of YOLOv8n + AIFI + LTSC reduces the number of parameters by 27.5% and GFLOPs by 22.9%, while R increases by 3.78%, and the model size is the smallest, achieving a better balance between performance and a light weight. However, mAP@50:95 only increased by 0.71%, reflecting the influence of a lightweight design on a high IoU threshold detection ability. YOLOv8n + AIFI + CBAM combines the advantages of AIFI and CBAM. The algorithm P is increased by 1.22%, F1 is increased by 1.84%, and mAP@50:95 is optimal. Compared with YOLOv8n + AIFI, R is increased by 2.88% and mAP@50 is increased by 1.24%, which proves that the synergistic effect of the attention mechanism and feature interaction can effectively compensate for the information loss caused by a light weight. In the scheme of YOLOv8n + CBAM + LTSC, R increased by 2.07% while the parameter quantity was compressed by 23.1%, but mAP@50:95 decreased by 0.24% and F1 decreased by 0.09%. This indicates that the simple superposition of the attention mechanism and light weight may lead to a decrease in feature expression ability, and the module coupling method needs to be further optimized. ATD-YOLO achieves a 27.4% reduction in the number of parameters and a 22.9% reduction in GFLOPs by integrating the advantages of multiple modules. At the same time, R is increased by 5.49%, mAP@50 is 0.904 ± 0.012, and F1 is optimized to 0.886 ± 0.010. Although mAP@50:95 only increases by 1.67%, it achieves the optimal balance between GFLOPs and model size while maintaining high-precision detection through efficient feature interaction and modular compression, combined with an attention mechanism to compensate for lightweight information loss.
3.5. Performance Comparison with Baseline Models
To further verify the effectiveness of the improved network model in the complex environment locust detection task, under the same experimental equipment, experimental parameters, datasets, and training strategies, the experimental results of some better performance target detection models are compared with the experimental results of the improved network model in this paper, including YOLOv3-tiny, YOLOv5n, YOLOv6n, YOLOv8n, YOLOv9t, YOLOv10n, YOLOv11n, and YOLOv12n [
36]. The evaluation index is the same as in
Section 3.3. The experimental results are shown in
Table 6.
The experimental results show that the improved ATD-YOLO achieved a good performance on multiple performance indicators. P increased from 0.904 to 0.91, slightly higher than other models, which indicates that it is better at reducing error detection and can more accurately distinguish between the target and non-target. R increased by 4.3%, which is significantly better than other models, indicating that the model has made significant progress in reducing missed detection. The mAP@50 reached 0.909, which exceeded the other eight models and increased by 2.3% compared with the basic network YOLOv8n. The mAP@50:95 is increased from 0.424 to 0.431. Although it is slightly lower than YOLOv11n, it still exceeds the original YOLOv8n and other models, which indicates that the precision of target detection under different IOU thresholds is improved. In F1, the improved ATD-YOLO achieved a maximum value of 0.89, indicating that it achieved good results in balancing precision and recall. In terms of the number of parameters, complexity, and size of the model, although it is slightly lower than YOLOv9t, YOLOv11n, and YOLOv12n, it is greatly improved compared with other models. Overall, the improved ATD-YOLO does well at balancing the computational efficiency and detection performance.
In summary, the ATD-YOLO model proposed in this paper has certain advantages in accurately detecting locusts in complex agricultural environments.
3.5.1. Significance Analysis of ATD-YOLO, YOLOv8n, and YOLOv11n
In this experiment, YOLOv11 was added for comparison. Each group of experiments was repeated five times, and the results were taken as the mean ± standard deviation. Recall and mAP@50 were selected as evaluation indexes. Recall directly determines the missed rate of pest detection, which is the core index of agricultural application, while mAP@50 corresponds to the average accuracy of IoU = 0.5, which is more in line with the demand of “target roughly positioning can trigger prevention and control” in agricultural scenarios.
As shown in
Figure 9, ATD-YOLO is significantly better than YOLOv8n and YOLOv11n in recall and mAP@50. In terms of recall, the mean value of ATD-YOLO (0.865) was 3.3% higher than that of YOLOv11n (0.837). There was no overlap in the box plot, and the t test showed a significant difference (
p < 0.001). In terms of mAP@50, the mean value of ATD-YOLO (0.904) covered the upper limit of YOLOv11n performance, which was significantly increased by 0.9% (
p < 0.05). Although YOLOv11n is slightly better at high IoU thresholds, ATD-YOLO has more of an advantage in striking an accuracy–efficiency balance through collaborative optimization of a light weight and attention mechanisms, providing a better solution for resource-constrained agricultural scenarios.
3.5.2. Performance Comparison of ATD-YOLO, YOLOv8n, and YOLOv11n
Figure 10 shows the mAP@50 and total loss comparison curves of the improved models ATD-YOLO, YOLOv8n, and YOLOv11n.
In general, from the comparison of the mAP@50 curve, it can be seen that ATD-YOLO has a relatively high detection precision for the target at the initial stage of training, and the training process has no obvious fluctuation and is relatively stable. The initial convergence of YOLOv8n is slow, and there is an early stop phenomenon. Although YOLOv11n has a high precision in the later stage, it fluctuates frequently and plummets obviously, and there may be overfitting problems. In the comparison of the Total Loss curve, it can be seen that ATD-YOLO performs best. Although its initial loss value is slightly higher than YOLOv11n, the training process is more stable, the descent curve is smoother, and the final convergence effect is significant. Although YOLOv11n has a fast convergence speed in the early stage, there are slight fluctuations in the later stage, and there is still room for optimization. The comparison results further verify that the ATD-YOLO model has a better detection effect than other models.
3.6. Visual Analysis
To enrich the method of evaluating the performance of the ATD-YOLO algorithm, this section selects several models with better effects to conduct comparative experiments from two perspectives. First, the real locust test set is divided into three categories, and all models are verified by locusts with different density distributions. The classification details are shown in
Table 7. Secondly, the gradient weighted class activation mapping [
37] (Grad-CAM + +) method is used for visualization and analysis.
3.6.1. Comparison of Test Results of Different Distribution Densities
To more intuitively illustrate the detection performance of the algorithm proposed in this paper, this section of the experiment selected three scenarios under different density conditions for display and compared those with the representative models (YOLOv8n and YOLOv11n). The detection results are shown in
Table 8 and
Figure 10.
The table data show that YOLOv8n, YOLOv11n, and ATD-YOLO have the best performance in medium-density scenarios. R (0.913), mAP@50 (0.967), and F1 (0.932) of ATD-YOLO were all ahead, and P (0.957) of YOLOv11n was the highest. In the low density, ATD-YOLO had the best comprehensive performance (F1 = 0.858), and YOLOv11n had the lowest R-value (0.782). At the high density, the F1 (0.884) of ATD-YOLO and the R (0.86) of YOLOv11n were better, and the indicators of YOLOv8 n were generally lagging.
YOLOv8 has obvious shortcomings: In low-density scenarios, its mAP@50:95 is only 0.398, reflecting that the feature extraction module does not capture the locust target features, which may be more dispersed and relatively small at a low density, and it is difficult to accurately identify the target under stricter detection standards, resulting in a poor detection accuracy under complex evaluation indicators. In the high-density scene, the recall rate is reduced to 0.814, indicating that the detection head has insufficient instance segmentation and positioning capabilities in the overlapping area of the target when dealing with dense locust targets, resulting in missed detection. The defect of YOLOv11n is reflected in the fact that the recall rate is only 0.782 under low-density conditions, which is much lower than 0.896 at the medium density and 0.86 at the high density, indicating that its initial feature extraction layer is not sensitive enough to the sparse distribution of locust targets. For example, when the number of locusts is small and the distribution is sparse under a low density, the model struggles to quickly search for and locate the target, resulting in a high missed detection rate. The shortcomings of ATD-YOLO include, on the one hand, an mAP@50:95 of 0.377 in low-density scenarios, indicating that for small-sized locust targets that may exist at a low density, the down-sampling and up-sampling strategies of the feature extraction network fail to fully extract and recover small target features, resulting in a limited performance under high-precision detection standards.
It can be seen from
Figure 11 that YOLOv8n and YOLOv11n still have some shortcomings in detail processing. For example, in the Level I scenario, they did not detect locusts in half of the body; in the Level II scene, there are many false detections and missed detections, and other objects are mistaken for locusts; in the Level III scenario, individual locusts are hidden in weeds and have a dense distribution in small areas, which increases the difficulty of detection, resulting in missed detection and repeated detection. ATD-YOLO shows certain advantages. When the target is densely distributed or occluded, it can identify the target more accurately and reduce the occurrence of missed detection.
In summary, ATD-YOLO has become the preferred solution in complex environments due to its equalization performance and detailed detection advantages in multi-density scenarios. YOLOv11n is suitable for tasks with high accuracy requirements in medium/high-density scenes, but the sparse target detection ability needs to be optimized; YOLOv8n needs to improve the algorithm for dense target detection and complex background interference to improve the detection reliability in practical applications.
Although ATD-YOLO can maintain its detection performance in complex environments, there is still room for improvement in some scenarios. For example, when the distribution of target objects is extremely dense or mutual occlusion occurs, the ATD-YOLO algorithm may need to be further improved to enhance the fit and precision of the bounding box.
3.6.2. Heat Maps
In order to further demonstrate the advantages of the ATD-YOLO model in performing locust detection tasks in complex environments, this section uses the gradient-weighted class activation mapping (Grad-CAM + +) method for visualization and analysis, as shown in
Figure 12.
To supplement the interpretability of the heat map model evaluation, this section uses four quantitative indicators: MAV, AE, BMA, and AC, for comparison. The results of the indicators are shown in
Table 9.
The results show that ATD-YOLO performs best on image II, with the lowest AE (11.676), the highest AC (2.087), and the lowest BMA (0.277), reflecting the advantages of target feature concentration, strong background suppression, and an outstanding discrimination ability. YOLOv11n has an excellent performance in BMA (0.241) and AC (1.476) on image I, which is suitable for complex background scenes, but its AE is generally high, and feature concentration needs to be optimized. YOLOv8 n has the highest MAV (0.5) in image II, but the BMA (0.537) is also the highest, resulting in an AC (1.079) close to 1, and the target background discrimination ability is weak; meanwhile, the AC in image III is lower than 1, and the performance is significantly worse.
In summary, ATD-YOLO is better in feature concentration, background suppression, and discrimination. YOLOv11n is suitable for complex background scenes, while YOLOv8n needs to strengthen background noise suppression and target background contrast. These indicators provide a key basis for model interpretability analysis and performance optimization, help subsequent model improvement and scene adaptation, and provide data support for model selection and optimization in target detection tasks.
4. Discussion
In this study, the YOLOv8n model was improved. Considering the deployability and light weight of YOLOv8n, it was selected as the basic model. As described in
Section 2, based on YOLOv8n, this study optimized the intra-scale feature interaction, channel-spatial attention mechanism, detector structure, model parameter quantity, and computational complexity and designed the ATD-YOLO model. To prove the effectiveness of each independent module, ablation experiments were performed, as shown in
Table 3. The improved model is superior to the basic model in the training stage and the test stage. The performance of the model was proven by comparing the precision, recall rate, and mAP, as shown in
Figure 9. To further intuitively observe the precision and generalization ability of other comparison models with the YOLO series, this paper presented a visual experiment, as shown in
Figure 10 and
Figure 11. Experiments show that by integrating an intra-scale feature interaction module (AIFI), a convolution block attention module (CBAM), and a lightweight shared convolution detection head (LTSC), the ATD-YOLO model improves the recognition ability under the challenges of a complex background and target occlusion, and it achieves higher detection precision in different distribution density scenarios.
Although this model shows a good performance in the controlled dataset environment, its robustness in extreme weather conditions (such as heavy rain, dust storms) or cross-species interference (such as other insect effects) scenarios still needs to be improved, and more field tests are needed to verify its practical application effect. Future research can combine edge computing devices to fully verify the real-time performance of the model in multi-modal data scenarios, to ensure its stability and reliability in complex real-world environments.
In terms of model accuracy and related indicators, it has been significantly improved compared with the past, but the experimental results have also exposed some problems that need to be solved urgently. As an important basis for model experiments, although this paper has simulated the distribution of locusts in real scenes as much as possible, there are still problems, such as an insufficient number of targets and uneven data quality. At the same time, the manual annotation process of the dataset may introduce uncertainties, which may lead to misclassification or labeling, with a certain impact on the training effect of the model.
In terms of model structural optimization, although ATD-YOLO has made phased progress in reducing the number of parameters and computational complexity, it is still necessary to explore more lightweight model structures and algorithms in order to further improve computational efficiency and maintain high-precision advantages. Through continuous optimization of model design, it is expected that researchers may achieve a more efficient computing performance in practical applications and promote the application of related technologies.
For agricultural communities, this study provides a practical and lightweight solution for locust monitoring, which can assist plant protection robots in preventing agricultural pests and diseases, thereby reducing the dependence on labor-intensive manual surveys and improving the disaster response efficiency. It is hoped that this research will promote the further development of field locust monitoring technology.
5. Conclusions
This paper presented a Locusta migratoria ssp. manilensis dataset focusing on real and complex backgrounds, a change in locust density, and a diversity of poses, which enriches the data resources and provides data support for locust monitoring in the future. At the same time, an optimization model, ATD-YOLO based on YOLOv8n, is proposed, which effectively improves the detection performance of locusts in complex environments. Firstly, the SPPF structure at the end of the original YOLOv8n network backbone is replaced, and the attention-based intra-scale feature interaction (AIFI) module is introduced to enhance the network’s ability to capture long-distance dependent information, thereby improving the detection precision of targets at different scales. Secondly, the channel-space attention module (CBAM) is integrated, which enables the network to adapt to complex backgrounds and improves the ability of the network to suppress background noise and highlight the expression ability of key areas. Thirdly, a lightweight shared convolution structure (LTSC) is proposed in the detection head part, which improves the parameter utilization efficiency of the detection head and enhances the recognition precision of locusts.
The experimental results show that the proposed model achieves a mean average precision (mAP) of 90.9% on the Real-Locust dataset, the precision is improved by 0.6%, the recall rate is improved by 4.3%, and the parameter quantity and computational complexity are reduced by 27.4% and 22.9%, respectively. These results show that the proposed method effectively handles the detection of locusts in complex environments and provides valuable technical support for their real-time monitoring.
Future work will focus on (1) expanding the dataset with a ground perspective and multi-angle UAV images, and combining Generative Adversarial Networks (GANs) to synthesize diverse locust scenes in order to enhance the generalization of the model; (2) exploring ultra-lightweight architectures through model pruning, quantization, or knowledge distillation to further reduce the computational overheads of resource-constrained edge devices; and (3) integrating ATD-YOLO into the intelligent agricultural system for real-time field deployment to achieve dynamic early warning and precise control of pests.