Lightweight GAFNet Model for Robust Rice Pest Detection in Complex Agricultural Environments

Zhou, Yang; Huang, Wanqiang; Liu, Benjing; Chen, Tianhua; Wang, Jing; Zhang, Qiqi; Yang, Tianfu

doi:10.3390/agriengineering8010026

Open AccessArticle

Lightweight GAFNet Model for Robust Rice Pest Detection in Complex Agricultural Environments

by

Yang Zhou

¹

,

Wanqiang Huang

¹

,

Benjing Liu

²,

Tianhua Chen

³,

Jing Wang

¹,

Qiqi Zhang

¹ and

Tianfu Yang

^1,*

¹

College of Information and Technology, Jilin Agricultural University, Changchun 130118, China

²

Qian Gorlos Mongolian Autonomous County Black Soil Protection and Monitoring Center, Songyuan 131100, China

³

College of Biological and Agricultural Engineering, Jilin University, Changchun 130022, China

^*

Author to whom correspondence should be addressed.

AgriEngineering 2026, 8(1), 26; https://doi.org/10.3390/agriengineering8010026 (registering DOI)

Submission received: 24 November 2025 / Revised: 7 January 2026 / Accepted: 8 January 2026 / Published: 10 January 2026

Download

Browse Figures

Versions Notes

Abstract

To address challenges such as small target size, high density, severe occlusion, complex background interference, and edge device computational constraints, a lightweight model, GAFNet, is proposed based on YOLO11n, optimized for rice pest detection in field environments. To improve feature perception, we propose the Global Attention Fusion and Spatial Pyramid Pooling (GAM-SPP) module, which captures global context and aggregates multi-scale features. Building on this, we introduce the C3-Efficient Feature Selection Attention (C3-EFSA) module, which refines feature representation by combining depthwise separable convolutions (DWConv) with lightweight channel attention to enhance background discrimination. The model’s detection head, Enhanced Ghost Detect (EGDetect), integrates Enhanced Ghost Convolution (EGConv), Squeeze-and-Excitation (SE), and Sigmoid-Weighted Linear Unit (SiLU) activation, which reduces redundancy. Additionally, we propose the Focal-Enhanced Complete-IoU (FECIoU) loss function, incorporating stability and hard-sample weighting for improved localization. Compared to YOLO11n, GAFNet improves Precision, Recall, and mean Average Precision (mAP) by 3.5%, 4.2%, and 1.6%, respectively, while reducing parameters and computation by 5% and 21%. GAFNet can deploy on edge devices, providing farmers with instant pest alerts. Further, GAFNet is evaluated on the AgroPest-12 dataset, demonstrating enhanced generalization and robustness across diverse pest detection scenarios. Overall, GAFNet provides an efficient, reliable, and sustainable solution for early pest detection, precision pesticide application, and eco-friendly pest control, advancing the future of smart agriculture.

Keywords:

rice pest detection; lightweight model; global attention fusion; multi-scale feature optimization; smart agriculture

1. Introduction

Rice is a globally vital staple crop, with its yield and quality directly influencing food security, rural livelihoods, and socioeconomic stability [1]. However, factors such as global warming, increased planting density, heavy fertilizer and pesticide use have led to the growing threat of rice pests. These pests, characterized by small size, high population density, severe leaf overlap, and complex backgrounds, are causing significant damage, reducing both yield and quality [2]. Therefore, there is an urgent need for automated, intelligent, and lightweight pest detection technologies. These technologies can enable early warning, precise pesticide application, and eco-friendly pest control, ultimately advancing intelligent and sustainable agricultural production.

Traditional rice pest detection methods, such as manual field surveys, sex pheromone traps, and yellow sticky boards, are labour-intensive, depend heavily on the experience of field personnel, and may exhibit high omission rates under large-scale or dense-planting conditions. These methods fail to meet the demands of modern agriculture for “efficient, precise, and intelligent” pest monitoring in terms of time response, spatial coverage, and detection accuracy [3,4,5,6]. To address these limitations, computer-based pest detection methods have emerged. Traditional machine vision technology, through image acquisition, processing, and analysis, enables efficient, real-time pest identification and localization. This significantly enhances the automation and accuracy of monitoring, overcoming the shortcomings of conventional methods. Typical methods include pest discrimination systems based on classifiers such as Support Vector Machine (SVM), K-Nearest Neighbors, Naive Bayes, Decision Trees, and Random Forests [7,8,9,10]. These methods typically rely on manually extracted low-level features like color, texture, and shape for pest image classification or localization. M.A. Ebrahimi et al. [11] used the ratio of major to minor axes as a regional feature and hue, saturation, and intensity as color features, employing SVM with different kernel functions to detect potential parasites on strawberry plants. Limiao Deng et al. [12] proposed a method based on integrating SIFT into the HMAX model and extracting invariant features using the LCP algorithm, achieving 85.5% recognition accuracy for pest identification in complex natural environments using SVM. However, such methods are highly sensitive to expert experience and relevant knowledge, involve cumbersome feature selection processes, and have limited generalization performance when facing complex field backgrounds, lighting variations, leaf occlusion, and multi-pest scenarios. These methods suffer from high false positives and missed detections, making them inadequate for practical production needs.

In recent years, deep learning has made significant progress in object detection tasks and has become a mainstream technology in agricultural vision. Among these, the YOLO (You Only Look Once) series models, with their end-to-end structure and high inference speed, significantly improve pest detection accuracy and robustness, and have been widely applied in crop pest and disease identification. For example, Yunong Tian et al. [13] proposed MD-YOLO, which enhances small object detection ability through a dense network and adaptive attention mechanism, specifically designed for small moth pests, achieving a mean Average Precision (mAP) of 86.2%. Yongkang Liu et al. [14] introduced YOLO-Wheat, which improves the detection algorithm by using attention mechanisms and feature fusion, achieving an mAP of 89.6%. Yongzong Lu et al. [15] developed MA-YOLO, a lightweight real-time detection model based on multi-scale convolution and dynamic weighting, with an mAP of 65.4% and 73.9% on the IP102 [16] and Pest24 [17] datasets, respectively. Sen Yang et al. [18] proposed SRNet-YOLO, which addresses the issue of missed detections for low-resolution pests through super-resolution reconstruction and attention fusion, achieving an mAP of 78.2%, with an mAP of 57% even for very small pest targets. Nan Wang et al. [19] proposed Insect-YOLO, which systematically extracts complex pest features and integrates multi-scale information to optimize feature representation. With only 8.2 GFLOPs of low computational demand, it enables fast and accurate identification of low-resolution pest images in the field. Despite recent advances, current lightweight YOLO based models remain insufficient for the complex pad field environment. First, its ability to extract spatial details of small targets is limited, resulting in missed detections. Second, modeling disturbances such as uneven lighting, occlusion, and image blur is challenging, leading to false positives. Third, despite the availability of lightweight versions, the model size and computational requirements remain incompatible with resource-constrained agricultural terminals or edge devices.

To overcome the aforementioned challenges and improve pest detection in paddy fields, the integration of global attention mechanisms, multi-scale feature optimization, and lightweight design is critical for addressing issues such as small target size, severe occlusion, and complex backgrounds. Beyond accuracy, the model’s robustness and minimal computational demand are vital for ensuring effective deployment on edge devices. In response, this study introduces GAFNet, an enhanced version of YOLO11n, incorporating strategies like global context fusion, efficient feature selection, lightweight detection heads, and an optimized loss function. These innovations enable precise and robust detection of rice pests, offering a highly efficient and sustainable solution for resource-constrained agricultural terminals and drone platforms.

The main contributions of this article include: proposing three novel modules that not only expand the receptive field, fuse attention, enhance channel selection, and strengthen features, but also make the model more lightweight; A new loss function has been proposed to maintain the training stability of bounding box regression for slender pests. Through a series of experiments and deployments, it has been proven that GAFNet not only has good generalization and robustness, but can also be deployed on convenient devices, providing a direct and practical method for precise pest management.

2. Methods

YOLO11 [20], introduced by Ultralytics in 2024, signifies the next evolution in the YOLO series for real-time object detection. It achieves an optimal equilibrium between speed, precision, efficiency, and multi-tasking capabilities. YOLO11 is available in five configurations, n/s/m/l/x, with YOLO11n being the most lightweight in terms of parameter count and computational demand, making it particularly well-suited for rice pest detection in this paper. YOLO11n was chosen as the baseline model because it is extremely lightweight while maintaining accuracy, and also excellent in inference speed, making it very suitable for pest detection research on edge devices.

In the context of rice pest detection, YOLO11n encounters challenges such as minuscule object size, high density, severe occlusion, and intricate backgrounds. To address these challenges, we propose GAFNet, an enhanced variant of YOLO11n tailored for robust rice pest detection under field conditions. Initially, a novel Global Attention Fusion and Spatial Pyramid Pooling (GAM-SPP) module was introduced. This module integrates global contextual information from the rice field scene with multi-scale features to bolster the model’s ability to detect concealed pests. Subsequently, the C3 Efficient Feature Selection Attention (C3-EFSA) module is proposed, which merges depthwise separable convolutions (DWConv) [21] with an Efficient Channel Attention (ECA) [22], thereby enhancing feature discrimination efficiency in complex rice field environments. Furthermore, an Enhanced Ghost-based Detection Head (EGDetect) is introduced, which integrates Enhanced Ghost Convolution (EGConv), Squeeze-and-Excitation (SE) [23], and Sigmoid-Weighted Linear Unit (SiLU) [24] to reformulate a lightweight composite structure, reducing computational redundancy. Finally, the Focal-Enhanced Complete-IoU (FECIoU) loss function, an advancement over Complete-IoU (CIoU) [25], is presented, incorporating numerical stability terms and a hard sample weighting mechanism to further enhance pest localization under occlusion scenarios.

Through the synergistic design of these four components, GAFNet not only preserves the lightweight advantages of YOLO11n but also substantially enhances the accuracy, robustness, and adaptability to edge scenarios in rice pest detection. A schematic illustration of the rice pest detection mechanism based on the GAFNet model is presented in Figure 1.

2.1. Multi-Scale Feature Enhancement Mechanism Based on Global Attention Fusion

In rice pest detection tasks, the targets often exhibit characteristics such as small size, high density, and severe occlusion, which result in significant loss of fine-grained spatial information during the feature extraction process, particularly in lightweight object detection models. While the original Spatial Pyramid Pooling-Fast (SPPF) module in YOLO11n can expand the receptive field, it lacks the selective modeling ability for crucial target regions and salient channel information, limiting the model’s discriminative power for pest regions. To address this issue, this paper proposes a GAM-SPP module, which replaces the SPPF module in the YOLO11n model. The advantage of this module lies in its ability to retain multi-scale context fusion while integrating Channel Attention (CA) and Spatial Attention (SA), enabling more effective focus on the key information of rice pest regions. Figure 2 illustrates the proposed GAM-SPP module.

As depicted in Figure 2, the GAM-SPP module proposed in this paper is primarily composed of the Spatial Pyramid Pooling (SPP) [26] branch, CA, and SA components. In the GAM-SPP module, the input feature map is first subjected to channel compression through a 1 × 1 convolution, followed by multi-scale pyramid pooling. Three distinct receptive field 2D max-pooling operations are sequentially introduced, utilizing kernel sizes of 5 × 5, 9 × 9, and 13 × 13, with a stride of 1 and appropriate padding to preserve the spatial dimensions of the feature map. This design effectively models multi-scale spatial contextual information, facilitating the detection of prominent regions within pest targets of varying sizes.

Next, feature fusion and channel compression are carried out. The original feature map is concatenated with the three pooled feature maps along the channel dimension, and a 1 × 1 convolution is applied to reduce the channel count back to the input size. This step prevents channel dimension expansion, reduces both parameters and computational overhead, and seamlessly integrates multi-scale semantic features.

The CA mechanism enables the model to dynamically adjust the importance of different channels, assigning greater attention to more relevant features. Utilizing the SE structure, CA performs global 2D adaptive average pooling on the fused features, followed by two 1 × 1 convolutions, each with Rectified Linear Unit (ReLU) [27] and Sigmoid activations, respectively. The resulting channel-wise weight map is used to reweight the channel responses of the input features. Additionally, the SA mechanism models spatial saliency. SA is implemented by applying max-pooling and average-pooling along the channel dimension to generate two spatial maps. These maps are concatenated and processed through a 2D convolutional layer with a 7 × 7 kernel, followed by a Sigmoid activation to produce a spatial attention map. This map is used to reweight the feature map, allowing the model to focus more effectively on potential pest regions in rice leaves.

Finally, a 1 × 1 convolution is applied to further consolidate the feature map enhanced by the attention mechanism, ensuring that the output dimension matches the input size, thereby replacing the original SPPF module with the GAM-SPP.

2.2. Design of the C3-EFSA Module Based on Multi-Scale DWConv and Lightweight ECA Mechanism

In the practical task of pest detection in rice fields, the feature extraction capability of the C3k2 module in YOLO11n exhibits certain limitations under various disturbances such as leaf occlusion and motion blur. To address this, a novel and more efficient module, C3-EFSA, is proposed in this paper, offering enhanced expressive power. C3-EFSA’s parallel depth-wise branches and ECA boost small-pest signal strength without extra cost, keeping accuracy under motion. By replacing all C3k2 modules in the backbone of the YOLO11n network with this module, the aim is to effectively reduce false negatives and false positives in complex environments, thereby improving the overall detection accuracy and robustness of the model. As shown in Figure 3, this is a schematic representation of the proposed C3-EFSA module’s structural principles.

As shown in Figure 3, the C3-EFSA module design incorporates lightweight multi-scale receptive field modeling and the ECA mechanism. This design aims to enhance the model’s ability to express pest features under limited computational resources. The core structure of the module adopts a three-branch parallel design, with each branch capturing structural information at different scales and spatial distributions from the input features. Among them, two branches employ 3 × 3 and 5 × 5 DWConv, preceded by a 1 × 1 convolution, to extract local and contextual texture information while balancing accuracy and computational efficiency. The third branch uses a 3 × 3 grouped convolution, which introduces cross-channel interactions while reducing computational overhead, followed by Batch Normalization (BN) [28] and SiLU activation functions to enhance non-linear expressiveness. After concatenating the three feature maps along the channel dimension, they are fused using a 1 × 1 convolution. Then, BN and SiLU activation are applied to enhance non-linear representational capacity.

After the fusion of the three feature maps, the C3-EFSA module incorporates the ECA mechanism. Unlike traditional attention mechanisms, such as SE and Convolutional Block Attention Module [29], ECA utilizes a lightweight 1D convolution, avoiding channel dimension compression and fully connected operations. The attention module first performs global 2D adaptive average pooling on the fused feature map to obtain the global response vector for each channel. It then enables local channel interactions via a 1D convolution with a small kernel, typically size 3. Finally, the Sigmoid activation function generates attention coefficients for each channel, which are applied to recalibrate the original feature map through weighted adjustments. As a result, the C3-EFSA module, augmented with ECA, significantly enhances the model’s focus on key regions of rice pest infestations. At the same time, it suppresses irrelevant background areas, improving the model’s ability to distinguish small rice pest targets.

Additionally, the C3-EFSA module preserves the residual connection structure [30], where the module output is element-wise added to the input when the number of input and output channels are equal. This retains low-level feature information and facilitates stable gradient propagation in deep networks. This design further enhances the convergence speed during training and improves the overall stability of the network.

2.3. Lightweight Ghost Convolution Optimization Module with SE Attention Integration

To enhance the model’s robustness in complex rice field scenarios, this paper optimizes the original Detect structure of YOLO11n. It introduces the EGDetect module to achieve this improvement. This module preserves the network’s lightweight nature while incorporating structural enhancements and attention mechanisms. As a result, it improves feature representation and enhances the model’s ability to focus on targets for rice pest detection.

As shown in Figure 4, the original Detect module of YOLO11n consists of multiple standard convolution layers for bounding box regression and object classification. However, standard convolutions exhibit high feature redundancy and limited expressiveness, particularly when dealing with small or low-contrast targets. To reduce feature redundancy and enhance the focus on key target regions, this paper proposes a novel EGConv module based on Ghost Convolution [31], which replaces the 3 × 3 and 1 × 1 standard convolutions in the original YOLO11n architecture. Figure 5 illustrates the conceptual structure of the proposed EGConv module.

As illustrated in Figure 5, the core concept of EGConv is to generate core features through a subset of “real” convolutions (Primary Convolution), while the remaining features are “synthetically” generated from existing ones using computationally efficient operations (Cheap Operation), specifically DWConv. This approach significantly reduces computational costs. Specifically, EGConv first applies a standard convolution for initial feature extraction, where the number of output channels in the main branch contributes to a portion of the final output. These features are then passed through a DWConv to generate the remaining output channels, enabling richer feature representations. The outputs from both the primary and cheap branches are subsequently concatenated along the channel dimension to form the complete output.

To further enhance the model’s focus on key target regions, this paper introduces a lightweight channel attention mechanism, the SE module, into the EGConv module. The SE module extracts the response statistics of each channel through global 2D adaptive average pooling of the feature map. It then generates channel weights via two 1 × 1 2D convolutions to recalibrate the importance of each channel in the feature map. This mechanism effectively amplifies significant features related to pest targets while suppressing background noise and irrelevant regions. The attention output is applied to all concatenated channel features. A truncation operation ensures the final output’s channel count aligns with expectations, maintaining full compatibility with the original Detect module.

Furthermore, to enhance the network’s nonlinear expressive capability, this paper uniformly replaces all ReLU activation functions used in the EGConv module with the SiLU activation function. The SiLU activation function is defined as shown in Equation (1):

S i L U (x) = x \cdot σ (x) = \frac{x}{1 + e^{- x}}

(1)

where

x

denotes each element of the input feature map and

σ (x)

represents the Sigmoid activation function, the SiLU activation function offers superior gradient continuity and expressive capability compared to the traditional ReLU. Particularly in tasks such as object regression and small object detection, it leads to more stable performance. The advantage of replacing the activation function lies in its ability to enhance optimization convergence and improve accuracy without altering the network’s structural dimensions.

In this paper, all 3 × 3 and 1 × 1 standard convolution modules in the original YOLO11n Detect are replaced with the new EGConv module to enhance the model’s feature generation, attention fusion capabilities, and activation function optimization. The improved version of the Detect module is named EGDetect, as illustrated in Figure 6.

2.4. Design of the Enhanced Loss Function FECIoU

In rice pest detection, standard Intersection over Union (IoU) [32] and its variants, like CIoU, show strong geometric modeling in bounding box regression. However, they struggle with small, occluded, and irregularly shaped targets, often failing to achieve optimal localization accuracy in capturing pest morphology and handling overlaps. To address this, we improve the CIoU loss in the original YOLO11n model and propose FECIoU, an enhanced loss function optimized for complex pest detection. The traditional CIoU loss

L_{C I o U}

is defined in Equation (2):

L_{C I o U} = 1 - I o U + \frac{ρ^{2} (b, b^{g})}{c^{2}} + α v

(2)

Here,

ρ^{2} (b, b^{g})

represents the squared Euclidean distance between the centers of the predicted and ground truth boxes.

c^{2}

denotes the squared diagonal length of the minimal enclosing box. The aspect ratio discrepancy is quantified by

v = \frac{4}{π^{2}} {(a r c t a n (\frac{w_{g}}{h_{g}}) - a r c t a n (\frac{w_{p}}{h_{p}}))}^{2}

, where

w_{g}

and

h_{g}

are the width and height of the ground truth box, and

w_{p}

and

h_{p}

are those of the predicted box. The aspect ratio penalty

α = \frac{v}{1 - I o U + v}

is a dynamic weight. However, in rice pest detection, where targets can be extremely small or elongated, the ground truth box height

h

may approach zero. In such cases, the value of

a r c t a n (w / h)

becomes excessively large, leading to numerical instability or gradient explosion. This instability significantly impairs the training process, especially when handling extreme target shapes, degrading the model’s convergence.

To address this issue, a small positive number

ϵ

is introduced into the calculation, improving the aspect ratio discrepancy term

v

in CIoU, as shown in Equation (3):

v = \frac{4}{π^{2}} {(\arctan (\frac{w_{g}}{h_{g} + ϵ}) - \arctan (\frac{w_{p}}{h_{p} + ϵ}))}^{2}

(3)

In Equation (2), the weight term

α

may experience significant gradient fluctuations when the IoU is very small. To mitigate this issue and smooth gradient variations,

α

is redefined as shown in Equation (4):

α = \frac{v}{v + (1 - I o U + ϵ)}

(4)

In addition, to further enhance the model’s ability to fit samples, this paper introduces a difficult sample weighting mechanism similar to Focal Loss [33] in the loss function. This mechanism utilizes a simple IoU index weighting term

(1 - I o U)^{γ}

to enhance the gradient influence of low-quality predictions, allowing the model to focus more on small targets, occluded or highly biased samples during training. Here,

γ

is the focusing parameter that controls the intensity of this adjustment. In the experiment, in order to appropriately enhance the contribution of difficult samples without excessively disturbing normal sample training,

γ = 0.05

was set. Setting

γ

to 0.05 yields a gentle yet non-zero increase in gradient for low-IoU samples, emphasizing hard cases without amplifying gradient noise or destabilizing optimization. This value aligns with the original Focal-Loss design principle, ensuring numerical stability while encouraging refinement of faint or partially occluded pests. Finally, the FECIoU loss function is expressed as Equation (5):

L_{F E C I o U} = (1 - I o U + \frac{ρ^{2}}{c^{2}} + α v) \cdot (1 - I o U)^{γ}

(5)

3. Experiments and Discussion

3.1. Dataset Preparation and Augmentation Methods

3.1.1. Data Collection and Splitting

IP102 is the largest publicly available pest image dataset in agriculture, containing 75,222 images across 102 pest categories. However, it includes many duplicates, irrelevant, and misclassified images, leaving only a few usable samples. In this study, we extracted rice pest images, removed duplicates, cleaned the data, and manually reviewed it with the help of experts from our university’s Plant Protection College, resulting in 4226 valid samples. The rice pests were annotated in YOLO format using LabelImg 1.8.6. The annotation adopts a double-blind protocol, with two people independently annotating. If there are differences in the annotation boxes or categories, they will be reviewed by experts from our university’s Plant Protection College. The new dataset, named RicePest-10 (RP10), contains 10 rice pest categories, shown in Figure 7. These include Curculionidae (weevils), Delphacidae (planthoppers), Cicadellidae (leafhoppers), Phlaeothripidae (thrips), Cecidomyiidae (gall midges), Hesperiidae (skippers), Crambidae (grass moths), Chloropidae (grass flies), Ephydridae (shore flies), and Noctuidae (owlet moths). These 10 types are the most common rice pests. The class imbalance was not corrected during the sampling phase, but random partitioning was ensured during dataset partitioning to maintain a natural distribution for all subsets. RP10 was split into training (3375 images), validation (424 images), and test (427 images) sets in an 8:1:1 ratio, ensuring even distribution of pest categories across subsets for fair evaluation and effective training, as shown in Table 1. Data augmentation techniques, including random brightness, motion blur, random occlusion, and salt-and-pepper noise, were applied to expand the dataset, resulting in the final RP10 dataset for the experiments.

3.1.2. Data Augmentation

To enhance the generalization and robustness of the model, four data augmentation techniques were applied to the training set: random brightness, motion blur, random rectangular occlusion, and salt-and-pepper noise. These methods simulate various real-world interferences, such as complex lighting changes, plant oscillations, camera shake, leaf occlusions, and imaging noise, in rice paddy fields. Specifically, random brightness simulates light variations under weak light before and after sunrise, overexposure at noon, and low light at dusk; motion blur replicates the blurring and trailing effects caused by device shake or wind-induced plant movement; random rectangular occlusion mimics partial or large-area occlusion of pests by rice leaves or panicles; salt-and-pepper noise simulates noise from imaging devices or the environment. By applying these four augmentation techniques, the training set was expanded to double its original size, reaching 6750 images, while the validation and test sets remained unchanged, thereby improving the model’s generalization ability and robustness. Examples of the augmented training set are shown in Figure 8.

3.2. Experimental Environment and Parameter Settings

To verify the effectiveness of the model, the experiments in this study were conducted on a Windows 10 operating system using the PyTorch 2.2.2 deep learning framework, with Python 3.11.11 as the programming language. The GPU used was an NVIDIA GeForce RTX 4070 Ti SUPER with 16 GB of VRAM. Additionally, the model was trained with a batch size of 32, 200 total training epochs, an initial learning rate of 0.01, a momentum factor of 0.937, and a weight decay coefficient of 0.0005 for the optimizer. It is worth mentioning that seed is set to 0. When the random seed is set to 0, all random operations such as model weight initialization, online data augmentation, sample sampling order, etc., are locked. In the same experimental environment, the training process can be replicated, so the results of each experiment are completely consistent. The specific parameter settings are shown in Table 2.

3.3. Evaluation Metrics

To comprehensively and objectively evaluate the performance of the proposed GAFNet model in rice pest detection, this study assesses the model’s performance based on Precision, Recall, mAP, Parameters, and FLOPs.

P = \frac{T P}{T P + F P}

(6)

where P denotes Precision, representing the proportion of targets predicted by the model to belong to a specific category, and serves as an indicator of the model’s false positive rate. A higher P corresponds to fewer false positives. Additionally,

T P

(True Positive) refers to the number of accurately detected pest bounding boxes, while

F P

(False Positive) indicates the number of erroneously identified boxes.

R = \frac{T P}{T P + F N}

(7)

where R denotes Recall, representing the proportion of true pest targets successfully detected by the model, reflecting the model’s false negative rate.

F N

(False Negative) refers to the number of missed targets. A higher R indicates fewer missed detections. In this study, mAP is calculated by averaging the AP (Average Precision) of all 10 rice pest categories, offering a comprehensive evaluation of the model’s localization and classification performance. The definition of

A P_{c}

is as follows:

A P_{c} = \int_{0}^{1} P_{c} (R) d R

(8)

where c represents a specific class and

A P_{c}

denotes the average precision for class

c

. Therefore, the calculation of mAP is given by Equation (9):

m A P = \frac{1}{N} \sum_{c = 1}^{N} A P_{c}

(9)

Here,

N

represents the total number of classes.

Moreover, the number of parameters and FLOPs are crucial metrics for evaluating both the model’s performance and resource consumption. The number of parameters refers to the total count of all trainable weights and biases within the model, measured in “millions” (M), directly influencing the model’s storage requirements and peak memory usage, which is especially critical in edge deployment scenarios where storage and memory constraints are paramount. FLOPs, on the other hand, indicate the number of floating-point operations required for the model to perform a single forward inference on an input image of size 640 × 640, measured in “billions” (G). FLOPs serve as a standard for assessing computational complexity, and lower FLOPs can reduce inference latency and power consumption, making it more suitable for real-time processing applications. By considering these metrics, one can effectively assess the practical performance of GAFNet and provide a basis for its optimization.

3.4. Training Progress and Classification Performance of GAFNet

As illustrated in Figure 9, the loss curve for different epochs during the training process demonstrates a smooth convergence trend, characterized by a sharp initial decline followed by a gradual reduction. Compared to YOLO11n, the GAFNet model proposed in this paper exhibits lower loss values and faster convergence, preliminarily validating the advantages of the proposed method during the model training phase. The evaluation results on the test dataset indicate that the GAFNet model achieves an accuracy of 89.8%, a Recall rate of 85.6%, and an mAP of 90.1%, while the model’s parameter count is only 2.45 M and its computational load is merely 5.0 GFLOPs. These results highlight the model’s strong performance-to-efficiency ratio, achieving high accuracy with minimal computational overhead.

Furthermore, the use of confusion matrices is essential, as it provides detailed classification information for each category, helping to identify performance discrepancies across specific classes and uncover potential biases or imbalance issues. This is invaluable for guiding model optimization. As shown in Figure 10, the confusion matrix of the GAFNet model’s performance on the rice pest classification task illustrates clear diagonal elements, demonstrating effective differentiation among all 10 categories. Among these, Cecidomyiidae, Crambidae, and Hesperiidae show the highest identification rates, indicating that the model has effectively learned the features of these categories. Delphacidae, Phlaeothripidae, and Ephydridae also exhibit favorable identification performance with relatively high accuracy. In contrast, Curculionidae and Cicadellidae have the lowest recognition rates, often being misclassified as background or other similar categories, primarily due to factors such as leaf obstruction, the small size of the pests, and complex backgrounds. Through the confusion matrix, GAFNet’s overall high recognition accuracy and exceptional category discrimination capability in the rice pest classification task are further validated.

3.5. Ablation Study and Contribution Analysis of Improvement Modules

To assess the contributions of various enhancements to model detection accuracy and lightweight optimization, this study conducts a progressive ablation experiment on the rice pest dataset. Using YOLO11n as the benchmark, we ensure strict consistency across hardware, software, training hyperparameters, and data augmentation strategies. Subsequently, we incrementally integrate GAM-SPP, C3-EFSA, EGDetect, and FECIoU into the network. The results of the ablation experiment, as presented in Table 3, highlight the contributions of each enhancement module in terms of both detection precision and model efficiency.

As shown in Table 3, we first replace the SPPF in YOLO11n with the proposed GAM-SPP. The modified model, with a slight increase in parameters and computational cost, improves mAP from 88.5% to 89.5%, and significantly boosts Precision by 1.9%. This demonstrates the advantage of multi-scale context and attention synergy in enhancing mAP and Precision. Building on this, we replace all C3k2 layers in the backbone with C3-EFSA, incorporating a three-way parallel structure with 3 × 3, 5 × 5 DWConv and 3 × 3 group convolutions. After fusion, we introduce the lightweight ECA, which not only increases Recall by 2.8% and elevates mAP to 89.9%, but also reduces parameters to 2.64 M and FLOPs to 5.7 G through the sparse computation of DWConv, achieving higher accuracy with lower computational cost. Subsequently, we replace the detection head with EGDetect, where all 3 × 3 and 1 × 1 standard convolutions are replaced by an enhanced composite structure combining EGConv, SE, and SiLU. The main and lightweight branches are concatenated for output, maintaining mAP while reducing parameters to 2.45 M and FLOPs to 5.0 G, further validating the lightweight potential of EGConv and SE attention in the detection head. Finally, the FECIoU loss is introduced, adding a minimal stabilization term

ϵ

and a difficulty-weighted factor

γ

to the IoU exponent in CIoU, further boosting Recall by 1.4% and achieving an mAP of 90.1%, with no increase in parameters or FLOPs. This confirms the loss function’s effectiveness in improving detection accuracy for small and partially occluded pests. Consequently, after all the synergistic improvements, the model achieves a 3.5% increase in Precision, a 4.2% increase in Recall, a 1.6% improvement in mAP, a reduction of 0.13M in parameters, and a decrease of 1.3 G in FLOPs, validating the effectiveness and complementarity of each module design. Although the performance improvement of GAFNet is incremental, it is more lightweight compared to YOLO11n, which makes its efficiency on edge hardware significantly higher. Under resource constrained deployment, GAFNet has a moderate mAP increase that translates into practical advantages.

3.6. Performance Comparison and Analysis of the GAFNet Model with Other Detection Models

To validate the advancement and practicality of the proposed method, a comparative analysis was conducted on the rice pest dataset with current mainstream detection models such as Faster R-CNN [34], SSD [35], RT-DETR [36], and the YOLO series [37,38,39,40]. The performance comparison results of various models are shown in Table 4.

As shown in Table 4, Faster R-CNN, SSD, and RT-DETR exhibit lower precision compared to the method proposed in this paper. They also have larger parameters and FLOPs, making them unsuitable for deployment on resource-constrained edge devices. In contrast, the mAP of YOLOv5n to YOLOv12n ranges from 81.8% to 88.7%, with parameters ranging from 1.97 M to 6.03 M and FLOPs from 6.3 G to 13.1 G. The proposed GAFNet model, with a lightweight configuration of 2.45 M parameters and 5.0 GFLOPs, achieves an impressive mAP of 90.1%, improving by 1.6% over the baseline YOLO11n, with accuracy and recall rates increasing by 3.5% and 4.2%, respectively.

To visually compare the performance of the proposed GAFNet with the YOLO series, a radar chart based on Table 4 is presented in Figure 11. The larger the area covered in the radar chart, the better the model’s performance. It is evident from the chart that GAFNet excels in accuracy, recall, mAP, and FLOPs. Compared to YOLOv9t, although YOLOv9t has slightly fewer parameters than GAFNet, GAFNet outperforms it by 8.3% in mAP and has a significantly higher recall rate. Even when compared with YOLOv10n, which has the highest mAP in the YOLO series, GAFNet still achieves 1.4% higher mAP, with 39% fewer FLOPs and fewer parameters. Therefore, GAFNet is more suitable for pest detection in rice fields compared to the aforementioned methods.

3.7. Performance Evaluation of the FECIoU Loss Function Compared to Other IoU-Based Loss Functions

To verify the effectiveness of the proposed FECIoU loss function, while keeping the remaining modules unchanged, we compared it with the current mainstream IoU-based loss function, such as CIoU, Distance-IoU (DIoU) [25], Efficient-IoU (EIoU) [41], Generalized-IoU (GIoU) [42], Scylla-IoU (SIoU) [43], Wise-IoU (WIoU) [44], and Alpha-IoU (

α

-IoU) [45], Comparative analysis was conducted. The comparison results of various loss functions are shown in Table 5.

As shown in Table 5, under identical experimental conditions, the mAPs of CIoU, DIoU, EIoU, GIoU, SIoU, WIoU, and

α

-IoU were 89.9%, 87.4%, 86.8%, 87.8%, 89.0%, 88.2%, and 87.1%, respectively, all lower than the FECIoU proposed in this paper. Further observation of the recall rate shows that FECIoU leads other loss functions with 85.6%, indicating a significant reduction in missed detections on slender targets. In terms of accuracy, FECIoU also leads with 89.8%, indicating that it can reduce background false alarms while maintaining target localization accuracy. Overall, FECIoU achieves a better balance between gradient magnitude and convergence stability by introducing the minimal stabilization term

ϵ

and difficulty-weighted factor

γ

, making it more suitable for practical scenarios such as rice field pests with small targets, elongated shapes, and complex backgrounds. Therefore, compared to other IoU-based loss functions, FECIoU is a better choice for detecting pests in rice fields.

3.8. Visual Comparison of Detection Capabilities Between GAFNet and Baseline Models

To more intuitively assess the detection capability and feature focusing effect of the GAFNet model in real rice field scenarios, Figure 12 shows the detection results for various pests. As illustrated, the top row shows the original RGB images without detection, the middle row shows the detection results from the original YOLO11n model with final detection boxes overlaid, and the bottom row displays the detection results from the GAFNet model with final detection boxes overlaid.

It is clear that the YOLO11n model suffers from false negatives and false positives in detecting certain pests. For example, in column (d), Curculionidae is partially missed; in column (c), a Delphacidae is misclassified as a Cicadellidae; and in column (e), a Cicadellidae is misclassified as a Phlaeothripidae. In contrast, the GAFNet model correctly detects all pests, accurately locates them with high confidence, demonstrating the robust performance of the improved model for pest detection in rice fields.

The heatmap, utilizing a gradient color effect, effectively visualizes the distribution of data across different regions, further highlighting the advantages and optimization of the GAFNet model. As shown in Figure 13, the GAFNet model significantly outperforms the original model in detection performance. The gradient from blue to red visually represents the model’s attention focus across the entire image: warm-colored regions consistently highlight the pest contours, while cold-colored, low-activation areas at the edges clearly delineate the background leaf veins. This indicates that the model effectively suppresses interference from leaf textures and rice panicle shadows.

In particular, for elongated pests, the red heatmap region extends continuously along the pest’s long axis without interruption, validating the stable regression capabilities of EGDetect and FECIoU for targets with extreme aspect ratios. In leaf-occluded scenes, the highlighted pixels focus exclusively on the exposed core texture of the pest, with the background remaining predominantly cool, reflecting the multi-scale attention of GAM-SPP in filtering out redundant information. In densely clustered scenes, each pest corresponds to a distinct heatmap spot, with no large-scale dispersion between neighboring individuals. This demonstrates the ECA channel attention mechanism of C3-EFSA in successfully distinguishing high-frequency texture variations. Overall, the heatmap closely aligns with manually annotated regions, providing clear validation of the lightweight improvements in accurately focusing on key pest features under complex field conditions.

3.9. Evaluation of the Generalization Performance and Robustness of the GAFNet Model

To thoroughly evaluate the generalization ability of the GAFNet model, this study tests its robustness and transferability across different crops and collection conditions using the publicly available AgroPest-12 dataset. This dataset differs significantly from the previously used ones in terms of pest categories, image collection environments, and distribution characteristics. It effectively tests the model’s ability to adapt to unknown agricultural scenarios. As shown in Figure 14, these are example images for each pest category in the AgroPest-12 dataset.

As shown in Figure 14, the dataset includes various crop scenes such as rice, vegetables, and fruit trees, with a total of 13,143 images. These are divided into 11,502 training images, 1095 validation images, and 546 test images. Compared to the previously used datasets, AgroPest-12 introduces several new pest types beyond rice pests, testing the model’s ability to classify other pest categories. Additionally, the images were collected from different regions, lighting conditions, and using various equipment. The background complexity and lighting variation are significantly higher than in the single rice field scenario, posing greater challenges for the model’s multi-scale perception. In this study, YOLO11n and GAFNet were both trained and tested on the AgroPest-12 dataset under the same experimental conditions. Table 5 shows a comparison of the generalization performance of YOLO11n and GAFNet on the AgroPest-12 test set, with some visual results presented in Figure 15.

As shown in Table 6, GAFNet achieves an accuracy of 82.7%, a recall rate of 71.8%, and an mAP of 75.7% on this dataset. This represents an improvement of 4.4% in both accuracy and recall, and a 4.0% increase in mAP compared to YOLO11n. In terms of model parameters and computational cost, GAFNet remains lightweight, with only 2.45 M parameters and 5.0 GFLOPs.

Additionally, some visualization results are shown in Figure 15. From the detection results of Caterpillars and Earthworms in columns (a) and (b), it is clear that YOLO11n misses some detections, whereas GAFNet successfully identifies them. In columns (c) and (d), the heatmaps for Ants and Bees show that GAFNet significantly outperforms the original model. The warm-colored highlighted areas precisely cover the contours of the insects, while the cooler, low-activation regions at the edges and background create a clear separation. This demonstrates that GAFNet has excellent generalization ability.

3.10. Embedded Edge Device Deployment Experiment

This study verifies the feasibility and deployment efficiency of GAFNet on edge devices by deploying it on the NVIDIA Jetson Orin Nano embedded device, as shown in Figure 14a. The software environment is configured with Python 3.8 and PyTorch 1.8, while the hardware is powered by an Arm Cortex-A78AE CPU and a 32-core Tensor GPU.

The experiment evaluates GAFNet’s practical feasibility for real-time pest detection in rice fields by measuring its detection speed in frames per second (FPS) on the edge device. As shown in Figure 16c–f, the frame rate of GAFNet on the NVIDIA Jetson Orin Nano remains stable between 19 and 24 FPS, demonstrating excellent real-time performance.

The practical deployment experiment demonstrates that the proposed GAFNet runs stably on resource-constrained edge devices, with response latency within acceptable limits. CPU and GPU resource utilization are efficient, and power consumption meets edge device requirements. This lays a solid foundation and provides technical support for future deployment of GAFNet in pest detection systems. Due to the lack of specific power consumption measurements, this is a limitation of the current work and will be tested under controlled battery conditions in the future. Although GAFNet has validated real-time capabilities on Nvidia Jetson Orin Nano, current edge testing is only based on a single platform, and performance on other edge devices remains to be evaluated. This work will be carried out in the future.

4. Conclusions

This paper addresses the key challenges in rice paddy pest detection, including small target size, high density, severe occlusion, complex background, and limited edge computation, by proposing a lightweight detection model, GAFNet, based on an improved YOLO11n. Firstly, a subset of the IP102 dataset, covering 10 rice pest species, is constructed, and data augmentation strategies, including random brightness, motion blur, random rectangular occlusion, and salt-and-pepper noise, are employed to enhance dataset diversity and model generalization. Secondly, the proposed GAM-SPP replaces the SPPF in YOLO11n, integrating channel and spatial dual attention while introducing multi-scale receptive fields, significantly improving the model’s ability to detect concealed pests. Thirdly, a new C3-EFSA module is introduced, replacing all C3k2 blocks in the backbone. By utilizing DWConv and grouped convolutions in a three-way parallel structure, followed by ECA integration, the model achieves a 2.8% increase in recall rate while reducing parameters and FLOPs. Next, EGDetect is designed, restructuring the 3 × 3 and 1 × 1 convolutions in Detect with a composite structure of EGConv, SE, and SiLU. This reduces the model to 2.45 M parameters and 5.0 GFLOPs while maintaining the mAP. Finally, the FECIoU loss function is proposed, introducing numerical stability terms and IoU-weighted indices for difficult samples to enhance recall in occlusion scenarios by an additional 1.4%. Experimental results show that GAFNet achieves 89.8% accuracy, 85.6% recall, and 90.1% mAP under a lightweight constraint of only 2.45 M parameters and 5.0 GFLOPs. Compared to the baseline YOLO11n, it improves accuracy, recall, and mAP by 3.5%, 4.2%, and 1.6%, respectively, while reducing parameters and computation by 5% and 21%. Cross-model comparisons demonstrate that GAFNet strikes the best balance between accuracy, model size, and inference cost, outperforming mainstream methods such as Faster R-CNN, SSD, RT-DETR, and YOLOv5n to YOLOv12n. In Integrated Pest Management (IPM), missed detections can result in delayed pesticide applications, allowing pest populations to proliferate and leading to infestations, increased chemical use and costs, and crop damage. Conversely, misdetections may prompt unnecessary pesticide applications, wasting resources and harming beneficial insects. Both scenarios undermine the fundamental principle of IPM, which is to optimize chemical use by applying fewer pesticides at the appropriate times. In summary, GAFNet provides a resource-efficient, accurate, and environmentally sustainable solution for early pest detection, enabling precision spraying and greener pest management in smart agriculture. Future work will focus on combining Transformer architecture, multimodal perception, and cross-seasonal transfer learning to further improve the model’s generalization and real-time performance under complex weather, lighting, and planting conditions.

Author Contributions

Conceptualization, Y.Z. and W.H.; methodology, Y.Z., W.H. and B.L.; software, T.C. and J.W.; validation, J.W. and Q.Z.; formal analysis, B.L. and Q.Z.; investigation, B.L., T.C. and W.H.; resources, Y.Z. and T.Y.; data curation, T.C. and W.H.; writing—original draft preparation, W.H.; writing—review and editing, Y.Z., T.Y. and W.H.; visualization, J.W. and Q.Z.; supervision, Y.Z. and T.Y.; project administration, Y.Z.; funding acquisition, Y.Z. and T.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Jilin Provincial Department of Science and Technology-Free Exploration Basic Research, grant number YDZJ202301ZYTS408.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Lu, Y.; Tang, Y.; Zhang, J.; Liu, S.; Liang, X.; Li, M.; Li, R. Variations and Trends in Rice Quality across Different Types of Approved Varieties in China, 1978–2022. Agronomy 2024, 14, 1234. [Google Scholar] [CrossRef]
Conde, S.; Catarino, S.; Ferreira, S.; Temudo, M.P.; Monteiro, F. Rice Pests and Diseases Around the World: Literature-Based Assessment with Emphasis on Africa and Asia. Agriculture 2025, 15, 667. [Google Scholar] [CrossRef]
Kim, G.-E.; Son, C.-H.; Lee, S.J.C. ROI-aware multiscale cross-attention vision transformer for pest image identification. Comput. Electron. Agric. 2025, 237, 110732. [Google Scholar] [CrossRef]
Hu, Y.; Deng, X.; Lan, Y.; Chen, X.; Long, Y.; Liu, C. Detection of Rice Pests Based on Self-Attention Mechanism and Multi-Scale Feature Fusion. Insects 2023, 14, 280. [Google Scholar] [CrossRef]
Nyakuri, J.P.; Nkundineza, C.; Gatera, O.; Nkurikiyeyezu, K.; Mwitende, G.J.S.R. AI and IoT-powered edge device optimized for crop pest and disease detection. Sci. Rep. 2025, 15, 22905. [Google Scholar] [CrossRef] [PubMed]
Xie, J.; Lu, M.; Gao, Q.; Chen, L.; Zou, Y.; Wu, J.; Cao, Y.; Xu, N.; Wang, W.; Li, J. Intelligent Detection and Control of Crop Pests and Diseases: Current Status and Future Prospects. Agronomy 2025, 15, 1416. [Google Scholar] [CrossRef]
Kasinathan, T.; Singaraju, D.; Uyyala, S.R. Insect classification and detection in field crops using modern machine learning techniques. Inf. Process. Agric. 2021, 8, 446–457. [Google Scholar] [CrossRef]
Yang, Z.; Li, W.; Li, M.; Yang, X. Automatic greenhouse pest recognition based on multiple color space features. Int. J. Agric. Biol. Eng. 2021, 14, 188–195. [Google Scholar] [CrossRef]
Li, Y.; Ercisli, S. Data-efficient crop pest recognition based on KNN distance entropy. Sustain. Comput. Informatics Syst. 2023, 38, 100860. [Google Scholar] [CrossRef]
Guo, B.; Wang, J.; Guo, M.; Chen, M.; Chen, Y.; Miao, Y. Overview of Pest Detection and Recognition Algorithms. Electronics 2024, 13, 3008. [Google Scholar] [CrossRef]
Ebrahimi, M.A.; Khoshtaghaza, M.H.; Minaei, S.; Jamshidi, B. Vision-based pest detection based on SVM classification method. Comput. Electron. Agric. 2017, 137, 52–58. [Google Scholar] [CrossRef]
Deng, L.; Wang, Y.; Han, Z.; Yu, R. Research on insect pest image detection and recognition based on bio-inspired methods. Biosyst. Eng. 2018, 169, 139–148. [Google Scholar] [CrossRef]
Tian, Y.; Wang, S.; Li, E.; Yang, G.; Liang, Z.; Tan, M. MD-YOLO: Multi-scale Dense YOLO for small target pest detection. Comput. Electron. Agric. 2023, 213, 108233. [Google Scholar] [CrossRef]
Liu, Y.; Wang, Q.; Zheng, Q.; Liu, Y.J.A. YOLO-wheat: A more accurate real-time detection algorithm for wheat pests. Agronomy 2024, 14, 2244. [Google Scholar] [CrossRef]
Lu, Y.; Liu, P.; Tan, C.J.A. MA-YOLO: A Pest Target Detection Algorithm with Multi-Scale Fusion and Attention Mechanism. Agronomy 2025, 15, 1549. [Google Scholar] [CrossRef]
Wu, X.; Zhan, C.; Lai, Y.-K.; Cheng, M.-M.; Yang, J. Ip102: A large-scale benchmark dataset for insect pest recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 8787–8796. [Google Scholar]
Wang, Q.-J.; Zhang, S.-Y.; Dong, S.-F.; Zhang, G.-C.; Yang, J.; Li, R.; Wang, H.-Q.J.C. Pest24: A large-scale very small object data set of agricultural pests for multi-target detection. Comput. Electron. Agric. 2020, 175, 105585. [Google Scholar] [CrossRef]
Yang, S.; Zhou, G.; Feng, Y.; Zhang, J.; Jia, Z. SRNet-YOLO: A model for detecting tiny and very tiny pests in cotton fields based on super-resolution reconstruction. Front. Plant Sci. 2024, 15, 1416940. [Google Scholar] [CrossRef]
Wang, N.; Fu, S.; Rao, Q.; Zhang, G.; Ding, M. Insect-YOLO: A new method of crop insect detection. Comput. Electron. Agric. 2025, 232, 110085. [Google Scholar] [CrossRef]
Khanam, R.; Hussain, M.J. Yolov11: An overview of the key architectural enhancements. arXiv 2024. [Google Scholar] [CrossRef]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H.J. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017. [Google Scholar] [CrossRef]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 11534–11542. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
Elfwing, S.; Uchibe, E.; Doya, K.J.N. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural Netw. 2018, 107, 3–11. [Google Scholar] [CrossRef]
Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU loss: Faster and better learning for bounding box regression. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 12993–13000. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef]
Nair, V.; Hinton, G.E. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), Haifa, Israel, 21–24 June 2010; pp. 807–814. [Google Scholar]
Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 448–456. [Google Scholar]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C.; Xu, C. Ghostnet: More features from cheap operations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 1580–1589. [Google Scholar]
Yu, J.; Jiang, Y.; Wang, Z.; Cao, Z.; Huang, T. Unitbox: An advanced object detection network. In Proceedings of the 24th ACM International Conference on Multimedia, Amsterdam, The Netherlands, 15–19 October 2016; pp. 516–520. [Google Scholar]
Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016; pp. 21–37. [Google Scholar]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. Detrs beat yolos on real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 19–24 June 2024; pp. 16965–16974. [Google Scholar]
Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W. YOLOv6: A single-stage object detection framework for industrial applications. arXiv 2022. [Google Scholar] [CrossRef]
Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 19–24 June 2023; pp. 7464–7475. [Google Scholar]
Wang, C.-Y.; Yeh, I.-H.; Mark Liao, H.-Y. Yolov9: Learning what you want to learn using programmable gradient information. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 29 September–4 October 2024; pp. 1–21. [Google Scholar]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J. Yolov10: Real-time end-to-end object detection. Adv. Neural Inf. Process. Syst. 2024, 37, 107984–108011. [Google Scholar]
Zhang, Y.-F.; Ren, W.; Zhang, Z.; Jia, Z.; Wang, L.; Tan, T.J.N. Focal and efficient IOU loss for accurate bounding box regression. Neural Netw. 2022, 506, 146–157. [Google Scholar] [CrossRef]
Rezatofighi, H.; Tsoi, N.; Gwak, J.; Sadeghian, A.; Reid, I.; Savarese, S. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 658–666. [Google Scholar]
Gevorgyan, Z.J. SIoU loss: More powerful learning for bounding box regression. arXiv 2022. [Google Scholar] [CrossRef]
Tong, Z.; Chen, Y.; Xu, Z.; Yu, R. Wise-IoU: Bounding box regression loss with dynamic focusing mechanism. arXiv 2023. [Google Scholar]
He, J.; Erfani, S.; Ma, X.; Bailey, J.; Chi, Y.; Hua, X.-S. Alpha-IoU: A family of power intersection over union losses for bounding box regression. Adv. Neural Inf. Process. Syst. 2021, 34, 20230–20242. [Google Scholar]

Figure 1. The principle framework diagram of the GAFNet rice pest detection model based on global attention fusion and multi-scale feature optimization.

Figure 2. Schematic diagram of the GAM-SPP structural principle.

Figure 3. Schematic diagram illustrating the structure of the C3-EFSA module. In this diagram, c1 denotes the input channels, and c2 represents the output channels.

Figure 4. The original structure of the YOLO11n Detect module.

Figure 5. Schematic diagram of the proposed EGConv module.

Figure 6. Schematic diagram of the EGDetect architecture.

Figure 7. Example of different types of rice pest infestations. (a) Image of Curculionidae; (b) Image of Delphacidae; (c) Image of Cicadellidae; (d) Image of Phlaeothripidae; (e) Image of Cecidomyiidae; (f) Image of Hesperiidae; (g) Image of Crambidae; (h) Image of Chloropidae; (i) Image of Ephydridae; (j) Image of Noctuidae.

Figure 8. Examples of augmented training set images. (a) Augmented image of Curculionidae; (b) Augmented image of Delphacidae; (c) Augmented image of Cicadellidae; (d) Augmented image of Phlaeothripidae; (e) Augmented image of Cecidomyiidae; (f) Augmented image of Crambidae.

Figure 9. Comparison of loss curves for GAFNet and YOLO11n on the training set across different epochs.

Figure 10. Confusion matrix for rice pest classification using the GAFNet model.

Figure 11. Radar chart comparing the performance of different detection models.

Figure 12. Comparison of detection results across different models. (a) Detection results for Curculionidae; (b) Detection results for Hesperiidae; (c) Detection results for Delphacidae; (d) Detection results for Curculionidae; (e) Detection results for Cicadellidae.

Figure 13. Heatmap visualization results across different models.

Figure 14. Example images of each pest category in the AgroPest-12 dataset. (a) Ants; (b) Bees; (c) Beetles; (d) Caterpillars; (e) Earthworms; (f) Earwigs; (g) Grasshoppers; (h) Moths; (i) Slugs; (j) Snails; (k) Wasps; (l) Weevils.

Figure 15. Visualization results of the generalization experiment. (a) Detection results for Caterpillars; (b) Detection results for Earthworms; (c) Heatmap for Ants; (d) Heatmap for Bees.

Figure 16. GAFNet deployment on the Nvidia Jetson Orin Nano embedded device. Panel (a) shows the device used for the experiment. Panel (b) shows the device used for result display. Panels (c–f) show detection speed and performance for Curculionidae, Cicadellidae, Delphacidae, and Phlaeothripidae, respectively, with FPS indicating processing efficiency.

Table 1. Distribution statistics of different types of pest data.

Class	Train	Validation	Test	Total
Curculionidae	658	84	81	823
Delphacidae	586	71	72	729
Cicadellidae	508	57	71	636
Phlaeothripidae	227	33	28	288
Cecidomyiidae	292	45	33	370
Hesperiidae	313	42	38	393
Crambidae	370	44	47	461
Chloropidae	121	16	16	153
Ephydridae	131	15	21	167
Noctuidae	169	17	20	206
Total	3375	424	427	4226

Table 2. Parameter settings for model training experiments.

Parameter	Setting
epochs	200
batch	32
workers	0
device	0
patience	100
imgsz	640
optimizer	SGD
lr0	0.01
momentum	0.937
weight_decay	0.0005
seed	0

Table 3. Ablation study results on the contribution of improvement modules to model performance enhancement. × denotes not using this module, and √ denotes using this module.

GAM-SPP	C3-EFSA	EGDetect	FECIoU	P/%	R/%	mAP/%	Parameters/M	FLOPs/G
×	×	×	×	86.3	81.4	88.5	2.58	6.3
√	×	×	×	88.2	81.9	89.5	2.83	6.5
√	√	×	×	88.5	84.7	89.9	2.64	5.7
√	√	√	×	88.9	84.2	89.9	2.45	5.0
√	√	√	√	89.8	85.6	90.1	2.45	5.0

Table 4. Performance comparison of different models.

Model	P/%	R/%	mAP/%	Parameters/M	FLOPs/G
Faster R-CNN	47.1	71.4	67.7	137.10	370.2
SSD	78.1	58.6	67.4	23.75	60.9
RT-DETR	86.2	80.9	83.7	32.00	103.5
YOLOv5n	85.3	77.6	85.2	2.50	7.1
YOLOv6n	84.9	78.1	85.5	4.23	11.8
YOLOv7-tiny	83.4	75.1	82.7	6.03	13.1
YOLOv8n	85.6	78.5	85.0	3.01	8.1
YOLOv9t	87.2	70.3	81.8	1.97	7.6
YOLOv10n	87.0	82.4	88.7	2.70	8.2
YOLO11n	86.3	81.4	88.5	2.58	6.3
YOLOv12n	85.4	80.7	86.0	2.56	6.3
GAFNet	89.8	85.6	90.1	2.45	5.0

Table 5. Performance comparison of different loss functions.

Loss Function	P/%	R/%	mAP/%
CIoU	88.9	84.2	89.9
DIoU	86.6	83.0	87.4
EIoU	87.2	79.4	86.8
GIoU	84.1	83.0	87.8
SIoU	87.5	85.1	89.0
WIoU	85.8	82.1	88.2
$α$ -IoU	89.7	78.7	87.1
FECIoU	89.8	85.6	90.1

Table 6. Comparison of generalization performance between YOLO11n and GAFNet on AgroPest-12.

Model	P/%	R/%	mAP/%	Parameters/M	FLOPs/G
YOLO11n	78.3	67.4	71.7	2.58	6.3
GAFNet	82.7	71.8	75.7	2.45	5.0

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhou, Y.; Huang, W.; Liu, B.; Chen, T.; Wang, J.; Zhang, Q.; Yang, T. Lightweight GAFNet Model for Robust Rice Pest Detection in Complex Agricultural Environments. AgriEngineering 2026, 8, 26. https://doi.org/10.3390/agriengineering8010026

AMA Style

Zhou Y, Huang W, Liu B, Chen T, Wang J, Zhang Q, Yang T. Lightweight GAFNet Model for Robust Rice Pest Detection in Complex Agricultural Environments. AgriEngineering. 2026; 8(1):26. https://doi.org/10.3390/agriengineering8010026

Chicago/Turabian Style

Zhou, Yang, Wanqiang Huang, Benjing Liu, Tianhua Chen, Jing Wang, Qiqi Zhang, and Tianfu Yang. 2026. "Lightweight GAFNet Model for Robust Rice Pest Detection in Complex Agricultural Environments" AgriEngineering 8, no. 1: 26. https://doi.org/10.3390/agriengineering8010026

APA Style

Zhou, Y., Huang, W., Liu, B., Chen, T., Wang, J., Zhang, Q., & Yang, T. (2026). Lightweight GAFNet Model for Robust Rice Pest Detection in Complex Agricultural Environments. AgriEngineering, 8(1), 26. https://doi.org/10.3390/agriengineering8010026

Article Menu

Lightweight GAFNet Model for Robust Rice Pest Detection in Complex Agricultural Environments

Abstract

1. Introduction

2. Methods

2.1. Multi-Scale Feature Enhancement Mechanism Based on Global Attention Fusion

2.2. Design of the C3-EFSA Module Based on Multi-Scale DWConv and Lightweight ECA Mechanism

2.3. Lightweight Ghost Convolution Optimization Module with SE Attention Integration

2.4. Design of the Enhanced Loss Function FECIoU

3. Experiments and Discussion

3.1. Dataset Preparation and Augmentation Methods

3.1.1. Data Collection and Splitting

3.1.2. Data Augmentation

3.2. Experimental Environment and Parameter Settings

3.3. Evaluation Metrics

3.4. Training Progress and Classification Performance of GAFNet

3.5. Ablation Study and Contribution Analysis of Improvement Modules

3.6. Performance Comparison and Analysis of the GAFNet Model with Other Detection Models

3.7. Performance Evaluation of the FECIoU Loss Function Compared to Other IoU-Based Loss Functions

3.8. Visual Comparison of Detection Capabilities Between GAFNet and Baseline Models

3.9. Evaluation of the Generalization Performance and Robustness of the GAFNet Model

3.10. Embedded Edge Device Deployment Experiment

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI