Lightweight Pest Object Detection Model for Complex Economic Forest Tree Scenarios

Cheng, Xiaohui; Wang, Xukun; Kang, Yanping; Deng, Yun; Lu, Qiu; Tang, Jian; Shi, Yuanyuan; Zhao, Junyu

doi:10.3390/insects16090959

Open AccessArticle

Lightweight Pest Object Detection Model for Complex Economic Forest Tree Scenarios

by

Xiaohui Cheng

^1,2,

Xukun Wang

¹

,

Yanping Kang

^1,2,*,

Yun Deng

^1,2,

Qiu Lu

^1,2,

Jian Tang

³,

Yuanyuan Shi

³

and

Junyu Zhao

³

¹

College of Computer Science and Engineering, Guilin University of Technology, Guilin 541004, China

²

Guangxi Key Laboratory of Embedded Technology and Intelligent System, Guilin 541004, China

³

Guangxi Forestry Research Institute, Nanning 530002, China

^*

Author to whom correspondence should be addressed.

Insects 2025, 16(9), 959; https://doi.org/10.3390/insects16090959

Submission received: 12 August 2025 / Revised: 8 September 2025 / Accepted: 11 September 2025 / Published: 12 September 2025

(This article belongs to the Special Issue Artificial Intelligence (AI) and Insect Pests Management: Securing Food Security, Human Health, and Natural Resources)

Download

Browse Figures

Review Reports Versions Notes

Simple Summary

Pests in economic forests are often difficult to detect due to their resemblance to backgrounds, and traditional manual monitoring is inefficient and prone to missed detections. This study proposes LightFAD-DETR, a lightweight model specifically designed for pest identification in complex forest scenarios. The model accurately identifies and pinpoints tiny pests while distinguishing their growth stages. Its innovative design significantly reduces computational demands, enabling direct deployment on edge devices. This technology helps reduce excessive pesticide use, maintain tree health, and safeguard yields of fruits and timber, providing practical support for sustainable forestry management.

Abstract

Pest control in economic forests is a crucial aspect of sustainable forest resource management, yet it faces bottlenecks such as low efficiency and high miss rates for small objects. Based on the RT-DETR model, this paper proposes LightFAD-DETR, a lightweight architecture integrated with feature aggregation diffusion, designed for complex economic forest scenarios. Firstly, by employing the YOLOv9 lightweight backbone network to compress the computational base load, we introduce the RepNCSPELAN4-CAA module, which integrates re-parameterization techniques and one-dimensional strip convolution. This enhances the model’s ability for cross-regional modeling of slender insect morphologies. Secondly, a feature aggregation diffusion network is designed, incorporating a dimension-aware selective integration mechanism. This dynamically fuses shallow detail features with deep semantic features, effectively mitigating information loss for small objects occluded by foliage. Finally, a re-parameterized batch normalization technique is introduced to reconstruct the AIFI module. Combined with a progressive training strategy, this eliminates redundant parameters, thereby enhancing inference efficiency on edge devices. Experimental validation demonstrates that compared to the baseline RT-DETR model, LightFAD-DETR achieves a 1.4% improvement in

{m A P}_{0.5 : 0.95}

, while reducing parameters by 41.7% and computational load by 35.0%. With an inference speed reaching 106.3 FPS, the method achieves balanced improvements in both accuracy and lightweight design.

Keywords:

economic forest pest control; RT-DETR; feature aggregation; small object detection

1. Introduction

Economic forest trees, including high-value species such as rubber, tea oil camellia, walnut, chestnut, eucalyptus, poplar, and fruit trees, have become vital pillars for national ecological security barriers, strategic timber resource reserves, and rural revitalization. Their economic value is constituted by providing core income sources for forest farmers and supporting extensive industrial chains encompassing timber processing, forest chemical industries, fruit, and medicinal material production [1,2]. Various pests, such as longhorn beetles, pine caterpillars, scale insects, as well as wood-boring and foliage-feeding insects, pose pervasive and severe threats to these economically important trees, which are characterized by long growth cycles and exposure to open environments. The explosive, sudden, and devastating nature of pest infestations leads to stunted tree growth, deterioration of timber quality, significant yield reduction or even complete loss of fruit, and in severe cases, the death of entire stands. These consequences result in incalculable direct and indirect economic losses [3,4]. Furthermore, traditional control methods reliant on chemical pesticides are not only costly but also incur serious negative impacts, including environmental pollution, pesticide residues, and the destruction of natural enemies [5]. Economic forests often grow in heterogeneous environments that can be described as complex economic forest ecosystems. This concept refers to multi-species stands with diverse canopy structures and variable microclimatic conditions, which not only harbor diverse pest species but also generate intricate visual backgrounds that increase the difficulty of pest monitoring.

Current monitoring methods relying on manual inspections and physicochemical trapping exhibit significant limitations. Manual inspections are highly inefficient, making it difficult to cover vast and complex forest areas. Identification accuracy heavily depends on the inspector’s experience and is highly subjective, easily leading to missed detections and misjudgments. This approach struggles to capture early-stage, localized pest outbreaks, delaying optimal control timing and posing safety hazards in rugged terrain. Physicochemical trapping [6] has a narrow target range, primarily attracting specific pest species and providing limited information. It cannot precisely locate infested plants or quantitatively assess infestation severity, and it requires ongoing maintenance investment. These methods fall short in timeliness, accuracy, coverage, and cost-effectiveness, failing to meet the demands of modern precision forestry. Within intelligent pest monitoring systems, object detection technology [7] plays a pivotal core role. Its fundamental value lies in simultaneously achieving pest species identification and precise location pinpointing, providing an unprecedented richness in information dimensions. This enables the development of automated, intelligent monitoring systems based on ground-based sensing devices and artificial intelligence. Such systems can efficiently process massive volumes of image data, significantly enhance monitoring efficiency and coverage, reduce human error, improve identification accuracy and objectivity, and hold promise for providing early warnings of minor initial damage [8].

Although object detection technology holds great promise, its research and application in the specific field of economic forest pest control still faces significant challenges [9]. The complex and variable backgrounds of forest areas, such as interlaced branches and leaves, uneven lighting, shadows, and different growth stages, can severely interfere with target recognition [10]. In addition, the targets themselves present complex characteristics: pests are often small, morphologically diverse, camouflaged, or mimetic, with damage symptoms that vary widely in form and scale and can easily be confused with diseases or physiological stress [11]. These factors highlight the urgent need to improve the robustness and generalization ability of existing models in complex real-world scenarios. At the same time, deployment on mobile devices demands lightweight models and fast real-time inference. Here, a “lightweight model” refers to a neural network with reduced parameters and computational requirements, enabling real-time inference on resource-constrained devices.

However, most existing detection models still fall short of these requirements. Many lightweight pest detection methods are developed for crop fields or single-species environments, making them difficult to generalize to complex economic forest scenarios. Conventional approaches also struggle to preserve fine-grained details of tiny pests under strong background interference, resulting in high miss rates. Furthermore, although some models achieve competitive accuracy, they typically incur high computational costs, which hinder practical deployment on resource-constrained edge devices.

To overcome these gaps, this study proposes a lightweight detection paradigm tailored for economic forest pests. The planned scientific novelty lies in combining lightweight architectural design with enhanced cross-regional feature modeling and adaptive small-object representation, while maintaining real-time inference efficiency suitable for edge deployment. This integrated design directly addresses the dual challenges of accuracy in complex backgrounds and practicality for field applications, thus providing a new pathway for intelligent forest-pest monitoring. Based on these considerations, we hypothesize that a lightweight detection model integrating re-parameterized multi-branch convolution, adaptive feature aggregation, and efficient normalization strategies can effectively improve the recognition of small pest objects in complex forestry scenes while maintaining real-time inference speed and reduced computational cost.

Based on the RT-DETR framework, this paper proposes a lightweight detection model, LightFAD-DETR, for complex scenarios. Through multi-dimensional collaborative optimization, it achieves an efficient balance between speed, accuracy, and resource consumption. The main contributions of this paper are as follows:

A high-efficiency lightweight model, LightFAD-DETR, is proposed for detecting pests in economic forests. This model achieves an excellent balance between detection accuracy and computational efficiency.
The RepNCSPELAN4-CAA module is designed using a re-parameterization strategy that maintains a multi-branch convolution structure during training and fuses it into a single-path structure during inference, reducing computational latency. Moreover, the module incorporates one-dimensional strip convolutions to dynamically establish long-range spatial dependencies, enhancing the model’s capability to model cross-regional features of elongated pest targets.
A feature aggregation diffusion network is developed, incorporating a dimension-aware selective integration module. This design adaptively integrates deep semantic features with shallow texture details, mitigating information loss for small-scale pest targets and improving the recognition of subtle pest signs under complex leaf texture backgrounds.
An improved AIFI module with re-parameterized batch normalization is proposed. This enhancement adopts a progressive fusion approach to integrate dynamic linear components into the normalization process and ensures their fusion with adjacent linear layers during inference, thereby further reducing computational redundancy and optimizing the model’s execution performance on edge computing devices.

The remainder of this paper is organized as follows: Section 2 provides a brief overview of related work. Section 3 presents a detailed description of the LightFAD-DETR model and its internal components. Section 4 offers a comprehensive analysis of the experimental results, including ablation experiments and comparative experiments. Section 5 discusses the performance of the LightFAD-DETR model. Finally, Section 6 concludes the paper.

2. Related Work

2.1. Pest Object Detection

In this study, the term “pest object” denotes an individual insect or a visible pest-related symptom localized within an image. Early detection and accurate identification of pest infestations are fundamental to the implementation of effective plant protection measures. Object detection technology based on computer vision, with its non-contact and automation potential, is increasingly becoming a central focus in intelligent pest monitoring research. Tian et al. [12] proposed the MD-YOLO model for detecting small Lepidoptera pests on sticky traps. They developed a dual-path feature fusion network that combines feature extraction and feature aggregation paths, along with targeted data augmentation strategies such as random splicing and point scattering. The model is deployed via a cloud server, integrated into an Internet of Things system, and equipped with a cloud service platform based on a B/S architecture, enabling remote pest image detection and early warning.

Dong et al. [13] introduced a lightweight pest detection model called PestLite, which incorporates a Multi-Scale Feature Pooling Module to enhance detection accuracy for agricultural pests while significantly reducing model complexity, making it more suitable for real-time detection scenarios. Ye et al. [14] proposed an improved framework combining YOLOv8 [15] and SAHI [16], which enhances the detection of tiny tea pests through multi-module optimization. The framework achieves high precision while maintaining lightweight and real-time performance.

Yang et al. [17] developed an improved lightweight tea-pest detection model by introducing deformable convolutions to enhance feature extraction, using dynamic attention mechanisms for feature fusion optimization, and designing a novel decoupled head and improved non-maximum suppression algorithm. These improvements significantly enhance the detection accuracy of small and visually similar pest species in complex backgrounds, providing an efficient tool for pest management in tea plantations. Li et al. [18] proposed the IMLL-DETR model for detecting pests and diseases in lychee leaves. They designed the MDGA module to enhance multi-scale feature extraction and contextual fusion, introduced a P2 detection head to retain small object details, and restructured the loss function to improve localization accuracy, thereby significantly boosting performance in complex agricultural environments.

Chen et al. [19] developed the Pest-PVT model, which improves detection accuracy for dense, small-sized pests in the field by integrating anchor-free detection, adaptive sample selection, and split self-attention mechanisms. The study utilized the NVIDIA Jetson TX2 embedded platform, along with the JetPack development kit, CUDA, and the MMdetection framework, and developed a graphical interface using PySimpleGUI to deploy and apply the model on low-power edge devices. Zhao et al. [20] introduced the AC-YOLO model to address the challenge of stored grain pest detection. By integrating attention mechanisms into the YOLOv7 [21] framework, the model improves recognition accuracy for multiple small target categories, providing an effective solution for intelligent pest monitoring in grain storage facilities.

2.2. Small Object Detection

As one of the key challenges in the field of computer vision, small object detection has attracted increasing research attention in recent years. A wide range of innovative methods have emerged, with significant progress made in areas such as improving feature pyramids, introducing attention mechanisms, optimizing contextual information, and deploying lightweight models. Despite these advances, the inherent characteristics of small objects such as low resolution, limited information, and high sensitivity to background interference continue to pose considerable challenges. These factors make small object detection a central focus for current research and ongoing optimization. Deng et al. [22] addressed the challenges of feature coupling and information loss in small object detection by proposing an extended feature pyramid network. They added a high-resolution pyramid layer specifically for handling small objects and designed a feature texture transfer module to super-resolve feature maps and extract reliable regional details, effectively improving detection accuracy for small objects.

Jiang et al. [23] tackled core challenges in aerial imagery from drones, where small objects are prevalent, densely packed, and vary greatly in scale. They enhanced small object perception by replacing the standard detection head with a specialized head for tiny objects. A multi-scale feature extraction module was designed using multi-branch heterogeneous convolution to capture rich scale information. A bidirectional dense feature pyramid was also constructed to deeply integrate shallow detail with deep semantic information, greatly enhancing small object detection performance in drone images. Cheng et al. [24] conducted a systematic review of the key challenges in small object detection, including information loss, feature noise, sensitivity to bounding box shifts, and insufficient training samples. They proposed six categories of algorithmic strategies, namely sample-based approaches, scale-aware methods, attention mechanisms, feature mimicry, context modeling, and focused detection strategies. In addition, they built several datasets across related fields to support algorithm evaluation and development.

Wang et al. [25] focused on small object detection in autonomous driving scenarios. They introduced an innovative query-guided module called Q-block, which uses coarse localization to identify key regions and then guides fine-grained detection using high-resolution features. This method significantly improves the localization accuracy of distant small objects. Zhang et al. [26] addressed small object detection in industrial defect inspection. They designed a lightweight feature fusion network sensitive to detail, which improved detection performance for small-scale defects across multiple resolutions while maintaining real-time inference speed. This provides an efficient solution for industrial applications. Shen et al. [27] introduced a dynamic perception module for small objects that treats the feature pyramid as a temporal sequence, similar to video frames. By using three-dimensional convolution, the module focuses on multi-scale features and surrounding context of small objects. This enhances fine detail extraction and improves small object detection performance in remote sensing imagery.

3. Proposed Method

As shown in Figure 1, based on the RT-DETR [28] framework, this paper proposes a lightweight detection model called LightFAD-DETR designed for complex scenarios. Through multi-dimensional collaborative optimization, the model achieves an efficient balance between speed, accuracy, and resource consumption. First, the RepNCSPELAN4-CAA module employs re-parameterization technology to merge the multi-branch convolutions used during training into an equivalent single-path structure during inference, significantly reducing computational latency. At the same time, it integrates one-dimensional strip convolutions to dynamically capture long-range spatial dependencies, enhancing the model’s ability to represent cross-region associations for elongated pest targets. Second, a feature aggregation diffusion network (FADN) is designed, incorporating a dimension-aware selective integration (DASI) module. This enables dynamic fusion of deep semantic features and shallow detail features, alleviating information loss for small targets, which is especially beneficial for recognizing subtle pest signs under leaf texture interference. In addition, an improved AIFI module with re-parameterized batch normalization (RepBN) is introduced. Through a progressive strategy, dynamic linear components are integrated into the normalization process and fused with adjacent linear layers during inference, further eliminating redundant computations and enhancing efficiency for edge deployment. These innovative designs significantly reduce the model’s parameter count and computational complexity while maintaining high-resolution image processing capability, making it particularly suitable for resource-constrained edge devices.

As shown in Figure 1, based on the RT-DETR model, we first incorporate the first nine layers of the YOLOv9 [29] backbone to replace the original backbone of RT-DETR, which helps reduce the number of parameters and computational cost. Next, the improved RepNCSPELAN4-CAA module is used to replace the RepNCSPELAN4 module in the backbone and the RepC3 module in the CCFM component of RT-DETR. By applying re-parameterization techniques, the multi-branch structure used during training is merged into a single-path structure during inference, thereby reducing computational overhead. Then, a feature aggregation diffusion network is designed, and the DASI module is introduced to enhance the preservation of small object details and improve resistance to background interference. Finally, the AIFI module is improved with the integration of RepBN, forming the new RepBN-AIFI module, which helps reduce training fluctuations and improves inference speed. These components together form the LightFAD-DETR model. LightFAD-DETR effectively reduces both missed and false detections in complex scenarios, while balancing lightweight design, real-time performance, and accuracy, providing reliable technical support for intelligent forest-pest detection.

3.1. RepNCSPELAN4-CAA Module

The RepNCSPELAN4 module is an innovative design in YOLOv9 that deeply integrates re-parameterization techniques with cross-scale feature aggregation. It achieves efficient feature extraction and fusion through a multi-path collaborative mechanism. As shown in Figure 2, the module first expands the input feature channels using a 1 × 1 convolution and splits them into two paths. The main path adopts a re-parameterized cross-stage partial network (RepNCSP), which enhances feature representation through multi-branch convolutions during training and automatically merges them into an equivalent single-path structure during inference. This reduces computational latency while maintaining accuracy. The auxiliary path refines multi-level features through cascaded RepNCSP sub-modules. Each sub-module progressively enhances local features via channel compression, re-parameterized bottleneck layers, and residual connections. Finally, shallow detail features and deep semantic features are densely fused along the channel dimension. The output is then passed through a 1 × 1 convolution to compress the channels, resulting in a multi-scale feature map that combines rich spatial details with strong semantic representation. This module overcomes the representational limitations of traditional convolution by using re-parameterization techniques. It balances computational efficiency and feature diversity through a dynamic channel-splitting strategy and excels at capturing spatial topological relationships of multi-scale objects in complex backgrounds, providing a robust feature foundation for real-time high-precision detection.

The core advantage of the RepNCSPELAN4 module lies in its triple collaborative design of re-parameterization, cross-stage channel optimization, and multi-scale feature aggregation, enabling an efficient balance among speed, accuracy, and lightweight design in pest detection tasks. Compared to traditional modules, it significantly enhances feature representation through multi-branch convolutions during training, which are automatically merged into a single-branch structure during inference. This allows for a notable performance boost with virtually no additional computational cost. Its unique cross-stage channel splitting strategy divides the input features into a main path and a lightweight auxiliary path. The main path deepens feature abstraction through stacked re-parameterized bottleneck layers, while the auxiliary path preserves original information, effectively avoiding gradient redundancy and reducing computational load. This makes it especially suitable for deployment on edge devices. The ELAN-style multi-scale dense feature concatenation mechanism fuses shallow details with deep semantic information, equipping the model with strong discriminative capability for tiny pest objects, densely arranged objects, and complex backgrounds. While maintaining a lightweight structure, this module addresses the limitations of traditional models in real-time scenarios, such as missed or false detections of multi-scale objects, and introduces a novel design paradigm for efficient detection in high-resolution imagery.

Cai et al. [30] proposed the Context Anchor Attention (CAA) module, as shown in Figure 3.

The CAA module performs context modeling through multi-stage collaborative processing. First, the input feature

X

undergoes global average pooling, denoted as

P_{a v g}

, to compress spatial information. Then, a 1 × 1 convolution is applied to adjust the channels, generating an initial global context representation

F^{p o o l}

, which provides a semantic foundation for subsequent long-range modeling, as shown in Equation (1) [30]:

F^{p o o l} = {C o n v}_{1 \times 1} (P_{a v g} (X))

(1)

Subsequently, depthwise separable one-dimensional strip convolutions in the horizontal

(1 \times k_{b})

and vertical

(k_{b} \times 1)

directions are applied to model spatial dependencies along the width and height dimensions, respectively. The kernel size is dynamically adjusted as

k_{b} = 11 + 2 n

, where

n

represents the network depth, allowing the receptive field to be extended with linear complexity and better adapted to elongated object structures. As shown in Equations (2) and (3) [30]:

F^{w} = {D W C o n v}_{1 \times k_{b}} (F^{p o o l})

(2)

F^{h} = {D W C o n v}_{k_{b} \times 1} (F^{w})

(3)

Finally, the direction-sensitive feature

F^{h}

is mapped into a spatial attention map

A

. A lightweight convolution activated by the Sigmoid function generates spatial attention weights, which are dynamically fused with the local feature

P

through weighted integration. A residual connection is then applied to enhance the response in target regions, resulting in the output enhanced feature

F^{a t t n}

. As shown in Equations (4) and (5) [30]:

A = S i g m o i d ({C o n v}_{1 \times 1} (F^{h}))

(4)

F^{a t t n} = (A ⨀ P) \oplus P

(5)

Here, “

⨀

” denotes element-wise multiplication, and “

⨁

” denotes element-wise addition.

This process decomposes large-kernel convolutions into lightweight one-dimensional operations, which reduces computational complexity while accurately modeling cross-region associations of elongated objects in pest images. It also suppresses background noise and achieves a balanced representation of local details and global semantics.

The CAA module dynamically captures long-range contextual information in pest images using lightweight one-dimensional strip convolutions. It combines global average pooling to extract global features and leverages an adaptive attention mechanism to enhance feature responses in object regions. Its core advantage lies in replacing traditional large-kernel convolutions with depthwise separable convolutions in the horizontal and vertical directions. This significantly reduces computational cost while effectively addressing the issues of sparse context in large-scale objects and background noise in small objects. It is particularly effective at handling elongated targets such as long-bodied pests, achieving a balance between local detail preservation and global dependency modeling. The module provides efficient feature enhancement for multi-scale object detection in complex scenarios.

We embed the CAA module between the multi-scale feature concatenation and the final convolution in the RepNCSPELAN4 module. This integration uses a dynamic context anchoring strategy to further enhance the semantic awareness of feature fusion, thereby forming the RepNCSPELAN4-CAA module in LightFAD-DETR. This design retains the original advantages of re-parameterization and cross-stage channel optimization while addressing the challenges of extreme object scale variation and complex background interference in economic forest scenarios. The CAA module’s one-dimensional strip convolutions adaptively capture long-range cross-region dependencies, and the spatial attention mechanism softly filters the concatenated multi-level features, focusing on strengthening the response in target regions while suppressing background noise.

3.2. Feature Aggregation Diffusion Network Integrated with Dimension-Aware Selective Integration Module

Xu et al. [31] proposed the DASI module, as shown in Figure 4. The DASI module adaptively selects and fuses multi-level features, dynamically balancing high-dimensional semantic information with low-dimensional detail features, thereby enhancing the saliency of small pest objects.

The DASI module first aligns the high-dimensional feature

F_{h} \in R^{H_{h} \times W_{h} \times C_{h}}

, the low-dimensional feature

F_{l} \in R^{H_{l} \times W_{l} \times C_{l}}

, and the current layer feature

F_{u} \in R^{H \times W \times C}

to the same spatial size. Then, these features are evenly divided into four parts along the channel dimension. As shown in Equation (6) [31]:

h_{i}, l_{i}, u_{i} = s p l i t (F_{h}), s p l i t (F_{l}), s p l i t (F_{u}), i = 1,2, 3,4

(6)

Here, each split has the shape

R^{H \times W \times \frac{C}{4}}

, ensuring that features from different levels are aligned in both spatial and channel dimensions.

Subsequently, each split of the current layer feature

u_{i}

is passed through a Sigmoid function to generate the weight

α

, which is then used to dynamically fuse high-dimensional contextual features with low-dimensional detail features based on the value of

α

. As shown in Equation (7) [31]:

α = s i g m o i d (u_{i})

(7)

Based on the weight

α

, the high-dimensional feature

h_{i}

and the low-dimensional feature

l_{i}

are linearly fused. As shown in Equation (8) [31]:

u_{i}^{'} = α ⨀ l_{i} + (1 - α) ⨀ h_{i}

(8)

Here,

⨀

denotes element-wise multiplication. When

α > 0.5

, the model favors low-dimensional details; otherwise, it focuses on high-dimensional context, enabling task-driven adaptive selection.

Finally, the fused features are concatenated and passed through a convolution to integrate channel information and enhance non-linearity. As shown in Equations (9) and (10) [31]:

F_{u}^{″} = [u_{1}^{'}, u_{2}^{'}, u_{3}^{'}, u_{4}^{'}]

(9)

F_{u}^{'} = R e L U (B N (C o n v (F_{u}^{″})))

(10)

Here, the convolution compresses the channels and fuses information across the splits, while batch normalization (BN) and ReLU further optimize the feature distribution and enhance the representational capability of the module.

This module, through lightweight channel splitting and an adaptive weighting mechanism, effectively alleviates the problem of small object information loss caused by multiple downsampling operations while reducing the number of parameters. It achieves efficient collaborative modeling of detail and semantics under complex backgrounds.

The data focused by the DASI module is further processed through a specific diffusion mechanism, as shown in the feature aggregation diffusion network part of the model in Figure 1. The DASI module accepts inputs from three different scales to extract multi-scale texture features with varying receptive fields. Through the diffusion mechanism, features enriched with contextual information are propagated across all detection scales. By performing feature aggregation and diffusion twice, each scale’s features are enhanced with detailed contextual information, which is more beneficial for subsequent object detection and classification. This process forms the feature aggregation diffusion network.

To address the challenges posed by pest objects that are small in size, highly variable in shape, and easily confused with leaf textures, the FADN dynamically fuses high-dimensional semantic features with low-dimensional detail features through an adaptive weighting mechanism to capture subtle morphological differences. When detecting tiny insect bodies, it prioritizes enhancing local details in shallow features. In contrast, under strong background interference, it focuses more on deep features’ global semantic information to suppress noise. This dynamic balancing mechanism enables the model to preserve the fine structures of pest objects while enhancing robustness against complex backgrounds, thereby reducing missed detections and false positives.

3.3. Re-Parameterized Batch Normalization

Guo et al. [32] proposed RepBN, an efficient normalization method designed to replace LayerNorm in Transformers by dynamically adjusting the normalization process through combining standard BatchNorm with a learnable parameter

η

. The core idea is to add a linear term

(η X)

of the input

X

to the output of BatchNorm during training, thereby enhancing the model’s representational capacity. After training, RepBN is mathematically re-parameterized into an equivalent BatchNorm form, allowing it to be directly fused with adjacent linear layers during inference. This eliminates extra computation and reduces latency. To avoid training instability caused by directly replacing LayerNorm, RepBN employs a progressive strategy that gradually reduces dependence on LayerNorm. At the initial stage, the model fully relies on LayerNorm, then progressively transitions to pure BatchNorm as training advances, ensuring stable training, as illustrated in Figure 5.

This design not only retains the efficient inference characteristics of BatchNorm but also improves training stability and flexibility through dynamic parameter adjustment and progressive replacement. It makes the model better suited for resource-constrained deployment environments while maintaining performance. The formula for RepBN is shown in Equation (11) [32]:

R e p B N (X) = B N (X) + η X

(11)

Here, BN(X) denotes the standard Batch Normalization operation, and

η

is a learnable parameter. This design introduces additional flexibility by adding a linear term

(η X)

of the input feature

X

on top of the BatchNorm output, allowing the normalization layer to dynamically adjust the feature distribution. During training,

η

is optimized via gradients, enhancing the model’s ability to adapt to complex patterns. The re-parameterized form of RepBN is shown in Equation (12) [32]:

R e p B N (X) = B N (X; μ, σ, α + η σ, β + η μ)

(12)

Here,

μ

and

σ

are the mean and standard deviation of BatchNorm, while

α

and

β

are the original BatchNorm scaling and shifting parameters. After re-parameterization, the new scaling parameter becomes

α + η σ

, and the new shifting parameter becomes

β + η μ

. The key to this transformation is incorporating the dynamic adjustment term

η X

into the BatchNorm parameters, allowing it to be directly merged with adjacent linear layers during inference without extra computation, thereby improving efficiency.

We transformed the AIFI module of RT-DETR into a RepBN-based architecture by replacing the normalization component of AIFI with RepBN, resulting in AIFI-RepBN. This architecture can be further re-parameterized into BatchNorm and merged with linear layers, significantly improving inference efficiency while maintaining detection accuracy. Its core advantage lies in combining the efficient inference characteristics of BatchNorm with a learnable linear term, enabling the model to reduce extra normalization computations and lower latency when processing high-resolution images and multi-scale objects. At the same time, it adaptively handles complex feature distribution variations, enhancing sensitivity to small pest objects or occluded scenarios. Furthermore, the progressive strategy smoothly transitions from LayerNorm to BatchNorm, avoiding model instability caused by abrupt parameter changes during early training. Additionally, RepBN’s hardware-friendly design makes it easier to deploy on edge devices by simplifying computations through parameter fusion and reducing memory usage, providing an efficient and reliable solution for real-time pest detection in resource-constrained environments.

4. Experiment

4.1. Experimental Dataset and Environment

The Forestry Pest Dataset [33] was constructed by Liu et al. as a specialized dataset for the identification of pests in economic forestry. It aims to address the low efficiency of manual inspection in traditional methods as well as the small scale and limited diversity of existing datasets. This dataset was constructed through a multi-stage process to ensure quality and reproducibility. Images were collected from search engines and specialized forestry pest websites, covering 31 common forestry pest species across multiple life stages such as eggs, larvae, nymphs, and adults, all captured in natural wild environments. They cover a wide range of ecological groups, including wood borers, defoliators, sap-sucking pests and leaf miners. A preliminary filtering process was carried out by four trained volunteers with the assistance of three forestry experts, who removed duplicate, invalid, or blurred images, resulting in 2278 high-quality original samples. To address class imbalance and improve generalization, seven data augmentation techniques including rotation, noise addition, and brightness transformation were applied, which expanded the dataset to 7163 images. Annotation was performed using the Labelimg tool, where three volunteers labeled pest categories and bounding boxes under the supervision of experts. Ambiguous samples were jointly reviewed and resolved by experts to ensure annotation accuracy. Finally, the dataset was divided into 5730 training images, 716 validation images, and 717 test images. These procedures clearly define the conditions for dataset construction and the criteria for sample selection, thereby enhancing the reproducibility of the experimental results. The 31 pest categories are numbered sequentially from 0 to 30, as shown in Table 1.

The number and size distribution of labels are shown in Figure 6. Figure 6a indicates that the dataset exhibits an imbalance in sample distribution, where different pest categories are distinguished by 31 colors corresponding to class indices from 0 to 30 in the Forestry Pest Dataset. The most frequently occurring categories are “Erthesina fullo (nymph2)” and “Plagiodera versicolora (Laicharting) (ovum)”. These pests often appear in clusters within a single image, resulting in significantly more annotated instances compared to other categories. Similarly, other small-scale objects tend to appear repeatedly in individual images, leading to relatively high occurrence frequencies. This distribution pattern aligns with real-world conditions in economic forestry scenarios. Figure 6b shows that most object aspect ratios in the dataset are concentrated below 0.1, which corresponds to the typical size characteristics of small object detection.

During the data annotation process, forestry experts were invited to participate in classification and labeling to ensure accuracy, using professional tools to assist with the task. The dataset retains the complex morphological variations and life cycle differences among pests in real outdoor environments, such as the developmental changes in the same species at different life stages. Compared to previous forestry pest datasets that were limited to laboratory conditions or focused on a single species, this dataset is more representative of real-world application scenarios. It provides high-quality data support for deep learning research in the monitoring and control of pests in economic forestry. Therefore, we use this dataset for model training. Detailed information about the experimental settings is shown in Table 2.

4.2. Evaluation Metrics

In object detection tasks, Precision, Recall, and mean Average Precision (

m A P

) are the most commonly used evaluation metrics. Precision measures the proportion of correctly identified targets among all the targets predicted by the model, as shown in Equation (13). A high precision indicates that most of the detected targets are accurate and reliable, while a low precision suggests a high number of false positives. Recall represents the proportion of correctly detected true targets out of all actual targets, as shown in Equation (14). A high recall indicates that the model successfully captures the majority of targets, while a low recall suggests a high rate of missed detections.

P r e c i s i o n = \frac{T P}{T P + F P}

(13)

R e c a l l = \frac{T P}{T P + F N}

(14)

Here,

T P

(True Positive) refers to the number of correctly detected targets,

F P

(False Positive) is the number of incorrectly detected targets, and

F N

(False Negative) is the number of targets that were not correctly detected.

Average Precision (AP) evaluates the detection accuracy of a single class by computing the average precision across different recall levels. Mean Average Precision (

m A P

) is then calculated by averaging the AP values of all classes, providing a comprehensive assessment of the model’s overall detection performance, as shown in Equations (15) and (16).

A P = \int_{0}^{1} P (r) d r

(15)

m A P = \frac{1}{C} \sum_{i = 1}^{C} {A P}_{i}

(16)

In the formulas,

P (r)

represents the precision at a given recall rate

r

, and

C

denotes the total number of classes.

{m A P}_{0.5}

refers to the mean Average Precision calculated at a specific threshold (IoU = 0.5), where IoU (Intersection over Union) measures the overlap between the predicted bounding box and the ground truth box. This metric evaluates the model’s detection accuracy under a relatively lenient IoU requirement. In contrast,

{m A P}_{0.5 : 0.95}

is the mean Average Precision computed across a range of IoU thresholds from 0.5 to 0.95 with a step size of 0.05. Compared to

{m A P}_{0.5}

,

{m A P}_{0.5 : 0.95}

provides a more rigorous evaluation, as it requires the model to maintain high localization accuracy under stricter IoU conditions.

FPS (frames per second) is a key metric for evaluating a model’s processing efficiency, directly reflecting the algorithm’s ability to analyze image data within a unit of time. An increase in this parameter corresponds to enhanced real-time processing performance, meaning the system can achieve faster visual information recognition and response in time-sensitive scenarios.

Additionally, the number of model parameters is generally related to the model’s complexity. A larger number of parameters may indicate stronger fitting capability but also increases computational cost and storage requirements. The goal of lightweight models is to reduce the number of parameters while maintaining high performance. FLOPs (floating point operations) refer to the total number of floating-point calculations performed during model inference, usually measured in GFLOPs (billion floating-point operations), and are used to assess the computational complexity of a model. A higher FLOPs value means more inference time and hardware resources are required. In object detection tasks, an ideal model should achieve a high

{m A P}_{0.5 : 0.95}

while minimizing both the number of parameters and FLOPs to meet deployment demands in practical scenarios. Therefore, this paper selects

{m A P}_{0.5 : 0.95}

, model parameter count, and FLOPs as the evaluation metrics.

4.3. Ablation Experiments

To verify the effectiveness of each improved module proposed in the LightFAD-DETR model, ablation experiments are conducted in this section. In comparison with the baseline model, the following components are individually integrated into the RT-DETR framework: replacing the backbone with the first nine layers of YOLOv9, incorporating the RepNCSPELAN4-CAA module, deploying the FADN with the DASI module, and replacing the AIFI module with AIFI-RepBN. Subsequently, all four improvements are progressively integrated to form the complete LightFAD-DETR model.

When only replacing the backbone with the first nine layers of YOLOv9, the

{m A P}_{0.5 : 0.95}

drops by 0.2% compared to the baseline model, but the number of parameters decreases by 47.2% and the computational cost is reduced by 40.8%, making the model more lightweight. When only the RepNCSPELAN4-CAA module is added by replacing the RepC3 module in the RT-DETR model, the

{m A P}_{0.5 : 0.95}

increases by 0.5 percent, enhancing the semantic perception capability of feature fusion, while the parameter count and computational load also decrease to some extent. When only the FADN with the DASI module is integrated, the

{m A P}_{0.5 : 0.95}

increases by 1.5%, achieving efficient collaborative modeling of details and semantics in complex backgrounds, although this leads to some increase in both parameters and computation. When the AIFI-RepBN module is added alone, the

{m A P}_{0.5 : 0.95}

improves by 0.8% without significantly changing the model’s parameter size or computational load, greatly enhancing inference efficiency while maintaining detection accuracy. When the YOLOv9 backbone is used in combination with replacing both the RepNCSPELAN4 modules in the backbone and the RepC3 module in the model with the RepNCSPELAN4-CAA modules, the model’s parameters and computational cost are further reduced, and

{m A P}_{0.5 : 0.95}

increases by 0.3%. Following this, the integration of the DASI-based FADN brings a 0.7% increase in

{m A P}_{0.5 : 0.95}

, with only an 11.5% increase in parameters and a 14.2% increase in FLOPs. Finally, by incorporating the AIFI-RepBN module, the model achieves an additional 0.6% increase in

{m A P}_{0.5 : 0.95}

without significantly affecting parameters or computation. In summary, compared to the baseline, the final LightFAD-DETR model achieves a 1.4% improvement in

{m A P}_{0.5 : 0.95}

, with a 41.7% reduction in parameters and a 35.0% reduction in computational cost, thus achieving a dual improvement in both accuracy and lightweight performance. Detailed results are shown in Table 3.

Vaswani et al. [34] proposed the Learned Positional Encoding method, which was used to enhance the positional encoding mechanism of the AIFI module, resulting in the AIFI-LPE module. Shaker et al. [35], in the SwiftFormer model, introduced the Efficient Additive Attention mechanism, which was applied to improve the global attention mechanism of the AIFI module, leading to the AIFI-EfficientAdditive module. Xia et al. [36] proposed Deformable Attention, which was incorporated into the AIFI module’s global self-attention mechanism, resulting in the AIFI-DAttention module. Pan et al. [37] introduced the HiLo Attention mechanism that captures both high and low frequency information and was used to improve the attention mechanism in AIFI, resulting in the AIFI-HiLo module. Wu et al. [38] proposed the M2SA module, which includes two branches: the first employs an enhanced channel attention mechanism, while the second integrates multi-scale depthwise separable dilated convolutions and adaptive pooling in the MSMHSA (Multi-Scale Multi-Head Self-Attention) module. We integrated the MSMHSA module into the AIFI structure, forming the AIFI-MSMHSA module.

We integrated different improved versions of the AIFI module into the RT-DETR model to evaluate the suitability of the AIFI-RepBN module. As shown in Table 4, all modified modules maintain nearly the same parameter count and computational cost. The inclusion of the AIFI-LPE and AIFI-EfficientAdditive modules resulted in a slight decrease in the original model’s

{m A P}_{0.5 : 0.95}

. The AIFI-DAttention and AIFI-HiLo modules had minimal impact on the

{m A P}_{0.5 : 0.95}

. In contrast, both the AIFI-RepBN and AIFI-MSMHSA modules led to a noticeable improvement in the model’s

{m A P}_{0.5 : 0.95}

.

To demonstrate that the AIFI-RepBN module is more suitable for the LightFAD-DETR model, we integrated six different improved AIFI modules into the LightFAD-DETR model separately. The results are shown in Table 5. All improved models have similar parameter counts and computational costs. Among them, the AIFI-RepBN, AIFI-DAttention, and AIFI-HiLo modules all achieved improvements in

{m A P}_{0.5 : 0.95}

based on the LightFAD-DETR model. Notably, the integration of the AIFI-RepBN module led to the most significant improvement in

{m A P}_{0.5 : 0.95}

while maintaining the lowest computational cost, making it the most suitable for the LightFAD-DETR model. Therefore, we adopt the AIFI-RepBN module in the final model.

In the experiment regarding improving the RepC3 module in the baseline model, we integrated different improved versions of the RepC3 module into the RT-DETR model to verify the suitability of the RepNCSPELAN4-CAA module. Ding et al. [39] proposed the Diverse Branch Block, which can convert complex multi-branch designs into a single convolutional layer using structural re-parameterization techniques. This allows the model to maintain high expressive power during training and lightweight structure during inference. We introduced this block by replacing the RepBlock in the RepC3 module, forming the DBBC3 module. Gong [40] proposed the Dynamic Group Shuffle Transformer (DGST), which uses a 3:1 splitting strategy to combine group convolution with channel shuffle operations, and replaces traditional linear layers with convolutional layers. This design reduces computational complexity while enhancing the model’s ability to capture both local and global features. In this study, we replaced the RepC3 module with the DGST module to conduct comparative experiments. Song et al. [41] introduced the gConv Block, which combines gating mechanisms with depthwise separable convolutions to enhance model expressiveness while maintaining computational efficiency. We used this block to replace the RepBlock in the RepC3 module, resulting in the gConvC3 module. Ding et al. [42] also proposed the Dilated Reparam Block in UniRepLKNet, which allows flexible adjustment of dilation rates and kernel sizes to dynamically adapt to different receptive field and detail extraction requirements. We used this block to replace the RepBlock in the RepC3 module, forming the DRBC3 module. Finally, we replaced the RepC3 module with the RepNCSPELAN4 module from YOLOv9 and conducted comparative experiments with the other improved versions. As shown in Table 6, all the improved modules reduced model parameters and computational cost to varying degrees. However, only the DBBC3 module and the RepNCSPELAN4-CAA module improved the model’s

{m A P}_{0.5 : 0.95}

, with the RepNCSPELAN4-CAA module achieving better improvement while maintaining lower parameter count and computation cost.

To demonstrate that the RepNCSPELAN4-CAA module is more suitable for the LightFAD-DETR model, we integrated six different improved versions of the RepC3 module into the LightFAD-DETR model. The results are shown in Table 7. Among all the improvements, the DBBC3 module and the RepNCSPELAN4-CAA module showed the most significant improvement in

{m A P}_{0.5 : 0.95}

, with the RepNCSPELAN4-CAA module achieving the best performance. However, the

{m A P}_{0.5 : 0.95}

improvement from the DBBC3 module came at the cost of increased model parameters and computational complexity. Although the other improved modules further reduced the model’s complexity, their

{m A P}_{0.5 : 0.95}

performance was not satisfactory. Therefore, we incorporated the RepNCSPELAN4-CAA module into the LightFAD-DETR model.

4.4. Classification Performance Analysis

To evaluate the overall performance of the model, we conducted a series of comparative experiments between the LightFAD-DETR model and RT-DETRr18, RT-DETRr34, and RT-DETRr50. As shown in Table 8, the LightFAD-DETR model outperforms all versions of RT-DETR in terms of accuracy and

{m A P}_{0.5 : 0.95}

. In terms of model size and computational complexity, LightFAD-DETR has about one-third the parameters and FLOPs of RT-DETRr34, and about one-fourth of RT-DETRr50. Moreover, its FPS surpasses both RT-DETRr34 and RT-DETRr50, significantly exceeding the real-time detection requirements for forestry pest monitoring.

Figure 7 presents a comparison of the

{m A P}_{0.5 : 0.95}

across 31 categories between the LightFAD-DETR model and the RT-DETRr18 model. The results show that the LightFAD-DETR model achieves higher detection accuracy in approximately two-thirds of the categories, while the difference in accuracy for the remaining less favorable categories is relatively small. Notably, significant improvements in

{m A P}_{0.5 : 0.95}

are observed for categories such as “Drosicha contrahens (female)”, “Erthesina fullo”, “Cnidocampa flavescens”, “Plagiodera versicolora (Laicharting) (larvae)”, “Plagiodera versicolora (Laicharting) (ovum)”, “Apriona germari (Hope)”, and “Latoia consocia (Walker)”. These improvements highlight the LightFAD-DETR model’s strong ability to capture local detail information, its advantage in detecting small objects, and its robustness against background interference.

4.5. Performance Comparison of Different Models

Table 9 presents the comparison results between the LightFAD-DETR model and several popular object detection models. The comparison includes two-stage detectors (Faster R-CNN [43]), one-stage detectors (SSD512 [44], RetinaNet [45], and the YOLO series), and Transformer-based detectors (DETR [46], Deformable DETR [47], Swin-B [48]). The analysis shows that the LightFAD-DETR model achieves an outstanding balance between accuracy, lightweight design, and speed, significantly outperforming other models. In terms of detection accuracy, the LightFAD-DETR model reaches a

{m A P}_{0.5 : 0.95}

of 88.5% and a precision of 97.3%, placing it at a top-tier level. Its performance is comparable to or even better than YOLOv8m and clearly surpasses other lightweight models such as YOLOv5m and YOLOv10m [49], as well as classic models like Faster R-CNN and SSD512.

The core advantage of the LightFAD-DETR model lies in its exceptionally lightweight design and efficiency, all while maintaining high accuracy. It has the lowest computational complexity and number of parameters among all the compared models, significantly lower than models such as Faster R-CNN, DETR, and Swin-B. Despite this substantial reduction in model size, it does not compromise on speed, as it achieves an inference speed of 106.3 FPS, which far exceeds real-time detection requirements. Although slightly lower than the fastest models, YOLOv8m (137.8 FPS) and YOLOv10m (142.5 FPS), the LightFAD-DETR model achieves nearly a 50% reduction in parameters and FLOPs while offering comparable or even superior accuracy. Therefore, LightFAD-DETR is a lightweight model that maintains high accuracy and speed, making it well-suited for deployment on edge devices under the strict resource constraints and real-time demands of intelligent forestry pest detection.

4.6. Performance Analysis on Different Datasets

To demonstrate the generalization capability and deployment advantages of the LightFAD-DETR model in complex, multi-category pest scenarios, we conducted comparative experiments on the IP102 dataset [52]. The IP102 dataset provides strong data support for pest detection in economic forest settings, covering 102 common pest species and meeting the specific identification needs of such environments. With over 75,000 images captured in real-world conditions, the dataset comprehensively represents pests across different developmental stages and includes complex background interference, effectively enhancing the model’s generalization ability in diverse forestry scenarios. In addition, approximately 19,000 images with bounding box annotations are available for training object detection models, enabling integrated pest localization and classification analysis.

The detection performance comparison of common training models is shown in Table 10. The results indicate that all models experience a decline in detection accuracy on this dataset compared to the Forestry Pest Dataset. However, the LightFAD-DETR model remains in the top tier, achieving a

{m A P}_{0.5 : 0.95}

of 41.2% with extremely low resource consumption. Its accuracy is comparable to mainstream models such as YOLOv8m and RT-DETRr50, and it significantly outperforms traditional detectors like Faster R-CNN and SSD512. Furthermore, its high inference speed of 106.3 FPS surpasses most Transformer-based models, fully meeting real-time application demands. These results highlight the effectiveness of LightFAD-DETR’s lightweight design and demonstrate its strong performance in complex, cross-dataset pest detection tasks. The model maintains high accuracy and stability when facing new categories and unfamiliar backgrounds, effectively establishing strong discriminative feature representations for diverse target morphologies. It achieves an optimal balance of speed, accuracy, and efficiency under edge device resource constraints, providing a reliable and high-performance foundation for intelligent pest detection in economic forestry applications.

4.7. Visual Analysis of Detection Results

Figure 8 presents a visual comparison of pest target detection results between the RT-DETR model and the proposed LightFAD-DETR model on the Forestry Pest Dataset. In each group, the left column shows the original image, the middle column displays the detection results from the RT-DETR model, and the right column shows the results from the LightFAD-DETR model. Each row represents an independent comparison group, highlighting the performance differences between the two models in challenging scenarios. In the first row, the RT-DETR model shows clear limitations when dealing with blurred images and densely packed objects, failing to detect a large number of pest eggs. In contrast, the LightFAD-DETR model successfully identifies all targets in the scene, overcoming the difficulties of blur and occlusion, and generally assigns higher confidence scores to the detected objects. In the second row, the RT-DETR model incorrectly identifies parts of the image background as pest eggs, which could lead to unnecessary alerts or interventions in real-world applications. The LightFAD-DETR model, however, avoids these false detections. In the third row, at the bottom right of the image, the RT-DETR model mistakenly detects two closely adjacent pests as a single target, while the LightFAD-DETR model accurately distinguishes and detects both targets correctly. In summary, through these three representative comparison scenarios, the LightFAD-DETR model demonstrates significant advantages over the RT-DETR model in handling image blur, dense object distributions, background interference, and closely located objects. These results strongly validate the superior accuracy and robustness of the LightFAD-DETR model for pest detection tasks in economic forestry.

5. Discussion

Through its innovative lightweight architecture and effective multi-module collaborative optimization, LightFAD-DETR successfully achieves an excellent balance of accuracy, speed, and model efficiency in the task of economic forestry pest detection. Its strong robustness in complex backgrounds, effective detection of small and slender objects, and demonstrated generalization ability across datasets fully validate the model’s advanced design and practical applicability. LightFAD-DETR provides strong technical support for the edge deployment of intelligent forestry pest-monitoring systems and holds significant application value and broad prospects for adoption.

Compared with previously known pest detection models [43,44,45,46,47,48,49,50,51], LightFAD-DETR achieves a more balanced trade-off between accuracy, efficiency, and deployment feasibility. Traditional approaches often emphasize accuracy at the expense of model size and inference time, making them difficult to use in resource-constrained forestry environments. In contrast, our model integrates lightweight architectural design with enhanced feature aggregation, thereby preserving fine-grained details of small pest targets while maintaining high processing speed. From a scientific perspective, this work contributes a novel detection paradigm that demonstrates how advanced feature modeling strategies and re-parameterization techniques can be effectively combined in economic forest pest monitoring, addressing challenges such as complex backgrounds and small object recognition. From a practical perspective, LightFAD-DETR offers a deployable solution for intelligent forestry management, enabling real-time, large-scale, and cost-effective pest monitoring in real-world scenarios. These advantages underscore both the theoretical significance and the applied value of the model, establishing it as a step forward toward bridging the gap between laboratory research and field application.

In addition to experimental validation, the LightFAD-DETR model was developed with practical deployment in mind. Its lightweight structure and reduced computational requirements make it well suited for embedding into edge devices such as portable monitoring instruments, mobile phones, or compact computing units with GPU or NPU acceleration. This enables real-time pest detection directly in the field, even in remote forestry areas with limited network connectivity, thereby reducing latency and improving system reliability. The model can also be integrated into unmanned aerial vehicles for large-scale aerial surveys or combined with fixed ground-based monitoring stations and Internet of Things platforms to establish continuous and automated pest surveillance systems. These application pathways highlight the feasibility of deploying LightFAD-DETR in real-life forestry scenarios, bridging the gap between academic research and practical forest pest management.

This study’s analysis is entirely based on publicly available datasets. The model’s performance potential is constrained by the quality and inherent limitations of these datasets, which in turn affects the robustness of the study’s conclusions. In terms of internal validity, a key challenge lies in the quality of the datasets themselves. For instance, inadequate sampling strategies may fail to comprehensively cover the object phenomena, and subjective inconsistencies during data annotation may introduce potential biases into the model’s learning process, leading to distorted findings. From the perspective of external validity, the limitations of the datasets pose major constraints, such as sample size, coverage scope, and characteristics of the specific data collection environment. These factors collectively may weaken the generalizability of the study’s conclusions to broader contexts. The observed accuracy decline on the IP102 dataset suggests that there is still room for improvement when the model is faced with larger volumes of data, finer objects, or extremely dense and overlapping instances. The recognition of tiny pests under multilayer leaf occlusion also remains a persistent challenge. Although the model accounts for different developmental stages of pests, its robustness in recognizing species with highly variable morphologies, or those in non-standard postures or damaged conditions, requires further validation and enhancement.

Future research will further enhance the adaptability of the LightFAD-DETR model in complex forestry environments. To address the challenges of extremely small objects and heavy occlusion, we plan to integrate super-resolution techniques and adaptive receptive field mechanisms to improve the model’s ability to detect tiny pests hidden beneath deep leaf coverage. Additionally, to reduce the model’s reliance on annotated data, we aim to explore its integration with few-shot learning and transfer learning strategies. On the lightweight design front, we will investigate dynamic neural architecture search (NAS) techniques to generate customized subnetworks tailored to the hardware characteristics of different edge devices, achieving a refined balance between accuracy and efficiency.

6. Conclusions

This paper proposes a lightweight pest detection model, LightFAD-DETR, designed for complex economic forestry scenarios. By implementing multidimensional collaborative optimization based on the RT-DETR framework, the model achieves an efficient balance between accuracy, speed, and resource consumption. First, the introduction of the YOLOv9 backbone significantly reduces the computational load. Then, the RepNCSPELAN4-CAA module is proposed, which integrates re-parameterization techniques with one-dimensional strip convolutions to enhance cross-region modeling capabilities for elongated pest objects while reducing inference latency. Additionally, a feature aggregation diffusion network is designed, along with the introduction of a dimension-aware selective integration module. These components dynamically fuse deep and shallow features, effectively mitigating information loss for small objects in complex backgrounds. Furthermore, the AIFI module, improved with re-parameterized batch normalization and trained using a progressive training strategy, eliminates redundant computation and further enhances deployment efficiency on edge devices. Experimental results demonstrate that, while significantly reducing parameter count and computational complexity, the proposed model surpasses the baseline in detection accuracy and meets real-time inference requirements. In cross-dataset evaluations, it exhibits generalization performance comparable to mainstream models, particularly excelling in handling blurred images, dense objects, and complex background interference, effectively reducing missed and false detections. In terms of quantitative performance, LightFAD-DETR achieved a mean average precision of 88.5% (

{m A P}_{0.5 : 0.95}

) on the Forestry Pest Dataset, surpassing existing baseline models by a clear margin in both overall accuracy and small object recognition. The model runs at 106.3 FPS on RTX 2080 GPU, with only 11.6 million parameters and 31.7 GFLOPs, demonstrating its ability to maintain real-time performance while remaining lightweight. Compared with previous studies, this model achieves breakthrough advancements in both lightweight design and real-time performance, offering a highly efficient and reliable solution for edge deployment of intelligent pest detection terminals in economic forestry.

Author Contributions

Conceptualization, X.C., X.W. and Y.D.; methodology, X.C. and X.W.; software, X.C. and X.W.; validation, X.C., X.W., Q.L. and Y.K.; formal analysis, X.C., X.W., J.T., Y.S. and J.Z.; investigation, X.W. and Y.D.; resources, X.C. and Y.D.; data curation, X.W. and J.T.; writing—original draft preparation, X.C. and X.W.; writing—review and editing, X.C., X.W., Y.D., Q.L., Y.K., J.T., Y.S. and J.Z.; supervision, X.C.; project administration, X.C.; funding acquisition, X.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research work was supported by Key R & D projects of the Guangxi Science and Technology Program (Guike.AB24010338), The central government guides local science and technology development fund projects (GuikeZY22096012), National Natural Science Foundation of China (32360374), Independent research project (GXRDCF202307-01).

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Kanan, A.H.; Masiero, M.; Pirotti, F. Estimating Economic and Livelihood Values of the World’s Largest Mangrove Forest (Sundarbans): A Meta-Analysis. Forests 2024, 15, 837. [Google Scholar] [CrossRef]
Wang, L.; Wang, E.; Mao, X.; Benjamin, W.; Liu, Y. Sustainable Poverty Alleviation through Forests: Pathways and Strategies. Sci. Total Environ. 2023, 904, 167336. [Google Scholar] [CrossRef] [PubMed]
Venkatesh, Y.N.; Ashajyothi, M.; Uma, G.S.; Rajarajan, K.; Handa, A.K.; Arunachalam, A. Diseases and Insect Pests Challenge to Meet Wood Production Demand of Tectona grandis (L.), a High-Value Tropical Tree Species. J. Plant Dis. Prot. 2023, 130, 929–945. [Google Scholar] [CrossRef]
Raum, S.; Collins, C.M.; Urquhart, J.; Potter, C.; Pauleit, S.; Egerer, M. Tree Insect Pests and Pathogens: A Global Systematic Review of Their Impacts in Urban Areas. Urban Ecosyst. 2023, 26, 587–604. [Google Scholar] [CrossRef]
Barathi, S.; Sabapathi, N.; Kandasamy, S.; Lee, J. Present Status of Insecticide Impacts and Eco-Friendly Approaches for Re-mediation—A Review. Environ. Res. 2024, 240, 117432. [Google Scholar] [CrossRef]
Saberi Riseh, R.; Gholizadeh Vazvani, M.; Hassanisaadi, M.; Thakur, V.K.; Kennedy, J.F. Use of Whey Protein as a Natural Polymer for the Encapsulation of Plant Biocontrol Bacteria: A Review. Int. J. Biol. Macromol. 2023, 234, 123708. [Google Scholar] [CrossRef]
Wang, S.; Xu, D.; Liang, H.; Bai, Y.; Li, X.; Zhou, J.; Su, C.; Wei, W. Advances in Deep Learning Applications for Plant Disease and Pest Detection: A Review. Remote Sens. 2025, 17, 698. [Google Scholar] [CrossRef]
Marvasti-Zadeh, S.M.; Goodsman, D.; Ray, N.; Erbilgin, N. Early Detection of Bark Beetle Attack Using Remote Sensing and Machine Learning: A Review. ACM Comput. Surv. 2023, 56, 97. [Google Scholar] [CrossRef]
Mittal, M.; Gupta, V.; Aamash, M.; Upadhyay, T. Machine Learning for Pest Detection and Infestation Prediction: A Comprehensive Review. WIREs Data Min. Knowl. Discov. 2024, 14, e1551. [Google Scholar] [CrossRef]
Godinez-Garrido, G.; Gonzalez-Islas, J.-C.; Gonzalez-Rosas, A.; Flores, M.U.; Miranda-Gomez, J.-M.; Gutierrez-Sanchez, M.d.J. Estimation of Damaged Regions by the Bark Beetle in a Mexican Forest Using UAV Images and Deep Learning. Sustainability 2024, 16, 10731. [Google Scholar] [CrossRef]
Yin, J.; Zhang, H.; Chen, Z.; Li, J. Detecting Emerald Ash Borer Boring Vibrations Using an Encoder–Decoder and Improved DenseNet Model. Pest Manag. Sci. 2025, 81, 384–401. [Google Scholar] [CrossRef]
Tian, Y.; Wang, S.; Li, E.; Yang, G.; Liang, Z.; Tan, M. MD-YOLO: Multi-Scale Dense YOLO for Small Target Pest Detection. Comput. Electron. Agric. 2023, 213, 108233. [Google Scholar] [CrossRef]
Dong, Q.; Sun, L.; Han, T.; Cai, M.; Gao, C. PestLite: A Novel YOLO-Based Deep Learning Technique for Crop Pest Detection. Agriculture 2024, 14, 228. [Google Scholar] [CrossRef]
Ye, R.; Gao, Q.; Qian, Y.; Sun, J.; Li, T. Improved YOLOv8 and SAHI Model for the Collaborative Detection of Small Targets at the Micro Scale: A Case Study of Pest Detection in Tea. Agronomy 2024, 14, 1034. [Google Scholar] [CrossRef]
Varghese, R.; Sambath, M. YOLOv8: A Novel Object Detection Algorithm with Enhanced Performance and Robustness. In Proceedings of the 2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS), Chennai, India, 18–19 April 2024; IEEE: Chennai, India, 2024; pp. 1–6. [Google Scholar] [CrossRef]
Cagatay Akyon, F.; Onur Altinuc, S.; Temizel, A. Slicing Aided Hyper Inference and Fine-tuning for Small Object Detection. arXiv 2022, arXiv:2202.06934. [Google Scholar] [CrossRef]
Yang, Z.; Feng, H.; Ruan, Y.; Weng, X. Tea Tree Pest Detection Algorithm Based on Improved Yolov7-Tiny. Agriculture 2023, 13, 1031. [Google Scholar] [CrossRef]
Li, Z.; Shen, Y.; Tang, J.; Zhao, J.; Chen, Q.; Zou, H.; Kuang, Y. IMLL-DETR: An Intelligent Model for Detecting Multi-Scale Litchi Leaf Diseases and Pests in Complex Agricultural Environments. Expert Syst. Appl. 2025, 273, 126816. [Google Scholar] [CrossRef]
Chen, H.; Wen, C.; Zhang, L.; Ma, Z.; Liu, T.; Wang, G.; Yu, H.; Yang, C.; Yuan, X.; Ren, J. Pest-PVT: A Model for Multi-Class and Dense Pest Detection and Counting in Field-Scale Environments. Comput. Electron. Agric. 2025, 230, 109864. [Google Scholar] [CrossRef]
Zhao, C.; Bai, C.; Yan, L.; Xiong, H.; Suthisut, D.; Pobsuk, P.; Wang, D. AC-YOLO: Multi-Category and High-Precision Detection Model for Stored Grain Pests Based on Integrated Multiple Attention Mechanisms. Expert Syst. Appl. 2024, 255, 124659. [Google Scholar] [CrossRef]
Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 7464–7475. [Google Scholar] [CrossRef]
Deng, C.; Wang, M.; Liu, L.; Liu, Y.; Jiang, Y. Extended Feature Pyramid Network for Small Object Detection. IEEE Trans. Multimed. 2022, 24, 1968–1979. [Google Scholar] [CrossRef]
Jiang, L.; Yuan, B.; Du, J.; Chen, B.; Xie, H.; Tian, J.; Yuan, Z. MFFSODNet: Multiscale Feature Fusion Small Object Detection Network for UAV Aerial Images. IEEE Trans. Instrum. Meas. 2024, 73, 5015214. [Google Scholar] [CrossRef]
Cheng, G.; Yuan, X.; Yao, X.; Yan, K.; Zeng, Q.; Xie, X.; Han, J. Towards Large-Scale Small Object Detection: Survey and Benchmarks. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 13467–13488. [Google Scholar] [CrossRef]
Wang, H.; Liu, C.; Cai, Y.; Chen, L.; Li, Y. YOLOv8-QSD: An Improved Small Object Detection Algorithm for Autonomous Vehicles Based on YOLOv8. IEEE Trans. Instrum. Meas. 2024, 73, 2513916. [Google Scholar] [CrossRef]
Zhang, Y.; Zhang, H.; Huang, Q.; Han, Y.; Zhao, M. DsP-YOLO: An Anchor-Free Network with DsPAN for Small Object Detection of Multiscale Defects. Expert Syst. Appl. 2024, 241, 122669. [Google Scholar] [CrossRef]
Shen, C.; Qian, J.; Wang, C.; Yan, D.; Zhong, C. Dynamic Sensing and Correlation Loss Detector for Small Object Detection in Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5627212. [Google Scholar] [CrossRef]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs Beat YOLOs on Real-Time Object Detection. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 16965–16974. [Google Scholar] [CrossRef]
Wang, C.-Y.; Yeh, I.-H.; Mark Liao, H.-Y. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. In Computer Vision—ECCV 2024; Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G., Eds.; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2025; Volume 15089, pp. 1–21. [Google Scholar] [CrossRef]
Cai, X.; Lai, Q.; Wang, Y.; Wang, W.; Sun, Z.; Yao, Y. Poly Kernel Inception Network for Remote Sensing Detection. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 27706–27716. [Google Scholar] [CrossRef]
Xu, S.; Zheng, S.; Xu, W.; Xu, R.; Wang, C.; Zhang, J.; Teng, X.; Li, A.; Guo, L. HCF-Net: Hierarchical Context Fusion Network for Infrared Small Object Detection. In Proceedings of the 2024 IEEE International Conference on Multimedia and Expo (ICME), Niigata, Japan, 26 June–5 July 2024; pp. 1–6. [Google Scholar] [CrossRef]
Guo, J.; Chen, X.; Tang, Y.; Wang, Y. SLAB: Efficient Transformers with Simplified Linear Attention and Progressive Re-parameterized Batch Normalization. In Proceedings of the 41st International Conference on Machine Learning (ICML 2024), Vienna, Austria, 21–27 July 2024; PMLR: London, UK, 2024; Volume 235, pp. 16802–16812. [Google Scholar]
Liu, B.; Liu, L.; Zhuo, R.; Chen, W.; Duan, R.; Wang, G. A Dataset for Forestry Pest Identification. Front. Plant Sci. 2022, 13, 857104. [Google Scholar] [CrossRef] [PubMed]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. arXiv 2017, arXiv:1706.03762. [Google Scholar] [CrossRef]
Shaker, A.; Maaz, M.; Rasheed, H.; Khan, S.; Yang, M.-H.; Khan, F.S. SwiftFormer: Efficient Additive Attention for Transformer-Based Real-Time Mobile Vision Applications. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; pp. 17379–17390. [Google Scholar] [CrossRef]
Xia, Z.; Pan, X.; Song, S.; Li, E.; Huang, G. Vision Transformer with Deformable Attention. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 4784–4793. [Google Scholar] [CrossRef]
Pan, Z.; Cai, J.; Zhuang, B. Fast Vision Transformers with HiLo Attention. In Proceedings of the 36th International Conference on Neural Information Processing Systems (NeurIPS 2022), New Orleans, LA, USA, 28 November–9 December 2022; Curran Associates, Inc.: Red Hook, NY, USA, 2022; pp. 14541–14554. [Google Scholar]
Wu, H.; Huang, P.; Zhang, M.; Tang, W.; Yu, X. CMTFNet: CNN and Multiscale Transformer Fusion Network for Remote-Sensing Image Semantic Segmentation. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–12. [Google Scholar] [CrossRef]
Ding, X.; Zhang, X.; Han, J.; Ding, G. Diverse Branch Block: Building a Convolution as an Inception-like Unit. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual Event/Nashville, TN, USA, 19–25 June 2021; pp. 10881–10890. [Google Scholar] [CrossRef]
Gong, W. Lightweight Object Detection: A Study Based on YOLOv7 Integrated with ShuffleNetv2 and Vision Transformer. arXiv 2024, arXiv:2403.01736. [Google Scholar] [CrossRef]
Song, Y.; Zhou, Y.; Qian, H.; Du, X. Rethinking Performance Gains in Image Dehazing Networks. arXiv 2022, arXiv:2209.11448. [Google Scholar] [CrossRef]
Ding, X.; Zhang, Y.; Ge, Y.; Zhao, S.; Song, L.; Yue, X.; Shan, Y. UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 5513–5524. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Computer Vision—ECCV 2016; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2016; Volume 9905, pp. 21–37. [Google Scholar]
Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 318–327. [Google Scholar] [CrossRef]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020, Proceedings, Part I; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2020; Volume 12346, pp. 213–229. [Google Scholar]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable DETR: Deformable Transformers for End-to-End Object Detection. In Proceedings of the 9th International Conference on Learning Representations (ICLR 2021), Virtual Event, Austria, 3–7 May 2021; OpenReview: Amherst, MA, USA, 2021; pp. 1–12. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Virtual Event/Montreal, QC, Canada, 11–17 October 2021; pp. 9992–10002. [Google Scholar] [CrossRef]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-Time End-to-End Object Detection. arXiv 2024, arXiv:2405.14458. [Google Scholar] [CrossRef]
Khanam, R.; Hussain, M. YOLOv11: An Overview of the Key Architectural Enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar] [CrossRef]
Tian, Y.; Ye, Q.; Doermann, D. YOLOv12: Attention-Centric Real-Time Object Detectors. arXiv 2025, arXiv:2502.12524. [Google Scholar] [CrossRef]
Wu, X.; Zhan, C.; Lai, Y.-K.; Cheng, M.-M.; Yang, J. IP102: A Large-Scale Benchmark Dataset for Insect Pest Recognition. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 8779–8788. [Google Scholar] [CrossRef]

Figure 1. The network structure of the LightFAD-DETR model.

Figure 2. Module structure of RepNCSPELAN4.

Figure 3. Context anchor attention module.

Figure 4. Dimension-Aware Selective Integration module.

Figure 5. The progressive strategy of RepBN.

Figure 6. Tag quantity map (a) and label distribution size map (b).

Figure 7. Comparison of

{m A P}_{0.5 : 0.95}

between RT-DETR and LightFAD-DETR models in each pest category.

Figure 7. Comparison of

{m A P}_{0.5 : 0.95}

between RT-DETR and LightFAD-DETR models in each pest category.

Figure 8. Comparison of detection results between the RT-DETR and the LightFAD-DETR.

Table 1. The image sample numbers of 31 types of pests in the Forest Pest Dataset.

Label Number	Category	Label Number	Category
0	Drosicha contrahens (female)	16	Latoia consocia (Walker) (larvae)
1	Drosicha contrahens (male)	17	Plagiodera versicolora (Laicharting) (larvae)
2	Chalcophora japonica	18	Plagiodera versicolora (Laicharting) (ovum)
3	Erthesina fullo	19	Plagiodera versicolora (Laicharting)
4	Erthesina fullo (nymph2)	20	Spilarctia subcarnea (Walker) (larvae)
5	Erthesina fullo (nymph)	21	Spilarctia subcarnea (Walker) (larvae2)
6	Spilarctia subcarnea (Walker)	22	Apriona germari (Hope)
7	Psilogramma menephron	23	Hyphantria cunea
8	Sericinus montela	24	Cerambycidae (larvae)
9	Sericinus montela (larvae)	25	Psilogramma menephron (larvae)
10	Clostera anachoreta	26	Monochamus alternatus
11	Cnidocampa flavescens (Walker)	27	Micromelalopha troglodyte (Graeser) (larvae)
12	Cnidocampa flavescens (Walker) (pupa)	28	Hyphantria cunea (larvae)
13	Anoplophora chinensis	29	Hyphantria cunea (pupa)
14	Micromelalopha troglodyte (Graeser)	30	Latoia consocia (Walker)
15	Psacothea hilari (Pascoe)

Table 2. Experimental configuration details, including hardware environment, software framework, and training parameter settings.

Experimental Configuration	Configuration Information
Deep Learning Framework	Pytorch 2.0.0
Software Version	PyCharm PROFESSIONAL 2023.1
Python Version	Python 3.8.19
Server	AMAX
GPU	NVIDIA GeForce 2080 Ti
Number of Training Epochs	300
Training Batch Size	16
Number of Threads	4
Optimizer	0.01

Table 3. Ablation experiments compared with the baseline model.

Baseline Model	Replace the Backbone	Rep-NCSPE-LAN4-CAA	FADN (DASI)	AIFI-RepBN	${m A P}_{0.5 : 0.95}$ (%)	Params (M)	GFLOPs
RT-DETR					87.1	19.9	57.1
	√				86.9	10.5	33.8
		√			87.6	19.1	53.1
			√		88.6	21.2	63.5
				√	87.9	19.9	57.1
	√	√			87.2	10.4	32.5
	√	√	√		87.9	11.6	37.1
	√	√	√	√	88.5	11.6	37.1

Table 4. The experiment regarding improving the AIFI module in the baseline model.

Improved AIFI Module	${m A P}_{0.5 : 0.95}$ (%)	Params (M)	GFLOPs
AIFI-LPE	86.4	20.0	57.1
AIFI-EfficientAdditive	86.6	19.9	57.3
AIFI-DAttention	87.2	19.9	57.3
AIFI-HiLo	87.1	19.9	57.2
AIFI-MSMHSA	87.6	19.9	57.2
AIFI-RepBN	87.9	19.9	57.1