An Adaptive State-Space Convolutional Fusion Network for High-Precision Pest Detection in Smart Agarwood Cultivation

Luo, Zhijie; Chen, Rui; Li, Shaoxin; Guo, Jianjun

doi:10.3390/math13243937

Open AccessArticle

An Adaptive State-Space Convolutional Fusion Network for High-Precision Pest Detection in Smart Agarwood Cultivation

College of Artificial Intelligence, Zhongkai University of Agriculture and Engineering, Guangzhou 510225, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Mathematics 2025, 13(24), 3937; https://doi.org/10.3390/math13243937

Submission received: 14 November 2025 / Revised: 9 December 2025 / Accepted: 9 December 2025 / Published: 10 December 2025

(This article belongs to the Special Issue Deep Learning and Adaptive Control, 4th Edition)

Download

Browse Figures

Versions Notes

Abstract

The sustainable cultivation of agarwood, a high-value tree species, is significantly threatened by foliar pests, requiring efficient and accurate monitoring solutions. While deep learning is widely used, mainstream models face inherent limitations: Convolutional Neural Networks have restricted receptive fields and Transformers incur high computational complexity, complicating the balance of accuracy and efficiency for tiny pest detection in complex environments. To address these challenges, a novel Adaptive State-space Convolutional Fusion Network (ASCNet) is proposed. Its core component, the Adaptive State-space Convolutional Fusion Block (ASBlock), integrates the global context modeling of state-space models—which have linear complexity—with the local feature extraction of convolutional networks through a dual-path adaptive fusion mechanism. A Grouped Spatial Shuffle Downsampling (GSD) module replaces standard strided convolutions to preserve fine-grained spatial details during downsampling. For small object detection, a Normalized Wasserstein Distance (NWD)-based loss function mitigates the sensitivity of traditional IoU to minor localization errors. Evaluations on a new agarwood pest dataset show that ASCNet outperforms state-of-the-art detectors (including the YOLO series, RT-DETR, and Gold-YOLO), achieving a maximum mAP@50 of 93.0 ± 0.2% and mAP@50:95 of 71.2 ± 0.3% with high computational efficiency. The results confirm ASCNet as a robust and effective solution for intelligent pest monitoring in high-value crops like agarwood.

Keywords:

agarwood pest detection; deep learning; state-space model; small object detection; adaptive feature fusion

MSC:

68T07

1. Introduction

Agarwood (Aquilaria spp.), a valuable tree species indigenous to Southeast Asia, produces a resin highly esteemed as the “gold of plants,” occupying a unique and indispensable role in medicine, culture, and the economy. The formation of agarwood resin occurs when the tree undergoes natural injury or microbial infection, which induces the secretion of aromatic resin as part of its self-repair mechanism [1,2]. This distinctive biological phenomenon renders agarwood an exceptionally rare natural resource. In traditional medicinal practices, agarwood is recognized for its therapeutic properties, including the promotion of qi circulation, analgesic effects, warming of the middle burner to alleviate vomiting, and the regulation of respiration to relieve asthma. It continues to be extensively incorporated into contemporary Chinese medicinal formulations [3]. From a cultural perspective, agarwood holds significant value in religious rituals and among scholars and literati, embodying a rich heritage of historical and cultural significance spanning millennia [4]. Economically, high-quality agarwood commands prices reaching several hundred dollars per gram, exceeding the value of gold, and has fostered the development of a comprehensive industrial supply chain [5]. As illustrated in Figure 1, the value of agarwood spans multiple dimensions: from medicinal tea and cultural incense to high-value accessories like bracelets carved from the precious wood itself.

Nevertheless, the sustainable utilization of this precious resource is confronted with substantial challenges. The International Union for Conservation of Nature (IUCN) has classified wild agarwood populations as vulnerable, citing overexploitation as a primary cause of their near depletion [4]. In response, artificial cultivation has emerged as the sole viable strategy to sustain industry growth. However, intensive cultivation practices are impeded by significant threats from pests and diseases [6]. Notably, foliar herbivorous pests, which directly impair photosynthesis and hinder tree growth, represent a critical constraint on the healthy advancement of the agarwood industry.

The advancement of pest monitoring technology is fundamentally linked to the progression of smart agriculture and can be broadly categorized into three developmental phases. Initially, pest monitoring was conducted exclusively through manual field inspections. This method was characterized by inefficiency and high labor demands, relying heavily on the subjective expertise of individual inspectors, which often resulted in considerable delays and variability. In extensive agricultural plantations, a comprehensive inspection cycle could span several weeks. This delay increases the risk of missing critical intervention periods for pest and disease management. Consequently, irreversible economic damage may occur. To address the shortcomings of manual inspection, researchers adopted traditional machine learning techniques. These approaches typically utilized handcrafted features—such as color, texture, and shape—paired with classifiers like Support Vector Machines (SVM) and random forests for pest identification [7,8,9,10]. While this strategy introduced a degree of automation, its effectiveness was constrained by the quality of feature engineering. The inherent morphological variability of pests, the complexity of background interference, and fluctuating lighting conditions in practical environments significantly limited the generalizability and robustness of these methods.

In recent years, deep learning-based detection models, particularly those utilizing convolutional neural network (CNN) architectures, have achieved significant advancements in agricultural pest detection due to their robust end-to-end feature learning capabilities. Notable examples of such architectures include Faster R-CNN [11], SSD [12], and the YOLO series [13,14,15,16,17]. Yang et al. proposed the Maize-YOLO model [18], an enhanced version of YOLOv7 that substantially reduces computational complexity while improving the accuracy of rice pest detection by integrating the CSPResNeXt-50 module with the VOV-GSCSP module. Similarly, Wang and colleagues developed the RGC-YOLO model [19], which replaces conventional convolutional layers with the GhostConv structure and incorporates a hybrid attention mechanism, facilitating efficient multi-scale recognition of rice diseases and pests. Furthermore, Liu et al. introduced YOLO-Wheat for wheat pest detection [20]; Guan et al. devised a multi-scale pest detection approach termed GC-Faster R-CNN by integrating hybrid attention mechanisms [21]; Yu et al. presented LP-YOLO [22], a lightweight pest detection method; and Tang et al. proposed SP-YOLO for multi-scale pest detection in beet fields [23]. Collectively, these studies underscore the substantial application potential and continuous innovative progress of CNN-based models in the domain of agricultural pest detection. CNN models exhibit inherent limitations. Their primary advancements have predominantly targeted the efficiency of convolutional operations and the enhancement of local feature extraction, yet they do not fundamentally address the constraints imposed by the limited local receptive fields characteristic of CNNs. This intrinsic property imposes a natural bottleneck on the model’s ability to comprehend the global context within images.

To transcend these locality restrictions, researchers have increasingly incorporated the Transformer architecture into computer vision tasks [24]. The core self-attention mechanism of Transformers facilitates direct computation of interactions across all regions of an image, thereby enabling effective modeling of global contextual information. In the domain of agricultural pest detection, Transformers have shown promise in capturing complex scene relationships and global context, resulting in significant improvements in detection performance [25,26,27]. Nonetheless, this enhanced global modeling capability incurs a quadratic increase in computational complexity. Consequently, the demand for computational resources is substantially elevated. Consequently, this limitation poses significant challenges for deploying such models in agricultural field settings, where processing high-resolution images efficiently is essential. In this context, a novel approach to sequence modeling—the state space models (SSMs) [28,29] and, more specifically, the selective state space model exemplified by Mamba—has garnered considerable interest owing to its distinctive advantages in handling long sequences [30]. By leveraging a selective state mechanism, SSMs are capable of efficiently capturing global contextual information akin to Transformers while maintaining linear computational complexity. This results in a substantial enhancement in computational efficiency without sacrificing performance. Therefore, State Space Models (SSMs) are particularly suitable for agricultural pest detection. In complex field images, understanding the global scene context (such as leaf distribution, presence of shadows, or pest infestation patterns) is crucial for accurately locating small and blurry targets. SSMs provide an efficient mechanism to achieve this global understanding, which standard Transformer models struggle with due to their excessive computational demands, while CNNs are fundamentally limited by their local receptive fields. Conceptually, SSMs can be seen as a bridge between CNNs and Transformers: they retain the core of efficient recurrent sequence processing (similar to the inductive bias of CNN’s local processing) while achieving a global receptive field and data-dependent feature selection capability comparable to Transformer’s self-attention mechanism. Table 1 provides a comparative summary of these three architectural paradigms.

Consequently, SSMs emerge as a promising solution to the inherent trade-off between the locality constraints of CNNs and the computational demands of Transformers. Notably, SSM-based models have been employed in agricultural pest classification; for instance, Wang et al. introduced InsectMamba [31], representing the inaugural application of an SSM framework to pest classification, thereby enabling the model to extract comprehensive visual features for accurate pest identification.

Despite significant progress in the domain of agarwood pest detection, critical research gaps persist. While state space models have exhibited exceptional proficiency in sequence modeling, their utilization has predominantly been confined to classification tasks. The exploration of their potential in dense object detection remains limited, particularly within agricultural contexts that demand precise localization and multi-scale recognition, such as the complex environments associated with agarwood pest detection. To address these challenges, an innovative detection framework called the Adaptive State Space Convolution Fusion Network (ASCNet) is introduced in this study, as an innovative detection framework that synergistically combines the complementary advantages of convolutional operations and state space modeling. This integration is realized through three principal contributions:

A dual-path adaptive fusion mechanism, implemented in the Adaptive State Space Convolution Fusion Module (ASBlock). This module processes features from the state-space model and convolutional pathways in parallel, employing a gating mechanism to dynamically adjust their contributions.
Spatial shuffle downsampling is implemented via the Grouped Spatial Shuffle Downsampling module (GSD). By utilizing pixel rearrangement and grouped convolution operations to replace the convolutional downsampling method, this approach effectively reduces feature resolution while preserving fine-grained spatial details.
For small object optimization, a loss function incorporating the Normalized Wasserstein Distance (NWD) metric is introduced. This method models bounding boxes as Gaussian distributions to enhance robustness in detecting minute pests.

Extensive experiments conducted on the newly constructed agarwood pest dataset demonstrate that ASCNet achieves outstanding performance while maintaining practical efficiency, thereby providing a reliable solution for intelligent pest monitoring in the cultivation of high-value timber tree species.

2. Materials and Methods

2.1. Dataset Construction

2.1.1. Image Acquisition

This study targets diseases and pests affecting agarwood leaves, with the objective of developing an efficient and reliable intelligent detection model. Due to the absence of public image datasets for agarwood leaf diseases and pests, we compiled a high-quality, meticulously annotated Agarwood Pest Dataset (APD). Data were collected continuously from October 2024 to August 2025, in collaboration with the Zhenlong Town Government in Xinyi, Maoming, Guangdong Province. Systematic image acquisition was performed in large-scale agarwood plantations within this region.

All images were captured using mainstream smartphones to ensure method feasibility and reproducibility in real-world applications. The collection process emphasized diverse and complex scenarios, including varying lighting conditions (e.g., strong light, shadows, backlight) and degrees of leaf occlusion and density, thereby enhancing dataset diversity and complexity. Figure 2 shows the geographical distribution of data collection sites, with the core location in Zhenlong Town, Xinyi, Maoming—a major agarwood production area in China (22.35° N, 110.94° E). This area was chosen for its representative South Asian subtropical monsoon climate. The cultivation scale, management practices, and pest patterns in this region are highly representative of South China’s main agarwood production areas. This provides pest samples that mirror actual production environments with high ecological validity, laying a solid foundation for model generalization.

Figure 3 shows images of agarwood leaf pests under different lighting conditions.

(a) High-density micro-pest aggregation: Testing the model’s ability to separate and count under severe occlusion.
(b) Sparse individual distribution: Evaluating the model’s capability to detect and recognize targets in complex backgrounds.
(c) Blurred dense clusters: Simulating image degradation caused by motion or defocus to test the model’s feature robustness.
(d) Coexistence of adults and eggs: Introducing significant scale and morphological differences within the same category across different life stages.
(e) Strong-light suspended distribution: Highlighting severe specular reflection interference caused by the smooth surface of agarwood leaves.
(f) Low-light suspended distribution: Presenting common shaded environments with low contrast and high noise.

These images were captured using mainstream smartphones, ensuring the accessibility and reproducibility of the method.

Figure 3. Representative samples from the Agarwood Pest Dataset (APD) illustrating key challenges. (a) Dense distribution of tiny pests, presenting challenges in occlusion handling and instance separation. (b) Simple distribution of individual pests, focusing on recognition against varied backgrounds. (c) Fuzzy dense pest distribution, testing robustness to motion blur and defocus. (d) Coexistence of eggs and pests, introducing extreme scale variation within a single category. (e) Highlighted suspended pest distribution, showcasing interference from specular reflections on glossy leaves. (f) Low-light suspended pest distribution, representing low-contrast, high-noise conditions common in shaded canopies.

These six conditions were systematically collected to ensure the dataset covers the primary axes of environmental variability encountered in real field monitoring, thereby providing a robust benchmark for model evaluation.

2.1.2. Data Construction

All visible pest individuals in the images were annotated using the LabelImg tool, resulting in a total of 16,217 bounding box annotations. The annotation protocol followed these guidelines: (1) bounding boxes were tightly drawn around the visible parts of pests; (2) to avoid introducing labeling noise, individuals that could not be reliably identified due to severe occlusion (where over 50% of the body was invisible) or extreme blur were excluded from annotation; (3) agarwood pests within dense egg clusters were annotated. The dataset was randomly partitioned into a training set (720 images), a validation set (90 images), and a test set (90 images). The detailed annotation statistics are provided in the table below.

To mitigate overfitting caused by the limited dataset size, data augmentation techniques were applied to enhance model robustness and generalization. As illustrated in Figure 4, the employed methods include flipping, translation, brightness adjustment, and Gaussian noise injection.

These augmentations were strategically chosen to expand the effective training data and force the model to learn features invariant to common real-world perturbations, thereby reducing overfitting and improving robustness for deployment in uncontrolled field environments.

2.2. Adaptive State-Space Convolutional Fusion Net

As illustrated in Figure 5, while traditional CNNs (a) are constrained by local receptive fields that limit long-range dependency modeling, and Transformers (b) suffer from quadratic computational complexity due to self-attention, the Mamba architecture (c) offers an efficient alternative. Based on SSMs, Mamba processes long sequences with linear complexity via a selective state mechanism. However, the potential of SSMs for dense detection tasks is not yet fully realized, particularly in achieving effective fusion of their global modeling capabilities with the local feature extraction strengths of CNNs. To bridge this gap, the Adaptive State Space Convolution Fusion Network (ASCNet) is proposed, a novel architecture that integrates CNNs and SSMs through a dual-path adaptive module, providing an effective solution for agarwood pest detection.

As illustrated in Figure 6, the proposed ASCNet builds upon the YOLOv8 backbone and consists of four key components: Initial Downsampling Layer: Two stacked Grouped Spatial Shuffle Downsampling (GSD) modules conduct preliminary feature extraction and dimensionality reduction, minimizing computational cost while preserving spatial details. Multi-level Feature Extraction Stage: The network core comprises four sequential stages (Stages 1–4), each stacking multiple Adaptive State-space Convolutional Fusion modules (ASBlock) with integrated GSD modules for spatial downsampling. This design facilitates progressive extraction and fusion of multi-scale features across various abstraction levels and spatial contexts. Feature Pyramid Network (FPN): A lightweight FPN enhances multi-scale object detection, particularly for tiny pests. Via a top-down pathway and lateral connections, it fuses feature maps from different stages to build a feature pyramid rich in multi-scale contextual information. Detection Head: A lightweight detection head processes the fused features to predict bounding boxes and categories. It employs a decoupled head design to separate classification and regression tasks and incorporates a Normalized Wasserstein Distance (NWD)-based loss function to specifically optimize small object detection.

2.3. Adaptive State-Space Convolutional Fusion Block

The Adaptive State-space Convolutional Fusion Block (ASBlock) serves as the fundamental building block of the ASCNet architecture. Its core innovation lies in the efficient and adaptive integration of state space models’ long-sequence modeling capabilities (global context) with convolutional neural networks’ local feature extraction strengths. The module structure is illustrated in Figure 7.

The ASBlock processes an input feature map

X \in R^{H \times W \times C}

through a dual-branch parallel architecture that integrates SSM and CNN pathways, followed by a fusion operation (

F_{f u s e}

). The module outputs an enhanced feature map

Y \in R^{H \times W \times C}

, refined in both channel and spatial dimensions. This transformation is formalized in Equations (1) and (2).

Y = F_{G D B} (S i L U (L i n e a r (F_{f u s e} (X)))) + F_{f u s e}

(1)

F_{f u s e} (X) = L i n e a r (C o n c a t [F_{c o n v} ⊙ G_{c o n v}, F_{s s m} ⊙ G_{s s m}])

(2)

where

⊙

represents element-wise multiplication for feature modulation,

G_{c o n v}

and

G_{s s m}

are adaptive gating signals generated through identical processing streams.

The state-space pathway captures long-range dependencies and global context within feature maps, enabling the model to comprehend and localize pest targets in complex agricultural environments for agarwood pest identification. The input feature

X \in R^{H \times W \times C}

undergoes initial processing through a linear layer for channel projection and information mixing, followed by

S i L U

activation to introduce nonlinearity. Features then pass through the Gated Dual Branch Block (

G D B

) for deep transformation and gated enhancement, with batch normalization (

B N

) stabilizing the training process. The core component,

S S 2 D

—a two-dimensional selective state-space layer—efficiently models global dependencies along spatial dimensions with linear complexity. The pathway output is denoted as

F_{s s m}

, as formalized in Equation (3).

F_{s s m} = F_{S S 2 D} (B N (F_{G D B} (S i L U (L i n e a r (X)))))

(3)

The convolutional pathway extracts local details and spatial structures, capturing subtle features such as the specific textures of tiny pests. The input feature

X \in R^{H \times W \times C}

first processes through a standard convolution for local feature extraction, followed by

S i L U

activation. A depthwise separable convolution (

D W C o n v

) then efficiently captures spatial features while reducing parameters, after which

G E L U

activation is applied. A final convolution layer further refines the features. The output is denoted as

F_{c o n v}

and formalized in Equation (4).

F_{s s m} = C o n v (S i L U (D W C o n v (G E L U (C o n v (X)))))

(4)

Adaptive Gating and Fusion Mechanism: The key innovation of ASBlock is its data-dependent mechanism to fuse

F_{s s m}

and

F_{c o n v}

. Instead of simple concatenation or addition, a dedicated adaptive gating branch generates two weight maps,

G_{c o n v}

and

G_{s s m}

, that dynamically highlight the most informative features from each pathway based on the input

X

. The weight generation is shown in Figure 8.

First, a shared meta-gating signal

G_{m e t a}

is generated by condensing global information from the input via adaptive average pooling, followed by two linear layers with a SiLU activation in between:

G_{m e t a} = \partial (L i n e a r (S i L U (L i n e a r (A d a p t i v e A v g P o o l (x_{p})))))

(5)

Here,

G_{m e t a} \in R^{1 \times 1 \times 2 C}

. This vector is then split into two equal parts along the channel dimension, each corresponding to one pathway. A channel-wise softmax function is applied across these two parts to ensure the weights are normalized and competitive:

G_{s s m}, G_{c o n v} = s p i l t (s o f t m a x (G_{m e t a}))

(6)

This yields

G_{s s m}, G_{c o n v} \in R^{1 \times 1 \times C}

, which are broadcasted (element-wise multiplied) to modulate the features from their respective pathways. The modulated features are then concatenated and fused via a linear projection:

F_{f u s e} (X) = L i n e a r (C o n c a t [F_{c o n v} ⊙ G_{c o n v}, F_{s s m} ⊙ G_{s s m}])

(7)

where

⊙

denotes element-wise multiplication.

2.3.1. Gated Dual Branch Block

The Gated Dual Branch (GDB) module acts as an efficient feature enhancement unit designed to amplify pest-related features and suppress background information via a gated dual-branch mechanism, enhancing recognition accuracy in complex leaf backgrounds. Its structure is illustrated in Figure 9. The main path employs depthwise separable convolution followed by a 1 × 1 convolution to extract deep spatial features, capturing details like pest textures and shapes. The gate path uses a simple 1 × 1 convolution to generate a gating signal from the main path’s intermediate features, assessing the importance of each feature channel and spatial location.

2.3.2. Two-Dimensional Selective Scanning

The two-dimensional selective scan is a core operation in the state-space path, adapting Selective State Space Models from one-dimensional sequence modeling to two-dimensional image data. This provides efficient global context modeling for agarwood pest detection, as depicted in Figure 10. Traditional state-space models like Mamba are designed for one-dimensional sequences, achieving efficient long-sequence modeling through selective scan mechanisms. However, image data has an inherent two-dimensional structure, and applying one-dimensional scans directly would disrupt spatial locality. The VMamba model [32] introduces an innovative 2D Selective Scan (SS2D) for visual data, effectively capturing spatial dependencies while maintaining linear computational complexity. The SS2D algorithm comprises three key steps: cross-scan, selective scan (filtering), and cross-merge.

2.4. Group Spatial Shuffle Downsampling

Traditional downsampling using 3 × 3 convolutions with stride 2 often incurs significant information loss, primarily due to aggressive spatial compression and limited receptive fields. This is particularly detrimental for the detection of tiny pests on agarwood leaves. To overcome these limitations, the Grouped Spatial Shuffle Downsampling (GSD) module is proposed, which introduces two key innovations. First, PixelUnshuffle is employed for efficient spatial reorganization. This non-parametric operation transforms the input by halving the spatial dimensions and quadrupling the channels, thereby preserving fine-grained spatial information essential for identifying microscopic pests. Second, a grouped pointwise convolution (with 1 × 1 kernels and group = 4) efficiently processes the expanded channels. This maintains feature diversity while reducing computational complexity and parameters. Compared to standard 3 × 3 convolution downsampling, the GSD module offers three principal advantages. (i) High-frequency details are explicitly preserved through pixel rearrangement, minimizing spatial information loss compared to aggressive pooling or strided convolution. (ii) Computational efficiency is enhanced via parameter-free spatial reorganization and the use of grouped convolution. (iii) The streamlined operation improves gradient flow during backpropagation, ensuring more stable training dynamics and lower computational overhead. These characteristics make GSD well-suited for multi-scale pest detection in complex agricultural environments. The architecture is illustrated in Figure 11.

2.5. Introduction of Normalized Gaussian Wasserstein Distance

In our agarwood pest dataset, the pests are predominantly minuscule, leading to significant limitations in the traditional Intersection over Union (IoU) metric. For small targets such as tiny pests, predicted bounding boxes exhibit high sensitivity to positional deviations. As illustrated in Figure 12, slight positional shifts in tiny pests cause disproportionately large IoU variations, while the same pixel-level displacement results in smaller IoU changes for larger pests. To address this, we employ the Normalized Wasserstein Distance (NWD) method [33], which models pest detection boxes as two-dimensional Gaussian distributions and uses Wasserstein distance to measure similarity, providing a more robust criterion for minor localization errors.

2.5.1. Gaussian Distribution Modeling for Bounding Boxes

We model the bounding box as a two-dimensional Gaussian distribution, where the central region (corresponding to the pest body) has higher probability density, decreasing towards the edges. The mathematical formulation for the inscribed ellipse is provided in Equation (8).

\frac{{(x - c_{x})}^{2}}{σ_{x}^{2}} + \frac{{(y - c_{y})}^{2}}{σ_{y}^{2}} = 1

(8)

Here, (

c_{x}

,

c_{y}

) denotes the center coordinates of the rectangle, while

σ_{x}

and

σ_{y}

represent the semi-axis lengths along the x-axis and y-axis of the rectangle, respectively.

The probability density function of the two-dimensional Gaussian distribution is given in Equation (9).

f (x | μ, Σ) = \frac{e x p (- \frac{1}{2} {(x - μ)}^{T} Σ^{- 1} (x - μ))}{2 π {| \sum |}^{\frac{1}{2}}}

(9)

where x denotes the positional coordinates, μ and Σ represent the mean vector and covariance matrix, respectively.

When

{(x - μ)}^{⊺} Σ^{- 1} (x - μ) = 1

the axis-aligned bounding box is modeled as a two-dimensional Gaussian distribution, as formalized in Equation (10).

μ = [\begin{matrix} c_{x} \\ c_{y} \end{matrix}], Σ = [\begin{matrix} \frac{w^{2}}{4} & 0 \\ 0 & \frac{h^{2}}{4} \end{matrix}]

(10)

where

w = 2 σ_{x}, h = 2 σ_{y}

.

2.5.2. Normalized Gaussian Wasserstein Distance

We employ the Wasserstein distance to measure the similarity between the 2D Gaussian distributions of the ground-truth and predicted bounding boxes, as formalized in Equation (11).

W_{2}^{2} (μ_{1}, μ_{2}) = {‖m_{1} - m_{2}‖}_{2}^{2} + {‖Σ_{1}^{\frac{1}{2}} - Σ_{2}^{\frac{1}{2}}‖}_{F}^{2}

(11)

Since these Gaussian distributions are modeled from bounding boxes A and B, Equation (11) can be simplified to Equation (12).

W_{2}^{2} (A, B) = {‖({[c x_{a}, c y_{a}, \frac{w_{a}}{2}, \frac{h_{a}}{2}]}^{T}, {[c x_{b}, c y_{b}, \frac{w_{b}}{2}, \frac{h_{b}}{2}]}^{T})‖}_{2}^{2}

(12)

To align with the traditional IoU range of [0, 1], the Wasserstein distance is normalized, as shown in Equation (13).

N W D (N_{a}, N_{b}) = e x p (- \frac{\sqrt{W_{2}^{2} (N_{a}, N_{b})}}{C})

(13)

where C is a parameter representing the average absolute size of targets in the dataset.

Therefore, when we introduce NWD into the loss function, the specific expression is as shown in (14):

{L o s s}_{N W D} = 1 - N W D (N_{a}, N_{b})

(14)

3. Results

3.1. Experimental Environment and Parameter Setting

Different experimental environments can lead to difficulties in reproducing results, so we provide detailed experimental settings and parameter configurations. Details are shown in Table 2.

3.2. Evaluation Metrics

Detection performance is evaluated using Precision (P), Recall (R), and Mean Average Precision (mAP), as defined in Equations (15)–(17).

P = \frac{T P}{T P + F P}

(15)

R = \frac{T P}{T P + F N}

(16)

A P = \int_{0}^{1} P (R) d R, m A P = \frac{1}{N} \sum_{i = 1}^{N} {A P}_{i}

(17)

Model complexity is quantified by the number of parameters (Params) and the model storage size, while computational efficiency is assessed via Giga FLOPs (GFLOPs) for theoretical operations and inference speed (Frames Per Second, FPS) for practical deployment latency.

3.3. Comparison Experiments

To validate the effectiveness of our proposed ASCNet model, we conducted a series of comparative experiments, including CNN-based YOLO series, Faster R-CNN, Gold-YOLO [34], RTMDet [35], Transformer-based RT-DETR series [36], Salience-DETR [37], and DINO [38], as detailed in Table 3, Among them, our model is highlighted in gray, the best performance metrics are marked in bold, and the second best are underlined.

The comparative experimental results in Table 3 demonstrate ASCNet’s significant performance advantages over mainstream CNN and Transformer models. ASCNet-n and ASCNet-s achieve top or competitive results across key metrics. Specifically, ASCNet-s leads all models with 95.6 ± 0.7% Precision, 87.2 ± 0.6% Recall, 93.0 ± 0.2% mAP@50, 82.7 ± 0.4% mAP@75, and 71.2 ± 0.3% mAP@50:95, outperforming the second-best Gold-YOLO-s by 5.7 and 5.6 percentage points in mAP@75 and mAP@50:95, respectively. This validates the efficacy of the adaptive state-space convolutional fusion mechanism for agarwood pest detection. In terms of efficiency, Figure 13 shows that ASCNet-n attains 89.3 ± 0.3% mAP@50 and 66.9 ± 0.7% mAP@50:95 with only 5.43 M parameters and 12.8 GFLOPs, utilizing approximately one-quarter the parameters of comparable Transformer models like RT-DETR-R18 (20.2 M) while substantially reducing computational cost. This theoretical efficiency translates directly into practical advantages: ASCNet-n requires only 9.03 MB of storage and achieves a fast inference speed of 60 FPS, making it highly suitable for memory and latency-constrained edge devices. Meanwhile, ASCNet-s achieves 71.2 ± 0.3% mAP@50:95 with moderate complexity (19.6 M parameters, 46.7 GFLOPs), significantly surpassing YOLOv8-s (65.5%) and Gold-YOLO-s (66.0%) at similar parameter counts. Notably, despite its higher accuracy, ASCNet-s maintains a competitive model size (38.0 MB) and inference speed (57 FPS), comparable to or better than other high-performance detectors like YOLOv8-s (21.4 MB, 58 FPS) and Gold-YOLO-s (37.2 MB, 52 FPS). These results demonstrate the superiority of the GSD module and dual-path adaptive fusion in balancing performance and efficiency. The high Recall (87.2 ± 0.6%) further confirms the state-space path’s role in capturing global context and reducing missed detections. Collectively, the experiments show that ASCNet achieves an optimal accuracy-efficiency trade-off for agarwood pest detection, offering an effective solution for pest monitoring in complex agricultural settings.

To further validate the detection performance of the proposed ASCNet model, we present visual comparisons in Figure 14. The comparison encompasses various architectures, including two-stage models (Faster R-CNN), single-stage models (YOLOv5-s, YOLOv8-s), Transformer-based models (RT-DETR-R50), and our ASCNet variants (ASCNet-n, ASCNet-s).

Visual analysis reveals notable performance disparities. As depicted in subfigure (a), Faster R-CNN detects some prominent targets but suffers from numerous missed detections, underscoring its limitations in small object detection. YOLOv5-s and YOLOv8-s in subfigures (b) and (e) show speed advantages but exhibit high false positive rates in complex scenarios, misclassifying leaf shadows or spots as pests, which indicates limited generalization in complex backgrounds. RT-DETR-R50 in subfigure (c) displays the sparsest detection boxes, reflecting severe missed detections. In contrast, the proposed ASCNet models (subfigures (d) and (f)) demonstrate superior performance. Specifically, ASCNet-n and ASCNet-s achieve higher recall, accurately locating most tiny and occluded targets overlooked by other models. Their predictions also show higher confidence scores with significantly fewer false positives compared to the YOLO series. This confirms the efficacy of ASCNet’s adaptive state-space convolutional fusion mechanism, which leverages global context to suppress background interference and enhances tiny pest recognition through refined feature extraction.

3.4. Ablation Experiments

To evaluate the contribution of each module in ASCNet, we performed ablation studies, summarized in Table 4. A checkmark (√) denotes the inclusion of a module, whereas its absence indicates removal.

The ablation results in Table 4 systematically demonstrate the individual contributions and synergistic interactions of the three core modules in ASCNet. When only the ASBlock is used, the model shows significant improvements in mAP@50, mAP@75, and mAP@50:95, underscoring the crucial role of the dual-path adaptive fusion mechanism that combines state-space models and convolutional networks for feature extraction. Using only the GSD module reduces parameters and computational cost to 9.1 M and 24.3 GFLOPs, yet achieves 90.7% mAP@50 and 70.0% mAP@50:95, significantly outperforming traditional downsampling methods and emphasizing the value of spatial information preservation. The configuration with only the NWD loss achieves 91.5% mAP@50 and 70.3% mAP@50:95 with 11.1 M parameters and 28.4 GFLOPs, notably reaching 81.4% mAP@75, which confirms the loss function’s specialization for small object detection. In dual-module combinations, ASBlock with NWD increases mAP@50 to 92.1%, whereas ASBlock with GSD maintains 19.6 M parameters and 46.7 GFLOPs while achieving 91.9% mAP@50, indicating strong complementarity between these modules. The full ASCNet-s configuration achieves the best performance in all ablation studies, with mAP@50, mAP@75, and mAP@50:95 at 93.0 ± 0.2%, 82.7 ± 0.4%, and 71.2 ± 0.3%, respectively—significantly outperforming single- and dual-module configurations while maintaining the same parameters and FLOPs as ASBlock with GSD. This demonstrates the synergy among the three modules: ASBlock facilitates adaptive fusion of global and local features, GSD preserves spatial details, and NWD optimizes small object detection. Collectively, they underpin ASCNet’s strong performance in agarwood pest detection.

To isolate the contribution of the proposed Grouped Spatial Shuffle Downsampling (GSD) module, we replace all downsampling operations in the baseline model (YOLOv8-s) with either standard 3 × 3 strided convolution (S = 2) or our GSD module, while keeping all other components identical. As shown in Table 5, the model equipped with GSD outperforms the one using standard convolution by a significant margin, particularly in the more stringent mAP@50:95 (70.0% vs. 65.5%) and mAP@75 (80.5% vs. 77.8%) metrics. Notably, these accuracy gains are achieved alongside a reduction in both parameters and computational cost (9.1 M vs. 11.1 M, 24.3 G vs. 28.4 G FLOPs). This result validates that the core design of GSD—preserving fine-grained spatial information through pixel rearrangement—is decisively more effective for tiny pest detection than the aggressive, lossy compression of traditional strided convolution.

In this study, the weight of the NWD loss component influences ASCNet’s detection performance for small targets. Accordingly, we varied the NWD coefficient while holding other configurations constant, as summarized in Table 6.

The ablation study on NWD weight coefficients (Table 6) demonstrates the significant impact of different NWD ratios on model performance. With an NWD weight of 0.3, ASCNet-s achieves optimal performance in mAP@50, mAP@75, and mAP@50:95. This indicates that a moderate NWD weight effectively balances specialized optimization for small object detection in bounding box regression with overall detection performance. When the weight increases to 0.5, mAP@50 drops to 92.4%, mAP@50:95 decreases to 70.6%, and mAP@75 slightly declines to 83.0%. This suggests that a higher NWD weight enhances small object detection but may over-prioritize small objects, slightly compromising overall accuracy. When the weight further increases to 0.7, all metrics decline significantly: mAP@50 falls to 91.7%, mAP@75 to 82.0%, and mAP@50:95 to 70.1%. This demonstrates that excessively high NWD weights disrupt the loss function balance, causing the model to focus too much on small objects and neglect other scales, thereby impairing overall performance. These results indicate that the NWD weight has an optimal range around 0.3. Within this range, NWD loss enhances small object detection without compromising overall accuracy, providing key guidance for optimizing ASCNet’s loss function in multi-scale agarwood pest detection.

To further rigorously evaluate the effectiveness of the proposed NWD loss function, we compared it with several commonly used IoU-based loss functions in object detection, including CIoU, SIoU [39], DIoU [40], and GIoU [41]. To ensure a fair comparison, each loss function was integrated into the same baseline detector (YOLOv8-s) and trained on our dataset under identical settings. The results are summarized in Table 7.

The results clearly demonstrate the superiority of the NWD-based loss, particularly on the more stringent metrics mAP@75 and mAP@50:95, which are critical for evaluating small-object detection. NWD outperforms the best IoU variant (DIoU) by 4.6 percentage points in mAP@50:95. This advantage stems from a fundamental modeling divergence compared to advanced IoU-based losses. Although variants such as CIoU and DIoU incorporate factors like center-point distance or aspect ratio, their core remains a discrete measure based on intersection-over-union, which is highly sensitive to the scale of tiny objects. In contrast, the key strength of NWD lies in its modeling of bounding boxes as Gaussian distributions, providing a geometrically more stable similarity measure that is less sensitive to minor localization errors of small objects—an inherent weakness of all IoU-based metrics.

4. Discussion

This study demonstrates that the proposed ASCNet model achieves outstanding performance levels in pest detection on Aquilaria sinensis leaves, outperforming various mainstream CNN- and Transformer-based detectors. Its superior performance arises from the innovative integration of state space models with convolutional neural networks, which mitigates the limited receptive fields of CNNs and the quadratic complexity of Transformers. The ASBlock‘s dual-path adaptive fusion mechanism dynamically balances global context and local features. This capability is essential in complex agricultural settings, where pests may be morphologically diverse, occluded, or cluttered within the background. ASCNet-s attains a recall of 87.8%, confirming the state-space pathway’s efficacy in capturing long-range dependencies and minimizing missed detections. The Grouped Spatial Shuffle Downsampling (GSD) module substantially reduces information loss during downsampling, preserving fine details critical for tiny pest detection. ASCNet-s leads all compared methods in multi-scale metrics, underscoring its robust performance. Incorporating Normalized Wasserstein Distance (NWD) into the loss function further enhances small object detection; NWD models bounding boxes as Gaussian distributions, offering a geometrically stable similarity measure compared to IoU, which is sensitive to minor positional shifts. Ablation studies on the NWD weight coefficient confirm that nwd_ratio = 0.3 optimally boosts small object detection without compromising overall accuracy.

Despite these advances, limitations remain: the dataset is from a single geographic region, potentially limiting generalization to other Aquilaria cultivation areas with different pests or imaging conditions. Future work should include cross-regional validation and more diverse pest categories to improve robustness. Moreover, while ASCNet balances accuracy and computational cost well, further optimization is required for deployment on resource-constrained edge devices in agriculture. Techniques like model quantization, pruning, and distillation are promising avenues for future research.

5. Conclusions

This study addresses the critical challenge of detecting tiny pests on agarwood leaves, where conventional deep learning models often struggle to balance accuracy and efficiency. The Adaptive State-space Convolutional Fusion Network (ASCNet) is introduced, a novel framework that enhances feature representation by integrating the long-range dependency modeling capability of state-space models with the local feature extraction strengths of convolutional networks. This integration is realized through an Adaptive State-space Convolutional Fusion Block (ASBlock), which adaptively fuses global context with local details. Additional contributions include a Grouped Spatial Shuffle Downsampling (GSD) module, which preserves spatial information during resolution reduction, and a Normalized Wasserstein Distance (NWD) loss that improves localization robustness for small objects. Evaluated on the novel Agarwood Pest Dataset, ASCNet demonstrates distinct advantages over mainstream detectors. Compared to the CNN-based YOLOv8-s, ASCNet-s achieves a superior balance, improving mAP@50:95 by 6 points (71.2 ± 0.3% vs. 65.5%) with a moderate increase in parameters. Against the Transformer-based RT-DETR-r50, it shows a 9.5-point lead in mAP@50:95 while using less than half the computational cost (46.7 vs. 131 GFLOPs). These results confirm that ASCNet successfully navigates the inherent trade-off between the limited receptive fields of CNNs and the high computational complexity of Transformers, establishing a new state-of-the-art for this specific task.

ASCNet provides a robust and efficient solution for intelligent pest monitoring in agarwood cultivation, establishing a versatile paradigm adaptable to other agricultural vision tasks involving small object detection in complex environments. Compared to existing approaches, its primary advantages lie in its balanced design: it mitigates the limited receptive fields of CNNs and the high computational complexity of Transformers through efficient state-space modeling, while the GSD module and NWD loss specifically enhance performance for tiny pests. However, several limitations should be noted. The model’s validation is currently based on a dataset from a specific geographical region, which may affect its generalizability to other cultivation environments with different pest species or imaging conditions. Furthermore, while designed for efficiency, its deployment on resource-constrained edge devices may require further optimization, such as model quantization or pruning, to achieve real-time performance in all field scenarios. Future work will therefore focus on three key directions to address these limitations and advance practical application: (1) enhancing cross-environment generalization through domain adaptation techniques; (2) exploring multimodal data fusion incorporating environmental or spectral information for richer contextual understanding; and (3) optimizing the model architecture and inference pipeline for real-time deployment on edge devices in smart agriculture systems.

Author Contributions

Conceptualization, Z.L. and R.C.; methodology, R.C. and Z.L.; formal analysis, Z.L. and R.C.; investigation, S.L.; data curation, Z.L. and R.C.; resources, J.G.; writing—original draft, Z.L. and R.C.; writing—review and editing, Z.L. and R.C.; supervision, S.L.; project administration, S.L.; funding acquisition, J.G. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the following funding sources: the 2022 Graduate Education Innovation Plan Project of Zhongkai University of Agriculture and Engineering (KA220160228); the Guangdong Rural Science and Technology Commissioner Project (No. KTP20240633); and the Guangdong Basic and Applied Basic Research Foundation (2023A1515011230). The authors extend their sincere gratitude for the financial and technical support provided by these programs; Special Projects in Key Fields of Ordinary Universities in Guangdong Province (2025ZDZX4025); Guangdong Province Graduate Education Innovation Program Project under Grant (2024ANLK_049).

Data Availability Statement

The data presented in the study are openly available in [Zenodo] at [https://doi.org/10.5281/zenodo.17626206].

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

ASCNet	Adaptive State-space Convolutional Fusion Network
ASBlock	Adaptive State-space Convolutional Fusion Block
GSD	Grouped Spatial Shuffle Downsampling
NWD	Normalized Wasserstein Distance
IUCN	International Union for Conservation of Nature
SVM	Support Vector Machine
CNN	Convolutional Neural Network
SSM	State Space Model
APD	Agarwood Pest Dataset
FPN	Feature Pyramid Network
GDB	Gated Dual Branch
BN	Batch Normalization
DWConv	Depthwise Separable Convolution
SS2D	2D Selective Scanning
IoU	Intersection over Union
TP	True Positive
FP	False Positive
FN	False Negative
AP	Average Precision
mAP	mean Average Precision

References

Farah, A.H.; Lee, S.Y.; Gao, Z.; Yao, T.L.; Madon, M.; Mohamed, R. Genome Size, Molecular Phylogeny, and Evolutionary History of the Tribe Aquilarieae (Thymelaeaceae), the Natural Source of Agarwood. Front. Plant Sci. 2018, 9, 712. [Google Scholar] [CrossRef] [PubMed]
Naef, R. The Volatile and Semi-volatile Constituents of Agarwood, the Infected Heartwood of Aquilaria Species: A Review. Flavour Fragr. J. 2011, 26, 73–87. [Google Scholar] [CrossRef]
Persoon, G.A.; Van Beek, H.H. Growing ‘The Wood of The Gods’: Agarwood Production in Southeast Asia. In Smallholder Tree Growing for Rural Development and Environmental Services; Snelder, D.J., Lasco, R.D., Eds.; Advances in Agroforestry; Springer: Dordrecht, The Netherlands, 2008; Volume 5, pp. 245–262. ISBN 978-1-4020-8260-3. [Google Scholar]
López-Sampson, A.; Page, T. History of Use and Trade of Agarwood. Econ. Bot. 2018, 72, 107–129. [Google Scholar] [CrossRef]
Lee, S.Y.; Mohamed, R. The Origin and Domestication of Aquilaria, an Important Agarwood-Producing Genus. In Agarwood; Mohamed, R., Ed.; Tropical Forestry; Springer: Singapore, 2016; pp. 1–20. ISBN 978-981-10-0832-0. [Google Scholar]
Samsuddin, A.S.; Lee, S.Y.; Ong, S.P.; Mohamed, R. Damaging Insect Pests and Diseases and Their Threats to Agarwood Tree Plantations. Sains Malays. 2019, 48, 497–507. [Google Scholar] [CrossRef]
Sethy, P.K.; Barpanda, N.K.; Rath, A.K.; Behera, S.K. Deep Feature Based Rice Leaf Disease Identification Using Support Vector Machine. Comput. Electron. Agric. 2020, 175, 105527. [Google Scholar] [CrossRef]
Li, W.; Yang, Z.; Lv, J.; Zheng, T.; Li, M.; Sun, C. Detection of Small-Sized Insects in Sticky Trapping Images Using Spectral Residual Model and Machine Learning. Front. Plant Sci. 2022, 13, 915543. [Google Scholar] [CrossRef]
Zhong, Y.; Gao, J.; Lei, Q.; Zhou, Y. A Vision-Based Counting and Recognition System for Flying Insects in Intelligent Agriculture. Sensors 2018, 18, 1489. [Google Scholar] [CrossRef]
Alam, M.M.; Islam, M.T. Machine Learning Approach of Automatic Identification and Counting of Blood Cells. Healthc. Technol. Lett. 2019, 6, 103–108. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Computer Vision—ECCV 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2016; Volume 9905, pp. 21–37. ISBN 978-3-319-46447-3. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement 2018. arXiv 2018. [Google Scholar] [CrossRef]
Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection 2020. arXiv 2020. [Google Scholar] [CrossRef]
Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A Single-Stage Object Detection Framework for Industrial Applications 2022. arXiv 2022. [Google Scholar] [CrossRef]
Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 7464–7475. [Google Scholar]
Yang, S.; Xing, Z.; Wang, H.; Dong, X.; Gao, X.; Liu, Z.; Zhang, X.; Li, S.; Zhao, Y. Maize-YOLO: A New High-Precision and Real-Time Method for Maize Pest Detection. Insects 2023, 14, 278. [Google Scholar] [CrossRef] [PubMed]
Wang, J.; Ma, S.; Wang, Z.; Ma, X.; Yang, C.; Chen, G.; Wang, Y. Improved Lightweight YOLOv8 Model for Rice Disease Detection in Multi-Scale Scenarios. Agronomy 2025, 15, 445. [Google Scholar] [CrossRef]
Liu, Y.; Wang, Q.; Zheng, Q.; Liu, Y. YOLO-Wheat: A More Accurate Real-Time Detection Algorithm for Wheat Pests. Agriculture 2024, 14, 2244. [Google Scholar] [CrossRef]
Guan, B.; Wu, Y.; Zhu, J.; Kong, J.; Dong, W. GC-Faster RCNN: The Object Detection Algorithm for Agricultural Pests Based on Improved Hybrid Attention Mechanism. Plants 2025, 14, 1106. [Google Scholar] [CrossRef]
Yu, Y.; Zhou, Q.; Wang, H.; Lv, K.; Zhang, L.; Li, J.; Li, D. LP-YOLO: A Lightweight Object Detection Network Regarding Insect Pests for Mobile Terminal Devices Based on Improved YOLOv8. Agriculture 2024, 14, 1420. [Google Scholar] [CrossRef]
Tang, K.; Qian, Y.; Dong, H.; Huang, Y.; Lu, Y.; Tuerxun, P.; Li, Q. SP-YOLO: A Real-Time and Efficient Multi-Scale Model for Pest Detection in Sugar Beet Fields. Insects 2025, 16, 102. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image Is Worth 16 × 16 Words: Transformers for Image Recognition at Scale 2020. arXiv 2020. [Google Scholar] [CrossRef]
Zhang, H.; Gong, Z.; Hu, C.; Chen, C.; Wang, Z.; Yu, B.; Suo, J.; Jiang, C.; Lv, C. A Transformer-Based Detection Network for Precision Cistanche Pest and Disease Management in Smart Agriculture. Plants 2025, 14, 499. [Google Scholar] [CrossRef]
Xu, R.; Yu, J.; Ai, L.; Yu, H.; Wei, Z. Farmland Pest Recognition Based on Cascade RCNN Combined with Swin-Transformer. PLoS ONE 2024, 19, e0304284. [Google Scholar] [CrossRef] [PubMed]
Lu, X.; Zhang, Y.; Zhang, C. CATransU-Net: Cross-Attention TransU-Net for Field Rice Pest Detection. PLoS ONE 2025, 20, e0326893. [Google Scholar] [CrossRef] [PubMed]
Gu, A.; Johnson, I.; Goel, K.; Saab, K.; Dao, T.; Rudra, A.; Ré, C. Combining Recurrent, Convolutional, and Continuous-time Models with Linear State-Space Layers 2021. arXiv 2021. [Google Scholar] [CrossRef]
Gu, A.; Goel, K.; Ré, C. Efficiently Modeling Long Sequences with Structured State Spaces 2022. arXiv 2022. [Google Scholar] [CrossRef]
Gu, A.; Dao, T. Mamba: Linear-Time Sequence Modeling with Selective State Spaces 2023. arXiv 2023. [Google Scholar] [CrossRef]
Wang, Q.; Wang, C.; Lai, Z.; Zhou, Y. InsectMamba: State Space Model with Adaptive Composite Features for Insect Recognition. In Proceedings of the ICASSP 2025–2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 6–11 April 2025; pp. 1–5. [Google Scholar]
Liu, Y.; Tian, Y.; Zhao, Y.; Yu, H.; Xie, L.; Wang, Y.; Ye, Q.; Jiao, J.; Liu, Y. VMamba: Visual State Space Model 2024. arXiv 2024. [Google Scholar] [CrossRef]
Wang, J.; Xu, C.; Yang, W.; Yu, L. A Normalized Gaussian Wasserstein Distance for Tiny Object Detection 2022. arXiv 2022. [Google Scholar] [CrossRef]
Wang, C.; He, W.; Nie, Y.; Guo, J.; Liu, C.; Han, K.; Wang, Y. Gold-YOLO: Efficient Object Detector via Gather-and-Distribute Mechanism 2023. arXiv 2023. [Google Scholar] [CrossRef]
Lyu, C.; Zhang, W.; Huang, H.; Zhou, Y.; Wang, Y.; Liu, Y.; Zhang, S.; Chen, K. RTMDet: An Empirical Study of Designing Real-Time Object Detectors 2022. arXiv 2022. [Google Scholar] [CrossRef]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs Beat YOLOs on Real-Time Object Detection. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 16965–16974. [Google Scholar]
Hou, X.; Liu, M.; Zhang, S.; Wei, P.; Chen, B. Salience DETR: Enhancing Detection Transformer with Hierarchical Salience Filtering Refinement. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 17574–17583. [Google Scholar]
Zhang, H.; Li, F.; Liu, S.; Zhang, L.; Su, H.; Zhu, J.; Ni, L.M.; Shum, H.-Y. DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection 2022. arXiv 2022. [Google Scholar] [CrossRef]
Gevorgyan, Z. SIoU Loss: More Powerful Learning for Bounding Box Regression 2022. arXiv 2022. [Google Scholar] [CrossRef]
Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU Loss: Faster and Better Learning for Bounding Box Regression. Proc. AAAI Conf. Artif. Intell. 2020, 34, 12993–13000. [Google Scholar] [CrossRef]
Rezatofighi, H.; Tsoi, N.; Gwak, J.; Sadeghian, A.; Reid, I.; Savarese, S. Generalized Intersection over Union: A Metric and A Loss for Bounding Box Regression 2019. arXiv 2019. [Google Scholar] [CrossRef]

Figure 1. The diverse applications of agarwood: (a) Medicinal use (Agarwood tea); (b) Cultural use (Agarwood incense burning); (c) Personal accessory and economic value (Agarwood wood bracelet/rosary).

Figure 2. Data collection location.

Figure 4. Visualization of data augmentation techniques applied to improve model generalization. (a) Original training image. (b) Result after applying translation, brightness variation, and Gaussian noise injection, which simulates camera movement, changing illumination, and sensor noise. (c) Horizontally flipped image, encouraging viewpoint invariance. (d) Result after combined flipping and brightness adjustment, further increasing diversity.

Figure 5. Comparison of feature extraction mechanisms among CNN, Transformer, and Mamba.

Figure 6. Overall Architecture.

Figure 7. ASBlock Structural diagram.

Figure 8. Schematic diagram of convolution path weight and state space path weight generation.

Figure 9. GDB Structural Diagram.

Figure 10. SS2D internal structure.

Figure 11. GSD Structural Diagram.

Figure 12. The black box represents the ground truth box, the blue box represents the predicted box with offset 1, and the red box represents the predicted box with offset 2.

Figure 13. Comprehensive Comparison of ASCNet and Mainstream Detection Models on Accuracy, Parameter Count, and Computational Complexity.

Figure 14. Visualization of detection results from various algorithm models. (a) represents Faster R-CNN; (b) represents YOLOv5-s; (c) represents RT-DETR-r50; (d) represents ACSNet-n; (e) represents YOLOv8-s; (f) represents ACSNet-s.

Table 1. Comparative overview of convolutional, transformer, and state-space model paradigms for visual recognition.

Paradigm	Core Operation	Key Strength	Main Limitation	Complexity
Convolutional Neural Network (CNN)	Localized convolutional filtering	Strong local feature extraction; parameter-efficient; translation equivariance	Limited receptive field; poor at modeling long-range dependencies	Linear–O(N)
Transformer	Global self-attention	Powerful global context modeling; dynamic weight allocation	Quadratic computational and memory cost with respect to sequence length	Quadratic–O(N²)
State Space Model (SSM)	Selective state scanning	Global dependency modeling with linear complexity; efficient long-sequence handling	Sensitive to scanning order; requires specialized 2D adaptation for vision tasks	Linear–O(N)

Table 2. Hardware environment, software configuration, and parameter settings.

Environment	Configuration
Hardware
CPU	13th Gen Intel Core i5-13490F 2.50 GHz processor
GPU	NVIDIA GeForce RTX 4070Super 12 G graphics card
RAM	32 GB DDR5
Software
System	Windows 11
CUDA	11.8
Pytorch	2.1.0
Python	3.10
Parameter
Epochs	100
Batch Size	8
Weight decay coefficient	0.0005
Workers	8
Optimization	SGD
Learning rate	0.01
Momentum factor	0.937

Table 3. Comparison of performance and model complexity between CNN and Transformer models.

Method	Precision ↑	Recall ↑	mAP@50 ↑	mAP@75 ↑	mAP@50:95 ↑	Params ↓	FLOPs ↓	Size (MB)	FPS
CNN-Based
Faster R-CNN	87.8	71.7	80.9	58.2	51.8	41.5 M	178 G	112	21
YOLOv5-n	88.8	72.0	82.5	60.0	52.7	2.51 M	7.2 G	5.01	63
YOLOv8-n	88.4	75.6	84.7	63.6	55.9	3.01 M	8.2 G	5.94	66
YOLOv5-s	92.3	81.0	88.9	73.8	62.8	9.12 M	24.0 G	17.6	50
YOLOv8-s	92.3	83.1	89.9	77.8	65.5	11.1 M	28.4 G	21.4	58
RTMDet-s	93.7	81.5	89.5	74.2	63.4	8.99 M	14.8 G	10.3	63
Gold-YOLO-s	93.1	82.0	89.7	77.5	66.0	21.5 M	46.0 G	37.2	52
Transformer-Based
RT-DETR-r18	90.7	74.2	84.2	63.1	55.9	20.2 M	58.6 G	36.3	61
RT-DETR-r34	91.3	74.9	85.3	65.2	57.0	30.3 M	88.9 G	54.2	56
RT-DETR-r50	93.6	78.4	87.0	72.5	62.1	42.8 M	131 G	82.0	58
Salience-DETR	93.4	78.6	87.2	70.9	61.6	56.1 M	201 G	125	53
DINO	87.9	75.7	84.7	64.4	56.0	55.9 M	279 G	172	42
Ours
ASCNet-n	94.7 ± 0.6	81.9 ± 0.5	89.3 ± 0.3	77.0 ± 1.0	66.9 ± 0.7	5.43 M	12.8 G	9.03	60
ASCNet-s	95.6 ± 0.7	87.2 ± 0.6	93.0 ± 0.2	82.7 ± 0.4	71.2 ± 0.3	19.6 M	46.7 G	38.0	57

Note: Bold indicates the best performance; underlined numbers indicate the second-best performance.

Table 4. Results of the Ablation Study for the ASCNet-s.

ASBlock	GSD	NWD	mAP@50 ↑	mAP@75 ↑	mAP@50:95 ↑	Params ↓	FLOPs ↓
√			91.7	82.0	70.3	21.7 M	51.0 G
	√		90.7	80.5	70.0	9.1 M	24.3 G
		√	91.5	81.4	70.3	11.1 M	28.4 G
√		√	92.1	82.3	70.6	21.7 M	51.0 G
√	√		91.9	82.0	70.4	19.6 M	46.7 G
√	√	√	93.0 ± 0.2	82.7 ± 0.4	71.2 ± 0.3	19.6 M	46.7 G

Table 5. Ablation study comparing downsampling methods.

Method	mAP@50	mAP@75	mAP@50:95	Params	FLOPs
Baseline + 3 × 3 Conv, S = 2	89.9	77.8	65.5	11.1 M	28.4 G
Baseline + GSD	90.7	80.5	70.0	9.1 M	24.3 G

Table 6. Ablation experiments with different NWD weight coefficients.

nwd_Ratio	mAP@50 ↑	mAP@75 ↑	mAP@50:95 ↑
0.3	93.2	83.2	71.6
0.5	92.4	83.0	70.6
0.7	91.7	82.0	70.1

Table 7. Comparison of different bounding box regression losses on the APD.

Loss Function	P	R	mAP@50	mAP@75	mAP@50:95
CIoU	92.3	83.1	89.9	77.8	65.5
SIoU	89.5	81.7	89.0	69.2	59.8
DIoU	94.4	81.8	89.9	77.3	65.7
GIoU	92.7	82.5	89.3	76.1	65.1
NWD	94.9	85.7	91.5	81.4	70.3

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Luo, Z.; Chen, R.; Li, S.; Guo, J. An Adaptive State-Space Convolutional Fusion Network for High-Precision Pest Detection in Smart Agarwood Cultivation. Mathematics 2025, 13, 3937. https://doi.org/10.3390/math13243937

AMA Style

Luo Z, Chen R, Li S, Guo J. An Adaptive State-Space Convolutional Fusion Network for High-Precision Pest Detection in Smart Agarwood Cultivation. Mathematics. 2025; 13(24):3937. https://doi.org/10.3390/math13243937

Chicago/Turabian Style

Luo, Zhijie, Rui Chen, Shaoxin Li, and Jianjun Guo. 2025. "An Adaptive State-Space Convolutional Fusion Network for High-Precision Pest Detection in Smart Agarwood Cultivation" Mathematics 13, no. 24: 3937. https://doi.org/10.3390/math13243937

APA Style

Luo, Z., Chen, R., Li, S., & Guo, J. (2025). An Adaptive State-Space Convolutional Fusion Network for High-Precision Pest Detection in Smart Agarwood Cultivation. Mathematics, 13(24), 3937. https://doi.org/10.3390/math13243937

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Adaptive State-Space Convolutional Fusion Network for High-Precision Pest Detection in Smart Agarwood Cultivation

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset Construction

2.1.1. Image Acquisition

2.1.2. Data Construction

2.2. Adaptive State-Space Convolutional Fusion Net

2.3. Adaptive State-Space Convolutional Fusion Block

2.3.1. Gated Dual Branch Block

2.3.2. Two-Dimensional Selective Scanning

2.4. Group Spatial Shuffle Downsampling

2.5. Introduction of Normalized Gaussian Wasserstein Distance

2.5.1. Gaussian Distribution Modeling for Bounding Boxes

2.5.2. Normalized Gaussian Wasserstein Distance

3. Results

3.1. Experimental Environment and Parameter Setting

3.2. Evaluation Metrics

3.3. Comparison Experiments

3.4. Ablation Experiments

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI