You are currently viewing a new version of our website. To view the old version click .
Applied Sciences
  • Article
  • Open Access

6 November 2025

Research on Target Detection and Counting Algorithms for Swarming Termites in Agricultural and Forestry Disaster Early Warning

,
and
School of Information Engineering, North China University of Water Resources and Electric Power, Zhengzhou 450046, China
*
Author to whom correspondence should be addressed.

Abstract

The accurate monitoring of termite swarming—a key indicator of dispersal and population growth—is essential for early warning systems that mitigate infestation risks in agricultural and forestry environments. Automated detection and counting systems have become a viable alternative to labor-intensive and time-consuming manual inspection methods. However, detecting and counting such small and fast-moving targets as swarming termites poses a significant challenge. This study proposes the YOLOv11-ST algorithm and a novel counting algorithm to address this challenge. By incorporating the Fourier-domain parameter decomposition and dynamic modulation mechanism of the FDConv module, along with the LRSA attention mechanism that enhances local feature interaction, the feature extraction capability for swarming termites is improved, enabling more accurate detection. The SPPF-DW module was designed to replace the original network’s SPPF module, enhancing the feature capture capability for small targets. In comparative evaluations with other baseline models, YOLOv11-ST demonstrated superior performance, achieving a Recall of 87.32% and a mAP50 of 93.21%. This represents an improvement of 2.1% and 2.02%, respectively, over the original YOLOv11. The proposed counting algorithm achieved an average counting accuracy of 91.2%. These research findings offer both theoretical and technical support for the development of a detection and counting system for swarming termites.

1. Introduction

Termites are recognized as one of the world’s five major pests and are highly destructive insects [1], inflicting substantial damage across various sectors, including agriculture and forestry. Annual global economic losses attributable to termite activity are estimated to range between 15 and 40 billion US dollars [2]. Accurate detection and counting of termites are essential for assessing local termite species composition and infestation severity, which are critical for developing effective prevention and control strategies. Since termite detection requires specialized expertise and is labor-intensive, automated methods present a promising alternative [3]. Traditional inspection methods predominantly rely on manual examination, which is not only time-consuming and resource-intensive but also susceptible to subjective bias. In real-time monitoring videos, the rapid movement of termites frequently results in motion blur. Moreover, during collective behaviors, individuals often overlap with one another or are partially occluded by environmental debris, posing significant challenges for accurate detection and individual counting. The applicability of computer vision in termite detection primarily arises from its capacity to address the subjectivity and inherent limitations of conventional approaches. By leveraging deep learning-based image and pattern recognition algorithms, this technology enables highly accurate detection of individual termites, even under challenging field conditions. This transformation shifts the detection process from experience-dependent qualitative assessments to intelligent diagnosis that is both quantifiable and reproducible, thereby substantially improving the sensitivity of early detection, the objectivity of evaluation outcomes, and the efficiency of large-scale monitoring.
To address the aforementioned challenges, this study constructed a swarm-flying termite dataset to mitigate the lack of annotated data. Building upon the original SPPF module, we designed the SPPF-DW module and incorporated the LRSA and FDConv modules into YOLOv11, thereby developing the YOLOv11-ST model for improved recognition of swarm-flying termites. Furthermore, a multi-feature fusion-based feature matching and counting algorithm was proposed to alleviate redundant counting across adjacent frames.

2. Relate Work

Object detection, as a fundamental technology in the field of computer vision [4], serves as a critical foundation and prerequisite for achieving accurate object counting. In the context of detecting swarming termites, the performance of the object detection algorithm directly influences the accuracy of subsequent quantity estimation, thereby affecting the assessment of termite activity patterns and the development of control strategies. The detection model must exhibit high accuracy and strong real-time capability. The YOLO [5] is particularly well-suited for such tasks due to its compact architecture and computational efficiency, which align with the essential requirements for real-time object counting [6]. However, the high-speed descending motion of swarming termites within trapping devices presents significant challenges to the image acquisition system. During high-speed motion, termites exhibit noticeable motion blur on image sensors, often accompanied by visual artifacts such as trailing effects and partial occlusion. Moreover, the trapping device’s internal lighting conditions are complex and variable, and the contrast between individual termites and the background is often low. These factors collectively increase the difficulty of accurate detection. To ensure reliable input for subsequent counting algorithms, it is essential to develop a detection model specifically designed for high-speed moving targets. The model must be capable of robust feature extraction under motion blur and for small objects, while also demonstrating strong adaptability to varying illumination and background interference. Only a detection model fulfilling these criteria can provide accurate and dependable data foundational to subsequent termite population statistics. Recent years have witnessed significant advancements in this research domain. Huang et al. [3] developed a termite classification system utilizing MobileNetV2, which was evaluated against the performance of three human termite experts. Their model processed 8000 images of termite specimens 65 times faster than human experts while maintaining comparable classification accuracy. Zhong et al. [7] proposed an approach combining YOLO-based object detection for coarse counting with global-feature-based refinement for precise counting. Implemented on a Raspberry Pi platform, their insect identification and counting system achieved an average counting accuracy of 92.50%. Saradopoulos et al. [8] designed an edge-computing insect-monitoring system based on the ESP32 microcontroller, employing quantized deep learning models to attain over 95% accuracy in insect counting. This system provides a highly efficient, low-cost, and low-power solution for smart urban pest management. Dai et al. [9] incorporated the SWin Transformer (SwinTR) and Transformer (C3TR) modules into the YOLOv5m architecture, achieving a pest detection accuracy of 95.7% in agricultural settings.
Accurately counting high-speed moving objects remains a considerable challenge within the field of object counting, particularly in the context of monitoring swarming termites using trap-based devices. Owing to the small size and rapid descent of swarming termites, image capture via macro photography introduces several technical complications: firstly, high-speed motion results in pronounced motion blur effects, significantly obscuring the morphological features of individual termites; secondly, owing to the limited frame rates of conventional cameras (typically 30–60 fps), a single termite may appear in only one or two consecutive video frames. This extremely brief presence renders conventional Multi-Object Tracking algorithms [10] largely ineffective, as they generally require a target to be visible across multiple consecutive frames (typically ≥3) to establish a reliable motion trajectory. In the field of object counting, Li et al. [11] introduced a cross-line zoning strategy combined with DeepSORT to assign unique identifiers to wheat ears. By integrating geographic coordinate mapping and confidence-based filtering, their method effectively eliminated redundant counts across frames, achieving an average accuracy exceeding 97.5%. These results underscore the ongoing challenge of designing counting algorithms capable of handling high-speed, transient targets—such as swarming termites—without compromising detection accuracy or introducing duplicate counts. This remains a critical research issue both in termite monitoring and in broader applications involving fast-moving objects. Further advancing the field, Wang et al. [12] proposed a pest recognition and counting method utilizing deep learning and data reorganization. Their improved Faster R-CNN model, termed MPest-RCNN, demonstrated enhanced capability in accurately identifying and counting pests across varying densities and scales, outperforming existing approaches. In a similar vein, Saradopoulos et al. [9] developed an insect monitoring system based on edge computing using an ESP32 microcontroller. By incorporating quantized deep learning models, the system achieved over 95% counting accuracy while operating at ultra-low cost and power consumption, offering an efficient solution for intelligent pest management in urban environments.

3. Materials and Methods

This section describes the materials and methods used in the study, including the dataset, evaluation metrics, experimental parameters, the detailed architecture and components of the proposed YOLO-AT model, and the swarming termites counting algorithm.

3.1. Dataset Construction

The dataset of swarming termites was collected in the Nuozhadu area of Pu’er City, Yunnan Province, China. It consists of Macrotermes and Odontotermes. The humid climate and abundant vegetation in Yunnan make it an active region for termites such as Macrotermes and Odontotermes. To study their swarming behavior and develop automated counting algorithms, a dataset containing videos and annotations of swarming termites was constructed.
The dataset was built by capturing field videos of swarming termite activity using a macro camera (TW330-U2M-V1.0, Shenzhen, China), which outputs a resolution of 1920 × 1080 at 30 fps in MJPEG format. Videos were recorded inside termite trapping devices during peak swarming periods (dusk and night in the rainy season). Keyframes were extracted at a rate of one frame per second to generate JPEG image sequences with a resolution of 1920 × 1080. Blurred, duplicate, or termite-free frames were removed. Manual annotation was performed using the LabelImg tool, with each image saved in YOLO format as a corresponding TXT file. The following Figure 1 is dataset labeling figure.
Figure 1. Dataset labeling.

3.2. The Indicators of Evaluation

The experiments involved a comparative analysis of the improved network model against other models under the same experimental conditions. Evaluation was primarily conducted using four metrics—precision, recall, mAP@50, and mAP@0.5–0.95 (%)—to assess missed detections and false alarms.
Precision: Reflects the accuracy of the model’s predictions, as defined by the formula below.
P r e c i s i o n = T P + F P T P
Recall: The model’s recall rate for targets is computed using the formula provided.
R e c a l l = T P + F N T P
mAP: This metric represents the mean average precision across all object categories. Specifically, mAP@0.5 denotes the mean average precision computed at an Intersection over Union (IoU) threshold of 0.5. The calculation formula is provided in Equation (3).
A P = 0 1 p r d r ; m A P = 1 c i = 1 c A P i
where c is the total number of classes in the image.
E = | P r e d i c t i o n t r u t h |
E R = E t r u t h
A E R = i = 1 N E R i N
A R = 1 E R
A A R = 1 A E R
where Prediction denotes the model-predicted count of swarming in each video, Truth represents the actual number of flying termites in each video, N refers to the total number of videos, and ER denotes the counting error rate for the ‘i’-th video.

3.3. Experimental Parameter Configuration

All experiments conducted in this study, including comparative experiments on attention mechanisms, ablation studies of the YOLOv11-ST model, performance comparisons between YOLOv11-ST and other state-of-the-art models, as well as swarming termite counting experiments, were carried out on the same hardware setup. Detailed hardware specifications and experimental environments are provided in Table 1.
Table 1. Experimental environment.
In this study, several parameters were configured for the experiments. The training was conducted with a learning rate of 0.01, 300 epochs, a batch size of 16, and an input image size of 640 × 640. A comprehensive summary of the training parameters is provided in Table 2.
Table 2. Parameter setting.

3.4. YOLOv11-ST

Based on the YOLOv11 detection model, this study developed a novel swarming termite detection model named YOLOv11-ST. While inheriting the characteristics of the original architecture, the proposed model significantly improves the recognition accuracy of swarming termites. As illustrated in the Figure 2, three targeted optimizations were implemented: (1) An LRSA module was incorporated at the 13th layer of the YOLOv11 backbone to reduce interference from redundant information and enhance the model’s focus on the overall features of swarming termites. (2) The original YOLOv11 feature fusion network was reconstructed using an FDConv module to minimize precision loss. (3) The original SPPF module was replaced with our improved SPPF-DW module, which strengthens the network’s ability to retain information, reduces the loss of small target features, and thereby improves overall detection accuracy.
Figure 2. YOLOv11-ST network structure diagram.

3.4.1. Network Modeling for YOLOv11

Released in September 2024, YOLOv11represents the latest iteration of the YOLO series, designed to enhance both the efficiency and accuracy of real-time object detection. It also extends support to a variety of other computer vision tasks, including instance segmentation, image classification, and pose estimation. While retaining the classic YOLO architecture, YOLOv11 introduces two key innovative modules: the C3k2 module and the C2PSA module. The C3k2 module enhances the traditional Cross-Stage Partial Network (CSPNet) by serially stacking multiple C3k or Bottleneck modules, thereby strengthening feature extraction capability. It offers configurable parameters (e.g., when c3k = False, it reduces to a C2F structure), achieving a balance between computational efficiency and depth of feature fusion. The C2PSA module integrates the CSP structure with an attention mechanism—Pyramid Squeeze Attention (PSA)—leveraging multi-scale convolutions (3 × 3, 5 × 5, 7 × 7) and a channel weighting mechanism to dynamically enhance critical features. This design reduces computational cost by 50% while significantly improving detection robustness under challenging conditions such as occlusion and complex scenes. Through the novel C3k2 and C2PSA modules, YOLOv11 achieves breakthroughs in precision, inference speed, and model lightweighting, establishing itself as a benchmark in state-of-the-art real-time visual perception systems. Figure 3 shows YOLOv11 network structure diagram.
Figure 3. YOLOv11 network structure diagram.

3.4.2. Backbone Architecture Optimization

Conventional convolution (Conv) is inherently limited in its ability to adapt to the dynamic nature of local features in target objects. Generating more complex feature maps typically requires a greater number of convolutional kernels, increasing computational burden. Furthermore, the receptive field of the convolutional outputs maintains a fixed rectangular shape. Due to the cumulative effect of stacked convolutional layers, this receptive field expands progressively, often incorporating irrelevant background information into the output transformation and introducing noise during the training process of shift prediction. Given the highly complex characteristics of swarming termites, traditional convolution-based methods exhibit limited effectiveness in feature representation. To address these limitations, we introduce the FDConv and LRSA modules, which are designed to enhance the model’s ability to capture discriminative features of swarming termites that are otherwise poorly represented by standard convolutional approaches.

3.4.3. Frequency Dynamic Convolution

In the detection of swarming termites, common challenges include occluded or small-sized termites. Existing dynamic convolution methods employ multiple parallel weights and attention mechanisms to achieve adaptive weight selection. However, the frequency responses of these weights are highly similar, resulting in parameter redundancy and limited adaptability. FDConv [13] enhances frequency diversity under a fixed parameter budget through Fourier-domain parameter decomposition and a dynamic modulation mechanism.
Integrating the FDConv module into YOLOv11 primarily aims to improve the model’s efficiency and performance, especially when processing large-scale feature maps. The core operations of FDConv are performed in the frequency domain. Frequency-domain operations (such as FFT or DCT) are inherently global: each point in the frequency-domain representation of the input feature map contains information contributed by all points in the original spatial domain. Performing element-wise multiplication in the frequency domain is equivalent to applying a global convolution in the spatial domain—with a kernel size equal to that of the input feature map. This significantly expands the effective receptive field of the convolutional operation.
As a result, FDConv can more effectively capture long-range dependencies and global contextual information in feature maps.

3.4.4. Local-Region Self-Attention

The LRSA [14] (Local Region Self-Attention) mechanism is designed to enhance local feature interaction and representation, enabling more effective information exchange between adjacent regions. This improves the understanding of local details, making it particularly suitable for the recognition of termite targets. The YOLOv11 model primarily relies on convolutional operations, which have a limited receptive field and struggle to model global context, such as occluded objects. In contrast, LRSA utilizes overlapping patches to preserve more boundary information, making it more suitable than standard convolution for processing pixel-level dense targets. For instance, in the case of swarming termites, a standard YOLO model may miss detections, whereas LRSA’s fine-grained local modeling enhances feature discriminability. By reinforcing edge and texture features of small targets through overlapping patch attention and leveraging self-attention to correlate similar objects within local regions (e.g., multiple densely clustered small targets), LRSA helps reduce missed detections. It aids in recovering clear edges and textures, which is crucial for detecting fine details such as the morphology of termites. The following Figure 4 is structural diagram of the GSConv module.
Figure 4. Structural diagram of the GSConv module.

3.4.5. SPPF-DW Module

For the task of swarming termite detection, we designed the SPPF-DW module based on the original SPPF module. The core of SPPF is multi-scale max pooling. Pooling operations are inherently a form of downsampling that actively discards a significant amount of detailed information. Although SPPF preserves some information by concatenating features from pooling layers at different scales, after multiple pooling steps, the features of small targets may become extremely weak or even vanish in deeper feature maps. The YOLO series detectors prioritize detection efficiency and perform multiple downsampling operations in the Backbone. As a result, small targets may occupy only a few pixels on low-resolution feature maps, making them difficult to detect effectively.
To address the limitations of the SPPF module, we replaced the max pooling operations in the SPPF module with dilated convolutions. Dilated convolutions introduce gaps (holes) into the standard convolution kernel, thereby expanding the receptive field without losing detailed information. This approach maintains multi-scale receptive fields while avoiding the loss of fine-grained details. Compared to pooling operations, dilated convolutions better preserve edge and texture information. Figure 5 below is dilated convolution.
Figure 5. Dilated convolution.
SPPF-DW employs parallel convolutions to capture receptive fields, differing from the serial pooling structure of the original SPPF module. The improved SPPF-DW module exhibits a stronger capability to retain information, prevents the loss of small target features, and maintains spatial resolution to ensure sufficient representation of small targets. The following diagram illustrates the structure of the SPPF-DW module. Figure 6 below is structural diagram of the SPPF-DW module.
Figure 6. Structural diagram of the SPPF-DW module.

3.5. Swarming Termites Counting

Real-time counting of swarming termites typically relies on video streams. Currently, most object counting methods are still based on single-frame image detection results, where counts are accumulated through frame-by-frame detection. However, this approach faces two critical challenges in video sequences: first, how to accurately count the total number of swarming termites passing through a trapping device over a period of time; second, how to avoid repeated counting of the same termite across consecutive frames, especially when targets briefly disappear and reappear. For example, if a termite is detected in frame A but subsequently exhibits decreased detection confidence in frame B due to motion blur or partial occlusion, the algorithm may misinterpret it as a new target, leading to counting errors. To address the issue of repeated counting of the same target across adjacent frames, a new algorithm is required to deduplicate counts between the first and second frames. The actual count should be the total number of targets in the first frame plus the number of newly appeared targets in the second frame.
To achieve this, the study proposes a deduplication method based on multi-feature fusion and feature matching. First, a feature registry is generated to store the features of termites detected in the previous frame. Subsequently, keypoints are extracted from the detected termite targets, followed by edge extraction to compute their Hu moments. Finally, the flight trajectories of the termites are calculated. The termite targets identified in the previous frame are registered with unique biological fingerprints, derived from the three aforementioned features, and stored in the feature registry. When termites are detected in the next frame, the extracted features are matched against those in the registry. The three distinct biological features are assigned different weights for fusion, enabling accurate feature matching and ultimately achieving the goal of deduplication. Figure 7 below is termites counting process.
Figure 7. Termites counting.

3.5.1. Termite Keypoint Feature Extraction

Termite keypoint feature extraction is primarily used for identifying and matching feature points in termite images. The process begins by constructing a Gaussian scale space to simulate the morphological characteristics of termites at different observation distances, effectively replicating the blurring effects at various scales for multi-scale feature detection. A linear Gaussian convolution kernel is employed to generate a multi-scale image pyramid, where each octave consists of five interval scale layers to ensure comprehensive coverage of the termite’s overall body structure. The termite image is convolved with Gaussian kernels at different scales to produce the scale space:
L x , y , σ = I x , y G x , y , σ
where G(x, y, σ) is the Gaussian kernel at scale σ, I(x, y) represents the original termite image, L(x, y, σ) is the resulting scale space representation.
During the keypoint detection phase, the Difference of Gaussian (DoG) operator is utilized for extremum point searching. The DoG operator approximates the normalized Laplacian of Gaussian and exhibits high sensitivity to scale variations, enabling efficient localization of extremum points. Keypoints are detected by computing differences between adjacent scaled Gaussian images:
G ( x , y , ) = L ( x , y , k ) - L ( x , y , )
Stable feature points are localized through neighborhood comparisons within the three-dimensional scale space, with specific optimization for detecting typical morphological features such as termite head structures and wing vein bifurcations. An edge response threshold is set to suppress interference from body surface textures. During the feature description phase, the algorithm computes a gradient orientation histogram within a 16 × 16 pixel neighborhood around each keypoint. This histogram is divided into 8 bins to assign a dominant orientation to the keypoint, ensuring rotation invariance. Within the keypoint neighborhood, the gradient magnitude m(x, y) and orientation θ(x, y) are calculated as follows:
m ( x , y ) = ( L ( x + 1 , y ) L ( x 1 , y ) ) 2 + ( L ( x , y + 1 ) L ( x , y 1 ) ) 2
θ ( x , y ) = arctan ( L ( x , y + 1 ) L ( x , y 1 ) L ( x + 1 , y ) L ( x 1 , y ) )
The 360-degree orientation space is divided into 8 primary direction intervals, and a 4 × 4 sub-region partitioning strategy is adopted. This process ultimately generates a 128-dimensional feature vector with biological semantics. This descriptor can precisely characterize the morphological features of termites and effectively overcomes the feature distortion issues caused by variations in insect body posture that commonly occur in traditional measurement methods. Figure 8 is termite keypoint feature extraction.
Figure 8. Termite keypoint feature extraction.

3.5.2. Termite Edge Extraction and Hu Moment Calculation

Edge extraction and Hu moment calculation for termites constitute critical steps in the deduplication process of swarming termite detection. The Canny operator, as a classic multi-stage edge detection algorithm, demonstrates unique advantages in processing images of swarming termites. Through gradient calculation and multi-threshold processing, it effectively extracts biological features such as body structure and wing morphology of termites. The process begins with Gaussian filtering for noise reduction. Specifically tailored to address common noise interference in microscopic termite images, an appropriately sized Gaussian kernel is applied for smoothing, which eliminates random noise while preserving edge details.
G x , y = 1 2 π σ 2 e x 2 + y 2 2 σ 2
Gradient calculation and orientation determination are performed using the Sobel operator to compute horizontal and vertical gradients. Leveraging the unique textural characteristics of the termite’s body surface and wing morphology, the algorithm accurately identifies exoskeleton edges. The gradient orientation is determined using the arctan2 function.
M = G X I 2 + G y I 2 , θ = arctan 2 G X G y
Non-Maximum Suppression (NMS) is applied to eliminate non-extreme values that are unlikely to represent true edges. In the image gradient magnitude matrix, a larger element value within the 8-connected neighborhood indicates a higher gradient magnitude at that point. By combining the gradient direction at the detection point, approximate edge information can be localized. This process preserves local gradient maximum points while suppressing non-maximum values. Dual thresholds are employed to filter the binarized image. By selecting appropriate high and low thresholds, an edge map closely resembling the true edges of the image can be obtained. An initial edge map is generated based on the high threshold, which contains minimal false edges. However, due to the elevated threshold value, the resulting edges may be discontinuous. To address this issue, a second low threshold is utilized. In the high-threshold edge map, edge linking is performed to form continuous contours. When the algorithm reaches an endpoint of a contour, it searches the 8-connected neighborhood points for pixels that meet the low threshold criterion. Based on these pixels, new edges are progressively collected until the entire edge structure becomes continuous and closed.
Hu moments are mathematical features computed from image contours that possess rotation, scale, and translation invariance. For insects with distinctive morphological characteristics like swarming termites, Hu moments can effectively capture key information of their shape contours. Although the visual appearance of the same termite’s contour may vary under different poses or viewing angles, its Hu moment values remain relatively stable. The calculation of Hu moments is based on the second and third-order central moments of an image, generating seven invariant moments through nonlinear combinations. Their core advantage lies in maintaining relative stability when the target undergoes translation, rotation, or scale changes. The specific implementation involves four key steps: First, grayscale conversion and binarization are applied to the termite sample image to eliminate background interference. Next, spatial moments and central moments of the image are computed. Then, normalization is performed to remove scale effects. Finally, the seven Hu invariant moments are generated through a combination.

3.5.3. Termite Movement Trajectory Computation

In the analysis of swarming termite movement trajectories, precise localization and focus on the starting and ending points of trajectories constitute a critical foundational step. By calculating the movement trajectories of termites across consecutive image sequences, we can extract and quantify the displacement and directional changes in individuals in space. The core value of this trajectory computation lies in providing key references for efficient and accurate feature matching. Specifically, when a termite appears in two temporally adjacent frames, the core computational task involves: based on the localization information (centroid coordinates) of the termite in both frames, accurately calculating the Euclidean distance between the termite’s position in the previous frame and its position in the next frame, while simultaneously determining the deflection angle (or turning angle) of its movement direction relative to a reference direction (such as the horizontal axis or the movement direction in the previous frame). These two core parameters—distance and deflection angle—constitute a fundamental feature vector that characterizes the short-term movement state of the termite. They play a decisive role in achieving stable and reliable cross-frame matching of individual termites in complex scenarios (such as when multiple termites are present or under background interference). Figure 9 is termite movement trajectory computation.
Figure 9. Termite movement trajectory computation.

3.5.4. Weight Allocation Method

To allocate weights to feature points, Hu moments, and motion trajectories in the task of deduplicating swarming termite counts, we employ an adaptive weighting strategy that enables the system to automatically adjust feature weights across different scenarios. The reliability score of each feature must first be calculated, then normalized to obtain the final weights, as expressed in the formula:
w i = R i R i
1
Keypoint Weight (R1)
M r a t e : The ratio of feature points in the current frame that match those in the previous frame (number of matched points/total number of points). When occlusion or overlapping of termites occurs, the weight assigned to feature points should be reduced.
S p o i n t : The ratio of the number of feature points in the current frame to the historical average value.
Reliability score equation:
R 1 = α   M r a t e + ( 1 α )   S p o i n t s
2
Hu Moment Weight (R2)
C e d g e : Edge Closure Ratio, a lower ratio indicates more fragmented edges, resulting in reduced reliability.
S H u : Hu Moment Similarity, measured by the cosine similarity between the Hu moments of the current frame and the historical average Hu moments.
Reliability score equation:
R 2 = β C e d g e + ( 1 β ) S H u
3
Motion Trajectory Weight (R3)
T s m o o t h : Trajectory Smoothness, the deviation between the current displacement and the historical average displacement.
T s m o o t h = 1 1 + | c u r r e n t _ d i s p l a c e m e n t a v g _ d i s p l a c e m e n t |
L t r a c k : Trajectory Length, the number of consecutively tracked frames (logarithmic growth, longer trajectories are more reliable).
Reliability score equation:
R 3 = γ T s m o o t h + ( 1 γ ) L t r a c k

4. Results

4.1. Comparative Experiments of Attention Modules

To evaluate the performance of the LRSA attention mechanism integrated into the backbone network, this study selects six commonly used attention mechanism modules: the Global Attention Mechanism [15] (GAM) module, Squeeze-and-Excitation [16] (SE) module, Bi-directional Routing Attention [17] (Biformer) module, Convolutional Block Attention Module [18] (CBAM), Coordinate Attention [19] (CA), and Efficient Channel Attention [20] (ECA). During the experimental design phase, the LRSA module in the YOLO-AT network architecture was replaced with each of the six attention mechanisms. Under identical experimental settings, the incorporation of different attention modules resulted in varying degrees of improvement in detection accuracy. The performance comparison is presented in Table 3.
Table 3. Comparative experiments of multiple attention modules.
Experimental results demonstrate that the model integrated with the LRSA module achieves an mAP@0.5 of 92.41%, outperforming the GAM, SE, Biformer, CBAM, CA, and ECA modules by 0.52%, 0.77%, 0.65%, 0.38%, 1.16%, and 0.54%, respectively. In terms of mAP@0.5–0.95, it surpassed these modules by 0.5%, 0.98%, 0.62%, 0.41%, 1.19%, and 0.58%, respectively. The LRSA module achieved this accuracy of 92.41% by performing self-attention computation within local regions. Its advantage lies in sacrificing unnecessary global calculations to instead deeply explore the most critical local information, thereby efficiently and precisely modeling the details and context of small targets in a content-adaptive manner.

4.2. Ablation Experiments

To validate the performance gain of each improvement component in the YOLOv11 model, we compared the precision, recall, and mAP@0.5 metrics for every modification. The results are presented in the Table 4.
Table 4. The results of the ablation experiment.
(1)
The first group represents the baseline YOLOv11 results, serving as the comparison benchmark for the following seven experimental groups. Its precision, recall, and mAP@0.5 were 90.16%, 85.22%, and 91.19%, respectively.
(2)
The second, third, and fourth groups were experiments with one modification added sequentially. Adding the FDConv module achieved precision, recall, and mAP@0.5 of 90.43%, 86.13%, and 91.87%, respectively. Incorporating the LRSA module resulted in precision, recall, and mAP@0.5 of 91.32%, 86.74%, and 92.41%, respectively. Integrating the SPPF-DW module yielded precision, recall, and mAP@0.5 of 91.89%, 86.76%, and 92.34%, respectively.
(3)
The fifth, sixth, and seventh groups involved adding two modifications simultaneously: FDConv and LRSA, FDConv and SPPF-DW, and LRSA and SPPF-DW, respectively. The combination of FDConv and LRSA modules achieved precision, recall, and mAP@0.5 of 92.07%, 87.03%, and 92.85%, respectively. The FDConv and SPPF-DW combination resulted in precision, recall, and mAP@0.5 of 92.44%, 86.96%, and 92.61%, respectively. The LRSA and SPPF-DW combination yielded precision, recall, and mAP@0.5 of 92.68%, 87.09%, and 92.94%, respectively.
(4)
The eighth group demonstrates the results with all improvements integrated. Compared to the baseline model, precision increased by 2.82%, recall improved by 2.1%, and mAP@0.5 was enhanced by 2.02%.

4.3. Comparative Analysis of Detection Performance Among Different Models

To comprehensively evaluate the performance of the YOLOv11-ST model, this study selects YOLOv8 [21], YOLOv10 [22], Faster RCNN [23], YOLOv8-MEB [24], YOLOv11 [25], and Insect-YOLO [26] as comparative models. As shown in the table, in terms of the mAP@0.5 metric, YOLOv11-ST achieved improvements of 4.1% compared to YOLOv8, 2.64% compared to YOLOv10, 2.02% compared to YOLOv11, 5.87% compared to Faster RCNN, 2.76% compared to YOLOv8n-MEB, and 2.16% compared to Insect-YOLO. The YOLOv11-ST model also demonstrated outstanding performance in the mAP@0.5–0.95 metric, with improvements of 3.25% over YOLOv8, 3.75% over YOLOv10, 1.8% over YOLOv11, 8.99% over Faster RCNN, 4.48% over YOLOv8n-MEB, and 2.44% over Insect-YOLO.
To more intuitively demonstrate termite detection in images, the experimental results are shown in the Figure 10. The results indicate that the improved YOLOv11-ST outperforms all baseline models in terms of Precision, Recall, and mAP@0.5, with a particularly notable improvement of 2.02% in mAP@0.5 (compared to YOLOv11). This enhancement can be attributed to the FDConv module’s frequency-dynamic convolution, which improves the capture of termite texture details, while the LRSA and SPPF-DW modules strengthen feature interactions within local regions, thereby enhancing the recognition capability for small termites. Table 5 is the result of each model evaluation index.
Figure 10. Object detection models’ performance comparison.
Table 5. The results of each model evaluation index.

4.4. Swarming Termite Counting Experiment

To validate the feasibility of the swarming termite counting method proposed in this paper, we conducted comparative experiments using three widely used object tracking algorithms, Sort [27], DeepSort [28], and ByteTrack [29] on the same set of 50 swarming termite videos, based on the improved YOLOv11 model presented in this study. The results indicate that duplicate counting frequently occurs with conventional counting algorithms, primarily due to the rapid flight speed of termites, causing target ID loss. In contrast, our algorithm leverages extracted features as unique biological fingerprints for deduplication, achieving an average improvement of 10.6% compared to Sort, 6.8% compared to DeepSort, and 5% compared to ByteTrack. The performance of different counting algorithms is summarized in the Table 6.
Table 6. Result of the swarming termites counting experiment.
A comparative analysis of swarming termite counting was conducted using Sort, DeepSort, ByteTrack, and the proposed method on 50 swarming termite videos. As shown in the Figure 11, linear regression analysis revealed that although all four algorithms demonstrated a certain correlation between machine counts and manual counts, our method exhibited a superior correlation (R2 = 0.9632) compared to the other three algorithms. This result underscores that our method provides an exceptional fit for accurate termite counting.
Figure 11. Different models compared with manual counting.

5. Deployment of YOLOv11-ST

Currently, YOLOv11-ST has been deployed in the swarm-flying termite monitoring station in Nuozhadu area, Pu’er City, Yunnan Province, China. Figure 12 is termite swarming detection station.
Figure 12. Termite swarming detection station.

6. Discussion

To address challenges such as duplicate counting and small target recognition in swarming termite detection and counting, an improved model named YOLOv11-ST, based on YOLOv11, is proposed. This model integrates frequency dynamic convolution and an attention mechanism for local feature interaction to enhance its focus on critical feature regions. To improve adaptability to target morphological diversity, an SPPF-DW module was designed to replace the original SPPF module, aiming to mitigate feature loss for small targets and strengthen the detection performance for swarming termites. On the swarming termite dataset, YOLOv11-ST achieves higher detection accuracy than YOLOv11, with Precision improved by 2.82% and mAP@0.5 increased by 2.02%. Compared to Insect-YOLO, a model specifically designed for insect detection, our approach exhibits superior adaptability to termite-specific characteristics, with Precision improved by 2.38% and mAP@0.5 increased by 2.16%. Furthermore, YOLOv11-ST consistently achieves performance improvements over other baseline models across multiple evaluation metrics. Subsequently, a novel counting algorithm was proposed to overcome the limitation of conventional target tracking methods in accurately counting swarming termites. Experimental results demonstrate that the proposed approach achieves a higher AAR (%) compared to commonly used tracking algorithms.
This study provides a practical solution for swarm-termite detection and counting, laying a solid foundation for early warning of termite infestation and intelligent control. From the perspective of biological monitoring, the success of this model enables more frequent and cost-effective non-invasive censuses. The high-precision detection capability facilitates early warning of termite infestations, significantly enhancing monitoring efficiency while reducing labor costs. Furthermore, the approach achieves standardization and digitalization of the detection process, effectively eliminating subjective biases inherent in manual inspections. In future work, we will further explore the adaptability of this model to various termite species.

Author Contributions

Conceptualization, Y.W.; methodology, Y.W.; software, Y.W.; validation, Y.W.; formal analysis, Y.W.; investigation, Y.W.; data curation, Y.W.; writing—original draft preparation, Y.W.; writing—review and editing, Y.W., H.W. and T.C.; visualization, Y.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Science and Technology Department of Henan Province Key Technology R&D Program (252102210030) and Key Scientific Research Project of Higher Education Institutions in Henan Province (25A520006).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Acknowledgments

We thank all those individuals who assisted us for their invaluable support, encouragement, and guidance throughout the research process.

Conflicts of Interest

The authors declare that they have no conflicts of interest to report regarding the present study.

References

  1. Oi, F. A review of the evolution of termite control: A continuum of alternatives to termiticides in the United States with emphasis on efficacy testing requirements for product registration. Insects 2022, 13, 50. [Google Scholar] [CrossRef] [PubMed]
  2. Wu, D.; Seibold, S.; Ellwood, M.D.F.; Chu, C. Differential effects of vegetation and climate on termite diversity and damage. J. Appl. Ecol. 2022, 59, 2922–2935. [Google Scholar] [CrossRef]
  3. Huang, J.H.; Liu, Y.T.; Ni, H.C.; Ni, H.C.; Chen, B.-Y.; Huang, S.-Y.; Tsai, H.-K.; Li, H.-F. Termite pest identification method based on deep convolution neural networks. J. Econ. Entomol. 2021, 114, 2452–2459. [Google Scholar] [CrossRef] [PubMed]
  4. Cho, J.; Choi, J.; Qiao, M.; Ji, C.W. Automatic identification of whiteflies, aphids and thrips in greenhouse based on image analysis. Int. J. Math. Comput. Simul. 2007, 346, 244. [Google Scholar]
  5. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
  6. Jiang, P.; Ergu, D.; Liu, F.; Cai, Y.; Ma, B. A Review of Yolo algorithm developments. Procedia Comput. Sci. 2022, 199, 1066–1073. [Google Scholar] [CrossRef]
  7. Zhong, Y.; Gao, J.; Lei, Q.; Zhou, Y. A vision-based counting and recognition system for flying insects in intelligent agriculture. Sensors 2018, 18, 1489. [Google Scholar] [CrossRef] [PubMed]
  8. Saradopoulos, I.; Potamitis, I.; Ntalampiras, S.; Konstantaras, A.I.; Antonidakis, E.N. Edge computing for vision-based, urban-insects traps in the context of smart cities. Sensors 2022, 22, 2006. [Google Scholar] [CrossRef] [PubMed]
  9. Dai, M.; Dorjoy, M.M.H.; Miao, H.; Zhang, S. A new pest detection method based on improved YOLOv5m. Insects 2023, 14, 54. [Google Scholar] [CrossRef] [PubMed]
  10. Luo, W.; Xing, J.; Milan, A.; Zhang, X.; Liu, W.; Zhao, X.; Kim, T.-K. Multiple object tracking: A literature review. Artif. Intell. 2021, 293, 103448. [Google Scholar] [CrossRef]
  11. Li, Z.; Zhu, Y.; Sui, S.; Zhao, Y.; Liu, P.; Li, X. Real-time detection and counting of wheat ears base on improved YOLOv7. Comput. Electron. Agric. 2024, 218, 108670. [Google Scholar] [CrossRef]
  12. Wang, T.; Zhao, L.; Li, B.; Liu, X.; Xu, W.; Li, J. Recognition and counting of typical apple pests based on deep learning. Ecol. Inform. 2022, 68, 101556. [Google Scholar] [CrossRef]
  13. Chen, L.; Gu, L.; Li, L.; Yan, C.; Fu, Y. Frequency Dynamic Convolution for Dense Image Prediction. arXiv 2025, arXiv:2503.18783. [Google Scholar] [CrossRef]
  14. Liu, X.; Liu, J.; Tang, J.; Wu, G. CATANet: Efficient Content-Aware Token Aggregation for Lightweight Image Super-Resolution. arXiv 2025, arXiv:2503.06896. [Google Scholar] [CrossRef]
  15. Liu, Y.; Shao, Z.; Hoffmann, N. Global Attention Mechanism: Retain Information to Enhance Channel-Spatial Interactions. arXiv 2021, arXiv:2112.05561. [Google Scholar] [CrossRef]
  16. Hu, J.; Shen, L.; Albanie, S.; Sun, G.; Wu, E. Squeeze-and-Excitation Networks. arXiv 2017, arXiv:1709.01507. [Google Scholar]
  17. Zhu, L.; Wang, X.; Ke, Z.; Zhang, W.; Lau, R. Biformer: Vision transformer with bi-level routing attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 10323–10333. [Google Scholar]
  18. Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. Cbam: Convolutional Block Attention Module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
  19. Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13713–13722. [Google Scholar]
  20. Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11534–11542. [Google Scholar]
  21. Varghese, R.; Sambath, M. Yolov8: A novel object detection algorithm with enhanced performance and robustness. In Proceedings of the 2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS), Chennai, India, 18–19 April 2024; pp. 1–6. [Google Scholar]
  22. Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. Yolov10: Real-time end-to-end object detection. Adv. Neural Inf. Process. Syst. 2024, 37, 107984–108011. [Google Scholar]
  23. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
  24. Wang, H.; Guo, X.; Zhang, S.; Li, G.; Zhao, Q.; Wang, Z. Detection and recognition of foreign objects in Pu-erh Sun-dried green tea using an improved YOLOv8 based on deep learning. PLoS ONE 2025, 20, e0312112. [Google Scholar] [CrossRef] [PubMed]
  25. Khanam, R.; Hussain, M. Yolov11: An overview of the key architectural enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar] [CrossRef]
  26. Wang, N.; Fu, S.; Rao, Q.; Zhang, G.; Ding, M. Insect-YOLO: A new method of crop insect detection. Comput. Electron. Agric. 2025, 232, 110085. [Google Scholar] [CrossRef]
  27. Bewley, A.; Ge, Z.; Ott, L.; Ramos, F.; Upcroft, B. Simple online and realtime tracking. In Proceedings of the 2016 IEEE International Conference on Image Processing (ICIP), Phoenix, AZ, USA, 25–28 September 2016; pp. 3464–3468. [Google Scholar]
  28. Wojke, N.; Bewley, A.; Paulus, D. Simple online and realtime tracking with a deep association metric. In Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017; pp. 3645–3649. [Google Scholar]
  29. Zhang, Y.; Sun, P.; Jiang, Y.; Yu, D.; Weng, F.; Yuan, Z.; Luo, P.; Liu, W.; Wang, X. Bytetrack: Multi-object tracking by associating every detection box. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer Nature: Cham, Switzerland, 2022; pp. 1–21. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.