Improving Moving Insect Detection with Difference of Features Maps in YOLO Architecture

Gomez-Canales, Angel; Gomez-Avila, Javier; Hernandez-Barragan, Jesus; Lopez-Franco, Carlos; Villaseñor, Carlos; Arana-Daniel, Nancy

doi:10.3390/app15147697

Open AccessArticle

Improving Moving Insect Detection with Difference of Features Maps in YOLO Architecture

by

Angel Gomez-Canales

,

Javier Gomez-Avila

,

Jesus Hernandez-Barragan

,

Carlos Lopez-Franco

,

Carlos Villaseñor

^*

and

Nancy Arana-Daniel

^*

University Center for Exact Sciences and Engineering, Universidad de Guadalajara, Guadalajara 44100, Jalisco, Mexico

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2025, 15(14), 7697; https://doi.org/10.3390/app15147697

Submission received: 19 May 2025 / Revised: 26 June 2025 / Accepted: 7 July 2025 / Published: 9 July 2025

(This article belongs to the Special Issue Deep Learning for Object Detection)

Download

Browse Figures

Versions Notes

Abstract

Insect detection under real-field conditions remains a challenging task due to factors such as lighting variations and the small size of insects that often lack sufficient visual features for reliable identification by deep learning models. These limitations become especially pronounced in lightweight architectures, which, although efficient, struggle to capture fine-grained details under suboptimal conditions, such as variable lighting conditions, shadows, small object size and occlusion. To address this, we introduce the motion module, a lightweight component designed to enhance object detection by integrating motion information directly at the feature map level within the YOLOv8 backbone. Unlike methods that rely on frame differencing and require additional preprocessing steps, our approach operates on raw input and uses only two consecutive frames. Experimental evaluations demonstrate that incorporating the motion module leads to consistent performance improvements across key metrics. For instance, on the YOLOv8n model, the motion module yields gains of up to 5.11% in mAP50 and 7.83% in Recall, with only a small computational overhead. Moreover, under simulated illumination shifts using HSV transformations, our method exhibits robustness to these variations. These results highlight the potential of the motion module as a practical and effective tool for improving insect detection in dynamic and unpredictable field scenarios.

Keywords:

object detection; deep learning; YOLOv8; motion enhancement

1. Introduction

The expected population increase from 8.2 billion in 2024 to a peak population of 10.3 billion in the mid-2080s [1] drives an increased demand for food, which in turn, results in the need for crop expansion. Nevertheless, crop production is affected by various factors such as crop pests and pollinators decline. According to [2], due to pests and insects, there is an estimated crop production loss of 18–20% worldwide. On the other hand, the decline of pollinators is an increasingly alarming problem for farmers. Pollinators support the seed production of 87 of the principal food crops globally, emphasizing the critical need for pollinator conservation, especially since approximately 40% of pollinators, such as bees and butterflies, are at risk of extinction [3]. Thus, effective insect management plays a crucial role in reducing crop losses, helping to maintain healthy and productive harvests. By adequately managing insect populations, farmers can mitigate the damage caused by pests while simultaneously promoting the vital role of pollinators. This balanced approach can not only reduce crop losses, but also enhances crop yield quality, thereby increasing its market value [4]. Effective management is essential in both pest control and the conservation of pollinators, which are integral to the reproduction of many crops, ensuring long-term agricultural productivity and environmental health [5].

Traditionally, insect management has relied on human resources. However, this approach is not a practical solution for large-scale crops, as it requires a high level of expertise and is an arduous task that demands considerable time. Therefore, automated methods are essential for addressing these limitations [6].

Machine learning (ML) has been one of the efforts to automate this task. Traditional ML techniques, such as support vector machines and k-nearest neighbors, had been effectively applied across a wide variety of areas, such as crop yield prediction [7], breast cancer diagnosis [8], solar radiation forecasting [9], protein function prediction [10], among others. However, these methods depend on manually engineered features, whose design often requires certain expertise and can be time-consuming to extract [11]. Furthermore, these manually crafted features often fail to capture the representation of the wide-range, large-scale variations in object shapes, such as the vast diversity of insect species, shapes, and sizes [12], which consequently diminishes the performance of ML techniques.

To overcome traditional ML limitations, convolutional neural networks (CNN) have proven to be a powerful alternative. Unlike traditional ML approaches, CNNs can automatically learn hierarchical feature representations from raw image data, eliminating the need for manual feature extraction [11]. This ability makes CNNs particularly suitable for tasks involving complex visual patterns, such as insect detection, where variations in species, shapes and orientations pose significant challenges. CNN-based models have demonstrated remarkable success in various computer vision applications such as image classification [13,14], object detection [15,16] and image segmentation [17,18], enabling more accurate and robust object recognition. Nonetheless, insect detection remains particularly challenging due to their small size, often occupying only a tiny fraction of the pixels in an image, which limits the visual information for CNNs to extract meaningful features.

This has motivated extensive research on improving CNN based models for small object detection, particularly in the context of insect recognition, which remains as an active research area. Next, we provide an overview of related work, highlighting key advancements in this field.

Related Work

As previously noted, both traditional ML techniques and CNNs have been applied to automate pest detection tasks. In [19], a comparative study between these approaches is presented for automated pest detection and identification on greenhouse tomato and pepper crops. They show that CNNs offer superior performance, achieving higher accuracy than traditional ML methods and effectively distinguishing between visually similar pest classes.

However, the images used in this study were captured under controlled environmental conditions, specifically within a cultivation chamber. As a result, the model shows limited effectiveness when applied to real-world cultivation field scenarios. This performance gap arises from the fact that, in field conditions, models must cope with additional challenges such as diverse backgrounds, varying lighting conditions, occlusion and important changes in apparent characteristics of pests due to view angle and distance difference during image acquisition [20,21]. Studies such as [20,22] report that models trained on images acquired in controlled environments tend to perform poorly when tested under actual field conditions. To address this limitation, ref. [22] recommends prioritizing the inclusion of images captured under real-world conditions in the training dataset.

To overcome this limitation, efforts have focused on the collection of datasets composed of images captured directly in the field. For instance, ref. [23] introduces a dataset specifically designed to reflect real-world conditions, this dataset contains 750,000 images of 102 insect classes. Such datasets are essential for improving the generalization capabilities of deep learning models and ensuring their applicability in practical agricultural scenarios.

Regarding insect detection in real world scenarios, ref. [24] proposed a object detection algorithm based on YOLOv3 for early detection of tomato diseases and pests, dilated convolution replaces convolution layers in the network backbone, resulting in improved ability for small object detection.

Wang et al. [25] constructed a three-scale CNN with channel and spatial attention mechanisms, they showed that the proposed method can achieve both high speed and accuracy of crop pest detection, outperforming architectures such as VGG16.

Recently, transformer-based architectures have gained significant attention in the deep learning community. Originally introduced for natural language processing tasks [26], the transformer architecture has since been successfully adapted for various computer vision applications. One such adaptation is Vision Transformer (ViT) [27]. Building on this approach, ref. [28] proposed a multi-scale and multi-factor ViT attention model for pest detection and classification over the IP102 [23] dataset. Their results demonstrate that transformer-based models can improve both classification and detection performance.

An interesting approach is [29], where motion information is used to enhance the visibility of insects in time-lapse RGB images. The preprocessed images are fed into a CNN. The proposed method improves the performance of object detection models such as YOLO and Faster R-CNN. Additionally, the authors provide a dataset containing 107,383 annotated time-lapse images, including 9423 labeled insect instances. Several other insect monitoring systems have been deployed using traps equipped with cameras, such as [30,31]. These systems are designed to attract and capture insects for close-range imaging, often resulting in high-resolution data suitable for classification tasks. However, they typically rely on different lures or attractants [32], which may introduce sampling bias and alter insect behavior. Furthermore, the majority of trapping methods are invasive, which results in rare insect species being killed [29]. A time-lapse setup avoids interference with natural insect behavior and supports broader spatial coverage.

The use of motion has proven effective for enhancing object visibility and improving model performance in object detection tasks. Nevertheless, the integration of motion information remains relatively underexplored, particularly in challenging environments. In addition to the approach proposed by [29], other studies have also leveraged motion cues to distinguish objects from complex backgrounds. For instance, ref. [33] utilized motion information from video sequences by applying background subtraction and optical flow to detect freely moving fish in underwater environments. The outputs of Gaussian Mixture Model (GMM) and optical flow were combined with raw video frames containing texture and shape information. This combined input was then fed into an R-CNN model, which achieved state-of-the-art performance in the fish detection task.

Another example is presented in [34], where a novel approach integrating a CNN model, specifically Single-Shot Detector with MobileNetV2 backbone, with motion cues is presented. In this method, intermediate feature maps from the CNN are used to identify regions of motion within each frame. By focusing only on moving objects, the approach effectively reduces false-positive detections and improves computational efficiency by lowering the average processing time.

One of the major drawbacks of methods such as [29,33] is their reliance on preprocessing steps to incorporate motion information. For instance, ref. [33] employs a GMM to separate foreground objects from the background, a computationally intensive process that requires multiple frames to construct a reliable background model. Similarly, ref. [29] requires a motion-enhanced image to be generated from three consecutive frames before being passed to the detection network. This preprocessing step not only adds complexity but also depends on frame differencing, which is highly sensitive to lighting variations commonly encountered in real-field scenarios. These limitations highlight the need for more efficient and robust approaches to motion integration in object detection.

Feature maps extracted by convolutional layers are distortion and translation invariant, allowing the CNN to refine hierarchically more intricate patterns [35]. Some papers have operated feature maps to enhance predictions; for example, ref. [36] proposes an accelerating image classification through feature map similarity. In [37], uses the difference of feature maps to extract geometrical patterns in medical images. On the other hand, ref. [38] combines feature maps from different layers to offer scale-robust object detection.

To address challenges mentioned above, we propose the motion module, a lightweight component designed to enhance object detection performance by integrating motion information directly at the feature map level within the YOLOv8 backbone. This approach eliminates the need for any preprocessing and introduces only minimal computational overhead.

The main contributions of this work are as follows:

We introduce the motion module, a lightweight and efficient module that boosts YOLOv8 performance by incorporating motion cues directly into the network backbone. By incorporating motion information at an early stage in the feature extraction pipeline, the module enables the network to learn richer representations without incurring significant computational overhead. A key aspect of our design is the prioritization of a balanced trade-off between performance and computational cost. The module has been designed to maintain a low parameter count, ensuring the improved accuracy does not come at the expense of efficiency.
Our method avoids any external preprocessing and operates with only two consecutive raw frames as input, allowing the network to extract motion cues internally without relying on handcrafted features. This desicursgn choice enables a more streamlined and efficient pipeline that is easier to deploy. Processing only two frames at a time reduces memory usage. This lightweight motion information integration makes the approach well-suited for scenarios with limited resources, such as embedded systems or mobile platforms.
We demonstrate that the proposed approach is robust to lighting variations, making it well-suited for real-world field conditions. This robustness allows the model to maintain high detection accuracy under challenging scenarios such as illumination changes, shadows and glare. As a result, the method can be reliably deployed in dynamic outdoor settings where lighting conditions are often unpredictable and uncontrollable, such as agricultural fields.

2. Materials and Methods

2.1. Dataset

To evaluate the performance of the proposed model, we conducted experiments using the publicly available dataset introduced in [29]. This dataset serves as a benchmark for assessing object detection models targeting small insects in real-world environments.

This dataset consists of time-lapse images captured at 30-s intervals. The camera system was used to monitor three different plant species: Trifolium pratense (red clover), Cakile maritima (sea rocket), and Malva sylvestris (common mallow). While the primary focus of the monitoring system was to capture honeybees visiting the plants, the dataset also includes occasional appearances of other animals such as beetles, butterflies, and spiders. This dataset provides realistic scenarios with natural lighting, vegetation and background variability, making it suitable for evaluating detection models in outdoor environments. All images in the dataset are in high definition, with a resolution of

1920 \times 1080

pixels (width × height), which allows small insects to be visible. Additionally, the distribution of bounding box sizes was analyzed through boxplots of normalized dimensions. Figure 1 shows that the normalized bounding box width reaches up to approximately 0.100 of the image width, while the height reaches up to approximately 0.175 of the image height. This suggests that most bees occupy a relatively small portion of the image, which highlights the importance of sufficient resolution and sharpness for reliable detection.

During image acquisition, the camera remained stationary during individual recordings but was relocated periodically throughout the monitoring period to ensure coverage of different flowering plants from both lateral and overhead perspectives.

The authors in [29] provide predefined train, validation and test splits of the dataset. However, the training and validation sets do not contain consecutive time-lapse images captured at 30-s intervals. To address this limitation, we constructed a modified version of the dataset by selecting samples exclusively from the original test split, which preserves temporal continuity between frames. This modified dataset ensures a consistent 30-s interval between images, allowing us to leverage motion information more effectively. Summary statistics of the original test split are presented in Table 1.

As mentioned previously, we constructed a modified version of the dataset by selecting samples exclusively from the original test split. However, as shown in Table 1, a significant proportion of these images do not contain any insects. To address this issue, we filtered the dataset by selecting only the images that included at least one insect for each camera system. Then, to balance the dataset, we randomly sampled an equal number of images without insects (background images) from the corresponding camera system, ensuring a balanced distribution for training and evaluation. Additionally, we performed data cleaning on this modified dataset by removing false-positive annotations and out-of-focus images. As previously noted, some images contained insects other than honeybees; however, due to the small proportion of these instances, such images were excluded from the dataset. This refinement allowed us to focus exclusively on the detection of honeybees. The statistics of this modified dataset are shown in Table 2.

As indicated in Table 2, the resulting dataset comprises a total of 9676 images, evenly distributed between samples containing honeybees and background-only images. To facilitate model development and ensure a fair evaluation, we partitioned the dataset into training, validation, and test sets following a 0.8/0.1/0.1 ratio, respectively. This split was performed in a stratified manner across camera systems to preserve the diversity of scenes and plant species in each subset. Some instances from this dataset are shown in Figure 2.

To further evaluate the generalization capability of our model across different insect types, we created a synthetic dataset by overlaying pest images onto raspberry plant images. This synthetic dataset was necessary due to the scarcity of publicly available datasets featuring insects in motion under field conditions. By generating this dataset, we are able to simulate a wide range of insect appearances, poses and positions, which allow us to more thoroughly evaluate the generalization capability of our approach.

For the synthetic dataset generation, we first captured high-resolution images of raspberry plants in agricultural fields located in Jocotepec, Jalisco, Mexico. To simulate the presence of multiple pest species we then artificially overlaid pest images onto these plant background images. The pest included in this synthetic dataset are nine common species known to affect raspberry crops: Tetranychus urticae (red spider), Polyphagotarsonemus latus (broad mite), Coccoidea (coccidae), Curculionidae (weevil), Cotinis mutabilis (figeater), Rhagoletis spp., Lampronia corticella (raspberry moth), Aphididae (aphid) and Frankliniella spp. (thrips). According to [39], it is uncommon for multiple competing pest species to coexist on the same individual plant. Therefore, each synthetic image includes only one pest type overlaid per plant background, aligning with real-world biological patterns. For each generated image, we simulate a corresponding previous frame by shifting the position of the insects present in the scene.

For each pest class a total of 1000 main frames with the insect in a certain position were generated. Then, for each of these frames, a corresponding previous frame was generated by shifting the position of each insect within the corresponding scene. This process resulted in 2000 images per class: 1000 main frames for pest detection and the other 1000 images used to provide temporal context, enabling the integration of motion information into the detection pipeline.

Similar to honeybees dataset, we created training, validation and test splits using a 0.8/0.1/0.1 ratio, respectively. The splits were performed in a stratified manner to preserve the distribution of images for each insect class across all splits. All images were cropped to size

640 \times 640

pixels. Example images from this synthetic dataset are shown in Figure 3.

As we did with honeybees dataset, now we present the distribution of bounding box sizes through boxplots of normalized bounding box dimensions for all insect classes, this is illustrated in Figure 4. As shown, the bounding boxes for pests are relatively small, with both dimensions concentrated around low normalized values. This reflects the small visual size of the pests within the images, which complicates the detection process. Furthermore, this challenge highlights the importance of incorporating motion cues for improving detection performance.

2.2. Method

Insects in natural environments can be small, camouflaged or partially occluded, making them difficult to detect. To enhance detection performance in real-world conditions, we extend the conventional still-image object detection framework by incorporating an adjacent frame as input. This design allows the network to extract motion cues that are not available in a single image. These cues are particularly useful for detecting insects under the mentioned challenging conditions, which may be undistinguishable from background in static images. By providing a dynamic context, the model can better localize and classify moving insects, leading to improved performance under field conditions.

By incorporating these motion cues, the network can leverage motion cues to identify subtle motion patterns that help distinguish insects from the background

In this work, our method is based on the YOLOv8 architecture, which is illustrated in Figure 5. We begin by introducing the proposed motion module, followed by a detailed description of its training procedure. The overall structure of the motion module is depicted in Figure 6.

This module can be inserted after any layer in the YOLOv8 backbone, prior to the SPPF block. It takes as input two feature maps: the output from layer N at time t and the output from the same layer at time

t - 1

, denoted in Figure 6 as

F_{t}

and

F_{t - 1}

, respectively.

To encode motion information, the module computes the difference between these consecutive feature maps and applies the absolute value operation,

| F_{t} - F_{t - 1} |

, then this absolute difference is normalized using batch normalization. We empirically evaluated both alternatives-using the raw residual and its absolute value-and found that the latter consistently led to better detection performance. The use of absolute value highlights the magnitude of changes regardless of direction, allowing the network to focus on relevant dynamic information without being sensitive to sign.

Next, the original feature map

F_{t}

is concatenated with the normalized difference along the channel dimension. Assuming each input feature map has C channels, the concatenated feature map will have

2 C

channels. To integrate the motion information and restore the channel count to C, a

1 \times 1

convolution is applied. This ensures compatibility with the subsequent layer in the network.

This process is summarized in Equation (1).

F_{M} (F_{t}, F_{t - 1}) = Conv (Concat (F_{t}, BN (| F_{t} - F_{t - 1} |)))

(1)

where:

$F_{M}$ is the motion module.
$F_{t}$ is the feature map up to layer N for image at time t.
$F_{t - 1}$ is the feature map up to layer N for image at time $t - 1$ .
$Conv (\cdot)$ is a $1 \times 1$ convolution.
$Concat (\cdot, \cdot)$ is the concatenation along the channel dimension operation.
$BN (\cdot)$ is BatchNorm2D operation.

Assuming that the dimensions of

F_{t}

and

F_{t - 1}

are

(H, W, C)

, the output of the motion module, denoted as

F_{M} (F_{t}, F_{t - 1})

, will also have dimensions

(H, W, C)

. This is ensured by the use of a

1 \times 1

convolution, which reduces the concatenated feature map back to the original number of channels. As a result, the output of the module matches the expected input dimensions of layer

N + 1

.

Having described the structure and functionality of the motion module, we now present the training procedure used to optimize its parameters within the YOLOv8 network.

We begin with a YOLOv8 model pretrained on the COCO dataset, and fine-tune it for 100 epochs using one of the insect datasets presented in the previous section.

After completing this initial training phase, we transform the model into a Siamese architecture, that is, the model consists of two identical branches with shared weights that process two input frames in parallel -the main frame and the previous frame, as illustrated in Figure 7. Specifically, the inputs are the image at time t and the image at time

t - 1

, denoted as

I_{t}

and

I_{t - 1}

, respectively, in Figure 7.

In Figure 7, each block in the network backbone before SPPF block is labeled as

B_{i}

where

1 \leq i \leq 9

. The motion module can be inserted after any of these blocks. In the example shown, it is placed after block

B_{5}

.

Both input images,

I_{t}

and

I_{t - 1}

are processed in parallel through identical network branches up to block

B_{n}

, where n is the block preceding the placement of the motion module. At this point, the motion module integrates the motion information derived from the two branches into the feature map corresponding to

I_{t}

, effectively ending the Siamese structure. From this point onward, the original YOLOv8 pipeline resumes, using the enhanced features as input.

If the neck of the network receives input from a block preceding the motion module-i.e., a stage still within the Siamese structure-it only utilizes the output from the branch processing

I_{t}

, ignoring the features from the

I_{t - 1}

branch. This ensures consistency with the original single-frame processing design of YOLOv8.

All blocks in the trained YOLOv8 model, except for the motion module, are frozen during the second training phase. As illustrated in Figure 7, the blocks highlighted in blue indicate layers whose weights remain unchanged throughout this phase. Only parameters of motion module are updated, since its weights are initialized randomly. Freezing the rest of the network prevents the large gradients-expected during the initial updates of the motion module-from adversely affecting the pretrained weights, thereby preserving the knowledge previously learned by the base YOLOv8 model. Therefore, during the second training phase, we train the model for additional 100 epochs, updating only the parameters of the motion module. This selective fine-tuning allows the module to effectively learn and encode motion information, while preserving the knowledge already acquired by the rest of the network.

In the following section, we compare the results obtained on the modified dataset using our method with those obtained by the method presented in [29]. To provide context for this comparison, we first offer a brief explanation of the referenced method.

In [29], the authors incorporate a preprocessing step that enhances images with motion information prior to feeding them into the neural network. To generate a motion-enhanced image, three consecutive frames are used. These RGB frames are first converted to grayscale and then blurred using a Gaussian kernel of size

5 \times 5

pixels. The resulting grayscale and blurred frame at time index

k \in N

is denoted as

B_{k}

. To enhance frame k, the adjacent frames

k - 1

and

k + 1

are also considered. Two frame differences are then computed: the difference between

B_{k}

and

B_{k - 1}

, denoted as

Δ B_{k} [i, j]

, and the difference between

B_{k + 1}

and

B_{k}

, denoted as

Δ B_{k + 1} [i, j]

, where

[i, j] \in [1 . . . N] \times [1 . . . M]

are the pixel coordinates. These computations are formalized in Equations (2) and (3), respectively.

Δ B_{k} [i, j] = B_{k} [i, j] - B_{k - 1} [i, j]

(2)

Δ B_{k + 1} [i, j] = B_{k + 1} [i, j] - B_{k} [i, j]

(3)

Then, using this information, a motion likelihood,

L_{k}^{3} [i, j]

, is created by summing the absolute values of

Δ B_{k} [i, j]

and

Δ B_{k + 1} [i, j]

, as shown in Equation (4).

L_{k}^{3} [i, j] = | Δ B_{k} [i, j] | + | Δ B_{k + 1} [i, j] |

(4)

Next, the RGB channels of the original color image at time k are modified to generate the motion enhanced image, denoted as M. In this enhanced image, the blue channel (

M_{b}

) contains a combination of the original blue (

I_{b}

) and red (

I_{r}

) channels, as show in Equation (5). The original red channel (

M_{r}

) is replaced with the motion likelihood (

L_{k}^{3}

), as defined in Equation (6). Finally, the green channel remains unaltered; the original green channel (

I_{g}

) is directly copied into the enhanced image (

M_{g}

), as shown in Equation (7).

M_{b} [i, j] = 0.5 I_{b} [i, j] + 0.5 I_{r} [i, j]

(5)

M_{r} [i, j] = L_{k}^{3} [i, j]

(6)

M_{g} [i, j] = I_{g} [i, j]

(7)

An example of this process applied to an image at time k containing a honeybee is shown in Figure 8 and Figure 9. Figure 8 shows the original RGB image alongside its corresponding motion likelihood, where the movement of the honeybee-as well as some flower motion caused by wind-can be observed. Figure 9 presents the resulting motion enhanced image, in which the honeybee is highlighted in red.

2.3. Experimental Setup

To promote faster inference and reduce the computational load, all images were resized to

640 \times 640

pixels. The following hyperparameters were used in our experiments: an initial learning rate of 0.01, a learning rate momentum of 0.937, and a batch size of 32. Training was conducted using the Stochastic Gradient Descent (SGD) optimizer.

The hardware configuration used for all experiments consisted of an Intel Core i7 (13th generation) CPU (Intel, Santa Clara, CA, USA) and an NVIDIA GeForce RTX 4060 GPU (NVIDIA, Santa Clara, CA, USA) with 8 GB of VRAM.

The software environment was set up using Python 3.8, PyTorch 2.4.1, and CUDA 12.6, running on Windows 11.

2.4. Evaluation Metrics

To evaluate the performance of the models, we use a set of standard object detection metrics: Precision, Recall, F1 Score, and mean Average Precision (mAP). These metrics provide a detailed assessment of the detection capabilities of the models across different configurations. In addition, we report the number of GFLOPs (Giga Floating Point Operations) to quantify the computational cost associated with each model. FLOPs measure the total number of floating-point operations required to process a single input. One GFLOP corresponds to one billion (

10^{9}

) FLOPs. GFLOPs thus offer a convenient estimate of the inference complexity of the model.

To better understand the evaluation process, we briefly describe the formulation of each metric used:

Precision: measures the proportion of correctly predicted positive instances among all predicted positive instances. It is defined by Equation (8).

$Precision = \frac{T P}{T P + F P}$

(8)

where $T P$ is the number of true positives and $F P$ is the number of false positives.
Recall: evaluates the proportion of correctly predicted positive instances among all actual positive instances. Equation (9) defines recall.

$Recall = \frac{T P}{T P + F N}$

(9)

where $F N$ is the number of false negatives.
F1 Score: combines both Precision and Recall into a single measure. It is the harmonic mean of Precision and Recall, offering a balance between the two. F1 Score is given by Equation (10).

$F 1 Score = 2 \cdot \frac{Precision \cdot Recall}{Precision + Recall}$

(10)
Average Precision ( $A P$ ): quantifies the area under the precision-reall curve for a single class. It is defined as the integral of the precision with respect to recall ( $p (r)$ ), as shown in Equation (11).

$AP = \int_{0}^{1} p (r) d r$

(11)

Although AP is not reported directly, it serves as the basis for computing the mAP, which averages the AP values across all classes.
mean Average Precision at IoU = 0.50 (mAP50): summarizes the precision-recall curve across all object classes, using a fixed Intersection over Union (IoU) threshold of 0.5. A detection is considered correct only if its IoU with the ground truth is at least 0.5. The AP is calculated for each class under this criterion, and the results are the averaged across all classes, as shown in Equation (12).

$mAP 50 = \frac{1}{N} \sum_{i = 1}^{N} A P_{i}^{I o U = 0.5}$

(12)

3. Experimental Results and Discussion

Given the objective of achieving a lightweight and efficient model, we restrict our evaluation to the YOLOv8n and YOLOv8s variants, which offer a favorable balance between performance and model size. The reported results correspond to a single training run. We observed that performance was consistent across different training sessions due to two main factors. First, the base YOLO network was always initialized with the same pretrained weights obtained from the dedicated insect dataset. Second, the newly introduced motion module was initialized using a fixed random seed. These conditions ensure consistency across training runs.

In our experiments, we first utilized a YOLOv8 model pretrained on the modified honeybees dataset and tested the insertion of the motion module after different blocks of the network backbone, from

B_{1}

to

B_{9}

in Figure 7. Each configuration was independently trained and evaluated to determine the most effective placement of the motion module. The performance of each configuration is summarized in Table 3 for YOLOv8n and in Table 4 for YOLOv8s, where we present the results on the validation set across different block placements. For a clear comparison, these results are also visualized in Figure 10 for YOLOv8n and Figure 11 for YOLOv8s, using a bar chart that highlights the impact of the motion module on the performance of the model at each stage.

Table 3 shows that, for YOLOv8n, incorporating the motion module can increase precision by up to 1.34%, with respect to baseline, when placed after block

B_{5}

in the network backbone. For Recall, the greatest improvement-up to 7.83%-is observed when the module is inserted after block

B_{4}

, which also yields the highest gain in F1 Score, with an increase of 4.33%. As for mAP50, the optimal improvement of 5.11% is achieved when the module is placed after block

B_{3}

. Based on these results, we select the configuration with the highest mAP50 improvement, as it provides also a favorable trade-off between computational overhead and detection performance, requiring only an additional 0.81 GFLOPs compared to the baseline model.

We now extend this analysis to the YOLOv8s variant to assess whether similar trends hold, and to determine the most effective placement of the motion module under increased model capacity. As shown in Table 4, the greatest improvement in Precision-2.49% increased compared to the baseline model-is achieved when the motion module is inserted after block

B_{2}

in the network backbone. This configuration also yields the highest gain in F1 Score, with an increase of 2.91%. For Recall, the largest improvement, as in YOLOv8n, occurs when the module is placed after block

B_{4}

, resulting in a 5.38% increase. Regarding mAP50, the best performance is again obtained when the module is placed after block

B_{3}

, leading to a 2.34% improvement. Following the same criterion used with YOLOv8n, we select the configuration with the highest mAP50 increase-placement after block

B_{3}

- as it provides a favorable trade-off between performance gain and computational overhead. In this case, the model incurs an additional cost of only 3.01 GFLOPs compared to the baseline.

The results above indicate that the inclusion of motion information contributes to improved detection performance. By enhancing object visibility, the motion module enables the model to better localize objects, leading to consistent gains across key evaluation metrics. Notably greater improvements are observed in YOLOv8n, suggesting that the motion module is particularly beneficial for lightweight models, where limited representational capacity can be compensated by integrating motion-based cues through the motion module.

In both cases, the optimal placement for the motion module appears to lie between blocks

B_{2}

and

B_{5}

in the network backbone.

As can be seen from Table 3 and Table 4, for both YOLOv8n and YOLOv8s, inserting the motion module at deeper layers leads to an increase in the total number of parameters in the model. This is expected, since feature maps in later stages of the backbone have a higher number of channels, causing the

1 \times 1

convolution in the motion module to require more parameters. Similarly, the GFLOPs of the model also increase, not only due to the larger parameter count but also because the network needs to process two frames at deeper stages, further raising the computational and memory demands.

To further evaluate the effectiveness of the proposed approach, we compare each configuration of the YOLOv8 variants with their corresponding counterparts enhanced using the method proposed in [29], evaluated on the test set. As previously mentioned, for both YOLOv8n and YOLOv8s, the motion module is placed after block

B_{3}

in the network backbone, as this configuration yields a favorable trade-off between performance and computational cost. These comparisons, shown in Table 5, provide insight into the relative effectiveness of both motion integration strategies.

Table 5 shows that for both our method and approach proposed in [29], the inclusion of motion information enables the models to surpass the baseline performance, particularly in terms of Recall, F1 and mAP50. Our proposed method achives comparable results to [29] in Recall and F1 Score. Although the method from [29] achieves a higher Precision and mAP50, we will demonstrate that our approach is more robust to lighting variations-an essential factor for real world deployment, where illumination conditions can change due to shadows, cloud cover, and other environmental factors. While our model introduces a slightly higher number of GFLOPs than the method proposed in [29], it does not require any preprocessing, as motion information is integrated directly at the feature map level, allowing raw images to be used as input and contributes to greater robustness against lighting variations. In addition, our approach operates with only two frames, in contrast to [29], which relies on three.

As mentioned earlier, we conducted a study to evaluate the impact of using the absolute value of feature map difference versus raw residuals. The results show that using the absolute value leads to a 0.24% increase in F1 Score and a 0.17% improvement in mAP50 for YOLOv8n, and a 0.12% increase in F1 Score and a 0.64% improvement for mAP50 for YOLOv8s, compared to using raw residuals. These findings suggest that emphasizing the magnitude of motion changes, regardless of direction, helps the network extract more discriminative features. Table 6 shows that although using raw residuals results in slightly higher precision, applying the absolute value leads to improvements in recall, F1 Score and mAP50. In our approach, we prioritize higher recall to reduce false negatives, as failing to detect a relevant insect can be critical in agricultural monitoring.

To evaluate the robustness of our approach to lighting variations, we tested both our method and the approach from [29] on the original test set after applying modifications to each input frame. Specifically, we randomly adjusted the HSV (Hue, Saturation, and Value) components to simulate different illumination conditions, such as changes in brightness and color tone. The results of this evaluation are presented in Table 7. This allows us to assess the robustness of each approach in more challenging, real-world-like conditions.

Table 7 shows that both YOLOv8n and YOLOv8s models enhanced with our motion module outperform their counterparts using the method proposed in [29]. These results demonstrate that our method is more robust to lighting variations, as motion integration is performed at the feature map level. In contrast, ref. [29] relies on frame differencing, a technique that is highly sensitive to changes in illumination. This sensitivity is reflected in the significant performance drop observed for models using [29] under varying lighting conditions.

Finally, we present qualitative results for both YOLOv8n and YOLOv8s models enhanced with the motion module. Figure 12 shows predictions from the YOLOv8n variant, while Figure 13 shows results from YOLOv8s. These examples illustrate how the integration of the motion module enhances the ability of the model to localize and detect objects in challenging scenarios, such as cases involving small objects by leveraging motion-based feature information.

As shown in Figure 8, some image pairs contain visible motion in the background, particularly in plant parts such as leaves or flowers, likely caused by wind. Despite this, the model achieves accurate detections, this can be attributed to the fact that motion is computed at the feature map level rather than from raw pixel intensities. By performing temporal differencing on feature maps, the network can focus on meaningful patterns while suppressing irrelevant changes. Subsequent layers in the network further refine these feature maps, retaining only representations that are useful for insect detection.

After demonstrating that the motion module improves honeybee detection performance, and exhibits greater robustness to lighting variations than the method proposed in [29], which relies on frame differencing, we now evaluate the proposed method on the synthetic pest dataset to assess its performance on a broader range of insect types. This evaluation complements the previous analysis on the honeybee dataset and provides insights into the ability of the model to generalize across different insect classes. Below, we present the quantitative results and comparisons with the baseline YOLOv8 models, highlighting the effectiveness of the motion module in this new setting.

Based on the findings from the honeybee dataset experiments, where the insertion of the motion module after block

B_{3}

in the network backbone yielded a favorable trade-off between accuracy and computational cost, we place the motion module in this position for all experiments conducted on the synthetic pest dataset. This placement aims to preserve a consistent configuration while allowing the network to integrate motion features at a sufficiently abstract level.

The results in Table 8 demonstrate the effectiveness of integrating the motion module into the YOLOv8n backbone for pest detection. While the baseline YOLOv8n achieves a slightly higher precision, the model with the motion module outperforms the baseline across all other key metrics: recall increases by 4.66%, F1 Score improves by 2.66% and mAP50 rises by 3.49%. The higher recall and F1 Score indicate that the motion-enhanced model is more sensitive and effective at detecting pests, including those that may be partially occluded or visually similar to background elements. These improvements are particularly meaningful for pest detection, since failing to detect an insect (i.e., false negatives) can be more critical than occasional false positives. The increased performance is achieved with only a modes increase in computational cost, an additional 0.81 GFLOPs, making the motion module a practical addition even for resource-constrained environments.

Now we extend the analysis to YOLOv8s, these results are shown in Table 9. When integrating the motion module into the YOLOv8s backbone for pest detection, we observe improvements across all metrics. Compared to the baseline, the model with the motion module increases precision by 1.16%, recall by 1.98%, F1 Score by 1.72 percentage points, and mAP50 by 3.28 points. These results demonstrate that even for a larger model, the incorporation of motion information remains beneficial. Despite moderate increase in computational cost, an increase of 3.03 GFLOPs, the performance improvements justify the inclusion of the module.

Table 10 compares the F1 Score and mAP50 between Basaeline YOLOv8 and YOLOv8 with motion module integrated after block

B_{3}

, for both YOLOv8n and YOLOv8s variants. Results are shown for each pest class individually, showing that the inclusion of the motion module yields improvements in nearly all pest categories, particularly for smaller insects, such as broad mite, suggesting that the motion cues are especially beneficial when visual features alone are insufficient. In summary, the per-class analysis confirms the general effectiveness of the motion module across a variety of pest types.

Finally, we include qualitative results on the synthetic pest dataset. Figure 14 and Figure 15 show sample predictions from the YOLOv8n and YOLOv8s variants, respectively. The visualization shows that motion integration enhances detection accuracy in scenarios where appearance alone may be insufficient, such as small or camouflaged insects.

4. Conclusions

This work introduces the motion module, a lightweight component designed to enhance the detection performance of YOLOv8 models by integrating motion information directly at the feature map level. The module operates on feature maps of two consecutive frames, allowing the network to exploit temporal differences and extract motion cues that are not available in static images. This design choice provides dynamic context that helps the network distinguish small insects from the background, specially under real-world conditions where appearance-based cues alone may be insufficient. Additionally, operating at a feature map level helps reduce the impact of lighting variations, compared to methods that rely on frame differencing, such as [29].

Experimental results demonstrate that the motion module can significantly boost performance across the evaluation metrics. Specifically, mAP50 increases by up to 5.11% for YOLOv8n and 2.34% for YOLOv8s; Recall improves by up to 7.83% and 5.38% respectively; Precision sees gains of up to 1.34% and 2.49%; and F1 Score increases by up to 4.33% for YOLOv8n and 2.91% for YOLOv8s. These improvements are achieved when the motion module is inserted between blocks

B_{2}

and

B_{5}

for YOLOV8n, and between blocks

B_{2}

and

B_{4}

for YOLOv8s in the network backbone, resulting in a modest increase in computational cost-at most 1.67 GFLOPs for YOLOv8n and 3.96 GFLOPs for YOLOv8s.

The results indicate that the motion module is particularly beneficial for lightweight models, such as YOLOv8n, where the integration of motion cues can effectively compensate for limited representational capacity. Moreover, quantitative evaluations under simulated lighting variations (via HSV modifications) reveal that our approach, specifically for YOLOv8n, maintains robust performance with only a minor degradation under challenging illumination conditions. This robustness makes the module practical and reliable for real-world applications, where lighting can vary due to factors such as shadows, cloud cover, and other environmental factors.

Additionally, a synthetic dataset of common pests known to affect raspberry plants was introduced. Results obtained on this synthetic dataset show that the proposed motion module is not limited to honeybee detection but can generalize effectively to other insect classes. The inclusion of the motion module consistently improved key detection metrics such as F1 Score and mAP50 for both YOLOv8n and YOLOv8s variants, compared to its corresponding baseline models. Specifically, for YOLOv8n, recall increases by 4.66%, F1 Score improves by 2.66% and mAP50 increases by 3.49%. Regarding YOLOv8s, precision is improved by 1.16%, recall by 1.98%, F1 Score rises by 1.72 percentage points and mAP50 by 3.28 points. This highlights the potential of the proposed approach to enhance object detection performance in tasks involving insects in motion.

Future work will explore whether the motion module is object-agnostic, that is, whether it can generalize across detection tasks and architectures without retraining the motion module, offering consistent performance improvements regardless of the type of object being detected. Similarly, we intend to experiment with other feature maps and preprocessing techniques, such as temporal gradients and optical flow.

Author Contributions

Conceptualization, N.A.-D. and C.V.; methodology, J.G.-A. and J.H.-B.; software, J.G.-A. and A.G.-C.; validation, J.H.-B.;resources, N.A.-D.; data curation, A.G.-C. and C.V.; writing—original draft preparation, A.G.-C.; visualization, A.G.-C.; project administration, C.L.-F. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The data are available upon request from the corresponding authors.

Conflicts of Interest

The authors declare no conflicts of interest.

References

United Nations. World Population Prospects 2024: Summary of Results; United Nations: New York, NY, USA, 2024. [Google Scholar]
Souto, A.L.; Sylvestre, M.; Tölke, E.D.; Tavares, J.F.; Barbosa-Filho, J.M.; Cebrián-Torrejón, G. Plant-derived pesticides as an alternative to pest management and sustainable agricultural production: Prospects, applications and challenges. Molecules 2021, 26, 4835. [Google Scholar] [CrossRef] [PubMed]
Brunet, J.; Fragoso, F.P. What are the main reasons for the worldwide decline in pollinator populations? CAB Rev. Perspect. Agric. Vet. Sci. Nutr. Nat. Resour. 2024, 2024, 1–12. [Google Scholar] [CrossRef]
Raveloaritiana, E.; Wanger, T.C. Decades matter: Agricultural diversification increases financial profitability, biodiversity, and ecosystem services over time. arXiv 2024, arXiv:2403.05599. [Google Scholar]
Kremen, C.; Chaplin-Kramer, R. Insects as providers of ecosystem services: Crop pollination and pest control. In Proceedings of the Insect Conservation Biology: Proceedings of the Royal Entomological Society’s 23rd Symposium, Sussex, UK, 12–14 September 2005; CABI Publishing: Wallingford, UK, 2007; pp. 349–382. [Google Scholar]
Jin, X.; Yang, H.; Li, Z.; Huang, C.; Yin, D.; Domingues, T.; Brandão, T.; Ferreira, J.C. Machine Learning for Detection and Prediction of Crop Diseases and Pests: A Comprehensive Survey. Agriculture 2022, 12, 1350. [Google Scholar] [CrossRef]
Sharifi, A. Yield prediction with machine learning algorithms and satellite images. J. Sci. Food Agric. 2021, 101, 891–896. [Google Scholar] [CrossRef]
Yue, W.; Wang, Z.; Chen, H.; Payne, A.; Liu, X. Machine Learning with Applications in Breast Cancer Diagnosis and Prognosis. Designs 2018, 2, 13. [Google Scholar] [CrossRef]
Belmahdi, B.; Louzazni, M.; Bouardi, A.E. Comparative optimization of global solar radiation forecasting using machine learning and time series models. Environ. Sci. Pollut. Res. 2022, 29, 14871–14888. [Google Scholar] [CrossRef]
Bonetta, R.; Valentino, G. Machine learning techniques for protein function prediction. Proteins Struct. Funct. Bioinform. 2019, 88, 397–413. [Google Scholar] [CrossRef]
Lecun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef]
Ung, H.T.; Ung, Q.H.; Nguyen, T.T.; Nguyen, B.T. An Efficient Insect Pest Classification Using Multiple Convolutional Neural Network Based Models. In Proceedings of the SoMeT, Kitakyushu, Japan, 20 September 2022; pp. 584–595. [Google Scholar]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. Commun. ACM 2017, 60, 84–90. [Google Scholar]
Chen, L.; Li, S.; Bai, Q.; Yang, J.; Jiang, S.; Miao, Y. Review of Image Classification Algorithms Based on Convolutional Neural Networks. Remote Sens. 2021, 13, 4712. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Zaidi, S.S.A.; Ansari, M.S.; Aslam, A.; Kanwal, N.; Asghar, M.; Lee, B. A survey of modern deep learning based object detection models. Digit. Signal Process. 2022, 126, 103514. [Google Scholar] [CrossRef]
Luo, Z.; Yang, W.; Yuan, Y.; Gou, R.; Li, X. Semantic segmentation of agricultural images: A survey. Inf. Process. Agric. 2024, 11, 172–186. [Google Scholar] [CrossRef]
Soylu, B.E.; Guzel, M.S.; Bostanci, G.E.; Ekinci, F.; Asuroglu, T.; Acici, K. Deep-Learning-Based Approaches for Semantic Segmentation of Natural Scene Images: A Review. Electronics 2023, 12, 2730. [Google Scholar] [CrossRef]
Gutierrez, A.; Ansuategi, A.; Susperregi, L.; Tubío, C.; Rankić, I.; Lenža, L. A Benchmarking of Learning Strategies for Pest Detection and Identification on Tomato Plants for Autonomous Scouting Robots Using Internal Databases. J. Sens. 2019, 2019, 5219471. [Google Scholar] [CrossRef]
Singh, D.; Jain, N.; Jain, P.; Kayal, P.; Kumawat, S.; Batra, N. PlantDoc: A dataset for visual plant disease detection. In Proceedings of the CoDS COMAD 2020: 7th ACM IKDD CoDS and 25th COMAD, Hyderabad, India, 5–7 January 2020; pp. 249–253, ACM International Conference Proceeding Series. [Google Scholar] [CrossRef]
Liu, J.; Wang, X. Plant diseases and pests detection based on deep learning: A review. Plant Methods 2021, 17, 22. [Google Scholar] [CrossRef] [PubMed]
Ferentinos, K.P. Deep learning models for plant disease detection and diagnosis. Comput. Electron. Agric. 2018, 145, 311–318. [Google Scholar] [CrossRef]
Wu, X.; Zhan, C.; Lai, Y.K.; Cheng, M.M.; Yang, J. IP102: A Large-Scale Benchmark Dataset for Insect Pest Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 8787–8796. [Google Scholar]
Wang, X.; Liu, J.; Zhu, X. Early real-time detection algorithm of tomato diseases and pests in the natural environment. Plant Methods 2021, 17, 43. [Google Scholar] [CrossRef] [PubMed]
Wang, X.; Zhang, S.; Wang, X.; Xu, C. Crop pest detection by three-scale convolutional neural network with attention. PLoS ONE 2023, 18, e0276456. [Google Scholar] [CrossRef]
Vaswani, A.; Brain, G.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is All you Need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30. Advances in Neural Information Processing Systems. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image Is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Xie, M.; Ye, N. Multi-Scale and Multi-Factor ViT Attention Model for Classification and Detection of Pest and Disease in Agriculture. Appl. Sci. 2024, 14, 5797. [Google Scholar] [CrossRef]
Bjerge, K.; Frigaard, C.E.; Karstoft, H. Object Detection of Small Insects in Time-Lapse Camera Recordings. Sensors 2023, 23, 7242. [Google Scholar] [CrossRef] [PubMed]
Liu, L.; Wang, R.; Xie, C.; Yang, P.; Wang, F.; Sudirman, S.; Liu, W. PestNet: An end-to-end deep learning approach for large-scale multi-class pest detection and classification. IEEE Access 2019, 7, 45301–45312. [Google Scholar]
Sittinger, M.; Uhler, J.; Pink, M.; Herz, A. Insect detect: An open-source DIY camera trap for automated insect monitoring. PLoS ONE 2024, 19, e0295474. [Google Scholar]
Montgomery, G.A.; Belitz, M.W.; Guralnick, R.P.; Tingley, M.W. Standards and Best Practices for Monitoring and Benchmarking Insects. Front. Ecol. Evol. 2021, 8, 1–18. [Google Scholar] [CrossRef]
Salman, A.; Siddiqui, S.A.; Shafait, F.; Mian, A.; Shortis, M.R.; Khurshid, K.; Ulges, A.; Schwanecke, U. Automatic fish detection in underwater videos by a deep neural network-based hybrid motion learning system. ICES J. Mar. Sci. 2020, 77, 1295–1307. [Google Scholar] [CrossRef]
Svitov, D.V.; Alyamkin, S.A. Optimizing the Neural Network Detector of Moving Objects. Optoelectron. Instrum. Data Process. 2021, 57, 18–26. [Google Scholar] [CrossRef]
Taye, M.M. Theoretical understanding of convolutional neural network: Concepts, architectures, applications, future directions. Computation 2023, 11, 52. [Google Scholar] [CrossRef]
Park, K.; Kim, D.H. Accelerating image classification using feature map similarity in convolutional neural networks. Appl. Sci. 2018, 9, 108. [Google Scholar]
Azam, S.; Montaha, S.; Fahim, K.U.; Rafid, A.R.H.; Mukta, M.S.H.; Jonkman, M. Using feature maps to unpack the CNN ‘Black box’ theory with two medical datasets of different modality. Intell. Syst. Appl. 2023, 18, 200233. [Google Scholar]
Huang, J.; Lan, X.; Li, S.; Zhu, C.; Chang, H. Consecutive Feature Network for Object Detection. In Proceedings of the 2018 IEEE International Conference on Mechatronics and Automation (ICMA), Changchun, China, 5–8 August 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 1123–1128. [Google Scholar]
Mehrparvar, M.; Zytynska, S.E.; Balog, A.; Weisser, W.W. Coexistence through mutualist-dependent reversal of competitive hierarchies. Ecol. Evol. 2018, 8, 1247–1259. [Google Scholar] [PubMed]

Figure 1. Distribution of normalized bounding box sizes for honeybees dataset.

Figure 3. Example of six images from the generated pests dataset showing pests along with its corresponding bounding box.

Figure 4. Distribution of normalized bounding box sizes for pests dataset.

Figure 5. YOLOv8 network architecture.

Figure 6. Motion module.

Figure 8. RGB image of a honeybee at time k (left) and its corresponding motion likelihood map

L_{k}^{3}

Figure 8. RGB image of a honeybee at time k (left) and its corresponding motion likelihood map

L_{k}^{3}

Figure 9. Image at time k after motion enhancement (M).

Figure 10. Performance metrics (Precision, Recall, F1 Score and mAP50) for the YOLOv8n model with the motion module inserted at different positions in the backbone, from

B_{1}

to

B_{9}

. The dashed black line indicates the performance of the unmodified YOLOv8n baseline.

Figure 10. Performance metrics (Precision, Recall, F1 Score and mAP50) for the YOLOv8n model with the motion module inserted at different positions in the backbone, from

B_{1}

to

B_{9}

. The dashed black line indicates the performance of the unmodified YOLOv8n baseline.

Figure 11. Performance metrics (Precision, Recall, F1 Score and mAP50) for the YOLOv8s model with the motion module inserted at different positions in the backbone, from

B_{1}

to

B_{9}

. The dashed black line indicates the performance of the unmodified YOLOv8s baseline.

Figure 11. Performance metrics (Precision, Recall, F1 Score and mAP50) for the YOLOv8s model with the motion module inserted at different positions in the backbone, from

B_{1}

to

B_{9}

. The dashed black line indicates the performance of the unmodified YOLOv8s baseline.

Figure 12. Detection results using YOLOv8n enhanced with the proposed motion module.

Figure 13. Detection results using YOLOv8s enhanced with the proposed motion module.

Figure 14. Detection results on synthetic pests dataset using YOLOv8n enhanced with the proposed motion module.

Figure 15. Detection results on synthetic pests dataset using YOLOv8s enhanced with the proposed motion module.

Table 1. Test split of the dataset presented in [29]. It includes time-lapse images captured by multiple camera systems, each identified by a unique ID (e.g., Sx-0, Sx-1). The table summarizes the number of images and the presence or absence of insects per camera.

Camera	Insects	Images	Ratio (%)	Plant
S1-0	170	14,092	1.2	Rocket
S1-1	333	15,120	2.2	Clover
S2-0	322	14,066	2.3	Mallow
S2-1	411	14,011	2.9	Mallow
S3-0	2100	15,120	13.9	Clover
S4-0	2319	15,120	15.3	Clover
S4-1	701	15,120	4.6	Clover

Table 2. Summary statistics of the modified dataset constructed from the original test split in [29]. The table shows the number of images containing honeybees, the number of images without honeybees (background images), and the total number of samples per camera system, after data cleaning and filtering to ensure balanced classes and temporal consistency.

Camera	Images with Honeybees	Background Images	Total Images	Plant
S1-0	15	15	30	Rocket
S1-1	54	54	108	Clover
S2-0	302	302	604	Mallow
S2-1	395	395	790	Mallow
S3-0	1697	1697	3394	Clover
S4-0	1784	1784	3568	Clover
S4-1	591	591	1182	Clover
Total	4838	4838	9676

Table 3. Quantitative results of the YOLOv8n model with the motion module inserted at different backbone positions (

B_{1}

to

B_{9}

). The table reports Precision, Recall, F1 Score, and mAP50 on the validation set for each configuration. The baseline row corresponds to the original YOLOv8n model without the motion module. Bold font indicates best values.

Table 3. Quantitative results of the YOLOv8n model with the motion module inserted at different backbone positions (

B_{1}

to

B_{9}

). The table reports Precision, Recall, F1 Score, and mAP50 on the validation set for each configuration. The baseline row corresponds to the original YOLOv8n model without the motion module. Bold font indicates best values.

Motion Module Position	Precision	Recall	F1 Score	mAP50	Params (M)	GFLOPs
Baseline	87.67%	81.54%	84.49%	86.95%	3.00	8.08
$B_{1}$	86.79%	89.07%	87.91%	91.69%	3.00	8.29
$B_{2}$	87.57%	88.51%	88.04%	91.45%	3.00	8.52
$B_{3}$	88.37%	87.27%	87.81%	92.06%	3.00	8.89
$B_{4}$	88.28%	89.37%	88.82%	91.88%	3.01	9.12
$B_{5}$	89.01%	86.98%	87.98%	91.98%	3.01	9.75
$B_{6}$	86.98%	85.31%	86.14%	88.73%	3.04	9.98
$B_{7}$	87.06%	84.93%	85.99%	88.82%	3.04	10.61
$B_{8}$	87.41%	81.10%	84.14%	86.89%	3.14	10.85
$B_{9}$	85.33%	83.24%	84.27%	86.89%	3.14	11.21

Table 4. Quantitative results of the YOLOv8s model with the motion module inserted at different backbone positions (

B_{1}

to

B_{9}

). The table reports Precision, Recall, F1 Score, and mAP50 on the validation set for each configuration. The baseline row corresponds to the original YOLOv8s model without the motion module. Bold font indicates best values.

Table 4. Quantitative results of the YOLOv8s model with the motion module inserted at different backbone positions (

B_{1}

to

B_{9}

). The table reports Precision, Recall, F1 Score, and mAP50 on the validation set for each configuration. The baseline row corresponds to the original YOLOv8s model without the motion module. Bold font indicates best values.

Motion Module Position	Precision	Recall	F1 Score	mAP50	Params (M)	GFLOPs
Baseline	87.63%	84.07%	85.82%	89.79%	11.13	28.44
$B_{1}$	89.04%	87.21%	88.11%	92.03%	11.13	29.06
$B_{2}$	90.12%	87.38%	88.73%	92.02%	11.13	29.99
$B_{3}$	87.82%	88.27%	88.04%	92.13%	11.13	31.45
$B_{4}$	86.72%	89.45%	88.07%	91.62%	11.16	32.40
$B_{5}$	86.70%	88.43%	87.56%	91.94%	11.16	34.91
$B_{6}$	86.51%	85.69%	86.10%	89.31%	11.26	35.85
$B_{7}$	87.28%	86.59%	86.93%	90.47%	11.26	38.37
$B_{8}$	88.79%	84.74%	86.72%	89.75%	11.65	39.31
$B_{9}$	88.39%	84.62%	86.47%	89.81%	11.65	40.78

Table 5. Comparison between the proposed motion module and the method from [29] accross YOLOv8n and YOLOv8s, we report the Precision, Recall, F1 Score, mAP50 and the number of GFLOPs. Bold font indicates best values.

Model	Precision	Recall	F1 Score	mAP50	GFLOPs
YOLOv8n Baseline	85.92%	83.09%	84.48%	87.44%	8.08
YOLOv8n [29]	85.14%	88.67%	86.87%	93.93%	8.08
YOLOv8n + Motion module	84.79%	89.31%	87.00%	91.20%	8.89
YOLOv8s Baseline	86.64%	85.76%	86.20%	90.56%	28.44
YOLOv8s [29]	88.20%	87.81%	88.00%	94.59%	28.44
YOLOv8s + Motion module	87.30%	89.50%	88.39%	93.19%	31.45

Table 6. Comparison of detection performance using raw residuals versus using their absolute value for both YOLOv8n and YOLOv8s. Bold font indicates best values.

Model	Precision		Recall		F1 Score		mAP50
Model	Raw Residuals	Abs. Value	Raw Residuals	Abs. Value	Raw Residuals	Abs. Value	Raw Residuals	Abs. Value
YOLOv8n	86.28%	84.79%	87.25%	89.31%	86.76%	87.00%	91.03%	91.20%
YOLOv8s	87.58%	87.30%	88.97%	89.50%	88.27%	88.39%	92.55%	93.19%

Table 7. Comparison of detection performance under simulated lighting variations using HSV transformations applied on the test set. The table reports Precision, Recall, F1 Score and mAP50 for both our method and the approach from [29]. Bold font indicates best values.

Model	Precision	Recall	F1 Score	mAP50
YOLOv8n [29]	78.39%	61.39%	68.85%	69.03%
YOLOv8n + Motion module	88.60%	83.09%	85.76%	90.29%
YOLOv8s [29]	81.59%	62.45%	70.75%	71.36%
YOLOv8s + Motion module	85.66%	70.46%	77.32%	79.50%

Table 8. Detection performance comparison on the synthetic pest dataset, averaged across all classes, between baseline YOLOv8n and YOLOv8n with motion module integrated after block

B_{3}

. Bold font indicates best values.

Table 8. Detection performance comparison on the synthetic pest dataset, averaged across all classes, between baseline YOLOv8n and YOLOv8n with motion module integrated after block

B_{3}

. Bold font indicates best values.

Model	Precision	Recall	F1 Score	mAP50	GFLOPs
YOLOv8n Baseline	97.65%	73.92%	83.82%	82.03%	8.09
YOLOv8n + Motion module	96.65%	78.58%	86.48%	85.52%	8.90

Table 9. Detection performance comparison on the synthetic pest dataset, averaged across all classes, between baseline YOLOv8s and YOLOv8s with motion module integrated after block

B_{3}

. Bold font indicates best values.

Table 9. Detection performance comparison on the synthetic pest dataset, averaged across all classes, between baseline YOLOv8s and YOLOv8s with motion module integrated after block

B_{3}

. Bold font indicates best values.

Model	Precision	Recall	F1 Score	mAP50	GFLOPs
YOLOv8s Baseline	96.66%	77.39%	85.74%	84.52%	28.45
YOLOv8s + Motion module	97.82%	79.37%	87.46%	87.80%	31.48

Table 10. Per-class F1 Score and mAP50 comparison between baseline YOLOv8 and YOLOv8 with motion module placed after block

B_{3}

for both YOLOv8n and YOLOv8s variants. Bold font indicates best values.

Table 10. Per-class F1 Score and mAP50 comparison between baseline YOLOv8 and YOLOv8 with motion module placed after block

B_{3}

for both YOLOv8n and YOLOv8s variants. Bold font indicates best values.

Pest	F1 Score				mAP50
Pest	YOLOv8n Baseline	YOLOv8n + Motion Module	YOLOv8s Baseline	YOLOv8s + Motion Module	YOLOv8n Baseline	YOLOv8n + Motion Module	YOLOv8s Baseline	YOLOv8s + Motion Module
Aphid	89.26%	90.47%	90.93%	91.58%	87.83%	90.41%	89.67%	92.10%
Broad mite	68.19%	76.84%	74.99%	78.09%	62.18%	71.89%	70.35%	76.32%
Coccidae	85.41%	89.15%	87.82%	90.98%	84.52%	87.51%	86.73%	90.46%
Figeater	87.88%	90.84%	88.98%	91.53%	84.31%	88.25%	88.00%	91.01%
Moth	86.04%	89.44%	87.09%	90.09%	85.89%	90.49%	87.08%	91.04%
Red Spider	79.23%	79.87%	79.91%	80.17%	77.55%	79.23%	78.36%	80.58%
Rhagoletis spp.	89.51%	89.08%	91.84%	91.38%	90.06%	90.56%	92.19%	92.70%
Thrips	86.06%	86.57%	87.07%	87.97%	86.56%	88.48%	87.07%	89.62%
Weevil	82.85%	86.07%	83.05%	85.39%	79.33%	82.83%	81.19%	86.32%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Gomez-Canales, A.; Gomez-Avila, J.; Hernandez-Barragan, J.; Lopez-Franco, C.; Villaseñor, C.; Arana-Daniel, N. Improving Moving Insect Detection with Difference of Features Maps in YOLO Architecture. Appl. Sci. 2025, 15, 7697. https://doi.org/10.3390/app15147697

AMA Style

Gomez-Canales A, Gomez-Avila J, Hernandez-Barragan J, Lopez-Franco C, Villaseñor C, Arana-Daniel N. Improving Moving Insect Detection with Difference of Features Maps in YOLO Architecture. Applied Sciences. 2025; 15(14):7697. https://doi.org/10.3390/app15147697

Chicago/Turabian Style

Gomez-Canales, Angel, Javier Gomez-Avila, Jesus Hernandez-Barragan, Carlos Lopez-Franco, Carlos Villaseñor, and Nancy Arana-Daniel. 2025. "Improving Moving Insect Detection with Difference of Features Maps in YOLO Architecture" Applied Sciences 15, no. 14: 7697. https://doi.org/10.3390/app15147697

APA Style

Gomez-Canales, A., Gomez-Avila, J., Hernandez-Barragan, J., Lopez-Franco, C., Villaseñor, C., & Arana-Daniel, N. (2025). Improving Moving Insect Detection with Difference of Features Maps in YOLO Architecture. Applied Sciences, 15(14), 7697. https://doi.org/10.3390/app15147697

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Improving Moving Insect Detection with Difference of Features Maps in YOLO Architecture

Abstract

1. Introduction

Related Work

2. Materials and Methods

2.1. Dataset

2.2. Method

2.3. Experimental Setup

2.4. Evaluation Metrics

3. Experimental Results and Discussion

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI