1. Introduction
The expected population increase from 8.2 billion in 2024 to a peak population of 10.3 billion in the mid-2080s [
1] drives an increased demand for food, which in turn, results in the need for crop expansion. Nevertheless, crop production is affected by various factors such as crop pests and pollinators decline. According to [
2], due to pests and insects, there is an estimated crop production loss of 18–20% worldwide. On the other hand, the decline of pollinators is an increasingly alarming problem for farmers. Pollinators support the seed production of 87 of the principal food crops globally, emphasizing the critical need for pollinator conservation, especially since approximately 40% of pollinators, such as bees and butterflies, are at risk of extinction [
3]. Thus, effective insect management plays a crucial role in reducing crop losses, helping to maintain healthy and productive harvests. By adequately managing insect populations, farmers can mitigate the damage caused by pests while simultaneously promoting the vital role of pollinators. This balanced approach can not only reduce crop losses, but also enhances crop yield quality, thereby increasing its market value [
4]. Effective management is essential in both pest control and the conservation of pollinators, which are integral to the reproduction of many crops, ensuring long-term agricultural productivity and environmental health [
5].
Traditionally, insect management has relied on human resources. However, this approach is not a practical solution for large-scale crops, as it requires a high level of expertise and is an arduous task that demands considerable time. Therefore, automated methods are essential for addressing these limitations [
6].
Machine learning (ML) has been one of the efforts to automate this task. Traditional ML techniques, such as support vector machines and k-nearest neighbors, had been effectively applied across a wide variety of areas, such as crop yield prediction [
7], breast cancer diagnosis [
8], solar radiation forecasting [
9], protein function prediction [
10], among others. However, these methods depend on manually engineered features, whose design often requires certain expertise and can be time-consuming to extract [
11]. Furthermore, these manually crafted features often fail to capture the representation of the wide-range, large-scale variations in object shapes, such as the vast diversity of insect species, shapes, and sizes [
12], which consequently diminishes the performance of ML techniques.
To overcome traditional ML limitations, convolutional neural networks (CNN) have proven to be a powerful alternative. Unlike traditional ML approaches, CNNs can automatically learn hierarchical feature representations from raw image data, eliminating the need for manual feature extraction [
11]. This ability makes CNNs particularly suitable for tasks involving complex visual patterns, such as insect detection, where variations in species, shapes and orientations pose significant challenges. CNN-based models have demonstrated remarkable success in various computer vision applications such as image classification [
13,
14], object detection [
15,
16] and image segmentation [
17,
18], enabling more accurate and robust object recognition. Nonetheless, insect detection remains particularly challenging due to their small size, often occupying only a tiny fraction of the pixels in an image, which limits the visual information for CNNs to extract meaningful features.
This has motivated extensive research on improving CNN based models for small object detection, particularly in the context of insect recognition, which remains as an active research area. Next, we provide an overview of related work, highlighting key advancements in this field.
Related Work
As previously noted, both traditional ML techniques and CNNs have been applied to automate pest detection tasks. In [
19], a comparative study between these approaches is presented for automated pest detection and identification on greenhouse tomato and pepper crops. They show that CNNs offer superior performance, achieving higher accuracy than traditional ML methods and effectively distinguishing between visually similar pest classes.
However, the images used in this study were captured under controlled environmental conditions, specifically within a cultivation chamber. As a result, the model shows limited effectiveness when applied to real-world cultivation field scenarios. This performance gap arises from the fact that, in field conditions, models must cope with additional challenges such as diverse backgrounds, varying lighting conditions, occlusion and important changes in apparent characteristics of pests due to view angle and distance difference during image acquisition [
20,
21]. Studies such as [
20,
22] report that models trained on images acquired in controlled environments tend to perform poorly when tested under actual field conditions. To address this limitation, ref. [
22] recommends prioritizing the inclusion of images captured under real-world conditions in the training dataset.
To overcome this limitation, efforts have focused on the collection of datasets composed of images captured directly in the field. For instance, ref. [
23] introduces a dataset specifically designed to reflect real-world conditions, this dataset contains 750,000 images of 102 insect classes. Such datasets are essential for improving the generalization capabilities of deep learning models and ensuring their applicability in practical agricultural scenarios.
Regarding insect detection in real world scenarios, ref. [
24] proposed a object detection algorithm based on YOLOv3 for early detection of tomato diseases and pests, dilated convolution replaces convolution layers in the network backbone, resulting in improved ability for small object detection.
Wang et al. [
25] constructed a three-scale CNN with channel and spatial attention mechanisms, they showed that the proposed method can achieve both high speed and accuracy of crop pest detection, outperforming architectures such as VGG16.
Recently, transformer-based architectures have gained significant attention in the deep learning community. Originally introduced for natural language processing tasks [
26], the transformer architecture has since been successfully adapted for various computer vision applications. One such adaptation is Vision Transformer (ViT) [
27]. Building on this approach, ref. [
28] proposed a multi-scale and multi-factor ViT attention model for pest detection and classification over the IP102 [
23] dataset. Their results demonstrate that transformer-based models can improve both classification and detection performance.
An interesting approach is [
29], where motion information is used to enhance the visibility of insects in time-lapse RGB images. The preprocessed images are fed into a CNN. The proposed method improves the performance of object detection models such as YOLO and Faster R-CNN. Additionally, the authors provide a dataset containing 107,383 annotated time-lapse images, including 9423 labeled insect instances. Several other insect monitoring systems have been deployed using traps equipped with cameras, such as [
30,
31]. These systems are designed to attract and capture insects for close-range imaging, often resulting in high-resolution data suitable for classification tasks. However, they typically rely on different lures or attractants [
32], which may introduce sampling bias and alter insect behavior. Furthermore, the majority of trapping methods are invasive, which results in rare insect species being killed [
29]. A time-lapse setup avoids interference with natural insect behavior and supports broader spatial coverage.
The use of motion has proven effective for enhancing object visibility and improving model performance in object detection tasks. Nevertheless, the integration of motion information remains relatively underexplored, particularly in challenging environments. In addition to the approach proposed by [
29], other studies have also leveraged motion cues to distinguish objects from complex backgrounds. For instance, ref. [
33] utilized motion information from video sequences by applying background subtraction and optical flow to detect freely moving fish in underwater environments. The outputs of Gaussian Mixture Model (GMM) and optical flow were combined with raw video frames containing texture and shape information. This combined input was then fed into an R-CNN model, which achieved state-of-the-art performance in the fish detection task.
Another example is presented in [
34], where a novel approach integrating a CNN model, specifically Single-Shot Detector with MobileNetV2 backbone, with motion cues is presented. In this method, intermediate feature maps from the CNN are used to identify regions of motion within each frame. By focusing only on moving objects, the approach effectively reduces false-positive detections and improves computational efficiency by lowering the average processing time.
One of the major drawbacks of methods such as [
29,
33] is their reliance on preprocessing steps to incorporate motion information. For instance, ref. [
33] employs a GMM to separate foreground objects from the background, a computationally intensive process that requires multiple frames to construct a reliable background model. Similarly, ref. [
29] requires a motion-enhanced image to be generated from three consecutive frames before being passed to the detection network. This preprocessing step not only adds complexity but also depends on frame differencing, which is highly sensitive to lighting variations commonly encountered in real-field scenarios. These limitations highlight the need for more efficient and robust approaches to motion integration in object detection.
Feature maps extracted by convolutional layers are distortion and translation invariant, allowing the CNN to refine hierarchically more intricate patterns [
35]. Some papers have operated feature maps to enhance predictions; for example, ref. [
36] proposes an accelerating image classification through feature map similarity. In [
37], uses the difference of feature maps to extract geometrical patterns in medical images. On the other hand, ref. [
38] combines feature maps from different layers to offer scale-robust object detection.
To address challenges mentioned above, we propose the motion module, a lightweight component designed to enhance object detection performance by integrating motion information directly at the feature map level within the YOLOv8 backbone. This approach eliminates the need for any preprocessing and introduces only minimal computational overhead.
The main contributions of this work are as follows:
We introduce the motion module, a lightweight and efficient module that boosts YOLOv8 performance by incorporating motion cues directly into the network backbone. By incorporating motion information at an early stage in the feature extraction pipeline, the module enables the network to learn richer representations without incurring significant computational overhead. A key aspect of our design is the prioritization of a balanced trade-off between performance and computational cost. The module has been designed to maintain a low parameter count, ensuring the improved accuracy does not come at the expense of efficiency.
Our method avoids any external preprocessing and operates with only two consecutive raw frames as input, allowing the network to extract motion cues internally without relying on handcrafted features. This desicursgn choice enables a more streamlined and efficient pipeline that is easier to deploy. Processing only two frames at a time reduces memory usage. This lightweight motion information integration makes the approach well-suited for scenarios with limited resources, such as embedded systems or mobile platforms.
We demonstrate that the proposed approach is robust to lighting variations, making it well-suited for real-world field conditions. This robustness allows the model to maintain high detection accuracy under challenging scenarios such as illumination changes, shadows and glare. As a result, the method can be reliably deployed in dynamic outdoor settings where lighting conditions are often unpredictable and uncontrollable, such as agricultural fields.
2. Materials and Methods
2.1. Dataset
To evaluate the performance of the proposed model, we conducted experiments using the publicly available dataset introduced in [
29]. This dataset serves as a benchmark for assessing object detection models targeting small insects in real-world environments.
This dataset consists of time-lapse images captured at 30-s intervals. The camera system was used to monitor three different plant species:
Trifolium pratense (red clover),
Cakile maritima (sea rocket), and
Malva sylvestris (common mallow). While the primary focus of the monitoring system was to capture honeybees visiting the plants, the dataset also includes occasional appearances of other animals such as beetles, butterflies, and spiders. This dataset provides realistic scenarios with natural lighting, vegetation and background variability, making it suitable for evaluating detection models in outdoor environments. All images in the dataset are in high definition, with a resolution of
pixels (width × height), which allows small insects to be visible. Additionally, the distribution of bounding box sizes was analyzed through boxplots of normalized dimensions.
Figure 1 shows that the normalized bounding box width reaches up to approximately 0.100 of the image width, while the height reaches up to approximately 0.175 of the image height. This suggests that most bees occupy a relatively small portion of the image, which highlights the importance of sufficient resolution and sharpness for reliable detection.
During image acquisition, the camera remained stationary during individual recordings but was relocated periodically throughout the monitoring period to ensure coverage of different flowering plants from both lateral and overhead perspectives.
The authors in [
29] provide predefined train, validation and test splits of the dataset. However, the training and validation sets do not contain consecutive time-lapse images captured at 30-s intervals. To address this limitation, we constructed a modified version of the dataset by selecting samples exclusively from the original test split, which preserves temporal continuity between frames. This modified dataset ensures a consistent 30-s interval between images, allowing us to leverage motion information more effectively. Summary statistics of the original test split are presented in
Table 1.
As mentioned previously, we constructed a modified version of the dataset by selecting samples exclusively from the original test split. However, as shown in
Table 1, a significant proportion of these images do not contain any insects. To address this issue, we filtered the dataset by selecting only the images that included at least one insect for each camera system. Then, to balance the dataset, we randomly sampled an equal number of images without insects (background images) from the corresponding camera system, ensuring a balanced distribution for training and evaluation. Additionally, we performed data cleaning on this modified dataset by removing false-positive annotations and out-of-focus images. As previously noted, some images contained insects other than honeybees; however, due to the small proportion of these instances, such images were excluded from the dataset. This refinement allowed us to focus exclusively on the detection of honeybees. The statistics of this modified dataset are shown in
Table 2.
As indicated in
Table 2, the resulting dataset comprises a total of 9676 images, evenly distributed between samples containing honeybees and background-only images. To facilitate model development and ensure a fair evaluation, we partitioned the dataset into training, validation, and test sets following a 0.8/0.1/0.1 ratio, respectively. This split was performed in a stratified manner across camera systems to preserve the diversity of scenes and plant species in each subset. Some instances from this dataset are shown in
Figure 2.
To further evaluate the generalization capability of our model across different insect types, we created a synthetic dataset by overlaying pest images onto raspberry plant images. This synthetic dataset was necessary due to the scarcity of publicly available datasets featuring insects in motion under field conditions. By generating this dataset, we are able to simulate a wide range of insect appearances, poses and positions, which allow us to more thoroughly evaluate the generalization capability of our approach.
For the synthetic dataset generation, we first captured high-resolution images of raspberry plants in agricultural fields located in Jocotepec, Jalisco, Mexico. To simulate the presence of multiple pest species we then artificially overlaid pest images onto these plant background images. The pest included in this synthetic dataset are nine common species known to affect raspberry crops:
Tetranychus urticae (red spider),
Polyphagotarsonemus latus (broad mite),
Coccoidea (coccidae),
Curculionidae (weevil),
Cotinis mutabilis (figeater),
Rhagoletis spp.,
Lampronia corticella (raspberry moth),
Aphididae (aphid) and
Frankliniella spp. (thrips). According to [
39], it is uncommon for multiple competing pest species to coexist on the same individual plant. Therefore, each synthetic image includes only one pest type overlaid per plant background, aligning with real-world biological patterns. For each generated image, we simulate a corresponding previous frame by shifting the position of the insects present in the scene.
For each pest class a total of 1000 main frames with the insect in a certain position were generated. Then, for each of these frames, a corresponding previous frame was generated by shifting the position of each insect within the corresponding scene. This process resulted in 2000 images per class: 1000 main frames for pest detection and the other 1000 images used to provide temporal context, enabling the integration of motion information into the detection pipeline.
Similar to honeybees dataset, we created training, validation and test splits using a 0.8/0.1/0.1 ratio, respectively. The splits were performed in a stratified manner to preserve the distribution of images for each insect class across all splits. All images were cropped to size
pixels. Example images from this synthetic dataset are shown in
Figure 3.
As we did with honeybees dataset, now we present the distribution of bounding box sizes through boxplots of normalized bounding box dimensions for all insect classes, this is illustrated in
Figure 4. As shown, the bounding boxes for pests are relatively small, with both dimensions concentrated around low normalized values. This reflects the small visual size of the pests within the images, which complicates the detection process. Furthermore, this challenge highlights the importance of incorporating motion cues for improving detection performance.
2.2. Method
Insects in natural environments can be small, camouflaged or partially occluded, making them difficult to detect. To enhance detection performance in real-world conditions, we extend the conventional still-image object detection framework by incorporating an adjacent frame as input. This design allows the network to extract motion cues that are not available in a single image. These cues are particularly useful for detecting insects under the mentioned challenging conditions, which may be undistinguishable from background in static images. By providing a dynamic context, the model can better localize and classify moving insects, leading to improved performance under field conditions.
By incorporating these motion cues, the network can leverage motion cues to identify subtle motion patterns that help distinguish insects from the background
In this work, our method is based on the YOLOv8 architecture, which is illustrated in
Figure 5. We begin by introducing the proposed motion module, followed by a detailed description of its training procedure. The overall structure of the motion module is depicted in
Figure 6.
This module can be inserted after any layer in the YOLOv8 backbone, prior to the SPPF block. It takes as input two feature maps: the output from layer N at time
t and the output from the same layer at time
, denoted in
Figure 6 as
and
, respectively.
To encode motion information, the module computes the difference between these consecutive feature maps and applies the absolute value operation, , then this absolute difference is normalized using batch normalization. We empirically evaluated both alternatives-using the raw residual and its absolute value-and found that the latter consistently led to better detection performance. The use of absolute value highlights the magnitude of changes regardless of direction, allowing the network to focus on relevant dynamic information without being sensitive to sign.
Next, the original feature map is concatenated with the normalized difference along the channel dimension. Assuming each input feature map has C channels, the concatenated feature map will have channels. To integrate the motion information and restore the channel count to C, a convolution is applied. This ensures compatibility with the subsequent layer in the network.
This process is summarized in Equation (
1).
where:
is the motion module.
is the feature map up to layer N for image at time t.
is the feature map up to layer N for image at time .
is a convolution.
is the concatenation along the channel dimension operation.
is BatchNorm2D operation.
Assuming that the dimensions of and are , the output of the motion module, denoted as , will also have dimensions . This is ensured by the use of a convolution, which reduces the concatenated feature map back to the original number of channels. As a result, the output of the module matches the expected input dimensions of layer .
Having described the structure and functionality of the motion module, we now present the training procedure used to optimize its parameters within the YOLOv8 network.
We begin with a YOLOv8 model pretrained on the COCO dataset, and fine-tune it for 100 epochs using one of the insect datasets presented in the previous section.
After completing this initial training phase, we transform the model into a Siamese architecture, that is, the model consists of two identical branches with shared weights that process two input frames in parallel -the main frame and the previous frame, as illustrated in
Figure 7. Specifically, the inputs are the image at time
t and the image at time
, denoted as
and
, respectively, in
Figure 7.
In
Figure 7, each block in the network backbone before SPPF block is labeled as
where
. The motion module can be inserted after any of these blocks. In the example shown, it is placed after block
.
Both input images, and are processed in parallel through identical network branches up to block , where n is the block preceding the placement of the motion module. At this point, the motion module integrates the motion information derived from the two branches into the feature map corresponding to , effectively ending the Siamese structure. From this point onward, the original YOLOv8 pipeline resumes, using the enhanced features as input.
If the neck of the network receives input from a block preceding the motion module-i.e., a stage still within the Siamese structure-it only utilizes the output from the branch processing , ignoring the features from the branch. This ensures consistency with the original single-frame processing design of YOLOv8.
All blocks in the trained YOLOv8 model, except for the motion module, are frozen during the second training phase. As illustrated in
Figure 7, the blocks highlighted in blue indicate layers whose weights remain unchanged throughout this phase. Only parameters of motion module are updated, since its weights are initialized randomly. Freezing the rest of the network prevents the large gradients-expected during the initial updates of the motion module-from adversely affecting the pretrained weights, thereby preserving the knowledge previously learned by the base YOLOv8 model. Therefore, during the second training phase, we train the model for additional 100 epochs, updating only the parameters of the motion module. This selective fine-tuning allows the module to effectively learn and encode motion information, while preserving the knowledge already acquired by the rest of the network.
In the following section, we compare the results obtained on the modified dataset using our method with those obtained by the method presented in [
29]. To provide context for this comparison, we first offer a brief explanation of the referenced method.
In [
29], the authors incorporate a preprocessing step that enhances images with motion information prior to feeding them into the neural network. To generate a motion-enhanced image, three consecutive frames are used. These RGB frames are first converted to grayscale and then blurred using a Gaussian kernel of size
pixels. The resulting grayscale and blurred frame at time index
is denoted as
. To enhance frame
k, the adjacent frames
and
are also considered. Two frame differences are then computed: the difference between
and
, denoted as
, and the difference between
and
, denoted as
, where
are the pixel coordinates. These computations are formalized in Equations (
2) and (
3), respectively.
Then, using this information, a motion likelihood,
, is created by summing the absolute values of
and
, as shown in Equation (
4).
Next, the RGB channels of the original color image at time
k are modified to generate the motion enhanced image, denoted as
M. In this enhanced image, the blue channel (
) contains a combination of the original blue (
) and red (
) channels, as show in Equation (
5). The original red channel (
) is replaced with the motion likelihood (
), as defined in Equation (
6). Finally, the green channel remains unaltered; the original green channel (
) is directly copied into the enhanced image (
), as shown in Equation (
7).
An example of this process applied to an image at time
k containing a honeybee is shown in
Figure 8 and
Figure 9.
Figure 8 shows the original RGB image alongside its corresponding motion likelihood, where the movement of the honeybee-as well as some flower motion caused by wind-can be observed.
Figure 9 presents the resulting motion enhanced image, in which the honeybee is highlighted in red.
2.3. Experimental Setup
To promote faster inference and reduce the computational load, all images were resized to pixels. The following hyperparameters were used in our experiments: an initial learning rate of 0.01, a learning rate momentum of 0.937, and a batch size of 32. Training was conducted using the Stochastic Gradient Descent (SGD) optimizer.
The hardware configuration used for all experiments consisted of an Intel Core i7 (13th generation) CPU (Intel, Santa Clara, CA, USA) and an NVIDIA GeForce RTX 4060 GPU (NVIDIA, Santa Clara, CA, USA) with 8 GB of VRAM.
The software environment was set up using Python 3.8, PyTorch 2.4.1, and CUDA 12.6, running on Windows 11.
2.4. Evaluation Metrics
To evaluate the performance of the models, we use a set of standard object detection metrics: Precision, Recall, F1 Score, and mean Average Precision (mAP). These metrics provide a detailed assessment of the detection capabilities of the models across different configurations. In addition, we report the number of GFLOPs (Giga Floating Point Operations) to quantify the computational cost associated with each model. FLOPs measure the total number of floating-point operations required to process a single input. One GFLOP corresponds to one billion () FLOPs. GFLOPs thus offer a convenient estimate of the inference complexity of the model.
To better understand the evaluation process, we briefly describe the formulation of each metric used:
3. Experimental Results and Discussion
Given the objective of achieving a lightweight and efficient model, we restrict our evaluation to the YOLOv8n and YOLOv8s variants, which offer a favorable balance between performance and model size. The reported results correspond to a single training run. We observed that performance was consistent across different training sessions due to two main factors. First, the base YOLO network was always initialized with the same pretrained weights obtained from the dedicated insect dataset. Second, the newly introduced motion module was initialized using a fixed random seed. These conditions ensure consistency across training runs.
In our experiments, we first utilized a YOLOv8 model pretrained on the modified honeybees dataset and tested the insertion of the motion module after different blocks of the network backbone, from
to
in
Figure 7. Each configuration was independently trained and evaluated to determine the most effective placement of the motion module. The performance of each configuration is summarized in
Table 3 for YOLOv8n and in
Table 4 for YOLOv8s, where we present the results on the validation set across different block placements. For a clear comparison, these results are also visualized in
Figure 10 for YOLOv8n and
Figure 11 for YOLOv8s, using a bar chart that highlights the impact of the motion module on the performance of the model at each stage.
Table 3 shows that, for YOLOv8n, incorporating the motion module can increase precision by up to 1.34%, with respect to baseline, when placed after block
in the network backbone. For Recall, the greatest improvement-up to 7.83%-is observed when the module is inserted after block
, which also yields the highest gain in F1 Score, with an increase of 4.33%. As for mAP50, the optimal improvement of 5.11% is achieved when the module is placed after block
. Based on these results, we select the configuration with the highest mAP50 improvement, as it provides also a favorable trade-off between computational overhead and detection performance, requiring only an additional 0.81 GFLOPs compared to the baseline model.
We now extend this analysis to the YOLOv8s variant to assess whether similar trends hold, and to determine the most effective placement of the motion module under increased model capacity. As shown in
Table 4, the greatest improvement in Precision-2.49% increased compared to the baseline model-is achieved when the motion module is inserted after block
in the network backbone. This configuration also yields the highest gain in F1 Score, with an increase of 2.91%. For Recall, the largest improvement, as in YOLOv8n, occurs when the module is placed after block
, resulting in a 5.38% increase. Regarding mAP50, the best performance is again obtained when the module is placed after block
, leading to a 2.34% improvement. Following the same criterion used with YOLOv8n, we select the configuration with the highest mAP50 increase-placement after block
- as it provides a favorable trade-off between performance gain and computational overhead. In this case, the model incurs an additional cost of only 3.01 GFLOPs compared to the baseline.
The results above indicate that the inclusion of motion information contributes to improved detection performance. By enhancing object visibility, the motion module enables the model to better localize objects, leading to consistent gains across key evaluation metrics. Notably greater improvements are observed in YOLOv8n, suggesting that the motion module is particularly beneficial for lightweight models, where limited representational capacity can be compensated by integrating motion-based cues through the motion module.
In both cases, the optimal placement for the motion module appears to lie between blocks and in the network backbone.
As can be seen from
Table 3 and
Table 4, for both YOLOv8n and YOLOv8s, inserting the motion module at deeper layers leads to an increase in the total number of parameters in the model. This is expected, since feature maps in later stages of the backbone have a higher number of channels, causing the
convolution in the motion module to require more parameters. Similarly, the GFLOPs of the model also increase, not only due to the larger parameter count but also because the network needs to process two frames at deeper stages, further raising the computational and memory demands.
To further evaluate the effectiveness of the proposed approach, we compare each configuration of the YOLOv8 variants with their corresponding counterparts enhanced using the method proposed in [
29], evaluated on the test set. As previously mentioned, for both YOLOv8n and YOLOv8s, the motion module is placed after block
in the network backbone, as this configuration yields a favorable trade-off between performance and computational cost. These comparisons, shown in
Table 5, provide insight into the relative effectiveness of both motion integration strategies.
Table 5 shows that for both our method and approach proposed in [
29], the inclusion of motion information enables the models to surpass the baseline performance, particularly in terms of Recall, F1 and mAP50. Our proposed method achives comparable results to [
29] in Recall and F1 Score. Although the method from [
29] achieves a higher Precision and mAP50, we will demonstrate that our approach is more robust to lighting variations-an essential factor for real world deployment, where illumination conditions can change due to shadows, cloud cover, and other environmental factors. While our model introduces a slightly higher number of GFLOPs than the method proposed in [
29], it does not require any preprocessing, as motion information is integrated directly at the feature map level, allowing raw images to be used as input and contributes to greater robustness against lighting variations. In addition, our approach operates with only two frames, in contrast to [
29], which relies on three.
As mentioned earlier, we conducted a study to evaluate the impact of using the absolute value of feature map difference versus raw residuals. The results show that using the absolute value leads to a 0.24% increase in F1 Score and a 0.17% improvement in mAP50 for YOLOv8n, and a 0.12% increase in F1 Score and a 0.64% improvement for mAP50 for YOLOv8s, compared to using raw residuals. These findings suggest that emphasizing the magnitude of motion changes, regardless of direction, helps the network extract more discriminative features.
Table 6 shows that although using raw residuals results in slightly higher precision, applying the absolute value leads to improvements in recall, F1 Score and mAP50. In our approach, we prioritize higher recall to reduce false negatives, as failing to detect a relevant insect can be critical in agricultural monitoring.
To evaluate the robustness of our approach to lighting variations, we tested both our method and the approach from [
29] on the original test set after applying modifications to each input frame. Specifically, we randomly adjusted the HSV (Hue, Saturation, and Value) components to simulate different illumination conditions, such as changes in brightness and color tone. The results of this evaluation are presented in
Table 7. This allows us to assess the robustness of each approach in more challenging, real-world-like conditions.
Table 7 shows that both YOLOv8n and YOLOv8s models enhanced with our motion module outperform their counterparts using the method proposed in [
29]. These results demonstrate that our method is more robust to lighting variations, as motion integration is performed at the feature map level. In contrast, ref. [
29] relies on frame differencing, a technique that is highly sensitive to changes in illumination. This sensitivity is reflected in the significant performance drop observed for models using [
29] under varying lighting conditions.
Finally, we present qualitative results for both YOLOv8n and YOLOv8s models enhanced with the motion module.
Figure 12 shows predictions from the YOLOv8n variant, while
Figure 13 shows results from YOLOv8s. These examples illustrate how the integration of the motion module enhances the ability of the model to localize and detect objects in challenging scenarios, such as cases involving small objects by leveraging motion-based feature information.
As shown in
Figure 8, some image pairs contain visible motion in the background, particularly in plant parts such as leaves or flowers, likely caused by wind. Despite this, the model achieves accurate detections, this can be attributed to the fact that motion is computed at the feature map level rather than from raw pixel intensities. By performing temporal differencing on feature maps, the network can focus on meaningful patterns while suppressing irrelevant changes. Subsequent layers in the network further refine these feature maps, retaining only representations that are useful for insect detection.
After demonstrating that the motion module improves honeybee detection performance, and exhibits greater robustness to lighting variations than the method proposed in [
29], which relies on frame differencing, we now evaluate the proposed method on the synthetic pest dataset to assess its performance on a broader range of insect types. This evaluation complements the previous analysis on the honeybee dataset and provides insights into the ability of the model to generalize across different insect classes. Below, we present the quantitative results and comparisons with the baseline YOLOv8 models, highlighting the effectiveness of the motion module in this new setting.
Based on the findings from the honeybee dataset experiments, where the insertion of the motion module after block in the network backbone yielded a favorable trade-off between accuracy and computational cost, we place the motion module in this position for all experiments conducted on the synthetic pest dataset. This placement aims to preserve a consistent configuration while allowing the network to integrate motion features at a sufficiently abstract level.
The results in
Table 8 demonstrate the effectiveness of integrating the motion module into the YOLOv8n backbone for pest detection. While the baseline YOLOv8n achieves a slightly higher precision, the model with the motion module outperforms the baseline across all other key metrics: recall increases by 4.66%, F1 Score improves by 2.66% and mAP50 rises by 3.49%. The higher recall and F1 Score indicate that the motion-enhanced model is more sensitive and effective at detecting pests, including those that may be partially occluded or visually similar to background elements. These improvements are particularly meaningful for pest detection, since failing to detect an insect (i.e., false negatives) can be more critical than occasional false positives. The increased performance is achieved with only a modes increase in computational cost, an additional 0.81 GFLOPs, making the motion module a practical addition even for resource-constrained environments.
Now we extend the analysis to YOLOv8s, these results are shown in
Table 9. When integrating the motion module into the YOLOv8s backbone for pest detection, we observe improvements across all metrics. Compared to the baseline, the model with the motion module increases precision by 1.16%, recall by 1.98%, F1 Score by 1.72 percentage points, and mAP50 by 3.28 points. These results demonstrate that even for a larger model, the incorporation of motion information remains beneficial. Despite moderate increase in computational cost, an increase of 3.03 GFLOPs, the performance improvements justify the inclusion of the module.
Table 10 compares the F1 Score and mAP50 between Basaeline YOLOv8 and YOLOv8 with motion module integrated after block
, for both YOLOv8n and YOLOv8s variants. Results are shown for each pest class individually, showing that the inclusion of the motion module yields improvements in nearly all pest categories, particularly for smaller insects, such as broad mite, suggesting that the motion cues are especially beneficial when visual features alone are insufficient. In summary, the per-class analysis confirms the general effectiveness of the motion module across a variety of pest types.
Finally, we include qualitative results on the synthetic pest dataset.
Figure 14 and
Figure 15 show sample predictions from the YOLOv8n and YOLOv8s variants, respectively. The visualization shows that motion integration enhances detection accuracy in scenarios where appearance alone may be insufficient, such as small or camouflaged insects.
4. Conclusions
This work introduces the motion module, a lightweight component designed to enhance the detection performance of YOLOv8 models by integrating motion information directly at the feature map level. The module operates on feature maps of two consecutive frames, allowing the network to exploit temporal differences and extract motion cues that are not available in static images. This design choice provides dynamic context that helps the network distinguish small insects from the background, specially under real-world conditions where appearance-based cues alone may be insufficient. Additionally, operating at a feature map level helps reduce the impact of lighting variations, compared to methods that rely on frame differencing, such as [
29].
Experimental results demonstrate that the motion module can significantly boost performance across the evaluation metrics. Specifically, mAP50 increases by up to 5.11% for YOLOv8n and 2.34% for YOLOv8s; Recall improves by up to 7.83% and 5.38% respectively; Precision sees gains of up to 1.34% and 2.49%; and F1 Score increases by up to 4.33% for YOLOv8n and 2.91% for YOLOv8s. These improvements are achieved when the motion module is inserted between blocks and for YOLOV8n, and between blocks and for YOLOv8s in the network backbone, resulting in a modest increase in computational cost-at most 1.67 GFLOPs for YOLOv8n and 3.96 GFLOPs for YOLOv8s.
The results indicate that the motion module is particularly beneficial for lightweight models, such as YOLOv8n, where the integration of motion cues can effectively compensate for limited representational capacity. Moreover, quantitative evaluations under simulated lighting variations (via HSV modifications) reveal that our approach, specifically for YOLOv8n, maintains robust performance with only a minor degradation under challenging illumination conditions. This robustness makes the module practical and reliable for real-world applications, where lighting can vary due to factors such as shadows, cloud cover, and other environmental factors.
Additionally, a synthetic dataset of common pests known to affect raspberry plants was introduced. Results obtained on this synthetic dataset show that the proposed motion module is not limited to honeybee detection but can generalize effectively to other insect classes. The inclusion of the motion module consistently improved key detection metrics such as F1 Score and mAP50 for both YOLOv8n and YOLOv8s variants, compared to its corresponding baseline models. Specifically, for YOLOv8n, recall increases by 4.66%, F1 Score improves by 2.66% and mAP50 increases by 3.49%. Regarding YOLOv8s, precision is improved by 1.16%, recall by 1.98%, F1 Score rises by 1.72 percentage points and mAP50 by 3.28 points. This highlights the potential of the proposed approach to enhance object detection performance in tasks involving insects in motion.
Future work will explore whether the motion module is object-agnostic, that is, whether it can generalize across detection tasks and architectures without retraining the motion module, offering consistent performance improvements regardless of the type of object being detected. Similarly, we intend to experiment with other feature maps and preprocessing techniques, such as temporal gradients and optical flow.