1. Introduction
The olive tree (
Olea europaea) is economically vital, especially in the Mediterranean [
1]. In Spain, it covers 2.78 million hectares, representing 70% of the EU’s production. However, pests like the olive fruit fly
B. oleae (Rossi) (Diptera: Tephritidae), which is considered the most damaging pest of olive production worldwide, cause significant economic losses by damaging fruit and oil quality and can affect up to 100% of the crop yield when infestations are not controlled [
2,
3,
4]. Integrated Pest Management (IPM) has been mandatory in the EU, as established in the European Directive 2009/128/EC on the sustainable use of pesticides and reinforced by the European Green Deal, which aims to reduce pesticide dependency and promote sustainable agricultural practices [
5]. To this end, one of the key fundamentals on which to build an IPM program is the availability of the economic injury level and economic threshold for intervention, which usually depend upon the monitoring of the insect pest populations, together with the selection of the most suitable control method [
6,
7]. The olive fruit fly IPM programs are based on adult trapping, which is considered the most effective and widely used method for monitoring the olive fruit fly population dynamics, to obtain reliable data for correct decision-making [
7,
8]. The development of increasingly precise, rapid, and cost-effective methods that require minimal field technician intervention is one of the great challenges of IPM, with technological innovations as key tools to overcome the limitations of traditional monitoring methods [
9,
10,
11]. Thus, the integration of advanced technologies such as computer vision and machine learning into the detection and counting process emerges as a promising solution for olive fruit fly monitoring and management [
9,
12,
13,
14,
15]. Computer vision enables machines to understand and process visual data, facilitating tasks such as object detection, and it continues to advance with the development of new algorithms [
16]. However, the creation of reliable and accurate automated recognition systems largely depends on the availability of high-quality, representative datasets [
9,
15,
17]. A major challenge is the absence of public and comprehensive olive fruit fly datasets, which limits the development of accurate and robust models for olive fruit fly detection [
9,
17].
Regarding recent advances in automatic fruit fly identification, several studies have explored the application of deep learning and object-detection architectures to improve trap-based monitoring accuracy. In [
18], a YOLO-based framework was implemented for the detection and counting of
B. oleae, demonstrating the feasibility of real-time processing in image-based monitoring systems. The work in [
19] proposed an improved YOLOv5m configuration focused on small insect pest detection, enhanced robustness under challenging image conditions. In [
20], a cascaded deep-learning approach was applied to greenhouse pest detection using sticky-trap images, highlighting the relevance of computer-vision-based solutions in integrated monitoring programs. Additionally, ref. [
21] evaluated YOLOv5-based detectors for multi-species pest identification. Although these contributions represent meaningful progress, most of them were developed under controlled image-acquisition conditions, relied on limited datasets, or were not evaluated on low-cost embedded platforms, which reinforces the need for a heterogeneous and deployable dataset such as the one presented in this study.
Recent years have seen considerable progress in the development of automatic recognition systems for insect detection, driven by the application of deep and machine learning techniques. Additionally, models based on architectures like YOLOv5 have demonstrated effectiveness in identifying and classifying several species [
15,
19,
22,
23].
However, some of the main limitations lie in the quality and diversity of the datasets used. Most datasets have a weak structure, containing a limited number of images and lacking variability in factors such as brightness, shadows, environmental conditions. Also, many datasets focus exclusively on the segmentation of the target insect, eliminating the context of the trap and excluding other environmental elements, such as other insects or particles, which could lead to confusion in a real-world scenario. Therefore, it is crucial to develop datasets that consider environmental variability and the coexistence of similar species, to enhance the accuracy and robustness of detection models.
However, the development of reliable deep learning models for olive fruit fly detection has also been limited by the lack of publicly available, sufficiently large, and heterogeneous datasets. Previous studies have typically relied on a very small number of images and restrictive acquisition conditions, which constrains model generalization and increases the risk of overfitting [
17,
18,
19]. Beyond dataset size, key characteristics such as environmental diversity, illumination variability, background complexity, annotation quality, and domain representativeness are essential to ensure robust learning and transferability to real-world scenarios. Therefore, the present work not only expanded the dataset volume, but also increased variability in backgrounds, lighting conditions, camera devices, and specimen states, aiming to provide a more realistic and operationally relevant dataset for field deployment.
The present study aimed to create a high-quality dataset, validated not only by the model but also by expert assessment, for automated olive fruit fly population recognition. Therefore, one of the main objectives was to develop and characterize a dataset that captures the complexity of the dipteran detection problem in real scenarios by focusing on three main aspects:
Collection of olive fruit fly images incorporating variations in environmental and growing conditions to build a robust dataset.
Development of a comprehensive labelling protocol, leveraging the expertise of entomologists specialized in this dipteran, to ensure the creation of accurate and reliable models.
Validation of the dataset through modelling, statistical analysis, and subsequent evaluation.
Several studies have applied YOLOv5 or similar convolutional neural networks to pest detection in agricultural environments [
9,
14,
18,
19,
20,
21,
23,
24]. However, most of these approaches were evaluated exclusively under controlled laboratory conditions and did not assess model performance on resource-limited or embedded devices. In contrast, this work contributes by: (i) developing a heterogeneous dataset that combines field and laboratory images of
B. oleae under variable lighting and background conditions; (ii) analyzing the effect of training configurations (YOLOv5s and YOLOv5m, 150 and 300 epochs) on recall and F1-score across confidence thresholds; and (iii) validating real-time performance on a Raspberry Pi device. These aspects enhance the practical applicability of deep learning for pest monitoring and support its integration into Integrated Pest Management (IPM) systems.
2. Materials and Methods
The methodology adopted in this work involved a series of steps (
Figure 1). In this sense, the collection of images from chromotropic traps (yellow sticky panels; Econex S.L., Murcia, Spain), targeting the olive fruit fly was the starting point for building the dataset. Following this, manual labeling was carried out to annotate each instance of the fly in the images. Data augmentation techniques were also applied to artificially enhance the size and diversity of the training dataset. The labelling of the images represented a key aspect for the subsequent training of the convolutional neural network models (CNN).
Lastly, during the inference stage, the trained model was used to identify and make predictions or decisions based on new input images that it had not encountered during training.
2.1. Creation of the Dataset
2.1.1. Creation of the Field Dataset
For this study, a total of 659 images were collected over two years (2022–2023) in a real field environment (
Figure 2). These collected images are part of the FruitFlyNet-ii project (
http://fruitflynet-ii.aua.gr) (accessed on 10 October 2024), in which an image dataset was created. Since these images were taken in different scenarios, weather conditions, environments, and variations in the fly’s appearance. Data augmentation techniques were applied to the original 659 images to enhance the dataset and improve model generalization, resulting in an expanded dataset comprising 1083 images.
2.1.2. Creation of Laboratory Dataset
To complement the field-acquired images, an additional dataset comprising 370 images was generated under controlled laboratory conditions. This additional dataset was required to overcome the limited and seasonal availability of field images and to ensure sufficient variability for robust model training. The controlled laboratory setup allowed for the incorporation of diverse illumination conditions, backgrounds, camera angles, and specimen states, resulting in a more balanced, representative, and generalizable dataset (
Figure 3).
To replicate field trapping conditions, a laboratory setup was designed to simulate chromotropic traps involving olive fruit fly and other insect species. Daily photographs were captured using a custom support structure that emulated the appearance and conditions of chromotropic traps (
Figure 4). These images were also augmented using data augmentation techniques, resulting in a total of 1337 images.
To approximate the visual characteristics of field images, such as shadows, reflections, and contrast, various light configurations and transparent support materials were evaluated. These included rigid and flexible transparent plastics, as well as commercial plastic wraps, to reproduce the optical properties of the sticky trap surface. Several real-world visual artefacts were intentionally incorporated into the dataset, and during laboratory acquisition, to guarantee variability in image appearance. This included heterogeneous illumination conditions, shadows, surface reflections caused by sunlight, and partial occlusions. Additionally, physical trap contamination was simulated using naturally adhered non-target insects and olive leaves, reproducing common sources of visual noise observed in commercial monitoring traps. Image acquisition was performed using an 8-megapixel Raspberry Pi camera (resolution: 3280 × 2464 pixels; Raspberry Pi Foundation, Cambridge, UK), which was selected based on a comparative evaluation of multiple imaging devices. This camera offered an optimal balance of image quality, focal length suitability, physical dimensions, and cost-effectiveness for the specific imaging requirements of olive fruit fly monitoring (
Figure 5).
2.1.3. Data Augmentation
To improve model generalization and robustness under diverse environmental conditions, several data augmentation techniques were applied to both the field and laboratory datasets. The selected transformations included horizontal flipping, vertical flipping, and 180° image rotation implemented and automated using Python’s Pillow package (version 10.2.0). These geometric operations were chosen because they realistically emulate potential variations in trap orientation, camera angle, and insect positioning that may naturally occur in field monitoring scenarios while preserving the biological integrity and visual characteristics of B. oleae specimens.
The augmentation process substantially increased the total volume and variability in the training data. Specifically, the field dataset expanded from 659 to 1083 images, and the laboratory dataset increased from 370 to 1337 images, resulting in a combined total of 2420 images. This threefold increase in data size enhanced model exposure to diverse visual patterns, helping to reduce overfitting and improve convergence stability during training.
Although an ablation analysis isolating the contribution of each transformation was not conducted, comparative experiments between non-augmented and augmented datasets demonstrated consistent improvements in model performance. Models trained on augmented data exhibited higher recall and F1-scores, confirming that these augmentations positively impacted the detector’s capacity to identify olive fruit flies across variable imaging conditions.
The field images used in this study originate from the FruitFlyNet-II dataset; however, their number was limited because the activity of B. oleae is seasonal and because a larger dataset was required to support robust model training. For this reason, additional images were generated under controlled laboratory conditions to reproduce the visual characteristics of chromotropic traps and increase dataset diversity. Although different camera types with higher pixel resolutions were preliminarily tested, the final acquisition pipeline employed the same 8-megapixel Raspberry Pi camera used in FruitFlyNet-II, ensuring comparable resolution and optical properties. Consequently, the dataset developed in this study can be considered an extended and diversified version of FruitFlyNet-II, incorporating a broader range of lighting conditions, backgrounds and sample variability while maintaining consistent acquisition parameters.
2.2. Image Labelling
For image annotation, bounding boxes were employed to indicate the position of B. oleae specimens on the trap surfaces. Multiple labeling software or platforms can be used for labelling. The open-source tool LabelImg was utilized in this study. Annotations were initially saved in the PASCAL VOC (XML) format due to occasional instability of LabelImg when using the YOLO format.
Afterwards, custom Python scripts were implemented to convert the XML annotations into YOLO-compatible TXT format and generate composite images that overlaid the bounding boxes on top of the original images for viewing.
This post-tagging visualization step allowed for efficient inspection of the annotated data using standard image viewers, eliminating the need to reopen the files within the tagging tool (
Figure 6 and
Figure 7). This visual check was essential to ensure the quality of the annotations and to facilitate the subsequent model training phases. Annotation data were stored in formats compatible with object detection frameworks (e.g., COCO, PASCAL VOC, YOLO), allowing for flexibility in model selection.
To ensure accuracy and consistency of labeling, all labeling annotations were performed by expert annotators. This rigorous approach was essential to develop a reliable and robust training dataset, which ultimately contributed to improved model performance in olive fruit fly detection and localization.
To confirm the reliability of the manual annotations, a small set of representative images covering different lighting conditions, backgrounds and insect positions was re-annotated by the same expert entomologist at a later time. Bounding-box agreement between the original and repeated labels was quantified using the Intersection over Union (IoU), defined as the ratio between the overlapping area and the total union area of both boxes (
Figure 8).
Across the inspected images, IoU values ranged from 0.63 to 0.79, with an average IoU of 0.71, indicating a consistent and reliable labeling process despite the intrinsic difficulty of annotating small insects on reflective sticky surfaces. This quality-control procedure supports the robustness of the ground-truth annotations used for model training.
2.3. Training of Models
2.3.1. Model Training with Different Datasets
The model training process was a critical step in validating the image dataset collected for olive fruit fly detection. To this end, three training models were developed using three distinct datasets: one consisting of field-acquired images, another composed of laboratory-acquired images, and a third combining both sources to form a global dataset. To ensure proper model development, the dataset was partitioned into two subsets: 80% of the images were used for training, while the remaining 20% were reserved for validation (
Figure 9). Model training was conducted using the YOLOv5 architecture, a family of object detection models developed by Ultralytics, recognized for offering a strong balance between inference speed and detection accuracy. The YOLOv5s, YOLOv5m, and YOLOv5l models represent different sizes and levels of complexity within the YOLOv5 family, each optimized for specific applications. In this study, we evaluated the optimal model by training it with an image size of 640 pixels and 300 epochs (the number of times the model processes the entire dataset during training). The training was performed on a GPU server, specifically using Google Colab, a cloud-based service that provides access to Tensor Processing Units (TPUs) and Graphics Processing Units (GPUs) for developing open-source software, standards, and computational services.
After preparing the necessary scripts in the Google Colab terminal for the model training process, we proceeded with training by entering the specific commands tailored to the required model size. !python train.py --batch 8 --epochs 300 --data customdata.yaml --weights yolov5s.pt –cache.
This initiates the training process for the different versions evaluated (YOLOv5s and YOLOv5m) and epochs (150 and 300) (
Figure 10). The files generated after training, namely best.pt and last.pt, contain the weights required by YOLOv5 to perform the subsequent inference process. Best.pt: represents the best weights collected during training. Last.pt: is the weight of the last stage of training.
The inclusion of simple geometric augmentations (horizontal and vertical flips and 180° rotations) improved the stability of the training process and enhanced model generalization, particularly under variable illumination and background conditions. These augmentations contributed to the high recall and F1-scores achieved by all configurations.
Additionally, to ensure full reproducibility, the experimental environment and hyperparameter configuration are detailed below. All training procedures were performed using the official Ultralytics YOLOv5 repository in Google Colab. The experimental setup consisted of Python 3.10.12, PyTorch 2.5.0+cu121 and CUDA 12.1, running on an NVIDIA Tesla T4 GPU with 15 GB of VRAM. Training was executed with the YOLOv5 default configuration using an input resolution of 640 × 640 pixels and a batch size of 8. The hyperparameters corresponded to the official YOLOv5 default values, including: learning rate = 0.01, momentum = 0.937, weight decay = 0.0005, warm-up = 3 epochs and default augmentation parameters (hsv_h = 0.015, hsv_s = 0.7, hsv_v = 0.4, translate = 0.1, scale = 0.5, fliplr = 0.5, mosaic = 1.0).
Models were trained using the following command:
python train.py --batch 8 --epochs {150 or 300} --data customdata.yaml --weights yolov5{model}.pt –cache
Unless otherwise indicated, all other hyperparameters were used as provided by the YOLOv5 default configuration. The custom dataset was defined in the customdata.yaml file and included a single target class. Additionally, geometric augmentation was applied offline by generating horizontally flipped, vertically flipped and 180° rotated versions of each image prior to training to increase data diversity and improve generalization.
2.3.2. Optimization of the Model with Versions of YOLOv5s and YOLOv5m
Building on the results obtained from models trained with different datasets, three additional models were developed. To optimize their performance, the objective was to determine the most suitable model size (small or medium) for effectively detecting the olive fruit fly while also ensuring compatibility with the computational constraints of the Raspberry Pi when using the global dataset (comprising both field and laboratory images). This evaluation included versions of YOLOv5s and YOLOv5m, each trained over 150 and 300 epochs.
The YOLOv5s model represents the smaller and more lightweight variant within the YOLOv5 architecture. It was trained for 300 epochs to promote effective convergence and robust learning of the dataset. Additionally, a version trained with 150 epochs was evaluated to determine whether acceptable performance could be achieved with reduced computational resources and training time. Similarly, the larger YOLOv5m model was trained for both 150 and 300 epochs, enabling a comparative assessment of its performance and potential advantages in terms of model robustness. In general, computer vision processes involve training a machine learning model on a dataset composed of pre-classified and labeled images. During this training phase, the model learns to recognize distinctive patterns and features associated with each image category (
Figure 11).
To ensure a meaningful and fair comparison between models of different sizes (YOLOv5s vs. YOLOv5m) and training durations (150 vs. 300 epochs), all configurations were trained under identical conditions using the same global dataset and data augmentation settings. Model performance was evaluated using standard object-detection metrics, including precision, recall, F1-score, and mean Average Precision (mAP0.5), as well as confusion-based error analysis (false positives and false negatives). In addition, training time and computational cost were considered as secondary criteria because one of the objectives was to select a model suitable for deployment on a resource-constrained embedded device (Raspberry Pi).
2.4. Inference Process in Model Validation with Different Datasets
Model validation was conducted to evaluate the performance and accuracy of the trained models using an independent dataset comprising images not included in the training phase. This validation set allowed for an objective assessment of the model’s ability to generalize to previously unseen data. Validation involved comparing the model’s predictions with the ground truth annotations and calculating standard performance metrics based on this comparison, such as precision, recall. The primary objective of the validation process was to confirm the model’s robustness and its applicability to real-world scenarios. All validation images were stored in JPEG format, accompanied by their corresponding annotations in TXT files compatible with YOLO, allowing for integration in digital processing (
Figure 12).
2.5. Inference Process in the Validation of Models with Versions of YOLOv5s and YOLOv5m
The optimization of the YOLOv5 models was further evaluated through the inference process. This step consisted of evaluating the model’s ability to detect and classify specimens of B. oleae using an image dataset that had not been used during training or validation. As in the validation phase, model predictions were compared to ground truth notations, and key performance metrics, such as precision, recall, and confidence scores, were calculated to quantify accuracy. Inference was carried out within the Google Colab environment using the following script:
!python detect-py --weights/content/globalmodelM150.pt --conf 0.50 –source/content/(image for inference)
In this way, this script enabled the model to process input images and perform object detection, identifying and localizing instances of the olive fruit fly.
4. Discussion
The development of tools based on artificial intelligence (AI), specifically deep learning techniques (Deep Learning), is transforming the landscape of IPM [
9,
15,
25,
26,
27] producing a great social and environmental impact, allowing researchers to face challenges related to pest control. Achieving success in creating accurate automated recognition systems for olive fruit fly detection is closely linked to the creation and existence of a robust and quality dataset [
15,
17]. The combination of field and laboratory images has proven to be essential for training models to operate efficiently under real-world conditions. Recent studies report that global models trained with combined datasets have achieved accuracies above 90%, validating their applicability [
15,
17,
21]. The effectiveness of machine learning models developed for olive fruit fly detection highlights the critical importance of dataset quality and robustness in ensuring reliable performance. In this context, models trained exclusively on field or laboratory images exhibit limitations when applied across different environments. These constraints motivated the integration of both image types into a unified dataset, enhancing the overall robustness and adaptability of the resulting models. The global model, trained with field and laboratory images, showed performance, accuracy, and precision above 90%, giving the green light to effective operation in real field conditions. These values are consistent with previous deep-learning studies on agricultural pest detection, where reported accuracies typically range between 85% and 95% under controlled or semi-controlled imaging conditions [
18,
19,
20,
24]. This alignment reinforces the technical soundness of the proposed approach and confirms that the use of heterogeneous training data contributes positively to model robustness and real-world generalization.
Beyond reporting performance values, it is important to contextualize how the methodological approach of this study differs from previous research. Unlike most insect detection studies based solely on controlled laboratory datasets or synthetic image augmentation [
18,
19,
20], our dataset integrates real field images with laboratory-generated samples that emulate operational acquisition conditions. This hybrid strategy addresses one of the main limitations identified in prior studies, where restricted scenario variability can lead to domain shift and decreased generalization when deployed in non-controlled environments [
14,
17,
18]. Therefore, this work contributes to improving dataset representativeness, which is recognized as a key determinant for robust model generalization.
This presents a significant potential as it occurs for other pests [
9,
20] reporting the successful use of similar models for the detection of insects in greenhouses emphasizing that the quality of the dataset is the fundamental pillar for the effectiveness of the models in different contexts. Optimization of the machine learning architecture with YOLOv5 has emerged as a highly efficient solution for pest detection. From a methodological perspective, the selection of YOLOv5 aligns with trends in agricultural computer-vision systems, where one-stage detectors have demonstrated better operational feasibility than two-stage approaches (e.g., Faster R-CNN) in resource-constrained settings [
14,
19,
20]. Although alternative architectures such as SSD or YOLOv8 could potentially be explored, a multi-model benchmarking analysis exceeded the scope of this study, which focused on dataset creation and embedded deployment feasibility.
Comparisons performed between light (YOLOv5s) and heavy (YOLOv5m) versions trained with different configurations of epochs demonstrated that the s version, optimized for low-cost platforms such as Raspberry Pi, offers the best ratio between accuracy, robustness and computational efficiency [
14,
21] crucial for practical field applications where hardware resources are limited [
28]. Although alternative deep-learning architectures such as Faster R-CNN, SSD or YOLOv8 could also be considered, a multi-model benchmarking analysis was beyond the scope of this study, which focused on dataset development and embedded deployment feasibility. Under low-power edge-computing constraints, YOLOv5 provides a favourable trade-off between accuracy, inference efficiency and implementation simplicity compared with heavier or less robust alternatives [
14,
21,
23]. For this reason, YOLOv5 was selected as the most balanced option for real-world deployment on resource-limited hardware. Furthermore, when considering long-term autonomous deployment on embedded IoT devices, it is essential to prevent progressive error accumulation and performance drift, especially in unattended sensing environments. Recent research highlights the need for systematic error monitoring strategies in long-duration edge–AI systems to ensure long-term operational reliability and data integrity, which aligns with the goals of the proposed trap-based monitoring solution [
29]. This study shows how the incorporation of advanced AI techniques and the optimization of model training processes present an enormous potential that machine learning models can achieve for the accurate detection of olive fruit flies. Additionally, advances in techniques such as transfer learning and convolutional neural networks have allowed for the extension of the applicability of these models to pests of other crops such as tomato or rice [
24,
30]. These advances not only offer a promising tool for olive growing but also highlight the importance of continued efforts toward scalable and accessible technological solutions. In this regard, the development of a unified global dataset and the implementation of lightweight Deep Learning architectures emerge as key strategies for integrated pest management within the framework of sustainable agriculture.
Comparison and Interpretation of Results
The achieved recall and F1-scores (up to ≈0.99) confirm that YOLOv5-based models can provide robust detection of
B. oleae under heterogeneous imaging conditions. The small variability observed among repetitions (standard deviation < 0.02) suggests consistent model behavior and satisfactory statistical confidence. Similar deep-learning approaches have been reported for pest recognition in general reviews [
9,
14,
23]; and several YOLOv5-based studies have been proposed for different crops and species, such as rice or tomato pests [
19,
20,
24]. In comparison, the proposed system achieved comparable or higher performance while using a smaller but more diverse dataset, and it was the only one validated on a low-cost embedded platform [
24]. This indicates that dataset heterogeneity has a stronger influence on model generalization than dataset size alone.
5. Conclusions
This study developed a heterogeneous dataset and evaluated YOLOv5-based detection models specifically with the goal of enabling an electronic trap for monitoring B. oleae. The best configuration (YOLOv5s trained for 300 epochs) achieved F1 approximately 0.99 and could run inference on a Raspberry Pi with latency compatible with intermittent image capture strategies. These results demonstrate the technical feasibility of an embedded, low-cost detection module for IPM applications.
The dataset and pipeline were created to be reusable and reproducible; they provide a benchmark for the next stages of trap development, including hardware integration, field trials and continued model updating from trap-acquired data.
The dataset and training pipeline were designed for transparency and reproducibility, providing a benchmark for the next stages of trap development, including hardware integration, field validation, and continuous model improvement using images acquired from deployed traps.
Based on these promising results, future work will focus on expanding the dataset, testing more advanced detection architectures, and completing the design and deployment of an intelligent electronic trap for automated population monitoring and decision support within IPM frameworks.
6. Future Work
Future work will focus on expanding the dataset with additional field samples collected throughout multiple seasons and geographic locations, allowing for a broader representation of environmental variability and insect population dynamics. Further improvements will include the evaluation of more advanced deep learning architectures, lightweight model compression strategies, and on-device optimisation techniques to enhance inference efficiency on embedded platforms.
In parallel, the next development stage will focus on the complete design, engineering, and deployment of an intelligent electronic trap integrating the trained detection model with low-power image acquisition, wireless communication capabilities, and autonomous energy management. A functional prototype will be tested under real operational field conditions to evaluate durability, reliability, maintenance requirements, and long-term monitoring performance. Ultimately, the objective is to achieve a fully operational, scalable, and autonomous early-warning system that can be incorporated into Integrated Pest Management (IPM) frameworks to support data-driven decision-making in olive production systems.