1. Introduction
The growing population of the white-tailed deer (
Odocoileus virginianus, hereafter deer) across the United States is a serious challenge for farmers, as they have to bear significant economic losses from deer depredation of corn, soybeans, cotton, and wheat. For instance, the state of Mississippi has the highest deer density in the nation, where a survey of row crop producers revealed that 17,830 total acres of farmland were affected by deer damage, resulting in a staggering annual economic loss of
$4.6 million [
1]. The issue occurs nationwide. One study estimated that the combined loss due to wildlife damage across corn, soybeans, wheat, and cotton to be
$592.6 million [
2], and another identified white-tailed deer as the primary wildlife species responsible for most of this damage across the U.S. [
3].
People have explored a variety of control methods to reduce wildlife-related crop damage, such as fencing, hunting, trapping, and repellents. Each of these approaches, however, presents significant limitations in terms of cost, scalability, and long-term sustainability. Fencing is one of the most widely used deterrents, and high-tensile or electrified fences can be effective in reducing deer damage. However, the costs are prohibitive for large-scale agricultural operations, averaging
$13,000 per mile of fencing, and ongoing maintenance is required to repair damage from storms, fallen trees, or determined animals [
4]. Moreover, persistent deer often learn to breach or circumvent barriers, reducing effectiveness over time. Hunting and culling programs represent another strategy. Regulated hunting seasons can provide some localized relief, but they are typically seasonal and sometimes of low-intensity, decreasing their effectiveness in reducing deer damage. A recent study on hunting behavior in the U.S. projected that the number of hunters is declining by 2.2% to 1.5% per year [
5]. This further calls into question the long-term efficacy of hunting as a strategy for mitigating deer intrusion on agricultural lands. More intensive damage control programs may reduce local populations more significantly; yet, they often face public opposition, animal welfare concerns, and ecological debates about altering wildlife populations. These programs also require permits from relevant wildlife management agencies, limiting farmers’ autonomy and flexibility [
6].
Repellents have also been widely studied, particularly for deer [
7]. While chemical or odor-based repellents can reduce browsing pressure, they typically degrade quickly after exposure to rain, wind, or sun and therefore require frequent reapplication. Overall, these approaches are largely reactive rather than proactive, addressing the problem only after damage has occurred. They are often unsustainable in the long term and may cause unnecessary stress, injury, or mortality to wildlife, raising ethical as well as ecological concerns [
8].
Sustainable wildlife management that actively prevents deer from entering farms represents a foundational solution to these challenges [
9,
10]. Future systems can leverage advanced computer vision and deep learning techniques, most notably You Only Look Once (YOLO) models [
11], to achieve highly accurate and real-time identification of deer presence and movement [
12]. Detailed outputs, such as the location of deer within an image or enhanced segmentation masks, can then trigger tailored, dynamic deterrents including species-specific sound or light emissions, or autonomous maneuvers by Unmanned Aerial Vehicles (UAVs) and Unmanned ground Vehicles (UGVs) [
13]. A common design pattern in emerging research involves a hierarchical process in which initial motion detection, often through Passive Infrared (PIR) sensors, activates more sophisticated vision-based models for animal detection. For example, Ref. [
14] integrated a Convolutional Neural Network (CNN) to identify animals, which then activated non-lethal deterrents such as ultrasonic sounds, flashing lights, or sprinklers. Similarly, Ref. [
15] employed a PIR sensor to trigger a Recurrent Convolutional Neural Network (R-CNN), which in turn activated species-specific ultrasonic frequencies to mitigate habituation. More advanced systems have also been developed for other species. For instance, Ref. [
16] designed a wild boar deterrent system that uses YOLOv5n [
17] to confirm presence before escalating deterrents from ultrasonic sound to predator scent (wolf urine) and vocalizations.
Machine vision serves as the foundation for these emerging deterrence systems, acting as the “eyes” for real-time deer detection. Among the available approaches, YOLO models have been preferred over two-stage detectors (e.g., Faster R-CNN or EfficientDet), due to their balance of accuracy and speed, making them well-suited for field deployment [
18,
19]. For instance, Ref. [
18] demonstrated the superiority of YOLOv10 over YOLOv5 for wildlife detection on an NVIDIA Jetson Nano, achieving a AP@0.5 of 0.934. Similarly, Ref. [
20] implemented a YOLOv8m-based surveillance system on a Raspberry Pi, showing its feasibility on low-power edge devices. In a more recent application, Ref. [
21] deployed a TensorRT-optimized YOLOv5 model on an NVIDIA Jetson Orion Nano for a UAV-based deterrence system, achieving an inference time of only 0.025 s per frame. Collectively, these studies affirm that the YOLO family of models has become the de facto standard for practical, high-performance wildlife detection systems on edge hardware.
Despite recent advancements, most deer and wildlife detection studies continue to rely on private datasets or general-purpose object detection datasets. For instance, Ref. [
22] applied YOLOv5 to the PASCAL VOC dataset, while others used aggregated datasets from Roboflow Universe that include multiple animal classes [
18,
20]. In deer-specific research, custom datasets have been used for detecting Sika deer from camera traps with YOLOv8n [
12] and for identifying various deer species from UAV imagery with YOLOv8-seg [
23]. However, these datasets are often limited to simplified scenarios focused on targeted deer populations and are generally not publicly available, making reproducibility and broad benchmarking difficult (see
Table 1). This lack of open-source domain-specific data creates a reproducibility gap, making it difficult to benchmark progress or validate whether models trained in isolation can generalize to new environments.
Furthermore, existing literature has primarily focused on the evaluation of individual YOLO iterations in isolation, often neglecting the systematic comparison of emerging architectures. Crucially, performance is frequently reported based on high-end desktop GPUs (e.g., NVIDIA RTX series), which obscures the practical limitations of deploying these models on resource-constrained edge devices. Therefore, a critical gap exists between the theoritical capability of modern architectures and their verified feasibility in the field.
To address these gaps, we present a publicly available dataset of 3095 annotated deer images with bounding-box labels, derived from the Idaho Cameratraps project, representing challenging real-world scenarios. Using this dataset, we establish a comprehensive benchmark of YOLOv8–YOLOv11 models, evaluating performance not only on a high-end NVIDIA RTX 5090 GPU but also on resource-constrained platforms, including the CPU-based Raspberry Pi 5 and the GPU-accelerated NVIDIA Jetson AGX Xavier. This rigorous approach provides a validated reference for the deployment of autonomous wildlife detection systems in real-world agricultural and conservation scenarios.
The key contributions of this work are as follows:
An open-source dataset of 3095 annotated deer images with bounding-box labels, covering diverse environmental conditions and lighting scenarios.
A systematic evaluation and benchmarking of four YOLO architectures (v8–v11), encompassing 12 model variants for deer detection.
Inference benchmarking of the 12 YOLO model variants on edge devices, including the CPU-based Raspberry Pi 5 and the GPU-accelerated NVIDIA Jetson AGX Xavier.
This paper is organized as follows.
Section 2 describes the data acquisition methods, model selection, training, evaluation, and inference metrics.
Section 3 presents the model evaluation results, along with the limitations of the study.
Section 4 interprets the results and provides a broad discussion regarding the implications and limitations of this study. Finally,
Section 5 provides concluding remarks and potential directions for future work.
2. Materials and Methods
This section describes the data acquisition process, the YOLO models employed, the training procedure, and the model evaluation strategy, followed by inference experiments conducted on edge devices.
2.1. Data Acquisition
To enable a comprehensive comparison, we curated two deer image datasets. We first selected a dataset from Roboflow Universe containing annotated images of deer, which will be referred to as the “Roboflow dataset”. The Roboflow dataset [
30] consists of 2339 images of different deer species captured in diverse environments, mostly under favorable lighting conditions and with minimal motion, as is common in manually photographed images (
Figure 1). The dataset was divided into training set with 2043 images and validation set with 296 images. However, this type of dataset does not fully capture the challenges of real-world field conditions, such as low-light environments, motion blur, partial occlusion, and varying camera trap settings.
To better represent these scenarios, we curated a second dataset from the Idaho Cameratraps project, shared by the Idaho Department of Fish and Game via the LILA BC online repository [
31]. The Cameratraps project contains over 1.5 million camera trap images collected from video sequences across multiple regions. While the original dataset provides only sequence-level labels without bounding-box annotations, we filtered approximately 7000 images containing deer (confidence score > 0.9) and manually annotated 3095 images with bounding boxes using the Computer Vision Annotation Tool (CVAT) [
32]. This “Cameratrap dataset” was then divided into a training set with 2578 images and a validation set with 517 images (
Figure 1). For both the dataset, training set was utilized for training purpose while isolating validation set solely for evaluation purpose throughout the whole study.
The two datasets differ significantly in complexity and realism. The Roboflow dataset largely consists of clean, manually photographed images of deer, mostly under good lighting and with limited motion or occlusion, making it more suitable for baseline model training. In contrast, the Cameratrap dataset captures deer under challenging real-world conditions, including varying backgrounds, night/low-light settings, occlusion, and camouflage. As shown in
Figure 1, the Cameratrap dataset therefore provides a more realistic benchmark for evaluating model robustness in field deployments.
2.2. Deep Learning Models
The YOLO framework made a groundbreaking contribution to computer vision, particularly in object detection, by introducing a single-pass approach that framed detection as a regression problem. This innovation enabled real-time inference, which was lacking in other state-of-the-art models at the time [
33]. Since then, the YOLO family has evolved rapidly, producing numerous variants that build on the original design with architectural and performance enhancements [
11].
In this study, we focus on four recent versions, YOLOv8, YOLOv9, YOLOv10, and YOLOv11, for comparative benchmarking. The study is further constrained to the three most lightweight variants of each model family: nano (n), small (s), and medium (m). Notably, for YOLOv9, the tiny (t) variant was used because it has a similar parameter count as for nano-scale variants. YOLOv8 introduced notable improvements, including an anchor-free detection method for enhanced accuracy, architectural refinements in the backbone and detection head to better capture objects of varying scales, and a decoupled head design that further improved precision [
34,
35]. We conclude our comparisons with YOLOv11. While alternative architectures, such as YOLO-NAS (Neural Architecture Search) or the latest transformer-hybrid variants(e.g., YOLOv12) offer promising capabilities, they represent divergent evolutionary branches or a gradual shift towards attention-based vision transformer architectures [
11,
36,
37]. This study specifically bounds its scope to the continuous CNN-based lineage of YOLOv8 through YOLOv11 to provide a controlled comparative analysis.
2.2.1. YOLOv8
YOLOv8, developed by Ultralytics, represents a state-of-the-art object detection model that introduced several architectural improvements over its predecessor, YOLOv5 [
38]. One of its key innovations is anchor-free estimation, which accelerates post-processing, particularly Non-Maximum Suppression (NMS), for selecting prediction boxes [
39]. Additionally, the model employs the CSPDarknet backbone, which reduces computational complexity while maintaining accuracy by splitting feature maps and incorporating the Sigmoid Linear Unit (SiLU) activation function [
38]. This design enhances gradient flow and feature reuse, resulting in smaller, more efficient models well-suited for edge devices. Furthermore, YOLOv8 integrates a PANet neck, which improves multi-scale feature learning by augmenting the traditional Feature Pyramid Network (FPN) with a bottom-up path alongside the top-down pathway [
40]. Another improvement is the C2f (Cross Stage Partial network with two fusion blocks) module, which captures complex patterns more effectively, further boosting model accuracy [
38,
41]. Ultralytics provides YOLOv8 in five sizes (“n”, “s”, “m”, “l”, and “x”), offering flexibility depending on accuracy and resource constraints. The smallest model, YOLOv8n, has 3.2 million parameters, while the largest, YOLOv8x, has approximately 257.8 million parameters.
2.2.2. YOLOv9
YOLOv9 architecture marks a significant improvement over earlier YOLO versions in terms of speed and accuracy of object detection. A key challenge it addressed is the information bottleneck problem in deep neural networks, where gradients diminish or useful information is lost as they propagate through successive layers of large models. To overcome this, YOLOv9 introduced Programmable Gradient Information (PGI), which ensures more reliable gradient flow and enhances the network’s ability to learn from complex image features. As shown in
Figure 2, PGI is implemented through a reversible auxiliary branch that updates the main branch by generating useful gradients through a supervision mechanism. Moreover, YOLOv9 introduced the Generalized Efficient Layer Aggregation Network (GELAN), an advancement over the original Efficient Layer Aggregation Network (ELAN) architecture. Unlike ELAN, which relied solely on convolutional blocks, GELAN can flexibly incorporate different computational blocks, improving both efficiency and generalization. Together, the integration of PGI and GELAN enables YOLOv9 to achieve robust gradient flow, efficient computation, and superior detection performance [
42,
43].
2.2.3. YOLOv10
YOLOv10 introduced several architectural innovations aimed at improving both efficiency and accuracy, while addressing key limitations in earlier YOLO versions [
45]. One major issue identified in prior models was the reliance on NMS during post-processing, which added computational overhead and delayed inference. To overcome this, YOLOv10 adopted NMS-free training by incorporating a one-to-one prediction head alongside the traditional one-to-many head during training, allowing the network to optimize using both. However, at inference time, only the one-to-one head is used, which eliminates the need for NMS and enables faster, end-to-end deployment. Another notable improvement is the simplification of the classification head. The authors observed that the regression head plays a more critical role in YOLO performance, so the classification head was streamlined to reduce computational cost without sacrificing accuracy.
YOLOv10 also redesigned its convolutional blocks to reduce redundancy. Traditional YOLO architectures used 3 × 3 convolutions with a stride of 2 to simultaneously downsample spatial dimensions (from H × W to H/2 × W/2) and expand channels (from C to 2C). In contrast, YOLOv10 decouples these operations: a pointwise convolution first handles channel transformations, followed by a depthwise convolution for spatial downsampling. This design significantly reduces computational overhead. To further enhance accuracy, YOLOv10 integrates partial self-attention and employs improved large-kernel convolutions, particularly for lightweight models. Together, these modifications result in a more efficient and accurate object detection framework suitable for real-time applications.
2.2.4. YOLOv11
YOLOv11 represents the most recent advancement in the YOLO family, developed by Ultralytics as a more sophisticated yet efficient model. It extends the versatility of YOLO by supporting multiple computer vision tasks, including object detection, instance segmentation, and pose estimation, while maintaining strong performance relative to its size [
46]. Built upon the YOLOv8 architecture, YOLOv11 introduces two key innovations. The C2PSA (Cross Stage Partial with Spatial Attention) block enables the model to focus on the most relevant regions of an image, thereby improving its ability to detect small and occluded objects. Meanwhile, the C3k2 (Cross Stage Partial with 3 Convolutions and 2 Kernels) block reduces computational complexity while preserving rich feature representations, making the model more efficient without compromising accuracy.
Ultralytics maintains the YOLO family of models and provides open-source implementations of the versions they primarily developed, such as YOLOv8 and YOLOv11. In practice, YOLO models present a tradeoff between model size and reliability. Although they are widely applicable across various computer vision tasks, our study focuses specifically on their object detection capabilities. To capture the balance between performance and computational efficiency, we selected the “n”, “s”, and “m” variants of each model. Among these, YOLOv8m is the largest, with 25.84 million parameters, while YOLOv9t is the smallest, with only 1.94 million parameters (see
Table 2). Each version incorporates different techniques for gradient flow, optimization, and image processing. By evaluating performance across a wide range of architectural variations and model sizes, we will provide deeper insights that support informed decision-making when selecting YOLO models for deer detection tasks.
2.3. Training
The YOLO models were trained on an NVIDIA GeForce RTX 5090 GPU with 32 GB of VRAM, a high-end platform well-suited for deep learning workloads. The training was conducted on a Linux machine with CUDA version 12.9 (
Table 3). While the default training configurations provided by Ultralytics are effective, adjustments were made to better leverage the available hardware. In particular, the batch size and number of workers were tuned for different models based on their parameter size. Larger batch sizes enable faster training by improving GPU utilization [
47], but they are constrained by memory usage, especially for larger models. Therefore, careful experimentation was conducted to identify the optimal values for batch size and the number of workers, given the model size and available VRAM. All models were trained utilizing Common Objects in Context(COCO) pretrained weights for a total of 100 epochs with image size 640, initial learning rate 0.01, and SGD(Stochastic Gradient Descent) optimizer with standard augmentation techniques (
Table 4).
2.4. Testing and Performance Evaluation
In this study, the model needs to correctly identify the presence of deer in an image and localize them with bounding boxes. In addition, we need to assess the computational efficiency of the models and the feasibility of real-time deployment. There are several standard metrics to evaluate the performance of models.
Intersection over Union (IoU): IoU is a widely adopted metric used to evaluate the correctness of detections produced by an object detection model. The model is trained to predict bounding boxes around objects in an image, and IoU measures how closely these predictions match the ground-truth bounding boxes [
48].
Precision (P): Precision quantifies the proportion of detections that are correct, i.e., true positives relative to all predicted positives. In object detection, a detection is considered a true positive if both the class label and localization () are correct.
Recall (R): Recall measures the proportion of ground-truth objects correctly detected, i.e., true positives relative to all actual objects. Missed detections or those failing the IoU threshold contribute to false negatives.
F1 Score: The F1-score is the harmonic mean of precision and recall, providing a single balanced measure of detection accuracy at a given confidence threshold.
AP (Average Precision): AP summarizes the tradeoff between precision and recall across confidence thresholds by computing the area under the precision–recall curve. Higher AP values indicate better overall detection performance [
48].
Mean Average Precision (mAP): mAP is the mean of AP across all predicted classes. In this study, it reduces to the AP of the single “Deer” class. Commonly reported values include mAP@0.5 (IoU threshold of 0.5) and mAP@0.5:0.95 (averaged across thresholds from 0.5 to 0.95 with step size 0.05).
Inference Time: Inference time is the average time required for a model/computational graph (e.g., ONNX model) to execute a forward pass on the hardware accelerator (CPU/GPU). It reflects the raw computational capability of the device and is reported as time per batch-amortized image.
Processing Time (System Overhead): Processing time includes time required for additional CPU-bound operations to prepare the data and interpret the results. This encompasses pre-processing (disk I/O, resizing and normalization) and post-processing (decoding raw tensors, coordinate scaling and confidence thresholding) time. Visualization is excluded from this metric to isolate detection performance. Thus, total time is reported in this study as the sum of inference time and system overhead.
Frames Per Second (FPS): FPS is derived strictly from inference time to benchmark the maximum computational throughput of the edge devices. This helps to isolate the accelerator’s performance from system-level bottlenecks related to processing time (system overhead).
In this study, we use IoU, precision, recall, F1-score, and AP@0.5 to measure detection accuracy, while inference time and processing time are used to evaluate computational capability [
48]. To ensure consistency in performance reporting, a standardized evaluation procedure was followed for all timing measurements. Each inference session was preceded by a pre-load phase, consisting of 3 iterations over 8 sample images to stabilize hardware clock frequencies and initialize the computational graph within the system memory. Following this initialization, inference was executed sequentially across the entire validation dataset (
images). The reported values represent the arithmetic mean derived from this full-set evaluation. This approach ensures that the performance metrics reflect sustained hardware capability across diverse environmental inputs rather than isolated peak performance.
2.5. Inference on Edge Devices
The YOLO models were trained on high-performance workstations. However, for applications such as deer detection, real-time identification and localization are essential for system effectiveness. One approach is to outsource inference to a cloud-based server, where live video frames are continuously transmitted and processed, with detection results returned to the client [
49]. However, such solutions often face challenges related to bandwidth requirements, latency, and network security. Edge computing provides an alternative by enabling local image or video processing, thereby reducing latency and ensuring rapid response. Performing all computation on the local device also creates an independent detection and tracking system, which is particularly advantageous in remote agricultural settings where reliable cloud connectivity cannot be guaranteed [
50]. A wide range of off-the-shelf edge devices are available for such applications, including single-board computers (SBCs) such as the Raspberry Pi, NVIDIA Jetson platforms, USB accelerators, and mobile devices with embedded CPUs and GPUs [
51]. We evaluate inference performance on two representative devices: a CPU-based Raspberry Pi and a GPU-accelerated NVIDIA Jetson platform (
Figure 3).
2.5.1. Raspberry Pi 5
Raspberry Pi is a series of small single-board computers that offer low-cost, small-sized hardware for computing. They are often credit-card-sized but deliver excellent computing power for their size. In this study, we test the performance of trained models on Raspberry Pi 5, which is the newest model of the Raspberry Pi series (
Figure 3). It features an onboard computer with a quad-core 64-bit Arm Cortex-A76 CPU. Raspberry Pi is widely used in applications like real-time image or video processing since it supports a dedicated Camera Serial Interface (CSI). Moreover, it can operate on low power, making it suitable for remote deployment as an independent computer, able to get real-time data from the environment, such as the presence of deer in our case. For this study, a device with 16 GB system memory and the operating system Linux Ubuntu 22.04 was used.
2.5.2. Jetson AGX Xavier
NVIDIA Jetson AGX Xavier is a powerful edge device designed explicitly to address the rigorous demands of robotics and edge AI applications. It is centered around a highly integrated System-on-Chip (SoC) that combines Volta GPU, ARM-based CPUs, and specialized hardware accelerators within a unified memory framework. The GPU incorporates 512 CUDA cores and 64 tensor cores, capable of delivering up to 11 TFLOPS (FP16) or 22 TOPS (INT8). It operates at a maximum frequency of 1.37 GHz with dynamic scaling, enabling fine-grained control of power and performance [
52]. The tensor cores provide significant efficiency for AI workloads, as they are optimized for matrix multiply–accumulate (MMA) operations, which underpin most deep learning algorithms. Moreover, the AGX Xavier integrates dedicated accelerators such as Deep Learning Accelerators (DLA) and Programmable Vision Accelerators (PVA), providing hardware-level support for diverse workloads in autonomous and embedded systems. In addition, this device comes with a compact form factor of 4.2 × 4.2 × 4 inches (
Figure 3 (right)). Together, these features make the Jetson AGX Xavier a versatile platform for real-time robotics and edge AI deployment.
2.5.3. Model Deployment and Optimization
While powerful hardware provides the foundation for high-speed computing, the performance of AI models is equally dependent on their software implementations [
53]. To fully exploit the capabilities of the edge devices, specialized software frameworks and optimizations are required. Ultralytics provides YOLO models with optimizations for accuracy and speed; however, the default PyTorch 2.7.1 [
54] exports are not always the most efficient for deployment on storage- and performance-constrained devices. To address this, several open-source model optimization methods enable faster inference without compromising accuracy. Common export and deployment formats include ONNX, OpenVINO, TorchScript, TensorFlow Lite, NCNN, and TensorRT [
55]. Each format supports different types of AI models while incorporating optimizations to improve inference speed on specific hardware. In many cases, the creators of these model frameworks also provide dedicated runtime platforms to ensure efficient execution across devices.
In this study, we evaluated the YOLO models on both devices by converting them to ONNX format. ONNX provides a framework-agnostic representation of deep learning models as computation graphs, enabling portability across platforms [
56]. Inference was performed using ONNX Runtime, a lightweight, high-performance engine that reduces deployment overhead and supports hardware-specific optimizations via execution providers (e.g., ARM CPUs or CUDA GPUs). By utilizing a unified deployment pipeline, we eliminate software-specific optimization as a confounding variable, ensuring that observed performance discrepancies are primarily attributable to hardware architecture and model design. While specialized compilers—such as NVIDIA TensorRT for Jetson devices or TFLite for Raspberry Pi can unlock significant performance gains, they often introduce platform-specific conversion artifacts [
57,
58,
59]. This study prioritizes a reproducible, portable baseline to benchmark the intrinsic capabilities of each device under a common software stack. The specific inference and evaluation settings used across both devices for inference on ONNX models are summarized in
Table 5. Minor platform specific variations are expected despite identical evaluation settings.
4. Discussion
4.1. Assessment of the Domain Gap
One of the primary outcomes of this study is the empirical quantification of the domain gap between controlled detection environments and the complex conditions encountered in real-world wildlife monitoring. While contemporary YOLO architectures frequently report near-perfect precision on curated benchmark datasets (e.g., COCO or cleaned Roboflow subsets), our results demonstrate a substantial degradation in performance under field conditions characterized by camouflage, occlusion, and heterogeneous illumination.
Specifically, models that perform at or near the state of the art in benchmark settings exhibited a decline to approximately 0.74 AP when evaluated on the Cameratraps dataset, corresponding to an approximate 25% reduction in detection performance. This finding underscores the limitations of benchmark-centric evaluation and highlights the importance of domain-specific data in ecological monitoring applications. Our results suggest that architectural advances alone are insufficient to ensure reliable deployment performance, and instead emphasize the need for data-centric strategies and evaluation protocols tailored to operational environments. Consequently, this work provides a realistic performance baseline and informs future efforts toward robust, field-ready computer vision systems for agricultural and wildlife monitoring.
4.2. Choices of CPU or GPU Edge Devices
The choice between CPU- and GPU-based edge devices depends largely on the requirements of the target application. CPU platforms, such as the Raspberry Pi, are low-cost, energy-efficient, and suitable for lightweight tasks where occasional or event-triggered detections are sufficient, for example, confirming the presence of a stationary animal or supporting long-term monitoring in power-constrained environments. By contrast, GPU-accelerated platforms, such as the NVIDIA Jetson AGX Xavier, are better suited for real-time applications that demand high-frame-rate video analysis, continuous tracking, and rapid decision-making. The Jetson provides the computational capacity to run larger models at interactive speeds, making it ideal for dynamic scenarios where animals are moving, and system responsiveness is critical. In practice, the Raspberry Pi offers accessibility and low-power operation, while the Jetson delivers the performance necessary for advanced, real-time wildlife detection and deterrence. The decision between the two thus reflects a trade-off between efficiency, cost, and the level of intelligence required at the edge.
4.3. Robustness to Other Species
In this work, we curated the Cameratraps dataset, which covers a diverse and challenging range of environmental conditions (see
Figure 1). However, the ability of the models to generalize to other deer species, or to avoid misclassifying non-deer animals as deer, has not yet been evaluated. For deployment in a real-world automatic deer deterrence system, robustness against such incorrect classifications is essential. Out-of-distribution (OOD) detection methods could be applied to mitigate these risks [
60]. Achieving this will require additional studies, including the development of a multi-species dataset to rigorously test cross-species generalization and misclassification robustness.
4.4. Dataset Limitation and Scope
While the Cameratraps dataset presents a critical benchmark for deer detection in challenging environmental conditions, the scope and generalization of the presented findings is subject to some constraints. First, the dataset is characterized by the specific vegetation and terrain of the intermountain West (Idaho) making it more suitable as a testbed for regional deployments. Second, the modest data scale of 3095 images does not allow for the training of expansive models entirely from scratch without risking overfitting. However, this scale effectively simulates the data-constrained scenarios typical of wildlife monitoring and demonstrates the potency of transfer learning using pre-trained YOLO weights. Finally, as current annotations aggregate all the detections into a single “Deer” class, the scope is constrained. But the prioritization of a unified class helps to establish a robust baseline for target species recovery and effectively isolates the detection capabilities. It will be fruitful to extend this dataset in the future to explicitly characterize interactions with non-target species, such as elk or livestock often encountered in the same geography.
4.5. Real-World Deployment
Our evaluation on edge devices showed strong detection accuracy across platforms, but real-world deployment also requires speed, efficiency, and reliability. The NVIDIA Jetson AGX Xavier offered a favorable balance of performance and accuracy, supporting real-time operation. In contrast, smaller low-power devices such as the Raspberry Pi 5 performed poorly with larger models, where high GFLOPs demands quickly overwhelmed the CPU. The study further indicates that simply converting PyTorch models to ONNX and running them with ONNX Runtime is insufficient for achieving real-time performance on the Raspberry Pi. However, the specialized optimization frameworks and model compression techniques can significantly shift these boundaries. For instance, the use of TensorRT on Jetson devices can leverage hardware-specific optimizations that represent the theoretical upper bound of the hardware’s performance. Similarly, on the Raspberry Pi 5, lightweight frameworks like NCNN or TFLite, particularly when combined with INT8 quantization, have been shown to improve frame rates [
57]. These measures are essential to minimize resource usage while maintaining acceptable detection performance in practical field settings. The benchmarking in this study serves as a critical reference point for researchers to determine whether a model requires such intensive optimization for a specific deployment setting.
Furthermore, a limitation of the current performance evaluation is the reliance on steady-state average metrics recorded in a controlled environment. While the results provide a consistent baseline for performance comparison, they may not account for the performance variance caused by thermal throttling or system interrupts encountered in unpredictable field conditions. Future extensions should involve long-term operational stability tests that report statistical variance and standard deviation over extended deployment time periods to better characterize the deployment reliability of these models. Additionally, subsequent studies must also incorporate physical power-metering to profile the energy consumption per inference across the edge devices, which is a critical constraint for battery-powered remote agricultural deployments.
5. Conclusions
In this paper, we introduced an open-source dataset of 3095 annotated deer images with bounding-box labels, capturing diverse environmental conditions and lighting scenarios. We benchmarked 12 model variants from four recent YOLO architectures (v8, v9, v10, and v11) and evaluated their viability on two representative edge devices: the CPU-based Raspberry Pi 5 and the GPU-powered NVIDIA Jetson AGX Xavier. Our results quantify a critical domain gap, demonstrating that while modern architectures are theoretically capable of high precision, their real-world feasibility varies significantly across hardware. Results showed that larger m-series models achieved the highest accuracy on high-end hardware, with AP@0.5 scores exceeding 0.94. However, using a standardized, framework-agnostic ONNX runtime, their computational demands made them unsuitable for real-time deployment on CPU-only devices such as the Raspberry Pi 5 where the performance dropped below 1 FPS. In contrast, the Jetson AGX Xavier provided an optimal balance, sustaining real-time processing speeds above 25 FPS while maintaining high detection accuracy (AP@0.5 > 0.85). These findings demonstrate that GPU-accelerated hardware is a prerequisite for real-time wildlife tracking at the edge when relying on universal deployment stacks.
Overall, this study provides clear, actionable guidance for the design of effective autonomous deer detection systems that can be deployed on edge devices. Since fast and accurate deer detection is fundamental to advanced monitoring, tracking, and deterrence applications, this study plays a foundational role in the development of such systems. Future work will focus on hardware-specific optimization techniques and lightweight frameworks, such as TensorFlow Lite, to further improve performance on constrained devices. Additionally, we plan to expand the dataset by collecting and annotating images that capture a wider range of challenging conditions, including adverse weather (e.g., snow, heavy rain, fog), diverse agricultural landscapes (e.g., cornfields, soybean fields), and a greater variety of deer species and behaviors.