1. Introduction
Camera traps and other camera-based sensing systems are now widely used for environmental monitoring, biodiversity assessment, and wildlife management [
1,
2]. Tools such as Megadetector [
3] and Wildlife Insights [
4] are designed to help scientists analyze trail camera images for research and can perform complex classification tasks that can differentiate between multiple species. These efforts are typically done weekly or monthly in batch mode and do not require rapid inference [
5,
6]. While such latency is acceptable in many ecological studies, there are a growing set of operational contexts where the decision window can be minutes rather than days.
Human activities are reshaping natural ecosystems, resulting in habitat loss and imposing significant pressures on large carnivores [
7]. In addition, large scale ecological disturbances such as droughts and wildfires driven by climate change are transforming landscapes and altering wildlife distributions, further exacerbating human–wildlife interactions [
8,
9]. In the case of large mammals entering human-used spaces (e.g., polar bears in towns), wildlife approaching transportation corridors or other infrastructure (e.g., elephants near railway lines), and rare or high-stakes events in insufficiently protected areas (e.g., pumas attacking livestock), near-real-time detection, of the order of a few seconds, is of the utmost importance in order to enact intervening measures [
10].
Field deployment requiring real-time or near real-time response faces several challenges, the most important consideration being a fast yet accurate classification algorithm. Additionally, deployments often operate in remote locations with intermittent connectivity, constrained budgets, minimal power and limited compute [
11]. This requires edge computing approaches that run inference locally rather than relying on cloud processing. Furthermore, the data itself presents difficulties: large fractions of motion-triggered captures may be empty, and animal-containing images can be low resolution and hard to interpret automatically, especially in low light situations [
12,
13]. Nighttime trail camera imagery is often monochrome/infrared (IR) and frequently affected by motion blur, illumination artifacts, and partial-body captures; in contrast, daytime color images typically contain richer information and sharper images. Practical systems must therefore work across both night and day conditions while suppressing empty triggers to avoid excessive false alarms. Labeled data can also be limited in deployment-specific settings, motivating approaches that reduce relabeling burden while maintaining reliability [
14].
There is no single state of the art vision sensing AI model that performs efficiently and reliably under all of the above-mentioned constraints. While frontier multimodal models are highly capable, they are typically accessed as externally hosted, proprietary services, which introduces ongoing cost, connectivity requirements, and variable end-to-end latency, complicating guarantees of seconds-scale response in remote deployments [
15,
16,
17,
18]. Even when running locally, a model often has to do two different jobs at once: (i) ignore the many empty and nuisance triggers from wind, rain, etc., and (ii) make a high-confidence, target-species decision when an animal is present. Edge-friendly tools such as YOLO [
19,
20] are excellent at quickly finding animal-like image regions but have poor reliability in confirming a specific target species. This is exacerbated when the animal may be distant, partially visible, and frequently captured in infrared at night. These gaps are especially important in near-real-time workflows where repeated false alarms quickly undermine trust and usability.
Beyond the YOLO family, other lightweight single-shot detector families such as SSD [
21] and EfficientDet [
22] are widely used as embedded-friendly options for object detection under constrained compute budgets. In our low-contrast night-IR setting, we found that such single-shot detectors can be brittle when the target animal occupies only a small fraction of the image, as contrast loss and background clutter make localization unstable. This motivates our two-stage design that first detects candidate wildlife events and then applies a more selective classification step. In principle, neural-operator approaches (e.g., FNO-style models [
23,
24]) could further reduce the classifier’s sensitivity to image resolution while remaining computationally efficient. However, in our experiments EfficientNet already provided a satisfactory accuracy for binary classification even if the cropped region was small and pixelated. Separately, a common strategy in wildlife monitoring is to exploit temporal context (bursts or video) using CNN → RNN pipelines (e.g., LSTM/GRU variants) to stabilize predictions across images [
25,
26]. While temporal aggregation can improve robustness when reliable sequences are available, they are more time consuming than inference on single images. Our deployment requires rapid on-site decisions within a short interaction window (often only a few seconds, e.g., ∼4 s) with severe night-IR degradations. Buffering multiple images for temporal modeling increases end-to-end latency and system complexity on embedded hardware which we wanted to avoid. We therefore focus on an event-level, two-stage still-image cascade that is robust without assuming temporal continuity.
To address these gaps, we present a detect–classify cascade, a two-stage AI-enabled
vision sensor that forms the core of our near-real-time environmental monitoring system. The system itself consists of an event-driven camera paired with an edge-computing unit that performs on-device inference and triggers user-defined responses. Related smart camera-trap systems demonstrate on-device inference and integrated field prototypes [
27,
28]; here, we focus on seconds-scale, species-specific identification for high-stakes events. We demonstrate our approach for a sample dataset containing camera trap images, distinguishing pumas from other wildlife. We report end-to-end trigger-to-action latency and false-trigger behavior across both daytime color and nighttime infrared imagery. Finally, we include Grad-CAM visualizations [
29] for explainability of classifier decisions that build trust in the system and guide model refinement for terrain and species specific applications.
Our primary innovation is a two-stage pipeline: a permissive first-stage detector paired with a second-stage classifier trained via a curriculum-learning approach. This design fills a niche required to meet our constraints of challenging nighttime infrared conditions and seconds-scale inference on edge hardware. The result is reliable, on-device identification while the animal is still on-site, enabling real-time notifications and optional audio/light deterrents. We also incorporate classifier explainability to improve robustness. By inspecting which features drive decisions and where classification fails, we can systematically identify failure modes and feed those examples back into our data collection and training pipeline, improving performance over time.
Our contributions can be summarized as follows:
A novel two-stage edge-deployable pipeline that uses a Stage 1 detector and a staged transfer-learning curriculum for Stage 2 species confirmation;
An interpretability workflow using Grad-CAM visualizations to surface failure modes and edge cases (e.g., false triggers, partial-body/night-IR errors) and guide iterative model refinement for field deployments;
A deployable, offline vision sensor that integrates motion-triggered acquisition, on-device inference, and user-defined actions on low-cost edge hardware designed for challenging field imagery;
An openly released, re-trainable implementation (code, weights, labeled datasets, and a hardware bill of materials) designed for field use.
The system is designed with flexibility in mind, to be extended to other species with modest additional labeled data. While our quantitative evaluation emphasizes pumas as a high-stakes management case study, we also include an illustrative ringtail deployment obtained by retraining only Stage 2 with a few hundred labeled images, demonstrating that the same pipeline can be adapted to new target species with modest additional data.
The remainder of the paper is organized as follows:
Section 2 describes the system architecture and two-stage methodology;
Section 3 details the case study and evaluation protocol;
Section 4 presents quantitative results and ablations;
Section 5 discusses implications, limitations, and failure modes; and
Section 6 summarizes the key conclusions.
3. Case Study
3.1. Study Context and Dataset
The species chosen for this demonstration was the puma. Pumas are wide-ranging large carnivores with important ecological roles [
37,
38] and are also implicated in management challenges near the wildland–urban interface and around domestic animals. Management actions such as lethal removal can have complex outcomes, motivating scalable nonlethal approaches and improved monitoring tools [
39,
40]. Selective sensing can enable targeted downstream actions, including activation of well-established deterrent modalities, once reliable low-false-alarm triggering is available [
35,
36]. The scope of this study is limited to the sensing-and-classification capability only.
Study region: The original training and validation dataset was assembled by the Large Mammal Monitoring Project [
2,
41]. The project investigates the effect of climate change, drought, and wildfire on large mammal numbers and behavior by collecting trail camera imagery in the Pajarito Plateau area of New Mexico. Because most labeled training and validation images come from this single region, there is potential geographic concentration bias (e.g., habitat- or site-specific background cues) that could reduce performance when deployed in substantially different environments. Dozens of trail cameras have been deployed, capturing thousands of images of pumas and other wildlife. The region includes rugged canyons and mesas and supports diverse wildlife, including deer, elk, bears, bobcats, coyotes, and pumas. The resulting imagery contains animals approaching from multiple angles and distances under varying lighting conditions. Additionally, data were collected using multiple trail camera brands with different imaging sensors and brightness profiles, improving robustness to cross-camera domain shifts. We further augmented this dataset with images collected from our own field deployments in New Mexico and California to better handle edge cases.
Imaging conditions: Our dataset spans both nighttime infrared imagery and daytime color imagery. Nighttime IR images are often lower quality and more prone to motion blur and illumination artifacts, whereas daytime color images typically contain richer information and sharper images. Including both modalities supports around-the-clock monitoring and improves practical deployability. Training, validation and test datasets each contain images across the entire spectrum of operating conditions.
Dataset and Labels: For training and validation, we draw from a dataset that consists of 1103 puma images and 1693 no-puma images using an split. For test data we use an independent set of 479 puma and 955 no-puma images. Importantly, this test set was collected from deployments in California and is geographically disjoint from the New Mexico training/validation data; it includes different background vegetation/terrain and includes camera hardware not fully overlapping with the training set, providing a direct cross-site (and partially cross-camera) domain-shift evaluation of generalization for pumas. Because camera-trap deployments are dominated by non-target triggers, the resulting dataset is moderately imbalanced toward the no-puma class. We therefore use a class-stratified split for training/validation and verify that both daytime and nighttime conditions are represented across train/validation/test. This imbalance motivates reporting balanced accuracy in addition to accuracy and error rates. We manually verified that the range of imaging conditions is captured in the sets. We formulated a binary classification task: puma vs. no-puma. The no-puma class includes other wildlife species (e.g., foxes, coyotes, bobcats, bears, deer, elk, skunk) as well as empty or nuisance-trigger images (e.g., wind-driven vegetation, precipitation, insects, and other non-animal motion). Labels were assigned at the image level based on visual review: an image was labeled puma if any part of a puma was visibly present (including partial-body views), and labeled no-puma otherwise. This binary labeling scheme is also straightforward to adapt to other target species, because it only requires image-level labels for a single target class versus a pooled no-target class rather than exhaustive multi-species annotation.
3.2. Prototype Evaluation
We evaluate the system for near-real-time monitoring in terms of its speed and reliability. We first conduct an offline ablation study of the two-stage sensing-and-classification methodology developed here. This is performed by systematically testing individual parts of the cascade to compare the performance by dividing the dataset into the same training and test categories for each step. We also validate the entire pipeline in field deployment mode to ensure that not only does the two-stage classification work accurately within seconds but that the entire end-to-end pipeline functions as intended.
Baselines and ablations: We use an independent test dataset previously unseen by the classifier, with 479 puma images and 955 no-puma images to perform our ablation study. This held-out set is geographically disjoint from training/validation (California vs. New Mexico) and therefore serves as our primary quantitative check on cross-site generalization for puma detection. We perform the following ablations:
Stage 1 only (YOLO label proxy)—detector-only baseline that predicts puma if the pretrained YOLO detector reports a cat detection; otherwise no-puma.
Stage 2 only (full image)—single-stage baseline that applies the curriculum-learned binary EfficientNet classifier directly to the uncropped original image.
Two-stage (animal-only filter)—ablation in which Stage 1 detections are filtered to a restricted “animal” label set before cropping; Stage 2 classifies the resulting ROIs.
Two-stage (proposed; permissive Stage 1)—the proposed detector → classifier cascade, where Stage 1 is used permissively for localization/cropping and empty-trigger suppression, and Stage 2 performs puma confirmation on the ROIs and gates downstream actions.
3.3. Metrics
The following questions guide our evaluation of the system:
How accurately does the pipeline distinguish target vs. non-target across nighttime infrared and daytime color conditions?
How effectively does the two-stage cascade suppress false triggers (empty images and non-target wildlife) compared to simpler baselines?
What is the end-to-end latency from trigger to action command, and how does it decompose by stage?
We record True Positive (TP), True Negative (TN), False Positive (FP) and False Negative (FN) classifications. Because raw FP/FN counts depend on dataset size and class balance, we report them for operational context and pair them with normalized rates and standard metrics. Offline performance is quantified using precision (P), recall (R), F1 score (F1), accuracy (Acc), balanced accuracy (BalAcc), false positive rate (FPR), and false negative rate (FNR), where
We additionally compute
for the positive class (
puma) in the confusion matrices reported in Equations (
4)–(
7). Operational performance is characterized by (i) end-to-end latency and its stage breakdown (
Section 4.4) and (ii) qualitative field demonstrations of the integrated workflow (trigger → transfer → inference → action;
Figure 3).
End-to-end latency is measured as the elapsed time from camera trigger (image timestamp at upload) to issuance of an action command by the edge device (e.g., notification event and/or initiation of audio playback). Latency values were computed over 100 triggered events collected during field operation across both networking configurations.
4. Results
4.1. Offline Performance Evaluation (Ablation and Baselines)
We begin with an offline ablation study to isolate the role of each stage in the two-stage pipeline. The goal is straightforward: to show what breaks if we skip parts of the system. These comparisons motivate the full two-stage cascade and explain why seemingly sensible shortcuts can lead to more missed events or more false alarms, both of which lead to different undesirable consequences and lowered trust.
4.1.1. Stage 1 Only (YOLO Label Proxy)
We evaluate a detector-only baseline using YOLO, a common choice for fast edge deployment. Because
puma is not a dedicated detector class in our YOLO model, we use a proxy rule by flagging
puma whenever YOLO reports
cat (Equation (
4)). Under this setting, the model correctly identifies 160/479
puma events and correctly rejects 948/955
no-puma events. This corresponds to a precision of 0.958, recall of 0.334, and F1 of 0.495 (puma treated as the positive class). Furthermore, even when the detector localizes the animal, its predicted class label is unreliable under field conditions: motion-triggered wildlife imagery (often infrared, blurred, and partially occluded) is frequently assigned to visually adjacent or even unrelated categories. The Confusion matrix for the Stage 1 only (YOLO label proxy) baseline is:
4.1.2. Stage 2 Only (Full-Image Curriculum-Learned Classifier)
We also evaluate a classifier-only baseline (EfficientNet on the full image) and show that, without detector-based localization and empty-image filtering, it produces more false alarms in cluttered, motion-triggered imagery. Equation (
5) shows that applying the classifier directly to full images yields relatively high sensitivity to pumas (FN = 33), but at the cost of many false positives (FP = 171). Equation (
5) makes this tradeoff explicit: the model correctly identifies 446/479
puma events (recall = 0.931), but incorrectly flags 171/955
no-puma events as
puma (precision = 0.723). The Confusion matrix for the Stage 2 only (full image) baseline is:
4.1.3. Two-Stage with an Animal-Only Filter
A natural idea is to reduce clutter by filtering Stage 1 detections to “animal” labels before passing ROIs to Stage 2. Equation (
6) shows that this strategy produces substantially more missed pumas than full-image classification (FN = 89 vs. 33). Only 390/479
puma events are correctly identified (recall = 0.814), while 89
puma events are filtered or rejected as
no-puma before Stage 2 can identify them. The reason is that the detector’s semantic labels are less reliable than its ability to localize an object: many true pumas are localized when all classes are enabled but are assigned non-animal (or otherwise incorrect) labels and therefore get removed by the label filter. Once those images are excluded, Stage 2 never has the opportunity to correct the detector’s label error. The Confusion matrix for the two-stage (animal-only filter) ablation is:
4.1.4. Proposed Two-Stage Workflow
Equation (
7) shows the proposed design, keeping Stage 1 permissive for localization and empty-trigger suppression, then using Stage 2 for the
puma decision on cropped ROIs. This configuration achieves both low false alarms (FP = 8) and low missed events (FN = 12), improving substantially over the individual stages. The confusion matrix in Equation (
7) summarizes this balance: 467/479
puma events are correctly detected while only 8/955
no-puma events are incorrectly flagged. The reduction in false alarms is driven by Stage 1 suppressing empty or nuisance triggers and restricting Stage 2 to animal-containing ROIs, which removes much of the background clutter that confuses a full-image classifier. The few remaining misses are dominated by cases where Stage 1 fails to localize the animal (e.g., low contrast with foliage in nighttime infrared imagery), highlighting that localization quality is the primary limiting factor once the two-stage cascade is used. The Confusion matrix for the proposed two-stage workflow is:
4.2. Sensitivity to Stage 1 and Stage 2 Model Choice
We next evaluate whether substituting newer model families in Stage 1 (object detection) and Stage 2 (binary classification) changes offline performance, or whether the primary driver is the two-stage pipeline itself.
4.2.1. Stage 1 Only (YOLO Label Proxy): YOLOv8 vs. YOLO26
We compare our current Stage 1-only baseline (YOLOv8 [
30] label proxy; Equation (
4)) against the newer YOLO26 [
42] under the same proxy rule (flag
puma whenever the detector reports
cat).
Table 2 summarizes results. Performance is similar, consistent with the interpretation that Stage 1 is primarily used to cast a wide net and provide candidate ROIs for Stage 2. Latency varies slightly across images (with the number of predicted bounding boxes), so we report average latency; under this measure, YOLO26 is slightly faster than YOLOv8.
4.2.2. Stage 2 Only (Full-Image Classifier): EfficientNet vs. ConvNeXt-Tiny
We also compare the Stage 2-only baseline reported in Equation (
5) (EfficientNet [
31] applied to the full image) against a newer ConvNet backbone (ConvNeXt-Tiny [
43]) trained with the same curriculum and evaluation protocol.
Table 3 shows similar full-image performance. This supports the interpretation that Stage 2 is most effective when applied to ROIs provided by Stage 1, where it can suppress false positives that arise in cluttered full-image imagery. We also report average inference latency (in seconds) on our Raspberry Pi deployment. Unlike Stage 1 detection, Stage 2 latency is fairly consistent across crops because each ROI is resized to a fixed input resolution. EfficientNet is slightly faster in our setup, consistent with its lower computational footprint, while ConvNeXt-Tiny was not specifically designed for edge deployment.
Across both stages, substituting newer architectures yields similar offline performance under the same evaluation protocol. The limited sensitivity to architecture choice is consistent with our motivation for using the two stage approach: once Stage 1 reliably proposes candidate ROIs, the remaining errors are driven more by challenging imagery (motion blur, low contrast, partial views) and the ROI-vs-full-image distinction than by the specific detector or backbone family. Because both YOLOv8/YOLO26 and EfficientNet/ConvNeXt are strong pretrained models, differences are expected to be small when the training data and protocol are held constant.
In addition, for our hardware, the latency of the new models is also similar to the older models. Although YOLO26 is slightly faster than YOLOv8 in our measurements by ≈0.1 s, our current end-to-end bottleneck is not model inference. The dominant contributor to total turnaround is transferring each image from the camera to the Raspberry Pi via FTP, which accounts for most of the ∼4 s cycle time. Consequently, further reducing model latency would have limited impact on end-to-end responsiveness. Given the small differences and the importance of reproducibility in long-running deployments, we retain our original YOLOv8 and EfficientNet choices for the remainder of this study, as they are well-tested and stable within our training and inference pipeline.
4.3. Field Deployment and Operational Workflow Validation
We have continuously deployed the system since May 2025 in the field to validate end-to-end robustness under realistic triggering conditions (wind, precipitation, insects, and non-target wildlife) and to confirm that the sensing-to-action loop runs without external internet access. Deployments were conducted at two sites using both supported networking configurations, using an existing WiFi network, and in standalone mode, with a hotspot. In both modes, cameras upload motion-triggered images via FTP to the Raspberry Pi and downstream inference and action logic is identical. In the field, the trigger stream is dominated by empty or weather-disturbed images, with non-target animals (e.g., skunks, ringtails, raccoons, and foxes) far more common than puma visits. On nights with calm weather, the system ingests on the order of ∼100 motion-triggered events, whereas windy, rainy, or snowy conditions can generate ∼1000 events due to nuisance motion. These conditions provide a realistic stress test of false-trigger suppression and alert gating under heavy background activity. Across ongoing deployments comprising tens of thousands of motion-triggered events, we quantified to be from 98 false puma detections out of 12,436 triggers, where a false positive is defined as a non-puma event that nonetheless produces a puma alert/action. False puma detections were a combination of animals such as foxes posed in ways that resemble a puma. Other false detections result from ROIs of inanimate objects that resembled a puma.
Because puma events are rare at our sites, these deployments are primarily informative for operational robustness and false-trigger behavior. Field operation discovered systematic edge cases (e.g., animals that resemble pumas at certain angles) that are underrepresented in curated offline splits. We used these field observations to refine labeling guidelines for ambiguous triggers and to prioritize targeted data collection (hard negatives and rare positive contexts), strengthening the training/validation coverage and the representativeness of held-out evaluation data used for the offline results reported in
Section 4.1. Across continuous operation, we observed a small number of site-network infrastructure interruptions (two power outages and four temporary WiFi outages), all of which resolved without manual intervention as the Raspberry Pi rebooted and services resumed automatically, providing practical evidence of robustness to common field failures.
4.3.1. End-to-End Workflow Demonstrations
We validated the integrated workflow (trigger → transfer → inference → action) using qualitative field demonstrations in which an audio output was triggered following target detection.
Figure 3 shows an example puma encounter from the site-network deployment (19 May 2025) where a caterwauling call was played to get a response from the puma without making it a nuisance or scaring it away. This encounter demonstrates Stage 1 localization (bounding boxes), Stage 2 classification, and triggering of an audio output. We present
Figure 3 as an operational demonstration of closed-loop performance. All field demonstrations were conducted on private property using standard non-invasive monitoring practices; no animals were handled, baited, or physically contacted.
4.3.2. Illustrative Multi-Species Operation (Ringtail Case Study)
To demonstrate that the same sensing-to-action pipeline can be adapted beyond pumas, we retrained the Stage 2 classifier to identify ringtails using 653 ringtail images. We deployed the system for monitoring ringtails at a residential water feature. Over approximately one month, the system recorded more than 30 ringtail visits and produced 311 usable images. The ringtail-trained Stage 2 binary classifier correctly identified 258 images (approximately 83% image-level accuracy) in this deployment. As expected, accuracy was lower than for pumas since ringtails are fast, small, and the Stage 1 object detector often fails to detect them. When ringtail detections occurred, an audio output was optionally triggered as a demonstration of actuation capability and rapid verification. In some instances the animal left the scene shortly after audio playback; these observations are anecdotal and are included to illustrate closed-loop operation in a different species context. The corresponding video is provided at
https://vimeo.com/1120196742 (accessed on 17 February 2026).
4.4. End-to-End Latency
We measured end-to-end latency as the elapsed time from camera trigger (image timestamp at upload) to issuance of an action command by the edge device (e.g., notification event and/or initiation of audio playback). This metric captures the practical responsiveness of the system for near-real-time monitoring workflows.
In the current implementation, end-to-end latency is approximately 4 s. Note motion triggering is performed by the camera’s built-in PIR sensor. This value was averaged over 100 triggered events collected during field operation across both networking configurations. This total is dominated by image transfer plus Stage 1 detection/cropping (approximately 3 s) followed by Stage 2 classification (approximately 1 s); other overheads are negligible. In an earlier field demonstration recorded on 19 May 2025 (
Figure 3), the end-to-end latency was approximately 8 s; subsequent software optimization (model caching and runtime initialization changes) reduced latency to the current value.
5. Discussion
5.1. Interpreting Results of the Ablation Study
Key results of the ablation study described in
Section 3.2 are summarized in
Table 4, which reports both operational FP/FN counts and standard classification metrics (precision, recall, F1, accuracy, balanced accuracy) for each ablation variant.
The Stage 1-only baseline reveals that off-the-shelf detectors are unreliable for species-specific decisions: using the cat proxy produces few false alarms (FP = 7) but misses many true pumas (FN = 319), yielding high precision (0.958) but very low recall (0.334) and a low F1 (0.495), while accuracy (0.773) is dominated by correct no-puma decisions (balanced accuracy = 0.663). This misclassification is not solely a proxy-label artifact; qualitative review shows true pumas frequently assigned to adjacent or unrelated categories (e.g., dog, sheep, horse), and similar label instability occurs even for detector-supported species (e.g., bears) despite correct localization. We attribute this behavior to the detector’s coarse 1000-class taxonomy and training bias toward clear daytime imagery, motivating Stage 1 as a trigger/localizer with species identification deferred to a downstream classifier. From an operational standpoint, Stage 1 is also inexpensive to run on-device (average latency ≈ 0.85 s per image on our Raspberry Pi), but its role must be as a ROI detector rather than species-level classification.
The Stage 2-only baseline (full-image classifier) exhibits the opposite failure mode: it substantially increases false alarms (FP = 171) because motion-triggered field imagery often contains complex backgrounds (vegetation, shadows, precipitation, insects) and, at night, monochrome infrared artifacts and motion blur. This yields high recall (0.931) but lower precision (0.723), so F1 = 0.814; BalAcc = 0.876 remains strong because sensitivity is high, but the reduced specificity implied by 171 FP can be operationally problematic for high-volume deployments. In these conditions, the classifier can confuse background texture or illumination patterns for puma-like features. In addition, without localization the target may occupy only a small fraction of the image or appear as a partial-body capture, which contributes to missed positive events (FN = 33). Together, these effects make full-image classification less reliable for unattended, high-volume triggering in the field. Stage 2-only inference is fast on-device (average latency ≈ 0.55 s per image on our Raspberry Pi), but the absence of localization makes this speed insufficient to offset the operational cost of the elevated false-positive rate.
The “animal-only filter” variant demonstrates why using detector semantic labels as a pre-filter is risky. Although filtering to “animal” labels reduces nuisance ROIs, it also discards many true pumas before Stage 2 is ever applied, increasing missed events.
Finally, the proposed permissive two-stage workflow yields the best overall balance (FP = 8, FN = 12). It is the only variant that simultaneously achieves high precision (0.983) and high recall (0.975), producing the best F1 (0.979) along with accuracy = 0.986 and BalAcc = 0.983. This division of labor reduces false alarms by removing empty or background-dominated images from the classifier’s input distribution, and it reduces missed pumas by concentrating the classifier on localized animal evidence rather than weak, spatially diluted cues in the full image. In operational terms, FP and FN translate directly into workload and risk: the proposed pipeline keeps the “false alert” burden low (8 FP out of 955
no-puma, i.e., high specificity) while preserving sensitivity (only 12 FN out of 479
puma), which is why both accuracy and balanced accuracy are high. The remaining misses are dominated by cases where Stage 1 fails to localize the animal (e.g., low contrast with foliage in nighttime infrared imagery), indicating that localization quality is the primary limiting factor once ROI-based confirmation is employed. In terms of runtime, both two-stage variants (animal-only filter and the proposed permissive Stage 1) have similar end-to-end latency on our Raspberry Pi (average ≈1.5 s per image), which is approximately the Stage 1 cost plus the Stage 2 cost, plus a small handoff/overhead between stages; this is consistent with the Stage 1-only and Stage 2-only latencies reported in
Table 2 and
Table 3.
5.2. Visual Analysis of Illustrative Examples
To complement the quantitative ablation results in
Table 4,
Figure 4 provides representative examples illustrating the strengths and failure modes of the three configurations (Stage 1 only, Stage 2 only, and the proposed two-stage cascade).
In
Figure 4 (row 1), the target is partially obscured against a nighttime monochrome background; Stage 1 localizes the animal but assigns an incorrect label (e.g., “elephant”), consistent with detector label instability under IR illumination, blur, and occlusion. Similar confusion occurs even for native YOLO species (e.g., bears), where localization is often correct but the assigned class drifts to other animals or even non-animal categories, suggesting label confusion is the dominant failure mode rather than the lack of a
puma class. Accordingly, Stage 1 is best used as a trigger/localizer, and Stage 2 benefits from cropping: full-image inference can predict
no-puma when the animal is small and background-dominated, while the Stage 1 crop enables a confident
puma prediction by reducing background confounds.
In the second row, a daytime color image captures a puma in open view with a relatively simple background. Stage 2 (full-image classifier) correctly identifies the target in this easier setting. Stage 1 again localizes the animal but assigns a non-target detector label (e.g., “dog”), reinforcing that Stage 1 should be treated primarily as a localization and candidate-generation stage rather than a species decision rule. The two-stage pipeline correctly classifies the ROI with high confidence, matching the full-image classifier on this straightforward example while retaining the robustness benefits observed in
Table 4.
The third row illustrates a challenging non-target false positive: a fox captured at an angle and posture that produces puma-like contours (notably head/shoulder silhouette and tail/torso proportions). All three approaches are misled in this example: Stage 1 assigns a proxy label (e.g., “cat”), Stage 2 applied to the full image predicts
puma, and the two-stage pipeline assigns a high
puma probability despite operating on the localized ROI. This case demonstrates that false positives are not limited to empty images or nuisance motion; visually similar non-target species and rare poses can also drive errors. The impact of this misidentification is likely small, since, often, the goal is to eliminate a high percentage of false positives, and these types of false positives appear to be rare. In
Section 5.3, Grad-CAM visualizations help interpret why the classifier attends to plausible morphological cues in such examples, even when the final decision is incorrect.
5.3. Explainability and Error Analysis
Figure 5 summarizes the dominant failure modes observed in offline evaluation and field operation. Most false negatives arise in Stage 1, where the YOLO detector fails to produce a usable detection for cropping—typically under poor illumination (nighttime IR with low contrast or blur) or when the puma’s appearance blends into the forest floor, logs, or tree textures. In these cases, Stage 2 is never invoked because no ROI is generated. In contrast, the relatively rare false positives are primarily Stage 2 errors: the EfficientNet classifier can be fooled by non-target animals such as foxes and bobcats when viewed at certain angles that resemble puma silhouettes. These false positives are uncommon in practice and have limited operational impact compared to missed detections. Importantly, increasing Stage 1 detector sensitivity improves recall and mitigates many Stage 1 false negatives with little penalty, because additional ROIs passed to Stage 2 are usually rejected by the puma-confirmation classifier. Furthermore, targeted augmentation of the Stage 2 training set with these edge-case poses is a straightforward way to reduce the remaining false positives. This behavior is consistent with field deployments, where some nights produce over 1000 triggered images yet Stage 2 false positives remain infrequent (see
Section 4.3).
We analyzed both model-level and system-level behavior to understand where errors occur in practical deployments. Explainability of the classifier’s decisions serves two purposes. It builds trust in the system in addition to identifying actionable strategies to improve performance. To better understand which image regions drive the Stage 2 binary classifier decisions (
puma vs.
no-puma), we generated Gradient-weighted Class Activation Mapping (Grad-CAM) visualizations [
29]. Grad-CAM uses gradients of the target class score with respect to the final convolutional feature maps to produce a coarse heatmap that highlights regions most influential to the prediction.
Figure 6 shows representative heatmaps for multiple species and provides a useful sanity check on what the classifier is using after cropping.
Across many correctly classified puma images (two shown for illustration), the classifier consistently places high importance on anatomically informative cues such as the long tail, head/ear region, torso or shoulder contour, and paws. The heatmaps also help interpret common failure cases. False positives can be explained by overlapping silhouettes and poses (e.g., foxes or coyotes in side-profile with long tails and puma-like silhouettes, and occasional bobcat examples when tail cues are absent or the viewing angle emphasizes head/torso features). These qualitative patterns align with the quantitative ablation results.
5.4. Operational Robustness
The device user interface exposes the detector confidence threshold (Stage 1 sensitivity), allowing users to trade off missed detections versus additional candidate ROIs. Increasing sensitivity can reduce missed pumas by accepting weaker detections in low-contrast nighttime imagery; in our experience, this does not substantially increase false alarms because Stage 2 provides strong binary puma confirmation on the cropped regions. This supports a practical deployment strategy: keep Stage 1 permissive to preserve recall, and rely on Stage 2 to maintain specificity.
The deployment software was designed for unattended operation. Corrupted or partially uploaded image files are detected and skipped (logged and ignored) rather than causing pipeline failure. In field operation, the system resumed automatically after brief power or connectivity interruptions once service was restored. We also tested multi-camera operation with up to three cameras connected concurrently; images are processed sequentially via a queue, which keeps peak resource use bounded and is expected to introduce limited backlog in typical wildlife monitoring scenarios where target events are sparse in time.
5.5. Limitations and Future Work
Despite these promising results, a few limitations warrant consideration. First, our held-out evaluation provides evidence of generalization for pumas across a meaningful domain shift: training/validation in New Mexico and an independent test in California with different backgrounds and camera hardware. Our training data also span 11 camera brands, introducing substantial variation in infrared appearance and image statistics across devices. Moreover, because Stage 2 operates on cropped animal detections, the classifier is encouraged to learn animal appearance cues rather than site-specific vegetation or scenery. Consistent with pumas having similar visual features across regions, we observed strong performance on the California test set. However, performance may still vary under more extreme shifts in habitat, infrared characteristics, or camera configurations, and a key practical limitation is that many training examples contain clearly visible animals under relatively favorable illumination. Improving robustness to difficult field conditions such as low contrast, partial views, motion blur, or challenging lighting is an important direction for future work.
Second, our multi-species evidence is currently limited to an illustrative ringtail case study, which is intended to demonstrate rapid re-targeting of the same pipeline by retraining only Stage 2 with modest labeled data, rather than to claim universal multi-species performance. A broader, protocol-driven study spanning multiple regions and additional target species with standardized train/test splits across sites and camera models is beyond the scope of this study and is being considered under future work.
Our field validation involved a limited number of puma encounters and a stream that is strongly class-imbalanced, as is typical for camera-trap deployments where non-target triggers dominate. We attempt to compensate for this limitation by reporting metrics that are informative under imbalance (e.g., balanced accuracy and operational false-positive/false-negative rates) rather than relying on accuracy alone.
On the hardware side, although the Raspberry Pi implementation performed reliably in our tests, power interruptions, severe weather, or hardware failures could challenge long-term autonomous deployment. A current field deployment by collaborators from UC Davis is using the same on-device pipeline and is detecting pumas at performance levels consistent with the accuracy reported here, providing an initial external validation under real operating conditions. As a next step, we will couple detection with adaptive responses and evaluate different actions (e.g., deterrent sounds) and their effectiveness in reducing predation events. If there is a need to reduce end-to-end latency in the field while testing deterrence efficacy, we will focus on the largest driver which is the FTP transfer time, rather than the models used for inference. Timing is largely driven by the specific camera hardware and other FTP-enabled camera models may achieve lower transfer latency.
6. Conclusions
We presented a deployable, offline vision sensor for near-real-time environmental monitoring that runs on low-cost edge hardware and is designed for challenging field imagery. The core contribution is a practical two-stage pipeline that separates broad localization and empty-image suppression (Stage 1) from target confirmation on ROIs (Stage 2), improving reliability in motion-triggered deployments where most images are empty or non-target. Quantitatively, our strongest evidence of generalization is that puma performance is measured on an independent, geographically disjoint California test set relative to New Mexico training/validation data, indicating robustness to background and camera domain shift.
On a labeled puma vs. no-puma dataset (479 puma, 955 no-puma; ), the proposed two-stage configuration achieved high event-level performance and outperformed single-stage alternatives by reducing false positives caused by complex backgrounds and nuisance triggers. Grad-CAM visualizations further support that, once properly localized, the classifier attends to anatomically meaningful cues for the target decision.
In field deployments operating since May 2025 across two networking modes (site WiFi and standalone hotspot), the system achieves approximately 4 s end-to-end latency from trigger to action command, enabling time-sensitive monitoring use cases in which stakeholders can be notified and events verified while the animal is still present. Field examples demonstrate integrated closed-loop operation with optional actuation outputs.
The system is intended to be reusable across species and sites via retraining of the Stage 2 binary classifier and configurable deployment settings. This was demonstrated by successfully training on a limited number of ringtail images. Ongoing and future work will focus on improving nighttime localization in low-contrast conditions, reducing power demand through camera configurations that avoid continuous capture while retaining event-driven upload, and conducting controlled, protocol-driven evaluations of targeted intervention policies (e.g., audio/light) to quantify behavioral outcomes. More broadly, these observations motivate data-efficient improvement strategies, including model-in-the-loop inspection and edge-enabled feedback loops in which uncertain events are flagged for human review and incorporated into periodic retraining as the system encounters new environments and species.