SOCRATES: A Stereo Camera Trap for Monitoring of Biodiversity

The development and application of modern technology is an essential basis for the efficient monitoring of species in natural habitats and landscapes to trace the development of ecosystems, species communities, and populations, and to analyze reasons of changes. For estimating animal abundance using methods such as camera trap distance sampling, spatial information of natural habitats in terms of 3D (three-dimensional) measurements is crucial. Additionally, 3D information improves the accuracy of animal detection using camera trapping. This study presents a novel approach to 3D camera trapping featuring highly optimized hardware and software. This approach employs stereo vision to infer 3D information of natural habitats and is designated as StereO CameRA Trap for monitoring of biodivErSity (SOCRATES). A comprehensive evaluation of SOCRATES shows not only a $3.23\%$ improvement in animal detection (bounding box $\text{mAP}_{75}$) but also its superior applicability for estimating animal abundance using camera trap distance sampling. The software and documentation of SOCRATES is provided at https://github.com/timmh/socrates


Introduction
The utilization of modern technology is commonplace in large-scale commercial, civil, and strategic projects. But terrestrial ecology, in particular, is less well equipped. The potential of modern sensors, remote sensing, automated laboratory procedures, and complex data processing is hardly utilized in this field. This absence is one important reason why there is no long-term large-scale automated monitoring of biodiversity (as established for climate research).

AMMODs Framework
To foster the adaption of modern technologies for the development of automated, reliable, and verifiable biodiversity monitoring, a network of Automated Multisensor stations for Monitoring of species Diversity (AMMODs) is proposed by Wägele et al. (2022) to pave the way for a new generation of biodiversity assessment stations. The AMMODs network approach combines cuttingedge technologies with biodiversity informatics and expert systems to conserve expert knowledge. The sensors employed in AMMODs range from traps for DNA barcoding over camera traps for visual monitoring, bioacoustics, to plant volatile compound detectors. The general concept, the hardware and software components, and the setup of AMMODs are described by Wägele et al. (2022). Within the AMMOD project, SOCRATES is specifically concerned with the visual monitoring of wildlife. SOCRATES derives and utilizes depth information as a third dimension in addition to the regular two dimensions (that is, the horizontal and vertical pixel coordinates) of conventional camera trap images.

SOCRATES Contributions
• Detection and localization of animals in camera trap images are often unreliable. The additional depth information provided by SOCRATES fosters the accuracy and reliability of visual animal detection and animal localization within the monitored natural habitat.
• Abundance estimation, using methods such as camera trap distance sampling (CTDS), is traditionally performed using a combination of commercial camera trap hardware and very laborious manual workflows. SOCRATES instead provides depth information in a fully automated way using stereo vision.
• Reproducibility and accessibility for practitioners. This study takes the practitioner's perspective and provides detailed setup and operational instructions.

Related Work
Related work is reported using two perspectives. First, recent progress with respect to visual object detection in images and video clips is reported. Second, an overview on approaches to estimate the density and abundance of unmarked animal populations using camera traps is given.

Visual Animal Detection
State-of-the-art approaches to visual animal detection utilize 2D deep learning object detection methods. Given a 2D color or grayscale image, these methods learn to predict a set of bounding boxes, e.g. given by the 2D location of the upper-left and bottom-right corner of an axis-aligned rectangle fully enclosing the animal. Object detection methods might be extended to perform instance segmentation, where not only bounding boxes are predicted, but whether any pixel in the image belongs to the respective object (binary masks). Deep learning object detection and instance segmentation models usually consist of two parts. The backbone takes in the original image and produces a hierarchy of feature maps that encode higher-and higher-level information about the image. These feature maps are then used by the object detection or instance segmentation model to predict the bounding boxes and binary masks themselves. A general issue of deep learning models is that their training requires immense amounts of annotated training data. Annotated data is raw data associated with corresponding labels, which can be of different modalities (e.g. object classes occurring in the image, bounding boxes around objects of interest, pixelwise masks of such objects, etc.). These labels must often be created manually and are therefore costly to obtain. This requirement of large annotated training datasets is slightly relaxed by transfer learning. In transfer learning, the backbone is first pre-trained to perform some task involving a very large training dataset, e.g., performing image classification on ImageNet (Deng et al., 2009). Visual concepts learned by such backbones have been shown to be generally useful and not just applicable to the pre-training task (Zeiler and Fergus, 2014). The backbone is then fine-tuned on the target task, which usually involves a much smaller dataset.

Abundance Estimation
There exist a number of methods to estimate the density and abundance of unmarked animal populations using camera traps, e.g. the random encounter model (REM) (Rowcliffe et al., 2008), the random encounter and staying time model (REST) (Nakashima et al., 2018), the time-to-event model (TTE), spaceto-event model (STE), instantaneous estimator (IS) (Moeller et al., 2018) and camera trap distance sampling (Howe et al., 2017). All of these require an estimation of the effective area surveyed by the camera trap. This area is not simply given by the optical constraints of the camera, instead it is influenced by factors such as environmental occlusion and the range of the passive infrared sensor which may not perform consistently at all locations within the camera's viewshed. The effective area surveyed is statistically inferred by using the distances of the observed animals. Although there are approaches that estimate these distances (semi) automatically Johanns et al., 2022), they either require laborious capture of reference material  or might not generalize to extreme scenarios such as very close-up scenes within 3 m of the camera (Johanns et al., 2022;Auda, 2022).

The SOCRATES Stereovision Sensor Platform
The SOCRATES camera trap system comprises a cost and power efficient stereovision sensor platform as well as state-of-the-art animal detection software based on a deep learning software architecture. The experimental evaluation of SOCRATES will utilize a representative RGB-D dataset generated by SOCRA-TES in the wildlife park Plittersdorf located in Bonn, Germany, exhibiting fallow deer.
The central goal of SOCRATES is to infer depth information. Computer stereovision is the well-established approach to image-based depth estimation. By comparing information about an observed scene from two differing camera perspectives depth information can be derived. Mostly, both cameras in a stereo setup are displaced horizontally from one another yielding scene observations in terms of a left image and a right images. Computer stereo vision can be seen as the technical analogue to human stereopsis, that is, human perception of depth and three-dimensional structure by combining visual information from two eyes.

Depth by Stereovision
Depth is derived by stereo vision using two steps: 1. Stereo Matching. Given a stereo-pair image (i.e., left and right image) of an observed scene, a stereo matching model infers the so-called disparity map. The disparity map D represents the position difference d of every observed scene point between the left and right image, when viewed in the right image. The disparity is inversely proportional to the distance from The important components will be covered in the following text and are here highlighted, i.e., cameras (red outline), baseline rail (blue outline), control unit (green outline), battery (violet outline), infrared illumination (turquoise outline), passive IR sensor (orange outline). Details such as power supply, wiring and screws are omitted. Some parts of the model are obtained from external sources 3 .
the stereo setup to the observed scene point depicted in the corresponding image points. 2 2. Depth Computation. Once the disparity map is derived, the disparity values d are used in combination with the so-called extrinsic parameters of the stereovision setup to compute the absolute depth value for each observed scene point. The extrinsic parameters are (1) the baseline b (physical distance between left and right cameras) and (2) the common focal length of both cameras f (distance between the lens and camera sensor). The absolute depth value z for an observed scene point is then simply given by: z = b·f d . Thus, the lower the disparity, the larger the depth of the observed scene point.

Hardware design of SOCRATES
The SOCRATES stereovision platform is optimized for 1. operability (a) at day and night time as well as (b) for a wide range of animal-camera distances, 2. effective and efficient power supply, 3. hardware and construction costs, 4. weather resistance.
The following section covers in detail the technical implementation and how we address these design goals. We first address the stereo camera design (cameras and Baseline, design goals 1. and 3.). The raw data produced by the cameras is processed and stored by the control unit (design goals 2. and 3.). Weather-resistance (design goal 4.) is provided by the case. Infrared motion Detection and illumination facilitate energy efficiency (design goal 2.) and operability at night time (design goal 1. (a)). We additionally describe in detail the power supply, how we obtain animal-camera distances using stereo correspondence and how the captured data may be transferred using different connectivity options.
Cameras and Baseline: A pair of Raspberry Pi High Quality Cameras are chosen for their cost-effectiveness and the high sensitivity of their Sony IMX477 sensor (Sony Semiconductor Solutions Corporation). Interchangeable lenses allow adaptation to specific scenarios (i.e. shorter focal lengths for close-up scenes, higher focal lengths for more distant objects). Removal of the infrared filter allows sufficient exposure at night using artificial infrared illumination. The cameras have an additional Bayer filter above the sensor, which is usually responsible for filtering different wavelengths to create a color image. We leave this filter intact to not risk damaging the sensor itself. As near-infrared illumination (either from the environment or the illuminator) illuminates all color bands, we do not try to recover any color information and instead average all bands to obtain a grayscale image.
The cameras are mounted on a 77.5 cm long U-shaped aluminum rail with holes drilled at regular intervals to allow configuration of different baseline distances between both cameras. Both cameras are connected through 50 cm long ribbon cables to the two MIPI CSI-2 interfaces of an NVIDIA Jetson Nano Developer Kit.
Both design aspects, i.e., the interchangeable high quality lenses as well as the configurable baseline construction, allow for adaptation to specific scenarios, e.g., free fields, feeding places, animal crosses, green bridges, etc. where animals are observable at different distances. Control: We use an NVIDIA Jetson Nano Developer Kit as the central control and storage unit. It is responsible for taking motion detection signals from the PIR sensor, turning on the power to the IR illuminator, capturing, encoding, and archiving image material from the cameras. We decided on the Jetson Nano for the following reasons: (1) compared to most single-board computers, it provides two MIPI CSI-2 interfaces for the two cameras, (2) it provides a powerful GPU that can be used for encoding video efficiently, and (3) it supports a power-efficient hibernation mode (SC7) from which it can be activated through its general-purpose input/output (GPIO) pins by adjusting the Linux device tree accordingly. The raw RGB video material is encoded on the Jetson Nano's GPU by synchronizing the left and right image streams, concatenating them horizontally, and compressing the resulting video of resolution 2 × 1920 × 1080 using the HEVC video codec (Sullivan et al., 2012). In our experiments, this results in a bitrate of roughly 6.7Mbit/s. The Jetson Nano uses a 128GB mi-croSDXC card for persistent storage. Case: To make SOCRATES as weather-resistant as possible, most components are placed inside a single weather-proof case. The case is made of 0.8 cm thick birch plywood and is 80 cm wide, 11.6 cm high and 20 cm deep. We decided for a very wide case to be able to adapt the baseline of the stereo camera to different configurations. The front of the case is shielded by a piece of acrylic glass. In the bottom, we add a 4 cm diameter hole for ventilation, which is covered from the inside with an insect screen. The battery is mounted via Velcro strip onto a hatch in the bottom of the case, to allow quick replacement. We add two further holes for the wiring of the IR illuminator and motion detector, respectively, both of which are sealed using silicone. The top of the case is sealed using a silicone strip and secured by screws, which can be loosened to take it off for maintenance. All exposed wooden parts are further treated with marine varnish for weather resistance. Motion Detection: Like most camera traps, we use a pyroelectric infrared (PIR) sensor for detecting motion and thereby triggering capture. We choose an HC-SR501 PIR sensor due to its compatibility with the 3.3 V GPIO pins of the Jetson Nano. We initially mounted the PIR sensor inside the case, just behind the acrylic glass. However, we found that this severely impaired the ability of the sensor to detect any kind of motion outside the case. This is because acrylic glass is opaque around wavelengths of 10 µm (Altuglas International, 2000), which corresponds to the body temperatures of most animals. Therefore, we mount the PIR sensor in a separate 3D printed weatherproof casing below the main case. Illumination: We employ a simple 12 W, 850 nm infrared illuminator to ensure properly exposed images at night without disturbing most animals. The illuminator has a weatherproof case and is mounted on the bottom of the main case. The 12 V power supply is switched by a Jetson Nano GPIO pin using an IRLZ44NPBF MOSFET. Power supply: All components are powered by a lithium ion polymer battery due to their high power density. We employ a battery with a theoretical capacity of 236.8 W h (16 000 mA h at 14.8 V). A generic 4S balancer circuit board provides over-discharge protection. The variable voltage of the battery is then regulated to 5 V for the Jetson Nano and 12 V for the infrared illuminator by Mean Well SCW20A-05 and SCW12A-12 converters, respectively. Stereo correspondence: The central goal of SOCRATES is to infer depth information through stereo vision. In the natural world, as well as in computer vision, this is achieved by solving the stereo correspondence problem. To solve the stereo correspondence problem efficiently, the left and right images must be rectified. To obtain an accurate rectification, the intrinsic (internal camera parameters) and extrinsic (rotation and translation between the cameras) pa- rameters have to be obtained by a calibration procedure. For the calibration of the intrinsic parameters, a calibration object (e.g. checkerboard pattern printed on cardboard) has to be captured by the camera(s) to be able to associate 3D points in the scene with 2D points in the resulting image. To obtain the extrinsic parameters, eight or more correspondences between images of points in the projections of both cameras must be established (Longuet-Higgins, 1981). We perform both intrinsic and extrinsic calibration using Kalibr (Maye et al., 2013) with a grid of 4 × 3 AprilTags (Olson, 2011) mounted on a wooden board as the calibration target. During the setup of SOCRATES, the calibration target is manually moved through the scene such that it covers as much as possible of each camera's field of view. After SOCRATES is assembled and calibrated, calibration does not have to be repeated when deployed to different locations, as the calibration is not dependend not on a specific location but only on the camera configuration. Given the intrinsic and extrinsic parameters, we rectify the images of both cameras and compute the disparity of each pixel using Li et al. (2022). Connectivity: SOCRATES may transmit the recorded data via three different means: wired ethernet cable, wireless LAN (Edimax EW-7811UN) or cellular connection (Huawei E3372H). If no basestation is available, we use the cellular connection to manually download the captured data. Otherwise, we connect via wireless LAN and the CoAP protocol (Bormann et al., 2012) to the AMMOD Basestation (Wägele et al., 2022;Sixdenier et al., 2022), which in turn uploads the captured data to the AMMOD Portal (c.f. section 3.5).

Data
We deployed SOCRATES in the Tierpark Plittersdorf in order to evaluate the hardware and software. Details about this deployment may be found in section 3.1. During this time, SOCRATES made 221 true positive observations. Exemplary samples are visualized in figure 3. Each observation results in an HEVC encoded video with 30 frames per second and a length of 25 seconds. For our experiments, we sample two sets of still images from these videos.
For our camera trap distance sampling study (c.f. section 2.4), we sample still images from the videos at a rate of 2 s −1 , resulting in a total of 2871 images.
For the instance segmentation task (c.f. section 2.3), we want our dataset to consist of diverse scene configurations (animal positions and poses, lighting conditions, etc.). To obtain such diverse samples, sampling at regular intervals is not enough. Sometimes deer will stand still for long periods of time while moving quickly through the scene at other times. Therefore, we employ an approach based on background modeling using Gaussian mixture models (KaewTraKulPong and Bowden, 2002). We then accumulate the ratio of foreground pixels (that is, the ratio of pixels occupied by moving objects) in each video frame until a threshold of 10% is reached. This way, we sample more often if there is more movement in the video, and less often for less movement. We then annotated a total of 546 instances in 187 of the still images sampled this way with instance masks using the interactive annotation tool proposed by Sofiiuk et al. (2021). Figure 4 visualizes one such annotated image. On average, we needed roughly 3.5 minutes per instance, resulting in a total annotation effort of roughly 32 hours. Out of the total 546, we use 395 instances for training and validation (via 10-fold cross-validation), and reserve 151 instances as test dataset, such that images from a single video are only ever contained in one dataset. The test dataset is not used in this work but is instead reserved for future work. We publish both the raw data (Haucke and Steinhage, 2022b) and the instance segmentation dataset (Haucke and Steinhage, 2022a).

Depth-aware Instance Segmentation
We frame the problem of detecting and localizing animals as an instance segmentation problem, with the goal of generating a bounding box and a binary mask for each animal instance. Compared to animal presence-absence classification, this approach allows both counting the exact number of animals present, as well as inferring the distance between animal and camera by applying the binary mask to the depth images obtained using stereo vision. However, the depth images themselves obtain useful information for differentiating multiple individual animals from themselves and the background. Instance segmentation is usually performed by first forwarding a color image with red, green and blue (RGB) channels through the backbone, which may be a convolutional neural network (CNN) or a vision transformer (Dosovitskiy et al., 2020). The backbone then generates a hierarchy of feature maps that encode higher-and higher-level information about the image. These feature maps are then used by an instance segmentation model to predict bounding boxes and binary masks themselves. It is not obvious how to use the depth images obtained from SOCRATES in this framework. Compared to datasets like ShapeNet (Chang et al., 2015), we only have information from (effectively) a single perspective. We therefore argue that it is wise to treat the depth information as an additional channel in the two-dimensional image instead of working on point clouds or voxel grids, which increase the computational and memory requirements while largely foregoing the significant improvements being made in the area of 2D instance segmentation.
Although most backbones are pre-trained on color images without depth information, a recent work proposes Omnivore, a vision transformer backbone trained on color and depth information . In our experiments, we use Omnivore as our backbone of choice. As instance segmentation models, we use either the convolution-based Cascade Mask R-CNN (Cai and Vasconcelos, 2018) or the transformer-based Mask2Former (Cheng et al., 2022). This is motivated by the observation that vision transformers require more training data to perform well, compared to CNNs (Dosovitskiy et al., 2020;Hassani et al., 2021). Therefore, Mask2Former performs very poorly when trained on small datasets such as the Plittersdorf instance segmentation dataset, while outperforming Cascade Mask R-CNN on larger datasets such as Cityscapes (Cordts et al., 2016). The resulting model architecture is visualized by figure 5. To demonstrate that depth information is not only beneficial on the Plittersdorf instance segmentation task, we also evaluate improvements on the Cityscapes instance segmentation dataset. This is because the Cityscapes dataset is one of the only datasets which provides both depth information through stereo vision and a large amount of instance segmentation annotations.
We implement our instance segmentation pipeline using mmdetection (Chen et al., 2019) and largely keep the default hyperparameters of the mmdetection model implementations. We use the AdamW optimzer (Loshchilov and Hutter, 2017) with a global learning rate of 5N 10 −5 with batch size N and a weight decay of 0.05. We set N = 2 for Mask2Former and N = 6 for Cascade Mask R-CNN due to memory constraints.  Figure 5: RGB-D Instance Segmentation using Omnivore  and Cascade Mask R-CNN (Cai and Vasconcelos, 2018) or Mask2Former (Cheng et al., 2022). In Omnivore, the grayscale and depth images are first split into 2D patches, linearly embedded and added together. The resulting embeddings are then passed through a transformer encoder which generates hierarchical feature maps. These feature maps are then used by Cascade Mask R-CNN or Mask2Former to perform instance segmentation.

Camera Trap Distance Sampling Study
It is now possible to combine the instance masks generated by the animal detection model (c.f. section 2.3) with the depth images obtained by stereo vision (c.f. section 2.1.1) to obtain the distances required for the camera trap distance sampling (CTDS, Howe et al. (2017)) abundance estimation method. To be able to use all observations without leaking information from the training dataset of our instance segmentation model, in this study, we use not the instance masks but the bounding boxes of MegaDetector (Beery et al., 2019) and the sampling approach by Haucke et al. (2022). To show the viability of this approach, we perform an exemplary estimation of detection probability. We use 7 equally spaced distance intervals from 3 m to 11 m. As SOCRATES is mounted on a tree just outside the enclosure and at a height of 1.9 m, 3 m is the minimal distance where deer are certain to be visible. We do not re-scale the minimum distance, as deer might be present closer to the camera but outside the field of view. We use the Distance for Windows software (version 7.4, Thomas et al. (2010)) and model the detection function using a uniform key function with a single cosine adjustment term.

SOCRATES
We operated SOCRATES in the Tierpark Plittersdorf , Bonn, Germany, from February 9th to July 8th 2022, or 149 days. The Tierpark Plittersdorf houses exclusively European fallow deer (Dama dama) and Sika deer (Cervus nippon). The camera was mounted on the side of a tree using a lashing strap and not moved during the entire duration. During this time, SOCRATES experienced temperatures from −4°C to 38°C and storms with wind speeds of 87 km h −1 without issues. SOCRATES was without power or the software disabled due to maintenance for 46 days, resulting in a total number of 103 observation days. During this time, SOCRATES recorded 1089 observations. Out of these, 221 showed visible animals. This indicates a false positive rate of roughly 80%, which is in line with prior work concerned with commercial camera traps (Newey et al., 2015). False triggers are primarily induced by (1) animals in the field-of-view of the PIR sensor, but outside of the fields of view of the cameras, (2)   infrared illumination by the sun during daytime or (3) by artificial light sources such as flood lights on nearby buildings. We manually removed all false positive observations from our dataset. Although this could easily be automated, e.g. by using the MegaDetector (Beery et al., 2019), we wanted to ensure that there are no persons in the final dataset and therefore screened the entire dataset manually. We compare SOCRATES with the widely used commercially produced Reconyx HP2XC in Table 1. SOCRATES is significantly larger to support large baselines, while having significantly shorter battery life and slightly higher component costs. The infrared illuminator of SOCRATES operates at a slightly shorter wavelength, which might be visible for some animals. The infrared illuminator should therefore be replaced with a longer-wavelength version in the future. At the same time, SOCRATES not only provides depth information through stereo vision, but also allows recording video at high resolutions and frame rates for long durations only limited by available storage space.
We demonstrate that the stereo capabilities SOCRATES facilitate improved visual animal detection (c.f. section 3.2) and accurate abundance estimation using camera trap distance sampling (c.f. section 3.4). Depth information is also essential for obtaining absolute animal sizes in photogrammetry, which is traditionally performed using laser rangefinders (Shrader et al., 2006). Furthermore, depth information has been shown to improve the accuracy of animal tracking over 2D-only approaches (Klasen and Steinhage, 2022a,b). SOCRATES can not compete with commercially available camera traps in cost or battery life, but this is not our goal. Apart from the methodological improvements described above, SOCRATES fulfills three high-level goals: 1. it demonstrates that stereo camera traps are viable and worthwhile. We hope to convince commercial camera trap manufacturers to support stereo camera setups using off-the-shelf hardware. 2. it facilitates the verification of monocular approaches. For example, abundance estimation using camera trap distance sampling might be performed twice, once using monocular approaches Johanns et al., 2022) and once using SOCRATES. Both raw animal distances and the resulting animal densities might then be compared. 3. it allows to generate training data for monocular depth estimation methods such as Godard et al. (2019); Ranftl et al. (2020Ranftl et al. ( , 2021. These approaches have been largely focused on human-centric scenes such as indoor and street scenes with relatively simple geometry, which are highly unlike natural scenes such as forests. Gathering training data from natural scenes might help these methods to generalize better to such scenes and thus allow monocular camera traps to more accurately estimate depth information in the future. Figure 3 shows some exemplary pairs of near infrared images and corresponding depth maps inferred by (Li et al., 2022). As can be seen, the depth maps generally represent the scene well and clearly highlight the boundaries of the deer. To evaluate the depth maps quantitatively, we employ the temporal quality metric proposed in (Vandewalle and Varekamp, 2014), which is defined as:

Stereo Correspondence
where N T is equal to the number of frames in the input video, N P is the number of pixels in a single frame, D(x, y, n) is the scalar disparity at some pixel (x, y) at time n, and m x , m y is the optical flow from frame n to frame n−1, calculated using (Farnebäck, 2003). Using (Li et al., 2022), we obtain E t = 0.4439, which is on-par with the temporal error of the ground truth disparity in (Vandewalle and Varekamp, 2014). As can be seen, the temporal error is low for the vast majority of observations. Like with regular camera traps, at night time, some regions in the field of view might be insufficiently lit and therefore underexposed in the resulting images. In these regions, insufficient image information is available to perform successful stereo correspondence, leading to the outliers with poor temporal error apparent in figure 6. One such outlier case is shown in figure 7. Still, the depth of the welllit area is correctly inferred.

Visual Animal Detection
We use the COCO (Lin et al., 2014) metrics to evaluate our instance segmentation models. Each metric is obtained by performing 10-fold cross-validation after the last training epoch. Cross-validation is especially important in this setting, as it reduces the impact of a single lucky train-test split on this small dataset. Table 2 summarizes the results on the Plittersdorf instance segmentation task. The summarizing metrics for bounding boxes (AP bbox ) and segmentation (AP segm ) show that incorporating depth information results in an overall performance improvement. Interestingly, for low IOU thresholds (AP bbox 50 , AP segm 50 ), depth information seems to have the opposite effect. In other words, pure grayscale images perform better for roughly localizing an animal, whereas grayscale and depth information together are better for localizing animals very accurately AP bbox 75 , AP segm 75 ). This is especially interesting as the ground truth labeling is performed using exclusively the grayscale image. Intuitively, one could therefore argue that the grayscale information is most important for matching the ground truth very precisely. Here we see the opposite effect. As the error of stereo correspondence is quadratically related to the true distance, the resulting depth maps become less useful at larger distances. This is reflected in the lower performance on small instances (AP segm s ), which typically are farther away than medium (AP segm m ) or large instances (AP segm l ). We tried to ease the dependence on depth information for these faraway instances by clipping the depth values to different maximum distances or randomly dropping the depth information altogether during training (Srivastava et al., 2014). However, this did not result in meaningful improvements.  Table 3: Instance segmentation results on the Cityscapes validation set. The depth-aware Omnivore-L variant clearly improves the non-depth-aware variant in all metrics. The metrics of the Swin-L backbone are obtained using the original implementation (Cheng et al., 2022). AP segm and AP segm 50 are Cityscapes metrics, the rest are COCO metrics.

Depth-aware Instance Segmentation on Cityscapes
To show that the positive effect of depth information on instance segmentation accuracy is not limited to settings with grayscale images, a single object class, and a fixed camera such as SOCRATES, we additionally evaluate our instance segmentation approach on the Cityscapes instance segmentation task (Cordts et al., 2016). The Cityscapes instance segmentation dataset (Cordts et al., 2016) is composed of color and stereo depth images of urban street scenes, captured by cameras in a moving car. It features several object classes, such as person, car, or bus, annotated with instance labels. The Cityscapes dataset is also much larger, with 3475 annotated images in its train and validation sets. As can be seen in table 3, the depth information has an overall even greater positive impact than in the Plittersdorf task (c.f. section 3.2). This is likely caused by two reasons: (1) the Mask2Former (Cheng et al., 2022) being able to better make use of the depth-aware feature hierarchies produced by Omnivore , and (2), the larger training dataset, which might help alleviate the lower number of depth images during pre-training . Figure 8: CTDS detection probability. We transform our distance measurements into seven intervals (visualized in blue) from which the detection probability (visualized in red) is derived using CTDS. Figure 8 depicts the detection probability obtained by CTDS using the parameters specified in section 2.4. Note that the estimated probability density approximates the measurements well, starting from a distance of 3 m. Due to the way SOCRATES is mounted, deer below 3 m may not be visible, which is why we exclude these low distances from our estimation (c.f. section 2.4). Compared to competing approaches Johanns et al., 2022), distance estimation for abundance estimation of unmarked animal populations is straightforward with SOCRATES. Figure 9 visualizes the respective tradeoffs. The presented proofof-concept for modelling detection probability in camera trap distance sampling with SOCRATES stereo camera devices demonstrates the usefulness of the approaches taken and its potential for improving the efficacy of future wildlife surveys. The reduced cost for data processing, the increase in animal detection and potential for application in integrated mono-and stereo-camera trap surveys pave the way for an end-to-end solution in computational wildlife monitoring. The proposed approach is not limited to the conditions of our study, but is widely applicable across habitats, species and regions. For future field surveys, we recommend that multiple SOCRATES devices be used, along with a random or systematic study design to estimate wildlife density and associated variance reliably. SOCRATES can also be paired cooperatively with traditional, monocular camera traps for improved error quantification and improvement of monocular distance estimations like that proposed by Johanns et al. (2022).

AMMOD Portal Case Study
A central goal of the AMMOD project is to automatically collect all observed data in a central repository (the AMMOD Portal, https://data.ammod.de), which will eventually be accessible to biologists and the general public. For SOCRATES, we ensure this by uploading the captured raw data via the CoAP protocol (Bormann et al., 2012) to the AMMOD Basestation (Wägele et al., 2022;Sixdenier et al., 2022), if available at the current location, or directly to the AMMOD Portal otherwise. The AMMOD Basestation takes the role of scheduling and prioritizing data transfer from different sensors according to the energy available from energy harvesting. Once the raw data is uploaded to the AMMOD Portal, a server runs the instance segmentation (c.f. section 2.3) and distance estimation (c.f. section 2.4) workflows. To increase throughput and energy efficiency, the server is equipped with an NVIDIA GPU to accelerate neural network inference. Both methods are packaged as Docker images to simplify dependency management and updates. The resulting instance masks and distances are then again uploaded to the AMMOD Portal and are available for further analysis by biologists. The entire data flow is fully automated and visualized in figure 10.

Conclusion
We propose a StereO CameRA Trap for monitoring of biodivErSity (SO-CRATES), a novel camera trap prototype that uses stereo vision to infer the 3D structure of the captured scene. SOCRATES enables the following contributions: • Detection and localization of animals: SOCRATES provides depth information that improves the accurate localization of animals in an instance segmentation setting, e.g. by 3.23% in bounding box mAP 75 . We obtain these results by performing 10-fold cross-validation. Similar results on the Cityscapes instance segmentation task show that this effect is neither limited to grayscale images, a single object category, nor fixed cameras such as SOCRATES.
• Abundance estimation is facilitated by the automatic distance measurements of SOCRATES. We perform a proof-of-concept camera trap distance sampling study and successfully model detection probability in a wildlife enclosure. Future work might use SOCRATES to perform automatic abundance estimation in the wild and compare the results with competing approaches.
• Reproducibility and accessibility for practitioners is enabled by openly providing our raw and labeled data, code, detailed instructions, best practices, and 3D CAD models at https://github.com/timmh/socrates. We hope to pave the way for the eventual adaption of stereo camera traps by commercial manufacturers.