1. Introduction
Ensuring animal welfare is a key responsibility of any animal-keeping institution [
1,
2]. Animal welfare is defined to be the collective physical, mental and emotional state of an individual animal [
2] and should be guaranteed 24 h a day, seven days a week, ideally from birth to death [
3]. Examining animal welfare requires reliable, reproducible, and repeated assessment of welfare indicators [
4]. In zoos, this is typically achieved by
observing the behavior and by measuring
physiological and
physical indicators. Physiological indicators are, for example, adrenal hormones, glucocorticoid metabolites, or biochemical and hematological parameters. Physical parameters include coat or body condition scoring, gait parameters, or pedal and dental health [
5,
6,
7,
8,
9].
Typically, behavioral observations in zoos are carried out using traditional methods through direct observation, either by keepers or biologists manually scoring behavior [
9]. Depending on the observed species and the specific research question, different activities (e.g., walking, standing, lying, feeding, social as well as abnormal behaviors) are in the scope of the observation. However, of particular importance is to record the animal’s position in the enclosure over time. Analyzing the spatio-temporal changes in enclosure usage gives insight into an individual’s activity and inactivity patterns, proximity and distance towards conspecifics, and preferences in area usage. Therefore, observing an animal requires spotting it, identifying it, and locating its position on the enclosure map. Since manual pinpoint localization is not possible, enclosures are typically divided into suitable segments depending on the structure of the enclosure and the position of the observer [
10]. The animals’ positions are manually assigned to the respective segment limiting the maximum accuracy of the location to the size of the chosen segments.
As manual observations are very time-consuming, they are usually only carried out for a few hours per day, severely limiting their conclusiveness [
7,
9]. This leads to a selective assessment of the behavior of the animal as the observation of a few hours does not allow a general assertion [
9,
11]. Human observers are prone to error, especially in long-duration observations, and may observe only a small group of animals or only individuals, depending on the method chosen. In addition, for some species, extensive training is needed to recognize individual animals, as many species lack distinct visual features. Additionally, the problem of subjectivity still exists since the behavioral measurement is highly dependent on the perceptual abilities of the observer and always leaves room for error [
12]. It can be concluded that manually performed long-term studies of the behavior of individual animals are associated with extremely high effort and costs, and still do not enable continuous monitoring.
An alternative to traditional manual observation methods is the use of a video-based monitoring system, which overcomes the aforementioned limitations and allows insight into behavior on a 24/7 time scale. To be able to automate the whole manual observation process, such a monitoring framework must perform the same processing stages: (a) the animals must be detected in the raw video data and (b) the identity of each individual must be determined. In the third stage, depending on the method, (c) different information about the individual behavior can be assessed. In addition, for the present
zoo setting, the framework needs to cope with some additional difficulties compared to a
laboratory setting. It needs to be able to monitor animals in large enclosures with low camera resolutions and varying light conditions. As the positioning of the cameras needs to be adapted to the specific enclosure requirements, the viewing angle on the animals might vary. Therefore, detection and identification methods must be pose-invariant and robust to occlusions of parts of the animals. Additionally, the framework should be applicable to different species, hence using species-specific features such as unique coat patterns [
13,
14,
15] for identification of individuals is not ideal.
Very few state-of-the-art approaches provide a solution for automating the whole manual observation process.
Table 1 provides an overview of current video-based behavior monitoring frameworks. They are analyzed regarding the aforementioned specific challenges faced in the zoo setting. The frameworks that come closest to solving the problem under discussion are
Blyzer [
16],
idTracker [
17] and
GroupTracker [
18]. Blyzer is designed to detect animals of one species and outputs trajectories for further analysis. The image quality requirements for the camera are modest, yet the camera must be positioned to provide a top-down viewing angle. This specific positioning requirement of the camera as well as the lack of possibility to identify individuals severely limit the potential of this approach. The frameworks idTracker and GroupTracker are the only ones able to identify individuals for trajectory analysis. However, the limitation remains that these approaches only work in a laboratory setting. Only animals that remain visible in the same pose and show a high contrast to the background can be monitored. In summary, despite the great potential provided by recent deep learning developments, only a few frameworks exist that automate every step of the monitoring process, none of which is solving the specific challenges of the presented zoo setting. Our work aims to close this gap in research.
To the best of our knowledge, we propose the first automated video-based framework for behavior monitoring of individual animals in a zoo setting. It is based on state-of-the-art deep learning algorithms and constitutes a step towards a non-invasive, fully automated animal observation system. Our framework takes raw videos of the animals in their enclosure as an input and outputs individual trajectories as well as basic statistics on the animals’ behavior. The framework consists of four main stages. First, the animals are located in the video (
object detection, (1)) and the identity of each animal is determined (
classification, (2)). Then, the positions of the animals are transformed from the camera plane to a map of the enclosure (
coordinate transformation, (3)) for a meaningful interpretation. In the last step, the individual trajectories are analyzed (
trajectory analysis, (4)). Finally, we present a graphical user interface that allows biologists and animal keepers easy access to the data and the statistics. A schematic overview of the proposed framework is depicted in
Figure 1. Since no comparable framework exists [
25], we compare the performance to previous manual observation methods.
We evaluate the proposed framework on polar bears (
Ursus maritimus). This species is particularly challenging as individuals lack prominent distinct visual features. A limitation to this approach is that our study includes only two individual animals, which means that the classification problem is limited to two classes. However, polar bears are only kept with a few individuals in each zoo, making our approach representative of other institutions. Monitoring animal welfare of polar bears is of particular concern, as they are prone to abnormal behaviors under human care [
6,
26]. Skovlund et al. [
27] analyzed 46 publications to identify and validate animal-welfare-based indicators for polar bears. Individual activity and inactivity patterns monitored over time and interpreted in context with husbandry and environmental conditions are identified as promising indicators for polar bears and are recommended for further research [
27]. The framework we propose allows the first continuous monitoring of these parameters.
In summary, our contribution is a video-based framework explicitly designed to monitor individual animals in a zoo setting. For that, we use state-of-the-art deep learning models. We evaluate this framework on a newly created dataset of polar bears. Finally, we provide this extensively annotated dataset consisting of 4450 images, including a suitable method for aggregating annotations made by any number of experts.
2. Dataset
For the purposes of implementing and evaluating the proposed framework, a dataset consisting of 4450 images showing polar bears under human care was collected. Please note that while detection of polar bears could just exploit a pre-trained model, we still need to collect the data to perform the identification of individuals. The images have been taken at the polar bear enclosure at Nuremberg Zoo, which is home to two mature animals (
Vera, female adult and
Nanuq, male adult). An example image including both animals is shown in
Figure 2.
2.1. Data Collection
The polar bear exhibit at Nuremberg Zoo consists of two indoor and two outdoor enclosures used to keep the polar bears seasonally separate (typically from August to February). However, the enclosures can be set up to allow the polar bears to share the outdoor areas during the mating season (March to June) or until intraspecific aggression occurs. Three video cameras continuously monitor both outdoor enclosures. They are aligned so that the visitor areas are not recorded, resulting in unrecorded areas where the animals’ behavior cannot be evaluated. The cameras acquire videos with a frame rate of and a resolution of pixels. For the aim of this project, a period of five days of data (27 April–1 May 2020) has been selected. During this period, the polar bears shared both enclosures and thus might both be present in a single image. A total of 4450 frames were randomly selected and stored for further labeling. Three biologists annotated all images to provide labels of high quality by assigning labeled bounding boxes to the animals visible in the picture.
2.2. Accordance Metric for Multiple Annotators
Aggregating labels from multiple experts requires a suitable metric for annotation quality assessment. The most commonly used evaluation metric for bounding box annotations is the
Intersection over Union (
IoU) [
28]. However, it can only be used to compare two annotated areas, e.g., a network prediction and a ground truth label. Literature provides some modified versions of the IoU metric for different purposes (e.g., [
29] or [
30]), none of which are applicable for our labeling setting with several competing biologists. Therefore, we propose a modified IoU-based accordance metric for competitive bounding box labeling of more than one expert with unique classes:
Consider
experts and
unique classes (e.g., animal identities). Every annotator
creates a bounding box
for each
. If one class is not present in the image or the annotator does not find it, we consider an empty box. Pairwise comparison of two annotations of the same class
m by two annotators
k and
l is provided using the IoU metric:
Based on this, we can calculate the accordance rate
for each dataset instance. For each class
, we calculate the respective pairwise IoU of two annotations and divide this by the number of all pairwise comparisons (
M comparisons per class and
comparisons between the different observers) for normalization:
is calculated for each instance, where is a perfect score meaning that all bounding boxes for each class align perfectly. In case of either the bounding boxes do not overlap, or the labels of overlapping boxes do not match.
Compared to the original IoU, the proposed accordance metric is suitable for situations in which multiple annotators compete for ground truth. Instances with R below a certain threshold (e.g., ) should be discussed collaboratively.
2.3. Labeling Process
Annotation data collection was acquired in a two-step process performed by three trained biologists. In the first step, they labeled each image in a competitive process (using the EXACT labeling tool [
31]), where each expert created bounding box annotations for each animal, including a label for their identities, not knowing about the annotations made by the other experts. The global accordance metric after the first labeling round was
. In the second stage, those instances with an accordance rate
were collaboratively discussed. In the case of agreement, the labels were changed. After this process was finished, an overall accordance rate of
was achieved, showing a very high consistency in the labeled dataset (see
Figure 3).
2.4. Dataset Statistics
As the 4450 images were randomly selected from the video data, only 2099 instances show one or more animals. For most algorithms, empty images do not affect the training process, but may still be of value depending on the used method. Hence, the provided dataset also contains these images. 167 images show both animals, 1932 only one. 2266 bounding boxes are provided, 1082 for the male and 1184 for the female bear. We provide the data, including the label information under public license (see Data Availability Statement).
5. Discussion
The proposed framework’s performance was evaluated in six experiments. Experiments 1 to 4 were designed to investigate the ability of the object detection and classification stages to find polar bears and identify individuals. Experiments 5 and 6 assessed the quality of the coordinate mapping from the camera plane to the map of the enclosure. The results of all six experiments will be discussed in the following.
5.1. Object Detection and Classification
Experiment 1 assessed the performance of yolo on the task of finding the class polar bear in the images. The F1 scores for the IoU thresholds , and are , and , respectively – an acceptable performance regarding this project’s scope. The mean IoU over all folds is . As this results in a mean deviation of the position of cm compared to ground truth (experiment 5), this IoU score is within a reasonable range for the aim of this project. As the mean IoU is , the F1 score for the IoU threshold of is significantly lower at .
In
experiment 2 different state-of-the-art classification models were tested. The resulting F1 scores ranged from
(
ResNet101) to
(both
ResNet18 and
MobileNetV2). Since the presented framework can be used to evaluate large video data periods, the inference time needs to be considered. In this aspect,
ResNet18 showed the best performance, which is why this model was chosen for the final version of the framework. The models evaluated in this experiment showed the best performance in the first two folds in almost every run (see
Table 4). A possible explanation for this is that the data shows no special features in the first two days. On the third day, the male animal stays comparatively often in areas very far away from the camera, while bushes often occlude the female. Both animals are recorded
standing on this day. On day four, the male is again more often obscured by bushes. On day five, the image quality is negatively influenced by strong sunlight. Furthermore, both animals are swimming more often in the water area of the enclosure on this day. The described peculiarities explain the decrease of performance for the folds three to five because, on these days, incidents are shown, which the models did not see during training. Additionally, the fact that the performance decrease for these folds is within an acceptable range shows the ability of the framework and the individual models to generalize and deal well with unseen particularities.
Experiments 3 and 4 analyzed the combined performance of the detection and classification stages. The precision for both experiments was >0.90 and thus very high. The framework consisting of yolo and ResNet18 achieved a precision of . This means that for all instances predicted, only 8% are incorrect. These are the cases where either a polar bear is found but the wrong identity is assigned or an object from the background is incorrectly identified as a polar bear. In both cases, the outlier can be corrected by applying simple filtering methods since there is no spatio-temporal proximity to another detection of the same class. The influence of these erroneous instances on the overall performance of the framework is thus not problematic for the scope of this project. Please note that the precision increases for a higher IoU threshold. This is because the more precisely the animal is located during object detection, the better it can be classified.
The framework consisting of yolo and ResNet18 achieves a recall of 83.2%. Thus, about 17% of all existing animal instances are not found. At a framerate of 12.5 frames per second, the information where the animals are located is missing only on 2–3 frames per second on average. This is not a problem and can be easily corrected by simple interpolation. The recall is about 5 to 6% better for experiment 3 compared to experiment 4. This means that using yolo alone results in about 5–6% fewer animals being found.
The results show that by combining
yolo (for object detection) and
ResNet18 (for classification) in experiment 3, as well as training
yolo to solve both tasks simultaneously in experiment 4, we achieved F1 scores of more than 80%. However,
yolo alone is about 3 to 4% less performant. It also shows that the male individual, Nanuq, is detected slightly less accurately. This is because the difficult instances for Nanuq occur more frequently than for Vera. He often lies in a sandpit that is distant from the camera. In addition, he is more often obscured or standing on his feet. Some examples of these difficulties are depicted in
Figure 6.
Even though the difference in performance between the two approaches is rather small, using the two-step method still has its merits. One of the reasons is that the framework is designed for more accessible adaptation to new zoos. If another institution wants to use the framework, the labeling effort is reduced because the classifier is easier to train compared to the object detector, which can be used pre-trained as it is. Another argument for the two-stage approach is that yolo does not use the full resolution of the image due to its optimization for fast computation times. Classifiers such as ResNet, on the other hand, use the full resolution of the image. This approach is more reasonable for applying the framework in cases where more than two animals share an enclosure, and thus classification becomes more complex. It can be concluded that performance evaluation in experiments 1–4 shows that the first two stages of the framework can effectively detect and identify individual animals.
5.2. Coordinate Mapping
The main aim of the experiments on coordinate mapping was to assess the quality of localizing the animals within their enclosure. This stage aims to achieve the smallest possible deviation of the predicted position from the actual position, which can be defined as the center of the polar bear’s body. Due to the small number of cameras available and their limited viewing angles, the exact body center cannot be determined in every enclosure area. However, this deviation must always be considered in relation to the animal’s size. As male polar bears reach a length of 2.00–2.50 m from nose tip to tail tip [
39], deviations within this order of magnitude do not significantly influence the quality of the coordinate mapping with regards to the overall objective of monitoring animal behavior. Additionally, we need to compare the frameworks positioning accuracy with previous manual observation methods. Since pinpoint localization is not possible with manual observation, enclosures are typically divided into either equidistant grids or suitable segments (depending on the specific conditions). The animals’ positions are then manually assigned to the respective area. When manually observing the polar bears at Nuremberg Zoo, the enclosure was thus divided into 34 segments with a mean width of
m.
Object of investigation in experiment 5 was how the predicted bounding box of the framework deviates from ground truth. As we define the position of the animal to be the center of the lower edge of the bounding box, this deviation will also show in the transformed coordinates. The resulting mean deviation of cm shows that this error is within the polar bears’ dimensions. It is also significantly more precise than previous applied manual observation methods, which achieve an average precision of m when dividing the polar bear enclosures into segments. These results show that the deviation introduced by the object detection stage does not significantly affect the overall performance of the framework with respect to the biological research questions.
In
experiment 6 we investigated the systematical error induced by the coordinate mapping via the homography matrices. Compared to a reference measurement, obtained with two laser distance meters, the mean deviation of the points was
m. Again, the deviation from the ground truth lies within the dimensions of the animal.
Figure 5 shows that the error can be considered to be a constant offset in the respective segments of the enclosure and thus does not significantly contribute to the calculation of the total distance. When comparing the length of the ground truth trajectory to the length of the framework’s prediction, the difference was less than
for both enclosures. Thus, the deviation of the output of the framework from the actual position is almost neglectable for the calculation of the distance traveled. The error is only relevant when considering the probability distribution of the animal’s position in the enclosure. The induced offset depends on the enclosure area, as homography matrices are more precise in closer proximity and frontal plan view. Still, with an error of
m, the enclosure usage can be analyzed more precisely compared to the mean precision of
m achieved with manual observation techniques.
A limitation introduced by this approach is that the topology of the enclosure is not incorporated into the trajectory calculation as there have not been enough cameras to provide any depth information.
In summary, the experiments show that no significant errors are introduced by this approach to coordinate mapping. The deviations are within a reasonable range with respect to the animal size. Furthermore, significantly more precise trajectories can be achieved than with previous manual observations. The possibility of determining the distances traveled by the animals is an insight into behavior that manual observation methods cannot provide. Thus, this method is suitable for effectively tracking the position of observed animals, even with a limited number of available cameras.
6. Conclusions
Measuring animal behavior is an important method in animal welfare research, especially when combined with physical and physiological parameters [
7,
9,
27]. We propose a deep learning framework for non-invasive behavior monitoring of individual animals under human care. We provide a tool to indicate spatio-temporal usage of an individuals’ habitat area, which allows analysis of individual activity and inactivity patterns, and locomotion distances. These parameters are measured reliably, objectively, and repeatedly with a reproducible method. Therefore, the well-known limitations of animal behavior observation by human observers [
7,
9,
25] concerning time restrictions and observer bias are overcome by our framework. Our experiments on polar bears show that the presented framework improves the current manual observation methods in all aspects. We allow biologists and animal caretakers to overcome time-consuming observation and thus to expand their datasets at a 24/7 time scale. This detailed insight into an animal’s daily routine is an important step towards ensuring animal welfare on a 24/7 time scale from birth to death [
3].
Even if only basic behavior categories are analyzed, the data collected by the framework is of great use. For example, the effect of certain enclosure changes or management measures that aim to increase activity could be investigated with our continuous monitoring framework. Additionally, an analysis of individual activity and inactivity patterns throughout the year is an important part of analyzing behavior in relation to environmental influences such as temperature, day length, weather, or visitor numbers. Furthermore, physiological parameters such as the stress hormone cortisol, which can be measured retrospectively over weeks in hair [
40], can be used in combination with behavioral data to better interpret the animal’s condition. Thus, this framework provides another essential part of the matrix available to analyze behavior precisely and objectively. As a next step, defining activity-inactivity-ratios or walking distances characteristic for a specific individual on a seasonal time base will make it useful as an early-warning system for animal keepers if unexpected changes in daily values appear. Thus, this framework also represents a suitable tool for evaluating welfare and enhances the interpretation of physiological data. Future work should investigate the transferability of this framework to a broad range of other individuals and animal species within different terrains. In particular, future studies might consider transferring our trained object detector model to other zoos to analyze the performance on other polar bears, which requires a re-training of the identification stage and thus labeling of a new data set. Although the framework itself is species- and enclosure-independent, the general performance will be influenced by the specific situation’s boundary conditions, including camera angle and resolution, species, and enclosure size. This should be an object of investigation.