1. Introduction
Long-term monitoring of wildlife plays an important role in biodiversity conservation research [
1]. Camera traps have been widely used in wildlife monitoring due to their non-invasive nature and low cost [
2]. Camera trap data are used in many studies [
3,
4] for animal behavior identification [
3] and abundance estimation [
4]. These studies are based on identifying the species pictured in camera trap footage from camera trap images. It is more helpful to start the above studies when the recognition is more accurate.
Wildlife identification research based on deep learning has acquired more and more attention due to the benefits of automatically extracting wildlife-related information and the effective processing of a large number of images [
5]. Gomez Villa et al. [
6] selected 26 animals from the Snapshot Serengeti dataset and evaluated the potential of deep convolutional neural network frameworks such as AlexNet, VGGNet, GoogLeNet, and ResNet for the species identification task. The recognition accuracy of top-1 is 35.4% when the training dataset is unbalanced and contains empty images and 88.9% when the dataset is balanced and the images contain only foreground animals. Zualkernan et al. [
7] proposed an edge-side wildlife recognition architecture using the Internet of Things (IoT) and the Xception model to recognize wildlife images captured by camera traps and transmit the recognition results in real-time to a remote mobile application. Furthermore, by comparing the accuracy of VGG16, ResNet50, and self-trained networks in recognizing animal species such as snakes, lizards, and toads in camera-captured images, it was demonstrated that both ResNet50 and VGG trained using transfer learning outperform the self-trained model [
8].
To enhance the performance of wildlife recognition, Xie et al. [
9] proposed an integrated SE-ResNeXt model based on a multi-scale animal feature extraction module and a vision attention module, which enhances the feature extraction capability of the model and improved the recognition accuracy on a self-constructed wildlife dataset from 88.1% to 92.5%. Yang et al. [
10] improved the accuracy of YOLOv5s from 72.6% to 89.4% by introducing the channel attention mechanism and the self-attention mechanism. Zhang et al. [
11] designed a deep joint adaptation network for wildlife image recognition, which improved the generalization ability of the model in open scenarios and increased the recognition accuracy of 11 animal species from 54.6% to 58.2%.
In addition to improving model recognition performance by modifying the network structure and training strategy, some studies have also improved model recognition performance through data enhancement methods. Ahmed et al. [
12] used camera-captured images with noisy labels, which turned some of the correct labels in the training set into wrong labels, to classify animals and improved the accuracy of recognition by selecting the largest prediction from multiple trained networks. Zhong et al. [
13] proposed a data enhancement strategy that integrates image synthesis and regional background suppression to improve the performance of wildlife recognition and combines them with a model compression strategy to provide a lightweight recognition model that enables real-time monitoring on edge devices. Tan et al. [
14] evaluated the YOLOv5, Cascade R-CNN, and FCOS models using daytime and nighttime camera trap data, demonstrating that models trained jointly by day and night can improve the accuracy of animal classification compared to that when using only nighttime data.
Currently, most wildlife recognition methods can only use camera trap image data for classification. However, limited by factors such as the shooting angle, animal pose, background environment, lighting conditions, etc., some animals are difficult to distinguish in camera trap images alone. In animal identification tasks using citizen science images, contextual information such as the climate, date, and location that accompanies the acquisition of citizen science imagery is used to identify the wildlife. Terry et al. [
15] developed a multi-input neural network model that fuses contextual metadata and images to identify ladybird species in the British Isles, UK, demonstrating that deep learning models can effectively use contextual information to improve the top-1 accuracy of multi-input models from 48.2% to 57.3%. de Lutio et al. [
16] utilized the spatial, temporal, and ecological contexts attached to most plant species’ observation information to construct a digital taxonomist that improved accuracy from 73.48% to 79.12% compared to a model trained using only images. Mou et al. [
17] used animals’ visual features, for example, the color of a bird’s feathers or the color of an animal’s fur, to improve the recognition accuracy of a contrastive language–image pre-trained (CLIP) model on multiple animal datasets. Camera traps in national parks are capable of monitoring wildlife continuously for long periods of time with reduced human intervention and can provide complete information on animal rhythms. However, the use of animal rhythm information to aid wildlife recognition in camera trap images in national park scenes has not yet been explored.
Along with the massive camera trap image collection process, we obtain a lot of temporal metadata, including the date and time. These temporal metadata can reflect the activity rhythms of animals [
15,
18,
19]. The quantity of data collected on various days throughout the year fluctuates due to variations in animal activity levels. Specifically, as shown in
Figure 1, the number of camera trap images capturing kudus is greatest in August, whereas the highest number of camera trap images featuring blesboks is observed in October. Furthermore, the amount of data acquired during daylight hours varies due to distinct circadian rhythms. Based on the animal movement patterns fitted to the images captured by the camera traps, it was found that springboks were most active from 06:00 h to 07:00 h, while kudus were most active from 17:00 h to 18:00 h.
To investigate whether fusing temporal information can improve wildlife recognition performance, we designed a neural network for combining wildlife image features and temporal features, named Temporal-SE-ResNet50. First, we utilized the ResNet50 model and introduced SE attention to extract wildlife image features. Then, we gained temporal metadata from wildlife images. After applying cyclical encoding, which uses sine–cosine mapping for handling periodic data such as the date and time, the temporal features were then extracted by a residual multilayer perceptron (MLP) network. Finally, the wildlife image features and temporal features were fused by a dynamic MLP module to obtain the final recognition results.
Our contribution includes the following three parts:
We utilized temporal metadata in camera trap images to aid wildlife recognition and found that extracting temporal features after cyclical encoding of the date and time, respectively, can effectively improve the accuracy of wildlife recognition, which provides a new idea for using animal domain knowledge like animal rhythms to improve wildlife image recognition;
We proposed a wildlife recognition framework called Temporal-SE-ResNet50 that fuses image features and temporal features, which uses an SE-ResNet50 network to extract image features, a residual MLP network to extract temporal features, and then uses a dynamic MLP module to fuse the above features together;
We conducted extensive experiments on three national park camera trap datasets and demonstrated that our method is effective in improving wildlife recognition performance.
The remainder of this paper is organized as follows:
Section 2 describes the framework used for fusing image features and temporal features, including SE-ResNet50 for extracting image features, a residual MLP network for extracting temporal features, and a dynamic MLP module for fusing image features and temporal features.
Section 3 describes the data sources, experimental setup, and evaluation metrics.
Section 4 discusses the experimental results.
Section 5 discusses the experimental results and future research directions.
Section 6 presents the conclusions.
2. Methods
In this section, we introduce Temporal-SE-ResNet50, a wildlife recognition framework that fuses image and temporal information. As shown in
Figure 2, the overall framework of Temporal-SE-ResNet50 consists of four stages. In the image feature extraction stage, we construct SE-ResNet50 to extract wildlife features from images more effectively than ResNet50 (see
Section 2.1); in the temporal metadata acquisition stage, we first obtain the temporal metadata from each camera trap image; in the temporal feature extraction stage, we obtain the corresponding temporal features of the image through the residual MLP network (see
Section 2.2); and in the image feature and temporal feature fusion stage, after obtaining the image features and temporal features, we use the dynamic MLP module to fuse the two to obtain the enhanced image representation (see
Section 2.3). In the end, we obtain the recognition results on different species of wildlife.
2.1. Camera Trap Image Feature Extraction
Wildlife images obtained by utilizing camera traps in natural scenes are usually affected by lighting conditions, animal behaviors, shooting angles, backgrounds, etc., making recognition challenging. Therefore, we designed the SE-ResNet50 model based on ResNet [
21] and squeeze-and-excitation network (SENet) [
22] to extract wildlife image features. The structure of the SE-ResNet50 model is shown in
Figure 3.
ResNet has had excellent performance in numerous previous wildlife image recognition studies [
6,
8]. Given the computation and network complexity, we chose a 50-layer ResNet as the basic network. ResNet-50 starts with a regular convolutional layer for the initial feature extraction from input images. It is then followed by four residual blocks. As shown in
Figure 3, each residual block consists of multiple stacked BottleNeck blocks, where each BottleNeck block typically incorporates multiple convolutional layers, which help the model to extract different animal features from the input data; e.g., the shallow residual block extracts simple features such as contours, and the deep residual block extracts detailed features such as tails and hairs. To maintain the integrity of the information and to address the problem of gradient degradation during model training, skip connections are used in every BottleNeck block. These connections allow input features to be added directly to the output, ensuring that key details are preserved during the learning process. Subsequently, global average pooling is applied to convert the feature map into a fixed-length representation. This step helps to summarize the important features of the data over the entire spatial dimension. Finally, the resulting fixed-length vector representation is passed to the fully connected layer, which is responsible for performing specific classification tasks.
Recently, a large number of studies [
23,
24] have demonstrated that adding an attention mechanism can effectively enhance the ability of convolutional neural networks to extract key features. Given that camera trap images usually have complex backgrounds and different lighting conditions, in order to further enable the model to better focus on key animal regions and ignore other irrelevant information, we introduce the SE attention module. This module uses global pooling to compress the global spatial information and then learns the importance of each channel in the channel dimension. As shown in
Figure 4, it is specifically divided into three operations: squeeze, excitation, and reweight.
The squeeze operation compresses the input feature map
through the spatial dimension using global average pooling and obtains the feature map
, which represents the global distribution of the responses on the feature channels.
The excitation operation models the correlation between the feature channels significantly by using two fully connected layers. To reduce the computational effort, the first fully connected layer compresses the channel
C with a ratio of
r, and the second fully connected layer then restores it to
C. The feature map
with channel weights is then obtained using Sigmoid activation.
where
denotes the Sigmoid activation function,
,
denotes the ReLU activate function, and
.
The reweight operation obtains the final feature map
by the multiplication channel weights with the original input feature map.
We introduced SE attention module in each residual structure and constructed the SE-ResNet50 model, which can pay more attention to the key wildlife features and suppress the other unimportant features.
2.2. Temporal Feature Extraction
Since dates have a cyclical feature, the end of one year and the start of the next year should be close to each other. Therefore, we use sine–cosine mapping [
16] to encode the date metadata captured by the camera trap as (
d1,
d2) according to Equations (5) and (6). With this cyclical encoding, December 31st and January 1st are mapped to be near each other.
where 365 respects the total number of days in a year, or 366 in a leap year.
Correspondingly, for the time metadata, we also use the sine–cosine mapping to encode the time metadata captured by the camera trap as (
t1,
t2) according to Equations (7) and (8). With this cyclical encoding, 23:59 and 0:00 were also mapped to be near each other.
where 1440 respects the total number of minutes in a day.
Finally, the date and time metadata are combined to form our temporal information (d1, d2, t1, t2).
Inspired by PriorsNet [
25], we extract temporal features from a residual MLP network. As shown in
Figure 5, it is a fully connected neural network model that consists of two fully connected layers and four fully connected residual layers. Each fully connected residual layer contains two fully connected layers, two ReLU activation functions, and a Dropout layer composition. By inputting temporal information into this network model, the feature mapping of the temporal information is finally obtained. The dimension of the temporal features is set to 256, which achieves the best performance, as described by Tang et al. [
26].
2.3. Image Feature and Temporal Feature Fusion
After the image features and temporal features were obtained, we used the dynamic MLP module [
27] to fuse the wildlife image features and temporal features, and the overall structure of the fusion is shown in
Figure 6. Since the dimensionality of the image features is much more than that of the temporal features, for better feature fusion, we first performed a dimensionality reduction operation; i.e., we reduced the dimensionality of the image features to 256, which exists only in Temporal-ResNet50 and Temporal-SE-ResNet50.
The image features extracted by the convolutional neural network are three-dimensional in shape as (H, W, C), where H and W denote the height and width of the feature map, respectively, and C denotes the number of channels. And the temporal features are one-dimensional. When splicing was performed, the temporal features were mapped to the same dimension as the image features; then, the text features and image features were concatenated in accordance with the channel direction, and then feature fusion was performed by a convolutional neural network. So, in a dynamic MLP module, image features and temporal features were first concatenated together channel-wise and then fed into the subsequent MLP block to generate image features and temporal features, respectively. Guided by the temporal features, projection parameters were dynamically generated to adaptively improve the representation of image features. Finally, after the MLP block, we obtained the fused features.
After stacking two dynamic MLP modules, we obtained the fused wildlife features. We expanded the dimensions to the original output image feature dimension. By using a skip connection, we used the fused features to further enhance the representation of the original image features. Finally, after the fully connected layer, the learned features were mapped to the output categories to obtain the final recognition result.
5. Discussion
In this study, we present a novel method that fuses image and temporal metadata for recognizing wildlife in camera trap images. Our experimental results on different camera trap datasets demonstrate that leveraging temporal metadata can improve overall wildlife recognition performance, which is similar to the findings of Terry et al. [
15] and de Lutio et al. [
16] who utilized contextual data to enhance the recognition performance of citizen science images.
In recent years, a large number of wildlife recognition studies based on camera trap images have achieved great accomplishments [
6,
7,
8]. These achievements are attributed to large amounts of labeled data. To further improve wildlife recognition performance, the first step is to expand the dataset. However, the process of data annotation is time-consuming and labor-intensive. To solve this problem without expanding the dataset, we consider utilizing information from existing images, such as the observation times of camera traps. By analyzing the frequency and distribution of random animal encounters in camera traps, this can be used not only to estimate animal population density [
18,
19], but also to reflect animal species activity patterns [
41]. Unlike previous wildlife recognition studies which were merely based on camera trap images, we exploit the observation time of the camera trap images to help interpret the images.
In addition to wildlife recognition using camera traps, species recognition using citizen science images is also a hot research topic. There are many studies that utilize geographic and contextual metadata to aid species recognition from public science images. Chu et al. [
42] used geolocation for fine-grained species identification, which improved the top-1 accuracy in iNaturalist from 70.1% to 79.0%. Mac Aodha et al. [
25] proposed a method to estimate the probability of species occurrence at a given location using the geographic location and time as a priori knowledge. However, studies utilizing the above information in camera trap data are rare. Consistent with previous studies on citizen science images that utilize geographic and contextual metadata, our findings indicated that fusing the temporal information enhanced the baseline accuracy by 0.44%. Unlike these studies, we only considered temporal information and did not utilize geographic data. For the temporal metadata, we further considered date and time separately, where date corresponds to the animal’s seasonal rhythms and time corresponds to the animal’s circadian rhythms. Our experimental results show that the periodic coding of date and time separately, followed by feature extraction and then fusion with image features, can better leverage temporal metadata. As for geographic metadata, geographic data is not easily accessible, and due to the deployment of infrared cameras in a national park with very little geographic variation, the potential for integrating geo-metadata requires further research.
Attention mechanisms have been widely used in the field of animal detection and recognition owing to their plug-and-play and effective traits. Xie et al. [
39] introduced SE attention in YOLOv5 to improve big mammal species detection from a UAV viewpoint. Zhang et al. [
43] introduced a coordinated attention (CA) module in YOLOv5s to suppress non-critical information on the face of sheep to recognize the identity of sheep in real time. Given the different lighting conditions and backgrounds present in camera trap images, to better extract wildlife features, we compared the performance of the model after adding different attention modules and found that the addition of the SE attention mechanism can further improve wildlife recognition by 0.09%.
In general, our proposed method for wildlife recognition has significant advantages, improving the baseline model by 0.53% on the Camdeboo dataset without additional image data. In addition, on the Snapshot Serengeti dataset and the Snapshot Mountain Zebra dataset, our method improved by 0.25% and 0.43% compared to the baseline model. These findings have important implications for utilizing the potential of animal rhythms for wildlife identification based on camera trap images. Moreover, these temporal metadata are already collected with the camera trap image and do not add an additional collection burden. However, the use of temporal information is less useful for animals with insufficiently trained images. Given that factors affecting wildlife recognition include, in addition to the number of training images, life habits (e.g., the presence of one or more identical wildlife on a single image as a result of living in a group or solitary), motion poses (e.g., running results in the presence of blurring in the image), and the timing of the shot (e.g., the image contains only a portion of the animal’s body), determining the minimum amount of training data required for each animal is relatively complex, which deserves to be investigated in depth in future work.
In the future, we will focus on the following two aspects of work. On the one hand, to further explore the potential of fusing temporal information, we will further determine the minimum number of images of each wildlife species to be used in training. On the other hand, considering that animals living in urban areas can experience changes in activity patterns due to artificial light, food availability, and human activities, we will collect datasets of wildlife camera trap images involving different environments or contexts to investigate the robustness of the recognition model after fusing the temporal metadata under different activity patterns.