Multiobject Tracking of Wildlife in Videos Using Few-Shot Learning

Simple Summary Video recordings enable scientists to estimate species’ presence, richness, abundance, demography, and activity. The increasing popularity of camera traps has led to a growing interest in developing approaches to more efficiently process images. Advanced artificial intelligence systems can automatically find and identify the species captured in the wild, but they are hampered by dependence on large samples. However, many species rarely occur, such as endangered species, and only a few shot samples are available. Building on recent advances in deep learning and few-shot learning technologies, we developed a multiobject-tracking approach based on a tracking-by-detection paradigm for wildlife to improve multiobject-tracking performance. We hope that it will be beneficial to ecology and wildlife biology by speeding up the process of multiobject tracking in the wild. Abstract Camera trapping and video recording are now ubiquitous in the study of animal ecology. These technologies hold great potential for wildlife tracking, but are limited by current learning approaches, and are hampered by dependence on large samples. Most species of wildlife are rarely captured by camera traps, and thus only a few shot samples are available for processing and subsequent identification. These drawbacks can be overcome in multiobject tracking by combining wildlife detection and tracking with few-shot learning. This work proposes a multiobject-tracking approach based on a tracking-by-detection paradigm for wildlife to improve detection and tracking performance. We used few-shot object detection to localize objects using a camera trap and direct video recordings that could augment the synthetically generated parts of separate images with spatial constraints. In addition, we introduced a trajectory reconstruction module for better association. It could alleviate a few-shot object detector’s missed and false detections; in addition, it could optimize the target identification between consecutive frames. Our approach produced a fully automated pipeline for detecting and tracking wildlife from video records. The experimental results aligned with theoretical anticipation according to various evaluation metrics, and revealed the future potential of camera traps to address wildlife detection and tracking in behavior and conservation.


Introduction
Biodiversity is an essential component and a key element in maintaining the stability of ecosystems. In the face of the current sharp decline in global biodiversity, it is urgent to take adequate measures to prevent and protect it. Wildlife monitoring and conservation that determine biodiversity patterns is a cornerstone of ecology, biogeography, and conservation biology. Therefore, monitoring animal habits and activity patterns during the rewilding training process is essential. Driven by advances in cheap sensors and computer-vision technologies for detecting and tracking wildlife, biodiversity research is rapidly transforming into a data-rich discipline. Video data have become indispensable in the retrospective analysis and monitoring of endangered animal species' presence and puter-vision technologies for detecting and tracking wildlife, biodiversity research is rapidly transforming into a data-rich discipline. Video data have become indispensable in the retrospective analysis and monitoring of endangered animal species' presence and behaviors. However, large-scale research is prohibited by the time and resources needed to process large data manually.
Recent technological advances in computer vision have led to wildlife scientists realizing the potential of automated computational methods to monitor wildlife. This ongoing revolution is facilitated by cost-effective mechanical high-throughput wildlife-tracking methods that generate massive high-resolution images across scales relevant to the ecological context in which animals perceive, interact with and respond to their environment. While applying existing tools is tempting, many potential pitfalls must be considered to ensure the responsible use of these approaches. For example, a large amount of data is required to train these deep-learning models accurately. However, because many species rarely occur, only a few shot samples are available; thus, the performance is typically low.
Few-shot learning aims to develop the ability to learn and generalize autonomously from a small number of samples. It can rapidly generalize to new tasks containing only a few samples with supervised information. Multiple recent publications have discussed this approach [1][2][3][4][5]. Generally, the research on multiobject tracking mainly focuses on how to improve the real-time performance of multiobject monitoring [6,7], how to better model the appearance information of the target [8][9][10][11], and how to associate targets efficiently [12][13][14][15]. Multiobject-tracking methods always follow the tracking-by-detection paradigm. In [7], this method was called separate detection and embedding (SDE). This means that the MOT system was broken down into two steps: (1) locating the target in single video frames; and (2) associating detected targets with existing trajectories. Another multi-object tracking learning paradigm, JDE, was also proposed. JDE jointly learned the detector and embedding model in a single deep network. In other words, the JDE method used a single network to output both the detection result and the corresponding appearance embeddings of the detected boxes. The SDE method used two separate networks to accomplish the above two tasks. JDE was closer to real-time performance, but the tracking accuracy was slightly worse than SDE. The small-sample object-detector performance was not as good as that of YOLO [16][17][18][19], Faster R-CNN [20], and other general object detectors [21,22]. In the object detection of each frame, there will be missed detection, which significantly affects the effect of the multiobject-tracking task. Therefore, to ensure the performance effect of a multiobject-tracking model driven by a small amount of data, in addition to selecting the SDE paradigm, we also proposed a trajectory reconstruction module in the data association part to further optimize the tracking accuracy, as shown in Figure 1. We aimed to obtain a few-shot multiobject-tracking model based on few-shot learning. In this framework, we used a few-shot object detector as the detector and a classification network trained based on the few-shot method as the feature extractor. In addition, we also designed a trajectory-reconstruction module to optimize the tracking result. We aimed to obtain a few-shot multiobject-tracking model based on few-shot learning. In this framework, we used a few-shot object detector as the detector and a classification network trained based on the few-shot method as the feature extractor. In addition, we also designed a trajectory-reconstruction module to optimize the tracking result.
The research hotspots of multiobject tracking under the tracking-by-detection paradigm always have the following two aspects: (1) a more accurate detection of targets in complex environments; and (2) the ability to deal with long-term occlusion and short-term occlusion problems and to associate targets more accurately. Some previous works [23][24][25] showed that a multiobject-tracking approach could achieve a state-of-the-art performance when used together with a robust object detector. They used Kalman filtering to predict and update trajectories [23] and proposed an extension [24]. In addition to considering the motion features above, the apparent features of the target were also considered. Feichtenhofer et al. introduced correlation features representing object cooccurrences across time to aid the ConvNet during tracking. Moreover, they linked the frame-level detections based on across-frame tracks to produce high-accuracy detections at the video level [25].
The primary purpose of data association is to match multiple targets between frames, including the appearance of new marks, the disappearance of old targets, and the identity matching of targets between consecutive frames. Many approaches formulated the data-association process as various optimization problems [12,13]. The former mapped the maximum a posteriori (MAP) data-association problem to cost-flow networks with nonoverlapping constraints on trajectories. A min-cost flow algorithm found the optimal data association in the network. The latter believed that re-identification only by appearance was not enough, and long-distance object reproduction was also worthy of attention. They proposed a graph-based formulation that linked and clustered person hypotheses over time by solving an instance of a minimum cost lifted multicut problem. Some works, such as [26,27], emphasized improving the features used in data association. They proposed dual matching attention networks with spatial and temporal attention mechanisms [26]. The spatial attention network generated dual spatial attention maps based on the crosssimilarity between each location of an image pair, making the model more focused on matching common regions between images. The temporal attention module adaptively allocated different levels of attention to separate samples in the tracklet to suppress noisy observations. To obtain a higher precision, they also developed a new training method with ranking loss and regression loss [27]. The network considered the appearance and the corresponding temporal frames for data association.
Conceptually, tracking technologies using computer vision permit high-resolution snapshots of the movement of multiple animals and can track nontagged individuals, but they are less cost-effective, are usually limited to specific scenarios, and make individual identification challenging. In contrast, here we provide a fully automated computational approach to tracking tasks for wildlife by combining few-shot learning with multiobject tracking to detect, track, and recognize nature. It could represent a step-change in our use of extensive video data from the wild to speed up the procedure for ethologists to analyze biodiversity for research and conservation in the wildlife sciences. This approach represents an automated pipeline for recognizing and tracking species in the wild. Our main contributions can be summarized as follows: • We combined few-shot learning with a multiobject-tracking task. To the best of our knowledge, the multiple automated object-tracking frameworks based on few-shot learning are being proposed for the first time.

•
Our approach effectively merged the richness of deep neural network representations with few-shot learning that paves the way for robust detection and tracking of wildlife, which can be adaptive for unknown scenarios by data augmentation. • A trajectory reconstruction module was proposed to compensate for the shortcomings of the few-shot object-detection algorithm in the multiobject-tracking tasks, especially in monitoring wildlife.

Architecture Overview
While camera traps have become essential for wildlife monitoring, they generate enormous amount of data. The fundamental goal of using intelligent frameworks in wildlife monitoring is automated analyses of behaviors, interactions, and dynamics, both individual and group. For example, sampling the quantity of species' complex interactions for network analysis is a significant methodological challenge. Early approaches require capturing subjects and are labor-intensive. Their application may be location-specific, and the recorded data typically lacks contextual visual information. In this work, we instead sought to learn the unstrained dynamics and be sensitive to the presence of various locations and groups. The aim was to propose a cost-effective wildlife-tracking approach that generated massive high-resolution video records across scales relevant to the ecological context in which animals perceive, interact with and respond to their environment. Figure 2 shows the overall design of the proposed MOT framework, called Few-MOT, which followed the tracking-by-detection paradigm, but without requiring large amounts of training data. An input video frame first underwent a forward pass through a fewshot object detector and a few-shot feature extractor to obtain motion and appearance information. Finally, we followed [24] and made improvements to solve the association problem for a few-shot setting. The upgrades included two parts: (1) a three-stage matching process including cascade matching, central-point matching, and IoU matching; and (2) a trajectory-reconstruction module to compensate for few-shot object detection.

Architecture Overview
While camera traps have become essential for wildlife monitoring, they generate enormous amount of data. The fundamental goal of using intelligent frameworks in wildlife monitoring is automated analyses of behaviors, interactions, and dynamics, both individual and group. For example, sampling the quantity of species' complex interactions for network analysis is a significant methodological challenge. Early approaches require capturing subjects and are labor-intensive. Their application may be location-specific, and the recorded data typically lacks contextual visual information. In this work, we instead sought to learn the unstrained dynamics and be sensitive to the presence of various locations and groups. The aim was to propose a cost-effective wildlife-tracking approach that generated massive high-resolution video records across scales relevant to the ecological context in which animals perceive, interact with and respond to their environment. Figure 2 shows the overall design of the proposed MOT framework, called Few-MOT, which followed the tracking-by-detection paradigm, but without requiring large amounts of training data. An input video frame first underwent a forward pass through a few-shot object detector and a few-shot feature extractor to obtain motion and appearance information. Finally, we followed [24] and made improvements to solve the association problem for a few-shot setting. The upgrades included two parts: (1) a three-stage matching process including cascade matching, central-point matching, and IoU matching; and (2) a trajectory-reconstruction module to compensate for few-shot object detection. Figure 2. The architecture of our proposed few-shot tracker framework: Few-MOT. It consisted of a detection process and a tracking process. The detection process followed a few-shot object detector that directly regressed the objectness score (def), bounding box location (x,y,w,h), and classification score (cls). The tracking process included a few-shot feature-extraction network (Extractor), a matching module, and a trajectory-reconstruction module. The extractor was responsible for extracting the features of each object clip. The matching module then performed the association of targets between frames, and if they met the reconstruction criteria, they were constructed by the trajectoryreconstruction module. The details of this module will be explained in the methods section. Figure 2. The architecture of our proposed few-shot tracker framework: Few-MOT. It consisted of a detection process and a tracking process. The detection process followed a few-shot object detector that directly regressed the objectness score (def), bounding box location (x,y,w,h), and classification score (cls). The tracking process included a few-shot feature-extraction network (Extractor), a matching module, and a trajectory-reconstruction module. The extractor was responsible for extracting the features of each object clip. The matching module then performed the association of targets between frames, and if they met the reconstruction criteria, they were constructed by the trajectory-reconstruction module. The details of this module will be explained in the methods section.

Few-Shot Detection Module
Most object-detection approaches rely on extensive training samples. These requirements substantially limit their scalability to open-ended accommodation of novel classes with limited labeled training data. In general, the detection branch of multiobject tracking is the state-of-the-art of the object-detection field. Given the extreme scarcity of endangered animal scenes, we had very few samples available. This paper addresses these problems by offering a few-shot object detection with spatial constraints to localize objects in our multiobject-tracking framework. Few-shot object detection only requires a k-shot training sample, and its performance is better than that of the general detector under the same premise.
First, a note that in few-shot learning, we defined a large number of samples as the base, with their counterparts as the novel. In this paper, the novel class refers to the endangered animal class. Our proposed few-shot object-detection method allowed for few-shot learning in different scenarios with spatial dependencies while adapting to a dynamically changing environment during the detection process. It exploited a set of objects and environments that were processed, composed, and affected by each other simultaneously, instead of being recognized individually. Considering the geographical correlation between species and environmental factors, we thus proposed spatial constraints during the data augmentation. The images were first separated from the front and back views using the pretrained saliency network U2-Net [28]. Then, the pretrained image-inpainting network CR-Fill [29] repaired the missing parts. Finally, the foreground and background, which were separated, were blended and combined into a new sample. We used a perceptual hashing algorithm for spatial constraints during the combinations that did not correspond to the actual situation. For example, an event with a zero probability, such as a giant panda in the sky, would be misleading for training the object-detection model. After the above-constrained data expansion, the samples were learned from each other. The training of the few-shot objectdetection task was performed based on a feature-reweighting method [30].
The perceptual hash algorithm pHash reduced the image frequency by the discrete cosine transform (DCT) and then matched similar images by calculating the Hamming distance. The algorithm proceeded as follows: (1) This analysis can be extended toward a graphical representation ( Figure 3).

Learning More Robust Appearance Embedding Based on Few-Shot Learning
There is an appearance metric-learning problem in a multiobject-tracking task, and

Learning More Robust Appearance Embedding Based on Few-Shot Learning
There is an appearance metric-learning problem in a multiobject-tracking task, and the aim is to learn an embedding space where instances of the same identity are close while instances of different identities are far apart. The metric-learning problem is often defined as a re-identification task in multiobject tracking, mainly aimed at a single category; i.e., pedestrians or vehicles. For example, person re-identification aims at searching for persons across multiple nonoverlapping cameras. The task of Re-ID in this approach shares similar insights with the Re-ID for persons. When presented with an animal-of-interest (query) in video records, an animal Re-ID tells whether this animal has been observed in another place (time). In particular, we tracked nonsingle classes, and each class had very little training data. Thus, we trained the embedding learning process on the few-shot classification task.
Typically, few-shot classification approaches include optimization-based, model-based, and metric-based methods. Since our goal was not to classify but to train a feature learner based on the classification task and its feature map to the target, we performed descriptions of categories and changes in behavior. Thus, directly using a few-shot classification network for training was not applicable. We used elastic-distortion data augmentation to ensure the features had single information. Elastic distortion changed the posture of the target, allowing changes in behavior to be focused and adapted to our eventual tracking task. Because the target was moving and the pose of the same target was constantly changing in the video stream, this variation affected the recognition rate of the target identity during the tracking process.
Firstly, the affine transformation of the image was performed to obtain a random displacement field generated by each pixel of the image. Then, we convolved the random displacement field with N(0, δ), which obeyed the Gaussian distribution, and multiplied the random displacement field by the control factor α, where δ controlled the smoothness of the image and α controlled the strength of the image deformation. We set δ to 0.07 and α to 5. The experimental results suggested that these parameter values enriched the target pose without distorting the image. Figure 4 shows a partial example of the processed image.
We imitated the approach used in [31] in our training process, using self-supervision and regularization techniques to learn generic representations suitable for few-shot tasks. Firstly, we used a pretext task called rotation to construct the self-supervised task on the base classes. In the self-supervised task, the input image was rotated by r degrees and r ∈ C R = {0 • , 90 • , 180 • , 270 • }. The secondary purpose of the model was to predict the amount of rotation applied to the image. An auxiliary loss was added to the standard classification loss in the image classification setting to learn the generic representation. Secondly, fine-tuning with a manifold mixup was conducted on the base classes and endangered classes for a few more epochs. The manifold mixup provided a practical way to flatten a given class of data representations into a compact region. The loss function of the first stage is given by: where L rot denotes the self-supervision loss, and L class denotes the classification loss. The loss function of the fine-tuning stage is given by: α to 5. The experimental results suggested that these parameter values enriched the target pose without distorting the image. Figure 4 shows a partial example of the processed image.

Figure 4.
Example comparison of the EAOD dataset after elastic distortion. Each target was appropriately distorted without distorting the image. In this way, the diversity of target poses was enriched.
We imitated the approach used in [31] in our training process, using self-supervision and regularization techniques to learn generic representations suitable for few-shot tasks. Firstly, we used a pretext task called rotation to construct the self-supervised task on the base classes. In the self-supervised task, the input image was rotated by r degrees and ∈ = {0°, 90°, 180°, 270°}. The secondary purpose of the model was to predict the amount of rotation applied to the image. An auxiliary loss was added to the standard classification loss in the image classification setting to learn the generic representation. Secondly, finetuning with a manifold mixup was conducted on the base classes and endangered classes for a few more epochs. The manifold mixup provided a practical way to flatten a given class of data representations into a compact region. The loss function of the first stage is given by: where denotes the self-supervision loss, and denotes the classification loss. The loss function of the fine-tuning stage is given by: In addition, we used the input data and ′ with corresponding feature representations at layer given by ( ) and ( ′ ), respectively. In addition, we used the input data x and x with corresponding feature representations at layer l given by f l θ (x) and f l θ (x ), respectively.

Association Module
Considering that the current association modules were all associated with the conventional multiobject-tracking task and were not applied to the multiobject-tracking task with a few-shot setting, it was inevitable that there were some shortcomings. To fit the Few-MOT module to the MOT-EA dataset, we made some improvements with the DeepSORT association algorithm.

Three-Stage Matching
In addition to cascade matching and IoU matching, we added a central-point matching, which helped to alleviate the mismatched detection boxes and tracks due to an excessive intersection ratio. The IoU matrix iou j,i was calculated as the intersection-over-union (IoU) distance between every detection and object pair.
where Area track j is the area of track j , and Area(dec i ) represents the area of dec i . The central-point matrix center j,i was calculated as the central-point distance between every detection and track pair. Figure 5 illustrates the difference between center-point matching and IoU matching.
where center track j and center(dec i ) are the central-point of the track and detection, respectively.  During the experiment, we found that if we only used cascade m point matching in the matching stage, it did help to reduce ID switch time, it was accompanied by an increase in missed targets. Thus, we IoU matching and central-point matching and designed the follow struction module to alleviate this problem. In the MOT-EA datase above two matching strategies using the two indicators for FN and three-stage matching was the best matching strategy. A further discu experiment reveals more details.

Trajectory-Reconstruction Module
We found an excessive amount of missed detection cases in given in the previous section, which damaged the tracking effect. In mance of the few-shot detector was not as good as YOLO, Faster R-C eral object detectors. The target was then lost in the video stream. Ho [32], the tracking accuracy of multiple objects can be written as: During the experiment, we found that if we only used cascade matching and centralpoint matching in the matching stage, it did help to reduce ID switching, but at the same time, it was accompanied by an increase in missed targets. Thus, we worked together on IoU matching and central-point matching and designed the following trajectory-reconstruction module to alleviate this problem. In the MOT-EA dataset, we measured the above two matching strategies using the two indicators for FN and FP, and found that three-stage matching was the best matching strategy. A further discussion of the ablation experiment reveals more details.

Trajectory-Reconstruction Module
We found an excessive amount of missed detection cases in the tracking process given in the previous section, which damaged the tracking effect. In addition, the performance of the few-shot detector was not as good as YOLO, Faster R-CNN, and other general object detectors. The target was then lost in the video stream. However, according to [32], the tracking accuracy of multiple objects can be written as: where FN is false negatives (the sum of missing amounts in the entire video), FP is false positives (the sum of the number of false positives in the entire video), IDSW is the ID switch (the total number of ID switches), and GT is the number of the ground truth objects. The object-detection accuracy significantly affected the tracking accuracy, so we designed a trajectory-reconstruction module to deal with the above problems. This module compensated for the lack of a few-shot detector. First, we specified the central region, as shown in Figure 6 below. Then, if there was no trajectory and the detection box was successfully matched in frame T, we judged the central-point position of the track in frame T-1. If the central point of the bounding box in frame T-1 was located in the central area, we reconstructed the track of frame T-1 to frame T under the present conditions. We allowed the reconstruction of five consecutive frames because the object's position usually changed slightly in five consecutive frames. The box of frame T-1 could still locate the object's position in the subsequent four frames.
First, we specified the central region, as shown in Figure 6 below. Then, if there was no trajectory and the detection box was successfully matched in frame T, we judged the central-point position of the track in frame T-1. If the central point of the bounding box in frame T-1 was located in the central area, we reconstructed the track of frame T-1 to frame T under the present conditions. We allowed the reconstruction of five consecutive frames because the object's position usually changed slightly in five consecutive frames. The box of frame T-1 could still locate the object's position in the subsequent four frames.

Implementation Details
This framework was written in Python with PyTorch support. First, when training the feature extractor of Few-MOT, we converted the EAOD private object-detection dataset into an image-classification dataset for training. WRN-28-10 [33] was used as the backbone, and the elastic-distortion data-augmentation strategy enhanced the feature robustness of animals in various poses. Then, in the design of the trajectory-reconstruction module, we found through several experiments that when the allowable reconstruction threshold was set to less than 5, there were too many missed trajectories. When the setting was greater than 5, there were too many false trajectories, which reduced the tracking effect. Therefore, we set the threshold for the maximum number of frames allowed to be continuously reconstructed to 5.

Datasets and Evaluation Metrics
1. Datasets: Currently, there is no multiobject-tracking dataset for endangered animals, so we created the MOT-EA multiobject-tracking dataset in the format of MOT-16 [34].
The dataset included five endangered species: brown-eared pheasant, crested ibis, giant panda, golden snub-nosed monkey, and tiger. Each video was 10 to 20 seconds in length. Details are shown in Table 1 below.

Implementation Details
This framework was written in Python with PyTorch support. First, when training the feature extractor of Few-MOT, we converted the EAOD private object-detection dataset into an image-classification dataset for training. WRN-28-10 [33] was used as the backbone, and the elastic-distortion data-augmentation strategy enhanced the feature robustness of animals in various poses. Then, in the design of the trajectory-reconstruction module, we found through several experiments that when the allowable reconstruction threshold was set to less than 5, there were too many missed trajectories. When the setting was greater than 5, there were too many false trajectories, which reduced the tracking effect. Therefore, we set the threshold for the maximum number of frames allowed to be continuously reconstructed to 5.

1.
Datasets: Currently, there is no multiobject-tracking dataset for endangered animals, so we created the MOT-EA multiobject-tracking dataset in the format of MOT-16 [34].
The dataset included five endangered species: brown-eared pheasant, crested ibis, giant panda, golden snub-nosed monkey, and tiger. Each video was 10 to 20 s in length. Details are shown in Table 1 below.

2.
Evaluation Metrics: Following the benchmarks, we evaluated our work using [32]. MOTA and IDF1 are considered the two most important among all metrics. MOTA is an indicator to measure the accuracy of multiobject tracking. Mostly, it considers the matching errors of objects in the tracking process. According to FP, FN, and IDs, MOTA gives a very intuitive measure of the tracker performance, which is independent of the accuracy of object detection. The IDF1 considers the ID accuracy rate and the ID recall rate comprehensively, and considers the ID information more than MOTA. However, IDF1 cannot reflect the phenomenon of ID switch. This is shown in Equations (10) and (11) below. A robust tracking system should show good scores for both MOTA and IDF1.

Experimental Results
Here, we evaluated our system using the MOT-EA dataset. Table 2 shows the tracking performance of our framework on the five endangered categories. Furthermore, we compared the same few-shot object detector with multiple trackers, as shown in the first four rows of Table 3. On the other hand, the general detector YOLOv4 was used for comparison, as shown in row 5 of Table 3. The specific performance of the five methods in Table 3 on the MOT-EA dataset is supplemented in Appendix A Tables A1-A5. The results showed that our framework outperformed many previous approaches with small data samples. Both the MOTA and IDF1 scores were in the leading position for MOT-EA. We believe that the following results were obtained because the general detector could not achieve a good detection effect with a small amount of data, which significantly affected the tracking. In addition, the tracker we designed was more suitable for this scenario. It is more robust to various morphological changes in animals, and more targeted to insufficient learning caused by a small amount of data.  Two example trajectories of two tigers using the Few-MOT model are shown in Figure 7 below. Our model made it possible to track the targets and plot the movements. We could record the basic trajectories of the endangered animals within the monitoring area. Furthermore, we could also use the trajectories to analyze the areas where the targets were active, determine whether they were involved and the interaction between different targets, etc. In addition, the tracking processes of a giant panda and a golden snub-nosed monkey are shown in Figures 8 and 9, respectively. The targets were continuously located during this process and maintained unique identity IDs. ure 7 below. Our model made it possible to track the targets and plot the movements. We could record the basic trajectories of the endangered animals within the monitoring area. Furthermore, we could also use the trajectories to analyze the areas where the targets were active, determine whether they were involved and the interaction between different targets, etc. In addition, the tracking processes of a giant panda and a golden snub-nosed monkey are shown in Figures 8 and 9, respectively. The targets were continuously located during this process and maintained unique identity IDs.   Two example trajectories of two tigers using the Few-MOT model are shown in Figure 7 below. Our model made it possible to track the targets and plot the movements. We could record the basic trajectories of the endangered animals within the monitoring area. Furthermore, we could also use the trajectories to analyze the areas where the targets were active, determine whether they were involved and the interaction between different targets, etc. In addition, the tracking processes of a giant panda and a golden snub-nosed monkey are shown in Figures 8 and 9, respectively. The targets were continuously located during this process and maintained unique identity IDs.   Two example trajectories of two tigers using the Few-MOT model are shown in Figure 7 below. Our model made it possible to track the targets and plot the movements. We could record the basic trajectories of the endangered animals within the monitoring area. Furthermore, we could also use the trajectories to analyze the areas where the targets were active, determine whether they were involved and the interaction between different targets, etc. In addition, the tracking processes of a giant panda and a golden snub-nosed monkey are shown in Figures 8 and 9, respectively. The targets were continuously located during this process and maintained unique identity IDs.

Ablation Study and Discussion
Here, we discuss the impact of the three parts of the three-stage matching and elasticdistortion data-augmentation strategy and the trajectory-reconstruction module. First, we performed ablation experiments on the MOT-EA dataset for the matching module. The two stages included cascade matching and central matching. The three stages included cascade matching, central matching, and IoU matching. As shown in Table 4, the three-stage matching showed improvement in the cases of false and missed detections.  Table 5 shows the impacts of the two parts of the elastic-distortion data-augment strategy and the trajectory-reconstruction module. The baseline model (row 1 in Table 5) consisted of a few-shot detector and an unmodified tracker. The other experimental results in Table 5 shared the same set of few-shot detectors, except for the feature learner's training process and the tracker's association module. The results indicated that the feature stability brought by the elastic-distortion data-enhancement strategy slightly improved the MOTA index. However, the more significant effect stemmed from the proposal of the trajectory-reconstruction module. This module handled both false and missed targets well in the tracking process. According to Equation (10), it led to a significant improvement in the MOTA.  Figure 10 shows a small segment of the performance of the trajectory reconstruction module during the tracking process. In comparison, we can find that the target lost in the 30th frame was reconstructed. This module made the trajectory of the target more complete.
Here, we discuss the impact of the three parts of the three-stage matching and elasticdistortion data-augmentation strategy and the trajectory-reconstruction module. First, we performed ablation experiments on the MOT-EA dataset for the matching module. The two stages included cascade matching and central matching. The three stages included cascade matching, central matching, and IoU matching. As shown in Table 4, the threestage matching showed improvement in the cases of false and missed detections. ↓ means the smaller the better. Table 5 shows the impacts of the two parts of the elastic-distortion data-augment strategy and the trajectory-reconstruction module. The baseline model (row 1 in Table 5) consisted of a few-shot detector and an unmodified tracker. The other experimental results in Table 5 shared the same set of few-shot detectors, except for the feature learner's training process and the tracker's association module. The results indicated that the feature stability brought by the elastic-distortion data-enhancement strategy slightly improved the MOTA index. However, the more significant effect stemmed from the proposal of the trajectory-reconstruction module. This module handled both false and missed targets well in the tracking process. According to Equation (10), it led to a significant improvement in the MOTA. ↓ means the smaller the better. Figure 10 shows a small segment of the performance of the trajectory reconstruction module during the tracking process. In comparison, we can find that the target lost in the 30th frame was reconstructed. This module made the trajectory of the target more complete.

Discussion
So-called "big data" approaches are not limited to technical fields because the combination of large-scale data collection and processing techniques can be applied to various scientific questions. Meanwhile, it has never been more critical to keep track of biodiversity than over the past decade, as losses and declines have accelerated with ongoing development. However, multiobject tracking is complicated, with experts relying on human interactions and specialized equipment. While cheap camera sensors have become essential for capturing wildlife and their movements, they generate enormous amounts of data, and have become a prominent research tool for studying nature. Machine-and deep-learning methods hold promise as efficient tools to scale local studies to a global understanding of the animal world [38]. However, the detection and tracking of the target animals are challenging, essentially because the data obtained from wild species are too sparse.
Our deep-learning approach detected and tracked the target animals and produced spatiotemporal tracks that following multiple objects through few-shot learning to alleviate instance imbalance and insufficient sample challenges. This study demonstrated how incorporating track methods, deep learning, and few-shot learning can be a research tool for studying wild animals. Turning now to its limitations, we note that our approach heavily relied on the prominent parts' detection performance, and easily failed to track infant animals.

Conclusions
In this work, we introduced Few-MOT for wildlife to embed uncertainty into designing a multiobject-tracking model by combining the richness of deep neural networks with few-shot learning, leading to correctable and robust models. The approach systematically provided a fully automated pipeline framework to integrate the few-shot learning method with deep neural networks. Instead of a discriminative model, a spatial-constraints model was created. Furthermore, a trajectory-reconstruction module was also proposed to compensate for the shortcomings of the few-shot object detection. Our model demonstrated the efficacy of using few-shot architectures for biological application: the automated recognition and tracking of wildlife. Unlike older, data-rich automation methods, our method was entirely based on deep learning with few shots. It also improved previous deep-learning methods by combining few-shot learning with a multiobject-tracking task. It also provided a rich set of examples by incorporating contextual details of the environment, which can be valuable for few-shot learning efficiency, especially in wildlife detection and tracking.
The data explosion that has come with the widespread use of camera traps poses challenges while simultaneously providing opportunities for wildlife monitoring and conservation [39]. Tracking animals is essential in animal-welfare research, especially when combined with physical and physiological parameters [40][41][42]. It is also challenging to curate datasets large enough to train tracking models. We proposed a deep-learning framework named Few-MOT to track endangered animals based on a few-shot-learning and tracking-by-detection paradigm. It could record the daily movements of the target being tracked, marking areas of frequent activity and other information that could be used for further analysis. This framework offered a few-shot object detection with spatial constraints to localize objects and a trajectory-reconstruction module for a better association. The experimental results showed that our method performed better on the few-shot multiobjecttracking task. Our new datasets open up many opportunities for further research on multiobject tracking. There were some limitations to our study, notably that the detector could detect a nonexistent target in the wrong place when the surroundings were extremely similar to the target. Future work should investigate how multiple variables, such as the features of the training dataset and different network architectures, affect performance. Furthermore, a key driver in the advancement of intelligent video systems for wildlife conservation will be the increasing availability of datasets for sufficient species, and opensource datasets should also be proposed in the future.
Author Contributions: Methodology, J.F. and X.X.; investigation, J.F.; data curation, X.X.; validation, X.X.; writing-original draft preparation X.X.; writing-review and editing, J.F. All authors have read and agreed to the published version of the manuscript.

Funding:
The work was supported by the National Natural Science Foundation of China (41971365) and the Chongqing Research Program of Basic Science and Frontier Technology (cstc2019jcyj-msxmX0131).

Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.

Data Availability Statement:
The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest:
The authors declare no conflict of interest.