Procapra Przewalskii Tracking Autonomous Unmanned Aerial Vehicle Based on Improved Long and Short-Term Memory Kalman Filters

This paper presents an autonomous unmanned-aerial-vehicle (UAV) tracking system based on an improved long and short-term memory (LSTM) Kalman filter (KF) model. The system can estimate the three-dimensional (3D) attitude and precisely track the target object without manual intervention. Specifically, the YOLOX algorithm is employed to track and recognize the target object, which is then combined with the improved KF model for precise tracking and recognition. In the LSTM-KF model, three different LSTM networks (f, Q, and R) are adopted to model a nonlinear transfer function to enable the model to learn rich and dynamic Kalman components from the data. The experimental results disclose that the improved LSTM-KF model exhibits higher recognition accuracy than the standard LSTM and the independent KF model. It verifies the robustness, effectiveness, and reliability of the autonomous UAV tracking system based on the improved LSTM-KF model in object recognition and tracking and 3D attitude estimation.


Introduction
Procapra przewalskii is an endangered ungulate endemic to the Qinghai-Tibet Plateau. Its type specimen was collected by Nikolai M. Przewalski in the Ordos Plateau of Inner Mongolia in 1875 [1]. It is listed as a national key protected wild animal by the  [9][10][11]. Meanwhile, they have exhibited extensive applications in animal identification without pre-specifying any features [12] but have limited usages in monitoring the activities of farming animals [13]. The two-stream network proposed by [14] is one of the tracking models to track moving objects. Multi-layers and the optical flow of convolutional networks enable the capturing the frames of relevant information on an object and tracking the object movement among various frames. Shortly after the proposal of two-stream networks, long-term recurrent convolutional networks (LRCNs) were developed. LRCNs [15] generally comprise several CNNs, namely Inception modules, ResNet, VGG, and Xception, enabling the extraction of spatial and temporal features.
LRCN was the most applied tracking model thanks to its reasonable architecture in object tracking. Generic object tracking using regression network (GOTURN) [16] is another lightweight network that achieves 100 frames per second (fps) for object tracking. GOTURN was initially trained with the generic-objects-filled datasets. The regions of interest (ROIs) on the frames are undertaken as input data for the trained network, providing the possibility of continuously predicting the location of the target. The Slow Fast network [17], on the other hand, tracks objects using two streams of frames, namely slow and high pathways. Many other algorithms can be introduced for animal monitoring, including, but not limited to, simple online and real-time tracking (SORT), the Hungarian algorithm (HA), the Munkres variant of the Hungarian assignment algorithm (MVHAA), the spatial-aware temporal response filter (STRF), and the channel and spatial reliability discriminative correlation filter (CSRDCF) [18][19][20][21].
These algorithms were utilized by object-detection models, such as Faster R-CNN, FCN, SSD, VGG, and YOLO, to detect and track animals in images using their geometric features in continuous frames [13]. To provide a reliable and efficient method to monitor the behavioral activity in cows, [22] presented a tracking system embedded with ultrawideband technology. Similarly, [23] employed a computer vision module to analyze and detect the positive and negative social interactions in feeding behavior among cows. The system was implemented and tested on seven dairy cows in the feeding area, realized an accuracy with a mean error of 0.39 m and standard deviation of 0.62 m, and achieved a detection accuracy of social interactions of 93.2%. However, the real-time locating system (RTLS) exhibits poor accuracy in identifying individual cows if they are in close body contact.
The CNN-based algorithm presents limited applications in monitoring agricultural animals because it does not pre-specify any features of any target. In this case, the YOLOX model, a lightweight network with an anchorless frame at the head of the network, is applied in this paper, which is equipped with several high-performance detectors and a faster network convergence, so that animals with specified features can be monitored. LRCN is the most widely used tracking model due to its reasonable structure in target tracking. However, it fails to effectively monitor the pose behavior of animals and consider the variations in measurement noise. Therefore, the LSTM-KF model with good robustness, reliability, and validity in target tracking and 3D pose estimation is proposed in this paper, which can alleviate the effect of measurement noise and modify the measurement results.

Learning-Based KF Architecture
In the current work, machine learning and KF models are combined for temporal regularization. The approaches can be classified into those that learn the static parameters of the KF and those that actively regress the parameters during filtering. The noise covariance matrices (NCMs) were optimized statically by [24] to replace the manual fine-tuning of noise parameters in a robotic navigation. Additionally, a coordinate ascent algorithm was employed and each element of the NCM was optimized. However, this approach is only applicable for noisy but time-invariant systems. As opposed to the dynamic model adopted in this study, a change in measurement noise cannot be considered and will therefore lower the accuracy in state estimates.
Reference [8] learned the underlying state transition function that controlled the dynamics of a hidden process state. However, only the state space equations of the KF are used instead of the prediction and update scheme with good performance under linear state transitions and additive Gaussian noise [6]. The neural network models that jointly learn to propagate the state, incorporate the measurement updates, and react to control inputs were trained. In addition, the covariances were designed as constants during the entire estimation. This approach can estimate the state better than a distinct prediction and update model, especially when large-scale training data are insufficient, as demonstrated in the experiment section in the present study.
The dynamic regression of KF parameters was put forward by [7] who adopted support vector regression (SVR) to estimate a linear state transition function, jointly with which the predicted NCM was estimated. The SVR-based system can deal with time-variant systems and outperforms manually tuned KF models in object tracking. As opposed to the model adopted here, the measurement noise covariances (MNCs) are kept constant and the transition function is modeled as a matrix multiplication. In this case, it can only estimate the linear motion models, while the model employed in the present study can estimate the nonlinear transition functions based on all previous state observations.
Reference [25] focused on integrating a one-shot estimation as a measurement into a KF model, which required a prediction of the MNC. They demonstrated that the integrated model exhibited superior performance by comparing it with two other models. In contrast, the model designed in the present work undertakes the measurement updates as a blackbox system and automatically estimates the MNC, so that they can be combined with the current one-shot estimators.
Previous work has extensively investigated the temporal regularization for bit-pose estimation, and priority attention has been given to works that focus on implicit regularization schemes and that explicitly use a learning-based KF structure to infer temporal coherence. In contrast to other models, the proposed model introduces the LSTM-KF, which mitigates the modeler-induced influence on specifying motion and noise models a priori, while allowing rich models to learn from data that are extremely difficult to write down explicitly. An extensive series of experiments reveals that the LSTM-KF outperforms both the stand-alone KF and LSTM in terms of temporal regularization.

Overall Technical Architecture
The videos in the research area were acquired using Prometheus 230 intelligent UAVs (Chengdu Bobei Technology Co., Ltd., Chengdu, China), as shown in Figure 1. An Intel RealSense D435i stereo camera was selected for the system to acquire sensing and depth data because its features include light weight, wide field of view (FoV), high depth accuracy, and good stability. Furthermore, a powerful graphics processing unit (GPU) was employed for the embedded systems, and the NVIDIA Jetson AGX Xavier embedded platform was selected to process the deep-learning-based algorithms. A flight controller (Pixhawk 4 (PX4)) was deployed and communicated with the MAVROS package, which was connected to the planner node.
As illustrated in Figure 2, the designed system consists of (1) a perception module, (2) an object-tracking algorithm, (3) a UAV maneuver, and (4) a ground station visualization module. In brief, the UAV system perceives the red-green-blue (RGB) images and the depth data first, and the drone recognizes the Procapra przewalskii with YOLOX (a deep-learningbased detector). Next, the 2D bounding boxes are fused with the depth measurement to estimate the 3D pose of the Procapra przewalskii. Finally, the improved LSTM-KF model proposed here is integrated to assist in predicting the motion of the Procapra przewalskii. During this, the visualization user interface is included. As illustrated in Figure 2, the designed system consists of (1) a perception module, (2) an object-tracking algorithm, (3) a UAV maneuver, and (4) a ground station visualization module. In brief, the UAV system perceives the red-green-blue (RGB) images and the depth data first, and the drone recognizes the Procapra przewalskii with YOLOX (a deeplearning-based detector). Next, the 2D bounding boxes are fused with the depth measurement to estimate the 3D pose of the Procapra przewalskii. Finally, the improved LSTM-KF model proposed here is integrated to assist in predicting the motion of the Procapra przewalskii. During this, the visualization user interface is included.

Dataset Establishment and Training
Precepting an object in a 3D world is essential for detecting and tracking an object. A deep-learning-based detector is employed here to generate related 2D information and perform 3D stereo reconstruction. This is very challenging because the object may move fast (e.g., running), the training data are low, the detection accuracy is not high, and the position of the object is changing continuously. As mentioned above, the YOLOX algorithm is selected as the baseline model to more accurately detect and track the Procapra przewalskii. Its structural framework is illustrated in Figure 3.  As illustrated in Figure 2, the designed system consists of (1) a perception module, (2) an object-tracking algorithm, (3) a UAV maneuver, and (4) a ground station visualization module. In brief, the UAV system perceives the red-green-blue (RGB) images and the depth data first, and the drone recognizes the Procapra przewalskii with YOLOX (a deeplearning-based detector). Next, the 2D bounding boxes are fused with the depth measurement to estimate the 3D pose of the Procapra przewalskii. Finally, the improved LSTM-KF model proposed here is integrated to assist in predicting the motion of the Procapra przewalskii. During this, the visualization user interface is included.

Dataset Establishment and Training
Precepting an object in a 3D world is essential for detecting and tracking an object. A deep-learning-based detector is employed here to generate related 2D information and perform 3D stereo reconstruction. This is very challenging because the object may move fast (e.g., running), the training data are low, the detection accuracy is not high, and the position of the object is changing continuously. As mentioned above, the YOLOX algorithm is selected as the baseline model to more accurately detect and track the Procapra przewalskii. Its structural framework is illustrated in Figure 3.

Dataset Establishment and Training
Precepting an object in a 3D world is essential for detecting and tracking an object. A deep-learning-based detector is employed here to generate related 2D information and perform 3D stereo reconstruction. This is very challenging because the object may move fast (e.g., running), the training data are low, the detection accuracy is not high, and the position of the object is changing continuously. As mentioned above, the YOLOX algorithm is selected as the baseline model to more accurately detect and track the Procapra przewalskii. Its structural framework is illustrated in Figure 3.  By referring to YOLOV3 and Darknet53, the YOLOX model adopts the structural architecture and spatial pyramid poling (SPP) layer of the latter. In addition, the model is equipped with several high-performance detectors.
In August 2022, we went to the research area (Qinghai Lake) ( Figure 4a) to acquire the UAV data and verify the actual flight. There were 40 flights in five days, and each flight By referring to YOLOV3 and Darknet53, the YOLOX model adopts the structural architecture and spatial pyramid poling (SPP) layer of the latter. In addition, the model is equipped with several high-performance detectors.
In August 2022, we went to the research area (Qinghai Lake) ( Figure 4a) to acquire the UAV data and verify the actual flight. There were 40 flights in five days, and each flight lasted about half an hour. The average flight height was about 100 m, and the aerial photography coverage area reached 9744 km 2 . Figure 4b shows the flight landing sites. With the captured videos, an object-tracking database was established to identify the moving Procapra przewalskii, match them in different frames, and track their motions. By referring to YOLOV3 and Darknet53, the YOLOX model adopts the structural architecture and spatial pyramid poling (SPP) layer of the latter. In addition, the model is equipped with several high-performance detectors.
In August 2022, we went to the research area (Qinghai Lake) ( Figure 4a) to acquire the UAV data and verify the actual flight. There were 40 flights in five days, and each flight lasted about half an hour. The average flight height was about 100 m, and the aerial photography coverage area reached 9744 km 2 . Figure 4b shows the flight landing sites. With the captured videos, an object-tracking database was established to identify the moving Procapra przewalskii, match them in different frames, and track their motions. A total of 6 video sequence databases, which were composed of 3 training databases and 3 test databases, are marked. The data were divided into a training set, a test set, and a verification set at a ratio of 3:2:1 (The supplemental data set is available at link https://pan.baidu.com/s/1vEYdFFTKUE9Z9cC67lCH_Q?pwd=56vx). There were three major motions for Procapra przewalskii, which are standing, walking, and running, as displayed in Figure 5a-c (male), d-f (female) and g-i (young). The database was trained A total of 6 video sequence databases, which were composed of 3 training databases and 3 test databases, are marked. The data were divided into a training set, a test set, and a verification set at a ratio of 3:2:1 (The Supplementary Data Set is available at link https://pan.baidu.com/s/1vEYdFFTKUE9Z9cC67lCH_Q?pwd=56vx). There were three major motions for Procapra przewalskii, which are standing, walking, and running, as displayed in Figure 5a-c (male), Figure 5d-f (female) and Figure 5g-i (young). The database was trained based on the YOLOX model by adjusting the weight ratio, confidence threshold, intersection over union (IoU) threshold of nms, and activation function. In this way, a stable and accurate model was obtained. based on the YOLOX model by adjusting the weight ratio, confidence threshold, intersection over union (IoU) threshold of nms, and activation function. In this way, a stable and accurate model was obtained.

3D Pose Estimation
Herein, the predicted bounding box was saved as SROI, and the 3D pose of the object was recovered and dynamically tracked by the object coordinates on the 2D frame based on the depth information obtained from the stereo camera. In addition, an interior rectan-

3D Pose Estimation
Herein, the predicted bounding box was saved as S ROI , and the 3D pose of the object was recovered and dynamically tracked by the object coordinates on the 2D frame based on the depth information obtained from the stereo camera. In addition, an interior rectangle Si was firstly generated by contracting S ROI with a scaling factor θ, as computed in Equations (1) and (2). In the equations below, S i is the predicted bounding box; θ refers to the scaling factor to adjust the size of the bounding box; c x and c y are the coordinates of the bounding box; and w and h represent the width and height of the bounding box, respectively.
S i , as displayed in Figure 6b, serves as the ROI to obtain the depth information. The unfilled pixels are filtered out from the depth image captured by the stereo camera, and the remaining depth data in S i are averaged as S, which is assumed as the distance between the observer and the target object. Then, with the boundary box coordination, coordination transformation was performed to obtain the relative attitude of the camera and the global attitude in the world frame. Frame transformation was carried out according to Equations (3) and (4) In the above equations, u and v are the pixel coordinates of t S ; K is the inherent matrix of the local camera and C i X is the object pose vector in camera frame; and W i X refers to the object pose vector in the world frame. Specifically, the transformation matrix can be calculated using Equations (5) and (6). Then, with the boundary box coordination, coordination transformation was performed to obtain the relative attitude of the camera and the global attitude in the world frame. Frame transformation was carried out according to Equations (3) and (4) below: In the above equations, u and v are the pixel coordinates of S t ; K is the inherent matrix of the local camera and X C i is the object pose vector in camera frame; and X W i refers to the object pose vector in the world frame. Specifically, the transformation matrix can be calculated using Equations (5) and (6).
where r i j is an element in the observer pose rotation matrix; o x , o y , and o z denote the position of the observer (UAV) relative to the world frame; and T B C and T W B are the pose transformation matrices. Rotation of the coordinate system is usually represented by a rotation matrix or a quaternion representation.

Tracking Based on the Improved LSTM-KF Model
In this study, the YOLOX algorithm is employed because it can balance speed and accuracy. Dynamic states of the target Procapra przewalskii and the quadrotor reduce the robustness of the pose estimation based on the descriptions in Section 3.3. The target Procapra przewalskii could not always be captured in the FoV during a surveillance as false positive or negative results may be found. In addition, partial or full occlusion might occur but not often. To address the above issues, the KF model is utilized to enhance tracking, but it requires the specification of a motion model and a measurement model in advance, which increases the burden on the modeler.

Model Structure and Prediction Steps
As introduced above, an improved LSTM-KF model is proposed in the current study, which is a time regularization model for attitude estimators. Its main idea is to use the KFs without specifying a linear conversion function or fixed process and measuring the covariance matrixes Q and R.
The network of the standard LSTM (Figure 7) exhibits memory units, forgetting gates (f t ), input gates (i t ), and output gates (O t ). Some information of cell state C t−1 is retained in the current cell state C t , and the amount of retained information is determined by f t , as given in Equation (7). Meanwhile, i t and O t can be calculated with Equations (8) and (9), respectively.
Sensors 2023, 23, x FOR PEER REVIEW 9 of 19 positive or negative results may be found. In addition, partial or full occlusion might occur but not often. To address the above issues, the KF model is utilized to enhance tracking, but it requires the specification of a motion model and a measurement model in advance, which increases the burden on the modeler.

Model Structure and Prediction Steps
As introduced above, an improved LSTM-KF model is proposed in the current study, which is a time regularization model for attitude estimators. Its main idea is to use the KFs without specifying a linear conversion function or fixed process and measuring the covariance matrixes Q and R.
The network of the standard LSTM ( Figure 7) exhibits memory units, forgetting gates (ft), input gates (it), and output gates (Ot). Some information of cell state Ct−1 is retained in the current cell state Ct, and the amount of retained information is determined by ft, as given in Equation (7). Meanwhile, it and Ot can be calculated with Equations (8) and (9), respectively.
Here, Ot on the standard LSTM network is modified as follows: In Equation (10)   KF is an optimal state estimator under the assumptions of linear and Gaussian noise. Specifically, if the state and the measurement state are expressed as t y and t z , respectively, the hypothetical model here can be expressed in Equations (11) and (12). Here, O t on the standard LSTM network is modified as follows: In Equation (10) above, {X 0 , X 1 , . . . , X t−1 } and {Y t , Y t+1 , . . . , Y t+n } are the input and output of the LSTM network, respectively; {W 0(t) , W t , . . . , W (t−1)t } and {W 0(t+1) , W t(t+1) , . . . W (t−1)(t+1) } represent the direct weights of the input and output, respectively; C refers to the current state of the LSTM network; and V is the coefficient.
KF is an optimal state estimator under the assumptions of linear and Gaussian noise. Specifically, if the state and the measurement state are expressed as y t and z t , respectively, the hypothetical model here can be expressed in Equations (11) and (12).
Because the incoming measurements are noisy estimates of potential states and H = I in Equation (11), Equations (11) and (12) can be modified to Equations (13) and (14), respectively, which are the basic models of LSTM-KF. In the equations below, Z t denotes the model measurement state; W t is the weight at moment t; Q t and R t are covariance matrices; and f is a nonlinear transfer function.
The prediction step can be defined by Equations (15) and (16).
where f is modeled by an LSTM module, F is the Jacobian matrix of f relative toŷ t−1 , and Q t is the output of the second LSTM model. Thus, the updating steps are specified in Equations (17)- (19).
whereR t is the output of the third LSTM module andẑ t refers to the observed measurement at time t. Next, these LSTM modules are described in detail.

Architecture and Loss Function
In this paper, LSTM f , LSTM Q , and LSTM R are selected to represent the three LSTM modules of f,Q t , andR t , respectively. LSTM f is composed of three stacked layers (1024 hidden cells in each) and three fully connected (FC) layers (with 1024, 1024, and 48 hidden cells). Similarly, standard LSTM is built as LSTM f , but O t of the LSTM f is modified to connect to i t . In addition, ReLU nonlinearity is introduced to all FC layer activations except the last, and each LSTM layer is followed by a lost layer with a retention probability of 0.7. LSTM Q and LSTM R follow with 256 hidden unit monolayer frameworks and 48 hidden units. Meanwhile, O t connects with i t to prevent the video time sequence of the agency and the Procapra przewalskii training to avoid overfitting. Figure 8 shows each module of the LSTM-KF model and Figure 9 displays an overview of the system.  Preliminarily, the standard Euclidean loss summation is applied throughout the entire process, but the LSTMf module fails to learn reasonable mapping. Therefore, the loss function is introduced with a term to enhance the gradient flow to the LSTMf module in the current study. Equation (20) expresses the specific loss function.

Optimization of Parameters
All parameters θ in the loss function are optimized to minimize the loss given by Equation (20) regarding all free parameters in the model applied in this work, which is  Preliminarily, the standard Euclidean loss summation is applied throughout the entire process, but the LSTMf module fails to learn reasonable mapping. Therefore, the loss function is introduced with a term to enhance the gradient flow to the LSTMf module in the current study. Equation (20) expresses the specific loss function.

Optimization of Parameters
All parameters θ in the loss function are optimized to minimize the loss given by Equation (20) regarding all free parameters in the model applied in this work, which is At each t, takingŷ t−1 as an input, LSTM f generates the intermediate stateŷ t without depending on the current measurement; LSTM Q takesŷ t andQ t as the input and output, respectively, and estimates the process covariance; and taking Z t andR t as an input and output, respectively, LSTM R only estimates the measured covariance. Finally,ŷ t and Z t , along with the covariance estimates made here, are fed into a standard KF (Equations (16)-(19)), ultimately yielding a new predictionŷ t . Moreover, Q and R are restricted to diagonal and positive definite by indexing the output of the LSTM Q and LSTM R modules in this study.
Preliminarily, the standard Euclidean loss summation is applied throughout the entire process, but the LSTM f module fails to learn reasonable mapping. Therefore, the loss function is introduced with a term to enhance the gradient flow to the LSTM f module in the current study. Equation (20) expresses the specific loss function.

Optimization of Parameters
All parameters θ in the loss function are optimized to minimize the loss given by Equation (20) regarding all free parameters in the model applied in this work, which is the connection of the ownership weight matrix and bias from all three LSTM modules, which are a combination of the LSTM layer and the linear layer ( Figure 8).
The LSTM-KF model can achieve end-to-end training, and the gradient can be obtained through the time backpropagation algorithm [26]. All the computations and states in the model are presented by a single data flow graph, so that communication between the sub-computations can be displayed, which is conductive to the parallel execution of independent computations to obtain the gradient as soon as possible.
In addition, the Adam-based optimizer [27] is introduced for training iteration to update the gradient, ensuring high stability and high predictability of the model. Meanwhile, the update rules are extended based on the L 2 norm to those based on the L p norm. A large p results in numerical instability of these variants. However, in a special case of p → ∞, the algorithm is simple but stable, which can be calculated with Equation (21).
With the L p norm, the step size at time t is supposed to be inversely proportional to v t 1/p , and then Equation (21) can be modified to Equation (22).
It should be noted that the attenuation term is equivalently parameterized here to β p 2 instead of β 2 . If p → ∞ and v t = lim p → ∞ (v t ) 1/p are defined, Equations (23)-(26) can be obtained. This corresponds to a very simple recursive equation (Equation (27)).
The initial value is v 0 = 0. Note that conveniently, the initialization bias does not necessarily need to be corrected. The improved Adam-based optimizer is simpler than the original and is easier for gradient updating.

Result
The performance of the trained model was assessed on a Jetson AGX Xavier onboard computer, where the coupled detection head of the original YOLO model was replaced with the decoupled head, which greatly accelerated training convergence. As mentioned, the robustness of the proposed model was observed on a streaming video with various techniques for quantitative analysis Lastly, several intensive flight tests were performed on a self-assembled quadrotor platform to evaluate the overall performance.

Detection Effect of the YOLOX Model
The YOLOX model outputs a predictive bounding box that classifies detected objects and marks their locations, which plays a key role in subsequent pose estimation using the UAV. The model training lasted for 1000 iterations, during which time the loss did not degrade. Because the accuracy of the model is affected by different neural network resolutions, the YOLOX model was trained with different resolutions (i.e., 416 × 416, 512 × 512, 640 × 640) to evaluate the best performance. The four input resolutions are compared in Table 1. Furthermore, the surveillance performed by UAVs was realized based on real-time perception solutions, which focus more on object detection and tracking. In this case, the detection speed and accuracy had to be balanced to ensure consistent detection and tracking, in which delay can be neglected and accuracy is high enough. Thus, the YOLOX and YOYLOv4 models of the same network resolution were compared to examine their accuracy and speeds.
After training, the model performance in detecting target Procapra przewalskii on real-time videos captured was evaluated on an Intel RealSense D435i stereo camera. The trained model was proven to be robust under various environments and exhibits low false positives and negatives. Procapra przewalskii tracking was then successively assessed after assuring the validity of the model.

Tracking Performance on Target
In this study, the LSTM-KF model was employed to track and identify the behaviors and gestures of the video sequence dataset of Procapra przewalskii. Six object-tracking sequences were comprehensively generated from the Procapra przewalskii dataset, and the 6-DOF ground realistic attitude was available. The LSTM-KF model was trained at 2 × 10 −5 , decaying by 0.95 from the second period. Before training, gradients of 100 time steps were propagated using a truncated backpropagation time.
However, for a single-layer LSTM with 16 hidden units, batch size is set to 2 and the learning rate is designed as 5 × 10 −4 . After the model is trained for 120 periods, the gradient is propagated again for 10-time steps in the same way. It is the same case for the standard LSTM method evaluated in this work.
The tracking algorithm can be evaluated by employing the successive frames of depth frame sequence tracking through the 3D CAD model of 3D pose. Therefore, all the task methods are compared here to obtain a target-tracking method which is superior to the existing methods. Table 2 displays the results of tracking recognition under the scenario.

Verification of Field Tracking Flight
To integrate a perception to reaction and evaluate the surveillance system, four sites (the red points in Figure 4b) were selected to verify the tracking effect of the UAV on Procapra przewalskii. The parameter settings for the flight test are shown in Table 3. Table 3. Defined parameters for flight test.

Parameters
Value Due to the complexity of the algorithm and the uncertainty of the research site, simulation verification and parameter adjustment of the algorithm are necessary before actual flight.
Additionally, the estimated dynamic position is compared with the ground truth of the tracked Procapra przewalskii. Figures 12 and 13 demonstrate that the system can basically track the poses of Procapra przewalskii in 3D space. Regardless of jittering and occasional drift, the Procapra przewalskii can be relocated accurately after several frames. In addition, the figure shows that error is basically within 0.5 m in all axes of the world frame.

Verification of Field Tracking Flight
To integrate a perception to reaction and evaluate the surveillance system, four sites (the red points in Figure 4b) were selected to verify the tracking effect of the UAV on Procapra przewalskii. The parameter settings for the flight test are shown in Table 3. Due to the complexity of the algorithm and the uncertainty of the research site, simulation verification and parameter adjustment of the algorithm are necessary before actual flight.
In addition, the figure shows that error is basically within 0.5 m in all axes of the world frame.   Additionally, the estimated dynamic position is compared with the ground truth of the tracked Procapra przewalskii. Figures 12 and 13 demonstrate that the system can basically track the poses of Procapra przewalskii in 3D space. Regardless of jittering and occasional drift, the Procapra przewalskii can be relocated accurately after several frames.
In addition, the figure shows that error is basically within 0.5 m in all axes of the world frame.   In addition, the root-mean-square error (RMSE) and mean absolute error (MAE) (defined in Equations (28) and (29), respectively) were calculated, as presented in Table 4. Here, RMSE indicates the degree of prediction error generated by the model, and a large error results in heavier weights. MAE reflects the error between the predicted and actual values, and a smaller MAE corresponds to a better model performance. i y and i ŷ in Equations (28) and (29)   In addition, during monitoring, the distance between the UAV and the target Procapra przewalskii constantly changed, so it is necessary to further analyze the accuracy at different distances. As shown in Table 5, the performance of the proposed method remained basically stable regardless of the distance between the UAV and the target Procapra przewalskii.

Discussion
It should be highlighted that if the input resolution is too large, the best possible mAP increases, but the training and detection speed are indeed negatively affected. Therefore, the higher input resolution is not trained in this work, because the currently obtained In addition, the root-mean-square error (RMSE) and mean absolute error (MAE) (defined in Equations (28) and (29), respectively) were calculated, as presented in Table 4. Here, RMSE indicates the degree of prediction error generated by the model, and a large error results in heavier weights. MAE reflects the error between the predicted and actual values, and a smaller MAE corresponds to a better model performance. y i andŷ i in Equations (28) and (29) are the true value and the predicted value, respectively. In addition, during monitoring, the distance between the UAV and the target Procapra przewalskii constantly changed, so it is necessary to further analyze the accuracy at different distances. As shown in Table 5, the performance of the proposed method remained basically stable regardless of the distance between the UAV and the target Procapra przewalskii.

Discussion
It should be highlighted that if the input resolution is too large, the best possible mAP increases, but the training and detection speed are indeed negatively affected. Therefore, the higher input resolution is not trained in this work, because the currently obtained speed and accuracy at 640 × 640 are acceptable, and the mAP and the union threshold intersection are 88.67% and 0.50 (AP50), respectively. Unfortunately, the fps is not as fast as the resolution of 512 × 512. Meanwhile, it was found that the mAP of the YOLOX model was lower in contrast to the YOLOv4 model, but the fps was higher. In consideration of the greater significance of fps in real-time predictions, the YOLOX model with a higher fps was selected to balance the accuracy and speed. Table 2 clearly displays that the motion models that do not investigate the training data, i.e., Kalman Vel Kalman Acc and EMA, are not meaningfully improved for translational estimation and rotation. However, the improved LSTM-KF model proposed in this paper performs better in predicting the target position (0.82 mm) with a mean error of 61.26%, which is higher than the original estimate and better than the results in [28], using the KF algorithm alone for target tracking, and exhibits a lower average error. In addition, the LSTM-KF model greatly improves the original measurements with all actions outperforming the standard LSTM by an average of 14% compared to the state-of-the-art method. In contrast, the standard LSTM method estimates the position and rotation with such a large error that they fail to meet the requirements.
As shown in Tables 3 and 4, the 3D object pose-estimation systems focused on by other scholars [29,30] highlight the objects in static states, while the model applied in this paper exhibited higher errors in estimating the dynamic position of Przewalski's Tibetan antelope in real time. However, it possesses better robustness and is also more accurate than the model in [28] that uses the KF algorithm alone for 3D pose estimation of the target. In addition, it mitigates the influence of the modeler on the a priori specified motion and noise models. The model for 3D pose estimation of dynamic targets using an improved spatio-temporal context algorithm in comparison to that in [31] exhibits higher accuracy, a fast network convergence, and fewer impacts from measurement noise. Overall, the proposed LSTM-KF possesses a better performance in pose estimation than the sole use of the KF algorithm and improved spatio-temporal context algorithm. If the target animal, Przewalski's Tibetan antelope, moves suddenly, redundant overshot periods follow, slightly affecting the overall performance of the model. Thus, it further proves that the proposed model can be better applied to autonomous UAV monitoring systems in a real-time and manipulable manner.

Conclusions
In this paper, a deep-learning-based model was employed to build an autonomous UAV tracking system to help monitor Procapra przewalskii. The LSTM-KF model is proposed and applied to track the target, and the YOLOX model is employed to identify the target. Meanwhile, they were combined to estimate the pose of the protozoa in 3D images, thus improving the performance of object tracking. In addition, the three different standard LSTM networks modeled in the LSTM-KF model are optimized by modifying and connecting the computation of Q t . During the training iterations of the LSTM-KF model, Adam as an optimizer is improved by extending the L 2 criterion-based update rule to an L p criterion-based rule. The results show that the improved LSTM-KF model can achieve the best result (0.82 mm) in predicting the target position with an average error of 61.26%, which is higher than the original estimate, significantly improving the accuracy of the measurement results. In addition, the YOLOX model exhibits an mAP of 93.25% and an FPS of 13.56 on images with 640 × 640 resolution, which are higher than those of the YOLOv4 model. Overall, the proposed improved LSTM-KF model is robust, valid, and reliable for animal tracking, recognition, and pose estimation.
However, this paper is subject to several shortcomings for Procapra przewalskii tracking. For example, when UAVs are applied to track dense herds of Procapra przewalskii, accuracy is decreased if Procapra przewalskii individuals occlude each other. Based on this, we are trying to solve this problem in future research using algorithms based on depth sorting. In addition, it is believed that a visual servo controller can be designed to control the UAV to explore the environment or avoid obstacles using only one camera, ensuring that the tracked object is always in the field of view of the camera.