Transformer-Based Maneuvering Target Tracking

When tracking maneuvering targets, recurrent neural networks (RNNs), especially long short-term memory (LSTM) networks, are widely applied to sequentially capture the motion states of targets from observations. However, LSTMs can only extract features of trajectories stepwise; thus, their modeling of maneuvering motion lacks globality. Meanwhile, trajectory datasets are often generated within a large, but fixed distance range. Therefore, the uncertainty of the initial position of targets increases the complexity of network training, and the fixed distance range reduces the generalization of the network to trajectories outside the dataset. In this study, we propose a transformer-based network (TBN) that consists of an encoder part (transformer layers) and a decoder part (one-dimensional convolutional layers), to track maneuvering targets. Assisted by the attention mechanism of the transformer network, the TBN can capture the long short-term dependencies of target states from a global perspective. Moreover, we propose a center–max normalization to reduce the complexity of TBN training and improve its generalization. The experimental results show that our proposed methods outperform the LSTM-based tracking network.


Introduction
With the rapid development of the electronic information industry, target tracking technology has been increasingly used in the military and civilian fields. The target tracking task aims to estimate the state of the target based on data measured by sensors. It can be classified into maneuvering and non-maneuvering target tracking, where "maneuvering" refers to the case in which the target suddenly changes its motion state. For the tracking of maneuvering targets, the interactive multi-model (IMM) algorithm, which uses multiple models to fit complex motion states, is considered [1]. Therefore, many tracking algorithms proposed subsequently were based on the IMM [2][3][4]. However, IMM-based algorithms are associated with the mismatch problem between the set of models and the target motion states. Furthermore, when the motion state of the target changes, a specific number of observations must be accumulated, resulting in the model estimation delay problem [5].
The development of deep neural networks, especially recurrent neural networks (RNNs) with memory ability, provides novel ideas to solve the problems of IMM-based algorithms [6][7][8][9]. The RNN [10] and long short-term memory (LSTM) networks [11] can estimate the state from the observation at each time step [6,12]. Nevertheless, the LSTM and RNN can only process the input sequence sequentially, resulting in long-distance memory fading problems [9]. Thus, the LSTM and RNN may reduce the correlation between trajectory points at different locations, which subjectively influences the modeling of maneuvering states. In addition, trajectory datasets are usually collected in a fixed-range coordinate system and preprocessed with min-max normalization [8,13,14]. However, the same maneuvering state in the dataset may correspond to trajectories with different initial positions, which increases the complexity of network learning. Moreover, the fixed distance range reduces the generalization of the network.
In this study, to accurately model and estimate the states of maneuvering targets, we propose a transformer-based network (TBN). Specifically, our proposed network applies the transformer network as an encoder to extract global features of the observation sequence. Simultaneously, 1D convolutional networks are applied as a decoder to estimate the state sequence from the features. Compared with the LSTM network, which processes observations sequentially, the TBN associates the observations at all positions and applies an attention mechanism to model their dependencies [15]. Thus, the features of the observations can be represented independently without regard to their position in the sequence [16][17][18]. Therefore, the TBN has better feature representation and global memory ability than LSTM [18]. Moreover, a learnable positional embedding is added to the input of the TBN to explore the temporal features of the observation sequence. Finally, a novel center-max normalization is applied by the TBN to improve generalization. Compared with the min-max normalization, our proposed center-max normalization transforms the trajectories from a fixed to a relative coordinate system with the initial observation point as the origin. The experimental results demonstrate that center-max normalization considerably increases the generalization of the TBN to trajectories with different distance ranges. Furthermore, center-max normalization also promotes the tracking performance of the TBN by reducing the complexity of trajectory learning.

Problem Formulation
Based on the previous research on maneuvering target tracking [4,7,8,19], we mainly considered point targets tracked by radar in the X-Y plane. Meanwhile, the problem of target birth and death was not considered in this study. Therefore, we assumed that z k is the observation vector and x k is the state vector at the kth time step. Specifically, x k = c x,k , c y,k , v x,k , v y,k denotes the coordinates and corresponding velocities in the twodimensional scene, and z k = [θ k , d k ] denotes the azimuth and distance of the radar observation.
We intend to build a maneuvering target tracking model based on a deep neural network. The input to the model is the observation sequence z 1:K = {z 1 , z 2 , . . . , z K }, and the output is the estimated state sequencex 1:K = {x 1 ,x 2 , . . . ,x K }, where K is the total number of time steps. Given that target tracking is a regression problem, we used the root-mean-squared error (RMSE) between the normalized ground-truth sequence x * 1K = x * 1 , x * 2 , . . . , x * K and the estimated sequencex * 1:K = x * 1 ,x * 2 , . . . ,x * K as the loss function [9] to evaluate the model: In practice, obtaining a sufficient number of trajectories is difficult. Thus, we simulated segmented trajectories based on the state-space model (SSM) [20].
The SSM defines the state transition equation and observation equation as: where F is the transition matrix and n k is the transition noise. h is the nonlinear observation, and u k is the observed noise. In this study, two motion states were considered: constant velocity (CV) and constant turn (CT), as mentioned in [8]. The transition matrix of CV and CT is defined as: where w is the turn rate of the maneuvering target and τ is the sampling interval of the observations. According to [21], the transition noise where α k ∼ N 0, σ 2 a is the Gaussian noise caused by the maneuvering acceleration with zero mean and standard deviation σ a .
For radar tracking, Z k is defined as: where σ θ is the standard deviation of the azimuth and σ d is the standard deviation of the distance.

Proposed Model
In this section, we discuss the components of the TBN in detail. In Section 3.1, we introduce a trajectory normalization method named center-max normalization to improve generalization. In Section 3.2, the structure of the TBN is presented. In Section 3.3, we summarize the overall process of applying the TBN for maneuvering target tracking.

Center-Max Normalization
A trajectory of a maneuvering target is exhibited in Figure 1. The left of Figure 1 shows an observation sequence, which contains the distance and azimuth. To eliminate the dimensional difference between the observations, z 1K in the polar coordinates are converted toz 1K in the X-Y plane coordinates: Figure 1c shows a trajectory in the X-Y plane coordinates. The distance range and initial position of the targets may vary extensively; thus, we propose a center-max normalization mechanism to improve the generalization of the model and reduce the training complexity, as shown in Figure 2. This can be formulated as follows: wherez * k is the normalized observation at the kth time step,z 1 is the initial value ofz 1:K , and D max denotes the maximum distance that the targets can move within K time steps. In Equation (8), the observation sequence is normalized to [−1, 1] by dividing by D max . Subtractingz 1 , the observation sequencez 1:K is represented in a relative coordinate system withz 1 as the origin. Benefiting from center-max normalization, the TBN only needs to focus on learning different maneuvers of the target without considering the influence of the initial position. Therefore, the tracking of maneuvering targets by the TBN is not limited by the detection range. Correspondingly, the ground-truth state sequence x 1:K is normalized as follows: where x * 1:K is the normalized state sequence, c x is the centering vector corresponding to x * 1:K , z x,1 ,z y,1 is the position component ofz 1 , and V max is the maximum speed of the simulation targets.

Proposed Network
In sequence modeling tasks, the LSTM network sequentially extracts features. However, the transformer network uses the self-attention mechanism to process input data in parallel, which can capture both local and global dependencies. Therefore, we innovatively introduced it to the target tracking task to comprehensively capture the internal law of target maneuvering. Our proposed TBN consists of positional encoding, N-stacked transformer encoder layers, and one convolutional decoder layer. Each transformer encoder layer contains multi-head self-attention, a feedforward fully connected network, and two residual connections after each of the previous blocks. For intuitive understanding, the entire architecture of the TBN is shown in Figure 3. . Architecture of the TBN. Input data are normalized observation sequencesz * 1:K .z * 1:K is first mapped to s * 1:K , whose dimension is E × K by positional encoding. The encoder consists of N-stacked multi-head self-attention and fully connected feedforward layers, which aim at extracting the features of s * 1:K . The decoder maps E-dimensional feature vectors to the normalized state sequencex * 1:K by two 1D convolutional layers.

Positional Encoding
In natural language processing tasks, the transformer network adds positional encoding to the input tokens to represent their relative or absolute positions in the sequence [15]. However, in this study, the input to the TBN is numeric. Therefore, the learnable positional encoding mentioned [22] is added to the input of the network as follows: where F is the sine function, w i and ϕ i are learnable parameters that mapz * k to an Edimensional representation space, and s * 1:K [i] is the encoding result of the ith subspace.

Multi-Head Self-Attention
Self-attention is the core of the TBN. First, the input encoding sequence s * 1:K ∈ R E×K of self-attention is linearly mapped into the sequences "query" (Q), "keys" (K), and "values" (V) as follows: where W Q , W K , and W V ∈ R E×E are learnable matrices. Furthermore, Q, K, and V are split into M subsequences along dimension E, and M attention heads are obtained by the interaction of the elements at any two positions in each subsequence: where Q m , K m , V m ∈ R E m ×K and E m = E M . Finally, M attention heads are concatenated to compose the multi-head self-attention: Thus, the network is allowed to capture more information from different representation subspaces at different positions.

Feedforward Layer
After the multi-head self-attention, a feedforward layer consisting of two fully connected layers is used to linearly transform each position of s attention .
In the decoder part, two 1D convolutional layers are used to output the final trajectory estimationx 1:K , and the parameters of the network are trained by minimizing Equation (1) using the mini-batch gradient descent.

Maneuvering Target Tracking Based on the TBN
When the well-trained TBN is applied to track a complete trajectory, all observations are first segmented with window length K = 10 and step size P = 5. These segmented observation sequences are then normalized sequentially and passed to the TBN to estimate the corresponding state sequence set x 1+rP 1:K * , r ∈ (0, . . . , R − 1) , where x 1+rP 1:K * denotes the normalized state sequence output at time step (1 + rP) and R is the number of sequences. Subsequently, x 1+rP 1:K * needs to be denormalized as follows: where c r x is the centering vector of the rth state sequence. In addition, adjacent state sequencesx 1+rP 1:K andx 1+(r+1)P 1:K are merged together. Letx 1+rP 1:2K−P denote the merge result of two above-mentioned state sequences, whose length is 2K − P. Thus, the overlapped regions ofx 1+rP 1:2K−P are calculated as follows: x 1+rP P:K = 0.5 x 1+rP P:K +x Finally, all state sequences in the set x 1+rP 1:K , r ∈ (0, . . . , R − 1) are merged in turn to obtain complete state estimates. Figure 4 illustrates the overall tracking process, including the observations segmentation, center-max normalization processing, network estimation, denormalization processing, and segmented state sequences concatenation. Figure 4. Structure of the transformer-based maneuvering target tracking. The observation sequences of targets are firstly segmented into subsequences z 1:K of length K with step size P. After that, z 1:K are converted toz * 1:K by center-max normalization. Then, the TBN infers the normalized trajectorŷ x * 1:K fromz * 1:K . In addition,z * 1:K are de-normalized tox 1:K . Finally, the overlapped region ofx 1:K is averaged and concatenated to obtain the estimation of the complete state sequences.

Experiments and Results
In this section, we list the parameters of the trajectory dataset and the TBN. Several experiments were designed to test the tracking performance of our proposed model.

Implementation Details
Dataset: We generated 300,000 trajectories based on the SSM as a dataset. The parameters of the trajectory dataset are listed in Table 1. In addition, we assumed normalization parameters: D max = 3 km, V max = 300 m/s, and targets were observed every 1 s.
Hyper-parameters: Our network consists of four encoder layers, with eight attention heads. The dimension of E was 512. The output dimensions of the 1D convolutional layer in the decoder were 64 and 4, respectively. The model was trained using the Adam optimizer [23] with β 1 = 0.9, β 2 = 0.98, and ε = 10 −9 . The learning rate was linear warmedup for the first 10 epochs and decayed subsequently based on the dynamic adjustment strategy mentioned in [15]. We trained 300 epochs with a batch size of 64 on a single NVIDIA TITAN Xp GPU.
Baseline: We compared the TBN+center-max normalization (TBN+CM) model with the IMM algorithm [19] and the LSTM+min-max normalization (LSTM+MM) tracking model [8]. As a comparison, we also built the LSTM+center-max normalization (LSTM+CM) model. The LSTM network consisted of four hidden layers with a dimension of 128, as mentioned in [8]. The same dataset was used to train the above networks.

Results
We first compared the performances of the LSTM+MM, LSTM+CM, and TBN+CM based on a test set containing 20,000 segmented trajectories. The tracking results are listed in Table 2. In Table 2, the position and velocity RMSEs of the LSTM+CM are smaller than that of the LSTM+MM, which proves that our proposed center-max normalization improved the tracking capability of the network by reducing the complexity of trajectory learning. At the same time, the TBN+CM achieved the smallest position and velocity RMSE. Thus, it can be concluded that the TBN yields better performance than LSTM when tracking segmented trajectories. We then simulated a target with the initial states of [2 km, 2 km, 50 m/s, 0 m/s] and steering rates equal to 0 • and conducted Monte Carlo simulations to generate a 60-step trajectory named A1. The target maneuvers had turn rates equal to −1 • and 3 • at the 10th step and the 40th steps, respectively. In addition, the standard deviations of acceleration, azimuth, and distance noise were set to 5 m/s 2 , 0.2 • , and 5 m. We evaluated the TBN+CM, LSTM+MM, LSTM+CM, and IMM algorithms on trajectory A1. The tracking results are listed in Table 3 and Figure 5. Among the listed figures, Figure 5a shows how well the algorithms tracked the target. Figure 5b,c show the pointwise RMSE of trajectory A1. Furthermore, the average RMSEs of trajectory A1 are listed in Table 3. In Table 3, the RMSEs of the LSTM+CM are smaller than those of the LSTM+MM, which proves that our proposed center-max normalization improved the tracking capability of the network by reducing the complexity of trajectory learning. At the same time, the bolded results in Table 3 indicate that TBN+CM had the smallest tracking error. The experiments above demonstrated the superiority of the TBN+CM in tracking maneuvering targets.  In addition, the initial position of trajectory A1 was moved to [12 km, 12 km] and [15 km, 15 km] to obtain trajectories A2 and A3. We conducted generalization experiments on trajectories A2 and A3, as listed in Table 4. The bolded results in Table 4 demonstrated that our proposed TBN+CM can generalize to tracking trajectories beyond the preset distance. However, the LSTM+MM led to tracking failure due to its fixed normalization mechanism.

Conclusions
In this study, we employed the attention mechanism of the transformer network to extract a comprehensive tracking of trajectories and finally developed a novel network named the TBN for radar target tracking missions. Furthermore, our proposed center-max normalization improved the generalization of the network by processing observations in a relative coordinate system. It can be seen from the experimental results that, when tracking maneuvering targets, our proposed TBN model obtained lower RMSEs of position and velocity than the LSTM model, and the TBN model can still work normally when the observation sequence is missing; however, the LSTM model will not be available. Therefore, our algorithm outperformed existing LSTM-based tracking networks and traditional algorithms.