Online Multiple Athlete Tracking with Pose-Based Long-Term Temporal Dependencies

This paper addresses the Multi-Athlete Tracking (MAT) problem, which plays a crucial role in sports video analysis. There exist specific challenges in MAT, e.g., athletes share a high similarity in appearance and frequently occlude with each other, making existing approaches not applicable for this task. To address this problem, we propose a novel online multiple athlete tracking approach which make use of long-term temporal pose dynamics for better distinguishing different athletes. Firstly, we design a Pose-based Triple Stream Network (PTSN) based on Long Short-Term Memory (LSTM) networks, capable of modeling long-term temporal pose dynamics of athletes, including pose-based appearance, motion and athletes’ interaction clues. Secondly, we propose a multi-state online matching algorithm based on bipartite graph matching and similarity scores produced by PTSN. It is robust to noisy detections and occlusions due to the reliable transitions of multiple detection states. We evaluate our method on the APIDIS, NCAA Basketball and VolleyTrack databases, and the experiment results demonstrate its effectiveness.


Introduction
In recent years, sports video analysis has received increasing attention in academia and industry due to its scientific challenges and promising applications. It covers various application scenarios or research directions, including automatic game commentary, tactical analysis, player statistics, etc. Among these directions, athletes tracking is fundamental and critical for sports video analysis.
In the literature, several early attempts [1][2][3] focus on Multi-Athlete Tracking (MAT) in volleyball, basketball and soccer game videos. They generally apply additional preprocessing to simplify tracking based on additional clues within the sports fields, e.g., site boundaries and static cameras. Xing et al. [3] build an observation model to classify the playfield region and nonplayfield regions with color information, providing convenience for athlete localization. Gomez et al. [1] make use of static cameras for separating foreground and background easily, and thus producing clean athlete targets for tracking. However, those methods are probably not suitable for complex scenes in the wild, which contain variations in background and illumination.
Actually, most efforts are made to track multiple athletes [4,5] following the typical Multi-Object Tracking (MOT) framework, which employs the tracking-by-detection paradigm, i.e., associating the prepared detections in terms of different objects at each frame. The MOT methods can be roughly categorized into two branches, i.e., online matching based and offline association based, according to whether using subsequent frames.
Offline association approaches often show better tracking performance as additional information is employed; however, they always require higher computational cost. While the online matching ones have advantages in high processing speed and applicability. As in the popular benchmarks, e.g., PETS, UA-DETRAC, and MOTChallenge, they pay more attention to pedestrians and vehicles in surveillance scenes, which have stable appearances and motions.
Indeed, in real sports scenes, there exist some specific difficulties: (1) athletes share a high similarity in appearance (dressing, figure, etc), and they frequently occlude with each other; (2) athletes often have abrupt positions and complex actions. These facts make the existing MOT methods [6][7][8][9], especially the ones focus on appearance and simple motion clues, lose efficacy. When playing games, the athletes usually have their own specific pose dynamics, which are distinct from each other within a period of time. For example, when the setter is passing the ball, the spiker is waiting for attacking. This brings us the idea that long-term temporal pose dynamics may help to distinguish different athletes. Based on the considerations above, we propose a novel multiple athlete tracking approach to the given issue, following the popular online tracking-by-detection paradigm.
Specifically, we first design a Pose-based Triple Stream Network (PTSN) based on Long Short-Term Memory (LSTM) networks, capable of modeling long-term temporal pose dynamics of athletes. Moreover, to capture more subtle differences between athletes, we enrich the pose dynamic into three clues, i.e., pose-based appearance, motion and interactions of athletes, which are modeled by three network streams in PTSN. Given the history tracklet and current detection, PTSN could generate robust affinity between them according to the degree of dependency. Second, we design a multi-state online matching algorithm based on bipartite graph matching. It uses the affinities produced by PTSN to associate the athlete detections frame by frame and finally accomplish the tracking. More importantly, in the online association, we define multiple detection states and build reliable transitions, boosting the association robustness to noisy detections and occlusion.
In summary, the main contributions of this paper are highlighted as follows: • We propose a Pose-based Triple Stream Networks (PTSN) based on Long Short-Term Memory (LSTM) networks, capable of modeling long-term temporal pose dynamics of athletes and generating robust association affinities. • We design a multi-state online matching algorithm based on multiple detection states and reliable transitions with the association affinities, improves the robustness to noisy detections and occlusion. • We evaluate our method by comparing it with recently proposed advanced multiobject trackers on the APIDIS, NCAA Basketball and VolleyTrack databases, and the experiment results demonstrate the effectiveness of our method.
A preliminary version of this work is presented in [10]. Compared with the conference version, the extensions include: (1) more details of the proposed approach and related works are explained; (2) more experiments on APIDIS and NCAA databases are conducted and more results are displayed and discussed.
The rest of this paper is organized as follows. In Section 2, related works for athlete tracking and multiple object tracking are discussed. Section 3 describes our MAT approach in detail, including the Pose-based Triple Stream Network and multi-state online matching algorithm. In Section 4, databases, experimental results and analysis are shown. Finally, Section 5 draw an conclusion for the paper with perspectives.

Athlete Tracking
In the past few decades, despite a few works focusing on single athlete tracking [11], most studies address the Multi-Athlete Tracking problem. Some early attempts make use of specific clues in sports fields (e.g., site boundaries) to facilitate athlete tracking. Xing et al. [3] introduce a progressive observation modeling strategy to classify the playfield region and nonplayfield regions with color information. Based on classification results, the tracking component is achieved by a Bayesian inference approach. Mauthner [2] and Gomez [1] use particle filters to predict positions and velocities of players in beach volleyball games. They separate foreground and background to make athlete modeling easier, but cues in background may lose potentially useful in improving the stability and accuracy of trackers. However, those methods may lack practicability in more complex scenes due to the variations in background and illumination. Meanwhile, more studies take MAT as an MOT problem and process it in the tracking-by-detection framework. Liu et al. [4] tracked players in basketball and hockey game videos from the view of tactics analysis. They try to predict all the possible moving directions of players, but it may incur failure due to infinite possibilities. Shitrit et al. [5] propose a Multi-Commodity Network Flow approach to track multiple players in basketball and soccer games. They exploit both the appearance and position clues to prevent identity switches, and report promising results.

Multi-Object Tracking
There exist a number of approaches attempting to address the MOT problem, and they can be roughly divided into traditional ones and deep learning based ones. Further, traditional methods can be classified into online matching based [12][13][14]14,15] and offline association based [16][17][18][19][20][21]. The offline association approaches take advantages of global information and prove more robust to complex scenes. They cast the task of MOT as a graphical model solving problem, where the graphs are built with a set of detections or tracklets. Many globally optimal techniques, e.g., k-shortest paths in DP_NMS [18], Linear Program in ELP [20], are used to solve the graphical models. Recently, more robust pairwise affinities based on strong appearance cues, e.g., sparse appearance models and integral channel feature appearance, are proved effective in associations. This lead the online matching methods, e.g., MHT_DAM [9] and LINF1 [22], achieving state-of-theart performance.
More recently, Yu et al. [23] took advantage of high-performance detection and representative feature of CNNs, and achieved significantly better results on MOTChallenge [24,25] in both online and offline mode; Leal-Taixé et al. [26] defined a new CNNbased structure for people appearance representation to build effective relations between detections. Sadeghian et al. [6] presented a structure Recurrent Neural Networks (RNNs) based network architecture that reasons jointly on multiple cues over a temporal window. The deep models used in those methods largely enhance the tracklet affinity and improve the performance for both online matching and offline association. However, most existing MOT methods pay more attention to the pedestrian tracking in surveillance, short of investigation on multiple athlete tracking in sports videos.

Human Pose Estimation
In the literature, many studies [27][28][29] explore the pose clue for person tracking for its rich semantic information, and they take full advantage of human pose estimation (HPE) task [30][31][32][33]. Recent human pose estimation methods can be classified into bottom-up and top-down ones. Bottom-up approaches [34,35] first detect individual body joints and then group them into people, while the top-down ones [32,33] first detect people bounding boxes and then predict their joint locations within each region. Top-down approaches do not need any joint grouping due to the detectors. Correspondingly, bottom-up approaches are lack of this reliance on a detector. Following the Tracking-By-Detection (TBD) paradigm, some methods [28,29] use the top-down technique to produce the human pose. However, they only use the joints feature, which does not generalize well for representing persons. In our work, we enrich the pose dynamic into multiple pose-based clues, which improves the representation ability.

Multi-Athlete Tracking Approach
As mentioned in Section 1, MAT can be treated as a special case of MOT, which aims at estimating the successive states (e.g., position and size or bounding box) of concerned objects (e.g., pedestrians and vehicles) in the video frames. Correspondingly, we follow the commonly used Tracking-By-Detection (TBD) paradigm to solve the MAT problem, i.e., given athlete detections (producing by Faster R-CNN [36] detector) at each frame, we aim at matching their identities across different frames to generate a set of athletes trajectories over time.
Our framework, as shown in Figure 1, consists of the Pose-based Triple Stream Networks (PTSN) and a multi-state online matching algorithm. PTSN is responsible for calculating the similarity scores between the history tracklets and the candidate detection in the current frame, where the scores come from three network streams that model three pose clues, i.e., pose-based appearance, motions and athletes' interactions. Based on the similarity scores, the multi-state online matching algorithm generates final trajectories for athletes by tracklet/detection matching and state transitions. Details of our proposed PTSN and the matching algorithm are described in the following.
respectively, which are further fused into the final one φ(τ i , b j ) for connecting the history tracklet τ i and the current detection b j . The details of the three streams of PTSN will be explained in the following three subsections.
To encode the long-term dependencies of sequences, we use LSTM as the main structure in our networks. Unlike popular graph-based tracking methods [14,16,37], whose similarity scores are only calculated in the previous frame, our method could capture the long-term dependencies of targets by inferring from the observation sequences of variable length.

Pose-Based Appearance Stream(PAS)
As we introduced above, the athletes share very similar appearances when playing games. To improve the discrimination among athletes, besides the appearance cues, we employ the pose and position information. The Pose-based Appearance Stream (PAS) is then designed, as shown in Figure 2, which models the long-term evolution of pose and appearance. It takes a tracklet and a detection as input and estimates their affinity, i.e., determining whether the candidate detection box contains the same athlete in the tracklet. A tracklet is a set of bounding boxes of tracked athlete's trajectory at timesteps 1, ..., t, i.e., , and the candidate bounding box b t+1 j is the detection in time t + 1.
Specifically, we concatenate the tracklet and detection to construct a new detection sequence, extract the pose-based appearance features that are fed into the LSTM network to model the long-term evolutions. Then we use Softmax followed by the LSTM for a binary classification problem, estimating how confident the evolutions come from the same athlete.  At each frame, the pose and appearance feature of the detected athlete are extracted by the feature extractor, where the pose coordinates are mapped into original frames, therefore including the athlete positions. We use the output of the last layer of Hourglass Networks [32] as pose feature φ p . For the athlete in a bounding box, we use the pretrained model of the Hourglass Networks on COCO dataset to generate x and y coordinates of 16 pose joints, leading to a pose feature of H-dimension. We start with the initial weights of ResNet-101 pre-trained on ImageNet and add a new fully connected layer on the top of the structure to extract a H-dimensional appearance feature φ a . Given the detection sequence, the extracted pose feature sequence, i.e., (φ is the feature of candidate detection. The LSTM takes the feature sequence as input, and models the long-term dependencies of pose-based appearance, which is carried by a hidden representation h pa . h pa is then fed into a softmax classification layer (contains a fully-connected layer and a softmax layer) to output a score φ PA (τ i , b j ), i.e., the tracklet-detection affinity.

Pose-Based Motion Stream (PMS)
The motion information is another critical clue for distinguishing athletes, for instance, the velocity variation of the athletes usually differ significantly. We design the Pose-based Motion Stream (PMS), illustrated in Figure 3, which can estimate tracklet-detection affinity based on athlete motion dynamics. Instead of representing motion by using the velocity of the center of an athlete's bounding box, PMS computes the velocity of each joint to describe motion information, delivering a more comprehensive representation of the athlete.
PMS consists of a pose detector, a motion detector, and an LSTM network followed by a softmax classification layer. The pose detector is the same as the one in PAS. For each image region in the tracklet, the 16 pose joints, i.e., x, y coordinates in original image, of an athlete are extracted. Based on the pose coordinates, we first calculate the joint velocities and transform them into embedding space. Assume that V pm(t) ik refer to the kth joint velocity of the ith athlete at tth frame, it can be calculated as follows: where (X ) are the 2D coordinates of ith athlete on the kth joint at tth frame.
To improve the abstract ability of PAS, we transform the joint velocities V pm(t) ik into em-bedding space of higher dimensions by a fully-connected layer. Given the input tracklet j is the candidate detection, after post/motion extraction, we can obtain the velocity feature sequence φ pm = (φ is a H-dimensional feature vectors at t frame. Then φ pm passes through an LSTM layer. The hidden state h m of this LSTM layer carries dynamics of pose-based motion information. We take the final time-step of h m as global representation and feed it into a softmax classification layer with cross-entropy loss to predict the tracklet-detection affinity score φ PM (τ i , b j ).

Pose-Based Interaction Stream (PIS)
Both pose-based appearance and motion clues are used to represent individual athletes, however, they lose the contextual structure, which is also an important discriminative information. Therefore, we propose the Pose-based Interaction Stream (PIS), as shown in Figure 4, which can capture the interactions among a specific athlete and the ones around him/her. Since the number of neighbors can vary, in order to obtain the same size input, we model the neighborhood of each target as a fixed size grid and design the Interaction Grid (IG). In IG, each athlete is represented as six main joint positions, including head, left wrist, right wrist, left ankle, right ankle and mean value of all joint positions. Assume that (IG 1 i , IG 2 i , ..., IG t i ) is the interaction grid for the ith athlete at timesteps 1, ..., t. The pose joint positions are pooled into the grid and IG t can be defined as: where 1 mn [x, y] is an indicator function to check if the athlete's joint at (x, y) is in the (m, n) cell of the grid. N i is the set of neighbors of the athlete i, |N i | = 3. P j is the set of joints of neighbor j. After building the interaction grids, we use a fully-connected layer that takes the interaction grids as input and produces H-dimensional output for each timestep. Similar to PAS and PMS, we also map the interaction grid of candidate detection j at timestep t + 1 to H-dimensional feature vector using the same measure, forming the interaction feature sequence φ pi = (φ ). The feature sequence is fed into an LSTM layer to model the contextual structure, which is carried in its hidden state h i . The final step of h i is passed to a softmax classification layer with cross-entropy loss to predict the tracklet-detection affinity score φ PI (τ i , b j ).  Figure 4. The architecture of Posed-based Interaction Stream (PIS). A pose detector is applied to obtain pose information of an athlete from previous t frames. The interaction grids of this athlete are calculated between his/her closest 3 neighbors at each frame. Then we apply an LSTM to encode interaction information for this athlete and compare it with candidate boxes b j generated by the detector at timestep t + 1. Finally, the LSTM outputs a similarity score φ PI (τ i , b j ) indicating the probability of the candidate boxes containing the same athlete.

Multi-State Online Matching Algorithm
Based on the final fusion similarity score between a tracklet and detection, we design a multi-state online matching algorithm, shown in Algorithm 1, for accomplishing the final tracking in an online mode.
State transition diagram of our tracker is shown in Figure 5. First of all, a set of bounding boxes belonging to each frames {B 0 , B 1 , B 2 , ..., B T−1 } is filtered via Non-Maximum Suppression (NMS) operation. High score bounding boxes are selected to next step (as a1 operation). On the contrary, low score bounding boxes are sent to die tracklets container (C die ), terminating their life cycles (as a2 operation). For each high score bounding box b j , we feed it into PTSN together with each tracked tracklet τ i in active tracklets container C active to calculate the tracklet-detection affinity σ PTSN , where tracked tracklets C active consists of {τ 1 , τ 2 , ..., τ i , ..., τ n } in previous frames. If the affinity of a detection over σ PTSN , it will update old τ i in C active (as a3 operation) by replacing b j in τ i . Then the bounding box propagation operation will be applied on lost and active containers to predict next bounding box of τ i in the subsequent frame, according to the velocity of τ i (as a4 operation). If the prediction can not catch the next detection, both b j and old τ i will be sent to next step (as a5 operation), denoting τ i as missing tracklet. For the remaining detections, we compare with tracklets in C die for targets recovery. The process will be done for every b j . If they match successfully (affinity score over σ PTSN ), they will be sent to C active again (as a6 operation) , and then bounding box propagation operation will be done to predict next bounding box of τ i in the following frame, according to velocity of τ i (as a4 operation). If they fail and the waiting time has exceeded the hyper-parameter δ waiting , they will be sent to C die , ending its life cycle (as a7 operation). After that, the remaining bounding box will form a new tracklet and wait for a matching in C lost . When the bounding boxes of the last frame B T−1 is executed, all tracklet of C active will be copied to C f inal as output of our tracker as long as they are longer than λ min (as a8 operation). More detailed algorithm steps are strictly illustrated in Algorithm 1.

Algorithm 1 Multi-State Online Matching Algorithm.
Inputs: if PTSN(τ i ,b j ) ≥ σ PTSN then 7: add b best to τ i and remove b best from B t  16: add b best to τ i ; remove b best from B t and move τ i to C active  19: if time waiting (τ i ) ≥ δ waiting then 20: move τ i to C die 21: end if 22: end if 23: for b j ∈ B t do 24: start a new tracklet with b j and insert it into C lost 25: end for 26: end for 27: end for 28: for τ i ∈ C active do 29: if len(τ i ) ≥ λ min then 30: add τ i to C f inal .., B T−1 }. C active is a pooling storing tracklets that have been tracked so far. C lost is a pooling storing tracklets that tracked but lost. C die is a pooling storing tracklets judged to be illegal. C f inal is a pooling storing legal output tracklets. The transfer actions {a1, a2, ..., a8} between them will be explained in detail below, and there will be corresponding annotations in the Algorithm 1.

Experiment
To evaluate the proposed method, we conducted extensive experiments on the Vol-leyTrack dataset. The database, implementation details, evaluation index, and results are described as follows.

Databases
The public benchmarks for MAT in sports videos are very limited compared to that for general MOT. In this study, we used the databases in [28], i.e., APIDIS, NCAA and VolleyTrack, which are newly collected and improved ones.
The APIDIS dataset contains 13 sequences of basketball games, each of which belongs to a round and lasts 10-20 s. It is collected from the original 15-min video on camera-6 (side view) (APIDIS, http://www.apidis.org/_Dataset/). As illustrated in Figure 6 (top row), the frames in APIDIS have difficult illumination conditions in background. The dataset totally contains 5764 frames of a resolution of 1600 × 1200 recorded at 22 fps. At each frame, seven persons (two referees and two five-player teams) on the court are annotated. In the experiments, seven sequences were used for training and the remaining ones for testing.
The NCAA Basketball dataset consists of NCAA basketball games from Youtube videos and is used for team activity recognition originally. Figure 6 (middle row) shows some example frames. To evaluate MAT methods, Kong et al. [28] manually annotate parts of data with bounding boxes of each player. Concretely, it contains four rounds in a game video, each of which lasts about 300 frames, at 30 fps. The sequences totally have 1179 frames, of a resolution of 640 × 480. Following [28], due to the small scale, it is not enough to train deep networks. Considering the similarity with APIDIS, we combined the APIDIS training set and two NCAA Basketball sequences for training, and the remaining two sequences were used for testing.
The VolleyTrack dataset contains 18 video sequences of world-class volleyball games collected from YouTube, as shown in Figure 6 (bottom row). Each video is captured by a camera equipped at the end line of the competition terrain, and there exist variations in background and illumination. The video sequences, corresponding to game rounds, last from 8 to 12 s. The dataset contains 5406 frames, at 30 fps, of a resolution of 1920 × 1080. The bounding boxes of players are manually annotated at each frame. Due to the same setting at both sides of the ground, the dataset only considers the athletes in the half ground near to the camera for each video. Among the 18 sequences, 50% of the data were for training and the others for testing.
We used the Faster R-CNN detector trained on each dataset to produce detections for all the evaluations.

Implementation Details
In our experiments, we set H (size of input vectors to LSTM) as 32 for all the three streams, but the source of the input vector was different. 64-dimensional input vector of PAS φ As we know, it is hard to train a deep model on imbalanced datasets. Many works [38,39] deal with class imbalance issue, and achieves much progresses. In our work, we train PTSN as a 0/1 classification problem, where the negative pair belongs to class 0 and positive pair belongs to class 1. We used a resampling strategy to evade the imbalance classes. Specifically, for each pair, there existed one positive sample and many negative ones. Retain one of detections in the positive pair, the negative examples were constructed by replacing the other detection with another random athlete. We ran our model on a machine equipped with two Intel (R) Xeon E5-2620 v2 CPUs (12-core, 2.6 GHz), 16 GB RAM, and a 1080Ti GPU.

Evaluation Indexes
To evaluate the performance of multiple athletes tracking algorithms, we used metrics widely used in MOT [25]. Among them, Multiple Object Tracking Accuracy (MOTA) and Multiple Object Tracking Precision (MOTP) were two popular ones. According to [40], MOTA gives a very intuitive measure of the tracker's performance at detecting objects and keeping their trajectories. MOTP shows the ability of a tracker to estimate precise object positions. In addition, there were some indicators that we used to measure the quality of the method. Mostly Tracked targets (MT) can be defined as the ratio of ground-truth trajectories that are covered by track predictions for at least 80% of their respective life span; Mostly Lost targets (ML) can be defined as the ratio of ground-truth trajectories that are covered by a track hypothesis for at most 20% of their respective life span; FP can be defined as the total number of false positive and FN can be defined as the total number of false negatives (missed targets). IDS is defined as the total number of identity switches [41].

Results Analysis
We first explored the contributions of different components in PTSN on the VolleyTrack dataset, as shown in Table 1. It can be seen that combining all three streams led to higher performance compared to using parts of them, indicating the effectiveness of pose-based appearance, motion and interaction clues. Compared to the counterpart that only uses appearance (AS), incorporating pose information, i.e., Pose-based Appearance Stream (PAS), gained about 8% in terms of MOTA, demonstrating the key role of pose cue. It can be seen that the one with motion stream (PMS) achieved better performance than with PAS and AS in terms of many metrics. Besides, the interaction clues had better ability in reducing ID switches, evidenced by PIS vs. PMS and PAS. In Tables 2-4, we compared our method with the ones of other state-of-the-art MOT methods on APIDIS, NCAA Baseketball and VolleyTrack datasets respectively. It can be observed that the proposed approach achieved better scores than the other ones, including MHT_DAM [9], CEM [42], ELP [20], Siamese CNN [26] and MDPNN16 [6], on some metrics such as the IDF1 and FP. It indicates the effectiveness of the PTSN and multi-state matching algorithm for multi-athlete tracking in sports videos. Note that some method, e.g., CEM [42], ELP [20] and Siamese CNN [26], woredk in offline association mode, and made use of much more context information. Our method could obtain comparable results with those offline association methods. More importantly, thanks to the online matching, our method had the superiority in processing speed (higher FPS).  Figure 7 illustrates some qualitative results, containing both success and failure examples. By using long-term dependencies of pose-based clues, our method could largely recover the target after an occlusion. When the detections were missing due to occlusion or approaching of athletes, our method could track the ones with the same identity, e.g., the athletes in green dashed circles. Meanwhile, we can also notice that our method tended to fail in some difficult situations, e.g., the athletes in red dashed circles dressed very similarly.

Conclusions
In this paper, we propose a novel online multiple athlete tracking approach in sports videos. It makes use of long-term temporal pose dynamics for better associate correct athlete targets. First, we propose the Pose-based Triple Stream Networks, which models the pose dynamics by three clues, i.e., pose-based appearance, motion and interactions among athletes. Second, we design a multi-state online matching algorithm based on bipartite graph matching. Due to the multiple detection states and reliable transitions, it is robust to noisy detections and occlusions. The proposed method is evaluated on the APIDIS, NCAA basketball and VolleyTrack datasets by comparing to other popular MOT methods, and the experiment results clearly demonstrate its advantages for this task. The proposed PTSN and online matching algorithm are separate modules, which limit the practicability in MAT/MOT. In the future, we aim at building an integrated pipeline in which both feature extraction and data association can be jointly learned.