A Deep Learning and Computer Vision Based Multi-Player Tracker for Squash

Featured Application: Autonomousmultiplayertrackingforkinematicevaluationofelitesquashplayers. Abstract: Sports pose a unique challenge for high-speed, unobtrusive, uninterrupted motion tracking due to speed of movement and player occlusion, especially in the fast and competitive sport of squash. The objective of this study is to use video tracking techniques to quantify kinematics in elite-level squash. With the increasing availability and quality of elite tournament matches ﬁlmed for entertainment purposes, a new methodology of multi-player tracking for squash that only requires broadcast video as an input is proposed. This paper introduces and evaluates a markerless motion capture technique using an autonomous deep learning based human pose estimation algorithm and computer vision to detect and identify players. Inverse perspective mapping is utilized to convert pixel coordinates to court coordinates and distance traveled, court position, ‘T’ dominance, and average speeds of elite players in squash is determined. The method was validated using results from a previous study using manual tracking where the proposed method (ﬁltered coordinates) displayed an average absolute percent error to the manual approach of 3.73% in total distance traveled, 3.52% and 1.26% in average speeds < 9 m / s with and without speeds < 1 m / s, respectively. The method has proven to be the most e ﬀ ective in collecting kinematic data of elite players in squash in a timely manner with no special camera setup and limited manual intervention.


Introduction
Quantitative analysis of human movement has long been an interest within sports biomechanics for its ability to determine performance and strategy [1], as well as its application in rehabilitation to identify injury risk factors [2] and facilitate recovery [3]. The demand for motion analysis to capture more complex environments in sport is pushing for the development of faster, more autonomous, and sophisticated techniques. Biomechanical analysis in applications such as training and competition requires the following unique criteria: provide accurate kinematic data, deliver information in a timely manner, and remove factors which restrict or influence the subject's natural movement [4].
The most widespread and common techniques for kinematic data capture have historically been manual notation on prerecorded videos and marker-based technologies. However, they are not court view provided by the main camera. Using video analysis software, Dartfish Team Pro version 8 (2015), markers were manually placed on each foot for every eligible frame to determine player position in the video coordinate system. Ten reference points on the court were recorded to determine a coordinate transformation converting the video image coordinate system to the coordinate system of the plane of the court [24].
The contribution of the current paper is to advance the development of accurate and reliable markerless motion tracking for squash by removing the need for a special setup, reducing processing times, and limiting user intervention used by previous squash studies. The proposed methodology improves on the previous work done by our research group [24] by replacing the time-consuming and laborious task of player tracking with an autonomous deep learning based human pose estimation to detect individuals in the frame and computer vision to identify the players. Removing the need for specific equipment and limiting significant user intervention increases the number of eligible matches we can analyze in a timely manner. Matches that are filmed by the PSA or filmed similarly are available to be analyzed by our methodology. This study outlines and validates our proposed method with the results of the previous study completed by Buote et al. [24] that quantify the players' distance traveled, position relative to the T, and average velocities. This is the first study to apply deep learning and computer vision motion tracking techniques to study elite squash players in competition.

Materials and Methods
The method was tested on a quarterfinals match of the 2013 Canary Wharf Classic Tournament collected from the PSA video collection. The broadcast video was obtained at a frame rate of 25 fps and a resolution of 720 × 576. The match was between professional players El Shorbagy (dark grey shirt) and Mustonen (white shirt), with PSA Rank 5th and 53rd respectively at the time of play and consisted of five games. Automated tracking was performed on the match only on the frames which had been previously manually tracked [24]. This allowed direct quantitative comparison to validate the automated tracking's effectiveness. The procedure involves three main steps as described below.

Preprocessing
The full match broadcast is split up into five games, each as a separate file. Each of the subsequent methods are applied to each game. The games are played back, and a user manually identifies when the game is in play. Only video frames involving gameplay are kept so that analysis does not include moments when players are walking around between rallies, for example. The games are then filtered to only include the main camera angle. Throughout the broadcast, the camera angle is changed to give different views of the court. Only the main camera angle is used because it contains all of the court reference points used in the coordinate transformation. The program starts by generating a histogram for the first frame of video and adds this to a list of reference histograms (no references exist at this point, so the first frame is always the first reference). The histograms measure the frequency of color intensities, with one histogram for each color channel, where the bins are brightness levels and the y-axis measures the number of pixels within that range of brightness. The program then iterates through the rest of the video, for each frame generating a histogram and comparing it to the list of references. If the histogram is similar to one of the references within a defined threshold, it is grouped with those frames. If the frame is not similar to any of the references, it is added as a new reference. The result is separate video clips that each contain a separate camera angle. The longest video will be the main camera angle, as this is the angle that is used most often and is what will be used for player tracking. A timestamp in the top left corner of the frame is also added in this step. An example pre-processed frame is shown in Figure 1.

Player Tracking and Identification
A general-purpose multi-person pose estimation neural network [25] was used for player tracking. The network consists of a two-branch multi-stage convolutional neural network with five initial convolutional layers followed by seven convolutional layers for each joint and limb. It outputs heatmaps for each joint and limb where further code detects peaks in the heatmaps to locate a maximum of 17 key joints and the limbs that connect them as shown in Figure 2. The neural network was trained on the MPII human multi-person dataset [26] and the COCO 2016 keypoints challenge dataset [27]. A tracking confidence value is also provided for each player that was tracked.
First, the video is cropped to only include the court as to avoid tracking spectators that might be visible surrounding the court. Before tracking begins, the user is asked to draw a box around each player. A reference histogram of each player's torso is generated for later identification. The tracking process then begins. For each frame, depending on the game scenario, either both, one, or neither of the players will be tracked by the neural network. Examples of difficult tracking scenarios such as unnatural limb positioning and player occlusion are shown in Figure 2.

Player Tracking and Identification
A general-purpose multi-person pose estimation neural network [25] was used for player tracking. The network consists of a two-branch multi-stage convolutional neural network with five initial convolutional layers followed by seven convolutional layers for each joint and limb. It outputs heatmaps for each joint and limb where further code detects peaks in the heatmaps to locate a maximum of 17 key joints and the limbs that connect them as shown in Figure 2. The neural network was trained on the MPII human multi-person dataset [26] and the COCO 2016 keypoints challenge dataset [27]. A tracking confidence value is also provided for each player that was tracked.

Player Tracking and Identification
A general-purpose multi-person pose estimation neural network [25] was used for player tracking. The network consists of a two-branch multi-stage convolutional neural network with five initial convolutional layers followed by seven convolutional layers for each joint and limb. It outputs heatmaps for each joint and limb where further code detects peaks in the heatmaps to locate a maximum of 17 key joints and the limbs that connect them as shown in Figure 2. The neural network was trained on the MPII human multi-person dataset [26] and the COCO 2016 keypoints challenge dataset [27]. A tracking confidence value is also provided for each player that was tracked.
First, the video is cropped to only include the court as to avoid tracking spectators that might be visible surrounding the court. Before tracking begins, the user is asked to draw a box around each player. A reference histogram of each player's torso is generated for later identification. The tracking process then begins. For each frame, depending on the game scenario, either both, one, or neither of the players will be tracked by the neural network. Examples of difficult tracking scenarios such as unnatural limb positioning and player occlusion are shown in Figure 2.  A minimum tracking confidence value (2.0) and a minimum number of joints (10) are used to remove tracking errors. Frames with values below the minimum thresholds are removed from the analysis. In the remaining frames, a histogram of each tracked player's torso is generated and First, the video is cropped to only include the court as to avoid tracking spectators that might be visible surrounding the court. Before tracking begins, the user is asked to draw a box around each player. A reference histogram of each player's torso is generated for later identification. The tracking process then begins. For each frame, depending on the game scenario, either both, one, or neither of Appl. Sci. 2020, 10, 8793 5 of 16 the players will be tracked by the neural network. Examples of difficult tracking scenarios such as unnatural limb positioning and player occlusion are shown in Figure 2.
A minimum tracking confidence value (2.0) and a minimum number of joints (10) are used to remove tracking errors. Frames with values below the minimum thresholds are removed from the analysis. In the remaining frames, a histogram of each tracked player's torso is generated and compared to the references generated earlier to identify the players. The coordinates of each player's joints are recorded in a spreadsheet.
Timestamps associated with each frame are read by cropping the left top corner. Before tracking, a set of pictures of the numbers 0-9 are provided in the same font and black background they appear in the video. Using Python's computer vision library, OpenCV, contours are detected to separate individual numbers, and a machine learning based classification algorithm k Nearest Neighbors (kNN) is used to associate numbers detected to the set provided. This process repeats for each frame until the video is completed.

Postprocessing
To convert the screen-space coordinates to court-space coordinates, a rotation matrix and translation vector are generated using reference points of the court as they appear in the video, and their corresponding coordinates in the plane of the court (known based on standard court dimensions). The method to generate the equations was the same as used by Buote et al. [24]. The reference points are shown in Figure 3.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 5 of 16 compared to the references generated earlier to identify the players. The coordinates of each player's joints are recorded in a spreadsheet. Timestamps associated with each frame are read by cropping the left top corner. Before tracking, a set of pictures of the numbers 0-9 are provided in the same font and black background they appear in the video. Using Python's computer vision library, OpenCV, contours are detected to separate individual numbers, and a machine learning based classification algorithm k Nearest Neighbors (kNN) is used to associate numbers detected to the set provided. This process repeats for each frame until the video is completed.

Postprocessing
To convert the screen-space coordinates to court-space coordinates, a rotation matrix and translation vector are generated using reference points of the court as they appear in the video, and their corresponding coordinates in the plane of the court (known based on standard court dimensions). The method to generate the equations was the same as used by Buote et al. [24]. The reference points are shown in Figure 3. Once court coordinates are obtained, further analysis to attain player data is completed. For positional data, the court floor coordinate system's origin is placed at the T with x increasing to the right and y increasing to the front wall. The x-and y-coordinates of each player's left and right feet are averaged for further analysis. Total distance is determined by calculating the change in coordinates between consecutive frames using the Euclidian norm. A player's average radial distance from the T is calculated by taking the Euclidian norm of their coordinates in each frame [29]. The percentage of time a player spends to the left of the T is determined by the number of coordinates with negative x-values divided by the total coordinates. The percentage of time spent behind the T is similarly calculated as the number of coordinates with negative y-values divided by the total coordinates.
Velocity components were calculated individually with the change in the x-coordinate and the Once court coordinates are obtained, further analysis to attain player data is completed. For positional data, the court floor coordinate system's origin is placed at the T with x increasing to the right and y increasing to the front wall. The x-and y-coordinates of each player's left and right feet are averaged for further analysis. Total distance is determined by calculating the change in coordinates between consecutive frames using the Euclidian norm. A player's average radial distance from the T is calculated by taking the Euclidian norm of their coordinates in each frame [29]. The percentage of time a player spends to the left of the T is determined by the number of coordinates with negative x-values Appl. Sci. 2020, 10, 8793 6 of 16 divided by the total coordinates. The percentage of time spent behind the T is similarly calculated as the number of coordinates with negative y-values divided by the total coordinates. Velocity components were calculated individually with the change in the x-coordinate and the y-coordinate divided by the time between consecutive frames. The time difference is determined by subtracting the smaller timestamp from the larger timestamp associated with the frames. Average speed over the entirety of a game was determined as the sum of the magnitude of the velocity components divided by the total number of velocity components.
To validate player tracking, coordinates were graphed against the coordinates from the manual tracking method reported by Buote et al. [24] and quantified using the coefficient of determination (R 2 ). Percent error was calculated for all player data collected as: where the reference statistic is the statistic reported by Buote et al. [24]. A 5th order moving average filter calculated as: was applied to the court coordinates prior to analysis to smooth minor fluctuations in foot detection. Further investigation on the unfiltered total distance for 50 consecutive frames (2 s) in each game showed a large discrepancy in total distance traveled compared with the manually tracked data [24] (Table 1) due to characteristic differences in how tracking was managed. For manual tracking, observers were instructed to locate the feet and minimize coordinate changes between frames as to speed up the annotating process and to produce relatively smooth motion tracking. Our proposed method does not hold historic data on the previous frame analyzed, which results in variation in foot detection between frames. From frame to frame, the foot node can be located higher up the ankle or lower down on the floor. Especially for y-coordinates, this can make a small difference during court conversion due to the length of the squash court compared to its width. These minor variations can accumulate causing large differences in summations over time and the final values can be dependent on the number of frames collected. Thus, a filter was applied to the player coordinates to account for variability in foot tracking.

Results
The match spanned five games and 41.4 min with 22.4 min (55.3%) of active match play. An average of 76.5% of active match play was analyzed with the removal of frames using other angles than the court main camera view. A summary of game length, % of active game play, and % analyzed is reported in Table 2. The match was recorded at 25 fps where manual tracking included frames during active game play taken from the court main camera view [24]. Further analysis by Buote et al. [24] was done only between consecutive frames and did not interpolate between breaks longer than 1/25 s in time.
To validate the proposed method, player detection and identification was done on the same frames. Frames were discarded by the proposed method if a player was not identified, which was usually caused by player occlusion or an unnatural pose (i.e., Figure 2). Table 3 presents the number and percentage of frames retained by our methodology compared to the manual tracking.  Unfiltered and filtered coordinates are plotted against the manual tracking coordinates in Figure 5 with the coefficient of determination. Table 4 outlines the R 2 values per game for both players. Table 5 compares both players positional statistics obtained from unfiltered and filtered coordinates and Buote et al. [24] results of the same match. These parameters are compared to the results of Buote et al. [24] in Table 6.
Velocity statistics including the average speed data of both players calculated from unfiltered, filtered coordinates and Buote et al. [24] are presented in Table 7. Similar to the positional data, average speeds were compared to Buote et al. [24] and differences were quantified in Table 8.
The average differences and percent error of the player data collected by the filtered coordinates is summarized in Table 9. With consideration that the error of estimation of the position is recommended to be less than the natural balance of the center of gravity of the human body (between 15 and 20 cm) in an observed movement, the average difference for positional data of 17.6 cm (Table 9) is acceptable but can be improved [30].   Unfiltered and filtered coordinates are plotted against the manual tracking coordinates in Figure  5 with the coefficient of determination. Table 4   collected from the manual tracking method, orange and green points are the unfiltered and filtered coordinates collected from the proposed tracking method, respectively. Red circles highlight significant areas where filtering has improved tracking. Coordinates were taken over the first 2000 frames from 50:09.04-54:47.16 (global timestamp of the broadcasted match).
Unfiltered and filtered coordinates are plotted against the manual tracking coordinates in Figure  5 with the coefficient of determination. Table 4 outlines the R 2 values per game for both players.     Table 9. Summary of average absolute percent error comparing player data collected from filtered coordinates (smoothed using a 5th order moving average filter) and results of Buote et al. from the 2013 Canary Wharf match [24].

Discussion
This study aims to apply deep learning and computer vision processes to evaluate kinematics of elite squash players for the first time. The method was validated when compared with previous results from a manual tracking study [24]. Our method presents many advantages to prior data acquisition methods. The ability to analyze any matches filmed by the PSA or suitable matches filmed from a similar angle, requiring no special camera setup or wearable markers that could impede player movement, significantly increases the number of elite matches eligible for analysis.
A notable advancement in the present study is the speed of player tracking, which has been considerably accelerated to 0.3 s per frame. Player statistics are rapidly generated using Python code for easy computation. Thus, an ideal full match analysis takes approximately 3 h including tracking and analysis, where the majority of the process is autonomous. Presently, manual intervention is only required during pre-processing to identify active play. Broadcasts of professional squash matches do not have a definitive visual or auditory indicator of when a rally begins or ends, unlike other racquet sports. Based on our preliminary investigation, some strategies that could be implemented in the future to address this include tracking when the scoreboard changes, noting a change of the camera angle or pan away from the court (note that these implementations will not be completely instantaneous).
For the match analyzed, active play was slightly higher than half the total time of each game (55.9% on average). This supports the interpretation of squash being a sport demanding of short, high intensity bursts rather than endurance and constant intensity [31]. Other camera angles such as the sidewall and close up secondary cameras do not display both players and are typically used for repetitive shots, usually drop shots or backhands down the wall from the left back corner. However, the movement of the players were cyclical between the T and the corner and deemed to be relatively equal, providing valid results for comparison and aligning with previous studies [17,20,21,24]. Current work is being done to implement autonomous collection of other frame angles of the match. Future work can be done establishing court conversion matrices using the inverse perspective mapping method used on the main camera angle and other camera angles to analyze the full length of active match play [32,33].
Frames analyzed by the manual tracking method were used with the proposed tracking method with an average of 83.82% of the input frames per game, resulting in detection and identification of both players. Frames where the system was unable to successfully detect players were due to player occlusion or unnatural pose as mentioned previously in the methods section. A global timestamp was assigned and detected in each frame to account for difference in time when calculating for velocity between missing frames. Comparison of time series dependent results such as player velocities provide support for the effectiveness of this approach.
Court conversions were determined using reference points in the frame specific to the court and camera angle. The equations have been noted to be more accurate in predicting a player's position near the T than the top corners, likely due to the distribution of reference points having a higher concentration in the center (service lines) compared to the corners [24]. The raw position coordinates were smoothed due to the variation of foot detection (described in the methods section) using a 5th order moving average filter. The R 2 values calculated displayed a slightly higher accuracy for the x unfiltered coordinates compared to the x filtered coordinates (0.990 and 0.988, respectively) ( Table 4). Further, the y filtered coordinates were noted to be considerably higher than the y unfiltered coordinates (0.971 and 0.966, respectively) ( Table 4). This indicates that the accuracy of the system depends on the margin of error of the y coordinates. This is due to the dimensions of a squash court where the length is longer than the width. Because of the camera angle perspective, the video image produces a court that is compressed lengthwise and is wider at the bottom (back wall) compared to the top (front wall), causing y coordinates to have a larger margin for error during detection.
The filtered coordinates displayed more reliability for cumulative statistics such as total distance with an average percent error of 3.73% compared to the unfiltered coordinates with an average percent error of 19.85% ( Table 6). The variation in foot detection with the proposed method resulted in larger changes in coordinate position between frames compared to the manual tracking method, which resulted in consistently higher values for total distance traveled. Filtering was able to remove the problematic fluctuations, resulting in total distance traveled values that were closer to the manually measured values. This is especially evident in Game 3, where both players have the lowest total distance percent error of 0.43% (El Shorbagy) and 6.98% (Mustonen) ( Table 6) when compared to the rest of the games. Game 3 also has the lowest number of frames collected at 76.50% as opposed to the average number of frames collected at 85.65% (without Game 3), supporting the need to filter the raw coordinates. Like previous studies, it appears players travel similar distances as their opponent in each individual game and distances traveled can be correlated to the length of game [20].
Vučković et al. [21] suggested that the dominance of a rally can be indicated by the time spent near the T, except for closely contested games. This is in agreement with our results as the winner and higher ranked player of the match, El Shorbagy (1.49 m for unfiltered coordinates, 1.57 m for filtered coordinates, and 1.71 m according to Buote et al.) maintained a smaller average radius to the T than Mustonen (1.71 m for unfiltered coordinates, 1.80 m for filtered coordinates, and 1.93 m according to [24]). This is reflective of common squash tactics where skilled players play accurate shots to force their opponent to leave the T area, while less skilled players play a greater number of shots closer to the center of the court [21,24].
Players spent an average of 53.7% (unfiltered and filtered coordinates) of the time on the left side of the T which concurs with the findings of 56.5% from Buote et al. [24]. Since the left side wall camera view was not analyzed, these percentages are expected to be higher. This aligns with Vučković et al. [34] who recorded an average of 64.6% of shots coming from the left side of the court for 10 matches played at the men's World Team Championship in 2003. As both players were right-handed, a higher percentage spent on the left (backhand) side was expected since at the elite level, a common tactic is to play to your opponent's backhand which is considered weaker and more difficult [24]. An overwhelming 86.4% (unfiltered and filtered coordinates) of the time was spent behind the T, agreeing with the manual tracking average of 89.7% from Buote et al. [24]. This is similar to the previous studies of Vučković et al. who found 74.5% of shots coming from behind the T at the same 10 matches recorded at the men's World Team Championship in 2003 as stated above [34]. The tendency to favor the left and situate yourself behind the T typically occurs when a player returns to center to anticipate the next shot. The lower percentages calculated using the proposed method compared to the reference is likely since most missing frames due to player occlusion occur near the T during their return to the ideal position.
The average speeds calculated by the filtered coordinates (overall 1.90 m/s) are much closer to Buote et al.'s results (1.85 m/s) [24] than the unfiltered coordinates (2.23 m/s). This supports the need for filtering of coordinates and is once again likely due to the variation in foot detection, causing increased distance traveled and in turn higher reported speeds between consecutive frames. The results of the filtered coordinates align with previous studies where Buote et al. [24] recorded a maximum average speed of 2.04 m/s over 5 matches of elite players from 2012-2014 and Hughes and Franks [17] recorded a maximum mean speed of 1.98 m/s, while the maximum average speed was 1.99 m/s using filtered coordinates. As the average walking speed is around 1.4 m/s and the walk-to-run transition speed has noted to occur below 2 m/s, our results of 1.90 m/s as the overall average speed reflects the idea that squash comprises of shifts between walking and running [35][36][37][38].
Removing speeds below 1 m/s is argued by Buote et al. [24] to provide a more realistic idea of how fast players move to return shots. Speeds that fall under 1 m/s are primarily identified when a player is at center court waiting for their opponent to play a shot, during the pause for accuracy and power before a player makes their shot, and when players change directions. With this selection, our results show that for 70.2% of the time during active match play, players moved at an average speed of 2.44 m/s and only spent 29.8% of the time moving less than 1 m/s. This is reflective of Buote et al.'s analysis of 5 matches [24] as mentioned above, which found the mean speed of players as 2.52 m/s (excluding speeds less than 1 m/s), 69.6% of the time during active match play. These speeds represent the incredible level of conditioning and endurance elite squash players must possess to compete.
A limitation of this study is the inability to analyze the entirety of active match play (83.32% analyzed on average, Table 3). Another constraint is the assumption that players slide horizontally across the plane of the court when converting video coordinates into court coordinates, meaning that vertical movement of a player due to jumping is considered as distance traveled. In addition, the conversion does not take into account any lens warping. Our future work will focus on continuing to develop the reliability of this method, add the analysis of additional camera angles, refine the model to reduce/handle missing frames, and to gather data on recent PSA matches. Further research opportunities include analysis of upper body and arm kinematics.

Conclusions
With the increasing availability and access to broadcasted elite squash matches, our study utilizes recent advancements in human pose estimation and computer vision to quantify squash kinematics and tactics using video analysis. This method offers the ability to analyze any PSA match or any match filmed similarly and suitably, giving access to a large collection of elite player data to be analyzed for the first time. It is also entirely autonomous apart from selecting active match play.
Our results support previously identified elite squash tactics and strategy in former studies. This methodology has proven to be accurate and reliable in comparison to results of a manual tracking method [24]. It is also the most effective in collecting kinematic data with no special camera setup, limited manual intervention, and has a clear advantage in its ability to provide analysis in a timely manner.