The Eye in the Sky—A Method to Obtain On-Field Locations of Australian Rules Football Athletes

: The ability to overcome an opposition in team sports is reliant upon an understanding of the tactical behaviour of the opposing team members. Recent research is limited to a performance analysts’ own playing team members, as the required opposing team athletes’ geolocation (GPS) data are unavailable. However, in professional Australian rules Football (AF), animations of athlete GPS data from all teams are commercially available. The purpose of this technical study was to obtain the on-field location of AF athletes from animations of the 2019 Australian Football League season to enable the examination of the tactical behaviour of any team. The pre-trained object detection model YOLOv4 was fine-tuned to detect players


Introduction
Performance analysts in professional sports teams are increasingly required to analyse large sets of athlete data to derive insights that result in a competitive advantage over the opposition [1].The in-depth analysis of team sports is on the rise due to advancements in sensor technology and computing power [2][3][4].The first step of all analyses is the accurate tracking of players on the field.This can be achieved by sensor data such as GPS, LPS, or IMUs, or using video [3].The notion of tracking data refers to spatiotemporal data describing ball and/or player positions during a sports event [5].The use of easy-to-access video data to derive 2D spatiotemporal data is increasingly popular.For this purpose, computer vision methods have been applied in multiple sports, with the majority of applications in soccer and basketball [2].Using solely video data, researchers and sport professionals aim to better understand tactical behaviour and interactions of a team or individual [5,6] or to support decision-making pertaining to performance and injury risk [4].
For meaningful outcomes, it is important to consider the demands and constraints of specific team sports [4].Australian rules Football (AF) provides challenges that cannot be found in more frequently investigated team sports like soccer and basketball: in AF, 36 players are on a field that is not consistent in size across different stadiums, which is a unique constraint of the sport.The dimensions of fields used within the professional Australian Football League (AFL) vary from 175 m in length and 145 m in width (University of Tasmania Stadium) to 155 m by 136 m (Sydney Cricket Ground).The average length and width of AFL grounds are 163.6 ± 5.9 m and 132.1 ± 6.9 m, respectively [4].To provide player location information, all players are equipped with a commercial GPS unit, but professional teams can only access the GPS data of their own team, while limited or no information on any of the opposition AF teams is accessible.Therefore, current state-of-the-art tactical analysis cannot be performed easily, since the location of the opposition AF team is unknown.Consequently, despite the ubiquity of GPS data, the analysis of opposition collective behaviour in AF is currently restricted to conventional 2D video analysis, a manual and time-consuming process.Therefore, computer vision technology presents an exciting opportunity to overcome this issue in AF [7][8][9].
The major challenges of vision-based methods are their dependency on the environmentthey are susceptible to frequent changes in athlete velocities and occlusions in congested play, changes in field lighting, and similarities in the appearances of teammates [2,3].Further, the unique challenge of varying field sizes in combination with the large field size requiring multiple cameras makes the use of conventional player tracking methods impossible [4].Varying pre-processing steps are therefore necessary to successfully track athletes [10,11].Recommended techniques include the following: (1) the removal of shadows to combat changes in lighting conditions [12]; (2) the use of multiple camera set-ups to ensure all athletes are in the field of view during filming [13]; (3) the use of pre-trained object detection models [14][15][16]; (4) and jersey number recognition models for the detection and identification of individual athletes [17].
Two computer vision-based methods regularly used to obtain on-field athlete locations are detection embedded tracking and tracking by detection [11].The detection of athletes is part of the tracking pipeline in detection embedded tracking [7,18,19], considered as a costly manual method when compared with modern deep learning implementations [8,20].The tracking process begins by extracting the playing field area using a combination of basic computer vision techniques, such as background subtraction, Canny edge detection, and contour extraction [7,8,21] to ensure that the subsequent feature extraction of the tracking subjects (usually colour, shape, and trajectory features) is free from variations in the playing field appearance and noise from spectators and advertisement banners.One example of this approach is in tracking athletes in soccer across video frames using Haar-like features [19], defined as differences in the summed pixel intensities of various rectangular regions across the tracking subjects [22].A more recent example built upon this approach is using particle filters as the feature extraction method, which considers differences in pixel intensities between smaller regions in comparison to Haar-like features [23].Blob detection [24,25], Otsu detectors [12], and motion vectors [26] are also commonly used feature extraction methods in detection embedded tracking.Athletes are tracked by associating similar features across frames, such as edge detection [12], three-dimensional topographic features [25], and Efficient Convolution Operator (ECO) tracking algorithms [27,28].
Tracking by detection differs from the aforementioned approaches in that it first detects the athletes in the input image prior to passing detections to stand-alone object trackers [14,15].This approach results in improved accuracy as the athlete appearances and locations are known prior to the application of the tracking algorithm.However, the accuracy of the stand-alone tracking methods are heavily dependent on the accuracy of the detector, meaning it is extremely important to use an object detector that has been optimised for the desired task to obtain the best tracking results.Deep learning techniques based on convolutional neural networks have shown promising results in object detection [29], and pre-trained person detectors are a popular form of object detectors as they avoid the need to train from scratch.For example, Histograms of Oriented Gradients-based person detectors [30], Faster-Recurrent Convolutional Neural Network (a popular state-of-the-art object detector architecture [31]), and a faster state-of-the-art object detector known as You Only Look Once (YOLO) [32,33] that were trained to detect persons in images and videos have all been used to detect athletes in sporting contexts.The detections were subsequently passed into designated tracking algorithms, such as support vector machines, Long Short-Term Memory neural networks [34], and Simple Online and real-time Tracking with a Deep Association Metric (DeepSORT) [35], to track the athletes across video frames [9,15,36].
Previous progress in tracking athlete movements using computer vision-based methods has examined sports where playing field dimensions remain constant across competition arenas.Further, the playing field boundaries in previous research are all rectangular in shape, such as the playing fields and courts encountered in soccer, basketball, and squash [25,[37][38][39][40], serving to greatly simplify the technical processes required to determine the field-relative position of detected athletes [4].A majority of the existing works has also used stationary cameras that minimises, and in some cases eliminates, issues related to a shifting background, appearance distortions, and camera motion that arise from operator pan, tilt, and zoom functions [10].These challenges are amplified in AF due to the permissible differences in field shapes and sizes across stadiums [41], and the use of multiple, manually operated pan, tilt, and zoom cameras in which the entire playing field will seldom be in full field of view.Frequent occlusions of athletes are also a common feature of video footage due to the full-contact nature of the sport.These limitations severely impact the aforementioned tracking methods' performance, serving to inhibit the application of athlete detection and tracking methods in AF [4].One study used a custom person detector and team classifier for detection and then tracked the athletes across frames of broadcast video with a combination of Kalman filters and energy minimisation techniques [42].The results of this investigation struggled to overcome the changes in lighting conditions and frequent occlusions of athletes.
Athlete tracking data, from body-worn Global Positioning System (GPS) to Local Positioning System (LPS) devices, overcome the aforementioned challenges plaguing video footages of AF matches [43].However, the raw athlete tracking data from opposition teams are unavailable to professional AF teams, meaning that an alternative method is needed to obtain this information.Unique to professional AF is the commercial availability of animations of athlete GPS data from all professional AF matches, which include opposition teams, by Champion Data, the official statistics provider for the Australian Football League.Athletes are represented as circles from a birds-eye view of a playing field (Figure 1, top right), which simplifies the athlete tracking task because it removes the issues of lighting variations, changes in athlete appearances, ambiguous differences between teammates, occluded areas of the playing field, and camera distortion.Consequently, the animated athlete tracking data provides a unique opportunity for the application of modern trackingby-detection techniques.
The aim of this technical study is to develop a tracking by detection technique to obtain the field-relative positions of AF athletes using player animations based on GPS signals.We will further establish pixel-to-Cartesian coordinate transformation coefficients unique to each stadium.This novel application of tracking by detection enables tactical analyses of opposition collective team behaviour and the development of interactive play sketching tools in AF.Overview of the workflow used in this study.Raw GPS data is not available for the opposition team.All steps using this information are highlighted by red boxes.GPS animations are commercially available for both teams and all steps in the workflow using this data are highlighted by green boxes.The numbers provided in brackets indicate the data set size used for each step.

Materials and Methods
Two sources of data obtained from a single professional AF team from the 2019 Australian Football League season, comprising 22 matches, were used for this study.A total of three matches were excluded from analysis due to errors in the raw GPS data.A further two matches were excluded due to errors in the visualisations.Another two matches included only three of the four available quarters due to visualisation errors in the final quarter.Data from a total of 15 full matches, and the first three quarters of an additional two matches were included.An overview of the full workflow of this study is displayed in Figure 1.
The first data source comprised the visual representation of athlete GPS tracking data, which was overlaid onto an image of the playing field it was originally collected from (i.e., the actual ground the match was played on) (Figure 1).A visual animation of the GPS data is produced by animating the output from the GPS sensors (Catapult S5 units) during match play and was provided to the industry research partner, a professional AF team, by the official statistics provider of the Australian Football League, Champion Data.These third-party vendor athlete tracking data animations are commercially available to all professional AF teams.The second source of data was raw GPS data from the GPS sensors worn by a single team of athletes (n = 37) drawn from all matches in the 2019 AF playing season.This study was approved by the Ethics committee from the University of Western Australia (2020/ET000197).

Athlete Detection
Two-minute samples of the athlete tracking data animations, selected as the first two minutes of match play from a total match time of two hours, were used to train a state-of-the-art multiple-object detection ANN [44] to detect a single team.For this purpose, athlete player animations were manually labelled using an online labelling tool; supervise.ly(accessed on 9 May 2024) [45].A total of 3476 images resulted in 60,612 labelled athlete examples.This dataset was synthetically enlarged via conventional computer vision cropping and flipping methods [46], resulting in a total of 41,712 images containing 704,987 labelled examples.The data were split into 85% (35,483 images) for training and 15% (6229 images) for testing.The images were resized to 416 × 416 pixels and used to fine-tune a YOLOv4 object detection model (backbone: CSPDarknet53, neck: PANet, head: YOLO Head) that was pre-trained on the MS-COCO dataset and which is publicly available through the Darknet framework [44].Training took place over 74,000 iterations with a batch size of 64 [33], an initial learning rate of 0.001, a momentum of 0.949, and a decay of 0.0005.The mean average precision (mAP), precision, recall, and F1-scores were reported as standard measures of model accuracy with an Intersection over Union (IoU) threshold of 0.5.The trained model was used to detect athletes in animations across the entire two hours of a match and the centre of the detected bounding box was used to define an athlete's position in pixel coordinates.

Athlete Tracking
A pre-trained tesseract Optical Character Recognition (OCR) model was initially employed to identify the athlete player numbers present in each detection [47].However, upon visual inspection, it was clear that the outputs of the OCR model were prone to misidentification.Erroneous outputs were saved, corrected, and labelled, and comprised multiple samples of images ranging from the digit 1 through 50 (i.e., the expected range in athlete player numbers allocated to AF athletes).Data were split into 80% (121,230 samples) for training and 20 % (30,331 samples) for testing.The fully corrected labels were used to train a custom CNN to identify athlete player numbers in the detections (Figure 2), since CNNs have shown their applicability in text recognition in images [48].After performing a grid search, the convolution kernels were set to a size of 3 × 3 and the pooling kernels were set to a size of 2 × 2. Each layer utilised a rectified linear unit activation, with the exception of the final classification layer, which used a softmax activation.The CNN was trained using a five-fold cross-validation over 10 epochs with a batch size of 32, a learning rate of 0.01, and a momentum of 0.9.A categorical cross-entropy loss function was optimised during training using a stochastic gradient descent training process.Training accuracy and loss were analysed during training, while the accuracy of the trained model was evaluated on the test set.The trained CNN number reader was used to identify the athlete player numbers present in each of the detections with an imposed condition that each number can only occur once per team to reflect the actual use case.

Conversion to the Field
To transform the athlete tracks in the animations from the image coordinate system to a field-relative Cartesian coordinate system, the raw GPS data and the equivalent track of athletes in the animations were used.

GPS Data
The start and end times of each quarter were recorded for each match and used to extract the GPS information during match play.The GPS data were converted from Earth-centred coordinates, in longitude and latitude, to field-relative Cartesian coordinates, where the origin of the field-relative coordinate system was located at the centre of the field, the X-axis was aligned from field-goal to field-goal, and the Y-axis aligned orthogonal to the X-axis such that the positive direction is away from the team benches (Figure 1, bottom left).The longitudinal and latitudinal coordinates of the centre of all competition fields (L F , θ F ) were recorded using Google Earth [49].Equations ( 1) and (2) were used to convert the longitudinal and latitudinal coordinates of the athletes (L Ath , θ Ath ) to field-centred Cartesian coordinates (X Ath , Y Ath ).
where Arc represents the arc distance of a degree over the Earth's surface, as determined by where R represents the radius of the Earth in meters, which is assumed to be a uniform sphere with a constant radius of 6,378,137 m [50].The bearing (ψ) between the field's centre (L F , θ F ) and the position on the fields boundary that corresponds with the maxima of the y-coordinate (L F max , θ F max ) was determined for each field using Equation ( 4): where and The bearing was used to align the field-centred Cartesian coordinates (X GPS , Y GPS ) with the local coordinate system using the following: and The field-relative GPS outputs were down-sampled to 1 Hz.

Animation Data
The athlete track in the animations were down-sampled to 1 Hz for ease of handling prior to a conditional filtering data cleaning process to correct for errors in the detection/tracking.(1) The position of any athlete whose movement was greater than a pre-defined threshold of 55 pixels per frame was replaced with a missing value.This threshold value was determined in initial pilot testing and equates to a speed of 8.2 m/s, which is categorised as a high-intensity sprint that reportedly occurs 22 ± 9 times per match [51].(2) To avoid large jumps in an athlete's movements in the instance where the position was missed across multiple consecutive frames, the subsequent detected position was also replaced with a missing value if the athlete's movement exceeded the pre-defined threshold.(3) An athlete's position was also removed if the athlete was tracked in less than five times in the subsequent ten frames following a missing value.Linear interpolation was applied to minimise the number of missed positions and was only applied in instances of less than or equal to five consecutive missing values.The athlete tracks and field-relative GPS outputs were temporally aligned for each quarter.
An optimisation problem was established to determine the optimal scaling (m x,y ) and translation (c x,y ) coefficients to transform the pixel coordinate outputs (u, v) to field-relative Cartesian coordinates (X u , Y v ).This step is necessary for every stadium given that the nine standard home stadiums used by AF teams nationally are of varying size.Initial tests revealed a linear relationship defined by the following: and The linear equation was optimised using the Levenberg-Marquardt algorithm through a non-linear least squares method [52].Axis-specific scaling and translation coefficients were determined using separate optimisation problems because the scaling and translation are different for the field's X and Y axes.The optimisation was undertaken on a quarter-byquarter basis to account for variations in temporal alignments between the athlete detections and the field-relative GPS.The Euclidean distance d(p, q) between the transformed object detector output tracks p = (X u , Y v ) and the GPS Cartesian coordinates q = (X GPS , Y GPS ) was determined as an accuracy measure if an athlete was present in both GPS and tracking data.The Euclidean distance was defined as where n is the total number of frames tracked.

Results
The athlete detector achieved an mAP of 0.94, a precision of 0.95, a recall of 0.97, and an F1-score of 0.96.The custom CNN trained to read the two-digit player numbers achieved an average accuracy of 0.98 ± 2 × 10 −3 on the test set across the five-folds of training (Figure 3).Each stadium had a unique pixel-to-Cartesian coordinate transformation equation (see Equations ( 10) and ( 11) due to the non-standardised dimensions of an AF field.The stadiumspecific scaling and translation coefficients are presented in Table 1.The median Euclidean distance between the GPS Cartesian coordinates and the transformed pixel coordinates across the entire season was 2.63 m, with lower and upper quartile values of 1.58 m and 4.04 m, respectively (Figure 4).distance for round 20 had a greater spread than all other matches.Further investigations revealed an issue with the number reader implementation in this match.It was found that two athletes with similar playing numbers were repeatedly misidentified.

Discussion
The aims of this research were to develop a tracking-by-detection technique to obtain the field-relative positions of AF athletes based on commercially available GPS-based player animations and to establish unique pixel-to-Cartesian coordinate transformation equations for each AF stadium.Due to the vast ground sizes in AF, standard optical tracking methods cannot be applied in AF [3].Athlete tracking systems have therefore been largely confined to GPS data that are not available for the opposing team.Hence, the novel method using animated GPS data presented in this research is a valuable first step to analyse tactical behaviour of both playing teams.
The high accuracy of the custom athlete detector (mAP 0.94, precision 0.95, recall 0.97, F1-score 0.96) was comparable to previous successful attempts of similar tasks [3,9,14,17,53].These results support the position that fine-tuning a multiple object detector is sufficient for detecting AF athlete's in animations.Due to the data volume and time requirements to train a fully customised multiple object detector from scratch, the most appropriate approach was to utilise a pre-trained object detector and fine-tune the model on our custom dataset [16].As such, the training time was reduced while also achieving favourable results with a reduced data volume.Future work in this area may compare different multiple object detectors, e.g., [54][55][56][57], to improve detection accuracy and reduce the time taken for inference.Additionally, the model developed in this study may be used to generate a larger athlete detection dataset to enable a multiple object detector to be trained from scratch as a means for comparison.The dataset should also be expanded to include multiple teams to increase the applicability of the athlete detector method presented here.
The use of a customised two-digit number reader for identifying the player numbers of each tracked athlete, similar to the approach taken by Yoon and colleagues (2019) [17], was substantively different to previous tracking-by-detection methods used in sports [9,15,36] and achieved a high accuracy of 0.98.Although the combination of using multiple, separate deep learning architectures in the athlete tracking pipeline is not an efficient process, the good performance allowed for the determination of stadium-specific pixel-to-Cartesian coordinate transformation coefficients, which can be used in future research.Previously implemented pre-trained tracking models [15,35] were not suitable for implementation in the current study due to the uniqueness of the dataset (i.e., athletes represented as dots with playing numbers) in comparison to the data used to develop open-source tracking methods (i.e., real-world images of humans).The application of pre-trained tracking models should be explored by using the current method to generate the required data for training custom tracking models specific to AF.In doing so, future work may develop alternative and more streamlined tracking-by-detection methods.
The stadium-grouped coefficients of the pixel-to-Cartesian coordinate transformation equations (Table 1) demonstrated low variability, thereby establishing a stadium-specific method for transforming pixel coordinates to field-relative Cartesian coordinates.The slight differences in the scaling and translation coefficients between stadiums demonstrated the robustness of the approach in accounting for the varying field sizes used in AF [41].
The average positional accuracy of the current approach (2.63 m) is considered high compared to the reported accuracy of commonly used GPS and LPS devices (0.96 ± 0.49 m and 0.23 ± 0.07 m, respectively [43]).However, it was observed that the positional differences between the transformed pixel coordinates and the GPS Cartesian coordinates were systematic in nature (i.e., the magnitude and direction of was consistent for each detected athlete).These observations suggest that the Euclidean distance between the transformed pixel coordinates and GPS Cartesian coordinates can be reduced by fine-tuning the transformation equations.Additionally, it is evident that the novel approach resulted in significant outliers (Figure 4) that were found to be attributable to detection method errors and the subsequent misidentification of athletes.This error may be mitigated by adopting more sophisticated post-processing and filtering protocols, or through the use of custom tracking models.The applied conditional filter removed large outliers but at the same time introduced large gaps without any information.Custom tracking models or filters such as a Kalman filter could be used to minimise large detection gaps and therefore, by extension, the misidentification of athletes [35].The unequal representation of matches played at each stadium impacted the present work, which can be attributed to idiosyncrasies of the AF playing season draw, which saw the industry research partner not having played at every AF stadium over the course of the 2019 season.This limitation could be addressed by using data from multiple teams and seasons in future work, which, however, is challenging due to the limited data availability of GPS data of different professional AF teams.Therefore, the proposed method offers the opportunity to create a larger dataset that can be used in the future to train more sophisticated and streamlined machine learning models for player detection and tracking based on unique AF animations.
This research is the first step towards an automated tool for the determination of the onfield position of players of both teams in AF.This information will allow sport professionals to better understand tactical behaviour and interactions of a team or individual [5,6] and support decision-making pertaining to performance and injury risk [4].

Conclusions
This study introduced a novel method to obtain the on-field location of AF athletes with high accuracy from commercially available animations of athlete's GPS data, circumventing the pitfalls of video data.The ability to obtain the on-field location of athletes in this manner unlocks the potential of recent analytic advances in the study of collective team behaviour, a research stream currently hampered by the unavailability of opposition team athlete tracking data in AF.The method may easily be extended to obtain the on-field locations of opposition team athletes and for the analysis of opposition team strategies.Athlete tracking data of this type may also be used to develop interactive play sketching tools in AF, which have recently been realised in the context of basketball and soccer [58,59].
Future work should expand on these methods across multiple areas.First, the total volume of data should be increased by including multiple teams from the competition.Second, variations in the proposed CNN architectures should be explored to realise a real-time pipeline.And last, matches played at all stadiums should be included to ensure that the transformation equations developed are applicable for any given competition.
dataset data of one AFL season: 22 matches (88 quarters) inclusion of 15 full and first three quarters of 3 matches (69 quarters) 2 min animation samples (3476 images, 60,612 athlete samples) data augmentation (41,712 images, 704,987 athlete samples) fine-tuning pre-trained object detection model (YOLOv4) 85 / 15 % training / test split (35,483 / 6229 images) 74,000 iterations, batch size 64, initial learning rate 0.001, momentum 0.949, decay of 0.0005 evaluation: mean average precision (mAP), precision, recall, F1-scores evaluation: accuracy athlete tracking training a custom CNN model to read 2-digit numbers of detected athletes 80 / 20 % training / test split (121,230 / 30,331 detections) five-fold cross validation, 10 epochs, batch size 32, learning rate 0.01, momentum 0.9 pre-trained OCR model was used to identify player numbers in detection erroneous outputs were saved, corrected, and labelled field-relative x, y Cartesian coordinates conversion to the field conditional filtering of tracking based on movement threshold, linear interpolation of missing values solving of optimisation problem to transform pixel coordinates to field-based Cartesian coordinates, evaluation of the Euclidean distance between GPS and tracking coordinates

Figure 1 .
Figure 1.Overview of the workflow used in this study.Raw GPS data is not available for the opposition team.All steps using this information are highlighted by red boxes.GPS animations are commercially available for both teams and all steps in the workflow using this data are highlighted by green boxes.The numbers provided in brackets indicate the data set size used for each step.

Figure 2 .
Figure 2. Architecture details of the custom athlete number convolutional neural network.The first box displays the input layer, followed by five convolutional layers of different sizes described by the numbers.After flattening the data, the convolutional layers are followed by three dense layers of different sizes.The blue and red shapes display the data flow through the network.

Figure 3 .
Figure 3. Boxplot of the accuracy measure distribution of the custom number reader model across the five-fold cross-validation training protocol.

Figure 4 .
Figure 4. Boxplot of the distribution of the 95% confidence interval Euclidean distance between the transformed pixel coordinates and GPS field-relative coordinates.The distribution of the Euclidean

Table 1 .
Optimisation problem coefficient results grouped by stadium.Pix-uv is the pixel coordinates; m Coeff and c Coeff are the scaling and translation coefficients, respectively.