Temporal Pattern Attention for Multivariate Time Series of Tennis Strokes Classification

Human Action Recognition is a challenging task used in many applications. It interacts with many aspects of Computer Vision, Machine Learning, Deep Learning and Image Processing in order to understand human behaviours as well as identify them. It makes a significant contribution to sport analysis, by indicating players’ performance level and training evaluation. The main purpose of this study is to investigate how the content of three-dimensional data influences on classification accuracy of four basic tennis strokes: forehand, backhand, volley forehand, and volley backhand. An entire player’s silhouette and its combination with a tennis racket were taken into consideration as input to the classifier. Three-dimensional data were recorded using the motion capture system (Vicon Oxford, UK). The Plug-in Gait model consisting of 39 retro-reflective markers was used for the player’s body acquisition. A seven-marker model was created for tennis racket capturing. The racket is represented in the form of a rigid body; therefore, all points associated with it changed their coordinates simultaneously. The Attention Temporal Graph Convolutional Network was applied for these sophisticated data. The highest accuracy, up to 93%, was achieved for the data of the whole player’s silhouette together with a tennis racket. The obtained results indicated that for dynamic movements, such as tennis strokes, it is necessary to analyze the position of the whole body of the player as well as the racket position.


Introduction
Computer Vision is an interdisciplinary field of study that aims to derive meaningful information from various types of data. Applying artificial intelligence for digital images, skeleton, depth, videos, point cloud, audio, acceleration, signals or motion capture data allows one to perform actions or make decisions as well as further recommendations. The purpose of Human Action Recognition (HAR) is to understand human behaviours and identify them [1,2]. It specifies a set of person's moves performed in time in order to complete a task. Occasionally, additional objects, such as a tennis racket or a golf club, are involved to do the actions. Depending on the complexity of the movements and their duration, different length sequences are taken into consideration, from a single frame to a whole video streaming. HAR is a challenging task used in numerous applications. It interacts with many aspects of Computer Vision, Machine Learning, Deep Learning and Image Processing [3]. It utilizes detection of a person or objects in the image, video as well as sensor data, the location of the action in time and space, and the recognition of the action. This attitude usually involves feature detection, such as extracted from 3D silhouettes, skeletal joint and body part location, local spatio-temporal, local occupancy patterns and finally 3D scene flow [2]. That is why it makes a significant contribution to sport analysis. Detection of athletes and recognition of their actions or teams' activities plays a pivotal role in indicating the players' performance level and training evaluation or analyzing sport statistics [3].
Decision Tree (DT), Random Forest (RF) and k-Nearest Neighbor (kNN) [15]. The Pan Tompkins algorithm for the classification of shots using time warping was presented in [16]. Many studies focused on tennis stroke recognition based on video data. This attitude involved extracting features from videos and applying a classifier to the whole set [17]. The THETIS is a very well-known dataset consisting of twelve tennis moves captured by Microsoft Kinect in a form of video and ONI files [18]. The video-based action recognition of backhand (two-handed, one-handed, slice, and volley), forehand (flat, open stance, slice, and volley), serve (flat, kick, and slice) as well as smash was performed using the 3-layered LSTM network in [19,20]. In [17], these twelve moves were classified using SVM and linear-chain Conditional Random Fields (CRF). The five-layer deep historical LSTM network described in [21] was applied for similar moves using the following datasets: THETIS and HMDB51. Six tennis strokes from the THETIS datasets were recognised by the LSTM network in [22]. Serve, hit as well as non-hit were recognized by the Kernelised Linear Discriminant Analysis (KLDA) in [23]. Transductive transfer learning for an annotation of video sequences was applied. The changes in the tennis ball were also taken into consideration. The basic tennis strokes, forehand and backhand, from a video were analyzed in [24][25][26] using the SVM classifier. In [27], tennis serves, forehand and backhand were recognised using two classifiers: SVM with the radial basis function kernel and K-Nearest Neighbour classifiers (KNNs). A wireless inertial measurement unit sensor together with a system consisting of eight video cameras was used for capturing the data.
Studies concerning HAR were also performed using motion capture data recorded via optical systems. Forehand and backhand strokes with and without ball contact as well as no-shots were recognized by ST-GCN based on images generated from three-dimensional motion data in [28]. Graph Convolutional Networks (GCNs) were an obvious choice due to the fact that the parts of the image correlated with the human topology. In this study, the influence of input fuzzification on the obtained accuracy was examined. The results showed that this approach increased recognition ability. An extension of the above research was the recognition of individual tennis stroke phases, i.e., forehand preparation, forehand shot with racket swinging, backhand preparation, backhand shot with racket swinging and no-shots which were presented in [29]. Three classifiers with and without fuzzification were taken into account: SVM, MLP, and ST-GCN. In addition, the influence of the extensions and generalizations of the Choquet integral on the aggregation of results obtained by individual classifiers was verified. The results indicated that this method increased the efficiency of recognizing tennis moves. Another approach to tennis movements recognition including its phases was presented in [30]. For the purpose of the classification, the Attention Temporal Graph Convolutional Network (A3T-GCN) was applied both with and without input fuzzification. The conducted results showed that this classifier might be considered as one of the most appropriate methods for tennis classification.
The state-of-the-art study presented in this paper is to apply the A3T-GCN classifier for tennis stroke recognition based on three-dimensional coordinates data obtained from the optical motion capture system. Forehand, backhand and volley strokes were taken into consideration. The main purpose of this study is to look into how the content of three-dimensional data influences classification accuracy, precision, recall, and F1 score. Both the coordinates associated with the player's silhouette and the position of the racket were analyzed, which to the authors' knowledge is the novelty approach. The A3T-GCN was chosen due to the attention model, which both stores information about the player's model, but also determines the predicted player position.
The rest of this paper is organized as follows. Section 2 explores the material and methods as well as introduces the Attention Temporal Graph Convolutional Network. Section 3 presents results of the state-of-the-art action recognition methods with the proposed classifier and 3D motion capture data. Section 4 discusses the proposed method, and finally Section 5 concludes the study.

Participants
In this study, seven male and three female tennis players took part (age 23.7 ± 4.58, height 1.77 ± 0.13 m, weight 71.65 ± 10.68 kg). Only one of them was left-handed, while the others were right-handed. They all signed the consent for the study.

Data Acquisition
Each participant was prepared for the experiment. First, they have a 15-min warmup. Second, thirty-nine retroreflective markers, specified in the Plug-in Gait model, were attached to their body. Finally, all the required measurements were gathered for the purpose of creating a new model as well as preparing its calibration in the motion capture system. Furthermore, seven markers were also attached to the tennis racket, according to the following scheme: one to the top of the racket head, two on both sides of the racket, one to the bottom of the racket head and one to the bottom of the racket handle. Such an arrangement reflects the racket shape and capture its movements.
For the purpose of acquisition, eight-camera optical Vicon motion capture system, installed in the indoor room, was used with the Nexus software. The cameras are mounted two on each wall on the same level. The whole schema of the cameras arrangement is presented in Figure 1. Before movement acquisition the calibration of the system was performed. The maximal calibration error did not exceed 0.045 pixels. The frequency of capturing was set to 100 Hz. Each participant performed forehand, two-handed backhand and volley strokes. Forehand and backhand ones were performed while running and avoiding a bollard placed on the floor. Due to this, the strokes were more natural than hitting the ball from a standing position. At first, ten forehand strokes without a ball were performed, followed by ten backhand strokes without a ball. Next, these exercises were repeated with a ball. Finally, the participant performed ten volley forehand and ten volley backhand in front of the tennis net. Tennis balls were thrown from the right and the left side of the net, while standing parallel to the net, the player made a short movement with the racket in front of him/her, causing the ball to bounce and fall. The participant hit a ball which was caught by a special net. The forehand tennis stroke is made with the dominant hand. The racket was placed on the dominant side; then, it was directed towards the ball. After the racket made contact with the ball, the racket was directed to the opposite arm of the player in a way of swinging. While performing a two-handed backhand stroke, the racket was held with a continental grip. It was placed on the opposite side to the dominant one. After the racket made contact with the ball, it was directed to the dominant side. In the case of a one-handed backhand, the racket was held with a dominant hand. These two types of strokes are presented in Figure 2. It is worth indicating that forehand and forehand volley are very similar moves in a certain part of the movement. The same goes for backhand strokes. Each performed stroke has been verified by a specialist. All failed strokes were rejected. Due to the fact that professional tennis players participated in the study, the well-performed strokes were repetitive.

Data Post-Processing
The Vicon Nexus software was used for post-processig of all obtained recordings. This tasks involved the following steps: marker labelling, gap filling using interpolation methods implemented in Vicon Nexus software (Pattern Fill and Rigid Body Fill), data cleaning, and applying the Plug-in-Gait model. The last one was only for the model representing human body. Additionally, a new model, consisting of all markers attached to the racket, was generated. The data prepared in this way was saved to c3d file.
The whole gathered recordings was verified by a professional tennis coach. As a result, the following number of tennis moves was obtained: backhand-212, forehand-197, forehand volley-180, backhand volley-180.

Attention Temporal Graph Convolutional Network
The idea of the A3T-GCN was taken from the work [31], where a similar structure was used to predict traffic volume in selected cities. The basic modification of this network consists of transforming the element responsible for the prediction into a classifier. Additionally, the elements responsible for the separation of spatial and temporal features have also been adapted. In the original approach, the Gated Recurrent Unit (GRU) network was applied. Due to extensive structure of the GRU network, inadequate to the problem, we are analyzing in our work RNN network, often also called BiRNN or Bidirectional RNN. It is schematically shown in Figure 3. Moreover, the original prediction was based on a Context Vector, while in case of this study additional Multilayer Perceptron was added on the output of the classifier. The whole network structure used in this study is presented in Figure 4.

Spatial Features
Usually, skeleton data studies are based on images or video as input, so the data are processed by typical Convolutional Neural Networks (CNN). In case of this study, as input data points in three-dimensional space were used, the proposed classifier was based on Graph Convolutional Networks (GCNs). The connections between the nodes of the G graph were presented in the form of the adjacency matrix A. The entire feature matrix has been marked with the X variable. To process graph nodes, the GCN network, uses a Fourier filter to determine the spatial relation between features. This relationship was characterized by Equation (1), which actually defines a multilayer GCN model.
where n represents the number of hidden layers, O = O + I N is the adjacent matrix (O) with added self-connections, I N describes the identity matrix, T = ∑ j O ij , F (n) defines the output of n layer, Θ (n) is a matrix which contains all parameters of specified n th layer and σ(·) represents the sigmoidal function for a nonlinear model [32].
In this study, the GCN network consists of three layers. This structure can be described by Equation (2).
indicates the preliminary step, Ψ 0 ∈ R PxF denotes the weight matrix between input and hidden layer, P defines the size of the feature matrix, while F is a value related to the number of the hidden unit, Ψ 1 , Ψ 2 ∈ R FxZ define the weight matrices from hidden to output layer, f (I, O) ∈ R NxZ , denotes the output length Z and ReLU(), is the Rectified Linear Unit, commonly used as neurons activation function [32].

Temporal Features
To indicate temporal features, which are the key elements in recognizing the analyzed types of tennis strokes, a BiDirectional Recurrent Neural Network was used. BiRNNs were applied to obtain the information about the player at time t. To gather this kind of data, the information about previous (in time n − 1, n − 2,...n − n f , where n f denotes the maximum number of frames in all c3d file) features were taken into consideration. If analyzed file had fewer frames the missing values were set to 0. The structure of whole temporal features elements can be expressed by Equations (3)-(6) [31]: where ugc t denotes the update gate, which role is connected with controlling the information quantity at the previous moment, rgc t indicates the reset gate, which is responsible for neglecting the state information at the previous moment, mc t describes stored memory content at the current moment and h t defines the output value at the current moment. W u , W r and W c represent the weights in the training process for the updated gate layer, reset gate layer and output layer, respectively.

Attention Model
Commonly attention model is defined as an encoder-decoder. It is widely used in such applications as: traffic forecasting [31], image labeling [33], recommendation systems [34] or document classification [35]. Based on [36], it can be stated that the most general division of that kind of model includes hard and soft attention. In this study, the soft one was applied. The attention model's first application is to store information about the player's model. The second is to indicate the context vector, which is responsible for determining the predicted tennis player positions. The applied attention model consists of the following steps.

1.
First, determine, using the BiRNN network, the successive hidden states u k (k = 1, ..., m) of the time series I k (k = 1, 2, ..., m), where m is the number of frames in series. As a result, the set of u k states is defined.

2.
Second, a context vector (C v ) is determined. In particular, the value of position change was determined on a basis of two hidden layers. Their features are indicated applying Equation (7): where ψ (1) and b (1) denote the weight and bias of the first layer and ψ (2) and b (2) are similarly features of the second layer. H is a matrix with hidden layer values. To determine the values of ψ (1) , ψ (2) the So f tmax function (8) is used.
Final, the C v is defined as follows: The final classification is performed by two-layer perceptron. This neural structure consists of one element in the first layer and four (related to four recognised strokes) in the second one. The So f tmax function is used to activate the neurons in the first layer, while the second layer is activated by a linear one.

Experiment
In this study, the forehand, backhand, volley forehand and volley backhand strokes were recognized. The whole tennis movements dataset consisted of backhand-212, forehand-197, volley forehand-180, and volley backhand-180. It represented the player's silhouette together with a tennis racket. Two types of experiments were performed. The first one concerned the whole set of data while the other only the player's silhouette by removing the coordinations of the tennis racket subject.
A series of experiments were carried out, taking into account the random division of data into the training, validation and test sets: 60%, 20% and 20%, respectively. The data was chosen from every type of stroke in the above-mentioned proportions. For each division, 20 tests were carried out, independently.

Results
Grouped results were presented in Tables 1-5. Selected parameters of the learning process showing the correctness of the model were shown in Figure 5. The loss value L CE was calculated on a basis of the Sparse Categorical Cross-Entropy defined by Equation (10).
for n classes, where T i is a ground truth, p i is the So f tmax probability for the ith class. The assessment of the classifier quality was based on several standard measures, such as: Accuracy (11), Precision (12), Recall (13) and F1 score (14).
where TP denotes the true positive fraction, FP-the false positive fraction, and FN-the false negative fraction. Although Precision, Recall, and F1 are usually presented for binary classification, there is a simple way for extending their definition to multiple classes. In this case, Precision, e.g., for backhand, will be defined as correctly classified backhand strokes out of all classified backhand strokes. The Recall for backhand is the number of correctly predicted backhand strokes out of all input backhand strokes.
For the obtained accuracy results for two types of moves: strokes without and with a tennis racket (Table 1), the T-Test was calculated, for which t = −8.2753 was obtained, for α = 0.05. The obtained result allowed us to state that it cannot be concluded that there is a difference between the means.  In order to check the correctness of the developed model, Leave-One-Out Cross-Validation (LOOCV) was performed ( Table 6). This is a computationally expensive procedure; however, it allows us to obtain clear and unbiased information about the model. Using LOOCV, the root mean squared error (RMSE) for n tests was determined: where n-denotes number of test, y i -true value, y i -predicted value.

Discussion
In this study, the recognition of tennis forehand, backhand, volley forehand and volley backhand was performed based on data gathered in c3d files in a form of three-dimensional coordinates. Both the player's silhouette indicated by 39 markers and the silhouette together with a tennis racket represented by 7 additional markers were analyzed.
As it can be observed in Tables 1 and 2 and Figure 6 the mean accuracy depends on the captured data. It is higher for the experiments with the whole tennis player's silhouette together with a tennis racket than the experiment involving only a single body. It can be concluded that the arrangement and trajectory of a tennis racket plays an extremely important role in the correct classification.
Furthermore, in the case of Precision (see Table 3), the obtained mean results are higher for the combination of the tennis player's body model with a tennis racket for all analyzed strokes. The same dependence applies to Recall results (Table 4) and the F1 score (Table 5).
It should be noted that the standard deviation, amounting to a few percent, is low for all the obtained results, which shows the stability of the applied classifier. In Table 7, the state-of-the-art studies related to the tennis strokes recognition are presented. They were performed using various types of data obtained from different sources, such as sensors, video or motion capture systems. The most research in this field were carried out on the well-known THETIS database. Both the data in the form of video, as well as the images obtained from the Kinect motion capture system, were the source for recognizing tennis movements. Broadcast video involving real matches or tournaments with top tennis players was also often taken into consideration. Various types of neural network approaches were used for these purposes. It is worth stressing that graph neural networks (ST-GCN and A3T-GCN) were used to recognize basic tennis strokes based on data obtained from an optical motion capture system. This classifier was chosen due to the characteristics of the recorded data. The applied human model, represented by 39 markers attached to the body in fixed locations, is transformed into a graph, which reflects the topology of the human silhouette. This approach allowed us to obtain a high accuracy. Analyzing the results presented in the study [28], it can be seen that used the A3T-GCN classifier allows for better recognition of tennis strokes than the ST-GCN one, despite the different types of input data. In the previous study, described in [28], the accuracy was obtained at the level of 68.9% for the 60% of data belonging to the training set, which corresponds to the settings in this study. The forehand and backhand classification was performed using images containing the subjects of the tennis player together with the racket. They were generated based on three-dimensional data by Vicon Nexus software. Based on this kind of input data the simplified model, both for tennis player and a racket, was created. The tennis racket was represented only by two points referring to its head and handle. The mean accuracy obtained in this study is higher for both the analyzed silhouette and its combination with a tennis racket compared to the results obtained in the work [28]. In [30], the study concerned the images obtained from three-dimensional data were analysed. The classification of forehand, divided into preparation and the hit phases, and backhand, also divided into preparation and the hit phases, as well as no-hit was performed using the Attention Temporal Graph Convolutional Network. The achieved accuracy results in a form of mean of two phases for forehand stroke did not exceed 80% while for backhand-77%. The obtained mean accuracy results in this paper are higher for the analyzed strokes with and without a tennis racket. The studies presented in this paper concerning tennis movements recognition was performed based on three-dimensional data in the form of coordinates of markers placed on the player's body and a tennis racket was used. As it can be seen in Table 7 this approach is unique. The results obtained in this paper suggest that the type of input data affects the accuracy of tennis stroke classification. Whole body data stored in the form of threedimensional coordinates allows to achieve better results than in the case of images obtained from three-dimensional data. In addition, the inclusion of a tennis racket in the input data improves the classification quality of these strokes.

Conclusions
The state-of-the-art of this study was to verify the impact of adding a tennis racket to the input data of the whole player's silhouette on the final classification of four main tennis strokes. For the purpose of this study the A3T-GCN classifier was applied. The tennis moves were represented in a form of three-dimensional motion data. The described approach gave satisfactory results. Forehand, backhand, volley forehand and volley backhand were taken into consideration. According to the previous authors' study considering the ST-GCN network as a classifier [28], the obtained accuracy for two basic tennis movements (forehand and backhand) recognition in this research has been improved. As the results showed, adding a tennis racket to the data presenting the whole body silhouette significantly improved the classification quality. While comparing two graph networks, i.e., A3T-GCN and ST-GCN, the obtained results in this study clearly indicated that the A3T-GCN structure might be considered as the most suitable for motion data expressed in a form of three-dimensional coordinations. Further work will focus on the possibilities of using other methods of computational intelligence, in particular deep learning methods, and investigating their impact on the efficiency of movement classification in sport. Moreover, research on the possibility of using aggregation of classification methods, may be of interest in further studies in this field.