A Super-Bagging Method for Volleyball Action Recognition Using Wearable Sensors

: Access to performance data during matches and training sessions is important for coaches and players. Although there are many video tagging systems available which can provide such access, these systems require manual effort. Data from Inertial Measurement Units (IMU) could be used for automatically tagging video recordings in terms of players’ actions. However, the data gathered during volleyball sessions are generally very imbalanced, since for an individual player most time intervals can be classiﬁed as “non-actions” rather than “actions”. This makes automatic annotation of video recordings of volleyball matches a challenging machine-learning problem. To address this problem, we evaluated balanced and imbalanced learning methods with our newly proposed ‘super-bagging’ method for volleyball action modelling. All methods are evaluated using six classiﬁers and four sensors (i.e., accelerometer, magnetometer, gyroscope and barometer). We demonstrate that imbalanced learning provides better unweighted average recall, (UAR = 83.99%) for the non-dominant hand using a naive Bayes classiﬁer than balanced learning, while balanced learning provides better performance (UAR = 84.18%) for the dominant hand using a tree bagger classiﬁer than imbalanced learning. Our super-bagging method provides the best UAR (84.19%). It is also noted that the super-bagging method provides better averaged UAR than balanced and imbalanced methods in 8 out of 10 cases, hence demonstrating the potential of the super-bagging method for IMU’s sensor data. One of the potential applications of these novel models is fatigue and stamina estimation e.g., by keeping track of how many actions a player is performing and when these are being performed.


Introduction
Top performance in sports depends on training programs designed by team staff, with a regime of physical, technical, tactical and perceptual-cognitive exercises. Depending on how athletes perform, exercises are adapted, or the program could be redesigned. State of the art data science methods have led to groundbreaking changes. Data come from sources such as position and motion of athletes in basketball [1], and baseball and football match statistics [2].
Furthermore, new hardware platforms have appeared, such as LED displays integrated into a sports court [3] and custom tangible sports interfaces [4], which offer possibilities for hybrid training with a mix of technological and non-technological elements [3]. This has led to novel kinds of 1.
Proposal of a novel ensemble method (i.e., the super-bagging method) and its demonstration for volleyball action modelling, 2.
Demonstration of the role of dominant and non-dominant hand for volleyball action modelling using super-bagging method, 4.
Evaluation of all four IMU sensors separately and in combination for volleyball action modelling using different learning methods (i.e., balanced learning, imbalanced learning and super-bagging methods).

Related Work
Many approaches have been proposed for human activity recognition. They can be categorized mainly into two main categories: wearable sensor-based and vision-based. Vision-based methods employ cameras to detect and recognize activities using several computer vision techniques e.g., Zivkovic et al. propose a robust player segmentation algorithm. Novel features are extracted from video frames, and finally, classification results for different classes of tennis strokes using Hidden Markov Model are reported [30].
Wearable sensor-based methods collect input signals from wearable sensors mounted on human bodies such as accelerometer and gyroscope. For example, Liu et al. [31] identified temporal patterns among actions and used those patterns to represent activities for the purpose of automatic recognition. Kautz et al. [32] presented an automatic monitoring system for beach volleyball based on wearable sensor devices which are placed at the wrist of the dominant hand of players. Beach volleyball serve recognition from a wrist-worn gyroscope is proposed in Cuspinera et al. [33]. Jarit et al. [34] showed that the grip strength of non-dominant and dominant hands is almost the same for college baseball players.
Inertial Measurement Units (IMUs) [11,12] have been used to detect sport activities in different sports e.g., soccer; Schuldhaus et al. use a custom-made system comprise of sensors and memory to collect data regarding the lower extremities of soccer players to classify shot pass in soccer [35]. The usage of wearable devices is not limited to sports, e.g., Wang et al. [36] use wearable sensors to form a wireless body area network to sense various physiological parameters of the human body, while others [37,38] have crafted ways to make the process energy efficient and secure. In tennis, Pei et al. use the JY-61 sensor to acquire motion information, such as accelerometer, that is used to detect tennis stroke type such as forehand, backhand and serve by using acceleration data as well as angular velocity [10]. Similarly, Kos et al. placed a miniature wearable IMU device on the player's forearm to classify the common type of tennis strokes [39].
Particularly for volleyball, Vales-Alons et al. developed a Smart Coaching Assistant for professional volleyball players to analyze exercise quality control by analyzing repetitions of the same action using dynamic programming [8]. Bagautdinov et al. use a neural network approach to detect individual activity to infer joint team activities in the context of volleyball games [9]. In their work, Wang et al. assessed the skill of volleyball spikers. The level of the players were classified into three levels of group such as elite, sub-elite and amateur by a Support Vector Machine (SVM) classifier [13].
It can be concluded that there are multiple studies that take into account the use of IMUs sensors and computer vision for sports related events. However, one of the limitations of a computer vision approach is that it cannot work well in the volleyball setting when the player positions change and when the sight of a player is occluded by some other player. Hence, the IMUs sensors is a good fit for volleyball settings. It is also noted that while there are quite a few studies focused on volleyball action modelling, most of the studies take into account the role of the dominant hand particularly for volleyball action modelling and the role of the non-dominant hand is less explored in sports related activities.

Our Approach
The presented paper extends upon the ideas presented in our previous work [27][28][29]. Figure 1 shows the overall system architecture. The presented paper focuses on step 2 of the proposed system. However, to give the reader the full idea, this section provides a summary of all the steps of the proposed approach. Data were collected for 9 female volleyball players who wore IMUs on both wrists and were encouraged to play naturally during their routine training session i.e., step (0) in Figure 1. The hardware used in this study are the Xsens MTw Awinda (https://www.xsens.com/products/mtw-awinda-last accessed May 2020) IMU sensors [11] and two video cameras. The video streams are synchronized with the IMU's sensor data streams for further processing.

Data Annotation
To obtain the ground truth for machine-learning model training, the video recording was annotated using the Elan software (see Figure 2) [40]. Three annotators annotated the video. Since volleyball actions performed by players are quite distinct there is no ambiguity in terms of inter-annotator agreement. The quality of the annotation is evaluated by a majority vote i.e., if all annotators have annotated the same action or if an annotator might have missed or mislabeled an action. As a result, for the action case and the non-action case there were 1453 and 24,412 s of data, respectively. Table 1 shows the amount of data (in seconds) for each player. This data set is made available to the research community upon request. The annotators annotated the type of volleyball actions such as under hand serve, overhead pass, serve, forearm pass, one hand pass, smash, underhand pass and any other activity such as walking is considered to be non-action. Table 1 also details the number of volleyball actions performed by each player.

Auto-tagging System Prototype
The proposed system performs classification in two stages i.e., step (2) and step (3). In step (2) binary classification (detection of start and end times of an action) is performed to identify if a players is performing an action or not using supervised machine learning at frame level [29]. After detecting the start and end times of an action, in step (3) (Figure 1), the type of volleyball action performed by the players is classified using supervised machine-learning algorithms. Once the action type is identified, its information along with the timestamp is stored in a repository for indexing purposes. Information related to the video, players and actions performed by the players are indexed and stored as documents in tables or cores in the Solr search platform [41]. An example of a 'Smash' indexed by Solr is shown as below: "id":"25_06_Player_1_action_2" "player_id":["25_06_Player_1"], "action_name":["Smash"], "timestamp":["00:02:15"], "_version_":1638860511128846336 An interactive system is developed to allow player and coaches, access to performance data by automatically supplementing video recordings of training sessions and matches. The interactive system is developed as a web application. The server-side is written using the asp.net MVC framework. The front-end is developed using HTML5/Javascript. Figure 3 shows a screen shot of the front-end of the developed system. The player list and actions list are dynamically populated by querying the repository. The viewer can filter the actions by player and action-type (e.g., overhead pass by player 3). Once a particular action item is clicked or taped, the video is automatically jumped to the time interval where the action is being performed. Currently the developed system lets a user filter types of action performed by each player, the details of the interactive system is described in [27,28].

Super-Bagging Method
This section describes the super-bagging method for training a classifier for imbalanced data. We call the method super-bagging because, like bagging methods, an ensemble is trained on multiple subsets of the data, but in contrast to regular bagging methods, rather than taking random subsets of the data, our method builds on top of balanced undersampling and unbalanced full sampling data sets. Given a standard training set D of size n (i.e., observations), super-bagging generates two new training sets D 1 with size n and D 2 with size n . All observations are repeated in D 1 , so D = D 1 . However, D 2 contains a subset of D 1 . Let n observations be distributed into two classes m 1 and m 2 , with t 1 and t 2 observations respectively, so that n = t 1 + t 2 and t 2 > t 1 . Then, let n = t 1 + t 2 where t 2 = t 1 (i.e., n = 2t 1 ). This results in two training sets, one of which is a full imbalanced training set (D 1 ) and the other is a balanced training set (D 2 ). Two machine learning models have been trained using training sets D 1 and D 2 . Each of the model results are fused using decision (late) fusion method i.e., labelling an instance as a non-action event in case of unanimity only. In case of fusion of sensors, the number of votes to label an instance as non-action is searched through a grid-search algorithm. The architecture of the algorithm is shown in Figure 4.

Experimentation
This section describes the machine-learning models training using balanced, imbalanced and super-bagging methods for action and non-action events recognition.

The Data Set
We evaluated the super-bagging method using the data set [28] which we collected with the aim of developing volleyball action recognition components to be used in interactive digital-physical volleyball exercise applications [42]. This data set was collected during a volleyball training session as described in Section 3. The data set is highly imbalanced: around 94% of the data belong to the non-action class. Hence, different machine-learning approaches need to be explored in this setting.

Feature Extraction
We have extracted time domain features by applying basic six statistical-functionals such as mean, standard deviation, median, mode, skewness and kurtosis which are extracted over a frame length (i.e., time window) of 0.5 seconds 50% overlapping frames step (1) of Figure 1. The gyroscope, magnetometer and accelerometer data are three-dimensional, which is why we get 6 × 3 features over a frame for the sensors and barometer is single-dimensional data which results in 6 × 1 features for each frame in total.

Classification Methods
The classification experiments were performed using six different methods, namely decision trees (DT, with leaf size of 10, where the leaf size is optimized through a grid search within a range of 1 to 20), nearest neighbor (KNN with K = 5, where K parameter is optimized through a grid search within a range of 1 to 10), linear discriminant analysis (LDA), Tree Bagger (TB, with 50 trees and a leaf size of 10 where leaf size is optimized through a grid search within a range of 1 to 20), Naive Bayes (NB, with kernel distribution assumption optimized through a grid search for kernel smoothing density estimate, Multinomial distribution, Multivariate multinomial distribution and Normal distribution) and support vector machines (SVM, with a linear kernel (optimized by trying different kernel function i.e., linear, Gaussian, RBF and polynomial) with box constraint of 0.5 (optimized by trying a grid search between 0.1 to 1.0), and sequential minimal optimization solver (optimized by trying different solvers i.e., iterative single data algorithm, L1 soft-margin minimization by quadratic programming and sequential minimal optimization )).
The classification methods are implemented in MATLAB (http://uk.mathworks.com/products/ matlab/ (December 2018)) using the statistics and machine-learning toolbox. The classifier hyper-parameters maximum ranges (such as K = 10 ) are set using hit and trial method. A leave-one-subject-out (LOSO) cross-validation setting was adopted, where the training data do not contain any information of the validation subjects. To assess the classification results, we used the Unweighted Average Recall (UAR) instead of overall accuracy as the data set is highly imbalanced. The unweighted average recall is the arithmetic average of recall of both classes.

Experiments
The overall action frames for eight players were 5812 frames while the in non-action case there were 97,648 frames. One can understand from the samples that the data set is imbalanced. To evaluate the performance of the IMU sensor, we trained machine-learning models using balanced as well as imbalanced data sets for the recognition of action and non-action frames. It is done using different classifiers and we evaluated their effectiveness for handling balanced and imbalanced data sets (i.e., IMU's sensors) for volleyball action recognition as some classifier are less affected by the class imbalance nature such as NB and KNN. We have conducted mainly three experiments as follow: • Experiment 1 (M D 1 ): training is performed on the imbalanced data sets (i.e., D 1 ) in terms of action and non-actions and validation is performed on the imbalanced data set (i.e., D 1 ) in leave-one-subject out settings. The prior-probabilities of classifiers are set according to the classes distribution. • Experiment 2 (M D 2 ): training is performed on the balanced data sets (i.e., D 2 ) in terms of actions and non-actions, where same number of non-actions events (selected randomly) and action events for each player are used. The validation is performed on the imbalanced data set (i.e., D 1 ) in leave-one-subject out settings. The prior-probabilities of classifiers are set to be equal for both classes as in this setting the distribution of classes is same. • Experiment 3 (M D 1 + M D 2 ): training is performed using the super-bagging method and validation is performed on the imbalanced data set in leave-one-subject out settings.

Results and Discussions
This section describes the results of machine-learning models for action and non-action events and demonstrates the discriminative power of different IMU sensors placed on the dominant and non-dominant hand.

Experiment 1 (M D 1 ): Imbalanced Learning Method
The UAR of the dominant hand and non-dominant hand for all sensors are shown in Tables 2  and 3 respectively. These results indicate that the non-dominant hand (83.99%) provides better UAR than the dominant hand (79.83%), with NB being the best classifier for action detection. The results indicated that the accelerometer provides the best averaged UAR of 69.83% for the dominant hand and 73.24% for the non-dominant hand. The averaged UAR also indicates that the accelerometer (74.14%) and magnetometer (73.52%) provide better UAR on the non-dominant hand than on the dominant hand. The average UAR of fusion results indicate that the non-dominant hand provides better results (74.42%) than the dominant hand (70.81%).

Experiment 2 (M D 2 ): Balanced Learning Method
The UAR of dominant hand and non-dominant hand for all sensors are shown in Tables 4  and 5 respectively. These results indicate that the dominant hand (84.18%) provides better UAR than the non-dominant hand (82.16%), with TB being the best classifier for action detection. The results indicated that the accelerometer provides the best averaged UAR of 82.29% for the dominant hand and 78.26% for the non-dominant hand. The averaged UAR also indicates that all sensors provide better UAR on the dominant hand than on the non-dominant hand. The average UAR of fusion results indicate that the dominant hand provides better results (81.00%) than the non-dominant hand (78.26%).

Experiment 3 (M D 1 + M D 2 ): Super-Bagging Method
The UAR of dominant hand and non-dominant hand for all sensors are shown in Tables 6  and 7 respectively. These results indicate that the dominant hand (84.19%) provides better UAR than the non-dominant hand (82.93%), with TB being the best classifier for action detection. The results indicated that the accelerometer provides the best averaged UAR of 82.43% for the dominant hand and 80.91% for the non-dominant hand. The averaged UAR also indicates that the all sensor provide better UAR on the dominant hand than on the non-dominant hand. The average UAR of fusion results indicates that the dominant hand provides better results (81.91%) than the non-dominant hand (80.08%).

Discussion
The results reported above indicate that Exp 3 (i.e., super-bagging) improved the UAR and provided the best average UAR of 81.19% and 80.08% for dominant and non-dominant hand, respectively. We have also noted that the best UAR was obtained using the TB classifier. TB with super-bagging improve the UAR of fusion for non-dominant hand from 79.59% to 80.11% but results in slight decrease in UAR for the dominant hand from 83.10% to 82.87%.
It is demonstrated that the imbalanced learning provides better UAR (83.99%) for the non-dominant hand using a Naive Bayes classifier than balanced learning, as Naive Bayes does not work with an assumption of balanced distribution. The balanced learning provides better UAR of 84.18% for the dominant hand using the tree bagger classifier than imbalanced learning. It could be due to the reason that the dominant hand requires less information (i.e., the movements of the non-dominant hand do not vary a lot while performing a volleyball action) for action modelling than the non-dominant hand. The super-bagging method provides the best UAR of 84.19%. To get further insight of the results we reported the confusion matrix of the best results as shown in Figure 5.  However it is also noted that imbalanced learning (83.99% with NB) is more accurate in capturing the non-dominant hand information than balanced learning (82.1% with tree bagger) and super-bagging method (82.93% with NB). To get further insights of the of the results, we reported the average UAR in Table 8. From Table 8, it is noted that the super-bagging method provides better averaged UAR in 8 out of 10 cases than balanced and imbalanced methods. The previous study [43] provided us with interesting results regarding the role of the non-dominant hand in volleyball action and non-action classification. However, in that study, we used an imbalanced learning method which suggests that the non-dominant hand provides more accurate results than the dominant hand. The current study uses both balanced and imbalanced learning and our newly proposed super-bagging method, and suggests that the dominant hand provides more accurate results than the non-dominant hand for balanced learning and super-bagging method. It is also demonstrated that the balanced learning provides higher average UAR (81.00%) than imbalanced learning (70.81%), which are even more marked with 'super-bagging method' with a UAR of 81.19% for the dominant hand. It is indicating that super-bagging can capture more information than balanced and imbalanced learning methods. However, these results need further research to investigate/analyze different data sets for multiple applied machine-learning problems such as emotion recognition and type of volleyball action recognition.
The previous work detailed in Section 2 is focusing on a small number of sensors instead of the evaluation of four sensors which are used in this study. It is also noted that while there are quite a few studies focused on volleyball action modelling, most of the studies take into account the role of the dominant hand particularly for volleyball action modelling and the role of the non-dominant hand is less explored in sports related activities. This study demonstrates the role of both dominant and non-dominant hand movements. The proposed novel method (i.e., super-bagging method) is a fusion of imbalanced and balanced learning method which results in using the full data set (no missing information) for training and avoids the 'curse of imbalanced data set' using only two classifiers in an ensemble. The potential application of the proposed models can be interesting for fatigue and stamina estimation [8], where players/trainers are only interested in determining the amount of actions performed regardless of their type.

Conclusions
This article demonstrated the relevance of a balanced (undersampling method), imbalanced (full sampling) and super-bagging method for volleyball action modelling. Machine-learning models operating on IMU's sensors provided UAR of up to 84.19%, which is well above the chance level of 50%. The undersampling method provided more accurate results than the full sampling method which is more marked with our super-bagging method. It is also noted that the undersampling method provided better results for the dominant hand than full sampling method. However, the full sampling method provided better results for the non-dominant hand compared to the undersampling method. It is also noted that the super-bagging method provides a better averaged UAR in 8 out of 10 cases for sensors than balanced and imbalanced methods. Hence, demonstrating the potential of a super-bagging method for IMU's sensor data. The difference is small but it is the first testing of a super-bagging method which encourages further exploration of this method on different machine-learning problems and by also adjusting the weights of both classifiers in super-bagging ensemble and exploring score fusion methods. In the future, we aim to extend this research by incorporating different frequency domain features such as spectrogram, and to employ the super-bagging method to evaluate its generalizability particularly for multi-class problems.