Gravity Control-Based Data Augmentation Technique for Improving VR User Activity Recognition

The neural-network-based human activity recognition (HAR) technique is being increasingly used for activity recognition in virtual reality (VR) users. The major issue of a such technique is the collection large-scale training datasets which are key for deriving a robust recognition model. However, collecting large-scale data is a costly and time-consuming process. Furthermore, increasing the number of activities to be classified will require a much larger number of training datasets. Since training the model with a sparse dataset can only provide limited features to recognition models, it can cause problems such as overfitting and suboptimal results. In this paper, we present a data augmentation technique named gravity control-based augmentation (GCDA) to alleviate the sparse data problem by generating new training data based on the existing data. The benefits of the symmetrical structure of the data are that it increased the number of data while preserving the properties of the data. The core concept of GCDA is two-fold: (1) decomposing the acceleration data obtained from the inertial measurement unit (IMU) into zero-gravity acceleration and gravitational acceleration, and augmenting them separately, and (2) exploiting gravity as a directional feature and controlling it to augment training datasets. Through the comparative evaluations, we validated that the application of GCDA to training datasets showed a larger improvement in classification accuracy (96.39%) compared to the typical data augmentation methods (92.29%) applied and those that did not apply the augmentation method (85.21%).


Introduction
Sensor-based human activity recognition (HAR) aims to recognize the actions and goals of the users from a series of observations by the sensors, which are attached to the users. Today, HAR is being increasingly used as a non-verbal interaction methodology for virtual reality (VR) users because it can provide a more realistic and immersive experience compared to the controller-button-based interaction methodologies. For example, users will be more immersed in the game when they can throw a weapon with an actual throwing action instead of simply clicking a controller button.
HAR in VR applications is often achieved by using machine learning methods because of their robustness [1,2]. However, collecting large-scale and labeled datasets, which are key to deriving robust recognition models, has been limited because the collection process is often costly. Such a sparse data problem will be aggravated as the number of activities to be classified increases. For example, in commercial action fighting VR games, such as Blade & Sorcery, shielding front can block a stabbing from the front, and shielding downward can block a stabbing from the below, but not vice versa. Therefore, it is necessary to classify them into different labels to provide users with appropriate responses and results. However, this requires a sufficient number of training data for the shielding front and shielding downward.
On the other hand, in a VR game, which is the most widely used application in the field of VR, there are activities that have a similar motion structure but different motion directions. These activities often result in different consequences. For example, in commercial action fighting VR games such as Blade & Sorcery, shielding front can block a stabbing from the front, but not a stabbing from below. Therefore, it is necessary to classify them into different labels to provide users with appropriate responses and results. However, this will increase the number of activities to be distinguished by recognition models, and aggravate the sparse data problems.
Sparse data problems can be alleviated by using data augmentation methods. Those methods artificially create new training data based on the existing training data. Since the symmetrical structure of the data is preserved, the generated data have similar properties to the existing training data. For time-series data, there are a number of data augmentation methods [3,4]. However, those methods did not take the characteristics of the acceleration data obtained from the sensors into account. For example, the scaling method, which is one of the widely used data augmentation methods, either scales the training data up or down. When applying this to the controller accelerations as it is, not only the acceleration derived from the motion, but also the acceleration of gravity, will be scaled simultaneously. This deviates from the general fact that the magnitude of gravitational acceleration should always be constant, which means that incorrect data have been generated.
In this paper, we propose a data augmentation technique named gravity control-based data augmentation (GCDA). The goal of the GCDA is to increase the number of training data without performing expensive actual motion collection. The underlying idea of GCDA is to decompose the input acceleration data into the acceleration of the actual activity (zero-gravity acceleration) and acceleration of gravity (gravitational acceleration), and augments them separately. Furthermore, focusing on the activities that have similar motion structures to each other but have different motion directions, we used them to augment each other by changing direction information. Then, the comparative evaluations were conducted to validate whether the proposed technique provides better results compared to the previous augmentation method.

Related Work
Human activity recognition (HAR) has long been studied in both the academic and commercial fields as a method of future interaction. There are roughly two mainstreams in the field of HAR: camera-based and sensor-based methods. Camera-based methods generally use video images [5][6][7][8] or depth camera images [9][10][11]. Although camera-based approaches have the advantage of being able to utilize a variety of information captured by the camera, they often suffer from problems of self-occlusion and a limited camera workspace. On the other hand, sensor-based methods generally rely on 3D kinematic information, such as angular and linear parameters, measured from wearable or handheld devices [12][13][14][15][16][17]. The major advantage of these methods is that 3D kinematic information can be obtained continuously without suffering from the self-occlusion problem. However, such sensors are often inaccessible to general users and suffer from calibration errors.
Our study builds on the field of sensor-based HAR. In this field, activity recognition was generally achieved by performing signal processing of the acceleration data [18,19]. Chen et al. [20] used surface electromyography (sEMG) and 2D accelerometers to achieve hand motion recognition. For hand motion recognition, several studies [21,22] have tried to measure the hand motion data by using CyberGlove and calibrated the data. However, a Cyberglove is expensive to general users and shows low accuracy when measuring large curvatures on fingers. Jeong et al. [23] presented an inexpensive finger-gesture recognition glove that recognizes sign language. However, it revealed some problems in terms of reaction speed and stability. Wang and Neff [24] presented a glove calibration method that can recognize the hand motion in real-time. Wolf et al. [25] developed a wearable device called BioSleeve, and inferred arm and hand gestures based on the measured data from the Electromyogram (EMG) and Inertial Measurement Unit (IMU) sensors. Georgi et al. [26] suggested an activity recognition framework, based on arm, hand, and finger movements. To this end, they fused the signals of an IMU and EMG attached to the user's wrist and forearm, respectively. Huang et al. [27] developed hand gesture recognition wrist band by combining EMG and IMU sensors. Calella et al. [28] developed a 3D gesture recognition device called HandMagic based on an IMU sensor. Alavi et al. [29] used 10 sensors attached to the body to capture human motion and classified six different gestures. Zadeh et al. [30] developed a biosensor-based hand-finger gesture recognition method. Mummadi et al. [31] introduced a novel sensing glove design that detects its wearer's hand postures and motion at the fingertips. Diliberti et al. [16] used customized motion-capturing gloves to obtain gestural datasets and implemented a neural network for real-time gesture recognition. All of these studies showed acceptable recognition performance, but they have the disadvantages that the sensor devices are hardly accessible to the general public and that the recognition performance is largely dependent on the performance of the sensor. On the other hand, many commercial VR applications generally allow users to interact with virtual worlds through the VR controller because the controller is widely distributed as a basic VR kit. Therefore, we aim to recognize VR users' activity based on the 3D kinematic information, measured by the HTC VIVE controller.
There are many ways to recognize what activity is taking place based on human activity data such as images or acceleration. Spatio-temporal-matching-based methods [32][33][34][35] recognize human activity based on where the motion is (spatial information) and how the motion moves (temporal information). These methods are easy to implement, but their recognition performance is not good enough. Hidden Markov model (HMM)-based methods [13,[36][37][38][39] exploit the concept of finite-state automata, in which each state transition arc has an associated probability value. From the initial state, the state moves toward one of the output symbols based on the obtained human activity data and state transition probability. Each output symbol represents a specific activity. Although HMM-based methods have shown satisfactory performance in terms of activity recognition, they usually have a high computational cost for calculating a large number of state probability densities and parameters. Recently, the neural network based methods [16,[40][41][42][43][44] have become popular because they have strong classification performance, a fast response time, and enable non-linear classification. Since the human activity data are often a form of sequential data, the Recurrent Neural Network (RNN)-based model has been studied for the HAR tasks [45][46][47][48]. However, RNN-based models often suffer from slow learning speed and resource consumption. On the other hand, the Convolutional Neural Network (CNN)-based models have shown fast learning ability, low resource consumption, and high accuracy in classification. Due to its effectiveness, existing deep learning methods are mainly based on Convolutional Neural Network (CNN) architecture [49][50][51][52]. Therefore, we use a CNN architecture to construct a neural network model for HAR. Each dimension of the 3D kinematic data is treated as one channel of an image, and the convolution and pooling are performed separately.
The major issue of a neural network-based HAR is to collect large-scale training datasets which are key for deriving a robust recognition model. Since collecting the actual data is a costly and time-consuming process, the data augmentation technique is often exploited to increase the amount of training data [53,54]. However, augmenting time-series data has not been received much attention. Cui et al. [3] and Le Guennec et al. [55] used window slicing and time-warping techniques to augment the time-series data. Forestier et al. [56] proposed Dynamic Time Warping (DTW) which averages a set of time-series data and uses the average time-series as a new synthetic example. Um et al. [4] applied many well-known data augmentation methods used for image data to the wearable sensor data and investigated which method improves the classification performance. Camps et al. [57] augmented time-series data with random data shifting and rotating. Fawaz et al. [58] showed how the overfitting problem with small time-series datasets can be mitigated using a recent data augmentation technique that is based on DTW and a weighted version of the DTW Barycentric Averaging algorithm. Rashid et al. [59] presented a data-augmentation framework for generating synthetic time-series training data for an RNN-based deep learning network to accurately and reliably recognize equipment activities. Wang et al. [60][61][62] used an augmentation method of jittering and cropping to increase the performance of automatic protective behavior detection. Gao et al. [63] presented several data augmentation techniques specifically designed for time-series data in both the time domain and frequency domain. The studies on various data augmentation methods for time-series data can be found in the survey paper published by Wen et al. [64]. In most cases, the augmentation for time-series data simply exploits the combination of existing augmentation techniques used in image data augmentation such as jittering, scaling, and warping.
In this paper, we presented a data augmentation method specifically designed for timeseries data obtained from the wearable IMU sensor. In the IMU sensor-based HAR, gravity is often considered unnecessary noise data. Therefore, many studies filtered out the gravitational acceleration from the measured sensor data to obtain zero-gravity sensor data [65,66]. However, gravitational acceleration can be an important indicator. Kim et al. [67] found that gravity can be used as directional feature that increases the recognition accuracy in some activities. Inspired by this work, we used gravitational acceleration as a directional feature and controlled it to augment activity datasets.

Data Collecting
In VR applications, HAR technique can be used to provide non-verbal interactions to VR users. To enable this, application designers first define a number of activities to use for interactions. Then, they build an activity recognition model and acquire training dataset. The training dataset is often collected from the designers themselves or a few people, because collecting the data from a variety of real users is costly. After the small amount of training data is collected, the data are augmented to avoid problems caused by the use of sparse training data, and used for training the recognition model.
In this paper, we simulate this scenario to validate our augmentation technique. Since the author defines the activities and has high expertise in performing the activities, he can be regarded as a designer in the scenario. Therefore, the training dataset (TR) which is used to train the neural network is collected from the author. For the role of the general VR users in the scenario, we hire several participants and collect their activity data. When they perform a certain activity, the trained model must recognize the activity and provide an appropriate response to it. Therefore, a test dataset (TS) which is used to test the performance of our neural network is collected from the participants.

Participants
We acquired data from five participants including an author (two females and three males, age: 22.8 ± 1.7 years, and height: 171.2 ± 10.3 cm), and all of them had experienced VR games. Participants were composed of undergraduate or graduate students, with a normal or corrected-to-normal vision to minimize the effect of the physical condition and age of participants on the validation.

Activities
We focused on activities performed by moving the VR controller because the controller is the most widely used interaction device. We also focused on activities that are frequently observed in games, since games are not only the most prevalent type of VR application but also often require real-time activity recognition methods to provide a real-time response to user interactions. Therefore, we designed the activities based on the motions in the commercial VR game, Blade & Sorcery. First, we defined three types of reference activity: Slash, Thrust, and Guard. In games, these activities can be carried out in a variety of directions, and the consequences of the activity may vary depending on the direction. For example, frontal guard activity can block a stabbing from the front, but not a stabbing from below. This means that it is necessary to consider the reference activities as different activities when they are performed in different directions. Therefore, we defined four sub-activities for each reference activity and aimed to classify 12 sub-activities through the recognition model. Sub-activities derived from the same reference activity have similar motion structures but have different motion directions. The reference and sub-activities are shown in Figure 1. We defined four types of sub-activity for each of the three reference activities. Therefore, we defined 12 types of sub-activity in total. In each picture, the green circle represents the start position of the activity and the red arrow connected to the circle represents motion direction. At the right bottom of each picture, the X-Y-, and Z-axis of the controller device space are illustrated. As shown in the figure, the orientations of the controller required to perform sub-activities derived from the same reference activity are different from each other.

Procedure
On arrival, each participant was provided a brief explanation about the data collection procedure. Subsequently, each participant was given a 10-min training session to familiarize themselves with each sub-activity. After the training session, each participant was asked to perform 60 repetitions of each of the 12 sub-activities and the order of sub-activities was randomly provided. Participants were asked to perform only with the right hand since all activities are designed to be performed with the right hand. Before starting each repetition, participants were asked to place the controller in the start position and wait. When the experimenter signals, participants began to perform the activity. Participants were asked to keep pressing the trigger button during the activity and to release the button when the activity was over. To avoid participant fatigue, a 5-min break was provided for every 60 repetitions completed. In total, each participant took about 115 min to complete the data collection procedure. As a result, we collected 300 activity data for each sub-activity, and thus 3600 activity data were collected in total. The collected data was divided into two groups: TR and TS. We constructed TR by using data collected from the author. Therefore, TR has 720 activity data. We constructed TS by using data collected from the participants. Therefore, TS has 2880 activity data.

Data Sensing
While the participant repeatedly conducted each sub-activity, we saved activity data from 0.3 s before the moment the participant pressed the trigger button to the moment the participant released the trigger button. This 0.3 s margin, which was exploited for real-time recognition, will be discussed in Section 5.2.
We stored three kinds of information: three-dimensional acceleration of VR controller in the controller device space (a c ), orientation of VR controller in the world space (q c ), and orientation of head-mounted display (HMD) in the world space (q h ). These data can be measured by the Vive controllers with 24 infrared and IMU sensors, and by a SteamVR tracking system that tracks the controller and an HMD with millimeter-level accuracy with an update rate ranging from 250 Hz to 1 kHz. The measured data are stored in circular buffers using OpenVR API which automatically calibrates IMU and orientation data at the rate of 250 Hz and 90 Hz, respectively. OpenVR API is developed by Valve Corporation, headquartered in Bellevue, Washington, USA. To match the data frequency between acceleration and orientation, orientation data were linearly interpolated and resampled at the rate of 250 Hz. Examples of collected acceleration, a c , are shown in Figure 2.  Examples of collected acceleration of VR controller, a c , in each sub-activity. In all graphs, the x-axis represents time and the y-axis represents the acceleration (m/s 2 ). The red, green, and blue represent X, Y, Z signals from the accelerometer, respectively.

Outlier Removal
After the data acquisition, we performed outlier removal process for both TR and TS. In TR, there were 60 activity data for each sub-activity. For each sub-activity dataset, we removed 10 activity data, which showed either extreme acceleration or orientation. This is achieved manually by authors. As a result, the number of data in TR has been reduced from 720 to 600.
In TS, we also removed outliers. For each participant's data, we removed 10 data from each of the 60 sub-activity data. Unlike the manual outlier removal process used for TR, we employed an automatic outlier removal process based on the data in TS. For example, to determine the outliers from the Forward slash activity data in TS, we tested whether the controller accelerations (a c ) of the data satisfied the following criteria: • The average a c of the activity data is within the range [µ TR ac − σ TR ac , µ TR ac + σ TR ac ]. • The minimum a c of the activity data is within the range [µ min(TR ac ) − σ min(TR ac ) , The maximum a c of the activity data is within the range [µ max(TR ac ) − σ max(TR ac ) , The average a c of all Forward slash activity data in TR.
σ TR ac : The standard deviation of a c of all Forward slash activity data in TR. Unless all criteria are satisfied, the activity data are considered an outlier and removed from TS. After that, we also tested the same criteria with the controller orientation (q c ) instead of acceleration (a c ). This removes data that include extreme orientation. As the last step, we manually removed outliers that are not detected by the automatic removal process until the number of Forward slash activity data became 50 for each participant. As this outlier removal process was conducted for each sub-activity dataset, the total number of activity data in TS has been reduced from 2880 to 2400.

Gravity Control-Based Data Augmentation
The goal of the GCDA is to increase the amount of training dataset (TR) without performing actual motion collection. Recall that the underlying idea of GCDA is to decompose the a c of activity data into zero-gravity acceleration (â c ) and gravitational acceleration in controller device space (a g ), and augment each of them separately. a g can be obtained by using gravitational acceleration in the world space (g) and controller orientation (q c ). Then, a c can be computed by subtracting a g from the a c . Note thatâ c is the key component that defines an activity, regardless of the direction in which the user performs the activity. For example,â c s obtained from the Forward slash and Downward slash activities are similar to each other because both of them are derived from the same reference activity, Slash. Figure 3 shows the similarâ c s of four sub-activities derived from the Slash.   Figure 4 shows the five steps of GCDA: gravity decomposition, data augmentation, gravity rotation, gravity switching and gravity addition. In the gravity decomposition step, a c decomposed intoâ c and a g .
In the data augmentation step, typical data augmentation methods are applied tô a c . Since there are various types of augmentation methods, we performed comparative tests to investigate which method leads to an improvement in classification performance when applied to TR. For comparison, we first chose seven different types of typical data augmentation methods for the time-series data [4]: jittering, scaling, rotation, permutation, magnitude-warping, time-warping, and cropping. Each augmentation method was applied to TR, and recognition model was trained with each of the augmented TRs. Then, we measured their classification accuracy. As a baseline, we additionally trained the recognition model with non-augmented TR and measured the classification accuracy. For a fair comparison, we optimized the augmentation parameters of each method to our datasets with a number of internal tests. As shown in Table 1, the five types of augmentation method, except for permutation and cropping, increase classification accuracy by at least 2% when applied to TR.   The use of combinations of various data augmentation methods sometimes shows better augmentation performance compared to the use of a single method. Therefore, we additionally compared two more methods in which TR is augmented by different combinations of typical augmentation methods: (1) a combination of all seven typical augmentation methods (typical-all), and (2) a combination of only typical augmentation methods that actually improve the classification accuracy (typical-improve). In typical-improve, we combined jittering, scaling, rotation, magnitude-warping and time-warping methods because they increase classification accuracy when applied to TR. Since the typicalimprove improved the classification accuracy the most (7.08%), we decided to use it in the data augmentation step.
In the gravity rotation step, we used the fact that the measured gravitational acceleration could be slightly different for each trial, even for the same activity. Therefore, a g was slightly rotated about the forward component of q h to generate new data. In internal test, rotation ranges within [−10 • , 10 • ] generally improved the inference performance of the neural network.
On the other hand, a large rotation can be used for augmenting other sub-activity that has a similar motion structure but a different motion direction. To avoid the confusion with gravity rotation, we named the step gravity switching. In the gravity switching step, a g was largely rotated about the forward component of q h . The amount of rotation is determined by the designer, based on the activity design. For example, rotating a g of Forward slash by 180• about q h can be used as TR of Reverse slash. Since we defined four sub-activities for each reference-activity, gravity switching can increase TR by four times.

Activity Recognition
In this section, we present our activity recognition model. Our recognition model is built upon a neural network model since it has not only strong classification performance but also a fast computation time. As a network architecture, we chose to use a convolutional neural network (CNN), which is competent in capturing the local connections and the translational invariance of sensory data [45].

Neural Network Model
We employed one-dimensional (1D) convolutions [17,68]. The convolution and pooling are performed along the time of the 3D acceleration data obtained from the accelerometer. To achieve real-time classification, our neural network is designed in a simple form with only two 1D CNN layers and one fully connected layer with a dropout [69]. Each CNN layer passes the output to the batch normalization layer [70] and ReLU non-linear activation layer [71]. Our CNN model consisted of 13.852 parameters occupies 55.408 KB of memory and requires a total of 145.5 kFLOPS for each inference. Figure 5 shows the architecture. Batch normalization and ReLU activation layers are placed at the end of each convolution layer. One fully connected layer with a dropout is placed at the end of our network for the classification. The red, green and blue represent X, Y, Z signals from the accelerometer, respectively.
A CNN layer consists of sets of filters. Each CNN layer takes the output of the previous layer as input. Then, it convolves the input by moving the filters horizontally and learns features of the input. Through this, neural networks can understand the user's activity from the high-level to the low-level features.
Since the number of sub-activities we want to recognize is 12, the fully connected layer consists of 12 neurons. The value of each neuron indicates the probability that the given acceleration is a sub-activity corresponding to the neuron. For example, if the acceleration data of Forward slash are given as input, and the corresponding sub-activity of the first neuron is Forward slash, then the first neuron will have the largest value compared to other neurons. The value of each neuron is computed by conducting a weighted sum of all input values from the previous layer, giving us a probability distribution of what activity the user is performing.

Input Activity Data
Neural networks generally take fixed-size data as input. Therefore, activity data cannot be used directly as input to the neural networks since the length of activity data varies from person to person and activity to activity. Therefore, we exploited the window slicing approach [3]. When the user's activity started, 3 × 125 sized window began to convey the three-dimensional acceleration of VR controller (a c ) to the neural network. Since the hop-size of the window is set to 3 × 25, consecutive two windows share 3 × 100 data each other. The size of the window was determined, taking into account both strong feature extraction and minimal response delay. If the window size is designed to be too small, the data will not be able to extract the proper features of the activity, which will significantly decrease the inference performance of the neural network. On the contrary, if the window size is designed to be too large, the time interval between consecutive inputs increases. This will increase the response time of the neural network.
As acceleration is obtained at the rate of 250 Hz from the accelerometer, it can say that our neural network is designed to take 0.5 s of activity data as input and consecutive two inputs share 0.4 s of activity data each other. Unfortunately, the use of 0.5 s input inevitably causes 0.5 s delay at the initial inference since the initial window has to wait for it to be filled with acceleration data. This initial delay is a critical problem to the real-time VR application. To address this issue, we used 0.3 s of acceleration before the user started the activity to fill the initial window. In our internal test, this successfully reduced the delay of the initial inference without decreasing the inference accuracy. We speculate that the acceleration of 0.3 s before the user starts an activity seems to be a classifiable feature, such as activity preparation movement.

Evaluation
In this section, we evaluate whether the application of GCDA to activity data increases the information power which is helpful to train the activity recognition model.

Comparison between Data Augmentation Methods
In this evaluation, we aimed to show whether GCDA leads to better improvement in classification accuracy compared to the typical augmentation methods. To this end, we defined three test cases: (1) TR is augmented with GCDA (GCDA); (2) TR is augmented with a combination of jittering, scaling, rotation, magnitude-warping, and time-warping (typical-improve); (3) TR is not augmented (baseline). Additionally, we analyzed the contributions of each augmentation step of GCDA to the improvement that GCDA brings. Therefore, we defined three more test cases with different versions of GCDA: (4) GCDA without gravity rotation and gravity switching steps (GCDA-augment); (5) GCDA without data augmentation and gravity switching steps (GCDA-rotation); 6) GCDA without data augmentation and gravity rotation steps (GCDA-switching). In total, we compared 6 test cases.
To investigate whether GCDA has more improvements for smaller TR, we defined 13 test conditions by restricting fraction of the available TR: 2%, 6%, 10%, 14%, 18%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 100%. By using repeated random subsampling crossvalidation [72], each test case was evaluated in different 13 test conditions. The evaluation result is illustrated in Figure 6. In terms of the average accuracy for all fractions, GCDA shows higher accuracy (96.39%) compared to the typical-improve (92.29%) and baseline (85.21%). From this point, we can conclude that GCDA leads to better improvement in classification accuracy compared to the typical augmentation methods. On the other hand, the average accuracy for all fractions of GCDA-augment, GCDA-rotation, and GCDA-switching are 92.7%, 86.28%, and 95.05%, respectively. It seems that gravity switching, data augmentation, and gravity rotation are in the order of contribution to the improvement of classification accuracy.
On the other hand, the improvement brought by GCDA is the most noticeable when the fraction of the available TR is 2%. GCDA shows higher accuracy compared to the typical-improve (+13.7%) while it shows higher accuracy compared to baseline (+26.45%). Furthermore, the accuracies of GCDA are not significantly different from each other when the fraction of the available TR is larger than 6%. In typical-improve, the same phenomenon is shown only after the fraction is larger than 30%. This means that GCDA brings greater improvements for smaller training sets.

Visualization of the Gravity Augmentation Result
When the augmentation technique is applied to the training datasets, data are altered while the class labels are preserved. If the data change significantly, the correlation between the data and the label will no longer be valid. To examine whether gravity rotation and gravity switching steps change the data too much and invalidate the correlation, we exploited the t-distributed stochastic neighbor embedding (t-SNE) [73]. First, we trained our neural network with TR. Then, TR is augmented with gravity rotation and gravity switching steps and fed into the neural network. After this, we extracted the feature vectors from the latent space located right before the fully connected layer. Lastly, these 512-dimensional feature vectors are projected into 2-dimensional vectors and plotted, as shown in Figure 7. The result demonstrates that the augmented data show a similar distributions to their original data. Therefore, gravity rotation and gravity switching steps successfully augment training data while preserving the correlation between the data and the label.  GCDA can be a solution to a sparse data problem which is one of the major is-428 sues of a neural-network based HAR. In our tests with GCDA, activities performed 429 by participants were successfully classified into 12 sub-activities with more than 96% 430 accuracy even the recognition model was trained by sparse data which was collected 431 from only one person. One drawback of our method is the fact that gravity switching 432 step can only be used when sub-activities are correlated through their reference-activity.

433
However, we envision that the hierarchy between sub-activity and reference-activity 434 Figure 7. Latent space visualization of original TR (original) and augmented TR (augmented). For generating augmented, gravity rotation and gravity switching are applied to TR.

Conclusions
In this paper, we propose a novel augmentation technique, GCDA. GCDA focuses on VR user activity data where the activities have similar motion structures to each other but different motion directions. Given the acceleration data obtained from the VR controller, GCDA first decomposes the acceleration into zero-gravity acceleration and gravitational acceleration. Then, it augments each acceleration data by applying different augmentation techniques. The zero-gravity acceleration data are augmented by the combination of typical augmentation techniques widely used in previous studies, while the gravitational acceleration data are augmented by two steps of gravity rotation and gravity switching. Lastly, two types of augmented acceleration are merged and used for training. The evaluation tests have validated that applying GCDA to the training datasets increases the classification accuracy of the activity recognition model more than applying typical augmentation techniques to the training datasets.
GCDA can be a solution to a sparse data problem, which is one of the major issues of a neural-network-based HAR. In our tests with GCDA, activities performed by participants were successfully classified into 12 sub-activities with more than 96% accuracy; even the recognition model was trained by sparse data, collected from only one person. One drawback of our method is the fact that the gravity switching step can only be used when sub-activities are correlated through their reference-activity. However, we envision that the hierarchy between sub-activity and reference-activity will be more common in future VR applications since they require more types of interaction methods. In future VR applications, if the user's upward clap and downward clap are used as different interaction methods, GCDA will help to increase the interaction recognition performance.
Another drawback is the fact that gravity switching requires prior knowledge about how much each activity is rotated from the reference-activity. To resolve this problem, our future research will be conducted to infer the required amount of rotation for each sub-activity automatically.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author. Since the data is intended to be used as internal data for certain companies, it is not publicly available.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: