A Study on Sensor System Latency in VR Motion Sickness

: One of the most frequent technical factors affecting Virtual Reality (VR) performance and causing motion sickness is system latency. In this paper, we adopted predictive algorithms (i


Introduction
In an ever-changing world seeking regular technological advancement, Virtual Reality has been proven to play an imperative role in the further advancement of scientific progress in many sectors of our lives. Virtual Reality (VR) refers to a computer-simulated environment that produces a sensible and intuitive 3D environment that clients can understand through a head-mounted display (HMD). However, the assessment of VR has brought up a well-known issue of many users suffering from motion sickness (MS), which is also known as simulation sickness or visually induced motion sickness (VIMS). In addition, motion sickness can be referred to as a mal-adaptation syndrome when exposed to real and/or apparent motion [1]. The signs or symptoms of motion sickness include nausea, excessive salivation, cold sweating, and drowsiness [2]. MS is categorized by actual motion on land, sea, and air, [1]. The cause of motion sickness is described in multiple theories, i.e., the sensory conflict theory [3], evolutionary/poison theory, postural instability theory, eye movement theory, and rest-frame hypothesis theory [4]. These theories argue for different causal factors describing the effects of mal-adaptation while sensing real or perceived motion.
The most common causal theory of MS, the Sensory Conflict Theory, was first proposed in 1975 [1]. According to the Sensory Conflict Theory, the visual systems generate the sensory Reality (AR) devices are discussed (see Section 2). In Section 3, we discuss the common predictive algorithms for predictive tracking.
Later, the most critical factor, called latency, is considered for the simulation. In order to reduce the system delay or latency, we investigate the performance of Dead Reckoning and Kalman Filtering at first. Following that, we deploy deep-learning-based approaches LSTM, Bidirectional LSTM, and Convolutional LSTM (see Section 4). Finally, we conclude this study by analyzing and comparing the error performances between these statistical learning and deep-learning-based methods in Section 5.

Technical Factors Related to the Motion Sickness in Virtual Environment
In this section, a few technical factors involved in motion sickness in a virtual environment are discussed, such as field of view, latency, sensors, rendering, and display

Motion Sickness (MS) Incidence, Symptoms, and Theories
The word "Cybersickness" does not refer to a disease or a pathological state, rather to a typical physiological response that comes from an abnormal stimulus [22]. Cybersickness incidence depends on the stimulus, for instance the frequency, duration, and the user measurement criteria. The sickness symptoms were previously grouped into three common categories of effects, including nausea, oculomotor, and disorientation [23].
Miller and Graybiel determined that 90% to 96% of participants would undergo stomach symptoms when the participants reached the maximum number of head movements during rotation protocol [1]. Static observers who are healthy might feel significant discomfort caused by motion stimuli and a moving visual field, which is impossible in individuals without vestibular function.
According to the Oxford Dictionary of Psychology, passive movement can cause a mismatch between the information related to the orientation and the actual movement, which are supplied via the visual and vestibular systems. As per the explanation of sensory conflict theory, it is this mismatch that induces feelings of nausea [24]. In 2005, "Johnson" explained that there was a higher chance of experiencing simulator sickness (SS) when the perceived motion was not correlated with the forces transmitted by the users' vestibular system [1].
Otherwise, if the real-life movements agree with the visual perception, the risk of experiencing SS reduces [25]. According to the Evolutionary/Poison theory, if conflicting information is received from our senses, it implies that something is irregular with our perceptual and motor systems/frameworks. Human bodies have advanced features that help to ensure an individual by limiting unsettling physiological influences created by consumed poisons [3].
Once humans predict what is or is not stationary, they will combine the information received by the visual and inertial system to support their following perceptions or actions [25]. As per Rest Frame (RF) theory, the most effective way to solve SS is by helping people find or create clear and non-conflicting rest frames to reconstruct their spatial perception [1].

Field of View
In optical devices and sensors, field of view (FOV) describes the specific angle based on which the devices can catch up with electromagnetic radiation. Rather than a single focusing point, FOV allows for coverage of an area. For VR, it is better to consider a larger FOV to obtain an immersive, life-like experience. Similarly, wider FOV provides better sensor coverage or accessibility for other optical devices. Field of view is one of the critical factors contributing to cybersickness, which can be categorized into two issues [26].

Latency
Latency or movement to photon latency is one of the most significant and critical properties in VR and AR systems [27]. Latency describes the time length between the user performing motion and the display showing the perfect content for the capture motion. More precisely, the duration it obtains for the real-world event to be sensed, processed, and displayed to the user is the system latency. The approximate range of latency is tens to hundreds of milliseconds (ms) [28]. In VR systems, latency has been determined to confound pointing and object motion tasks, catching tasks, and ball bouncing tasks.
Latency is exceptionally perceptional for humans. In VR, the user only sees the virtual content, which is a perception of the misalignment that originates from the muscle motion and the vestibular system. The user is not able to sense up to 20 ms of lag in the system [28]. With that in mind, the commonly used system for overlaying the virtual content onto the real world is via the optical see-through (OST) in the augmented reality space [29].
Since there is no lag in the real world, the latency for AR systems such as these has become visually apparent, as seen from the misregistration between the virtual content and real-world [28]. It is difficult to obtain minimum latency, as there is very little research evidence about these topics. "Moss and Muth" differentiated latency from other system variables. According to them, there was no increase in motion sickness symptoms when the amount of extra HMD system latency was varied [27]. Obtaining low latency is far more complex.

Sensors
Most mobile AR and VR devices combine cameras and inertial measurement units (IMUs) for their use for motion estimation. For tracking the position and orientation and the low end-to-end latency, IMU plays an important role. Primarily AR and VR systems run the frequency of tracking cameras at 30 Hz. This suggests that only 33 ms is needed for the image to read out and then passed on to be processed. They are assuming that exposure is settled at 20 ms. Thus, the first step is to assign a timestamp for each sensor measurement so that the processing can happen later on [28]. The cameras are operating at 30-60 Hz, so IMUs also run at a higher rate (from 100 Hz to even 1000 Hz) [28].

Tracking
In tracking the HMD track, the user's head's movement updates the rendered scene according to the orientation and location. Three rotational movements: pitch, yaw, and roll, are tracked by rotational tracking. Rotational tracking is performed by IMUs (such as accelerometers, gyroscopes, and magnetometers). In addition, there are three translational movements: forward/back, up/down, and left/right are tracked by positional tracking known as six degrees of freedom (6DOF) [30].
Usually, positional tracking is much more complicated than rotational tracking [31]. Depending on how the AR or VR device has moved and rotated, the tracking absorbs the sensor data to calculate a 6DOF motion estimation (i.e., the pose). Here, 6DOF means that the body is free to move forward or backward, up or down, and left to right.

Rendering
A 2D image is generated at the rendering stage. Then, the frame buffer is sent to the display. The renderer requires several inputs to generate that image, such as the 3D content. Due to this, it is usual for rendering to happen asynchronously from the point of tracking. When generating a new frame, it is assumed that the latest camera pose estimated by the tracker is used [28].

Display
The display's process is critical for AR and VR since it contributes to a significant amount of additional latency, which is also highly visible. During the buffer phase, the data is to the display in pixel-by-pixel and line-by-line mode [30]. Each pixel of the display system stores the three primary colors: red, green, and blue. Thus, the data of the frames look like RGBRGBRGBRGB. . . [28].
That means the frame is arranged as scanlines so that each scanline has a pixel, and each pixel has RGB color components. The organic light-emitting diode (OLED) and liquid crystal display (LCD) do not usually store received pixel data in an intermediate/temporary buffer but do some extra processing, such as re-scaling [28]. In [32], predictive displays found success in overcoming the AR device system latency issue by providing the user with immediate visual feedback.

Common Prediction Algorithm for Predictive Tracking
A particular device latency tester can measure the motion-to-photon latency in a VR device [28]. The desire to perform predictive tracking comes from the latency, as mentioned earlier. The latency increases with the growing delay. By utilizing this predictive tracking, it is possible to reduce the latency [33]. Latency can come from multiple sources, such as sensing delays, processing delays, transmission delays, data smoothing, rendering delays, and frame rate delays [31]. All of the AR/VR devices have minimal delays. To counter this, "predictive tracking" alongside different methods (e.g., time-warping) helps to reduce the apparent latency [31].
One of the most prominent uses of predictive tracking is by decreasing the evident "motion-to-photon" latency, which indicates the time between movement and the instance the movement is portrayed/visualized on the actual display [28]. Although there is a delay between the movements and the instant the movement information is presented on display, the perceived latency can be reduced by an estimated future orientation and the position as information used in refreshing the display.
The reason for choosing predictive tracking in AR devices includes the viewer's natural world's rapid movement to look at against the augmented reality overlay [33]. A classic example would be when the user displays a graphical overlay over a physical object that the user watches with an AR headset. The overlay must stay "locked" to the object even when the user pivots his head. This is needed to ensure that it feels like part of the real world. Even though the object might be perceived with a camera, time is needed so that the camera can capture the frame.
A graphics chip renders the processor figures out the positioning of the object in the frame and then the overlay's new area. The user can potentially decrease the overlay movement contrasted with the natural world by using predictive tracking [31]. For example, while doing head-tracking, the user can see how quickly the human head can turn, and the typical rotation speeds, which can help improve the tracking model [31]. Next are some regularly used prediction algorithms.

Statistical Methods of Prediction
In this section statistical methods of prediction, the Dead Reckoning algorithm, Alphabeta-gamma algorithm, and Kalman Filtering are discussed.

Dead Reckoning
Suppose the position, as well as the velocity, are known for a given instance. In that case, the predicted new position presumes that the last known velocity and position are correct and that the velocity is continuous as before. With this algorithm, one crucial issue is that it requires the velocity to be constant regularly. However, as the velocity does not stay constant in most of the cases, that is why it makes the next set of suspicions wrong [31].
Dead Reckoning-based prediction is applied here for the prediction of the future value. From the Dead Reckoning, we applied the polynomial function for the prediction. The reason being that the data were polynomial. The basic equation of a cubic function or, more generally, a third-degree polynomial function is as follows: where y is the dependent variable and a, b, c, d are regression coefficients. For the prediction, the values for the four points from the dataset are taken. Then, the next point's value is predicted using the polynomial cubic function. Next, the quadratic function and the linear function are used for the prediction. The equation of the quadratic and linear function can be expressed as follows:

Alpha-Beta-Gamma
The Alpha-beta-gamma (ABG) predictor has close relations to the Kalman predictor, even though it has less complicated mathematical analysis. ABG endeavors to continuously estimate the velocity and acceleration so that it can be used for prediction. Since the estimate considers the actual data, they reduce the noise [31]. This is done by configuring three parameters (e.g., alpha, beta, and gamma) that give the ability to underscore responsiveness rather than noise reduction [30].
A commonly used helpful technique to reduce apparent latency is predictive tracking. This offers sophisticated or straightforward implementations and requires some idea as well as investigation. However, for today's VR and AR systems, predictive tracking is essential to achieve low latency tracking [33].

Kalman Filtering
As with Dead Reckoning, the Kalman Filtering algorithm estimates some unknown variables based on measurements taken over time. It is used to decrease the sensor noise for systems in which a mathematical model exists for the operation of the system [34]. It is an optimal estimator algorithm that combines data from a sensor and a motion model computationally. When the predicted values contain random error, uncertainty, or variation, it is a continuous cycle of predict-update [34].
The Kalman filter estimates the value more quickly than conventional standard prediction algorithms (e.g., Dead Reckoning and Alpha-beta-gamma predictor). Prediction with Kalman Filtering starts with the initial state estimation and in the estimated state, there is a certain amount of error. During the iterative process, the Kalman filter narrows down the predicted values somewhere close to the actual values very quickly. To predict efficiently and accurately, sufficient data is preferred. With enough data, the uncertainties are small, and the predicted value from the Kalman filter will be close to the actual value [34].
Let us consider X as the state matrix (X i , X j ), P as the process co-variance matrix, µ as the control variable matrix, W as the predicted state noise matrix, Q as the process noise covariance matrix, I as the identity matrix, K as the Kalman Gain, R as the noise covariance matrix (measurement error), and Y as the measurement of the state. A and B are the simple matrices multiplied with the state matrix to obtain the object's current position. H is also a matrix that allows the format of one matrix to fall into the other matrix's format.
We consider the initial state as X 0 and P 0 , the previous state as X k−1 , and the predicted new state as K p . Thus, the predicted new state X k p and P k p can be written as follows: Next, the measured position of the object, which needs to be tracked, is converted into the proper format and returned as a vector. A measurement noise or a measurement uncertainty might need to be added since the measurement at face value might not always be taken. There may be some noise that can be defined in a matrix and add that to the measurement, which results in an updated measurement. Then, the measurement is folded into the predicted state, and, from that, we come up with the Kalman gain K. The Kalman gain decides how much of the estimate needs to be done. Y k is the measured input and is used to calculate the noise in the measurement and variation in the estimate.
Next, the Kalman gain is calculated, and the new state X k of the object that needs to be tracked is predicted.
Next, the process error or the predicted error, P k in the Kalman gain is calculated as follows: In this stage, the calculated current state becomes the previous state and goes through the iteration process for the next prediction. In the Kalman filter, we update the output, the new position X k , and the new predicted error P k .

Machine Learning-Based Approach
Machine learning has proven to be extremely powerful in the fields of digital image processing, enhancement, and super-resolution. Moreover, machine learning models have been commonly used in production systems for computational photography and video comprehension. However, integrating and implementing machine learning models on mobile graphics pipelines is difficult due to limited computational resources, tight power, and latency constraints in mobile Virtual Reality (VR) systems. For example, suppose we can predict the head and body motion of the user in the immediate/near future.
In that case, it is possible to perform the predictive pre-rendering on the edge device/computer, thus, streaming the expected view to HMD [35]. To construct our sequential learning model, we use the recurrent neural network (RNN) architecture. In this segment, we use an LSTM-based approach of a sequence-to-sequence predictive model that performs well in sequence-to-sequence prediction problems [36]. The predictive model creates a series from the user's previous viewpoint positions and predicts future viewpoint positions in the same way [37].

Recurrent Neural Networks (RNN)
Recurrent Neural Networks (RNNs) are used to learn sequences in data. RNNs can map sequences into a single output or a series of outputs with ease. RNNs can handle sequences of varying lengths (e.g., VR data) [38]. The VR data is capable of performing the same operation as the time series data in the RNN model [39]. RNN adds a looping mechanism to a Feed Forward Neural Network (FFNN) that allows information to flow from one step to the next. The secret state (memory) representing the previous inputs carries the knowledge from the input data [40].
RNNs can be used for sequential knowledge modeling [41]. To generate output h 0 , the network takes input x 0 , which is fed along with the next input x 1 to generate the second output h 1 . Here, the h 1 output depends on the current x 1 input and its previous h 0 output. The structure of the RNN helps discover the dependencies of the input samples and recalls the meaning behind the sequences during training [16].

Long Short-Term Memory (LSTM)
Long short-term memory (LSTM) is able to manage long-term dependencies between samples of inputs. Unlike RNNs, it can selectively forget or recall data and add fresh information without fully changing the existing information [42]. An LSTM network consists of cells called memory blocks. Within every cell, there are four neural network layers. To go to the next cell, each cell moves two states: the cell state C t−1 and the hidden layer state h t−1 at the previous time step. x t represents the input at the current time step. Both cell state C and hidden layer state h (also the output state) change after passing through LSTM cells to form a new cell state C t and a new hidden layer state h t .
These cells are responsible for remembering essential information. Through gating mechanisms, this information can be manipulated. There are three gates to LSTM: the forget gate f t , input gate i t , and output gate o t . We calculate the forget gate f t , input gate i t , output gate o t , current cell memory s t , and current cell output h t to follow the LSTM structure from [41,43]: Here s t−1 is the previous cell memory, s t is the current cell memory, and h t is the current cell output. The activation functions are sigma (σ) and tanh. W and b are the weight vectors for the all gates. The update mechanism of the cell state represents the core of LSTM. This means that cell state C t−1 used the forget gate to discard part of the data/information at the previous time step (t − 1) and obtain a new state C t by inserting part of the information through the input gate. The output gate controls and updates a new hidden layer state h t [15]. For motion prediction by using LSTM, the mean square error (MSE) is considered as the loss function: where N train is the total time steps of all trajectories on training set S train , and L represents the total length for each of the corresponding trajectories. y t represents the predicted output, andŷ t represents the actual output.

Bidirectional Long Short-Term Memory (BLSTM)
Unidirectional LSTM preserves information from the past only because it can access only the past input. A Bidirectional LSTM will run on the inputs in two ways, one from the past to the future and the other from the future to the past. What distinguishes the bidirectional technique from unidirectional is that, in the LSTM that runs backwards, the system/user saves information from the future. By combining the two hidden states, the system/user may save information from both the past and the future at any point in time.
A bidirectional recurrent neural network (BRNN) was first proposed by M Schuster [44]. In several areas, such as phoneme classification [45], speech recognition [46], and emotion classification [47], bidirectional networks outperform unidirectional networks [48]. While applying to the time-series data, it also passes information backward in time and passes in normal temporal sequences. The BRNN has two hidden layers, each of which is connected to the input and the output. The first layer has recurrent connections from previous time steps, while the second is flipped.
That is how these two layers are differentiated and transfer activation backward on the series [49]. After unfolding over time, a BRNN can be trained using normal backpropagation. However, motion is a temporal-dynamic process influenced by various factors, such as acceleration, velocity, and direction. To learn these types of dependencies within single/multiple windows, time-and context-sensitive neural networks are proposed here. Since LSTM or BLSTMs learn time-independent dependencies, they capture the relationship between measurements within a window and the rest of the measurements in the same window [49,50].

Convolutional LSTM
The Convolutional LSTM architecture combines Convolutional Neural Network (CNN) layers for feature extraction on input data with LSTM to support sequence pre-diction [51]. CNN's are a type of feed-forward neural network that performs well in the image and natural language processing [52]. The Convolutional LSTMs were developed for visual time series prediction problems as well as the application of creating textual descriptions from image sequences (e.g., videos). Convolutional LSTM architecture is both spatially and temporally deep and has the flexibility to be applied to a variety of vision tasks involving sequential inputs and outputs.
The network's learning efficiency increases by local perceptron and weight sharing, which eventually reduces the number of parameters [53]. The convolution layer and the pooling layer are the two key components of CNN [52]. LSTM is commonly used in time series as it expands based on the time sequences [54]. The Convolutional LSTM process of training and prediction is shown in Figure 1. The main steps of a Convolutional LSTM are described as follows: • • Mean Absolute Error (MAE): Here, y t ,ŷ t , and N test represent the original value at period t, the predicted value at period t, and the number of total time steps of all trajectories on the test set S test , respectively. The predicted value is compared to the observed value using these measures. The smaller the value of these metrics, the better the prediction efficiency.

Experimental Analysis
In this section, the experimental setup is explained and the simulation results are presented and discussed. The experimental setup explains the building blocks of this study, and the results analysis describes the results from the statistical and machine-learningbased methods and compares them. Figure 2 The experimental setup of the proposed study started with the data generation/collection. Two types of data were collected from the VR devices, i.e., IMU data and camera sensor data. The IMU data was collected in high frequency, and the camera sensor data was collected in low frequency. Since the camera sensor data lead to the latency problem, these were considered to illustrate the effectiveness of the proposed method in the experiment.

Experimental Setup
The dataset contains the time (ns) and four Quaternions (w, x, y, z) of the sensor. The camera data was in the frequency range of 20 Hz, and the sensor data had a frequency of 200 Hz. After collecting the data, both the statistical and machine learning-based approaches were applied for prediction. After the prediction, the error properties of both approaches were measured. Next, the deployment scenario and the considered baseline schemes are described. Finally, the performance evaluation of the proposed approach was evaluated, and some insights from the results are discussed in Section 4.2.

Result Analysis
The experimental results from the statistical and machine learning-based approaches are discussed. The results from the statistical methods includes Dead Reckoning and Kalman Filtering. The results from the ML-based approaches include LSTM, Bidirectional LSTM, and Convolutional LSTM.

Experimental Results for Statistical Methods-Based Prediction
Cross-correlation: Cross-correlation measures the similarity of one signal and the time-delayed version of another signal as a function of a time-lag applied to them [55], which can be formulated as follows: Here, x(t) and y(t) are the functions of time and the time delay. The time delay can be negative, zero or positive. Cross-correlation reaches its maximum when the two signals considered become the most similar to each other. Since the frequency of the camera sensor data is higher than for the IMU sensor data, the data is re-sampled followed by the cross-correlation. First, the cross-correlation of the two data was computed and identified where the cross-correlation was the maximum. The highest lag was found in 0, which indicates that the camera sensor data and IMU sensor data were mostly the same at that point.

Dead Reckoning algorithm:
In the Dead Reckoning algorithm, the cubic, quadratic and linear functions are considered to predict the next value and fit the curve. The algorithm first takes the first four data points from the dataset to predict the next value at the corresponding next position using the cubic, quadratic, and linear functions. The algorithm predicts the next value after every four points.
From Figure 3a, we can see the predicted subsequent data at position 5 for the cubic, quadratic, and linear functions. The quadratic was the best fitted compared to the other functions. The predicted following linear function data were close to the actual data, and the predicted following quadratic function data was also close to the actual data. However, the predicted following cubic function data was not too close to the actual data compared to the quadratic function's predicted data points.  Next, the first 7 data points from the camera dataset were considered to observe the performance. We can see that the algorithm predicted the next value and fitted the curve from Figure 3b. After every 4 points, the algorithm predicted the next value. From the figure, it can be concluded that quadratic and cubic functions were suitable for curve fitting.
However, if all of the data from the camera dataset was considered, the quadratic function outperformed the cubic function for curve fitting. Furthermore, the linear was not the best fitted curve for the actual data. The linear function performed poorly while fitting the curve for the actual data compared to the other two functions. Figure 4 represents the predicted values of the whole dataset for all the three functions mentioned earlier. Next, the total length for the camera dataset for the cubic, quadratic, and linear functions is compared. Figure 5a shows the actual value far away from the predicted values by quadratic and cubic functions. From the figure, we can see that the quadratic function outperformed the other two functions while predicting the data.
When the previous data with respect to the actual data and the predicted data is compared, how much the actual data needs to be refined can be analyzed. In this case, it can be seen that the quadratic property/function was the preferred property/function. At the same time, the cubic property also performed close to the quadratic property compared to the linear function. The differences of the predicted cubic, quadratic, and linear function data from the original data are plotted in Figure 5b.
Kalman filter: The Kalman filter is applied to the camera dataset to predict the future value. Figure 6a shows that the algorithm predicted the value close to the actual value and smoothly fitted it. We compare the cubic, quadratic, and linear function methods with the Kalman Filtering technique where the Kalman filter quickly predicted the next position, and the predicted value was close to the actual value. 8  In Figure 6b, the prediction was made for the whole dataset, and the curve was fitted correctly compared to the cubic, quadratic, and linear function methods. Thus, from the results above, it can be concluded that the Kalman filter's prediction was more accurate for the future position prediction.

Experimental Results for Machine Learning-Based Prediction
To prove the effectiveness of LSTM, Bidirectional LSTM, and Convolutional LSTM, the same dataset and operating environment were used. The system's configuration used to conduct the experiments was an Intel i5-4700H 2.6 GHz, 16 GBs of RAM, 500 GB hard disk, and Windows 10 operating system. The dataset was split into 70% and 30% for training and testing. The training was executed for 50 epochs. Two hidden dense layers in the LSTM, one dense layer in bidirectional, and one hidden dense layer Convolutional LSTM were used. The optimizer was Adam, and the loss function of the root mean square error (RMSE) was used to train the model.    We found that LSTM, Bidirectional LSTM, and Convolutional LSTM-based prediction outperformed the Kalman Filtering and Dead Reckoning-based prediction. The details of the comparative analysis are provided in Section 4.2.3. Table 1 shows the error comparison of the cubic, quadratic, and linear function methods with no prediction data (the difference in the original data from the current position to the previous position). The no prediction data points were compared to the predicted cubic, quadratic, and linear data points. Table 1 indicates that the prediction using the quadratic function had a lower error than the no prediction error. However, for the standard deviation (SD), the quadratic and linear function had a similar error. The quadratic function had less errors than cubic and linear.

Comparison of Error
In Table 2, the predicted output of the Kalman filter with our actual data points is compared.  Table 2 shows that the actual data points had a higher rate of errors than the Kalman filter prediction errors. Thus, the Kalman filter is suitable for prediction, which is illustrated in Figure 5. Additionally, it had more minor errors than the actual value of data points. The predicted data points were close to the actual data points. The results from Table 2 indicate that it had significantly fewer errors than the actual data points.
In Table 3, the training and testing RMSE and MAE value from LSTM, Bidirectional LSTM, and Convolutional LSTM model are compared. The accuracy of the VR sensor prediction presented in Table 3 demonstrates that the proposed Convolutional LSTM and Bidirectional LSTM models incurred the smallest RMSE and Convolutional LSTM incurred the smallest MAE in most of the sessions. The effectiveness of the proposed deep-learning-based techniques to predict VR motion positions is presented by comparing error properties. We found that LSTM performed as superior in every session of the motion prediction with a small RMSE and MAE. The results depict that, among the three approaches, CNN-LSTM outperformed the others. The CNN-LSTM had a MAE of 0.00 and RMSE of 0.00, which were the smallest among the three prediction models, and it had high prediction accuracy.

Conclusions and Future Work
This paper investigated the latency causing MS in AR and VR systems. We adopted a common prediction algorithm to predict the future values to reduce the system delay resulting in reduced latency. The Dead Reckoning and Kalman Filtering techniques were studied to predict future data as a common prediction algorithm. We also adopted deeplearning-based methods with three types of LSTM models that could learn general head motion patterns of VR sensors and predict future viewing directions and locations based on previous traces.
On a real motion trace dataset with low MAE and RMSE, the system performed well. The error was compared with no prediction values. We found that the predicted values had a minor error when compared with no prediction. While using the Dead Reckoning algorithm, the quadratic function had less errors than the cubic and linear functions. However, the fitted curve was not right for the quadratic value.
The error property was relatively more minor in the Dead Reckoning algorithm than with the Kalman Filtering predicted error results. While deploying the deep-learningbased techniques, the Convolutional LSTM, and Bidirectional LSTM were more effective at learning temporal features from the data set. The prediction output can suffer if the number of time lags in historical data is insufficient. However, the model still has certain shortcomings. For instance, we only considered a small amount of data for training and testing since the experiment only had a limited amount of real-world data from VR sensors. Potential future research would focus on applying our method with a large, real-world dataset from VR devices.
Since the proposed method showed minimal errors with a high prediction rate, based on the experimental results, we conclude that our prediction model can reduce significantly system latency by predicting future values, which could ultimately help to reduce MS in AR and VR environments.