Augmented Reality Assisted Assembly Training Oriented Dynamic Gesture Recognition and Prediction

: Augmented reality assisted assembly training (ARAAT) is an effective and affordable technique for labor training in the automobile and electronic industry. In general, most tasks of ARAAT are conducted by real-time hand operations. In this paper, we propose an algorithm of dynamic gesture recognition and prediction that aims to evaluate the standard and achievement of the hand operations for a given task in ARAAT. We consider that the given task can be decomposed into a series of hand operations and furthermore each hand operation into several continuous actions. Then, each action is related with a standard gesture based on the practical assembly task such that the standard and achievement of the actions included in the operations can be identiﬁed and predicted by the sequences of gestures instead of the performance throughout the whole task. Based on the practical industrial assembly, we speciﬁed ﬁve typical tasks, three typical operations, and six standard actions. We used Zernike moments combined histogram of oriented gradient and linear interpolation motion trajectories to represent 2D static and 3D dynamic features of standard gestures, respectively, and chose the directional pulse-coupled neural network as the classiﬁer to recognize the gestures. In addition, we deﬁned an action unit to reduce the dimensions of features and computational cost. During gesture recognition, we optimized the gesture boundaries iteratively by calculating the score probability density distribution to reduce interferences of invalid gestures and improve precision. The proposed algorithm was evaluated on four datasets and proved to increase recognition accuracy and reduce the computational cost from the experimental results.


Introduction
Industrial assembly is performed by grouping individual parts and fitting them together to create the finished commodities with great additional value. Thus, assembly is an important step to connect the manufacturing processes and the business processes. In assembly, training is significant for technicians to improve the skills. Effective assembly training can increase the efficiency and quality of assembly tasks to achieve more value. Therefore, many businesses and researches have paid attention to assembly training [1]. In traditional assembly training, trainees need repeated practice to improve assembly skills, which leads to high resource consumption. Furthermore, it is not easy to evaluate the standard and achievement during the traditional assembly training operations. Nowadays, these problems can be addressed by augmented reality (AR) technology.
AR is a novel human computer interaction (HCI) technique. AR can enable users to experience the real world in which virtual objects and real objects coexist, and interact with them in the real time. In the past two decades, AR application has been a trending research topic in many areas, such as education, entertainment, medicine, and industry [2,3]. Volkswagen intended to use AR to compare the calculated crash test imagery with the actual case [4]. Fuchs et al. [5] developed an optical see-through augmentation for laparoscopic surgery which could simulate the view of the laparoscopes from small incisions. Pokémon GO [6] enabled users to capture and battle with different virtual Pokémon in a real environment by mobile phone. Liarokapis et al. [7] employed a screen-based AR to assist engineering education based on the Construct3D tool. AR is capable of assisting assembly training because of its low time costs and high effectiveness, which allows trainees to conduct real-time assembly training tasks at any place or time with minimum cost [8][9][10][11]. Furthermore, in an augmented environment, trainees can analyze the behaviors and achievements according to virtual information and real feedback. From AR interaction, trainees can obtain intuitive data to standardize their training operations.
Currently there are no commercial products for AR assisted assembly training (ARAAT), thus many research works have focused on it. Early ARAAT studies mainly converged on marker-based tools or gloves. Wang et al. [12] established an AR assembly workspace which enabled trainees to assemble virtual objects by real marked tools. Valentini [13] developed a system which allowed trainees to assemble the virtual components using a glove with sensors. These methods can be accurate and intuitive, but come with the high cost of the devices. Recently, benefiting from the rapid development of computer vision, researchers are increasingly focused on the vision-based bare-hand ARAAT which is natural and intuitive with the low costs of the vision cameras [14][15][16]. Most ARAAT tasks are conducted by real-time hand operations. Trainees need to use their real hands to operate virtual workpieces. Thus, precise gesture recognition plays an important role in the bare-hand ARAAT, and also in evaluating the standard and achievement of training tasks. Lee et al. [17] applied hand orientation estimation and collision recognition from trainees' hands to virtual substances. They proposed a hand interaction technique that ensured a seamless experience for trainees in the AR environment. Nevertheless, the precision of Lee's research depended on the range between trainees' hands and stereo cameras. Thus, calculation errors existed when only one finger was used and its application was limited. Wang [18] proposed a Restricted Coulomb Energy network to segment hands for AR empty-hand assembly. Virtual objects were controlled by two fingertips in the experiment to simulate assembly tasks. Since algorithms of fingertip tracking were implemented in 2D space without depth information, the results had lower recognition accuracy. Most current studies on the bare-hand ARAAT have received a low recognition accuracy. Hence, more effort has been made to raise the recognition precision. Choi [19] developed a hand-based AR mobile phone interface by executing the "grasping" and "releasing" gestures with virtual substances. The interface provided a natural interaction benefit from the hand detection, palm pose estimation, and finger gesture recognition. Figueiredo et al. [20] evaluated interactions on tabletop applications with virtual objects by hand tracking and gesture recognition. During the interaction, they applied the "grasping" and "releasing" gestures and used the Kinect device for hand tracking. These studies have increased the recognition rate by various image processing methods, but the interaction gestures are confined to few types of interaction gestures. Even though for the up-to-date AR device HoloLens (1st Generation) [21] that is broadly used in numerous AR applications, the operation gestures are also limited to only two gestures: "pointing" and "blooming". Limited types of interaction gestures are not only inadequate for practical industrial assembly tasks, but also giving unnatural experiences to trainees. In ARAAT, giving a realistic and natural experience in performing assembly tasks is also a significant issue as well as precise gesture recognition [22][23][24]. Aside from inadequate gestures, a long response time will also bring an unnatural interaction experience in ARAAT. Thus, many researches have focused on early recognition by predicting or estimating gestures to reduce the response time and make the process of assembly operations appear natural. Zhu et al. [25] proposed a progressive filtering approach to predict ongoing human tasks to ensure a natural and friendly interaction. Du et al. [26] predicted gestures using improved particle filters to accomplish the tasks of welding, painting, and stamping. With the help of additional physical properties of 3D virtual objects, Imbert et al. [27] found a more natural approach to doing assembly tasks and the results showed that trainees could perform assembly tasks easily. There is a problem that current studies for bare-hand ARAAT mostly focus on the single gesture recognition rather than the whole assembly task evaluation. Compared with single gesture recognition, the whole task evaluation can analyze the trainees' operations overall and is more helpful to trainees for improving their standard and achievement of operations. However, the whole task evaluation remains a challenge because the whole assembly task contains many different gestures together which are difficult to be distinguished. Directly evaluating the performance of a whole ARAAT task is a complicated process because only limited types of gestures for ARAAT can be recognized and recognition accuracy is not high.
Based on the related ARAAT studies mentioned thus far, ARAAT has the following areas of improvement: (1) lack of the whole complex assembly task evaluation, (2) limitation of interaction gestures, (3) low recognition accuracy, and (4) unnatural interaction experiences resulting from long response time. With the aim of resolving these problems, in this paper, we develop an ARAAT system. The flowchart of ARAAT is shown in Figure 1. Trainees choose the ARAAT tasks and conduct the corresponding operations. The AR device (HoloLens) records trainees' gesture videos during the tasks and the multimodal features are extracted from the videos. After classification, these multimodal features are used for gesture segmentation and optimization. After recognition with the optimal gesture boundaries, the gesture results will be used to evaluate the standard and achievement of hand operations in ARAAT tasks. In ARAAT, we have made the following contributions: (1) Building a model for the whole complex assembly task evaluation. We decompose an ARAAT task into a series of hand operations. Each hand operation is further decomposed into several continuous actions. Each action can be considered as an identifiable gesture. Using the classification and sequences of gestures, we can easily distinguish actions and predict operations to evaluate the performance of ARAAT tasks. (2) Increasing the types of interaction gestures. We generalize three typical operations and six standard actions based on practical industrial assembly tasks. (3) Improving the recognition accuracy. For evaluating the standard and achievement of hand operations in ARAAT tasks, an algorithm for gesture recognition is proposed in this paper to improve recognition accuracy and efficiency. The ARAAT task is recorded into an input video by an AR device. To ensure precise interactions for trainees to work with virtual workpieces by real hands (empty hands or using assembly kits), virtual workpieces must match correctly to hands or tools according to spatial-temporal consistency. Based on the spatial-temporal consistency, we use Zernike moments combined histogram of oriented gradient and linear interpolation motion trajectories to simultaneously represent 2D static and 3D dynamic features, respectively. The directional pulse-coupled neural network is chosen as the classifier to recognize gestures. To reduce the computational cost, we define an action unit to reduce the dimensions of features. The score probability density distribution is defined and applied to optimize gesture boundaries iteratively to decrease the interference of invalid gestures during gesture recognition. (4) Decreasing the response time. We proposed an action and operation prediction method based on the standard operation order. The prediction method can early recognize the action and operation to reduce the response time and ensure a natural experience in ARAAT.
The subsequent sections of this paper are divided as follows: Section 2 describes the modeling for ARAAT; Section 3 presents the action categories, action recognition, and operation prediction; Section 4 details the experimental results compared with other algorithms on a homemade dataset and the experimental analysis; finally, Section 5 provides a short conclusion and suggestions for future research. The framework of AR assisted assembly training (ARAAT) operation recognition. HoloLens takes trainee's gesture videos of corresponding ARAAT task operations and the multimodal features are extracted from the gesture videos. After classification, the features are used for gesture segmentation and optimization. By the recognition with the optimal gesture boundaries, the gesture results will be given to evaluate the standard and achievement of hand operations in ARAAT tasks.

Modeling for Augmented Reality Assisted Assembly Training
ARAAT tasks are mainly conducted by hand operations. Directly evaluating the performance of a whole ARAAT task is a complicated process, so we evaluate the performance of ARAAT tasks according to the standard and achievement of hand operations. For this purpose, we consider that a task can be decomposed into a series of hand operations, each of which can be decomposed into several continuous actions. Each action is related to a standard gesture based on the practical assembly task. The model of ARAAT conducted based on this decomposition is illustrated in Figure 2. Let T be a given assembly task and V be the recorded input video corresponding to T. V can be expressed as a series of frames in digital images, that is, where f t is the tth image frame. For the convenience of formalizing T and the related definitions, we define a concatenation operator ⊕, where "f ⊕ g" means that f occurs just after g. This gives the following definitions: Definition 1. Let O i be the ith operation of task T, and N be the number of operations in T. Then, where f h i,j and f e i,j are the head frame and the end frame of A i,j , respectively. We consider that the end frame of an action is the head frame of the next action, that is, To evaluate the performance of the task T in ARAAT, it is necessary to clarify which operations O i are conducted for the task T. Operation recognition is the identification of the corresponding sequences of actions {A i,j }. To distinguish every action A i,j for each O i ∈ T, we need to label all the frame f t of the input video V and recognize the gestures included in the action A i,j . This dynamic gesture recognition allows for the performance of tasks in ARAAT to be evaluated.

Dynamic Gesture Recognition in Augmented Reality Assisted Assembly Training
The difficulty of dynamic gesture recognition in ARAAT lies with simultaneously segmenting and labeling gestures of actions in an operation. One exhaustive method of dynamic gesture recognition is to label all frames in the searching space, but this is time-consuming when dealing with long gestures. Therefore, a more efficient dynamic recognition algorithm is proposed in this section, consisting of three parts: action categories, action recognition, and operation prediction.

Action Categories
According to the American Society of Mechanical Engineers (ASME) standard operations [28], there are five typical types of assembly tasks in practical industrial assembly: "matching", "conjugating", "joining", "fastening", and "meshing", as shown in Figure 3. The essential operations for conducting these assembly tasks can be categorized into "inserting", "fastening", and "screwing", presented in Figure 4. Based on these typical tasks and operations, we conclude six basic actions [29]: • "Rotating": trainees can change the orientation of objects; • "Moving": the movement of an object or a tool; • "Grasping": trainees can gain an object or a tool; • "Releasing": trainees can put down an object or a tool; • "Pointing": the selection action of an option or a virtual workpiece in AR environment; • "Scaling": trainees can resize objects.  The actions are demonstrated in Figure 5: In ARAAT, trainees usually complete assembly tasks with their hands and several virtual tools, similar to practical industrial assembly. Thus, the above six standard actions are also the basis in ARAAT. To evaluate the standard and achievement of assembly tasks, we first need to recognize trainees' gestures of basic actions.
Different from single action recognition, meaningless transition actions between important actions in practical assembly operation are unavoidable. The transition actions make no sense and may impact negatively on the output in gesture recognition. To solve the disturbance of transition actions, a "Null" action is added in the standard basic action set, that is,

Action Recognition
Action recognition mainly involves the following parts: feature extraction, gesture classification, and boundary segmentation. The specific processing steps are outlined below.

Feature Extraction
The actions in ARAAT are all dynamic gestures with movements. It is difficult to distinguish gestures with only 2D static features, so both static and dynamic features are extracted simultaneously to provide greater accuracy of action recognition. The static features are mainly the 2D representation characters of gestures and the dynamic features are the trajectories of hand motions.
(1) Static Features In this paper, Zernike moments [30] and histogram of oriented gradient (HOG) are used to extract static features [31]. Zernike moments were first proposed by Frits Zernike in 1934 to uniquely describe functions on the unit disk and were then extended to describe images for feature extraction. Zernike moments have the properties of shift-invariant, scale-invariant, and rotation-invariant, and are always used as descriptors for gestures. Therefore, differently sized and shaped artifacts will not have a great influence on the gesture recognition results. However, Zernike moments cannot achieve good results in texture recognition. HOG is a feature descriptor for object detection in the static image that counts occurrences of gradient orientation in localized portions of an image. Thus, HOG is used to compensate for the recognition of local texture features in images.
. . , the static feature matrix from f 1 to f t can be defined as Type t ) is the static feature vector of frame f t , Z nm t , V HOG t and f Type t are Zernike moments, HOG feature vector and the gesture type, respectively.
Before calculating Zernike moments, each frame must first be normalized. Let p t (x t , y t ) is a pixel of frame f t . m f and n f are the length and width of f t , and 0 ≤ x ≤ m f , 0 ≤ y ≤ n f . Then, we conduct a mapping transformation of f t to a normalized polar image f t . We define p t (x t , y t ) as the pixel of f t , where −n f /2 ≤ x, y ≤ n f /2. The Zernike moments of order n with repetition m for gestures are calculated as follows, where n and m are nonnegative integers, m ≤ n and n − m is even. R nm t is a radial polynomial and R nm * t is the complex conjugate. ρ t is polar value and θ t is polar angle. We define the gesture type f Type t as the finger number used in each gesture, that is, 1, i f performing gestures only using the index finger 2, i f performing gestures using both the thumb and index finger 5, i f performing gestures using all fingers (10) (2) Dynamic Features In addition to the static representation of gestures, the dynamic motions of gestures also need to be acquired.
We define p x C t , y C t as the centroid of hand in each frame Considering that the range and speed of movements vary from person to person, a mean operation is conducted for the shift invariant and robustness of features. We obtain two new vectors X t and Y t , where X t and Y t are the mean values of X t and Y t . The dynamic features then can be presented by (3) Feature Matrix In summary, for ∀ f t ∈ A i,j ⊂ V, i = 1, 2, . . . , N, j = 1, 2, . . . , M i , t = 2, 3, . . . , the feature matrix F t representing gestures from f 1 to f t is expressed as:

Gesture Classification
We use the feature matrix F t as the input of the classifier C and recognize the gestures. In other words, classifier C can be seen as a mapping for ∀ f t ∈ A i,j ⊂ V, i = 1, 2, . . . , N, j = 1, 2, . . . , M i , t = 2, 3, . . .. By calculating the feature matrix F t , we can obtain the recognition result C( f t , F t ) ∈ A. The classifier C can be defined as: In this paper, the directional pulse-coupled neural network (DPCNN) is chosen as the classifier. The DPCNN can classify and recognize dynamic gestures by template matching and is often applied to real-time applications, which is verified in our previous work [32]. The DPCNN can select the neuron firing directions by different excitations to reduce computational complexity and time of traditional PCNN. The DPCNN can also improve recognition accuracy by the choice of reasonable firing directions. For each gesture in the standard basic action set A, that is, {Pointing, Moving, Grasping, Releasing, Scaling, Rotating, Null}, we construct a single gesture video dataset that is used as a training template to train parameters of the DPCNN classifier. We input the feature matrix F t of each single gesture into the DPCNN and then train the classifier.
To increase the efficiency of gesture recognition, an action unit is introduced in the process of feature classification. We define L as the length of the action unit. An L that is too small will lead to poor computational efficiency, while an L that is too large will increase computational complexity. To determine the value of L, we collected assembly gesture video data online and conducted an experimental statistical analysis. By trial and error, we determined that L = 20 is the best length for the action unit. Because of the use of the action unit, the dimension of features is reduced to 20 instead of all frames. For ∀ f t ∈ A i,j ⊂ V, i = 1, 2, . . . , N, j = 1, 2, . . . , M i , t = 20, 21, . . ., f t and 19 frames before f t construct an action unit denoted by U t , that is, The multiple features of the action unit U t are then gained by the feature extractor. Based on the features, the classifier assigns each frame in the action unit U t to the appropriate action category and gives scores to each action category, which are shown in Figure 6a. This reduces the computational cost and time, as well as the impact of video length on complexity, allowing even long operation videos to be processed quickly and efficiently. The scores then produce a unit probability distribution p U (A) for the action class. The distribution suggests the probability of actions assigned to each class. As the action unit moves forward along frames, the probability distribution p(A) on the whole video can be produced, as illustrated in Figure 6b.

Boundary Segmentation
As mentioned in Section 2, the operations of a task are constructed by actions, which are performed by gestures. For accurate recognition of gestures, it is necessary to first define the boundary of each action. In this section, we use density distribution optimization to segment the action boundary.
Let p(A| f t ) be the distribution of action class A on frame f t ∈ V. We define the candidate boundary frame F Candidate (as seen in Figure 7), where f h  Then, we optimize the segmentation on the log density distribution logp A i,j by tuning ( f h i,j , f e i,j ) with the candidate boundary frame F Candidate , that is, Equation (17) is solved by dynamic programming and the optimal result of boundaries f h i,j ⊕ . . . ⊕ f e i,j is obtained. We propose a simplified algorithm which can search for the optimal segment boundary in real time. The procedure of dynamic programming is shown in Algorithm 1.
In After each boundary optimization, we need to reduce the influence of invalid actions. In a real assembly operation, each action usually lasts more than 60 frames. Thus, any pair of f h i,j and f e i,j shorter than 10 frames in an operation is more likely to be an invalid action and is removed. When the pair of f h i,j and f e i,j is more than 10 frames but shorter than 60 frames and is recognized to not be a transition action "Null", the pair is merged to the former segment. Then, we solve (17) again. The procedure is repeated until neighboring f h i,j and f e i,j are appropriate and stable.
The flowchart of boundary segmentation is shown in Figure 8 and the algorithm is given below: (1) Optimize the boundary by solving (17), go to (2).

Action and Operation Prediction
The purpose of ARAAT is to improve the standard for trainees by evaluating performance of hand operations. Thus, the prediction of actions and operations is as necessary as action recognition. The prediction algorithm can identify the assembly operations of trainees early by recognizing actions and the judgement of current orders, allowing for the evaluation of the standard order and achievement of the following actions.
In ARAAT, trainees are required to use their hands or various virtual equipment to complete different assembly training tasks. As mentioned in Section 3.1, "inserting", "fastening", and "screwing" are typical practical assembly operations. To conduct these training operations, trainees need to hold some equipment as assistance, such as a wrench or screwdriver. By moving the workpieces with their hands, trainees can carry out the "inserting" operation; by rotating the wrench, trainees can perform the "fastening" operation; by rotating the screwdriver, trainees can perform the "screwing" operation. In ARAAT, trainees also need to complete the assembly training through controlling the virtual workpiece by their hands or equipment.
Each assembly operation in ARAAT contains several actions in a standard order, concluded as follows (refer to Figure 9): Figure 9. The action orders of three typical operation "inserting", "fastening", and "screwing" in ARAAT.
Inserting: grasping (the workpieces) → moving → rotating/scaling →releasing (the workpieces), that is, Inserting = grasping ⊕ moving ⊕ rotating/scaling ⊕ releasing (18) Fastening: grasping (the wrench) → moving → moving around →releasing (the wrench), that is, Fastening = grasping ⊕ moving ⊕ moving around ⊕ releasing (19) Screwing: grasping (the screwdriver) → moving → rotating → releasing (the screwdriver), that is Screwing = grasping ⊕ moving ⊕ rotating ⊕ releasing (20) We can evaluate the performance of actions and operations by recognizing the gestures and predicting the action order which the trainee is performing. Furthermore, when the trainee carries out an operation in ARAAT, the virtual object needs to give the corresponding reaction. If there is a delay in the reaction, the operation will not be smooth and the interaction will be unnatural. In order to avoid inconsistencies with the operation, it is necessary to predict the current and next actions and give the computer enough time to make real-time feedback reactions. By the recognition of frames in the action unit, we can predict the uncompleted actions and reduce the response time, providing trainees with a smooth and natural human-machine interaction experience in ARAAT.
Based on the above discussion, the framework of action and operation prediction is given in Figure 10.

Experimental Design and Datasets
The experiments are divided into two sections to evaluate the proposed algorithm in this paper. The first section is to validate the recognition accuracy and efficiency of the proposed dynamic gesture recognition algorithm. We conduct the experiments on 4 datasets compared with other algorithms which contain 4 parts: frame recognition, action recognition, action boundary segmentation, and the effect of image resolution on recognition. The second section is to validate the effectiveness of proposed algorithm for operation recognition and prediction and the reliability of proposed ARAAT system. We invite participants to take real-time ARAAT tasks on HoloLens device to evaluate the naturalness of interactions in ARAAT.
In the first section, the proposed algorithm for dynamic gesture recognition is evaluated on two public datasets (Sheffield Kinect Gesture (SKIG) dataset [33] and Sebastien Marcel Dynamic Hand Posture Dataset [34]) and two homemade datasets (Assembly Gesture Video Dataset and HoloLens ARAAT Dataset).
The SKIG Dataset was captured by a Kinect device that included a RGB camera and a depth camera.   The Assembly Gesture Video Dataset contains 437 assembly operation gesture video sequences collected online from different uploaders on YouTube demonstrated in Figure 13. The dataset has 6 standard assembly actions: "Pointing", "Moving", "Grasping", "Releasing", "Scaling", "Rotating", and the transition action "Null". These 6 standard actions were generalized according to the ASME standard operations [28]. The resolution for each video is 640 × 480. The HoloLens ARAAT Dataset is constructed from various action videos from HoloLens RGB cameras. The videos of ARAAT tasks used in the experiments were captured by HoloLens and the resolution of the image is 1028 × 720 with 30 frames per second (fps). The dataset is divided into two parts: the single action dataset and the ARAAT task dataset. The single action dataset contains 6 types of actions: "Pointing", "Moving", "Grasping", "Releasing", "Scaling", and "Rotating". Videos were recorded of 30 participants, who conducted the actions of real assembly operations based on the ASME standard operations. Each participant performed 6 actions 10 times each, which created 60 videos for each participant and 1800 videos in total. Each video has approximately 100-150 frames. The single action dataset is split into the training and testing sets. The ARAAT task dataset was recorded by 30 participants and is used for testing sets in the boundary segmentation experiments. Each trainee performed the ARAAT tasks, containing 6 types of actions: "Pointing", "Moving", "Grasping", "Releasing", "Scaling", and "Rotating", in any order at will for 10 times, producing 300 testing task videos.

Public Datasets
The results on the SKIG Dataset and Sebastien Marcel Dynamic Hand Posture Dataset are presented in Tables 1 and 2, respectively. On the SKIG Dataset, the proposed algorithm was compared with RGGP + RGB-D [33], 4DCOV [35], Depth Context [36], HOG + LBP [37], and DLEH2 (DLE + HOG2) [38] and achieved state-of-the-art accuracy. The proposed algorithm was more attentive to the spatial-temporal consistency of gestures during the process of feature learning, which was not reflected in HOG + LBP and DLEH2. The comparison results confirm that spatialtemporal consistency plays an important role in gesture recognition. On the Sebastien Marcel Dynamic Hand Posture Dataset, the proposed algorithm was compared with Discriminative Canonical Correlation Analysis (DCCA) [39], tensor canonical correlation analysis (TCCA) [40], Product Manifolds (PM) [41], Genetic Programming (GP) [42], Tangent Bundles (TB) [43], and 3D Covariance spatio-temporal descriptor (Cov3D) [44]. As shown in Table 2, the proposed algorithm outperforms the others by at least 3% in recognition accuracy. Among the algorithms in Table 2, Cov3D produced a result second only to the proposed algorithm because it learned the spatial-temporal features of the gestures during the process. The result shown in Tables 1 and 2 is not as good as the proposed algorithm in this paper because only dynamic spatial features cannot fully represent the gestures. This further proves the importance of spatial-temporal consistency information in gesture recognition.

Homemade Datasets
We also evaluated the proposed algorithm with the others on the homemade datasets. Five-fold cross-validation was used in the experiments, which divided 80% of the data for training and 20% for testing each time. Tables 3 and 4 show the accuracy of the proposed algorithm on the Assembly Gesture Video Dataset and HoloLens ARAAT Dataset, respectively. Tables 5 and 6 present the recognition results and processing time on the homemade datasets compared with SSBoW [45], DSBoW [46], DTBoW [47,48], and DFW [49].  The recognition rates of the proposed algorithm for each action were all over 90%. Outstanding performances of over 96% were achieved for several actions, such as "Moving" and "Null". The average accuracy of all actions was 93.1% and 93.3% for the Assembly Gesture Video Dataset and HoloLens ARAAT Dataset, respectively. The results show that the proposed algorithm can obtain the highest recognition accuracy across all algorithms from the table. In particular, for "pointing" and "scaling", the proposed algorithm outperforms all the others by a margin that showcases its advantage in dynamic gesture recognition. Among all the algorithms, SSBoW performed the worst in all action recognitions; this is because each frame from the input video could not be clearly classified to the correct type by SSBoW, which led to an overall low recognition rate. Frame-to-frame recognition results are presented in Figures 14 and 15. In an action unit, the proposed algorithm considerably outperformed other approaches with the highest accuracy, while SSBoW produced the lowest results. Figures 14 and 15 explain why SSBoW could not produce good accuracy in each action.  To evaluate the performance of the proposed algorithm, we also conducted boundary segmentation experiments on the HoloLens ARAAT dataset. We labeled the ground truth boundary segmentation of the videos in advance. Because the action "Null" is the meaningless transition action, we excluded it from the ground truth. We developed a measure to evaluate the segmentation accuracy (SA), expressed as: where F result denotes the frames of the segmentation result and F groundtruth denotes the segmentation frames of the ground truth. Figure 16 demonstrates the accuracy of each action's segmentation on the HoloLens ARAAT dataset. DFW provided the best recognition results compared to SSBoW, DSBoW, and DTBoW. However, it is outperformed by the proposed algorithm. This is because DFW is incapable of segmenting each action explicitly, so incorrectly segmented frames will interfere with its recognition accuracy. Through the optimal boundary search, the proposed algorithm can precisely divide each action from a long input video. The segmentation results also prove why the proposed algorithm outperforms the other approaches in action recognition accuracy. Figure 16. Accuracy of action boundary segmentation on HoloLens ARAAT dataset. Figure 17 presents the accuracy of different resolutions to further support the effectiveness of the proposed algorithm. When the resolutions were over 160 × 120, the accuracies were over 90%. However, the accuracy was under 90% for 80 × 60 and only 64% for 40 × 30. Although the accuracy was low in some low resolutions, the proposed algorithm is still reliable in most situations because these low-resolution inputs are rarely used in current applications.

Result of Operation Recognition and Prediction
In the ARAAT system, the action sequence is important for the evaluation of the standard and achievement of trainees' operations. Thus, recognition and prediction of the operation is as necessary as action recognition.
To validate the reliability of the proposed algorithm, 30 participants took part in performing real-time ARAAT tasks. Half of the participants were beginners in assembly operations and the other half had basic assembly operation knowledge but little practical experience. Participants completed assembly training tasks from an application on HoloLens written in C#. The demonstration scenario is shown in Figure 18. Participants could assemble workpieces in the tasks in an arbitrary order depending on their own needs. During assembly, participants chose whether to use their bare hands or various virtual tools, such as a wrench or screwdriver, to complete the tasks, as illustrated in Figure 18. The tasks were mainly completed by 6 actions: "Pointing", "Moving", "Grasping", "Releasing", "Scaling", and "Rotating", and 3 operations: "Inserting", "Fastening", and "Equipping". To evaluate the sequence recognition accuracy and prediction efficiency, we developed two measures: the sequence accuracy ( A sequence ) and the degree of early recognition (DER), defined respectively as the following: where A i,j is the ground truth of A i,j and E A i,j = A i,j means if A i,j = A i,j is true, the output is 1 while 0 otherwise. N re is the number of frames of the recognized operation and N total is the total number of frames of the whole operation. A lower DER means that the operations are recognized early and a higher DER means that the operations are recognized late. The experiment results shown in Table 7 indicate that the proposed algorithm can predict the operations up to 40% of the time with a high recognition rate of 93.5%. These results mean that the prediction can give the machine enough time to make a reaction and provide the trainees with a smooth and friendly human-machine interaction in ARAAT.

Conclusions
ARAAT is an effective and affordable technique for labor training in the automobile and electronic industry. In this paper, we developed an ARAAT system to transform the complicated ARAAT task evaluation into a problem of gesture recognition and proposed a gesture recognition and prediction algorithm. We built a complicated ARAAT task model where a task is decomposed into a series of hand operations and each hand operation is further decomposed into several continuous actions corresponding to gestures. We defined five typical tasks, three typical operations, and six standard actions based on the practical assembly works, defined an action unit to reduce the dimensions of features during the recognition, and defined a score probability density distribution iteratively to optimize gesture boundaries to reduce interference from invalid gestures. Furthermore, we simultaneously extracted 2D static and 3D dynamic features of standard gestures to improve the gesture recognition precision and proposed an action and operation prediction method for a short response delay time and a natural interaction. The proposed algorithm was evaluated on two public datasets and two homemade assembly datasets, and achieved a high recognition rate of 93.5% up to 40% of the time. The experimental results showed that the proposed algorithm can increase recognition accuracy and reduce the computational cost, which help to ensure reliability in the ARAAT task evaluation and improve the experience of human-machine interaction. Although the procedures of ARAAT are relatively static and predictable, it remains a challenge to handle the different assembly difficulties, various products, and rapid updating of assembly skills. Therefore, in future, we will pay more attention to the research of assembly operations which is adapted to different assembly difficulties and various products, and try to build a new ARAAT system which can provide guidelines to trainees on updated assembly skills.  Institutional Review Board Statement: Ethical review and approval were waived for this study, due to the nature of data collected, which does not involve any personal information that could lead to the later identification of the individual participants. The participant in Figure 18 is the first author.