Hand-Guiding Gesture-Based Telemanipulation with the Gesture Mode Classiﬁcation and State Estimation Using Wearable IMU Sensors

: This study proposes a telemanipulation framework with two wearable IMU sensors without human skeletal kinematics. First, the states (intensity and direction) of spatial hand-guiding gestures are separately estimated through the proposed state estimator, and the states are also combined with the gesture’s mode (linear, angular, and via) obtained with the bi-directional LSTM-based mode classiﬁer. The spatial pose of the 6-DOF manipulator’s end-effector (EEF) can be controlled by combining the spatial linear and angular motions based on integrating the gesture’s mode and state. To validate the signiﬁcance of the proposed method, the teleoperation of the EEF to the designated target poses was conducted in the motion-capture space. As a result, it was conﬁrmed that the mode could be classiﬁed with 84.5% accuracy in real time, even during the operator’s dynamic movement; the direction could be estimated with an error of less than 1 degree; and the intensity could be successfully estimated with the gesture speed estimator and ﬁnely tuned with the scaling factor. Finally, it was conﬁrmed that a subject could place the EEF within the average range of 83 mm and 2.56 degrees in the target pose with only less than ten consecutive hand-guiding gestures and visual inspection in the ﬁrst trial.


Introduction
Recently, as the demand for more sophisticated manipulation by accelerating the spread of smart factories has gradually increased, the remote control and teleoperation technology based on user-friendly and flexible gesture recognition has begun to attract attention [1]. In the case of the gesture-based remote control of the manipulator, a central research interest of this study, significantly satisfying gesture recognition accuracy should be essential as the underlying element technology for precise and accurate control. The hand-gesture-based remote control sensors for manipulators can be categorized into vision, depth camera, EMG, IMU sensors, etc.
First, we examine research cases using vision sensors, such as RGB cameras and Kinect. According to the study by C Nuzzi et al. [2], the data of 5 classes collected through the RGB camera were classified with an accuracy of 92.6% using the R-CNN algorithm. However, they reported limitations, such as light reflection, boundary extraction of background and hand, and limited working area of camera FOV. To overcome these limitations of RGB cameras, W Fang et al. [3] proposed a gesture recognition method for 37 hand gestures with CNN and DCGAN. They collected 37 hand gestures under various environmental conditions using artificial light sources. D Jiang et al. [4] classified 24 hand gestures collected with Kinect with an accuracy of 93.63% through CNN. However, like the RGB sensor, limited FOV and low illuminance [5] are also not allowed in the Kinect.
Vision sensor-based gesture recognition methods are frequently employed for remote control of manipulators. One study [6] utilized Kinect V2 and OpenPose [7] to develop a real-time human-robot interaction framework for robot teaching through hand gestures, incorporating a background invariant robust hand gesture detector. The researchers employed a pre-trained state-of-the-art convolutional neural network (CNN), Inception V3, alongside the OpenSign dataset [8] to classify 10 hand gestures. With 98.9% accuracy in hand gesture recognition, they demonstrated gesture-based telemanipulation using an RGB camera. However, this approach requires users to memorize perceivable gestures for the robot, and the vision sensor's depth range constrains its capabilities. In addition, the system has only been tested indoors and may struggle in bright light due to the resulting contrast in RGB images. In cases where Kinect's skeletal information is used, researchers have successfully controlled the speed and steering of a mobile robot [9] and the position of a 5-axis manipulator [10]. Nevertheless, performance and usability issues often arise in research using vision sensors, such as limited field of view (FOV), light reflection, occlusion, and illumination. As a result, this approach is considered nearly infeasible in industrial settings, where operators must stand and perform gestures while facing the monitor screen.
EMG sensors' limited controllable degrees of freedom (DOFs) and dependency on human kinematic models make them unsuitable for telemanipulation applications when used alone. Vogel [11] combined sEMG with Vicon motion-capture camera systems to record EMG signals from the wrist and pose information to remotely control the DLR LWR-III manipulator and train machine learning models. Furthermore, to minimize occlusion effects in gesture-based telemanipulation using only Kinect, an approach that combined hand posture recognition based on sEMG-derived biofeedback information was introduced [12]. In a study employing IMU and EMG sensors [13], six static hand motions were recognized and used to control a robot arm by mapping each motion to the corresponding robot arm movement.
In the study of IMU-based gesture recognition [14], six hand gestures were recognized at an average accuracy of 81.6%, and the telemanipulation was achieved only with the predefined motion mapped for each gesture. In the study using the operator's skeletal kinematic model [15], omnidirectional manipulation was achieved by estimating the hand motion trajectory, even though challenges persisted in the uncertainties of pose estimation and differentiating between unintentional and intentional motions. However, in most cases of IMU-based motion recognition, if the operator's initial body alignment determined just after the sensor calibration does not hold, the accuracy of dynamic gesture recognition will drop drastically. In a study [16] for human motion tracking using a set of wearable IMU sensors, they did not use a body-fixed reference frame, but an earth-fixed frame for calculating the joint position between the body segments with considering the reference method from the biomechanics domain [17]. Thus, in this case, the time-variant bodyheading direction does not matter because the angle of the human body, an essential feature of the recognition model, is not less affected by the change of the body-heading orientation. To apply this method, the segment axes should be determined segment-by-segment through predefined joint movements, such as pronation-supination for the upper-limb joint [18] and flexion-extension for the lower-limb joint [19]. Moreover, the relation of segments with the global reference frame should be identified after estimating the relative pose of the sensor to the segment. Then, the joint position can be calculated by two connecting segments. It can be said that this method should be time-consuming and inconvenient, as the number of the joints of interest is increased.
There has been another approach to securing consistent reference inertial measurement frames in IMU sensor-based human motion analysis of lower-limb [20] and upper-limb part motion [21] with simple standing-stopping sensor calibration gestures used in a study [22]. Regarding the body-fixed frame, the studies addressed the critical issue of variations in the subject's body alignment, which can affect the accuracy and consistency of motion analysis and gesture recognition using wearable IMU sensors. The researchers emphasized the importance of updating the body-fixed frame according to the subject's heading direction and body alignment changes. By doing so, the proposed method can effectively account for the time-varying nature of the body-fixed frame and maintain accurate gesture recognition, even when the wearer's body alignment undergoes significant changes. They also demonstrated the efficacy of the proposed method through experimental evaluations, which show that the recognition performance remains robust, even in the presence of substantial body alignment changes. The study results suggest that the proposed approach effectively addresses the challenges associated with body-fixed frame variations and can potentially improve the utility of wearable inertial sensor-based systems for gesture recognition and related applications.
Thus, considering the method of the floating body-fixed frame [20,21], this study proposes a new hand-guiding gesture-based telemanipulation approach that addresses the limitations identified in previous hand-guiding gesture-based telemanipulation studies. As a result, this study makes the following research contributions:

1.
This study proposes a novel spatial pose telemanipulation method of a 6-DOF manipulator without the human skeletal kinematic model with only two IMU sensors by an unprecedented combination of the gesture mode and states.

2.
Consistent hand-guiding gesture mode classification and state estimation were successfully achieved by integrating the floating body-fixed frame method into the proposed telemanipulation method, even during the operator's dynamic movements.
The rest of this paper is composed as follows. In Section 2, we describe the problem definition. In Section 3, we present details about the proposed method for the hand-guiding gesture-based manipulator remote-control method. Then, we describe the experimental validation results in Section 4. In Section 5, we conclude the paper.

Problem Definition
As discussed earlier, studies on hand gesture-based telemanipulation using IMU sensors can be categorized into two approaches:

1.
The gesture recognition model-based method enables robot arm control through specific hand gestures mapped to corresponding motions. However, it is limited in its capacity to control the robot arm in all directions.

2.
Skeletal kinematic model-based method: this approach allows for omnidirectional control of the robot arm by replicating the hand movements of a human worker. Despite its versatility, it faces challenges implementing pure linear or angular motion and differentiating between intended and unintended control motions. Moreover, accurately discerning the operator's motion intent without external sensor-based error feedback remains unattainable.
In this study, as illustrated in Figure 1, a single hand-mounted IMU sensor is employed to estimate hand gesture states, including intensity and direction. Concurrently, hand gesture modes are classified using a bi-directional LSTM. The omnidirectional control of a spatial manipulator with more than six degrees of freedom is achieved by combining the corresponding mode and state. Moreover, by utilizing a real-time pose-tracking controller, the operator can update the target pose at desired moments, even while the manipulator moves towards the target pose, creating a natural trajectory up to the final target pose.  Figure 1 illustrates the framework of a hand-guiding gesture-based teleoperation strategy. The operator fastens wearable IMU sensors to their pelvis and hand. The pelvismounted sensor is essential for updating the orientation of {FBf}, a virtually created body reference frame, in response to changes in the operator's heading direction. A handmounted IMU sensor is necessary to classify the operator's hand gesture mode and estimate its states. If the operator maintains the initial alignment with the {FBf} constant, there is no need for a pelvis-mounted IMU sensor. The hand-mounted IMU sensor's output is expressed to the {FBf} frame and has an update rate of 100 Hz. A bi-directional LSTM model is employed to classify the three gesture modes, and the model has a time horizon length of 400 ms and classifies the gesture mode every 0.01 ms. Ultimately, by combining the mode and state of the hand-guiding gesture, the target pose is generated with three modes: intentional gesture, linear gesture for pure translational motion, and angular gesture for pure rotational motion of the EEF. These three modes allow for the following executions of the manipulator's spatial pose control.

Method
As mentioned earlier, the gesture's mode classification and state estimation should be combined to realize the spatial pose telemanipulation of a 6-DOF manipulator without the human skeletal kinematic model. In addition, the floating body-fixed frame method should also be considered to realize consistent gesture mode classification, even in the operator's dynamic movement. Thus, this section is composed as follows: (Section 3.1) Floating body-fixed frame, (Section 3.2) Bi-directional LSTM-based hand-guiding gesture classification, and (Section 3.3) Hand-guiding gesture's state estimation.

Floating Body-Fixed Frame
This section presents a protocol for generating an {FBf} that updates according to the operator's body heading direction. Figure 2 illustrates creating a {Bf} through a standstooping calibration gesture. Given that human body activity is explained by the body motion plane (frontal plane, sagittal plane, transverse plane), it is assumed that the initial {Bf} aligns with the body motion plane. By calculating the orientation difference between  Figure 1 illustrates the framework of a hand-guiding gesture-based teleoperation strategy. The operator fastens wearable IMU sensors to their pelvis and hand. The pelvismounted sensor is essential for updating the orientation of {FB f }, a virtually created body reference frame, in response to changes in the operator's heading direction. A handmounted IMU sensor is necessary to classify the operator's hand gesture mode and estimate its states. If the operator maintains the initial alignment with the {FB f } constant, there is no need for a pelvis-mounted IMU sensor. The hand-mounted IMU sensor's output is expressed to the {FB f } frame and has an update rate of 100 Hz. A bi-directional LSTM model is employed to classify the three gesture modes, and the model has a time horizon length of 400 ms and classifies the gesture mode every 0.01 ms. Ultimately, by combining the mode and state of the hand-guiding gesture, the target pose is generated with three modes: intentional gesture, linear gesture for pure translational motion, and angular gesture for pure rotational motion of the EEF. These three modes allow for the following executions of the manipulator's spatial pose control.

Method
As mentioned earlier, the gesture's mode classification and state estimation should be combined to realize the spatial pose telemanipulation of a 6-DOF manipulator without the human skeletal kinematic model. In addition, the floating body-fixed frame method should also be considered to realize consistent gesture mode classification, even in the operator's dynamic movement. Thus, this section is composed as follows: (Section 3.1) Floating body-fixed frame, (Section 3.2) Bi-directional LSTM-based hand-guiding gesture classification, and (Section 3.3) Hand-guiding gesture's state estimation.

Floating Body-Fixed Frame
This section presents a protocol for generating an {FB f } that updates according to the operator's body heading direction. Figure 2 illustrates creating a {Bf } through a standstooping calibration gesture. Given that human body activity is explained by the body motion plane (frontal plane, sagittal plane, transverse plane), it is assumed that the initial {Bf } aligns with the body motion plane. By calculating the orientation difference between the IMU sensors with arbitrary attachment postures and the ideal {Bf } frame through the calibration gesture, the IMU sensor's orientation is corrected to match the posture of {Bf }. For detailed {Bf } generation methods, please refer to the paper [6] describing the FGCD algorithm. The operator's body heading direction aligns with the positive x-axis of the initially created {Bf }, and the z-axis of {Bf } remains vertically upward with respect to the inertial frame. Subsequently, using the pelvic IMU's orientation, {Bf } is continuously updated to align the operator's current body heading direction with the x-axis of {Bf }. This concept allows gesture mode recognition and state estimation for an initially created {Bf }, even when the body-heading direction varies dynamically. Equation (1) represents the formula for generating {FB f }.
G denotes inertial frame, {FBf} denotes floating body-fixed frame, and Bf denotes body-fixed frame. Sf, pelvic denotes the sensor-fixed frame at the pelvic IMU, and Sf, hand denotes the sensor-fixed frame at the hand-mounted IMU. The subscripts "stand" and "stoop" denote the operator's posture when the orientation of the corresponding sensor is recorded. "Sc" represents a calibrated sensor-fixed frame. "Cj" is a new sensor-fixed frame of each sensor initially {FBf} is created. Equation (2) converts the IMU sensor's orientation, initially expressed in the inertial frame, into an expression for frame {FBf}. To transform the acceleration and angular velocity outputs from the sensor-fixed frame to frame {FBf}, Equations (3) and (4) are utilized. Unlike orientation, acceleration and angular velocity are outputs for the sensor-fixed frame of the sensor itself. Consequently, the constant transformation defined in the final step of calibration, which transitions from {Sf} to {Sc}, is not required for these outputs. , , , , , , , , , , G denotes inertial frame, {FB f } denotes floating body-fixed frame, and B f denotes body-fixed frame. S f , pelvic denotes the sensor-fixed frame at the pelvic IMU, and S f , hand denotes the sensor-fixed frame at the hand-mounted IMU. The subscripts "stand" and "stoop" denote the operator's posture when the orientation of the corresponding sensor is recorded. "S c " represents a calibrated sensor-fixed frame. "C j " is a new sensor-fixed frame of each sensor initially {FB f } is created. Equation (2) converts the IMU sensor's orientation, initially expressed in the inertial frame, into an expression for frame {FB f }. To transform the acceleration and angular velocity outputs from the sensor-fixed frame to frame {FB f }, Equations (3) and (4) are utilized. Unlike orientation, acceleration and angular velocity are outputs for the sensor-fixed frame of the sensor itself. Consequently, the constant transformation defined in the final step of calibration, which transitions from {S f } to {S c }, is not required for these outputs. intentional and a return motion. Upon visual inspection, it becomes evident that the initial motion shares a similar shape during the execution of the intended motion, which is also observable in the return motion. Moreover, it should be noteworthy that this considerable intensity difference between the intended and returning motions in hand-guiding gestures depicted in Figure 3 inevitably makes the recognition accuracy and model generalization performances worse. Thus, in our preceding study [21], the RNN-based models, which have better performances in context understanding, were considered to realize an accurate gesture recognition model, even with these significant intensity differences. Moreover, the bi-directional LSTM model showed the best performance in recognizing the hand-guiding gestures in Figure 3. That is a reason why the bi-directional LSTM, which incorporates both past and future information at each time step, was chosen as the motion recognition model for this study.  Figure 3 illustrates that the linear gesture, representing pure translational motion, and the angular gesture, corresponding to pure spatial rotational motion, are associated with an intentional and a return motion. Upon visual inspection, it becomes evident that the initial motion shares a similar shape during the execution of the intended motion, which is also observable in the return motion. Moreover, it should be noteworthy that this considerable intensity difference between the intended and returning motions in handguiding gestures depicted in Figure 3 inevitably makes the recognition accuracy and model generalization performances worse. Thus, in our preceding study [21], the RNNbased models, which have better performances in context understanding, were considered to realize an accurate gesture recognition model, even with these significant intensity differences. Moreover, the bi-directional LSTM model showed the best performance in recognizing the hand-guiding gestures in Figure 3. That is a reason why the bi-directional LSTM, which incorporates both past and future information at each time step, was chosen as the motion recognition model for this study.  Figure 4 presents the framework for recognizing hand gesture modes using a bi-directional LSTM. Firstly, 9-dimensional data, consisting of orientation, acceleration, and angular velocity expressed with respect to {FBf}, are chosen as features. The nine selected data types were extracted using the sliding window method, and the extracted data were used as inputs for the bi-directional LSTM model. Following the results of the preceding study [21], the size of the sliding window was set to 9 × 40. And as the intensity of the hand-guiding gesture varies among operators, the data extracted through the sliding window undergo a normalization process. In addition, Table 1 represents detailed computational cost-and hyperparameters-related information of the model in    Figure 4 presents the framework for recognizing hand gesture modes using a bidirectional LSTM. Firstly, 9-dimensional data, consisting of orientation, acceleration, and angular velocity expressed with respect to {FBf }, are chosen as features. The nine selected data types were extracted using the sliding window method, and the extracted data were used as inputs for the bi-directional LSTM model. Following the results of the preceding study [21], the size of the sliding window was set to 9 × 40. And as the intensity of the handguiding gesture varies among operators, the data extracted through the sliding window undergo a normalization process. In addition, Table 1 represents detailed computational cost-and hyperparameters-related information of the model in  Figure 3 illustrates that the linear gesture, representing pure translational motion, and the angular gesture, corresponding to pure spatial rotational motion, are associated with an intentional and a return motion. Upon visual inspection, it becomes evident that the initial motion shares a similar shape during the execution of the intended motion, which is also observable in the return motion. Moreover, it should be noteworthy that this considerable intensity difference between the intended and returning motions in handguiding gestures depicted in Figure 3 inevitably makes the recognition accuracy and model generalization performances worse. Thus, in our preceding study [21], the RNNbased models, which have better performances in context understanding, were considered to realize an accurate gesture recognition model, even with these significant intensity differences. Moreover, the bi-directional LSTM model showed the best performance in recognizing the hand-guiding gestures in Figure 3. That is a reason why the bi-directional LSTM, which incorporates both past and future information at each time step, was chosen as the motion recognition model for this study.  Figure 4 presents the framework for recognizing hand gesture modes using a bi-directional LSTM. Firstly, 9-dimensional data, consisting of orientation, acceleration, and angular velocity expressed with respect to {FBf}, are chosen as features. The nine selected data types were extracted using the sliding window method, and the extracted data were used as inputs for the bi-directional LSTM model. Following the results of the preceding study [21], the size of the sliding window was set to 9 × 40. And as the intensity of the hand-guiding gesture varies among operators, the data extracted through the sliding window undergo a normalization process. In addition, Table 1 represents detailed computational cost-and hyperparameters-related information of the model in

Hand-guiding Gesture's State Estimation
After obtaining the transformation matrix from t − 1 to t of the sensor-fixed frame of the hand-mounted IMU through Equation (5), the resultant transformation matrix is converted into an axis-angle representation through Equation (6) to avoid the singularity issue of the ZYX Euler angle convention. Through Equation (7), the instantaneous axis of a rotation expressed w.r.t. the {FB f } of the hand-mounted IMU and the direction of the angular gesture state, FB f ω S f ,hand,t , can be obtained, and the intensity, θ h , can also be obtained.
r i,j represents each component of the SO(3) matrix obtained through Equation (5). To generate a command for the pure rotational motion of the manipulator's EEF, {FB f } is replaced with {B}, the base frame of the manipulator, as in Equation (8), and the unit screw axis S can be defined according to Def. 1. The parameter λ a ∈ R is a scale factor for user control of estimated gesture intensity. Equation (10) After calculating the average velocity by integrating the acceleration output from the hand-mounted IMU using Equation (11), the normalized result of the average velocity is determined as the direction of the corresponding gesture through Equation (12). The parameter n denotes the quantity of data collected during the intentional motion segment of the respective linear gesture.
FB fv Mathematics 2023, 11, 3514 8 of 16 As in the case of the angular gesture described above, to generate a command for the pure translation motion of the manipulator's EEF, {FB f } is replaced with B, the manipulator's base frame (Equation (13)), and Def.1, as in Equation (14). Thus, the unit screw axis S is defined. Moreover, the scale factor is defined for adjusting the estimated gesture's intensity, and Equation (15) is used to obtain the target position of the manipulator EEF.
Definition 1 (Screw axis). For a given reference frame, a screw axis S of a joint can be written as: where either (i) ω = 1 or (ii) ω = 0 and v = 1. If (i) holds, then v = −ω × q + hω, where q is a point on the axis of the screw, and h is the pitch of the screw (h = 0 for a pure rotation about the screw axis). If (ii) holds, then the pitch of the screw is infinite, and the twist is a translation along the axis defined by v. Although we use the pair (ω, v) for both a normalized screw axis S (where one of ω or v must be unity) and a general twist V (where there are no constraints on ω and v), the meaning should be clear from the context [23].

Experiments
This section examines the performances of the gesture mode classifier and state estimator. And then, the experimental studies to teleoperate the UR5e manipulator to the target poses with the developed hand-guiding gesture-based telemanipulation method by combining the mode and state were conducted in the motion-capture space.

Gesture Recognition
After subjects wore two XSENS MTw wearable IMU sensors, the datasets were collected for three predefined hand-guiding gesture modes. As shown in Figure 5, all subjects moved freely within the motion-capture space and repeatedly performed hand-guiding gestures in all directions during dataset acquisitions. Table 2 presents detailed training, validation, and test dataset acquisition information.
Mathematics 2023, 11, x FOR PEER REVIEW 8 of 15 As in the case of the angular gesture described above, to generate a command for the pure translation motion of the manipulator's EEF, {FBf} is replaced with B, the manipulator's base frame (Equation (13)), and Def.1, as in Equation (14). Thus, the unit screw axis S is defined. Moreover, the scale factor is defined for adjusting the estimated gesture's intensity, and Equation (15) is used to obtain the target position of the manipulator EEF.
( ) , [ ] ( ) Definition 1 (Screw axis). For a given reference frame, a screw axis S of a joint can be written as: where either (i)

Experiments
This section examines the performances of the gesture mode classifier and state estimator. And then, the experimental studies to teleoperate the UR5e manipulator to the target poses with the developed hand-guiding gesture-based telemanipulation method by combining the mode and state were conducted in the motion-capture space.

Gesture Recognition
After subjects wore two XSENS MTw wearable IMU sensors, the datasets were collected for three predefined hand-guiding gesture modes. As shown in Figure 5, all subjects moved freely within the motion-capture space and repeatedly performed hand-guiding gestures in all directions during dataset acquisitions. Table 2 presents detailed training, validation, and test dataset acquisition information.   The detailed performances of the bi-directional LSTM model in the model-training step are presented in Table 3, showcasing the classification accuracies achieved. Table 4 also shows the results of real-time telemanipulation experiments, specifically focusing on hand-guiding gesture recognition using the trained model. The corresponding confusion matrices for both the model training and real-time experiments can be found in Figure 6. Moreover, according to these confusion matrices, it is noteworthy that a noticeable number of misclassifications inevitably exists, and it might result in the manipulator's jerky motion due to the undesired mode switching into the incorrect gesture modes. To address this issue, as depicted in Figure 7, the latest twenty tentative modes are saved in the tentative mode history repository, and then the most frequent mode is selected as the current gesture mode. In other words, by incorporating a 0.2 s sliding window, the jerky motion of the manipulator caused by misclassifications can be effectively mitigated. However, it should also be noted that this implementation introduces a minimum time delay of 0.2 s for the desired mode switching.  The detailed performances of the bi-directional LSTM model in the model-training step are presented in Table 3, showcasing the classification accuracies achieved. Table 4 also shows the results of real-time telemanipulation experiments, specifically focusing on hand-guiding gesture recognition using the trained model. The corresponding confusion matrices for both the model training and real-time experiments can be found in Figure 6. Moreover, according to these confusion matrices, it is noteworthy that a noticeable number of misclassifications inevitably exists, and it might result in the manipulator's jerky motion due to the undesired mode switching into the incorrect gesture modes. To address this issue, as depicted in Figure 7, the latest twenty tentative modes are saved in the tentative mode history repository, and then the most frequent mode is selected as the current gesture mode. In other words, by incorporating a 0.2 s sliding window, the jerky motion of the manipulator caused by misclassifications can be effectively mitigated. However, it should also be noted that this implementation introduces a minimum time delay of 0.2 s for the desired mode switching.

Gesture State Estimation
In this study, a testbench, as shown in Figure 8, was prepared to verify the operator's control intention estimation performance through the combination of mode recognition and state estimation described in Sections 3.2 and 3.3. The details of the test bench and measurement information are as follows.

1.
Testbench: Within the testbench, 6 Prime 13 cameras, 2 retro-reflective marker sets (hand, manipulator's EEF), and a workstation with MOTIVE 2.1.1 (NaturalPoint Inc., Corvallis, OR, USA) software are installed. The pose of the marker-fixed frame is measured for the Optitrack-fixed frame {Of} defined in an L-shaped calibration square (CS-200) located within the motion-capture volume.

2.
Hand trajectory: One wireless IMU sensor is attached to the back of the subject's pelvis and right hand with a strap, and a marker set is attached to the right hand to measure its position for {Of }.

Gesture State Estimation
In this study, a testbench, as shown in Figure 8, was prepared to verify the operator's control intention estimation performance through the combination of mode recognition and state estimation described in Sections 3.2 and 3.3. The details of the test bench and measurement information are as follows.
1. Testbench: Within the testbench, 6 Prime 13 cameras, 2 retro-reflective marker sets (hand, manipulator's EEF), and a workstation with MOTIVE 2.1.1 (NaturalPoint Inc., Corvallis, OR, USA) software are installed. The pose of the marker-fixed frame is measured for the Optitrack-fixed frame {Of} defined in an L-shaped calibration square (CS-200) located within the motion-capture volume. 2. Hand trajectory: One wireless IMU sensor is attached to the back of the subject's pelvis and right hand with a strap, and a marker set is attached to the right hand to measure its position for {Of}.
Gesture state: All outputs of the wireless IMU sensor are converted to be output for {FBf}, and the converted features are output as the gesture mode and state through the models described in Sections 3.2 and 3.3. Note that the conversion relationship between {FBf} and {Of} defined by the calibration square cannot be accurately identified. However, the z-axis is the same as [0 0 1], and the authors tried their best to align the body-heading direction with the L-square-heading direction during the calibration gesture.  Figure 9 presents the measurement results for the linear gesture's hand trajectory, estimated direction, and estimated intensities. Examining the projected results on the X.Y. plane reveals that the linear gesture's direction estimates the direction. The hand trajectory was measured for the {Of} frame, while the gesture direction was estimated for the {FBf} frame, resulting in different reference frames. After transforming the estimated direction's reference frame to the {Of} frame, the origin was set as the starting point of the hand trajectory for improved readability. The estimated intensity is multiplied by a scaling factor ranging from 0.1 to 0.4 to determine the target position dots. Thus, it was experimentally validated that the position of the manipulator's end-effector could be controlled according to the direction and intensity intended by the user. Figure 10 illustrates the hand trajectory, hand orientation, estimated rotational axis, and axis of pure rotational motion of the EEF for the angular gesture. In contrast to the linear gesture, direction estimation for the angular gesture requires the hand-mounted IMU's orientation, as described in Equations (5)-(7). Consequently, the pose expressed for the {Of} frame of the hand-mounted marker at the start and endpoints of the intentional motion section of the angular gesture is displayed. As a result of conducting an angular gesture experiment with more than five subjects, it was observed that there were instances where the hand performed an angular gesture while slightly bent toward the forearm. Gesture state: All outputs of the wireless IMU sensor are converted to be output for {FB f }, and the converted features are output as the gesture mode and state through the models described in Sections 3.2 and 3.3. Note that the conversion relationship between {FB f } and {Of } defined by the calibration square cannot be accurately identified. However, the z-axis is the same as [0 0 1], and the authors tried their best to align the body-heading direction with the L-square-heading direction during the calibration gesture. Figure 9 presents the measurement results for the linear gesture's hand trajectory, estimated direction, and estimated intensities. Examining the projected results on the X.Y. plane reveals that the linear gesture's direction estimates the direction. The hand trajectory was measured for the {Of } frame, while the gesture direction was estimated for the {FB f } frame, resulting in different reference frames. After transforming the estimated direction's reference frame to the {Of } frame, the origin was set as the starting point of the hand trajectory for improved readability. The estimated intensity is multiplied by a scaling factor ranging from 0.1 to 0.4 to determine the target position dots. Thus, it was experimentally validated that the position of the manipulator's end-effector could be controlled according to the direction and intensity intended by the user.
This might lead to an angular error between the intended and estimated rotation axes, so the difference between the intended and estimated direction needs to be compared. Therefore, the estimated rotational axis and the axis of pure rotational motion of the EEF are shown in Figure 10, with the angle between the two axes confirmed to be 0.57 degrees. These results confirm that the user can instigate the pure rotational motion of the EEF (End-Effector) according to the intended direction and intensity. The synthesis will be discussed in Section 4.3.   Figure 11 illustrates the results of telemanipulation to move the UR5e manipulator's EEF from the home pose to the goal pose, as depicted in the framework described in Figure 1. The heading direction was allowed to move freely without needing to be fixed. The experimental environment is presented in Figure 8. To track the EEF pose, a reflective  Figure 10 illustrates the hand trajectory, hand orientation, estimated rotational axis, and axis of pure rotational motion of the EEF for the angular gesture. In contrast to the linear gesture, direction estimation for the angular gesture requires the hand-mounted IMU's orientation, as described in Equations (5)-(7). Consequently, the pose expressed for the {Of } frame of the hand-mounted marker at the start and endpoints of the intentional motion section of the angular gesture is displayed. As a result of conducting an angular gesture experiment with more than five subjects, it was observed that there were instances where the hand performed an angular gesture while slightly bent toward the forearm. This might lead to an angular error between the intended and estimated rotation axes, so the difference between the intended and estimated direction needs to be compared. Therefore, the estimated rotational axis and the axis of pure rotational motion of the EEF are shown in Figure 10, with the angle between the two axes confirmed to be 0.57 degrees. These results confirm that the user can instigate the pure rotational motion of the EEF (End-Effector) according to the intended direction and intensity. The synthesis will be discussed in Section 4.3.

Validation of Hand-Guiding Gesture-Based Telemanipulation
Mathematics 2023, 11, x FOR PEER REVIEW 11 of 15 This might lead to an angular error between the intended and estimated rotation axes, so the difference between the intended and estimated direction needs to be compared. Therefore, the estimated rotational axis and the axis of pure rotational motion of the EEF are shown in Figure 10, with the angle between the two axes confirmed to be 0.57 degrees. These results confirm that the user can instigate the pure rotational motion of the EEF (End-Effector) according to the intended direction and intensity. The synthesis will be discussed in Section 4.3.   Figure 11 illustrates the results of telemanipulation to move the UR5e manipulator's EEF from the home pose to the goal pose, as depicted in the framework described in Fig-Figure 10. Angular gesture results: hand trajectory and orientation (at start and end pose) measured by six Optitrack Prime 13 cameras and reflective marker set and intended gesture's direction and intensity estimated by the proposed method; the angle between IMU−based axis and Optitrack−based axis is 0.57 degrees. Figure 11 illustrates the results of telemanipulation to move the UR5e manipulator's EEF from the home pose to the goal pose, as depicted in the framework described in Figure 1. The heading direction was allowed to move freely without needing to be fixed. The experimental environment is presented in Figure 8. To track the EEF pose, a reflective marker set was attached, and the frames shown at the starting and ending points of the EEF trajectory in Figure 11 correspond to the marker set. Figure 11a displays the control results using the position trajectory controller of ROS for the linear gesture case, demonstrating that a target pose is generated for each gesture. Due to the characteristics of the trajectory controller, a new target pose cannot be assigned until the EEF reaches the target point. Over three experiments, the EEF could be moved to approximately 83 mm of the goal pose on average. The 83 mm was significantly influenced by the inevitable distance gap caused by part interference between the marker set attached to the EEF and the marker set corresponding to the goal pose.

Validation of Hand-Guiding Gesture-Based Telemanipulation
In Figure 11b,c, the target pose can be updated in real time based on the pose-tracking controller of ROS. Figure 11b presents the results of linear gesture-only telemanipulation, while Figure 11c shows the results of telemanipulation combining both linear and angular gestures. Notably, in the case of Figure 11c, both the position and orientation were simultaneously controlled, starting from a different goal pose and initial EEF poses. It was confirmed that the angle difference between the final EEF pose and the goal pose was within 2.56 degrees. Additionally, in all experiments, it was verified that the EEF could be moved to the desired goal pose with a hand-guiding gesture fewer than ten times. Figure 11. Results of hand−guiding gesture−based telemanipulation without constraint on the operator's fixed body−heading direction: (a) pure translation motion with "position trajectory control" ROS controller, (b) pure translation motion with "position grouped control" ROS controller, (c) both pure translational motion and rotational motion−based control with "position grouped control" ROS controller.

Result and Discussion
This study presents a complete framework for teleoperating 6-DOF manipulators using two wearable IMU sensors, without relying on a complex human skeletal kinematic model. To validate the performance of the proposed method, the teleoperation experiments using a UR5e manipulator for spatial pose control were successfully conducted in the motion-capture space. As a result, the following experimental results were confirmed:  Figure 11. Results of hand−guiding gesture−based telemanipulation without constraint on the operator's fixed body−heading direction: (a) pure translation motion with "position trajectory control" ROS controller, (b) pure translation motion with "position grouped control" ROS controller, (c) both pure translational motion and rotational motion−based control with "position grouped control" ROS controller. Figure 11a displays the control results using the position trajectory controller of ROS for the linear gesture case, demonstrating that a target pose is generated for each gesture. Due to the characteristics of the trajectory controller, a new target pose cannot be assigned until the EEF reaches the target point. Over three experiments, the EEF could be moved to approximately 83 mm of the goal pose on average. The 83 mm was significantly influenced by the inevitable distance gap caused by part interference between the marker set attached to the EEF and the marker set corresponding to the goal pose.
In Figure 11b,c, the target pose can be updated in real time based on the pose-tracking controller of ROS. Figure 11b presents the results of linear gesture-only telemanipulation, while Figure 11c shows the results of telemanipulation combining both linear and angular gestures. Notably, in the case of Figure 11c, both the position and orientation were simultaneously controlled, starting from a different goal pose and initial EEF poses. It was confirmed that the angle difference between the final EEF pose and the goal pose was within 2.56 degrees. Additionally, in all experiments, it was verified that the EEF could be moved to the desired goal pose with a hand-guiding gesture fewer than ten times.

Result and Discussion
This study presents a complete framework for teleoperating 6-DOF manipulators using two wearable IMU sensors, without relying on a complex human skeletal kinematic model. To validate the performance of the proposed method, the teleoperation experiments using a UR5e manipulator for spatial pose control were successfully conducted in the motion-capture space. As a result, the following experimental results were confirmed: 1.
Utilizing the floating-body fixed frame, the hand-guiding gesture mode can be classified with 84.5% accuracy in real time, even during the operator's dynamic movement scenarios.

2.
The spatial direction of hand-guiding gestures could be estimated with an error of less than 1 degree. 3.
The gesture intensity of hand-guiding gestures could be successfully estimated with a speed estimator and finely tuned with the scaling factor. 4.
Finally, a subject could place the EEF within the average range of 83 mm and 2.56 degrees in the target pose with only less than ten consecutive hand-guiding gestures and visual inspection in the first trial.
The main research contribution of this study is in the unprecedented trial of a combination of the gesture's mode and states to realize controllability on the spatial direction and displacement with the minimum number of wearable IMU sensors. Firstly, the intensity and direction of hand-guiding gestures were separately estimated through the proposed gesture state estimator, combined with the gesture mode obtained using a bi-directional LSTM-based gesture mode classifier. Based on the integrated gesture mode and state, the manipulator's EEF can be successfully controlled by combining spatial translational and rotational motions. To show the significance of our method, Table 5 compares the method proposed in this study with the most recent studies in terms of recognition accuracy and controllability. While it is true that the recognition accuracy of our study falls short compared to recent research, we have introduced an additional process for selecting the mode of the hand-guiding gesture in our study. This has resolved the issue of manipulator malfunction due to the misrecognition of the hand-guiding gesture mode. Moreover, our study not only recognizes the mode of the hand-guiding gesture, but also estimates its state (direction and intensity) to allow for the remote control of the manipulator in the direction desired by the user. However, recent studies can only control the robot to a fixed direction and predetermined displacement. Therefore, this method is the first to control spatial linear or spatial angular motion up to displacement with only a single hand-mounted and pelvic-mounted IMU sensor.