1. Introduction
Despite the increasing prevalence and sophistication of myoelectric prosthetic limb-replacing devices, their acceptance by amputees has seen comparatively little improvement in the course of their development [
1]. In their current state, the responsiveness and utility of these devices are inadequate to meet patient requirements for everyday use, as well as to justify the monetary, temporal, and emotional costs associated with amputation recovery [
2]. Purely cosmetic prostheses or body-powered devices using hooks in place of replica hands continue to be favored among patients by virtue of their simplicity, regardless of their lower potential range of functionality [
2].
Accommodating the needs of upper-body amputees poses a uniquely complex challenge, with the dexterity of an organic hand being considerably more difficult to emulate than the load-bearing functions of the leg and foot [
3]. Isolated surveys of upper- and lower-body amputees reflect this; among the former demographic, prosthesis abandonment rates frequently exceed 50 percent [
1], while the abandonment of lower-body appliances seldom exceeds 20 percent [
4]. Concurrently, the poor control fidelity of fine motions and device activation is a leading source of patient frustration [
2]. Each action of a myoelectric device is triggered by a specific set of muscle contractions, and determining the correct action requires modeling many different parameters. These parameters are collectively called the state space and the action space. Developing such a precise mathematical model is a great challenge in itself. As a consequence, the world is increasingly leaning towards data-driven machine learning (ML), especially approaches that do not require any prior knowledge about the system.
Table 1 summarizes major prosthesis types according to the location of amputation, their corresponding abandonment rates, and follow-up periods after amputation. Aggregated data from patient studies demonstrates the need for a shift in design priorities towards making these devices more user-oriented. The data further demonstrates that despite the application or characteristics of the prosthetic, upper-limb prosthetics are generally abandoned more frequently.
Most machine learning algorithms belong to one of six categories: supervised learning, unsupervised learning, semi-supervised learning, reinforcement learning, transduction, and learning to learn. These techniques have been used to train a metasurface-based imager to recognize, monitor, and measure human movements [
5]. The utility of supervised learning, unsupervised learning, semi-supervised learning, and reinforcement learning has been studied across a wide range of healthcare settings, including health monitoring, mental health, disease management, medical diagnosis, assistive healthcare, physical activity, substance use, and dietary management [
6].
A more recent and popular form of machine learning algorithm in controlling highly non-linear and high-dimensional complex systems is reinforcement learning (RL). RL involves algorithms that learn a policy on how to behave based on an observation of the environment, where the actions of the machine agent impact the environment, resulting in feedback. The agent then incorporates the feedback into the algorithm to guide future decisions [
7]. RL, a form of approximate dynamic programming, has shown promise in “tuning” robotic knee prostheses [
8]. Tuning is the process through which a robotic prosthesis’s control parameters are personalized to the individual user. Reinforcement learning has also been utilized to gather data from an able-bodied subject wearing a robotic limb and transfer that data to a user of a robotic knee prosthesis as a means of improving the tuning process [
9].
To address more complicated problems, for example, robotic manipulation of objects, a combination of deep learning with RL, known as “deep reinforcement learning,” has been studied in [
10]. Leading strategies for solving the existing problems in prosthetic hand actuation deal with the implementation of AI for shared control of the prosthetic, leaving intention to the user, and AI for fine control of strength and coordination [
11]. In a similar kind of research work, artificial neural networks and support vector machines have been tested in the optimization of certain implants used in hip replacement procedures [
12].
Most of the common approaches to actuating a prosthetic hand use Electromyography (EMG) signals, where electrical signals generated by muscle contractions are captured and filtered to remove noise and retain meaningful signal content. Once a meaningful signal is found, the next step is to extract useful information such as variance, waveform length, and zero crossings. Then, based on the extracted features and using algorithms [
13], prosthetic hand movement is generated [
14]. For example, researchers have used multichannel surface EMG knit band sensors and wearable smart band textile-based EMG sensors to design their respective prosthetic hands [
15,
16,
17]. In [
18], the authors have demonstrated several strategies for electrode placement and discussed different signal processing techniques. However, EMG signal-based prosthetic hand actuation in the real world often suffers from multiple challenges, for example, skin irritation [
19] due to an adhesive gel used inside the electrode, low battery life, lack of comfort [
20], low power consumption, and high speed of operation [
21] have been reported. Despite such limitations, EMG-based methods are widely used due to their low cost, non-invasiveness, and direct correlation with muscle activity [
15].
With respect to myoelectrically controlled prostheses, RL techniques have been used to improve a user’s control [
22]. However, all the previous works improve the ease of use of prosthetics, but do not solve issues with physical therapy or training before a patient receives their prosthetic. Fitting of a prosthetic is a very time-consuming and expensive process; without proper exercise, muscles and nerves can decay further after surgery. Making training and adjustments to the prosthetic is a difficult process. Combining these methods with augmented or virtual reality (AR or VR) simulations allows for patients to train ahead of the fitting process, keeping their muscles and nerves in manageable condition, and making the transition to their new prosthetic easier. In addition to training, AI can be used for modification of the prosthetic structure itself, making it more personal and accessible to each user. An AR/VR simulation has the added advantage of testing out an EMG prosthetic before its construction, saving money on patients who may have a distaste for a more advanced prosthetic. Patients who have tested their prosthesis in a virtual environment before use have reported less frustration, more enjoyment of their prosthetics, and drastic improvement in prosthetic control and accuracy over those who do not train in AR or VR [
23].
This research seeks to help fill this gap by creating a comprehensive framework for the use of RL in a VR scene with human input to provide a gentle and engaging approach to prosthesis rehabilitation, where some complex movements may be automated using an RL agent. An initial demonstration of this framework is completed by training an RL agent to perform a grasping motion. A VR scene is then created to simulate a simpler rehabilitation environment by allowing the user to still control the movement of the prosthetic using an inertial sensor in an EMG armband, while the grasping motion is automated. This framework is envisioned to be further generalized beyond the grasping motion to more complex movements that lead to patient frustration when rehabilitating with EMG-driven prosthetics. For the remainder of this paper, the Unity scene in which the RL agent is trained to perform the grasping motion is the static VR scene, and the Unity scene that the RL agent is deployed in with human input using the EMG armband is referred to as the dynamic VR scene.
The remainder of this paper will be structured as follows.
Section 2 will provide relevant background information on the creation of augmented reality scenes through Unity, prosthesis control through EMG armbands, and RL using Unity’s machine learning (ML) agents.
Section 3 presents the methodology conducted to create the training environment for the ML agent in the static VR scene and the dynamic VR scene to demonstrate the trained agent.
Section 4 covers the results of the ML Agent training process, particularly lessons learned in creating reward and observation functions to effectively train a reinforcement learning model for usage in an augmented reality scene. Finally,
Section 6 concludes the paper and discusses possible avenues for future work.
3. Methodology
3.1. Prosthesis Modeling Process
To simulate realistic prosthetic control in a VR environment, an arm is created using Blender, a software predominantly used for modeling organic objects in three dimensions. The modeling is performed in stages; first, the construction of the ‘armature’ or bone structure is accomplished, second, the creation of the ‘mesh’ or skin, and finally, the integration of the two into a fully rigged and exportable prosthetic model is done, as shown in
Figure 1.
A critical factor in designing the armature is ensuring that the natural degrees of freedom (DoF) of a human hand are accurately represented. Of particular importance is the grasping motion, which consists of 28 DoF. A further breakdown of these DoFs is presented in
Section 3.2, but their effects on the modeling process will be discussed here. In Blender, each joint added to the skeleton is for the purpose of increasing both visual realism and simulation accuracy. Thus, the challenge is to capture sufficient biomechanical accuracy to simulate realistic grasping without artificially constraining the model.
The armature is built to anatomically represent a human arm, with particular attention given to the complexity of the hand. Each finger is modeled as a hierarchical system of joints, allowing realistic flexion of the fingers. A root joint is placed at the wrist, from which the forearm, palm, and fingers are attached. Establishing this hierarchy is essential to provide visual clarity and proper functionality. Testing is performed via an animation to ensure that the kinematics are correctly established so as not to inhibit anatomically accurate human grasping.
The mesh is modeled to conform tightly to the underlying skeleton, ensuring that deformations caused by joint rotations are smoothly distributed across the surface. Special care is taken around areas of high mobility, such as knuckles and wrists, where poorly designed geometry could lead to visual deformation. The mesh is then bound to the armature using Blender’s weight-painting system, which defines how much influence each joint exerts over nearby vertices. This step is done to achieve realistic deformations during motion. A chronological representation of the modeling process is shown in
Figure 1.
3.2. Reinforcement Learning Framework
The ability to consistently and reliably grasp objects is paramount to the utility of a prosthetic device, be it physical or simulated, and fundamental to human dexterity [
36]. Devices incapable of reliable prehension quickly become functionally redundant, necessitating that this simple gesture be finely-tuned. Assuming all other joints in the hand and arm are fixed except for the fingers and thumb, the simulated hand may still be regarded as a 28-degree-of-freedom (DoF) system. Each finger segment commonly can rotate left and right (pitch) and up and down (yaw) along a limited range of angles (almost negligible). By considering the grasping motion, the total DoFs in the hand collapse from a 28 to a 14 DoF system (3 DoF for each finger and 2 DoF for the thumb) as yaw rotations are not useful in the gesture. Due to the simplistic nature of this gesture, which constitutes a reduction in degrees of freedom, the following methodology is completed to control a prosthetic to perform a grasping motion. It is envisioned that this methodology can then be generalized in future work to perform more complex motions using similar observation, action, and reward functions as those proposed in this research.
As the contact dynamics allow a prosthetic to grasp any object adhere to a non-linear relationship, the use of Unity’s ML-Agents is advantageous. The use of ML-Agents also allows greater variability for individualized prosthetic designs and various motions that would be difficult to code formally. A minimal environment is created in the Unity editor to train the RL agent, given the simplicity of the task. A snapshot of one of these environments is shown in
Figure 2. The figure displays the beginning position of the arm prior to training the ML-Agent. This position is assumed at the beginning of every episode during training of the RL agent. The hand is also initiated at the same position during the start of the dynamic VR scene. This final position once training and deployed in the dynamic VR scene is shown in
Section 4.2.
Apart from the modeled prosthetic and cylinder, the collision spaces for objects are also shown by the green outline. The colliders allow for the processing of contact between each finger segment and the cylinder. A collision is processed using kinematic rigid-body physics, where the entire arm and cylinder are considered to be a rigid body. Rigid-body kinematics are chosen over kinetics, as the processing of forces between the prosthetic hand and cylinder is found to cause displacement of the prosthetic even when all positional and rotational axes are frozen at the device’s base. By only considering kinematic physics, a challenge is created by fingers now having the ability to pass through the cylinder, allowing for a greater possibility that a motion besides grasping is found to be optimal by the reinforcement learning model. This issue is addressed in the design of the observation and action spaces.
As coordinated with the previously mentioned 14 DoFs to perform the grasping motion, an action is assigned to each joint yielding the following action space:
, where each action is normalized between the bounds of −1 and 1 using a hyperbolic tangent activation function. Each action controls the speed and direction of yaw rotation for each individual joint in the prosthetic hand. The displacement angle of each joint (
) for a set time interval (
) is calculated through Equation (
8).
represents the maximum angular velocity that each joint can turn, which is then scaled by the
to determine the final magnitude and direction of this velocity. A value of 90
for
is found to work well for training the ML-Agent.
To inform the ML-Agent on the current state of its environment, two observations are provided with respect to each joint in the hand, where the total set of joints can be denoted as . First, the ML-Agent is informed of the z-rotation of each finger segment. These observations are taken to bridge the gap between the current state of the prosthetic and the reward function. Tallying the z-rotation of each joint as allows the ML-Agent to more easily understand which rotational displacement, according to the action it takes, yields a higher. Information regarding the distance of each joint to the cylinder () is also provided to the ML-Agent. This observation is especially useful in the case of the dynamic VR scene (discussed in the next section), where multiple cylinders are present and the ML-Agent, along with human input, has to choose the best plausible motion to grasp the nearest cylinder. Given these details the total observation vector () can adequately be described as .
Along with the colliders, two ray perception sensors are placed orthogonally to each finger. These ray perception sensors allow the agent to perceive its environment through raycasting. A series of rays is projected from an object at a given length and angle. These rays are given the ability to collide with various objects of a specific tag name within the MR scene. For this case, all of the ray perception sensors are set to alert the ML-Agent when the ray interacts with a cylinder collider. Each of the two rays perpendicular to each finger contains one ray near the end of the finger and one ray near the base of the finger. These placements are strategically chosen to inform the agent whether the base and end joints are approaching the cylinder from a physically correct angle. Without this guidance, a clockwise rotation could cause the fingers to contact the cylinder with their backs, which is not biomechanically feasible. An example of these ray perception sensor placements on the pinky finger is shown in
Figure 3.
Minimal hyperparameter tuning is completed in the training of the proposed RL model, as a greater importance is placed on the design of the action and observation states and the reward function for model convergence rather than hyperparameter optimization. For reproducibility, some notable hyperparameters, their purpose, and the chosen values for this study are shown below. For a greater explanation of RL hyperparameters and tuning procedures, please consult [
37]. Values for the hyperparameter values are bolded.
learning rate—Determines the step size for updating the policy and value networks (a high value may lead to instability; a low value leads to slow convergence): 3 × 10−4 with linear decay
epsilon—Regulates the bounds of clipping (
) for the objective function previously described in
Section 2.2 and given by Equation (
3):
0.2 gamma—The discount factor assisting the model in deciding the importance of long-term rewards or immediate gains: 0.99
beta—Represents the policy’s entropy. The higher the entropy the more the RL model is encouraged to explore a diverse set of actions instead of exploiting learned policies: 0.005
batch size—The number of training samples which are processed during gradient ascent: 2048
epochs—The number of times collected data from the RL-agent’s interaction with its environment is used to tune its policy and value parameters: 3
Two reward function approaches are investigated to analyze which scheme would lead to a grasping motion that is close to reality. The first reward function takes a simple approach where contact between each finger segment and the cylinder is rewarded. The first reward function is created to encourage both immediate and sustained contact with joints in the hand and the item that the virtual prosthetic is grasping. This is completed by placing a larger immediate reward for first contact between each joint and the cylinder, notifying the agent that it is exploiting a favorable set of actions. Extended contact is then supplemented by a continuous reward at every timestep, with the intent that the agent recognizes its current action space is the most favorable. A smaller value for the continuous reward is given to ensure that the cumulative reward value does not explode outside of normalized bounds, leading to a hindrance in model convergence. Additionally, a higher reward value is aligned to contact between joints that play a greater importance in performing a grasping action, where each finger’s base and the thumb joints are deemed to have the highest importance. The values for these rewards are shown in
Table 2.
Apart from rewarding immediate and persistent contact, the release of contact is included in the reward function as a punishment. If a collider from a finger segment is found to have had previous contact, but that contact is later released, a release penalty of −0.25 is included. The total logic for this reward function can be formally expressed by Equation (
9), where
and
represent the reward weights for each joint,
j, in the total set of available joint types,
, as coordinated with the values shown in
Table 2. In total,
accounts for 14 joints: 3 for each finger (base, middle, and end) and 2 for the thumb (base and end).
,
, and
are Boolean operators for each joint, which are continually updated at every time step,
t. These operators represent immediate contact, continuous contact, and release of contact, respectively. Finally,
represents the penalty weight of −0.25 for the release of contact.
The second reward function takes a more continuous approach. Rather than relying on Boolean operators at each timestep for the addition or subtraction of discrete reward weights, the reward is defined as a continuous function of the distance between each joint and its nearest point on the cylinder’s collider mesh. The distance improvement (
) for each joint
is simply calculated throughout each timestep as shown in Equation (
10). The total adjustment in reward weights throughout each timestep is expressed by Equation (
11) as a summation of each distance improvement for all joints. The following reward function allows for the continuous addition of reward and penalty weights. If a joint moves closer to the cylinder, it is provided a positive reward in proportion to its distance improvement. In contrast, if the joint moves farther away, the reward weight becomes negative and acts as a penalty for the ML-Agent. For consistency, the second reward function used the same set of joints,
, as the first reward function.
3.3. Reinforcement Learning Evaluation Methods
Each of these reward functions is given to be a function of the environment’s state and action space at time
t (
and
). The empirical mean reward as calculated through Equation (
13), for a set number of discretized timesteps, serves as an important cornerstone in evaluating model convergence and success. For each of the reward functions, a reward is calculated at a rate of 50 Hz, thereby providing a timestep length of 0.02 s.
In addition to convergence of the empirical mean reward, a minimization of the network’s total loss (or error in predictions) is also desired. In the case of the PPO algorithm, there are two separate loss functions to be minimized: policy loss and value loss. The policy loss for the PPO algorithm that is used in this study allows for guidance of the agent’s policy network to enact more fruitful actions. Further description of the policy loss function to be minimized is provided in
Section 2.2 where the loss function is shown by Equation (
3). The value loss represents the network’s estimation of the future reward to be provided to the agent, given that it enacts a specific action state. The value loss is measured by the mean squared error (MSE) function shown in which is averaged over
t timesteps, where
represents the actual reward for completing that action and
represents the predicted reward.
The success of the RL model using each of the specified reward functions is evaluated using the following criteria:
The reinforcement learning model reaches an empirical mean reward that almost surely converges to its maximum possible value, as time approaches infinity given that the model reaches an optimal policy : as .
Minimization (or a pattern of minimization) of both the policy and value losses, ensuring that the agent’s strategy is being guided in a positive direction and the network’s estimation of future reward given a possible state is accurate, respectively.
Once is achieved, the policy’s entropy is minimized, meaning that the model is less inclined to explore its action space, and is confident that the action it has converged on produces the maximum reward.
Qualitatively, a grasping motion in the correct rotational direction appears to have been successfully completed in both the static and dynamic mixed reality scenes. This would imply that the chosen reward function not only results in model convergence but also produces the desired action that the RL model is designed to complete.
To ensure that the RL model converges upon an optimal policy, quantitative convergence standards are placed upon criteria 1–3. For each of the episodes (5000 timesteps of 0.02 s in length), the relative deviation of that episode’s value is measured against the previous episode. To ensure that convergence is not quantitatively measured too early, as the values attained for each episode may be sequentially similar, the relative deviation of each episode is also compared against the previous ten episodes. For cumulative reward, entropy, and loss, the relative deviations set to ensure convergence and an optimal policy are 0.01%. The timestep values of when convergence occurs are compared to the graphical representation of each of these metrics across time.
3.4. Dynamic Virtual Reality Scene
A dynamic VR scene is constructed as a proof of concept to demonstrate the applicability of machine learning for prosthetic control for rehabilitation. This is achieved through the integration of four technologies: Unity’s ML-Agents, the Thalmic Labs Myo Armband, MRTK2, and the Meta Quest 3 headset.
Unity’s established role in simulation research, combined with the flexibility of its ML-Agents package, made it an ideal platform for hosting the dynamic scene. Integration of ML-Agents with the modeled prosthetic, discussed in
Section 3.1, is achieved through modification of the reward function (
Section 3.2). The previously trained ML-Agent is employed in the dynamic VR scene through the use of an
.onnx file. The user can observe the dynamic VR scene through the Meta Quest 3 headset and control the position and rotation of the entire virtual prosthesis using an EMG armband. This is accomplished via the eight IMU sensors discussed in
Section 2.4. Once the user position and rotation data are correctly displayed within Unity, the final step is to represent this within a VR environment.
Four of the experimental setup structures containing a platform and cylinder are placed in random Z-X coordinate locations within the dynamic VR scene. The scene is visualized in
Figure 4.
The placement of multiple cylinders within this scene presents a new challenge for the trained ML-Agent, as it now must actuate a grasping motion on the closest cylinder while still observing information for all cylinders. The user is given the objective to move the prosthetic to a cylinder of their choice, allowing enough distance for the prosthetic to grasp it. The inherent goal of this scene in the context of a rehabilitation situation is to allow the user to become familiar with controlling their prosthetic through the shoulder/elbow joints, while the hand actuation movement is automated for them.
Measurements from the inertia sensor in the EMG armband are propagated onto the virtual scene to emulate physical reality. This is most clearly shown in
Figure 5a, where the arm is shown to be placed at the approximate position where the user’s physical arm would be. Physically, the Myo EMG armband is placed slightly above the user’s elbow. The emulator, where actions informed by inertia sensor readings originate, occurs in a relatively similar position. The placement is shown in
Figure 5b, where the outlined sphere is the origin of movement for the prosthetic. These measures are taken to ensure a good match between the VR scene and the user’s movement to minimize the intuition needed to use the application.
To display the Unity environment with the incorporated RL model, prosthetic model, and associated components, Microsoft’s Mixed Reality Toolkit 2 (MRTK2) is utilized. MRTK2 provides the foundation for integrating spatial interaction and device-specific inputs, enabling the creation of an interactive VR scene directly within Unity. The hand tracking capabilities that are available through MRTK2 are not found to be necessary, as the IMU sensor within the EMG armband allows for adequate control of the prosthetic. The resulting VR scene is deployed to the Meta Quest 3 headset, with Meta’s Meta-Link software (Version 78.1033) serving as the interface for connecting the Unity scene to the headset.
Initial exploration of the connectivity between the EMG-Armband, pretrained ML-agent, and Meta Quest 3 was explored for AR applications as well, given their advantages in immersiveness for rehabilitation applications [
38]. Launching the AR application using solely the ML-Agent was found to be successful; however, the inclusion of the Myo EMG armband was found to limit the application space to VR only. This limitation is imposed by the EMG-sensor due the its inability to connect directly to the Meta Quest 3. Therefore, the application cannot be built and remotely launched away from the computing device with the Unity project. The limitation on this aspect of the study is further discussed in
Section 5. Apart from exploring more compatible EMG-armband devices, a more immersive environment may be created to increase patient engagement.
5. Discussion
A key point of discussion from the different training results of RF1 and RF2 is the characteristics of a reward function that result in successful training. The key failure of RF1 is the discrete approach it takes, where the overall reward is dependent on Boolean values. Boolean values (0 s and 1 s) provide a finite and sparse set of signals to be sent to the gradient estimator of the policy’s objective function. These signals only inform the agent of its success or failure, omitting how close the agent is to performing the desired behavior. This adversely affects the agent’s training, as there is no usable direction for improvement when performing gradient descent, only binary feedback. As shown by the policy loss in
Figure 6, after a long duration of training, the agent is still completely uncertain of a strategy to guide its empirical mean reward in a positive direction. RF2 provides a continuous reward that is completely a function of distance, with a near-infinite set of signals based on the distance from the joint and the cylinder. The continuity of these signals allows for the policy’s objective function to clearly map out its search space, where it is aware of a direction that would cause an improvement of the overall reward it is given. This highlights the importance of designing reward functions to provide continuous, informative signals that guide incremental improvement rather than sparse, binary feedback.
Once trained, the ML-Agent is shown to perform well in environments that are different than the one in which it was trained. Once the user enters the dynamic VR scene, the ML-Agent does not immediately perform the grasping motion unless the user moves toward an object that is tagged for grasping (such as the cylinder). This positively reflects the policy that the ML-Agent uses, as the agent has learned to grasp specific objects rather than simply performing the grasping motion unprompted. The success of the ML-Agent performing the grasping motion with human control of the arm position in the global space demonstrates the expected generalization from static training to dynamic deployment. This idea can further be extrapolated to human-in-the-loop training of the ML-Agent in the future.
Previous endeavors into the integration of augmented reality with myo-electric prosthetics have included trials of both amputees and non-amputees as a principal method of data collection. A lack of testability on recovering patients represents a limitation of the framework proposed in this study; regardless of the ML-Agent’s standard of performance, its long-term effectiveness cannot be concluded without examination of its synergy with a human user. User experience depends significantly on subjective variables that isolated tests of the system cannot evaluate. Likewise, the aspects of the system that are most problematic for users cannot be identified as is. Important considerations, such as whether the visual feedback provided by an augmented reality framework alleviates phantom limb pain in amputees or if prior training with a simulated prosthetic translates to higher proficiency with a physical device, remain purely speculative. Future iterations of the system proposed in this study would benefit most from trials involving human participants, with the degree of skill transfer and their subjective levels of satisfaction being evaluated individually. Central nervous system integration of a framework similar to that demonstrated in [
41] would additionally inform subsequent reiterations of the proposed system by providing empirical indicators of user affinity. The data collected from such an approach would be a more accurate predictor of patient rejection or acceptance, ideally making later investments into myoelectric prosthetic technology more worthwhile for healthcare providers and patients alike.
Considering the limited availability of the Myo armband, future reiterations of the proposed framework would benefit additionally from trials with alternative open-source electromyography platforms. Potentially-compatible tools include Arduino’s electromyographic sensors and signal processors [
42], the FREEEMG surface probes manufactured by BTS Bioengineering [
43], and OpenBCI’s Cyton bio-sensing board [
44] or Galea headset, among other implements. Practical tests of different platforms are needed to achieve the most optimal balance between cost and performance as well as to maximize the accessibility of this system for users. Furthermore, it would be beneficial to develop software capable of transmitting data directly over a network to the target headset application.
6. Conclusions and Future Work
This study has demonstrated the feasibility of integrating reinforcement learning with static and interactive dynamic VR scenes to train and use a prosthetic to perform a grasping gesture. The overall framework is successfully created by combining Unity ML-Agents, MRTK, and an IMU sensor from Thalmic Labs’ EMG armband. Two reward function approaches are considered: one using discrete boolean statements to provide a reward upon contact and the other using a continuous distance-based reward. From these two approaches, the continuous distance-based reward is found to be successful in allowing the model to converge, whereas the Boolean-based reward failed. It is concluded that the agent trained on a Boolean-based reward function failed due to the weak signal provided to the policy gradient. A reward that is contingent on ‘True’ and ‘False’ does not provide an adequate direction of improvement to the agent’s policy estimator, as supported by the results of RF1’s policy loss over time.
The foundation of this presented research may be generalized to various applications. By employing a static VR environment, the ML-Agent is trained under controlled, consequence-free conditions, thereby reducing resource demands and risks associated with mistraining. The dynamic VR scene serves as the associative final application that the ML-Agent is trained to complete. In many scenarios, the dynamic scene may be a physical application. This application can be wide-ranging, past simply the application of prosthetics, such as the training of a robot to automatically perform a movement, industrial assembly operations, and many other automated movement applications. In the case of prosthetic actuation, the connection of reinforcement learning to a VR environment using MRTK may allow for a more simplistic rehabilitation process where the user may adjust to their EMG-controlled prosthetic while a trained ML-Agent can assist them in performing some movements.
This study provides a foundation upon which multiple avenues of future research may be pursued. The presented work provides the methodology upon which the actuation of a hand gesture is completed. A similar framework may then be used to allow the arm to perform different gestures with modifications to the environment and reward function. To fully generalize an ML-Agent’s ability in performing these gestures, transfer learning may prove useful. An agent can use the knowledge from training to complete a simple hand gesture (such as grasping) to then learn to perform a more complex gesture in a reduced amount of time, which it could not have otherwise learned from scratch. To properly train the agent to perform different gestures, VR-scene design must be taken into account to create the optimal environment. The RL architecture may also need to be modified to account for transfer mechanisms across heterogeneous action and observation spaces when VR interactions differ for various gestures.
Another, less obvious, direction for future work is incorporating uncertainty quantification (UQ) into the ML-Agents’ training process. This consideration becomes increasingly important for verification if a similar approach is undertaken and the trained agent is subsequently deployed on a physical system where consequential decisions are made by the agent. The need for the application of formal UQ of this methodology becomes readily important in verifying the consistency of the RL-agent prior to deployment in a clinical setting. Under the Unity ML-Agents’ framework, UQ can be easily performed using deep ensemble or Monte Carlo dropout methods without having to modify the architecture of the deep reinforcement learning agent itself. Both forms of UQ may easily be implemented with minimal changes necessary to the model architecture, as dropout layers would only need to be added for Monte Carlo dropout. Given the deep nature of the PPO algorithm for the ML-Agent used, both of these methods for UQ would be most relevant.