Reinforcement Learning-Driven Prosthetic Hand Actuation in a Virtual Environment Using Unity ML-Agents

Done, Christian; Palmer, Jaden; Oakey, Kayson; Gupta, Atulan; Thiros, Constantine; Franklin, Janet; Schoen, Marco P.

doi:10.3390/virtualworlds4040053

Open AccessArticle

Reinforcement Learning-Driven Prosthetic Hand Actuation in a Virtual Environment Using Unity ML-Agents

by

Christian Done

^*

,

Jaden Palmer

,

Kayson Oakey

,

Atulan Gupta

,

Constantine Thiros

,

Janet Franklin

and

Marco P. Schoen

Department of Mechanical and Measurement & Control Engineering, Idaho State University, Pocatello, ID 83209, USA

^*

Author to whom correspondence should be addressed.

Virtual Worlds 2025, 4(4), 53; https://doi.org/10.3390/virtualworlds4040053

Submission received: 16 September 2025 / Revised: 27 October 2025 / Accepted: 29 October 2025 / Published: 6 November 2025

Download

Browse Figures

Versions Notes

Abstract

Modern myoelectric prostheses remain difficult to control, particularly during rehabilitation, leading to high abandonment rates in favor of static devices. This highlights the need for advanced controllers that can automate some motions. This study presents an end-to-end framework coupling deep reinforcement learning with augmented reality (AR) for prosthetic actuation. A 14-degree-of-freedom hand was modeled in Blender and deployed in Unity. Two reinforcement learning agents were trained with distinct reward functions for a grasping task: (i) a discrete, Booleann reward with contact penalties and (ii) a continuous distance-based reward between joints and the target object. Each agent trained for 3 × 10⁷ timesteps at 50 Hz. The Booleann reward function performed poorly by entropy and convergence metrics, while the continuous reward function achieved success. The trained agent using the continuous reward was integrated into a dynamic AR scene, where a user controlled the prosthesis via a myoelectric armband while the grasping motion was actuated automatically. This framework demonstrates potential for assisting patients by automating certain movements to reduce initial control difficulty and improve rehabilitation outcomes.

Keywords:

augmented reality; mixed reality; reinforcement learning; Unity ML-Agents; prosthetic actuation

1. Introduction

Despite the increasing prevalence and sophistication of myoelectric prosthetic limb-replacing devices, their acceptance by amputees has seen comparatively little improvement in the course of their development [1]. In their current state, the responsiveness and utility of these devices are inadequate to meet patient requirements for everyday use, as well as to justify the monetary, temporal, and emotional costs associated with amputation recovery [2]. Purely cosmetic prostheses or body-powered devices using hooks in place of replica hands continue to be favored among patients by virtue of their simplicity, regardless of their lower potential range of functionality [2].

Accommodating the needs of upper-body amputees poses a uniquely complex challenge, with the dexterity of an organic hand being considerably more difficult to emulate than the load-bearing functions of the leg and foot [3]. Isolated surveys of upper- and lower-body amputees reflect this; among the former demographic, prosthesis abandonment rates frequently exceed 50 percent [1], while the abandonment of lower-body appliances seldom exceeds 20 percent [4]. Concurrently, the poor control fidelity of fine motions and device activation is a leading source of patient frustration [2]. Each action of a myoelectric device is triggered by a specific set of muscle contractions, and determining the correct action requires modeling many different parameters. These parameters are collectively called the state space and the action space. Developing such a precise mathematical model is a great challenge in itself. As a consequence, the world is increasingly leaning towards data-driven machine learning (ML), especially approaches that do not require any prior knowledge about the system.

Table 1 summarizes major prosthesis types according to the location of amputation, their corresponding abandonment rates, and follow-up periods after amputation. Aggregated data from patient studies demonstrates the need for a shift in design priorities towards making these devices more user-oriented. The data further demonstrates that despite the application or characteristics of the prosthetic, upper-limb prosthetics are generally abandoned more frequently.

Most machine learning algorithms belong to one of six categories: supervised learning, unsupervised learning, semi-supervised learning, reinforcement learning, transduction, and learning to learn. These techniques have been used to train a metasurface-based imager to recognize, monitor, and measure human movements [5]. The utility of supervised learning, unsupervised learning, semi-supervised learning, and reinforcement learning has been studied across a wide range of healthcare settings, including health monitoring, mental health, disease management, medical diagnosis, assistive healthcare, physical activity, substance use, and dietary management [6].

A more recent and popular form of machine learning algorithm in controlling highly non-linear and high-dimensional complex systems is reinforcement learning (RL). RL involves algorithms that learn a policy on how to behave based on an observation of the environment, where the actions of the machine agent impact the environment, resulting in feedback. The agent then incorporates the feedback into the algorithm to guide future decisions [7]. RL, a form of approximate dynamic programming, has shown promise in “tuning” robotic knee prostheses [8]. Tuning is the process through which a robotic prosthesis’s control parameters are personalized to the individual user. Reinforcement learning has also been utilized to gather data from an able-bodied subject wearing a robotic limb and transfer that data to a user of a robotic knee prosthesis as a means of improving the tuning process [9].

To address more complicated problems, for example, robotic manipulation of objects, a combination of deep learning with RL, known as “deep reinforcement learning,” has been studied in [10]. Leading strategies for solving the existing problems in prosthetic hand actuation deal with the implementation of AI for shared control of the prosthetic, leaving intention to the user, and AI for fine control of strength and coordination [11]. In a similar kind of research work, artificial neural networks and support vector machines have been tested in the optimization of certain implants used in hip replacement procedures [12].

Most of the common approaches to actuating a prosthetic hand use Electromyography (EMG) signals, where electrical signals generated by muscle contractions are captured and filtered to remove noise and retain meaningful signal content. Once a meaningful signal is found, the next step is to extract useful information such as variance, waveform length, and zero crossings. Then, based on the extracted features and using algorithms [13], prosthetic hand movement is generated [14]. For example, researchers have used multichannel surface EMG knit band sensors and wearable smart band textile-based EMG sensors to design their respective prosthetic hands [15,16,17]. In [18], the authors have demonstrated several strategies for electrode placement and discussed different signal processing techniques. However, EMG signal-based prosthetic hand actuation in the real world often suffers from multiple challenges, for example, skin irritation [19] due to an adhesive gel used inside the electrode, low battery life, lack of comfort [20], low power consumption, and high speed of operation [21] have been reported. Despite such limitations, EMG-based methods are widely used due to their low cost, non-invasiveness, and direct correlation with muscle activity [15].

With respect to myoelectrically controlled prostheses, RL techniques have been used to improve a user’s control [22]. However, all the previous works improve the ease of use of prosthetics, but do not solve issues with physical therapy or training before a patient receives their prosthetic. Fitting of a prosthetic is a very time-consuming and expensive process; without proper exercise, muscles and nerves can decay further after surgery. Making training and adjustments to the prosthetic is a difficult process. Combining these methods with augmented or virtual reality (AR or VR) simulations allows for patients to train ahead of the fitting process, keeping their muscles and nerves in manageable condition, and making the transition to their new prosthetic easier. In addition to training, AI can be used for modification of the prosthetic structure itself, making it more personal and accessible to each user. An AR/VR simulation has the added advantage of testing out an EMG prosthetic before its construction, saving money on patients who may have a distaste for a more advanced prosthetic. Patients who have tested their prosthesis in a virtual environment before use have reported less frustration, more enjoyment of their prosthetics, and drastic improvement in prosthetic control and accuracy over those who do not train in AR or VR [23].

This research seeks to help fill this gap by creating a comprehensive framework for the use of RL in a VR scene with human input to provide a gentle and engaging approach to prosthesis rehabilitation, where some complex movements may be automated using an RL agent. An initial demonstration of this framework is completed by training an RL agent to perform a grasping motion. A VR scene is then created to simulate a simpler rehabilitation environment by allowing the user to still control the movement of the prosthetic using an inertial sensor in an EMG armband, while the grasping motion is automated. This framework is envisioned to be further generalized beyond the grasping motion to more complex movements that lead to patient frustration when rehabilitating with EMG-driven prosthetics. For the remainder of this paper, the Unity scene in which the RL agent is trained to perform the grasping motion is the static VR scene, and the Unity scene that the RL agent is deployed in with human input using the EMG armband is referred to as the dynamic VR scene.

The remainder of this paper will be structured as follows. Section 2 will provide relevant background information on the creation of augmented reality scenes through Unity, prosthesis control through EMG armbands, and RL using Unity’s machine learning (ML) agents. Section 3 presents the methodology conducted to create the training environment for the ML agent in the static VR scene and the dynamic VR scene to demonstrate the trained agent. Section 4 covers the results of the ML Agent training process, particularly lessons learned in creating reward and observation functions to effectively train a reinforcement learning model for usage in an augmented reality scene. Finally, Section 6 concludes the paper and discusses possible avenues for future work.

2. Background

2.1. Augmented Reality Scene Creation Using Unity

To generate both static and dynamic AR scenes, the Unity game engine is used. Unity provides a platform for developers to create 2D and 3D video games, virtual and augmented reality applications, and various complex visuals. Unity provides a complex array of functionalities that allow the user to achieve the highest level of visual and game-logic (using C# programming) quality that they desire.

By employing various open-source assets such as the Mixed Reality Toolkit (MRTK), any type of virtual immersion application can be created in Unity [24]. VR allows for full virtual immersion by closing out any visuals of physical space and completely placing the user in a digital environment. AR is taken to overlay pieces of the digital environment or various virtual objects into the physical space that the user is experiencing. For the following study, version 2020.3.42f1 of Unity is used with version 2 of MRTK to allow the user to perceive virtual objects while navigating their physical space, with interaction driven by the EMG armband.

2.2. Reinforcement Learning Through Unity ML Agents

Controlling a system using conventional control algorithms requires an accurate mathematical representation of the system. However, determining an exact mathematical model for a complex, nonlinear system is often a tedious job. For example, a proportional–integral–derivative (PID) controller stabilizes simple systems effectively, and a Markov decision process (MDP) optimizes future estimations [25,26], but both require a known dynamic and linear error model.

Reinforcement learning (RL) is a special kind of machine learning (semi-supervised learning), which has gained much interest among the automatic control community in recent years [27]. RL actively explores using stochastic policies and learns through continuous interactions with its surrounding environments. Due to this, RL does not suffer from short-term noises [28]. RL models are created to produce the most optimal set of actions at each time step (

a_{t}

) based on observations (

o_{t}

) from its state (

s_{t}

), which entails a complete description of its environment. Here,

o_{t}

and

s_{t}

are generally provided to the RL model in vector or matrix form. In this study,

a_{t}

is taken to be a continuous action space where the model has an infinite number of actions within a specified range of real numbers. The model is then given numerical feedback through a reward

r_{t}

to inform itself on the utility of the actions it has taken. RL can solve infinite state space problems by optimizing a policy utilizing a reward/penalty concept without having any prior knowledge about the system. A policy gradient method estimates the gradient of the current policy and plugs it into a stochastic ascent policy. The policy serves as the controller to find the optimal action for the subsequent timestep to best influence the system. The gradient estimator aims to find the direction of steepest ascent, informing the RL agent of how to maximize its reward. A commonly used gradient estimator is mathematically represented in Equation (1). Here,

π_{θ}

is a stochastic policy,

{\hat{A}}_{t} (s_{t}, a_{t})

is an estimator of the advantage function (how good an action with informed by the state is) at timestep t,

{\hat{E}}_{t}

is the expected reward return when using

π_{θ}

, and

\hat{g}

is the policy gradient estimator.

\hat{g} = {\hat{E}}_{t} [\nabla_{θ} log π_{θ} (a_{t} | s_{t}) {\hat{A}}_{t} (s_{t}, a_{t})]

(1)

A popular form of RL is the Proximal Policy Optimization (PPO) algorithm, which is a policy gradient method that balances between performance and stability via a clipped surrogate objective function. The inspiration for introducing this approach is to eliminate the complexity of the trust region policy optimization (TRPO) method, while maintaining the data efficiency and reliability of the method [29]. PPO reuses the batches of data for multiple epochs and learns utilizing fewer samples. It uses only the first-order optimization technique that ensures the method is simpler to implement, and it is an on-policy approach. PPO works great in both continuous and discrete action spaces and is well-suited for complex environments such as robotics and games [30,31,32].

PPO improves upon traditional policy gradient methods by ensuring that updates to the policy do not deviate too much from the previous policy by integrating a clipped surrogate objective. A policy gradient method estimates the gradient of the current policy and plugs it into a stochastic ascent policy. The gradient estimator for the PPO RL algorithm is shown in Equation (2).

{\hat{g}}_{p p o} = \nabla_{θ} L^{c l i p} (θ)

(2)

where

L^{CLIP} (θ)

is the clipped surrogate objective function used in PPO shown in Equation (3). The purpose of this objective function is similar to the learning rate in conventional machine learning and deep learning algorithms. An unclipped objective function given by (

r_{t} (θ) {\hat{A}}_{t}

) is calculated to obtain the advantage of performing a set of actions at timestep t. The clipped version forces the reward to be within the bounds

[1 - ϵ, 1 + ϵ]

, therefore limiting the possible advantage or disadvantage weights to the action space

{\hat{A}}_{t}

, within the range set by the tunable hyperparameter

ϵ

. The addition of this clipped objective function allows the RL agent’s policy to still improve quickly while preventing sharp deviations in policy parameters that could inhibit model convergence.

L^{CLIP} (θ) = {\hat{E}}_{t} [min (r_{t} (θ) {\hat{A}}_{t}, clip (r_{t} (θ), 1 - ϵ, 1 + ϵ) {\hat{A}}_{t})]

(3)

2.3. Unity Machine Learning Agents

The environment in which an RL agent is trained is of similar, if not greater, importance to the RL algorithm itself. Many simulation platforms exist; however, they either only support a single environment or only allow for domain-specific applications. Understanding this, Unity developed the first general platform to train and test reinforcement learning algorithms in 2017: Unity Machine Learning Agents (ML-Agents) [33]. ML-Agents allows the user to create their own custom environment using the Unity game engine. By allowing for a flexible environment creation and testing platform, the user can generate an environment to the exact level of realism desired. The Unity engine also contains the PhysX physics engine, allowing the user to apply physics principles to the objects in the environment and the agent’s interactions with them [34].

The versatility and capabilities of the Unity engine make it the perfect platform for training and testing RL agents. By allowing the user to generate their environments in a similar fashion to game design, an environment can be created with rich auditory and sensory data. The inclusion of a well-supported physics engine increases the physical complexity of the training platform, allowing the agent to interact with the environment in complex manners. Several agents may also be trained at the same time to create a multi-agent framework. As interactions with agents may be simulated, multiple agents that inherit from the same policy can be trained in parallel as well. Parallel processing between various agents allows for faster optimization of the policy gradient, ensuring faster convergence in the agent achieving its maximum reward.

2.4. EMG Armband for Prosthesis Control

Precise manipulation of the modeled prosthetic within Unity is essential to the integrity of the simulation. For this application, Thalmic Labs’ Myo Armband is used for its noninvasive electromyography (EMG) inertial sensor. Resources for interfacing with the armband may be found in the following GitHub repository: https://github.com/mark-toma/MyoMex (accessed on 27 August 2025). The Myo Armband features a nine-axis inertial sensing unit (IMU), allowing for real-time tracking of a user’s position within Unity. Integration of the Myo Armband into Unity is performed using community-developed packages, such as Microsoft’s Mixed Reality Toolkit 2 (MRTK2). MRTK2 allows for the synchronization between the user’s arm and the modeled prosthetic within Unity. Predominantly, it provides real-time movement tracking via 9-axis IMU sensor. While focus is prioritized on IMU spatial and orientation tracking, the Myo Armbands’ eight EMG sensors can provide further potential for integrating EMG data with Unity’s ML-Agents, packaged to further aid in a user’s training experience. The Myo Armband’s inertial measurement unit (IMU) provides orientation estimates in the form of a unit quaternion shown by Equation (4).

q = q_{0} + q_{1} i + q_{2} j + q_{3} k, ∥ q ∥ = 1,

(4)

where

q_{0}

is the scalar component and

(q_{1}, q_{2}, q_{3})

form the vector component. To represent the orientation of a user’s arm in Unity, the quaternion is applied to vectors in

R^{3}

through the rotation operator shown in the quaternion rotation operator given by Equation (5).

v^{'} = q v q^{*},

(5)

where v is treated as a pure quaternion with zero scalar part and

q^{*}

denotes the conjugate of q. This operation does not affect the magnitude of v and corresponds to a rotation of the vector by an angle

θ

about a unit axis u, where the quaternion can be expressed through Equation (6).

q = cos (\frac{θ}{2}) + u sin (\frac{θ}{2}) .

(6)

The expanded form of the operator is shown in Equation (7).

v^{'} = (q_{0}^{2} - {∥ q ∥}^{2}) v + 2 (q \cdot v) q + 2 q_{0} (q \times v),

(7)

with

q = (q_{1}, q_{2}, q_{3})

. This formulation allows Unity to directly apply IMU quaternion data to the prosthetic model, ensuring proper orientation tracking. The mathematical foundation for this method is presented in [35].

3. Methodology

3.1. Prosthesis Modeling Process

To simulate realistic prosthetic control in a VR environment, an arm is created using Blender, a software predominantly used for modeling organic objects in three dimensions. The modeling is performed in stages; first, the construction of the ‘armature’ or bone structure is accomplished, second, the creation of the ‘mesh’ or skin, and finally, the integration of the two into a fully rigged and exportable prosthetic model is done, as shown in Figure 1.

A critical factor in designing the armature is ensuring that the natural degrees of freedom (DoF) of a human hand are accurately represented. Of particular importance is the grasping motion, which consists of 28 DoF. A further breakdown of these DoFs is presented in Section 3.2, but their effects on the modeling process will be discussed here. In Blender, each joint added to the skeleton is for the purpose of increasing both visual realism and simulation accuracy. Thus, the challenge is to capture sufficient biomechanical accuracy to simulate realistic grasping without artificially constraining the model.

The armature is built to anatomically represent a human arm, with particular attention given to the complexity of the hand. Each finger is modeled as a hierarchical system of joints, allowing realistic flexion of the fingers. A root joint is placed at the wrist, from which the forearm, palm, and fingers are attached. Establishing this hierarchy is essential to provide visual clarity and proper functionality. Testing is performed via an animation to ensure that the kinematics are correctly established so as not to inhibit anatomically accurate human grasping.

The mesh is modeled to conform tightly to the underlying skeleton, ensuring that deformations caused by joint rotations are smoothly distributed across the surface. Special care is taken around areas of high mobility, such as knuckles and wrists, where poorly designed geometry could lead to visual deformation. The mesh is then bound to the armature using Blender’s weight-painting system, which defines how much influence each joint exerts over nearby vertices. This step is done to achieve realistic deformations during motion. A chronological representation of the modeling process is shown in Figure 1.

3.2. Reinforcement Learning Framework

The ability to consistently and reliably grasp objects is paramount to the utility of a prosthetic device, be it physical or simulated, and fundamental to human dexterity [36]. Devices incapable of reliable prehension quickly become functionally redundant, necessitating that this simple gesture be finely-tuned. Assuming all other joints in the hand and arm are fixed except for the fingers and thumb, the simulated hand may still be regarded as a 28-degree-of-freedom (DoF) system. Each finger segment commonly can rotate left and right (pitch) and up and down (yaw) along a limited range of angles (almost negligible). By considering the grasping motion, the total DoFs in the hand collapse from a 28 to a 14 DoF system (3 DoF for each finger and 2 DoF for the thumb) as yaw rotations are not useful in the gesture. Due to the simplistic nature of this gesture, which constitutes a reduction in degrees of freedom, the following methodology is completed to control a prosthetic to perform a grasping motion. It is envisioned that this methodology can then be generalized in future work to perform more complex motions using similar observation, action, and reward functions as those proposed in this research.

As the contact dynamics allow a prosthetic to grasp any object adhere to a non-linear relationship, the use of Unity’s ML-Agents is advantageous. The use of ML-Agents also allows greater variability for individualized prosthetic designs and various motions that would be difficult to code formally. A minimal environment is created in the Unity editor to train the RL agent, given the simplicity of the task. A snapshot of one of these environments is shown in Figure 2. The figure displays the beginning position of the arm prior to training the ML-Agent. This position is assumed at the beginning of every episode during training of the RL agent. The hand is also initiated at the same position during the start of the dynamic VR scene. This final position once training and deployed in the dynamic VR scene is shown in Section 4.2.

Apart from the modeled prosthetic and cylinder, the collision spaces for objects are also shown by the green outline. The colliders allow for the processing of contact between each finger segment and the cylinder. A collision is processed using kinematic rigid-body physics, where the entire arm and cylinder are considered to be a rigid body. Rigid-body kinematics are chosen over kinetics, as the processing of forces between the prosthetic hand and cylinder is found to cause displacement of the prosthetic even when all positional and rotational axes are frozen at the device’s base. By only considering kinematic physics, a challenge is created by fingers now having the ability to pass through the cylinder, allowing for a greater possibility that a motion besides grasping is found to be optimal by the reinforcement learning model. This issue is addressed in the design of the observation and action spaces.

As coordinated with the previously mentioned 14 DoFs to perform the grasping motion, an action is assigned to each joint yielding the following action space:

A_{t} = [a_{t, 1}, a_{t, 2}, \dots, a_{t, 14}] \in {[- 1, 1]}^{14}

, where each action is normalized between the bounds of −1 and 1 using a hyperbolic tangent activation function. Each action controls the speed and direction of yaw rotation for each individual joint in the prosthetic hand. The displacement angle of each joint (

Δ θ_{i}

) for a set time interval (

Δ T

) is calculated through Equation (8).

ω

represents the maximum angular velocity that each joint can turn, which is then scaled by the

a_{t, i}

to determine the final magnitude and direction of this velocity. A value of 90

\frac{\circ}{s}

for

ω

is found to work well for training the ML-Agent.

Δ θ_{i} = a_{t, i} ω Δ T

(8)

To inform the ML-Agent on the current state of its environment, two observations are provided with respect to each joint in the hand, where the total set of joints can be denoted as

J = {j_{1}, j_{2}, \dots, j_{14}}

. First, the ML-Agent is informed of the z-rotation of each finger segment. These observations are taken to bridge the gap between the current state of the prosthetic and the reward function. Tallying the z-rotation of each joint as

θ_{i}

allows the ML-Agent to more easily understand which rotational displacement, according to the action it takes, yields a higher. Information regarding the distance of each joint to the cylinder (

d_{i}

) is also provided to the ML-Agent. This observation is especially useful in the case of the dynamic VR scene (discussed in the next section), where multiple cylinders are present and the ML-Agent, along with human input, has to choose the best plausible motion to grasp the nearest cylinder. Given these details the total observation vector (

O_{t}

) can adequately be described as

O_{t} = [θ_{1}, θ_{2}, \dots, θ_{14}, d_{1}, d_{2}, \dots, d_{14}] \in R^{28}

.

Along with the colliders, two ray perception sensors are placed orthogonally to each finger. These ray perception sensors allow the agent to perceive its environment through raycasting. A series of rays is projected from an object at a given length and angle. These rays are given the ability to collide with various objects of a specific tag name within the MR scene. For this case, all of the ray perception sensors are set to alert the ML-Agent when the ray interacts with a cylinder collider. Each of the two rays perpendicular to each finger contains one ray near the end of the finger and one ray near the base of the finger. These placements are strategically chosen to inform the agent whether the base and end joints are approaching the cylinder from a physically correct angle. Without this guidance, a clockwise rotation could cause the fingers to contact the cylinder with their backs, which is not biomechanically feasible. An example of these ray perception sensor placements on the pinky finger is shown in Figure 3.

Minimal hyperparameter tuning is completed in the training of the proposed RL model, as a greater importance is placed on the design of the action and observation states and the reward function for model convergence rather than hyperparameter optimization. For reproducibility, some notable hyperparameters, their purpose, and the chosen values for this study are shown below. For a greater explanation of RL hyperparameters and tuning procedures, please consult [37]. Values for the hyperparameter values are bolded.

learning rate—Determines the step size for updating the policy and value networks (a high value may lead to instability; a low value leads to slow convergence): 3 × 10⁻⁴ with linear decay
epsilon—Regulates the bounds of clipping ( $[1 - ϵ, 1 + ϵ]$ ) for the objective function previously described in Section 2.2 and given by Equation (3): 0.2
gamma—The discount factor assisting the model in deciding the importance of long-term rewards or immediate gains: 0.99
beta—Represents the policy’s entropy. The higher the entropy the more the RL model is encouraged to explore a diverse set of actions instead of exploiting learned policies: 0.005
batch size—The number of training samples which are processed during gradient ascent: 2048
epochs—The number of times collected data from the RL-agent’s interaction with its environment is used to tune its policy and value parameters: 3

Two reward function approaches are investigated to analyze which scheme would lead to a grasping motion that is close to reality. The first reward function takes a simple approach where contact between each finger segment and the cylinder is rewarded. The first reward function is created to encourage both immediate and sustained contact with joints in the hand and the item that the virtual prosthetic is grasping. This is completed by placing a larger immediate reward for first contact between each joint and the cylinder, notifying the agent that it is exploiting a favorable set of actions. Extended contact is then supplemented by a continuous reward at every timestep, with the intent that the agent recognizes its current action space is the most favorable. A smaller value for the continuous reward is given to ensure that the cumulative reward value does not explode outside of normalized bounds, leading to a hindrance in model convergence. Additionally, a higher reward value is aligned to contact between joints that play a greater importance in performing a grasping action, where each finger’s base and the thumb joints are deemed to have the highest importance. The values for these rewards are shown in Table 2.

Apart from rewarding immediate and persistent contact, the release of contact is included in the reward function as a punishment. If a collider from a finger segment is found to have had previous contact, but that contact is later released, a release penalty of −0.25 is included. The total logic for this reward function can be formally expressed by Equation (9), where

α_{j}

and

β_{j}

represent the reward weights for each joint, j, in the total set of available joint types,

J

, as coordinated with the values shown in Table 2. In total,

J

accounts for 14 joints: 3 for each finger (base, middle, and end) and 2 for the thumb (base and end).

I_{j}^{t}

,

C_{j}^{t}

, and

R_{j}^{t}

are Boolean operators for each joint, which are continually updated at every time step, t. These operators represent immediate contact, continuous contact, and release of contact, respectively. Finally,

γ

represents the penalty weight of −0.25 for the release of contact.

r_{t} = \sum_{j \in J} [α_{j} I_{j}^{t} + β_{j} C_{j}^{t} Δ t] + γ \sum_{j \in J} R_{j}^{t}

(9)

The second reward function takes a more continuous approach. Rather than relying on Boolean operators at each timestep for the addition or subtraction of discrete reward weights, the reward is defined as a continuous function of the distance between each joint and its nearest point on the cylinder’s collider mesh. The distance improvement (

Δ d_{j}^{t}

) for each joint

j \in J

is simply calculated throughout each timestep as shown in Equation (10). The total adjustment in reward weights throughout each timestep is expressed by Equation (11) as a summation of each distance improvement for all joints. The following reward function allows for the continuous addition of reward and penalty weights. If a joint moves closer to the cylinder, it is provided a positive reward in proportion to its distance improvement. In contrast, if the joint moves farther away, the reward weight becomes negative and acts as a penalty for the ML-Agent. For consistency, the second reward function used the same set of joints,

J

, as the first reward function.

Δ d_{j}^{t} = d_{j}^{t - 1} - d_{j}^{t}

(10)

r_{t} = \sum_{j \in J} Δ d_{j}^{t}

(11)

3.3. Reinforcement Learning Evaluation Methods

Each of these reward functions is given to be a function of the environment’s state and action space at time t (

S_{t}

and

A_{t}

). The empirical mean reward as calculated through Equation (13), for a set number of discretized timesteps, serves as an important cornerstone in evaluating model convergence and success. For each of the reward functions, a reward is calculated at a rate of 50 Hz, thereby providing a timestep length of 0.02 s.

{\bar{R}}_{T} = \frac{1}{T} \sum_{t = 0}^{T - 1} r (S_{t}, A_{t})

(12)

In addition to convergence of the empirical mean reward, a minimization of the network’s total loss (or error in predictions) is also desired. In the case of the PPO algorithm, there are two separate loss functions to be minimized: policy loss and value loss. The policy loss for the PPO algorithm that is used in this study allows for guidance of the agent’s policy network to enact more fruitful actions. Further description of the policy loss function to be minimized is provided in Section 2.2 where the loss function is shown by Equation (3). The value loss represents the network’s estimation of the future reward to be provided to the agent, given that it enacts a specific action state. The value loss is measured by the mean squared error (MSE) function shown in which is averaged over t timesteps, where

Y_{i}

represents the actual reward for completing that action and

{\hat{Y}}_{i}

represents the predicted reward.

L_{v a l u e} = \frac{1}{t} \sum_{i = 1}^{t} (Y_{i} - {\hat{Y}}_{i})

(13)

The success of the RL model using each of the specified reward functions is evaluated using the following criteria:

The reinforcement learning model reaches an empirical mean reward that almost surely converges to its maximum possible value, $ρ^{*}$ as time approaches infinity given that the model reaches an optimal policy $π^{\infty}$ : ${\bar{R}}_{T} \overset{a . s .}{\to} ρ^{*} (π^{\infty})$ as $t \to \infty$ .
Minimization (or a pattern of minimization) of both the policy and value losses, ensuring that the agent’s strategy is being guided in a positive direction and the network’s estimation of future reward given a possible state is accurate, respectively.
Once $π^{\infty}$ is achieved, the policy’s entropy is minimized, meaning that the model is less inclined to explore its action space, and is confident that the action it has converged on produces the maximum reward.
Qualitatively, a grasping motion in the correct rotational direction appears to have been successfully completed in both the static and dynamic mixed reality scenes. This would imply that the chosen reward function not only results in model convergence but also produces the desired action that the RL model is designed to complete.

To ensure that the RL model converges upon an optimal policy, quantitative convergence standards are placed upon criteria 1–3. For each of the episodes (5000 timesteps of 0.02 s in length), the relative deviation of that episode’s value is measured against the previous episode. To ensure that convergence is not quantitatively measured too early, as the values attained for each episode may be sequentially similar, the relative deviation of each episode is also compared against the previous ten episodes. For cumulative reward, entropy, and loss, the relative deviations set to ensure convergence and an optimal policy are 0.01%. The timestep values of when convergence occurs are compared to the graphical representation of each of these metrics across time.

3.4. Dynamic Virtual Reality Scene

A dynamic VR scene is constructed as a proof of concept to demonstrate the applicability of machine learning for prosthetic control for rehabilitation. This is achieved through the integration of four technologies: Unity’s ML-Agents, the Thalmic Labs Myo Armband, MRTK2, and the Meta Quest 3 headset.

Unity’s established role in simulation research, combined with the flexibility of its ML-Agents package, made it an ideal platform for hosting the dynamic scene. Integration of ML-Agents with the modeled prosthetic, discussed in Section 3.1, is achieved through modification of the reward function (Section 3.2). The previously trained ML-Agent is employed in the dynamic VR scene through the use of an .onnx file. The user can observe the dynamic VR scene through the Meta Quest 3 headset and control the position and rotation of the entire virtual prosthesis using an EMG armband. This is accomplished via the eight IMU sensors discussed in Section 2.4. Once the user position and rotation data are correctly displayed within Unity, the final step is to represent this within a VR environment.

Four of the experimental setup structures containing a platform and cylinder are placed in random Z-X coordinate locations within the dynamic VR scene. The scene is visualized in Figure 4.

The placement of multiple cylinders within this scene presents a new challenge for the trained ML-Agent, as it now must actuate a grasping motion on the closest cylinder while still observing information for all cylinders. The user is given the objective to move the prosthetic to a cylinder of their choice, allowing enough distance for the prosthetic to grasp it. The inherent goal of this scene in the context of a rehabilitation situation is to allow the user to become familiar with controlling their prosthetic through the shoulder/elbow joints, while the hand actuation movement is automated for them.

Measurements from the inertia sensor in the EMG armband are propagated onto the virtual scene to emulate physical reality. This is most clearly shown in Figure 5a, where the arm is shown to be placed at the approximate position where the user’s physical arm would be. Physically, the Myo EMG armband is placed slightly above the user’s elbow. The emulator, where actions informed by inertia sensor readings originate, occurs in a relatively similar position. The placement is shown in Figure 5b, where the outlined sphere is the origin of movement for the prosthetic. These measures are taken to ensure a good match between the VR scene and the user’s movement to minimize the intuition needed to use the application.

To display the Unity environment with the incorporated RL model, prosthetic model, and associated components, Microsoft’s Mixed Reality Toolkit 2 (MRTK2) is utilized. MRTK2 provides the foundation for integrating spatial interaction and device-specific inputs, enabling the creation of an interactive VR scene directly within Unity. The hand tracking capabilities that are available through MRTK2 are not found to be necessary, as the IMU sensor within the EMG armband allows for adequate control of the prosthetic. The resulting VR scene is deployed to the Meta Quest 3 headset, with Meta’s Meta-Link software (Version 78.1033) serving as the interface for connecting the Unity scene to the headset.

Initial exploration of the connectivity between the EMG-Armband, pretrained ML-agent, and Meta Quest 3 was explored for AR applications as well, given their advantages in immersiveness for rehabilitation applications [38]. Launching the AR application using solely the ML-Agent was found to be successful; however, the inclusion of the Myo EMG armband was found to limit the application space to VR only. This limitation is imposed by the EMG-sensor due the its inability to connect directly to the Meta Quest 3. Therefore, the application cannot be built and remotely launched away from the computing device with the Unity project. The limitation on this aspect of the study is further discussed in Section 5. Apart from exploring more compatible EMG-armband devices, a more immersive environment may be created to increase patient engagement.

4. Results

4.1. ML-Agent Training

The following section addresses the success of both of the ML-Agents using their respective reward functions in accordance with the first three criteria for success as described in Section 3.2. Both of the agents are trained for 3 × 10⁷ timesteps, where each timestep consists of a length of 0.02 s, as this approximate training was found lead to the best convergence without unnecessary computational costs. This timestep is found to be sufficient to allow the model to fully converge to the most optimal possible policy. In accordance with the first three criteria mentioned above, the quantitative results used to analyze the performance of the agent were the empirical mean reward, value loss, policy loss, and policy entropy over the training period of the ML-Agent.

For the remainder of this section, the ML-Agent that is trained using the reward function given by Equation (9) is referred to as Reward Function 1 (RF1) and the agent trained using Equation (11) is referred to as Reward Function 2 (RF2). The maximum amount of timesteps used for training correlated to a training period of approximately 24.3 h using RF1 and 30.2 h using RF2 on a 16-core 13th Gen Intel Core i7-13700KF processor. The prolonged amount of training time from using RF2 is most likely a result of continual distance calculations between the prosthetic and cylinder meshes being conducted to derive the reward for each timestep.

The resultant metrics of training an ML-Agent with RF1 are shown in Figure 6. By analyzing the empirical mean reward across time, it is clear that the agent did not converge on an optimal policy. This statement is further supported by analyzing the policy loss and entropy across time. The policy loss is shown to fluctuate between approximately 0.03 and 0.15, with no significant pattern of minimization. The increase of policy entropy throughout time correlates with model exploration of its action space. Increased encouragement of exploration of the action space leads to the conclusion that the agent is not confident that any of the actions that it has currently undertaken will lead to greater rewards compared to what it has already experienced. Minimization of the value loss is shown to occur, but this only indicates that the neural network informing the agent of future rewards has been accurate in its prediction. The minimization of value loss does not mean that the optimal policy has been attained, which is the ultimate goal for creating a reinforcement learning agent to accurately perform the task at hand. Qualitatively, the results from these metrics lead the ML-Agent trained using RF1 to not recognize that the grasping motion leads to the maximum reward. Given this performance, the ML-Agent using RF1 was not deployed on the dynamic VR scene. Given RF1’s lack of performance graphically and when employed, any quantitatively measuring convergence was found to be unnecessary.

The results from training an ML-Agent to perform a grasping motion using RF2 are shown in Figure 7. Here, the reward is shown to converge to an approximate value of 0.70. The policy loss is shown to also have a minimization pattern where the loss initially begins at a value of 0.70 and is reduced to a final approximate value of 0.35. Similar to the results from RF1, the value loss is shown to be exponentially minimized, but with a much greater certainty and less overall noise. Contrary to RF1, the policy entropy in RF2 decreases from approximately 1.42 to 0.8 over time. This decrease in policy entropy signifies that the agent is confident that the action space it has already explored will result in a fruitful reward. Therefore, the training results for RF2 shown in Figure 7 result in an overall success of the first three criteria for a successful RL agent proposed in Section 3.2.

Quantitatively, RF2 is found to converge at different positions depending on the metric analyzed. The empirical mean reward is shown to meet the convergence criteria after approximately 2.11 × 10⁷ timesteps, and the value loss is shown to converge after approximately 1.62 × 10⁷ timesteps. The policy entropy is shown to converge much later after approximately 2.95 × 10⁷ timesteps, demonstrating that the ML-agent is still inclined to explore even as it approaches the maximum empirical mean reward that it can achieve. Each of these occurrences of convergence is shown by the black line in Figure 7. To allow the policy entropy to settle, thereby ensuring that the RL-agent is more inclined to enact its current policy rather than explore a new one, training the model for 3 × 10⁷ timesteps is found to be sufficient. The policy loss is not found to converge; however, this is shown to be a common phenomenon in literature as the RL-model is exploring and learning various policies rather than a singular one, leading to large fluctuations [39]. Simply observing a decrease in the policy loss is found to be sufficient.

4.2. ML-Agent Performance in Dynamic VR Scene

As discussed in Section 3.2, the fourth criterion of success is qualitatively observed by testing if a grasping motion is actually informed in the dynamic VR scene. While the first three criteria were analyzed during testing outside of the dynamic VR space, with particular attention given to the agents’ training metrics, the fourth criterion focuses on whether the learned policy is accurately transferred to the VR space and generalized well.

Deployment of the agent trained with RF2 into the VR space resulted in the desired grasping motion when placed by the user on the target cylinder. Once the user moved toward a cylinder the agent performs the grasping motion where all cylinders are grasped independent of position or rotation. A constant attempt at self-actuation of grasping is made by the agent from the moment of scene startup. This correlates to the agent observing multiple cylinders, and trying to minimize the distance between each finger joint and a cylinder through the specified rotational actions it can conduct. A grasping motion is successfully achieved via the agent once guided to the target cylinder by the user using the IMU sensor in the EMG band. Once the arm is moved to an appropriate position for grasping, the grasp motion is nearly instantaneous as the RL-agent is training to maximize its reward by decreasing the distance between each finger segment and the cylinder as fast as possible. The success of this grasping motion in the VR environment is visualized in Figure 8.

Testing of RF2 within the VR environment is done as it shows success at identifying the completion of the grasping motion as the optimal policy. By contrast, RF1 was not deployed into the VR environment due to the agent not identifying the grasping motion as the optimal policy.

A key limitation of this framework that must be discussed is the connectivity between the Thalmic Labs Myo Armband and the Meta Quest 3 headset. Generally, AR and VR applications that are deployed from Unity to the Meta Quest 3 are serialized in the form of a .apk file. This file allows for easy transferability and continuous usage of the application in the headset, independent of the computer on which the application was created. However, EMG and IMU signals produced by the Thalmic Labs Myo Armband require processing through the Myo Connect application, which is only available for desktop devices [40]. This prohibits the dynamic VR application from being launched remotely on a Meta Quest 3 through an .apk file. Instead, the headset must be connected to the desktop device with the Unity scene of the VR application. Further investigation of alternative EMG armband and IMU sensor solutions that have serializable signal processing should be conducted for greater transferability of this framework.

5. Discussion

A key point of discussion from the different training results of RF1 and RF2 is the characteristics of a reward function that result in successful training. The key failure of RF1 is the discrete approach it takes, where the overall reward is dependent on Boolean values. Boolean values (0 s and 1 s) provide a finite and sparse set of signals to be sent to the gradient estimator of the policy’s objective function. These signals only inform the agent of its success or failure, omitting how close the agent is to performing the desired behavior. This adversely affects the agent’s training, as there is no usable direction for improvement when performing gradient descent, only binary feedback. As shown by the policy loss in Figure 6, after a long duration of training, the agent is still completely uncertain of a strategy to guide its empirical mean reward in a positive direction. RF2 provides a continuous reward that is completely a function of distance, with a near-infinite set of signals based on the distance from the joint and the cylinder. The continuity of these signals allows for the policy’s objective function to clearly map out its search space, where it is aware of a direction that would cause an improvement of the overall reward it is given. This highlights the importance of designing reward functions to provide continuous, informative signals that guide incremental improvement rather than sparse, binary feedback.

Once trained, the ML-Agent is shown to perform well in environments that are different than the one in which it was trained. Once the user enters the dynamic VR scene, the ML-Agent does not immediately perform the grasping motion unless the user moves toward an object that is tagged for grasping (such as the cylinder). This positively reflects the policy that the ML-Agent uses, as the agent has learned to grasp specific objects rather than simply performing the grasping motion unprompted. The success of the ML-Agent performing the grasping motion with human control of the arm position in the global space demonstrates the expected generalization from static training to dynamic deployment. This idea can further be extrapolated to human-in-the-loop training of the ML-Agent in the future.

Previous endeavors into the integration of augmented reality with myo-electric prosthetics have included trials of both amputees and non-amputees as a principal method of data collection. A lack of testability on recovering patients represents a limitation of the framework proposed in this study; regardless of the ML-Agent’s standard of performance, its long-term effectiveness cannot be concluded without examination of its synergy with a human user. User experience depends significantly on subjective variables that isolated tests of the system cannot evaluate. Likewise, the aspects of the system that are most problematic for users cannot be identified as is. Important considerations, such as whether the visual feedback provided by an augmented reality framework alleviates phantom limb pain in amputees or if prior training with a simulated prosthetic translates to higher proficiency with a physical device, remain purely speculative. Future iterations of the system proposed in this study would benefit most from trials involving human participants, with the degree of skill transfer and their subjective levels of satisfaction being evaluated individually. Central nervous system integration of a framework similar to that demonstrated in [41] would additionally inform subsequent reiterations of the proposed system by providing empirical indicators of user affinity. The data collected from such an approach would be a more accurate predictor of patient rejection or acceptance, ideally making later investments into myoelectric prosthetic technology more worthwhile for healthcare providers and patients alike.

Considering the limited availability of the Myo armband, future reiterations of the proposed framework would benefit additionally from trials with alternative open-source electromyography platforms. Potentially-compatible tools include Arduino’s electromyographic sensors and signal processors [42], the FREEEMG surface probes manufactured by BTS Bioengineering [43], and OpenBCI’s Cyton bio-sensing board [44] or Galea headset, among other implements. Practical tests of different platforms are needed to achieve the most optimal balance between cost and performance as well as to maximize the accessibility of this system for users. Furthermore, it would be beneficial to develop software capable of transmitting data directly over a network to the target headset application.

6. Conclusions and Future Work

This study has demonstrated the feasibility of integrating reinforcement learning with static and interactive dynamic VR scenes to train and use a prosthetic to perform a grasping gesture. The overall framework is successfully created by combining Unity ML-Agents, MRTK, and an IMU sensor from Thalmic Labs’ EMG armband. Two reward function approaches are considered: one using discrete boolean statements to provide a reward upon contact and the other using a continuous distance-based reward. From these two approaches, the continuous distance-based reward is found to be successful in allowing the model to converge, whereas the Boolean-based reward failed. It is concluded that the agent trained on a Boolean-based reward function failed due to the weak signal provided to the policy gradient. A reward that is contingent on ‘True’ and ‘False’ does not provide an adequate direction of improvement to the agent’s policy estimator, as supported by the results of RF1’s policy loss over time.

The foundation of this presented research may be generalized to various applications. By employing a static VR environment, the ML-Agent is trained under controlled, consequence-free conditions, thereby reducing resource demands and risks associated with mistraining. The dynamic VR scene serves as the associative final application that the ML-Agent is trained to complete. In many scenarios, the dynamic scene may be a physical application. This application can be wide-ranging, past simply the application of prosthetics, such as the training of a robot to automatically perform a movement, industrial assembly operations, and many other automated movement applications. In the case of prosthetic actuation, the connection of reinforcement learning to a VR environment using MRTK may allow for a more simplistic rehabilitation process where the user may adjust to their EMG-controlled prosthetic while a trained ML-Agent can assist them in performing some movements.

This study provides a foundation upon which multiple avenues of future research may be pursued. The presented work provides the methodology upon which the actuation of a hand gesture is completed. A similar framework may then be used to allow the arm to perform different gestures with modifications to the environment and reward function. To fully generalize an ML-Agent’s ability in performing these gestures, transfer learning may prove useful. An agent can use the knowledge from training to complete a simple hand gesture (such as grasping) to then learn to perform a more complex gesture in a reduced amount of time, which it could not have otherwise learned from scratch. To properly train the agent to perform different gestures, VR-scene design must be taken into account to create the optimal environment. The RL architecture may also need to be modified to account for transfer mechanisms across heterogeneous action and observation spaces when VR interactions differ for various gestures.

Another, less obvious, direction for future work is incorporating uncertainty quantification (UQ) into the ML-Agents’ training process. This consideration becomes increasingly important for verification if a similar approach is undertaken and the trained agent is subsequently deployed on a physical system where consequential decisions are made by the agent. The need for the application of formal UQ of this methodology becomes readily important in verifying the consistency of the RL-agent prior to deployment in a clinical setting. Under the Unity ML-Agents’ framework, UQ can be easily performed using deep ensemble or Monte Carlo dropout methods without having to modify the architecture of the deep reinforcement learning agent itself. Both forms of UQ may easily be implemented with minimal changes necessary to the model architecture, as dropout layers would only need to be added for Monte Carlo dropout. Given the deep nature of the PPO algorithm for the ML-Agent used, both of these methods for UQ would be most relevant.

Author Contributions

Conceptualization, C.D. and J.P.; methodology, C.D. and J.P.; software, C.D., J.P. and K.O.; investigation, C.D. and J.P.; resources, M.P.S.; data curation, C.D. and J.P.; writing—original draft preparation, C.D., J.P., K.O., A.G., C.T. and J.F.; writing—review and editing, J.P. and M.P.S.; supervision, M.P.S.; project administration, M.P.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data is available upon request through the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AR	Augmented Reality
AI	Artificial Intelligence
DoF	Degrees of Freedom
EMG	Electromyography
IMU	Inertial Measurement Unit
MRTK	Mixed Reality Toolkit
ML	Machine Learning
ML-Agent	Unity Machine Learning Agent
MSE	Mean Squared Error
PID	Proportional–Integral–Derivative
PPO	Proximal Policy Optimization
RL	Reinforcement Learning
TRPO	Trust Region Policy Optimization
UQ	Uncertainty Quantification
VR	Virtual Reality

References

Salminger, S.; Stino, H.; Pichler, L.H.; Gstoettner, C.; Sturma, A.; Mayer, J.A.; Szivak, M.; Aszmann, O.C. Current rates of prosthetic usage in upper-limb amputees—Have innovations had an impact on device acceptance? Disabil. Rehabil. 2022, 44, 3708–3713. [Google Scholar] [CrossRef] [PubMed]
Biddiss, E.A.; Chau, T.T. Upper limb prosthesis use and abandonment: A survey of the last 25 years. Prosthetics Orthot. Int. 2007, 31, 236–257. [Google Scholar] [CrossRef] [PubMed]
Bhaskaranand, K.; Bhat, A.K.; Acharya, K.N. Prosthetic rehabilitation in traumatic upper limb amputees (an Indian perspective). Arch. Orthop. Trauma Surg. 2003, 123, 363–366. [Google Scholar] [CrossRef] [PubMed]
Budinski, S.; Manojlović, V.; Knežević, A. Predictive factors for successful prosthetic rehabilitation after vascular transtibial amputation. Acta Clin. Croat. 2021, 60, 657. [Google Scholar] [CrossRef]
Li, L.; Ruan, H.; Liu, C.; Li, Y.; Shuang, Y.; Alù, A.; Qiu, C.W.; Cui, T.J. Machine-learning reprogrammable metasurface imager. Nat. Commun. 2019, 10, 1082. [Google Scholar] [CrossRef]
Oyebode, O.; Fowles, J.; Steeves, D.; Orji, R. Machine learning techniques in adaptive and personalized systems for health and wellness. Int. J. Hum. Comput. Interact. 2023, 39, 1938–1962. [Google Scholar] [CrossRef]
Nasteski, V. An overview of the supervised machine learning methods. Horizons B 2017, 4, 56. [Google Scholar] [CrossRef]
Wen, Y.; Si, J.; Brandt, A.; Gao, X.; Huang, H.H. Online reinforcement learning control for the personalization of a robotic knee prosthesis. IEEE Trans. Cybern. 2019, 50, 2346–2356. [Google Scholar] [CrossRef]
Gao, X.; Si, J.; Wen, Y.; Li, M.; Huang, H.H. Knowledge-guided reinforcement learning control for robotic lower limb prosthesis. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020; pp. 754–760. [Google Scholar]
Mohammed, M.Q.; Chung, K.L.; Chyi, C.S. Review of deep reinforcement learning-based object grasping: Techniques, open challenges, and recommendations. IEEE Access 2020, 8, 178450–178481. [Google Scholar] [CrossRef]
Gao, Z.; Tang, R.; Chen, L.; Huang, Q.; He, J. Continuous shared control in prosthetic hand grasp tasks by Deep Deterministic Policy Gradient with Hindsight Experience Replay. Int. J. Adv. Robot. Syst. 2020, 17, 1729881420936851. [Google Scholar] [CrossRef]
Cilla, M.; Borgiani, E.; Martínez, J.; Duda, G.N.; Checa, S. Machine learning techniques for the optimization of joint replacements: Application to a short-stem hip implant. PLoS ONE 2017, 12, e0183755. [Google Scholar] [CrossRef]
Parajuli, N.; Sreenivasan, N.; Bifulco, P.; Cesarelli, M.; Savino, S.; Niola, V.; Esposito, D.; Hamilton, T.J.; Naik, G.R.; Gunawardana, U.; et al. Real-time EMG based pattern recognition control for hand prostheses: A review on existing methods, challenges and future implementation. Sensors 2019, 19, 4596. [Google Scholar] [CrossRef]
Joshi, D.; Atreya, S.; Arora, A.; Anand, S. Trends in EMG based prosthetic hand development: A review. Indian J. Biomech. Spec. Issue 2009, 228–232. [Google Scholar]
Lee, S.; Kim, M.O.; Kang, T.; Park, J.; Choi, Y. Knit band sensor for myoelectric control of surface EMG-based prosthetic hand. IEEE Sens. J. 2018, 18, 8578–8586. [Google Scholar] [CrossRef]
Yadav, D.; Veer, K. Recent trends and challenges of surface electromyography in prosthetic applications. Biomed. Eng. Lett. 2023, 13, 353–373. [Google Scholar] [CrossRef] [PubMed]
Abdikenov, B.; Zholtayev, D.; Suleimenov, K.; Assan, N.; Ozhikenov, K.; Ozhikenova, A.; Nadirov, N.; Kapsalyamov, A. Emerging Frontiers in Robotic Upper-Limb Prostheses: Mechanisms, Materials, Tactile Sensors and Machine Learning-Based EMG Control. Sensors 2025, 25, 3892. [Google Scholar] [CrossRef]
Salminger, S.; Roche, A.; Sturma, A.; Mayer, J.; Aszmann, O. Hand transplantation versus hand prosthetics: Pros and cons. Curr. Surg. Rep. 2016, 4, 8. [Google Scholar] [CrossRef]
Marozas, V.; Petrenas, A.; Daukantas, S.; Lukosevicius, A. A comparison of conductive textile-based and silver/silver chloride gel electrodes in exercise electrocardiogram recordings. J. Electrocardiol. 2011, 44, 189–194. [Google Scholar] [CrossRef]
Došen, S.; Cipriani, C.; Kostić, M.; Controzzi, M.; Carrozza, M.C.; Popović, D.B. Cognitive vision system for control of dexterous prosthetic hands: Experimental evaluation. J. Neuroeng. Rehabil. 2010, 7, 42. [Google Scholar] [CrossRef]
Ryait, H.S.; Arora, A.; Agarwal, R. Study of issues in the development of surface EMG controlled human hand. J. Mater. Sci. Mater. Med. 2009, 20, 107–114. [Google Scholar] [CrossRef]
Edwards, A.L.; Dawson, M.R.; Hebert, J.S.; Sutton, R.S.; Chan, K.M.; Pilarski, P.M. Adaptive switching in practice: Improving myoelectric prosthesis performance through reinforcement learning. Proc. MEC 2014, 14, 18–22. [Google Scholar]
Boschmann, A.; Neuhaus, D.; Vogt, S.; Kaltschmidt, C.; Platzner, M.; Dosen, S. Immersive augmented reality system for the training of pattern classification control with a myoelectric prosthesis. J. Neuroeng. Rehabil. 2021, 18, 25. [Google Scholar] [CrossRef]
Microsoft. Mixed Reality Toolkit (MRTK). 2022. Available online: https://learn.microsoft.com/en-us/windows/mixed-reality/mrtk-unity/mrtk2/ (accessed on 27 August 2025).
Raj, R.; Ramakrishna, R.; Sivanandan, K.S. A real time surface electromyography signal driven prosthetic hand model using PID controlled DC motor. Biomed. Eng. Lett. 2016, 6, 276–286. [Google Scholar] [CrossRef]
García-Ortíz, J.V.; Mora, M.C.; Cerdá-Boluda, J. Modeling the Dynamics of Prosthetic Fingers for the Development of Predictive Control Algorithms. Mathematics 2024, 12, 3236. [Google Scholar] [CrossRef]
Gupta, A.; Schoen, M.P. Analysis of Simulated Autonomous Wheelchair Driving using GA-PID and RL based Controllers. In Proceedings of the 2025 Intermountain Engineering, Technology and Computing (IETC), Orem, UT, USA, 9–10 May 2025; pp. 1–6. [Google Scholar]
Levine, S.; Finn, C.; Darrell, T.; Abbeel, P. End-to-end training of deep visuomotor policies. J. Mach. Learn. Res. 2016, 17, 1–40. [Google Scholar]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar] [CrossRef]
Taheri, H.; Hosseini, S.R.; Nekoui, M.A. Deep reinforcement learning with enhanced ppo for safe mobile robot navigation. arXiv 2024, arXiv:2405.16266. [Google Scholar] [CrossRef]
Guo, Z.; Fu, H.; Wu, J.; Han, W.; Huang, W.; Zheng, W.; Li, T. Dynamic Task Planning for Multi-Arm Apple-Harvesting Robots Using LSTM-PPO Reinforcement Learning Algorithm. Agriculture 2025, 15, 588. [Google Scholar] [CrossRef]
Yu, C.; Velu, A.; Vinitsky, E.; Gao, J.; Wang, Y.; Bayen, A.; Wu, Y. The surprising effectiveness of ppo in cooperative multi-agent games. Adv. Neural Inf. Process. Syst. 2022, 35, 24611–24624. [Google Scholar]
Juliani, A.; Berges, V.P.; Teng, E.; Cohen, A.; Harper, J.; Elion, C.; Goy, C.; Gao, Y.; Henry, H.; Mattar, M.; et al. Unity: A General Platform for Intelligent Agents. arXiv 2020, arXiv:1809.02627. [Google Scholar] [CrossRef]
Kaup, M.; Wolff, C.; Hwang, H.; Mayer, J.; Bruni, E. A review of nine physics engines for reinforcement learning research. arXiv 2024, arXiv:2407.08590. [Google Scholar] [CrossRef]
Jia, Y.B. Quaternions and rotations. Com S 2008, 477, 15. [Google Scholar]
Liu, Y.; Jiang, L.; Liu, H.; Ming, D. A systematic analysis of hand movement functionality: Qualitative classification and quantitative investigation of hand grasp behavior. Front. Neurorobot. 2021, 15, 658075. [Google Scholar] [CrossRef] [PubMed]
Eimer, T.; Lindauer, M.; Raileanu, R. Hyperparameters in reinforcement learning and how to tune them. In Proceedings of the International Conference on Machine Learning, PMLR, Honolulu, HI, USA, 23–29 July 2023; pp. 9104–9149. [Google Scholar]
Lim, G.; Youn, H.; Kim, H.; Jeong, H.; Cho, J.; Lee, S.; Pak, C.; Kwon, S. Impact of mixed reality-based rehabilitation on muscle activity in lower-limb amputees: An EMG analysis. IEEE Access 2024, 12, 106415–106431. [Google Scholar] [CrossRef]
Nota, C.; Thomas, P.S. Is the policy gradient a gradient? arXiv 2019, arXiv:1906.07073. [Google Scholar]
Jaman, G.G.; Schoen, M.P. Convolutional neural networks for time series data processing applicable to sEMG controlled hand prosthesis. Tech. Mech.-Eur. J. Eng. Mech. 2024, 44, 47–60. [Google Scholar]
Kim, H.; Miyakoshi, M.; Kim, Y.; Stapornchaisit, S.; Yoshimura, N.; Koike, Y. Electroencephalography reflects user satisfaction in controlling robot hand through electromyographic signals. Sensors 2022, 23, 277. [Google Scholar] [CrossRef]
Wu, H.; Dyson, M.; Nazarpour, K. Arduino-based myoelectric control: Towards longitudinal study of prosthesis use. Sensors 2021, 21, 763. [Google Scholar] [CrossRef]
Fedorová, L.; Rajt’úková, V.; Tóth, T.; Živčák, J. EMG system application in muscle parametrization of the upper extremities. In Proceedings of the 2014 IEEE 12th International Symposium on Applied Machine Intelligence and Informatics (SAMI), Herl’any, Slovakia, 23–25 January 2014; pp. 85–89. [Google Scholar]
Cardona-Álvarez, Y.N.; Álvarez-Meza, A.M.; Cárdenas-Peña, D.A.; Castaño-Duque, G.A.; Castellanos-Dominguez, G. A novel OpenBCI framework for EEG-based neurophysiological experiments. Sensors 2023, 23, 3763. [Google Scholar] [CrossRef]

Figure 1. Progression of prosthetic modeling in Blender: (left) Armature, (middle) Mesh, and (right) Final rigged model.

Figure 2. Environment used to train the Unity ML-Agent to perform a grasping motion.

Figure 3. Two ray perception sensors (in white) shown at a perpendicular angle to the pinky finger.

Figure 4. Dynamic VR scene as visualized in the Unity game engine.

Figure 5. Dynamic VR scene visual descriptions.

Figure 6. Reinforcement learning agent training result metrics across timesteps using RF1.

Figure 7. Reinforcement learning agent training result metrics across timesteps using RF2.

Figure 8. Arm grasping cylinder in dynamic scene.

Table 1. Comparison of prosthesis types with respect to amputation type, abandonment rate, and follow-up period.

Prosthesis Type	Amputation Type	Abandonment Rate (%)	Follow-Up Period (Years)
Myoelectric	Upper-limb	47% as of 2020 [1]	4–24 [1]
Body-powered	Upper-limb	30% as of 2004 [2]	2–21 [2]
Cosmetic	Upper-limb	29% as of 2004 [2]	5–21 [2]
Lower-limb	Transtibial	>11–22% as of 2021 [4]	1 [4]
Lower-limb	Transfemoral	<11–22% as of 2021 [4]	1 [4]

Table 2. Reward amount according to joint and contact type.

Joint Type	Immediate Reward	Continuous Reward (per 0.2 s)
Base	0.015	0.0005
Middle	0.010	0.0004
End	0.005	0.0003
Thumb	0.020	0.0007

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Done, C.; Palmer, J.; Oakey, K.; Gupta, A.; Thiros, C.; Franklin, J.; Schoen, M.P. Reinforcement Learning-Driven Prosthetic Hand Actuation in a Virtual Environment Using Unity ML-Agents. Virtual Worlds 2025, 4, 53. https://doi.org/10.3390/virtualworlds4040053

AMA Style

Done C, Palmer J, Oakey K, Gupta A, Thiros C, Franklin J, Schoen MP. Reinforcement Learning-Driven Prosthetic Hand Actuation in a Virtual Environment Using Unity ML-Agents. Virtual Worlds. 2025; 4(4):53. https://doi.org/10.3390/virtualworlds4040053

Chicago/Turabian Style

Done, Christian, Jaden Palmer, Kayson Oakey, Atulan Gupta, Constantine Thiros, Janet Franklin, and Marco P. Schoen. 2025. "Reinforcement Learning-Driven Prosthetic Hand Actuation in a Virtual Environment Using Unity ML-Agents" Virtual Worlds 4, no. 4: 53. https://doi.org/10.3390/virtualworlds4040053

APA Style

Done, C., Palmer, J., Oakey, K., Gupta, A., Thiros, C., Franklin, J., & Schoen, M. P. (2025). Reinforcement Learning-Driven Prosthetic Hand Actuation in a Virtual Environment Using Unity ML-Agents. Virtual Worlds, 4(4), 53. https://doi.org/10.3390/virtualworlds4040053

Article Menu

Reinforcement Learning-Driven Prosthetic Hand Actuation in a Virtual Environment Using Unity ML-Agents

Abstract

1. Introduction

2. Background

2.1. Augmented Reality Scene Creation Using Unity

2.2. Reinforcement Learning Through Unity ML Agents

2.3. Unity Machine Learning Agents

2.4. EMG Armband for Prosthesis Control

3. Methodology

3.1. Prosthesis Modeling Process

3.2. Reinforcement Learning Framework

3.3. Reinforcement Learning Evaluation Methods

3.4. Dynamic Virtual Reality Scene

4. Results

4.1. ML-Agent Training

4.2. ML-Agent Performance in Dynamic VR Scene

5. Discussion

6. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI