Deep Reinforcement Learning-Based Enhancement of Robotic Arm Target-Reaching Performance

Honelign, Ldet; Abebe, Yoseph; Tullu, Abera; Jung, Sunghun

doi:10.3390/act14040165

Open AccessArticle

Deep Reinforcement Learning-Based Enhancement of Robotic Arm Target-Reaching Performance

by

Ldet Honelign

^1,†,‡

,

Yoseph Abebe

^1,‡,

Abera Tullu

^2,‡ and

Sunghun Jung

^2,*

¹

Department of Electromechanical Engineering, Addis Ababa Science and Technology University, Kilnto 16417, Ethiopia

²

Faculty of Smart Vehicle System Engineering, Chosun University, Dong-gu, Gwangju 61452, Republic of Korea

^*

Author to whom correspondence should be addressed.

^†

Current address: Department of Aerospace Engineering, Chosun University, Seoseok-dong, Gwangju 61452, Republic of Korea.

^‡

These authors contributed equally to this work.

Actuators 2025, 14(4), 165; https://doi.org/10.3390/act14040165

Submission received: 25 January 2025 / Revised: 12 March 2025 / Accepted: 21 March 2025 / Published: 26 March 2025

(This article belongs to the Special Issue From Theory to Practice: Incremental Nonlinear Control)

Download

Browse Figures

Versions Notes

Abstract

This work investigates the implementation of the Deep Deterministic Policy Gradient (DDPG) algorithm to enhance the target-reaching capability of the seven degree-of-freedom (7-DoF) Franka Pandarobotic arm. A simulated environment is established by employing OpenAI Gym, PyBullet, and Panda Gym. After 100,000 training time steps, the DDPG algorithm attains a success rate of 100% and an average reward of −1.8. The actor loss and critic loss values are 0.0846 and 0.00486, respectively, indicating improved decision-making and accurate value function estimations. The simulation results demonstrate the efficiency of DDPG in improving robotic arm performance, highlighting its potential for application to improve robotic arm manipulation.

Keywords:

deep reinforcement learning; robotic arm manipulator; OpenAI Gym; PyBullet; Panda Gym

1. Introduction

Enhancing the precision of robotic arm manipulation remains a core research area towards the aim of achieving full autonomy of robots in various sectors such as industrial manufacturing and assembly processes. given the maturity of image recognition and vision systems, the next logical progression in this field is to achieve complete autonomy of robotic manipulators through the use of machine learning (ML), artificial neural networks (ANNs), and artificial intelligence (AI) as a whole [1,2]. Numerous attempts have been made to create intelligent robots that can take tasks and execute them accordingly [3,4,5,6,7,8,9,10].

Although the development of a system with intelligence close to that of humans is still a long way off, robots that can perform specialized autonomous activities such as intelligent facial emotion recognition [11], flying in natural and man-made environments [12], driving a vehicle [13], swimming [14], carrying boxes and materials over different terrains [15], and picking up and placing objects [16,17] have already been realized.

However, some challenges must be overcome in order to achieve this goal. For instance, the mapping complexity from Cartesian space to the joint space of a robotic arm increases with the number of joints and linkages possessed by the manipulator. This is problematic, as the tasks assigned to a robotic arm are in Cartesian space, whereas the commands (velocity or torque) are in joint space [18,19]. Therefore, if full autonomy of robotic manipulators is the objective, then resolving the target-reaching problem represents one of the most crucial factors that must be addressed.

As described in [20,21], reinforcement learning is a type of machine learning that aims to maximize the outcome of a given system by using a dynamic and autonomous trial-and-error approach. It shares a similar objective with human intelligence, which is characterized by the ability to perceive and retain information as knowledge to be used for environment-adaptive behaviors. Trial-and-error search and delayed rewards are fundamental aspects of reinforcement leaning, which allows the learning strategy to interact with the environment by performing actions and discovering rewards [22]. Through this approach, software agents and machines can automatically select the most effective course of action to take in a given circumstance, resulting in improved performance. Reinforcement learning offers a framework and set of tools for designing sophisticated and complex robotic behaviors [23,24]. In contrast, the challenges presented by robotic issues serve as motivation, impact, and confirmation of advances in reinforcement learning. Multiple previous works on the implementation of reinforcement learning in the field of robotics depict this fact [25,26,27,28,29].

In addition to the above-mentioned studies, several different approaches to robotic manipulation have been investigated, including model-based reinforcement learning [30], supervised learning-based approaches [31], and classical control-based techniques. Although these techniques have proven effective in various settings, they often come with limitations in terms of flexibility, sample efficiency, or generalization to unseen tasks. In contrast, the proposed method makes use of reinforcement learning to improve robotic manipulators’ capacity for self-adaptation to changing surroundings. Instead of depending only on preset models or rules, reinforcement learning allows robots to learn from their experiences and improve their performance. This paper further examines the performance of our approach compared to existing reinforcement learning methods in terms of efficiency, stability, and real-world applicability.

Recent advances in RL-based robotic manipulation have focused on improving sampling efficiency, stability, and real-world generalization. The Soft Actor–Critic (SAC) algorithm is a model-free deep reinforcement learning algorithm capable of handling continuous action spaces, and has shown strong adaptability in domains such as energy-efficient building control, where it achieved a 9.7% reduction in energy costs while maintaining thermal comfort [32]. Similarly, Twin Delayed DDPG (TD3) has been improved through priority experience replay and transfer learning, significantly improving learning efficiency in mobile robot path planning, as demonstrated by increasing success rates by 16.6% and reducing training times by 23.5% [33].

Although model-based approaches such as PILCO [34] offer superior sample efficiency, they often struggle with high-dimensional control problems such as robotic manipulation. In contrast, policy-gradient methods such as PPO and DDPG remain widely used thanks to their direct interaction-based learning capabilities. However, challenges such as hyperparameter sensitivity and the exploration–exploitation tradeoff continue to persist, representing critical areas of research.

The key contributions of this work are as follows:

Implementation of the deep deterministic policy gradient algorithm in a simulated environment to improve the target-reaching performance of a robotic arm.
Integration of reinforcement learning with OpenAI Gym, PyBullet, and Panda Gym to create a flexible simulation framework.
Comparison of the DDPG (off-policy) and Proximal Policy Optimization (PPO) (on-policy) algorithms to assess their efficiency, stability, and performance in robotic manipulation tasks.
Evaluation of the performance and training behavior of the robotic agent using the DDPG and PPO algorithms in the context of a simulated target-reaching task.

This paper evaluates both algorithms based on their learning efficiency and applicability in the context of robotic manipulation simulations.

2. Modeling of Robotic Arm

Direct Kinematic Model of Robotic Arm

The rotation matrices in the Denavit–Hartenberg (DH) coordinate frame represent the rotations on the X and Z axes. The rotation matrices for these axes are respectively provided as follows:

R_{x} = [\begin{matrix} 1 & 0 & 0 \\ 0 & C δ_{i} & - S δ_{i} \\ 0 & S δ_{i} & C δ_{i} \end{matrix}]

(1)

and

R_{z} = [\begin{matrix} C ϕ_{i} & - S ϕ_{i} & 0 \\ S ϕ_{i} & C ϕ_{i} & 0 \\ 0 & 0 & 1 \end{matrix}] .

(2)

The homogeneous transformation matrix (

T_{i - 1}^{i}

) that accounts for rotation and translation is provided as follows:

\begin{array}{l} T_{i - 1}^{i} & = [\begin{matrix} R_{i - 1}^{i} & p_{i - 1}^{i} \\ 0 & 1 \end{matrix}] \\ = {Rot}_{x} (ϕ_{i}) \cdot {trans}_{x} (d_{i}) \cdot {trans}_{z} (Δ_{i}) \cdot {Rot}_{z} (a_{i}) \end{array}

\Rightarrow \begin{array}{l} T_{i - 1}^{i} = [\begin{matrix} C ϕ_{i} & - S ϕ_{i} C δ_{i} & S ϕ_{i} S δ_{i} & a_{i} C ϕ_{i} \\ S ϕ_{i} & C ϕ_{i} C δ_{i} & - C ϕ_{i} S δ_{i} & a_{i} S ϕ_{i} \\ 0 & S δ_{i} & C δ_{i} & d_{i} \\ 0 & 0 & 0 & 1 \end{matrix}] \end{array}

(3)

where:

The rotation matrix ( $R_{i - 1}^{i}$ ) represents the orientation of the i-th frame relative to the $(i - 1)$ -th frame.
$P_{i - 1}^{i}$ represents the center of the link frame with components ( $P_{x}$ , $P_{y}$ , and $P_{z}$ ).

Denavit–Hartenberg (DH) Axis Representation

The four DH parameters that describe the translation and rotation relationship between two consecutive coordinate frames are as follows:

d: Distance between the current frame and the previous frame along the Z-axis,
( $ϕ$ ): Angle between the X-axis of the previous frame and the X-axis of the current frame about the previous z-axis,
a: Distance between the Z-axes of the current and previous frames.
$δ$ : offset of the previous frame from the current frame along the Z-axis of the current frame.

The Denavit–Hartenberg (DH) parameters for the Franka Panda robot shown in Figure 1 are provided in Table 1. From the above DH parameters and based on the homogeneous transformation matrix (3), the transformation matrix for the Franka Panda robot is derived as shown below.

T_{i - 1}^{i} = [\begin{matrix} C ϕ_{i} & - S ϕ_{i} C δ_{i} & S ϕ_{i} S δ_{i} & a_{i} C ϕ_{i} \\ S ϕ_{i} & C ϕ_{i} C δ_{i} & - C ϕ_{i} S δ_{i} & a_{i} S ϕ_{i} \\ 0 & S δ_{i} & C δ_{i} & d_{i} \\ 0 & 0 & 0 & 1 \end{matrix}]

T_{0}^{1} = [\begin{matrix} C ϕ_{1} & - S ϕ_{1} & 0 & 0 \\ S ϕ_{1} & C ϕ_{1} & 0 & 0 \\ 0 & 0 & 1 & 0.33 \\ 0 & 0 & 0 & 1 \end{matrix}]

(4)

T_{1}^{2} = [\begin{matrix} C ϕ_{2} & 0 & - S ϕ_{2} & 0 \\ S ϕ_{2} & 0 & C ϕ_{2} & 0 \\ 0 & - 1 & 0 & 0 \\ 0 & 0 & 0 & 1 \end{matrix}]

(5)

T_{2}^{3} = [\begin{matrix} C ϕ_{3} & 0 & S ϕ_{3} & 0 \\ S ϕ_{3} & 0 & - C ϕ_{3} & 0 \\ 0 & 1 & 1 & 0.316 \\ 0 & 0 & 0 & 1 \end{matrix}]

(6)

T_{3}^{4} = [\begin{matrix} C ϕ_{4} & 0 & S ϕ_{4} & 0.0825 C ϕ_{4} \\ S ϕ_{4} & 0 & - C ϕ_{4} & 0.0825 S ϕ_{4} \\ 0 & 1 & 1 & 0 \\ 0 & 0 & 0 & 1 \end{matrix}]

(7)

T_{4}^{5} = [\begin{matrix} C ϕ_{5} & 0 & - S ϕ_{5} & 0.0825 C ϕ_{5} \\ S ϕ_{5} & 0 & C ϕ_{5} & 0.0825 S ϕ_{5} \\ 0 & - 1 & 0 & 0.384 \\ 0 & 0 & 0 & 1 \end{matrix}]

(8)

T_{5}^{6} = [\begin{matrix} C ϕ_{6} & 0 & S ϕ_{6} & 0 \\ S ϕ_{6} & 0 & - S ϕ_{6} & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 1 \end{matrix}]

(9)

T_{6}^{7} = [\begin{matrix} C ϕ_{7} & 0 & S ϕ_{7} & 0.088 C ϕ_{7} \\ S ϕ_{7} & 0 & - C ϕ_{7} & 0.088 S ϕ_{7} \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 1 \end{matrix}]

(10)

T_{7}^{0} = T_{0}^{1} \cdot T_{1}^{2} \cdot T_{2}^{3} \cdot T_{3}^{4} \cdot T_{4}^{5} \cdot T_{5}^{6} \cdot T_{6}^{7}

(11)

T_{7}^{0} = [\begin{matrix} r_{11} & r_{12} & r_{13} & p_{x} \\ r_{21} & r_{22} & r_{23} & p_{y} \\ r_{31} & r_{32} & r_{33} & p_{z} \\ 0 & 0 & 0 & 1 \end{matrix}]

(12)

The orientation and position of the end effector are respectively provided by

\begin{array}{l} R_{11} & = - S ϕ_{1} S ϕ_{3} C ϕ_{2} + C ϕ_{1} C ϕ_{3} C ϕ_{4} \end{array}

(13)

\begin{array}{l} R_{12} & = - S ϕ_{1} C ϕ_{2} C ϕ_{3} C ϕ_{4} - S ϕ_{3} C ϕ_{1} \end{array}

(14)

\begin{array}{l} R_{13} & = S ϕ_{2} C ϕ_{1} \end{array}

(15)

\begin{array}{l} R_{21} & = S ϕ_{1} C ϕ_{3} C ϕ_{4} + S ϕ_{3} C ϕ_{1} C ϕ_{2} \end{array}

(16)

\begin{array}{l} R_{22} & = - S ϕ_{1} S ϕ_{3} C ϕ_{2} + C ϕ_{1} C ϕ_{3} C ϕ_{4} \end{array}

(17)

\begin{array}{l} R_{23} & = - S ϕ_{2} S ϕ_{1} \end{array}

(18)

\begin{array}{l} R_{31} & = - S ϕ_{2} C ϕ_{3} C ϕ_{4} \end{array}

(19)

\begin{array}{l} R_{32} & = S ϕ_{2} S ϕ_{3} C ϕ_{4} \end{array}

(20)

\begin{array}{l} R_{33} & = C ϕ_{2} \end{array}

(21)

and

\begin{array}{l} p_{x} = & - 0.107 S ϕ_{2} C ϕ_{1} + 0.088 C ϕ_{1} \\ + 0.384 (- S ϕ_{1} S ϕ_{3} C ϕ_{2} + C ϕ_{1} C ϕ_{3} C ϕ_{4}) \\ + 0.316 C ϕ_{1} C ϕ_{2} + 0.333 C ϕ_{1} \end{array}

(22)

\begin{array}{l} p_{y} = & - 0.107 S ϕ_{2} S ϕ_{1} + 0.088 S ϕ_{1} \\ + 0.384 (S ϕ_{1} C ϕ_{3} C ϕ_{4} + S ϕ_{3} C ϕ_{1} C ϕ_{2}) \\ + 0.316 S ϕ_{1} C ϕ_{2} + 0.333 S ϕ_{1} \end{array}

(23)

\begin{array}{l} p_{z} = & 0.107 C ϕ_{2} \\ + 0.384 S ϕ_{2} C ϕ_{3} C ϕ_{4} \\ + 0.316 S ϕ_{2} + 0.333 S ϕ_{2} . \end{array}

(24)

3. Deep Reinforcement Learning Algorithm Design

3.1. Policy Gradient Algorithm

The policy gradient theorem states that the expected return to the policy parameters can be calculated as the sum of the action value function

q^{π} (s, a)

multiplied by the policy function

π_{ϱ} (s, a)

, summed over all states s and actions a and weighted by the stationary distribution of states

d^{π} (s)

:

\begin{array}{l} O (ϱ) = \sum_{s} d^{π} (s) \sum_{a} Q^{π} (s, a) π_{ϱ} (s, a) \end{array}

(25)

where:

Objective function ( $J_{ϱ}$ ): This represents the expected cumulative reward obtained by following the policy $π_{ϱ}$ in the given environment. The objective function is optimized by adjusting the parameter $ϱ$ to maximize the expected cumulative reward.
Discounted state distribution ( $d^{π} (s)$ ): This represents the probability of being in a particular state s under the policy $π$ ; mathematically,

$\begin{array}{l} d^{π} (s) = lim_{t \to \infty} P (s_{t} = s | s_{0}, π_{ϱ}), \end{array}$

(26)

where $s_{t} = s$ when starting from $s_{0}$ and following policy $π_{ϱ}$ for t time steps.
Action–value function ( $Q^{π} (s, a)$ ): This represents the expected cumulative reward obtained by taking action a in state s and following policy $π$ thereafter.
Policy function( $π_{ϱ} (s, a)$ ): This represents the probability of taking action a in state s under the parameterized policy $ϱ$ .

The policy gradient theorem helps to solve this problem by providing a formula for the gradient of the expected return to the policy parameters. This formula involves the stationary distribution of the Markov chain, and is provided by

\begin{array}{l} \nabla_{ϱ} J_{ϱ} = \sum_{s} d^{π} (s) \sum_{a} q^{π} (s, a) \nabla_{ϱ} π_{ϱ} (a | s), \end{array}

(27)

where

q^{π} (s, a)

is the state–action value function for policy

π_{ϱ}

:

\begin{array}{l} \propto \sum_{s \in S} d^{π} (s) \sum_{a \in A} Q^{π} (s, a) \nabla_{ϱ} π_{ϱ} (s, a) . \end{array}

(28)

3.1.1. Derivation of the Policy Gradient Theorem

The steps of the derivation using the derivative product rule of Equation (29) are as follows:

\begin{array}{l} \nabla_{ϱ} V^{π} (s) = \nabla_{ϱ} [\sum_{a \in A} π_{ϱ} (a | s) Q^{π} (s, a)], \end{array}

(29)

using the derivative product rule

\begin{array}{l} = \sum_{a \in A} (\nabla_{ϱ} π_{ϱ} (a | s) Q^{π} (s, a) + π_{ϱ} (a | s) \nabla_{ϱ} Q^{π} (s, a)) . \end{array}

Step 1: Apply the derivative product rule:

\begin{array}{l} \nabla_{ϱ} V^{π} (s) & = \nabla_{ϱ} \sum_{a \in A} (π_{ϱ} (a | s) Q^{π} (s, a)) . \end{array}

Step 2: Distribute the derivative operator inside the summation:

\begin{array}{l} \nabla_{ϱ} V^{π} (s) & = \sum_{a \in A} \nabla_{ϱ} (π_{ϱ} (a | s) Q^{π} (s, a)) . \end{array}

Step 3: Apply the chain rule to differentiate the product of

π_{ϱ} (a | s)

and

Q^{π} (s, a)

from

ϱ

:

\begin{array}{l} \nabla_{ϱ} V^{π} (s) & = \sum_{a \in A} (\nabla_{ϱ} π_{ϱ} (a | s) Q^{π} (s, a) + π_{ϱ} (a | s) \nabla_{ϱ} Q^{π} (s, a)) . \end{array}

Step 4: Simplify the expression by rearranging the terms.

After applying the derivative product rule, the derivative of the state–value function

V^{π} (s)

to the policy parameter

ϱ

is

\begin{array}{l} \nabla_{ϱ} V^{π} (s) = \sum_{a \in A} (\nabla_{ϱ} π_{ϱ} (a | s) Q^{π} (s, a) + π_{ϱ} (a | s) \nabla_{ϱ} Q^{π} (s, a)) . \end{array}

(30)

We can extend

Q^{π} (s, a)

by incorporating the value of the future state. This can be done by considering the state–action pair

(s, a)

and summing over all possible future states

s^{'}

and corresponding rewards r:

\begin{array}{l} = & \sum_{a \in A} [\nabla_{ϱ} π_{ϱ} (a | s) Q^{π} (s, a) \\ + π_{ϱ} (a | s) \nabla_{ϱ} \sum_{s^{'}, r} P (s^{'}, r | s, a) (r + V^{π} (s^{'}))] . \end{array}

(31)

Because

P (s^{'}, r | s, a)

and r are not functions of

ϱ

, the derivative operator

\nabla_{ϱ}

can be moved inside the summation over

s^{'}, r

without affecting these terms:

\begin{array}{l} \nabla_{ϱ} V^{π} (s) = & \sum_{a \in A} [\nabla_{ϱ} π_{ϱ} (a | s) Q^{π} (s, a) \\ + π_{ϱ} (a | s) \sum_{s^{'}, r} P (s^{'}, r | s, a) \nabla_{ϱ} V^{π} (s^{'})] . \end{array}

(32)

Next, we can observe that

\nabla_{ϱ} V^{π} (s^{'})

is the derivative of the state–value function from the policy parameter

ϱ

at state

s^{'}

. This can be rewritten as

\nabla_{ϱ} V^{π} (s^{'}) = \nabla_{ϱ} V^{π} (s^{'}) \sum_{s^{'}} P (s^{'} | s, a)

. Here,

P (s^{'} | s, a)

represents the probability of transitioning to state

s^{'}

given the current state–action pair

(s, a)

. We can now make the following substitution:

\begin{array}{l} = \sum_{a \in A} [\nabla_{ϱ} π_{ϱ} (a | s) Q^{π} (s, a) + π_{ϱ} (a | s) \sum_{s^{'}, r} P (s^{'}, r | s, a) \nabla_{ϱ} V^{π} (s^{'})] \end{array}

\begin{array}{l} = \sum_{a \in A} [\nabla_{ϱ} π_{ϱ} (a | s) Q^{π} (s, a) + π_{ϱ} (a | s) \sum_{s^{'}} P (s^{'} | s, a) \nabla_{ϱ} V^{π} (s^{'})] \end{array}

where

P (s^{'} | s, a) = \sum_{r} P (s^{'} | s, a)

.

Now, we have

\begin{array}{l} \nabla_{ϱ} V^{π} (s) = \sum_{a \in A} [\nabla_{ϱ} π_{ϱ} (a | s) Q^{π} (s, a) \\ + π_{ϱ} (a | s) \sum_{s^{'}} P (s^{'} | s, a) \nabla_{ϱ} V^{π} (s^{'})] . \end{array}

(33)

We can consider the following visit sequence and identify the probability of changing state s to state x using policy

π_{ϱ}

after k steps as

ρ (s \to x, k)

:

s \overset{a \sim π_{ϱ} (. | s)}{\to} s^{'} \overset{a \sim π_{ϱ} (. | s^{'})}{\to} s^{″} \overset{a \sim π_{ϱ} (. | s^{″})}{\to} \dots

For $k = 0 : ρ (s \to s, k = 0) = 1$ , and
For $k = 1$ , we can consider every action that might be taken and add up the probabilities of reaching the desired state:

$\begin{array}{l} ρ (s \to x, k = 1) = \sum_{a} π_{ϱ} (a | s) P (s^{'} | s, a) . \end{array}$

(34)
The goal is to move from s to x after $k + 1$ steps by following $π_{ϱ}$ . The agent can first move from s to intermediate state $s^{'} (s^{'} \in S)$ before proceeding to the final state x in the last steps after k stages. This allows us to recursively update the visitation probability as

$\begin{array}{l} ρ (s \to x, k + 1) = \sum_{s^{'}} ρ^{π} (s \to s^{'}, k) ρ (s^{'} \to x, 1) . \end{array}$

(35)

After dealing with the probability

ρ (s \to x, k)

for transitioning from state s to state x after a certain number of steps k, the next step is to drive a recursive formulation for

\nabla_{ϱ} V^{π} (s)

.

To accomplish this, a function

ϕ (s)

is introduced and defined as follows:

\begin{array}{l} ϕ (s) = \sum_{a \in A} \nabla_{ϱ} π_{ϱ} (a | s) Q^{π} (s, a) \end{array}

(36)

where

Φ (s)

represents the sum of the gradients of the policy

π_{ϱ}

to

ϱ

, weighted by the corresponding action–value function

Q^{π} (s, a)

.

To simplify,

\begin{array}{l} \nabla_{ϱ} V^{π} (s) = Φ (s) + [\sum_{a \in A} π_{ϱ} (a | s) \sum_{s^{'}} P (s^{'} | s, a) \nabla_{ϱ} V^{π} (s^{'})] \end{array}

(37)

\begin{array}{l} = ϕ (s) + \sum_{a} π_{ϱ} (a | s) \sum_{s^{'}} P (s^{'} | s, a) \nabla_{ϱ} V^{π} (s^{'}) \end{array}

(38)

\begin{array}{l} = Φ (s) + \sum_{s^{'}} \sum_{a \in A} π_{ϱ} (a | s) P (s^{'} | s, a) \nabla_{ϱ} V^{π} (s^{'}) \end{array}

(39)

\begin{array}{l} = Φ (s) + \sum_{s^{'}} \sum_{a \in A} π_{ϱ} (a | s) P (s^{'} | s, a) \nabla_{ϱ} V^{π} (s^{'}) \end{array}

(40)

\begin{array}{l} = Φ (s) + \sum_{s^{'}} ρ^{π} (s \to s^{'}, 1) \nabla_{ϱ} V^{π} (s^{'}) \end{array}

(41)

\begin{array}{l} = Φ (s) + \sum_{s^{'}} ρ^{π} (s \to s^{'}, 1) [ϕ (s^{'}) + \sum_{s^{″}} ρ^{π} (s^{'} \to s^{″}, 1) \nabla_{ϱ} V^{π} (s^{″})] . \end{array}

Considering

s^{'}

as the middle point for

s^{'} \to s^{″}

,

\begin{array}{l} = & Φ (s) + [\sum_{s^{'}} ρ^{π} (s \to s^{'}, 1) ϕ (s)] \\ + [\sum_{s^{'}} ρ^{π} (s^{'} \to s^{″}, 2) \nabla_{ϱ} V^{π} (s^{″})] . \end{array}

(42)

The expression for

\nabla_{ϱ} V^{π} (s)

can be unrolled as follows:

\begin{array}{l} = & Φ (s) + [\sum_{s^{'}} ρ^{π} (s \to s^{'}, 1) Φ (s^{'})] \\ + [\sum_{s^{″}} ρ^{π} (s \to s^{″}, 2) Φ (s^{″})] \\ + [\sum_{s^{‴}} ρ^{π} (s \to s^{‴}, 3) \nabla_{ϱ} V^{π} (s^{‴})] \end{array}

(43)

\begin{array}{l} \nabla_{ϱ} V^{π} (s) = \sum_{x \in S} \sum_{k = 0} ρ^{π} (s \to x, k) Φ (x) . \end{array}

(44)

Eliminating the derivatives of the Q-value function

\nabla_{ϱ} Q^{π} (s)

and inserting objective function

O (ϱ)

, starting from state

S_{0}

,

\begin{array}{l} \nabla_{ϱ} O (ϱ) = \nabla_{ϱ} V^{π} (S_{0}), \end{array}

(45)

\begin{array}{l} = \sum_{s} \sum_{k = 0} ρ^{π} (s \to x, k) ϕ (s) . \end{array}

(46)

Letting

η (s) = \sum_{s} ρ^{π} (s, \to x, k)

,

\begin{array}{l} \nabla_{ϱ} O (ϱ) = \sum_{s} η (s) ϕ (s) . \end{array}

(47)

Substituting

η (s) = \sum_{s} ρ^{π} (s, \to x, k)

into Equation (47),

\begin{array}{l} \nabla_{ϱ} O (ϱ) = \sum_{s} η (s) ϕ (s) . \end{array}

(48)

We can normalize

η (s)

,

s \in S

to be the probability distribution, as shown in

\begin{array}{l} \nabla_{ϱ} O (ϱ) = (\sum_{s} η (s)) \sum_{s} [\frac{η (s)}{\sum_{s} η (s)} ϕ (s)] . \end{array}

(49)

Because

\sum_{s} η (s)

is constant, the gradient of the objective function is proportional to the normalized

η (s)

and

ϕ (s)

:

\begin{array}{l} \nabla_{ϱ} O (ϱ) \propto \sum_{s} [\frac{η (s)}{\sum_{s} η (s)} ϕ (s)] \end{array}

(50)

where

d^{π} (s) = \frac{η (s)}{\sum_{s} η (s)}

is a stationary distribution.

In the episodic case, the constant proportionality

\sum_{s} η (s)

is the average length of an episode, while in the continuing case it is 1 [36]:

\begin{array}{l} = \sum_{s} d^{π} (s) \sum_{a \in A} \nabla_{ϱ} π_{ϱ} (a | s) Q^{π} (s, a) \end{array}

(51)

\begin{array}{l} \nabla_{ϱ} O (ϱ) \propto \sum_{s \in S} d^{π} (s) \sum_{a \in A} Q^{π} (s, a) \nabla_{ϱ} π (ϱ) (a | s) \end{array}

(52)

\begin{array}{l} \nabla_{ϱ} O (ϱ) = \sum_{s \in S} d^{π} (s) \frac{\nabla_{ϱ} π_{(} ϱ) (a | s)}{π_{(} ϱ) (a | s)} \end{array}

(53)

\begin{array}{l} = \sum_{s \in S} d^{π} (s) \frac{\nabla_{ϱ} π_{(} ϱ) (a | s)}{π_{(} ϱ) (a | s)} \end{array}

(54)

\begin{array}{l} \nabla_{ϱ} O (ϱ) = E_{π} {Q^{π} (s, a) \nabla_{ϱ} ln (π_{ϱ} (a | s))} \end{array}

(55)

where

E_{π} refers to E_{s \sim d_{π}, a \sim π_{ϱ}}

when the distribution of the states and actions follows policy

π_{ϱ}

(on-policy).

3.1.2. Off-Policy Gradient Algorithms

Because DDPG is an off-policy gradient algorithm, we first discuss this type of policy gradient algorithm in more detail.

The behavior policy for collecting samples is known and labeled as

α (a | s)

. The objective function sums up the reward over the state distribution defined by this behavior policy:

O (ϱ) = \sum_{s \in S} d^{α} (s) \sum_{a \in A} Q^{π} (s, a) π_{ϱ} (a | s)

= E_{s \sim d^{α}} [\sum_{a \in A} Q^{π} (s, a) π_{ϱ} (a | s)]

where

d^{α} (s)

is the stationary distribution of behavior policy

α

, as

d^{α} (s) = {lim}_{t \to \infty} P (S_{t} = s | S_{0}, α)

and

Q^{π}

is the action–value function estimated about the target policy

π

. Given that the training observations are sampled by

a \sim α (a | s)

, we can rewrite the gradient as

\begin{array}{l} \nabla_{ϱ} O (ϱ) & = \nabla_{ϱ} E_{s \sim d^{α}} [\sum_{a \in A} Q^{π} (s, a) π_{ϱ} (a | s)] \end{array}

(56)

by using the derivative product rule

\begin{array}{l} = E_{s \sim d^{α}} [\sum_{a \in A} (Q^{π} (s, a) \nabla_{ϱ} π_{ϱ} (a | s) + π_{ϱ} (a | s) \nabla_{ϱ} Q^{π} (s, a))] \end{array}

(57)

and ignoring

π_{ϱ} (a | s) \nabla_{ϱ} Q^{π} (s, a)

:

\begin{array}{l} \overset{(i)}{\approx} E_{s \sim d^{α}} [\sum_{a \in A} Q^{π} (s, a) \nabla_{ϱ} π_{ϱ} (a | s)] \end{array}

(58)

\begin{array}{l} = E_{s \sim d^{α}} [\sum_{a \in A} α (a | s) \frac{π_{ϱ} (a | s)}{α (a | s)} Q^{π} (s, a) \frac{\nabla_{ϱ} π_{ϱ} (a | s)}{π_{ϱ} (a | s)}] \end{array}

(59)

\begin{array}{l} = E_{α} [\frac{π_{ϱ} (a | s)}{α (a | s)} Q^{π} (s, a) \nabla_{ϱ} ln π_{ϱ} (a | s)] \end{array}

(60)

where

[\frac{π_{ϱ} (a ∣ s)}{α (a ∣ s)}]

is the importance weight. Because

Q^{π}

is a function of the target policy, and consequently a function of the policy parameter

ϱ

, the derivative

\nabla_{ϱ} Q^{π} (s, a)

must also be computed using the product rule. However, in practice it is challenging to compute

\nabla_{ϱ} Q^{π} (s, a)

directly. Fortunately, policy improvement can still be guaranteed by approximating the gradient and ignoring the gradient of

Q^{π}

, which achieves eventual convergence to the true local minimum.

In summary, when applying the policy gradient in the off-policy setting, it can be adjusted using a weighted sum, where the weight is the ratio of the target policy to the behavior policy,

[\frac{π_{ϱ} (a | s)}{α (a | s)}]

[36].

3.2. Deterministic Policy Gradient (DPG)

The policy function

π (\cdot | s)

is typically represented as a probability distribution over actions (A) based on the current state, making it inherently stochastic. However, in the case of the DPG, the policy is modeled as a deterministic decision, denoted as

a = μ (s)

. Instead of selecting actions probabilistically, the DPG directly maps states to specific actions without uncertainty. Let

$ρ_{0} (s)$ be the initial distribution over states,
$ρ^{μ} (s \to s^{'}, k)$ be the density of visitation probability in state $s^{'}$ when starting from state s after moving k steps by policy $μ$ , and
$ρ^{μ} (s^{'})$ be the discounted state distribution, defined as
$ρ^{μ} (s^{'}) = \int_{S} \sum_{k = 1}^{\infty} γ^{k - 1} ρ_{0} (s) ρ^{μ} (s \to s^{'}, k) d_{s}$ .

Then, the objective function to be optimized is

\begin{array}{l} O (ϱ) = \int_{S} ρ^{μ} (s) Q (s, μ_{ϱ} (s)) d s . \end{array}

(61)

According to the chain rule, we first take the gradient of Q with respect to action a, then take the gradient of the deterministic policy function

μ

with respect to

ϱ

:

\begin{array}{l} \nabla_{ϱ} O (ϱ) & = \int_{S} ρ^{μ} (s) \nabla_{a} Q^{μ} (s, a) \nabla_{ϱ} μ_{ϱ} {(s) |}_{a = μ_{ϱ} (s)} d s, \end{array}

(62)

\begin{array}{l} = E_{s \sim ρ^{μ}} {[\nabla_{a} Q^{μ} (s, a) \nabla_{ϱ} μ_{ϱ} (s) |}_{a = μ_{ϱ} (s)}] . \end{array}

(63)

3.3. Deep Deterministic Policy Gradient (DDPG)

By combining DQN and DPG, DDPG leverages the power of deep neural networks to handle high-dimensional state spaces and complex action spaces, making it suitable for a wide range of reinforcement learning tasks. The original DQN works in discrete space; DDPG extends it to continuous space with the actor–critic framework while learning a deterministic policy. To improve exploration, an exploration policy

μ^{'}

is constructed by adding noise

N

:

\begin{array}{l} μ^{'} (s) = μ_{ϱ} (s) + N . \end{array}

(64)

Moreover, the DDPG algorithm integrates a sophisticated technique known as soft updates, also called conservative policy iteration, to update the parameters of both the actor and critic networks. This revised methodology utilizes a small parameter, denoted

τ

, which is much smaller than 1 (

τ ≪ 1

).

The soft update equation is formulated as follows:

\begin{array}{l} ϱ^{'} \leftarrow τ ϱ + (1 - τ) ϱ^{'} . \end{array}

(65)

Unlike the approach employed in the DQN, this guarantees that the target network values are gradually altered over time when the target network remains static for a fixed period.

As shown in Algorithm 1, the DDPG algorithm follows an actor-critic framework where the critic network estimates the action-value function, while the actor network learns a deterministic policy. The training process begins with initializing the networks and reply buffer. During each episode, actions are selected using the current policy with added exploration noise, and the resulting transition is stored in the reply buffer. The critic is updated by minimizing the loss function, and the actor is optimized using the policy gradient. Furthermore, the target networks for both the actor and the critic are updated using soft updates to ensure stability in learning [37].

Algorithm 1: Deep Deterministic Policy Gradient [37]

4. Results and Discussion

4.1. Simulation Environment and Training Setup

The deep reinforcement learning agent was trained using Python on a Jupyter Notebook using a Linux Ubuntu 20.04 operating system. The training process spanned approximately 5.415 h on a system with an Intel graphics card, 8 GB of RAM, and a 1.9 GHz processor.

To simulate the target reaching task of the Franka Panda robotic arm, we utilized the PandaReach-v2 environment from the panda_gym library, which is built on PyBullet. This physics-based environment enables realistic interactions for reinforcement learning.

To train the Deep Deterministic Policy Gradient (DDPG) algorithm, Stable Baselines3, an optimized deep-reinforcement learning library was used. The training process spanned 100,000 time steps, allowing the robotic agent to learn an optimal policy for precise target reaching. The integration of OpenAI Gym, PyBullet, and Panda Gym provided an efficient and flexible simulation framework to evaluate reinforcement learning techniques in robotic manipulation.

4.2. Hyperparameter Selection and Initial Search

The parameters listed in Table 2 were selected for this work.

4.2.1. Batch Size Comparison

Upon completion of the training process, it was observed that there was a minimal disparity in the success rate (Figure 2) and the cumulative reward (Figure 3) between the different batch sizes. Furthermore, the decrease in critic loss values (Figure 4) and actor loss values (Figure 5) indicated an improvement in the ability of both the actor and critic networks to approximate the optimal policy and value functions. Although the success rate and cumulative reward were similar, the enhanced convergence demonstrated by the lower losses with a batch size of 2048 suggests a more efficient learning process and a potentially higher quality of the policies learned within this setting. However, as Figure 6 shows, the training speed varied widely with batch sizes, with 512 being the fastest, 1024 second, and 2048 last. Despite its slower training speed, a batch size of 2048 was used because it achieved better stability and convergence. Larger batch sizes generally reduce gradient variance, leading to a smoother optimization process and improved overall model quality.

4.2.2. Learning Rate Comparison

After the training process, the observations shown in Figure 7 and Figure 8 revealed a minimal disparity in the cumulative reward and the success rate achieved between the two learning rates. However, the learning rate of 2e-4 shows a slightly better success rate compared to the learning rate of 1e-3. In contrast, when using the learning rate of 1e-3, a notable decrease in both actor loss (Figure 9) and critic loss (Figure 10) was observed, indicating improved policy and value estimation. Despite comparable cumulative reward and success rates, the reduced losses with the learning rate of 1e-3 signified enhanced convergence and a potentially more efficient learning process, suggesting that the agent may have acquired higher-quality learned policies.

4.3. Selection of Optimal Hyperparameters and Extended Training of DDPG Agent

After comparing the hyperparameters and performing a detailed analysis of the associated results, we selected the hyperparameters with the most promising performance as shown in Table 3. Following this, an extended training phase was initiated that included 100,000 time steps as shown in Table 4. This extended training phase served as the fundamental training stage, which is elaborated in the following section.

4.3.1. Improvement in Cumulative Reward and Success Rate

The mean episode reward (Figure 11) improved from −49.2 in the first loop to −1.8 in the last loop, while the success rate (Figure 12) increased from 0 in the first loop to 1 in the last loop.

4.3.2. Frames per Second (FPS)

The training speed (Figure 13) decreased from 18 frames per second (FPS) in the first loop to 5 FPS in the last loop.

4.3.3. Improvement in Actor and Critic Losses

The actor loss (Figure 14) decreased from 0.625 in the first loop to 0.0846 in the last loop, while the critic loss (Figure 15) decreased from 0.401 in the first loop to 0.00486 in the last loop.

4.4. Comparing DDPG and PPO: Off-Policy vs. On-Policy Reinforcement Learning Algorithms

We trained the Proximal Policy Optimization (PPO) on-policy reinforcement learning algorithm and compared its performance with the Deep Deterministic Policy Gradient (DDPG) off-policy algorithm. The results are shown in Figure 16 and Figure 17. As shown in Figure 16, the cumulative reward achieved by DDPG was

- 1.8

, while the cumulative reward obtained by PPO was

- 50

. The results of this comparison indicate that DDPG shows better performance in this particular scenario than PPO in terms of cumulative reward.

5. Conclusions

In this study, we applied the Deep Deterministic Policy Gradient (DDPG) algorithm to train a robotic arm manipulator, specifically the Franka Panda robotic arm, for a target-reaching task. The objective of this task was to enable the robotic arm to accurately reach a designated target position. The DDPG algorithm was chosen because of its effectiveness in continuous control tasks and its ability to learn policies with high-dimensional action spaces. Using a combination of deep neural networks and an actor–critic architecture, the DDPG algorithm was able to approximate the optimal policy for the robotic arm. When comparing the performance of PPO and DDPG algorithms after training for 100,000 time steps, PPO achieved a mean episode reward of

- 50

indicating that the agent struggled to achieve positive rewards on average. Despite training at a relatively fast speed of

561 F P S

, these results suggest that the PPO algorithm faced challenges in finding successful strategies for the given task.

However, the DDPG algorithm demonstrated superior performance, with a mean episode reward of

- 1.8

. It achieved a success rate of 1, indicating consistent success in achieving the desired outcomes. Despite a slower training speed of

5 F P S

, the DDPG algorithm showcased the ability to learn and improve its policy effectively over time. Based on these results, DDPG outperformed PPO in terms of cumulative reward and success rate in the given scenario.

From the above results, it can be concluded that PPO performs worse than DDPG, mainly because of the way each algorithm learns. PPO is a policy-based method, which means that it learns only from its most recent interactions and ignores past experiences. Although this promotes stability, it makes learning less effective because the agent constantly needs new data to improve. However, DDPG is an off-policy algorithm, meaning that it stores past experiences in a replay buffer and continues to learn them. This makes training more sample-efficient, which is especially useful for complex tasks such as robotic arm control.

Another key difference is how the two algorithms update their policies. PPO uses a clipping mechanism to prevent large policy changes, which helps with stability but can slow down learning, especially in high-dimensional action spaces. Meanwhile, DDPG uses deterministic updates, allowing it to fine-tune actions more precisely.

Although DDPG had a much slower training speed, it ultimately learned better strategies, leading to a higher success rate and better rewards compared to PPO.

Author Contributions

The authors’ contributions in this manuscript are stated as follows: conceptualization, L.H. and Y.A.; methodology, L.H.; software, L.H.; validation, A.T. and S.J.; formal analysis, L.H.; investigation, A.T., S.J.; resources, L.H.; data curation, L.H.; writing—original draft preparation, L.H.; writing—review and editing, Y.A., A.T., and S.J.; visualization, A.T. and L.H.; supervision, Y.A. and S.J.; project administration, S.J.; funding acquisition, S.J. All authors have read and agreed to the published version of the manuscript.

Funding

This study was supported by a research fund from Chosun University, 2024 under Grant K208419006.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

All of the required data in this work are available from the authors and can be provided upon request.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study, in the collection, analysis, or interpretation of data, in the writing of the manuscript, or in the decision to publish the results.

References

Mohsen, S.; Behrooz, A.; Roza, D. Artificial intelligence, machine learning and deep learning in advanced robotics, a review. Cogn. Robot. 2023, 3, 54–70. [Google Scholar]
Santos, A.A.; Schreurs, C.; da Silva, A.F.; Pereira, F.; Felgueiras, C.; Lopes, A.M.; Machado, J. Integration of Artificial Vision and Image Processing into a Pick and Place Collaborative Robotic System. J. Intell. Robot. Syst. 2024, 110, 159. [Google Scholar] [CrossRef]
Xinle, Y.; Minghe, S.; Lingling, S. Adaptive and intelligent control of a dual-arm space robot for target manipulation during the post-capture phase. Aerosp. Sci. Technol. 2023, 142, 108688. [Google Scholar]
Abayasiri, R.A.M.; Jayasekara, A.G.B.P.; Gopura, R.A.R.C.; Kazuo, K. Intelligent Object Manipulation for a Wheelchair-Mounted Robotic Arm. J. Robot. 2024, 2024, 1516737. [Google Scholar] [CrossRef]
Yi, J.; Zhang, H.; Mao, J.; Chen, Y.; Zhong, H.; Wang, Y. Review on the COVID-19 pandemic prevention and control system based on AI. Eng. Appl. Artif. Intell. 2022, 114, 105184. [Google Scholar] [CrossRef] [PubMed]
Tang, Q.; Liang, J.; Zhu, F. A comparative review on multi-modal sensors fusion based on deep learning. Signal Process. 2023, 213, 109165. [Google Scholar] [CrossRef]
Doewes, R.I.; Purnama, S.K.; Nuryadin, I.; Kurdhi, N.A. Human AI: Social robot decision-making using emotional AI and neuroscience. In Emotional AI and Human-AI Interactions in Social Networking; Garg, M., Koundal, D., Eds.; Academic Press: Cambridge, MA, USA, 2024; pp. 255–286. [Google Scholar] [CrossRef]
Martin, J.G.; Muros, F.J.; Maestre, J.M.; Camacho, E.F. Multi-robot task allocation clustering based on game theory. Robot. Auton. Syst. 2023, 161, 104314. [Google Scholar]
Nguyen, M.N.T.; Ba, D.X. A neural flexible PID controller for task-space control of robotic manipulators. Front. Robot. AI 2023, 9, 975850. [Google Scholar]
Laurenzi, A.; Antonucci, D.; Tsagarakis, N.G.; Muratore, L. The XBot2 real-time middleware for robotics. Robot. Auton. Syst. 2023, 163, 104379. [Google Scholar] [CrossRef]
Zhang, L.; Jiang, M.; Farid, D.; Hossain, M.A. Intelligent Facial Emotion Recognition and Semantic-Based Topic Detection for a Humanoid Robot. Expert Syst. Appl. 2013, 40, 5160–5168. [Google Scholar]
Floreano, D.; Wood, R.J. Science, Technology, and the Future of Small Autonomous Drones. Nature 2015, 521, 460–466. [Google Scholar] [CrossRef] [PubMed]
Chen, T.D.; Kockelman, K.M.; Hanna, J.P. Operations of a Shared, Autonomous, Electric Vehicle Fleet: Implications of Vehicle & Charging Infrastructure Decisions. Transp. Res. Part Policy Pract. 2016, 94, 243–254. [Google Scholar]
Chen, Z.; Jia, X.; Riedel, A.; Zhang, M. A Bio-Inspired Swimming Robot. In Proceedings of the 2014 IEEE International Conference on Robotics and Automation (ICRA), Hong Kong, China, 31 May–7 June 2014; p. 2564. [Google Scholar]
Ohmura, Y.; Kuniyoshi, Y. Humanoid Robot Which Can Lift a 30kg Box by Whole Body Contact and Tactile Feedback. In Proceedings of the 2007 IEEE/RSJ International Conference on Intelligent Robots and Systems, San Diego, CA, USA, 29 October–2 November 2007; pp. 1136–1141. [Google Scholar]
Kappassov, Z.; Corrales, J.-A.; Perdereau, V. Tactile Sensing in Dexterous Robot Hands. Robot. Auton. Syst. 2015, 74, 195–220. [Google Scholar] [CrossRef]
Gomes, N.; Martins, F.; Lima, J.; Wörtche, H. Deep Reinforcement Learning Applied to a Robotic Pick-and-Place Application; Springer: Berlin/Heidelberg, Germany, 2021; pp. 251–265. [Google Scholar] [CrossRef]
Aryslan, M.; Yevgeniy, L.; Troy, H.; Richard, P. A deep reinforcement-learning approach for inverse kinematics solution of a high degree of freedom robotic manipulator. Robotics 2022, 11, 44. [Google Scholar]
Serhat, O.; Enver, T.; Erkan, Z. Adaptive Cartesian space control of robotic manipulators: A concurrent learning based approach. J. Frankl. Inst. 2024, 361, 106701. [Google Scholar]
Ladosz, P.; Weng, L.; Kim, M.; Oh, H. Exploration in deep reinforcement learning: A survey. Inf. Fusion 2022, 85, 1–22. [Google Scholar] [CrossRef]
AlMahamid, F.; Grolinger, K. Reinforcement learning algorithms: An overview and classification. In Proceedings of the 2021 IEEE Canadian Conference on Electrical and Computer Engineering (CCECE), Virtual, 12–17 September 2021; pp. 1–7. [Google Scholar]
Thrun, S.; Littman, M.L. Reinforcement Learning: An Introduction. AI Mag. 2000, 21, 103. [Google Scholar]
Smruti, A. Deep reinforcement learning for robotic manipulation-the state of the art. arXiv 2017, arXiv:1701.08878. Available online: https://api.semanticscholar.org/CorpusID:9103068 (accessed on 21 February 2024).
Jens, K.; Bagnell, J.A.; Jan, P. Reinforcement learning in robotics: A survey. Int. J. Robot. Res. 2013, 32, 1238–1274. [Google Scholar]
Tianci, G. Optimizing robotic arm control using deep Q-learning and artificial neural networks through demonstration-based methodologies: A case study of dynamic and static conditions. Robot. Auton. Syst. 2024, 181, 104771. [Google Scholar]
Andrea, F.; Elisa, T.; Nicola, C.; Stefano, G. Robotic Arm Control and Task Training through Deep Reinforcement Learning. arXiv 2020, arXiv:2005.02632v1. [Google Scholar]
Shianifar, J.; Schukat, M.; Mason, K. Optimizing Deep Reinforcement Learning for Adaptive Robotic Arm Control. arXiv 2024, arXiv:2407.02503v1. [Google Scholar]
Roman, P.; Jakub, K.; Radomil, M.; Martin, J. Deep-Reinforcement-Learning-Based Motion Planning for a Wide Range of Robotic Structures. Computation 2024, 12, 116. [Google Scholar] [CrossRef]
Wanqing, X.; Yuqian, L.; Weiliang, X.; Xun, X. Deep reinforcement learning based proactive dynamic obstacle avoidance for safe human-robot collaboration. Manuf. Lett. 2024, 41, 1246–1256. [Google Scholar]
Chua, K.; Calandra, R.; McAllister, R.; Levine, S. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. Adv. Neural Inf. Process. Syst. 2018, 31. [Google Scholar]
Levine, S.; Pastor, P.; Krizhevsky, A.; Quillen, D. Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection. Int. J. Robot. Res. 2018, 37, 421–436. [Google Scholar]
Kathirgamanathan, A.; Mangina, E.; Finn, D.P. Development of a Soft Actor Critic deep reinforcement learning approach for harnessing energy flexibility in a Large Office building. Energy AI 2021, 5, 100101. [Google Scholar] [CrossRef]
Li, P.; Chen, D.; Wang, Y.; Zhang, L.; Zhao, S. Path planning of mobile robot based on improved TD3 algorithm in dynamic environment. Heliyon 2024, 10, e32167. [Google Scholar] [CrossRef]
Salas-Pilco, S.; Xiao, K.; Hu, X. Correction: Salas-Pilco et al. Artificial Intelligence and Learning Analytics in Teacher Education: A Systematic Review. Educ. Sci. 2023, 13, 897. [Google Scholar] [CrossRef]
Franka Emika Documentation. Control Parameters Documentation. 2024. Available online: https://frankaemika.github.io/docs/control_parameters.html (accessed on 21 February 2024).
Weng, L. Policy Gradient Algorithms. Lil’Log. 2018. Available online: https://lilianweng.github.io/posts/2018-04-08-policy-gradient (accessed on 21 February 2024).
Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Tassa, Y.; Silver, D. Continuous control with deep reinforcement learning. arXiv 2015, arXiv:1509.02971. [Google Scholar] [CrossRef]

Figure 1. Denavit –Hartenberg axis representation of the Franka Panda robot [35].

Figure 2. Success rate for different batch sizes (512, 1024, and 2048).

Figure 3. Cumulative mean reward for different batch sizes (512, 1024, and 2048).

Figure 4. Critic loss for different batch sizes (512, 1024, and 2048).

Figure 5. Actor loss for different batch sizes (512, 1024, and 2048).

Figure 6. Frames per second for different batch sizes (512, 1024, and 2048).

Figure 7. Success rate for different learning rates (2 ×

10^{- 4}

and 1 ×

10^{- 3}

).

Figure 7. Success rate for different learning rates (2 ×

10^{- 4}

and 1 ×

10^{- 3}

).

Figure 8. Cumulative mean reward for different learning rates (2 ×

10^{- 4}

and 1 ×

10^{- 3}

).

Figure 8. Cumulative mean reward for different learning rates (2 ×

10^{- 4}

and 1 ×

10^{- 3}

).

Figure 9. Actor loss for different learning rates (2 ×

10^{- 4}

and 1 ×

10^{- 3}

).

Figure 9. Actor loss for different learning rates (2 ×

10^{- 4}

and 1 ×

10^{- 3}

).

Figure 10. Critic loss for different learning rates (2 ×

10^{- 4}

and 1 ×

10^{- 3}

).

Figure 10. Critic loss for different learning rates (2 ×

10^{- 4}

and 1 ×

10^{- 3}

).

Figure 11. Improved cumulative mean reward.

Figure 12. Improved success rate.

Figure 13. Training speed in frames per second (FPS).

Figure 14. Improved actor loss.

Figure 15. Improved critic loss.

Figure 16. Cumulative mean reward.

Figure 17. Training speed.

Table 1. Denavit–Hartenberg axis representation.

Joint	a (m)	d (m)	$δ$ (Rad)	$ϕ$ (Rad)
1	0	0.333	0	$ϕ_{1}$
2	0	0	−90	$ϕ_{2}$
3	0	0.316	90	$ϕ_{3}$
4	0.0825	0	90	$ϕ_{4}$
5	−0.0825	0.384	−90	$ϕ_{5}$
6	0	0	90	$ϕ_{6}$
7	0.088	0	90	$ϕ_{7}$
Flange	0	0.107	0	0

Table 2. Parameters and Hyperparameters.

Parameter	Value
Policy	MultiInputPolicy
Replay buffer class	HerReplayBuffer
Verbose	1
Gamma	0.95
Tau ( $τ$ )	0.005
Batch size	512, 1024, 2048
Buffer size	100,000
Replay buffer kwargs	rb kwargs
Learning rate	1e-3, 2e-4
Action noise	Normal action noise
Policy kwargs	Policy kwargs
Tensorboard log	Log path

Table 3. Selected parameters and hyperparameters.

Parameter	Value
Policy	MultiInputPolicy
Replay buffer class	HerReplayBuffer
Verbose	1
Gamma	0.95
Tau ( $τ$ )	0.005
Batch size	2048
Buffer size	100,000
Replay buffer kwargs	rb kwargs
Learning rate	1e-3
Action noise	Normal action noise
Policy kwargs	Policy kwargs

Table 4. Training metrics at time steps of 200 and at

100,000

.

Table 4. Training metrics at time steps of 200 and at

100,000

.

Category	Value	Category	Value
rollout/		rollout/
Episode length	50
Episode mean reward	−49.2	Episode mean reward	−1.8
Success rate	0	Success rate	1
time/		time/
Episodes	4	Episodes	2000
FPS	18	FPS	5
Time elapsed	10	Time elapsed	19,505
Total timesteps	200	Total time steps	100,000
train/		train/
Actor loss	0.625	Actor loss	0.0846
Critic loss	0.401	Critic loss	0.00486
Learning rate	0.001	Learning rate	0.001
Number of updates	50	Number of updates	99,850

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Honelign, L.; Abebe, Y.; Tullu, A.; Jung, S. Deep Reinforcement Learning-Based Enhancement of Robotic Arm Target-Reaching Performance. Actuators 2025, 14, 165. https://doi.org/10.3390/act14040165

AMA Style

Honelign L, Abebe Y, Tullu A, Jung S. Deep Reinforcement Learning-Based Enhancement of Robotic Arm Target-Reaching Performance. Actuators. 2025; 14(4):165. https://doi.org/10.3390/act14040165

Chicago/Turabian Style

Honelign, Ldet, Yoseph Abebe, Abera Tullu, and Sunghun Jung. 2025. "Deep Reinforcement Learning-Based Enhancement of Robotic Arm Target-Reaching Performance" Actuators 14, no. 4: 165. https://doi.org/10.3390/act14040165

APA Style

Honelign, L., Abebe, Y., Tullu, A., & Jung, S. (2025). Deep Reinforcement Learning-Based Enhancement of Robotic Arm Target-Reaching Performance. Actuators, 14(4), 165. https://doi.org/10.3390/act14040165

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Deep Reinforcement Learning-Based Enhancement of Robotic Arm Target-Reaching Performance

Abstract

1. Introduction

2. Modeling of Robotic Arm

Direct Kinematic Model of Robotic Arm

Denavit–Hartenberg (DH) Axis Representation

3. Deep Reinforcement Learning Algorithm Design

3.1. Policy Gradient Algorithm

3.1.1. Derivation of the Policy Gradient Theorem

3.1.2. Off-Policy Gradient Algorithms

3.2. Deterministic Policy Gradient (DPG)

3.3. Deep Deterministic Policy Gradient (DDPG)

4. Results and Discussion

4.1. Simulation Environment and Training Setup

4.2. Hyperparameter Selection and Initial Search

4.2.1. Batch Size Comparison

4.2.2. Learning Rate Comparison

4.3. Selection of Optimal Hyperparameters and Extended Training of DDPG Agent

4.3.1. Improvement in Cumulative Reward and Success Rate

4.3.2. Frames per Second (FPS)

4.3.3. Improvement in Actor and Critic Losses

4.4. Comparing DDPG and PPO: Off-Policy vs. On-Policy Reinforcement Learning Algorithms

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI