Deep Reinforcement Learning Control of a Hexapod Robot

Kim, Taesoo; Choi, Minjun; Choi, Seunguk; Yoon, Taeuan; Choi, Dongil

doi:10.3390/act15010033

Open AccessArticle

Deep Reinforcement Learning Control of a Hexapod Robot

by

Taesoo Kim

¹,

Minjun Choi

²,

Seunguk Choi

³,

Taeuan Yoon

¹ and

Dongil Choi

^1,*

¹

Department of Mechanical System Engineering, Myongji University, Yongin 17058, Republic of Korea

²

Econet Center, Neuromeka, Seoul 04782, Republic of Korea

³

CLAVIL, H Businesspark, Seoul 05836, Republic of Korea

^*

Author to whom correspondence should be addressed.

Actuators 2026, 15(1), 33; https://doi.org/10.3390/act15010033

Submission received: 22 October 2025 / Revised: 23 December 2025 / Accepted: 24 December 2025 / Published: 5 January 2026

(This article belongs to the Section Actuators for Robotics)

Download

Browse Figures

Versions Notes

Abstract

Recent advances in legged robotics have highlighted deep reinforcement learning (DRL)-based controllers for their robust adaptability to diverse, unstructured environments. While position-based DRL controllers achieve high tracking accuracy, they offer limited disturbance rejection, which degrades walking stability; torque-based DRL controllers can mitigate this issue but typically require extensive time and trial-and-error to converge. To address these challenges, we propose a Real-Time Motion Generator (RTMG). At each time step, RTMG kinematically synthesizes end-effector trajectories from target translational and angular velocities (yaw rate) and step length, then maps them to joint angles via inverse kinematics to produce imitation data. The RL agent uses this imitation data as a torque bias, which is gradually annealed during training to enable fully autonomous behavior. We further combine the RTMG-generated imitation data with a decaying action priors scheme to ensure both initial stability and motion diversity. The proposed training pipeline, implemented in NVIDIA Isaac Gym with Proximal Policy Optimization (PPO), reliably converges to the target gait pattern. The trained controller is Tensor RT-optimized and runs at 50 Hz on a Jetson Nano; relative to a position-based baseline, torso oscillation is reduced by 24.88% in simulation and 21.24% on hardware, demonstrating the effectiveness of the approach.

Keywords:

deep reinforcement learning; motion imitation learning; artificial neural network; legged robot

1. Introduction

In deep reinforcement learning (DRL) for legged locomotion, the action space has traditionally relied on joint-position commands [1,2,3]. Position-based policies are simple to implement and exhibit strong trajectory-tracking performance; however, because they operate through high-level kinematic commands, the underlying actuator dynamics—such as torque limits, friction, drive delays, and compliance—are not explicitly modeled. As a result, position-controlled policies often generalize poorly when exposed to terrain irregularities, unmodeled contacts, or external disturbances [4,5,6]. These limitations are especially evident in legged systems walking over irregular footholds or transitioning between different ground-contact conditions, where robust stabilization is required at each step. This highlights the need for more expressive low-level control strategies that can directly exploit actuator dynamics.

To address these issues, torque-commanded DRL policies have been proposed [1], providing more robust interaction with the environment by leveraging low-level dynamics. However, training torque-based agents remains challenging: the absence of structured prior knowledge frequently leads to unstable early-stage exploration and extremely low sample efficiency, as described in [7]. In many cases, the robot collapses immediately due to unbounded torque outputs, preventing the collection of meaningful training data. This challenge is particularly relevant for hexapod robots, whose three-point support patterns provide reliable stability during normal walking but do not prevent unstable torques from hindering early learning. In this work, we mitigate these issues through imitation learning, allowing torque-based policies to gain initial stability and improved sample efficiency.

Imitation-learning approaches such as DeepMimic [8] stabilize training by providing reference trajectories that the agent is encouraged to imitate. A prominent extension of this concept to torque-based DRL is Decaying Action Priors (DecAP) [9], which supplies a precomputed set of imitation motions that act as a gradually diminishing torque bias. Although DecAP effectively mitigates early instability, its motion library is fixed and cannot adapt to changes in command input during training.

This limitation motivates our use of a Real-Time Motion Generator (RTMG), which synthesizes task-space and joint-space reference trajectories online at every time step. Unlike offline datasets, RTMG produces command-consistent trajectories that immediately reflect changes in user inputs such as walking speed, heading direction, or step length. These reference motions are incorporated into the training pipeline both as imitation targets and as PD-based torque biases. Their continuous, command-conditioned nature enables stable early-stage learning without restricting the agent to a finite set of predefined gaits.

However, the proposed framework also exhibits several limitations. The RTMG-generated reference motions rely on heuristic design choices and do not guarantee globally optimal gait patterns; furthermore, the method prescribes fixed swing–stance structures that may limit applicability to highly dynamic or non-periodic behaviors. As the torque bias decays during training, the fully learned policy must operate without structured guidance, which may reduce performance in tasks requiring long-horizon coordination or broader motion repertoires. These limitations highlight the need for future work that integrates real-time motion generation with more principled model-based priors or adaptive motion libraries.

In summary, this work presents a torque-based locomotion framework in which real-time, command-conditioned kinematic guidance is used to stabilize deep reinforcement learning and enable practical sim-to-real deployment. By generating imitation references online rather than relying on fixed offline motion libraries, the proposed Real-Time Motion Generator (RTMG) provides structured yet flexible guidance during early training while remaining fully compatible with reinforcement learning. Combined with a decaying torque-bias mechanism, this approach allows torque-based policies to converge reliably from scratch and to adapt continuously to changing user commands. The effectiveness of the framework is validated through both simulation and real-world experiments on a hexapod robot, demonstrating improved stability over a position-based baseline under terrain-induced disturbances.

2. Hexapod Robot Model

The hexapod robot used in this work has six legs, each consisting of three revolute joints, for a total of 18 degrees of freedom (DoF). The legs are symmetrically arranged along the sides of the robot body, and the kinematic structure is shown in Figure 1 and Figure 2.

Figure 1 shows the external appearance of the physical hexapod robot used in this work, and its overall leg arrangement and structural characteristics can be observed.

Figure 2 illustrates the configuration and rotation directions of the three joints in a single leg, and the joints correspond, in order, to the base, shoulder, and elbow.

The robot specifications are listed in Table 1.

The six legs are arranged laterally along the body. During locomotion, at least three legs remain in contact with the ground at all times, forming a support polygon that enables statically stable walking [10,11]. As a result, the robot provides high static stability even on unstructured terrain and exhibits robust locomotion against external disturbances as described in [12].

Additionally, to deploy the policy on real hardware, the robot relies solely on its onboard sensing: joint encoders that are integrated into each actuator and a single IMU module mounted at the center of the robot’s torso. No external tracking systems or additional environment-mounted sensors are used. Accordingly, the observation vector is composed of joint angles, joint velocities, the robot’s angular velocity measured by the IMU, and the gravity direction vector derived from the IMU’s quaternion output. By designing a policy that depends only on these built-in sensors, we greatly simplify the sensing requirements and enhance the feasibility of real-world deployment.

3. Real-Time Motion Generator

To secure stability during early reinforcement learning and to enable responsiveness to user commands, we introduce a Real-Time Motion Generator (RTMG) that synthesizes task-space and joint-space imitation data at every control step. In contrast to imitation-learning schemes that rely on static offline trajectories [8,9,13], RTMG dynamically generates references from the commanded translational and angular velocities. This ensures that the imitation targets remain consistent with the current user command while providing structured guidance that stabilizes torque exploration from the beginning of training.

3.1. Motivation and Advantages of RTMG

Traditional imitation-learning approaches rely on static offline datasets and therefore cannot adapt to changes in commanded velocities, gait parameters, or foothold conditions. In contrast, the proposed Real-Time Motion Generator (RTMG) synthesizes task-space and joint-space trajectories online at every control step, ensuring that the reference motions remain consistent with the current user command.

A key benefit of this real-time generation is its ability to produce diverse and smoothly varying behaviors. Whereas DecAP is inherently limited by the finite content of its offline motion library, RTMG continuously modulates stance timing, swing geometry, and foot trajectories according to the commanded velocities. Even small variations in user input yield distinct trajectories, effectively expanding the behavioral repertoire beyond what is possible with fixed datasets.

The resulting real-time modulation provides a significant advantage for training torque-based locomotion policies, which must exhibit smooth transitions, responsive stepping, and stable motion under a wide range of command conditions. The main advantages of RTMG are summarized as follows:

Immediate responsiveness: Foot trajectories and joint angles are recomputed every control cycle based on the latest commanded velocity, enabling smooth adaptation to continuously varying inputs.
Stability during early training: RTMG-generated joint references act as imitation targets and PD-based torque biases, preventing unstable torque spikes before the policy is fully trained.
Enhanced motion diversity: The continuous, command-conditioned nature of RTMG naturally broadens the space of attainable behaviors without requiring manual construction of large motion libraries.
Efficient parallel generation: All computations are vectorized and applied to thousands of agents (4096 in our implementation), allowing large-scale training in Isaac Gym.

As illustrated in Figure 3, the RTMG converts the commanded translational and angular velocities into stance–swing foot trajectories and corresponding joint-space references. These references are then used in two complementary ways during training: (1) as imitation targets for the imitation reward, and (2) as PD-based torque biases within the DecAP framework. By providing command-consistent and kinematically grounded reference motions at each control step, the RTMG significantly stabilizes the early stage of torque-based reinforcement learning while maintaining responsiveness to rapidly changing user inputs.

3.2. Task-Space Foot Trajectory Generation

At each simulation step, the gait cycle of each leg is divided into stance and swing phases. RTMG computes the desired foot position

p_{i} (t)

of each leg i as follows.

(1): Stance phase

During stance, the foot remains approximately in contact with the ground while the body moves according to the commanded linear velocity

v_{lin} = [v_{x}, v_{y}]

and yaw rate

ω_{z}

. The foot displacement is the negative of the body motion and consists of both translational and rotational components.

For a leg located at

(x_{i}, y_{i})

, the yaw-induced velocity is

Δ p_{rot} = [- y_{i} ω_{z}, x_{i} ω_{z}, 0] .

(1)

Thus, the stance-phase trajectory becomes

p_{i}^{stance} (t) = p_{i, 0} + (- v_{lin} + Δ p_{rot}) t .

(2)

In our implementation, this displacement is further centered about the midpoint of the stance duration to maintain symmetric step placement. This formulation corresponds directly to the vectorized implementation used for 4096 agents.

(2): Swing phase

When a leg enters swing, the objective is to move it smoothly from the previous stance endpoint to the next. A cubic Bézier curve is used:

p_{i}^{swing} (t) = {(1 - t)}^{3} P_{0} + 3 {(1 - t)}^{2} t P_{1} + 3 (1 - t) t^{2} P_{2} + t^{3} P_{3},

(3)

where

P_{0}

and

P_{3}

denote symmetric start/end points of the swing, and

P_{1}

,

P_{2}

are elevated by a swing height h (0.07 m in our implementation). This generates a natural foot-lifting motion with sufficient ground clearance.

The stance and swing computations correspond to the Python implementation used during training, ensuring consistency between the algorithmic description and the large-scale GPU-parallel execution.

3.3. Coordinate Frames and Transformations

All task-space trajectories generated by RTMG are expressed in the body frame of the robot, whose origin is located at the center of the main chassis and whose axes follow the conventional forward–left–up convention. This frame is convenient for incorporating commanded velocities, as both the translational velocity

v_{lin}

and yaw rate

ω_{z}

are defined in the body frame.

However, inverse kinematics (IK) for each leg is computed in that leg’s local kinematic frame, whose origin is located at the leg’s shoulder joint. To convert a body-frame foot position

p_{i}^{body}

into the corresponding leg-base frame, we apply a fixed transformation:

p_{i}^{leg} = R_{i}^{⊤} (p_{i}^{body} - t_{i}),

where

t_{i}

and

R_{i}

denote the translation and rotation from the body frame to the i-th leg base. These transforms are constant and obtained directly from the robot’s URDF model.

The resulting vector

p_{i}^{leg} = [x, y, z]

is then fed into the geometric IK equations described in the next subsection. This ensures that stance and swing trajectories—generated in a unified body-centric frame—are accurately mapped into each leg’s local joint coordinate system before computing joint angles.

3.4. Joint-Space Reference Generation via Inverse Kinematics

For each task-space foot target, the 3-DoF joint-space angles

[θ_{1}, θ_{2}, θ_{3}]

are computed using standard geometric inverse kinematics. The task-space position of the foot is expressed in the local leg frame as

p = {[x, y, z]}^{⊤}

, where

$x, y$ : horizontal coordinates of the foot relative to the leg base frame;
z: vertical coordinate (positive upward) relative to the leg base frame.

The leg geometry consists of three links:

$l_{b s}$ : horizontal offset between the hip yaw joint and the hip pitch joint (“Base-to-shoulder” link);
$l_{s e}$ : length of the upper leg (“Shoulder-to-elbow” link);
$l_{e f}$ : length of the lower leg (“Elbow-to-foot” link).

Using these definitions, the IK equations are:

\begin{matrix} θ_{1} & = arctan 2 (y, x), \end{matrix}

(4)

\begin{matrix} L & = \sqrt{x^{2} + y^{2}}, \end{matrix}

(5)

\begin{matrix} H F & = \sqrt{{(L - l_{b s})}^{2} + z^{2}}, \end{matrix}

(6)

\begin{matrix} A_{1} & = arctan 2 (L - l_{b s}, z), \end{matrix}

(7)

\begin{matrix} A_{2} & = arccos (\frac{l_{e f}^{2} - l_{s e}^{2} - H F^{2}}{- 2 l_{s e} H F}), \end{matrix}

(8)

\begin{matrix} θ_{2} & = \frac{π}{2} - (A_{1} + A_{2}), \end{matrix}

(9)

\begin{matrix} θ_{3} & = \frac{π}{2} - arccos (\frac{H F^{2} - l_{e f}^{2} - l_{s e}^{2}}{- 2 l_{s e} l_{e f}}) . \end{matrix}

(10)

These joint-angle references serve two roles during training:

They provide the imitation reward that encourages physically consistent stepping motions.
They are converted into PD-based torque biases within the DecAP framework, stabilizing exploration during early training.

The geometric IK maintains numerical stability across the robot’s entire stepping workspace because all intermediate quantities remain well-defined within joint limits, and no singular configurations arise during typical swing and stance trajectories.

3.5. Torque Bias in the DecAP Framework

To stabilize early exploration and accelerate convergence of the torque-based policy, we use a PD controller to convert the RTMG-generated joint references into an intuitive action bias. Following the formulation of Decaying Action Priors (DecAP), the bias term is computed at each time step t as

β_{t} = K_{p} ({\hat{q}}_{t} - q_{t}) + K_{d} ({\dot{\hat{q}}}_{t} - {\dot{q}}_{t}),

(11)

where

{\hat{q}}_{t}

and

{\dot{\hat{q}}}_{t}

are the reference joint position and velocity obtained from the inverse-kinematics output of RTMG, and

q_{t}

,

{\dot{q}}_{t}

are the measured joint states. The proportional and derivative gains are fixed to

K_{p} = 50, K_{d} = 0.5 .

This PD term acts as an approximate torque that imitates the commanded motion and provides a physically meaningful initialization for torque exploration. The final torque sent to the motors is blended with the policy torque according to the DecAP decay rule:

τ_{t} = a_{t} \sim π_{τ} (s_{t}) + γ^{t / k} β_{t},

(12)

where

a_{t}

is the action predicted by the policy network,

β_{t}

is the PD-based bias term, and

γ^{t / k}

is a gradually decreasing decay factor that determines how strongly the bias affects the torque output during training. In our implementation, we follow the original DecAP formulation with

γ = 0.99, k = 100,

yielding the decay schedule

γ^{t / k} = {0.99}^{(t / 100)} .

This schedule ensures that early in training the agent receives a strong, stabilizing torque bias based on the RTMG-generated imitation motion, while the influence of the bias gradually diminishes, allowing the learned policy to take full autonomous control as training progresses.

4. Reinforcement Learning Strategy

This section describes the reinforcement-learning framework used to train the torque-control policy. Training was conducted in the NVIDIA Isaac Gym Preview 4.0 environment [14], which enables massively parallel GPU simulation. A total of 4096 agents were simulated in parallel, significantly improving sample efficiency. We adopt Proximal Policy Optimization (PPO) [15], an actor–critic algorithm well suited for continuous-control tasks. PPO stabilizes learning by constraining policy updates through a clipped surrogate objective, and Generalized Advantage Estimation (GAE) [16] is employed to obtain low-variance, temporally smoothed estimates of the advantage function. Concretely, GAE computes the advantage as an exponentially weighted sum of multi-step temporal-difference errors controlled by the parameter

λ

, which provides a tunable trade-off between bias and variance and enables more reliable credit assignment over long horizons.

To improve sim-to-real transfer, domain randomization [17,18] was applied during training by introducing uncertainty in both actuator-related and environment-related parameters. Actuator parameters, including motor strength scaling, friction, armature, and damping, were randomized around nominal values identified through response-matching between simulation and the real robot. In contrast, environment and rigid-body parameters, such as contact properties and base inertial properties, were randomized using heuristic but conservative ranges to reflect unmodeled variability and manufacturing tolerances. This mixed strategy balances physical fidelity for actuator dynamics with broader robustness to environmental uncertainty.

Figure 4 illustrates the overall reinforcement-learning architecture adopted in this work.

4.1. Observation Space, Network Architecture, and Hyperparameters

The observation vector consists of 65 dimensions (Table 2), including robot state information, commanded velocities, and the torque command issued by the policy at the current control step (after clipping). Including the current action in the observation provides the policy with a self-conditioning signal, allowing it to learn smoother torque profiles and avoid abrupt changes between consecutive control steps.

The action vector consists of an 18-dimensional joint-torque command, corresponding to the 18 actuated joints of the hexapod robot (Table 3). Each action directly specifies the desired torque applied at a joint, enabling the policy to exploit low-level actuator dynamics and compliance during contact-rich locomotion.

The policy and value networks are implemented as multilayer perceptrons (MLPs). Both networks receive the 65-dimensional observation vector as input and process it through three hidden layers of sizes 512, 256, and 128 using ELU activations. The policy network outputs an 18-dimensional torque command corresponding to the 18 actuated joints, while the critic network produces a scalar value estimate. This architecture balances expressive capacity with training stability and follows common design choices in torque-based locomotion RL.

The hyperparameters used during training are summarized in Table 4. A higher learning rate is used during Stage 1 to accelerate initial convergence, while a lower learning rate is employed during Stage 2 for more stable fine-tuning on rough terrain.

4.2. Reward Functions

The reward terms used in Stage 1 and Stage 2 differ in their structure. Stage 1 includes imitation rewards supplied by the Real-Time Motion Generator (RTMG), which accelerates early learning and stabilizes the torque-based actions. In Stage 2, all imitation-related terms are removed, and the policy is refined using only tracking and regularization terms. This separation prevents overfitting to RTMG trajectories and enables adaptation to irregular terrain. Figure 5 and Figure 6 report the per-episode average return rather than the cumulative return.

Stage 1 Reward (Flat Terrain).

The reward terms for Stage 1 are summarized in Table 5. These terms include velocity tracking, imitation, and standard regularization components.

Stage 2 Reward (Rough Terrain).

In Stage 2, imitation terms are removed and the policy is refined using only tracking-oriented and regularization rewards. This encourages the policy to deviate from the nominal gait when necessary to handle irregular terrain. The reward terms used in Stage 2 are summarized in Table 6.

4.3. Curriculum Learning Structure

Training follows a two-stage curriculum. In Stage 1, the policy is trained on flat terrain using Decaying Action Priors (DecAP) and imitation rewards generated by the RTMG. The PD-based torque bias accelerates convergence and suppresses unstable torque spikes early in training. In Stage 2, DecAP is disabled and all imitation rewards are removed. The policy learned in Stage 1 serves as initialization, and training continues on irregular terrain to improve robustness. The reward curves for both stages are shown in Figure 5 and Figure 6, demonstrating rapid initial improvement followed by stable fine-tuning on rough terrain.

As shown in Figure 5, the reward in Stage 1 rises rapidly during early training and then gradually decreases after reaching its peak. This behavior reflects the intended role of the Real-Time Motion Generator (RTMG) and the Decaying Action Priors (DecAP) framework. In the early phase of training, RTMG provides strong, command-consistent kinematic guidance that constrains torque exploration to physically meaningful regions, resulting in stable learning and high imitation-related rewards. As the decay factor progressively reduces the influence of the torque bias, the policy increasingly assumes responsibility for generating joint torques autonomously. Consequently, the learned behaviors are no longer required to closely match the RTMG reference trajectories, leading to a controlled reduction in imitation reward without indicating performance degradation.

Importantly, this transition highlights the primary contribution of the proposed framework: RTMG is not designed to produce optimal gait patterns, but rather to serve as a heuristic, real-time guidance prior that stabilizes torque-based reinforcement learning during its most vulnerable early stages. By gradually removing this guidance, the policy is encouraged to adapt beyond the prescribed kinematic structure while retaining the robustness acquired through guided exploration. This trade-off between structured guidance and autonomous learning underpins the practical effectiveness of the proposed approach for sim-to-real deployment.

Gait Definition in Stage 1

The RTMG generates a periodic tripod gait, in which three legs are in stance while the other three legs swing, and the stance/swing sets alternate every half-cycle. This tripod support pattern provides quasi-static stability during walking and is used as the nominal gait structure for generating command-conditioned reference trajectories during Stage 1 training.

4.4. Comparison with DecAP

Although DecAP provides a useful stabilizing torque prior in the early phase of training, it fundamentally relies on a fixed offline motion dataset. In contrast, our framework incorporates RTMG, which generates state- and command-conditioned trajectories online. Because DecAP is designed around reproducing a small set of predefined motions while RTMG produces continuously varying trajectories, the two frameworks differ substantially in both purpose and operational structure. As a result, a direct experimental comparison under identical conditions would not fully isolate their conceptual differences. Instead, the present work focuses on evaluating the behavioral adaptability and motion diversity enabled specifically by real-time trajectory generation.

5. Hardware Configuration

This section describes the hardware configuration used to deploy the trained reinforcement learning policy in real-world settings. The proposed system is centered on a low-power embedded platform, NVIDIA Jetson Nano, and the model was optimized to enable real-time execution of the torque-based control policy.

The robot frame was fabricated by 3D printing, and a 3-cell 11.1 V lithium-polymer battery with a capacity of 5000 mAh was used to sustain operation for over 30 min.

The implemented hardware is shown in Figure 7.

5.1. System Overview

As shown in Figure 8, the overall system is organized so that the Jetson Nano executes control computations and centrally manages motor control and sensor data. The control loop runs at 50 Hz, enabling real-time locomotion control, and all computations are performed on board the Jetson Nano.

The robot uses a total of 18 Dynamixel motors, with three motors per leg. They are connected in a daisy chain via RS485 communication, and control commands are sent directly from the Jetson Nano through a USB-to-RS485 converter (U2D2, ROBOTIS, Seoul, Republic of Korea). Feedback from the motors (position, velocity, etc.) is received synchronously.

We used the EBIMU-9DOFV5-R3 sensor (E2BOX, Hanam-si, Gyeonggi-do, Republic of Korea) to measure the robot’s orientation. The sensor communicates with the Jetson Nano via USB and outputs, in real time, the body-frame quaternion and 3-axis angular velocity (rad/s). From the quaternion, we compute the projected-gravity vector by rotating the world-frame gravity direction

[0, 0, - 1]

into the robot’s body frame. This projected-gravity representation provides a continuous and singularity-free encoding of base orientation, which is used in the observation vector along with the raw angular-velocity measurements. Although the IMU also provides Euler angles internally, they are not used for learning because they exhibit discontinuities and are redundant given the quaternion-based gravity projection.

The IMU data stream is sampled at 50 Hz to match the control-loop frequency of the Jetson Nano. Internally, the sensor operates at a higher sampling rate with a built-in low-pass filter set to a cutoff frequency of 37.5 Hz, ensuring stable and noise-reduced measurements before they are consumed by the reinforcement learning policy.

5.2. Network Optimization

The trained policy was exported from PyTorch (version 1.12.0). Ref. [19] and optimized with Tensor RT to float16 precision for real-time execution on the Jetson Nano using NVIDIA JetPack. The inference latency was measured at approximately 0.006 s, which is sufficient to operate the robot at 50 Hz.

The robot software running on the Jetson Nano is implemented in a mixed C++/Python architecture. Motor control is handled in C++ (compiled with the JetPack toolchain), whereas policy inference and sensor integration are performed in Python (version 3.8). The system is configured to start automatically at boot.

6. Experiments and Evaluation

This section presents experiments conducted to evaluate the proposed reinforcement-learning-based torque control policy in both simulation and real hardware environments. The performance of the proposed torque-based controller is compared with that of a position-based baseline that tracks reference motions using a PD controller.

The purpose of the experiments in this section is not to demonstrate optimal locomotion performance, but to evaluate whether the proposed real-time guidance framework enables stable and robust torque-based control under contact-induced disturbances. In particular, the experiments focus on torso stability during stair locomotion, where abrupt ground-height changes and unmodeled contacts frequently destabilize position-based controllers.

6.1. Experimental Environment

The simulation experiments were conducted in NVIDIA Isaac Gym Preview 4.0. The virtual environment was configured as a stair-shaped terrain (Figure 9) with a step height of 60 mm and a tread depth of 300 mm. In this environment, both a position-based controller and the proposed torque-based RL controller were commanded to descend the stairs at a forward velocity of 0.1 m/s.

For the position-based baseline, we implemented a joint-space PD controller that tracks the reference joint trajectories. The same PD gains were applied to all 18 actuated joints, with proportional and derivative gains set to

P = 50, D = 0.5 .

These values were selected to provide stable tracking while avoiding excessive overshoot during contact transitions on uneven terrain. The same gains were used for all joints and were kept fixed throughout the experiments to ensure a consistent baseline for comparison. During walking, we measured the base angular velocity around the roll and pitch axes and evaluated performance by comparing their RMS values across the two controllers.

The hardware experiments were conducted using the 3D-printed hexapod robot shown in Figure 7, and, as in simulation, we compare a position-based controller with the proposed torque-based RL controller. The experimental terrain is the stair configuration in Figure 9, with a step height of 36 mm and a tread depth of 265 mm. In this setting, the robot was commanded to ascend the stairs at 0.15 m/s, and, as in the simulation experiments, performance was evaluated by comparing the RMS errors of the roll and pitch angular velocities.

Torso angular velocity was selected as the primary evaluation metric because it directly reflects the robot’s ability to attenuate contact-induced disturbances and maintain balance during stair traversal.

Note on command velocity differences.

The commanded forward velocities used in simulation (0.1 m/s) and in real hardware experiments (0.15 m/s) differ primarily due to the intrinsic discrepancies between the simulated and physical environments. In the simulation experiments, the stair geometry is steeper, with a 60 mm step height and a 300 mm tread, which produces larger instantaneous disturbances during foot–step transitions. Using a slightly lower commanded velocity allows clearer observation of the controller’s response under these stronger perturbations.

In contrast, the physical platform operates on a stair configuration with a smaller step height (36 mm) and shorter tread (265 mm), resulting in milder terrain-induced disturbances. Moreover, the real robot inherently exhibits actuator friction, compliance, and sensor latency, all of which naturally attenuate high-frequency impacts that would otherwise arise in a perfectly rigid simulation.

Because the magnitude and structure of disturbances differ between the two environments, the commanded velocities are adjusted accordingly so that both experiments operate within comparable dynamic regimes. This difference reflects the physical characteristics of each environment rather than a difference in controller capability.

6.2. Simulation Results

The simulation results are shown in Figure 10. Here, the blue curve represents the position-based controller, and the orange curve represents the proposed torque-based controller. Overall, the orange curve exhibits smaller oscillations.

Presenting the overall RMS error as a bar chart makes the difference more apparent, as shown in Figure 11. As evident from the RMS error plot, the torque-based control significantly reduces torso oscillation while descending the stairs compared with the position-based control. To quantify torso oscillation, we computed the 2-norm of the RMS errors along the x- and y-axes; with the position-based controller the value was 0.46253, whereas with the RL-based torque controller it was 0.34747. This indicates that the RL-based torque control is approximately 24.88% more stable than the position-based control.

These results indicate that the torque-based policy effectively exploits joint-level compliance to absorb contact discontinuities, whereas the position-based controller amplifies disturbances by enforcing strict trajectory tracking.

6.3. Real-World Experiment Results

Similar trends were observed in the real-world experiments, as shown in Figure 12. As in the simulation, the blue curve represents the position-based controller, while the orange curve represents the proposed torque-based RL controller. Overall, the torque-based controller exhibits smaller oscillations. In contrast, the position-based controller shows large spikes in angular velocity at locations corresponding to step discontinuities, whereas the torque-based controller maintains stable walking behavior.

The difference is also evident in the RMS errors shown in Figure 13. For the position-based controller, the RMS error is slightly lower than that of the torque-based controller along the x-axis but considerably higher along the y-axis, indicating increased torso oscillation in the lateral direction. To quantify this effect, we computed the 2-norm of the RMS errors along the x- and y-axes; the resulting values were 0.12589 for the position-based controller and 0.09915 for the torque-based RL controller. This corresponds to an approximate 21.24% reduction in overall torso oscillation achieved by the proposed torque-based policy.

Figure 14 presents a sequence of snapshots comparing the real-world stair-climbing performance of the position-based controller (top row) and the proposed torque-based controller (bottom row). In the position-control case, the robot repeatedly fails to overcome the step. Because the PD-based position controller strictly tracks predefined joint trajectories, the legs do not yield when unexpected contact occurs. As the swing legs strike the step edge, the controller continues to command the nominal trajectory, causing the leg to drag along the obstacle instead of adapting to it. This leads to repeated partial climbs, loss of support in the stance legs, and eventual failure to ascend the terrain.

In contrast, the torque-based controller exhibits stable climbing behavior under the same discontinuity. Since the torque policy regulates joint effort rather than enforcing a fixed trajectory, the legs naturally comply when encountering the step edge. This compliance allows the swing legs to press against the obstacle and slide over it without destabilizing the body, while the stance legs adapt their supporting torques to maintain balance under sudden height changes. Notably, this behavior emerges even though all imitation-related guidance is removed during the final training stage, indicating that the learned policy has internalized the benefits of compliant, torque-level interaction.

7. Conclusions

In this work, we investigated how real-time, command-conditioned kinematic guidance can stabilize torque-based deep reinforcement learning and enable practical sim-to-real locomotion control on a hexapod robot.

Looking forward, several research directions naturally emerge from this study. First, while the present work primarily addressed periodic walking under moderate terrain variations, a key objective is to extend the controller toward a broader range of command-responsive behaviors. For a hexapod robot, this includes more precise trajectory following, smoother directional changes, adaptation to irregular footholds, and seamless transitions between walking patterns while maintaining stable three-point support. Achieving these capabilities will require the motion generator and the reinforcement-learning policy to incorporate more context-aware decision making and to handle variations in ground-contact geometry without compromising stability.

Second, although the current controller exhibits robustness on stair-like terrain, a more systematic evaluation of disturbance rejection is essential. Future work will investigate robustness against unexpected pushes, slips, and surface compliance changes, and will incorporate learning mechanisms that enable real-time adaptation to unknown terrain features. This includes exploring meta-learning or adaptive-domain-randomization strategies that continuously refine the policy based on real-world experience.

Third, we aim to generalize the proposed real-time imitation-learning framework to other legged platforms. Hexapods offer inherent static stability, but extending the method to quadrupeds or bipeds would require improved coordination strategies and more expressive torque policies. This direction may benefit from incorporating model-based components, such as learned dynamics priors or differentiable centroidal models, into the training pipeline.

Finally, from a systems perspective, we plan to investigate lightweight neural architectures and optimized deployment pipelines that can operate at higher control frequencies on resource-constrained hardware. Reducing model size, latency, and power consumption will be crucial for applying torque-based DRL controllers to portable or low-cost robotic platforms.

Overall, this study highlights the potential of real-time imitation-guided torque control as a practical prior for stabilizing learning-based locomotion.

Although integrating the original offline DecAP motion library into our RTMG-based torque-control pipeline lies outside the scope of this work, combining real-time trajectory generation with structured offline priors remains an intriguing direction for future research. A unified framework may leverage the advantages of both approaches and enable more systematic comparisons in subsequent studies.

In addition, recent studies have explored formal stability guarantees for neural-network-based controllers, particularly in the context of two-leg balancing and safety-critical locomotion. While such certification is beyond the scope of our present hexapod-focused study, incorporating stability-oriented analyses into torque-based DRL frameworks represents a promising future direction that may further strengthen the reliability of real-time imitation-driven controllers.

Supporting videos in Supplementary Materials demonstrating the sim-to-real deployment are available online.

Supplementary Materials

The following supporting videos are available online: Video S1: Sim-to-Real Deployment: https://youtu.be/oHufCEQsX60 (accessed on 23 December 2025). Video S2: Sim-to-Real Deployment: https://youtu.be/1PAzpE7zTjU (accessed on 23 December 2025).

Author Contributions

Conceptualization, T.K. and D.C.; methodology, T.K.; software, T.K. and M.C.; validation, M.C. and T.Y.; formal analysis, T.K.; investigation, D.C.; resources, T.K.; data curation, T.K. and M.C.; writing—original draft preparation, T.K.; writing—review and editing, D.C.; visualization, S.C.; supervision, D.C.; project administration, D.C.; funding acquisition, D.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (RS-2025-23525689).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article; further inquiries can be directed to the corresponding author.

Conflicts of Interest

Author Minjun Choi was employed by the company Neuromeka. Author Seunguk Choi was employed by the company CLAVIL. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Chen, S.; Zhang, B.; Mueller, M.W.; Rai, A.; Sreenath, K. Learning torque control for quadrupedal locomotion. In Proceedings of the 2023 IEEE-RAS 22nd International Conference on Humanoid Robots (Humanoids), Austin, TX, USA, 12–14 December 2023; pp. 1–8. [Google Scholar]
Margolis, G.B.; Agrawal, P. Walk these ways: Tuning robot control for generalization with multiplicity of behavior. In Proceedings of the Conference on Robot Learning, Atlanta, GA, USA, 6–9 November 2023; pp. 22–31. [Google Scholar]
Haarnoja, T.; Ha, S.; Zhou, A.; Tan, J.; Tucker, G.; Levine, S. Learning to walk via deep reinforcement learning. arXiv 2018, arXiv:1812.11103. [Google Scholar]
Kumar, A.; Fu, Z.; Pathak, D.; Malik, J. RMA: Rapid motor adaptation for legged robots. arXiv 2021, arXiv:2107.04034. [Google Scholar] [CrossRef]
Hwangbo, J.; Lee, J.; Dosovitskiy, A.; Bellicoso, D.; Tsounis, V.; Koltun, V.; Hutter, M. Learning agile and dynamic motor skills for legged robots. Sci. Robot. 2019, 4, eaau5872. [Google Scholar] [CrossRef]
Jakobi, N.; Husbands, P.; Harvey, I. Noise and the reality gap: The use of simulation in evolutionary robotics. In Proceedings of the European Conference on Artificial Life, Granada, Spain, 4–6 June 1995; pp. 704–720. [Google Scholar]
Peng, X.B.; Van De Panne, M. Learning locomotion skills using deep RL: Does the choice of action space matter? In Proceedings of the ACM SIGGRAPH/Eurographics Symposium on Computer Animation, Los Angeles, CA, USA, 28–30 July 2017; pp. 1–13. [Google Scholar]
Peng, X.B.; Abbeel, P.; Levine, S.; Van De Panne, M. DeepMimic: Example-guided deep reinforcement learning of physics-based character skills. ACM Trans. Graph. 2018, 37, 1–14. [Google Scholar] [CrossRef]
Sood, S.; Sun, G.; Li, P.; Sartoretti, G. DecAP: Decaying action priors for accelerated learning of torque-based legged locomotion policies. arXiv 2023, arXiv:2303.00000. [Google Scholar]
Bachega, R.P.; Neves, G.P.D.; Campo, A.B.; Angelico, B.A. Flexibility in hexapod robots: Exploring mobility of the body. IEEE Access 2023, 11, 110454–110471. [Google Scholar] [CrossRef]
Isvara, Y.; Rachmatullah, S.; Mutijarsa, K.; Prabakti, D.E.; Pragitatama, W. Terrain adaptation gait algorithm in a hexapod walking robot. In Proceedings of the 2014 13th International Conference on Control, Automation, Robotics & Vision (ICARCV), Singapore, 10–12 December 2014; pp. 1735–1739. [Google Scholar]
Qiu, Z.; Wei, W.; Liu, X. Adaptive gait generation for hexapod robots based on reinforcement learning and hierarchical framework. Actuators 2023, 12, 75. [Google Scholar] [CrossRef]
Peng, X.B.; Coumans, E.; Zhang, T.; Lee, T.W.; Tan, J.; Levine, S. Learning agile robotic locomotion skills by imitating animals. arXiv 2020, arXiv:2004.00784. [Google Scholar] [CrossRef]
Makoviychuk, V.; Wawrzyniak, L.; Guo, Y.; Lu, M.; Storey, K.; Macklin, M.; Hoeller, D.; Rudin, N.; Allshire, A.; Handa, A.; et al. Isaac Gym: High performance GPU-based physics simulation for robot learning. arXiv 2021, arXiv:2108.10470. [Google Scholar] [CrossRef]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar] [CrossRef]
Schulman, J.; Moritz, P.; Levine, S.; Jordan, M.; Abbeel, P. High-dimensional continuous control using generalized advantage estimation. arXiv 2015, arXiv:1506.02438. [Google Scholar]
Tobin, J.; Fong, R.; Ray, A.; Schneider, J.; Zaremba, W.; Abbeel, P. Domain randomization for transferring deep neural networks from simulation to the real world. In Proceedings of the 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vancouver, BC, Canada, 24–28 September 2017; pp. 23–30. [Google Scholar]
Chebotar, Y.; Handa, A.; Makoviychuk, V.; Macklin, M.; Issac, J.; Ratliff, N.; Fox, D. Closing the sim-to-real loop: Adapting simulation randomization with real world experience. In Proceedings of the 2019 IEEE International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; pp. 8973–8979. [Google Scholar]
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An imperative style, high-performance deep learning library. In Proceedings of the Advances in Neural Information Processing Systems 32 (NeurIPS 2019), Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]

Figure 1. Hexapod 3D model and overall leg arrangement.

Figure 2. Kinematic structure and joint rotation directions for a single leg (base, shoulder, elbow).

Figure 3. Overall architecture of the Real-Time Motion Generator (RTMG).

Figure 4. Overall reinforcement-learning architecture.

Figure 5. Reward curve during Stage 1 (flat terrain).

Figure 6. Reward curve during Stage 2 (rough terrain).

Figure 7. Hexapod robot.

Figure 8. Hardware architecture of hexapod robot.

Figure 9. Simulation snapshot.

Figure 10. Hexapod robot’s angular velocity of roll and pitch in Simulation environment.

Figure 11. RMS error of angular velocities in the simulation environment.

Figure 12. Hexapod robot’s angular velocity of roll and pitch in real-world environment.

Figure 13. RMS error of angular velocities in real-world environment.

Figure 14. Position control experiment snapshot.

Table 1. Robot hardware specifications.

Specifications	Value
Base-to-shoulder link length	40 $mm$
Shoulder-to-elbow link length	100 $mm$
Elbow-to-foot link length	145 $mm$
Body length $x$	400 $mm$
Body width $y$	300 $mm$
Body height $z$	185 $mm$
Weight	3.6 $kg$

Table 2. Observation space.

Observation	Dimension
Base Angular Velocity	3
Projected Gravity	3
User Commands	3
Joint Positions	18
Joint Velocity	18
Action	18

Table 3. Action space.

Action	Dimension
Joint Torques	18

Table 4. Hyperparameters for training.

Parameter	Stage 1 (Flat)	Stage 2 (Rough)
Optimizer	Adam	Adam
Learning Rate	1 × 10⁻³	1 × 10⁻⁴
Policy Updates	2000	1000
Discount Factor	0.99	0.99
GAE $λ$	0.95	0.95

Table 5. Reward Terms, Equations, and Weights (Stage 1: Flat Terrain).

Term	Equation	Weight
Linear velocity tracking	$e^{- {(v - v_{cmd})}^{2} / 0.02}$	1.0
Angular velocity tracking	$e^{- {(ω - ω_{cmd})}^{2} / 0.04}$	1.0
Orientation imitation	$e^{- {(θ - θ_{imit})}^{2} / 0.1}$	1.5
Height imitation	$e^{- {(z - z_{imit})}^{2} / 0.025}$	1.5
Position imitation	$e^{- {(p - p_{imit})}^{2} / 0.025}$	1.5
Base orientation	$∥ g_{proj, x y} ∥^{2}$	−2.5
Collision	$1_{collision}$	−1.0
Action smoothing	$\sum {(a_{t} - a_{t - 1})}^{2}$	−0.1
Action smoothing 2nd	$\sum {(a_{t} - 2 a_{t - 1} + a_{t - 2})}^{2}$	−0.05
Vertical velocity	$v_{z}^{2}$	−5.0
Base angular velocity	$\sum ω_{base}^{2}$	−0.05
Torque	$\sum τ^{2}$	−0.01
Joint motion	$\sum ({\ddot{q}}^{2} + {\dot{q}}^{2})$	−2.0 × 10⁻⁷
Foot slip	$\sum ∥ v_{foot contact} ∥$	−0.4

Table 6. Reward Terms, Equations, and Weights (Stage 2: Rough Terrain).

Term	Equation	Weight
Linear velocity tracking	$e^{- {(v - v_{cmd})}^{2} / 0.02}$	1.0
Angular velocity tracking	$e^{- {(ω - ω_{cmd})}^{2} / 0.04}$	1.0
Base orientation	$∥ g_{proj, x y} ∥^{2}$	−2.5
Collision	$1_{collision}$	−1.0
Action smoothing	$\sum {(a_{t} - a_{t - 1})}^{2}$	−0.1
Action smoothing 2nd	$\sum {(a_{t} - 2 a_{t - 1} + a_{t - 2})}^{2}$	−0.05
Vertical velocity	$v_{z}^{2}$	−5.0
Base angular velocity	$\sum ω_{base}^{2}$	−0.05
Torque	$\sum τ^{2}$	−0.01
Joint motion	$\sum ({\ddot{q}}^{2} + {\dot{q}}^{2})$	−2.0 × 10⁻⁷
Foot slip	$\sum ∥ v_{foot contact} ∥$	−0.4

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Kim, T.; Choi, M.; Choi, S.; Yoon, T.; Choi, D. Deep Reinforcement Learning Control of a Hexapod Robot. Actuators 2026, 15, 33. https://doi.org/10.3390/act15010033

AMA Style

Kim T, Choi M, Choi S, Yoon T, Choi D. Deep Reinforcement Learning Control of a Hexapod Robot. Actuators. 2026; 15(1):33. https://doi.org/10.3390/act15010033

Chicago/Turabian Style

Kim, Taesoo, Minjun Choi, Seunguk Choi, Taeuan Yoon, and Dongil Choi. 2026. "Deep Reinforcement Learning Control of a Hexapod Robot" Actuators 15, no. 1: 33. https://doi.org/10.3390/act15010033

APA Style

Kim, T., Choi, M., Choi, S., Yoon, T., & Choi, D. (2026). Deep Reinforcement Learning Control of a Hexapod Robot. Actuators, 15(1), 33. https://doi.org/10.3390/act15010033

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Deep Reinforcement Learning Control of a Hexapod Robot

Abstract

1. Introduction

2. Hexapod Robot Model

3. Real-Time Motion Generator

3.1. Motivation and Advantages of RTMG

3.2. Task-Space Foot Trajectory Generation

3.3. Coordinate Frames and Transformations

3.4. Joint-Space Reference Generation via Inverse Kinematics

3.5. Torque Bias in the DecAP Framework

4. Reinforcement Learning Strategy

4.1. Observation Space, Network Architecture, and Hyperparameters

4.2. Reward Functions

4.3. Curriculum Learning Structure

Gait Definition in Stage 1

4.4. Comparison with DecAP

5. Hardware Configuration

5.1. System Overview

5.2. Network Optimization

6. Experiments and Evaluation

6.1. Experimental Environment

6.2. Simulation Results

6.3. Real-World Experiment Results

7. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI