A Motion Control Strategy for a Blind Hexapod Robot Based on Reinforcement Learning and Central Pattern Generator

Wang, Lei; Li, Ruiwen; Wang, Xiaoxiao; Gao, Weidong; Chen, Yiyang

doi:10.3390/sym17071058

Open AccessArticle

A Motion Control Strategy for a Blind Hexapod Robot Based on Reinforcement Learning and Central Pattern Generator

by

Lei Wang

¹,

Ruiwen Li

²,

Xiaoxiao Wang

²,

Weidong Gao

^3,* and

Yiyang Chen

^4,*

¹

School of Automation, Wuxi University, Wuxi 214105, China

²

School of Automation, Nanjing University of Information Science and Technology, Nanjing 210044, China

³

School of Electronic Information Engineering, Wuxi University, Wuxi 214105, China

⁴

School of Mechanical and Electrical Engineering, Soochow University, Suzhou 215137, China

^*

Authors to whom correspondence should be addressed.

Symmetry 2025, 17(7), 1058; https://doi.org/10.3390/sym17071058

Submission received: 9 June 2025 / Revised: 27 June 2025 / Accepted: 2 July 2025 / Published: 4 July 2025

(This article belongs to the Section Engineering and Materials)

Download

Browse Figures

Versions Notes

Abstract

Hexapod robots that use external sensors to sense the environment are susceptible to factors such as light intensity or foggy weather. This effect leads to a drastic decrease in the motility of the hexapod robot. This paper proposes a motion control strategy for a blind hexapod robot. The hexapod robot is symmetrical and its environmental sensing capability is obtained by collecting proprioceptive signals from internal sensors, allowing it to pass through rugged terrain without the need for external sensors. The motion gait of the hexapod robot is generated by a central pattern generator (CPG) network constructed by Hopf oscillators. This gait is a periodic gait controlled by specific parameters given in advance. A policy network is trained in the target terrain using deep reinforcement learning (DRL). The trained policy network is able to fine-tune specific parameters by acquiring information about the current terrain. Thus, an adaptive gait is obtained. The experimental results show that the adaptive gait enables the hexapod robot to stably traverse various complex terrains.

Keywords:

hexapod robot; proprioception; CPG; DRL; adaptive gain

1. Introduction

As robots become increasingly integrated into human life, mobile robots are widely used to replace humans in performing tasks in challenging environments [1,2,3]. Legged robots demonstrate superior traversability in challenging environments compared to traditional wheeled and tracked robots [4,5,6,7]. Among legged robots, hexapods are favored for their static stability and structural simplicity, and their symmetrical structure produces an excellent biological gait.

The current environmental perception capability of hexapod robots is mainly obtained by acquiring 2D or 3D images using external sensors such as visual sensors or LiDAR. Reference [8] proposed a robot–terrain interference model based on visual sensors, joint torque, and inertial measurement unit (IMU) information, which significantly improved the off-road capability of legged robots. The authors of [9] designed a hexapod robot called “Weaver”, which uses terrain perception based on stereo vision and proprioception for adaptive control, and visual–inertial odometry for waypoint-based autonomous navigation. However, for hexapod robots that frequently perform tasks such as field transportation and reconnaissance, low visibility and adverse weather conditions—such as fog or rain—significantly impair the performance of external sensors.

To address the above problems, motion control strategies based on proprioception can be used. Zayev extended a two-stage architecture for hexapod robots that utilizes only proprioceptive signals for locomotion over complex terrain [10]. Expert policies (physical motion controllers) modeled by artificial neural networks are trained independently in these separate scenarios using deep reinforcement learning, and then autonomously reused in the inference process using recursive neural network terrain classifiers conditioned on state history to give adaptive gaits appropriate to the current terrain. Bachega designed a hexapod robot for autonomous locomotion in unstructured terrain that relies entirely on proprioception to adapt its body posture. The robot incorporates a whole-body kinematic model that allows for both open- and closed-chain control, demonstrating excellent ability to traverse irregular terrain [11]. Inspired by the fact that animals rely on proprioception and vestibular feedback to adapt their gait capabilities, Ruscelli proposed a modular framework for autonomous locomotion that relies on force sensing and inertial information, and robots using this framework are able to pass through rows of unevenly spaced obstacles [12].

This paper proposes a motion control strategy for a blind hexapod robot. The hexapod robot under this strategy uses only internal sensors (encoders and IMU) to sense spatial variations in the joints and body, thus indirectly sensing the ruggedness of the surrounding terrain. The control methods for the hexapod robot are based on the bio-inspired Central Pattern Generator (CPG) method, which allows the hexapod robot to move its gait without any balance loss [13,14]. In addition, deep reinforcement learning (DRL)-based approaches have proven effective in addressing the challenges of continuous motion space in robot control [15,16,17]. In order to enable a blind hexapod robot to walk stably on rugged terrain, a motion control strategy is proposed in this paper. In this framework, the robot requires only its own joint and pose information, and it trains a strategy network—based on DRL—to control the CPG network and generate adaptive gaits. The resulting strategy network achieves higher rewards and converges faster than networks trained with traditional reinforcement learning.

2. CPG Control Network

The CPG control method is a rhythmic motion control approach that emulates the self-sustaining oscillations of biological lower-level neural circuits, generating periodic gaits without relying on sensory feedback [18]. Unlike traditional control methods, complex CPG models can autonomously select gait patterns based on simple external inputs. This biological inspiration is evident in vertebrates, where electrical stimulation of the midbrain locomotor region (MLR) activates locomotion via CPG circuits. Consequently, robotic implementations of CPG-based control must incorporate gait planning and modulation mechanisms, as varying stimulation intensities can trigger automatic gait transitions. Various types of CPG models have been used for robotics, such as connection models, vector mapping, and coupled oscillator systems, all of which incorporate the numerical integration of a set of coupled differential equations that allow for a variety of gaits to be realized by adjusting the coupling between the oscillators.

2.1. The Mathematical Model of a Single CPG

The critical aspect of CPG modeling lies in designing a CPG network that possesses asymptotically stable limit cycle characteristics while generating consistent oscillatory outputs. To achieve smooth, efficient, and rapid gait transitions, it is essential to select an appropriate CPG model. In this context, the Hopf oscillator demonstrates remarkable advantages due to its structurally stable harmonic limit cycle properties: its limit cycle morphology remains independent of frequency parameters, and the rising/falling edge frequencies of the oscillatory waveform can be independently adjusted.

Characterized by its concise mathematical formulation, this oscillator features completely decoupled parameters (amplitude, phase, and frequency) that maintain one-to-one correspondence without mutual interference. Moreover, the Hopf oscillator exhibits stable limit cycle behavior, enables direct modulation of amplitude and frequency, and demonstrates high computational efficiency, making it particularly suitable for gait planning in multi-legged robots. Building upon these advantages, this study adopts the Hopf oscillator as the fundamental oscillatory unit for the CPG network, with its mathematical representation presented as follows:

\begin{matrix} \{\begin{matrix} \dot{x} = α (μ - x^{2} - y^{2}) \dot{x} - ω y \\ \dot{y} = α (μ - x^{2} - y^{2}) \dot{y} + ω x \end{matrix} \end{matrix}

(1)

where x and y are state output variables;

μ

is a positive number that determines the amplitude of the oscillator’s output signal;

α

is a positive number used to control the speed at which the oscillator converges to the limit ring; and

ω

is the frequency of the oscillator.

To enable gait variation in the hexapod robot, we modify the base equations by introducing an additional term that dynamically adjusts the temporal ratio between swing phase and stance phase during locomotion. This modification takes the following mathematical form:

\begin{matrix} \{\begin{matrix} ω & = \frac{ω_{s t}}{e^{- α y} + 1} + \frac{ω_{s w}}{e^{α y} + 1} \\ ω_{s t} & = \frac{1 - β}{β} ω_{s w} \end{matrix} \end{matrix}

(2)

In the formula,

ω_{s w}

represents the oscillation phase frequency;

ω_{s t}

represents the support phase frequency;

β

is the load factor; and a determines the rate of change of

ω

between

ω_{s w}

and

ω_{s t}

.

2.2. CPG Control Network Structure for a Hexapod Robot

This paper presents a gait generation method based on symmetric coupled cell theory, where oscillator coupling is employed to achieve desired locomotion patterns. The approach establishes two key theoretical foundations: (1) the derivation of existence conditions for symmetric periodic solutions in coupled dynamical systems, with the network structure being designable independently of individual cell dynamics; and (2) a network design methodology that relies exclusively on algebraic parameters, enabling both simplified implementation and straightforward scalability to more complex network configurations.

In a mutually coupled network, multiple CPGs exhibit bidirectional coupling relationships that operate in both temporal and spatial domains. The coupling dynamics between any two oscillators can be described by Equation (3), where

ϕ_{i}^{j}

represents the phase difference of neuron oscillator j relative to neuron oscillator i, and

Δ

denotes the coupling term between the two oscillators, expressed as follows:

\{Δ_{i}^{j} = [\begin{matrix} \cos ϕ_{i}^{j} & - \sin ϕ_{i}^{j} \\ \sin ϕ_{i}^{j} & \cos ϕ_{i}^{j} \end{matrix}] [\begin{matrix} u_{j} \\ v_{j} \end{matrix}]

(3)

In a CPG network, coupled oscillators are obtained by summing with other oscillators, and the neuronal oscillator i can be represented as follows:

\{{\dot{X}}_{i} = [\begin{matrix} {\dot{u}}_{i} \\ {\dot{ν}}_{i} \end{matrix}] = [\begin{matrix} σ (R^{2} - u_{i}^{2} - v_{i}^{2}) u_{i} - 2 π ω_{k} v_{i} \\ σ (R^{2} - u_{i}^{2} - v_{i}^{2}) v_{i} + 2 π ω_{k} u_{i} \end{matrix}] + λ \sum_{j} Δ_{i}^{j}

(4)

\begin{matrix} \{\begin{matrix} ω_{i k} = \{\begin{matrix} \frac{N}{120 (1 - ε)} (ν_{i k} \geq 0) \\ \frac{N}{120 ε} (ν_{i k} < 0) \end{matrix} \end{matrix} \end{matrix}

(5)

where

λ

represents the coupling strength between oscillators. Given target parameters

(\begin{matrix} ϕ_{1}^{2} & ϕ_{2}^{3} & ϕ_{3}^{4} & ϕ_{4}^{5} & ϕ_{5}^{6} & ϕ_{6}^{1} & ε N \end{matrix})

specifying the desired phase differences, average joint velocity, and duty cycle, the outputs of all six oscillators will automatically converge to the target waveform. While the same network structure can accommodate different gaits through parameter adjustments, abrupt changes may lead to waveform irregularities, including axis-biased oscillations and resulting in highly disordered gaits that compromise robot stability. To address this, we implement a smooth transition mechanism where parameters gradually evolve from current to target values during gait transitions, as described by the following equation:

\begin{matrix} \{\begin{matrix} ϕ_{i}^{j +} = ϕ_{i}^{j +} + \frac{ϕ_{i}^{j -} - ϕ_{i}^{j +}}{e^{κ (t - t_{0})}} \end{matrix} \end{matrix}

(6)

\begin{matrix} \{\begin{matrix} ε^{+} = ε^{+} + \frac{ε^{-} - ε^{+}}{e^{x (t - t_{0})}} \end{matrix} \end{matrix}

(7)

Here,

ϕ_{i}^{j +}

and

ε^{+}

denote the target phase difference and duty cycle, while

ϕ_{i}^{j -}

and

ε^{-}

represent their current values. The parameter

t_{0}

indicates the command transition period, t measures elapsed time since

t_{0}

, and K serves as a transition time adjustment factor. Notably, when transitioning between gaits with significantly different duty cycles (e.g., from tripod to quadruped gait where duty cycle increases substantially), K should be set to a smaller value to ensure smooth parameter evolution and maintain system stability during the gait transition.

In a mutually coupled network, multiple CPGs contain both coupling relationships in the time domain and the spatial domain. For the 18 degrees of freedom of the hexapod robot to realize forward motion, this paper adopts 18 CPG units to control the hip, knee and ankle joints of the six legs, respectively. The 18 oscillators can be categorized into inter-leg coupling and intra-leg coupling to construct a hierarchical CPG control network. As shown in Figure 1, each hip, knee and ankle joint of the robot is represented by one oscillator, and the left and right joint structures are symmetric to each other.

The control network features a hierarchical coupling architecture: (1) inter-foot coupling forms a fully symmetric network topology connecting all six hip joints through bidirectional coupling, and (2) intra-foot coupling establishes unidirectional connections between the hip, knee, and ankle joints within each limb. In this configuration, the CPG output serves as joint angle control signals, where the hip joint references are directly generated by the six-oscillator inter-foot network, while the corresponding knee and ankle joint signals are derived through their unidirectional coupling relationships with the ipsilateral hip joint.

2.3. Tripod Gait

CPG networks can produce distinct rhythmic signals for various gaits through parameter modulation. To maintain locomotion stability and prevent the aforementioned instability issues, careful gait planning is essential. For tripod gait implementation, the hexapod’s legs are optimally divided into two alternating support groups: the left support triangle (legs 0, 3, 4) and the right support triangle (legs 1, 2, 5). This symmetrical load distribution strategy effectively balances the robot’s weight across multiple legs, reducing individual leg loading while enhancing both movement efficiency and dynamic stability during locomotion.

As illustrated in Figure 2, these two leg groups alternate between stance and swing phases to propel the robot forward, maintaining a constant ratio of three legs in stance phase while the other three execute swing phase at any instant. The inter-group phase difference is set to

ϕ

with a 0.5 duty cycle, yielding gait parameters

(ϕ_{1}^{2}, ϕ_{2}^{3}, ϕ_{3}^{4}, ϕ_{4}^{5}, ϕ_{5}^{6}, ϕ_{6}^{1}, ε)

=

(- π, 0, π, 0, - π, π, 0.5)

. Within each leg, unidirectional coupling propagates signals from hip to knee, and subsequently to ankle joints.The CPG-generated joint trajectories for the tripod gait are shown in Figure 3, where the x-axis is time and the y-axis is the joint trajectory signals generated by each CPG unit, which demonstrates variation in each CPG signal over time. Figure 4, Figure 5 and Figure 6 show the joint motion trajectory signals captured by a hexapod robot moving in the real world, where the x-axis is time and the y-axis is the joint trajectory signals of the hip, knee and ankle joints of each leg, which basically coincides with the CPG signals.

3. Proprioceptive Movement

This chapter describes the problem of proprioceptive locomotion of a six-legged robot. It introduces two parts: knowledge about proprioception, and proprioception of a hexapod robot.

3.1. Proprioception

Proprioception refers to the awareness of one’s body position and movement in three-dimensional space. It enables the body to detect changes in limb movement, recognize spatial orientation, and sense stimuli such as pressure, touch, or stretching [19].

Unlike exteroception, proprioception relies on internal sensors to obtain sensory information from within the body. These internal sensors are called proprioceptors, and they send information to the body about muscle length, tension, and joint angles. The proprioceptors send signals to the brain through the nervous system. For example, a person can put food in his or her mouth even when his or her eyes are closed, as well as feel his or her surroundings with his or her hands and feet and waddle forward. Simple or complex movements such as walking, running, and playing basketball require proprioception [20]. The common motor faculty proprioceptors are the muscularis oculi, Golgi tendon apparatus, annulus laminaris, and Ruffini’s vesicles, which sense the speed and direction of joint motion and joint position and transmit them to the brain through the conduction pathway. The proprioceptive conduction pathways of the trunk and limbs are shown in Figure 7.

3.2. The Proprioception of a Hexapod Robot

The hexapod robot’s proprioceptive feedback is acquired through integrated joint sensors, with each of its six legs featuring three actuated degrees of freedom (hip, knee, and ankle joints), resulting in a total of 18 independently controlled joints. To fully capture the system’s kinematic state, each joint is equipped with an angular position sensor that provides real-time measurements of joint angles and angular velocities, enabling comprehensive motion monitoring and control across all 18 degrees of freedom.

The hexapod robot employs an integrated sensor suite comprising both inertial measurement units and joint sensors to acquire comprehensive proprioceptive feedback. This multimodal sensory system captures: (1) trunk kinematics (pose

P o s e

, linear velocity

V_{b o d y}

, and angular velocity

ω_{b o d y}

) and (2) joint kinematics (angle

θ_{j o i n t}

and angular velocity

ω_{j o i n t}

for each actuator). All sensory data is aggregated into a unified state representation:

\begin{matrix} S = [\begin{matrix} P o s e, & V_{b o d y}, & ω_{b o d y}, & θ_{j o i n t}, & ω_{j o i n t} \end{matrix}] \end{matrix}

(8)

which serves as the basis for real-time state estimation, environmental awareness, and subsequent motion planning.

4. Reinforcement Learning Algorithm

Deep reinforcement learning is the combination of reinforcement learning and deep learning [21,22]. At present, mainstream online policy-based algorithms such as Proximal Policy Optimization (PPO) and Asynchronous Advantage Actor-Critic (A3C) require a large number of samples at each gradient step, resulting in relatively low sampling efficiency [23,24]. Moreover, although offline policy algorithms such as Deep Deterministic Policy Gradient (DDPG) have higher sampling efficiency than PPO, they are sensitive to hyperparameters and have poor convergence [25]. To obtain the optimal policy, it is necessary to choose an appropriate deep reinforcement learning algorithm to thoroughly explore the environment, rather than only prioritizing actions with the highest rewards, which can lead the algorithm to fall into local optima.

The Soft Actor-Critic (SAC) algorithm is a maximum entropy-based reinforcement learning algorithm proposed by Haarnoja et al. [26]. By introducing a maximum entropy regularization term in the objective function, the algorithm is encouraged to maintain exploratory behavior during the learning process, thereby improving the robustness and sampling efficiency of the algorithm [27]. The working principle is that repeatedly selecting the same action leads to a decrease in entropy, which prompts the agent to consider not only the maximization of rewards but also the diversity of policies during the learning process, thus expanding the exploration range of the algorithm. Moreover, the introduction of this exploratory behavior helps to avoid the algorithm falling into local optima, further enhancing the convergence and stability of the algorithm.

Within the Actor-Critic architecture, the SAC algorithm implements two distinct neural network ensembles: a value network for state–action pair evaluation and a policy network for action generation. The value network is trained through temporal difference (TD) error minimization, enabling precise estimation of long-term expected returns for given state–action pairs. Concurrently, the policy network optimizes its parameters to maximize the expected cumulative return, ensuring the selected actions under various states yield optimal long-term performance. This dual-network approach facilitates stable policy improvement while maintaining efficient value estimation throughout the learning process.

The policy network processes normalized proprioceptive signals as input and generates action values within the bounded interval

[- 1, 1]

through Gaussian sampling. These continuous outputs are subsequently mapped in real time to discrete task assignments and destination partition identifiers. Within the SAC framework, both the value network and target value network share identical architectures while serving distinct purposes: the primary value network estimates state–action value functions, while the target network provides stable learning benchmarks through soft parameter updates to ensure training stability. This dual-network configuration enables effective policy evaluation while maintaining consistent temporal difference targets throughout the learning process.

\begin{matrix} θ^{'} \leftarrow τ θ + (1 - τ) θ^{'} \end{matrix}

(9)

In the formula:

θ

represents the target network parameters;

θ^{'}

represents the online network parameters;

τ

represents the soft update parameter.

Compared to conventional on-policy reinforcement learning algorithms (e.g., PPO and A3C), the Soft Actor-Critic (SAC) algorithm demonstrates superior performance in continuous action space environments due to its maximum entropy framework. SAC uniquely optimizes the trade-off between reward maximization and policy entropy, explicitly balancing exploration and exploitation through its entropy-regularized objective function. For hexapod robot control system design, we formulate the policy optimization problem under maximum entropy reinforcement learning as follows:

π^{*} = \arg \max_{π} E_{\partial - π} [\sum_{t = 0}^{r} γ^{t} (R_{t} + α H (π (\cdot | S_{t})))]

(10)

In the formula:

E_{\partial - π} [\cdot]

represents the state value function based on policy

π

in MDP; ∂ represents the trajectory,

\partial = (S_{0}, A_{0}, S_{1}, A_{1}, \dots, S_{t}, A_{t})

;

R_{t}

represents the expected reward at time t;

γ^{t}

represents the discount factor at time t;

S_{t}

represents the state at time t;

H (π (\cdot ∣ S_{t}))

represents the definition of policy entropy;

α

is the temperature coefficient used to balance expected rewards and policy entropy, which can be automatically adjusted through the following formula:

J (α) = E_{A_{t} - π} [- α \ln π (A_{t} ∣ S_{t}) - α H_{0}]

(11)

In the formula:

H_{0}

represents the initial entropy of the policy, and

A_{t}

represents the action taken at time t.

Additionally, the definition of policy entropy is described as follows:

\begin{matrix} H (π (\cdot ∣ S_{l})) = - \sum_{a} π (\cdot ∣ S_{l}) \ln π (\cdot ∣ S_{l}) \end{matrix}

(12)

Under the framework of the maximum entropy mechanism, the soft state–action values and soft state values can be redefined as:

\{\begin{matrix} Q (S_{t}, A_{t}) = E_{\partial - π} [\sum_{t = 0}^{T} γ^{t} (R_{t} + α H (π (\cdot | S_{t})))] \\ V (S_{t}) = E_{\partial \sim π} [Q (S_{t}, A_{t}) - α \ln π (\cdot | S_{t})] \end{matrix}

(13)

The formulation introduces two key value functions: (1) the soft Q-function

Q (S_{t}, A_{t})

, which estimates the expected cumulative reward (including entropy regularization) when taking action

A_{t}

in state

S_{t}

at timestep t; and (2) the soft state-value function

V (S_{t})

, representing the expected return under the current policy’s action distribution for state

S_{t}

, where both functions incorporate policy entropy to maintain the exploration-exploitation balance characteristic of maximum entropy reinforcement learning.

Through mathematical derivation incorporating the Kullback–Leibler (KL) divergence measure, we obtain the closed-form optimal policy solution, expressed as follows:

π_{B} = \underset{π^{*}}{\arg m i n} D_{KL} \{π^{*} (\cdot | S_{t}) ∥ \exp (\frac{1}{α} (Q (S_{t}, \cdot) - V (S_{t})))\}

(14)

In the formula,

D_{K L}

is the Kullback–Leibler divergence calculation formula, and

π_{B}

is the closed-form solution for the optimal policy.

5. Strategy Network Training

This chapter systematically presents the development process of the hexapod robot control system through three key components: The hardware platform configuration and training environment setup, the detailed physical modeling and dynamic simulation of the hexapod robot, and the implementation and training of the parameterized policy network for locomotion control.

5.1. Training Platforms

Deep reinforcement learning typically exhibits significantly higher sample complexity compared to supervised and unsupervised learning approaches, primarily due to its inherent trial-and-error training paradigm that necessitates extensive environmental interactions [28]. This characteristic becomes particularly problematic in real-world applications where environmental interactions are either cost-prohibitive (e.g., robotic hardware wear) or pose substantial risks (e.g., autonomous vehicle training), thereby limiting the practical deployment of deep reinforcement learning algorithms in many physical systems.

Virtual environment training offers an effective solution to mitigate the risks associated with real-world robot training, where unexpected situations may lead to equipment damage, safety hazards, or other adverse consequences. By leveraging simulation platforms, robots can safely explore diverse scenarios while developing robust control strategies that ensure reliable real-world deployment. In this study, we employ MuJoCo to establish a virtual training environment for hexapod robots. As a high-fidelity physics engine based on Newtonian mechanics, MuJoCo excels at simulating multi-joint rigid-body dynamics, contact forces, and complex interactions [29]. The platform enables comprehensive modeling of the hexapod robot through XML configuration files, which specify geometric properties, inertial characteristics, joint mechanics, and terrain parameters, including surface friction and collision properties, thereby providing a physically accurate simulation environment for reinforcement learning and motion planning applications.

The MuJoCo simulation environment provides comprehensive control capabilities for robotic training, enabling dynamic parameter adjustment of environmental conditions, flexible scenario configuration, and accelerated training through fast-forward and replay functionalities, significantly improving the efficiency of reinforcement learning debugging and optimization. Crucially, MuJoCo facilitates complete observability of all training data, including full state transitions, action sequences, and reward signals, which serve as essential inputs for policy optimization and value function learning. The platform automatically logs granular training metrics at each timestep for post-hoc performance analysis. The three training terrains in this paper are shown in Figure 8. Terrain 1 is low grassland; Terrain 2 is rocky land; Terrain 3 is tall grassland.

5.2. Parametric Policy Network Training

The locomotion control of the hexapod robot is formulated as a Markov Decision Process (MDP), satisfying the fundamental requirements for reinforcement learning applications. In this study, we employ the Soft Actor-Critic (SAC) algorithm to train a parameterized motion policy network, which is particularly suitable for this continuous control task due to its maximum entropy framework. As an off-policy deep reinforcement learning algorithm, SAC incorporates entropy maximization into its objective function, achieving an optimal balance between exploration and exploitation [30]. This approach demonstrates superior robustness and sample efficiency compared to conventional methods, while providing stable convergence properties for continuous action space control problems [31].

As illustrated in Figure 9, the SAC algorithm employs a dual-network architecture: The Actor implements a four-layer policy network where the input layer processes state space S, followed by two 256-unit hidden layers, and finally an output layer generating parameters mean

π_{φ}^{μ}

and standard deviation

π_{φ}^{σ}

for a Gaussian action distribution through separate linear transformations; The Critic maintains four value networks (two online/two target) featuring identical architecture: a concatenated

(S, A)

input layer, dual 256-unit hidden layers, and a linear output layer estimating Q-values. This symmetric design ensures stable policy evaluation while accommodating continuous action spaces through its probabilistic action selection mechanism.

The design of the reward function is task- and environment-dependent. For training hexapod robot locomotion on uneven terrain, we define the total reward

r_{t}

at time step t as a composite of: a positive reward component

r_{v}

for forward velocity maintenance, and a penalty term

r_{θ}

for excessive body orientation deviations. The complete reward function takes the following form:

\begin{matrix} r_{t} = λ_{1} r_{v} - λ_{2} r_{θ} \end{matrix}

(15)

The velocity reward component

r_{v}

incentivizes forward locomotion by proportionally rewarding the robot’s speed v relative to a target velocity

v_{t a r}

. This design choice prevents the suboptimal gaits (e.g., unstable hopping motions) that often emerge when using pure distance-based rewards, which can substantially prolong training duration. The velocity reward is formulated as

\begin{matrix} r_{v} = \frac{1}{| v - v_{t a r} | + 1} - \frac{1}{v_{t a r} + 1} \end{matrix}

(16)

where the function

r_{v}

monotonically increases as v approaches

v_{t a r}

, creating a smooth optimization landscape for policy learning.

To maintain proper heading alignment during locomotion, we introduce an orientation penalty term

r_{θ}

based on the robot’s attitude quaternion. Using the scalar component

q_{w}

of the orientation quaternion, the penalty is formulated as

\begin{matrix} r_{θ} = \sqrt{2 \arccos q_{w}} \end{matrix}

(17)

The structure of the trained strategy network is as follows: the number of input nodes is 46, the number of nodes in both hidden layers is 256, and at the end the number of nodes in the output layer is 2. Table 1 shows the parameters set during training. The final training process is shown in Figure 10. It takes about 7–8 h to train a strategy network model.

The algorithm’s iterative procedure executes the following sequence at each timestep:

Collect a batch of samples ${s_{t}, a_{t}, r_{t}, s_{t + 1}}$ using a random policy $π$ ;
Update the Q-function based on the samples: $θ_{i} \leftarrow θ_{i} - λ {\hat{\nabla}}_{θ_{i}} L_{θ} (θ_{i})$ for $i \in {1, 2}$ ;
Update the policy parameters: $φ \leftarrow φ - λ {\hat{\nabla}}_{φ} J_{π} (φ)$ ;
Update the temperature coefficient: $α_{0} \leftarrow α_{0} - λ {\hat{\nabla}}_{α_{0}} J (φ)$ ;
Update the policy network parameters: $θ_{i}^{'} \leftarrow τ θ_{i} - (1 - τ) θ_{i}^{'}$ for $i \in {1, 2}$ .

5.3. Training Result

This paper compares the performance of traditional SAC algorithms, Azayev proposed a motion training method and a secondary architecture for a hexapod robot. Azayev proposes a motion method for a blind hexapod robot that uses proprioceptive signals such as foot-ground signals, joint angle signals, and IMU signals to perceive the environmental information, and uses the PPO algorithm to train a policy network to control the robot motion [10]. The training results are shown in Figure 11, where the secondary architecture has a high rate of decreasing rewards for training at the beginning, which is not as good as the traditional SAC algorithm. However, after 20k training rounds, the reward of the secondary architecture is higher than SAC and reaches the highest reward after 245k rounds of training, whereas SAC only temporarily exceeds the secondary architecture around 70k training rounds and does not reach convergence, and finally reaches the highest reward and equals the reward of the secondary architecture around 350k training rounds. Azayev’s proposed method is slower to train and rewards similar values to the other two methods at around 330k rounds. The experimental results show that the secondary architecture proposed in this paper significantly improves the training speed of the hexapod robot in the training platform.

This study employs a hierarchical training architecture for the hexapod robot, with the learning performance across three distinct terrains illustrated in Figure 12. The training curves demonstrate stable convergence for all terrain types, though the convergence rates vary according to terrain complexity. These results validate that our proposed method successfully achieves robust locomotion control for the hexapod robot traversing diverse rugged terrains, as evidenced by the consistent policy convergence regardless of environmental difficulty. The empirical findings confirm the method’s effectiveness in meeting the control requirements for adaptive walking in unstructured environments.

6. Policy Network Performance Analysis

The performance of the policy network is tested by deploying the trained policy network into a real hexapod robot. This chapter first introduces the physical structure of the hexapod robot, then discusses the deployment process of the model, and finally analyzes the performance of the policy network after deployment.

6.1. Physical Model of a Hexapod Robot

The physical structure was developed based on JetHexa, an open-source hexapod robot platform manufactured by Hiwonder (Shenzhen Huaner Technology Co., Ltd., Shenzhen, China) that utilizes the ROS framework. The reference platform integrates several key hardware components: an NVIDIA (Santa Clara, CA, USA) Jetson Nano GPU-based embedded system running Ubuntu 18.04 LTS on its ARM processor, proprietary HX-35H servomotors with 35 kg·cm torque output, a 2D LiDAR sensor featuring 4000–9000 Hz sampling frequency and 16 m detection range, and an RGB camera. This configuration enables advanced robotic capabilities including real-time motion control, simultaneous localization and mapping (SLAM), and human–robot interaction. The simulation model faithfully replicates these specifications while accounting for the dynamic characteristics of each component.

Figure 13 illustrates the dimensional specifications of the hexapod robot, with an overall length of 397 mm, width of 426 mm, and height of 238 mm. The coordinate system is defined with the robot’s forward direction as the positive X-axis, while the positive Y-axis is oriented 90° counterclockwise from the X-axis, and the positive Z-axis points vertically upward. The origin is established at the geometric center of the robot’s torso, with all measurements in millimeters. The spatial coordinates of each leg’s base position relative to this reference frame are quantitatively presented in Table 2.

The inertial measurement system employs an MPU6050 module featuring integrated 3-axis MEMS gyroscopes and accelerometers, coupled with a configurable Digital Motion Processor (DMP). This IMU architecture supports I2C interface connectivity with auxiliary digital sensors, enabling real-time output of 6/9-axis motion data including rotation matrices, quaternions, and Euler angles through sensor fusion algorithms. As illustrated in Figure 14, the hexapod robot’s proprioceptive sensing network comprises this IMU module along with 18 joint angle feedback sensors, forming a comprehensive state estimation system for locomotion control.

6.2. Policy Network Performance Analysis

As described in the previous section, the experimental hexapod robot is built on an embedded system powered by an NVIDIA Jetson Nano GPU. The system has a high-performance deep learning inference library, TensorRT, which accelerates the inference process of the model by converting the trained model into an efficient inference engine. In this paper, TensorRT is used to deploy a policy network model. Firstly, the trained policy network model is exported to onnx file in pytorch library, then optimized to TensorRT Engine by using TensorRT, and finally inference is performed by TensorRT Engine. Through inference, the hexapod robot can output the motion trajectory of each joint in real time according to the current proprioceptive signals. The following is the performance of the hexapod robot in each terrain after deploying the policy network model.

A critical objective in hexapod robot training is maintaining consistent locomotion velocity across varying terrains. To evaluate this capability, we set a uniform target speed of 0.1 m/s during training for all three terrain types, as demonstrated in Figure 15. On all three terrains, the robot starts from a stationary position, thus the initial speed is 0 m/s. Terrain 1 is relatively flat, and the hexapod robot in this terrain only needs about 1.2 s to accelerate to the target speed of 0.4 m/s. Terrain 2 is more rugged, and the hexapod robot needs 1.9 s to accelerate to the target speed. Terrain 3 is rugged and has the resistance of grass, which results in the longest acceleration time, requiring 2.7 s to reach the target speed. After accelerating to the target speed, the hexapod robots on all three terrains oscillate near the target speed, which is consistent with the movement patterns of hexapod robots on rugged terrains. The experimental results show that this control architecture allows the hexapod robot to maintain a stable walking speed on various unstructured terrains.

The dynamic stability of the hexapod robot is quantified through three key metrics: vertical acceleration, roll angular acceleration, and pitch angular acceleration. As shown in Figure 16, the vertical acceleration data across all three terrains oscillate around

0 m / s^{2}

, with amplitudes predominantly confined to the range of

- 0.2 m / s^{2}

to

0.6 m / s^{2}

. Quantitative analysis reveals terrain-specific stability characteristics: Terrain 1 exhibits a standard deviation of

0.148 m / s^{2}

with three instances exceeding

0.5 m / s^{2}

per 10-second interval; Terrain 2 demonstrates improved stability with a

0.120 m / s^{2}

standard deviation and only two exceedances; while Terrain 3 shows intermediate performance with a

0.139 m / s^{2}

standard deviation. These results collectively indicate that the robot maintains consistent vertical acceleration patterns with minimal variation across different terrain types, confirming the effectiveness of our control approach in maintaining stable locomotion.

Figure 17 presents the roll angular acceleration profiles of the hexapod robot across three terrain types. The measurements demonstrate consistent oscillatory patterns centered around

0 rad / s^{2}

, with peak amplitudes ranging from

- 0.8 rad / s^{2}

to

0.9 rad / s^{2}

throughout the motion. Quantitative stability analysis reveals terrain-dependent variations: standard deviations of

0.206 rad / s^{2}

(Terrain 1),

0.191 rad / s^{2}

(Terrain 2), and

0.133 rad / s^{2}

(Terrain 3). These results indicate that while all terrains maintain angular accelerations within stable bounds, Terrain 3 exhibits the most stable roll dynamics, followed by Terrain 2 and then Terrain 1. The consistently small magnitude of angular accelerations across all test conditions confirms the effectiveness of the control system in maintaining rotational stability during locomotion.

Figure 18 displays the pitch angular acceleration characteristics of the hexapod robot across the three terrain conditions. The measured values oscillate symmetrically around

0 rad / s^{2}

with amplitudes consistently bounded between

- 0.7 rad / s^{2}

and

0.8 rad / s^{2}

. Quantitative analysis shows standard deviations of

0.231 rad / s^{2}

(Terrain 1),

0.214 rad / s^{2}

(Terrain 2), and

0.172 rad / s^{2}

(Terrain 3), as detailed in Table 3. Notably, all terrains exhibit similar oscillation patterns in both amplitude and frequency, with Terrain 3 demonstrating the most stable pitch dynamics. The consistently small magnitude of angular accelerations (below 0.8 rad/s²) indicates excellent pitch stability during locomotion, further validating the robustness of the control system across varying terrain conditions.

Overall, the vertical acceleration, the angular acceleration of the traverse angle and the angular velocity of the pitch angle of the hexapod robot in these three terrains are stable with stable frequencies and small standard deviations of the amplitudes and magnitudes, which are more stable. The experimental results show that the control architecture enables the hexapod robot to maintain a stable gait in a variety of unstructured terrains.

We employ contour mapping to visualize the hexapod robot’s locomotion trajectory, with the maximum yaw deviation within a defined 100 ×

200 {cm}^{2}

area serving as a quantitative measure of straight-line walking capability. Figure 19 illustrates the movement path in Terrain 1, where the purple curve traces the robot’s center-of-mass trajectory, while the color gradient (red to blue) represents terrain elevation variations corresponding to the reference colorbar. Quantitative analysis reveals a maximum yaw deviation of

9.38 cm

in this terrain—the smallest observed among the three test environments—demonstrating superior straight-line walking performance compared to more challenging terrains. This precision in path following, combined with the minimal lateral deviation, confirms the effectiveness of our control architecture in maintaining directional stability on moderately uneven surfaces.

The motion path of the hexapod robot in Terrain 2 is shown in Figure 20, the terrain is a gravel road full of stones, with relatively large ups and downs, and its maximum yaw distance is

28.36 cm

. The motion path of the hexapod robot in Terrain 3 is shown in Figure 21, the terrain is full of grasses whose height is greater than the height of the hexapod robot, but because the grasses are flexible objects, the hexapod robot can easily pass through the grasses, the grasses are hidden in the figure, only the information on the terrain is retained, and the maximum yaw distance is

28.56 cm

.

7. Conclusions

In order to overcome the vulnerability of hexapod robots to weather and light through using external sensors to sense the field environment, it is feasible to use only proprioceptive signals obtained from internal sensors to guide the motion of the hexapod robot so that the hexapod robot can walk stably even in rugged terrain. In this paper, we establish three training environments for a hexapod robot in rugged terrain via Mujoco, construct a motion control network for a symmetric hexapod robot using CPG, and propose a parameterization strategy for training this network using the SAC algorithm. The strategy evaluates the current environment and outputs the next action by sensing the proprioceptive information of the hexapod robot itself, so that the hexapod robot can make an adaptive motion according to the current environment, and realizes the control requirements for the blind hexapod robot to walk in a variety of complex terrains.The trained strategy network is faster to train and more rewarding than traditional methods. It exhibits good locomotion performance in the target environment and is able to traverse the target terrain.

When the hexapod robot moves in harsh environments such as these, the vision-based motion strategy can be combined or switched to the proprioceptive-based motion strategy proposed in this paper to improve the traversal ability of the hexapod robot. In future work, a layer of terrain classification network can be trained which recognizes the current terrain and selects the best strategy to enable the robot to adapt to the combined terrain, thus improving the traversal ability of the robot in the combined terrain in the field. In addition, the use of shared control can take full advantage of human-robot collaboration, which can better train the strategy network and improve the robot’s locomotion ability in the field.

Author Contributions

Conceptualization, L.W., W.G. and Y.C.; methodology, L.W. and Y.C.; software, R.L. and X.W.; validation, L.W. and Y.C.; formal analysis, L.W., W.G. and Y.C.; investigation, R.L.; resources, L.W., W.G. and Y.C.; data curation, R.L.; writing—original draft preparation, L.W.; writing—review and editing, W.G. and Y.C.; visualization, R.L.; supervision, W.G. and Y.C.; project administration, W.G. and Y.C.; funding acquisition, W.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Basic Research Program of Jiangsu (Grants No BK20240313) and Wuxi Young Scientific and Technological Talent Support.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Dupeyroux, J.; Serres, J.R.; Viollet, S. AntBot: A six-legged walking robot able to home like desert ants in outdoor environments. Sci. Robot. 2019, 4, eaau0307. [Google Scholar] [CrossRef] [PubMed]
Gao, Y.; Wei, W.; Wang, X.; Li, Y.; Yu, Q. Feasibility, planning and control of ground-wall transition for a suctorial hexapod robot. Appl. Intell. 2021, 51, 5506–5524. [Google Scholar] [CrossRef]
Buchanan, R.; Wellhausen, L.; Bjelonic, M.; Bandyopadhyay, T.; Kottege, N.; Hutter, M. Perceptive whole-body planning for multilegged robots in confined spaces. J. Field Robot. 2021, 38, 68–84. [Google Scholar] [CrossRef]
Song, X.; Zhang, X.; Meng, X.; Chen, C.; Huang, D. Gait optimization of step climbing for a hexapod robot. J. Field Robot. 2022, 39, 55–68. [Google Scholar] [CrossRef]
Homchanthanakul, J.; Manoonpong, P. Continuous online adaptation of bioinspired adaptive neuroendocrine control for autonomous walking robots. IEEE Trans. Neural Netw. Learn. Syst. 2021, 33, 1833–1845. [Google Scholar] [CrossRef]
Thor, M.; Manoonpong, P. Versatile modular neural locomotion control with fast learning. Nat. Mach. Intell. 2022, 4, 169–179. [Google Scholar] [CrossRef]
Manoonpong, P.; Patanè, L.; Xiong, X.; Brodoline, I.; Dupeyroux, J.; Viollet, S.; Arena, P.; Serres, J.R. Insect-inspired robots: Bridging biological and artificial systems. Sensors 2021, 21, 7609. [Google Scholar] [CrossRef]
Mao, L.; Gao, F.; Tian, Y.; Zhao, Y. Novel method for preventing shin-collisions in six-legged robots by utilising a robot–terrain interference model. Mech. Mach. Theory 2020, 151, 103897. [Google Scholar] [CrossRef]
Bjelonic, M.; Kottege, N.; Homberger, T.; Borges, P.; Beckerle, P.; Chli, M. Weaver: Hexapod Robot for Autonomous Navigation on Unstructured Terrain. J. Field Robot. 2018, 35, 1063–1079. [Google Scholar] [CrossRef]
Azayev, T.; Zimmerman, K. Blind Hexapod Locomotion in Complex Terrain with Gait Adaptation Using Deep Reinforcement Learning and Classification. J. Intell. Robot. Syst. 2020, 99, 659–671. [Google Scholar] [CrossRef]
Bachega, R.P.; Neves, G.P.D.; Campo, A.B.; Angelico, B.A. Flexibility in Hexapod Robots: Exploring Mobility of the Body. IEEE Access 2023, 11, 110454–110471. [Google Scholar] [CrossRef]
Ruscelli, F.; Sartoretti, G.; Nan, J.; Feng, Z.; Travers, M.; Choset, H. Proprioceptive-Inertial Autonomous Locomotion for Articulated Robots. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, Australia, 21–25 May 2018; pp. 3436–3441. [Google Scholar]
Ma, F.; Yan, W.; Chen, L.; Cui, R. CPG-based Motion Planning of Hybrid Underwater Hexapod Robot for Wall Climbing and Transition. IEEE Robot. Autom. Lett. 2022, 7, 12299–12306. [Google Scholar] [CrossRef]
Helal, K.; Albadin, A.; Albitar, C.; Alsaba, M. Workspace trajectory generation with smooth gait transition using CPG-based locomotion control for hexapod robot. Heliyon 2024, 10, e31847. [Google Scholar] [CrossRef]
Chen, Y.; Zhou, Y. Machine learning based decision making for time varying systems: Parameter estimation and performance optimization. Knowl.-Based Syst. 2020, 190, 105479. [Google Scholar] [CrossRef]
Wang, L.; Li, R.; Huangfu, Z.; Feng, Y.; Chen, Y. A soft actor-critic approach for a blind walking hexapod robot with obstacle avoidance. Actuators 2023, 12, 393. [Google Scholar] [CrossRef]
Cheng, C.; Zhang, H.; Sun, Y.; Tao, H.; Chen, Y. A cross-platform deep reinforcement learning model for autonomous navigation without global information in different scenes. Control Eng. Pract. 2024, 150, 105991. [Google Scholar] [CrossRef]
Ren, J.; Gosgnach, S. Localization of rhythm generating components of the mammalian locomotor central pattern generator. Neuroscience 2023, 513, 28–37. [Google Scholar] [CrossRef]
Wang, C.; Gao, J.; Deng, Z.; Zhang, Y.; Zheng, C.; Liu, X.; Sperandio, I.; Chen, J. Extracurricular sports activities modify the proprioceptive map in children aged 5–8 years. Sci. Rep. 2022, 12, 9338. [Google Scholar] [CrossRef]
Qu, X.; Hu, X.; Zhao, J.; Zhao, Z. The roles of lower-limb joint proprioception in postural control during gait. Appl. Ergon. 2022, 99, 103635. [Google Scholar] [CrossRef]
Wu, P.; Tian, E.; Tao, H.; Chen, Y. Data-driven spiking neural networks for intelligent fault detection in vehicle lithium-ion battery systems. Eng. Appl. Artif. Intell. 2025, 141, 109756. [Google Scholar] [CrossRef]
Wang, G.; Li, Z.; Weng, G.; Chen, Y. An overview of industrial image segmentation using deep learning models. Intell. Robot. 2025, 5, 143–180. [Google Scholar] [CrossRef]
Boudlal, A.; Khafaji, A.; Elabbadi, J. Entropy adjustment by interpolation for exploration in Proximal Policy Optimization (PPO). Eng. Appl. Artif. Intell. 2024, 133, 108401. [Google Scholar] [CrossRef]
Cho, Y.i.; Kim, B.; Yoon, H.C.; Woo, J.H. Locating algorithm of steel stock area with asynchronous advantage actor-critic reinforcement learning. J. Comput. Des. Eng. 2024, 11, 230–246. [Google Scholar] [CrossRef]
Sumiea, E.H.; Abdulkadir, S.J.; Alhussian, H.S.; Al-Selwi, S.M.; Alqushaibi, A.; Ragab, M.G.; Fati, S.M. Deep deterministic policy gradient algorithm: A systematic review. Heliyon 2024, 10, e30697. [Google Scholar] [CrossRef] [PubMed]
Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the International Conference on Machine Learning, PMLR, Stockholm, Sweden, 10–15 July 2018; pp. 1861–1870. [Google Scholar]
Hou, Y.; Han, G.; Zhang, F.; Lin, C.; Peng, J.; Liu, L. Distributional soft actor-critic-based multi-auv cooperative pursuit for maritime security protection. IEEE Trans. Intell. Transp. Syst. 2023, 25, 6049–6060. [Google Scholar] [CrossRef]
Kadokawa, Y.; Kodera, T.; Tsurumine, Y.; Nishimura, S.; Matsubara, T. Robust iterative value conversion: Deep reinforcement learning for neurochip-driven edge robots. Robot. Auton. Syst. 2024, 181, 104782. [Google Scholar] [CrossRef]
Todorov, E.; Erez, T.; Tassa, Y. Mujoco: A physics engine for model-based control. In Proceedings of the 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, Vilamoura-Algarve, Portugal, 7–12 October 2012; IEEE: Piscataway, NJ, USA, 2012; pp. 5026–5033. [Google Scholar]
Zang, W.; Song, D. Energy-saving profile optimization for underwater glider sampling: The soft actor critic method. Measurement 2023, 217, 113008. [Google Scholar] [CrossRef]
Banerjee, C.; Chen, Z.; Noman, N. Improved Soft Actor-Critic: Mixing Prioritized Off-Policy Samples With On-Policy Experiences. IEEE Trans. Neural Netw. Learn. Syst. 2024, 35, 3121–3129. [Google Scholar] [CrossRef]

Figure 1. Hierarchical CPG control network for hexapod robots.

Figure 2. Diagram of tripod gait leg positions.

Figure 3. CPG signals for each joint in a tripod gait.

Figure 4. Hip joint CPG signals for a hexapod robot in tripod gait.

Figure 5. Knee joint CPG signals for a hexapod robot in tripod gait.

Figure 6. CPG signals for ankle joints in a hexapod robot’s tripod gait.

Figure 7. The Conduction pathway of proprioception from the trunk and limbs.

Figure 8. From left to right are the physical models of Terrain 1, Terrain 2, and Terrain 3.

Figure 9. Neural network architecture.

Figure 10. SAC algorithm based strategy network training process.

Figure 11. Comparison of training results.

Figure 12. Training results for three terrains.

Figure 13. The conduction pathways of somatosensory information from the trunk and limbs.

Figure 14. Sensor networks for hexapod robots.

Figure 15. Walking speed of a hexapod robot.

Figure 16. The vertical acceleration of the hexapod robot during motion.

Figure 17. The roll angular acceleration of the hexapod robot during motion.

Figure 18. The pitch angular acceleration of the hexapod robot during motion.

Figure 19. Illustration of hexapod robot walking on Terrain 1.

Figure 20. Illustration of hexapod robot walking on Terrain 2.

Figure 21. Illustration of Hexapod robot walking on Terrain 3.

Table 1. Training parameter.

Learning Rate	Batch Size	Buffer Sie	Gamma	Tau
0.0003	256	1,000,000	0.99	0.85

Table 2. Coordinates of each leg.

	Left Leg	Right Leg
Front Leg	(93.6, 50.805, 0)	(93.6, −50.805, 0)
Middle Leg	(0, 73.575, 0)	(0, −73.575, 0)
Rear Leg	(−93.6, 50.805, 0)	(−93.6, −50.805, 0)

Table 3. Standard deviation of stability metrics for the hexapod robot.

Standard Deviation	Terrain 1	Terrain 2	Terrain 3
Vertical Acceleration ( $m / s^{2}$ )	2.97	2.39	2.78
Oscillations around roll ( $rad / s^{2}$ )	0.0206	0.0191	0.0133
Oscillations around pitch ( $rad / s^{2}$ )	0.0231	0.0214	0.0172

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, L.; Li, R.; Wang, X.; Gao, W.; Chen, Y. A Motion Control Strategy for a Blind Hexapod Robot Based on Reinforcement Learning and Central Pattern Generator. Symmetry 2025, 17, 1058. https://doi.org/10.3390/sym17071058

AMA Style

Wang L, Li R, Wang X, Gao W, Chen Y. A Motion Control Strategy for a Blind Hexapod Robot Based on Reinforcement Learning and Central Pattern Generator. Symmetry. 2025; 17(7):1058. https://doi.org/10.3390/sym17071058

Chicago/Turabian Style

Wang, Lei, Ruiwen Li, Xiaoxiao Wang, Weidong Gao, and Yiyang Chen. 2025. "A Motion Control Strategy for a Blind Hexapod Robot Based on Reinforcement Learning and Central Pattern Generator" Symmetry 17, no. 7: 1058. https://doi.org/10.3390/sym17071058

APA Style

Wang, L., Li, R., Wang, X., Gao, W., & Chen, Y. (2025). A Motion Control Strategy for a Blind Hexapod Robot Based on Reinforcement Learning and Central Pattern Generator. Symmetry, 17(7), 1058. https://doi.org/10.3390/sym17071058

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Motion Control Strategy for a Blind Hexapod Robot Based on Reinforcement Learning and Central Pattern Generator

Abstract

1. Introduction

2. CPG Control Network

2.1. The Mathematical Model of a Single CPG

2.2. CPG Control Network Structure for a Hexapod Robot

2.3. Tripod Gait

3. Proprioceptive Movement

3.1. Proprioception

3.2. The Proprioception of a Hexapod Robot

4. Reinforcement Learning Algorithm

5. Strategy Network Training

5.1. Training Platforms

5.2. Parametric Policy Network Training

5.3. Training Result

6. Policy Network Performance Analysis

6.1. Physical Model of a Hexapod Robot

6.2. Policy Network Performance Analysis

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI