1. Introduction
Hyper-redundant manipulators (HRMs) possess higher degrees of freedom (DOFs), resulting in superior flexibility and bending capabilities. Hyper-redundant manipulators differ significantly from traditional robots due to their extensive or even infinite degrees of freedom, allowing for highly flexible and adaptive movement resembling snakes or tentacles. Their unique structure enables novel locomotion patterns and the ability to autonomously navigate around obstacles. Unlike traditional robots, which often rely on application-specific control strategies, these manipulators can be controlled using more generalized approaches. Additionally, their design provides increased resilience to mechanical failures, making them highly robust in complex environments. These attributes facilitate obstacle navigation, operation in confined spaces, and application in specialized fields such as aerospace [
1,
2], nuclear energy [
3,
4], post-disaster search and rescue [
5,
6], and minimally invasive medical procedures [
7,
8,
9]. However, the implementation of traditional control methods based on precise modeling is challenging due to the manipulators’ high-dimensional configuration space and the incorporation of elastic materials with complex, nonlinear, and strongly coupled physical properties.
Traditional controllers for hyper-redundant manipulators generally rely on precise modeling and can be divided into two main categories: analytical methods and numerical methods [
10]. Previous studies [
11,
12,
13,
14] have utilized analytical methods to compute inverse kinematic solutions for hyper-redundant manipulators by introducing additional configuration-based or task-related constraints to reduce redundancy. However, these methods are highly dependent on specific configurations and have become computationally intensive as redundancy increases. In contrast, numerical methods utilizing techniques such as pseudo-inverse [
15] or extended Jacobian inverse [
16] have been used to solve the inverse kinematics problem for redundant manipulators. Popular methods in this field include gradient projection [
17] and cyclic coordinated descent [
18]. While these methods can be effective to some extent, they still suffer from issues such as joint oscillations, local minima, and joint singularities [
2,
19,
20,
21,
22,
23,
24,
25,
26,
27,
28,
29]. The inverse kinematics of highly articulated robots is inherently complex as it is a nonlinear and configuration-dependent problem, often lacking a unique solution. This challenge is especially pronounced in hyper-redundant robots with numerous degrees of freedom. In our study, we tackle this issue by utilizing deep reinforcement learning algorithms, enabling neural networks to learn to represent inverse kinematics through self-learning processes, thus effectively navigating the complexities of this problem.
The manipulability optimization problem has been reformulated as a constrained quadratic programming problem, utilizing dynamic neural networks for real-time optimization without matrix inversion. Ref. [
30] proposed a distributed scheme using a generalized recurrent neural network to enhance the manipulability across multiple manipulators within a distributed network. Ref. [
31] further developed this concept by formulating the problem as a game-theoretic issue by employing a game-theoretic recurrent neural network to optimize manipulability for collaborative manipulators. However, these approaches are limited by their tendency to be trapped in local optima and their reliance on complex equation transformations and artificial neural network designs, which restricts their generality.
Advances in artificial intelligence have also led to the development of new methods for obstacle avoidance in redundant manipulators. The integration of deep neural networks with reinforcement learning (RL) algorithms has sparked renewed research interest in recent years [
32]. Unlike inefficient conventional hyper-redundant controllers, deep reinforcement learning approaches offer superior real-time control through environmental interaction and trial-and-error learning. This data-driven methodology significantly outperforms analytical and numerical methods in both computational efficiency and execution speed. RL has been particularly influential in the path planning of these manipulators [
19,
20,
21,
22,
33,
34]. For instance, ref. [
35] used RL to train an agent to separately plan end-effector motion and self-motion, combining this approach with the gradient projection method. However, this method necessitates the agent to learn the entire planning process without fully leveraging the inherent properties of self-motion, which leads to substantial computational difficulties. OpenAI [
36,
37] demonstrated high-level control of dexterity on physical five-fingered hand manipulators, but their work did not extend to continuum control. Ref. [
38] provides valuable insights into the application of RL to address the challenges of motion control in hyper-redundant robots.
The contributions of this paper are as follows.
We initially utilized the TD3 algorithm [
39] (an efficient model-free reinforcement learning method) combined with the hindsight experience replay mechanism to achieve learning of a multi-target end-effector approach to tasks in obstacle-free scenarios under sparse reward conditions.
We introduced the concept of dense connectivity, design policy networks, and value function networks based on densely connected networks to enhance the algorithm’s ability to handle high-dimensional state and action spaces.
We analyzed the advantages of using hindsight-based algorithms as baseline algorithms for end-effector target approach tasks with random initial states and random targets.
We implemented a periodic reset scheme with gradually increasing reset periods for training, enabling the learning of continuous multi-target end-effector approach tasks in obstacle-free scenarios under sparse reward conditions.
A comparison of the proposed reinforcement learning method with traditional approaches indicated that the robustness advantage primarily stems from the high degrees of freedom inherent to hyper-redundant manipulators. In contrast, the reinforcement learning algorithm excels in real-time performance, offering a notable advantage due to its reduced computational cost during inference. Additionally, when considering complexity, traditional methods may be perceived as more straightforward due to their established frameworks, while the complexity of reinforcement learning during training is often difficult to quantify. Furthermore, limitations of the proposed reinforcement still exist, particularly concerning the interpretability of reinforcement learning and its implications for safety in real-world applications.
Our method is particularly suited for practical applications in aerospace scenarios, such as satellite maintenance and retrieval. The unique capabilities of hyper-redundant robots enable them to navigate complex environments and perform precise maneuvers, making our approach highly applicable to the specific challenges of aerospace operations.
This paper is structured as follows.
Section 2 presents the theoretical foundations and background knowledge that underpin our research. 
Section 3 delineates our proposed methodology, providing a comprehensive exposition of its key components and operational principles. 
Section 4 presents the empirical validation of our approach, encompassing both experiments that demonstrate its efficacy and superiority, as well as ablation studies that elucidate the impact and necessity of our methodology. 
Section 5 concludes this paper with a concise summary of our findings and a discussion of promising directions for future research.
   3. Method
  3.1. Training Algorithm
For the convenience of transferring our method to more challenging environments in the future, we used the sparse reward function mentioned in the previous section. With this reward function, the training of the control policy was result-oriented and can represent the true goal. However, compared with a shaped dense reward, it is often difficult to obtain enough information in the random exploration of the hyper-redundant manipulator, resulting in the back-propagation gradient being submerged in the random noise, making it difficult to effectively learn the control policy.
Experience replay techniques utilize the experience of previous policy to train current policy. Hindsight experience replay (HER) [
41] allows the agent to learn from failed experiences by replaying each episode with a different goal, e.g., one of the goals that was achieved in the episode rather than the original goal. HER can be combined with any off-policy RL algorithm, greatly improving sample efficiency and making learning possible even in sparse reward settings. In [
41], the author combined HER with the DDPG algorithm [
42]; however, DDPG has the problem of overestimating the value function.
This paper combined the TD3 algorithm and HER to train the multi-target approaching task of the hyper-redundant manipulator. Multi-target approaching refers to a generalized control strategy that enables a robot to reach any target within its continuous workspace using a single trained policy. In the context of our hyper-redundant manipulator, this means that once trained, the same control strategy can guide the end-effector to accurately approach any designated point in the workspace without requiring separate training for different target locations. This approach emphasizes the versatility and generalization capability of the control policy. The overall procedure is summarized in Algorithm 1.
This algorithm combines twin delayed deep deterministic policy gradient (TD3) and hindsight experience replay (HER) to address the multi-target approach task for hyper-redundant manipulators. It begins by initializing the critic networks  and , the actor network , and their corresponding target networks, along with a replay buffer . During each episode, a target g is sampled, and the manipulator resets to the initial state . At each timestep, an action  is selected with exploration noise, and the resulting state transition is stored in the replay buffer. To improve sample efficiency, HER is applied by sampling additional hindsight goals G from the episode. For each hindsight goal , the corresponding transition tuple is stored in the buffer and recalculated as follows:  in . The critic networks are updated using the temporal difference error, and the actor network  is periodically updated using the policy gradient . Target networks are updated with soft updates to maintain stability.
The undefined symbols in Algorithm 1 are training hyperparameters that need to be manually specified. Specifically, 
 represents the standard deviation of the exploration noise added to actions during training, while 
c defines the clipping bounds for this noise to ensure stability. The discount factor 
 is used to calculate the target value by balancing immediate and future rewards. The parameter 
d determines the interval at which the actor network is updated, ensuring that updates are not performed too frequently. Finally, 
 is the soft update parameter that controls the rate at which the target networks are updated to slowly track the learned networks, maintaining training stability. These hyperparameters are critical for properly tuning the training process.
        
| Algorithm 1 TD3 with HER for a hyper-redundant manipulator multi-target approach | 
| 1:Initialize critic networks  and  and actor network  with random parameters , , and 2:Initialize target networks , , and 3:Initialize replay buffer 4:for  to M do5:   Sample a target g and reset the manipulator to 6:   for  to  do7:     Select action  with exploration noise , 8:     Observe a new state  and reward 9:     Store transition tuple  in 10:   end for11:   for  to  do12:     Sample a set of additional goals G from current episode13:     for each  do14:        Observe new reward 15:        Store transition tuple  in 16:     end for17:     Sample mini-batch of N transitions from replay buffer       ,       18:     Update critics 19:     if  then20:        Update  according to the sampled policy gradient:         21:        Update target networks:22:        23:        24:     end if25:   end for26:end for
 | 
  3.2. Network Structure
The structure of critic networks and actor networks in the reinforcement learning algorithm is generally a simple two-layer fully connected layer network. However, for hyper-redundant manipulators, shallow neural networks may have insufficient representation ability due to relatively high degrees of freedom. This paper attempts to improve the control precision and learning efficiency of the algorithm by increasing the complexity of the network.
This paper proposes a five-layer DenseNet network [
43] and a simplified version called SimpleDenseConnect. SimpleDenseConnect has only the connection between each layer and the feature layer, which is to reuse feature in each layer and enhance the representation ability of the network. The network structures are shown in 
Figure 4.
  3.3. Target Sampling
In the training of multi-target approaching tasks of the hyper-redundant manipulator, the sampling of target points was adopted by sampling the joint angle  in the joint space and mapping it to the coordinate  in the workspace. Although this ensures that the sampled targets are always located in the workspace, due to the coupling between joints, the joint angles near zero 0 in the joint space are more likely to be mapped to the same coordinate in the workspace, resulting in both an uneven distribution of the target samples and insufficient generalization ability for points with low sampling probability in the workspace.
In practice, the target points of the manipulator are directly given in the workspace, with an implicit assumption that the target points are uniformly distributed in the workspace. In this paper, we introduced a U-shaped target sampling technique to make the sampled target points closer to the uniform distribution in the workspace, which is not only conducive to the overall generalization of the control policy in the workspace, but also conforms to the practical work of the manipulator.
The joint space sampling distribution was optimized as previous observations have indicated that the uniform sampling in joint space does not yield an even mapping in Cartesian space. This uneven mapping results in the policy overly fitting to specific regions within Cartesian space. In this study, we compared the variations in Cartesian space sample distributions when transitioning from a normal distribution in joint space to a U-shaped distribution constructed through an inverse function. The results demonstrate that, as the joint space approaches a U-shaped distribution, the sample distribution in Cartesian space becomes more uniform. Additionally, this paper conducted multi-target training based on different sampling methods, ultimately evaluating the performance under random samples in Cartesian space corresponding to the U-shaped sampling in joint space.
Figure 5 depicts the distribution of target points mapped to the workspace when sampled with different distributions in the joint space, where Normal Distribution I and Normal Distribution II have the same mean with different standard deviations. Specifically, the sampling method for Normal Distribution I is given by 
, while Normal Distribution II uses 
. The uniform distribution sampling is represented as 
, and the U-shaped distribution sampling is defined as 
, where 
 with 
 and 
. As shown in 
Figure 5d, the target points will be more uniform in the workspace by adopting the joint space U-shaped target sampling.
   3.4. Variable-Reset Cycle
In the training process of the multi-target approaching tasks mentioned earlier, each episode starts with the same initial state, that is, the initial joint angles of the hyper-redundant manipulator are all set to the 0 degree, as shown in 
Figure 6a. However, we hope that the hyper-redundant manipulator can reach the targets from any initial state so that it can complete the task of tracking static interpolating targets or moving targets, as shown in 
Figure 6b.
For common manipulators and mobile robots, it is easy to randomly sample a conflict-free and reasonable initial state because they have low redundancy or mainly focus on their position rather than configuration. But, for hyper-redundant manipulators, it is difficult to sample an initial state due to their high state dimension and hyper redundancy. As shown in 
Figure 7, it will encounter a series of problems such as initial state model penetration, knotting, collision, and being inconsistent with reality when sampled directly from the joint space. Therefore, how to sample a reasonable and effective initial state is crucial for the continuous multi-target approaching tasks of the hyper-redundant manipulator.
This paper proposes a training technique called variable-reset cycle, which initializes the manipulator at the beginning of each episode. This technique is inspired by HER and the understanding that the manipulator should avoid attempting excessive meaningless large-angle configurations.
During training, the cycle of resetting the hyper-redundant manipulator joint angles gradually increased. Inspired by the concept of curriculum learning, this paper introduced a variable-reset cycle training mechanism that allows for a linear transition from a reset-based training approach to a non-reset training approach. Under this mechanism, the joint reset cycle progressively enlarges, following the formula . This approach provides a curriculum of increasingly complex randomly initialized states throughout the training process, thereby accelerating the agent’s learning of more complex multi-target control strategies under completely non-reset conditions.
The periodic resetting of joint angles to a neutral position (0 degrees) at predetermined intervals constrains the hyper-redundant manipulator to a relatively stable and simplified configuration, which can expedite convergence in the learning process. This approach offers the advantage of reducing the initial complexity of the state space. Conversely, training the control policy without resetting allows the hyper-redundant manipulator to initialize with a diverse range of complex joint configurations. This strategy enables the control policy to explore a broader spectrum of states, including less frequent ones, potentially leading to enhanced generalization capabilities.
The variable-reset cycle methodology synthesizes these two approaches by progressively increasing the interval between resets during the training phase. Variable-reset cycles can be seen as a form of an implicit curriculum as the initial joint angle of the hyper-redundant manipulator shifts from a fixed 0 degree to any degree that is feasible in theory. In this way, the training process can be provided with a course of increasing complexity of random initialization states; thus, it can learn continuous multi-target approaching or trajectory tracking in the end.
  4. Experiments
  4.1. Experimental Design
Firstly, we validated the practical performance of the convergent policy trained with the variable-reset cycle training method based on the DenseNet architecture for multi-target approaching tasks under the reset conditions of the hyper-redundant manipulator. This validation also included a comparative experiment with proximal policy optimization (PPO), a leading on-policy reinforcement learning algorithm.
Secondly, we conducted ablation experiments to verify the necessity of the proposed algorithm design. Specifically, we evaluated a simplified version of DenseNet and compared its performance with DenseNet and a two-layer MLP across three control precision levels: 0.02 m, 0.015 m, and 0.01 m. Furthermore, to assess the practical performance improvements achieved through variable-reset cycles, experiments were conducted to evaluate the policies trained under different reset schemes on continuous multi-target approaching tasks, where the hyper-redundant manipulator operates without being reset.
  4.2. Hardware Configuration
The experiments in this study were conducted using a 2.8 GHz quad-core Intel Core i7 CPU and a GeForce RTX 2080 GPU. Notably, the inference process during training was specifically configured to entirely run on the CPU without incurring significant time costs. This highlights the efficiency of our algorithm and its potential applicability in real-world scenarios, where high-performance computational resources may not be readily available.
  4.3. Results
In order to verify the performance of the control method described above, this paper tests the 12-link, 24-DOF, and hyper-redundant manipulator on continuous multi-target approaching tasks in the workspace with a control accuracy of 0.02 m.
Specifically, under the condition of the manipulator being reset, we provided two circular interpolation trajectory points on the  and  planes, as well as an irregular arch trajectory on the  plane, serving as a random sequence of targets. The manipulator, guided by the learned strategy, sequentially approaches these targets, allowing us to assess the completion of its tracking performance for the target points in the  and  planes.
The experimental results are shown in 
Figure 8. In the experiments of following circular trajectories on the 
 and 
 planes, the circle is divided into 18 equal length parts by interpolation points, that is, for every tracking circle, the end-effector trajectory will leave 18 trajectories. The policy can follow a maximum of 70 continuous trajectories of following circular trajectories on the 
 plane and more than 600 continuous trajectories, at most, on the 
 plane. The difference between these two tasks lies in the fact that the circular trajectory on the 
 plane is tangent to the boundary surface of the hyper-redundant manipulator’s workspace, resulting in a singular effect, where even a small error can lead to failure to achieve the target and a poor performance of the trajectory following. The results show that, by following the policy trained based on the proposed method, the 24-DOF hyper-redundant manipulator can effectively achieve continuous multi-target approaching tasks with non-reset in a certain period.
To improve readability and provide a more comprehensive evaluation, we conducted an experimental comparison with PPO, a widely recognized on-policy reinforcement learning algorithm. The results, presented in 
Figure 9, offer a side-by-side analysis under identical conditions for the same task. This comparison clearly highlights the advantages of the proposed algorithm, demonstrating superior performance in terms of convergence speed and overall task success. These findings underscore the effectiveness of the variable-reset cycle training method in improving the policy performance for multi-target approaching tasks.
To strengthen the research findings and provide concrete evidence of the performance improvements, we evaluated the completion rates of the hyper-redundant manipulator for randomly generated target point sequences of fixed lengths under different reset mechanisms. 
Table 3 presents the results, summarizing the completion rates for these sequences within the workspace. In the table, “Completed numbers” represents the length of the random target sequence, “Variable 
T” demonstrates the advantages of the variable-reset cycle, “
” indicates that the manipulator is reset after each tracking task cycle, and “
” denotes the completion rates under non-reset conditions. For each setting, five random sequences of target points with fixed lengths were generated using different random seeds. This approach demonstrates the effectiveness of the proposed variable-reset cycle method. Notably, using the proposed variable-reset cycle method, a success rate of 98.32% was achieved, representing a 134% improvement compared to the baseline under non-reset conditions (
).
  4.4. Ablation Study
  4.4.1. Network Structure
For the multi-target approaching tasks of the 12-link, 24-DOF, and hyper-redundant manipulator, this paper compared the training effects of three network structures, DenseNet, SimpleDenseConnect, and MLP at control accuracies of 0.02 m, 0.015 m, and 0.01 m, respectively, as shown in 
Figure 10. For convenience of comparison, the curves were filtered by a sliding window with size of 10.
At a control accuracy of 0.02 m, all the three networks possessed reliable convergence and satisfactory performance so the effect of dense connection was not obvious. At a control accuracy of 0.015 m, DenseNet and SimpleDenseConnect significantly improved the convergence speed and performance compared to MLP, and they were able to achieve a success rate of nearly 100% after convergence. At a control accuracy of 0.01 m, the dense connection had a slight improvement for MLP, but none of the three networks were able to learn effective policy under this condition.
It can be seen that a dense connection partly improved the control accuracy. The main advantage of this structure is that the features of the manipulator are directly connected to each subsequent layer, which makes full use of the features. However, the performance improvement brought by connection between the hidden layers was not obvious.
  4.4.2. Variable-Reset Cycle
This section evaluates the performance improvement introduced by variable-reset cycles. Policies trained with different reset cycles were tested on continuous multi-target approaching tasks. During these tests, the hyper-redundant manipulator was not reset.
First, for each episode reset (
T = 1), we conducted experiments similar to the last section. The effect of following planar circular trajectories is shown in 
Figure 11. The densely packed points on the trajectory indicate that the end-effector of the manipulator stayed at that position until the episode terminated by reaching the maximum step. The policy can better follow the circular interpolation trajectory on the 
 plane, but, for the interpolation trajectory on the 
 plane, the policy quickly losed control. The experimental results show that the policy trained by resetting each episode had a certain generalization ability to the random initial state, but they were relatively weak.
Next, for the non-reset settings (
T = infinity), we conducted continuous multi-target approaching tasks for the planar hyper-redundant manipulator with 2-to-12 DOFs. The training curve is shown in 
Figure 12. It can be seen that the training effect decreased rapidly with the increase in redundancy, and it was difficult to learn effective policies at 12 DOF. The reason for this was that, as the DOF and redundancy of the manipulator increased, the joints were more likely to accumulate deformation. When the joint deformation accumulated to a certain extent, it was difficult for the policy network and the value network to fit these sparse states in the joint space, such that uncertain actions will be output, which further deepens the accumulation of joint deformation. At the same time, the reward became more sparse, and the training entered a vicious cycle stage.
Finally, for a fixed-cycle reset (here, 
T = 10), the continuous multi-target approaching tasks were trained without resetting for a planar manipulator with 10 DOF and 3D hyper-redundant manipulators with 4-to-20 DOFs. The training curve is shown in 
Figure 13. 
Figure 13a shows that, compared to never resetting, the fixed-cycle reset had a certain effect on the continuous multi-target approaching of low-redundancy manipulators. However, 
Figure 13b indicates that, for higher DOF hyper-redundant manipulators, due to their easier accumulation of errors, fixed-cycle reset cannot solve the problem of the continuous multi-target approaching of high-DOF, hyper-redundant manipulators.