A Novel Actor—Critic Motor Reinforcement Learning for Continuum Soft Robots

: Reinforcement learning (RL) is explored for motor control of a novel pneumatic-driven soft robot modeled after continuum media with a varying density. This model complies with closed-form Lagrangian dynamics, which fulﬁlls the fundamental structural property of passivity, among others. Then, the question arises of how to synthesize a passivity-based RL model to control the unknown continuum soft robot dynamics to exploit its input–output energy properties advantageously throughout a reward-based neural network controller. Thus, we propose a continuous-time Actor–Critic scheme for tracking tasks of the continuum 3D soft robot subject to Lipschitz disturbances. A reward-based temporal difference leads to learning with a novel discontinuous adaptive mechanism of Critic neural weights. Finally, the reward and integral of the Bellman error approximation reinforce the adaptive mechanism of Actor neural weights. Closed-loop stability is guaranteed in the sense of Lyapunov, which leads to local exponential convergence of tracking errors based on integral sliding modes. Notably, it is assumed that dynamics are unknown, yet the control is continuous and robust. A representative simulation study shows the effectiveness of our proposal for tracking tasks.


Introduction
The essential feature of RL is that it provides a reinforcement signal based on the value function evaluation to compute the present action aiming to learn a task.Such a brief statement deploys a powerful motor learning paradigm [1] from control application through the Actor-Critic scheme, well substantiated by adaptive dynamic programming [2], and optimal control [3] schemes.Recently, model-based (regressor) adaptive and model-free (neural networks) controls have been proposed to deal with uncertainty [4][5][6].
Conventional RL schemes typically require state and action space exploration, demanding massive trials and data to tune and test the system in broad operational conditions.It makes conventional RL an option for some software applications, but are risky for hardware systems.Then, it has been claimed that further research is needed to introduce explicit stability bounds and clear implementation procedures to implement RL schemes for a physical system, such as a robot.However, a distinct feature of the RL massive literature is lack of stability conditions [7], with few exceptions.Then, translational research is required to yield a novel RL scheme with stability analysis, particularly for highly nonlinear systems such as robots, but stability is definitively a requirement for uncertain and disturbed complex deformable (soft body) robots.
We are interested in the particular class of pneumatic-driven continuum soft robots.For these robots, implementing a conventional RL is prone to failure when attempting to operate it away from its narrow operational margin, thus requiring novel RL designs equipped with stability analysis.The stability analysis does not represent a nuisance, unnecessary design requirement, or an elegant, irrelevant usage of mathematics; by contrast, stability is tantamount to guaranteeing the operation of the systems under certain conditions.Thus, stability analysis is required for any novel motor control scheme for nonlinear uncertain models, such as continuum soft robots.Given that RL aims at computing the present action that yields the desired state of a system at a given cost, computational architectures have been studied to explore to state-action space, which has led to software architectures rather than RL's stability conditions.Thus, large batches of trials, even millions, are regularly carried out for a particular system until it learns the task and a particular set of discrete admissible controls.However, when such a system is a physical entity, like a robot, there is no room for such trials, but it must operate within stability margins.The trial-and-error mechanistic approach of RL tuning is forbidden for robots since fatal failure may arise.Unfortunately, stability requirements for RL applications for robots have not permeated into the computational RL community; better phrased, why does the immense majority of RL literature lack stability?It is worrisome that this fact is not a worry, but rather the norm, in the practice of RL for robots [7], including for deformable (soft) robots [8].Although the early approaches of RL dynamical systems are used, the principal developments in recent years have improved the algorithms on a finite set of states and controls.
Nonetheless, RL has an enormous advantageous prevalence over "conventional" control in the sense that RL deals with two metrics (performance index and tracking error) while conventional control deals with only one metric (tracking error).RL evaluates performance to issue a reinforced signal to the action (control), which seems very interesting for novel systems, such as the continuum soft robot subject to varying density and a non-constant center of mass [9].Then, the question arises of how to design a sound (stability-based) RL scheme considering its deformation, which is the essential feature that distinguishes it against conventional rigid-body robots.
In this paper, we entertain the explicit need for rigorous stability of a model-free RL scheme for continuum soft robots [9], where reward evaluates continuous deformation coordinates.Our proposed scheme is similar to adaptive neurocontrol away from optimallike control.However, it differs from the former because a second neural network (the Critic or Critic NN) aims to enforce the asymptotic stability of the temporal difference error equation in continuous time.Additionally, a so-called Actor NN aims at inverse dynamics compensation.This scheme is called the Actor-Critic Learning of Motor Control.Unlike traditional Actor-Critic schemes [10,11], novel adaptive mechanisms are introduced for neural weights that allows tightly and complex intertwined nonlinear Actor-Critic neural architectures to emerge, substantiated by the closed-loop stability analysis for tracking tasks of the uncertain continuum 3D soft robot subject to Lipschitz disturbances.

Contribution and Organization
The contribution amounts to a novel RL scheme for a novel continuum soft robot [9], using a particular actuation topology [12].RL's relevant role evaluation performance is exploited to reinforce the neurocontroller based on the approximation of Bellman's temporal difference.It represents the accumulated reward-based value function, approximated by the Critic NN using a novel adaptive mechanism of weights with nonlinear neural activations, while the Actor NN compensates approximately for inverse dynamics.A chatterless integral sliding mode is introduced to guarantee error tracking [13].Overall, tracking with performance evaluation is guaranteed, assuming no knowledge of complex dynamics subject to disturbances, yet with a smooth control action.
This manuscript is organized as follows.Section 2 introduces the preliminaries and problem statement.Section 3 presents the RL design, with stability analysis in the Appendix.Simulations are presented in Section 4 for a 3D continuum soft robot motion tracking an aggressive trajectory, with discussions of the overall scheme presented in Section 5. Finally, concluding remarks are given in Section 6, addressing some advantages and concerns of the proposed scheme.

Preliminaries and Problem Statement
Interestingly, there exist hundreds of papers indexed in the academic metasearch engine Scopus under the keywords RL and soft robot addressing a variety of RL implementations to non-rigid-body systems; however, only 24 papers deal precisely with soft-robots; only two mention stability [14,15]; yet, none included any formal stability analysis.Though this amazing body of literature is rapidly changing and will hopefully soon be address stability, this situation speaks for itself about the significant worldwide effort to exploit the powerful characteristics of RL to soft robots.However, it also shows that the solid foundations of RL are taken for granted for novel applications; for example, it meets the theoretical assumptions of the original approach, but is without any rigor to study the subtleties on how to include the differences of each novel system with stability.Not only that, but stability analysis also paves the way to substantiate a specific design within the specificities of each new system; that is, stability analysis tailors the design according to what a particular system is.The result is an efficient RL custom design for the system under study, not a general design for a particular system, which typically leads to conservative and inefficient RL designs.Overall, this brief literature assessment shows the tendency of RL implementations to a large class of systems, including deformable-body robots.This approach limits and weakens its effectiveness since, in practice, each new system differs in many aspects from the ideal one considered initially.Impressive empirical results of the literature empower RL as a viable option worth studying; however, its prime will be highlighted, arguably, when synthesized through stability analysis to deliver an asymptotically stable RL approach for a specific system.

On Continuum Soft Robots
Soft robots can be classified by their actuation into four types [16]: • Fluidic Elastomer soft robots (FESRs): This type of soft robots has pneumatic/hydraulic chambers embedded into their bodies, which induce body deformation when pressurized.It can generate movements such as bending, elongation, torsion, and a combination of these movements.For example, the STIFF-FLOP comprises a series of identical elastomeric soft actuators with internal pneumatic chambers to unlock three-dimensional movement and a central chamber for stiffness variation via granular interference phenomena [17].• Cable-driven soft robots (CDSRs): The robot has external or internal cables that generate deformation by tension variation.However, the type of movement and workspace depend on the number and position of cables, which means that there are more control inputs and rigid elements where cables pivot.Furthermore, the exerted force of this type of actuator depends directly on cables tension and not on stiffness.An example is depicted in [18], where a four-cable-driven soft arm is presented.• Shape-memory polymer soft robots (SMPSRs): This type encompasses robots composed of polymers with a thermally induced effect, which allows them to go from an initial state to a deformed state.However, SMPSRs do not produce high strain and are usually applied when small deformations are required.• Dielectric/electroactive polymer soft robots (D/EPSRs): This type of robots are based on deformation phenomena in response to electricity.However, due to their high voltage amplification, their doped elastomer is the most disadvantageous and risky option.
By comparing the actuation mechanisms of these types of soft robots, it is recognized that FESRs have the best relationship between applied force and deformation, given that applied energy (either pneumatic or hydraulic) continuously deforms the elastomer, translating into viscoelastic forces of continuum media, i.e., a change in the distance between each pair of particles in the material.On the other hand, there are three types of morphologies for soft robots that show continuous deformation [19]: • Cylindrical morphology: The robot's body is shaped like a cylinder of elastomeric material, with pressure inputs (chambers) radially distributed along an internal radius.When a chamber is pressurized, the body presents a controlled curvature along the extensible center of the robot.Usually, this morphology is built using inextensible braided threads to mitigate radial and circumferential deformations so that the robot's configuration can be approximated with a minimum set of linearly independent variables principally used as control inputs actuated by pneumatic chambers.

•
Ribbed morphology: The robot is composed of three elastomer-based layers.The top and bottom layers have internal ribbed-like structures with multiple rectangular channels connected to fluid transmission lines, whereas the middle layer is a flexible but inextensible restriction.In an active state, where fluid pressurizes a group of chambers, bending is produced.An example is presented in [20] with a soft arm of six ribbed-like segments designed as a manipulation system.• Pleated morphology: Consists of discrete sections (plates) of elastomeric materials evenly distributed and separated by gaps.At the bottom part, a high-stiffness silicon layer is used to work as an inextensible restriction.Additionally, the top part has hollow cavities (in each plate) connected to a central chamber.When it gets pressurized, each plate experiences balloon-like deformations translated into bending of the highstiffness silicon layer along the direction of the layer with lower stiffness.An example is presented in [21], where a soft manipulator has six segments with cylindrical cavities, and a pleated-shaped soft gripper is used for grasping purposes.
Among these morphologies, the cylindrical morphology is the one that allows to approximate robot's deformation through a finite number of variables due to the radial distribution of their pressure entries while simultaneously allowing relatively easy characterization of their geometric variables.Additionally, four types of movement can be distinguished [22]: axial (elongation and retraction of length; see Figure 1b), bending (see Figure 1c), circumferential (translated as torsion; see Figure 1d) and radial (expansion and contraction of cross-sectional area; see Figure 1e).Notice that these movements are achieved by imposing different deformation restrictions.Thus, we consider in this paper the soft robot defined as a cylindrical-shaped soft body composed of elastomeric material moving from continuous controlled body deformation.Moreover, we refer to a continuum soft robot as a soft robot with continuous infinitesimal deformation of the distance between their particles, and it must not be confused with what was referred to as a continuum robot 20 years ago [23].

Deformation Coordinates
The proposed soft robot has circumferential and radial restrictions so that it can only have axial and bending deformations, i.e., increasing (or decreasing) its length l and performing flexion in the direction of an azimuth angle φ (notice that this is achieved when two or more internal chambers are activated).Given these constraints, we consider constant cross-sectional geometry, which gives rise to defining a curved central axis that passes through the body's geometric center, known as the backbone (see Figure 2a) [24].By definition, soft robots have variable curvature along the backbone, i.e., for each cross section of the body, there exists a different curvature.To generate low-cost computation modeling, a constant-curvature approach is used, assuming a single s-curve parameterized by an arc (see Figure 2a), which resembles the body as a segment, hence giving a single curvature κ along the segment, which may vary along time.Therefore, a vector of deformation coordinates q e is defined as The constant curvature parameterizes the backbone of a soft robot by a radius of curvature r k = 1 κ and a curvature angle θ = κl.On the other hand, the constant-curvature approach has the advantage of enabling an additional space named Actuation space (AS), defined by the l vector which contains all length variables l 1 , l 2 , . . ., l n corresponding to the n-actuation elements of the robot.Hence, two direct kinematic mappings arise:

•
From actuation space l to configuration space q e (AS → CS), related to the actuation mechanism, which in this case is the length of chambers.It is also known as specific mapping.

•
From configuration space q e to operational space x (CS → OS), better known as direct kinematics [25].
(a) Deformation coordinates q e for a constant-curvature segment s (in red the backbone).

Kinematics
Two frames can describe zeroth-order direct kinematics of a constant-curvature soft robot: the inertial one Σ 1 at the base of the backbone and the distal reference Σ 2 at end-effector's frame, which can be seen in Figure 2b.Consider deformation coordinates q e and the following homogeneous transformations: where e is the distal position point obtained by e = (r k (1 p of a particle p be within the soft body with respect to an inertial reference frame Σ 0 as an analog to movement transformation of a rigid body, i.e., where d is the body's position, and r p is the relative position of point p with respect to local reference frame Σ 1 ; see Figure 3 (for the sake of simplicity, Equation ( 3) is going to be used without superscripts, i.e., d p = d + Rr p ).
It is important to remark that for rigid bodies R ṙp = 0, the particles position with respect to the inertial reference frame Σ 1 stays constant during transformation.However, this does not occur for soft robots because the distance between particles is variable due to elastic deformation of the body such that R ṙp = 0. Additionally, Ṙ can be expressed in terms of angular velocity ω, using equivalence Ṙ = [ω (0) ×]R = R[ω (1) ×], so that a particle's velocity in a soft body is declared as: and in inertial and local coordinates, respectively.Now, vector r p can be calculated as a function of deformation coordinates q e expressed in toroidal coordinates system c = (r ψ µ) T , where r and ψ are the radius and angle which allow it to be positioned in any point of a cross-sectional area within the soft robot; and µ ∈ [0 1] is the variable which parameterizes an arc length segment (note that µ = 1 is the distal point).Therefore, r p is composed by r p (q e , c) = d s/1 (q e , µ) + R s 1 (q e , µ) where d s/1 gives the spatial position of a point s in the backbone, and r p/s positions any point along the s-cross section.From (7), the relative velocity ṙp of a particle can be obtained by taking its partial derivative with respect to q e : ṙp (q e , qe , c) = ∂r p (q e , c) where J v p is the deformation Jacobian given by with , and

Dynamics
Consider the integral Lagrangian model based on the D'Alembert-Lagrange equation to describes deformation of a cylindrical-shaped pneumatic soft robot of constant-curvature in an inertial base [9]: with q = l φ κ T ∈ R 3 being the three-dimension vector of generalized coordinates, H(q) ∈ R 3×3 the inertia matrix, C(q, q) ∈ R 3×3 Coriolis matrix, g(q) ∈ R 3 the gravity vector, and τ v ∈ R 3 the generalized viscoelastic force vector which is assumed to be separated and oversimplified in a pure linear viscous friction term and an elastic restorative one of the form τ v = −D v q + τ e with positive semi-definite viscous matrix gain D v = (d q 1 , d q 2 , d q 3 ) T and generalized elastic forces τ e .The Lagrangian dynamic model (10) has the following properties [9]: • Symmetry and definite positiveness of inertia matrix: H(q) = H T (q), H(q) > 0, ∀q.• Skew symmetry of Coriolis matrix: • Passivity: Elastic forces τ e are obtained via elastic function proposed in [26] and Castigliano's theorem [27], as

Affine Actuation
By considering a pneumatic soft robot with c embedded cylindrical-shaped pneumatic chambers, air injection produces a controlled force field p = (p 1 p 2 . . .p c ) T ∈ R c which causes coupled deformation among all chambers inside the body.Thus, a mapping from c pressure vectors to the n generalized force coordinates can be defined as where B(q) ∈ R n×c is the input matrix as a lineal operator given by [9]: In this work, a three-DoF soft robot is considered, with three identical prismaticshaped chambers with transverse area A c evenly positioned such that all centroids of the chambers' area are placed along the circumference described by a radius r m , see Figure 4b.Thus, the input matrix B(•) ∈ R 3×3 is full rank and is rewritten as [12] where A c (r ex , w) = π 3 (r ex − w) 2 , r m (r ex , w) = 2 π (r ex − w) sin( π 3 ).

Open-Loop Error Equation
Adding and subtracting the functional Y r = H(q) qr + C(q, q) qr + D v qr + g(q) to (10), we have the open-loop error equation where the extended velocity error coordinate is for qr , the continuous nominal reference to be defined.System (15) has mainly been used for control design for many Lagrangian systems, even in neuro-control applications.In the latter case, [28] proposes an adaptive neurocontroller with an underlying integral sliding mode to enforce robust error tracking, which does not require training nor any knowledge of the robot with a smooth control actions.

Nominal Reference Design to Induce Integral Sliding Modes
Let the nominal reference be defined as where ∆q = q − q d is the position error; q d and qd are the desired position and velocity, respectively; and α and K i are positive feedback gains, S q = S − S d , with S = ∆ q + α∆q, S d = S(t 0 )e −κt .Substituting ( 17) into ( 16), an extended velocity error coordinates is obtained as

Control Design
Based on the seminal work of [13] for rigid robots and extended for soft robot (10) in [9], now, in this paper, given the non-intuitive free-form deformation of the continuum soft robot, we wonder how to consider the body deformation in the control design to guarantee S → 0 with performance evaluation.More precisely, we are interested in designing τ for unknown (10), considering how deformation coordinates perform, additionally to guarantee tracking error convergence.

Problem Statement
We are interested in how to control soft robots using reinforcement learning tools, explicitly using the Actor-Critic scheme.Necessarily, it implies introducing an additional metric for task-performance evaluation along with error convergence.Thus, the following problem arises: "Design a learning mechanism that guarantees simultaneous error convergence and task performance of a closed-loop pneumatic-driven soft robot through a control law that evaluates online learning units for a model-free scheme."

Actor-Critic Learning of Motor Control
Now, we proceed to explain in detail the proposed Actor-Critic architecture as well as the main result called Reinforced Neurocontroller; see Figure 5.

TD Error
Critic NN Critic Adaptation ∆q, ∆ q x d , ẋd Actor NN Actor Adaptation The proposed Actor-Critic scheme where each colored block corresponds to a specific role in the scheme.Notice that it keeps the conventional neurocontroller architecture approximating Ŷr .

Reward-Based Value Function and Temporal Difference Error
Consider the following continuous value function R that depends only on the taskdependent scalar reward r ∈ R [29]; where ψ is the time constant for discounting future rewards with t ≤ m ≤ ∞.Differentiating (19), one obtains which can be rewritten in terms of an error δ, as the so-called temporal difference error for continuous time [29], This consistency equation is used to approximate value function (19) based only on reward information.
Remark 1.Notice that in the proposed Actor-Critic scheme, the reward and the temporal difference error implement the learning mechanism in the Critic NN that approximates the value function R, which in turn represents the reinforcement signal used to improve the adaptation mechanism of the Actor NN that approximates Y r by Ŷr .

Critic NN
Given that value function from Equation ( 19) is smooth, it can be approximated by a neural network with finite constant weights where a small c is the neural approximation error.Then, there exists a neural network with adaptive weights Ŵc ∈ R c that approximates (22) as follows where is the sigmoid bipolar activation function, with V c ∈ R 3×c representing fixed weights and ζ c ∈ R c being the input vector.It means that the learning has been made.Now, using (23) instead of ( 22) by the equivalence principle, the temporal difference error ( 21) can be written as Notice that δ = 0 per se (it is in the neural error domain); thus, the problem becomes in designing the adaptation ˙Ŵ c such that δ → 0, which translates into an appropriate approximation of ( 19) by (23).Now, we have the following result.

Proposition 1. Consider the following adaptation law
and which comes from the temporal difference error δ.Then, selecting K w and K large enough, the convergence of the temporal difference error is achieved such that δ → 0.

Actor NN
Let (14), according to the NN approximation property, be the continuous nonlinear function; it can be represented as Y r = W T a Z a (•) + a , with W a ∈ R a×3 finite constant weights, Z a (•) ∈ R a being the input basis, and a being the reconstruction error.Then, the following neural network and adaptation law are proposed where Ŵa ∈ R a×3 is the matrix of adaptive weights, Z a (•) = σ a (V T a ζ a ) ∈ R a is the bipolar sigmoid activation function, with V a ∈ R 4×a being the matrix of fixed weights and ζ a ∈ R 4 the input vector, and Γ a ∈ R a×a is positive definite gain.

Passivity-Based Reinforced Neurocontroller
Let the control signal be defined as with Then, the following main results are in order.
Theorem 1.Consider the soft robot dynamics (10) in a closed-loop with the control signal (29) and adaptation laws (25) and (28).Thus, from Proposition 1, for high enough K i and K d gains, exponential convergence of the tracking errors is guaranteed via integral sliding modes with smooth control signals and without knowledge of the soft robot dynamics.

Numerical Simulations 4.1. The Simulator and Parameters
The soft robot aims to track a tornado-like trajectory at a distal point described by a desired pose X d ; see Figure 6.Notice that X d was mapped into generalized coordinates q d via first-order inverse kinematics.Simulations were carried out in Matlab-Simulink 2021b with solver ode23tb running at an adaptive step sampling for a tolerance of 1 × 10 −3 .Table 1 shows the dynamic parameters and desired trajectory of the soft robot, with the Young's modulus value being based on experiments [30] using Ecoflex 00-30™.Initial bending β 0 was calculated through initial length and curvature as β 0 = l 0 × κ 0 2 .

Neural Network Architectures
The Critic NN has only one hidden layer, with input vector ζ c = (−1 ∆q), where weights connect the input layer to the hidden layer with constants V c ; initial adaptive weights Ŵc (t 0 ) are tuned between 0 and 1.Similarly, the Actor neural network has one hidden layer with input vector ζ a = (1 S r ), where the weights connecting the input layer and the hidden layer V a are fixed.

Reward Design
Given that the reward function, r, encodes critical aspects to motivate the fulfillment of the task, and the soft robots are typically equipped with low resolution (due to embedded sensor technology being in progress), it is worth considering weight position over velocity errors.Then, let the reward function be where P = diag(9, 9, 9), Q = diag(1, 1, 1).

Feedback Control and Adaptation Gains
The final value of feedback and adaptation gains follow the simulator's theory specifications and numerical performance.
The values for feedback gains were K d = diag(1000 200 200), α = diag(20 20 20), K i = diag(0.050.05 0.05), and κ = 30; and for the adaptation gains, they were K = 50, K w = 7, and Γ a = I 10×10 × 4000.It aims to promote larger reward recollection from smaller position errors, so evaluation-based reinforcement is influenced to take action even by smaller position errors.

Results
From Figure 6, we notice that despite the fact that initial conditions are selected away from the desired trajectory, the end-effector tracks the desired trajectory with a short transient.This can be seen in Figure 7, where tracking error converges exponentially in about t = 0.3 s. Figure 8a shows the bounded extended velocity error S r that shapes the invariant of stability, which gives rise to sliding mode at S q = 0; see Figure 8b around t ≥ 0.3 s.The control signals are quite smooth; see Figure 9a.Notice that τ 1 corresponds to the length coordinate l where its magnitude is much greater than τ 2 and τ 3 due to the higher energy that is required to deform the material for elongation tasks.Figure 9c,d shows the integral and temporal difference errors converging quickly, implying an accurate approximation of the value function by the Critic NN.Finally, reward behavior is shown in Figure 10, since it depends on the evaluation of tracking errors (30); since it has a larger value at initial conditions, then at the beginning, it reaches its higher value.Once the end effector reaches the desired trajectory, the reward ceases signaling that motor learning is achieved.

Comparative Results vs. Classical PID Controller
In order to compare how our proposal performs against others, simulations were also carried out for the very well-known controller, such as classical model-free PID regulator, under the same desired trajectories and initial conditions.The results are shown in Figure 11.Generalized position and velocity errors are shown in Figures 11a and 11b, respectively.Notice that trajectories remain bounded after a short initial response.In contrast, our scheme converges to the origin; see Figure 7. Figure 11c,d shows the control signals and the demanded pressures.Notice that for initial time t < 0.1, the system has a high demand of control effort, translating into a high-pressure demand, which can be detrimental in practice.In contrast, our scheme ameliorates this effect and achieves convergence of tracking errors with smooth control; see Figures 7 and 9a,b.The Actor and Critic neural networks interplay to guarantee tracking, taking into consideration the soft robot performance.The Critic NN approximates the value function by enforcing convergence of the integral and the temporal difference error.In contrast, the Actor NN approximates the nonlinear dynamics using online reinforcement from the Critic NN throughout its weights and the reward.Reward design is fundamental to yield information about the robot task performance.Given the subtleties of the soft robot, reward design is based on the weighted sum of the position and velocity tracking errors.However, the reward can be designed otherwise, since its interpretation depends on what promotes learning.Adaptation of Actor-Critic's weights is proposed based on Lyapunov stability, in contrast to other works that use variants of the gradient descent method.

On Simulation Study
It is considered to be a Lagrangian soft robot assuming constant cross-section geometry along the backbone, even after being exposed to exogenous and endogenous forces.In practice, this is achieved by manufacturing inextensible threads that braid the soft robot [17] to restrict deformation when pressurized.However, this constraint may not be enforced for negative absolute or relative pressure or when a vacuum emerges.Further research is needed from the material science community to considered these cases.

Advantages, Disadvantages, and Limitations
The advantages of this method are as follows: the proposed AC scheme is model-free, i.e., no information on the soft robot dynamics is needed to implement the scheme.In addi-tion, neither pre-training nor initialization is required in the learning process.Surprisingly, Actor-Critic neural network topologies are of low dimension for such difficult and complex nonlinear soft robot dynamics, with only one hidden layer of neurons.We address the difficult and complex soft robot continuum dynamics based on density variations that yield a varying center of mass and a varying inertia tensor.On the other hand, as a limitation, the model has a kinematic singularity at κ = 0, which is common with other soft robot models, as well as the assumption of constant cross area, which is hard to enforce in practice.Then, for practical implementations of this RL scheme, we surmise that the challenge also includes the soft robot hardware, actuation, and sensory system.
Other modeling domains, such as FEM, have been used as an alternative to study the approximate mechanical properties of soft robots [31,32], with quite some success in designing and manufacturing deformable bodies.Certainly, FEM is needed to analyze structurally optimal designs and comparisons to its continuum domain.

Conclusions
Contributions from many research fields have enriched soft robot knowledge, giving rise to novel paradigms on modeling, control, and design.In this note, a novel RL controller for a class of continuum soft robot model is addressed, contributing to the state-of-the-art in learning how to yield 3D trajectories, taking into account online evaluation of deformation performance.The stability-guaranteed, model-free neurocontroller oversees the convergence of the tracking error and reward recollection while exploiting its structural properties, including passivity.The key contribution comes from novel designs of nonlinear adaptive weights of the Critic NN and Actor NN, the former of which uses discontinuous terms and nonlinear activations functions to enforce convergence of TD errors δ.In contrast, the latter is influenced by the reinforcement signals given by the reward and temporal difference error.The soft robot under consideration is of the class of continuum body deformation driven by internal pneumatic chambers.This soft robot model has a Lagrangian structure; henceforth, our RL scheme is not limited to soft robots, but it can be implemented in systems characterized by Lagrangian dynamics.Real-time experimental testing has major importance in pursuing real compliance with the theory based on the axioms and the assumptions.Ongoing effort is occuring in the development of such a platform with special care on non-invasive measurement system and low-level pneumatic control instrumentation [9].
Taking its time derivative of (A7) along the flow (derivative) of S r = S q + K i sgn(S q ), we obtain Vs q =S T q ( Ṡr − K i sgn(S q )) ≤ − K i S q + S q η ≤ − (K i − η) S q . (A8) We can always choose K i > η to enforce a sliding mode condition at S q = 0, guaranteeing the local exponential convergence of tracking errors, i.e., ∆q, ∆ q → 0 as t → ∞ [13].

Figure 1 .
Figure 1.Characteristic deformations of a cylindrical-shaped soft robot.
Direct kinematics for a constant-curvature soft robot.

Σ 0 Σ 1 (
a) Particle position on the solid body.Position inside a soft body with respect to a referential frame associated with the body.

Figure 3 .
Figure 3. Position of a particle in a continuum soft body.Particle's velocity ḋp is easily obtained by the time derivative of its position, Equation (3), resulting in ḋp = ḋ + Ṙr p + R ṙp .(4) The structure of the robot with each pneumatic chamber represented by a distinct color.(b)Cross section showing chambers distribution, the radius of chambers centroid r m , external radium r ex , and wall width w.

Figure 6 .
Figure 6.Tracking of 3D soft robot desired (red dotted line) trajectories X d (t), from initial pose (green dot) shown in grey with a black backbone to a final pose shown in orange, with red backbone.
Convergence of velocity tracking errors.

Figure 7 .
Figure 7. Position and velocity tracking errors exhibit smooth convergence, with a short transient of about 300 ms.Notice that such a performance for such aggressive trajectories, since Coriolis (centrifugal and centripetal) forces increases as time goes by.

Figure 8 .
Figure8.Performance of extended velocity error S r and invariant manifolds S q = 0 suffers from the increment of Coriolis forces; nonetheless, sliding modes are enforced.
Control signals τ are smooth, though increasing for length coordinate in accordance to the task.
Pressure of internal control chambers related as P = B −1 τ.
Convergence of the neural approximation of temporal difference error, implies the learning mechanism is in place after a short initial period.

Figure 9 .Figure 10 .
Figure 9. (a) Control signals and (b) pressure behavior, in accordance to the convergence of (c) integral temporal difference error and (d) temporal difference error.

Figure 11 .
Figure 11.[Classical PID controller].Results obtained using PID: (a) position errors remains bounded, similarly, (b) the velocity errors also remain bounded.(c) control signals exhibit high demand and produces pressures (d).

1 .
On the Actor-Critic Architecture with Adaptive Neural Weights

Table 1 .
Soft robot parameters and the tornado-like desired trajectory.