Deep Reinforcement Learning for Soft, Flexible Robots: Brief Review with Impending Challenges

: The increasing trend of studying the innate softness of robotic structures and amalgamating it with the beneﬁts of the extensive developments in the ﬁeld of embodied intelligence has led to the sprouting of a relatively new yet rewarding sphere of technology in intelligent soft robotics. The fusion of deep reinforcement algorithms with soft bio-inspired structures positively directs to a fruitful prospect of designing completely self-sufﬁcient agents that are capable of learning from observations collected from their environment. For soft robotic structures possessing countless degrees of freedom, it is at times not convenient to formulate mathematical models necessary for training a deep reinforcement learning (DRL) agent. Deploying current imitation learning algorithms on soft robotic systems has provided competent results. This review article posits an overview of various such algorithms along with instances of being applied to real-world scenarios, yielding frontier results. Brief descriptions highlight the various pristine branches of DRL research in soft robotics. generative adversarial imitation learning applied to solve real world problems. The paper incorporates separate sections on problems faced while transferring learnt policies from simulation to real world and possible solutions to avoid observing such a reality gap (Section 3.5) which gives way to section (Section 3.6) that talks about various simulation softwares available for soft robots. We include a section (Section 5) at the end on challenges of such technologies and budding areas of global interest that can be future frontiers of DRL research for soft robotic systems.


Soft Robotics: A New Surge in Robotics
The past decade has seen engineering and biology coming together [1][2][3][4][5], leading to cropping up of a relatively newer field of research-Soft Robotics (SoRo). SoRo has been enhancing physical potentialities of robotic structures amplifying the flexibility, rigidity and the strength and hence, accelerating their performance. Biological organisms use their soft structure to good advantage to maneuver in complex environments, hence giving the motivation to exploit such physical attributes that could be incorporated to perform tasks that demand robust interactions with uncertain environments [6]. SoRo with three-dimensional bio-inspired structures [7] are capable of self-regulated homeostasis, resulting in robotics actuators that have the potential to mimic biomimetic motions with inexpensive actuation [5,[8][9][10]. These developments enabling such robotic hardware present an opportunity to couple with imitation learning techniques by exploiting the special properties of these materials to clinch precision and accuracy. Various underlying physical properties including body shape, elasticity, viscosity, softness, density enable such unconventional structures and morphologies in robotic systems with embodied intelligence. Developing such techniques would certainly lead to fabrication of robots that could invulnerably communicate with the environment. SoRo presents future prospects of being melded with Tissue Engineering, giving rise to composite systems that could find vast applications in the medical domain [11].
Soft Robots are fabricated from materials that are easily deformable and possess the pliability and rheological characteristics of biological tissue. This fragment of the bio-inspired class of machines represents an interdisciplinary paradigm in engineering capable of aiding human assistance in varied domains of research. There are applications wherein SoRo have accelerated the performance and expanded the potentialities. These robots have shown promise from being used as wearables for prosthetics to replacing human labor in industries involving large-scale manipulation and autonomous navigation.

Deep Learning for Controls in Robotics
There has been a certain incline towards utilization of deep learning techniques for creating autonomous systems. Deep learning [12] approaches have shown benefits when combined with reinforcement learning (RL) tasks in the past decade and are known to produce state-of-the-art results in various diverse fields [13]. There have been several pioneer algorithms in this domain in tasks that were difficult to handle with former methods. The need for creating completely autonomous intelligent robotic systems has led to the heavy dependence on the use of Deep RL to solve a set of complex real-world problems without any prior information about the environment. These continuously evolving systems aid the agent to learn through a sequence of multiple time steps, gradually moving towards an optimal solution.
Robotics tasks can be broken down into two different fragments, namely perception [14,15] and control. The task of perception can get necessary information about the environment via sensory inputs, from which they extract desired target quantities or properties. However, in the case of learning a control policy, the agent actively interacts with the environment, trying to achieve an optimal behavior based on the rewards received.
The problem of soft robotic control goes one step further than the former due to the following factors: • Data distribution: In the case of Deep RL for perception, the observations are independent and identically distributed. While in the case of controls, they are accumulated in an online manner due to their continuous nature where each one is correlated to the previous ones [16]. • Supervision Signal: Complete supervision is provided in case of perception in the form of ground truth labels. While in controls, there are only sparse rewards available. • Data Collection: Dataset collection can be done offline in perception; it requires online collection in case of controls. Hence, this affects the data we can collect due to the fact that the agent needs to execute actions in the real world, which is not a primitive task.

Deep Learning in SoRo
The task of control of bio-inspired robots requires additional efforts due to the heavy demand of large training data, expensive interactions between a soft robot and the environment, a large action space dimension and the persistently varying structure of the robot due to bending, twisting, warping and other deformations and variations in chemical composition. Such challenges can be simplified as a straightforward problem to design adaptive and evolving models that learn from previous experiences and are capable of handling prodigious-sized datasets.
The implementation of Deep Learning Techniques for solving compound problems in the task of controls [17][18][19] in soft robotics has been one of the hottest topics of research. Hence, there has been the development of various algorithms that have surpassed the accuracy and precision of earlier approaches. The last decade has seen dependence on soft robotics (and/or bio-robotics) for solving the control related tasks, and applying such DRL techniques on these soft robotics systems has become a focal point of ongoing research. Some of them are depicted in Figure 1. Hence, the amalgamation of these budding fields presents the potential of building smarter control systems [11] that can handle objects of varying shapes [23], adapt to continuously diverging environments and perform substantial tasks combining soft robots. Hence, in this paper, we focus on applying DRL and imitation learning techniques to soft robots to perform the task of control of robotic systems.

Forthcoming Challenges
Artificial Intelligence is the development of machines that are capable of making independent decisions that normally require human aid. The next big step towards learning control policies for robotic applications is imitation learning. In such approaches, the agent learns to perform a task by mimicking the actions of an expert, gradually improving its performance with time as a Deep Reinforcement Learning agent. Even still, it is hard to design the reward/loss function in these cases due to the dimension of the action space, pertaining to the wide variety of motions possible in soft robots. These approaches are valuable for humanoid robots or manipulators with high degrees of freedom where it is accessible to demonstrate desired behaviors because of the magnified flexibility and tensile strength. Manipulation tasks, especially the ones that involve the use of soft robotics, are effectively integrated with such imitation learning algorithms giving rise to agents that are able to imitate expert's actions [3,24,25]. Various factors including the large dimension of the action space, varying composition, and structure of the soft robot and environment alterations presents a variety of challenges that require intense surveillance and deep research. Attempts have similarly been made to reduce the amendments required in models when transferring from one trained on a simulation to the one that functions effectively in the real world. These challenges not only appear as hindrances to achieving complete self-sufficiency but also act as strong candidates for the future of artificial intelligence and Robotics research. In the paper, we list various DRL and Imitation Learning algorithms applied to solve real-world problems, besides mentioning various such challenges that prevail and could act as centers of upcoming research.
This review article is comprised of various sections that focus on applying deep reinforcement learning and imitation learning algorithms to various tasks of manipulation and navigation utilizing soft flexible robots. The beginning sections give a basic overview of the reinforcement learning (Section 2.2) and Deep RL (Section 3.1) followed by descriptive explanation about the application of Deep RL in Navigation (Section 3.3) and Manipulation (Section 3.4) mainly on SoRo environments. The succeeding section (Section 4) talks about behavioural cloning followed by inverse RL and generative adversarial imitation learning applied to solve real world problems. The paper incorporates separate sections on problems faced while transferring learnt policies from simulation to real world and possible solutions to avoid observing such a reality gap (Section 3.5) which gives way to section (Section 3.6) that talks about various simulation softwares available for soft robots. We include a section (Section 5) at the end on challenges of such technologies and budding areas of global interest that can be future frontiers of DRL research for soft robotic systems.

Introduction
Soft robots intend to solve non-trivial tasks that are generally required to have adaptive capabilities in interacting with constantly varying environments. Controlling such soft-bodied agents with the aid of Reinforcement Learning involves making machines that are able to execute and identifying optimal behavior in terms of a certain reward (or loss) function. Therefore, Reinforcement Learning can be expressed as a procedure in which at each state s the agent performs an action a, receiving a response in the form of a reward from the environment. It decides the goodness of the previous state-action pairs and this process continues until the agent has learned a policy well enough. This is a process that involves both explorations that refer to exploring different ways to achieve a particular task as well as exploitation, which is the method of utilizing the current information gained and trying to receive the largest reward possible at that given state.
Robotics tasks can be modeled as a Markov Decision Processes (MDP), consisting of a 5-tuple such as: (i) S: set of states; (ii) A: set of actions; (iii) P: transition dynamics; (iv) R: set of rewards; and γ: discount factor. Episodic MDPs have a terminal state which once obtained ends the learning procedure. Episodic MDPs with time horizon T, ends after T time steps regardless of whether it has reached its goal or not. In the problem of controls for robots, the information about the environment is gathered through the sensors which are not enough to make a decision about the action, such MDPs are called Partially Observable MDPs. These are countered either by stacking observations up to that time step before processing or by using a recurrent neural network (RNN). In any RL task, we intend to maximize the expected discount return that is the weighted sum of rewards received by the agent [26]. For this purpose, we have two types of policies, namely stochastic (π(a|s)) where actions are drawn from a probability distribution and deterministic (µ(s)) where they are selected specifically for every state. Then, we have Value functions (V π (s)) that depict the expected outcome starting from state s and following policy π.
Reinforcement Learning tasks can be broadly classified as model-based and models-free approaches. The framework in which the agents learn optimal actions for each state based on the rewards and observations is a model-based technique. In these methods, we make use of supervised learning algorithms to minimize a cost function based on what the agent observes from the environment. It is not necessary to learn a model for predicting the optimal actions. In approaches like Actor-Critic and Policy-based methods (will be described in detail in the next sections), we can simply estimate the optimal Q-values for any particular action at a state from which it is trivial to choose the policy with the highest Q-values. Such methods are beneficial when dealing with SoRo due to the difficulty and cost of the interaction of SoRo with the environment.

Reinforcement Learning Algorithms
This section provides an overview of major RL algorithms that have been extended by using deep learning frameworks.

•
Value-based Methods: These methods estimate the probability of being in a given state, using which the control policy is determined. The sequential state estimation is done by making use of Bellman's Equations (Bellman's Expectation Equation and Bellman's Optimality Equation). Value-based RL algorithms include State-Action-Reward-State-Action (SARSA) and Q-Learning, which differ in their targets, that is the target value to which Q-values are recursively updated by a step size at each time step. SARSA is an on-policy method where the value estimations are updated towards a policy while Q-Learning, being an off-policy method, updates the value estimations towards a target optimal policy. This algorithm is a complex algorithm that is used to expound various multiplex-looking problems but computational constraints act as stepping stones to utilizing it. Detailed explanation can be found in recent works like Dayan [27], Kulkarni et al. [28], Barreto et al. [29], and Zhang et al. [30].

•
Policy-based Methods: In contrast to the Value-based methods, Policy-based methods directly update the policy without looking at the value estimations. Some of key differences between Value-based and Policy-based are listed in Table 1. They are slightly better than value-based methods in the terms of convergence, solving problems with continuous high dimensional data, and effectiveness in solving deterministic policies. They perform in two broad ways-gradient-based and gradient-free [31,32] methods of parameter estimation. We focus on gradient-based methods where gradient descent seems to be the choice of optimization algorithm. Here, we optimize the objective function as: wherein the score function [33] for the policy π θ is given by f π θ (.). Using Equation (1), we can comment on the performance of the model with respect to the task in hand. A RL algorithm is the REINFORCE algorithm [34], that simply plugs in the sample return equal to the score function given by: A baseline term b(s) is subtracted from the sample return to reduce the variance of estimation which updated the equation in the following manner: while using Q-value function, score function can make use of either stochastic policy gradient (Equation (4)) or deterministic policy gradient (Equation (5)) [35] given by: and It is observed that this method certainly overpowers the former in terms of computational time and space limitations. Still, it cannot be extended to tasks involving interaction with continuously evolving environments that require the agent to be adaptive.
It has been noted that it is not practically suitable to follow the policy gradient because of safety issues and hardware restrictions. Therefore, we optimize using the policy gradient on stochastic policies wherein integration is done over state space due to the large dimension of the action space in case of soft robots that can sustain movements in directions and angles possible. • Actor Critic Method: These algorithms keep a clear representation of the policy and state estimations. The score function for this is obtained by replacing the return G t from Equation (3) of policy based methods with Q π θ (s t ,a t ) and baseline b(s) with V π θ (s t ) that results in the following equation: The advantage function A(s, a) is given by: Actor-critic methods could be described as an intersection of policy-based and value-based methods, wherein it combines iterative learning methods of both the methods.

•
Integrating Planning and Learning: There exist methods wherein the agent learns from experiences itself and can collect imaginary roll-outs [36]. Such methods have been upgraded by using alongside DRL methods [37,38]. They are essential in extending RL techniques to soft robotic systems, as the droves of degrees of freedom lead to expensive interaction with the environment and hence compromise on the training data available.

Introduction
The benefits in terms of physical and mechanical properties allow a wide range of actions with soft robots. There are sectors wherein soft robots have found extensive applications: Bio-medical: Soft Robots have found enormous applications in the domain of bio-medicine, that include the development of soft robotic tools for surgery, diagnosis, drug delivery, wearable medical devices, and prostheses, active simulators that copy the working of human tissues for training and biomechanical research. The fact that they are durable and flexible makes them apt for applications involving maneuvering in close and delicate areas where a possible human error could cause heavy damage. Certain special properties of being completely water-soluble or ingestible make them a candidate for an effective delivery agent.
Manipulation: Another application domain of soft robots is autonomous picking, sorting, distributing, classifying and grasping capabilities in various workplaces including warehouses, factories, and industries.
Mobile Robotics: Various types of diverse domain-specific robots that possess the ability to move have been employed for countless purposes. Robots that could walk, climb, crawl or jump, having structures inspired from other animals that portray special movement capabilities find applications in inspection, monitoring, maintenance and cleaning tasks. The recent works in the field of swarm technology have greatly enhanced the performance of robots that are mobile and possess such flexible and adaptive structures.
Edible Robotics: The considerable developments in the edible robotic materials and 3D printing have led to a sharp rise in ease of prototyping soft robotic structure that is ingestible and water soluble. Such biodegradable robotic equipment could relive damages that are incurred because of the interaction of these machines with the environment, contributing to pollution (especially the damage done to the water bodies). These unique type of robots are generally composed of edible polymers competitive for use in the medical and food industry.
Origami Mechanics: Origami, a concept that has been in use for hundreds of years, has been employed to enhance the physical strength of soft robotic systems [25,39]. These robots that have structures are capable of having varied sizes, shapes, and appearances and are proficient in lifting weight up to 1000 times their own weight. These robots can find intensive applications in various diverse industries that require lifting of heavy material.
Apart from these, soft robots have found applications in including motor control in machines, assistive robots, military reconnaissance, natural disaster relief, pipe inspection and more.
The vast domain applications of soft robotics have made its study alongside DRL-based methods necessary. It can perform compound tasks because of their special mechanical capabilities and incorporate self-adaptive and evolving models that learn from interactions from the environment. The following table shows various domains wherein soft robots are utilized and DRL techniques will be discussed in detail in the sections to follow. Table 2. SoRo applied to achieve state-of-the-art results alongside sub-domains where its utilization with deep reinforcement learning (DRL) and imitation learning techniques presently occur. Pictures adapted with permission from [40,41]. Copyright 2014, Mary Ann Liebert, Inc., publishers. Copyright 2017, American Association for the Advancement of Science.

Domain of Application Basic Applications Methods in Which DRL/Imitation Learning Algorithms Can Be Incorporated
Biomedical [42] • Equipment for Surgeries, endoscopy, laparoscopy etc.
• Prosthetics for the impaired Neural networks can approximate optimal value functions in Reinforcement Learning Algorithms and hence, have been extensively applied to predict the control policy in robotics. Systems involving soft robots generally have challenges in policy optimization due to large action and state spaces. Hence, incorporating neural networks in models alongside adaptive reinforcement learning techniques enhances the performance. The last decade has seen a sharp rise in the usage of DRL methods for performing a variety of tasks to make use of the bio-inspired structures of soft robots. The following are the common DRL algorithms in practise solving such control problems: • Deep Q-Network (DQN) [45]: In this approach the optimal value of Q-function is obtained using a deep neural network (generally a CNN), like we do in other value-based algorithms. We denote the weights of this Q-network by Q*(s, a) and the error and target are given by equations: and The weights are recursively updated by the equation: moving in the direction of decreasing gradient with a rate equal to the learning rate. These are capable of solving problems involving high dimensional state spaces but restricted to discrete low dimensional action space. These manipulation methods using soft robotics that have structure inspired from biological creatures can interact with the environment, alongside additional flexibility and adaptation to changing situations. The training architecture is given by Figure 2.
The component that performs action a at a state s.

SoRo AGENT
A set of all states that are constantly varying and that the agent tries to influence when taking the action that he picks. Figure 2. Training architecture of a Deep Q-Network (DQN) agent. Picture adapted with permission from [46]. Copyright 2018, American Association for the Advancement of Science.

ENVIRONMENT
Two main methods employed in DQN for learning are: -Target Network: Target network Q-has the same architecture as the Q-network but while learning the weights of only the Q-network are updated, while repeatedly being copied to weights of the θ − network. In this procedure, a target is computed from the output of θ − function [18]. -Experience Replay: The collected data in form of state-action pairs with their rewards are not directly utilized but are stored in a replay memory. While actually training, samples are picked up from the memory to serve as mini-batches for the learning. The further learning task follows the usual steps of using gradient descent to reduce loss between learned Q-network and target Q-network.
These two methods are used to stabilize the learning in DQN by reducing the correlation between estimated and target Q-values, and between consecutive observations respectively. Advanced techniques for stabilizing and creating efficient models include Double DQN [47] and Dueling DQN [48].
• Deep Deterministic Policy Gradients (DDPG) [18]: This is a modification of the DQN combining techniques from actor-critic methods to model problems with continuous high dimensional action spaces. The training procedure of a DDPG agent is depicted in Figure 3. The equations for stochastic (Equation (11)) and deterministic (Equation (12)) policies are given by equations: and The difference between this and DQN lies in the dependence of Q-value on the action where it is represented by giving one value from each action in DQN and by taking action as input to theta Q in case of DDPG. This method remains to be one of the premiere algorithms in the field of DRL applied to systems utilizing soft robots.
• Normalised Advantage Function (NAF) [49]: This functions in a similar way as DDPG in the sense that it enables Q-learning in continuous high dimensional action spaces by employing the use of deep learning. In NAF, Q-function Q(s, a) is represented so as to ensure that its maximum value can easily be determined during the learning procedure. The difference in NAF and DQN lies in the network output, wherein it outputs θ V , θ µ and θ L in its last linear layer of the neural network. θ µ and θ L predict the advantage necessary for the learning technique. Similar to a DQN, it makes use of Target Network and Experience Replays to ensure there is the least correlation in observations collected over time. The advantage term in NAF is given by: wherein, Asynchronous NAF approach has been introduced in the work by Gu et al. [50]. • Asynchronous Advantage Actor Critic (A3C) [51]: In asynchronous DRL approaches, various actors-learners are utilized to collect observations, each storing gradient for their respective observations that used to update the weights of the network. A3C, as a commonly used algorithm of this type, always maintains a policy representation π(a|s;θ π ) and a value estimation V(s; θ V ) making use of score function in the form of an advantage function that is obtained by observations that are provided by the action-learners. Each actor-learner collects roll-outs of observations of its local environment up to T steps, accumulating gradients from samples in the roll-outs. The approximation of advantage function used in this approach is given by equation: The network parameters θ V and θ π are updated repeatedly according to the equations given by: and Training architecture is shown in Figure 4.
This approach does not require learning stabilization techniques like memory replay as the parameters are updated simultaneously rather than sequentially, hence eliminating the correlation factor between them. Furthermore, there are action-learners involved in this method that tend to explore a wider view of the environment and helping to learn an optimal policy. A3C has proven to be the stepping stone for DRL research and to be efficient in providing state-of-art results alongside reduced time and space complexity and its range of problem-solving capabilities.
• Advantage Actor-Critic (A2C) [52,53]: It is not necessary that asynchronous methods lead to better performance. It has been shown in various papers, that synchronous version of the A3C algorithm provides fine results wherein each actor-learner finishes collecting observation after which they are averaged and an update is made.
• Guided Policy Search (GPS) [54]: This approach involves collecting samples making use of current policy, generating a training trajectory at each iteration that is utilized to update the current policy according to supervised learning. The change is bounded by adding it like a regularization term in the cost function, to prevent sudden changes in policies leading to instabilities.
• Trust Region Policy Optimization (TRPO) [55]: In Schulman et al. [55], an algorithm was proposed for optimization of large nonlinear policies which gave improvement in the accuracy. Discount cost function for an infinite horizon MDP is given by replacing reward function with cost function giving the equation: Similarly, the same replacement made to state-value functions give the following Equation (19). Hence, resulting in advantage function given by: Optimizing Equation (19) would result in giving an updating rule for the policy as follows: This has been mathematically proven via advanced literatures by Kakade and Langford [56] that this method improves the performance. This algorithm requires advanced optimization problem solving techniques using conjugate gradient and then using line search [55].
• Proximal Policy Optimization (PPO) [57]: These methods solve soft constraint optimization problem making use of standard Stochastic Gradient Descent problem. Due to its simplicity and effectiveness in solving control problems, it has been applied to policy estimation in OpenAI. Training architecture is shown in Figure 4.
• Actor-Critic Kronecker-Factored Trust-Region (ACKTR) [53]: This uses a form of trust region gradient descent algorithm for an actor-critic with curvature estimated using a Kronecker-Factored approximation.
These methods have shown prospect when combined with the innumerable physical capabilities of the soft structure of bio-inspired robots [3,24], and are a topic of interest. Some of these algorithms applied successfully to SoRo are listed in Table 2.
The SoRo agent that performs action a at a state s.

Actor
This is what describes the current situation of the agent, acting as the input.

State s
A set of all states that are constantly varying and that the agent tries to influence when taking the action that he picks.

Deep Reinforcement Learning Mechanisms
Mechanisms have been proposed that can enhance the learning procedure while solving control problems involving soft robots through the aid of DRL algorithms. These act as catalysts to the task in hand acting orthogonally to the actual algorithm. They ensure that the task of solving DRL problems to obtain nearly optimal actions with respect to each state for soft robotics is computationally less expensive. A diverse set of DRL control tasks include: • Auxiliary Tasks [58][59][60][61][62]: Usage of other supervised and unsupervised machine learning methods like regressing depth images from color images, detecting loop closures etc., besides the main algorithm to receive information from sparse supervision signals in the environment.

•
Prioritized Action Replay [63]: Prioritizing memory replay according to error • Hindsight Action Replay [13]: Relabeling the rewards for the collected observations by effective use of failure trajectories along with using binary/sparse labels that speed the off-policy methods.

•
Curriculum Learning [64][65][66]: Exposing the agent to a sophisticated set of the environment, helping it to learn to solve complex tasks.

•
Curiosity-Driven Exploration [67]: Incorporating internal rewards besides external ones collected from observations. • Asymmetric Action Replay for Exploration [68]: The interplay between two forms of the same learner generates curricula, hence, driving exploration • Noise in Parameter Space for Exploration [69,70]: Inserts additional noise so as to aid exploration in the learning procedure.
The rising demands for creating structures in physical capabilities in terms of flexibility, strength, rigidness etc. Autonomous Systems lead to growth in the developments in DRL that could be applied to soft robots. This brings about a need for incorporating such mechanisms (as mentioned in this section) in models that strengthen the impact of DRL algorithms that are coupled with the soft robots.

Deep Reinforcement Learning for Soft Robotics Navigation
Deep RL approaches have turned out to be an aid for navigation tasks, helping to generate such trajectories by learning from observations taken from the environment in form of state-action pairs. Similar to other types of robots, Soft Robots require autonomous navigation capabilities that can be coupled with their mechanical properties allowing them to execute onerous looking tasks with ease. Soft robots used for investigation, maintenance or monitoring purposes at various workplaces have the climbing or crawling capabilities that require self-sufficient path planning potentialities. Completely reliable and independent movement is necessary for creating systems that perform tasks requiring continuous interaction with the environment in order to find the path for desired movements. One such example of DRL being utilized in order to navigate between two points within a room by a mobile soft robot is depicted in Figure 5. Like other DRL problems, the navigation problem is considered as an MDP. Input sensors (LIDAR scans and depth images from an onboard camera) get readings and in return output a trajectory (policy in the form of actions to be taken in a particular state), that will complete the task of reaching the goal in the given span of time.
Experiments have been carried out in this growing field of research as stated below: • Zhu et al. [72] gave the A3C system the first-person view alongside the target image to conceive the goal of reaching the destination point, by the aid of universal function approximators. The network used for learning is a ResNet [73] that is trained using a simulator [74] that creates a realistic environment consisting of various rooms each as a different scene for scene-specific layers [72]. • Zhang et al. [30] implemented a deep successor representation formulation for predicting Q-value functions that learn representations interchangeable between navigation tasks. Successor feature representation [28,29] breaks down the learning into two fragments-learning task-specific reward functions and task-specific features alongside their evolution for getting the task in hand done. This method takes motivation from other DRL algorithms that make use of optimal function approximations to relate and utilize the information gained from previous tasks for solving the tasks we currently intend to perform [30]. This method has been observed to work effectively in transferring current policies to different goal positions in varied/scaled reward functions and to newer complex situations like new unseen environments.
Both these methods intend to solve the problem of navigation for autonomous robots that have inputs in the form of RGB images of the environment by either getting the target image [72] or by transferring information that is gained through previous processes [30,75]. Such models are trained via asynchronous DDPG for a varied set of real and simulations of real environments. • Mirowski et al. [58] made use of a DRL mechanism with additional supervision signals available (especially loop closures and depth prediction losses) in the environment, allowing the robot to freely move between a varying start and end. • Chen et al. [76] proposed a solution for compound problems involving dynamic environment (essentially obstacles) like navigating on a path with pedestrians as moving obstacles. It utilizes a set of hardware to demonstrate the proposed algorithm in which LIDAR sensor readings are used to predict the different properties associated with pedestrians (like speed, position, radius) that contribute to forming the reward/loss function. Long et al. [77] make use of PPO to conduct multi-agent obstacle avoidance task. • Simultaneous Localisation and Mapping (SLAM) [78] has been at the heart of recent robotic advancements with loads of papers been written regularly in this field in the last decade. SLAM makes use of DRL methodologies partially or completely and have shown to produce one of the best results in such tasks of localization and navigation.

Deep Reinforcement Learning for Soft Robotics Manipulation
DRL techniques have been applied to soft robotic manipulation tasks like picking, dropping, reaching, etc. [18,51,55,90]. The enhanced rigidity, flexibility, strength and adaptation capability of soft robots over hard robots [91,92] has extensive applications in the manipulation field and combining it with DRL has been observed to give precise and satisfactory results. The coming of the soft robotics technologies and its blend with deep learning technologies has contributed to it becoming a crucial part of manipulation tasks. One such example of DRL being utilized in a manipulator with soft end-effectors for picking objects of varied sizes and shapes is depicted in Figure 6.
Recent advancements in the grasping capabilities aided by vision-based complex systems have certainly provoked the growth in the utilization of models involving the employment of artificially intelligent robots. Various techniques of robotic grasp detection [93] and delicate control of soft robots for grasping objects of varying shape, size, and composition [94] have evolved, making way for deep learning and DRL algorithms integrated with benefits of the robust yet limber structure of soft robots. These models take as input 3-channel images of the scene along with the object to be picked, passing it through a deep convolutional network (CNN) that outputs a grasp predictor. This predictor is the input to control system that intends to maneuver the end-effector in order to pick the object. After developments in the domain of soft robotics, we require learning algorithms that can solve complex manipulation tasks alongside taking care to follow constraints that are enforced as a result of the physical structure of such bio-inspired robots [3,24]. DRL technologies have certainly enhanced the performance of such agents.
In the past few years, we have witnessed a drastic increase in research focus on DRL techniques while dealing with soft robots as listed below: The required torques are predicted by passing this concatenated input to linear layers at the end of the network. These experiments were carried out with a PR2 robot for various tasks like screwing a bottle, inserting the block into a shape sorting cube etc. Despite giving desirable results, it is not widely used in real-world applications as it requires complete observability in state space that is difficult to obtain in real life. • Finn et al. [95] and Tzeng et al. [96]: Made use of DRL techniques for predicting optimal control policies by studying state representations. • Fu et al. [97]: Introduced the use of a dynamic network for creating a one-shot model for manipulation problems. There have been advancements in the area of manipulation using multi-fingered bio-inspired hands that are model-based [98,99] or model-free [100]. • Riedmiller et al. [59]: Gave a new algorithm that enhanced the learning procedure from the time complexity as well as accuracy point of view. It said that sparse rewards for the model in attaining optimal policy faster than providing binary rewards, that lead to policy that did not have the desired trajectories for the end effector. For this, another policy (referred to as intentions) was learned for auxiliary tasks whose inputs are easily attainable via basic sensors [101]. Besides this, a scheduling policy is further learned for scheduling the intention policies. Such a system has better results than a normal DRL algorithm for the task of lifting that took about 10 h to learn from scratch.
Soft Robotics has turned out to be an emerging field especially for manipulation tasks, turning out to be superior in terms of accuracy and efficiency in comparison to the human or hard robotic counterparts [102]. Involving such soft robots in place of humans has further lead to a drastic dip in chances of industrial disasters due to human error (as a result of their environment adaptation property) and they have proven to be valuable for working environments that are unsuitable.

Difference between Simulation and Real World
Collecting training samples is not an easy task while solving the problem of controls in soft robotics as it is in perception for similar systems. Collection of the real-world dataset (state-action pair) is a costly operation due to the high dimensionality of the control spaces and the lack of availability of a central source of control data for every environment. This inflicts various challenges while bringing models that have been trained in simulation to real-world scenarios. Even though we have various simulation software for soft robotics especially designed for manipulation tasks like picking, dropping, placing etc. as well as navigation tasks like walking, jumping, crawling etc., there are still challenges that act as hindrances in this problem of solving control tasks making use of these flexible robots.
Soft robots having bio-inspired designs that make use of DRL techniques like the ones listed in the previous sections are known to yield satisfactory results, but still, they face various obstacles that hinder their performance when tested on the real-world problems after being trained on simulation settings. Such a gap is often viewed in disparities in visual data [30] or laser readings [75]. The following section provides an overview: • Domain Adaptation: This translates images from a source domain to the destination domain. Domain confusion loss that was first proposed in the paper Tzeng et al. [103] learns a representation that is steady towards changes in the domain. However, the limitation of this approach lies in the fact that it requires the source and destination domain information before the training step, which is challenging. Visual data coming from multiple sources is represented by X (simulator) and Y (onboard sensors). The problem arises when we train the model on X and test it on Y, wherein we observe a considerable amount of performance difference between the two. This problem of reality gap is a genuine problem faced while dealing with systems involving soft robots due to the constant variations in the position of end-effector in numerous degrees of freedom. Hence, there is a need for a system that is invariant to changes in perspective (transform) with which the agent observes various key points in the environment. Domain adaptation is a basic yet effective approach that is widely utilized to solve problems of low accuracy due to variations between simulation and real-world environments.

•
This problem can be solved if we have a mapping function that can map data from one domain to the other one. This can be done by employing a deep generative model called Generative Adversarial Network or commonly known as GANs [104][105][106]. GANs are deep models that have two basic components-a discriminator and a generator. The job of the generator is to produce image samples from the source domain to the destination domain, while that of the discriminator is to differentiate between true and false (generated) samples.
-CycleGAN [85]: First proposed in Zhu et al. [85], works on the principle that it is possible and feasible to predict a mapping that maps from input domain to output domain simply by adding a cycle consistent loss term as a regulariser, for the original loss for making sure the mapping is reversible. It is a combination of two normal GANs and hence, two separate encoders, decoders and discriminators are trained according to equations: and The loss term for the complete weight updating (after incorporating the cycle consistent loss terms for each GAN) step now turns out to be: Hence, the final optimization problem turns out to be Equation (24).
This is known to produce desirable results for scenes to draw comparisons/relations between both domains but occasionally fails on complex environments.

-
CyCADA [107]: The problems that CycleGAN, that was first introduced in Hoffmann et al. [107], faced were resolved by making use of the semantic consistency loss that could be used to map complex environments. It trains a model to move from the source domain containing semantic labels, helping map the domain images from X to that in Y.
The equations that are used for mapping using the decoder are given by: and Here, CE(S X , f X (X)) represents the cross-entropy loss between data-points predicted by pre-trained model and the true labels S X .
Deep learning frameworks like GANs [104][105][106], VAEs [108], disentangled representations [109,110] have the potential to aid the control process of soft robots. These developing frameworks have widened the perspective of DRL for robotic controls. The combination of two such tender fields of technology-soft robotics and deep learning frameworks (especially generative models) act as stepping stones to major technological advancement in the coming decades.
• Domain Adaptation for Visual DRL Policies: In such adaptation techniques, we transform the policy from a source domain to the destination domain.
Bousmalis et al. [111] proposed a new way to solve problems of reality gap in policies trained on simulations and applying them in real life scenarios.
There have been recent developments with the aim of developing newer training techniques to avoid such a gap in efficiency while testing on simulation and real-world scenarios besides advancements in the simulation environments possible to create virtually. Tobin et al. [112] randomized the lighting conditions, viewing angles and texture of objects to ensure the agent is manifested to disparities in the factors of variation. The work by Peng et al. [113] and Rusu et al. [114] further focus on such training methods. The recent advancements in VR Goggles [115] have separated policy learning and its adaptation in order to minimize the transfer steps required for moving from simulation to the real world. A new optimization objective comprising of an additional shift loss regularisation term was further deployed on such a model, that borrows motivation from artistic style transfer proposed by Ruder et al. [116]. Works in the domain of scene adaptation include indoor scene adaptation where the semantic loss is not added (A VR Goggles [115] model tested on Gazebo [117] simulator using a Turtlebot3 Waffle) and outdoor scene adaptation where we do add such a semantic loss term to the original loss function. Outdoor scene adaptation involves a collection of real-world data through a dataset like RobotCar [118] which is tested on a simulator (like CARLA [119]). The network is trained using DeepLab Model [120] after adding the semantic loss term. Such a model turns out to be applicable to situations where the simulation fails to accurately represent the real agent [121].
With a rise in the number of simulations software (and simulation methods [122]) available publicly and the growth in the hardware industry for soft robots-upcoming of 3D printed bio-robots [123][124][125][126] with specific tasks for which they have been designed for alongside special flexible electronics for such systems [127]. There is scope for improvement in the development of real-world soft robots with practical applications, making way for a hot topic for future research in the upcoming years.

Simulation Platforms
There are platforms that are available for simulation purposes of DRL agents before testing on real-world applications, selected in Table 3. The readily available software contributes to the upcoming research in the field of controls. The fact that a normal DRL agent requires lots of training even before testing it in real environments makes the presence of special purpose simulation tools important.
For the task of controls, there is fewer simulation software available for soft robots [11] as compared to the hard ones. The fact that soft robotics is a relatively new field might be the reason that there is scarcely any simulation software for manipulation tasks using a soft robot.
SOFA is an efficient framework for physical simulation of soft robotics of different shapes, sizes and material that has the potential to boost the research in this budding field. The SOFA allows the user to model, simulate and control a soft robot. Soft-robotics toolkit [133] is a plugin that aids to simulate soft robots using SOFA framework [134]. Others that are capable of modeling and simulating soft robotics agents are V-REP simulator [128], Action simulation by Energid, and MATLAB (Simscape modeling) [135] (Figure 7).

Imitation Learning for Soft Robotic Actuators
There are drawbacks of training a DRL agent to perform control tasks especially for soft robots: high training time and data that requires robots to interact with the real world which is computationally expensive, a well-formulated reward function is impractical to obtain. Imitation Learning is a unique approach to perform a control task, especially the ones involving the use of bio-inspired robots without the requirement of a reward function [136]. In such cases, it is inconvenient to formulate one where the problem is ill-posed and requires an expert whose actions are mimicked by the agent [16]. Imitation Learning is desirable in situations where an expert is present with high degrees of movement/action space leading to enormous computation time, necessity of a large training set and the difficulty to give a reward function that describes the problem. An overview of the training procedure for an imitation learning agent is shown in Figure 8.

Demonstration from Expert
New Data

ALL PREVIOUS DATA
Supervised Learning The use of imitation learning for solving problems of manipulation like picking, dropping, etc. [139,140] where we can exploit the benefits of soft robotics over hard ones have become essential. Controls tasks in such situations generally have tough to compute cost functions due to the high dimension of action space caused by the flexibility of the soft structure of the robot introduced in the motion of the actuator/end-effector. Under such situations, the study of imitation learning becomes a topic of significance. It requires an expert agent which in cases of manipulation using soft robotics is a person who performs the same tasks that a robot is required to copy. Therefore, manipulation with soft robots and imitation learning algorithms for performing the control task in hand go hand in hand and complement each other.
A primitive imitation learning algorithm is a supervised learning problem. However, simply applying the normal steps of supervised learning to tasks involving the formulation of control policy does not work well. There are changes/variations that must be made due to the difference in common supervised learning problems and control problems. The following section provides an overview of those variations:

•
Errors-Independent or Compound: The basic assumption of a common supervised learning task that assumes that the actions of the soft robotics agent do not affect the environment in any way is violated in the case of imitation control tasks. The presupposition that data samples (observations) collected are independent and identically distributed is not valid for imitation learning tasks, hence, causing error propagation making the system unstable due to minor errors too. • Time-step Decisions-Single or Multiple: Supervised learning models generally ignore the dependence between consecutive decisions different from what primitive robotic approaches. The goal of imitation learning is different from simply mimicking the expert's actions. At times, a hidden objective is missed by the agent while simply copying the actions of the expert. These hidden objectives might be tasks like avoiding colliding with obstacles, increasing the chances to complete a specific task, or minimizing the effort by the mobile robot.
In the next section, we describe three of the main imitation learning algorithms that have been applied to real life scenarios effectively.

•
Behaviour Cloning: This is one of the basic imitation learning approaches in which we train a supervised learning agent based on actions of the expert agent from input states to output states via performed actions. Data AGGregation (DAGGer) [141] is one of the algorithms described earlier that solves the problem of propagation of errors in a sequential system. This is similar to common supervised learning problems in which at each iteration the updated (current) policy is applied and observations recorded are labeled by expert policy. The data collected is concatenated to the data already available and the training procedure is applied to it. This technique has been readily utilized in diverse domain applications due to its simplicity and effectiveness.
Bojarski et al. [142] trained a navigation control model that collected data from 72 h of actual driving by the expert agent and tried to mimic the state (images pixels) to actions (steering commands) with a mapping function. Similarly, Tai et al. [143] and Giusti et al. [144] came up with imitation learning applications for real life robotic control. Advanced readings include Codevilla et al. [145] and Dosovitskiy et al. [119].
Imitation learning is effective in problems involving manipulation given below: -Duan et al. [146] improved the one-shot imitation learning to formulate the low-dimensional state to action mapping, using behavioral cloning that tries to reduce the differences in agent and the expert actions. He used this method in order to make a robotic arm stack various blocks in the way the expert does it, observing the relative position of the block at each time step. The performance achieved after incorporating various other additional features like temporal dropouts and convolutions were similar to that of a DAGGer. -Finn et al. [147] and Yu et al. [60] modified the already existing Model Agnostic Meta-Learning (MAML) [148], which is a diverse algorithm that trains a model on varied tasks and making it capable to solve a new unseen task when assigned. The updating of weights, θ is done using a method which is quite similar to the common gradient algorithm and given by equation: wherein α is the step size for gradient descent. The learning is done to achieve the objective function given by: (28) which leads to the gradient descent step given by: wherein β represents the meta step size. -While Duan et al. [146] and Finn et al. [147] propose a way to train a model that works on newer set of samples, the earlier described Yu et al. [60] is an effective algorithm in case of domain shift problems. Eitel et al. [149] came up with a new approach wherein he gave a new model that takes in over segmented RGB-D images as inputs and gives actions as outputs for segregation of objects in an environment.
• Inverse Reinforcement Learning: This method aims to formulate the utility function that makes the desired behavior nearly optimal. An IRL algorithm called as Maximum Entropy IRL [150] uses an objective function as given by the equation: For a robot following a set a constraints [151][152][153], it is difficult to formulate an accurate reward function but these actions are easy to demonstrate. Soft robotic systems generally have constraint issues due to the composition and configuration of different materials of the actuators resulting in the elimination of a certain part of the action space (or sometimes state space). Hence forcing the involvement of IRL techniques in such situations. This algorithm for soft robotic systems can perform the task that we want to solve by the human expert due to the flexibility and adaptability of the soft robot. The motivation for exploiting this technique in soft robots comes from the fact that their pliant movements make it difficult to formulate a cost function, hence leading to dependence of such models in systems requiring resilience and robustness. Maximum Entropy IRL [154] has been used alongside deep convolutional networks to learn the multiplex representations in problems involving a soft robot to copy the actions of a human expert for control tasks.

•
Generative Adversarial Imitation Learning: Even though IRL algorithms are effective, they require large sets of data and training time. Hence, an alternative was proposed by Ho and Ermon [155] who gave Generative Adversarial Imitation Learning (GAIL) that comprises a Generative Adversarial Network (GAN) [104]. Such generative models are essential when working with soft robotic systems as they require larger sets of training data because of the wide variety of actions-state pairs possible in such cases and the fact that GANs are complex deep networks that are able to learn complex representations in the latent space.
Like other GANs, GAIL consists of two independently trained fragments-generator (or the decoder) that generate state-action pairs close to that of the expert and the discriminator that learns to distinguish between samples created by the generator and real samples. The objective function of such a model is given by equation: Extensions of GAIL have been proposed in recent works including Baram et al. [156] and Wang et al. [157]. GAIL solved imitation learning problems in navigation (Li et al. [158] and Tai et al. [159] applied it for finding socially complaint policies) as well as manipulation (Stadie et al. [160] used GAIL for mimicking an expert's actions through domain agnostic representation). This presents an opportunity for GAIL to be applied to systems involving soft actuators for its composite structure and unique learning technique.
Imitation learning for soft robotics, being a relatively new field of research, has not yet been explored to its fullest. It is effective in the domains of industrial applications, capable of replacing its human counterpart due to its precision, reliability, and efficiency. Expert agent's actions can be mimicked by an autonomous soft robotics agent. These algorithms overcome the tough formulation of an appropriate cost function due to high action space dimensionality of soft robots. These techniques form an amalgam that could copy the expert as well as learn on its own via exploration depending on the situation in hand, and hence, must be a center for future deep learning developments in soft robots. DRL alongside imitation learning has been applied to countless scenarios and has been observed to provide satisfactory results, as shown in Table 4.

Future Scope
Deep learning has the potential to solve control problems like manipulation in soft robots. Further, we list the stepping stones in the path of using deep reinforcement approaches to solve such tasks that might be topics of great interest in the near future: • Sample Efficiency: It takes efforts and resources in collecting observations for training by making agents interact with the environments especially for soft robotic systems due to the various number of actions possible at each state. The biomimetic motions [10,173] of the flexible bio-inspired actuators make way for further research in creating efficient systems that can collect experiences without the expense. Generalization between tasks: A completely trained model is able to perform well in the tasks trained on, but it performs poorly in new tasks and situations. For soft robotics systems that are required to perform a varied set of tasks that are correlated, it is necessary to come up with methods that can transfer the learning from one training procedure when being tested on tasks. Therefore, there is a requirement of creating completely autonomous systems that take up least resources for training and still are diverse in application. This challenge is of key significance in the context of soft robots due to the hefty expense of allowing them to interact with the environment and the inconsistency of their own structure and composition, leading to increased adaptation concerns.
Despite these challenges in control problems for soft robots, there are topics that are gaining the attention of DRL researchers due to the future scope of development in these areas of research. Two of them are:

•
Unifying Reinforcement Learning and Imitation Learning: There have been quite a few developments [175][176][177][178] with the aim to combine the two algorithms and reap the benefits of both wherein the agent can learn from the actions of the expert agent alongside interacting and collecting experiences from the environment itself. The learning from experts' actions (generally a person for soft robotic manipulation problems) can sometimes lead to less-optimal solutions while using deep neural networks to train reinforcement learning agent can turn out to be an expensive task pertaining to the high dimensional action space for soft robots. Current research in this domain focuses on creating a model where a soft robotic agent is able to learn from experts' demonstrations and then as the time progresses, it moves to a DRL-based exploration technique wherein it interacts with the continuously evolving environment to collect observations. In the near future, we could witness completely self-determining soft robotics systems that have the best of both worlds. It can learn from the expert in the beginning and equipped with capabilities to learn on its own when necessary. Hence resulting in the benefits of the amalgamated mechanical structure by exploiting its benefits. • Meta-Learning: Methods proposed in Finn et al. [148] and by Nichol and Schulman [179] have found a way to find parameters from relatively fewer data samples and produce better results on newer tasks than they have not been trained on. This development can be stepping stones to further developments leading to the creation of robust and universal policy solutions. This could be a milestone research item when it comes to combining deep learning technologies with soft robotics. Generally, it is hard to retrieve a large dataset for soft robotic systems due to the expenses in allowing it to interact with its environment. Soft robotic systems are generally harder to deal with as compared to the harder ones and therefore, such learning procedures could aid soft robots to perform satisfactorily well.
Control of soft robots of enhanced flexibility and strength has become one of the premier domains of robotics researchers. There have been numerous DRL and imitation learning algorithms proposed for such systems. Recent works have shown massive span for further development including some that could branch out as separate areas of soft robotics research themselves. These challenges have opened new doors for such artificially intelligent algorithms that will be a trending topic of discussion and debate for the coming decades. Combining deep learning frameworks with soft robotic systems and extracting the benefits of both is seen a potential area of future developments. For Table 4, image source: MIT News/Youtube, NPG Press/Youtube, National Science Foundation (Credit: Siddharth Sanan, Carnegie Mellon University), Quality Point Tech/Youtube.

Conclusions
This paper demonstrate an overview of deep reinforcement learning and imitation learning algorithms applied to problems involving control of soft robots and have been observed to give state-of-the-art results in their domains of application especially manipulation where soft robots are extensively utilized. We have described learning paradigms of various learning techniques, followed by the instances of being applied to solve real-life robotic control problems. Despite the growth in research in this field of universal interest in the last decade, there are still challenges in controls of soft robots (it being a relatively new field of research in robotics) that needs concentrated attention. Soft Robotics is a constantly growing academic domain that focuses on exploiting the mechanical structure by integration of materials, structures, and software, and when combined with the boons of imitation learning and other DRL mechanisms can create systems capable of replacing humans at any discipline possible. We list the stepping stones to the development of such soft robots that are completely autonomous and self-adapting yet physically strong systems.
In a nutshell, the subject that gathers the attention of one and all remains to how the incorporation of DRL and imitation learning approaches can accelerate the ever so satisfactory performances of soft robotic systems and unveil a plethora of possibilities of creating altogether self-sufficient systems in the near future.

Conflicts of Interest:
The authors declare no conflict of interest.