Real–Sim–Real Transfer for Real-World Robot Control Policy Learning with Deep Reinforcement Learning †

: Compared to traditional data-driven learning methods, recently developed deep reinforcement learning (DRL) approaches can be employed to train robot agents to obtain control policies with appealing performance. However, learning control policies for real-world robots through DRL is costly and cumbersome. A promising alternative is to train policies in simulated environments and transfer the learned policies to real-world scenarios. Unfortunately, due to the reality gap between simulated and real-world environments, the policies learned in simulated environments often cannot be generalized well to the real world. Bridging the reality gap is still a challenging problem. In this paper, we propose a novel real–sim–real (RSR) transfer method that includes a real-to-sim training phase and a sim-to-real inference phase. In the real-to-sim training phase, a task-relevant simulated environment is constructed based on semantic information of the real-world scenario and coordinate transformation, and then a policy is trained with the DRL method in the built simulated environment. In the sim-to-real inference phase, the learned policy is directly applied to control the robot in real-world scenarios without any real-world data. Experimental results in two different robot control tasks show that the proposed RSR method can train skill policies with high generalization performance and signiﬁcantly low training costs.


Introduction
Over the past decades, robots have been gradually applied in various fields, with the expectation of completing more control tasks for human beings. Traditional programming methods can achieve the goal of performing certain tasks with the assumption that environments are known and structured [1]. However, robots often encounter working scenarios that are complicated and unpredictable in the real world. As a result, significant research attention has been given to data-driven learning methods [2,3], which avoid some of the challenges of analytic formulations and endow the learned policies with generalization capability. Recently, deep reinforcement learning (DRL) [4], which combines the reinforcement learning (RL) [5] method with deep neural networks has achieved great success in areas such as video games [6] and the board game Go [7]. Inspired by this, many works try to apply DRL algorithms in training robots to obtain control policies in unstructured environments, which shows appealing performance [8]. However, DRL methods typically require huge amounts of training samples and large-scale random explorations, which bring mechanical wear and tear to the hardware of robots. As collecting training data on real-world robots is costly, potentially unsafe, and time-consuming, learning control policies for real-world robots can be difficult and tedious.
One promising method is to train control policies in simulated environments where data generation is safe, convenient, and linvolves a ower cost, and then to transfer the learned policies to the real world. However, it is laborious to construct simulated environments similar to real-world working scenarios, especially with high fidelity. Consequently, the policies trained in simulated environments usually cannot directly work well in the real world due to the reality gap (discrepancies between simulated and real-world environments). Although lots of approaches have been proposed to bridge the reality gap, such as domain randomization (DR) [9] and domain adaptation (DA) [10], bridging the reality gap is still a challenging problem.
To train control policies for real-world robots with high generalization capability, and to greatly reduce the training cost, in this work we propose a real-sim-real (RSR) transfer method that includes a real-to-sim training phase and a sim-to-real inference phase. In the real-to-sim training phase, a simplified task-relevant simulated environment is automatically constructed based on the semantic information of the real-world scenario and coordinate transformation. The control policies are trained with the DRL method in the built environment. In the sim-to-real inference phase, the trained policies are directly transferred to the real-world scenarios. Experimental results show that the proposed RSR method can train control policies for real-world robots with promising generalization performance and significantly low training costs.
In summary, the main contributions of this paper are listed as follows: (1) We present a new learning paradigm to train control policies for real-world robots with the DRL method. The learning pipeline is divided into a real-to-sim training phase and a sim-to-real inference phase, which trains robot control policy with a higher generalization capability and lower costs. (2) The proposed method automatically constructs a task-relevant simulated environment for policy learning based on semantic information of real-world working scenarios and coordinate transformation, which avoids the challenging problem of manually creating the simulated environments with high fidelity, endowing the policy learning process with high efficiency. (3) The proposed method directly employs the trained policy in real-world scenarios without any real-world training data or fine-tuning.
The rest of this document is organized as follows. In Section 2, previous research in this field is summarized. Section 3 describes the details of the proposed method. Section 4 shows the experiments and results. Finally, Section 5 presents the conclusions.

Robot Control Policy Learning
Data-driven learning algorithms are widely employed to train control policies, which can be classified into supervised learning methods, reinforcement learning methods, and recently developed deep reinforcement learning methods. The supervised learning methods take the state-action pairs of demonstration data as the training samples to learn mapping relationships between the states and actions [11], which have been successfully deployed in manipulation tasks [12,13], driving [14], and navigation [15]. However, generally, the policies learned with supervised learning methods are deeply influenced by the quality of demonstrations. Typically collecting high-quality demonstration data, especially in the field of robots, is not a trivial task [16,17].
Reinforcement learning methods train the robot agents to obtain optimal policies through trial and error [18][19][20][21][22]. Reinforcement learning has led to many successes in the domain of robot control when low-dimensional state space or action space is available [23,24]. However, reinforcement learning shows limited success in continuous cases.
Deep reinforcement learning methods combining reinforcement learning with deep neural networks have shown great potential in addressing high-dimensional and continuous action-state space for robot control policy learning. Deep Q-network (DQN) [6] has been implemented to train reaching skill in 2D space for three-link robots [25,26]. In addition, deep deterministic policy gradient (DDPG) [27], trust region policy optimization (TRPO) [28], asynchronous advantage actor-critic (A3C) [29], generalized advantage estimation (GAE) [30], and proximal policy optimization(PPO) [31] have also been implemented in certain robot simulated control tasks, such as stacking blocks, hopping, or walking. Guided policy search (GPS) [32] is a rare method that can directly train control policies on real-world robots. Although deep reinforcement learning methods show high potential for control policy learning, due to large amounts of training data notoriously to collect, acquiring manipulation skill policies for real-world robots through DRL is time-consuming and cumbersome.

Sim-to-Real Transfer
The simulated environment is an appealing alternative to real-world scenarios for policy learning. However, the reality gap also introduces new challenges that have to be solved to make the trained policies be effectively applied in real-world scenarios. A number of recent works have explored different strategies for policy learning in the context of robot control.
One natural way is to make simulated environments closely match the real world by using high-quality rendering. Some researchers create visually-realistic simulated environments for 3-link robot reaching skill learning [25,26], or for 7-DOF robot arm grasping skill learning [33], hoping that the trained policies exhibit similar behavior in the real world as its simulated counterpart, showing limited success. Others [34,35] use simulated depth images, which abstract away appearance properties of objects, and then employ the learned policy in the real world with a calibrated fixed depth camera. Unfortunately, a simulated environment rarely models the real world perfectly, and implementing the policies trained in an imperfect simulation model can yield a poor real-world performance. Unlike these approaches, our method allows the use of low-quality renderers that are not carefully matched to real-world scenarios, which is beneficial for low-cost policy learning.
Other works explore using domain adaptation to bridge the reality gap. Domain adaptation allows a learning model trained with data from a simulation domain to generalize to a real-world domain [36]. Stein et al. utilizes cycleGAN to convert each synthetic image to the realistic style one [37]. Cutler et al. uses simulation data as a prior to train control policies, which decreases the real-world samples [38]. Transferring synthetic images from simulator to adapted images similar to real-world ones is also adopted [39]. Applying progressive networks [40], which share features between simulated environments and real world, enable the learning of a manipulation policy. Domain adaptation is an important tool for addressing the reality gap, but in contrast to these approaches, ours requires no additional training on real-world data.
Several works have shown the success of exploring the idea of domain randomization to bridge the reality gap. Policies learned in a simulator with varied 3D scenes and textures can be applied successfully to real-world quadrotor flight [41]. Similarly, randomizing the texture of objects, lighting conditions, and camera positions in the simulated environments during training is proposed [9], with the aim that models learned in simulation would generalize to real-world scenarios with no additional training. This involves manually adjusting the simulated environment to roughly match the appearance and dynamics of the laboratory setup, and then relying on domain randomization of only the camera position and orientation [42]. As the dynamics of a simulated robot may differ from its real-world counterpart, domain randomization is also explored in dynamics [43,44].Other works also explore combining domain randomization with domain adaptation [45] for policy training on reaching tasks.
Unlike these approaches of manually designing a simulated model, which are grueling and time-consuming, the proposed RSR transfer method can automatically construct a task-relevant simulated environment based on semantic images of real-world scenes and coordinate transformation, which guarantees that the constructed simulated environment resembles its real-world counterpart with respect to policy training. In addition, our approach does not require any real-world training and attempts to directly apply policies learned in simulation to a real-world robot, without the burden of requiring human interactions during the training process.

Method
The background of classic reinforcement learning is a Markov decision process (MDP), defined by a tuple (S, A, π, r, P, γ), where S is a state set, A is an action set, π : S → R is a policy, r : S×A → R is a reward function, P : S × A × S → R is a transition dynamic, and γ ∈ (0, 1) is a discount factor. When the agent interacts with the environment using policy π, a trajectory sequence τ : {s 0 , a 0 , r 0 , s 1 , a 1 , r 1 , · · · , s T , a T , r T } is rolled out, where T is the length of τ. Discounted accumulated rewards R(τ) can be written as: where a t ∼ π(a t |s t ), s t+1 ∼ P(s t+1 |s t , a t ), and "|" is a symbol for conditional probability. Value function V π (s t ), state-action value function Q π (s t , a t ), and advantage function A π (s t , a t ) are defined as: The policy is optimized by maximizing the accumulated rewards R(τ).
Most common robot control tasks such as manipulation or navigation tasks require a robot to reach a desired state from an initial state. As a result, in this paper, we focus robot control policy learning on different forms of reaching tasks in relatively complicated environments with obstacles, which are still challenging and also the stepping stones to more complex tasks. Figure 1 shows the learning pipeline of our proposed method, which includes a real-to-sim training phase and a sim-to-real inference phase, as shown in Figure 1a,b, respectively. To make the content of this paper more concise and compact, by default, we mainly take the manipulation task as an example to illustrate our method. At the real-to-sim training phase, firstly, a semantic image is segmented from an RGB image of a real-world working scenario. Coordinate transformation maps each pixel position from the image coordinate system to the robot coordinate system. Then, a task-relevant simulated environment is generated based on the semantic information and coordinate transformation. Finally, a control policy is learned with the DRL method in the constructed simulated environment. At the sim-to-real inference phase, simulated-like synthetic images are generated based on the semantic images of the real wold. The trained policy takes the synthetic images as input and outputs actions that thus directly control the real-world robot to perform the desired task. Sim-to-real inference phase. The trained policy takes simulated-like synthetic images as input and outputs actions to control real-world robots via an ROS (robot operation system).

Generating a Simulated Environment
An RGB image I rgb and a depth image I dpt of a robot working scenario are captured from an RGB-D camera. As is shown in Figure 1a, a semantic image I sem is segmented from I rgb based on fully convolutional networks (FCN) [46]. To conveniently and effectively construct a task-relevant simulated environment for policy learning, some simplifications are made for semantic image I sem to create a simulated environment, as is shown in Figure 2. The target object is simplified to its geometry center. The robot (for navigation task) or gripper (for manipulation task) is simplified to be a solid ball, the center and diameter of which are calculated from its semantic pixel region contour. The obstacles are completely preserved, and other irrelevant objects that cannot be obstacles are ignored. Given depth image I dpt , our method transforms each pixel position [u i , v i ] T of the RGB image from the image coordinate system to the robot coordinate system with the following coordinate transformation equation: where i = 1, 2, · · · , T, p ri is the transformed position under the robot coordinate system, M in is the camera inner parameter matrix, z ci is the depth value with respect to the pixel position [u i , v i ] T , and R and T are the calibrated rotation matrix and transformation vector from the camera coordinate system to the robot coordinate system. As a result, we obtain the pose information of the corresponding objects under the robot coordinate system. Consequently, we get a task-relevant simulated environment, which is an abstraction of the real-world scene, and meanwhile keep the information related to training the desired control policy. In the policy training period, the solid red ball corresponding to the robot (or gripper) represents the virtual agent, the solid green circle corresponding to the target object denotes the target position, the blue area corresponds to the obstacle area being unreachable for the virtual agent, and the black area corresponds to the background and irrelevant objects area being reachable for the virtual agent.
The initial positions of the virtual agent and target object, and the shape and number of obstacles can be changed randomly in the simulated environment. Figure 2. Illustration of generating a task-relevant simulated environment. The target object is simplified to its geometry center of the corresponding regions in the semantic image. The robot (or gripper) is equivalent to a solid ball, the center and diameter of which are calculated from its semantic pixel region contour. The obstacles are completely preserved, and other irrelevant objects that cannot be objects are ignored.

Policy Network
The designed policy network is inspired by [32], including three convolutional layers and two fully connected layers, as is shown in Figure 3. The policy network takes simulated image I s,t captured from the simulated environment at current time step t together with simulated images I s,t−1 and I s,t−2 at the time steps t − 1 and t − 2 as input. The simulated images are resized from the raw image size (640 × 480) to be (240 × 240). The output of the network is the mean µ θ (I s,t ) and variance σ θ (I s,t ) of a Gaussian policy π θ (·|I s,t ) = N(µ θ (I s,t ), σ θ (I s,t )), where θ represents the parameters of the policy neural network. Figure 3. Architecture of the designed neural network policy. Two stride steps and a filter size of 3 × 3 are used for all three convolution networks. The ReLU nonlinear activation function is used throughout, with no pooling and no dropout techniques applied. The input is three RGB images from a simulated environment downsized to (240 × 240) with 9 channels, and the output is action.

Policy Training
At time step t , the robot agent takes an action a t according to current state I s,t and policy π θ , receives reward r t , and moves to the next state I s,t+1 . Repeating the above procedure, an episode trajectory is obtained τ : {I s,0 , a 0 , r 0 , · · · , I s,t , a t , r t , · · · , I s,T }. The reward function r t is set to be where d = x − x * is the Euclidean distance between the robot (or gripper) position x and the target object center x * , δ is the threshold to determine whether the agent reaches the target position (δ = 1 pixel). When encountering obstacles, the agent receives a reward of −10. We adopt the DRL method of proximal policy optimization (PPO) [31] to maximize a surrogate objective L clip (θ): where ε = 0.2, t specifies the time index in [0, T], π θold denotes the old policy before the update, and A t is the estimated advantage function A t , where V ϕ (I s,t ) is the estimated value function for state I s,t , and the parameter ϕ is updated by regression on mean-squared error, The policy parameter θ is updated by where α is the learning rate.

Deploying the Trained Policy
The sketch of deploying the policy trained from the simulated environment to the real-world scenario is shown in Figure 1b. Similar to the policy training period, semantic image I sem is segmented from the captured RGB image. Our method then synthesizes simulated-like images I syn in low-fidelity based on a segmented image respecting the following rules: the target object is simplified to be its geometry center; the robot (or gripper) is equivalent to a solid circle, the center and diameter of which are calculated from the semantic pixel region contour; the obstacles are completely preserved, and other irrelevant objects that cannot be obstacles are ignored.
Similar to the training phase, at time step t of the policies employed period, the trained policy takes synthetic image I syn,t together with synthetic images I syn,t−1 and I syn,t−2 at time t − 1 and t − 2 as input and outputs actions that directly control the real-world robot. As a result, we do not fine-tune the trained policy with real-world training data in the inference phase. The only additional step is to convert the real-world images to simulated-like synthetic images, which is efficient and inexpensive for policy learning.
The fully detailed algorithm is shown in Algorithm 1.

Performance Evaluation
The performance of the learned policy is evaluated in a real-world scenario in terms of success rate S rate , which specifies the ratio of the times of successfully achieving the desired task within the allowed error δ to all testing times N consumed.
whereh(·) is a indicator function outputting 1 when taking True as input and outputting 0 when taking False as input, and d i e = p f − p d is the distance error measured by the Euclidean distance between the target position p d and the final robot (or gripper) position p f at the end of the ith episode.

Algorithm 1 RSR transfer method
Real-to-sim training phase: 1: Capture RGB image I rgb and depth image I dpt from RGB-D camera. 2: Obtain semantic image I sem based on FCN. 3: Construct a task-related simulated environment with I sem and coordinate transformation. 4: Design a policy network. 5: Train policy with PPO in the constructed simulated environment. 6: for k = 1, 2, · · · , do do 7: Collect trajectory τ. 8: Update policy parameter θ by maximizing the surrogate objective L clip (θ).(Equation (7)) 9: Fit value function V ϕ .(Equation (10)) 10: end for 11: Until policy converges. Sim-to-real inference phase: 12: for k = 1, 2, · · · , T do 13: Semantic image I sem is segmented from the captured real-world RGB image. 14: Synthesize simulated-like images I syn . 15: The trained policy takes synthetic images as input, and output actions controlling the real-world robot. 16: end for 17: Until finish the target task.

Experiments and Results
To evaluate the proposed RSR method, experiments were carried out on two designed tasks: a UR5 robot manipulation task and TurtleBot navigation task, as shown in Figure 4, respectively. The first task was learning a skill policy to control the robot gripper to reach a target object, avoiding obstacles in 3D space. The second one was to train the TurtleBot to navigate from a random starting position to a random goal position in 2D space, without obstacle collision as well. A real-world RGB-D camera was visually calibrated to match the position and orientation of the simulated camera for each task respectively.

Semantic Segmentation of Robot Working Scenarios
In this work, for the manipulation task, the objects situated in our robot working scenarios were classified into background, robot gripper, obstacle, toy dolphin, toy hedgehog, toy squirrel, and toy lion. For the navigation task, the objects were classified into background, robot, obstacle, and target object. We collected RGB images of the robot working scenarios from the real-world camera. To make the training data for semantic segmentation, each pixel of the RGB images was labeled with one of the above categories.
The FCN neural network was based on VGG-16 [47], which has a wide availability of pre-trained weights. We generated a training set of 200 samples and validation set of 50 samples for each task. The semantic segmentation network was converged after 20 iteration steps using the SGD (stochastic gradient descent) optimization method with a batch size of 32 images. The semantic segmentation results are shown in Figure 5. We also found that the segmentation module works well in scenarios with objects overlapping with each other. Moreover, we adopted a dilation and erosion technique to filter noise from each semantic image, which we found to be beneficial to obtain the centroids of robot (or gripper) or target object. Figure 5. Sematic segmentation results of robot working scenarios. The first row shows the RGB images, and the second row shows their corresponding semantic images. The left two are for the manipulation task, and the right two are for the navigation task. Best viewed in color.

Policy Learning
The task-relevant simulated environment was generated automatically based on our proposed method in the Mujoco physics engine [48] interfaced with OpenAI Gym [49], as is shown in Figure 6. For the manipulation task, the action dimension is 3, which moves the gripper in 3D space. For the navigation task, the action dimension is 2, which controls the TurtleBot's navigation in 2D space. We compared our method against several baseline methods.
(1) Transfer-RGB: direct training policy with simulated RGB images and using real-world RGB images in the inference period. (2) Transfer-Depth: training policy with simulated depth images and using real-world depth images in the inference period. For the Transfer-RGB, Transfer-Depth, DR, and DA methods, we built a simulated environment similar to our real-world robot working scenario in Gazebo simulator (https://gazebosim.org). The domain randomization technique refers to [9] (https://github.com/neka-nat/gazebo_domain_ randomization). All of the methods share the same neural networks. We initialized the convolutional layers of the policy networks by the weights pre-trained on ImageNet. The policy networks are trained with TensorFlow (https://www.tensorflow.org) on NVIDIA GTX1080. Table 1 summarizes the parameters used in our experiments. Adam [50] In the policy training phase, to evaluate the success rates of the trained policies for the manipulation task, we randomly chose one of the toys (toy dolphin, toy hedgehog, toy squirrel, and toy lion) as the target object to be randomly put in front of the UR5 robot, within its workspace.
The success rates were evaluated by performing the manipulation task 20 times with the robot gripper in different starting positions and the target object in different goal positions every 30 policy iteration steps. For the navigation task, the navigation scenario was a manually designed rectangular area with a length of 4.5 m and a width of 4.0 m.The success rates were calculated by conducting the navigation task 20 times with different initial positions and target positions. The success rates were also evaluated every 30 policy iteration steps. To test the generalization performance of the trained policy, we randomly changed the robot (or gripper) starting position, target object position, obstacle position, and the number of obstacles in real-world scenarios.
We trained three different instances of each algorithm with different random seeds. Figure 7a,b illustrates the learning curves of our proposed method and the baseline methods applied in the manipulation task and navigation task, respectively. The solid curves corresponds to the mean success rates and the shaded region to the minimum and maximum success rates over the three trials. Table 2 summarizes the average success rates of the final trained polices. The proposed method achieves an average success rates of 89% in the manipulation task and 93% in the navigation task. Compared to the DR and DA methods, the results on two designed real-world tasks show that our proposed learning pipeline shows a better accuracy performance and higher-generalization capability. We find that the policy trained in a simulated environment with RGB images cannot be successfully deployed in real-world scenarios, confirming that the reality-gap has a significant harmful influence on policies directly transferred from simulated environments to the real world. The policy trained with depth images also demonstrates poor performance. Although the generated simulated-like synthetic images do not contain rich information like the real-world ones, the experimental results show that images rendered in low-fidelity with our method provide useful information for policy learning.  Another advantage of our learning paradigm over existing methods is that it is efficient and low-cost, due not only to it not needing real-world data or fine-tuning for real-world policy learning, but also in constructing the simulated environment for policy learning. Figures 8 and 9 show the frames of the final trained policies deployed on the manipulation task and the navigation task in the constructed simulated environment, respectively. Our proposed method succeeds in learning these two designed tasks in simulated environments.

Conclusions
In this paper, we study the possibility of directly transferring the policies trained in simulated environments to the real world with high generation capability and low costs. We proposed a novel real-sim-real (RSR) transfer method for control policy learning in real-world robots. In the real-to-sim training phase, a task-relevant simulated environment is automatically constructed based on semantic information of real-world working scenarios and coordinate transformation, and then policies are learned in the built simulated environment with the DRL method. In the sim-to-real inference phase, the trained policy is directly employed in the real world. As real-world scenarios are usually complicated and unstructured, the DRL method shows great potential for developing skill policies to be well employed in such environments. The experimental results show that our proposed method can effectively and efficiently learn control policies for real-world robots using the DRL method. The policies trained with our method show high generalization capability and low costs.
In future works, we intend to extend the training scenarios to richer repertoire tasks that are more common in real life. In addition, to improve the performance of the trained policies, we would incorporate more modalities such as robot state information or haptic sensor information for policy learning. Another direction is combining our RSR method with domain adaptation or domain randomization methods.