Variable Compliance Control for Robotic Peg-in-Hole Assembly: A Deep Reinforcement Learning Approach

Industrial robot manipulators are playing a more significant role in modern manufacturing industries. Though peg-in-hole assembly is a common industrial task which has been extensively researched, safely solving complex high precision assembly in an unstructured environment remains an open problem. Reinforcement Learning (RL) methods have been proven successful in solving manipulation tasks autonomously. However, RL is still not widely adopted on real robotic systems because working with real hardware entails additional challenges, especially when using position-controlled manipulators. The main contribution of this work is a learning-based method to solve peg-in-hole tasks with position uncertainty of the hole. We proposed the use of an off-policy model-free reinforcement learning method and bootstrap the training speed by using several transfer learning techniques (sim2real) and domain randomization. Our proposed learning framework for position-controlled robots was extensively evaluated on contact-rich insertion tasks on a variety of environments.


Introduction
Autonomous robotic assembly is an essential component of industrial applications. Industrial robot manipulators are playing a more significant role in modern manufacturing industries with the goal of improving production efficiency and reducing costs. Though peg-in-hole assembly is a common industrial task which has been extensively researched, safely solving complex high precision assembly in an unstructured environment remains an open problem [1].
Most common industrial robots are joint position-controlled. For this type of robots, compliance control is necessary to safely attempt contact-rich tasks otherwise the robot is prone to cause large unsafe assembly forces even with tiny position errors. Compliant robot assembly tasks have been studied in two ways: passive and active methods. In the passive method, a mechanical device called Remote Center Compliance(RCC) [2] is placed between the robot's wrist and gripper. The passive compliance provided by the RCC lets the gripper move perpendicular to the peg's axis and rotate freely so as to reduce resistance. However, the passive method does not work well with high precision assembly [3]. On the other hand, active compliant methods correct assembly error through sensor feedback. In general, these methods use force sensors to detect the external forces and moments, and design control strategies based on dynamic models of the task to minimize the contact force [4][5][6][7][8]. Some active methods mimic the human's compliance during assembly [9][10][11][12]. Nevertheless, most of these assembly methods are not practical to use in real applications. Model parameters need to be identified and the controller gains need to be tuned, in both cases, the process is manually engineered for specific tasks, which requires a lot of time, effort, and expertise. These approaches also are not robust to uncertainties and do not generalize well to variations in the environment.
To reduce human involvement and increase the robustness to uncertainties, most recent research is focused on learning the assembly skills either from human demonstrations [13] or directly from interaction with the environment [14]. The present research focuses on the latter.
Reinforcement learning (RL) methods enable agents to learn complex behaviors through interaction with the surrounding environment and by maximizing the rewards received from the environment; ideally, the agents' behavior can generalize to unseen scenarios or tasks [14]. Therefore, RL can be applied to robotic agents to learn high precision assembly skills instead of only transferring human skills to the robot program [15]. Recent studies have shown the importance of RL for robotic manipulation tasks [16][17][18], but none of these methods can be applied directly to high precision industrial applications due to the lack of fine motion control.
In [19], a RL technique was used to learn a simple peg-in-hole insertion operation. Similarly, Inuo et al. [20] proposed a robot skill acquisition approach by training a recurrent neural network to learn a peg-in-hole assembly policy. However, these approaches use a finite number of actions by discretizing the action space, which has many limitations in continuous actions control tasks [21] as it is the case for robot control, which is continuous and high-dimensional.
Xu et al. [22] proposed learning a dual peg insertion using the Deep Deterministic Policy Gradient [23] (DDPG) algorithm with a fuzzy reward system. Similarly, Fan et al. [24] uses DDPG combined with Guided Policy Search (GPS) [25] to learn high-precision assembly tasks. Luo et al. [26] also uses GPS to learn a peg-in-hole tasks on a deformable surface. Nevertheless, these methods learn policies that control the motion trajectory only while they require manual tuning of the force control gains, therefore it does not scale well to variations of the environment.
Ren et al. [27] proposed the use of DDPG to simultaneously control position and force control gains, nevertheless they assume geometric knowledge of the insertion task which makes the learned policies inflexible to be applied to different insertion tasks. To solve high precision assembly tasks, our approach focuses on learning policies that simultaneously control the robot's motion trajectory and actively tune a compliant controller to unknown geometric constraints.
Buchli et al. [28] and Abu-Dakka et al. [29] accomplished variable stiffness skill learning on robot manipulators by using a RL algorithm call policy improvement with path integrals (PI2). However, the method is formulated for torque-control robots. Another similar approach is to use a flexible robot so as to focus only on the motion trajectory as in [30], however, rigid position-controlled robots are still more widely used. Therefore, we focus on industrial robot manipulators which are mainly position-based controlled.
The main contribution of this work is a learning-based framework for robotic peg-in-hole assembly with uncertain goal position. Our method enables a position-controlled industrial robot manipulator to safely learn contact-rich manipulation tasks by controlling the nominal trajectory and, at the same time, learning variable force control gains for each phase of the task. We build on top of our previous work [31] adding a more robust policy representation and adapting domain transfer learning techniques (sim2real) to greatly improve the training efficiency on the real robot. The effectiveness of the proposed method is shown through extensive evaluation with a real robotic system on a variety of contact-rich peg-in-hole insertion tasks. Although the effects of domain randomization has been research [32,33], to the best of our knowledge, we are the first to study the effects of sim2real with domain randomization on contact-rich real-robot applications with position-controlled robots. In the present study, we consider the peg-in-hole assembly task that requires the mating of two components. One of the components is grasped and manipulated by the robot manipulator while the second component has a fixed position, either by fixtures to a surface of the environment or by being hold by a second robot manipulator. Figure 1 provides a 2D representation of the insertion tasks considered and the components assumed to be available to solve the task. The proposed method is designed for a position-controlled robot manipulator with a Force/Torque sensor at its wrist. Typically, these insertion tasks can be broadly divided into two main phases [34] search and insertion . During the search phase the robot aligns the peg within the clearance region of the hole. At the beginning, the peg is located at a distance from the center of the hole in a random direction. The distance from the hole is assumed as the "positional error." During the insertion phase: the robot adjusts the orientation of the peg with respect to the hole orientation and pushes the peg to the desired position. We focus on both phases of the assembly task with the following assumptions:

Problem statement
• The manipulated object is already firmly grasped. However, slight changes of the object orientation within the gripper are possible during the manipulation. • There is access to an imperfect prediction of the target end-effector pose (as shown in Figure 1) or a reference trajectory and its degree of uncertainty. • The manipulated object is inserted in a direction parallel to the gripper's orientation.
We consider the second assumption fair given the advances in vision recognition techniques, where the 6D pose of objects can be estimated from single RGB images [35][36][37][38][39] or RGB images with depth maps (RGB-D) [40][41][42][43]. The high accuracy of the predictions are in many cases enough to be used for robot manipulation. Moreover, this second assumption also includes the specific case of using an assembly planner [44,45] where even if the initial position of the objects is known, the inevitable error through out the manipulation (e.g. pick-and-place, grasping, and re-grasping) makes the position/orientation of the manipulated objects uncertain at the insertion phase. A reference trajectory could be similarly obtained from demonstrations [13,[46][47][48] when a complex motion is required to achieve the insertion. The last assumption allows to define a desired insertion force that may vary for different insertion tasks without loss of generalization. Figure 2. Our proposed framework. Based on an estimated target position for an insertion task, our system learn a control policy that defines the motion trajectory and force control parameters of an adaptive compliance controller to control an industrial robot manipulator.

System Overview
Our proposed system aims to solve assembly tasks with an uncertain goal pose. Figure 2 shows the overall system architecture. There are two control loops. The inner loop is an adaptive compliance controller, we choose to use a parallel position-force controller which was proven to work well for these kind of contact-rich manipulation tasks [31]. The inner loop runs at a control frequency of 500 Hz, which is the maximum available in the Universal Robots e-series robotic arms 1 . Details of the parallel controller are provided in Section 2.2.3. The outer loop is a RL control policy running at 20 Hz that provides sub-goal positions and the parameters of the compliance controller. The outer loop's slower control frequency allows the policy to process the robot state and compute the next action to be taken by the manipulator while the inner loop precise high-frequency control seeks to achieve and maintain the sub-goal provided by the policy. Details of the RL algorithm and the policy architecture are provided in Section 2.2. Finally, the input to the system is an estimated target position and orientation for the insertion task.
The motion commands, x c , sent to the adaptive compliance controller, correspond to the pose of the robot's end-effector. The pose are of the form x = [p, φ], where p ∈ R 3 is the position vector and φ ∈ R 4 is the orientation vector. The orientation vector is described using Euler parameters (unit quaternions) denoted as φ = {η, ε}; where η ∈ R is the scalar part of the quaternion and ε ∈ R 3 the vector part.

Reinforcement Learning Algorithm
Robotic reinforcement learning is a control problem where a robot, the agent, acts in a stochastic environment by sequentially choosing actions over a sequence of time steps. The goal is to maximize a cumulative reward. Said problem is modeled as a Markov Decision Process. The environment is described by a state s ∈ S. The agent can perform actions a ∈ A, and perceives the environment through observations o ∈ O, which may or not be equal to s. We consider an episodic interaction of finite time steps with a limit of T time steps per episode. The agent's goal is to find a policy π(a(t) | o(t)) that selects actions a(t) conditioned on the observations o(t) to control the dynamical system. Given an stochastic 1 Robot details at https://www.universal-robots.com/e-series/ dynamics p(s(t + 1) | s(t), a(t)) and a reward function r(s, a), the aim is to find a policy π * that maximizes the expected sum of future rewards given by R(t) = ∑ ∞ i γr(s(t), a(t)) with γ being a discount factor [14]. In this work, we use the RL algorithm called Soft-Actor-Critic (SAC) which is one of the state-of-the-art algorithms with high sample efficiency, ideal for real robotic applications. SAC [49] is an off-policy actor-critic deep RL algorithm based on the maximum entropy. SAC aims to maximize the expected reward while also optimizing a maximum entropy. The SAC agent optimizes a maximum entropy objective, which encourages exploration according to a temperature parameter α. The core idea of this method is to succeed at the task while acting as randomly as possible. Since SAC is an off-policy algorithm, it uses a replay buffer to reuse information from recent rollouts for sample-efficient training. Additionally, we use the distributed prioritized experience replay approach for further improvement [50]. Our implementation of the SAC algorithm is based on TF2RL repository 2 .

Multi-modal Policy Architecture
The control policy is represented using neural networks, as shown in Figure 3. The policy input is the robot state. The robot state includes the proprioception information of the manipulator and haptic information. The proprioception includes the pose error between the current robot's end-effector position and the predicted target pose, x e , the end-effector velocityẋ, the desired insertion force, F g , and the actions taken in the previous time step, a t−1 . The proprioception feedback is encoded with a neural network with 2 fully connected layers with the activation function RELU to produce a 32-dimensional feature vector. For the force-torque feedback, we consider the last 12 readings from the six-axis F/T sensor, filtered using a low-pass filter, as a 12 x 6 time series: The F/T time series is fed to a temporal convolutional network (TCN) [51] to produce another 32-dimensional feature vector. The feature vectors from proprioception and haptic information are concatenated to obtain a 64-dimensional feature vector, and then fed to two fully connected layers to predict the next action. The policy outputs the actions for a parallel position-force controller. The policy produces two type of actions, a . = [a x , a p ], where a x = [p, φ] are position/orientation sub-goals, and a p are parameters of the parallel controller. The specific parameters controlled by a p are described in Section 2.2.3. Our proposed method uses a common force control scheme combined with a reinforcement learning policy to learn contact-rich manipulations with a rigid position-controlled robot. For the family of contact-rich manipulation tasks that requires some sort of insertion, the parallel position-force control [52] performs better and can be learned faster than using an admittance control scheme, when combined with a RL policy [31].

Compliance Control in Task Space
The implemented parallel controller is depicted in Figure 4. A PID parallel position-force control is used with the addition of a selection matrix to define the degree of control of position and force over each direction. The control law consists of a PD action on position, a PI action on force, a selection matrix and the policy position action, a x , where F e = F g − F ext and x x is the commanded positions to the robot. The selection matrix is: where the values correspond to the degree of control that each controller has over a given direction. Our parallel control scheme has a total of 30 parameters, 12 from the position PD controller's gains, 12 from the force PI controller's gains, and 6 from the selection matrix S. We reduced the number of controllable parameters to prevent unstable behavior and to reduce the system's complexity. For the PD controller, only the proportional gain, K x p , is controllable while the derivative gain, K x d , is computed based on the K x p . K x d is set to have a critically damped relationship as: , a p ∈ R 18 As a safety measure, we narrow the agents choices for the force control parameters by imposing upper and lower limits to each parameter. Assuming we have access to some baseline gain values, P base . We define a range of potential values for each parameter as [P base − P range , P base + P range ] with the constant P range defining the size of the range. We map the policy actions a p from the range [−1, 1] to each parameter's range. P base and P range are hyperparameters of our method.

Task's reward function
For all the insertion tasks considered, the same reward function was used: where F g is the desired insertion force, F ext is the contact force, and F max is the defined maximum contact force allowed. L m (y) = y → x, x ∈ [1, 0] is a linear mapping to the range 1 to 0, thus, the closer to the goal and the lower the contact force, the higher the reward obtained. || · || 1,2 is L1,2 norm based on [16].
κ is a reward defined as follows During training, the task is considered completed if the euclidean distance between the robot's end-effector position and the true goal position is less than 1 mm. The agent is encouraged to complete the task as fast as possible by providing an extra reward for every unused time step, with respect to the maximum number of time steps per episode T. Moreover, we impose a collision constraint, where the agent is penalized for colliding with the environment by giving it a negative reward as well as finishing the episode early. This collision constraint encourages a safer exploration as shown in our previous work [31]. We define a collision as exceeding the force limit, F max . Therefore, a collision detector is not necessary nor geometric knowledge of the environment. Finally, each component is weighted via w, all w's are hyperparameters.

Speeding up Learning
Two strategies were adopted to speed up the learning process. First, the exploitation of prior knowledge using the idea of Residual RL [53,54]. Second, we use a physics simulator to train the robot on a peg insertion task and transfer the learned policy directly to the real robot (sim2real).

Residual Reinforcement Learning
To speed up the learning of the control policy for insertion tasks that require complex manipulation, we use residual reinforcement learning [53,54]. The goal is to leverage the training process by exploiting prior knowledge. With the assumption of an estimated target position or a reference trajectory we can manually define a controller x g . Then said controller's signal would be combine with the policy action, a x .
The objective is to avoid training the policy from scratch and avoid exploration of the entire parameter space. The position command send to the robot is where x ′ g is the reference trajectory process through a PD controller, a x is the policy signal on position, and x f is the response to the contact force, as shown in Figure 4. The first two terms come from the parallel controller. Therefore, the policy would just need to learn to adjust the reference trajectory to achieve the task.

Sim2real
The proposed method works on the robot's end-effector Cartesian task-space, which makes it easier to transfer learning from simulation to the real robot or even between robots [55]. For most insertion tasks, a simple peg insertion task was used for training on a physics simulator. We used the simulator Gazebo 9 [56]. To close the reality gap between the physics simulator and the real world dynamics, we use domain randomization [57]. During training on the simulator, the following aspects were randomized: • Initial/goal end-effector position. Having random initial/goal positions helps the RL algorithm find policies that generalized to a wide range of initial position conditions. • Objects surface stiffness. The RL agent also needs to learn to fine-tune the force controller parameters to get a proper response to the contact force. Therefore, randomizing the stiffness of the manipulated objects helps it find policies that adapt to different dynamic conditions. • Uncertainty error of the goal pose prediction. On the real robot, the prediction of the target pose comes from noisy sensory information, either from a vision detection system or from prior known manipulations (grasp and re-grasp). Thus, during training on the simulation, we emulate this error using a normal Gaussian distribution with mean zero and standard deviation of a maximum distance error (for position and orientation). • Desired insertion force. For different insertion tasks a specific contact force is necessary for the insertion to succeed. As we consider the insertion force an input to the policy, during training, we randomized this value for each episode.

Experiment Setup
Experimental validation was performed on a simulated environment using the Gazebo simulator [56] version 9, as well as, on real hardware using a Universal Robot 3 e-series, with a control frequency of up to 500 Hz. The robotic arm has a Force/Torque sensor mounted at its end-effector and a Robotiq Hand-e parallel gripper. In both environments, training of the RL agent was performed on a computer with CPU Intel i9-9900k and GPU Nvidia RTX-2800 Super. To control the robot agent, we use the Robot Operating System (ROS) [58] with the Universal Robot ROS Driver 3 . The experimental environment on the real robot is shown in Figure 5

Training
During the training phase, the agent's task is to insert a cuboid peg into a task board on the simulated environment. The agent is trained for 500, 000 time steps which in average takes about 5 hours to complete. During training the environment is modified after each episode by randomizing one or several of the training conditions mentioned in Section 2.4.2. The range of values used for the randomization of the training conditions is shown in Table 1. The random goal position is selected from a defined set of possible insertion planes, as depicted in Figure 6. Simulation environment. Overlay of randomizable goal positions.

Condition
Value range Initial position (relative to goal) After training on simulation, the learned policy is refined by retrain on the real robot for 3% off simulation time steps, which takes about 20 minutes, to further account for the reality gap between the simulated physics dynamics and the real world physics dynamics.

Evaluation
The learned policy is initially evaluated on the real robot with a 3D printed version of the cuboid peg in hole insertion task with the true goal pose. During the evaluation the observations and actions were recorded. Figure 7 shows the performance of the learned policy (sim2real+retrain). The figure shows the relative position of the end-effector with respect to the goal position, the contact force, and the actions taken by the policy for each Cartesian direction normalized to the range [-1, 1], as described in Section 2.2.3. As shown in Fig. 1, the insertion direction is aligned with the y-axis of the robot's coordinate system. In Figure 7, we highlighted three phases of the task, blue corresponds to the search phase in free-space before contact with the surface; yellow is the search phase after initial contact with the environment; and green corresponds to the insertion phase. During the search phase and particularly on the insertion direction (y-axis), we can clearly observe that the learned policy properly reacts to the contact with the environment by quickly adjusting the force control parameters. On top of that, during the insertion phase, the learned policy changes its strategy from just minimizing contact force to a mostly position control strategy to complete the insertion. This behavior is proper for this particular insertion task, as there is little resistance during the insertion phase, but it is not the desired behavior for other insertion tasks as we discuss later in Section 3.4.1. Figure 7. Performance of the learned policy (sim2real+retrain) on the 3D printed cuboid peg insertion task. The insertion direction is aligned with the y-axis of the robot's coordinate system. The relative distance from the robot's end-effector to the goal position and the contact force are display. Additionally, the 24 actions of the policy are display besides its corresponding axis.
Additionally, we compare the performance of the learned policy as a combination of sim2real and refinement on the real robot versus just learning on the real robot or just directly transferring the learned policy from simulation (sim2real) without further training. We evaluate these policies on a 3D printed version of the cuboid peg insertion task. The policies were tested 20 times with random initial position assuming a perfect estimation of the goal position (true goal). Table 2 shows the results of the evaluation. The three policies have a very high success rate, however the policy transfer from simulation has difficulty with the real world physics dynamics. As expected, the policy retrained from simulation give the best overall performance time.

Generalization
Now, to evaluate the generalization capabilities of our proposed learning framework, we use a series of environments with varying conditions.  Table 2. Comparison of learning from scratch, straightforward sim2real and sim2real + retraining (Ours). Test performed on a 3D printed cuboid peg insertion task assuming knowledge of the true goal position.

Varying degrees of Uncertainty error
First, the learned policies are evaluated on the 3D printed cuboid peg insertion task where there is a degree of error on the estimation of the goal position. To clearly compare the performance of the different methods with different degrees of estimation error, we added and offset of position or orientation about the x-axis of the true goal pose. Nevertheless, for completeness we also evaluate the policies on goal poses with added random offset of translation, [−1, 1] millimeters, and orientation, [−5 • , 5 • ], on all directions. On each case, the policies were tested 20 times from random initial positions. Results are shown in Table 3.

Estimation error / Success rate Position
Orientation Method 1 mm 2 mm 3 mm 4 mm 5 mm 100% 100% 100% 100% 100% 90% Table 3. Comparison of learning from scratch, straightforward sim2real and sim2real + retraining (Ours) with different degrees of uncertainty error of the goal position. Test performed on a 3D printed cuboid peg insertion task.
In all cases, the policy learned from simulation with domain randomization and fine-tuned on the real robot give the best results. It is worth noting that if the difference between the physics dynamics on simulation and the real world is too big, learning from scratch can yield better results than only transferring the policy from simulation, as can be seen when the uncertainty error on orientation is too big (5 • ) where the friction with the environment makes the task much harder, such contact dynamics are difficult to simulate.

Varying Environment's Stiffness
Second, the learned policy was also evaluated on different stiffness environments. Figure 8 shows the 3 environment considered for evaluation. High stiffness is the default environment. Medium stiffness was achieve by using a rubber band to hold the cuboid peg between the gripper fingers, adding a degree of static compliance. In addition to that, for low stiffness environment a soft foam surface was added to further decrease the stiffness. The policies were evaluated from 20 different initial positions, the results are reported in Table 4.

Varying Insertion Tasks
Finally, we evaluate the learned policy on a series of novel insertion tasks, none seen during training to assess its generalization capabilities. These insertion tasks include challenges such as adapting to a very hard surface (high stiffness), requiring a minimum insertion force to perform the insertion, and a complex peg shape for the mating of the parts. The different insertion scenarios are depicted in Figure 9.  Table 4. Success rate of 3D printed cuboid insertion task with different degrees of contact's stiffness. For each task, the learned policy was executed 20 times from random initial positions and assuming perfect estimation of the goal position. Table 5 shows the success rate of the learned policy on these novel tasks along with the desired insertion force set for each task. As the insertion force was defined as an input of the policy, we can define specific desired insertion force for each task. Even though, the policy was only trained using the simpler cuboid peg insertion task, mainly in simulation and shortly refined on real robot with a 3D printed version of the same task, the learned policy achieve high success rate in novel and complex insertion tasks.

Method/Stiffness High Medium
Compare to the cuboid peg insertion task, on these novel insertion tasks the peg is more likely to get stuck during the task's search phase as the surrounding surface near the hole is not smooth and may have crevices. The extra challenges are not present during the training phase, which reduces the capability of the learned policy to react in an appropriate way. The insertion task of the LAN port is the most challenging for the policy due to the complex shape of the LAN cable endpoint. If just one corner of the LAN adapter is stuck, the insertion cannot be completed even if large force is applied.
Additionally, we also tested the policy on different insertion planes, for the electric outlet and the LAN port tasks. In both cases, the success rate is similar due to the training with randomized insertion planes. However, the policy is slightly better with insertions on the y-axis plane due to the retraining (on the real robot) been done only about this axis.

Ablation Studies
On this section, we evaluate the individual contribution of some components added to the proposed learning framework. The inclusion of transfer learning from simulation to the real robot for the proposed learning framework was evaluated. We compare the learning performance of training the agent on the real robot from scratch versus learning starting from a policy learned on simulation. The training from scratch was performed for 50,000 steps while the retrained from simulation lasted 15,000 steps. Figure 10 shows the learning curve for both training sessions. Learning from scratch requires at least 50,000 steps to succeed at the tasks most of the time. In contrast, learning from the pretrained policy on simulation achieve the same performance in under 5,000 steps. It is worth noting that the policy from simulation still requires some training to fine-tune the controller to the real world physics dynamics which are difficult to simulate, as can be seen from the slow start and the drops in cumulative reward. Figure 10. Comparison between learning from scratch and learning from a policy learned on simulation: Learning curve for the 3D printed cuboid peg insertion task on the real robot with random initial positions.

Policy Architecture
We evaluate the contribution of the policy architecture introduced in our method (see Section 2.2.2) by comparing it to a policy with a simple neural network (NN) with 2 fully connected layers as used in previous work [31]. We trained both policies on the cuboid peg insertion task on simulation and compare the learning performance. Figure 11 shows the learning curve of both policy architectures for a training session of 70,000 time steps. From the figure, is clear that with our newly proposed TCN-based policy, the agent is able to learn faster and exploit better rewards. The TCN-based policy learns a successful policy (25K) about 15,000 steps faster than the simple NN-based policy (40K). Additionally, the TCN-based policy converges to a cumulative reward higher than the simple NN-based policy.

Policy Inputs
Finally, we evaluate the choice of inputs for the policy. We compare our proposed policy architecture with all inputs, as defined in Section 2.2.2, with two variants. First, we consider the policy without the inclusion of the prior action, a t−1 . Second, we consider the policy without the knowledge of the desired insertion force, F g . The training environment is the cuboid peg insertion on simulation with random initial position and random desired insertion force. In the case of the policy that do not have F g as input, the cost function still accounts for the desired insertion force.  Figure 12 shows the comparison of learning curves. Most notably is the poor performance of the policy that lacks the knowledge of the prior action, a t−1 . The prior action information is critical for the agent to more quickly converge to an optimal policy. Additionally, the knowledge of F g enables the agent to find policies that yield higher cumulative rewards, as well as, learning faster.

Discussion
We have proposed a learning framework for position-controlled robot manipulators to solve contact-rich manipulation tasks. The proposed method allows to learn low-level high-dimensional control policies on real robotic systems. The effectiveness of the learned policies is shown through an extensive experimental study. We show that the learned policies have a high success rate at performing the insertion task under the assumption of a perfect estimation of the goal position. The policy correctly learns the nominal trajectory and the appropriate force control parameters to succeed at the task. On top of that, the policy achieve high success rate under varying conditions of the environment in terms of uncertainty of the goal position, stiffness of the environment, and novel insertion tasks.
While in this work, the model free reinforcement learning algorithm SAC was used, the proposed framework can be easily adapted to other RL algorithms. The choice of SAC is due to its sample efficiency as an off-policy algorithm. The pros and cons of using other learning algorithms would be an interesting future work.
One limitation of our learning framework is the selection of the force control parameter range (see Section 2.2.3). The choice of a wide range of values may allow the policy to adapt to very different environments, but it also increases the difficulty of learning a task, as small variations in the action may cause undesired behaviors, as is the case during the first 20K to 30K steps of training (see Figure 11). On the other hand, a narrow range would make it easier and faster to learn the task, but it may not generalize well to different environments. Defining a range is much easier than manually finding the optimal parameters for each task, but it is still a manual process. Therefore, another interesting future work would be to use demonstrations to learn rough estimation of the optimal force parameters to further reduce training times.