Reinforcement Learning for Pick and Place Operations in Robotics: A Survey

: The ﬁeld of robotics has been rapidly developing in recent years, and the work related to training robotic agents with reinforcement learning has been a major focus of research. This survey reviews the application of reinforcement learning for pick-and-place operations, a task that a logistics robot can be trained to complete without support from a robotics engineer. To introduce this topic, we ﬁrst review the fundamentals of reinforcement learning and various methods of policy optimization, such as value iteration and policy search. Next, factors which have an impact on the pick-and-place task, such as reward shaping, imitation learning, pose estimation, and simulation environment are examined. Following the review of the fundamentals and key factors for reinforcement learning, we present an extensive review of all methods implemented by researchers in the ﬁeld to date. The strengths and weaknesses of each method from literature are discussed, and details about the contribution of each manuscript to the ﬁeld are reviewed. The concluding critical discussion of the available literature, and the summary of open problems indicates that experiment validation, model generalization, and grasp pose selection are topics that require additional research.


Introduction
Robotic arms have been used in industry for years to automate repetitive, strenuous, and complex tasks in which speed and precision are critical. The difficulty with the conventional control of robotic arms is that they require individual programming on a taskby-task basis with no margin for error. As a result, task programming requires extensive effort by a robotics engineer or tedious online manipulation by skilled robot operators with the control pendant [1][2][3].

Robotics Background
With traditional robotic control strategies, motion planning is founded on forward and inverse kinematics. The forward kinematics combine the direct measurement of all joint orientations with the lengths of linkages connected to the joints to determine the end-effector position. Similarly, inverse kinematics establish the values of joint positions required to place the end effector at a new location with a new orientation. The Jacobian of the forward kinematics and the inverse Jacobian for the inverse kinematics are used to determine the velocity of the end effector at different points over the motion path [4,5].
Typically, an engineer does not directly develop the code to apply the forward and inverse kinematics to complete each robotic task. Instead, graphical or text-based programming environments implement interpreters and compilers to translate basic commands or position controls to the robot in its native language [3,6]. The programming environment is determined by the interfaces provided by the supplier and the choice between online and offline programming [7]. Online programming is traditionally used in industry to allow skilled operators to teach basic motion trajectories and operations to the robot using teach-pendant control interfaces or manual robotic arm positioning. There are many difficulties associated with online programming, such as robot downtime, range in pendant interfaces, and lack of dynamic control to handle changes in the environment. Online programming is only usable for basic tasks with high repeatability. Reprogramming is required for any minor variances in the workspace, limiting the applicability for this type of programming [7,8].
Offline programming is the alternative control technique. This approach involves skilled programmers interacting with graphical program environments and/or text-based programming languages to develop control strategies for the robotic arm. Offline programming has the advantages of reducing downtime (and the associated downtime cost) and allowing the implementation of advanced control strategies [3,7,8]. The disadvantages of offline programming include the requirement for programmers to have experience with generic or controller specific languages, and/or the graphical model based control environment selected by the manufacturer [3,7].
Due to the difficulties with both the online and offline approaches, robotic implementation in small and medium-sized labor-dependent enterprises in industry has been slow. The earliest forms of electric robotic arms driven with forward and inverse kinematics were developed in the early 1970s [9]. Nearly 50 years since this development, only 33.96% of small and medium firms have implemented any form of robotics, a value which is 35.64% less than large firms (statistic taken in 2015) [10]. Automation improves productivity and reduces labor costs [10], which puts small and medium-sized enterprises at a significant disadvantage. The barrier for small companies to incorporate robotics is the high cost of implementing and reprogramming robots and the inability of robots to handle environments that change dynamically [11].
To reduce difficulty with the implementation of robotics in factory settings, reinforcement learning (RL) is becoming a popular alternative to task-specific programming. The goal of RL in robotics is to enable the agent (the robotic arm) to complete basic operations without direct command programming, and to handle changes in the task without requiring reprogramming. The logistics action of pick-and-place is an obvious example of a task which a robotic agent should be able to perform with RL.
To complete pick-and-place, a robotic arm must select a target object from a cluttered or isolated location and place the object at a specified position with a particular orientation ( Figure 1) This generic task is applicable for various sorting applications which traditionally require human labor. Of the papers published on the topic of pick-and-place with RL (Section 8), implementations differ drastically. To date, no single approach has been selected as the industry standard.
Robotics 2021, 10, x FOR PEER REVIEW 2 of 28 environment is determined by the interfaces provided by the supplier and the choice between online and offline programming [7]. Online programming is traditionally used in industry to allow skilled operators to teach basic motion trajectories and operations to the robot using teach-pendant control interfaces or manual robotic arm positioning. There are many difficulties associated with online programming, such as robot downtime, range in pendant interfaces, and lack of dynamic control to handle changes in the environment. Online programming is only usable for basic tasks with high repeatability. Reprogramming is required for any minor variances in the workspace, limiting the applicability for this type of programming [7,8].
Offline programming is the alternative control technique. This approach involves skilled programmers interacting with graphical program environments and/or text-based programming languages to develop control strategies for the robotic arm. Offline programming has the advantages of reducing downtime (and the associated downtime cost) and allowing the implementation of advanced control strategies [3,7,8]. The disadvantages of offline programming include the requirement for programmers to have experience with generic or controller specific languages, and/or the graphical model based control environment selected by the manufacturer [3,7].
Due to the difficulties with both the online and offline approaches, robotic implementation in small and medium-sized labor-dependent enterprises in industry has been slow. The earliest forms of electric robotic arms driven with forward and inverse kinematics were developed in the early 1970s [9]. Nearly 50 years since this development, only 33.96% of small and medium firms have implemented any form of robotics, a value which is 35.64% less than large firms (statistic taken in 2015) [10]. Automation improves productivity and reduces labor costs [10], which puts small and medium-sized enterprises at a significant disadvantage. The barrier for small companies to incorporate robotics is the high cost of implementing and reprogramming robots and the inability of robots to handle environments that change dynamically [11].
To reduce difficulty with the implementation of robotics in factory settings, reinforcement learning (RL) is becoming a popular alternative to task-specific programming. The goal of RL in robotics is to enable the agent (the robotic arm) to complete basic operations without direct command programming, and to handle changes in the task without requiring reprogramming. The logistics action of pick-and-place is an obvious example of a task which a robotic agent should be able to perform with RL.
To complete pick-and-place, a robotic arm must select a target object from a cluttered or isolated location and place the object at a specified position with a particular orientation ( Figure 1) This generic task is applicable for various sorting applications which traditionally require human labor. Of the papers published on the topic of pick-and-place with RL (Section 8), implementations differ drastically. To date, no single approach has been selected as the industry standard.

Related Work
A broad-scale generalized strategy for implementing pick-and-place with RL in robotics is needed before this technique can be implemented at scale. Mohammed and Chua [12], Liu et al. [13], and Tai et al. [14] have all written review papers focused on RL in robotics; however, these papers have a broad range of focus in terms of robotic agents used and the task completed.

Scope of Review
This survey provides a holistic review of RL in robotics as it relates to the pick-andplace task. The fundamental formulation of RL is reviewed, and the elements unique to the robotics application are analyzed in detail. A comprehensive list of papers published on the topic are presented, and a critical analysis on the state of research is conducted. The section breakdown is as follows.
Section 2 reviews the fundamental framework for RL, as it can be found in the Markov decision process. After introducing the framework, this section examines the aspects of the robotics application that makes the problem difficult to solve. Section 3 discusses the applicable value function and policy search approaches to policy optimization. The mathematical formulation for each approach is presented and explained. Section 4 investigates the intuition behind reward shaping and highlight some critical techniques developed in the past 30 years. Section 5 explains inverse reinforcement learning and its value in determining complex underlying reward functions. Section 6 explores the intuition behind pose estimation, discussing both gripper and target object pose. Section 7 assesses the choices for simulation environment and reviews how this choice affects the end testing accuracy for robotic pick-and-place. Section 8 performs an extensive audit of the available publications related to RL pick-and-place, with the goal of providing direction for individuals working on this problem in the future. Section 9 includes a brief discussion on open problems.

RL Formulation
The goal for completing a pick-and-place operation without task-specific programming is to allow the agent to independently perform actions in the environment, then learn the optimal strategy for its future tasks based on its experiences [15]. The learning process is analogous to a child learning to walk. A child must first attempt the motions of moving limbs, rolling over, and crawling, before it can complete the target action of walking. The child's behavior is driven by rewards from its environment after the completion of each task. Positive affirmation from the child's parents is an example of an environmental reward which would drive the walking behavior.

Markov Decision Process
The Markov decision process (MDP) is the underlying formulation for learning with these reinforcement iterations [15]. The MDP encourages specific behaviors and discourages others by feeding rewards or punishments to the agent based on the agent's action in the environment. The agent selects an action based on its policy function for available actions a ∈ A(s) in each state s ∈ S. In the deterministic case, the same action a is always taken in state s, and the policy is denoted as π(s). For an agent operating in a stochastic environment, the policy defines the probabilities of selecting each possible action given a state. The policy for the stochastic case is denoted as π(a|s) [15]. The primary goal for RL is to find the optimal policy π * (a|s). The optimal policy can be found through the implementation of the value function approach [16] or directly through a policy search (both discussed further in Section 3) [17].
The goal of the policy function is to optimize the expected return G, the (weighted) sum of all discounted returns before the final time step. At time step t, the expected return is expressed as [15].
where T is the total time step, R k is the reward for each step, and γ is the discounting parameter, a number between 0 and 1, which discounts future rewards. Figure 2 depicts the standard MDP for a robotic arm.
where is the total time step, is the reward for each step, and is the discounting parameter, a number between 0 and 1, which discounts future rewards. Figure 2 depicts the standard MDP for a robotic arm. Again, the intuition behind the MDP can be understood with the analogy of a child learning to crawl. The policy function is analogous to the child's thought process before attempting an action. The child's policy is dependent on the reward it receives for completing the action and is the most significant factor which drives learning.

RL for Pick-and-Place in Robotics
RL formulated around the basic MDP appears to be intuitive and straightforward; however, robotic training has specific nuances. The primary impediments include the high dimensionality, continuous action space, and extensive training time.
Most robotic arms implement 6 degrees of freedom as this is the minimum freedom required to reach any position in space at any angle [18]. The freedom of movement makes the task complex, as all pick-and-place actions require the control of the torque and velocity of each joint of the agent. Additionally, the robot operates in a complex environment with a continuous operational space. Rather than having discrete state choices, the agent in this problem has infinitely many states to explore. The method of limiting the number of actions and states to a tractable set becomes a choice that affects the training time, test accuracy, and repeatability [19].
Many RL algorithms have been developed for the application with simple agents and environments. OpenAI (https://gym.openai.com/, accessed on 15 August 2021) has attempted to standardize the benchmarking environments for RL by creating open access simulations for testing control strategies. Cartpole is an example of a classic control problem found on OpenAI, in which the agent can make a command (left or right) and receive a low dimensional feedback signal (position and angular velocity) [20]. Compared to this simple benchmark, robotic arm manipulation is foundationally more difficult. The high dimensionality of robotic arms drives the need for complex reward functions to promote smooth motions and minimize training time [21]. Even with complex reward functions, Again, the intuition behind the MDP can be understood with the analogy of a child learning to crawl. The policy function is analogous to the child's thought process before attempting an action. The child's policy is dependent on the reward it receives for completing the action and is the most significant factor which drives learning.

RL for Pick-and-Place in Robotics
RL formulated around the basic MDP appears to be intuitive and straightforward; however, robotic training has specific nuances. The primary impediments include the high dimensionality, continuous action space, and extensive training time.
Most robotic arms implement 6 degrees of freedom as this is the minimum freedom required to reach any position in space at any angle [18]. The freedom of movement makes the task complex, as all pick-and-place actions require the control of the torque and velocity of each joint of the agent. Additionally, the robot operates in a complex environment with a continuous operational space. Rather than having discrete state choices, the agent in this problem has infinitely many states to explore. The method of limiting the number of actions and states to a tractable set becomes a choice that affects the training time, test accuracy, and repeatability [19].
Many RL algorithms have been developed for the application with simple agents and environments. OpenAI (https://gym.openai.com/, accessed on 15 August 2021) has attempted to standardize the benchmarking environments for RL by creating open access simulations for testing control strategies. Cartpole is an example of a classic control problem found on OpenAI, in which the agent can make a command (left or right) and receive a low dimensional feedback signal (position and angular velocity) [20]. Compared to this simple benchmark, robotic arm manipulation is foundationally more difficult. The high dimensionality of robotic arms drives the need for complex reward functions to promote smooth motions and minimize training time [21]. Even with complex reward functions, the type of object picked, and the pose (orientation and position) of the target object affects the probability of task completion [22].
Due to the high dimensionality and continuous action space, the agent requires extensive training to learn the optimal policy. Online training is costly because of the required human supervision, robotic wear, danger of damaging the robot, and lack of available training robots. The costs associated with online training drives the need for simulation training [16]; however, simulation training often has a low test accuracy. The fundamental problem with simulation training is that the knowledge trained in the simulation must be transferred to the real world. Any inaccuracies in modeling physical parameters such as shapes and weights of objects handled, lighting, or friction coefficients will cause test accuracies to be lower than training accuracies [23].

Policy Optimization
Finding the optimal policy for a RL problem involves value iteration or policy search. With both policy optimization techniques, the solution process changes depending on if the approach is model-based or model-free. Additionally, the exploration-exploitation tradeoff significantly influences the speed of solution convergence.
The model-based technique implements an understanding of the environment through prior learning or through state-space search to create an approximation of the transition probabilities between states given actions p(s |s, a)) (also commonly formulated as T(s |a, s)). The model-free (direct) design has no explicit representation of the system dynamics. Instead, the model-free technique learns the value function directly from the rewards received during training [15,24,25]. For RL in robotics, most approaches are modelfree, since creating a perfect model of all transition probabilities in a continuous space is intractable.
Finding the policy for the agent always involves a tradeoff between searching random actions to find the optimal policy and greedily selecting the optimal action to improve rewards. In RL, ε is the variable used to represent the probability of not taking the rewardmaximizing action during exploration [15]. To ensure that the whole action space is explored, the agent traditionally implements some form of on-or off-policy learning. Onpolicy learning improves the current policy marginally by selecting the optimal action for the majority (1 − ε) of actions and selecting a random action with a probability (ε) for the rest. Off-policy learning applies two policies; the target policy, which is the policy being learned about, and the behavior policy, which is the policy that drives behavior for learning [15].
In the following sections, the value function and policy search approaches are explained, and different search techniques inside each methodology are reviewed [25].

Value Function Approach
The value function approach is based on dynamic programming (DP). DP is a strategy of breaking up a complex problem into a sequence of successive sub-problems which can be iteratively solved. Implementing DP for RL requires an understanding of value functions.

Value Functions
Value functions are founded on the understanding that each state has an associated value dependent on the reward achieved in that state and that state's potential to be a step in achieving future rewards with the agent's current policy. A reward is often only received in a specific goal state, so positions proximal to the goal state have the highest value. The value function is found by breaking down the problem into successive sub-problems following the sequence of states. The Bellman equations in Equations (2) and (3) are established to solve for action-value function q π , and state value function υ π respectively [15].
In stochastic reinforcement learning, randomness in the system implies that taking an action a from state s does not always result in the same final state s . The transition dynamics (or transition probability) of the environment p(s , r|s, a) is used to represent the probability of reaching state s and receiving a reward r given that action a is taken in state s [15]. Using this formulation, the action-value function q π (s, a) given in Equation (2) can be understood as being equal to the summation of the transition probabilities of reaching all possible states (s ) multiplied by the value (r + γυ π (s )) of each possible state (s ).
The state-value function given in Equation (3) can be understood similarly. The primary difference between the two formulations is that the state-value function (Equation (3)) incorporates action selection. The probability of selecting action a is dependent on the stochastic policy π(a|s). Based on this understanding, the value of a given state is equal to the summation over all actions of the probability of selecting a particular action multiplied by the state value function q π (s, a).
In reinforcement learning, state-and action-value functions are implemented for finding the optimal policy with Bellman's optimality equations. The optimal state value function υ * and optimal action-value function q * are found by maximizing the probability of choosing the actions which lead to states yielding values larger or equal to those from all other policies [15].
For finite discrete actions and states, lookup tables can be used to pick appropriate actions for each state to form the optimal policy. For pick-and-place operations in robotics, determining Bellman's optimality equations is difficult because the space of all states is continuous. Typically the Monte Carlo (MC) or temporal difference (TD) approaches [16] are used. Both approaches are founded on dynamic programming (DP).

Dynamic
Programming DP is a model-based strategy that creates internal models for transition probabilities and rewards to calculate the value function. A primary assumption of DP is that the models perfectly represent reality, which is unrealistic for continuous spaces. Even though it cannot be directly applied for RL in robotics, DP is the framework for understanding the MC and TD techniques [15]. DP consists of two fundamental approaches: policy iteration and value iteration.
Policy iteration is a DP approach that consists of policy evaluation and policy improvement subtasks [26]. Policy evaluation is used to update the value of each state based on the current policy, transition probabilities, and the value of the following states [15].
Policy evaluation is completed by calculating the υ k (s) (Equation (6)) for all states with an arbitrarily initiated policy. After an infinite number of exploratory steps ( k → ∞ ), the transition probabilities between states given actions p(s , r|s, a) and the state-values υ k (s) can be determined. Policy improvement then greedily selects the best action that can be taken in each state based on these newly found transition probabilities and state values [15,16]. Policy improvement is solved by implementing q π (s, a) as shown [15] π (s) where argmax over actions involves taking the action which maximizes the value for each state. The new choice of action indicates the agent's policy in the next round of policy iteration. The process of policy iteration in DP consists of iterating between policy evaluation and policy improvement until the policy stops changing. At this point, the policy is assumed to be optimal [26].
Value iteration is an alternative to policy iteration, which can be used effectively by combining policy evaluation and improvement into one step. Value iteration is formulated as [15] υ k+1 (s) s ,r p s , r s, a r + γυ k s (8) with this formulation, the policy evaluation step is stopped after one iteration, and the policy for each state can be updated dynamically.

Model-Free Techniques
Techniques that are not model based can incorporate the DP strategy by using sample transitions and rewards to learn approximate value functions. Learning the value functions without complete probability distributions for all transitions requires exploration to ensure all states are visited, and non-greedy policies are investigated.
The Monte Carlo (MC) averaging returns approach is a model-free technique that requires no prior knowledge. The model only requires experience through interaction with the environment. MC completes various actions in each state, and the average returns for each action are used as the expected action reward. After many iterations, the solution will converge to the actual underlying value for each action [16].
MC approaches implement a combined on-and off-policy technique. The state values for MC are determined during off-policy environmental interaction where sample trajectories are explored. After the state values are determined, the policy is found during the on-policy policy improvement step of policy iteration. During this step, the greedy policy is found by taking the value-maximizing actions for the values learned during off-policy exploration [15].
The Temporal Difference (TD) approach is equivalent to an incremental implementation of MC. TD updates the estimates of the state-value function based on individual experiences without waiting for the outcome of the entire sample trajectory as opposed to MC. TD uses the difference between the old estimate of the value function and the experienced reward and new state to update the value function after one step, as shown in [15,25,26]: Here, α is the learning rate, a real number between 0 and 1 (usually 0.1 or smaller). This form of TD is referred to as the TD(0) algorithm.
State Action State Reward (SARSA) is a TD model which is used to update the actionvalue function (Q(s, a)) [16]. SARSA implements on-policy TD control, implying that the action value is updated after each exploration step [15,25].
Operating on-policy means that the action space is only explored due to the ε − greedy aspect of on-policy search [15]. Due to its lack of off-policy exploration, SARSA does not have the same convergence guarantees as to the MC approach.
Q-learning is a method which is very popular in reinforcement learning research [15,25,26]. Q-learning is the off-policy equivalent of SARSA, formulated as: [15] The use of off-policy TD allows Q-learning to perform updates to the value function regardless of the action selected [25]. This choice guarantees the convergence of Q(s t , a t ) if the learning rate α is well defined [15].
Value function approaches are intuitive; however, they often struggle to converge due to the high dimensionality of robotic state space, the continuous nature of the problem, and the bootstrapped approaches used to estimate the value function quickly. Another issue with the value function approach is that random exploration of the agent can result in mechanical damage due to range and torque limitations [27].

Policy Search Approach
Policy search (PS) is an approach to policy optimization which has recently shown potential in robotics as an alternative to the value function approach [16]. PS is a valuable tool in robotics due to the scalability with dimensionality, a vital issue with the value function approach [28]. Research has shown that PS is better than value function approaches for tasks that are finite horizon and are not repeated continuously [27].
PS does not approximate a policy as the value function does, i.e., by learning values of actions, then acting to optimize rewards based on these expected values. Instead, policy search implements a parameterized policy that can select the optimal action without reviewing the value function [15]. θ, the variable representing the policy parameters, could be understood intuitively as weights of the deep neural network that influences the policy for a value function [15,27] The formulation for the linear case is [27] where Φ(x) is the basis function representing the state, and the policy parameters θ determine the choice of action in each state [27]. PS has the goal of learning the policy parameters that optimizes the performance measure J(θ), the expected cumulated discounted reward from a given state [27]. This reward is formulated as [15] J(θ) wherein v π θ is the true value function for policy π θ .
To maximize the cumulative reward, the optimal policy can be found using the policy gradient method [15], which implements gradient ascent by where µ(s) is the on-policy probability distributions for policy π.
Like the value function approach, the policy search approach can be implemented with model-free and model-based techniques [16]. The model-free technique uses online agent actions to create trajectories from which to learn. This technique has the advantage of not needing to learn an accurate forward model, which is often more challenging to learn than the policy itself. The model-based technique is more sample conservative, as the agent creates internal simulations of the dynamics of the model based on a few observations from the online model. The agent then learns the policy based on these internal simulations [27]. The online model-free technique has been implemented more in research due to its ease of use; however, model-based approaches have shown better ability to generalize to unforeseen samples [27].
Inside of policy search, several methods have been developed and implemented in robotics, including REINFORCE, Actor-Critic, Deterministic Policy Gradient, and Proximal Policy Optimization.

REINFORCE
REINFORCE is a MC based policy gradient method that uses vector samples (episodes) filled with state, action, reward triples found with the policy π θ to approximate G, the return as defined in Equation (1). The algorithm updates the policy parameters incrementally with [15] θ ← θ + α γ t G ∇ ln π θ (a t |s t , θ) where ∇ ln π θ is equal to [15] ∇ ln π θ = ∇π(a t |s t , θ) π(a t |s t , θ) The intuition behind this formulation is that the policy parameter vector θ is shifted in the direction of the steepest ascent with positive return G, to improve the policy π θ after each step.
REINFORCE is a valuable policy search method that creates a state value function and has guaranteed convergence; however, it tends to converge slowly compared to other bootstrapped methods.

Actor-Critic Methods
The one-step advantage actor-critic (A2C) is analogous to the TD methods introduced above, which do not involve offline policy improvement steps. A2C implements the notion of δ t , the difference between the expected value and actual value for a given state and state-value weighting w [15].
The difference parameter can be used each step to update the state-value weighting w and the policy parameter θ with the equation [15].
The calculation for the policy and difference parameter is completed after each incremental exploration step in the search space.
Another common actor-critic approach seen in the literature is asynchronous advantage actor-critic (A3C) [29]. A3C has the same formulation as A2C, with the slight difference of allowing multiple agents to asynchronously explore different policies in parallel. This approach reduces the need for experience replay to stabilize learning in time and improve stability.

Deterministic Policy Gradient
Another common implementation is the deterministic policy gradient (DPG) technique [30]. DPG implements Q-learning updates to determine the action value function as shown [30].
In this approach, DPG moves the policy in the direction of the gradient of Q for continuous spaces, rather than selecting the globally optimal Q value in each step [30].
Here Q w (s t , a t ) is a differentiable action value function which is used to replace the true action value function.

Proximal Policy Optimization
Proximal policy optimization (PPO) is a policy gradient technique which was designed to provide faster policy updates then A2C or DPG. PPO applies the DPG structure, but updates the policy parameter θ based on a simple surrogate objective function [31]. The policy update includes a step of minimizing the penalty term associated with the difference between the surrogate and the original function.

Summary
The methods reviewed are fundamental to the different implementations of RL in robotics. The choice in policy optimization has a significant impact on the convergence guarantees and training time required.

Reward Shaping
In RL, the allocation of rewards drives the training process because the agent aims to maximize the reward signal [32]. Reward shaping is the technique that involves modifying the reward function to help the agent learn faster [33]. An efficient reward function can accelerate the learning process by introducing background knowledge to RL agents [34].
Traditional RL applications used monolithic goals exclusively [35]. Monolithic goals seemed to behave perfectly with well-behaved domains with clear reward objectives. However, as RL problems started to become more complex in the 1990s, this approach was found to be inadequate for modeling real physical world situations (such as robotics). The difficulties in recognizing the state information, nondeterministic and noisy environment, variances in the learning trails required, timeliness of the rewards, and the requirement of achieving multiple goals can contribute to the failure of monolithic goals for these applications [35].
To address these challenges, as an improvement to a naïve monolithic single goal reward function, Mataric proposed a heterogeneous reinforcement function [35]. The method broke the rewards down into several goals representing the knowledge of the domains, then grouped these goals with progress estimators. Goal refinement enables the system to learn through immediate transitionary rewards which can speed up learning and convergence. The issue with this approach is that the agent can achieve a net positive gain in reward by running cycles rather than finding the end state [21,32].
To address this, Ng et al. [21] introduced a potential-based shaping algorithm. The algorithm incorporated a new reward function: R = R + F, where R is the traditional reward and F is the shaping reward function. The shaping reward function F is a function over states, which should be chosen or constructed [34] based on the domain knowledge between two states. For example, F could be a function of distance to the goal state, wherein F increases as the agent selects actions closer to the goal state. Ng et al. proved that the optimal policy would not change given this new reward function [21].
Adding rewards is advantageous when the rewards represent the actual effect of the actions, however, adding rewards can also cause errors in the learning process when noisy reward inputs are given. To find balance between efficiency and effectiveness in RL reward shaping, Luo et al. proposed a method called Dense2Sparse [36]. By letting the system receive rewards continuously in the first several learning episodes, the policy is expected to converge quickly despite the noisy information. The suboptimal policy learned from the dense reward can be optimized under a noise-free sparse rewarding scheme, where a positive reward is given only when the task or sub-task is completed. Jang et al. found the concept of combining sparse and dense reward schemes to be effective with their research on training a Walker robot with high dimensional continuous action and state spaces [37].
Besides using a predefined change in the reward function, reward shaping can also be performed dynamically throughout the learning process. Tenorio-González et al. realized a dynamic reward shaping technique by implementing verbal feedback from a human observer [38]. The inclusion of human intervention can help the system converge faster but requires active human participation in the learning process. Konidaris and Barto [39] developed a method to improve the reward function autonomously. Their work focused on generalizing the knowledge learned by the agents for small tasks to produce a shaping function to apply to all rewards. Their experiment showed good performance of the shaped rewards in a rod positioning scenario.
Reward shaping is a tradeoff between accuracy and speed. A monolithic goal reward can be a noise-free input, but it is sometimes not practical for complex tasks. Breaking down the task and applying rewards frequently allows the system to learn faster, but it can also distract the agent [32,40].

Imitation (Apprenticeship) Learning
Reward shaping is one of the tasks which require human intervention in RL. With only a monolithic goal, the agent may be forced to search the environment for an extended period to refine its actions. An alternative path to manual reward shaping is imitation learning [15,41,42].
Traditional robotic control systems (discussed in Section 1.1) can have imitation behavior programmed into a robot via manual robotic arm positioning. This approach has the equivalent results to implementing the teach pendant without the need for joint programming. Manual arm positioning is valid but has all the same issues as traditional control, i.e., the inability to handle changes in the environment [7,8].
As an alternative to traditional control, imitation learning can be used to learn a reward function from examples. The reward function can be used to refine the policy depending on the scenario, allowing for domain adaptation [42]. Imitation (apprenticeship) learning includes several different approaches, such as Behavior Cloning (BC) and Inverse Reinforcement Learning (IRL). Like all other approaches to RL, imitation learning has the option to implement model-based and model-free techniques [42].

Behavior Cloning
BC is the simplest form of imitation learning, as it borders supervised learning (SL) and RL. The objective of this learning technique is to learn the policy which determined the trajectory of the expert demonstration. The trajectory variable τ = (s 0 , a 0 , s 1 , a 1 , . . .) is used as the functional output from the policy [42] τ = π * (s) (23) where s is the state of some robotic manipulator. The problem resembles supervised learning, where the objective of mimicking behavior is set as minimizing the loss between the policy parameters θ of the policy π θ (s) and the optimal policy based on expert demonstrations of π * (s). BC is easy to understand and execute, but since the problem is reduced to a supervised learning approach, error may occur when reaching a state which the expert has never visited. The policy may attempt to map a new environment with the same strategy used for an older environment, which can cause an error referred to as the covariate shift [43]. Due to undefined behavior in that state, the algorithm could lead to catastrophic problems, as was shown by Stephane Ross in his work on mismatch training and test inputs on training robotics [44].
The first application of behavior cloning was ALVINN, a project by Pomerleau in 1989 [45], where a neural network was used to estimate the policy for driving an autonomous vehicle. The inputs were sensor signals from vision cameras and laser range sensors, and the outputs were turning curvatures which translated into steering angles. Supervised learning was performed by using road snapshots and turn curvature pairs to train the network. The objective suggests that it minimizes the 1-step deviation error along the expert trajectory; therefore, it does not capture the intention of expert behavior in the long term, which reduces the accuracy of long-term planning.
Due to the approach operating in the SL domain, BC is more applicable to scenarios in which massive databases of expert data are available [43]. BC has been shown to be valuable in large data domains such as autonomous vehicles [46] but has not been shown to be applicable to robotics.

Inverse Reinforcement Learning
IRL is a way to find the underlying reward function which explains the agent's behavior (the policy). IRL is based on the known parameters in an MDP (s, a, p(s |s, a), γ, π * ) [41,42]. The goal in IRL is to choose R, which makes the policy π optimal, and makes any changes in policy π as costly as possible, so that the differences between two actions in any state is significant [41].
The notation for IRL is [ where the true reward function R * is a function of features φ, and the true weights w * . In robotics, φ could be used to represent the features: smooth motion profiles and speed. Different weights w * could represent the tradeoff between these two features.
One approach for implementing IRL is with the apprenticeship learning technique implemented by Abbeel and Ng [47]. This technique defines the value of a policy as: where E s 0 ∼p(s |s,a) [V π (s 0 )] represents the expectation for value given that the agent starts in state s 0 with some transition probability for the next state of p(s | s, a) [41]. To get the final formulation shown in Equation (25), the value function for the policy is replaced with the reward function defined in Equation (24). This formulation can be used to find the policy π by minimizing the difference between the policies π E and π while maximizing the minimum differences between trajectories. The formulation for this min-max problem can be expressed as [47].
The IRL learning process involves the loop shown in Figure 3, where the policy is initially randomized. First, Equation (26) is implemented to compare the policies of the expert and the agent and improve the reward weighting w T . Next, the updated reward weighting is used to calculate the state rewards (Equation (24)). Finally, the RL algorithm is used to compute the optimal policy.
Due to the approach operating in the SL domain, BC is more applicable to scenarios in which massive databases of expert data are available [43]. BC has been shown to be valuable in large data domains such as autonomous vehicles [46] but has not been shown to be applicable to robotics.

Inverse Reinforcement Learning
IRL is a way to find the underlying reward function which explains the agent's behavior (the policy). IRL is based on the known parameters in an MDP (s, a, ( | , ), , * ) [41,42]. The goal in IRL is to choose R, which makes the policy optimal, and makes any changes in policy as costly as possible, so that the differences between two actions in any state is significant [41].
The notation for IRL is [47] where the true reward function * is a function of features , and the true weights * . In robotics, could be used to represent the features: smooth motion profiles and speed. Different weights * could represent the tradeoff between these two features.
One approach for implementing IRL is with the apprenticeship learning technique implemented by Abbeel and Ng [47]. This technique defines the value of a policy as: where ~ | , [ ( )] represents the expectation for value given that the agent starts in state with some transition probability for the next state of ( | , ) [41]. To get the final formulation shown in Equation (25), the value function for the policy is replaced with the reward function defined in Equation (24). This formulation can be used to find the policy by minimizing the difference between the policies and while maximizing the minimum differences between trajectories. The formulation for this min-max problem can be expressed as [47] ϵ The IRL learning process involves the loop shown in Figure 3, where the policy is initially randomized. First, Equation (26) is implemented to compare the policies of the expert and the agent and improve the reward weighting . Next, the updated reward weighting is used to calculate the state rewards (Equation (24)). Finally, the RL algorithm is used to compute the optimal policy. The IRL solution is often ill-posed, so instead of forcing the system to perform exactly as the optimal behavior from the demonstrations, the constraints can be relaxed to help find a suitable solution. Abbeel and Ng [47] did so by relaxing the equality of the performance with max Algorithms based on IRL can be costly due to the inner loop, which contains internal RL calculations for each step. The reward function acquired with IRL does not tell the agent how to act, thus imitating the exact motion of the expert demonstrations still requires RL [43]. That said, IRL has shown significant potential in robotics applications, as will be shown in the Section 8.

Pose Estimation for Grasp Selection
Pose estimation is a critical factor for pick and handle operations. Incorrect estimation of pose could result in poor grasp choice due to the robotic arm pose or unstable placement due to incorrect estimation of object pose. Gualtieri and Platt [19] noted the value in pose estimation when comparing the accuracy of cup placement against blocks and bottles. Blocks are symmetric in x, y, and z, bottles are symmetric in x and y, and cups are symmetric only in x. The complexity of the cup pose made the pick-and-place task more difficult, which resulted in lower overall accuracy compared to blocks and bottles (Figure 4). Pose estimation is critical, as the task should be completable regardless of the complexity of the target object.
Algorithms based on IRL can be costly due to the inner loop, which contains internal RL calculations for each step. The reward function acquired with IRL does not tell the agent how to act, thus imitating the exact motion of the expert demonstrations still requires RL [43]. That said, IRL has shown significant potential in robotics applications, as will be shown in the Section 8.

Pose Estimation for Grasp Selection
Pose estimation is a critical factor for pick and handle operations. Incorrect estimation of pose could result in poor grasp choice due to the robotic arm pose or unstable placement due to incorrect estimation of object pose. Gualtieri and Platt [19] noted the value in pose estimation when comparing the accuracy of cup placement against blocks and bottles. Blocks are symmetric in x, y, and z, bottles are symmetric in x and y, and cups are symmetric only in x. The complexity of the cup pose made the pick-and-place task more difficult, which resulted in lower overall accuracy compared to blocks and bottles ( Figure  4). Pose estimation is critical, as the task should be completable regardless of the complexity of the target object. The literature refers to two pose types, the object pose and the gripper pose (also referred to as the grasps [48]). The object pose is described by the position and orientation of the object being manipulated [49,50]. Each object pose exists inside the set of all possible poses for that object, called the pose space [50]. The gripper pose is used to describe the different gripping positions, orientations, and jaw openings that could be used to grasp a target object [23]. Each gripper pose exists inside the set of all possible grasps for an object, called the grasp space. Pose estimation can be completed using analytic (geometric) or data-driven methods.
Analytic methods fit the point cloud of sensor input to known models (often CAD). Model fitting limits the agent to detecting the pose of items which have models available for comparison [51]. The analytic method first determines the target object pose, then selects an optimal gripper grasp based on the target object pose. The analytical method is usually set up as an optimization problem. The grasp is chosen from the grasp space based The literature refers to two pose types, the object pose and the gripper pose (also referred to as the grasps [48]). The object pose is described by the position and orientation of the object being manipulated [49,50]. Each object pose exists inside the set of all possible poses for that object, called the pose space [50]. The gripper pose is used to describe the different gripping positions, orientations, and jaw openings that could be used to grasp a target object [23]. Each gripper pose exists inside the set of all possible grasps for an object, called the grasp space. Pose estimation can be completed using analytic (geometric) or data-driven methods.
Analytic methods fit the point cloud of sensor input to known models (often CAD). Model fitting limits the agent to detecting the pose of items which have models available for comparison [51]. The analytic method first determines the target object pose, then selects an optimal gripper grasp based on the target object pose. The analytical method is usually set up as an optimization problem. The grasp is chosen from the grasp space based on maximizing the values for resistance to disturbance, dexterity for further manipulation, equilibrium to reduce object forces, and stability in the case of external forces or grasp errors [52,53]. The analytic method is useful when precise placement in a particular orientation is required [23].
Data-driven methods are free from the constraint of requiring models, as this approach implements machine learning to estimate optimal gripper poses directly [23]. There are two approaches inside of the data-driven method: the target-model-based approach and the model-free approach.
The target model-based approach implements the point cloud data from the sensors to estimate a model of the target object. The target model can be used to eliminate many of the available grasp pose options. Supervised learning (SL) can be used to label objects based on the training data [23,53]. To save time on data labeling for the SL problem, simulation environments can be implemented to create a database of randomly oriented and mixed samples for training the model. The data-driven target model-based approach is optimal when accurate placement and correct target orientation is critical to receive the reward or when the target objects must be selected from a cluttered scene [23].
The model-free approach removes the model estimation step and instead focuses on optimizing the grasp pose based on the entire range of available grasps. The model-free approach is composed of discriminative and generative techniques. The discriminative technique reviews all the available grasp pose candidates and trains a convolutional neural network (NN) to select the optimal grasp based on these [19]. The generative approach recommends a grasp configuration based on the orientation of the target. This approach is fundamentally the same as object detection, where the gripper pose is selected based on the orientation of the detected object. An example of this comes from Jiang et al. [54], who implemented orientation rectangles to provide an estimate for the ideal position and orientation of the gripper head. Since multiple grasp candidates are often available, a NN is used to train the model to select the best grasp. The model-free approach is most successful when dealing with new targets and soft placement restrictions [23].
The approaches listed work well in combination with other strategies. For example, the agent could be allowed to implement the temporary placement of the object, free from clutter, to allow a better estimation of pose. This approach would be useful when accurate placement is required [19]. An alternative method would be to implement multiple viewing angles by hoisting the item in front of external cameras. Multiple viewing angles could be used to identify the object or determine if a more optimal grasp pose is available [55]. Another practical approach for estimating pose is by implementing focus. Attention focus has been successfully implemented by Gualtieri and Platt [22] via iterative zooming to regions of interest. The use of focus would limit the number of available grasp poses to those seen in the focus region.

Simulation Environment
In the sections above, training in simulation was mentioned as a requirement for some approaches, and a useful tool for others. The benefit of training in a simulated environment is that the agent can perform thousands of training iterations in a short period, without any wear effects on the online device. Online training requires the availability of the robot and human supervision for safety reasons. Examples of the difficulties with training in real life can be seen from Levine and Koltun [28], who implemented 14 robotic manipulators over two months to collect 800,000 grasp attempts. The accuracy of the grasp operation after this training was 82.5 percent, a lower accuracy than has been achieved for comparable pick-and-place actions when training was completed in simulation.
The benefits of training offline are often contrasted with the loss of accuracy when the trained model is applied in the real world. Inaccuracies of model parameters such as friction, weight, or dimension, may have massive influences during model testing. The difference between the offline and online models is referred to as the "reality gap". The primary approaches for coping with the reality gap are domain randomization and domain adaptation [23].
Domain randomization is a method of training the agent in an inconsistent environment. The goal is to force the agent to learn to ignore "noise" and changes in the background. Noise added for domain randomization could include lighting, color, positions, textures, or other features that serve to confuse the scene. The goal of domain randomization is to make the online test scene appear as a training sample with different randomizations than previously experienced [56].
Domain adaptation is an approach that implements a form of generative adversarial networks (GANs) to generalize the test data onto the target domain. The process implements data from the test domain (online) to train the model in the training domain (offline) so that training in the simulation generalizes well to online testing. This approach still requires a small amount of labeled data from the test domain; however, much less data is required than online training [23,57].
The choice of simulation environment is another factor that affects test accuracy. An assortment of packages exist (MuJouCo, ODE, and Bullet), each of which have their pros and cons. Key factors include consistency, stability, numerical accuracy of simulation, and compatibility [58]. MuJouCo performs better than other simulators in terms of simulation accuracy, energy conservation, and grasp stability. ODE came in second for nearly all of these measurements. The primary difference between the two physics engines is that ODE is open source and compatible with a variety of simulators such as Gazebo and OpenRAVE [59]. In terms of support, documentation, and ease of use, MuJouCo was ranked the worst [58].

Analysis
The above sections review the primary choices that need to be made during problem formulation. This analysis section focuses on reviewing and critically discussing the current state of research.

State of Research-Complete Pick-and-Place Task
Benchmarks for pick-and-place operations include simulation accuracy, the number of online tests required to train the model, and most importantly, the accuracy of testing. The table below reviews papers published by leading RL robotics researchers to illustrate the strengths and weaknesses of the different implementations. Success rates in each paper are difficult to compare due to the differences in the tasks. The discussion outside of the table gives clarity for each implementation and explains the results.
Fu et al. [17] completed several tasks, with comparable difficulty to pick-and-place, including inserting a peg in a hole, stacking toy blocks, placing a ring on a peg, and more. The approach involved learning the system dynamics from prior examples, then completing manipulation tasks in one shot after the goal was defined. The approach did not directly use RL for learning individual tasks by searching through the state space, but instead used RL to learn the system dynamics for robotic control. The accuracy of these tasks was high; however, the approach did not incorporate a cluttered scene.
Gualtieri et al. [19,22] published several papers on completing the pick-and-place action in a cluttered scene with an assortment of objects. Before this work, Gualtieri et al. [51] first published a manuscript on pose estimation. The pose estimation research completed in [51] was applied to the entire pick-and-place operation in [19,22]. The group completed testing in simulation and online. The accuracy listed represents the online accuracy of grasping a novel object after being trained on objects of similar type. The group found that an item picked from a clutter of objects had significantly lower accuracy than selecting an object in isolation. The results indicate that items with complex geometry (cups and bottles) are significantly more difficult to grasp compared to targets with a simple geometry (square blocks). Compared to other tasks completed in the literature, the complexity of the cluttered pick-and-place task was high.
Popov et al. [60] completed the action of precision stacking of Lego blocks using distributed deep DPG (DDPG) with a robotic arm in simulation. The blocks were initially positioned in isolation in the same orientation, which significantly reduced the task difficulty. The team compared simple single reward and composite multi-step reward functions and proved that the single reward approach had the fastest convergence. The groups asynchronous implementation of DDPG proved that distributed training with asynchronous updates can improve convergence speed. The task had an average complexity.
Mahler and Goldberg [61] applied imitation learning to complete the task of selecting novel objects from a cluttered environment and placing them aside in an organized manner. The learning approach implemented the Dex-Net 2.0 database of point clouds and grasp poses to generate training samples for learning effective Grasp Quality Convolutional Neural Networks (GQ-CNN). The policy learned appropriate NN weights based on replicating the results from the demonstrated samples. Transfer learning was effectively used to apply synthetic training to real world pick-and-place. Compared to the other examples from literature, the complexity of this cluttered pick-and-place task was high.
Sehgal et al. [62] simulated a simple, single block pick-and-place task with the use of DDPG HER whose parameters were tuned with the genetic algorithm. The main contribution from this paper was to show that the GA could be used to minimize the number of training iterations required to achieve comparable convergence to that of DDPG HER without the GA. The overall task complexity for this action was low. Zuo et al. [63] implemented deterministic GAIL to complete a simple, single block pick-and-place task. The DGAIL approach implemented a NN discriminator to develop a reward function based on examples and a policy gradient method for learning ideal actions based on the discriminator. During testing, DGAIL was compared to stochastic GAIL and DDPG to prove the relevance of this approach. The approach yielded much better results than all other methods except DDPG, a more computationally expensive approach. The overall task complexity for this implementation was low.
Chen et al. [64] completed a picking operation where the target object was in a cluttered environment. The robotic arm could rearrange the cluttered scene if the target was invisible from the camera vantage. The approach was limited due to the model-based grasp pose technique, however, it showed excellent performance in finding the target object in the cluttered scenario. The placement operation was not tested but could have been easily implemented since models for the target object were available. The action of selecting the target from a cluttered scene had an average complexity.
Xiao et al. [65] achieved a high accuracy for the pick-and-place task in a cluttered environment with the Parameterized Action Partially Observable Monte-Carlo Planning (PA-POMCP). The system approximated the utility of available actions based on the current belief of the agent about the environment. The approach allowed the agent to change perspective or maneuver items in the scene to find an accurate estimation for the location of the target object. The approach showed significant performance for known items but was not tested on unknown items in new environments. The action sequence of rearranging the scene and selecting a target from the clutter, had a high complexity.
Liu et al. [66] designed a system to complete several robotic pick-and-place tasks, including stacking, sorting, and pin-in-hole placement. A major contribution of this paper was the work around reward shaping to solve complex problems. The approach assigned rewards based on the Euclidean distance between the Objects Target Configuration (OTC) and the Objects Current Configuration (OCM). The closer the configuration of blocks, pins, etc., to the optimal configuration, the more reward allocated to the sample. Additional rewards were allocated for final task completion. Comparison between this approach and a basic linear reward approach shows a significant improvement in convergence. In addition to reward shaping, this paper compared the MAPPO learning technique to traditional PPO and Actor-Critic (A3C) techniques. MAPPO improved performance against the other techniques by directly outputting manipulation signal to the agent. The overall task complexity for the various action sequences was high.
Mohammed et al. [67] simulated a robotic agent which completed the task of organizing 10 blocks dropped into a workspace. The training technique applied was Q-learning. The block geometries were not all the same, but the shapes were simple. The accuracies for the pick-and-place operations were only tested in simulation, and not applied to an online agent. Considering the average-high complexity of the task, the accuracy of the pick-and-place and the computational efficiencies are promising.
Li et al. [68] applied a robotic arm to perform tasks including reach, push, multistep push and pick-and-place using the Augmented Curiosity-Driven Experience Replay (ACDER) approach. The pick-and-place operation was applied to a single target object which was free from clutter. The novelty of the research includes a method of exploration in which rewards are allocated for the exploration of unfamiliar states (curiosity). Additionally, this approach initiated the robot in states not stored in the replay buffer to improve exploration efficiency. Through testing, this approach was proven to be several times more sample efficient than hindsight experience replay (HER). The overall task complexity for this action was low.
Pore and Aragon-Camarasa [69] proposed an algorithm for robotic pick-and-place on a simple block placed in isolation on a surface. The solution method involved decomposing the task into approach, grasp and retract segments. After perfect accuracy was achieved for each independently trained action, a high-level choreographer (actor-critic network) was trained to learn the policy for ordering the behaviors. The team compared a DDPG + HER approach to this A3C + subsumption architecture (SA) model. The end-to-end approach with DDPG + HER resulted in a maximum of 60% accuracy after 10,000 episodes. The A3C + SA model achieved 100% accuracy after 6000 training episodes. The relative comparison shows the potential of this SA model; however, the task had a low complexity and was never tested online with a real robot.
Al-Selwi et al. [70] simulated the pick-and-place task on a simple block placed in isolation on a surface using a DDPG approach combined with HER. The approach validated what has been shown in other papers [19], that target object complexity changes task effectiveness. The work presents little novelty in terms of policy optimization. The overall task complexity in this example was low.
Marzari et al. [71] presented simulations and online testing for pick-and-place task on a simple block placed in isolation on a surface by using a DDPG HER method. By implementing task decomposition, the model was able to achieve 100% accurate performance with simulation and online testing. The major contribution of this approach compared to [69] was that behavior cloning was not used for training the approach subtask, so sample operations were not required. Online testing indicated excellent generalizability for novel objects and presented a high sample efficiency compared to all other approaches. Although the overall task complexity was low, this approach shows excellent potential for future applications.
Anca and Studley [72] completed a simulated pick-and-place operation in which a simple block was picked by manipulating a robot in 2D with a twin delayed actor-critic network. The result demonstrated operational accuracy without requiring example simulations (unlike [69]), however, showed little advantage over approaches based on DDPG HER [71]. The task complexity with this action sequence was lower than all other samples.

State of Research-Pick-and-Place Subtasks
Due to the novelty of the field, Table 1 presents the (short) exhaustive list of all robotic pick-and-place with RL literature currently available. That said, there are other publications which present contribution to the field, that do not include key elements of the action sequence. Examples include the literature on grasp selection [73,74] or precision placement [75]. To provide a comprehensive survey, key examples for the pick-and-place subtasks drawn from the literature are listed in Table 2.
Finn et al. [75] developed an IRL algorithm that could be applied to various tasks in different environments to learn the policy and reward function. A task completed in simulation included peg insertion in a hole to a depth of 0.1m. The robotic model could complete this task with high precision after training with~30 examples. Of the real robotic tasks completed, the task of placing dishes in a rack was the most like pick-and-place. After 25-30 samples of human teaching, the dish placement action could be completed with perfect accuracy. This research gives an excellent indication of the power of IRL, however, to apply this algorithm to the full pick-and-place action sequence, further extension is required to include the pose estimation step for "picking".
Kalashnikov et al. [74] developed a custom Q-learning approach focused on generalizability and scalability for unfamiliar environments. The task required the picking of random objects from a dense clutter and raising the object above the clutter. The training was completed online to incorporate all properties which were unmodeled in the physics simulation. To apply this algorithm to the full pick-and-place action sequence, the research requires additional task programming to include precision placement. Pore and Aragon-Camarasa, 2020 [69] Hierarchical RL in which the task was broken into simplified multi-step behaviors with the subsumption architecture (SA Wu et al. [76] designed a custom robot which implements a 3 finger gripper to pick objects out of a clutter of 2-30 objects. The study demonstrated that attention improved the picking action in cluttered scenes by~20% and implementing a 3-finger gripper improved object selection by~45%. To apply this technique to the full pick-and-place operation, additional task programming is required for precision placement.
Deng et al. [77] designed a custom robotic hand to be used for picking objects out of a cluttered scene with a series of fingers and a suction rod. The mode of selection allowed for a simple convolutional NN structure to be used for selecting ideal grasp locations. During testing, grasp failure occurred if 3 consecutive grasp attempts were not completed sequentially. Task simplicity and low-test accuracy limits this approach for real world application.
Beltran-Hernandez et al. [78] performed the task of picking simple geometry items (cylinder, cube, and sphere) during simulation training. Accuracy for simulation testing on random objects (duck, nut, mechanical part) was high, especially considering the major differences between the trained models and the test samples. The approach was model based, so it could be easily extended to incorporate high precision placement.
Berscheid et al. [79] completed the robotic task of manipulating and picking a variety of objects in a clutter with the application of Q-Learning. Results indicate high picking effectiveness and generalizability. Training was completed online, without the use of a simulation model. The task definition did not include placement, but the picking action included a high level of difficulty. To apply this algorithm to the full pick-and-place action sequence, additional task programming is required to include precision placement.
Kim et al. [80] completed the task of picking up unknown target objects in a highly cluttered scene by using disentanglement of image input. The goal of this approach was to reduce the computational difficulty of performing a pick action. A key contribution of this paper is showing the value in creating a low dimensional state representation of the target object with the use of autoencoders. Additionally, this paper investigated the value of implementing disentanglement to allow simple scene recognition to guide behavior. The disentanglement approaches include attention, separation of internal (robot) and external (scene) information, and separation of position and appearance in the scene.
Many other papers exist, which can be applicable for subtasks of pick-and-place, however most of them could not be easily extended to the entire operation. For more samples of individual grasping techniques for the pick action, this author recommends the review of [12].

Critical Discussion
The literature discussed above shows that significant progress has been made towards developing a RL approach which can be applied to robotics, however there are several issues with each approach. Additional research is required for validating simulations, reducing the loss in accuracy from crossing the reality gap, testing task variations, and lowering the amount of human intervention required.
Only 40% of the literature analyzed above includes results for online testing. Most of the research only included data on testing the models in simulation, wherein the dynamics are completely dependent on accurate modeling. To validate that the various methods are accurate and implementable in real life, further testing is required.
Of the manuscripts examined, several had robotic testing accuracies which were significantly lower than simulated accuracies [17,68]. Deviation between simulated and testing accuracies indicate that problems with bridging the reality gap require further investigation. As previously noted, training robotic agents online requires extensive time and equipment availability [28]. For RL to be applicable for robotic pick-and-place in industry, simulations must be developed in a way that allows for knowledge learned in simulation to be accurately transfer to the real world.
The literature includes many examples where the task involved simple pick-and-place with blocks or spheres in an isolated scene [62,63,[68][69][70][71][72]. This type of task is rare in an industrial environment, as most tasks involve a cluttered scene with specific placement requirements. Additional real and simulated testing is required to validate the accuracy of these approaches for more realistic tasks.
The methodologies which achieved robotic test accuracies high enough to be considered for application in industry, such as [61,71] required extensive human intervention. Task decomposition significantly speeds up learning and improves overall accuracy, however this method requires human involvement. The research must be extended to include methods for automatic task decomposition, or to develop strategies for task decomposition which do not require significant effort.

Open Problems
RL with robotics is a young and growing field which has many open problems to consider. The key problems include the lack of generalizability, the fostering of curiosity, and pose selection.
Problem generalization is an open topic, which relates to the issue of applying learning from the training set to a broader spectrum of problems. Ref. [78] gave an excellent example of a technique that can scale a crude model to a large test set, however, this approach still requires online testing.
Another quandary to be solved is finding a method to teach the agent to learn independently with intrinsic motivation. Sutton and Barto describe this idea as "curiosity" [15]. The goal is to enable the agent to learn tasks that may be useful in the future on its own, in order that the robotic arm could be more universally applicable.
Grasp pose selection is another open problem which significantly affects the pick-andplace task. The literature presented samples of pose selection which implemented grasp search or model recognition. Grasp search approaches proved to be reasonably effective when combined with focus, iterative zooming, or scene rearrangement [55,77], however this approach had mixed results when trained end-to-end. Model-based approaches generically performed well, however this approach was limited to tasks involving familiar target objects.

Conclusions and Future Work
This literature survey has shown that RL has potential to replace traditional robotic control. Several implementations show promising results for enabling robotic arms to perform basic pick-and-place actions consistently with high accuracy. The existing literature is optimistic; however, more research must be completed for open problems such as problem generalization, independent learning and grasp pose search.
To improve task generalizability, training could be extended to include a large sample set. None of the research reviewed in this paper presented pick-and-place actions where the targets between each action were drastically different. Generalizing the training samples could make the pick-and-place task more widely applicable. The idea is conceptually equivalent to domain randomization in which the task is randomized rather than the environment.
Self-motivated learning could be developed for the robotic agents, by rewarding "play". The agent could be taught to extend their learning by completing a self-selected activity. The explorative "play" actions could be used to optimize other tasks that are currently completed in a specific way to maximize the reward function. Intrinsically motivated rewards could allow the agent to explore new actions for future use [15].
Grasp pose selection could be optimized by combining the model-based and modelfree techniques for pose estimation. The agent could implement a large bank of models for application with familiar target objects. In circumstances where no model matches the target object with a high accuracy, the agent could instead implement a model-free approach. This combined model-free and model-based approach could be developed to replicate the human approach of identification. Humans identify objects by mapping them