1. Introduction
Many household operations require manipulating an object under a physically constrained environment, such as opening drawers and doors. A robotic system performing such household operations must be guaranteed not to damage the object or environment. Therefore, the robot needs to adjust its hand motion during the execution based on the force imposed by the environment, i.e., constraint force. This manipulation is called
compliant manipulation [
1]. There are an unpredictable amount of manipulations in the household environment; thus, the generalized controller to such manipulations is expected to realize a household robot.
This study investigates the generalization capability of a policy trained with a single environment and reward using reinforcement learning (RL) to various unseen manipulations. Although the RL-based approach [
2,
3,
4,
5,
6] is more robust to the uncertainty associated with recognition of object information, such as pose, articulation, and shape than classical controllers [
7], this requires a manual design of the training environment and reward specific to each manipulation. Thus, it is not scalable to the number of target manipulations. This issue is caused by the lack of the generalization of the policy to the unseen manipulations because this approach handles each manipulation independently.
Manipulations can be classified based on a physical constraint. In the previous study [
8], a manipulation group is defined to have a common admissible/inadmissible direction, along which the object can/cannot move. For example, several manipulations, such as drawer opening, plate sliding, and pole pulling, belong to the same group because the object’s admissible motion directions are constrained under a linear guide. If an object tries to move in the inadmissible direction, the constraint exerts the force on the object. Thus, we notice that the manipulations grouped based on the constraint also have a common characteristic of the constraint force. Since compliant manipulation operations are executed leveraging the force, we design a single policy generalized to various unseen manipulations on the basis of the characteristic of the constraint force.
We propose the
constraint-aware policy, which estimates the object’s admissible direction using the constraint force. We train the policy to be generalized to unseen manipulations in the constraint group with a single environment and reward (
Figure 1 Right). This environment and reward are designed assuming the
single-system condition (
Figure 1 Left) that the robot hand and the object move in unison and can be regarded as the composite body, where the internal forces, such as frictional forces, are canceled out. Thus, the policy can obtain the constraint force exerted on the object. The environment is designed as a simplification of the real-world manipulations by extracting the common characteristic of the physical constraint critical to compliant manipulation operations, which is the key to the generalization. The assumption is practically realistic because it can be easily satisfied by an execution design, such as moving the hand slowly. Under the single-system condition, the estimation error of the admissible direction decreases in accordance with the reduction in the magnitude of the constraint force; thus, the reward is calculated only utilizing the magnitude.
In this study, we design the policy for the manipulation group with either a prismatic or revolute joint, which are representative constraints in the household environment. Under the constraint of either a prismatic or revolute joint, the object has one-degree-of-freedom translation and rotation, respectively. In addition to the generalization within a group, we investigate the transferability to a different group. Specifically, we consider transferring the policy for a prismatic joint to the manipulations with a revolute joint. To reuse the policy, we discuss the common and uncommon aspect of a revolute joint compared to a prismatic joint.
The common aspect: Circular motion can be considered as a series of infinitesimal linear motions.
The uncommon aspect: The hand must rotate in conjunction with the object to achieve the single-system condition.
From the common aspect, we can apply the same constraint-aware policy so that the policy estimates the admissible direction in both groups, whereas, owing to the uncommon aspect, the hand should rotate at execution only in manipulations with a revolute joint. To decide whether the hand needs to rotate or not, the type of the physical constraint, such as a prismatic or revolute joint, should be known.
To identify the constraint type, we leverage Learning-from-Observation (LfO) [
9,
10]. LfO provides the robot with hints for a manipulation through a multimodal one-shot human demonstration which includes a verbal instruction and hand movement. The instruction contains semantic information that enables a robot to infer the constraint type of the manipulated object. For example, the verbal instruction of “open the refrigerator door” is associated with a revolute joint. At execution, the robot selects the policy corresponding to the obtained constraint type from the preliminary prepared policies. In this study, we determine whether the physical constraint is a prismatic or revolute joint using the LfO system, and find out the necessity of the rotation of the hand.
We conducted experiments to investigate the generalization capability of the trained constraint-aware policy to various unseen household manipulations, such as a drawer opening, plate sliding, pole pulling, door opening, and handle rotating, in the simulator. We also compared the generalization with the classical controller [
7], which is designed for the group with a prismatic joint. As a result of this experiment, unlike the classical controller, the constraint-aware policy can be executed in various manipulations. In addition, we evaluated the performance in the real-world using the policy and the LfO system, and demonstrated that the policy can be applied on a physical robot without additional training.
Toward a robot system capable of performing a wide range of manipulations, it is important to design the generalized policy for each manipulation group. Given that household manipulations can be classified based on their common constraints [
8], the key to the generalized policies is to design an environment and reward focusing on a common characteristic within each constraint group. This study validated the concept of the constraint-aware policy for two fundamental physical constraints, those being a prismatic and revolute joint. We believe this study is the first step towards realizing the generalized household robot.
The contributions of this study are as follows:
We proposed a constraint-aware policy which is trained using a single environment and reward and generalized to various unseen manipulations with a common physical constraint.
We designed a simple training environment and reward function based on the constraint for the training of the constraint-aware policy.
We demonstrated that unseen compliant manipulation operations can be executed on a physical robot using the constraint-aware policy and the LfO system.
The remainder of this paper is organized as follows.
Section 2 reviews related work and states the focus of this paper.
Section 3 introduces the constraint-aware policy.
Section 4 describes the details of LfO to apply the constraint-aware policy in practice.
Section 5 presents experiments for compliant manipulation using the constraint-aware policy in the simulation and reality.
Section 6 discusses the result of our experiment and an extension of our method to hardware-level reusability and other constraints.
Section 7 concludes this paper.
2. Related Work
In this study, we focus on a design of a policy which is robust to uncertainty associated with recognition, such as object pose and articulation, and object shape. In addition, we aim to train the policy, which is generalized to various unseen manipulations with a single environment and reward using RL. The representative approaches of compliant manipulation are the planning-based approach, classical closed-loop controller, and RL. In this section, we briefly review these approaches for compliant manipulation.
Previous research has focused on designing policies for opening drawers and doors. The pioneering work on door opening is [
11], where robot motion is planned based on a known door model. In an unstructured environment, the model is unknown, and two methods can be used: geometry estimation and a closed-loop online controller to minimize force and torque. Several studies have been conducted on geometry estimation [
12,
13,
14,
15,
16,
17,
18,
19,
20], where articulation pose is estimated from visual input, and a motion trajectory can be planned from this estimation result. However, the estimation accuracy is insufficient for compliant manipulation (e.g., ∼20
estimation error in a rotation axis orientation on real-world data [
18]), and causes the planning-based approach to fail. To deal with such estimation errors, other studies [
21,
22] have devised a robot mechanism for compliance.
Closed-loop controllers have been proposed in several studies, which can deal with uncertainty in geometry estimation [
7,
23,
24,
25,
26]. In [
23], an online controller was designed on the basis of a simple strategy in which the end-effector follows the path of the least force. Several studies have proposed online controllers based on this strategy [
7,
24,
25,
26]. These online controllers use the magnitude of force, which differs due to the change in the environment, and are not robust to the environmental change. To address this issue, we propose a constraint-aware policy using RL that can deal with uncertainty. Classical controllers also have a problem that requires manual parameter tuning. An adaptive controller is the solution to tune the parameters for a specific manipulation [
27,
28,
29]. This adaptive tuning requires a real-world interaction between the robot and environment. In the case of our study, in which the object is constrained to the environment, a large force is directly applied to the robot and the object under an estimation error. Thus, it is dangerous to determine the parameters through the real-world interaction, and the controller is not appropriate for this study. Using the learning-based approach for compliant manipulation mitigates the issue on the parameter tuning.
Several studies have applied RL to train a policy for compliant manipulation [
2,
3,
4,
5,
6,
30]. These studies focused on the design of policies by preparing the environment and reward for only a specific manipulation. For example, these studies prepare a door-opening environment and calculate an angle of the door as the reward. For example, Urakami et al. proposed DoorGym, which is a training environment for generalizing the door-opening policy [
5]. This trained policy can be generalized to doors with various doorknobs, lighting conditions, and environmental settings, but has focused only on door opening. Therefore, the trained policy is unable to be applied to other manipulations with the same constraint. There are several studies on RL which focus on designing a generalized policy for many varieties of manipulations [
31,
32,
33,
34,
35]. However, this approach requires time and effort to prepare environments for all target manipulations to collect a large amount of data. In addition, this approach achieves an insufficient success rate on real-world application and needs to fine-tune the policy for a specific manipulation. In this study, we propose a policy generalized to manipulations with a common physical constraint, using a single environment and reward based on the common characteristic among these manipulations.
3. Method
In this study, we aim to train the policy generalized to various compliant manipulation operations, which is required in many household manipulations. Toward this policy, we design a single environment and reward based on the common characteristic of the physical constraint within a manipulation group. In this section, we explain an approach to the learning of this constraint-aware policy.
This section is organized as follows.
Section 3.1 explains the target manipulation group in this study.
Section 3.2 states assumptions for executing the constraint-aware policy.
Section 3.3 introduces the training method of the policy for the target manipulation group in
Section 3.1.
Section 3.4 describes the technical details of satisfying the single-system condition, which is one of the assumptions explained in
Section 3.2. These details are essential for an appropriate execution of the policy trained under the environment and reward in
Section 3.3.
3.1. Target Manipulation Group
In this study, we focus on manipulation groups with the physical constraints, which are one-degree-of-freedom translation (prismatic joints) or rotation (revolute joints). These physical constraints are representative of the household environment. In the manipulations with a prismatic joint, such as drawer opening, plate sliding, and pole pulling, the object’s admissible motion directions are constrained under a linear guide. As for the manipulations with a revolute joint, including door opening and handle rotating, the admissible directions are constrained under a rotational axis.
Compliant manipulations of the same group have a common characteristic of the constraint force. A large force is exerted on an object when the object tries to move along the inadmissible direction. Since compliant manipulation operations can be achieved using the force, we achieve various unseen manipulations within the same group by a single policy based on such a characteristic of the force.
3.2. Assumptions
The constraint-aware policy in this study is executed on the following assumptions.
Assumption 1: Single-system condition: The robot hand and object move in unison, where the internal forces between them are canceled out.
Assumption 2: The inertial force on the manipulated object is negligible.
Assumption 3: Friction in the joint mechanism is sufficiently weak such that the manipulated object can move smoothly along the desired trajectory.
Assumption 4: The workplane of the robot hand and direction of the rotation axis are known; thus, the robot hand and manipulated object move on a known plane.
These assumptions can be fulfilled in the manipulations we are focusing on. Assumptions 1 and 2 can be satisfied through the design of the manipulation, with Assumption 2 being satisfied by moving the manipulated object slowly. Assumption 1 is satisfied by a grasp mechanism and an additional policy to decrease torque exerted on the object. For more details of Assumption 1, see
Section 3.4. Assumption 3 is satisfied by many household objects, as they are designed for easy handling by humans. Finally, this study focuses on objects with only one prismatic or revolute joint, which are representative of household environments; thus, regarding Assumption 4, the workplane can easily be obtained. These can be obtained using Learning-from-Observation (LfO), where a human provides manipulation instructions to a robot through a one-shot demonstration [
9,
10]. We can calculate the workplane from human hand trajectories. For more details on Assumption 4, see
Section 4.
3.3. Training Design under Single-System Condition
Deep RL is employed to design the control policy, as it mitigates the requirement of manual parameter tuning and is robust to uncertainties, such as recognition error and sensor noise, unlike classical controllers [
7,
23,
24].
To design the control policy, we assume compliant manipulation as a Markov decision process and apply deep RL to train a constraint-aware policy. The Markov decision process has a state space , action space , state transition , initial state distribution , and reward . At each timestep t, an agent interacts with an environment with an action determined from state , resulting in and . The goal of RL is finding the optimal policy that maximizes the cumulative reward , where is the discount factor, , and T is the episode length.
In this study, the robot hand moves along a motion direction and observes a force . We train the policy to estimate an optimal motion direction while the hand moves along the estimated direction.
3.3.1. Training Environment
The training environment is designed based on the single-system condition. This environment consists of a single composite body and a prismatic joint (
Figure 2). This composite body represents the robot hand and manipulated object under the single-system condition. At each timestep, a force exerted on the body
is obtained as a result of interaction between the body and constraint. The constraint is represented as a constraint equation, and the force is calculated by solving the equation of motion, which includes the constraint force [
36]. The single-system condition guarantees that
, measured at the robot wrist, is identical to the constraint force on the body, as any internal forces between the hand and the object can be ignored.
This environmental design offers the advantage of a low simulation cost, as it is unnecessary to consider unstable factors, such as contact simulations between objects. This improves simulation speed and leads to faster training. Furthermore, the policy trained in this environment can be easily adapted to different robot hands because it is independent of the specific characteristics of the robot hand itself.
3.3.2. State and Action
At timestep
t, the state
consists of the normalized force obtained from a sensor
(
) and the motion direction of the robot hand
. Utilizing the normalized force vector is important because the normalization makes the policy robust to a change in the magnitude of force, which is caused by an environmental change. Note that if the constraint force is so small that they are negligible, various noises such as sensing errors and joint bending are amplified. In this study, we assume that the constraint force is constantly large enough to ignore these factors. In the case that these factors are negligible, we should calculate a magnitude of the force smaller than a predefined threshold as zero value. The action
is defined as an operation that modifies the direction of motion. Given
and
, the motion direction is updated using the following equation:
When the object tries to move in the inadmissible direction, the constraint force is exerted on the object. The policy should modify the motion direction toward this force direction such that the force is reduced. As shown in
Figure 3, the update of the motion direction by the optimal policy guarantees the adjustment of
resulting from the interaction between the object and the constraint. Thus, the motion direction can be appropriately modified using the force direction. Note that the direction of the constraint force can be obtained under Assumption 3, where a friction in the joint mechanism is sufficiently weaker than the constraint force.
3.3.3. Reward
We train the constraint-aware policy to estimate the motion direction of the robot hand. To train the optimal policy, we should set an appropriate reward function based on the constraint. Thus, we consider the case that the motion direction is not along the constraint (
Figure 3). In compliant manipulations with both the prismatic and revolute joints, if the robot hand does not move along the constraint, the constraint force is exerted by the physical constraint on the object. This force is minimized when the motion direction is along the constraint. Thus, we propose the reward
represented by the constraint force
:
3.4. Technical Details of Satisfying the Single-System Condition When Applying the Policy to a Robot
The constraint-aware policy is trained and executed under the assumption of the single-system condition. To satisfy the single-system condition, the relative position and orientation between the robot hand and an object must be maintained. Two main challenges to satisfy this condition are identified: fingertip slipping and lack of contact between the robot and object.
3.4.1. Avoidance of Fingertip Slipping
A violation of the single-system condition can occur if a large impulse force causes the robot’s fingertips to slip on the manipulated object. This large impulse force is mainly caused in case that the robot hand tries to move in the inadmissible direction by the large amount of translation. Thus, to prevent the large impulse force, we implement the robot control system so that the robot hand moves slowly. Moreover, fingertip slipping is likely to occur if the hand orientation remains constant during manipulation of a revolute joint where the orientation of the manipulated object changes. To avoid the slipping, we change the hand orientation based on the change in the motion direction, as follows. We define
as the quaternion representing the hand orientation in the world coordinate system at time
t; then,
can be calculated using the following equation:
where
represents the quaternion rotating the angle between
and
around the outer product of
and
.
This strategy does not necessarily guarantee a change in the orientation of the hand completely in conjunction with the orientation of the object, and can be adopted only in case the relative orientation between the hand and manipulated object is not strictly fixed. An example case is door opening with a lazy closure, which is one of the grasps [
37], as shown in
Figure 4. Using the lazy closure, the contact regions remain constant and stable manipulation is ensured while opening the door, even though the relative orientation between the hand and manipulated object is not strictly fixed. However, when the relative orientation between the hand and manipulated object is strictly fixed, such as handle rotating, a more precise method to change the hand orientation is required. Thus, we prepare an additional policy to maintain the single-system condition for this case. Further details are provided in
Appendix A.
3.4.2. Guarantee of Hand–Object Contact
The manipulated object and robot hand must be in contact throughout the manipulation to maintain the single-system condition. Contact is guaranteed if a non-zero constraint force is measured by a sensor on the wrist of the robot. Thus, the contact condition is ensured by applying a constraint force at the beginning of the manipulation and maintaining it throughout the manipulation. Specifically, the constraint force fed into the policy is defined as the raw force value offset by the force (i.e., ).
The displacement of the hand is classified into admissible or inadmissible directions between the robot hand and object. If the hand moves along the inadmissible direction, the hand collides with the object. In this case, the single-system condition is kept. If the estimated displacement is out of inadmissible directions between the hand and object, the hand goes away from the object and the single-system condition is broken. In this study, we assume that the estimated displacement is always within the inadmissible directions between the robot hand and object.
4. Learning-from-Observation System
Compliant manipulation is executed by combining our constraint-aware policy with the Learning-from-Observation (LfO), a system in which a human provides manipulation instructions to a robot through a one-shot demonstration [
9,
10]. In this study, the physical constraint, workplane, and initial motion direction are obtained from a human demonstration for compliant manipulation. Using this system, we can satisfy Assumption 4, i.e., the workplane can be determined by leveraging the demonstration. This section describes the details of the LfO system applied in this study.
As shown in
Figure 5, the LfO system consists of two phases: the demonstration phase and execution phase. The demonstration phase involves the LfO system obtaining a sequence of tasks from a human demonstration and assigning skill parameters to each task. During the execution phase, the system decodes the skill parameters into the execution commands.
In the demonstration phase, a human demonstration is encoded into a sequence of tasks using skill parameters [
10]. The demonstration consists of an RGBD image sequence of a one-shot human demonstration and verbal instructions. In this study, the human demonstration is decomposed into several tasks, including the grasping and compliant manipulation within physical constraints (prismatic or revolute joint). The skill parameters of grasp and manipulation are also determined from the image sequence and verbal instructions.
For grasping, the skill parameters include the force exertion type and approach direction appropriate for the task situation [
37]. A convolutional neural network (CNN)-based classifier (grasp recognizer in
Figure 5) recognizes one of the four force exertion types based on the human hand image at the moment of grasp and the name of the object [
38]. Similar hand shapes can be recognized as different force exertion types using the name of the object. The approach direction is calculated from the trajectory of the human hand in the demonstration (hand trajectory calculator in
Figure 5).
The physical constraint is determined from the verbal instruction (constraint recognizer in
Figure 5). For example, the verbal instruction of “open a fridge door” is associated with a revolute joint. For compliant manipulation of a prismatic joint, the skill parameters include the workplane normal and initial motion direction. Meanwhile, the skill parameters of compliant manipulation for a revolute joint include the rotation radius, in addition to the workplane normal and initial motion direction. These parameters are calculated by the hand trajectory calculator. The workplane normal and rotation radii are calculated using plane fitting and circular fitting, respectively.
In the execution phase, the robot executes the target task sequence by first grasping an object and then manipulating it. In the grasping, a contact point recognizer and grasping policy are selected based on the force exertion type obtained in the demonstration phase [
37]. The recognizer and policy are previously trained for each force exertion type. The contact point recognizer has a simple CNN structure, where the input is the depth image of the target object and the output is the contact points to be grasped. The detected contact points are passed on to the grasping policy, and the grasp is executed.
In the manipulation, a manipulation policy is executed. The manipulation policy is selected based on the constraints obtained in the demonstration phase. In the task involving the prismatic or revolute joints, the constraint-aware policy is applied. Note that, as described in
Section 3.4, in a task with a revolute joint, the hand orientation is changed to maintain the single-system condition because the orientation of the manipulated object changes during the manipulation. Therefore, the constraint type (prismatic or revolute) must be determined prior to manipulation.
7. Conclusions
In this study, we proposed a constraint-aware policy that can be applied to various manipulations with either a prismatic or revolute joint. We designed a training environment and a reward function to train the policy based on these constraints. The experimental results showed that the single policy could be executed on three manipulations with a prismatic joint (drawer opening, plate sliding, and pole pulling), even when an estimation error in the motion direction was applied in the simulation. Unlike the classical controller, our policy achieved robust execution against environmental changes. In addition, we could execute our policy on two manipulations with a revolute joint (door opening and handle rotating). Furthermore, three manipulations, drawer opening, door opening, and handle rotating, were successfully executed on an actual robot without additional training.
Although our policy was trained in the simple environment, our policy could be executed successfully on different manipulations. Previous reinforcement learning (RL) methods specially designed the environment and reward for each target manipulation, whereas our policy was widely applicable to various assumed situations. Thus, we successfully designed a policy generalized to manipulations constrained by either a prismatic or revolute joint based on the constraint force, which is a common characteristic between such manipulations.
Toward a robot system capable of executing a wide range of manipulations, it is crucial to design a generalized policy for each manipulation group. Household manipulations can be categorized according to their physical constraints [
8]. The key to the generalized policies is to design an environment and reward focusing on a common characteristic within each group. This study validated the concept of a constraint-aware policy for either a prismatic or revolute joint, which are fundamental in considering physical constraints. We believe this study is the first step towards realizing the generalized household robot.