2.1. POMDP Model of the Sorting Task
Here, we explain the basic PODMP model of the sorting task, initially presented in [
19]. In general, a partially observable Markov decision process (POMDP) is a tuple
[
20], where the elements are the following:
S is a set of states, taken discrete and finite for classical PODMPs. Individual states are denoted by .
A is a set of actions available to the robot, again discrete and finite. Actions are .
is a stochastic state transition function. Each function value gives the probability that the next state is after executing action a in current state s.
is a reward function, where is the reward obtained by executing a in s. Note that sometimes rewards may also depend on the next state; in that case, is the expectation taken over the value of the next state. Moreover, rewards are classically assumed to be bounded.
Z is a set of observations, discrete and finite. The robot does not have access to the underlying state s. Instead, it observes the state through imperfect sensors, which read an observation at each step.
is a stochastic observation function, which defines how observations are seen as a function of the underlying states and actions. Specifically, is the probability of observing value z when reaching state after executing action a.
is a discount factor.
The objective in the POMDP is to maximize the expected sum of discounted rewards along the trajectory.
A central concept in POMDPs is the belief state, which summarizes information about the underlying state
s, as gleaned from the sequence of actions and observations known so far. The robot is uncertain about
s, so the belief state is a probability distribution over
S,
, where
denotes the cardinality of
S. The belief state is initially chosen equal to
(uniform if no prior information is available), and then updated based on the actions
a and observations
z with the following formula:
Here, is a normalization factor, equal to the probability of observing z when a is executed in s; this can be easily computed from O.
Now that the general POMDP concepts are in place, we are ready to describe the sorting task. There are two main components to this task: a deterministic, fully-observable component relating to robot motion among the viewpoints, and a stochastic, partially-observable component relating to the object classes. The two components run largely in parallel, and they are connected mainly through the rewards for the class-decision actions of the robot. We first present the motion component, as it is rather simple, and then turn our attention to the more interesting, class-observation component.
The motion component is defined as follows:
Motion state , meaning simply the viewpoint of the robot. There are K such viewpoints.
Motion action , meaning the choice of next viewpoint.
Motion transition function:
which simply says that the robot always moves deterministically to the chosen viewpoint.
Motion states are fully observable, so we can define observations and the observation function if and only if (and 0 otherwise).
Consider now the class observation component. There are object classes , ,…, , and the robot simultaneously observes positions from the conveyor belt. Thus, the state contains the object class at each such position j: . Note that the subscript of c indexes the class values, while the superscript indexes positions. The action for this component is a decision on the class of the object at the start of the belt: . The robot is expected to issue such an action only when it is sufficiently certain about this class; this will be controlled via the reward function, to be defined later.
To define the
transition function , we first need to give the overall action space available to the robot, which consists of all the motion and class decision actions
. Note that at a given step, the robot may either move between viewpoints, or make a class decision. Then:
In order, the branches of this transition function have the following meaning. The first branch encodes that, if the robot moved between viewpoints (so the belt did not advance), then the classes remain the same on all
H positions of the belt. The second branch, on the other hand, says that the classes move after a decision action (which automatically places the object in the right bin and advances the belt): the new class on position 1 is the old one on position 2, and so on. The third branch also applies for a decision action, and its role is to initialize the class value at the last position
H. Note that in reality the class will be given by the true subsequent object, but since the POMDP transition function is time-invariant, this cannot be encoded and we use a uniform distribution over the classes instead. The fourth branch simply assigns 0 probability to the transitions not seen on the first three branches. To better understand what is going on, see
Figure 1.
The classes are of course not accurately observable, so we need to extract information about them via observations, and maintain a belief over their values. We will do this in a factored fashion, separately for each observed position on the belt.
The robot makes an observation about each position j, where means that the object at position j is seen to have class i (which may or may not be the true class ).
Observations at each position
j are made according to the
observation function:
where
is the probability of making observation
from the viewpoint
just reached, when the underlying class of object is
. These probabilities are application-dependent: they are given among others by the sensor properties, classification algorithm accuracy, actual viewpoint positions, etc. If a good
a priori sensor model is available, it can inform the choice of
. However, the only generally applicable way of obtaining the observation function is experimental. For each viewpoint
, position
j, and underlying class
, a number
n of independent observations are performed, and the classes observed are recorded. Then,
is computed as the ratio between the number of observations resulting in class
, and
n. Note that our approach is thus independent of the details of the classifier, which can be chosen given the constraints in the particular application at hand. Any classifier will benefit from our approach in challenging problems where the object shape is ambiguous from some viewpoints.
The overall state signal of the POMDP is , with state space . We have already defined the action space A, and the overall observation z is . We will not explicitly define the joint transition and observation functions T and O as the equations are overly complicated and do not really provide additional insight; nevertheless, the procedure to attain them follows directly.
Instead, let us focus now on the belief state. There is one such belief state
at each position
j, which maintains the probabilities of each possible class value
at that position. Note that
is the belief that
is equal to
. Then, at any motion action, observations are performed according to
O and the belief state is updated per the usual Formula (
1). After decision actions however, there is a special behavior:
What is happening is that the old belief state at
is moved to
j, and the belief for the last position is initialized to be uniform, as there is no prior information about the object (if a prior is available, then it should be used here).
Figure 1 also provides an example for the propagation process of the beliefs.
The overall
reward function is initially defined as follows:
At each motion action, a constant reward of is received, which encodes time or energy consumption required to move the robot arm. When a decision is made, a reward is obtained if the decision was correct (the class was well identified), and the incorrect-decision penalty is assigned otherwise.
2.2. Adding Rewards Based on the Information Gain
The reward function (
6) is based only on performance in the task (correct or incorrect decisions, and a time/energy penalty). In our active perception problem, it is nevertheless essential that before taking a decision, the algorithm is sufficiently confident about the object class. Of course, the incorrect decision penalty indirectly informs the algorithm if the class information was too ambiguous. We propose however to include more direct feedback on the quality of the information about the object class in the reward function. This is a novel contribution compared to [
19].
Specifically, since in our problem the belief is a distribution over object classes (or over combinations of classes, for the multiple-position variant), we will characterize the amount of extra information provided by an action by using the information gain—or Kullback-Leibler divergence—between the current belief state and a possible future one:
Informally, we expect the information gain to be large when distribution is significantly “peakier” than b, i.e., the object class is significantly less ambiguous in than in b.
We will also need the entropy of the belief state, defined as follows:
To understand how the information gain is exploited in our approach, we must delve into the planning module, see also
Section 2.4. This module has the role of finding a good sequence of actions that maximizes the amount of reward—in our case, a sequence of viewpoints, which improves the likelihood of a proper sorting for candidate objects; and of decisions on the classes of these objects. The planning module solves the POMDP problem using the DESPOT algorithm of [
15], using an online approach that interleaves planning and plan execution stages. We will explain a few details about DESPOT, to the extent required to understand our method; the complete algorithm is rather intricate and outside the scope of this paper.
DESPOT constructs a tree of belief states, actions, and observations.
Figure 2 gives an example of such a tree. Each square node is labeled by a belief state
b, and may have a round child for each action
a; in turn, each such action node may have a square, belief child for each observation
z, labeled by the belief
resulting from
a and
z. A tree represents many possible stochastic evolutions of the system, e.g., for the sequence of actions
there are four possible belief trajectories in the tree of
Figure 2: those ending in the 9th, 10th, 13th and 14th leaves at depth 2.
We will work with a reward function
that is defined on transitions between belief nodes of this tree. For the original task-based POMDP reward function
R in the section above, the corresponding belief-based reward would be:
where the subscript
t indicates this is the direct task reward. Note that in DESPOT, beliefs are approximately represented in the form of a set of particles, and belief rewards are similarly approximated based on these particles. We implicitly work with these approximate belief versions, both in the equation above and in the sequel.
We include the information gain by using a modified reward, as follows:
Thus, larger rewards are assigned to actions that help disambiguate better between object classes. Here, is a tuning parameter that adjusts the relative importance of the information-gain reward. Later on, we study the impact of on performance.
To choose which nodes to create in developing the tree, DESPOT requires upper and lower bounds on the values (long-term expected rewards) of beliefs. It computes lower and upper bounds
and
of the task-reward values with well-known procedures in the PODMP literature [
20]. To include the information gain, we leave the original lower bounds unchanged; since information gains are always positive, the lower bounds computed for the task rewards remain valid for the new rewards. For the upper bounds, we add
times the entropy of
b as an estimate of the upper bound of any sequence of information gains:
2.3. Complexity Insight
A key factor dictating the complexity of the problem is the branching factor of the tree explored by the planning algorithm. To gain some more insight into this, let us examine a simple case where there are two viewpoints labeled L (for Left) and R (for Right), two classes labeled 1 and 2, and the observation function given in
Table 1. Thus, if
q is close to 1, then from viewpoint L class 1 is seen more accurately, and from viewpoint R class 2 is seen more accurately.
Take a uniform initial belief,
. For this case, if we define the probability
of a belief (round) node in
Figure 2 as the product of all observation probabilities from the root to that node, we can describe the tree explicitly. In particular, at depth
d we will have only nodes with the following structures:
where
, and
k decreases in steps of 2 from
d down to 0 when
d is even, or to 1 when
d is odd. The proof is an intricate induction, which we skip for space reasons. Instead, we plot in
Figure 3 an example evolution of the probabilities
as a function of
d (up to 100) and of the resulting values of
k, for the particular case when
. These results say that at each depth
d, when
q is large (i.e., when sensing is good) there are only a few classes with large probabilities: probabilities drop exponentially as
k decreases. This is encouraging, because results in [
4] suggest that complexity is small when node probabilities are skewed in this way (results there were for a different algorithm, AEMS2 [
21], but we believe this principle is generally applicable to any belief-tree exploration algorithm; see also the related concept of covering number [
22,
23]).
Obtaining a full analytical statement of this insight seems difficult. Instead, next we study empirically the effective branching factor of DESPOT with information-gain rewards, for
, and for a slightly more complicated version of the problem with 4 classes (the case of 2 classes is not informative as the algorithm only develops very shallow trees). The branching factor is estimated by letting the algorithm run for a long time from a uniform initial belief, and dividing the number of belief nodes (round in
Figure 2) at depth
by the number at
d. We obtain a value of
for the largest branching factor across all depths
d. Note that the largest possible branching factor is 32, so the effective branching factor is significantly smaller, suggesting that the problem is not overly difficult to solve.