Implicit Shape Model Trees: Recognition of 3-D Indoor Scenes and Prediction of Object Poses for Mobile Robots

For a mobile robot, we present an approach to recognize scenes in arrangements of objects distributed over cluttered environments. Recognition is made possible by letting the robot alternately search for objects and assign found objects to scenes. Our scene model ”Implicit Shape Model (ISM) trees” allows us to solve these two tasks together. For the ISM trees, this article presents novel algorithms for recognizing scenes and predicting the poses of searched objects. We define scenes as sets of objects, where some objects are connected by 3-D spatial relations. In previous work, we recognized scenes using single ISMs. However, these ISMs were prone to false positives. To address this problem, we introduced ISM trees, a hierarchical model that includes multiple ISMs. Through the recognition algorithm it contributes, this article ultimately enables the use of ISM trees in scene recognition. We intend to enable users to generate ISM trees from object arrangements demonstrated by humans. The lack of a suitable algorithm is overcome by the introduction of an ISM tree generation algorithm. In scene recognition, it is usually assumed that image data is already available. However, this is not always the case for robots. For this reason, we combined scene recognition and object search in previous work. However, we did not provide an e ffi cient algorithm to link the two tasks. This article introduces such an algorithm that predicts the poses of searched objects with relations. Experiments show that our overall approach enables robots to find and recognize object arrangements that cannot be perceived from a single viewpoint.


Introduction
To act autonomously in various situations, robots not only need capabilities to perceive and act but must also be provided with models of the possible states of the world.If we imagine such a robot as a household helper, it will have to master tasks such as setting, clearing, or rearranging tables.Let us imagine that such a robot looks at the table in Figure 1 and tries to determine which of these tasks is pending.More precisely, the robot must choose between four different actions, each of which contributes to the solution of one of the tasks.An autonomous robot may choose an action based on a comparison of its perceptions with its world model, i.e., its assessment of the state of the world.Which scenes are present is an elementary aspect of such a world state.Modeling scenes and comparing their models with perceptions is the topic of this article.In particular, we model scenes not by the absolute poses of the objects in them, but by the spatial relations between these objects.Such a model can be more easily reused across different environments because it models a scene regardless of where it occurs.

Scene Recognition -Problem and Approach
This article provides a solution to the problem of classifying into scenes a configuration or so-called "arrangement" of objects whose poses are given in six degrees of freedom .Recognizing scenes based on the objects present is an approach suggested for indoor environments by [1] and successfully investigated by [2].The classifier we propose not only describes a single configuration of objects, but rather a multitude of configurations that these objects can take on while still representing the same scene.Hereinafter, this multitude of configurations will be referred to as a "scene category" rather than as "scene" which is a specific object configuration.For Figure 1, the classification problem we address can be paraphrased as follows: "Is the present tableware an example of the modeled scene category?","How well does each of these objects fit our scene category model?", "Which objects on the table belong to the scene category?",and "How many objects are missing from the scene category?".Our classifier is learned from object configurations demonstrated by a human in front of a robot and perceived by the robot using 6-DoF object pose estimation.
Many scene categories require that the spatial characteristics of relations, including uncertainties, be accurately described.For example, a table setting requires that some utensils be exactly parallel to each other, whereas their positions relative to the table are less critical.To meet such requirements, we proposed single Implicit Shape Models (ISMs) as scene classifiers in [3].Inspired by Hough voting, a classic but still popular approach (see [4], [5]), our ISMs let each detected object in a configuration vote on which scenes it might belong to, using the  spatial relations in which the object participates.The fit of these votes yields a confidence level for the presence of a scene category in an object configuration.Overall, this article is not about feature extraction, but about modeling relations and their variations in 6-DoF.ISMs for scene recognition should not be considered as an alternative but as a complement to the immensely successful Convolutional Neural Nets (CNNs).

Object Search -Problem and Approach
Figure 2 shows our experimental kitchen setup as an example of the many indoor environments where objects are spatially distributed and surrounded by clutter.A robot will have to acquire several points of view before it has observed all objects in such a scene.This problem is addressed in a field called three-dimensional object search ( [6]).The existing approaches often rely on informed search.This method is based on the fact that detected objects specify areas where other objects should be searched.However, these areas are predicted by individual objects rather than by entire scenes.Predicting poses utilizing individual objects can lead to ambiguities, since, e.g., a knife in a table setting would expect the plate to be beneath itself when a meal is finished, whereas it would expect the plate to be beside itself when the meal has not yet started.Using, instead, estimates for scenes to predict poses resolves this problem.
To this end, we presented 'Active Scene Recognition' (ASR) in [7] and [8], a procedure that integrates scene recognition and object search.Roughly, the procedure is as follows: The robot first detects some objects and computes which scene categories these objects may belong to.Assuming these scene estimates are valid, it then predicts where missing objects, which would also belong to these estimates, could be located.Based on these predictions, camera views are computed for the robot to check.In this article, ASR is detailed in Sec. 4. In Sec.5.3, we evaluate ASR and this article's contributions to it on a physical robot.To distinguish between ASR and pure scene recognition, the latter is referred to as 'Passive Scene Recognition' (PSR).PSR is detailed in Sec. 3. The flow of our overall approach (see [9]), which consists of two phases: first the learning of scene clas-sifiers and then the execution of Active Scene Recognition, is shown in Figure 3.

Relation Topology Selection -Problem
We train our scene classifiers from sensory-perceived demonstrations (see [10]), which consist of a two-to low-threedigit number of recorded object configurations.This learning task involves the problem of selecting pairs of objects in a scene to be connected by spatial relations.Which combination of relations is modeled determines the number of false positives returned by a classifier and the runtime of scene recognition.Combinations of relations are hereinafter referred to as relation topologies.Whereas such topologies contain only binary relations, they can represent many ternary or n-ary relations with multiple binary ones.In Sec.3.8, we motivate and outline how we selected relation topologies in our previous work in [11].

Contributions and Differences from Previous Work
To not unnecessarily restrict this topology selection, a scene classifier should be able to represent a maximum of topologies.To this end, we first suggested 'Implicit Shape Model trees', a hierarchical scene model, in [3].This model consists of multiple ISMs stacked upon each other, with the ISMs in it representing different portions of the same scene category.Such portions are brought together by additional ISMs in such a tree.The closer an ISM is to the root of a tree, the larger the portion it covers.However, while we outlined such a tree in [3], we did not define how scene recognition in terms of data or control flow would work with ISM trees.In Sec.3.7, we close this gap which prevents greater use of ISM trees by contributing an algorithm for scene recognition with ISM trees.In [11], we defined how to select relation topologies.However, we did not describe how ISM trees are generated from such topologies.In Sec.3.6, we contribute an algorithm for generating ISM trees, thus closing another essential gap.
As visible in Figure 3, learned ISM trees are used to perform ASR.To make ASR possible, we had to link two research problems: Scene recognition and object search.To this end, we proposed a technique for predicting the poses of searched objects in [7] and reused it in [8].However, this technique suffered from a combinatorial explosion.We close this gap, which made ASR impractical for larger scenes, by contributing a prediction algorithm in Sec.4.2 that efficiently predicts object poses.In summary, this work's contributions are an: 1. Algorithm for generating ISM trees 2. Algorithm for scene recognition using ISM trees 3. Algorithm for predicting object poses with ISM trees

Equipment and Constraints
For the experiments with our robot MILD in Figure 1, we integrated these algorithms into ASR.The robot consists of a mobile base and a pivoting camera head.Searched objects are detected using third-party object pose estimators.From all searched objects in our kitchen setup, only the utensils are localized using markers.ISM trees provide two parameters that  set the degree to which object poses may deviate from the modeled relations without being excluded from the scene.These parameters were tested in the range of [mm] to [dm] for object positions and in the one-to two-digit [°] range for object orientations.Because ISM trees emphasize the modeling of relations, they focus on the objects in a scene.They complement work as [12] that place emphasis on the global shape of a scene including the walls or the floor.ASR can only search objects that are part of demonstrated scene categories.ASR also assumes that the environment is static during object search, as opposed to approaches ( [13]) that address dynamic scenes.

Scene Recognition
Scene understanding is generally understood as an image labeling problem.It can be addressed by two approaches.Either scenes are derived from detected objects and relations between them, or scenes are derived directly from image data without concepts such as objects, as by [14].Descriptions of scenes in the form of graphs (including existing objects and relation types), which are derived by an object-based approach, are far more informative for further use, as in our work for object search, than the global labels for images that are instead derived using the "direct" approach.Work by [15] or [16] that follows the object-based approach relies on neural nets for object detection including feature extraction (e.g. by [17], [18]), which they combine with neural nets to generate scene graphs.This is enabled by datasets that include relations ( [19], [20]), which have been published in recent years alongside object detection datasets ( [21], [22]).These scene graph nets are very powerful but are designed with the goal of learning models of relations that focus on relation types or meanings rather than the spatial characteristics of relations.In contrast, in our work, we want to focus on accurately modeling the spatial properties of relations and their uncertainties.Yet, our model should be able to cope with small amounts of data, since we want to learn it from demonstrations of people's personal preferences concerning object configurations.Indeed, personal data must be provided by a user, wherein users wants to put a limited effort.
Examples for preferences in object configurations can be breakfast tables which hardly two people will want to have set in the same way.Yet, people will expect household robots to take their preferences into account when arranging objects.For example, [23] address personal preferences by combining a relation model for learning preferences for arranging objects on a shelf with object detection.However, while their approach can even successfully handle conflicting preferences, it can also miss subtle differences between spatial relations and is therefore too coarse for us.Classifiers explicitly designed to model relations and their uncertainties such as the part-based models [24] from the 2000s are a more expressive alternative.They also have low sample complexity, making them suitable for learning from demonstrations.Replacing their outdated feature extraction component with CNN-based object detectors or pose estimators (e.g.DOPE [25] or PoseCNN [26]), we obtain an object-based scene classifier that combines the power of CNNs with the expressiveness of part-based models in relation modeling.Thus, in our approach, we combine pre-trained object pose estimators with the relation modeling of part-based models.
Ignoring the outdated feature extraction of part-based models, we note that [27] already successfully used a part-based model, the Constellation Model [28], to represent scenes.Constellation models define spatial relations using a parametric representation over cartesian coordinates (a normal distribution), just like the Pictorial Structures Models [29] (another partbased model) do.Recently, [30]'s approach of using probability distributions over polar coordinates to define relations is proving more effective for describing practically relevant relations.Whereas such distributions are more expressive than [23]'s model, they are still too coarse for us.Moreover, they use a fixed number of parameters to represent relations.What would be most appropriate when learning from demonstrations of varying length is a relation model whose complexity grows with the number of training samples demonstrated, i.e., a nonparametric model [10].One such flexible models are the Implicit Shape Models (ISMs) of [31], [32].Therefore, we chose ISMs as the basis for our approach.One shortcoming that ISMs have in common with Constellation and Pictorial Structures Models is that they can only represent a single type of relation topology.However, the topology that yields the best tradeoff between the number of false positives and scene recognition runtime can vary from scene to scene.To account for this, we extended the ISMs to our hierarchical ISM trees.We also want to mention scene grammars ( [33]) which are similar to partbased models but motivated by formal languages.They again model relations probabilistically and use only star topologies.For these reasons, we prefered ISMs over scene grammars.

Object Pose Prediction
The search of objects in 3-D has been addressed as an activevision ( [34], [35], [36], [37]) and a manipulation problem ( [38], [39], [40]).Active-vision approaches are divided into direct ( [41], [42]) and indirect ( [43]) search depending on the type of knowledge about potential object poses used.Indirect search uses spatial relations to predict from known poses of objects those of searched objects.Indirect search can be classified according to the type ( [44]) of relations used to predict poses.[45], [46], [47] are examples of using relations that correspond to natural language concepts such as 'above' in Robotics.Even though such symbolic relations provide high-quality generalization, they can only provide coarse estimates of metric object poses -too coarse for many object search tasks.For example, [48] successfully adapted and used Boltzmann Machines to encode symbolic relations.Representing relations metrically with probability distributions showed promising results in [49].However, their pose predictions are derived exclusively from known locations of individual objects, which leads to ambiguities that can be avoided when using scenes instead.Leibe et al. ([31], [32])

Overview and ISMs as by
In Sec. 1, we introduced scene recognition as a black box that receives estimates for objects as input.From these, it derives as output the present instances of scene categories.The concrete process of how our approach to scene recognition works is shown in Figure 4. Firstly, external object pose estimators derive the types and poses of the present objects.Hence, a physical object configuration is transformed into a set of 'object estimates'.These estimates are passed on to each of our scene  classifiers.Every scene classifier then returns estimates for the presence of instances1 of one scene category, including where these instances are located in 3-D space.Each scene classifier models a scene category and is learned from a recording of a human demonstration.Such a demonstration of object configurations is shown in 2 in Figure 5.A plate and a cup are pushed from left to right, yielding two parallel trajectories 2 .Please note that we use pre-trained object pose estimators to record object poses during demonstrations.One of the demonstrated configurations is visible in 1 in Figure 5.
In [3], we redefined Implicit Shape Models (ISMs) so they would represent scenes (which consist of objects), instead of objects (which consist of object parts).The original ISMs were used to recognize objects in 2-D images and similar to the Generalized Hough Transform ( [50]).As [51] write, this transform is used to "locate arbitrary shapes with unknown position, size and orientation" in images.It does so by gathering evidence about the said properties of a shape from the pixel values in an image.Evidence gathering is implemented by learning a mapping from pixels to shape parameters for each shape.Using this mapping, pixels cast votes in an accumulator array for different parameter values.The parameter values of the present shapes can then be determined from the local maxima in this array.

Single ISMs as Scene Classifiers -Previous Work
The learning ([3]) of a similar mapping for our 'scene classifier ISMs' can be thought of as adding entries to a table, similar to K-Nearest Neighbors ( [52]).In [3], we transformed the absolute poses in the trajectories of interrelated objects into relative poses which we then stored in a table.Thus, the learning did not involve the optimization of parameters of a model.Instead, ISMs represented spatial relations as sequences of 6-DoF relative poses.Nevertheless, scene recognition could be done efficiently as it mainly consisted of highly parallelizable matrix operations.We did not arbitrarily decide which pairs of absolute poses to convert into relative poses.Instead, among the objects in a scene category, we did set one as the reference object to which all relative poses pointed.In 2 in Figure 5, for example, the cornflakes box is the reference.Accordingly, all relative poses in 3 in Figure 5 3 point from the plate and the cup to the box.Like the Generalized Hough Transform, scene recognition ( [3]) with a single ISM started with a vote.Instead of letting pixels vote, the known objects cast votes starting from the place where they had been located.The voting was done by combining the estimated pose of the respective object with all those relative poses in the table of the ISM, assigned to this object.The visualization of a vote in 4 in Figure 5 shows at which poses the plate and the cup respectively expect the box.The votes cast are entered into the 3-D accumulator array shown in 5 in Figure 5. Once the voting was completed, we searched this array for the most comprehensive and consistent combinations of votes from the objects in a scene category.We identified the top-rated combinations as instances of the scene category the ISM modeled.Note that it did not matter whether the reference object was present in that combination and that a missing reference did not cause recognition to fail.
To avoid a combinatorial explosion during this search, we only compared votes that had fallen into the same bin of the accumulator, using a method similar to the Mean-Shift Search proposed by [32].This procedure allowed for discarding votes from irrelevant objects, i.e., objects that did not belong to the modeled scene category.Since a single ISM only ever relates 3 Each relation consists of all visualized arrows of one color.
one reference object to all other objects in a scene, it can only represent a star-shaped topology of relations.This could lead to false positives in scene recognition, as long as only relations between the other objects would be violated.For example, in 6 in Figure 5, we swapped the cup and the plate.Nevertheless, the ISM considers this configuration a valid instance of its scene category.Hence, star-shaped topologies and single ISMs are not sufficient to reliably recognize many scene categories.

Implicit Shape Model Trees -Outline
Instead, we create a scene classifier that supports all connected relation topologies by first partitioning the given relation topology into star-shaped subtopologies, which are then assigned to separate ISMs.Based on this partitioning, we assemble the ISMs and connect them into a tree, creating a compound hierarchical model of the initial relation topology: The ISM tree.To avoid a combinatorial explosion when using an ISM tree for scene recognition, we take the precaution that only a restricted amount of data, the most comprehensive and consistent combinations of votes in each ISM, is shared between connected ISMs.Such an approach could have led to false negatives in scene recognition.However, such an effect is not observed during our experiments in Sec. 5. Before we detail this article's contributions to the ISM trees, we present the assumptions and notation used throughout the article in Sec.3.4 and outline a technique from our previous work which partitions connected topologies into stars in Sec.3.5.Sec.3.6 introduces a novel algorithm for generating ISM trees from these stars.Yet another contribution, we present an algorithm for recognizing scenes with ISM trees in Sec.3.7.

Preliminaries -Definitions for Scene Recognition
We define an object o as an entity whose state E(o, t) = (c, d, T) at a point in time t is estimated from sensor data.The  state is described by a triple consisting of a label c indicating the object class, a label d used to distinguish between different objects of the same class, and a transformation matrix T ∈ R 4×4 indicating the pose of the object.A scene category S = ({o}, {R}) consists of objects and the spatial relations {R} between the objects.The identity of a scene category is defined by a label z and each spatial relation is represented as a set of relative 6-DoF poses {T jk }.In scene recognition, the fit between a configuration {E(o, t)} of objects (a set of states) and the model of a scene category is estimated.If this fit, whose degree is indicated by a confidence level b(I S ) ∈ [0, 1], is sufficiently good, we consider the objects as an instance I S of the scene category and locate it at a pose T F .Models of scene categories are learned from trajectories demonstrated over l time steps for each object included in the category.Each trajectory is a sequence J(o) = (E(o, 1), . . ., E(o, l)) of estimates of the time-variant state E(o, t) of an object.
When modeling a scene category with an ISM tree, pairs of trajectories are converted into spatial relations.The relations are stored in a table, as outlined in Sec.3.2.A relation topology Σ = ({o}, {R}) describes the same as a scene category, but at a different level of abstraction.In a topology, relations are represented on a purely algebraic level instead of explicitly considering their spatial properties as scene categories do.We distinguish the following types of topologies: Star topologies Σ σ , in which a single object o F (the reference object) is connected to all other objects by one relation each.Complete topologies, in which every object is connected to all other objects.Connected topologies Σ ν , in which each pair of objects is connected by a sequence of relations.

Relation Topology Partitioning -Previous Work
An ISM tree is learned in two steps.In step one, the relations in a scene category are distributed across several ISMs.
Step one is covered in this subsection and is part of our previous work ([11]).In step two, the ISMs are then combined into a hierarchical scene classifier.We refer to step two as the tree generation which is one of the three contributions of this article.It is introduced in the next subsection.Let us assume for step one that a connected relation topology Σ ν was given for a scene category S.
Step one distributed the relations in the scene category by partitioning this so-called input topology Σ ν into a set of star-shaped subtopologies {Σ σ ( j)}.The partitioning was performed using a depth-first search that successively selected objects o M in the topology that were involved in as many relations R as possible.We considered each selected object as the center of a star topology This star topology also included the relations {R M } in which the center participated and the neighborhood N(o M ) of the center, i.e., all objects connected to the center by the relations {R M }.
We illustrate how this deep-first search works in Figure 6 using the scene category "Setting-Ready for Breakfast" whose connected relation topology was partitioned into five star topologies in five iterations j ∈ {1, . . ., 5}.In video clip 1 ("Demonstration of object configurations for learning a scene classifier"), we provide footage from the demonstration we recorded for this scene category.The recorded dataset consists of object trajectories that are 112 time steps long.The star topology we extracted first on the left of Figure 6 had "Plat-eDeep" as its center 4 and all other objects as its neighborhood.We selected the center for the next star topology to be extracted within this neighborhood.We stored the order in which objects o in the input topology Σ ν would have been chosen as centers for star topologies Σ σ in a height function h {Σ σ } (o).This order would correspond to a breadth-first search.The height function is defined for each object and will be used as a balancing criterion when generating ISM trees in the next subsection, ensuring that the height of the generated tree is minimized.By favoring objects with high degrees, the depth-first search in this subsection ensures that as few star topologies as possible are extracted.All five star topologies extracted from the input topology for "Setting-Ready for Breakfast" can be seen in the leftmost column in Figure 7. Since a depth-first search can completely search any connected graph or relation topology, and its search tree consists of the star topologies we want to extract, we can find a partitioning for any connected input topology.: Algorithm -How an ISM tree is generated from the star topologies shown in the leftmost column (see 1).In each of the five depicted iterations, a star topology is selected to learn a single ISM using the object trajectories for scene category "Setting-Ready for Breakfast".Dashed boxes show the selected stars, whereas arrows link these stars to the learned ISMs.

Contribution 1 -Generation Algorithm for ISM Trees
Having obtained a set of star topologies in step one, the task in step two is to generate an ISM tree from them.As one of the three contributions of this article, the algorithm we present here models all extracted star topologies by separate ISMs m which must, however, be linked together to form a tree.Such tree, generated from the five star topologies shown in the upper left in Figure 7, is visualized as a directed graph in Figure 8.This tree consists of a set {m} of five connected ISMs arranged in two levels.At the top of the tree is the root ISM m R , where intermediate results are merged from the four other ISMs m ∈ {m} below.All results we obtain from single ISMs in the tree are hereinafter referred to as recognition results I m .We use this term to distinguish between results of single ISMs and the instances I S of a scene category S that result from recognizing scenes with an entire ISM tree.
Within an ISM tree, we also distinguish between real objects found at the leaves o L and placeholder objects o F found at the internal vertices of the tree. 5In Figure 8, all inputs for the ISMs at tree level 1 are leaves.Each ISM at this level models relations between real objects and a reference object o F , but compared to Sec. 3.2 this reference is now an placeholder object in its own right.This placeholder object is used as an interface to pass recognition results I m of an ISM m to another ISM m ′ ∈ {m} at the next lower level in the tree for further processing.In Figure 8, the reference object "setting sub1" is used to pass results from the ISM on the lower left to the root ISM.In the root ISM, 5 Leaves and internal vertices are both represented as circles in Figure 8. Internal vertices are named after the scene category and connected to ISMs by green arrows.ISMs are visualized as boxes.
this reference object is treated as a regular object whose relation to another objects is modeled.
Step two generates ISM trees through two nested loops 6 .An outer loop converts a star topology into a single ISM in each iteration step, while an inner loop is responsible for attaching the just generated ISM at the appropriate place in the tree.How this is done for our ongoing example on the scene category "Setting-Ready for Breakfast" can be seen in Figure 7.In this figure, the iterations of the outer loop are visualized column by column from left to right, while the inner loop traverses the star topologies in each column from top to bottom.The order in which the outer loop selects star topologies from the previously extracted set is given by the height function h {Σ σ } (o) from the previous subsection.This function allows the star topologies Σ σ ( j) to be processed in the reverse order in which they would have been extracted in a breadth-first search.This ensures that star topologies that could be located in the highest levels of the tree to be generated are attached as close to the root as possible.This minimizes the actual height of the tree.On the left in Figure 7, the last extracted star topology with "KnifeLeft" as its center is accordingly converted into an ISM in the first iteration of the outer loop. 7The conversion is done using the ISM learning technique from Sec. 3.2 ([3]).A single ISM m is created from this topology and the trajectories J(o) demonstrated.Such ISM is shown in Figure 7 on the left of the lower dark green area.
Before the next iteration of the outer loop can begin, the inner loop still has to answer the question to which ISM m ′ the newly created ISM m should be connected.This connection is To minimize the height of the resulting ISM tree, the inner loop starts its search for a star topology Σ σ (k) suitable for this substitution, at the topologies that minimize the height function h {Σ σ } (o) and thus would be located as close to the root as possible.In the leftmost column in Figure 7, the center "KnifeLeft" of the just selected star topology Σ σ ( j) is found in the topmost topology Σ σ (k) which has "PlateDeep" as its center.In the topmost topology, "KnifeLeft" is replaced by the reference object "setting sub3" of the ISM just created. 8

Contribution 2 -Recognition Algorithm for ISM Trees
We concretize our definition of scene recognition from Sec. 3.4 as follows for ISM trees: From an object configuration such as the table setting in 1 in Figure 8, more specifically from the estimated states 9 {E(o, t)} of the objects, we want to derive instances I S of a scene category S, like the one shown in 3 in Figure 8.Our algorithm for scene recognition using ISM trees is another contribution of this article and involves two steps: An evaluation step exemplified in 2 in Figure 8 and an assembly step exemplified in 4 in Figure 8.Both steps 10 are detailed 8 Substitutions by reference objects in star topologies are colored green. 9The object pose estimation is omitted in Figure 8 for simplicity. 10Pseudocodes for the evaluation and assembly steps are provided in the appendix by Algo. 2 to 4. in this subsection.In the evaluation step, all the single ISMs m in a tree are evaluated one by one, e.g., the five ISMs in the example tree in 2, and all their respective recognition results I m are stored for the assembly step.In the assembly step, the recognition results from different ISMs, that belong to the same instance of a scene category are combined.
The evaluation step solves two problems: It defines an order in which the ISMs are evaluated as well as the use of an interface to exchange recognition results between ISMs.The actual evaluation of each ISM draws on the technique for classifying scenes with a single ISM.It is from our previous work ( [3]) and outlined in Sec.3.2.In an ISM tree, the ISMs cannot all be evaluated simultaneously, since some ISMs m ′ are supposed to further process the intermediate results of other ISMs m.These connections between pairs of ISMs, induced by reference objects o F , must be taken into account.For example, the evaluation of root ISM m R (visualized as a box at tree level 0 in 2 in Figure 8) cannot begin until the evaluation of all four ISMs m k with k ∈ {0, . . ., 3} at tree level 1 (the dark green area) is completed.By considering these connections, the evaluation step maximizes efficiency because each ISM is evaluated exactly once during scene recognition.
The evaluation step begins by sorting the ISMs according to their levels in the tree.This sorted list is traversed using two nested loops such that all recognition results from ISMs at tree level n can be stored before the evaluation of the ISMs at tree level n − 1 begins.In 2 in Figure 8, this equates to evaluating the ISMs from bottom to top line by line.If only real objects, i.e., only leaves o L and no internal vertices, are involved in the evaluation of the ISMs at a certain level, it is sufficient that the evaluation step distributes the different object states {E(o, t)} that describe the object configuration to the appropriate ISMs.At tree level 1 in 2, for instance, this is the case.If internal vertices are involved in an ISM, reference objects o F , more precisely their placeholder states E(o F ), should be computed be- Each ISM m that is not the root ISM may pass zero to a multitude of recognition results to another ISM m ′ in the tree.When the ISM m ′ is evaluated, each of these results is considered as a separate input, which yields more recognition results in this ISM.These results should be passed on to a third ISM.We implemented two strategies that mitigate this effect to avoid a combinatorial explosion in scene recognition: Firstly when we generate ISM trees, the height function h {Σ σ } (o) is used to minimize tree heights and thus the lengths of the chains of interdependent ISMs in a tree.Secondly, the number of placeholder states E(o F ) emanating from each ISM is limited by discarding all recognition results that have been assigned too low confidence levels b(o F ).
The evaluation step ends once it has evaluated the ISM at the root.The recognition results from the different ISMs are visualized as clouds in 4 in Figure 8.The results are connected through horizontal green arrows to those ISMs where they were computed.The task of the assembly step that now begins is to determine across ISMs which recognition results belong to the same instance I S of a scene category and to assemble such instances.As in 4, the assembly step starts at the results of the root.

Relation Topology Selection -Previous Work
While we explained how we partition relation topologies in Sec.3.5, we did not address how to determine the relation topology to partition.Our novel generation algorithm can derive an ISM tree for any kind of connected topology, but not every topology is equally suitable for learning a classifier.Figure 9 illustrates how omitting the wrong relations can lead to recognition errors, i.e., false-positive results.We define a scene category instance I S to be a false positive if scene recognition assigns it a confidence level b(I S ) that exceeds a given threshold, whereas its underlying object configuration {o ′ } does not sufficiently match that scene category.1 in Figure 9 visualizes a result of scene recognition for the "Setting-Ready for Breakfast" category.To generate the tree employed here, we used a star topology whose center is the green plate.
From a valid place setting (as in 1 and 3 in Figure 9), we expect that utensils such as forks, knives and spoons be on the "correct" sides of the plate.In addition to this first set of rules, others require that forks, knives, and spoons be oriented parallel to each other.There are also rules regarding the relative distances of utensils from the edge of the table.If a star topology is now used to cover the first set of rules, the other rules cannot be modeled with this topology.For this reason, the ISM tree from a star topology already used in 1 in Figure 9 returns a false positive in 2. The invalid configuration {o ′ } shown in 2 differs from the valid {o} in 1 in that several relative poses between object pairs that do not involve the plate are invalid.The ISM tree, however, does not notice these differences visualized by yellow arrows in 2. 11To prevent false positives, ISM trees could instead be learned from all n • (n − 1) /2 spatial relations that can be defined for a set of n objects, i.e., from a complete relation topology.The fact that ISM trees from such complete topologies do not yield false positives is illustrated in 3 and 4 in Figure 9.In 3 and 4, such a tree is applied to the object configurations {o}, {o ′ } from 1, 2 in Figure 9.The result in 4 is not a false positive, as some of the ISMs in the tree recognize that some of the relations mod- eled by them are not fulfilled. 12However, a disadvantage of complete topologies is the excessive number of relations that must be checked during scene recognition.In general, the cost of scene recognition with ISM trees is closely related to the number of relations represented.The fact that recognition with complete topologies is generally intractable has also been reported ( [24]) for other part-based models.
The question arose how to find a connected topology, different from the edge cases which are the star and complete topologies, as a middle ground.Such topology would yield an ISM tree that combines efficiency and representational power.To identify such a relation topology most generically, we used two domain-unspecific goodness measures in our previous work ( [11]): The false-positive rate numFPs() in scene recognition and the average time consumption avgDur() of scene recognition.Based on these measures, we formalized the selection of relation topologies as a combinatorial optimization problem.The challenge in this selection is the exponential number 2 n • (n − 1) /2 of relation topologies that can be defined for n objects.Given the number of topologies among which to choose, we used a local search technique to develop a Relation Topology Selection procedure.Its basic idea was to iteratively adjust a relation topology by adding, removing, or exchanging relations until a topology was found that contained only those relations that were most important for recognizing a scene category.The result, a so-called optimized topology, was then used to learn an ISM tree from it.

State Machine and Next-Best-View -Previous Work
In the previous section on Passive Scene Recognition (PSR), we ignored the question of under which conditions object pose estimation can obtain "object estimates" for scene recognition.Our approach to creating suitable conditions in spatially distributed and cluttered indoor environments is to have a mobile robot adopt camera views from which it can perceive searched objects.To this end, in two previous works ( [7], [8]), we introduced Active Scene Recognition (ASR) -an approach that connects PSR with three-dimensional object search within a decision-making system.We implemented ASR as a state machine consisting of two search modes (states) DI-RECT SEARCH and INDIRECT SEARCH that alternate.We then integrated this state machine with the MILD robot shown in Figure 1 so that ASR can decide on the presence of n scene categories in the environment visible in Figure 2.
ASR starts in DIRECT SEARCH mode which is tasked with acquiring initial object estimates.For this purpose, we developed two strategies to identify suitable camera views and move to them.The first ("informed") strategy is based on prior knowledge about possible placements of objects, e.g, from demonstrations of scene categories.If this informed search does not yield object estimates, an uninformed strategy ( [6]) is used to explore the entire environment uniformly.As soon as at least one object estimate is obtained, the direct search stops, and the INDIRECT SEARCH mode starts instead.
The other mode INDIRECT SEARCH consists of a loop in which three substates (Passive Scene Recognition, a technique for predicting the poses of searched objects, and 3-D object search) alternate.The loop starts in the first substate SCENE RECOGNITION, in which PSR is performed with ISM trees on the currently available object estimates.The results of SCENE RECOGNITION are instances of scene categories.Some instances may not contain all objects belonging to their category.Therefore, it is the task of the other two substates in the loop to complete such partial instances.The second substate OBJECT POSE PREDICTION uses ISM trees to predict locations of objects that would allow completion of these instances.When using ISM trees, some object poses may need to be predicted using entire sequences of spatial relations.This is prone to a combinatorial explosion: An algorithm presented in our previous work ( [7]) for predicting poses suffered from such an explosion.In the next subsection, we address this problem with an efficient algorithmic solution (one of the contributions of this article).
The third substate RELATION BASED SEARCH of the loop uses predicted object poses to search for these objects in 3-D, i.e., to determine camera views that are promising for finding them.Whenever such a view has been determined, the robot moves there and tries to localize objects in 6-DoF.In [8], we formalised finding suitable camera views as a Next-Best-View (NBV) optimisation problem.The algorithm with which we addressed this problem had to search for a camera view that maximized an objective function, starting from predicted object poses and the current robot pose.This objective function  modeled the success probability of object localization as well as the time required to reach the view and perform localization.
Our approach allowed both optimizing the views and deciding which objects to search in them.

Contribution 3 -Object Pose Prediction Algorithm
Our approach to predicting the poses of searched objects with the help of ISM trees is similar to an inversion of scene recognition.Scene recognition infers from known states of objects which instances of a scene category S the states correspond to.Instead, object pose prediction infers hypotheses about the possible poses T P of the missing objects o P from a known scene category instance I S and its location T F .Since these predicted poses must be suitable for 3-D object search, both the 3-DoF positions of the missing objects and their 3-DoF orientations must be predicted.Knowledge about the expected orientation of a searched object can determine the success or failure of object localization.The poses predicted by the algorithm presented in this subsection are visualized as coordinate systems in 2 and 4 in Figure 10.ISM trees allow us to infer object poses from spatial relations R, i.e., depending on the known poses T of already found objects o.The flexibility of this approach is illustrated in Figure 10 for the scene category "Setting-Ready for Breakfast": If an incomplete instance of this scene category is rotated as between 1 and 3, the object poses predicted from them in 2 and 4 rotate with it without need for adjustments.
Predicting object poses with ISM trees consists of two steps and is the third and final contribution of this article: In step one, precomputations are performed to identify those parts of an ISM tree that provide a fast and reliable prediction.In step two, these precomputations are used to predict the poses of searched objects.Figure 11 refers to the ISM tree which models scene category "setting" and has already been used in Figure 8.Since some objects from the scene category are involved in multiple relations, several leaves o L in the tree correspond to the same object.This way, "ForkRight" is represented at both levels of the tree.To predict object poses using the leaf for "ForkRight" at tree level 1, one would have to combine spatial relations from the ISMs "setting" and "setting sub1".On level 0, a single relation in the ISM "setting" is sufficient.Since the accuracy of predicted poses depends on the number of relations used, step one precomputes the shortest sequences of ISMs between any object o in a scene category S and the root m R of the tree.All nontrivial sequences are defined as paths P(m R , o) consisting of l pairs (m k , m k+1 ) of connected ISMs (see Eq. 1).Paths end at the ISM m l+1 which contains the leaf o L appropriate for predicting the pose of object o.We compute the shortest paths using a breadth-first search that traverses trees, as in Figure 11, from top to bottom.
Step two, the actual pose prediction algorithm, derives possible poses for the searched objects from these paths and a partial scene category instance I S .This is done via three nested loops: The innermost loop 13 computes exactly one pose estimate T P per searched object o P .Two outer loops 14 call this innermost loop until a specified number n P of poses is predicted for each searched object.Figure 11 shows how the innermost loop operates on an ISM tree.First, as shown by the horizontal green arrow in 1, it passes the pose T F of instance I S to the root ISM.Starting from root ISM m R in 2, it evaluates all ISMs on the shortest path P(m R , o P ) to a suitable leaf.For "ForkLeft", such a leaf is located on the far left of tree level 1.A predicted pose is visualized in 3 as a green circle and connected by a green arrow to the leaf from which it results.
The algorithm originating from our previous work ( [7]) was unable to efficiently predict object poses because it processed all the relative poses that make up a spatial relation in an ISM.Across multiple ISMs, this would lead to a combinatorial explosion: If it generated a prediction for each relative pose in a relation of an ISM m k at tree level k and passed the prediction as a possible pose of a reference object to another ISM m k+1 at level k + 1, each of these would be combined with all relative poses in a relation of ISM m k+1 .Also, the algorithm did not use shortest paths.Combinatorial explosion is avoided in the new method presented here by processing one random relative pose per relation instead of all.More precisely, the innermost loop 13 Pseudocode for the innermost loop is provided in the appendix by Algo. 5. 14 Pseudocode for the outer loops is provided in the appendix by Algo.6. selects one relative pose T jk from each ISM along the shortest path and inverts all such poses. 15The pose T F of the incomplete instance is multiplied by all these sampled and inverted relative poses T k j so that one of the sought pose hypotheses T P is obtained.For instance, to predict an absolute pose of "ForkLeft", the innermost loop randomly selects one relative pose from the ISMs "setting" and "setting sub1" respectively.

Overview
We do present experiments for PSR with ISM trees in Sec.5.2 and for our approach to ASR in Sec.5.3.Except for explicitly labeled experiments in Sec.5.2.5, 5.3.4 and 5.3.5, our PSR and ASR approaches are evaluated exclusively on measurements from physical sensors.The input for these real-world experiments was acquired by the pivoting sensor head of our MILD mobile robot.Our approach to ASR controlled both the sensors and actuators of this physical robot shown in Figure 1.The robot operated in our experimental setup which mimicked some aspects of a kitchen (see Figure 2).Our approaches to PSR and ASR were run on a PC with an Intel Xeon E5-1650 v3 3.50 GHz CPU and 32GB DDR4 RAM.In Sec.5.3.1 to 5.3.3,ISM trees for ten scene categories are used to evaluate our ASR approach.In each subsection, ASR is expected to provide estimates for all existing scenes.In addition to these ASR experiments on a physical robot, we performed an experiment in simulation in Sec.5.3.4 to compare the time consumption of our approach to ASR with two alternatives.

Scene Category "Office"
In this experiment, we evaluate how well our scene recognition approach captures the properties of spatial relations throughout an ISM tree.We do this by investigating how changes in individual object poses affect the recognized scene category instances.The scene category used is named "Office" and consists of four objects with fiducial markers attached to them to maximize object localization accuracy: Mouse, Keyboard, LeftScreen, and RightScreen.Video clip 2 ("Demonstration of the scene category: Office") shows how we demonstrated this category.1 in Figure 12 visualizes one of the 51 object configurations included in the dataset for this demonstration.The relative poses that make up the spatial relations of the learned ISM tree are visualized in 3 in Figure 12.The demonstration includes two relative movements between object pairs.The first relative movement involves both screens.It shall create a relation that consists of nearly identical relative poses, as depicted in the middle of 3. The second relative movement between Mouse and Keyboard shall create a much more variable spatial relation.The tree learned for "Office" consists of one ISM labeled "Office" and another labeled "Office sub0".

Parameters Describing Results of Scene Recognition
Scene recognition is performed on nine object configurations to analyze what impact changing the pose of an object has on the parameters in scene category instances.Since scene recognition is deterministic, each configuration is processed only once.In each configuration, an object pose either differs in its position or its orientation from those expected by the spatial relations in the ISM tree.Table 1 and 2 show how the ISMs "Office sub0" and "Office" quantify these differences.In both tables, color coding indicates the appropriateness of the values they contain.Green stands for results we consider excellent, yellow for good results, and red indicates problems.
Each estimated scene category instance in Figure 13 can also be represented as a set of parameters whose values can be found in the same row of the two tables.For each object, two compliance parameters express the degree to which its estimated position and orientation comply with a spatial relation in which the object is involved.Formal definitions of these compliances can be found in [9].Compliances are normalized to [0, 1], where 1 expresses a perfect match, and 0 represents a lower bound below which objects are excluded from scene category instances.A similarity measure is derived for each object by multiplying both compliances.Adding up all similarity measures in a table row yields the value of the objective function for an ISM m, the nonnormalized equivalent to its confidence level.This value describes the extent to which all of these objects, either directly involved with the ISM m or involved with another ISM m ′ to which m is related, contribute to the recognition result produced by ISM m.

Influence of Object Poses on Passive Scene Recognition
In the uppermost lines of Table 1 and 2, all compliances concerning positions and orientations are close to one.Thus, the objective function reaches its maximum.The values of the objective function correspond to the number of objects that each ISM considers.In "RightScreen-half-lowered", RightScreen is moved downwards by 0.05 m, as can be seen in 1 in Figure 13.The compliances for "RightScreen-half-lowered" in Table 1 validate that ISM "Office sub0" correctly notices that RightScreen has been slided, but not rotated.To increase the discrepancy between the positions of LeftScreen and RightScreen, LeftScreen is displaced further upwards by 0.035 m in "RightScreen-fully-lowered".Whereas the nonzero compliances in Table 1 indicate that no object in "RightScreen-halflowered" has been excluded from the scene category instance as we intended, this is different in "RightScreen-fully-lowered".The positional difference between the screens is sufficient to exclude one screen.However, the yellow coloring of the corresponding compliances in Table 1 makes it clear that it is suboptimal that ISM "Office sub0" excludes the less-displaced Left-Screen.ISM trees are more sensitive to displacements of reference objects of their ISMs than to those of nonreference objects.
When shifting LeftScreen forwards by 0.05 m in "LeftScreen-half-front" instead of moving RightScreen, scene recognition reveals that displacements are always considered from the perspective of the reference object.The two aforementioned phenomena, although counterintuitive, do not affect the values calculated for the overall objective function and could be systematically compensated.In configuration "RightScreen-half-right", we displace RightScreen this time.We move this screen to the right so that our experiments cover all directions in 3-D space where shifting is possible.When being moved to the right by 0.015 m, RightScreen is displaced less than in "RightScreen-half-lowered".The objective-function values in Table 1 show that ISMs can be sensitive enough to notice such slight differences.
After checking whether ISM trees detect translations of objects, the same should be done for rotations.We rotate Left-Screen by 15 • in "LeftScreen-half-rotated" and by 30 • in "LeftScreen-fully-rotated".Comparing the configurations in which LeftScreen is rotated with those in which RightScreen is lowered reveals that rotations are just as precisely detected as translations.After having analyzed how changing object poses affects ISM "Office sub0", the next configurations are to show that changes are treated equally in root ISM "Office".In the configuration "Mouse-half-right", Mouse is pushed 0.11 m to the right.This represents a displacement larger than those of both screens together in "RightScreen-fully-lowered".However, Mouse is not excluded from the corresponding scene category instance in 7 in Figure 13.This shows that we can control how permissive spatial relations are through the demonstrations we record.In "Mouse-half-rotated", Mouse is rotated by 15 • instead of being shifted.The fact that the objective function of ISM "Office" returns the same value for "Mouse-half-rotated" and "LeftScreen-half-rotated" proves that we can further influence whether scene recognition is permissive concerning positions or orientations.Overall, these experiments confirm that ISM trees can identify whether an object has been shifted or ro- Figure 13: Visualization of the scene category instances that an ISM tree for the category "Office" recognized in the physical object configurations "RightScreen-halflowered" (1), "RightScreen-fully-lowered" (2), "LeftScreen-half-front" (3), "RightScreen-half-right" (4), "LeftScreen-half-rotated" (5), "LeftScreen-fully-rotated" (6), "Mouse-half-right" (7), and "Mouse-half-rotated" (8).
tated in various directions.They can also estimate the sizes of such displacements.

Scene Categories Demonstrated for ASR Evaluation
This subsection is devoted to the scene categories that we demonstrated to evaluate ASR and that are named in Table 3.The next subsection is then devoted to the performance of the ISM trees learned from the demonstrations we recorded for these categories.Demonstrations and the evaluation of ASR took place in the kitchen setup depicted in Figure 2 or 1 in Figure 15.Object configurations were demonstrated in areas of the setup such as the cupboard at the top of 1, the shelves on its right, and the tables.The cupboard and shelves are filled with clutter.We recorded all object poses with the camera head of our MILD robot.Markers are only used on the cutlery to compensate for reflections.The ISM trees for all scene categories in Table 3 result from topologies optimized using the Relation Topology Selection (RTS), which we outlined in Sec.3.8.This table contains the durations (lengths) of the object trajectories, as well as the numbers of objects in the datasets of each scene category.Some of the categories are visualized in Figure 15.Whereas 1, 2, 6, 7, and 8 show object trajectories and spatial relations, 3, 4, and 5 show snapshots of demonstrations.As ISM trees are generative models, different scene categories can contain the same objects and model similar relations.This also allows searching for different scenes at the same time.
In the different areas of our setup, the objects can be arranged horizontally or vertically in 2D.However, we define scene categories that span multiple areas and thus extend into 3D.For instance, the category "Cereals-on Shelf" in 5, 6, and 8 relates parts of a table setting to the food and drinks stored on the shelves.Food, drinks, and the shelves are also part of "Drinks-on Shelf" in 2 and "Sandwich-on Shelf".The ISM trees for these scene categories contain relations of a considerable length, such as those drawn in 2. The object configurations corresponding to these scene categories are truly threedimensional, as they extend both horizontally and vertically.A close-up view of the vertical relations in "Cereals-on Shelf" is provided in 6, whereas the horizontal ones are shown in 8. Except for the shelves, the three categories "Sandwich-Setting" in 1 and 3, "Cereals-Setting" in 4 and 7, and "Drinks-Setting" consist of the same objects as their "...-on Shelf" counterparts.The former three expect food and drinks to be located on a table, not on the shelves.

Performance of Optimized ISM Trees
Table 3 shows how the ISM trees for the scene categories from the previous subsection perform concerning the goodness measures numFPs() (given in percent) and avgDur().One line corresponds to one category.The two measures were defined in Sec.3.8 for the RTS.We rate their values with color coding, as in Sec.5.2.2.Additionally, the table specifies the trajectory lengths and numbers of objects for each category and the number of relations it models.The presented values confirm that the runtime of scene recognition with a complete topology is orders of magnitude higher than that with a star.Especially for the larger categories in Table 3, complete topologies are much too inefficient for ASR.It should be noted that high runtimes do not only result from large numbers of objects, but also from long demonstration recordings.This, e.g., explains the runtime difference between "Setting-Ready for Breakfast" and "Cupboard-Filled".The table also displays the high num-   bers of false positives produced with star topologies, so that they are not an alternative to complete topologies.However, the number of false positives also depends on how much the objects in a category have been moved during a demonstration.If the objects are barely moved, such as for "Sandwich-Setting", a star topology is just as reliable as a complete topology.Overall, however, only optimized topologies achieve simultaneously low values for numFPs() and for avgDur().
In Figure 14, we now consider the runtime of scene recognition with ISM trees from optimized topologies individually.This plot shows average runtimes for different datasets, depending on the number of objects they contain and the length of the record of their demonstration.Unlike Table 3 which presents results from sensor-recorded demonstrations, all object trajectories for this plot are generated in simulation.Scene recognition is performed on 600 object configurations generated for each dataset in Figure 14.The recognition runtimes for these configurations are given in seconds and each curve stands for a specific trajectory length.All curves appear to be linear for the number of objects.The slopes of the curves appear to be determined by the trajectory lengths.Thus, the runtimes appear to correlate with a product of trajectory length and number of objects.Beyond such favorable time complexity, another experiment renders ISM trees suitable for object search applications: We measured a maximum runtime of 3.71 seconds for ten objects and a trajectory length of 400 samples (20 min when capturing samples every 3 seconds while a demonstration is recorded).The fact that the runtimes with optimized topologies in Table 3 remain under 2.5 seconds further emphasizes that ISM trees are suitable for the state machine we use to implement ASR.

Influence of Object Orientations on Pose Prediction
Unlike Sec.5.2.3, in this section, we no longer detect objects from a single viewpoint.Here, we investigate how well our ASR approach can recognize scenes whose object configurations cannot be fully perceived from a single viewpoint.We evaluated ASR in our kitchen setup in six experiments.Each experiment was performed twice to account for the positioning uncertainties of our MILD robot.In these experiments, the robot was expected to recognize all existing instances of the scene categories specified in Table 4.The experiments in Sec.5.3.1 and 5.3.2analyze how our object pose prediction, and thus ASR, is affected when we change object poses between the previous demonstration and the execution of ASR during the experiment.Sec.5.3.1 focuses on rotational displacements, whereas Sec.5.3.2addresses translational displacements.
In both subsections, we investigate how well ASR can detect all objects in scene category instances despite such displacements and whether it can detect the deviations from learned relations that result from these displacements.The first experiments (s1 e1 and s2 e1) in each subsection are performed on object configurations identical to the demonstration, whereas the second experiments (s1 e2 and s2 e2) cover either rotational or translational displacements.At the beginning of s1 e1 and s1 e2, MILD looks at the upper right corner of the shelves in 1 in Figure 16. 1 visualizes the results of one of the two executions of s1 e1 in Table 4. From there, MILD detects two searched objects.It then predicts object poses on the shelves and the table.NBV estimation minimizes the travel time for MILD by letting ASR search for another object on the shelves.Video clip 3 ("Influence of Object Orientations on Pose Prediction") shows how MILD proceeds further.In the end, instances of the categories "Drinks-on Shelf", "Cereals-on Shelf", "Sandwich-Setting", and "Setting-Ready for Breakfast" are correctly recognized.Figure 15: Object trajectories demonstrated for the scene categories "Sandwich-Setting, "Drinks-on Shelf", "Cereals-Setting", and "Cereals-on Shelf" in our kitchen setup, as well as the spatial relations inside the ISM trees learned for these categories.As visible in 2, all searched objects on shelves were rotated for s1 e2.This change affects the categories "Drinks-on Shelf" and "Cereals-on Shelf".That the confidence levels for the two categories fall off differently seems counterintuitive, but is because they do not contain the same number of objects.s1 e2 also shows that ASR requires only a single correctly oriented object (here, the shelves) to find all objects from the same scene category.This is because the shelves cause correctly predicted poses on the table, so ASR can ignore the more distant incorrect predictions at the lower left of 3 that result from the rotated objects.Which views the robot has just adopted and will adopt next is visualized by red and turquoise frustums.Predicted poses that are within the future view are colored blue.The large gap between correct and incorrect pose predictions illustrates the extent to which rotational changes can affect pose prediction when using long spatial relations.

Influence of Object Positions on Pose Prediction
At the beginning of s2 e1, MILD is not standing in front of the shelves, but in front of the table.As shown in 1 in Figure 17, MILD first searches this table.We have recorded in video clip 4 ("Influence of Object Positions on Pose Prediction") how MILD proceeds until all existing scene categories are recognized.The difference between s2 e1 and s2 e2 is that all objects on the table were shifted at the same time using a tray.The confidence levels of those categories in Table 4, which contain only objects on the table, remain unchanged.Since all their objects are on the tray, shifting the tray does not affect the relative poses between them.This shows that ISM trees depend only on relative object poses and not directly on absolute object  The long relations between the objects on the table and on the shelves are affected by the shift, but only slightly.The predicted poses on the shelves move backwards, as shown in 2 in Figure 17.However, they stay close enough to the shelves, so MILD still finds all objects.Yet, even a small orientation error in an object estimate on the table causes some predicted poses on one shelf to move up one level, as shown on the right in 3. Overall, the effect of rotational deviations on the accuracy of pose prediction depends on the length of the relation used, while that of translational deviations is constant.

Active Scene Recognition on a Cluttered Table
After two subsections devoted to object configurations spread across our kitchen setup, this subsection shows how our ASR approach deals with an object configuration that brings together a large number of searched objects from different overlapping scenes.Such a configuration -a cluttered table -can be seen in 1 in Figure 19.It consists of 15 objects to be searched, several of which are obscured from certain viewpoints, and seven irrelevant objects.As 2 in Figure 19 and video clip 5 ("Active Scene Recognition on a Cluttered Table ") show, MILD manages to find all searched objects.The only object that is not always found in the corresponding experiment s3 e1 is the shelves and these do not participate in s3 e1.The objects at the front of the table are easily localized, and scene recogni-  tion achieves high confidence levels, e.g., for "Setting-Ready for Breakfast".The objects in the back are more difficult to find, resulting in a lower confidence level for "Drinks-Setting".The spuriously high confidence level for "Cereals-on Shelf" results from a false positive returned by the object localization.All irrelevant objects are correctly discarded by ASR.

Comparison of Three Approaches to ASR
In this subsection, we compare the execution times of our approach to ASR to those achieved by two alternative approaches to ASR.The searched object configuration looks similar to "Setting-Ready for Breakfast".The first alternative to our approach is called "direct search only" and omits the INDI-RECT SEARCH mode.Instead, its SCENE RECOGNITION substate exclusively processes object estimates acquired by the informed and uninformed strategies of the DIRECT SEARCH mode.We call the second alternative "bounding box search".This approach assumes that objects can only be located in socalled bounding boxes determined by prior knowledge and does not use INDIRECT SEARCH to predict the poses of searched objects.Such bounding boxes are visualized in 3 in Figure 18 as yellow boxes in which possible object poses are visualized as white spheres.However, "bounding box search" uses Nest-Best-Views (NBVs) to sweep the bounding boxes.
Compared to the previous experiments (see 1 in Figure 19), the searched place setting is shifted and rotated (see 2 -4 in Figure 18).All three ASR approaches are executed twice and successfully find all objects in the setting.Since "direct search  only" and "bounding box search" take an inordinate amount of time, we ran this experiment in simulation.Both alternatives take much longer than our ASR approach: 31.13 and 15.24 minutes instead of 2.48 minutes.The informed strategy of "direct search only" is not able to find all objects.The views the strategy adopts are shown in 1 in Figure 18.MILD then uses the uninformed strategy of "direct search only", which causes a lengthy search but eventually succeeds.2 shows the views adopted by both the informed and uninformed strategies.The views in 3 are adopted by "bounding box search" and show that this approach often moves between bounding boxes rather than searching a single one from different perspectives.This highlights the difficulty of parameterizing the estimation of NBVs.In 4, our approach adopts the same first two views as "direct search only" in 1.However, instead of continuing the search on the right of the table once some objects are found, our approach adapts to the fact that the place setting has been rotated and lets MILD search on the other side of the table.

Runtime of Object Pose Prediction
We reused the ISM trees from Sec. 5.2.5 to compute the runtime of our pose prediction algorithm for datasets with different numbers of objects n and trajectory lengths l.We averaged 10•n•l executions of the algorithm per dataset in Figure 20.Each runtime shown corresponds to the time required to predict the poses of all objects in a category.If we disregard scaling (runtimes are given here in hundredths of a second), the analysis of the curves from Sec. 5.2.5 also applies to Figure 20.Given that the maximum runtime is 0.055 seconds for ten objects and a trajectory length of 400 samples, the time consumption of object pose prediction seems negligible compared to the one of scene recognition.

Conclusions
Through its contributions, this article closes three gaps in the core of Active Scene Recognition (ASR).ASR, which was impractical without these contributions, combines scene recognition and object search -two tasks that are otherwise considered separately.Firstly, ASR enables scene recognition to analyze object configurations that cannot be perceived from a single viewpoint.Secondly, it allows object search to be guided by object configurations rather than single objects, making it more efficient.Using only single objects can lead to ambiguities, since, e.g., a knife in a table setting would expect a plate to be beneath itself when a meal is finished, while it would expect the plate to be beside itself when the meal has not yet started.
The feature extraction components of part-based models may be outdated compared to Convolutional Neural Nets (today's gold standard).However, this article aims to show that ISMs Especially when modeling relations in scenes that express personal preferences and for which only small amounts of data are available, a technique such as ISMs is suitable.This suitability stems from the fact that ISMs model relations nonparametrically in the sense of instance-based learning ( [52]).
However, fact that ISM trees model relations nonparametrically also means that they can be prone to combinatorial explosion.To avoid such effects in the recognition and prediction algorithms we contribute, we have implemented the following strategies: As proposed by [32], an accumulator array and a method similar to Mean-Shift Search are used within single ISMs during recognition to prune a significant portion of the votes.Moreover, when recognizing scenes with an ISM tree, two factors -the number of intermediate results passed from one ISM to the next and the lengths of the chains of interrelated ISMs in a tree -can cause combinatorial effects.We limit the first through passing only the best-rated intermediate results between ISMs and the second by minimizing the heights of ISM trees through our tree generation algorithm.Moreover, we solve the combinatorial explosion that made our previous pose prediction algorithm inefficient.Instead of simply concatenating inverted spatial relations in an ISM tree, the new method samples random subsets from these relations.
Our evaluation of PSR in Sec.5.2 provided evidence that any ISM of an ISM tree precisely detects when object poses deviate from modeled spatial relations.Depending on their parametrization, ISMs are more or less permissive concerning such deviations.Further experiments in [9] have also shown  that ISM trees are robust against objects missing in object configurations.In Sec.5.3, we applied ASR onto object configurations which were considerably more complex than those used in our previous work.Robot localization and object pose estimation accuracy were the limiting factors for our ASR approach.Still, ASR even succeeded in recognizing scenes at different locations with the same ISM tree.This illustrates that spatial relations make ISM trees particularly reusable compared to techniques that model scenes using absolute object poses.Video clip 6 ("Recognition of scenes independent of their locations") is devoted to this major advantage of ISM trees and thus ASR.
The experiments in Sec.5.2.5 suggest that the runtime of PSR linearly depends on the number of objects included in a scene category.An experiment we conducted for datasets including six objects indicates that this is also true for trajectory lengths: We measured a maximum recognition runtime of 4.94 seconds for a demonstration trajectory length of 1000 samples.Still, the recognition runtimes for longer trajectory lengths can exceed the requirements of ASR.To overcome this limitation, we plan to compress the relations in ISM trees by eliminating redundant relative poses with downsampling voxel grids.
3. As it is not the goal of this article to describe all the details of these algorithms, but to present their key ideas concisely, some variables and helper functions are defined only in [9].

Figure 1 :
Figure 1: Motivating example for scene recognition: Mobile robot looks at an object configuration (arrangement).It reasons which of its actions to apply.

Figure 2 :
Figure 2: Experimental setup mimicking a kitchen.The objects are distributed over a table, a cupboard, and some shelves.Colored dashed boxes are used to discern the searched objects from the clutter and to assign objects to exemplary scene categories.

Figure 3 :
Figure 3: Overview of the research problems (in blue) addressed by our overall approach.At the top are the inputs and outputs of our approach.Below are the two phases of our approach: Scene classifier learning and ASR execution.

Figure 4 :
Figure 4: Overview of the inputs and outputs of Passive Scene Recognition.Spheres represent increasing confidences of outputs by colors from red to green.

Figure 5 : 1 :
Figure 5: 1: Snapshot of a demonstration.2: Demonstrated object trajectories as sequences of object estimates.The cup is always to the right of the plate, and both in front of the box.3: Relative poses, visualized as arrows pointing to a reference, form the relations in our scene classifier.Here the classifier models the relations "cup-box" and "plate-box".4: Objects voting on poses of the reference, using the relations from 3. 5: Accumulator filled with votes.6: Here the cup is to the left of the plate, which contradicts the demonstration.The learned ISM does not recognize this, as it does not model the "cup-plate" relation.It outputs a false positive.

Figure 6 :
Figure6: Algorithm -How the connected relation topology -from which we generate the ISM tree in Figure8-is first partitioned.The partitioning includes five iterations.It starts at the leftmost graph and ends at the rightmost.In each iteration, a portion of the connected topology, colored green, is converted into a separate star-shaped topology.All stars are on the left of Figure7.

Figure 7
Figure7: Algorithm -How an ISM tree is generated from the star topologies shown in the leftmost column (see 1).In each of the five depicted iterations, a star topology is selected to learn a single ISM using the object trajectories for scene category "Setting-Ready for Breakfast".Dashed boxes show the selected stars, whereas arrows link these stars to the learned ISMs.

Label_subi 4 Figure 8 :
Figure 8: Algorithm -How Passive Scene Recognition works with an ISM tree: As soon as object poses estimated from the configuration in 1 are passed to the tree, data flows in 2 from the tree's bottom to its top, eventually yielding the scene category instance shown in 3.This instance is made up of the recognition results in 4, returned by the single ISMs.

Figure 9 :
Figure 9: Results from ISM trees for the same demonstration, but learned with different topologies.Valid results are annotated with a tick in green, invalid results by a cross in red.In 1 & 2, a star is used.In 3 & 4, a complete topology.In 1, 3, valid object configurations are processed.Instead, 2, 4 show invalid configurations.
It recursively compares from top to bottom the stored recognition results I m , I m ′ according to the connections between pairs of ISMs m, m ′ .Such recursion chain is started in 4 for each recognition result I setting computed by root ISM m R .During each recursion chain, a recognition result I setting is compared with the intermediate results I setting subk of the different ISMs m k at level 1.For a comparison to assign two recognition results I m , I m ′ to the same instance, two conditions must be met: Firstly these results must come from two ISMs m, m ′ that exchanged reference objects o F .Secondly the very same reference object must have been involved in both recognition results.The sec-ond condition is satisfied if one of the reference objects in each of the two recognition results I m , I m ′ has the same state E(o F ).

Figure 10 :
Figure 10: Example of how scene recognition (see 1, 3) and object pose prediction (see 2, 4) can compensate for changing object poses.The results they return are equivalent, even though the localized objects are rotated by 90 • between 1, 2 and 3, 4.

Figure 11 :
Figure11: Algorithm -How the poses of searched objects are predicted with an ISM tree: The incomplete scene category instance in 1 is input and causes data to flow through the tree in 2. Unlike scene recognition, data flows from the tree's top to its bottom, while relative poses from different inverted spatial relations are combined.The resulting predictions are visible in 3.

2 1 3 Figure 12 : 1 :
Figure 12: 1: Snapshot of the demonstration for scene category "Office".3: The ISM tree for this category, including the relative poses in its relations and the demonstrated poses.2: Result of applying this tree onto the configuration "Correct-configuration".

Figure 14 :
Figure 14: Times which ISM trees (from optimized topologies) take to recognize scenes, depending on the trajectory length in their datasets.

Figure 16 :
Figure 16: Influence of object orientations on pose prediction: 1 and 2 show s1 e1 and s1 e2.Between s1 e1 and s1 e2, all objects on the shelves (right) were rotated.1, 2: Recognized scenes and camera views MILD adopted.3: Snapshot during the execution of s1 e2.The miniature objects correspond to predicted poses.

Figure 17 :
Figure17: Influence of object positions on pose prediction: 1 and 2 show s2 e1 and s2 e2.Between s2 e1 and s2 e2, all objects on the table were shifted to the right.1, 2: Recognized scenes and camera views MILD adopted.3: Snapshot during the execution of s2 e2.The miniature objects correspond to predicted poses.

Figure 18 :
Figure18: Comparison of three approaches to ASR: For "direct search only" (see 1,2), "bounding box search" (see 3), and our ASR approach (see 4), we show which camera views were adopted.Additionally, 1 shows demonstrated object poses and 2-4 show recognized scene instances.

1 2Figure 19 :
Figure 19: Active Scene Recognition on a cluttered table.1: Snapshot of physical objects.2: Camera views adopted and recognized scene instances.

Figure 20 :
Figure 20: Times which our pose prediction algorithm takes, depending on the number of objects and trajectory length in the datasets.

Table 1 :
We specify position and orientation compliances, as well as a similarity rating for all measured object poses, which ISM "Office sub0" processes.The value of the recognition result's objective function (returned by the ISM) is also provided.See Sec.5.2.2 for definitions of the objective function, rating, and compliances.

Table 2 :
We specify position and orientation compliances, as well as a similarity rating for all measured object poses, which ISM "Office" processes.The value of the recognition result's objective function (returned by the ISM) is also provided.See Sec.5.2.2 for definitions of the objective function, rating, and compliances.

Table 3 :
Performance of ISM trees learned for scene categories in our kitchen and optimized by Relation Topology Selection.

Table 4 :
Performance measures for experiments with the physical MILD robot, used to evaluate Active Scene Recognition.