^{*}

This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).

We introduce a novel Maximum Entropy (MaxEnt) framework that can generate 3D scenes by incorporating objects’ relevancy, hierarchical and contextual constraints in a unified model. This model is formulated by a Gibbs distribution, under the MaxEnt framework, that can be sampled to generate plausible scenes. Unlike existing approaches, which represent a given scene by a single And-Or graph, the relevancy constraint (defined as the frequency with which a given object exists in the training data) require our approach to sample from multiple And-Or graphs, allowing variability in terms of objects’ existence across synthesized scenes. Once an And-Or graph is sampled from the ensemble, the hierarchical constraints are employed to sample the Or-nodes (style variations) and the contextual constraints are subsequently used to enforce the corresponding relations that must be satisfied by the And-nodes. To illustrate the proposed methodology, we use desk scenes that are composed of objects whose existence, styles and arrangements (position and orientation) can vary from one scene to the next. The relevancy, hierarchical and contextual constraints are extracted from a set of training scenes and utilized to generate plausible synthetic scenes that in turn satisfy these constraints. After applying the proposed framework, scenes that are plausible representations of the training examples are automatically generated.

In recent years, the need for 3D models and modeling tools is growing due to high demands in computer games, virtual environments and animated movies. Even though there are many graphics software in the market, these tools cannot be used by ordinary users due to their steep learning curve. Even for the graphic experts, creating a large number of 3D models is a tedious and time consuming procedure, requiring the need for automaton.

Though it is in its infant stage, automating the procedures of generating 3D contents, either by using design guidelines or learning from examples, has become one of the active research areas in the computer graphics community. In order to capture or represent the underlying pattern of a given object/scene, state-of-the-art machine learning algorithms have been used in recent years to automatically or semi-automatically generate 3D models that encompass a variety of objects/scenes by learning optimal styles and arrangements of the constituent parts/objects. Yet, there remain numerous challenges in creating a fully-automated scene generation system that can model complex scenarios. We hereby discuss our contribution towards achieving the ultimate goal of designing a fully-automated scene generation system.

In this paper, we present a novel approach that can model a given scene by multiple And-Or graphs and sample them to generate plausible scenes. Using a handful training scenes, we extract three major constraints namely: Relevancy, hierarchical and contextual constraints. Each of these major constraints is represented by many sub-constraints that are extracted from each object or pairs of objects in every scene. These constraints are then used to generate plausible scenes by sampling from a probability distribution with maximum entropy content.

The work presented here builds on our previous work [

The organization of the paper is as follows. Section 2 presents the existing works that are related to our approach. Section 3 describes the necessary mathematical formulations required in scene generation. Here, we first describe knowledge representation of scenes with And-Or graphs, and then discuss the importance and the intuition behind using the relevancy, hierarchical and contextual constraints for scene generation. Next, we introduce the MaxEnt framework that integrates these constraints into a single, unified framework and represents the scene generation problem as sampling from a Gibbs distribution. The Gibbs distribution is chosen using a maximum entropy model selection criterion and has the capability of learning constraints from the training scenes. Then, parameter estimation of the Gibbs distribution via the feature pursuit strategy is explained. Finally, a technique to sample from the Gibbs distribution is discussed in this section and a pseudocode summarizing the above steps is presented. Section 4 presents the implementation details of the proposed approach. In Section 5, we report the results and analysis followed by a comparison of our framework with an existing approach. Finally, in Sections 6, we present a summary of our accomplishments and make some concluding remarks.

Our approach benefits from some of the most recent works in the fields of computer vision and graphics. In this section, we briefly describe these works and point out their relevance to our approach.

As grammar defines the rules of composing a sentence, most objects in images can also be composed of parts that are constrained with a set of contextual and non-contextual constraints [

In these works, using a single And-Or graph in [

As stochastic grammar of images is used to model intra-class variations in images, recent works [

In [

As opposed to [

Although the above approaches [

Recently, Fisher

Our approach presented here is different from the existing works [

In this section, we present the mathematical groundwork that is necessary to formulate 3D scene generation as sampling from a Gibbs distribution under the MaxEnt model selection criterion.

Over the past decade, many computer vision based applications have used an And-Or graph as a concise knowledge representation scheme [

In our specific example, the And-Or graph represents desk scenes whose nodes are objects that constitute the scene. We can generate a large number of And-Or graphs to represent desk scenes by allowing the nodes to be sampled as either ON or OFF. This indirectly represents the relevancy of objects in the scene. As an example, we represent the desk scenes by composing a maximum of seven objects (

Assuming that the nodes in the And-Or graphs are identified for a given scene, 3D scene generation reduces to parsing the graph by first sampling the existence of each object based on their relevancy to the scene. Then, for each object with style variations (Or nodes), a style is sampled based on its probability as observed in the training examples. Finally, contextual constraints between the nodes that are turned ON are enforced. As an example, the first stage defines the existence of objects as: “The desk scene contains table, chair and computer”. The second stage defines the style of objects that are turned ON from the first stage as: “The desk scene contains a

In this paper, a single instance of every node is considered. However, the number of instances of each node can also be considered as a variable. In such cases, it can integrated in the And-Or graph and be sampled during scene generation [

In order to represent 3D scene generation with And-Or graphs as discussed before, we define the tuple

where

Each node

where _{L}_{W}_{H}

The following constraints are used in the MaxEnt model selection criterion to sample scenes from a Gibbs distribution.

In order to allow the sampling of nodes as ON or OFF, we learn the objects’ relevancy to the scene. To incorporate this constraint in our model, we compute a relevancy term using the object existence information from the training examples. This constraint is then used to sample an And-Or graph for a given scene.

Given the existence of each object as ON or OFF, the relevancy of an object can be computed as:

where _{table}_{paper}

The hierarchical constraint is used to incorporate intra-class variations of object styles for scene generation, and it is represented by the Or-nodes in the graph. By counting the frequency of a given object style is observed from the training data, we can synthesize scenes that obey this style proportion.

Using object existence information as well as the corresponding style used in a given scene, we can define the proportion of object

where the Dirac-Delta function _{computer}_{computer}

In our experiment, since we consider at most a two-category style of objects,

Objects in a given scene are assumed to be constrained contextually by design preferences and/or physical constraints. The contextual pattern of the underlying constraint can be learned from training examples and can be used to generate plausible synthetic scenes. In our approach, we defined a set of pairwise contextual sub-constraints as shown in

Histograms are extracted for each contextual sub-constraints defined in

where _{l}_{o}_{o′}_{l}_{o}_{o′}_{l}

In our approach, we use the MaxEnt model selection criteria to identify a probability distribution that best fits the constraints extracted from the training set.

As Jaynes stated in [

Given an unobserved true

Solving the above constrained optimization problem results in the following Gibbs distribution [

where:

Here,

Comparing the energy term ε(

In order to sample from the Gibbs distribution given in

The parameters of the Gibbs distribution

where

In order to learn parameters, scenes must be sampled by perturbing the objects’ relevancy (

Computing the parameters for relevancy, hierarchical and contextual simultaneously is computationally expensive. As a result, these constraints are decoupled in such a way that we first sample scenes to obey the relevancy constraints. Once the relevancy constraint is obeyed, we sample the hierarchical constraints for objects that exist in each scene. Finally, scenes are sampled to capture the contextual constraints observed from the training examples. With each type of constraint, a greedy parameter optimization approach called feature pursuit [

As discussed, we use three types of constraints (relevancy, hierarchical and contextual), each of which is represented by multiple sub-constraints, specifically, |

In feature pursuit strategy, sub-constraints are selected one at a time from the pool of

The scene synthesis procedure is initialized by random sampling. Thereafter, a sub-constraint _{+} is selected by first computing the squared Euclidean distance followed by picking the most diverging sub-constraint as given in

Where

The corresponding parameter for the sub-constraint _{+} is then learned iteratively using

The intuition behind the feature pursuit strategy is that the sub-constraint with the highest deviation between the true and the approximate distributions should be prioritized and learned in order to bring the two distributions as close as possible.

As more sub-constraints are selected, more parameters are tuned and the sampled scenes come to resemble the patterns observed in the training scenes.

In order to sample from the Gibbs distribution defined in _{+}

as given by

where

To give an example, assuming that we are working on contextual constraints (_{+} (from

The sampling, feature pursuit, and parameter estimation are continuously applied until the energy overall divergence between the two distribution constraints, as given by

Given a set of training scenes ^{F}^{P}_{1} = 0.1, _{2} = 0.1 and

In this section, we explain the implementation details for generating plausible and visually appealing synthetic scenes using the proposed approach.

In addition to the constraints mentioned earlier, we also considered criteria that help to make the synthesized scenes more plausible and visually appealing. These considerations are detailed next.

The model described thus far has no provisions for prohibiting the occurrence of intersecting objects. To remedy this shortcoming, we incorporate the intersection constraint, which uses the projection of object’s bounding box on the XY-plane (top-view of the scenes). For every pair of objects

where _{xy}_{xy}

While intersection can be encoded as a soft constraint in the energy expression (e.g., see [

To demonstrate the parent-child support criteria, consider a laptop placed on a table. Usually, the laptop is contained in the table, as seen from the top view (XY projection of the scene) and it is in contact with the table if viewed from the side. The contact constraint, formulated by

Where _{z}_{z}_{z}_{z}

Using

In this section, parent-child support is formulated based on the assumption that child objects normally exist on top of the bounding box of their parent. Although this is a valid assumption for the training scenes that are used in our experiment, it will fail for the general case when a parent object has many supporting surfaces. As a result, this assumption needs to be relaxed by first segmenting out any supporting surfaces of a given object and evaluating the parent-child relationship on each surface. During scene generation, this will add additional contextual constraints on the height of objects (along the Z-axis). Therefore, the height of each object can also be sampled in a similar fashion as the relative position along the X- and Y-axis.

Unlike the intersection constraint that restricts the synthesis of intersecting objects, visual balance, which largely depends on personal preference, is implemented as a soft constraint. As a result, the visual balance constraint is incorporated on children objects by modifying the energy expression defined in

Here,_{vb} is the visual balance cost, and w_{vb} determines how much this term should influence the scene generation. In [

where ^{P}

To clarify what _{vb} is measuring, compare the scene shown in

The reason for handling visual balance as an energy term (as opposed to incorporating it into the model as an additional contextual sub-constraint) is that the visual balance constraint adds significant complexity to the feature pursuit procedure. To clarify, if for example the maximum deviating sub-constraint in feature pursuit happens to be relative position in the X-axis between lamp and paper, either of the objects can be perturbed along the X-axis and decide whether to accept or reject the proposed configuration. However, visual balance depends on not just a pair of objects, but on all children objects. Moreover, visual balance can be modified by perturbing a large combination of constraints.

Note that since all the other constraints are normalized counts, the
_{vb} = 0.05

In order to synthesize scenes with the appropriate orientation of objects, the scene generation approach should incorporate a way to identify the front

In the proposed framework, |^{P}_{L}_{W}_{H}_{Room}_{Room}_{Room}_{ρ,x}_{ρ,y}_{ρ,x}_{ρ,y}

Once the scenes are randomly initialized, they are then perturbed and sampled to ultimately match the constraints extracted from the training examples. Depending on the type of constraint being optimized, the scene perturbation is performed as follows. If relevancy constraint is selected (
_{+} a scene & is randomly picked from the synthesized scenes ^{P}_{+} is randomly sampled as

Similarly, if hierarchical constraint is selected (
_{+}, a scene & is randomly picked from the synthesized scenes ^{P}_{+}_{k}

Finally, if contextual constraint is selected (
_{+}, a scene & is randomly picked from the synthesized scenes ^{syn*}_{+} exist. Then, one of the objects (either _{+} is perturbed. Since we have defined the intersection constraint as a hard constraint, the perturbed object’s feature is used to check if the intersection ratio with any other object in that scene is below the defined threshold. If any of the intersection ratios falls above the threshold, the perturbation is discarded and a new sample is generated. This procedure is repeated a maximum of

For child object:

For relative position in X, _{ρ,x}

For relative position in Y, _{ρ,y}

For relative orientation,

For non-child object:

For relative position in X, _{Room}

For relative position in Y, _{Room}

For relative orientation,

In this section, we present the results from the proposed MaxEnt based scene generation framework. Our training dataset contains |^{F}

Using the training examples, a set of relevancy, hierarchical and contextual constraints are extracted and used as observed constraints to sample a set of synthetic scenes. As described in Section 3.4, we decouple these constraints in such a way that we first sample scenes to obey the relevancy constraint. Once this is accomplished, we sample the hierarchical constraints for objects that exist in each scene. Finally, scenes are sampled to capture the contextual constraints observed from the training examples.

Once the relevancy constraints are extracted from the training and a synthetic scene, feature pursuit is applied by continuously sampling the existence of the objects in the synthetic scenes and updating the parameters until the divergence is minimal. The result of this procedure can be seen in

After the relevancy constraint is matched, the hierarchical constraint is optimized until the objects’ style proportion is matched for those with multiple styles; see

After both the relevancy and hierarchical constraints are satisfied, the final step is imposing contextual constraints on the synthetic scenes. Using the pairwise relations defined in

It can be seen from

To assess the performance of our proposed framework, 31 test subjects were first shown a few scenes from the training data to establish a reference for scene acceptability. Thereafter, they were presented the 50 synthesized scenes and were asked to rate each scene into five categories: Very bad (1), bad (2), fair (3), good (4) and very good (5). Furthermore, for better understanding of the ratings, the participants were asked to justify their rating. The ratings were then consolidated and are represented in

We would like to mention that our objective here is not so much to quantify the plausible scenes but rather to quantify the implausible ones, as we would expect less inter-subject variability in the latter than the former. This allowed us to quantify and report our approach’s tendency to generate implausible scenes.

In Section 2, we mentioned the advantage of the MaxEnt framework over the existing approaches for scene generation application. In comparison with the MaxEnt framework, the existing approaches [

Among the existing approaches, Fisher

Given a sample training scene shown in

In order to synthesize new scenes, Fisher

Thereafter, scene generation procedure is implemented in Fisher

On the other hand, the scene generation procedure in our approach is implemented as follows. First, the energy term of the Gibbs distribution defined in

In order to compare the approach proposed by Fisher

From these figure, one can see that Fisher

In this paper, we proposed a novel automatic, data-driven 3D scene generation framework based on MaxEnt model selection criterion. Unlike the existing methods, our framework incorporates relevancy, hierarchical and contextual constraints into a unified framework. By integrating the relevancy constraint into the model, our approach manages to sample from multiple, conceptually-similar And-Or graphs, allowing variability in terms of object existence.

In addition to introducing a relevancy constraint into a hierarchical and contextual MaxEnt framework, we incorporated different contextual constraints namely: Intersection constraint, parent-child support and visual-balance criteria. As a result, the proposed approach is capable of generating plausible synthetic scenes with wide range of variations.

In order to evaluate the plausibility of the scenes generated using the proposed framework, we gathered feedback from human graders. Form this evaluation procedure, more than 70% of the scenes are rated above fair and the average rating of all scenes is obtained to fall above fair. This evaluation confirms that the proposed framework is capable of generating a reasonable number of plausible scenes automatically.

Thereafter, a comparison of the proposed framework with the existing approaches is discussed. During the comparison, it is demonstrated that the proposed framework captures the variability of the observed scenes and generate scenes with larger variability as compared with the existing approaches.

As a final note, although the applicability of the proposed approach is illustrated with only an exemplary desk scene in this paper, the framework can be used to synthesize any scene. Given a scene that is composed of objects/parts and constrained by relevancy, hierarchical and contextual constraints, the same process detailed in

The authors contributed equally in conceiving, designing and analyzing the approach as well as preparing this manuscript.

The authors declare no conflict of interest.

Example of And-Or graph representation for desk scenes. Each node is connected (dotted lines) to every other node, but for clarity, only a subset of such connections is shown.

Comparison of visual balance criteria (

Computation of orientation vector.

Different styles of computer models: (

Relevancy constraint optimization.

Hierarchical constraint optimization.

Set of randomly selected plausible synthesized scenes.

Human observer Ratings of Synthesized Scenes.

Training data preparation: (a) 3D dining scene (b) training data amplification (c) probability density function of the training data (represented by Gaussian Mixture Models) (d) Histogram of training data (30×30 bins).

Scene generation proposed in Fisher

MaxEnt scene generation: (

Synthetic scene distribution comparison (

Defined Relationships.

Relationships ( |
Formula |
---|---|

Relative position in X axis | _{o}_{o′} |

Relative position in Y axis | _{o}_{o′} |

Relative Orientation | _{o}_{o′} |

3D Models used in Training Scenes.

Object ID | Object Name | Models used for each styles | |
---|---|---|---|

Style 1 | Style 2 | ||

1 | Table | 1 | 1 |

2 | Chair | 5 | - |

3 | Computer | 4 | 1 |

4 | Lamp | 3 | - |

5 | Phone | 1 | 1 |

6 | Book | 2 | 2 |

7 | Paper | 1 | - |

This pseudocode synthesizes 3D scenes by sampling from the Gibbs distribution. Lines 2 and 3 define the input and output of the algorithm. Line 4 initializes synthetic scenes randomly. Line 5 constrains the synthetic scenes with respect to relevancy (

1 | function ^{P}^{F} |

2 | // Input: A set of training scenes ^{F} |

3 | // Output: A set of synthetic scenes ^{P} |

4 | – ^{P} |

5 | – ^{P} = Constrain_Scenes (S^{P}, S^{F},M = R) |

6 | – ^{P} = Constrain_Scenes (S^{P}, S^{F},M = H) |

7 | – ^{P} = Constrain_Scenes (S^{P}, S^{F},M = C) |

This pseudocode synthesizes scenes that are constrained with respect to
_{+}). Lines 8–23 perturb scenes with respect to _{+} and update them using Metropolis sampling until convergence.

1 | function ^{P}^{P}^{F} |

2 | – ^{F} extract constraints M^{F} |

3 | – ^{P} extract constraints M^{P} |

4 | – |

5 | – |

6 | – |

7 | – _{+} using (10) |

8 | – |

9 | – _{+} using (10) |

10 | – ^{P*} by S^{P} |

11 | – |

12 | – ^{P*} |

13 | – _{+} and propose S^{P*} (See Section 4.2) |

14 | – ^{P*} |

15 | |

16 | ^{P}^{P*} |

17 | |

18 | ^{P}^{P*} |

19 | ^{P}^{P*} |

20 | ^{P}←M^{P*} |

21 | |

22 | ^{P*}^{P} |

23 | – _{+} |

24 | _{1} |

25 | _{2} |