Advancements, Challenges, and Future Directions in Scene-Graph-Based Image Generation: A Comprehensive Review

Amuche, Chikwendu Ijeoma; Zhang, Xiaoling; Monday, Happy Nkanta; Nneji, Grace Ugochi; Ukwuoma, Chiagoziem C.; Chikwendu, Okechukwu Chinedum; Hyeon Gu, Yeong; Al-antari, Mugahed A.

doi:10.3390/electronics14061158

Open AccessReview

Advancements, Challenges, and Future Directions in Scene-Graph-Based Image Generation: A Comprehensive Review

by

Chikwendu Ijeoma Amuche

^1,*

,

Xiaoling Zhang

¹,

Happy Nkanta Monday

²

,

Grace Ugochi Nneji

²

,

Chiagoziem C. Ukwuoma

^2,3

,

Okechukwu Chinedum Chikwendu

⁴,

Yeong Hyeon Gu

^5,*

and

Mugahed A. Al-antari

^5,*

¹

School of Information and Communication Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China

²

Department of Computing, Oxford Brookes College, Chengdu University of Technology, Chengdu 610059, China

³

College of Nuclear Technology and Automation Engineering, Chengdu University of Technology, Chengdu 610059, China

⁴

Department of Biochemistry, Federal University of Technology Owerri, Ihiagwa, Owerri PMB 1526, Nigeria

⁵

Department of Artificial Intelligence and Data Science, College of AI Convergence, Daeyang AI Center, Sejong University, Seoul 05006, Republic of Korea

^*

Authors to whom correspondence should be addressed.

Electronics 2025, 14(6), 1158; https://doi.org/10.3390/electronics14061158

Submission received: 5 February 2025 / Revised: 11 March 2025 / Accepted: 11 March 2025 / Published: 15 March 2025

Download

Browse Figures

Versions Notes

Abstract

:

The generation of images from scene graphs is an important area in computer vision, where structured object relationships are used to create detailed visual representations. While recent methods, such as generative adversarial networks (GANs), transformers, and diffusion models, have improved image quality, they still face challenges, like scalability issues, difficulty in generating complex scenes, and a lack of clear evaluation standards. Despite various approaches being proposed, there is still no unified way to compare their effectiveness, making it difficult to determine the best techniques for real-world applications. This review provides a detailed assessment of scene-graph-based image generation by organizing current methods into different categories and examining their advantages and limitations. We also discuss the datasets used for training, the evaluation measures applied to assess model performance, and the key challenges that remain, such as ensuring consistency in scene structure, handling object interactions, and reducing computational costs. Finally, we outline future directions in this field, highlighting the need for more efficient, scalable, and semantically accurate models. This review serves as a useful reference for researchers and practitioners, helping them understand current trends and identify areas for further improvement in scene-graph-based image generation.

Keywords:

scene graphs; image generation; adversarial networks; graph convolutional network; transformer models; diffusion models

1. Introduction

In computer graphics and artificial intelligence, transforming a scene graph into an image is a complex task that has received significant interest from researchers. Unlike a poet, who may battle with sleep and hunger to effectively express a moment, generating images from scene graphs requires precise accuracy and efficiency. This paper explores the diverse techniques employed for generating images from scene graphs. Presently, deep learning [1,2,3] has made significant progress in the domain of computer vision (CV) [4,5], excelling in elementary tasks like object identification and recognition [6,7]. There is an increasing need for more advanced visual comprehension and reasoning tasks that can examine the relationships between objects in a given scene. The demand for this led to the development of the scene graph. Scene graphs were initially introduced as a data structure [8] to provide a detailed representation of the objects found in a scene and their interrelationships. A comprehensive scene graph can accurately represent the complex meaning of a dataset with scenes and possesses strong skills to convert 2D/3D images [9,10,11] into abstract semantic elements. It does this task without imposing restrictions on the categories and characteristics of objects or the connections between them. Research on scene graphs significantly boosts our understanding in domains such as CV, natural language processing (NLP), and utilizing both domains. Initially introduced in 2015, scene graphs were primarily used for image retrieval. Since then, there has been substantial growth in the quantity and breadth of research dedicated to scene graphs.

Although there have been significant advancements in generating images using text-based methods that combine generative adversarial networks (GANs) with recurrent neural networks [12,13], it is still challenging to generate images from complex written descriptions that incorporate several objects and their relationships. Usually, the information in a sentence is ordered linearly, with words following one another. However, complex explanations can frequently be better structured and comprehended using scene graphs. Recent years have witnessed significant advancements in image generation [14,15,16,17], primarily due to extensive research in fields such as diffusion and score-based generative models [18,19,20]. These techniques have facilitated the generation of authentic and diverse images [21,22,23], which can be described using various inputs, such as labels [24,25], captions [26], segmentation masks [27], sketches [28], and stroke paints [29,30]. However, these input methods frequently struggle to represent intricate connections among several objects in an image. Scene graphs are capable and reliable to represent objects and their relationships visually [31]. It is crucial to investigate the use of scene graphs in generating images to replicate complex situations [32] accurately. It takes work and subtlety to go from a scene graph, a structured abstraction of a scene, to a visual representation. Scene graphs are essential information for generating images—they do not contain the final image. The challenge is crossing this gap and generating a realistic image from the hierarchy of the abstract scene.

Scene graphs are reliable when generating images for complex scenes with several objects and layouts. They offer deeper comprehension of the connections between objects and are helpful for many visual tasks. These tasks include image generation [32], 3D scene understanding [33], image retrieval [8], image/video captioning [34,35], image manipulation [36], and visual question answering (VQA) [37]. They are integrated into graph networks in VQA to enable reasoning about given queries and are used to learn visual features [38]. They let the model obtain related images more quickly and precisely express image semantics through layout without depending on unstructured text [39]. A scene graph defines a structured layout, and bounding boxes are used to transform this layout into images. However, generating vocabulary-based scene graphs can be challenging due to the complexity of object relationships and spatial reasoning. Scene-graph-to-image (SG2IM) algorithms, inspired by the pioneering technique [32], along with diffusion models and transformers, have been widely explored to address these challenges. This article discusses the key components of image-generating techniques and compares different model architectures. Figure 1 provides an overview of these models, highlighting their differences in structure and processing. There have been several recent surveys of text-based image synthesis with GANs [40,41,42]. However, a thorough review of scene-graph-based image-generating methods has not been done, to the best of our knowledge. The aim is to enable the reader to study and comprehend this research subject, which has gained much interest in the last few years. The main contributions of this survey are as follows:

❖: Comprehensive overview. Provide a thorough summary of the background information and fundamental concepts in scene-graph-to-image generation.
❖: Methodological classification. Group and analyze the methodologies proposed over time, highlighting their strengths and weaknesses.
❖: Insight into challenges. Equip readers with the understanding necessary to grasp the challenges of generating images from structured scene graphs.

1.1. Scope of the Survey

This survey provides a comprehensive overview of scene graphs in the context of image generation, examining the methodologies, advancements, and challenges in this evolving field. It explores the complexities in converting structured scene representations into realistic, visually appealing images, focusing on the key difficulties current generation methods face. The paper aims to clarify scene graphs’ practical applications and significance in image generation, highlighting their growing importance in modern computer graphics and artificial intelligence. This study addresses several critical research questions that form the foundation of this study, such as:

❖: What are the primary challenges in transforming scene graph representations into photorealistic images?
❖: Which methodologies and algorithms are most commonly used for scene-graph-to-image generation?
❖: How has deep learning contributed to the progress of scene-graph-based image generation methods?

This survey begins with an in-depth examination of scene graphs, discussing their historical development and crucial role in the evolution of computer graphics, artificial intelligence, and computer vision. The paper provides a theoretical framework for scene-graph-to-image generation, followed by a review of various techniques developed to address the challenges of this transformation. As the number of scene-graph-to-image (SG2IM) models grows, this paper systematically evaluates the most relevant research through a structured search methodology. This study defines key search terms, such as scene graphs, scene-graph-based image generation, graph-based image synthesis, GANs for scene-graph-based generation, transformers in structured image generation, diffusion models for structured image generation, and graph convolutional networks (GCNs), and reviews credible academic sources, including prestigious conferences (IJCAI, CVPR, and ICML) and journals (Nature Machine Intelligence and JNCA). The selection criteria prioritized peer-reviewed journal articles and conference papers published between 2013 and 2024. To maintain historical continuity, seminal works predating this period were included if they played a foundational role in shaping modern methodologies. By employing this systematic approach, the survey captures the latest advancements while ensuring that fundamental theoretical contributions remain acknowledged.

1.2. Structure of the Survey

The paper is structured as follows. Section 2 summarizes related literature about various surveys and overviews within the area. Section 3 introduces the background and concept of image generation using different methods. Section 4 provides the methodological description of the scene-graph-to-image generation from other models. Next, a brief description of public datasets is introduced in Section 5 before the evaluation metrics and various contrasts of the most significant image generation methods in Section 6. Subsequently, current challenges are presented in Section 7, and a conclusion is provided in Section 8.

2. Existing Surveys

Most of the existing surveys on scene-graph-based image generation (SG2IM) either cover broad topics or focus on specific aspects of image generation, such as text-to-image, sketch-to-image, and layout-to-image approaches. As shown in Figure 2, various image generation methods have been explored, each differing in input type, adopted algorithms, and generation objectives [43,44]. For instance, ref. [43] provides a broad survey of image generation techniques, examining methods based on text, sketches, and layout inputs. It conducts a thorough evaluation of existing approaches, analyzing their algorithmic structures, strengths, and limitations. However, this survey does not focus on scene-graph-based methods, which is the central theme of our work. In contrast, ref. [44] specifically reviews GAN-based models for text-to-image synthesis, highlighting their core components, challenges, and architectural improvements. While their survey provides insights into GANs’ role in image generation, our work extends beyond GANs, incorporating transformer-based and diffusion-based scene graph image generation models, thus offering a broader methodological perspective.

A more closely related work is ref. [45], which presents a standardized framework for evaluating scene-graph- and layout-based image generation models. Their study primarily focuses on defining performance metrics and evaluation methodologies for these models. However, while their work is limited to model assessment, our survey provides a comparative analysis of different techniques, categorizing them based on their underlying architectures and how they leverage scene graph information.

Unlike these existing surveys, our work systematically examines scene-graph-based image generation from multiple perspectives, providing a comparative evaluation of GANs, transformers, and diffusion models in the context of image realism, relational consistency, and computational efficiency. We also introduce a structured discussion of benchmark datasets, evaluation metrics, and emerging challenges, making our study a comprehensive reference for future research in scene-graph-based image generation.

3. Background

The essential ideas frequently used for developing image generation pipelines are summarized in this section. Scene graphs, GCN, and specific methodologies, like GAN, transformers, and diffusion models, are the three primary elements of this integrated process. Within computer graphics, scene-graph-to-image generation is a dynamic and important area, where converting organized scene descriptions into visually appealing representations has gained interest. The fundamental components of this field must be properly understood, especially the function of scene graphs in computer graphics.

3.1. Scene Graphs

Scene graphs are the foundation of scene-graph-to-image generation. Scene graphs are hierarchical data structures that are useful and easy for computer graphics scene organization and representation. These structures look like family trees in that the nodes represent different objects in a scene. A node might, for example, stand in for a three-dimensional object and describe its geometric form, material characteristics, and scene position. Scene graphs define spatial hierarchy by use of parent–child relationships. A scene graph is an ordered method of representing a scene that precisely defines objects (e.g., “man”, “frisbee”, and “goalpost”), their characteristics (“frisbee is white”), and the connections between objects (“man holding frisbee”). A scene graph’s essential elements are its objects, attributes, and relations. The crucial components of an image are often indicated by bounding boxes, objects, or subjects. Any object can have one or more attributes, such as color (like “white”). Relationships can be stated as comparisons (e.g., “taller than”), descriptive verbs (e.g., “wear”), spatial positions (e.g., “is behind”) [46,47,48,49]. In short, a scene graph is a set of visual relationship triplets in the form of subject, relation, and object. The notion of scene graphs was initially established for image retrieval, explicitly to aid in searching images with detailed semantic elements. Figure 3 illustrates that a comprehensive scene graph includes objects and their relationships, accurately capturing the semantic representation of the scene. Scene graphs can represent both 2D and 3D [50] images by encoding them into their semantic components. This encoding does not impose any limitations on the types of objects, their properties, or relationships.

The scene graph

G

is a tuple

(O, E)

that consists of a set of objects

O = (o_{1}, \dots, o_{1}),

which can represent various entities, such as persons (man and referee), places (goalpost), or things (frisbee). Each

o_{i}

is a component of

G

. The set

E

consists of directed edges

E \subseteq O \times R \times O

, which show object relationships. These relationships can include geometry (people at a goalpost) and actions (man holding a frisbee) and are described as tuples

(o_{i}, r, o_{j})

, where

o_{i} o_{j}

\in O

and

r \in R

. Objects and relationships in a scene graph exchange features and messages. Several models improve object properties and extract phrase features using different message-passing techniques.

3.2. Graph Convolutional Networks

Relationships among data points in a vector space are often shown via graphs, where the signal value at the node

i

is represented by

x \in R^{n}

and

x_{i} .

The authors of refs. [51,52] presented graph neural networks (GNNs) to handle cyclic, acyclic, directed, and undirected graphs, as well as data processing within graph domains [53,54,55]. Convolutional aggregation methods are employed in graph convolutional networks [56,57] (GCNs), a particular kind of GNN, much as in traditional 2D convolutional layers of a convolutional neural network (CNN). A pair

(V, E)

, where

V

denotes the nodes and

E

the edges, defines a graph

G

in a GCN. The input is a spatial grid of feature vectors, from which a new spatial grid of vectors with weights shared across the neighbors of all nodes is produced via convolutional aggregates. The output vectors

v_{i}^{'}

and

v_{r}^{'}

for all nodes and edges are

R^{D_{o u t}}

, and the input vectors

v_{i}

and

v_{r}

for all nodes and edges are

R^{D_{i n}}

. The conventional procedure for turning scene graphs into images involves passing the scene graphs through GCNs that employ many graphs’ convolutional layers. Every node

V

and edge

E

in the input graph processes dimension vectors

D_{i n}

to calculate new dimension vectors

D_{o u t}

for each linked node and edge. The utilization of graph convolution theory in GCNs can be understood by considering the input vectors

v_{i}

and

v_{r}

in

R^{D_{i n}}

, which represents objects

o_{i}

\in O

and edges

(o_{i}, r, o_{j})

\in E

. The output vector, as defined in Equation (1), is computed for every node and edge using three distinct transformation functions:

g_{s}, g_{p}

, and

g_{o}

. These functions correspond to the subject node

o_{i}

, the predicate (relationship)

r

, and the object node

o_{j}

, respectively. These three functions are integrated into a unified network framework by combining their respective input vectors

v_{i}, v_{r}, a n d v_{j}

. The combined vector is then passed through a multilayer perceptron (MLP), which processes the concatenated vectors and generates updated feature representations for the subject

o_{i}

, predicate

r

, and object

o_{j}

. This process is formally expressed as:

v_{i}^{'}, v_{r}^{'} {\in R}^{D_{o u t}},

(1)

v_{r}^{'} = g_{s} (v_{i}) + g_{p} (v_{r}) + g_{o} (v_{j}),

(2)

where the transformed representations

v_{i}^{'}

and

v_{r}^{'}

correspond to the updated embeddings for the object node

o_{i}

and the predicate

r

, respectively. The output vector for edges is represented as

v_{r}^{'}

in Equation (2). To compute a candidate vector for edges starting at

o_{i}

, all candidates are gathered in a set called

V_{i}^{s}

. Similarly, for all edges terminating at

o_{j},

a candidate vector is constructed using

v_{i}^{o} .

3.3. Overview of Scene-Graph-Based Image Generation Approaches

Currently, there have been multiple attempts to generate images from scene graphs using different models, such as the ones developed in refs. [58,59,60]. Methods such as GANs, diffusion models, and graph transformers have been utilized in this domain. The methods above highlight the specific emphasis on generating images from scene graphs as a separate and significant field of study. The purpose of this section is to summarize the previous research on generating images from scene graphs and examine the various approaches utilized for generating images.

3.3.1. Image Generation from Scene Graphs and Layouts

Generating images from scene graphs is a challenging task that requires much effort to represent identifiable objects in complex scenes. GCNs are widely used in many ways for image synthesis from scene graphs. An early attempt in this field was to generate an image retrieval framework that used a scene-graph-based method [8]. Expanding upon this, scene-graph-to-image (SG2IM), an innovative technique to generate images based on scene graphs, was introduced [32]. The SG2IM approach tackles the difficulties conventional natural language or textual description approaches face in maintaining semantic entity information while generating images. Moreover, generating images from scene graphs is commonly classified as conditional image generation, meaning that the generation of the image depends on extra specified information. The innovative study in ref. [32] highlighted the importance of scene graphs that provide detailed information about many things in the foreground. Their approach utilized a GCN to analyze the scene graph data along its graph edges. The researchers made predictions about the bounding boxes and segmentation masks to represent the scene visually. These predictions were used to generate a layout of the scene. Then, they employed a cascaded refinement network (CRN) to convert this layout into the final image [61].

Conventional approaches often encoded the scene graph into a vector representation and subsequently decoded an image from this vector. However, this process sometimes led to the loss of spatial details and challenges in handling intricate object relationships. To address these difficulties, an innovative method was introduced that generates images directly from the scene graph, improving the depiction of spatial relationships [32]. GCNs can be classified into two main types, (i) spectral and (ii) spatial GCN. The spectral GCN seeks to construct a deep learning framework with less complexity by integrating a graph estimation technique designed explicitly for classification tasks [62]. This method utilizes spectral theory to examine graphs’ structures by analyzing the graph Laplacian’s eigenvalues and eigenvectors. Conversely, the spatial GCN extends the principles of classic CNNs and propagation models [63]. It enhances traditional CNNs to effectively process graph data by converting these data into convolution operations that are aware of the underlying structure. These operations can be applied in both Euclidean and non-Euclidean settings. This modification enables spatial GCNs to operate directly on the graph nodes and their surrounding areas, thereby maintaining the intrinsic spatial connections present in the data. The authors of ref. [58] presented a new method that differs from the methodology proposed in ref. [32] in three notable aspects. Firstly, they employed external object crops as anchors to direct the process of generating images.

Additionally, a crop refining network was employed to convert layout masks into finished images, improving the visual consistency and level of detail in the created images. Finally, they implemented a crop selector, an automated system that selects the most suitable crops from a database of objects. This guarantees that the chosen elements are ideally matched for the generated scene. This approach provides improved control and precision in generating images that are more precise and suitable to the given context. A method that uses recurrent neural networks to generate images from scene graphs in an interactive manner was presented, and this technique improves the capacity to preserve image content while enabling incremental alterations over three increasingly complex stages [64]. The scene graph is expanded during each stage by adding more nodes and edges. This expansion provides the GCN with more specific information, which is used to generate a layout.

The architecture is an extension of SG2IM, which utilizes a scene layout network (SLN) to make predictions about bounding boxes. This method integrates the GCN with an adversarial image translation strategy to generate images without the requirement for supervision. However, improvement is needed, as the generated images often lack sharpness, and the described objects may not consistently match the input scene graphs. Expanding upon previous progress, the authors of ref. [65] enhanced the ref. [32] model by incorporating a scene context network. This innovation integrates a context-aware loss to advance image matching. It offers two novel metrics for evaluating the arrangement of generated images with scene graphs: (i) relation score and (ii) mean opinion relation score. These developments are designed to enhance the accuracy and significance of the generated images to their associated scene graphs. However, GCN-based approaches have limitations, including handling links between characteristics and reliably recognizing the correct relations. For example, the words man, right, and woman and woman, left, and man may be accurate, but they usually result in dissimilar visual interpretations. To tackle these difficulties, a technique that utilizes attribute interactions by maintaining the graph’s information via a canonicalization process while generating images from scene graphs was proposed [66]. This method guarantees that the spatial and relational consistency of the scene graph is preserved in the generated images, thus improving both the outcomes’ precision and visual harmony. Scene layouts function as intermediary representations from scene graphs throughout the image generation.

Nevertheless, a straightforward approach was proposed to generate images from these layouts without the need for human definition of scene graphs. First, specific bounding boxes and object categories are defined, then a varied collection of images is generated using these general patterns [59]. In ref. [60], they improved their previous work by adjusting the loss functions and strengthening the object feature map module with object-wise attention in their framework. T. Sylvain et al. [67] presented a technique for generating images from layouts emphasizing objects. They contained aspects of scene graph-based retrieval to advance the precision of the layouts. Their approach, termed object-centric GAN (OC-GAN), integrates layout-based image generation with scene graph analogies to obtain spatial representations of objects inside a scene layout, resulting in a hybrid generation process. Nonetheless, a limitation of the scene-layout-to-image (SL2IM) generation method is that, while the generated images may initially appear realistic and adhere to the input patterns, closer examination often reveals a lack of contextual understanding and position sensitivity.

To tackle these difficulties, an innovative module for transforming features that represent the context was introduced. This module augments the position sensitivity of objects in the generated images by updating each object’s features and using a Gram matrix to capture the inter-feature correlations of the feature maps [40]. This method aims to create visuals that adhere to the layout while maintaining contextual consistency and spatial awareness. Table 1 gives a summary of these methods.

3.3.2. GANs for Image Synthesis

GANs have been widely used for scene-graph-based image generation due to their ability to generate high-fidelity images. The fundamental approach involves using graph convolutional networks (GCNs) to encode the scene graph structure and subsequently passing this information to the GAN framework. The GCN processes the nodes (objects) and edges (relationships) in the scene graph, transforming them into meaningful feature representations. These representations are then used to generate an intermediate scene layout, which acts as a structured blueprint for the final image synthesis.

In a conventional GAN configuration, a generator generates an image while a discriminator assesses the image’s validity (Figure 4). The versatility of GANs has led to significant advancements across multiple fields, including image generation [32,77,78], video prediction [79,80], texture synthesis [81,82], natural language processing [83,84], and image style transfer [85]. GANs excel at conditional image synthesis, enabling the generation of images dependent on specified inputs, such as category labels. This feature improves the clarity and precision of the generated images, as evidenced in refs. [86,87]. Furthermore, the discriminator can be engineered to not only differentiate between authentic and synthetic images but also to forecast the labels of these images [14,88,89]. GANs have a significant role in the progress of image synthesis, generating synthetic images of superior quality compared to earlier technologies and fashioning themselves as a benchmark in cutting-edge techniques [90,91,92]. The worth of this has created new opportunities for practical applications that depend on precise and comprehensive artificial representations of real-world objects. Although GANs provide cutting-edge results in image generation, their training process can be significantly unstable. To tackle this difficulty, the Wasserstein GAN (WGAN) [93] provides an enhancement to the conventional GAN model [94]. WGAN improves learning stability and reduces model collapse during training using the Wasserstein distance metric, which ensures a more dependable gradient behavior.

Furthermore, the stability issue is tackled by modifying the training method. This approach entails progressively refining the discriminator over a predetermined number of iterations, allowing the generator to adapt to a more stable adversary [95,96]. This leads to more uniform training dynamics. Progressive GANs and BigGAN are adept at producing high-resolution and high-fidelity images for many purposes. The progressive GAN methodically integrates additional layers into the network. This method commences with low-resolution images and progressively proceeds to greater resolutions, improving details throughout the training procedure. This method helps the model understand the image’s structure on a wide scale and afterwards focuses on more complex aspects, improving the training process’s speed and stability. Ref. [14] was specifically designed to train on big datasets, such as ImageNet, efficiently. This enables it to produce exceptionally detailed, high-resolution images. This model employs larger batch sizes and more complex structures, pushing the limits of GAN capabilities in terms of both image quality and resolution. GANs are essential to scene-graph-to-image synthesis, and they rely on adversarial training and comprise two neural networks: the discriminator and the generator. The generator uses a scene graph as the input guide to generate images that closely match actual ones. It uses data to develop convincingly realistic images that try to be identical to real ones.

Throughout the process, the discriminator functions as a judge, evaluating the authenticity of both real and generated images. It provides essential feedback to the generator by assessing the quality and realism of the generated images. The simultaneous training of these networks fosters a competitive environment, where the discriminator continuously improves its ability to distinguish synthetic images from real ones, while the generator adapts to create increasingly realistic outputs that can “fool” the discriminator. This adversarial learning dynamic enhances the generator’s ability to generate high-quality, photorealistic images, while ensuring that the generated images align with the given scene graph input. GANs offer several advantages, including their capacity to generate visually compelling images with high fidelity. Additionally, they can effectively learn complex image distributions, enabling them to produce diverse and realistic outputs that capture intricate details within a scene. Some notable advantages of GANs are as follows:

❖: GANs generate images in a single forward pass, making them significantly faster compared to diffusion models, which require iterative refinement.
❖: Unlike transformers, which require quadratic attention computations, and diffusion models, which require multiple denoising steps, GANs are computationally lightweight once trained.
❖: The discriminator in GANs ensures high-quality, photorealistic images by pushing the generator to refine details, textures, and object boundaries.

However, GANs have limitations despite their popularity. Mode collapse occurs when GANs generate a limited number of different images, focusing on a few data distribution patterns. Hyperparameter sensitivity makes GAN training unstable, causing mode dropout or oscillation.

3.3.3. Diffusion Models for Image Synthesis

Diffusion models, as stated in recent studies [18,21,30], belong to a unique category of generative models that vary from conventional methods, like GANs. In contrast to traditional models that generate data from random noise in a single step, diffusion models function by incrementally introducing and then removing noise through multiple iterative steps. The core concept of diffusion models involves initiating an image inherently defined by random noise. The model subsequently iteratively improves the image by implementing a series of adjustments that systematically reduce noise levels. Each modification aims to eliminate noise while concurrently conserving and enhancing the fundamental characteristics of the original image. Diffusion models facilitate the development of intricate and lifelike images by meticulously controlling the emergence of details from randomness. This method provides a robust substitute to the one-step generation technique utilized in other generative models, presenting an enhanced approach to image generation that accurately captures intricate patterns and textures. The image generation process via diffusion is categorized by a series of stages known as diffusion timesteps. Each timestep incrementally improves the image until it matches the intended goal image, as illustrated in Figure 5. Diffusion models present a promising and innovative approach for image production in the realm of scene-graph-to-image synthesis. These models seek to generate visuals that are both lifelike and contextually consistent. Their applications are particularly valuable in this domain. They are conditioned on scene graphs, enabling them to create images that precisely depict particular scene descriptions. This conditioning guarantees that the objects in the resulting image closely follow the predefined qualities and relationships described in the scene graph.

To effectively generate scene-graph-conditioned images, diffusion models must first encode the structured information from the scene graph. This is commonly achieved using graph neural networks (GNNs) or CLIP-based embeddings. GNNs process the nodes (objects) and edges (relationships) of a scene graph, transforming them into latent feature representations that preserve spatial and relational structure. These feature vectors then serve as conditioning inputs during the image generation process. Alternatively, some models, such as SceneDiffusion [97], leverage CLIP embeddings to extract high-level semantic relationships from the scene graph, bypassing the need for explicit layout prediction while still preserving the scene structure. Once the scene graph has been encoded, its information is integrated into the denoising process at multiple stages. One common approach is latent conditioning via cross-attention, where scene graph embeddings are injected into the diffusion model through cross-attention layers. This mechanism allows the model to establish meaningful correspondences between objects and their spatial relationships, ensuring that the generated images remain structurally consistent with the input graph. Additionally, methods such as SceneGenie [72] refine diffusion-based image synthesis by incorporating explicit bounding box constraints, guiding the model to generate images with objects placed in semantically appropriate locations. Unlike GANs, which produce images in a single forward pass, diffusion models iteratively refine images by introducing scene graph relationships progressively, making them more robust to inconsistencies and enabling better control over spatial and semantic alignment.

The iterative method of reducing noise in diffusion models guarantees that the images seem realistic and remain consistent with the underlying scene graph. With each timestep, the image’s coherence to the original scene is gradually enhanced, resulting in improved detail and accuracy. The ability to adjust the number of diffusion timesteps provides a parameter that may be controlled to find a balance between image quality and computing efficiency. Increasing the number of timesteps leads to improved image quality since the model has more chances to enhance and fine-tune the details of the image.

Consequently, this necessitates an increased need for computational resources. Users can adjust the timesteps based on their particular needs and available resources, enabling the customization of diffusion models for various scenarios. Diffusion models are especially advantageous in fields necessitating the generation of intricate and contextually precise graphics, including virtual reality, simulation-based training, and sophisticated graphical content creation. The capacity to generate visually realistic images that align with predicted patterns and characteristics offers substantial advantages for developers and artists in various domains. Diffusion models exhibit a high level of robustness against noise, making them highly advantageous for generating images from scene graphs that are either incomplete or noisy. Their strong resilience makes them very suitable for managing the uncertainties encountered in real-world data. Diffusion models [96] function by reversing a controlled process that initially converts clean data into a noisy state. It was initially described in ref. [98] and then improved upon in refs. [24,99]. The ability to generate image samples of exceptional quality and variety has been aided by diffusion models [23,100,101]. Diffusion models excel at conditional image synthesis, allowing them to effectively employ different types of user input, such as texts or images, to direct the process of generating images. This adaptability is accomplished by including conditional information directly into the model in various ways. Present diffusion models commonly incorporate auxiliary classifiers [20,21,25,102] to handle this integration. The methods used vary from classifier guidance [20,21] to classifier-free guidance [98], where the model dynamically learns to either utilize or ignore additional data. Furthermore, latent diffusion models (LDMs) [26,103] are a significant improvement in the effectiveness of training diffusion models for high-resolution images. In addition, they utilize cross-attention techniques [104] to integrate conditional information when generating samples. This improves the model’s capacity to provide intricate and contextually suitable images based on the provided conditions. Including conditionality in the integration process makes exerting more exacting control over the resulting content possible. This ensures that the final images closely adhere to both the user’s expectations and the intrinsic traits of the input scene graph. The authors of [21] developed a technique for generating images based on certain conditions. They utilized diffusion models guided by classifiers, with the classifier guiding the diffusion model during sampling. This technique has been applied in different contexts. For example, a diffusion model was used alongside a low-resolution image to generate a high-resolution version [100].

Another application is seen, where a low-quality image is used as a starting point to improve and enhance image details, such as colors and textures [29]. SGDiff [70] addresses the difficulty of generating realistic images from textual descriptions and scene graphs. Conventional approaches frequently depend on manually constructed scene layouts to direct image generation. Nonetheless, these layouts may fail to adequately depict the alignment between the generated images and the original scene graphs, leading to suboptimal outputs.

To address this problem, the authors propose a novel approach that obtains scene graph embeddings that are precisely matched with their corresponding images. This procedure entails utilizing a pre-trained encoder to extract both comprehensive and specific information from scene graphs, which may accurately predict the corresponding images. The technique utilizes masked autoencoding loss and contrastive loss to optimize these embeddings. It emphasizes the importance of accurately learned embeddings for achieving precise alignment between graph and image semantics. Meticulous ablation investigations demonstrate the model’s more excellent performance over existing approaches. Unlike traditional methods that rely on intermediate layout representations, the model more accurately reproduces the structures provided by scene graphs.

Moreover, SGDiff provides functionalities for semantically manipulating images through scene graph alterations, showcasing its versatility and practical usefulness in image generation tasks. SceneGenie, a new method for directing the sampling process in diffusion models is used for generating images. This approach employs bounding box and segmentation information as navigational assistance during the reverse sampling phase [69]. A GNN utilizes structured text prompts, like scene graphs, to predict this information. This instruction significantly improves the areas of interest using gradients obtained from the predicted bounding box and segmentation maps. This allows for more accurate and contextually relevant image generation. Some notable advantages of diffusion models are as follows:

❖: Diffusion models generate highly detailed and realistic images, often outperforming other generative models in quality.
❖: They provide great flexibility in conditioning, allowing for generating a wide range of images from the same input by adjusting the noise levels and conditioning signals. It is essential to recognize that they have limitations, which include the following:
❖: A significant limitation is the high computational demand, as generating an image involves numerous diffusion steps, resulting in a slower and more resource-intensive process than alternative methods.
❖: Although diffusion models are highly effective, effectively incorporating specific structured data, such as scene graphs, into these models can be challenging and is currently an active area of research.

3.3.4. Graph Transformers for Image Synthesis

Initially developed for natural language processing, transformers have exhibited exceptional adaptability and have been successfully applied in other domains, including computer vision. Transformers are particularly effective in scene graph image generation because they can process intricately structured data, such as scene graphs. Scene graphs provide rich descriptions of scenes using nodes representing items and edges representing interactions between them. The structure of scene graphs, which methodically arranges objects and their relationships, aligns effectively with the transformers’ capacity to represent intricate interactions among various scene parts. The ability to do semantic parsing of scene graphs is essential, as transformers can extract vital information regarding objects, their characteristics, and relationships. Thoroughly analyzing this information is crucial for precisely aligning the descriptive data of the scene graph with the process of generating the image. Once the scene graph has been parsed and understood, transformers utilize the organized data to direct the image’s generating process. They excel in generating images that closely match the given scene descriptions due to their capacity to adjust the generation process based on the parsed data. Positional encoding is a crucial element in transformers that enables the model to consider and comprehend the spatial connections between items in a scene graph. This understanding is essential in guaranteeing that the images generated correspond to the original scene’s descriptive accuracy and spatial arrangement. One of the key advantages of transformers in this domain is their ability to parse and encode structured scene information as a sequence of tokens. Instead of relying on localized feature aggregation, transformers utilize multi-head attention mechanisms to weigh relationships across the entire graph. This approach avoids the limitations of GCNs, which may struggle to propagate information effectively in dense graphs with many interconnections.

Transformers have not only brought about a significant change in the field of natural language processing, but they have also made substantial progress in other domains, such as vision with vision transformers (ViTs) [105], reinforcement learning [106,107], and meta-learning [108]. Their remarkable capacity to efficiently adapt to larger model sizes, computing resources, and data input is especially prominent in the language field [109] and in their utilization as ViTs and the case of generic autoregressive models [110]. The scalability and versatility of these models make them suitable for use as fundamental models in several domains. They have the potential to replace domain-specific designs and establish new standards for performance and adaptability. Beyond language processing, transformers have shown their versatility and effectiveness through applications in autoregressive pixel prediction and image synthesis [111,112,113]. These applications have included training autoregressive models [114,115] and masked generative models [99,116] on discrete codebooks [117]. They are scalable as autoregressive models, managing up to 20B parameters [118]. The transformer architecture is successfully used to apply scene-graph-to-image generation [67]. This approach encodes scene graphs using transformers; at first, it concentrates on generating a layout from the graph. GCNs that encode the scene graph are then integrated with this architecture to condition the image generation process. To improve scene composition, ref. [67] presented novel methods, including applying edge characteristics and the Laplacian matrix. A new class of diffusion models built on the transformer architecture, the scalable diffusion models with transformers (SDMT), extends the use of transformers in generative modeling. Transformers that work on latent patches of images are used by SDMT [119] in place of the conventional U-Net backbone. This method aims to investigate and make clear the influence of architectural choices on diffusion models and creates empirical foundations for next generative modeling studies. Significantly, the results from ref. [118] showed that transformers can successfully substitute their standard topologies for the inductive bias of the U-Net, therefore ensuring the success of diffusion models. Some advantages of transformer models are as follows:

❖: Unlike GANs, which primarily focus on local features, transformers capture global relationships using self-attention mechanisms.
❖: Transformers process the entire scene graph at once, ensuring that spatial and semantic relationships are well preserved.
❖: Transformers, on the other hand, use attention mechanisms to learn generalizable representations, making them more adaptable to unseen scene structures.
❖: Transformers can maintain correct spatial positioning between related entities.

Assessing the computational efficiency and memory usage of different generative models is crucial for understanding their practical applications. The performance of these models varies significantly based on their underlying architectures, which affects their feasibility in real-time scenarios and large-scale applications.

GANs are generally the most computationally efficient among the three families of generative models. Their image synthesis process is performed in a single forward pass, making them well suited for real-time applications, such as interactive content creation and rapid image synthesis. The primary computational challenge with GANs lies in their training instability, requiring extensive hyperparameter tuning to avoid issues like mode collapse [120]. However, once trained, GANs offer the fastest inference times with moderate memory requirements, as they do not need iterative refinement steps.

Transformer-based models introduce higher computational costs due to their reliance on self-attention mechanisms, which scale quadratically with image resolution. While transformers provide superior relational reasoning and excel in structured image generation tasks, their high memory usage limits scalability. Training these models requires extensive GPU resources, and inference times are slower than GANs due to the necessity of computing long-range dependencies. Despite these limitations, transformers are advantageous when generating images that require complex object interactions and logical consistency in scene-graph-based representations.

Diffusion-based models deliver state-of-the-art image quality but are computationally the most expensive. Unlike GANs and transformers, diffusion models generate images through multiple iterative denoising steps [121], significantly increasing the inference time. The memory usage of diffusion models scales linearly with the number of diffusion steps, making them resource-intensive and less feasible for real-time applications. However, recent advancements, such as latent diffusion models (LDMs) and timestep reduction techniques, have improved computational efficiency, making diffusion models increasingly viable for practical applications. These optimizations allow diffusion models to generate high-fidelity images while reducing their computational footprint.

The choice of model architecture should be guided by the computational trade-offs involved. GANs offer the fastest synthesis but can struggle with relational complexity and diversity. Transformers provide better relational consistency but at a higher computational cost. Diffusion models ensure the highest image quality but remain the most computationally expensive. Future research should explore hybrid approaches that combine the efficiency of GANs with the structural consistency of transformers and the realism of diffusion models [119], ensuring scalability and performance optimization for large-scale and real-time applications. Table 2 presents a comparative analysis of different scene-graph-based image generation models, highlighting their structural frameworks, computational efficiency, and ability to generate diverse, high-quality images.

4. Method Comparison

This section defines the methodological explanations of image generation techniques employed in this review. Various methodologies have been established, mainly based on the foundational research of the SG2IM technique [32]. These methodologies often utilize a uniform input format (scene graphs) and concentrate on generating realistic graphics that faithfully represent the objects and their interrelations, as defined. Image generation from scene graphs requires a profound comprehension of computer vision and computer graphics, which entails converting structured scene descriptions into high-quality, varied visual representations. This review thoroughly compares different methodologies, assessing their advantages, limitations, and efficacy across diverse datasets and measures. The principal designs for SG2IM approaches can be classified into three core categories: generative adversarial networks (GANs), diffusion models, and transformer-based models.

4.1. GAN-Based Models

GANs are a popular method for generating images from scene graphs. These models use a generator–discriminator framework, where the generator creates images, and the discriminator checks how real they are. In scene-graph-to-image generation, conditional GANs (cGANs) are commonly employed. cGANs consider additional information, such as scene graphs, to guide the image generation process. The following are notable cGAN models that have advanced the field.

4.1.1. Image Generation from Scene Graphs (SG2IM)

The SG2IM methodology uses a conditional GAN (GCN) architecture, conditioning image generation on the input scene graph. It utilizes advanced graphic and neural network methodologies to generate images based on scene graphs. The image-generating network

f

requires a scene graph

G

and noise vector

z

as inputs. This network generates an output image

I

using the function

I = f (G, z)

. A GCN performs computations on the scene graph

G

. The GCN produces embedding vectors for every graph object, a crucial aspect. These embeddings adhere to the object relationships in the scene graph. Anticipating the position and shape of each object by predicting bounding boxes and segmentation masks aids in accurately positioning and defining them inside the image. Following this phase, the GCN processing algorithm generates a layout. This layout is an intermediary representation between the scene graph and the image output. Objects are organized spatially using GCN bounding boxes and segmentation information. The cascaded refinement network (CRN) iteratively improves the layout to generate the ultimate image

I

. The network

f

is trained using discriminators

D_{i m g}

and

D_{o b j}

to ensure that it can recognize objects and generate realistic images. The

D_{i m g}

assesses the quality and authenticity of an image to enhance the realism in

I

. The object discriminator

D_{o b j}

assesses the authenticity of specific components within

I

to determine the authenticity and identity of the objects. An embedding layer, designed to acquire knowledge, converts nodes, objects, and edges (relationships) in a scene graph into compact vectors.

The purpose of this layer is to convert category labels on nodes and edges into informative and valuable vectors. The GCN then uses these vectors to understand better and represent the scene graph. To create an image for a layout, it is essential to describe the positions of the objects within the design accurately. The layout is employed within a CRN alongside convolutional refinement modules. In CRN, the modules are concatenated channel-wise and then passed through two 3 × 3 convolutional layers. The scene layout is reduced in resolution to a specified input resolution and is then combined with the output of the previous module for each module. A significant challenge is that while it is effective at capturing object relationships, it struggles with maintaining high-resolution detail and can be computationally intensive for larger scene graphs. Figure 6 depicts the standard procedure for creating images from scene graphs, utilizing techniques such as graph convolutional networks (GCNs).

4.1.2. PasteGAN

Reference [58] proposed PasteGAN, a technique that employs crop selection to improve control over image generation. Their contributions might be expressed as follows: (i) They devised a methodology in which scene graph items operate as crops, employing an external object crops repository to direct image production. (ii) They established a crop refining network and an object–image fuser to enhance the visual quality of object crops in the final image-generating process. (iii) They executed the crop selector module of PasteGAN, which autonomously identifies the most appropriate crop for the assignment. PasteGAN utilizes scene graphs and object cropping to generate images. It primarily utilizes an external memory tank to identify objects in input scene graphs for image generation. The training technique of PasteGAN consists of two stages: the first stage involves reconstructing ground-truth images using the original crops from the

m_{i}^{o r i}

, while the second stage involves generating images using selected crops from the

m_{i}^{s e l}

, in the external memory tank. GCN utilizes entity contextual information from the scene graph to generate a latent vector

z

.

PasteGAN employs a crop selection technique to extract objects from scene graphs processed using GCN, generating realistic images. An intelligent crop selection should differentiate between appropriate objects and similarities in the scene. The scene graphs for PasteGAN were generated using the pre-trained SG2IM-based GCN. PasteGAN’s crop refining network consists of two distinct processes. Firstly, there is a crop encoder that extracts the fundamental visual characteristics of objects. Secondly, an object refiner utilizes two 2D graph convolutional layers. It combines several designs of cut objects. The object–image fuser combines all object crops into a canvas based on a latent space. Based on a cascaded refinement network, an image decoder creates an image using latent canvas input, ensuring the object’s placements are respected. The discriminator utilizes image and object discriminators to train the image creation network adversarial, aiming to generate realistic and identifiable images. While effective at capturing object relationships, it struggles with maintaining high-resolution detail and can be computationally intensive for larger scene graphs.

4.1.3. Learning Canonical Representations for Scene-Graph-to-Image Generation (WSGC)

Geometric elements convert layouts into visuals (woman, right of, and girl). Given that individuals are responsible for constructing scene graphs, it is plausible that errors such as inaccurate data relationships may occur, as shown in Figure 7. Previous techniques based on SG2IM did not treat the scene graphs, woman, right of, and girl, and girl, left of, and woman, as semantically similar, which is a limitation. The semantic equivalency problem was addressed by using a novel image generation method based on canonical scene graphs, substituting conventional scene graphs [65]. Canonical scene graphs generate layouts instead of providing logically comparable conventional scene graphs. They can train more compact models by distributing information across the network using fewer parameters. Canonicalization enhances the resilience of graphs and reduces the impact of noise. The process of scene graph canonicalization consists of two distinct processes. The calculation of the scene graph canonicalization involves the use of transitive and converse relations as inferences. Furthermore, precisely weighted scene graph canonicalization (WSGC-E) and sampling weighted scene graph canonicalization (WSGC-S) are employed to produce a weighted scene graph canonicalization. However, converting a scene graph to an image involves two distinct steps. The initial layout prediction process consists of utilizing a scene graph that is assigned specific weights. The scene graph was applied to predict layout bounding boxes using graph convolutional networks (GCNs). The WSGC utilized the same methodology as refs. [59,124] to generate images from layouts. The study in ref. [60] utilized innovative loss functions to create images from layouts. The canonicalization process adds computational complexity, but ensuring that all relationships are captured accurately can be difficult, especially in highly detailed scenes.

4.1.4. Image Generation from a Hyper-Scene Graph with Trinomial Hyperedges

The problem of taking positional relations amid three or more objects has not been addressed, as existing models only represent relations between two objects. The deficiencies of current scene-graph-to-image models are addressed through hyper-scene graphs with trinomial hyperedges, which enable the representation of intricate positional relationships among three or more objects [73]. The model uses the layout-to-image model to generate higher-resolution images, further enhancing the generated images’ quality and clarity. The model employs a generative adversarial network (GAN) setup, including a generator and discriminator, optimized for hyper-scene graphs, which uses a combination of losses to train the generator and discriminator, including adversarial losses and a novel bounding box regression network for precise object localization. It adapted and extended traditional graph convolution networks to accommodate the complexity of hyperedges. Firstly, they created a hypergraph convolution network and evaluated it on the COCO-Stuff and Visual Genome datasets, showing improvements over existing methods regarding image quality and alignment with the input scene graphs. New metrics, like the positional relation of three objects (PTO) and area of overlapping (AoO), are used alongside traditional metrics like the inception score to evaluate performance. While the model improves resolution and detail representation, the computational cost and complexity of training such models remain challenges.

4.1.5. Scene-Graph-to-Image Generation with Contextualized Object Layout Refinement

This work introduces a novel methodology that incrementally enhances a comprehensive layout description, thereby increasing the representation of inter-object interactions and mitigating overlap and coverage challenges. This method utilizes GCNs to analyze scene graphs and efficiently encode inter-object interactions. The COLoR model component improves object layout generation by contextualizing the position and shape of each object concerning others in the scene, ensuring that the layouts are more accurate and demonstrate extensive coverage with fewer overlaps. The strategy employs adversarial training and supervised losses to promote the creation of more realistic and diversified layouts. The layouts are improved through several iterations, each augmenting the detail and precision of object placements according to their contextual links. This strategy considerably enhances layout coverage and diminishes overlap relative to prior techniques. This results in improved precision and aesthetic quality in image production. The ultimate images generated with these enhanced layouts exhibit more authenticity and realism. Additional investigation into more effective layout refining methods and the possible incorporation of supplementary contextual information is required to improve the quality of the generated images. While augmenting the realism of generated images, the repeated refinement process can be computationally intensive, and accurate positioning continues to pose difficulties in intricate scenarios.

4.2. Transformer-Based Models

Transformer-based architectures have emerged as powerful tools for modeling complex dependencies in scene graphs. Their ability to capture long-range relationships and handle complex interactions between objects makes them particularly suitable for scene-graph-to-image generation. Unlike traditional convolutional neural networks (CNNs), which process images in a local, grid-like manner, transformers apply self-attention mechanisms to focus on both global and local object relationships within a scene.

4.2.1. Transformer-Based Image Generation from Scene Graphs

Transformer architectures present the instability problems commonly linked to adversarial training techniques employed in prior models. The model first represents scene graphs using a transformer-based architecture called SGTransformer. This design utilizes multi-head attention methods to effectively capture the graph’s complex connections between nodes (objects) and edges (relationships). Unlike traditional transformers that use sequential positional encoding, this model incorporates the Laplacian matrix of the graph as a positional encoding,

∆ = I - D^{- 1 / 2} {A D}^{- 1 / 2}

. Here,

A

represents the adjacency matrix of the graph, while

D

is the degree matrix, defined as a diagonal matrix where each diagonal element corresponds to the sum of the weights of edges connected to node

i

. The term

D^{- 1 / 2} {A D}^{- 1 / 2}

represents the symmetric normalization of

A

, ensuring numerical stability in spectral-based graph processing. This approach helps maintain the graph’s geometric structure and enhances the scene graph’s representation. After being created from the scene graph, the layout is transformed into a series of distinct tokens using a vector quantized variational autoencoder (VQVAE). The layout information is first passed via a VQVAE, which decreases the dimensionality of the image data and encodes them into a sequence of distinct tokens. The tokens are quantized feature vectors that effectively capture the essential information needed to reconstruct the image. The image transformer (ImT) utilizes these tokens to generate images. At this stage, the model uses autoregression to predict image tokens based on the layout. This approach treats image generation as a problem of predicting a sequence. The ImT utilizes a self-attention mechanism to focus on specific elements of the input sequence throughout different stages of the generation process. This feature is crucial for ensuring that the produced image maintains coherence with the input arrangement. The SGTransformer predicts the layout that is necessary for the process of generating the image. This conditioning ensures the generated images are diverse and faithful reconstructions of the underlying scene graph. The learning process is more consistent than adversarial training approaches commonly used in image tasks, as both components are learned in a supervised manner. The training process involves the utilization of various loss functions. For the SGTransformer, the layout loss is employed, while for the VQVAE, the losses used include reconstruction, quantization, and commitment loss. Moreover, the authenticity and visual quality of the generated images are enhanced by utilizing perceptual and discriminator losses.

4.2.2. Hierarchical Image Generation via Transformer-Based Sequential Patch Selection

In contrast to models that generate entire images simultaneously, ref. [66] introduced a methodology that combines parametric and non-parametric approaches. The system uses a transformer model to choose suitable image crops from an extensive collection and then generates realistic images based on scene graphs. The model first encodes the scene network to obtain semantic features advantageous for crop retrieval and position prediction. The sequential crop selection module (SCSM) improves crop compatibility by successively selecting crops using a transformer trained with contrastive learning. The progressive scene-graph-to-image module (PSGIM) guides the hierarchical generation of the final image based on the given crop. This involves utilizing a patch-guided, spatially adaptive normalizing module and gated convolutions to ensure the scene graph’s accuracy and the crop’s visual representation. The model is trained by employing a variety of loss functions, including contrastive loss for crop selection and a combination of reconstruction, perceptual, and adversarial losses for image generation. While the model shows promising outcomes, the computational complexity and the need for an extensive collection of pre-segmented object crops are potential limitations that could be addressed in future studies. Enhancements can be made to the crop selection and picture synthesis modules to generate image outputs that are both more realistic and varied.

4.3. Diffusion-Based Models

In machine learning, diffusion models are generative models that shine, especially in generating high-quality data, especially photographs. They work by progressively adding noise to training data and then learning to undo this process to create fresh samples. Their creative approach has made them a popular substitute for conventional generative models, including variational autoencoders (VAEs) and generative adversarial networks (GANs).

4.3.1. SceneGenie

A. Farshad et al. [69] presented a novel method to improve the generation of images based on text descriptions by utilizing diffusion models. Their work focused on the task of effectively depicting intricate textual prompts, particularly in accurately determining the exact quantity of specified objects present in an image. SceneGenie utilizes bounding box and segmentation map data while conducting the diffusion model sampling procedure. It achieves this without needing further training data by using semantic information from CLIP embeddings and geometric constraints to improve the resolution and accuracy of image representation regarding the scene graph. The system transforms textual prompts into organized scene graphs, enhancing nodes with CLIP embeddings to capture intricate item relationships and properties. The diffusion model is employed and guided by bounding box coordinates and segmentation maps anticipated from the scene graph. This involves calculating gradients for each object using CLIP embeddings and ensuring that geometric restrictions are maintained during image generation. The method functions during the inference phase without any supplementary training. This concept applies to any architecture of diffusion models, focusing on the capacity to be flexible and adaptable to existing models. SceneGenie demonstrates superior performance in generating high-quality, accurate images from scene graphs across two public benchmarks, outperforming existing methods regarding image quality and accuracy. The work presents qualitative and quantitative comparisons with state-of-the-art techniques, highlighting the ability to generate more representative scenes, especially in scenarios where the number of object instances is specified. The method still faces challenges in generating high-quality images of complex structures, such as faces, and suffers from high computational demands typical of diffusion models. Fine-tuning on more constrained datasets or employing off-the-shelf models for semantic segmentation and object detection could address some limitations.

4.3.2. Diffusion-Based Scene-Graph-to-Image Generation with Masked Contrastive Pre-Training

The model in ref. [70] employs a diffusion-based architecture that relies on scene graph embeddings. The embeddings are improved by an innovative training method that involves masked contrastive pre-training. Furthermore, by directly optimizing the alignment between scene graph embeddings and related images, this method effectively resolves the limitations encountered in previous systems that depended on intermediary scene layouts. As a result, it achieves higher precision and detail in image generation. The model incorporates a scene graph encoder trained to provide embeddings containing local and global information about a scene graph. This is accomplished by implementing two primary strategies: The masked autoencoding loss is a loss function that prioritizes local features by recreating masked sections of images using unmasked parts and the related scene graph embeddings. Before training the diffusion model, the scene graph encoder undergoes pre-training utilizing a method that partially obscures scene graph sections (masking) and verifies that the model can accurately predict the masked sections using the visible sections. This process aids in acquiring resilient and meaningful representations. The contrastive loss function trains the encoder to distinguish between images that either adhere to or deviate from the scene graph, improving the global structure’s overall alignment. The latent diffusion model utilizes the scene graph embeddings to condition and synthesize the final images. This model functions in a reduced-dimensional latent space to optimize efficiency and uses a U-Net-like structure for diffusion. The encoder is initially trained on pairs of graphs and images. The goal is to optimize the alignment between the embeddings (representations) generated by the encoder and the corresponding images. This is achieved by combining two loss functions. The diffusion model is trained independently in a hidden space based on the embeddings produced by the pre-trained scene graph encoder. Although SGDiff has shown notable progress, the intricacy and computational demands of training and inference using diffusion models are more significant than specific conventional approaches. Future research could investigate more optimal training methodologies or enhance multi-modal data integration.

4.3.3. R3CD: Scene-Graph-to-Image Generation with Relation-Aware Compositional Contrastive Control Diffusion

Large-scale diffusion models have improved image generation. However, complicated scene graphs’ entity interactions are complex for these models to capture. They encounter two significant obstacles: their inability to express exact and accurate interactions through abstract linkages and build complete entities often produces hazy representations of the intended scene. R3CD, a new approach, addresses these constraints [71]. This method uses large-scale diffusion models to learn better and generate scene graph images. A scene graph transformer with node and edge encoding is needed to perceive local and global information from input scene graphs. Initializing the transformer embeddings using a T5 model ensures robust graph processing and understanding. A joint contrastive loss that includes attention maps and denoising steps improves model performance. A loss function helps the diffusion model understand and generate images. This method increases the model’s capacity to describe concise interactions and generate entire entities, producing more realistic and diverse visuals that match the graph’s relationships.

5. Dataset

In this section, this study presents the most used datasets in the scene-graph-to-image generation field. Figure 8 and Table 3 summarize some representative datasets and the specific statistics of them.

Visual Genome. This dataset comprises 108,077 scene graph annotated images with seven main components, such as objects, attributes, relationships, scene graphs, region descriptions, region graphs, and question–answer (QA) pairs [31]. Each image consists of an average of 35 objects, and the relationships between 2 objects can be actions, e.g., jumping over, wearing, behind, driving on, etc., and 26 attributes, e.g., color (red), states (sitting/standing), etc. The scene graphs in this dataset are the localized representations of an image and are combined to construct an entire image. The region descriptions are natural descriptions in sentence format that describe a region of the scene. The objects, attributes, and relationships are combined through a directed graph to construct region graphs in the VG dataset. Furthermore, two types of QA pairs are associated with each image: (i) freeform QA and (ii) region-based QA. The dataset is pre-processed at the beginning and then divided into training (80%), validation (10%), and test (10%) sets.

COCO-Stuff. This dataset comprises 164K complex scene images [68]. It contains 172 classes comprising 80 things, 91 stuff, and 1 class as unlabeled. An expert annotator curated the 91 stuff classes. Additionally, the unlabeled class was used in two scenarios. First, when the label is not listed in any 171 predefined classes, and second, when the annotator cannot infer the pixel label. However, this dataset contains 40K training images and 5K validation images from scene graphs and layouts for the image generation task. Dense pixel-level annotations in COCO-Stuff are augmented from the COCO dataset.

CLEVR (Compositional Language and Elementary Visual Reasoning) [125] is a procedurally generated dataset designed to assess visual reasoning capabilities in models by providing structured scenes with explicitly defined object attributes and spatial relationships. The dataset consists of 70,000 images in the training set, 15,000 in the validation set, and 15,000 in the test set. Each image is accompanied by corresponding question–answer pairs, totaling approximately 699,989 questions for training, 149,991 for validation, and 14,988 for testing. The scenes contain 3D-rendered objects characterized by four primary attributes: size (large or small), shape (cube, cylinder, or sphere), material (rubber or metal), and color (with eight possible values, including gray, blue, brown, yellow, red, green, purple, and cyan). This structured composition results in 96 unique object configurations. Unlike natural datasets, such as Visual Genome and COCO-Stuff, CLEVR offers a controlled setting that enables precise evaluation of structured scene-graph-based image generation. While it has primarily been used for visual question answering (VQA) and compositional reasoning, its well-defined relational structure makes it a potential benchmark for assessing how models handle spatial alignment and object relationships in generated images.

6. Analysis and Discussion

This section provides the experimental findings of the latest techniques evaluated on commonly used image-generating benchmarks, such as COCO-Stuff and VG. The accuracy and efficiency of each approach are assessed by comparing them using widely used metrics for measuring performance, such as the inception score (IS), Fréchet inception distance (FID), and diversity score.

6.1. Evaluation Metrics

There are multiple methods to assess the performance of authentic and generated images. This section provides an overview of widely accepted and commonly used methods for evaluating image generation models. Table 4 compares the various metrics.

6.1.1. Inception Score (IS)

The definition of the inception score is:

I S (X; Y) = \exp \{E_{x} [D_{K L} (p (y| x) ∥ p (y))]\},

(3)

where

p (y | x)

is the conditional label distribution, and

D_{K L}

is the Kullback–Leibler divergence between two probability density functions. A high IS suggests that the model generates a meaningful image [90]. In addition to Equation (1), the IS may be defined using class label mutual information and generated samples:

I S (X; Y) = e x p {I (X; Y)},

(4)

where

I (X; Y)

represents mutual information between X and Y. The inception score evaluates synthetic image quality—the greater the value, the better [90]. It classifies generated images using a pre-trained deep neural network. Its key goals are image quality and image diversity.

6.1.2. Fréchet Inception Distance (FID)

The FID [126] metric quantifies the similarity between the distributions of generated images and the authentic images:

F I D = ∥ {μ_{r} - μ_{g} ∥}^{2} + T r (\sum_{r} + \sum_{g} - 2 {(\sum_{r} \sum_{g})}^{\frac{1}{2}}),

(5)

where

μ_{r}

and

\sum_{r}

are the mean and covariance of the feature vectors of the actual images,

μ_{g}

and

\sum_{g}

are the mean and covariance of the feature vectors of the generated images,

T r

denotes the trace of a matrix, and

∥^{2}

denotes the squared Euclidean distance. A lower FID (#) indicates higher image quality from the generator than actual images. It is sensitive to sample size, which can affect its reliability.

6.1.3. Diversity Score (DS)

The DS assesses the diversity of generated images by quantifying their dissimilarity. A superior score signifies increased diversity among generated outputs. This metric is essential for preventing generative models from producing similar or identical images, particularly when generating from complex scene graphs:

D S = 1 - \frac{1}{N^{2}} \sum_{i = 1}^{N} \sum_{j = 1}^{N} d (x_{i}, x_{j}),

(6)

where N is the number of generated images, and

d (x_{i}, x_{j})

is a distance metric (e.g., Euclidean distance) between image pairs.

6.1.4. Kernel Inception Distance (KID)

The KID assesses the distance between the distributions of authentic and generated images in a feature space. It uses a kernel function to capture variances in distributions, providing an unbiased estimate even with smaller sample sizes [120]. This makes it particularly useful for evaluating generative models when fewer samples are available. It is computed using the maximum mean discrepancy (MMD²) with a polynomial kernel function:

K I D = {M M D}^{2} = \frac{1}{m (m - 1)} \sum_{i \neq j} k (ϕ (x_{i}), ϕ (x_{j})) + \frac{1}{n (n - 1)} \sum_{i \neq j} k (ϕ (y_{i}), ϕ (y_{j})) - \frac{2}{m n} \sum_{i, j} k (ϕ (x_{i}), ϕ (x_{j})),

(7)

where

ϕ (x_{i})

and

ϕ (y_{i})

are the feature representation from a pre-trained inception model,

x_{i}

are the generated images,

y_{j}

are the actual images, while

m

and

n

are the numbers of generated and real images.

6.2. Discussion

This section provides a complete analysis and discussion of the results using existing frameworks. Results from evaluating state-of-the-art models on various datasets are emphasized. All strategies in this review are compared based on the data and metrics used to quantify performance. The results are classified based on the dataset used for experimental validation.

Visual Genome is an essential dataset for training the proposed techniques for generating images from scene graphs. The utilization of deep learning techniques enhances their efficacy. Table 5 presents the performance accuracy of each approach by measuring four metrics: IS, FID, DS, and KID. SG2IM [32] achieves the lowest scores, indicating a lack of image diversity and quality. SceneGenie [69] and R3CD [71] perform better than other models, with an IS of 20.25. This score indicates the exceptional capability to generate a wide range of unique images easily identified by the inception classifier. HIGT [66] and TBIGSG [67] exhibit satisfactory performance, although they still lag significantly behind SceneGenie. SGDiff [70] achieves a superior FID score of 16.6, which is notably lower compared to other approaches, such as PasteGAN [58] and HIGT [66]. This suggests that SGDiff generates images that closely resemble the distribution of authentic photos. SceneGenie demonstrates a robust FID score of 42.41, which complements its high IS and emphasizes its overall efficacy. For this dataset, methods such as R3CD and SceneGenie are optimal due to their focus on relational dynamics and high-resolution outputs. In contrast, PasteGAN may be more suitable for applications where fine control over individual object appearances is crucial, especially in artistic contexts.

COCO-Stuff is also an essential dataset for training the proposed techniques for generating images from scene graphs. HIGT [66] demonstrates a strong performance on COCO-Stuff, with an IS of 15.2, highlighting its robustness across diverse situations. SGDiff and TBIGSG exhibit notable scores, suggesting their efficiency in generating a wide range of images. SGDiff again demonstrates a robust performance, as seen by its FID score of 22.4. This indicates that it consistently generates high-quality images across many datasets. HIGT and TBIGSG have FIDs of 51.6 and 52.3, respectively, indicating competitive performance but not leading performance. SceneGenie performs exceptionally in both datasets, especially Visual Genome, by generating high-quality, diverse, realistic images. The high IS and KID emphasize its strength and efficiency. SGDiff demonstrates outstanding performance in terms of FID, demonstrating its strength in generating images that closely resemble the distribution of authentic images. This makes it particularly helpful for applications that demand high realism. HIGT exhibits robust diversity scores, indicating that its effectiveness when generating a wide range of images from a single input is essential.

The structural complexity of scene graphs significantly affects the performance of image generation models across datasets. Visual Genome consists of highly detailed scene graphs, with an average of 35 objects per image, each connected through multiple relationships. The complexity of these graphs creates challenges for models, as they must maintain semantic consistency across numerous entities while ensuring correct spatial alignment. GAN-based approaches, such as SG2IM and PasteGAN, often struggle with mode collapse when handling high-density relationships, leading to repetitive patterns and limited image diversity, as seen in Figure 9 and Figure 10. This occurs because the generator fails to capture the full distribution of diverse scene layouts, causing it to produce overly similar images. These models perform better in datasets with simpler relational structures.

In contrast, transformer-based methods, like HIGT and TBIGSG, leverage self-attention mechanisms to process long-range dependencies effectively. While they improve relational consistency, their high computational requirements make them less scalable for large datasets, like Visual Genome. Diffusion-based approaches, such as refs. [72,73] and SGDiff, offer an advantage in this context. By iteratively refining images over multiple denoising steps, these models progressively align object placements with scene graph constraints, resulting in improved spatial accuracy and visual realism.

COCO-Stuff, by comparison, presents a more moderate scene graph complexity, with an average of three to eight objects per image and fewer relationships. The lower relational density allows GAN-based models like PasteGAN and SG2IM to perform well, as they do not require extensive relational reasoning to generate high-quality images. The direct scene-to-layout mapping in these models is well suited to COCO-Stuff, where object placement is relatively straightforward. Transformer-based approaches still provide contextual improvements, but their computational overhead does not result in significantly better performance than diffusion-based methods on this dataset. Diffusion models, while effective, show less improvement over GANs due to the simpler scene graph structures. This suggests that for datasets with lower relational complexity, models that prioritize efficiency over deep relational modeling can perform competitively.

CLEVR, on the other hand, presents a different challenge. Unlike Visual Genome and COCO-Stuff, CLEVR consists of synthetic scenes with highly structured relationships between a small number of objects. This dataset is specifically designed for compositional reasoning, making it useful for evaluating how models handle explicit relational dependencies rather than natural scene diversity. Since the relationships in CLEVR follow a predefined set of rules, models trained on this dataset often excel at spatial alignment but may struggle with more diverse and ambiguous real-world images. GAN-based methods can generate visually convincing outputs due to the dataset’s structured nature, but they may lack the ability to generalize to more complex, unstructured scene graphs. Transformer-based models benefit from CLEVR’s structured relationships, effectively capturing relational dependencies with lower computational costs compared to datasets like Visual Genome. Diffusion models, while still applicable, offer less noticeable improvements since the dataset does not present the same level of relational complexity as Visual Genome.

These observations highlight how scene graph complexity critically impacts the effectiveness of image generation models. Datasets with high-density graphs, such as Visual Genome, benefit from diffusion-based approaches that iteratively refine relational constraints [72], while datasets with simpler structures, such as COCO-Stuff and CLEVR, allow GANs to generate visually convincing images without extensive relational modeling. However, while CLEVR provides a structured test environment, its limited use in scene-graph-to-image generation suggests that further research is needed to assess its suitability as a benchmark. Future work should explore adaptive scene graph conditioning techniques to optimize model performance based on dataset-specific graph structures, ensuring scalable and high-fidelity image synthesis across varying complexities.

Comparative Analysis of Image Generation Methods from Scene Graphs: Strengths and Limitations

Several techniques have shown unique approaches and capabilities in the developing field of image synthesis from scene graphs, each designed to address particular difficulties in converting structured representations into images. This review provides a detailed understanding necessary for progressing the subject by methodically examining different approaches and pointing out their strengths and limitations. Table 6 gives a summary of the strengths and limitations of the methods. The field of image generation from scene graphs is currently in development, and various techniques have demonstrated unique capabilities and approaches. Each method addresses specific challenges in converting structured representations into images. This review meticulously evaluates various methodologies, identifying their strengths and drawbacks. It establishes that scene graphs are the foundation of all methods, serving as structured inputs that guide the generation of contextually aware images. This combined approach highlights the field’s joint emphasis on the generation of images that are not only realistic but also precise, adhering to predefined relationships and object characteristics. These techniques employ architectures encompassing GCNs, transformers, and diffusion models through sophisticated deep learning frameworks. Even though these systems provide the foundation for understanding and parsing complex scene descriptions, they also have high computation requirements, demanding considerable resources for efficient training and inference.

The alignment between scene graphs and images is directly optimized by SGDiff [70] without the need for intermediate scene layouts. This optimization process enhances semantic accuracy. SGDiff [70] and IGHSGTH [73] focus on direct mappings from scene graphs to images, optimizing for semantic alignment and spatial accuracy. Such approaches are highly valuable in applications requiring strict input structure adherence, such as digital storytelling and simulation environments. Alternatively, HIGT [66] assembles images patch by patch with the intention of maximizing the level of realism and detail in the final image. It incorporates crop position information in the selection process, demonstrating enhanced results. PasteGAN [58] is a semi-parametric technique that uses external object cropping to enhance image features. Including a crop selector allows for dynamic selection of object crops based on the scene context, enhancing authentic and diverse images. This technique mainly benefits product design and digital marketing, which need precise visual accuracy. To optimize the sampling process, SceneGenie [69] exhibits cutting-edge usage of diffusion models by integrating detailed bounding box and segmentation map guidance. This allows SceneGenie to set new precision standards in applications, such as medical imaging and detailed object interpretation in educational content. SGDiff and SceneGenie excel in generating high-fidelity images that are structurally and semantically aligned with the input scene graphs, making them ideal for augmented reality and precision-oriented applications. SG2IM [32] and COLoR [72] are noted for their efficiency and adaptability in real-time content generation, suitable for dynamic media production and interactive applications.

It is important to emphasize that specific methods appear to have superior performance. This difference in performance can be attributed to various factors, such as the underlying design of the architecture, the particular learning and training mechanisms used, the quality of the input data, and the extent to which these methods effectively incorporate contextual information from scene graphs. SceneGenie, which employs advanced diffusion models, is designed to handle high-resolution and complex scene descriptions effectively. The complexity exhibited by these models enables researchers to exert precise control over the image generation process, leading to the generation of images of superior quality. On the other hand, less complex architectures may face difficulties when dealing with intricate scene graphs or may not be able to capture subtle details, resulting in lower performance in generating realistic images.

Several methods utilize specialized components for specific tasks within the image generation pipeline. For instance, PasteGAN incorporates a crop refining network and an object–image fuser, which are purposefully developed to integrate external object crops seamlessly. This can enhance performance in tasks requiring precise control over objects’ appearance. The utilization of methods such as SG2IM involves the implementation of adversarial training. This technique has been shown to enhance the realism of generated images by iteratively refining them through a feedback loop, where generative and discriminative models engage in a competitive process. This method can lead to more photorealistic outputs compared to non-adversarial approaches. COLoR utilizes adversarial learning techniques to enhance the quality of object layouts in generated images. This approach aims to ensure that the generated images appear realistic and accurately capture the spatial and relational dynamics of the objects, as specified by the scene graph. Research has shown that models incorporating and leveraging contextual information from scene graphs tend to perform better. IGHSGTH, for instance, employs an innovative method to integrate multi-object relationships, thereby improving the spatial and relational accuracy of the resulting images.

In summary, the exceptional performance achieved by scene-graph-based image generation methods can be attributed to a combination of detailed architectural designs, advanced training mechanisms that effectively utilize adversarial learning and contextual data, high-quality input scene graphs, and their capacity to process and integrate intricate relational information from scene graphs precisely. The combination of these factors contributes to the generation of images that are more realistic, detailed, and contextually accurate.

7. Challenges

In recent years, the field of image generation has gained increasing attention within the deep learning community, owning to its influence on numerous applications. Yet, it is acknowledged that every scientific research domain has innumerable hurdles. This also applies to image generation tasks. The difficulties in image generation tasks can be analyzed according to the methods used to generate them. The goal here is to discuss the challenges of the various techniques used in image generation.

7.1. GANs

Generative adversarial networks (GANs) are a significant class of deep learning models frequently used for generating images. However, when applied to scene graphs, they present unique challenges that can hinder performance and output quality, as follows:

Training GANs can be particularly unstable when dealing with scene graphs. The generator and discriminator networks may struggle to converge, leading to inadequate performance. This instability is exacerbated when the complexity of scene graphs increases, as the generator must learn to interpret and generate diverse object interactions and spatial arrangements [126].
Mode collapse is a significant concern in GANs, especially in the context of scene graphs. When the generator learns to produce a limited set of images that deceive the discriminator, it may overlook the diverse range of object interactions specified in the scene graphs. This results in repetitive or homogeneous outputs that fail to capture the intended variety of scenes. In scene graph applications, mode collapse can manifest as a generator creating similar layouts or interactions across different inputs, leading to a lack of diversity in generated images [121].
Catastrophic forgetting occurs when learning new information negatively impacts previously acquired knowledge within the network. In scene graphs, as new object relationships are introduced during training, the model may forget how to generate scenes based on earlier learned relationships. This challenge is particularly pronounced when training on datasets with varying complexity and types of relationships.
Optimizing hyperparameters, such as learning rates, batch sizes, and regularization, becomes crucial in scene graphs. The impact of these parameters on image quality is significant; however, finding the right balance can be challenging due to the intricate nature of scene graph data. The need for fine-tuning increases with model complexity, making achieving stable and high-quality outputs difficult.

To stabilize training, integrating normalization techniques such as spectral normalization or batch normalization is essential [127,128,129]. These methods help balance the generator and discriminator during training but may require careful implementation to be effective with complex scene graph structures. Employing strategies such as weight accumulation [130] can help mitigate catastrophic forgetting by preserving previously learned knowledge throughout subsequent training stages [131]. Additionally, multi-generator setups or adaptive noise injection can enhance diversity in generated outputs but introduce their complexities. The success of GANs heavily relies on the quality and diversity of training data. Inadequate representation of specific object interactions or spatial arrangements within scene graphs can lead to biased or incomplete image generation.

7.2. Diffusion Models

Diffusion models have gained prominence in image generation because they produce high-quality outputs. However, when applied to scene graphs, they present unique challenges that can hinder their effectiveness and efficiency, as follows:

Diffusion models are inherently computationally expensive due to their iterative nature. Each image generation involves numerous forward and backward passes through the model, significantly impeding training and inference speeds. Utilizing these models in real-time applications is challenging, where quick responses are necessary.
The training procedure for diffusion models requires meticulous noise reduction management throughout a sequence of stages. Achieving the right balance of noise levels is crucial for effective image generation, but it can be challenging. This often necessitates extensive trial and error with various configurations and parameters, leading to longer training times.
Like GANs, diffusion models are sensitive to hyperparameter selection, including the variance schedule of noise injected during training. Improper configuration can result in poor image quality or unsuccessful training sessions. For instance, the choice of how noise is added (linear vs. cosine schedules) can significantly impact model performance.
While diffusion models excel at generating high-quality images, maintaining this quality at higher resolutions or with more intricate scene graphs poses difficulties. The model may require additional iterations to produce clear and coherent visuals as input complexity increases, raising computational demands.

To mitigate some of these issues, developing adaptive algorithms that dynamically adjust noise levels based on the progress of image generation is a promising approach. Such methods aim to reduce the number of timesteps while preserving output quality [18,132], potentially accelerating training and inference. When integrating scene graphs into diffusion models, challenges arise from accurately interpreting complex relationships and interactions among objects. However, several optimization techniques have been developed to address this issue.

One widely adopted approach is quantization techniques, which involve reducing the precision of model parameters to lower memory consumption and the inference time. Techniques such as post-training quantization (PTQ) and quantization-aware training (QAT) [133] have shown promising results in reducing computational overhead without significant loss in image quality. Another effective method is diffusion distillation, where large diffusion models are compressed into lighter, more efficient models by distilling their knowledge into conditional GANs or other generative architectures. This technique significantly accelerates the inference time while preserving the high image quality characteristic of diffusion models [134]. Gradient guidance is another optimization method that allows fine-tuning of diffusion models toward specific optimization objectives, such as faster convergence or lower sampling costs, improving efficiency for real-time applications [135]. By integrating these optimization strategies, diffusion models are becoming increasingly suitable for real-time tasks, including interactive design, AI-assisted content creation, and high-speed visual synthesis [136].

7.3. Transformer Models

Transformers have revolutionized various fields, including image generation, due to their ability to capture complex relationships within data. However, when applied to scene graphs for image generation, they face several specific challenges, as follows:

Transformers need lots of memory and storage since they can handle whole data sequences simultaneously. This limits the scalability of models with substantial scene graphs or high-resolution picture synthesis. As scene graphs grow in size and complexity, memory requirements can surpass resources, causing out-of-memory (OOM) issues during training or inference.
The intricate self-attention mechanisms in transformers necessitate considering interactions between all pairs of input and output elements. This results in a quadratic increase in computational complexity relative to the sequence length, which can be mainly taxing in tasks involving complex scene graphs that require detailed spatial and semantic understanding. Thus, training transformer models can be time-consuming and resource-intensive.
Transformer models typically exhibit superior performance when trained on large datasets. However, obtaining large datasets with detailed annotations of scene graphs and equivalent images can be challenging. In scene-graph-to-image generation, this scarcity of annotated data can restrict the model’s efficacy in scenarios with incomplete information.
While transformers have the potential to capture intricate relationships within data, they often struggle to generalize from known data to unknown scenarios. Suppose the training data do not adequately encompass the range of possibilities present in real-world situations. In that case, the model may generate images that do not accurately depict new scene graphs or misunderstand the spatial and semantic connections specified by these graphs. This limitation can lead to suboptimal results when generating images based on novel or complex scene configurations.
The performance of transformer models is susceptible to hyperparameter selection, such as learning rates and attention mechanisms. Improper configuration can lead to poor image quality or ineffective training sessions. For instance, variations in attention patterns (local vs. global) can significantly impact how well the model learns from scene graph data.

Several strategies can be employed to address these challenges, such as implementing local attention mechanisms or sparse attention patterns that can reduce memory requirements by limiting the number of interactions considered during self-attention calculations. This approach helps alleviate some computational burdens while maintaining performance. Utilizing memory layers that lessen the need for complete pairwise self-attention calculations can also enhance efficiency without sacrificing output quality. Leveraging transformer models pre-trained on relevant tasks or datasets can improve generalization capabilities and reduce the labeled data required for practical training.

7.4. Future Directions

This section examines future research directions for scene-graph-to-image creation algorithms. These techniques have lately captured the interest of the computer vision community due to their capacity to generate realistic and diverse images by using the deep semantic information included within scene graphs. While existing SG2IM methods have shown substantial success, there is still significant room for improvement in the quality and diversity of the generated images, as outlined below:

Future research should focus on developing hybrid models that integrate GANs with diffusion models or transformers. Such combinations could leverage GANs’ ability to generate high-quality images with diffusion models’ robustness against noise, potentially leading to more stable training processes.
Implementing adaptive algorithms that dynamically adjust parameters during image generation can help mitigate computational costs associated with diffusion models. These algorithms could optimize noise levels based on real-time feedback, enhancing training efficiency and output quality.
To improve model generalization, exploring advanced data augmentation strategies is crucial. Techniques such as generative augmentation using pre-trained models can enhance dataset diversity, thus enabling models to learn from a broader range of scenarios.
Investigating the application of self-supervised learning and graph neural networks may provide innovative solutions for understanding complex relationships within scene graphs, leading to improved image generation capabilities.
Encouraging collaboration between researchers in computer vision, natural language processing, and generative modeling can foster new insights and methodologies for advancing scene-graph-based image generation.
The delay in image generation processes impedes real-time applications, such as interactive gaming or live digital media production. The development of fast-generation models specifically designed for real-time applications, potentially utilizing edge computing, has the potential to transform user experiences in interactive apps.
Although models like SceneGenie have made significant advancements, generating exact and detailed images is difficult, particularly in industries that demand extraordinary accuracy, like medical imaging. Enhancing precision could be achieved by further refining models by integrating domain-specific knowledge and training on specialized datasets. Engaging in collaborations with specialists in specific fields to customize models according to precise requirements has the potential to generate substantial enhancements.
Scene-graph-based image generation has the potential to be combined with other domains, like natural language processing and augmented reality. This integration can lead to more context-aware, more advanced systems that boost user interaction and engagement. Creating cohesive systems capable of directly transforming natural language descriptions into scene graphs or utilizing augmented reality to modify scene graphs in response to user interactions dynamically has the potential to unlock novel applications in educational technology and interactive media.
Automating digital content creation through advanced image generation models can significantly impact the marketing, design, and entertainment industries. Adapting generative models to specific content creation tasks, such as automatically generating marketing materials based on trend analysis or user engagement data, could streamline workflows and enhance content relevance and personalization.

8. Conclusions

This study offered a thorough analysis of the generation of images from scene graphs by classifying existing methodologies according to their fundamental models, such as GANs, diffusion models, and transformers. The study highlighted the importance of extensive datasets in facilitating efficient image generation and summarized frequently utilized benchmark datasets in this domain. Furthermore, critical assessment criteria were emphasized for assessing the effectiveness of diverse image generation processes, accompanied by a comparative analysis of these metrics in tabular form. Notwithstanding the progress in this domain, numerous substantial technological hurdles exist. Challenges such as mode collapse in GANs, computational expenses in diffusion models, and memory constraints in transformers impede future advancement. Confronting these problems is essential for improving the effectiveness and use of scene-graph-based image-generating methodologies. Future studies should concentrate on creating hybrid models that integrate the advantages of various approaches while mitigating their shortcomings. Investigating novel methods, including adaptive algorithms for noise management, advanced data augmentation techniques, and the utilization of synthetic data, could markedly enhance model performance. By addressing these obstacles, researchers can facilitate the development of more resilient and adaptable image production systems that proficiently leverage scene graphs.

Author Contributions

C.I.A., conceptualization, methodology, data curation, writing—original draft, writing—review and editing; X.Z., supervision, funding acquisition, writing—review and editing, conceptualization; H.N.M., validation, resources, writing—review and editing; G.U.N., formal analysis, validation, writing—review and editing; C.C.U., visualization, validation, writing—review and editing; O.C.C., software, visualization, writing—review and editing; Y.H.G., formal analysis, funding acquisition, investigation, writing—review and editing, project administration; M.A.A.-a., conceptualization, funding acquisition, investigation, methodology, project administration, writing—review and editing. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by Technology Innovation Program (2410002644, Development of robotic manipulation task learning based on Foundation model to understand and reason about task situations) funded By the Ministry of Trade Industry & Energy (MoTIE, Korea).

Data Availability Statement

Data sharing is not applicable.

Acknowledgments

This research was funded by the Natural Science Foundation of Sichuan Province, China, under grant 24NSFSC0622. This work was supported by Technology Innovation Program (2410002644, Development of robotic manipulation task learning based on Foundation model to understand and reason about task situations) funded By the Ministry of Trade Industry & Energy (MoTIE, Korea). This research was supported by the National Research Foundation of Korea (NRF) grant funded by the Korean government (MSIT) (No. RS-2022-00166402 and RS-2023-00256517).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

CV	Computer vision
NLP	Natural language processing
SG2IM	Scene graph-to-image
GCNs	Graph convolutional networks
CNN	Convolutional neural network
GANs	Generative adversarial networks
VQA	Visual question answering
ViTs	Vision transformers
WSGC	Weighted scene graph canonicalization
R3CD	Relation-aware compositional contrastive control diffusion
IS	Inception score
DS	Diversity score
HIGT	Hierarchical Image Generation via Transformer-Based Sequential Patch Selection
TBIGSG	Transformer-based image generation from scene graphs
SL2I	Scene layout-to-image
ML	Machine learning
GNNs	Graph neural networks
MLP	Multilayer perceptron
CRN	Cascaded refinement network
SLN	Scene layout network
WGAN	Wasserstein GAN
SGDiff	Scene graph diffusion
SDMT	Scalable diffusion models with transformers
COLoR	Contextualized object layout refinement
TBIG	Transformer-based image generation
FID	Fréchet inception distance
KID	Kernel inception distance
IGHSGTH	Image generation from a hyper-scene graph with trinomial hyperedges
CLIP	Contrastive language–image pre-training

References

Ren, P.; Xiao, Y.; Chang, X.; Huang, P.-Y.; Li, Z.; Gupta, B.B.; Chen, X.; Wang, X. A survey of deep active learning. ACM Comput. Surv. 2021, 54, 1–40. [Google Scholar] [CrossRef]
Xu, J.; Zhao, J.; Zhou, R.; Liu, C.; Zhao, P.; Zhao, L. Predicting destinations by a deep learning based approach. IEEE Trans. Knowl. Data Eng. 2021, 33, 651–666. [Google Scholar] [CrossRef]
Rossi, R.A.; Zhou, R.; Ahmed, N.K. Deep inductive graph representation learning. IEEE Trans. Knowl. Data Eng. 2018, 32, 438–452. [Google Scholar] [CrossRef]
Liu, B.; Yu, L.; Che, C.; Lin, Q.; Hu, H.; Zhao, X. Integration and performance analysis of artificial intelligence and computer vision based on deep learning algorithms. arXiv 2023, arXiv:2312.12872. [Google Scholar] [CrossRef]
Hassaballah, M.; Awad, A.I. Deep Learning in Computer Vision: Principles and Applications; CRC Press: Boca Raton, FL, USA, 2020. [Google Scholar]
Wang, D.; Wang, J.-G.; Xu, K. Deep learning for object detection, classification and tracking in industry applications. Sensors 2021, 21, 7349. [Google Scholar] [CrossRef]
Wu, H.; Liu, Q.; Liu, X. A review on deep learning approaches to image classification and object segmentation. Comput. Mater. Contin. 2019, 58, 575–597. [Google Scholar] [CrossRef]
Johnson, J.; Krishna, R.; Stark, M.; Li, L.-J.; Shamma, D.; Bernstein, M.; Fei-Fei, L. Image retrieval using scene graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3668–3678. [Google Scholar]
Zhang, C.; Cui, Z.; Chen, C.; Liu, S.; Zeng, B.; Bao, H.; Zhang, Y. Deeppanocontext: Panoramic 3d scene understanding with holistic scene context graph and relation-based optimization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 12632–12641. [Google Scholar]
Gao, C.; Li, B. Time-Conditioned Generative Modeling of object-centric representations for video decomposition and prediction. In Proceedings of the Uncertainty in Artificial Intelligence, Pittsburgh, PA, USA, 31 July–4 August 2023; pp. 613–623. [Google Scholar]
Wang, R.; Wei, Z.; Li, P.; Zhang, Q.; Huang, X. Storytelling from an image stream using scene graphs. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 9185–9192. [Google Scholar]
Reed, S.; Akata, Z.; Yan, X.; Logeswaran, L.; Schiele, B.; Lee, H. Generative adversarial text to image synthesis. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 20–22 June 2016; pp. 1060–1069. [Google Scholar]
Liao, W.; Hu, K.; Yang, M.Y.; Rosenhahn, B. Text to image generation with semantic-spatial aware gan. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 18187–18196. [Google Scholar]
Brock, A.; Donahue, J.; Simonyan, K. Large scale gan training for high fidelity natural image synthesis. arXiv 2018, arXiv:1809.11096. [Google Scholar]
Crowson, K.; Biderman, S.; Kornis, D.; Stander, D.; Hallahan, E.; Castricato, L.; Raff, E. Vqgan-clip: Open domain image generation and editing with natural language guidance. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 88–105. [Google Scholar]
Gafni, O.; Polyak, A.; Ashual, O.; Sheynin, S.; Parikh, D.; Taigman, Y. Make-a-scene: Scene-based text-to-image generation with human priors. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 89–106. [Google Scholar]
Razavi, A.; den Oord, A.; Vinyals, O. Generating diverse high-fidelity images with vq-vae-2. In Advances in Neural Information Processing Systems; Neural Information Processing Systems Foundation, Inc. (NeurIPS): La Jolla, CA, USA, 2019; Volume 32. [Google Scholar]
Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems; Neural Information Processing Systems Foundation, Inc. (NeurIPS): La Jolla, CA, USA, 2020; Volume 33, pp. 6840–6851. [Google Scholar]
Wild, V.D.; Ghalebikesabi, S.; Sejdinovic, D.; Knoblauch, J. A rigorous link between deep ensembles and (variational) bayesian methods. In Advances in Neural Information Processing Systems; Neural Information Processing Systems Foundation, Inc. (NeurIPS): La Jolla, CA, USA, 2024; Volume 36. [Google Scholar]
Song, Y.; Sohl-Dickstein, J.; Kingma, D.P.; Kumar, A.; Ermon, S.; Poole, B. Score-based generative modeling through stochastic differential equations. arXiv 2020, arXiv:2011.13456. [Google Scholar]
Dhariwal, P.; Nichol, A. diffusion models beat gans on image synthesis. In Advances in Neural Information Processing Systems; Neural Information Processing Systems Foundation, Inc. (NeurIPS): La Jolla, CA, USA, 2021; Volume 34, pp. 8780–8794. [Google Scholar]
Ramesh, A.; Dhariwal, P.; Nichol, A.; Chu, C.; Chen, M. Hierarchical text-conditional image generation with clip latents. arXiv 2022, arXiv:2204.06125. [Google Scholar]
Saharia, C.; Chan, W.; Saxena, S.; Li, L.; Whang, J.; Denton, E.L.; Ghasemipour, K.; Gontijo Lopes, R.; Karagol Ayan, B.; Salimans, T.; et al. Photorealistic text-to-image diffusion models with deep language understanding. In Advances in Neural Information Processing Systems; Neural Information Processing Systems Foundation, Inc. (NeurIPS): La Jolla, CA, USA, 2022; Volume 35, pp. 36479–36494. [Google Scholar]
Ho, J.; Salimans, T. Classifier-free diffusion guidance. arXiv 2022, arXiv:2207.12598. [Google Scholar]
Kim, G.; Kwon, T.; Ye, J.C. Diffusionclip: Text-guided diffusion models for robust image manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 2426–2435. [Google Scholar]
Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10684–10695. [Google Scholar]
Couairon, G.; Verbeek, J.; Schwenk, H.; Cord, M. Diffedit: Diffusion-based semantic image editing with mask guidance. arXiv 2022, arXiv:2210.11427. [Google Scholar]
Maas, A.L.; Hannun, A.Y.; Ng, A.Y. Rectifier nonlinearities improve neural network acoustic models. In Proceedings of the in ICML Workshop on Deep Learning for Audio, Speech and Language Processing, Atlanta, GA, USA, 16–21 June 2013. [Google Scholar]
Meng, C.; He, Y.; Song, Y.; Song, J.; Wu, J.; Zhu, J.-Y.; Ermon, S. Sdedit: Guided image synthesis and editing with stochastic differential equations. arXiv 2021, arXiv:2108.01073. [Google Scholar]
Yang, L.; Zhang, Z.; Song, Y.; Hong, S.; Xu, R.; Zhao, Y.; Zhang, W.; Cui, B.; Yang, M.-H. Diffusion models: A comprehensive survey of methods and applications. ACM Comput. Surv. 2023, 56, 105. [Google Scholar] [CrossRef]
Krishna, R.; Zhu, Y.; Groth, O.; Johnson, J.; Hata, K.; Kravitz, J.; Chen, S.; Kalantidis, Y.; Li, L.-J.; Shamma, D.A.; et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. 2017, 123, 32–73. [Google Scholar] [CrossRef]
Johnson, J.; Gupta, A.; Fei-Fei, L. Image generation from scene graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 1219–1228. [Google Scholar]
Hughes, N.; Chang, Y.; Carlone, L. Hydra: A real-time spatial perception system for 3D scene graph construction and optimization. arXiv 2022, arXiv:2201.13360. [Google Scholar]
Fei, Z.; Yan, X.; Wang, S.; Tian, Q. Deecap: Dynamic early exiting for efficient image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 12216–12226. [Google Scholar]
Zhao, S.; Li, L.; Peng, H. Aligned visual semantic scene graph for image captioning. Displays 2022, 74, 102210. [Google Scholar] [CrossRef]
Dhamo, H.; Farshad, A.; Laina, I.; Navab, N.; Hager, G.D.; Tombari, F.; Rupprecht, C. Semantic image manipulation using scene graphs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 5213–5222. [Google Scholar]
Rahman, T.; Chou, S.-H.; Sigal, L.; Carenini, G. An improved attention for visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 1653–1662. [Google Scholar]
Zhang, C.; Chao, W.-L.; Xuan, D. An empirical study on leveraging scene graphs for visual question answering. arXiv 2019, arXiv:1907.12133. [Google Scholar]
Schuster, S.; Krishna, R.; Chang, A.; Fei-Fei, L.; Manning, C.D. Generating semantically precise scene graphs from textual descriptions for improved image retrieval. In Proceedings of the Fourth Workshop on Vision and Language, Lisbon, Portugal, 18 September 2015; pp. 70–80. [Google Scholar]
He, S.; Liao, W.; Yang, M.Y.; Yang, Y.; Song, Y.-Z.; Rosenhahn, B.; Xiang, T. Context-aware layout to image generation with enhanced object appearance. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 15049–15058. [Google Scholar]
Zhou, R.; Jiang, C.; Xu, Q. A survey on generative adversarial network-based text-to-image synthesis. Neurocomputing 2021, 451, 316–336. [Google Scholar] [CrossRef]
Shamsolmoali, P.; Zareapoor, M.; Granger, E.; Zhou, H.; Wang, R.; Celebi, M.E.; Yang, J. Image synthesis with adversarial networks: A comprehensive survey and case studies. Inf. Fusion 2021, 72, 126–146. [Google Scholar] [CrossRef]
Elasri, M.; Elharrouss, O.; Al-Maadeed, S.; Tairi, H. Image generation: A review. Neural Process. Lett. 2022, 54, 4609–4646. [Google Scholar] [CrossRef]
PM, A.F.; Rahiman, V.A. A review of generative adversarial networks for text-to-image synthesis. In Proceedings of the International Conference on Emerging Trends in Engineering-YUKTHI2023, Kerala, India, 10–12 April 2023. [Google Scholar]
Hassan, M.U.; Alaliyat, S.; Hameed, I.A. Image generation models from scene graphs and layouts: A comparative analysis. J. King Saud Univ. Inf. Sci. 2023, 35, 101543. [Google Scholar] [CrossRef]
Kuznetsova, A.; Rom, H.; Alldrin, N.; Uijlings, J.; Krasin, I.; Pont-Tuset, J.; Kamali, S.; Popov, S.; Malloci, M.; Kolesnikov, A.; et al. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. Int. J. Comput. Vis. 2020, 128, 1956–1981. [Google Scholar] [CrossRef]
Lu, C.; Krishna, R.; Bernstein, M.; Fei-Fei, L. Visual relationship detection with language priors. In Proceedings of the Computer Vision—ECCV 2016: 14th European Conference, Proceedings, Part I. Amsterdam, The Netherlands, 11–14 October 2016; Volume 14, pp. 852–869. [Google Scholar]
Zhao, L.; Yuan, L.; Gong, B.; Cui, Y.; Schroff, F.; Yang, M.-H.; Adam, H.; Liu, T. Unified visual relationship detection with vision and language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 6962–6973. [Google Scholar]
Liao, W.; Rosenhahn, B.; Shuai, L.; Yang, M.Y. Natural language guided visual relationship detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, 16–17 June 2019. [Google Scholar]
Armeni, I.; He, Z.-Y.; Gwak, J.; Zamir, A.R.; Fischer, M.; Malik, J.; Savarese, S. 3D scene graph: A structure for unified semantics, 3d space, and camera. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 5664–5673. [Google Scholar]
Scarselli, F.; Gori, M.; Tsoi, A.C.; Hagenbuchner, M.; Monfardini, G. The graph neural network model. IEEE Trans. Neural Netw. 2008, 20, 61–80. [Google Scholar] [CrossRef] [PubMed]
Ding, M.; Guo, Y.; Huang, Z.; Lin, B.; Luo, H. GROM: A generalized routing optimization method with graph neural network and deep reinforcement learning. J. Netw. Comput. Appl. 2024, 229, 103927. [Google Scholar] [CrossRef]
Michel, G.; Nikolentzos, G.; Lutzeyer, J.F.; Vazirgiannis, M. Path neural networks: Expressive and accurate graph neural networks. In Proceedings of the International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; pp. 24737–24755. [Google Scholar]
Talal, M.; Gerfan, S.; Qays, R.; Pamucar, D.; Delen, D.; Pedrycz, W.; Alamleh, A.; Alamoodi, A.; Zaidan, B.B.; Simic, V. A Comprehensive Systematic Review on Machine Learning Application in the 5G-RAN Architecture: Issues, Challenges, and Future Directions. J. Netw. Comput. Appl. 2024, 233, 104041. [Google Scholar] [CrossRef]
Zhang, Z.; Tessone, C.J.; Liao, H. Heterogeneous graph representation learning via mutual information estimation for fraud detection. J. Netw. Comput. Appl. 2024, 234, 104046. [Google Scholar] [CrossRef]
Wang, K.; Cui, Y.; Qian, Q.; Chen, Y.; Guo, C.; Shen, G. USAGE: Uncertain flow graph and spatio-temporal graph convolutional network-based saturation attack detection method. J. Netw. Comput. Appl. 2023, 219, 103722. [Google Scholar] [CrossRef]
Zhang, P.; Luo, Z.; Kumar, N.; Guizani, M.; Zhang, H.; Wang, J. CE-VNE: Constraint escalation virtual network embedding algorithm assisted by graph convolutional networks. J. Netw. Comput. Appl. 2024, 221, 103736. [Google Scholar] [CrossRef]
Li, Y.; Ma, T.; Bai, Y.; Duan, N.; Wei, S.; Wang, X. Pastegan: A Semi-Parametric Method to Generate Image from Scene Graph. In Advances in Neural Information Processing Systems; Neural Information Processing Systems Foundation, Inc. (NeurIPS): La Jolla, CA, USA, 2019; Volume 32. [Google Scholar]
Zhao, B.; Meng, L.; Yin, W.; Sigal, L. Image generation from layout. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 8584–8593. [Google Scholar]
Zhao, B.; Yin, W.; Meng, L.; Sigal, L. Layout2image: Image generation from layout. Int. J. Comput. Vis. 2020, 128, 2418–2435. [Google Scholar] [CrossRef]
Chen, Q.; Koltun, V. Photographic image synthesis with cascaded refinement networks. In Proceedings of the IEEE International Conference on Computer Vision, Norwell, MA, USA, 1 October 2017; pp. 1511–1520. [Google Scholar]
Henaff, M.; Bruna, J.; LeCun, Y. Deep convolutional networks on graph-structured data. arXiv 2015, arXiv:1506.05163. [Google Scholar]
Zhang, S.; Tong, H.; Xu, J.; Maciejewski, R. Graph convolutional networks: A comprehensive review. Comput. Soc. Netw. 2019, 6, 11. [Google Scholar] [CrossRef] [PubMed]
Mittal, G.; Agrawal, S.; Agarwal, A.; Mehta, S.; Marwah, T. Interactive image generation using scene graphs. arXiv 2019, arXiv:1905.03743. [Google Scholar]
Tripathi, S.; Bhiwandiwalla, A.; Bastidas, A.; Tang, H. Using scene graph context to improve image generation. arXiv 2019, arXiv:1901.03762. [Google Scholar]
Herzig, R.; Bar, A.; Xu, H.; Chechik, G.; Darrell, T.; Globerson, A. Learning canonical representations for scene graph to image generation. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Proceedings Part XXVI. Glasgow, UK, 23–28 August 2020; Volume 16, pp. 210–227. [Google Scholar]
Sylvain, T.; Zhang, P.; Bengio, Y.; Hjelm, R.D.; Sharma, S. Object-centric image generation from layouts. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 19–21 May 2021; Volume 35, pp. 2647–2655. [Google Scholar]
Caesar, H.; Uijlings, J.; Ferrari, V. Coco-stuff: Thing and stuff classes in context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 1209–1218. [Google Scholar]
Xu, X.; Xu, N. Hierarchical image generation via transformer-based sequential patch selection. In Proceedings of the AAAI Conference on Artificial Intelligence, Palo Alto, CA, USA, 22 February–1 March 2022; Volume 36, pp. 2938–2945. [Google Scholar]
Sortino, R.; Palazzo, S.; Rundo, F.; Spampinato, C. Transformer-based image generation from scene graphs. Comput. Vis. Image Underst. 2023, 233, 103721. [Google Scholar] [CrossRef]
Zhang, Y.; Meng, C.; Li, Z.; Chen, P.; Yang, G.; Yang, C.; Sun, L. Learning object consistency and interaction in image generation from scene graphs. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, Vienna, Austria, 19–25 August 2023; pp. 1731–1739. [Google Scholar]
Farshad, A.; Yeganeh, Y.; Chi, Y.; Shen, C.; Ommer, B.; Navab, N. Scenegenie: Scene graph guided diffusion models for image synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 88–98. [Google Scholar]
Yang, L.; Huang, Z.; Song, Y.; Hong, S.; Li, G.; Zhang, W.; Cui, B.; Ghanem, B.; Yang, M.-H. Diffusion-based scene graph to image generation with masked contrastive pre-training. arXiv 2022, arXiv:2211.11138. [Google Scholar]
Liu, J.; Liu, Q. R3CD: Scene graph to image generation with relation-aware compositional contrastive control diffusion. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 20–27 February 2024; Volume 38, pp. 3657–3665. [Google Scholar]
Ivgi, M.; Benny, Y.; Ben-David, A.; Berant, J.; Wolf, L. Scene graph to image generation with contextualized object layout refinement. In Proceedings of the 2021 IEEE International Conference on Image Processing (ICIP), Anchorage, AK, USA, 19–22 September 2021; pp. 2428–2432. [Google Scholar]
Miyake, R.; Matsukawa, T.; Suzuki, E. Image generation from a hyper scene graph with trinomial hyperedges. In Proceedings of the VISIGRAPP (5: VISAPP), Lisbon, Portugal, 19–21 February 2023; pp. 185–195. [Google Scholar]
Ko, M.; Cha, E.; Suh, S.; Lee, H.; Han, J.-J.; Shin, J.; Han, B. Self-supervised dense consistency regularization for image-to-image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 18301–18310. [Google Scholar]
Isola, P.; Zhu, J.-Y.; Zhou, T.; Efros, A.A. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1125–1134. [Google Scholar]
Skorokhodov, I.; Tulyakov, S.; Elhoseiny, M. Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 3626–3636. [Google Scholar]
Vondrick, C.; Pirsiavash, H.; Torralba, A. Generating videos with scene dynamics. In Advances in Neural Information Processing Systems; Neural Information Processing Systems Foundation, Inc. (NeurIPS): La Jolla, CA, USA, 2016; Volume 29. [Google Scholar]
Zhao, X.; Guo, J.; Wang, L.; Li, F.; Li, J.; Zheng, J.; Yang, B. STS-GAN: Can we synthesize solid texture with high fidelity from arbitrary 2d exemplar? In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, Macao SAR, China, 19–25 August 2023; pp. 1768–1776. [Google Scholar]
Cao, T.; Kreis, K.; Fidler, S.; Sharp, N.; Yin, K. Texfusion: Synthesizing 3d textures with text-guided image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 4169–4181. [Google Scholar]
Li, C.; Su, Y.; Liu, W. Text-to-text generative adversarial networks. In Proceedings of the 2018 International Joint Conference on Neural Networks (IJCNN), Rio de Janeiro, Brazil, 8–13 July 2018; pp. 1–7. [Google Scholar]
Chai, Y.; Yin, Q.; Zhang, J. Improved training of mixture-of-experts language gans. In Proceedings of the ICASSP 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
Karras, T.; Laine, S.; Aila, T. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4401–4410. [Google Scholar]
Shen, Y.; Yang, C.; Tang, X.; Zhou, B. Interfacegan: Interpreting the disentangled face representation learned by gans. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 2004–2018. [Google Scholar] [CrossRef]
Gauthier, J. Conditional generative adversarial nets for convolutional face generation. In Class Project for Stanford CS231N: Convolutional Neural Networks for Visual Recognition, Winter Semester; Stanford University: Stanford, CA, USA, 2014; Volume 2014, p. 2. [Google Scholar]
Kammoun, A.; Slama, R.; Tabia, H.; Ouni, T.; Abid, M. Generative adversarial networks for face generation: A survey. ACM Comput. Surv. 2022, 55, 1–37. [Google Scholar] [CrossRef]
Odena, A.; Olah, C.; Shlens, J. Conditional image synthesis with auxiliary classifier gans. In Proceedings of the International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 2642–2651. [Google Scholar]
Salimans, T.; Goodfellow, I.; Zaremba, W.; Cheung, V.; Radford, A.; Chen, X. Improved techniques for training gans. In Advances in Neural Information Processing Systems; Neural Information Processing Systems Foundation, Inc. (NeurIPS): La Jolla, CA, USA, 2016; Volume 29. [Google Scholar]
Mao, Q.; Lee, H.-Y.; Tseng, H.-Y.; Ma, S.; Yang, M.-H. Mode seeking generative adversarial networks for diverse image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 1429–1437. [Google Scholar]
Alaluf, Y.; Patashnik, O.; Cohen-Or, D. Restyle: A residual-based stylegan encoder via iterative refinement. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 6711–6720. [Google Scholar]
Arjovsky, M.; Chintala, S.; Bottou, L. Wasserstein generative adversarial networks. In Proceedings of the International Conference on Machine Learning, Sydney, NSW, Australia, 6–11 August 2017; pp. 214–223. [Google Scholar]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. In Advances in Neural Information Processing Systems; Neural Information Processing Systems Foundation, Inc. (NeurIPS): La Jolla, CA, USA, 2014; Volume 27. [Google Scholar]
Metz, L.; Poole, B.; Pfau, D.; Sohl-Dickstein, J. Unrolled generative adversarial networks. arXiv 2016, arXiv:1611.02163. [Google Scholar]
Wiatrak, M.; Albrecht, S.V.; Nystrom, A. Stabilizing generative adversarial networks: A survey. arXiv 2019, arXiv:1910.00927. [Google Scholar]
Mishra, R.; Subramanyam, A. V Image synthesis with graph conditioning: Clip-guided diffusion models for scene graphs. arXiv 2024, arXiv:2401.14111. [Google Scholar]
Sohl-Dickstein, J.; Weiss, E.; Maheswaranathan, N.; Ganguli, S. Deep unsupervised learning using nonequilibrium thermodynamics. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 2256–2265. [Google Scholar]
Song, Y.; Ermon, S. Generative modeling by estimating gradients of the data distribution. In Advances in Neural Information Processing Systems; Neural Information Processing Systems Foundation, Inc. (NeurIPS): La Jolla, CA, USA, 2019; Volume 32. [Google Scholar]
Gu, S.; Chen, D.; Bao, J.; Wen, F.; Zhang, B.; Chen, D.; Yuan, L.; Guo, B. Vector quantized diffusion model for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10696–10706. [Google Scholar]
Saharia, C.; Chan, W.; Chang, H.; Lee, C.; Ho, J.; Salimans, T.; Fleet, D.; Norouzi, M. Palette: Image-to-image diffusion models. In Proceedings of the ACM SIGGRAPH 2022 Conference Proceedings, Vancouver, BC, Canada, 7–11 August 2022; pp. 1–10. [Google Scholar]
Nichol, A.; Dhariwal, P.; Ramesh, A.; Shyam, P.; Mishkin, P.; McGrew, B.; Sutskever, I.; Chen, M. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv 2021, arXiv:2112.10741. [Google Scholar]
Vahdat, A.; Kreis, K.; Kautz, J. Score-based generative modeling in latent space. In Advances in Neural Information Processing Systems; Neural Information Processing Systems Foundation, Inc. (NeurIPS): La Jolla, CA, USA, 2021; Volume 34, pp. 11287–11302. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems; Neural Information Processing Systems Foundation, Inc. (NeurIPS): La Jolla, CA, USA, 2017; Volume 30. [Google Scholar]
Zhai, X.; Kolesnikov, A.; Houlsby, N.; Beyer, L. Scaling vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 12104–12113. [Google Scholar]
Chen, L.; Lu, K.; Rajeswaran, A.; Lee, K.; Grover, A.; Laskin, M.; Abbeel, P.; Srinivas, A.; Mordatch, I. Decision transformer: Reinforcement learning via sequence modeling. In Advances in Neural Information Processing Systems; Neural Information Processing Systems Foundation, Inc. (NeurIPS): La Jolla, CA, USA, 2021; Volume 34, pp. 15084–15097. [Google Scholar]
Janner, M.; Li, Q.; Levine, S. Offline reinforcement learning as one big sequence modeling problem. In Advances in Neural Information Processing Systems; Neural Information Processing Systems Foundation, Inc. (NeurIPS): La Jolla, CA, USA, 2021; Volume 34, pp. 1273–1286. [Google Scholar]
Peebles, W.; Radosavovic, I.; Brooks, T.; Efros, A.A.; Malik, J. Learning to learn with generative models of neural network checkpoints. arXiv 2022, arXiv:2209.12892. [Google Scholar]
Kaplan, J.; McCandlish, S.; Henighan, T.; Brown, T.B.; Chess, B.; Child, R.; Gray, S.; Radford, A.; Wu, J.; Amodei, D. Scaling laws for neural language models. arXiv 2020, arXiv:2001.08361. [Google Scholar]
Henighan, T.; Kaplan, J.; Katz, M.; Chen, M.; Hesse, C.; Jackson, J.; Jun, H.; Brown, T.B.; Dhariwal, P.; Gray, S.; et al. Scaling laws for autoregressive generative modeling. arXiv 2020, arXiv:2010.14701. [Google Scholar]
Child, R.; Gray, S.; Radford, A.; Sutskever, I. Generating long sequences with sparse transformers. arXiv 2019, arXiv:1904.10509. [Google Scholar]
Chen, M.; Radford, A.; Child, R.; Wu, J.; Jun, H.; Luan, D.; Sutskever, I. Generative pretraining from pixels. In Proceedings of the International Conference on Machine Learning, Shenzhen, China, 15–17 February 2020; pp. 1691–1703. [Google Scholar]
Parmar, N.; Vaswani, A.; Uszkoreit, J.; Kaiser, L.; Shazeer, N.; Ku, A.; Tran, D. Image C. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 4055–4064. [Google Scholar]
Ramesh, A.; Pavlov, M.; Goh, G.; Gray, S.; Voss, C.; Radford, A.; Chen, M.; Sutskever, I. Zero-Shot Text-to-Image Generation. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 8821–8831. [Google Scholar]
Esser, P.; Rombach, R.; Ommer, B. Taming Transformers for High-Resolution Image Synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 12873–12883. [Google Scholar]
Chang, H.; Zhang, H.; Jiang, L.; Liu, C.; Freeman, W.T. Maskgit: Masked Generative Image Transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11315–11325. [Google Scholar]
Van Den Oord, A.; Vinyals, O.; Kavukcuoglu, K. Neural Discrete Representation Learning. In Advances in Neural Information Processing Systems; Neural Information Processing Systems Foundation, Inc. (NeurIPS): La Jolla, CA, USA, 2017; Volume 30. [Google Scholar]
Yu, J.; Xu, Y.; Koh, J.Y.; Luong, T.; Baid, G.; Wang, Z.; Vasudevan, V.; Ku, A.; Yang, Y.; Ayan, B.K.; et al. Scaling autoregressive models for content-rich text-to-image generation. arXiv 2022, arXiv:2206.10789. [Google Scholar]
Peebles, W.; Xie, S. Scalable Diffusion Models with Transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 4195–4205. [Google Scholar]
Bińkowski, M.; Sutherland, D.J.; Arbel, M.; Gretton, A. Demystifying mmd gans. arXiv 2018, arXiv:1801.01401. [Google Scholar]
Wang, Y.; Gonzalez-Garcia, A.; Berga, D.; Herranz, L.; Khan, F.S.; Weijer, J. van de Minegan: Effective Knowledge Transfer from Gans to Target Domains with Few Images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9332–9341. [Google Scholar]
Cao, H.; Tan, C.; Gao, Z.; Xu, Y.; Chen, G.; Heng, P.-A.; Li, S.Z. A survey on generative diffusion models. IEEE Trans. Knowl. Data Eng. 2024, 36, 2814–2830. [Google Scholar] [CrossRef]
Yuan, C.; Zhao, K.; Kuruoglu, E.E.; Wang, L.; Xu, T.; Huang, W.; Zhao, D.; Cheng, H.; Rong, Y. A Survey of Graph Transformers: Architectures, Theories and Applications. arXiv 2025, arXiv:2502.16533. [Google Scholar]
Sun, W.; Wu, T. Image Synthesis from Reconfigurable Layout and Style. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 10531–10540. [Google Scholar]
Johnson, J.; Hariharan, B.; van der Maaten, L.; Fei-Fei, L.; Zitnick, C.L.; Girshick, R. CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2901–2910. [Google Scholar]
Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Hochreiter, S. Gans Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. In Advances in Neural Information Processing Systems; Neural Information Processing Systems Foundation, Inc. (NeurIPS): La Jolla, CA, USA, 2017; Volume 30. [Google Scholar]
Meshry, M.M. Neural Rendering Techniques for Photo-Realistic Image Generation and Novel View Synthesis. Ph.D. Thesis, University of Maryland, College Park, MD, USA, 2022. [Google Scholar]
Wu, Y.-L.; Shuai, H.-H.; Tam, Z.-R.; Chiu, H.-Y. Gradient Normalization for Generative Adversarial Networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 6373–6382. [Google Scholar]
Kim, J.; Choi, Y.; Uh, Y. Feature Statistics Mixing Regularization for Generative Adversarial Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11294–11303. [Google Scholar]
Miyato, T.; Kataoka, T.; Koyama, M.; Yoshida, Y. Spectral normalization for generative adversarial networks. arXiv 2018, arXiv:1802.05957. [Google Scholar]
Zhou, F.; Cao, C. Overcoming Catastrophic Forgetting in Graph Neural Networks with Experience Replay. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual Event, 2–9 February 2021; Volume 35, pp. 4714–4722. [Google Scholar]
Ho, J.; Saharia, C.; Chan, W.; Fleet, D.J.; Norouzi, M.; Salimans, T. Cascaded diffusion models for high fidelity image generation. J. Mach. Learn. Res. 2022, 23, 2249–2281. [Google Scholar]
Tang, S.; Wang, X.; Chen, H.; Guan, C.; Wu, Z.; Tang, Y.; Zhu, W. Post-training Quantization for Text-to-Image Diffusion Models with Progressive Calibration and Activation Relaxing. arXiv 2023, arXiv:2311.06322. [Google Scholar]
Kang, M.; Zhang, R.; Barnes, C.; Paris, S.; Kwak, S.; Park, J.; Shechtman, E.; Zhu, J.-Y.; Park, T. Distilling Diffusion Models into Conditional Gans. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; pp. 428–447. [Google Scholar]
Guo, Y.; Yuan, H.; Yang, Y.; Chen, M.; Wang, M. Gradient Guidance for Diffusion Models: An Optimization Perspective. In Advances in Neural Information Processing Systems; Neural Information Processing Systems Foundation, Inc. (NeurIPS): La Jolla, CA, USA, 2025; Volume 37, pp. 90736–90770. [Google Scholar]
Valevski, D.; Leviathan, Y.; Arar, M.; Fruchter, S. Diffusion models are real-time game engines. arXiv 2024, arXiv:2408.14837. [Google Scholar]

Figure 1. Scene-graph-based image generation models. An overview of different scene-graph-based image generation models, including GAN-based, transformer-based, and diffusion-based approaches, illustrating how each method processes a scene graph to generate realistic images, highlighting the key architectural differences among these techniques.

Figure 2. Image generation types based on the kind of input.

Figure 3. A scene graph representation.

Figure 4. A typical GAN architecture.

Figure 5. A typical diffusion model architecture.

Figure 6. A typical pipeline for scene-graph-to-image generation.

Figure 7. Limitation of existing methods.

Figure 8. Samples from the image generation dataset.

Figure 9. Sample image generation on Visual Genome SG2IM [32], PASTEGAN [58], HIGT [69] SGDIFF [73].

Figure 10. Sample image generation on COCO-Stuff SG2IM [32], PASTEGAN [58], SCENEGENIE [72], SGDIFF [73], HIGT [69], TBIGSG [70].

Table 1. Summary of image generation methods from scene graphs.

Method	Year	Architecture	Input Type	Dataset
SG2IM [32]	2018	GCN, SLN, CRN, GAN	Scene graph	Visual Genome [31], COCO-Stuff [68]
IIGSG [64]	2019	GCN, SLN, CRN	Scene graph	Visual Genome, COCO-Stuff
IGLO [59]	2019	Layout2Im	Image Layout	Visual Genome, COCO-Stuff
WSGC [66]	2020	GCN, GAN	Scene graph	Visual Genome, COCO, CLEVR
CALIG [40]	2021	GAN	Image Layout	COCO-Stuff
PASTEGAN [58]	2019	GAN	Scene graph	Visual Genome, COCO-Stuff
HIGTBSPS [69]	2022	Transformer, VAE	Scene graph	Visual Genome, COCO-Stuff
TBIGSG [70]	2023	Transformer	Scene graph	Visual Genome, COCO-Stuff
LOCIIG [71]	2023	GAN	Scene graphs	Visual Genome, COCO-Stuff
SCENEGENIE [72]	2023	Diffusion model	Scene graphs	Visual Genome, COCO-Stuff
SGDiff [73]	2022	Diffusion model	Scene graphs	Visual Genome, COCO-Stuff
R3CD [74]	2024	Diffusion Model	Scene graphs	Visual Genome, COCO-Stuff
Color [75]	2021	GCN, LRN, GAN	Scene graph	Visual Genome, COCO-Stuff
HSG2IM [76]	2023	GAN	Scene graph	Visual Genome, COCO-Stuff

Table 2. Differences between generative adversarial networks (GANs), diffusion models, and graph transformers in their approach to image generation from scene graphs.

Aspect	GANs	Diffusion Models	Graph Transformers
Architecture	Generator and discriminator [44].	Forward and reverse diffusion [73].	Uses self-attention mechanisms [70].
Training Stability	It is often unstable and prone to mode collapse.	Generally stable due to the iterative refinement process.	Variable stability based on transformer depth [69].
Image Quality	Generates high-quality images but is incosistent.	Generates detailed, high-resolution images [72].	Captures relational data well. Detailed images.
Computational Cost	More efficient in terms of training time and resources [44].	Computationally expensive due to multiple denoising steps [73].	Varies, and large transformers can be resource-intensive.
Use Cases	Appropriate for real-time applications	High-quality synthesis and complex scene graphs.	Complex scene graph interaction [70].
Challenges	Mode collapse and limited diversity [43].	High computational cost makes scalability difficult [122].	Requires careful attention design to effectively capture graph relationships [123].

Table 3. Statistics of Visual Genome, COCO-Stuff, and CLEVR.

Dataset	Visual Genome	COCO-Stuff	CLEVR
Number of images	108,077	163,957	100,000
Training set	62,565	24,972	70,000
Validation set	5506	1024	15,000
Test set	5088	2048	15,000
Total number of classes	178	172	-
Things classes	-	80	-
Stuff classes	-	91	-
No. of objects in an image	3–30	3–8	-
Min. number of relationships between objects	1	6	-

Table 4. Comparison of evaluation metrics for image generation from scene graphs.

Metric	Description	Strengths	Weaknesses
KID	Measures distribution distance using kernels.	Unbiased, sensitive to differences.	Requires kernel tuning, computationally intensive.
FID	Compares feature distributions assuming Gaussianity.	Single scalar for image quality, effective for large datasets.	Sensitive to sample size, assumes Gaussianity.
DS	Evaluates diversity by measuring feature variance.	Captures output variety, prevents similar images.	May not correlate with quality and variable implementation.
IS	Measures quality based on classification probabilities.	Encourages high quality and diversity, simple to compute.	Classifier choice can bias results, sensitive to sample size.

Table 5. Performance comparison of existing methods on Visual Genome and COCO-Stuff.

Dataset	Method	Evaluation Metrics
Dataset	Method	IS ↑	FID ↓	DS ↓	KID ↓
Visual Genome	SG2IM [32]	5.5	-	-	-
	PasteGAN [58]	6.9	58.53	0.24	-
	WSGC [66]	8.0	-	-
	HIGT [69]	10.8	63.7	0.59	-
	IGHSGTH [76]	9.93	-	-	-
	SceneGenie [72]	20.25	42.41	-	8.43
	R3CD [74]	18.9	23.4	-	-
	SGDiff [73]	9.3	16.6	-	-
	TBIGSG [70]	12.8	$60.3$	-	-
COCO-Stuff	SG2IM	6.7	-	-	-
	PasteGAN	9.1	50.94	0.27	-
	WSGC	5.6	-	-
	COLoR [75]	-	95.8	-	-
	HIGT	15.2	51.6	0.63	-
	IGHSGTH	11.89	-	-	-
	SceneGenie	9.05	67.51	-	7.86
	R3CD [74]	19.5	32.9	-	-
	SGDiff	11.4	22.4	-	-
	TBIGSG	13.7	$52.3$	-	-

Table 6. Comparison of image generation methods from scene graphs.

Method	Strengths	Limitations	Applications
SG2IM [32]	Utilizes explicit structure for high-quality synthesis.	Complex training, requires detailed annotations.	Scene understanding, illustration generation.
PasteGAN [58]	Fine control over object appearance.	Dependent on crop quality, risk of overfitting.	Artistic image generation.
WSGC [66]	Robust generalization to graph complexity.	Complex setup and learning curve.	Generalized scene understanding.
HIGT [69]	Enhances realism with hierarchical generation.	Complexity in crop selection, computationally intensive.	High-resolution image generation.
IGHSGTH [76]	Advanced relationship modeling.	Increased model complexity.	Complex relational scene generation.
SceneGenie [72]	High-resolution imaging.	Complex inference integration.	High-fidelity image generation.
SGDiff [73]	Improved alignment, enhanced image quality.	Complexity in training.	Artistic applications.
COLoR [75]	Accurate contextualized layout generation.	Complex model architecture, high computational demand.	Context-aware image synthesis.
R3CD [74]	Captures local/global relational features for diverse and realistic images.	Computationally intensive, dependent on scene graph quality.	Storytelling, complex scene generation.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Amuche, C.I.; Zhang, X.; Monday, H.N.; Nneji, G.U.; Ukwuoma, C.C.; Chikwendu, O.C.; Hyeon Gu, Y.; Al-antari, M.A. Advancements, Challenges, and Future Directions in Scene-Graph-Based Image Generation: A Comprehensive Review. Electronics 2025, 14, 1158. https://doi.org/10.3390/electronics14061158

AMA Style

Amuche CI, Zhang X, Monday HN, Nneji GU, Ukwuoma CC, Chikwendu OC, Hyeon Gu Y, Al-antari MA. Advancements, Challenges, and Future Directions in Scene-Graph-Based Image Generation: A Comprehensive Review. Electronics. 2025; 14(6):1158. https://doi.org/10.3390/electronics14061158

Chicago/Turabian Style

Amuche, Chikwendu Ijeoma, Xiaoling Zhang, Happy Nkanta Monday, Grace Ugochi Nneji, Chiagoziem C. Ukwuoma, Okechukwu Chinedum Chikwendu, Yeong Hyeon Gu, and Mugahed A. Al-antari. 2025. "Advancements, Challenges, and Future Directions in Scene-Graph-Based Image Generation: A Comprehensive Review" Electronics 14, no. 6: 1158. https://doi.org/10.3390/electronics14061158

APA Style

Amuche, C. I., Zhang, X., Monday, H. N., Nneji, G. U., Ukwuoma, C. C., Chikwendu, O. C., Hyeon Gu, Y., & Al-antari, M. A. (2025). Advancements, Challenges, and Future Directions in Scene-Graph-Based Image Generation: A Comprehensive Review. Electronics, 14(6), 1158. https://doi.org/10.3390/electronics14061158

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Advancements, Challenges, and Future Directions in Scene-Graph-Based Image Generation: A Comprehensive Review

Abstract

1. Introduction

1.1. Scope of the Survey

1.2. Structure of the Survey

2. Existing Surveys

3. Background

3.1. Scene Graphs

3.2. Graph Convolutional Networks

3.3. Overview of Scene-Graph-Based Image Generation Approaches

3.3.1. Image Generation from Scene Graphs and Layouts

3.3.2. GANs for Image Synthesis

3.3.3. Diffusion Models for Image Synthesis

3.3.4. Graph Transformers for Image Synthesis

4. Method Comparison

4.1. GAN-Based Models

4.1.1. Image Generation from Scene Graphs (SG2IM)

4.1.2. PasteGAN

4.1.3. Learning Canonical Representations for Scene-Graph-to-Image Generation (WSGC)

4.1.4. Image Generation from a Hyper-Scene Graph with Trinomial Hyperedges

4.1.5. Scene-Graph-to-Image Generation with Contextualized Object Layout Refinement

4.2. Transformer-Based Models

4.2.1. Transformer-Based Image Generation from Scene Graphs

4.2.2. Hierarchical Image Generation via Transformer-Based Sequential Patch Selection

4.3. Diffusion-Based Models

4.3.1. SceneGenie

4.3.2. Diffusion-Based Scene-Graph-to-Image Generation with Masked Contrastive Pre-Training

4.3.3. R3CD: Scene-Graph-to-Image Generation with Relation-Aware Compositional Contrastive Control Diffusion

5. Dataset

6. Analysis and Discussion

6.1. Evaluation Metrics

6.1.1. Inception Score (IS)

6.1.2. Fréchet Inception Distance (FID)

6.1.3. Diversity Score (DS)

6.1.4. Kernel Inception Distance (KID)

6.2. Discussion

Comparative Analysis of Image Generation Methods from Scene Graphs: Strengths and Limitations

7. Challenges

7.1. GANs

7.2. Diffusion Models

7.3. Transformer Models

7.4. Future Directions

8. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI