IntegraPSG: Integrating LLM Guidance with Multimodal Feature Fusion for Single-Stage Panoptic Scene Graph Generation

Zhao, Yishuang; Zhang, Qiang; Sun, Xueying; Liu, Guanchen

doi:10.3390/electronics14173428

Open AccessArticle

IntegraPSG: Integrating LLM Guidance with Multimodal Feature Fusion for Single-Stage Panoptic Scene Graph Generation

¹

College of Automation, Jiangsu University of Science and Technology, Zhenjiang 212100, China

²

Systems Science Laboratory, Jiangsu University of Science and Technology, Zhenjiang 212100, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(17), 3428; https://doi.org/10.3390/electronics14173428

Submission received: 31 July 2025 / Revised: 19 August 2025 / Accepted: 25 August 2025 / Published: 28 August 2025

Download

Browse Figures

Review Reports Versions Notes

Abstract

Panoptic scene graph generation (PSG) aims to simultaneously segment both foreground objects and background regions while predicting object relations for fine-grained scene modeling. Despite significant progress in panoptic scene understanding, current PSG methods face challenging problems: relation prediction often only relies on visual representations and is hindered by imbalanced relation category distributions. Accordingly, we propose IntegraPSG, a single-stage framework that integrates large language model (LLM) guidance with multimodal feature fusion. IntegraPSG introduces a multimodal sparse relation prediction network that efficiently integrates visual, linguistic, and depth cues to identify subject–object pairs most likely to form relations, enhancing the screening of subject–object pairs and filtering dense candidates into sparse, effective pairs. To alleviate the long-tail distribution problem of relations, we design a language-guided multimodal relation decoder where LLM is utilized to generate language descriptions for relation triplets, which are cross-modally attended with vision pair features. This design enables more accurate relation predictions for sparse subject–object pairs and effectively improves discriminative capability for rare relations. Experimental results show that IntegraPSG achieves steady and strong performance on the PSG dataset, especially with the R@100, mR@100, and mean reaching 38.7%, 28.6%, and 30.0%, respectively, indicating strong overall results and supporting the validity of the proposed method.

Keywords:

scene graph generation; panoptic scene graph generation; large language model; multimodel relation decoder

1. Introduction

Scene graph generation (SGG) [1,2] has emerged as a pivotal technique for bridging low-level visual perception and high-level semantic understanding. By detecting visual objects and inferring their pairwise relationships in the form of <subject, predicate, object> triplets, SGG enables structured scene interpretation that benefits a wide range of downstream applications, including image retrieval [3], visual question answering [4], image captioning [5,6], image generation [7,8], and robotic navigation [9].

Considerable progress has been made in enhancing SGG models [2,10,11,12], with many approaches leveraging graph neural networks, Transformer architectures, and external knowledge to improve relational reasoning and contextual understanding. These efforts have delivered impressive results in structured scene representation. Despite these advancements, there remain several aspects of the problem that merit further exploration. In particular, as most SGG models operate on the bounding box level, they may face difficulties in accurately delineating object boundaries and handling occlusions or overlapping instances. Moreover, the predominant focus on object–object relations may hinder the ability to capture interactions involving background or amorphous regions (e.g., <person, standing on, grass>), which are critical for a more comprehensive interpretation of complex scenes.

To enrich the expressiveness of scene graphs, a novel variant of SGG called PSG [13] has emerged. PSG extends the conventional SGG formulation by integrating panoptic segmentation, allowing both “thing” and “stuff” categories [14] to be jointly segmented and reasoned about. This enables the modeling of more diverse relationships, such as <person, running on, sand> or <sand, attached to, sea>, which may otherwise be underrepresented or overlooked. By integrating the strengths of panoptic segmentation [15] with scene graph reasoning, PSG offers a more holistic framework for scene understanding and has recently gained considerable interest within the vision community.

Several recent works have contributed valuable insights into PSG modeling. For instance, RCFrame [16] enhances relational coverage by introducing a proposal matching strategy during training and a relation-constrained segmentation mechanism at inference. CAFE [17] innovatively introduces shape-aware features such as masks and boundaries and incorporates the features in three stages according to the difficulty of relation perception, which is combined with knowledge distillation to enhance the ability of fine-grained relation modeling. Meanwhile, CIM [18] focuses on the bias of the long-tail distribution of relations, treats the long-tail prior as a “confounding factor”, removes the bias through causal intervention, and combines the uncertainty estimation module with optimized hard sample handling to alleviate the bias problem of rare relation prediction. These methods have significantly broadened the scope and capability of PSG models.

While these pioneering efforts have considerably advanced the field and deepened our understanding of PSG, certain nuanced challenges remain that invite continued exploration and refinement. First, the distribution of relation categories in PSG tends to be highly imbalanced, often resulting in suboptimal performance for rare but semantically rich predicates [13]. Second, current mainstream PSG approaches [13,19,20] predominantly rely on modeling visual features derived from RGB images. While effective for certain aspects, these features exhibit limited capacity for capturing complex spatial structures (e.g., occlusion, spatial distances, object stacking).

These challenges point to several promising directions for advancing PSG. A central avenue involves the integration of multimodal features beyond RGB, such as depth and geometric cues, which can provide explicit spatial structure and thereby enrich relational reasoning. Another crucial step lies in employing more selective mechanisms to filter and prioritize subject–object pairs [12,21], reducing redundancy and ensuring that reasoning focuses on the most salient interactions. Finally, harnessing external semantic guidance offers a powerful means to improve the recognition of rare and context-dependent relations, alleviating the pervasive long-tail distribution problem. In this regard, the rapid progress of LLM [22,23] offers new opportunities for enhancing relational reasoning; LLM possesses strong contextual understanding and knowledge generalization abilities, making it well suited to guide relation inference in complex visual scenes. For instance, as illustrated in Figure 1, given a <person–sports ball> pair, conventional methods may produce a generic predicate like “playing”. While plausible, such predictions can miss subtle contextual nuances. In contrast, by leveraging semantic guidance from LLM, the model can infer more specific and context-aware predicates—such as “about to hit” or “swinging”—thereby enriching scene interpretation and improving the recognition of rare and fine-grained relations.

In this paper, we propose IntegraPSG, a unified single-stage method for PSG that integrates LLM guidance with multimodal feature fusion. Specifically, IntegraPSG is single-stage in that panoptic segmentation and relation prediction are jointly performed within a unified network, optimized end-to-end via a single joint objective. This framework enhances spatial relation modeling capability and subject–object pairing accuracy through multimodal feature fusion, while leveraging semantic guidance from LLM to improve relation prediction performance. IntegraPSG comprises three major parts: the first part is a panoptic segmentation network, which extracts object query features and produces object masks and class labels to provide a structured scene representation that lays the foundation for subsequent relational reasoning. The second part is a multimodal sparse relation prediction network, which integrates visual, depth, and language information to construct a multimodal pairing proposal matrix. This module incorporates statistical priors from the PSG dataset [13] and employs a lightweight matrix learner [21] to refine the selection of subject–object pairs, reducing redundancy while strengthening spatial relational reasoning. The third part is a multimodal relation decoder. We utilize sparse features of selected subject–object pairs to form vision pair features. LLM-generated language prompts are encoded via a text encoder and are inputted together with the visual features of selected subject–object pairs into a cross-modal decoder. This mechanism facilitates deep interaction between visual and linguistic modalities to predict context-aware relations. In summary, the main contributions of this paper include:

We propose a unified single-stage framework, IntegraPSG, to address the challenges of spatial reasoning and long-tail distribution in PSG by integrating multimodal features in the subject–object pairing stage and introducing the semantic guidance of LLM for relation prediction. Our method establishes a collaborative mechanism of “multimodal refinement–language-guided reasoning”, which provides an effective solution to these fundamental challenges.
We design a multimodal sparse relation prediction network that constructs visual language and deep spatial multimodal relation proposal matrix and jointly optimizes them. The architecture enhances the screening mechanism of subject–object pairs to improve the accuracy of subject–object candidate pairs selection while reducing redundant pairings.
We propose a language-guided multimodal relation decoder that cross-modally interacts language descriptions generated by LLM with visual features. The design substantially strengthens discriminative capability for rare relations and markedly improves prediction performance on long-tail samples.

The efficacy of IntegraPSG is rigorously evaluated through extensive experiments on the PSG dataset under the challenging SGDet task. Performance is assessed using standard metrics, including Recall@K (R@K) and mean Recall@K (mR@K). Comparative analyses against the baseline and several representative contemporary methods show that IntegraPSG achieves substantial improvements. Notably, it attains the highest R@100 and overall mean scores among the evaluated approaches, while delivering strong performance in mR@100. These results validate the effectiveness of our multimodal feature fusion and LLM guidance reasoning architecture, highlighting both its accuracy and robustness for PSG tasks.

2. Related Work

2.1. Panoptic Scene Graph Generation

Panoptic scene graph generation (PSG) [13] originates from the SGG task, which adopts a panoptic segmenter and replaces bounding boxes with pixel-level segmentation masks to achieve more detailed subject–object localization and relation modeling. Yang et al. propose PSG tasks [13] for the first time and construct a PSG dataset, and they design two representative baseline models: the PSGTR and the PSGFormer. The PSGTR [13] is based on DETR’s Transformer architecture [24], which treats triplet generation as a query matching problem, and generates mask-level triplets using three types of queries: subject, predicate, and object, realizing end-to-end unified prediction; PSGFormer [13] introduces a dual-decoder structure, focusing on mask-level object recognition and relational reasoning, respectively, and sharing contextual information through interaction modules to improve the overall accuracy. In subsequent research, ADTrans [25] enhances the discriminative properties of relation recognition through semantic subspace segmentation and visual semantic prototype alignment, especially improving the robustness of long-tail relations. PSGAtten [26] addresses the noise interference problem caused by mask segmentation and constructs a hybrid attention mechanism, which integrates semantic and positional attention to guide the formation of more consistent semantic connection between the subject and object to improve the stability and accuracy of mask-level relational reasoning. DGTM [20] constructs relation candidate pairs based on the Transformer attention graph and combines the double-gate structure to enhance feature selectivity, which achieves better results in tail relation recognition. To provide a structured overview of these diverse approaches, we summarize their key characteristics in Table 1. Different from these approaches, we propose IntegraPSG, which copes with the long-tail distribution problem and improves relation recognition in complex scenes by jointly modeling multimodal features with prior statistical pairing and facilitating cross-modal alignment with the help of a language-guided multimodal relation decoder.

2.2. Panoptic Segmentation

Panoptic segmentation is a key foundation for complex scene understanding tasks (e.g., PSG) by assigning semantic class labels to each pixel in an image and distinguishing different instances under the same category, which takes into account the continuity between the background region and foreground objects and provides finer-grained object boundaries and semantic support for SGG. Early panoptic segmentation methods are mainly based on two-branch fusion architectures, such as Panoptic FPN [27], UPSNet [28], etc., which deal with the semantic and instance segmentation structures separately and unify them in the post-processing stage. With the rise of mask representation, methods such as MaskFormer [29] and K-Net [30] unify the segmentation task into a mask classification problem, which effectively improves the consistency of model representation. In recent years, the introduction of the Transformer architecture [24,31] has driven significant improvements in panoptic segmentation performance. Among these methods, Mask2Former [32], an integrated framework, demonstrates state-of-the-art performance on multiple segmentation tasks. Its selection is motivated by its core “mask attention” mechanism, which is instrumental for rendering the high-quality detail modeling necessitated by our work. This combination of benchmark performance and strong generalization ability ensures the provision of robust and precise object representations, a critical underpinning for the subsequent relation modeling in our PSG task.

2.3. Large Language Model for Visual Relation Understanding

In recent years, LLMs have made significant breakthroughs in the field of natural language processing. Since the GPT series [33] led the wave of pre-trained models, models such as LLaMA [34], Claude [35], Bard [36], and Qwen [37] have been introduced, and they perform well in tasks such as language understanding, generative and common-sense reasoning, etc. LLM is capable of acquiring rich semantic patterns and world knowledge from large-scale corpora, with excellent context modeling and generalization capabilities. Some research [22,23,38] has been shown that LLM can generate descriptive language information that fits the context of an image, which is especially suitable for improving the performance of tasks such as classification, object detection, and relation recognition. In the PSG task, it is crucial to effectively utilize the contextual information and knowledge provided by LLM, and current explorations are still insufficient. Our approach utilizes LLM to generate relation judgments and descriptions of PSG dataset triplets and constructs a library of language prompts to enhance rare relation understanding and improve the model’s ability to handle long-tail distributions.

3. Method

3.1. Problem Definition

We first remind that the goal of the classical SGG task is to generate a target scene graph G consisting of the triplet <subject, relation, object> from an input image

I \in R^{H \times W \times 3}

, and the generation of the scene graph G can be formulated as a joint probability distribution:

p (G | I) = p (B, C | I) p (R | I, B, C)

(1)

As defined in Equation (1), the target scene graph

G = {E, R}

consists of an entity set E and a relation set R. The entity set E includes bounding boxes

B = {b_{1}, \dots, b_{n}}

and class labels

C = {c_{1}, \dots, c_{n}}

, which correspond to n objects in the image. The relation set is denoted by

R = {r_{1}, \dots, r_{n}}

. Each

b_{i} \in R^{4}

represents the bounding box coordinates, and

c_{i}

and

r_{i}

belong to the sets of all object and relation categories, respectively. The term

p (B, C | I)

represents the detection of entity objects via an object detector, while

p (R | I, B, C)

denotes the relation prediction conditioned on the detected entities. The inference process from

p (B, C | I)

to

p (R | I, B, C)

requires the model to construct a sparse scene graph by identifying the most probable entity pairs from all possible combinations.

While classical SGG localizes and relates only foreground objects using bounding boxes, PSG generation extends this formulation to encompass both foreground (things) and background (stuff) regions via pixel-level segmentation. This enables a more comprehensive representation of the entire scene. Accordingly, the generation of the panoptic scene graph G is formulated as:

p (G | I) = p (M, C | I) p (R | I, M, C)

(2)

As defined in Equation (2),

G = {M, C, R}

includes a set of binary segmentation masks

M = {m_{1}, \dots, m_{n}}

, where each

m_{i} \in {0, 1}^{H \times W}

represents the spatial extent of an object and

C = {c_{1}, \dots, c_{n}}

denotes the associated class labels. The relation set

R = {r_{1}, \dots, r_{n}}

captures semantic interactions among segmented regions. Here, the term

p (M, C | I)

models object segmentation and classification via a panoptic segmentation network, while

p (R | I, M, C)

refers to relation prediction based on the segmentation-derived visual and contextual features. This formulation enables direct reasoning over fine-grained, mask-level scene elements, moving beyond the limitations of bounding box representations.

3.2. Overall Architecture

As shown in Figure 2, the proposed IntegraPSG framework consists of three core components. First, a Mask2Former-based panoptic segmentation network is employed to extract object queries and generate the corresponding object masks along with their semantic class labels. Second, a multimodal sparse relation prediction network identifies a subset of subject–object pairs with high relational confidence from the full set of candidates, effectively reducing combinatorial complexity. Finally, a multimodal relation decoder integrates visual features with language-guided prompt features to predict the semantic relation for each selected pair, thereby enhancing relational inference accuracy, particularly in challenging or ambiguous scenarios.

3.3. Panoptic Segmentation Network

In this paper, we use the state-of-the-art model Mask2Former as a panoptic segmenter, as shown in Figure 2a. The architecture of Mask2Former consists of a backbone network, a pixel decoder, and a Transformer decoder with a mask attention mechanism. It contains a set of object queries, which have the ability to handle semantic segmentation, instance segmentation, and panoptic segmentation tasks in a unified way. In this network, the entity regions are driven by object queries. These queries interact with multi-scale pixel features from the pixel encoder using the mask attention mechanism. The object queries are updated layer by layer at the Transformer decoder, which ultimately generates high-resolution masks and class predictions.

Given an input image

I \in R^{H \times W \times 3}

, Mask2Former is able to directly output the set of object queries

Q_{e} = {q_{i}}_{i = 1}^{N} \in R^{N \times d}

, where

q_{i}

denotes the i-th entity and d is the embedding dimension. Meanwhile Mask2Former predicts the corresponding mask and class label of each entity from the set of object queries. Specifically, the entity set is represented as

Q_{e} = {m_{i}, c_{i}}_{i = 1}^{N} \in R^{N \times d}

, where

m_{i} \in {0, 1}^{H \times W}

denotes the mask of the i-th entity,

c_{i} \in {1, \dots, C_{e}}

is the corresponding entity class label, and

C_{e}

denotes the number of entity categories. Since the N object queries generated by Mask2Former incorporate the location information implied by the masks and the semantic information implied by the class labels, these object queries can be encoded as the corresponding subjects and objects for subsequent relation prediction tasks.

3.4. Multimodal Sparse Relation Prediction Network

As shown in Figure 2b, for the given N entity objects, our multimodal sparse relation prediction network first generates an initial

N \times N

multimodal relation proposal matrix. Each element of this matrix is used to predict the degree of relevance of the possible relation between subject and object features. Then, the matrix learner is used to process the proposal matrix and refines it into a sparse relation matrix. Finally, high-confidence subject–object pairs are selected from the final sparse matrix.

3.4.1. Multimodal Relation Proposal Matrix Construction

To strengthen the model’s scene comprehension capability and enhance the quality of relational pairing, as shown in Figure 3, our framework incorporates two complementary branches: a visual language matrix branch that captures visual semantic information and a depth spatial matrix branch that encodes geometric constraints. These branches are fused to construct a multimodal subject–object relation proposal matrix. In this matrix, visual language features provide dominant guidance, while depth features offer supplementary spatial verification. Finally, through statistical prior filtering, the final proposal matrix is constructed.

Visual language relation proposal matrix: We generate a set of object queries

{q_{i}}_{i = 1}^{N}

containing visual language information based on the Mask2Former architecture. Each query

q_{i} \in R^{1 \times d}

interacts with multi-scale features output from the pixel decoder through masked attention, thus incorporating rich visual features and semantic information. Using these query features as input, subject embeddings

E_{sub}

and object embeddings

E_{obj}

are produced by two multilayer perceptrons (MLPs)—one for subjects and one for objects. Both MLPs share the same architecture but have independent parameters. This process is defined as in Equation (3):

E_{sub} = MLP (q_{i}), E_{obj} = MLP (q_{i}),

(3)

where

E_{sub}, E_{obj} \in R^{N \times d}

.

Finally, we compute the cosine similarity between the subject embeddings and object embeddings. This yields the visual language relation proposal matrix

V_{matrix} \in R^{N \times N}

, which reflects the strength of association between subject–object pairs in the visual language space.

Depth spatial relation proposal matrix: To address the limited capacity of visual modality in representing fine-grained three-dimensional spatial structures, we leverage the MiDaS [39] monocular depth estimation model to generate a dense depth map

D \in R^{H \times W}

for each input RGB image. The depth map maintains the same resolution as the original input. Based on the object masks

m_{i}

obtained from the Mask2Former segmentation output, we extract depth features from the aligned depth map and normalize them to construct a compact depth representation. Specifically, for each object region, we encode its spatial characteristics using a three-dimensional feature vector

f_{i}^{depth}

defined as in Equation (4):

f_{i}^{depth} = [μ (m_{i}), σ (m_{i}), min (m_{i})],

(4)

where

μ (m_{i})

is the mean of the depth values in the mask region, reflecting the overall proximity of the object region.

σ (m_{i})

is the standard deviation of the depth values in the mask region, indicating the flatness of the object’s surface.

min (m_{i})

is the minimum depth value in the mask region, capturing the closest point and assisting in determining the object spatial boundary.

We design spatial relation metrics, including depth difference, surface flatness consistency, and boundary proximity between objects, to compute the spatial relation representation between subject–object pairs. These metrics produce the depth spatial relation proposal matrix

D_{matrix} \in R^{N \times N}

, which reflects the relative positions and structural relationships of subject–object pairs in the three-dimensional space.

Statistical prior correlation matrix: Considering that relying only on visual, language and depth modalities may still introduce semantically implausible subject–object pairs, we introduce a statistical pair prior as a filtering mechanism. We count object category pairs on the PSG dataset, construct a category co-occurrence frequency matrix, and normalize it. Using the generated subject–object pairs and their categories in each image, we query and construct a statistical prior correlation matrix

P_{matrix} \in R^{N \times N}

between subject–object pairs. This matrix provides each subject–object pair with a score that indicates the reasonableness of the relation on the dataset. Finally, the matrix is used to eliminate subject–object pairs, i.e., erroneous semantic combinations, which are difficult to distinguish visual language depth from those which are impossible to have a relation.

Multimodal relation proposal matrix: The visual language relation proposal matrix

V_{matrix}

and the depth spatial relation proposal matrix

D_{matrix}

are combined through weighted fusion to form an initial relation proposal matrix

M_{initial} \in R^{N \times N}

, as defined in Equation (5):

M_{initial} = α \cdot V_{matrix} + β \cdot D_{matrix},

(5)

where

α

and

β

are learnable weighting parameters that balance visual language and depth features in the relation proposal matrix. By learning them end-to-end, the network can automatically adjust their values to optimize multimodal fusion for the PSG task.

Subsequently, the subject–object pairs with semantically implausible situations in

M_{initial}

are filtered by combining the statistical prior correlation matrix

P_{matrix}

to construct the final relation proposal matrix

M_{filtered} \in R^{N \times N}

. This operation realizes the complementary enhancement of multimodal information, as defined in Equation (6):

M_{filtered} = Filter (M_{initial}, P_{matrix})

(6)

3.4.2. Multimodal Sparse Relation Matrix Refinement

Given that the number of proposed object pairs typically far exceeds the actual valid pairs present in an image, the multimodal relation proposal matrix

M_{filtered}

inevitably contains numerous semantically irrelevant subject–object pairs. To enhance both the efficiency and accuracy of subsequent relation prediction, it is essential to eliminate pairs that violate semantic coherence.

To address this challenge, we introduce a matrix learner designed to optimize the initial proposal matrix

M_{filtered}

. This module employs a lightweight convolutional neural network architecture, which effectively suppresses redundant noise while preserving local semantic associations through convolutional operations. Its purpose is to further filter out meaningless subject–object pairs and to identify them with potential relational significance, thereby improving the sparsity and precision of the relation proposals.

To supervise the training of the relation matrix learner, we construct a ground truth pairing relation matrix

M_{gt} \in R^{N \times N}

, where N denotes the predefined number of object queries, as shown in Figure 2b. This matrix establishes a precise correspondence between subject–object instances in the ground truth scene graph and the predicted object queries, achieved via Hungarian matching. For each ground truth scene graph, valid subject–object pairs

(s_{i}, o_{j})

are mapped into the predicted query space. The corresponding entries

(i, j)

in

M_{gt}

are set to 1. All other entries, which represent irrelevant or non-existent pairs, are set to 0. As a result, the final

M_{gt}

forms a highly sparse binary supervision matrix.

During training, the matrix learner refines the multimodal relation proposal matrix

M_{filtered}

under the guidance of

M_{gt}

, producing a sparser and more accurate relation matrix

M_{sparse}

. This refined matrix approximates the sparse distribution and valid pairings of the ground truth. Subsequently, vision pair features are constructed by performing a Top-k selection on

M_{sparse}

(see Section 3.5.1). These features serve as robust inputs to the following relational reasoning stage.

3.5. Multimodal Relation Decoder

In order to improve the modeling capability of complex semantics and long-tail relations in the relational reasoning stage, we propose a multimodal relation decoder based on language-guided augmentation in Figure 2c. The decoder takes as visual input the subject–object vision pair features constructed in Figure 2b, where the features are derived by selecting the Top-k subject queries

Q_{s}

and object queries

Q_{o}

from the sparse relation matrix. Subsequently, the relation decoder employs a cross-attention mechanism to integrate the vision pair features with language prompt features generated by LLM, thereby enabling precise and semantically enriched relation prediction.

3.5.1. Vision Pair Features Extraction

Top-k selection is performed on the

M_{sparse}

obtained in Section 3.4.2 to extract the subject–object pairs and their location indexes that are most likely to be related. Based on this selection, the corresponding subject queries

Q_{s} \in R^{k \times d}

and object queries

Q_{o} \in R^{k \times d}

are extracted from from the set of object queries

{q_{i}}_{i = 1}^{N}

. Then the subject queries and object queries are concatenated in the subject/object embedding dimension to generate pair query representations

Q_{pair}

, i.e., vision pair features

F_{V}

, as defined in Equation (7):

F_{V} = Q_{pair} = Concat (Q_{s}, Q_{o}) \in R^{N_{p} \times 2 d},

(7)

where

N_{p}

denotes the number of subject–object pairs, which is the same as the number of Top-k object queries. With the above concatenation, the visual features of each subject–object pair are integrated into a unified joint representation for subsequent relation prediction task.

3.5.2. Language Prompt Features Extraction

LLM provides additional language prior information for the PSG task to enhance the model’s comprehension of visual relations, especially the generalization performance on long-tail relations.

To alleviate the long-tail distribution challenge in the PSG task, we use an LLM as a relation adjudicator to judge and describe the relation for a given triplet (subject, relation, object). We design a set of structured prompt templates based on judgment instructions, shown in Figure 4a, to guide the LLM to generate both a natural language description and a reason explanation of the subject–object relation, as shown in Figure 4b. The prompt template consists of three critical components: firstly informing the LLM of its role orientation and task definition, secondly example guidance, and finally providing the specific triplet to be judged. Importantly, this procedure is carried out only once in an offline manner to build a reusable knowledge base, thereby avoiding the overhead of repeated inference during training. Rather than training or fine-tuning, we directly exploit a pre-trained and frozen LLM in a zero-shot setting.

To construct a comprehensive language prior knowledge base, we systematically enumerate all possible subject–object category pairs as well as all relation categories on the PSG dataset. Within this offline process, we leverage a pre-trained text encoder (e.g., BERT), kept frozen, to encode the natural language description of each triplet into a high-dimensional text embedding

f_{text} \in R^{1 \times D_{L}}

. Collectively, these embeddings form a complete set of language prompt features

F_{L}^{all} \in R^{N_{sub}^{2} \times R \times D_{L}}

, where

N_{sub}

denotes the total number of object categories and R denotes the number of relation categories on the dataset. During the subsequent online training of the main network, as shown in Figure 2c, IntegraPSG retrieves these pre-computed features from the knowledge base, enabling seamless integration. Because each triplet is encoded independently, the resulting representations remain rich and specific even for sparse long-tail relations, thereby effectively mitigating the long-tail problem.

Given an input RGB image, our approach first selects the Top-k most likely subject–object pairs using a multimodal sparse relation prediction network. Subsequently, according to the category indexes of these pairs, we retrieve the corresponding language prompt features

F_{L} \in R^{N_{p} \times R \times D_{L}}

under R (

R = 56

on the PSG dataset) relation categories from the knowledge base. For a specific relation r between

N_{p}

subject–object pairs in the image, the language prompt features are represented as

F_{L (r)} \in R^{N_{p} \times D_{L}}

, which capture fine-grained relational attributes—encompassing both common and rare relations—for the given subject–object pairs. Finally,

F_{L (r)}

can be integrated into the relation prediction module as semantic guidance to enhance visual feature representation.

3.5.3. Relation Prediction

To realize deeper interaction between visual and language features, we design a language-guided multimodal relation decoder. The decoder aims to enhance the language expression of rare relations and alleviate the learning bias caused by the long-tail distribution of relations. Built upon the BERT architecture [40], it introduces a cross-modal cross-attention mechanism. By providing independent language prompt features for each relation category, the mechanism is able to guide the semantic focus of visual features, thus realizing a more accurate, relation-aware visual language fusion.

As shown in Figure 5, the core design of this decoder is as follows: (1) each of the R relation categories is processed by two specialized cross-modal Transformer decoder blocks [41], and (2) the transformed features are fed into a final fully-connected layer for relation prediction.

The constructed visual pair features

F_{V} \in R^{N_{p} \times 2 d}

and language prompt features

F_{L} \in R^{N_{p} \times R \times D_{L}}

are, respectively, projected into a unified embedding space via linear transformations, as defined in Equation (8):

F_{V}^{'} = {FC}_{v} (F_{V}) \in R^{N_{p} \times D}, F_{L}^{'} = {FC}_{l} (F_{L}) \in R^{N_{p} \times R \times D}

(8)

In the multimodal relation decoder, we adopt a cross-modal attention mechanism. The attention computation is defined as in Equation (9), using the visual pair features

F_{V}^{'}

as the query and the relational language prompt features

F_{L}^{'}

as the key and value. Importantly, this cross-attention is computed independently for each relation category

r \in [1, R]

:

Attention (F_{V}^{'}, F_{L (r)}^{'}) = softmax (\frac{F_{V}^{'} W_{Q} {(F_{L (r)}^{'} W_{K})}^{T}}{\sqrt{D}}) (F_{L (r)}^{'} W_{V}),

(9)

where

F_{L (r)}^{'} \in R^{N_{p} \times D}

denotes the language prompt features of the r-th relation category.

Since the language prompt features

F_{L}^{'}

provide an independent semantic guidance vector for each relation triplet (subject, relation, object), the decoder can establish independent and parallelizable semantic fusion channels across different relation categories. This design enables parallel relation interactions while avoiding semantic interference caused by shared attention resources. Specifically, for

N_{p}

subject–object pairs in an image, we can simultaneously compute their fusion features

F_{f u s i o n} \in R^{N_{p} \times R \times D}

under R relation categories. Each element

F_{f u s i o n (r)} \in R^{N_{p} \times D}

represents the visual language fusion features corresponding to the r-th relation. Subsequently, we introduce independent linear classifiers

{FC}_{r}

for each relation category, which dimensionally compress the fusion features to generate the prediction results

R_{r e l (r)}

of the r-th relation between subject–object pairs, as defined in Equation (10):

R_{r e l (r)} = {FC}_{r} (F_{f u s i o n (r)}) \in R^{N_{p} \times 1}

(10)

Finally, the predictions of all relation categories are concatenated to obtain the complete relation predictions

R_{r e l}

, as defined in Equation (11):

R_{r e l} = Concat (R_{r e l (1)}, \dots, R_{r e l (R)}) \in R^{N_{p} \times R}

(11)

3.6. Model Training

Our end-to-end PSG is divided into three sub-tasks: entity segmentation based on panoptic segmentation, sparse relation matrix refinement based on the multimodal sparse relation prediction network, and relation prediction based on the multimodal relation decoder. During the training process, each sub-task generates the corresponding supervised information and loss function. Unlike two-stage pipelines, all components are optimized jointly under a weighted loss, rendering the segmentation module fully differentiable and co-trainable with relation prediction. This joint optimization defines IntegraPSG as a true single-stage framework rather than a sequential cascade.

Panoptic segmentation loss: To drive the panoptic segmentation network to accurately segment object masks and predict their class labels, we adopt the loss function design from Mask2Former [32], denoted as the panoptic segmentation loss

L_{s e g}

, as shown in Equation (12):

L_{s e g} = λ_{c l s} L_{c l s} + λ_{d i c e} L_{d i c e} + λ_{m a s k} L_{m a s k},

(12)

where

L_{c l s}

is the classification loss,

L_{d i c e}

is the dice loss,

L_{m a s k}

is the mask loss, and the weight coefficients of each loss term are set to

λ_{c l s} = 2

,

λ_{d i c e} = λ_{m a s k} = 5

.

Sparse relationship matrix supervised loss: Considering the presence of a large number of meaningless subject–object pairs in the predicted relation matrix, for extremely imbalanced positive and negative sample distributions, direct use of the traditional binary cross-entropy loss would result in IntegraPSG biased towards predicting an all-zero matrix. To this end, we use the positive samples weighted binary cross-entropy loss

L_{m p a i r}

to supervise the matrix learner [21]. By scaling up the positive sample weights by a factor of p, we instruct IntegraPSG to learn to generate the final sparse relation matrix, focusing on the valid subject–object pairs and balancing the contribution of positive and negative samples. This loss is defined as in Equation (13):

L_{m p a i r} = - [p \cdot y \cdot log (\hat{y}) + (1 - y) \cdot log (1 - \hat{y})],

(13)

where y and

\hat{y}

denote the ground truth label and the predicted relation probability for each subject–object pair, respectively. The positive sample weight

p = \frac{total proposal sub - obj pairs}{ground truth sub - obj pairs}

=

\frac{N_{p a i r}}{\sum_{i, j} M_{g t} (i, j)}

.

Subject–object classification loss: This loss supervises the class prediction accuracy of Top-k high-priority subject–object pairs selected by the sparse relation prediction network. Specifically, because the number of model-predicted subject–object pairs

N_{p}

is usually much larger than the number of GT triplets

N_{g t}

, we first use the Hungarian matching algorithm to achieve the optimal alignment of the predicted subject–object pairs with the labels of the GT subject–object pairs. Each GT triplet is matched to a unique predicted candidate pair, ensuring accurate supervision. After matching,

L_{c l s}^{s u b o b j}

is computed using the standard cross-entropy loss function, denoted as:

L_{c l s}^{s u b o b j} = CrossEntropyLoss (\hat{s}, s) + CrossEntropyLoss (\hat{o}, o),

(14)

where

\hat{s}

and

\hat{o}

are the probability distributions of subject and object classes predicted by IntegraPSG and s and o are the ground truth class labels.

Relation classification loss: Considering the imbalance in the distribution of relation categories on the PSG dataset, we use the Seesaw loss function [42]. This loss alleviates training bias caused by the extreme imbalance of positive and negative samples by dynamically adjusting the gradient weights of different categories. The loss

L_{r e l}

is defined as in Equation (15):

L_{r e l} = SeesawLoss (\hat{r}, r),

(15)

where

\hat{r}

is the predicted probability distribution over relation categories, and r is the ground truth relation label.

Overall loss function: The overall loss

L

is calculated by integrating four components: panoptic segmentation loss, sparse relation matrix supervised loss, subject–object classification loss, and relation classification loss. This is defined as in Equation (16):

L = λ_{s e g} L_{s e g} + λ_{m p a i r} L_{m p a i r} + λ_{c l s}^{s u b o b j} L_{c l s}^{s u b o b j} + λ_{r e l} L_{r e l},

(16)

where

λ_{s e g} = 1

,

λ_{m p a i r} = 5

,

λ_{c l s}^{s u b o b j} = 4

,

λ_{r e l} = 2

. In the multimodal relation decoder part, we directly use the pre-trained LLM and text encoder, so there is no need for additional model training.

4. Experiments

4.1. Experimental Settings

Dataset: In our PSG task, we use the most challenging PSG dataset to conduct experiments, where each image is labeled with both panoptic segmentation and a scene graph. This dataset is derived from the COCO dataset [43] and contains a total of 48,749 images, of which 2177 are in the test set and 46,572 are in the training set. The annotation content covers 80 “thing” categories (e.g., people, animals, furniture) and 53 “stuff” categories (e.g., grass, wall, sky), a total of 133 object categories. Additionally, the dataset is labeled with 56 categories of relational predicates, which are detailed in Table 2. In our work, all 56 of these relationship categories are utilized for both training and evaluation.

Tasks: Existing research has shown [13] that PSG typically includes two sub-tasks: predicate classification (PredCls) and scene graph detection (SGDet). PredCls aims to take as input the GT positions and class labels of objects, focusing on predicting the relation categories between objects and generating the scene graph. This task excludes the interference of segmentation accuracy on the results and focuses on evaluating the model’s ability to understand relations, and it is therefore suitable for the performance evaluation of two-stage models. In contrast, SGDet takes the original image as input and requires the model to jointly predict the categories of objects, their locations, and the relation categories between them to generate a scene graph. This task requires the model to have both object detection and relation modeling capabilities, and it is the most comprehensive and challenging task in PSG. Our IntegraPSG is evaluated for the most challenging SGDet task.

Metrics: Recall@K (R@K) measures how many GT triplets a model can successfully match among the first K prediction triplets, reflecting the overall recall ability of the model for high-frequency relations. According to existing studies, the K values are set to be 20, 50, and 100. Considering that the PSG dataset has a severe long-tail distribution, we further introduce mRecall@K (mR@K), which calculates R@K separately for each relation category before averaging, to more fairly assess the model’s comprehensive performance of the head–tail relation. Considering the different focuses of different metrics and the problem of performance trade-offs, in order to reflect the overall performance of the model, we introduce mean metric, which achieves a comprehensive assessment of the overall performance of the model by averaging the R@K and mR@K.

4.2. Implementation Details

We use ResNet-50 and the pre-trained Mask2Former on the COCO dataset to initialize the backbone, pixel decoder, and Transformer decoder as the pre-training weight files. The number of object queries is set to

N = 100

, the embedding dimension is set to

d = 256

, and the selected subject–object pairs Top-k is set to

N_{p} = 100

. For the LLM, we choose the Qwen2.5-7b-instruct model released in Alibaba, and the text encoder is chosen from the pre-trained BERT based on the official dataset. Our model is trained by using the AdamW optimizer [44], with an initial learning rate of 1 × 10⁻⁴ and a weight decay factor of 5 × 10⁻², and the backbone network with a learning rate of 1 × 10⁻⁵. The model is trained over 12 epochs and learning rate is reduced by a factor of 0.1 at epochs 5 and 10 to achieve the fine-tuning objective and reduce the risk of overfitting. All computations are trained on 2 NVIDIA GeForce RTX 3090 Ti GPUs, where each GPU is responsible for processing 2 images with a fixed batch size of 4 and a training cycle of 30 h.

4.3. Experimental Results

In order to validate the effectiveness and interpretability of the proposed method IntegraPSG, we conduct experiments on the PSG dataset for the SGDet task, and systematically compare it with several representative methods (IMP [45], MOTIFS [46], VCTree [47], GPSNet [48], CIM [18], PSGTR [13], PSGFormer [13], RCFrame [16], ADTrans [25], PSGAtten [26], CAFE [17], and DGTM [20]). Figure 6 and Table 3 present the performance of each method under R@K, mR@K, mean metrics, and inference time.

The experimental results demonstrate that IntegraPSG consistently achieves stable and superior performance across all evaluation metrics, significantly outperforming multiple baseline methods. In particular, compared with the single-stage method PSGTR, our approach improves mR@20, mR@50, and mR@100 by 5.7%, 5.5%, and 6.5%, respectively; compared with the two-stage method VCTree, the improvements are 12.6%, 16.1%, and 18.4% on these metrics. Notably, IntegraPSG attains the highest score in R@100 and remains highly competitive across other metrics, surpassing most methods in mR@100 as well. The overall mean reaches 30.0%, slightly exceeding the second-best method, PSGAtten (29.9%). This indicates that IntegraPSG demonstrates considerable competitiveness in low-sample relation categories and effectively mitigates the long-tail distribution problem of relations. Consequently, it captures richer contextual and scene relational information with greater precision, thereby enhancing overall performance.

Although IntegraPSG’s mR@K scores (22.3%, 26.3%, 28.6%) are moderately lower than those of ADTrans (26.4%, 29.7%, 30.0%), our method demonstrates clear advantages in R@K metrics, achieving substantial gains in R@20 (+2.7%), R@50 (+5.5%), and R@100 (+8.7%), along with a 1.4% improvement in the overall mean. This highlights IntegraPSG’s strong capability in identifying high-confidence relational triplets, particularly in scenes requiring fine-grained subject–object modeling. In contrast, ADTrans shows more balanced performance across relation frequencies, reflecting its superior generalization to long-tail predicates. These complementary strengths indicate that while IntegraPSG excels in accurately modeling complex visual relationships, there remains room for improving its robustness to infrequent relation types.

Furthermore, the noticeable performance gap between R@20 and R@100 suggests that while IntegraPSG successfully detects a substantial number of correct relations, some of these relations are not consistently ranked at the top of the confidence list. This phenomenon is likely due to the dense distribution of confidence scores among numerous candidate triplets, which constrains the precision of the highest-ranked predictions. Specifically, this dense scoring means that common, visually obvious relations and even some false positives often receive higher confidence scores than correct but infrequent relations. As a result, when the ranking is cut off at a small K, such as 20, these less frequent relations are disproportionately pushed out of the top positions. This effect directly explains the sharp decline in mR@K, a metric sensitive to all relation categories, and clarifies why the performance gap widens against models like ADTrans that better handle long-tail distributions. Nonetheless, the strong performance at R@100, supported by the competitive mR@100 score, underscores the model’s robust relational reasoning capabilities and highlights promising avenues for future refinement in modeling the prioritization of critical relationships—an aspect particularly crucial for high-precision panoramic scene graph understanding and downstream inference tasks.

To evaluate the practical utility of our method, we measured inference efficiency and computational cost. As shown in Table 3, IntegraPSG processes a single image on an NVIDIA GeForce RTX 3090 Ti in 0.183 s, using approximately 1022 MB of GPU memory. While RCFrame is slightly faster (0.155 s) on SGDet, IntegraPSG achieves a 5.4% higher average recall, highlighting its favorable trade-off between speed and performance. Despite incorporating multimodal feature fusion with LLM prompts, IntegraPSG is roughly 20% faster than the single-stage PSGTR baseline. This efficiency arises from its streamlined architecture, reasoning over a sparse set of meaningful subject–object pairs and retrieving knowledge from the language-prompt database, optimizing feature integration without excessive computational cost.

4.4. Ablation Study

In order to validate the effectiveness of the individual components, we carry out some ablation experiments for them to understand the impact that the different modules have on the IntegraPSG method.

Proposal matrix construction: To validate the advantages of multimodal proposal matrix construction, we perform detailed ablation experiments on the components in the framework and show the performance improvement in Table 4. “Vision”, “Depth”, and “Prior” represent the visual language proposal, depth spatial proposal, and statistical prior components, respectively. The analysis in Table 4 shows that proposal matrix construction makes IntegraPSG achieve the optimal performance of 28.7% and 22.3% on R/mR@20, respectively, which effectively improves the accuracy of subject–object pairing in the scene and provides higher-quality inputs for subsequent relational reasoning, thus driving the overall performance of PSG.

Visual pair features: In the relational reasoning stage, the model needs to perform feature fusion of every two subject–object pairs of object features to be used as inputs to the relation decoder. Currently, the commonly used fusion methods include concatenation (Concat) and addition (Add). To evaluate the performance difference between the two approaches in our modeling task, we conduct ablation experiments, and the results are summarized in Table 5. The analysis of Table 5 shows that with the use of the Concat approach, the model outperforms the Add approach in all metrics, especially in mR@50 and mR@100, which are more significant, with an improvement of 1.9% and 2.0%, respectively, indicating that concatenated splicing can better preserve semantic and visual differences between subject and object targets, which helps the decoder to capture complex relation types.

Comparative analysis of long-tail mitigation strategies: To evaluate the effectiveness of the language-guided strategy in enhancing rare relation prediction and to demonstrate its superiority over conventional long-tail mitigation methods, we conducted a comprehensive ablation study. In particular, the comparison with the loss re-weighting scheme ensures that other factors remain unchanged when evaluating the impact of language information. As shown in Table 6, the complete IntegraPSG model substantially outperforms both re-sampling and loss re-weighting, highlighting its strong capability in addressing the long-tail distribution of relations. When the language information is introduced, IntegraPSG shows different degrees of improvement in all metrics. Among them, R@20 and mR@20 are increased by 1.0% and 1.5%, respectively, and mR@50 and mR@100 are also significantly increased by 1.4% and 1.4%, respectively. This indicates that the fusion of language prompt features is a key component in boosting relation prediction performance, particularly in enhancing the recognition of rare relations.

To concretely illustrate the impact of LLM guidance, we present side-by-side visualizations of IntegraPSG predictions with and without LLM prompts (Figure 7, Figure 8 and Figure 9). These examples complement the ablation results in Table 6, demonstrating how language information enhances the prediction of rare and challenging relations.

Ablation on relation classification loss: To effectively handle the long-tail distribution of relationships in complex scenes, we systematically evaluated the impact of different relation loss functions. As shown in Table 7, while standard approaches like focal loss [49] achieve high overall recall (R@K), they perform poorly on rare relations, with mR@100 reaching only 23.0%. In sharp contrast, Seesaw loss not only maintains high overall recall but also substantially improves mean recall, elevating mR@100 to 28.6%. These results demonstrate that Seesaw loss provides a fundamentally better balance between head and tail classes, thereby justifying its adoption as a more competitive and effective loss function in our model.

4.5. Qualitative Analysis

Figure 10 and Figure 11 illustrate representative qualitative results of IntegraPSG, showing the input image, panoptic segmentation output, ground truth annotations in the lower left, and the top 20 predicted triplets in the lower right. These visualizations complement the Recall@K metrics by highlighting both the strengths and the limitations of the model in complex scenes.

In the complex street scene of Figure 10, IntegraPSG demonstrates strong generalization, correctly predicting plausible yet unannotated triplets such as <bus, driving on, road> and additional <car–road> relations (highlighted in red circles). Equally revealing, however, are the failure cases. The prediction <pavement-merged, attached to, pavement-merged> constitutes a clear logical error, exposing the absence of intrinsic constraints to prevent self-referential associations, particularly when modeling abstract “stuff” categories. Furthermore, although the relevant objects (person and road) were correctly segmented, IntegraPSG failed to recover the rare ground truth relation <person, crossing, road>, exemplifying the difficulty of recalling extremely rare predicates and directly contributing to the moderate mR@K scores. Another misclassification, <person, riding, bicycle> instead of the ground truth <person, looking at, bicycle>, underscores a semantic bias: despite riding being even less frequent on the dataset than looking at, the model still favors it. This suggests a tendency to prioritize the prototypical functional association of the <person–bicycle> pair—namely riding—over the actual visual evidence of a static posture. Collectively, these cases highlight a key challenge for IntegraPSG: leveraging the strong semantic associations learned from language without allowing them to override the realities grounded in visual perception.

In the wedding scene of Figure 11, IntegraPSG demonstrates its capacity to parse intricate social interactions, correctly identifying relations such as <person, wearing, tie>, <person, holding, knife>, and so on. A particularly instructive case arises with the prediction <knife, slicing, cake>. On the one hand, this triplet is semantically coherent and linguistically natural, highlighting the model’s ability to capture the strong association between “knife” and “slicing”. On the other hand, the absence of the ground truth relation <person, slicing, cake> exposes a notable limitation: the difficulty of distinguishing between agent (the initiator of an action) and instrument (the means by which it is enacted).

Two factors likely contribute to this shortcoming. First, visual ambiguity: the close spatial arrangement of the hand, knife, and cake complicates the assignment of their respective roles. Second, linguistic bias: IntegraPSG appears to over-rely on canonical functional associations—here, privileging the knife–slicing link—at the expense of tracing the causal chain back to the true agent. Importantly, the value of this case lies not in declaring <knife, slicing, cake> as strictly correct or incorrect, but in suggesting a potential direction for improving PSG models: moving beyond simple triplet recognition to better capture causal dependencies and entity roles.

Overall, these examples reveal the main scenarios where IntegraPSG still struggles: self-referential or logically inconsistent predictions when modeling abstract “stuff” categories, failure to recall extremely rare relations, and confusion between agents and instruments in complex action-centric scenes.

5. Discussion

Despite the significant progress achieved by IntegraPSG in PSG, several limitations remain. First, the language guidance leveraged by the cross-attention mechanism can introduce a subtle bias towards semantic plausibility, potentially at the expense of fine-grained visual evidence. Second, the model’s core architecture, while optimized for triplet recognition, does not explicitly capture causal dependencies and may generate logically inconsistent predictions, which underscores the need for more robust causal reasoning constraints. Another key consideration involves the long-tail distribution: while our method demonstrates a marked advance in recognizing rare relations, its efficacy diminishes for the most sparsely represented predicates, presenting a key frontier for generalization. In future research, we will mitigate predictive bias stemming from the imbalance between language guidance and visual information, and improve performance on the long-tail distribution of predicates. Future research could also focus on enhancing the model’s capacity for causal reasoning.

6. Conclusions

In this paper, we propose IntegraPSG, a unified single-stage method for PSG that integrates LLM guidance with multimodal feature fusion. IntegraPSG generates object queries using the powerful panoptic segmenter, Mask2Former. A multimodal sparse relation prediction network then fuses visual, language, and depth information. This network, incorporating a statistical prior and a matrix learner, efficiently selects the most likely subject–object pairs from numerous candidates for subsequent relation prediction. To mitigate the long-tail distribution of relations, we design a language-guided multimodal relation decoder, enhancing the fusion of visual pair features and language prompt features. Experiments on the PSG dataset demonstrate IntegraPSG’s superior performance compared with several representative existing methods, effectively improving relation inference accuracy and interpretability while alleviating the long-tail issue. Despite these improvements, the model remains challenged by the long-tail distribution, and future work will investigate strategies to further mitigate this issue, addressing key technical limitations to enhance PSG’s generalization and enable broader applications in comprehensive scene understanding and multimodal reasoning.

Author Contributions

Conceptualization, Y.Z. and Q.Z.; methodology, Y.Z. and X.S.; software, Y.Z. and Q.Z.; validation, G.L. and Q.Z.; formal analysis, X.S. and G.L.; writing—original draft preparation, Y.Z. and G.L.; writing—review and editing, Q.Z. and X.S.; visualization, G.L.; supervision, Q.Z. and X.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data used in this work are publicly accessible and have been appropriately cited throughout the article. The panoptic scene graph (PSG) dataset is available at https://github.com/Jingkang50/OpenPSG (accessed on 21 July 2022). The depth maps estimated by MiDaS from the PSG dataset, together with the textual knowledge base generated by large language models, are stored and accessible at https://pan.baidu.com/s/1lu97CPMKlq9e4ShBVqt6sw?pwd=7dyw (accessed on 1 August 2025). The updated code implementation, including recent additions, is publicly available at https://github.com/zhaoshuang0220/IntegraPSG/tree/main (accessed on 1 August 2025).

Acknowledgments

The authors are grateful for the Qwen-7B model provided by Alibaba DAMO Academy, which was used to generate text inputs for our architecture. All AI-generated content was carefully verified by the authors.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Gu, J.; Zhao, H.; Lin, Z.; Li, S.; Cai, J.; Ling, M. Scene graph generation with external knowledge and image reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 1969–1978. [Google Scholar]
Li, H.; Zhu, G.; Zhang, L.; Jiang, Y.; Dang, Y.; Hou, H.; Shen, P.; Zhao, X.; Shah, S.A.A.; Bennamoun, M. Scene graph generation: A comprehensive survey. Neurocomputing 2024, 566, 127052. [Google Scholar] [CrossRef]
Schroeder, B.; Tripathi, S. Structured query-based image retrieval using scene graphs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 178–179. [Google Scholar]
Qian, T.; Chen, J.; Chen, S.; Wu, B.; Jiang, Y.G. Scene graph refinement network for visual question answering. IEEE Trans. Multimed. 2022, 25, 3950–3961. [Google Scholar] [CrossRef]
Chen, S.; Jin, Q.; Wang, P.; Wu, Q. Say as you wish: Fine-grained control of image caption generation with abstract scene graphs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9962–9971. [Google Scholar]
Zhong, Y.; Wang, L.; Chen, J.; Yu, D.; Li, Y. Comprehensive image captioning via scene graph decomposition. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 211–229. [Google Scholar]
Li, W.; Zhang, H.; Bai, Q.; Zhao, G.; Jiang, N.; Yuan, X. Ppdl: Predicate probability distribution based loss for unbiased scene graph generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 19447–19456. [Google Scholar]
Herzig, R.; Bar, A.; Xu, H.; Chechik, G.; Darrell, T.; Globerson, A. Learning canonical representations for scene graph to image generation. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 210–227. [Google Scholar]
Amiri, S.; Chandan, K.; Zhang, S. Reasoning with scene graphs for robot planning under partial observability. IEEE Robot. Autom. Lett. 2022, 7, 5560–5567. [Google Scholar] [CrossRef]
Jiang, B.; Zhuang, Z.; Shivakumar, S.S.; Taylor, C.J. Enhancing Scene Graph Generation with Hierarchical Relationships and Commonsense Knowledge. In Proceedings of the 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Tucson, AZ, USA, 26 February–6 March 2025; pp. 8883–8894. [Google Scholar]
Liu, H.; Yan, N.; Mortazavi, M.; Bhanu, B. Fully convolutional scene graph generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 11546–11556. [Google Scholar]
Wang, L.; Yuan, Z.; Chen, B. Multi-Granularity Sparse Relationship Matrix Prediction Network for End-to-End Scene Graph Generation. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2025; pp. 105–121. [Google Scholar]
Yang, J.; Ang, Y.Z.; Guo, Z.; Zhou, K.; Zhang, W.; Liu, Z. Panoptic scene graph generation. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 25–27 October 2022; pp. 178–196. [Google Scholar]
Caesar, H.; Uijlings, J.; Ferrari, V. Coco-stuff: Thing and stuff classes in context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 1209–1218. [Google Scholar]
Kirillov, A.; He, K.; Girshick, R.; Rother, C.; Dollár, P. Panoptic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 9404–9413. [Google Scholar]
Yang, J.; Wang, C.; Liu, Z.; Wu, J.; Wang, D.; Yang, L.; Cao, X. Focusing on flexible masks: A novel framework for panoptic scene graph generation with relation constraints. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023; pp. 4209–4218. [Google Scholar]
Shi, H.; Li, L.; Xiao, J.; Zhuang, Y.; Chen, L. From easy to hard: Learning curricular shape-aware features for robust panoptic scene graph generation. Int. J. Comput. Vis. 2025, 133, 489–508. [Google Scholar] [CrossRef]
Liang, S.; Zhang, L.; Xie, C.; Chen, L. Causal intervention for panoptic scene graph generation. In Proceedings of the 2024 IEEE International Conference on Multimedia and Expo (ICME), Niagara Falls, ON, Canada, 15–19 July 2024; pp. 1–6. [Google Scholar]
Xu, J.; Chen, J.; Yanai, K. Contextual associated triplet queries for panoptic scene graph generation. In Proceedings of the 5th ACM International Conference on Multimedia in Asia, Tainan, Taiwan, 6–8 December 2023; pp. 1–5. [Google Scholar]
Sun, Y.; Chen, Y.; Huang, X.; Wang, Y.; Chen, S.; Yao, K.; Yang, A. DGTM: Deriving Graph from transformer with Mamba for panoptic scene graph generation. Array 2025, 26, 100394. [Google Scholar] [CrossRef]
Wang, J.; Wen, Z.; Li, X.; Guo, Z.; Yang, J.; Liu, Z. Pair Then Relation: Pair-Net for Panoptic Scene Graph Generation. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 10452–10465. [Google Scholar] [CrossRef] [PubMed]
Naveed, H.; Khan, A.U.; Qiu, S.; Saqib, M.; Anwar, S.; Usman, M.; Akhtar, N.; Barnes, N.; Mian, A. A Comprehensive Overview of Large Language Models. ACM Trans. Intell. Syst. Technol. 2025, 16, 1–72. [Google Scholar] [CrossRef]
Min, B.; Ross, H.; Sulem, E.; Veyseh, A.P.B.; Nguyen, T.H.; Sainz, O.; Agirre, E.; Heintz, I.; Roth, D. Recent advances in natural language processing via large pre-trained language models: A survey. ACM Comput. Surv. 2023, 56, 1–40. [Google Scholar] [CrossRef]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable DETR: Deformable transformers for end-to-end object detection. In Proceedings of the Ninth International Conference on Learning Representations, Vienna, Austria, 3–7 May 2021; pp. 1–16. [Google Scholar]
Li, L.; Ji, W.; Wu, Y.; Li, M.; Qin, Y.; Wei, L.; Zimmermann, R. Panoptic scene graph generation with semantics-prototype learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; pp. 3145–3153. [Google Scholar]
Kuang, X.; Che, Y.; Han, H.; Liu, Y. Semantic-enhanced panoptic scene graph generation through hybrid and axial attentions. Complex Intell. Syst. 2025, 11, 110. [Google Scholar] [CrossRef]
Kirillov, A.; Girshick, R.; He, K.; Dollár, P. Panoptic feature pyramid networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 6399–6408. [Google Scholar]
Xiong, Y.; Liao, R.; Zhao, H.; Hu, R.; Bai, M.; Yumer, E.; Urtasun, R. Upsnet: A unified panoptic segmentation network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 8818–8826. [Google Scholar]
Cheng, B.; Schwing, A.; Kirillov, A. Per-pixel classification is not all you need for semantic segmentation. Adv. Neural Inf. Process. Syst. 2021, 34, 17864–17875. [Google Scholar]
Zhang, W.; Pang, J.; Chen, K.; Loy, C.C. K-net: Towards unified image segmentation. Adv. Neural Inf. Process. Syst. 2021, 34, 10326–10338. [Google Scholar]
Li, Z.; Wang, W.; Xie, E.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P.; Lu, T. Panoptic segformer: Delving deeper into panoptic segmentation with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1280–1289. [Google Scholar]
Cheng, B.; Misra, I.; Schwing, A.G.; Kirillov, A.; Girdhar, R. Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1290–1299. [Google Scholar]
Abdullah, M.; Madain, A.; Jararweh, Y. ChatGPT: Fundamentals, applications and social impacts. In Proceedings of the 2022 Ninth International Conference on Social Networks Analysis, Management and Security (SNAMS), Milan, Italy, 29 November–1 December 2022; pp. 1–8. [Google Scholar]
Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. Llama: Open and efficient foundation language models. arXiv 2023, arXiv:2302.13971. [Google Scholar] [CrossRef]
Anthropic. Claude. 2023. Available online: https://claude.ai/chat (accessed on 29 July 2025).
Google. Bard. 2023. Available online: https://bard.google.com/chat (accessed on 29 July 2025).
Alibaba. Qwen. 2023. Available online: https://chat.qwen.ai/ (accessed on 29 July 2025).
Li, L.; Xiao, J.; Chen, G.; Shao, J.; Zhuang, Y.; Chen, L. Zero-Shot Visual Relation Detection via Composite Visual Cues from Large Language Models. Adv. Neural Inf. Process. Syst. 2023, 36, 50105–50116. [Google Scholar]
Ranftl, R.; Lasinger, K.; Hafner, D.; Schindler, K.; Koltun, V. Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-Shot Cross-Dataset Transfer. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 1623–1637. [Google Scholar] [CrossRef] [PubMed]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 3–5 June 2019; pp. 4171–4186. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
Wang, J.; Zhang, W.; Zang, Y.; Cao, Y.; Pang, J.; Gong, T.; Chen, K.; Liu, Z.; Loy, C.C.; Lin, D. Seesaw loss for long-tailed instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 9695–9704. [Google Scholar]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; pp. 740–755. [Google Scholar]
Xu, D.; Zhu, Y.; Choy, C.B.; Fei-Fei, L. Decoupled weight decay regularization. In Proceedings of the Seventh International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019; pp. 1–18. [Google Scholar]
Xu, D.; Zhu, Y.; Choy, C.B.; Fei-Fei, L. Scene graph generation by iterative message passing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5410–5419. [Google Scholar]
Zellers, R.; Yatskar, M.; Thomson, S.; Choi, Y. Neural motifs: Scene graph parsing with global context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Lake City, UT, USA, 18–23 June 2018; pp. 5831–5840. [Google Scholar]
Tang, K.; Zhang, H.; Wu, B.; Luo, W.; Liu, W. Learning to compose dynamic tree structures for visual contexts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 6619–6628. [Google Scholar]
Lin, X.; Ding, C.; Zeng, J.; Tao, D. Gps-net: Graph property sensing network for scene graph generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 3746–3753. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. In Proceedings of the IEEE International Conference on Computer Vision, Los Alamitos, CA, USA, 22–29 October 2017; pp. 2999–3007. [Google Scholar]

Figure 1. An illustration of the PSG task. (a) Input images; (b) corresponding panoptic segmentation results; (c) predictions from conventional PSG methods, which often struggle with contextual modeling and rare relation recognition, frequently defaulting to common relations (e.g., “playing”, “holding”) for representation; (d) predictions from our method. By integrating LLM to generate descriptive relation triplets (encompassing both common and rare relations), our method predicts richer interactions. For instance, the “person–sports ball” subject–object pair is predicted as “person–about to hit–sports ball”, providing a more nuanced understanding. Additionally, the common relation “holding” is replaced by the semantically specific, rare relation “swinging”, which demonstrates our model’s ability in parsing the semantics of visual scenes and mitigating long-tail relations bias.

Figure 2. An illustration of the overall framework of IntegraPSG. It contains three parts: (a) The panoptic segmentation network provides query-based a Mask2Former segmenter to generate object queries, masks, and class labels. (b) The multimodal sparse relation prediction network constructs a multimodal relation proposal matrix through a pairwise module based on the object queries and the depth features and generates sparse subject–object pair features with the help of a matrix learner. (c) The multimodal relation decoder interacts vision pair features with language prompt features that are retrieved from a pre-computed knowledge base. Finally, the decoder predicts the corresponding relation labels of the subject–object pairs. The snowflake and fire symbols denote frozen and tunable modules, respectively.

Figure 3. An illustration of our pairwise module. pairwise constructs the visual language relation proposal matrix and the depth spatial relation proposal matrix and fuses and filters them to obtain the final multimodal relation proposal matrix.

Figure 4. An illustration of the prompts design and relation description results of the LLM for triplet relation judgment. (a) shows the prompts construction, including role setting and user content, which requires the model to judge the reasonableness of the “sub–rel–obj” triplet based on common sense and focusing on rare relations. (b) For the model response results, a number of typical cases (e.g., person–riding–surfboard, person–swinging–tennis racket, elephant–guiding–elephant, giraffe–sitting on–fence) are presented to show the model’s judgment and reason for the existence of a specific triplet relation.

Figure 5. An illustration of the relation decoder architecture.

Figure 6. Comparison of evaluation results of SGDet task R@K, mR@K, and mean metrics on PSG dataset.

Figure 7. Visualization results of IntegraPSG on the PSG dataset, comparing predictions with and without LLM guidance (using loss re-weighting). (a) Ground truth annotation; (b) Predictions without LLM; (c) Predictions with LLM. Red circles with “✓” represent plausible but unannotated triplets.

Figure 8. Visualization results of IntegraPSG on the PSG dataset, comparing predictions with and without LLM guidance (using loss re-weighting). (a) Ground truth annotation; (b) Predictions without LLM; (c) Predictions with LLM.

Figure 9. Visualization results of IntegraPSG on the PSG dataset, comparing predictions with and without LLM guidance (using loss re-weighting). (a) Ground truth annotation; (b) Predictions without LLM; (c) Predictions with LLM. Red circles with “✓” represent plausible but unannotated triplets.

Figure 10. In the visualization, red circles with “✓” represent plausible but unannotated triplets, white circles with “✓” indicate predictions matching the ground truth, and blue circles with “×” highlight a few selected representative errors for qualitative analysis.

Figure 11. In the visualization, red circles with “✓” represent plausible but unannotated triplets, while white circles with “✓” indicate predictions matching the ground truth.

Table 1. Comparison of key characteristics of existing PSG methods.

Method	Core Architecture	Relation Modeling	Strengths and Limitations
PSGTR [13]	Single-Decoder Transformer	Triplet Query Decoding	(+) A simple one-stage baseline: Serves as the most direct and straightforward benchmark. (−) Redundant segmentation, non-trivial Re-ID.
PSGFormer [13]	Dual-Decoder Transformer	Object and Relation Query Matching	(+) Unbiased prediction, excels on novel relations. (−) Limited peak recall (R@K).
ADTrans [25]	Plug-and-Play Dataset Enhancement Framework	Annotation Debiasing via Prototype Learning	(+) Alleviates the long-tail problem via annotation correction. (−) Reduces recall (R@k) on frequent predicates.
PSGAtten [26]	Two-Stage Attention-Based Structure	Visual-Semantic Fusion via Hybrid Attention	(+) Achieves superior recall via enhanced visual semantic fusion. (−) High complexity; conservative predictions.
DGTM [20]	Single-Stage Structure Integrating Mamba and KAN	Deriving Relations from Self-Attention via Mamba	(+) Enhances mean recall (mR@K) by leveraging attention by-products. (−) Overall recall is constrained by segmentation performance.
IntegraPSG	LLM-Integrated Single-Stage Framework	Synergistic Multimodal Pairing and LLM-Guided Decoding	(+) Achieves balanced recall and long-tail performance via a “refine-then-reason” synergy. (−) Under-calibrated confidence for its long-tail predictions.

Table 2. The 56 relationship categories defined on the PSG dataset.

Relationship Predicate Vocabulary
over	in front of	beside	on	in	attached to	hanging from	on back of
falling off	going down	painted on	walking on	running on	crossing	standing on	lying on
sitting on	flying over	jumping over	jumping from	wearing	holding	carrying	looking at
guiding	kissing	eating	drinking	feeding	biting	catching	picking
playing with	chasing	climbing	cleaning	playing	touching	pushing	pulling
opening	cooking	talking to	throwing	slicing	driving	riding	parked on
driving on	about to hit	kicking	swinging	entering	exiting	enclosing	leaning on

Table 3. The performance of IntegraPSG is compared with other methods on the PSG dataset using R@K, mR@K, and mean metrics to evaluate the SGDet task.

Method	Scene Graph Detection (SGDet)							Inference Time
Method	R@20	R@50	R@100	mR@20	mR@50	mR@100	Mean	Inference Time
IMP [45]	16.5	18.2	18.6	6.5	7.1	7.2	12.4	-
Motif [46]	20.0	21.7	22.0	9.1	9.6	9.7	15.4	-
VCTree [47]	20.6	22.1	22.5	9.7	10.2	10.2	15.9	0.132 *
GPSNet [48]	17.8	19.6	20.1	7.0	7.5	7.7	13.3	-
CIM [18,26]	20.7	22.1	23.8	13.4	15.0	15.3	18.4	-
PSGTR [13]	28.4	34.4	36.3	16.6	20.8	22.1	26.4	0.230 *
PSGFormer [13]	18.0	19.6	20.1	14.8	17.0	17.6	17.6	0.175 *
RCFrame [16]	29.7	32.4	33.3	16.5	17.6	17.9	24.6	0.155
ADTrans [25]	26.0	29.6	30.0	26.4	29.7	30.0	28.6	-
PSGAtten [26]	29.1	35.2	37.3	25.1	26.2	26.6	29.9	-
CAFE [17]	24.6	27.6	28.7	25.0	26.6	26.9	26.6	-
DGTM [20]	25.79	28.26	28.42	25.38	27.8	27.9	27.3	-
IntegraPSG (Ours)	28.7	35.1	38.7	22.3	26.3	28.6	30.0	0.183

* Data from [16].

Table 4. Ablation experiment of the proposal matrix module.

Method			Scene Graph Detection (SGDet)
Vision	Depth	Prior	R@20	R@50	mR@20	mR@50
✓	×	×	27.0	33.6	21.0	26.3
✓	✓	×	28.4	34.8	22.0	25.7
✓	×	✓	28.6	33.9	22.2	26.0
✓	✓	✓	28.7	35.1	22.3	26.3

“✓” indicates that the corresponding component is included, whereas “×” indicates that it is not included.

Table 5. Different ways of integration between subject and object.

Method	Scene Graph Detection (SGDet)
Method	R@20	R@50	R@100	mR@20	mR@50	mR@100
Add	27.2	33.4	36.9	20.5	24.4	26.6
Concat	28.7	35.1	38.7	22.3	26.3	28.6

Table 6. Ablation study on the role of language information in long-tail relation modeling.

Method	LLM	Scene Graph Detection (SGDet)
Method	LLM	R@20	R@50	R@100	mR@20	mR@50	mR@100
Re-sampling	×	26.7	33.7	37.9	18.0	21.7	23.4
Loss re-weighting	×	27.7	34.8	38.5	20.8	24.9	27.2
IntegraPSG(Ours)	✓	28.7	35.1	38.7	22.3	26.3	28.6

“✓” indicates that the corresponding component is included, whereas “×” indicates that it is not included.

Table 7. Different loss functions of relation classification.

Method	Scene Graph Detection (SGDet)
Method	R@20	R@50	R@100	mR@20	mR@50	mR@100
Cross-Entropy Loss	27.9	34.6	38.2	15.1	18.2	19.5
Focal Loss [49]	28.7	35.1	38.4	18.5	21.7	23.0
Seesaw Loss [42]	28.7	35.1	38.7	22.3	26.3	28.6

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhao, Y.; Zhang, Q.; Sun, X.; Liu, G. IntegraPSG: Integrating LLM Guidance with Multimodal Feature Fusion for Single-Stage Panoptic Scene Graph Generation. Electronics 2025, 14, 3428. https://doi.org/10.3390/electronics14173428

AMA Style

Zhao Y, Zhang Q, Sun X, Liu G. IntegraPSG: Integrating LLM Guidance with Multimodal Feature Fusion for Single-Stage Panoptic Scene Graph Generation. Electronics. 2025; 14(17):3428. https://doi.org/10.3390/electronics14173428

Chicago/Turabian Style

Zhao, Yishuang, Qiang Zhang, Xueying Sun, and Guanchen Liu. 2025. "IntegraPSG: Integrating LLM Guidance with Multimodal Feature Fusion for Single-Stage Panoptic Scene Graph Generation" Electronics 14, no. 17: 3428. https://doi.org/10.3390/electronics14173428

APA Style

Zhao, Y., Zhang, Q., Sun, X., & Liu, G. (2025). IntegraPSG: Integrating LLM Guidance with Multimodal Feature Fusion for Single-Stage Panoptic Scene Graph Generation. Electronics, 14(17), 3428. https://doi.org/10.3390/electronics14173428

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

IntegraPSG: Integrating LLM Guidance with Multimodal Feature Fusion for Single-Stage Panoptic Scene Graph Generation

Abstract

1. Introduction

2. Related Work

2.1. Panoptic Scene Graph Generation

2.2. Panoptic Segmentation

2.3. Large Language Model for Visual Relation Understanding

3. Method

3.1. Problem Definition

3.2. Overall Architecture

3.3. Panoptic Segmentation Network

3.4. Multimodal Sparse Relation Prediction Network

3.4.1. Multimodal Relation Proposal Matrix Construction

3.4.2. Multimodal Sparse Relation Matrix Refinement

3.5. Multimodal Relation Decoder

3.5.1. Vision Pair Features Extraction

3.5.2. Language Prompt Features Extraction

3.5.3. Relation Prediction

3.6. Model Training

4. Experiments

4.1. Experimental Settings

4.2. Implementation Details

4.3. Experimental Results

4.4. Ablation Study

4.5. Qualitative Analysis

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI