HAtt-Flow: Hierarchical Attention-Flow Mechanism for Group-Activity Scene Graph Generation in Videos

Group-activity scene graph (GASG) generation is a challenging task in computer vision, aiming to anticipate and describe relationships between subjects and objects in video sequences. Traditional video scene graph generation (VidSGG) methods focus on retrospective analysis, limiting their predictive capabilities. To enrich the scene-understanding capabilities, we introduced a GASG dataset extending the JRDB dataset with nuanced annotations involving appearance, interaction, position, relationship, and situation attributes. This work also introduces an innovative approach, a Hierarchical Attention–Flow (HAtt-Flow) mechanism, rooted in flow network theory to enhance GASG performance. Flow–attention incorporates flow conservation principles, fostering competition for sources and allocation for sinks, effectively preventing the generation of trivial attention. Our proposed approach offers a unique perspective on attention mechanisms, where conventional “values” and “keys” are transformed into sources and sinks, respectively, creating a novel framework for attention-based models. Through extensive experiments, we demonstrate the effectiveness of our Hatt-Flow model and the superiority of our proposed flow–attention mechanism. This work represents a significant advancement in predictive video scene understanding, providing valuable insights and techniques for applications that require real-time relationship prediction in video data.


Introduction
Visual scene understanding is a foundational challenge in computer vision, encompassing the interpretation of complex scenes, objects, and their relationships within images and videos.This task is particularly intricate for video, where temporal dynamics and multi-modal information introduce unique complexities.Group-activity video scene graph (GAVSG) generation, which involves predicting relationships between objects in a video across multiple frames, stands at the forefront of this endeavor.
In recent years, significant progress has been made in video understanding.Techniques such as video scene graph generation (VidSGG) have allowed us to extract high-level semantic representations from video content.However, VidSGG typically operates in a static, retrospective manner, constraining its predictive capabilities.The GAVSG dataset, on the other hand, extends the scope of visual scene understanding to anticipate and describe subject-and-object relationships and their temporal evolution.
In the context of SGG, diverse methodologies have been explored, from probabilistic graphical models and AND-OR grammar approaches [21][22][23][24][25][26][27][28] to knowledge graph embeddings like VTransE [29] and UVTransE [30].Recent endeavors delve into challenges such as the long-tailed distribution of predicates [31,32], visually irrelevant predicates [33], and precise bounding box localization [34].As shown in Figure 1, we can observe that the previous methods can detect the subjects and objects in a scene.However, they need to generate a well-defined scene graph, whereas our method can learn all the nuanced relationships among the subjects and objects in the scene to produce a fine scene graph.To address the limitation of the enriched learning of group activities in a scene, we introduced the GASG dataset, which includes nuanced annotations in the form of five different attributes.This can help set a better scene graph generation benchmark than the existing datasets in this domain.In this work, we propose a novel approach for GAVSG that draws inspiration from flow network theory, introducing flow-attention.This mechanism leverages flow conservation principles in both the source and sink aspects, introducing a competitive mechanism for sources and an allocation mechanism for sinks.This innovative approach mitigates the generation of trivial attention and enhances the predictive power of GAVSG.We build upon a new perspective on attention mechanisms, rooted in flow network theory, to design our GAVSG framework.The conventional attention mechanism aggregates information from "values" and "keys" based on the similarity between "queries".By framing attention in terms of flow networks, we transform values into sources and keys into endpoints, thus creating a fresh perspective on the attention mechanism.

Contributions:
The main contributions of our work are threefold.First, we introduce a novel dataset (the dataset is available for verification at https://uark-cviu.github.io/GASG/) with nuanced attributes that aid the scene graph generation task in the group activity setting.Second, our work advances the state of the art in predictive video scene understanding by introducing flow-attention and redefining attention mechanisms via incorporating hierarchy awareness.Third, we demonstrate the effectiveness of our approach through extensive experiments and achieve state-of-the-art performance over the existing approaches.

Related Work
Group action recognition (GAR).Group action recognition (GAR) has witnessed a shift towards deep learning methodologies, notably convolutional neural networks (CNNs) and recurrent neural networks (RNNs) [35][36][37][38][39][40][41][42][43][44][45].Attention-based models and graph convolution networks are crucial in capturing spatial-temporal relations in group activities.Transformer-based encoders, often coupled with diverse backbone networks, excel in extracting features for discerning actor interactions in multimodal data [46].Recent innovations, such as MAC-Loss, introduce dual spatial and temporal transformers for enhanced actor interaction learning [47].The field continues to evolve with heuristic-free approaches like those by Tamura et al., simplifying the process of social group activity recognition and member identification [48].
Video scene graph generation (VidSGG).VidSGG, initiated by Shang et al. [72], explores spatio-temporal relations in videos.Research has delved into spatio-temporal conditional bias, the domain shift between image and video scene graphs, and embodied semantic approaches using intelligent agents.Notable methods include TRACE [73], which separates relation prediction and context modeling, and embodied semantic SGG, employing reinforcement learning for path generation via intelligent agents [74,75].Yang et al. [76] proposed a transformer-encoder-based baseline model to evaluate their proposed panoptic video scene graph dataset, which included fusing the extracted features of the subjects in the scene.

Limitation of Prior Datasets
We present a detailed comparison of existing datasets in Table 1.However, upon further examination, it becomes apparent that the limitations of prior datasets are particularly pronounced when considering the intricate nature of group activities within visual content.Many existing datasets have primarily focused on specific types of individual actions, often overlooking the complex dynamics of group activities in real-world interactions.This narrow focus has hindered the development of models capable of addressing a diverse range of action classifications, thereby limiting their adaptability to varied real-world scenarios.
Moreover, earlier datasets have often provided sparse annotations and underscored relationships within isolated fragments of the relational graph, neglecting the complexities of broader and more intricate scenes.This sparse annotation may lead to a lack of comprehensive relationship modeling and potential biases in the developed models.
Furthermore, there exists a significant gap in the representation of scenes featuring dense crowds of people.These settings present formidable challenges related to occlusion management and require a nuanced understanding of complex interactions within such contexts.
In response to these identified limitations, we introduce the group activity scene graph (GASG) dataset.This dataset directly addresses the aforementioned issues by featuring various scenes and settings (represented as attributes in the annotations), spanning distinct scenarios and effectively differentiating it from prior datasets.The GASG dataset excels in capturing five critical features of group activities: appearance, situation, position, interaction, and relations.Additionally, it comprehensively tracks the movements and interactions of individuals and sub-groups, facilitating a profound understanding of their dynamics and activities over time.With its rich dataset of these aspects, the GASG dataset lays the groundwork for a new paradigm in understanding complex group activities within scenarios characterized by dense populations, thereby pushing the boundaries of group activity recognition.

Dataset Overview
The GASG dataset offers a diverse array of sub-group and group activities, enriching the landscape of research in group activity recognition.It encompasses 27 categories of individual actions, such as walking and talking, aligning with the categories present in the JRDB-Act dataset.Additionally, our dataset presents 11 distinct categories of sub-group activities, ranging from standing closely and chatting to complex interactions like group evolution and collaborative work.These sub-group activities meticulously capture the nuances of interpersonal dynamics within smaller groups, providing a nuanced portrayal of real-world interactions.
Furthermore, the dataset encapsulates seven categories of group activities, including walking, conversing, commuting, resting, office working, waiting, and sitting.These activities encapsulate commonplace scenarios encountered in real-world settings, serving as a robust foundation for analyzing collective behaviors and interactions within larger groups.By incorporating a comprehensive spectrum of activities, the GASG dataset empowers researchers to delve deeply into the complexities of group dynamics and advance the frontiers of group activity recognition.

Data Collection and Annotation
This GASG dataset comprises a rich collection of videos from the JRDB dataset, offering a unique perspective on sub-group and overall group activities.This dataset provides comprehensive coverage of various activities and introduces essential tracking information.
The tracking information within the GASG dataset (as shown in Figure 2) facilitates a detailed understanding of individuals and sub-groups within each frame.This information includes the trajectory, position, and interactions of each actor.The dataset defines five key aspects for comprehensive scene understanding:

•
Interaction: This aspect characterizes the dynamic interactions between subjects and objects, shedding light on how individuals and sub-groups engage with each other.

•
Position: The dataset includes precise data on the location and orientation of subjects and objects, enhancing the analysis of their spatial relationships during activities.
• Appearance: Visual traits of subjects and objects are meticulously captured, allowing for detailed examinations of their attributes and characteristics.• Relationship: Understanding the associations and connections between subjects and objects is essential for deciphering the complex interplay within group activities.
This aspect provides insight into the underlying dynamics of relationships within the scenes.• Situation: To provide environmental context, the GASG dataset offers descriptors highlighting the contextual information surrounding subjects and objects, enabling researchers to consider the broader setting in their analyses.These five key aspects-interaction, position, appearance, relationship, and situationform the backbone of the dataset's annotation structure, providing a holistic view of the diverse activities and interactions within the sub-group and overall group scenarios.This level of detail sets GASG apart, making it a valuable resource for research in scene comprehension, action recognition, and group activity analysis.The annotation process is detailed in the Appendix B.

Dataset Statistics
The accompanying pie chart in Figure 3b delves into the complexity of attributes in our dataset, revealing that "Appearance" consists of 33% of the total annotations, which is the highest, "Relationship" consists of 28%, "Interaction" consists of 17%, "Position" consists of 12%, and "Situation" consists of 10%, which is the least.We explore the distribution of social activity labels in Figure 3a, focusing on the sizes of social groups.The chart in the figure provides a nuanced view of social group sizes in the dataset.Specifically, 75.5%, 16.6%, 5%, and 1.2% of social groups consist of one, two, three, and four members, respectively.Interestingly, only 1% of the dataset includes groups with five or more members, with the maximum observed group size being 29 members.

Methodology
In this section, we present the methodology of our proposed HAtt-Flow approach for robust group activity recognition.We introduce three key modules: input preparation, hierarchical awareness induction, and feature flow-attention mechanism, which are overviewed below: Input preparation: We prepare input node and edge embeddings for the graph transformer layer.This module uniquely incorporates both textual and visual features, facilitating a holistic representation of the underlying data.The novel aspect here is the integration of both modalities into a unified representation, enabling the model to leverage complementary information from both sources for improved recognition accuracy.

Hierarchical awareness induction:
We propose enriching the vision and language branches through a novel hierarchy-aware attention mechanism.This module introduces hierarchical aggregation priors to guide the model in capturing complex relationships within the data.The novelty lies in integrating hierarchical information into the attention mechanism, allowing the model to capture multi-level dependencies and semantic hierarchies within group activities.
Feature flow-attention mechanism: Inspired by flow network theory, we introduce a feature flow-attention mechanism to prevent the generation of trivial attention s.This module incorporates competitive and allocation principles to enhance the model's ability to capture relevant features within group activities.The innovation here is the introduction of a flow-based attention mechanism, which enables the model to dynamically allocate attention based on feature importance, leading to more robust and interpretable representations of group activities.
Additionally, we present our training loss formulation tailored for the HAtt-Flow architecture, which encourages the joint learning of textual and visual features for improved group activity recognition.
We utilized pre-trained visual and textual backbones to extract the corresponding subject features in the video, v, and textual features, t.In the input preparation section, these are h to represent the nodes and e to represent the edges.However, we denote visual nodes as v, textual nodes as t, and textual edges as t e in Figure 4. We used the graph transformer layer and the graph transformer layer with edge features to extract the corresponding feature representations.The former is tailored to graphs lacking explicit edge attributes, while the latter incorporates a dedicated edge feature pipeline to integrate available edge information, maintaining abstract representations at each layer.Now, let us proceed to detail each module in the subsequent subsections.

Input Preparation
Initially, we prepared input node and edge embeddings for the graph transformer layer.In the context of our model, text features are employed to generate both nodes and edges, whereas vision features are exclusively utilized for generating nodes.Consider a graph, G, with node features represented as text features, α i ∈ R d n ×1 , for each node, i, and edge features, also derived from text, denoted as β ij ∈ R d e ×1 for edges between nodes i and j.The input node features, α i , and edge features, β ij , undergo a linear projection to be embedded into d-dimensional hidden features, h 0 i and e 0 ij .
Here, A 0 ∈ R d×d n , B 0 ∈ R d×d e and a 0 , b 0 ∈ R d are parameters of the linear projection layers.The pre-computed node positional encodings of dimension k are linearly projected and added to the node features ĥ0 i .
Here, C 0 ∈ R d×k , and c 0 ∈ R d .Notably, positional encodings are only added to the node features at the input layer and not during intermediate graph transformer layers.Detailed information about the graph transformer layers is presented in the Appendix C.

Hierarchical Awareness Induction
We propose enriching the vision and language branches through a hierarchy-aware attention mechanism.In line with the conventional transformer architecture, we divide modality inputs into low-level video patches and text tokens.These are recursively merged based on semantic and spatial similarities, gradually forming more semantically concentrated clusters, such as video objects and text phrases.We define hierarchy aggregation priors with the following aspects: Tendency to merge.Patches and tokens are recursively merged into higher-level clusters that are spatially and semantically similar.If two nearby video patches share similar appearances, merging them is a natural step to convey the same semantic information.
Non-splittable.Once patches or tokens are merged, they will never be split in later layers.This constraint ensures that hierarchical information aggregation never degrades, preserving the complete process of hierarchy evolution layer by layer.
We incorporate these hierarchy aggregation priors into an attention mask, C, serving as an extra inductive bias to help the conventional attention mechanism in transformers better explore hierarchical structures adapted to each modality format-a 2D grid for videos and a 1D sequence for texts.Thus, the proposed hierarchy-aware attention is defined as follows: Note that C is shared among all heads and progressively updated bottom-up across transformer layers.We elaborate on the formulations of the hierarchy-aware mask, C, for each modality as follows.

Hierarchy Induction for Language Branch
In this section, we reconsider the tree-transformer method from the perspective of the proposed hierarchy-aware attention, explaining how to impose hierarchy aggregation priors on C in three steps.
Generate neighboring attention score.The merging tendency of adjacent word tokens is described through neighboring attention scores.Two learnable key and query matrices, W ′ Q and W ′ K , transfer any adjacent word tokens, (t i , t i+1 ).The neighboring attention score, s i,i+1 , is defined as their inner product: Here, σ t is a hyperparameter controlling the scale of the generated scores.A softmax function for each token, t i , is employed to normalize its merging tendency with two neighbors: For neighbor pairs (t i , t i+1 ), the neighboring affinity score âi,i+1 is the geometric mean of p i,i+1 and p i+1,i : âi,i+1 = √ p i,i+1 • p i+1,i .From a graph perspective, it describes the strength of edge e i,i+1 by comparing it with edges e i−1,i (p i,i+1 vs. p i,i−1 ) and e i+1,i+2 (p i+1,i vs. p i+1,i+2 ).
Enforcing non-splittable property.A higher neighboring affinity score indicates that two neighbor tokens are more closely bonded.To ensure that merged tokens will not be split, layer-wise affinity scores, a l i,i+1 , should increase as the network goes deeper, i.e., a l i,i+1 ≥ a l−1 i,i+1 for all l.It helps to gradually generate a desired hierarchy structure: Similarly, we formulate the hierarchy induction for visual branch, detailed in the Appendix D.

Feature Flow-Attention Mechanism
In the following representation, we use the corresponding nodes, h, from the respective branches of language and visual graph transformers as the queries (Q), keys (K), and values (V).
Inspired by flow network theory, the flow-attention mechanism introduces a competitive mechanism for sources and an allocation mechanism for sinks, preventing the generation of trivial attention.In a flow network framework, attention is viewed as the flow of information from sources to sinks.The results (R) act as endpoints receiving the inbound information flow, and the values (V) serve as sources providing the outgoing information flow.
Flow capacity calculation.For a scenario with n sinks and m sources, incoming flow, I i , for the i-th sink and outgoing flow, O j , for the j-th source are calculated as follows: where ϕ(•) is a non-negative function.
Flow conservation.We establish the preservation of incoming flow capacity for each sink, maintaining the default value at 1, effectively "locking in" the information forwarded to the next layer.This conservation strategy ensures that the outgoing flow capacities of sources engage in competition, with their collective sum strictly constrained to 1.Likewise, by conserving the outgoing flow capacity for each source at the default value of 1, essentially "fixing" the information acquired from the previous layer, the conservation of incoming and outgoing flow capacities is enforced via normalizing operations: In this context, the ratio denotes element-wise division, with O dedicated to source conservation and ϕ(Q) I assigned to sink conservation.This normalization process ensures the preservation of flow capacity for each source and sink token, as evidenced by the following equations: source-j: These equations replicate the same computations as Equation (7).The initial equation concerns the outgoing flow capacity of the j-th source after the normalization process In contrast, the second equation corresponds to the incoming flow capacity of the i-th sink after the normalization process ϕ(Q) I .In both instances, the capacities are identical to the default value of 1.
The conserved incoming flow, I, and outgoing flow, O, are represented as follows: Flow-attention mechanism.We introduce the flow-attention mechanism, leveraging competition induced via incoming flow conservation for sinks.In O, sources compete while maintaining a fixed flow-capacity sum, revealing source significance.I represents the sink information when the source outgoing capacity is 1, reflecting aggregated information allocation to each sink.The flow-attention equations are as follows: In the "Competition" stage, V is determined through the application of the Softmax function to O, followed by element-wise multiplication with V.The "Aggregation" step, denoted as A, is computed using the presented equation.Lastly, the "Allocation" phase calculates R by employing the Sigmoid function on I, which is then element-wise multiplied with A.

Training Loss
To adapt the contrastive pretraining objective for video and text features in the HAtt-Flow architecture, the objective function can be expressed as follows: Here, v and u represent the video and text feature vectors, τ is the learnable temperature parameter, and M is the total number of video-text pairs, i.e., the total number of labels.

Experiment Settings
Dataset details.Our dataset adopts a division strategy from JRDB [81], where videos are segregated at the sequence level, ensuring the entirety of a video sequence is allocated to a specific split.The 54 video sequences are distributed, with 20 for training, seven for validation, and 27 for testing.To align with the evaluation practices of analogous datasets, our evaluation is centered on keyframes sampled at one-second intervals, resulting in 1419 training samples, 404 validation samples, and 1802 test samples.
Implementation details.Our framework, implemented in PyTorch, undergoes training on a machine featuring four NVIDIA Quadro RTX 6000 GPUs.During training, we adopt a batch size of 2 and leverage the Adam Optimizer, commencing the training process with an initial learning rate set at 0.0001.
Evaluation metrics.We evaluate the model using two tasks: (1.) predicate classification (PredCls) and (2.) video scene graph generation (VSGG).The video scene graph generation (VSGG) task aims to generate descriptive triplets for an input video.Each triplet, denoted as (r i , ), consists of a relation, r i , occurring between time points t 1 and t 2 , connecting a subject, o s , (class category) with mask tube m s (t 1 ,t 2 ) and an object, o o , with mask tube m o (t 1 ,t 2 ) .Evaluation metrics for PredCls and VSGG adhere to scene graph generation (SGG) standards, utilizing Recall@K (R@K) and mean Recall@K (mR@K).Successful recall for a ground-truth triplet (ô s , m( ) requires accurate category labels and IOU volumes between predicted and ground-truth mask tubes above 0.5.The soft recall is recorded when these criteria are met, considering the time IOU between predicted and ground-truth intervals.

Comparison with the State of the Art
We present our comparisons with state-of-the-art (SOTA) methods in Tables 2 and 3 for our dataset and PSG dataset.In direct comparison with the methods above, the HAtt-Flow model exhibits a notable performance advantage, establishing itself as the current state of the art.This superiority is attributed to its proficiency in capturing intricate social activities among subjects across spatial and temporal dimensions.On the GASG dataset, our proposed method outperforms existing SGG methods by a significant margin on all metrics except for the R/mR@20 of the VSGG task.On the PSG dataset, it is evident that the proposed method dominated the other methods to demonstrate state-ofthe-art performance.

Ablation Study
Flow-attention direction.We introduced a novel flow-attention mechanism between the hierarchical transformers handling text and vision.To explore the impact of the flow direction between these networks, we conducted experiments as detailed in Table 4.Our findings validate that optimal results are achieved when attention flows from the text to the vision transformer.Conversely, performance declines in the opposite direction, notably when no attention flows.This observation suggests that the cross-attention mechanism enhances the model's contextual learning capacity primarily when the flow is from text to vision because text often provides high-level semantic information and context that can guide the understanding of visual content.However, it is worth noting that further investigation is warranted to fully understand the reasons behind the decline in performance when attention flows in the opposite direction.Potential improvements could involve refining the interaction between the text and vision transformers to better leverage complementary information from both modalities.
Importance of hierarchy awareness.We incorporated hierarchy awareness into the transformer framework to bolster the model's scene graph generation capabilities.The experimental results, detailed in Table 5, affirm that hierarchical awareness optimally enhances scene graph generation.Conversely, performance declines in the absence of this design.This is likely because, when the model is aware of the hierarchy during the generation of video scene graphs, it accurately predicts all relevant nodes and their relationships (edges).However, future work could explore alternative methods for incorporating hierarchy awareness to further improve performance, such as fine-tuning the hierarchical structure or exploring different aggregation techniques.
Attribute analysis.In Table 6, we evaluate the impact of different attributes in the dataset on the model's performance in scene graph generation.Our findings affirm that including all attributes results in optimal scene graph generation.Conversely, the performance exhibits a decline when we consider individual attributes one by one.We can clearly observe that performance is proportional to the attribute distribution, as shown in Table 3, i.e., the more the number of attributes, the better performance.This underscores the significance of leveraging all attributes, indicating that they collectively enhance the model's capacity to grasp intricate contexts, enabling accurate scene graph generation.However, further investigation into the interplay between different attributes and their impact on performance could provide valuable insights for refining the model architecture and training process.

Qualitative Analysis
To gain deeper insights into the performance of HAtt-Flow, we employed visualization to illustrate its scene graph generation predictions using our dataset.As depicted in Figure 5, our model substantially improves overall scene graph generation compared to PS-GFormer.The effectiveness of our hierarchy-aware attention-flow mechanism contributes significantly to this enhancement, providing our model with superior context modeling capabilities for visual representations guided by textual inputs.We can observe that [80] could only detect the subjects, but not accurate groups and their interactions.In contrast, the HAtt-Flow is accurate in graph generation and overall group activity prediction.Best viewed in color and zoomed in.

Conclusions
In this work, we introduced a pioneering dataset designed with nuanced attributes, specifically tailored to enhance the scene graph generation task within the context of group activities.Our contributions extend to advancing predictive video scene understanding, propelled by the introduction of flow-attention and a paradigm shift in attention mechanisms through hierarchy awareness.Via rigorous experimentation, we demonstrated the efficacy of our approach, showcasing significant improvements over the previously existing methods.

Limitations
While the novel flow-attention mechanism introduced in this work draws inspiration from flow network theory and offers promising advancements, several limitations warrant consideration.Primarily, the implementation of flow-attention imposes heightened computational demands, which may pose challenges for deployment in resource-constrained settings.This computational complexity underscores the need for efficient algorithms and hardware acceleration techniques to enable practical use in real-world applications.Furthermore, the performance of the flow-attention model is greatly influenced by the quality and quantity of available training data.Variations in data quality or insufficient sample sizes can impact the model's robustness and generalization capabilities, highlighting the importance of extensive and diverse datasets for achieving optimal results.Addressing these computational and data-related limitations is crucial to maximizing the practicality and effectiveness of the proposed flow-attention mechanism.Future research endeavors should focus on developing strategies to mitigate computational demands while maintaining performance levels and exploring methods for enhancing the robustness of the model to variations in training data.By overcoming these challenges, the flow-attention mechanism can realize its full potential as a valuable tool in a wide range of real-world applications.

Figure 1 .
Figure 1.Comparison of HAtt-Flow results with other scene graph generation methods.Best viewed in color and zoomed in.

Figure 2 .
Figure 2. A sample video from our group activity scene graph (GASG) dataset.The top row displays keyframes featuring overlaid bounding boxes, each annotated with a unique ID for consistency.Below, the timeline tubes provide a comprehensive temporal representation of scene graph annotations for distinct attributes, including appearance, interaction, position, relationship, and situation.These annotations offer nuanced details, enhancing scene understanding and contributing to a more refined video content analysis.Best viewed in color and zoomed in.

Figure 3 .
Figure 3. Statistics of the GASG dataset, number of social groups, and attributes in the dataset.Best viewed in color and zoomed in.

Figure 4 .
Figure 4. Overall architecture of the proposed HAtt-Flow network.The extracted visual and textual features are passed through their respective graph transformers to obtain corresponding node features.These nodes are passed through the hierarchy-aware-based transformer encoder models to have enriched features, including a feature flow-attention mechanism to enhance cross-modality learning.Finally, we use CLIP loss to optimize the learned features.Please refer to Figure 1 for the details of levels L 0 , L 1 , L 2 , and L 3 .

Figure 5 .
Figure 5.The visualization of the scene graphs generated via PSGFormer [80] and our approach.We can observe that[80] could only detect the subjects, but not accurate groups and their interactions.In contrast, the HAtt-Flow is accurate in graph generation and overall group activity prediction.Best viewed in color and zoomed in.

Table 1 .
Comparison of existing datasets.GA is the group activity label, H

-H, H-O, and O-O represent
interactions between human and human, human and object, and object and object.

Table 2 .
Comparison with SOTA methods on GASG dataset.Bold numbers indicate the best results.

Table 3 .
Comparison with SOTA methods on PSG dataset.Bold numbers indicate the best results.

Table 4 .
Ablation study for flow-attention direction.Bold numbers indicate the best results.

Table 5 .
Ablation study for hierarchy awareness.Bold numbers indicate the best results.