SceneGATE: Scene-Graph based co-Attention networks for TExt visual question answering

Most TextVQA approaches focus on the integration of objects, scene texts and question words by a simple transformer encoder. But this fails to capture the semantic relations between different modalities. The paper proposes a Scene Graph based co-Attention Network (SceneGATE) for TextVQA, which reveals the semantic relations among the objects, Optical Character Recognition (OCR) tokens and the question words. It is achieved by a TextVQA-based scene graph that discovers the underlying semantics of an image. We created a guided-attention module to capture the intra-modal interplay between the language and the vision as a guidance for inter-modal interactions. To make explicit teaching of the relations between the two modalities, we proposed and integrated two attention modules, namely a scene graph-based semantic relation-aware attention and a positional relation-aware attention. We conducted extensive experiments on two benchmark datasets, Text-VQA and ST-VQA. It is shown that our SceneGATE method outperformed existing ones because of the scene graph and its attention modules.


Introduction
Significant progress for multi-modal tasks that demand the simultaneous processing of both images and texts has been made in the past few years, and Visual Question Answering (VQA) is one of the prominent multi-modal tasks that requires answering natural language questions by inferring from the content of the given images.However, the nature of the questions and images of many existing VQA datasets is deficient in training the model to build a comprehensive understanding of human everyday scenes.For example, the collected photo-realistic images of many conventional VQA datasets exclude the texts that commonly appear in daily-life scenes, and the questions are merely designed to examine the recognition of objects and their attributes, such as colors and sizes.To overcome this limitation of existing conventional VQA datasets and to train models with better understandings of texts in realistic scenes through question answering, a new variant of VQA tasks, TextVQA [1], was recently proposed.
Images in TextQA tasks are collected from realistic scenes that contain various formats of texts, e.g., brand names and price tags, and the questions are specifically designed to be answered by referring to the textual information in the images.Hence, in addition to the recognition of objects as in the conventional VQA task, it is necessary for TextVQA models to additionally recognize the texts that are associated with the objects and capture these textual features from the images.Most current TextVQA models rely on the Optical Character Recognition (OCR) technique to directly extract the textual characters from images as OCR tokens and then integrate these OCR token features with image and question features for answer prediction, as shown in [1][2][3].However, the use of OCR tokens as an additional sequence of inputs to image object features and question word features can hardly reveal or capture the relations between the texts and their related objects in images.Such relations are significant in answering questions that show the explicit positional or semantic relationships between the objects and the textual characters, such as the relation on between UMD and uniforms in the question "What university is represented on these uniforms?" in Figure 1.To capture the relatedness between objects and OCR tokens, some recent works [4][5][6][7] proposed to implicitly represent the relationships between objects and OCR tokens through their absolute locations.However, assuming such positional proximity as "relatedness" is not reliable and could be ambiguous, because irrelevant objects that are in different categories might be located in similar positions around one OCR token.To overcome this problem, we propose an explicit representation of the relationships between the OCR tokens and their associated objects in an image with the help of a scene graph.A scene graph [8] is a graph structure that annotates the attributes of objects and the relationships between objects in an image.In this work, we propose a novel scene graph structure, specifically for TextVQA tasks, by assigning the OCR tokens as the attributes of objects to represent the affiliations of OCR tokens with their related objects.The scene graph embedding is encoded via the semantic embeddings of the objects and OCR tokens of the scene graph; thus, compared to previous works that only considered the visual features of objects, such scene graph embedding captures the relationships between objects and OCR tokens from the semantic aspect.A semantic relation-aware attention module is also applied to obtain the ultimate scene graph embedding that encodes the different semantic relationships between the objects and the OCR tokens.
Another problem that this work tries to solve is to achieve much more intense interactions between the multi-modal inputs for TextVQA models and thus generate a better answer representation for the final answer prediction.Most of the previous works [2,7] directly input different modalities (i.e., image, question and OCR token features) into a multi-modal transformer; although some works [3,6] made additional interactions for some modalities beforehand, these interactions were rather weak because there was a lack of simultaneous self-attention learning within each modality.Thus, we propose a Scene Graph-Based Co-Attention Network (SceneGATE) that includes a co-attention module consisting of two attention units: self-attention that boosts the intra-interactions for each modality and the guide-attention unit that uses the question features to guide the attention learning for the image and OCR token features.A positional relation-aware attention module is applied to the integrated visual features of objects and OCR tokens.Such integration enables the model to additionally learn relationships between the OCR tokens and their related objects from the positional level.The SceneGATE network operates on the two branches of scene graph features and the visual-level integrated multi-modal features in parallel, such that the relationships between objects and OCR tokens from both the semantic level and positional level are highlighted.The overall architecture of SceneGATE can be found in Figure 1.In summary, the main contributions of our work are as follows.

•
To the best of our knowledge, this is the first attempt to apply a scene graph as an image representation in TextVQA.We introduce a novel scene graph generation framework for the TextVQA environment.

•
We propose and integrate scene graph-based semantic relation-aware attention with positional relation-aware attention to establish a complete interaction between each question word and visual feature.

TextVQA
Look, Read, Reason and Answer (LoRRA) [1] was the first baseline model for TextVQA tasks.It simply encodes the OCR tokens by FastText embeddings and enables the use of any type of contextual attention mechanism to integrate the questions, images and OCR token features.Further works have been proposed to solve the TextVQA problem from various aspects; for example, ref. [2] proposed Multi-Modal Multi-Copy Mesh (M4C) with an enriched OCR representation to capture more properties of the OCR tokens.Such enriched OCR token features are then projected with the object regions and question features into the same joint embedding space through the multi-modal transformer encoder.As an extension of M4C, ref. [6] captured the relationships between the object regions and their related OCR tokens by encoding the OCR-related object features through softmax attention, which are the result of learning the locations of corresponding OCR tokens and the object regions.Others have proposed graph structures to better represent and encode the relationships between OCR tokens and the related object regions.For example, ref. [4] built three different graphs to represent the visual, semantic and numeric information of OCR tokens and object regions, where the nodes of the graphs are updated from locationally related nodes.Ref. [5] proposed a spatially-aware graph attention module to emphasize the importance of different types of relative spatial relationships between OCR tokens and objects.Similarly, ref. [7] also utilized the graph attention module but this was conditioned on the relationships between the OCR tokens and the objects that were revealed from the question structural patterns.Ref. [9] proposed to align the related OCR tokens with the question by using a two-stage module that aligns the question with images via the pre-trained visual grounding task and then aligns the question and OCR tokens through object labels.Different from using an off-the-shelf OCR module, as in all the previous works, a recent study [10] integrated the training of the OCR technique into the flow of an end-to-end TextVQA model to mitigate the influence of poor OCR accuracy on the final answer prediction.A comparison of the TextVQA models can be found in Table 1.

Conventional VQA
VQA [11] is a multi-modal task that requires answering natural language questions by looking at given images.The answer representation is generated from the inputs with different natures in essence: images formed by pixels and questions formed by semantic words.To align the image features and question features into the same joint embedding space for answer prediction, previous VQA works mainly adopt two different methods: fusion techniques [12-16] and attention mechanisms.The attention mechanism in VQA tasks ranges from the basic vanilla attention [17,18] to the recent, commonly used coattention mechanism [19][20][21][22].Such a co-attention mechanism aims to obtain a stronger interaction between different modalities by using the attention weights learned from individual modalities to guide the attention of each other.In this work, we use the guidedattention module for stronger intra-modal integration for OCR tokens, images and question features.The scene graph has been applied in various visual language tasks, including image captioning [23,24], text to image generation [25] and image-text retrieval [26].Recently, scene graphs have also been used in VQA.For example, ref. [27] processed the scene graphs and image features simultaneously through two parallel branches of recurrent memory networks to improve the model's reasoning ability over objects' relationships.
Ref. [28] proposed to use a probabilistic scene graph of images as the state machine, where questions were transformed into instructions to perform the reasoning process.Ref. [29] claimed that only partial scene graphs are effective for answer prediction and proposed a selective system to choose the most important path in a scene graph and the most probable destination node on the graph to predict the answers.Ref. [30] used a graph attention network to encode scene graph embeddings to leverage the relatedness between different objects.Ref. [31] applied the pre-training pipeline for Visual Commonsense Reasoning (VCR) tasks by incorporating object-based scene graphs in transformer layers to focus on semantically adjacent object nodes within multiple hops, regardless of relationship types.Nevertheless, our work is the first to apply a scene graph in TextVQA tasks and we propose a novel TextVQA-based scene graph structure to explicitly represent the affiliations between objects and OCR tokens.

SceneGATE-Input Representations
In this work, we proposed a Scene Graph-Based Co-Attention Network for TextVQA tasks; the architecture is shown in Figure 1.We first describe the input representations, the scene graph generation and the scene graph encoding methods in Section 3. We then explain the co-attention networks for multi-modality semantic and positional relation integration in Section 4. Table 2 defines all the mathematical symbols in our description.
self-attended visual objects, OCR features and decoder hidden states decoder answer token at, before and after time step t, respectively

Input Representations
Given question words w 1 , . . ., w t , we encode each word into a d-dimensional Bidirectional Encoder Representations from Transformers (BERT) embedding [32].The weights are then fine-tuned during training.For objects in each image, we obtain the appearance features for a maximum of 100 objects via the Faster-RCNN model pre-trained on the Visual Genome dataset [33].We concatenate the appearance features of each object with its corresponding bounding box coordinates to represent each object region.For the OCR tokens, we extract their appearance features from the images using the same pre-trained Faster-RCNN model.Each OCR token (Google Cloud OCR Extractor: https://cloud.google.com/products/ai/,accessed on 30 June 2023) is also encoded by 300-dimensional pre-trained FastText embeddings (FastText embedding pre-trained with subword information on Wikipedia 2017, UMBC WebBase corpus and statmt.orgnews dataset: https://fasttext.cc/docs/en/english-vectors.html,accessed on 30 June 2023)).Following [2], we concatenate the appearance features, FastText embeddings, Pyramidal Histogram of Characters (PHOC) [34] features and the bounding box coordinates of each OCR token to obtain the d ′ -dimensional enriched OCR embedding.

Scene Graph Construction
A scene graph is a graph structure SG = (V, E ) that denotes the relationships between objects as well as the associated attributes of each object for an image, where objects sg o ∈ O, attributes sg a ∈ A and relationships sg r ∈ R are set as nodes of the graph.In this work, we construct a novel scene graph structure that is specific for TextVQA tasks to represent the affiliations between OCR tokens and the associated object regions.
For each object in an image, we compare its bounding box coordinates with the other objects in the image.We have defined 11 different relation types: inside, surrounding, to the right of, to the left of, under, above, top right, bottom right, top left, bottom left and overlap.(The semantic relation types can be defined in various ways if they can be represented in different semantic categories.This will be encoded in the categorical embedding in Section 3.3).We also compare the bounding box coordinates of each OCR token with all the objects in the image and assign this OCR token as the attribute of the object whose bounding box surrounds the OCR token's bounding box with the highest intersection over union (IoU) score.Finally, we obtain the triplet of (sg o i , sg r , sg o j ) for every two objects and the pair of (sg o i , sg a i ) for each object that has attributes.We use bi-directional edges in the scene graph.

Scene Graph Embedding
We use different methods to encode the scene graph based on whether the (scene graph-based) semantic relation-aware (SRA) attention is applied or not.We describe the methods in detail in this section.We report our findings from the ablation studies in Section 6.2, where the SRA attention is not applied.
When SRA attention is applied, we initialize the node features of each object and its attributes with a 300-dimensional embedding.They are stacked together as the node embedding matrix Matrix P×300 , where P is the total number of objects and attribute nodes of each scene graph.We then add an extra relationship type, self, to the 11 pre-defined relationship types as mentioned in Section 3.2 in order to denote the relationship of each node with itself.In addition to the triplet of (sg o i , sg r , sg o j ) denoting the relationships between every two objects, we further add the relationships inside and surrounding between objects and their attributes to explicitly show and encode the existing semantic relationships between the objects and their associated OCR tokens.Hence, the object-attribute pair (sg o i , sg a i ) now becomes a triplet of (sg o i , sg r , sg a i ), where sg r = surrounding, and (sg a i , sg r , sg o i ), where sg r would be inside.The 12 relationship types are then converted into numeric labels of 1 to 12 to build the adjacency matrix that covers all the object nodes and attribute nodes of a scene graph.
When SRA Attention is not applied, we encode each scene graph according to [27].We first initialize the node embedding for all objects, relationship and attribute nodes in 300 dimensions, and then update each object node embedding with its associated attribute nodes' embeddings and all the relationship nodes' embeddings as well as their associated subject node embeddings.Specifically, for each sg o i , we update its embedding to a 900dimensional embedding by concatenating two additional embeddings: (1) the average of all the related relationship embeddings, where each relationship embedding is the average of the embeddings of the relationship node and the subject node in the triplet of (sg o i , sg r , sg o j ) for sg o i ; (2) the average of all the embeddings of the associated attribute nodes that are connected to sg o i .The updated object node embeddings are stacked into a matrix Matrix N×900 as the scene graph representations for each scene graph, where N is the number of total object nodes in each scene graph.We also propose two approaches to initialize the node features: pre-trained word embedding and GCN-based embedding, as illustrated in Sections 3.3.1 and 3.3.2.The performance for these two approaches is compared in Section 6.2.

Pre-Trained Word Embedding
Each node is initialized by either a 300-dimensional pre-trained GloVe embedding (GloVe embedding pre-trained on the Wikipedia and Gigaword5 corpus: https://nlp.stanford.edu/projects/glove/,accessed on 30 June 2023) or FastText embedding.For words with multiple tokens, we take the average of each token's embedding as the node embedding.

GCN-Based Embedding
Graph convolutional networks (GCN) take the node embedding matrix and the adjacency matrix as inputs.They are propagated over all nodes and result in a matrix with the updated node features.We construct one graph based on all the unique categories of objects, relationships and attributes nodes across all the scene graphs of all the images in the dataset, and we propagate them over a 2-layer GCN for the updated node representations following Equation (1).
H (l) at layer l = 0 is the input node feature matrix X GCN ∈ R M×M , where each node is represented by a one-hot encoding and M is the total number of nodes.Based on the triplets of (sg o i , sg r , sg o j ) and the pairs of (sg o i , sg a ) in the scene graphs, the objects are connected with the related relations and the associated attributes in the adjacency matrix Ã.We assign a weight of 1 to all the connected edges and 0 to non-edges in Ã. D is the degree matrix computed based on Ã such that Dii = ∑ j Ãij .We train two different GCNs in terms of different node labels.In the object-GCN, we manually categorize all the object types into 60 super-classes based on the hypernym of the synset (synonym set) of each object token and use these 60 super-classes as the labels of each object node during GCN training.In the attribute-GCN, we label each attribute node with the attribute tokens' named entity recognition (NER) types.(We used the Google Cloud NLP API to identity the entity type: https://cloud.google.com/natural-language/docs/analyzing-entities,accessed on 30 June 2023).We Pooling min = min(H The node features of X ′ GCN are then used as the initial node representations for object, relationship and attribute to generate the scene graph embedding of each scene graph.

SceneGATE-Co-Attention Networks
For multi-modality integration, we apply a guided attention module over the inputs and introduce two parallel branches of scene graph-based semantic relation-aware attention and positional relation-aware attention layers.

Self-Attention Module
The self-attention (SA) module consists of a multi-head attention layer and a feedforward layer with ReLU activation and dropout [35].The input matrix X is transformed into three matrices that are in the same dimensions, i.e., query, key and value, as the learnable weights.These three matrices are then fed into a multi-head attention layer for the calculation of scaled dot-product attention.We respectively apply two SA modules for our two sets of inputs X: (1) the question features to obtain the self-attended question representations T; and (2) the combination of object appearance features, the enriched OCR token features as described in Section 3.1 and the answer token features from the decoder, to obtain the attended visual-level object-OCR features and decoder hidden states

Guided Attention Module
The guided attention (GA) module shares the same structure and hyperparameters as the SA module, but the inputs of the multi-head attention layer are the feature matrix X and the transformed key and value matrices of another feature matrix Y.In the GA module, we use the self-attended question representations T as the feature matrix Y to guide the attention learning with the attended visual-level object-OCR features V that function as the input feature matrix X.Finally, we obtain the question-guided object-OCR features and decoder hidden states

Semantic Relation-Aware Attention
We use the transformer encoder [35] with 12 heads as the backbone for our semantic relation-aware (SRA) attention Layer.As illustrated in Section 3.3, we annotate 12 predefined different relationships sg r ∈ R between every two object nodes (sg o i , sg r , sg a i ) and 2 relationship types between objects and their attributes (sg a i , sg r , sg o i ) in the scene graph.We introduce a special attention mechanism whereby each head of the transformer will only attend to certain nodes of the scene graph.In other words, we only allow each node to attend to nodes that are connected by certain types of relationships R j for the j-th head in the SRA attention Layer, where R j is a subset of the 12 relationships R and contains only κ number of relationship types.A bias term β is added to the calculation of the scaled dot-product attention.When β = 0, the attention weights can be calculated normally between the two nodes, considering that an edge is mapped to some relationship types sg r ∈ R j that the j-th head of the SRA attention Layer is supposed to attend to.When β = −∞, the attention weights between two nodes also become −∞, considering that an edge is mapped to a set of relationship types sg r / ∈ R j .Since each head is supposed to attend to specific sub-information in a scene graph, the calculation of the attention is only limited to a given set of nodes.In order to manage which relationship types and the number of relationship types that each head in the SRA attention layer pays attention to, we need to control the values of β and κ.Empirically, we find that κ = 3 is the most suitable.

Positional Relation-Aware Attention
Inspired by [5], we also construct a directed complete spatial graph over the object features V ′ obj and OCR token features V ′ ocr in V ′ , where each edge corresponds to one of the spatial relationship types according to their relative positions.Additional edges are also added to connect all the object nodes and OCR tokens to all the question tokens.
Similar to the SRA attention Layer, the positional relation-aware (PRA) attention layer also uses the structure of the multi-modal transformer encoder with 12 heads as the backbone and permits each head to attend to different subsets of the spatial relationship types.All the heads also allow all the objects and OCR tokens to attend to the question's words.
Moreover, a causal attention mask is applied for the decoder D ′ in the PRA attention layer.D ′ (t) is the answer token generated from the decoder at time step t.The attention layer can attend to all question tokens, objects and OCR tokens along with the previously decoded entries in the answer D ′ (<t) , without attending to D ′ (>t) , the decoding entries after time step t.T and V ′ , obtained from the SA module and GA module, are combined as the input sequence and fed to two subsequent PRA attention layers to obtain spatially attended features F s .
Outputs F sg from the SRA attention layer are combined with F s to become the input to the multi-modal transformer encoders [35].The combined input allows the model to attend to all input features in a pair-wise manner.The multi-word answer in each time step t is decoded using the dynamic pointer network following [2].

Datasets
We evaluate our model with two widely used benchmark datasets: Text-VQA and ST-VQA.The Text-VQA dataset was proposed by [1] in 2019.Different from the conventional VQA datasets, images in the Text-VQA dataset contain texts in different formats, and the questions are specifically designed to be answered by referring to the textual information in images.The Text-VQA dataset collects 28,408 images that contain texts from the Open Images v3 dataset [36].There are 45,336 question-image pairs in the Text-VQA dataset, which are split into 34,602, 5000 and 5734 for training, validation and testing, respectively.Each question has 10 ground truth answers, and the voting of these 10 answers is used to compute the soft accuracy score.The ST-VQA dataset [37] is a concurrent work of the Text-VQA dataset.However, different from the Text-VQA dataset, the 23,038 images in the ST-VQA dataset are collected from multiple source image datasets, including the Cocotext [38], Visual Genome [33], VizWiz [39], ICDAR [40,41], ImageNet [42] and IIIT-STR [43] datasets, in order to reduce the effect of possible biases from a single-source image dataset.There are 17,028 images/23,446 questions for the training set, 1893 images/2628 questions for the validation set and 2971 images/4070 questions for the test set.Each question has at most 2 ground truth answers to compute the accuracy score by soft voting, similar to the VQA context.To clearly show the difference between conventional VQA datasets and the Text-VQA/ST-VQA dataset, we list some typical conventional VQA datasets and compare them in terms of various aspects in Table 3.

Implementation Details
We encode the question features, the appearance features of objects and the OCR tokens in the same dimension and the same maximum sequence length as in [5].Each scene graph has an average of 36 object nodes and a maximum of 100 OCR nodes.The downstream SA and GA modules have a dimension of 768, with 8 attention heads and a dropout rate of 0.1.Our experiments are conducted utilizing an NVIDIA Titan RTX GPU with 24 GB RAM, a 16 Intel(R) Core(TM) i9-9900X CPU @ 3.50 GHz with 128 GB RAM and the operating system of Ubuntu 20.04.1.Our final model contains around 95 million trainable parameters and requires around 0.6 h to train one epoch.The validation accuracy converges within 40 epochs of training for most of our model variants, and our best model converges within 8 epochs on both datasets.In short, the training of our best-performing model requires around 3GB GPU RAM and 4 h to complete.We use a batch size of 8 and follow the same settings as in [5] for other hyperparameter values.Details of all hyperparameters can be found in Appendix A.

Baseline Models
We compare the SceneGATE network with the following baselines in this work: LoRRA [1] encodes the OCR tokens with only FastText embedding and it has an attention mechanism to integrate all image, question and OCR token features into the same joint embedding space for answer prediction.M4C [2] uses enriched OCR token representations that include the appearance, semantic, character-level and spatial features of OCR tokens.The multi-modal transformer encoder is used for modality integration and iterative decoding, while a dynamic pointer network is applied for answer generation.Simple is not Easy (SNE) [3] has three separate vanilla attention blocks for the independent integration of object region, OCR visual-based and OCR textual-based features with the questions.Localization-Aware Answer Prediction (LaAP-Net) [6] integrates objects and OCR token features via an attention mechanism to obtain the OCR-related image features, which is followed by M4C using the multi-modal transformer for integration with the question features.The Multi-Modal Graph Neural Network (MM-GNN) [4] constructs three graphs for object regions, semantic OCR tokens and numeric OCR tokens, which all interact to learn from the related nodes.Spatially Aware Multi-Modal Multi-Copy Mesh (SA-M4C) [5] adopts a spatially aware self-attention module to capture and to encode 12 different types of spatial relationships between objects and OCR tokens.Structured Multi-Modal Attention (SMA) [7] applies a question-conditioned graph attention module to identify the potential relationships between objects and OCR tokens from the question patterns.

Performance Comparison
Different from other works that used ST-VQA to enlarge the training dataset size, we compared the performance of our model with different baselines by training only on the original Text-VQA dataset.We can see from Table 4 that our model outperformed all the baselines and yielded the state-of-the-art result of 42.37% validation accuracy and 44.02% test accuracy on the Text-VQA dataset.We ran the code provided by SA-M4C to train the Text-VQA dataset with their default hyperparameters.Their results were only 40.71% and 42.61% for the validation and test accuracy, respectively, which were almost 2% lower than our model.Compared with M4C, SNE and LoRRA, which simply used the attention mechanism to integrate the different modalities, the models (SMA, SA-M4C and SceneGATE) that applied the graph attention module achieved better performance, indicating the importance of the explicit representation and encoding of the relationships between objects and their related OCR tokens for the TextVQA task.However, both SMA and SA-M4C considered only the visual representation of object nodes and their spatial relationships, when the node embedding of OCR tokens in a graph was updated.Our model overcomes this limitation via the additional encoding of the semantic embeddings of object nodes and the semantic relationships among different objects and OCR tokens with the use of the scene graph.Our results also indicate the importance of such semantic relationship representation and the scene graph in the TextVQA task.
In addition to the accuracy rate, we also used the Average Normalized Levenshtein Similarity (ANLS) score, which was proposed for the evaluation of the ST-VQA dataset [37], as an additional evaluation metric to evaluate the performance on ST-VQA.The ANLS score aims to eliminate the dropped performance caused by OCR recognition errors.It compares the similarities between the ground truth answers and the prediction results, rather than the robust identity, as when using the accuracy rate.The edit distance of converting a prediction string into a ground truth string is measured by this metric in order to give a soft score for the prediction.If the edit distance is greater than 0.5, it can be considered as an incorrect prediction not resulting from OCR recognition mistakes, and a score of 0 is given.Otherwise, the difference of this edit distance from 1.0 is awarded as the prediction score.A higher ANLS score indicates that more accurate predictions are made by the model.The performance of our models and the baselines is compared in Table 5 and we can see that our model greatly outperforms the baselines by achieving 41.29%, 0.525 and 0.516 for the validation accuracy, validation ANLS score and test ANLS score, respectively.Our model achieved a 1.5% improvement in accuracy, a 0.028 improvement for the validation ANLS score and a 0.039 improvement for the test ANLS score compared to the base model SA-M4C, which performed slightly worse than its base model M4C in terms of accuracy and achieved only around a 0.01 increase in the ANLS score on both the validation set and test set compared to its base model, M4C.The larger performance gap achieved by our model compared to our base model shows the importance of considering the semantic relations among the objects and OCR tokens.The use of a scene graph to capture the explicit semantic relationships between objects and OCR tokens makes an important contribution to our model's SOTA performance on both the Text-VQA and ST-VQA datasets.To examine the impact of different scene graph node embedding initialization methods on the model's performance, we also evaluated the model's performance in using GCN and GloVe for node embedding initialization on both the Text-VQA and ST-VQA validation sets.We can see from Table 6 that GloVe had the worst results, with only 41.33% and 40.79% accuracy on both the Text-VQA and ST-VQA validation sets, while the use of GCN would increase the performance slightly.FastText resulted in the best performance considering that FastText is more capable of dealing with the OOV issue for the cases of rare OCR tokens in a scene graph.
Network Component.To investigate the contribution of our model's components, we first integrated all the image, question, OCR token and scene graph features via multi-modal transformer encoders as in M4C [2].This simple approach achieved 39.13% and 37.78% accuracy on the Text-VQA and ST-VQA validation sets, respectively, as shown in Table 7.After the addition of the co-attention module for the better inter-and intra-integration of the image, question and OCR token features, the performance on the validation and test sets increased significantly to 41.27% and 39.57%, indicating the effectiveness of such self-attention-based guided attention.The inclusion of PRA attention layers gave an improvement in the accuracy rate by around 0.7%, and the performance rose further to 42.37% and 41.29% for the Text-VQA and ST-VQA validation sets after adding the SRA attention layer over the scene graphs.These results prove the critical roles of the co-attention, PRA and SRA attention layers in our model.Effect of Layer Number.In order to determine how many layers we should apply to each module of the model, we conducted experiments with different combinations of layer numbers and the results are presented in Table 8.Since having two multi-modal transformer encoder (MMTE) layers worked the best for the SA-M4C model [5], we started by fixing the number of final BERT layers to two and adopted all the combinations of numbers in the range [1,3] for the number of PRA attention layers and the number of SRA attention Layers.We empirically observed that models with two PRA attention layers always performed better than others (rows 4-6 vs. other rows), yielding validation accuracy of more than 41.8%.In addition, having two layers for each type of attention layer worked the best.Based on these two observations, we fixed the number of SPA layers to two and tested the model with a smaller number of SRA attention layers and MMTE layers.Eventually, we found that having two PRA attention layers, one SRA attention layer and one MMTE layer was able to yield the best validation accuracy result, 42.37%.

Quality Analysis
Figure 2 shows some sample pairs of images with questions and the answers from different baseline models.The OCR tokens and their associated object regions with high attention weights are highlighted with yellow and red bounding boxes in the images.Compared with other baselines, our model generated more accurate and complete answers with the correct corresponding OCR tokens regions detected in the images.For example, our model perfectly identified the brand of the beer with the answer Coors Light, while M4C missed the token Light and LoRRA and SA-M4C gave incorrect results for the case in the top-right image.In addition, our model also showed good inference ability in addition to text-reading ability.Taking the bottom-right image as an example, to answer the question of How many items can you get for $5?, the model can not only recognize the correct location of $5, but it also has the ability to understand the semantic meaning of the forward slash in the image and to interpret the character before this symbol as a number.Our model provided the correct answer, while the answers of LoRRA and SA-M4C were incorrect.We present more examples and error analyses in the Appendices B and C.

Conclusions
We propose SceneGATE with the use of a novel TextVQA-based scene graph by treating the OCR tokens in images as the attributes of the objects.Our SceneGATE applies semantic relation-aware attention to the scene graph and uses the guided attention mechanism to obtain the question-guided object and OCR token features, which are then fed into the graph attention module for the learning of the positional relationships between objects and OCR tokens.Our SceneGATE comprehensively learns the semantic and positional relationships between objects and texts in images and outperforms the SOTA on both the Text-VQA and ST-VQA datasets.
have 8 different classes: CONSUMER GOOD, EVENT, LOCATION, NUMBER, ORGANIZATION, PERSON, WORK OF ART and OTHER.The outputs of the second layer of object-GCN H (l 2 ) obj and attribute-GCN H (l 2 ) att are then passed to a minimum pooling layer as in Equation (2) to obtain the final node feature matrix X ′ GCN ∈ R M×d .

Figure 2 .
Figure 2. Visualization of attention outputs from SceneGATE.Yellow bounding boxes surround the OCR tokens predicted by SceneGATE.Red bounding boxes are the object regions that are associated with the OCR tokens.The thicker the bounding box lines, the higher the attention weights are.

Figure A1 .
Figure A1.Visualization of attention outputs from SceneGATE.Yellow bounding boxes surround the OCR tokens predicted by SceneGATE.Red bounding boxes are the object regions that are associated with the OCR tokens.The thicker the bounding box lines, the higher the attention weights are.Appendix B. Additional Qualitative Examples Figure A1 compares some additional prediction results of SceneGATE to those of the other baselines.

Figure A2 .
Figure A2.Visualization of incorrect classification analysis.Yellow bounding boxes surround the OCR tokens predicted by SceneGATE.Red bounding boxes are the object regions that are associated with the OCR tokens.The thicker the bounding box lines, the higher the attention weights are.

Table 1 .
Summary of TextVQA models discussed.

Table 2 .
Symbols and definitions.) scene graph of an image with node set V and edge set E sg o ∈ O object node set of each scene graph sg a ∈ A attribute node set of each scene graph sg r ∈ R relationship node set of each scene graph P = |O ∪ A| total number of object and attribute nodes in each scene graph N = |O| total number of object nodes in each scene graph M = |O ∪ A ∪ R| total number of nodes in each scene graph H ′ embedding size of OCR tokens X generalized input feature matrix into self-attention/guided attention Y generalized additional input feature matrix into guided attention T self-attended question representation V

Table 4 .
Results on Text-VQA dataset.Acc.refers to the soft accuracy score.

Table 5 .
Results on ST-VQA dataset.Acc.refers to the soft accuracy score.

Table 6 .
Validation performance of our model obtained on different types of scene graph node embedding.Acc.refers to the soft accuracy score.

Table 7 .
Ablation testing results on the validation set.PRA: positional relation-aware.SRA: semantic relation-aware.Acc.refers to the soft accuracy score.

Table 8 .
Validation performance for different numbers of each type of attention layer used.MMTE: multi-modal transformer encoder layers, PRA: positional relation-aware attention layers, SRA: semantic relation-aware attention layers, Acc.: soft accuracy score.