Semantic Representation Using Sub-Symbolic Knowledge in Commonsense Reasoning †

study an extension of our research Presented the 28th International Conference on Computational 8–13 December 2020. We have extended our previous study by (1) showing how to assess pre-trained models on their understanding of questions and demonstrating language model limitations, (2) proposing a new graph representation strategy with expansion using the AMR graph and ConceptNet, and (3) showing signiﬁcant performance improvements in the diverse commonsense reasoning-based datasets compared with baselines. ‡ These authors to this work. Abstract: The commonsense question and answering (CSQA) system predicts the right answer based on a comprehensive understanding of the question. Previous research has developed models that use QA pairs, the corresponding evidence, or the knowledge graph as an input. Each method executes QA tasks with representations of pre-trained language models. However, the ability of the pre-trained language model to comprehend completely remains debatable. In this study, adversarial attack experiments were conducted on question-understanding. We examined the restrictions on the question-reasoning process of the pre-trained language model, and then demonstrated the need for models to use the logical structure of abstract meaning representations (AMRs). Additionally, the experimental results demonstrated that the method performed best when the AMR graph was extended with ConceptNet. With this extension, our proposed method outperformed the baseline in diverse commonsense-reasoning QA tasks.


Introduction
Based on a clear understanding of the question and commonsense data, the commonsense question and answering (CSQA) system evaluates a question to obtain the correct answer. To predict the correct answer, a query has to be comprehensively understood with commonsense knowledge. As shown in Figure 1, ferret is a key word for answering the question. Unlike machines, people capture the relationships between the predicates and arguments of a question and extract the necessary concepts from commonsense knowledge. However, the machine implicitly gathers statistics on how words appear when combined with large corpora rather than obtaining a clear representation of concepts [1]. An ultimate machine capable of reasoning commonsense must understand linguistic symbols, such as semantic representations [2][3][4][5]. Furthermore, selecting the concept of the question analytically from considerable commonsense knowledge is necessary for precise reasoning. Recent tasks have mainly focused on answering questions given relevant documentation or context, requiring little general background knowledge. However, people use their wealth of world knowledge to answer questions. Due to the increasing demand for evaluating the capability of machines on commonsense reasoning [6,7] similar to humans, corresponding datasets appeared recently. OpenBookQA [7] is a new kind of questionand-answer dataset modeled on open book exams for assessing human comprehension of a topic. Answering the questions in this dataset requires additional extensive general knowledge not covered in the book. In addition, the CommonsenseQA dataset [6] builds using ConceptNet, a knowledge base that contains people's common sense. This dataset leverages multiple target concepts with the same semantic relationship to a single source concept in the ConceptNet. Thus, the model must distinguish each target concept from the mentioned source.
Two main approaches have been developed to solve these tasks from the model. The first approach to commonsense reasoning is a fine-tuning method with a pre-trained language model. This method collects evidence sentences from knowledge sources, such as Wikipedia or open mind commonsense (OMCS) [8], and trains a pre-trained language model using this external commonsense knowledge. During the inference stage, the system creates an input as a concatenation of questions, candidate answers, and corresponding evidence retrieved from the evidence sources. The system carelessly trains evidence sentences with commonsense knowledge using a model with numerous parameters. The second approach applies a reasoning process with a commonsense knowledge graph [9][10][11]. Based on the words that appear in the question, this method extracts data from ConceptNet [12] and represents it using graph encoders. The answer is predicted using graph representation and attention data from the language models. The system supplements the insufficient representations of the language model using a commonsense knowledge graph. However, improving the performance of these approaches without understanding the question remains difficult.
To improve performance, this study proposes the abstract meaning representation (AMR) of a question [13]. Within a logical framework, AMR graphs the meaning of single or multiple sentences. Because its representation lacks a commonsense relationship between concepts, we expanded this graph with an additional commonsense knowledge graph based on each concept for commonsense reasoning. We used the AMR symbolic framework to logically comprehend commonsense reasoning and represent the new AMR-ConceptNet graph, which is expanded with commonsense knowledge in a Levi graph [14].
Our main contributions are as follows.
• We demonstrate how to assess pre-trained models on the understanding of the questions and demonstrate the limitations of the language models.
• We propose a new graph representation strategy expanded with an AMR graph and ConceptNet. • Compared with the baselines, our method shows significant performance improvement in diverse commonsense reasoning-based datasets.

Abstract Meaning Representation (AMR)
Various studies in the natural language processing (NLP) fields have used AMR in their models [5,[15][16][17][18][19][20][21][22]]. An AMR [13] is a graph that logically represents the meaning of a sentence. The AMR graph captures the structure of "who is doing what to whom" in a given sentence and represents the sentence with a directed acyclic graph including single-rooted nodes, concept nodes, and relationships between them. Because the root node is a representation focus, the other concepts are connected to semantic relations once the root node is fixed. Similar to a parse tree, the AMR graph is traversable, considering all words. However, AMR builds the same graph structure for two syntactically different but semantically similar sentences. The concepts in an AMR graph are referred to as the events or entities. Additionally, the relationships between the concepts are denoted in the vocabulary of the PropBank frameset [23] and standard words. AMR represents semantic roles using more than 100 semantic relations (for example, negation, conjunction, and command). In PropBank, the graph form is labeled as the semantic roles of ARG0∼4 and ARGM. Subsequently, other concept nodes are sequentially joined from the semantic relations.

ConceptNet
ConceptNet [12] is a multilingual knowledge graph that connects words and phrases of natural language used by people in the real world. Real-world common sense is defined in ConceptNet as two nodes and directed edges indicating concepts and their relations, respectively. The relationships defined in a single lexical resource are not enough for a machine to understand words in the natural language people use. For example, in WordNet, dog and cat are defined as hyponyms for animal. However, it is not connected to a pet. ConceptNet is constructed by collecting data from various knowledge bases, including Wiktionary [24], WordNet [25], and DBpedia [26]. They were also defined as hierarchical URLs to avoid ambiguity. For example, the node "/c/en/read/v" can be retrieved using the part-of-speech data. Moreover, multiple relationships simultaneously exist between two different nodes. Consequently, the ambiguity between two different nodes can be handled using these multiple relations.

Commonsense Reasoning
Pre-trained-based models have performed admirably in earlier studies. The initial approach (https://gist.github.com/commonsensepretraining/507aefddcd00f891c83ebf6 936df15e8 (accessed date in 1 May 2022), https://drive.google.com/file/d/1sGJBV38aG7 06EAR75F7LYwCqci9ocG9i/view (accessed date in 1 May 2022)) for commonsense reasoning was based on a fine-tuning method. This approach uses questions and answers to limit reasoning capability. A retrieval module was also used to supplement the reasoning ability to retrieve evidence from the questions and answers. The second approach uses an additional encoder to embed knowledge graphs such as ConceptNet. An additional encoder typically uses paths or nodes in the graph [9][10][11]. Lin et al. [9], Ma et al. [11] extracts the graph paths using a specific search algorithm and uses them as the input for the encoder. Lv et al. [10] embeds the node with an adjacency matrix and uses the graph attention to compute the attention score. Various pre-trained models that used the aforementioned methods achieved high performance, including BERT [27] and RoBERTa [28], which use bidirectional transformer encoders. They also include XLNet [29], which uses autoregressive language modeling; ALBERT [30], which uses cross-layer parameter sharing and factorized embedding parameterization; and ELECTRA [31]. Additionally, ELECTRA was designed as a generator and discriminator, pre-trained using a replaced token detection (RTD) task.

Proposed Method
Because the word in a sentence acts in a specific role, such as a predicate or argument, the concept of the AMR graph also has semantics in the graph structure. Owing to these graph structure benefits, we used AMR graphs to extract commonsense knowledge graphs. We generated the question as an AMR graph using the model of Cai and Lam [32], which has recently demonstrated excellent performance in AMR generation tasks. Although most AMR graphs are properly generated from the model, inevitable errors in the types of relationships or concepts may occur. After obtaining an AMR graph, our method integrates it with the ConceptNet graph. Particularly, if the AMR concept existed in ConceptNet, it connected the ConceptNet node with all nodes in the AMR. The proposed methods then used ConceptNet to prune ARG0, ARG1, ARG2, ARG3, and ARG4 nodes that lacked edges. Our proposed graph representation prunes other nodes unrelated to the argument nodes because these argument relationships have significant meanings. This process uses ACP-ARG graphs to train the model by repeatedly identifying excessive paths. The proposed ACP-ARG is shown in Figure 2. The graph G = (V, E ) represents a fixed set of nodes V and relational edges E . The ACP-ARG graph can be expressed as follows: The union of the AMR graph and ConceptNet subgraphs containing AMR concepts linked with argument relations results in the ACP-ARG graph denoted by Equation (1). The AMR graph is denoted as G AMR = {V amr , E amr }. ConceptNet's subgraph corresponding to concepts linked to argument relations is denoted as G AMR arg CN = {V amr arg cn , E amr arg cn }.

Data Setup
SRLAttack dataset is an adversarial attack dataset to analyze whether the pre-trained language model relies on superficial cues. Semantic role labeling (SRL) captures the relationship between the predicate and the arguments in the question. By determining the precise meaning of the question, the analysis results of the SRL can be used to determine the correct answer. First, we labeled the semantic roles of each argument using a pretrained model (https://docs.allennlp.org/models/main/models/structured_prediction/ predictors/srl/ (accessed date in 1 April 2021)) [33]. Then, we randomly selected the predicate in the labeled question and determined the answer of the distractor from an argument based on the position of the predicate.
CommonsenseQA dataset consists of 12,102 questions and five candidate answers provided by Talmor et al. [6]. We divided the official training set for the experimental efficiency, and the organizer can only validate the official test set once every two weeks. The new training, development, and test sets contained 8500, 1221, and 1242, respectively.
OpenBookQA OpenBookQA dataset consists of one question, and four candidate answers about elementary-level science. The questions require common sense knowledge to solve and include 4957 training sets, 500 development sets, and 500 test sets [7]. Unlike CSQA, we used an official development set because the test set is accessible to the public; therefore, we could monitor the performance as needed.

Experimental Details
We trained the model using Quadro RTX 8000 and used the same parameters as in Cai and Lam [15] and Lim et al. [34]. The hyperparameters of each language model were obtained manually.

Pre-Trained Language Models
BERT [27], a bi-directional model using the transformer architecture, performs admirably in most natural language understanding tasks. BERT is pre-trained using largescale text data on masked language modeling (MLM) and next sentence prediction (NSP). It effectively captures the natural language context. Despite its capability to learn context, BERT cannot capture the overall meaning of a text using a static-masking rule based on 15% of the sentence. ELECTRA [31] proposed more efficient pre-training strategies using a generator structure and discriminator networks, similar to a generative adversarial network (GAN).

AMR-CN Reasoning Model
For the AMR-CN reasoning baseline, we used Lim et al. [34]'s model, which considers the pruned graph as an input and calculates the attention score of each path using the graph transformer and obtaining the entire graph vector. The model is shown in Figure 3.

Graph Path Learning Module
With the ACP-ARG graph from the graph integration and pruning module, the graph path learning module initializes the concept node vector as the sum of the concept embedding using GloVe [35] and absolute position embedding. A relation encoder is first used to encode the connection between the two concepts into a distributed representation for the model to recognize the explicit path of the ACP-ARG graph. The relation encoder recognizes the shortest path between two concepts and expresses the sequence as a relation vector by using a Gated Recurrent Unit (GRU) [36]. The equation representing the relation is depicted as follows: where sp t is the shortest path of the relation between two nodes. The final relation encoding r ij between concepts i and j is the concatenation of the final hidden states from the forward and backward GRU networks, which are represented in the Equation (3).
In order to inject this relation information into the conceptual representation, the AMR-CN reasoning model follow the idea of relative position, including [37,38], which introduces an attention scoring method based on the conceptual representation and the relation representation.
To calculate the attention score, the model is divided the relation vector r ij passed from the linear layer into forward relation encoding r i→j and backward relation encoding r j→i , as follows: [r i→j ; r j→i ] = W r r ij (4) where W r is the parameter matrix. This split renders the model to consider the bidirectionality of the path. Thereafter, the model computes the attention score considering the concepts and their relations. Note that c i and c j are the concept embedding. The equation is presented below: The first term in the last line of Equation (5) is the original term in the vanilla attention mechanism, which includes the pure content of the concept. The second and third terms capture the bias of the relation concerning the source and target, respectively. The last item represents universal relation bias. As a result, the computed attention score updates the concept embedding while maintaining a fully connected communication [15]. Therefore, the concept-relation interaction can be injected into the concept node vector. The resulting conceptual representations are aggregated into the entire graph vector and fed into the transformer layer to model the interaction between the AMR and ConceptNet conceptual representations.
The major advantage of this relation-enhanced attention mechanism is that it provides a fully connected view of input graphs using the relation multi-head attention mechanisms. By integrating two different concept types from the AMR graph and ConceptNet into a single graph, the model globally recognizes which path has high relevance to the question during the interpretation.

Language Encoder
The language encoder is utilized to encode text input into distributed representation, which is a pre-trained language model with a large corpus. The language model uses the models described in the baseline.

Reasoning Module
The proposed method performs commonsense reasoning on the ACP-ARG graph and predicts the correct answer. the model takes two types of input, text and graph representations, and transforms semantic representations into distributed representations. After obtaining text representation vectors, the model concatenates graph and language vectors, feeds them into the Softmax layer, and then picks the correct answer.

Diverse Expansion Methods
In the experiments, we demonstrated the effect of our various expansion methods on pre-trained language models and compared in-depth the ability to answer commonsense questions. To this end, we demonstrated how the AMR graph enables pre-trained language models to understand the semantics of a question and expands the graph with ConceptNet for comprehensive knowledge acquisition. However, expanding the AMR graph with all concepts in the knowledge graph wastes computational resources. Additionally, external knowledge of a few concepts requires a proper reasoning process. We conducted a study to determine which expansion method is the most effective for commonsense reasoning.
For ConceptNet, we expanded the graph based on all words in the question, called the CN full graph. The ConceptNet graph is denoted by G CN = {V cn , E cn } and the subgraph of ConceptNet corresponding to the question token is denoted by G token CN = {V token cn , E token cn }. The CN full graph (CF) is depicted in Figure 4a and defined as follows: The AMR-CN-Full (ACF) graph is an integrated graph in which all nodes of the AMR graph connect to the ConceptNet graph. Additionally, we limited ConceptNet (CN) for the experiments to just two methods. One method was to use the ConceptNet graph corresponding to all question tokens separated by a space in the sentence, as shown in Figure 4b. The graph path learning module could not use the reasoning CN graph owing to the initial disconnection of the question token and disconnection between the concept nodes. Therefore, we combined all tokens from the question to the root node to ensure that our model performed effectively in commonsense reasoning. The ACF graph was identical to the graph before pruning when creating the ACP-ARG.  The ACF graph can be expressed as follows: The AMR graph is denoted as G AMR = {V amr , E amr }. Additionally, the subgraph of ConceptNet matched with the ACF graph is denoted by G AMR CN = {V amr cn , E amr cn }.
We also conducted ACP-ARG-mini, which is identical to ACP-ARG except for the types of arguments that are pruned. For ACP-ARG, we pruned the nodes that lack edges known as ARG0, ARG1, ARG2, ARG3, and ARG4 with ConceptNet from the ACF graph. Unlike ACP-ARG, ACP-ARG-mini prune nodes lack ARG0 and ARG1 edges, which possess more than 50% of all other argument relations, as shown in Table 1. We expanded the graph based on nodes unrelated to the arguments. The results of the experiments are shown in Table 2. ACP-ARG scored the highest in both the new development and test sets. The performance of the model based on the ACP-ARG graph suggests that using all the information related to the question is not always correct. This suggests that it is efficient and effective when using the specific knowledge graph that the question requires and that the arguments of an AMR graph can provide significant evidence for retrieving the knowledge graph. According to ACP-ARG-mini, the amount of knowledge should be considered even when using arguments from the AMR graph. The model using the ACP-nonARG graph demonstrated inability in the reasoning process.

Adversarial Attack Test Using SRL
To demonstrate the analysis of whether the pre-trained language models precisely comprehend the question, we used semantic role labeling (SRL) (https://docs.allennlp.org/ models/main/models/structured_prediction/predictors/srl/ (accessed date in 1 April 2021)). SRL [39] labels the predicate and its arguments in a sentence. This study conducted an adversarial attack test on pre-trained language models based on SRL data. For the analysis, we replaced one candidate option, except for the correct answer, with the randomly selected argument related to the predicate of the question in Figure 1a. The experiment assessed whether the model predicted the correct answer using common-sense knowledge or relied on the argument text in the question. We selected BERT as a representative of the pre-trained language models. We tested the model using the CSQA. We fine-tuned the BERT model using the original training dataset for QA tasks and obtained inference results from the original and SRL-corrupted development datasets. Table 3 shows the results for each development dataset. They indicate that the performance of BERT decreases when one option other than the correct answer is substituted with the argument of the question. Thus, the decrease in performance suggests that BERT merely relies on superficial cues from the question. Our proposed model alleviated the performance of BERT, which showed a decrease of 1.06% compared with the decrease of the fine-tuning model of 5.78%.

Comparison on Different Language Models
Pre-trained language models based on transformer encoders have been studied since the appearance of BERT [27]. ELECTRA [31] is a model that is trained by replaced token detection using a discriminator. We experimented to determine whether our ACP-ARG graph is effective for diverse pre-trained language models. Additionally, we demonstrated that the proposed graph outperformed the graph representation method of the previous study [34]. Unlike our method that used the model Cai and Lam [32], Lim et al. [34] used the model Guo et al. [16] to generate the AMR graph. Table 4 shows the comparative results based on different types of language models. The input of the language model is "[CLS]+Question+[SEP]+candidate answer". All language models that used our method outperformed their own fine-tuned score and the other graph reasoning score, achieving 53.95% with BERT and 72.68% with ELECTRA-based models on the average score of our new test set and dev set. The results suggest that the concept representations of the ACP-ARG graph positively affect CSQA and are generalizable to any language encoder.

Models
Odev-Acc(%) Otest-Acc(%) Avg To demonstrate the generalization ability on another reasoning task, we also experimented on OpenBook QA (OBQA). The OBQA dataset consisted of four multiple-choice questions. Elementary-level science-fact-based reasoning is required for this task. As shown in Table 6, the models with the AMR graph or ConceptNet scored 56.00% and 60.00%, respectively, whereas BERT fine-tuning only scored 47.20% and 56.40%. Additionally, ELECTRA-based models outperformed their fine-tuning method, by 64.20% and 82.40% in the official test set.

Strengths and Limitations
We analyzed the necessity of the semantic representation of the pre-trained language model using adversarial attacks and semantic role labeling. However, we built our data automatically through an external model for the experiments in Table 3. Since humans do not annotate the created data, there may be errors within the data. Additionally, our study suggested a more effective graph representation than the previous study [34]. But, some problems remained. One is the error propagation problem from the AMR construction. The other is the static graph expansion, which might lead the model to learn the same knowledge of different semantic meanings using the same words.

Conclusions and Future Works
This paper proposes a strategy of graph representation utilizing the AMR and Con-ceptNet for commonsense reasoning tasks. The expansion methods of the AMR graph and ConceptNet involved selecting the necessary concepts based on the AMR argument relations because AMR consists of concepts connected with specific logical rules. As a result, extending all nodes connected by argument relations shows the highest performance. However, the proposed method statically expands the same graph information for ambiguous words. Therefore, we plan to use end-to-end common sense inference models, such as AMR constructs and dynamic AMR extension methods that can choose different knowledge for the same word depending on the context of the question.

Conflicts of Interest:
The authors declare no conflict of interest.