ClueReader: Heterogeneous Graph Attention Network for Multi-hop Machine Reading Comprehension

Multi-hop machine reading comprehension is a challenging task in natural language processing as it requires more reasoning ability across multiple documents. Spectral models based on graph convolutional networks have shown good inferring abilities and lead to competitive results. However, the analysis and reasoning of some are inconsistent with those of humans. Inspired by the concept of grandmother cells in cognitive neuroscience, we propose a heterogeneous graph attention network model named ClueReader to imitate the grandmother cell concept. The model is designed to assemble the semantic features in multi-level representations and automatically concentrate or alleviate information for reasoning through the attention mechanism. The name ClueReader is a metaphor for the pattern of the model: it regards the subjects of queries as the starting points of clues, takes the reasoning entities as bridge points, considers the latent candidate entities as grandmother cells, and the clues end up in candidate entities. The proposed model enables the visualization of the reasoning graph, making it possible to analyze the importance of edges connecting entities and the selectivity in the mention and candidate nodes, which is easier to comprehend empirically. Evaluations on the open-domain multi-hop reading dataset WikiHop and drug-drug interaction dataset MedHop proved the validity of ClueReader and showed the feasibility of its application of the model in the molecular biology domain.


I. INTRODUCTION
Machine reading comprehension (MRC) is one of the most attractive and long-standing tasks in natural language processing (NLP).Compared with single-paragraph MRC, multi-hop MRC is more challenging since multiple confusing answer candidates are contained in different passages [1], [2].Models designed for this task are supposed to have abilities to reasonably traverse multiple passages and discover reasoning clues following given questions.For complex multi-hop MRC tasks, more understandable, reliable, and analyzable methodologies are required to improve reading performance.
A better understanding of biological brains could play a vital role in building artificial intelligent systems [3].Previous cognitive research in reading can be of benefit to challenging multi-hop MRC tasks.The concept of grandmother cells can be traced back to a 1969 academic lecture given by the neuroscientist Jerome Lettvin [4], and was later defined by the physiologist Horace Barlow as cells in the brain that respond specifically to a single familiar person or object.In experiments on primates, researchers discovered individual neurons that responded specifically to a specific person, image, or concept after differentiation [5].A study of a patient with epilepsy found a neuron in the patient's anterior temporal lobe that responded specifically to the Hollywood star Jennifer Aniston [6].Any form of stimulation related to Aniston, whether it be a color photograph, a close-up of her face, a cartoon portrait, or even just seeing her name written on paper, could and would only stimulate that neuron to produce an excited signal.As research into the concept of grandmother cells, the underlying mechanism of their response became clearer.The signal output from a single grandmother cell in response to specific stimuli actually stems from the coordinated calculation of a large-scale neural network behind grandmother cells [5].It suggests that a single neuron can respond to only one out of thousands of stimulation, which is somehow intuitively similar to reading and inference in multihop MRC: • Selectivity.The grandmother cells concept organizes the neurons in a hierarchical "sparse" coding scheme.It activates some specific neurons to respond to stimulation, similar to the manner in which we store reasoning evidence maps (neurons) in our minds during reading and recall-related evidence maps to reason the answer with a question (stimulation) constrained.• Specificity.The concept implies that brains contain grandmother neurons that are so specialized and dedicated to a specific object, which is similar to a particular MRC question resulting in a specific answer among multiple reading passages and their complex reasoning evidence.• Class character.Amazing selectivity is captured in grandmother cells.However, it results from computation by much larger networks and the collective operations of many functionally different low-level cells, similar to human multi-hop reading in which evidence is usually gathered from different levels as much as possible and the final answer is decided in some candidate endpoints.To imitate grandmother cells in multi-hop MRC, the reading evidence is supposed to be organized as level-classified neurons and the selections must be performed in response to specific question stimulation.As for multi-hop MRC tasks, the hops between two entities could be connected as node pairs and gradually constructed into a reasoning evidence graph taking all related entities as nodes.This reasoning evidence graph is intuitively represented as a graph structure, which can be empirically considered to contain the implicit reasoning chains from the start of the question to the end of the answer nodes (entities).We generally recall considerable related evidence as a node, whatever form it is (such as a paragraph, a short sentence, or a phrase) to meet the class character, and we coordinate their inter-relationship before obtaining the results.
Graph neural networks (GNNs) inspire us to posit that operating on graphs and manipulating the structured knowledge can support relational reasoning [7], [8] in a sophisticated and flexible pattern, similar to the implementation of grandmother cells regarding the cells as nodes in the graph and collecting evidence in multi-classified aspects of node representations.Further, spatial graph attention networks (GATs) perform the selectivity in the reasoning evidence graph in the manner of grandmother cells using attention mechanisms.This work has the following main contributions: 1) In order to construct a more reasonable graph, ClueReader draws inspiration from the concept of grandmother cells in the brain during information cognition, in which cells in the brain only output specific entities.This leads to the creation of heterogeneous graph attention networks with multiple types of nodes.2) By taking the subject of queries as the starting point, potential reasoning entities in multiple documents as bridge points, and mention entities consistent with candidate answers as end points, the proposed ClueReader is a heuristic way of constructing MRC chains.3) Before outputting predicted answers, ClueReader innovatively visualizes the internal state of the heterogeneous graph attention network, providing intuitive quantitative data displays for analyzing the effectiveness, rationality, and explainability.The remainder of the article is organized as follows.Section II describes the work related to multi-hop MRC, and Section III proposes the ClueReader that imitates grandmother cells for multi-hop MRC.Experimental evaluations are conducted in Section IV, and conclusions are summarized in Section V.

II. RELATED WORK A. Sequential Reading Models for Multi-hop MRC
Sequential reading models were first used for single-passage MRC tasks, and most of them are based on recurrent neural networks (RNNs) or their variants.When the attention mechanism was introduced into NLP tasks, their performance significantly improved [9], [10], [11], [12].In the initial benchmarks of the QANGAROO [13], a dataset for multihop MRC, the milestone model Bi-Directional Attention Flow (BiDAF) [9] was first applied to evaluate its performance in the multi-hop MRC task.It represented the context at different levels and used a bi-directional attention flow mechanism to obtain query-aware context representation and was then used for predictions.
Some studies [14], [15], [16], [17] argued that independent attention mechanisms, i.e., Bidirectional Encoder Representations from Transformers (BERT) [14]-style models, applied on sequential contexts can outperform former RNN-based approaches in various NLP downstream tasks including MRC.When the sequential approaches were applied to multi-hop MRC tasks, however, they suffered from the challenge that the super-long contexts -to adapt the design of the sequential requirement, multiple passages are concatenated into one passage -resulted in dramatically increased calculation and time consumption.A long-sequence architecture, Longformer [17], overcomes the self-attention restriction and allows the length of sequences to be increased from 512 to 4,096 and then concatenates all the passages into a long sequential context for reading.The Longformer modified the question answering (QA) methodology proposed in BERT [14]: the long sequential context consisted of a question, candidates, and passages, which were separated by special tags that were applied to the linear layers to output the predictions, while still having enough memory for first 4,096 length sequence.
Although the approaches above are effective, [18] indicate that model reasoning is not robust enough.We consider that there are still two main challenges that should be further addressed: (1) With the expansion of the problem scale and the reasoning complexity, the token-limited problem may appear again eventually.For instance, a full-wiki setting task in HOTPOTQA requires models to predict answers from the scope of the entire WIKIPEDIA, which is a dataset for diverse and explainable multi-hop question answering.It is difficult to imagine how a huge search space is built based on a large amount of text.(2) Some models which simply concatenate text to long contexts lack logical relationships, which is unconvincing in terms of their reasoning.Thus, the approaches based on GNNs were proposed to improve the scalability and explainability in multi-hop MRC.

B. Graph Neural Networks for Multi-hop MRC
Reasoning about explicitly structured data, in particular, graphs has arisen at the intersection of deep learning and structured approaches [7].As the representative graph methodology, Graph Convolutional Networks (GCNs) [19], [20] are widely applied in multi-hop MRC approaches.Cognitive Graph QA (CogQA) [21] was founded on the dual process theory [22], [23], and it divides the multi-hop reading process into two stages: the implicit extraction (System I) based on BERT and the explicit reasoning (System II) established in GCNs.System I extracts the answer candidates and useful next-hop entities from passages for the cognitive graph construction, then System II updates entity representations and predict the final answer in the GCN message passing way.In this procedure, the selected passages are not put in the system at once.As a result, CogQA keeps its scalability in the face of the massive scope of reading materials.Our proposed ClueReader: a heterogeneous graph attention network for multi-hop MRC.The detailed explanations of S, C, and q are in task formalization (Section III-A).S, C, and q are encoded in three independent Bi-LSTM (Section III-B).Following the graph construction strategies in Section III-C, the outputs of three encoders are applied to Co-attention and Self-attention to initialize the reasoning graph features, which is explained in Section III-D.
Then the topology information and node features are passed into the GAT layer.A much larger network computation behind grandmother cells is performed in GAT Layer, and n-hops message passing is calculated in n parameter shared layers which are represented in Section III-D2.Finally, grandmother cells selectivity is combined in Section III-E, outputting the final predicted answer.
Entity-GCN [24] extracts all the text spans matching the candidates as nodes and obtains their representations from the contextualized ELMo [25] word embeddings, then passes them to the GCN module for reasoning.Based on Entity-GCN, Bi-directional Attention Entity Graph Convolutional Network (BAG) [26] added Glove word embeddings and two manual features, named-entity recognition and part-of-speech tags, to reflect the semantic properties of tokens.On account of the full usage of the question contextual information, it applies the bi-directional attention mechanism, both node2query, and query2node, to obtain query-aware node representations in the reasoning graph for better predictions.Path-based GCN [27] introduces more related entities in the graph than the nodes merely matching the candidates to enhance the performance of the model.Heterogeneous Document-Entity (HDE) model [28] introduces the heterogeneous nodes into GCNs, which contain different granularity levels of information.Additionally, Keywords-Aware Dynamic Graph Neural Network (KA-DGN) [29] was proposed and designed as a dynamic graph neural network to further tackle reading over multiple scattered text snippets.Furthermore, Zhang et al. [30] and Song et al. [31] separately proposed knowledge-aware and evidenceaware GNN reading models, which integrate dependency relations or multiple pieces of evidence from multiple paragraphs.
However, the reading process of the above-mentioned approaches is still inexplicable, especially in GNNs, which stimulated our interest in the selectivity of this procedure.

III. METHODOLOGY
We introduce the design and implementation of the proposed model, ClueReader, which is shown in Figure 1.

A. Task Formalization
A given query q = (s, r, a * ) is in a triple form, where s is the subject entity, r is the query relation (i.e., predication), and q can be converted into sequential form q = {q 1 , q 2 , ..., q m }, where m is the number of tokens in the query q.Then a set of candidates C q = {c 1 , c 2 , ..., c z } and a series of supporting documents S q = {s 1 , s 2 , ..., s n } containing the candidates are also provided, where z is the number of the given candidates, n is the number of the given supporting documents, and the subscript q means the two sets are constrained by the query q.Moreover, S q is provided in a random order, and without S q , the answer to the query q could be multiple.Our goal is to identify the single correct answer a * ∈ C q by reading S q .

B. Encoding Layer
We utilize the pre-trained GloVe [32] model to initialize word embeddings, and then employ Bidirectional Long Short-Term Memory (Bi-LSTM) [33], [34] to encode sequence representations as: where the subscripts t and t − 1 denote the indexes of encoding time step; W i and W h are the hyperparameters of the input and the hidden layer; i, f , o, c, h and c respectively represent the input, forget, output, content, hidden and cell states; x represents the word embedding; σ and tanh are sigmoid activation and hyperbolic tangent activation, respectively.
We use − → h and ← − h to denote the forward-pass (i.e., the left-to-right) and the backward-pass (i.e., the right-to-left) sequence representations encoded by Bi-LSTM, respectively.Then, the representation of the entire sequential context obtained from the encoding layer can be expressed as follows: where the symbol || denotes the concatenation of − → h and ← − h .To encode the sequence representations of support documents S, candidates C, and query q, it is desirable to use three independent Bi-LSTM.Their outputs are and H q ∈ R lq×d , respectively, where i and j are the indexes of the documents and the candidates, l is the sequence length, and d is the output dimension of the representations.

C. Heterogeneous Reasoning Graph
The concept of grandmother cells reveals that the brains of monkeys, like those of humans, contain neurons that are so specialized they appear to be dedicated to a single person, image, or concept.This amazing selectivity is uncovered in a single neuron, while it must result from computation by a much larger network [5].We heuristically consider that this procedure in multi-hop reading could be summarized as three steps: 1) The query (or the question) locates the related neurons at a low level, which then stimulates higher-level neurons to trigger computation; 2) The higher-level neurons begin to respond to increasingly broader portions of other neurons for reasoning and to avoid a broadcast storm, informative selectivity takes place in this step; 3) At the top level, some independent neurons are responsible for the computations that occurred in step 2. We refer to these neurons as grandmother cells and expect them to provide the appropriate results that correspond to the query.We attempt to imitate grandmother cells in our reading procedure and present our reasoning graph as consistent as possible with the three steps mentioned above.The heterogeneous reasoning graph G = {V , E}, which is illustrated in Figure 2, simulates a heuristic chain of comprehension that starts from the subject entity in query q and goes through the reasoning entities in the supporting document set S q , then through the mention entities in S q that are consistent with the candidate answer, and finally touches at the candidates in set C q (referred to as the grandmother cell).
1) Nodes Definition: To construct the graph, we define five different types of nodes which are similar to neurons and ten kinds of edges among the nodes [15], [24].
• Subject Nodes -As the form of query q, the subject entity s is given in q = (s, r, a * ).For example, the subject entity of the query sequence context Where is the basketball team that Mike DiNunno plays for based? is certainly Mike DiNuuno.We extract all the named entities that match with s from documents and regard them as the subject nodes to open up the reading clues triggering further computations.The subject nodes are denoted as V sub and colored in gray in Figure 2. • Reasoning Nodes -In light of the requirements of the multi-hop MRC, there are some gaps between the subject entities and candidates.To build bridges between the two and make the reasoning clues as complete as possible, we replenish those clues with the named recognition entities and nominal phrases from the documents containing the question subjects and answer candidates.The reasoning nodes are marked as V rea and colored in orange in Figure 2. • Mention Nodes -A series of candidate entities are given in C q , they may occur in multiple times within the document set S q .As a result, we traverse the documents and extract the named entities corresponding to each candidate as mention nodes, serving as the soft endpoint of the reasoning chain.It should be noted that mention nodes will participate in the semi-supervised learning process and will be involved in the final answer prediction.The mention nodes are presented as V men and colored in green in Figure 2. • Support Nodes -As described by [5], we consider that multi-type representations may contribute to the reading process, thus the support documents containing the above nodes are introduced to G as support nodes, which are notated as V sup and colored in red in Figure 2. • Candidate Nodes -To imitate grandmother cells, we consider candidate nodes as hard endpoints of the reasoning chain to gather relevant information from the heterogeneous reasoning graph.For the mention nodes V q men of a candidate answer c q , when V q men ≥ 1, candidate nodes are established as grandmother cells to provide the final prediction.The candidate nodes are denoted as V can and colored in blue in Figure 2. 2) Edges Definition: To learn the entity relationships between different nodes, we define 10 kinds of edges between nodes in heterogeneous reasoning graphs inspired by the literature [35], [24], [26], as shown in Table I.
3) Graph Construction: In the heterogeneous reasoning graph, the clue-reading chain can be represented by V sub ↔ V rea ↔ V men ↔ V can , whose edges are covered by E sub2rea , E rea2rea , E rea2men , and E can2men .E edgesout and E rea2rea give the model the ability to transfer information across documents and edges in E sup2sub , E sup2can , and E sup2men are responsible to supplement the multi-angle textual information from the documents.Furthermore, the E can2men could gather all the information of the mentioned nodes corresponding to the candidates and then pass their representations to the output layer to realize the imitation of grandmother cells.
Specifically, this multi-hop MRC process of the clue-based reasoning starts with the subject node, connecting reasoning nodes from support documents, then connecting the mention nodes as soft endpoints of the clue chain, and finally connecting to the candidate nodes (grandmother cells) as hard endpoints of the clue chain.For example, for the question, Which country is the location of the United Nations Headquarters?, the answer candidate set includes China, France, UK, USA, and Russia.One correct and reasonable clue chain can be represented as Location of United Nations Headquarters (subject node)↔Manhattan↔New York City↔New York State↔USA (mention node)↔USA (candidate node).In practice, multiple clue chains are included within the heterogeneous reasoning graph, and under the constraints of the query, the selection of soft and hard endpoints is required to output the final prediction.

D. Heterogeneous Graph Attention Network for Multi-hop Reading
1) Query-aware Contextual Information: Following HDE [28], we use the co-attention and self-attention mechanisms [36] to combine the query contextual information and documents.Moreover, it is applied to the other semantic representations that require reasoning consistent with the query.To represent the query-aware support documents, it can be calculated as follows: where A i qs is the similarity matrix for two sequences, between the i-th support document H i s ∈ R l i s ×d and query H q ∈ R lq×d , and d is the dimension of the context.Then, the queryaware representation of support documents S ca is computed as follows: To project the sequence into a fixed dimension and output the representation N sup of V sup for graph optimization, a selfattention is utilized to summarize the contextual information: In addition to the query-aware support documents, the coattention and self-attention are used to generate query-aware node representations from other sequential representations.
2) Message Passing in the Heterogeneous Graph Attention Network: We present messaging passing in the heterogeneous graph attention network for reading within multiple relations in diverse nodes.The input of this module is a graph G = {V , E} and node representations N = {n 1 , n 2 , . . ., n r } ∈ R 1×2d , where r is the number of nodes.Initially, a shared weight matrix W n is applied to N , then the attention coefficients and nodes attention coefficients are computed as where e ij are the attention coefficients indicating the importance of the features of the node n j to the node n i , and α ij is normalized across all structure neighbors N i of the node n i .The attention mechanism is responsible for selectivity with node interdependence, which enables us to show how the nodes take effect during the reasoning.Considering the 10 different types of edges defined in Section III-C2, we model the relational edges basing on the vanilla GAT [37]: where n l i ∈ R 1×2d is the hidden state of the node n i in the l-th layer, all the GAT layers are parameter shared, k is the k -th head following [15], [37], R is the set of all types of edges in E, and α k,l rij are normalized attention coefficients computed by the k-th attention mechanism with relation r, which is presented in [37].
Message passing is a key component of our model.To echo the selectivity of grandmother cells, we use the attention mechanism to select (i.e., activate or deactivate) key node pairs in our reasoning graph, and we empirically regard this process as the reading reasoning in the graph.

TABLE I THE DEFINITION OF EDGES IN THE HETEROGENEOUS GRAPH ATTENTION NETWORK ClueReader.
Edges Definition

E sup2sub
If the support document s i contains the j-th subject node v j sub , an undirected edge denoted as e ij sup2sub is established to connect the support node v i sup of s i and the subject node v j sub .

E sup2can
If the support document s i contains the j-th candidate node v j can , an undirected edge denoted as e ij sup2can is established to connect the support node v i sup of s i and the candidate node v j can .

E sup2men
If the support document s i contains the j-th mention node v j men , an undirected edge denoted as e ij sup2men is established to connect the support node v i sup of s i and the mention node v j men .

E can2men
If the j-th mention node v j men and the i-th candidate node v i can represent the same entity, an undirected edge denoted as e ij can2men is established to connect the two nodes.

E sub2rea
If the i-th subject node v i sub and the j-th reasoning node v j rea extracted from the same document, an undirected edge denoted as e ij sub2rea is established to connect the two nodes.

E rea2men
If the i-th reasoning node v i rea and the j-th mention node v j men extracted from the same document, an undirected edge denoted as e ij rea2men is established to connect the two nodes.

E can2can
All the mention nodes are fully connected using undirected edge e ij can2can .

E edgesin
If two mention nodes v i men and v j men are extracted from the same document, the two nodes will be connected as e ij edgesin .
E edgesout If two mention nodes v i men and v j men are extracted from different documents represent the same entity, the two nodes will be connected as e ij edgesout .

E rea2rea
If two reasoning nodes v i rea and v j rea are extracted from the same document or represent the same entity, the two nodes will be connected as e ij rea2rea .
3) Gating Mechanism: A previous study [19] showed that GNNs suffer from the smoothing problem when calculated by stacking many layers, thus, we overcome this issue by applying question-aware [27] and general gating mechanisms [38] to optimize the procedure.
where H q is the query representation given by a dedicated Bi-LSTM encoder to keep consistency with the dimension of node features N , j indicates the order of query words, m is the query length, σ is a sigmoid function, and indicates elementwise multiplication.Then the general gating mechanism is introduced as follows:

E. Output Layer
After updating the node representation, we use two multilayer perceptrons, MLP can and MLP men , to transform the node features to prediction scores.All the candidate nodes (grandmother cells) N can and mention nodes N men from G are employed to output the prediction score distribution a as: ) where max(•) takes the maximum mention node score over MLP men , then the two parts are summed with the effect of a harmonic γ as the final prediction score distribution.

IV. EXPERIMENTS
We present the performance of our model on the QAN-GAROO [13] dataset and evaluate the performance in detail.Then, the ablation study and the visualization will demonstrate the benefit of the model.Finally, a case study shows the relationship between the answers output from the models and human reading results.

A. Dataset for Experiments
QANGAROO is a multi-hop MRC dataset containing two independent datasets, WIKIHOP and MEDHOP, from the opendomain field and molecular biology field, respectively.Both WIKIHOP and MEDHOP were divided into three subsets: the training set, development set, and undisclosed test set, which is used for official evaluation.The dataset sizes are shown in Table II.
WIKIHOP was created from WIKIPEDIA (as the document corpus) and WIKIDATA (as structured knowledge triples).A To validate whether the dataset can be consistent with the formalization of the multi-hop MRC, the dataset founder asked human annotators to evaluate the samples in the WIKIHOP development and test sets.For each sample in the two sets, at least three annotators participated in the evaluation, and they were required to answer three questions: • whether they knew the fact before • whether the fact follows from the texts (with options follows, likely, and not follows) • whether multiple documents are required to answer the question All the samples in the test set were human-selected and were labeled by the majority of annotators with follows and multiple documents required.Annotators merely noted the samples in the development set without the selection.
The MEDHOP dataset was constructed using the DRUG-BANK as certain knowledge.Then the creators extracted the research paper abstracts from MEDLINE-the online medical literature search & analysis system and the bibliographic database of the National Library of Medicine of the USA -as a corpus, and the aim is to predict the drug-drug interaction (DDI) after reading the texts.The purpose of applying multi-hop methods in this prediction is to find and combine individual observations that can suggest previously unobserved DDI from inferring and reasoning the prior public knowledge in contents rather than some costly experiments.The only query type is interacts_with.A sample given in [13] is illustrated in Figure 3(b) and note that accession numbers replace the medical proper nouns (e.g., DB00007, DB06825, DB00316) rather than the names of drugs and human proteins (e.g., Leuprolide, Triptorelin, Acetaminophen) in practice.

B. Experiments Settings
We exploited NLTK [39] toolkit to tokenize the support documents and candidates, then split the query q = {s, r, a * } into relation r and subject entity s.All the named entities matching with candidates C q were extracted as mention nodes V men , and the SPACY1 was used to extract the named entities and noun phrases from texts as reasoning nodes V rea .We concatenated GloVe [32] and n-gram character embeddings [40] to obtain 400-dimensional word embeddings, which were input to the encoder layer.The out-of-vocabulary words were presented with random vectors.The word embedding was fixed in WIKIHOP experiment and trainable on MEDHOP.We implemented the ClueReader model with PyTorch and PyTorch Geometric [41].NetworkX [42] was utilized to visualize the reading graph, the weights of node pair weights, and node selections.

C. Results and Analyses
In Table III we present the performance of ClueReader in the development and test sets of WIKIHOP and MEDHOP and compare it with the performance of published models mainly based on GNNs.Our model improved the accuracy of GCNbased models HDE [28] in the test set from 70.9% to 72.0% and Path-based GCN in the development set from 64.5% to 66.9%, while Path-based GCN using GloVe and ELMo word embeddings surpassed our model by 0.5% in the test set, which confirms that the initial representations of nodes are extremely critical [27].However, limited by the architecture and computing resources, we did not use powerful contextual word embeddings like ELMo and BERT in our model, which can be further addressed.Compared to the other GNN-based models [24], [26], [31] and the sequential models [13], [43], our model achieved higher accuracy.We are the first to apply the GNN-based model to MEDHOP, although the accuracy was 1.8% lower than BiDAF, we believe that the possible reason was the failure in extracting the reasoning nodes of the SPACY toolkit, which means the bridge entities were incomplete.
To analyze the scalability of our model, we divided the development set into six groups according to the number of support documents and then determined the accuracy in each group.The grouped accuracy on WIKIHOP is shown in Figure 4. ClueReader achieved competitive results: 73.59% and 63.57% in the groups of (1-10) and (11)(12)(13)(14)(15)(16)(17)(18)(19)(20), with a total of 4,039 samples accounting for 95% of the development set.The lowest accuracy of 55.74% was for the group (41-50).However, it increased to 62.5% in the group (51-62), which shows the scalability of our model is effective.The grouped accuracy on MEDHOP is shown in Figure 5, and they are quite competitive.The highest and second-highest accuracy of 60.00% and 51.85% are in (31)(32)(33)(34)(35)(36)(37)(38)(39)(40) and (21-30) groups, respectively, and the lowest and second-lowest accuracy of 0% and 35.59% are in (1-10) and (51-62) groups, respectively.In particular, the result in the (51-64) group on MEDHOP is against the group (51-62) on WIKIHOP, which implies that we must concentrate on the difference between the open-domain and molecular textual contexts.The results in the different number of support documents show the contribution of our model to the scalability of the multi-hop MRC tasks.
As mentioned above, the WIKIHOP development set had consistency between facts and annotated documents.To determine whether multiple documents are required to reason the   Coref-GRU [43] 56.0 59.3 --MHQA-GRN [31] 62.8 65.4 --Entity-GCN [24] 64.8 67.6 --HDE [28] 68.1 70.9 --BAG [26] 66.5 69.0 --Path-based GCN [27]    Further, we believe that authenticity can seriously impact the accuracy of our prediction.The categories associated with may not follow the fact achieved the worse results, of 71.4%, 71.4%, and 71.5%, respectively, in the groups of likely follows the fact (single document and multiple documents) and "not follows" is not given.The same analysis is infeasible in the development set of MEDHOP since the document complexity and the number of documents per sample are significantly larger.

D. Ablation Study
We proposed five types of nodes in G, to analyze how they reasoned, we removed the edges with specific connections and isolated the nodes to evaluate the performance in the subset of the WIKIHOP development set, that is, not follows was not annotated.Moreover, we tested the model without the message passing in G.The ablated performance is shown in Table V.On WIKIHOP, the proposed heterogeneous graph attention network was the most effective component of ClueReader.Without its contribution, the accuracy decreased by 18.76%.After blocking the nodes by groups, we observed that the support nodes contributed 9.43% absolutely, the mention nodes In Table VI, we present the model performances with different hyperparameters, especially the number of stacked GAT layers (the number of hops) and the weight of grandmother cells.The number of GAT layers controls how many parameter-sharing GAT layers should be involved in the reasoning graph.On WIKIHOP, we obtained the highest accuracy (66.5%) when we stacked the graph with five layers, and the model with three or four GAT layers had poorer performance (57.8% or 58.5% respectively).With six GAT layers, the accuracy dropped 2.3% compared with the best performance.Furthermore, as the final prediction illustrated in Equation ( 21), γ coordinates the mention nodes and the candidate nodes grandmother cells; we present the model performances with different γ settings in Table VI.The best performance was with γ set to 1.However, if we gave it too much weight, that is γ = 1.5, the accuracy decreased by 7.4%, which is even worse than when we set γ to 0 (59.7%), which convinces us that we should not ignore the effect of much larger networks behind grandmother cells.We observed similar phenomena with different hyperparameter settings On MEDHOP.When the number of hops was 5, and γ is 1, the model performed best at approximately 48.2%.We suspect that when a few GAT layers are stacked, the messages of nodes cannot pass sufficiently among the reasoning graph.When too many GAT layers are stacked, the graph over-smoothing problem leads to a drop in accuracy.We also empirically observed that models with higher γ may lose semantic information from context resulting in reduced prediction accuracy, which also fits the concept of grandmother cells that before the final predicting determination, a huge background network calculation should be done implicitly.

E. Visualization
Compared to spectral GNN-based reading approaches, our proposed heterogeneous reasoning graph ClueReader is a nonspectral approach, which allows us to analyze how the nodes interact with each other in various relations and how the connections take effect between nodes.We visualize the predictions in our heterogeneous reasoning graph on WIKIHOP and MEDHOP in Figures 6 and 7, respectively.Different types of nodes are shown in different colors (subject nodes are gray, reasoning nodes are orange, mention nodes are green, candidate nodes are blue, and support nodes are red), and their edges, which reflect selections of node pairs, are shown in different thickness lines.The thicker the edges, the more important they learn from training.Considering that the answer determination should not only be inferred by the weight edges but also from the output layer projected from the representations of the nodes to R 1×2d and accumulated score from N can and N men , we use the transparency of the nodes to respond to the outputs: the darker the nodes, the higher the values output from the output layer.Owing to the output values being quite different, some mention and candidate nodes are almost transparent.The weight graph provides the evidence during reading and the analysis of DDI.It passes the messages according to the concept of grandmother cells that not only one node becomes effective, but the cluster behind it plays a synergistic effect.We learn more about our model through visualization.For instance, the node transparency differentiation on MEDHOP is significantly lower than WIKIHOP, which indicates that the drug features are not sufficiently learned, leading to the convergence of node features and increased classification prediction difficulty.This issue can be further addressed.
To better understand the model predictions and contribute to further study, we generate HTML files of samples as shown in Figure 8 and analyze whether the named entities contained in the max-score nodes can make sense from the perspective of a human answering after reading.Please refer to our website (https://github.com/cluereader/cluereader.github.io)for more visualization samples in HTML files.

V. CONCLUSION
We present ClueReader, a heterogeneous graph attention network for multi-hop MRC, which is inspired by the concept of grandmother cells from cognitive neuroscience.The network contains several clue-reading paths from the subject of the question and ends with candidate entities.We use reasoning and mention nodes to complete the process and use support nodes to add supernumerary semantic information.We apply our methodology on QANGAROO, a multi-hop MRC dataset, and the official evaluation supports the effectiveness of our model in open-domain QA and molecular biology domain.Several potential issues could be further addressed, such as introducing intermediate supervision signals during the semisupervised graph learning, the enhancement of using external knowledge, and dedicated word embedding methodology in the medical context, which are possible to improve the model performance in multi-hop MRC tasks.
Fig.1.Our proposed ClueReader: a heterogeneous graph attention network for multi-hop MRC.The detailed explanations of S, C, and q are in task formalization (Section III-A).S, C, and q are encoded in three independent Bi-LSTM (Section III-B).Following the graph construction strategies in Section III-C, the outputs of three encoders are applied to Co-attention and Self-attention to initialize the reasoning graph features, which is explained in Section III-D.Then the topology information and node features are passed into the GAT layer.A much larger network computation behind grandmother cells is performed in GAT Layer, and n-hops message passing is calculated in n parameter shared layers which are represented in Section III-D2.Finally, grandmother cells selectivity is combined in Section III-E, outputting the final predicted answer.

Fig. 2 .
Fig. 2. Heterogeneous reasoning graph in ClueReader.Different nodes are filled in different colors, and the edges are distinguished by the types of lines.Subject nodes are gray, reasoning nodes are orange, mention nodes are green, support nodes are red, and candidate nodes are blue.The nodes in the light yellow square are all selected to input to the two MLP obtaining the prediction score distribution.
(a) A sample from the WIKIHOP.(b) A sample from the MEDHOP.

Fig. 3 .
Fig. 3. Samples of WIKIHOP and MEDHOP.Subject entities, reasoning entities, mention entities, and candidate entities are shown in gray, orange, green, and blue colors, respectively.The occurrence of the correct answer is shown by a square frame outside.

Fig. 4 .Fig. 5 .
Fig. 4. Statistics of the model performance with different numbers of support documents on the WIKIHOP development set.

Fig. 6 .
Fig.6.Visualizations of reasoning graphs on the WIKIHOP development samples that are correctly answered.A thicker edge corresponds to a higher attention weight, and darker green nodes or darker blue nodes represent higher output values among the same type of nodes.

Fig. 7 .Fig. 8 .
Fig. 7. Visualizations of reasoning graphs on the MEDHOP development samples that are correctly answered.A thicker edge corresponds to a higher attention weight, and darker green nodes or darker blue nodes represent higher output values among the same type of nodes.
To predict it, a named recognition entity Hampton Wick is extracted from the seventh support document, and it links to the same tokens in the zeroth support document where the correct candidate answer appears as well.The reasonable clue chain Hampton Wick War Memorial ↔ Hampton Wick # 1 ↔ Hampton Wick # 2 ↔ London Borough of Richmond upon Thames presents the procedure of our model for the multi-hop MRC task.

TABLE III PERFORMANCE
OF THE PROPOSED ClueReader IN THE DEVELOPMENT AND TEST SETS OF WIKIHOP AND MEDHOP, AND COMPARISONS WITH OTHER PUBLISHED APPROACHES ON THE LEADERBOARD.
(4)requires single document and likely follows fact;(5) not follows is not given.The performance of our model is presented in TableIV.We observe that ClueReader had the best performance of 74.9% in the samples which follow the facts and require multiple passages.This phenomenon proves the effectiveness of the model in pure multi-hop MRC tasks.It achieved the second-best result of 74.0% in samples following the facts and requiring a single document, which supports that ClueReader is also effective in single-passage MRC tasks.

TABLE VI ABLATION
STUDIES OF HYPERPARAMETERS OF GAT LAYERS AND WEIGHTS OF grandmother cells IN REASONING GRAPH PREDICTIONS.11%and the candidate nodes contributed 5.58%.Regarding the reasoning and subject nodes, we considered the small quantities contained in the graph leading to low status in contributions.However, we observed considerably different performances between WIKIHOP and MEDHOP.As the results are shown in TableV, the most effective part of the model is mention nodes.When we blocked the mention nodes in the graph, the accuracy decreased significantly, by 43.28%, and the graph reasoning contributed 10.53% to accuracy.Meanwhile, support nodes had negative effects on the prediction, a decrease of 0.29%, which is diametrically opposite the performance on the WIKIHOP development subset.