Subgraph Retrieval Augmented Generation (
SG-RAG) is a zero-shot Graph RAG method we proposed in [
13] for domain-specific multi-hop KGQA. In this section, we first provide an overview of the
SG-RAG method. Then, we highlight the experimental setup and the main findings from our previous work [
13]. Lastly, we discuss the potential improvements targeted in our extension.
4.1. Overview of SG-RAG
As
Figure 1 shows, the
SG-RAG method comprises three main steps: 1. subgraph retrieval, 2. textual transformation, and 3. answer generation. For a user question
q, the
SG-RAG method begins with the subgraph retrieval step, to retrieve the necessary subgraphs to answer
q. The retrieval process is based on mapping
q onto a Cypher query
(
mapping):
such that querying
using
retrieves the set of matched and filtered subgraphs
S containing the information necessary to answer
q:
where
. In our initial trials on Text2Cypher mapping, we observed that both LLaMA3-8B and Gemini performed poorly in generating valid Cypher queries. While GPT-4 was able to generate Cypher queries, its accuracy required further improvement. Therefore, we adopted a template-based approach for Text2Cypher mapping, using manually crafted templates. For a given question
q, we select a template corresponding to the question type and populate it with the entity mentioned in
q. Examples of these templates are provided in
Figure A1.
After retrieving the set of subgraphs
S, the second step of SG-RAG is to transform the retrieved subgraphs into a textual representation. For each retrieved subgraph
, the transformation is based on converting each directed edge
with its corresponding source and destination nodes
into a triplet, denoted by
t, in the form “subject|relation|object”. In the triplet format, “subject” and “object” refer to
and
, respectively, while “relation” refers to the label
e. During the textual transformation, the triplets that belong to the same subgraph are grouped, such that
Since
is a set, there is no pre-defined order among the triplets
.
Lastly, an answer
A to the question
q is generated using the underlying LLM as follows:
where
I refers to the instructions explaining the task and the inputs to the LLM. The prompt template represented by
I is shown in
Figure A2.
4.2. Experimental Setup and Main Results
In our previous work [
13], we experimented with the MetaQA benchmark dataset proposed by Zhang et al. [
23], which is a benchmark for multi-hop KGQA. It contains a
about the movies domain and a set of question–answer pairs grouped into 1-hop, 2-hop, and 3-hop. The ground truth answers are lists of entities.
As baseline methods to compare with SG-RAG, we employed the following methods:
The ground-truth answers in the MetaQA benchmark are lists of entity names. Therefore, metrics such as BLEU [
24] and ROUGE [
25] are not applicable, as they are sensitive to variations in the token order between the ground-truth and generated answers. Moreover, the
Direct and
RAG baselines generate answers either from the internal knowledge of the underlying LLM or from knowledge sources different from MetaQA-KG. As a result, the generated answers may include entities that are not present in the MetaQA ground truth. Penalizing the model for generating such entities could be unfair, as these entities may be correct but outside the scope of MetaQA. For these reasons, we evaluated the performance using the answer-matching rate (AMR) metric, inspired by the entity-matching rate metric proposed by Wen et al. [
26] for evaluating dialogue systems. AMR measures the intersection between the ground truth answer
and the generated answer
A, normalized by the number of entities in
:
AMR is advantageous because it is insensitive to the ordering of entities within the answer. It is recall-focused, penalizing the model for missing entities from the ground-truth answer (via set intersection), but not for generating additional, potentially valid entities outside the MetaQA dataset. The denominator in Equation (
5) reflects the number of entities in the ground-truth answer, ensuring that the metric properly captures recall.
To measure the statistical significance of the difference between the performance of our method and the baseline, we applied a one-tailed paired
t-test following the guidelines proposed by Dror et al. [
27]. As the size of our test set was large (>30), we could assume that the performance scores would follow a normal distribution [
28]. Our null hypothesis (
) is that the performance of our method is at most as good as the performance of the baseline. Our alternative hypothesis (
) is that our method outperforms the baseline. In the results discussion, we use the term statistically significant when the
p-value is less than
.
Initially, we compared the performance of
SG-RAG with the
Direct and
RAG methods;
RAG used Wikipedia documents regarding the entities in the MetaQA dataset as external knowledge. This experiment was conducted on a test sample of 15K question–answer pairs divided equally among 1-hop, 2-hop, and 3-hop questions. The underlying LLM was the Llama-3 8B Instruct version [
29]. From the results shown in
Table 1, we can observe that the performance of the
Direct method was poor compared to the other methods. This shows that depending only on the internal knowledge of Llama-3 8B is insufficient for the domain-specific task. On the other hand, the traditional
RAG method resulted in improvements in performance for 1-hop and 2-hop questions (
and
increase in 1-hop and 2-hop, respectively, using Top-10) compared to the
Direct method. However, the performance increase for 3-hop questions was very low (maximum increase was
using Top-1), which may have been due to the limitations of the semantic-based retrieval in finding the necessary documents to answer multi-hop questions. This limitations of traditional
RAG are addressed in
SG-RAG by utilizing Cypher queries for knowledge retrieval. Although Cypher-based retrieval narrows the search space compared to semantic-based retrieval, its ability to traverse the graph and retrieve subgraphs containing the necessary information gives
SG-RAG a distinct advantage for multi-hop questions. This advantage is reflected in the statistically significant performance improvements observed for 1-hop, 2-hop, and 3-hop questions (
Table 1).
One of the possible reasons behind the lower performance of the traditional
RAG method compared to
SG-RAG for 1-hop questions is that Wikipedia documents may not contain all the information required to answer a question. To address this issue, we generated textual documents based on the MetaQA-KG using the Gemini model (
https://gemini.google.com/ accessed on 1 July 2024). To generate the documents, we first extracted the entity names from the questions. Then, for each entity, we extracted the node
representing the targeted entity and the 1-hop neighborhood surrounding it. Lastly, we converted the extracted subgraph into a set of triplets and sent it to the Gemini model to generate a paragraph regarding the given triplets. During this experiment, we used Gemini 1.5 Flash version. This experiment was conducted on a test sample of 1547 1-hop questions, 1589 2-hop questions, and 1513 3-hop questions. From the results in
Table 2, we can see that applying
RAG to the generated documents based on the
(
RAG-Gen) increased the performance for 1-hop questions compared to
RAG-Wiki. Moreover, we can notice that the performance of
SG-RAG remained higher than that of
RAG-Gen, even for 1-hop questions. The reason behind the higher performance of
SG-RAG is that the knowledge sent to the LLM is in the triplet format, which helps the LLM extract information and apply reasoning to it.
Lastly, to analyze the effect of the underlying LLM, we re-applied
SG-RAG and
RAG-Gen with GPT-4 Turbo [
30] using the same sample we had used previously. As can be seen (
Table 3),
SG-RAG again statistically significantly outperformed the traditional
RAG method for 1-hop, 2-hop, and 3-hop questions. We can also see a general increase in the performance of all the methods compared to the results using Llama-3 8B instruct. This increase can be attributed to the higher capabilities of the GPT-4 model compared to Llama-3 8B.
4.3. Potential Improvements
LLMs generally struggle to pay sufficient attention to knowledge that appears in the middle of the context, and they tend to only focus on knowledge at the beginning and end of the context. This problem in LLMs is known as the lost-in-the-middle problem [
8]. For multi-hop QA, the lost-in-the-middle problem increases when the necessary parts of information are far away from each other [
9]. A possible solution to this problem is to apply ordering of the knowledge in the context [
10,
31]. In our previous work, we did not propose an ordering mechanism for the subgraph triplets
; however, specifying the order of the triplets can decrease the effect of the lost-in-the-middle problem by supporting the LLM in reasoning about
T and generating a more accurate answer to
q.
Moreover, the textual transformation step in SG-RAG converts the set of retrieved subgraphs to the set so that each subgraph is represented by the set of triplets . In the triplet representation, the subgraphs in T can overlap. This means that the set T that is shared with the LLM to generate an answer to q can contain repetitive triplets. These redundant triplets unnecessarily increase the size of the context shared with the LLM. During our experiments, we noticed that, on average, of the retrieved triplets for a 2-hop question were redundant. This percentage increased to around in the case of the 3-hop question.