Schema Retrieval for Korean Geographic Knowledge Base Question Answering Using Few-Shot Prompting

Lee, Seokyong; Yu, Kiyun

doi:10.3390/ijgi13120453

Open AccessArticle

Schema Retrieval for Korean Geographic Knowledge Base Question Answering Using Few-Shot Prompting

by

Seokyong Lee

and

Kiyun Yu

^*

Department of Civil and Environmental Engineering, Seoul National University, Seoul 08826, Republic of Korea

^*

Author to whom correspondence should be addressed.

ISPRS Int. J. Geo-Inf. 2024, 13(12), 453; https://doi.org/10.3390/ijgi13120453

Submission received: 30 September 2024 / Revised: 1 December 2024 / Accepted: 11 December 2024 / Published: 15 December 2024

Download

Browse Figures

Versions Notes

Abstract

Geographic Knowledge Base Question Answering (GeoKBQA) has garnered increasing attention for its ability to process complex geographic queries. This study focuses on schema retrieval, a critical step in GeoKBQA that involves extracting relevant schema items (classes, relations, and properties) to generate accurate operational queries. Current GeoKBQA studies primarily rely on rule-based approaches for schema retrieval. These predefine words or descriptions for each schema item. This rule-based method has three critical limitations: (1) poor generalization to undefined schema items, (2) failure to consider the semantic meaning of user queries, and (3) an inability to adapt to languages not used in the predefined step. In this study, we present a schema retrieval model by using few-shot prompting on GPT-4 Turbo to address these issues. Using the SKRE dataset, we searched for the best prompt in terms of enabling the model to handle Korean geographic questions across various generalization levels. Notably, this method outperformed fine-tuning in zero-shot scenarios, underscoring its adaptability to unseen data. To our knowledge, this is the first attempt to develop a schema retrieval model for GeoKBQA that purely utilizes a language model and is capable of processing Korean geographic questions.

Keywords:

GeoKBQA; KBQA; schema retrieval

1. Introduction

The recent advancements in the field of natural language processing (NLP) are significant. Particularly, the progress in technology has enabled training with large-scale datasets, leading to the development of large neural network-based language models. These models have demonstrated high performance in various NLP tasks such as document summarization, sentiment analysis, and machine translation, driving many innovations [1]. One of the tasks in NLP, Question Answering (QA), involves generating or retrieving answers to user queries [2]. Research has been conducted on Knowledge Base Question Answering (KBQA), which finds answers from a structured graph-form database known as a Knowledge Base (KB) to build QA systems [3,4,5,6].

To implement KBQA, components such as entity linking, schema retrieval, and transducers are used [6,7]. Entity linking involves identifying entities mentioned in natural language questions and linking them to their corresponding entities in the actual KB, clarifying their identity [8]. Schema retrieval extracts schema items (classes, relations, properties) that are directly or indirectly related to the question from the KB [6]. Lastly, the transducer creates a logical structure that can query the KB based on the results of entity linking and schema retrieval [7].

While entity linking and transducers are integral to the KBQA pipeline, this study specifically focuses on schema retrieval. This is because schema retrieval directly influences the accuracy of the logical structure generated by the transducer and ensures that the operational query aligns with the KB’s structure. Errors in schema retrieval propagate to downstream components, leading to invalid queries and incorrect results. With the emergence of large-scale KBs that contain tens of thousands of schema items, such as Freebase [9] and Wikidata [10], schema retrieval has become a critical challenge. Many recent studies have prioritized schema retrieval as a core task in KBQA systems [3,4,5,6,11,12].

The schema of a KB can be defined in terms of classes, relations, and properties [6]. A class defines the category or type of entity. For example, the class “university” refers to a type that includes real-world entities like “Seoul National University”. A relation represents the connection between entities or classes. The entities “Seoul” and “Seoul National University” can be linked by the relation “LOCATED_IN”, indicating geographical inclusion. Lastly, a property describes the characteristics of a class or entity. The entity “Seoul National University” could have properties like “coord”, indicating its geographical coordinates, and “area”, showing the size of the area. Schema retrieval plays an important role in the KBQA process. It ensures that the schema items used in a generated query match the KB structure. Even a single mismatch between a schema item and the KB schema can result in an invalid query, making it impossible to retrieve an answer [13].

Recent advancements in large language models (LLMs) have significantly influenced schema retrieval research. Schema retrieval using LLMs can be implemented through two primary approaches: fine-tuning and few-shot prompting [3,4,5,6,11,12]. Fine-tuning, however, requires a large amount of training data [14]. These training data, which consist of natural language questions paired with corresponding schema items, vary widely because each KB has its own schema ontology. These differences lead to compatibility issues between datasets and create challenges in adapting fine-tuned models to new KBs or languages. Since building datasets for schema retrieval models whenever the KB or language of user questions changes consumes considerable resources and time, research has been conducted on schema retrieval models using few-shot prompting techniques, which do not require large datasets [11,12].

Few-shot prompting is a method of in-context learning where the model performs new tasks using only the context provided [1]. This involves presenting the language model with a prompt that includes instructions for the task and a few examples, enabling the model to infer how to perform the task on its own [15]. Recent research has used few-shot prompting with language models such as OpenAI’s GPT-4 Turbo and code-davinci-002 to perform schema retrieval [11,12]. Unlike fine-tuning, with simple changes in instructions and the use of few-shot examples, schema retrieval models using few-shot prompting can be applied to various KBs and languages.

However, existing studies have primarily focused on neighbor schema retrieval, which only targets the schemas immediately surrounding the explicit entities mentioned in a question [11,12]. This narrow focus can significantly increase both the complexity and the time required for querying the knowledge base when dealing with intricate questions involving multiple relational steps, known as hops. For instance, in a question like “Where is the nearest parking spot from the park, which is located near the cheapest apartments in Seoul?”, there is a four-hop relation between “Seoul” and “parking spot”. The neighbor method must traverse all entities within a four-hop range of “Seoul” to identify the relevant relations. This process is computationally expensive and time-consuming. Moreover, this method also fails to handle queries that do not explicitly mention an entity, such as “Where is the highest mountain?” [6].

To overcome these limitations, it is more effective to employ a dense retrieval approach that considers all available schema items within the KB, or to use a hybrid method that combines both dense and neighbor-based approaches. In the fine-tuning domain, this strategy has been shown to improve performance significantly [4,6]. Despite the potential of these methods, research on dense schema retrieval models utilizing few-shot prompting remains unexplored. This approach is particularly advantageous because it eliminates the need for extensive training datasets, making it more adaptable and efficient.

GeoQA is an extended domain of the QA system that is designed to respond to geographic questions [7]. Geographic questions involve geographic entities, concepts (such as specific types like buildings, cities, or states), or spatial relationships [16]. As with general QA systems, research in the GeoQA field has been conducted on GeoKBQA, which uses structured graph-form KBs to answer users’ geographic queries [7,17,18]. However, unlike KBQA studies that utilize fine-tuning or few-shot prompting with language models for schema retrieval, GeoKBQA research has traditionally relied on a rule-based approach [17,18,19]. This method involves predefining words or descriptions linked to specific schema items and matching them with the text in user questions. Such a rule-based system necessitates frequent updates whenever the KB schema changes and struggles with questions that lack predefined schema relationships, leading to poor generalization. This limitation becomes particularly problematic for Korean geographic questions, where each schema item must have predefined Korean terms. However, no studies have yet addressed the development of a rule-based model for Korean questions, which would require significant resources and time to manage all relevant expressions effectively.

This reliance on rule-based systems is further complicated by the unique characteristics of geographic questions, which require context-sensitive schema retrieval. For instance, a schema retrieval model must correctly identify the appropriate spatial relationship based on the context of the question. For example, when asked about the distance between Paris and Beijing, the model should retrieve the “distance” relation, while for a question concerning the distance between Canada and the USA, the model should retrieve the “adjacent” relation, reflecting their shared border and geographic scale. Additionally, geographic concepts and entities often involve inherent vagueness. The term “Amazon”, for instance, may correspond to different classes in the knowledge base, such as “river”, “rainforest”, or “company”. A schema retrieval model must infer the correct context to associate the term with the appropriate class. Rule-based systems are unable to address these complexities, as they are restricted to matching predefined keywords and fail to consider the broader meaning or relationships within the question. In contrast, schema retrieval models, which leverage LLMs, dynamically interpret relationships and nuances, making them well suited for addressing the challenges of answering geographic questions.

To address these challenges, this study introduces a dense schema retrieval model for Korean GeoKBQA using few-shot prompting with an LLM. Dense schema retrieval, which evaluates all schema items in the KB against the query, is particularly effective for handling multi-hop queries and questions that lack explicit entity references. Few-shot prompting further enhances the adaptability of schema retrieval by enabling LLMs to infer relationships between questions and schema items directly from a given prompt, eliminating the need for predefined schema relationships or extensive training datasets. This study also compares the few-shot prompting approach with a fine-tuned Multilingual BERT (M-BERT) [14] model to evaluate their performance across different generalization levels.

The primary contributions of this work are as follows:

We develop a language model based schema retrieval model for GeoKBQA: the proposed model addresses the limitations of traditional rule-based methods by dynamically inferring relationships between queries and schema items, demonstrating strong generalization capabilities.
We create a prompt for few-shot prompting-based schema retrieval: using the Spatial Knowledge Reasoning Engine (SKRE) dataset, we developed optimal prompts tailored to the dense schema retrieval of Korean geographic questions.
We adapt few-shot prompting techniques for dense schema retrieval: this study leverages few-shot prompting to handle complex, multi-hop, and entity-less queries, providing a robust alternative to fine-tuning-based methods.

This paper is organized as follows: Section 2 reviews related work, including schema retrieval in KBQA, GeoKBQA methods, and the limitations of existing schema retrieval models. Section 3 describes the methodology, including dataset construction, generalization levels, the process of finding the optimal prompt for the few-shot prompting model, and the proposed schema retrieval models. Section 4 discusses the experimental setup, including data preparation, implementation details, and evaluation metrics. In Section 5, we present the results of the prompt optimization process and a comparative performance analysis of the few-shot prompting and fine-tuning approaches. Finally, Section 6 concludes with key findings, implications, and directions for future research.

2. Related Works

2.1. Schema Retrieval of KBQA

The KBQA system is divided into two main approaches: IR-based (information retrieval-based) and SP-based (semantic parsing-based) approaches. The IR-based approach involves generating a subgraph from the KB that is relevant to the query, and it compares entities and relations within this subgraph to the query until it establishes the correct entity [20]. However, in large-scale KBs such as Freebase and Wikidata, this method faces efficiency issues due to the vast search space required [20].

On the other hand, the SP-based approach converts queries into a logical structure, such as SPARQL or Cypher, to find answers within the KB [7]. This method is not only more effective for large-scale KBs, but also facilitates the interpretation of the model’s reasoning process as it generates a logical structure as its output. Moreover, SP-based approaches have achieved state-of-the-art results in prominent KBQA datasets such as GrailQA [3] and WebQuestionsSP [21], leading many researchers to adopt these approaches [3,4,5,6,11,12].

While SP-based KBQA comprises various components, the essential processes commonly utilized in most studies are entity linking, schema retrieval, and transducers [7]. Entity linking identifies and connects entities mentioned in natural language queries to the actual entities in the KB, clarifying their identities [8]. Schema retrieval involves extracting schema items (class, relation, property) from the KB that are directly or indirectly related to the query [6]. Transducers then generate operational queries for the KB based on the results of entity linking and schema retrieval [7]. The accuracy of schema retrieval is particularly critical, as errors in schema retrieval not only affect the immediate extraction of relevant KB elements but also propagate to the transducer, leading to invalid or incomplete operational queries. Even a single mismatch between a schema item in the final query and the KB structure can result in query failure, making it impossible to retrieve the correct answer [13]. This underscores the important role of schema retrieval in ensuring the overall success of the KBQA pipeline.

Recent research in the SP-based KBQA field has classified schema retrieval into two types: neighbor and dense [4]. Neighbor schema retrieval identifies entities mentioned in a query through entity linking, and then it consults the KB to generate candidates from the schema items accessible to those entities. These candidates are then formatted into logical structures such as S-expressions, which include connected schema information, before being passed to a transducer for query transformation [3,5,11,12]. This approach reduces the number of schema items a model has to consider by only focusing on those directly connected to the identified entities. However, as the number of hops in a query increases, the search space and computation time in the KB exponentially grow, leading to increased operation time. Due to these computational limits, previous studies have only considered relationships of up to 2 hops using the neighbor schema retrieval method [3,5,11,12]. This method also fails to handle queries that do not explicitly mention an entity, such as “Where is the highest mountain?” [6].

Dense schema retrieval, on the other hand, compares all schema items in the KB against the query to find suitable candidates [4]. While this method can be computationally intensive, as the number of schema items increases with the size of the KB, it offers the advantage of flexibly handling more complex queries, including those involving 3 or more hops [6]. Unlike the neighbor method, which must repeat searches for each entity mentioned in compound queries, the dense method can identify all required schema items in a single search. Additionally, it operates independently of the entity linking process, reducing the propagation of errors from that process and allowing it to handle queries without explicit entities.

In a study conducted by [4] using the Freebase KB and the GrailQA datasets, which include complex queries requiring operations with more than 2 hops, both neighbor and dense schema retrieval models were trained and compared. For class schema, the neighbor method generated an average of 112.1 candidate schema items per query up to 2 hops, with a 69.2% rate of containing the correct answer schema. In contrast, when the dense method was set to generate 100 candidates, it included the correct answer schema at a 98.5% rate, demonstrating superior performance over the neighbor method. The higher performance of the dense method was confirmed for the relation schema as well. Thus, our research builds on this advantage of the dense method, leveraging its capability to handle multi-hop and non-entity questions.

Recent studies have used the fine-tuning of the BERT model or employed few-shot prompting with large language models such as GPT-4 [22] to perform dense schema retrieval [3,4,5,6,11,12]. Both language models are based on the Transformer architecture [23], which converts natural language input sequences into output sequences using an encoder–decoder structure.

2.1.1. BERT Fine-Tuning-Based Schema Retrieval

The BERT model is based on the encoder architecture known as Transformer. Unlike decoders that process sequences from left to right in a unidirectional structure, BERT processes the entire sequence bidirectionally at once. BERT was pre-trained on large unlabelled corpora through tasks like Masked Language Modeling, which involves predicting obscured words, and Next Sentence Prediction, which assesses the relationship between two sentences. The parameters of this pre-trained model are then fine-tuned for specific tasks, demonstrating strong performance across various natural language processing tasks [14].

BERT models can be fine-tuned to perform dense schema retrieval in two different ways. Chen et al. [4] employed the BERT bi-encoding method [24], where the query

q

and schema item

s

are encoded separately using distinct BERT models into

e_{q}

and

e_{s}

, respectively. The score

S

is then calculated using the dot product of these encodings, as shown in Equation (1).

S (q, s) = e_{q} \cdot e_{s}

(1)

In contrast, Shu et al. [6] utilized the BERT cross-encoding method [14], where the query and schema are encoded together in BERT using a [SEP] token, and the result from the [CLS] token in the final layer is input into a linear layer to compute the score, as described in Equation (2).

S (q, s) = Linear (BERTCLS ([q; s]))

(2)

Each encoding method presents distinct trade-offs. The bi-encoding strategy permits the pre-encoding of schemas, facilitating more efficient computational processes, as queries and schema items are encoded separately. However, this method fails to directly model the interactions between inputs, a capability intrinsic to the cross-encoding strategy, potentially resulting in reduced accuracy [25].

Research utilizing BERT fine-tuning for dense schema retrieval has shown promising results on well-known datasets such as GrailQA and WebQuestionsSP. These approaches leverage BERT’s ability to understand the semantic and contextual meaning of questions, using this understanding to calculate the correlation between schema items and queries. By encoding both queries and schema items in a shared semantic space, BERT effectively captures the nuanced relationships between them, ensuring the more accurate retrieval of relevant schema items. However, these approaches require large datasets with labeled natural language queries and corresponding schema items, which can limit their scalability in scenarios with limited data availability.

2.1.2. GPT Few-Shot Prompting-Based Schema Retrieval

The rapid development of LLMs has paved the way for a departure from traditional fine-tuning methods. Unlike methods that require parameter modifications, in-context learning allows models to perform tasks by directly presenting them with the necessary data and enabling them to infer how to execute tasks independently [1]. The few-shot prompting technique can be utilized to enhance the effectiveness of in-context learning, which involves providing a few instructional examples (few-shot examples) that guide the model to infer how to carry out the tasks [15]. This method, unlike fine-tuning, does not require the use of extensive training data to modify model parameters and allows for the execution of various NLP tasks with just a few examples. Notably, the GPT series of language models provided by the OpenAI API, which are based on the decoder architecture of the Transformer, have exhibited promising results in schema retrieval [11,12].

Li et al. [11] utilized the code-davinci-002 model and employed few-shot prompting to generate draft queries from natural language questions. Due to the high probability that the schemas included in the queries might not match the actual KB, they used the BM25 algorithm to measure similarity with the real KB schemas and subsequently revised the queries. However, employing BM25, which is a simple string matching-based ranking algorithm, poses limitations regarding generalization capabilities compared to language model-based methods that consider the semantics and context of the queries. Similarly, the study by Xiong et al. [12] also involved generating draft queries. Subsequently, through entity linking, entities mentioned in the queries were identified in the KB, and schema items within a 2-hop distance of these entities were compared with the queries performed using the GPT-4 Turbo model. This approach, like the previously described neighbor schema retrieval method, struggles with complex queries or queries that do not include explicit entities. To the best of our knowledge, there has been no implementation of a dense schema retrieval model using few-shot prompting that has demonstrated a comparable performance to those based on BERT fine-tuning. While BERT fine-tuning excels at capturing the semantic context of queries and schema items through parameter adjustments on labeled datasets, GPT leverages few-shot prompting to infer relationships between queries and schema items using contextual examples provided at runtime. This approach eliminates the dependency on extensive labeled datasets, highlighting a gap in current research where the potential of combining few-shot prompting with proven dense retrieval strategies remains underexplored.

2.2. Schema Retrieval of GeoKBQA

Research in the GeoKBQA field has also employed SP-based methods [17,18,19], aligning with broader trends in KBQA research. However, unlike the general KBQA domain, these studies have adopted a rule-based approach rather than employing fine-tuning or few-shot prompting for schema retrieval.

Punjani et al. [17] and Kefalidis et al. [19] employed a rule-based method, which involved predefining relationships between knowledge base schema items and potential representative words in natural language questions. These predefined words were then matched with the words present in the questions. However, this approach has notable limitations. It necessitates updates to the associated vocabulary each time changes are made to the KB schemas, and it fails to address questions containing words that are not predefined, thus significantly limiting its generalization capabilities. Moreover, the method focuses solely on specific words within questions to perform similarity assessments, neglecting the broader contextual understanding of the question. For example, to discern whether the term “Jaguar” in the question “My Jaguar broke down, where can I fix it?” refers to the car manufacturer or the animal, an analysis that considers the full context of the question is essential.

Furthermore, implementing this rule-based approach to handle spatial-related Korean questions would require the predefinition of Korean terms associated with all schema items. Given that previous studies in the GeoKBQA have primarily defined relationships in English, the application of this methodology to models designed for Korean GeoKBQA is fraught with challenges.

By contrast, Hamzei et al. [18] analyzed questions by breaking them down into parts of speech through syntactic parsing [26] and extracted words or phrases that could represent schemas. These elements and the predefined schema item descriptions were then vectorized using a BERT model, and their dot products were used as a measure of similarity. While this approach improves the capability to adapt to new words by comparing specific words from questions with schema item descriptions using a language model, it still does not adequately grasp the complete context of the questions. Moreover, relying on a BERT base model initially pre-trained on English corpora presents difficulties in terms of effectively handling Korean questions.

In summary, while the rule-based schema retrieval models commonly utilized in the GeoKBQA field rely on predefining relevant words or descriptions, they exhibit limited generalization capabilities and insufficient contextual comprehension. Additionally, their development has not considered Korean questions, rendering their application to Korean GeoKBQA models particularly challenging.

2.3. Dataset for Schema Retrieval in Korean GeoKBQA

In this study, we aim to compare the performance of fine-tuned M-BERT models and few-shot-prompted GPT-4 Turbo models for schema retrieval in Korean GeoKBQA. To facilitate these experiments, a substantial dataset specifically designed for schema retrieval is required. This dataset must encompass spatial-related natural language questions paired with corresponding schema items. Schema labels can be extracted from queries, enabling the development of a schema retrieval dataset from processed GeoKBQA datasets containing Korean natural language questions and their matching queries.

While there are several KBQA datasets that include natural language questions and their corresponding queries, such as WebQuestionsSP, GrailQA, and ComplexWebQuestions [27], they do not specifically categorize spatial-related questions. Some datasets are specifically designed for GeoKBQA, like GeoQuestions201 [17] and GeoQuestions1089 [19], which contain 201 and 1089 entries, respectively. However, these datasets are inadequate for fine-tuning due to their limited size and are composed of English questions, making them inapplicable for Korean GeoKBQA.

The only dataset that includes Korean questions and contains sufficient data for fine-tuning and performance evaluation is the SKRE dataset, which was developed by Seoul National University’s Geographic Information System Laboratory. This dataset comprises 8010 real estate-related questions in Korean, relating to the Seocho District in Seoul, along with corresponding schema items extracted from Cypher queries.

The SKRE dataset is uniquely suited for evaluating schema retrieval models as it incorporates the inherent challenges of GeoQA, including context sensitivity and the resolution of ambiguous terms. For example, a question like “Please tell me about apartments that can be assigned to Seoul Isu Elementary School” might appear to require the retrieval of the “school” class for “Seoul Isu Elementary School”. However, the schema retrieval model must resolve this ambiguity and correctly retrieve the “SchoolAssignDistrict” class, which contains polygon entities representing the legal geographic boundaries used to assign apartments to schools. These entities have a “school_name” property that links them to the respective schools.

Similarly, the same question might seem to require knowledge of the “distance” relation in order to calculate the proximity between “Seoul Isu Elementary School” and nearby apartments. In reality, the schema retrieval model needs to identify the “IN” relation, which connects apartments with their corresponding school assignment districts. This demonstrates the model’s ability to understand the context of the question, utilize knowledge about the KB, and infer the appropriate relationship to perform accurate schema retrieval.

The dataset’s questions range from simple inquiries, such as “Where is the nearest hospital from Seoul apartment?”, to more complex ones, like “Which apartment in Seocho-gu has the cheapest monthly rent and is the easiest to catch a taxi during rush hour?” These questions highlight the dataset’s ability to test schema retrieval models across both straightforward and intricate scenarios.

The SKRE dataset is built upon the SKRE KB, which consists of 11,167 nodes across 22 class schema types and 11,346 edges defined by 5 relationship schema types. Additionally, it includes 63,772 property values across 106 distinct property types, forming a comprehensive body of real estate information in this geographic region. Neo4j’s graph-based structure enables the SKRE KB to model complex relationships and spatial hierarchies effectively, making it well suited for handling geographic information. Cypher, specifically designed for querying property graph databases like Neo4j, excels at managing complex graph traversals, offering advantages over languages designed for Resource Description Framework (RDF) formats.

The KB features geographic relationships such as “IN” and “ADJACENT_TO”, which represent spatial relations like containment and adjacency, as well as domain-specific relationships like “TRADE”, denoting recent building contracts, and “HAS_TYPE”, representing individual rooms within buildings. The KB’s classes cover various building types such as “apartment”, “school”, and “hospital”, alongside points of interest like “Park” and “GoodWayToWalk”. Its properties describe attributes such as “addr” for address and “coord” for coordinates, as well as contextual information like “crime_rate” and “area_use”. The richness of the schema makes SKRE a vital resource for advancing GeoKBQA models, particularly in the Korean real estate domain, providing a robust benchmark for integrating spatial and semantic reasoning into geographic question-answering systems.

It is important to note that the SKRE dataset is proprietary and not publicly available due to privacy concerns and institutional policies. Researchers interested in accessing this dataset for academic purposes may request it by contacting the corresponding author via email. All requests will be evaluated by the Seoul National University’s Geographic Information System Laboratory to ensure compliance with data privacy and usage policies. While its limited accessibility poses challenges, the SKRE dataset’s specialized focus on Korean GeoKBQA and its comprehensive structure make it a valuable resource for advancing research in this domain.

3. Methodology

As can be seen in Figure 1, the experiments in this study are organized into three main steps. First, we modify the SKRE dataset, which includes Korean geographic questions and corresponding Cypher queries, to create a schema retrieval dataset. Second, using this dataset, we identify the most suitable prompt to perform few-shot prompting in dense schema retrieval. During this process, we evaluate the generalization performance to analyze the effects of different prompts. Lastly, to validate the effectiveness of the few-shot prompting-based schema retrieval model, we fine-tune an M-BERT model on the same dataset.

3.1. Dataset Construction and Generalization Levels

To construct a test dataset capable of comparing models based on fine-tuning and few-shot prompting, we extracted schema labels from the Cypher queries contained in the SKRE dataset. We organized all schema items that constitute the KB into a dataset and used these to identify the schemas present in the queries. The processed data encompass spatial-related natural language questions and their corresponding class, relation, and property schema items.

To obtain a precise evaluation of the schema retrieval models’ generalization performance, we adapted Gu et al.’s [3] transducer generalization levels to conduct the schema retrieval task. However, we did not consider the functions of the query since we only focused on schema items. The generalization measures used to evaluate our proposed schema retrieval models include Independent and Identically Distributed (I.I.D.), compositional, and zero-shot levels. These levels are illustrated in Figure 2, which provides examples using class schema items. While the examples focus on class schema items for clarity, the same generalization methodology is applied to relation and property schema items as well. The left side of Figure 2 shows few-shot examples or training data used for our models that associate natural language queries with their corresponding class schema items. The right side demonstrates test data categorized into the three generalization levels, with each designed to evaluate the model’s performance under different conditions.

I.I.D. generalization: This level assesses the model’s ability to handle queries that align with the schema items and question structures seen in the few-shot examples or training data. For example, as you can see in the left side of Figure 2, there is a training data point that asks “Can you tell me about apartments in the Umyeon-dong district that have a daycare center nearby?” and employs class schema items “apartment”, “DistrictBoundaryDong”, and “daycare”. A corresponding I.I.D. level test data point could be “Find apartments near a daycare in Seocho-dong district”, which uses a similar question structure and identical schema combination.
Compositional generalization: This level evaluates the model’s ability to process new combinations of schema items encountered during training. For instance, the training data in Figure 2 include queries like “Can you tell me about apartments in the Umyeon-dong district that have a daycare center nearby?”, which has the corresponding class schema items “apartment”, “DistrictBoundaryDong”, “daycare”, and “Find the school closest to Jamwon Elementary School”, and the schema item “school”. A compositional test query might combine these schema items in a new way, asking “Can you tell me about apartments that have a school nearby in Umyeon-dong?” This requires the use of “apartment”, “school”, and “DistrictBoundaryDong”.
Zero-shot generalization: This level measures the model’s ability to handle schema items it has never encountered during training. For example, if the training data contain no mention of the class schema “reputation”, a zero-shot level query like “Can you tell me about the social media reviews for Raemian apartments?” evaluates the model’s ability to infer and retrieve this unseen schema item alongside the item of apartment.

By assessing the models across these diverse generalization scenarios, we can evaluate whether the schema retrieval models are merely memorizing training data or actively adapting to the complexities and changes in the real world.

3.2. Schema Retrieval Models

We utilized the constructed dataset to perform dense schema retrieval using both few-shot prompting and fine-tuning methods, as proposed in our study. For the few-shot prompting approach, we recognized that performance could vary depending on the number of instructions and examples included in the prompt. Consequently, we conducted experiments under various conditions to identify the optimal prompt configuration. For the fine-tuning approach, we employed the M-BERT model, which is capable of processing Korean questions. We enhanced computational efficiency through negative sampling. Finally, we compared the generalization performance of both models using the previously described I.I.D., compositional, and zero-shot generalization levels.

3.2.1. GPT Few-Shot Prompting-Based Model

In this study, we propose a schema retrieval model that employs few-shot prompting with the GPT-4 Turbo model provided by the OpenAI API. Specifically, we implemented a dense schema retrieval model that compares all schema items in the KB with the questions. As previously discussed, unlike the neighbor approach, the dense method can handle complex questions or those that do not include explicit entities, offering significant advantages [4,6]. By utilizing few-shot prompting for schema retrieval, this approach allows for adaptability to changes in the KB or schema without the need for retraining that is typical with fine-tuning methods. Instead, simple modifications to the prompt can accommodate these changes without altering the model’s parameters. The prompts used in the experiments include instructions and few-shot examples containing schema items that the model can select, facilitating schema retrieval.

As illustrated in Figure 3, few-shot prompting differs fundamentally from the rule-based approach commonly used in the GeoKBQA domain. It allows the GPT-4 Turbo model to dynamically interpret the semantic meaning of input questions through task descriptions and few-shot examples. This approach enables the model to flexibly retrieve schema items without the need to predefine every relationship between schema items and words or descriptions. Moreover, it is relatively independent of KB and language changes, requiring only adjustments to the schema list in the task description and corresponding examples.

However, the performance of models based on few-shot prompting is highly sensitive to the instructions and examples included within the prompt [28,29]. Therefore, this study uses a three-step process to explore the most effective prompts for schema retrieval tasks in Korean GeoKBQA.

Processing Methods

While it is common to train separate models for class, relation, and property schemas in dense schema retrieval using fine-tuning [4,6], few-shot prompting allows for the creation of prompts that can search all three types of schemas simultaneously. This approach can complete all tasks with a single API call, potentially reducing inference time. However, the increase in task complexity could lead to reduced performance. Simpler tasks performed that are through prompts are likely to enhance performance [28,30]. We conducted experiments to compare combined and divided task approaches. An example of the prompts used in the experiments is shown in Figure 3. The right image only represents the divided task method for class schemas. In actual experiments, searches for class, relation, and property schemas are conducted separately for the same input question, and the results are synthesized.

2.: Number of Few-Shot Examples

Few-shot examples are used to enhance the model’s understanding of schema retrieval tasks by demonstrating specific examples. In classification tasks like schema retrieval, providing a greater number of examples generally increases model accuracy; however, beyond a certain point, the rate of performance improvement significantly diminishes [11,28,29]. As the prompt lengthens, both the amount of data to be processed and the computational complexity increase, making it crucial to determine the optimal number of examples. In this study, we systematically increased the number of examples included in the prompt from 0 to 40, in increments of 10, to observe performance changes and identify the most efficient number of few-shot examples.

3.: Instructions

To generate information accurately and effectively, it is necessary to use specific instructions for the language model [15]. In the context of schema retrieval tasks, not only should the task description be included, but a list of schema items that the model can select must also be incorporated into the instructions [31]. Similar to the number of few-shot examples, the ways schema retrieval tasks are described to the model significantly affects its performance [28,29]. Therefore, this study developed three different instructions to identify the optimal instruction set, and measured performance changes accordingly.

The first instruction was formulated (see Table 1) based on the classification case explored by Ouyang et al. [32]. In the test environment, since we did not predetermine how many schema items a query may require, the instruction was designed to prompt the selection of the five schema items most relevant to the query. For schema items related to relations, where the KB contained fewer schema items, the instruction was modified to prompt the selection of only two items. This adjustment was consistently applied to the second and third instructions as well.

The second instruction, shown in Table 1, clearly defines the objectives and scope of the schema retrieval task [33,34] and provides a detailed description of the steps required to complete the task [35].

In the third instruction, we assigned the persona of an “information specialist” to the model to enhance its accuracy. The persona pattern encourages the model to process and present information on specific topics more consistently and accurately [31,36]. Following the approach used in the study by White et al. [37], we crafted the instructions using the “act as persona X” phrase. Additionally, drawing on the methodology from the work of Xiong et al. [12], we employed a step-by-step approach to ensure the model clearly understands the task and processes information efficiently. Notably, expressions like “carefully assess” were used to emphasize the need for meticulous review in task execution [34].

3.2.2. BERT Fine-Tuning-Based Model

In this study, in order to compare the performance of the proposed few-shot prompting-based schema retrieval model with traditional fine-tuning methods, we implemented a dense schema retrieval model based on M-BERT. M-BERT, developed by Google, is a multilingual version of the BERT model. It was pre-trained on a Wikipedia corpus encompassing 104 languages and has shown a strong performance in natural language processing tasks for languages for which there are fewer resources [38]. Although there are models specifically tailored for Korean, such as KR-BERT [39] and KoBERT (https://github.com/SKTBrain/KoBERT (accessed on 12 December 2024)), which were developed by SKTBrain, they have limitations in terms of validation for multilingual tasks involving Korean queries and English schema labels. Thus, M-BERT was selected due to its potential applicability to datasets in various languages.

To fine tune M-BERT for schema retrieval, we employed a cross-encoding approach that combines natural language questions and schema items into a single model input, as outline by Shu et al. [6]. The model’s structure is depicted in Figure 4.

In this setup, the natural language question q and the schema item s are concatenated into a single string and input into the M-BERT tokenizer, as shown in Equation (3), transforming them into a vector-embedding matrix E. Subsequently, as detailed in Equation (4), E serves as the input to M-BERT and is output as H. The value of H corresponding to the first token, [CLS], as expressed in Equation (5), undergoes a linear transformation to produce the logit z. Finally, as shown in Equation (6), a sigmoid classification layer takes z as its input and outputs two probabilities, indicating whether the cross-encoded schema item is relevant or irrelevant to the query.

E = M-BERT Tokenizer (q; s)

(3)

H = M-BERT (E)

(4)

z = w \cdot H [CLS] + b

(5)

σ (z) = \frac{1}{1 + e^{- z}}

(6)

As illustrated in Figure 4 and the corresponding equations, our model does not employ the traditional multi-class classification method. In traditional multi-class classification, calculating the probability for every schema item is necessary, which can be computationally demanding in terms of both time and memory. Additionally, the denominator in the sigmoid function can increase rapidly, leading to very small loss values. These challenges become severe in schema retrieval due to the large number of schema items the model must consider.

To address these issues, we adopted the approach from Shu et al. [6], which incorporates negative sampling [40] into dense schema retrieval. This method simplifies the task from multi-class to binary classification by distinguishing between correct (positive data) and incorrect (negative data) schema items based on the user’s input question.

The corresponding formula is as follows:

L = - \log σ (z_{p}) - \sum_{j = 1}^{n} \log σ (- z_{j})

(7)

In Equation (7), n denotes the number of negative samples, and

z_{p}

represents the logit used by the M-BERT model for the positive sample. Equation (7) illustrates that instead of performing a softmax operation across all schema items, a sigmoid function is used to compute the loss for the positive sample and a small number of negative samples. This approach significantly reduces memory usage and computational time.

In the paper, following the methodology outlined by Shu et al. [6], we assumed that the occurrence probability of all schema items follows a uniform distribution and randomly sampled negative data accordingly. For properties, we sampled 20 negative data points for each question. For class and relation schemas, which consist of only 22 and 5 types, respectively, we treated all schemas except for the correct answer as negative in order to train each model.

4. Experimental Setup

4.1. Dataset

For fine-tuning purposes, we processed the data such that each dataset contained only one type of schema label—class, relation, or property. This resulted in three distinct datasets. Each dataset was divided into training, validation, and test data in a 7:1:2 ratio (5607:801:1602, respectively). The test data for class and property schemas were categorized according to the generalization levels (I.I.D., compositional, zero-shot), with 534 examples allocated to each category. However, due to specific schema combinations, exact proportions could not be achieved, and a tolerance of up to 16 items (approximately 1% of the test data) was allowed. For relation-specific datasets, because the SKRE KB contains only five relation types, categorization by generalization levels was not feasible. Instead, relation datasets were constructed using random sampling in the same 7:1:2 ratio.

Unlike fine-tuning, few-shot prompting does not require the use of extensive training and validation data. Instead, a small number of few-shot examples were randomly selected, and test data were constructed using the same proportions as those used in the fine-tuning datasets. The test datasets for class and property schemas retained the same structure as the fine-tuning datasets, while relation test datasets continued to rely on random sampling due to the limited number of relation schema types. To minimize the influence of random selection, we generated three separate datasets for each experiment during the prompt optimization process. This ensured consistency and fairness in the comparative evaluation between the few-shot prompting and fine-tuning approaches.

4.2. Implementation Details

For the fine-tuning-based models, we used PyTorch and Hugging Face Transformers. We utilized the M-BERT cased model for all class, relation, and property schema retrieval tasks. The models were trained for 3 epochs with a learning rate of 5 × 10⁻⁵ and a batch size of 64. After each epoch, the loss was measured on the validation dataset, and the model with the best performance was selected as the final model.

For the few-shot prompting-based models, we utilized GPT-4 Turbo, provided by OpenAI API.

4.3. Evaluation Metrics

For performance evaluation, we utilized hit@k. This metric checks whether the correct schema is included within the top k predictions made by the schema model. The formula is as follows:

hit @ k = \frac{1}{N} \sum_{i = 1}^{N} 1 ({rank}_{i} \leq k)

(8)

where N represents the total number of test data points.

1 ({rank}_{i} \leq k)

returns 1 if the schema predicted for the i-th question is within the top k, but otherwise it returns 0.

We use the hit@k metric to evaluate the performance of schema retrieval tasks because the number of correct schemas can vary depending on the question, and because there can be multiple valid schemas. For instance, let us consider the instruction “Find a hospital near Seoul National University”, which can be expressed using two different Cypher queries:

Match (x {uuid:‘sub_123′})-[:NEARBY]->(y:Hospital);
Match (x: Subway {name: ‘Seoul National University’})-[:NEARBY]->(y:Hospital).

Although these two queries return the same results, the first one does not utilize the “Subway” schema item and directly searches using “uuid”, which is the unique identifier of the entity “Seoul National University.” Therefore, in this experiment, we applied the hit@k metric sequentially from hit@1 to hit@5 for evaluation. This approach allows for more flexible assessment, as even if the model initially outputs an incorrect schema, this does not affect the subsequent hit@k ranges. However, for relation schemas, since there are only five types, accuracy increases rapidly from hit@3 onward, even when selected randomly, which does not accurately reflect the model’s performance. As a result, we only conducted experiments for hit@1 and hit@2 in the relation schema category.

5. Results

5.1. Dataset Construction Results

The Cypher queries in the SKRE dataset were used to extract and label class, relation, and property schemas. The processed dataset consists of a total of 8010 entries, as shown in Figure 5.

5.2. Prompt Searching Results for GPT Few-Shot Prompting-Based Model

To find the most suitable prompt for Korean GeoKBQA dense schema retrieval, we conducted a three-step process that included the processing method, the number of few-shot examples, and instructions. To minimize the effect of sampling few-shot examples, each experiment was executed three times with different examples and test data. All hit@k scores represent the average of these three experiments. Due to limitations in time and cost, we applied the method that showed good results in the previous stage to the next experiment.

Processing Methods

As previously explained, schema retrieval through few-shot prompting can be implemented using either a combined processing method, where class, relation, and property schemas are handled simultaneously, or a divided processing method, where separate prompts are used for each schema.

Since the combined processing method dataset included all types of schema labels, it was not possible to specify a reference schema for labeling according to the generalization metrics. Therefore, we did not label the data based on generalization levels.

The experimental results of both methods are presented in Table 2. It was observed that the divided processing method, which simplifies the task by using individual prompts for each schema type, consistently outperformed the combined method across all hit@k ranges. While it is challenging to pinpoint the exact cause of these results due to the opaque reasoning process of LLMs, some patterns suggest that the combined method struggles with maintaining task-specific instructions.

For example, in the class schema results, the combined processing model frequently selected the schema item “Road” instead of “GoodWayToWalk”, despite explicit instructions in the prompt that required the model to choose only from the provided schema items within the SKRE KB. This behavior indicates that the model may have forgotten the instructions as the prompt became longer. In the combined processing method, the model must process examples and instructions for three schema types (class, relation, and property) simultaneously, which increases the complexity of the input. This additional complexity likely caused the model to retrieve a schema item (“Road”) that was not included in the given options, reflecting a failure to adhere to the task-specific constraints.

This issue highlights a challenge in GeoQA, where resolving ambiguity in geographic questions must be achieved while strictly following schema-specific constraints. For instance, in the question, “What are the best walking paths within 1200 m of Umyeon-dong?” the model should focus on the phrase “best walking paths” to retrieve the correct schema item “GoodWayToWalk” from the provided options. However, the term “paths” may lead the model to associate the question with the schema item “Road”, especially when task-specific instructions are overlooked. The divided processing method mitigates this issue by isolating each schema type into separate prompts, thereby reducing input length and complexity. This approach allows the model to better retain the task-specific instructions and adhere to the provided constraints, minimizing errors caused by forgetting instructions and instruction ambiguity.

These findings underscore the importance of designing prompts and managing task complexity to address schema retrieval challenges in GeoQA. By reducing the cognitive load on the model and improving its adherence to task-specific constraints, the divided processing method demonstrates superior performance, particularly in tasks that require the model to resolve ambiguous terms and distinguish between overlapping spatial concepts.

2.: Number of Few-Shot Examples

Using a divided method, we compared the performance differences by varying the number of few-shot examples included in the prompt. The number of examples ranged from 0 to 40, increasing by increments of 10.

Table 3 shows the average hit@1 to hit@5 results. As seen in Figure 6a, performance improvements for I.I.D. and compositional data sharply decreased after using more than 20 few-shot examples for class and property schemas. However, zero-shot data showed less variation compared to the others, indicating that the model is primarily relying on its pre-trained knowledge rather than specific patterns picked up from examples.

As mentioned in Section 4.1, due to the limited number of schema types in the dataset, we did not categorize the generalization levels for the relation datasets and only used hit@1 to hit@2 results to calculate the average score. The relation schema showed similar results to class and property schemas. However, as seen in Table 3, the model’s performance when there were no examples was noticeably lower. This is due to the semantic difference between relation labels and geographic questions. For instance, the relation “TRADE” represents the edge between an apartment and its recent trade price in the SKRE KB. When a user asks for the price of an apartment, it is difficult for the model to infer that “TRADE” and “price” are related without any examples. Due to the inherent ambiguity in such geographic questions, it is crucial to improve the performance of GeoKBQA by training the model with few-shot examples that help it learn the schema of the KB and related queries.

As a result, when measuring performance across all schema types based on the number of few-shot examples, we observed that the performance improvement sharply decreased after exceeding 20 examples. While performance gradually increases as the number of few-shot examples grows, the computational complexity also rises due to the increased input token size. Therefore, we determined that using 20 few-shot examples is the most efficient approach, as this is the point where the performance gains start to diminish significantly.

3.: Instructions

Using the divided processing method and 20 examples, we tested three different instructions for class, property, and relation schemas. As shown in Table 4 and Table 5, we excluded the hit@1 results for class and property schemas. This exclusion was due to the existence of multiple valid schemas for a single question, as mentioned in Section 4.3. A detailed explanation can be found in Appendix A.

As shown in Table 4, Table 5 and Table 6, instruction 3, which used the persona technique and provided a more detailed step-by-step explanation of the task, demonstrated a better overall performance. This instruction consistently outperformed the others across all schema types, except at the I.I.D. level for property schema, where the effect of the instruction was relatively low due to the direct alignment between few-shot examples and the question.

The most significant difference was observed in the hit@1 scores of the relation schema, where instruction 3 outperformed instructions 1 and 2 by 0.25 and 0.13, respectively. This was due to the semantic difference between relation labels and geographic questions, as mentioned earlier. These results suggest that the less information the model has between schema labels and questions, the greater the importance of a well-crafted instruction becomes.

Despite our efforts to address the instruction-forgetting issue, as identified in the results of the processing method evaluation, this issue still persisted to some extent when providing clearer and more detailed instructions to the model. This behavior highlights the inherent vagueness and ambiguity in geographic questions, factors which often lead to challenges in schema retrieval. For instance, in the class schema, the model extracted the unknown schema “Road” for “GoodWayToWalk” and “ExpositionCenter” for “Convention”, which was intended to represent a convention center. Similarly, in the relation schema, “DISTANCE” was frequently extracted for queries involving distance calculations. However, the SKRE dataset employs Neo4j’s point.distance function to calculate the actual distance between entities, and thus, a specific “DISTANCE” relation is not explicitly defined in the KB.

Nonetheless, providing more detailed instructions appeared to mitigate these issues, as reflected in the results presented in Table 4, Table 5 and Table 6. These results underscore the critical role of precise and context-aware instructions in improving schema retrieval performance, particularly when addressing ambiguities or gaps between the schema labels and the query semantics.

5.3. Comparison Results for GPT Few-Shot Prompting and BERT Fine-Tuning-Based Models

To compare the performance of the two schema retrieval models, we compared the results obtained using the prompt from the few-shot prompting model, which was determined to be the most suitable (divided processing method, 20 few-shot examples, and instruction 3), with the results obtained from the fine-tuning model (Table 7).

In terms of the I.I.D. generalization performance for class and property schemas, the fine-tuning method showed a higher performance across most hit@k ranges. However, the difference was minimal.

For compositional generalization, the fine-tuning method demonstrated a significant 7% improvement at hit@2 for class schemas, but the difference narrowed to around 1–2% afterward. For property schemas, the few-shot prompting method initially performed best, but in later stages, the fine-tuning method showed better performance. However, the performance difference between the two models was less than 1%, indicating nearly equivalent performance.

The largest performance gap between the two models was observed in zero-shot generalization scenarios. For class schemas, the few-shot prompting method outperformed the fine-tuning method by approximately 10% at hit@2, and by nearly 15% in subsequent ranges. Similarly, for property schemas, the few-shot prompting method performed about 9% better from hit@2 onward.

For relation schemas, there were not enough schema types in the KB to compare performance based on generalization metrics. However, the fine-tuning method generally showed better performance. Since there are only five types of relation schema, the likelihood of encountering an unseen schema in the test data is very low. Therefore, most of the test data are likely I.I.D. or compositional data, indicating that the fine-tuning method performs better than the few-shot prompting method in these scenarios.

Through the experiments, we confirmed that the fine-tuning-based schema retrieval model performs better in I.I.D. and compositional scenarios, where the data are relatively similar to the training data. However, in the zero-shot generalization scenario, where the data distribution is significantly different from that of the training or few-shot example data, the few-shot prompting method outperformed the fine-tuning model by a substantial margin. This result demonstrates that in zero-shot situations, pre-acquired knowledge plays a more critical role. The advantages of the GPT-4 Turbo model, which has been trained on more data and contains a larger number of parameters, are particularly evident in this context. Although the size of GPT-4 Turbo has not been disclosed, the difference in parameter counts between the previous GPT-3 model and M-BERT is approximately 1600-fold.

In schema retrieval tasks, the model must select schema items relevant to the query from the vast search space of all schemas in the KB. The probability of finding the schema item required by the user’s query in the training data or few-shot examples is very low [3]. Therefore, in real-world applications, schema retrieval using the few-shot prompting method, which demonstrates superior zero-shot performance, is likely to be more suitable.

6. Conclusions

This study constructed the first neural-based schema retrieval model for Korean GeoKBQA. Prior studies used predefined rule-based models that had limited generalization performances on undefined schema items and were unable to account for the semantic meaning of user questions. Additionally, these models required significant time and resources in order to define new relationships between questions and schema items whenever the language or KB changed.

To build our model, we utilized the few-shot prompting method with the GPT-4 Turbo model, which requires a considerably smaller amount of data compared to traditional fine-tuning methods. We also adopted dense schema retrieval, which is known to perform better than neighbor-based schema retrieval. To the best of our knowledge, this is the first work to construct a few-shot prompting-based dense schema retrieval model, not only in the GeoKBQA domain but also within the broader KBQA field.

Using the SKRE dataset, which contains Korean geographic queries, we constructed a schema retrieval dataset. With this dataset, we conducted various experiments that considered the processing methods, the number of few-shot examples used, and the instructions used in order to identify the best prompt for Korean GeoKBQA schema retrieval.

To evaluate our model’s performance, we also trained an M-BERT model to perform dense schema retrieval, using the traditional fine-tuning method to process Korean geographic questions. We tested the models across I.I.D., compositional, and zero-shot generalization levels to carefully compare the performance of the two approaches. The fine-tuning-based model showed better performances on I.I.D. and compositional levels, where the distribution was similar to the training data or few-shot examples. However, the few-shot prompting-based model performed better in the zero-shot setting, where the models had to predict schema items they had never seen before. Given the large search space and the diversity of user questions, few-shot prompting may be more suitable for practical usage.

Despite being a pioneering study in the use of neural-based schema retrieval for Korean GeoKBQA, this research has its limitations. The SKRE KB we used, which is the only geographic KB that provides corresponding Korean questions, contains fewer schema types compared to modern large-scale KBs like Freebase or Wikidata. This raises the possibility of different results being obtained on larger-scale KBs. Furthermore, in the few-shot prompting-based model used in this study, all schema types in the KB are input into the model, which increases the number of input tokens as the number of schemas grows. When using large-scale KBs with many schemas, API usage costs can become an issue. Therefore, depending on the size of the KB being used, it may be necessary to combine this method with fine-tuning. Our subsequent work aims to address these existing challenges.

Author Contributions

Seokyong Lee: conceptualization, data curation, formal analysis, methodology, validation, visualization, writing—original draft, and writing—review and editing. Kiyun Yu: conceptualization, funding acquisition, project administration, and ND supervision. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Korea Agency for Infrastructure Technology Advancement (KAIA) grant funded by the Ministry of Land, Infrastructure and Transport (Grant RS-2022-00143336).

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to privacy.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

In Table A1 and Table A2, the hit@1 scores were noticeably lower than the other ranges across all instructions and generalization levels. As mentioned in Section 4.3, this was due to the existence of multiple valid schemas for a single question. For class schemas, the most common issue occurred when the model chose “apartment” for the apartment name mentioned in the query, but the schema was marked as incorrect because it was not used in the query. Similarly, for property schemas, the model often predicted “name” based on the entity’s name in the query, but since the query used the entity’s unique identifier “uuid” instead of “name”, it was also marked as incorrect. This does not necessarily reflect the model’s performance but depends on which schema is used in the query.

The goal of the schema retrieval model is to identify appropriate schema items for the transducer to use when constructing a query. Therefore, even if certain schema items are not ultimately used in the query, the model should still retrieve valid schemas relevant to the query. As a result, even if the hit@1 score is not high, a strong performance at hit@k indicates that the model effectively finds appropriate schema items.

Given that the performance differences were not significant compared to the experiments on processing methods or the number of few-shot examples, we used the averages of hit@2 to hit@5 for the instruction search. This approach helped to highlight the differences more clearly.

Table A1. Class instruction results.

Level	Ins.	Hit@1	Hit@2	Hit@3	Hit@4	Hit@5
I.I.D.	1	0.5110	0.9246	0.9286	0.9567	0.9567
	2	0.6101	0.9571	0.9594	0.9661	0.9684
	3	0.4352	0.9864	0.9872	0.9934	0.9986
Comp.	1	0.8751	0.9165	0.9506	0.9567	0.9865
	2	0.5687	0.9473	0.9583	0.9583	0.9583
	3	0.6071	0.9286	0.9765	0.9867	0.9898
Zero-shot	1	0.4352	0.8778	0.8792	0.8912	0.9054
	2	0.6901	0.9125	0.9166	0.9264	0.9354
	3	0.6836	0.9036	0.9534	0.9567	0.9864

Table A2. Property instruction results.

Level	Ins.	Hit@1	Hit@2	Hit@3	Hit@4	Hit@5
I.I.D.	1	0.4962	0.9350	0.9350	0.9358	0.9358
	2	0.5966	0.9855	0.9954	0.9954	0.9954
	3	0.5612	0.9855	0.9855	0.9962	0.9984
Comp.	1	0.3799	0.8781	0.8855	0.8861	0.8872
	2	0.5364	0.9848	0.9848	0.9891	0.9891
	3	0.4248	0.9935	0.9950	0.9950	0.9972
Zero-shot	1	0.5124	0.8465	0.8470	0.8481	0.8961
	2	0.6145	0.8943	0.8943	0.9002	0.9013
	3	0.6185	0.9101	0.9205	0.9315	0.9555

References

Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A. Language Models Are Few-Shot Learners. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
Mishra, A.; Jain, S.K. A Survey on Question Answering Systems with Classification. J. King Saud. Univ. Comput. Inf. Sci. 2016, 28, 345–361. [Google Scholar] [CrossRef]
Gu, Y.; Kase, S.; Vanni, M.; Sadler, B.; Liang, P.; Yan, X.; Su, Y. Beyond IID: Three Levels of Generalization for Question Answering on Knowledge Bases. In Proceedings of the The Web Conference, Ljubljana, Slovenia, 19–23 April 2021; pp. 3477–3488. [Google Scholar]
Chen, S.; Liu, Q.; Yu, Z.; Lin, C.-Y.; Lou, J.-G.; Jiang, F. ReTraCk: A Flexible and Efficient Framework for Knowledge Base Question Answering. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations, Online, 1–6 August 2021; pp. 325–336. [Google Scholar]
Ye, X.; Yavuz, S.; Hashimoto, K.; Zhou, Y.; Xiong, C. RNG-KBQA: Generation Augmented Iterative Ranking for Knowledge Base Question Answering. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Dublin, Ireland, 22–27 May 2022; Association for Computational Linguistics: Dublin, Ireland, 2022; pp. 6032–6043. [Google Scholar]
Shu, Y.; Yu, Z.; Li, Y.; Karlsson, B.; Ma, T.; Qu, Y.; Lin, C.Y. TIARA: Multi-Grained Retrieval for Robust Question Answering over Large Knowledge Base. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, 7–11 December 2022; Association for Computational Linguistics: Abu Dhabi, United Arab Emirates, 2022; pp. 8108–8121. [Google Scholar]
Yang, J.; Jang, H.; Yu, K. Geographic Knowledge Base Question Answering over OpenStreetMap. ISPRS Int. J. Geo-Inf. 2024, 13, 10. [Google Scholar] [CrossRef]
Yang, T. Developing a Transformer-Based Natural Language Entity Linking Model to Improve the Performance of GeoKBQA. Master’s Thesis, Seoul National University, Seoul, Republic of Korea, 2023; pp. 1–96. [Google Scholar]
Bollacker, K.; Evans, C.; Paritosh, P.; Sturge, T.; Taylor, J. Freebase: A Collaboratively Created Graph Database for Structuring Human Knowledge. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data (SIGMOD’08), Vancouver, BC, Canada, 9–12 June 2008; Association for Computing Machinery: New York, NY, USA, 2008; pp. 1247–1250. [Google Scholar] [CrossRef]
Vrandečić, D.; Krötzsch, M. Wikidata: A Free Collaborative Knowledgebase. Commun. ACM 2014, 57, 78–85. [Google Scholar] [CrossRef]
Li, T.; Ma, X.; Zhuang, A.; Gu, Y.; Su, Y.; Chen, W. Few-Shot In-Context Learning on Knowledge Base Question Answering. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, ON, Canada, 9–14 July 2023; Association for Computational Linguistics: Toronto, ON, Canada, 2023; pp. 6966–6980. [Google Scholar]
Xiong, G.; Bao, J.; Zhao, W. Interactive-KBQA: Multi-Turn Interactions for Knowledge Base Question Answering with Large Language Models. arXiv 2024, arXiv:2402.15131. [Google Scholar]
Kwiatkowski, T.; Choi, E.; Artzi, Y.; Zettlemoyer, L. Scaling Semantic Parsers with On-the-Fly Ontology Matching. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, WA, USA, 18–21 October 2013; Association for Computational Linguistics: Seattle, WA, USA, 2013; pp. 1545–1556. [Google Scholar]
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; Association for Computational Linguistics: Minneapolis, MN, USA, 2019; pp. 4171–4186. [Google Scholar]
Reynolds, L.; McDonell, K. Prompt Programming for Large Language Models: Beyond the Few-Shot Paradigm. In Proceedings of the Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems (CHI EA’21), New York, NY, USA, 2–7 June 2021; Association for Computing Machinery: New York, NY, USA, 2021. Article 314. pp. 1–7. [Google Scholar]
Mai, G.; Janowicz, K.; Zhu, R.; Cai, L.; Lao, N. Geographic Question Answering: Challenges, Uniqueness, Classification, and Future Directions. AGILE GISci. Ser. 2021, 2, 8. [Google Scholar] [CrossRef]
Punjani, D.; Singh, K.; Both, A.; Koubarakis, M.; Angelidis, I.; Bereta, K.; Beris, T.; Bilidas, D.; Ioannidis, T.; Karalis, N.; et al. Template-Based Question Answering over Linked Geospatial Data. In Proceedings of the 12th Workshop on Geographic Information Retrieval (GIR’18), New York, NY, USA, 6 November 2018; Association for Computing Machinery: New York, NY, USA, 2018. Article 7. pp. 1–10. [Google Scholar]
Hamzei, E.; Tomko, M.; Winter, S. Translating Place-Related Questions to GeoSPARQL Queries. In Proceedings of the ACM Web Conference 2022, Lyon, France, 25–29 April 2022; pp. 902–911. [Google Scholar]
Kefalidis, S.-A.; Punjani, D.; Tsalapati, E.; Plas, K.; Pollali, M.; Mitsios, M.; Tsokanaridou, M.; Koubarakis, M.; Maret, P. Benchmarking Geospatial Question Answering Engines Using the Dataset GeoQuestions1089. In Proceedings of the Semantic Web—ISWC 2023: 22nd International Semantic Web Conference, Athens, Greece, 6–10 November 2023; Springer: Berlin/Heidelberg, Germany, 2023; pp. 266–284. [Google Scholar]
Shi, J.; Cao, S.; Hou, L.; Li, J.; Zhang, H. TransferNet: An Effective and Transparent Framework for Multi-Hop Question Answering over Relation Graph. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Online and Punta Cana, Dominican Republic, 7–11 November 2021; Association for Computational Linguistics: Online and Punta Cana, Dominican Republic, 2021; pp. 4149–4158. [Google Scholar]
Yih, W.T.; Richardson, M.; Meek, C.; Chang, M.W.; Suh, J. The Value of Semantic Parse Labeling for Knowledge Base Question Answering. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany, 7–12 August 2016; Association for Computational Linguistics: Berlin, Germany, 2016; Volume 2, pp. 201–206. [Google Scholar]
Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; McGrew, B. GPT-4 Technical Report. arXiv 2023, arXiv:2303.08774. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17), Long Beach, CA, USA, 4–9 December 2017; Curran Associates Inc.: Red Hook, NY, USA, 2017; pp. 6000–6010. [Google Scholar]
Wu, L.; Petroni, F.; Josifoski, M.; Riedel, S.; Zettlemoyer, L. Scalable Zero-Shot Entity Linking with Dense Entity Retrieval. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020), Online, 16–20 November 2020; Association for Computational Linguistics: Online, 2020; pp. 6397–6407. [Google Scholar]
Chiu, J.; Shinzato, K. Cross-Encoder Data Annotation for Bi-Encoder Based Product Matching. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: Industry Track, Abu Dhabi, United Arab Emirates, 7–11 December 2022; Association for Computational Linguistics: Abu Dhabi, United Arab Emirates, 2022; pp. 161–168. [Google Scholar]
Han, W.; Jiang, Y.; Ng, H.T.; Tu, K. A Survey of Unsupervised Dependency Parsing. In Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, Spain, 8–13 December 2022; International Committee on Computational Linguistics: Online and Barcelona, Spain, 2020; pp. 2522–2533. [Google Scholar]
Talmor, A.; Berant, J. The Web as a Knowledge Base for Answering Complex Questions. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, New Orleans, LA, USA, 1–6 June 2018; Association for Computational Linguistics: New Orleans, LA, USA, 2018; Volume 1, pp. 641–651. [Google Scholar]
Zhao, Z.; Wallace, E.; Feng, S.; Klein, D.; Singh, S. Calibrate Before Use: Improving Few-Shot Performance of Language Models. In Proceedings of the International Conference on Machine Learning (ICML), Online, 18–24 July 2021; pp. 12697–12706. [Google Scholar]
Min, S.; Lyu, X.; Holtzman, A.; Artetxe, M.; Lewis, M.; Hajishirzi, H.; Zettlemoyer, L. Rethinking the Role of Demonstrations: What Makes In-Context Learning Work? In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, 7–11 December 2022; Association for Computational Linguistics: Abu Dhabi, United Arab Emirates, 2022; pp. 11048–11064. [Google Scholar]
Shrawgi, H.; Rath, P.; Singhal, T.; Dandapat, S. Uncovering Stereotypes in Large Language Models: A Task Complexity-Based Approach. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), St. Julian’s, Malta, 7–22 March 2024; Association for Computational Linguistics: St. Julian’s, Malta, 2024; pp. 1841–1857. [Google Scholar]
Sheetrit, E.; Brief, M.; Mishaeli, M.; Elisha, O. ReMatch: Retrieval Enhanced Schema Matching with LLMs. arXiv 2024, arXiv:2403.01567. [Google Scholar]
Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A. Training Language Models to Follow Instructions with Human Feedback. Neural. Inf. Process. Syst. 2022, 35, 27730–27744. [Google Scholar]
Zamfirescu-Pereira, J.D.; Wong, R.Y.; Hartmann, B.; Yang, Q. Why Johnny Can’t Prompt: How Non-AI Experts Try (and Fail) to Design LLM Prompts. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI’23), Hamburg, Germany, 23–28 April 2023; Association for Computing Machinery: New York, NY, USA, 2023. Article 437. pp. 1–21. [Google Scholar] [CrossRef]
Bsharat, S.M.; Myrzakhan, A.; Shen, Z. Principled Instructions Are All You Need for Questioning LLaMA-1/2, GPT-3.5/4. arXiv 2023, arXiv:2312.16171. [Google Scholar]
Wang, L.; Xu, W.; Lan, Y.; Hu, Z.; Lan, Y.; Lee, R.K.-W.; Lim, E.-P. Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, ON, Canada, 9–14 July 2023; Association for Computational Linguistics: Toronto, ON, Canada, 2023; pp. 2609–2634. [Google Scholar]
Deshpande, A.; Murahari, V.; Rajpurohit, T.; Kalyan, A.; Narasimhan, K. Toxicity in ChatGPT: Analyzing Persona-Assigned Language Models. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, 6–10 December 2023; Association for Computational Linguistics: Singapore, 2023; pp. 1236–1270. [Google Scholar]
White, J.; Fu, Q.; Hays, S.; Sandborn, M.; Olea, C.; Gilbert, H.; Elnashar, A.; Spencer-Smith, J.; Schmidt, D.C. A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT. arXiv 2023, arXiv:2302.11382. [Google Scholar]
Pires, T.; Schlinger, E.; Garrette, D. How Multilingual Is Multilingual BERT? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; Association for Computational Linguistics: Florence, Italy, 2019; pp. 4996–5001. [Google Scholar]
Lee, S.; Jang, H.; Baik, Y.; Park, S.; Shin, H. KR-BERT: A Small-Scale Korean-Specific Language Model. arXiv 2020, arXiv:2008.03979. [Google Scholar]
Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.; Dean, J. Distributed Representations of Words and Phrases and Their Compositionality. In Proceedings of the 26th International Conference on Neural Information Processing Systems—Volume 2, Lake Tahoe, NV, USA, 5–8 December 2013; Curran Associates Inc.: Red Hook, NY, USA, 2013; Volume 2, pp. 3111–3119. [Google Scholar]

Figure 1. The overall pipeline of our experiment.

Figure 2. Three generalization levels of test data (class schema case). As shown on the right side, the I.I.D. level uses the same schema combinations as are seen in the few-shot training data (highlighted in blue). The compositional level introduces new combinations of schema items seen during the training process (highlighted in blue and green). Lastly, the zero-shot level evaluates the model’s ability to handle completely new schema items (highlighted in red) that the model has never encountered before.

Figure 3. Few-shot prompting-based dense schema retrieval models. In its actual implementation, Korean geographic questions were used.

Figure 4. Fine-tuning-based dense schema retrieval model. Although the illustration shows the model for class schemas, the relation and property models have the same pipeline.

Figure 5. Dataset construction results.

Figure 6. Number of few-shot example results: (a) shows the average score for class and property schemas; (b) shows the relation score.

Table 1. Instructions for GPT few-shot prompting-based schema retrieval model.

Instruction 1

This task is schema retrieval for Korean GeoKBQA.
Given total schema items and question, you must find the five best matching items with the question.
Note: Do not include any explanations or apologies in your responses. And if there is no correct match, return ‘none’.

Instruction 2

For the Korean geographic KBQA schema retrieval task, examine the provided list of schema items and a corresponding question.
Your objective is to identify the five schema items that directly correspond to the given question.
If none of the schema items match, respond with ‘none’.
Always return exactly five results, whether they are less relevant.
Please keep your response concise and limit it to the matching schema item only; avoid providing any additional explanations or apologies.
Furthermore, verify the completion of the task and if any errors are identified, restart the task from the beginning.

Instruction 3

For the Korean geographic KBQA schema retrieval task, examine the provided list of schema items and a corresponding question. Act as an expert information specialist specialized in the geographical KBQA domain to ensure consistency and relevance in your responses.
1. Assessment
Carefully assess each schema item to determine its relevance to the provided query.
2. Selection
Identify the schema item that most accurately corresponds to the query.
If no appropriate schema item exists, your response should be ‘none’.
3. Response Formation
Respond solely with the name of the most relevant schema item or ‘none’ if a suitable match is absent.
4. Result Presentation
Provide a list of five different schema items, prioritizing relevance. If fewer than five directly relevant items are found, include less relevant items to complete the list of five.
Remember:
Avoid including any additional explanations or commentary in your response.
If any errors are detected during the process, reassess the information and repeat the assessment if necessary.

Table 2. Processing method results. The average of every schema type.

Method	hit@1	hit@2	hit@3	hit@4	hit@5
Combined	0.4111	0.6502	0.7500	0.7623	0.8750
Divided	0.5954	0.9293	0.9311	0.9611	0.9725

Table 3. Number of few-shot example results.

Schema	Level	0 Examples	10 Examples	20 Examples	30 Examples	40 Examples
Avg. of Class and Property	I.I.D.	N/A	0.8325	0.8602	0.8752	0.8711
	Comp.	N/A	0.8252	0.8515	0.8542	0.8550
	Zero-shot	0.7754	0.7824	0.7938	0.7865	0.7930
Relation	N/A	0.3521	0.7278	0.8215	0.8431	0.8575

Table 4. Class instructions results.

Level	Instruction No.	Hit@2~5 Avg.
I.I.D.	1	0.9417
	2	0.9628
	3	0.9914
Comp.	1	0.9526
	2	0.9556
	3	0.9704
Zero-shot	1	0.8884
	2	0.9227
	3	0.9500

Table 5. Property instructions results.

Level	Instruction No.	Hit@2~5 Avg.
I.I.D.	1	0.9354
	2	0.9929
	3	0.9914
Comp.	1	0.8842
	2	0.9870
	3	0.9952
Zero-shot	1	0.8594
	2	0.8975
	3	0.9294

Table 6. Relation instructions results.

Level	Instruction No.	Hit@1	Hit@2
N/A	1	0.6263	0.8293
	2	0.7565	0.8943
	3	0.8780	0.9063

Table 7. Number of few-shot example results.

Schema	Level	Model	Hit@1	Hit@2	Hit@3	Hit@4	Hit@5
Class	I.I.D.	FS	0.4352	0.9864	0.9872	0.9934	0.9986
	I.I.D.	FT	0.5768	0.9946	0.9957	0.9961	0.9961
	Comp.	FS	0.6071	0.9286	0.9765	0.9867	0.9898
	Comp.	FT	0.6964	0.9964	0.9964	0.9982	0.9982
	Zero-shot	FS	0.6836	0.9036	0.9534	0.9567	0.9864
	Zero-shot	FT	0.5892	0.7985	0.7985	0.7985	0.8078
Property	I.I.D.	FS	0.5612	0.9855	0.9855	0.9962	0.9984
	I.I.D.	FT	0.8671	0.9976	0.9981	0.9995	0.9995
	Comp.	FS	0.4248	0.9935	0.9950	0.9950	0.9972
	Comp.	FT	0.4099	0.9864	0.9891	0.9968	0.9985
	Zero-shot	FS	0.6185	0.9101	0.9205	0.9315	0.9555
	Zero-shot	FT	0.5661	0.8455	0.8485	0.8594	0.8684
Relation	N/A	FS	0.8780	0.9063	N/A	N/A	N/A
Relation	N/A	FT	0.9782	0.9863	N/A	N/A	N/A

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Published by MDPI on behalf of the International Society for Photogrammetry and Remote Sensing. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lee, S.; Yu, K. Schema Retrieval for Korean Geographic Knowledge Base Question Answering Using Few-Shot Prompting. ISPRS Int. J. Geo-Inf. 2024, 13, 453. https://doi.org/10.3390/ijgi13120453

AMA Style

Lee S, Yu K. Schema Retrieval for Korean Geographic Knowledge Base Question Answering Using Few-Shot Prompting. ISPRS International Journal of Geo-Information. 2024; 13(12):453. https://doi.org/10.3390/ijgi13120453

Chicago/Turabian Style

Lee, Seokyong, and Kiyun Yu. 2024. "Schema Retrieval for Korean Geographic Knowledge Base Question Answering Using Few-Shot Prompting" ISPRS International Journal of Geo-Information 13, no. 12: 453. https://doi.org/10.3390/ijgi13120453

APA Style

Lee, S., & Yu, K. (2024). Schema Retrieval for Korean Geographic Knowledge Base Question Answering Using Few-Shot Prompting. ISPRS International Journal of Geo-Information, 13(12), 453. https://doi.org/10.3390/ijgi13120453

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Schema Retrieval for Korean Geographic Knowledge Base Question Answering Using Few-Shot Prompting

Abstract

1. Introduction

2. Related Works

2.1. Schema Retrieval of KBQA

2.1.1. BERT Fine-Tuning-Based Schema Retrieval

2.1.2. GPT Few-Shot Prompting-Based Schema Retrieval

2.2. Schema Retrieval of GeoKBQA

2.3. Dataset for Schema Retrieval in Korean GeoKBQA

3. Methodology

3.1. Dataset Construction and Generalization Levels

3.2. Schema Retrieval Models

3.2.1. GPT Few-Shot Prompting-Based Model

3.2.2. BERT Fine-Tuning-Based Model

4. Experimental Setup

4.1. Dataset

4.2. Implementation Details

4.3. Evaluation Metrics

5. Results

5.1. Dataset Construction Results

5.2. Prompt Searching Results for GPT Few-Shot Prompting-Based Model

5.3. Comparison Results for GPT Few-Shot Prompting and BERT Fine-Tuning-Based Models

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI