You are currently viewing a new version of our website. To view the old version click .
ISPRS International Journal of Geo-Information
  • Article
  • Open Access

15 December 2024

Schema Retrieval for Korean Geographic Knowledge Base Question Answering Using Few-Shot Prompting

and
Department of Civil and Environmental Engineering, Seoul National University, Seoul 08826, Republic of Korea
*
Author to whom correspondence should be addressed.

Abstract

Geographic Knowledge Base Question Answering (GeoKBQA) has garnered increasing attention for its ability to process complex geographic queries. This study focuses on schema retrieval, a critical step in GeoKBQA that involves extracting relevant schema items (classes, relations, and properties) to generate accurate operational queries. Current GeoKBQA studies primarily rely on rule-based approaches for schema retrieval. These predefine words or descriptions for each schema item. This rule-based method has three critical limitations: (1) poor generalization to undefined schema items, (2) failure to consider the semantic meaning of user queries, and (3) an inability to adapt to languages not used in the predefined step. In this study, we present a schema retrieval model by using few-shot prompting on GPT-4 Turbo to address these issues. Using the SKRE dataset, we searched for the best prompt in terms of enabling the model to handle Korean geographic questions across various generalization levels. Notably, this method outperformed fine-tuning in zero-shot scenarios, underscoring its adaptability to unseen data. To our knowledge, this is the first attempt to develop a schema retrieval model for GeoKBQA that purely utilizes a language model and is capable of processing Korean geographic questions.

1. Introduction

The recent advancements in the field of natural language processing (NLP) are significant. Particularly, the progress in technology has enabled training with large-scale datasets, leading to the development of large neural network-based language models. These models have demonstrated high performance in various NLP tasks such as document summarization, sentiment analysis, and machine translation, driving many innovations [1]. One of the tasks in NLP, Question Answering (QA), involves generating or retrieving answers to user queries [2]. Research has been conducted on Knowledge Base Question Answering (KBQA), which finds answers from a structured graph-form database known as a Knowledge Base (KB) to build QA systems [3,4,5,6].
To implement KBQA, components such as entity linking, schema retrieval, and transducers are used [6,7]. Entity linking involves identifying entities mentioned in natural language questions and linking them to their corresponding entities in the actual KB, clarifying their identity [8]. Schema retrieval extracts schema items (classes, relations, properties) that are directly or indirectly related to the question from the KB [6]. Lastly, the transducer creates a logical structure that can query the KB based on the results of entity linking and schema retrieval [7].
While entity linking and transducers are integral to the KBQA pipeline, this study specifically focuses on schema retrieval. This is because schema retrieval directly influences the accuracy of the logical structure generated by the transducer and ensures that the operational query aligns with the KB’s structure. Errors in schema retrieval propagate to downstream components, leading to invalid queries and incorrect results. With the emergence of large-scale KBs that contain tens of thousands of schema items, such as Freebase [9] and Wikidata [10], schema retrieval has become a critical challenge. Many recent studies have prioritized schema retrieval as a core task in KBQA systems [3,4,5,6,11,12].
The schema of a KB can be defined in terms of classes, relations, and properties [6]. A class defines the category or type of entity. For example, the class “university” refers to a type that includes real-world entities like “Seoul National University”. A relation represents the connection between entities or classes. The entities “Seoul” and “Seoul National University” can be linked by the relation “LOCATED_IN”, indicating geographical inclusion. Lastly, a property describes the characteristics of a class or entity. The entity “Seoul National University” could have properties like “coord”, indicating its geographical coordinates, and “area”, showing the size of the area. Schema retrieval plays an important role in the KBQA process. It ensures that the schema items used in a generated query match the KB structure. Even a single mismatch between a schema item and the KB schema can result in an invalid query, making it impossible to retrieve an answer [13].
Recent advancements in large language models (LLMs) have significantly influenced schema retrieval research. Schema retrieval using LLMs can be implemented through two primary approaches: fine-tuning and few-shot prompting [3,4,5,6,11,12]. Fine-tuning, however, requires a large amount of training data [14]. These training data, which consist of natural language questions paired with corresponding schema items, vary widely because each KB has its own schema ontology. These differences lead to compatibility issues between datasets and create challenges in adapting fine-tuned models to new KBs or languages. Since building datasets for schema retrieval models whenever the KB or language of user questions changes consumes considerable resources and time, research has been conducted on schema retrieval models using few-shot prompting techniques, which do not require large datasets [11,12].
Few-shot prompting is a method of in-context learning where the model performs new tasks using only the context provided [1]. This involves presenting the language model with a prompt that includes instructions for the task and a few examples, enabling the model to infer how to perform the task on its own [15]. Recent research has used few-shot prompting with language models such as OpenAI’s GPT-4 Turbo and code-davinci-002 to perform schema retrieval [11,12]. Unlike fine-tuning, with simple changes in instructions and the use of few-shot examples, schema retrieval models using few-shot prompting can be applied to various KBs and languages.
However, existing studies have primarily focused on neighbor schema retrieval, which only targets the schemas immediately surrounding the explicit entities mentioned in a question [11,12]. This narrow focus can significantly increase both the complexity and the time required for querying the knowledge base when dealing with intricate questions involving multiple relational steps, known as hops. For instance, in a question like “Where is the nearest parking spot from the park, which is located near the cheapest apartments in Seoul?”, there is a four-hop relation between “Seoul” and “parking spot”. The neighbor method must traverse all entities within a four-hop range of “Seoul” to identify the relevant relations. This process is computationally expensive and time-consuming. Moreover, this method also fails to handle queries that do not explicitly mention an entity, such as “Where is the highest mountain?” [6].
To overcome these limitations, it is more effective to employ a dense retrieval approach that considers all available schema items within the KB, or to use a hybrid method that combines both dense and neighbor-based approaches. In the fine-tuning domain, this strategy has been shown to improve performance significantly [4,6]. Despite the potential of these methods, research on dense schema retrieval models utilizing few-shot prompting remains unexplored. This approach is particularly advantageous because it eliminates the need for extensive training datasets, making it more adaptable and efficient.
GeoQA is an extended domain of the QA system that is designed to respond to geographic questions [7]. Geographic questions involve geographic entities, concepts (such as specific types like buildings, cities, or states), or spatial relationships [16]. As with general QA systems, research in the GeoQA field has been conducted on GeoKBQA, which uses structured graph-form KBs to answer users’ geographic queries [7,17,18]. However, unlike KBQA studies that utilize fine-tuning or few-shot prompting with language models for schema retrieval, GeoKBQA research has traditionally relied on a rule-based approach [17,18,19]. This method involves predefining words or descriptions linked to specific schema items and matching them with the text in user questions. Such a rule-based system necessitates frequent updates whenever the KB schema changes and struggles with questions that lack predefined schema relationships, leading to poor generalization. This limitation becomes particularly problematic for Korean geographic questions, where each schema item must have predefined Korean terms. However, no studies have yet addressed the development of a rule-based model for Korean questions, which would require significant resources and time to manage all relevant expressions effectively.
This reliance on rule-based systems is further complicated by the unique characteristics of geographic questions, which require context-sensitive schema retrieval. For instance, a schema retrieval model must correctly identify the appropriate spatial relationship based on the context of the question. For example, when asked about the distance between Paris and Beijing, the model should retrieve the “distance” relation, while for a question concerning the distance between Canada and the USA, the model should retrieve the “adjacent” relation, reflecting their shared border and geographic scale. Additionally, geographic concepts and entities often involve inherent vagueness. The term “Amazon”, for instance, may correspond to different classes in the knowledge base, such as “river”, “rainforest”, or “company”. A schema retrieval model must infer the correct context to associate the term with the appropriate class. Rule-based systems are unable to address these complexities, as they are restricted to matching predefined keywords and fail to consider the broader meaning or relationships within the question. In contrast, schema retrieval models, which leverage LLMs, dynamically interpret relationships and nuances, making them well suited for addressing the challenges of answering geographic questions.
To address these challenges, this study introduces a dense schema retrieval model for Korean GeoKBQA using few-shot prompting with an LLM. Dense schema retrieval, which evaluates all schema items in the KB against the query, is particularly effective for handling multi-hop queries and questions that lack explicit entity references. Few-shot prompting further enhances the adaptability of schema retrieval by enabling LLMs to infer relationships between questions and schema items directly from a given prompt, eliminating the need for predefined schema relationships or extensive training datasets. This study also compares the few-shot prompting approach with a fine-tuned Multilingual BERT (M-BERT) [14] model to evaluate their performance across different generalization levels.
The primary contributions of this work are as follows:
  • We develop a language model based schema retrieval model for GeoKBQA: the proposed model addresses the limitations of traditional rule-based methods by dynamically inferring relationships between queries and schema items, demonstrating strong generalization capabilities.
  • We create a prompt for few-shot prompting-based schema retrieval: using the Spatial Knowledge Reasoning Engine (SKRE) dataset, we developed optimal prompts tailored to the dense schema retrieval of Korean geographic questions.
  • We adapt few-shot prompting techniques for dense schema retrieval: this study leverages few-shot prompting to handle complex, multi-hop, and entity-less queries, providing a robust alternative to fine-tuning-based methods.
This paper is organized as follows: Section 2 reviews related work, including schema retrieval in KBQA, GeoKBQA methods, and the limitations of existing schema retrieval models. Section 3 describes the methodology, including dataset construction, generalization levels, the process of finding the optimal prompt for the few-shot prompting model, and the proposed schema retrieval models. Section 4 discusses the experimental setup, including data preparation, implementation details, and evaluation metrics. In Section 5, we present the results of the prompt optimization process and a comparative performance analysis of the few-shot prompting and fine-tuning approaches. Finally, Section 6 concludes with key findings, implications, and directions for future research.

3. Methodology

As can be seen in Figure 1, the experiments in this study are organized into three main steps. First, we modify the SKRE dataset, which includes Korean geographic questions and corresponding Cypher queries, to create a schema retrieval dataset. Second, using this dataset, we identify the most suitable prompt to perform few-shot prompting in dense schema retrieval. During this process, we evaluate the generalization performance to analyze the effects of different prompts. Lastly, to validate the effectiveness of the few-shot prompting-based schema retrieval model, we fine-tune an M-BERT model on the same dataset.
Figure 1. The overall pipeline of our experiment.

3.1. Dataset Construction and Generalization Levels

To construct a test dataset capable of comparing models based on fine-tuning and few-shot prompting, we extracted schema labels from the Cypher queries contained in the SKRE dataset. We organized all schema items that constitute the KB into a dataset and used these to identify the schemas present in the queries. The processed data encompass spatial-related natural language questions and their corresponding class, relation, and property schema items.
To obtain a precise evaluation of the schema retrieval models’ generalization performance, we adapted Gu et al.’s [3] transducer generalization levels to conduct the schema retrieval task. However, we did not consider the functions of the query since we only focused on schema items. The generalization measures used to evaluate our proposed schema retrieval models include Independent and Identically Distributed (I.I.D.), compositional, and zero-shot levels. These levels are illustrated in Figure 2, which provides examples using class schema items. While the examples focus on class schema items for clarity, the same generalization methodology is applied to relation and property schema items as well. The left side of Figure 2 shows few-shot examples or training data used for our models that associate natural language queries with their corresponding class schema items. The right side demonstrates test data categorized into the three generalization levels, with each designed to evaluate the model’s performance under different conditions.
Figure 2. Three generalization levels of test data (class schema case). As shown on the right side, the I.I.D. level uses the same schema combinations as are seen in the few-shot training data (highlighted in blue). The compositional level introduces new combinations of schema items seen during the training process (highlighted in blue and green). Lastly, the zero-shot level evaluates the model’s ability to handle completely new schema items (highlighted in red) that the model has never encountered before.
  • I.I.D. generalization: This level assesses the model’s ability to handle queries that align with the schema items and question structures seen in the few-shot examples or training data. For example, as you can see in the left side of Figure 2, there is a training data point that asks “Can you tell me about apartments in the Umyeon-dong district that have a daycare center nearby?” and employs class schema items “apartment”, “DistrictBoundaryDong”, and “daycare”. A corresponding I.I.D. level test data point could be “Find apartments near a daycare in Seocho-dong district”, which uses a similar question structure and identical schema combination.
  • Compositional generalization: This level evaluates the model’s ability to process new combinations of schema items encountered during training. For instance, the training data in Figure 2 include queries like “Can you tell me about apartments in the Umyeon-dong district that have a daycare center nearby?”, which has the corresponding class schema items “apartment”, “DistrictBoundaryDong”, “daycare”, and “Find the school closest to Jamwon Elementary School”, and the schema item “school”. A compositional test query might combine these schema items in a new way, asking “Can you tell me about apartments that have a school nearby in Umyeon-dong?” This requires the use of “apartment”, “school”, and “DistrictBoundaryDong”.
  • Zero-shot generalization: This level measures the model’s ability to handle schema items it has never encountered during training. For example, if the training data contain no mention of the class schema “reputation”, a zero-shot level query like “Can you tell me about the social media reviews for Raemian apartments?” evaluates the model’s ability to infer and retrieve this unseen schema item alongside the item of apartment.
By assessing the models across these diverse generalization scenarios, we can evaluate whether the schema retrieval models are merely memorizing training data or actively adapting to the complexities and changes in the real world.

3.2. Schema Retrieval Models

We utilized the constructed dataset to perform dense schema retrieval using both few-shot prompting and fine-tuning methods, as proposed in our study. For the few-shot prompting approach, we recognized that performance could vary depending on the number of instructions and examples included in the prompt. Consequently, we conducted experiments under various conditions to identify the optimal prompt configuration. For the fine-tuning approach, we employed the M-BERT model, which is capable of processing Korean questions. We enhanced computational efficiency through negative sampling. Finally, we compared the generalization performance of both models using the previously described I.I.D., compositional, and zero-shot generalization levels.

3.2.1. GPT Few-Shot Prompting-Based Model

In this study, we propose a schema retrieval model that employs few-shot prompting with the GPT-4 Turbo model provided by the OpenAI API. Specifically, we implemented a dense schema retrieval model that compares all schema items in the KB with the questions. As previously discussed, unlike the neighbor approach, the dense method can handle complex questions or those that do not include explicit entities, offering significant advantages [4,6]. By utilizing few-shot prompting for schema retrieval, this approach allows for adaptability to changes in the KB or schema without the need for retraining that is typical with fine-tuning methods. Instead, simple modifications to the prompt can accommodate these changes without altering the model’s parameters. The prompts used in the experiments include instructions and few-shot examples containing schema items that the model can select, facilitating schema retrieval.
As illustrated in Figure 3, few-shot prompting differs fundamentally from the rule-based approach commonly used in the GeoKBQA domain. It allows the GPT-4 Turbo model to dynamically interpret the semantic meaning of input questions through task descriptions and few-shot examples. This approach enables the model to flexibly retrieve schema items without the need to predefine every relationship between schema items and words or descriptions. Moreover, it is relatively independent of KB and language changes, requiring only adjustments to the schema list in the task description and corresponding examples.
Figure 3. Few-shot prompting-based dense schema retrieval models. In its actual implementation, Korean geographic questions were used.
However, the performance of models based on few-shot prompting is highly sensitive to the instructions and examples included within the prompt [28,29]. Therefore, this study uses a three-step process to explore the most effective prompts for schema retrieval tasks in Korean GeoKBQA.
  • Processing Methods
While it is common to train separate models for class, relation, and property schemas in dense schema retrieval using fine-tuning [4,6], few-shot prompting allows for the creation of prompts that can search all three types of schemas simultaneously. This approach can complete all tasks with a single API call, potentially reducing inference time. However, the increase in task complexity could lead to reduced performance. Simpler tasks performed that are through prompts are likely to enhance performance [28,30]. We conducted experiments to compare combined and divided task approaches. An example of the prompts used in the experiments is shown in Figure 3. The right image only represents the divided task method for class schemas. In actual experiments, searches for class, relation, and property schemas are conducted separately for the same input question, and the results are synthesized.
2.
Number of Few-Shot Examples
Few-shot examples are used to enhance the model’s understanding of schema retrieval tasks by demonstrating specific examples. In classification tasks like schema retrieval, providing a greater number of examples generally increases model accuracy; however, beyond a certain point, the rate of performance improvement significantly diminishes [11,28,29]. As the prompt lengthens, both the amount of data to be processed and the computational complexity increase, making it crucial to determine the optimal number of examples. In this study, we systematically increased the number of examples included in the prompt from 0 to 40, in increments of 10, to observe performance changes and identify the most efficient number of few-shot examples.
3.
Instructions
To generate information accurately and effectively, it is necessary to use specific instructions for the language model [15]. In the context of schema retrieval tasks, not only should the task description be included, but a list of schema items that the model can select must also be incorporated into the instructions [31]. Similar to the number of few-shot examples, the ways schema retrieval tasks are described to the model significantly affects its performance [28,29]. Therefore, this study developed three different instructions to identify the optimal instruction set, and measured performance changes accordingly.
The first instruction was formulated (see Table 1) based on the classification case explored by Ouyang et al. [32]. In the test environment, since we did not predetermine how many schema items a query may require, the instruction was designed to prompt the selection of the five schema items most relevant to the query. For schema items related to relations, where the KB contained fewer schema items, the instruction was modified to prompt the selection of only two items. This adjustment was consistently applied to the second and third instructions as well.
Table 1. Instructions for GPT few-shot prompting-based schema retrieval model.
The second instruction, shown in Table 1, clearly defines the objectives and scope of the schema retrieval task [33,34] and provides a detailed description of the steps required to complete the task [35].
In the third instruction, we assigned the persona of an “information specialist” to the model to enhance its accuracy. The persona pattern encourages the model to process and present information on specific topics more consistently and accurately [31,36]. Following the approach used in the study by White et al. [37], we crafted the instructions using the “act as persona X” phrase. Additionally, drawing on the methodology from the work of Xiong et al. [12], we employed a step-by-step approach to ensure the model clearly understands the task and processes information efficiently. Notably, expressions like “carefully assess” were used to emphasize the need for meticulous review in task execution [34].

3.2.2. BERT Fine-Tuning-Based Model

In this study, in order to compare the performance of the proposed few-shot prompting-based schema retrieval model with traditional fine-tuning methods, we implemented a dense schema retrieval model based on M-BERT. M-BERT, developed by Google, is a multilingual version of the BERT model. It was pre-trained on a Wikipedia corpus encompassing 104 languages and has shown a strong performance in natural language processing tasks for languages for which there are fewer resources [38]. Although there are models specifically tailored for Korean, such as KR-BERT [39] and KoBERT (https://github.com/SKTBrain/KoBERT (accessed on 12 December 2024)), which were developed by SKTBrain, they have limitations in terms of validation for multilingual tasks involving Korean queries and English schema labels. Thus, M-BERT was selected due to its potential applicability to datasets in various languages.
To fine tune M-BERT for schema retrieval, we employed a cross-encoding approach that combines natural language questions and schema items into a single model input, as outline by Shu et al. [6]. The model’s structure is depicted in Figure 4.
Figure 4. Fine-tuning-based dense schema retrieval model. Although the illustration shows the model for class schemas, the relation and property models have the same pipeline.
In this setup, the natural language question q and the schema item s are concatenated into a single string and input into the M-BERT tokenizer, as shown in Equation (3), transforming them into a vector-embedding matrix E. Subsequently, as detailed in Equation (4), E serves as the input to M-BERT and is output as H. The value of H corresponding to the first token, [CLS], as expressed in Equation (5), undergoes a linear transformation to produce the logit z. Finally, as shown in Equation (6), a sigmoid classification layer takes z as its input and outputs two probabilities, indicating whether the cross-encoded schema item is relevant or irrelevant to the query.
E = M-BERT   Tokenizer ( q ;   s )
H = M-BERT ( E )
z = w · H [ CLS ] + b
σ ( z ) = 1 1 + e - z
As illustrated in Figure 4 and the corresponding equations, our model does not employ the traditional multi-class classification method. In traditional multi-class classification, calculating the probability for every schema item is necessary, which can be computationally demanding in terms of both time and memory. Additionally, the denominator in the sigmoid function can increase rapidly, leading to very small loss values. These challenges become severe in schema retrieval due to the large number of schema items the model must consider.
To address these issues, we adopted the approach from Shu et al. [6], which incorporates negative sampling [40] into dense schema retrieval. This method simplifies the task from multi-class to binary classification by distinguishing between correct (positive data) and incorrect (negative data) schema items based on the user’s input question.
The corresponding formula is as follows:
L = - log   σ ( z p ) - j = 1 n log   σ ( - z j )
In Equation (7), n denotes the number of negative samples, and z p represents the logit used by the M-BERT model for the positive sample. Equation (7) illustrates that instead of performing a softmax operation across all schema items, a sigmoid function is used to compute the loss for the positive sample and a small number of negative samples. This approach significantly reduces memory usage and computational time.
In the paper, following the methodology outlined by Shu et al. [6], we assumed that the occurrence probability of all schema items follows a uniform distribution and randomly sampled negative data accordingly. For properties, we sampled 20 negative data points for each question. For class and relation schemas, which consist of only 22 and 5 types, respectively, we treated all schemas except for the correct answer as negative in order to train each model.

4. Experimental Setup

4.1. Dataset

For fine-tuning purposes, we processed the data such that each dataset contained only one type of schema label—class, relation, or property. This resulted in three distinct datasets. Each dataset was divided into training, validation, and test data in a 7:1:2 ratio (5607:801:1602, respectively). The test data for class and property schemas were categorized according to the generalization levels (I.I.D., compositional, zero-shot), with 534 examples allocated to each category. However, due to specific schema combinations, exact proportions could not be achieved, and a tolerance of up to 16 items (approximately 1% of the test data) was allowed. For relation-specific datasets, because the SKRE KB contains only five relation types, categorization by generalization levels was not feasible. Instead, relation datasets were constructed using random sampling in the same 7:1:2 ratio.
Unlike fine-tuning, few-shot prompting does not require the use of extensive training and validation data. Instead, a small number of few-shot examples were randomly selected, and test data were constructed using the same proportions as those used in the fine-tuning datasets. The test datasets for class and property schemas retained the same structure as the fine-tuning datasets, while relation test datasets continued to rely on random sampling due to the limited number of relation schema types. To minimize the influence of random selection, we generated three separate datasets for each experiment during the prompt optimization process. This ensured consistency and fairness in the comparative evaluation between the few-shot prompting and fine-tuning approaches.

4.2. Implementation Details

For the fine-tuning-based models, we used PyTorch and Hugging Face Transformers. We utilized the M-BERT cased model for all class, relation, and property schema retrieval tasks. The models were trained for 3 epochs with a learning rate of 5 × 10−5 and a batch size of 64. After each epoch, the loss was measured on the validation dataset, and the model with the best performance was selected as the final model.
For the few-shot prompting-based models, we utilized GPT-4 Turbo, provided by OpenAI API.

4.3. Evaluation Metrics

For performance evaluation, we utilized hit@k. This metric checks whether the correct schema is included within the top k predictions made by the schema model. The formula is as follows:
hit @ k = 1 N i = 1 N 1 ( rank i k )
where N represents the total number of test data points. 1 ( rank i k ) returns 1 if the schema predicted for the i-th question is within the top k, but otherwise it returns 0.
We use the hit@k metric to evaluate the performance of schema retrieval tasks because the number of correct schemas can vary depending on the question, and because there can be multiple valid schemas. For instance, let us consider the instruction “Find a hospital near Seoul National University”, which can be expressed using two different Cypher queries:
  • Match (x {uuid:‘sub_123′})-[:NEARBY]->(y:Hospital);
  • Match (x: Subway {name: ‘Seoul National University’})-[:NEARBY]->(y:Hospital).
Although these two queries return the same results, the first one does not utilize the “Subway” schema item and directly searches using “uuid”, which is the unique identifier of the entity “Seoul National University.” Therefore, in this experiment, we applied the hit@k metric sequentially from hit@1 to hit@5 for evaluation. This approach allows for more flexible assessment, as even if the model initially outputs an incorrect schema, this does not affect the subsequent hit@k ranges. However, for relation schemas, since there are only five types, accuracy increases rapidly from hit@3 onward, even when selected randomly, which does not accurately reflect the model’s performance. As a result, we only conducted experiments for hit@1 and hit@2 in the relation schema category.

5. Results

5.1. Dataset Construction Results

The Cypher queries in the SKRE dataset were used to extract and label class, relation, and property schemas. The processed dataset consists of a total of 8010 entries, as shown in Figure 5.
Figure 5. Dataset construction results.

5.2. Prompt Searching Results for GPT Few-Shot Prompting-Based Model

To find the most suitable prompt for Korean GeoKBQA dense schema retrieval, we conducted a three-step process that included the processing method, the number of few-shot examples, and instructions. To minimize the effect of sampling few-shot examples, each experiment was executed three times with different examples and test data. All hit@k scores represent the average of these three experiments. Due to limitations in time and cost, we applied the method that showed good results in the previous stage to the next experiment.
  • Processing Methods
As previously explained, schema retrieval through few-shot prompting can be implemented using either a combined processing method, where class, relation, and property schemas are handled simultaneously, or a divided processing method, where separate prompts are used for each schema.
Since the combined processing method dataset included all types of schema labels, it was not possible to specify a reference schema for labeling according to the generalization metrics. Therefore, we did not label the data based on generalization levels.
The experimental results of both methods are presented in Table 2. It was observed that the divided processing method, which simplifies the task by using individual prompts for each schema type, consistently outperformed the combined method across all hit@k ranges. While it is challenging to pinpoint the exact cause of these results due to the opaque reasoning process of LLMs, some patterns suggest that the combined method struggles with maintaining task-specific instructions.
Table 2. Processing method results. The average of every schema type.
For example, in the class schema results, the combined processing model frequently selected the schema item “Road” instead of “GoodWayToWalk”, despite explicit instructions in the prompt that required the model to choose only from the provided schema items within the SKRE KB. This behavior indicates that the model may have forgotten the instructions as the prompt became longer. In the combined processing method, the model must process examples and instructions for three schema types (class, relation, and property) simultaneously, which increases the complexity of the input. This additional complexity likely caused the model to retrieve a schema item (“Road”) that was not included in the given options, reflecting a failure to adhere to the task-specific constraints.
This issue highlights a challenge in GeoQA, where resolving ambiguity in geographic questions must be achieved while strictly following schema-specific constraints. For instance, in the question, “What are the best walking paths within 1200 m of Umyeon-dong?” the model should focus on the phrase “best walking paths” to retrieve the correct schema item “GoodWayToWalk” from the provided options. However, the term “paths” may lead the model to associate the question with the schema item “Road”, especially when task-specific instructions are overlooked. The divided processing method mitigates this issue by isolating each schema type into separate prompts, thereby reducing input length and complexity. This approach allows the model to better retain the task-specific instructions and adhere to the provided constraints, minimizing errors caused by forgetting instructions and instruction ambiguity.
These findings underscore the importance of designing prompts and managing task complexity to address schema retrieval challenges in GeoQA. By reducing the cognitive load on the model and improving its adherence to task-specific constraints, the divided processing method demonstrates superior performance, particularly in tasks that require the model to resolve ambiguous terms and distinguish between overlapping spatial concepts.
2.
Number of Few-Shot Examples
Using a divided method, we compared the performance differences by varying the number of few-shot examples included in the prompt. The number of examples ranged from 0 to 40, increasing by increments of 10.
Table 3 shows the average hit@1 to hit@5 results. As seen in Figure 6a, performance improvements for I.I.D. and compositional data sharply decreased after using more than 20 few-shot examples for class and property schemas. However, zero-shot data showed less variation compared to the others, indicating that the model is primarily relying on its pre-trained knowledge rather than specific patterns picked up from examples.
Table 3. Number of few-shot example results.
Figure 6. Number of few-shot example results: (a) shows the average score for class and property schemas; (b) shows the relation score.
As mentioned in Section 4.1, due to the limited number of schema types in the dataset, we did not categorize the generalization levels for the relation datasets and only used hit@1 to hit@2 results to calculate the average score. The relation schema showed similar results to class and property schemas. However, as seen in Table 3, the model’s performance when there were no examples was noticeably lower. This is due to the semantic difference between relation labels and geographic questions. For instance, the relation “TRADE” represents the edge between an apartment and its recent trade price in the SKRE KB. When a user asks for the price of an apartment, it is difficult for the model to infer that “TRADE” and “price” are related without any examples. Due to the inherent ambiguity in such geographic questions, it is crucial to improve the performance of GeoKBQA by training the model with few-shot examples that help it learn the schema of the KB and related queries.
As a result, when measuring performance across all schema types based on the number of few-shot examples, we observed that the performance improvement sharply decreased after exceeding 20 examples. While performance gradually increases as the number of few-shot examples grows, the computational complexity also rises due to the increased input token size. Therefore, we determined that using 20 few-shot examples is the most efficient approach, as this is the point where the performance gains start to diminish significantly.
3.
Instructions
Using the divided processing method and 20 examples, we tested three different instructions for class, property, and relation schemas. As shown in Table 4 and Table 5, we excluded the hit@1 results for class and property schemas. This exclusion was due to the existence of multiple valid schemas for a single question, as mentioned in Section 4.3. A detailed explanation can be found in Appendix A.
Table 4. Class instructions results.
Table 5. Property instructions results.
As shown in Table 4, Table 5 and Table 6, instruction 3, which used the persona technique and provided a more detailed step-by-step explanation of the task, demonstrated a better overall performance. This instruction consistently outperformed the others across all schema types, except at the I.I.D. level for property schema, where the effect of the instruction was relatively low due to the direct alignment between few-shot examples and the question.
Table 6. Relation instructions results.
The most significant difference was observed in the hit@1 scores of the relation schema, where instruction 3 outperformed instructions 1 and 2 by 0.25 and 0.13, respectively. This was due to the semantic difference between relation labels and geographic questions, as mentioned earlier. These results suggest that the less information the model has between schema labels and questions, the greater the importance of a well-crafted instruction becomes.
Despite our efforts to address the instruction-forgetting issue, as identified in the results of the processing method evaluation, this issue still persisted to some extent when providing clearer and more detailed instructions to the model. This behavior highlights the inherent vagueness and ambiguity in geographic questions, factors which often lead to challenges in schema retrieval. For instance, in the class schema, the model extracted the unknown schema “Road” for “GoodWayToWalk” and “ExpositionCenter” for “Convention”, which was intended to represent a convention center. Similarly, in the relation schema, “DISTANCE” was frequently extracted for queries involving distance calculations. However, the SKRE dataset employs Neo4j’s point.distance function to calculate the actual distance between entities, and thus, a specific “DISTANCE” relation is not explicitly defined in the KB.
Nonetheless, providing more detailed instructions appeared to mitigate these issues, as reflected in the results presented in Table 4, Table 5 and Table 6. These results underscore the critical role of precise and context-aware instructions in improving schema retrieval performance, particularly when addressing ambiguities or gaps between the schema labels and the query semantics.

5.3. Comparison Results for GPT Few-Shot Prompting and BERT Fine-Tuning-Based Models

To compare the performance of the two schema retrieval models, we compared the results obtained using the prompt from the few-shot prompting model, which was determined to be the most suitable (divided processing method, 20 few-shot examples, and instruction 3), with the results obtained from the fine-tuning model (Table 7).
Table 7. Number of few-shot example results.
In terms of the I.I.D. generalization performance for class and property schemas, the fine-tuning method showed a higher performance across most hit@k ranges. However, the difference was minimal.
For compositional generalization, the fine-tuning method demonstrated a significant 7% improvement at hit@2 for class schemas, but the difference narrowed to around 1–2% afterward. For property schemas, the few-shot prompting method initially performed best, but in later stages, the fine-tuning method showed better performance. However, the performance difference between the two models was less than 1%, indicating nearly equivalent performance.
The largest performance gap between the two models was observed in zero-shot generalization scenarios. For class schemas, the few-shot prompting method outperformed the fine-tuning method by approximately 10% at hit@2, and by nearly 15% in subsequent ranges. Similarly, for property schemas, the few-shot prompting method performed about 9% better from hit@2 onward.
For relation schemas, there were not enough schema types in the KB to compare performance based on generalization metrics. However, the fine-tuning method generally showed better performance. Since there are only five types of relation schema, the likelihood of encountering an unseen schema in the test data is very low. Therefore, most of the test data are likely I.I.D. or compositional data, indicating that the fine-tuning method performs better than the few-shot prompting method in these scenarios.
Through the experiments, we confirmed that the fine-tuning-based schema retrieval model performs better in I.I.D. and compositional scenarios, where the data are relatively similar to the training data. However, in the zero-shot generalization scenario, where the data distribution is significantly different from that of the training or few-shot example data, the few-shot prompting method outperformed the fine-tuning model by a substantial margin. This result demonstrates that in zero-shot situations, pre-acquired knowledge plays a more critical role. The advantages of the GPT-4 Turbo model, which has been trained on more data and contains a larger number of parameters, are particularly evident in this context. Although the size of GPT-4 Turbo has not been disclosed, the difference in parameter counts between the previous GPT-3 model and M-BERT is approximately 1600-fold.
In schema retrieval tasks, the model must select schema items relevant to the query from the vast search space of all schemas in the KB. The probability of finding the schema item required by the user’s query in the training data or few-shot examples is very low [3]. Therefore, in real-world applications, schema retrieval using the few-shot prompting method, which demonstrates superior zero-shot performance, is likely to be more suitable.

6. Conclusions

This study constructed the first neural-based schema retrieval model for Korean GeoKBQA. Prior studies used predefined rule-based models that had limited generalization performances on undefined schema items and were unable to account for the semantic meaning of user questions. Additionally, these models required significant time and resources in order to define new relationships between questions and schema items whenever the language or KB changed.
To build our model, we utilized the few-shot prompting method with the GPT-4 Turbo model, which requires a considerably smaller amount of data compared to traditional fine-tuning methods. We also adopted dense schema retrieval, which is known to perform better than neighbor-based schema retrieval. To the best of our knowledge, this is the first work to construct a few-shot prompting-based dense schema retrieval model, not only in the GeoKBQA domain but also within the broader KBQA field.
Using the SKRE dataset, which contains Korean geographic queries, we constructed a schema retrieval dataset. With this dataset, we conducted various experiments that considered the processing methods, the number of few-shot examples used, and the instructions used in order to identify the best prompt for Korean GeoKBQA schema retrieval.
To evaluate our model’s performance, we also trained an M-BERT model to perform dense schema retrieval, using the traditional fine-tuning method to process Korean geographic questions. We tested the models across I.I.D., compositional, and zero-shot generalization levels to carefully compare the performance of the two approaches. The fine-tuning-based model showed better performances on I.I.D. and compositional levels, where the distribution was similar to the training data or few-shot examples. However, the few-shot prompting-based model performed better in the zero-shot setting, where the models had to predict schema items they had never seen before. Given the large search space and the diversity of user questions, few-shot prompting may be more suitable for practical usage.
Despite being a pioneering study in the use of neural-based schema retrieval for Korean GeoKBQA, this research has its limitations. The SKRE KB we used, which is the only geographic KB that provides corresponding Korean questions, contains fewer schema types compared to modern large-scale KBs like Freebase or Wikidata. This raises the possibility of different results being obtained on larger-scale KBs. Furthermore, in the few-shot prompting-based model used in this study, all schema types in the KB are input into the model, which increases the number of input tokens as the number of schemas grows. When using large-scale KBs with many schemas, API usage costs can become an issue. Therefore, depending on the size of the KB being used, it may be necessary to combine this method with fine-tuning. Our subsequent work aims to address these existing challenges.

Author Contributions

Seokyong Lee: conceptualization, data curation, formal analysis, methodology, validation, visualization, writing—original draft, and writing—review and editing. Kiyun Yu: conceptualization, funding acquisition, project administration, and ND supervision. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Korea Agency for Infrastructure Technology Advancement (KAIA) grant funded by the Ministry of Land, Infrastructure and Transport (Grant RS-2022-00143336).

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to privacy.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

In Table A1 and Table A2, the hit@1 scores were noticeably lower than the other ranges across all instructions and generalization levels. As mentioned in Section 4.3, this was due to the existence of multiple valid schemas for a single question. For class schemas, the most common issue occurred when the model chose “apartment” for the apartment name mentioned in the query, but the schema was marked as incorrect because it was not used in the query. Similarly, for property schemas, the model often predicted “name” based on the entity’s name in the query, but since the query used the entity’s unique identifier “uuid” instead of “name”, it was also marked as incorrect. This does not necessarily reflect the model’s performance but depends on which schema is used in the query.
The goal of the schema retrieval model is to identify appropriate schema items for the transducer to use when constructing a query. Therefore, even if certain schema items are not ultimately used in the query, the model should still retrieve valid schemas relevant to the query. As a result, even if the hit@1 score is not high, a strong performance at hit@k indicates that the model effectively finds appropriate schema items.
Given that the performance differences were not significant compared to the experiments on processing methods or the number of few-shot examples, we used the averages of hit@2 to hit@5 for the instruction search. This approach helped to highlight the differences more clearly.
Table A1. Class instruction results.
Table A1. Class instruction results.
LevelIns.Hit@1Hit@2Hit@3Hit@4Hit@5
I.I.D.10.51100.92460.92860.95670.9567
20.61010.95710.95940.96610.9684
30.43520.98640.98720.99340.9986
Comp.10.87510.91650.95060.95670.9865
20.56870.94730.95830.95830.9583
30.60710.92860.97650.98670.9898
Zero-shot10.43520.87780.87920.89120.9054
20.69010.91250.91660.92640.9354
30.68360.90360.95340.95670.9864
Table A2. Property instruction results.
Table A2. Property instruction results.
LevelIns.Hit@1Hit@2Hit@3Hit@4Hit@5
I.I.D.10.49620.93500.93500.93580.9358
20.59660.98550.99540.99540.9954
30.56120.98550.98550.99620.9984
Comp.10.37990.87810.88550.88610.8872
20.53640.98480.98480.98910.9891
30.42480.99350.99500.99500.9972
Zero-shot10.51240.84650.84700.84810.8961
20.61450.89430.89430.90020.9013
30.61850.91010.92050.93150.9555

References

  1. Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A. Language Models Are Few-Shot Learners. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
  2. Mishra, A.; Jain, S.K. A Survey on Question Answering Systems with Classification. J. King Saud. Univ. Comput. Inf. Sci. 2016, 28, 345–361. [Google Scholar] [CrossRef]
  3. Gu, Y.; Kase, S.; Vanni, M.; Sadler, B.; Liang, P.; Yan, X.; Su, Y. Beyond IID: Three Levels of Generalization for Question Answering on Knowledge Bases. In Proceedings of the The Web Conference, Ljubljana, Slovenia, 19–23 April 2021; pp. 3477–3488. [Google Scholar]
  4. Chen, S.; Liu, Q.; Yu, Z.; Lin, C.-Y.; Lou, J.-G.; Jiang, F. ReTraCk: A Flexible and Efficient Framework for Knowledge Base Question Answering. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations, Online, 1–6 August 2021; pp. 325–336. [Google Scholar]
  5. Ye, X.; Yavuz, S.; Hashimoto, K.; Zhou, Y.; Xiong, C. RNG-KBQA: Generation Augmented Iterative Ranking for Knowledge Base Question Answering. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Dublin, Ireland, 22–27 May 2022; Association for Computational Linguistics: Dublin, Ireland, 2022; pp. 6032–6043. [Google Scholar]
  6. Shu, Y.; Yu, Z.; Li, Y.; Karlsson, B.; Ma, T.; Qu, Y.; Lin, C.Y. TIARA: Multi-Grained Retrieval for Robust Question Answering over Large Knowledge Base. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, 7–11 December 2022; Association for Computational Linguistics: Abu Dhabi, United Arab Emirates, 2022; pp. 8108–8121. [Google Scholar]
  7. Yang, J.; Jang, H.; Yu, K. Geographic Knowledge Base Question Answering over OpenStreetMap. ISPRS Int. J. Geo-Inf. 2024, 13, 10. [Google Scholar] [CrossRef]
  8. Yang, T. Developing a Transformer-Based Natural Language Entity Linking Model to Improve the Performance of GeoKBQA. Master’s Thesis, Seoul National University, Seoul, Republic of Korea, 2023; pp. 1–96. [Google Scholar]
  9. Bollacker, K.; Evans, C.; Paritosh, P.; Sturge, T.; Taylor, J. Freebase: A Collaboratively Created Graph Database for Structuring Human Knowledge. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data (SIGMOD’08), Vancouver, BC, Canada, 9–12 June 2008; Association for Computing Machinery: New York, NY, USA, 2008; pp. 1247–1250. [Google Scholar] [CrossRef]
  10. Vrandečić, D.; Krötzsch, M. Wikidata: A Free Collaborative Knowledgebase. Commun. ACM 2014, 57, 78–85. [Google Scholar] [CrossRef]
  11. Li, T.; Ma, X.; Zhuang, A.; Gu, Y.; Su, Y.; Chen, W. Few-Shot In-Context Learning on Knowledge Base Question Answering. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, ON, Canada, 9–14 July 2023; Association for Computational Linguistics: Toronto, ON, Canada, 2023; pp. 6966–6980. [Google Scholar]
  12. Xiong, G.; Bao, J.; Zhao, W. Interactive-KBQA: Multi-Turn Interactions for Knowledge Base Question Answering with Large Language Models. arXiv 2024, arXiv:2402.15131. [Google Scholar]
  13. Kwiatkowski, T.; Choi, E.; Artzi, Y.; Zettlemoyer, L. Scaling Semantic Parsers with On-the-Fly Ontology Matching. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, WA, USA, 18–21 October 2013; Association for Computational Linguistics: Seattle, WA, USA, 2013; pp. 1545–1556. [Google Scholar]
  14. Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; Association for Computational Linguistics: Minneapolis, MN, USA, 2019; pp. 4171–4186. [Google Scholar]
  15. Reynolds, L.; McDonell, K. Prompt Programming for Large Language Models: Beyond the Few-Shot Paradigm. In Proceedings of the Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems (CHI EA’21), New York, NY, USA, 2–7 June 2021; Association for Computing Machinery: New York, NY, USA, 2021. Article 314. pp. 1–7. [Google Scholar]
  16. Mai, G.; Janowicz, K.; Zhu, R.; Cai, L.; Lao, N. Geographic Question Answering: Challenges, Uniqueness, Classification, and Future Directions. AGILE GISci. Ser. 2021, 2, 8. [Google Scholar] [CrossRef]
  17. Punjani, D.; Singh, K.; Both, A.; Koubarakis, M.; Angelidis, I.; Bereta, K.; Beris, T.; Bilidas, D.; Ioannidis, T.; Karalis, N.; et al. Template-Based Question Answering over Linked Geospatial Data. In Proceedings of the 12th Workshop on Geographic Information Retrieval (GIR’18), New York, NY, USA, 6 November 2018; Association for Computing Machinery: New York, NY, USA, 2018. Article 7. pp. 1–10. [Google Scholar]
  18. Hamzei, E.; Tomko, M.; Winter, S. Translating Place-Related Questions to GeoSPARQL Queries. In Proceedings of the ACM Web Conference 2022, Lyon, France, 25–29 April 2022; pp. 902–911. [Google Scholar]
  19. Kefalidis, S.-A.; Punjani, D.; Tsalapati, E.; Plas, K.; Pollali, M.; Mitsios, M.; Tsokanaridou, M.; Koubarakis, M.; Maret, P. Benchmarking Geospatial Question Answering Engines Using the Dataset GeoQuestions1089. In Proceedings of the Semantic Web—ISWC 2023: 22nd International Semantic Web Conference, Athens, Greece, 6–10 November 2023; Springer: Berlin/Heidelberg, Germany, 2023; pp. 266–284. [Google Scholar]
  20. Shi, J.; Cao, S.; Hou, L.; Li, J.; Zhang, H. TransferNet: An Effective and Transparent Framework for Multi-Hop Question Answering over Relation Graph. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Online and Punta Cana, Dominican Republic, 7–11 November 2021; Association for Computational Linguistics: Online and Punta Cana, Dominican Republic, 2021; pp. 4149–4158. [Google Scholar]
  21. Yih, W.T.; Richardson, M.; Meek, C.; Chang, M.W.; Suh, J. The Value of Semantic Parse Labeling for Knowledge Base Question Answering. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany, 7–12 August 2016; Association for Computational Linguistics: Berlin, Germany, 2016; Volume 2, pp. 201–206. [Google Scholar]
  22. Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; McGrew, B. GPT-4 Technical Report. arXiv 2023, arXiv:2303.08774. [Google Scholar]
  23. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17), Long Beach, CA, USA, 4–9 December 2017; Curran Associates Inc.: Red Hook, NY, USA, 2017; pp. 6000–6010. [Google Scholar]
  24. Wu, L.; Petroni, F.; Josifoski, M.; Riedel, S.; Zettlemoyer, L. Scalable Zero-Shot Entity Linking with Dense Entity Retrieval. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020), Online, 16–20 November 2020; Association for Computational Linguistics: Online, 2020; pp. 6397–6407. [Google Scholar]
  25. Chiu, J.; Shinzato, K. Cross-Encoder Data Annotation for Bi-Encoder Based Product Matching. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: Industry Track, Abu Dhabi, United Arab Emirates, 7–11 December 2022; Association for Computational Linguistics: Abu Dhabi, United Arab Emirates, 2022; pp. 161–168. [Google Scholar]
  26. Han, W.; Jiang, Y.; Ng, H.T.; Tu, K. A Survey of Unsupervised Dependency Parsing. In Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, Spain, 8–13 December 2022; International Committee on Computational Linguistics: Online and Barcelona, Spain, 2020; pp. 2522–2533. [Google Scholar]
  27. Talmor, A.; Berant, J. The Web as a Knowledge Base for Answering Complex Questions. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, New Orleans, LA, USA, 1–6 June 2018; Association for Computational Linguistics: New Orleans, LA, USA, 2018; Volume 1, pp. 641–651. [Google Scholar]
  28. Zhao, Z.; Wallace, E.; Feng, S.; Klein, D.; Singh, S. Calibrate Before Use: Improving Few-Shot Performance of Language Models. In Proceedings of the International Conference on Machine Learning (ICML), Online, 18–24 July 2021; pp. 12697–12706. [Google Scholar]
  29. Min, S.; Lyu, X.; Holtzman, A.; Artetxe, M.; Lewis, M.; Hajishirzi, H.; Zettlemoyer, L. Rethinking the Role of Demonstrations: What Makes In-Context Learning Work? In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, 7–11 December 2022; Association for Computational Linguistics: Abu Dhabi, United Arab Emirates, 2022; pp. 11048–11064. [Google Scholar]
  30. Shrawgi, H.; Rath, P.; Singhal, T.; Dandapat, S. Uncovering Stereotypes in Large Language Models: A Task Complexity-Based Approach. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), St. Julian’s, Malta, 7–22 March 2024; Association for Computational Linguistics: St. Julian’s, Malta, 2024; pp. 1841–1857. [Google Scholar]
  31. Sheetrit, E.; Brief, M.; Mishaeli, M.; Elisha, O. ReMatch: Retrieval Enhanced Schema Matching with LLMs. arXiv 2024, arXiv:2403.01567. [Google Scholar]
  32. Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A. Training Language Models to Follow Instructions with Human Feedback. Neural. Inf. Process. Syst. 2022, 35, 27730–27744. [Google Scholar]
  33. Zamfirescu-Pereira, J.D.; Wong, R.Y.; Hartmann, B.; Yang, Q. Why Johnny Can’t Prompt: How Non-AI Experts Try (and Fail) to Design LLM Prompts. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI’23), Hamburg, Germany, 23–28 April 2023; Association for Computing Machinery: New York, NY, USA, 2023. Article 437. pp. 1–21. [Google Scholar] [CrossRef]
  34. Bsharat, S.M.; Myrzakhan, A.; Shen, Z. Principled Instructions Are All You Need for Questioning LLaMA-1/2, GPT-3.5/4. arXiv 2023, arXiv:2312.16171. [Google Scholar]
  35. Wang, L.; Xu, W.; Lan, Y.; Hu, Z.; Lan, Y.; Lee, R.K.-W.; Lim, E.-P. Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, ON, Canada, 9–14 July 2023; Association for Computational Linguistics: Toronto, ON, Canada, 2023; pp. 2609–2634. [Google Scholar]
  36. Deshpande, A.; Murahari, V.; Rajpurohit, T.; Kalyan, A.; Narasimhan, K. Toxicity in ChatGPT: Analyzing Persona-Assigned Language Models. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, 6–10 December 2023; Association for Computational Linguistics: Singapore, 2023; pp. 1236–1270. [Google Scholar]
  37. White, J.; Fu, Q.; Hays, S.; Sandborn, M.; Olea, C.; Gilbert, H.; Elnashar, A.; Spencer-Smith, J.; Schmidt, D.C. A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT. arXiv 2023, arXiv:2302.11382. [Google Scholar]
  38. Pires, T.; Schlinger, E.; Garrette, D. How Multilingual Is Multilingual BERT? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; Association for Computational Linguistics: Florence, Italy, 2019; pp. 4996–5001. [Google Scholar]
  39. Lee, S.; Jang, H.; Baik, Y.; Park, S.; Shin, H. KR-BERT: A Small-Scale Korean-Specific Language Model. arXiv 2020, arXiv:2008.03979. [Google Scholar]
  40. Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.; Dean, J. Distributed Representations of Words and Phrases and Their Compositionality. In Proceedings of the 26th International Conference on Neural Information Processing Systems—Volume 2, Lake Tahoe, NV, USA, 5–8 December 2013; Curran Associates Inc.: Red Hook, NY, USA, 2013; Volume 2, pp. 3111–3119. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.