Enhancing Text-to-SPARQL Generation via In-Context Learning with Example Selection Strategies

Lu, Eric Jui-Lin; Su, Zi-Ting

doi:10.3390/engproc2026134036

Open AccessProceeding Paper

Enhancing Text-to-SPARQL Generation via In-Context Learning with Example Selection Strategies^†

by

Eric Jui-Lin Lu

^* and

Zi-Ting Su

Department of Management Information Systems, National Chung Hsing University, Taichung 40227, Taiwan

^*

Author to whom correspondence should be addressed.

^†

Presented at the 7th Eurasia Conference on IoT, Communication and Engineering 2025 (ECICE 2025), Yunlin, Taiwan, 14–16 November 2025.

Eng. Proc. 2026, 134(1), 36; https://doi.org/10.3390/engproc2026134036

Published: 9 April 2026

(This article belongs to the Proceedings of The 7th Eurasia Conference on IoT, Communication and Engineering 2025 (ECICE 2025))

Download

Browse Figures

Versions Notes

Abstract

Large language models demonstrate strong in-context learning (ICL) capabilities, allowing them to perform diverse tasks without fine-tuning. In knowledge graph question answering (KGQA), natural language questions are translated into SPARQL queries. Existing ICL approaches mainly rely on semantic similarity, often neglecting structural features. To address this limitation, we developed a structure-aware example selection strategy that integrates both semantic and structural patterns by abstracting Resource Description Framework (RDF) triples. We compare four strategies: (1) fully random, (2) semantic similarity, (3) same-type random, and (4) same-type semantic similarity. Experiments on LC-QuAD 1.0 using FLAN-T5 show that in non-fine-tuned settings, structure-aware semantic selection achieves the best results, highlighting the importance of structural congruence, while after fine-tuning, differences between strategies converge but diversity and semantic relevance remain beneficial. These findings demonstrate the critical role of example quality in ICL and provide empirical insights for KGQA design.

Keywords:

LLM; in-context learning; KGQA; text-to-SPARQL; RDF triples

1. Introduction

In natural language processing (NLP), large language models (LLMs) have shown strong in-context learning (ICL) ability, enabling them to solve tasks such as question answering, translation, and text generation without fine-tuning [1,2,3,4]. Generative Pre-trained Transformer 3 (GPT-3), for example, demonstrates that ICL can approach or even surpass fine-tuned models in multi-task settings [1].

Knowledge graph question answering (KGQA) has emerged as a key research area. The core challenge lies in converting natural language questions into structured queries (e.g., SPARQL). Unlike conventional NLP tasks that generate free-form text or categorical labels, Text-to-SPARQL requires mapping natural language semantics to precise logical structures, strictly following syntax rules and composing RDF triples. As Zahera et al. highlighted, this process demands both accurate entity–relation alignment and syntactically valid queries [5]. Without structured prompts and ICL, LLMs often produce invalid SPARQL. These difficulties underline the complexity of Text-to-SPARQL beyond standard NLP tasks.

In recent studies, ICL is applied to Text-to-SPARQL [5,6,7,8]. However, most research relies on semantic similarity to select examples, overlooking structural features. For instance, Taffa and Usbeck [9] used k-Nearest Neighbors (kNN), which often faces scalability challenges. Zahera et al. reduced cost via clustering, but their method requires an appropriate number of clusters and often struggles with cross-dataset generalization [5]. Overall, these methods heavily rely on semantic similarity, leaving structural characteristics underexplored.

To address the limitations of reliance on semantic similarity, we introduce a structure-aware example selection strategy. Specifically, we transform SPARQL queries into RDF triples and abstract them into structural patterns (details and examples are provided in Section 3.1). We design fully random (FR), semantic similarity (SS), same-type random (STR), and same-type semantic similarity (STSS). Experiments on the LC-QuAD 1.0 dataset with the Flan-T5 model evaluate performance under zero-shot, one-shot, and few-shot settings and further compare ICL with fine-tuned models.

We conduct a systematic comparison of example selection strategies, proposing a structure-aware approach combined with semantic similarity; a systematic comparison of different example selection strategies is conducted under ICL. In addition, we analyzed the effects of ICL on both non-fine-tuned and fine-tuned models, providing design insights and empirical evidence for KGQA systems.

2. Related Works

ICL has been widely applied in NLP tasks without fine-tuning and is typically categorized into zero-shot, one-shot, and few-shot settings [1,10,11]. Previous research [2,12,13] shows that increasing the number of examples can improve performance. However, these gains yield diminishing returns and substantially higher computational costs. Consequently, the choice of example selection strategy becomes critical for ensuring both effectiveness and efficiency in ICL.

In Text-to-SPARQL, three representative strategies have been explored. Taffa and Usbeck applied kNN with cosine similarity to retrieve the nearest training questions [9]. Similarly, Zahera et al. employed clustering with Sentence Transformers, reducing computational cost but requiring pre-defined cluster numbers [5]. Kosten et al. compared random versus semantic similarity selection, showing that random examples sometimes outperform semantic ones due to greater diversity and reduced overfitting [8]. The results highlight the potential and the limitations of purely semantic-driven strategies, motivating our integration of structural features.

3. Methodology

Previous example selection methods mainly rely on semantic similarity, considering only the surface form of questions. To address this, we incorporate the structural characteristics of RDF triples to improve generation quality. This approach, different from clustering [5], does not require predefining the number of groups and thus provides clearer criteria and greater flexibility for optimization. In addition, random examples can sometimes mitigate overfitting from excessive semantic similarity [8], so we design two strategies within each structural type: random and semantic similarity.

3.1. Structural Classification of RDF Triples

To incorporate structural features into example selection, we adopted the structural abstraction framework of Chen et al. [14,15], simplifying SPARQL queries into RDF triples and further abstracting them into subgraph structural types.

For example, we consider the question “Which sectors’ people are part of local political parties that fall under the International Muslim Brotherhood?” (Figure 1). Its corresponding SPARQL query is expressed as three RDF triples: (?x, dbp:international, dbr:Muslim_Brotherhood), (?x, dbo:religion, ?uri), and (?x, rdf:type, dbo:PoliticalParty). In the abstraction process, dbr:Muslim_Brotherhood is treated as a named entity (E), dbp:international and dbo:religion are regarded as relation entities (R), and dbo:PoliticalParty as a class entity (C). The predicate rdf:type is likewise treated as a relation entity (R). Therefore, the triples are converted into the structural patterns (?x, R, E), (?x, R, ?uri), and (?x, R, C). After transformation, this RDF triple set contains one E, three R, and one C. Ignoring the directionality of R, the question is classified into the G-type structural category shown in Figure 2.

3.2. System Framework

Let the input question be denoted as

q

, whose vector representation is obtained from the final hidden state of the pre-trained Flan-T5 encoder. After flattening, a fixed-length embedding vector is generated, denoted as

e_{q} \in R^{d}

, where

d

represents the embedding dimension. Similarly, let the set of candidate questions extracted from the training data be

C = {c_{1}, c_{2}, \dots, c_{n}}

. Each candidate’s question

c_{i}

undergoes the same encoding process to obtain its vector representation

e_{c_{i}}

.

The semantic similarity between the input question q and a candidate question c_i is then computed using cosine similarity as follows:

S i m i l a r i t y (c_{i}, q) = c o s (e_{q}, e_{c_{i}})

(1)

Based on the similarity scores, the top-k most semantically similar candidate questions are selected, along with their corresponding RDF triples, to form the set

S_{k}

.

S_{k} = \{(c_{i} {, t}_{i})| t o p - k by S i m i l a r i t y (c_{i}, q)\}

(2)

where

t_{i}

denotes the RDF triples associated with candidate question

c_{i}

.

During the prompt construction stage, the set

S_{k}

is combined with the input question

q

and fed into the Flan-T5 generation module to produce the target RDF triples as the final output.

3.3. Few-Shot Prompt Design

We evaluate the performance of large language models under three different ICL scenarios: zero-shot, one-shot, and few-shot and compare these settings with fine-tuned models. The few-shot prompt format is illustrated in Figure 3. In the experimental setup, the few-shot configuration consists of three reference examples together with one target question. Each reference example consists of a reference question, selected according to the example selection strategy, and its corresponding reference RDF triples. The reference triples are represented in a unified structured format with the aid of special tokens, which are designed to help the model correctly identify triple boundaries and variable dependencies. Specifically, [SEP] is used to separate the subject, predicate, and object; [ROOT] corresponds to the variable ?uri in the LC-QuAD 1.0 dataset; [TEMP] corresponds to the variable ?x, representing an intermediate node in the query; and [END] marks the termination of each triple.

3.4. Example Selection

To systematically investigate the roles of semantic similarity and structural features in example selection, and to validate the argument of Kosten et al. [8] that random examples may introduce useful diversity, we design and compare the following four strategies.

FR: Examples are randomly sampled from the entire training dataset.
SS: Both the input and training questions are encoded into vectors, and the most similar examples are selected based on cosine similarity.
STR: The target question is first assigned to a structural type by the RDF triples classifier, and examples are randomly sampled within that category.
STSS: The target question is assigned to a structural type, and the most semantically similar examples are selected from within the same category.

4. Experiments

We use Flan-T5 [16,17], which enhances generalization and reasoning in ICL scenarios. Experiments are conducted on the LC-QuAD 1.0 dataset to analyze how different ICL formats and example selection strategies affect generation accuracy.

4.1. Experimental Results Without Fine-Tuning

Table 1 shows that model performance is limited under the zero-shot setting (BLEU = 0.0529, F1 = 0.1566). In contrast, few-shot settings significantly improve results. Among the four strategies, STSS shows the best 3-shot performance (BLEU score = 0.3980, F1 = 0.5120), indicating that structurally consistent and semantically relevant examples are most effective without fine-tuning. These findings suggest that the relevance and quality of examples have a greater impact on performance than merely increasing the number of examples.

4.2. Experimental Results with Fine-Tuning

Table 2 shows that overall performance improves substantially after fine-tuning. Even in the zero-shot setting, results are strong (BLEU score = 0.8748, F1 = 0.9158), indicating a solid baseline capability. In the 1-shot setting, FR performs marginally better (BLEU score = 0.8816, F1 = 0.9188), suggesting that example diversity can be more helpful than semantic or structural consistency in low-resource cases. With 3-shot, all strategies improve further, with SS achieving the best performance (BLEU score = 0.8860, F1 = 0.9224), followed by FR (BLEU score = 0.8807, F1 = 0.9198). This highlights the advantage of semantic relevance under richer conditions, while diversity from random selection still offers consistent benefits.

5. Conclusions

We analyzed the role of example selection strategies in ICL for RDF triple generation in KGQA. Beyond empirical validation, our results demonstrate that structural patterns can effectively guide large language models in structured query generation, and the relative importance of structural consistency, semantic similarity, and diversity evolves across training stages. These results enhance the theoretical understanding of ICL and offer practical guidance for building efficient, adaptive KGQA systems.

Author Contributions

E.J.-L.L.: Writing—review & editing, Conceptualization, Supervision, Methodology, Conceptualization, Funding acquisition. Z.-T.S.: Methodology, Writing—original draft, Formal analysis, Validation, Data curation. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Science and Technology Council (NSTC), grant number 113-2221-E-005-075.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly available datasets were analyzed in this study. This data can be found here: [https://github.com/AskNowQA/LC-QuAD, accessed on 25 March 2026].

Conflicts of Interest

The authors declare no conflict of interest.

References

Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
Dong, Q.; Li, L.; Dai, D.; Zheng, C.; Ma, J.; Li, R.; Xia, H.; Xu, J.; Wu, Z.; Chang, B.; et al. A survey on in-context learning. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 1107–1128. [Google Scholar]
Chen, Y.; Zhong, R.; Zha, S.; Karypis, G.; He, H. Meta-learning via language model in-context tuning. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics; Association for Computational Linguistics: Stroudsburg, PA, USA, 2022; Volume 1, pp. 719–730. [Google Scholar]
Garg, S.; Tsipras, D.; Liang, P.S.; Valiant, G. What can transformers learn in-context? A case study of simple function classes. Adv. Neural Inf. Process. Syst. 2022, 35, 30583–30598. [Google Scholar]
Zahera, H.M.; Ali, M.; Sherif, M.A.; Moussallem, D.; Ngomo, A.C.N. Generating SPARQL from Natural Language Using Chain-of-Thoughts Prompting. In Proceedings of the 20th International Conference on Semantic Systems; IOS Press: Amsterdam, The Netherlands, 2024; pp. 353–368. [Google Scholar]
Kovriguina, L.; Teucher, R.; Radyush, D.; Mouromtsev, D. SPARQLGEN: One-Shot Prompt-based Approach for SPARQL Query Generation. In Proceedings of the SEMANTiCS 2023 Posters & Demos, Leipzig, Germany, 20–22 September 2023. [Google Scholar]
D’Abramo, J.; Zugarini, A.; Torroni, P. Dynamic few-shot learning for knowledge graph question answering. arXiv 2024, arXiv:2407.01409. [Google Scholar] [CrossRef]
Kosten, C.; Nooralahzadeh, F.; Stockinger, K. Evaluating the effectiveness of prompt engineering for knowledge graph question answering. Front. Artif. Intell. 2025, 7, 1454258. [Google Scholar] [CrossRef]
Taffa, T.A.; Usbeck, R. Leveraging LLMs in scholarly knowledge graph question answering. arXiv 2023, arXiv:2311.09841. [Google Scholar] [CrossRef]
Li, Y. A practical survey on zero-shot prompt design for in-context learning. In Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing; INCOMA Ltd.: Shoumen, Bulgaria, 2023; pp. 641–647. [Google Scholar]
Geng, M.; Wang, S.; Dong, D.; Wang, H.; Li, G.; Jin, Z.; Mao, X.; Liao, X. Large language models are few-shot summarizers: Multi-intent comment generation via in-context learning. In Proceedings of the 46th IEEE/ACM International Conference on Software Engineering; Association for Computing Machinery: New York, NY, USA, 2024; pp. 1–13. [Google Scholar]
Li, T.; Zhang, G.; Do, Q.D.; Yue, X.; Chen, W. Long-context LLMs struggle with long in-context learning. arXiv 2024, arXiv:2404.02060. [Google Scholar]
Bertsch, A.; Ivgi, M.; Xiao, E.; Alon, U.; Berant, J.; Gormley, M.R.; Neubig, G. In-context learning with long-context models: An in-depth exploration. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics; Association for Computational Linguistics: Stroudsburg, PA, USA, 2025; Volume 1, pp. 12119–12149. [Google Scholar]
Chen, Y.H.; Lu, E.J.L.; Cheng, K.H. Enhancing SPARQL query generation for question answering with a hybrid encoder-decoder and cross-attention model. J. Web Semant. 2025, 87, 100869. [Google Scholar] [CrossRef]
Chen, Y.H.; Lu, E.J.L.; Tsao, C.N. Enhancing SPARQL query generation using multi-label text-to-text models. Data Knowl. Eng. 2026, 164, 102584. [Google Scholar] [CrossRef]
Longpre, S.; Hou, L.; Vu, T.; Webson, A.; Chung, H.W.; Tay, Y.; Zhou, D.; Le, Q.V.; Zoph, B.; Wei, J.; et al. The Flan collection: Designing data and methods for effective instruction tuning. In Proceedings of the 40th International Conference on Machine Learning; PMLR: Cambridge, MA, USA, 2023; pp. 22631–22648. [Google Scholar]
Chung, H.W.; Hou, L.; Longpre, S.; Zoph, B.; Tay, Y.; Fedus, W.; Li, Y.; Wang, X.; Dehghani, M.; Brahma, S.; et al. Scaling instruction-finetuned language models. J. Mach. Learn. Res. 2024, 25, 70. [Google Scholar]

Figure 1. Example of RDF triple structural pattern classification. (The red, blue, and green fonts represent subjects, predicates, and objects, respectively.)

Figure 2. Subgraph types in LC-QuAD 1.0 Queries.

Figure 3. Few-shot in-context learning framework.

Table 1. BLEU and F1 scores in zero-shot, one-shot, and few-shot settings without fine-tuning.

Method	Setting	BLEU Score	F1 Score
FR	0-shot	0.0529	0.1566
	1-shot	0.1219	0.2372
	3-shot	0.1496	0.2423
SS	1-shot	0.2382	0.3563
SS	3-shot	0.3777	0.4913
STR	1-shot	0.1346	0.2505
STR	3-shot	0.2227	0.3296
STSS	1-shot	0.2329	0.3497
STSS	3-shot	0.3980	0.5120

Table 2. BLEU and F1 scores in zero-shot, one-shot, and few-shot settings with fine-tuning. (Bold values indicate the best performance in each column.)

Method	Setting	BLEU Score	F1 Score
FR	0-shot	0.8748	0.9158
	1-shot	0.8816	0.9188
	3-shot	0.8807	0.9198
SS	1-shot	0.8766	0.9177
SS	3-shot	0.8860	0.9224
STR	1-shot	0.8765	0.9175
STR	3-shot	0.8756	0.9156
STSS	1-shot	0.8763	0.9178
STSS	3-shot	0.8783	0.9194

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Lu, E.J.-L.; Su, Z.-T. Enhancing Text-to-SPARQL Generation via In-Context Learning with Example Selection Strategies. Eng. Proc. 2026, 134, 36. https://doi.org/10.3390/engproc2026134036

AMA Style

Lu EJ-L, Su Z-T. Enhancing Text-to-SPARQL Generation via In-Context Learning with Example Selection Strategies. Engineering Proceedings. 2026; 134(1):36. https://doi.org/10.3390/engproc2026134036

Chicago/Turabian Style

Lu, Eric Jui-Lin, and Zi-Ting Su. 2026. "Enhancing Text-to-SPARQL Generation via In-Context Learning with Example Selection Strategies" Engineering Proceedings 134, no. 1: 36. https://doi.org/10.3390/engproc2026134036

APA Style

Lu, E. J.-L., & Su, Z.-T. (2026). Enhancing Text-to-SPARQL Generation via In-Context Learning with Example Selection Strategies. Engineering Proceedings, 134(1), 36. https://doi.org/10.3390/engproc2026134036

Article Menu

Enhancing Text-to-SPARQL Generation via In-Context Learning with Example Selection Strategies^†

Abstract

1. Introduction

2. Related Works

3. Methodology