Refining Text2Cypher on Small Language Model with Reinforcement Learning Leveraging Semantic Information
Abstract
1. Introduction
- (1)
- We propose a novel refining training method for Text2Cypher, which works well even on baseline SLMs (pretrained and instruction-tuned generative small language models). The proposed refining training method shows that it can effectively deal with the semantic gap between input texts and output Cypher queries with reinforcement learning, leveraging semantic information rendered during training. To the best of our knowledge, this paper is the first to apply reinforcement learning to Text2Cypher;
- (2)
- In order to reduce the semantic gap more efficiently, we propose a GRPO-based reinforcement learning optimization strategy, which can effectively utilize the semantic information of key-value and triple relationships extracted from output responses for Text2Cypher, which takes the best reward based on averaging several output responses;
- (3)
- We propose a simple prompting approach applied to a baseline LM to extract semantic information alongside Cypher query generation. The guidance embedded in the input prompts directs the model to extract supplementary information, which enhances the Text2Cypher task.
2. Background and Related Works
2.1. Related Work
2.2. Background
2.2.1. Cypher Query Language
2.2.2. Text2Cypher and Comparison with Text2SQL
2.2.3. Key-Value Extraction
2.2.4. Relevant Triple Relationship Extraction
2.2.5. Group Relative Policy Optimization (GRPO)
2.2.6. Baseline LM: Pretrained and Instruction-Tuned Generative Language Model
3. Proposed Method
3.1. Overview of the Proposed Method
- (1)
- Supervised fine-tuning: The baseline SLM is initially fine-tuned on paired natural language—Cypher examples, with schema context included in the input to help the model learn schema-aware query generation;
- (2)
- Reinforcement learning with support tasks: The fine-tuned baseline SLM is further optimized using reinforcement learning with GRPO optimization policy. During this stage, the model also learns two auxiliary tasks—key-value pair extraction and relationship triple extraction—which we refer to as support tasks. These tasks target the identification of core elements in Cypher queries, such as entities, attributes, and their relationships. By learning and performing these support tasks before generating the final query, the model is better guided during training, which helps improve both the precision and the generalizability of the generated Cypher queries.
3.2. Supervised Fine-Tuning
3.3. Reinforcement Learning with Support Tasks
| reward = 0 # This is reward value output # This is predicted output from model # 1. Format Checking format_result = check_format(output) # We can check using regular # expression If: format_result == True Then: reward += 1 Else: reward −= 1 # 2. Answer Checking output_query = extract_cypher_query(output) If: output_query == ground truth Cypher query Then: reward += 1 Else: reward −= 1 # 3. Key-Value Checking list_key_value = extract_key_value(output) If: list_key_value ⊇ ground truth key-value Then: reward += 1 Else: reward −= 1 # 4. Relationship Triple Checking output_triples = extract_relationship_triples(output) If: output_triples ⊇ ground truth triples Then: reward += 1 Else: reward −= 1 Return reward |
- -
- Format checking: This function verifies if the output adheres to the specified guideline format. It ensures that the model extracts key-value pairs and relationship triples when generating the response;
- -
- Answer checking: This function extracts the Cypher query from the output and compares it to the ground truth. The output answer should resemble the ground truth query;
- -
- Key-value checking: This function compares the key-value pairs extracted from the output with the ground truth labels in the dataset;
- -
- Triple-relationship checking: Similar to key-value checking, this function extracts the relationship triples from the output and ensures that they match the relationship triples in the ground truth.
4. Experiments
4.1. Datasets and Training Procedure
- -
- Dataset 1: This dataset was collected from publicly available sources [29] and supplemented with our own contributions. It contains 7741 instances of natural language questions paired with their corresponding Cypher queries. The data is divided into 4934 instances for training and 2807 instances for testing. The dataset covers 14 different Neo4j graph database (GraphDB) schemas spanning various domains, including social networks, movies and entertainment, and business and organizational data. We organize the training and testing sets so that the schemas are disjoint. The training set includes samples from 11 schemas, while the test set contains samples from the remaining 3 schemas. We designed the split such that the schemas in the training set differ not only in structure but also in domain topics compared to those in the test set. This setup allows us to evaluate the model’s ability to generalize to unseen and semantically distinct schemas.
- -
- Dataset 2: This dataset, introduced by Neo4j in [7], consists of 44,387 instances of question–Cypher query pairs. Training and testing split is as follows: 39,554 instances are used for training, and 4833 instances are used for testing. This Neo4j Text2Cypher dataset also covers a broad range of domains, including social networks, business, and media, by combining data from various sources. We use this dataset to compare our results with other works in both the BLEU and execution score metrics. A full 4833-sample test set was used for BLEU evaluation. However, in order to evaluate the execution score, we need the corresponding database to execute the query and retrieve the result. Since the database did not fully provide the dataset, we selected only 1460 samples from the test dataset, which corresponded with the Neo4j graph database we found.
- Intel (R) Core (TM) i9-10900X CPU @ 3.70 GHz;
- NVIDIA GeForce GTX 4090 24 GB;
- Memory 128 GB.
- Peak memory usage during inference: ~3 GB;
- Average response time per query: ~3 s.
4.2. Experiment Results and Discussions
5. Discussion
5.1. Contribution of Reinforcement Learning with Support Tasks
5.2. Limitations and Potentials:
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Timón-Reina, S.; Rincón, M.; Martínez-Tomás, R. An Overview of Graph Databases and Their Applications in the Biomedical Domain. Database 2021, 2021, baab026. [Google Scholar] [CrossRef] [PubMed]
- Almabdy, S. Comparative Analysis of Relational and Graph Databases for Social Networks. In Proceedings of the 2018 1st International Conference on Computer Applications & Information Security (ICCAIS), Riyadh, Saudi Arabia, 4–6 April 2018; pp. 1–4. [Google Scholar]
- Syed, M.H.; Huy, T.Q.B.; Chung, S.-T. Context-Aware Explainable Recommendation Based on Domain Knowledge Graph. Big Data Cogn. Comput. 2022, 6, 11. [Google Scholar] [CrossRef]
- Xu, Z.; Cruz, M.J.; Guevara, M.; Wang, T.; Deshpande, M.; Wang, X.; Li, Z. Retrieval-Augmented Generation with Knowledge Graphs for Customer Service Question Answering. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, Washington, DC, USA, 10 July 2024; pp. 2905–2909. [Google Scholar]
- Kobeissi, M.; Assy, N.; Gaaloul, W.; Defude, B.; Haidar, B. An Intent-Based Natural Language Interface for Querying Process Execution Data. In Proceedings of the 2021 3rd International Conference on Process Mining (ICPM), Eindhoven, The Netherlands, 31 October 2021; pp. 152–159. [Google Scholar]
- Nie, L.; Cao, S.; Shi, J.; Sun, J.; Tian, Q.; Hou, L.; Li, J.; Zhai, J. GraphQ IR: Unifying the Semantic Parsing of Graph Query Languages with One Intermediate Representation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, 7–11 December 2022; pp. 5848–5865. [Google Scholar]
- Ozsoy, M.G.; Messallem, L.; Besga, J.; Minneci, G. Text2Cypher: Bridging Natural Language and Graph Databases. arXiv 2024, arXiv:2412.10064. [Google Scholar]
- DeepSeek-AI; Guo, D.; Yang, D.; Zhang, H.; Song, J.; Zhang, R.; Xu, R.; Zhu, Q.; Ma, S.; Wang, P.; et al. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv 2025, arXiv:2501.12948. [Google Scholar]
- Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.-J. BLEU: A Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics—ACL’02, Philadelphia, PA, USA, 1 January 2001; p. 311. [Google Scholar]
- Guo, A.; Li, X.; Xiao, G.; Tan, Z.; Zhao, X. SpCQL: A Semantic Parsing Dataset for Converting Natural Language into Cypher. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management, Atlanta, GA, USA, 17 October 2022; pp. 3973–3977. [Google Scholar]
- Mohammadjafari, A.; Maida, A.S.; Gottumukkala, R. From Natural Language to SQL: Review of LLM-Based Text-to-SQL Systems. arXiv 2024, arXiv:2410.01066. [Google Scholar]
- Liu, X.; Shen, S.; Li, B.; Ma, P.; Jiang, R.; Zhang, Y.; Fan, J.; Li, G.; Tang, N.; Luo, Y. A Survey of NL2SQL with Large Language Models: Where Are We, and Where Are We Going? arXiv 2024, arXiv:2408.05109. [Google Scholar]
- Lewis, M.; Liu, Y.; Goyal, N.; Ghazvininejad, M.; Mohamed, A.; Levy, O.; Stoyanov, V.; Zettlemoyer, L. BART: Denoising Sequence-to-Sequence Pre-Training for Natural Language Generation, Translation, and Comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 7871–7880. [Google Scholar]
- Wu, Z. Large Language Model Based Semantic Parsing for Intelligent Database Query Engine. J. Comput. Commun. 2024, 12, 1–13. [Google Scholar] [CrossRef]
- Feng, G.; Zhu, G.; Shi, S.; Sun, Y.; Fan, Z.; Gao, S.; Hu, J. Robust NL-to-Cypher Translation for KBQA: Harnessing Large Language Model with Chain of Prompts. In Knowledge Graph and Semantic Computing: Knowledge Graph Empowers Artificial General Intelligence; Wang, H., Han, X., Liu, M., Cheng, G., Liu, Y., Zhang, N., Eds.; Communications in Computer and Information Science; Springer Nature Singapore: Singapore, 2023; Volume 1923, pp. 317–326. ISBN 978-981-99-7223-4. [Google Scholar]
- Hornsteiner, M.; Kreussel, M.; Steindl, C.; Ebner, F.; Empl, P.; Schönig, S. Real-Time Text-to-Cypher Query Generation with Large Language Models for Graph Databases. Future Internet 2024, 16, 438. [Google Scholar] [CrossRef]
- Zhong, Z.; Zhong, L.; Sun, Z.; Jin, Q.; Qin, Z.; Zhang, X. SyntheT2C: Generating Synthetic Data for Fine-Tuning Large Language Models on the Text2Cypher Task. In Proceedings of the 31st International Conference on Computational Linguistics, Abu Dhabi, United Arab Emirates, 19–24 January 2024. [Google Scholar]
- Tiwari, A.; Malay, S.K.R.; Yadav, V.; Hashemi, M.; Madhusudhan, S.T. Auto-Cypher: Improving LLMs on Cypher Generation via LLM-Supervised Generation-Verification Framework. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers), Albuquerque, NM, USA, 29 April–4 May 2025; pp. 623–640. [Google Scholar]
- Coffelt, J.P.; Kampmann, P.; Beetz, M. Implementation and Application of a Knowledge Service for AUV Mission Explainability. In Proceedings of the 2025 IEEE Underwater Technology (UT), Taipei, Taiwan, 2 March 2025; pp. 1–7. [Google Scholar]
- Ozsoy, M.G. Enhancing Text2Cypher with Schema Filtering. arXiv 2025, arXiv:2505.05118. [Google Scholar]
- Tran, Q.-B.-H.; Waheed, A.A.; Chung, S.-T. Robust Text-to-Cypher Using Combination of BERT, GraphSAGE, and Transformer (CoBGT) Model. Appl. Sci. 2024, 14, 7881. [Google Scholar] [CrossRef]
- Francis, N.; Green, A.; Guagliardo, P.; Libkin, L.; Lindaaker, T.; Marsault, V.; Plantikow, S.; Rydberg, M.; Selmer, P.; Taylor, A. Cypher: An Evolving Query Language for Property Graphs. In Proceedings of the 2018 International Conference on Management of Data, Houston, TX, USA, 27 May 2018; pp. 1433–1445. [Google Scholar]
- Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal Policy Optimization Algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar]
- Wei, J.; Bosma, M.; Zhao, V.Y.; Guu, K.; Yu, A.W.; Lester, B.; Du, N.; Dai, A.M.; Le, Q.V. Finetuned Language Models Are Zero-Shot Learners. arXiv 2021, arXiv:2109.01652. [Google Scholar]
- Yenduri, G.; Ramalingam, M.; Selvi, G.C.; Supriya, Y.; Srivastava, G.; Maddikunta, P.K.R.; Raj, G.D.; Jhaveri, R.H.; Prabadevi, B.; Wang, W.; et al. GPT (Generative Pre-Trained Transformer)—A Comprehensive Review on Enabling Technologies, Potential Applications, Emerging Challenges, and Future Directions. IEEE Access 2024, 12, 54608–54649. [Google Scholar] [CrossRef]
- Gemini Team; Anil, R.; Borgeaud, S.; Alayrac, J.-B.; Yu, J.; Soricut, R.; Schalkwyk, J.; Dai, A.M.; Hauth, A.; Millican, K.; et al. Gemini: A Family of Highly Capable Multimodal Models. arXiv 2023, arXiv:2312.11805. [Google Scholar]
- Enis, M.; Hopkins, M. From LLM to NMT: Advancing Low-Resource Machine Translation with Claude. arXiv 2024, arXiv:2404.13813. [Google Scholar]
- Kullback, S.; Leibler, R.A. On Information and Sufficiency. Ann. Math. Statist. 1951, 22, 79–86. [Google Scholar] [CrossRef]
- Text-to-Cypher Data. Available online: https://huggingface.co/datasets/tomasonjo/text2cypher-gpt4o-clean (accessed on 23 June 2025).
- Gemma2-9B-Text2Cypher Model. Available online: https://huggingface.co/neo4j/text2cypher-gemma-2-9b-it-finetuned-2024v1 (accessed on 23 June 2025).



| Convert text to Cypher query based on this schema: The schema: (Schema) The text: (Question) |
| Node properties: - Business - address: STRING - location: POINT - city: STRING - name: STRING - User - name: STRING - userId: STRING - Review - date: DATE - text: STRING - Category - name: STRING The relationships: (:Business)-[:IN_CATEGORY]->(:Category) (:User)-[:WROTE]->(:Review) (:Review)-[:REVIEWS]->(:Business) |
| Response in the following format: <reasoning> Some reasoning to get the right Cypher query <key_value> Value extracted from the text, which can be useful for generating Cypher query. Example: Helen, financial crises... </key_value> <relationship> All triple relationships are used in the Cypher query (must appear in the schema also) Example: Suppliers -[:SUPPLIES]-> Product, (:Person)-[:ACTED_IN]->(:Movie) </relationship> </reasoning> <answer> The final Cypher query based on reasoning, key value, and relationship </answer> |
| System: (Guideline) User: Convert text to Cypher query based on this schema: The schema: (schema) The text: (question) Assistant: |
| Method (Qwen2.5-3B-Instruct) | Performance |
|---|---|
| Supervised only | 80.01% |
| Supervised + RL with support task | 85.04% |
| Method (Qwen2.5-3B-Instruct) | Performance |
|---|---|
| Supervised + RL with key-value support task | 81.69% |
| Supervised + RL with relationship support task | 84.11% |
| Model | Ours | GPT-4o | GPT-4o_mini | Gemini_1.5 flash_001 | Gemma2 _9B | Llama3_1 _8B_instruct |
|---|---|---|---|---|---|---|
| Parameter size | 3B | 1.8T | 8B | 8B | 9B | 8B |
| Google-BLEU score | 0.7701 | 0.8017 | 0.7973 | 0.7780 | 0.64703 | 0.5560 |
| Method | Ours | Gemma-2_9B [30] |
| Parameter size | 3B | 9B |
| Execution score | 56.23% | 45.41% |
| Question | Supervised Fine-Tuning | Further Reinforcement Learning Model |
|---|---|---|
| What are the orders that were placed on 1996-08-09? | MATCH (o:Order) WHERE o.OrderDate = datetime(‘1996-08-09T00:00:00Z’) RETURN o.OrderID (wrong) | MATCH (o:Order) WHERE o.OrderDate = ‘1996-08-09’ RETURN o.OrderID (correct) |
| Which items were purchased by customers residing in Mexico from the beverages category? | MATCH (c:Customer)-[:PLACED]->(o:Order)-[:CONTAINS]->(p:Product) WHERE p.CategoryID = ‘Beverages’ AND c.Country = ‘Mexico’ RETURN DISTINCT p.ProductName (wrong) | MATCH (c:Customer)-[:PLACED]->(o:Order)-[:CONTAINS]->(p:Product)-[:BELONGS_TO]->(cat:Category) WHERE c.Country = ‘Mexico’ AND cat.CategoryName = ‘Beverages’ RETURN p.ProductName (correct) |
| Dataset 1 | Dataset 2 | |
|---|---|---|
| Number of nodes in schema | 2–16 Nodes | 3–28 Nodes |
| Schema length | 100~1000 tokens | 100~2000 tokens |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Tran, Q.-B.-H.; Waheed, A.A.; Mudasir, S.; Chung, S.-T. Refining Text2Cypher on Small Language Model with Reinforcement Learning Leveraging Semantic Information. Appl. Sci. 2025, 15, 8206. https://doi.org/10.3390/app15158206
Tran Q-B-H, Waheed AA, Mudasir S, Chung S-T. Refining Text2Cypher on Small Language Model with Reinforcement Learning Leveraging Semantic Information. Applied Sciences. 2025; 15(15):8206. https://doi.org/10.3390/app15158206
Chicago/Turabian StyleTran, Quoc-Bao-Huy, Aagha Abdul Waheed, Syed Mudasir, and Sun-Tae Chung. 2025. "Refining Text2Cypher on Small Language Model with Reinforcement Learning Leveraging Semantic Information" Applied Sciences 15, no. 15: 8206. https://doi.org/10.3390/app15158206
APA StyleTran, Q.-B.-H., Waheed, A. A., Mudasir, S., & Chung, S.-T. (2025). Refining Text2Cypher on Small Language Model with Reinforcement Learning Leveraging Semantic Information. Applied Sciences, 15(15), 8206. https://doi.org/10.3390/app15158206

