Refining Text2Cypher on Small Language Model with Reinforcement Learning Leveraging Semantic Information

Tran, Quoc-Bao-Huy; Waheed, Aagha Abdul; Mudasir, Syed; Chung, Sun-Tae

doi:10.3390/app15158206

Open AccessArticle

Refining Text2Cypher on Small Language Model with Reinforcement Learning Leveraging Semantic Information

¹

Department of Intelligent Systems, Soongsil University, Seoul 06978, Republic of Korea

²

Department of AI Convergence, Soongsil University, Seoul 06978, Republic of Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(15), 8206; https://doi.org/10.3390/app15158206

Submission received: 16 May 2025 / Revised: 20 July 2025 / Accepted: 21 July 2025 / Published: 23 July 2025

Download

Browse Figures

Versions Notes

Abstract

Text2Cypher is a text-to-text task that converts natural language questions into Cypher queries. Recent research by Neo4j on Text2Cypher demonstrates that fine-tuning a baseline language model (a pretrained and instruction-tuned generative model) using a comprehensive Text2Cypher dataset can effectively enhance query generation performance. However, the improvement is still insufficient for effectively learning the syntax and semantics of complex natural texts, particularly when applied to unseen Cypher schema structures across diverse domains during training. To address this challenge, we propose a novel refinement training method based on baseline language models, employing reinforcement learning with Group Relative Policy Optimization (GRPO). This method leverages extracted semantic information, such as key-value properties and triple relationships from input texts during the training process. Experimental results of the proposed refinement training method applied to a small-scale baseline language model (SLM) like Qwen2.5-3B-Instruct demonstrate that it achieves competitive execution accuracy scores on unseen schemas across various domains. Furthermore, the proposed method significantly outperforms most baseline LMs with larger parameter sizes in terms of Google-BLEU and execution accuracy scores over Neo4j’s comprehensive Text2Cypher dataset, with the exception of colossal LLMs such as GPT4o, GPT4o-mini, and Gemini.

Keywords:

Text2Cypher; reinforcement learning; small language model; fine-tuning

1. Introduction

With the increasing adoption of graph databases, such as Neo4j, across diverse domains like biomedical research [1], social networks [2], recommendation systems [3], and RAG [4], the demand for user-friendly Cypher query interfaces has grown significantly. While the Cypher query language provides powerful expressiveness for querying graph data, its usage remains challenging for many non-expert users due to its technical complexity. This challenge motivates the task of translating natural language into Cypher queries—commonly known as Text2Cypher.

The Text2Cypher task raises several key challenges. First, the model for the task requires an understanding of both the user’s text inputs and the underlying graph schemas. Second, Cypher query includes flexible patterns, directional relationships, and variable-length path queries that introduce additional complexity. Finally, models trained on a specific schema often struggle to generalize when presented with new or unseen schemas, so that cross-schema generalization becomes a critical benchmark for real-world deployment.

Early rule-based methods [5,6] attempted to bridge the semantic gap between natural language and graph queries, but these approaches lacked generalization and scalability. Recent approaches based on LLMs show promising outcomes [7]. Since LLMs are pretrained, instruction-tuned, and aligned with human preference on massive amounts of text datasets, they learn various language patterns, logical forms, and contextual structures. This enables them to interpret natural language in a generalizable way, including understanding and generating graph queries across different domains. But existing LLMs have been trained more on ordinary text datasets. Neo4j’s Text2Cypher [7] constructs a comprehensive Text2Cypher dataset, which contains 44,387 instances consisting of input texts, corresponding cypher outputs, and schemas, and demonstrates through experiments over various ‘pretrained and instruction-tuned generative language models’ (hereafter, we call baseline LMs as in [7]) that fine-tuning with such a comprehensive Text2Cypher dataset enhances the performance.

Nonetheless, we can observe that fine-tuning alone is insufficient to capture the full syntactic and semantic diversity of Cypher queries across domains.

To address this challenge, we propose a refined training method of Text2Cypher based on the baseline SLM (small-scale baseline LM)s by reinforcement learning with Group Relative Policy Optimization (GRPO) [8], which leverages semantic information such as key-value properties and triple relationship of the input texts extracted during refining training process so that the proposed method tunes the fine-tuned model more effectively so as to produce more syntactically and semantically accurate queries.

We implement our proposed refinement method on a baseline SLM like Qwen2.5-3B-Instruct [8]. For the baseline SLM, we first apply further fine-tuning with a home-grown Text2Cypher dataset and Neo4j’s Text2Cypher dataset [6]. Then, we apply reinforcement learning with support tasks. The support tasks are supposed to extract semantic information, such as key-value and triple relationships, under guidelines contained in input prompts.

We evaluate the proposed method in a cross-schema setting under environments with separate training and testing schemas. Experimental results show that our proposed method achieves 85% execution accuracy on unseen schemas. By comparison with the latest performance results reported in Neo4j’s recent paper [7], the proposed method applied to Qwen2.5-3B-Instruct (3B parameter size) outperforms several fine-tuned open-weighted language models even with bigger parameters with respect to Google-BLEU [9] and execution accuracy score [10].

These findings demonstrate the effectiveness of reinforcement learning utilizing semantic information like key-value and triple relationship-based rewards in improving the robustness and generalization of baseline SLMs as well as LLMs for Cypher query generation in practical graph database applications. The contributions of this paper are summarized as follows:

(1): We propose a novel refining training method for Text2Cypher, which works well even on baseline SLMs (pretrained and instruction-tuned generative small language models). The proposed refining training method shows that it can effectively deal with the semantic gap between input texts and output Cypher queries with reinforcement learning, leveraging semantic information rendered during training. To the best of our knowledge, this paper is the first to apply reinforcement learning to Text2Cypher;
(2): In order to reduce the semantic gap more efficiently, we propose a GRPO-based reinforcement learning optimization strategy, which can effectively utilize the semantic information of key-value and triple relationships extracted from output responses for Text2Cypher, which takes the best reward based on averaging several output responses;
(3): We propose a simple prompting approach applied to a baseline LM to extract semantic information alongside Cypher query generation. The guidance embedded in the input prompts directs the model to extract supplementary information, which enhances the Text2Cypher task.

The structure of this paper is as follows: Section 2 explains background technologies and reviews related work on Text2Cypher. Section 3 describes our proposed method, including the training strategy. Section 4 outlines the experimental setup, evaluation results, and discussion. Finally, Section 5 concludes this paper and outlines directions for future work.

2. Background and Related Works

2.1. Related Work

Compared to research on Text2SQL (Natural Query-to-SQL Query) [11,12], studies on Text2Cypher have emerged more recently. Similar to the early stages of Text2SQL research, initial approaches to Text2Cypher primarily relied on rule-based and template-based methods [5,6]. These methods required manually crafted rules for each graph schema, making them inadequate for addressing the semantic gap between natural language and graph queries effectively.

Several approaches have attempted to address the semantic gap between a natural query and a Cypher query, as in Text2SQL. For example, the Intent-Based Natural Language Interface [5] employed a two-stage pipeline combining intent detection and named entity recognition (NER) with rule-based query construction. The first stage uses machine learning for intent detection and NER, while the second stage utilizes a rule-based component to construct the corresponding Cypher query. This approach highlights the utility of sentence semantics for rule-based conversion, derived from intent detection and NER. However, it faces limitations in handling entity linking and the logical structures of complex sentences, which can lead to incomplete or incorrect queries, especially for sentences with intricate relationships. The GraphQ IR [6] introduced an intermediate representation (IR) designed to bridge the gap between natural language queries and Cypher queries. This approach consists of a neural semantic parser (using BART [13]) to convert natural language into an intermediate representation, followed by a compiler that converts this IR into a graph query. While this method effectively addresses the semantic gap between natural and graph queries, it encounters challenges with entity disambiguation and limited support for complex logic. Additionally, since GraphQ IR is schema-specific, it struggles with cross-domain adaptability. Recent advances in LLM-based semantic parsing have significantly improved the translation of natural language into structured queries. For example, Ref. [14] proposes an intelligent database query engine that leverages large language models to understand user intents and generate executable SQL queries, demonstrating the potential of LLMs in real-world semantic parsing tasks.

Although these approaches are useful in controlled settings, such as rule-based systems and intermediate representations, they often perform best when the input follows a formal and grammatically correct structure. They tend to struggle with casual or less structured language, which is common in real-world scenarios.

In contrast, large language models (LLMs) offer a more flexible solution since LLMs have been pretrained, instruction-tuned, and aligned with human preference on massive amounts of text datasets and have learned various language patterns, logical forms, and contextual structures. Thus, LLMs can interpret natural language in a generalizable way, including understanding and generating graph queries across different domains. Thus, LLM-based approaches become the mainstream of ‘Text2Query (SQL or Cypher)’ research.

Ref. [15] developed a methodology that leverages LLMs to bridge the gap between textual questions and structured knowledge representations. Thus, it takes a natural language question and its NER results as input, combined with different prompts to generate a Cypher statement. Ref. [16] presents an integration of large language models into NoSQL databases and knowledge graphs to facilitate real-time translation of natural language into Cypher queries; it does not handle the improvement in Text2Cypher task performance.

Neo4j’s Text2Cypher [7], introduced in 2024, first constructed a comprehensive Text2Cypher dataset consisting of 44,387 instances from diverse application domains, offering a rich resource for fine-tuning LLMs on the task of converting text to Cypher queries. This study also demonstrated that further fine-tuning baseline LMs using the Text2Cypher dataset enhances performance. However, this work relies primarily on fine-tuning large instruction-tuned LLMs on a comprehensive dataset, without utilizing semantic features aligned with Cypher query structures to enhance performance.

Ref. [17] introduces SyntheT2C, a framework for generating synthetic datasets to fine-tune large language models for the Text2Cypher task. Auto-Cypher in Ref. [18] presents an automated pipeline that uses LLMs to generate and verify Cypher queries, enhancing the quality of training data for Text2Cypher tasks. In Ref. [19], a pretrained Text2Cypher LLM is adapted into an AI agent to retrieve knowledge from a graph database based on user requests. A common approach in Text2Cypher is to incorporate the database schema into prompts. However, complex schemas can introduce noise, increase hallucinations, and raise computational costs. The schema filtering proposed in Ref. [20] addresses these challenges by including only relevant schema elements, improving query generation while reducing token costs.

CoBGT at Ref. [21] combines semantic value extraction with a transformer-based model to improve performance on the Text2Cypher task. However, this approach may struggle with unseen schemas, as the underlying transformer model lacks the broad pretraining and generalization capabilities of LLMs.

All these efforts do not propose any more effective training methods so as to let trained LMs deal with the semantic gap between natural text and cypher query more satisfactorily, so that they can enhance diverse complex query conversion across various domains.

2.2. Background

This section briefly introduces the underlying key background concepts and technologies relevant to this paper, such as Cypher query language, the role of key-value pairs, relationship triples in graph querying, the reinforcement learning policy optimization algorithm GRPO, and the pretrained and instruction-tuned generative language model.

2.2.1. Cypher Query Language

Cypher is the declarative query language used by Neo4j [22], one of the most widely adopted graph databases. Cypher queries operate on nodes, relationships, and their properties, reflecting the structure of a graph.

Cypher queries typically follow a MATCH–WHERE–RETURN structure. The MATCH clauses define the graph pattern to search for, WHERE applies filters to nodes or relationships based on properties, and RETURN specifies which values or structures to retrieve.

An example of the Cypher query:

MATCH (d:Director)-[:DIRECTED]->(m:Movie)

WHERE d.name = “Frank Darabont”

RETURN m.title

This query retrieves the titles of movies directed by Frank Darabont. In the MATCH clause, the node labels “Director” and “Movie” represent the types of entities, while [:DIRECTED] defines the relationship between them. The pattern (d:Director)-[:DIRECTED]->(m:Movie) is the triple relationship. The WHERE clause filters results using the key value, “Frank Darabont”. The RETURN clause specifies that the titles of the matched movies should be included in the result.

Due to the structure of Cypher queries, accurately specifying triple relationships in the MATCH clause and key values in the WHERE clause is crucial for generating precise queries that retrieve the desired information from the graph database. Utilizing semantic information for matching between a natural language query and SQL query has been exploited in Text2SQL [5,6]. But, in research on Text2Cypher so far, semantic information such as key-value and triple relationships has not been fully considered. In this paper, we show that key-value extraction and triple relationship extraction are essential tasks in converting natural language into Cypher queries. Previous research [21] has also shown that incorporating key-value and triple relationship features can significantly improve model performance in text-to-Cypher generation.

2.2.2. Text2Cypher and Comparison with Text2SQL

Text2SQL and Text2Cypher both translate natural language into structured queries, but differ in complexity and resource availability. While SQL queries typically involve selecting columns from tables and joining them based on keys, Cypher is designed for graph databases, which model data as nodes, relationships, and properties. This structure is inherently more complex than the tabular format of relational databases used by SQL. Translating natural language into Cypher often requires understanding and mapping multi-hop relationships and graph patterns, which is more challenging than mapping to the relatively straightforward joins and selections in SQL.

Text2SQL benefits from more mature, larger, and higher-quality datasets, making it easier to train and benchmark models. Text2Cypher suffers from a scarcity of high-quality, publicly available datasets. Datasets are often developed independently, making them difficult to combine and use effectively. This lack of resources makes it more difficult to train robust models for Text2Cypher.

Text2Cypher requires an understanding of graph database schema, node, and relationship types, as well as how to express complex traversals in Cypher. This demands more specialized knowledge compared to SQL, which is more widely known and used. Text2SQL requires an understanding of table schemas and relationships, but the learning curve is generally less steep.

2.2.3. Key-Value Extraction

A key value in this paper refers to a specific text span—typically a name, date, location, or number—mentioned in a natural language question that corresponds to a property value of a node (or occasionally a relationship) in the graph. These values are essential for filtering query results and are typically used in the WHERE clause of the Cypher query.

Key values appear explicitly in the user question. They usually represent identifiable facts such as name (“Frank Darabont”), year (1994), place (“New York”), or numerical conditions. Extracted key values usually can be mapped to the correct property in the graph schema during query generation.

Example:

“Can you provide a list of actors who appeared in movies directed by Frank Darabont?”

Generated Cypher:

MATCH (a:Actor)-[:ACTED_IN]->(:Movie)<-[:DIRECTED]-(d:Director)

WHERE d.name = “Frank Darabont”

RETURN DISTINCT a.name

In this example, the key value “Frank Darabont” is mapped to the name property of the Director node. It appears in the WHERE clause to restrict results to movies directed by that specific person.

Identifying key values accurately is crucial for ensuring that the Cypher query targets the intended entities. Failure to extract them properly may result in incomplete or incorrect query results.

2.2.4. Relevant Triple Relationship Extraction

Relevant triple relationship extraction involves identifying the structural graph patterns implied by the natural language input—specifically, which and how entities are connected.

A triple in graph query terms is composed of (subject node)-[:relationship]->(object node). In Cypher, this translates into patterns like (:Person)-[:WORKS_AT]->(:Company).

The extracted triple relationships guide the MATCH clause of Cypher queries by specifying the traversal path through the graph. These patterns are derived from understanding the semantic relationships expressed in the user’s question.

Example:

“Can you provide a list of actors who appeared in movies directed by Frank Darabont?”

Generated Cypher:

MATCH (a:Actor)-[:ACTED_IN]->(:Movie)<-[:DIRECTED]-(d:Director)

WHERE d.name = “Frank Darabont”

RETURN DISTINCT a.name

Relevant triples:

(Actor)-[:ACTED_IN]->(Movie)

(Movie)<-[:DIRECTED]-(Director)

Extracting these relevant triples correctly ensures that the model builds valid and meaningful graph traversal patterns reflecting the user intent. Without this step, even if the key-value pair is correct, the query may follow the wrong path and produce irrelevant or incomplete results.

2.2.5. Group Relative Policy Optimization (GRPO)

Group Relative Policy Optimization (GRPO) [8] is a policy optimization that introduces the concept of group-based learning in reinforcement learning (RL). It addresses some of the limitations found in traditional policy optimization methods like Proximal Policy Optimization (PPO) [23], which require a critic model to estimate how good a model’s action or output is. GRPO optimizes model behavior based on relative rankings among responses generated for the same input. This ranking-based feedback is inherently more robust to noise and does not require training a separate reward model, which can introduce instability and additional complexity. Moreover, by comparing outputs within localized groups, GRPO effectively regularizes updates, reducing variance and preventing over-optimization on individual examples.

GRPO takes a simpler and more intuitive approach. Instead of trying to estimate absolute values of model outputs, it works by comparing multiple responses generated by the model for the same input (e.g., a user prompt). For each prompt, the model generates a set of possible outputs (responses), and each one is assigned a reward using a predefined reward function (e.g., right format, or similarity to a reference answer).

Then, GRPO calculates the advantage of each response by subtracting the average reward of all responses in that group from the reward of the individual response. This advantage reflects how much better or worse the response is compared to others in the same group. By using this relative comparison within a group, GRPO avoids needing an external value function.

The model is then updated using a clipped surrogate loss, which helps prevent overly large changes in the model’s behavior. Additionally, a KL-divergence penalty is added to the loss to ensure that the new model does not deviate too far from the original model. This combination allows GRPO to make safe and meaningful updates to the policy.

Overall, GRPO offers a simpler, more robust alternative to traditional reinforcement learning methods for language model fine-tuning. It improves sample efficiency, reduces training instability, and works well in applications like instruction tuning and reward-based optimization of LLMs.

2.2.6. Baseline LM: Pretrained and Instruction-Tuned Generative Language Model

Pretrained large language models (LLMs) are artificial intelligence models that have undergone an initial training phase, called pretraining, on massive and diverse text datasets before being adapted for specific tasks. During pretraining, the model learns general language patterns, grammar, semantics, and even some world knowledge by predicting missing or next words in sentences, often using self-supervised learning techniques. The broad language understanding acquired during pretraining enables effective fine-tuning for specialized domains or tasks.

FLAN [24] demonstrates that instruction-tuning a pretrained language model on a diverse set of NLP tasks using natural language instructions significantly enhances zero-shot learning abilities. Instead of only learning from raw text, the model is explicitly taught to follow instructions phrased in natural language, which improves its ability to generalize to unseen tasks. Not to mention large-scale LLMs like GPT [25], Gemini [26], Claude [27], etc., modern small-scale instruction-tuned generative LMs like Llama-3.2-3B-Instruct, Qwen2.5-3B-Instruct, Gemma-3-4b-it, and others affected by FLAN’s instruction-tuning approach are fine-tuned by instruction-tuning and aligned with human preferences and world knowledge on vast amount of diverse datasets of instruction–response pairs, enabling them to capture patterns and relationships between words, and to understand and execute a wide range of user instructions, even for tasks not explicitly seen during training. Thus, even small-scale instruction-tuned LMs can be easily adapted to new domains or any text-to-text generative tasks like ‘Text2Cypher’, without extensive retraining or architectural changes. They have demonstrated strong capabilities in language understanding and generation, semantic parsing, and few-shot or zero-shot generalization [11,12].

However, SQL and Cypher have their own syntax patterns and semantics that are different from natural languages. Thus, since even large-scale LLMs have been trained mostly with natural language text data, they have limitations in bridging semantic gaps between natural sentences and query sentences. Thus, further fine-tuning the instruction-tuned LLMs with the appropriate datasets has been suggested and reported to be effective in improving performance [6]. Nonetheless, such a simple further fine-tuning with datasets cannot fully handle such various variants of the semantic gap.

This paper addresses this challenge. We let baseline SLM be refined by reinforcement learning with GRPO, utilizing the semantic information of key-value and triple relationships extracted during the training process. These techniques enhance the models’ ability to generate accurate and executable Cypher queries, particularly in scenarios with limited training data for specific schemas.

3. Proposed Method

The proposed method aims to enhance query correctness and improve schema generalization by incorporating auxiliary tasks during reinforcement learning—specifically, key-value pair extraction and triple-relationship construction.

3.1. Overview of the Proposed Method

The proposed method starts with an open-source small-scale baseline LM (like Qwen2.5-3B-Instruct).

The overall workflow of our proposed method follows a two-stage training pipeline:

(1): Supervised fine-tuning: The baseline SLM is initially fine-tuned on paired natural language—Cypher examples, with schema context included in the input to help the model learn schema-aware query generation;
(2): Reinforcement learning with support tasks: The fine-tuned baseline SLM is further optimized using reinforcement learning with GRPO optimization policy. During this stage, the model also learns two auxiliary tasks—key-value pair extraction and relationship triple extraction—which we refer to as support tasks. These tasks target the identification of core elements in Cypher queries, such as entities, attributes, and their relationships. By learning and performing these support tasks before generating the final query, the model is better guided during training, which helps improve both the precision and the generalizability of the generated Cypher queries.

Figure 1 illustrates the overall process of our proposed refining training method.

3.2. Supervised Fine-Tuning

In Stage 1, we fine-tune the baseline SLM using natural language–Cypher pairs along with schema context (Figure 2).

The goal of this stage is to teach the model the basic syntax of Cypher, fundamental schema mapping, and logical query construction. The input prompt includes a natural language question along with schema context—specifically, node types, relationship types, and property names. The output is a Cypher query corresponding to the input question. The model is trained in a supervised manner using cross-entropy loss, which encourages it to maximize the probability of generating the correct sequence of query tokens. The input prompt is shown in Table 1:

The schema format can be flexible, but we encourage including information about node type, node properties, and all triple relationships presented in the graph database. An example is shown in Table 2.

3.3. Reinforcement Learning with Support Tasks

After supervising fine-tuning, we apply reinforcement learning with GRPO optimization policy to refine the fine-tuned baseline SLM. This stage includes prompt construction, a group of response generating, criteria checking, calculating the objective function, and model updating. This learning process is illustrated in Figure 3.

Prompt construction:

In contrast to the supervised fine-tuning stage, we combine the schema context, input question, and a set of guidelines to construct the input, which is then fed into the fine-tuned baseline SLM. The guideline is introduced to support tasks, enabling the model to extract key-value pairs and relationship triples, in addition to generating a Cypher query. The guideline specifies the format that the model should follow. First, any reasoning-related terms are placed inside the “reasoning” tag. Within this tag, the model identifies and specifies the key-value pairs to extract from the input, which are useful for generating the Cypher query. These are placed inside the “key_value” tag. Next, in the “relationship” tag, the model selects the relevant relationship triples based on the schema. Finally, in the “answer” tag, the model generates the Cypher query corresponding to the input request. The guideline is illustrated, and the full input prompt is shown in Table 3 and Table 4.

Group of response generation and criteria checking:

After the prompt is fed into the model, it generates a group of responses. The fine-tuned baseline SLM can be configured with a temperature parameter, which affects the randomness of the generated results. The temperature value ranges from 0 to 1, where a higher value increases randomness. In our experiments, we set the temperature to 0.6, allowing the SLM to generate a diverse set of responses for a single input request. For each response, we calculate a reward by applying the criteria checking.

Pseudo code for reward function:

reward = 0 # This is reward value
output # This is predicted output from model
# 1. Format Checking
format_result = check_format(output) # We can check using regular
# expression
If: format_result == True
Then: reward += 1
Else: reward −= 1
# 2. Answer Checking
output_query = extract_cypher_query(output)
If: output_query == ground truth Cypher query
Then: reward += 1
Else: reward −= 1
# 3. Key-Value Checking
list_key_value = extract_key_value(output)
If: list_key_value ⊇ ground truth key-value
Then: reward += 1
Else: reward −= 1
# 4. Relationship Triple Checking
output_triples = extract_relationship_triples(output)
If: output_triples ⊇ ground truth triples
Then: reward += 1
Else: reward −= 1
Return reward

We implement four reward functions for criteria checking, which assign a reward score to each response. These functions include the following:

-: Format checking: This function verifies if the output adheres to the specified guideline format. It ensures that the model extracts key-value pairs and relationship triples when generating the response;
-: Answer checking: This function extracts the Cypher query from the output and compares it to the ground truth. The output answer should resemble the ground truth query;
-: Key-value checking: This function compares the key-value pairs extracted from the output with the ground truth labels in the dataset;
-: Triple-relationship checking: Similar to key-value checking, this function extracts the relationship triples from the output and ensures that they match the relationship triples in the ground truth.

All the comparisons with ground truth are string-based.

If the result satisfies the criteria, the response receives a positive reward; otherwise, the reward is negative. The rewards from each criterion are added together as in Formula (1) to calculate the final reward.

After calculating the reward for each output, we determine the advantage value for each response by the Formula (2), which is used in the GRPO method [8].

r_{i} = c_{i 1} + c_{i 2} + c_{i 3} + c_{i 4}

(1)

A_{i} = \frac{r_{i} - m e a n ({r_{1}, r_{2}, \dots, r_{G}})}{s t d ({r_{1}, r_{2}, \dots, r_{G}})}

(2)

c_ij: reward score for response o_i on the criteria j (j= 1,2,3,4) (we have 4 criteria);

r_i: the overall reward score for response o_i (i = 1, …, n; n := number of output responses);

A_i: advantage value for response o_i.

Then, the mode is optimized by maximizing the objective Formula (3)

J_{GRPO} (θ) = E_{q \sim P (Q), {o}_{i = 1}^{G} \sim π_{θ_{ed}} (O ∣ q)} [\frac{1}{G} \sum_{i = 1}^{G} \min (\frac{π_{θ} (o_{i} ∣ q)}{π_{θ_{old} (} (o_{i} ∣ q)} A_{i}, clip (\frac{π_{θ} (o_{i} ∣ q)}{π_{θ_{old}} (o_{i} ∣ q)}, 1 - ϵ, 1 + ϵ) A_{i}) - β D_{KL} (π_{θ} | | π_{ref})]

(3)

ε and β: hyperparameters;

π_{θ}

: policy model;

π_{θ_{old}}

: old policy model;

E

: expectation;

D_{KL}

: Kullback–Leibler divergence function [28];

q: the input text (prompt) from training dataset

P (Q)

;

π_{ref}

: reference policy model;

{o}_{i = 1}^{G}

: group of response from policy model

π_{θ_{old}}

.

4. Experiments

In this section, we describe the datasets, training procedures, and implementation details used to develop and evaluate our proposed method.

We adopt Qwen2.5-3B-Instruct with approximately 3 billion parameters, among publicly available baseline SLMs, as the backbone for our experiments. This model strikes a balance between computational efficiency and performance, making it suitable for scenarios with limited resources. Additionally, Qwen2.5-3B-Instruct has demonstrated strong performance in code generation tasks when compared with other baseline SLMs in recent evaluations.

4.1. Datasets and Training Procedure

To evaluate the performance of our proposed method, we use two distinct datasets:

-: Dataset 1: This dataset was collected from publicly available sources [29] and supplemented with our own contributions. It contains 7741 instances of natural language questions paired with their corresponding Cypher queries. The data is divided into 4934 instances for training and 2807 instances for testing. The dataset covers 14 different Neo4j graph database (GraphDB) schemas spanning various domains, including social networks, movies and entertainment, and business and organizational data. We organize the training and testing sets so that the schemas are disjoint. The training set includes samples from 11 schemas, while the test set contains samples from the remaining 3 schemas. We designed the split such that the schemas in the training set differ not only in structure but also in domain topics compared to those in the test set. This setup allows us to evaluate the model’s ability to generalize to unseen and semantically distinct schemas.

Training dataset: The training dataset includes 11 GraphDB schemas covering diverse domains such as social networks (e.g., relationships between streamers and users), question answering on Stack Overflow, company organizational structures, and IT network management. The schema complexity varies significantly, ranging from simple graphs with 3 node labels to more complex schemas with up to 16 node labels.

Testing dataset: The test dataset consists of three GraphDB schemas related to distinct domains: movie data (e.g., actor–director relationships) and business transactions (e.g., employee–order–product–supplier relationships). These schemas include between two and six node labels, allowing us to evaluate the model’s performance on previously unseen and topically different graph structures;

-: Dataset 2: This dataset, introduced by Neo4j in [7], consists of 44,387 instances of question–Cypher query pairs. Training and testing split is as follows: 39,554 instances are used for training, and 4833 instances are used for testing. This Neo4j Text2Cypher dataset also covers a broad range of domains, including social networks, business, and media, by combining data from various sources. We use this dataset to compare our results with other works in both the BLEU and execution score metrics. A full 4833-sample test set was used for BLEU evaluation. However, in order to evaluate the execution score, we need the corresponding database to execute the query and retrieve the result. Since the database did not fully provide the dataset, we selected only 1460 samples from the test dataset, which corresponded with the Neo4j graph database we found.

Our training pipeline is organized into two stages. First, the training set was split into two parts. One is used for supervised fine-tuning (90%), and another is used for reinforcement learning (10%). Next, supervised fine-tuning is conducted with two epochs, learning rate 2 × 10⁻⁴, batch size 8, hyper-parameter β = 0.04, and ε = 0.2. After that, reinforcement learning with GRPO and support tasks is conducted with one epoch. During this, we set the number of response generations to 10 and the temperature to 0.6. The experiments were conducted using the following hardware setup:

Intel (R) Core (TM) i9-10900X CPU @ 3.70 GHz;
NVIDIA GeForce GTX 4090 24 GB;
Memory 128 GB.

The total training time was approximately 5 h, consisting of 1 h for supervised fine-tuning and 4 h for reinforcement learning. During training, the memory usage peaked at around 20 GB. Inference Efficiency:

Peak memory usage during inference: ~3 GB;
Average response time per query: ~3 s.

4.2. Experiment Results and Discussions

We adopt two evaluation metrics for this experiment: execution accuracy score [10] and Google-BLEU [9] score, as in [7].

Execution accuracy score [10] measures the correctness of a generated Cypher query by executing it against the actual Neo4j database and comparing the results with the execution result of the ground truth. The value of the score is 1 (similar execution result with ground truth) or 0 (different execution result from ground truth).

Google-BLEU score is a metric commonly used for evaluating text generation tasks, including machine translation and text-to-SQL/Cypher generation. It measures the similarity between the generated query and a reference (ground truth) query by comparing n-grams (unigrams, bigrams, etc.). The value of the score is a real number between 0 and 1. However, BLEU only captures lexical similarity and does not verify whether the generated Cypher query is logically correct or executable.

The experimental result for dataset 1 is shown in Table 5 and Table 6. Table 5 shows an ablation study comparing the performance of supervised fine-tuning alone with the proposed refining training (supervised fine-tuning and reinforcement learning). The result shows that reinforcement learning with support tasks improves execution accuracy by 5.03% over supervised fine-tuning alone. Table 6 compares single-task support settings: the triple relationship improves accuracy more than key-value extraction alone.

In Table 7, we present the results on dataset 2 (Neo4j’s Text2Cypher dataset) with respect to Google-BLEU score metrics. We compare our method (Supervised Fine-tuning + Reinforcement Learning with Support Tasks) with other models reported in [7]. The results show that GPT-4o achieves the highest score of 0.8017. However, our model also performs competitively with a score of 0.7701, surpassing even Gemma 2, a significantly larger language model with 9B parameters.

As mentioned above, it is difficult to evaluate execution accuracy over the second dataset because the Neo4j graph databases for testing Cypher query for this dataset were not fully provided. Therefore, we selected 1460 samples from the test dataset to evaluate with respect to this metric. We then conducted evaluations on Gemma-2-9b-it [28] using our method (since the results for this small subset were not provided in [7]). The experimental results in Table 8 show that our method outperforms Gemma 2 (pretrained on the Text2Cypher task introduced in [7,30]) with an accuracy of 56.23%.

In Table 9, we present some prediction Cypher queries from our proposed method.

As shown in Table 9, our model generates more accurate Cypher queries than the supervised fine-tuning baseline in both simple and complex scenarios. For instance, when filtering by date, the supervised model fails to match the expected format, resulting in an invalid query. In contrast, our model generates the correct date string format, suggesting improved handling of value-oriented conditions. Additionally, for queries requiring multi-hop reasoning, our model successfully captures the correct relational path through the schema.

5. Discussion

5.1. Contribution of Reinforcement Learning with Support Tasks

Our experiments demonstrate that reinforcement learning with support tasks—specifically key-value and triple-relationship extraction—plays a critical role in enhancing generalization to unseen schemas. Compared to purely supervised learning, our method provides explicit semantic supervision during training, which teaches the model to better understand the structural and property-level semantics of Cypher queries. This leads to more accurate and interpretable query generation across diverse domains.

Compared to prior approaches, such as Neo4j’s Text2Cypher [7] and CoBGT [21], our method introduces two key distinctions. First, while [7] relies heavily on large language models fine-tuned with comprehensive data, it lacks explicit semantic task supervision and struggles with generalization to structurally different schemas. CoBGT [21], on the other hand, introduces semantic guidance via graph processing, but it is less flexible in adapting to new domains due to its reliance on a preprocessed graph structure. In contrast, our method leverages the pretrained knowledge of language models combined with schema-aware prompt construction and response guidance, offering a more adaptable and lightweight solution for diverse and unseen graph structures.

Approaches like GraphQ IR [6] and rule-based methods [5] attempt to bridge the semantic gap through intermediate representations or template construction. However, these methods are less effective in handling informal or complex natural language compared to large language models (LLMs). Our approach leverages LLMs through prompt-based training, while enhancing semantic understanding through guided supervised and reinforcement learning. Specifically, we incorporate auxiliary support tasks into the training process, enabling the model to better capture the structural and semantic patterns required for Cypher query generation.

Research in [15,16] adopts a system-level or prompt-based approach to improve Text2Cypher performance by leveraging large language models (LLMs). However, their focus remains primarily on system architecture rather than training optimization. As these methods do not involve adapting or fine-tuning the model itself, they typically rely on powerful LLMs such as ChatGPT. This reliance limits their applicability in local or private deployment settings, where computational resources and privacy constraints may limit the use of large cloud-based models. In contrast, our approach focuses on training-time optimization of open-source LLMs using reinforcement learning guided by support tasks. This enables competitive performance even with limited computational resources, making our method more practical for private, offline, or resource-constrained environments.

One notable outcome of our study is that small language models (≤3B parameters) can perform very competitively when properly tuned during training. This finding is important because it shows that high-quality Text-to-Cypher generation is possible without relying on ultra-large models. In our experiments, the maximum memory consumption during training was approximately 20 GB. For deployment, however, the model requires only about 3 GB, which fits on many common consumer-grade GPUs, making it feasible for local, private, or resource-constrained environments.

5.2. Limitations and Potentials:

We also acknowledge limitations in our current reward design. The reward function includes format verification, answer similarity, and partial correctness of extracted key-values and relationships. While effective to some degree, performance on Dataset 2 remains limited in terms of execution accuracy. This is likely due to the complexity of its schemas. As shown in Table 10, Dataset 2 includes schemas with a higher number of node and relationship types, along with longer and denser descriptions. This structural richness increases the difficulty of accurately encoding the schema context and generating correct query logic, especially for small models. This reveals a current limitation of our approach and a promising direction for future research: enhancing the model’s schema comprehension through better schema-aware prompt construction or exploring more sophisticated reward mechanisms, such as graph-structural matching, semantic equivalence evaluation, or reward signals derived from execution correctness (e.g., output results matching ground truth answers).

While our work focuses on Cypher, the support tasks we designed—like extracting key-value pairs and relationship triples—can also be applied to other graph query languages. With some adjustments, our reinforcement learning approach could be used for SPARQL, Gremlin, or GSQL. This makes it possible to develop a more general Text-to-GraphQuery system that works across different types of graph databases.

6. Conclusions

In this paper, we proposed a refining training method to enhance Cypher query generation from natural language inputs, with a focus on achieving high performance using small language models suitable for local generative AI applications. Our method incorporates reinforcement learning with GRPO optimization policy and introduces schema-guided support tasks—specifically, key-value extraction and triple-relationship extraction—to guide the learning process beyond standard supervised fine-tuning.

Through this reinforcement training with support tasks, our proposed refining training method demonstrates strong cross-schema generalization, achieving 85% execution accuracy on previously unseen graph schemas. Experimental results of an ablation study and comparisons with strong baselines show that, when appropriately refined, small language models can match or even outperform significantly larger models. These findings underscore the effectiveness of leveraging fine-grained reinforcement learning with semantic information for structured query generation, particularly in scenarios involving complex or unfamiliar schemas.

Future work may explore integrating execution feedback into the training loop. By allowing the model to learn from execution errors (e.g., syntax violations or runtime failures), we can provide a more realistic and dynamic supervision signal, helping the model iteratively improve its outputs. This could be particularly beneficial for handling syntactically complex or semantically ambiguous queries.

Additionally, further work is needed to improve performance on highly complicated or large-scale schemas. Enhancing the model’s ability to encode and reason over dense schema descriptions—perhaps through better prompt construction, schema summarization techniques—remains an important area of investigation.

While this work focuses on Cypher, the proposed approach is adaptable to other graph query languages such as SPARQL or Gremlin. Future work could explore language-agnostic query generation pipelines or support-task design specific to different graph data models.

Author Contributions

Conceptualization, Q.-B.-H.T. and S.-T.C.; methodology, Q.-B.-H.T.; software, Q.-B.-H.T.; validation, Q.-B.-H.T., S.M. and A.A.W.; formal analysis, Q.-B.-H.T., S.M. and S.-T.C.; investigation, Q.-B.-H.T.; resources, Q.-B.-H.T.; data curation, Q.-B.-H.T.; writing—original draft preparation, Q.-B.-H.T.; writing—review and editing, Q.-B.-H.T., S.-T.C. and A.A.W.; visualization, Q.-B.-H.T.; supervision, S.-T.C.; project administration, Q.-B.-H.T.; funding acquisition, S.-T.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the MSIT (Ministry of Science and ICT), Korea, under the National Program for Excellence in SW (2024-0-00071) supervised by the IITP (Institute of Information and Communications Technology Planning and Evaluation).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in this article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Timón-Reina, S.; Rincón, M.; Martínez-Tomás, R. An Overview of Graph Databases and Their Applications in the Biomedical Domain. Database 2021, 2021, baab026. [Google Scholar] [CrossRef] [PubMed]
Almabdy, S. Comparative Analysis of Relational and Graph Databases for Social Networks. In Proceedings of the 2018 1st International Conference on Computer Applications & Information Security (ICCAIS), Riyadh, Saudi Arabia, 4–6 April 2018; pp. 1–4. [Google Scholar]
Syed, M.H.; Huy, T.Q.B.; Chung, S.-T. Context-Aware Explainable Recommendation Based on Domain Knowledge Graph. Big Data Cogn. Comput. 2022, 6, 11. [Google Scholar] [CrossRef]
Xu, Z.; Cruz, M.J.; Guevara, M.; Wang, T.; Deshpande, M.; Wang, X.; Li, Z. Retrieval-Augmented Generation with Knowledge Graphs for Customer Service Question Answering. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, Washington, DC, USA, 10 July 2024; pp. 2905–2909. [Google Scholar]
Kobeissi, M.; Assy, N.; Gaaloul, W.; Defude, B.; Haidar, B. An Intent-Based Natural Language Interface for Querying Process Execution Data. In Proceedings of the 2021 3rd International Conference on Process Mining (ICPM), Eindhoven, The Netherlands, 31 October 2021; pp. 152–159. [Google Scholar]
Nie, L.; Cao, S.; Shi, J.; Sun, J.; Tian, Q.; Hou, L.; Li, J.; Zhai, J. GraphQ IR: Unifying the Semantic Parsing of Graph Query Languages with One Intermediate Representation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, 7–11 December 2022; pp. 5848–5865. [Google Scholar]
Ozsoy, M.G.; Messallem, L.; Besga, J.; Minneci, G. Text2Cypher: Bridging Natural Language and Graph Databases. arXiv 2024, arXiv:2412.10064. [Google Scholar]
DeepSeek-AI; Guo, D.; Yang, D.; Zhang, H.; Song, J.; Zhang, R.; Xu, R.; Zhu, Q.; Ma, S.; Wang, P.; et al. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv 2025, arXiv:2501.12948. [Google Scholar]
Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.-J. BLEU: A Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics—ACL’02, Philadelphia, PA, USA, 1 January 2001; p. 311. [Google Scholar]
Guo, A.; Li, X.; Xiao, G.; Tan, Z.; Zhao, X. SpCQL: A Semantic Parsing Dataset for Converting Natural Language into Cypher. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management, Atlanta, GA, USA, 17 October 2022; pp. 3973–3977. [Google Scholar]
Mohammadjafari, A.; Maida, A.S.; Gottumukkala, R. From Natural Language to SQL: Review of LLM-Based Text-to-SQL Systems. arXiv 2024, arXiv:2410.01066. [Google Scholar]
Liu, X.; Shen, S.; Li, B.; Ma, P.; Jiang, R.; Zhang, Y.; Fan, J.; Li, G.; Tang, N.; Luo, Y. A Survey of NL2SQL with Large Language Models: Where Are We, and Where Are We Going? arXiv 2024, arXiv:2408.05109. [Google Scholar]
Lewis, M.; Liu, Y.; Goyal, N.; Ghazvininejad, M.; Mohamed, A.; Levy, O.; Stoyanov, V.; Zettlemoyer, L. BART: Denoising Sequence-to-Sequence Pre-Training for Natural Language Generation, Translation, and Comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 7871–7880. [Google Scholar]
Wu, Z. Large Language Model Based Semantic Parsing for Intelligent Database Query Engine. J. Comput. Commun. 2024, 12, 1–13. [Google Scholar] [CrossRef]
Feng, G.; Zhu, G.; Shi, S.; Sun, Y.; Fan, Z.; Gao, S.; Hu, J. Robust NL-to-Cypher Translation for KBQA: Harnessing Large Language Model with Chain of Prompts. In Knowledge Graph and Semantic Computing: Knowledge Graph Empowers Artificial General Intelligence; Wang, H., Han, X., Liu, M., Cheng, G., Liu, Y., Zhang, N., Eds.; Communications in Computer and Information Science; Springer Nature Singapore: Singapore, 2023; Volume 1923, pp. 317–326. ISBN 978-981-99-7223-4. [Google Scholar]
Hornsteiner, M.; Kreussel, M.; Steindl, C.; Ebner, F.; Empl, P.; Schönig, S. Real-Time Text-to-Cypher Query Generation with Large Language Models for Graph Databases. Future Internet 2024, 16, 438. [Google Scholar] [CrossRef]
Zhong, Z.; Zhong, L.; Sun, Z.; Jin, Q.; Qin, Z.; Zhang, X. SyntheT2C: Generating Synthetic Data for Fine-Tuning Large Language Models on the Text2Cypher Task. In Proceedings of the 31st International Conference on Computational Linguistics, Abu Dhabi, United Arab Emirates, 19–24 January 2024. [Google Scholar]
Tiwari, A.; Malay, S.K.R.; Yadav, V.; Hashemi, M.; Madhusudhan, S.T. Auto-Cypher: Improving LLMs on Cypher Generation via LLM-Supervised Generation-Verification Framework. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers), Albuquerque, NM, USA, 29 April–4 May 2025; pp. 623–640. [Google Scholar]
Coffelt, J.P.; Kampmann, P.; Beetz, M. Implementation and Application of a Knowledge Service for AUV Mission Explainability. In Proceedings of the 2025 IEEE Underwater Technology (UT), Taipei, Taiwan, 2 March 2025; pp. 1–7. [Google Scholar]
Ozsoy, M.G. Enhancing Text2Cypher with Schema Filtering. arXiv 2025, arXiv:2505.05118. [Google Scholar]
Tran, Q.-B.-H.; Waheed, A.A.; Chung, S.-T. Robust Text-to-Cypher Using Combination of BERT, GraphSAGE, and Transformer (CoBGT) Model. Appl. Sci. 2024, 14, 7881. [Google Scholar] [CrossRef]
Francis, N.; Green, A.; Guagliardo, P.; Libkin, L.; Lindaaker, T.; Marsault, V.; Plantikow, S.; Rydberg, M.; Selmer, P.; Taylor, A. Cypher: An Evolving Query Language for Property Graphs. In Proceedings of the 2018 International Conference on Management of Data, Houston, TX, USA, 27 May 2018; pp. 1433–1445. [Google Scholar]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal Policy Optimization Algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar]
Wei, J.; Bosma, M.; Zhao, V.Y.; Guu, K.; Yu, A.W.; Lester, B.; Du, N.; Dai, A.M.; Le, Q.V. Finetuned Language Models Are Zero-Shot Learners. arXiv 2021, arXiv:2109.01652. [Google Scholar]
Yenduri, G.; Ramalingam, M.; Selvi, G.C.; Supriya, Y.; Srivastava, G.; Maddikunta, P.K.R.; Raj, G.D.; Jhaveri, R.H.; Prabadevi, B.; Wang, W.; et al. GPT (Generative Pre-Trained Transformer)—A Comprehensive Review on Enabling Technologies, Potential Applications, Emerging Challenges, and Future Directions. IEEE Access 2024, 12, 54608–54649. [Google Scholar] [CrossRef]
Gemini Team; Anil, R.; Borgeaud, S.; Alayrac, J.-B.; Yu, J.; Soricut, R.; Schalkwyk, J.; Dai, A.M.; Hauth, A.; Millican, K.; et al. Gemini: A Family of Highly Capable Multimodal Models. arXiv 2023, arXiv:2312.11805. [Google Scholar]
Enis, M.; Hopkins, M. From LLM to NMT: Advancing Low-Resource Machine Translation with Claude. arXiv 2024, arXiv:2404.13813. [Google Scholar]
Kullback, S.; Leibler, R.A. On Information and Sufficiency. Ann. Math. Statist. 1951, 22, 79–86. [Google Scholar] [CrossRef]
Text-to-Cypher Data. Available online: https://huggingface.co/datasets/tomasonjo/text2cypher-gpt4o-clean (accessed on 23 June 2025).
Gemma2-9B-Text2Cypher Model. Available online: https://huggingface.co/neo4j/text2cypher-gemma-2-9b-it-finetuned-2024v1 (accessed on 23 June 2025).

Figure 1. Overall training process.

Figure 2. Supervised fine-tuning stage.

Figure 3. Reinforcement learning stage.

Table 1. The example input prompt and output.

Convert text to Cypher query based on this schema:
The schema:
(Schema)
The text:
(Question)

Table 2. Schema example.

Node properties:
- Business
- address: STRING
- location: POINT
- city: STRING
- name: STRING
- User
- name: STRING
- userId: STRING
- Review
- date: DATE
- text: STRING
- Category
- name: STRING
The relationships:
(:Business)-[:IN_CATEGORY]->(:Category)
(:User)-[:WROTE]->(:Review)
(:Review)-[:REVIEWS]->(:Business)

Table 3. The guideline for reinforcement stage.

Response in the following format:
<reasoning>
Some reasoning to get the right Cypher query
<key_value>
Value extracted from the text, which can be useful for generating Cypher query. Example: Helen, financial crises...
</key_value>
<relationship>
All triple relationships are used in the Cypher query (must appear in the schema also)
Example: Suppliers -[:SUPPLIES]-> Product, (:Person)-[:ACTED_IN]->(:Movie)
</relationship>
</reasoning>
<answer>
The final Cypher query based on reasoning, key value, and relationship
</answer>

Table 4. The input prompt for reinforcement stage.

System:
(Guideline)
User:
Convert text to Cypher query based on this schema:
The schema:
(schema)
The text:
(question)
Assistant:

Table 5. Execution score of training methods on dataset 1.

Method (Qwen2.5-3B-Instruct)	Performance
Supervised only	80.01%
Supervised + RL with support task	85.04%

Table 6. Comparison of Support Tasks on Dataset 1.

Method (Qwen2.5-3B-Instruct)	Performance
Supervised + RL with key-value support task	81.69%
Supervised + RL with relationship support task	84.11%

Table 7. Google-Bleu score comparison of our method with well-known model.

Model	Ours	GPT-4o	GPT-4o_mini	Gemini_1.5 flash_001	Gemma2 _9B	Llama3_1 _8B_instruct
Parameter size	3B	1.8T	8B	8B	9B	8B
Google-BLEU score	0.7701	0.8017	0.7973	0.7780	0.64703	0.5560

Table 8. Execution score comparison on dataset 2.

Method	Ours	Gemma-2_9B [30]
Parameter size	3B	9B
Execution score	56.23%	45.41%

Table 9. Some output comparing supervised fine-tuning model and further reinforcement learning model.

Question	Supervised Fine-Tuning	Further Reinforcement Learning Model
What are the orders that were placed on 1996-08-09?	MATCH (o:Order) WHERE o.OrderDate = datetime(‘1996-08-09T00:00:00Z’) RETURN o.OrderID (wrong)	MATCH (o:Order) WHERE o.OrderDate = ‘1996-08-09’ RETURN o.OrderID (correct)
Which items were purchased by customers residing in Mexico from the beverages category?	MATCH (c:Customer)-[:PLACED]->(o:Order)-[:CONTAINS]->(p:Product) WHERE p.CategoryID = ‘Beverages’ AND c.Country = ‘Mexico’ RETURN DISTINCT p.ProductName (wrong)	MATCH (c:Customer)-[:PLACED]->(o:Order)-[:CONTAINS]->(p:Product)-[:BELONGS_TO]->(cat:Category) WHERE c.Country = ‘Mexico’ AND cat.CategoryName = ‘Beverages’ RETURN p.ProductName (correct)

Table 10. Comparison between dataset 1 and dataset 2 in number of nodes and schema length.

	Dataset 1	Dataset 2
Number of nodes in schema	2–16 Nodes	3–28 Nodes
Schema length	100~1000 tokens	100~2000 tokens

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Tran, Q.-B.-H.; Waheed, A.A.; Mudasir, S.; Chung, S.-T. Refining Text2Cypher on Small Language Model with Reinforcement Learning Leveraging Semantic Information. Appl. Sci. 2025, 15, 8206. https://doi.org/10.3390/app15158206

AMA Style

Tran Q-B-H, Waheed AA, Mudasir S, Chung S-T. Refining Text2Cypher on Small Language Model with Reinforcement Learning Leveraging Semantic Information. Applied Sciences. 2025; 15(15):8206. https://doi.org/10.3390/app15158206

Chicago/Turabian Style

Tran, Quoc-Bao-Huy, Aagha Abdul Waheed, Syed Mudasir, and Sun-Tae Chung. 2025. "Refining Text2Cypher on Small Language Model with Reinforcement Learning Leveraging Semantic Information" Applied Sciences 15, no. 15: 8206. https://doi.org/10.3390/app15158206

APA Style

Tran, Q.-B.-H., Waheed, A. A., Mudasir, S., & Chung, S.-T. (2025). Refining Text2Cypher on Small Language Model with Reinforcement Learning Leveraging Semantic Information. Applied Sciences, 15(15), 8206. https://doi.org/10.3390/app15158206

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Refining Text2Cypher on Small Language Model with Reinforcement Learning Leveraging Semantic Information

Abstract

1. Introduction

2. Background and Related Works

2.1. Related Work

2.2. Background

2.2.1. Cypher Query Language

2.2.2. Text2Cypher and Comparison with Text2SQL

2.2.3. Key-Value Extraction

2.2.4. Relevant Triple Relationship Extraction

2.2.5. Group Relative Policy Optimization (GRPO)

2.2.6. Baseline LM: Pretrained and Instruction-Tuned Generative Language Model

3. Proposed Method

3.1. Overview of the Proposed Method

3.2. Supervised Fine-Tuning

3.3. Reinforcement Learning with Support Tasks

4. Experiments

4.1. Datasets and Training Procedure

4.2. Experiment Results and Discussions

5. Discussion

5.1. Contribution of Reinforcement Learning with Support Tasks

5.2. Limitations and Potentials:

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI