Next Article in Journal
A Linguistic q-Rung Orthopair ELECTRE II Algorithm for Fuzzy Multi-Criteria Ontology Ranking
Next Article in Special Issue
Music Genre Classification Using Prosodic, Stylistic, Syntactic and Sentiment-Based Features
Previous Article in Journal
LizAI XT—AI-Accelerated Management Platform for Complex Healthcare Data at Scale, Beyond EMR/EHR and Dashboards
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Demo-ToT: Enhancing the Reasoning Capabilities of AI Agent via Improved Demonstrations Retrieval Strategy

National Key Laboratory of Information Systems Engineering, National University of Defense Technology, Changsha 410073, China
*
Author to whom correspondence should be addressed.
Big Data Cogn. Comput. 2025, 9(11), 276; https://doi.org/10.3390/bdcc9110276
Submission received: 1 September 2025 / Revised: 16 October 2025 / Accepted: 31 October 2025 / Published: 2 November 2025
(This article belongs to the Special Issue Artificial Intelligence (AI) and Natural Language Processing (NLP))

Abstract

Innovative reasoning frameworks have been proposed to enhance the reasoning capabilities of AI agents, improving their performance in various tasks. However, most existing research has focused on enhancing designing frameworks for LLMs, with limited attention on leveraging in-context learning to boost their reasoning power. This paper proposes a novel approach, Demo-ToT, which enhances the Tree-of-Thought (ToT) reasoning framework by dynamically retrieving relevant demonstrations to improve reasoning accuracy. Various demonstration retrieval strategies, including vector similarity, sparse retrieval, and string similarity, were explored to identify the most effective methods for optimizing LLM performance. Experiments conducted across multiple benchmarks and language models of varying sizes demonstrated that Demo-ToT substantially enhanced the reasoning ability of smaller LLMs, achieving performance comparable to or even surpassing that of much larger models such as GPT-4.

1. Introduction

With the development of large language models (LLMs), the field of artificial intelligence (AI) has experienced a rapid transformation. To apply AI technologies in specific domains, the AI agent is considered as one of the key approaches to realizing artificial general intelligence (AGI), which has been widely adopted to help people solve specific and complex tasks. However, when it comes to enhancing the reasoning capabilities of AI agents, most scholars have focused on improving their reasoning frameworks, including Chain-of-Thought (CoT) [1], Self-consistency [2], Tree-of-Thought (ToT) [3], Graph-of-Thought (GoT) [4], and Program-of-Thought (PoT) [5]. However, there is a gap in research on using in-context learning to enhance the reasoning abilities of AI agents.
Although the ToT framework has shown potential to enhance the reasoning ability of LLMs, it is limited by its reliance on fixed prompts [6]. The static nature of these prompts restricts the model’s ability to fully explore the diverse reasoning chains it can potentially generate. From a theoretical perspective, altering the prompts can significantly enhance the capabilities of the LLMs and influence the performance of AI agents. Among all the components in the prompt, demonstrations play a key role, guiding the LLMs by providing exemplary input-output pairs and providing strong priors in predictions. However, the question of how to select appropriate demonstrations to enhance the reasoning ability of AI agents remains a significant challenge.
To address this limitation and enhance the reasoning ability of LLMs under the ToT framework, several demonstration retrieval strategies are introduced, as illustrated in Figure 1. The compared methods include vector similarity retrieval, sparse retrieval based on BM25 [7], and string similarity retrieval using the Levenshtein distance [8]. Each method constructs demonstrations differently, offering distinct levels of contextual relevance and precision. The goal is to identify the most effective strategy for improving the reasoning performance of LLMs.
Building on the original ToT experiments (Game of 24 and Crosswords) and additional benchmarks (MMLU, BBH, and HumanEval), experiments were conducted across various LLM sizes using different demonstration retrieval strategies. Model outputs were analyzed with respect to mathematical reasoning and linguistic capabilities under each strategy. The results show that retrieval-based demonstration selection improves performance in both domains by adapting prompts to input-specific contexts. An ablation study further examines the impact of factors such as the number of demonstrations and the choice of embedding model.
The primary contributions of this paper are as follows:
1.
Bridging ICL and ToT frameworks. Despite the potential of in-context learning, its integration into structured reasoning paradigms such as ToT remains limited. We propose Demo-ToT, which dynamically retrieves relevant demonstrations at each intermediate reasoning step, enabling smaller models to perform complex reasoning. By supporting context-driven refinement of the reasoning trajectory, the approach overcomes the rigidity of fixed prompt templates.
2.
Comprehensive formulation of demonstration retrieval strategies. We unified and evaluated multiple demonstration selection paradigms under the ToT framework, including: (a) ToT–CBS, which selects representative demonstrations by clustering the support set; (b) ToT–VSR, which conducts vector-similarity-based retrieval; and (c) ToT–SR and ToT–SSR, which adopt sparse (BM25) and string-similarity (Levenshtein) retrieval, respectively. This unified formulation provided a standardized comparison among existing and extended retrieval mechanisms for structured reasoning.
3.
Retrieval–re-ranking mechanism and extensive validation. Beyond retrieval-only strategies, we propose a novel retrieval–re-ranking paradigm for demonstration selection. Our ToT + VSR + DR framework first retrieved a broad pool of candidate demonstrations and then re-ranked them using a learnable scoring model that predicted each demonstration’s expected utility to the model. The re-ranking model was trained using a pairwise ranking loss aligning predicted scores with generation-based quality. This is, to our knowledge, the first adaptation of the retrieval–re-rank architecture to structured reasoning. We further validated Demo-ToT on five reasoning benchmarks (Game of 24, Crosswords, MMLU, BBH, and HumanEval) and across multiple model scales (Qwen2.5–7B, 14B, and 32B), confirming consistent accuracy gains and reduced performance gaps with larger proprietary models.

2. Related Work

2.1. Reasoning Methods for AI Agents

The reasoning abilities of AI agents have been extensively studied, leading to the development of various structured reasoning frameworks. Among these, CoT prompting [1] has been widely adopted, which decomposes problems into intermediate reasoning steps. Self-consistency [2] further improves CoT by aggregating multiple sampled reasoning paths to reach a more reliable answer.
Another prominent method is ReAct (Reasoning–Acting) [9], which enables LLMs to interleave reasoning with interactive environment actions. ReAct has been applied to tool-augmented reasoning and reinforcement learning settings, demonstrating its effectiveness in dynamic decision-making.
Based on CoT [1], ToT [3] extends the reasoning paradigm by structuring the thought process as a search tree, allowing models to explore multiple reasoning paths before making a final decision. ToT improves the robustness of the model by enabling the back-tracking and evaluation of alternative reasoning strategies. Furthermore, AutoCoT [6] clusters unlabeled questions, generates a concise reasoning chain for one sample per cluster through zero-shot prompting, and uses these self-made demonstrations to prompt the LLMs.
Other works have explored program-aided reasoning (PaR) [10], where language models use external programming tools to verify and refine their reasoning steps. These approaches highlight ongoing efforts to improve structured reasoning in AI systems. However, existing methods often overlook the role of demonstration learning in strengthening intermediate reasoning steps, which our work addresses.

2.2. Prompt Optimization Methods

Prompt optimization has been crucial to improve the performance of LLMs in reasoning tasks. Few-shot prompting [11] enables models to generalize by providing relevant examples in-context, leading to significant improvements in reasoning-heavy tasks.
Automatic prompt optimization [12,13] extends gradient-based methods by allowing LLMs to autonomously design and refine task-specific prompts. Meanwhile, prompt tuning [14,15] and other parameter-efficient strategies update only continuous prompt embeddings or lightweight adapters, providing efficient adaptation without retraining the entire model.
Self-generated CoT prompts [16] attempt to make LLMs generate their own reasoning steps before solving a task, mitigating the dependency on human-crafted CoT examples. Another important strategy is retrieval-augmented generation (RAG) [17], which retrieves relevant documents or demonstrations to enrich the reasoning process.
Despite these advances, existing prompt optimization methods rarely address how to select the most effective demonstrations dynamically during intermediate reasoning steps. Our proposed Demo-ToT framework fills this gap by adaptively retrieving and using demonstrations at each stage of ToT reasoning.

2.3. In-Context Learning

In-context learning (ICL) [11] is a fundamental property of LLMs, where they learn from examples provided within the prompt without explicit weight updates. ICL enables LLMs to perform new tasks by conditioning on demonstrations rather than requiring task specific fine tuning.
A key research direction within ICL is demonstration selection by choosing the most effective examples to maximize model performance. Adaptive demonstration selection [18] explores methods to retrieve task-relevant examples dynamically rather than relying on static exemplars. Active example selection for ICL [19] further refines this idea by prioritizing informative examples that lead to better generalization.
Demonstration learning is a specialized form of ICL, where models improve reasoning capabilities by leveraging previously solved examples. Meta-in-context learning (Meta-ICL) methods [20] investigate how large language models generalize across tasks through demonstration-based meta-training, extending the original paradigm toward more adaptive and scalable reasoning. Similarly, retriever-augmented ICL [21] enhances CoT reasoning by selecting the most relevant past demonstrations based on input similarity.

3. Methodology

3.1. Tree of Thought (ToT)

The Tree-of-Thought (ToT) framework generalizes the classical CoT reasoning by enabling large language models (LLMs) to perform deliberate search over structured reasoning paths. Formally, let X denote the input space and Y the solution space. A ToT process is defined as a directed search tree G = ( S , E ) , where each node s t S is a thought state at time step t, and each edge ( s t , s t + 1 ) E represents a reasoning transition induced by a new thought τ t + 1 .
  • State representation
A state s t encodes the cumulative reasoning context up to step t as the following:
s t = [ x , τ 1 : t ] , τ 1 : t = ( τ 1 , τ 2 , , τ t ) , x X .
where each τ i denotes a coherent linguistic or symbolic reasoning unit (thought), such as an intermediate equation or partial hypothesis.
  • Thought generation
From the current state s t , the language model M θ generates a set of candidate thoughts T t + 1 through a stochastic decoding process as follows:
T t + 1 = { τ t + 1 ( 1 ) , , τ t + 1 ( k ) } , τ t + 1 ( i ) p θ ( τ t + 1 s t )
where each candidate τ t + 1 ( i ) leads to a successor node s t + 1 ( i ) = [ x , τ 1 : t , τ t + 1 ( i ) ] .
  • State evaluation
For each generated node, ToT employs a value function V θ ( s t ) —implemented as a “value prompt”—to estimate its heuristic quality or probability of reaching the correct solution as the following:
v t = V θ ( s t ) = p θ ( v s t ) , v { sure , likely , impossible } .
  • Search policy
A search policy π (e.g., breadth-first or depth-first) determines which nodes to expand next according to the joint evaluation:
s t + 1 = π ( G θ ( s t ) , V θ ( s t ) ) ,
where G θ denotes the generative operator of Equation (2). A complete reasoning trajectory is thus defined as s 0 : T = ( s 0 , s 1 , , s T ) , and the final solution is decoded as y = Decode ( s T ) .
  • Limitations of ToT
Although ToT enables systematic exploration over multiple reasoning paths, it still relies on static, handcrafted prompt templates to define both propose and value prompts. Formally, the thought generation in Equation (2) depends on a fixed demonstration set D fixed as follows:
p θ ( τ t + 1 s t ) = p θ τ t + 1 s t , D fixed ,
where D fixed is predefined and independent of the evolving state s t . This static formulation prevents the model from adapting its contextual knowledge to different reasoning trajectories, which limits the flexibility of ToT in complex or diverse problem spaces.
  • Motivation for Demo-ToT
To address this limitation, we propose Demo-ToT, an enhanced variant that dynamically retrieves and re-ranks demonstrations based on the current thought state s t as the following:
p θ ( τ t + 1 s t ) = p θ τ t + 1 s t , Z top - k ( s t ) ,
where Z top - k ( s t ) is an adaptively selected subset of demonstrations from the support set, retrieved according to semantic similarity and refined by the re-ranking mechanism introduced in Section 3.2.5. By integrating dynamic demonstration learning into the ToT search, Demo-ToT transforms the original static tree into an adaptive reasoning process that updates its contextual knowledge at each intermediate node.

3.2. Demo-ToT Framework

While the ToT framework enables structured reasoning by exploring multiple thought paths, its reliance on manually crafted demonstrations (see Equation (5)) restricts adaptability and generalization. To overcome this limitation, we propose Demo-ToT, which dynamically retrieves, re-ranks, and integrates demonstrations according to the evolving reasoning state s t . The overall process is illustrated in Figure 1.

3.2.1. Demonstrations Retrieval Strategy

To enable dynamic adaptation within the reasoning process, Demo-ToT replaces the static demonstration list in ToT with a retrieval-based mechanism that constructs adaptive prompts in each reasoning step. Before introducing our retrieval strategies, we first show an example of the value prompt design used in the original ToT method:
value prompt = “Evaluate if given numbers can reach 24 (sure/likely/impossible)
10 14    10 + 14 = 24    sure
11 12    11 + 12 = 23 12 − 11 = 1 11 * 12 = 132 11/12 = 0.91    impossible
4 4 10    4 + 4 + 10 = 8 + 10 = 18 4 * 10 − 4 = 40 - 4 = 36 (10 − 4) * 4 = 6 * 4 = 24    sure
......”
In the example above, ToT uses several examples as demonstrations (10, 14/11, 12/4, 4, 10), but these manually designed demonstrations may not be able to fully unleash the potential of the large language model. Thus, we optimized the demonstration selection method through our demonstration retrieval strategies.

3.2.2. Prompt Templates

This subsection presents the prompt templates, which are modified from [3]. We used experiments to show that the prompt templates in our work outperformed those from ToT. We designed two types of templates: value prompts for the evaluation phase and propose prompts for the generation phase of ToT reasoning. As shown in Table 1 and Table 2, these templates introduced the placeholder <demonstrations> to replace handcrafted examples with dynamically retrieved ones.
We present the templates of value prompt and propose prompt for the Game of 24 and Crossword tasks, as shown in Table 1 and Table 2. In the templates (second row of the table), <demonstrations> is a placeholder that was used to replace the hand-crafted demonstrations in ToT. The demonstrations provide guidance on the format and logic of the current query. The <input> is the current input query, and <output> is the response contents of LLMs. In the third row, we also presented examples of the demonstrations.
Through our pilot experiments and observations from the previous literature, the prompt’s quality is closely related to the demonstrations, especially for relatively small language models. In turn, how the demonstrations in Table 1 and Table 2 are formulated plays a key role in the final performance of ToT in challenging reasoning tasks. Thus, we will focus on discussing a series of methods for constructing the demonstration list in the prompt template.

3.2.3. Fixed Demonstrations Replacement

The objective of this study is to enable language models to dynamically retrieve the most relevant demonstrations as prompts during critical reasoning stages. To achieve this goal, fixed demonstrations are transformed into adaptive ones. As shown in Table 1 and Table 2, the original prompt format is preserved, while a variable placeholder, <demonstrations>, is introduced for prompt construction. The most straightforward method to assemble the demonstration list is to assign a fixed set of examples to each prompt template. Three approaches for generating fixed demonstrations were considered as follows:
  • Manual selection (MS). In this method, a set of demonstrations was manually written. Such manually curated examples were widely used in the early studies of LLMs [11,22].
  • Random selection (RS). In this method, for each type of prompt template, a set of demonstrations was randomly selected to fill the placeholders.
  • Clustering-based selection (CBS). This method followed the approach proposed in AutoCoT [6], which was adopted as one of the comparative baselines for evaluating Demo-ToT. Specifically, the Support set was first clustered to identify several distinct centroids, and representative demonstrations from each cluster were then selected to construct the prompt and guide the LLMs.

3.2.4. Selecting Relevant Demonstrations

As pointed out by [18], fixed demonstrations can only provide limited clues for the test queries, and different queries may benefit from different demonstrations. We employed various demonstration retrieval strategies to dynamically select the most similar demonstrations based on the input. These selected demonstrations were then used to replace the placeholder <demonstration>, thereby constructing prompts tailored to facilitate reasoning in LLMs. Our approach ensured that the demonstrations were dynamically adjusted in each state of the reasoning process. The demonstration retrieval strategies we use are as follows:
  • Vector Similarity Retrieval (VSR). Vector similarity retrieval is a dense retrieval paradigm that represents text as high-dimensional dense vectors in a continuous semantic space, typically generated by pre-trained language models. Given an input question q, we encoded it into a vector v q R d . The retrieval process selected demonstrations { d i } from the support set whose encoded vectors v d i maximized the similarity score, commonly measured by cosine similarity:
    sim ( v q , v d i ) = v q · v d i v q v d i .
    In our implementation, we used the Facebook AI Similarity Search (FAISS) toolkit [23] to efficiently perform approximate nearest neighbor search for scalability.
  • Sparse Retrieval (SR). Sparse retrieval refers to traditional information retrieval methods based on lexical matching. We adopted the BM25 ranking function [7], which scored a document d i with respect to query q as:
    BM 25 ( q , d i ) = t q IDF ( t ) · TF ( t , d i ) · ( k 1 + 1 ) TF ( t , d i ) + k 1 · 1 b + b · | d i | avgdl ,
    where IDF ( t ) = log N n ( t ) + 0.5 n ( t ) + 0.5 , k 1 and b are tunable parameters (we use default values k 1 = 1.2 , b = 0.75 ), | d i | is the document length, and avgdl is the average document length in the collection.
  • String Similarity Retrieval (SSR). This strategy measures surface-level textual similarity using the Levenshtein distance. The normalized similarity between question q and demonstration d i was computed as:
    sim lev ( q , d i ) = 1 Lev ( q , d i ) max ( | q | , | d i | ) .
    We used this score to rank demonstrations by their surface-form resemblance to the input.

3.2.5. Demonstration Re-Ranking (DR)

To ensure that the retrieved demonstrations are optimally aligned with the current reasoning context, we introduced a demonstration re-ranking (DR) module that refined the retrieval results before they were inserted into the ToT prompt. Empirically, we observed that similarity-based retrieval rankings often failed to reflect the actual usefulness of demonstrations for large language models, which motivated a learning-based re-ranking mechanism to bridge this gap.
  • Model architecture
Let the pretrained encoder be E ( · ) and the current query be x q . Each demonstration x i in the candidate pool ( i = 1 , , n ) was first encoded as a vector representation v i = E ( x i ) , and the query was encoded as v q = E ( x q ) . The two vectors were concatenated and fed into a ranking head R Head , which outputted a predicted quality score s ^ ( z i ) [ 0 , 1 ] for each demonstration z i as the following:
s ^ ( z i ) = R Head Concat ( [ v i , v q ] ) ,
where Concat ( · ) denotes vector concatenation. The R Head was implemented as a lightweight multi-layer perceptron with the ReLU activation. During training, only this head and the LoRA-adapted parameters of E ( · ) were updated. The demonstrations were ranked according to their predicted scores as follows:
r ^ ( z i ) = Ranking s ^ ( z j ) { s ^ ( z j ) } j = 1 n .
  • Empirical observation
Given a query–answer pair ( x q , y q ) , a retrieved demonstration z i guides the language model M ( · ) to generate a prediction y ^ i = M ( z i , x q ) . We used the vector similarity between y ^ i and the ground-truth y q as a proxy for its true quality as the following:
s ( z i ) = VS ( y ^ i , y q ) ,
and obtain the corresponding generation-based ranking as follows:
r ( z i ) = Ranking s ( z j ) { s ( z j ) } j = 1 n .
We empirically found that the correlation between the retrieval ranking r ret ( z i ) and the generation-based ranking r ( z i ) was low, as measured by the Spearman coefficient as the following:
corr q = Spearman { r ( z i ) } i = 1 n , { r ret ( z i ) } i = 1 n ,
where the average corr q = 0.092 on a validation set indicated that simple embedding-based retrieval did not align well with the LLM utility.
  • Training objective
To make the predicted ranking r ^ ( z i ) consistent with the true ranking r ( z i ) , we optimized a pairwise margin ranking loss as follows:
L r = 1 i , j n , i j max 0 , m ( i , j ) ( s ^ ( z j ) s ^ ( z i ) ) ,
where the pairwise margin was defined as the following:
m ( i , j ) = max 0 , r gen ( z i ) 3 r gen ( z j ) 3 ,
where r gen denotes the generation-based ranking derived from Equation (10).
  • Integration into Demo-ToT
During inference, the top-k demonstrations with the highest predicted s ^ ( z i ) were selected and inserted into the prompt template as demonstrations . Formally, the updated reasoning state at each ToT step became the follwoing:
s t + 1 = π G θ ( s t , Z top - k ) , V θ ( s t ) ,
where Z top - k is the re-ranked demonstration subset. This coupling ensured that the Demo-ToT search dynamically adapted its contextual knowledge at every node of the reasoning tree, thereby transforming ToT into a fully adaptive reasoning process.

3.3. Experimental Settings

3.3.1. Task

Given that ToT experiments (Game of 24 and Crosswords) can comprehensively evaluate reasoning abilities in mathematics, semantics, common sense, and other aspects, we considered these experimental tasks to be relatively comprehensive. Therefore, we conducted our demonstration retrieval strategies experiments within these tasks.
The Game of 24 is a popular mathematical puzzle that challenges players to combine given numbers to obtain 24 using basic arithmetic operations (addition, subtraction, multiplication, and division). Each number can only be used once, and players must use all numbers in their calculations. The operations can be performed in any order, and parentheses can be used to indicate the order of operations. For example, given the numbers 2, 3, 4, and 5, one possible solution could be (5 × 4) − (2 + 3) = 24. In the ToT framework, the problem is decomposed into sequential "thought steps," where each step generates candidate equations (e.g., proposing operations like “3 + 4 = 7” to reduce the problem to solving “7, 2, 1” for 24). The ToT method employs a breadth-first search (BFS) algorithm to systematically explore these intermediate states, using a language model to generate potential solutions and evaluate their likelihood of success. For instance, the LM might propose multiple equation candidates (e.g., “5 * 6 = 30” or “10 − 4 = 6”) and then assess their feasibility based on proximity to the target value or logical coherence. The result of ToT significantly outperforms CoT’s, achieving a 74% success rate with GPT-4 compared to CoT’s 4%. The task highlights ToT’s ability to enhance LM performance in mathematical problems.
Crosswords is a classic word puzzle game in which players fill in a grid with words based on given clues. The grid consists of intersecting horizontal and vertical lines that form blank squares where players write letters to form words. Each row and column has a corresponding clue that hints at the word to be filled in. The challenge lies in ensuring that the words intersect correctly, both horizontally and vertically, so that all clues are satisfied simultaneously. To achieve this, players often need to draw on a variety of knowledge domains and employ strategic reasoning to navigate the inter-dependencies between clues. For instance, clues like “Numerical prefix” (e.g., OCTA) or “Contest in ancient Greece” (AGON) necessitate not just lexical knowledge but also contextual and cultural understanding. By structuring the problem-solving process as a tree of hypotheses, the model iteratively refines solutions, checks for coherence across intersecting entries, and prioritizes the most plausible paths—mirroring the layered reasoning seen in real-world crossword puzzles. This approach addresses the limitations of linear or single-path reasoning, demonstrating the value of structured exploration in handling tasks with high interdependency and ambiguity.

3.3.2. Dataset

In most experiments involving the Game of 24, researchers typically use integers from one to nine to test the numerical reasoning and problem-solving abilities of language models. We used combinations of four numbers within this range. We collected data from https://www.4nums.com (accessed on 1 September 2025, the same as ToT), which contained 1362 combinations of four numbers. These combinations were ordered by the time it takes for humans to calculate 24 using these numbers, with higher indices corresponding to longer solution times. We split these combinations into a 1000:362, the support to test split ratio. The support set was used to collect demonstration samples, and the rest 362 samples were considered as the test set.
The support set was constructed through the following three-step procedure designed to ensure demonstration quality and semantic richness:
1.
First we divided the task dataset into support/test splits according to the ratio described above.
2.
Second, demonstrations were then collected using the GPT-4o model. Each sample was processed under the original ToT framework, and for successfully solved cases, the intermediate input–output pairs from the proposal and value steps were extracted and included in the demonstration set.
3.
Third, the resulting demonstration set was then embedded using the BGE-base-en model, where each demonstration input was transformed into a dense vector representation. All vectors were stored and indexed as a vector database using the FAISS toolkit to support efficient similarity retrieval during reasoning.
In our Crosswords task, we used a 5 × 5 crosswords grid. The language model was tasked with inferring the next word to fill in the grid. It completed this by considering the clues for each row and column, as well as the words already placed in the grid. The model must use the context provided by the existing words and the specific hints from the clues to logically deduce the correct word that fits both the current grid setup and the given clue. The original dataset was split into support and test sets in a 104:52 ratio. The support set was used to collect demonstration samples, while the remaining 52 instanced form the test set.
To further demonstrate that our work is widely applicable, we further conducted an experiment with the following datasets:
  • MMLU. The MMLU benchmark [24] evaluates large language models’ knowledge retention and reasoning across 57 academic subjects under strict zero-shot and few-shot settings, providing a comprehensive cross-domain assessment without task-specific fine-tuning.
  • BBH. Curated from the BIG-Bench repository [25], the BIG-Bench Hard subset comprises 23 capabilities-testing tasks where previous language models underperformed human baselines. These challenges emphasize complex reasoning patterns including causal inference, counterfactual analysis, and multi-hop deduction that current architectures find particularly demanding.
  • HumanEval. The HumanEval benchmark [26] evaluates the code generation and reasoning abilities of large language models through 164 curated programming tasks with predefined function signatures and verification tests, reflecting realistic software engineering requirements.
For MMLU and BBH, we used ten% of the samples as the test samples. For the rest of the samples, we randomly selected 1500 samples for collecting demonstrations. For HumanEval, a half of the samples were treated as the demonstration set, and the rest was used for testing.
We systematically evaluated the performance of the Qwen2.5 language models [27] across a range of parameter scales to elucidate their capabilities in Game of 24 and Crosswords. We employed relatively small language models, specifically those with 7B, 14B, and 32B parameters, which can be efficiently deployed on personal computers. We preferred these small models could achieve performance comparable to state-of-the-art models, such as GPT-4, while maintaining the flexibility and accessibility of local deployment. For decoding responses, we used the nucleus sampling strategy [28]. We set the temperature to 0.7, and top_p to 0.9.
For both the Game of 24 and Crosswords tasks, we calculated the task-completion accuracy. In Game of 24, a reasoning chain was considered successful when the LLM produced a correct equation that used each of the given numbers exactly once, combined with addition, subtraction, multiplication, or division, to obtain 24. In the Crosswords task, each game corresponds to ten words. We considered the reasoning successful as soon as the LLM generated the target word, regardless of whether the entire game was completed.
For each task, we employed different demonstration retrieval strategies and compared them with the standard IO prompting and the fixed demonstration strategy approach. By default, we considered the number of demonstrations to be eight. And for the VSR, SR, and SSR strategies, the demonstrations were organized in ascending order of relevance scores in the prompt. We used the BAAI/bge-base-en-v1.5 [29] for VSR and the BM25S package https://huggingface.co/blog/xhluca/bm25s (accessed on 1 September 2025) to implement the sparse search. For SSR, we used the Python-levenshtein package https://pypi.org/project/python-Levenshtein/ (accessed on 1 September 2025).
For applying the trained DR module, retrieve 32 demonstrations, and top eight scored ones by the DR module were selected. During the training of DR module, we used the demonstration set as the training data. The LoRA rank for the encoder was 16, and the learning rate was 1 × 10 4 with 100 warm-up steps and linear decay. The DR module was fine-tuned for 2000 steps with batch size four.

4. Results

4.1. Experimental Results for the Game of 24 Task

We selected the top eight most relevant demonstrations from support set to construct each prompt. Then, we recorded the number of successful problem-solving instances under different strategies by testing the language model on the first 100 combinations from the test dataset. The experiment was configured with the propose method for thought generation, the value method for state evaluation, and a greedy approach to select the top five states. Each thought was generated once, evaluated five times, and the best five states were retained for further steps.
We initially conducted experiments using a relatively small language model (Qwen-7B-Instruct) to verify the performance of IO, CoT, ToT (original), and several demonstration retrieval strategies we designed on the original ToT tasks (Game of 24 and Crosswords). The results are presented in Figure 2.
The experimental results demonstrate that our methods significantly enhanced the performance of relatively smaller language models on Game of 24 task, especially the methods of selecting relevant demonstrations. Subsequently, we altered the parameters of the LLMs, employing more capable language models (but still with a much smaller number of parameters compared to GPT-4), to test the effectiveness of our methods. Figure 3 shows the performance of the larger models (Qwen2.5-14B-Instruct and Qwen2.5-32B-Instruct) in ToT’s original tasks.
As shown in Figure 2 and Figure 3, the proposed methods substantially outperformed the baseline approaches, including IO, CoT, and ToT, based on their original implementations. The ToT method failed to generalize when applied to open-source LLMs, which were less capable than commercial models. In the Game of 24 task, ToT achieved a 74% accuracy rate using GPT-4. Under the same setting, the proposed framework enabled Qwen2.5-32B-Instruct to attain comparable accuracy, demonstrating its strong applicability to relatively smaller LLMs. Moreover, adaptive selection of demonstrations yielded a notable performance improvement; for instance, ToT–VSR outperformed ToT–MS. Among the adaptive retrieval methods, the vector similarity–based approach achieved the best overall results.

4.2. Experimental Results for Crosswords

The Crosswords task presents the model with a partially filled crossword grid and a set of clues. LLMs must solve the crossword puzzle by filling in the missing words based on the provided clues. The results are also presented in Figure 2, from which shows our methods achieved high performances compared to the original ToT framework.
We evaluated the efficiency of our method by following the same sequence of steps as in the previous experiment. Initially, we conducted experiments using the Qwen2.5-7B-Instruct model. Our demonstration retrieval strategies also improved the model performance, as shown in Figure 2. Similarly, the VSR retrieval method achieved the best results. Specifically, the SSR method had a slight advantage over the SR method in the Crosswords task.
We also evaluated the performance of much bigger LLMs (14B and 32B) on the Crosswords task, as shown in Figure 3. The retrieval method based on VSR achieved the best results. However, there is still a gap compared to the results of GPT-4 on ToT. Upon analysis, it is likely that Qwen2.5-32B-Instruct still has limitations in analyzing the multiple meanings of words. In the context of the intricate task, the word retrieval process is highly challenging.

4.3. Experimental Results for the Other Benchmarks

Experiments were also conducted on the MMLU, BBH, and HumanEval datasets, with the performance comparison summarized in Table 3. The LLM backbone already performed strongly on these tasks due to prior knowledge obtained during pretraining. Consequently, the performance gains from the proposed method were less pronounced than those observed in Figure 2. Nevertheless, the proposed approach consistently outperformed all compared baselines.
On the MMLU task, we conducted a concise error analysis. For ToT (original), the errors can be distributed into the following two categories: First, the proposal steps fail by not providing the right solution paths (referred to as proposal failure). This category constitutes 28.7% of the errors. Second, the evaluation steps fails by assigning high scores to wrong solution paths and low scores to correct paths (referred to as value failure). This category constitutes 7.1% of the errors on average. For the ToT–VSR method, the percentages of the proposal and value failures are 27.9% and 5.7%, respectively. For the ToT–VSR–DR method, the percentages of the proposal and value failures are 27.4% and 5.4%, respectively. We can see that better demonstrations benefit the LLMs by boosting the quality of the proposal and value steps. And our proposed ToT–VSR–DR performed the best by providing better demonstrations and thus facilitated the proposal and value steps.
On the BBH task, for the ToT (original) method, the percentages of the proposal and value failures are 39.1% and 13.5%, respectively. For the ToT–VSR method, the percentages of the proposal and value failures are 37.2% and 12.5%, respectively. For the ToT–VSR–DR method, the percentages of the proposal and value failures are 36.8% and 11.4%, respectively.
To complement the accuracy-based evaluation, we further analyzed the reasoning depth and robustness of our proposed Demo-ToT framework.
  • Reasoning depth. For ToT-based methods, the maximum reasoning depth is treated as a hyperparameter. Following the settings in the original ToT paper, we set the maximum reasoning depth to three for the Game of 24 task, four for the Crossword task, and three for MMLU, BBH, and HumanEval. These depths provided a balance between reasoning completeness and computational cost.
  • Robustness under noisy demonstrations. We conducted a special experiment on the Crossword task by injecting noise into half of the demonstration set, randomly altering words in the answers.Under such noise, the ToT–VSR method’s accuracy drops significantly from 17.2 to 10.1, as it directly relies on vector-similarity retrieval.In contrast, ToT–VSR–DR, which included a re-ranking step, could effectively filter out noisy demonstrations, showing only a slight drop from 18.7 to 16.4. These results demonstrate that the proposed re-ranking mechanism enhances robustness against corrupted or low-quality demonstration sets.

4.4. Ablation Experiment

In addition to evaluating the main strategies, we performed systematic ablation studies to investigate other factors that may influence model performance. These included the following: (a) Number of Demonstrations. We varied the number of demonstrations provided to the model to assess whether a larger set of demonstrations improved performance. (b) Embedding Models. We tested different embedding models in the FAISS retrieval strategy to determine how the choice of embedding model affects the results.

4.4.1. The Impact of Varying Demonstration Quantities

In previous experiments, we selected eight most relevant demonstrations as examples (TOP-K = 8). To investigate the impact of the value of K, we changed the value of K and use Qwen2.5-14B-Instruct-Int4 as a language model. We compared the accuracy rate of the language model in completing the Game of 24, as shown in Table 4. Table 4 also presents the performance of ToT–CBS as a comparison. From the results, the following takeaways can be made:
  • The ToT–VSR method outperformed the ToT–CBS method under different settings of demonstration quantities.
  • Further increasing the number of demonstrations does not result in clear improvements.

4.4.2. The Impact of Different Embedding Models

In previous experiments, we selected the BGE-base-en model as the embedding model for demonstration selection. To investigate the impact of the embedding model, we now consider alternative embedding models:
  • The BERT model [30]—representing the Transformer-based encoder family widely used for text embeddings.
  • Sentence-Transformer https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2 (accessed on 1 September 2025)—a modern sentence-level embedding framework extended for large-scale cross-domain tasks.
  • The SimCSE model [31]—a contrastive sentence embedding approach with recent improvements in stability and domain generalization.
  • BGE-large-en, which is from the same model series [29] as BGE-base-en.
This ablation study also used the Qwen2.5-14B model as the LLM backbone. We can see in Table 5 that BGE-large-en slightly outperformed the others, achieving the best performance. The results show that high-quality embedding model can enhance the performance of our framework, but our method can also perform well when using relatively weaker embedding models.

4.4.3. The Impact of Reducing the Size of the Support Set

In previous experiments, we used a set of 1,000 samples to construct the support set for the Game of 24 task. Intuitively, the smaller the support set, the less information could be extracted and LLMs could receive less help and performs less well. Thus, to prove this hypothesis, we altered the size of the support set to 50%, 10%, 5%, and 1% of the original size. Our ablation study also used the Qwen2.5-14B model as the LLM backbone, and the experimental results are presented in Table 6.

4.4.4. Implementation Efficiency and Resource Consumption

Our method introduced additional retrieval and re-ranking steps beyond standard ToT-based reasoning. To quantify their implementation-level impact, we conducted a detailed micro-benchmark on a vGPU-32GB and an NVIDIA RTX 4090 GPU using PyTorch 2.3.0 and CUDA 12.1 as follows:
  • Transforming an input into its embedding vector costs 31.2 ms on average, using the BGE-base embedding model. Fetching the top-k demonstrations through the FAISS-based vector index adds only 6.7 ms per query.
  • The re-ranking stage, which refines the retrieved demonstrations, takes 91.4 ms on average for each query.
  • For the LLM inference stage, we measured throughput in terms of generated tokens per second (tps). When running the Qwen2.5-7B-Instruct model with the vLLM toolkit on a vGPU-32GB device, the inference speed reached 283.6 tps.
  • The warm-start process of the entire system cost approximately 57 s, including LLM deployment (30.4 s) and FAISS index compilation (11.6 s).
  • To achieve the above performance, the LLM occupied roughly 90% of GPU memory, while the remaining 10% was used by the embedding and re-ranking models. This configuration ensured full utilization of a single vGPU-32GB machine without exceeding its memory limit.
Across all five evaluated tasks, the average end-to-end processing time per sample is 63.5 s for ToT (original), 64.1 s for ToT–VSR, and 65.6 s for ToT–VSR–DR. These results show that our enhanced retrieval and re-ranking introduced only marginal additional latency—less than 3.3% overhead compared with the baseline— while delivering substantial improvements in reasoning accuracy on tasks such as Game of 24, Crossword, BBH, and HumanEval. Therefore, the added time cost is acceptable for real-world, reasoning-intensive applications where improved logical accuracy and stability are prioritized.

4.4.5. Comparison with GPT-4 (Turbo)

To address the reviewer’s concern regarding the absence of a direct comparison with frontier proprietary models, we conducted additional experiments using the GPT-4 Turbo model through OpenAI’s official API. All experiments were performed under identical datasets and settings as in our previous evaluations, covering the Crossword, Game of 24, and MMLU tasks. For consistency, we emploed a greedy decoding strategy to ensure deterministic outputs across all runs.
We evaluated four configurations: CoT, ToT, ToT–VSR, and the full ToT–VSR–DR variant. All methods shared the same prompt design and reasoning workflow described in Section 3.2.
From Table 7, two observations can be drawn. First, even with a highly capable proprietary model like GPT-4 Turbo, our proposed retrieval-enhanced reasoning framework (ToT–VSR–DR) still leads to consistent improvements across all benchmarks. This confirms that dynamic demonstration retrieval and re-ranking can enhance reasoning robustness and precision regardless of model scale. Second, GPT-4 Turbo equipped with standard CoT performs considerably worse than the relatively lightweight Qwen2.5-7B-Instruct model using our full Demo-ToT framework, which substantiates our claim that the proposed approach effectively narrows the reasoning performance gap between smaller open-source LLMs and frontier proprietary models.

4.4.6. Effect of Demonstration Order

We further examined whether the order of demonstrations in the prompt affects reasoning performance, since prompt sequencing may influence the model’s contextual understanding. For the ToT–VSR–DR method, we compared three order strategies on three representative tasks: (1) ascending order by relevance score (our default setting); (2) descending order; and (3) random order. The results are summarized in Table 8.
The results show that our default ascending-order setting achieves the best or comparable performance across all tasks, confirming that the model benefits from gradually increasing relevance among demonstrations while remaining robust to order variations.

4.5. The Impact of Different LLMs on Task Accuracy

To verify that our chosen LLMs do not exert an undue influence on the experimental result, we conducted comparative experiments using other LLMs. We additionally conducted comparative experiments on the Game of 24 task and Crosswords task using Llama3-8B. The results are reported in Table 9.

4.6. Case Study

We used cases from the Game of 24 task to illustrate how our method is superior. Case for the propose prompt For the input “2 5 6 11”, the ToT method’s demonstrations are shown in Listing 1.
Listing 1. ToT demonstrations for the input “2 5 6 11” in the Game of 24 task.
Bdcc 09 00276 i001aBdcc 09 00276 i001b
The LLM’s response is shown in Listing 2.
Listing 2. The LLM’s response to the input “2 5 6 11” (ToT).
Bdcc 09 00276 i002
In comparison, our method’s demonstrations were selected adaptively based on the input “2 5 6 11”. The demonstrations are shown in Listing 3.
Listing 3. Adaptively select demonstrations based on the input “2 5 6 11”.
Bdcc 09 00276 i003aBdcc 09 00276 i003bBdcc 09 00276 i003cBdcc 09 00276 i003dBdcc 09 00276 i003e
The LLM’s response is shown in Listing 4.
Listing 4. The LLM’s response to the input “2 5 6 11” (Demo-ToT).
Bdcc 09 00276 i004aBdcc 09 00276 i004b
Here, we describe the case of the value prompt. The LLM conducted the evaluation step through the value prompt, filtering out low-quality proposals and keeping the high-potential proposals. Our method’s value prompt with the demonstrations for the proposal 6 − 2 = 4 (left: 4 5 11) is shown in Listing 5.
Listing 5. Value prompt with demonstrations for evaluating the proposal “6 − 2 = 4 (left: 4 5 11)” (Demo-ToT).
Bdcc 09 00276 i005aBdcc 09 00276 i005b
Listing 6 shows the LLM’s response to the value prompt.
Listing 6. The LLM’s response to the value prompt (Demo-ToT).
Bdcc 09 00276 i006
The entire value prompt of ToT is shown in Listing 7.
Listing 7. The value prompt of ToT.
Bdcc 09 00276 i007aBdcc 09 00276 i007b
The LLM’s response to the ToT’s original value prompt is shown in Listing 8.
Listing 8. The LLM’s response to the value prompt (ToT).
Bdcc 09 00276 i008
Thus, the value result is less accurate than that under our method.
To further interpret why adaptive retrieval leads to improved reasoning performance, we conducted a detailed quantitative analysis focusing on the key influencing factors and typical error types.
  • Why adaptive retrieval succeeds. The primary reason lies in its ability to provide demonstrations that are semantically aligned with the current reasoning context rather than merely lexically or structurally similar. Lexical overlap is not the deciding factor—some retrieved examples exhibit low word-level similarity but still enhance reasoning accuracy. Structural similarity is also not dominant, since all demonstrations share similar reasoning formats. In contrast, semantic embedding closeness plays a key role: ToT–VSR consistently outperformed ToT (original), ToT–SR, and ToT–SSR, confirming that semantic retrieval captures more meaningful task-level relations. The re-ranking process in ToT–VSR–DR further refined candidate demonstrations by predicting their contribution to reasoning performance, rather than relying solely on similarity scores. This benefit-oriented selection mechanism explains the stability and robustness gains achieved by Demo-ToT.
  • Systematic error analysis. To enhance interpretability, we categorized the reasoning errors into two types: (a) proposal failures, where the reasoning process produces incorrect intermediate paths; and (b) value failures, where the evaluation step misjudges the correctness of reasoning branches. For the MMLU task, the proportions of proposal/value failures are 28.7%/7.1% for ToT (original), 27.9%/5.7% for ToT–VSR, and 27.4%/5.4% for ToT–VSR–DR. For the BBH task, the corresponding ratios are 39.1%/13.5%, 37.2%/12.5%, and 36.8%/11.4%. These results demonstrate that adaptive retrieval—particularly with re-ranking—effectively reduces both proposal and value errors, leading to more stable and interpretable reasoning outcomes.

5. Conclusions

In this work, we addressed the limitations of static prompts in the ToT framework by introducing dynamic demonstration retrieval strategies to enhance the reasoning capabilities of language models. Our experiments on Game of 24 and Crosswords tasks demonstrated that adaptive retrieval of demonstrations—particularly through vector similarity (VSR)—significantly outperformed fixed or randomly selected demonstrations. The proposed strategies enabled even mid-sized models like Qwen2.5-14B to achieve performance comparable to much larger models, reducing reliance on scale while maintaining reasoning accuracy. The ablation studies further revealed that the number of demonstrations and the choice of embedding models affected performance, though the retrieval strategy itself remained the dominant factor. These findings highlight the critical role of context-aware demonstration selection in optimizing LLM reasoning, offering a pathway to improve efficiency and accessibility in complex problem-solving scenarios without requiring architectural expansion. Future work could extend this approach to broader tasks and investigate hybrid retrieval mechanisms for further gains.

Author Contributions

Conceptualization, J.L.; methodology, J.L.; software, J.L.; validation, J.L., B.R. and M.Z.; formal analysis, J.L.; investigation, J.L., B.R. and M.Z.; resources, M.Z. and H.C.; data curation, J.L.; writing—original draft preparation, J.L. and B.R.; writing—review and editing, J.L. and B.R.; visualization, J.L.; supervision, M.Z.; project administration, H.C.; funding acquisition, M.Z. and H.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work is partially supported by the National Natural Science Foundation of China (NSFC) [Grant No. 72401286].

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The code and dataset will be finalized and made publicly available online upon acceptance of the paper.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Jason, W.; Xuezhi, W.; Dale, S.; Maarten, B.; Fei, X.; Ed, C.; V, L.Q.; Denny, Z. Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural Inf. Process. Syst. 2022, 35, 24824–24837. [Google Scholar] [CrossRef]
  2. Wang, X.; Wei, J.; Schuurmans, D.; Le, Q.; Chi, E.H.; Narang, S.; Chowdhery, A.; Zhou, D. Self-Consistency Improves Chain of Thought Reasoning in Language Models. In Proceedings of the 11th International Conference on Learning Representations (ICLR), Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
  3. Yao, S.; Yu, D.; Zhao, J.; Shafran, I.; Griffiths, T.; Cao, Y.; Narasimhan, K. Tree of thoughts: Deliberate problem solving with large language models. Adv. Neural Inf. Process. Syst. 2023, 36, 11809–11822. [Google Scholar] [CrossRef]
  4. Besta, M.; Blach, N.; Kubicek, A.; Gerstenberger, R.; Podstawski, M.; Gianinazzi, L.; Gajda, J.; Lehmann, T.; Niewiadomski, H.; Nyczyk, P.; et al. Graph of thoughts: Solving elaborate problems with large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 17682–17690. [Google Scholar]
  5. Chen, W.; Ma, X.; Wang, X.; Cohen, W.W. Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks. Trans. Mach. Learn. Res. 2023, 53, 12588. [Google Scholar] [CrossRef]
  6. Zhang, Z.; Zhang, A.; Li, M.; Smola, A. Automatic Chain of Thought Prompting in Large Language Models. In Proceedings of the 11th International Conference on Learning Representations (ICLR), Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
  7. Zeng, H.; Killingback, J.; Zamani, H. Scaling sparse and dense retrieval in decoder-only llms. In Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval, Padua, Italy, 13–18 July 2025; pp. 2679–2684. [Google Scholar]
  8. Poljak, J.; Crčić, D.; Horvat, T. Comparative Analysis of Text Similarity Algorithms and Their Practical Applications in Computer Science. Elektrotehn. Vestn. 2025, 92, 151–156. Available online: https://ev.fe.uni-lj.si/3-2025/Poljak.pdf (accessed on 1 September 2025).
  9. Yao, S.; Zhao, J.; Yu, D.; Du, N.; Shafran, I.; Narasimhan, K.; Cao, Y. React: Synergizing reasoning and acting in language models. In Proceedings of the International Conference on Learning Representations (ICLR), Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
  10. Gao, L.; Madaan, A.; Zhou, S.; Alon, U.; Liu, P.; Yang, Y.; Callan, J.; Neubig, G. Pal: Program-aided language models. In Proceedings of the International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; pp. 10764–10799. [Google Scholar]
  11. Basiouni, A.M.; El Rashid, M.; Shaalan, K. In-Context Learning in Large Language Models (LLMs): Mechanisms, Capabilities, and Implications for Advanced Knowledge Representation and Reasoning. IEEE Access 2025, 13, 95574–95593. [Google Scholar] [CrossRef]
  12. Pryzant, R.; Iter, D.; Li, J.; Lee, Y.T.; Zhu, C.; Zeng, M. Automatic Prompt Optimization with “Gradient Descent” and Beam Search. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP 2023), Singapore, 6–10 December 2023; pp. 7957–7968. [Google Scholar]
  13. Zhou, Y.; Zheng, B.; Chen, Q. Large Language Models as Automatic Prompt Engineers. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2024, Bangkok, Thailand, 11–16 August 2024; Association for Computational Linguistics: Toronto, Canada, 2024. [Google Scholar]
  14. Chen, Z.; Zhang, H.; Liu, X. Soft Prompt Transfer for Parameter-Efficient Fine-Tuning of Large Language Models. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada, 16 December 2024. [Google Scholar]
  15. Zhou,, X.; Liang, D.; Xu, W.; Zhu, X.; Xu, Y.; Zou, Z.; Bai, X. Dynamic adapter meets prompt tuning: Parameter-efficient transfer learning for point cloud analysis. arXiv 2024, arXiv:2403.01439. [Google Scholar] [CrossRef]
  16. Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; hsin Chi, E.H.; Xia, F.; Le, Q.; Zhou, D. Chain of Thought Prompting Elicits Reasoning in Large Language Models. arXiv 2022, arXiv:2201.11903. [Google Scholar] [CrossRef]
  17. Xu, R.; Liu, H.; Nag, S.; Dai, Z.; Xie, Y.; Tang, X.; Luo, C.; Li, Y.; Ho, J.C.; Yang, C.; et al. SimRAG: Self-Improving Retrieval-Augmented Generation for Adapting Large Language Models to Specialized Domains. In Proceedings of the 2025 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL 2025), Albuquerque, NM, USA, 29 April–4 May 2025; Volume 1: Long Papers, pp. 11534–11550. [Google Scholar]
  18. Li, X.; Lv, K.; Yan, H.; Lin, T.; Zhu, W.; Ni, Y.; Xie, G.T.; Wang, X.; Qiu, X. Unified Demonstration Retriever for In-Context Learning. arXiv 2023, arXiv:2305.04320. [Google Scholar] [CrossRef]
  19. Zhang, Y.; Feng, S.; Tan, C. Active Example Selection for In-Context Learning. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP), Abu Dhabi, United Arab Emirates, 7–11 December 2022; pp. 9134–9148. [Google Scholar]
  20. Li, G.; Wang, P.; Liu, J.; Guo, Y.; Ji, K.; Shang, Z.; Xu, Z. Meta In-Context Learning Makes Large Language Models Better Zero and Few-Shot Relation Extractors. In Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence (IJCAI-24), Jeju, Republic of Korea, 3–9 August 2024; pp. 6350–6358. [Google Scholar]
  21. Li, X.; Nie, E.; Liang, S. From classification to generation: Insights into crosslingual retrieval augmented icl. arXiv 2023, arXiv:2311.06595. [Google Scholar] [CrossRef]
  22. Touvron, H.; Martin, L.; Stone, K.R.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv 2023, arXiv:2307.09288. [Google Scholar] [CrossRef]
  23. Mageirakos, V.; Wu, B.; Alonso, G. Cracking Vector Search Indexes. Proc. VLDB Endow. 2025, 18, 3951–3964. [Google Scholar] [CrossRef]
  24. Zhao, Q.; Huang, Y.; Lv, T.; Cui, L.; Wei, F.; Sun, Q.; Xin, Y.; Mao, S.; Zhang, X.; Yin, Q.; et al. MMLU-CF: A Contamination-free Multi-task Language Understanding Benchmark. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025), Vienna, Austria, 27 July–1 August 2025; Association for Computational Linguistics: Toronto, ON, Canada, 2025; Volume 1: Long Papers, pp. 13371–13391. [Google Scholar] [CrossRef]
  25. Dong, Q.; Li, L.; Dai, D.; Zheng, C.; Ma, J.; Li, R.; Xia, H.; Xu, J.; Wu, Z.; Liu, T.; et al. A Survey on In-Context Learning. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP 2024), Miami, FL, USA, 12–16 November 2024. [Google Scholar]
  26. Daniel, L.; Lincoln, M. HumanEval on Latest GPT Models-2024. arXiv 2024, arXiv:2402.14852. [Google Scholar] [CrossRef]
  27. Yang, A.; Yang, B.; Zhang, B.; Hui, B.; Zheng, B.; Yu, B.; Li, C.; Liu, D.; Huang, F.; Wei, H.; et al. Qwen2. 5 technical report. arXiv 2024, arXiv:2412.15115. [Google Scholar]
  28. Kempton, T.; Burrell, S. Local Normalization Distortion and the Thermodynamic Formalism of Decoding Strategies for Large Language Models. arXiv 2025, arXiv:2503.21929. [Google Scholar] [CrossRef]
  29. Xiao, S.; Liu, Z.; Zhang, P.; Muennighoff, N.; Lian, D.; Nie, J.Y. C-pack: Packed resources for general chinese embeddings. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, Washington, DC, USA, 14–18 July 2024; pp. 641–649. [Google Scholar]
  30. Warner, B.; Chaffin, A.; Clavié, B.; Weller, O.; Hallström, O.; Taghadouini, S.; Gallagher, A.; Biswas, R.; Ladhak, F.; Aarsen, T.; et al. Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference. arXiv 2024, arXiv:2503.21929. [Google Scholar] [CrossRef]
  31. Xu, J.; Shao, W.; Chen, L.; Liu, L. SimCSE++: Improving Contrastive Learning for Sentence Embeddings from Two Perspectives. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP 2023), Singapore, 6–10 December 2023; Association for Computational Linguistics: Toronto, ON, Canada, 2023; pp. 12028–12040. [Google Scholar]
Figure 1. Restructure of the prompt template using different demonstration retrieval strategies and using them as the LLM prompts. Since the prompt is too long to display in full, ‘…’ is used to indicate omitted parts.
Figure 1. Restructure of the prompt template using different demonstration retrieval strategies and using them as the LLM prompts. Since the prompt is too long to display in full, ‘…’ is used to indicate omitted parts.
Bdcc 09 00276 g001
Figure 2. Using different demonstration retrieval strategies on Qwen2.5-7B-Instruct language models to complete the ToT’s task. (a) The accuracy (%) achieved using various demonstration retrieval strategies on the Game of 24 task. (b) The accuracy (%) achieved using various demonstration retrieval strategies on the Crosswords task.
Figure 2. Using different demonstration retrieval strategies on Qwen2.5-7B-Instruct language models to complete the ToT’s task. (a) The accuracy (%) achieved using various demonstration retrieval strategies on the Game of 24 task. (b) The accuracy (%) achieved using various demonstration retrieval strategies on the Crosswords task.
Bdcc 09 00276 g002
Figure 3. The performance of Qwen2.5-14B-Instruct and Qwen2.5-32B-Instruct to complete the ToT’s original task.
Figure 3. The performance of Qwen2.5-14B-Instruct and Qwen2.5-32B-Instruct to complete the ToT’s original task.
Bdcc 09 00276 g003
Table 1. The prompt templates for the value prompt on Game of 24 and Crosswords tasks.
Table 1. The prompt templates for the value prompt on Game of 24 and Crosswords tasks.
Game of 24Crosswords
Value prompt template: <demonstrations>
Task: Evaluate if given numbers can reach 24 (sure/likely/impossible)
Instruction: Mimic the format and reasoning steps of the <demonstrations>, and generate possible future steps and the final evaluation for the following input.
Please do not generate any other text.
Input: <input>
Final result: <output>
Value prompt template: <demonstrations>
Task: Evaluate if there exists a five letter word of some meaning that fit the given letter constraints
(sure/maybe/impossible).
Instruction: Mimic the format and reasoning steps of the <demonstrations>, and generate possible future steps and the final evaluation for the following input.
Please do not generate any other text contents.
Input: <input>
Final result: <output>
An example of <demonstrations> for the value prompt:
input: 11 12
possible future steps:
11 + 12 = 23
12 − 11 = 1
11 * 12 = 132
11/12 = 0.91
final evaluation: impossible
An example of <demonstrations> for the value prompt:
Input: Incorrect; to injure: w _ o _ g
The letter constraint is: 5 letters, letter 1 is w, letter 3 is o, letter 5 is g.
Some possible words that mean “Incorrect; to injure”:
wrong (w r o n g): 5 letters, letter 1 is w, letter 3 is o, letter 5 is g. fit!
Final result: sure
Table 2. The prompt templates for the propose prompt on Game of 24 and Crosswords tasks.
Table 2. The prompt templates for the propose prompt on Game of 24 and Crosswords tasks.
Game of 24Crosswords
Propose prompt template:
<demonstrations>
Instruction: Mimic the format of the above <demonstrations>, and generate possible next steps for the following input. Please do not generate any other text contents.
Input: <input>
Possible next steps: <output>
Propose prompt template:
<demonstrations>
Let’s play a 5 × 5 mini crossword, where each word should have exactly 5 letters.
Instruction: Mimic the format of the above <demonstrations>. Please do not generate any other text contents.
Input: <input>
Possible next steps:
An example of <demonstrations> for the propose prompt:
Input: 2 8 8 14
Possible next steps:
2 + 8 = 10 (left: 10 8 14)
2 * 8 = 16 (left: 16 8 14)
8 − 2 = 6 (left: 6 8 14)
8/2 = 4 (left: 4 8 14)
2 + 14 = 16 (left: 8 8 16)
2 * 14 = 12 (left: 8 8 28)
14/2 = 7 (left: 8 8 7)
14 − 2 = 12 (left: 8 8 12)
8 + 8 = 16 (left: 2 16 14)
8 - 8 = 16 (left: 2 0 14)
8 * 8 = 16 (left: 2 64 14)
8/8 = 1 (left: 2 1 14)
An example of <demonstrations> for the propose prompt:
Input: Current Board:
_ _ _ _ _
_ _ _ _ _
_ _ _ _ _
_ _ _ _ _
_ _ _ _ _
Possible next steps: h1. shown (high)
h2. wirra (medium)
h3. avail (high)
h4. rette (medium)
h5. treed (high)
v1. swart (high)
v2. hiver (high)
v3. orate (high)
v4. write (medium)
v5. naled (high)
Table 3. The accuracy (%) comparison on MMLU, BBH, and HumanEval using Qwen2.5-7B-Instruct as the backbone.
Table 3. The accuracy (%) comparison on MMLU, BBH, and HumanEval using Qwen2.5-7B-Instruct as the backbone.
MethodMMLUBBHHumanEval
CoT62.1 ± 1.0645.2 ± 0.8568.8 ± 1.57
ToT (original)64.2 ± 0.8347.4 ± 0.9372.6 ± 1.68
ToT + VSR (Proposed)66.4 ± 0.8650.3 ± 0.7773.8 ± 1.34
ToT + VSR + DR (Proposed)67.2 ± 0.9151.8 ± 0.7374.7 ± 1.41
Table 4. The impact of varying demonstration quantities on accuracy (%).
Table 4. The impact of varying demonstration quantities on accuracy (%).
StrategyNumber of Demonstrations
124816
ToT + CBS17.720.525.126.226.3
ToT + VSR43.257.959.358.757.3
ToT + VSR + DR47.858.860.160.959.4
Table 5. The impact of different embedding models on the accuracy (%).
Table 5. The impact of different embedding models on the accuracy (%).
Embedding ModelsAccuracy (Game of 24)
BGE-base-en58.7
BGE-large-en59.0
BERT55.3
Sentence-Transformer56.4
SimCSE57.1
Table 6. The impact of reducing the number of demonstrations on the accuracy (%).
Table 6. The impact of reducing the number of demonstrations on the accuracy (%).
Reducing RateAccuracy (Game of 24)
100%58.7
50%57.3
10%55.4
5%51.9
1%48.6
Table 7. Performance comparison between GPT-4 Turbo under different reasoning strategies.
Table 7. Performance comparison between GPT-4 Turbo under different reasoning strategies.
MethodCrossword (%)Game of 24 (%)MMLU (%)
CoT15.16.885.8
ToT (original)61.274.786.3
ToT + VSR74.186.387.1
ToT + VSR + DR77.888.588.4
Table 8. Performance comparison under different demonstration orderings.
Table 8. Performance comparison under different demonstration orderings.
Method (ToT + VSR + DR)CrosswordGame of 24MMLU
Random order18.248.666.7
Descending order18.449.166.9
Ascending order (default)18.749.167.2
Table 9. Impact of different LLMs on the task accuracy (%).
Table 9. Impact of different LLMs on the task accuracy (%).
TaskQwen2.5-7BLlama3-8B
Game of 24 (ToT + VSR)46.543.8
Game of 24 (ToT + VSR + DR)49.146.3
Crosswords (ToT + VSR)17.215.9
Crosswords (ToT + VSR + DR)18.717.2
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, J.; Ren, B.; Zhang, M.; Chen, H. Demo-ToT: Enhancing the Reasoning Capabilities of AI Agent via Improved Demonstrations Retrieval Strategy. Big Data Cogn. Comput. 2025, 9, 276. https://doi.org/10.3390/bdcc9110276

AMA Style

Li J, Ren B, Zhang M, Chen H. Demo-ToT: Enhancing the Reasoning Capabilities of AI Agent via Improved Demonstrations Retrieval Strategy. Big Data and Cognitive Computing. 2025; 9(11):276. https://doi.org/10.3390/bdcc9110276

Chicago/Turabian Style

Li, Jiahui, Bangbang Ren, Mengmeng Zhang, and Honghui Chen. 2025. "Demo-ToT: Enhancing the Reasoning Capabilities of AI Agent via Improved Demonstrations Retrieval Strategy" Big Data and Cognitive Computing 9, no. 11: 276. https://doi.org/10.3390/bdcc9110276

APA Style

Li, J., Ren, B., Zhang, M., & Chen, H. (2025). Demo-ToT: Enhancing the Reasoning Capabilities of AI Agent via Improved Demonstrations Retrieval Strategy. Big Data and Cognitive Computing, 9(11), 276. https://doi.org/10.3390/bdcc9110276

Article Metrics

Back to TopTop