Demo-ToT: Enhancing the Reasoning Capabilities of AI Agent via Improved Demonstrations Retrieval Strategy
Abstract
1. Introduction
- 1.
- Bridging ICL and ToT frameworks. Despite the potential of in-context learning, its integration into structured reasoning paradigms such as ToT remains limited. We propose Demo-ToT, which dynamically retrieves relevant demonstrations at each intermediate reasoning step, enabling smaller models to perform complex reasoning. By supporting context-driven refinement of the reasoning trajectory, the approach overcomes the rigidity of fixed prompt templates.
- 2.
- Comprehensive formulation of demonstration retrieval strategies. We unified and evaluated multiple demonstration selection paradigms under the ToT framework, including: (a) ToT–CBS, which selects representative demonstrations by clustering the support set; (b) ToT–VSR, which conducts vector-similarity-based retrieval; and (c) ToT–SR and ToT–SSR, which adopt sparse (BM25) and string-similarity (Levenshtein) retrieval, respectively. This unified formulation provided a standardized comparison among existing and extended retrieval mechanisms for structured reasoning.
- 3.
- Retrieval–re-ranking mechanism and extensive validation. Beyond retrieval-only strategies, we propose a novel retrieval–re-ranking paradigm for demonstration selection. Our ToT + VSR + DR framework first retrieved a broad pool of candidate demonstrations and then re-ranked them using a learnable scoring model that predicted each demonstration’s expected utility to the model. The re-ranking model was trained using a pairwise ranking loss aligning predicted scores with generation-based quality. This is, to our knowledge, the first adaptation of the retrieval–re-rank architecture to structured reasoning. We further validated Demo-ToT on five reasoning benchmarks (Game of 24, Crosswords, MMLU, BBH, and HumanEval) and across multiple model scales (Qwen2.5–7B, 14B, and 32B), confirming consistent accuracy gains and reduced performance gaps with larger proprietary models.
2. Related Work
2.1. Reasoning Methods for AI Agents
2.2. Prompt Optimization Methods
2.3. In-Context Learning
3. Methodology
3.1. Tree of Thought (ToT)
- State representation
- Thought generation
- State evaluation
- Search policy
- Limitations of ToT
- Motivation for Demo-ToT
3.2. Demo-ToT Framework
3.2.1. Demonstrations Retrieval Strategy
value prompt = “Evaluate if given numbers can reach 24 (sure/likely/impossible)
10 14 10 + 14 = 24 sure
11 12 11 + 12 = 23 12 − 11 = 1 11 * 12 = 132 11/12 = 0.91 impossible
4 4 10 4 + 4 + 10 = 8 + 10 = 18 4 * 10 − 4 = 40 - 4 = 36 (10 − 4) * 4 = 6 * 4 = 24 sure
......”
3.2.2. Prompt Templates
3.2.3. Fixed Demonstrations Replacement
- Random selection (RS). In this method, for each type of prompt template, a set of demonstrations was randomly selected to fill the placeholders.
- Clustering-based selection (CBS). This method followed the approach proposed in AutoCoT [6], which was adopted as one of the comparative baselines for evaluating Demo-ToT. Specifically, the Support set was first clustered to identify several distinct centroids, and representative demonstrations from each cluster were then selected to construct the prompt and guide the LLMs.
3.2.4. Selecting Relevant Demonstrations
- Vector Similarity Retrieval (VSR). Vector similarity retrieval is a dense retrieval paradigm that represents text as high-dimensional dense vectors in a continuous semantic space, typically generated by pre-trained language models. Given an input question q, we encoded it into a vector . The retrieval process selected demonstrations from the support set whose encoded vectors maximized the similarity score, commonly measured by cosine similarity:In our implementation, we used the Facebook AI Similarity Search (FAISS) toolkit [23] to efficiently perform approximate nearest neighbor search for scalability.
- Sparse Retrieval (SR). Sparse retrieval refers to traditional information retrieval methods based on lexical matching. We adopted the BM25 ranking function [7], which scored a document with respect to query q as:where , and b are tunable parameters (we use default values , ), is the document length, and avgdl is the average document length in the collection.
- String Similarity Retrieval (SSR). This strategy measures surface-level textual similarity using the Levenshtein distance. The normalized similarity between question q and demonstration was computed as:We used this score to rank demonstrations by their surface-form resemblance to the input.
3.2.5. Demonstration Re-Ranking (DR)
- Model architecture
- Empirical observation
- Training objective
- Integration into Demo-ToT
3.3. Experimental Settings
3.3.1. Task
3.3.2. Dataset
- 1.
- First we divided the task dataset into support/test splits according to the ratio described above.
- 2.
- Second, demonstrations were then collected using the GPT-4o model. Each sample was processed under the original ToT framework, and for successfully solved cases, the intermediate input–output pairs from the proposal and value steps were extracted and included in the demonstration set.
- 3.
- Third, the resulting demonstration set was then embedded using the BGE-base-en model, where each demonstration input was transformed into a dense vector representation. All vectors were stored and indexed as a vector database using the FAISS toolkit to support efficient similarity retrieval during reasoning.
- MMLU. The MMLU benchmark [24] evaluates large language models’ knowledge retention and reasoning across 57 academic subjects under strict zero-shot and few-shot settings, providing a comprehensive cross-domain assessment without task-specific fine-tuning.
- BBH. Curated from the BIG-Bench repository [25], the BIG-Bench Hard subset comprises 23 capabilities-testing tasks where previous language models underperformed human baselines. These challenges emphasize complex reasoning patterns including causal inference, counterfactual analysis, and multi-hop deduction that current architectures find particularly demanding.
- HumanEval. The HumanEval benchmark [26] evaluates the code generation and reasoning abilities of large language models through 164 curated programming tasks with predefined function signatures and verification tests, reflecting realistic software engineering requirements.
4. Results
4.1. Experimental Results for the Game of 24 Task
4.2. Experimental Results for Crosswords
4.3. Experimental Results for the Other Benchmarks
- Reasoning depth. For ToT-based methods, the maximum reasoning depth is treated as a hyperparameter. Following the settings in the original ToT paper, we set the maximum reasoning depth to three for the Game of 24 task, four for the Crossword task, and three for MMLU, BBH, and HumanEval. These depths provided a balance between reasoning completeness and computational cost.
- Robustness under noisy demonstrations. We conducted a special experiment on the Crossword task by injecting noise into half of the demonstration set, randomly altering words in the answers.Under such noise, the ToT–VSR method’s accuracy drops significantly from 17.2 to 10.1, as it directly relies on vector-similarity retrieval.In contrast, ToT–VSR–DR, which included a re-ranking step, could effectively filter out noisy demonstrations, showing only a slight drop from 18.7 to 16.4. These results demonstrate that the proposed re-ranking mechanism enhances robustness against corrupted or low-quality demonstration sets.
4.4. Ablation Experiment
4.4.1. The Impact of Varying Demonstration Quantities
- The ToT–VSR method outperformed the ToT–CBS method under different settings of demonstration quantities.
- Further increasing the number of demonstrations does not result in clear improvements.
4.4.2. The Impact of Different Embedding Models
- The BERT model [30]—representing the Transformer-based encoder family widely used for text embeddings.
- Sentence-Transformer https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2 (accessed on 1 September 2025)—a modern sentence-level embedding framework extended for large-scale cross-domain tasks.
- The SimCSE model [31]—a contrastive sentence embedding approach with recent improvements in stability and domain generalization.
- BGE-large-en, which is from the same model series [29] as BGE-base-en.
4.4.3. The Impact of Reducing the Size of the Support Set
4.4.4. Implementation Efficiency and Resource Consumption
- Transforming an input into its embedding vector costs 31.2 ms on average, using the BGE-base embedding model. Fetching the top-k demonstrations through the FAISS-based vector index adds only 6.7 ms per query.
- The re-ranking stage, which refines the retrieved demonstrations, takes 91.4 ms on average for each query.
- For the LLM inference stage, we measured throughput in terms of generated tokens per second (tps). When running the Qwen2.5-7B-Instruct model with the vLLM toolkit on a vGPU-32GB device, the inference speed reached 283.6 tps.
- The warm-start process of the entire system cost approximately 57 s, including LLM deployment (30.4 s) and FAISS index compilation (11.6 s).
- To achieve the above performance, the LLM occupied roughly 90% of GPU memory, while the remaining 10% was used by the embedding and re-ranking models. This configuration ensured full utilization of a single vGPU-32GB machine without exceeding its memory limit.
4.4.5. Comparison with GPT-4 (Turbo)
4.4.6. Effect of Demonstration Order
4.5. The Impact of Different LLMs on Task Accuracy
4.6. Case Study
| Listing 1. ToT demonstrations for the input “2 5 6 11” in the Game of 24 task. |
![]() ![]() |
| Listing 2. The LLM’s response to the input “2 5 6 11” (ToT). |
![]() |
| Listing 3. Adaptively select demonstrations based on the input “2 5 6 11”. |
![]() ![]() ![]() ![]() ![]() |
| Listing 4. The LLM’s response to the input “2 5 6 11” (Demo-ToT). |
![]() ![]() |
| Listing 5. Value prompt with demonstrations for evaluating the proposal “6 − 2 = 4 (left: 4 5 11)” (Demo-ToT). |
![]() ![]() |
| Listing 6. The LLM’s response to the value prompt (Demo-ToT). |
![]() |
| Listing 7. The value prompt of ToT. |
![]() ![]() |
| Listing 8. The LLM’s response to the value prompt (ToT). |
![]() |
- Why adaptive retrieval succeeds. The primary reason lies in its ability to provide demonstrations that are semantically aligned with the current reasoning context rather than merely lexically or structurally similar. Lexical overlap is not the deciding factor—some retrieved examples exhibit low word-level similarity but still enhance reasoning accuracy. Structural similarity is also not dominant, since all demonstrations share similar reasoning formats. In contrast, semantic embedding closeness plays a key role: ToT–VSR consistently outperformed ToT (original), ToT–SR, and ToT–SSR, confirming that semantic retrieval captures more meaningful task-level relations. The re-ranking process in ToT–VSR–DR further refined candidate demonstrations by predicting their contribution to reasoning performance, rather than relying solely on similarity scores. This benefit-oriented selection mechanism explains the stability and robustness gains achieved by Demo-ToT.
- Systematic error analysis. To enhance interpretability, we categorized the reasoning errors into two types: (a) proposal failures, where the reasoning process produces incorrect intermediate paths; and (b) value failures, where the evaluation step misjudges the correctness of reasoning branches. For the MMLU task, the proportions of proposal/value failures are 28.7%/7.1% for ToT (original), 27.9%/5.7% for ToT–VSR, and 27.4%/5.4% for ToT–VSR–DR. For the BBH task, the corresponding ratios are 39.1%/13.5%, 37.2%/12.5%, and 36.8%/11.4%. These results demonstrate that adaptive retrieval—particularly with re-ranking—effectively reduces both proposal and value errors, leading to more stable and interpretable reasoning outcomes.
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Jason, W.; Xuezhi, W.; Dale, S.; Maarten, B.; Fei, X.; Ed, C.; V, L.Q.; Denny, Z. Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural Inf. Process. Syst. 2022, 35, 24824–24837. [Google Scholar] [CrossRef]
- Wang, X.; Wei, J.; Schuurmans, D.; Le, Q.; Chi, E.H.; Narang, S.; Chowdhery, A.; Zhou, D. Self-Consistency Improves Chain of Thought Reasoning in Language Models. In Proceedings of the 11th International Conference on Learning Representations (ICLR), Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
- Yao, S.; Yu, D.; Zhao, J.; Shafran, I.; Griffiths, T.; Cao, Y.; Narasimhan, K. Tree of thoughts: Deliberate problem solving with large language models. Adv. Neural Inf. Process. Syst. 2023, 36, 11809–11822. [Google Scholar] [CrossRef]
- Besta, M.; Blach, N.; Kubicek, A.; Gerstenberger, R.; Podstawski, M.; Gianinazzi, L.; Gajda, J.; Lehmann, T.; Niewiadomski, H.; Nyczyk, P.; et al. Graph of thoughts: Solving elaborate problems with large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 17682–17690. [Google Scholar]
- Chen, W.; Ma, X.; Wang, X.; Cohen, W.W. Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks. Trans. Mach. Learn. Res. 2023, 53, 12588. [Google Scholar] [CrossRef]
- Zhang, Z.; Zhang, A.; Li, M.; Smola, A. Automatic Chain of Thought Prompting in Large Language Models. In Proceedings of the 11th International Conference on Learning Representations (ICLR), Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
- Zeng, H.; Killingback, J.; Zamani, H. Scaling sparse and dense retrieval in decoder-only llms. In Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval, Padua, Italy, 13–18 July 2025; pp. 2679–2684. [Google Scholar]
- Poljak, J.; Crčić, D.; Horvat, T. Comparative Analysis of Text Similarity Algorithms and Their Practical Applications in Computer Science. Elektrotehn. Vestn. 2025, 92, 151–156. Available online: https://ev.fe.uni-lj.si/3-2025/Poljak.pdf (accessed on 1 September 2025).
- Yao, S.; Zhao, J.; Yu, D.; Du, N.; Shafran, I.; Narasimhan, K.; Cao, Y. React: Synergizing reasoning and acting in language models. In Proceedings of the International Conference on Learning Representations (ICLR), Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
- Gao, L.; Madaan, A.; Zhou, S.; Alon, U.; Liu, P.; Yang, Y.; Callan, J.; Neubig, G. Pal: Program-aided language models. In Proceedings of the International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; pp. 10764–10799. [Google Scholar]
- Basiouni, A.M.; El Rashid, M.; Shaalan, K. In-Context Learning in Large Language Models (LLMs): Mechanisms, Capabilities, and Implications for Advanced Knowledge Representation and Reasoning. IEEE Access 2025, 13, 95574–95593. [Google Scholar] [CrossRef]
- Pryzant, R.; Iter, D.; Li, J.; Lee, Y.T.; Zhu, C.; Zeng, M. Automatic Prompt Optimization with “Gradient Descent” and Beam Search. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP 2023), Singapore, 6–10 December 2023; pp. 7957–7968. [Google Scholar]
- Zhou, Y.; Zheng, B.; Chen, Q. Large Language Models as Automatic Prompt Engineers. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2024, Bangkok, Thailand, 11–16 August 2024; Association for Computational Linguistics: Toronto, Canada, 2024. [Google Scholar]
- Chen, Z.; Zhang, H.; Liu, X. Soft Prompt Transfer for Parameter-Efficient Fine-Tuning of Large Language Models. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada, 16 December 2024. [Google Scholar]
- Zhou,, X.; Liang, D.; Xu, W.; Zhu, X.; Xu, Y.; Zou, Z.; Bai, X. Dynamic adapter meets prompt tuning: Parameter-efficient transfer learning for point cloud analysis. arXiv 2024, arXiv:2403.01439. [Google Scholar] [CrossRef]
- Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; hsin Chi, E.H.; Xia, F.; Le, Q.; Zhou, D. Chain of Thought Prompting Elicits Reasoning in Large Language Models. arXiv 2022, arXiv:2201.11903. [Google Scholar] [CrossRef]
- Xu, R.; Liu, H.; Nag, S.; Dai, Z.; Xie, Y.; Tang, X.; Luo, C.; Li, Y.; Ho, J.C.; Yang, C.; et al. SimRAG: Self-Improving Retrieval-Augmented Generation for Adapting Large Language Models to Specialized Domains. In Proceedings of the 2025 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL 2025), Albuquerque, NM, USA, 29 April–4 May 2025; Volume 1: Long Papers, pp. 11534–11550. [Google Scholar]
- Li, X.; Lv, K.; Yan, H.; Lin, T.; Zhu, W.; Ni, Y.; Xie, G.T.; Wang, X.; Qiu, X. Unified Demonstration Retriever for In-Context Learning. arXiv 2023, arXiv:2305.04320. [Google Scholar] [CrossRef]
- Zhang, Y.; Feng, S.; Tan, C. Active Example Selection for In-Context Learning. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP), Abu Dhabi, United Arab Emirates, 7–11 December 2022; pp. 9134–9148. [Google Scholar]
- Li, G.; Wang, P.; Liu, J.; Guo, Y.; Ji, K.; Shang, Z.; Xu, Z. Meta In-Context Learning Makes Large Language Models Better Zero and Few-Shot Relation Extractors. In Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence (IJCAI-24), Jeju, Republic of Korea, 3–9 August 2024; pp. 6350–6358. [Google Scholar]
- Li, X.; Nie, E.; Liang, S. From classification to generation: Insights into crosslingual retrieval augmented icl. arXiv 2023, arXiv:2311.06595. [Google Scholar] [CrossRef]
- Touvron, H.; Martin, L.; Stone, K.R.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv 2023, arXiv:2307.09288. [Google Scholar] [CrossRef]
- Mageirakos, V.; Wu, B.; Alonso, G. Cracking Vector Search Indexes. Proc. VLDB Endow. 2025, 18, 3951–3964. [Google Scholar] [CrossRef]
- Zhao, Q.; Huang, Y.; Lv, T.; Cui, L.; Wei, F.; Sun, Q.; Xin, Y.; Mao, S.; Zhang, X.; Yin, Q.; et al. MMLU-CF: A Contamination-free Multi-task Language Understanding Benchmark. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025), Vienna, Austria, 27 July–1 August 2025; Association for Computational Linguistics: Toronto, ON, Canada, 2025; Volume 1: Long Papers, pp. 13371–13391. [Google Scholar] [CrossRef]
- Dong, Q.; Li, L.; Dai, D.; Zheng, C.; Ma, J.; Li, R.; Xia, H.; Xu, J.; Wu, Z.; Liu, T.; et al. A Survey on In-Context Learning. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP 2024), Miami, FL, USA, 12–16 November 2024. [Google Scholar]
- Daniel, L.; Lincoln, M. HumanEval on Latest GPT Models-2024. arXiv 2024, arXiv:2402.14852. [Google Scholar] [CrossRef]
- Yang, A.; Yang, B.; Zhang, B.; Hui, B.; Zheng, B.; Yu, B.; Li, C.; Liu, D.; Huang, F.; Wei, H.; et al. Qwen2. 5 technical report. arXiv 2024, arXiv:2412.15115. [Google Scholar]
- Kempton, T.; Burrell, S. Local Normalization Distortion and the Thermodynamic Formalism of Decoding Strategies for Large Language Models. arXiv 2025, arXiv:2503.21929. [Google Scholar] [CrossRef]
- Xiao, S.; Liu, Z.; Zhang, P.; Muennighoff, N.; Lian, D.; Nie, J.Y. C-pack: Packed resources for general chinese embeddings. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, Washington, DC, USA, 14–18 July 2024; pp. 641–649. [Google Scholar]
- Warner, B.; Chaffin, A.; Clavié, B.; Weller, O.; Hallström, O.; Taghadouini, S.; Gallagher, A.; Biswas, R.; Ladhak, F.; Aarsen, T.; et al. Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference. arXiv 2024, arXiv:2503.21929. [Google Scholar] [CrossRef]
- Xu, J.; Shao, W.; Chen, L.; Liu, L. SimCSE++: Improving Contrastive Learning for Sentence Embeddings from Two Perspectives. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP 2023), Singapore, 6–10 December 2023; Association for Computational Linguistics: Toronto, ON, Canada, 2023; pp. 12028–12040. [Google Scholar]



| Game of 24 | Crosswords |
|---|---|
| Value prompt template:
<demonstrations> Task: Evaluate if given numbers can reach 24 (sure/likely/impossible) Instruction: Mimic the format and reasoning steps of the <demonstrations>, and generate possible future steps and the final evaluation for the following input. Please do not generate any other text. Input: <input> Final result: <output> | Value prompt template:
<demonstrations> Task: Evaluate if there exists a five letter word of some meaning that fit the given letter constraints (sure/maybe/impossible). Instruction: Mimic the format and reasoning steps of the <demonstrations>, and generate possible future steps and the final evaluation for the following input. Please do not generate any other text contents. Input: <input> Final result: <output> |
| An example of <demonstrations> for the value prompt: input: 11 12 possible future steps: 11 + 12 = 23 12 − 11 = 1 11 * 12 = 132 11/12 = 0.91 final evaluation: impossible | An example of <demonstrations> for the value prompt: Input: Incorrect; to injure: w _ o _ g The letter constraint is: 5 letters, letter 1 is w, letter 3 is o, letter 5 is g. Some possible words that mean “Incorrect; to injure”: wrong (w r o n g): 5 letters, letter 1 is w, letter 3 is o, letter 5 is g. fit! Final result: sure |
| Game of 24 | Crosswords |
|---|---|
| Propose prompt template: <demonstrations> Instruction: Mimic the format of the above <demonstrations>, and generate possible next steps for the following input. Please do not generate any other text contents. Input: <input> Possible next steps: <output> | Propose prompt template: <demonstrations> Let’s play a 5 × 5 mini crossword, where each word should have exactly 5 letters. Instruction: Mimic the format of the above <demonstrations>. Please do not generate any other text contents. Input: <input> Possible next steps: |
| An example of <demonstrations> for the propose prompt: Input: 2 8 8 14 Possible next steps: 2 + 8 = 10 (left: 10 8 14) 2 * 8 = 16 (left: 16 8 14) 8 − 2 = 6 (left: 6 8 14) 8/2 = 4 (left: 4 8 14) 2 + 14 = 16 (left: 8 8 16) 2 * 14 = 12 (left: 8 8 28) 14/2 = 7 (left: 8 8 7) 14 − 2 = 12 (left: 8 8 12) 8 + 8 = 16 (left: 2 16 14) 8 - 8 = 16 (left: 2 0 14) 8 * 8 = 16 (left: 2 64 14) 8/8 = 1 (left: 2 1 14) | An example of <demonstrations> for the propose prompt: Input: Current Board: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ Possible next steps: h1. shown (high) h2. wirra (medium) h3. avail (high) h4. rette (medium) h5. treed (high) v1. swart (high) v2. hiver (high) v3. orate (high) v4. write (medium) v5. naled (high) |
| Method | MMLU | BBH | HumanEval |
|---|---|---|---|
| CoT | 62.1 ± 1.06 | 45.2 ± 0.85 | 68.8 ± 1.57 |
| ToT (original) | 64.2 ± 0.83 | 47.4 ± 0.93 | 72.6 ± 1.68 |
| ToT + VSR (Proposed) | 66.4 ± 0.86 | 50.3 ± 0.77 | 73.8 ± 1.34 |
| ToT + VSR + DR (Proposed) | 67.2 ± 0.91 | 51.8 ± 0.73 | 74.7 ± 1.41 |
| Strategy | Number of Demonstrations | ||||
|---|---|---|---|---|---|
| 1 | 2 | 4 | 8 | 16 | |
| ToT + CBS | 17.7 | 20.5 | 25.1 | 26.2 | 26.3 |
| ToT + VSR | 43.2 | 57.9 | 59.3 | 58.7 | 57.3 |
| ToT + VSR + DR | 47.8 | 58.8 | 60.1 | 60.9 | 59.4 |
| Embedding Models | Accuracy (Game of 24) |
|---|---|
| BGE-base-en | 58.7 |
| BGE-large-en | 59.0 |
| BERT | 55.3 |
| Sentence-Transformer | 56.4 |
| SimCSE | 57.1 |
| Reducing Rate | Accuracy (Game of 24) |
|---|---|
| 100% | 58.7 |
| 50% | 57.3 |
| 10% | 55.4 |
| 5% | 51.9 |
| 1% | 48.6 |
| Method | Crossword (%) | Game of 24 (%) | MMLU (%) |
|---|---|---|---|
| CoT | 15.1 | 6.8 | 85.8 |
| ToT (original) | 61.2 | 74.7 | 86.3 |
| ToT + VSR | 74.1 | 86.3 | 87.1 |
| ToT + VSR + DR | 77.8 | 88.5 | 88.4 |
| Method (ToT + VSR + DR) | Crossword | Game of 24 | MMLU |
|---|---|---|---|
| Random order | 18.2 | 48.6 | 66.7 |
| Descending order | 18.4 | 49.1 | 66.9 |
| Ascending order (default) | 18.7 | 49.1 | 67.2 |
| Task | Qwen2.5-7B | Llama3-8B |
|---|---|---|
| Game of 24 (ToT + VSR) | 46.5 | 43.8 |
| Game of 24 (ToT + VSR + DR) | 49.1 | 46.3 |
| Crosswords (ToT + VSR) | 17.2 | 15.9 |
| Crosswords (ToT + VSR + DR) | 18.7 | 17.2 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Li, J.; Ren, B.; Zhang, M.; Chen, H. Demo-ToT: Enhancing the Reasoning Capabilities of AI Agent via Improved Demonstrations Retrieval Strategy. Big Data Cogn. Comput. 2025, 9, 276. https://doi.org/10.3390/bdcc9110276
Li J, Ren B, Zhang M, Chen H. Demo-ToT: Enhancing the Reasoning Capabilities of AI Agent via Improved Demonstrations Retrieval Strategy. Big Data and Cognitive Computing. 2025; 9(11):276. https://doi.org/10.3390/bdcc9110276
Chicago/Turabian StyleLi, Jiahui, Bangbang Ren, Mengmeng Zhang, and Honghui Chen. 2025. "Demo-ToT: Enhancing the Reasoning Capabilities of AI Agent via Improved Demonstrations Retrieval Strategy" Big Data and Cognitive Computing 9, no. 11: 276. https://doi.org/10.3390/bdcc9110276
APA StyleLi, J., Ren, B., Zhang, M., & Chen, H. (2025). Demo-ToT: Enhancing the Reasoning Capabilities of AI Agent via Improved Demonstrations Retrieval Strategy. Big Data and Cognitive Computing, 9(11), 276. https://doi.org/10.3390/bdcc9110276

















