COLLT: A Multi-Task Optimization Framework for Clarification-Oriented Tool Learning in Legal Large Language Models
Abstract
1. Introduction
2. Related Work
2.1. Legal Question Answering
2.2. Legal Large Language Model
2.3. Tool Learning
2.4. Distinguishing Features and Main Contributions
- 1.
- We propose COLLT, the first framework to integrate legal LLMs, proactive clarification, and tool learning. It addresses ambiguous or inappropriate queries from users lacking legal expertise, significantly reducing hallucinations in legal question-answering tasks.
- 2.
- We design an end-to-end learning strategy that jointly trains action selection, tool invocation, and response generation, effectively leveraging the reasoning capabilities of legal LLMs.
- 3.
- We conduct a comprehensive evaluation across recent mainstream Chinese legal LLMs, including ablation studies and a clarification benchmark.
- 4.
- We introduce a legal toolbox comprising six fine-tuned tools and release the COLLT dataset, which includes tool-training data, a multi-turn instruction-tuning corpus, and 500 real-world multi-turn legal consultations.
3. Methodology
3.1. Mathematical Formulation of the COLLT Framework
3.1.1. Notation
3.1.2. Sequential Decision View and the Budget Constraint
3.1.3. Three-Task Loss as a Chain-Rule Factorization
3.1.4. Inference Procedure
| Algorithm 1: COLLT inference |
![]() |
3.2. Concrete Instantiation
3.2.1. Action Selection
3.2.2. Legal Tool Selection
- Similar Case Retrieval (): This tool retrieves historical cases that are semantically similar to the input query. To implement this functionality, we fine-tuned a similar case retrieval model based on the Lawformer [14] architecture. During the retrieval phase, the trained Lawformer model encodes both the query and candidate case texts into vector representations, upon which a nearest-neighbor search is conducted in the embedding space. The model is trained on the case-retrieval portion of the COLLT dataset (Section 3.3), which packages DISC-Law-SFT [36,37], CAIL2019 [38] and LeCard [39] with the leakage controls and near-duplicate filtering described in Section 4.1. Detailed training procedures for this tool are provided in Section 4.1.1.
- Legal Article Search (): This tool is designed to search relevant legal provisions, regulations, and judicial interpretations from a predefined legal knowledge base in response to user queries. Similar to the SCR tool, we fine-tuned the Lawformer pre-trained model specifically for the legal article search task, enabling it to function as a dedicated statute search tool. The training process used the article-search portion of the COLLT dataset, with similarity to the LawBench LAP test set removed prior to fine-tuning. The constructed legal knowledge base includes the Criminal Law of the People’s Republic of China, the Civil Code of the People’s Republic of China, the Criminal Procedure Law, the Civil Procedure Law, as well as authoritative judicial interpretations collected from http://www.legalai.cn/. Detailed training procedures for this tool are provided in Section 4.1.2.
- Legal Charge Prediction (): This tool is designed to predict potential legal charges based on the case facts described by the user. We fine-tuned a pre-trained Lawformer model for the legal charge prediction task using a multi-label classification framework to accommodate the common scenario of multiple charges in judicial practice. The model was trained on the charge-prediction portion of the COLLT dataset, with similarity to the LawBench LCP test set removed prior to fine-tuning. Detailed training procedures for this tool are provided in Section 4.1.3.
- Legal Element Recognition (): This tool is designed to extract legally significant elements from user queries, such as the identity of the actor, subjective intent, methods of action, and consequences. We directly adopt the ALEM [40] model proposed by Zhang et al. as the backbone of this tool. ALEM employs an interaction-based attention mechanism between case facts and legal elements, and it achieved state-of-the-art performance on this task in the year it was introduced.
- Legal Event Detection (): This tool identifies and classifies legal events or actions mentioned in user queries, such as “contract signing,” “tortious conduct,” or “intentional injury.” We adopt the DGGCCM [41] model proposed by Gong et al. as our legal event detection tool. At the time of its release, this model represented the state-of-the-art for the task, demonstrating strong performance in event extraction and classification.
- Internet Search (): To overcome the limitations of static knowledge in pretrained language models, this tool integrates the Bing API (a real-time web search engine) to retrieve up-to-date factual and contextual information.
Statute-First Priority Among Tools
3.2.3. Enhanced Response
| Action | Indictive Token | |
|---|---|---|
| Answer the user’s query directly | <DRT> | |
| Clarify the user’s query | <CLR> | |
| Legal Tool | Head Tag | Tail Tag |
| Similar Case Retrieval | <SCR> | </SCR> |
| Legal Article Search | <LAS> | </LAS> |
| Legal Charge Prediction | <LCP> | </LCP> |
| Legal Element Recognition | <LER> | </LER> |
| Legal Event Detection | <LED> | </LED> |
| Internet Search | <NET> | </NET> |
| Enhanced Respond | Indicative Token | |
| <ER> | ||
3.3. End-to-End Training
4. Experimental Setup
4.1. Details of Legal Tools Training
4.1.1. Similar Case Retrieval
4.1.2. Legal Article Search
4.1.3. Legal Charge Prediction
4.1.4. Other Tools
4.2. Evaluation Tasks and Metrics
- 1.
- Legal Charge Prediction: Predicts the legal charges associated with a case based on case facts. Evaluation is performed using the F1 score.
- 2.
- Legal Article Prediction: Identifies which legal article applies to a case. The F1 score is used as the evaluation metric.
- 3.
- Prison Term Prediction: Predicts the prison sentence for a defendant. The model’s performance is assessed using log-distance.
- 4.
- Argument Mining: Detects arguments within legal documents and classifies them as pro or con. Accuracy is used to evaluate performance.
- 5.
- Dispute Focus Identification: Identifies the central dispute in a legal case. Evaluation is based on the F1 score of key issue extraction.
- 6.
- Issue Topic Identification: Classifies the topic of a legal case. Single-label classification metrics, including precision, recall, and F1 score, are used.
- 7.
- Legal Event Detection: Identifies and extracts legal events related to a case from legal documents, including event triggers, arguments, and event types. Evaluation follows standard Named Entity Recognition (NER) metrics.
- 8.
- Opinion Summarization: Summarizes judicial opinions. Performance is evaluated using ROUGE-L scores, comparing model-generated summaries with human-written ones.
- 9.
- Case Analysis: Analyzes cases to predict outcomes and relevance to other cases. Similarity retrieval and outcome prediction are evaluated using precision, recall, and F1 scores.
- 1.
- Answer accuracy: This measures how correctly the model’s response addresses the legal question, ensuring the provided advice aligns with applicable laws and regulations.
- 2.
- Legal knowledge coverage: This assesses the breadth and depth of legal knowledge the model demonstrates in its response, considering the inclusion of relevant laws, precedents, and legal principles.
- 3.
- Reasonableness: This criterion evaluates whether the response is practical and realistic, ensuring the advice given could be logically applied in a real-world context.
4.3. Baselines
4.3.1. General-Purpose Chinese Large Models
- ChatGLM3-6B [22]: ChatGLM3 is the third-generation open-source bilingual conversational model jointly released by Zhipu AI and Tsinghua University. It inherits the advantages of smooth dialogue and low deployment requirements from previous versions, while making significant improvements in performance and functionality.
- LLaMa3-8B [43]: LLaMa3 is a large language model developed by Meta, designed for efficient and scalable performance in various natural language processing tasks. It excels in tasks like text generation, question answering, and translation, offering strong performance across multiple languages.
- InternLM3-8B [24]: InternLM3 is an open-source language model with 8 billion parameters, developed by the Shanghai AI Laboratory for general tasks and advanced reasoning.
- Qwen2.5-7B [44]: Qwen2.5 is the latest large language model released by Alibaba, featuring 7 billion parameters and fine-tuned for instruction-following, designed for general tasks and advanced reasoning.
- Baichuan2-7B [45]: Baichuan2-7B is the second generation open-source large language model developed by Baichuan AI, with 7 billion parameters. It is trained on 2.6 trillion tokens of high-quality Chinese and English data, supporting a context window of 4096 tokens.
4.3.2. Legal-Specific Large Models
- InterLM-Law [23]: InternLM-Law, based on InternLM, integrates continued pre-training and instruction tuning using both legal and general corpora to improve legal text comprehension and generation. This approach enables the model to perform effectively in legal tasks while retaining a broad understanding of general language.
- LexiLaw [21]: LexiLaw is an open-source large language model designed specifically for the Chinese legal field, built on the ChatGLM-6B architecture. It enhances legal consultation and case analysis performance through methods including LoRA, P-tuning-v2, and full-parameter fine-tuning, utilizing a diverse range of training data, including legal Q&A, regulations, legal documents, and general domain text.
- Lawyer-LLaMA [28]: This model is trained using a combination of legal domain data and general domain data, utilizing GPT-4 Turbo to build a high-quality legal dataset. The model undergoes supervised fine-tuning to enhance its legal reasoning capabilities, allowing it to effectively apply domain knowledge and handle various legal professional issues.
- FuziMingcha [31]: FuziMingcha is a Chinese legal large language model co-developed by Shandong University, Inspur Cloud, and China University of Political Science and Law. Built upon ChatGLM, the model has been trained on extensive Chinese legal corpora, including judgment documents and statutes, as well as supervised fine-tuning datasets, such as legal Q&A and case retrieval.
- Wisdom-Interrogatory [26]: The Wisdom-Interrogatory model, developed collaboratively by Zhejiang University, Alibaba DAMO Academy, and Huayi Institute of Computing, is a large-scale legal language model designed to enhance legal accessibility and judicial efficiency.
4.4. Implementation Details
5. Results and Analysis
- RQ1: Can COLLT improve the performance of existing LLMs on legal NLP tasks?
- RQ2: Can COLLT enhance the legal Q&A ability of existing LLMs?
- RQ3: Are the legal tools effective?
- RQ4: Does proactive clarification help the model better understand the case details?
5.1. Performance on Legal NLP Tasks
5.2. Performance on Free-Form Q&A
5.3. Legal Tool Evaluation
5.4. Clarification Mechanism Evaluation
5.5. Ablation Studies
5.5.1. Tool Ablation
5.5.2. Clarification Trigger Comparison
6. Discussion
7. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Appendix A. Prompt Template for ChatGPT-o3 Scoring


Appendix B. Detailed Setup of Legal Tool Evaluation
References
- Yao, S.; Zhao, J.; Yu, D.; Du, N.; Shafran, I.; Narasimhan, K.; Cao, Y. React: Synergizing reasoning and acting in language models. In Proceedings of the 11th International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
- Schick, T.; Dwivedi-Yu, J.; Dessì, R.; Raileanu, R.; Lomeli, M.; Hambro, E.; Zettlemoyer, L.; Cancedda, N.; Scialom, T. Toolformer: Language models can teach themselves to use tools. Adv. Neural Inf. Process. Syst. 2023, 36, 68539–68551. [Google Scholar]
- Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.t.; Rocktäschel, T.; et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Adv. Neural Inf. Process. Syst. 2020, 33, 9459–9474. [Google Scholar]
- Kumar, S.; Reddy, P.K.; Reddy, V.B.; Singh, A. Similarity analysis of legal judgments. In Proceedings of the Fourth Annual ACM Bangalore Conference; Association for Computing Machinery: New York, NY, USA, 2011; pp. 1–4. [Google Scholar]
- Quaresma, P.; Rodrigues, I. A question-answering system for Portuguese juridical documents. In Proceedings of the 10th International Conference on Artificial Intelligence and Law; Association for Computing Machinery: New York, NY, USA, 2005; pp. 256–257. [Google Scholar]
- Kim, M.Y.; Xu, Y.; Goebel, R. Applying a convolutional neural network to legal question answering. In Proceedings of the JSAI International Symposium on Artificial Intelligence; Springer: Cham, Switzerland, 2015; pp. 282–294. [Google Scholar]
- Xiao, G.; Mo, J.; Chow, E.; Chen, H.; Guo, J.; Gong, Z. Multi-Task CNN for classification of Chinese legal questions. In Proceedings of the 2017 IEEE 14th International Conference on e-Business Engineering (ICEBE); IEEE: Piscataway, NJ, USA, 2017; pp. 84–90. [Google Scholar]
- Collarana, D.; Heuss, T.; Lehmann, J.; Lytra, I.; Maheshwari, G.; Nedelchev, R.; Schmidt, T.; Trivedi, P. A question answering system on regulatory documents. In Legal Knowledge and Information Systems; IOS Press: Amsterdam, The Netherlands, 2018; pp. 41–50. [Google Scholar]
- Kim, M.Y.; Rabelo, J.; Babiker, H.K.B.; Rahman, M.A.; Goebel, R. Legal information retrieval and entailment using transformer-based approaches. Rev. Socionetwork Strateg. 2024, 18, 101–121. [Google Scholar] [CrossRef] [PubMed]
- Büttner, M.; Habernal, I. Answering legal questions from laymen in german civil law system. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers); Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 2015–2027. [Google Scholar]
- Zhang, N.N.; Xing, Y. Questions and answers on legal texts based on BERT-BiGRU. In Proceedings of the Journal of Physics: Conference Series; IOP Publishing: Bristol, UK, 2021; Volume 1828, p. 012035. [Google Scholar]
- Hoppe, C.; Pelkmann, D.; Migenda, N.; Hötte, D.; Schenck, W. Towards intelligent legal advisors for document retrieval and question-answering in german legal documents. In Proceedings of the 2021 IEEE Fourth International Conference on Artificial Intelligence and Knowledge Engineering (AIKE); IEEE: Piscataway, NJ, USA, 2021; pp. 29–32. [Google Scholar]
- Tieu, T.T.; Chau, C.N.; Nguyen, T.S.; Nguyen, L.M. Apply bert-based models and domain knowledge for automated legal question answering tasks at alqac 2021. In Proceedings of the 2021 13th International Conference on Knowledge and Systems Engineering (KSE); IEEE: Piscataway, NJ, USA, 2021; pp. 1–6. [Google Scholar]
- Xiao, C.; Hu, X.; Liu, Z.; Tu, C.; Sun, M. Lawformer: A pre-trained language model for Chinese legal long documents. AI Open 2021, 2, 79–84. [Google Scholar] [CrossRef]
- Zhong, H.; Xiao, C.; Tu, C.; Zhang, T.; Liu, Z.; Sun, M. How Does NLP Benefit Legal System: A Summary of Legal Artificial Intelligence. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 5218–5230. [Google Scholar]
- Seo, M.; Kembhavi, A.; Farhadi, A.; Hajishirzi, H. Bi-directional attention flow for machine comprehension. In Proceedings of the 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, 24–26 April 2017. [Google Scholar]
- Wang, S.; Yu, M.; Chang, S.; Jiang, J. A co-matching model for multi-choice reading comprehension. In Proceedings of the Annual Meeting of the Association for Computational Linguistics; Association for Computational Linguistics (ACL): Stroudsburg, PA, USA, 2018. [Google Scholar]
- Zhang, W.; Shen, H.; Lei, T.; Wang, Q.; Peng, D.; Wang, X. GLQA: A generation-based method for legal question answering. In Proceedings of the 2023 International Joint Conference on Neural Networks (IJCNN); IEEE: Piscataway, NJ, USA, 2023; pp. 1–8. [Google Scholar]
- Joshi, S. Methods in Legal Contractual Content Generation. Ph.D. Thesis, International Institute of Information Technology, Hyderabad, India, 2023. [Google Scholar]
- Gupta, P.; Jiao, C.; Yeh, Y.T.; Mehri, S.; Eskenazi, M.; Bigham, J.P. InstructDial: Improving Zero and Few-shot Generalization in Dialogue through Instruction Tuning. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing; Association for Computational Linguistics: Stroudsburg, PA, USA, 2022; pp. 505–525. [Google Scholar]
- Li, H.; Ai, Q.; Dong, Q.; Liu, Y. LexiLaw: A Scalable Legal Language Model for Comprehensive Legal Understanding. 2024. Available online: https://github.com/CSHaitao/LexiLaw (accessed on 12 February 2026).
- Glm, T.; Zeng, A.; Xu, B.; Wang, B.; Zhang, C.; Yin, D.; Zhang, D.; Rojas, D.; Feng, G.; Wang, Z.; et al. ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools. arXiv 2024, arXiv:2406.12793. [Google Scholar]
- Fei, Z.; Zhang, S.; Shen, X.; Zhu, D.; Wang, X.; Ge, J.; Ng, V. InternLM-Law: An Open-Sourced Chinese Legal Large Language Model. In Proceedings of the 31st International Conference on Computational Linguistics; Association for Computational Linguistics: Stroudsburg, PA, USA, 2025; pp. 9376–9392. [Google Scholar]
- Cai, Z.; Cao, M.; Chen, H.; Chen, K.; Chen, K.; Chen, X.; Chen, X.; Chen, Z.; Chen, Z.; Chu, P.; et al. InternLM2 Technical Report. arXiv 2024, arXiv:2403.17297. [Google Scholar] [CrossRef]
- He, W.; Wen, J.; Zhang, L.; Cheng, H.; Qin, B.; Li, Y.; Jiang, F.; Chen, J.; Wang, B.; Yang, M. HanFei-1.0. 2023. Available online: https://github.com/siat-nlp/HanFei (accessed on 12 February 2026).
- Wu, Y.; Liu, Y.; Liu, Y.; Li, A.; Zhou, S.; Kuang, K. wisdomInterrogatory. GitHub Repository. 2024. Available online: https://github.com/zhihaiLLM/wisdomInterrogatory (accessed on 12 February 2026).
- Louis, A.; van Dijck, G.; Spanakis, G. Interpretable long-form legal question answering with retrieval-augmented large language models. In Proceedings of the AAAI Conference on Artificial Intelligence; AAAI Press: Washington, DC, USA, 2024; Volume 38, pp. 22266–22275. [Google Scholar]
- Yao, S.; Ke, Q.; Wang, Q.; Li, K.; Hu, J. Lawyer GPT: A legal large language model with enhanced domain knowledge and reasoning capabilities. In Proceedings of the 2024 3rd International Symposium on Robotics, Artificial Intelligence and Information Engineering; Association for Computing Machinery: New York, NY, USA, 2024; pp. 108–112. [Google Scholar]
- Wiratunga, N.; Abeyratne, R.; Jayawardena, L.; Martin, K.; Massie, S.; Nkisi-Orji, I.; Weerasinghe, R.; Liret, A.; Fleisch, B. CBR-RAG: Case-based reasoning for retrieval augmented generation in LLMs for legal question answering. In Proceedings of the International Conference on Case-Based Reasoning; Springer: Cham, Switzerland, 2024; pp. 445–460. [Google Scholar]
- Kalra, R.; Wu, Z.; Gulley, A.; Hilliard, A.; Guan, X.; Koshiyama, A.; Treleaven, P. HyPA-RAG: A Hybrid Parameter Adaptive Retrieval-Augmented Generation System for AI Legal and Policy Applications. In Proceedings of the 1st Workshop on Customizable NLP: Progress and Challenges in Customizing NLP for a Domain, Application, Group, or Individual (CustomNLP4U); Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 237–256. [Google Scholar]
- Wu, S.; Liu, Z.; Zhang, Z.; Chen, Z.; Deng, W.; Zhang, W.; Yang, J.; Yao, Z.; Lyu, Y.; Xin, X.; et al. fuzi.mingcha. GitHub Repository. 2023. Available online: https://github.com/irlab-sdu/fuzi.mingcha (accessed on 12 February 2026).
- Shu, C.; Zhang, H. Neural programming by example. In Proceedings of the AAAI Conference on Artificial Intelligence; AAAI Press: Washington, DC, USA, 2017; Volume 31. [Google Scholar]
- Li, Y.; Choi, D.; Chung, J.; Kushman, N.; Schrittwieser, J.; Leblond, R.; Eccles, T.; Keeling, J.; Gimeno, F.; Dal Lago, A.; et al. Competition-level code generation with alphacode. Science 2022, 378, 1092–1097. [Google Scholar] [CrossRef] [PubMed]
- Zhang, D.; Zou, J.; Zhu, G. Multitool Drilling Path Optimization by Multiagent Reinforcement Learning Approach. IEEE Trans. Ind. Inform. 2025, 21, 6210–6219. [Google Scholar] [CrossRef]
- Wang, C.; Luo, W.; Dong, S.; Xuan, X.; Li, Z.; Ma, L.; Gao, S. Mllm-tool: A multimodal large language model for tool agent learning. In Proceedings of the 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV); IEEE: Piscataway, NJ, USA, 2025; pp. 6678–6687. [Google Scholar]
- Yue, S.; Chen, W.; Wang, S.; Li, B.; Shen, C.; Liu, S.; Zhou, Y.; Xiao, Y.; Yun, S.; Huang, X.; et al. DISC-LawLLM: Fine-tuning Large Language Models for Intelligent Legal Services. arXiv 2023, arXiv:2309.11325. [Google Scholar]
- Yue, S.; Liu, S.; Zhou, Y.; Shen, C.; Wang, S.; Xiao, Y.; Li, B.; Song, Y.; Shen, X.; Chen, W.; et al. LawLLM: Intelligent Legal System with Legal Reasoning and Verifiable Retrieval. In Proceedings of the International Conference on Database Systems for Advanced Applications; Springer: Singapore, 2024; pp. 304–321. [Google Scholar]
- Xiao, C.; Zhong, H.; Guo, Z.; Tu, C.; Liu, Z.; Sun, M.; Zhang, T.; Han, X.; Hu, Z.; Wang, H.; et al. CAIL2019-SCM: A dataset of similar case matching in legal domain. arXiv 2019, arXiv:1911.08962. [Google Scholar] [CrossRef]
- Ma, Y.; Shao, Y.; Wu, Y.; Liu, Y.; Zhang, R.; Zhang, M.; Ma, S. LeCaRD: A legal case retrieval dataset for Chinese law system. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval; Association for Computing Machinery: New York, NY, USA, 2021; pp. 2342–2348. [Google Scholar]
- Zhang, H.; Pan, B.; Li, R. Legal judgment elements extraction approach with law article-aware mechanism. Trans. Asian-Low-Resour. Lang. Inf. Process. 2021, 21, 1–15. [Google Scholar] [CrossRef]
- Gong, S.; Luo, X. DGGCCM: A hybrid neural model for legal event detection. Artif. Intell. Law 2025, 33, 1109–1149. [Google Scholar] [CrossRef]
- Fei, Z.; Shen, X.; Zhu, D.; Zhou, F.; Han, Z.; Huang, A.; Zhang, S.; Chen, K.; Yin, Z.; Shen, Z.; et al. LawBench: Benchmarking Legal Knowledge of Large Language Models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 7933–7962. [Google Scholar]
- Dubey, A.; Jauhri, A.; Pandey, A.; Kadian, A.; Al-Dahle, A.; Letman, A.; Mathur, A.; Schelten, A.; Yang, A.; Fan, A.; et al. The llama 3 herd of models. arXiv 2024, arXiv:2407.21783. [Google Scholar] [CrossRef]
- Qwen Team. Qwen2.5 Technical Report. arXiv 2024, arXiv:2412.15115. [Google Scholar]
- Baichuan. Baichuan 2: Open Large-scale Language Models. arXiv 2023, arXiv:2309.10305. [Google Scholar]
- Zheng, Y.; Zhang, R.; Zhang, J.; Ye, Y.; Luo, Z.; Feng, Z.; Ma, Y. LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models. In Proceedings of the Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), Bangkok, Thailand; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024. [Google Scholar]
- Singhal, K.; Azizi, S.; Tu, T.; Mahdavi, S.S.; Wei, J.; Chung, H.W.; Scales, N.; Tanwani, A.; Cole-Lewis, H.; Pfohl, S.; et al. Large language models encode clinical knowledge. Nature 2023, 620, 172–180. [Google Scholar] [CrossRef] [PubMed]
- Wu, S.; Irsoy, O.; Lu, S.; Dabravolski, V.; Dredze, M.; Gehrmann, S.; Kambadur, P.; Rosenberg, D.; Mann, G. Bloomberggpt: A large language model for finance. arXiv 2023, arXiv:2303.17564. [Google Scholar] [CrossRef]







| Two-Annotator Verdict | Samples | Proportion | Status |
|---|---|---|---|
| 2 Yes, 0 No | 9843 | 86.80% | Agreed (accept) |
| 0 Yes, 2 No | 749 | 6.60% | Agreed (reject) |
| 1 Yes, 1 No | 748 | 6.60% | Disagreement |
| Total | 11,340 | 100% | - |
| Task | Dataset | Metric |
|---|---|---|
| Legal Charge Prediction (LCP) | CAIL2018 https://github.com/china-ai-law-challenge/cail2018 (accessed on 5 February 2026) | F1 |
| Legal Article Prediction (LAP) | CAIL2018 https://github.com/china-ai-law-challenge/cail2018 (accessed on 18 February 2026) | F1 |
| Prison Term Prediction (PTP) | CAIL2018 https://github.com/china-ai-law-challenge/cail2018 (accessed on 11 March 2026) | log-distance |
| Argument Mining (AM) | CAIL2022 https://github.com/china-ai-law-challenge (accessed on 18 February 2026) | Accuracy |
| Dispute Focus Identification (DFI) | LAIC2021 https://laic.cjbdi.com/ (accessed on 18 February 2026) | F1 |
| Issue Topic Identification (ITI) | CrimeKgAssitant https://github.com/liuhuanyong/CrimeKgAssitant (accessed on 18 February 2026) | Accuracy |
| Legal Event Eetection (LED) | LEVEN https://github.com/thunlp/LEVEN (accessed on 18 February 2026) | F1 |
| Opinion Summarization (OS) | CAIL2022 https://github.com/china-ai-law-challenge (accessed on 18 February 2026) | ROUGE-L |
| Case Analysis (CA) | JEC-QA https://github.com/thunlp/jec-qa (accessed on 18 February 2026) | Accuracy |
| Method | LCP (F1) | LAP (F1) | PTP (Log-Distance) | AM (Accuracy) | DFI (F1) | ITI (Accuracy) | LED (F1) | OS (ROUGE-L) | CA (Accuracy) |
|---|---|---|---|---|---|---|---|---|---|
| General LLM | |||||||||
| ChatGLM3-6B-Zero | 0.31 | 0.52 | 0.74 | 0.32 | 0.27 | 0.21 | 0.13 | 0.34 | 0.29 |
| ChatGLM3-6B-One | 0.33 | 0.55 | 0.74 | 0.30 | 0.30 | 0.23 | 0.12 | 0.32 | 0.27 |
| COLLT-GLM | 0.49 (↑ 0.16) | 0.62 (↑ 0.07) | 0.78 (↑ 0.04) | 0.40 (↑ 0.08) | 0.36 (↑ 0.06) | 0.29 (↑ 0.06) | 0.19 (↑ 0.06) | 0.47 (↑ 0.13) | 0.37 (↑ 0.08) |
| LLaMa3-8B-Zero | 0.21 | 0.30 | 0.51 | 0.12 | 0.51 | 0.15 | 0.11 | 0.28 | 0.24 |
| LLaMa3-8B-One | 0.24 | 0.32 | 0.50 | 0.11 | 0.52 | 0.17 | 0.12 | 0.26 | 0.20 |
| COLLT-LLaMa | 0.38 (↑ 0.14) | 0.37 (↑ 0.05) | 0.62 (↑ 0.11) | 0.20 (↑ 0.08) | 0.59 (↑ 0.07) | 0.21 (↑ 0.04) | 0.18 (↑ 0.06) | 0.33 (↑ 0.05) | 0.32 (↑ 0.08) |
| InternLM3-8B-Zero | 0.39 | 0.36 | 0.67 | 0.36 | 0.21 | 0.17 | 0.09 | 0.39 | 0.27 |
| InternLM3-8B-One | 0.43 | 0.39 | 0.68 | 0.35 | 0.21 | 0.19 | 0.11 | 0.38 | 0.26 |
| COLLT-InternLM | 0.54 (↑ 0.11) | 0.45 (↑ 0.06) | 0.72 (↑ 0.04) | 0.43 (↑ 0.07) | 0.23 (↑ 0.02) | 0.23 (↑ 0.04) | 0.15 (↑ 0.04) | 0.43 (↑ 0.04) | 0.30 (↑ 0.03) |
| Qwen2.5-7B-Zero | 0.50 | 0.72 | 0.81 | 0.32 | 0.37 | 0.28 | 0.12 | 0.41 | 0.24 |
| Qwen2.5-7B-One | 0.52 | 0.73 | 0.82 | 0.31 | 0.38 | 0.28 | 0.12 | 0.42 | 0.24 |
| COLLT-Qwen | 0.55 (↑ 0.03) | 0.74 (↑ 0.01) | 0.82 (↑ 0.00) | 0.37 (↑ 0.05) | 0.42 (↑ 0.04) | 0.35 (↑ 0.07) | 0.16 (↑ 0.04) | 0.50 (↑ 0.08) | 0.27 (↑ 0.03) |
| Baichuan2-7B-Zero | 0.43 | 0.27 | 0.65 | 0.19 | 0.19 | 0.11 | 0.08 | 0.24 | 0.19 |
| Baichuan2-7B-One | 0.44 | 0.29 | 0.65 | 0.18 | 0.19 | 0.12 | 0.10 | 0.22 | 0.18 |
| COLLT-Baichuan | 0.47 (↑ 0.03) | 0.34 (↑ 0.05) | 0.69 (↑ 0.04) | 0.20 (↑ 0.01) | 0.22 (↑ 0.03) | 0.15 (↑ 0.03) | 0.12 (↑ 0.02) | 0.26 (↑ 0.02) | 0.20 (↑ 0.01) |
| Legal LLM | |||||||||
| InternLM-Law | 0.40 | 0.29 | 0.54 | 0.17 | 0.10 | 0.10 | 0.05 | 0.33 | 0.05 |
| LexiLaw | 0.35 | 0.12 | 0.64 | 0.20 | 0.03 | 0.11 | 0.09 | 0.31 | 0.17 |
| Lawyer-LLaMa | 0.30 | 0.17 | 0.47 | 0.14 | 0.11 | 0.08 | 0.04 | 0.30 | 0.07 |
| Fuzi-Mingcha | 0.50 | 0.21 | 0.66 | 0.08 | 0.17 | 0.10 | 0.10 | 0.49 | 0.10 |
| Wisdom-Interrogatory | 0.32 | 0.31 | 0.67 | 0.12 | 0.08 | 0.08 | 0.08 | 0.31 | 0.14 |
| Legal Tool | Metric | Result |
|---|---|---|
| Similar Case Retrieval | NDCG@5 | 77.51 ± 3.11 |
| Legal Article Searching | Accuracy | 80.01 ± 2.44 |
| Legal Charge Prediction | Accuracy | 86.02 ± 3.51 |
| Legal Element Recognition | Accuracy | 77.43 ± 3.84 |
| Legal Event Detection | F1 | 83.25 ± 2.74 |
| Setting | LCP | LAP | DFI | ITI | LED | OS | CA | Mean |
|---|---|---|---|---|---|---|---|---|
| COLLT (no removal) | 0.486 | 0.504 | 0.364 | 0.246 | 0.160 | 0.398 | 0.292 | — |
| 0.479 | 0.499 | 0.349 | 0.240 | 0.159 | 0.385 | 0.262 | ||
| 0.464 | 0.446 | 0.357 | 0.241 | 0.159 | 0.385 | 0.281 | ||
| 0.418 | 0.498 | 0.357 | 0.241 | 0.159 | 0.391 | 0.279 | ||
| 0.480 | 0.498 | 0.342 | 0.215 | 0.153 | 0.391 | 0.285 | ||
| 0.480 | 0.498 | 0.357 | 0.233 | 0.115 | 0.386 | 0.285 | ||
| 0.485 | 0.498 | 0.363 | 0.241 | 0.160 | 0.391 | 0.285 | ||
| − all tools | 0.442 | 0.406 | 0.327 | 0.205 | 0.115 | 0.355 | 0.252 |
| Setting | Trigger-F1 |
|---|---|
| (a) base-vanilla (no system prompt, no tools) | 0.070 |
| (b) base-with-clarify (clarification system prompt, no tuning) | 0.476 |
| (c) COLLT-SFT (clarification learned via instruction tuning) | 0.814 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Yang, K.; Sun, J.; Wang, Z.; Xu, C. COLLT: A Multi-Task Optimization Framework for Clarification-Oriented Tool Learning in Legal Large Language Models. Mathematics 2026, 14, 1891. https://doi.org/10.3390/math14111891
Yang K, Sun J, Wang Z, Xu C. COLLT: A Multi-Task Optimization Framework for Clarification-Oriented Tool Learning in Legal Large Language Models. Mathematics. 2026; 14(11):1891. https://doi.org/10.3390/math14111891
Chicago/Turabian StyleYang, Kaixin, Jingyun Sun, Zhenxing Wang, and Chang Xu. 2026. "COLLT: A Multi-Task Optimization Framework for Clarification-Oriented Tool Learning in Legal Large Language Models" Mathematics 14, no. 11: 1891. https://doi.org/10.3390/math14111891
APA StyleYang, K., Sun, J., Wang, Z., & Xu, C. (2026). COLLT: A Multi-Task Optimization Framework for Clarification-Oriented Tool Learning in Legal Large Language Models. Mathematics, 14(11), 1891. https://doi.org/10.3390/math14111891


