Next Article in Journal
Cost-Based Competition and Market Share Determination: A CES Analytical Framework
Previous Article in Journal
Does Digital Asset Allocation Improve Corporate ESG Performance? Evidence from China
Previous Article in Special Issue
Whose LLM Is It Anyway? Linguistic Comparison and LLM Attribution for GPT-3.5, GPT-4 and Bard
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

COLLT: A Multi-Task Optimization Framework for Clarification-Oriented Tool Learning in Legal Large Language Models

1
Aulin College, Northeast Forestry University, Harbin 150040, China
2
College of Computer and Control Engineering, Northeast Forestry University, Harbin 150040, China
*
Authors to whom correspondence should be addressed.
Mathematics 2026, 14(11), 1891; https://doi.org/10.3390/math14111891
Submission received: 26 April 2026 / Revised: 21 May 2026 / Accepted: 25 May 2026 / Published: 29 May 2026

Abstract

Tool-augmented Large Language Models (LLMs) have demonstrated remarkable capabilities in language understanding and generation across various domains, with notable progress in legal applications. However, existing legal LLMs still face two major challenges in legal question answering: (1) a persistent occurrence of hallucinations in generated legal advice and (2) limited effectiveness in handling ambiguous legal queries. To address both issues, this article introduces COLLT (Clarification-Oriented Legal Large language models enhanced by Tool learning), a legal LLM framework designed around a tool learning mechanism oriented to clarification. We model the clarification, tooling, and response workflow as a budgeted sequential decision. All components, including action selection, tool invocation, and response generation, are jointly optimized through multi-task instruction tuning. The COLLT framework first determines whether a user query contains ambiguity. If ambiguity is detected, the model actively guides the user to clarify their intent; if the query is clear, the model automatically invokes appropriate legal tools to improve the accuracy and reliability of its response. To support this mechanism, we propose an end-to-end training strategy enabling the model to learn how to invoke tools adaptively across various scenarios. Specifically, we construct an instruction-tuning dataset with action tags, tool tags, and optimized responses, and use it to train a series of COLLT models based on five Chinese foundation models: ChatGLM-6B, LLaMA-3-8B, InternLM-3-8B, Qwen-2.5-7B, and Baichuan-2-7B. Experimental results show that COLLT significantly outperforms baselines across nine standard legal NLP tasks. In a free-form QA evaluation based on 500 real-world legal consultation queries, COLLT achieves notable improvements in answer accuracy, legal knowledge coverage, and response appropriateness. Further visualization analysis reveals that COLLT consistently selects appropriate tools under similar intents, and statistical analysis indicates that multi-turn clarification interactions contribute to generating higher-quality responses.

1. Introduction

In recent years, Tool-augmented large language models (LLMs) have demonstrated strong capabilities in invoking external tools to enhance generation quality, yet they struggle when user queries are underspecified or ambiguous. Consider a user asking “I need help with a contract.” Existing frameworks such as ReAct [1] and Toolformer [2] will immediately invoke a contract analysis tool without first determining which aspect of the contract requires assistance—whether it is drafting, reviewing, or interpreting specific clauses. This premature tool invocation often results in irrelevant outputs, as the model lacks a mechanism to assess whether the user’s intent is sufficiently clear. Similarly, retrieval-augmented generation (RAG) systems [3] retrieve relevant legal statutes for queries like “Can I break a lease?” but fail to identify why the user wants to break the lease (e.g., landlord breach vs. tenant relocation), leading to generic advice that does not address the user’s specific situation.
As illustrated in Figure 1, two distinct failure modes motivate our framework. Subfigure (a) contrasts one-shot generation against a clarification-enhanced interaction: when the user query is underspecified, a direct response is vague and unhelpful, whereas a clarification turn before generation yields accurate and personalized answers. Subfigure (b) contrasts pure parametric-knowledge generation against tool-invoked generation: the former hallucinates, while grounding the response in external legal tools produces reliable advice. These two subfigures expose a deeper structural issue. Current tool-augmented LLMs treat “invoke tool” and “clarify intent” as a single reflexive action, lacking a decision mechanism that tells them apart. A principled framework must learn when to clarify and when to invoke, not conflate the two.
Addressing this gap requires a meta-cognitive layer that can model information sufficiency: the model must learn to distinguish between ambiguous queries (where clarification is needed) and well-specified queries (where tools can be directly invoked). This decision cannot be made by a rule-based heuristic such as query length, as ambiguity is context-dependent and requires understanding the semantic completeness of user intent. Formally, this amounts to learning a policy over an extended action space A ext = { A CLR } { ( A DRT , τ ) : τ T } , where A CLR triggers clarification and A DRT triggers tool invocation from a toolbox T .
We choose legal question answering as a concrete testbed for this framework for three reasons. (i) The legal domain is tool-rich: tasks such as case retrieval, statute lookup, and charge prediction are naturally decomposable into tool invocations. (ii) User queries in legal consultation settings are genuinely ambiguous: users without legal training often pose vague questions like “I signed a contract, and now I have a problem”. In practice, legal professionals routinely guide users to clarify these queries before providing advice. (iii) The cost of hallucination is high in legal advice, making factual grounding through external tools critical.
Motivated by these challenges, this article introduces COLLT, a framework that integrates a learned clarification mechanism with multi-tool invocation. COLLT formalizes the clarification–tool–response workflow as a sequential decision process and jointly trains the action selection, tool invocation, and response generation components through instruction tuning. To equip COLLT with the capability to effectively select actions and invoke tools, we propose an end-to-end instruction tuning strategy. A high-quality instruction tuning dataset is developed, comprising four key components: (1) action selection tags indicating whether clarification is required; (2) legal tool selection tags specifying which tools to use; (3) results produced by the selected tools; and (4) responses augmented with the corresponding tool outputs. We train COLLT on this dataset using a multi-task objective derived from the probabilistic factorization in Equation (7), enabling the model to jointly optimize action selection, tool invocation, and response generation.
Experimental results demonstrate that COLLT outperforms existing baseline models across nine standard legal NLP tasks: legal charge prediction, legal article prediction, prison term prediction, argument mining, dispute focus identification, issue topic identification, legal event detection, opinion summarization, and case analysis. To further validate its practical utility, we conduct a free-form question answering evaluation using a dataset of 500 real-world legal consultation queries collected from online legal forums. The evaluation results show that COLLT achieves gains in answer accuracy, legal knowledge coverage, and contextual appropriateness compared to competitive baselines. In addition, a visualization-based tool selection analysis shows that COLLT consistently chooses the most relevant legal tools when handling queries with similar intents, reflecting its capacity for intent understanding and tool utilization. Quantitative analysis reveals a correlation between multi-turn clarification interactions and response quality, suggesting that the clarification mechanism contributes to generating more precise and personalized answers.

2. Related Work

2.1. Legal Question Answering

Legal question answering (LQA) is one of the most representative applications of artificial intelligence in the legal domain. It aims to enable machines to understand and answer complex legal questions, requiring legal knowledge, language comprehension, and reasoning capabilities. Current LQA methods can be broadly categorized into three types: (1) retrieval-based approaches, (2) machine reading comprehension (MRC)-based approaches, and (3) generation-based approaches.
Retrieval-based methods follow a standard pipeline: given a query, the system retrieves relevant answers from a predefined database. The core lies in designing effective similarity measures between the query and candidate answers [4,5]. Notable works include models based on convolutional neural networks [6,7], recurrent neural networks [8], and attention mechanisms [9,10]. With the advent of pre-trained language models, researchers have developed retrieval-based LQA models leveraging BERT. For example, Wang et al. [11] proposed a framework combining BERT with BiGRU; Hoppe et al. [12] built a German LQA system integrating BERT and BM25; Tieu et al. [13] injected legal knowledge into BERT’s reasoning process to enhance retrieval accuracy. These methods are reliable and verifiable, making them suitable for legal applications. However, they lack flexibility—if no candidate answers match the query, only suboptimal results can be returned.
MRC-based methods aim to answer questions based on given supporting texts. LawFormer [14], proposed by Xiao et al., was pre-trained on large-scale Chinese legal corpora using a sparse attention mechanism and achieved an F1 score of 74.28% on a legal MRC dataset. Zhong et al. [15] evaluated several general MRC models, such as BiDAF, Co-matching, and HAF, showing reasonable performance, though still behind models trained on domain-specific data. MRC approaches often rely on interaction-based attention to model the relevance between the query and text segments [16,17]. However, they tend to perform shallow analysis and lack the ability to integrate broader legal knowledge and reasoning, limiting their effectiveness in complex legal scenarios.
Generation-based methods are the most challenging. They require models to produce contextually relevant, logically coherent, and personalized answers, even when such answers do not exist in the database. Despite the difficulty, this approach aligns most closely with the ultimate goal of LQA. Zhang et al. [18] proposed a two-stage method: retrieve relevant statutes, then generate an answer based on the query and statutes. Huang et al. [19] introduced CoLMQA, which uses slot-filling to ensure logical consistency by first generating a template with slots (e.g., law names, article numbers) and then filling them using a Transformer-based key-value network. Joshi et al. [20] studied the alignment between user intent and legal articles, proposing a BERT-based contextual modeling approach to generate relevant legal provisions via keyword guidance.
Recently, large language models (LLMs) have significantly advanced generation-based LQA. With billions of parameters and training on massive text corpora, LLMs demonstrate strong capabilities in semantic understanding and text generation. Many LQA methods based on generative LLMs have emerged, which will be discussed in detail in Section 2.2.

2.2. Legal Large Language Model

In the legal field, domain-specific legal LLMs have gradually emerged. As these models evolve, researchers have explored various strategies to improve their understanding of legal language and their ability to handle legal tasks. For example, LexiLaw [21] fine-tuned ChatGLM2-6B [22] using data from judicial exams and legal Q&A, enabling more accurate instruction-following in legal queries. InternLM-Law [23], built on InternLM [24], combines continued pre-training and instruction tuning with both legal and general corpora to enhance legal text comprehension and generation. HanFei [25] leveraged a larger 60 GB legal instruction dataset to significantly improve performance in legal Q&A tasks. WisdomInterrogatory [26], though using a smaller and more specialized dataset, achieved strong results in specific legal tasks.
To improve accuracy and reduce hallucinations, prior work has incorporated external knowledge and interaction in different forms. A widely adopted direction is retrieval-augmented generation, which retrieves relevant statutes or evidence from external repositories to support the answer generation [27,28,29,30]. Another line of work explores multi-turn legal dialogue and reasoning data to enhance interactive capabilities in consultation-like settings [31].
While these advances are valuable, they do not fully address the challenges posed by vague from non-expert users. Existing multi-turn settings are often designed around relatively well-formed user intents, where interaction mainly serves to follow a predefined dialogue flow, rather than to actively detect ambiguity and collect missing details or pinpoint unclear aspects, and guide users to clarify their true intent. As a result, such systems can remain largely reactive and may still misinterpret underspecified inputs.
Moreover, although retrieval-based methods alleviate the limitations of purely parametric knowledge, retrieval is commonly treated as an external, loosely coupled component and is not learned jointly with the model’s action selection and tool invocation behaviors. This lack of a systematic design for selecting, executing, and integrating tools limits both the specificity and efficiency of tool use.
We argue that legal LLMs and specialized tools are complementary rather than mutually exclusive. In contrast to prior work that focuses on retrieval or generic interaction separately, we propose a clarification-oriented framework that (i) explicitly decides whether clarification is needed and (ii) learns to invoke appropriate legal tools and integrate their outputs end-to-end, enabling a transition from reactive answering to proactive intent clarification and tool-assisted reasoning.

2.3. Tool Learning

Tool learning has emerged as a key direction in recent research on LLMs, aiming to equip LLMs with the ability to invoke external tools for solving complex tasks.
Early work on tool use focused primarily on structured tasks and program generation. For instance, Shu et al. [32] introduced a neural programmer model, which mapped natural language to executable programs for database operations. Li et al. [33] proposed modular neural networks, allowing language inputs to dynamically compose multiple “subprograms” to complete complex tasks. However, these approaches typically relied on large amounts of supervised data and lacked generalization capability, making them difficult to scale to open-ended tasks.
With the rapid advancement of LLMs, researchers began integrating pre-trained LLMs with external tools. One representative work is the ReAct framework [1] proposed by Yao et al., which incorporates a “reasoning–action–observation” loop into the generation process, enabling the model to make decisions based on current states and adjust its reasoning path through environment feedback. ReAct significantly improved performance on complex reasoning tasks, such as HotpotQA and multi-hop question answering, and inspired subsequent studies combining Chain-of-Thought prompting with tool invocation. Further progress was made by methods like Toolformer [2], which introduced automatically generated tool-use annotations during pretraining. This enabled the model to acquire tool-usage capabilities before fine-tuning.
Another line of research focuses on multi-tool composition and orchestration. Frameworks such as MultiTool-Agent [34,35] incorporate planning modules and tool-selection strategies, allowing the model to dynamically coordinate multiple heterogeneous tools (e.g., calculators, code interpreters, retrievers) when handling multi-task or cross-domain problems. These methods often employ prompt engineering or reinforcement learning to optimize scheduling policies, though issues of stability and error recovery remain open challenges.
Although the combination of LLMs and tool learning has shown clear benefits in general domains, its application to the legal domain remains largely unexplored. In particular, there is a lack of systematic integration of tool learning into legal QA systems to support sub-tasks such as statute retrieval, legal interpretation, or case comparison. Given the reliance of legal QA on factual reasoning and domain-specific norms, tool learning holds promise for improving the model’s ability to handle legal problems.

2.4. Distinguishing Features and Main Contributions

COLLT shares with Toolformer [2] the idea of triggering tools through generated tags, and it is also related to ReAct [1], which interleaves reasoning and actions after receiving a user query. The key difference is that Toolformer and ReAct generally start from the question of how to answer the current query, whereas COLLT first asks whether the query is sufficiently specified to be answered. This shift is central to our mechanism. Instead of treating clarification as an external prompt or a post-processing step, COLLT incorporates it into the model’s decision process. The innovation is reflected in three aspects: the action design, the joint training objective, and the bounded tool-invocation mechanism.
First, COLLT makes clarification a learnable action. The model chooses between asking the user to clarify ( A CLR ) and answering directly with tool support ( A DRT ). This design is important for legal consultation, where users often omit facts that determine the applicable rule, claim, or remedy. If an underspecified query is sent directly to a retrieval tool, the returned material may be only loosely related to the user’s actual situation and may mislead the final answer. By learning when to clarify before invoking tools, COLLT reduces this risk and gives the model a more practical interaction pattern for legal question answering. The ablation results in Section 5.5 further show that supervised training on this action boundary is more reliable than relying on prompt-level instructions alone.
Second, COLLT trains action selection, tool invocation, and response generation within one model. The three heads share the Transformer backbone θ and are optimized through the multi-task loss L action + λ L tool + μ L resp , which follows from the chain-rule factorization of the joint distribution. This provides explicit supervision not only for generating answers, but also for deciding whether to use tools and which tools should be used. In this respect, COLLT differs from Toolformer’s perplexity-filtered self-supervision and from ReAct-style in-context trajectories, where the decision boundary between clarification, tool use, and final response is not directly learned.
Third, COLLT constrains tool use with a cardinality budget | τ | 2 . This mechanism reduces the per-turn search space from O ( 2 | T | ) to O ( | T | 2 ) and keeps inference manageable as the toolbox grows. More importantly, it encourages selective tool use rather than uncontrolled tool chaining, which is especially useful when the original query lacks sufficient legal detail.
Building on these distinctions, the main contributions of this work are summarized as follows:
1.
We propose COLLT, the first framework to integrate legal LLMs, proactive clarification, and tool learning. It addresses ambiguous or inappropriate queries from users lacking legal expertise, significantly reducing hallucinations in legal question-answering tasks.
2.
We design an end-to-end learning strategy that jointly trains action selection, tool invocation, and response generation, effectively leveraging the reasoning capabilities of legal LLMs.
3.
We conduct a comprehensive evaluation across recent mainstream Chinese legal LLMs, including ablation studies and a clarification benchmark.
4.
We introduce a legal toolbox comprising six fine-tuned tools and release the COLLT dataset, which includes tool-training data, a multi-turn instruction-tuning corpus, and 500 real-world multi-turn legal consultations.

3. Methodology

This section introduces COLLT. We first fix notation and derive the three-task training loss as a chain-rule factorization of the joint distribution (Section 3.1). We then describe the concrete instantiation, including the token-stream inference procedure and the end-to-end training pipeline.

3.1. Mathematical Formulation of the COLLT Framework

COLLT is trained with supervised instruction tuning, not reinforcement learning. The formalism in this subsection is used for two narrow purposes: to derive the three-task loss Equation (10) as a chain-rule factorization of the joint distribution rather than as a heuristically weighted sum and to expose a budget-induced reduction of the per-turn search space from O ( 2 | T | ) to O ( | T | 2 ) . Readers who want only the practical algorithm can skip to Algorithm 1.

3.1.1. Notation

Let V be a finite vocabulary, Q V * the user-query space, and R V * the response space. The dialogue history accumulated before the t-th turn is H t = [ ( q 1 , r 1 ) , , ( q t 1 , r t 1 ) ] , and the state at turn t is s t = ( q t , H t ) . The two top-level actions are A = { A DRT , A CLR } : A DRT commits the model to a direct, possibly tool-augmented answer, and A CLR commits it to asking the user for clarification. The legal toolbox is
T = { T S C R , T L A S , T L C P , T L E R , T L E D , T W e b } .
The model is parameterized by θ , which covers a shared Transformer backbone and three task-specific output heads: a 2-way action head over A , an autoregressive 6-way tool head over T , and a next-token response head over V . Ambiguity detection is not a separate module but a derived quantity
f ambig ( q , H ) = arg max A A P ( A q , H ; θ ) = A CLR ,
so a query is treated as ambiguous exactly when the action head’s softmax places more mass on A CLR than on A DRT .

3.1.2. Sequential Decision View and the Budget Constraint

The clarification–tool–response loop can be written as a finite-horizon Markov Decision Process. We state it as a Definition mostly to fix notation that the rest of the section refers back to; supervision is by instruction tuning, and we do not solve the MDP with reinforcement learning.
Definition 1
(COLLT as an MDP). We model the clarification-oriented question answering process as a finite-horizon MDP M = ( S , A ext , P trans , R ) . The state space is S = Q × H , with
S = Q × H ,
and elements s t = ( q t , H t ) . The extended action space is
A ext = { A CLR } ( A DRT , τ ) : τ T , | τ | 2 ,
encoding the cardinality budget | τ | 2 directly in the action space rather than as a runtime filter. The transition P trans is deterministic: A CLR appends a clarification turn to H t and ( A DRT , τ ) appends a tool-enhanced response turn. The reward R is the terminal response-quality signal; no discounting is needed because the process terminates after each query is resolved.
The top-level action is selected by
A t X = arg max A X { A D R T , A C L R } P ( A X q t , H t ; θ ) ,
and when A t X = A DRT , the tool head is queried autoregressively. Even though the MDP is descriptive rather than optimized by RL, encoding the budget in A ext pays off in the per-turn search space.
Proposition 1
(Complexity Reduction via the Budget Constraint). The budget | τ | 2 restricts the extended action space to size
| A ext | = 1 + | T | 0 + | T | 1 + | T | 2 = O ( | T | 2 ) ,
reducing the per-turn search space from O ( 2 | T | ) to quadratic in the toolbox size.
This is a direct counting argument. For the deployed value | T | = 6 the budget gives 1 + 1 + 6 + 15 = 23 distinct extended actions instead of 2 6 = 64 tool subsets, which is what makes exhaustive search over A ext tractable at inference time.

3.1.3. Three-Task Loss as a Chain-Rule Factorization

The model assigns a joint probability over the action, the tool sequence, the parsed tool outputs, and the response. Tool outputs g t X k are deterministic functions of T t X k and s t , so P ( g T , s ) is a Dirac mass that contributes zero to the log-likelihood and can be dropped. Applying the chain rule then gives
P ( · s t ; θ ) = P ( A t X s t ; θ ) action head · k = 1 K t P ( T t X k s t , T t X < k ; θ ) tool head · P ( r t s t , { g t X k } ; θ ) response head ,
where s t = ( q t , H t ) from Equation (3) and K t { 0 , 1 , 2 } is the number of tools invoked at turn t. The single-tool selection and the input-to-retrieval mapping that the tool head implements are
T t X = arg max T X T P ( T X q t , H t ; θ ) ,
T t X : ( q t , H t ) g t .
All three heads share the Transformer backbone θ and differ only in their final projection layers.
We train COLLT by minimizing the negative log-likelihood of Equation (7) over the instruction-tuning corpus. Expanding the negative log gives
L total ( θ ) = L action ( θ ) + λ L tool ( θ ) + μ L resp ( θ ) ,
where L action = log P ( A t X s t ; θ ) is the action-classification cross-entropy, L tool = k log P ( T t X k s t , T t X < k ; θ ) is the autoregressive tool-tag NLL, and L resp = j log P ( r t , j r t , < j , s t , { g t X k } ; θ ) is the token-level response NLL. The three components are not a heuristic mixture but a direct consequence of the factorization in Equation (7). In our experiments we set λ = μ = 1 , which recovers the joint training objective
L = n log P L M A t X , T t X 1 , g t X 1 , T t X 2 , g t X 2 , r t | q t , H t
that the model is supervised on; the ( λ , μ ) sensitivity analysis is left to future work.

3.1.4. Inference Procedure

At inference time, Algorithm 1 executes the policy π θ : select the top-level action, generate a clarification if needed; otherwise, stream tokens and invoke tools as their head tags appear, subject to the budget B = 2 . The algorithm is what the model actually runs; the MDP language above names it, but is not a separate optimization procedure.
Algorithm 1: COLLT inference
Mathematics 14 01891 i001
Per-turn time complexity is O ( | V | · L resp ) for response generation plus O ( K · c tool ) for at most K 2 tool calls of cost c tool each. Space complexity matches standard instruction tuning since the three heads share the backbone θ and add only projection layers. The concrete implementation, including the indicative-token format (Table 1) and the per-tool training procedures (Section 4.1), instantiates this procedure in the experimental system below.

3.2. Concrete Instantiation

We now describe the concrete instantiation of the formulation above. The action selection mechanism (Equation (5)), tool selection procedure (Equations (1) and (8)), and training objective (Equation (11)) are realized through a token-stream parsing architecture with indicative tags. Algorithm 1 formalizes the complete policy execution; the token-stream implementation details follow below.
The inference process of COLLT is illustrated in Figure 2a,b. Upon receiving the user’s current query q and the dialogue history H, the model selects an appropriate action based on the current conversational context. It can either choose the clarification action A C L R , which prompts the user to further specify their intent, or the direct response action A D R T , which indicates that the model is ready to generate a reply without additional clarification.
If the model selects A C L R , it leverages its learned legal knowledge to guide the user in supplementing, clarifying, or rephrasing their input, as shown in Figure 2a. This step is particularly critical in legal dialogue, where language tends to be highly technical and precise; ambiguous expressions from the user may significantly impair the model’s ability to understand and respond accurately.
If the model selects A D R T , it proceeds, as shown in Figure 2b, to invoke suitable tools from a predefined legal toolbox T to enhance the legal soundness and depth of the generated response. The toolbox comprises six functional tools: (1) similar case retrieval, (2) legal article search, (3) legal charge prediction, (4) legal element recognition, (5) legal event detection, and (6) internet search. These tools provide legal justifications, case-based support, or contextual knowledge relevant to the query, thereby improving the professionalism and accuracy of the response. The model is restricted to invoking at most two legal tools per inference round.
This two-tool cap is a deliberate methodological choice rather than a runtime shortcut, and three considerations motivate it. In our annotated corpus the great majority of legal queries are resolved by one or two tool calls, so a higher budget would add little supervision signal while requiring experts to label additional tool slots for every training instance for the joint objective in Equation (11) to remain well posed. Tool invocation is also not free at the token level: each tool returns hundreds to thousands of tokens of retrieved text that compete with the original query for attention. In the one-shot baseline, where base models are given the full toolbox without instruction tuning, scores regress below their zero-shot counterparts on AM, LED, PTP, and ITI, indicating that uncontrolled tool stacking can dilute the query signal rather than reinforce it. The tool ablation in Section 5.5 confirms the converse direction: each retained tool contributes mainly on the task it is designed to support, with off-target columns moving by less than one point, so two well-chosen tools recover most of the available marginal gain. The cap is encoded as the parameter | τ | 2 in Equation (4) rather than as a hard architectural assumption, so domains that genuinely require wider tool composition can relax it without altering the rest of the framework.
Based on the parsed results returned by the tools and the semantic information of the user’s query, the model generates an enhanced legal response. This response aims to maintain contextual relevance while ensuring legal accuracy and practical utility, thereby better aligning with the user’s intent and fulfilling their legal information needs.
Subsequent sections detail the design of the action selection mechanism, the functionality of the legal tools, and the process and implementation of enhanced response generation.

3.2.1. Action Selection

At each dialogue turn, the model must select one of two possible actions based on the user’s current query q t and the dialogue history H t = { q 1 , q 2 , , q t 1 } . The first action, denoted as A D R T , corresponds to directly responding to the query, while the second, A C L R , involves prompting the user to clarify their intent.
The model selects A D R T when the user’s intent is clear and the query is well-structured, for example, queries such as “What are the conditions for contract invalidity in the Civil Code?” or “What is the sentencing standard for intentional injury?”. In contrast, when the query is vague, ambiguous, or lacks essential legal details, the model opts for A C L R , initiating a clarification step to guide the user in refining their input. Typical examples of such queries include “What should I do if the contract I signed has issues?” or “Can I sue someone who hit me?”.
We formalize the action selection process using Equation (5), where A t X denotes the selected action at turn t, and θ represents the model parameters. The probability distribution P ( A X q t , H t ; θ ) is computed by the model.
Through this design, COLLT can dynamically determine whether clarification is needed before proceeding, and make informed decisions about whether to invoke legal tools or directly generate a response. The action selection mechanism serves as the foundation for subsequent reasoning steps, including tool selection and response generation.

3.2.2. Legal Tool Selection

To enhance the model’s response quality in legal tasks, we use the set of six specialized legal tools T defined in Equation (1). Each tool is specifically designed to address typical needs in legal consultation scenarios. By incorporating structured intermediate results from these tools, the model can generate more targeted and legally coherent responses. The functionalities of these tools are detailed as follows:
  • Similar Case Retrieval ( T S C R ): This tool retrieves historical cases that are semantically similar to the input query. To implement this functionality, we fine-tuned a similar case retrieval model based on the Lawformer [14] architecture. During the retrieval phase, the trained Lawformer model encodes both the query and candidate case texts into vector representations, upon which a nearest-neighbor search is conducted in the embedding space. The model is trained on the case-retrieval portion of the COLLT dataset (Section 3.3), which packages DISC-Law-SFT [36,37], CAIL2019 [38] and LeCard [39] with the leakage controls and near-duplicate filtering described in Section 4.1. Detailed training procedures for this tool are provided in Section 4.1.1.
  • Legal Article Search ( T L A S ): This tool is designed to search relevant legal provisions, regulations, and judicial interpretations from a predefined legal knowledge base in response to user queries. Similar to the SCR tool, we fine-tuned the Lawformer pre-trained model specifically for the legal article search task, enabling it to function as a dedicated statute search tool. The training process used the article-search portion of the COLLT dataset, with similarity to the LawBench LAP test set removed prior to fine-tuning. The constructed legal knowledge base includes the Criminal Law of the People’s Republic of China, the Civil Code of the People’s Republic of China, the Criminal Procedure Law, the Civil Procedure Law, as well as authoritative judicial interpretations collected from http://www.legalai.cn/. Detailed training procedures for this tool are provided in Section 4.1.2.
  • Legal Charge Prediction ( T L C P ): This tool is designed to predict potential legal charges based on the case facts described by the user. We fine-tuned a pre-trained Lawformer model for the legal charge prediction task using a multi-label classification framework to accommodate the common scenario of multiple charges in judicial practice. The model was trained on the charge-prediction portion of the COLLT dataset, with similarity to the LawBench LCP test set removed prior to fine-tuning. Detailed training procedures for this tool are provided in Section 4.1.3.
  • Legal Element Recognition ( T L E R ): This tool is designed to extract legally significant elements from user queries, such as the identity of the actor, subjective intent, methods of action, and consequences. We directly adopt the ALEM [40] model proposed by Zhang et al. as the backbone of this tool. ALEM employs an interaction-based attention mechanism between case facts and legal elements, and it achieved state-of-the-art performance on this task in the year it was introduced.
  • Legal Event Detection ( T L E D ): This tool identifies and classifies legal events or actions mentioned in user queries, such as “contract signing,” “tortious conduct,” or “intentional injury.” We adopt the DGGCCM [41] model proposed by Gong et al. as our legal event detection tool. At the time of its release, this model represented the state-of-the-art for the task, demonstrating strong performance in event extraction and classification.
  • Internet Search ( T W e b ): To overcome the limitations of static knowledge in pretrained language models, this tool integrates the Bing API (a real-time web search engine) to retrieve up-to-date factual and contextual information.
These tools jointly cover a wide range of legal information needs, including normative retrieval, fact supplementation, and analogical reasoning. During inference, the model dynamically selects the most appropriate tool T X T based on the current user query q t and the dialogue history H t . The selection process follows Equation (8), where P ( T X q t , H t ; θ ) denotes the probability of selecting tool T X conditioned on the current query and dialogue context, parameterized by θ . This process is implemented by the legal large language model.
Once a tool is selected, it maps the input to a structured intermediate result g t as defined in Equation (9). This intermediate result g t is then fed, along with the original query, into the response generation module, enhancing the accuracy, legal rigor, and contextual relevance of the final output.
Statute-First Priority Among Tools
Chinese law is a civil-law system in which statutory provisions are primary and prior cases are at most persuasive. Treating T L A S and T S C R as interchangeable inputs to the response head risks producing answers grounded in analogous cases rather than in the controlling statute. We impose a fixed invocation order over T :
T L A S { T L C P , T L E R , T L E D } T S C R T W e b ,
where ≻ reflects the legal authority hierarchy from statute to case law to open-web sources. The priority is implemented as a single reordering step at each turn: when multiple tools are selected, their tags are sorted according to Equation (12) before invocation, and the retrieved results { g t X k } are concatenated in that order as context for the response head. Statutory outputs, therefore, always precede analogous-case outputs in the generation context, giving them positional precedence without any explicit conflict-detection logic. When T S C R and T L A S are invoked together, the prompt also explicitly instructs the model to prioritize the statutory-law evidence retrieved by T L A S . At training time, gold tag sequences are sorted by the same ordering before serving as supervision targets for the tool head in Equation (8), so the model learns to emit statutory tags before case-law tags. The mechanism adds no computational overhead beyond one sort per turn. Empirically, on the worked example in Section 3.2.3, the model emits the tool sequence T L A S T S C R even when the user query is phrased in case-comparison terms, consistent with the priority in Equation (12).

3.2.3. Enhanced Response

The model selection of actions and legal tools are represented by special indicative tokens in the predicted text, such as “<DRT>” for directly answering user queries and “<CLR>” for clarifying queries, as shown in Table 1.
Table 1. Indicative tokens for guiding the model’s subsequent operations.
Table 1. Indicative tokens for guiding the model’s subsequent operations.
ActionIndictive Token
Answer the user’s query directly<DRT>
Clarify the user’s query<CLR>
Legal ToolHead TagTail Tag
Similar Case Retrieval<SCR></SCR>
Legal Article Search<LAS></LAS>
Legal Charge Prediction<LCP></LCP>
Legal Element Recognition<LER></LER>
Legal Event Detection<LED></LED>
Internet Search<NET></NET>
Enhanced RespondIndicative Token
<ER>
Indicative tokens for legal tools are split into head and tail tags (e.g., “<SCR>” and “</SCR>” for the similarity case retrieval tool). We monitor the model’s output stream in real time and, once the model emits a head tag, the corresponding legal tool is immediately invoked. We then append the tool’s parsing result to the head tag and output the tail tag. To avoid excessive computational complexity, we limit the model to selecting at most two legal tools.
After predicting the tail tag for the final tool, the model proceeds to generate an enhanced response. We use the <ER> tag to indicate that the model is beginning to generate an enhanced response. The complete process for generating an enhanced response is described in Algorithm 1.
We also illustrate this process with a specific example in Figure 2c. The workflow begins with a user submitting a query (e.g., “please tell me what was stolen”). The model first outputs a indicative token <DRT>, signaling that the query is sufficiently clear and complete to proceed directly to a response. Then it generates a head tag <SCR> to invoke the first legal tool, a similar case retrieval tool, which takes the query as input and returns relevant results, followed by the tail tag of the tool. Next, the model outputs a head tag <LAS> to trigger the second legal tool. Upon generating a <ER> token, the model generates a final response enhanced by the parsed results from the legal tools.

3.3. End-to-End Training

We design an end-to-end instruction fine-tuning strategy to enable the model to learn how to select an action, legal tools, and generate enhanced text in a complete dialogue. We supervise the model with a combination of an action, legal tools, parsed results, and the enhanced response text, as defined by the training objective Equation (11).
The training relies on the COLLT dataset, which we assemble and release together with this paper at https://github.com/EDwardGaming/COLLT (accessed on 22 May 2026). The dataset bundles three components: (i) tool-training data for T SCR , T LAS , and T LCP , derived from public Chinese legal corpora with the leakage controls described in Section 4.1; (ii) an 11,340-sample multi-turn instruction-tuning corpus for the joint objective in Equation (11), with each sample containing the action label, the tool-tag sequence, the simulated tool outputs, and the clarification-aware response; and (iii) 500 real-world multi-turn legal consultations crawled from the Hualu (66 law) legal forum, retained in their original conversational form. Component (iii) is held out from training and is used as the out-of-distribution benchmark in Section 5.4 and Section 5.5; roughly 65% of its dialogues contain at least one clarification turn produced by the responding lawyer on the forum, which gives the framework an external, model-independent reference for clarification behavior. Figure 3 shows a representative training instance from component (ii).
We constructed components (ii) and (iii) over five working days. On days one and two, three senior legal experts selected 340 high-quality seeds from DISC-Law-SFT [36,37] spanning marriage, intellectual property, and custody disputes, and manually annotated them as reference samples; each reference contains the action label ( A D R T or A C L R ), the tool-tag sequence, the simulated tool outputs, and the final response. We then used these 340 references as few-shot exemplars to query DeepSeek-v3 (https://api-docs.deepseek.com/zh-cn/, accessed on 25 April 2026) over the remaining 11,000 DISC-Law-SFT seeds. All generations were obtained through the official DeepSeek-v3 web API rather than a local deployment under a fixed annotation prompt template. Each seed was queried exactly once and the API response was cached verbatim for downstream audit. While the DeepSeek calls were in flight, we crawled component (iii) in parallel from the Hualu public consultation archive, keeping only complete multi-turn threads with at least one user query and one lawyer response and discarding threads with personally identifying information.
On days three through five we audited every machine-generated sample. We built a lightweight in-house web interface on which each item is presented alongside its source seed and on which a reviewer’s only action is a single Yes/No click. Two annotators from the three-expert team independently reviewed every sample. Items receiving two Yes verdicts were accepted directly. Items receiving two No verdicts were handed to the third expert, who rewrote them rather than re-judging them, after which the rewritten item entered a single round of re-audit. Items with a split one-Yes-one-No verdict were also passed to the third expert, this time as the deciding vote rather than as a rewriter. Table 2 reports the joint distribution of the two annotators’ verdicts; the two-rater inter-annotator agreement is 93.4% and Fleiss’ kappa is 0.630, indicating substantial agreement.
To curb distributional bias before annotation, we applied near-duplicate filtering with SimHash (threshold 0.85) and cosine similarity (threshold 0.9) on the DISC-Law-SFT seed pool. After annotation, we ran automated consistency and format validation on all 11,340 samples in component (ii), checking for field completeness, JSON well-formedness, and placeholder integrity.
The factual content of every training instance in component (ii) is grounded in a real legal consultation, and the audit step provides expert oversight on each generated response, but the surface form of the augmented responses is still produced by a single generator (DeepSeek-v3) and may carry residual stylistic homogeneity that human-only annotation would avoid. This is the reason we bundle component (iii) with the release: the Hualu dialogues lie entirely outside the augmented training distribution, and the clarification-trigger evaluation in Section 5.5 uses them to test whether the learned clarification behavior generalizes beyond the construction pipeline.

4. Experimental Setup

4.1. Details of Legal Tools Training

This section outlines the training details of the tools. All experiments are conducted on an NVIDIA A100 GPU using PyTorch 1.12 and CUDA 11.8.
To prevent data leakage between tool training and the LawBench evaluation suite, the training data for all tools, packaged in component (i) of the COLLT dataset (Section 3.3), are kept disjoint from the LawBench LCP, LAP, and PTP test instances. We use the article-search and charge-prediction splits sourced from DISC-Law-SFT [36,37] rather than CAIL2018, and we apply near-duplicate removal (SimHash ≥0.85 and cosine similarity ≥0.9) to drop any training item that is highly similar to a LawBench test instance before fine-tuning. We then run the complete evaluation under this leakage-controlled setting, and the results remain stable.

4.1.1. Similar Case Retrieval

To enhance the model’s semantic discrimination capability in the task of similar case retrieval, we design a joint loss function that combines contrastive loss and triplet loss within a Siamese architecture. Specifically, for contrastive loss, we construct input pairs ( x 1 , x 2 ) with a binary label y { 0 , 1 } indicating whether the two cases are semantically similar. Each input is encoded into a vector representation h 1 = f ( x 1 ) , h 2 = f ( x 2 ) , and their Euclidean distance is computed as D = h 1 h 2 2 . The contrastive loss is then defined as L contrastive = y · D 2 + ( 1 y ) · max ( 0 , m D ) 2 , where m > 0 is a margin parameter encouraging dissimilar pairs to be at least m units apart in the embedding space. In parallel, triplet loss is introduced to capture finer-grained semantic distinctions. Each triplet consists of an anchor case x a , a positive case x p , and a negative case x n , with embeddings h a , h p , and h n , respectively. The triplet loss is formulated as L triplet = max 0 , h a h p 2 2 h a h n 2 2 + α , where α > 0 is the enforced minimum margin between positive and negative samples. The overall training objective is defined as a weighted sum of the two losses: L total = λ 1 · L contrastive + λ 2 · L triplet , where λ 1 and λ 2 are coefficients controlling the relative contributions of each component. This combined loss function effectively promotes the model’s ability to distinguish similar from dissimilar cases within the learned semantic embedding space.
The training configuration is as follows: the model is initialized with the base version of Lawformer; the optimizer is AdamW with an initial learning rate of 5 × 10 5 , using a linear decay schedule; the batch size is set to 16; training proceeds for 30 epochs, with early stopping. A positive-to-negative sample ratio of 1:3 is used to ensure sufficient exposure to negative instances.

4.1.2. Legal Article Search

We formulate the task as a standard multi-class classification problem. In this framework, the model takes a user query as input and outputs a probability distribution over a predefined set of legal articles. The most relevant article is then identified by selecting the one with the highest predicted probability. We fine-tuned the Lawformer pre-trained model for this task, where each query is encoded and mapped to one of the articles in the legal knowledge base. Supervised training signals are taken from the article-search split of the COLLT dataset (Section 3.3), with overlap to the LawBench LAP test set removed prior to fine-tuning per the leakage-control procedure at the start of Section 4.1. The model is trained using the standard cross-entropy loss function:
The training configuration is as follows: the model is initialized with the base version of Lawformer; the optimizer is AdamW with an initial learning rate of 5 × 10 5 and a linear decay schedule; the batch size is set to 16, and training is conducted for 10 epochs with early stopping. The number of output classes corresponds to the total number of articles in the legal knowledge base. To address the long-tail distribution of article frequencies, we apply balanced sampling during training.

4.1.3. Legal Charge Prediction

We trained a Lawformer model checkpoint on the charge-prediction split of the COLLT dataset (Section 3.3), with overlap to the LawBench LCP test set removed prior to fine-tuning per the leakage-control procedure at the start of Section 4.1. Lawformer, based on a Transformer architecture, is capable of capturing complex legal relationships in case texts.
L = i = 1 n y i log ( p i ) + ( 1 y i ) log ( 1 p i )
where y i represents the true label for the i-th charge, p i is the predicted probability, and n is the number of charges. This approach allows the model to predict multiple labels simultaneously and optimize each prediction effectively.
During the training process, hyperparameters were carefully tuned to ensure optimal performance for the multi-label classification task. The batch size was set to 32. The learning rate was set to 2 × 10 5 , and the Adam optimizer was used for its efficiency in handling sparse gradients and large datasets. To prevent overfitting, a dropout rate of 0.3 was applied, along with L2 regularization to control model complexity. The maximum input sequence length was set to 512 to accommodate longer case descriptions. The model was trained for five epochs, with evaluation after each epoch to monitor loss and F1 score on the validation set. Early stopping was implemented, halting training after three consecutive epochs without significant improvement in the F1 score.

4.1.4. Other Tools

We use the ALEM [40] model proposed by Zhang et al. for legal element detection and DGGCCM [41] for legal event detection. We use the Bing Search API (https://www.bing.com/, accessed on 25 April 2026) as our internet search tool.

4.2. Evaluation Tasks and Metrics

We evaluate COLLT using two types of tasks. The first type consists of traditional legal NLP tasks, which include nine sub-tasks:
1.
Legal Charge Prediction: Predicts the legal charges associated with a case based on case facts. Evaluation is performed using the F1 score.
2.
Legal Article Prediction: Identifies which legal article applies to a case. The F1 score is used as the evaluation metric.
3.
Prison Term Prediction: Predicts the prison sentence for a defendant. The model’s performance is assessed using log-distance.
4.
Argument Mining: Detects arguments within legal documents and classifies them as pro or con. Accuracy is used to evaluate performance.
5.
Dispute Focus Identification: Identifies the central dispute in a legal case. Evaluation is based on the F1 score of key issue extraction.
6.
Issue Topic Identification: Classifies the topic of a legal case. Single-label classification metrics, including precision, recall, and F1 score, are used.
7.
Legal Event Detection: Identifies and extracts legal events related to a case from legal documents, including event triggers, arguments, and event types. Evaluation follows standard Named Entity Recognition (NER) metrics.
8.
Opinion Summarization: Summarizes judicial opinions. Performance is evaluated using ROUGE-L scores, comparing model-generated summaries with human-written ones.
9.
Case Analysis: Analyzes cases to predict outcomes and relevance to other cases. Similarity retrieval and outcome prediction are evaluated using precision, recall, and F1 scores.
Each sub-task utilizes curated datasets containing real-world legal texts and case documents. The tasks and their evaluation metrics follow the LawBench evaluation protocol [42], which establishes standardized metrics across Chinese legal NLP tasks to enable direct comparability with prior results. Table 3 details the datasets and evaluation metrics for each task.
Two metric choices in Table 3 warrant additional justification. For PTP we report log-distance rather than absolute error in months. The log-distance metric is the one fixed by LawBench [42], the evaluation framework whose splits and protocol we follow, so retaining it is necessary for direct comparability with prior reported numbers. The metric is also better aligned with judicial practice than absolute error because prison-term distributions are heavy-tailed and the severity of a sentencing error scales with the base sentence rather than with its absolute magnitude. Predicting twelve months for a true six-month sentence doubles the defendant’s incarceration, an extreme judicial error, whereas predicting one hundred twenty-six months for a true one hundred twenty-month sentence is a 5% deviation that falls within the range of judicial discretion. Absolute-error scoring would penalize these two cases identically; log-distance, being monotone in the ratio y pred / y true , separates them, which matches the standard used in sentencing review.
For OS we report ROUGE-L because it is the metric used by the LawBench OS task. We acknowledge that ROUGE-L measures n-gram overlap rather than factual legal consistency, and that two summaries can disagree on a controlling article while sharing surface phrasing. We treat ROUGE-L as a comparability anchor with prior work and complement it with the ChatGPT-o3 scoring in Section 5.4, which is prompted to score legal correctness directly. A purpose-built factual-consistency metric for legal summarization is outside the scope of this paper and is noted as a limitation.
The second type is the free-form legal consultation task, which simulates real-world legal consultation scenarios. In this task, the model is required to answer a wide range of legal questions posed by users, providing comprehensive and legally accurate advice. These inquiries cover various branches of law, including but not limited to family law, contract disputes, criminal law, and property rights, reflecting the diverse nature of legal consultations. The task challenges the model to navigate the complexities of legal language and offer advice that is both applicable and clear for non-expert users.
For this task, we use the 500 real-world legal consultations released as component (iii) of the COLLT dataset (Section 3.3), crawled from the Hualu (66law) public consultation archive (https://www.66law.cn/). The set spans a wide variety of legal issues, from basic legal questions to more complex multi-issue situations, and is held out from training so that the evaluation here is out-of-distribution with respect to component (ii). The evaluation of the model’s performance is based on three key criteria:
1.
Answer accuracy: This measures how correctly the model’s response addresses the legal question, ensuring the provided advice aligns with applicable laws and regulations.
2.
Legal knowledge coverage: This assesses the breadth and depth of legal knowledge the model demonstrates in its response, considering the inclusion of relevant laws, precedents, and legal principles.
3.
Reasonableness: This criterion evaluates whether the response is practical and realistic, ensuring the advice given could be logically applied in a real-world context.
These criteria are scored on a scale from 1 to 10, where a higher score reflects greater accuracy, comprehensiveness, and applicability of the legal advice provided. This evaluation approach ensures that the model is not only legally sound but also effective in communicating with users who may not have specialized legal knowledge.
Through these two types of tasks, we can comprehensively assess the performance and application potential of COLLT.

4.3. Baselines

We evaluate the performance of our proposed model against two categories of baselines. The first category includes widely used Chinese large language models designed for general-purpose applications, while the second category focuses on models specifically tailored for legal tasks.

4.3.1. General-Purpose Chinese Large Models

The first category of baseline models consists of popular Chinese large language models that are widely adopted for general use. The models selected for this category represent some of the most advanced and commonly used architectures in the Chinese language processing community. They are:
  • ChatGLM3-6B [22]: ChatGLM3 is the third-generation open-source bilingual conversational model jointly released by Zhipu AI and Tsinghua University. It inherits the advantages of smooth dialogue and low deployment requirements from previous versions, while making significant improvements in performance and functionality.
  • LLaMa3-8B [43]: LLaMa3 is a large language model developed by Meta, designed for efficient and scalable performance in various natural language processing tasks. It excels in tasks like text generation, question answering, and translation, offering strong performance across multiple languages.
  • InternLM3-8B [24]: InternLM3 is an open-source language model with 8 billion parameters, developed by the Shanghai AI Laboratory for general tasks and advanced reasoning.
  • Qwen2.5-7B [44]: Qwen2.5 is the latest large language model released by Alibaba, featuring 7 billion parameters and fine-tuned for instruction-following, designed for general tasks and advanced reasoning.
  • Baichuan2-7B [45]: Baichuan2-7B is the second generation open-source large language model developed by Baichuan AI, with 7 billion parameters. It is trained on 2.6 trillion tokens of high-quality Chinese and English data, supporting a context window of 4096 tokens.

4.3.2. Legal-Specific Large Models

The second category of baselines consists of recently proposed legal large language models that have been specifically optimized for legal tasks. These models are trained on legal texts and are equipped with knowledge of legal terminologies, case law, and other legal aspects that are essential for handling legal questions and tasks. The models in this category are:
  • InterLM-Law [23]: InternLM-Law, based on InternLM, integrates continued pre-training and instruction tuning using both legal and general corpora to improve legal text comprehension and generation. This approach enables the model to perform effectively in legal tasks while retaining a broad understanding of general language.
  • LexiLaw [21]: LexiLaw is an open-source large language model designed specifically for the Chinese legal field, built on the ChatGLM-6B architecture. It enhances legal consultation and case analysis performance through methods including LoRA, P-tuning-v2, and full-parameter fine-tuning, utilizing a diverse range of training data, including legal Q&A, regulations, legal documents, and general domain text.
  • Lawyer-LLaMA [28]: This model is trained using a combination of legal domain data and general domain data, utilizing GPT-4 Turbo to build a high-quality legal dataset. The model undergoes supervised fine-tuning to enhance its legal reasoning capabilities, allowing it to effectively apply domain knowledge and handle various legal professional issues.
  • FuziMingcha [31]: FuziMingcha is a Chinese legal large language model co-developed by Shandong University, Inspur Cloud, and China University of Political Science and Law. Built upon ChatGLM, the model has been trained on extensive Chinese legal corpora, including judgment documents and statutes, as well as supervised fine-tuning datasets, such as legal Q&A and case retrieval.
  • Wisdom-Interrogatory [26]: The Wisdom-Interrogatory model, developed collaboratively by Zhejiang University, Alibaba DAMO Academy, and Huayi Institute of Computing, is a large-scale legal language model designed to enhance legal accessibility and judicial efficiency.

4.4. Implementation Details

We trained COLLT based on each of the five Chinese foundation models mentioned above, naming the resulting versions COLLT-GLM, COLLT-LLaMa, COLLT-InternLM, COLLT-Qwen, and COLD-Baichuan. We used the LLaMA-factory [46] framework for model training.
For model evaluation, we employed the LawBench framework [42], a well-established and widely-used benchmark specifically designed for evaluating legal large language models. LawBench provides comprehensive evaluation metrics across a variety of legal tasks, including legal charge prediction, similar case retrieval, and legal event detection.
To mitigate the impact of model hallucinations, we implemented a strategy of running 20 independent tests on each task dataset. The final results presented are based on the average outcomes of these 20 tests, ensuring that the reported performance is statistically reliable and representative of the model’s capabilities in real-world scenarios.

5. Results and Analysis

We answer the following research questions through extensive experiments:
  • RQ1: Can COLLT improve the performance of existing LLMs on legal NLP tasks?
  • RQ2: Can COLLT enhance the legal Q&A ability of existing LLMs?
  • RQ3: Are the legal tools effective?
  • RQ4: Does proactive clarification help the model better understand the case details?

5.1. Performance on Legal NLP Tasks

Table 4 reports COLLT’s performance against two reference points for each base model: a zero-shot baseline (“-Zero”) and a one-shot baseline (“-One”) that receives the same system prompt and tool descriptions as COLLT but a single in-context example in place of instruction tuning. The Legal LLM block reports five publicly released legal LLMs evaluated under the same protocol. Blue deltas in the COLLT rows are absolute improvements over max ( Zero , One ) .
The change from Zero to One is task-dependent rather than uniform. Of the 45 (base, task) cells, 23 move positively, 15 regress, and 7 are unchanged. Sorted by task the picture is much sharper: LCP and LAP improve on every one of the five bases; DFI, ITI, and LED improve or hold steady in 14 of 15 cells; while AM, OS, and CA regress in 13 of 15 cells (zero gains on AM, one on OS, zero on CA). The split tracks the composition of the legal toolbox. LCP, LAP, and LED are directly served by T L C P , T L A S , and T L E D respectively, and DFI and ITI overlap with the legal-element output of T L E R ; with these tasks, a single in-context example is enough for the base model to fire a tool that contributes evidence aligned with the task target, and the gain from the grounded output outweighs the cost of the extra context. AM, OS, and CA have no closely matching tool: argument mining and opinion summarization operate over a long input rather than over retrieved evidence, and the case-analysis benchmark requires multi-step reasoning over a short question that no tool’s parsed output answers directly. On these inputs the base model still fires tools, but the returned passages contribute hundreds of tokens unrelated to the question, and the resulting context dilution outweighs any generic benefit from the in-context example. Because the -One setting has access to the same six tools and the same system prompt as COLLT, and differs only in lacking instruction tuning, the regression cannot be attributed to weaker access to information; the root cause is that the base model has not learned which tool to fire on which input. Relative to max ( Zero , One ) , COLLT achieves a non-negative gain on every one of the 45 cells, and no COLLT cell falls below the corresponding max ( Zero , One ) .
The size of the improvement depends on the base. ChatGLM3-6B and LLaMa3-8B gain the most (mean per-task improvements of 0.08 over max ( Zero , One ) ). InternLM3-8B gains 0.05 on average. Qwen2.5-7B, the highest-scoring base, gains 0.04 on average, with its top columns (LAP at 0.74 and PTP at 0.82) already close to the cell ceiling. Baichuan2-7B gains 0.03 on average; its baseline rows are the weakest among the five, indicating that supervised instruction tuning does not fully compensate for limited capacity in the underlying base. The general pattern is consistent with the view that explicit supervision over tool-invocation patterns yields larger marginal returns when the base model has more headroom to gain.
The COLLT variants outperform every Legal LLM in the lower block on every column; the narrowest margin is on OS, where Fuzi-Mingcha reaches ROUGE-L 0.49 against COLLT-Qwen 0.50. Because the Legal LLM block uses parametric knowledge alone without tool access, part of the gap is attributable to tool augmentation rather than to the underlying base. The results address RQ1: COLLT improves the performance of existing LLMs on legal NLP tasks.

5.2. Performance on Free-Form Q&A

This section presents an in-depth analysis of the advantages of COLLT in the task of free-form legal consultation, with specific experimental settings outlined in Section 4.2. To assess the effectiveness of COLLT, we recruited three volunteers with relevant expertise to score the responses generated by the models based on three critical dimensions: answer accuracy, legal knowledge coverage, and the reasonableness of the responses. These three factors were chosen because they reflect key aspects of quality in legal consultation—accuracy ensures the correctness of legal advice, legal knowledge coverage guarantees the model’s ability to address diverse legal topics, and reasonableness assesses whether the model’s response is logically sound and practical.
The volunteers were tasked with scoring each response on a predefined scale, where higher scores corresponded to better performance. These scores were then aggregated to calculate the score win rate of COLLT compared to baseline models. The win rate was computed as the percentage of instances in which COLLT outperformed the baseline models in each of the three dimensions.
To assess the reliability of the three-volunteer scoring, we computed Fleiss’ kappa across the three annotators on the integer scores assigned to each (model, query, dimension) triple in the free-form Q&A evaluation. The kappa values are 0.71 for answer accuracy, 0.68 for legal knowledge coverage, and 0.64 for reasonableness, all in the substantial-agreement range. We also report the same statistic on the binary win/tie/loss verdict derived from these scores, which is the basis of Figure 4: Fleiss’ kappa on the win/tie/loss verdict is 0.66 averaged over the three dimensions. We acknowledge that three annotators is on the lower end for an open-ended legal evaluation and is recorded as a limitation in Section 3.3; the substantial agreement reduces, but does not eliminate, the risk of idiosyncratic scoring.
The experimental results are shown in Figure 4, where we observe that COLLT models consistently improve the performance of the original LLMs across all three evaluation dimensions for free-form legal consultation. Specifically, COLLT models demonstrate the most substantial improvement in answer accuracy. This improvement is largely attributed to COLLT’s ability to actively clarify user queries before generating responses, which enhances the model’s understanding of user intentions. By asking clarifying questions or rephrasing ambiguous queries, COLLT ensures that the model better comprehends the underlying legal issue, thus leading to more accurate, relevant, and personalized responses.
Furthermore, COLLT improves the reasonableness of its responses by generating more contextually appropriate and logically coherent legal advice. By leveraging the legal knowledge obtained from legal tools, COLLT ensures that each response is both logically sound and applicable to the user’s specific situation, increasing the overall usability and quality of the legal consultation.
To provide a more objective evaluation of COLLT, we also employed ChatGPT-o3 to score the responses generated by the models. This method allows for a more systematic and automated assessment of response quality, reducing any potential biases introduced by human evaluators. The prompt template used to guide ChatGPT-o3 for scoring is described in Appendix A. Figure 5 illustrates the score win rates of COLLT, showing that, even under the ChatGPT-o3 scoring setup, COLLT consistently outperforms the baseline models across all three evaluation dimensions.
Through this experiment, we answer RQ 2: COLLT can enhance the legal Q&A ability of existing LLMs.

5.3. Legal Tool Evaluation

This section evaluates the performance of each legal tool, with the test results shown in Table 5. For the legal tool T S C R , we use NDCG@5 as the evaluation metric; for T L A S , T T C P , and T L E R , we use accuracy as the evaluation metric; and for T L E D , we use the F1 score. The evaluation of each legal tool is based on 100 test samples that we manually constructed. Appendix B presents the detailed setup of legal tool evaluation.
The results demonstrate that the performance of the legal tools is satisfactory and effectively supports the large model in generating high-quality responses. This experimental results answer RQ3: The legal tools are effective.
To evaluate the robustness of COLLT in legal tool selection, we first distilled the intent from each query by removing irrelevant information, then encoded the cleaned text into semantic vectors using a BERT model fine-tuned on legal corpora. These high-dimensional vectors were projected into a two-dimensional space using t-SNE for visualization and semantic proximity comparison. We then assigned the same color to queries mapped by COLLT to the same legal tool and plotted them in a scatter diagram. The results in Figure 6 show clear clustering of same-color points, indicating that semantically similar queries are consistently mapped to the same legal tool, thus confirming COLLT’s stable and consistent selection capability across varied but similar-intent queries.

5.4. Clarification Mechanism Evaluation

This section analyzes the clarification mechanism in COLLT on the 500 real-world multi-turn legal consultations released as component (iii) of the COLLT dataset (Section 3.3). We first examine the correlation between query length and clarification triggers.
As shown in Figure 7a, shorter queries trigger the clarification mechanism more frequently. This is a natural pattern: brief queries provide less information, prompting the model to seek additional details for a more accurate response.
We further evaluated the quality of model responses as clarification triggers increased, maintaining assessment across three dimensions: response accuracy, legal knowledge coverage, and reasonableness. Five samples requiring three clarification rounds were randomly selected from the 500-test dataset for evaluation (Figure 7b–d). The bar charts demonstrate significant improvements in response accuracy, legal knowledge coverage, and reasonableness as the clarification mechanism activation frequency grows.
The experimental results answer RQ4: proactive clarification does help the model better understand the case details.

5.5. Ablation Studies

5.5.1. Tool Ablation

Table 6 reports the effect of disabling each individual tool during both training and inference, averaged over the five COLLT-* variants from Table 4. The first row reproduces the full COLLT setting. Each subsequent row removes one tool from T , and the last row removes all six tools simultaneously. We omit PTP and AM from this comparison because their task inputs trigger almost no tool calls under the learned action policy, so removing tools does not change the predictions on those columns.
Each tool’s removal reduces accuracy primarily on the task it is designed to support. Dropping T L A S costs 5.8 points on LAP, dropping T L C P costs 6.8 points on LCP, dropping T L E D costs 4.5 points on LED, dropping T S C R costs 3.0 points on CA, and dropping T L E R costs 3.1 points on ITI. Off-target columns move by less than one point in every single-removal row. The pattern is consistent with the intended role of each tool as a localized source of grounded context: each tool contributes mainly on the column whose label space or evidence requirement it directly produces.
Removing the internet search tool T W e b has the smallest effect (mean drop of 0.4 percentage points). This is expected since all seven evaluated tasks are static benchmarks whose answers do not depend on real-time web information. Removing all six tools simultaneously costs 5.0 percentage points on average. The all-removed setting still exceeds the Zero baselines from Table 4 by 1.7 percentage points on average. Decomposing the 6.7-point mean gain of full COLLT over the Zero baselines into these two components, tool outputs account for 5.0 points and the supervised action-selection and response-composition behavior learned during instruction tuning accounts for 1.7 points. Both components are needed for the full effect: tools supply grounded content, and the SFT-learned routing and integration logic supplies the policy that prompting alone does not recover.

5.5.2. Clarification Trigger Comparison

To isolate the contribution of supervised clarification training from prompting alone, and to address the concern that ambiguity is otherwise judged only by the model’s own predictions, we evaluate clarification triggering against an external benchmark. The benchmark is component (iii) of the COLLT dataset, the 500 multi-turn Hualu consultations introduced in Section 3.3, covering family, contract, criminal, and property disputes; approximately 65% of these dialogues already contain at least one clarification turn produced by the responding lawyer on the forum, which provides a natural model-independent reference for the clarification decision. We extended the set with an explicit binary gold label per query indicating whether the query is sufficiently ambiguous to warrant a clarification turn; the labels were produced independently of any COLLT or baseline output by the three-expert team described in Section 3.3 under the same Yes/No review interface, so the ground truth is not derived from any model’s predictions. The metric is trigger-F1, the F1 score of the binary decision “output the clarification action A CLR on this query”.
Table 7 reports trigger-F1 averaged over the five base models. The base-vanilla setting runs each base model with no system prompt and no tool descriptions. The base-with-clarify setting evaluates the same base models with a low-cost active-prompting strategy: a system prompt instructs the model to ask a clarifying question first whenever the user query is ambiguous or lacks necessary details; otherwise, it answers directly. The COLLT-SFT setting evaluates the COLLT models with the clarification action learned through instruction tuning.
The system prompt alone raises trigger-F1 from 0.070 to 0.476, confirming that contemporary base models do possess a latent ability to ask clarifying questions when explicitly instructed. COLLT-SFT reaches 0.814, an absolute gap of 0.338 over the prompting baseline. Inspection of the prompting-baseline outputs shows that most of the gap is driven by false positives: prompted base models tend to ask clarifying questions on unambiguous queries as well, while COLLT learns to call A CLR selectively on inputs that genuinely need it. Supervised exposure to the decision boundary, not merely to the existence of a clarification action, is what closes the remaining gap between prompting and instruction tuning.

6. Discussion

We collect the principal limitations of this work before turning to the question of how far the formulation transfers beyond legal QA.
The instruction-tuning corpus combines real legal seeds drawn from DISC-Law-SFT with responses, tool trajectories, and clarification turns reconstructed by DeepSeek-v3 under expert audit. Even with the per-sample review and the substantial inter-annotator agreement reported in Table 2, surface form is still produced by a single generator and a degree of stylistic homogeneity is unavoidable. Producing all 11,340 responses entirely by hand would have required on the order of 10 4 expert-hours and was infeasible within the project budget; this trade-off between scale and stylistic diversity remains an open problem for instruction-tuned legal LLMs. The 500-query Hualu evaluation in Section 5.5 partially mitigates the concern by testing on inputs whose stylistic distribution lies outside the augmented training corpus, but a fully human-written corpus remains a desirable next step.
The free-form human evaluation that produces Figure 4 relies on three annotators. The Fleiss’ kappa values are in the substantial range, but a larger jury would tighten the variance on the win-rate estimates and is the right next step for high-stakes legal scoring. The two-tool budget is supported by the corpus audit and by the tool ablation in Section 5.5, yet cases that combine cross-jurisdictional statutes, case law, and procedural rules may require a higher cap; we have encoded the cap as the parameter | τ | 2 in Equation (4) rather than as an architectural assumption so that it can be relaxed without altering the rest of the framework. We have validated COLLT only on Chinese law, and transfer to common-law jurisdictions or to other domains has not been tested empirically.
The three-task loss decomposition ( L action + λ L tool + μ L resp ) and the budget constraint ( | τ | 2 ) are stated in terms that do not presuppose any particular domain. Domains that share high tool diversity with chronically underspecified user queries, including medical QA [47], financial compliance [48], and technical support [33], admit the same MDP and the same three-task objective with only the toolbox T and the instruction-tuning corpus replaced. Empirical validation of this transfer is beyond the scope of this paper and is left to future work.

7. Conclusions

In this work, we present COLLT, a tool-augmented framework whose central decision is whether to clarify an underspecified query or to invoke tools directly. We derive its three-task training objective, over action selection, tool invocation, and response generation, as a chain-rule factorization of the joint likelihood rather than a heuristically weighted sum. We further introduce a cardinality budget on the invoked tool subset that reduces the per-turn action-space complexity from O ( 2 | T | ) to O ( | T | 2 ) , making tool routing tractable as the toolbox grows.
We instantiate this framework as COLLT and validate it on legal question answering, a tool-rich domain in which user queries are frequently underspecified. Across five Chinese foundation-model backbones, nine legal NLP tasks, and 500 real-world consultation queries, COLLT achieves consistent improvements in answer accuracy, legal knowledge coverage, and response appropriateness; visualization analysis further indicates that the learned policy selects consistent tools for semantically similar queries.
The clarification-then-tool decision structure and the three-task loss decomposition are domain-agnostic and transfer to medical, financial, and technical support question answering with only data-level modifications to the toolbox and the instruction-tuning corpus. Future work includes empirical transfer studies on these domains, sensitivity analysis of the multi-task weighting coefficients, and extensions of the budget constraint to cost-aware tool selection.

Author Contributions

Conceptualization, K.Y., J.S. and C.X.; Methodology, K.Y. and J.S.; Validation, Z.W.; Investigation, K.Y. and C.X.; Resources, J.S. and C.X.; Data curation, Z.W.; Writing—original draft, K.Y.; Writing—review & editing, K.Y. and J.S.; Visualization, K.Y.; Supervision, J.S. and Z.W.; Project administration, K.Y. and Z.W.; Funding acquisition, K.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work has been supported by Undergraduate Training Programs for Innovations by NEFU grant number S202510225448.

Data Availability Statement

The instruction-tuning dataset and evaluation scripts are publicly available at https://github.com/EDwardGaming/COLLT (accessed on 22 May 2026). The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Prompt Template for ChatGPT-o3 Scoring

We have designed a dedicated prompt template to guide ChatGPT-o3 in scoring legal responses. The template is illustrated in Figure A1 and Figure A2, which respectively show the Chinese version and its English translation.
Figure A1. Prompt template used to guide ChatGPT-o3 in scoring legal responses.
Figure A1. Prompt template used to guide ChatGPT-o3 in scoring legal responses.
Mathematics 14 01891 g0a1
Figure A2. English translation of the ChatGPT-o3 scoring prompt template.
Figure A2. English translation of the ChatGPT-o3 scoring prompt template.
Mathematics 14 01891 g0a2

Appendix B. Detailed Setup of Legal Tool Evaluation

To comprehensively evaluate the effectiveness of various legal tools, we constructed 100 test samples for each of five tasks—similar case retrieval ( T S C R ), legal article searching ( T L A S ), legal charge prediction ( T L C P ), legal element recognition ( T L E R ), and legal event detection ( T L E D ), resulting in a total of 500 samples. All samples were newly created by us in collaboration with three domain experts with legal practice experience, rather than reused from public datasets. This was done to assess the tools’ generalizability and transferability in realistic usage scenarios.
We began by collecting 12,000 de-identified judicial documents from China Judgments Online (https://wenshu.court.gov.cn/) and selected candidate documents based on case type—criminal, civil, and administrative—in a 5:4:1 ratio. Each task then underwent task-specific refinement. For T S C R , only the factual section was retained to form the query text. Three experts independently retrieved and ranked similar cases; their top-10 results were merged using the Borda count method to create a gold-standard top-5 list. For T L A S and T L C P , explicit labels such as article numbers and charge names were removed to prevent leakage, and experts annotated the correct statutes or charges based on current codes such as the Civil Code and Criminal Law. For T L E R , 2000 sentences containing legal elements were extracted from documents of different case types; 100 sentences were randomly selected and annotated using the BIO tagging scheme to mark entity boundaries and categories. For T L E D , full-length judgments of 2000–5000 words were chosen, and <subject, predicate, object> triples were annotated, along with optional time and location slots.
To ensure annotation quality, we quantified inter-annotator agreement. For retrieval ( T S C R ), we used Krippendorff’s α for ranked data; for classification tasks ( T L A S and T L C P ), Fleiss’ K; for sequence and extraction tasks ( T L E R and T L E D ), span-level IAA-F1 scores. If α or K 0.80 (or I A A F 1 0.85 ), annotations were accepted directly. If the agreement fell between 0.60–0.79 (or 0.70–0.84), conflicting samples were resolved via review meetings. Scores below the threshold triggered sample re-selection and re-annotation until the metrics met the required standard. Cases that could not be resolved were adjudicated by a fourth senior legal expert; if disagreement persisted, the sample was discarded and replaced with a new one of the same case type to maintain a consistent sample size of 100 per tool.
For evaluation metrics, T S C R used N D C G @ 5 to assess ranking quality; T L A S , T L C P , and T L E R used accuracy; and T L E D used event-level micro-F1. All samples and annotations underwent a second round of anonymization, with personally identifiable information removed and minor place names blurred to prevent identity exposure.

References

  1. Yao, S.; Zhao, J.; Yu, D.; Du, N.; Shafran, I.; Narasimhan, K.; Cao, Y. React: Synergizing reasoning and acting in language models. In Proceedings of the 11th International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
  2. Schick, T.; Dwivedi-Yu, J.; Dessì, R.; Raileanu, R.; Lomeli, M.; Hambro, E.; Zettlemoyer, L.; Cancedda, N.; Scialom, T. Toolformer: Language models can teach themselves to use tools. Adv. Neural Inf. Process. Syst. 2023, 36, 68539–68551. [Google Scholar]
  3. Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.t.; Rocktäschel, T.; et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Adv. Neural Inf. Process. Syst. 2020, 33, 9459–9474. [Google Scholar]
  4. Kumar, S.; Reddy, P.K.; Reddy, V.B.; Singh, A. Similarity analysis of legal judgments. In Proceedings of the Fourth Annual ACM Bangalore Conference; Association for Computing Machinery: New York, NY, USA, 2011; pp. 1–4. [Google Scholar]
  5. Quaresma, P.; Rodrigues, I. A question-answering system for Portuguese juridical documents. In Proceedings of the 10th International Conference on Artificial Intelligence and Law; Association for Computing Machinery: New York, NY, USA, 2005; pp. 256–257. [Google Scholar]
  6. Kim, M.Y.; Xu, Y.; Goebel, R. Applying a convolutional neural network to legal question answering. In Proceedings of the JSAI International Symposium on Artificial Intelligence; Springer: Cham, Switzerland, 2015; pp. 282–294. [Google Scholar]
  7. Xiao, G.; Mo, J.; Chow, E.; Chen, H.; Guo, J.; Gong, Z. Multi-Task CNN for classification of Chinese legal questions. In Proceedings of the 2017 IEEE 14th International Conference on e-Business Engineering (ICEBE); IEEE: Piscataway, NJ, USA, 2017; pp. 84–90. [Google Scholar]
  8. Collarana, D.; Heuss, T.; Lehmann, J.; Lytra, I.; Maheshwari, G.; Nedelchev, R.; Schmidt, T.; Trivedi, P. A question answering system on regulatory documents. In Legal Knowledge and Information Systems; IOS Press: Amsterdam, The Netherlands, 2018; pp. 41–50. [Google Scholar]
  9. Kim, M.Y.; Rabelo, J.; Babiker, H.K.B.; Rahman, M.A.; Goebel, R. Legal information retrieval and entailment using transformer-based approaches. Rev. Socionetwork Strateg. 2024, 18, 101–121. [Google Scholar] [CrossRef] [PubMed]
  10. Büttner, M.; Habernal, I. Answering legal questions from laymen in german civil law system. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers); Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 2015–2027. [Google Scholar]
  11. Zhang, N.N.; Xing, Y. Questions and answers on legal texts based on BERT-BiGRU. In Proceedings of the Journal of Physics: Conference Series; IOP Publishing: Bristol, UK, 2021; Volume 1828, p. 012035. [Google Scholar]
  12. Hoppe, C.; Pelkmann, D.; Migenda, N.; Hötte, D.; Schenck, W. Towards intelligent legal advisors for document retrieval and question-answering in german legal documents. In Proceedings of the 2021 IEEE Fourth International Conference on Artificial Intelligence and Knowledge Engineering (AIKE); IEEE: Piscataway, NJ, USA, 2021; pp. 29–32. [Google Scholar]
  13. Tieu, T.T.; Chau, C.N.; Nguyen, T.S.; Nguyen, L.M. Apply bert-based models and domain knowledge for automated legal question answering tasks at alqac 2021. In Proceedings of the 2021 13th International Conference on Knowledge and Systems Engineering (KSE); IEEE: Piscataway, NJ, USA, 2021; pp. 1–6. [Google Scholar]
  14. Xiao, C.; Hu, X.; Liu, Z.; Tu, C.; Sun, M. Lawformer: A pre-trained language model for Chinese legal long documents. AI Open 2021, 2, 79–84. [Google Scholar] [CrossRef]
  15. Zhong, H.; Xiao, C.; Tu, C.; Zhang, T.; Liu, Z.; Sun, M. How Does NLP Benefit Legal System: A Summary of Legal Artificial Intelligence. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 5218–5230. [Google Scholar]
  16. Seo, M.; Kembhavi, A.; Farhadi, A.; Hajishirzi, H. Bi-directional attention flow for machine comprehension. In Proceedings of the 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, 24–26 April 2017. [Google Scholar]
  17. Wang, S.; Yu, M.; Chang, S.; Jiang, J. A co-matching model for multi-choice reading comprehension. In Proceedings of the Annual Meeting of the Association for Computational Linguistics; Association for Computational Linguistics (ACL): Stroudsburg, PA, USA, 2018. [Google Scholar]
  18. Zhang, W.; Shen, H.; Lei, T.; Wang, Q.; Peng, D.; Wang, X. GLQA: A generation-based method for legal question answering. In Proceedings of the 2023 International Joint Conference on Neural Networks (IJCNN); IEEE: Piscataway, NJ, USA, 2023; pp. 1–8. [Google Scholar]
  19. Joshi, S. Methods in Legal Contractual Content Generation. Ph.D. Thesis, International Institute of Information Technology, Hyderabad, India, 2023. [Google Scholar]
  20. Gupta, P.; Jiao, C.; Yeh, Y.T.; Mehri, S.; Eskenazi, M.; Bigham, J.P. InstructDial: Improving Zero and Few-shot Generalization in Dialogue through Instruction Tuning. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing; Association for Computational Linguistics: Stroudsburg, PA, USA, 2022; pp. 505–525. [Google Scholar]
  21. Li, H.; Ai, Q.; Dong, Q.; Liu, Y. LexiLaw: A Scalable Legal Language Model for Comprehensive Legal Understanding. 2024. Available online: https://github.com/CSHaitao/LexiLaw (accessed on 12 February 2026).
  22. Glm, T.; Zeng, A.; Xu, B.; Wang, B.; Zhang, C.; Yin, D.; Zhang, D.; Rojas, D.; Feng, G.; Wang, Z.; et al. ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools. arXiv 2024, arXiv:2406.12793. [Google Scholar]
  23. Fei, Z.; Zhang, S.; Shen, X.; Zhu, D.; Wang, X.; Ge, J.; Ng, V. InternLM-Law: An Open-Sourced Chinese Legal Large Language Model. In Proceedings of the 31st International Conference on Computational Linguistics; Association for Computational Linguistics: Stroudsburg, PA, USA, 2025; pp. 9376–9392. [Google Scholar]
  24. Cai, Z.; Cao, M.; Chen, H.; Chen, K.; Chen, K.; Chen, X.; Chen, X.; Chen, Z.; Chen, Z.; Chu, P.; et al. InternLM2 Technical Report. arXiv 2024, arXiv:2403.17297. [Google Scholar] [CrossRef]
  25. He, W.; Wen, J.; Zhang, L.; Cheng, H.; Qin, B.; Li, Y.; Jiang, F.; Chen, J.; Wang, B.; Yang, M. HanFei-1.0. 2023. Available online: https://github.com/siat-nlp/HanFei (accessed on 12 February 2026).
  26. Wu, Y.; Liu, Y.; Liu, Y.; Li, A.; Zhou, S.; Kuang, K. wisdomInterrogatory. GitHub Repository. 2024. Available online: https://github.com/zhihaiLLM/wisdomInterrogatory (accessed on 12 February 2026).
  27. Louis, A.; van Dijck, G.; Spanakis, G. Interpretable long-form legal question answering with retrieval-augmented large language models. In Proceedings of the AAAI Conference on Artificial Intelligence; AAAI Press: Washington, DC, USA, 2024; Volume 38, pp. 22266–22275. [Google Scholar]
  28. Yao, S.; Ke, Q.; Wang, Q.; Li, K.; Hu, J. Lawyer GPT: A legal large language model with enhanced domain knowledge and reasoning capabilities. In Proceedings of the 2024 3rd International Symposium on Robotics, Artificial Intelligence and Information Engineering; Association for Computing Machinery: New York, NY, USA, 2024; pp. 108–112. [Google Scholar]
  29. Wiratunga, N.; Abeyratne, R.; Jayawardena, L.; Martin, K.; Massie, S.; Nkisi-Orji, I.; Weerasinghe, R.; Liret, A.; Fleisch, B. CBR-RAG: Case-based reasoning for retrieval augmented generation in LLMs for legal question answering. In Proceedings of the International Conference on Case-Based Reasoning; Springer: Cham, Switzerland, 2024; pp. 445–460. [Google Scholar]
  30. Kalra, R.; Wu, Z.; Gulley, A.; Hilliard, A.; Guan, X.; Koshiyama, A.; Treleaven, P. HyPA-RAG: A Hybrid Parameter Adaptive Retrieval-Augmented Generation System for AI Legal and Policy Applications. In Proceedings of the 1st Workshop on Customizable NLP: Progress and Challenges in Customizing NLP for a Domain, Application, Group, or Individual (CustomNLP4U); Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 237–256. [Google Scholar]
  31. Wu, S.; Liu, Z.; Zhang, Z.; Chen, Z.; Deng, W.; Zhang, W.; Yang, J.; Yao, Z.; Lyu, Y.; Xin, X.; et al. fuzi.mingcha. GitHub Repository. 2023. Available online: https://github.com/irlab-sdu/fuzi.mingcha (accessed on 12 February 2026).
  32. Shu, C.; Zhang, H. Neural programming by example. In Proceedings of the AAAI Conference on Artificial Intelligence; AAAI Press: Washington, DC, USA, 2017; Volume 31. [Google Scholar]
  33. Li, Y.; Choi, D.; Chung, J.; Kushman, N.; Schrittwieser, J.; Leblond, R.; Eccles, T.; Keeling, J.; Gimeno, F.; Dal Lago, A.; et al. Competition-level code generation with alphacode. Science 2022, 378, 1092–1097. [Google Scholar] [CrossRef] [PubMed]
  34. Zhang, D.; Zou, J.; Zhu, G. Multitool Drilling Path Optimization by Multiagent Reinforcement Learning Approach. IEEE Trans. Ind. Inform. 2025, 21, 6210–6219. [Google Scholar] [CrossRef]
  35. Wang, C.; Luo, W.; Dong, S.; Xuan, X.; Li, Z.; Ma, L.; Gao, S. Mllm-tool: A multimodal large language model for tool agent learning. In Proceedings of the 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV); IEEE: Piscataway, NJ, USA, 2025; pp. 6678–6687. [Google Scholar]
  36. Yue, S.; Chen, W.; Wang, S.; Li, B.; Shen, C.; Liu, S.; Zhou, Y.; Xiao, Y.; Yun, S.; Huang, X.; et al. DISC-LawLLM: Fine-tuning Large Language Models for Intelligent Legal Services. arXiv 2023, arXiv:2309.11325. [Google Scholar]
  37. Yue, S.; Liu, S.; Zhou, Y.; Shen, C.; Wang, S.; Xiao, Y.; Li, B.; Song, Y.; Shen, X.; Chen, W.; et al. LawLLM: Intelligent Legal System with Legal Reasoning and Verifiable Retrieval. In Proceedings of the International Conference on Database Systems for Advanced Applications; Springer: Singapore, 2024; pp. 304–321. [Google Scholar]
  38. Xiao, C.; Zhong, H.; Guo, Z.; Tu, C.; Liu, Z.; Sun, M.; Zhang, T.; Han, X.; Hu, Z.; Wang, H.; et al. CAIL2019-SCM: A dataset of similar case matching in legal domain. arXiv 2019, arXiv:1911.08962. [Google Scholar] [CrossRef]
  39. Ma, Y.; Shao, Y.; Wu, Y.; Liu, Y.; Zhang, R.; Zhang, M.; Ma, S. LeCaRD: A legal case retrieval dataset for Chinese law system. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval; Association for Computing Machinery: New York, NY, USA, 2021; pp. 2342–2348. [Google Scholar]
  40. Zhang, H.; Pan, B.; Li, R. Legal judgment elements extraction approach with law article-aware mechanism. Trans. Asian-Low-Resour. Lang. Inf. Process. 2021, 21, 1–15. [Google Scholar] [CrossRef]
  41. Gong, S.; Luo, X. DGGCCM: A hybrid neural model for legal event detection. Artif. Intell. Law 2025, 33, 1109–1149. [Google Scholar] [CrossRef]
  42. Fei, Z.; Shen, X.; Zhu, D.; Zhou, F.; Han, Z.; Huang, A.; Zhang, S.; Chen, K.; Yin, Z.; Shen, Z.; et al. LawBench: Benchmarking Legal Knowledge of Large Language Models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 7933–7962. [Google Scholar]
  43. Dubey, A.; Jauhri, A.; Pandey, A.; Kadian, A.; Al-Dahle, A.; Letman, A.; Mathur, A.; Schelten, A.; Yang, A.; Fan, A.; et al. The llama 3 herd of models. arXiv 2024, arXiv:2407.21783. [Google Scholar] [CrossRef]
  44. Qwen Team. Qwen2.5 Technical Report. arXiv 2024, arXiv:2412.15115. [Google Scholar]
  45. Baichuan. Baichuan 2: Open Large-scale Language Models. arXiv 2023, arXiv:2309.10305. [Google Scholar]
  46. Zheng, Y.; Zhang, R.; Zhang, J.; Ye, Y.; Luo, Z.; Feng, Z.; Ma, Y. LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models. In Proceedings of the Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), Bangkok, Thailand; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024. [Google Scholar]
  47. Singhal, K.; Azizi, S.; Tu, T.; Mahdavi, S.S.; Wei, J.; Chung, H.W.; Scales, N.; Tanwani, A.; Cole-Lewis, H.; Pfohl, S.; et al. Large language models encode clinical knowledge. Nature 2023, 620, 172–180. [Google Scholar] [CrossRef] [PubMed]
  48. Wu, S.; Irsoy, O.; Lu, S.; Dabravolski, V.; Dredze, M.; Gehrmann, S.; Kambadur, P.; Rosenberg, D.; Mann, G. Bloomberggpt: A large language model for finance. arXiv 2023, arXiv:2303.17564. [Google Scholar] [CrossRef]
Figure 1. Subfigure (a) shows that clarifying user queries leads to more accurate and personalized responses. Subfigure (b) shows that utilizing legal tools, rather than relying solely on the model’s parametric knowledge, results in more reliable legal advice.
Figure 1. Subfigure (a) shows that clarifying user queries leads to more accurate and personalized responses. Subfigure (b) shows that utilizing legal tools, rather than relying solely on the model’s parametric knowledge, results in more reliable legal advice.
Mathematics 14 01891 g001
Figure 2. Model inference process. Subfigure (a) illustrates that when the user’s query is vague, the model leverages its own parameterized knowledge to generate a guiding respond. Subfigure (b) shows that when the user’s query is clear, the model directly invokes legal tools to generate an enhanced respond. Subfigure (c) depicts how the model selects actions, invokes legal tools, and generates the enhanced respond.
Figure 2. Model inference process. Subfigure (a) illustrates that when the user’s query is vague, the model leverages its own parameterized knowledge to generate a guiding respond. Subfigure (b) shows that when the user’s query is clear, the model directly invokes legal tools to generate an enhanced respond. Subfigure (c) depicts how the model selects actions, invokes legal tools, and generates the enhanced respond.
Mathematics 14 01891 g002
Figure 3. An example of our constructed end-to-end instruction tuning data.
Figure 3. An example of our constructed end-to-end instruction tuning data.
Mathematics 14 01891 g003
Figure 4. Human evaluation results comparing the effectiveness of COLLT and baseline models in the free-form legal consultation task across three dimensions: answer accuracy, legal knowledge coverage, and response reasonableness.
Figure 4. Human evaluation results comparing the effectiveness of COLLT and baseline models in the free-form legal consultation task across three dimensions: answer accuracy, legal knowledge coverage, and response reasonableness.
Mathematics 14 01891 g004
Figure 5. ChatGPT-o3 evaluation results comparing the effectiveness of COLLT and baseline models in the free-form legal consultation task across three dimensions: answer accuracy, legal knowledge coverage, and response reasonableness.
Figure 5. ChatGPT-o3 evaluation results comparing the effectiveness of COLLT and baseline models in the free-form legal consultation task across three dimensions: answer accuracy, legal knowledge coverage, and response reasonableness.
Mathematics 14 01891 g005
Figure 6. Visualization of legal-tool selection; the different colors indicate the model’s choices for each legal tool.
Figure 6. Visualization of legal-tool selection; the different colors indicate the model’s choices for each legal tool.
Mathematics 14 01891 g006
Figure 7. Experimental results of clarification trigger analysis in COLLT. Subfigure (a) illustrates the relationship between query length and clarification trigger frequency. Subfigures (b), (c), and (d), respectively, show score variations across three evaluation dimensions—response accuracy, legal knowledge coverage, and response plausibility—for the model’s replies to five test samples as the number of clarification triggers increases.
Figure 7. Experimental results of clarification trigger analysis in COLLT. Subfigure (a) illustrates the relationship between query length and clarification trigger frequency. Subfigures (b), (c), and (d), respectively, show score variations across three evaluation dimensions—response accuracy, legal knowledge coverage, and response plausibility—for the model’s replies to five test samples as the number of clarification triggers increases.
Mathematics 14 01891 g007
Table 2. Joint distribution of the two annotators’ verdicts in the dataset audit phase. The two-rater inter-annotator agreement rate is 93.4% and Fleiss’ kappa is 0.630.
Table 2. Joint distribution of the two annotators’ verdicts in the dataset audit phase. The two-rater inter-annotator agreement rate is 93.4% and Fleiss’ kappa is 0.630.
Two-Annotator VerdictSamplesProportionStatus
2 Yes, 0 No984386.80%Agreed (accept)
0 Yes, 2 No7496.60%Agreed (reject)
1 Yes, 1 No7486.60%Disagreement
Total11,340100%-
Table 3. Datasets and evaluation metrics of the tasks.
Table 3. Datasets and evaluation metrics of the tasks.
TaskDatasetMetric
Legal Charge Prediction (LCP)CAIL2018 https://github.com/china-ai-law-challenge/cail2018 (accessed on 5 February 2026)F1
Legal Article Prediction (LAP)CAIL2018 https://github.com/china-ai-law-challenge/cail2018 (accessed on 18 February 2026)F1
Prison Term Prediction (PTP)CAIL2018 https://github.com/china-ai-law-challenge/cail2018 (accessed on 11 March 2026)log-distance
Argument Mining (AM)CAIL2022 https://github.com/china-ai-law-challenge (accessed on 18 February 2026)Accuracy
Dispute Focus Identification (DFI)LAIC2021 https://laic.cjbdi.com/ (accessed on 18 February 2026)F1
Issue Topic Identification (ITI)CrimeKgAssitant https://github.com/liuhuanyong/CrimeKgAssitant (accessed on 18 February 2026)Accuracy
Legal Event Eetection (LED)LEVEN https://github.com/thunlp/LEVEN (accessed on 18 February 2026)F1
Opinion Summarization (OS)CAIL2022 https://github.com/china-ai-law-challenge (accessed on 18 February 2026)ROUGE-L
Case Analysis (CA)JEC-QA https://github.com/thunlp/jec-qa (accessed on 18 February 2026)Accuracy
Table 4. Performance of COLLT against zero-shot and one-shot baselines on traditional legal NLP tasks. For each base model we report a zero-shot baseline (suffix “-Zero”), a one-shot baseline that receives the same system prompt and tool descriptions as COLLT but only a single in-context example (suffix “-One”), and the instruction-tuned COLLT variant. Cyan deltas in the COLLT rows are absolute improvements over max ( Zero , One ) .
Table 4. Performance of COLLT against zero-shot and one-shot baselines on traditional legal NLP tasks. For each base model we report a zero-shot baseline (suffix “-Zero”), a one-shot baseline that receives the same system prompt and tool descriptions as COLLT but only a single in-context example (suffix “-One”), and the instruction-tuned COLLT variant. Cyan deltas in the COLLT rows are absolute improvements over max ( Zero , One ) .
MethodLCP (F1)LAP (F1)PTP (Log-Distance)AM (Accuracy)DFI (F1)ITI (Accuracy)LED (F1)OS (ROUGE-L)CA (Accuracy)
General LLM
ChatGLM3-6B-Zero0.310.520.740.320.270.210.130.340.29
ChatGLM3-6B-One0.330.550.740.300.300.230.120.320.27
COLLT-GLM0.49 (↑ 0.16)0.62 (↑ 0.07)0.78 (↑ 0.04)0.40 (↑ 0.08)0.36 (↑ 0.06)0.29 (↑ 0.06)0.19  (↑ 0.06)0.47 (↑ 0.13)0.37  (↑ 0.08)
LLaMa3-8B-Zero0.210.300.510.120.510.150.110.280.24
LLaMa3-8B-One0.240.320.500.110.520.170.120.260.20
COLLT-LLaMa0.38 (↑ 0.14)0.37 (↑ 0.05)0.62 (↑ 0.11)0.20 (↑ 0.08)0.59  (↑ 0.07)0.21 (↑ 0.04)0.18 (↑ 0.06)0.33 (↑ 0.05)0.32 (↑ 0.08)
InternLM3-8B-Zero0.390.360.670.360.210.170.090.390.27
InternLM3-8B-One0.430.390.680.350.210.190.110.380.26
COLLT-InternLM0.54 (↑ 0.11)0.45 (↑ 0.06)0.72 (↑ 0.04)0.43  (↑ 0.07)0.23 (↑ 0.02)0.23 (↑ 0.04)0.15 (↑ 0.04)0.43 (↑ 0.04)0.30 (↑ 0.03)
Qwen2.5-7B-Zero0.500.720.810.320.370.280.120.410.24
Qwen2.5-7B-One0.520.730.820.310.380.280.120.420.24
COLLT-Qwen0.55  (↑ 0.03)0.74  (↑ 0.01)0.82  (↑ 0.00)0.37 (↑ 0.05)0.42 (↑ 0.04)0.35  (↑ 0.07)0.16 (↑ 0.04)0.50  (↑ 0.08)0.27 (↑ 0.03)
Baichuan2-7B-Zero0.430.270.650.190.190.110.080.240.19
Baichuan2-7B-One0.440.290.650.180.190.120.100.220.18
COLLT-Baichuan0.47 (↑ 0.03)0.34 (↑ 0.05)0.69 (↑ 0.04)0.20 (↑ 0.01)0.22 (↑ 0.03)0.15 (↑ 0.03)0.12 (↑ 0.02)0.26 (↑ 0.02)0.20 (↑ 0.01)
Legal LLM
InternLM-Law0.400.290.540.170.100.100.050.330.05
LexiLaw0.350.120.640.200.030.110.090.310.17
Lawyer-LLaMa0.300.170.470.140.110.080.040.300.07
Fuzi-Mingcha0.500.210.660.080.170.100.100.490.10
Wisdom-Interrogatory0.320.310.670.120.080.080.080.310.14
Table 5. Evaluation results of the legal tools.
Table 5. Evaluation results of the legal tools.
Legal ToolMetricResult
Similar Case Retrieval T S C R NDCG@577.51 ± 3.11
Legal Article Searching T L A S Accuracy80.01 ± 2.44
Legal Charge Prediction T L C P Accuracy86.02 ± 3.51
Legal Element Recognition T L E R Accuracy77.43 ± 3.84
Legal Event Detection T L E D F183.25 ± 2.74
Cyan values indicate the standard deviation over repeated evaluations.
Table 6. Tool ablation. Each row reports the mean performance across the five COLLT-* variants when the indicated tool is removed from T at both training and inference. Bold entries mark the largest drop in each column relative to the no-removal row. The Mean Δ column is the absolute drop averaged across the seven task columns, in percentage points.
Table 6. Tool ablation. Each row reports the mean performance across the five COLLT-* variants when the indicated tool is removed from T at both training and inference. Bold entries mark the largest drop in each column relative to the no-removal row. The Mean Δ column is the absolute drop averaged across the seven task columns, in percentage points.
SettingLCPLAPDFIITILEDOSCAMean Δ
COLLT (no removal)0.4860.5040.3640.2460.1600.3980.292
T S C R 0.4790.4990.3490.2400.1590.3850.262 1.1
T L A S 0.4640.4460.3570.2410.1590.3850.281 1.7
T L C P 0.4180.4980.3570.2410.1590.3910.279 1.5
T L E R 0.4800.4980.3420.2150.1530.3910.285 1.2
T L E D 0.4800.4980.3570.2330.1150.3860.285 1.4
T W e b 0.4850.4980.3630.2410.1600.3910.285 0.4
− all tools0.4420.4060.3270.2050.1150.3550.252 5.0
Table 7. Clarification trigger comparison on the 500-query ambiguity-labeled benchmark. Each row reports trigger-F1 averaged over the five base models in Table 4.
Table 7. Clarification trigger comparison on the 500-query ambiguity-labeled benchmark. Each row reports trigger-F1 averaged over the five base models in Table 4.
SettingTrigger-F1
(a) base-vanilla (no system prompt, no tools)0.070
(b) base-with-clarify (clarification system prompt, no tuning)0.476
(c) COLLT-SFT (clarification learned via instruction tuning)0.814
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yang, K.; Sun, J.; Wang, Z.; Xu, C. COLLT: A Multi-Task Optimization Framework for Clarification-Oriented Tool Learning in Legal Large Language Models. Mathematics 2026, 14, 1891. https://doi.org/10.3390/math14111891

AMA Style

Yang K, Sun J, Wang Z, Xu C. COLLT: A Multi-Task Optimization Framework for Clarification-Oriented Tool Learning in Legal Large Language Models. Mathematics. 2026; 14(11):1891. https://doi.org/10.3390/math14111891

Chicago/Turabian Style

Yang, Kaixin, Jingyun Sun, Zhenxing Wang, and Chang Xu. 2026. "COLLT: A Multi-Task Optimization Framework for Clarification-Oriented Tool Learning in Legal Large Language Models" Mathematics 14, no. 11: 1891. https://doi.org/10.3390/math14111891

APA Style

Yang, K., Sun, J., Wang, Z., & Xu, C. (2026). COLLT: A Multi-Task Optimization Framework for Clarification-Oriented Tool Learning in Legal Large Language Models. Mathematics, 14(11), 1891. https://doi.org/10.3390/math14111891

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop