Next Article in Journal
Feedback Linearization of a Reduced Chemostat Model Under Inflow Disturbances
Previous Article in Journal
CGPA-UGRCA: A Novel Explainable AI Model for Sentiment Classification and Star Rating Using Nature-Inspired Optimization
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Adaptive Chain-of-Thought Distillation Based on LLM Performance on Original Problems

School of Information Engineering, Chinese People’s Armed Police Force Engineering University, Xi’an 710086, China
*
Author to whom correspondence should be addressed.
Mathematics 2025, 13(22), 3646; https://doi.org/10.3390/math13223646
Submission received: 18 September 2025 / Revised: 23 October 2025 / Accepted: 27 October 2025 / Published: 14 November 2025

Abstract

The chain-of-thought (CoT) approach in large language models (LLMs) has markedly enhanced their performance on complex tasks; however, effectively distilling this capability into LLMs with smaller parameter scales remains a challenge. Studies have found that small LLMs do not always benefit from CoT distillation. Inspired by the concept of teaching students in accordance with their aptitude, we propose an adaptive chain-of-thought distillation (ACoTD) framework. The core idea is to dynamically and adaptively customize distillation data and supervision signals for student models based on their performance on the original problems. Specifically, ACoTD initially evaluates and categorizes the original problems according to the capabilities of the student model. Subsequently, for Easy- and Medium-level problems, a short CoT distillation is employed for a brief lecture to reinforce knowledge and enhance training efficiency, for high-difficulty problems where the student model underperforms, and a detailed long CoT distillation is utilized for in-depth explanation to infuse richer reasoning logic. This differentiated distillation strategy ensures that student models achieve a better grasp of learning. We conducted experiments on multiple benchmark datasets. The results indicate that, compared to the baseline, our method can significantly improve the inference performance of small LLMs. Our method provides a new student-centered paradigm for knowledge distillation, demonstrating that adaptive adjustment of teaching strategies based on student feedback is an effective way to enhance small LLMs’ reasoning ability.

1. Introduction

Recently, artificial intelligence (AI) technologies represented by LLMs have achieved disruptive breakthroughs, demonstrating remarkable human-level capabilities across a wide range of Natural Language Processing (NLP) tasks—including translation, text summarization, code generation, mathematical reasoning, and open-domain dialog. The success of these models stems not only from their massive parameters and vast volumes of training data, but more importantly from the emergence of their complex reasoning capabilities. Among the various techniques aimed at enhancing the reasoning abilities of LLMs, CoT stands out as one of the most influential paradigms. Unlike the traditional “input-output” model, CoT guides the model to generate a series of intermediate reasoning steps before ultimately deriving the answer. This approach, which mimics the way humans solve problems step-by-step, has been proven to significantly improve the model’s performance on complex tasks such as arithmetic and symbolic reasoning.
However, when the same CoT prompting is applied to smaller LLMs (e.g., those with fewer than 7B parameters, or even 1B parameters), their performance is often inadequate—sometimes even worse than directly generating answers. This performance barrier primarily arises from two factors:
  • Limitations in model capacity and knowledge storage: Small-scale LLMs have inherent upper bounds in terms of memory, knowledge retention, and computational capabilities. They struggle to internalize and effectively execute the multi-step, coherent, and logically consistent reasoning process required by CoT. The generated CoT frequently contains logical gaps, factual errors, or fails to derive the correct answer altogether, becoming an ineffective hallucination chain.
  • Flaws in the distillation process: To transfer the CoT capabilities of large LLMs to smaller ones, a straightforward approach is knowledge distillation. Typically, a teacher model (a larger and stronger LLM) is used to generate reasoning chains on the training dataset of a specific task, and these chains are then used as supervision signals to fine-tune a student model (a smaller and weaker LLM). However, the data generated through this method is not always suitable for the student model, leading to inconsistent distillation performance.
Traditional distillation methods aim to minimize the difference between the output of the student model and the teacher model, and their objective function is usually defined as follows [1]:
L K D = E ( p , a ) D [ l ( f s ( p ) , f t ( p ) ) ] ,
where f s and f t represent the student model and the teacher model, respectively, l denotes the loss function, and D stands for the training dataset. This method implicitly assumes that f t ( p ) is the optimal supervision signal for all questions p. However, based on the two aforementioned factors, this assumption does not hold, resulting in low distillation efficiency.
Therefore, our goal is to find an adaptive and optimal supervision signal f ~ ( p ) for the student model, which is based on the student’s current ability state S (i.e., its performance on the question set):
L K D = E ( p , a ) D [ l ( f s ( p ) , f ~ ( p ; S ) ) ] ,
where f ~ ( p ; S ) is a function determined by the student’s state S , and it generates (CoT) supervision signals with different levels of detail for questions of varying difficulty.
We argue that teaching tasks should obey the concept of teaching students in accordance with their aptitude, and instruction should be tailored to the student’s abilities—that is, data for distillation should be selectively chosen based on the characteristics of the student model. Building on this insight, we propose an ACoTD framework. The framework first establishes a problem grading mechanism for the original problems, based on the student model’s performance on the dataset. It then samples a distillation dataset from the graded problems according to the student’s performance:
  • For Easy and Medium problems: The student model is assumed to have already mastered at least part of the required knowledge to a certain extent. Thus, a brief lecture is adopted, using short CoT as fine-tuning signals.
  • For Hard problems: The student model is assumed to have a poor understanding of the relevant knowledge. Thus, “in-depth explanation “ is adopted, using long CoT as fine-tuning signals.
Our experiments show that compared with standard distillation methods, our adaptive chain-of-thought distillation significantly improves the reasoning performance of small LLMs, achieving notable gains across multiple benchmarks.
The key innovations and contributions of our proposed ACoTD framework are as follows:
  • We have constructed an extensive problem dataset covering various difficulty levels and established a corresponding problem-grading mechanism;
  • We have introduced the ACoTD framework—a student-model-centric framework that effectively enhances the reasoning capabilities of small LLMs through dynamic, personalized distillation strategies;
  • We have conducted extensive experiments, verifying that ACoTD achieves satisfactory accuracy and effectiveness in reasoning-intensive tasks across multiple benchmarks.

2. Related Work

This section will review the cutting-edge work in two research areas closely related to this study:
  • Distillation for LLMs, which aims to transfer the capabilities of large LLMs to small LLMs;
  • Data Sampling for LLMs Fine-Tuning, which focuses on how to construct high-quality datasets for LLM fine-tuning.
Our work lies at the intersection of these two areas, aiming to optimize the reasoning capabilities of small LLMs through an innovative data collection and distillation strategy.

2.1. Distillation for LLMs

Knowledge distillation (KD) is a classic technique that compresses the knowledge of large, high-performance teacher models into more lightweight student models. Early research mainly focused on distillation at the output level—for instance, by minimizing the Kullback–Leibler (KL) divergence between the student model and the teacher model in terms of their predicted probability distributions (Soft Labels). In the era of LLMs, KD is widely used to develop smaller LLMs under the guidance of larger teacher models, thereby reducing the computational demands associated with large-scale models.
Currently, KD methods for LLMs can be roughly categorized into two types: black-box KD and white-box KD. The former only accesses the text generated by the teacher model, while the latter can also obtain the teacher model’s output distribution or intermediate hidden states. Most existing KD methods for LLMs fall under the category of black-box KD; for example, Lee et al. [2] generated a preference dataset and directly aligned the outputs of the distilled model with those of the full-precision model to ensure that the distilled model’s responses are as close as possible to the original model. White-box KD methods have been relatively less explored, but there are representative studies such as MiniLLM [3].
However, standard KD provides limited information due to its reliance solely on final outputs. Recent studies have explored various distillation approaches to enhance distillation effectiveness: Reasoning Distillation (Hsieh et al.) achieves this goal by training the student model to understand both the final answer and the underlying reasoning [4]; multi-teacher strategies (Tian et al.; Zhang et al.) achieve this by aggregating diverse reasoning paths, enriching distillation data, and improving generalization [5,6]; DeepSeek [7] can generate long CoT directly from the base model through reinforcement learning without Supervised Fine-Tuning (SFT), incorporates cold-start training data, and then performs SFT to further enhance model performance; Shridhar et al. [8] explored iterative distillation, which gradually improves the student model’s reasoning capabilities through multiple rounds of fine-tuning; Li et al. [9] enhanced distillation quality by sampling multiple reasoning paths from the teacher model and using Self-Consistency to select the most consistent path as a supervision signal; Ye et al. [10] analyzed the quality of distillation data and demonstrated that a small amount of carefully designed data can effectively stimulate the model’s reasoning abilities; Xu et al. [11] obtained high-quality and diverse distillation data from unlabeled data by introducing a dual-criteria sampling rejection strategy.
Nevertheless, most of these methods adopt uniform or random strategies to sample distillation data from predefined datasets. Even when specific sampling methods are used, they are teacher-centric in nature. Essentially, this is a one-size-fits-all teaching approach that fails to tailor instruction to the individual differences in student models.

2.2. Data Sampling for LLM Fine-Tuning

The quality, diversity, and difficulty distribution of fine-tuning data have a decisive impact on the performance of the final model. For LIMA (Zhou et al.), without any reinforcement learning, only 1000 carefully selected high-quality instruction-response pairs (chosen for diversity and clarity) were used to fine-tune the 65B LLaMA model. Despite the small size of the dataset, the resulting model performed exceptionally well: it learned to handle complex queries and even generalized to tasks not seen during training [12]. This “less is more” outcome indicates that fine-tuning only requires a small amount of demonstration data to unlock most of the model’s performance, with data quality being the most critical factor.
Early data sampling methods typically relied on manual annotation or heuristic sampling from existing datasets. For example, Li et al. [13] and Huang et al. [14] created data through manual annotation, which was mainly used for benchmarking.
However, manually creating high-quality data is extremely costly and difficult to scale. Therefore, using LLMs themselves to automatically generate synthetic data has become a mainstream approach [15]. For instance, through well-designed prompt engineering, large LLMs are guided to generate reasoning steps and answers for a large number of original questions, thereby building large-scale training corpora. Whitney et al. [16] leveraged the in-context learning capabilities of LLMs through prompt engineering to induce the generation of higher-quality data, albeit with limitations imposed by model biases; Mahene et al. [17] used existing self-alignment frameworks to automatically generate fine-tuning data, achieving good performance in SFT of small LLMs.
Recently, researchers have begun to focus on the intelligence and adaptability of data collection. Curriculum Learning proposes that models should start learning from simple samples and gradually transition to complex ones, which is similar to the human learning process. Difficulty-based sampling classifies datasets according to predefined difficulty metrics (such as question length, calculation steps, and number of concepts) and performs sampling based on this classification. However, most of these methods rely on predefined, static difficulty metrics, which are defined from the perspective of the questions themselves or the teacher model—for example, the uncertainty of the teacher model in generating answers. A key flaw is that they overlook the perspective of the student model itself: a question deemed easy by the teacher model may still be difficult for the student model to understand.

3. Methods

This section will elaborate on the proposed ACoTD framework in detail. The core idea of this framework is to be student-model centric: it dynamically evaluates the student model’s knowledge gaps based on its actual performance on different original problems and customizes a distillation dataset for it accordingly, thereby achieving efficient and personalized capability transfer. The entire process mainly consists of two core modules: problem difficulty grading and LLM adaptive distillation. This framework is illustrated in Figure 1.

3.1. Framework

Our ACoTD is designed to address the limitations of the “one-size-fits-all” data strategy in traditional CoT distillation. Its overall workflow is a systematic and automated process, as shown in Figure 1, which mainly includes the following six stages:
  • Original Problem Collection: Collect a large number of unlabeled original problems from open-source channels such as the Internet to form the initial problem set D r a w .
  • Student Performance Diagnosis: Use the target student model to generate answers and their corresponding CoT for all problems in Draw, thereby obtaining a diagnostic report on the model’s current capabilities.
  • Problem Difficulty Grading: Based on the quality of the answers and CoT generated by the student model, dynamically assign a difficulty level d ∈ {Easy, Medium, Hard} to each problem.
  • Adaptive Sampling: According to the difficulty level distribution obtained in the previous step, sample from Draw at the same proportion to construct a smaller-scale but more targeted distillation problem dataset.
  • Differentiated Data Generation: Engage two top-tier teachers—LLM-A and LLM-B. The teacher model LLM-A generates a concise, short CoT for Easy and Medium problems to achieve concise teaching and knowledge consolidation; the teacher model LLM-B generates a detailed and thorough long CoT for Hard problems to realize in-depth teaching and knowledge gap filling. Finally, the final distillation dataset is formed, which is divided into D-Short and D-Long.
  • Supervised Fine-Tuning: Perform SFT on the student model using the high-quality datasets, D-Short and D-Long.
The innovation of this framework lies in the fact that its difficulty definition is not based on the inherent attributes of the problems, but on the interactive performance between the student model and the problems, thereby “teaching students in accordance with their aptitude”.

3.2. Problem Difficulty Grading

The core of difficulty grading lies in defining what is “hard” and what is “easy” from the perspective of the student model. Based on the quality of the answer a i and CoT C i generated by the student model for the problem p i , we classify the problem into three levels:
  • Easy: If and only if a i is correct, and C i is logically rigorous with accurate reasoning. This indicates that the student model has fully mastered the knowledge and reasoning capabilities required to solve such problems, and no extra attention needed.
  • Medium: If a i is incorrect, but C i is basically correct or only has minor flaws (e.g., individual calculation errors, reasonable reasoning steps in parts). This shows that the student model has a roughly correct reasoning framework and masters part of the knowledge needed to handle this question, but has minor omissions or calculation errors in details, requiring consolidated practice.
  • Hard: If a i is incorrect and C i is completely wrong or logically confusing. This means the student model has fundamental knowledge gaps in the core concepts or reasoning processes of such problems, requiring focused teaching.
We define a scoring function G that evaluates the difficulty of the question p i for the student model M 0 , based on the answer a i and chain-of-thought C i generated by M 0 , as well as the standard answer a i * :
d i = G ( a i , C i , a i * ) ,
where d i { E a s y , M e d i u m , H a r d } .
The judgment criteria of this function are formalized as follows:
  • Easy ( d i = E ):
G ( a i , C i , a i * ) = E ( a i = a i * ) L o g i c a l ( C i ) C o m p l e t e ( C i ) ,
  • Medium ( d i = M ):
G ( a i , C i , a i * ) = M ( a i a i * ) ( P a r t i a l l y L o g i c a l ( C i ) M i n o r E r r o r ( C i ) ) ,
  • Hard ( d i = H ):
G ( a i , C i , a i * ) = H ( a i a i * ) ( ¬ L o g i c a l ( C i ) I n c o m p l e t e ( C i ) ) ,
Boolean functions such as L o g i c a l ( · ) and C o m p l e t e ( · ) are used to represent the evaluation of CoT quality.
We submit the answers and CoTs generated by the student model to another significantly more capable LLM for evaluation. We guide this evaluation LLM to conduct the assessment in accordance with our reasoning and standards through prompts, and perform manual verification by cross-referencing with reference answers to ensure the accuracy and high quality of the data. After completing the difficulty grading, we record the proportion of the number of problems at each difficulty level as X:Y:Z (X stands for Easy questions, Y stands for Medium questions, and Z stands for Hard questions). The prompt used to evaluate the performance of the student model is shown in Figure 2. The problem difficulty grading process is formalized in Algorithm 1.
Algorithm 1. Difficulty grading procedure for original problems
Require: Original problem dataset Draw = {pi} ▷ Original problem pi
Require: Set of LLMs M = {M0, M1} ▷ M0: student model, M1: teacher model LLM-A
Require: Result set R = {(pi, a*i)} ▷ Each sample contains Original problem pi and correct answer a*i
Require: Prompt template P0, used to evaluate the performance of the student model
Ensure: Graded problem dataset Deasy, Dmedium and Dhard
1: Initialize Deasy ← ∅, Dmedium ← ∅, Dhard ← ∅
2: for problem pi in Draw do
3:   Get M0 response (ai, Ci)  ▷ ai: predicted answer, Ci: CoT
4:   Add result RR∪{(ai, Ci)}
5: end for
6: for each sample(pi, a*i, ai, Ci) in R do
7:   Construct prompt P ← Concatenate(P0, pi, a*i, ai, Ci)
8:   Get M1 prediction Gi
9:   if Gi = = Easy
10:    Add Graded problem dataset Deasypi
11:    else if Gi = = Medium
12:    Add Graded problem dataset Dmediumpi
13:    else if Gi = = Hard
14:    Add Graded problem dataset Dhardpi
15:    end if
16: end for
17: return Graded problem dataset Deasy, Dmedium and Dhard

3.3. LLM Adaptive Distillation

After obtaining the difficulty label for each problem, we conduct adaptive sampling of distillation problems and data generation.
Adaptive Sampling: First, it is based on the proportion (X, Y, and Z) of samples labeled as Easy, Medium, and Hard in the initial problem dataset D r a w . Then, we sample from the original problem dataset according to this proportion to construct the distillation problem dataset. This ensures that the distillation dataset perfectly reproduces the current capability distribution of the student model, allowing training resources to be precisely allocated to the areas where they are most needed.
This process is formalized as follows:
Let the original problem set D r a w be classified into three subsets: D E , D M , and D H , with their sizes being N E , N M , and N H , respectively. Our goal is to construct a distillation problem set D s a m p l e of size K . We adopt proportional stratified sampling to ensure that the difficulty distribution of the sampled dataset is consistent with the original diagnostic distribution:
D s a m p l e E = K × N E N ,   D s a m p l e M = K × N M N ,   D s a m p l e H = K × N H N ,
where N = N E + N M + N H represents the total number of original questions. D s a m p l e E ,   D s a m p l e M and D s a m p l e H denote the problem sets randomly sampled from D E , D M , and D H , respectively. K represents the quantity of sampling problems, which is a customizable parameter depending on the desired size of the distillation dataset. Finally,
D s a m p l e = D s a m p l e E D s a m p l e M D s a m p l e H .
This sampling strategy ensures that the distillation dataset accurately reflects the distribution of the current capability shortcomings of the student model.
Differentiated CoT Generation: We adopt two distinct strategies to generate supervision signals for problems of different difficulty levels:
  • For problems labeled Easy and Medium: we use the teacher model LLM-A to generate short CoT. The prompt is designed to require the model to provide concise derivations with only key steps, avoiding verbosity. This corresponds to concise teaching, aiming to efficiently consolidate the knowledge that the student model has already mastered or is close to mastering. The prompt used to generate short CoT is shown in Figure 3.
  • For problems labeled Hard: we use another teacher model LLM-B (or the same model with a different prompt) to generate long and detailed CoT. The prompt requires the model to break down core concepts, explain the basis for each derivation step in detail, and even provide analogies or examples. This corresponds to in-depth teaching, aiming to fill the knowledge gaps of the student model and establish correct reasoning patterns. The prompt used to generate short CoT is shown in Figure 4.
We define two CoT generation functions, corresponding to concise and detailed explanations, respectively:
  • Short CoT generation function:
( T S ) :   T S ( p ) ( C s h o r t s , a ) ,
  • Long CoT generation function:
( T L ) :   T L ( p ) ( C l o n g s , a ) ,
where C s h o r t and C l o n g represent the concise and detailed CoTs generated for the same question p , respectively. Finally, the construction process of the adaptive distillation dataset D d i s t i l l can be expressed as
D d i s t i l l = p D s a m p l e E D s a m p l e M p , T S p p D s a m p l e H p , T L p ,
After obtaining the short CoT distillation data, D-Short, and the long CoT distillation data, D-Long, we conducted data verification and supervised fine-tuning. To ensure the quality of the distillation data, we implement a set of automated verification processes (such as verifying the results generated by the teacher model using standard answers, or conducting consistency checks with another LLM) combined with a small amount of manual sampling review, so as to filter out erroneous data that may be generated by the teacher model. Finally, we perform supervised fine-tuning on the student model using the verified high-quality dataset. The adaptive distillation process is formalized in Algorithm 2.
Algorithm 2 LLM Adaptive Distillation Procedure
Require: Graded problem dataset Deasy, Dmedium and Dhard
Require: Set of LLMs M = {M0, M1, M2} ▷ M0: student model, M1: teacher model LLM-A, M2: teacher model LLM-B
Require: Prompt template for each teacher model Pj = { P1, P2 } ▷ P1: used to generate short CoT, P2: used to generate long CoT
Input: Given numbers 0 < k < min(count(Deasy), count(Dmedium), count(Dhard)), n = 0
Ensure: Distilled student model Mdistill
1: Initialize distillation problem set Dsample ← ∅, distillation dataset Ddistill ← ∅
2: for problem pi in Deasy, n < INT(k count(Deasy)/count(Dmedium)) do
3:   Add distillation problem set Dsamplepi
4:   n = n + 1
5: end for
6: n = 0
7: for problem pi in Dmedium, n < INT(k) do for each sample(pi, a*i, ai, Ci) in R do
8:   Add distillation problem set Dsamplepi
9:   n = n + 1 Get M1 prediction Gi
10: end for
11: n = 0
12: for problem pi in Dhard, n < INT(k count(Dhard)/count(Dmedium)) do
13:   Add distillation problem set Dsamplepi
14:   n = n + 1
15: end for
16: for problem pi in Dsample
17:  if Gi = = Easy or Medium
18:    Construct prompt P ← Concatenate(P1, pi)
19:    Get M1 response (ai, Ci)
20:    Add distillation dataset DdistillDdistill ∪ {(pi, ai, Ci)}
21:  else
22:    Construct prompt P ← Concatenate(P2, pi)
23:    Get M2 response (ai, Ci)
24:    Add distillation dataset DdistillDdistill∪{(pi, ai, Ci)}
25:  end if
26: end for
27: Get Mdistill by SFT M0 with Ddistill
28: return Mdistill

4. Experiment

To comprehensively evaluate the effectiveness of the proposed ACoTd framework, we designed and conducted a series of experiments. This section will detail the experimental setup, the datasets, and the evaluation metrics used. Additionally, we conducted an in-depth analysis of the role of each core component in ACoTd through ablation studies.

4.1. Datasets

In this experiment, multiple publicly available and challenging mathematical reasoning datasets were selected to comprehensively and fairly evaluate the model’s reasoning ability on mathematical problems of different difficulty levels and types. These datasets are all sourced from public academic resources or authoritative competition platforms, ensuring the reproducibility and comparability of the experiment. Original data is divided into training data and benchmark data.

4.1.1. Training Data

Our training data was mainly sourced from
  • NuminaMath [18]: An open-source large language model series focused on mathematical problem-solving and its supporting dataset. It is the largest public dataset in AI4Maths with 860k pairs of competition math problems and solutions.
  • Historical AIME [19]: It includes past real questions from the American Invitational Mathematics Examination (AIME). These questions are known for their high difficulty and requirement for multi-step reasoning, making them highly suitable for testing the ability of LLMs to solve complex mathematical problems.
  • GSM8K [20]: Proposed by OpenAI researchers in 2021, it contains approximately 8500 high-quality English elementary school math word problems, covering basic mathematical knowledge such as addition, subtraction, multiplication, division, fractions, decimals, and percentages. These problems typically require multi-step reasoning and are divided into a training set (7473 problems) and a test set (1319 problems).
  • OlympicArena [21]: A large-scale, interdisciplinary, and high-quality benchmark dataset, including 11,163 problems, covering 62 international Olympic competitions. By collecting difficult problems at the level of international Olympic competitions, it sets a high standard for evaluating the superintelligence of models.
  • GAOKAO [22]: a large model evaluation dataset centered on real questions from China’s National College Entrance Examination (Gaokao), covering a total of 2811 questions, including multiple-choice questions, fill-in-the-blank questions, and problem-solving questions. It is designed to evaluate the ability of large models in complex language understanding and logical reasoning tasks.

4.1.2. Benchmark Data

Our benchmark data are mainly sourced from
  • AMC23 [23]: It contains 40 questions from the American Mathematics Competition 12 (AMC12) 2022 and AMC12 2023. The original AMC12 questions are multiple-choice questions with four options; the authors revised the problem statements into a form that requires integer outputs, and questions whose statements could not be revised were excluded.
  • AIME25 [24]: It comprises challenging mathematics competition problems from the American Invitational Mathematics Examination of 2025.
  • GSM8K-TEST: It contains 1319 problems from GSM8K.
  • MATH500 [25]: It contains 500 carefully selected mathematical problems, which are divided into five difficulty levels ranging from simple to complex. This enables researchers to meticulously evaluate the model’s performance under different challenge levels.

4.2. Metrics and Baseline

4.2.1. Metrics

We use the following metrics to quantify model performance:
  • Accuracy (Acc): The correctness rate of the model’s final answers compared with the ground-truth option, serving as the primary performance evaluation metric. It is computed as
A c c u r a c y = 1 n 1 n 1 a ^ i = a i ,
where
n is the number of all problems;
a ^ i is the LLM’s answer to ith problem;
a i is the ground truth answer to ith problem; and 1 is the indicator function, which returns 1 when the argument is true and 0 otherwise.
  • CoT Quality Score (CoTQS): To assess the quality of the reasoning process, we use Deepseek-R1 as the evaluator. Deepseek-R1 rates the CoT generated by the model on a scale of 1 to 5 across three dimensions—logical coherence, step completeness, and factual correctness—and then calculates the average score. The prompt used to generate the CoT quality score is shown in Figure 5.

4.2.2. Baseline

We take three baselines below for comparison: DeepSeek-R1-Distill-Qwen-7B and Qwen3-1.7B.
  • DeepSeek-R1-Distill-Qwen-7B: An efficient reasoning model with 7B parameters that is more lightweight, based on Qwen2.5-7B as the base model and distilled from DeepSeek-R1;
  • Qwen3-1.7B (CoT on): A lightweight open-source large model with 1.7B parameters (including 1.4B non-embedding parameters) that supports switching between thinking mode and non-thinking mode—this is the thinking mode;
  • Qwen3-1.7B (CoT off): non-thinking mode, where the output will not include a chain of thought.

4.3. Experiment Settings

4.3.1. Models

In terms of model selection, based on our original intention of enhancing the reasoning ability of small-scale LLMs, the student model should be an easily accessible open pre-trained model with a parameter count of less than 7B; since the adaptive distillation of ACoTd requires generating both long CoT and short CoT simultaneously, it is preferable to select two different LLMs as teacher models—with each responsible for generating long CoT and short CoT, respectively—to avoid manual prompt switching.
  • Student Model: Considering the flexibility of our hardware conditions, we have selected Qwen3-1.7B as the student model and access it through a local deployment.
  • Teacher Model: We utilize Ernie-4.5-Turbo and Qwen-QwQ-32B as our teacher models, and these models can be accessed online via API. Among them, Ernie-4.5-Turbo serves as the teacher model for generating short CoT, while Qwen-QwQ-32B acts as the teacher model for generating long CoT.

4.3.2. Data

We collected a total of more than 10k problems from open-source resources (datasets are mentioned in Section 4.1.1) to form the original problem dataset Draw. After the student model completed difficulty grading, the ratio of Easy, Medium, and Hard difficulty questions was 5.3:1:3.1 (how this ratio was obtained, which was mentioned in Section 3.2). The sampled distillation problem dataset contained 1000 questions in total, with the generated long CoT training dataset having 329 CoTs and the short CoT training dataset having 671 CoTs. Among the test benchmark datasets, MATH500 included 500 problems, AIME25 had 30 problems, GSM-TEST contained 1319 problems, and AMC23 included 40 problems.

4.3.3. Training Details

Our training framework is fine-tuned based on the LLaMA-Factory [26] framework. This framework integrates various mainstream and efficient training and fine-tuning techniques, supports adaptation to major open-source models on the market, and provides multiple high-level abstract calling interfaces. It also includes functions such as multi-stage training, inference testing, benchmark evaluation, and API services. It can improve computational efficiency while minimizing memory requirements as much as possible.
In our training setup, we used the AdamW optimizer and employed training with LoRA [27] fine-tuning, setting the LoRA rank to 32, a learning rate of 1 × 10−5, a batch size of 16, 5 training epochs, and a context window of 16384 tokens. During inference, we used a temperature of 0.6, max tokens set to 8196, and a top-p value of 0.9. More hyperparameters are listed in Table 1, and shared hyperparameters with baselines are listed in Table 2. All experiments were conducted on an NVIDIA A40 48G GPU.

4.3.4. Ablation Experiment Settings

To validate the contributions of key ACoTd components, we evaluate the variants with certain components omitted or replaced. Changes are as follows:
  • Remove the step of sampling distillation questions based on the student model’s level, and directly sample an equal amount of distillation data randomly;
  • Fixed the ratio of Easy, Medium, and Hard data to 6:3:1, without dynamically adjusting based on student model performance;
  • Use only short chains of thought when generating COT distillation data;
  • Use only long chains of thought when generating COT distillation data.
The specific results will be discussed in Section 5.

5. Result Analysis

This section presents the experimental results verifying the effectiveness of our ACoTd framework, along with relevant analyses of the results, specifically including two parts: main results and ablation results.

5.1. Main Results

Table 3 presents the results of controlled experiments evaluating the performance of LLMs across multiple mathematical problem benchmark datasets. Besides baselines, we also included the experiments on the same benchmarks from Curriculum Learning’s work into the comparison. Through analysis of these results, we can clearly observe the following:
  • Significant Advantages of ACoTd: Our proposed ACoTd method achieved the best performance across all datasets compared with different baseline models. ACoTd improved the accuracy of the Qwen3-1.7B model by over 10% on all four benchmarks, with an even more impressive 20% accuracy increase on the highly challenging AIME25. From the CoTQS perspective, the quality of reasoning steps also showed significant improvement. Furthermore, when compared with DeepSeek-R1-Distill-Qwen-7B, ACoTd outperformed this larger LLM by a considerable margin. This fully demonstrates the effectiveness of the adaptive strategy based on the actual performance of the student model. Notably, while DeepSeek-R1-Distill-Qwen-7B was only more effective than Qwen3-1.7B (CoT on) on the GSM8K-TEST benchmark, it still comprehensively outperformed Qwen3-1.7B (CoT off). This is because DeepSeek-R1-Distill-Qwen-7B, with its larger parameter size, possesses better fundamental capabilities. However, once the reasoning ability of a smaller LLM is stimulated through CoT, its performance can match or even surpass larger LLMs. This indicates that CoT brings substantial performance improvements, and from another perspective, confirms that our method is superior in terms of performance among existing distillation methods.
  • Limitations of Traditional Distillation: Even though the DeepSeek-R1-Distill-Qwen-7B model (also a distilled model) can generate CoT, its performance is relatively poor. This confirms that traditional one-size-fits-all distillation strategies cannot focus on strengthening the weak points of student models.
  • Differences Between Static and Dynamic Difficulty Assessment: By comparing the common benchmark assessment used in Curriculum Learning with our work, we find that Curriculum Learning, based on static metrics, performs significantly worse. This indicates that dynamic difficulty assessment from the student model’s perspective better reflects actual learning needs than static assessment from the problem or teacher’s perspective, thereby bringing more efficient performance improvements.
The specific results will be discussed in Section 5.

5.2. Ablation Results

We conducted four ablation studies on the aforementioned benchmarks to verify the effectiveness of our method from different perspectives. The results are shown in Table 4.
Analysis of the results shows the following:
  • Core Role of the Difficulty Classification Mechanism: When the difficulty classification mechanism is removed and the method degrades to random sampling, the performance drops most significantly. This proves that accurately classifying questions based on the student model’s performance is the key to the success of ACoTd.
  • Importance of Adaptive Sampling: When a fixed sampling ratio is adopted (e.g., Easy/Medium/Hard = 6:3:1) instead of adaptive sampling based on the original distribution, the performance decreases slightly. This indicates that maintaining a difficulty distribution matching the student model’s capabilities is crucial.
  • Effectiveness of Differentiated Supervision Signals: Performance decreases when only short CoT or only long CoT are used. This demonstrates the necessity of the “teaching students in accordance with their aptitude“ differentiated strategy: concise explanation (CoT) for already mastered knowledge can improve efficiency, while detailed explanation (CoT) for unmastered knowledge can make up for weaknesses. Using only long CoT leads to low training efficiency and may introduce redundant noise, whereas using only short CoT fails to provide sufficient reasoning details for difficult questions.
In addition, we visualized the model’s performance improvement on problems of different benchmarks in Figure 6. The results show that the performance improvement brought by ACoTd mainly comes from the significant improvement in accuracy on AIME25 and AMC23, which are mainly Hard- and Medium-difficulty questions, directly verifying that our method can effectively target the knowledge gaps of the student model.

6. Conclusions

This paper aims to address the core challenge of efficiently distilling the strong reasoning capabilities of large-scale LLMs into smaller student models. Traditional distillation methods adopt a “one-size-fits-all” strategy. This approach overlooks the significant differences in the student model’s mastery of knowledge across problems of varying difficulty, ultimately leading to low distillation efficiency and limited generalization ability.
To tackle this issue, we propose a student-centered ACoTd framework. The core innovation of ACoTd lies in the introduction of a problem difficulty classification mechanism and a dynamic difficulty-aware mechanism. First, based on the student model’s actual performance on the original problems (i.e., the quality of generated answers and CoTs), this mechanism automatically classifies problems into three levels: Easy, Medium, and Hard. Subsequently, leveraging this difficulty distribution, it adaptively samples from the original problems to construct a distillation dataset and matches differentiated supervision signals to problems of different difficulty levels: short CoTs are used for concise explanation and consolidation of knowledge for Easy and Medium problems, while long and detailed CoTs are employed for in-depth explanation and addressing knowledge gaps for Hard problems.
Through extensive experiments on multiple mathematical reasoning benchmarks, we verified the effectiveness of the ACoTd framework. The results show that compared with traditional distillation methods, ACoTd significantly improves the performance of distilled student models on complex reasoning tasks. Ablation studies further confirm that the three core components—difficulty classification, adaptive sampling, and differentiated supervision—are all key factors for enhancing model performance.
Despite the promising results achieved by ACoTd, this study still has limitations. It primarily focuses on empirical verification, and the theoretical analysis of why ACoTd works remains insufficient. Future work can strive to provide a more solid theoretical foundation for it. And we believe it will be more effective if we integrate our approach with reinforcement learning or rejection sampling. We believe that this student-centered, data-driven methodological paradigm holds significant importance and broad application prospects for advancing the development of efficient, lightweight reasoning models.

Author Contributions

Conceptualization, J.S. and X.C.; methodology, J.S.; software, J.S.; validation, J.S. and Z.G.; formal analysis, J.S.; investigation, J.S.; resources, X.C.; writing—original draft preparation, J.S.; writing—review and editing, X.S.; visualization, X.S.; supervision, X.C.; project administration, Z.G.; funding acquisition, X.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Acknowledgments

All data uploaded to the cloud are from open-source datasets and do not involve any personal or privacy risks.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Hinton, G.; Vinyals, O.; Dean, J. Distilling the Knowledge in a Neural Network. arXiv 2015, arXiv:1503.02531. [Google Scholar] [CrossRef]
  2. Lee, J.; Park, S.; Hong, S.; Kim, M.; Chang, D.-S.; Choi, J. Improving Conversational Abilities of Quantized Large Language Models via Direct Preference Alignment. In Proceedings of the 62nd Annual Meeting ofthe Association for Computational Linguistics, Bangkok, Thailand, 11–16 August 2024. [Google Scholar]
  3. Gu, Y.; Dong, L.; Wei, F.; Huang, M. MiniLLM: Knowledge Distillation of Large Language Models. In Proceedings of the Twelfth International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024. [Google Scholar]
  4. Hsieh, C.-Y.; Li, C.-L.; Yeh, C.-K.; Nakhost, H.; Fujii, Y.; Ratner, A.; Krishna, R.; Lee, C.-Y.; Pfister, T. Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes. arXiv 2023, arXiv:2305.02301. [Google Scholar] [CrossRef]
  5. Tian, Y.; Han, Y.; Chen, X.; Wang, W.; Chawla, N.V. Beyond Answers: Transferring Reasoning Capabilities to Smaller LLMs Using Multi-Teacher Knowledge Distillation. arXiv 2024, arXiv:2402.04616. [Google Scholar] [CrossRef]
  6. Zhang, Y.; Liu, H.; Xiao, Y.; Amoon, M.; Zhang, D.; Wang, D.; Yang, S.; Quek, C. LLM-Enhanced Multi-Teacher Knowledge Distillation for Modality-Incomplete Emotion Recognition in Daily Healthcare. IEEE J. Biomed. Health Inform. 2024, 29, 6406–6416. [Google Scholar] [CrossRef] [PubMed]
  7. Guo, D.; Yang, D.; Zhang, H.; Song, J.; Zhang, R.; Xu, R.; Zhu, Q.; Ma, S.; Wang, P.; Bi, X.; et al. Deepseek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv 2025, arXiv:2501.12948. [Google Scholar] [CrossRef]
  8. Shridhar, K.; Stolfo, A.; Sachan, M. Distilling Reasoning Capabilities into Smaller Language Models. arXiv 2022, arXiv:2212.00193. [Google Scholar] [CrossRef]
  9. Wang, P.; Wang, Z.; Li, Z.; Gao, Y.; Yin, B.; Ren, X. SCOTT: Self-Consistent Chain-of-Thought Distillation. arXiv 2023, arXiv:2305.01879. [Google Scholar] [CrossRef]
  10. Ye, Y.; Huang, Z.; Xiao, Y.; Chern, E.; Xia, S.; Liu, P. Limo: Less is More for Reasoning. arXiv 2025, arXiv:2502.03387. [Google Scholar] [CrossRef]
  11. Xu, J.; Zhou, M.; Liu, W.; Liu, H.; Han, S.; Zhang, D. TwT: Thinking without Tokens by Habitual Reasoning Distillation with Multi-Teachers’ Guidance. arXiv 2025, arXiv:2503.24198. [Google Scholar] [CrossRef]
  12. Zhou, C.; Liu, P.; Xu, P.; Iyer, S.; Sun, J.; Mao, Y.; Ma, X.; Efrat, A.; Yu, P.; Yu, L.; et al. Lima: Less is More for Alignment. arXiv 2023, arXiv:2305.11206. [Google Scholar] [CrossRef]
  13. Li, J.; Beeching, E.; Tunstall, L.; Lipkin, B.; Soletskyi, R.; Huang, S.; Rasul, K.; Yu, L.; Jiang, A.Q.; Shen, Z.; et al. Numinamath: The Largest Public Dataset in AI4Maths with 860k Pairs of Competition Math Problems and Solutions. Available online: http://faculty.bicmr.pku.edu.cn/~dongbin/Publications/numina_dataset.pdf (accessed on 15 September 2025).
  14. Huang, Z.; Wang, Z.; Xia, S.; Li, X.; Zou, H.; Xu, R.; Fan, R.Z.; Ye, L.; Chern, E.; Ye, Y.; et al. Olympicarena: Benchmarking Multi-Discipline Cognitive Reasoning for Superintelligent AI. arXiv 2024, arXiv:2406.12753. [Google Scholar] [CrossRef]
  15. Zorik, G.; Jonathan, H.; Roee, A.; Chen, E.; Idan, S. TrueTeacher: Learning Factual Consistency Evaluation with Large Language Models. arXiv 2023, arXiv:2305.11171. [Google Scholar] [CrossRef]
  16. Whitney, C.; Jansen, E.; Laskowski, V.; Barbieri, C. Adaptive Prompt Regeneration and Dynamic Response Structuring in Large Language Models Using the Dynamic Query-Response Calibration Protocol. OSF Prepr. 2024. [Google Scholar] [CrossRef]
  17. Mahene, A.; Pereira, D.; Kowalski, V.; Novak, E.; Moretti, C.; Laurent, J. Automated Dynamic Data Generation for Safety Alignment in Large Language Models. TechRxiv 2024. [Google Scholar] [CrossRef] [PubMed]
  18. NuminaMath. Available online: https://huggingface.co/collections/AI-MO/numinamath-6697df380293bcfdbc1d978c (accessed on 15 September 2025).
  19. AIME-1983-2024. Available online: https://huggingface.co/datasets/gneubig/aime-1983-2024 (accessed on 15 September 2025).
  20. GSM8K. Available online: https://huggingface.co/datasets/openai/gsm8k (accessed on 15 September 2025).
  21. OlympicArena. Available online: https://huggingface.co/datasets/GAIR/OlympicArena (accessed on 15 September 2025).
  22. GAOKAO. Available online: https://huggingface.co/datasets/xDAN-Vision/GAOKAO_2010_to_2022 (accessed on 15 September 2025).
  23. AMC23. Available online: https://huggingface.co/datasets/zwhe99/amc23 (accessed on 15 September 2025).
  24. AIME25. Available online: https://huggingface.co/datasets/math-ai/aime25 (accessed on 15 September 2025).
  25. MATH500. Available online: https://huggingface.co/datasets/HuggingFaceH4/MATH-500 (accessed on 15 September 2025).
  26. Zheng, Y.; Zhang, R.; Zhang, J.; Ye, Y.; Luo, Z.; Feng, Z.; Ma, Y. LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models. arXiv 2024, arXiv:2403.13372. [Google Scholar] [CrossRef]
  27. Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. LoRA: Low-Rank Adaptation of Large Language Models. arXiv 2021, arXiv:2106.09685. [Google Scholar] [CrossRef]
Figure 1. The framework for adaptive chain-of-thought distillation.
Figure 1. The framework for adaptive chain-of-thought distillation.
Mathematics 13 03646 g001
Figure 2. The prompt used to evaluate the performance of the student model.
Figure 2. The prompt used to evaluate the performance of the student model.
Mathematics 13 03646 g002
Figure 3. The prompt used to generate short CoT.
Figure 3. The prompt used to generate short CoT.
Mathematics 13 03646 g003
Figure 4. The prompt used to generate long CoT.
Figure 4. The prompt used to generate long CoT.
Mathematics 13 03646 g004
Figure 5. The prompt used to generate the CoT quality score.
Figure 5. The prompt used to generate the CoT quality score.
Mathematics 13 03646 g005
Figure 6. Performance improvement on problems of different benchmarks.
Figure 6. Performance improvement on problems of different benchmarks.
Mathematics 13 03646 g006
Table 1. Hyperparameters.
Table 1. Hyperparameters.
HyperparametersValue
Epoch5
Learning Rate1 × 10−5
Batch Size16
Packingtrue
Scheduler Typecosine
Warmup Rate0.03
Weight Decay0.01
Sequence Length16,384
LoRA Rank32
LoRA Alpha32
Dropout0.1
Validation Steps16
Checkpoint Interval64
Table 2. Shared hyperparameters with baselines.
Table 2. Shared hyperparameters with baselines.
HyperparametersValue
Temperature0.6
Top-p0.9
Max Tokens16,384
Table 3. Main results, comparison with baselines.
Table 3. Main results, comparison with baselines.
ModelsAMC23AIME25MATH500GSM8K-TEST
AccCoTQSAccCoTQSAccCoTQSAccCoTQS
Our Method0.8 ± 0.024.42 ± 0.1310.511 ± 0.0312.8 ± 0.0890.91 ± 0.034.6 ± 0.0210.953 ± 0.014.79 ± 0.08
DeepSeek-R1-Distill-Qwen-7B0.483 ± 0.062.43 ± 0.2310.122 ± 0.0630.9 ± 0.1980.712 ± 0.0383.7 ± 0.0990.926 ± 0.0184.52 ± 0.267
Qwen3-1.7B (CoT off)0.358 ± 0.583.38 ± 0.0990.067 ± 0.051.41 ± 0.170.802 ± 0.0094.15 ± 0.2370.851 ± 0.0214.16 ± 0.304
Qwen3-1.7B (CoT on)0.692 ± 0.0233.83 ± 0.0620.256 ± 0.0161.71 ± 0.1340.857 ± 0.0524.42 ± 0.2690.908 ± 0.0424.58 ± 0.206
Table 4. Ablation results, comparison with baselines.
Table 4. Ablation results, comparison with baselines.
ModelsAMC23AIME25MATH500GSM8K-TEST
AccCoTQSAccCoTQSAccCoTQSAccCoTQS
Our Method0.8 ± 0.024.42 ± 0.1310.511 ± 0.0312.8 ± 0.0890.91 ± 0.034.6 ± 0.0210.953 ± 0.014.79 ± 0.08
Random Sampling0.708 ± 0.123.74 ± 0.0820.255 ± 0.0161.94 ± 0.0620.864 ± 0.0164.43 ± 0.0410.923 ± 0.0084.52 ± 0.074
Fixed Ratio0.767 ± 0.0234.01 ± 0.1220.411 ± 0.0412.52 ± 0.0570.902 ± 0.0034.54 ± 0.0630.949 ± 0.0094.67 ± 0.081
Only Short CoT0.75 ± 0.023.88 ± 0.1260.345 ± 0.0321.98 ± 0.0470.897 ± 0.114.59 ± 0.0370.947 ± 0.0124.77 ± 0.008
Only Long CoT0.758 ± 0.0233.98 ± 0.1090.322 ± 0.0162 ± 0.0470.896 ± 0.0114.58 ± 0.0570.945 ± 0.0024.77 ± 0.005
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Shen, J.; Cui, X.; Gao, Z.; Sheng, X. Adaptive Chain-of-Thought Distillation Based on LLM Performance on Original Problems. Mathematics 2025, 13, 3646. https://doi.org/10.3390/math13223646

AMA Style

Shen J, Cui X, Gao Z, Sheng X. Adaptive Chain-of-Thought Distillation Based on LLM Performance on Original Problems. Mathematics. 2025; 13(22):3646. https://doi.org/10.3390/math13223646

Chicago/Turabian Style

Shen, Jianan, Xiaolong Cui, Zhiqiang Gao, and Xuanzhu Sheng. 2025. "Adaptive Chain-of-Thought Distillation Based on LLM Performance on Original Problems" Mathematics 13, no. 22: 3646. https://doi.org/10.3390/math13223646

APA Style

Shen, J., Cui, X., Gao, Z., & Sheng, X. (2025). Adaptive Chain-of-Thought Distillation Based on LLM Performance on Original Problems. Mathematics, 13(22), 3646. https://doi.org/10.3390/math13223646

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop