Adaptive Chain-of-Thought Distillation Based on LLM Performance on Original Problems

Shen, Jianan; Cui, Xiaolong; Gao, Zhiqiang; Sheng, Xuanzhu

doi:10.3390/math13223646

Open AccessArticle

Adaptive Chain-of-Thought Distillation Based on LLM Performance on Original Problems

School of Information Engineering, Chinese People’s Armed Police Force Engineering University, Xi’an 710086, China

^*

Author to whom correspondence should be addressed.

Mathematics 2025, 13(22), 3646; https://doi.org/10.3390/math13223646

Submission received: 18 September 2025 / Revised: 23 October 2025 / Accepted: 27 October 2025 / Published: 14 November 2025

Download

Browse Figures

Versions Notes

Abstract

The chain-of-thought (CoT) approach in large language models (LLMs) has markedly enhanced their performance on complex tasks; however, effectively distilling this capability into LLMs with smaller parameter scales remains a challenge. Studies have found that small LLMs do not always benefit from CoT distillation. Inspired by the concept of teaching students in accordance with their aptitude, we propose an adaptive chain-of-thought distillation (ACoTD) framework. The core idea is to dynamically and adaptively customize distillation data and supervision signals for student models based on their performance on the original problems. Specifically, ACoTD initially evaluates and categorizes the original problems according to the capabilities of the student model. Subsequently, for Easy- and Medium-level problems, a short CoT distillation is employed for a brief lecture to reinforce knowledge and enhance training efficiency, for high-difficulty problems where the student model underperforms, and a detailed long CoT distillation is utilized for in-depth explanation to infuse richer reasoning logic. This differentiated distillation strategy ensures that student models achieve a better grasp of learning. We conducted experiments on multiple benchmark datasets. The results indicate that, compared to the baseline, our method can significantly improve the inference performance of small LLMs. Our method provides a new student-centered paradigm for knowledge distillation, demonstrating that adaptive adjustment of teaching strategies based on student feedback is an effective way to enhance small LLMs’ reasoning ability.

Keywords:

large language model; chain of thought; knowledge distillation

MSC:

68T50

1. Introduction

Recently, artificial intelligence (AI) technologies represented by LLMs have achieved disruptive breakthroughs, demonstrating remarkable human-level capabilities across a wide range of Natural Language Processing (NLP) tasks—including translation, text summarization, code generation, mathematical reasoning, and open-domain dialog. The success of these models stems not only from their massive parameters and vast volumes of training data, but more importantly from the emergence of their complex reasoning capabilities. Among the various techniques aimed at enhancing the reasoning abilities of LLMs, CoT stands out as one of the most influential paradigms. Unlike the traditional “input-output” model, CoT guides the model to generate a series of intermediate reasoning steps before ultimately deriving the answer. This approach, which mimics the way humans solve problems step-by-step, has been proven to significantly improve the model’s performance on complex tasks such as arithmetic and symbolic reasoning.

However, when the same CoT prompting is applied to smaller LLMs (e.g., those with fewer than 7B parameters, or even 1B parameters), their performance is often inadequate—sometimes even worse than directly generating answers. This performance barrier primarily arises from two factors:

Limitations in model capacity and knowledge storage: Small-scale LLMs have inherent upper bounds in terms of memory, knowledge retention, and computational capabilities. They struggle to internalize and effectively execute the multi-step, coherent, and logically consistent reasoning process required by CoT. The generated CoT frequently contains logical gaps, factual errors, or fails to derive the correct answer altogether, becoming an ineffective hallucination chain.
Flaws in the distillation process: To transfer the CoT capabilities of large LLMs to smaller ones, a straightforward approach is knowledge distillation. Typically, a teacher model (a larger and stronger LLM) is used to generate reasoning chains on the training dataset of a specific task, and these chains are then used as supervision signals to fine-tune a student model (a smaller and weaker LLM). However, the data generated through this method is not always suitable for the student model, leading to inconsistent distillation performance.

Traditional distillation methods aim to minimize the difference between the output of the student model and the teacher model, and their objective function is usually defined as follows [1]:

L_{K D} = E_{(p, a *) \sim D} [l (f_{s} (p), f_{t} (p))],

(1)

where

f_{s}

and

f_{t}

represent the student model and the teacher model, respectively,

l

denotes the loss function, and D stands for the training dataset. This method implicitly assumes that

f_{t} (p)

is the optimal supervision signal for all questions p. However, based on the two aforementioned factors, this assumption does not hold, resulting in low distillation efficiency.

Therefore, our goal is to find an adaptive and optimal supervision signal

\tilde{f} (p)

for the student model, which is based on the student’s current ability state

S

(i.e., its performance on the question set):

L_{K D} = E_{(p, a *) \sim D} [l (f_{s} (p), \tilde{f} (p; S))],

(2)

where

\tilde{f} (p; S)

is a function determined by the student’s state

S

, and it generates (CoT) supervision signals with different levels of detail for questions of varying difficulty.

We argue that teaching tasks should obey the concept of teaching students in accordance with their aptitude, and instruction should be tailored to the student’s abilities—that is, data for distillation should be selectively chosen based on the characteristics of the student model. Building on this insight, we propose an ACoTD framework. The framework first establishes a problem grading mechanism for the original problems, based on the student model’s performance on the dataset. It then samples a distillation dataset from the graded problems according to the student’s performance:

For Easy and Medium problems: The student model is assumed to have already mastered at least part of the required knowledge to a certain extent. Thus, a brief lecture is adopted, using short CoT as fine-tuning signals.
For Hard problems: The student model is assumed to have a poor understanding of the relevant knowledge. Thus, “in-depth explanation “ is adopted, using long CoT as fine-tuning signals.

Our experiments show that compared with standard distillation methods, our adaptive chain-of-thought distillation significantly improves the reasoning performance of small LLMs, achieving notable gains across multiple benchmarks.

The key innovations and contributions of our proposed ACoTD framework are as follows:

We have constructed an extensive problem dataset covering various difficulty levels and established a corresponding problem-grading mechanism;
We have introduced the ACoTD framework—a student-model-centric framework that effectively enhances the reasoning capabilities of small LLMs through dynamic, personalized distillation strategies;
We have conducted extensive experiments, verifying that ACoTD achieves satisfactory accuracy and effectiveness in reasoning-intensive tasks across multiple benchmarks.

2. Related Work

This section will review the cutting-edge work in two research areas closely related to this study:

Distillation for LLMs, which aims to transfer the capabilities of large LLMs to small LLMs;
Data Sampling for LLMs Fine-Tuning, which focuses on how to construct high-quality datasets for LLM fine-tuning.

Our work lies at the intersection of these two areas, aiming to optimize the reasoning capabilities of small LLMs through an innovative data collection and distillation strategy.

2.1. Distillation for LLMs

Knowledge distillation (KD) is a classic technique that compresses the knowledge of large, high-performance teacher models into more lightweight student models. Early research mainly focused on distillation at the output level—for instance, by minimizing the Kullback–Leibler (KL) divergence between the student model and the teacher model in terms of their predicted probability distributions (Soft Labels). In the era of LLMs, KD is widely used to develop smaller LLMs under the guidance of larger teacher models, thereby reducing the computational demands associated with large-scale models.

Currently, KD methods for LLMs can be roughly categorized into two types: black-box KD and white-box KD. The former only accesses the text generated by the teacher model, while the latter can also obtain the teacher model’s output distribution or intermediate hidden states. Most existing KD methods for LLMs fall under the category of black-box KD; for example, Lee et al. [2] generated a preference dataset and directly aligned the outputs of the distilled model with those of the full-precision model to ensure that the distilled model’s responses are as close as possible to the original model. White-box KD methods have been relatively less explored, but there are representative studies such as MiniLLM [3].

However, standard KD provides limited information due to its reliance solely on final outputs. Recent studies have explored various distillation approaches to enhance distillation effectiveness: Reasoning Distillation (Hsieh et al.) achieves this goal by training the student model to understand both the final answer and the underlying reasoning [4]; multi-teacher strategies (Tian et al.; Zhang et al.) achieve this by aggregating diverse reasoning paths, enriching distillation data, and improving generalization [5,6]; DeepSeek [7] can generate long CoT directly from the base model through reinforcement learning without Supervised Fine-Tuning (SFT), incorporates cold-start training data, and then performs SFT to further enhance model performance; Shridhar et al. [8] explored iterative distillation, which gradually improves the student model’s reasoning capabilities through multiple rounds of fine-tuning; Li et al. [9] enhanced distillation quality by sampling multiple reasoning paths from the teacher model and using Self-Consistency to select the most consistent path as a supervision signal; Ye et al. [10] analyzed the quality of distillation data and demonstrated that a small amount of carefully designed data can effectively stimulate the model’s reasoning abilities; Xu et al. [11] obtained high-quality and diverse distillation data from unlabeled data by introducing a dual-criteria sampling rejection strategy.

Nevertheless, most of these methods adopt uniform or random strategies to sample distillation data from predefined datasets. Even when specific sampling methods are used, they are teacher-centric in nature. Essentially, this is a one-size-fits-all teaching approach that fails to tailor instruction to the individual differences in student models.

2.2. Data Sampling for LLM Fine-Tuning

The quality, diversity, and difficulty distribution of fine-tuning data have a decisive impact on the performance of the final model. For LIMA (Zhou et al.), without any reinforcement learning, only 1000 carefully selected high-quality instruction-response pairs (chosen for diversity and clarity) were used to fine-tune the 65B LLaMA model. Despite the small size of the dataset, the resulting model performed exceptionally well: it learned to handle complex queries and even generalized to tasks not seen during training [12]. This “less is more” outcome indicates that fine-tuning only requires a small amount of demonstration data to unlock most of the model’s performance, with data quality being the most critical factor.

Early data sampling methods typically relied on manual annotation or heuristic sampling from existing datasets. For example, Li et al. [13] and Huang et al. [14] created data through manual annotation, which was mainly used for benchmarking.

However, manually creating high-quality data is extremely costly and difficult to scale. Therefore, using LLMs themselves to automatically generate synthetic data has become a mainstream approach [15]. For instance, through well-designed prompt engineering, large LLMs are guided to generate reasoning steps and answers for a large number of original questions, thereby building large-scale training corpora. Whitney et al. [16] leveraged the in-context learning capabilities of LLMs through prompt engineering to induce the generation of higher-quality data, albeit with limitations imposed by model biases; Mahene et al. [17] used existing self-alignment frameworks to automatically generate fine-tuning data, achieving good performance in SFT of small LLMs.

Recently, researchers have begun to focus on the intelligence and adaptability of data collection. Curriculum Learning proposes that models should start learning from simple samples and gradually transition to complex ones, which is similar to the human learning process. Difficulty-based sampling classifies datasets according to predefined difficulty metrics (such as question length, calculation steps, and number of concepts) and performs sampling based on this classification. However, most of these methods rely on predefined, static difficulty metrics, which are defined from the perspective of the questions themselves or the teacher model—for example, the uncertainty of the teacher model in generating answers. A key flaw is that they overlook the perspective of the student model itself: a question deemed easy by the teacher model may still be difficult for the student model to understand.

3. Methods

This section will elaborate on the proposed ACoTD framework in detail. The core idea of this framework is to be student-model centric: it dynamically evaluates the student model’s knowledge gaps based on its actual performance on different original problems and customizes a distillation dataset for it accordingly, thereby achieving efficient and personalized capability transfer. The entire process mainly consists of two core modules: problem difficulty grading and LLM adaptive distillation. This framework is illustrated in Figure 1.

3.1. Framework

Our ACoTD is designed to address the limitations of the “one-size-fits-all” data strategy in traditional CoT distillation. Its overall workflow is a systematic and automated process, as shown in Figure 1, which mainly includes the following six stages:

Original Problem Collection: Collect a large number of unlabeled original problems from open-source channels such as the Internet to form the initial problem set $D_{r a w}$ .
Student Performance Diagnosis: Use the target student model to generate answers and their corresponding CoT for all problems in D_raw, thereby obtaining a diagnostic report on the model’s current capabilities.
Problem Difficulty Grading: Based on the quality of the answers and CoT generated by the student model, dynamically assign a difficulty level d ∈ {Easy, Medium, Hard} to each problem.
Adaptive Sampling: According to the difficulty level distribution obtained in the previous step, sample from D_raw at the same proportion to construct a smaller-scale but more targeted distillation problem dataset.
Differentiated Data Generation: Engage two top-tier teachers—LLM-A and LLM-B. The teacher model LLM-A generates a concise, short CoT for Easy and Medium problems to achieve concise teaching and knowledge consolidation; the teacher model LLM-B generates a detailed and thorough long CoT for Hard problems to realize in-depth teaching and knowledge gap filling. Finally, the final distillation dataset is formed, which is divided into D-Short and D-Long.
Supervised Fine-Tuning: Perform SFT on the student model using the high-quality datasets, D-Short and D-Long.

The innovation of this framework lies in the fact that its difficulty definition is not based on the inherent attributes of the problems, but on the interactive performance between the student model and the problems, thereby “teaching students in accordance with their aptitude”.

3.2. Problem Difficulty Grading

The core of difficulty grading lies in defining what is “hard” and what is “easy” from the perspective of the student model. Based on the quality of the answer

a_{i}

and CoT

C_{i}

generated by the student model for the problem

p_{i}

, we classify the problem into three levels:

Easy: If and only if $a_{i}$ is correct, and $C_{i}$ is logically rigorous with accurate reasoning. This indicates that the student model has fully mastered the knowledge and reasoning capabilities required to solve such problems, and no extra attention needed.
Medium: If $a_{i}$ is incorrect, but $C_{i}$ is basically correct or only has minor flaws (e.g., individual calculation errors, reasonable reasoning steps in parts). This shows that the student model has a roughly correct reasoning framework and masters part of the knowledge needed to handle this question, but has minor omissions or calculation errors in details, requiring consolidated practice.
Hard: If $a_{i}$ is incorrect and $C_{i}$ is completely wrong or logically confusing. This means the student model has fundamental knowledge gaps in the core concepts or reasoning processes of such problems, requiring focused teaching.

We define a scoring function

G

that evaluates the difficulty of the question

p_{i}

for the student model

M_{0}

, based on the answer

a_{i}

and chain-of-thought

C_{i}

generated by

M_{0}

, as well as the standard answer

a_{i}^{*}

:

d_{i} = G (a_{i}, C_{i}, a_{i}^{*}),

(3)

where

d_{i} \in {E a s y, M e d i u m, H a r d}

.

The judgment criteria of this function are formalized as follows:

Easy ( $d_{i} = E$ ):

G (a_{i}, C_{i}, a_{i}^{*}) = E ⟺ (a_{i} = a_{i}^{*}) \land L o g i c a l (C_{i}) \land C o m p l e t e (C_{i}),

(4)

Medium ( $d_{i} = M$ ):

G (a_{i}, C_{i}, a_{i}^{*}) = M ⟺ (a_{i} \neq a_{i}^{*}) \land (P a r t i a l l y L o g i c a l (C_{i}) \lor M i n o r E r r o r (C_{i})),

(5)

Hard ( $d_{i} = H$ ):

G (a_{i}, C_{i}, a_{i}^{*}) = H ⟺ (a_{i} \neq a_{i}^{*}) \land (\neg L o g i c a l (C_{i}) \lor I n c o m p l e t e (C_{i})),

(6)

Boolean functions such as

L o g i c a l (\cdot)

and

C o m p l e t e (\cdot)

are used to represent the evaluation of CoT quality.

We submit the answers and CoTs generated by the student model to another significantly more capable LLM for evaluation. We guide this evaluation LLM to conduct the assessment in accordance with our reasoning and standards through prompts, and perform manual verification by cross-referencing with reference answers to ensure the accuracy and high quality of the data. After completing the difficulty grading, we record the proportion of the number of problems at each difficulty level as X:Y:Z (X stands for Easy questions, Y stands for Medium questions, and Z stands for Hard questions). The prompt used to evaluate the performance of the student model is shown in Figure 2. The problem difficulty grading process is formalized in Algorithm 1.

Algorithm 1. Difficulty grading procedure for original problems

Require: Original problem dataset D_raw = {p_i} ▷ Original problem p_i

Require: Set of LLMs M = {M₀, M₁} ▷ M₀: student model, M₁: teacher model LLM-A

Require: Result set R = {(p_i, a^*_i)} ▷ Each sample contains Original problem p_i and correct answer a^*_i

Require: Prompt template P₀, used to evaluate the performance of the student model

Ensure: Graded problem dataset D_easy, D_medium and D_hard

1: Initialize D_easy ← ∅, D_medium ← ∅, D_hard ← ∅

2: for problem p_i in D_raw do

3: Get M₀ response (a_i, C_i) ▷ a_i: predicted answer, C_i: CoT

4: Add result R ← R∪{(a_i, C_i)}

5: end for

6: for each sample(p_i, a^*_i, a_i, C_i) in R do

7: Construct prompt P ← Concatenate(P₀, p_i, a^*_i, a_i, C_i)

8: Get M₁ prediction G_i

9: if G_i = = Easy

10: Add Graded problem dataset D_easy ← p_i

11: else if G_i = = Medium

12: Add Graded problem dataset D_medium ← p_i

13: else if G_i = = Hard

14: Add Graded problem dataset D_hard ← p_i

15: end if

16: end for

17: return Graded problem dataset D_easy, D_medium and D_hard

3.3. LLM Adaptive Distillation

After obtaining the difficulty label for each problem, we conduct adaptive sampling of distillation problems and data generation.

Adaptive Sampling: First, it is based on the proportion (X, Y, and Z) of samples labeled as Easy, Medium, and Hard in the initial problem dataset

D_{r a w}

. Then, we sample from the original problem dataset according to this proportion to construct the distillation problem dataset. This ensures that the distillation dataset perfectly reproduces the current capability distribution of the student model, allowing training resources to be precisely allocated to the areas where they are most needed.

This process is formalized as follows:

Let the original problem set

D_{r a w}

be classified into three subsets:

D_{E}

,

D_{M}

, and

D_{H}

, with their sizes being

N_{E}

,

N_{M}

, and

N_{H}

, respectively. Our goal is to construct a distillation problem set

D_{s a m p l e}

of size

K

. We adopt proportional stratified sampling to ensure that the difficulty distribution of the sampled dataset is consistent with the original diagnostic distribution:

|D_{s a m p l e}^{E}| = K \times \frac{N_{E}}{N}, |D_{s a m p l e}^{M}| = K \times \frac{N_{M}}{N}, |D_{s a m p l e}^{H}| = K \times \frac{N_{H}}{N},

(7)

where

N = N_{E} + N_{M} + N_{H}

represents the total number of original questions.

D_{s a m p l e}^{E}, D_{s a m p l e}^{M}

and

D_{s a m p l e}^{H}

denote the problem sets randomly sampled from

D_{E}

,

D_{M}

, and

D_{H},

respectively.

K

represents the quantity of sampling problems, which is a customizable parameter depending on the desired size of the distillation dataset. Finally,

{D_{s a m p l e} = D}_{s a m p l e}^{E} \cup D_{s a m p l e}^{M} \cup D_{s a m p l e}^{H} .

(8)

This sampling strategy ensures that the distillation dataset accurately reflects the distribution of the current capability shortcomings of the student model.

Differentiated CoT Generation: We adopt two distinct strategies to generate supervision signals for problems of different difficulty levels:

For problems labeled Easy and Medium: we use the teacher model LLM-A to generate short CoT. The prompt is designed to require the model to provide concise derivations with only key steps, avoiding verbosity. This corresponds to concise teaching, aiming to efficiently consolidate the knowledge that the student model has already mastered or is close to mastering. The prompt used to generate short CoT is shown in Figure 3.
For problems labeled Hard: we use another teacher model LLM-B (or the same model with a different prompt) to generate long and detailed CoT. The prompt requires the model to break down core concepts, explain the basis for each derivation step in detail, and even provide analogies or examples. This corresponds to in-depth teaching, aiming to fill the knowledge gaps of the student model and establish correct reasoning patterns. The prompt used to generate short CoT is shown in Figure 4.

We define two CoT generation functions, corresponding to concise and detailed explanations, respectively:

Short CoT generation function:

(T_{S}) : T_{S} (p) \to (C_{s h o r t} s, a),

(9)

Long CoT generation function:

(T_{L}) : T_{L} (p) \to (C_{l o n g} s, a),

(10)

where

C_{s h o r t}

and

C_{l o n g}

represent the concise and detailed CoTs generated for the same question

p

, respectively. Finally, the construction process of the adaptive distillation dataset

D_{d i s t i l l}

can be expressed as

D_{d i s t i l l} = \underset{p \in D_{s a m p l e}^{E} \cup D_{s a m p l e}^{M}}{\cup} \{(p, T_{S} (p))\} \cup \underset{p \in D_{s a m p l e}^{H}}{\cup} \{(p, T_{L} (p))\},

(11)

After obtaining the short CoT distillation data, D-Short, and the long CoT distillation data, D-Long, we conducted data verification and supervised fine-tuning. To ensure the quality of the distillation data, we implement a set of automated verification processes (such as verifying the results generated by the teacher model using standard answers, or conducting consistency checks with another LLM) combined with a small amount of manual sampling review, so as to filter out erroneous data that may be generated by the teacher model. Finally, we perform supervised fine-tuning on the student model using the verified high-quality dataset. The adaptive distillation process is formalized in Algorithm 2.

Algorithm 2 LLM Adaptive Distillation Procedure

Require: Graded problem dataset D_easy, D_medium and D_hard

Require: Set of LLMs M = {M₀, M₁, M₂} ▷ M₀: student model, M₁: teacher model LLM-A, M₂: teacher model LLM-B

Require: Prompt template for each teacher model P_j = { P₁, P₂ } ▷ P₁: used to generate short CoT, P₂: used to generate long CoT

Input: Given numbers 0 < k < min(count(D_easy), count(D_medium), count(D_hard)), n = 0

Ensure: Distilled student model M_distill

1: Initialize distillation problem set D_sample ← ∅, distillation dataset D_distill ← ∅

2: for problem p_i in D_easy, n < INT(k count(D_easy)/count(D_medium)) do

3: Add distillation problem set D_sample ← p_i

4: n = n + 1

5: end for

6: n = 0

7: for problem p_i in D_medium, n < INT(k) do for each sample(p_i, a^*_i, a_i, C_i) in R do

8: Add distillation problem set D_sample ← p_i

9: n = n + 1 Get M₁ prediction G_i

10: end for

11: n = 0

12: for problem p_i in D_hard, n < INT(k count(D_hard)/count(D_medium)) do

13: Add distillation problem set D_sample ← p_i

14: n = n + 1

15: end for

16: for problem p_i in D_sample

17: if G_i = = Easy or Medium

18: Construct prompt P ← Concatenate(P₁, p_i)

19: Get M₁ response (a_i, C_i)

20: Add distillation dataset D_distill ←D_distill ∪ {(p_i, a_i, C_i)}

21: else

22: Construct prompt P ← Concatenate(P₂, p_i)

23: Get M₂ response (a_i, C_i)

24: Add distillation dataset D_distill ← D_distill∪{(p_i, a_i, C_i)}

25: end if

26: end for

27: Get M_distill by SFT M₀ with D_distill

28: return M_distill

4. Experiment

To comprehensively evaluate the effectiveness of the proposed ACoTd framework, we designed and conducted a series of experiments. This section will detail the experimental setup, the datasets, and the evaluation metrics used. Additionally, we conducted an in-depth analysis of the role of each core component in ACoTd through ablation studies.

4.1. Datasets

In this experiment, multiple publicly available and challenging mathematical reasoning datasets were selected to comprehensively and fairly evaluate the model’s reasoning ability on mathematical problems of different difficulty levels and types. These datasets are all sourced from public academic resources or authoritative competition platforms, ensuring the reproducibility and comparability of the experiment. Original data is divided into training data and benchmark data.

4.1.1. Training Data

Our training data was mainly sourced from

NuminaMath [18]: An open-source large language model series focused on mathematical problem-solving and its supporting dataset. It is the largest public dataset in AI4Maths with 860k pairs of competition math problems and solutions.
Historical AIME [19]: It includes past real questions from the American Invitational Mathematics Examination (AIME). These questions are known for their high difficulty and requirement for multi-step reasoning, making them highly suitable for testing the ability of LLMs to solve complex mathematical problems.
GSM8K [20]: Proposed by OpenAI researchers in 2021, it contains approximately 8500 high-quality English elementary school math word problems, covering basic mathematical knowledge such as addition, subtraction, multiplication, division, fractions, decimals, and percentages. These problems typically require multi-step reasoning and are divided into a training set (7473 problems) and a test set (1319 problems).
OlympicArena [21]: A large-scale, interdisciplinary, and high-quality benchmark dataset, including 11,163 problems, covering 62 international Olympic competitions. By collecting difficult problems at the level of international Olympic competitions, it sets a high standard for evaluating the superintelligence of models.
GAOKAO [22]: a large model evaluation dataset centered on real questions from China’s National College Entrance Examination (Gaokao), covering a total of 2811 questions, including multiple-choice questions, fill-in-the-blank questions, and problem-solving questions. It is designed to evaluate the ability of large models in complex language understanding and logical reasoning tasks.

4.1.2. Benchmark Data

Our benchmark data are mainly sourced from

AMC23 [23]: It contains 40 questions from the American Mathematics Competition 12 (AMC12) 2022 and AMC12 2023. The original AMC12 questions are multiple-choice questions with four options; the authors revised the problem statements into a form that requires integer outputs, and questions whose statements could not be revised were excluded.
AIME25 [24]: It comprises challenging mathematics competition problems from the American Invitational Mathematics Examination of 2025.
GSM8K-TEST: It contains 1319 problems from GSM8K.
MATH500 [25]: It contains 500 carefully selected mathematical problems, which are divided into five difficulty levels ranging from simple to complex. This enables researchers to meticulously evaluate the model’s performance under different challenge levels.

4.2. Metrics and Baseline

4.2.1. Metrics

We use the following metrics to quantify model performance:

Accuracy (Acc): The correctness rate of the model’s final answers compared with the ground-truth option, serving as the primary performance evaluation metric. It is computed as

A c c u r a c y = \frac{1}{n} \sum_{1}^{n} 1 ({\hat{a}}_{i} = a_{i}),

(12)

where

n is the number of all problems;

{\hat{a}}_{i}

is the LLM’s answer to ith problem;

a_{i}

is the ground truth answer to ith problem; and

1 (\cdot)

is the indicator function, which returns 1 when the argument is true and 0 otherwise.

CoT Quality Score (CoTQS): To assess the quality of the reasoning process, we use Deepseek-R1 as the evaluator. Deepseek-R1 rates the CoT generated by the model on a scale of 1 to 5 across three dimensions—logical coherence, step completeness, and factual correctness—and then calculates the average score. The prompt used to generate the CoT quality score is shown in Figure 5.

4.2.2. Baseline

We take three baselines below for comparison: DeepSeek-R1-Distill-Qwen-7B and Qwen3-1.7B.

DeepSeek-R1-Distill-Qwen-7B: An efficient reasoning model with 7B parameters that is more lightweight, based on Qwen2.5-7B as the base model and distilled from DeepSeek-R1;
Qwen3-1.7B (CoT on): A lightweight open-source large model with 1.7B parameters (including 1.4B non-embedding parameters) that supports switching between thinking mode and non-thinking mode—this is the thinking mode;
Qwen3-1.7B (CoT off): non-thinking mode, where the output will not include a chain of thought.

4.3. Experiment Settings

4.3.1. Models

In terms of model selection, based on our original intention of enhancing the reasoning ability of small-scale LLMs, the student model should be an easily accessible open pre-trained model with a parameter count of less than 7B; since the adaptive distillation of ACoTd requires generating both long CoT and short CoT simultaneously, it is preferable to select two different LLMs as teacher models—with each responsible for generating long CoT and short CoT, respectively—to avoid manual prompt switching.

Student Model: Considering the flexibility of our hardware conditions, we have selected Qwen3-1.7B as the student model and access it through a local deployment.
Teacher Model: We utilize Ernie-4.5-Turbo and Qwen-QwQ-32B as our teacher models, and these models can be accessed online via API. Among them, Ernie-4.5-Turbo serves as the teacher model for generating short CoT, while Qwen-QwQ-32B acts as the teacher model for generating long CoT.

4.3.2. Data

We collected a total of more than 10k problems from open-source resources (datasets are mentioned in Section 4.1.1) to form the original problem dataset D_raw. After the student model completed difficulty grading, the ratio of Easy, Medium, and Hard difficulty questions was 5.3:1:3.1 (how this ratio was obtained, which was mentioned in Section 3.2). The sampled distillation problem dataset contained 1000 questions in total, with the generated long CoT training dataset having 329 CoTs and the short CoT training dataset having 671 CoTs. Among the test benchmark datasets, MATH500 included 500 problems, AIME25 had 30 problems, GSM-TEST contained 1319 problems, and AMC23 included 40 problems.

4.3.3. Training Details

Our training framework is fine-tuned based on the LLaMA-Factory [26] framework. This framework integrates various mainstream and efficient training and fine-tuning techniques, supports adaptation to major open-source models on the market, and provides multiple high-level abstract calling interfaces. It also includes functions such as multi-stage training, inference testing, benchmark evaluation, and API services. It can improve computational efficiency while minimizing memory requirements as much as possible.

In our training setup, we used the AdamW optimizer and employed training with LoRA [27] fine-tuning, setting the LoRA rank to 32, a learning rate of 1 × 10⁻⁵, a batch size of 16, 5 training epochs, and a context window of 16384 tokens. During inference, we used a temperature of 0.6, max tokens set to 8196, and a top-p value of 0.9. More hyperparameters are listed in Table 1, and shared hyperparameters with baselines are listed in Table 2. All experiments were conducted on an NVIDIA A40 48G GPU.

4.3.4. Ablation Experiment Settings

To validate the contributions of key ACoTd components, we evaluate the variants with certain components omitted or replaced. Changes are as follows:

Remove the step of sampling distillation questions based on the student model’s level, and directly sample an equal amount of distillation data randomly;
Fixed the ratio of Easy, Medium, and Hard data to 6:3:1, without dynamically adjusting based on student model performance;
Use only short chains of thought when generating COT distillation data;
Use only long chains of thought when generating COT distillation data.

The specific results will be discussed in Section 5.

5. Result Analysis

This section presents the experimental results verifying the effectiveness of our ACoTd framework, along with relevant analyses of the results, specifically including two parts: main results and ablation results.

5.1. Main Results

Table 3 presents the results of controlled experiments evaluating the performance of LLMs across multiple mathematical problem benchmark datasets. Besides baselines, we also included the experiments on the same benchmarks from Curriculum Learning’s work into the comparison. Through analysis of these results, we can clearly observe the following:

Significant Advantages of ACoTd: Our proposed ACoTd method achieved the best performance across all datasets compared with different baseline models. ACoTd improved the accuracy of the Qwen3-1.7B model by over 10% on all four benchmarks, with an even more impressive 20% accuracy increase on the highly challenging AIME25. From the CoTQS perspective, the quality of reasoning steps also showed significant improvement. Furthermore, when compared with DeepSeek-R1-Distill-Qwen-7B, ACoTd outperformed this larger LLM by a considerable margin. This fully demonstrates the effectiveness of the adaptive strategy based on the actual performance of the student model. Notably, while DeepSeek-R1-Distill-Qwen-7B was only more effective than Qwen3-1.7B (CoT on) on the GSM8K-TEST benchmark, it still comprehensively outperformed Qwen3-1.7B (CoT off). This is because DeepSeek-R1-Distill-Qwen-7B, with its larger parameter size, possesses better fundamental capabilities. However, once the reasoning ability of a smaller LLM is stimulated through CoT, its performance can match or even surpass larger LLMs. This indicates that CoT brings substantial performance improvements, and from another perspective, confirms that our method is superior in terms of performance among existing distillation methods.
Limitations of Traditional Distillation: Even though the DeepSeek-R1-Distill-Qwen-7B model (also a distilled model) can generate CoT, its performance is relatively poor. This confirms that traditional one-size-fits-all distillation strategies cannot focus on strengthening the weak points of student models.
Differences Between Static and Dynamic Difficulty Assessment: By comparing the common benchmark assessment used in Curriculum Learning with our work, we find that Curriculum Learning, based on static metrics, performs significantly worse. This indicates that dynamic difficulty assessment from the student model’s perspective better reflects actual learning needs than static assessment from the problem or teacher’s perspective, thereby bringing more efficient performance improvements.

The specific results will be discussed in Section 5.

5.2. Ablation Results

We conducted four ablation studies on the aforementioned benchmarks to verify the effectiveness of our method from different perspectives. The results are shown in Table 4.

Analysis of the results shows the following:

Core Role of the Difficulty Classification Mechanism: When the difficulty classification mechanism is removed and the method degrades to random sampling, the performance drops most significantly. This proves that accurately classifying questions based on the student model’s performance is the key to the success of ACoTd.
Importance of Adaptive Sampling: When a fixed sampling ratio is adopted (e.g., Easy/Medium/Hard = 6:3:1) instead of adaptive sampling based on the original distribution, the performance decreases slightly. This indicates that maintaining a difficulty distribution matching the student model’s capabilities is crucial.
Effectiveness of Differentiated Supervision Signals: Performance decreases when only short CoT or only long CoT are used. This demonstrates the necessity of the “teaching students in accordance with their aptitude“ differentiated strategy: concise explanation (CoT) for already mastered knowledge can improve efficiency, while detailed explanation (CoT) for unmastered knowledge can make up for weaknesses. Using only long CoT leads to low training efficiency and may introduce redundant noise, whereas using only short CoT fails to provide sufficient reasoning details for difficult questions.

In addition, we visualized the model’s performance improvement on problems of different benchmarks in Figure 6. The results show that the performance improvement brought by ACoTd mainly comes from the significant improvement in accuracy on AIME25 and AMC23, which are mainly Hard- and Medium-difficulty questions, directly verifying that our method can effectively target the knowledge gaps of the student model.

6. Conclusions

This paper aims to address the core challenge of efficiently distilling the strong reasoning capabilities of large-scale LLMs into smaller student models. Traditional distillation methods adopt a “one-size-fits-all” strategy. This approach overlooks the significant differences in the student model’s mastery of knowledge across problems of varying difficulty, ultimately leading to low distillation efficiency and limited generalization ability.

To tackle this issue, we propose a student-centered ACoTd framework. The core innovation of ACoTd lies in the introduction of a problem difficulty classification mechanism and a dynamic difficulty-aware mechanism. First, based on the student model’s actual performance on the original problems (i.e., the quality of generated answers and CoTs), this mechanism automatically classifies problems into three levels: Easy, Medium, and Hard. Subsequently, leveraging this difficulty distribution, it adaptively samples from the original problems to construct a distillation dataset and matches differentiated supervision signals to problems of different difficulty levels: short CoTs are used for concise explanation and consolidation of knowledge for Easy and Medium problems, while long and detailed CoTs are employed for in-depth explanation and addressing knowledge gaps for Hard problems.

Through extensive experiments on multiple mathematical reasoning benchmarks, we verified the effectiveness of the ACoTd framework. The results show that compared with traditional distillation methods, ACoTd significantly improves the performance of distilled student models on complex reasoning tasks. Ablation studies further confirm that the three core components—difficulty classification, adaptive sampling, and differentiated supervision—are all key factors for enhancing model performance.

Despite the promising results achieved by ACoTd, this study still has limitations. It primarily focuses on empirical verification, and the theoretical analysis of why ACoTd works remains insufficient. Future work can strive to provide a more solid theoretical foundation for it. And we believe it will be more effective if we integrate our approach with reinforcement learning or rejection sampling. We believe that this student-centered, data-driven methodological paradigm holds significant importance and broad application prospects for advancing the development of efficient, lightweight reasoning models.

Author Contributions

Conceptualization, J.S. and X.C.; methodology, J.S.; software, J.S.; validation, J.S. and Z.G.; formal analysis, J.S.; investigation, J.S.; resources, X.C.; writing—original draft preparation, J.S.; writing—review and editing, X.S.; visualization, X.S.; supervision, X.C.; project administration, Z.G.; funding acquisition, X.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Acknowledgments

All data uploaded to the cloud are from open-source datasets and do not involve any personal or privacy risks.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Hinton, G.; Vinyals, O.; Dean, J. Distilling the Knowledge in a Neural Network. arXiv 2015, arXiv:1503.02531. [Google Scholar] [CrossRef]
Lee, J.; Park, S.; Hong, S.; Kim, M.; Chang, D.-S.; Choi, J. Improving Conversational Abilities of Quantized Large Language Models via Direct Preference Alignment. In Proceedings of the 62nd Annual Meeting ofthe Association for Computational Linguistics, Bangkok, Thailand, 11–16 August 2024. [Google Scholar]
Gu, Y.; Dong, L.; Wei, F.; Huang, M. MiniLLM: Knowledge Distillation of Large Language Models. In Proceedings of the Twelfth International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024. [Google Scholar]
Hsieh, C.-Y.; Li, C.-L.; Yeh, C.-K.; Nakhost, H.; Fujii, Y.; Ratner, A.; Krishna, R.; Lee, C.-Y.; Pfister, T. Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes. arXiv 2023, arXiv:2305.02301. [Google Scholar] [CrossRef]
Tian, Y.; Han, Y.; Chen, X.; Wang, W.; Chawla, N.V. Beyond Answers: Transferring Reasoning Capabilities to Smaller LLMs Using Multi-Teacher Knowledge Distillation. arXiv 2024, arXiv:2402.04616. [Google Scholar] [CrossRef]
Zhang, Y.; Liu, H.; Xiao, Y.; Amoon, M.; Zhang, D.; Wang, D.; Yang, S.; Quek, C. LLM-Enhanced Multi-Teacher Knowledge Distillation for Modality-Incomplete Emotion Recognition in Daily Healthcare. IEEE J. Biomed. Health Inform. 2024, 29, 6406–6416. [Google Scholar] [CrossRef] [PubMed]
Guo, D.; Yang, D.; Zhang, H.; Song, J.; Zhang, R.; Xu, R.; Zhu, Q.; Ma, S.; Wang, P.; Bi, X.; et al. Deepseek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv 2025, arXiv:2501.12948. [Google Scholar] [CrossRef]
Shridhar, K.; Stolfo, A.; Sachan, M. Distilling Reasoning Capabilities into Smaller Language Models. arXiv 2022, arXiv:2212.00193. [Google Scholar] [CrossRef]
Wang, P.; Wang, Z.; Li, Z.; Gao, Y.; Yin, B.; Ren, X. SCOTT: Self-Consistent Chain-of-Thought Distillation. arXiv 2023, arXiv:2305.01879. [Google Scholar] [CrossRef]
Ye, Y.; Huang, Z.; Xiao, Y.; Chern, E.; Xia, S.; Liu, P. Limo: Less is More for Reasoning. arXiv 2025, arXiv:2502.03387. [Google Scholar] [CrossRef]
Xu, J.; Zhou, M.; Liu, W.; Liu, H.; Han, S.; Zhang, D. TwT: Thinking without Tokens by Habitual Reasoning Distillation with Multi-Teachers’ Guidance. arXiv 2025, arXiv:2503.24198. [Google Scholar] [CrossRef]
Zhou, C.; Liu, P.; Xu, P.; Iyer, S.; Sun, J.; Mao, Y.; Ma, X.; Efrat, A.; Yu, P.; Yu, L.; et al. Lima: Less is More for Alignment. arXiv 2023, arXiv:2305.11206. [Google Scholar] [CrossRef]
Li, J.; Beeching, E.; Tunstall, L.; Lipkin, B.; Soletskyi, R.; Huang, S.; Rasul, K.; Yu, L.; Jiang, A.Q.; Shen, Z.; et al. Numinamath: The Largest Public Dataset in AI4Maths with 860k Pairs of Competition Math Problems and Solutions. Available online: http://faculty.bicmr.pku.edu.cn/~dongbin/Publications/numina_dataset.pdf (accessed on 15 September 2025).
Huang, Z.; Wang, Z.; Xia, S.; Li, X.; Zou, H.; Xu, R.; Fan, R.Z.; Ye, L.; Chern, E.; Ye, Y.; et al. Olympicarena: Benchmarking Multi-Discipline Cognitive Reasoning for Superintelligent AI. arXiv 2024, arXiv:2406.12753. [Google Scholar] [CrossRef]
Zorik, G.; Jonathan, H.; Roee, A.; Chen, E.; Idan, S. TrueTeacher: Learning Factual Consistency Evaluation with Large Language Models. arXiv 2023, arXiv:2305.11171. [Google Scholar] [CrossRef]
Whitney, C.; Jansen, E.; Laskowski, V.; Barbieri, C. Adaptive Prompt Regeneration and Dynamic Response Structuring in Large Language Models Using the Dynamic Query-Response Calibration Protocol. OSF Prepr. 2024. [Google Scholar] [CrossRef]
Mahene, A.; Pereira, D.; Kowalski, V.; Novak, E.; Moretti, C.; Laurent, J. Automated Dynamic Data Generation for Safety Alignment in Large Language Models. TechRxiv 2024. [Google Scholar] [CrossRef] [PubMed]
NuminaMath. Available online: https://huggingface.co/collections/AI-MO/numinamath-6697df380293bcfdbc1d978c (accessed on 15 September 2025).
AIME-1983-2024. Available online: https://huggingface.co/datasets/gneubig/aime-1983-2024 (accessed on 15 September 2025).
GSM8K. Available online: https://huggingface.co/datasets/openai/gsm8k (accessed on 15 September 2025).
OlympicArena. Available online: https://huggingface.co/datasets/GAIR/OlympicArena (accessed on 15 September 2025).
GAOKAO. Available online: https://huggingface.co/datasets/xDAN-Vision/GAOKAO_2010_to_2022 (accessed on 15 September 2025).
AMC23. Available online: https://huggingface.co/datasets/zwhe99/amc23 (accessed on 15 September 2025).
AIME25. Available online: https://huggingface.co/datasets/math-ai/aime25 (accessed on 15 September 2025).
MATH500. Available online: https://huggingface.co/datasets/HuggingFaceH4/MATH-500 (accessed on 15 September 2025).
Zheng, Y.; Zhang, R.; Zhang, J.; Ye, Y.; Luo, Z.; Feng, Z.; Ma, Y. LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models. arXiv 2024, arXiv:2403.13372. [Google Scholar] [CrossRef]
Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. LoRA: Low-Rank Adaptation of Large Language Models. arXiv 2021, arXiv:2106.09685. [Google Scholar] [CrossRef]

Figure 1. The framework for adaptive chain-of-thought distillation.

Figure 2. The prompt used to evaluate the performance of the student model.

Figure 3. The prompt used to generate short CoT.

Figure 4. The prompt used to generate long CoT.

Figure 5. The prompt used to generate the CoT quality score.

Figure 6. Performance improvement on problems of different benchmarks.

Table 1. Hyperparameters.

Hyperparameters	Value
Epoch	5
Learning Rate	1 × 10⁻⁵
Batch Size	16
Packing	true
Scheduler Type	cosine
Warmup Rate	0.03
Weight Decay	0.01
Sequence Length	16,384
LoRA Rank	32
LoRA Alpha	32
Dropout	0.1
Validation Steps	16
Checkpoint Interval	64

Table 2. Shared hyperparameters with baselines.

Hyperparameters	Value
Temperature	0.6
Top-p	0.9
Max Tokens	16,384

Table 3. Main results, comparison with baselines.

Models	AMC23		AIME25		MATH500		GSM8K-TEST
Models	Acc	CoTQS	Acc	CoTQS	Acc	CoTQS	Acc	CoTQS
Our Method	0.8 $\pm$ 0.02	4.42 $\pm$ 0.131	0.511 $\pm$ 0.031	2.8 $\pm$ 0.089	0.91 $\pm$ 0.03	4.6 $\pm$ 0.021	0.953 $\pm$ 0.01	4.79 $\pm$ 0.08
DeepSeek-R1-Distill-Qwen-7B	0.483 $\pm$ 0.06	2.43 $\pm$ 0.231	0.122 $\pm$ 0.063	0.9 $\pm$ 0.198	0.712 $\pm$ 0.038	3.7 $\pm$ 0.099	0.926 $\pm$ 0.018	4.52 $\pm$ 0.267
Qwen3-1.7B (CoT off)	0.358 $\pm$ 0.58	3.38 $\pm$ 0.099	0.067 $\pm$ 0.05	1.41 $\pm$ 0.17	0.802 $\pm$ 0.009	4.15 $\pm$ 0.237	0.851 $\pm$ 0.021	4.16 $\pm$ 0.304
Qwen3-1.7B (CoT on)	0.692 $\pm$ 0.023	3.83 $\pm$ 0.062	0.256 $\pm$ 0.016	1.71 $\pm$ 0.134	0.857 $\pm$ 0.052	4.42 $\pm$ 0.269	0.908 $\pm$ 0.042	4.58 $\pm$ 0.206

Table 4. Ablation results, comparison with baselines.

Models	AMC23		AIME25		MATH500		GSM8K-TEST
Models	Acc	CoTQS	Acc	CoTQS	Acc	CoTQS	Acc	CoTQS
Our Method	0.8 $\pm$ 0.02	4.42 $\pm$ 0.131	0.511 $\pm$ 0.031	2.8 $\pm$ 0.089	0.91 $\pm$ 0.03	4.6 $\pm$ 0.021	0.953 $\pm$ 0.01	4.79 $\pm$ 0.08
Random Sampling	0.708 $\pm$ 0.12	3.74 $\pm$ 0.082	0.255 $\pm$ 0.016	1.94 $\pm$ 0.062	0.864 $\pm$ 0.016	4.43 $\pm$ 0.041	0.923 $\pm$ 0.008	4.52 $\pm$ 0.074
Fixed Ratio	0.767 $\pm$ 0.023	4.01 $\pm$ 0.122	0.411 $\pm$ 0.041	2.52 $\pm$ 0.057	0.902 $\pm$ 0.003	4.54 $\pm$ 0.063	0.949 $\pm$ 0.009	4.67 $\pm$ 0.081
Only Short CoT	0.75 $\pm$ 0.02	3.88 $\pm$ 0.126	0.345 $\pm$ 0.032	1.98 $\pm$ 0.047	0.897 $\pm$ 0.11	4.59 $\pm$ 0.037	0.947 $\pm$ 0.012	4.77 $\pm$ 0.008
Only Long CoT	0.758 $\pm$ 0.023	3.98 $\pm$ 0.109	0.322 $\pm$ 0.016	2 $\pm$ 0.047	0.896 $\pm$ 0.011	4.58 $\pm$ 0.057	0.945 $\pm$ 0.002	4.77 $\pm$ 0.005

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Shen, J.; Cui, X.; Gao, Z.; Sheng, X. Adaptive Chain-of-Thought Distillation Based on LLM Performance on Original Problems. Mathematics 2025, 13, 3646. https://doi.org/10.3390/math13223646

AMA Style

Shen J, Cui X, Gao Z, Sheng X. Adaptive Chain-of-Thought Distillation Based on LLM Performance on Original Problems. Mathematics. 2025; 13(22):3646. https://doi.org/10.3390/math13223646

Chicago/Turabian Style

Shen, Jianan, Xiaolong Cui, Zhiqiang Gao, and Xuanzhu Sheng. 2025. "Adaptive Chain-of-Thought Distillation Based on LLM Performance on Original Problems" Mathematics 13, no. 22: 3646. https://doi.org/10.3390/math13223646

APA Style

Shen, J., Cui, X., Gao, Z., & Sheng, X. (2025). Adaptive Chain-of-Thought Distillation Based on LLM Performance on Original Problems. Mathematics, 13(22), 3646. https://doi.org/10.3390/math13223646

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Adaptive Chain-of-Thought Distillation Based on LLM Performance on Original Problems

Abstract

1. Introduction

2. Related Work

2.1. Distillation for LLMs

2.2. Data Sampling for LLM Fine-Tuning

3. Methods

3.1. Framework

3.2. Problem Difficulty Grading

3.3. LLM Adaptive Distillation

4. Experiment

4.1. Datasets

4.1.1. Training Data

4.1.2. Benchmark Data

4.2. Metrics and Baseline

4.2.1. Metrics

4.2.2. Baseline

4.3. Experiment Settings

4.3.1. Models

4.3.2. Data

4.3.3. Training Details

4.3.4. Ablation Experiment Settings

5. Result Analysis

5.1. Main Results

5.2. Ablation Results

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI