Symmetry-Aware Code Generation: Distilling Pseudocode Reasoning for Lightweight Deployment of Large Language Models

Li, Yonglin; Gu, Shanzhi; Geng, Mingyang

doi:10.3390/sym17081325

Open AccessArticle

Symmetry-Aware Code Generation: Distilling Pseudocode Reasoning for Lightweight Deployment of Large Language Models

by

Yonglin Li

^1,†,

Shanzhi Gu

^1,† and

Mingyang Geng

^2,*

¹

Hunan Huishiwei Intelligent Technology Co., Ltd., Changsha 410013, China

²

College of Computer Science, National University of Defense Technology, Changsha 410073, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Symmetry 2025, 17(8), 1325; https://doi.org/10.3390/sym17081325

Submission received: 8 July 2025 / Revised: 6 August 2025 / Accepted: 8 August 2025 / Published: 14 August 2025

(This article belongs to the Section Computer)

Download

Browse Figures

Versions Notes

Abstract

Code generation is a critical task in software engineering, enabling the automation of transforming natural language descriptions into executable code. Recent advancements in large language models (LLMs) have demonstrated their potential to significantly enhance code generation capabilities by leveraging complex reasoning processes. However, the large size of these models poses challenges for deployment in resource-constrained environments, as they require substantial computational resources and memory. The challenge lies in transferring the sophisticated problem-solving strategies of LLMs to smaller, more efficient models without sacrificing performance, while maintaining symmetry between the reasoning steps and final code generation. This task is further complicated by the need to preserve high code generation accuracy while reducing the resource demands of deployment. Although distillation methods have been proposed, efficiently transferring both the reasoning process and final code generation remains an underexplored area. In this work, we propose a novel distillation framework that extracts intermediate reasoning steps, such as pseudocode, from LLMs and transfers them to smaller models. Our approach enables smaller models to replicate the problem-solving strategies of larger models through a multi-task learning framework, which includes both pseudocode and code generation tasks, thus maintaining the symmetry between reasoning and output. We conducted comprehensive experiments on the CodeSearchNet dataset, comparing our distillation framework across four student models (Tranx, CodeT5, NatGen, and SPT-Code) distilled from four large language models (CodeLlama-7B, CodeQwen-7B, DeepSeek, and GPT-4). Results show that our approach consistently improves code generation performance, with the best case (CodeT5 distilled from GPT-4) achieving up to 74% improvement in Top-1 accuracy over the baseline.

Keywords:

large language models; knowledge distill; code generation; chain of thought

1. Introduction

Code generation has become a crucial aspect of modern software engineering, significantly enhancing productivity by automating the process of transforming natural language descriptions into executable code. Recent advancements in large language models have spurred the development of innovative code generation methods that leverage these models’ ability to understand and synthesize code from various programming languages [1,2,3,4,5,6,7,8]. Such methods have proven highly effective in tasks ranging from simple syntax generation to more complex tasks involving multi-step reasoning and logic. These approaches, particularly those based on LLMs, offer a promising direction for automating the coding process, making it accessible even to developers with limited programming experience. As a result, a variety of LLM-based solutions for code generation have emerged, showing substantial improvements in task performance and expanding the scope of possible applications.

However, as shown in Figure 1, despite the impressive capabilities of LLMs in code generation, their deployment remains a significant challenge due to the enormous computational resources required by these models [9,10,11]. The sheer size of modern LLMs, such as those with hundreds of billions of parameters [12,13,14,15], presents substantial obstacles for practical use in real-world scenarios. These LLMs often demand vast amounts of memory and specialized infrastructure, making them impractical for widespread deployment, especially in environments where resource constraints are a concern. Consequently, deploying lightweight versions of these models—capable of retaining the high-level capabilities of LLMs while being more resource-efficient—has become an overlooked yet crucial area of research. Addressing this challenge involves overcoming various complexities, such as model compression, distillation, and efficient inference techniques. This issue remains largely underexplored but holds significant potential to make advanced code generation accessible and practical for diverse applications, particularly in resource-limited contexts.

The motivation behind our approach is grounded in the idea of distilling the step-by-step reasoning process from LLMs into smaller, more efficient models [16,17]. This approach is based on the concept of transferring not just the final outputs, but the intermediate reasoning steps—essentially, the cognitive chain—used by LLMs to solve complex tasks. By capturing and transferring these reasoning steps, smaller models can learn to replicate the problem-solving strategies of larger models, thus enhancing their ability to understand and generate language more effectively. The distillation pipeline involves the following: (1) the large model generates a series of intermediate reasoning steps; (2) these steps are distilled into a form that the smaller model can learn from; (3) the smaller model is trained to mimic the large model’s reasoning process; and (4) it is evaluated based on its ability to generate accurate results. This approach not only improves the language understanding and generation capabilities of smaller models but also reduces the computational and memory demands of deploying LLMs. The generality of this framework makes it applicable across a range of tasks, enabling the efficient deployment of powerful reasoning capabilities while significantly lowering resource requirements.

The challenge lies in transferring the sophisticated problem-solving strategies of LLMs to smaller, more efficient models without sacrificing performance. This task is further complicated by the need to maintain the symmetry between the reasoning process (i.e., pseudocode generation) and the final code generation output. While LLMs excel at generating both the reasoning path and the final solution, smaller models must replicate this same cognitive structure, ensuring that the intermediate steps and final code are balanced and aligned. Preserving this symmetry is crucial to ensuring that the smaller models maintain the same reasoning capabilities as the larger models, while overcoming the challenges of reduced computational and memory resources. Achieving this balance remains an underexplored yet vital area of research, as it requires overcoming the complexities of model compression, distillation, and efficient inference techniques.

In this paper, we propose a novel method for distilling the code generation capabilities of large language models into smaller models through a multi-task learning framework. Our approach mirrors the reasoning process of LLMs by transferring both the intermediate thought processes, i.e., pseudocode, and the final code generation output, to smaller models. This symmetry between the large and small models enables the smaller models to replicate the problem-solving strategies of their larger counterparts, preserving the balance between understanding the problem, breaking it down into pseudocode, and generating executable code. By achieving this symmetry, we ensure that the smaller models can learn the same cognitive patterns without the need for extensive computational resources. We evaluate the framework using several LLMs, including CodeLlama-7B, CodeQwen-7B, DeepSeek, and GPT-4, and demonstrate significant performance improvements across CodeSearchNet benchmark [18]. The results show that the distillation framework not only enhances the performance of smaller models, as reflected in both automatic metrics and human evaluations, but also maintains efficiency, making it suitable for resource-constrained environments. Specifically, distilling from larger models, particularly GPT-4, leads to the most substantial improvements, achieving top accuracy scores on various tasks while substantially reducing memory usage and inference time.

The contributions of this paper are as follows:

We propose a novel distillation framework that transfers the reasoning and code generation capabilities of LLMs into smaller, more efficient models by leveraging intermediate reasoning steps such as pseudocode, enabling smaller models to replicate the problem-solving strategies of larger models.
We evaluate the proposed distillation framework using several LLMs, including CodeLlama-7B, CodeQwen-7B, DeepSeek, and GPT-4, and demonstrate significant performance improvements in code generation tasks across CodeSearchNet benchmark, both in automatic metrics and human evaluations.
Our experiments show that distilling from larger models, especially GPT-4, yields the most substantial improvements, achieving top accuracy scores on various tasks while maintaining efficiency in terms of memory usage and inference time, making the framework suitable for deployment in resource-constrained environments.

The remainder of this paper is structured as follows. Section 2 reviews the background and related work on neural code generation and knowledge distillation, highlighting recent advances in large language models and their limitations in deployment. Section 3 introduces the proposed cognitive-chain distillation framework, which distills the reasoning process, such as problem decomposition and pseudocode planning—from LLMs into smaller code generation models. Section 4 presents the experimental setup, evaluation metrics, benchmark datasets, baseline comparisons, and ablation studies to validate the effectiveness and efficiency of our approach. Section 5 presents threats to validity in this work. Finally, Section 6 concludes the paper and outlines potential future directions for enhancing lightweight, high-performance NL2Code systems.

2. Background and Related Work

2.1. Chain-of-Thought Planning for Code Generation

In our proposed methodology for code generation using chain-of-thought (CoT), we introduce a two-phase process that incorporates self-planning by the large language models before generating the actual code. This process is delineated as follows:

Planning Phase: This phase leverages the in-context learning capabilities of LLMs for self-planning. Here, a prompt C, comprising k concatenated tuples

〈x_{1} \cdot y_{1}〉 ∥〈x_{1} \cdot y_{1}〉∥ \dots ∥ 〈x_{k} \cdot y_{k}〉

, is provided. In these tuples, x represents the human intent, and y symbolizes the plan—an organized schedule of subproblems abstracted and decomposed from the intent. The plan is formulated as

y = \{q_{1}, q_{2}, \dots, q_{n}\}

. During inference, the test-time intent

x_{test}

is appended to the prompt, forming

C ∥ x_{test}

, which is then processed by the LLM

M

to generate a test-time plan

y_{test}

. Notably, k is a relatively small number, indicating that effective self-planning can be achieved with just a few examples demonstrating planning.

Implementation phase: In this phase, the test-time plan

y_{test}

is combined with the intent

x_{test}

as input to the model

M

, resulting in the generation of the final code

z_{test -}

.

The entire process can be expressed through the following equation, formalizing the relationship between the phases:

\begin{matrix} P (z ∣ x, C) & = \sum_{y} P (z ∣ y, x, C) \cdot P (y ∣ x, C), \\ \propto P (z ∣ y, x, C) \cdot P (y ∣ x, C), \end{matrix}

(1)

By assuming conditional independence, we further simplify this to

P (z ∣ y, x, C) = P (z ∣ y, x)

. This simplification underscores the efficiency and effectiveness of our proposed CoT approach in LLM-based code generation.

2.2. Chain-of-Thought Reasoning for LLMs

Chain-of-thought reasoning has emerged as a critical method for enhancing the problem-solving capabilities of large language models, particularly in complex tasks requiring multi-step reasoning [2,19,20,21,22]. Several studies have explored the power of CoT and its theoretical foundations, shedding light on its efficacy and potential mechanisms. Wei et al. explored the use of chain-of-thought prompting to enhance LLMs’ reasoning abilities, showing that generating a series of intermediate reasoning steps significantly improves the performance of LLMs on arithmetic, commonsense, and symbolic reasoning tasks. Their empirical results demonstrated that CoT prompting, with only a few exemplars, achieved state-of-the-art accuracy in math word problem benchmarks, even surpassing fine-tuned models [19]. Feng et al. took a theoretical approach to examine the expressivity of LLMs with CoT in solving mathematical and decision-making problems. Their work utilized circuit complexity theory to show that LLMs, through CoT reasoning, can effectively handle decision-making problems and solve tasks like dynamic programming, justifying their power in tackling complex real-world challenges [20]. Xia et al. expanded the CoT paradigm by introducing a range of chain-of-X (CoX) methods, inspired by the sequential structure of CoT. Their survey categorizes various CoX techniques used to improve reasoning in diverse domains and tasks, offering a comprehensive resource for researchers interested in applying CoT across broader scenarios [21]. Zhang et al. introduced chain of preference optimization (CPO), a method that fine-tunes LLMs using search trees from the tree-of-thought (ToT) approach to optimize CoT reasoning paths. Their experiments showed that CPO improved LLM performance in complex problem-solving tasks like question answering, arithmetic reasoning, and fact verification, without the heavy computational burden of ToT [22].

Together, these studies highlight the evolution of CoT as a powerful tool in LLM-based reasoning tasks. From practical prompting techniques to theoretical insights and optimization strategies, these works contribute to our understanding of how LLMs process complex reasoning tasks and provide potential pathways for further improving these models.

2.3. Knowledge Distillation of LLMs

Knowledge distillation (KD) has become a crucial method for enhancing the performance of large language models, enabling the transfer of knowledge from larger, more complex models to smaller, more efficient ones [23,24,25,26,27,28,29]. Various studies have explored different approaches and applications of KD in LLMs, focusing on model compression, efficiency improvements, and the mitigation of limitations such as hallucinations. Yang et al. introduced LLM-NEO, a parameter-efficient knowledge distillation method that integrates low-rank adaptation (LoRA) with KD to improve knowledge transfer efficiency. Their approach demonstrates superior performance in compressing models like Llama 2 and Llama 3.2, offering insights into hyperparameter optimization and LoRA’s robustness in KD applications [23]. Song et al. explored privacy-preserving data augmentation using KD, focusing on medical text classification. Their method combines differential privacy (DP) with KD to generate pseudo samples while maintaining privacy guarantees. By using LLMs as teachers, the model trains students to select private data samples with calibrated noise, ensuring privacy protection while enhancing the classification model’s performance [30]. Li et al. proposed prompt distillation (prompt distillation or POD) to improve the efficiency of LLM-based recommendation systems. Their work distills discrete prompts into continuous vectors, enhancing the efficiency of LLM training and reducing inference time. Although their method showed improvements in training, it also highlighted the need for further advancements in inference efficiency for recommendation tasks [25]. Xu et al. provided a comprehensive survey on the knowledge distillation of LLMs, emphasizing its role in transferring knowledge to open source models like LLaMA and Mistral. The survey discusses KD mechanisms, its integration with data augmentation (DA), and the potential for the self-improvement of open source models. This work lays the foundation for understanding how KD can bridge the gap between proprietary and open-source models while enhancing their contextual understanding and ethical alignment [31]. McDonald et al. focused on reducing hallucinations in LLMs through knowledge distillation. Their research demonstrated how transferring knowledge from high-capacity teacher models to smaller student models can significantly reduce hallucination rates while improving the accuracy of generated responses. Using the MMLU benchmark, their method showed substantial improvements in model robustness and accuracy, suggesting that KD can be an effective tool for refining LLMs in real-world applications [32]. Further enhancing distillation robustness may require addressing causal feature separation to improve out-of-distribution generalization [33], especially when confounded instruments (e.g., biased training data or prompts) affect the student model’s causal inference [34]. Additionally, modeling time-varying dependencies in the distillation process through state-space counterfactual prediction [35] could help adapt to dynamic knowledge shifts between teacher and student models.

Therefore, knowledge distillation in LLMs has become a critical area of research, with approaches ranging from improving privacy, recommendation efficiency, and model robustness to reducing hallucinations. These studies highlight the diverse applications of KD and provide a pathway for further research into optimizing and transferring knowledge within LLMs, making them more efficient and reliable in various domains.

Table 1 provides a categorized summary of existing studies on chain-of-thought reasoning and knowledge distillation for large language models, highlighting their key contributions and limitations. This comparison reveals that prior work either focuses on prompting strategies or general-purpose distillation, without addressing the integration of reasoning processes into distilled models. In contrast, our work uniquely combines CoT reasoning with knowledge distillation, targeting enhanced performance in code generation tasks with smaller models.

3. Methodology

3.1. Overall Architecture

The proposed method distills the cognitive process of LLMs into smaller models for code generation tasks, specifically focusing on the transfer of intermediate reasoning steps, such as pseudocode generation. The motivation behind this approach is to improve the performance of smaller models by enabling them to mimic the problem-solving strategies of larger, more powerful models, thereby enhancing their ability to generate high-quality code. The method uses a multi-task learning framework, where the smaller model is trained to perform two tasks simultaneously: (1) generating pseudocode and (2) generating the final code. This dual learning task enables the model to learn both the reasoning path (pseudocode) and the end result (code), closely following the methodology used by larger models.

As shown in Figure 2, the overall pipeline for this approach begins with the large model generating both the pseudocode and final code. The pseudocode, representing the logical steps taken by the large model, serves as an intermediate representation. The distillation process then trains the smaller model to replicate both the pseudocode generation and the final code generation, capturing the larger model’s reasoning process while simultaneously learning to produce the desired code outputs.

3.2. Pseudocode Extraction

To extract the reasoning process (pseudocode) from a large model during code generation, we leverage in-context learning. This method capitalizes on the model’s ability to utilize the surrounding context provided in the prompt to generate intermediate reasoning steps, such as pseudocode, before producing the final code. The underlying idea is that by conditioning the model on a well-structured prompt that encourages it to break down the problem into logical steps, the model will naturally generate the reasoning steps necessary for solving the task.

The process begins with a carefully designed prompt that not only provides the input query but also encourages the model to generate intermediate steps that clarify the approach to solving the problem. The prompt is structured as follows:

Problem:

[Natural Language Query]

Pseudocode:

1. [First reasoning step]
2. [Second reasoning step]
3. [Additional reasoning steps]
4. […]

Code:

Explanation:

This structure guides the model to first generate intermediate reasoning steps (i.e., pseudocode) based on the given problem description. The Pseudocode section allows the model to break down the problem into smaller, logically coherent steps, which it can then use to generate the final Code section. The high-level steps in the pseudocode ensure that the model understands the flow and logic behind the task before proceeding to actual code generation.

In this prompt structure, LLMs are explicitly instructed to first generate pseudocode. This pseudocode typically includes high-level steps such as initializing variables, performing loops, and checking conditions, which are essential for solving the problem. By conditioning the model with this structure, we ensure that it generates the necessary intermediate reasoning before proceeding to the final code generation.

This approach mimics how human developers often approach programming: first by planning out the steps (pseudocode), followed by the actual implementation (code). By embedding this structure into the prompt, we harness the model’s contextual understanding, enabling it to reason more effectively and generate high-quality code based on its learned thought process.

3.3. Multi-Task Learning Framework

In order to achieve the dual task of learning pseudocode generation and code generation, we formulate the multi-task learning framework as follows. Let

L_{pseudocode}

be the loss function for generating pseudocode, and

L_{code}

be the loss function for generating the final code. The total loss for the multi-task learning objective is

L_{total} = α \cdot L_{pseudocode} + β \cdot L_{code}

(2)

where

α

and

β

are hyperparameters that control the relative importance of the two tasks. These losses are computed based on the difference between the model’s predicted outputs and the ground truth pseudocode or code. Specifically, for a given input query

Q

, the model generates a pseudocode

\hat{P}

and code

\hat{C}

, and the respective losses are computed as follows:

L_{pseudocode} = \sum_{i = 1}^{n} CrossEntropy ({\hat{P}}_{i}, P_{i})

(3)

L_{code} = \sum_{j = 1}^{m} CrossEntropy ({\hat{C}}_{j}, C_{j})

(4)

where

P_{i}

and

C_{j}

are the true pseudocode and code tokens, respectively, and

{\hat{P}}_{i}

and

{\hat{C}}_{j}

are the predicted tokens. By optimizing this combined objective, the model learns to simultaneously generate accurate pseudocode and the final code, mimicking the cognitive process of larger models.

This approach ensures that the smaller model not only generates correct final outputs but also understands the reasoning process involved in arriving at those outputs, which significantly improves its ability to solve complex programming tasks. Furthermore, this dual training process enables the smaller model to focus on both the intermediate reasoning and the final goal, leading to a more robust understanding of code generation as a whole.

3.4. Distillation of Cognitive Process

In our approach, we utilize different masks to trigger the generation of various components, [pseudocode] or [code], within the model. The mask serves as a mechanism to guide the model in generating specific parts of the solution, based on the task at hand. For instance, when the mask is applied to prompt the model for pseudocode generation, the model focuses on producing intermediate reasoning steps that break down the problem. Conversely, when the mask is applied to trigger code generation, the model generates the final executable code based on the previously generated pseudocode. This masked conditioning helps the model distinguish between generating high-level logical steps (pseudocode) and the final solution (code), ensuring that both components are learned and generated efficiently in the multi-task learning framework.

The distillation of the cognitive process is essential for transferring the problem-solving strategies from the large model to the smaller model. The larger model generates both pseudocode and the corresponding code for a given problem, serving as the “teacher” in the distillation framework. The smaller model, acting as the “student,” learns to generate both pseudocode and code by mimicking the teacher’s reasoning process.

The key idea behind this process is that intermediate steps—such as pseudocode generation—are critical for understanding how to approach and solve a coding problem. By training the smaller model to generate these intermediate steps, the model can better grasp the logical flow of a solution, leading to more accurate and efficient code generation. Through multi-task learning, the model is not only trained to generate the final code but also to follow the reasoning path, which enhances its language understanding and coding proficiency.

This distillation approach effectively captures the cognitive chain of LLMs and applies it to smaller models, making advanced code generation techniques more accessible for deployment in environments with limited computational resources. The use of pseudocode as an intermediate step ensures that the reasoning process is transferred, allowing smaller models to perform at a higher level despite their reduced size.

Algorithm 1 outlines the distillation framework for transferring the cognitive reasoning of LLMs to smaller models in code generation tasks. In each iteration, the large teacher model generates both pseudocode and final code for a given task, which serves as ground truth for training the student model. The student model is simultaneously trained to generate both pseudocode and final code, with the loss function combining the errors in pseudocode and code generation. The total loss is a weighted sum of the CrossEntropy losses for both tasks, and the student model is optimized through backpropagation. This process enables the smaller model to learn both the intermediate reasoning steps and the final code generation, effectively transferring the problem-solving capabilities of the larger model while reducing computational and memory requirements.

Algorithm 1 Distillation of cognitive process from large to small models

1:: Input: Teacher Model $M_{l a r g e}$ , Student Model $M_{s m a l l}$ , Training Data $D$ , Hyperparameters $α, β$
2:: for each task $T$ in $D$ do
3:: Step 1: Pseudocode Generation
4:: $P_{t e a c h e r} \leftarrow M_{l a r g e} (T, pseudocode prompt)$
5:: Step 2: Code Generation
6:: $C_{t e a c h e r} \leftarrow M_{l a r g e} (T, code prompt)$
7:: Step 3: Student Model Output
8:: $P_{s t u d e n t} \leftarrow M_{s m a l l} (T, pseudocode prompt)$
9:: $C_{s t u d e n t} \leftarrow M_{s m a l l} (T, code prompt)$
10:: Step 4: Loss Calculation
11:: $L_{p s e u d o c o d e} \leftarrow CrossEntropy (P_{s t u d e n t}, P_{t e a c h e r})$
12:: $L_{c o d e} \leftarrow CrossEntropy (C_{s t u d e n t}, C_{t e a c h e r})$
13:: $L_{t o t a l} \leftarrow α \cdot L_{p s e u d o c o d e} + β \cdot L_{c o d e}$
14:: Step 5: Optimization
15:: $θ_{s t u d e n t} \leftarrow θ_{s t u d e n t} - η \nabla_{θ_{s t u d e n t}} L_{t o t a l}$
16:: end for
17:: Output: Trained Student Model $M_{s m a l l}$

4. Experiment Results

4.1. Experimental Settings

All experiments were conducted on a server equipped with 8 NVIDIA Tesla V100 GPUs (32 GB memory per GPU), running a Linux-based environment with CUDA 11.2 and cuDNN 8.2. The teacher models, CodeLlama-7B and CodeQwen-7B, were pre-trained on large-scale code datasets, while the student model, CodeT5-base (220 M parameters), was selected for its efficiency in resource-constrained environments. We used the CodeSearchNet dataset, which was pre-processed and split into training, validation, and test sets with an 80–10–10 ratio.

Regarding the distilling process, hyperparameters

α

and

β

were set to 0.5 for balanced learning between the two tasks. The Adam optimizer with a learning rate of

1 \times 10^{- 5}

was used, and the batch size was set to 16. Models were trained for 30 epochs with early stopping based on validation loss. Performance was evaluated using accuracy on the CSN test set, with additional human evaluation for the qualitative assessment of the generated code.

To further validate the deployability of the distilled student models in resource-constrained environments, we conducted additional runtime experiments on a consumer-grade machine equipped with a single NVIDIA RTX 3060 GPU (12 GB memory) and on a Raspberry Pi 5 with 8 GB RAM (ARM-based). On the RTX 3060, the CodeT5-base model achieved an average inference latency of 120 ms per sample, maintaining real-time responsiveness. On the Raspberry Pi, while batch inference was not feasible due to memory constraints, the model successfully executed single-sample code generation tasks within an average of 3.8 s using ONNX runtime in quantized form. These results confirm that the distilled models, especially CodeT5-base, can operate effectively on lower-end hardware, supporting our claim of suitability for deployment in resource-constrained environments.

4.2. Research Questions

In this section, we introduce the research questions, the dataset and baseline models, the details in the evaluation, as well as the experimental results.

RQ1: Effectiveness of distilling LLMs into smaller models: How does distilling the reasoning and code generation capabilities of LLMs into smaller models impact the performance of the smaller models in code generation tasks? Specifically, we investigate how different LLMs, such as CodeLlama-7B and CodeQwen-7B, perform when their reasoning processes are distilled into smaller models like CodeT5-base.
RQ2: Ablation study on the chain-of-thought process: What is the impact of disrupting the intermediate reasoning steps (e.g., pseudocode) during the distillation process? We perform an ablation study by randomly shuffling or replacing the chain-of-thought pseudocode with other random content to assess how such modifications affect the performance of the distilled smaller models.
RQ3: Human Evaluation of Distilled Model Outputs: How do human evaluators perceive the quality of code generated by the distilled smaller models? We incorporate a human evaluation to assess the generated code in terms of accuracy, readability, and overall utility, providing an empirical measure of the real-world effectiveness of the distilled models in practical scenarios.

These research questions guide our investigation into the viability, impact, and practical applications of distilling large language models’ reasoning capabilities into smaller, more deployable models for code generation.

4.3. Dataset

The dataset used for distillation contains bimodal data pairs, where each pair corresponds to a specific programming language. Table 2 provides the statistics of the dataset, showing the number of bimodal data pairs for each language.

4.4. Evaluation Metrics

In our study, we adopt a comprehensive approach to evaluate both code generation capabilities, focusing on the similarity between the generated code and the reference (oracle) code. We utilize four distinct metrics, as established in prior research [36], to conduct a thorough assessment:

Token match (TM): This metric is computed using the standard BLEU score [37], commonly used in evaluating natural language generation tasks. TM measures the n-gram overlap between the generated code and the reference. The BLEU score is defined as

$BLEU = BP \cdot exp (\sum_{n = 1}^{N} w_{n} log p_{n})$

(5)

where $p_{n}$ is the modified n-gram precision, $w_{n}$ is the weight for each n-gram (typically uniform), and BP is the brevity penalty to penalize short candidates.
Syntax match (SM): SM evaluates the similarity between the abstract syntax trees (ASTs) of the predicted code ( $A_{p r e d}$ ) and the reference code ( $A_{r e f}$ ). We calculate the tree edit distance (TED) and normalize it:

$SM = 1 - \frac{TED (A_{p r e d}, A_{r e f})}{max (| A_{p r e d} |, | A_{r e f} |)}$

(6)

where $TED (\cdot)$ denotes the tree edit distance, and $| \cdot |$ is the size of the AST. A higher SM indicates better syntactic similarity.
Dataflow match (DM): DM assesses whether the data dependencies (i.e., variable usage and flow) are preserved. Let $D_{p r e d}$ and $D_{r e f}$ denote the sets of data dependency edges (e.g., def-use chains) in the predicted and reference code. We define

$DM = \frac{| D_{p r e d} \cap D_{r e f} |}{| D_{p r e d} \cup D_{r e f} |}$

(7)

which measures the Jaccard similarity between dataflow graphs.
CodeBLEU (CB): CodeBLEU [38] is a weighted combination of four aspects: n-gram match, weighted syntax match, dataflow match, and keywords match. It is computed as

$CodeBLEU = λ_{1} \cdot TM + λ_{2} \cdot SM + λ_{3} \cdot DM + λ_{4} \cdot KM$

(8)

where $λ_{1} + λ_{2} + λ_{3} + λ_{4} = 1$ and KM is the keyword match score. In our evaluation, we follow the default weights suggested in [38].

These metrics collectively provide a multi-dimensional assessment of code quality, ranging from basic lexical similarities to complex semantic and syntactic alignments. For a more detailed explanation of these metrics, readers are encouraged to refer to [36]. This multifaceted evaluation framework allows us to rigorously assess the effectiveness of code generation algorithms, ensuring their practical applicability and reliability. Additionally, Top-1 accuracy measures the percentage of times the model’s highest-ranked prediction is correct, while Top-5 accuracy measures the percentage of times that the correct answer appears within the model’s top five predictions.

4.5. Baseline Methods

In our study, we focus on several baseline models for distillation, each representing a state-of-the-art approach in code generation. These models serve as foundational elements in our distillation process, allowing us to capture and transfer their capabilities into smaller, more efficient models.

Tranx: Tranx, as detailed in [39], is a model that predicts a sequence of actions to construct an abstract syntax tree, which in turn generates the source code. It operates on an abstract syntax description language framework, a grammatical representation of ASTs. The model predicts three types of actions at each timestep to expand the tree until it is fully constructed. Given its influential role in AST construction, as evidenced by numerous follow-up studies [36], Tranx is chosen as the representative technique in our study.
CodeT5: Following the T5 architecture [40], CodeT5 [41] processes both code and text tokens. It features a unique pre-training task, NL-PL dual generation, enabling the model to generate code from text and vice versa. This dual-generation capability, along with its encoder–decoder framework, makes CodeT5 an ideal candidate for code generation tasks. We opt for the CodeT5-base version due to its demonstrated effectiveness.
NatGen: Building upon CodeT5, NatGen [42] introduces an additional pre-training task called “Code Naturalizing.” This task trains the model to convert unnatural code into a form resembling human-written code, enhancing its understanding of code semantics and its ability to generate human-like code.
SPT-Code: Another advanced pre-trained model, SPT-Code [43], also follows the encoder–decoder framework. It differs from CodeT5 in its pre-training input; it includes the AST of the code, harnessing syntactic information, and uses the method names and called methods as a natural language description, thus eliminating the need for a bilingual corpus.

Each of these models brings unique strengths to the table, making them ideal candidates for our study on distilling large language model capabilities into smaller, more deployable models. Their collective attributes form the basis of our approach to enhance the efficiency and efficacy of code generation in resource-constrained environments.

To ensure a fair comparison across all models, we standardized the training and distillation procedures by using the same set of hyperparameters where applicable. Specifically, all student models were trained using the Adam optimizer with a learning rate of

1 \times

10^{- 5}

, a batch size of 16, and a maximum of 30 epochs with early stopping based on validation loss. For the distillation process, we set the loss balancing weights

α = 0.5

and

β = 0.5

to equally weigh the task-specific loss and the distillation loss. Each model was trained on the same training split of the CodeSearchNet dataset, and evaluations were conducted under identical conditions. For models with architectural differences (e.g., CodeT5 vs. Tranx), we retained their original layer configurations and embedding dimensions as reported in their respective papers to preserve their design integrity, while ensuring consistent training conditions across all experiments.

4.6. Large Language Models

The LLMs used for distillation in this paper are as follows:

DeepSeek: DeepSeek is a general-purpose large language model known for its versatility across a wide range of tasks, from text generation to problem-solving. It is designed to handle diverse inputs and can be fine-tuned for specific applications, showcasing impressive performance in natural language understanding and generation. We exploit DeepSeek-671B in our experiments.
GPT-4: GPT-4 is one of the most advanced general-purpose language models developed by OpenAI. It excels in complex reasoning tasks, including multi-step problem solving, creative writing, and language comprehension, setting new benchmarks for large language models in various NLP applications.
CodeLlama-7B: CodeLlama-7B is a specialized code generation model developed for software engineering tasks. With 7 billion parameters, it is designed to handle programming-related tasks such as code completion, bug fixing, and code synthesis, offering a more efficient solution compared to larger models while maintaining strong performance.
CodeQwen-7B: CodeQwen-7B is another code-focused large model with 7 billion parameters, optimized for various programming languages and development environments. It is specifically fine-tuned for generating code from natural language descriptions, offering a more compact alternative to larger models without sacrificing quality.

Our approach is designed to be model-agnostic, meaning it can be seamlessly integrated with various LLMs in a plug-and-play manner. This flexibility allows for future explorations with even more advanced LLMs, potentially unlocking new avenues in code generation and LLM applications.

4.7. Results

4.7.1. RQ1: Effectiveness of Distilling LLMs into Smaller Models

To ensure that the model is well-trained and neither underfits nor overfits, we monitored the training and validation loss across 30 epochs. As shown in Figure 3, both losses exhibit a smooth and consistent decline, converging around epoch 26. The validation loss closely follows the training loss throughout, indicating that the model generalizes well to unseen data. These results justify our decision to use 30 epochs with early stopping based on validation loss, as the model reaches a stable state without overtraining.

As shown in Table 3, our distillation framework shows clear improvements in smaller models across various code generation tasks. For instance, distilling from CodeQwen-7B enhances CodeT5’s Top-1 accuracy from 9.2 to 12.0, and Tranx sees a similar boost from 2.5 to 3.1. These results suggest that even smaller 7B models can effectively transfer key reasoning steps, enabling smaller models to perform better in generating code. Notably, CodeQwen-7B distillation leads to a larger increase in performance for CodeT5 compared to other small models, which suggests that CodeQwen-7B is particularly effective at transferring knowledge for more complex tasks, especially in terms of generating code output that aligns well with expected patterns.

Larger models, i.e., DeepSeek and GPT-4, provide even more significant performance gains, particularly for more complex tasks. For example, distilling from DeepSeek boosts CodeT5’s Top-1 accuracy from 9.2 to 15.5, while GPT-4 distillation achieves a higher Top-1 accuracy of 16.0. This trend continues across other models, where distillation from DeepSeek and GPT-4 outperforms both CodeQwen-7B and CodeLLama-7B in Top-1 and Top-5 accuracy. Specifically, NatGen sees its Top-1 accuracy increase from 9.5 to 12.0 with DeepSeek, and to 13.0 with GPT-4, showing how the larger models’ advanced reasoning capabilities lead to more accurate code generation, especially when the task complexity increases.

Therefore, our distillation framework not only provides clear performance improvements across all small models but also demonstrates that larger models offer more pronounced benefits. The GPT-4 and DeepSeek distillations consistently lead to the highest performance, with GPT-4 achieving the best Top-1 accuracy across all tasks (e.g., 16.0 for CodeT5 and 13.0 for NatGen). On the other hand, CodeQwen-7B offers solid improvements, though not as drastic as the larger models. This indicates that the distillation framework is highly versatile, capable of improving small models with distillation from both relatively smaller 7B models and larger, more powerful models, making it a scalable solution for efficient code generation.

To present the results of the ablation study on the chain-of-thought process, we introduce a random disturbance to the distilled pseudocode in our models. This aims to test the significance of maintaining the reasoning chain (i.e., pseudocode) during the distillation process. We randomly shuffle or replace the pseudocode with irrelevant content and observe how the performance changes compared to the baseline models and the best-performing GPT-4 distilled model.

4.7.2. RQ2: Ablation Study on the Chain-of-Thought Process

Table 4 presents the results of the ablation study, where we compare the original distilled models with those where the pseudocode reasoning (chain-of-thought) was randomly disturbed. As seen in the table, introducing random disturbance in the pseudocode leads to a decrease in performance across all models, as indicated by the lower Top-1 and Top-5 accuracy scores. For example, Tranx-random-distill shows a significant reduction in performance, with Top-1 accuracy dropping from 2.5 to 2.0, and Top-5 accuracy also showing a decline. Similarly, CodeT5-random-distill and SPT-Code-random-distill exhibit noticeable drops in performance.

The GPT-4 distilled models, however, consistently show the highest performance, further reinforcing the value of maintaining the chain-of-thought pseudocode in the distillation process. The comparison between the random-distilled models and their base counterparts confirms that the reasoning steps embedded in the pseudocode are crucial for achieving optimal performance in code generation tasks. Therefore, this ablation study validates the effectiveness of our distillation framework in transferring meaningful reasoning processes from LLMs to smaller ones, highlighting the importance of preserving the reasoning chain during the distillation process.

4.7.3. Results of Human Evaluation

For the human evaluation, we selected 100 samples from the test set and focused on the top-performing models, specifically the five models from the CodeT5 series: the baseline version and four distillations from LLMs (CodeLLama-7B, CodeQwen-7B, DeepSeek, and GPT-4). These 100 samples were evenly distributed among five experts with over six years of programming experience, with each expert evaluating 100 samples. Each model’s output was assessed across three criteria: correctness, completeness, and fluency, with scores ranging from 1 to 5, where 1 represents “Poor” and 5 represents “Excellent.” To ensure consistency, a cross-validation process was applied, and the final scores were averaged for each model.

The results from the human evaluation, as shown in Table 5, provide further confirmation of the effectiveness of our distillation framework. When evaluating the five models across the three criteria—Correctness, Completeness, and Fluency—we observe clear improvements in the distilled models, particularly those generated from larger models, which supports the findings from the automatic evaluation metrics.

Firstly, the distillation from CodeQwen-7B shows notable improvements compared to the baseline CodeT5. For example, the Correctness score increases from 2.8 to 3.4, the Completeness score increases from 2.9 to 3.4, and the Fluency score increases from 3.2 to 3.6. These results align with the trends observed in the automatic evaluation, where the distillation from CodeQwen-7B similarly improved performance across the board. Furthermore, distillation from more powerful models like DeepSeek and GPT-4 results in even greater improvements, achieving Correctness scores of 4.1 and 4.3, respectively. These improvements in human evaluation are consistent with the gains observed in the automatic metrics, reinforcing the idea that the distillation framework can successfully transfer reasoning capabilities from larger models to smaller ones.

The highest performance in Correctness, Completeness, and Fluency is achieved by the distillation from GPT-4, with scores of 4.3, 4.2, and 4.3, respectively, which outperforms all other models in human evaluation. This confirms that our distillation framework not only improves automated evaluation metrics but also translates into tangible improvements in human-assessed qualities such as code Correctness, Completeness, and Fluency. Therefore, our results demonstrate that the distillation framework is effective in enhancing the performance of smaller models, with both automated and human evaluations confirming the benefits of distilling from larger models.

To provide a qualitative comparison, we further analyzed representative code samples generated by our distilled models and baseline models. Compared to baseline models such as CodeT5 and NatGen, the code generated by our distilled models—especially those distilled from GPT-4 and DeepSeek, exhibited more coherent logic, clearer structure, and better alignment with programming conventions. In particular, the distilled models were more likely to include meaningful variable names, complete function definitions, and properly structured control flows, whereas baseline models occasionally produced syntactically valid but semantically incomplete or overly generic code. This improvement is largely attributed to the preservation of chain-of-thought reasoning during distillation, which helps guide the model through multi-step generation processes in a more interpretable and human-like manner.

4.7.4. Prompt Examples of the Distillation Process

To illustrate how our distillation framework operates in real scenarios, we provide representative prompt examples used during training and evaluation. Below is an example of a training prompt given to the teacher model (e.g., GPT-4) and its corresponding output, which is used to supervise the student model:

Problem: Write a Python function that takes a nested list of arbitrary depth and returns a flat list containing all the elements.

Pseudocode:

1. Define a function that takes a list as input.
2. Initialize an empty result list.
3. Iterate through each element in the input list.
4. If the element is a list, recursively flatten it and extend the result list.
5. If the element is not a list, append it directly to the result list.
6. Return the result list after processing all elements.

Code:

Explanation:

This function uses recursion to handle arbitrarily nested lists. For each element, it checks if it is a list; if so, it calls itself recursively and flattens that sublist. Otherwise, it appends the element directly. This process ensures that the final result contains no nested structures, regardless of input depth.

5. Threats to Validity

5.1. Internal Validity

One potential threat to internal validity in our study is the reliance on expert human evaluators, which may introduce subjective biases in the scoring process. While we selected five experts with over six years of programming experience to ensure a reasonable level of expertise, variations in individual judgment could influence the evaluation of criteria such as Correctness, Completeness, and Fluency. Although we employed a cross-validation process and averaged the scores to mitigate this issue, the small number of evaluators and their subjective interpretations of the evaluation criteria could still introduce inconsistency in the results. Furthermore, there is the possibility that the model outputs evaluated in the human study may not cover the full range of real-world code generation tasks, as the selected test set might not be entirely representative of the complexity encountered in different programming environments.

Another internal validity concern is the limited diversity in the test set. Although we selected a diverse range of 100 samples from the dataset, this subset might not comprehensively represent the vast spectrum of programming tasks encountered in practical applications. As the difficulty and nature of programming problems can vary greatly, the specific sample selection process might unintentionally favor certain types of code or problem domains, leading to biased performance metrics. This could affect the fairness of the evaluation, as some models may perform better on the selected test set due to its particular composition rather than a generalized ability to handle diverse real-world scenarios.

Additionally, while our framework is designed to preserve reasoning patterns through the use of intermediate supervision (e.g., pseudocode), we do not directly analyze the internal representations of student models to verify the depth of reasoning transfer. Our current conclusions are based on performance improvements and output structure. Future work may incorporate probing techniques or attention-based interpretability analyses to more rigorously evaluate whether internal cognitive-like mechanisms are being transferred during distillation.

5.2. External Validity

In terms of external validity, a major concern arises from the limited scope of our study, as it only focuses on specific LLMs for distillation (i.e., CodeLLama-7B, CodeQwen-7B, DeepSeek, and GPT-4). While these models represent advanced state-of-the-art architectures, the findings from our experiments might not directly apply to other large language models or distillation methods that were not included in this study. For example, the effectiveness of distillation from models not covered here, such as those specialized in other domains or with different architectural properties, remains unclear. Therefore, the conclusions drawn about the benefits of distilling reasoning capabilities may not generalize to other, less widely used models or novel techniques not considered in our study.

Moreover, the test set used for both automatic and human evaluations may not fully reflect the real-world programming tasks typically encountered in production environments. While our dataset is diverse in terms of programming languages and problem types, its composition might still introduce biases that affect how well the results generalize to broader use cases. For example, code generation tasks in more specialized domains or those requiring complex reasoning might differ from those represented in our test set. Thus, the results of this study may be most applicable to code generation tasks within the scope of the evaluated test set, limiting the broader applicability of our findings to all possible programming scenarios.

While our primary training and evaluation were conducted on a high-performance server, we recognize the need to verify real-world deployability. To address this, we performed additional testing on consumer-grade and ARM-based devices, which demonstrated that the distilled student models can run efficiently with limited computational resources. However, performance may vary across deployment contexts depending on device memory, optimization level (e.g., quantization), and runtime framework, which warrants further profiling in production scenarios.

5.3. Theoretical and Practical Implications

This study offers both theoretical and practical contributions to the field of code generation and LLM distillation. Theoretically, it demonstrates that preserving chain-of-thought reasoning during knowledge distillation significantly enhances the performance of smaller models, providing empirical support for integrating intermediate reasoning steps as transferable knowledge units. This finding advances our understanding of how reasoning traces from LLMs can be effectively internalized by compact models, bridging the gap between model interpretability and performance. Practically, our proposed distillation framework enables the deployment of lightweight, resource-efficient models that retain strong reasoning capabilities, making them suitable for real-world applications such as IDE integration, automated code assistants, and edge-device programming tools. Furthermore, the model-agnostic design of our framework ensures its adaptability across various LLMs and code generation tasks, promoting its utility in both academic and industrial settings.

5.4. Security Evaluation

While our primary evaluation focuses on the functional correctness of generated code, we acknowledge that security is a critical dimension often overlooked in code generation tasks. To address this concern, we conducted a preliminary security analysis using the static analysis tools, including Semgrep and CodeQL, to assess the potential vulnerabilities in the code generated by both baseline and distilled models.

Specifically, we analyzed 1000 code snippets generated from the test set and scanned them for common security issues, such as unsafe input handling, use of insecure APIs, and missing input validation. Our findings show that the code generated by distilled models, particularly those distilled from GPT-4 and DeepSeek, tends to produce fewer security warnings compared to their undistilled counterparts (e.g., CodeT5 and NatGen). This suggests that distilling reasoning-aware models may help capture secure programming patterns implicitly learned by the teacher LLMs.

However, vulnerabilities still occasionally emerge, especially in edge cases where the model attempts complex input/output operations or constructs SQL queries or shell commands. While the reduction in security warnings is promising, this preliminary evaluation highlights the need for incorporating explicit security-aware objectives or constraints in future distillation pipelines.

6. Conclusions

In this work, we propose a novel distillation framework designed to transfer the reasoning and code generation capabilities of large language models into smaller, more efficient models. Our approach leverages the intermediate reasoning steps, such as pseudocode, during the distillation process, enabling smaller models to replicate the problem-solving strategies of larger models. We evaluate the framework on multiple tasks using several LLMs, including CodeLLama-7B, CodeQwen-7B, DeepSeek, and GPT-4, and demonstrate significant improvements in performance across various code generation benchmarks. Our results show that the distillation framework not only improves the performance of smaller models, as evidenced by both automatic metrics and human evaluations, but also enhances code generation capabilities while maintaining efficiency in resource-constrained environments. Notably, distilling from larger models, particularly GPT-4, leads to the most substantial improvements, achieving top accuracy scores in multiple tasks.

Looking ahead, there are several promising directions for future work. First, we plan to explore the application of our distillation framework to other tasks beyond code generation, such as question answering and text summarization, to assess its broader applicability. Additionally, investigating more advanced techniques for preserving the reasoning steps during distillation, such as hierarchical or multi-modal reasoning, could further improve the quality of the distilled models. Another avenue for future work involves expanding the dataset to include a wider variety of programming languages and more complex real-world tasks to test the generalizability of the framework in diverse scenarios. Moreover, we aim to evaluate the scalability of our framework with even larger teacher models and systematically explore the trade-offs between model size, reasoning quality, and computational efficiency. Finally, integrating our approach into real-world systems and comparing its performance with existing tools, such as Kiro (https://kiro.dev/, (accessed on 11 August 2025)), will be an important step towards validating its practical impact and usability.

Author Contributions

Methodology, Y.L.; Software, Y.L.; Validation, S.G.; Visualization, S.G.; Writing—original draft, S.G.; Writing—review and editing, M.G. All authors have read and agreed to the published version of the manuscript.

Funding

The authors declare that this study received funding from the Key Project of Research and Development Plan of Hunan Province with Grant No. 2025JK2146.

Data Availability Statement

The original contributions presented in the study are included in the article—further inquiries can be directed to the corresponding author.

Acknowledgments

We sincerely thank Yingwen Chen and Zhijian Yuan for their valuable contributions to the early drafting and validation of this manuscript. Their insights during the initial research phase helped shape the foundation of this work.

Conflicts of Interest

Authors Yonglin Li and Shanzhi Gu were employed by the company Hunan Huishiwei Intelligent Technology Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. The funder was not involved in the study design, collection, analysis, interpretation of data, the writing of this article or the decision to submit it for publication.

References

Gu, Q. Llm-based code generation method for golang compiler testing. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, San Francisco, CA, USA, 3–9 December 2023; pp. 2201–2203. [Google Scholar]
Jiang, X.; Dong, Y.; Wang, L.; Fang, Z.; Shang, Q.; Li, G.; Jin, Z.; Jiao, W. Self-planning code generation with large language models. ACM Trans. Softw. Eng. Methodol. 2024, 33, 1–30. [Google Scholar] [CrossRef]
Li, J.; Zhao, Y.; Li, Y.; Li, G.; Jin, Z. Acecoder: An effective prompting technique specialized in code generation. ACM Trans. Softw. Eng. Methodol. 2024, 33, 1–26. [Google Scholar] [CrossRef]
Zamfirescu-Pereira, J.; Jun, E.; Terry, M.; Yang, Q.; Hartmann, B. Beyond Code Generation: LLM-supported Exploration of the Program Design Space. In Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems, Yokohama, Japan, 26 April–1 May 2025; pp. 1–17. [Google Scholar]
Zhang, H.; Cheng, W.; Wu, Y.; Hu, W. A Pair Programming Framework for Code Generation via Multi-Plan Exploration and Feedback-Driven Refinement. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, Sacramento, CA, USA, 27 October–1 November 2024; pp. 1319–1331. [Google Scholar]
Geng, M.; Wang, S.; Dong, D.; Wang, H.; Li, G.; Jin, Z.; Mao, X.; Liao, X. Large language models are few-shot summarizers: Multi-intent comment generation via in-context learning. In Proceedings of the 46th IEEE/ACM International Conference on Software Engineering, Lisbon, Portugal, 14–20 April 2024; pp. 1–13. [Google Scholar]
Dong, Y.; Jiang, X.; Jin, Z.; Li, G. Self-collaboration code generation via chatgpt. ACM Trans. Softw. Eng. Methodol. 2024, 33, 1–38. [Google Scholar] [CrossRef]
Hong, S.; Zheng, X.; Chen, J.; Cheng, Y.; Wang, J.; Zhang, C.; Wang, Z.; Yau, S.K.S.; Lin, Z.; Zhou, L.; et al. Metagpt: Meta programming for multi-agent collaborative framework. arXiv 2023, arXiv:2308.00352. [Google Scholar]
Sun, Z.; Lyu, C.; Li, B.; Wan, Y.; Zhang, H.; Li, G.; Jin, Z. Enhancing code generation performance of smaller models by distilling the reasoning ability of llms. arXiv 2024, arXiv:2403.13271. [Google Scholar] [CrossRef]
Chen, H.; Saha, A.; Hoi, S.; Joty, S. Personalised distillation: Empowering open-sourced llms with adaptive learning for code generation. arXiv 2023, arXiv:2310.18628. [Google Scholar]
Malmsten, E. Distil-CodeGPT, Distilling Code-Generation Models for Local Use. Ph.D. Thesis, Delft University of Technology, Delft, The Netherlands, 2023. [Google Scholar]
Bai, J.; Bai, S.; Chu, Y.; Cui, Z.; Dang, K.; Deng, X.; Fan, Y.; Ge, W.; Han, Y.; Huang, F.; et al. Qwen technical report. arXiv 2023, arXiv:2309.16609. [Google Scholar] [CrossRef]
Roziere, B.; Gehring, J.; Gloeckle, F.; Sootla, S.; Gat, I.; Tan, X.E.; Adi, Y.; Liu, J.; Sauvestre, R.; Remez, T.; et al. Code llama: Open foundation models for code. arXiv 2023, arXiv:2308.12950. [Google Scholar]
Li, J.; Zhu, H.; Liu, H.; Shi, X.; Zong, H.; Dong, Y.; Zhang, K.; Jiang, S.; Jin, Z.; Li, G. aiXcoder-7B-v2: Training LLMs to Fully Utilize the Long Context in Repository-level Code Completion. arXiv 2025, arXiv:2503.15301. [Google Scholar]
Jiang, S.; Li, J.; Zong, H.; Liu, H.; Zhu, H.; Hu, S.; Li, E.; Ding, J.; Han, Y.; Ning, W.; et al. aixcoder-7b: A lightweight and effective large language model for code completion. arXiv 2024, arXiv:2410.13187. [Google Scholar]
Hsieh, C.Y.; Li, C.L.; Yeh, C.K.; Nakhost, H.; Fujii, Y.; Ratner, A.; Krishna, R.; Lee, C.Y.; Pfister, T. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. arXiv 2023, arXiv:2305.02301. [Google Scholar]
Li, L.H.; Hessel, J.; Yu, Y.; Ren, X.; Chang, K.W.; Choi, Y. Symbolic chain-of-thought distillation: Small models can also “think” step-by-step. arXiv 2023, arXiv:2306.14050. [Google Scholar]
Husain, H.; Wu, H.H.; Gazit, T.; Allamanis, M.; Brockschmidt, M. Codesearchnet challenge: Evaluating the state of semantic code search. arXiv 2019, arXiv:1909.09436. [Google Scholar]
Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Ichter, B.; Xia, F.; Chi, E.; Le, Q.; Zhou, D. Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural Inf. Process. Syst. 2022, 35, 24824–24837. [Google Scholar]
Feng, G.; Zhang, B.; Gu, Y.; Ye, H.; He, D.; Wang, L. Towards revealing the mystery behind chain of thought: A theoretical perspective. Adv. Neural Inf. Process. Syst. 2023, 36, 70757–70798. [Google Scholar]
Xia, Y.; Wang, R.; Liu, X.; Li, M.; Yu, T.; Chen, X.; McAuley, J.; Li, S. Beyond chain-of-thought: A survey of chain-of-x paradigms for llms. arXiv 2024, arXiv:2404.15676. [Google Scholar]
Zhang, X.; Du, C.; Pang, T.; Liu, Q.; Gao, W.; Lin, M. Chain of preference optimization: Improving chain-of-thought reasoning in llms. Adv. Neural Inf. Process. Syst. 2024, 37, 333–356. [Google Scholar]
Yang, R.; Wu, T.; Wang, J.; Hu, P.; Wu, Y.C.; Wong, N.; Yang, Y. Llm-neo: Parameter efficient knowledge distillation for large language models. arXiv 2024, arXiv:2411.06839. [Google Scholar] [CrossRef]
Pangakis, N.; Wolken, S. Knowledge distillation in automated annotation: Supervised text classification with llm-generated training labels. arXiv 2024, arXiv:2406.17633. [Google Scholar] [CrossRef]
Li, L.; Zhang, Y.; Chen, L. Prompt distillation for efficient llm-based recommendation. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, Birmingham, UK, 21–25 October 2023; pp. 1348–1357. [Google Scholar]
Hu, S.; Zou, G.; Yang, S.; Lin, S.; Gan, Y.; Zhang, B.; Chen, Y. Large language model meets graph neural network in knowledge distillation. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 27 February–4 March 2025; Volume 39, pp. 17295–17304. [Google Scholar]
Hu, L.; Zhang, X.; Song, D.; Zhou, C.; He, H.; Nie, L. Efficient and effective role player: A compact knowledge-grounded persona-based dialogue model enhanced by llm distillation. ACM Trans. Inf. Syst. 2025, 43, 1–29. [Google Scholar] [CrossRef]
Liu, J.; Zhang, C.; Guo, J.; Zhang, Y.; Que, H.; Deng, K.; Liu, J.; Zhang, G.; Wu, Y.; Liu, C.; et al. Ddk: Distilling domain knowledge for efficient large language models. Adv. Neural Inf. Process. Syst. 2024, 37, 98297–98319. [Google Scholar]
Fang, L.; Yu, X.; Cai, J.; Chen, Y.; Wu, S.; Liu, Z.; Yang, Z.; Lu, H.; Gong, X.; Liu, Y.; et al. Knowledge distillation and dataset distillation of large language models: Emerging trends, challenges, and future directions. arXiv 2025, arXiv:2504.14772. [Google Scholar] [CrossRef]
Song, Y.; Zhang, J.; Tian, Z.; Yang, Y.; Huang, M.; Li, D. LLM-based privacy data augmentation guided by knowledge distillation with a distribution tutor for medical text classification. arXiv 2024, arXiv:2402.16515. [Google Scholar] [CrossRef]
Xu, X.; Li, M.; Tao, C.; Shen, T.; Cheng, R.; Li, J.; Xu, C.; Tao, D.; Zhou, T. A survey on knowledge distillation of large language models. arXiv 2024, arXiv:2402.13116. [Google Scholar]
McDonald, D.; Papadopoulos, R.; Benningfield, L. Reducing LLM hallucination using knowledge distillation: A case study with mistral large and MMLU benchmark. TechRxiv 2024. [Google Scholar] [CrossRef]
Wang, H.; Kuang, K.; Lan, L.; Wang, Z.; Huang, W.; Wu, F.; Yang, W. Out-of-distribution generalization with causal feature separation. IEEE Trans. Knowl. Data Eng. 2023, 36, 1758–1772. [Google Scholar] [CrossRef]
Wang, H.; Yang, W.; Yang, L.; Wu, A.; Xu, L.; Ren, J.; Wu, F.; Kuang, K. Estimating individualized causal effect with confounded instruments. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, 14–18 August 2022; pp. 1857–1867. [Google Scholar]
Wang, H.; Li, H.; Zou, H.; Chi, H.; Lan, L.; Huang, W.; Yang, W. Effective and efficient time-varying counterfactual prediction with state-space models. In Proceedings of the Thirteenth International Conference on Learning Representations, Singapore, 24–28 April 2025. [Google Scholar]
Wang, S.; Geng, M.; Lin, B.; Sun, Z.; Wen, M.; Liu, Y.; Li, L.; Bissyandé, T.F.; Mao, X. Natural language to code: How far are we? In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, San Francisco, CA, USA, 3–9 December 2023; pp. 375–387. [Google Scholar]
Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.-J. Bleu: A Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, 7–12 July 2002; pp. 311–318. [Google Scholar]
Ren, S.; Guo, D.; Lu, S.; Zhou, L.; Liu, S.; Tang, D.; Sundaresan, N.; Zhou, M.; Blanco, A.; Ma, S. Codebleu: A method for automatic evaluation of code synthesis. arXiv 2020, arXiv:2009.10297. [Google Scholar] [CrossRef]
Yin, P.; Neubig, G. TRANX: A transition-based neural abstract syntax parser for semantic parsing and code generation. arXiv 2018, arXiv:1810.02720. [Google Scholar] [CrossRef]
Ni, J.; Abrego, G.H.; Constant, N.; Ma, J.; Hall, K.B.; Cer, D.; Yang, Y. Sentence-t5: Scalable sentence encoders from pre-trained text-to-text models. arXiv 2021, arXiv:2108.08877. [Google Scholar]
Wang, Y.; Wang, W.; Joty, S.; Hoi, S.C. Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. arXiv 2021, arXiv:2109.00859. [Google Scholar]
Chakraborty, S.; Ahmed, T.; Ding, Y.; Devanbu, P.T.; Ray, B. Natgen: Generative pre-training by “naturalizing” source code. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Singapore, 14–18 November 2022; pp. 18–30. [Google Scholar]
Niu, C.; Li, C.; Ng, V.; Ge, J.; Huang, L.; Luo, B. Spt-code: Sequence-to-sequence pre-training for learning source code representations. In Proceedings of the 44th International Conference on Software Engineering, Pittsburgh, PA, USA, 21–29 May 2022; pp. 2006–2018. [Google Scholar]

Figure 1. Our research motivation: the lightweight deployment of LLMs’ superior code generation capability.

Figure 2. The architecture of our distilling framework.

Figure 3. Training and validation loss curve over 30 epochs.

Table 1. Summary of related work in chain-of-thought reasoning and knowledge distillation.

Study	Contribution	Limitation
Chain-of-Thought Reasoning for LLMs
Wei et al. [19]	Introduced CoT prompting to boost reasoning in LLMs; improved performance on math and symbolic reasoning	Focused only on prompting; lacks insights into distillation or model compression
Feng et al. [20]	Provided a theoretical foundation using circuit complexity to explain CoT’s expressivity	Lacks empirical validation in code generation or practical settings
Xia et al. [21]	Proposed a taxonomy of CoX techniques extending CoT to various domains	Primarily a survey; lacks implementation or quantitative evaluation
Zhang et al. [22]	Developed chain-of-preference optimization (CPO) using tree-of-thought to optimize reasoning paths	Still relies on search-based methods; expensive in computation
Knowledge Distillation of LLMs
Yang et al. [23]	Proposed LoRA-based efficient KD for compressing LLMs like Llama 2/3	Focused on general compression, not reasoning-specific distillation
Song et al. [30]	Combined KD and differential privacy for medical text classification	Task-specific (medical); not transferable to general LLMs or code generation
Li et al. [25]	Introduced prompt distillation for recommendation systems using LLMs	Limited to recommendation domain; not applicable to multi-step reasoning tasks
Xu et al. [31]	Surveyed KD methods for LLMs, especially for open source distillation and self-improvement	Broad and conceptual; lacks task-specific evaluations (e.g., code generation)
McDonald et al. [32]	Used KD to reduce hallucinations in LLM outputs on MMLU benchmark	Focused on hallucination control, not reasoning or step-wise logic transfer
This work	Proposes a novel distillation framework combining CoT reasoning with knowledge distillation to improve small model performance in code generation tasks	First, to empirically validate CoT preservation during LLM distillation for code reasoning and generation

Table 2. Statistics of the dataset used for distilling.

Programming Language	Bimodal Data Pairs
Go	319,256
Java	500,754
JavaScript	143,252
PHP	662,907
Python	458,219
Ruby	52,905
All	$2, 137, 293$

Table 3. The results of distillation from different LLMs on the code generation task.

Techniques	Top-1				Top-5
Techniques	TM	SM	DM	CB	TM	SM	DM	CB
Tranx	2.5	19.2	25.5	12.7	2.7	21.3	28.2	14.0
Tranx-distill-CodeLLama-7B	3.7	22.3	28.9	14.5	4.3	24.6	31.3	15.9
Tranx-distill-CodeQwen-7B	3.1	21.0	27.8	14.0	3.5	22.4	30.1	15.0
Tranx-distill-DeepSeek	4.5	23.5	30.5	15.5	5.0	25.2	32.6	16.5
Tranx-distill-GPT4	5.0	24.0	31.5	16.0	5.5	26.5	34.0	17.0
CodeT5	9.2	27.3	39.5	22.2	10.2	29.2	42.3	23.9
CodeT5-distill-CodeLLama-7B	10.9	29.8	41.4	23.8	11.9	31.4	44.8	25.6
CodeT5-distill-CodeQwen-7B	12.0	31.4	44.5	25.6	14.5	35.1	49.0	31.2
CodeT5-distill-DeepSeek	15.5	34.0	49.2	29.2	16.0	39.0	53.2	37.3
CodeT5-distill-GPT4	16.0	34.5	48.0	32.5	16.5	40.4	54.5	38.0
NatGen	9.5	24.2	36.2	21.0	9.6	25.0	38.1	21.7
NatGen-distill-CodeLLama-7B	11.4	26.8	38.6	23.9	12.9	27.4	40.3	25.2
NatGen-distill-CodeQwen-7B	10.0	25.5	37.0	22.5	10.5	26.0	39.0	23.0
NatGen-distill-DeepSeek	12.0	27.5	40.0	24.5	12.6	28.2	41.5	25.0
NatGen-distill-GPT4	13.0	28.0	41.5	27.0	13.8	29.0	42.5	28.4
SPT-Code	3.4	16.9	37.2	16.3	3.8	18.8	40.8	17.9
SPT-Code-distill-CodeLLama-7B	5.3	19.2	39.4	19.1	6.9	21.4	43.2	21.6
SPT-distill-CodeQwen-7B	4.0	18.5	38.0	17.0	5.5	19.5	41.0	18.5
SPT-distill-DeepSeek	6.0	21.0	41.5	19.5	7.2	22.5	44.2	20.7
SPT-distill-GPT4	7.0	22.7	43.3	21.8	7.6	24.0	45.3	22.6

Table 4. Ablation study results: distillation with random disturbance of pseudocode.

Techniques	Top-1				Top-5
Techniques	TM	SM	DM	CB	TM	SM	DM	CB
Tranx	2.5	19.2	25.5	12.7	2.7	21.3	28.2	14.0
Tranx-random-distill	2.0	17.8	24.2	11.5	2.3	19.0	26.5	12.8
Tranx-distill-GPT4	5.0	24.0	31.5	16.0	5.5	26.5	34.0	17.0
CodeT5	9.2	27.3	39.5	22.2	10.2	29.2	42.3	23.9
CodeT5-random-distill	8.0	25.5	38.0	21.0	9.0	27.5	40.0	22.5
CodeT5-distill-GPT4	16.0	34.5	48.0	32.5	16.5	40.4	54.5	38.0
NatGen	9.5	24.2	36.2	21.0	9.6	25.0	38.1	21.7
NatGen-random-distill	8.3	23.0	34.0	19.5	9.0	24.0	37.0	21.0
NatGen-distill-GPT4	13.0	28.0	41.5	27.0	13.8	29.0	42.5	28.4
SPT-Code	3.4	16.9	37.2	16.3	3.8	18.8	40.8	17.9
SPT-Code-random-distill	2.9	15.2	35.0	14.5	3.3	17.0	39.5	16.2
SPT-distill-GPT4	7.0	22.7	43.3	21.8	7.6	24.0	45.3	22.6

Table 5. Human evaluation results.

Model	Correctness	Completeness	Fluency
CodeT5	2.8	2.9	3.2
CodeT5-distill-CodeLLama-7B	3.1	3.3	3.5
CodeT5-distill-CodeQwen-7B	3.4	3.4	3.6
CodeT5-distill-DeepSeek	4.1	3.9	4.3
CodeT5-distill-GPT4	4.3	4.2	4.3

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, Y.; Gu, S.; Geng, M. Symmetry-Aware Code Generation: Distilling Pseudocode Reasoning for Lightweight Deployment of Large Language Models. Symmetry 2025, 17, 1325. https://doi.org/10.3390/sym17081325

AMA Style

Li Y, Gu S, Geng M. Symmetry-Aware Code Generation: Distilling Pseudocode Reasoning for Lightweight Deployment of Large Language Models. Symmetry. 2025; 17(8):1325. https://doi.org/10.3390/sym17081325

Chicago/Turabian Style

Li, Yonglin, Shanzhi Gu, and Mingyang Geng. 2025. "Symmetry-Aware Code Generation: Distilling Pseudocode Reasoning for Lightweight Deployment of Large Language Models" Symmetry 17, no. 8: 1325. https://doi.org/10.3390/sym17081325

APA Style

Li, Y., Gu, S., & Geng, M. (2025). Symmetry-Aware Code Generation: Distilling Pseudocode Reasoning for Lightweight Deployment of Large Language Models. Symmetry, 17(8), 1325. https://doi.org/10.3390/sym17081325

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Symmetry-Aware Code Generation: Distilling Pseudocode Reasoning for Lightweight Deployment of Large Language Models

Abstract

1. Introduction

2. Background and Related Work

2.1. Chain-of-Thought Planning for Code Generation

2.2. Chain-of-Thought Reasoning for LLMs

2.3. Knowledge Distillation of LLMs

3. Methodology

3.1. Overall Architecture

3.2. Pseudocode Extraction

3.3. Multi-Task Learning Framework

3.4. Distillation of Cognitive Process

4. Experiment Results

4.1. Experimental Settings

4.2. Research Questions

4.3. Dataset

4.4. Evaluation Metrics

4.5. Baseline Methods

4.6. Large Language Models

4.7. Results

4.7.1. RQ1: Effectiveness of Distilling LLMs into Smaller Models

4.7.2. RQ2: Ablation Study on the Chain-of-Thought Process

4.7.3. Results of Human Evaluation

4.7.4. Prompt Examples of the Distillation Process

5. Threats to Validity

5.1. Internal Validity

5.2. External Validity

5.3. Theoretical and Practical Implications

5.4. Security Evaluation

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI