Exploring Selective Layer Freezing Strategies in Transformer Fine-Tuning: NLI Classifiers with Sub-3B Parameter Models

Hwang, Taewook; Seo, Hyein; Jung, Jeesu; Jung, Sangkeun

doi:10.3390/app151910434

Open AccessArticle

Exploring Selective Layer Freezing Strategies in Transformer Fine-Tuning: NLI Classifiers with Sub-3B Parameter Models

¹

Computer Science and Engineering, Chungnam National University, Daejeon 34134, Republic of Korea

²

EurekaAI, Daejeon 34134, Republic of Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(19), 10434; https://doi.org/10.3390/app151910434

Submission received: 29 August 2025 / Revised: 23 September 2025 / Accepted: 24 September 2025 / Published: 26 September 2025

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

In recent years, methods that selectively fine-tune or reduce the number of layers in large language models (LLMs) have garnered attention as an efficient alternative to traditional fine-tuning, where all layers are trained. In this study, we revisit the classical concept of layer freezing and propose a simple, effective strategy that selectively fine-tunes only a portion of transformer layers. We show that freezing the bottom 25% or 50% of layers in small-scale LLMs with sub-3 billion parameters yields significant improvements in memory efficiency and training speed while maintaining, or even surpassing, the performance of full fine-tuning and Low-Rank Adaptation (LoRA). Through experiments on Natural Language Inference (NLI) tasks using LLMs with fewer than 3 billion parameters, our approach achieves up to 50% memory savings and 30% faster training. Notably, our method does not require architectural modifications or additional parameters, making it particularly suitable for resource-constrained environments.

Keywords:

small Large Language Models (sLLM); transformer; fine-tuning; parameter-efficient training; layer freezing; Natural Language Inference (NLI); transfer learning; computational efficiency

1. Introduction

Recent advancements in Natural Language Processing (NLP) have been largely driven by the emergence of Large Language Models (LLMs) such as GPT-3 [1], PaLM [2], and LLaMA [3]. These models leverage extensive pre-training to learn a wide range of linguistic patterns and knowledge. They demonstrate high performance across various tasks through techniques like prompt engineering [4] and in-context learning [5]. As a result, LLMs have become indispensable tools in numerous NLP tasks, including translation, question answering, and document generation.

However, several limitations are associated with the training and application of LLMs: First, high-performing LLMs typically contain over 7 billion parameters, making them computationally expensive and requiring vast resources and time for both training and inference. Additionally, LLMs tend to exhibit inconsistent performance in unfamiliar domains or tasks that were not encountered during pre-training [6]. This is a chronic problem for pre-trained models, which are constrained to generating outputs based on their pre-trained knowledge. To address this fundamental issue, it is imperative to implement knowledge updates through fine-tuning processes, enhancing the model’s adaptability [7]. However, due to the immense size of modern LLMs, with parameter counts ranging from billions to hundreds of billions, even fine-tuning demands significant computational effort.

To address these computational challenges, we revisit the concept of Layer Freezing, a simple yet effective fine-tuning strategy, and introduce detailed strategies that extend this approach. Previous studies have explored freezing layers of small language models such as BERT [8] during fine-tuning, but these efforts have mainly focused on improving speed and have encountered challenges due to the complexity of freezing techniques [9,10].

In contrast, we have found that simply freezing a subset of layers can achieve better computational efficiency and superior performance compared with fine-tuning the entire model. Rather than introducing additional layers or parameters, our goal is to reduce costs and maximize training efficiency by focusing on fine-tuning only a subset of layers within the existing LLM. The discovered method has the following advantages:

Simplicity: This approach is highly straightforward and can be easily applied without the need for complex analysis or modifications to the model architecture.
Universality: This method can be widely applied across various model architectures, regardless of scale or structural complexity.
Performance Improvement: Experimental results show that this method not only improves computational efficiency but also enhances model performance compared with fine-tuning all layers.

Figure 1 illustrates a comparison between the conventional fine-tuning method that utilizes all layers and the layer selection method tested in this study. In the Fixed Freeze scenario, a predetermined ratio and location of layers are frozen before training begins. In contrast, the Adapted Freeze approach involves recording and analyzing the initial training process, selecting the appropriate layers to freeze, and then completing the remaining training.

To validate our approach, we focused on LLMs with fewer than 3 billion parameters, which can be trained on a single GPU. We conducted a comparative analysis using the Natural Language Inference (NLI) task [11], a subset of text classification problems. The NLI task was chosen as it effectively assesses a model’s fundamental language understanding and reasoning capabilities.

In our experiments, we found that freezing the bottom 25% of transformer layers during fine-tuning yields significant improvements in both computational efficiency and model performance. This simple approach reduced training memory usage by over 30% (excluding the static model parameters) compared with full-model fine-tuning, while generally improving overall performance. Additionally, we observed approximately a 20% increase in training speed. Empirical comparisons revealed that our layer-freezing strategy achieved superior performance metrics relative to Low-Rank Adaptation (LoRA) [12], while offering similar gains in training speed and memory reduction. These findings demonstrate that a conceptually simple, classical fine-tuning technique can be repurposed to meet the demands of modern LLMs while achieving competitive performance.

2. Related Work

2.1. Natural Language Inference

NLI, also known as recognizing textual entailment [13], is a fundamental task in NLP. This task involves determining the logical relationship between a premise and a hypothesis. Specifically, a model must classify this relationship as entailment (the hypothesis necessarily follows from the premise), contradiction (the hypothesis contradicts the premise), or neutral (the hypothesis may or may not be true given the premise).

NLI serves as a crucial indicator for evaluating a model’s language understanding and reasoning capabilities, including comprehension of semantics, context, and logical relationships. Models that demonstrate high performance in NLI tasks tend to excel in other language understanding tasks as well. Success in NLI challenges often indicates a level of language understanding that generalizes across various domains and linguistic tasks [14]. Consequently, NLI has established itself as a valuable benchmark for assessing language models [15].

2.2. General Fine-Tuning Approaches

The traditional fine-tuning approach involves retraining all or some of the parameters of a pre-trained model to adapt it to a new task [16]. This method adjusts the weights of the pre-trained model, utilizing various strategies such as learning rate adjustment, gradual unfreezing, and discriminative learning rates [17]. While this is useful for training task-specific models, it has limitations in terms of computational cost and resource efficiency when fine-tuning all parameters in recent large-scale models. To address these issues, techniques that reduce the number of trainable parameters, such as LoRA, or reduce memory usage, such as LLM quantization [18], are gaining attention.

2.3. Parameter-Efficient Fine-Tuning Approaches

2.3.1. Layer Freezing

Since the emergence of language models, there have been attempts to improve the efficiency of fine-tuning by freezing certain layers. Ref. [9] proposed a method that fine-tunes only the bias parameters instead of the weights in transformer-based masked language models, reducing memory usage and improving speed. Ref. [10] introduced a technique to accelerate the training process by gradually freezing layers based on their impact during training. However, both studies primarily focus on speed improvement, which often leads to performance degradation. Additionally, these studies require complex mechanisms for layer freezing.

Recent work revisits layer freezing for LLM fine-tuning with modern strategies, offering different perspectives on parameter selection and freezing mechanisms. Ref. [19] proposes Half Fine-Tuning (HFT), which randomly freezes 50% of parameters each round across all transformer layers, retaining pre-trained knowledge and mitigating catastrophic forgetting. HFT achieves comparable or better task performance with around a 30% reduction in training time by treating parameter selection as a regularization mechanism. However, HFT’s random selection approach may freeze task-critical parameters, potentially limiting its effectiveness for specific tasks requiring targeted layer adaptation. Ref. [20] introduces SAFE (Selective Adapter Freezing Early), which selectively freezes less important adapter modules early in training based on centered kernel alignment importance scores. SAFE reduces memory use by 43%, computation by 35%, and training time by 12% without degrading accuracy. While SAFE demonstrates sophisticated importance-based selection, it requires additional computational overhead for importance scoring and is limited to adapter-based fine-tuning architectures.

These selective freezing techniques demonstrate that layer- or module-level freezing can yield substantial efficiency gains. However, they differ fundamentally in their selection mechanisms: HFT employs stochastic selection for simplicity, SAFE uses importance-based selection for precision, while our approach utilizes positional selection based on transformer layer hierarchy. Unlike HFT’s random approach that may inadvertently freeze crucial upper layers needed for task-specific reasoning in NLI tasks, our bottom-up strategy systematically preserves lower-layer linguistic features while allowing upper layers to adapt to logical inference requirements. Compared with SAFE’s adapter-module focus requiring architectural modifications, our method operates directly on transformer layers without additional components, making it more suitable for resource-constrained environments with small language models where adapter overhead may be prohibitive.

2.3.2. Low-Rank Adaptation

LoRA was first introduced by Ref. [12] as a parameter-efficient fine-tuning method for large pre-trained language models. Rather than updating the full set of model parameters, LoRA injects trainable low-rank matrices into the attention layers of transformer architectures while keeping the original weights frozen. This reduces the number of trainable parameters and the overall memory footprint, enabling efficient adaptation with minimal resource requirements. LoRA maintains model expressiveness by learning task-specific directions in a low-dimensional subspace.

Recently, Ref. [21] proposes AdaLoRA, which adaptively allocates low-rank update parameters based on importance, resulting in better performance under tight parameter budgets. Ref. [22] introduces QLoRA, which combines 4-bit quantization of the base model with LoRA adapters, enabling efficient fine-tuning of very large models on a single 48 GB GPU without performance loss. More recently, Ref. [23] proposes GeoLoRA, a dynamic low-rank training method based on matrix differential equations that adaptively allocates parameter budget across layers via geometric principles. GeoLoRA achieves improved convergence guarantees and outperforms AdaLoRA in both accuracy and computational efficiency. These advancements demonstrate the continued evolution of LoRA-based adaptation techniques, improving the efficiency-accuracy trade-off for modern large language models.

In contrast, our study employs a simple freezing method that can be applied to any model while also demonstrating performance improvement. This straightforward approach stands out in that it not only reduces computational cost but also enhances model performance.

3. Layer Selection for Fine-Tuning

This study proposes a method to improve model training efficiency by fine-tuning only a subset of layers in an LLM and aims to validate it experimentally. In contrast to conventional LLM fine-tuning methods that involve training all layers, this study demonstrates that selectively freezing specific layers and fine-tuning only the remaining ones can lead to improved performance, faster learning speed, and reduced memory usage.

3.1. Fixed Freezing

The main strategy used during the model training process is to freeze specific layers of the model and fine-tune only the remaining layers. We evaluated the impact of various layer selection methods on model performance and training efficiency:

Bottom-Up Freezing: We experimented with a method that sequentially freezes the model’s bottom layers, allowing only the remaining upper layers to be fine-tuned. As the bottom layers are primarily responsible for the basic linguistic expressiveness of the language model, while the upper layers tend to learn task-specific representations [24], we hypothesized that this freezing approach would preserve fixed linguistic knowledge while enabling task-specific adaptation.
Top-Down Freezing: We tested an approach that freezes the top layers and trains only the bottom layers. This method anticipates that the bottom layers will be tuned to the task based on the fixed higher-level concepts in the frozen upper layers.
Interval Freezing: This method involves freezing layers at intervals of n, meaning that every n-th layer is frozen during training. This approach aims to allow both upper and lower layers to be appropriately adjusted simultaneously, encouraging information to be learned evenly across various layer levels.

3.2. Adapted Freezing

As an alternative to the Fixed Freezing strategy, we propose an Adaptive Freezing approach with dynamic layer selection. In this approach, we track the weight changes of each layer during training to identify layers with significant or minimal changes. Based on these changes, we automatically identify the layers that play a crucial role in performance. The Top-N layers, according to the magnitude of weight changes, are then selectively frozen before proceeding with training. The following outlines the operational sequence of this adaptive layer selection method:

Weight Change Tracking: We calculate the change in weights for each layer by comparing the layer-wise weights before training and after the first five steps of training. The magnitude of weight changes for multiple parameters within a single layer was quantified as a single scalar value by computing the L2 norm of the changes and then taking the mean across all parameters in the layer.
Top-N Layer Selection: We select the top N layers with either the largest or smallest weight changes, freeze them, and then resume training. Through this, we aimed to understand the roles that layers with large and small weight changes play in the fine-tuning.

3.3. Freezing Strategies

Through these Fixed and Adapted Freezing methods, we aim to experimentally demonstrate that fine-tuning only a subset of layers can reduce memory and computational costs compared with fine-tuning the entire model. We hypothesize that this approach can potentially improve performance or, at a minimum, maintain it without degradation.

The following are abbreviations for the freezing strategies used in this study:

ALL: Fine-tuning using all layers, used as the baseline.
LoRA: Fine-tuning using LoRA, also used as the baseline.
INT (Interval): Fine-tuning by freezing layers at regular intervals.
BOT (Bottom Up): Fine-tuning by freezing layers starting from the bottom of the model.
TOP (Top Down): Fine-tuning by freezing layers starting from the top of the model.
ADT-L (Adapted Low): Fine-tuning by freezing N layers with the smallest weight changes.
ADT-H (Adapted High): Fine-tuning by freezing N layers with the largest weight changes.

The number following each abbreviation indicates the percentage of frozen layers. For example, INT25 means 25% of the layers are frozen at regular intervals, while TOP50 means 50% of the layers are frozen starting from the top.

Figure 2 visualizes the freezing strategies utilized in this study. When 50% of the total layers are frozen, the layers are frozen and trained in the pattern shown in the figure for each strategy.

4. Experiments

The primary objective of this study is to verify whether fine-tuning only a subset of layers in LLMs can achieve sufficient training effectiveness compared with training all layers. Through our approach, we aim to explore a methodology that reduces memory and computational resource requirements while maintaining training effectiveness without performance degradation.

4.1. Models and Datasets

The experiments used decoder-only small LLMs with sub-3 billion parameters, such as Gemma-2b (Gemma) [25], Phi-2 [26], and MiniCPM-2b-128k (MiniCPM) [27], which can be trained on a single GPU. All these models are large-scale pre-trained language models whose parameters can be fine-tuned for specific tasks.

For the experiments, we used NLI tasks, which are primarily text classification problems designed to verify the models’ basic language understanding and reasoning abilities by determining logical relationships between sentences. NLI-related tasks from the GLUE [15] and SuperGLUE [28] benchmarks were used. The specific tasks are listed in Table 1.

4.2. Experimental Design

In this study, we conducted a series of experiments to evaluate the efficacy of fine-tuning strategies that selectively utilize specific layers of neural networks. Our experimental design focused on various freezing techniques, enabling a comparative analysis of performance variations resulting from each approach. The primary objective of these experiments was to conduct a comprehensive assessment of model performance, memory utilization, and training efficiency to determine the optimal freezing methodology.

Finding the Optimal Freezing Ratio: First, we conducted experiments to determine at which ratio of frozen layers the model exhibits the highest performance. To achieve this, we froze a certain proportion of the model’s layers and trained only the remaining layers. For the INT strategy, we applied freezing ratios of 25%, 33.3%, and 50%, while for the BOT, TOP, ADT-L, and ADT-H strategies, we used ratios of 25%, 50%, and 75%. Through this approach, we aimed to identify the threshold at which performance sharply declines when more than a certain proportion of layers are frozen, thereby deriving the optimal freezing ratio.

Finding the Optimal Freezing Position: Next, we analyzed which positions within the model’s layers have the most significant impact on performance when frozen. The freezing positions were categorized as BOT (Bottom Up; freezing layers sequentially starting from the lower layers), TOP (Top Down; freezing layers sequentially starting from the upper layers), and INT (Interval; freezing layers at regular intervals). By comparing the performance for each freezing position, we examined which layers play more crucial roles. Additionally, we tracked the weight changes during the training process and adaptively froze layers to understand the significance of the weight changes.

Finding the Optimal Freezing Strategy: Finally, we aimed to identify the optimal freezing strategy by comprehensively considering the performance, training speed, and memory usage of each strategy. As it is challenging to fairly and objectively quantify these diverse aspects into a single metric, we first evaluated each aspect quantitatively and then conducted a comprehensive analysis.

4.3. Hyper-Parameter Setting

For each strategy, we conducted training using five different random seeds: 42, 43, 44, 45, and 46, and measured the average performance. We fixed the batch size at 32 and max length at 128 and trained the model for a total of 100 steps for each experiment. The learning rate is

5 \times e^{- 5}

, and we used a cosine learning rate scheduler. For the LoRA implementation in our experiments, we set the rank (r) to 8 and the alpha parameter to 32. A dropout rate of

0.1

was applied, and no quantization was performed. All other hyper-parameters remained consistent across experiments.

4.4. Performance Comparison by Freezing Strategy

Figure 3 shows the average performance of the three models on the QNLI task, measured five times using different random seeds for various freezing strategies. The error bars represent the

μ \pm σ

(mean ± standard deviation). The gray bars represent the experimental results for the baseline approaches: full fine-tuning ALL and LoRA. The red dashed line indicates the mean performance when all layers are fine-tuned. The blue dashed line represents the mean performance minus one standard deviation, while the green dashed line shows the mean performance plus one standard deviation when all layers are fine-tuned.

In most cases, the INT25, BOT25, and TOP25 strategies showed superior performance compared with the ALL and LoRA strategy. Notably, the Phi-2 model achieved even higher performance, especially with the freezing strategies. The BOT strategy consistently demonstrated excellent and solid performance across all models and most tasks. Furthermore, the INT strategy generally showed lower standard deviation compared with ALL, indicating more stable learning. Contrary to expectations, the Adapted Freezing strategy did not show performance merits. Furthermore, freezing layers with either high or low weight changes did not yield significant results.

Total experimental results can be found in Appendix A.

4.5. Training Efficiency Comparison by Freezing Strategy

Figure 4 shows the average performance and training time measured by training the Gemma model on the MNLI task for 10 steps using five different random seeds. The size of the circles visually represents the proportion of layers that were frozen. From this, it can be observed that the INT, BOT, and TOP strategies learn faster than the ALL strategy. In particular, the BOT strategy generally showed very fast learning speed, comparable to the TOP and INT strategies. Moreover, when compared with LoRA, the BOT strategies showed comparable or superior learning speed. ADT-L and ADT-H, on the other hand, exhibited relatively slower speeds due to an additional fixed time of about 13 s required for selecting the initial layers to freeze. With sufficiently long training times, this overhead becomes negligible, and their speed is expected to be comparable to that of the INT and TOP strategies. While Figure 4 shows Gemma’s efficiency results to avoid redundant visualizations, we conducted identical efficiency measurements for Phi-2 and MiniCPM and observed consistent trends. Additional experimental results can be found in Appendix A.

Figure 5 shows the GPU memory usage of the INT and BOT strategies compared with ALL. Excluding the memory used by the fixed model, when comparing only the training memory, the INT strategy used approximately 7–13% less memory and the BOT strategy used approximately 12–25% less. In the case of BOT, as the freeze ratio increased, the GPU memory usage dramatically decreased by about 20% at each ratio. Additionally, BOT25 demonstrated a level of memory reduction comparable to that of LoRA. The smaller reduction in INT is presumed to be due to inefficient computation caused by a lack of optimization during CUDA operations, depending on the location of the frozen layers.

4.6. Best Freezing Strategy

Ranking Score = Mean (reversed rank): To determine the optimal strategy, we assigned scores based on the performance rankings from our experiments. In this paper, a total of 17 experiments were conducted using ALL and LoRA strategy and 15 freezing strategies. Accordingly, for each task, the strategy that achieved the highest average performance was given 16 points, while the strategy with the lowest average performance received 1 point. This scoring was performed for each of the 15 experiments (3 models × 5 tasks), and the final average score was calculated.

Figure 6 shows the average scores measured in this manner, and it can be observed that BOT25 and TOP25 achieved the highest scores. Notably, the INT strategy outperformed the ALL strategy across all ratios. This confirms that strategies INT25, INT33, INT50, BOT25, BOT50, and TOP25 can be used as alternatives to the ALL strategy. Considering memory usage and training speed, the BOT25 and BOT50 strategies are judged to be the most effective.

5. Discussion

Our experimental findings indicate that partial layer freezing, a concept traditionally used for transfer learning, remains highly relevant and effective in the context of modern LLMs. By revisiting this idea through a contemporary lens, we demonstrate that selectively freezing transformer layers can achieve a favorable trade-off between computational efficiency and task performance. This observation aligns with theoretical insights from representational learning, where lower layers typically capture general syntactic or lexical features, while higher layers encode task-specific semantics.

Enhanced Performance: Specifically, we found that freezing the bottom 25% or 50% of transformer layers during fine-tuning not only maintained high performance but also often exceeded the results of full model fine-tuning and LoRA. This approach led to a substantial reduction in memory usage, approximately 30% and 50%, respectively, without compromising model effectiveness. Notably, the training speed increased by 20–30%, which can be attributed to the reduced computational load. We posit that this phenomenon may be attributed to the model’s capacity being disproportionately large relative to the complexity of the NLI task. This aligns with observations in techniques like LoRA, where freezing the majority of the model and training only a small number of additional parameters can lead to performance improvements.

Memory Reduction: The reduction in memory usage observed with our partial fine-tuning approach is logically consistent with the decreased number of trainable parameters. However, we noticed that interval freezing strategies, where layers are frozen in a distributed pattern throughout the model, did not yield significant memory savings compared with contiguous bottom-layer freezing. This suggests that contiguous freezing of layers is more beneficial for memory optimization, though we acknowledge that the underlying mechanisms require more rigorous investigation beyond our current empirical observations.

Learning Speed Improvements: While the speed improvements did not completely match the reduction in memory usage, the observed 20–30% increase in training speed is nonetheless significant. We attribute this to the substantial computational overhead inherent in processing large language models.

Moreover, our method does not require architectural modifications or additional trainable parameters, which stands in contrast to popular PEFT methods such as LoRA. Although LoRA-based methods are effective, they introduce additional components that require careful configuration and tuning. Our findings highlight the possibility of combining the strengths of both approaches, for instance, by applying LoRA to the unfrozen layers while freezing others.

This simplicity and adaptability make our method particularly suitable for constrained environments, such as on-device fine-tuning or edge applications. Future work could explore adaptive freezing strategies guided by gradient norms or weight change statistics, potentially in conjunction with other PEFT methods.

6. Conclusions

This study revisited the concept of layer freezing and demonstrated its effectiveness for sub-3 billion parameter language models on natural language inference (NLI) tasks. By freezing specific subsets of transformer layers, particularly the bottom 25% or 50% of layers, we achieved up to 50% memory savings and 20–30% faster training while maintaining or improving performance compared with full model fine-tuning and LoRA in our experimental setting.

Our approach offers practical advantages through its simplicity: it requires no architectural modifications, additional parameters, or complex selection mechanisms. The method can be readily applied to any transformer architecture, though effectiveness is demonstrated specifically within our experimental scope of sub-3B models and NLI tasks.

While we explored adaptive freezing strategies based on weight dynamics, these approaches proved less effective than simple positional heuristics in our experiments, reinforcing that straightforward strategies can be surprisingly effective for small language models in resource-constrained environments.

Our findings provide empirical evidence for the effectiveness of positional freezing strategies within clearly defined boundaries. Future work should validate applicability across larger models and diverse NLP tasks, as we make no claims about universal effectiveness beyond our experimental scope. We recommend that practitioners validate this approach for their specific use cases rather than assuming broad generalizability.

7. Limitation

This study demonstrated the efficiency of fine-tuning only certain layers in LLMs. However, the following limitations exist:

Experiments Limited to Small LLMs: This study primarily conducted experiments on small-scale LLMs with 3 billion parameters or fewer, such as Gemma, Phi-2, and MiniCPM. This choice was due to the experimental conditions set to enable training in a single GPU environment. Therefore, further research is needed to determine whether the proposed methodology demonstrates similar performance improvements and efficiency in extremely large models (such as PaLM and LLaMA). In extremely large models, memory requirements or training patterns may differ, necessitating experiments on these models to expand the scope of this research.

Dataset Limited to NLI Tasks: This study focused on Natural Language Inference (NLI) tasks, conducting experiments only on NLI-related datasets (such as RTE and CB) from GLUE and SuperGLUE benchmarks. While NLI tasks are specialized in evaluating a model’s ability to infer logical relationships, other types of tasks (e.g., text generation, question answering, translation, etc.) may have different model characteristics and learning requirements. Therefore, further experiments on diverse tasks and datasets are necessary to assess the effectiveness of the proposed fine-tuning method across a broader range of natural language processing tasks.

Heuristic-based Strategies: Our layer selection strategies were static and preconfigured. Although we also explored adaptation based on early training behavior, this method relies on heuristics and may not generalize well across tasks or architectures. More principled approaches, such as learning-to-freeze mechanisms or data-driven selection policies, are worth investigating in future work.

Limited Mechanistic Understanding: Our analysis remains primarily empirical. While we observe the effectiveness of bottom-layer freezing, we acknowledge that a deeper mechanistic understanding of why these strategies work requires more sophisticated analysis techniques such as representational probing and gradient flow analysis.

Potential Reduced Model Plasticity: Layer freezing inherently reduces model adaptability, which may be unsuitable for tasks requiring full representational capacity. Our approach may not be optimal for scenarios where maximum model flexibility is crucial.

To address these limitations, future work should validate the generalizability and efficiency of the proposed methodology across a range of larger LLMs and diverse tasks.

Author Contributions

Conceptualization, T.H.; Methodology, T.H.; Software, T.H.; Validation, T.H. and H.S.; Investigation, T.H.; Resources, T.H. and J.J.; Data Curation, T.H. and H.S.; Writing—Original Draft Preparation, T.H.; Writing—Review and Editing, H.S. and J.J.; Visualization, H.S.; Supervision, S.J.; Project Administration, S.J.; Funding Acquisition, S.J. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. RS-2022-00155857), Artificial Intelligence Convergence Innovation Human Resources Development (Chungnam National University), the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. RS-2025-0055621731482092640101), and the research fund of Chungnam National University.

Data Availability Statement

The datasets used in this study include GLUE and SuperGLUE. The GLUE benchmark is publicly available at https://gluebenchmark.com/tasks (accessed on 23 September 2025). The SuperGLUE benchmark is publicly available at https://super.gluebenchmark.com/tasks (accessed on 23 September 2025). The source code used in this study will be made available on https://github.com/GGoMaAI/SelectiveLayerFreezing (accessed on 23 September 2025) upon publication.

Acknowledgments

During the preparation of this manuscript, the authors used ChatGPT (OpenAI, GPT-4o) for the purpose of English editing. The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

Author Sangkeun Jung was employed by the company EurekaAI. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

LLM	Large Language Model
LoRA	Low-Rank Adaptation
NLI	Natural Language Inference
NLP	Natural Language Processing
HFT	Half Fine-Tuning
INT	Interval
BOT	Bottom Up
TOP	Top Down
ADT-L	Adapted Low
ADT-H	Adapted High

Appendix A. Model Performance

This section presents an analysis of model performance, focusing on the effects of freezing methods and dataset selection. Figure A1 and Figure A2 show the performance and learning speed of the Gemma model on the RTE, CB, QNLI, and WNLI datasets. Figure A3 shows the Gemma model performance on the RTE, CB, WNLI, and MNLI datasets. Figure A4 and Figure A5 show each Phi-2 and MiniCPM model, respectively, on the same datasets.

Figure A1. Learning speed and performance of Gemma on the RTE and MNIL tasks.

Figure A2. Learning speed and performance of Gemma on the QNLI and WNLI tasks.

Figure A3. Accuracy performance of Gemma. The same colors indicate the same strategy. The gray hatched bar represents the baseline performance achieved by fine-tuning all layers of the model. The red dashed line represents the average performance of fine-tuning across all layers, the blue dashed line represents the mean performance minus one standard deviation, while the green dashed line shows the mean performance plus one standard deviation when all layers are fine-tuned.

Figure A4. Accuracy performance of Phi. The same colors indicate the same strategy. The gray hatched bar represents the baseline performance achieved by fine-tuning all layers of the model. The red dashed line represents the average performance of fine-tuning across all layers, the blue dashed line represents the mean performance minus one standard deviation, while the green dashed line shows the mean performance plus one standard deviation when all layers are fine-tuned.

Figure A5. Accuracy performance of MiniCPM. The same colors indicate the same strategy. The gray hatched bar represents the baseline performance achieved by fine-tuning all layers of the model. The red dashed line represents the average performance of fine-tuning across all layers, the blue dashed line represents the mean performance minus one standard deviation, while the green dashed line shows the mean performance plus one standard deviation when all layers are fine-tuned.

References

Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. In Proceedings of the Advances in Neural Information Processing Systems 33, Virtual, 6–12 December 2020; Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2020; Volume 33, pp. 1877–1901. [Google Scholar]
Chowdhery, A.; Narang, S.; Devlin, J.; Bosma, M.; Mishra, G.; Roberts, A.; Barham, P.; Chung, H.W.; Sutton, C.; Gehrmann, S.; et al. Palm: Scaling language modeling with pathways. J. Mach. Learn. Res. 2023, 24, 11324–11436. [Google Scholar]
Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. Llama: Open and efficient foundation language models. arXiv 2023, arXiv:2302.13971. [Google Scholar] [CrossRef]
Liu, P.; Yuan, W.; Fu, J.; Jiang, Z.; Hayashi, H.; Neubig, G. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Comput. Surv. 2023, 55, 1–35. [Google Scholar] [CrossRef]
Xie, S.M.; Raghunathan, A.; Liang, P.; Ma, T. An Explanation of In-context Learning as Implicit Bayesian Inference. In Proceedings of the International Conference on Learning Representations, Virtual, 25 April 2022. [Google Scholar]
Hendrycks, D.; Liu, X.; Wallace, E.; Dziedzic, A.; Krishnan, R.; Song, D. Pretrained Transformers Improve Out-of-Distribution Robustness. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; Jurafsky, D., Chai, J., Schluter, N., Tetreault, J., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 2744–2751. [Google Scholar] [CrossRef]
Gururangan, S.; Marasović, A.; Swayamdipta, S.; Lo, K.; Beltagy, I.; Downey, D.; Smith, N.A. Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; Jurafsky, D., Chai, J., Schluter, N., Tetreault, J., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 8342–8360. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; Burstein, J., Doran, C., Solorio, T., Eds.; Volume 1 (Long and Short Papers), pp. 4171–4186. [CrossRef]
Ben Zaken, E.; Goldberg, Y.; Ravfogel, S. BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, Dublin, Ireland, 22–27 May 2022; Muresan, S., Nakov, P., Villavicencio, A., Eds.; Volume 2: Short Papers, pp. 1–9. [CrossRef]
Tang, H.; Chen, J.; Zhang, W.; Guo, Z. Training Acceleration Method Based on Parameter Freezing. Electronics 2024, 13, 2140. [Google Scholar] [CrossRef]
Bowman, S.R.; Angeli, G.; Potts, C.; Manning, C.D. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, 17–21 September 2015; Association for Computational Linguistics: Stroudsburg, PA, USA, 2015. [Google Scholar]
Hu, E.J.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. LoRA: Low-Rank Adaptation of Large Language Models. In Proceedings of the International Conference on Learning Representations, Virtual, 25–29 April 2022. [Google Scholar]
Dagan, I.; Glickman, O.; Magnini, B. The pascal recognising textual entailment challenge. In Proceedings of the Machine Learning Challenges Workshop, Southampton, UK, 11–13 April 2003; Springer: Berlin/Heidelberg, Germany, 2005; pp. 177–190. [Google Scholar]
Poliak, A.; Naradowsky, J.; Haldar, A.; Rudinger, R.; Van Durme, B. Hypothesis Only Baselines in Natural Language Inference. In Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics, New Orleans, LA, USA, 5–6 June 2018; pp. 180–191. [Google Scholar]
Wang, A.; Singh, A.; Michael, J.; Hill, F.; Levy, O.; Bowman, S. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Brussels, Belgium, 31 October–4 November 2018; Linzen, T., Chrupala, G., Alishahi, A., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2018; pp. 353–355. [Google Scholar] [CrossRef]
Howard, J.; Ruder, S. Universal language model fine-tuning for text classification. arXiv 2018, arXiv:1801.06146. [Google Scholar] [CrossRef]
Peters, M.E.; Ruder, S.; Smith, N.A. To Tune or Not to Tune? Adapting Pretrained Representations to Diverse Tasks. In Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019), Florence, Italy, 2 August 2019; Augenstein, I., Gella, S., Ruder, S., Kann, K., Can, B., Welbl, J., Conneau, A., Ren, X., Rei, M., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 7–14. [Google Scholar] [CrossRef]
Dettmers, T.; Lewis, M.; Belkada, Y.; Zettlemoyer, L. Gpt3.int8 (): 8-bit Matrix Multiplication for Transformers at Scale. Adv. Neural Inf. Process. Syst. 2022, 35, 30318–30332. [Google Scholar]
Hui, T.; Zhang, Z.; Wang, S.; Xu, W.; Sun, Y.; Wu, H. HFT: Half Fine-Tuning for Large Language Models. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, Vienna, Austria, 27 July–1 August 2025; Che, W., Nabende, J., Shutova, E., Pilehvar, M.T., Eds.; Volume 1: Long Papers, pp. 12791–12819. [CrossRef]
Son, H.; Son, Y.; Kim, C.; Kim, Y.G. Not All Adapters Matter: Selective Adapter Freezing for Memory-Efficient Fine-Tuning of Language Models. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies, Albuquerque, NM, USA, 29 April–4 May 2025; Chiruzzo, L., Ritter, A., Wang, L., Eds.; Volume 1: Long Papers, pp. 9479–9496. [CrossRef]
Zhang, Q.; Chen, M.; Bukharin, A.; He, P.; Cheng, Y.; Chen, W.; Zhao, T. Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning. In Proceedings of the Eleventh International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Dettmers, T.; Pagnoni, A.; Holtzman, A.; Zettlemoyer, L. Qlora: Efficient finetuning of quantized llms. Adv. Neural Inf. Process. Syst. 2023, 36, 10088–10115. [Google Scholar]
Schotthöfer, S.; Zangrando, E.; Ceruti, G.; Tudisco, F.; Kusch, J. GeoLoRA: Geometric integration for parameter efficient fine-tuning. In Proceedings of the Thirteenth International Conference on Learning Representations, Singapore, 24 April 2025. [Google Scholar]
Rogers, A.; Kovaleva, O.; Rumshisky, A. A Primer in BERTology: What We Know About How BERT Works. Trans. Assoc. Comput. Linguist. 2020, 8, 842–866. [Google Scholar] [CrossRef]
Team, G.; Mesnard, T.; Hardin, C.; Dadashi, R.; Bhupatiraju, S.; Pathak, S.; Sifre, L.; Rivière, M.; Kale, M.S.; Love, J.; et al. Gemma: Open models based on gemini research and technology. arXiv 2024, arXiv:2403.08295. [Google Scholar] [CrossRef]
Javaheripi, M.; Bubeck, S.; Abdin, M.; Aneja, J.; Bubeck, S.; Mendes, C.C.T.; Chen, W.; Del Giorno, A.; Eldan, R.; Gopi, S.; et al. Phi-2: The surprising power of small language models. Microsoft Res. Blog 2023, 1, 3. [Google Scholar]
Hu, S.; Tu, Y.; Han, X.; He, C.; Cui, G.; Long, X.; Zheng, Z.; Fang, Y.; Huang, Y.; Zhao, W.; et al. Minicpm: Unveiling the potential of small language models with scalable training strategies. arXiv 2024, arXiv:2404.06395. [Google Scholar] [CrossRef]
Wang, A.; Pruksachatkun, Y.; Nangia, N.; Singh, A.; Michael, J.; Hill, F.; Levy, O.; Bowman, S.R. SuperGLUE: A stickier benchmark for general-purpose language understanding systems. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, Vancouver, QC, Canada, 8–14 December 2019; Curran Associates Inc.: Red Hook, NY, USA, 2019. [Google Scholar]

Figure 1. Proposed freezing strategies. (a) is the conventional fine-tuning method, while (b,c) are the proposed strategies. In (b), predetermined layers are frozen according to preset configurations before training. In (c), after initial training, layers to freeze are selected based on the amount of weight change.

Figure 2. Proposed detailed freezing strategies, with 50% freezing for each strategy.

Figure 3. Accuracy performance on the QNLI task. The same colors indicate the same strategy. The gray hatched bar represents the baseline performance achieved by fine-tuning all layers of the model. The red dashed line represents the average performance of fine-tuning across all layers, the blue dashed line represents the mean performance minus one standard deviation, while the green dashed line shows the mean performance plus one standard deviation when all layers are fine-tuned.

Figure 4. Learning speed and CB performance of the Gemma model for different freezing strategies. The size of the semicircles represents the relative proportion of unfrozen layers used in training. The x-axis indicates the time (in seconds) taken for 10 steps of training, and the y-axis represents the performance after 100 steps of training. The TOP strategy is depicted with green circles, BOT with blue, INT with red, ADT-L with yellow, and ADT-H with orange circles.

Figure 5. GPU training memory usage for each freezing strategy. The blue color represents the memory statically used by the model, while the red color indicates the memory utilized during the training process. The numbers above each bar indicate the percentage reduction in GPU memory usage compared with the ALL strategy.

Figure 6. Overall evaluation based on ranking score by strategy. An overall evaluation was conducted by measuring the average ranking scores based on the reversed rankings assigned across all models, datasets, and experimental settings used in the experiments. The red dashed line represents the score of the ALL strategy.

Table 1. Number of samples in each dataset; the number in parentheses is the total number of data, but we used only a maximum of 3200 randomly selected data.

Dataset	Task	Number of Samples	Usage Rate
GLUE	RTE	2500	100%
GLUE	QNLI	3200 (105,000)	3.0%
GLUE	WNLI	635	100%
GLUE	MNLI	3200 (392,000)	0.8%
SuperGLUE	CB	250	100%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hwang, T.; Seo, H.; Jung, J.; Jung, S. Exploring Selective Layer Freezing Strategies in Transformer Fine-Tuning: NLI Classifiers with Sub-3B Parameter Models. Appl. Sci. 2025, 15, 10434. https://doi.org/10.3390/app151910434

AMA Style

Hwang T, Seo H, Jung J, Jung S. Exploring Selective Layer Freezing Strategies in Transformer Fine-Tuning: NLI Classifiers with Sub-3B Parameter Models. Applied Sciences. 2025; 15(19):10434. https://doi.org/10.3390/app151910434

Chicago/Turabian Style

Hwang, Taewook, Hyein Seo, Jeesu Jung, and Sangkeun Jung. 2025. "Exploring Selective Layer Freezing Strategies in Transformer Fine-Tuning: NLI Classifiers with Sub-3B Parameter Models" Applied Sciences 15, no. 19: 10434. https://doi.org/10.3390/app151910434

APA Style

Hwang, T., Seo, H., Jung, J., & Jung, S. (2025). Exploring Selective Layer Freezing Strategies in Transformer Fine-Tuning: NLI Classifiers with Sub-3B Parameter Models. Applied Sciences, 15(19), 10434. https://doi.org/10.3390/app151910434

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Exploring Selective Layer Freezing Strategies in Transformer Fine-Tuning: NLI Classifiers with Sub-3B Parameter Models

Abstract

1. Introduction

2. Related Work

2.1. Natural Language Inference

2.2. General Fine-Tuning Approaches

2.3. Parameter-Efficient Fine-Tuning Approaches

2.3.1. Layer Freezing

2.3.2. Low-Rank Adaptation

3. Layer Selection for Fine-Tuning

3.1. Fixed Freezing

3.2. Adapted Freezing

3.3. Freezing Strategies

4. Experiments

4.1. Models and Datasets

4.2. Experimental Design

4.3. Hyper-Parameter Setting

4.4. Performance Comparison by Freezing Strategy

4.5. Training Efficiency Comparison by Freezing Strategy

4.6. Best Freezing Strategy

5. Discussion

6. Conclusions

7. Limitation

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A. Model Performance

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI