Author Contributions
Conceptualization, A.D., P.S., R.V.B. and N.S.; methodology, A.D.; software, A.D.; validation, A.D., P.S., R.V.B. and N.S.; formal analysis, A.D.; investigation, A.D.; data curation, A.D.; writing—original draft preparation, A.D.; writing—review and editing, A.D., P.S. and R.V.B.; supervision, P.S., R.V.B. and N.S.; project administration, N.S. All authors have read and agreed to the published version of the manuscript.
Figure 1.
Act-LoRA: activation-guided layer selection for Low-Rank Adaptation.
Figure 2.
Probe placement in encoder or decoder blocks for collecting activation norms.
Figure 3.
Selective LoRA schematic diagram. (Left): Base model, (Center): LoRA adapters applied uniformly to all layers, (Right): LoRA adapters applied to selected layers guided by activation norms.
Figure 4.
Phase-wise GPU time and memory comparison with SOTA methods. (Left): Per-step training time decomposed into forward, backward, and optimizer phases. (Right): GPU memory usage measured after forward, backward (peak), and optimizer phases.
Figure 5.
Accuracy and loss comparison with PEFT methods. (Left): Accuracy vs. trainable parameters count (log scale). (Right): Loss vs. trainable parameters count (log scale).
Figure 6.
Phase-wise GPU time and memory scaling for top-k LoRA layer adaptation with activation norm-based importance score. (Left): Per-step training time decomposed into forward (base model + adapter), backward, and optimizer phases. (Right): GPU memory usage measured after forward, backward (peak), and optimizer phases.
Figure 7.
Accuracy and loss sweep study.
Figure 8.
BERT layer importance visualization.
Figure 9.
Cummulative Layer Importance Threshold Mapping for BERT.
Figure 10.
Comparison of activation norm and gradient norm-based layer importance for DeBERTaV3-Base and Llama-3.1-8B across 10 seeds. (Top): DeBERTaV3-Base. (Bottom): LLama-3.1-8B.
Figure 11.
Comparison of layer importance for DeBERTaV3-Base and Llama-3.1-8B across GLUE tasks. (Top): DeBERTaV3-Base. (Bottom): Llama-3.1-8B.
Figure 12.
Comparison of layer importance for Q and V projections using various metrics.
Table 1.
Overview of literature review.
| Aspect | Concept | Citation |
|---|
| PEFT | IA3 | [7] |
| Prefix Tuning | [9] |
| Prompt Tuning | [8] |
| Adapter Modules | [5] |
| LoRA | Low-Rank Adaptation | [6] |
| Adaptive rank allocation (AdaLoRA) | [13,21,22] |
| Dynamic rank slicing (DyLoRA) | [23,24] |
| Ablation-based rank importance (ALoRA) | [25] |
| Meta-learned rank selection (AutoLoRA) | [26] |
| LoRA-Drop | [27] |
| Memory-efficient fine-tuning (QLoRA, GaLore, LoRA-FA) | [28,29,30] |
| Structural adapter variants (Delta-LoRA, DoRA, VeRA, LoRA+, X-LoRA, Dynamic LoRA, NoRA, LoHa, LoKr) | [14,31,32,33,34,35,36,37,38] |
| Layer Specialization &Internal Substructure | Uniform Information Distribution in Layers | [20,39] |
| Layer Task Specialization | [10,11,40,41] |
| Layer Task Contribution Patterns | [12,42,43] |
| Layer ImportanceMetrics | Second-order (Hessian-based) methods | [44,45] |
| Fisher-based sensitivity | [46] |
| Gradient/Weight/Loss-based sensitivity metrics | [13,14,39] |
| Activation-based importance | [10,20,36,47] |
Table 2.
Evaluation metrics used to compare accuracy retention, training efficiency, inference performance, and memory footprint. All metrics are reported relative to the LoRA baseline within each model family.
| Metric | Type | Formula | Description |
|---|
| GLUE Score | Task Quality | GLUE | Unweighted mean of task-specific GLUE metrics. |
| GLUE (%) | Task Quality | | Relative change in downstream task performance with respect to LoRA. |
| Trainable Parameters (%↓) | Training Time | | Reduction in the number of trainable parameters. |
| GPU Hours (%↓) | Training Time | | Percentage reduction in total training compute. |
| Peak Memory (%↓) | Training Time | | Percentage reduction in peak GPU memory usage during training and inference. |
| Latency (%↓) | Inference Time | | Percentage reduction in mean per-example inference latency (ms). |
| Throughput (%↑) | Inference Time | | Percentage increase in inference throughput (samples/s). |
Table 3.
GLUE performance across all datasets for different LoRA variants on DeBERTa-V3-Base and LLaMA-3.1-8B. We report Matthews’ correlation coefficient for CoLA, Pearson’s correlation for STS-B, and accuracy for all other tasks. LoRA (r = 8), ActLoRA (r = 8), AdaLoRA (target_r = 8, init_r = 12).
| Model | Method | # Params | MNLI | QNLI | QQP | SST-2 | MRPC | RTE | CoLA | STS-B | WNLI |
|---|
| DeBERTa-V3-Base | LoRA | 300 K | | | | | | | | | |
| | AdaLoRA | 445 K | | | | | | | | | |
| | Act-LoRA () | 50 K | | | | | | | | | |
| | Act-LoRA () | 150 K | | | | | | | | | |
| LLaMA-3.1-8B | LoRA | 3.4 M | | | | | | | | | |
| | AdaLoRA | 5.1 M | | | | | | | | | |
| | Act-LoRA () | 1.7 M | | | | | | | | | |
| | Act-LoRA () | 2.5 M | | | | | | | | | |
Table 4.
Aggregate training and inference efficiency comparison across LLaMA-3.1-8B and DeBERTaV3-Base variants. Training GPUh (GPUh), Training GPU Memory Allocation (GiB) (Mem), Time to First Token (ms) (TTFT), Trainable Params Count (Params), Samples Per Sec (Throughput), Inference Latency (ms) (Latency). LoRA (r = 8), ActLoRA (r = 8), AdaLoRA (target_r = 8, init_r = 12).
| Model | Method | GLUE | GPUh | Mem | TTFT | Params | Throughput | Latency |
|---|
| LLaMA-3.1-8B | Lora | 77.59 | 3.100 | 15.98 | 100.23 | 3,416,576 | 10.21 | 748.28 |
| Adalora | 68.32 | 3.082 | 51.47 | 105.12 | 5,121,280 | 9.55 | 737.68 |
| ActLoRA (k = 16) | 78.61 | 1.918 | 15.83 | 98.75 | 1,712,128 | 10.47 | 701.41 |
| ActLoRA (k = 24) | 78.96 | 3.036 | 16.40 | 98.95 | 2,564,608 | 10.18 | 739.76 |
| DeBERTaV3-Base | Lora | 80.42 | 6.657 | 4.87 | 35.84 | 296,450 | 34.84 | 28.02 |
| Adalora | 70.07 | 8.725 | 4.59 | 46.96 | 444,194 | 32.81 | 29.86 |
| ActLoRA (k = 2) | 77.09 | 3.999 | 4.74 | 26.26 | 50,690 | 41.53 | 23.40 |
| ActLoRA (k = 6) | 80.21 | 5.210 | 4.78 | 28.18 | 148,994 | 38.62 | 25.21 |
Table 5.
Trade-off analysis comparing GLUE score, relative GLUE change ( GLUE %), GPU hour savings (GPUh %↓), latency reduction (Latency %↓), throughput (Throughput) improvement, peak memory reduction (Peak Mem %↓), and trainable parameter reduction relative to LoRA (Params %↓). DeB results correspond to DeBERTaV3-Base on a single L4 GPU; LL results correspond to LLaMA-3.1-8B on four A100 GPUs.
| Model | Method | GLUE | ΔGLUE | GPUh | Latency | Throughput | Peak Mem | Params |
|---|
| (%)
| (%↓)
| (%↓)
| (%↑)
| (%↓)
| (%↓)
|
|---|
| DeB | LoRA () | 0.801 | – | – | – | – | – | – |
| | AdaLoRA () | 0.670 | | | | | +5.86 | |
| | Act-LoRA () | 0.792 | | | | | | |
| | Act-LoRA () | 0.772 | | +40.0 | +7.8 | +19.2 | | +82.9 |
| LL | LoRA () | 0.780 | – | – | – | – | – | – |
| | AdaLoRA () | 0.689 | | | | | | |
| | Act-LoRA () | 0.779 | | | | | | |
| | Act-LoRA () | 0.755 | | +7.7 | +5.0 | +2.75 | +0.28 | +49.9 |
Table 6.
Comparison of fine-tuning methods on SST-2 (1000 steps, 3 seeds). Accuracy (Acc) and Loss are reported as mean ± standard deviation in percentage format. GPU hours (GPUh) are derived from phase-level timing totals. Peak allocated memory (Mem) corresponds to maximum CUDA allocation (GiB) observed during training. Trainable parameters (Params) are reported as absolute count with percentage of total model parameters in parentheses. ↑: Higher is better, ↓: Lower is better.
| Method | Acc (%↑) | Loss (%↓) | GPUh | Mem | Params (%) |
|---|
| Full FT | | | | | |
| IA3 | | | | | |
| VeRA | | | | | |
| LoKr | | | | | |
| LoHA | | | | | |
| LoRA | | | | | |
| AdaLoRA | | | | | |
| ActLoRA (Ours) | | | | | |
Table 7.
Phase-wise GPU time and memory breakdown for top-k LoRA adaptation. Peak memory consistently occurs after the backward pass. Compute scales approximately linearly with k, while memory growth remains modest.
| Top-k | GPU Time per Step (ms) | GPU Memory (MiB) |
|---|
| Forward
| Backward
| Optimizer
| Total
| After Forward
| After Backward (Peak)
| After Step
|
|---|
| 2 | 20.65 | 11.72 | 1.37 | 33.74 | 780 | 919.3 | 800 |
| 4 | 21.95 | 13.26 | 1.49 | 36.70 | 790 | 932.0 | 810 |
| 6 | 23.57 | 14.99 | 1.69 | 40.25 | 800 | 940.0 | 820 |
| 8 | 24.81 | 17.17 | 1.65 | 43.62 | 845 | 992.0 | 860 |
| 10 | 26.27 | 18.95 | 1.83 | 47.05 | 860 | 998.0 | 875 |
| 12 | 28.58 | 20.44 | 2.21 | 51.23 | 875 | 1024.0 | 890 |
Table 8.
Cumulative importance threshold analysis for BERT layer-wise activation magnitude. The table shows the minimum number of top-ranked layers required to reach each cumulative importance level.
| Cumulative Importance Threshold | Top-k Layers Required |
|---|
| 25% | 2 |
| 50% | 4 |
| 75% | 6 |
| 100% | 11 |
Table 9.
Mean and standard deviation of pairwise Kendall– rank correlation across 10 seeds.
| Model | Metric | Kendall– (Mean ± Std) |
|---|
| DeBERTaV3-Base | Activation Norms | |
| DeBERTaV3-Base | Gradient Norms | |
| LLaMA-3.1-8B | Activation Norms | |
| LLaMA-3.1-8B | Gradient Norms | |
Table 10.
Top-k Jaccard similarity across 10 seeds.
| k | DeBERTa Act | DeBERTa Grad | LLaMA Act | LLaMA Grad |
|---|
| 2 | | | | |
| 4 | | | | |
| 6 | | | | |
| 8 | | | | |
| 10 | | | | |
Table 11.
Correlation between importance metrics and ablation performance for Query (Q) and Value (V) layers using Pearson’s (), Spearman’s (), Kendall’s (), and cosine similarity ().
| Layer | Metric | | | | |
|---|
| Query | actnorm | 0.18 | 0.65 | 0.46 | 0.16 |
| Taylor | 0.15 | 0.32 | 0.18 | 0.14 |
| gradnorm | 0.15 | | 0.00 | 0.14 |
| Fisher | | 0.00 | 0.092 | |
| weightdelta | | 0.11 | 0.031 | |
| Value | actnorm | 0.34 | | | 0.28 |
| Taylor | | 0.067 | 0.078 | |
| gradnorm | | | | |
| Fisher | | | | |
| weightdelta | | | | |