Structure-Aware Low-Rank Adaptation for Parameter-Efﬁcient Fine-Tuning

: With the growing scale of pre-trained language models (PLMs), full parameter ﬁne-tuning becomes prohibitively expensive and practically infeasible. Therefore, parameter-efﬁcient adaptation techniques for PLMs have been proposed to learn through incremental updates of pre-trained weights, such as in low-rank adaptation (LoRA). However, LoRA relies on heuristics to select the modules and layers to which it is applied, and assigns them the same rank. As a consequence, any ﬁne-tuning that ignores the structural information between modules and layers is suboptimal. In this work, we propose structure-aware low-rank adaptation (SaLoRA), which adaptively learns the intrinsic rank of each incremental matrix by removing rank-0 components during training. We conduct comprehensive experiments using pre-trained models of different scales in both task-oriented (GLUE) and task-agnostic (Yelp and GYAFC) settings. The experimental results show that SaLoRA effectively captures the structure-aware intrinsic rank. Moreover, our method consistently outperforms LoRA without signiﬁcantly compromising training efﬁciency.


Introduction
With the scaling of model and corpus size [1][2][3][4][5], large language models (LLMs) have demonstrated an ability for in-context learning [1,6,7] in various natural language processing (NLP) tasks, that is, learning from a few examples within context.Although in-context learning is now the prevalent paradigm for using LLMs, fine-tuning still outperforms it in task-specific settings.In such scenarios, a task-specific model is exclusively trained on a dataset comprising input-output examples specific to the target task.However, full parameter fine-tuning, which updates and stores all the parameters for different tasks, becomes impractical when dealing with large-scale models.
In fact, LLMs with billions of parameters can be effectively fine-tuned by optimizing only a few parameters [8][9][10].This has given rise to a branch of parameter-efficient finetuning (PEFT) techniques [11][12][13][14][15][16] for model tuning.These techniques optimize a small fraction of the model parameters while keeping the rest fixed, thereby significantly reducing computational and storage costs.For example, LoRA [15] introduces trainable low-rank decomposition matrices into LLMs, enabling the model to adapt to a new task while preserving the integrity of the original LLMs and retaining the acquired knowledge.Fundamentally, this approach is built upon the assumption that updates to the weights of the pre-trained language model have a lower rank during adaptation to specific downstream tasks [8,9].Thus, by reducing the rank of the incremental matrices, LoRA optimizes less than 0.5% of the additional trainable parameters.Remarkably, this optimization achieves comparable or even superior performance to that of full parameter fine-tuning.However, despite its advantages, LoRA also comes with certain limitations that warrant consideration.One limitation lies in LoRA's reliance on heuristics to select the modules and layers to which it is applied.Though heuristics can be effective under specific circumstances, their lack of generalizability is a concern.This lack of generalizability can result in suboptimal performance, or even complete failure, when applied to new data.Another limitation is the assignment of the same rank to incremental matrices across different modules and layers.This tends to oversimplify the complex structural relationships and important disparities that exist within neural networks.This phenomenon is illustrated in Figure 1.In this paper, we propose a novel approach called structure-aware low-rank adaptation (SaLoRA), which adaptively learns the intrinsic rank of each incremental matrix by removing rank-0 components.As shown in Figure 2, we introduce a diagonal gate matrix G = diag(g 1 , . . ., g r ) for each incremental matrix.The modified incremental matrix can be represented as ∆ W = BGA.The incremental matrix is divided into triplets, where each triplet T i contains the i-th column of B, the i-th gate mask of G and the i-th row of A. Here, g i represents the binary "gate" that indicates the presence or absence of the i-th triplet.Although incorporating the active triplet count as a penalty term in the learning objective is unfeasible, we employ a differentiable relaxation method to selectively remove non-critical triplets by considering the L 0 norm [17,18].The L 0 norm is equal to the number of non-zero triplets and encourages the model to deactivate less essential triplets.This strategy assigns a higher rank to crucial incremental matrices to capture task-specific information.Conversely, less significant matrices are pruned to possess a lower ranks preventing overfitting.However, A and B are not orthogonal, implying potential dependence among the triplets.Removing these triplets can result in a more significant deviation from the original matrix.To enhance training stability and generalization, we introduce orthogonality regularization for B and A. Furthermore, we integrate a density constraint and leverage Lagrangian relaxation [19] to control the number of valid parameters.
We conduct extensive experiments on a wide range of tasks and models to evaluate the effectiveness of SaLoRA.Specifically, we conduct experiments on the General Language Understanding Evaluation [20] benchmark in a task-oriented setting to assess the model's performance.In addition, we evaluate the model's performance in a task-agnostic setting by fine-tuning LLaMA-7B with a 50K cleaned instruction-following dataset [21], and then perform zero-shot task inference on two text style transfer tasks: sentiment transfer [22] and formality transfer [23].The experimental results demonstrate that SaLoRA consistently outperforms LoRA without significantly compromising training efficiency.

Backgound
Transformer Architecture.The Transformer [24] is primarily constructed using two key submodules: a multi-head self-attention (MHA) layer and a fully connected feedforward (FFN) layer.The MHA is defined as follows: where Q, K, V ∈ R n×d are input-embedding matrices; W O ∈ R d×d is an output projection; are query, key and value projections of head i, respectively; n is sequence length; d is the embedding dimension; h is the number of heads and d k = d/h is the hidden dimension of the projection subspaces.The FFN consists of two linear transformations separated by a ReLU activation: where Parameter-Efficient Fine-Tuning.With the growing size of models, recent works have developed three main categories of parameter-efficient fine-tuning (PEFT) techniques.These techniques optimize a small fraction of model parameters while keeping the rest fixed, thereby significantly reducing computational and storage costs [10].For example, addition-based methods [11][12][13]25,26] introduce additional trainable modules or parameters that are not part of the original model or process.Specifcation-based methods [14,27,28] specify certain parameters within the original model or process as trainable, whereas the others remain frozen.Reparameterization-based methods [15,16,29], including LoRA, reparameterize existing parameters into a parameter-efficient form by transformation.In this study, we focus on reparameterization-based methods, with particular emphasis on LoRA.
Low-Rank Adaptation.LoRA, as introduced in the work of Hu et al. [15], represents a typical example of a reparameterization-based method.In LoRA, some pre-trained weights of LLMs' dense layers are reparameterized by injecting trainable low-rank incremental matrices.This reparameterization only allows low-rank matrices to be updated, while keeping the original pre-trained weights frozen.By reducing the rank of these matrices, LoRA effectively reduces the number of parameters during the fine-tuning process of LLMs.Consider a pre-trained weight matrix W ∈ R d×k , accompanied by a low-rank incremental matrix ∆ W = BA.For h = W x, the modified forward pass is as follows: where B ∈ R d×r , A ∈ R r×k , with the rank r min(d, k), and α is a constant scale hyperparameter.The matrix A adopts a random zero-mean Gaussian initialization, while the matrix B is initialized as a zero matrix.Consequently, the product ∆W = BA is initially set to zero at the beginning of training.Let B * j and A j * denote the j-th column of B and the j-th row of A, respectively.Using this notation, ∆W can be expressed as ∆W = ∑ r j=1 B * j A j * .

Method
In this section, we will first give a brief introduction to parameter-efficient fine-tuning, and then discuss our proposed model based on the problem definition.

Problem Formalization
We consider the general problem of efficiently fine-tuning LLMs for specific downstream tasks.Firstly, let us introduce some notations.Consider a training corpus D = (x i , y i ) N i=1 , where N represents the number of samples.Each sample consists an input, x i , and its corresponding output, y i .We use the index i to refer to the incremental matrix, i.e., ∆W i = B i A i for i = 1, . . ., K, where K is the number of incremental matrices.However, LoRA's assumption of identical ranks for each incremental matrix overlooks structural relationships and the varying importance of weight matrices across different modules and layers during fine-tuning.This oversight can potentially impact overall model performance.Our objective is to determine the optimal {rank * (∆W i )} K i=1 on the fly.The optimization objective can be formulated as follows: where W = {∆W i , . . ., ∆W K } represents the sets of trainable parameters and L corresponds to a loss function, such as cross-entropy for classification.Note that rank(∆W i ) ∈ {0, 1, . . ., r} is an unknown parameter that needs to be optimized.

Structure-Aware Intrinsic Rank Using L 0 Norm
To find the optimal {rank * (∆W i )} K i=1 on the fly, with minimal computational overhead during training, we introduce a gate matrix G to define the structure-aware intrinsic rank: where the g j ∈ {0, 1} serves as a binary "gate", indicating the presence or absence of the j-th rank.The gate matrix G = diag(g 1 , . . ., g r ) is a diagonal matrix consisting of the pruning variables.By learning the variable g j , we can control the rank of each incremental matrix individually, rather than applying the same rank to all matrices.To deactivate non-critical rank-0 components, the ideal approach would be to apply L 0 norm regularization to the gate matrix G: where r is the rank of incremental matrices.The L 0 norm measures the number of nonzero triplets; thus, optimizing L 0 would encourage the model to deactivate less important incremental matrices.
Unfortunately, the optimization objective involving ||G|| 0 is computationally intractable due to its non-differentiability, making it impossible to directly incorporate it as a regularization term in the objective function.Instead, we use a stochastic relaxation approach, where the gate variables g are treated as continuous variables distributed within the interval [0, 1].We leverage the reparameterization trick [30,31] to ensure that g remains differentiable.Following prior studies [17,19], we adopt the Hard-Concrete (HC) distribution as a continuous surrogate for random variables g, illustrated in Figure 3.The HC distribution applies a hard-sigmoid rectification to s, which can easily be sampled by first sampling u ∈ U(0, 1) and then computing as follows: where θ is the trainable parameter of the distribution and τ is the temperature.The interval (γ, ζ), with γ < 0 and ζ > 1, enables the distribution to concentrate probability mass at the edge of the support.The final outputs g are rectified into [0, 1].By summing up the probabilities of the gates being non-zero, the L 0 norm regularization can be computed via a closed form, as follows: As g now represents the output of the parameterized HC distribution function and serves as an intermediate representation for the neural network, gradient-based optimization methods can perform gradient updates for θ = {θ 1 , . . ., θ r }.For each training batch, we sample the gate mask and then share it across the training examples within the batch to enhance sampling efficiency.

Enhanced Stability Using Orthogonal Regularization
In deep networks, orthogonality plays a crucial role in preserving the norm of the original matrix during multiplication, preventing signal vanishing or exploding [32].However, in LoRA, where B and A are not orthogonal, the dependence can lead to larger variations when removing certain columns or rows through L 0 regularization.This, in turn, leads to training instability and the potential for negative effects on generalization [16].For this, we turn to orthogonal regularization, which enforces the orthogonality condition: where I is the identity matrix.Now, let us substitute Equations ( 8) and ( 9) into Equation ( 4) to derive the new training objective: where Θ = {θ i , . . ., θ K } represents the sets of trainable parameters, and λ and β are two constant hyperparameters.

Controlled Budget Using Lagrangian Relaxation
If we only rely on Equation (10) to learn the intrinsic rank for each incremental matrix, the resulting parameter budget cannot be directly controlled.This limitation becomes problematic in many real-world applications that require a specific model size or parameter budget.To address this issue, we further introduce an additional density constraint on R(W, Θ) to guide the network towards achieving a specific desired budget.
where b represents the target density and #(x) counts the total number of parameters in matrix x. and A i is of r i × k i .However, lowering the density constraint poses a challenging and (not necessarily strictly) constrained optimization problem.To tackle this challenge, we leverage Lagrangian relaxation as an alternative approach, along with the corresponding min-max game: where λ ∈ R is the Lagrangian multiplier, which is jointly updated during training.The updates to λ would increase the training loss unless the equality constraint is satisfied, resulting in the desired parameter budget.We optimize the Lagrangian relaxation by simultaneously performing gradient descent on (W, Θ) and projected gradient ascent (to R + ) on λ, as demonstrated in previous works [19,33].During the experiments, we observed that the term λ(C(Θ) − b) 2 converged quickly.To enhance training efficiency, we only optimize (Θ, λ) between T start and T end time steps.We provide a summarized algorithm in Algorithm 1.

Inference
During training, the gate mask g i is a random variable drawn from the HC distribution.At inference time, we first calculate the expected value of each g i in G.If the value of g i is greater than 0, we retain the corresponding i-th low-rank triplet.This procedure enables us to obtain the deterministic matrices B and A.

Experiments
We evaluated the effectiveness of the proposed SaLoRA on RoBERTa [34] and LLaMA-7B in both task-oriented and task-agnostic settings.
Baselines.We compared SaLoRA with the following methods: • Fine-tuning (FT) is the most common approach for adaptation.To establish an upper bound for the performance of our proposed method, we fine-tuned all parameters within the model.

•
Adapting tuning, as proposed by Houlsby et al. [25], incorporates adapter layers between the self-attention module (and the MLP module) and the subsequent residual connection.Each adapter module consists of two fully connected layers with biases and a nonlinearity in between.This original design is referred to as Adapter H . Recently, Pfeiffer et al. [11] introduced a more efficient approach, applying the adapter layer only after the MLP module and following a LayerNorm.We call it Adapter P .• Prefix-tuning (Prefix) [12] prepends a sequence of continuous task-specific activations to the input.During tuning, prefix-tuning freezes the model parameters and only backpropagates the gradient to the prefix activations.• Prompt-tuning (Prompt) [13] is a simplified version of prefix-tuning, allowing the additional k tunable tokens per downstream task to be prepended to the input text.• LoRA, introduced by Hu et al. [15], is a state-of-the-art method for parameter-efficient fine-tuning.The original implementation of LoRA applied the method solely to query and value projections.However, empirical studies [16,35] have shown that extending LoRA to all matrices, including W Q , W K , W V , W O , W U and W D , can further improve its performance.Therefore, we compare our approach with this generalized LoRA configuration to maximize its effectiveness.• AdaLoRA, proposed by Zhang et al. [16], utilizes singular value decomposition (SVD) to adaptively allocate the parameter budget among weight matrices based on their respective importance scores.However, this baseline involves computationally intensive operations, especially for large matrices.The training cost can be significant, making it less efficient for resource-constrained scenarios.

Task-Oriented Performance
Models and Datasets.We evaluated the performance of different adaptive methods on the GLUE benchmark [20] using pre-trained RoBERTa-base (125M) and RoBERTa-large (355 M) [34] models from the HuggingFace Transformers library [36].See Appendix A for additional details on the datasets we used.
Implementation Details.For running all the baselines, we utilized a publicly available implementation [37].We evaluated the performance of LoRA, AdaLoRA and SaLoRA at r = 8.To maintain a controlled parameter budget, we set the desired budget ratio (b) to 0.50 for both SaLoRA and AdaLoRA.During training, we used the AdamW optimizer [38], along with the linear learning rate scheduler.During our experiments, we observed that using a larger learning rate (η c ) significantly improved the learning process for both the gate matrices and Lagrange multiplier.Therefore, we set η c to 0.01 for all conducted experiments.We fine-tuned all models using an NVIDIA A100 (40 GB) GPU.Additional details can be found in Appendix B.
Main Results.We compared SaLoRA with the baseline methods under different model scale settings, and the experimental results on the GLUE development set are presented in Table 1.We can see that SaLoRA consistently achieved better or comparable performance compared with existing approaches for all datasets.Moreover, it even outperformed the FT method.SaLoRA's superiority was particularly striking when compared with LoRA, despite both models having a similar parameter count of 1.33 M/3.54 M for base/large model scales.
After training, SaLoRA effectively utilized only 0.5 × 1.33 M/0.5 × 3.54 parameters, yet still attained superior performance.This observation emphasizes the effectiveness of our method in learning the intrinsic rank for incremental matrices.

Task-Agnostic Performance
Models and Datasets.We present the experiments conducted to evaluate the performance of the self-instruct tuned LLaMA-7B models on instruction-following data [21].Our objective was to assess their capability in comprehending and executing instructions for arbitrary tasks.We evaluated model performance on two text style transfer datasets: Yelp [22] and GYAFC [23]).Text style transfer refers to the task of changing the style of a sentence to the desired style while preserving the style-independent content.The prompts used in these experiments can be found in Appendix C.
Furthermore, we compared the performance of SaLoRA with dataset-specific style transfer models, including StyTrans [15], StyIns [16] and TSST [17].In contrast to SaLoRA, these models were trained on a specific dataset.To evaluate the performance of style transfer models, we used the following metrics: (1) Transfer accuracy (ACC) using a fine-tuned BERT-base [39] classifier with each dataset.Main Results.Table 2 presents our experimental results on the Yelp and GYAFC datasets.Compared with LoRA, our method SaloRA achieved better or comparable performance across all directions on both datasets.This demonstrates the effectiveness of our method.In the negative-to-positive transfer direction, though SaloRA's transfer accuracy was lower than the dataset-specific models (e.g., StyIns achieved 92.40 compared with SaloRA's 71), it still aligned with the human reference accuracy of 64.60.Furthermore, SaloRA exhibited a lower perplexity (PPL) compared with dataset-specific models.These results show that SaLoRA (including LoRA) aligns more closely with human writing tendencies.In the formal-to-informal transfer direction, we also observed that our transfer accuracy was lower than dataset-specific models.This disparity may be attributed to the inherent bias of a large model for generating more formal outputs.This can be verified from the fact that SaLoRA exhibited a significant improvement in the transfer accuracy compared with dataset-specific models.

Analysis
The Effect of Rank r. Figure 4 illustrates the experimental results of fine-tuning RoBERTa-large across different ranks.We see that the rank r significantly influenced the model's performance.Both large and small values of r led to suboptimal results.This observation emphasizes that selecting the optimal value for r through heuristic approaches is not always feasible.Notably, SaLoRA consistently improved performance across all ranks when compared with the baseline LoRA.This suggests that SaLoRA effectively captured the "intrinsic rank" of the incremental matrix.
The Effect of Sparsity b. Figure 5 shows the experimental results of fine-tuning RoBERTa-large across various levels of sparsity.Remarkably, SaLoRA consistently exhibited enhanced performance across all sparsity levels compared with the baseline.This result suggests that SaLoRA's modifications facilitated the acquisition of the "intrinsic rank" of the incremental matrix under different sparsities.It is noteworthy that SaLoRA's performance even surpassed the results of LoRA under low sparsity conditions (0.125).The fact that SaLoRA can outperform LoRA even under low sparsity conditions highlights its capacity to capture and leverage parameters with a constrained budget.Consequently, SaLoRA's efficacy could be expanded on a limited budget, making it a versatile method with a broader range of applications.Ablation Study.We investigated the impact of Lagrangian relaxation and orthogonal regularization in SaLoRA.Specifically, we compared SaLoRA with the following variants: (i) SaLoRA λ=0 : SaLoRA without Lagrangian relaxation; (ii) SaLoRA β=0 : SaLoRA without orthogonal regularization.These variations involved the fine-tuning of the RoBERTa-base model on the CoLA, STS-B, and MRPC datasets.The target sparsity was set to 0.5 by default.SPS represented the expected sparsity of the incremental matrix.From Table 3, we see that: 1.
Without Lagrangian relaxation, the parameter budget was uncontrollable, being 0.37, 0.42 and 0.43 on the three datasets, respectively.Such results highlight the pivotal role that Lagrangian relaxation plays in controlling the allocation of the parameter budget.Nonetheless, it is worth noting that omitting Lagrange relaxation may lead to slight enhancements in performance.However, given the emphasis on control over the parameter budget, this incremental enhancement should be disregarded.

2.
Without orthogonal regularization, the performance of SaLoRA degenerated.These results validate that incorporating orthogonal regularization into SaLoRA ensures the independence of doublets from one another, leading to a significant enhancement in its performance.Visualization of Four Components.We plotted the visualization of expected sparsity b, the Lagrangian multiplier λ and ||A T A − I|| 2 F and ||B T B − I|| 2 F to show whether these four components were regularized by Lagrangian relaxation and orthogonal regularization, respectively.Specifically, we fine-tuned the RoBERTa-base using SaLoRA on the CoLA, STS-B and MRPC datasets.The initial Lagrangian multiplier λ was 0 and the target sparsity b was 0.5.From Figure 6, we see that: Comparison of Training Efficiency.We analyzed the efficiency of SaLoRA in terms of memory and computational efficiency, as shown in Table 4. Specifically, we selected two scales of the RoBERTa model, that is, RoB base and RoB large , and measured the peak GPU memory and training time under different batch sizes on an NVIDIA A100 (40 GB) GPU.From Table 4, we see that: 1.
The GPU memory usages of both methods were remarkably similar.Such results demonstrate that SaLoRA does not impose significant memory overhead.The reason behind this is that SaLoRA only introduces gate matrices in contrast to LoRA.The total number of parameters was r × L × M. In this experiment, r denotes the rank of the incremental matrix (set at 8), L corresponds to the number of layers within the model (12 for RoB base and 24 for RoB large ) and M stands for the number of modules in each layer (set at 6).

2.
The training time of SaLoRA increased by 11% when using a batch size of 32 compared with LoRA.This suggests that the additional computational requirements introduced by SaLoRA are justified by its notable gains in performance.This is because SaLoRA is only utilized during a specific training phase (T start to T end ) comprising 30% of the overall training time.With the remaining 70% being equivalent to LoRA, the overall impact on training time remains manageable.The Resulting Rank Distribution.Figure 7 shows the resulting rank of each incremental matrix obtained from fine-tuning RoBERTa-base with SaLoRA.We observed that SaLoRA always assigned higher ranks to modules (W U , W O and W V ) and layers (4, 5, 6 and 7).This aligns with the empirical results shown in Figure 1, indicating that modules (W U , W O and W V ) and layers (4, 5, 6 and 7) play a more important role in model performance.Hence, these findings not only validate SaLoRA's effective prioritization of critical modules and layers, but also emphasizes its capacity to learn the structure-aware intrinsic rank of the incremental matrix.

Conclusions
In this paper, we present SaLoRA, a structure-aware low-rank adaptation method that adaptively learns the intrinsic rank of each incremental matrix.In SaLoRA, we introduced a diagonal gate matrix to adjust the rank of the incremental matrix by penalizing the L 0 norm based on the count of activated gates.To enhance training stability and model generalization, we orthogonally regularized B and A. Furthermore, we integrated a density constraint and employed Lagrangian relaxation to control the number of valid ranks.In our experiments, we demonstrated that SaLoRA effectively captures the structure-aware intrinsic rank and consistently outperforms LoRA without significantly compromising training efficiency.

Figure 1 .
Figure 1.Fine-tuning performance of LoRA across different modules and layers with varying ranks on MRPC.

Figure 3 .
Figure 3. Hard-Concrete distribution with different parameters.

1 .Figure 6 .
Figure 6.Visualization of expected sparsity b and the Lagrangian multiplier λ under Lagrangian relaxation, and ||A T A − I|| 2 F and ||B T B − I|| 2 F under orthogonal regularization: (a) expected sparsity b; (b) Lagrangian multiplier λ; (c) A of W O at the first layer; and (d) B of W O at the first layer.

Figure 7 .
Figure 7.The resulting rank of each incremental matrix obtained from fine-tuning RoBERTa-base on MRPC with SaLoRA.The initial rank is set at 8, and the target sparsity is 0.5.The x-axis is the layer index and the y-axis represents different types of modules.
Dataset D; total iterations T; target density b; hyperparameters τ, γ, ζ, β, η p , η c .Output: The fine-tuned parameters {W, Θ}. for t = 1, . . ., T do Sample a mini-batch from D if T start ≤ t < T end then Sample a gate mask set G from HC distribution and share it across the mini-batch Compute the gradient L(W, Θ, λ) Update

Table 2 .
Automatic evaluation results on Yelp and GYAFC datasets.↑ indicates that higher values mean better performance, and vice versa.

Table 3 .
Ablation studies on Lagrangian relaxation and orthogonal regularization.

Table 4 .
Comparison of training efficiency between LoRA and SaLoRA on the MRPC dataset.

Table A1 .
Description of datasets.