1. Introduction
In recent years, visual language modeling tools (VLMs) have made significant progress in the field of artificial intelligence. Due to their strong multimodal processing capabilities, they have shown great potential in tasks such as image depiction and visual question answering. During the continuous updating of skill processes, the VLM parameter size shows an exponential upward trend; for example, the GPT-3 parameter size has reached 175 billion. Although the performance of such large-scale models has significantly improved, it has also created a dilemma of high computational costs. The traditional comprehensive fine-tuning mode requires updating all parameter contents of the model, resulting in a huge consumption of computing resources. Taking the 10 billion parameter model as an example, comprehensive fine-tuning operations may require dozens of high-performance GPUs (Graphics Processing Unit) to run continuously for several days, while also requiring significant storage costs to retain model parameter data. In resource-constrained environments (such as IoT (Internet of Things) medical monitoring systems [
1]), this situation is almost impossible to achieve, which brings many challenges and difficult situations to practical applications [
2].
To address the pain points of traditional full-scale fine-tuning, the parameter-efficient fine-tuning (PEFT) method has emerged in response, with low order adaptive techniques (LoRA) becoming one of the mainstream technologies. LoRA uses low rank matrix factorization operations to introduce only a small number of trainable parameter components in the fine-tuning process, freezing most of the original model parameter parts and significantly reducing the amount of computational resources and time costs required for training. For example, when implementing LoRA fine-tuning on a pre-trained model with 1 billion parameters, only about 10 million parameters need to be adjusted, reducing computational complexity by over 90%. This high-performance characteristic enables it to be widely applied in multiple fields: Pu and Xu (2024) [
3] introduced LoRA technology into a Transformer-based directional object detection device, achieving airborne processing of remote sensing images; Cappelletti et al. (2025) [
4] completed the task of using LoRA technology to detect a small number of deep counterfeit samples; Gianluca et al. (2025) [
5] used LoRA-adapted embedding to predict intrinsic diseases in protein sequences; Yu et al. (2021) [
6] combined LoRA technology with dual blockchain architecture to construct a contract theory-driven information system architecture, fully verifying its cross-domain scenario adaptation performance; Kowsher et al. (2024) [
7] proposed the content of the promotion method, which has been fine-tuned to guide the direction of large language models (LLMs); Bian et al. (2024) [
8] proposed LoRA FAIR content, which was optimized through aggregation and initialization in a federated learning environment to achieve fine-tuning of LoRA operations; Zhang et al. (2025) [
9] used heterogeneous LoRA allocation methods to effectively fine-tune the body of the federated basic model in Fed Hello; Zhao et al. (2025) [
10] proposed the LasQ scheme, which utilizes quantitative techniques to achieve the fine-tuning process of LLM maximum singular components; Qu et al. (2025) [
11] implemented an embedded routing function within an efficient fine-tuning program for low rank bottleneck parameters in visual language; Lin et al. (2025) [
12] achieved scientific sentence recognition through phased LoRA fine-tuning measures; Liao et al. (2025) [
13] constructed a dynamic adaptation strategy system for LoRA fine-tuning to achieve specific tasks and efficient optimization objectives for large-scale language models; Chen et al. (2025) [
14] proposed CE LoRA as a computationally efficient LoRA fine-tuning solution for language models; Mu et al. (2024) [
15] applied LoRA fine-tuning operations to multimodal large language model structures to complete multimodal sentiment analysis task projects. However, LoRA technology still faces specific challenges in real-world application scenarios: Djidi et al. (2021) [
16] showed that under the background conditions of energy harvesting sensor nodes, it is necessary to optimize the downlink delay problem of LoRA technology through wake-up radio technology, which also reflects its adaptive limitations in different scenario environments.
Although LoRA has achieved significant results in the field of efficient parameter adjustment, its core shortcomings and drawbacks are gradually exposed in multitasking environments. There are significant differences in data distribution patterns, task goal orientation dimensions, and feature representation forms among different tasks. The fixed parameter adjustment paradigm of this model is difficult to flexibly adapt to complex multitask requirements, and cannot effectively capture the task mode state of complex structures, resulting in a decrease in performance during the task switching phase. This phenomenon is similar to what Rice et al. (1993) [
17] found in their research on alcohol use therapy that “different age groups have different responses to the same treatment plan”—the lack of targeted adaptation mechanisms makes it difficult to meet different types of needs. For example, when performing image classification and object detection tasks simultaneously, it is difficult to fully utilize the correlation feature elements between the two, resulting in the performance of the model not reaching the ideal level in both task activities. Although Tian et al. (2024) [
18] proposed the asymmetric LoRA architecture Hydralora to optimize and fine-tune efficiency levels, LoRA fundamentally focuses on single task optimization as its core design concept, which still cannot meet the complex structural requirements of multitask environments.
It can be seen from
Table 1 that SLoRA takes “orthogonal constraint optimization and optimization of MoE structure (covering general experts, task-specific experts and dynamic routing)” as its core content. Compared with the “low rank matrix decomposition and freezing pretraining parameters” of traditional LoRA, on the one hand, it solves the design limitations of LoRA single task optimization (including poor adaptability to multiple tasks, lack of knowledge forgetting mitigation mechanism, etc.); on the other hand, it improves the knowledge integration ability by means of the cross-expert attention mechanism, reduces the interference generated by old knowledge through the initialization of the constraint solution space, and ultimately achieves an increase of 7.8% in the average accuracy rate of multiple tasks and 92.4% in the accuracy rate of old tasks (16.1% higher than LoRA). Task switching loss is only 2.3%; compared with LoRA, the results are 6.4 percentage points lower and maintain a super low computational cost situation (the proportion of trainable parameters is only 0.8%).
In order to address the shortcomings of traditional low rank adaptation (LoRA) and current mixed expert low rank adaptation (MoE LoRA) fusion solutions, and to fill the key research gap in efficient fine-tuning of visual language model (VLM) parameters in multitasking situations, three clear research questions are proposed, as follows:
① How can we overcome the shortcomings of traditional LoRA methods in multitasking adaptability? This deficiency is caused by their fixed low rank matrix update mechanism, which hinders their ability to adapt to the characteristics of different types of tasks and capture patterns between complex tasks.
② How can we alleviate the catastrophic forgetting and knowledge fragmentation issues in the existing MoE LoRA integration framework? These issues are caused by the new task parameter updates covering old task knowledge, as well as insufficient information exchange between expert modules.
③ How can we achieve a balance between high performance (including multitask accuracy and cross-pattern reasoning ability) and ultra-low computational consumption in the efficient parameter tuning process of large-scale visual language models (VLM)? Especially in environments where resources are limited, such as IoT applications.
In order to address these research issues, this study proposes the SLoRA architecture, whose core contributions are as follows, aimed at promoting the in-depth application of visual language models (VLMs) in multitasking scenarios:
(1) In response to the two core challenges of MoE LoRA (catastrophic forgetting and knowledge fragmentation), SLoRA combines orthogonal constraint optimization with optimized hybrid expert (MoE) structure to simultaneously achieve knowledge retention and cross-task knowledge integration improvement.
(2) Introducing constraint solution space initialization based on orthogonal constraint optimization, restricting the direction of parameter updates, reducing interference with existing knowledge, and effectively mitigating catastrophic forgetting in multitask learning.
(3) Design an optimized hybrid expert (MoE) structure consisting of “general experts + task-specific experts + dynamic routing” to enhance information sharing among experts, solve the problem of low efficiency in cross-mode collaboration, and improve multitask adaptability.
(4) Improve performance while maintaining ultra-low computational costs (only 0.8% of trainable parameters for a 1 billion parameter model) to adapt to resource-constrained environments such as the Internet of Things.
(5) Establish a multidimensional evaluation system that covers performance, efficiency, and universality to comprehensively verify the effectiveness and stability of the SLoRA architecture.
2. Technical Background
2.1. Efficient Parameter Fine-Tuning in Multitask Scenarios (PEFT)
The parameter-efficient fine-tuning (PEFT) technique, which has attracted considerable attention in the field of deep learning in recent years, aims to solve the problem of high computational costs in traditional full model fine-tuning for large-scale models. In multitask scenarios, models need to simultaneously handle multiple types of tasks. For example, in natural language processing, models may need to complete tasks such as text classification on the GLUE dataset with over 100,000 samples, sentiment analysis on the IMDb dataset with 50,000 comments, and machine translation on WMT14 English German pairs with 4.5 million parallel sentence pairs. In the field of computer vision, models may need to simultaneously process tasks such as ImageNet image classification with 1.2 million training samples, COCO object detection with 150,000 images and 880,000 object annotations, and PASCAL with 11,000 images and 27,000 segmentation annotations. VOC semantic segmentation and other tasks are also possible. The traditional comprehensive model fine-tuning method requires updating all parameters of the model, which is extremely expensive in multitasking scenarios. Taking a cross-modal pore-trained model with 10 billion parameters as an example, for its full parameter fine-tuning, a single round of training (assuming a batch size of 128) can achieve a computational cost of 5.2 × 1018 FLOPs (Floating-Point Operations Per Second), requiring the deployment of 50 A100 GPUs (40 GB of video memory) to manage continuously for 14 days. The total computational cost exceeds $120,000, and the model parameter files that have to be stored after each round of training reach 400 GB (single precision floating-point number). If 10 rounds of checkpoints are saved, 4 TB of storage space is required, and more importantly, detailed fine-tuning can easily lead to overfitting in multitasking scenarios. In the above multitasking combination, the overfitting rate (training set accuracy, validation set accuracy) of the model after full fine-tuning reaches 18.3% in text classification tasks, 22.1% in object detection tasks, and even as high as 31.7% in semantic segmentation tasks with small data volume. The reason is that the data distribution of discrete tasks varies significantly (such as the feature space overlap between text classification and semantic segmentation being less than 15%), and updating parameters uniformly will force the model to overfit local features in some tasks.
PEFT technology achieves efficient fine-tuning in multitasking scenarios by adjusting only a small number of parameters in the model or introducing a small number of additional trainable parameters. Taking adapter modules as an example, in a pure trained model with 10 billion parameters, inserting two 128 × 64 × 2 bottleneck structure adapter modules into each Transformer layer results in a total of only 5.12 million adapter modules for the entire model, accounting for only 0.51% of the original model parameters. Under this setting, the computational load of a single round of training can be reduced to 2.8 × 1017 FLOPs (only 5.4% of rich fine-tuning), and the computational cost can be reduced to $8000 with only three A100 GPUs running for 2 days. The storage requirement can also be reduced to 20 GB/round (including adapter parameters and optimizer status), and the overfitting rate is significantly reduced—text classification tasks are reduced to 5.2%, object detection is reduced to 7.8%, and semantic segmentation is reduced to 9.3%, all thanks to the stable feature extraction ability provided by frozen pure training parameters, which allows the adapter to only learn task-specific mapping. In multitasking scenarios, the model needs to have excellent adaptability and generalization. In terms of adaptability, the advantages of PEFT can be quantified by task switching efficiency: in the above multitask combination, the parameter adjustment amount of PEFT model switching tasks is only 0.3% of full fine-tuning, and the switching time is reduced from 2.3 h (reloading the model) of full fine-tuning to 45 s (updating only adapter parameters). As for performance, PEFT achieved a validation set accuracy of 85.7% (fully fine-tuned to 86.2%, with a difference of only 0.5%) in text classification tasks and 82.3% (fully fine-tuned to 81.9%) in sentiment analysis tasks, demonstrating its flexible adaptability to different tasks.
In terms of generalization, PEFT can achieve more effective knowledge transfer. For example, when the feature extractor is fine-tuned on the ImageNet image classification task and transferred to the COCO object detection task, the mAP (mean average precision, average accuracy) of the model can reach 42.6%, which is 3.5 percentage points higher than the result obtained by full fine-tuning transfer (39.1%); when the PEFT model, which was also fine-tuned on machine translation tasks, is transferred to cross-language text classification tasks, the average accuracy can reach 78.2%, significantly higher than the 72.5% of full fine-tuning transfer. This is because it retains more than 99% of the common knowledge in the pure trained model (through feature similarity calculation, the feature space overlap between the PEFT fine-tuned model and the original are trained model reaches 92.3%, while full fine-tuning is only 68.5%). This common knowledge provides a solid foundation for cross-task transfer. This efficient, low overfitting risk, and strong transfer ability make PEFT the core technology for large-scale model fine-tuning in multitask scenarios, especially in resource-limited industrial scenarios (such as edge device deployment and multitask real-time updates)—an irreplaceable advantage.
2.2. LoRA Technology Principles and Applications
2.2.1. LoRA Core Principles
Low rank adaptation (LoRA) is a popular parameter-efficient fine-tuning method, and its fundamental principle is based on low rank matrix factorization. In deep learning models, especially large-scale re-trained models such as GPT series and BERT, the weight matrix of the model generally has a high dimensionality. These high-dimensional matrices contain numerous parameters, and updating them all during the fine-tuning process would incur significant computational costs.
The idea of LoRA is to add two low rank matrices to the weight matrix of the re-trained model, and use the product of these two low rank matrices to roughly represent the updated weight values. For example, the weight matrix of the re-trained model is W, and during comprehensive fine-tuning, the entire W matrix needs to be updated. But in the LoRA approach, an additional fine-tuning matrix ΔW is added and expressed as the product of two low rank matrices A and B, that is, ΔW = A ⋅ B. Here, A belongs to d × r and B belongs to r × k, where r is the rank of the low rank matrix and is much smaller than d and k. During the training phase, only low rank matrices A and B are updated, while the weight matrix W of the pure trained model remains unchanged. In this way, the number of trainable parameters has significantly decreased from the original d × k to (d + k) × r, thereby substantially reducing computational costs and memory requirements.
Taking a pro trained model with 1 billion parameters as an example, assuming that the weight matrix dimension of one layer is d = 1000 and k = 1000, if all fine-tuning is performed, the number of parameters that need to be updated in this layer is 1000 × 1000 = 1,000,000. But if LoRA technology is used, when the rank of the low rank matrix is r = 16, the number of newly added trainable parameters in this layer is only (1000 + 1000) × 16 = 32,000, which reduces the number of trainable parameters by about 96.8%. This dramatically reduces the number of parameters, allowing the model to converge faster during the fine-tuning process, while also reducing the requirements for computational resources. This enables effective fine-tuning of large-scale models even under limited resource conditions.
2.2.2. Limitations of LoRA in Multitask Scenarios
Although LoRA has achieved outstanding results in efficient parameter adjustment, it still has significant constraints in multitasking scenarios, mainly due to its inherent design pattern, which is difficult to fit the substantial heterogeneity between tasks. Each task has significant differences in the characteristics of data distribution, the direction of core objectives, and the form of feature presentation—these differences are not only superficial, but also involve fundamental differences in the construction of feature spaces. Just like in the field of natural language processing, text classification tasks (such as topic classification on the GLUE dataset) focus on identifying discrete semantic categories, heavily relying on the frequency of keyword occurrence, syntactic structure, and domain-specific terminology; in contrast, emotion analysis tasks (such as emotion polarity discrimination on the IMDb dataset) focus on capturing subjective emotional tendencies, which depend more on emotional vocabulary, rhetorical devices, and subtle differences in contextual intonation [
19]. In addition to the field of natural language processing, the differences between cross-modal tasks are more significant: image classification tasks (such as ImageNet) prioritize low-level visual features such as edges, textures, and color distribution, while visual question answering (VQA) tasks require the integration of high-level semantic understanding of text problems with visual content parsing [
11,
14]. Quantitative analysis shows that the degree of feature space overlap between different tasks can be as low as 15% (such as text classification and semantic segmentation [2.1]), indicating that parameter adjustment modes optimized for one task are often incompatible with another task.
The fixed low rank matrix update mode of LoRA (ΔW = A · B) further exacerbates this adaptation bottleneck. Given that the rank r of a low rank matrix is predetermined (usually 16–64) and the direction of updates does not differ between tasks, this model can only learn general low rank adaptive patterns and cannot develop parameter adjustment strategies for specific tasks [
20]. Djidi et al. (2021) [
16] confirmed in the scenario of energy harvesting sensor nodes that the rigid parameter update mechanism of LoRA can cause significant performance degradation when facing dynamic changes in task requirements, and auxiliary wake-up radio technology is needed to compensate for its poor adaptive capability. Even in the specific multitasking aspect of natural language processing, Tian et al. (2024) [
18] pointed out that although improved architectures such as HydraLoRA optimize fine-tuning efficiency, the core design logic of LoRA still revolves around single task optimization and lacks inherent support for knowledge differentiation and collaboration in multitasking. Actual experimental data confirms this constraint: when LoRA is applied to a combination of text classification, sentiment analysis, and machine translation tasks, its average task switching loss reaches 8.7% (even higher for cross-modal tasks). In complex inference tasks, the performance gap widens to 11.3% compared to fine-tuning for specific tasks [
Table 1,
Section 3.1], which is a direct consequence of its inability to flexibly adapt to heterogeneous task requirements.
LoRA has poor adaptability when encountering differences in multitasking. The way of adjusting its parameters is relatively fixed, and it is not easy to flexibly adjust according to the characteristics of discrete tasks. When performing both text classification and sentiment analysis tasks simultaneously, LoRA may not effectively capture the differences between the two, resulting in negative performance on both tasks. The reason is that LoRA is mainly designed with a single task as the optimization direction. It assumes that the characteristics and patterns between tasks have certain similarities, and can adapt to different tasks by simply adjusting the low rank matrix. However, in practical multitasking scenarios, this assumption often does not hold true, and the differences between different tasks are likely to be meaningful, requiring the use of more complex parameter adjustment strategies to cope.
2.3. Mixture of Experts (MoE)
2.3.1. Overview of MoE Mechanism
A kind of architecture design—mixed expert (MoE) mechanism—is designed to improve the model’s ability and efficiency in dealing with complex tasks. Its core idea is taken from “teaching students in accordance with their aptitude” and “learning from others’ strengths”. It will decompose complex tasks into multiple subtasks, which will be managed by special “expert” models. These expert models can use different network structures (such as CNN (Convolutional Neural Network), LSTM (Long Short-Term Memory), Transformer, etc.), and show significant advantages in specific task domains. For example, in natural language processing, the accuracy rate of part of speech tagging of grammatical analysis experts can reach 92.3%, and the F1 value of context reasoning of semantic understanding experts can reach 89.7%, which is significantly higher than 85.1% of a single model.
The MoE mechanism consists of an expert network (composed of a set of parallel sub models, typically ranging from 8 to 128 experts in size) and a gate-controlled network (based on input feature dynamic allocation of experts, such as outputting expert allocation probabilities through surtax function, experimental results show that its decision accuracy can reach 89.4%; for example, the routing accuracy of scientific paper abstracts can reach 92.1%, and for news texts can reach 87.6%).
The quantitative data that clearly demonstrates the advantages of MoE are as follows: in a combination of 10 NLP tasks, the average F1 value of the eight-expert MoE model can reach 86.2%, which is 5.8% higher than that of a single model with the same parameter scale, and the training efficiency is improved by 3.2 times (thanks to only activating one to two experts per input); in multimodal tasks, the MoE model integrating visual experts (whose image feature extraction accuracy rate is 90.5%) and language experts (whose text understanding F1 value is 88.9%) can achieve 82.4% of the accuracy of cross-modal reasoning, which is significantly higher than 75.6% of the single modal model; MoE realized through dynamic routing and multi expert fusion has significant performance gains on complex tasks. For example, in the composite tasks including syntax analysis, semantic reasoning, and emotion recognition, the weighted average accuracy of the MoE model can reach 89.3%, while the single model is only 81.5%, and the computing cost can be reduced by 40% (because there is no need to repeatedly calculate shared features).
2.3.2. Challenges of MoE Integration into LoRA
Although integrating the MoE mechanism into LoRA (MoE LoRA) theoretically improves multitasking performance, it faces two core challenges in practical applications: catastrophic forgetting and knowledge fragmentation. Specifically, it is manifested as follows.
Catastrophic forgetting is manifested as a significant degradation of old task knowledge when the model develops new tasks. Specific experimental data shows that in sequence learning involving five visual tasks (such as image classification, object detection, etc.), the MoE LoRA model significantly reduces the accuracy of the first task from the initial 89.2% to 68.5% after learning the fifth task, with a decrease of 23.2%, far exceeding the 11.7% of the fully fine-tuned model; moreover, in-depth analysis shows that the update of expert parameters by new tasks will overwrite the knowledge of old tasks. For example, in object detection tasks, the “edge feature extraction” parameter is adjusted by ± 0.32, resulting in a 19.6% decrease in the accuracy of “texture feature” recognition in image classification tasks; in multitask parallel scenarios, this forgetting is even more severe. For example, when training three NLP tasks simultaneously, the previous task performance degradation rate of MoE LoRA (a decrease of 2.1% per round of training) is 3.5 times that of a single LoRA.
In the context of knowledge fragmentation, insufficient information sharing among experts leads to knowledge dispersion and limited cross-task performance. According to the calculation of feature cosine similarity, the overlap of different expert features in MoE LoRA is only 31.2% (the whole model is fine-tuned to 68.5%), and the feature similarity between language experts and visual experts is only 22.7%, resulting in a 15.3% lower accuracy of cross-modal tasks than the theoretical value. The peculiar case shows that in the “Image Description Generation” task, the accuracy of object recognition by visual experts reached 90.1%, and the BLEU (Bilingual Evaluation Understudy) value of text generation by language experts reached 28.6%. However, due to information isolation, the BLEU value of the fused task was just 21.3%, which is 25.5% lower than the ideal state of expert collaboration. In addition, the degree of expert specialization is positively correlated with knowledge fragmentation. When the number of experts increases from 8 to 32, the average F1 score of multitasking decreases by 4.8%, and the feature overlap reduces by 12.3%. These challenges have led to significant performance fluctuations in MoE LoRA in multitasking scenarios, with a performance standard deviation of 5.7 (SLoRA of 2.3) in 10 task combination tests, and particularly poor performance in cross-domain tasks, such as a 27.4% decrease in accuracy when switching from text classification to image segmentation.
4. Experimental Verification and Result Analysis
4.1. Experimental Setup
4.1.1. Experimental Dataset
This experiment selected the commonsense reasoning task dataset and IconQA multimodal dataset to comprehensively evaluate the performance of the SLoRA architecture.
This experiment selected three typical datasets to verify the multitasking performance of SLoRA. The specific details are shown in
Table 2:
WSC (SuperGLUE): Constructed by Wang et al. covering 273 text samples, the core transaction is referential resolution, and the demand model uses context to determine the referent object of pronouns (such as “it” and “they”), commonly used to evaluate the fine-grained semantic comprehension ability of language models [
21].
CommonsenseQA: Published by Talmor et al. containing 12,102 commonsense question and answer samples, with five candidate answers provided for each question, covering areas such as commonsense and scientific knowledge. The demand model combines background knowledge to conduct multi-choice reasoning [
22].
IconQA: Created by Pan et al., it contains 103,000 image text bimodal samples, with image categories including natural scenes, abstract icons, schematic diagrams, etc. Text problems involve object recognition, scene\I-∙∙∙∙ cognition, logical reasoning, etc. The demand model integrates visual features and text semantics to produce answers, and is a commonly used dataset for verifying cross-modal multitasking performance [
23].
All datasets use the official ratio of training set/validation set (70% for training set and 30% for validation set), and no additional preprocessing operations are performed on the data to ensure the fairness and reproducibility of the experiment.
4.1.2. Experimental Comparison Method
Selecting representative methods in the field of efficient parameter tuning as baselines and comparing the performance advantages of SLoRA, the core characteristics of each method are shown in the following table:
As shown in
Table 3, the trainable parameters of SLoRA account for 0.8%, balancing adaptability and efficiency. As a classic PEFT method, LoRA simplifies parameter updates by fixing a low rank matrix but lacks a task differentiation adaptation mechanism; it also optimizes low rank resource allocation based on LoRA to enhance adaptability to single tasks in AdaLoRA, without addressing the problem of multitask knowledge conflicts; SLoRA, which enhances multitasking capabilities through dual component collaboration, can reduce knowledge forgetting and improve the efficiency of knowledge sharing among experts.
4.1.3. Experimental Evaluation Indicators
The performance of the model is evaluated using the following quantitative indicators, and the calculation formula and applicable scenarios are shown in the table below:
Table 4 shows the following: the accuracy and F1 value used to measure the prediction accuracy of classification and question answering tasks are shown, where F1 value has the characteristic of focusing on solving the problem of imbalanced positive and negative samples in commonsense reasoning; the evaluation of the BLEU value of subtask text quality such as Image Description Generation in IconQA, which reflects the accuracy of cross-modal semantic transformation; the calculation formula for quantifying the degree of catastrophic forgetting in multitask learning is (initial accuracy—post training accuracy)/initial accuracy × 100% task switching loss.
4.2. Experimental Results
SLoRA, which demonstrates excellent performance in commonsense reasoning tasks, achieved significantly higher accuracy and F1 score on multiple commonsense reasoning datasets compared to LoRA, AdaLoRA, and other comparative methods, with clear quantitative advantages: on the WSC dataset of SuperGLUE, its accuracy is 9.0% higher than LoRA and 3.7% higher than AdaLoRA; on the CommonsenseQA dataset, its F1 score is 7.7% higher than LoRA and 2.9% higher than AdaLoRA.
From the information obtained in
Figure 7, SLoRA utilizes “orthogonal constraints to implement optimization operations + optimization processing of MoE structure (including expert components in general fields + expert components responsible for task-specific functions + routing links for dynamic programming)” to not only solve the key problem of “insufficient adaptability in multitask scenarios and difficulty in capturing complex semantic patterns” that traditional LoRA faces in commonsense reasoning-related tasks, but also strengthen and promote the integration process of semantic features through the cross-expert attention operation mechanism, ultimately achieving an accuracy rate of 85% at the WSC dataset level (9.0% higher than LoRA and 3.7% higher than AdaLoRA), on the CommonsenseQA dataset. We achieved a level of 70% F1 value within the category (compared to LoRA). With increases of 7.7% and 2.9% compared to AdaLoRA, the model’s fine-grained semantic parsing ability and commonsense reasoning skills have been significantly enhanced.
As shown in
Table 5, SLoRA performs outstandingly in commonsense reasoning tasks: the accuracy of the WSC dataset reaches 85%, which is 9.0% higher than LoRA (78%) and 3.7% higher than AdaLoRA (82%); the F1 value of the CommonsenseQA dataset reaches 70%, which is five and two percentage points higher than LoRA and AdaLoRA, respectively, verifying its adaptability to complex semantic reasoning compared to LoRA. The SLoRA, which performed well on the IconQA multimodal dataset, had significantly higher average scores than the baseline method under different LoRA rank settings (16, 32, 64). When the rank was 16, the average score of SLoRA reached 80 points, compared to LoRA which only scored 70 points and AdaLoRA which scored 75 points; when the rank was 32, the average score of SLoRA increased to 83 points, while LoRA and AdaLoRA were 72 and 78 points, respectively; when the rank was 64, the average score of SLoRA remained stable at 85 points, while LoRA and AdaLoRA were 75 and 80 points, respectively. This fully demonstrates the powerful ability of SLoRA in multimodal information fusion and cross-modal reasoning, which can better understand and process combined image and text information to accurately answer relevant questions. Moreover, its average score is significantly ahead under different LoRA rank (r = 16, 32, 64) settings, as shown in the following table.
Based on the information obtained from the content shown in
Figure 8, SLoRA utilizes orthogonal constraints to carry out optimization tasks and optimize the MoE structure (including a general domain expert group, a task-specific expert team, and a dynamic formal routing mechanism). It not only addresses the pain points of traditional LoRA and AdaLoRA in multimodal tasks, such as insufficient fusion at the cross-modal information level and poor adaptability to different rank values, but also uses dynamic formal routing to perform precise matching operations between visual and textual domain experts. It strengthens and enhances the ability to transform cross-modal semantic perspectives, and ultimately achieves significant leading results under different LoRA rank values (16, 32, 64) in the IconQA dataset—As shown in
Table 6 when r = 16. The average score obtained per hour is 80 points (up 14.3% compared to LoRA). At r = 32, the score reached 83 points (15.3% higher than LoRA), and at r = 64, it was 85 points (13.3% higher than LoRA). At the same time, the stability of performance level was maintained, and its outstanding advantages in multimodal information integration work and inference type tasks were fully verified and confirmed.
4.3. Result Analysis
Through in-depth analysis of the experimental results, we can clearly perceive that SLoRA has outstanding advantages compared to other methods. In multitasking scenarios, the constraint solution space initialization method of SLoRA and the optimized MoE architecture play a vital role. The initialization of constraint solution space, with the optimization measures of orthogonal constraints, effectively reduces the interference caused to existing knowledge, enabling the model to better retain the knowledge of preceding tasks when learning new tasks, thus successfully avoiding the problem of catastrophic forgetting. When transitional between commonsense reasoning tasks and multimodal tasks, SLoRA can maintain a stable performance state on both tasks, while LoRA and AdaLoRA exhibit significant fluctuations in performance during task transitions.
The optimized MoE architecture greatly enhances the model’s adaptability to various tasks through the collaborative operation of general experts and task-specific experts, coupled with dynamic routing mechanisms. The pure trained knowledge retained by general experts lays a solid foundation for the model, task-specific experts can carry out targeted learning based on the characteristics of different tasks, and dynamic routing mechanisms ensure that tasks can be accurately assigned to the most suitable experts for processing. In the IconQA multimodal dataset, for tasks involving object recognition in images, task-specific experts can quickly and accurately identify objects, and combined with basic information provided by conventional experts, provide accurate responses. However, LoRA and AdaLoRA perform worse than SLoRA in handling such complex multimodal tasks due to the lack of effective task adaptation mechanisms.
From the presentation in
Figure 9, it can be observed that SLoRA utilizes “orthogonal constraints to implement optimization operations + reshape the MoE architecture (covering general domain expert modules, task-specific expert units, and dynamic path planning mechanisms)”. On the one hand, it addresses the difficulties faced by traditional LoRA and AdaLoRA in multitasking environments, such as weak knowledge retention efficiency, insufficient task adaptation, and low cross-modal fusion rate. On the other hand, it reduces the interference effects of existing knowledge through orthogonal constraints and enhances the adaptability to tasks through dynamic routing. Ultimately, it achieves a significant breakthrough in three key levels—knowledge retention level (the loss caused by task switching is only 2.3%, which is 6.4 percentage points lower than LoRA), numerical value, and multitask adaptation status (related indicator is 2.3, far higher than LoRA). The corresponding 5.1 indicator value and cross-modal reasoning level (accuracy of 82.4%, 8.8 percentage points higher than LoRA) fully demonstrate its overall advantageous characteristics in multitasking application scenarios.
As shown in
Table 7, the advantages of SLoRA were quantified from three dimensions: knowledge retention, multitask adaptability, and cross-modal inference. Task switching loss was only 2.3%, which was 6.4 percentage points lower than LoRA; the accuracy of cross-modal reasoning reaches 82.4%, which is 8.8 percentage points higher than LoRA, reflecting the collaborative effect of dual core components. The outstanding performance of SLoRA in terms of stability and generalization lies in its relatively stable performance on different datasets and tasks, which can adapt to different data distributions and task requirements. For example, on different subsets of multiple commonsense inference datasets and IconQA multimodal datasets, the accuracy and F1 value of SLoRA fluctuate less compared to other methods, which fully demonstrates the strong generalization ability of SLoRA. It can effectively transfer the knowledge and skills learned on one task or dataset to other related tasks and datasets, thereby improving the practicality and reliability of the model.
4.4. Ablation Study
A detailed ablation study conducted to investigate the roles of various components in the SLoRA architecture shows that there is a certain pattern in the performance of SLoRA under different rank settings. At rank 16, a good performance balance can be achieved while maintaining low computational costs. As the rank increases, although the model’s expressive ability improves, the computational cost increases correspondingly and the performance improvement gradually decreases. For example, when the rank increases from 16 to 32, the average score on the IconQA multimodal dataset increases from 80 points to 83 points, with an improvement of three points. When the rank increases from 32 to 64, the average score only increases from 83 points to 85 points, with an improvement of two points. This shows that in practical applications, rank 16 can achieve a balance between performance and efficiency. A well-balanced choice is more appropriate, and the performance and computational cost comparison results of SLoRA under different rank settings are shown in the following table.
From the content shown in
Figure 10, it can be seen that SLoRA, with its “performance and efficiency quantification trade-off mechanism under different rank (r) settings”, not only deals with the problem of “insufficient model expression ability due to low rank values and redundant computational costs due to high rank values”, but also clarifies the optimal configuration through an accurate comparison process. Finally, when it is at the node of r = 16, it achieves the optimal balance between performance and efficiency levels (IconQA average score of 80 points, relative computational cost value of 1.0), while when the values are 32 and 64, the performance only shows a slight improvement trend (with scores of 83 and 85 respectively), but the computational cost shows a sharp increase, reaching as much as 1.8 times and 3.2 times respectively. This situation fully verifies the high efficiency and practicality of low rank settings in multitasking scenarios.
As shown in
Table 8, the computational cost is only 31.25% of that when r = 64. The key role played by the constraint initialization strategy and MoE architecture in improving the performance of SLoRA has been verified through experimental comparison. After removing the constraint initialization strategy, SLoRA has a significant catastrophic forgetting problem in multitask learning, which means that the model’s learning on new tasks leads to a significant decrease in performance on previous tasks. In multitask learning of commonsense reasoning tasks and image classification tasks, SLoRA with the constraint initialization strategy removed showed a decrease in the accuracy of commonsense reasoning tasks from 85% to 75% during the learning process, while the result of removing the constraint initialization strategy or MoE architecture showed a significant decrease in the performance of SLoRA, as shown in the table below.
According to
Figure 11 and
Table 9, it can be concluded that, when only the traditional LoRA structure is retained after removing the MoE architecture, the adaptability of SLoRA in multitasking scenarios is significantly reduced, and the correlation between different tasks cannot be effectively utilized, resulting in lower performance than the complete SLoRA architecture on each task. Moreover, when processing both natural language processing and computer vision tasks simultaneously, the F1 value of SLoRA without MoE architecture decreases from 70% to 65% in natural language processing tasks, and the accuracy decreases from 80% to 75% in computer vision tasks. These results are sufficient to prove that the constraint initialization strategy and MoE architecture, as key factors for SLoRA to achieve excellent performance in multitasking scenarios, work together to improve model adaptability, stability, and generalization.
4.5. Detailed Statistical Analysis
According to
Table 10 for information acquisition, SLoRA outperforms LoRA and AdaLoRA in terms of core indicators such as accuracy, F1 score, and BLEU value in the commonsense reasoning domain (including WSC and CommonsenseQA), multimodal domain (represented by IconQA), and integration of mixed tasks. For example, WSC accuracy has improved by 9.0%, and IconQA average score has improved by 13.3–15.3%. Moreover, the standard deviation value of the model performance is smaller (in the range of 0.02–0.8), far lower than the corresponding data of the comparison method. This highlights the stronger multitask adaptability and stability of the model.
According to
Table 11, the proportion of trainable parameters in the 1 billion/10 billion parameter model system of SLoRA is only 0.8%. The time consumed for a single round of training (3.2 h/12.5 h), the required number and scale of GPUs (8/16 A100), and the storage requirements (20 GB/80 GB) are all far lower than those of the full fine-tuning operation. The computational cost only reaches about 2.5% of that of the full fine-tuning method. At the same time, compared with LoRA and AdaLoRA, the efficiency performance is close to the same, achieving an efficient balance between performance and computational cost.
According to
Table 12, it was found that SLoRA has a high retention rate of 91.2%–94.8% for the accuracy of old tasks in different task sequence architectures (covering NLP, CV, and cross-modal categories). The loss caused by task switching is only in the range of 2.3%–4.7%. The amplitude of parameter perturbations (±0.03–±0.05) and the degree of feature space conflicts (0.12–0.16) are significantly lower than those of LoRA and AdaLoRA, effectively alleviating and suppressing catastrophic forgetting phenomena in multitask learning processes and enhancing cross-task knowledge transfer capabilities.
Through
Table 13, we conducted a content exploration on the constraint initialization strategy of SLoRA and optimized MoE architecture system, which constitute the core elements of performance. After removing the constraint initialization operation, the accuracy of commonsense reasoning decreased by 10%, the knowledge retention rate decreased to 75.6%, and the performance degraded by 11.8%; after removing the MoE architecture settings, the multitasking F1 value decreased to 65%, resulting in a performance degradation of 9.1%; when removing cross-expert attention mechanisms or dynamic routing settings, it can also cause a decline in performance. This verifies and confirms the key role of dual core components and sub modules in alleviating forgetting problems and improving multitask adaptation performance.
Through analysis and discrimination based on
Table 14, SLoRA achieves the optimal balance between performance and efficiency level under the rank r = 16 state (IconQA average score of 80 points, relative computational cost value of 1.0, single task activation number of 1.9 experts). As the rank value increases (stage 32/64/128), the performance improvement gradually narrows (only achieving a 3–6 point improvement effect), but the computational cost shows an exponential growth trend (reaching 1.8–6.7 times the growth rate), demonstrating the high efficiency and practical value of low rank settings in multitasking environments.
5. Conclusions
This study offers an in-depth analysis of the innovative architecture SLoRA, which demonstrates unique advantages in efficiently fine-tuning parameters in solving multitasking scenarios. It relies on two core components: the borrowed constraint solution space initialization (based on orthogonal constraint optimization, which can reduce the disturbance of existing knowledge in multitask learning, alleviate catastrophic forgetting problems, and provide a stable foundation for knowledge transfer between different tasks) and the optimized MoE structure (introducing general experts and task-specific experts, combined with dynamic routing mechanisms, which can significantly improve the adaptability of the model to multitasking, promote information sharing among experts, and solve knowledge fragmentation problems), effectively overcoming the limitations of traditional methods. In terms of experimental verification, SLoRA performed well in both commonsense reasoning tasks and the IconQA multimodal dataset, outperforming methods such as LoRA and AdaLoRA in commonsense reasoning tasks, demonstrating stronger semantic understanding and knowledge reasoning abilities. It reached the best results under different LoRA rank settings on the IconQA multimodal dataset, with significantly higher average scores than the baseline method and stronger stability and generalization. The ablation study further confirmed that SLoRA can achieve a good balance between performance and efficiency when the rank is 16, and the constraint initialization strategy and MoE architecture play a critical role in performance improvement.
The specific quantitative results are as follows: ① The performance improvement is as follows: among the 10 multitask combination architectures, the average accuracy reaches 85.6%, which is a 7.8% improvement compared to LoRA; the accuracy rate of the WSC dataset is 85%, showing a growth trend of 9.0% compared to LoRA; the F1 score of the CommonsenseQA dataset is at the 70% level, achieving a 7.7% improvement compared to LoRA; under the configuration of r = 32, the IconQA dataset scored an average of 83 points, achieving a 15.3% improvement compared to LoRA. ② The alleviation of the forgetting phenomenon means that the accuracy retention rate of old tasks reaches 92.4%, and the loss caused by task switching is only 2.3%, which is 9.2 percentage points lower than MoE LoRA. ③ The optimization status at the efficiency level means that the proportion of parameters that can be implemented for training is only 0.8%, and the training time for a single round is 3.2 h (in an 8 × A100 environment), which is 93.3% shorter than the full fine-tuning mode. The computational cost has been reduced by 99.2%. ④ The performance of cross-modal ability is as follows: the accuracy rate of cross-modal reasoning reaches 82.4%, which is 8.8 percentage points higher than LoRA. The BLEU value of the Image Description Generation task is 27.8%, approaching the level presented by full fine-tuning (28.5%).