RL–Fusion: The Large Language Model Fusion Method Based on Reinforcement Learning for Task Enhancing

Wang, Zijian; Li, Jiayong; Liu, Yu; Li, Xuhang; Yan, Cairong; Zhang, Yanting

doi:10.3390/app15042186

Open AccessArticle

RL–Fusion: The Large Language Model Fusion Method Based on Reinforcement Learning for Task Enhancing

by

Zijian Wang

^*

,

Jiayong Li

,

Yu Liu

,

Xuhang Li

,

Cairong Yan

and

Yanting Zhang

School of Computer Science and Technology, Donghua University, Shanghai 201620, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(4), 2186; https://doi.org/10.3390/app15042186

Submission received: 13 January 2025 / Revised: 14 February 2025 / Accepted: 17 February 2025 / Published: 18 February 2025

Download

Browse Figures

Versions Notes

Abstract

Model fusion is a technique of growing interest in the field of machine learning, which constructs a generalized model by merging the parameters of multiple independent models with different capabilities without the need to access the original training data or perform costly computations. However, during model fusion, when the number of parameters in a large language model is high, the dimension of the parameter space increases, which makes it more challenging to find the optimal combination of weights. Meanwhile, there is considerable potential for further development in sustainable optimization schemes for task-specific performance enhancement through model fusion in this area. In this paper, we propose a large-scale language model fusion approach based on task-enhanced reinforcement learning (RL–Fusion) to efficiently explore and optimize model fusion configurations. The key innovation of RL–Fusion lies in its use of reinforcement learning to guide parameter selection during model fusion, enabling a more intelligent and adaptive exploration of the parameter space. Additionally, RL–Fusion introduces a dynamic evaluation mechanism that adjusts the evaluation dataset in real-time based on feedback from SOTA models, ensuring continuous enhancement of domain-specific capabilities. RL–Fusion outperforms the baseline model by improving 1.75% in the MMLU benchmark test, 1.8% in the C-eval test, and 1.8% in the Chinese Named Entity Recognition (NER) test on the Yayi NER dataset by 16%. The results show that RL–Fusion is an effective and scalable model fusion solution that improves performance without increasing the computational cost of traditional optimization methods and has a wide range of applications in AI research and practice.

Keywords:

large language models (LLM); task enhancing; reinforcement learning (RL); model fusion

1. Introduction

Large Language Models (LLMs) are advanced AI systems, containing billions or even trillions of parameters, that leverage deep learning and Natural Language Processing (NLP) techniques to understand and generate natural language text [1]. Trained on extensive textual datasets, these models capture intricate language structures and semantic relationships, enabling a wide range of tasks such as language comprehension, text generation, knowledge-based question answering, code generation, and conversational systems [2,3]. Recent advancements in models such as GPT [4] and the LLaMA series [5] have resulted in significant performance improvements, while also addressing challenges related to security, efficiency, and interpretability. A key optimization strategy is model fusion, which combines parameters from multiple independent models to form a more versatile, general-purpose model. This approach, akin to ensemble learning, allows for knowledge transfer without requiring access to the original training data or incurring substantial computational costs. Model fusion enhances both accuracy and robustness, thereby expanding the applicability of LLMs to complex tasks.

Model fusion techniques are broadly categorized into pre-fusion and in-fusion methods [6]. Pre-fusion methods, such as Linearization Fine-tuning [7], improve model weight decoupling in the tangent space of pre-trained models. Architecture Transformation [6] standardizes models with varying architectures into a unified structure, enabling parameter-level fusion. Techniques such as GAN Cocktail [8] map models into a shared parameter space, followed by averaging and fine-tuning for domain-specific applications. Nguyen et al. [9] fused heterogeneous neural networks of varying depths using cross-layer alignment, without increasing network size or requiring access to original training data. In contrast, in-fusion methods [6] dynamically merge models during the fusion process, providing greater flexibility and adaptability, particularly for varying tasks or data samples. These methods adjust weight importance during fusion, allowing more precise control over the merging process compared to basic averaging or linear combinations. Akiba et al. [10] introduced an evolutionary algorithm to identify optimal merging combinations for various open-source models, producing new foundational models with specific capabilities. Liu et al. [11] employed Bayesian optimization to determine the optimal merging weights during pretraining, aiming to reduce resource consumption while improving performance. Zhou et al. [12] conceptualized model merging as a multi-task learning problem, deriving optimal merging coefficients through local linearization and task vector orthogonality. Subspace-based Merging Methods [6] mitigate task interference by transforming models into sparse subspaces. Yadav et al. [13] addressed parameter interference by pruning fine-grained parameters, resolving symbol conflicts, and merging consistent parameters, thus enhancing fusion efficiency. Despite their advantages, in-fusion methods continue to face challenges in cross-domain applications and in determining the optimal fusion model weights and parameters, especially for complex or heterogeneous model structures.

An important avenue in model fusion research involves enhancing domain-specific capabilities during the fusion process. This approach emphasizes tailoring models to optimize performance in specialized domains, as different tasks may benefit from distinct parameters. For instance, Aiello et al. [14] introduced the Joint Autoregressive Mixture (JAM) framework, which fuses large autoregressive text-to-image and language models into a unified system capable of generating high-quality multimodal outputs. Chen et al. [15] presented a method for fusing multimodal LLMs spanning image, audio, video, and point cloud modalities, reducing cross-modal interference through parameter decoupling and adjustment of modal fusion coefficients. Sung et al. [16] proposed guiding principles for model merging, emphasizing the importance of using models with the same pre-training starting point, preferring simpler models with consistent performance, and favoring full-model merging over partial-layer merging. Shukor et al. [17] developed a unified architecture capable of processing four modalities—image, video, audio, and speech—by converting tasks into a “sequence-to-sequence” format, enabling the use of a unified feature extractor and classifier across all modalities. While these approaches show promise, one of the key challenges in model fusion remains to select the appropriate parameters during the fusion process, as different parameter configurations may be advantageous for different tasks. As a result, developing a sustainable framework for advancing model fusion in specific domains is still an ongoing area of research.

This paper explores the challenges of searching a high-dimensional parameter space during model fusion, as well as task capability enhancement issues. A large language model fusion method based on Reinforcement Learning [18] for task enhancing (RL–Fusion) is proposed. The core innovation of RL–Fusion lies in its use of reinforcement learning to dynamically guide the selection of fusion parameters, enabling an intelligent and adaptive exploration of the parameter space to identify optimal combinations. Additionally, RL–Fusion introduces a novel dynamic evaluation mechanism that adjusts the evaluation dataset in real time based on feedback from large language model evaluations, ensuring continuous refinement and enhancement of task-specific performance. This method identifies the optimal combination of fusion parameters to maximize model performance, particularly in enhancing performance for specific tasks, while preserving the model’s original capabilities. Performance evaluation metrics on datasets such as MMLU [19], C-eval [20], and Yayi [21] demonstrate superior results compared to the base models.

The main contributions of this work are outlined as follows:

Reinforcement learning-based model fusion is employed to optimize parameter selection during fusion by rewarding the results, which allows for effective exploration of optimal parameter combinations and enhances model performance.
The evaluation feedback module of the LLM provides real-time dynamic feedback, adjusting the evaluation dataset according to the reinforcement learning results, thereby effectively enhancing the fusion model’s performance.
Effectiveness testing on multiple datasets demonstrates that the method provides modeling capabilities for task enhancement.

The remainder of this paper is organized as follows: Section 2 reviews related work. Section 3 introduces the main steps of RL–Fusion. Section 4 presents the evaluation results of the fused model on multiple datasets. Section 5 concludes the paper and outlines directions for future research.

2. Related Works

The large language model fusion method based on Reinforcement Learning [18] for task enhancing (RL–Fusion) involves weight alignment in the model fusion method and task enhancing for Model fusion.

2.1. Weight Alignment

With the rise of large language models (LLMs), model fusion techniques have garnered significant attention as a means of enhancing model performance. Early works by Goddard et al. [22] and Tang et al. [23] aimed to simplify the fusion process and establish experimental benchmarks; however, their research did not encompass the full spectrum of fusion techniques. As the field advanced, various methods were proposed to mitigate task conflicts and interference when merging models.

Singh et al. [24] performed soft alignment of neurons using optimal transport, while Tatro et al. [25] introduced a heuristic algorithm for approximating neuron alignment. Xu et al. [26] proposed a method for model alignment that operates concurrently in both the weight space and the activation space, reducing interference and conflicts during fusion. Jordan et al. [27] addressed variance collapse after interpolation in neural networks by proposing REPAIR, which readjusts pre-activation values in the interpolated network, significantly improving performance. Crisostomi et al. [28] globally optimized neuron arrangement across all layers, ensuring cyclic consistency during merging, though this method may require significant computational resources. Basic merging methods, such as parameter-weighted averages or task arithmetic operations, perform model merging using simple approaches; however, they may fail to fully capture the complex relationships between different models, resulting in poor performance, particularly when significant differences between tasks exist [6]. Weighted merging methods combine models based on their importance weights. Akiba et al. [10] employed evolutionary algorithms to efficiently search for merging coefficients, identifying optimal weights for complex fusion scenarios. Liu et al. [11] used Bayesian optimization to determine merging coefficients that minimize model entropy on unlabeled test data, suitable for large-scale datasets but computationally intensive. Zhou et al. [12] demonstrated that linear connections (LMC) exist between parameter spaces and feature maps, though finding optimal merging coefficients can be costly with many models.

Despite their advantages, these fusion methods continue to face challenges in cross-domain applications and in selecting optimal fusion model weights and parameters, particularly when dealing with complex or heterogeneous model structures. To address this issue, this paper proposes using reinforcement learning, in conjunction with large language model evaluation, to guide parameter selection during model fusion.

2.2. Task Enhancing for Model Fusion

A recent focus of model fusion research is hybrid domain fusion, which seeks to enhance domain-specific capabilities, such as generating harmless responses or facilitating engaging interactions [29]. This is particularly relevant in applications that require fine-grained behavior. Aiello et al. [14] introduced a joint autoregressive hybrid framework that fuses text and image generation models, enhancing multimodal capabilities through data-efficient instruction-tuning strategies. Chen et al. [15] optimized model performance through parameter decoupling and tuning, achieving multimodal scaling without requiring additional training. Sung et al. [16] explored merging different modal models using linear interpolation and task arithmetic to improve multitasking performance; however, they did not fully address parameter interference between modes. Shukor et al. [17] proposed a unified model capable of handling image, video, audio, and language tasks, demonstrating robust performance for multimodal fusion with small-scale parameters.

These approaches have enhanced model capabilities in specific domains through various fusion methods; however, a sustainable framework has yet to be proposed to address the impact of varying parameters on performance across different tasks during the fusion process. In this paper, we dynamically adjust the evaluation dataset based on the results of a large language model evaluation to improve the model’s performance on the named entity recognition task.

3. Method

The model fusion module employs TIES [13] and Slerp [30] to efficiently combine multiple base language models, dynamically adjusting the fusion parameters through reinforcement learning based on the characteristics of the base models. The SOTA model evaluation module assesses the performance of the fusion model on a test dataset using a Large Language Model (LLM) and provides accuracy scores as key performance indicators. Reinforcement learning optimizes the fusion parameters by balancing exploration with exploitation. Based on the current model configuration and the performance feedback obtained from the SOTA model evaluation, reinforcement learning tunes the parameters to gradually approach the optimal fusion configuration, thereby enhancing performance. The new fusion parameters resulting from the reinforcement learning adjustments are iteratively applied to the model fusion process to obtain the optimal fusion configuration. The overall architecture of the framework is presented in Figure 1.

3.1. Model Fusion

In this paper, RL–Fusion employs two model fusion methods (TIES [13] and Slerp [30]) as the basic fusion methods improved by reinforcement learning to optimize the fusion parameters for model fusion. Unlike traditional SLERP, which is limited to interpolating between two models and requires predefined interpolation factors, RL–Fusion dynamically adjusts these factors using reinforcement learning to achieve optimal fusion configurations. Similarly, while TIES focuses on resolving parameter redundancy and sign conflicts through fixed density and weight parameters, RL–Fusion enhances TIES by leveraging real-time feedback to refine density and weight settings, ensuring better alignment with task-specific objectives. By integrating reinforcement learning, RL–Fusion not only overcomes the limitations of these methods but also introduces a dynamic evaluation mechanism that continuously adapts the fusion process based on performance feedback. This approach ensures a more intelligent, adaptive, and efficient model fusion process compared to static methods like SLERP and TIES.

3.1.1. Slerp

Spherical Linear Interpolation (Slerp) [30] is a method for smoothly interpolating between two vectors, maintaining a constant rate of change and preserving the geometric properties of spherical space. First, the input vectors are normalized to unit length, ensuring they represent direction rather than magnitude. Next, the angle is computed using the dot product. When the vectors are nearly parallel, linear interpolation is applied for efficiency; when the angle is large, SLERP is used, with scale factors computed based on the interpolation factor t and the angle. These factors weight the original vectors and are then summed to obtain the interpolated vector. Slerp can be described as follows:

S l e r p (p, q, t) = \frac{\sin [(1 - t) θ] \cdot p + \sin (T θ) \cdot q}{\sin θ},

(1)

p, q are the two vectors to be interpolated; T is the interpolation factor, a value between 0 and 1 that specifies the degree of interpolation. Reinforcement learning dynamically adjusts T based on the results of the model evaluation;

θ

is the angle between the two vectors. SLERP is applied to each layer of the model to achieve overall model fusion. Reinforcement learning is used to obtain a new fusion model by adjusting the T.

3.1.2. TIES

TIES [13] aims to efficiently merge multiple task-specific models into a single multi-task model and optimizes model fusion in three steps. First, the Trimming (Trim) phase retains the top k% of parameters with large variations in each task’s parameters, while the remaining parameters are either reset to 0 or restored to the pre-trained model values to reduce interference from redundant parameters on model performance. Next, in the Elect phase, an election symbol vector is created for parameters with opposite signs across different models, and the final sign of each parameter is determined by summing the parameters across models. Finally, the Disjoint Merge phase is performed to compute the average of model parameters with the same final sign, thereby avoiding interference caused by sign inconsistency.

3.2. SOTA LLM Evaluation

In this section, we detail the LLM-based model evaluation method (the model used in this experiment is GPT-4o), which is specifically designed for evaluating model performance in the Named Entity Recognition (NER) task. The evaluation process follows a comprehensive framework that considers several key factors. The overall architecture of the SOTA LLM evaluation is shown in Figure 2.

Generate Evaluation Prompt

The LLM is employed to generate the prompts required for the assessment. The prompts will direct the model to combine the prediction results from the model

M_{o}

and the dataset

D_{o}

to calculate the contextual relevance (CR) by evaluating whether the NER results of model and dataset align and whether model and dataset are linguistically congruent. The corresponding weight parameters,

α

and

β

, are also set for both aspects in the prompt.

2.: Contextual relevance

Calculate the contextual relevance (CR) between the model output and the ground truth output. This step involves evaluating two dimensions: entity matching and linguistic consistency.

M a t c h (M_{o}, D_{o}) = \frac{|\{o_{i} \in M_{o} | \exists g_{i} \in D_{o}, o_{i} = g_{i}\}|}{|M_{o}|} .

(2)

The Match function measures how accurately the model identifies entities compared to the ground truth, based on exact type and content matching.

M_{o}

is the model prediction result,

D_{o}

is the true value of the dataset, and

o_{i}

is the type and value corresponding to each entity in the set of entities identified by the model.

g_{i}

is the type and value corresponding to each entity in the set of entities in the true value of the dataset.

L a n g u a g e (M_{o}, D_{o}) = \frac{V_{M} \cdot V_{D}}{‖V_{M}‖ ‖V_{D}‖} .

(3)

The Language function measures the linguistic consistency between the model’s output and the dataset, using a vector-based similarity measure (cosine similarity).

V_{M}

is the linguistic feature vector of the model output

M_{o}

, and

V_{D}

is the linguistic feature vector of the dataset result

D_{o}

;

‖V_{M}‖ ‖V_{D}‖

is the modulus of the vector, respectively.

C R (M_{o}, D_{o}) = α \cdot M a t c h (M_{o}, D_{o}) + β \cdot L a n g u a g e (M_{o}, D_{o}) .

(4)

α

,

β

are weight coefficients that balance the importance of entity matching and language consistency in the final assessment. These coefficients can be adjusted based on specific task requirements to reflect the relative influence of each factor.

3.: Performance Evaluation

n denotes the number of evaluation datasets. If we have n samples with scores

C R

, Then the average score

E (P)

can be expressed as:

E (P) = \frac{1}{n} \sum C R (M_{o}, D_{o}),

(5)

E (P)

is the overall performance evaluation score of the model.

3.3. Reinforcement Learning to Optimize Fusion Parameters

3.3.1. State Representation

Reinforcement learning-based strategies can effectively optimize parameter configurations during model fusion. This feedback mechanism allows RL–Fusion to adjust the strategy in real time based on environmental feedback, thus continuously improving the performance and efficiency of the fused model in a given task. The state

s_{t}

consists of model parameters

P_{t}

, performance metrics

M_{t}

, historical feedback

H_{t}

, and fusion history

F_{t}

.

P_{t}

denotes the model parameters at time t, including the weights of the base model and the fusion method, it can be described as follows:

P_{t} = [w_{t}, f_{t}] .

(6)

w_{t}

is the weight value used for model fusion at the current moment.

f_{t}

indicates the fusion method at the current moment.

M_{t}

denotes the performance metric at time t, generated by the SOTA LLM evaluation module (calculated from Equation (5)) to provide real-time feedback;

H_{t}

records the parameters and their corresponding rewards for each past model configuration. It can be described as follows:

H_{t} = \{P_{t}, r_{t}\} .

(7)

P_{t}

denotes the model parameters at time t (calculated from Equation (6)), and

r_{t}

denotes the reward value obtained after the fusion of the current parameter (calculated from Equation (7)).

F_{t}

records the evolution of the model’s parameter configurations, helping the intelligentsia to decide whether to continue to utilize previously successful configurations or to explore new ones. It can be described as follows:

F_{t} = \{P_{1}, P_{2}, \dots P_{t}\} .

(8)

The state

s_{t}

can be expressed by the following Equation:

s_{t} = [P_{t}, M_{t}, H_{t}, F_{t}] .

(9)

3.3.2. Action Definition

The action space comprises operations that directly impact the model fusion process. These actions encompass adjusting the weights of the base models to control their contribution to the final output, selecting fusion methods (TIES [13] or Slerp [30]) for model combination, and optimizing hyperparameters, such as learning rates, to refine the fusion process. Additionally, the agent balances exploration (searching for new configurations) with exploitation (optimizing existing configurations) to enhance the fusion results.

An action represents the various model fusion parameter tuning strategies. The following are the available actions and their corresponding instructions:

3.3.3. Q-Learning in RL–Fusion

Q-Learning computes the reward value by comparing the new result

E (P)

, from the SOTA LLM evaluation (calculated using Equation (5)), with the previous result stored in the Q-table (calculated using Equation (10)). Based on this reward, the algorithm proceeds to action selection and subsequently updates the Q-value accordingly. The main steps of Q-learning in RL–Fusion are shown in Figure 3.

Reward Feedback

The reward function offers feedback to the agent, directing its decision-making process. A positive reward is assigned when a new model configuration improves performance (accuracy increase), with the reward magnitude proportional to the improvement (+10 for a 10% increase in accuracy). Conversely, a negative reward is applied when performance declines (accuracy decrease), penalizing the agent for suboptimal choices (−10 for a 10% reduction in accuracy). The reward function can be formally described as follows:

r_{t} = θ \cdot ∆ P \cdot 1_{∆ P},

(10)

where

r_{t}

is the reward received after the t-th decision;

θ

is a scaling factor for the reward, which can be set to 10 to reflect a 10% accuracy gain;

∆ P

is the performance change

E (P)

new and

E (P)

old (calculated using Equation (5));

1_{∆ P}

is the indicator function when performance improves which is 1 if

∆ P

> 0 and −1 otherwise.

The reward function directs the agent’s model configuration decisions by assigning positive or negative rewards based on performance improvement or degradation. The reward magnitude correlates with the change in performance, and a model comparison-based mechanism ensures that the agent is optimizing in the correct direction.

2.: active selection

By selecting different actions (Table 1), the agent progressively approaches the optimal fusion configuration by choosing under-explored actions in conjunction with the exploration factor, based on the updated Q-value optimization parameters. The specific action selection process is described in Equation (11):

a c t i o n = \{\begin{matrix} random choice with probability ε \\ {a r g m a x}_{a} Q (state, a) with probability 1 - ε \end{matrix} .

(11)

Through

ε - g r e e d y

policy, the agent always selects the action with the highest current Q value, aiming to maximize the expected reward. With probability

ε

, a random action is selected, and with probability 1 −

ε

(Determination based on

r_{t}

Equation (10)), the action with the highest Q value is chosen. This method balances exploration and exploitation, allowing the agent to try different actions during the learning process and avoid getting stuck in local optima. The algorithm will, with a certain probability, select parameter configurations that have performed well (greedy selection), and with a certain probability, randomly explore new parameter configurations to prevent getting stuck in local optima.

3.: Q-value Update

At each time step, the agent selects an action based on the current state of the model. The Q-function is subsequently updated using the current Q-value and the observed reward, refining the estimates of different parameter configurations. Through repeated iterations, Q-learning progressively optimizes the base model parameters to maximize long-term rewards, thereby improving the performance of the final fused model. This update is governed by the following Equation (12):

\begin{matrix} Q (s_{t}, a_{t}) \leftarrow Q (s_{t}, a_{t}) + α [r_{t + 1} + γ m a x Q (s_{t + 1}, a) - Q (s_{t}, a_{t})], \end{matrix}

(12)

where

Q (s_{t}, a_{t})

represents the Q value of the at taking the action in state st; α is the learning rate and controls the magnitude of the Q value update;

r_{t + 1}

is the immediate reward after executing the action in the current state;

γ

is the discount factor indicating the effect of the future reward on the current decision;

s_{t + 1}

is the new state after the action is executed; max

Q (s_{t + 1}, a)

represents the maximum Q value of all possible actions in the new state.

As the Q-values are updated, the agent can progressively optimize the model fusion performance by continuously adjusting the model parameters (model weights, fusion methods, etc.), ultimately converging to the optimal configuration.

3.4. Algorithmic Process

The algorithm optimizes the model fusion process using Q-learning. The base models, parameters, and Q-table are initialized. In each iteration, the models are fused using TIES/Slerp, evaluated using a SOTA LLM module, and rewards are calculated. Actions are selected, and Q-values are updated to refine the parameters. The best-performing model is tracked and returned after all iterations are completed. The flow of the algorithm is illustrated in Algorithm 1.

Algorithm 1: Model Fusion and Optimization with Q-Learning

Input:
Base models BaseModels = {

M_{1}, M_{2}, \dots, M_{n}

}
Number of epochs for training: epochs
Output: Best_Model and Best_Performance
1: Initialize model parameters and metrics

E (P)

(The base model before fusion was obtained using SOTA LLM evaluation).
2: Initialize Q-table Q for reinforcement learning.
3: Set Best_Model =

E (P)

, Best_Performance = −∞

Training and Optimization:
4: for episode = 1 to epochs:
5: Fuse BaseModels using TIES/Slerp method, adjust model weights.
6: Evaluate the fusion model using SOTA LLM evaluation module for current parameter configuration

E (P)

Calculated from Equation (5).
7:        Calculate r using Equation (10)
8:        Select action a using Equation (11)
9:        Update Q-value using Equation (12)
10:       Update Q-value for selected action a in state s.
11:        if

E (P)

> BestPerformance:
12: BestModel = current model configuration
13: BestPerformance =

E (P)

4. Experiments and Results

In order to validate the effectiveness of RL–Fusion and its performance improvement on specific tasks, this study evaluated model fusion through ablation experiments (using reinforcement learning alone, RL–Fusion, and RL–Fusion+Finetune by fine-tuning the pre-fusion pedestal model on named entity recognition data, respectively). Experiments were conducted using the MMLU [19] and C-Eval [20] datasets to validate RL–Fusion’s improvement in cross-linguistic capabilities, and the Yayi [21] dataset was used to test its performance improvement in the named entity recognition task. In addition, to compare the effectiveness of RL–Fusion with the benchmark model, we conducted comparative experiments on the above datasets. To demonstrate the advantages of RL–Fusion in enhancing the capabilities of specific tasks and its efficiency in resource utilization, we also conducted a resource consumption comparison with the traditional fine-tuning approach and a side-by-side comparative analysis with other models with the same amount of parameters on the MMLU and C-Eval datasets.

4.1. Datasets

To validate the effectiveness of the proposed RL–Fusion model fusion method, we conducted performance evaluations on the Yayi [21], MMLU [19], and C-Eval [20] datasets. Importantly, the training set used for model fusion does not overlap with the test sets from these datasets, ensuring an unbiased evaluation. The Yayi dataset is a large-scale corpus containing millions of samples, with 54% in Chinese and 46% in English. It covers data from 12 domains, including finance, society, biology, business, industrial manufacturing, chemistry, vehicles, science, medical diseases, personal life, safety, and general knowledge, spanning a wide range of scenarios. In our experiments, named entity recognition (NER) was specifically chosen as the task to evaluate the model’s performance before and after fusion, as it is a critical area where we aim to achieve performance enhancement through the RL–Fusion approach. To this end, we utilized the NER data from Yayi, which includes 28 Chinese entity types (e.g., person, geopolitics, organization, body parts, drugs) and 130 English entity types (e.g., animal, weapon, conference, book). NER was selected as the key task for performance evaluation in our experiments, and model performance was evaluated using the SOTA LLM evaluation module.

MMLU (Massive Multitask Language Understanding) is a multi-task, multi-language evaluation dataset proposed by OpenAI, designed to assess the performance of large language models across a variety of fields and tasks. This dataset aims to assess the model’s general knowledge and reasoning abilities, encompassing 57 tasks across various disciplines, including history, mathematics, computer science, economics, medicine, and law. The difficulty of the tasks range from high school to undergraduate and graduate levels, with all questions presented in a multiple-choice format with four options. Additionally, it supports multiple languages, enabling the evaluation of the model’s cross-lingual capabilities.

C-Eval is a comprehensive Chinese evaluation suite designed to assess the advanced knowledge and reasoning abilities of base models within the Chinese context. It is the first evaluation suite to comprehensively cover multiple difficulty levels (middle school, high school, university, and professional levels) in Chinese, with multiple-choice questions across 52 distinct subjects, spanning a broad spectrum of fields from the humanities to science and engineering.

4.2. Model Evaluation Metrics

The evaluation metrics for the model primarily focus on classification accuracy across various tasks. The model’s performance on these tasks is evaluated by calculating the average classification accuracy across all examples and tasks. The corresponding performance evaluation metrics include classification accuracy (which quantifies the proportion of correct predictions made by the model across all tasks) and calibration error (which measures the discrepancy between the model’s predicted confidence and the actual accuracy).

The evaluation metrics for the MMLU and C-eval dataset can be calculated using the following steps:

D

is the number of all disciplines; for each discipline

d \in D

,

Q_{d}

is the set of questions for that discipline and

|Q_{d}|

is the number of questions; for each question q

\in Q_{d}

,

A_{q}

is the correct answer and

\bar{A_{q}}

is the result predicted by the model.

For each discipline $d$ , calculate the accuracy for that discipline $A_{d}$

$A_{d} = \frac{1}{|Q_{d}|} \sum_{q \in Q_{d}} δ (\bar{A_{q}} = A_{q}),$

(13)

where $δ (\bar{A_{q}} = A_{q})$ is 1 when $\bar{A_{q}} = A_{q}$ otherwise 0.
Calculate the average accuracy across all disciplines $A$

$A = \frac{1}{|D|} \sum_{d \in D} A_{d} .$

(14)
Calculate weighted average accuracy $A_{w}$ across all disciplines:

$A_{w} = \frac{\sum_{d \in D} |Q_{d}| \cdot A_{d}}{\sum_{d \in D} |Q_{d}|}$

(15)

Evaluation metrics on NER tasks are calculated from the description in Section 3.2.

4.3. Model Parameter Setups

MergeKit [22] is an open-source model fusion toolkit that offers a comprehensive set of APIs and strategies for integrating pretrained language models. The toolkit supports a variety of fusion techniques, including, but not limited to, weighted averaging, voting mechanisms, and advanced parameter optimization strategies. With MergeKit, model fusion can be efficiently performed, and parameters can be automatically adjusted during the fusion process. This paper employs methods integrated into MergeKit, such as Slerp [30] and TIES [13], for model fusion. The GPUs employed in the experiment are two NVIDIA GeForce RTX 4090 units, and the deep learning framework used is PyTorch v2.3.2 [31].

The base model consists of llama-3-chinese-8b-instruct-v3 and llama-3-8b-Instruct with the total number of training rounds set to 50, the initial learning rate of Q-learning is set to 0.01, the discount factor is set to 0.9, the exploration rate is set to 0.8, and the rate of exploration rate is set to decay by 5% every 5 epochs. For the initial fusion method, Slerp [30] was chosen, and the initial interpolation factors were set to [[0, 0.5, 0.3, 0.7, 1], [1, 0.5, 0.7, 0.3, 0]] These factors define the interpolation ratios between the two models at different layers, controlling how much each model contributes to the final fused model at various points in the interpolation process. For example, t = 0.5 means an equal contribution from both models, while t = 0.3 and t = 0.7 represent skewed contributions. Additionally, the initial model weights in the TIES [13] method were set to [0.5, 0.5], which define the contribution of each model to the final merged model. These weights ensure a balanced integration of the models during the fusion process.

4.4. Experiments

In our experiments, we compare the new model obtained from RL–Fusion with the two base models, llama-3-chinese-8b-instruct-v3 and llama-3-8b-Instruct, on the aforementioned datasets.

4.4.1. Ablation Experiment

To validate the effectiveness of the proposed RL–Fusion approach, we conducted a series of ablation experiments to evaluate the contribution of each component to the overall performance by progressively disassembling each part of RL–Fusion. The ablation experiments used a reinforcement learning (RL)-based selection strategy. In addition, two model fusion strategies were implemented: one using only RL for parameter selection (at which point the evaluation dataset was in the top 5% of the Yayi dataset), and the other combining reinforcement learning with dynamic model evaluation feedback (dynamically adjusting the number of evaluation datasets based on NER evaluation results). To further evaluate the effectiveness of these strategies, we fine-tuned the base model on the Chinese Named Entity Recognition (NER) dataset and evaluated its performance on the MMLU and C-Eval datasets (all sub-tasks; results are shown in Figure 4 and Figure 5), as well as on the Yayi NER dataset (results are shown in Figure 6).

In order to validate the effectiveness of the proposed RL–Fusion approach, we conducted a series of ablation experiments to evaluate the contribution of each component to the overall performance by progressively decomposing the parts of RL–Fusion. The ablation experiments used a reinforcement learning (RL)-based selection strategy. In addition, two model fusion strategies were implemented: one using only RL for parameter selection with a fixed evaluation dataset, and the other combining reinforcement learning with dynamic model evaluation feedback to dynamically adjust the evaluation dataset through evaluation feedback. To further evaluate the effect of RL–Fusion on fine-tuning the model, we fine-tuned the pre-fusion base model, llama-3-chinese-8b-instruct-v3, on the Chinese Named Entity Recognition (NER) dataset and performed model fusion with RL–Fusion on both the MMLU and C-Eval datasets (across all subtasks). As an additional baseline, we also compared the performance of RL–Fusion against a randomly parameterized fusion model, where the model parameters were fused without any optimization strategy, serving as a comparison to assess the impact of RL-based adjustments. The results are shown in Figure 4 and Figure 5, and the Yayi NER dataset (results are shown in Figure 6) to evaluate its performance.

The experimental results, presented in Table 2, clearly demonstrate that the RL–Fusion method outperforms other single-parameter selection strategies across all evaluation metrics. Furthermore, after fine-tuning the base model on the Chinese NER data, applying the RL–Fusion fusion strategy led to significant performance improvements, particularly on the Yayi (calculated by Equation (5)) and C-Eval (calculated by Equation (15)) datasets. The relatively lower performance on the MMLU (calculated by Equation (15)) dataset can be attributed to the fine-tuning enhancing the model’s knowledge of NER but also causing the fine-tuned model to underperform on more generalized data. These results underscore the superiority of the RL–Fusion method in optimizing model performance for Named Entity Recognition (NER) tasks.

Based on the experimental results, RL–Fusion demonstrates notable performance improvements across the MMLU, C-Eval, and Yayi (NER) datasets. On the MMLU dataset, RL–Fusion outperforms the baseline by 2.6%, increasing from 68.2% to 70.8%, and also surpasses RL by 0.7%, rising from 70.1% to 70.8%. Compared to RL–Fusion+FineTune, RL–Fusion’s performance is slightly lower by 0.3%. On the C-Eval dataset, RL–Fusion shows a significant improvement, outperforming the baseline by 1.9%, rising from 76.7% to 78.6%, and surpassing RL by 0.7%, from 77.9% to 78.6%. However, RL–Fusion performs 1.6% worse than RL–Fusion+FineTune. On the Yayi (NER) dataset, RL–Fusion shows a marginal improvement of 0.1 over RL, increasing from 2.4 to 2.5, and outperforms the baseline by 0.4, rising from 2.1 to 2.5. However, compared to RL–Fusion+FineTune, RL–Fusion exhibits a 1.3-point decrease, with RL–Fusion+FineTune achieving 3.8.

Through this series of ablation experiments, we not only validate the effectiveness of the RL–Fusion approach but also demonstrate that the combination of RL and SOTA LLM real-time evaluation feedback for parameter selection significantly enhances model performance on Chinese NER tasks. Specifically, on the MMLU dataset, RL–Fusion shows a 0.7% improvement over RL, and on the C-Eval dataset, it improves by 0.7%; although the improvement on the Yayi (NER) dataset is smaller (only 0.1), RL–Fusion still outperforms RL–Fusion+FineTune in comparison. These findings provide compelling evidence for the future application of model fusion strategies in complex NLP tasks.

4.4.2. Comparison with the Original Model

Table 3 presents a comparison of the overall results between the RL–Fusion LLM and baseline models on the MMLU benchmark. To evaluate the model’s language processing capabilities, we selected seven sub-tasks from the MMLU benchmark: Formal Logic, Jurisprudence, International Law, Logical Fallacies, High School Government and Politics, Marketing, and Management. These sub-tasks require a significant level of information extraction and processing ability. The model must extract key information from the given text, understand the context, and conduct logical reasoning and analysis. We observe that the three LLMs exhibit varying performance across the seven MMLU tasks. On average, the RL–Fusion LLM demonstrates a 1.75% relative performance gain over the original best-performing llama-3-8b-Instruct across all seven tasks. In specific tasks, the enhancement achieved by RL–Fusion LLM is substantial, such as an increase from 81.0% to 86.8% on the International Law task. There are two possible reasons for the degradation of the performance of RL–Fusion LLM in tasks such as formal logic and management. First, the poor performance of the base model in these tasks affects the fusion results. Second, continuous pre-training of Chinese instructions and their relevance to downstream tasks on the llama-3-chinese-8b-instruct-v3 model also leads to performance degradation.

Table 4 presents a comparison of the overall results between the RL–Fusion LLM and baseline models on the C-Eval benchmark. We observe that the three LLMs exhibit varying performance across the C-Eval tasks. On average, the RL–Fusion LLM demonstrates a 1.8% relative performance gain over the original best-performing llama-3-chinese-8b-instruct-v3 across all tasks. In specific tasks, the improvement achieved by RL–Fusion LLM is substantial, with an increase from 57.4% to 60.4% on the Average (Hard) task.

Table 5 presents a comparison of the overall results between the RL–Fusion LLM and baseline models on the Named Entity Recognition (NER) data from Yayi. The evaluation results were computed using SOTA LLM evaluation. We observe that the three LLMs exhibit varying performance on the NER task. The RL–Fusion LLM demonstrates a 16% relative performance gain in Chinese NER compared to the original best-performing llama-3-chinese-8b-instruct-v3.

In summary, the results of this study unequivocally validate the effectiveness of the RL–Fusion fusion method. By integrating reinforcement learning with real-time model evaluation feedback to optimize the model fusion process, RL–Fusion not only improves model performance across diverse tasks but also provides a robust framework for enhancing the performance of large language models, particularly for tasks requiring complex reasoning and entity recognition. These findings confirm that RL–Fusion is an effective method for enhancing the capabilities of LLMs, positioning it as a valuable tool for a wide range of natural language processing applications.

4.4.3. Comparison with Fine-Tuned Models on Computational Resources

In order to assess the advantages of the method proposed in this study in terms of resource utilization, an experiment was conducted to compare it with the model fine-tuning method in terms of time, hardware resources, and model generalization capabilities. The results of the experiment are presented in Table 6.

The fine-tuning method uses Lora [32] and the amount of data involved in the experiment is 1000.

The experiment comparing RL–Fusion and fine-tuning highlights RL–Fusion’s advantages in resource efficiency and performance. RL–Fusion completes tasks in just 15 min using only CPU resources, while fine-tuning takes 60 min and requires an RTX4090 GPU. Despite the fine-tuning method being trained on a Chinese dataset, which improves its C-Eval performance (79.2%), RL–Fusion outperforms it on MMLU (70.8% vs. 68.5%) and remains competitive on C-Eval (78.6%). These results demonstrate that RL–Fusion offers a more time- and resource-efficient approach, achieving comparable or even slightly better performance in key benchmarks.

4.4.4. Comparison with the Other Model

We compare several other large models using the same parameters and dataset. To validate the effectiveness of the proposed method, we use the model combining Slerp and TIES as the baseline for comparison. Table 7 presents a comparison of RL–Fusion with Other large language models with a similar number of participants using the MMLU and C-eval datasets.

Table 7 compares RL–Fusion with several large models using the MMLU and C-Eval datasets. RL–Fusion achieves 70.8% on MMLU and 78.6% on C-Eval, outperforming models like Mixtral 8x7B (70.6%, 74.7%) and Qwen 7B (56.7%, 59.6%). Notably, RL–Fusion also surpasses the performance of Slerp (68.7%, 74.1%) and TIES (66.9%, 76.2%), demonstrating the effectiveness of the proposed fusion approach. While it lags behind Llama 3.1-8B (73.0%, 81.2%), it shows significantly better performance than smaller models such as LLaMA 2-7B (45.3%, 75.2%) and GPT-3 6.7B (43.2%, 54.4%). Overall, RL–Fusion demonstrates competitive performance with larger models while being more resource-efficient.

5. Conclusions

This study extensively explores the significance of parameter selection in the model fusion process, with a focus on developing high-performance fusion models. We propose an innovative approach, RL–Fusion, which significantly improves model performance by leveraging reinforcement learning (RL) algorithms in the fusion process, combined with real-time model evaluation feedback. By dynamically adjusting the fusion parameters and evaluating the dataset, RL–Fusion effectively navigates the parameter space to discover optimal combinations. Notably, RL–Fusion enhances model performance in tasks such as Named Entity Recognition (NER), significantly improving accuracy while preserving the base model’s original functionality.

Through a series of well-designed experiments, we validate the significant advantages of the RL–Fusion approach over a single large language model (LLM) and established benchmarks. The experimental results demonstrate that RL–Fusion outperforms traditional methods across several key metrics, owing to its intelligent optimization of parameter selection. Specifically, we employed multiple LLMs with the same structure as the base model in our experiments to evaluate RL–Fusion’s performance under different model configurations. This setup enables a comprehensive evaluation of RL–Fusion’s applicability and effectiveness across various model fusion scenarios.

However, deploying RL–Fusion in real-world applications presents several challenges. First, the computational overhead of reinforcement learning during the fusion process, particularly for high-dimensional parameter spaces, may limit its scalability for extremely large models. Second, the dynamic evaluation mechanism, while effective, requires access to high-quality, domain-specific evaluation datasets, which may not always be readily available. Third, the real-time feedback loop, though innovative, introduces additional latency, which could be a bottleneck in time-sensitive applications. Finally, the generalizability of RL–Fusion across diverse tasks and domains needs further investigation, as its performance may vary depending on the complexity and specificity of the target task.

In summary, RL–Fusion leverages the adaptive learning capabilities of RL to offer a novel and powerful solution in model fusion. Our research not only advances model fusion techniques but also offers new insights into more efficient parameter optimization and improved model performance across a range of AI applications. We anticipate that RL–Fusion will be widely adopted in future research and practice to address more complex model fusion challenges and contribute to the advancement of AI technology. Future work will focus on addressing these deployment challenges to make RL–Fusion more accessible and scalable for real-world use cases.

Author Contributions

Conceptualization, J.L.; methodology, Z.W., J.L. and C.Y.; software, Z.W. and X.L.; validation, Z.W.; formal analysis, Y.L.; investigation, Z.W. and Y.Z.; resources, J.L.; data curation, Z.W., Y.L. and X.L.; writing—original draft preparation, Z.W.; writing—review and editing, J.L., Y.L., C.Y. and Y.Z.; supervision, J.L.; project administration, J.L.; funding acquisition, J.L., C.Y. and Y.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research is supported by the National Natural Science Foundation of China (No. 62302090, 62477006, 62272097), Shanghai Sailing Program (No. 23YF1401100).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available in [MMLU, C-Eval, Yayi] at [https://doi.org/10.48550/arXiv.2009.03300, https://doi.org/10.48550/arXiv.2305.08322, https://doi.org/10.48550/arXiv.2312.14862] (accessed on 16 February 2025).

Acknowledgments

We would like to thank all the authors for their valuable contributions and collaborative efforts in making this work possible.

Conflicts of Interest

The authors declare no conflict of interest.

References

Zhao, W.X.; Zhou, K.; Li, J.; Tang, T.; Wang, X.; Hou, Y.; Min, Y.; Zhang, B.; Zhang, J.; Dong, Z.; et al. A Survey of Large Language Models. arXiv 2024, arXiv:2303.18223. [Google Scholar] [CrossRef]
Kaplan, J.; McCandlish, S.; Henighan, T.; Brown, T.B.; Chess, B.; Child, R.; Gray, S.; Radford, A.; Wu, J.; Amodei, D. Scaling Laws for Neural Language Models. arXiv 2020, arXiv:2001.08361. [Google Scholar] [CrossRef]
Hoffmann, J.; Borgeaud, S.; Mensch, A.; Buchatskaya, E.; Cai, T.; Rutherford, E.; Casas, D.D.L.; Hendricks, L.A.; Welbl, J.; Clark, A.; et al. Training Compute-Optimal Large Language Models. arXiv 2022, arXiv:2203.15556. [Google Scholar] [CrossRef]
Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. arXiv 2020, arXiv:2005.14165. [Google Scholar] [CrossRef]
Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv 2023, arXiv:2307.09288. [Google Scholar] [CrossRef]
Yang, E.; Shen, L.; Guo, G.; Wang, X.; Cao, X.; Zhang, J.; Tao, D. Model Merging in LLMs, MLLMs, and Beyond: Methods, Theories, Applications and Opportunities. arXiv 2024, arXiv:2408.07666. [Google Scholar] [CrossRef]
Ortiz-Jimenez, G.; Favero, A.; Frossard, P. Task Arithmetic in the Tangent Space: Improved Editing of Pre-Trained Models. arXiv 2023, arXiv:2305.12827. [Google Scholar] [CrossRef]
Avrahami, O.; Lischinski, D.; Fried, O. Gan Cocktail: Mixing GANs Without Dataset Access; Springer: Berlin/Heidelberg, Germany, 2022; Volume 13683, pp. 205–221. [Google Scholar] [CrossRef]
Nguyen, D.; Nguyen, T.; Nguyen, K.; Phung, D.; Bui, H.; Ho, N. On Cross-Layer Alignment for Model Fusion of Heterogeneous Neural Networks. arXiv 2023, arXiv:2110.15538. [Google Scholar] [CrossRef]
Akiba, T.; Shing, M.; Tang, Y.; Sun, Q.; Ha, D. Evolutionary Optimization of Model Merging Recipes. arXiv 2024, arXiv:2403.13187. [Google Scholar] [CrossRef]
Liu, D.; Wang, Z.; Wang, B.; Chen, W.; Li, C.; Tu, Z.; Chu, D.; Li, B.; Sui, D. Checkpoint Merging via Bayesian Optimization in LLM Pretraining. arXiv 2024, arXiv:2403.19390. [Google Scholar] [CrossRef]
Zhou, Z.; Yang, Y.; Yang, X.; Yan, J.; Hu, W. Going Beyond Linear Mode Connectivity: The Layerwise Linear Feature Connectivity. arXiv 2023, arXiv:2307.08286. [Google Scholar] [CrossRef]
Yadav, P.; Tam, D.; Choshen, L.; Raffel, C.; Bansal, M. TIES-Merging: Resolving Interference When Merging Models. arXiv 2023, arXiv:2306.01708. [Google Scholar] [CrossRef]
Aiello, E.; Yu, L.; Nie, Y.; Aghajanyan, A.; Oguz, B. Jointly Training Large Autoregressive Multimodal Models. arXiv 2023, arXiv:2309.15564. [Google Scholar] [CrossRef]
Chen, C.; Du, Y.; Fang, Z.; Wang, Z.; Luo, F.; Li, P.; Yan, M.; Zhang, J.; Huang, F.; Sun, M.; et al. Model Composition for Multimodal Large Language Models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Bangkok, Thailand, 11–16 August 2024; Ku, L.-W., Martins, A., Srikumar, V., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 11246–11262. [Google Scholar] [CrossRef]
Sung, Y.-L.; Li, L.; Lin, K.; Gan, Z.; Bansal, M.; Wang, L. An Empirical Study of Multimodal Model Merging. arXiv 2023, arXiv:2304.14933. [Google Scholar] [CrossRef]
Shukor, M.; Dancette, C.; Rame, A.; Cord, M. UnIVAL: Unified Model for Image, Video, Audio and Language Tasks. arXiv 2023, arXiv:2307.16184. [Google Scholar] [CrossRef]
Minsky, M. Steps toward Artificial Intelligence. In Proceedings of the IRE; IEEE: New York, NY, USA, 1961; Volume 49, pp. 8–30. [Google Scholar] [CrossRef]
Hendrycks, D.; Burns, C.; Basart, S.; Zou, A.; Mazeika, M.; Song, D.; Steinhardt, J. Measuring Massive Multitask Language Understanding. arXiv 2021, arXiv:2009.03300. [Google Scholar] [CrossRef]
Huang, Y.; Bai, Y.; Zhu, Z.; Zhang, J.; Zhang, J.; Su, T.; Liu, J.; Lv, C.; Zhang, Y.; Lei, J.; et al. C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models. arXiv 2023, arXiv:2305.08322. [Google Scholar] [CrossRef]
Luo, Y.; Kong, Q.; Xu, N.; Cao, J.; Hao, B.; Qu, B.; Chen, B.; Zhu, C.; Zhao, C.; Zhang, D.; et al. YAYI 2: Multilingual Open-Source Large Language Models. arXiv 2023, arXiv:2312.14862. [Google Scholar] [CrossRef]
Goddard, C.; Siriwardhana, S.; Ehghaghi, M.; Meyers, L.; Karpukhin, V.; Benedict, B.; McQuade, M.; Solawetz, J. Arcee’s MergeKit: A Toolkit for Merging Large Language Models. arXiv 2024, arXiv:2403.13257. [Google Scholar] [CrossRef]
Tang, A.; Shen, L.; Luo, Y.; Hu, H.; Du, B.; Tao, D. FusionBench: A Comprehensive Benchmark of Deep Model Fusion. arXiv 2024, arXiv:2406.03280. [Google Scholar] [CrossRef]
Singh, S.P.; Jaggi, M. Model Fusion via Optimal Transport. arXiv 2023, arXiv:1910.05653. [Google Scholar] [CrossRef]
Tatro, N.J.; Chen, P.-Y.; Das, P.; Melnyk, I.; Sattigeri, P.; Lai, R. Optimizing Mode Connectivity via Neuron Alignment. arXiv 2020, arXiv:2009.02439. [Google Scholar] [CrossRef]
Xu, Z.; Yuan, K.; Wang, H.; Wang, Y.; Song, M.; Song, J. Training-Free Pretrained Model Merging. arXiv 2024, arXiv:2403.01753. [Google Scholar] [CrossRef]
Jordan, K.; Sedghi, H.; Saukh, O.; Entezari, R.; Neyshabur, B. Repair: Renormalizing Permuted Activations for Interpolation Repair. arXiv 2023, arXiv:2211.08403. [Google Scholar] [CrossRef]
Crisostomi, D.; Fumero, M.; Baieri, D.; Bernard, F.; Rodolà, E. C₂M₃: Cycle-Consistent Multi-Model Merging. arXiv 2024, arXiv:2405.17897. [Google Scholar] [CrossRef]
Ramé, A.; Couairon, G.; Shukor, M.; Dancette, C.; Gaya, J.-B.; Soulier, L.; Cord, M. Rewarded soups: Towards Pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards. arXiv 2023, arXiv:2306.04488. [Google Scholar] [CrossRef]
Shoemake, K. Animating rotation with quaternion curves. In SIGGRAPH ’85: Proceedings of the 12th Annual Conference on Computer Graphics and Interactive Techniques; Association for Computing Machinery: New York, NY, USA, 1985; pp. 245–254. [Google Scholar] [CrossRef]
Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Yang, E.; DeVito, Z.; Lin, Z.; Desmaison, A. Automatic differentiation in PyTorch. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. LoRA: Low-Rank Adaptation of Large Language Models. arXiv 2021. [Google Scholar] [CrossRef]
Dubey, A.; Jauhri, A.; Pandey, A.; Kadian, A.; Al-Dahle, A.; Letman, A.; Mathur, A.; Schelten, A.; Yang, A.; Fan, A.; et al. The Llama 3 Herd of Models. arXiv 2024, arXiv:2407.21783. [Google Scholar] [CrossRef]
Jiang, A.Q.; Sablayrolles, A.; Roux, A.; Mensch, A.; Savary, B.; Bamford, C.; Chaplot, D.S.; Casas, D.D.L.; Hanna, E.B.; Bressand, F.; et al. Mixtral of Experts. arXiv 2024, arXiv:2401.04088. [Google Scholar] [CrossRef]
Bai, J.; Bai, S.; Chu, Y.; Cui, Z.; Dang, K.; Deng, X.; Fan, Y.; Ge, W.; Han, Y.; Huang, F.; et al. Qwen Technical Report. arXiv 2023, arXiv:2309.16609. [Google Scholar] [CrossRef]
Peng, B.; Alcaide, E.; Anthony, Q.; Albalak, A.; Arcadinho, S.; Biderman, S.; Cao, H.; Cheng, X.; Chung, M.; Grella, M.; et al. RWKV: Reinventing RNNs for the Transformer Era. arXiv 2023, arXiv:2305.13048. [Google Scholar] [CrossRef]
Almazrouei, E.; Alobeidli, H.; Alshamsi, A.; Cappelli, A.; Cojocaru, R.; Debbah, M.; Goffinet, É.; Hesslow, D.; Launay, J.; Malartic, Q.; et al. The Falcon Series of Open Language Models. arXiv 2023, arXiv:2311.16867. [Google Scholar] [CrossRef]

Figure 1. Overall architecture diagram of the framework. The framework is built around a dynamic closed-loop optimization process, beginning with model fusion, where parameters from the source Large Language Model (LLM) are integrated to create an initial fused model. This fused model is then applied to domain-specific tasks (named entity recognition) and evaluated by a SOTA LLM, which provides performance scores and rankings. The evaluation results are fed into the reinforcement learning optimization module, where fusion parameters are dynamically adjusted. The updated parameters are then fed back into the model fusion phase, creating a continuous iterative loop that progressively enhances model performance. By leveraging dynamic parameter space exploration, real-time feedback-driven optimization, and a scalable closed-loop architecture, the framework significantly improves the model’s adaptability and task performance, while offering key advantages such as high automation, strong domain adaptability, and optimized computational efficiency.

Figure 2. Details of SOTA LLM Evaluation. The framework systematically integrates model output, dataset feedback, and dynamic weighting mechanisms to provide a quantifiable and scientific evaluation path for optimizing the performance of large language models.

Figure 3. Main steps of Q-learning in RL–Fusion.

Figure 4. Results of ablation experiments on MMLU.

Figure 5. Results of ablation experiments on C-eval.

Figure 6. Results of ablation experiments on Yayi.

Table 1. Action description.

Action Category	Action Description	Parameter Range	Specific Adjustment Method
Weight adjustment	Reconfigure the weights of each base model in the fusion model to enhance or attenuate the impact of a particular model.	$w_{i} \in [0, 1], \sum w_{i} = 1$	Increase or reduce the weight of a base model, the step size is $∆ w = 0.05$ , For example, the current weight $w_{1} = 0.5, w_{2} = 0.5$ , the action will adjust $w_{1}$ to 0.55 and $w_{2}$ to 0.45.
Fusion method selection	Different model fusion methods were chosen, Slerp, TIES	{Slerp, TIES}	Choose a new fusion method. For example, switch from the current TIES to Slerp.
Learning rate adjustment	The learning rate during the model fusion is dynamically adjusted to control the pace of parameter updates.	$η \in [10^{- 5}, 10^{- 1}]$	A linear or exponential adjustment based on the current learning rate, for example, from $η$ = 0.01 was increased to $η$ = 0.02 Or reduced to $η$ = 0.005.

Table 2. Results of the ablation experiments with the RL–Fusion method.

Method	MMLU	C-Eval	Yayi(NER)
Baseline	68.2%	76.7%	2.1
RL	70.1%	77.9%	2.4
RL–Fusion	70.8%	78.6%	2.5
RL–Fusion+FineTune	69.5%	80.2%	3.8