1. Introduction
Large language model (LLM) technologies, such as GPT-4 [
1] and DeepSeek-R1 [
2], have achieved rapid development, and they are being applied across multiple real-world domains, such as chatbots [
3,
4], intelligent agents [
5,
6], personal assistants [
7,
8], and knowledge-based question answering systems [
9,
10], etc. Compared to smaller-scale models, large language models demonstrate impressive emergent capabilities, and they exhibit superior contextual understanding, reasoning abilities, and generalization performance. These advantages enable them to show unprecedented potential across numerous application domains. Consequently, large language models have gained widespread popularity and significant attention not only in academia but also in industry, where they are increasingly becoming a pivotal force driving innovation and transformation.
These technologies based on LLMs are reshaping human–computer interaction and driving industrial intelligent transformation. In the healthcare domain, LLMs can assist in diagnosis and drug discovery. In the education domain, LLMs can provide personalized learning solutions. In the finance domain, LLMs can be applied to intelligent investment advice and risk management. With the optimization of algorithms and the enhancement of computing power, the inferential capabilities of LLMs are progressively improving, leading to the ongoing expansion of their application scenarios. However, the performance of LLMs is highly dependent on the quality of large-scale datasets. Currently, high-quality private data, as a valuable resource, plays a significant role in achieving precise understanding and efficient application in specific domains. It has become critical for model optimization and performance enhancement of large language models. In reality, high-quality private data is typically scattered across different enterprises or institutions. However, due to restrictions imposed by relevant laws or concerns about privacy data leakage, high-quality private data cannot be effectively shared, resulting in data silos. These data barriers severely hinder the further development of LLMs.
Although federated learning frameworks have made some progress, the development of federated fine-tuning based on LLMs remains at an immature stage. The current frameworks still face several limitations. Firstly, some current frameworks require manual configuration and code modifications, as well as manual transfer of models and data. Additionally, participants cannot perform fine-tuning of LLMs across multiple machines and GPUs during federated training. Secondly, these frameworks require manual dataset splitting and use the same data size for training in each round. Thirdly, these frameworks require manual code modifications to adjust model fusion strategies and do not support automated configuration through configuration files. In addition, these frameworks require manual code modifications to adjust privacy protection strategies, lacking simple and flexible configuration support.
We propose a novel multi-party collaborative training framework for large language models, named MPCTF, to alleviate the problems mentioned above. MPCTF consists of a one-click launch mechanism with multi-node and multi-GPU training capabilities that significantly simplifies user operations and enhances automation, thereby further optimizing multi-party training processes. Four data partitioning strategies automatically split the client datasets during the training process. Specifically, the data partitioning strategies are composed of the fixed-size strategy, the percentage-based strategy, the maximum data volume strategy, and the total data volume and available GPU memory strategy. These data partitioning strategies can facilitate the adjustment of the amount of training data for each round and change the dataset according to the round. Multiple model aggregation strategies, consisting of FedAvg, FedProx, FedAdam, and FedAdagrad, are integrated into MPCTF. These model aggregation strategies and their relevant parameters can be automatically configured through the configuration file in the proposed MPCTF. Moreover, multiple privacy protection strategies, such as the server-side fixed clipping strategy, the server-side adaptive clipping strategy, the client-side fixed clipping strategy, and the client-side adaptive clipping strategy, are integrated into MPCTF to achieve data protection. The different privacy protection strategies can be simply and flexibly configured within the proposed MPCTF framework. We performed extensive experiments to demonstrate the effectiveness of the proposed MPCTF, and the experimental results validate that the proposed MPCTF achieves superior performance.
The main contributions of this work are the following: (1) We propose the MPCTF framework for multi-party collaborative training of LLMs, which can effectively utilize the local data of each participant for collaborative training to obtain a more generalized global model. The proposed MPCTF acquired an accuracy rate of 65.43 in the experiments. In the proposed MPCTF, the one-click start training mechanism can uniformly start training through a configuration file, and the multi-GPU training mechanism enables internal multi-machine and multi-card training for participants and allows for flexible use of different training resources. The existing federated learning frameworks require relatively complex configuration files to start training, and some frameworks are weak at multi-machine and multi-card training for the participant. (2) We propose four data partitioning strategies to achieve collaborative utilization of data and increase the methods of data usage. The four data partitioning strategies acquired accuracies of 65.43, 65.35, 65.43, and 65.28 in the experiments. The data splitting methods of some existing frameworks are simple and lack multiple splitting methods. (3) We integrate multiple aggregation strategies to achieve model aggregation in the server by various methods. The multiple model aggregation strategies acquired accuracies of 65.43, 65.43, 63.53, and 65.50 in the experiments. Some current frameworks lack support for multiple aggregation methods. (4) We develop multiple privacy protection strategies that can integrate differential privacy to achieve data protection in different ways. The multiple privacy protection strategies acquired accuracies of 57.70, 57.77, 56.56, and 57.39 in the experiments. The privacy protection methods of some current frameworks are relatively limited, and they are weak in supporting multiple data privacy protection methods.
The rest of this paper is structured as follows.
Section 2 reviews the relevant literature on federated learning and privacy protection.
Section 3 details the methodology of the proposed model in terms of the one-click launch mechanism with multi-node and multi-GPU training, data partitioning strategy, model aggregation strategy, and privacy protection.
Section 4 presents the datasets, experimental setting, results, and analysis. Finally,
Section 5 concludes the paper.
3. Methodology
In this section, we first provide an overall introduction to the proposed MPCTF. Then, we introduce in detail the key capabilities of the proposed MPCTF, which are the one-click launch mechanism with multi-node and multi-GPU training, data partitioning strategy, model aggregation strategy, and privacy protection.
3.1. Overview
In this section, we provide an overall introduction to our proposed MPCTF. The overall architecture is illustrated in
Figure 1. The multi-party collaborative training of LLMs focuses on multi-party training and data protection. The multi-party collaborative training of LLMs can not only empower the entire large model development process with high availability but also further improve the value of private data. The proposed MPCTF can enable each participant to train the local model using their local dataset, thereby allowing training to be conducted without the training data leaving their organization. During the training process, the server receives the partial parameters of the model sent by each participant for model aggregation, which can achieve a global model with stronger generalization.
As illustrated on the left side of
Figure 1, the multi-party collaborative training of LLMs involves the following processes. Firstly, when all the participants and the server have joined the training framework and the preparatory work is completed, the server initiates the training, and then the training process begins. Secondly, each participant uses their high-quality private data to train the local large language model through local GPU computing resources. In order to save bandwidth resources and enhance communication efficiency between each participant and the server, the parameter-efficient fine-tuning method LoRA [
20] is adopted for training LLMs in MPCTF. Once the training of the current round is completed, the current participant will send the trained model’s partial parameter weights of the current round to the server. Thirdly, when the server receives the model partial parameter weights from all the participants, the server begins to perform global aggregation to obtain the global model parameters, and then the global model on the server is updated based on the aggregated global model parameters. Fourthly, the model’s partial parameter weights of the updated model are sent from the server to each participant. Each participant updates the local client-side model, and then each participant begins a new round of training. Finally, the framework repeats the above process continuously until the number of training rounds reaches the predefined value. After the training is finished, the final trained global model can be obtained.
In addition, multiple helpful and important training components ensure effective collaborative training across multiple participants in the proposed MPCTF framework, which include the one-click launch mechanism with multi-node and multi-GPU training, data partitioning strategy, model aggregation strategy, and privacy protection. These components and the fundamental framework collectively constitute a complete training system. All components are coordinated through a unified control mechanism to achieve automated management of the training workflow. During the training process, the parameter-efficient fine-tuning method LoRA is used to reduce communication overhead. One of the key factors influencing communication overhead is the model size, where larger models inherently generate greater communication costs. In addition, a larger learning rate may cause difficulties in model convergence, while a smaller learning rate may require more training rounds and time to achieve model convergence. Moreover, the developed privacy protection strategies use differential privacy to achieve data protection. The larger noise value provides better protection but may have a significant impact on the model’s performance. In the subsequent subsections, we will describe these components in detail.
3.2. One-Click Launch Mechanism with Multi-Node and Multi-GPU Training
The one-click launch mechanism with multi-node and multi-GPU training is designed to enable users to operate conveniently and enhance the level of automation. Specifically, the one-click launch mechanism with multi-node and multi-GPU training consists of the one-click start training and the multi-node and multi-GPU training. In the one-click launching training mechanism, the central server can initiate the training of all participants with just one configuration file. Different participants can customize their training configurations. In addition, the central server is capable of conducting real-time monitoring of the resource status of each participating entity. The one-click launching training mechanism can automatically generate training scripts for each participant and their sub-nodes. Furthermore, this mechanism enables the automated modification of relevant configurations and data dissemination during the collaborative training process of each participant.
The multi-node and multi-GPU training mechanism is developed to conduct the actual model training task within a single participant by leveraging multiple machines and multiple GPUs, and it can be compatible with various kinds of LLMs. In the specific training process, each participant can receive the weights from the server and store them locally. Internal nodes among the participants can distribute weights, initiate training, and transmit the optimal model weights to the server upon completion of each round in the training process. The multi-node and multi-GPU training mechanism also supports participants utilizing varying quantities of GPU resources or different hardware. In addition, the designed multi-node and multi-GPU training mechanism enables participants to configure their environmental variables based on their network conditions, such as determining whether to utilize InfiniBand (IB) or employ specific network interface cards for communication.
In the multi-node and multi-GPU training mechanism, users can specify the machines and GPUs to be used through the configuration file. In other words, the multi-node and multi-GPU training mechanism of the proposed MPCTF is used to enable each participant to perform model training using the designated machines and GPUs. If the multi-node and multi-GPU training mechanism is absent, the proposed MPCTF cannot perform training on multiple machines or GPUs.
3.3. Data Partitioning Strategy
The data partitioning strategy is proposed to achieve the splitting of the training data. It can conveniently adjust the amount of the training data used in each round. And the framework can seamlessly replace the corresponding datasets based on the progression of training rounds. The data partitioning strategy consists of the fixed-size strategy, the percentage-based strategy, the maximum data volume strategy, and the total data volume and available GPU memory strategy.
Specifically, the fixed-size strategy sets the amount of training data for each round according to the fixed size, which can be customized in the proposed MPCTF. The value of size in the fixed-size strategy determines the amount of training data in each round. The percentage-based strategy sets the amount of training data for each round according to the percentage of the total data of each participant. The amount of training data in each round is obtained by multiplying the quantity of the whole dataset and the percentage value. The maximum data volume strategy can automatically set the amount of training data for each round based on the specific participant who has the largest total data volume. In the maximum data volume strategy, the amount of training data in each round is determined by dividing the quantity of the largest dataset among the participants by the number of training rounds. The total data volume and available GPU memory strategy can automatically allocate the amount of training data for each round according to the total data amount of each participant and the total available GPU memory. In other words, the amount of training data in each round is determined jointly by the available GPU memory and the quantity of the whole datasets within each participant. In the proposed MPCTF, different data partitioning strategies can be flexibly customized and used during the training according to actual scenario requirements.
3.4. Model Aggregation Strategy
The model aggregation strategy aims to integrate the trained local models of each participant to improve the generalization of the global model. Aggregation methods such as FedAvg, FedProx, FedAdam, and FedAdagrad are integrated into the model aggregation strategy in the proposed MPCTF.
FedAvg computes a weighted average of local models, and it offers a simple yet effective aggregation approach. FedProx introduces a regularization term to mitigate local model divergence during training. FedAdam updates model parameters by incorporating momentum and adaptive learning rates, which combines the benefits of both techniques for federated optimization. FedAdagrad updates model parameters by adaptively adjusting the learning rate. If the model aggregation strategy is absent, the proposed MPCTF is unable to aggregate the models trained by the participants, and thus it cannot conduct the training for each round and cannot generate the final global model. In the proposed MPCTF, different model aggregation strategies can be selected, and the relevant parameters of each strategy can be customized through a unified configuration interface.
3.5. Privacy Protection
The privacy protection strategy is developed to achieve strict protection of training data by introducing differential privacy technology. In the proposed MPCTF, the privacy protection strategy comprises the server-side fixed clipping strategy, the server-side adaptive clipping strategy, the client-side fixed clipping strategy, and the client-side adaptive clipping strategy.
In both the server-side fixed clipping strategy and the server-side adaptive clipping strategy, the server can perform a unified clipping operation on all participants’ updates and reduce the communication overhead of the clipping values. In both client-side fixed clipping and client-side adaptive clipping strategies, the participants perform related operations to reduce the server’s computational overhead to a certain extent. In addition, the fixed clipping strategy truncates the value of model parameters through a preset fixed threshold to prevent individual outliers from interfering with the global model, and the adaptive clipping strategy truncates the value of model parameters by dynamically adjusting the threshold. It should be noted that all privacy protection strategies employ differential privacy to achieve data protection of training data. In subsequent experiments, the performance of the models obtained using different privacy protection strategies is very close. In addition, the larger noise value provides better protection but may have a significant impact on the model’s performance. Users can customize the value of the noise to achieve different levels of privacy protection. One potential limitation of privacy protection strategies is that they may reduce the performance of the model. Overall, in the proposed MPCTF, differential privacy strategies can be selected, and the relevant parameters of each strategy can also be customized through a unified configuration interface.