SimpleScale: Simplifying the Training of an LLM Model Using 1024 GPUs

Li, Tianfa; Pan, Jingshan; Ma, Siwei; Raikov, Aleksandr; Arkhipov, Alexander

doi:10.3390/app15158265

Open AccessArticle

SimpleScale: Simplifying the Training of an LLM Model Using 1024 GPUs

by

Tianfa Li

^1,*

,

Jingshan Pan

^1,2,*

,

Siwei Ma

³,

Aleksandr Raikov

¹

and

Alexander Arkhipov

¹

Jinan Institute of Supercomputing Technology, Jinan 251013, China

²

Department of Computer Science, Shandong Computer Science Center (National Supercomputer Center in Jinan), Qilu University of Technology (Shandong Academy of Sciences), Jinan 251013, China

³

National Engineering Research Center for Visual Technology, School of Computer Science, Peking University, Beijing 100871, China

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2025, 15(15), 8265; https://doi.org/10.3390/app15158265

Submission received: 16 June 2025 / Revised: 1 July 2025 / Accepted: 23 July 2025 / Published: 25 July 2025

Download

Browse Figures

Versions Notes

Abstract

LLMs are trained using many thousands of GPUs in well-known conventional models. It is necessary to address numerous issues in the training process, such as manual data collection organization, data parallel, model parallel, evaluation, testing, deployment, transferring large data streams, detecting errors, ongoing maintenance, and project management. A team of dozens of engineers is required to handle system problems in the training process. Therefore, it is time-consuming and expensive to build an efficient and fault-tolerant system based on Kubernetes. This paper develops SimpleScale for building LLMs based on FSDP and Slurm, which is a simple and efficient training system that includes the training agent, the efficient parallel strategy, the optimal step of checkpoint, and so on. Using the proposed system enables us to significantly accelerate the process of building the LLM without incurring substantial time spent on various manual issues. The proposed 1024-GPU cluster was tested on TinyLlama, which has 1.1 billion parameters and 300 billion tokens. For example, utilizing a 16-H100 GPU cluster accelerated the traditional TinyLlama training time costs from 89.05 days to 39.05 days. In the future, the project team plans to integrate Flash-Attention3, aiming for an MFU of more than 60% while maintaining the acceleration achieved in the present work.

Keywords:

training; Slurm; FSDP; dynamic hybrid shared strategy; flash attention

1. Introduction

Currently, LLMs are being widely developed in a variety of industries, including voice recognition, natural language processing, online Q&A, computer vision, and recommendation systems. Varieties of famous applications in different fields that make extensive use of LLMs are growing [1,2]. The number of methods and systems for generating LLM models is increasing [3,4,5,6,7,8]. It requires a large amount of expensive equipment, particularly including thousands of GPUs and high-speed Ethernet routers, to develop the transmission of massive data streams during the training period of the model. Additionally, a stable scheduling system and efficient model parallelism or data parallelism framework are developed for training LLMs. The amount of training data or datasets is an important factor which determines the capability of the LLM. For example, GPT-3 [9] has 175 billion parameters pretrained from 45 TB datasets, and PaLM [10], similarly trained from 45 TB data, which has 540 billion parameters. For the LLM training, GPUs distribute the hyperparameters and gradient accumulations of each node, as they are essential parts of the model. To make progress, GPUs actively collaborate with each other. As we all know, purposeful training model parallel and stability issues, including failures and delays, are notoriously difficult to diagnose and optimize in large-scale systems. To achieve a good and low training time result from the LLM, it is necessary to achieve the following: carefully develop the system’s structure; determine the functionality of each node; design the structure of connections, data flows, and commands; and implement a system for monitoring and correcting errors.

Based on Kubernetes [11], many effective systems have been developed to generate LLMs. For example, in the MegaScale training system [12], many components have been implemented to fix problems. The instances of many issues relating to the training system structure and the training process are as follows: achieving high training efficiency and stability at scale; failures and stragglers; maximizing the overlapping communication and computation; high network performance achievements; reducing pipeline bubbles; improving training efficiency without compromising accuracy; and reducing the iteration time.

In the MegaScale, several methods, works, and algorithms have been implemented to address the above-mentioned issues [13,14]. These are as follows: diagnostic tools to monitor system components and events deep in the stack; identification of root causes and implementation of effective techniques to achieve fault tolerance and mitigate stragglers; realization of the sliding window attention; application of parallel transformer block; LAMB optimizer based on the pattern of each parallelism strategy custom techniques and optimized for a specific task network topology; decreasing the number of ECMP hashing conflicts; enhancing the checkpoint and recovery procedures to decrease interruptions; adjustment of congestion control configuration and retransmit timeout parameters; creating heartbeat messages that contain diverse forms of information, which aids real-time anomaly and early warning detection; developing a set of diagnostic tests that pinpoint the nodes that are causing disruptions; a software for performance analysis that records fine-grained CUDA events, generates system-wide heat maps, and traces timelines from a distributed perspective; a 3D parallel training visualization tool, which will demonstrate data dependencies between diagnostic ranks; utilizing different pipeline scheduling strategies; crafting methods to conceal the extra work involved in all off-the-critical-path operations; and fusing LayerNorm and GeLU kernels together [15,16].

However, Kubernetes is a general-purpose orchestration platform for containerized applications. It is worth noting that Kubernetes was not originally designed for model training based on Figure 1; therefore, it requires many third-party extensions and various programs to mitigate the influence of training LLMs. The details are as follows:

Kubernetes cannot provide advanced scheduling. The vanilla version cannot even provide a method for performing a single job on many nodes, which can lead to training LLMs with low efficiency. To improve scheduling efficiency, it is necessary to develop a complex method that takes into account Kubernetes’ massive structure.
Kubernetes cannot provide granular management of the hardware, while device plugins can offer the same function. However, designing or installing these plugins requires specialized engineering skills to offer such granular management of the hardware.
Kubernetes requires researchers to design or install additional Kubernetes operators by themselves to provide the ability to train the LLMs. The list of additional Kubernetes operators is as follows: NVIDIA GPU Operator, NVIDIA Network Operator, MPI Operator [17], Training Operator, some solutions for shared filesystems and databases, etc.
Kubernetes requires researchers to implement etcd, monitor the network, and integrate SSL, where everything works more or less smoothly if the developer does not tinker with it.
When researchers are training LLMs on Kubernetes, the training workload must be containerized and follow the cloud-native principles. According to the description of MegaScale, it is essential to deeply transform the underlying architecture of Kubernetes and concurrently transform the distributed training programs to ensure successful cluster execution of large model workloads.

Each stage requires specialized engineering skills and expertise, contributing to the manual costs and complexity of developing LLMs. Due to the complexity of a training system based on Kubernetes, pretraining the LLM requires more time in system environment setup. Therefore, it requires a considerable number of programmers and a significant amount of time to implement numerous algorithms, which causes the process of pretraining LLM to involve a lot of manual labor costs and low actual utilization of large-scale GPU cluster computing power (occupying approximately 30% of the entire cluster lease time). The GPU cluster must remain on during this period. Although academic papers have suggested an efficient MFU with this system, the K8S-based cluster system actually incurs higher costs for training LLMs.

As shown in Figure 2, Slurm [18] and Kubernetes solve similar problems and offer comparable services. Both can be used for model training or other high-performance computing tasks. Slurm is extremely popular in the HPC industry [19]. It is implemented on over half of the Top 500 supercomputers in the world and many research labs, universities, and large corporations developing HPC. Compared to Kubernetes, Slurm has a long history of running workloads for all kinds of institutions performing intense computations. While Slurm was not originally designed for LLM training, it has been adapted to current needs (for example, through support for GPU computing). Slurm’s original structure is much closer to current machine learning requirements than that of Kubernetes.

As shown in Table 1, compared to Kubernetes, Slurm’s scheduler is smart and efficient. It is designed for supercomputers and can process on a massive scale with tens of thousands of nodes with hundreds of jobs per second. Researchers can directly implement distributed training programs on it without additional configuration. Slurm realizes the flexible management of hardware resources; it distinguishes CPU sockets, CPU cores, and hyperthreads. Additionally, it provides the ability of GPU sharding and network topology, where researchers can submit single or multiple tasks on a random node; Slurm is highly compatible with PyTorch structures, allowing researchers to submit tasks on each node. It is easy to extend and integrate some functions, for example, programming a heartbeat task to monitor the health of each node.

Therefore, this paper presents a highly efficient and simple training system implemented on 1024 GPU cluster, which employs the following methods: dynamic hybrid shared strategy for parallelling model parameters and data, employs a self-detecting, fault-tolerant distributed structure based on Slurm to maintain the health of the running GPU cluster. The main advantage of the proposed approach is the purposeful simplicity of the environment and scheduling for training an LLM, which significantly reduces manual labor costs, and relatively speeds up the training time of the LLM system.

Section 2 of this paper describes the proposed system for the training process and its distinctive details based on Slurm and FSDP. Section 3 presents the results of implementing the designed system. Section 4 conducts a comparative analysis of existing systems and the proposed system. Section 5 further develops the research plan proposal and its anticipated consequences. Section 6 provides the conclusions.

2. The Pretraining System

The proposed system structure is described in Figure 1. It consists of four layers, each showcasing technical details of hardware and software. At the bottom, Node represents the real computer machine, communicating through the two-layer InfiniBand protocol [20]. In the second layer, NCCL [21] and NVIDIA GPU Driver, powered by NVIDIA Inc., Santa Clara, CA, USA, is implemented on the computer machine to perform GPU data exchange and parallel computing. The third layer is the basic component for our proposed system, which includes Pytorch 1.12 [22], Slurm 21.08.8, and Wandb 0.17.4. The top layer focuses on the development approach, including optimizing the GPU distributed system and an efficient parallel strategy. In summary, our proposed system is simple and efficient, enabling rapid implementation on the GPU cluster. The details of our system are shown in Figure 3.

2.1. The Proposed GPU Distributed System

The preparation and error time for the LLM training environment occupy approximately 30 percent of the whole training time, which, in real situations, is determined by the GPU distributed system implemented. Decreasing manual labor costs in training LLMs requires developing a simple and efficient GPU distributed system, which includes an error tolerance agent to monitor system components and events.

Based on the architecture of a classic Slurm cluster (Figure 2), we created a robust pretraining framework for LLMs. This framework automatically detects errors, recovers quickly from the latest checkpoint, and implements error tolerance with minimal human interaction. Additionally, it has negligible influence with running LLM training processes.

2.1.1. Optimized Pretraining Framework

As shown in Figure 4, the training agent process connected to a Slurm is created to distribute GPU nodes and initiate the appropriate slurmds for each node when a training job task is submitted. A slurmd can manage only one node. After completing several initialization tasks, the slurmd builds the training task on each node and a robust pretraining daemon that sends heartbeats to the training agent on a regular basis. The messages contained in these heartbeats are designed to perform real-time anomaly tests and give early warnings of problems. If an error status is detected in a specific pretraining task or if a slurmd cannot receive a heartbeat within a defined time window, the training agent initiates the error recovery process, suspends the running training job across all slurmds, and submits the script to perform a series of self-diagnostics. These diagnostic programs are designed to be lightweight yet comprehensive and can handle most general hardware and software errors. The training agent will shut down the GPU nodes by submitting its ID when faulty nodes are detected. In response to the slurmd’s message, Slurm closes the faulty GPU nodes and adds the cluster with the same numbers of good GPU nodes that were detected by our diagnostics. In addition, users can manipulate nodes through submitting scripts. When the recovery is finished, the training agent restart training from the newest checkpoint. Our work focus on developing checkpoint and restart tasks to optimize the loss of training process, which minimizes the restart time of training LLM to 10 min.

2.1.2. Message Organize and Analysis

The messages of heartbeat contain the essential information of the slurmd, such as the active state, the network traffic, and the GPU utilization. In addition, the real-time data of the pretraining process is uploaded, allowing the agent to rapidly analyze any important errors. The standard Linux output and error logs of training processes are contained. These messages will be filtered and detected on the cloud. Once the special bug and warning keywords are found, the agent will upload the current diagnostic data.

To improve the detection of pretraining robustness and stability, our work has implemented a detection system, which can reach the millisecond level. Multiple levels of detection are developed to follow different indicators. Second-level detection generally aims to monitor the sum health condition and to eliminate common configuration influences of pretraining.

2.1.3. Diagnostic Program and Running

A contradiction exists between running time and correctness of self-detect analysis. The long detection period results in adversely affecting the efficiency of training time, which can lead to high error correct rates and cause misjudgment of the health state of the machine. Based on a variety of experiments and optimizations, we have implemented a series of simple analytical tests, which can effectively handle various hardware and software bugs and errors in the period of training process.

For testing potential errors on GPU communication, the agent develops a detailed end-to-end experiment among the GPUs, to monitor whether the network bandwidth meets essential benchmarks.

2.1.4. Optimal Checkpoint and Fast Resume

The agent is required to retrieve the training by loading model weights and optimizer states from the newest checkpoint when detecting and identifying a crash machine. It is important to ensure the checkpoint is close to the pretraining state when the crash occurred, to optimize the loss of the GPU computation and the training time. By reducing the checkpoint saving interval during pretraining, we can achieve this. Additionally, this minimizes checkpoint latency, especially the time spent on the special path during pretraining process.

To perform rapid checkpoint, the agent adopts an improved two-level method. On the one hand, each node slurmd saves its states to the machine’s memory and meanwhile goes on with the pretraining process. The high NVLINK bandwidth has made it possible to minimize this task to several seconds with the improvement in Pytorch’s parallel rule [23]. On the flip side, an online background thread periodically sends the state from the machine’s memory to a network file system (NFS [24] in our implementation, connected with our system by Infiniband 100 Gbps protocol) for centralized maintenance. This two-level method enables rapid GPU recovery from training crashes.

2.2. Efficient Parallel Strategy

In the industry of LLMs, efficient LLM pretraining in a thousand GPU cluster is essential. As the number of model parameters and accuracy increase, the requirements of GPUs have also exploded. Realizing these scale computational needs without compromising model accuracy requires the development of dataset pipeline management, optimized parallel sharding strategies, and fast iterator computations.

2.2.1. Dataset Pipeline Management

Dataset chunking and loading are generally ignored. Furthermore, these tasks lead to a significant amount of the GPU idle time being wasted at the start of every pretraining step, because the chunking of the dataset is performed on the CPU. To improve these tasks is important for minimizing the time taken in the pretraining process.

The dataset chunk is not on the essential workflow’s path. The data chunk for the next training step can start when the GPU tasks sync gradients at the end of every step, which is performed as a background process.

After the dataset chunking, data loader workers need to be created for distributed training. Each GPU process is performed with its own data loader. The machine memory loads the chunked data before forwarding it to GPU memory. This mechanism leads workers to compete for read IO bandwidth, which can easily result in training bottlenecks. During pretraining, eight GPU workers in the one node are in the same data parallel team. Therefore, the input of each iteration is essentially the same. According to this observation, we implemented a simple and fast method: to load chunked data into a shared CPU memory, with each node having one data loader. After reading data into the shared memory, each GPU worker copied the chunked data from memory to its GPU memory. The approach reduces excess reads and, importantly, minimizes the time of data loading.

2.2.2. Dynamic Hybrid Sharding Strategy

Efficient parallel processing for the model parameters and dataset becomes paramount in the training process. As is well known, extensive model training involves significant engineering complexity.

As shown in Figure 5, the FSDP works comprises three main parts: constructor, forward path, and backward path. Constructors have shard model parameters, and each rank keeps only its own shard. The forward path obtains the world size of the GPU cluster, calculates iterator steps of training according to the dataset, and then performs all-gather to collect all shards from all ranks to recover the full parameter in this FSDP unit, performs forward computation, and discards the parameter shards that it has just collected. Backward path performs all-gather to collect all shards from all ranks to recover the full parameter in this FSDP unit, performs backward computation, and performs reduce scatter to sync gradients and discard parameters.

Based on the above process of FSDP, it offers various alternative fragmentation strategies, which determine how your model is sharded across GPUs and machines. The strategies are as follows:

FULL_SHARD (For very large models that do not fit into a machine. The model is sharded across all GPUs on all machines. Requires a cluster with fast inter-node network).
SHARD_GRAD_OP (Slice gradient and optimizer state across workers).
NO_SHARD (Does not fragment anything; this is equivalent to DDP).
HYBRID_SHARD (For models that fit into a single machine, e.g., TinyLlama [25] 1B-3B on an 8xA100. This will shard the model within the machine but replicates across machines. Useful to avoid slow network bottleneck between machines).

In the fragmentation strategies of FSDP, HYBRID_SHARD is performed when the work can fit the LLM model inside a single node, while it always cannot work when the model is implemented on multiple nodes. Using the HYBRID_SHARD can lead to 100% of network bandwidth occupation on multiple nodes, which can lead to cluster crashing. Then we must introduce the FULL_SHARD sharding strategy across all GPU nodes. At eight nodes and below, this paper employs the FULL_SHARD sharding strategy (one of the FSDP strategies), yielding similar results to those obtained with the Lightning-AI [26] cluster. However, when scaling the cluster from eight nodes to 16, 32, 64, and 128 nodes, our cluster encounters several issues. For data parallel training, when the number of nodes reaches 16 or more, the data communication link becomes highly complex and lengthy, for example, due to sharing across all GPUs and the two-layer Infiniband network (Figure 3), the communication latency between the first and the last nodes reaches about 45 min, resulting in substantial time spent splitting the LLM model into chunks in the 1024 GPU cluster. In the realm of LLM pretraining engineering, network fluctuation can lead to a longer period of training for the LLMs. The cluster crashes because of synchronizing the data timeout, which has a default timeout of 30 min.

Based on this observation, the proposed work adopts the Dynamic HYBRID_SHARD sharding strategy (combining the FULL_SHARD sharding strategy and the HYBRID_SHARD sharding strategy), which is a two-layer approach, as follows:

At the top layer, as Figure 6 shows, the FULL_SHARD sharding strategy is used across a group of multiple nodes. For example, when our cluster scales to eight nodes and the minimum number of nodes required for the LLM model to run is four nodes. (In this work, the default number of nodes is four as a group). In each group, the method performs FULL_SHARD sharding strategy to full sharding. This leads to a reduction in the FULL_SHARD data exchange to 50%.
At the bottom layer, implement HYBRID_SHARD sharding strategy into these groups. Within each group, shard the model and then replicate it across all groups. This reduces communication volume by limiting expensive all-gathers and reduces scatters to within a node, improving performance for medium-sized models.

2.2.3. The Configuration Optimization

Our work makes various modifications and incorporates the simple optimization at the configuration level of the training process to minimize the time of the training process, without compromising accuracy.

We set the max tokens for the end of the training process, which stops training after this many tokens have been trained across all GPUs. Additionally, we adopt a parallel version of a simple formula to calculate the max iterator for the gradient updating.

max_tokens_per_device = max_tokens/world_size,

(1)

tokens_per_iter = micro_batch_size × block_size,

(2)

max_iters = max_tokens_per_device/tokens_per_iter,

(3)

where the max_tokens_per_device represents the max tokens of each node. max_tokens represents the whole tokens in this period of training LLM. world_size is the number of GPUs in the cluster. tokens_per_iter is the tokens of each iterator. micro_batch_size is the number of GPUs in each node. block_size is the length of the sequences in the training process. max_iters is the max iterator numbers in this training process. The computation of gradient updating can be performed in parallel using this method, which results in reduced computation time.

Additionally, we set the micro-batch size to eight, which is the highest possible to efficiently utilize the cluster’s GPU memory and push GPUs to 99%+ utilization. Of course, the proposed work uses the BF16 for training, which offers essentially the same prediction accuracy as a 32-bit floating-point format while significantly reducing power consumption and improving throughput with no additional time consumed.

3. Results for the Proposed System Implementation

Currently, the suggested approach is implemented in the Linux operating system version 5.15.0-94-generic-x86_64-with-glibc2.17, using the Python programming language version 3.8.11. Each executing node includes 96 CPUs and 8 GPUs, NVIDIA H100 80 GB HBM3 [27].

Experiments were performed with the number of executing nodes set to 16, 32, 64, and 128. Finally the work completed the whole of 128 node training process because of the expensive cost of H100. Additionally, performing 16, 32, and 64 nodes period test (every test continued to 2 days for stable remaining day’s computation) for calculating the time of the whole training process. The log of training the LLM on the present system is shown at https://wandb.ai/jnist/tinyllama?nw=nwuser1032124832 (accessed on 24 July 2025). During the training, a log file is generated at each step, where the iteration numbers, step number, loss, iteration time, and remaining time are recorded.

The proposed 1024 GPU cluster is tested on TinyLlama, an open-source, small language model with 1.1 billion parameters and 300 billion tokens (as shown in Figure 7). The whole training process includes three epochs, as shown in Figure 8.

The system performance graphs based on various parameters can also be plotted. Figure 9 illustrates the degree of GPU utilization for multiple numbers of GPUs involved in training, as a function of training time.

Figure 10 illustrates the training loss and running state of the 1024-GPU SLURM system for TinyLlama. In general, two crash events occur, which are similar to those reported by the original authors of TinyLlama and Lightning-AI.

4. System Comparison

As Figure 11 shows, the proposed system can process the scaling law of TinyLlama, demonstrating that a large GPU system (comprising 1024 GPUs) can significantly reduce pre-training time. The training results for a batch size of 512 are shown in Table 2, which was tested on TinyLlama (300 billion tokens).

The author of TinyLlama can achieve this within 90 days using a 16-A100 GPU cluster. Lightning-AI completed the training work in approximately 4 weeks using a 64-GPU A100 cluster platform. For comparison, it takes 39.5 days and 11.61 days to complete the training using the suggested platform in this paper, with 16 GPUs and 64 GPUs, respectively. Based on 1024 GPUs, the proposed cluster reproduced the pretraining with the same settings and data mix as the original authors, taking about 1.747 days to complete.

5. Discussion and Future Work

Generally, using a large-scale model pretraining platform typically requires a certain amount of time, compared to the duration of model training for platform compatibility tasks. Platforms such as MegaScale may even demand more time. Remarkably, our objective is to achieve a real experience by ensuring no code modifications or platform compatibility issues are necessary. The team in this project took less than two days to train TinyLlama directly.

In the future, our focus will be on optimizing the cluster at the low level to boost computational efficiency. The general PyTorch framework employed for one-click training, which aims to reduce actual large model training time, does not fully leverage the computational power of the H100 GPU. As indicated in Table 1, the Model Flops Utilization (MFU) only reaches 43.8% or 38.6%. This is because TinyLlama utilizes flash-attention2, and an algorithm was optimized explicitly for A100 GPUs. Moving forward, the project team plans to integrate flash-attention3, which present three essential methods to accelerate attention on H100 GPUs (leverage the asynchrony of tensor cores and TMA to overlap overall computation and data movement through warp-specialization and interleave block-wise matmul and softmax operations, as well as block quantization and incoherent processing that leverages hardware support for FP8 low precision), with the expectation that MFU will exceed 60%.

6. Conclusions

Based on the experiments for a 1024 GPU cluster, an efficient and easy-to-deploy system incorporating the dynamic hybrid shared strategy, the data pipeline approach, and the optimal configuration of training LLMs can minimize the time of training LLMs, which includes the essential software platform installed and developed, the dataset handling, the data parallel, and the training parallel. For example, using a 16-H100 GPU cluster accelerated the traditional TinyLlama training time costs from 89.05 days to 39.05 days. Additionally, introducing the training agent can enhance the robust ability for fault tolerance and diagnostics with the system components and events. This work can lower the barrier to entry by offering a straightforward and dependable GPU clustering solution, along with implementing specific optimizations to facilitate the rapid deployment of large-scale model training.

Author Contributions

Conceptualization, T.L. and J.P.; methodology, T.L.; software, T.L.; validation, S.M., T.L. and J.P.; formal analysis, T.L.; investigation, T.L.; resources, J.P.; data curation, A.R.; writing—original draft preparation, T.L.; writing—review and editing, T.L.; visualization, T.L.; supervision, A.R.; project administration, S.M. and A.A.; funding acquisition, S.M. and A.A. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the New 20 Project of Higher Education of Jinan (No. 202228024).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in this article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

FSDP	Fully Sharded Data Parallel
LLM	Large language model
Slurm	Simple Linux Utility for Resource Management
MFU	Model Flops Utilization
Q&A	Question and Answer
LAMB	Layer-wise adaptive moments optimizer for batch training
ECMP	Equal Cost Multipath
GeLU	Gaussian Error Linear Unit
NCCL	The NVIDIA Collective Communication Library
HPC	High-Performance Computing
SSH	Secure Shell Protocol
NFS	The network file system
DDP	Parallel Data Distribution

References

Shoeybi, M.; Patwary, M.; Puri, R.; LeGresley, P.; Casper, J.; Catanzaro, B. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv 2019, arXiv:1909.08053. [Google Scholar]
Welsby, P.; Cheung, B.M.Y. ChatGPT. Postgrad. Med. J. 2023, 99, 1047–1048. [Google Scholar] [CrossRef] [PubMed]
BehnamGhader, P.; Adlakha, V.; Mosbach, M.; Bahdanau, D.; Chapados, N.; Reddy, S. Llm2vec: Large language models are secretly powerful text encoders. arXiv 2024, arXiv:2404.05961. [Google Scholar]
Ben Zaken, E.; Goldberg, Y.; Ravfogel, S. BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, Dublin, Ireland, 22–27 May 2022; pp. 1–9. [Google Scholar]
Black, S.; Biderman, S.; Hallahan, E.; Anthony, Q.; Gao, L.; Golding, L.; He, H.; Leahy, C.; McDonell, K.; Phang, J. Gpt-neox-20b: An open-source autoregressive language model. arXiv 2022, arXiv:2204.06745. [Google Scholar]
De la Rosa, J.; Fernández, A. Zero-shot Reading Comprehension and Reasoning for Spanish with BERTIN GPT-J-6B. In Proceedings of the IberLEF@ SEPLN, A Coruña, Spain, 20–25 September 2022. [Google Scholar]
Hoffmann, J.; Borgeaud, S.; Mensch, A.; Buchatskaya, E.; Cai, T.; Rutherford, E.; Casas, D.d.L.; Hendricks, L.A.; Welbl, J.; Clark, A. Training compute-optimal large language models. arXiv 2022, arXiv:2203.15556. [Google Scholar]
Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Online (Virtual Conference), 16–20 November 2020; pp. 38–45. [Google Scholar]
Zong, M.; Krishnamachari, B. A survey on GPT-3. arXiv 2022, arXiv:2212.00857. [Google Scholar]
Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.-A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F. Llama: Open and efficient foundation language models. arXiv 2023, arXiv:2302.13971. [Google Scholar]
Kubernetes, T. Kubernetes. Kubernetes 2019, 24, 2019. [Google Scholar]
Jiang, Z.; Lin, H.; Zhong, Y.; Huang, Q.; Chen, Y.; Zhang, Z.; Peng, Y.; Li, X.; Xie, C.; Nong, S. {MegaScale}: Scaling large language model training to more than 10,000 {GPUs}. In Proceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24), Santa Clara, CA, USA, 17 April 2024; pp. 745–760. [Google Scholar]
Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
Li, H.; Koto, F.; Wu, M.; Aji, A.F.; Baldwin, T. Bactrian-x: Multilingual replicable instruction-following models with low-rank adaptation. arXiv 2023, arXiv:2305.15011. [Google Scholar]
Dao, T. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv 2023, arXiv:2307.08691. [Google Scholar]
Shah, J.; Bikshandi, G.; Zhang, Y.; Thakkar, V.; Ramani, P.; Dao, T. Flashattention-3: Fast and accurate attention with asynchrony and low-precision. Adv. Neural Inf. Process. Syst. 2024, 37, 68658–68685. [Google Scholar]
Droste, A.; Kuhn, M.; Ludwig, T. MPI-checker: Static analysis for MPI. In Proceedings of the Second Workshop on the LLVM Compiler Infrastructure in HPC, Austin, TX, USA, 15 November 2015; pp. 1–10. [Google Scholar]
Yoo, A.B.; Jette, M.A.; Grondona, M. Slurm: Simple linux utility for resource management. In Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing, Seattle, WA, USA, 24 June 2003; pp. 44–60. [Google Scholar]
Netto, M.A.; Calheiros, R.N.; Rodrigues, E.R.; Cunha, R.L.; Buyya, R. HPC cloud for scientific and business applications: Taxonomy, vision, and research challenges. ACM Comput. Surv. (CSUR) 2018, 51, 8. [Google Scholar] [CrossRef]
Grun, P. Introduction to infiniband for end users. White Pap. InfiniBand Trade Assoc. 2010, 55, 11. [Google Scholar]
Awan, A.A.; Chu, C.-H.; Subramoni, H.; Panda, D.K. Optimized broadcast for deep learning workloads on dense-GPU InfiniBand clusters: MPI or NCCL? In Proceedings of the 25th European MPI Users’ Group Meeting, Barcelona, Spain, 23–26 September 2018; pp. 1–9. [Google Scholar]
Imambi, S.; Prakash, K.B.; Kanagachidambaresan, G. PyTorch. In Programming with TensorFlow: Solution for Edge Computing Applications; Springer: Cham, Switzerland, 2021; pp. 87–104. [Google Scholar]
Foley, D.; Danskin, J. Ultra-performance Pascal GPU and NVLink interconnect. IEEE Micro 2017, 37, 7–17. [Google Scholar] [CrossRef]
Pawlowski, B.; Noveck, D.; Robinson, D.; Thurlow, R. The NFS version 4 protocol. In Proceedings of the 2nd International System Administration and Networking Conference (SANE 2000), Maastricht, The Netherlands, 22–25 May 2000. [Google Scholar]
Zhang, P.; Zeng, G.; Wang, T.; Lu, W. Tinyllama: An open-source small language model. arXiv 2024, arXiv:2401.02385. [Google Scholar]
AI, L. LitGPT. Available online: https://github.com/Lightning-AI/litgpt (accessed on 15 June 2025).
Choquette, J. Nvidia hopper h100 gpu: Scaling performance. IEEE Micro 2023, 43, 9–17. [Google Scholar] [CrossRef]

Figure 1. Architecture of a managed Kubernetes cluster.

Figure 2. Architecture of a classic Slurm cluster.

Figure 3. Proposed system structure of pretraining LLMs.

Figure 4. Optimized pretraining framework.

Figure 5. FSDP fundamental workflow.

Figure 6. Dynamic HYBRID_SHARED sharding strategy.

Figure 7. The total tokens of the training process.

Figure 8. The total epoch of the training process.

Figure 9. The GPU Utilization of each node.

Figure 10. The training loss.

Figure 11. The time of training process with 128 nodes.

Table 1. Pros and cons of distributed system.

System	Distributed Jobs Support	Containerization	Ease of Bootstrapping	User Experience and Entry Threshold
Kubernetes	Not in Vanilla	Jobs are required to be containerized	Medium	Good for engineers with big tech experience
Slurm	Good support out of box	Jobs can either be containerized or not	Easy	Good for people who came from the academic field

Table 2. Training performance.

Method	GPUs	Iteration Time (s)	Throughput (Tokens/s)	Training Time (Days)	MFU (%)
TinyLlama	16	533.93	24k	89.05	65.0
Lightning-AI	64	168.61	150k	28.12	55.0
Proposed in this paper	16	297.01	52 k	39.05	43.8
	64	69.85	360.07 k	11.61	38.6
	128	3.20	2265 k	1.75	29.8

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, T.; Pan, J.; Ma, S.; Raikov, A.; Arkhipov, A. SimpleScale: Simplifying the Training of an LLM Model Using 1024 GPUs. Appl. Sci. 2025, 15, 8265. https://doi.org/10.3390/app15158265

AMA Style

Li T, Pan J, Ma S, Raikov A, Arkhipov A. SimpleScale: Simplifying the Training of an LLM Model Using 1024 GPUs. Applied Sciences. 2025; 15(15):8265. https://doi.org/10.3390/app15158265

Chicago/Turabian Style

Li, Tianfa, Jingshan Pan, Siwei Ma, Aleksandr Raikov, and Alexander Arkhipov. 2025. "SimpleScale: Simplifying the Training of an LLM Model Using 1024 GPUs" Applied Sciences 15, no. 15: 8265. https://doi.org/10.3390/app15158265

APA Style

Li, T., Pan, J., Ma, S., Raikov, A., & Arkhipov, A. (2025). SimpleScale: Simplifying the Training of an LLM Model Using 1024 GPUs. Applied Sciences, 15(15), 8265. https://doi.org/10.3390/app15158265

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

SimpleScale: Simplifying the Training of an LLM Model Using 1024 GPUs

Abstract

1. Introduction

2. The Pretraining System

2.1. The Proposed GPU Distributed System

2.1.1. Optimized Pretraining Framework

2.1.2. Message Organize and Analysis

2.1.3. Diagnostic Program and Running

2.1.4. Optimal Checkpoint and Fast Resume

2.2. Efficient Parallel Strategy

2.2.1. Dataset Pipeline Management

2.2.2. Dynamic Hybrid Sharding Strategy

2.2.3. The Configuration Optimization

3. Results for the Proposed System Implementation

4. System Comparison

5. Discussion and Future Work

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI