Training Methods for Large Language Models: Current Approaches and Challenges
Abstract
1. Introduction
- Systematic methodological synthesis: We adopt a transparent survey methodology and organize an initial set of 58 core studies, expanded to 68 during revision, across major stages of the LLM training pipeline, supported by PRISMA-style reporting and thematic distribution.
- Unified analytical taxonomy: Beyond thematic review, we propose an original comparative framework positioning training paradigms along the axes of training efficiency and alignment/reasoning quality, offering new insight into dense scaling versus sparse and post-training optimization strategies.
- Quantitative frontier comparison: We strengthen the engineering perspective of the survey by including quantitative evidence from dense and sparse frontier model families (e.g., LLaMA-3 vs. DeepSeek MoE systems), highlighting the efficiency–capability tradeoffs shaping current research trends.
2. Survey Methodology
Research Questions
- RQ1: What are the dominant strategies in modern LLM pre-training and data curation?
- RQ2: How do Parameter-Efficient Fine-Tuning (PEFT) and alignment methods balance computational efficiency with response quality and safety?
- RQ3: To what extent do sparse Mixture-of-Experts (MoE) architectures improve the capability–efficiency tradeoff compared to dense transformer models?
3. Foundations of Pre-Training for Large Language Models
4. Fine-Tuning Strategies for Large Language Models: Supervised and Instruction Tuning
5. Advanced Fine-Tuning, Alignment, and Parameter-Efficient Adaptation in Large Language Models
5.1. Advanced Fine-Tuning and Emerging Alignment Strategies
5.2. Reinforcement Learning from Human Feedback (RLHF) and Preference Optimization
5.3. Factuality-Aware and Reasoning-Centric Alignment Methods
5.4. Parameter-Efficient Fine-Tuning (PEFT) for Scalable Adaptation
6. From Multimodal and Retrieval-Augmented Models to Reasoning-Centric Training: Sparse MoE Case Studies (e.g., DeepSeek)
Quantitative Efficiency in Sparse MoE Architectures: The DeepSeek Case Study
7. Evaluation of Large Language Models and Recent Advances
7.1. Outcome-Based vs. Process-Based Evaluation
Training-Aware Evaluation: SFT vs. RLHF Tradeoffs
8. Analytical Taxonomy: Training Efficiency vs. Alignment and Reasoning Quality
8.1. DenseScaling vs. Sparse Efficiency
8.2. Alignment Optimization and the “Alignment Tax”
8.3. Why Does the Alignment Tax Emerge?
8.4. Mapping Paradigms to the Taxonomy
9. Discussion
10. Future Directions
11. Limitations
12. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
| LLM | Large Language Model |
| NLP | Natural Language Processing |
| CLM | Causal Language Modeling |
| MLM | Masked Language Modeling |
| SFT | Supervised Fine-Tuning |
| IFT | Instruction Fine-Tuning |
| RLHF | Reinforcement Learning from Human Feedback |
| DPO | Direct Preference Optimization |
| PEFT | Parameter-Efficient Fine-Tuning |
| LoRA | Low-Rank Adaptation |
| QLoRA | Quantized Low-Rank Adaptation |
| MoE | Mixture-of-Experts |
| RAG | Retrieval-Augmented Generation |
| MMLU | Massive Multitask Language Understanding |
| PII | Personally Identifiable Information |
References
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar] [CrossRef]
- Gisserot-Boukhlef, H.; Boizard, N.; Faysse, M.; Alves, D.M.; Malherbe, E.; Martins, A.F.T.; Hudelot, C.; Colombo, P. Should We Still Pretrain Encoders with Masked Language Modeling? arXiv 2025, arXiv:2507.00994. [Google Scholar] [CrossRef]
- Interrante-Grant, A.; Varela-Rosa, C.; Narayan, S.; Connelly, C.; Reuther, A. Scaling Performance of Large Language Model Pretraining. arXiv 2025, arXiv:2509.05258. [Google Scholar] [CrossRef]
- Penedo, G.; Malartic, Q.; Hesslow, D.; Launay, J.; Noune, H.; Pannier, B.; Cappelli, A.; Malartic, E. The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only. arXiv 2023, arXiv:2306.01116. [Google Scholar] [CrossRef]
- Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Online, 6–12 December 2020; NeurIPS: Red Hook, NY, USA, 2020; Volume 33, pp. 1877–1901. [Google Scholar]
- Chen, M.; Tworek, J.; Jun, H.; Yuan, Q.; Pinto, H.; Kaplan, J.; Edwards, H.; Burda, Y.; Joseph, N.; Brockman, G.; et al. Evaluating Large Language Models Trained on Code. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Online, 6–14 December 2021; NeurIPS: Red Hook, NY, USA, 2021; Volume 34, pp. 24963–24977. [Google Scholar]
- Solaiman, I.; Brundage, M.; Clark, J.; Askell, A.; Herbert-Voss, A.; Wu, J.; Radford, A.; Krueger, G.; Kim, J.W.; Kreps, S.; et al. Release Strategies and the Social Impacts of Language Models. arXiv 2019, arXiv:1908.09203. [Google Scholar] [CrossRef]
- Bai, Y.; Kadavath, S.; Kundu, S.; Askell, A.; Kernion, J.; Jones, A.; Chen, A.; Goldie, A.; Mirhoseini, A.; McKinnon, C.; et al. Constitutional AI: Harmlessness from AI Feedback. arXiv 2022, arXiv:2212.08073. [Google Scholar] [CrossRef]
- Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. LLaMA: Open and Efficient Foundation Language Models. arXiv 2023, arXiv:2302.13971. [Google Scholar] [CrossRef]
- Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. J. Mach. Learn. Res. 2020, 21, 1–67. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, 4-9 December 2017; NeurIPS: Red Hook, NY, USA, 2017; Volume 30, pp. 5998–6008. [Google Scholar]
- Jiang, A.Q.; Sablayrolles, A.; Mensch, A.; Bamford, C.; Chaplot, D.S.; de las Casas, D.; Bressand, F.; Lengyel, G.; Lample, G.; Saulnier, L.; et al. Mistral 7B. arXiv 2023, arXiv:2310.06825. [Google Scholar] [CrossRef]
- Tay, Y.; Dehghani, M.; Tran, V.Q.; Garcia, X.; Wei, J.; Wang, X.; Chung, H.W.; Bahri, D.; Bahri, T.; Metzler, D. UL2: Unifying Language Learning Paradigms. arXiv 2022, arXiv:2205.05131. [Google Scholar] [CrossRef]
- Kaplan, J.; McCandlish, S.; Hernandez, D.; Brown, T.B.; Chess, B.; Child, R.; Gray, S.; Radford, A.; Wu, J.; Amodei, D. Scaling Laws for Neural Language Models. arXiv 2020, arXiv:2001.08361. [Google Scholar] [CrossRef]
- Hoffmann, J.; Borgeaud, S.; Mensch, A.; Buchatskaya, E.; Cai, T.; Rutherford, E.; de Las Casas, D.; Hendricks, L.A.; Welbl, J.; Clark, A.; et al. Training Compute-Optimal Large Language Models. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), New Orleans, LA, USA, 28 November–9 December 2022; NeurIPS: Red Hook, NY, USA, 2022. [Google Scholar]
- Shoeybi, M.; Patwary, M.; Puri, R.; LeGresley, P.; Casper, J.; Catanzaro, B. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. arXiv 2019, arXiv:1909.08053. [Google Scholar] [CrossRef]
- Rajbhandari, S.; Rasley, J.; Ruwase, O.; He, Y. ZeRO: Memory Optimizations Toward Training Trillion Parameter Models. arXiv 2020, arXiv:1910.02054. [Google Scholar] [CrossRef]
- Kudo, T.; Richardson, J. SentencePiece: A Simple and Language Independent Subword Tokenizer and Detokenizer for Neural Text Processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), Brussels, Belgium, 31 October–4 November 2018; NeurIPS: Red Hook, NY, USA, 2018; pp. 66–71. [Google Scholar] [CrossRef]
- Taori, R.; Gulrajani, I.; Zhang, T.; Dubois, Y.; Li, X.; Guestrin, C.; Liang, P.; Hashimoto, T.B. Alpaca: A Strong, Replicable Instruction-Following Model. Available online: https://crfm.stanford.edu/2023/03/13/alpaca.html (accessed on 9 February 2026).
- Köpf, A.; Kilcher, Y.; von Rütte, D.; Anagnostidis, S.; Tam, Z.; Stevens, K.; Barhoum, A.; Nguyen, D.; Stanley, O.; Nagyfi, R.; et al. OpenAssistant Conversations—Democratizing Large Language Model Alignment. arXiv 2023, arXiv:2304.07327. [Google Scholar] [CrossRef]
- Ding, N.; Chen, Y.; Xu, B.; Qin, Y.; Zheng, Z.; Hu, S.; Liu, Z.; Sun, M.; Zhou, B. Enhancing Chat Language Models by Scaling High-Quality Instructional Conversations. arXiv 2023, arXiv:2305.14233. [Google Scholar] [CrossRef]
- Chung, H.W.; Hou, L.; Longpre, S.; Zoph, B.; Tay, Y.; Fedus, W.; Li, Y.; Wang, X.; Dehghani, M.; Brahma, S.; et al. Scaling Instruction-Finetuned Language Models. arXiv 2022, arXiv:2210.11416. [Google Scholar] [CrossRef]
- Wei, J.; Bosma, M.; Zhao, V.; Guu, K.; Yu, A.W.; Lester, B.; Du, N.; Dai, A.M.; Le, Q.V. Finetuned Language Models Are Zero-Shot Learners. arXiv 2021, arXiv:2109.01652. [Google Scholar] [CrossRef]
- Zhang, S.; Dong, L.; Li, X.; Zhang, S.; Sun, X.; Wang, S.; Li, J.; Hu, R.; Zhang, T.; Wu, F.; et al. Instruction Tuning for Large Language Models: A Survey. arXiv 2023, arXiv:2308.10792. [Google Scholar] [CrossRef]
- Wang, Y.; Kordi, Y.; Mishra, S.; Liu, A.; Smith, N.A.; Khashabi, D.; Hajishirzi, H. Self-Instruct: Aligning Language Models with Self-Generated Instructions. arXiv 2022, arXiv:2212.10560. [Google Scholar] [CrossRef]
- Bai, Y.; Jones, A.; Ndousse, K.; Askell, A.; Chen, A.; DasSarma, N.; Drain, D.; Fort, S.; Ganguli, D.; Henighan, T.; et al. Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback. arXiv 2022, arXiv:2204.05862. [Google Scholar] [CrossRef]
- Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; et al. Training Language Models to Follow Instructions with Human Feedback. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), New Orleans, LA, USA, 28 November–9 December 2022; NeurIPS: Red Hook, NY, USA, 2022; Volume 35, pp. 27730–27744. [Google Scholar]
- Lin, S.; Gao, L.; Oguz, B.; Xiong, W.; Lin, J.; Yih, W.T.; Chen, X. FLAME: Factuality-Aware Alignment for Large Language Models. arXiv 2024, arXiv:2405.01525. [Google Scholar] [CrossRef]
- Xu, Y.; Chakraborty, T.; Kıcıman, E.; Aryal, B.; Rodrigues, E.; Sharma, S.; Estevao, R.; Balaguer, M.A.D.; Wolk, J.; Padilha, R.; et al. RLTHF: Targeted Human Feedback for LLM Alignment. arXiv 2025, arXiv:2502.13417. [Google Scholar] [CrossRef]
- Sotiropoulos, A.; Valapu, S.T.; Lei, L.; Coleman, J.; Krishnamachari, B. Crowd-SFT: Crowdsourcing for LLM Alignment. arXiv 2025, arXiv:2506.04063. [Google Scholar] [CrossRef]
- Li, M.; Chen, L.; Chen, J.; He, S.; Huang, H.; Gu, J.; Zhou, T. Reflection-Tuning: Data Recycling Improves LLM Instruction-Tuning. arXiv 2023, arXiv:2310.11716. [Google Scholar] [CrossRef]
- Pentyala, S.K.; Wang, Z.; Bi, B.; Ramnath, K.; Mao, X.-B.; Radhakrishnan, R.; Asur, S.; Cheng, N. PAFT: A Parallel Training Paradigm for Effective LLM Fine-Tuning. arXiv 2024, arXiv:2406.17923. [Google Scholar] [CrossRef]
- Chen, M.; Sun, L.; Li, T.; Sun, H.; Zhou, Y.; Zhu, C.; Wang, H.; Pan, J.Z.; Zhang, W.; Chen, H.; et al. ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning. arXiv 2025, arXiv:2503.19470. [Google Scholar] [CrossRef]
- Tang, Y.; Cohen, T.; Zhang, D.W.; Valko, M.; Munos, R. RL-Finetuning LLMs from On- and Off-Policy Data with a Single Algorithm. arXiv 2025, arXiv:2503.19612. [Google Scholar] [CrossRef]
- Ye, K.; Zhou, H.; Zhu, J.; Quinzan, F.; Shi, C. Robust Reinforcement Learning from Human Feedback for Large Language Models Fine-Tuning. arXiv 2025, arXiv:2504.03784. [Google Scholar] [CrossRef]
- Zhou, Z.; Zhang, Q.; Kumbong, H.; Olukotun, K. LowRA: Accurate and Efficient LoRA Fine-Tuning of LLMs under 2 Bits. arXiv 2025, arXiv:2502.08141. [Google Scholar] [CrossRef]
- Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. LoRA: Low-Rank Adaptation of Large Language Models. In Proceedings of the International Conference on Learning Representations (ICLR), Virtual, 25–29 April 2022. [Google Scholar]
- Rafailov, R.; Sharma, A.; Mitchell, E.; Ermon, S.; Manning, C.D.; Finn, C. Direct Preference Optimization: Your Language Model is Secretly a Reward Model. In Proceedings of the International Conference on Learning Representations (ICLR), Vienna, Austria, 7–11 May 2024. [Google Scholar]
- Dettmers, T.; Pagnoni, A.; Holtzman, A.; Zettlemoyer, L. QLoRA: Efficient Finetuning of Quantized LLMs. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), New Orleans, LA, USA, 10-16 December 2023; NeurIPS: Red Hook, NY, USA, 2023; Volume 36. [Google Scholar]
- Qi, H.; Dai, Z.; Huang, C. Hybrid and Unitary PEFT for Resource-Efficient Large Language Models. arXiv 2025, arXiv:2507.18076. [Google Scholar] [CrossRef]
- Han, Z.; Gao, C.; Liu, J.; Zhang, J.; Zhang, S.Q. Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey. arXiv 2024, arXiv:2403.14608. [Google Scholar] [CrossRef]
- Hu, C.W.; Wang, Y.; Xing, S.; Chen, C.; Feng, S.; Rossi, R.; Tu, Z. mRAG: Elucidating the Design Space of Multi-modal Retrieval-Augmented Generation. arXiv 2025, arXiv:2505.24073. [Google Scholar] [CrossRef]
- Drushchak, N.; Polyakovska, N.; Bautina, M.; Semenchenko, T.; Koscielecki, J.; Sykala, W.; Wegrzynowski, M. Multimodal Retrieval-Augmented Generation: Unified Information Processing Across Text, Image, Table, and Video Modalities. In Proceedings of the 1st Workshop on Multimodal Augmented Generation via Multimodal Retrieval (MAGMaR 2025), Vienna, Austria, 1 August 2025; Association for Computational Linguistics: Stroudsburg, PA, USA, 2025. [Google Scholar]
- Kumar, S.; Ghosal, T.; Goyal, V.; Ekbal, A. Can Large Language Models Unlock Novel Scientific Research Ideas? arXiv 2025, arXiv:2409.06185. [Google Scholar] [CrossRef]
- Zhu, Y.; Jiang, X.; Lin, J.; Chen, H.; Liu, Z.; Wang, Y.; Zhang, M. Large Language Models for Information Retrieval: A Survey. ACM Trans. Inf. Syst. 2026, 44, 1–54. [Google Scholar] [CrossRef]
- Yue, Z.; Zhuang, H.; Bai, A.; Hui, K.; Jagerman, R.; Zeng, H.; Qin, Z.; Wang, D.; Wang, X.; Bendersky, M. Inference Scaling for Long-Context Retrieval Augmented Generation. arXiv 2025, arXiv:2410.04343. [Google Scholar] [CrossRef]
- Papageorgiou, G.; Sarlis, V.; Maragoudakis, M.; Tjortjis, C. A Multimodal Framework Embedding Retrieval-Augmented Generation with MLLMs for Eurobarometer Data. AI 2025, 6, 50. [Google Scholar] [CrossRef]
- DeepSeek-AI; Liu, A.; Feng, B.; Wang, B.; Wang, B.; Liu, B.; Zhao, C.; Dengr, C.; Ruan, C.; Dai, D.; et al. DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model. arXiv 2024, arXiv:2405.04434. [Google Scholar] [CrossRef]
- Shazeer, N.; Mirhoseini, A.; Maziarz, K.; Davis, A.; Le, Q.; Hinton, G.; Dean, J. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. arXiv 2017, arXiv:1701.06538. [Google Scholar] [CrossRef]
- DeepSeek-AI; Liu, A.; Feng, B.; Xue, B.; Wang, B.; Wu, B.; Lu, C.; Zhao, C.; Deng, C.; Zhang, C.; et al. DeepSeek-V3 Technical Report. arXiv 2024, arXiv:2412.19437. [Google Scholar] [CrossRef]
- DeepSeek-AI; Guo, D.; Yang, D.; Zhang, H.; Song, J.; Wang, P.; Zhu, Q.; Xu, R.; Zhang, R.; Ma, S.; et al. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv 2025, arXiv:2501.12948. [Google Scholar] [CrossRef]
- Fedus, W.; Zoph, B.; Shazeer, N. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. J. Mach. Learn. Res. 2022, 23, 1–39. [Google Scholar]
- Grattafiori, A.; Dubey, A.; Jauhri, A.; Pandey, A.; Kadian, A.; Al-Dahle, A.; Letman, A.; Mathur, A.; Schelten, A.; Vaughan, A.; et al. The LLaMA 3 Herd of Models. arXiv 2024, arXiv:2407.21783. [Google Scholar] [CrossRef]
- Wang, Y.; Ren, S.; Lin, Z.; Han, Y.; Guo, H.; Yang, Z.; Zou, D.; Feng, J.; Liu, X. Qwen2.5: A Comprehensive Series of Large Language Models. arXiv 2024, arXiv:2412.15119. [Google Scholar] [CrossRef]
- Gumaan, E. ExpertRAG: Efficient RAG with Mixture of Experts. arXiv 2025, arXiv:2504.08744. [Google Scholar] [CrossRef]
- Wen, X.; Liu, Z.; Zheng, S.; Ye, S.; Wu, Z.; Wang, Y.; Xu, Z.; Liang, X.; Li, J.; Miao, Z.; et al. Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs. arXiv 2025, arXiv:2506.14245. [Google Scholar] [CrossRef]
- Liang, M.; Huang, W.; Liu, M.; Li, H.; Li, J. Lag-Relative Sparse Attention in Long Context Training. arXiv 2025, arXiv:2506.11498. [Google Scholar] [CrossRef]
- Ni, S.; Chen, G.; Li, S.; Chen, X.; Li, S.; Wang, B.; Wang, Q.; Wang, X.; Zhang, Y.; Fan, L.; et al. A Survey on Large Language Model Benchmarks. arXiv 2025, arXiv:2508.15361. [Google Scholar] [CrossRef]
- Datta, G.; Joshi, N.; Gupta, K. Analysis of Automatic Evaluation Metric on Low-Resourced Language: BERTScore vs BLEU Score. In Speech and Computer; Lecture Notes in Computer Science (LNCS); Springer: Berlin/Heidelberg, Germany, 2022. [Google Scholar] [CrossRef]
- Schick, T.; Dwivedi-Yu, J.; Dessì, R.; Raileanu, R.; Lomeli, M.; Zettlemoyer, L.; Cancedda, N.; Scialom, T. Toolformer: Language Models Can Teach Themselves to Use Tools. arXiv 2023, arXiv:2302.04761. [Google Scholar] [CrossRef]
- Chatoui, H.; Ata, O. Automated Evaluation of the Virtual Assistant in BLEU and ROUGE Scores. In Proceedings of the IEEE HORA Conference, Ankara, Turkey, 11–13 June 2021; IEEE: Piscataway, NJ, USA, 2021. [Google Scholar] [CrossRef]
- Ganguli, D.; Askell, A.; Schiefer, N.; Liao, T.I.; Lukošiūtė, K.; Chen, A.; Goldie, A.; Mirhoseini, A.; Olsson, C.; Hernandez, D.; et al. Do Large Language Models Know What They Know? On Data Contamination in Benchmarks. arXiv 2023, arXiv:2302.07459. [Google Scholar] [CrossRef]
- Jindal, M.; Shrawgi, H.; Agrawal, P.; Dandapat, S. SAGE: A Generic Framework for LLM Safety Evaluation. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track, Suzhou, China, 4–9 November 2025; Association for Computational Linguistics: Stroudsburg, PA, USA, 2025; pp. 11–33. [Google Scholar] [CrossRef]
- Lin, Y.; Lin, H.; Xiong, W.; Diao, S.; Liu, J.; Zhang, J.; Pan, R.; Wang, H.; Hu, W.; Zhang, H.; et al. Mitigating the Alignment Tax of RLHF. arXiv 2023, arXiv:2309.06256. [Google Scholar] [CrossRef]
- Zhang, M.; Shen, Y.; Deng, J.; Wang, Y.; Zhang, Y.; Wang, J.; Liu, S.; Dou, S.; Sha, H.; Peng, Q.; et al. LLMEval-3: A Large-Scale Longitudinal Study on Robust and Fair Evaluation of Large Language Models. arXiv 2025, arXiv:2508.05452. [Google Scholar] [CrossRef]
- Raza, S.; Raval, A.; Chatrath, V. MBIAS: Mitigating Bias in Large Language Models While Retaining Context. arXiv 2024, arXiv:2405.11290. [Google Scholar] [CrossRef]
- Mothilal, R.K.; Roy, J.; Ahmed, S.I.; Guha, S. Human-Aligned Faithfulness in Toxicity Explanations. arXiv 2025, arXiv:2506.19113. [Google Scholar] [CrossRef]
- Kelsall, J.; Tan, X.; Bergin, A.; Chen, J.; Waheed, M.; Sorell, T.; Procter, R.; Liakata, M.; Chim, J.; Chi, S. Evaluating Large Language Models in Legal Use Cases. AI Soc. 2025. [Google Scholar] [CrossRef]
- Dai, D.; Deng, C.; Zhao, C.; Xu, R.X.; Gao, H.; Chen, D.; Li, J.; Zeng, W.; Yu, X.; Wu, Y.; et al. DeepSeekMoE: Towards Ultimate Expert Specialization. arXiv 2024, arXiv:2401.06066. [Google Scholar] [CrossRef]
- BehnamGhader, P.; Adlakha, V.; Mosbach, M.; Bahdanau, D.; Chapados, N.; Reddy, S. LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders. arXiv 2024, arXiv:2404.05961. [Google Scholar] [CrossRef]



| Category | Included Studies |
|---|---|
| Foundation Pre-training Objectives | 19 |
| Supervised/Instruction Fine-Tuning (SFT/IFT) | 9 |
| Alignment and Post-training + PEFT | 16 |
| RAG/Multimodal/Reasoning-centric (Sparse MoE case studies, e.g., DeepSeek) | 16 |
| Evaluation Benchmarks and Recent Advances | 9 |
| Total | 68 |
| Dataset | Type | Size | Notes | Refs |
|---|---|---|---|---|
| Common Crawl | Web text | ∼100 TB | Web-scale coverage; very large and noisy; extensive deduplication and filtering required | [5,9] |
| Wikipedia | Encyclopedia | ∼15 GB | High-quality curated source; widely used across multilingual LLM corpora | [9] |
| BookCorpus/ OpenWebText | Books, web | ∼10–20 GB | Literature-based corpora supporting general language and world knowledge | [5,10] |
| Code datasets | Code (GitHub, StackOverflow) | 10–50 GB | Specialized corpora enabling strong performance in code generation tasks | [6] |
| Multilingual corpora | Text | Varies | Supports cross-lingual transfer and multilingual instruction-following | [9] |
| Safety-filtered corpora | Text | Varies | Incorporates PII removal, toxicity filtering, and responsible data governance | [7,8] |
| RefinedWeb-style corpora | Web text | Large-scale | Curated web datasets designed to outperform raw Common Crawl through cleaning | [4] |
| Technique | Trainable Params Reduction | VRAM Reduction (Typical) | Integration | Engineering Notes | Refs |
|---|---|---|---|---|---|
| LoRA | ∼99.9% fewer trainable weights | ∼60–80% | Easy | Low-rank adapters replace full weight updates, enabling efficient adaptation with near full fine-tuning quality in many settings. | [37] |
| Prefix Tuning | ∼98–99% reduction | ∼40–60% | Moderate | Optimizes only prefix vectors injected into attention; reduces memory/compute, but can be less plug-and-play than LoRA. | [41] |
| Adapters | ∼95–99% reduction | ∼50–70% | Easy | Adds small trainable bottleneck modules between transformer blocks while freezing backbone weights; simple modular deployment. | [41] |
| QLoRA | 65B tuning on 1 × 48 GB GPU (4-bit) | ∼85–95% | Easy | Combines 4-bit quantization with LoRA to enable large-model fine-tuning under constrained VRAM while preserving quality. | [39] |
| Model | Arch. | Total | Active | Tokens | Training Cost Proxy | MMLU | Refs |
|---|---|---|---|---|---|---|---|
| DeepSeek-V2-Base | MoE | 236 B | 21 B | 8.1 T | ∼1.40 M H800 GPU-h † | 78.4 | [48] |
| DeepSeek-V3-Base | MoE | 671 B | 37 B | 14.8 T | 2.79 M H800 GPU-h | 87.1 | [48,50] |
| LLaMA-3.1-405B | Dense | 405 B | 405 B | – | – | 84.4 | [53] |
| Qwen2.5-72B | Dense | 72 B | 72 B | – | – | 85.0 | [54] |
| Model | Architecture | Modalities | Retrieval Mechanism | Refs |
|---|---|---|---|---|
| BioGPT | Dense Transformer | Text | None | [44] |
| Claude 3 | Dense Transformer | Text, Image | Tool-augmented retrieval | [43] |
| DeepSeek (V2/V3/R1) | Sparse MoE | Text | External/system-level RAG | [48,50,51] |
| Gemini | Multimodal Dense | Text, Image, Audio | Tool-augmented retrieval | [43] |
| GPT-4 class models | Multimodal Dense | Text, Image | Tool-augmented retrieval | [42,43] |
| LawGPT | Dense Transformer | Text | None | [44] |
| RAG-LLaMA | Dense + Native RAG | Text + Docs | Native RAG integration | [45] |
| Benchmark | Task Type | Metrics | Notes | Refs |
|---|---|---|---|---|
| GLUE/SuperGLUE | NLP | Accuracy, F1 | General language understanding benchmarks widely used for model comparison | [58] |
| MMLU | Knowledge/reasoning | Accuracy | Multiple-choice tasks across diverse academic and professional domains | [58] |
| BIG-bench | Mixed | Accuracy, F1, human eval | Large-scale benchmark covering broad emergent model capabilities | [58] |
| HumanEval | Code generation | Pass@k | Python (v3.8, as specified in the original benchmark release) | [6,58] |
| Domain-specific | Biomedical, Legal | Accuracy, BLEU | Specialized evaluation suites for expert-domain reliability testing | [58] |
| Method | Description | Advantages | Disadvantages | Example Models | Refs |
|---|---|---|---|---|---|
| Pre-training (CLM, MLM, Span Corruption, UL2) | Self-supervised learning on massive text corpora | Learns general language/ world knowledge; foundation for downstream tasks | Very high computational cost; large data requirements; may encode biases | GPT, BERT, LLaMA | [1,5,10,13] |
| Supervised Fine-Tuning (SFT) | Task-specific supervised training on labeled datasets | Improves task performance; better instruction-following | Limited generalization; expensive to curate large labeled datasets | FLAN-T5, Qwen | [22,23] |
| Instruction Tuning | Fine-tuning on datasets of instructions and responses | Enhances generalization to unseen tasks; improves adherence to instructions | Requires high-quality instruction datasets; computationally intensive | FLAN, Self-Instruct | [24,25] |
| Reinforcement Learning from Human Feedback (RLHF/DPO) | Optimizes model outputs using reward modeling and preference-based objectives | Aligns model behavior with human values; reduces harmful outputs | Complex pipeline; expensive; requires human labeling | ChatGPT-class systems; Claude-class systems | [26,27,38] |
| Parameter-Efficient Fine-Tuning (PEFT) | Updates only small subsets of parameters (e.g., LoRA, QLoRA) | Low memory and compute requirements; fast adaptation to new tasks | May underperform full fine-tuning; requires careful configuration | LLaMA-LoRA, QLoRA models | [37,39,41] |
| Paradigm | Efficiency Focus | Alignment/Reasoning Focus | Refs |
|---|---|---|---|
| Dense pre-training (GPT/LLaMA-style) | High-compute scaling | General capability foundation | [5,9] |
| PEFT methods (LoRA, adapters, prompt-tuning) | Reduced fine-tuning cost | Task adaptation with minimal parameter drift | [37,39,41] |
| RLHF/DPO preference alignment | Moderate post-training overhead | Helpfulness, safety, preference shaping; potential trade-offs | [27,38,64] |
| Sparse MoE (DeepSeek-style) | High parameter/compute efficiency | Competitive reasoning with sparse activation | [48,49,52] |
| RLVR/verifiable reward alignment | Post-training efficiency | Scalable correctness-driven reasoning-centric alignment | [56] |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Karydas, D.; Margaritis, D.; Leligou, H.C. Training Methods for Large Language Models: Current Approaches and Challenges. Technologies 2026, 14, 133. https://doi.org/10.3390/technologies14020133
Karydas D, Margaritis D, Leligou HC. Training Methods for Large Language Models: Current Approaches and Challenges. Technologies. 2026; 14(2):133. https://doi.org/10.3390/technologies14020133
Chicago/Turabian StyleKarydas, Dimitris, Dimosthenis Margaritis, and Helen C. Leligou. 2026. "Training Methods for Large Language Models: Current Approaches and Challenges" Technologies 14, no. 2: 133. https://doi.org/10.3390/technologies14020133
APA StyleKarydas, D., Margaritis, D., & Leligou, H. C. (2026). Training Methods for Large Language Models: Current Approaches and Challenges. Technologies, 14(2), 133. https://doi.org/10.3390/technologies14020133

