Next Article in Journal
An Overview of Certified ISO 45001 OH&S Systems in the Regional Context
Previous Article in Journal
Applying the Analytical Hierarchy Process to Exploring Demand and Technology Preferences in InsurTech: Focusing on Consumer Concerns
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Proceeding Paper

Architectural and Methodological Advancements in Large Language Models †

1
Effyis Group, 140 BIS Rue de Rennes, 75006 Paris, France
2
Institut National des Postes et Télécommunications (INPT), Av. Allal Al Fassi, Rabat 10102, Morocco
*
Author to whom correspondence should be addressed.
Presented at the 1st International Conference on Smart Management in Industrial and Logistics Engineering (SMILE 2025), Casablanca, Morocco, 16–19 April 2025.
Eng. Proc. 2025, 97(1), 8; https://doi.org/10.3390/engproc2025097008
Published: 9 June 2025

Abstract

:
The evolution of large language models (LLMs) has been marked by significant architectural and methodological breakthroughs that have redefined the landscape of natural language processing. This review examines the key techniques driving modern LLMs, including foundational architectures, novel training methodologies, and cutting-edge performance benchmarks. In addition to offering a performance overview, this work presents a focused and up-to-date architectural benchmark that highlights key design differences between the best-performing open-source and closed-source LLMs, providing actionable insights into their underlying components. Beyond the performance comparison, our analysis details the inherent limitations of the monolithic transformer architecture and outlines emerging strategies. By bridging open-source innovations and proprietary advancements, this review offers a balanced resource for researchers and practitioners navigating this rapidly evolving field.

1. Introduction

The pursuit of artificial intelligence (AI) systems capable of understanding and generating human language dates back to Turing’s foundational work [1]. The early approaches relied on Statistical language models (SLMs) [2,3,4], which were effective for simple NLP tasks but limited in capturing language complexity. The introduction of neural language models (NLMs) [5,6,7] significantly advanced the field, using neural architectures to better represent the linguistic context.
A decisive breakthrough emerged with the transformer architecture [8]. Its attention-based mechanism became fundamental for pretrained language models (PLMs) such as BERT [9] and GPT-2 [10], demonstrating superior performance across NLP benchmarks. Scaling this paradigm led to large language models (LLMs), exemplified by GPT-3 [11], which revealed emergent capabilities like zero-shot and few-shot learning [12].
This review contributes by providing not only a performance benchmark of the leading LLMs but also a detailed architectural analysis. Our benchmark is designed to be light yet focused, emphasizing key design components (e.g., normalization methods, positional encoding, and activation functions) that differentiate these models. This dual perspective helps us understand both empirical performance and the inherent architectural innovations that have driven recent advances.

2. Evolution of Language Models

2.1. Statistical to Neural Transition

Statistical language models [2,3,4] initially dominated NLP tasks but were restricted by the “curse of dimensionality”. Neural language models [5,6,7] introduced distributed representations that improved language understanding.

2.2. Transformer Era

The 2017 paper “Attention Is All You Need” [8] proposed the transformer, eliminating recurrent dependencies in favor of self-attention. This led to the rapid rise of PLMs such as BERT [9] and GPT-2 [10]. Subsequent scaling gave rise to LLMs, exemplified by GPT-3 [11], enabling capabilities like few-shot and zero-shot learning [12].

2.3. Open-Source Initiatives

While closed-source models remain highly publicized, open-source equivalents increasingly rival their performance. LLaMA [13] and its variants have facilitated broader research by making advanced weights accessible, and projects like Vicuna [14] illustrate how community-driven fine-tuning can achieve near-state-of-the-art performance.

3. Architectural Innovations and Techniques in Large Language Models

The architecture of LLMs serves as the foundational framework that dictates their capabilities, efficiency, and range of applications. Over the years, a multitude of architectures and techniques have been proposed, each with unique strengths and limitations. Recent advancements have ushered in an era of hyper-scaled models that leverage innovative techniques to push the boundaries of what is computationally feasible and functionally possible.
The performance metrics presented in Table 1 offer a quantitative comparison between several state-of-the-art LLMs. In particular, differences in exact match scores, pass rates, and problem-solving percentiles reflect both algorithmic innovations and architectural choices. Although performance is crucial, our focus extends to examining how differences in model components, such as normalization methods, positional encoding strategies, and activation functions, affect overall capabilities. This integrated analysis provides researchers with a clearer picture of the trade-offs between model scale, training complexity, and architectural efficiency.

3.1. Mainstream Architectures

The primary architectural paradigms used in LLMs include the following:
  • Encoder–Decoder: Pioneered by Vaswani et al. [8], this architecture is specialized for tasks that necessitate both understanding and generation of sequences. It comprises two main components: an encoder that interprets the input and a decoder that produces the output. Models such as T5 [17] and BART [18] exemplify its effectiveness. (see Figure 1)
  • Causal Decoder: Predominantly employed for autoregressive sequence generation, causal decoders condition each token on its preceding tokens. The GPT models [10,11] are quintessential examples of this approach. (see Figure 1)
  • Prefix Decoder: Designed for tasks requiring simultaneous encoding and decoding within a single step [19], prefix decoders blend the capabilities of causal decoders and encoder-decoder models. An example is GLM-130B [20], which combines bidirectional context encoding with autoregressive token prediction. (see Figure 1)
  • Mixture-of-Experts (MoE): This approach extends the aforementioned architectures by activating only a subset of expert networks for each input token (see Figure 2). Notable models leveraging MoE include the following:
    Mixtral 8x7B (27 September 2023): A sparse mixture-of-experts model (SMoE) with open weights licensed under Apache 2.0. that outperforms Llama 2 (70 B) on selected benchmarks.
    DeepSeek-V3 (26 December 2024): A large MoE model with 671B parameters, where only 37B parameters are activated for each token, enabling efficient scaling and performance.

3.2. Detailed Configuration

3.2.1. Normalization Methods

Normalization techniques play a pivotal role in stabilizing the training of large language models. Various methods have been employed:
  • LayerNorm: Computes the mean and variance for each layer’s activations; widely used in models like GPT-3 [11] and BERT [9].
  • RMSNorm: A computationally efficient alternative that rescales activations based on the root mean square [21]. The LLaMA family [13] leverages RMSNorm for its efficiency gains.
  • DeepNorm: Proposed to stabilize extremely deep transformers by scaling residual connections [22]. GLM-130B [20] adopts DeepNorm in its architecture.

3.2.2. Position Embeddings

Position embeddings encode the sequential information of input tokens. Common methods include the following:
  • Absolute: Adds a learned position-specific vector to each input token [9].
  • Relative: Primarily seen in models like T5 [17], it modifies attention scores based on token distance.
  • Rotary (RoPE, ALiBi): PaLM 2 [23] employs advanced rotary position embeddings for handling long sequences.

3.2.3. Activation Functions

Activation functions introduce non-linearity into the model, significantly impacting performance:
  • GeLU: A balanced choice for both efficiency and representational power [24].
  • SwiGLU and GeGLU: Variants of Gated Linear Units (GLUs) that often yield better empirical performance; used in PaLM 2 [23] and related large-scale models [25].

3.3. Attention Mechanisms

Attention mechanisms form the cornerstone of sequence modeling in modern LLMs:
  • Full Attention: The original transformer uses full self-attention, considering all token pairs in a sequence [8].
  • Sparse Attention: Reduces computation by attending only to local or patterned subsets of tokens [26,27].
  • Multi-query/Grouped-query Attention: Different heads share parameters for keys and values, aiming to save memory without sacrificing too much in performance [28].
  • FlashAttention: Reorganizes the computation to use GPU memory more efficiently, preserving exact attention [29].
  • PagedAttention: Splits sequences into non-contiguous blocks to further optimize GPU memory utilization in LLM servers [30].

3.4. Training Strategies for Large Language Models

The training of LLMs requires specialized techniques to manage immense computational complexity. The key strategies are outlined below.

3.4.1. Optimization Settings

The following optimization settings are critical for stabilizing and enhancing the training of LLMs:
  • Batch Training: Stability often depends on batch sizing. GPT-3, for example, scaled from 32 k tokens up to 3.2 M tokens [11].
  • Learning-Rate Scheduling: A linear warm-up followed by cosine decay is common.
  • Choice of Optimizer: Adam/AdamW with ( β 1 , β 2 ) = ( 0.9 , 0.95 ) and ϵ = 10 8 is standard; Adafactor [31] is also used to reduce memory usage.
  • Stabilization Techniques: Gradient clipping and weight decay are widely adopted [32,33].

3.4.2. Scalable Training Methodologies

To efficiently handle the computational demands of large-scale models (see Table 2), several scalable training methodologies have been developed:
  • 3D Parallelism: Combines data, pipeline, and tensor parallelism to distribute computational loads [34,35,36].
  • ZeRO Optimization: Minimizes memory redundancy in data-parallel training by partitioning optimizer states, gradients, and parameters [37].
  • Mixed-Precision Training: Employs 16-bit (FP16) or BF16 numerics to accelerate training [38].

4. Conclusions

In the rapidly evolving landscape of large language models (LLMs), understanding their underlying architectures, training strategies, and key technical advances is crucial. This paper has provided a concise yet comprehensive overview of recent progress in open-source LLMs, highlighting essential techniques such as mixture-of-experts, advanced attention mechanisms, normalization layers, and scalable training methodologies.
Moreover, our review underscores a critical limitation inherent to the transformer architecture—the difficulty of integrating its monolithic design with traditional neural network modules. Future research may benefit from advanced approaches to integrate external tools that enhance accuracy and enable richer fact-checking. These directions promise to mitigate current challenges in training efficiency and multimodal integration, paving the way for more adaptable and robust language models.
It is evident that the gap between open-source and closed-source models is narrowing thanks to rapid innovations in architectures and training methods. As we look to the future, it becomes increasingly clear that the democratization of these technologies will spur wider applications in both academia and industry, potentially transforming how we interact with AI-driven language systems.

Author Contributions

Conceptualization, Z.Z.; methodology, Z.Z.; writing—original draft preparation, Z.Z.; writing—review and editing, O.S.; supervision, O.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

No new data were created or analyzed in this study.

Acknowledgments

The authors would like to thank the Effyis Group and the Institut National des Postes et Télécommunications (INPT) for their support.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Turing, A.M. Computing Machinery and Intelligence. Mind 1950, 59, 433–460. [Google Scholar] [CrossRef]
  2. Jelinek, F. Statistical Methods for Speech Recognition; MIT Press: Cambridge, MA, USA, 1998. [Google Scholar]
  3. Gao, J.; Lin, C. Introduction to the special issue on statistical language modeling. ACM Trans. Asian Lang. Inf. Process. 2004, 3, 87–93. [Google Scholar] [CrossRef]
  4. Rosenfeld, R. Two decades of statistical language modeling: Where do we go from here? Proc. IEEE 2000, 88, 1270–1288. [Google Scholar] [CrossRef]
  5. Bengio, Y.; Ducharme, R.; Vincent, P.; Jauvin, C. A neural probabilistic language model. J. Mach. Learn. Res. 2003, 3, 1137–1155. [Google Scholar]
  6. Mikolov, T.; Karafiát, M.; Burget, L.; Černocký, J.; Khudanpur, S. Recurrent neural network based language model. In Proceedings of the INTERSPEECH, Makuhari, Japan, 26–30 September 2010; pp. 1045–1048. [Google Scholar]
  7. Kombrink, S.; Mikolov, T.; Karafiát, M.; Burget, L. Recurrent neural network based language modeling in meeting recognition. In Proceedings of the INTERSPEECH, Florence, Italy, 27–31 August 2011; pp. 2877–2880. [Google Scholar]
  8. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
  9. Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the NAACL-HLT, Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
  10. Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language models are unsupervised multitask learners. OpenAI Blog 2019, 1, 9. [Google Scholar]
  11. Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. In Proceedings of the Advances in Neural Information Processing Systems, Online, 6–12 December 2020. [Google Scholar]
  12. Kaplan, J.; McCandlish, S.; Henighan, T.; Brown, T.B.; Chess, B.; Child, R.; Gray, S.; Radford, A.; Wu, J.; Amodei, D. Scaling Laws for Neural Language Models. arXiv 2020, arXiv:2001.08361. [Google Scholar]
  13. Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. LLaMA: Open and Efficient Foundation Language Models. arXiv 2023, arXiv:2302.13971. [Google Scholar]
  14. Chiang, W.-L.; Li, Z.; Lin, Z.; Sheng, Y.; Wu, Z.; Zhang, H.; Hon, H.-W.; Zhang, X. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90% ChatGPT Quality. Available online: https://lmsys.org/blog/2023-03-30-vicuna/ (accessed on 1 October 2023).
  15. Hendrycks, D.; Burns, C.; Basart, S.; Zou, A.; Mazeika, M.; Song, D.; Steinhardt, J. Measuring Massive Multitask Language Understanding. In Proceedings of the International Conference on Learning Representations (ICLR), Vienna, Austria, 4–8 May 2021. [Google Scholar]
  16. Chen, X.; Drori, I.; Pech, H.; Barzilay, R.; Jaakkola, T. Mathematical Language Understanding Evaluation (MATH). arXiv 2021, arXiv:2103.03874. [Google Scholar]
  17. Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. J. Mach. Learn. Res. 2020, 21, 1–67. [Google Scholar]
  18. Lewis, M.; Liu, Y.; Goyal, N.; Ghazvininejad, M.; Mohamed, A.; Levy, O.; Stoyanov, V.; Zettlemoyer, L. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Proceedings of the ACL, Online, 5–10 July 2020; pp. 7871–7880. [Google Scholar]
  19. Zhang, A.; Roller, S.; Goyal, N.; Artetxe, M.; Chen, M.; Chen, S.; Dewan, C.; Diab, M.T.; Li, X.; Lin, X.V.; et al. Extending Sequence-to-Sequence Models with Prefix Decoders. arXiv 2022, arXiv:2210.02143. [Google Scholar]
  20. Zeng, A.; Liu, X.; Du, Z.; Wang, Z.; Lai, H.; Ding, M.; Yang, Z.; Xu, Y.; Zheng, W.; Xia, X.; et al. GLM-130B: An Open Bilingual Pre-trained Model. arXiv 2022, arXiv:2210.02414. [Google Scholar]
  21. Zhang, B.; Sennrich, R. Root Mean Square Layer Normalization. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Volume 32. [Google Scholar]
  22. Wang, H.; Ma, S.; Dong, L.; Huang, S.; Zhang, D.; Wei, F. DeepNet: Scaling Transformers to 1000 Layers. arXiv 2022, arXiv:2203.00555. [Google Scholar] [CrossRef]
  23. Anil, R.; Dai, A.M.; Firat, O.; Johnson, M.; Lepikhin, D.; Passos, A.; Shakeri, S.; Taropa, E.; Bailey, P.; Che, Z.; et al. PaLM 2 Technical Report. arXiv 2023, arXiv:2305.10403. [Google Scholar]
  24. Hendrycks, D.; Gimpel, K. Gaussian Error Linear Units (GELUs). arXiv 2016, arXiv:1606.08415. [Google Scholar]
  25. Shazeer, N. GLU Variants Improve Transformer. arXiv 2020, arXiv:2002.05202. [Google Scholar]
  26. Zaheer, M.; Guruganesh, G.; Dubey, K.A.; Ainslie, J.; Alberti, C.; Ontañón, S.; Pham, P.; Ravula, A.; Wang, Q.; Yang, L.; et al. Big Bird: Transformers for Longer Sequences. In Proceedings of the Advances in Neural Information Processing Systems, Online, 6–12 December 2020. [Google Scholar]
  27. Child, R.; Gray, S.; Radford, A.; Sutskever, I. Generating Long Sequences with Sparse Transformers. arXiv 2019, arXiv:1904.10509. [Google Scholar]
  28. Shazeer, N. Fast Transformer Decoding: One Write-Head is All You Need. arXiv 2019, arXiv:1911.02150. [Google Scholar]
  29. Dao, T.; Fu, D.Y.; Ermon, S.; Rudra, A.; Ré, C. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 28 November–9 December 2022. [Google Scholar]
  30. vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention. Available online: https://vllm.ai/ (accessed on 1 October 2023).
  31. Shazeer, N.; Stern, M. Adafactor: Adaptive Learning Rates with Sublinear Memory Cost. In Proceedings of the ICML, Stockholm, Sweden, 10–15 July 2018; pp. 4596–4604. [Google Scholar]
  32. Pascanu, R.; Mikolov, T.; Bengio, Y. On the Difficulty of Training Recurrent Neural Networks. In Proceedings of the ICML, Atlanta, GA, USA, 16–21 June 2013; pp. 1310–1318. [Google Scholar]
  33. Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar]
  34. Shoeybi, M.; Patwary, R.; Puri, R.; LeGresley, P.; Casper, J.; Catanzaro, B. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. arXiv 2019, arXiv:1909.08053. [Google Scholar]
  35. Huang, Y.; Cheng, Y.; Bapna, A.; Firat, O.; Chen, D.; Chen, M.X.; Lee, H.; Ngiam, J.; Le, Q.V.; Wu, Y.; et al. GPipe: Efficient Training of Giant Neural Networks Using Pipeline Parallelism. arXiv 2019, arXiv:1811.06965. [Google Scholar]
  36. Harlap, A.; Narayanan, D.; Phanishayee, A.; Seshadri, V.; Devanur, N.R.; Ganger, G.R.; Zaharia, M. PipeDream: Fast and Efficient Pipeline Parallel DNN Training. arXiv 2018, arXiv:1806.03377. [Google Scholar]
  37. Rajbhandari, S.; Rasley, J.; Ruwase, O.; He, Y. ZeRO: Memory Optimizations Toward Training Trillion Parameter Models. In Proceedings of the SC20, Atlanta, GA, USA, 9–19 November 2020; pp. 1–16. [Google Scholar]
  38. Micikevicius, P.; Narang, S.; Alben, J.; Diamos, G.; Elsen, E.; Garcia, D.; Ginsburg, B.; Houston, M.; Kuchaiev, O.; Venkatesh, G.; et al. Mixed Precision Training. In Proceedings of the ICLR, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Figure 1. Comparison of mainstream decoder architectures: The encoder–decoder structure focuses on bidirectional attention for input encoding and target decoding. The prefix decoder blends bidirectional encoding with autoregressive decoding for more contextualized predictions. The causal decoder uses masked attention to condition predictions solely on preceding tokens, facilitating sequential generation.
Figure 1. Comparison of mainstream decoder architectures: The encoder–decoder structure focuses on bidirectional attention for input encoding and target decoding. The prefix decoder blends bidirectional encoding with autoregressive decoding for more contextualized predictions. The causal decoder uses masked attention to condition predictions solely on preceding tokens, facilitating sequential generation.
Engproc 97 00008 g001
Figure 2. Illustration of the mixture-of-experts (MoE) architecture: The router dynamically selects two feed-forward networks (FFNs) per token, based on input relevance (K = 2), to optimize computational efficiency. The architecture employs add and normalize layers for token aggregation and ensures parallelism across multiple tasks.
Figure 2. Illustration of the mixture-of-experts (MoE) architecture: The router dynamically selects two feed-forward networks (FFNs) per token, based on input relevance (K = 2), to optimize computational efficiency. The architecture employs add and normalize layers for token aggregation and ensures parallelism across multiple tasks.
Engproc 97 00008 g002
Table 1. Comparison of performance metrics between selected LLMs. EM = exact match; pass@1 = top-1 pass rate; “–” = not reported.
Table 1. Comparison of performance metrics between selected LLMs. EM = exact match; pass@1 = top-1 pass rate; “–” = not reported.
BenchmarkDeepSeek-V3Qwen2.5-72B-InstLlama-3.1-405B-InstGPT-4GPT-3.5 (0-shot CoT)Claude-3.5
MMLU  [15] (EM)75.971.673.372.678.0
GPQA-Diamond (pass@1)59.149.051.149.977.365.0
MATH 500  [16] (EM)90.280.073.874.694.878.3
AIME 2024 (pass@1)39.223.323.39.374.416.0
Codeforces (percentile)51.624.825.323.686.020.3
SWE-bench (Resolved)42.023.823.838.850.8
Table 2. Key architectural differences between selected LLMs.
Table 2. Key architectural differences between selected LLMs.
ModelSizeCategoryNormalizationPEActivation
DeepSeek-V3671 BCausal decoderPre-LayerNormLearnedSwiGLU
Qwen2.5-72B-Inst72 BCausal decoderRMSNormRoPESwiGLU
Llama-3.1-405B-Inst405 BCausal decoderRMSNormRoPESwiGLU
GPT-4500 BCausal decoderPre-LayerNormLearnedGeLU
GPT-3.5175 BCausal decoderPre-LayerNormLearnedGeLU
Claude-3.5100 BCausal decoderPre-LayerNormRoPESwiGLU
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zaza, Z.; Souissi, O. Architectural and Methodological Advancements in Large Language Models. Eng. Proc. 2025, 97, 8. https://doi.org/10.3390/engproc2025097008

AMA Style

Zaza Z, Souissi O. Architectural and Methodological Advancements in Large Language Models. Engineering Proceedings. 2025; 97(1):8. https://doi.org/10.3390/engproc2025097008

Chicago/Turabian Style

Zaza, Zakaria, and Omar Souissi. 2025. "Architectural and Methodological Advancements in Large Language Models" Engineering Proceedings 97, no. 1: 8. https://doi.org/10.3390/engproc2025097008

APA Style

Zaza, Z., & Souissi, O. (2025). Architectural and Methodological Advancements in Large Language Models. Engineering Proceedings, 97(1), 8. https://doi.org/10.3390/engproc2025097008

Article Metrics

Back to TopTop