A Survey of Large Language Models: Evolution, Architectures, Adaptation, Benchmarking, Applications, Challenges, and Societal Implications
Abstract
1. Introduction
Objectives, Scope, and Methodology
- Evolution of language models, reviewing the rise of Transformer-based architectures by tracing key innovations and paradigm shifts from early rule-based systems to modern foundation models (see Section 2).
- Establish a taxonomy of popular LLM architectures, including encoder-only, decoder-only, encoder-decoder (sequence-to-sequence), and multimodal models, detailing their design principles, capabilities, and typical use cases (see Section 3).
- Describe the core training and adaptation methodologies, including large-scale self-supervised pre-training, task-specific fine-tuning, and adaptation techniques such as reinforcement learning from human feedback (RLHF) and parameter-efficient fine-tuning (PEFT), supporting efficient and scalable deployment (see Section 4).
- Review benchmarks and evaluation methods used to assess model performance across tasks, including reasoning, factual correctness, robustness, and linguistic understanding (see Section 5).
- Survey real-world applications of LLMs across diverse domains—including scientific discovery, software engineering, healthcare, and the emergence of agentic AI as a key application area—where models act as reasoning engines for autonomous systems (see Section 6).
- Examine the economic implications of LLM development and deployment, including training and inference costs, infrastructure dependencies, labor market shifts, and growing inequalities in access and benefits (see Section 7).
- Highlight emerging challenges and open research questions, including hallucination, ethical risks, resource efficiency, and the broader societal impacts of LLM deployment (see Section 8).
2. Evolution of Language Modeling
2.1. Rule-Based Models (Pre–1990s)
2.2. Statistical Models (1990s–2000s)
2.3. Sequential Neural Language Models (2000s–2020s)
2.4. Transformer-Based Models (Late 2010s-–Present)
3. Model Architectures
3.1. Decoder-Only Models
3.2. Encoder-Only Models
3.3. Sequence-to-Sequence Models
3.4. Multimodal Models
3.5. Mixture of Experts (MoE) Models
4. Training and Adaptation
4.1. Pre-Training
4.1.1. Fine-Tuning
4.1.2. Prompt Engineering and In-Context Learning
4.1.3. Instruction Tuning
4.1.4. Reinforcement Learning from Human Feedback (RLHF)
4.2. Parameter-Efficient Fine-Tuning
4.2.1. Adapter-Based Methods
4.2.2. Prompt-Based Methods (Soft Prompts)
4.2.3. Reparameterization-Based Methods
5. Benchmarking and Evaluation
5.1. Benchmarking
5.1.1. MMLU
5.1.2. BIG-Bench
5.1.3. SuperGLUE
5.1.4. HELM
5.2. Evaluation
5.2.1. ROUGE
- ROUGE-N: Evaluates the overlap of n-grams between generated and reference texts, counting matched n-grams (with frequency clipping). Commonly used for unigrams (ROUGE-1) and bigrams (ROUGE-2), it emphasizes recall, specifically how many reference n-grams are covered, making it suitable for summarization evaluation.
- ROUGE-L: Uses the longest common subsequence (LCS) between generated and reference texts to capture in-sequence overlap without requiring contiguous matches. It computes precision and recall over LCS length (and often their harmonic mean), with a sentence-level variant (ROUGE-Lsum) for multisentence inputs.
- ROUGE-S (Skip-Bigram): Matches word pairs in order but not necessarily adjacent, allowing more flexible overlap detection than strict n-grams. It counts skip-bigram matches to assess loosely ordered content overlap, though it remains surface-based without deeper semantic matching.
5.2.2. BLEU
6. Applications of LLMs
6.1. Software Engineering and Design
- Debugging and Refactoring: LLMs assist developers by offering bug fixes and code improvements [57].
6.2. Healthcare and Life Sciences
- EHR Data Synthesis: LLMs excel at analyzing complex and unstructured Electronic Health Records (EHRs). They can extract information and generate concise patient summaries, with performance that can outperform human experts [80].
- Diagnostic Assistance: These models can serve as powerful diagnostic aids by processing clinical notes and patient histories to suggest potential differential diagnoses in fields such as radiology [81]. Specialized models, such as Med-PaLM 2, have demonstrated expert-level performance on medical competency exams, underscoring their potential to augment clinical workflows [82].
- Literature Synthesis: LLMs extract and summarize findings from large corpora of medical papers, enabling faster insights [51].
6.3. Finance and Banking
- Investment Analysis and Strategy: LLMs perform sentiment analysis on financial news, social media, and earnings reports to identify market trends and inform investment strategies. These models process vast amounts of unstructured text data in real-time to provide quantitative insights that support algorithmic trading and portfolio management [85].
- Compliance and Fraud Detection: Models analyze transactions and communications for anomalies indicative of fraud or regulatory violations [61].
- Chatbots and Virtual Assistants: Customer service is enhanced by LLMs that provide 24/7 support, reducing operational costs [62].
- Financial Reporting: LLMs generate and summarize reports, accelerating analyst workflows [63].
6.4. Manufacturing and Supply Chain
- Forecasting and Optimization: Demand prediction and supply chain optimization benefit from LLM-generated insights [64].
- Quality Control: Natural language interfaces aid in interpreting maintenance logs or sensor data [65].
- Engineering Education: LLMs provide customized support and tutoring for technical training [66].
6.5. Scientific Research and Discovery
- Hypothesis Generation and Literature Review: LLMs rapidly synthesize findings from thousands of papers [46]. For instance, tools like Elicit [47] and Semantic Scholar [86] leverage Transformer models to extract key claims, compare methodologies, and trace citations across thousands of papers in seconds. More recent models like Google’s Gemini [30] can perform deep research by synthesizing information across multiple documents, analyzing data, and generating novel hypotheses, effectively acting as AI research assistants [87].
- Experiment Design and Analysis: Beyond understanding prior work, LLMs can support the planning and interpretation of experiments. For example, models like ChatGPT [1] and SciPIP [88] have been used to suggest experimental conditions, recommend statistical techniques, and simulate expected outcomes based on prior data [46]. In computational chemistry, LLMs have even been integrated into pipelines to optimize reaction conditions and propose novel molecular structures [48], and new frameworks are using them to make the entire automated machine learning (AutoML) process more explainable and user-friendly through natural language [89].
- Scientific Writing: LLMs assist researchers in drafting abstracts, summarizing findings, and organizing research manuscripts in line with academic standards. Tools such as PaperPal and Writefull utilize LLMs to enhance clarity, suggest citations, and correct grammar in real time. In addition, citation-aware models like SciBot [49] can automatically insert references and generate BibTeX entries based on context.
6.6. Education and Corporate Training
- Personalized Learning: LLMs dynamically tailor educational content to match a learner’s proficiency, interests, and preferred learning style. For instance, platforms like Khanmigo (by Khan Academy) use GPT-based models to deliver adaptive math explanations for students at varying levels [74].
- Virtual Tutoring: LLMs act as intelligent tutors that offer instant, 24/7 support across a wide range of topics. For example, Duolingo’s GPT-4-powered AI tutor provides personalized conversational practice in language learning, correcting errors and explaining grammar contextually [75].
6.7. Creative and Content Industries
- Writing and Journalism: LLMs like GPT-4 are used by outlets such as BuzzFeed to generate article drafts, headlines, and marketing copy [69]. These models accelerate content creation while allowing human editors to refine tone and accuracy.
- Sports Media and Entertainment: Domain-specific applications are emerging that showcase how LLMs can augment commentary, analysis, and fan engagement. Data-driven football match commentaries that combine real-time statistics with fluent narrative structures help enrich live sports coverage [90]. Similarly, natural language explanations of machine learning models of footballing actions bridge the gap between complex analytics and interpretable insights for coaches and analysts [91].
- Design and Architecture: Tools like Autodesk Forma integrate LLMs and generative models to assist with early-stage ideation and layout generation [73].
6.8. Legal and Regulatory Sectors
6.9. Autonomous AI Agents
7. Economic Implications of LLM Development and Deployment
7.1. The Foundational Costs: From Training to Deployment
- Training Costs: The initial pre-training of a foundation model is the most expensive phase, representing a significant front-loaded capital expenditure. It requires massive computational power, typically involving thousands of high-end GPUs or TPUs running continuously for weeks or months. The costs have escalated dramatically; while GPT-2 (1.5 billion parameters, 2019) cost an estimated $50,000 to train, Google’s PaLM (540 billion parameters, 2022) is estimated to have cost around $8 million, and the Megatron-Turing NLG 530B model over $11.35 million [98]. These costs are driven by the sheer scale of the model (billions or trillions of parameters) and the vast datasets (trillions of tokens) required to achieve state-of-the-art performance. This has concentrated development in industry, which produced 32 significant machine learning models in 2022 compared to just three from academia.
- Inference Costs: While training is a formidable one-time cost, inference—the process of using a trained model to generate outputs—is a persistent operational expense that can cumulatively surpass the initial training cost for widely used services. The core economic challenge is balancing the conflicting demands of latency and throughput. For example, an interactive, low-latency configuration for PaLM 540B achieves a Model FLOPS Utilization (MFU) of only 14%, while a high-throughput configuration reaches 76% MFU, a five-fold difference in computational efficiency and cost. This is rooted in technical bottlenecks like the massive memory footprint of the model weights and the KV cache, which can total 3 TB for a 540B parameter model, and the inherently sequential nature of autoregressive decoding that limits parallelism [99]. Optimizing inference efficiency through techniques like model quantization (e.g., using INT8 weights reduced PaLM’s per-token latency by 23%), multiquery attention, and specialized hardware is a critical area of research and economic concern [99].
- Data Acquisition and Curation: While much of the data used for pre-training is scraped from the public web (e.g., Common Crawl), creating high-quality, clean, and diverse datasets is a significant undertaking. Furthermore, the data required for alignment stages like supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF) represents a substantial and often underestimated “hidden cost” driven by expensive, high-skill human labor. This phase requires thousands of hours of work from skilled labelers to generate demonstrations and rank model outputs to create preference datasets. These human-powered data generation efforts can add millions of dollars to the total development cost, an expense not captured in compute-based cost estimates like the $8 million figure for PaLM [98]. This human capital investment is a critical barrier to entry and a key component of a model’s total cost of ownership.
- Hardware Dependency: The development of LLMs has been largely dependent on the availability of powerful GPUs, with NVIDIA commanding a dominant market share [100]. This has created a hardware bottleneck where access to cutting-edge accelerators is a primary determinant of an organization’s ability to compete at the frontier of AI research.
- Cloud Infrastructure Dominance: Major cloud providers such as Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP) are central to the LLM ecosystem. They provide the scalable, on-demand computing infrastructure necessary for both training and hosting LLMs. Strategic partnerships, such as Microsoft’s investment in OpenAI, highlight how cloud providers are positioning themselves as indispensable platforms for the AI economy, capturing a significant portion of the value generated by LLM applications [101].
- Scalability and Deployment Trade-offs: Organizations face a critical decision between using third-party LLM APIs (e.g., OpenAI, Anthropic) and deploying their models (whether open-source or custom-built). Using APIs offers lower upfront costs and easier access but can lead to high long-term operational expenses and concerns over data privacy and model control. Self-hosting provides more control but requires significant investment in infrastructure and expertise. This trade-off is a central economic consideration for businesses integrating LLMs into their operations.
7.2. Market Consolidation and Commercialization
- Market Concentration: The development of state-of-the-art LLMs—such as GPT-4, Gemini, and Claude—is currently viable only for a small group of corporations, including Google, OpenAI (partnered with Microsoft), Meta, and Anthropic, who possess the necessary capital, proprietary data, and large-scale compute infrastructure [98,103,104]. This concentration of model development capabilities has sparked growing concerns over an emerging “AI oligopoly,” in which a few firms dominate foundational AI technologies, limit open innovation, and shape the trajectory of the LLM ecosystem to serve proprietary interests [104].
- Commercialization and Access: These dominant firms primarily commercialize LLMs through usage-based APIs, which offer high performance but at costs often unaffordable for smaller businesses. In contrast, the open-source ecosystem (e.g., LLaMA, Mistral) provides alternatives, but these require in-house expertise and infrastructure [5]. Small and medium-sized enterprises (SMEs) face critical barriers—including limited budgets, talent shortages, and lack of cloud resources—that hinder their ability to adopt AI effectively [105,106,107]. As a result, there is a growing productivity gap between large firms rapidly scaling AI and smaller businesses struggling to compete [108].
- Economic Inequality and the Global AI Divide: The uneven diffusion of LLM benefits could intensify economic inequality [109]. Domestically, workers, firms, and regions with access to advanced AI tools may gain disproportionate advantages in productivity and profitability. Internationally, countries with limited access to AI development infrastructure risk falling further behind economically and technologically, exacerbating global inequalities [110].
- The Open-Source Ecosystem as a Counterbalance: In response to the market concentration driven by high development costs, the open-source community has emerged as a powerful force for democratizing AI. Pioneered by models like Meta’s LLaMA series [5,111,112] and further advanced by organizations like EleutherAI [113], the movement provides access to high-performance foundation models with permissive licenses. This enables researchers, startups, and smaller enterprises to innovate, conduct research, and build applications without being locked into proprietary API ecosystems. Open-source models foster transparency, enable security auditing, and allow for deep customization and fine-tuning on private data, capabilities often restricted by commercial vendors. While this approach significantly lowers the barrier to entry, it still requires substantial in-house computational resources and expertise to effectively deploy and maintain these models [114].Table 8. Structural Drivers of Inequality in the LLM Economy.
Factor Impact on Market Dynamics Barriers for SMEs and Global South Reference(s) Market Consolidation AI capabilities concentrated in a few tech giants High entry cost excludes academia, small firms [98,104] API Commercialization Usage-based pricing favors large-scale customers Per-token cost unsustainable for startups/NGOs [105,108] Infrastructure Lock-in Cloud platforms vertically integrate compute and model access Self-hosting requires GPU access, engineering talent [100,101] Global Access Divide Uneven distribution of AI benefits Limited infrastructure, talent pipeline, and compute funding [109,110]
7.3. Labor Market Disruption and Socioeconomic Inequality
- Wage and Skill Polarization: The integration of LLMs may exacerbate wage inequality. Workers with skills complementary to AI (e.g., prompt engineering, AI ethics, system integration) may see wage increases, while those performing tasks easily automated may face downward wage pressure [122]. This necessitates broad societal efforts focused on reskilling and upskilling the workforce to adapt to an AI-driven economy [110].
- Wage vs. Wealth Inequality: AI adoption could have opposing effects on inequality. A calibrated task-based model using UK household data suggests that while AI may reduce wage inequality by displacing some high-income workers, it could substantially increase wealth inequality. This occurs as capital owners and those whose productivity is complemented by AI capture a larger share of economic gains, highlighting a difficult trade-off for policymakers between fostering growth and managing wealth disparities [123].
- Demographic and Fiscal Pressures: The economic impacts of LLMs intersect with major demographic trends. In aging high-income economies, AI may compensate for shrinking workforces but could also reduce momentum for immigration policies that support fiscal stability [124]. As modeled by Tosun, demographic shifts directly influence public spending on education and human capital. Failure to adapt fiscal policy could amplify intergenerational pressures and undercut the public investments needed to prepare the workforce for an AI-driven economy [125].
- Population Aging: High-income economies experiencing demographic decline may increasingly rely on LLMs to sustain productivity. However, without inclusive reskilling initiatives and adaptive migration policies, these technologies risk shrinking the tax base, amplifying intergenerational fiscal pressures, and undermining public investment in education [126].
- Geographic Disparities: LLMs offer the potential to revitalize rural and underserved areas through applications like telehealth and remote education. However, this promise is contingent on equitable access to broadband infrastructure and local training, without which AI could worsen the rural-urban economic divide [127].
8. Recent Trends and Open Issues
8.1. Multimodal and Unified Architectures
8.2. Detection of LLM-Generated Content
8.3. Agentic LLMs and Tool-Augmented Reasoning
8.4. Scalability and Efficiency
8.5. Ethical Concerns, Regulation, and Societal Impact
- Authenticity and Misinformation: It is becoming nearly impossible to tell the difference between human-written and AI-generated text, which threatens academic honesty, public trust in information, and the integrity of the digital ecosystem [149].
- Copyright and Data Provenance: A critical legal and ethical challenge stems from the common practice of training LLMs on vast datasets scraped from the internet, which often include copyrighted materials such as books, articles, and artwork without permission from the creators. This practice has led to numerous high-profile lawsuits from authors, artists, and news organizations who argue that it constitutes mass copyright infringement [150,151]. The central debate revolves around the doctrine of “fair use,” with technology companies arguing that training is a transformative use, while rights holders contend it devalues their intellectual property. This highlights the urgent need for transparency about where training data comes from and the development of ethical data-sourcing practices [152,153].
- Regulation and Governance: Although new rules, such as the EU AI Act [156], are beginning to address these concerns, their enforcement and auditability remain limited. This underscores the ongoing need for human oversight in the auditing process and clear, domain-specific guidelines for how these models are used within different communities.
8.6. Open Problems
- Detectability vs. Usability Trade-off. The challenge of detectability involves reliably identifying whether a piece of content was generated by an AI model. This field encompasses several technical strategies, including (a) watermarking, which embeds a secret, statistically detectable signal into the generated output [157,158]; (b) provenance tracking, which involves cryptographic methods to verify the origin of content [159,160]; and (c) model fingerprinting, which identifies the unique stylistic artifacts of a specific model [161,162]. A core open problem is that these detection approaches often degrade output fluency or introduce stylistic characteristics that can hinder creative or assistive writing [163,164]. Furthermore, the robustness of these detectors is often undermined by paraphrasing or adversarial prompt attacks, raising questions about their sustained utility [165,166].
- Dataset Contamination and Model Collapse. Data contamination refers to the unintentional inclusion of test data in a model’s training set, leading to inflated performance metrics and an inaccurate assessment of true generalization capabilities [167]. This problem manifests in two key ways: (a) benchmark contamination, where evaluation datasets are inadvertently scraped from the web and included in pre-training corpora, and (b) model collapse, a phenomenon where models trained on the synthetic outputs of previous models suffer a progressive loss of quality as the diversity of human-like language degrades and rare semantic patterns are lost [168]. Furthermore, paraphrased benchmarks can circumvent conventional data decontamination processes, thereby inflating performance estimates [169]. This highlights the critical need for contamination-resilient evaluation and dataset curation methodologies [170].
- Multilingual and Cross-Domain Generalization. Generalization in LLMs refers to the ability to perform effectively on new, unseen data and tasks that differ significantly from the training distribution [171]. A critical open problem is poor generalization in specific contexts, particularly in multilingual and cross-domain settings. Existing benchmarks are overwhelmingly English-centric, leaving low-resource languages and domain-specific tasks underrepresented [172,173]. When applying multilingual LLMs to long non-English contexts, performance can drop dramatically (e.g., from 96% in English to as low as 36% in Somali on multitarget retrieval tasks [172]), highlighting serious equity and inclusivity gaps.
- Long-Context Reasoning and Retrieval. Even models with extremely large context windows struggle with complex multistep reasoning across long texts. Issues like multimatching and logic-based retrieval tasks require chained reasoning and exceed existing attention and chain-of-thought capabilities unless decomposed into numerous steps [174,175]. Furthermore, simply increasing context length often yields diminishing returns or even performance degradation due to “hard negatives” or distracting information [176].
- Benchmark Diversity and Realism. Current benchmarks are often synthetic or English-centered. While the Needle-in-a-Haystack (NIAH) test assesses memory [177], it does not adequately measure deep comprehension or robust reasoning [178,179]. Emerging benchmarks (e.g., RULER [180], PangeaBench [181]) aim to address these gaps but are still limited in scope and cultural reach. A more comprehensive evaluation suite must cover multilingual, multimodal, and real-world reasoning challenges.
9. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
AI | Artificial Intelligence |
AWS | Amazon Web Services |
BERT | Bidirectional Encoder Representations from Transformers |
BIG-bench | Beyond the Imitation Game benchmark |
BLEU | Bilingual Evaluation Understudy |
CLM | Causal Language Modeling |
COPA | Choice of Plausible Alternatives |
CRF | Conditional Random Fields |
EDA | Electronic Design Automation |
EHR | Electronic Health Record |
FLOPs | Floating Point Operations per Second |
GCP | Google Cloud Platform |
GPU | Graphics Processing Unit |
GRU | Gated Recurrent Unit |
HDL | Hardware Description Language |
HELM | Holistic Evaluation of Language Models |
HMM | Hidden Markov Model |
KV | Key-Value |
LCS | Longest Common Subsequence |
LLM | Large Language Model |
LoRA | Low-Rank Adaptation |
LSTM | Long Short-Term Memory |
MFU | Model FLOPS Utilization |
MLM | Masked Language Modeling |
MMLU | Massive Multitask Language Understanding |
NER | Named Entity Recognition |
NIAH | Needle-in-a-Haystack |
NLG | Natural Language Generation |
NLP | Natural Language Processing |
NLU | Natural Language Understanding |
NTP | Next Token Prediction |
PEFT | Parameter-Efficient Fine-Tuning |
QA | Question Answering |
RLHF | Reinforcement Learning from Human Feedback |
RNN | Recurrent Neural Network |
ROUGE | Recall-Oriented Understudy for Gisting Evaluation |
RTE | Recognizing Textual Entailment |
Seq2Seq | Sequence-to-Sequence |
SME | Small and Medium-sized Enterprises |
SSL | Self-Supervised Learning |
STEM | Science, Technology, Engineering, and Mathematics |
TPU | Tensor Processing Unit |
VL | Vision-Language |
WSC | Winograd Schema Challenge |
References
- Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I. Improving Language Understanding by Generative Pre-Training. OpenAI Technical Report 2018. Available online: https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf (accessed on 20 June 2025).
- Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language models are unsupervised multitask learners. OpenAI Blog 2019, 1, 9. [Google Scholar]
- Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
- Chowdhery, A.; Narang, S.; Devlin, J.; Bosma, M.; Mishra, G.; Roberts, A.; Barham, P.; Chung, H.W.; Sutton, C.; Gehrmann, S.; et al. Palm: Scaling language modeling with pathways. J. Mach. Learn. Res. 2023, 24, 1–113. [Google Scholar]
- Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. Llama: Open and efficient foundation language models. arXiv 2023, arXiv:2302.13971. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st Conference on Neural Information Processing Systems 2017, Long Beach, CA, USA, 4–9 December 2017; Advances in Neural Information Processing Systems. Volume 30. [Google Scholar]
- Weizenbaum, J. ELIZA—A computer program for the study of natural language communication between man and machine. Commun. ACM 1966, 9, 36–45. [Google Scholar] [CrossRef]
- Colby, K.M.; Weber, S.; Hilf, F.D. Artificial paranoia. Artif. Intell. 1971, 2, 1–25. [Google Scholar] [CrossRef]
- Wallace, R.S. The Anatomy of ALICE; Springer: Berlin/Heidelberg, Germany, 2009. [Google Scholar]
- Winograd, T. Procedures as a Representation for Data in a Computer Program for Understanding natural Language; MIT Press: Cambridge, MA, USA, 1971. [Google Scholar]
- Jelinek, F. Statistical Methods for Speech Recognition; MIT Press: Cambridge, MA, USA, 1998. [Google Scholar]
- Rabiner, L.R. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 2002, 77, 257–286. [Google Scholar] [CrossRef]
- Lafferty, J.; McCallum, A.; Pereira, F. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the ICML, Williamstown, MA, USA, 28 June–1 July 2001; Volume 1, p. 3. [Google Scholar]
- Elman, J.L. Finding structure in time. Cogn. Sci. 1990, 14, 179–211. [Google Scholar] [CrossRef]
- Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
- Cho, K.; Van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv 2014, arXiv:1406.1078. [Google Scholar] [CrossRef]
- Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient estimation of word representations in vector space. arXiv 2013, arXiv:1301.3781. [Google Scholar] [CrossRef]
- Pennington, J.; Socher, R.; Manning, C.D. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1532–1543. [Google Scholar]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; Volume 1, pp. 4171–4186. [Google Scholar]
- Liu, A.; Feng, B.; Xue, B.; Wang, B.; Wu, B.; Lu, C.; Zhao, C.; Deng, C.; Zhang, C.; Ruan, C.; et al. Deepseek-v3 technical report. arXiv 2024, arXiv:2412.19437. [Google Scholar]
- Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 2020, 21, 1–67. [Google Scholar]
- Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. Roberta: A robustly optimized bert pretraining approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
- Lan, Z.; Chen, M.; Goodman, S.; Gimpel, K.; Sharma, P.; Soricut, R. Albert: A lite bert for self-supervised learning of language representations. arXiv 2019, arXiv:1909.11942. [Google Scholar]
- Anil, R.; Dai, A.M.; Firat, O.; Johnson, M.; Lepikhin, D.; Passos, A.; Shakeri, S.; Taropa, E.; Bailey, P.; Chen, Z.; et al. Palm 2 technical report. arXiv 2023, arXiv:2305.10403. [Google Scholar] [CrossRef]
- Lewis, M.; Liu, Y.; Goyal, N.; Ghazvininejad, M.; Mohamed, A.; Levy, O.; Stoyanov, V.; Zettlemoyer, L. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv 2019, arXiv:1910.13461. [Google Scholar]
- Lu, H.; Liu, W.; Zhang, B.; Wang, B.; Dong, K.; Liu, B.; Sun, J.; Ren, T.; Li, Z.; Yang, H.; et al. Deepseek-vl: Towards real-world vision-language understanding. arXiv 2024, arXiv:2403.05525. [Google Scholar]
- Hurst, A.; Lerer, A.; Goucher, A.P.; Perelman, A.; Ramesh, A.; Clark, A.; Ostrow, A.; Welihinda, A.; Hayes, A.; Radford, A.; et al. Gpt-4o system card. arXiv 2024, arXiv:2410.21276. [Google Scholar] [CrossRef]
- Shazeer, N.; Mirhoseini, A.; Maziarz, K.; Davis, A.; Le, Q.; Hinton, G.; Dean, J. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv 2017, arXiv:1701.06538. [Google Scholar]
- Jiang, A.Q.; Sablayrolles, A.; Roux, A.; Mensch, A.; Savary, B.; Bamford, C.; Chaplot, D.S.; Casas, D.d.l.; Hanna, E.B.; Bressand, F.; et al. Mixtral of experts. arXiv 2024, arXiv:2401.04088. [Google Scholar] [CrossRef]
- Team, G.; Georgiev, P.; Lei, V.I.; Burnell, R.; Bai, L.; Gulati, A.; Tanzer, G.; Vincent, D.; Pan, Z.; Wang, S.; et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv 2024, arXiv:2403.05530. [Google Scholar] [CrossRef]
- Fedus, W.; Zoph, B.; Shazeer, N. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. J. Mach. Learn. Res. 2022, 23, 1–39. [Google Scholar]
- Wei, J.; Bosma, M.; Zhao, V.Y.; Guu, K.; Yu, A.W.; Lester, B.; Du, N.; Dai, A.M.; Le, Q.V. Finetuned language models are zero-shot learners. arXiv 2021, arXiv:2109.01652. [Google Scholar]
- Sanh, V.; Webson, A.; Raffel, C.; Bach, S.H.; Sutawika, L.; Alyafeai, Z.; Chaffin, A.; Stiegler, A.; Scao, T.L.; Raja, A.; et al. Multitask prompted training enables zero-shot task generalization. arXiv 2021, arXiv:2110.08207. [Google Scholar]
- Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; et al. Training language models to follow instructions with human feedback. Adv. Neural Inf. Process. Syst. 2022, 35, 27730–27744. [Google Scholar]
- Houlsby, N.; Giurgiu, A.; Jastrzebski, S.; Morrone, B.; De Laroussilhe, Q.; Gesmundo, A.; Attariyan, M.; Gelly, S. Parameter-efficient transfer learning for NLP. In Proceedings of the International Conference on Machine Learning, PMLR, Long Beach, CA, USA, 9–15 June 2019; pp. 2790–2799. [Google Scholar]
- Lester, B.; Al-Rfou, R.; Constant, N. The power of scale for parameter-efficient prompt tuning. arXiv 2021, arXiv:2104.08691. [Google Scholar] [CrossRef]
- Li, X.L.; Liang, P. Prefix-tuning: Optimizing continuous prompts for generation. arXiv 2021, arXiv:2101.00190. [Google Scholar] [CrossRef]
- Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. Lora: Low-rank adaptation of large language models. ICLR 2022, 1, 3. [Google Scholar]
- Dettmers, T.; Pagnoni, A.; Holtzman, A.; Zettlemoyer, L. Qlora: Efficient finetuning of quantized llms. Adv. Neural Inf. Process. Syst. 2023, 36, 10088–10115. [Google Scholar]
- Hendrycks, D.; Burns, C.; Basart, S.; Zou, A.; Mazeika, M.; Song, D.; Steinhardt, J. Measuring massive multitask language understanding. arXiv 2020, arXiv:2009.03300. [Google Scholar]
- Srivastava, A.; Rastogi, A.; Rao, A.; Shoeb, A.A.M.; Abid, A.; Fisch, A.; Brown, A.R.; Santoro, A.; Gupta, A.; Garriga-Alonso, A.; et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv 2022, arXiv:2206.04615. [Google Scholar] [CrossRef]
- Wang, A.; Pruksachatkun, Y.; Nangia, N.; Singh, A.; Michael, J.; Hill, F.; Levy, O.; Bowman, S. Superglue: A stickier benchmark for general-purpose language understanding systems. Adv. Neural Inf. Process. Syst. 2019, 32. [Google Scholar] [CrossRef]
- Liang, P.; Bommasani, R.; Lee, T.; Tsipras, D.; Soylu, D.; Yasunaga, M.; Zhang, Y.; Narayanan, D.; Wu, Y.; Kumar, A.; et al. Holistic evaluation of language models. arXiv 2022, arXiv:2211.09110. [Google Scholar] [CrossRef]
- Lin, C.Y. Rouge: A package for automatic evaluation of summaries. In Proceedings of the Text Summarization Branches Out, Barcelona, Spain, 25–26 July 2004; pp. 74–81. [Google Scholar]
- Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, 7–12 July 2002; pp. 311–318. [Google Scholar]
- Eger, S.; Cao, Y.; D’Souza, J.; Geiger, A.; Greisinger, C.; Gross, S.; Hou, Y.; Krenn, B.; Lauscher, A.; Li, Y.; et al. Transforming Science with Large Language Models: A Survey on AI-assisted Scientific Discovery, Experimentation, Content Generation, and Evaluation. arXiv 2025, arXiv:2502.05151. [Google Scholar]
- Whitfield, S.; Hofmann, M.A. Elicit: AI literature review research assistant. Public Serv. Q. 2023, 19, 201–207. [Google Scholar] [CrossRef]
- Ramos, M.C.; Collison, C.J.; White, A.D. A review of large language models and autonomous agents in chemistry. Chem. Sci. 2025, 16, 2514–2572. [Google Scholar] [CrossRef]
- Frincu, I. In Search of the Perfect Prompt. Master’s Thesis, Aalto University, Espoo, Finland, 2023. [Google Scholar]
- Borhani, M.; Soofiani, N.M.; Ebrahimi, E.; Asadollah, S. First Report on Growth and Reproduction of Turcinoemacheilus bahaii (Esmaeili, Sayyadzadeh, Özulug, Geiger and Freyhof, 2014), in Zayandeh Roud River, Iran. Austin Environ. Sci. 2017, 2, 1014. [Google Scholar] [CrossRef]
- Nazi, Z.A.; Peng, W. Large language models in healthcare and medical domain: A review. Informatics 2024, 11, 57. [Google Scholar] [CrossRef]
- Mohammadabadi, S.M.S.; Peikani, M.B. Identification and classification of rheumatoid arthritis using artificial intelligence and machine learning. In Diagnosing Musculoskeletal Conditions Using Artifical Intelligence and Machine Learning to Aid Interpretation of Clinical Imaging; Elsevier: Amsterdam, The Netherlands, 2025; pp. 123–145. [Google Scholar]
- Zheng, Y.; Koh, H.Y.; Yang, M.; Li, L.; May, L.T.; Webb, G.I.; Pan, S.; Church, G. Large language models in drug discovery and development: From disease mechanisms to clinical trials. arXiv 2024, arXiv:2409.04481. [Google Scholar] [CrossRef]
- Mohammadabadi, S.M.S.; Seyedkhamoushi, F.; Mostafavi, M.; Peikani, M.B. Examination of AI’s role in Diagnosis, Treatment, and Patient care. In Transforming Gender-Based Healthcare with AI and Machine Learning; CRC Press: Boca Raton, FL, USA, 2024; pp. 221–238. [Google Scholar]
- Huynh, N.; Lin, B. Large Language Models for Code Generation: A Comprehensive Survey of Challenges, Techniques, Evaluation, and Applications. arXiv 2025, arXiv:2503.01245. [Google Scholar]
- Wong, M.F.; Guo, S.; Hang, C.N.; Ho, S.W.; Tan, C.W. Natural language generation and understanding of big code for AI-assisted programming: A review. Entropy 2023, 25, 888. [Google Scholar] [CrossRef]
- Liu, B.; Jiang, Y.; Zhang, Y.; Niu, N.; Li, G.; Liu, H. An Empirical Study on the Potential of LLMs in Automated Software Refactoring. arXiv 2024, arXiv:2411.04444. [Google Scholar] [CrossRef]
- Zhong, R.; Du, X.; Kai, S.; Tang, Z.; Xu, S.; Zhen, H.L.; Hao, J.; Xu, Q.; Yuan, M.; Yan, J. Llm4eda: Emerging progress in large language models for electronic design automation. arXiv 2023, arXiv:2401.12224. [Google Scholar] [CrossRef]
- Mohammadabadi, S.M.S.; Entezami, M.; Moghaddam, A.K.; Orangian, M.; Nejadshamsi, S. Generative artificial intelligence for distributed learning to enhance smart grid communication. Int. J. Intell. Netw. 2024, 5, 267–274. [Google Scholar]
- Mohammadabadi, S.M.S.; Liu, Y.; Canafe, A.; Yang, L. Towards distributed learning of pmu data: A federated learning based event classification approach. In Proceedings of the 2023 IEEE Power & Energy Society General Meeting (PESGM), Orlando, FL, USA, 16–20 July 2023; IEEE: New York, NY, USA, 2023; pp. 1–5. [Google Scholar]
- Li, Y.; Wang, S.; Ding, H.; Chen, H. Large language models in finance: A survey. In Proceedings of the Fourth ACM International Conference on AI in Finance, Brooklyn, NY, USA, 27–29 November 2023; pp. 374–382. [Google Scholar]
- Peddinti, S.R.; Katragadda, S.R.; Pandey, B.K.; Tanikonda, A. Utilizing large language models for advanced service management: Potential applications and operational challenges. J. Sci. Technol. 2023, 4. [Google Scholar] [CrossRef]
- Lopez-Lira, A.; Kwon, J.; Yoon, S.; Sohn, J.y.; Choi, C. Bridging language models and financial analysis. arXiv 2025, arXiv:2503.22693. [Google Scholar]
- Sriram, A. Comparative Forecasting in Retail Supply Chains Using Machine Learning and Large Language Models. Master’s Thesis, State University of New York at Binghamton, Binghamton, NY, USA, 2025. [Google Scholar]
- Brundage, M.P.; Sharp, M.; Pavel, R. Qualifying evaluations from human operators: Integrating sensor data with natural language logs. PHME Soc. Eur. Conf. 2021, 6, 9. [Google Scholar] [CrossRef]
- Bakas, N.P.; Papadaki, M.; Vagianou, E.; Christou, I.; Chatzichristofis, S.A. Integrating llms in higher education, through interactive problem solving and tutoring: Algorithmic approach and use cases. In Proceedings of the European, Mediterranean, and Middle Eastern Conference on Information Systems, Dubai, United Arab Emirates, 11–12 December 2023; Springer: Cham, Switzerland, 2023; pp. 291–307. [Google Scholar]
- McHugh, B.; Myers, D.; Patel, A. AI Co-Counsel: An Attorney’s Guide to Using Artificial Intelligence in the Practice of Law Symposium. Akron Law Rev. 2024, 57, 3. [Google Scholar]
- Khikmatillaeva, M. Beyond Chatbots: How specialized AI tools are reducing legal workloads. FARS Int. J. Educ. Soc. Sci. Humanit. 2025, 13, 133–154. [Google Scholar]
- Cheng, S. When Journalism meets AI: Risk or opportunity? Digit. Gov. Res. Pract. 2025, 6, 1–12. [Google Scholar] [CrossRef]
- Dhariwal, P.; Jun, H.; Payne, C.; Kim, J.W.; Radford, A.; Sutskever, I. Jukebox: A generative model for music. arXiv 2020, arXiv:2005.00341. [Google Scholar] [CrossRef]
- Topirceanu, A.; Barina, G.; Udrescu, M. Musenet: Collaboration in the music artists industry. In Proceedings of the 2014 European Network Intelligence Conference, Wroclaw, Poland, 29–30 September 2014; IEEE: New York, NY, USA, 2014; pp. 89–94. [Google Scholar]
- Marcus, G.; Davis, E.; Aaronson, S. A very preliminary analysis of DALL-E 2. arXiv 2022, arXiv:2204.13807. [Google Scholar] [CrossRef]
- Lu, Y. Artificial Intelligence Applied On Today‘s Urban and Architectural Conceptual Design-A Competition Case Study. Ph.D. Thesis, Politecnico di Torino, Torino, Italy, 2025. [Google Scholar]
- Shetye, S. An evaluation of khanmigo, a generative ai tool, as a computer-assisted language learning app. Stud. Appl. Linguist. TESOL 2024, 24. [Google Scholar] [CrossRef]
- Vega, J.; Rodriguez, M.; Check, E.; Moran, H.; Loo, L. Duolingo evolution: From automation to artificial intelligence. In Proceedings of the IEEE Colombian Conference on Applications of Computational Intelligence, Pamplona, Colombia, 17–19 July 2024; Springer: Berlin/Heidelberg, Germany, 2025; pp. 54–71. [Google Scholar]
- Calamas, D. Student and Instructor Feedback on an AI-Assisted Grading Tool; American Society for Engineering Education: Washington, DC, USA, 2024. [Google Scholar]
- Hidalgo-Reyes, J.; Alvarez, J.; Guevara-Chavez, L.; Cruz-Netro, Z.G. Gradescope as a Tool to Improve Assessment and Feedback in Engineering. In Proceedings of the 2025 Institute for the Future of Education Conference (IFE), Monterrey, Mexico, 28–30 January 2025; IEEE: New York, NY, USA, 2025; pp. 1–7. [Google Scholar]
- Yang, H.; Yue, S.; He, Y. Auto-gpt for online decision making: Benchmarks and additional opinions. arXiv 2023, arXiv:2306.02224. [Google Scholar] [CrossRef]
- O’BRIEN, P.D.; Wiegand, M.E. Agent based process management: Applying intelligent agents to workflow. Knowl. Eng. Rev. 1998, 13, 161–174. [Google Scholar] [CrossRef]
- Van Veen, D.; Van Uden, C.; Blankemeier, L.; Delbrouck, J.-B.; Aali, A.; Bluethgen, C.; Pareek, A.; Polacin, M.; Reis, E.P.; Seehofnerová, A.; et al. Clinical Text Summarization: Adapting Large Language Models Can Outperform Human Experts. Res. Sq. 2023. [Google Scholar] [CrossRef]
- Rao, A.; Kim, J.; Kamineni, M.; Pang, M.; Lie, W.; Dreyer, K.J.; Succi, M.D. Evaluating GPT as an adjunct for radiologic decision making: GPT-4 versus GPT-3.5 in a breast imaging pilot. J. Am. Coll. Radiol. 2023, 20, 990–997. [Google Scholar] [CrossRef]
- Singhal, K.; Azizi, S.; Tu, T.; Mahdavi, S.S.; Wei, J.; Chung, H.W.; Scales, N.; Tanwani, A.; Cole-Lewis, H.; Pfohl, S.; et al. Large language models encode clinical knowledge. Nature 2023, 620, 172–180. [Google Scholar] [CrossRef]
- Benary, M.; Wang, X.D.; Schmidt, M.; Soll, D.; Hilfenhaus, G.; Nassir, M.; Sigler, C.; Knödler, M.; Keller, U.; Beule, D.; et al. Leveraging large language models for decision support in personalized oncology. JAMA Netw. Open 2023, 6, e2343689. [Google Scholar] [CrossRef] [PubMed]
- Clusmann, J.; Kolbinger, F.R.; Muti, H.S.; Carrero, Z.I.; Eckardt, J.N.; Laleh, N.G.; Löffler, C.M.L.; Schwarzkopf, S.C.; Unger, M.; Veldhuizen, G.P.; et al. The future landscape of large language models in medicine. Commun. Med. 2023, 3, 141. [Google Scholar] [CrossRef]
- Wu, S.; Irsoy, O.; Lu, S.; Dabravolski, V.; Dredze, M.; Gehrmann, S.; Kambadur, P.; Rosenberg, D.; Mann, G. Bloomberggpt: A large language model for finance. arXiv 2023, arXiv:2303.17564. [Google Scholar] [CrossRef]
- Völker, T.; Pfister, J.; Koopmann, T.; Hotho, A. From Chat to Publication Management: Organizing your related work using BibSonomy & LLMs. In Proceedings of the 2024 Conference on Human Information Interaction and Retrieval, Sheffield, UK, 10–14 March 2024; pp. 386–390. [Google Scholar]
- Rane, N.; Choudhary, S.; Rane, J. Gemini versus ChatGPT: Applications, performance, architecture, capabilities, and implementation. J. Appl. Artif. Intell. 2024, 5, 69–93. [Google Scholar] [CrossRef]
- Wang, W.; Gu, L.; Zhang, L.; Luo, Y.; Dai, Y.; Shen, C.; Xie, L.; Lin, B.; He, X.; Ye, J. SciPIP: An LLM-based Scientific Paper Idea Proposer. arXiv 2024, arXiv:2410.23166. [Google Scholar]
- Sırt, M.; Eyüpoğlu, C. A user-friendly and explainable framework for redesigning AutoML processes with large language models. In Proceedings of the 33rd Signal Processing and Communications Applications Conference (SIU), Sile, Istanbul, Turkiye, 25–28 June 2025; pp. 1–4. [Google Scholar] [CrossRef]
- Xia, H.; Yang, Z.; Zhao, Y.; Wang, Y.; Li, J.; Tracy, R.; Zhu, Z.; Wang, Y.F.; Chen, H.; Shen, W. Language and multimodal models in sports: A survey of datasets and applications. arXiv 2024, arXiv:2406.12252. [Google Scholar] [CrossRef]
- Rahimian, P.; Flisar, J.; Sumpter, D. Automated explanation of machine learning models of footballing actions in words. J. Sport. Anal. 2025, 11, 22150218251353089. [Google Scholar] [CrossRef]
- Cottier, B.; Rahman, R.; Fattorini, L.; Maslej, N.; Besiroglu, T.; Owen, D. The rising costs of training frontier AI models. arXiv 2024, arXiv:2405.21015. [Google Scholar] [CrossRef]
- Liu, Y.; He, H.; Han, T.; Zhang, X.; Liu, M.; Tian, J.; Zhang, Y.; Wang, J.; Gao, X.; Zhong, T.; et al. Understanding llms: A comprehensive overview from training to inference. Neurocomputing 2025, 620, 129190. [Google Scholar] [CrossRef]
- Gale, T.; Elsen, E.; Hooker, S. Do Neural Networks Really Need to Be So Big? 2020. MIT-IBM Watson AI Lab Blog. Available online: https://mitibmwatsonailab.mit.edu/research/blog/do-neural-networks-really-need-to-be-so-big/ (accessed on 1 August 2025).
- Buchholz, K. The Extreme Cost of Training AI Models. 2024. Available online: https://www.forbes.com/sites/katharinabuchholz/2024/08/23/the-extreme-cost-of-training-ai-models/ (accessed on 20 June 2025).
- for Human-Centered Artificial Intelligence (HAI), S.I. AI Index Report. 2024. Available online: https://hai.stanford.edu/research/ai-index-2024 (accessed on 20 June 2025).
- OpenAI. API Pricing. 2025. Available online: https://openai.com/api/pricing/ (accessed on 20 June 2025).
- Maslej, N.; Fattorini, L.; Brynjolfsson, E.; Etchemendy, J.; Ligett, K.; Lyons, T.; Manyika, J.; Ngo, H.; Niebles, J.C.; Parli, V.; et al. Artificial intelligence index report 2023. arXiv 2023, arXiv:2310.03715. [Google Scholar] [CrossRef]
- Pope, R.; Douglas, S.; Chowdhery, A.; Devlin, J.; Bradbury, J.; Heek, J.; Xiao, K.; Agrawal, S.; Dean, J. Efficiently scaling transformer inference. Proc. Mach. Learn. Syst. 2023, 5, 606–624. [Google Scholar]
- IoT Analytics. The Leading Generative AI Companies. 2025. Available online: https://iot-analytics.com/leading-generative-ai-companies/ (accessed on 19 June 2025).
- Zhang, M.; Yuan, B.; Li, H.; Xu, K. LLM-Cloud Complete: Leveraging cloud computing for efficient large language model-based code completion. J. Artif. Intell. Gen. Sci. (JAIGS) 2024, 5, 295–326. [Google Scholar] [CrossRef]
- Vipra, J.; Korinek, A. Market concentration implications of foundation models. arXiv 2023, arXiv:2311.01550. [Google Scholar] [CrossRef]
- Bommasani, R.; Klyman, K.; Longpre, S.; Kapoor, S.; Maslej, N.; Xiong, B.; Zhang, D.; Liang, P. The 2023 Foundation Model Transparency Index. Trans. Mach. Learn. Res. 2025. [CrossRef]
- Ludwig, J.; Mullainathan, S.; Rambachan, A. Large Language Models: An Applied Econometric Framework; Technical Report; National Bureau of Economic Research: Cambridge, MA, USA, 2025. [Google Scholar]
- Castro, D. AI Can Improve U.S. Small Business Productivity. Information Technology & Innovation Foundation. 2025. Available online: https://itif.org/publications/2025/04/08/ai-can-improve-us-small-business-productivity/ (accessed on 1 August 2025).
- Newstardom Insights. The AI Accessibility Gap: Can Small Businesses Keep Up? 2024. Available online: https://newstardom.com/insights/the-ai-accessibility-gap-can-small-businesses-keep-up (accessed on 19 June 2025).
- Jain, A.; Kakade, K.S.; Vispute, S.A. The Role of Artificial Intelligence (AI) in the Transformation of Small-and Medium-Sized Businesses: Challenges and Opportunities. In Artificial Intelligence-Enabled Businesses: How to Develop Strategies for Innovation; John Wiley & Sons: Hoboken, NJ, USA, 2025; pp. 209–226. [Google Scholar]
- Korinek, A.; Vipra, J. Concentrating intelligence: Scaling and market structure in artificial intelligence. Econ. Policy 2025, 40, 225–256. [Google Scholar] [CrossRef]
- Xie, Y.; Avila, S. The social impact of generative LLM-based AI. Chin. J. Sociol. 2025, 11, 31–57. [Google Scholar] [CrossRef]
- Acemoglu, D.; Restrepo, P. Tasks, automation, and the rise in US wage inequality. Econometrica 2022, 90, 1973–2016. [Google Scholar] [CrossRef]
- Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al. Llama 2: Open foundation and fine-tuned chat models. arXiv 2023, arXiv:2307.09288. [Google Scholar] [CrossRef]
- Grattafiori, A.; Dubey, A.; Jauhri, A.; Pandey, A.; Kadian, A.; Al-Dahle, A.; Letman, A.; Mathur, A.; Schelten, A.; Vaughan, A.; et al. The llama 3 herd of models. arXiv 2024, arXiv:2407.21783. [Google Scholar] [CrossRef]
- Phang, J.; Bradley, H.; Gao, L.; Castricato, L.; Biderman, S. EleutherAI: Going Beyond “Open Science” to “Science in the Open”. arXiv 2022, arXiv:2210.06413. [Google Scholar] [CrossRef]
- Sinha, S.; Lee, Y.M. Challenges with developing and deploying AI models and applications in industrial systems. Discov. Artif. Intell. 2024, 4, 55. [Google Scholar] [CrossRef]
- Fossen, F.; Sorgner, A. Mapping the future of occupations: Transformative and destructive effects of new digital technologies on jobs. Foresight 2019, 13, 10–18. [Google Scholar] [CrossRef]
- Wang, Y. The large language model (llm) paradox: Job creation and loss in the age of advanced ai. Authorea Prepr. 2023. [CrossRef]
- Durach, C.F.; Gutierrez, L. “Hello, this is your AI co-pilot”–operational implications of artificial intelligence chatbots. Int. J. Phys. Distrib. Logist. Manag. 2024, 54, 229–246. [Google Scholar] [CrossRef]
- Dillon, E.W.; Jaffe, S.; Immorlica, N.; Stanton, C.T. Shifting Work Patterns with Generative AI; Technical report; National Bureau of Economic Research: Cambridge, MA, USA, 2025. [Google Scholar]
- Wang, J.Y.; Sukiennik, N.; Li, T.; Su, W.; Hao, Q.; Xu, J.; Huang, Z.; Xu, F.; Li, Y. A Survey on Human-Centric LLMs. arXiv 2024, arXiv:2411.14491. [Google Scholar] [CrossRef]
- Niu, Q.; Liu, J.; Bi, Z.; Feng, P.; Peng, B.; Chen, K.; Li, M.; Yan, L.K.; Zhang, Y.; Yin, C.H.; et al. Large language models and cognitive science: A comprehensive review of similarities, differences, and challenges. arXiv 2024, arXiv:2409.02387. [Google Scholar] [CrossRef]
- Johnson, S.; Acemoglu, D. Power and Progress: Our Thousand-Year Struggle Over Technology and Prosperity|Winners of the 2024 Nobel Prize for Economics; Hachette: London, UK, 2023. [Google Scholar]
- Wilmers, N. Generative AI and the Future of Inequality. MIT Exploration of Generative AI, 27 March 2024. [Google Scholar] [CrossRef]
- Rockall, E.; Mendes Tavares, M.; Pizzinelli, C. AI Adoption and Inequality; International Monetary Fund: Washington, DC, USA, 2025. [Google Scholar]
- Tosun, M.S. Ageing Robots to the Rescue, Technical Report, Oxford Institute of Population Ageing. 2023. Oxford Institute of Population Ageing Blog. Available online: https://www.ageing.ox.ac.uk/blog/ageing-robots-to-the-rescue (accessed on 20 June 2025).
- Tosun, M.S. Endogenous fiscal policy and capital market transmissions in the presence of demographic shocks. J. Econ. Dyn. Control 2008, 32, 2031–2060. [Google Scholar] [CrossRef]
- Tosun, M.S. Global Aging and Fiscal Policy with International Labor Mobility: A Political Economy Perspective; Technical Report, IZA Discussion Papers; International Monetary Fund: Washington, DC, USA, 2009. [Google Scholar]
- Weeks, W.B.; Spelhaug, J.; Weinstein, J.N.; Ferres, J.M.L. Bridging the rural-urban divide: An implementation plan for leveraging technology and artificial intelligence to improve health and economic outcomes in rural America. J. Rural. Health 2024, 40. [Google Scholar] [CrossRef]
- Woods, D.; Podhorzer, M. AI’s Impact on Income Inequality in the U.S.; Technical Report; Brookings Institution: Washington, DC, USA, 2023. [Google Scholar]
- Acemoglu, D.; Restrepo, P. Secular stagnation? The effect of aging on economic growth in the age of automation. Am. Econ. Rev. 2017, 107, 174–179. [Google Scholar] [CrossRef]
- Alayrac, J.B.; Donahue, J.; Luc, P.; Miech, A.; Barr, I.; Hasson, Y.; Lenc, K.; Mensch, A.; Millican, K.; Reynolds, M.; et al. Flamingo: A visual language model for few-shot learning. Adv. Neural Inf. Process. Syst. 2022, 35, 23716–23736. [Google Scholar]
- Shahriar, S.; Lund, B.D.; Mannuru, N.R.; Arshad, M.A.; Hayawi, K.; Bevara, R.V.K.; Mannuru, A.; Batool, L. Putting gpt-4o to the sword: A comprehensive evaluation of language, vision, speech, and multimodal proficiency. Appl. Sci. 2024, 14, 7782. [Google Scholar] [CrossRef]
- Islam, R.; Moushi, O.M. Gpt-4o: The cutting-edge advancement in multimodal llm. Authorea Prepr. 2024. [Google Scholar] [CrossRef]
- Song, S.; Li, X.; Li, S.; Zhao, S.; Yu, J.; Ma, J.; Mao, X.; Zhang, W.; Wang, M. How to Bridge the Gap between Modalities: Survey on Multimodal Large Language Model. IEEE Trans. Knowl. Data Eng. 2025, 7, 5311–5329. [Google Scholar] [CrossRef]
- Wu, J.; Yang, S.; Zhan, R.; Yuan, Y.; Chao, L.S.; Wong, D.F. A survey on LLM-generated text detection: Necessity, methods, and future directions. Comput. Linguist. 2025, 51, 275–338. [Google Scholar] [CrossRef]
- Muñoz-Ortiz, A.; Gómez-Rodríguez, C.; Vilares, D. Contrasting linguistic patterns in human and llm-generated news text. Artif. Intell. Rev. 2024, 57, 265. [Google Scholar] [CrossRef]
- Dathathri, S.; See, A.; Ghaisas, S.; Huang, P.S.; McAdam, R.; Welbl, J.; Bachani, V.; Kaskasoli, A.; Stanforth, R.; Matejovicova, T.; et al. Scalable watermarking for identifying large language model outputs. Nature 2024, 634, 818–823. [Google Scholar] [CrossRef] [PubMed]
- Sun, Y.; He, J.; Cui, L.; Lei, S.; Lu, C.T. Exploring the deceptive power of llm-generated fake news: A study of real-world detection challenges. arXiv 2024, arXiv:2403.18249. [Google Scholar]
- Goswami, A.; Kaur, G.; Tayal, S.; Verma, A.; Verma, M. Analyzing the efficacy of Deep Learning and Transformer models in classifying Human and LLM-Generated Text. In Proceedings of the 2024 8th International Conference on Computing, Communication, Control and Automation (ICCUBEA), Pune, India, 23–24 August 2024; IEEE: New York, NY, USA, 2024; pp. 1–5. [Google Scholar]
- Chan, K.H.; Ke, W.; Im, S.K. A general method for generating discrete orthogonal matrices. IEEE Access 2021, 9, 120380–120391. [Google Scholar] [CrossRef]
- Xing, Z.; Lam, C.T.; Yuan, X.; Im, S.K.; Machado, P. Mmqw: Multi-modal quantum watermarking scheme. IEEE Trans. Inf. Forensics Secur. 2024, 19, 5181–5195. [Google Scholar] [CrossRef]
- Pu, J.; Sarwar, Z.; Abdullah, S.M.; Rehman, A.; Kim, Y.; Bhattacharya, P.; Javed, M.; Viswanath, B. Deepfake text detection: Limitations and opportunities. In Proceedings of the 2023 IEEE symposium on security and privacy (SP), San Francisco, CA, USA, 21–25 May 2023; IEEE: New York, NY, USA, 2023; pp. 1613–1630. [Google Scholar]
- Wu, J.; Zhan, R.; Wong, D.; Yang, S.; Yang, X.; Yuan, Y.; Chao, L. Detectrl: Benchmarking llm-generated text detection in real-world scenarios. Adv. Neural Inf. Process. Syst. 2024, 37, 100369–100401. [Google Scholar]
- Topsakal, O.; Akinci, T.C. Creating large language model applications utilizing langchain: A primer on developing llm apps fast. In Proceedings of the International Conference on Applied Engineering and Natural Sciences, Konya, Turkey, 10–12 July 2023; Volume 1, pp. 1050–1056. [Google Scholar]
- Cao, S.; Zhang, J.; Shi, J.; Lv, X.; Yao, Z.; Tian, Q.; Li, J.; Hou, L. Probabilistic tree-of-thought reasoning for answering knowledge-intensive complex questions. arXiv 2023, arXiv:2311.13982. [Google Scholar]
- Xu, H.; Zhu, Z.; Pan, L.; Wang, Z.; Zhu, S.; Ma, D.; Cao, R.; Chen, L.; Yu, K. Reducing tool hallucination via reliability alignment. arXiv 2024, arXiv:2412.04141. [Google Scholar] [CrossRef]
- Liu, B.; Li, X.; Zhang, J.; Wang, J.; He, T.; Hong, S.; Liu, H.; Zhang, S.; Song, K.; Zhu, K.; et al. Advances and challenges in foundation agents: From brain-inspired intelligence to evolutionary, collaborative, and safe systems. arXiv 2025, arXiv:2504.01990. [Google Scholar] [CrossRef]
- Bai, G.; Chai, Z.; Ling, C.; Wang, S.; Lu, J.; Zhang, N.; Shi, T.; Yu, Z.; Zhu, M.; Zhang, Y.; et al. Beyond efficiency: A systematic survey of resource-efficient large language models. arXiv 2024, arXiv:2401.00625. [Google Scholar] [CrossRef]
- Weidinger, L.; Mellor, J.; Rauh, M.; Griffin, C.; Uesato, J.; Huang, P.S.; Cheng, M.; Glaese, M.; Balle, B.; Kasirzadeh, A.; et al. Ethical and social risks of harm from language models. arXiv 2021, arXiv:2112.04359. [Google Scholar] [CrossRef]
- Epstein, Z.; Hertzmann, A.; Investigators of Human Creativity. Art and the science of generative AI. Science 2023, 380, 1110–1111. [Google Scholar] [CrossRef] [PubMed]
- Desai, D.R.; Riedl, M. Between copyright and computer science: The law and ethics of generative aI. Nw. J. Tech. Intell. Prop. 2024, 22, 55. [Google Scholar] [CrossRef]
- Opderbeck, D.W. Copyright in AI training data: A human-centered approach. Okla. L. Rev. 2023, 76, 951. [Google Scholar] [CrossRef]
- Raza, S.; Ghuge, S.; Ding, C.; Dolatabadi, E.; Pandya, D. FAIR enough: Develop and assess a FAIR-compliant dataset for large language model training? Data Intell. 2024, 6, 559–585. [Google Scholar] [CrossRef]
- Pahune, S.; Akhtar, Z.; Mandapati, V.; Siddique, K. The Importance of AI Data Governance in Large Language Models. Big Data Cogn. Comput. 2025, 9, 147. [Google Scholar] [CrossRef]
- Liu, Z. Cultural bias in large language models: A comprehensive analysis and mitigation strategies. J. Transcult. Commun. 2025, 3, 224–244. [Google Scholar] [CrossRef]
- Laakso, A. Ethical Challenges of Large Language Models-a Systematic Literature Review. Master’s Thesis, University of Helsinki, Helsinki, Finland, 2023. [Google Scholar]
- Caruana, M.M.; Borg, R.M. Regulating Artificial Intelligence in the European Union. In The EU Internal Market in the Next Decade–Quo Vadis? Brill: Leiden, The Netherlands, 2025; p. 108. [Google Scholar]
- Fairoze, J.; Garg, S.; Jha, S.; Mahloujifar, S.; Mahmoody, M.; Wang, M. Publicly-detectable watermarking for language models. arXiv 2023, arXiv:2310.18491. [Google Scholar] [CrossRef]
- Liu, A.; Pan, L.; Hu, X.; Li, S.; Wen, L.; King, I.; Yu, P.S. An unforgeable publicly verifiable watermark for large language models. arXiv 2023, arXiv:2307.16230. [Google Scholar]
- Li, L.; Bai, Y.; Cheng, M. Where am i from? identifying origin of llm-generated content. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Miami, FL, USA, 12–16 November 2024; pp. 12218–12229. [Google Scholar]
- Antoun, W.; Sagot, B.; Seddah, D. From text to source: Results in detecting large language model-generated content. arXiv 2023, arXiv:2309.13322. [Google Scholar]
- Russinovich, M.; Salem, A. Hey, That’s My Model! Introducing Chain & Hash, An LLM Fingerprinting Technique. arXiv 2024, arXiv:2407.10887. [Google Scholar]
- Zeng, B.; Wang, L.; Hu, Y.; Xu, Y.; Zhou, C.; Wang, X.; Yu, Y.; Lin, Z. Huref: Human-readable fingerprint for large language models. Adv. Neural Inf. Process. Syst. 2024, 37, 126332–126362. [Google Scholar]
- Bao, G.; Zhao, Y.; Teng, Z.; Yang, L.; Zhang, Y. Fast-detectgpt: Efficient zero-shot detection of machine-generated text via conditional probability curvature. arXiv 2023, arXiv:2310.05130. [Google Scholar]
- Bao, G.; Rong, L.; Zhao, Y.; Zhou, Q.; Zhang, Y. Decoupling Content and Expression: Two-Dimensional Detection of AI-Generated Text. arXiv 2025, arXiv:2503.00258. [Google Scholar]
- Koike, R.; Kaneko, M.; Okazaki, N. Outfox: Llm-generated essay detection through in-context learning with adversarially generated examples. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 21258–21266. [Google Scholar]
- Li, Y.; Li, Q.; Cui, L.; Bi, W.; Wang, L.; Yang, L.; Shi, S.; Zhang, Y. Deepfake text detection in the wild. arXiv 2023, arXiv:2305.13242. [Google Scholar] [CrossRef]
- Min, B.; Ross, H.; Sulem, E.; Veyseh, A.P.B.; Nguyen, T.H.; Sainz, O.; Agirre, E.; Heintz, I.; Roth, D. Recent advances in natural language processing via large pre-trained language models: A survey. ACM Comput. Surv. 2023, 56, 1–40. [Google Scholar] [CrossRef]
- Shumailov, I.; Shumaylov, Z.; Zhao, Y.; Gal, Y.; Papernot, N.; Anderson, R. The curse of recursion: Training on generated data makes models forget. arXiv 2023, arXiv:2305.17493. [Google Scholar]
- Yang, S.; Chiang, W.L.; Zheng, L.; Gonzalez, J.E.; Stoica, I. Rethinking benchmark and contamination for language models with rephrased samples. arXiv 2023, arXiv:2311.04850. [Google Scholar] [CrossRef]
- Li, D.; Sun, R.; Huang, Y.; Zhong, M.; Jiang, B.; Han, J.; Zhang, X.; Wang, W.; Liu, H. Preference Leakage: A Contamination Problem in LLM-as-a-judge. arXiv 2025, arXiv:2502.01534. [Google Scholar] [CrossRef]
- Wang, X.; Antoniades, A.; Elazar, Y.; Amayuelas, A.; Albalak, A.; Zhang, K.; Wang, W.Y. Generalization vs Memorization: Tracing Language Models’ Capabilities Back to Pretraining Data. arXiv 2024, arXiv:2407.14985. [Google Scholar] [CrossRef]
- Agrawal, A.; Dang, A.; Nezhad, S.B.; Pokharel, R.; Scheinberg, R. Evaluating Multilingual Long-Context Models for Retrieval and Reasoning. arXiv 2024, arXiv:2409.18006. [Google Scholar]
- Ghosh, A.; Datta, D.; Saha, S.; Agarwal, C. The Multilingual Mind: A Survey of Multilingual Reasoning in Language Models. arXiv 2025, arXiv:2502.09457. [Google Scholar]
- Xu, P.; Ping, W.; Wu, X.; McAfee, L.; Zhu, C.; Liu, Z.; Subramanian, S.; Bakhturina, E.; Shoeybi, M.; Catanzaro, B. Retrieval meets long context large language models. In Proceedings of the The Twelfth International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
- Jin, B.; Yoon, J.; Han, J.; Arik, S.O. Long-context llms meet rag: Overcoming challenges for long inputs in rag. arXiv 2024, arXiv:2410.05983. [Google Scholar] [CrossRef]
- Villalobos, P.; Ho, A.; Sevilla, J.; Besiroglu, T.; Heim, L.; Hobbhahn, M. Position: Will we run out of data? Limits of LLM scaling based on human-generated data. In Proceedings of the Forty-First International Conference on Machine Learning, Vienna, Austria, 21–27 July 2024. [Google Scholar]
- Nelson, E.; Kollias, G.; Das, P.; Chaudhury, S.; Dan, S. Needle in the haystack for memory based large language models. arXiv 2024, arXiv:2407.01437. [Google Scholar] [CrossRef]
- Mohammadabadi, S.M.S. From generative ai to innovative ai: An evolutionary roadmap. arXiv 2025, arXiv:2503.11419. [Google Scholar] [CrossRef]
- Dai, H.; Pechi, D.; Yang, X.; Banga, G.; Mantri, R. DENIAHL: In-Context Features Influence LLM Needle-In-A-Haystack Abilities. arXiv 2024, arXiv:2411.19360. [Google Scholar]
- Hsieh, C.P.; Sun, S.; Kriman, S.; Acharya, S.; Rekesh, D.; Jia, F.; Zhang, Y.; Ginsburg, B. RULER: What’s the Real Context Size of Your Long-Context Language Models? arXiv 2024, arXiv:2404.06654. [Google Scholar]
- Yue, X.; Song, Y.; Asai, A.; Kim, S.; de Dieu Nyandwi, J.; Khanuja, S.; Kantharuban, A.; Sutawika, L.; Ramamoorthy, S.; Neubig, G. Pangea: A fully open multilingual multimodal llm for 39 languages. In Proceedings of the The Thirteenth International Conference on Learning Representations, Vienna, Austria, 1–7 May 2024. [Google Scholar]
Timeline | Dominant Models | Key Strengths | Notable Limitations |
---|---|---|---|
Pre-1990s: Rule-Based | ELIZA [7], PARRY [8], A.L.I.C.E. [9], SHRDLU [10] | Simulates conversation via handcrafted rules; early human-computer interaction | No learning; brittle; poor generalization; no understanding; limited context |
1990s–2000s: Statistical | n-gram [11], HMM [12], CRF [13] | Data-driven; foundational for early speech/MT; robust to noise | Fixed context (n); limited long-range dependencies; no semantics |
2000s-2020s: Neural Networks | RNN [14], LSTM [15], GRU [16], Word2Vec [17], GloVe [18] | Learns distributed representations; models variable-length sequences | Sequential bottlenecks; poor parallelization; struggles with long-term context |
Late 2010s–Present: Transformers | BERT [19], GPT series [1,2,3], DeepSeek [20] T5 [21], LLaMA [5], PaLM [4] | Scalable self-attention; contextual understanding; few/zero-shot ability; handles long-range dependencies | High computational cost; hallucination; bias; interpretability challenges |
Architecture Type | Representative Models | Typical Use Cases |
---|---|---|
Encoder-Only | BERT [19], RoBERTa [22], ALBERT [23] | Text classification, NER, extractive QA, sentiment analysis |
Decoder-Only | GPT-2/3/4 [2], LLaMA [5], PaLM [24], DeepSeek-V3 [20] | Text generation, dialogue systems, in-context learning |
Encoder–Decoder (Seq2Seq) | T5 [21], BART [25] | Translation, summarization, abstractive QA, text rewriting |
Multimodal | DeepSeek-VL [26], GPT-4o [27] | Image captioning, visual question answering, cross-modal retrieval |
PEFT Method | Core Mechanism | Key Characteristics | Trainable Params |
---|---|---|---|
Adapters [35] | Injects small, trainable “adapter” modules between frozen Transformer layers. | Trade-off: Effective generalization but adds inference latency due to new modules. Requires architectural modification. | Low (∼0.1–5%) |
Prompt Tuning [36] | Prepends learnable “soft prompt” embeddings to the input sequence. | Trade-off: Highest parameter efficiency but performance can be less stable and sensitive to prompt length. No inference latency. | Very Low (<0.1%) |
Prefix-Tuning [37] | Prepends learnable prefixes to the hidden states of each Transformer layer. | More expressive and stable than prompt tuning, but slightly more complex. No added inference latency during generation. | Very Low (<0.1%) |
LoRA [38] | Freezes base model weights and injects trainable low-rank matrices to approximate weight updates. | Trade-off: Balances high downstream performance with efficiency. No inference latency as matrices can be merged. Highly effective and widely adopted. | Low (∼0.1–1%) |
Benchmark | Focus | Dataset Size/Scope |
---|---|---|
MMLU [40] | Academic QA and reasoning across disciplines | 57 subjects, ∼15K Multiple-Choice Questions |
BIG-bench [41] | Emergent abilities and generalization (e.g., humor, ethics, logic) | 200+ tasks, community-contributed |
SuperGLUE [42] | Challenging NLU tasks (coreference, inference, etc.) | 8 tasks (e.g., RTE, WSC, COPA) |
HELM [43] | Multi-dimensional LLM evaluation (accuracy, fairness, robustness, bias) | 42 scenarios × 8 metrics × 30+ models |
Category | Sector/Use Cases | Description/Example Functions | References |
---|---|---|---|
STEM & Research | Scientific Research: hypothesis generation, experiment design, writing | LLMs like Elicit and SciBot support knowledge synthesis, planning, and scientific writing. | [1,46,47,48,49,50] |
Healthcare & Life Sciences: scribing, drug discovery, literature review | LLMs generate EHR notes, simulate molecular interactions, and summarize biomedical texts. | [51,52,53,54] | |
Software Engineering: code generation, debugging, HDL design, and privacy-aware analytics | LLMs like Copilot and CodeLLaMA assist in programming and hardware logic synthesis. | [55,56,57,58,59,60] | |
Enterprise & Business | Finance & Banking: fraud detection, chatbots, reporting | Analyzes transactions, powers financial assistants, and automates compliance summaries. | [61,62,63] |
Manufacturing & Supply Chain: forecasting, log analysis, training | Forecast demand, interpret logs, and support engineering education via LLM-based tutors. | [64,65,66] | |
Legal & Regulatory: legal search, contracts, compliance monitoring | Used in tools like CoCounsel and Harvey AI for legal reasoning and risk detection. | [67,68] | |
Creative & Social Domains | Creative Industries: writing, art/music, design ideation | LLMs power story generation, compose music (e.g., MuseNet), and assist with architecture sketches. | [69,70,71,72,73] |
Education: conversational tutoring, engagement | Supports inclusive, always-available learning environments with natural interaction. | [74,75] | |
Training: content customization, feedback, real-time assessment | Used in platforms like Khanmigo and Duolingo AI for tailored learning experiences and skill development. | [74,75,76,77] | |
Autonomous Systems | LLM Agents: task chaining, API interaction, digital automation | Auto-GPT and LangChain enable agents to reason, use tools, and automate workflows. | [78,79] |
Item | Estimated Cost/Metric |
---|---|
Training GPT-3 (175B) | ∼$4.6M USD (2020) [94] |
Training PaLM (540B) | ∼$3∼12M USD (2022) [95] |
Training Gemini 1.0 Ultra | ∼$192M USD (2025) [96] |
Inference Cost per 1M tokens (OpenAI API, June 2025) | $0.10 (GPT-4.1 nano input) to $8.00 (GPT-4.1 output) [97] |
Cost Category | Main Drivers | Key Technical Bottlenecks | Example Estimate |
---|---|---|---|
Training (Pre-training) | Compute, GPU/TPU clusters, massive datasets | Model size scaling, training FLOPs, hardware availability | $8 M (PaLM 540B) |
Inference (Deployment) | Continuous compute, energy, latency constraints | Memory bandwidth, KV cache size, parallelism limits | 29 ms/token @ 76% MFU |
Data Curation & Alignment | Human labor, annotation costs, quality control | RLHF ranking, SFT prompt generation, skilled reviewers | Millions USD |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Sajjadi Mohammadabadi, S.M.; Kara, B.C.; Eyupoglu, C.; Uzay, C.; Tosun, M.S.; Karakuş, O. A Survey of Large Language Models: Evolution, Architectures, Adaptation, Benchmarking, Applications, Challenges, and Societal Implications. Electronics 2025, 14, 3580. https://doi.org/10.3390/electronics14183580
Sajjadi Mohammadabadi SM, Kara BC, Eyupoglu C, Uzay C, Tosun MS, Karakuş O. A Survey of Large Language Models: Evolution, Architectures, Adaptation, Benchmarking, Applications, Challenges, and Societal Implications. Electronics. 2025; 14(18):3580. https://doi.org/10.3390/electronics14183580
Chicago/Turabian StyleSajjadi Mohammadabadi, Seyed Mahmoud, Burak Cem Kara, Can Eyupoglu, Can Uzay, Mehmet Serkan Tosun, and Oktay Karakuş. 2025. "A Survey of Large Language Models: Evolution, Architectures, Adaptation, Benchmarking, Applications, Challenges, and Societal Implications" Electronics 14, no. 18: 3580. https://doi.org/10.3390/electronics14183580
APA StyleSajjadi Mohammadabadi, S. M., Kara, B. C., Eyupoglu, C., Uzay, C., Tosun, M. S., & Karakuş, O. (2025). A Survey of Large Language Models: Evolution, Architectures, Adaptation, Benchmarking, Applications, Challenges, and Societal Implications. Electronics, 14(18), 3580. https://doi.org/10.3390/electronics14183580