Large Language Models: A Structured Taxonomy and Review of Challenges, Limitations, Solutions, and Future Directions

Peykani, Pejman; Ramezanlou, Fatemeh; Tanasescu, Cristina; Ghanidel, Sanly

doi:10.3390/app15148103

Open AccessReview

Large Language Models: A Structured Taxonomy and Review of Challenges, Limitations, Solutions, and Future Directions

¹

Department of Industrial Engineering, Faculty of Engineering, Khatam University, Tehran 1991633357, Iran

²

Department of Computer Sciences, Science and Research Branch, Islamic Azad University, Tehran 1477893855, Iran

³

Faculty of Economic Sciences, Lucian Blaga University of Sibiu, 550324 Sibiu, Romania

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(14), 8103; https://doi.org/10.3390/app15148103

Submission received: 6 June 2025 / Revised: 11 July 2025 / Accepted: 18 July 2025 / Published: 21 July 2025

(This article belongs to the Special Issue Advances in Large Language Models: Techniques, Applications and Challenges)

Download

Browse Figures

Versions Notes

Abstract

Large language models (LLMs), as one of the most advanced achievements in the field of natural language processing (NLP), have made significant progress in areas such as natural language understanding and generation. However, attempts to achieve the widespread use of these models have met numerous challenges, encompassing technical, social, ethical, and legal aspects. This paper provides a comprehensive review of the various challenges associated with LLMs and analyzes the key issues related to these technologies. Among the challenges discussed are model interpretability, biases in data and model outcomes, ethical concerns regarding privacy and data security, and their high computational requirements. Furthermore, the paper examines how these challenges impact the applications of LLMs in fields such as healthcare, law, media, and education, emphasizing the importance of addressing these issues in the development and deployment of these models. Additionally, solutions for improving the robustness and control of models against biases and quality issues are proposed. Finally, the paper looks at the future of LLM research and the challenges that need to be addressed for the responsible and effective use of this technology. The goal of this paper is to provide a comprehensive analysis of the challenges and issues surrounding LLMs in order to enable the optimal and ethical use of these technologies in real-world applications.

Keywords:

large language model; natural language processing; challenges of LLMs; comprehensive review; holistic survey; structured taxonomy

1. Introduction

Large language models (LLMs), as one of the most advanced and transformative tools of artificial intelligence (AI), have played an unparalleled role in the development of modern applications of natural language processing (NLP) [1]. These models, using deep architectures such as Transformers and self-supervised learning algorithms, have achieved remarkable capabilities in language understanding, text generation, translation, question answering, sentiment analysis, summarization, and even linguistic reasoning [2]. In recent years, models like GPT [3], BERT [4], LLaMA [5], ChatGPT [6], and PaLM [7] have not only pushed the technical boundaries of human language understanding but have also penetrated various scientific, industrial, social, and cultural domains.

To better understand the current position of LLMs, a brief look at their history is essential. Large language models are the result of decades-long evolution in language and machine learning research. The roots of this journey go back to the 1940s, when the basic concepts of artificial neural networks were introduced by McCulloch and Pitts [8]. In the following decades, statistical language models such as n-gram [9] and Hidden Markov Models (HMMs) [10] emerged, which operated based on word occurrence probabilities but were incapable of understanding contextual meanings and deep semantic relationships.

In the 2010s, the emergence of semantic embeddings like Word2Vec [11] and GloVe [12] enabled the representation of words in a continuous vector space, yet these models remained static and context-insensitive. Subsequently, models based on Recurrent Neural Networks (RNNs) [13] and Long Short-Term Memory (LSTM) [14], despite their ability to model sequential dependencies, struggled with issues such as gradient instability and limited memory. The major transformation came in 2017 with the introduction of Transformer architectures, with a design that, through the self-attention mechanism, overcame the limitations of previous models [2]. This architecture formed the foundation for the development of successful models like BERT, GPT-2, T5, and GPT-3. With the release of GPT-3 in 2020, featuring 175 billion parameters, and later GPT-4 with multimodal input capabilities, LLMs evolved from research tools into powerful engines for language generation. In parallel, models like RoBERTa, BLOOM, LLaMA, PaLM, and LaMDA emerged, aiming to optimize performance, accuracy, resource consumption, and task-specific efficiency [15].

The evolution of language models has progressed from simple statistical systems to today’s complex and multilayered neural architectures a journey that, while opening new horizons for human–machine interaction, has also introduced significant challenges. These challenges are not solely technical; they extend to ethical, social, legal, cultural, and economic dimensions as well. In fact, as these models become more advanced and widely adopted, the need for responsibility in their design, interpretation, and use becomes increasingly critical [1]. For instance, in the healthcare domain, serious obstacles include preserving patient privacy, interpretability of outputs, and clinical trustworthiness [16]. In the legal field, hallucination referring to the generation of plausible but factually incorrect or fabricated information in legal references and algorithmic biases can result in unjust decisions being made [17]. In education, concerns such as the weakening of critical thinking skills, dependency on machine-generated content, and reduced educational equity are prominent [18]. A summarized overview of this historical trajectory is presented in Figure 1, highlighting the key milestones in the evolution of large language models.

Likewise, in fields such as media, art, programming, agriculture, and energy, LLMs face challenges such as misinformation generation, lack of cultural context understanding, opacity, high energy consumption, and heavy infrastructure costs. Moreover, inequality in access to computing resources and trainable data has created a technological divide between developed and developing societies [19].

Given the rapidly expanding use of LLMs across all dimensions of personal and societal life, a comprehensive and interdisciplinary examination of their challenges and potential consequences has become an urgent necessity. Accordingly, this article adopts an analytical and practical approach to explore the multifaceted aspects of LLMs across domains such as healthcare, education, law, media, business, art, agriculture, politics, and science. Moreover, the current research attempts to develop a structured taxonomy and provide a comprehensive review of the challenges, limitations, solutions, and future directions in the era of LLMs. It should be noted that this study adopts a hybrid scoping and mapping review approach. Additionally, the analysis examines key academic databases, including Web of Science, Scopus, and Google Scholar. The forthcoming sections aim to explain the structure, applications, challenges, and future directions of LLMs by dissecting and analyzing the various dimensions of this technology.

In Section 2, a general and comprehensive definition of LLMs is presented. To provide the reader with an initial understanding of the nature of this technology, foundational concepts related to language models are introduced. This is followed by a detailed examination of the architectural structure of LLMs. Moreover, the types of data required to train these models are thoroughly discussed to help readers gain a deeper understanding of how LLMs learn and acquire their capabilities. Overall, this section offers a cohesive and comprehensive overview of the theoretical and technical foundations of LLMs.

Section 3 focuses on the challenges that LLMs face in various aspects of everyday life. It examines the wide-ranging applications of these models and the associated issues across diverse domains such as medicine, finance, business, industry, agriculture, energy, education, research, programming, content creation, art, law, and even tourism. At the end of this section, for each identified challenge, a practical and actionable solution is proposed, providing a strategic outlook for addressing potential problems.

In Section 4, the article delves into the more technical and specialized challenges related to the development and deployment of LLMs. This includes an analysis of issues such as fairness and bias mitigation, countering malicious attacks and potential misuse, integrating heterogeneous data, advancements in multimodal large language model research, supporting low-resource languages, ensuring continuous and stable learning in LLMs, and promoting ethical and responsible use of these technologies. Each of these topics is examined through relevant examples and practical analysis, and concrete solutions are offered to mitigate risks and enhance effectiveness in each area.

Section 5 takes a forward-looking approach, exploring the future trajectory of LLMs. This section investigates key questions, such as the following: In which domains will LLMs see the most growth? What technological advancements are necessary for their evolution? And what emerging risks may arise? It emphasizes the importance of interdisciplinary collaboration, the development of legal and regulatory frameworks, and the integration of human-centered and ethical perspectives in the continued evolution of these models. Finally, Section 6 provides a comprehensive conclusion that synthesizes the main points discussed throughout the article. This concluding section aims to present a clear and accessible picture of the current landscape, challenges, and opportunities surrounding LLMs one that not only enhances scientific understanding but also paves the way for future research and informed decision making.

2. Definition of Large Language Models

Large language models are among the most prominent recent achievements in the field of NLP, fundamentally transforming the way humans interact with and process language. These models, utilizing deep learning techniques, especially the Transformer architecture, are capable of learning and reproducing complex patterns and structures of language from massive volumes of textual data [2]. LLMs can process unstructured data and extract semantic relationships between words, phrases, and sentences. Moreover, some of them are capable of analyzing and generating multimodal data such as text, sound, images, and even combinations thereof [20].

Unlike human language learning, which develops within a social, interactive, and experiential context, these models are built solely on statistical analysis of massive textual data [21]. In fact, an LLM is a generative mathematical model that learns from the statistical distribution of tokens (words, parts of words, or even characters) in a large corpus of human-produced texts. The primary function of large language models is to predict the next token in a textual sequence [22]. In other words, if a piece of text is given to the model, its response will be based on the statistical probability of subsequent words in the training data, not on any real understanding or knowledge of the world [23]. This statistical nature means that an LLM cannot distinguish between reality, fiction, or cultural conventions [24].

Therefore, although the outputs of the model can be highly accurate, coherent, and even seemingly knowledgeable, the core operation of an LLM remains purely mathematical and statistical. Accordingly, it is crucial for users and developers to always remember that these models possess neither “knowledge”, “belief”, “understanding”, nor “self-awareness”; they merely generate sequences of words based on statistical probability [25].

Models such as ChatGPT, LLaMA, and Falcon [26] are advanced examples from this family, developed based on the GPT (Generative Pretrained Transformer) architecture. These models are exposed to a wide range of internet texts during the pretraining phase, enabling them to learn grammar, factual knowledge, reasoning abilities, and even a degree of general knowledge. This process makes them powerful tools for generating human-like texts in various applications.

2.1. Architecture of Large Language Models

Large language models are typically built upon the Transformer architecture, a framework widely recognized as a milestone in the evolution of deep learning for natural language processing. The Transformer architecture follows a dual-structured encoder–decoder design, in which the encoder first transforms a sequence of input symbols into continuous representations, or feature vectors [27]. These continuous vectors are then passed to the decoder to initiate the process of generating the output text [28]. Figure 2 illustrates the overall Transformer architecture.

In this structure, the encoder utilizes two key operations the self-attention mechanism and Feed-Forward Neural Layers to extract rich feature representations from the input. These operations are reinforced through residual connections and layer normalization to maintain a stable flow of information across multiple layers [27]. Subsequently, the decoder uses these extracted vectors to generate the output text in a step-by-step, auto-regressive manner, meaning that each generated token serves as input for the next prediction step [27]. To preserve causality and prevent the model from accessing future tokens during generation, masking is applied within the decoder. This architecture not only achieves high accuracy but also supports non-sequential processing, enhancing the scalability and efficiency of Transformers when dealing with large and diverse datasets [27,28]. In contrast to earlier architectures such as RNNs and LSTM networks, Transformer models are fully based on the self-attention mechanism and do not rely on sequential computation, a key factor that has significantly improved their efficiency and scalability [2].

To gain a deeper understanding of how the two core components of the Transformer the encoder and the decoder function, it is helpful to begin with a detailed examination of the encoder mechanism. In the Transformer architecture, the encoder acts as a meticulous and context-aware reader. It has full access to the input text (e.g., a source sentence in machine translation) and examines all words simultaneously to determine how different parts of the input relate to each other. The encoder first applies the Self-Attention mechanism, which determines how much attention each word should pay to every other word in the sequence. Following this, a Position-Wise Feed-Forward Neural Network is applied to each position in the sequence to extract higher-level features and capture more complex patterns and meanings [2].

As the information flows through these two stages, residual connections and layer normalizations are employed to stabilize the flow and prevent the excessive alteration of intermediate representations from occurring. By stacking this structure over multiple layers, the encoder eventually generates a context-aware semantic representation for each word, embedding the meaning of the entire input sequence into each individual token. These representations serve as the foundational information that the decoder later references to generate coherent and contextually appropriate outputs [29].

Each layer of the encoder consists of two sublayers [30]:

A multi-head self-attention mechanism, which captures intra-sequence dependencies.
A fully connected feed-forward neural network, applied in a position-wise manner to each token.

Both sublayers are wrapped with residual connections and followed by layer normalization. Additionally, all sublayers and embedding layers produce outputs with a fixed dimensionality, typically set to 512.

On the other side of the architecture, the decoder shares a structure similar to that of the encoder, with one key addition: an extra sublayer dedicated to cross-attention, which allows the decoder to attend to the encoder’s output representations [31]. Functionally, the decoder acts like a writer or text generator, responsible for producing a new sentence based on the input data and previously generated outputs. During the generation process, the decoder initially restricts its view to only the tokens that have been produced so far, preventing access to future information. This is achieved through causal masking, ensuring that each position in the sequence can only attend to preceding tokens and not to those that follow. This design enforces auto-regressive generation, where the model decides which word to produce next solely based on the tokens already generated [2].

Next, much like a skilled translator who refers back to the source text to find the most accurate expression, the decoder attends to the encoder’s output with the semantic representations of the input text, enabling more precise and contextually appropriate word selection. Throughout this process, residual connections and layer normalization are used to stabilize the flow of information and support more efficient training [2]. By stacking this structure across multiple layers, the decoder becomes capable of generating sentences or paragraphs step by step, producing output that is semantically rich, grammatically correct, and contextually coherent [2,31].

In addition to this basic structure, some newer models use more specialized architectures:

Causal decoder: A structure consisting solely of the decoder part, predicting the next token based on previous ones. The GPT architecture is of this type [2].
Prefix decoder: In this model, attention is bidirectional, and instead of a strict dependence on the past, the entire sequence is utilized [32].
Mixture-of-experts (MoE) architecture: A sparse and scalable architecture in which only a small portion of the layers (experts) are activated in each step. This structure uses a router to direct tokens to different experts and allows for model enlargement without significantly increasing computational costs [33].

Overall, the Transformer, with its flexible structure, self-attention mechanism, and capability to integrate with parallel structures like mixture-of-experts, forms the foundation of large language model architectures and enables the generation of highly accurate, fluent, and human-like texts.

2.2. The Training Process of Large Language Models

The process of training a large language model is a complex, multi-stage sequence of data-driven, computational, and human–centric activities that ultimately yields an intelligent, responsive system that is aligned with human needs [34]. These stages are outlined sequentially and summarized in Figure 3. The training pipeline begins with the collection of textual data from diverse sources books, academic articles, encyclopedias, news websites, social media, and human conversations selected to form a rich, varied, and representative corpus of natural language [2]. Given the heterogeneity of these data sources, Section 2.3 provides a detailed account of the various datasets employed throughout the training process.

However, raw collection alone is insufficient; the data must undergo cleaning and preprocessing. In this phase, duplicate content, incomplete sentences, unwanted languages, and offensive or noncompliant materials are removed. Known as data preprocessing, this step is pivotal to the final model’s quality, since noisy inputs directly lead to erroneous learning outcomes. Once the data have been prepared, the model enters the pretraining phase. Here, without requiring labeled data, the model is trained in a self-regressive manner [35]; that is, by observing the input sequence

X = x_{1}, x_{2}, \dots, x_{t - 1}

, it learns to predict the next token, modeling the probability according to Equation (1):

P (y | X) = \prod_{t = 1}^{T} (y | x_{1}, x_{2}, \dots, x_{t - 1})

(1)

Central to this phase is the Transformer architecture, built around the self-attention mechanism. Its capacity to capture long-range dependencies and to process tokens in parallel has underpinned the success of models such as GPT-3 and PaLM. As parameter counts and dataset sizes grow, the model attains greater generalization ability, enabling it to internalize both linguistic patterns and factual knowledge.

Following pretraining, the model proceeds to fine-tuning. During fine-tuning, the pretrained weights are optimized on task-specific data often composed of human-crafted instructions so that the model can execute targeted functions like summarization, question answering, or translation. A complementary approach in this stage is in-context learning [*3], wherein the model adapts to a new task by simply conditioning on a few exemplars provided in the prompt, without any parameter updates.

Despite fine-tuning, models may still exhibit deviations from human preferences or ethical norms. To better align the model with human values, reinforcement learning from human feedback (RLHF) is employed [36,37]. RLHF typically involves three steps:

Collecting human pairwise preference labels (“better” vs. “worse”);
Training a reward model to predict these preferences;
Optimizing the language model with algorithms such as proximal policy optimization (PPO) to maximize the reward signal.

This process helps the model generate safer, more acceptable, and ethically sound responses, while also correcting errors incurred during pretraining. Finally, the trained model must undergo rigorous evaluation. This evaluation suite assesses linguistic comprehension, text generation quality, conceptual reasoning, and performance across a range of tasks. If weaknesses are detected in certain areas, the training cycle may be revisited and repeated for those components. This iterative loop of retraining and enhancement continues until the model achieves satisfactory performance, culminating in a versatile and powerful system capable of applications from text generation to user assistance.

2.3. Large Language Model Datasets

The development of large language models is largely dependent on the diversity and quality of the datasets used at various stages such as pretraining, fine-tuning, preference assessment, and final evaluation [38]. As shown in Figure 4, these datasets are categorized into five main categories: pretraining datasets, instruction-based fine-tuning datasets, preference datasets, evaluation datasets, and traditional natural language processing datasets, which are discussed in this section.

2.3.1. Traditional Natural Language Processing Datasets

Traditional NLP datasets constitute foundational resources that researchers have developed over the years for training and evaluating models on classical language tasks. These datasets consist of meticulously structured and annotated corpora for tasks such as sentiment analysis, named entity recognition (NER), text classification, part-of-speech tagging, question answering, and machine translation. Their precise design and well-defined labels have made them standard benchmarks in numerous studies and scientific competitions, including GLUE, SuperGLUE, and SQuAD [39].

The significance of these datasets extends beyond training base models: they provide a rigorous framework for assessing a language model’s capacity to grasp concepts, semantic relationships, and linguistic accuracy [40]. Although large language models have recently been trained on vast amounts of unlabeled text, traditional NLP datasets remain indispensable for evaluation phases, targeted fine-tuning, and testing performance on specific tasks [41]. Furthermore, they play a critical role in designing controlled experiments that probe a model’s ability to learn linguistic subtleties and grammatical structures, thereby offering a scientific basis for comparative analyses across different models.

2.3.2. Pretraining Datasets

Pretraining datasets form the foundation of large language model training [42]. These datasets provide a vast amount of textual data for models to learn language structures, patterns, and general linguistic knowledge [43]. It is essential to begin by identifying the primary sources of these datasets. The largest textual corpora commonly used include web crawls (such as Common Crawl), encyclopedic sources like Wikipedia, collections of public and legal books, news articles, and occasionally, forum discussions or social media content. Each of these sources comes with its own distinctive characteristics.

After initial collection, the data undergo cleaning and filtering processes. These include removing duplicate entries, filtering out undesired languages, discarding incomplete or incoherent sentences, and eliminating explicitly offensive or legally and ethically problematic content [44]. The linguistic and topical diversity of the data is also critical. If a model is trained solely on text from a single domain or stylistic register, it tends to perform poorly when encountering texts outside that domain. Therefore, considerable effort is made to ensure coverage across a wide range of topics and contexts [38].

Additionally, the scale of the dataset is a key factor: the larger the dataset, the more likely the model is to capture a broader spectrum of linguistic patterns. In recent examples, pretraining datasets have reached hundreds of billions or even trillions of tokens. However, it is important to note that, contrary to popular belief, data quality is just as important as quantity. High-quality and diverse data can significantly enhance model performance far more than large but unstructured and noisy datasets.

Finally, ethical and legal considerations must always be taken into account. The use of copyrighted material or content containing sensitive personal information can lead to legal challenges or violations of privacy [42]. Once the data collection process is complete, the raw datasets must undergo a critical phase known as preprocessing, during which they are examined and refined for quality, consistency, and alignment with the model’s training objectives [45]. In general, pretraining data can be categorized into two main types: general pretraining datasets and domain-specific pretraining datasets:

General pretraining datasets consist of large-scale texts covering a wide range of topics and languages [42]. These data play a crucial role in training large language models, helping them gain a deeper understanding of language and perform effectively across various tasks [46].
Domain-specific pretraining datasets are designed to cover specialized fields such as medicine or finance [47]. These datasets enhance the model’s ability to understand and generate language specific to that domain and play a key role in specialized applications [48].

2.3.3. Instruction-Based Fine-Tuning Datasets

Instruction-based fine-tuning datasets play a crucial role in aligning large language models with specific user needs. These datasets enable models to follow complex instructions more accurately and coherently, thereby improving their performance on multi-step and chain-of-thought tasks [38]. Such data are typically produced through methods like human annotation, rewriting existing examples, or using smaller models to generate initial instruction–response pairs. Moreover, combining multiple sources and ensuring linguistic diversity in these datasets greatly enhances the model’s adaptability to real-world scenarios. By leveraging instruction-based data, a model not only learns grammatical patterns but also internalizes the underlying intent behind each instruction. The quality of these datasets is so important that it can directly impact the model’s final performance in specialized applications [49].

There are several distinct types of instruction-based fine-tuning datasets, each tailored to a different objective:

General instruction-based fine-tuning datasets: This category comprises a diverse collection of commands and educational prompts spanning a wide range of domains. These data help models develop a deeper understanding of natural language and human directives, thereby broadening their ability to perform various tasks effectively [50].
Task-specific instruction classification: In this approach, datasets are organized around particular tasks such as reasoning, text generation, or code understanding. By focusing on domain-specific prompts, this method enables targeted fine-tuning that boosts model performance in specialized areas [51].
Instruction-based fine-tuning in specialized domains: These datasets are crafted specifically to enhance model performance in fields like medicine, law, or education. They employ the specialized terminology and procedural instructions of each domain so that the model becomes familiar with the unique structure and content requirements of that field [52].

2.3.4. Preference Datasets

Preference datasets play a pivotal role in aligning large language models with human judgment. Unlike traditional datasets, which focus solely on the correctness or incorrectness of a single response, preference datasets are built by having humans compare and select the better outputs. In this approach, multiple model responses are generated for the same input, and annotators rank them according to criteria such as clarity, accuracy, adherence to instructions, or even stylistic quality. This process steers the model toward producing responses that more closely match human standards and preferences [38].

When combined with methods like RLHF, preference datasets enable models to better understand user intent and generate more trustworthy outputs. In practice, they help models not only to provide correct answers but also to deliver responses that are more satisfying, natural, and ethically as well as functionally appropriate [53]. As a result, these datasets are essential for developing models that excel in real-world applications such as conversational agents or creative content generation.

2.3.5. Evaluation Datasets

Evaluation datasets are fundamental instruments for the objective and systematic measurement of large language models’ performance across a wide range of tasks. These datasets typically consist of a standardized set of inputs paired with reference outputs, enabling the evaluation of models on question answering, machine translation, code generation, text summarization, and logical reasoning [27]. Given that language models can produce diverse and sometimes creative responses, having clear and quantifiable metrics for assessing response quality is essential.

These evaluation datasets are often designed based on qualitative criteria (such as semantic accuracy, coherence, and contextual relevance) and quantitative metrics (such as BLEU, ROUGE, or Exact Match (EM)), enabling comparison between different models or between a model and its previous versions. Evaluation data not only assesses model performance after training but also play a critical role in the model development process, as the feedback obtained can guide architecture refinement, training method improvements, or the selection of higher-quality data during fine-tuning stages [41]. This is especially crucial in high-stakes domains like medicine or law, where the accuracy of model outputs is vital, and the use of domain-specific evaluation datasets can help ensure the reliability and trustworthiness of a system’s responses.

2.4. Features and Capabilities of Large Language Models

Large language models, leveraging Transformer architecture and self-supervised learning methods, have brought about a remarkable transformation in the field of natural language processing. By learning from vast amounts of textual data, these models have demonstrated significant performance in tasks such as summarization, translation, question answering, and text generation. Abilities such as zero-shot generalization, in-context learning, and the generation of structured and accurate responses are among their most notable features [22].

In particular, models trained with human feedback, such as InstructGPT [54], have achieved significantly better performance than previous generations in following instructions, reducing content hallucination, generating more useful responses, and minimizing toxic content. These advancements have turned LLMs into powerful tools for a wide range of NLP applications, including accurate question answering, generating texts aligned with formatted instructions, and supporting multiple languages [55].

3. The Challenges and Limitations of Large Language Models Across Various Domains of Everyday Life

Despite the remarkable advancements of large language models in the field of natural language processing, this technology still faces fundamental and multifaceted challenges that must be addressed for its responsible and secure development. On one hand, the “black-box” nature of these models and the lack of sufficient interpretability have significantly affected user trust, particularly in sensitive domains such as healthcare, law, and education. On the other hand, inherent biases in training data can lead to the reproduction and amplification of racial, gender, or socioeconomic prejudices in the outputs an issue with particularly concerning implications in clinical and financial applications. Moreover, threats related to data leakage, privacy violations, and adversarial attacks demand the development of more rigorous legal and security frameworks for the use of LLMs. In addition, the massive computational requirements and high training costs make the adoption of these models challenging for many organizations and under-resourced countries. Unstable performance when confronted with out-of-distribution data and difficulties in generalization also remain critical technical issues. A general overview of the challenges discussed in this section is presented in Figure 5, which categorizes the limitations of LLMs across various domains of everyday life including medicine, law, education, agriculture, and more. This classification highlights the widespread and domain-specific nature of the issues involved, emphasizing that the responsible deployment of LLMs requires a context-aware and interdisciplinary approach.

3.1. Medicine

LLMs in the medical field, as an innovative tool, can create a major transformation in the way healthcare services are provided. These models, by utilizing vast amounts of textual data and advanced natural language processing techniques, are capable of delivering precise answers and complex analyses in various medical fields. One of the main applications of these models in medicine is supporting processes such as pre-consultation, diagnosis, and disease management. For example, LLMs can automatically process information related to disease conditions, treatment options, and medical history, helping doctors make better decisions [56]. Additionally, these models can be effective in medical education for doctors and nurses, especially in teaching complex concepts and conducting simulated exercises. However, significant challenges, such as privacy issues, data security, and model interpretability, exist in this field [57]. Since LLMs typically function as “black-boxes”, explanation and transparency in their decision-making process are crucial for physicians. Moreover, the presence of racial or gender biases in training data can lead to unfair and incorrect results, which in medicine, especially in disease diagnosis, can have serious consequences. Therefore, to use these models effectively and ethically in medicine, there is a need to adhere to ethical standards and stringent regulations regarding patient data protection and the reduction in model biases [58].

3.1.1. Interpretability and Transparency

Challenges of interpretability and transparency are fundamental barriers to the adoption of LLMs in healthcare, especially where critical decisions are made based on their outputs. These models often operate as “black-boxes”, generating responses that lack justifiable explanations or credible sources [59]; this is an issue that can erode trust among physicians and healthcare professionals who emphasize transparency and explainability in decision-making processes. The problem worsens when models produce seemingly confident responses that are, in fact, based on speculation or incomplete data [60]. To address this challenge, methods such as multi-step reasoning frameworks (e.g., the model proposed by Creswell) or the use of chain-of-thought prompting to extract knowledge graphs have been suggested. Additionally, designing models that are aware of their own uncertainty levels and can report confidence scores for outputs may increase user trust [61]. On the other hand, biases in training data, including racial or gender biases, can affect the accuracy of model outputs and lead to incorrect clinical decisions. Therefore, developing models that are technically transparent and trustworthy and fair in their content is essential for the successful integration of LLMs into healthcare systems [56].

3.1.2. Data Privacy and Security

One of the core challenges in validating and deploying LLMs in the domain of clinical data is the risk of leaking confidential and sensitive patient information. In certain adversarial attacks on models like GPT-2, training data containing personally identifiable information and private user conversations have been extracted word-for-word [43], and even in cases where data was seemingly anonymized, algorithms were able to re-identify patient identities [62]. These threats highlight the need for implementing solutions such as identifier pseudonymization, differential privacy protection, and continuous monitoring through data extraction attacks to assess model vulnerabilities. The use of LLMs in medical research also requires strict adherence to data security and privacy standards, as researchers deal with highly sensitive and personal data and bear a serious responsibility for protecting them [63]. Challenges include the unintentional inclusion of identifiable information in pretraining data and the ability of models to infer personal attributes from seemingly harmless data, which could lead to privacy violations. Even with anonymization, there is a risk of re-identification through hidden patterns in large-scale health data, necessitating the development of advanced algorithms to detect and prevent such threats. Moreover, constant monitoring of model outputs is essential to prevent unintentional disclosure of information. Ultimately, for the ethical use of LLMs in healthcare, relying solely on traditional privacy regulations is insufficient. New governance frameworks must be developed that anticipate emerging challenges and evaluate model performance from an ethical perspective. Active participation of patients and healthcare providers in the design and development process of these models can enhance transparency and build public trust in the use of medical data in AI-based systems [64].

3.1.3. Bias and Fairness

LLMs are typically trained on vast and diverse datasets, which often contain various biases, including racial, gender, or socioeconomic biases that can be reproduced and even amplified in the model outputs. This is especially concerning in healthcare, as it can lead to unfair treatments, misdiagnoses, and ultimately increased health disparities. For example, a study on skin cancer that primarily focuses on patients with light skin may result in a model that performs poorly in diagnosing the disease in individuals with darker skin tones, contributing to a wider gap in access to healthcare among different population groups [65]. A lack of sufficient data from minority groups also leads to poorer and more biased model performance for those populations. On the other hand, language models trained on diverse textual sources, both reliable and unreliable, may unintentionally generate inaccurate or misleading information, especially when training data include structural or historical biases. Therefore, researchers must rigorously clean and preprocess training data, identify inherent biases, and regularly validate model outputs [66]. Biases related to demographics, disease prevalence, or treatment outcomes, if unaddressed, can skew medical results away from fairness [67]. Effectively tackling these challenges requires scientific sensitivity at every stage of model development from design to final implementation. This can be achieved through ongoing audits and interdisciplinary collaboration among data scientists, ethicists, and healthcare experts. Such efforts can lead to the creation of effective guidelines and the development of bias-free models, ultimately resulting in fairer and more inclusive systems within the healthcare domain [68].

3.2. Finance

LLMs in the financial sector have emerged as an innovative and advanced tool capable of transforming traditional financial analysis methods and creating new opportunities in this field. These models, by utilizing complex algorithms and pretraining on vast datasets, possess remarkable capabilities in data analysis, market sentiment detection, and providing accurate advice [69]. One of the key applications of these models is in market sentiment analysis, where they can extract positive or negative market sentiments from financial news and social media, helping to predict market behavior. Additionally, LLMs play an important role in analyzing financial time series and forecasting market trends, as they can classify financial data and identify anomalies [70].

Moreover, these models have reasoning capabilities, which can assist them in generating investment recommendations and financial planning similar to human decision-making processes. Using these capabilities, LLMs are applied in agent-based modeling and can simulate market behavior and economic interactions. Despite the high potential of these models, challenges such as technical issues, generalizability, interpretability, and implementation challenges remain. Furthermore, performance evaluation challenges [71] are essential for the effective use of these models in financial matters.

3.2.1. Limitations and Complexity

The technical challenges of LLMs in financial analysis are primarily related to the complexity of their size and computational demands. These models, composed of millions or even trillions of parameters, require immense storage and computational power, which creates difficulties in resource-constrained environments [72]. This issue becomes more prominent when developers lack access to powerful GPUs or TPUs [73]. For example, models like FinBERT, with 110 million parameters, have a significant size that may be difficult to handle in some resource-limited scenarios [74]. Moreover, training large-scale LLMs requires substantial time and energy, which increases financial costs and energy consumption [75]. Additionally, LLMs struggle with generalizability and cannot maintain consistent performance across different financial tasks outside their training domains. These limitations are particularly significant when handling specific and unforeseen tasks in financial analysis, especially when dealing with new or unique data [69]. LLMs need continual improvement to stay aligned with new data and tasks while maintaining efficiency. Furthermore, increasing model size introduces complex challenges that require optimization strategies for more efficient and cost-effective implementation [76].

3.2.2. Generalizability

Generalizability in LLMs refers to their ability to accurately and consistently perform tasks in domains, data, or functions that differ from their initial training environments. Although these models are typically trained on vast and diverse datasets, they often fail to deliver reliable performance when applied to specific tasks or domains outside their original scope [77]. This limitation is clearly seen in applications such as coding projects or document analysis, which vary based on language, context, or domain. To enhance generalizability, the fine-tuning process must be executed with precision, utilizing diverse datasets and designing mechanisms for continuous feedback [78]. These steps help prevent overfitting to the training data and improve applicability in real-world scenarios. Nevertheless, recent research indicates that even with these measures, LLMs may not perform comparably to their training performance when exposed to unfamiliar data. This highlights a critical gap in the current capabilities of these models. In the financial domain, these challenges are particularly evident [79], as models developed for general purposes may lack the accuracy and consistency required for specialized tasks such as market forecasting, financial report analysis, or financial news classification. Therefore, the key challenge lies in developing models that not only possess broad knowledge but also demonstrate adaptability across various domains. Achieving this goal requires reforming training processes and designing innovative architectures and learning algorithms [80].

3.2.3. Interpretability and Trustworthiness

The interpretability and trustworthiness of LLMs are factors that are especially crucial in sentiment analysis tasks within the financial domain. The “black-box” nature of these models makes their decision-making processes difficult to understand, creating obstacles to user trust, particularly among investors who require transparency and logical reasoning [81]. Therefore, it is essential to develop tools that explain the internal mechanisms of these models so users can trace the logic behind the outputs. Moreover, the closed nature of many LLMs and the lack of transparency regarding their training data sources raise questions about data quality and ownership. These models are also vulnerable to adversarial attacks, increasing the need for security measures and ethical considerations [82]. Furthermore, the outputs of LLMs must align with social values and legal regulations and avoid providing harmful recommendations. As LLMs play an increasingly influential role in financial decision making, it becomes necessary to establish clear legal frameworks to define accountability and responsibility in the case of errors. The security and privacy of financial data must also be safeguarded through local infrastructure and robust protocols. Ultimately, with the growing use of LLMs in finance, an active, transparent, and ethics-driven approach is essential for safe and responsible deployment of this emerging technology [83].

3.2.4. Selection and Implementation

The challenge of selecting appropriate FinLLM models and techniques lies in balancing cost and performance. Depending on task complexity and inference costs, using general-purpose models with prompting techniques or domain-specific models may be more practical than building a new FinLLM. This requires LLMOps engineering skills [84], including optimization methods such as parameter-efficient tuning (PEFT) and the implementation of operational systems with continuous-integration (CI) and continuous-delivery (CD) pipelines. The main challenges in developing real-world financial applications are mostly non-technical [85], such as business needs, industry barriers, data privacy, accountability, ethics, and the understanding gap between finance professionals and AI experts [86]. To overcome these challenges, sharing successful FinLLM use cases in areas like robo-advisors, quantitative trading, and low-code development can be helpful. It is also recommended to focus more on generative applications in the future, such as report generation and document understanding. Accordingly, the development and deployment of LLMs in real-world financial environments require proper technical infrastructure, access to high-quality data, and security mechanisms to protect sensitive information. Effective collaboration between technical and financial stakeholders is also essential to align LLM-based solutions with actual organizational and client needs [76].

3.2.5. Performance Evaluation

The primary challenge in evaluating the performance of LLMs in finance lies in the need for domain-specific financial expertise to validate model outputs in financial NLP tasks [87]. Current evaluations are mostly based on standard NLP metrics such as accuracy and F1-score, whereas complex financial tasks require human evaluation by finance experts, the use of specialized financial metrics, and feedback to better align the model with real-world needs. Advanced tasks that involve new evaluation criteria can reveal the hidden capabilities of FinLLMs and determine whether these models can act as general solvers of financial problems, especially considering cost and performance [88]. Additionally, evaluating trading strategies built with LLMs presents unique challenges. One such challenge is signal degradation due to the widespread use of LLMs, and another is the inefficiency of existing benchmarks, which were developed before the emergence of LLMs and are incompatible with the transformed financial landscape. This environmental shift is not just a gradual erosion but a fundamental transformation requiring the definition of new evaluation approaches. Therefore, developing new benchmarks that align with LLM capabilities and reflect current market realities is essential. Without this, accurate evaluation of model-generated signals will be impossible, and doubts about their effectiveness will persist. As a result, alongside the traditional challenge of signal degradation, serious attention must be given to the difficulty of evaluation caused by environmental shifts to effectively leverage LLMs in designing trading strategies [89].

3.3. Business

Recent advancements in artificial intelligence, particularly in natural language processing techniques, have created powerful tools for business applications. These tools have the ability to understand, analyze, and generate human language, enabling the extraction of valuable insights from unstructured data such as user comments on social media [90]. NLP models can simulate customer sentiments from feedback and convert long reports into useful summaries. One notable advancement in this field is the development of LLMs like ChatGPT, which are capable of generating coherent and relevant responses [91]. These models can be customized for specific applications through prompt engineering without the need for retraining or additional data. However, effective use of these techniques requires expertise in natural language processing and appropriate training data. On the other hand, LLMs, due to their very large size, are capable of simulating vast language patterns. Nevertheless, technical, strategic, and structural challenges still exist in utilizing these technologies, preventing their full application in the real world [92].

3.3.1. Bias and Instability

Despite their remarkable ability in natural language generation and text analysis, LLMs face significant challenges in commercial applications. One of the main challenges is the presence of bias in training data and the fine-tuning process. These biases include gender bias, popularity bias, recency bias, and language bias, which can lead to the presentation of inaccurate or skewed information to users. In some cases, these biases appear in the form of imbalanced repetition of specific numbers or concepts in the model’s output [93]. The main source of these errors is the biased and low-diversity data on which the model was trained. In such conditions, the generated content may show negative or unfair orientation toward specific social groups, damaging customer trust and brand credibility [90]. Another significant challenge is the inability of LLMs to deeply understand context. These models primarily operate on statistical word prediction and lack true conceptual understanding of the subject [94]. As a result, they may make errors in tasks such as programming, financial data analysis, or precise information extraction from texts. Furthermore, the models are highly sensitive to prompt design. A slight change in how a question is phrased can completely alter the model’s response. This output instability can be problematic in applications such as customer support or organizational decision making. In summary, to effectively leverage LLMs in the business environment, their technical, behavioral, and structural challenges must be carefully identified and managed [95].

3.3.2. Organizational Implementation

Although LLMs are considered powerful tools for natural language processing, their implementation in organizational environments faces challenges beyond the technical aspects. Notably, there are serious concerns regarding data privacy and information security, especially when model usage depends on third-party infrastructures. In such cases, there is a risk of unintended leakage of sensitive information, which can pose legal and reputational threats to businesses [96]. Additionally, innovations based on LLMs often emerge through the combination of existing technologies and the creation of simple user interfaces, rather than entirely new technologies. This can create unrealistic expectations among managers and users. While media hype surrounding new versions of LLMs (such as GPT-4 or future versions) has grown significantly, experience has shown that excessive expectations sometimes do not align with actual model performance. Moreover, effective utilization of LLMs in the business domain requires a comprehensive and interdisciplinary perspective one that properly integrates this technology with business process management (BPM) tools, data governance, IT infrastructure, and organizational culture. Neglecting this alignment may lead to resistance to technology adoption and the failure of digital transformation projects [90].

3.4. Industry

LLMs have become a key component in the digital transformation of industries and play a significant role in optimizing language-based processes; these models, built on deep learning and extensive datasets, enable highly accurate, coherent understanding, generation, and analysis of human language [97]. In industrial environments, LLMs are used to automate textual tasks, rapidly process information, and support decision making, leading to a significant increase in organizational productivity. Due to their flexibility, these models can be customized to the specific needs of each business; they are capable of summarizing data, identifying linguistic patterns, and generating content without direct human intervention [98]. For example, in maritime transportation, the use of AI models to detect ships in low-visibility weather conditions such as fog or storms is also expanding, which can play an important role in enhancing the safety of sea travel [99], and in intelligent transportation, combining LLMs with vehicle trajectory data can assist in analyzing behaviors such as highway lane-changing, which has previously been studied using vehicle movement data [100]. LLM usage also results in time savings, cost reduction, and improved operational accuracy, helping organizations operate more intelligently and innovatively in today’s competitive environment.

However, widespread deployment of LLMs is also accompanied by challenges; the most important of these include the emergence of bias in decision making, privacy concerns [101], high computational costs, and difficulties integrating with existing infrastructures. For instance, these models may reproduce biases present in their training data, which can lead to serious ethical concerns in areas such as hiring or healthcare; moreover, the need for powerful computing resources poses a significant barrier even to smaller organizations, and the high energy consumption of these models raises sustainability concerns. Therefore, effective deployment of LLMs requires careful management of these challenges and responsible utilization of the technology’s capabilities [102].

3.4.1. Industrial Implementation

LLMs face serious technical challenges in their path to industrial adoption, which limit their precision and reliability. One of the most critical barriers is the phenomenon of “language hallucination” a situation where the model generates seemingly credible but actually incorrect information [103]. In sensitive environments such as healthcare, finance, or industrial automation, such errors can lead to irreparable damage. Additionally, most LLMs lack transparency and explainability [104]; this means that it is unclear what logic underpins the generated output. This trait raises doubts about the use of the model in domains requiring legal or technical accountability. Also, short-term memory limitations and weaknesses in maintaining long conversations make these models ineffective in applications like industrial assistants or customer service [105]. In multimedia contexts as well, most LLMs still lack the required capability to process text, audio, and image simultaneously, which limits their application in areas like robotics and intelligent control [106].

3.4.2. Risks and Inequalities

The introduction of LLMs into various industries, without considering security requirements, resource constraints, and social considerations, can lead to serious consequences. One of the primary concerns is the risk of data leakage and disclosure of confidential organizational information [43], since many of these models are trained on extensive personal or industrial data. Moreover, LLMs are vulnerable to cyberattacks and adversarial inputs, which can result in harmful outputs and disruption of critical decision-making processes [107]. In addition to these concerns, deploying LLMs requires heavy computational infrastructure and high energy consumption, imposing significant financial burdens on organizations. This challenge is especially restrictive for small- and medium-sized businesses and leads to inequality in access to advanced technologies [108]. Furthermore, many of these models contain bias in their training data, which can lead to unfair decisions in areas like hiring, employee evaluation, or resource allocation. Most existing models also lack proper support for non-English languages, which further limits equitable access to the technology on a global scale [109]. Finally, the lack of precise human evaluation in many LLM projects has reduced trust in model outputs and emphasized the need for stronger validation methods [97].

3.5. Agriculture

As an emerging AI technology, LLMs hold significant potential for transforming the agricultural sector. These models are capable of intelligently supporting processes such as crop yield prediction, plant disease detection, pest management, and optimization of planting and irrigation by analyzing genetic, environmental, and textual data [110,111]. Additionally, LLMs can play an effective role in educating farmers and providing personalized, data-driven advice [112]. However, in fully leveraging these models, practitioners face challenges such as the limited availability of agricultural data, the need for domain-specific training, and difficulties in accurately understanding specialized agricultural language. Moreover, the complexity of agricultural systems and the rapid pace of climate and economic changes demand real-time decision making and up-to-date knowledge, while LLMs typically operate based on static knowledge limited to their training time. Frameworks such as RAG can partially compensate for this limitation by enabling the models to utilize live data. Overall, LLMs are powerful tools but require optimization and specialized adaptation for effective application in agriculture [113].

3.5.1. Lack of Visual Data for Training CV with LLMs

One of the fundamental challenges in employing large multimodal language models in agriculture is the lack of high-quality and sufficiently diverse visual data for effective training of computer vision (CV) models. Many agricultural applications, such as plant disease detection, crop quality assessment, or weed identification, require accurate, diverse, and well-labeled image data [114]. However, collecting such data in agriculture is costly and time-consuming, and demands domain-specific expertise, as the images must be captured and labeled under various lighting, climatic, and biological conditions. This issue is particularly pronounced in less developed regions, where infrastructure for data collection and archiving is limited [112]. As a result, CV models trained on insufficient or imbalanced data suffer in terms of accuracy, generalizability, and performance stability. To overcome this challenge, the use of multimodal language models such as DALL·E or Flamingo has been proposed for synthetic generation of visual data. These models can produce artificial images based on textual descriptions of agricultural phenomena and enhance CV model performance by increasing diversity in training datasets [110]. However, synthetic data generation also faces challenges such as fidelity to reality, quality preservation, and the need for human evaluation.

3.5.2. Reliability

One of the major and fundamental challenges in using LLMs in agriculture is the trustworthiness of the responses and recommendations generated by these models. In areas such as farm management, crop yield prediction, or fertilizer and irrigation advice, decisions based on incorrect or incomplete information can lead to significant economic or even environmental damage [115]. LLMs may, in the absence of accurate data or when faced with ambiguous queries, provide responses that appear correct but are actually false, fabricated (hallucinated), or lacking strong evidence [116]. This issue becomes even more critical when non-expert users, such as farmers, rely on these outputs as definitive guidance. To enhance trustworthiness, models must not only provide answers but also explain their reasoning or present supporting evidence. Mechanisms such as human review, prompt engineering with chain-of-thought reasoning, and the integration of external data sources (such as through RAG) should be employed. Otherwise, unquestioned reliance on LLM outputs in agricultural settings may lead to risky and unreliable decision making.

3.5.3. Static Knowledge Limitation

One of the major limitations of LLMs in the agricultural domain is their reliance on static internal knowledge. These models are typically trained during a preprocessing phase on large text datasets [117], but once training is complete, their information becomes frozen with respect to updated events, technologies, and data. This characteristic poses a significant obstacle in a field like agriculture, where climatic conditions, pests, new crop varieties, government policies, or market data are constantly evolving. For example, an LLM trained in 2023 cannot provide accurate information about a recent drought or changes in subsidy policies in a specific region without access to updated data [118]. This limitation is particularly problematic in environments that require real-time decision making based on local and current data, potentially resulting in inaccurate and ineffective recommendations.

3.5.4. Ambiguity in Description

One of the less frequently addressed yet highly significant challenges in using LLMs for information extraction from agricultural texts is the ambiguity inherent in descriptive linguistic features. Specialized agricultural texts particularly field reports or expert assessments often use vague or implicit descriptive phrases [111]; for instance, a term like “dark spots” could refer to color, texture, or even the pattern of disease spread. Such ambiguities are interpretable by humans but are difficult for language models to accurately decode, especially in the absence of clear context or domain-specific knowledge. This issue leads to errors in the feature extraction process and consequently reduces the accuracy of LLM-based decision-support systems (DSS). Moreover, linguistic descriptions may vary greatly and exhibit polysemy [119]; for example, terms such as “poor growth”, “wilting”, or “underdevelopment” may all describe inadequate plant growth, yet their interpretation may vary. This underscores that relying solely on language models without incorporating complementary systems such as semantic classifiers, linguistic preprocessing, or human feedback cannot ensure precise and actionable information extraction.

3.6. Energy

LLMs are rapidly becoming key tools for managing the growing complexities of the electric power industry. With their ability to understand natural language and generate accurate responses, these models serve as effective interfaces between humans and advanced technical systems [120]. As sensor data continue to expand, and with the integration of unstable renewable sources such as solar and wind, along with the emergence of technologies like electric vehicles, hydrogen, and high-computation loads, the need for intelligent energy system management has increased. LLMs can enhance operator decision making in both routine and critical situations by reducing cognitive load and providing rapid analyses. They are particularly effective during emergencies such as storms, extreme grid fluctuations, or blackouts, where they can offer near-real-time solutions [121]. These models have also shown success in analyzing grid data, integrating textual and numerical information, generating technical codes for simulations, and responding to specialized queries. Techniques such as retrieval-augmented generation (RAG) and deep reinforcement learning within the LLM framework enable the delivery of context-aware, text-dependent solutions [122]. Despite these advantages, challenges remain, including the demand for intensive computational resources, ensuring data security, protecting privacy, and the necessity for well-defined legal frameworks. Furthermore, consistent performance under varying conditions and a lack of deep domain specialization are the other notable limitations of this technology [123]. Given the shortage of skilled personnel and the vast volume of data being generated, LLMs can contribute significantly to data analysis, decision making, rapid training of personnel, and increasing the resilience of power systems. Their ability to extract knowledge from large datasets, present it in an understandable form, and support decision-makers signals a promising future. Ultimately, the integration of LLMs with other emerging technologies outlines a new path for smarter operations, greater efficiency, and better responses to environmental and operational challenges in the power industry [120].

3.6.1. Data in Training

One of the fundamental challenges in deploying LLMs in the power industry is the lack of specialized and suitable data for training these models. Due to legal restrictions and privacy concerns, the pretraining process of LLMs is typically limited to publicly available and licensed data, which hinders the development of accurate and efficient models in the energy domain [124]. Building large, domain-specific datasets for power systems, especially considering regulations related to critical energy/electric infrastructure information (CEII), requires innovative solutions and clearly defined legal frameworks. Under current conditions, leveraging smaller, high-quality, and well-labeled datasets for model fine-tuning [125] can significantly improve performance in tasks such as power flow analysis or the prevention of unsafe output generation. Additionally, using in-context learning with limited but high-quality examples (few-shot learning) offers an effective approach to mitigating the scarcity of training data. On the other hand, much of the data in power systems are generated in the form of long, non-textual time series, which necessitates the design of customized embedding algorithms that are compatible with LLM architectures [126]. Another key challenge is the limitation of the context window size, which can hinder the model’s ability to understand long-term dependencies between critical signals in the power grid. Overall, overcoming these challenges could firmly establish LLMs as powerful tools for smart grid development and the advanced analysis of complex power system data.

3.6.2. Safety Considerations

The absence of clearly defined safety frameworks for deploying LLMs in the power industry a key concern when applying this technology in critical infrastructure. The outputs of these models are inherently probabilistic, and there is no guarantee of complete accuracy; whereas operations in power systems require high precision and full compliance with strict safety standards such as voltage limits [47]. One observed issue in testing was the generation of different responses with minor variations in input commands, which can lead to incorrect or even dangerous outcomes. This is particularly concerning in engineering domains where safety is critical. Moreover, LLMs lack the capability to provide uncertainty estimates, which are highly important for safe and precise decision making in the power industry. Additionally, the absence of custom safety guards can hinder the execution of critical tasks such as wildfire spread prediction or accurate analysis based solely on visual data [22]. These challenges indicate that high technical performance alone is not sufficient for the safe use of LLMs; instead, a comprehensive framework is needed to manage risk, ensure privacy, and prevent the generation of unsafe outputs [127]. Examples such as the NIST Risk Management Framework can serve as valuable guides for the safe deployment of LLMs in sensitive infrastructures like the power industry.

3.6.3. Physical Modeling

Despite significant advancements in natural language processing, LLMs still face serious limitations when it comes to aligning with the physical principles governing power systems. The processes of energy generation and consumption are influenced by a set of physical laws such as Maxwell’s equations, machine dynamics, and human behavior that require specialized, physics-based tools for accurate modeling [128]. The use of LLMs for tasks like electricity price forecasting or designing demand response policies is highly challenging due to the complexity and dependence on multiple factors, including grid load, consumer decisions, and market regulations. While increasing data volume may help improve forecasts related to renewable energy production or consumer behavior analysis, sufficient accuracy under unexpected conditions still cannot be guaranteed [129]. Moreover, the black-box nature of LLMs hinders the interpretability of their decisions, which poses a risk in critical infrastructures such as the power grid, where the transparency and traceability of decisions are essential. Therefore, physics-based and domain-specific tools remain indispensable for electrical engineers [130]. Nonetheless, LLMs can serve as complementary assistants by summarizing analyses, proposing initial decisions, and integrating with engineering tools to enhance overall system efficiency.

3.6.4. Security Threats

While LLMs offer new opportunities in power energy systems, they also introduce serious threats in the realms of cybersecurity and privacy. Even in locally deployed LLMs, running within organizations, vulnerabilities such as backdoor access, unauthorized privilege escalation, or extraction of sensitive training data may occur [131]. This risk becomes especially pronounced when proprietary data from companies operating in the power systems sector are used to train these models, increasing the likelihood of unintended information disclosure. Using online LLMs for critical tasks like price forecasting also makes them attractive targets for cyberattacks [132]. In addition, the expert prompts designed to interact with these models may contain confidential information or trade secrets, which pose serious risks if accessed by malicious actors. Privacy concerns are also amplified as human-related data usage increases. To counter these threats, developing standard protocols for data anonymization and sanitization before training is absolutely essential [133]. However, in cases where personal or group-specific data are highly context-dependent, effective anonymization becomes a significant challenge. Therefore, ensuring data security and confidentiality must be considered an integral part of the design and implementation of LLMs in the energy sector.

3.7. Education

LLMs in the education sector have created a significant transformation and contributed to improving learning and teaching processes. These models are capable of analyzing educational data, providing lesson recommendations, and even answering students’ questions in various academic fields [134]. LLMs can serve as writing or reading assistants, helping students to improve their writing and reading skills. Moreover, these models are highly effective in personalized education, addressing the specific needs of each student and assisting teachers in tailoring their lessons to individual learners. LLMs also play a crucial role in automating assessments [18] and educational analytics, reducing time and human error. However, there are challenges that need attention, including technological and security issues in implementing LLMs in education, which require adequate infrastructure and data security considerations. Additionally, challenges related to quality, educational equity, and access to technology, particularly for those without access to advanced technologies [135], remain important issues. Furthermore, privacy, data, and transparency challenges in LLM-based education require further regulations and clarifications to ensure the protection of users’ and students’ rights.

3.7.1. Security

The use of LLMs in education faces numerous technological and security challenges. These models rely on advanced artificial intelligence technologies that are highly complex, and in the case of technical failures, delivering high-quality educational services becomes difficult. Additionally, the quality of the data used to train LLMs plays a crucial role in improving their performance, and the process of converting educational data into usable forms requires high accuracy in collection and processing [130]. Moreover, LLMEdu faces challenges such as speech recognition, natural language processing, intelligent content generation, and multimodal models, which demand ongoing researcher attention for integrating new technologies in this field. On the other hand, with increasing intelligence levels of LLMs, security concerns such as the emergence of cognitive biases in responses have risen. Studies have shown that LLMs provide biased responses when working with gender-biased data, highlighting the need for precise data curation [136]. Furthermore, the lack of proper ethical and social values in LLMs increases the risk of criminal behaviors and necessitates comprehensive regulations for controlling the use of these models. Finally, ethical concerns about replacing human activities with AI are also discussed in education; although AI can empower learning, it cannot replace human roles such as fostering critical thinking and providing social support [34].

3.7.2. Quality and Dependency

Although LLMs offer broad opportunities for transforming education, they also present serious challenges regarding education quality. If these technologies fail to deliver high-quality and effective services, their acceptance among teachers and students may decline [27]. Educational institutions must also balance technological innovation with maintaining education quality to avoid overemphasis on technology at the expense of content quality. Additionally, a major concern in using LLMs is the risk of students becoming overly dependent on these tools, which could reduce their ability for independent learning and creativity. Cases like using ChatGPT for completing assignments are examples of this challenge [137]. These situations highlight the need to teach students time management and balanced technology use. In terms of access, technical infrastructure and technology availability remain significant barriers, especially in underserved regions. Beyond infrastructure issues, fear of being replaced by AI and cognitive resistance in some communities have led to the rejection of AI technologies in education. This underscores the necessity of enhancing digital literacy and offering targeted training for teachers and students. Ultimately, it should be noted that unequal use of LLM tools can widen the educational gap among students and challenge educational equity [18].

3.7.3. Privacy

The use of LLMs in education requires the collection and analysis of large volumes of personal data from students and teachers, which raises serious privacy concerns. LLMs need accurate information about users’ learning habits, performance, and personal characteristics to provide personalized education, making strict adherence to privacy regulations like GDPR and FERPA necessary [138]. Additionally, students and their parents must be clearly informed about what data are collected and how they are used, and obtaining informed consent is essential to ensure ethical data-management practices. Furthermore, the risk of data breaches and cyberattacks is a major concern when using LLMs in education, requiring educational institutions to enforce stringent security measures to protect user information [139]. Parents and teachers should also play an active role in raising children’s awareness of data security and how to prevent privacy risks associated with these tools. Moreover, in the process of collecting and processing learning data, it must be ensured that the data are properly protected and safeguarded from any misuse or leakage. The future of LLM-based education will require simultaneously preserving innovation and technological advancement while ensuring the security and privacy of data [18].

3.8. Research

Scientific research today is a key driver of progress in various fields, including technology, medicine, energy, and social sciences. With significant advancements in technology and new tools, research methods have also evolved, and the use of LLMs is one such innovation that has helped researchers perform their research processes more quickly and accurately [140]. These models, with their ability to process large and complex datasets, allow researchers to effectively analyze texts, extract information, and engage in data processing and scientific simulations. In interdisciplinary research, LLMs can combine information from different fields and generate new insights that were previously unattainable. Additionally, these models play an important role in facilitating scientific decision making, forecasting trends, and conducting complex simulations [141]. However, the use of these models also comes with challenges. Particularly, the need for high-quality training data, significant computational capacity, and issues related to ethics and privacy are some of the concerns that must be addressed when using these models in scientific research. Moreover, issues regarding model biases and the lack of transparency in decision-making processes can undermine trust in these tools within the scientific community [1]. Nonetheless, given the high potential of LLMs in improving and accelerating scientific research, these technologies can be widely applied across many domains and have a profound impact on the research and scientific world.

3.8.1. Specialized Understanding

One of the major challenges that is faced when applying LLMs in the research domain is their inability to understand highly specialized and detailed concepts. Many of these models are trained on data from general domains like chemistry, while some research tasks require deep understanding of more specific topics such as Suzuki coupling. In such cases, general and frequently repeated data may occupy the model’s parameter space and eliminate low-frequency specialized knowledge [142]. Additionally, scientific data often falls outside the model’s training distribution. For example, during testing, models may encounter molecules or proteins with structures unknown to them [124]. This distribution shift leads to reduced accuracy and performance. One proposed approach to address this issue is the use of invariant learning, which can provide a theoretical foundation for analyzing out-of-distribution data. Moreover, the use of domain-specific knowledge graphs, especially automated ones, is considered a suitable method for guiding the learning process and preventing the loss of specialized knowledge. This combination of methods can help improve the accuracy and generalizability of LLMs in scientific domains [141].

3.8.2. Trustworthiness

A serious challenge for LLMs in research is the generation of content that appears logical but is factually incorrect a phenomenon known as “hallucination” [143]. This issue is particularly dangerous in sensitive and critical fields such as medicine, biotechnology, and chemistry, where scientific decisions might be based on incorrect information. In such contexts, response accuracy is vital, and even minor errors can lead to irreversible consequences [141]. One recommended method to reduce this issue is the use of retrieval-augmented generation (RAG), which provides the model with up-to-date, credible, and relevant information to help produce more accurate outputs. However, most RAG research in science has focused on text and knowledge retrieval, while scientific data are often multimodal and heterogeneous. To enhance model trustworthiness, integrating textual data with other sources such as images, graphs, chemical compounds, and biological sequences is essential. Multimodal RAG can help the model gain a more accurate understanding of problems and produce more reliable content. This is considered a key priority in the safe and practical development of LLMs in science [144].

3.8.3. Linguistic Diversity

One of the fundamental challenges of LLMs in research is the limited language scope and data resources. Many of these models are primarily trained on English texts and have limited ability to understand and generate content in other languages. This leads to the exclusion of a significant portion of scientific knowledge published in languages other than Persian, Spanish, or English during the model training process [18]. As a result, the conceptual accuracy and coverage of the models across different cultures, languages, and scientific systems is reduced. On the other hand, most scientific LLMs focus only on certain platforms or domains such as Roblox or biochemistry, while other major platforms where users generate scientific content are overlooked [145]. Models trained solely on specific or single-source data tend to perform poorly in generalizing to other fields. To increase conceptual coverage and multilingual performance, it is essential to diversify model training sources both linguistically and topically [142]. In addition, utilizing diverse social and research platforms for data collection can help improve model accuracy and reliability in real-world applications. This approach lays the groundwork for developing more inclusive, trustworthy, and culturally and scientifically contextualized models [146].

3.9. Programming

In recent years, LLMs have made significant advancements in various programming domains, including code completion and generating code from natural language descriptions [147]. Specifically, models like Codex, which are used in GitHub Copilot as an assistant for automatic code generation, have played a crucial role in facilitating programming processes. Despite these successes, challenges such as the lack of public access to powerful models and the limited use of these models outside of resource-rich companies still exist. These limitations hinder research and development in this field for organizations with fewer resources [148,149]. Some language models, like GPT-Neo and GPT-J, are publicly available and can generate suitable code, but there are still issues with training larger models that cannot comprehensively address all programming needs. Moreover, models specifically trained on source code, such as CodeParrot, can help improve the performance of these systems. These models, with their ability to simulate various programming languages, particularly in cases where code and natural language need to be combined, perform better [150]. However, further research is still needed to improve the efficiency of these models and provide broader public access to them for use in various projects.

3.9.1. Diversity of Programming Languages

LLMs perform better in languages like Python, but when it comes to other programming languages such as C++, Java, Go, or non-English languages, their accuracy significantly decreases. Studies have shown that model perplexity in these languages is higher, indicating a weakness in the model’s ability to understand and generate relevant code [147]. One reason for this issue is the unbalanced distribution of training data in the datasets used. For example, the PolyCoder model performed very well in C, but was unable to surpass Codex in C++. This weakness may be attributed to the higher complexity of C++ and the model’s limited context window. Additionally, some models are trained only on code, whereas models like Codex are trained on a combination of code and natural texts, including data from Stack Exchange and GitHub. This combination helps improve performance across different languages [149]. Moreover, increasing the number of model parameters does not always lead to performance improvement, especially in languages for which sufficient training data are not available. Studies have shown that larger models such as GPT-Neo and GPT-J, despite their size, have reached a performance saturation point in some languages. As a result, the use of mixed datasets, linguistic balance in training, and attention to the specific features of each programming language are key factors in improving LLM performance in programming [150].

3.9.2. Code Generation and Refinement

LLMs have demonstrated impressive abilities in code generation and suggestion of fixes, but these abilities are not consistent in all situations and come with challenges [151]. Studies have shown that LLMs perform well in solving basic and beginner-level programming problems, such as fixing buggy code or completing simple functions [152]. However, when faced with more complex tasks, the quality of outputs drops and the accuracy of responses is not guaranteed. In some cases, code generated by the model may contain conceptual or syntactical errors. One major challenge is the strong dependency of output quality on prompt design. Experiments have shown that combined requests (e.g., simultaneously asking for error explanation and correction) reduce accuracy, while separating these requests can improve model performance [153]. Also, zero-shot learning performs differently compared to few-shot learning in terms of quality. Temperature selection also plays an important role in the final output; lower temperatures generally produce more accurate responses for specific tasks (like Pass@1), while higher temperatures are more effective for generating diverse outputs. If the temperature is too high, it may result in irrelevant or low-quality results. Therefore, generating and editing high-quality code with LLMs requires careful parameter tuning, proper prompt design, and repeated evaluations [154].

3.9.3. Dependency

With the increased use of LLMs in educational environments especially in computer science education concerns have emerged about students becoming overly dependent on these tools. Studies have shown that many students have positive experiences working with LLMs and find these models comprehensible, helpful, and reliable [155]. However, this dependency may reduce their ability to independently solve programming problems. Some studies have indicated that using LLMs may be associated with lower student grades. Additionally, students sometimes struggle to craft effective prompts to obtain correct answers, leading to confusion or mental fatigue [156]. Instructors also express concerns about declining learning quality, academic dishonesty, and reduced motivation for logical analysis. On the other hand, using models like ChatGPT as a replacement for teaching assistants or as development tools has benefits, but can also make learning superficial and tool-dependent [157]. In some cases, students copy code generated by the model without deep understanding. This not only weakens concept learning but also fosters overreliance on the models. As a result, the educational use of LLMs must be accompanied by carefully designed instructional strategies, critical thinking training, and continuous monitoring to prevent these models from becoming obstacles in the learning process [18].

3.10. Media

Social media has played a prominent role in the way information is transmitted and consumed, with platforms such as X, TikTok, and Instagram becoming central to global communications. These platforms, by democratizing access to information, have enabled real-time news sharing [158]. However, this openness has led to the rapid spread of fake news, which can harm social cohesion, political stability, and public health. Fake news, often created to mislead people, contributes to the polarization of society and disrupts democratic processes [159]. Unlike traditional media, which are editor-controlled, social media platforms rely more on algorithms and user interactions, which amplify marginal and misleading content. These challenges necessitate the development of high-performance automatic fake news detection systems [160]. LLMs such as GPT-4 and PaLM, due to their deep understanding of text and ability to identify complex semantic patterns, have proven effective in detecting fake news. These models not only have the ability to analyze linguistic content, but they can also recognize complex relationships between entities and topics in news, thereby helping to address the challenges of misinformation.

3.10.1. Instability and Bias

LLMs are recognized as powerful tools for analyzing media bias, but studies have shown that these models themselves exhibit cognitive bias [161]. One of the main challenges is whether these biases manifest uniformly across different topics. Research has revealed that the tested model tends to favor left-leaning perspectives. For more precise analysis, datasets with labeled topics as well as those with hidden topics were examined. Hidden topics were generated through clustering based on content indicators. Subsequently, the relationships between actual labels and the labels predicted by the model were analyzed. Results showed that, in distribution graphs, the model more often predicted left-leaning articles as central compared to right-leaning ones. This indicates the model’s tendency to moderate leftist views and identify them as neutral [162]. Additionally, analysis of various topics showed that this bias is not consistent across all subjects. In some topics, the model even displayed right-leaning bias contrary to its overall tendency. This inconsistency shows that the model’s orientation toward political content can depend on multiple factors. Therefore, bias in LLMs not only exists but also behaves variably at the topic level [163].

3.10.2. Lack of Reliable Sources

One of the fundamental challenges of LLMs in the media domain is the generation of content that appears coherent but lacks credible sources. The content these models produce typically has a fluent structure but lacks reliable informational backing [164]. In many cases, sentences generated by LLMs or generative search engines are presented without precise citations. Even when a source is cited, it often does not support the stated claim or has only a superficial connection. This occurs while users are increasingly trusting the outputs of these tools. Studies have shown that a high percentage of the sentences generated by these models are devoid of any citation. This can facilitate the widespread dissemination of misinformation [159,162]. Moreover, the absence of an internal mechanism for assessing factuality in the models makes validation even more difficult. As a result, content may be published with confident tone and seemingly credible appearance even when incorrect. In the media, this especially increases the risk of misleading audiences and reinforcing false beliefs [165].

3.10.3. Authoritative Tone

Another feature of LLMs is their ability to generate text with a confident and assertive tone even when the provided content is incorrect or unreliable. This confident tone makes model outputs appear credible and trustworthy to the audience, even if they contain misinformation [166]. Chatbots typically do not reflect doubt or uncertainty in their tone, which causes users to perceive them as informed and authoritative sources. This issue becomes more serious when models express opinions in domains such as politics, health, or law. The fluent writing style and coherent structure of responses further amplify this problem. Even audiences with high media literacy may be psychologically influenced by the fluency of the text and misjudge its accuracy. Research has shown that text fluency significantly impacts its believability. In fast-moving media environments, this can lead to the rapid spread of false claims. Therefore, confident tone and fluent prose, rather than being signs of accuracy, have become misleading factors in themselves [167].

3.10.4. Unreliable Sources

One of the fundamental challenges of LLMs in the media domain is the generation of content without backing from credible sources. While the content produced by these models is often fluent and well-structured linguistically, it is not necessarily reliable in terms of information. In many cases, the sentences generated by LLMs or generative search engines lack precise references [168]. Even when a source is cited, it either does not fully support the claim or has only a superficial and indirect connection to it. Meanwhile, users are increasingly placing trust in the outputs of these tools. Studies have shown that a high percentage of sentences generated by these models lack any valid citations, which can contribute to the widespread dissemination of misinformation [167]. The absence of internal mechanisms for assessing factual accuracy within the model architecture has made the validation process more difficult. As a result, inaccurate content may be published with a confident tone and an appearance of credibility. This issue is particularly concerning in the media domain, as it increases the risk of misleading audiences and reinforcing false beliefs [169].

3.11. Law

LLMs like ChatGPT can perform certain law-related tasks, but their capabilities are limited and basic. These models can analyze laws in specific and straightforward cases, such as determining whether an action aligns with existing regulations [170]. For example, if a question about a legal violation in a particular situation is asked, an LLM can logically provide a response based on the existing laws. However, these models cannot make binding decisions with the depth and accuracy of a judge, as their responses do not involve a deep interpretation or complex analysis of legal details and contexts [171]. LLMs can only interact with laws syntactically and do not possess a true semantic understanding of the laws. Furthermore, law enforcement in more complex processes, such as evaluating evidence, reviewing case history, and interpreting laws in diverse situations, requires human understanding, which these models lack [172,173]. Therefore, these models cannot be relied upon for judicial decision making or enforcing legal judgments, and their use in such areas is fundamentally limited.

3.11.1. Privacy

Legal cases typically involve highly sensitive personal information, such as individuals’ identities, financial statuses, and medical records. Using these data to train LLMs can pose the risk of unintended disclosure during content generation. Such data leakage is a serious threat to personal privacy [171]. Therefore, privacy protection must be a top priority in the design and training process of these models. Model outputs must under no circumstances contain personal information. In addition, the research and development team must design a precise mechanism for data processing and output review. Such a system can minimize the risk of data exposure [172]. These measures also ensure security and compliance with regulations when using LLMs in the legal domain. Ignoring this issue can pose serious challenges to the use of these models, as legal data are not public in nature and their disclosure has severe legal consequences. Thus, the safe use of language models in this field requires full control over input and output data [174].

3.11.2. Accountability

Currently, the scope of legal responsibility when using LLMs for legal consultation and decision making is not clearly defined. Although developers typically emphasize limitations and potential risks when releasing their models and try to mitigate legal problems during training, unintended consequences can still occur [175]. If a model provides analysis that leads to unfavorable outcomes, questions arise: Who is responsible? Is it the developer, the user, or possibly the model itself? At present, there is no consensus on whether users should be accountable for decisions made based on model outputs [173]. This highlights the need for more serious policymaking discussions. There is also a need to develop a comprehensive and transparent legal framework. Such a framework could facilitate safer and more accountable use of LLMs in legal processes. Until these ambiguities are resolved, the use of language models in law will carry significant legal and ethical risks.

3.11.3. Bias Against Judges

LLMs tend to exhibit inferential biases and inherent inclinations to generate certain outputs more frequently than others. In the task of “identifying the majority opinion author” at the U.S. Supreme Court (SCOTUS) level, each tested LLM showed different inferential tendencies; some leaned more toward well-known justices, while others exhibited less explainable behavior [175]. For example, Llama 2 disproportionately suggested Justice Story, an influential figure known for the Amistad decision. In contrast, PaLM 2 leaned more toward Justice McLean, known for his dissent in the controversial Dred Scott ruling [176]. Overall, all reviewed models tended to overestimate the frequency of certain judges’ names. This was evident in the distribution plots showing higher density above the y = x line. Such biases indicate that, when inferential biases in training data do not match real-world information distributions, the model is likely to make systematic errors. This raises the danger of a “legal monoculture” [177], meaning that, instead of accurately representing the real diversity of legal systems, models may merely echo information about a limited number of prominent judges or rulings.

3.11.4. Hallucination

According to findings, the more complex the legal task, the higher the rate of hallucination in model responses. For lower-complexity tasks, models perform relatively better, such as the basic task of “case existence recognition”, where part of the success is due to the model’s general tendency to answer “yes” [178]. However, once models are asked questions about the issuing court, case reference, or opinion author, they encounter difficulties. In mid-complexity tasks such as quoting directly or identifying case outcomes, hallucination rates increase dramatically. This increase is not limited to evaluation metrics errors persist even in simple yes/no tasks like “affirm or reverse ruling” [176]. In high-complexity tasks like recognizing doctrinal alignment between two cases, models perform close to random guessing. The hallucination rate in this task is reported at around 0.5, sometimes worse than a random choice. This shows that models lack deep and structured legal knowledge. In tasks like identifying the main legal question or final verdict, hallucination rates even reach 59% and 63%, respectively [175]. These high error levels are especially concerning when users expect LLMs to assist with real legal analysis. Therefore, these findings indicate that LLMs are still not reliable for many practical legal tasks.

3.12. Tourism

LLMs play a significant role in transforming the tourism industry. With their ability to understand natural language and generate intelligent responses, these models enhance interaction between tourists and digital systems. One of the key applications of LLMs in tourism is the development of intelligent chatbots that provide 24/7 support for tourist inquiries [175], including hotel reservations, ticket booking, and introducing local attractions. Models such as GPT can offer personalized recommendations based on each user’s linguistic, cultural, and emotional preferences. This feature gives tourists a greater sense of being understood and increases their satisfaction [177]. Additionally, LLMs can analyze user feedback and reviews on social media to identify strengths and weaknesses in tourism services and help improve their quality [178]. In crisis situations such as natural disasters, these models can also understand and categorize emergency messages and assist rescue teams in responding more quickly. Moreover, due to the multilingual capabilities of LLMs, language barriers in international travel are reduced, making it easier for tourists to communicate with locals and access services. The integration of LLMs with technologies such as augmented reality or recommender systems makes the travel experience more interactive, safer, and more enjoyable. Ultimately, LLMs are powerful tools for making the tourism industry smarter and more human-centered in the digital age [160].

3.12.1. Bias

Due to being trained on vast textual datasets, LLMs are inherently prone to reproducing gender and ethnic biases. Various studies have shown that such biases are observable in models like GPT-3.5, GPT-4, PaLM-2, and LLaMA. For instance, ChatGPT has generated job recommendations requiring less experience for women and produced resumes with immigration-related markers for Asian or Hispanic users [176]. In the healthcare domain, some models have provided more optimistic treatment predictions for white patients compared to racial minorities. These biases can lead to unequal access to services, reduced user satisfaction, and, ultimately, a negative impact on the travel experience [177]. If LLMs provide biased or unfair information while acting as travel assistants, they may influence user decisions and create discriminatory recommendations. This issue becomes particularly critical when users from diverse racial and gender backgrounds expect neutral and fair treatment. Interestingly, evidence shows that increasing model size does not necessarily reduce bias. Therefore, identifying and mitigating such biases is a serious necessity to ensure the ethical and equitable use of LLMs in the tourism industry [179].

3.12.2. Dependence on Training Data

One of the main challenges in using LLMs in the tourism industry is dealing with rapid changes and the constant need to update information. Travel-related data pertaining to topics such as flight statuses, the condition of tourist attractions, weather changes, travel regulations, and health and safety updates are constantly evolving [177]. These changes, especially in the dynamic and fast-paced world of tourism, can greatly influence travelers’ decision making. LLMs, which are typically trained on static datasets, are not able to respond quickly to such changes and may provide outdated or inaccurate information [180]. For example, if an LLM lacks updated information about a closed attraction or a change in a country’s entry regulations, it might mislead a traveler and negatively impact their experience. This issue becomes even more critical in crisis situations such as pandemics or political–economic shifts in different countries. Therefore, a real-time update system for LLMs is essential so they can stay aligned with moment-to-moment changes and provide accurate information to users. One possible solution to this problem is the use of real-time data and automated synchronization systems that allow LLMs to be continuously updated. This not only enhances the accuracy of information but also improves the performance and reliability of these models in changing conditions. Without such continuous updates, LLMs may deliver poor results in the face of rapid environmental and informational shifts, leading to a decline in user trust and satisfaction [181].

3.12.3. Gaining Trust

One of the fundamental challenges in leveraging LLMs in the tourism industry is gaining users’ trust in AI-generated recommendations. Unlike many other application domains, the travel experience is deeply personal, cultural, and multilayered, influenced by factors such as cultural backgrounds, emotional needs, personal values, and temporal and spatial conditions [182]. In such a context, users expect responses that are not only accurate but also aligned with their specific preferences, cultural beliefs, and individual circumstances. However, the reality is that many LLMs, lacking genuine understanding of human and cultural context, tend to produce generic, mechanical, or irrelevant recommendations that at best seem useful but emotionless, and at worst, misleading. This issue becomes more serious when responses are incomplete, incorrect, or biased, as such shortcomings significantly undermine user trust in the model’s credibility. Recent studies have also emphasized that it is not only the content of information that matters, but also the manner of delivery linguistic structure, transparency, tone, and communicative clarity all play a crucial role in shaping user perceptions of trust. If responses lack transparency, source traceability, or empathetic capacity, users are likely to disengage from the system. Therefore, developing LLMs that can offer interactive, transparent, and culturally sensitive experiences has become a critical requirement for the success of recommender systems in tourism [183]. Without this, even the most advanced models may end up confusing users or causing distrust and dissatisfaction, rather than enhancing the travel experience.

3.12.4. Privacy

The use of LLMs in the tourism industry requires the collection and analysis of users’ behavioral, preferential, and sometimes personal data; however, this very reliance on individual data raises serious privacy concerns. Information such as preferred destinations, search history, booking patterns, cultural preferences, or even users’ past activities may contain sensitive data that, if used without proper control or explicit consent, could result in privacy violations and a loss of trust [184]. This concern becomes even more complex in international contexts where data protection standards vary. To maintain user trust, it is essential to comply with strict data protection frameworks such as the GDPR and to provide transparency regarding how information is collected, stored, and used. Alongside legal considerations, the ethics of artificial intelligence plays a vital role. Numerous studies have shown that LLMs, due to their statistical and data-driven nature, can unintentionally generate content that carries gender, racial, or cultural stereotypes and biases [185,186]. Such content not only harms the travel experience but can also damage the reputation of tourism brands. Therefore, it is necessary to develop systems designed around ethical principles including informed consent, the removal of sensitive data, prevention of bias reproduction, and the delivery of fair and balanced responses. Additionally, incorporating human oversight and continuous monitoring of outputs can serve as a safeguard against potential harms. Only by implementing such measures can the capabilities of LLMs in tourism be harnessed in a responsible, secure, and equitable manner [128].

3.13. Art

In recent years, LLMs have demonstrated remarkable capabilities in content generation across domains such as text, image, and music, often producing outputs that are indistinguishable from those created by humans [187]. However, this ability to generate content without genuine conceptual understanding reveals a fundamental paradox: these models can create content without necessarily comprehending its meaning. In the realm of art, this gap between production and understanding becomes even more pronounced, as artistic critique demands a deep grasp of aesthetic theory, and of the historical, social, and cultural context of the work. Unlike humans, whose creativity is grounded in understanding [188], LLMs generate text purely based on statistical patterns. Recent studies have attempted to examine this paradox through tests resembling the Turing Test for art assessing whether audiences can distinguish whether a piece of art criticism was written by a human or by AI and through tasks related to the Theory of Mind. In this context, the framework of Noël Carroll, who argues that art criticism must involve value judgment particularly based on the concept of “success value”, has been used as a primary lens of analysis. Success value refers to a work’s ability to fulfill its conceptual goals, rather than simply eliciting an emotional response from the audience. From this perspective, evaluating LLMs’ performance in art analysis should go beyond textual fluency [189] and instead focus on depth of meaning, contextual grounding, and originality of ideas. This approach not only enables a fairer assessment of LLMs in the field of art, but also deepens our understanding of the limits of artificial creativity and the true capacities of these models [190].

3.13.1. Generation Without Understanding

One of the fundamental challenges is the Generative AI Paradox a situation in which LLMs are capable of producing highly fluent, persuasive, and even technically impressive content without truly understanding the meaning of what they generate [191]. Unlike humans, whose creativity is grounded in deep understanding, lived experience, and contextual knowledge, LLMs rely solely on statistical patterns and linguistic similarities to generate output. This gap between generation and comprehension becomes especially evident in fields like art [192], where art criticism depends not only on linguistic skill but also on the interpretation of aesthetic concepts and an understanding of the historical, social, and cultural context of a work. Although LLMs can produce texts that resemble art criticism, in the absence of theoretical understanding, these critiques lack conceptual depth and analytical value. This paradox represents one of the most significant challenges in applying language models to domains that rely on genuine human understanding and reflection [188].

3.13.2. Creativity

One of the fundamental challenges of LLMs in the realm of art is the absence of intention and intrinsic motivation in the creative process. Unlike humans, who create art driven by passion, personal motivation, or a need for inner expression [193], LLMs merely respond to input without any goal or desire to create. According to psychological theories, intrinsic motivation is a key factor in initiating creativity, an activity that is rewarding in itself and is not performed for external rewards. However, LLMs do not experience pleasure, have no goals, and possess no rationale behind their choices. They simply generate the most statistically probable linguistic sequence [192], without knowing why they are doing so or for what purpose. This lack of intention means that their output, while structurally acceptable, often lacks depth, honesty, or artistic authenticity. A genuine work of art is typically a response to human experience, pain, ideas, or vision. In contrast, LLMs merely reflect the data they have been trained on, without ever having experienced anything or having something to express. This creates a significant divide between human art and machine-generated output [194].

3.13.3. Bias

One of the key challenges of LLMs in artistic production is the emergence of algorithmic biases. If these models are trained primarily on data reflecting the views, styles, or values of a specific group, they will ultimately generate content that reproduces that particular perspective [195], while neglecting other cultural or artistic viewpoints. Such biases can manifest in the form of stereotypical representations of characters, preference for certain artistic styles, or the omission of symbols and concepts belonging to other cultures. This issue not only affects cultural equity in artistic production but also leads to the erasure or distortion of diverse artistic experiences. In such cases, rather than expanding the boundaries of creativity, the model’s output may reinforce narrow and one-sided narratives [196]. To reduce this type of bias, it is essential that the training process of the models be designed with a diverse, balanced, and culturally sensitive approach, and that tools be developed to detect and mitigate bias in their outputs.

3.13.4. Copyright

The use of LLMs in generating artistic works raises one of the most serious legal and ethical challenges in the realm of intellectual property and authorship. Since the outputs of these models often appear structurally and aesthetically acceptable, and at times even creative, a fundamental question emerges: Who owns the work? Current copyright laws are generally based on the principle of a “human creator” and do not directly grant protection to works produced by non-human entities, including artificial intelligence [197]. Within this legal framework, two main conditions must be met for protection: first, the existence of a certain level of originality in the work; second, the reflection of the author’s personality, intention, and creativity. While LLM-generated outputs may superficially satisfy the first condition, they fail to meet the second condition, as these models lack consciousness, intent, and intrinsic motivation. As a result, such works can only be considered eligible for legal protection if a human has played an active role in the creative process for example, by designing targeted prompts or curating the outputs. However, the absence of clear criteria for determining the “human contribution” in the creation of such works leads to ambiguity in identifying authorship, the emergence of legal disputes, and the risk of misattributing creativity [198]. This uncertainty represents a major obstacle to the legal, commercial, and cultural recognition of LLM-generated art.

3.13.5. Lack of Creative Dynamism

LLMs like GPT-4 have fundamentally transformed natural language processing and now serve as the backbone of many cutting-edge AI applications. However, their development and deployment come with complex technical challenges [198]. On one hand, their massive architecture and numerous parameters require extensive computational resources such as powerful GPUs, large memory capacities, and high processing power, which makes them difficult to adopt for small organizations and resource-constrained countries. On the other hand, models based on the Transformer architecture, due to their sole reliance on self-attention mechanisms, face limitations in maintaining long-term memory, understanding chains of reasoning, and processing multimodal inputs (such as text, images, and audio simultaneously) [197]. Furthermore, achieving transfer learning and domain adaptation requires costly processes like fine-tuning and continual learning, which can affect the stability of model performance. These models also exhibit inconsistent behavior when faced with out-of-distribution (OOD) data and remain challenging in terms of reliability under real-world conditions. Taken together, these factors indicate that the effective utilization of LLMs requires more optimized architectures, compression algorithms, and adaptable designs capable of handling diverse and dynamic data.

At the close of this section, Table 1 offers a consolidated snapshot of the principal challenges and recommended solutions for deploying LLMs across diverse domains. The table lets readers quickly see which obstacles each industry faces and how targeted technical or strategic measures can address them.

4. Technical Challenges and Considerations of LLMs

LLMs such as GPT-4, by creating a fundamental transformation in natural language processing, have shaped the core infrastructure of many modern artificial intelligence applications. However, the development and implementation of these models come with complex technical challenges [199]. On one hand, their massive structure and numerous parameters require extensive computational resources such as powerful GPUs, large memory, and high processing power, making their use difficult for small organizations and resource-limited countries. On the other hand, Transformer-based models, due to their sole reliance on the self-attention mechanism, face limitations in maintaining long-term memory, understanding chains of reasoning, and analyzing multifaceted inputs (such as text, images, and audio simultaneously) [200]. Moreover, transfer learning and adaptation to specialized data in various fields require costly processes like fine-tuning and continual learning, which affect the model’s performance stability. Additionally, these models experience performance fluctuations when confronted with out-of-distribution data [201] and continue to pose challenges regarding reliability in real-world conditions. The combination of these factors has made the effective use of LLMs reliant on more optimized architectures, compression algorithms, and the design of more adaptable structures in response to diverse and dynamic data. A general overview of these challenges is illustrated in Figure 6.

4.1. Fairness

If issues such as bias and fairness are not adequately addressed, they can have significant social consequences [202], including the generation of biased language and its negative impacts on certain social groups. Bias can enter LLMs from various sources, one of the main ones being bias in the training data. This bias occurs when the datasets used to train the models contain prejudices related to race, gender, religion, or socioeconomic status [203]; in such cases, the models inherit these biases and may even amplify them. Additionally, inadequate or incorrect representation of certain groups in training data can lead to the generation of language that does not accurately reflect the perspectives of these groups, producing incorrect or skewed outputs for them. Language model developers must have mechanisms in place to ensure balanced and fair representation of diverse perspectives in the datasets, as otherwise, the model may become biased against underrepresented groups [204]. If the training data contain stereotypes, the models will reproduce these stereotypes and contribute to the perpetuation of biases. Achieving fairness among different social groups in the training and development of LLMs is a difficult yet essential challenge that must be taken seriously.

4.2. Countering Malicious Attacks

LLMs are vulnerable to adversarial attacks [205], such that even small but targeted disruptions in the input data can lead to misinterpretations by the model. Addressing this challenge is essential to ensure the reliability and trustworthiness of LLMs, and achieving stable performance in the face of such threats requires reducing vulnerability to adversarial manipulations. To combat these threats, a range of methods can be used, including input preprocessing and transformation, adversarial training, robust optimization techniques, adversarial sample detection, defensive distillation, model ensembles, adaptive training against adversarial attacks, attack transferability analysis, data augmentation considering attack scenarios, guaranteed robustness, explainable hardening mechanisms, and specific evaluation metrics for adversarial attacks [206]. With input preprocessing and transformation, techniques like noise cleaning or data-based defense transformations are applied to reduce adversarial disturbances. In adversarial training, models are trained with manipulated samples to become resistant to attacks; meanwhile, in robust optimization, training objective functions are modified so that the model performs more stably against disruptions.

Adversarial sample detection is carried out using methods such as input reconstruction, uncertainty measurement, and anomaly detection. Techniques like defensive distillation and model ensembles, by combining the outputs of multiple models, reduce the impact of adversarial attacks, especially when diverse models are integrated. In adaptive training, adversarial samples are dynamically generated during training to enhance the model’s resistance, and attack transferability analysis provides insights for designing more general defenses. In data augmentation considering attack scenarios, data are developed in accordance with threat scenarios, while guaranteed robustness provides formal guarantees to specify the model’s performance against certain attacks. Using explainable hardening mechanisms also allows for examining how the model counters attacks; finally, employing specific evaluation metrics for adversarial attacks enables the comparison and assessment of different methods and models [207]. These techniques, rooted in traditional machine learning, require localization and adaptation to the unique features of LLMs, and their development calls for further research to design defenses tailored to the structure and behavior of LLMs.

4.3. Integration of Heterogeneous Data

Multimodal LLMs are evolving from the sole processing of textual data to understanding and generating content based on various types of data [208]. While LLMs are currently primarily trained on vast amounts of textual data, the current research trend is moving toward enabling these models to analyze and integrate visual, graphical, audio, and video data as well. This expansion has become especially critical in the present era, where the widespread growth of mobile devices and cameras results in the daily production of millions of images and videos [82]. Fully leveraging the potential of LLMs requires the seamless integration of these data types, so that the models can not only comprehend content from various media but also generate outputs that coherently and purposefully incorporate the semantic and contextual elements of all these media [209]. Achieving this goal requires advanced and technical research in various fields, as the integration of multimodal data is one of the most challenging issues in the development of next-generation LLMs.

4.4. Multimodal Large Language Model Research

Current research in the field of multimodal LLMs addresses various topics, including preprocessing and feature extraction of multimodal data, creating accurate and multi-level representations of multimedia data, and understanding spatiotemporal information in videos [210]; semantic and contextual understanding also plays a significant role in improving the performance of these models. Additionally, developing multimodal fusion architectures that can seamlessly integrate different data types such as text, image, audio, and video is one of the fundamental challenges. Pretraining and cross-modal transfer learning help improve shared representations across modalities, enabling the model to better perform downstream multimodal tasks. Furthermore, cross-modal alignment and adaptation, to precisely match elements from different media, require further research, with methods such as attention-based alignment and similarity metrics being applied in this area [211]. Creating accurate multimodal representations requires identifying and reflecting the complex relationships between different data types, with learning shared contextual vectors being one of the effective solutions. One of the major challenges is developing effective preprocessing techniques and extracting features specific to each data type, so these data can be properly integrated into multimodal model architectures [212]; understanding spatiotemporal information and identifying motion patterns in videos, along with combining spatial information, are essential abilities for generating meaningful multimodal outputs by LLMs.

4.5. Low-Resource Languages

The research and development of LLMs have so far been primarily limited to the English language; meanwhile, according to the Ethnologue report, there are over 7168 living languages in the world, with more than 3045 languages at risk of extinction because parents teach their children more dominant languages instead of their native tongue [213]. LLMs can play a significant role in preserving and promoting these languages, especially for those with limited resources that require organized datasets and, in some cases, speech-to-text conversion. To ensure linguistic inclusivity, strategies such as generating artificial data through back-translation, paraphrasing or rule-based data generation, leveraging transfer learning to use pretrained models, employing multilingual models with shared parameters, and developing few-shot or zero-shot models with methods like metacognitive learning or cross-linguistic transfer are under exploration [214]. Additionally, using semi-supervised and self-supervised learning to exploit unlabeled data, designing architectures tailored to specific linguistic features, and involving language communities in dataset creation through crowdsourcing are other key measures. The development of LLMs for low-resource languages can enable textual and spoken documentation, digital archiving of texts, preservation of stories and legends, provision of language learning tools, reduction in communication barriers, creation of educational content, and support for linguistic research [215], as well as the development of technologies that allow users to interact in their native language and culture.

4.6. Continuous Learning in Language Models

For LLMs to be practical, they need to be capable of continuous learning from new data, adapting to changing contexts, and retaining prior knowledge. Achieving these goals requires research in various areas, including the development of incremental learning algorithms that allow models to learn new information without forgetting past data [216], using methods like data replay, regularization, and parameter separation. Architectures with external memory, such as attention-based interfaces, which are known as memory-augmented models, also play a role in preserving previous information. To achieve continuous learning, LLMs must prioritize new information while retaining old knowledge [217], which is made possible through the use of adaptive learning rate schedules. Additionally, task-independent representations help models learn transferable and generalizable features, enabling them to adapt to new tasks without extensive retraining [218]. Regularization methods like EWC (Elastic Weight Consolidation) and synaptic intelligence facilitate continuous learning by preserving parameter stability. Alongside these, meta-learning and few-shot learning enable models to adapt to new tasks or domains with minimal data. Fine-tuning models based on new data while leveraging pretrained representations is another effective strategy for adaptation, and hybrid models that combine episodic memory systems with continuous learning techniques can also be utilized [219].

4.7. Ethical Use

To address the ethical challenges that arise in using LLMs, clear guidelines and frameworks are required to direct the development, deployment, and functioning of these models [185]. These guidelines must be designed in collaboration with language researchers, technology experts, application developers, and policymakers. Most importantly, the adoption and commitment to these ethical principles by researchers and organizations are essential to ensure the responsible development and use of LLMs [220]. Responsible AI application requires the integration of principles such as fairness, explainability, transparency, accountability, and privacy protection at all stages of the language model development cycle and its applications. On the other hand, since LLMs may contribute to the dissemination of misinformation, harmful content, or hate speech, implementing content management strategies is an integral part of using these models [221]. These systems must be continuously upgraded and monitored to prevent the generation of undesirable content. Additionally, conducting audits and ongoing assessments to identify model biases, ensure compliance with regulations, and analyze social impacts is necessary.

4.8. Human Interaction

In real-world applications of LLMs, unlike traditional software, there is a need for special attention and consideration. Accurate identification and documentation of real use cases are highly important because, to effectively address specific issues, models must be fine-tuned to optimize them for those applications [222]. This requires a deep understanding of the goals, challenges, and requirements of various domains. Designing intuitive and user-friendly interfaces is also crucial to facilitate interaction between humans and models, ensuring accessibility and ease of use for diverse user groups by adhering to user-centered design principles. “Human-in-the-loop” methodologies play a key role, as human feedback helps improve model performance and continuously refine outputs [47]. Additionally, providing accessibility and inclusivity mechanisms, such as support for multiple languages, is essential to enhance the efficiency and widespread use of these models.

Before closing this section, Table 2 provides a concise overview of the cross-cutting technical challenges encountered when building and operating LLMs, together with effective mitigation strategies to enhance fairness, robustness, and efficiency. It helps readers grasp, at a glance, how underlying issues relate to their corresponding remedies.

5. Future Directions

In this section, we explore the future outlook of LLMs from three dimensions: technical, ethical, and legal. The technical dimension addresses multimodal integration, the ethical dimension focuses on environmental consequences and energy optimization, and the legal dimension discusses user privacy protection.

5.1. Technical Dimension

With the rapid advancement of LLMs, technical concerns such as scalability, computational power, multimodal data integration, and efficiency in real-world environments have become central issues for the future of this technology. Designing models that can reduce resource consumption while maintaining accuracy and performance is one of the key priorities for future research. Additionally, the ability to understand and combine diverse inputs such as text, image, and audio has introduced new challenges in semantic alignment, multimodal architectures, and strategies for fine-tuning models. Future studies in this domain should focus on developing optimized, lightweight, and domain-specific solutions that align practical needs with technical constraints [223].

One of the most prominent manifestations of these technical complexities is the need for the optimal integration of multimodal inputs and their simultaneous processing within advanced architectural frameworks. With the expansion of LLMs, the integration of multimodal data including text, image, audio, and video has become one of the most transformative directions for future research [224]. In fields such as digital health [225], clinical data analysis, medical education, and virtual consultation, the ability of models to interpret and simultaneously process such diverse inputs can unlock a new level of intelligence and decision support. For example, in healthcare settings, combining a patient’s textual medical history with medical images (e.g., MRI or CT scans) and even audio data (e.g., breathing sounds or patient conversations) can significantly enhance diagnostic accuracy and the quality of treatment recommendations.

Recent developments such as Flamingo and GIT show that multimodal Transformer architectures can effectively model the interaction between natural language and visual inputs. Moreover, models like EmbodiedGPT, which leverage vision–language integration, are capable of being deployed in operational and clinical environments. However, challenges such as semantic alignment between heterogeneous data sources (text–image–audio), the scarcity of domain-specific labeled datasets in the medical field, low interpretability, and high hardware requirements for the training and deployment of these models remain significant barriers. Future research should focus on designing lightweight architectures tailored to domain-specific data, developing multimodal fine-tuning methods [226], and building standardized and high-quality datasets for evaluating performance in real-world scenarios. Furthermore, enhancing model decision making transparency and developing interpretability tools for clinical professionals should be among the ethical and practical priorities in advancing this field.

5.2. Ethical Dimension

With the growth of LLMs, ethical considerations have gone beyond data biases to include environmental responsibility as well. The high energy consumption of these models has significant environmental consequences that may exacerbate technological and biological inequalities. The future of ethically grounded LLM development must focus on designing architectures that are aligned with ethical principles and sustainability from the outset. Transparency, accountability, and the reduction in negative impacts will be key principles along this path [227].

One of the clearest examples of how technical challenges intersect with ethical priorities is the issue of energy optimization during the training and deployment of large LLMs. Given that one of the major challenges facing LLMs across all domains is their high computational cost and substantial energy use, achieving energy sustainability has become a serious concern for both researchers and policymakers [228]. This concern is especially critical as LLMs are increasingly applied in sensitive domains such as healthcare and clinical data analysis, where the environmental consequences of training and inference processes such as excessive carbon emissions, high electricity consumption, and water usage raise pressing ethical and practical questions. Training a single advanced model can result in the emission of hundreds of tons of carbon and may demand infrastructure that is simply not available in many medical systems, particularly in resource-limited settings.

To address this, future research should prioritize the development of lightweight and energy-efficient architectures that can significantly reduce power consumption without compromising model performance. Promising solutions in this area include quantization, model compression, domain-specific fine-tuning, and the adoption of high-efficiency hardware. Additionally, optimizing energy use during the inference phase particularly in real-time applications deserves special attention. These technological strategies, when combined with broader environmental policies such as the use of clean energy sources and the implementation of green frameworks in data centers [229], can support a more responsible and sustainable path forward for the deployment of LLMs. As artificial intelligence continues to advance, the need to balance computational capabilities with environmental responsibility must be treated as a core principle in the design, implementation, and regulation of this transformative technology.

5.3. Legal Dimension

With the expansion of LLMs across various domains, the legal implications of using personal data, content generation, and automated decision making have gained increasing importance. These models are often trained on datasets that may contain sensitive information or data protected by intellectual property rights, which raises significant challenges in areas such as data ownership, digital consent, and compliance with existing regulations (such as GDPR, HIPAA, and others). Therefore, future research should focus on developing intelligent, transparent, and enforceable legal frameworks that protect users without hindering innovation. Establishing international legal standards, identifying the role of accountability in automated decisions and designing mechanisms to control access to and use of data are essential steps toward responsible progress in this field [230].

A central legal concern in the development of language models is the structured protection of personal data across all phases of training, inference, and storage. As LLMs are increasingly deployed in sectors such as education, healthcare, business, and media, ensuring the privacy and security of user data has become a fundamental challenge. These models are typically trained on a vast corpora of real-world data, which may contain identifiable information, private conversations, digital records, or user-generated content. Even in cases where the data has been anonymized, there remains a risk of re-identification through subtle language patterns or by correlating information from multiple sources.

In response to these risks, future research and regulatory strategies should aim to integrate privacy-preserving mechanisms directly into model design [184]. Prominent approaches include federated learning methods, which allow models to be trained without the need to transfer raw data to centralized servers. Similarly, differential privacy-based learning can help reduce information leakage by introducing statistical noise into model outputs. Furthermore, developing algorithms that enhance decision making transparency, creating legally binding frameworks for data access, and ensuring that users provide informed digital consent are essential measures for responsible model governance. The principle of privacy-by-design where privacy protections are embedded from the earliest stages of system development rather than added later should serve as a foundational philosophy in this domain. Finally, the establishment of international privacy standards tailored to LLMs will be crucial for the long-term, responsible, and sustainable advancement of this technology [231].

6. Conclusions

This study aimed to provide a structured and comprehensive review of the challenges associated with LLMs, addressing the technical, ethical, social, and legal dimensions of this rapidly evolving technology across various domains. The findings indicate that LLMs, as one of the most advanced achievements in language-focused artificial intelligence, have played a transformative role in sectors such as healthcare, education, media, law, agriculture, industry, energy, and scientific research. However, the rapid expansion and widespread adoption of this technology have introduced complex challenges that cannot be effectively managed without a multidimensional and socially responsible approach.

Key obstacles include the opaque nature of model decision making, data-driven biases, privacy and security risks, the high demand for computational infrastructure, and the technological divide between countries. These issues, especially in high-stakes domains such as medicine and law, can result in flawed decisions, discriminatory outcomes, or the erosion of public trust.

Nonetheless, the vast potential of this technology must not be overlooked. LLMs are capable of revolutionizing tasks such as question answering, information summarization, natural language understanding, personalized recommendations, and even multimodal interactions. These capabilities have the potential to enhance human productivity, accelerate research processes, and improve service quality across various fields.

In light of the discussions presented in the Future Outlook Section, the sustainable and responsible development of LLMs requires long-term planning, investment in interdisciplinary research, and the design of frameworks that address both technical demands and ethical, legal, and social considerations. Future efforts must focus on building models that are more interpretable, transparent, energy-efficient, and aligned with local and human-centered needs. Moreover, continuous assessment of real-world implications and the active involvement of stakeholders including policymakers, developers, end users, and regulatory bodies are essential to ensure responsible integration.

Overall, this article emphasizes that, if properly guided, LLMs can become one of the most impactful tools for advancing human progress. However, if ethical and societal dimensions are ignored, they may pose significant risks. The future of this technology depends on the decisions we make today decisions that must distinguish between mere productivity and productivity that is grounded in justice, transparency, and accountability.

Author Contributions

Conceptualization, P.P., F.R., C.T. and S.G.; Methodology, P.P., F.R., C.T. and S.G.; Validation, P.P., F.R., C.T. and S.G.; Formal Analysis, P.P., F.R., C.T. and S.G.; Investigation, P.P., F.R., C.T. and S.G.; Resources, P.P., F.R., C.T. and S.G.; Data Curation, P.P., F.R. and C.T.; Writing Original Draft Preparation, P.P., F.R. and S.G.; Writing Review and Editing, P.P., F.R., C.T. and S.G.; Visualization, P.P., F.R. and S.G.; Supervision, P.P. and C.T.; Project Administration, P.P., F.R., C.T. and S.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

The authors would like to thank the anonymous reviewers and the editor-in chief for their constructive comments and suggestions.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Bommasani, R.; Hudson, D.A.; Adeli, E.; Altman, R.; Arora, S.; von Arx, S.; Bernstein, M.S.; Bohg, J.; Bosselut, A.; Brunskill, E.; et al. On the opportunities and risks of foundation models. arXiv 2021, arXiv:2108.07258. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
OpenAI; Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; et al. Gpt-4 technical report. arXiv 2023, arXiv:2303.08774. [Google Scholar] [CrossRef]
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. Roberta: A robustly optimized bert pretraining approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.-A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. Llama: Open and efficient foundation language models. arXiv 2023, arXiv:2302.13971. [Google Scholar] [CrossRef]
Biswas, S.S. Role of chat gpt in public health. Ann. Biomed. Eng. 2023, 51, 868–869. [Google Scholar] [CrossRef] [PubMed]
Driess, D.; Xia, F.; Sajjadi, M.S.M.; Lynch, C.; Chowdhery, A.; Ichter, B.; Wahid, A.; Tompson, J.; Vuong, Q.; Yu, T.; et al. Palm-e: An embodied multimodal language model. arXiv 2023, arXiv:2303.03378. [Google Scholar]
Raiaan, M.A.K.; Mukta, S.H.; Fatema, K.; Fahad, N.M.; Sakib, S.; Mim, M.M.J.; Ahmad, J.; Ali, M.E.; Azam, S. A review on large language models: Architectures, applications, taxonomies, open issues and challenges. IEEE Access 2024, 12, 26839–26874. [Google Scholar] [CrossRef]
Cavnar, W.B.; Trenkle, J.M. N-gram-based text categorization. In Proceedings of the SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, NV, USA, 11–13 April 1994; Volume 161175, p. 14. [Google Scholar]
Blunsom, P. Hidden Markov Models. Lect. Notes 2004, 15, 48. [Google Scholar]
Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient estimation of word representations in vector space. arXiv 2013, arXiv:1301.3781. [Google Scholar] [CrossRef]
Pennington, J.; Socher, R.; Manning, C.D. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1532–1543. [Google Scholar]
Mikolov, T.; Karafiát, M.; Burget, L.; Cernocký, J.; Khudanpur, S. Recurrent neural network based language model. In Proceedings of the Interspeech, Chiba, Japan, 26–30 September 2010; Volume 2, No. 3. pp. 1045–1048. [Google Scholar]
Yu, Y.; Si, X.; Hu, C.; Zhang, J. A review of recurrent neural networks: LSTM cells and network architectures. Neural Comput. 2019, 31, 1235–1270. [Google Scholar] [CrossRef] [PubMed]
Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. In Proceedings of the Advances in Neural Information Processing Systems, Virtual, 6–12 December 2020; Volume 33, pp. 1877–1901. [Google Scholar]
Kosmyna, N.; Hauptmann, E.; Yuan, Y.T.; Situ, J.; Liao, X.-H.; Beresnitzky, A.V.; Braunstein, I.; Maes, P. Your brain on chatgpt: Accumulation of cognitive debt when using an ai assistant for essay writing task. arXiv 2025, arXiv:2506.08872. [Google Scholar] [CrossRef]
Magesh, V.; Surani, F.; Dahl, M.; Suzgun, M.; Manning, C.D.; Ho, D.E. Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools. J. Empir. Leg. Stud. 2025, 22, 216–242. [Google Scholar]
Kasneci, E.; Sessler, K.; Küchemann, S.; Bannert, M.; Dementieva, D.; Fischer, F.; Gasser, U.; Groh, G.; Günnemann, S.; Hüllermeier, E.; et al. ChatGPT for good? On opportunities and challenges of large language models for education. Learn. Individ. Differ. 2023, 103, 102274. [Google Scholar] [CrossRef]
Weidinger, L.; Uesato, J.; Rauh, M.; Griffin, C.; Huang, P.-S.; Mellor, J.; Glaese, A.; Cheng, M.; Balle, B.; Kasirzadeh, A.; et al. Taxonomy of risks posed by language models. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, Seoul, Republic of Korea, 21–24 June 2022; pp. 214–229. [Google Scholar]
Li, Z.; Shi, Y.; Liu, Z.; Yang, F.; Liu, N.; Du, M. Quantifying multilingual performance of large language models across languages. arXiv 2024, arXiv:2404.11553. [Google Scholar] [CrossRef]
Wu, L.; Zheng, Z.; Qiu, Z.; Wang, H.; Gu, H.; Shen, T.; Qin, C.; Zhu, C.; Zhu, H.; Liu, Q.; et al. A survey on large language models for recommendation. World Wide Web 2024, 27, 60. [Google Scholar] [CrossRef]
Patil, R.; Gudivada, V. A review of current trends, techniques, and challenges in large language models (llms). Appl. Sci. 2024, 14, 2074. [Google Scholar] [CrossRef]
He, H.; Su, W.J. A Law of Next-Token Prediction in Large Language Models. arXiv 2024, arXiv:2408.13442. [Google Scholar] [CrossRef]
Bender, E.M.; Gebru, T.; McMillan-Major, A.; Shmitchell, S. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, Virtual, 3–10 March 2021; pp. 610–623. [Google Scholar]
Zhu, Y.; Du, S.; Li, B.; Luo, Y.; Tang, N. Are Large Language Models Good Statisticians? arXiv 2024, arXiv:2406.07815. [Google Scholar] [CrossRef]
Almazrouei, E.; Alobeidli, H.; Alshamsi, A.; Cappelli, A.; Cojocaru, R.; Debbah, M.; Goffinet, É.; Hesslow, D.; Launay, J.; Malartic, Q.; et al. The falcon series of open language models. arXiv 2023, arXiv:2311.16867. [Google Scholar] [CrossRef]
Naveed, H.; Khan, A.U.; Qiu, S.; Saqib, M.; Anwar, S.; Usman, M.; Akhtar, N.; Barnes, N.; Mian, A. A comprehensive overview of large language models. ACM Trans. Intell. Syst. Technol. 2023. [Google Scholar] [CrossRef]
Xu, W.; Hu, W.; Wu, F.; Sengamedu, S. DeTiME: Diffusion-enhanced topic modeling using encoder-decoder based LLM. arXiv 2023, arXiv:2310.15296. [Google Scholar]
Han, K.; Xiao, A.; Wu, E.; Guo, J.; Xu, C.; Wang, Y. Transformer in transformer. Adv. Neural Inf. Process. Syst. 2021, 34, 15908–15919. [Google Scholar]
Mo, Y.; Qin, H.; Dong, Y.; Zhu, Z.; Li, Z. Large language model (llm) ai text generation detection based on transformer deep learning algorithm. arXiv 2024, arXiv:2405.06652. [Google Scholar]
Singh, S.; Mahmood, A. The NLP cookbook: Modern recipes for transformer based deep learning architectures. IEEE Access 2021, 9, 68675–68702. [Google Scholar] [CrossRef]
Li, T.; El Mesbahi, Y.; Kobyzev, I.; Rashid, A.; Mahmud, A.; Anchuri, N.; Hajimolahoseini, H.; Liu, Y.; Rezagholizadeh, M. A short study on compressing decoder-based language models. arXiv 2021, arXiv:2110.08460. [Google Scholar] [CrossRef]
Shazeer, N.; Mirhoseini, A.; Maziarz, K.; Davis, A.; Le, Q.; Hinton, G.; Dean, J. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv 2017, arXiv:1701.06538. [Google Scholar]
Chang, Y.; Wang, X.; Wang, J.; Wu, Y.; Yang, L.; Zhu, K.; Chen, H.; Yi, X.; Wang, C.; Wang, Y.; et al. A survey on evaluation of large language models. ACM Trans. Intell. Syst. Technol. 2024, 15, 39. [Google Scholar] [CrossRef]
Chowdhery, A.; Narang, S.; Devlin, J.; Bosma, M.; Mishra, G.; Roberts, A.; Barham, P.; Chung, H.W.; Sutton, C.; Gehrmann, S.; et al. Palm: Scaling language modeling with pathways. J. Mach. Learn. Res. 2023, 24, 11324–11436. [Google Scholar]
Wu, Z.; Qiu, L.; Ross, A.; Akyürek, E.; Chen, B.; Wang, B.; Kim, N.; Andreas, J.; Kim, Y. Reasoning or reciting? Exploring the capabilities and limitations of language models through counterfactual tasks. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Mexico City, Mexico, 16–21 June 2024; Volume 1: Long Papers, pp. 1819–1862. [Google Scholar]
Christiano, P.F.; Leike, J.; Brown, T.; Martic, M.; Legg, S.; Amodei, D. Deep reinforcement learning from human preferences. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
Liu, Y.; Cao, J.; Liu, C.; Ding, K.; Jin, L. Datasets for large language models: A comprehensive survey. arXiv 2024, arXiv:2402.18041. [Google Scholar] [CrossRef]
Lhoest, Q.; del Moral, A.V.; Jernite, Y.; Thakur, A.; von Platen, P.; Patil, S.; Chaumond, J.; Drame, M.; Plu, J.; Tunstall, L.; et al. Datasets: A community library for natural language processing. arXiv 2021, arXiv:2109.02846. [Google Scholar] [CrossRef]
Song, Y.; Cui, C.; Khanuja, S.; Liu, P.; Faisal, F.; Ostapenko, A.; Indra Winata, G.; Fikri Aji, A.; Cahyawijaya, S.; Svetkov, Y.; et al. GlobalBench: A benchmark for global progress in natural language processing. arXiv 2023, arXiv:2305.14716. [Google Scholar] [CrossRef]
Kazemi, M.; Dikkala, N.; Anand, A.; Devic, P.; Dasgupta, I.; Liu, F.; Fatemi, B.; Awasthi, P.; Gollapudi, S.; Guo, D.; et al. Remi: A dataset for reasoning with multiple images. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 10–15 December 2024; Volume 37, pp. 60088–60109. [Google Scholar]
Wang, H.; Li, J.; Wu, H.; Hovy, E.; Sun, Y. Pre-trained language models and their applications. Engineering 2023, 25, 51–65. [Google Scholar] [CrossRef]
Carlini, N.; Tramer, F.; Wallace, E.; Jagielski, M.; Herbert-Voss, A.; Lee, K.; Roberts, A.; Brown, T.; Song, D.; Erlingsson, Ú.; et al. Extracting training data from large language models. In Proceedings of the 30th USENIX Security Symposium (USENIX Security 21), Virtual, 11–13 August 2021; pp. 2633–2650. [Google Scholar]
Qiu, X.; Sun, T.; Xu, Y.; Shao, Y.; Dai, N.; Huang, X. Pre-trained models for natural language processing: A survey. Sci. China Technol. Sci. 2020, 63, 1872–1897. [Google Scholar] [CrossRef]
Que, H.; Liu, J.; Zhang, G.; Zhang, C.; Qu, X.; Ma, Y.; Duan, F.; Bai, Z.; Wang, J.; Zhang, Y.; et al. D-cpt law: Domain-specific continual pre-training scaling law for large language models. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 10–15 December 2024; Volume 37, pp. 90318–90354. [Google Scholar]
Gu, Y.; Tinn, R.; Cheng, H.; Lucas, M.; Usuyama, N.; Liu, X.; Naumann, T.; Gao, J.; Poon, H. Domain-specific language model pretraining for biomedical natural language processing. ACM Trans. Comput. Healthc. (HEALTH) 2021, 3, 2. [Google Scholar] [CrossRef]
Hadi, M.U.; Al Tashi, Q.; Qureshi, R.; Shah, A.; Muneer, A.; Irfan, M.; Zafar, A.; Shaikh, M.B.; Akhtar, N.; Hassan, S.Z.; et al. Large language models: A comprehensive survey of its applications, challenges, limitations, and future prospects. Authorea Prepr. 2023, 1, 1–26. [Google Scholar]
Lee, K.; Ippolito, D.; Nystrom, A.; Zhang, C.; Eck, D.; Callison-Burch, C.; Carlini, N. Deduplicating training data makes language models better. arXiv 2021, arXiv:2107.06499. [Google Scholar]
Zhang, S.; Dong, L.; Li, X.; Zhang, S.; Sun, X.; Wang, S.; Li, J.; Hu, R.; Zhang, T.; Wu, F.; et al. Instruction tuning for large language models: A survey. arXiv 2023, arXiv:2308.10792. [Google Scholar]
Honovich, O.; Scialom, T.; Levy, O.; Schick, T. Unnatural instructions: Tuning language models with (almost) no human labor. arXiv 2022, arXiv:2212.09689. [Google Scholar] [CrossRef]
Ahmad, W.U.; Ficek, A.; Samadi, M.; Huang, J.; Noroozi, V.; Majumdar, S.; Ginsburg, B. OpenCodeInstruct: A Large-scale Instruction Tuning Dataset for Code LLMs. arXiv 2025, arXiv:2504.04030. [Google Scholar]
Zhang, X.; Tian, C.; Yang, X.; Chen, L.; Li, Z.; Petzold, L.R. Alpacare: Instruction-tuned large language models for medical application. arXiv 2023, arXiv:2310.14558. [Google Scholar]
Cui, G.; Yuan, L.; Ding, N.; Yao, G.; He, B.; Zhu, W.; Ni, Y.; Xie, G.; Xie, R.; Lin, Y.; et al. Ultrafeedback: Boosting language models with high-quality feedback. In Proceedings of the 41st International Conference on Machine Learning, Vienna, Austria, 21–27 July 2024. [Google Scholar]
Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.L.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; et al. Training language models to follow instructions with human feedback. In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 28 November–9 December 2022; Volume 35, pp. 27730–27744. [Google Scholar]
Chang, T.A.; Bergen, B.K. Language model behavior: A comprehensive survey. Comput. Linguist. 2024, 50, 293–350. [Google Scholar] [CrossRef]
Yang, R.; Tan, T.F.; Lu, W.; Thirunavukarasu, A.J.; Ting, D.S.W.; Liu, N. Large language models in health care: Development, applications, and challenges. Health Care Sci. 2023, 2, 255–263. [Google Scholar] [CrossRef] [PubMed]
Adeniran, A.A.; Onebunne, A.P.; William, P. Explainable AI (XAI) in healthcare: Enhancing trust and transparency in critical decision-making. World J. Adv. Res. Rev. 2024, 23, 2647–2658. [Google Scholar] [CrossRef]
Das, B.C.; Amini, M.H.; Wu, Y. Security and privacy challenges of large language models: A survey. ACM Comput. Surv. 2025, 57, 152. [Google Scholar] [CrossRef]
Huang, J.; Chang, K.C.C. Towards reasoning in large language models: A survey. arXiv 2022, arXiv:2212.10403. [Google Scholar]
Lehman, E.; Jain, S.; Pichotta, K.; Goldberg, Y.; Wallace, B.C. Does BERT pretrained on clinical notes reveal sensitive data? arXiv 2021, arXiv:2104.07762. [Google Scholar] [CrossRef]
Ribeiro, M.T.; Singh, S.; Guestrin, C. “Why should i trust you?” Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 1135–1144. [Google Scholar]
Thirunavukarasu, A.J.; Ting, D.S.J.; Elangovan, K.; Gutierrez, L.; Tan, T.F.; Ting, D.S.W. Large language models in medicine. Nat. Med. 2023, 29, 1930–1940. [Google Scholar] [CrossRef] [PubMed]
Esmaeilzadeh, P. Challenges and strategies for wide-scale artificial intelligence (AI) deployment in healthcare practices: A perspective for healthcare organizations. Artif. Intell. Med. 2024, 151, 102861. [Google Scholar] [CrossRef] [PubMed]
Nazi, Z.A.; Peng, W. Large language models in healthcare and medical domain: A review. Informatics 2024, 11, 57. [Google Scholar] [CrossRef]
Daneshjou, R.; Vodrahalli, K.; Novoa, R.A.; Jenkins, M.; Liang, W.; Rotemberg, V.; Ko, J.; Swetter, S.M.; Bailey, E.E.; Gevaert, O.; et al. Disparities in dermatology AI performance on a diverse, curated clinical image set. Sci. Adv. 2022, 8, eabq6147. [Google Scholar] [CrossRef] [PubMed]
Hasanzadeh, F.; Josephson, C.B.; Waters, G.; Adedinsewo, D.; Azizi, Z.; White, J.A. Bias recognition and mitigation strategies in artificial intelligence healthcare applications. NPJ Digit. Med. 2025, 8, 154. [Google Scholar] [CrossRef] [PubMed]
Omar, M.; Sorin, V.; Agbareia, R.; Apakama, D.U.; Soroush, A.; Sakhuja, A.; Freeman, R.; Horowitz, C.R.; Richardson, L.D.; Nadkarni, G.N.; et al. Evaluating and addressing demographic disparities in medical large language models: A systematic review. Int. J. Equity Health 2025, 24, 57. [Google Scholar] [CrossRef] [PubMed]
Omiye, J.A.; Gui, H.; Rezaei, S.J.; Zou, J.; Daneshjou, R. Large language models in medicine: The potentials and pitfalls: A narrative review. Ann. Intern. Med. 2024, 177, 210–220. [Google Scholar] [CrossRef] [PubMed]
Nie, Y.; Kong, Y.; Dong, X.; Mulvey, J.M.; Poor, H.V.; Wen, Q.; Zohren, S. A survey of large language models for financial applications: Progress, prospects and challenges. arXiv 2024, arXiv:2406.11903. [Google Scholar] [CrossRef]
Rane, N.; Choudhary, S.; Rane, J. Explainable Artificial Intelligence (XAI) approaches for transparency and accountability in financial decision-making. SSRN Electron. J. 2023. [Google Scholar] [CrossRef]
Lu, G.; Guo, X.; Zhang, R.; Zhu, W.; Liu, J. BizFinBench: A Business-Driven Real-World Financial Benchmark for Evaluating LLMs. arXiv 2025, arXiv:2505.19457. [Google Scholar]
Shoeybi, M.; Patwary, M.; Puri, R.; LeGresley, P.; Casper, J.; Catanzaro, B. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv 2019, arXiv:1909.08053. [Google Scholar]
Narayanan, D.; Shoeybi, M.; Casper, J.; LeGresley, P.; Patwary, M.; Korthikanti, V.A.; Vainbrand, D.; Kashinkunti, P.; Bernauer, J.; Catanzaro, B.; et al. Efficient large-scale language model training on gpu clusters using megatron-lm. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, St. Louis, MO, USA, 14–19 November 2021; pp. 1–15. [Google Scholar]
Araci, D. Finbert: Financial sentiment analysis with pre-trained language models. arXiv 2019, arXiv:1908.10063. [Google Scholar]
Strubell, E.; Ganesh, A.; McCallum, A. Energy and policy considerations for modern deep learning research. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, No. 09. pp. 13693–13696. [Google Scholar]
Li, Y.; Wang, S.; Ding, H.; Chen, H. Large language models in finance: A survey. In Proceedings of the Fourth ACM International Conference on AI in Finance, New York, NY, USA, 27–29 November 2023; pp. 374–382. [Google Scholar]
Chu, Z.; Guo, H.; Zhou, X.; Wang, Y.; Yu, F.; Chen, H.; Xu, W.; Lu, X.; Cui, Q.; Li, L.; et al. Data-centric financial large language models. arXiv 2023, arXiv:2310.17784. [Google Scholar] [CrossRef]
Phogat, K.S.; Puranam, S.A.; Dasaratha, S.; Harsha, C.; Ramakrishna, S. Fine-tuning Smaller Language Models for Question Answering over Financial Documents. arXiv 2024, arXiv:2408.12337. [Google Scholar] [CrossRef]
Qian, L.; Zhou, W.; Wang, Y.; Peng, X.; Yi, H.; Zhao, Y.; Huang, J.; Xie, Q.; Nie, J.-Y. Fino1: On the transferability of reasoning enhanced llms to finance. arXiv 2025, arXiv:2502.08127. [Google Scholar]
Wu, S.; Irsoy, O.; Lu, S.; Dabravolski, V.; Dredze, M.; Gehrmann, S.; Kambadur, P.; Rosenberg, D.; Mann, G. BloombergGPT: A large language model for finance. arXiv 2023, arXiv:2303.17564. [Google Scholar] [CrossRef]
Wang, Y.; Wang, Y.; Liu, Y.; Bao, R.; Harimoto, K.; Sun, X. Proxy Tuning for Financial Sentiment Analysis: Overcoming Data Scarcity and Computational Barriers. In Proceedings of the Joint Workshop of the 9th Financial Technology and Natural Language Processing (FinNLP), the 6th Financial Narrative Processing (FNP), and the 1st Workshop on Large Language Models for Finance and Legal (LLMFinLegal), Abu Dhabi, United Arab Emirates, 19–20 January 2025; pp. 169–174. [Google Scholar]
Wu, J.; Gan, W.; Chen, Z.; Wan, S.; Yu, P.S. Multimodal large language models: A survey. In Proceedings of the 2023 IEEE International Conference on Big Data (BigData), Sorrento, Italy, 15–18 December 2023; IEEE: New York, NY, USA, 2023; pp. 2247–2256. [Google Scholar]
Mirishli, S. Regulating Ai In Financial Services: Legal Frameworks and Compliance Challenges. arXiv 2025, arXiv:2503.14541. [Google Scholar] [CrossRef]
Rao, V.; Sun, Y.; Kumar, M.; Mutneja, T.; Mukherjee, A.; Yang, H. LLMs Meet Finance: Fine-Tuning Foundation Models for the Open FinLLM Leaderboard. arXiv 2025, arXiv:2504.13125. [Google Scholar]
Tavasoli, A.; Sharbaf, M.; Madani, S.M. Responsible Innovation: A Strategic Framework for Financial LLM Integration. arXiv 2025, arXiv:2504.02165. [Google Scholar] [CrossRef]
Huang, C.; Nourian, A.; Griest, K. Hidden technical debts for fair machine learning in financial services. arXiv 2021, arXiv:2103.10510. [Google Scholar] [CrossRef]
Liu, C.; Arulappan, A.; Naha, R.; Mahanti, A.; Kamruzzaman, J.; Ra, I.H. Large language models and sentiment analysis in financial markets: A review, datasets and case study. IEEE Access 2024, 12, 134041–134061. [Google Scholar] [CrossRef]
Abdelsamie, M.; Wang, H. Comparative analysis of LLM-based market prediction and human expertise with sentiment analysis and machine learning integration. In Proceedings of the 2024 7th International Conference on Data Science and Information Technology (DSIT), Nanjing, China, 20–22 December 2024; IEEE: New York, NY, USA, 2024; pp. 1–6. [Google Scholar]
Zaremba, A.; Demir, E. ChatGPT: Unlocking the future of NLP in finance. Mod. Financ. 2023, 1, 93–98. [Google Scholar] [CrossRef]
Vidgof, M.; Bachhofner, S.; Mendling, J. Large language models for business process management: Opportunities and challenges. In Proceedings of the International Conference on Business Process Management, Utrecht, The Netherlands, 11–15 September 2023; Springer Nature: Cham, Switzerland, 2023; pp. 107–123. [Google Scholar]
Fahland, D.; Fournier, F.; Limonad, L.; Skarbovsky, I.; Swevels, A.J. How well can large language models explain business processes? arXiv 2024, arXiv:2401.12846. [Google Scholar] [CrossRef]
Nasseri, M.; Brandtner, P.; Zimmermann, R.; Falatouri, T.; Darbanian, F.; Obinwanne, T. Applications of large language models (LLMs) in business analytics–exemplary use cases in data preparation tasks. In Proceedings of the International Conference on Human-Computer Interaction, Copenhagen, Denmark, 23 July 2023; Springer Nature: Cham, Switzerland, 2023; pp. 182–198. [Google Scholar]
Ferrara, E. Should chatgpt be biased? challenges and risks of bias in large language models. arXiv 2023, arXiv:2304.03738. [Google Scholar] [CrossRef]
Shen, S.; Logeswaran, L.; Lee, M.; Lee, H.; Poria, S.; Mihalcea, R. Understanding the capabilities and limitations of large language models for cultural commonsense. arXiv 2024, arXiv:2405.04655. [Google Scholar] [CrossRef]
Linkon, A.A.; Shaima, M.; Sarker, M.S.U.; Badruddowza; Nabi, N.; Rana, M.N.U.; Ghosh, S.K.; Rahman, M.A.; Esa, H.; Chowdhury, F.R. Advancements and applications of generative artificial intelligence and large language models on business management: A comprehensive review. J. Comput. Sci. Technol. Stud. 2024, 6, 225–232. [Google Scholar] [CrossRef]
Teubner, T.; Flath, C.M.; Weinhardt, C.; Van Der Aalst, W.; Hinz, O. Welcome to the era of chatgpt et al. the prospects of large language models. Bus. Inf. Syst. Eng. 2023, 65, 95–101. [Google Scholar] [CrossRef]
Raza, M.; Jahangir, Z.; Riaz, M.B.; Saeed, M.J.; Sattar, M.A. Industrial applications of large language models. Sci. Rep. 2025, 15, 13755. [Google Scholar] [CrossRef] [PubMed]
Wang, C.; Liu, Y.-Y.; Guo, T.-Z.; Li, D.-P.; He, T.; Li, Z.; Yang, Q.-W.; Wang, H.-H.; Wen, Y.-Y. Systems engineering issues for industry applications of large language model. Appl. Soft Comput. 2024, 151, 111165. [Google Scholar]
Chen, S.; Piao, L.; Zang, X.; Luo, Q.; Li, J.; Yang, J.; Rong, J. Analyzing differences of highway lane-changing behavior using vehicle trajectory data. Phys. A: Stat. Mech. Its Appl. 2023, 624, 128980. [Google Scholar] [CrossRef]
Chen, X.; Wei, C.; Xin, Z.; Zhao, J.; Xian, J. Ship detection under low-visibility weather interference via an ensemble generative adversarial network. J. Mar. Sci. Eng. 2023, 11, 2065. [Google Scholar] [CrossRef]
Li, Y.; Zhao, H.; Jiang, H.; Pan, Y.; Liu, Z.; Wu, Z.; Shu, P.; Tian, J.; Yang, T.; Xu, S.; et al. Large language models for manufacturing. arXiv 2024, arXiv:2410.21418. [Google Scholar] [PubMed]
Chkirbene, Z.; Hamila, R.; Gouissem, A.; Devrim, U. Large Language Models (LLM) in Industry: A Survey of Applications, Challenges, and Trends. In Proceedings of the 2024 IEEE 21st International Conference on Smart Communities: Improving Quality of Life Using AI, Robotics and IoT (HONET), Doha, Qatar, 3–5 December 2024; IEEE: New York, NY, USA, 2024; pp. 229–234. [Google Scholar]
Ji, Z.; Lee, N.; Frieske, R.; Yu, T.; Su, D.; Xu, Y.; Ishii, E.; Bang, Y.J.; Madotto, A.; Fung, P. Survey of hallucination in natural language generation. ACM Comput. Surv. 2023, 55, 248. [Google Scholar] [CrossRef]
Maatouk, A.; Piovesan, N.; Ayed, F.; De Domenico, A.; Debbah, M. Large language models for telecom: Forthcoming impact on the industry. IEEE Commun. Mag. 2024, 63, 62–68. [Google Scholar] [CrossRef]
Urlana, A.; Kumar, C.V.; Singh, A.K.; Garlapati, B.M.; Chalamala, S.R.; Mishra, R. LLMs with Industrial Lens: Deciphering the Challenges and Prospects—A Survey. arXiv 2024, arXiv:2402.14558. [Google Scholar]
Wang, J.; Wu, Z.; Li, Y.; Jiang, H.; Shu, P.; Shi, E.; Hu, H.; Ma, C.; Liu, Y.; Wang, X.; et al. Large language models for robotics: Opportunities, challenges, and perspectives. J. Autom. Intell. 2025, 4, 52–64. [Google Scholar] [CrossRef]
Shayegani, E.; Mamun, M.A.A.; Fu, Y.; Zaree, P.; Dong, Y.; Abu-Ghazaleh, N. Survey of vulnerabilities in large language models revealed by adversarial attacks. arXiv 2023, arXiv:2310.10844. [Google Scholar] [CrossRef]
Yee, J.S.G.; Ng, P.C.; Wang, Z.; McLoughlin, I.; Ng, A.B.; See, S. On-Device LLMs for SMEs: Challenges and Opportunities. arXiv 2024, arXiv:2410.16070. [Google Scholar] [CrossRef]
Rane, N. ChatGPT and similar generative artificial intelligence (AI) for smart industry: Role, challenges and opportunities for industry 4.0, industry 5.0 and society 5.0. Chall. Oppor. Ind. 2023, 4. [Google Scholar] [CrossRef]
Zhu, H.; Qin, S.; Su, M.; Lin, C.; Li, A.; Gao, J. Harnessing large vision and language models in agriculture: A review. arXiv 2024, arXiv:2407.19679. [Google Scholar] [CrossRef]
Li, J.; Xu, M.; Xiang, L.; Chen, D.; Zhuang, W.; Yin, X.; Li, Z. Large language models and foundation models in smart agriculture: Basics, opportunities, and challenges. arXiv 2023, arXiv:2308.06668. [Google Scholar]
Tzachor, A.; Devare, M.; Richards, C.; Pypers, P.; Ghosh, A.; Koo, J.; Johal, S.; King, B. Large language models and agricultural extension services. Nat. Food 2023, 4, 941–948. [Google Scholar] [CrossRef] [PubMed]
Vizniuk, A.; Diachenko, G.; Laktionov, I.; Siwocha, A.; Xiao, M.; Smoląg, J. A comprehensive survey of retrieval-augmented large language models for decision making in agriculture: Unsolved problems and research opportunities. J. Artif. Intell. Soft Comput. Res. 2025, 15, 115–146. [Google Scholar] [CrossRef]
Gong, R.; Li, X. The application progress and research trends of knowledge graphs and large language models in agriculture. Comput. Electron. Agric. 2025, 235, 110396. [Google Scholar] [CrossRef]
Rezayi, S.; Liu, Z.; Wu, Z.; Dhakal, C.; Ge, B.; Dai, H.; Mai, G.; Liu, N.; Zhen, C.; Liu, T.; et al. Exploring new frontiers in agricultural nlp: Investigating the potential of large language models for food applications. IEEE Trans. Big Data 2024, 11, 1235–1246. [Google Scholar] [CrossRef]
Shaikh, T.A.; Rasool, T.; Veningston, K.; Yaseen, S.M. The role of large language models in agriculture: Harvesting the future with LLM intelligence. Prog. Artif. Intell. 2024, 14, 117–164. [Google Scholar] [CrossRef]
Li, H.; Wu, H.; Li, Q.; Zhao, C. A review on enhancing agricultural intelligence with large language models. Artif. Intell. Agric. 2025, 15, 671–685. [Google Scholar] [CrossRef]
Zhang, Y.; Fan, Q.; Chen, X.; Li, M.; Zhao, Z.; Li, F.; Guo, L. IPM-AgriGPT: A Large Language Model for Pest and Disease Management with a G-EA Framework and Agricultural Contextual Reasoning. Mathematics 2025, 13, 566. [Google Scholar] [CrossRef]
Banerjee, S.; Das, S.; Mondal, A.C. A Study of the Application Domain of a Large Language Models in the Agricultural Sector. Int. J. Innov. Res. Comput. Sci. Technol. 2024, 12, 74–78. [Google Scholar] [CrossRef]
Majumder, S.; Dong, L.; Doudi, F.; Cai, Y.; Tian, C.; Kalathi, D.; Ding, K.; Thatte, A.A.; Li, N.; Xie, L. Exploring the capabilities and limitations of large language models in the electric energy sector. Joule 2024, 8, 1544–1549. [Google Scholar] [CrossRef]
Marinakis, V. Big data for energy management and energy-efficient buildings. Energies 2020, 13, 1555. [Google Scholar] [CrossRef]
Madani, S.; Tavasoli, A.; Astaneh, Z.K.; Pineau, P.O. Large Language Models integration in Smart Grids. arXiv 2025, arXiv:2504.09059. [Google Scholar] [CrossRef]
Katamoura, S.; Aksoy, M.S.; AlKhamees, B. Privacy and Security in Artificial Intelligence and Machine Learning Systems for Renewable Energy Big Data. In Proceedings of the 2024 21st Learning and Technology Conference (L&T), Makkah, Saudi Arabia, 15–16 January 2024; IEEE: New York, NY, USA, 2024; pp. 209–214. [Google Scholar]
Kaddour, J.; Harris, J.; Mozes, M.; Bradley, H.; Raileanu, R.; McHardy, R. Challenges and applications of large language models. arXiv 2023, arXiv:2307.10169. [Google Scholar] [CrossRef]
Rillig, M.C.; Ågerstrand, M.; Bi, M.; Gould, K.A.; Sauerland, U. Risks and benefits of large language models for the environment. Environ. Sci. Technol. 2023, 57, 3464–3466. [Google Scholar] [CrossRef] [PubMed]
Zhang, L.; Chen, Z. Opportunities and challenges of applying large language models in building energy efficiency and decarbonization studies: An exploratory overview. arXiv 2023, arXiv:2312.11701. [Google Scholar] [CrossRef]
Liu, M.; Zhang, L.; Chen, J.; Chen, W.A.; Yang, Z.; Lo, L.J.; Wen, J.; O’Neill, Z. Large language models for building energy applications: Opportunities and challenges. In Building Simulation; Tsinghua University Press: Beijing, China, 2025; pp. 1–10. [Google Scholar]
Miranda, M.; Ruzzetti, E.S.; Santilli, A.; Zanzotto, F.M.; Bratières, S.; Rodolà, E. Preserving privacy in large language models: A survey on current threats and solutions. arXiv 2024, arXiv:2408.05212. [Google Scholar] [CrossRef]
Ruan, J.; Liang, G.; Zhao, H.; Liu, G.; Sun, X.; Qiu, J.; Xu, Z.; Wen, F.; Dong, Z.Y. Applying large language models to power systems: Potential security threats. IEEE Trans. Smart Grid 2024, 15, 3333–3336. [Google Scholar] [CrossRef]
Buster, G. Large Language Models (LLMs) for Energy Systems Research; No. NREL/PR-6A20-87896; National Renewable Energy Laboratory (NREL): Golden, CO, USA, 2023.
Li, J.; Yang, Y.; Sun, J. Risks of practicing large language models in smart grid: Threat modeling and validation. arXiv 2024, arXiv:2405.06237. [Google Scholar] [CrossRef]
Cheng, Y.; Zhou, X.; Zhao, H.; Gu, J.; Wang, X.; Zhao, J. Large Language Model for Low-Carbon Energy Transition: Roles and Challenges. In Proceedings of the 2024 4th Power System and Green Energy Conference (PSGEC), Shanghai, China, 22–24 August 2024; IEEE: New York, NY, USA, 2024; pp. 810–816. [Google Scholar]
Zhang, L.; Chen, Z. Opportunities of applying Large Language Models in building energy sector. Renew. Sustain. Energy Rev. 2025, 214, 115558. [Google Scholar] [CrossRef]
Wang, S.; Xu, T.; Li, H.; Zhang, C.; Liang, J.; Tang, J.; Yu, P.S.; Wen, Q. Large language models for education: A survey and outlook. arXiv 2024, arXiv:2403.18105. [Google Scholar] [CrossRef]
Gan, W.; Qi, Z.; Wu, J.; Lin, J.C.W. Large language models in education: Vision and opportunities. In Proceedings of the 2023 IEEE International Conference on Big Data (BigData), Sorrento, Italy, 15–18 December 2023; IEEE: New York, NY, USA, 2023; pp. 4776–4785. [Google Scholar]
Xu, H.; Gan, W.; Qi, Z.; Wu, J.; Yu, P.S. Large language models for education: A survey. arXiv 2024, arXiv:2405.13001. [Google Scholar] [CrossRef]
Yan, L.; Sha, L.; Zhao, L.; Li, Y.; Martinez-Maldonado, R.; Chen, G.; Li, X.; Jin, Y.; Gašević, D. Practical and ethical challenges of large language models in education: A systematic scoping review. Br. J. Educ. Technol. 2024, 55, 90–112. [Google Scholar] [CrossRef]
Abd-Alrazaq, A.; AlSaad, R.; Alhuwail, D.; Ahmed, A.; Healy, P.M.; Latifi, S.; Aziz, S.; Damseh, R.; Alabed Alrazak, S.; Sheikh, J. Large language models in medical education: Opportunities, challenges, and future directions. JMIR Med. Educ. 2023, 9, e48291. [Google Scholar] [CrossRef] [PubMed]
Milano, S.; McGrane, J.A.; Leonelli, S. Large language models challenge the future of higher education. Nat. Mach. Intell. 2023, 5, 333–334. [Google Scholar] [CrossRef]
Chu, J.; Zhang, Y.; Qu, C.; Fan, C.; Xie, G.; Liu, S.; Yu, L. Utilizing Large Language Models to Boost Innovative Research and Development in Enterprises. In Proceedings of the 2024 4th International Conference on Enterprise Management and Economic Development (ICEMED 2024), Jinan, China, 24–26 May 2024; Atlantis Press: Paris, France, 2024; pp. 392–400. [Google Scholar]
Liu, Y.; Han, T.; Ma, S.; Zhang, J.; Yang, Y.; Tian, J.; He, H.; Li, A.; He, M.; Liu, Z.; et al. Summary of chatgpt-related research and perspective towards the future of large language models. Meta-Radiol. 2023, 1, 100017. [Google Scholar] [CrossRef]
Li, J.; Xu, J.; Huang, S.; Chen, Y.; Li, W.; Liu, J.; Lian, Y.; Pan, J.; Ding, L.; Zhou, H.; et al. Large language model inference acceleration: A comprehensive hardware perspective. arXiv 2024, arXiv:2410.04466. [Google Scholar] [CrossRef]
Zhang, Y.; Li, Y.; Cui, L.; Cai, D.; Liu, L.; Fu, T.; Huang, X.; Zhao, E.; Zhang, Y.; Chen, Y.; et al. Siren’s song in the AI ocean: A survey on hallucination in large language models. arXiv 2023, arXiv:2309.01219. [Google Scholar] [CrossRef]
Guo, K.; Utkarsh, A.; Ding, W.; Ondracek, I.; Zhao, Z.; Freeman, G.; Vishwamitra, N.; Hu, H. Moderating Illicit Online Image Promotion for Unsafe User Generated Content Games Using Large {Vision-Language} Models. In Proceedings of the 33rd USENIX Security Symposium (USENIX Security 24), Philadelphia, PA, USA, 14–16 August 2024; pp. 5787–5804. [Google Scholar]
Tao, Y.; Viberg, O.; Baker, R.S.; Kizilcec, R.F. Cultural bias and cultural alignment of large language models. PNAS Nexus 2024, 3, pgae346. [Google Scholar] [CrossRef] [PubMed]
Zhang, Y.; Chen, X.; Jin, B.; Wang, S.; Ji, S.; Wang, W.; Han, J. A comprehensive survey of scientific large language models and their applications in scientific discovery. arXiv 2024, arXiv:2406.10833. [Google Scholar] [CrossRef]
Xu, F.F.; Alon, U.; Neubig, G.; Hellendoorn, V.J. A systematic evaluation of large language models of code. In Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming, San Diego, CA, USA, 13 June 2022; pp. 1–10. [Google Scholar]
Austin, J.; Odena, A.; Nye, M.; Bosma, M.; Michalewski, H.; Dohan, D.; Jiang, E.; Cai, C.; Terry, M.; Le, Q.; et al. Program synthesis with large language models. arXiv 2021, arXiv:2108.07732. [Google Scholar] [CrossRef]
Liventsev, V.; Grishina, A.; Härmä, A.; Moonen, L. Fully autonomous programming with large language models. In Proceedings of the Genetic and Evolutionary Computation Conference, Lisbon, Portugal, 15–19 July 2023; pp. 1146–1155. [Google Scholar]
Miceli-Barone, A.V.; Barez, F.; Konstas, I.; Cohen, S.B. The larger they are, the harder they fail: Language models do not recognize identifier swaps in python. arXiv 2023, arXiv:2305.15507. [Google Scholar] [CrossRef]
Ziems, C.; Held, W.; Shaikh, O.; Chen, J.; Zhang, Z.; Yang, D. Can large language models transform computational social science? Comput. Linguist. 2024, 50, 237–291. [Google Scholar] [CrossRef]
Leinonen, J.; Hellas, A.; Sarsa, S.; Reeves, B.; Denny, P.; Prather, J.; Becker, B.A. Using large language models to enhance programming error messages. In Proceedings of the 54th ACM Technical Symposium on Computer Science Education V. 1, Toronto, ON, Canada, 15–18 March 2023; pp. 563–569. [Google Scholar]
Raihan, N.; Siddiq, M.L.; Santos, J.C.; Zampieri, M. Large language models in computer science education: A systematic literature review. In Proceedings of the 56th ACM Technical Symposium on Computer Science Education V. 1, Pittsburgh, PA, USA, 26 February–1 March 2025; pp. 938–944. [Google Scholar]
Krüger, T.; Gref, M. Performance of large language models in a computer science degree program. In Proceedings of the European Conference on Artificial Intelligence, Kraków, Poland, 30 September–4 October 2023; Springer Nature: Cham, Switzerland, 2023; pp. 409–424. [Google Scholar]
Abbas, M.; Jam, F.A.; Khan, T.I. Is it harmful or helpful? Examining the causes and consequences of generative AI usage among university students. Int. J. Educ. Technol. High. Educ. 2024, 21, 10. [Google Scholar] [CrossRef]
Murtaza, M.; Cheng, C.T.; Albahlal, B.M.; Muslam, M.M.A.; Raza, M.S. The impact of LLM chatbots on learning outcomes in advanced driver assistance systems education. Sci. Rep. 2025, 15, 7260. [Google Scholar] [CrossRef] [PubMed]
Lyu, M.R.; Ray, B.; Roychoudhury, A.; Tan, S.H.; Thongtanunam, P. Automatic programming: Large language models and beyond. ACM Trans. Softw. Eng. Methodol. 2024, 34, 140. [Google Scholar] [CrossRef]
Törnberg, P.; Valeeva, D.; Uitermark, J.; Bail, C. Simulating social media using large language models to evaluate alternative news feed algorithms. arXiv 2023, arXiv:2310.05984. [Google Scholar] [CrossRef]
Törnberg, P. Large language models outperform expert coders and supervised classifiers at annotating political social media messages. Soc. Sci. Comput. Rev. 2024, 08944393241286471. [Google Scholar] [CrossRef]
Qi, J. The Impact of Large Language Models on Social Media Communication. In Proceedings of the 2024 7th International Conference on Software Engineering and Information Management, Suva, Fiji, 23–25 January 2024; pp. 165–170. [Google Scholar]
Yang, K.; Zhang, T.; Kuang, Z.; Xie, Q.; Huang, J.; Ananiadou, S. MentaLLaMA: Interpretable mental health analysis on social media with large language models. In Proceedings of the ACM Web Conference 2024, Singapore, 13–17 May 2024; pp. 4489–4500. [Google Scholar]
Peters, H.; Matz, S.C. Large language models can infer psychological dispositions of social media users. PNAS Nexus 2024, 3, pgae231. [Google Scholar] [CrossRef] [PubMed]
Lin, L.; Wang, L.; Guo, J.; Wong, K.F. Investigating bias in llm-based bias detection: Disparities between llms and human perception. arXiv 2024, arXiv:2403.14896. [Google Scholar] [CrossRef]
Myers, D.; Mohawesh, R.; Chellaboina, V.I.; Sathvik, A.L.; Venkatesh, P.; Ho, Y.-H.; Henshaw, H.; Alhawawreh, M.; Berdik, D.; Jararweh, Y. Foundation and large language models: Fundamentals, challenges, opportunities, and social impacts. Clust. Comput. 2024, 27, 1–26. [Google Scholar] [CrossRef]
Augenstein, I.; Baldwin, T.; Cha, M.; Chakraborty, T.; Ciampaglia, G.L.; Corney, D.; DiResta, R.; Ferrara, E.; Hale, S.; Halevy, A.; et al. Factuality challenges in the era of large language models. arXiv 2023, arXiv:2310.05189. [Google Scholar] [CrossRef]
Kim, S.S.; Liao, Q.V.; Vorvoreanu, M.; Ballard, S.; Vaughan, J.W. “I’m Not Sure, But…”: Examining the Impact of Large Language Models’ Uncertainty Expression on User Reliance and Trust. In Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency, Rio de Janeiro, Brazil, 3–6 June 2024; pp. 822–835. [Google Scholar]
Yi, J.; Xu, Z.; Huang, T.; Yu, P. Challenges and Innovations in LLM-Powered Fake News Detection: A Synthesis of Approaches and Future Directions. arXiv 2025, arXiv:2502.00339. [Google Scholar]
Chelli, M.; Descamps, J.; Lavoué, V.; Trojani, C.; Azar, M.; Deckert, M.; Raynier, J.L.; Clowez, G.; Boileau, P.; Ruetsch-Chelli, C. Hallucination rates and reference accuracy of ChatGPT and Bard for systematic reviews: Comparative analysis. J. Med. Internet Res. 2024, 26, e53164. [Google Scholar] [CrossRef] [PubMed]
Farquhar, S.; Kossen, J.; Kuhn, L.; Gal, Y. Detecting hallucinations in large language models using semantic entropy. Nature 2024, 630, 625–630. [Google Scholar] [CrossRef] [PubMed]
Marcos, H. Can large language models apply the law? AI Soc. 2024, 40, 3605–3614. [Google Scholar] [CrossRef]
Lai, J.; Gan, W.; Wu, J.; Qi, Z.; Yu, P.S. Large language models in law: A survey. AI Open 2024, 5, 181–196. [Google Scholar] [CrossRef]
Surden, H. ChatGPT, AI large language models, and law. Fordham Law Rev. 2023, 92, 1941. [Google Scholar]
Homoki, P.; Ződi, Z. Large language models and their possible uses in law. Hung. J. Leg. Stud. 2024, 64, 435–455. [Google Scholar] [CrossRef]
Wang, J.; Zhao, H.; Yang, Z.; Shu, P.; Chen, J.; Sun, H.; Liang, R.; Li, S.; Shi, P.; Ma, L.; et al. Legal evalutions and challenges of large language models. arXiv 2024, arXiv:2411.10137. [Google Scholar] [CrossRef]
Dahl, M.; Magesh, V.; Suzgun, M.; Ho, D.E. Large legal fictions: Profiling legal hallucinations in large language models. J. Leg. Anal. 2024, 16, 64–93. [Google Scholar] [CrossRef]
Tuomi, A.; Tussyadiah, I.; Ascenção, M.P. Customized language models for tourism management: Implications and future research. Ann. Tour. Res. 2025, 110, 103863. [Google Scholar] [CrossRef]
Secchi, L. Knowledge Graphs and Large Language Models for Intelligent Applications in the Tourism Domain. Ph.D. Thesis, Università di Cagliari, Cagliari, Italy, 2024. [Google Scholar]
Ren, R.; Yao, X.; Cole, S.; Wang, H. Are Large Language Models Ready for Travel Planning? arXiv 2024, arXiv:2410.17333. [Google Scholar] [CrossRef]
Wei, Q.; Yang, M.; Wang, J.; Mao, W.; Xu, J.; Ning, H. Tourllm: Enhancing llms with tourism knowledge. arXiv 2024, arXiv:2407.12791. [Google Scholar]
Chu, M.; Chen, Y.; Gui, H.; Yu, S.; Wang, Y.; Jia, J. TraveLLaMA: Facilitating Multi-modal Large Language Models to Understand Urban Scenes and Provide Travel Assistance. arXiv 2025, arXiv:2504.16505. [Google Scholar]
Diao, T.; Wu, X.; Yang, L.; Xiao, L.; Dong, Y. A novel forecasting framework combining virtual samples and enhanced Transformer models for tourism demand forecasting. arXiv 2025, arXiv:2503.19423. [Google Scholar] [CrossRef]
Roumeliotis, K.I.; Tselikas, N.D.; Nasiopoulos, D.K. Leveraging Large Language Models in Tourism: A Comparative Study of the Latest GPT Omni Models and BERT NLP for Customer Review Classification and Sentiment Analysis. Information 2024, 15, 792. [Google Scholar] [CrossRef]
Gu, S. The Future of Tourism: Examining the Potential Applications of Large Language Models. Qeios 2024. [Google Scholar] [CrossRef]
Chen, K.; Zhou, X.; Lin, Y.; Feng, S.; Shen, L.; Wu, P. A Survey on Privacy Risks and Protection in Large Language Models. arXiv 2025, arXiv:2505.01976. [Google Scholar] [CrossRef]
Zhang, J.; Ji, X.; Zhao, Z.; Hei, X.; Choo, K.K.R. Ethical considerations and policy implications for large language models: Guiding responsible development and deployment. arXiv 2023, arXiv:2308.02678. [Google Scholar] [CrossRef]
Deng, C.; Duan, Y.; Jin, X.; Chang, H.; Tian, Y.; Liu, H.; Wang, Y.; Gao, K.; Zou, H.P.; Jin, Y.; et al. Deconstructing the ethics of large language models from long-standing issues to new-emerging dilemmas. arXiv 2024, arXiv:2406.05392. [Google Scholar]
Ge, H.; Chen, X. Exploring Factors Influencing the Integration of AI Drawing Tools in Art and Design Education. ASRI. Arte y Sociedad. Rev. Investig. Artes Humanid. Digit. 2024, 108–128. [Google Scholar]
Chakrabarty, T.; Laban, P.; Agarwal, D.; Muresan, S.; Wu, C.S. Art or artifice? large language models and the false promise of creativity. In Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems, Honolulu, HI, USA, 11–16 May 2024; pp. 1–34. [Google Scholar]
Zhu, S.; Wang, Z.; Zhuang, Y.; Jiang, Y.; Guo, M.; Zhang, X.; Gao, Z. Exploring the impact of ChatGPT on art creation and collaboration: Benefits, challenges and ethical implications. Telemat. Inform. Rep. 2024, 14, 100138. [Google Scholar] [CrossRef]
Kocmi, T.; Federmann, C. Large language models are state-of-the-art evaluators of translation quality. arXiv 2023, arXiv:2302.14520. [Google Scholar]
Giretti, A.; Durmus, D.; Vaccarini, M.; Zambelli, M.; Guidi, A.; di Meana, F.R. Integrating Large Language Models in Art and Design Education; International Association for Development of the Information Society: Lisbon, Portugal, 2023. [Google Scholar]
Franceschelli, G.; Musolesi, M. On the creativity of large language models. AI Soc. 2024, 40, 3785–3795. [Google Scholar] [CrossRef]
Boisseau, É. Imitation and Large Language Models. Minds Mach. 2024, 34, 42. [Google Scholar] [CrossRef]
Chen, Z.; Chan, J. Large language model in creative work: The role of collaboration modality and user expertise. Manag. Sci. 2024, 70, 9101–9117. [Google Scholar] [CrossRef]
Xu, Y. Open Sharing and Cross-border Integration of Art Laboratory Resources Based on LLM and Virtual Reality. In Proceedings of the 2024 International Conference on Interactive Intelligent Systems and Techniques (IIST), Bhubaneswar, India, 4–5 March 2024; IEEE: New York, NY, USA, 2024; pp. 441–445. [Google Scholar]
Roush, A.; Zakirov, E.; Shirokov, A.; Lunina, P.; Gane, J.; Duffy, A.; Basil, C.; Whitcomb, A.; Benedetto, J.; DeWolfe, C. LLM as an Art Director (LaDi): Using LLMs to improve Text-to-Media Generators. arXiv 2023, arXiv:2311.03716. [Google Scholar]
Hristov, K. Artificial intelligence and the copyright dilemma. Idea 2016, 57, 431. [Google Scholar]
Lu, Y.; Guo, C.; Dou, Y.; Dai, X.; Wang, F.Y. Could ChatGPT imagine: Content control for artistic painting generation via large language models. J. Intell. Robot. Syst. 2023, 109, 39. [Google Scholar] [CrossRef]
Johnson, S.; Hyland-Wood, D. A Primer on Large Language Models and their Limitations. arXiv 2024, arXiv:2412.04503. [Google Scholar]
Bai, G.; Chai, Z.; Ling, C.; Wang, S.; Lu, J.; Zhang, N.; Shi, T.; Yu, Z.; Zhu, M.; Zhang, Y.; et al. Beyond efficiency: A systematic survey of resource-efficient large language models. arXiv 2024, arXiv:2401.00625. [Google Scholar] [CrossRef]
Wang, W.; Chen, W.; Luo, Y.; Long, Y.; Lin, Z.; Zhang, L.; Lin, B.; Cai, D.; He, X. Model compression and efficient inference for large language models: A survey. arXiv 2024, arXiv:2402.09748. [Google Scholar] [CrossRef]
Gallegos, I.O.; Rossi, R.A.; Barrow, J.; Tanjim, M.M.; Kim, S.; Dernoncourt, F.; Yu, T.; Zhang, R.; Ahmed, N.K. Bias and fairness in large language models: A survey. Comput. Linguist. 2024, 50, 1097–1179. [Google Scholar] [CrossRef]
Bai, Y.; Zhao, J.; Shi, J.; Xie, Z.; Wu, X.; He, L. Fairmonitor: A dual-framework for detecting stereotypes and biases in large language models. arXiv 2024, arXiv:2405.03098. [Google Scholar]
Kotek, H.; Dockum, R.; Sun, D. Gender bias and stereotypes in large language models. In Proceedings of the ACM Collective Intelligence Conference, Delft, The Netherlands, 6–9 November 2023; pp. 12–24. [Google Scholar]
Schwinn, L.; Dobre, D.; Günnemann, S.; Gidel, G. Adversarial attacks and defenses in large language models: Old and new threats. In Proceedings of the PMLR, 16 December 2023; pp. 103–117. [Google Scholar]
Jain, N.; Schwarzschild, A.; Wen, Y.; Somepalli, G.; Kirchenbauer, J.; Chiang, P.-Y.; Goldblum, M.; Saha, A.; Geiping, J.; Goldstein, T. Baseline defenses for adversarial attacks against aligned language models. arXiv 2023, arXiv:2309.00614. [Google Scholar] [CrossRef]
Liao, Z.; Chen, K.; Lin, Y.; Li, K.; Liu, Y.; Chen, H.; Huang, X.; Yu, Y. Attack and defense techniques in large language models: A survey and new perspectives. arXiv 2025, arXiv:2505.00976. [Google Scholar] [CrossRef]
Zhang, Z.; Zhong, Y.; Ming, R.; Hu, H.; Sun, J.; Ge, Z.; Zhu, Y.; Jin, X. Disttrain: Addressing model and data heterogeneity with disaggregated training for multimodal large language models. arXiv 2024, arXiv:2408.04275. [Google Scholar]
Ebrahimi, B.; Howard, A.; Carlson, D.J.; Al-Hallaq, H. ChatGPT: Can a natural language processing tool be trusted for radiation oncology use? Int. J. Radiat. Oncol. Biol. Phys. 2023, 116, 977–983. [Google Scholar] [CrossRef] [PubMed]
Zhang, D.; Yu, Y.; Dong, J.; Li, C.; Su, D.; Chu, C.; Yu, D. Mm-llms: Recent advances in multimodal large language models. arXiv 2024, arXiv:2401.13601. [Google Scholar]
Jin, Y.; Li, J.; Liu, Y.; Gu, T.; Wu, K.; Jiang, Z.; He, M.; Zhao, B.; Tan, X.; Gan, Z.; et al. Efficient multimodal large language models: A survey. arXiv 2024, arXiv:2405.10739. [Google Scholar] [CrossRef]
Liang, Z.; Xu, Y.; Hong, Y.; Shang, P.; Wang, Q.; Fu, Q.; Liu, K. A Survey of Multimodel Large Language Models. In Proceedings of the 3rd International Conference on Computer, Artificial Intelligence and Control Engineering, Xi’an, China, 26–28 January 2024; pp. 405–409. [Google Scholar]
Nguyen, X.P.; Aljunied, S.M.; Joty, S.; Bing, L. Democratizing LLMs for low-resource languages by leveraging their English dominant abilities with linguistically-diverse prompts. arXiv 2023, arXiv:2306.11372. [Google Scholar]
Gurgurov, D.; Hartmann, M.; Ostermann, S. Adapting multilingual llms to low-resource languages with knowledge graphs via adapters. arXiv 2024, arXiv:2407.01406. [Google Scholar] [CrossRef]
Joshi, R.; Singla, K.; Kamath, A.; Kalani, R.; Paul, R.; Vaidya, U.; Chauhan, S.S.; Wartikar, N.; Long, E. Adapting Multilingual LLMs to Low-Resource Languages using Continued Pre-training and Synthetic Corpus. arXiv 2024, arXiv:2410.14815. [Google Scholar]
Shi, H.; Xu, Z.; Wang, H.; Qin, W.; Wang, W.; Wang, Y.; Wang, Z.; Ebrahimi, S.; Wang, H. Continual learning of large language models: A comprehensive survey. ACM Comput. Surv. 2024. [Google Scholar] [CrossRef]
Wu, T.; Luo, L.; Li, Y.F.; Pan, S.; Vu, T.T.; Haffari, G. Continual learning for large language models: A survey. arXiv 2024, arXiv:2402.01364. [Google Scholar] [PubMed]
Yang, Y.; Zhou, J.; Ding, X.; Huai, T.; Liu, S.; Chen, Q.; Xie, Y.; He, L. Recent advances of foundation language models-based continual learning: A survey. ACM Comput. Surv. 2025, 57, 112. [Google Scholar] [CrossRef]
Qiu, J.; Ke, Z.; Liu, B. Continual Learning Using Only Large Language Model Prompting. arXiv 2024, arXiv:2412.15479. [Google Scholar] [CrossRef]
Baeza-Yates, R.; Matthews, J. Statement on Principles for Responsible Algorithmic Systems; ACM Technology Policy Office: Washington, DC, USA, 2022. [Google Scholar]
Pasopati, R.U.; Bethari, C.P.; Nurdin, D.S.F.; Camila, M.S.; Hidayat, S.A. Ethical Consequentialism in Values and Principles of UNESCO’s Recommendation on the Ethics of Artificial Intelligence. Proc. Int. Conf. Relig. Sci. Educ. 2024, 3, 567–579. [Google Scholar]
Anisuzzaman, D.M.; Malins, J.G.; Friedman, P.A.; Attia, Z.I. Fine-Tuning Large Language Models for Specialized Use Cases. Mayo Clin. Proc. Digit. Health 2025, 3, 100184. [Google Scholar] [CrossRef] [PubMed]
Kermani, A.; Zeraatkar, E.; Irani, H. Energy-efficient transformer inference: Optimization strategies for time series classification. arXiv 2025, arXiv:2502.16627. [Google Scholar] [CrossRef]
AlShaikh, R.; Al-Malki, N.; Almasre, M. The implementation of the cognitive theory of multimedia learning in the design and evaluation of an AI educational video assistant utilizing large language models. Heliyon 2024, 10, e25361. [Google Scholar] [CrossRef] [PubMed]
Nassiri, K.; Akhloufi, M.A. Recent advances in large language models for healthcare. BioMedInformatics 2024, 4, 1097–1143. [Google Scholar] [CrossRef]
Alayrac, J.-B.; Donahue, J.; Luc, P.; Miech, A.; Barr, I.; Hasson, Y.; Lenc, K.; Mensch, A.; Millican, K.; Reynolds, M.; et al. Flamingo: A visual language model for few-shot learning. Adv. Neural Inf. Process. Syst. 2022, 35, 23716–23736. [Google Scholar]
Singh, A.; Patel, N.P.; Ehtesham, A.; Kumar, S.; Khoei, T.T. A survey of sustainability in large language models: Applications, economics, and challenges. In Proceedings of the 2025 IEEE 15th Annual Computing and Communication Workshop and Conference (CCWC), Las Vegas, NV, USA, 6–8 January 2025; IEEE: New York, NY, USA, 2025; pp. 8–14. [Google Scholar]
Wu, Y.; Hua, I.; Ding, Y. Unveiling environmental impacts of large language model serving: A functional unit view. arXiv 2025, arXiv:2502.11256. [Google Scholar]
Iftikhar, S.; Davy, S. Reducing Carbon Footprint in AI: A Framework for Sustainable Training of Large Language Models. In Proceedings of the Future Technologies Conference, London, UK, 14–15 November 2024; Springer Nature: Cham, Switzerland, 2024; pp. 325–336. [Google Scholar]
Zhao, G.; Song, E. Privacy-Preserving Large Language Models: Mechanisms, Applications, and Future Directions. arXiv 2024, arXiv:2412.06113. [Google Scholar] [CrossRef]
Zhang, Y.; Lu, Y.; Liu, F. A systematic survey for differential privacy techniques in federated learning. J. Inf. Secur. 2023, 14, 111–135. [Google Scholar] [CrossRef]

Figure 1. Timeline of key developments in the evolution of large language models from 1943 to 2023.

Figure 2. Overview of the Transformer architecture based on self-attention.

Figure 3. The training process of large language models.

Figure 4. Categorization of datasets in different stages of large language model development.

Figure 5. Categorization of the challenges and limitations of LLM across various domains of everyday life.

Figure 6. Overview of the technical challenges faced by LLMs.

Table 1. The key challenges and proposed solutions for deploying LLMs across various domains of everyday life.

Relevant Domain	Challenges	Challenge Solution
Medicine	▪ Lack of transparency (“black-box” outputs). ▪ Privacy and data security risks. ▪ Demographic bias in training data.	▪ Apply explainable-AI (XAI) methods. ▪ Use federated learning to keep data on-site. ▪ Rebalance datasets with smart sampling.
Finance	▪ Huge model size and computed cost. ▪ Weak generalization across tasks. ▪ Absence of domain-specific benchmarks.	▪ Distil or prune models to achieve smaller footprints. ▪ Fine-tune on blended market data. ▪ Create finance-specific evaluation suites with periodic human review.
Business	▪ Risk of data leakage. ▪ Prompt-sensitive, biased answers. ▪ Misalignment with existing business processes.	▪ Mask sensitive data. ▪ Design secure, templated prompts. ▪ Embed human–AI checkpoints and align with IT-governance rules.
Industry	▪ Hallucinations in safety-critical settings. ▪ Low explainability. ▪ High energy use. ▪ Integration complexity.	▪ Ground model outputs with a domain knowledge base. ▪ Compress models; expose them through secure middleware APIs. ▪ Monitor real-time power draw.
Agriculture	▪ Sparse, costly multimodal datasets. ▪ Outdated or inaccurate advice. ▪ Ambiguous domain terminology.	▪ Collect new data via drones/sensors. ▪ Adopt semi-supervised learning for low-label regimes; build a multilingual agronomy glossary.
Energy	▪ Scarcity of shareable grid data. ▪ Heavy GPU/TPU requirements. ▪ Safety-critical reliability. ▪ Cyber-security threats.	▪ Share anonymized data. ▪ Deploy lightweight edge models. ▪ Add active security layers. ▪ Retrain continually on live telemetry.
Education	▪ Infrastructure and security hurdles. ▪ Biased responses. ▪ Over-reliance by students.	▪ Filter and log content. ▪ Give teachers monitoring dashboards. ▪ Teach digital literacy skills. ▪ Store student data locally.
Research	▪ Shallow grasp of niche concepts. ▪ Hallucinated citations. ▪ Multilingual gaps.	▪ Assemble domain-specific corpora. ▪ Flag fake citations automatically. ▪ Integrate on-the-fly translation.
Programming	▪ Uneven language coverage. ▪ Prompt/temperature sensitivity ▪ Generated code errors.	▪ Run automatic unit tests. ▪ Use error-rating frameworks. ▪ Provide an LLM-assisted interactive debugger.
Media	▪ Political/topic bias. ▪ Source-free fluent text. ▪ Confident misinformation.	▪ Insert citations automatically. ▪ Run live fact checking. ▪ Monitor bias with ensemble counter-models.
Law	▪ Potential leakage of confidential legal texts. ▪ Unclear liability for errors; bias toward famous precedents; hallucinated reasoning.	▪ Mask legal entities; restrict query scope. ▪ Mandate human review. ▪ Consult a vetted legal knowledge base.
Art	▪ Algorithmic cultural bias. ▪ Copyright and authorship ambiguity. ▪ Limited creative dynamism.	▪ Provide copyright-verification tools. ▪ Expose cultural-style controls. ▪ Support human–AI co-creation workflows.

Table 2. Summary of the cross cutting technical challenges in building and operating LLMs, together with effective mitigation strategies.

Technical Focus	Challenges	Challenge Solution
Fairness and bias mitigation	▪ Training-data stereotypes	▪ Balanced sampling
Adversarial robustness	▪ Under-representation of minority groups	▪ Data re-writing or augmentation; continuous fairness-metric monitoring; fine-tune on minority data
Heterogeneous data integration	▪ Risk of amplifying inequity	▪ Adversarial training
Compute and resource efficiency	▪ Susceptibility to crafted, malicious inputs	▪ Input filtering and attack detection
Long-term memory and reasoning	▪ Aligning text, images, audio, and time series into one coherent representation	▪ Cryptographic signatures or watermarking of trusted inputs
Multimodal processing limits	▪ Trillion-parameter models demand large GPUs, memory, and energy	▪ Train a shared embedding space
Technical focus	▪ Limited context window difficulty with multi-step logic	▪ Use “late-fusion” ensemble methods

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Peykani, P.; Ramezanlou, F.; Tanasescu, C.; Ghanidel, S. Large Language Models: A Structured Taxonomy and Review of Challenges, Limitations, Solutions, and Future Directions. Appl. Sci. 2025, 15, 8103. https://doi.org/10.3390/app15148103

AMA Style

Peykani P, Ramezanlou F, Tanasescu C, Ghanidel S. Large Language Models: A Structured Taxonomy and Review of Challenges, Limitations, Solutions, and Future Directions. Applied Sciences. 2025; 15(14):8103. https://doi.org/10.3390/app15148103

Chicago/Turabian Style

Peykani, Pejman, Fatemeh Ramezanlou, Cristina Tanasescu, and Sanly Ghanidel. 2025. "Large Language Models: A Structured Taxonomy and Review of Challenges, Limitations, Solutions, and Future Directions" Applied Sciences 15, no. 14: 8103. https://doi.org/10.3390/app15148103

APA Style

Peykani, P., Ramezanlou, F., Tanasescu, C., & Ghanidel, S. (2025). Large Language Models: A Structured Taxonomy and Review of Challenges, Limitations, Solutions, and Future Directions. Applied Sciences, 15(14), 8103. https://doi.org/10.3390/app15148103

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Large Language Models: A Structured Taxonomy and Review of Challenges, Limitations, Solutions, and Future Directions

Abstract

1. Introduction

2. Definition of Large Language Models

2.1. Architecture of Large Language Models

2.2. The Training Process of Large Language Models

2.3. Large Language Model Datasets

2.3.1. Traditional Natural Language Processing Datasets

2.3.2. Pretraining Datasets

2.3.3. Instruction-Based Fine-Tuning Datasets

2.3.4. Preference Datasets

2.3.5. Evaluation Datasets

2.4. Features and Capabilities of Large Language Models

3. The Challenges and Limitations of Large Language Models Across Various Domains of Everyday Life

3.1. Medicine

3.1.1. Interpretability and Transparency

3.1.2. Data Privacy and Security

3.1.3. Bias and Fairness

3.2. Finance

3.2.1. Limitations and Complexity

3.2.2. Generalizability

3.2.3. Interpretability and Trustworthiness

3.2.4. Selection and Implementation

3.2.5. Performance Evaluation

3.3. Business

3.3.1. Bias and Instability

3.3.2. Organizational Implementation

3.4. Industry

3.4.1. Industrial Implementation

3.4.2. Risks and Inequalities

3.5. Agriculture

3.5.1. Lack of Visual Data for Training CV with LLMs

3.5.2. Reliability

3.5.3. Static Knowledge Limitation

3.5.4. Ambiguity in Description

3.6. Energy

3.6.1. Data in Training

3.6.2. Safety Considerations

3.6.3. Physical Modeling

3.6.4. Security Threats

3.7. Education

3.7.1. Security

3.7.2. Quality and Dependency

3.7.3. Privacy

3.8. Research

3.8.1. Specialized Understanding

3.8.2. Trustworthiness

3.8.3. Linguistic Diversity

3.9. Programming

3.9.1. Diversity of Programming Languages

3.9.2. Code Generation and Refinement

3.9.3. Dependency

3.10. Media

3.10.1. Instability and Bias

3.10.2. Lack of Reliable Sources

3.10.3. Authoritative Tone

3.10.4. Unreliable Sources

3.11. Law

3.11.1. Privacy

3.11.2. Accountability

3.11.3. Bias Against Judges

3.11.4. Hallucination

3.12. Tourism

3.12.1. Bias

3.12.2. Dependence on Training Data

3.12.3. Gaining Trust

3.12.4. Privacy

3.13. Art

3.13.1. Generation Without Understanding

3.13.2. Creativity

3.13.3. Bias

3.13.4. Copyright

3.13.5. Lack of Creative Dynamism

4. Technical Challenges and Considerations of LLMs

4.1. Fairness

4.2. Countering Malicious Attacks

4.3. Integration of Heterogeneous Data

4.4. Multimodal Large Language Model Research

4.5. Low-Resource Languages