1. Introduction
Large Language Models (LLMs) have emerged as one of the most significant breakthroughs in Artificial Intelligence (AI), attracting considerable interest from researchers, developers, and policymakers. Their accelerated evolution is reshaping the way language is processed, knowledge is retrieved, and tasks are automated across various domains [
1,
2,
3,
4]. To understand the extent of academic engagement in this field, we searched for the term “Large Language Models” in the Scopus database. As shown in
Figure 1, the number of publications using this term increased from 2363 in 2019 to 28,339 in 2024. Scopus has also indexed over 42,000 preprints on the topic during the same period.
This growth occurred alongside advances in generative AI, multimodal systems, and the growing visibility of Agentic AI. The term Agentic AI refers to systems that pursue goals, interact with their environment, and collaborate with other agents autonomously. Although such systems have been studied for decades, recent developments have added new capabilities. Many contemporary Agentic AI platforms, such as AutoGPT, now integrate LLMs to support reasoning, communication, and decision-making [
5,
6,
7,
8,
9,
10]. These models enable flexible interactions with users, allowing agents to adapt to changing tasks and environments. Public interest in Agentic AI has also grown significantly. As shown in
Figure 2, global search trends for the term “Agentic AI” increased rapidly between May 2024 and April 2025. This trend signals a growing shift in how researchers and the broader public perceive Agentic AI.
In the current literature, terms such as “Agentic AI”, “Agentic LLMs”, “Agentic Systems”, “LLM-based agents”, and “AI-based Agentic systems” are often used interchangeably due to the lack of clear, standardised definitions in this emerging field. In the context of our study, we use these terms broadly to refer to Agentic AI powered by LLMs, where LLMs serve as the reasoning core, but they rely on agents to take action. To demonstrate how Agentic AI and LLMs operate in practice, consider the following example, depicted in the simplified workflow diagram shown in
Figure 3. A user submits a prompt through the system’s user interface, such as “Book a flight for me from Dubai to London during the next month that includes at least three sunny days with a maximum price of £500”. Although the request may seem simple, resolving it involves multiple steps. The system would need to identify potential dates, check weather conditions, review flight options, access the user’s calendar, and apply the required filters. Each of these tasks requires a specific type of agent. For instance, a search agent connects to Skyscanner or Google Flights APIs to retrieve real-time ticket options. A filtering agent accesses data from services like OpenWeatherMap or Tomorrow.io to detect sunny windows and filters out the results that exceed the price limit. A booking agent checks user availability through tools such as Google Calendar. These services represent the External Services Layer in the system, where APIs, tools, and external data sources provide the necessary inputs for agent actions.
The agents interact with each other in a coordinated workflow. The search agent must retrieve routes first, and then the filtering agent applies filters to the retrieved data. Once candidate flights are selected, the booking agent ensures that none of the proposed dates conflict with user appointments. An LLM remains involved throughout the process, handling original instructions, decomposing them into subtasks, and maintaining dialogue with the user through the orchestrator layer. If no valid flights are found, the LLM can generate clarifying questions such as, “Would you prefer fewer sunny days or a higher budget?”. It helps the system adapt dynamically and maintain alignment with the user’s intent. Alternatively, it may return the booked ticket to the user successfully. This example presents a simplified scenario for illustration. Real-world Agentic AI deployments often involve a broader range of agents, more complex orchestration layers, and the potential integration of multiple LLMs and external services to handle more diverse and dynamic tasks.
While LLMs provide interpretation and planning, the execution depends on Agentic AI components that interface with external APIs, maintain local memory, and track progress toward goals. Once the instruction is issued, the system performs most of the work autonomously, with minimal human intervention. Due to a close connection between Agentic AI and LLMs, Agentic AI inherits many of the limitations already associated with LLMs, including security, privacy, trust, bias, misinformation, energy consumption, value alignment, transparency, and explainability. At the same time, it introduces new challenges, such as security, privacy and trust concerns, context management or memory persistence, goal misalignment, opaque decision-making, limited human oversight, multi-agent coordination, long-term safety, and ethical and legal accountability.
LLMs and Agentic AI have both gained significant momentum in recent years, driven by a growing number of technical innovations and commercial applications. However, a notable lack of peer-reviewed research remains in the intersection of these two areas. Studies on Agentic AI have started appearing [
11,
12,
13,
14,
15,
16,
17,
18,
19,
20,
21,
22,
23]. To the best of our knowledge, no existing study brings together the conceptual, technical, and ethical aspects of LLMs and Agentic AI in a single and coherent analysis. Through this research, we aim to fill that gap by identifying how the challenges of LLMs carry over into Agentic AI systems, examining their use across different domains, and outlining open research opportunities using a forward-looking roadmap.
The core contributions of our research are as follows:
A unified examination of LLMs and Agentic AI: We provide a consolidated analysis that frames LLMs and Agentic AI not as isolated innovations but as deeply interconnected systems. This integrated perspective is not covered in the existing literature and is essential to understanding their joint capabilities, risks, and future trajectories.
Cross-domain application insights demonstrating real-world impact: We provide a cross-sectoral perspective that reveals how Agentic AI and LLMs can enhance productivity, operational efficiency, autonomy, and decision-making, enabling organisations to adopt these technologies to improve their internal processes, service delivery, and user experience.
Empirical illustrations of model limitations: We present targeted small-scale experiments using GPT-4o and DeepSeek-R1 to illustrate real-world issues such as bias and cultural inclusivity. These examples provide concrete evidence of how such limitations can propagate from LLMs into Agentic AI systems, reinforcing the urgency of addressing them at both model and system levels.
A forward-looking research landscape of challenges and research directions: We introduce a structured mapping of technical and ethical challenges along with corresponding research opportunities for both LLMs and Agentic AI. This taxonomy provides a coherent framework to establish a conceptual foundation for future research and policy development to address emerging problems in this domain. Our work emphasises the need for cross-disciplinary collaboration to address multifaceted issues such as accountability, long-term safety, and sustainable deployment.
The rest of this paper is structured as follows.
Section 2 describes our research scope and approach.
Section 3 reviews application domains.
Section 4 and
Section 5 discuss core challenges and future research needs for LLMs and Agentic AI, respectively.
Figure 4 presents a high-level mapping of the research landscape, illustrating how Agentic AI and LLMs are applied in various sectors, the challenges they encounter, and the research directions that emerge from these challenges. The diagram organises the content into three main layers: application domains, technical and ethical challenges, and corresponding areas for future investigation. The challenges and open research opportunities are organised under clearly defined categories that reflect the emerging needs of Agentic AI and LLMs. For example, the increasing computational demands and memory limitations of autonomous agents highlight the need for developing more efficient and context-sensitive architectures, i.e., Memory-Aware Agents. This paper concludes with a summary of the findings and reflections on future work in
Section 6.
2. Research Scope and Method
This study adopts a structured scoping review to explore the breadth of emerging developments in Agentic AI and LLMs, where much of the literature in this area is relatively recent, fast-evolving, and often not yet widely indexed in traditional academic databases. To ensure coverage of established research and emerging perspectives, we followed a four-phase methodology, as shown in
Figure 5. Although not formally a systematic review, our research method was carefully planned and executed to maintain rigour and transparency in line with recognised review protocols. Particular attention was given to clearly defining the search strategy, setting inclusion and quality assessment criteria, and documenting the screening process to ensure traceability and transparency throughout the review.
The first phase involved developing a targeted search strategy using keywords such as Agentic AI, Large Language Models, Agentic AI challenges, LLM challenges, Agentic systems, and varying combinations of these and other related terms. Searches were conducted across multiple platforms, including Google Scholar, Scopus, and arXiv. Rather than limiting the scope only to peer-reviewed or high-impact journals, the strategy was designed to also capture early-stage and preprint contributions, which often provide valuable insights in emerging fields. At this phase, we applied initial inclusion criteria based on the title, abstract, and metadata to retain only English-language publications from 2020 to 2025 and eliminate irrelevant documents. This process yielded 327 documents: 142 from Google Scholar, 110 from Scopus, and 75 from arXiv.
The second phase focused on extensive scrutiny, where detailed inclusion and quality assessment criteria were applied to the full text of each document. Duplicate publications were removed using Zotero, ensuring that the dataset moving forward contained unique and relevant documents. Studies were retained if they focused on Agentic AI, LLMs, or their intersection. Sources also needed to have sufficient depth of analysis. Papers passing the inclusion criteria were further assessed for quality. Through this process, we only kept the papers based on a proper research process, methodological substance, and if they address key concepts, applications, challenges, or research directions of Agentic AI and LLMs. After this detailed screening, 71 papers: 28 from Google Scholar, 25 from Scopus, and 18 from arXiv, were retained. We then carried out a manual search to identify additional relevant material, including four carefully selected websites. This added 13 more sources. The final set of 84 reference materials, combining peer-reviewed articles, conference proceedings, preprints, and websites, formed the complete evidence base for conducting this review and developing the analysis presented throughout this manuscript.
Figure 6 presents the distribution of the final 84 sources by type. Journal articles account for the largest share (48%), followed by preprints (27%), conference papers (20%), and selected websites (5%). The higher proportion of journal articles may reflect researchers’ focus on peer-reviewed and technically substantive sources for in-depth thematic analysis. The relatively high number of preprints suggests that researchers are actively sharing early findings through platforms such as arXiv, which aligns with the fast-evolving nature of this area. The smaller share of conference papers may be due to the recent emergence of Agentic AI as a distinct research focus, meaning that relevant papers may not yet have appeared in proceedings or are still under review in upcoming events.
Figure 7 shows the distribution of the selected studies over time, highlighting a sharp increase in research activity related to Agentic AI and LLMs from 2023 to 2025. This trend reflects the rapid emergence of the field, as these technologies become more integrated into real-world systems.
To further analyse conceptual patterns within the reviewed literature, we manually added thematic tags to each document in Zotero by identifying relevant keywords from titles, abstracts, and, where needed, full texts. These tags reflected the core concepts of each paper and were exported for analysis using VOSviewer 1.6.20, which computed the total link strength of frequently co-occurring terms. To improve consistency and avoid fragmentation, closely related keywords such as “LLMs”, “LLM”, and “large language model” were treated as a single concept under the unified label “large language models” using a synonym mapping file. The results were then visualised in Microsoft Excel to produce a funnel-style bar chart, shown in
Figure 8. As expected, terms such as Large Language Models, Agentic AI, and artificial intelligence emerged as central concepts, consistent with this paper’s scope. More notably, explainable AI, transparency, misinformation, and ethics appeared with medium link strength, reflecting increasing research interest in interpretability and trust. In contrast, keywords such as sustainability, security, and particularly NLP and safety had weaker link strength, pointing to the least explored but important areas. A further observation is the higher strength of deep learning compared to Machine Learning (ML), indicating a shift toward more advanced deep neural architectures in Agentic AI and LLM research than traditional ML algorithms.
Key information was extracted from each source in the third phase by examining several thematic aspects. These included the relationship between Agentic AI and LLMs, applications of Agentic AI powered by LLMs, and core challenges associated with LLMs. We also identified how LLM challenges affect Agentic AI, along with issues that are specific to Agentic AI systems. Finally, we noted future research directions separately for LLMs and Agentic AI.
In the final phase, we synthesised and thematically analysed the extracted data. Applications were grouped into sectors such as education, healthcare, cybersecurity, autonomous systems, e-commerce, and customer service. Challenges were divided into two categories: those arising from LLM design and limitations, and those unique to Agentic AI. We then mapped the challenges with future research directions emerging from both categories. These insights are organised across three core sections of the paper: applications (
Section 3), LLM-related challenges and future directions (
Section 4), and Agentic AI-specific challenges and directions (
Section 5).
3. Applications of Agentic AI Powered by LLMs
Agentic AI represents a significant evolution in the design and deployment of intelligent systems, which are capable not only of following instructions but also of pursuing goals, adapting to changing contexts, and exhibiting a degree of autonomy in decision making [
24]. The framework of Agentic AI, encompassing perception, reasoning, action, learning, and collaboration, highlights a closed-loop system where continuous feedback and adaptation are fundamental. This cyclical process, significantly enabled by the cognitive capabilities of LLMs, drives the agent’s autonomy and goal-directed behaviour, fundamentally differentiating it from simpler, more traditional AI systems. By leveraging this feedback mechanism, agents can be autonomous, adaptive, and persistent, which sets them apart from conventional rule-based AI. In this section, drawing from the academic literature and real-world examples, we synthesise recent applications and early-stage deployments of Agentic AI systems across key domains, where LLMs serve as the reasoning core.
3.1. Education
Agentic AI is reshaping the learning experience by providing meticulously tailored personalised support to individual student needs. These systems are capable of understanding learners’ goals and adapting instructional strategies in real-time [
25]. In contrast to conventional educational tools, Agentic AI applications enhance learning by offering personalised feedback with multiple explanations of complex concepts and encouraging learners to engage in metacognitive reflection. For example, recent deployments, such as Khan Academy’s Khanmigo [
26], use GPT-4 to deliver Socratic-style tutoring, intentionally avoiding giving direct answers. Instead, it prompts students to articulate reasoning, make predictions, and self-correct. Agentic AI can incorporate external tools, such as search engines and calculators, to address limitations like knowledge cut-offs and calculation inefficiencies, thereby facilitating real-time data retrieval and the execution of complex and multi-step tasks. Recent findings demonstrate that Agentic AI can support students with diverse needs, including those with learning disabilities or language barriers, by tailoring instruction based on natural language input and behavioural cues [
27]. While many implementations are still at the pilot stage, their potential for scalable, inclusive education delivery is increasingly evident. The scalable nature of Agentic AI presents a promising solution to achieve global education equity. In regions where qualified teachers are limited, Agentic AI can serve as a scalable, multilingual, and culturally adaptable educational partner to improve the student learning experience.
3.2. Healthcare
The rapid advancement of AI is transforming healthcare worldwide, positioning digital technologies as an integral part of modern medical systems. Agentic AI powered by LLMs is reshaping overall diagnostic processes and patient engagement. It is also significantly enhancing clinical efficiency and decision support by streamlining administrative burdens, automating approvals, and updating care plans in real-time. Agentic AI systems analyse vast patient records, summarise complex case histories, and provide preliminary diagnoses based on symptoms, leading to measurable improvements in operational efficiency and patient care coordination [
10]. In addition, nowadays, modern diagnostic workflows are highly based on AI for medical image analysis and predictive analytics. The growing volume of imaging data and the complexity of EHRs can overwhelm clinicians, but LLMs help alleviate this burden, reducing cognitive load and enhancing patient safety efficiently [
28,
29]. Recent advancements in Multimodal Large Language Models (MLLMs) have significantly expanded the scope of Agentic AI in healthcare, particularly in the realm of medical imaging and diagnostic reasoning. A prime example is Med-Flamingo [
30], a few-shot multimodal AI agent designed to assist clinicians in diagnostic decision-making by integrating 2D medical images with textual inputs such as patient history or clinical notes. By leveraging the joint understanding of visual and linguistic modalities, Med-Flamingo demonstrates up to a 20% improvement in diagnostic accuracy compared to prior models. Such systems can engage in continuous monitoring, behaviour nudging, and proactive intervention, helping patients adhere to treatment regimens or adopt healthier habits. The conversational and contextual reasoning capabilities of LLMs allow for a more human-like interface, which can reduce barriers to care and enhance patient trust and satisfaction. Although Agentic AI offers numerous benefits, maintaining transparency, robustness, and accountability remains a pressing concern, particularly as they assume more autonomous roles in medical decision-making. Full-scale adoption is still a longer-term goal due to ongoing concerns about accountability and transparency in autonomous medical decisions.
3.3. Cybersecurity
As cyber threats intensify in scale, speed, and complexity, the demand for innovative adaptive defence mechanisms has driven the adoption of Agentic AI systems in cybersecurity. These systems extend beyond static threat detection by enabling autonomous decision-making and real-time responsiveness. The impact of Agentic AI on cybersecurity is multifaceted, profoundly enhancing various aspects of security operations. Firstly, in real-time threat detection and prevention, Agentic AI systems leverage their goal-driven nature and continuous learning capabilities to identify and neutralise emerging threats with unprecedented speed and accuracy. During a security incident, the ability of agents to rapidly assess the situation, prioritise risks, and execute containment and remediation actions such as revoking credentials or patching vulnerabilities can drastically minimise operational disruption and data exfiltration. This study [
31] states that AI will soon transform from chatbots to agent-driven cybersecurity, enhancing threat detection, autonomous responses, scalability, and cyber hygiene. While Agentic AI holds great promise for cybersecurity, it also introduces heightened security concerns, as existing threats persist and new vulnerabilities emerge [
32]. Like earlier innovations defined by [
33], these systems operate beyond controlled environments, interacting with multiple platforms and external data sources, thereby broadening the attack surface. This exposure raises the risk of unauthorised access, data leaks, and system manipulation. Leading cybersecurity firms, such as ReliaQuest and CrowdStrike, are already deploying Agentic AI to automate Security Operations Centre (SOC) tasks, enhance accuracy, and accelerate response times. This technology also offers proactive vulnerability protection throughout the software development lifecycle. Although these applications are operational, many Agentic AI features for cybersecurity are still in development and require robust governance and risk mitigation strategies due to expanded attack surfaces and data exposure risks.
3.4. Autonomous Vehicles
Agentic AI constitutes a qualitative leap in AI development, distinguished by its ability to set and pursue complex goals in unpredictable environments by autonomously managing its resources, much like human agents. The trajectory toward full AVs, particularly at Society of Automotive Engineers (SAE) Level 5, mandates a paradigm shift in AI from mere automation to genuine autonomy, reasoning, and adaptive decision-making [
34]. Agentic AI systems represent this critical advancement, empowering self-driving cars to perceive complex environments, plan intricate manoeuvres, execute actions, and continuously learn and adapt without constant human oversight [
35]. Early AVs, often referred to as co-pilot systems, relied on human drivers to take control in difficult situations. As technology has advanced, full AVs similar to autopilot systems are now capable of navigating complex routes independently, without the need for human input. Central to the efficacy of Agentic AI in AVs is the pivotal role of LLMs, which serve as the cognitive engine for advanced reasoning and contextual understanding. LLMs move beyond simple text generation to enable sophisticated decision-making, allowing Agentic systems to interpret context and reason, facilitate human–vehicle interaction, support adaptive learning, and, last but not least, enhance interpretability [
36]. Despite the transformative potential of Agentic AI for AVs, this technology is still at testing or pre-commercial stages. The role of LLMs as reasoning engines marks a shift toward fully autonomous decision-making, but their deployment is accompanied by significant challenges. A primary concern is ensuring consistent reliability and predictability, particularly in safety-critical scenarios, where even minor deviations in behaviour can lead to serious consequences [
37]. Additionally, concerns also extend to data privacy and security due to the extensive real-time data processing, as well as the opaque nature of complex AI models, particularly LLMs, which hinders transparency and accountability.
3.5. E-Commerce
In the evolving and competitive landscape of modern retail, Agentic AI is catalysing a significant transformation in e-commerce by enabling hyper-personalisation, enhancing operational efficiency, and supporting autonomous decision-making to address increasing customer expectations and complex operational challenges. Agentic AI systems in e-commerce function as intelligent, goal-oriented agents that adapt to changing market conditions, manage complex tasks, and continuously learn to optimise business operations with minimal human intervention [
38]. Moreover, in the realm of e-commerce, Agentic AI applications are proving instrumental in streamlining labour-intensive processes and elevating the customer experience. LLM-powered chatbots and AI shopping assistants provide real-time support for a broad range of customer inquiries, improving satisfaction by minimising wait times and allowing human staff to focus on more complex tasks [
39]. In addition, LLMs enhance product discovery by intelligently interpreting user behaviour and preferences, enabling the presentation of highly relevant items and simplifying navigation, which contributes to a more seamless and scalable shopping experience, even as product offerings continue to grow. These features are already in commercial use in many retail platforms, marking this as one of the more mature domains for the deployment of Agentic AI. While the benefits are substantial, there are notable concerns to consider during the implementation of the integrated agent for automation, particularly in handling vast amounts of sensitive customer data, where security and privacy risks are involved. Agentic AI processes personal preferences, browsing history, purchase patterns, and sometimes even payment information, making them attractive targets for cybercriminals. Despite these challenges, the advantages of implementing Agentic AI far outweigh the limitations. These concerns can be effectively addressed through the implementation of strong governance frameworks, transparent model design, continuous performance monitoring, and apparent human oversight.
3.6. Customer Service
The rise in Agentic AI is revolutionising customer service. It is shifting the focus from simply reacting to customer issues to powering intelligent virtual assistants that can handle large volumes of customer interactions [
40]. Unlike conventional rule-based systems, Agentic AI supported by LLMs can comprehend nuanced language, adapt to user intent, and engage in natural, fluid dialogue. They can comprehensively handle autonomous data and also cut down response times, which helps to reallocate human resources strategically. Furthermore, multinational e-commerce companies utilise these systems to connect customer-facing agents with CRM platforms (e.g., Salesforce) and logistics APIs, thereby resolving common customer inquiries, such as “Where is my order?” through contextual retrieval and response generation. Internally, the same architecture supports enterprise search, enabling employees to retrieve and summarise content such as meeting notes, sales materials, or policy documents using vector-based search tools like Pinecone or Elasticsearch. Another vital element of Agentic AI is its capacity for continuous learning. As these systems interact with customers, they gather data that helps improve their performance over time. This not only ensures the delivery of up-to-date information but also allows for the anticipation of customer needs, leading to proactive service strategies. Many global firms have already deployed these systems at scale, especially in high-volume service environments. However, there are also issues of transparency and explainability, as LLMs often operate as “black boxes”, making it difficult to audit decisions or trace errors. Despite these challenges, the benefits of deploying Agentic AI in customer service, such as improved efficiency, rapid response times, 24/7 availability, and scalable personalisation, far outweigh the drawbacks. With proper oversight, ethical safeguards, and human-in-the-loop mechanisms, these systems can deliver substantial value while minimising potential risks, making them a powerful asset in modern customer engagement strategies.
4. LLMs: Challenges and Research Directions Impacting Agentic AI
LLMs have brought about notable advantages, but their effectiveness and safe adoption are consistently being questioned due to a range of challenges. While issues such as bias, misinformation, and data privacy have received considerable attention, new problems continue to emerge as these models evolve and expand. In this section, we examine five of the most pressing and foreseeable concerns shaping the future of LLM research.
Table 1 summarises these challenges alongside the corresponding open research opportunities. Addressing these concerns is essential to support the responsible development and broader adoption of LLMs. Importantly, many of these challenges are not limited to LLMs alone. As LLMs increasingly serve as the reasoning core within Agentic AI systems, their limitations can propagate and intensify within more autonomous multi-agent settings. This makes it critical to examine these issues not only from a model-centric perspective but also through the lens of their integration into broader Agentic architectures.
In this section, we also discuss the results of small-scale demonstrations conducted using GPT-4o and DeepSeek-R1 to observe how LLMs respond to prompts involving bias and cultural inclusivity. We used three varied prompts for each problem with both models across three different days. To simulate realistic usage by non-technical users, we accessed the models via their public web interfaces rather than APIs or plug-ins. We did not adjust any system parameters, and all responses were generated using default settings, as the web interface does not allow changes to key variables such as temperature, top-p, and max tokens. These demonstrations reflect typical use cases where users may not have expertise in prompt design, and were not formal experiments but illustrative examples to support the thematic discussion. They are also not intended to be rigorous or fully reproducible as the models evolve, but instead offer practical insight into how such issues may emerge in everyday interactions with LLMs, especially when deployed in Agentic AI settings.
4.1. Security, Privacy, and Trust Concerns: Toward Trustworthy LLMs
LLMs have the potential to be used in healthcare, education, finance, and other critical domains. Therefore, it is crucial to determine how these systems handle privacy, data security, and user trust. There is often little clarity around what happens to user data during and after interactions, or how these systems are protected from misuse. In contrast to traditional software, these models rarely provide details about how their training data was gathered, whether personal information was removed, or what measures are in place to prevent the models from storing or repeating sensitive content. Research has shown that moderately sized models can memorise and leak parts of their training data [
41,
42,
43,
44,
45,
46]. For instance, Carlini et al. (2020) [
47] demonstrated that it is possible to extract memorised sequences, including sensitive personal data, from GPT-2-like models using targeted queries. Although it remains unclear whether models such as GPT-4 or DeepSeek-R1 are equally vulnerable. This uncertainty stems from a lack of independent auditing. Without external scrutiny, it is difficult to verify claims regarding data minimisation, consent management, or the effectiveness of safety filters in production environments.
Moreover, LLMs are vulnerable to various security threats, including prompt injection, model evasion, and data poisoning. Attackers can manipulate training data or construct adversarial prompts designed to bypass safety mechanisms and elicit harmful or policy-violating outputs. While companies like OpenAI have released safety documentation (e.g., GPT-4 System Card) acknowledging these vulnerabilities, the mitigation strategies remain preliminary and are still undergoing evaluation [
48]. The regulatory frameworks are not yet fully equipped to govern these emerging threats. For instance, the European Union’s AI Act draft outlines risk-based requirements for AI systems. Still, it does not provide specific protocols for LLM auditing, privacy assurance, or post-deployment monitoring [
49]. Similarly, the US NIST AI Risk Management Framework emphasises trust, accountability, and transparency but acknowledges that LLM-specific standards are still under development [
50]. In high-risk domains, the regulatory gap raises further questions about responsibility, oversight, and user protection. These risks are especially significant when LLMs are integrated into autonomous Agentic systems, where misused or compromised outputs can affect multi-agent coordination, decision logic, or user-facing interactions without immediate human oversight.
To address these concerns, future research must prioritise the development of trustworthy LLM architectures, including the integration of differential privacy, secure multi-party computation, and red-teaming protocols tailored to language generation. Trustworthy deployment also requires transparent documentation of model behaviour, training sources, and data handling practices. Until such standards are adopted and enforced, the reliability and ethical acceptance of LLMs in critical applications will remain limited.
4.2. Misinformation, Misuse, and Bias: Toward Ethical LLMs
As the use of LLMs continues to grow across public and private sectors, concerns about misinformation, misuse, and bias have become more prominent [
51,
52,
53,
54,
55,
56,
57,
58,
59,
60]. These issues impact the reliability, fairness, and accountability of AI systems when they are used in information-critical domains. Misinformation refers to the generation of incorrect or fabricated content by language models. It can occur due to limitations in the model’s design or flaws in its training data, such as the inclusion of unreliable or biased information. These issues become more serious when a model presents something as factual when it lacks a reliable foundation. These types of deceptive results can pose a high risk if users follow the model’s suggestions without further verification. There is also concern about the deliberate misuse of LLMs, as they can be used to produce misleading content, spam, and promote political propaganda. Since these models are designed to generate open-ended replies, it becomes too difficult to completely prevent misuse without compromising their flexibility. Currently, there are no well-defined frameworks or governance policies in place to monitor and manage these risks.
Bias is another key concern because LLMs tend to reflect the patterns and perspectives present in their training data, which can lead to responses that support stereotypes. For instance, a model trained in a different region, such as ChatGPT (USA) and DeepSeek (China), can produce varying responses to sensitive topics like human rights and international law. To demonstrate this limitation, we tested each model using three slightly rephrased prompts on the same topic: the treatment of religious minorities in China, to ensure the reproducibility of outcomes across different instances. This experiment was conducted on three different days: 23 March 2025; 30 July 2025; and 1 August 2025, using GPT-4o and DeepSeek-R1.
Figure 9,
Figure 10,
Figure 11,
Figure 12,
Figure 13 and
Figure 14 illustrate this issue.
The response from both models significantly varies. The results consistently show that ChatGPT acknowledges state-imposed restrictions, surveillance, and repression, particularly against Uyghur Muslims, while DeepSeek responses repeatedly emphasise equal rights and social harmony, omitting references to state control or human rights concerns. This divergence is consistent across all prompt variations and highlights how model responses may be shaped by the sociopolitical context of their origin.
The risks of embedded bias are also concerning in decision-making contexts. Many organisations have begun integrating LLMs into their recruitment pipelines to assist with resume screening and applicant selection. While such automation offers efficiency, studies have revealed significant racial and gender biases in these models. A study by researchers at the University of Washington found that state-of-the-art LLMs favoured names associated with white individuals over those associated with Black individuals in 85% of test cases. Names linked to Black males were never ranked higher than those associated with white males. Similarly, female-associated names were selected in only 11% of scenarios [
61]. These findings raise questions about fairness, transparency, and accountability in AI-driven decision systems.
Currently, most models lack internal mechanisms to verify that generated content is accurate, unbiased, or contextually appropriate. Nor do unified regulatory standards exist to ensure a responsible development and deployment of LLMs. When embedded within Agentic architectures, these biases can propagate across decision-making workflows, leading to systemic trust issues or compounding misinformation at the system level. These gaps underscore the pressing need for ongoing research into ethical model development, encompassing techniques for mitigating bias, detecting misinformation, and preventing misuse. Progress in this area will also require interdisciplinary collaboration. The development of ethical LLMs must go beyond technical optimisation to include perspectives from law, philosophy, and social science. Only by incorporating fairness, accountability, and transparency into the design and deployment of these systems, LLMs and Agentic AI can be trusted to serve the diverse needs of society in a responsible manner.
4.3. Energy Demands and Sustainability Issues: Toward Sustainable LLMs
The impressive performance of LLMs comes with a growing environmental cost. As LLMs are scaled up and deployed across a wide range of services, the energy required to support training and inference has become a serious concern [
62,
63]. Although much of the focus has been on the energy required to train large models, recent studies suggest that the energy used during inference may make up the largest share of a model’s total energy consumption over its lifetime [
64,
65]. As these models are increasingly used in everyday applications such as search engines and customer service tools, the combined energy demand of repeated inference has become a significant challenge in the pursuit of more sustainable AI systems [
66]. For instance, training a model like GPT-3, which contains 175 billion parameters, is estimated to have consumed around 1287 megawatt-hours of electricity. If powered by a typical energy mix, this would have resulted in approximately 552 metric tons of carbon dioxide emissions [
67]. However, once such a model is deployed at scale, the energy required to support daily interactions often surpasses the original training cost.
In response to growing concerns about energy consumption, researchers have developed various methods to enhance the efficiency of large models. Techniques such as model pruning, knowledge distillation, and quantisation are designed to reduce computational demands while maintaining acceptable levels of accuracy. Alongside these, there is increasing interest in hardware-level improvements, including the adoption of more efficient processors and accelerators, as well as software-level strategies that minimise energy waste during idle periods. While these innovations offer promise, they have not yet been widely adopted in industry. To support real progress, energy reporting and model efficiency need to move from being optional practices to standard requirements. Developers should routinely share information about energy use during training and inference, along with details about model size and hardware used. Without this level of transparency, it is difficult to evaluate or compare the environmental costs of different models in a meaningful way. More broadly, sustainable AI is not only a technical challenge, it raises important ethical and policy questions about how technology should be used in a world facing significant environmental constraints. As LLMs become a core component of Agentic systems that operate continuously and make autonomous decisions, the overall energy footprint is expected to continue growing. This is due in part to the extra demands of coordinating multiple agents, managing long-term memory access, and supporting real-time decision-making at scale. This raises new sustainability concerns that must be considered at the system design level, not just the model level. As language models continue to scale, researchers and developers must embed sustainability into the core design, evaluation, and deployment practices of future AI systems.
4.4. Value Misalignment and Exclusion: Toward Human-Centric LLMs
As LLMs continue to advance in autonomy and decision-making, it becomes increasingly essential for them to align with human values, societal expectations, and ethical principles [
68,
69,
70,
71]. This alignment is particularly critical in high-impact domains, such as healthcare, recruitment, and law enforcement, where errors or biases can have serious consequences. Value alignment refers to the extent to which model outputs are consistent with widely accepted moral and ethical standards. However, most current alignment techniques, including reinforcement learning from human feedback, often rely on annotator pools drawn from limited cultural or ideological backgrounds. As a result, these models may reflect values that are acceptable in one region but inappropriate in another, reinforcing a narrow worldview rather than adapting to global diversity.
Closely related but distinct is the issue of inclusivity. This concerns the model’s capacity to represent and respond equitably to the full range of identities, cultures, and belief systems found in society. Despite growing awareness, many existing systems default to dominant perspectives. To explore how inclusiveness and cultural representation are handled by LLMs, we tested ChatGPT (GPT-4o) and DeepSeek-R1 using three simplistic prompts to ask for personal names representing people of Pakistani origin, firstly generic names, then names from varying backgrounds, and finally names from different ethnic or cultural groups. This experiment was conducted on three different days: 29 March 2025; 30 July 2025; and 1 August 2025.
Figure 15,
Figure 16,
Figure 17,
Figure 18,
Figure 19 and
Figure 20 show that across all three prompts, ChatGPT’s responses consistently defaulted to names reflecting the religious majority (Muslims), even when explicitly asked for names from different cultural or ethnic groups. In contrast, DeepSeek responses demonstrated slightly greater inclusion, occasionally incorporating names that represent religious and ethnic minorities. However, neither model achieved comprehensive coverage in terms of sociocultural diversity. The variation between their outputs illustrates how embedded assumptions can shape results even for simple, seemingly neutral queries. While Muslims form the majority population in Pakistan, the country is also home to various other major communities, such as Christians, Hindus, and Sikhs. The exclusion of non-Muslim names in the absence of specific instruction demonstrates a limitation in the model’s cultural awareness and inclusivity.
These demonstrations are not intended as a rigorous or statistically significant evaluation of model behaviour but rather as a simplified prompt to illustrate how default assumptions can manifest when cultural ambiguity is not explicitly addressed. While more refined prompting could potentially produce more inclusive results, it is important to recognise that many users of LLMs are not prompt engineering experts. As these models become integrated into tools used by non-technical users, it becomes essential for them to proactively detect ambiguity and seek clarification. Placing the burden solely on users to adjust prompts may constrain inclusivity and reinforce dominant perspectives. Future development should focus on creating human-centric LLMs that support human-in-the-loop clarification, adaptive reasoning, and culturally sensitive responses by default.
Human-centric LLMs should not only make their outputs understandable but also invite users to question, refine, and override them. Such models must be designed with mechanisms that proactively identify cultural ambiguity, ask clarifying questions, and defer to human judgment when required. Without such capabilities, it would be the user’s responsibility to rectify the model results, which may lead to error-prone outcomes. When it comes to Agentic AI settings, such value misalignments may not only affect individual outputs but also the overall system. These limitations can also influence the behaviour of entire autonomous workflows if the agents rely on LLM-derived results and interpretations. Therefore, there is a need to build LLMs that can respond to social and cultural contexts with greater sensitivity. This involves bringing more diverse voices into the training and evaluation processes, improving alignment techniques to consider different ethical perspectives, and designing interactive features that allow users to question as well as adjust responses. Doing so is essential if LLMs are to be used responsibly and effectively in a globally diverse world.
4.5. Explainability and Transparency Gaps: Toward Transparent LLMs
Despite ongoing research efforts [
72,
73,
74], transparency remains a significant concern in LLMs due to the lack of explanation of their approaches. Turpin et al. (2023) [
75] found that explanations of LLMs often fail to reflect the actual reasoning used by the models, as they tend to produce responses that sound convincing without revealing how the output was generated. This omission is highly concerning when LLMs are used in fields such as education, healthcare, banking, and law, where a model’s reasoning process must be transparent and available for scrutiny. In another study, Zhao et al. (2024) [
76] provided a broader review of explainability techniques applied to LLMs. They discussed methods including prompt engineering, attribution analysis, and post hoc rationales. While these approaches offer some insight into a model’s behaviour, most of them fall short in consistency and reliability. In many cases, explanations vary significantly with minor changes in prompts, and there is still no standard approach to finding out whether an explanation accurately represents the model’s internal reasoning. These shortcomings pose serious challenges for real-world LLM applications where transparency is tied to accountability. A language model that can justify its result convincingly but inaccurately can create false confidence among users, making it tougher to detect fallacies and biased results.
The explainability issue is mainly inherited from the opaque nature of LLMs. While many machine learning models suffer from explainability limitations, the problem is particularly acute in large-scale deep neural networks. These models operate as black boxes with internal mechanisms that are difficult to inspect, interpret, and audit, which makes it challenging for users to understand how decisions are reached or to verify whether they are based on relevant and appropriate information. The lack of interpretability not only limits accountability but also increases the risk of hidden biases and logical errors propagating unchecked. Moreover, a recent work [
77] by Shojaee et al. (2025) demonstrates that LLMs may appear to solve complex logical problems not through true reasoning, but by statistically predicting plausible word sequences. This distinction between surface-level fluency and actual logical understanding reinforces the need for reliable explanation mechanisms, particularly when LLMs are used in tasks that demand transparency and trustworthiness. These findings align with our argument that improved interpretability is essential in high-stakes or Agentic AI contexts.
Improving transparency in LLMs requires not only better explanation techniques but also mechanisms that enable users to trace how outputs were produced and understand which factors had the most significant impact. There is also a need to develop robust standards for evaluating the quality and fidelity of explanations. Without these changes, LLMs will struggle to meet the required trust and accountability standards. Addressing these challenges by creating transparent LLMs is vital for building AI systems that are not only accurate but also transparent in their operation. As discussed further in
Section 5, this challenge becomes more critical in an Agentic AI context because opaque LLM reasoning can lead to a chain of unexplained actions across multiple agents, making post hoc auditing and accountability challenging.
5. Agentic AI: Core Challenges and Research Directions
Due to their enhanced capability and autonomy, Agentic AI systems also present a new set of challenges that extend beyond those typically associated with LLMs. These systems must not only generate appropriate outputs but also take independent action, maintain long-term context, and interact with humans and other agents in dynamic environments. Identifying and addressing these challenges is essential for building reliable, safe, and ethically aligned autonomous AI models.
Table 2 provides an overview of the key challenges and the corresponding research directions.
5.1. Security, Privacy, and Trust Concerns: Toward Trustworthy Agents
Security, privacy, and trust are major concerns and barriers to the adoption of Agentic AI in information-sensitive sectors and scenarios. While the limitations of LLMs in handling secure or sensitive information are reasonably well studied [
41,
47,
48,
78], the risks become significantly more complex once these models are embedded into agents that can autonomously access external systems, trigger actions, and exchange data without direct human oversight. At present, there is no peer-reviewed literature that formally examines these concerns in the context of Agentic AI. However, several risks are foreseeable given the behaviour and architecture of current systems. Unlike traditional software or static AI models, Agentic AI systems are designed to operate independently, often across connected environments and third-party APIs. This raises important questions, including the following:
What resources are agents accessing, and when?
How do agents authenticate with external tools or services?
What kind of personal or organisational data do agents retrieve, store, or transmit?
How are agents communicating with each other, and is that communication auditable?
Can agents leak or infer sensitive information without the user being aware?
What protections exist against malicious agents or compromised toolchains?
Could agents be exploited to launch phishing campaigns, perform unauthorised surveillance, or manipulate user decisions?
Such questions are especially critical in domains where strict data governance is required. Organisations operating in these sectors are unlikely to permit autonomous agents to access internal systems or sensitive data without clearly defined security frameworks. At present, such frameworks do not exist. There are no established standards for agent authentication, access control, permission scoping, or behavioural logging. Even basic measures such as consent prompts, action previews, or audit trails are inconsistently applied or entirely missing. Without stronger privacy guarantees and well-defined security boundaries, it is unlikely that Agentic systems will be trusted in environments where data protection is a legal or ethical obligation. This calls for the development of trustworthy, i.e., secure agent architectures, including sandboxing methods, scoped credential use, runtime verification, and privacy-by-design protocols. Until such measures are in place, the deployment of Agentic AI will remain limited, particularly in sectors where the consequences of failure are severe.
5.2. Goal Misalignment: Toward Aligned Agents
One of the most critical challenges in Agentic AI systems is goal misalignment, where the actions taken by an agent diverge from the intended objectives set by its human users. This issue often arises when goals are under-specified, ambiguous, or interpreted differently by the agent due to its internal representations or learned priors [
6,
21]. Since these systems are more autonomous, even minor misalignments in goal interpretation can result in ineffective or counterproductive outcomes. When language-based instructions are processed by LLMs embedded in Agentic systems, this problem is further amplified because LLMs are sensitive to prompt phrasing and context, and small variations can lead to inconsistent and undesirable behaviours. For example, an agent that has been given a task to “minimise customer response time” may prioritise speed over accuracy, which can unintentionally decline the overall service quality.
Addressing such issues requires creating agents that are more closely aligned with human values and intentions. This requires robust value alignment techniques that help agents interpret and pursue goals in an ethically sound and contextually appropriate manner. Existing approaches to achieve these objectives include inverse reinforcement learning, modelling human preferences, and incorporating real-time feedback during task execution. However, there is still much to be understood about how agents can adapt to evolving goals or environments without drifting from their original goals. Future research on aligned agents should examine how alignment can be maintained over time, particularly in dynamic multi-agent scenarios.
5.3. Opaque Decision-Making: Toward Explainable Agents
As Agentic AI systems make more autonomous decisions, the ability to understand how and why those decisions were made becomes increasingly essential. Opaque or black box behaviours reduce trust and limit human oversight [
79], particularly concerning in high-impact domains where decisions can carry significant consequences. This challenge is partly inherited from LLMs, whose internal reasoning processes are difficult to trace. However, the problem becomes more pressing in Agentic contexts, where decisions are part of multi-step plans or collaborative workflows. An opaque agent that misinterprets a user query or external signal may take a series of incorrect actions before the issue is noticed.
To address this, researchers need to create explainable agents [
80] that utilise explainable autonomy techniques, allowing agents to provide human-understandable justifications for their actions. Methods include natural language rationales, saliency-based input highlighting, and structured audit trails. A key area for future work is developing explanations that are not only technically accurate but also useful for non-expert users interacting with the system in real-time.
5.4. Limited Human Oversight: Toward Collaborative Agents
Agentic AI introduces the risk of reduced human control, particularly when agents operate autonomously or without supervision [
81]. In complex environments, it can be challenging for users to intervene or comprehend the system’s state, resulting in delayed error correction or unintended behaviours. Ensuring meaningful human oversight requires designing systems that support collaborative interaction. The collaborative agents should include transparent state reporting, interactive debugging tools, and mechanisms for human override. Agentic AI systems should be designed to allow control to shift between the agent and the user. This type of shared approach can strike a balance between the need for efficiency and the importance of safety.
There is an opportunity for future research to create collaborative Agentic frameworks that support shared control and can adapt to a range of use cases and user expertise levels. A key consideration is to ensure that oversight mechanisms are not overly burdensome or confusing. If they become too complicated or demanding, there is a risk that users may disengage from the system entirely, which can undermine both trust and effectiveness.
5.5. Context Management: Toward Memory-Aware Agents
Agentic AI systems often struggle to maintain context over long periods, especially in dynamic and real-time environments, which can lead to inconsistent decisions, forgotten instructions, or repeated errors, particularly when agents rely on LLMs that lack persistent memory structures [
6]. Memory challenges are critical in applications that require multi-step planning and long-term learning. Without the capacity to retain and retrieve relevant information, agents will experience limitations in their ability to adapt, learn from past interactions, and provide coherent user experiences.
This challenge demands creating Memory-Aware Agents to explore persistent memory architectures, including memory-augmented neural networks, vector databases for state retrieval, and hybrid systems that combine short-term and long-term memory components. A key challenge is balancing memory complexity with inference efficiency in resource-constrained deployments. More work is needed in this area to ensure that memory systems remain secure, interpretable, and aligned with the agent’s goals and user expectations.
5.6. Multi-Agent Coordination: Toward Cooperative Agents
Agentic AI applications can be composed of multiple agents that interact with each other. Effective coordination between these agents is crucial for achieving shared goals, preventing conflicts, and maintaining the overall system stability [
82]. However, coordination is not always straightforward. Agents may have access to different or incomplete information, which can result in actions that are misaligned or unexpected behaviour. Managing coordination in multi-agent systems also raises essential governance questions, such as how agents should communicate, how trust is built and maintained, and how roles and responsibilities are assigned.
Effective coordination among agents also depends on addressing broader concerns, such as fairness, redundancy, and the ability to recover from failures. As these systems continue to grow in scale and complexity, future research should explore scalable governance frameworks for coordinating large numbers of agents effectively, particularly in complex and real-time applications. Emphasis should be placed on designing robust protocols that promote cooperation while preserving the autonomy of individual agents.
5.7. Long-Term Safety: Toward Safe Agents
Autonomous agents operating over extended periods pose long-term safety risks, including the possibility of drift in behaviour, accumulation of minor errors, and unintended adaptation to harmful patterns. These risks are compounded in learning systems that update their strategies based on ongoing interaction with the environment [
83]. Safe exploration techniques such as constraint-based learning, reward shaping, and simulation-based pretraining are being developed to allow agents to learn and adapt while minimising harm. However, there is no standard method for evaluating long-term safety across different domains, and most current approaches focus on short-term testing or static environments.
The future work should aim to develop benchmarks, evaluation criteria, and runtime monitoring tools that can detect and mitigate long-term risks. It is also vital to define acceptable safety thresholds and fail-safe mechanisms tailored to specific use cases.
5.8. Ethical and Legal Accountability: Toward Accountable Agents
As Agentic AI systems make decisions that affect individuals and organisations, questions around responsibility, liability, and ethical compliance become important. Traditional software systems rely on transparent chains of command and decision provenance, but autonomous agents complicate these assumptions [
84]. Challenges in this space include identifying who is accountable when an agent causes harm, ensuring that decision-making is auditable, and defining the legal status of agents within organisational workflows. International variations in legal frameworks and ethical norms further complicate these concerns.
Proposed solutions include the development of accountability protocols that require agents to log their actions, store decision rationales, and support external review. Legal scholars have also called for AI-specific liability models that reflect shared responsibility between developers, operators, and system owners. Future research should aim to bridge the gap between technical capabilities and regulatory expectations by building systems that are both auditable and compliant by design.
6. Conclusions
In this paper, we examined how LLMs and Agentic AI can be applied across various domains, highlighting the key challenges associated with them and emphasising future research opportunities. As these AI models continue to integrate into real-world systems, ranging from healthcare and education to finance and public policy, it becomes increasingly critical to understand not only what they can do but also the risks they pose. Despite advancements in LLMs and Agentic AI, issues related to bias, transparency, legal accountability, multi-agent coordination, security, privacy, trust, ethics, sustainability, opaque decision-making, human oversight, human-agent collaboration, memory architectures, global misalignment, and inclusivity remain persistent and pressing. These challenges are not specific to a particular model, but they emerge from deeper design and deployment patterns of AI algorithms. As such, the insights and research directions proposed in this study remain relevant across evolving architectures and future developments.
Through this forward-looking analysis, we offer researchers and practitioners a structured roadmap to navigate these emerging AI developments. We emphasise the importance of interdisciplinary collaboration to address these multifaceted issues and guide the responsible and equitable development of LLM and Agentic AI technologies. This study offers an early glimpse of what lies ahead as LLMs shift from being passive tools to playing more active roles in shaping decisions and driving real-world outcomes with the use of Agentic AI. Additionally, it provides reflections on the kind of AI future we are collectively shaping, a future that demands advanced developments with careful attention to ethics, equity, and long-term societal impact.
In terms of limitations, our structured scoping review approach was guided by established principles that emphasise transparency, clear inclusion logic and procedural rigour, but it was not based on a formal systematic review framework. While this approach was appropriate for capturing emerging trends across diverse and rapidly evolving sources, future reviews in this area could adopt more formalised methodologies, such as protocol-based reviews, meta-analyses and systematic reviews, particularly as the literature matures and becomes more standardised.