ChatGPT’s Expanding Horizons and Transformative Impact Across Domains: A Critical Review of Capabilities, Challenges, and Future Directions

Feyijimi, Taiwo Raphael; Aliu, John Ogbeleakhu; Oke, Ayodeji Emmanuel; Aghimien, Douglas Omoregie

doi:10.3390/computers14090366

Open AccessReview

ChatGPT’s Expanding Horizons and Transformative Impact Across Domains: A Critical Review of Capabilities, Challenges, and Future Directions

by

Taiwo Raphael Feyijimi

^1,*

,

John Ogbeleakhu Aliu

^2,*

,

Ayodeji Emmanuel Oke

³

and

Douglas Omoregie Aghimien

⁴

¹

School of Electrical and Computer Engineering (ECE), Engineering Education Transformations Institute (EETI), College of Engineering, University of Georgia, Athens, GA 30602, USA

²

Engineering Education Transformations Institute (EETI), College of Engineering, University of Georgia, Athens, GA 30602, USA

³

Research Group on Sustainable Infrastructure Management Plus (RG-SIM+), Department of Quantity Surveying, Federal University of Technology Akure, Akure 340110, Ondo State, Nigeria

⁴

Department of Civil Engineering Technology, Faculty of Engineering & the Built Environment, University of Johannesburg, Johannesburg 2028, South Africa

^*

Authors to whom correspondence should be addressed.

Computers 2025, 14(9), 366; https://doi.org/10.3390/computers14090366

Submission received: 3 July 2025 / Revised: 14 August 2025 / Accepted: 20 August 2025 / Published: 2 September 2025

(This article belongs to the Special Issue Natural Language Processing (NLP) and Large Language Modelling)

Download

Browse Figure

Versions Notes

Abstract

The rapid proliferation of Chat Generative Pre-trained Transformer (ChatGPT) marks a pivotal moment in artificial intelligence, eliciting responses from academic shock to industrial awe. As these technologies advance from passive tools toward proactive, agentic systems, their transformative potential and inherent risks are magnified globally. This paper presents a comprehensive, critical review of ChatGPT’s impact across five key domains: natural language understanding (NLU), content generation, knowledge discovery, education, and engineering. While ChatGPT demonstrates profound capabilities, significant challenges remain in factual accuracy, bias, and the inherent opacity of its reasoning—a core issue termed the “Black Box Conundrum”. To analyze these evolving dynamics and the implications of this shift toward autonomous agency, this review introduces a series of conceptual frameworks, each specifically designed to illuminate the complex interactions and trade-offs within these domains: the “Specialization vs. Generalization” tension in NLU; the “Quality–Scalability–Ethics Trilemma” in content creation; the “Pedagogical Adaptation Imperative” in education; and the emergence of “Human–LLM Cognitive Symbiosis” in engineering. The analysis reveals an urgent need for proactive adaptation across sectors. Educational paradigms must shift to cultivate higher-order cognitive skills, while professional practices (including practices within education sector) must evolve to treat AI as a cognitive partner, leveraging techniques like Retrieval-Augmented Generation (RAG) and sophisticated prompt engineering. Ultimately, this paper argues for an overarching “Ethical–Technical Co-evolution Imperative”, charting a forward-looking research agenda that intertwines technological innovation with vigorous ethical and methodological standards to ensure responsible AI development and integration. Ultimately, the analysis reveals that the challenges of factual accuracy, bias, and opacity are interconnected and acutely magnified by the emergence of agentic systems, demanding a unified, proactive approach to adaptation across all sectors.

Keywords:

ChatGPT; large language models (LLMs); natural language understanding (NLU); content generation; agentic AI; human–AI symbiosis; education; engineering; ethical considerations

1. Introduction: The ChatGPT Inflection Point in AI and Its Applications

The advent of Chat Generative Pre-trained Transformer (ChatGPT) marks a significant inflection point in the trajectory of artificial intelligence and its pervasive influence across myriad sectors. Its rapid ascent since its public release has been characterized by widespread adoption and a mixture of acclaim and apprehension, fundamentally altering landscapes in education, healthcare, customer service, software development, and beyond [1]. Hailed as a “game changer” [2] and a “disruptive” technology [3], ChatGPT, built upon the Transformer architecture and the continued evolution of large language models (LLMs) [1], has demonstrated an uncanny ability to generate human-like text and engage in complex tasks. This inflection point is now rapidly evolving, as the powerful reasoning and tool-use capabilities of these models are giving rise to agentic AI—autonomous systems that can plan, reason, and execute complex, multi-step tasks to achieve goals, representing a paradigm shift from passive tools to active participants in digital and physical environments [4].

The initial reception to ChatGPT, particularly within academic circles, has often been described as more “shock” than “awe” [3]. This reaction stems largely from its immediate and palpable disruption to established educational practices, especially concerning student assessment and academic integrity [5]. The swiftness with which students adopted the tool for tasks such as coursework generation caught many institutions off-guard. This dynamic is set to intensify with the rise of AI agents, which can automate not just the writing of an assignment but the entire research and analysis process, posing even deeper challenges to traditional assessment. This evolution fundamentally elevates the stakes across all domains, transforming previous limitations and ethical concerns from mere inconveniences into critical risk factors for real-world impact and demanding a more proactive, integrated approach to development and deployment. In contrast, industries have often focused more on the “awe” aspect, emphasizing productivity gains and novel content creation capabilities [6]—a perception now amplified by the potential of agents to automate entire workflows, not just discrete tasks [7]. This divergence in perception is context-dependent; those whose core practices are directly challenged by the technology are more likely to experience initial shock, while those who see immediate utility may be more readily impressed. This divergence in initial perception underscores a central thesis of this paper: that a balanced, critical lens—one informed by both the “shock” of disruption and the “awe” of potential—is paramount for developing robust ethical frameworks, innovative adaptive responses, and a deeper understanding of human–AI collaboration essential for governing emerging agentic systems.

This paper critically examines ChatGPT’s multifaceted applications across five carefully selected domains: natural language understanding (NLU), content generation, knowledge discovery, engineering, and education. This selection is deliberate, designed to construct a layered analysis that moves from foundational capabilities (NLU and content generation) to an advanced, cross-cutting application (knowledge discovery), and culminates in two contrasting yet representative case studies of applied sectors: education (a human-centric, social domain) and engineering (a technical, STEM-oriented field). Together, these domains provide a comprehensive, illustrative spectrum for evaluating ChatGPT’s multi-faceted impacts, from its core mechanisms to its societal integration. Through this structured examination, the paper aims to advance theoretical understanding and propose methodological innovations. While specific examples are rooted in these areas, the primary analytical contribution lies in the development of conceptual frameworks that, by design, are highly generalizable. These frameworks offer a durable and transferable lens for understanding the opportunities and challenges in numerous other domains, from healthcare to finance, while addressing the attendant ethical considerations, research frontiers, and the implications of the emerging agentic paradigm to foster responsible and impactful global integration. The subsequent sections will dissect these specific application domains, exploring their theoretical underpinnings, recent methodological advancements, and the critical challenges that must be navigated to realize the full potential of both generative and agentic AI responsibly.

2. Advancements in Natural Language Understanding with ChatGPT: Capabilities, Innovations, and Critical Frontiers

At the heart of modern conversational AI lies natural language understanding (NLU), a specialized subfield of the broader discipline of natural language processing (NLP) [8]. While NLP encompasses the entire spectrum of enabling computers to process and communicate in human language, NLU is specifically concerned with machine reading comprehension—the ability to decipher the meaning, intent, and context behind human linguistic input [8]. This task is considered an “AI-hard” problem due to the immense complexity of human language, which is replete with nuance, ambiguity, idioms, and figurative expressions that are challenging for machines to interpret correctly [9]. NLU operates through two primary modes of analysis: syntactic analysis, which deconstructs the grammatical structure of a sentence, and semantic analysis, which processes the meaning of the words and their relationships within that structure [8]. For example, in the user query, “Book a flight to New York,” an NLU system must identify “book” as the user’s intent or action, “flight” as the object of that action, and “New York” as the destination parameter. This ability to transform unstructured language into structured, actionable information is the foundation for applications ranging from intelligent chatbots and virtual assistants to sentiment analysis tools and spam filters [9]. NLU is intrinsically linked to its counterpart, natural language generation (NLG), which is the process by which computers automatically generate content in human-like language [8]. In a typical interaction with a model like ChatGPT, NLU is employed to understand the user’s prompt, and NLG is used to formulate and deliver the textual response. The seamless integration of sophisticated NLU and NLG is what allows these models to engage in conversations that feel natural and coherent [8]. The continuous evolution of its underlying architecture and training methodologies has pushed the boundaries of what machines can comprehend and how they can interact using human language, setting the stage for more autonomous and goal-oriented AI systems.

2.1. Core NLU Architecture and Functionalities

The performance of LLMs is dictated by their underlying architecture. The foundational building block for all modern, high-capability models, including the entire GPT series, is the Transformer architecture [10]. Introduced in 2017, this deep learning architecture revolutionized NLP with its use of “attention mechanisms,” which allow the model to weigh the importance of different words in the input text when processing and generating language. ChatGPT’s NLU prowess is built upon the Transformer architecture. A key component of this architecture is the self-attention mechanism. This enables ChatGPT to engage in complex query responses, maintain contextual understanding over extended dialogues, and generate remarkably human-like text [1]. A critical process in this architecture is tokenization, which involves converting raw text into a sequence of numerical units, or “tokens,” that the model can mathematically process [11]. These tokens can be words, parts of words, or even individual characters. Modern tokenizers, such as the Byte-Pair Encoding (BPE) algorithm used by GPT models, are sophisticated systems that build a vocabulary of tokens based on their frequency in the training data [12]. The efficiency of tokenization—how many tokens are required to represent a given word or phrase—directly impacts the model’s speed and operational cost [13]. A significant challenge has been that tokenizers optimized on predominantly English-language corpora are often less efficient for other languages, requiring more tokens per word and thus increasing latency and cost for non-English users [12]. To address this, newer models like GPT-4o feature advanced tokenization with a significantly larger vocabulary (199,997 tokens for GPT-4o compared to 100,256 for GPT-4) [14]. This larger vocabulary allows for more efficient encoding of text, particularly in non-English languages, which reduces the number of tokens needed for a given input and thereby improves speed and cost-effectiveness [14,15]. However, this advancement is not without trade-offs; research has shown that these new tokenization schemes can sometimes introduce novel interpretation errors, for instance by misinterpreting certain phrases due to how the tokenizer handles new, long vocabulary entries from its statistical corpus [14]. Finally, a key architectural parameter that defines a model’s capability is its context window. This refers to the maximum amount of text, measured in tokens, that the model can process and “remember” at any one time [13]. It functions as the model’s working memory, encompassing both the user’s input (prompt) and the model’s generated output (completion). A larger context window is a significant advantage, as it allows the model to analyze long documents, maintain coherence over extended conversations, and reference a greater amount of information to ground its responses, which can help reduce hallucinations [13]. The size of the context window has expanded dramatically across the GPT lineage, from 4096 tokens in early GPT-3.5 models to 128,000 in GPT-4 Turbo and GPT-4o, and even up to 1 million tokens in the latest GPT-4.1 series [13]. Such enhancements in comprehension and processing enable agents to better interpret complex commands, maintain long-term goal coherence, and ground their actions in richer environmental understanding, thus moving beyond mere text generation to active, intelligent execution [4,16,17].

2.2. Innovative NLU Techniques and Their Impact

Beyond core architectural improvements, specific techniques are being developed to enhance ChatGPT’s NLU capabilities, address its limitations, and empower it to move from passive generation to active problem-solving.

Retrieval-Augmented Generation (RAG): RAG has emerged as a powerful method to mitigate knowledge gaps by dynamically incorporating external information into the generation process [18,19]. This involves vectorizing relevant documents and providing them as context, allowing the model to access up-to-date or domain-specific information not present in its training data. A notable application is in specialized engineering domains like Building Information Modeling (BIM), where RAG enabled ChatGPT-4 to significantly improve its understanding of localized Korean BIM guidelines, boosting performance by 25.7% [19]. By grounding responses in verifiable external documents, RAG increases trustworthiness and is a foundational technique for reliable AI systems [20].

The Rise of Agentic AI: A paradigm-shifting innovation is the development of agentic frameworks. These frameworks empower LLMs to act as autonomous agents that can reason, create complex plans, decompose tasks, and utilize external tools (like code interpreters, APIs, or web browsers) to achieve multi-step goals [4]. Unlike traditional NLU, which focuses on comprehension and response, agents use NLU to interact with an environment and execute actions, representing a move from passive text generation to proactive, goal-directed behavior [16,17].

Advancements from NLP Conferences (EMNLP/ACL 2024–2025): Recent research highlights several innovative techniques [21]:

LLMs for Data Annotation and Cleansing: LLMs are increasingly used to automate or assist in data annotation, a traditionally labor-intensive task. For example, the Multi-News+ dataset was enhanced by using LLMs with chain-of-thought and majority voting to cleanse and classify documents, improving dataset quality for multi-document summarization tasks.
Factual Inconsistency Detection: Given LLMs’ propensity for hallucination, techniques to detect factual inconsistencies are crucial. Methods like FIZZ, which employ fine-grained atomic fact decomposition and alignment with source documents, offer more interpretable ways to identify inaccuracies in abstractive summaries.
Multimodal NLU Enhancement: Research is exploring the integration of other modalities, such as acoustic speech information, into LLM frameworks for tasks like depression detection, indicating a move toward more holistic NLU that mirrors human multimodal [22].
“Evil Twin” Prompts: The discovery of “evil twin” prompts, obfuscated and uninterpretable inputs that can elicit desired outputs and transfer between models, opens new avenues for understanding LLM vulnerabilities and their internal representations, posing both security risks and research opportunities. For NLU, these prompts expose the often-superficial nature of model comprehension, revealing instances where models respond to statistical patterns rather than genuine semantic understanding. thereby challenging the very definition of “understanding” in AI systems [7,23,24,25,26,27,28,29], pushing theoretical frontiers to delineate between mere linguistic mimicry and robust, systematic generalization akin to human cognition.

2.3. Evolution and Critical Assessment of GPT Models in NLU

Despite their advancements, LLMs like ChatGPT face several critical limitations:

Benchmark Performance: This highlights the “Specialization vs. Generalization” tension inherent in NLU: while versatile, general-purpose models like ChatGPT-3.5 Turbo offer broad utility, they often yield to specialized, fine-tuned models (e.g., BERT variants) on specific NLU benchmarks like GLUE, particularly in tasks such as paraphrase detection or semantic similarity [1]. Newer versions like GPT-4 show improvements, yet this performance gap often persists for highly optimized task-specific models, underscoring the ongoing challenge of achieving both broad competence and expert-level precision.

Inherent Limitations

Misinformation and Hallucinations: A persistent issue is the generation of plausible-sounding but incorrect or nonsensical information, often termed “hallucinations” [1]. This undermines reliability, especially in critical applications.
Bias: LLMs inherit biases present in their vast training datasets, which can manifest as gender, racial, geographical, or ideological skews in their outputs [1,30,31]. These biases can perpetuate harmful stereotypes and lead to unfair outcomes.
Transparency and Explainability: The “black box” nature of LLMs makes it difficult to understand their decision-making processes or trace the origins of errors [2,28,32,33,34]. This lack of interpretability is a major hurdle for debugging, ensuring fairness, and building trust.
Contextual Understanding Limits: While improved, LLMs can still struggle with deeply nuanced contextual understanding, complex linguistic structures (like center-embedding), rare words, sarcasm, or the subtleties of human emotion [1,30,31].

These risks are significantly amplified in agentic systems. An agent that hallucinates or acts on biased information can cause direct, real-world harm, making robust safety and alignment protocols paramount [35]. Addressing these in NLU necessitates a multi-faceted approach, including developing explainable AI (XAI) techniques to unravel model decision-making and implementing rigorous data de-biasing strategies to ensure equitable and reliable linguistic processing. Table 1 provides a comparative overview of different ChatGPT models and their NLU/content generation characteristics, based on available information. However, we detailed the analysis of this table in the following subsections, including benchmark results, to provide more context for their interdisciplinary applicability.

2.3.1. Analysis of the GPT-3.5 Series: The Catalyst for Widespread Generative AI

The GPT-3.5 series, and specifically the ChatGPT application built upon it, represents a watershed moment in the history of artificial intelligence. While not the most technically advanced model family by current standards, its release marked the point at which generative AI transitioned from a niche academic pursuit to a global phenomenon. This section analyzes the model that served as this catalyst, establishing a crucial performance baseline and highlighting the capabilities and limitations that drove the development of its more powerful successors.

Architectural Profile and Training Paradigm

The ChatGPT-3.5 model is an iteration of the GPT-3 architecture, fine-tuned from the GPT-3.5 series of models that completed training in early 2022 [31]. Its foundation is a massive dataset of approximately 570 GB of text and code scraped from the internet, including sources like books, Wikipedia, articles, and code repositories such as GitHub [31]. This gives the model a broad, albeit general, base of world knowledge, which is limited by a cutoff date of January 2022 [31]. The defining innovation of ChatGPT-3.5 was not its scale but its training methodology. It was specifically optimized for dialogue using Reinforcement Learning with Human Feedback (RLHF) [31]. This process involves using human AI trainers to provide demonstrations of desired behavior and to rank different model outputs for quality. This feedback is then used to train a reward model, which in turn is used to fine-tune the LLM. The RLHF process guided the model toward generating responses that are more helpful, coherent, and aligned with human expectations for safe and useful conversation, making it feel significantly more interactive and less like a raw text-completion engine [31].

The GPT-3.5 family includes several variants, with GPT-3.5 Turbo emerging as the most capable and widely used model, offering a balance of performance, speed, and cost [31]. While OpenAI has not officially disclosed its size, research papers and analysis suggest the GPT-3.5 Turbo model contains approximately 20 billion parameters [34].

Core Competencies and Limitations

GPT-3.5 demonstrated strong performance across a wide range of general-purpose NLU and content generation tasks. Its primary strengths lie in its speed and its ability to handle basic conversational AI, translation, summarization, and the generation of “boilerplate” content, such as initial drafts of emails or simple code snippets [2,31,36]. Its rapid response time made it highly suitable for many real-time, interactive applications [2,31]. However, its widespread use also brought its limitations into sharp focus. The model is known for significant issues with factual accuracy and a propensity for hallucination, a problem exacerbated by its static, pre-2022 knowledge base [30]. It struggles notably with tasks that require deep domain knowledge, complex multi-step reasoning, or an understanding of nuanced, specialized subject matter [1]. Consequently, its application in safety-critical fields like healthcare has been met with caution, as studies have noted potentially high error rates [37].

The GLUE Score and Domain-Specific Tests

To objectively measure its capabilities, GPT-3.5 has been evaluated against standardized benchmarks. One of the most fundamental is the General Language Understanding Evaluation (GLUE) benchmark. GLUE is a collection of nine distinct NLP tasks, including sentiment analysis, question answering, and textual similarity, designed to provide a single, aggregate score that reflects a model’s overall language understanding ability [10]. The benchmark was designed to favor models that can generalize and transfer learning across this diverse set of tasks, with the ultimate goal of driving research toward more robust and versatile NLU systems [47]. On the GLUE benchmark, GPT-3.5 achieved an average score of approximately 78.7%. This performance is comparable to earlier foundational models like BERT-base but falls short of more rigorously trained models like RoBERTa-large. This result positions GPT-3.5 as competent in foundational NLU but not at the state of the art. Its limitations in more specialized contexts are further highlighted by its performance on domain-specific benchmarks. For example, on the LexGLUE benchmark, which consists of legal text classification tasks, GPT-3.5 Turbo achieved a zero-shot micro-F1 score of only 49.0%, significantly underperforming fine-tuned models that scored 78.9% [48]. This indicates a clear weakness in handling the nuanced and specific language of a specialized domain like law. To fully appreciate the context of these scores, it is useful to compare the architectures and training of the models it is measured against. BERT (Bidirectional Encoder Representations from Transformers) was a revolutionary model that learned context by processing text from both left-to-right and right-to-left simultaneously [49]. RoBERTa (Robustly Optimized BERT Approach) is not a new architecture but rather a refinement of BERT that achieves superior performance through a more sophisticated training regimen [49]. The fact that GPT-3.5 lags RoBERTa-large on GLUE demonstrates that raw model scale is not the only factor in performance; optimized training methodology is equally, if not more, critical. Table 2 highlights the optimized training procedures of RoBERTa, which contribute to its superior performance on NLU benchmarks like GLUE compared to earlier models.

The model’s weakness in specialized domains was starkly illustrated by its performance on Korea’s BIM Expertise Exam. Building Information Modeling (BIM) is a complex, holistic process for creating and managing digital representations of built assets, critical to the modern architecture, engineering, and construction (AEC) industry [50]. The Korean exam is a rigorous professional test of deep, domain-specific knowledge covering BIM software, standards, processes, and applications [19]. On this exam, GPT-3.5 scored an average of 65%, failing to meet the passing threshold and scoring a full 20 percentage points lower than its successor, GPT-4 [19]. This result provides clear, quantitative evidence of GPT-3.5’s limitations when faced with tasks requiring expert-level, specialized knowledge. Ultimately, the story of GPT-3.5 is not one of raw technical dominance. Its performance was quickly surpassed. Instead, its historical significance lies in its implementation. The combination of a “good enough” base model with the user-friendly ChatGPT interface and the crucial addition of RLHF for dialogue optimization created a product that was uniquely accessible and engaging. This democratization of generative AI ignited a massive wave of public interest, research, and investment. In doing so, it also brought the model’s limitations—its inaccuracies, its static knowledge, its struggles with specialization—to the forefront of public and academic discourse. These widely publicized failures effectively created the product roadmap for the next generation of models, with each of GPT-3.5’s key weaknesses becoming a primary target for improvement in GPT-4.

2.3.2. Analysis of the GPT-4 Series: A Leap in Reasoning and Multimodality

The introduction of the GPT-4 series marked a significant generational leap, moving beyond the general-purpose conversational abilities of its predecessor to deliver substantial improvements in reasoning, accuracy, and versatility. This evolution was driven by fundamental architectural enhancements, a dramatic expansion of the model’s working memory, and the practical integration of frameworks to overcome its inherent knowledge limitations. GPT-4 was designed not just to converse, but to solve complex problems with a higher degree of reliability.

Architectural Enhancements and Expanded Context

The most significant architectural shift with GPT-4 was its development as a large-scale multimodal model [22]. Unlike GPT-3.5, which was limited to text, GPT-4 can accept both image and text inputs to produce text outputs [23]. This capability allows it to perform a much wider range of tasks, such as analyzing diagrams, summarizing information from screenshots, and answering visual questions, fundamentally expanding its utility [23]. Parallel to this was a massive expansion of the model’s context window, which functions as its effective working memory [13]. While GPT-3.5 was limited to 4096 tokens, the GPT-4 series saw a dramatic and progressive increase in this capacity. This evolution is a key factor in the model’s enhanced performance, as a larger context window enables it to process and reason over extensive documents, maintain coherence in long conversations, and utilize more information to ground its responses, thereby improving accuracy and reducing hallucinations [13]. For example, this capability allows for sophisticated tasks like analyzing multiple lengthy legal contracts to identify conflicting clauses or supplementary context [28]. Table 3 illustrates the exponential growth of the context window across the GPT-4 lineage, translating abstract token counts into more tangible equivalents.

However, it is crucial to note that a larger context window does not automatically guarantee superior performance. Research has identified a challenge known as the “lost in the middle” or “needle-in-a-haystack” problem, where models may struggle to recall or utilize information that is buried deep within a long context [28]. This indicates that the model’s ability to effectively reason over the provided context is more critical than the raw size of the window itself [28].

Advanced Capabilities in Complex Reasoning and Nuanced Generation

The architectural upgrades in GPT-4 translated directly into superior performance, particularly on complex tasks. OpenAI described the model as “more reliable, creative, and able to handle much more nuanced instructions than GPT-3.5,” a difference that becomes increasingly evident as task complexity rises [11]. This leap in capability is quantifiable through benchmark performance. GPT-4 exhibits human-level or even superhuman performance on a variety of professional and academic exams. In a striking example, it passed a simulated Uniform Bar Exam with a score in the top 10% of test-takers, whereas GPT-3.5 scored in the bottom 10% [23]. It also demonstrated significant gains on standard NLP benchmarks, outperforming GPT-3.5 on MMLU (86.4% vs. 70.0% for 5-shot) and the HumanEval coding benchmark (67.0% vs. 48.1% for 0-shot) [23]. In domain-specific applications, this translated to lower error rates; for instance, error rates in business and economics contexts were observed to be around 15–20% for GPT-4, a notable improvement over its predecessor [37]. In the demanding field of medicine, GPT-4 exceeded the passing score on the US Medical Licensing Exam (USMLE) by over 20 points, outperforming not only GPT-3.5 but also specialized, medically fine-tuned models like Google’s Med-PaLM [51].

Mitigating Knowledge Gaps: The Critical Role of Retrieval-Augmented Generation (RAG)

The development of GPT-4 coincided with the practical application of Retrieval-Augmented Generation (RAG), a framework that addresses the core LLM limitations of knowledge cutoffs and the lack of domain-specific information. RAG is an AI architecture that optimizes LLM output by first retrieving relevant, up-to-date information from an external, authoritative knowledge base and then providing this information to the model as context along with the user’s original query [18]. This process grounds the model’s response in factual, verifiable data without requiring the costly and time-consuming process of retraining the entire model [18].

The RAG workflow typically involves three steps:

Indexing: External data from sources like document repositories, databases, or APIs is converted into numerical representations (embeddings) and stored in a specialized vector database. This creates an accessible knowledge library for the model [18].
Retrieval: When a user submits a query, the system converts the query into an embedding and performs a similarity search against the vector database to find and retrieve the most relevant chunks of information [18].
Generation: The retrieved information is then appended to the original user prompt and sent to the LLM, which uses this augmented context to generate a more accurate, relevant, and grounded response [18].

RAG provides a powerful solution to several of GPT-4’s inherent weaknesses. It allows the model to access information created after its training data cutoff, significantly reduces hallucinations by grounding responses in a trusted source, and enables it to answer questions based on proprietary or localized information (e.g., an organization’s internal policies or specific regional regulations) that was never part of its general training [18]. The practical value of this approach is clearly demonstrated in case studies. In the Korea BIM Expertise Exam, GPT-4 initially struggled with questions in the “BIM guidelines” subcategory, which required knowledge of specific Korean government policies not present in its general training data. However, when RAG was used to supply the model with the relevant policy documents, its score in that specific category improved by a remarkable 25.7%, raising its overall average score from 85% to 88.6%. This shows how RAG can be used as a surgical tool to patch precise knowledge gaps. Similarly, in healthcare, where information must be both current and compliant with local guidelines, RAG is invaluable. A study on its use for radiology contrast media consultations showed that implementing RAG with a locally deployed LLM completely eliminated hallucinations (0% hallucination rate with RAG vs. 8% without) and significantly improved the quality and relevance of its answers [52]. This approach provided the dual benefit of enhanced accuracy and the data privacy of an on-premise solution [52]. The rise of GPT-4 and the widespread application of RAG signify a crucial strategic pivot in the LLM field. It marks a move away from a monolithic, “one model knows all” philosophy toward a more pragmatic and effective hybrid architecture: a powerful core reasoning engine augmented by external, curated knowledge. This approach is an implicit acknowledgment that endlessly scaling a model’s training data is neither economically feasible nor practically effective for all knowledge-intensive tasks, particularly those that demand timeliness, domain specificity, and factual verifiability. The future of applied AI is thus less about the sheer size of a model’s internal “brain” and more about the quality, speed, and relevance of its “library card”—its ability to access and reason over external information. This shifts a significant part of the engineering challenge from model training to the curation of high-quality knowledge bases and the design of efficient retrieval systems.

2.3.3. Analysis of the GPT-4o Series: The Pursuit of Omni-Modal, Efficient Intelligence

The introduction of the GPT-4o (“o” for omni) series represents another significant evolution in OpenAI’s design philosophy. While GPT-4 focused on a raw increase in capability and reasoning power, GPT-4o and its variants prioritize efficiency, speed, and the native integration of multiple data modalities. This shift signals a maturation of the technology, moving from a pure research focus on scaling performance to a product-centric focus on user experience and economic viability.

The “Omni” Architecture: Unifying Modalities and Advancing Tokenization

The defining feature of GPT-4o is its unified end-to-end architecture. Previous systems that supported voice interaction, such as the Voice Mode in ChatGPT, relied on a pipeline of three separate models: one to transcribe audio to text, a second (GPT-3.5 or GPT-4) to process the text and generate a text response, and a third to convert that text back into audio [14]. This pipeline was slow and resulted in a loss of information, as the core text model could not perceive nuances like tone of voice, emotion, or background sounds [14]. GPT-4o replaces this cumbersome pipeline with a single, new model that was trained end-to-end across text, vision, and audio [14]. This means the same neural network processes all inputs and outputs, whether they are text, images, or sound. This unified approach allows the model to perceive and generate a much richer spectrum of communication, including laughter, singing, and expressive emotion, while also making the entire process significantly faster and more cost-effective [14]. This pursuit of efficiency is also reflected in the model’s advanced tokenization. GPT-4o employs a new tokenizer with a vocabulary of 199,997 tokens, nearly double the size of GPT-4’s [14]. A larger vocabulary allows the model to represent text more compactly, which is especially beneficial for non-English languages that were often inefficiently tokenized by previous, English-centric models [14]. For example, a piece of Chinese text that might have required twelve tokens for GPT-4 to process could be represented in as few as two tokens by GPT-4o, leading to substantial improvements in processing speed and reductions in API costs [14].

Strengths in Interactive and Real-Time Content Generation

The architectural innovations of GPT-4o translate directly into a superior interactive experience. The most dramatic improvement is in latency. By eliminating the multi-model pipeline, GPT-4o can respond to audio inputs in an average of 320 ms, with response times as low as 232 ms [14]. This is comparable to human response time in a natural conversation and represents a massive improvement over the 2.8 to 5.4-s latencies of the previous Voice Mode [14]. This real-time capability enables far more natural and fluid human-computer interaction, supporting a new class of applications. Users can engage in seamless, real-time conversations or use the model for tasks like instantly translating a menu from a photo or asking it to explain the rules of a live sports game by processing a real-time video feed [14]. Furthermore, the GPT-4o series is designed for accessibility. The flagship GPT-4o model provides GPT-4-level intelligence but is twice as fast and 50% cheaper to run via the API [14]. The even more efficient GPT-4o mini is approximately 60% cheaper than GPT-3.5 Turbo, making advanced AI capabilities economically viable for a much broader range of developers and businesses [42].

The MMLU Score

To assess its general knowledge and problem-solving abilities, models are often tested against the Massive Multitask Language Understanding (MMLU) benchmark. MMLU is a comprehensive and challenging test designed to evaluate an LLM’s capabilities across a vast array of subjects [53]. It comprises multiple-choice questions spanning 57 different tasks, covering everything from elementary mathematics and US history to highly specialized professional-level topics like computer science, law, and medicine [53]. MMLU is considered a particularly challenging benchmark because it was created specifically to address the saturation of earlier tests like GLUE by rapidly improving LLMs [53]. It evaluates models in a zero-shot or few-shot setting, meaning the model must answer questions with little to no task-specific examples provided in the prompt. This methodology tests the model’s ability to generalize its knowledge acquired during pre-training, reflecting a more robust and flexible form of intelligence [53]. To succeed at MMLU, a model requires not just factual recall but also critical thinking and the ability to reason across diverse domains [53]. For reference, human domain experts are estimated to achieve a score of around 89.8% on this benchmark [53]. The performance of the GPT-4o series on MMLU demonstrates a significant leap in general cognitive ability. The cost-efficient GPT-4o mini model achieves a score of 82% on MMLU, a dramatic improvement over the 69.8–70% scored by GPT-3.5 Turbo [42]. This substantial gain on such a broad and difficult benchmark, even for a smaller “mini” model, underscores the major advancements in general knowledge and reasoning capabilities embodied in the GPT-4o generation. The development of GPT-4o signals a strategic evolution in the AI landscape. The primary focus is no longer solely on scaling raw intelligence to achieve the highest possible benchmark score. Instead, there is a clear pivot toward optimizing the user experience and improving economic viability. The “omni” architecture is less about reaching a new peak of intelligence and more about creating a seamless, low-latency, multimodal interactive platform that feels natural and is accessible to a mass audience. This suggests a maturation of the technology from a research tool into a product platform. The engineering trade-offs have shifted from a simple balance of cost versus capability to a more complex, three-way negotiation between capability, cost, and latency. GPT-4o is the first model in this lineage where the headline innovations are dramatic improvements in the latter two areas, signaling a strategic ambition to make AI a ubiquitous, real-time utility rather than a powerful but slow and expensive back-end service.

2.3.4. Analysis of the o1-Series: Specialization in Advanced STEM Reasoning

The o1-series represents a new and distinct branch in the evolution of OpenAI’s models, marking a deliberate paradigm shift away from the general-purpose “omni” philosophy of GPT-4o. Instead of aiming for broad, fast, and efficient performance, the o1 models are specialized reasoning engines designed to sacrifice speed for an unprecedented level of accuracy and analytical rigor. This lineage is engineered to tackle the most complex, multi-step problems in domains like advanced mathematics, science, and engineering.

A New Paradigm: “System 2” Thinking and Chain-of-Thought

The conceptual foundation of the o1-series can be understood through the psychological framework of “System 1” and “System 2” thinking. In the context of AI, Ref. [54] describes System 1 as the fast, intuitive, pattern-matching mode of operation characteristic of most LLMs, including GPT-4o. This system excels at rapid predictions and generating plausible-sounding text. Ref. [54] describes System 2, in contrast, as a slower, more deliberate, logical, and step-by-step reasoning process. While foundational LLMs are masters of System 1, they have historically lacked the depth required for true System 2 analysis [43]. The o1-series is OpenAI’s first family of models explicitly designed to emulate System 2 thinking [43]. It achieves this by allocating significantly more time and computational resources at the point of inference to “think” before providing an answer [45]. This “thinking” process manifests as the generation of a long, internal “chain-of-thought.” The model systematically breaks down a complex problem into a sequence of logical steps, explicitly laying out its reasoning process [45]. This allows it to identify and correct potential errors along the way, much like a human expert working through a difficult proof, thereby increasing the probability of arriving at a correct and verifiable solution [45]. This approach represents a strategic shift in resource allocation, moving computational emphasis from the pre-training phase to the training and inference phases to achieve gains in complex reasoning [45].

Dominance in Scientific and Mathematical Domains

The o1-series is positioned as a complement to, not a replacement for, the general-purpose GPT-4o [45]. Its intended application is to solve complex problems that demand high precision and analytical depth, making it ideal for expert-level use in fields like mathematics, science, engineering, finance, and law [45].

To serve a range of needs within this specialized space, the series includes several variants:

o1-preview: The initial release, designed to tackle complex problems that require a combination of reasoning and broad general knowledge [45].
o1-mini: A faster and more cost-effective version that is highly optimized for STEM-related tasks like math and coding. It has less broad world knowledge than the preview model but excels at pure reasoning [45].
o1 (full): The most capable version, integrating the highest levels of reasoning and multimodality [45].

The performance of these models in scientific domains is remarkable. The o1-preview model was shown to perform at a PhD level on benchmark tests in physics, chemistry, and biology [45]. On the GPQA-diamond benchmark, a set of graduate-level science questions designed to be difficult even for human experts, the full o1 model became the first AI system to surpass the performance of recruited PhD-level human experts [45].

The AIME and Code Generation Tasks

The exceptional reasoning capabilities of the o1-series are most clearly demonstrated on benchmarks that test the absolute limits of mathematical and logical problem-solving. A prime example is the American Invitational Mathematics Examination (AIME). The AIME is not a standard academic test; it is a highly prestigious and notoriously difficult competition for the top 5% of high school mathematics students in the United States, serving as a critical qualifier for the USA Mathematical Olympiad [55]. The exam consists of 15 extremely challenging problems in algebra, combinatorics, geometry, and number theory, where the median score for these elite human participants is typically only four to six correct answers out of fifteen [45]. Because of its difficulty, AIME is considered an “unsaturated” benchmark, providing a clear and challenging runway to measure advances in AI reasoning [45]. On this benchmark, the performance of the o1-series represents a monumental leap. While the highly capable GPT-4o model could only solve 12–13% of the AIME problems, the full o1 model solved 74% on its first attempt and up to 93% when using more advanced sampling and re-ranking techniques—a score that would place it among the top 500 students in the nation [45]. The cost-efficient o1-mini also demonstrated elite performance, correctly solving 70% of the AIME problems, making it highly competitive with the full o1 model and vastly superior to GPT-4o [45]. The even newer o3-mini pushed this boundary further, achieving an 86.5% accuracy [45]. This aptitude for logical reasoning extends to code generation. On the competitive programming platform Codeforces, o1-mini achieves an Elo rating of 1650, placing it in the 86th percentile of human competitors [45] rating complex finite element analysis code for geotechnical engineering simulations—a task that requires both coding proficiency and a deep understanding of the underlying engineering principles [46]. The emergence of the o1-series marks the beginning of a strategic fragmentation of the LLM landscape. The market is evolving beyond a single, one-size-fits-all model toward a portfolio of specialized, high-performance “reasoning engines.” This signals a shift in the perceived economic value of AI, moving from broad conversational ability toward the provision of verifiable, expert-level solutions in high-stakes professional domains. This specialization comes with clear trade-offs in speed and cost, but these are acceptable in fields where the cost of an error is orders of magnitude greater than the cost of computation. We are witnessing the birth of domain-optimized foundation models. The future of enterprise AI will likely involve deploying a suite of models: a fast, cheap model like GPT-4o mini for routine tasks; a powerful generalist like GPT-4o for creative and conversational uses; and expensive, slow, but highly accurate reasoning engines like the o1-series for mission-critical analysis. This complicates the AI ecosystem but enables a far more optimized allocation of computational resources based on task value and risk.

2.3.5. Synthesis and Future Outlook: Trajectories in LLM Development

The evolution of the ChatGPT lineage from the general-purpose GPT-3.5 to the specialized o1-series reveals several clear and significant trajectories in the development of large language models. This progression is not merely a linear increase in power but a story of diversification, specialization, and a maturing understanding of the trade-offs between capability, efficiency, and applicability. Synthesizing the analyses of each model generation provides a high-level perspective on the state of the field and its likely future direction.

Comparative Synthesis Across Model Generations

The journey from GPT-3.5 to the o1-series can be framed by several key dichotomies that define the strategic choices in LLM development. First is the tension between generalization and specialization. The trajectory began with the broad generalism of GPT-3.5. GPT-4 significantly enhanced general reasoning capabilities. GPT-4o maintained this high level of general intelligence while optimizing for interaction and efficiency. The o1-series marks a decisive turn toward deep specialization in reasoning, creating a distinct class of models. This demonstrates a maturing market that now demands both versatile, general-purpose tools and high-precision, specialist instruments tailored for expert-level tasks. Second is the constant negotiation of the trilemma between speed, cost, and capability. GPT-3.5 was relatively fast and affordable but less capable. GPT-4 dramatically increased capability but at a higher cost and latency. The GPT-4o family, particularly the mini variant, represented a major leap in improving the cost-and-speed-to-capability ratio, making advanced AI far more accessible. The o1-series then consciously inverts this priority, sacrificing speed and cost to achieve an unprecedented level of reasoning capability, proving that for certain high-value, high-risk applications, accuracy and reliability are paramount. Finally, the narrative told by benchmark performance has become more nuanced. A model’s performance profile across a suite of different benchmarks is a strong indicator of its intended purpose. Strong scores on foundational benchmarks like GLUE suggest solid, general NLU competence. High scores on broad, multitask benchmarks like MMLU signal extensive world knowledge and versatile problem-solving ability. Dominance on esoteric, high-difficulty benchmarks like AIME or competitive coding platforms indicates elite, specialized reasoning power. Table 4 crystallizes the difference between a top-tier generalist and a top-tier specialist. This table visually contrasts the performance profiles of generalist and specialist models, highlighting that model strength is task-dependent.

This comparative data makes the strategic divergence clear. While o1-mini is a stronger generalist than GPT-4o mini on MMLU, the difference is incremental. On AIME, the difference is transformational. This powerfully illustrates that model selection is no longer a matter of choosing the “best” model overall, but of selecting the right model for the task at hand.

Key Trends and Implications for Application

Looking across the entire model lineage, several overarching trends emerge that will shape the future of applied AI. The first is the inexorable rise of multimodal and agentic AI. The shift from text-only models to natively multimodal systems like GPT-4o, combined with the development of powerful reasoning engines like the o1-series, is laying the groundwork for more sophisticated AI agents. These agents will be able to perceive the world through multiple senses (text, image, audio), reason about it in complex, multi-step ways, and act upon their conclusions, enabling more autonomous and capable systems [14]. The second trend is the solidification of the hybrid model of knowledge. The increasing importance and effectiveness of Retrieval-Augmented Generation (RAG) indicates that the future is not a single, omniscient model. Rather, it is a hybrid system where a powerful core reasoning engine is augmented in real time with verifiable, up-to-date, external knowledge. This has profound implications for deployment, placing a much greater emphasis on the quality of an organization’s data and the architecture of its information retrieval systems [19]. The quality of the AI’s output becomes inextricably linked to the quality of the knowledge it is fed. For users and organizations, the primary implication of these trends is the need to adopt a portfolio approach to model deployment. The decision to use a cheap and fast model like GPT-4o mini for a customer service chatbot, a powerful generalist like GPT-4o for creative content generation, or an expensive but highly accurate reasoning engine like o1 for financial risk analysis should be a deliberate, strategic choice. This requires a deeper understanding of both the task requirements and the specific strengths and weaknesses of each available model. Looking forward, the trajectory suggests a future of even greater specialization, with models being fine-tuned and optimized for specific industries and professional functions. The core research challenges will likely center on improving the reliability and verifiability of complex reasoning (e.g., mitigating issues like “fake alignment” in the o1 models [45]), making the use of massive context windows more effective and less prone to errors, and developing more robust and transparent methods for AI safety and bias mitigation. Finally, the tension between the development of powerful, closed, proprietary models by companies like OpenAI and the concurrent push from the wider community for more open, transparent, and auditable systems will continue to be a defining dynamic of the field.

Comparison to Human NLU: Studies demonstrate that the human brain’s language processing capabilities, particularly for complex syntactic structures and predictive reasoning, still surpass those of current LLMs, even when comparing against non-native human speakers [27]. ChatGPT, despite its sophistication, cannot be considered a complete theory of human language acquisition or processing due to fundamental differences in learning mechanisms and underlying “competence” [27,56].

2.4. Advancing Method and Theory in NLU Through ChatGPT

The evolution of ChatGPT catalyzes new methodological and theoretical directions in NLU:

Methodologically, a key frontier is the development of agentic AI workflows. These systems represent a profound shift, leveraging core NLU to interact with tools and environments to solve complex, multi-step problems [16,17].
Techniques like RAG are foundational, providing agents with grounded, verifiable knowledge [18,57,58,59].
Probing methods like “evil twin” prompts [21,60,61,62,63,64,65] and the push toward multimodality [14,15] are creating more robust and versatile models to power these agents.

A core tension in this advancement is between specialization and generalization. While foundation models like ChatGPT exhibit broad capabilities [1,31], they are often outperformed on specific tasks by fine-tuned models [1,66,67,68]. This suggests future progress lies in a sophisticated interplay between generalist models and specialized techniques. This is not merely about scaling models but designing smarter, more adaptable architectures, perfectly exemplified by agentic frameworks. In these systems, a generalist LLM acts as a central “reasoning engine” that intelligently selects and deploys specialized tools or knowledge sources (like RAG) as needed [4]. Theoretically, this points to a need for new frameworks that model the trade-offs between generalization and specialization, drawing inspiration from cognitive science theories on how humans balance broad knowledge with deep, domain-specific expertise to achieve goals [56]. For instance, insights from dual-process theories of cognition, which distinguish between intuitive, fast processing and deliberate, analytical reasoning, could inform AI architectures designed to dynamically allocate computational resources between broad, generalist understanding and precise, specialized inference.

3. The New Epoch of Content Generation: Diverse Applications, Quality Assurance, and Ethical Imperatives

ChatGPT has inaugurated a new epoch in content generation, demonstrating remarkable versatility across an array of domains. This capability, however, is rapidly evolving from simple content creation to autonomous task execution, which introduces more complex challenges concerning quality assurance and ethical responsibility.

3.1. ChatGPT’s Role in Diverse Content Generation and Task Automation

The applications of ChatGPT and its underlying models are extensive and expanding from content creation to proactive task completion:

Technical and Scientific Content: In engineering, ChatGPT assists in drafting reports, generating software documentation, and producing code snippets [69,70]. Multivocal literature reviews indicate error rates for engineering tasks average around 20–30% for GPT-4 [37]. Conversely, in creative writing, while generating novel ideas and structures, outputs can sometimes lack the unique voice or subtle emotional depth characteristic of human authorship, necessitating significant human refinement [37]. In medicine, it is used for generating patient reports and drafting discharge summaries [57], though error rates can range from 8% to 83% [37].
Marketing and SEO Content: Marketers leverage ChatGPT for creating blog posts, ad copy, social media updates, and personalized email campaigns. It also aids in SEO by generating topic ideas and crafting meta descriptions [71].
Legal Content: Law firms utilize ChatGPT for drafting client correspondence, creating legal blog content, and developing marketing materials to increase efficiency [71].
Creative Writing: ChatGPT has shown aptitude in generating creative content such as stories, poetry, and scripts, acting as a catalyst for imaginative endeavors [14,57,72,73,74].
Academic Content: In academic settings, ChatGPT assists with literature reviews, drafting sections of papers, generating study materials, and creating quizzes [23,75,76,77,78,79,80].
Automated Task Execution with AI Agents: The next frontier lies in agentic AI, where LLMs are empowered to act as autonomous agents. These agents move beyond generating content to performing complex, multi-step tasks. For example, an agent might not just write code but also autonomously execute test suites, debug it, and integrate it into a larger system; or it might not just draft a marketing email but also execute the entire campaign by analyzing performance data, adjusting targeting, and optimizing content strategy in real time [4,16]. This represents a profound shift from a content creator to a full autonomous task Automator.

3.2. Methodologies for Quality Control, Coherence, and Accuracy

Ensuring the quality and reliability of AI outputs, whether static content or agentic actions, is paramount.

Human Oversight and Human-in-the-Loop: This remains the most critical control measure. Expert review is essential for content where errors have severe consequences [2,32,36]. For agentic AI, this evolves into a “human-in-the-loop” model, where humans supervise, intervene, and approve agent actions before execution to prevent errors and ensure safety [81]. However, scaling human oversight presents its own challenges, including cognitive load on reviewers and the difficulty of verifying complex, multi-step agentic reasoning processes, necessitating the development of intelligent human–AI teaming interfaces and real-time anomaly detection systems [82].
Prompt Engineering: The quality of output is highly dependent on the input prompt. Effective prompt engineering is a key skill for guiding both content generation and agent behavior ([9,77,83,84,85,86,87].
Iterative Refinement: Using feedback loops to progressively refine outputs is a common practice to improve quality for both text and agent action sequences [21,28,88,89,90,91].
Fact-Checking and Source Verification: Due to the risk of hallucinations, rigorous fact-checking is essential [1,13,66,67,68]. For agents, this includes grounding their knowledge in real time, verifiable data sources before they act.
Process and Tool-Use Validation: For AI agents, quality control must extend beyond the final output to validate the entire process. This includes verifying that the agent’s reasoning is sound and that it uses its tools (e.g., web browsers, APIs) correctly and safely [17].
Specialized Evaluation Metrics and Tools: Domain-specific metrics like BLURB [1,51,92] and tools like SelfCheckGPT [4] are crucial for objective assessment.
Error Rate Analysis: Systematic analysis of error rates provides insights into reliability and highlights areas needing improvement [37].

3.3. Ethical Considerations in Content and Task Automation

The power of generative AI brings forth a spectrum of ethical challenges that are amplified by the introduction of autonomy.

Bias: A critical challenge is inherent bias. LLMs are trained on vast corpora of text and code from the internet, data that inevitably reflects the full spectrum of human societal biases [30]. Consequently, models can learn, reproduce, and even amplify harmful stereotypes related to race, gender, religion, age, and other social categories [30]. This can manifest as discriminatory outputs in sensitive applications like job recruitment, loan approval decisions, and medical diagnostics [30]. Research has shown that even models that have undergone “alignment”—a fine-tuning process to make them appear unbiased on explicit tests—can still harbor and act upon implicit biases in more subtle ways [93]. Studies also indicate that while some sources of bias can be identified and “pruned” from a model’s neural network, the context-dependent nature of bias makes a universal fix nearly impossible. This suggests that accountability for biased outputs may need to shift from the model developers to the organizations that deploy these models in specific, real-world contexts [30]. Generated content and agentic actions can reflect and amplify societal biases from training data [14,30,57].
Trustworthiness and Reliability: The probabilistic nature of LLMs means their outputs are not always factually correct or reliable, posing risks if unverified information is disseminated [57,81,89,91,94,95,96].
Security and Misuse: The potential for misuse is significant. Agentic AI dramatically lowers the barrier for malicious activities by enabling the automation of tasks like orchestrating large-scale phishing campaigns or propagating disinformation [97,98,99].
Accountability and Autonomous Action: Agents capable of autonomous action raise profound ethical questions about accountability. Determining responsibility when an autonomous agent causes financial, social, or physical harm is a complex challenge for which legal and ethical frameworks are still nascent [35]. Addressing this requires interdisciplinary efforts to develop clear accountability models—perhaps through “responsible-by-design” principles or legal concepts adapted from product liability and human-machine interaction—that allocate responsibility in human–agent systems.
Social Norms and Cultural Sensitivity: Generated content and actions must align with diverse cultural and societal expectations to avoid offense or misinterpretation [97,98,99].
Ethical Data Sourcing and Privacy: Concerns persist regarding the methods used for collecting training data and the privacy of user inputs fed into ChatGPT [13,70,100,101].
Copyright and Authorship: The generation of content raises complex questions about intellectual property rights, originality, authorship attribution, and plagiarism, especially when outputs closely resemble training data or are presented as original work [98,102,103,104,105,106]. Legal frameworks are still evolving to address these issues [2].

3.4. Advancing Method and Theory for Responsible Content Generation and Task Automation

To navigate this new epoch responsibly, advancements in both methodology and theory are needed. A “Human–AI Symbiotic” framework is proposed, where AI systems like ChatGPT handle initial drafting, information synthesis, and repetitive aspects of content creation, while human experts focus on critical review, strategic guidance, ethical filtering, creative refinement, and ensuring contextual appropriateness [107,108]. This moves beyond simple post-editing to a more integrated collaborative process. This integrated collaboration is essential for both content generation and the governance of AI agents. Furthermore, dynamic, context-aware quality control mechanisms are crucial. Such a framework requires a paradigm shift, especially within academia, moving away from viewing generative AI as a threat that erodes essential skills. This could involve AI systems dynamically adjusting their generation parameters based on content sensitivity and audience, or automatically flagging outputs that require higher levels of human scrutiny based on pre-defined ethical risk profiles or factual accuracy benchmarks. Instead, it should be seen as an opportunity to foster critical thinking and analytical capabilities [107]. When properly integrated, this human–AI collaboration can become a powerful tool to prepare students for an increasingly AI-augmented workforce, where the ability to partner effectively with intelligent systems is a crucial skill [108]. Furthermore, the development of dynamic, context-aware quality control mechanisms that adapt to the content type, intended audience, and potential risks is crucial, moving beyond static checklists or generic evaluation metrics.

The landscape of ChatGPT-driven content generation is characterized by a fundamental “Quality–Scalability–Ethics Trilemma.” There exists an inherent tension in simultaneously achieving high-quality, nuanced content, generating this content efficiently at scale, and rigorously adhering to ethical standards such as bias mitigation, truthfulness, and intellectual property respect [5,36,103,104,105,106]. This inherent tension is acutely magnified with agentic AI. Granting AI autonomy dramatically increases its scalability and potential impact, making the trade-offs with quality and ethics far more critical. Prioritizing scalability, such as rapid, fully automated content creation for mass dissemination, can lead to a degradation in content quality—resulting in generic, superficial, or inaccurate outputs—and can significantly amplify ethical risks, for example, through the widespread propagation of biased information or misinformation. Similarly, strict adherence to comprehensive ethical guidelines, including rigorous bias detection and mitigation protocols or meticulous checks for originality and factual accuracy, can inherently slow down the generation process and limit the scope of what can be feasibly automated. An autonomous system that scales rapidly without robust quality control and ethical safeguards poses unacceptable risks. Efforts to enhance quality and ethics through human review or rigorous validation often reduce the speed and scalability that make agents so powerful [37,40,81,89,94,95,96].

Navigating this trilemma requires developing techniques that co-optimize these dimensions or provide transparent trade-offs. This might involve multi-stage processes where different levels of quality control and ethical scrutiny are applied based on risk profile of the content or action, or the development of AI-assisted tools specifically designed for ethical review. Theoretically, this calls for new models of “responsible generative efficiency” and “trustworthy autonomy” that can quantify, predict, and guide the balance between these competing factors, moving the field toward a more mature and accountable approach to AI content creation and task automation. Practically, this implies that organizations must consciously define their acceptable thresholds across these three dimensions, prioritizing certain aspects based on the sensitivity and impact of the content or task.

4. ChatGPT as a Catalyst for Knowledge Discovery: Methodologies, Scientific Inquiry, and Future Paradigms

ChatGPT and similar LLMs are increasingly being explored not just as information retrieval tools but as active catalysts in the process of knowledge discovery. This is evolving from simple assistance to the deployment of autonomous systems that can manage complex scientific workflows, offering new paradigms for generating novel insights. However, this transformative potential is accompanied by significant limitations, including concerns over reliability, security vulnerabilities, and the fundamental challenge of validating insights from opaque systems—a “Black Box Conundrum”—all of which are critically examined in the following subsections.

4.1. Methodologies for Knowledge Extraction from Unstructured Data

LLMs excel at processing and interpreting unstructured text, which constitutes a massive portion of the world’s data. This capability is being harnessed through increasingly sophisticated methodologies:

Information Extraction from Diverse Sources: ChatGPT can parse complex documents, such as historical seedlists or health technology assessment (HTA) documents in different languages, to extract specific data points where rule-based methods falter [38,109,110,111].
Qualitative Data Analysis Assistance: Researchers are exploring ChatGPT for assisting in qualitative analysis, such as generating initial codes or identifying potential themes [21,28,74,90,91]. However, careful prompting and validation are required, as LLMs can generate nonsensical data if not properly guided [21,28,74,90,91]. This guidance often requires expert domain knowledge to formulate precise prompts, iterative refinement of generated codes or themes, and a deep understanding of the LLM’s limitations to prevent the introduction of artificial patterns or spurious insights into the analytical process.
LLMs Combined with Knowledge Graphs (KGs): A promising methodology involves integrating LLMs with KGs. The GoAI method, for instance, uses an LLM to build and explore a KG of scientific literature to generate novel research ideas, providing a more structured approach than relying on the LLM alone [87,112].
Autonomous Knowledge Discovery with AI Agents: The next methodological leap involves deploying agentic AI to create automated knowledge discovery pipelines. These agents can be tasked with a high-level goal and then autonomously plan and execute a sequence of actions—such as searching databases, retrieving papers, extracting data, and synthesizing findings—to deliver structured knowledge with minimal human intervention [113].
Prompt Injection Vulnerabilities: Research into prompt injection techniques highlights how the knowledge extraction process can be manipulated, underscoring security vulnerabilities that must be addressed for reliable knowledge discovery, especially in autonomous systems [114].

4.2. Applications in Scientific Research

ChatGPT is finding applications across the scientific research lifecycle, with agentic systems poised to integrate these functions into automated workflows:

Hypothesis Generation: Models like GPT-4 can generate plausible and original scientific hypotheses, sometimes outperforming human graduate students in specific contexts [44,45,110]. For instance, models could suggest novel drug targets by identifying non-obvious correlations across vast biomedical literature or propose new material compositions based on synthesized property databases.
Literature Review Assistance: LLMs can accelerate literature reviews by summarizing articles and identifying relevant papers and themes [104,109,111,115,116,117,118,119,120].
Experimental Design Support: ChatGPT can assist in outlining experimental procedures but may require expert refinement to address oversimplifications or “loose ends” [26,37,40,45,121]. These “loose ends” often involve the critical details of experimental controls, statistical rigor, feasibility assessments for real-world implementation, and adherence to ethical guidelines, all of which require human scientific judgment beyond current LLM capabilities.
Data Analysis and Interpretation: LLMs can assist in analyzing large volumes of text data to identify patterns and emerging themes [14,48,54,57,87,122,123].
Simulating Abductive Reasoning: LLMs can simulate abductive reasoning to infer plausible explanations or methodologies, thereby aiding research discovery [15,110,124,125,126,127,128,129,130].
Automating Research with Scientific Agents: The culmination of these capabilities is the creation of scientific agents. These are autonomous systems designed to conduct research by integrating multiple steps. For instance, a scientific agent could be tasked with a high-level research question and then autonomously search literature, formulate a hypothesis, design and execute code for a simulated experiment, analyze the results, and draft a preliminary report, dramatically accelerating the pace of discovery [131]. OpenAI ChatGPT’s and Google Gemini’s Deep Research language models are good examples.

4.3. Critical Assessment of ChatGPT’s Role in Advancing Research

The integration of generative AI into scientific inquiry presents a duality of prospects and challenges:

Acceleration and Efficiency: AI has the potential to dramatically accelerate research by automating time-consuming tasks, allowing researchers to focus on higher-level conceptual work [26,39,40,85,110,121,130,131,132,133,134,135].
Accuracy and Reliability Concerns: The propensity for hallucinations and bias is a major concern that necessitates rigorous validation of all AI-generated outputs [13,30,130]. This risk is magnified for autonomous agents, where acting on a single hallucinated fact could derail an entire research workflow. Mitigating this requires not only robust human-in-the-loop validation points but also the integration of autonomous self-correction loops and continuous cross-validation against established scientific databases or external tools within the agent’s workflow to ensure end-to-end integrity.
The Indispensable Role of Human Expertise: Human expertise remains crucial for critical evaluation, contextual understanding, and ensuring methodological soundness [26,39,40,85,110,121,130,131,132,133,134,135]. As research becomes more automated, the human role shifts from task execution to high-level strategic direction and critical supervision of the AI’s process and outputs.

4.4. Advancing Method and Theory in AI-Augmented Knowledge Discovery

The use of AI in knowledge discovery is pushing methodological and theoretical boundaries:

Frameworks like GoAI [112] exemplify a move toward structured methodologies that combine LLMs with KGs for more transparent idea generation.
The concept of LLMs “simulating abductive reasoning” [125] suggests a new theoretical lens for understanding how these models contribute to scientific insight, moving beyond pattern matching toward computational reasoning.

A significant hurdle in this new paradigm is the “Black Box Conundrum.” While tools like ChatGPT can generate novel hypotheses [133], the internal processes remain opaque [5]. This opacity is especially problematic for science, which demands transparency and reproducibility. The shift toward autonomous scientific agents makes this conundrum more acute. For an agent’s discoveries to be scientifically valid, its entire decision-making process must be transparent and verifiable. An inscrutable “reasoning” process can lead to a “crisis of explanation,” undermining the principles of systematic inquiry. Consequently, advancing true knowledge discovery with AI necessitates significant progress in explainable AI (XAI) tailored for generative models [136]. Without such advancements, AI-assisted discovery might remain a useful heuristic but will lack the demonstrable rigor required for foundational breakthroughs. This challenge calls for new theories of “computational scientific reasoning” to bridge the gap between the statistical generation of current LLMs and the logical, evidence-based reasoning that is the hallmark of the scientific method. Such theories must move beyond mere hypothesis generation to explain the emergence of truly novel scientific paradigms, modelling processes such as divergent idea generation, conceptual blending, and the computational mechanisms underlying scientific insight, especially as these reasoning capabilities become embodied in autonomous agents. This is particularly urgent for complex tasks like causal inference or the derivation of novel proofs, where the interpretability of an agent’s inferential steps is as crucial as the accuracy of its conclusions for scientific acceptance and reproducibility.

5. Revolutionizing Education and Training: ChatGPT’s Global Impact on Pedagogy, Assessment, and Equity

The integration of ChatGPT into education has been met with a mixture of excitement and trepidation, signaling a potential revolution in pedagogy, assessment, and the pursuit of educational equity. Its capabilities offer novel ways to personalize learning and support diverse learner needs, but they also pose significant challenges to traditional paradigms, particularly concerning academic integrity and the development of critical thinking. This evolution is now accelerating toward the use of more autonomous agentic AI systems, magnifying both the opportunities and the risks.

5.1. Applications in Education

ChatGPT’s versatility has led to its application across numerous facets of the educational landscape, with agentic systems representing the next frontier.

Personalized Learning: A primary application is facilitating personalized learning experiences. ChatGPT can adapt content, offer real-time feedback, and function as a virtual tutor available 24/7 [137,138]. For instance, ChatGPT can dynamically adjust the complexity of explanations based on a student’s prior responses, offer alternative examples tailored to their learning style, or provide targeted practice problems identified from their specific areas of difficulty.
Curriculum and Lesson Planning: Educators use ChatGPT to assist in designing courses, developing lesson plans, and visualizing theoretical concepts in practical settings [135,138].
Innovative Student Assessment: ChatGPT is being explored for generating diverse assessment items and designing tasks that promote critical thinking [137]. GenAI can also personalize assessments and feedback based on learner responses [139].
Teaching Aids and Interactive Tools: The technology can be harnessed to develop engaging teaching aids, virtual instructors, and interactive simulations [137].
Support for Diverse Learners: ChatGPT enhances accessibility for students with disabilities and multilingual learners through translation and simplification [140].
Autonomous Learning Companions and Agents: The next evolutionary step is the deployment of AI agents as personalized learning companions. These agents go beyond tutoring by autonomously managing a student’s long-term learning journey. They can co-design study plans, curate resources from vast digital libraries, schedule tasks, and proactively adapt strategies based on performance, transforming the learning process into a continuous, interactive dialogue [141,142]. This transformation stems from their capacity for dynamic pedagogical intervention, offering personalized feedback loops, adapting content difficulty in real time based on student mastery, and even guiding students in meta-cognitive reflection to enhance self-regulated learning.

5.2. Impact on Critical Thinking, Academic Integrity, and Ethics

The integration of ChatGPT into education brings heightened implications for cognitive skills and ethical conduct, which are amplified by agentic systems.

Critical Thinking: A dichotomy exists where AI can either be used to generate thought-provoking prompts that foster analysis or, through over-reliance, erode students’ ability to think deeply [5,143]. Concerns persist that students may become cognitively passive [144]. The introduction of AI agents deepens this concern, as they could automate not just the answers but the entire process of inquiry and discovery, potentially deskilling students in research and problem-solving [145]. To counter this, educators must integrate “AI literacy” and advanced “prompt engineering” into curricula, empowering students to critically engage with AI tools as intellectual collaborators, thereby fostering higher-order skills rather than merely outsourcing tasks.
Academic Integrity: The risk of plagiarism with AI-generated text is a primary concern [5,143]. With agents, this evolves from verifying authorship of text to verifying authorship of complex, multi-step actions and outcomes. Strategies to uphold integrity must, therefore, shift toward assessments that are inherently human-centric, such as project-based work, oral examinations, and evaluations of the process of inquiry and problem-solving, rather than solely the final product [143]. This necessitates designing assignments that are resistant to full automation, compelling students to demonstrate unique human cognitive contributions. Consequently, the paradigm shift in assessment towards authorship of these complex, multi-step actions and outcomes have potential to foster and develop in students critical thinking, metacognition and self-regulation of learning for quality AI literacy competencies and essential human-AI collaboration skills essential for future workforce preparedness [137,146]. However, for unbiased evaluation of these processes and outcomes of inquiry and problem-based learning to mitigate academic dishonesty in an AI-mediated learning environment and age, generic and context-specific assessment frameworks should be developed for clarity, consistency and reliability. Thus, academic integrity can be upheld if the processes are properly scrutinized and the outcomes are rigorously evaluated within well-defined assessment frameworks.
Ethical Challenges: Broader ethical issues include data privacy, equity, and potential biases in AI content [5]. Agentic AI introduces new dilemmas regarding student autonomy and data sovereignty. An agent managing a student’s learning collects vast amounts of sensitive performance and behavioral data, raising critical questions about consent, surveillance, and how that data is used to shape a student’s educational future [147].

5.3. Global Perspectives and Educational Equity

The adoption and impact of ChatGPT in education varies significantly across global contexts, influenced by infrastructure, literacy, and policy, technological infrastructure, digital literacy, cultural norms, and institutional policies

Diverse International Perceptions: Studies from regions like Pakistan and Indonesia reveal mixed student perceptions, balancing the benefits of ChatGPT as an AI assistant with concerns about its impact on deep thinking and integrity [144,148].
Democratization vs. Digital Divide: ChatGPT has the potential to democratize education by providing widespread access to high-quality learning resources [138]. However, it also risks exacerbating the digital divide if access to technology, internet, and AI literacy are inequitably distributed [140]. The advent of powerful, resource-intensive learning agents could create a new, more profound equity gap between students who have access to personalized autonomous tutors and those who do not [149]. Addressing this necessitates proactive policy development to ensure universal access to foundational AI tools, alongside culturally sensitive AI design and curriculum integration efforts that leverage AI to amplify diverse voices and knowledge systems.
Cultural Context and Bias: LLMs trained on predominantly Western datasets may perpetuate cultural biases [5]. While AI can be used to decolonize curricula, this requires careful human oversight to avoid reinforcing existing biases [140].

5.4. Advancing Educational Research, Theories, and Pedagogical Models

The advent of ChatGPT and other generative AI tools necessitates a re-evaluation of educational theories and practices.

Revisiting Learning Theories: ChatGPT’s capabilities challenge and offer new lenses through which to view learning theories such as constructivism (where students actively construct knowledge, potentially aided by AI tools) (Li et al., 2025) and self-determination theory (exploring AI’s impact on student autonomy, competence, and relatedness) [144].
Transforming Assessment Paradigms: Traditional assessment methods are being questioned. There is a call for innovative assessment strategies that emphasize higher-order thinking, creativity, and authentic application of knowledge, rather than tasks easily outsourced to AI [5]. This includes exploring personalized, adaptive assessments leveraging GenAI [144].
Methodological Rigor in AI-in-Education Research: There is a critical need for methodological rigor in studying AI’s impact on education. Researchers must carefully define experimental treatments, establish appropriate control groups, and use valid outcome measures that genuinely reflect learning, avoiding pitfalls of earlier “media/methods” debates where technology effects were often confounded with instructional design [150].
Developing New Pedagogical Models: The situation calls for the development of new pedagogical models that constructively integrate AI. This involves training educators and students in AI literacy, prompt engineering skills, and the critical evaluation of AI-generated outputs, and designing learning experiences that leverage AI as a tool for enhancing human intellect and creativity, rather than replacing it [151,152].

The widespread availability of ChatGPT presents a “Pedagogical Adaptation Imperative.” The initial defensive reactions focused on mitigating threats like plagiarism are insufficient [5,143]. This “shock” to the system compels a fundamental shift toward cultivating higher-order skills like critical thinking, creativity, and metacognition [3].

This necessitates a paradigm shift where AI is not just a tool to be policed but a cognitive partner to be leveraged. The imperative is to adapt curricula to foster human–AI collaboration. This vision must now expand to include agentic AI. The future of education will likely involve a co-evolution where educators become facilitators and strategic supervisors of AI-augmented learning experiences. Their role will shift to managing classrooms of human–agent teams, guiding students in the ethical and effective use of their personalized learning companions. Theoretically, this calls for new frameworks of “AI-Integrated Pedagogy” that conceptualize learning as a synergistic process involving human learners, human educators, and AI agents. This includes exploring frameworks like “Cyborg Pedagogy,” which envisions learning as a deeply integrated process where human cognitive functions are synergistically augmented by AI, blurring the lines between human and machine intellect in pursuit of enhanced learning outcomes. It requires adapting theories like constructivism to account for knowledge co-constructed with an autonomous agent and developing new models of human–AI co-regulation to ensure technology enhances, rather than undermines, human intellect and autonomy [141]. This co-regulation involves shared goal-setting, dynamic feedback loops where both human and AI adjust their strategies, and mutual error detection and correction, fostering a truly synergistic learning environment.

6. Engineering New Frontiers with ChatGPT: Advancing Design, Optimization, and Methodological Frameworks

ChatGPT and other LLMs are rapidly permeating various engineering disciplines, offering novel tools for design, analysis, and optimization. Their ability to process natural language, generate code, and synthesize information is creating new possibilities, with the evolution toward autonomous agentic AI systems poised to redefine traditional engineering methodologies.

6.1. Applications in Engineering Disciplines

The application of ChatGPT in engineering is diverse and expanding from task assistance to workflow automation.

Software Engineering: LLMs are used for code generation, debugging, automated code review, and documentation, with experts reporting significant time savings [69,70,145,153,154]. LLMs can also assist in translating natural language requirements into code [155].
Building Information Modeling (BIM), Architecture, and Civil Engineering: ChatGPT is explored for semantic search, information retrieval, and task planning [19]. RAG has proven effective in helping ChatGPT apply localized BIM guidelines [19].
Mechanical, Industrial, and General Engineering Design: LLMs assist in idea generation, conceptual design, and formulating engineering optimization problems [156,157].
Geotechnical Engineering: ChatGPT can generate finite element analysis (FEA) code for modeling complex processes, though its effectiveness varies based on the programming library used, underscoring its role as an assistant [46].
Control Systems Engineering: Studies show ChatGPT can pass undergraduate control systems courses but struggles with open-ended projects requiring deep synthesis and practical judgment [158].
Automated Design and Analysis with Engineering Agents: The next frontier is the deployment of engineering agents. These are autonomous systems that can manage complex, multi-step engineering workflows. For example, an agent could be tasked with a high-level goal, such as designing a mechanical part, and then autonomously generate design options, use software tools to run simulations (e.g., finite element analysis (FEA)—a computational method that enables engineers to predict how structures or components will respond to physical forces such as stress, heat, and vibration), interpret the results, and iterate on the design until specifications are met [120]. The criticality of human oversight and validation in these applications scales with the potential impact of errors; while minor inaccuracies in ideation might be tolerable, those in safety-critical FEA code generation demand absolute precision and rigorous validation protocols.

6.2. Theoretical Constructs and Novel Engineering Methodologies

The integration of ChatGPT is prompting new engineering methodologies and theoretical constructs.

Prompt Engineering for Optimization: Effective problem formulation using ChatGPT relies heavily on sophisticated prompt engineering and sequential learning approaches [157].
Human–LLM Design Practices: Comparative studies are yielding insights into LLM strengths (e.g., breadth of ideation) and weaknesses (e.g., design fixation), leading to recommendations for structured design processes with human oversight [159]. This design fixation stems from their training on vast datasets of existing designs, leading them to synthesize rather than truly innovate beyond established paradigms, thus highlighting the continued indispensability of human creativity for disruptive engineering solutions. This also means that while LLMs excel at optimizing within defined constraints, they currently struggle to generate entirely novel paradigms or break free from conventional design wisdom.
Cognitive Impact on Design Thinking: Research is exploring how AI influences designers’ cognitive processes, such as fostering thinking divergence and fluency [156].
LLMs in Systems Engineering (SE): While LLMs can generate SE artifacts, there are significant risks, including tendencies toward “premature requirements definition” and “unsubstantiated numerical estimates” [160]. These risks are magnified in autonomous agentic systems where flawed assumptions could propagate through an entire automated workflow.
Methodologies for Agentic Workflows: The rise of engineering agents necessitates new methodologies for managing human–agent and agent–agent collaboration. This includes designing frameworks for task decomposition, tool selection, and process validation to ensure the reliability and safety of autonomous engineering systems [7].

6.3. Impact on Engineer Productivity and Future Practice

The adoption of ChatGPT in engineering has clear implications for productivity and the nature of work.

Productivity Gains: Studies report significant productivity increases from using LLMs for tasks like code generation and drafting [155]. The shift toward agentic AI promises to extend these gains from task assistance to end-to-end workflow automation [154]. This automation particularly targets repetitive, data-intensive, or computationally heavy tasks such as iterative design optimization, extensive code refactoring, or multi-criteria material selection, thereby freeing engineers to focus on higher-level problem definition, innovative conceptualization, and critical validation.
Concerns and Challenges: Concerns exist about over-dependence on AI, which could lead to skill degradation, and anxieties about job security [155]. The need for human oversight remains critical due to potential inaccuracies and biases [37].
Preparing Future Engineers: Engineering curricula must adapt to prepare students for workplaces where GenAI tools are prevalent. This includes teaching AI literacy, prompt engineering, and the critical evaluation of AI outputs to ensure they can effectively supervise and collaborate with AI systems [3].

6.4. Advancing Engineering Methodologies and Theoretical Frameworks

The capabilities of ChatGPT can serve as a catalyst for advancing engineering methodologies and developing new theoretical frameworks:

Agent-Assisted Engineering Frameworks: There is an opportunity to develop structured frameworks that explicitly integrate AI agents at various stages of the engineering design process. These frameworks would define roles, responsibilities, and interaction protocols for human engineers and their agentic counterparts.
Theories of AI-Robustness in Design: The identification of LLM failure modes [160] can inform new theories around “AI-robustness” to predict and mitigate risks associated with using AI in critical applications. Such theories would need to encompass not only the prediction of explicit failure modes but also the development of systems resilient to adversarial inputs, capable of graceful degradation, and equipped with uncertainty quantification for decisions in safety-critical contexts.

The application of LLMs like ChatGPT in engineering is fostering a shift toward what can be conceptualized as “Human–AI Cognitive Symbiosis.” Current evidence indicates that while LLMs can assist with a range of tasks [44, 159,163], they require significant human guidance and correction, especially in complex or safety-critical situations [46,158]. Human engineers possess superior capabilities in deep contextual understanding, critical judgment, and true innovation, areas where current LLMs are limited [159].

The most effective applications arise when human engineers strategically leverage these AI systems as powerful cognitive partners. This dynamic is evolving with the advent of agentic AI. It is less about AI replacing engineers and more about augmenting their capabilities by delegating complex workflows to autonomous agents under human supervision. Consequently, engineering education and practice must evolve to cultivate “AI-collaboration literacy”—the skills required to effectively prompt, guide, validate, and ethically integrate the work of AI agents. The engineer’s role will shift from a sole problem-solver to an orchestrator of human and AI agent collaborative systems, akin to a conductor guiding an advanced musical ensemble, ensuring that AI-driven efficiency is balanced with human judgment for safety and creativity. This necessitates new theoretical models of “Human–Agent Symbiosis” in engineering, distinct from mere tool-use. Such theories would aim to elucidate how the distinct strengths of humans and AI agents can be optimally combined, where humans provide the high-level strategic direction, ethical oversight, and creative leaps, while agents autonomously execute complex, multi-step analytical and design tasks under supervision. These theories should also define principles for shared cognition and distributed responsibility in human–agent teams, elevating AI from mere “tools” to active, agentic partners in the engineering endeavor. Table 5 summarizes key applications of ChatGPT in education and engineering, highlighting benefits, challenges, and novel implications.

7. Navigating the AI Revolution: Themes, Tensions, Critical Gaps, and Future Directions

The proliferation and evolving capabilities of ChatGPT have undeniably reshaped multiple domains, yet this progress is accompanied by critical research gaps and a pressing need for a forward-looking agenda. Synthesizing findings across natural language understanding (NLU), content generation, knowledge discovery, education, and engineering reveals common themes and distinct challenges that must be addressed to harness the full potential of generative AI ethically and effectively, especially as the technology advances from passive tools to autonomous agentic AI systems.

7.1. Common Themes Across Domains

A foundational theme is the transformative capability of ChatGPT. Hailed as a “disruptive” technology, its ability to generate human-like text and engage in complex tasks is altering established practices. Newer models like GPT-4o and the o1 series incorporate multimodality and advanced reasoning, pointing toward future models with enhanced “System 2 thinking” [161]. This refers to the development of capabilities for deliberate, sequential, and logical reasoning, as opposed to solely relying on fast, intuitive pattern matching, a crucial step toward more reliable and auditable AI behavior. This opens doors to unexplored applications with significant disruptive potential, including the deployment of autonomous agentic systems capable of complex, multi-step task execution across science, education, and engineering. However, these advancements are accompanied by significant limitations. These include persistent issues with factual inaccuracy (“hallucinations”), inherent biases, the lack of transparency (“Black Box Conundrum”), and difficulties with complex reasoning. This necessitates the essential role of human oversight. The concept of Human–AI “Cognitive Symbiosis” emerges as a crucial paradigm, which is now evolving from collaboration with passive tools to frameworks of human–agent teaming and orchestration where humans guide and supervise autonomous systems [81]. Furthermore, pervasive ethical considerations are interwoven throughout all domains, including bias, reliability, misuse, data privacy, and authorship. Finally, the disruptive nature of ChatGPT necessitates a fundamental imperative for adaptation. This involves significant changes in educational pedagogy (“Pedagogical Adaptation Imperative”), engineering design processes (“Human–Agent Cognitive Symbiosis”), and the development of new methodologies for quality control across all fields. Figure 1 provides a broad overview of the contribution of this critical review.

The conceptual frameworks presented in the figure are the primary analytical output of this study, derived from a multi-stage synthesis. For each domain, a comprehensive review of current literature was conducted to identify the most persistent, foundational challenges and paradigmatic shifts. These recurring dynamics were then abstracted and framed into the distinct conceptual lenses shown, such as naming the inherent conflict in content creation the “Quality–Scalability–Ethics Trilemma” or the evolving partnership in engineering the “Human–AI Cognitive Symbiosis.” This methodological approach was chosen to move beyond a simple descriptive summary, providing a structured and transferable way to analyze the core tensions and imperatives shaping each field. Beyond these specific domains, the pervasive nature of agentic AI also necessitates critical consideration of broader societal implications, including potential shifts in global labor markets, the imperative for robust regulatory frameworks, and the geopolitical dimensions of advanced AI development.

7.2. Synthesis of Themes and Identification of Critical Research Gaps

The interplay of these themes creates inherent tensions and highlights critical research gaps, which are magnified by the prospect of agentic AI.

Natural Language Understanding (NLU): The “Specialization vs. Generalization Tension” persists. A fundamental gap lies in discerning genuine semantic understanding versus sophisticated pattern matching [24,25,27,56]. Future research should employ rigorous experimental designs, including controlled studies on compositional generalization and systematicity, to precisely delineate the nature of LLM comprehension beyond statistical co-occurrence [156,157,158]. This gap becomes a critical safety concern for agentic systems that must act reliably based on their understanding of commands and environmental cues. The lack of explainability hinders trust and theoretical advancement, a problem that becomes acute when an agent’s reasoning cannot be audited [7,23,28,158].
Content Generation: The “Quality–Scalability–Ethics Trilemma” is a core challenge [5,102]. With the rise of agentic AI, this trilemma intensifies, as the potential for autonomous systems to act unethically at scale poses a far greater risk than generating harmful text alone. New technical solutions, legal and practical frameworks are urgently needed to govern the processes and actions of these agents [162,163].
Knowledge Discovery: The “Black Box Conundrum” hinders the validation of AI-generated insights [121,132]. When a scientific agent autonomously conducts a research workflow, the need for a transparent and reproducible “chain of reasoning” becomes paramount for scientific integrity.
Education: The “Pedagogical Adaptation Imperative” demands a shift in focus to skills that complement AI. A critical gap is the lack of research on how to educate students to collaborate with and critically supervise learning agents without sacrificing their own cognitive autonomy [150,151,164]. Ensuring equitable access to powerful learning agents is crucial to prevent a widening of educational disparities [165].
Engineering: The “Human–LLM Cognitive Symbiosis” must evolve into robust human–agent teaming. A major gap exists in developing validation techniques for agents in safety-critical applications and creating theoretical frameworks for trust and responsibility in these collaborative systems [160,166].

Binding these is the overarching “Ethical–Technical Co-evolution Imperative.” Technical advancements are inextricably linked with escalating ethical challenges [30]. This is most evident with agentic AI, where the capacity for autonomous action demands that ethics, safety, and alignment are not afterthoughts but are embedded into the core design of the system [35]. These domain-specific challenges are not isolated; rather, they are deeply interconnected. For instance, advancements in Explainable AI (XAI) [33] to address the “Black Box Conundrum” in knowledge discovery will directly contribute to enhanced trustworthiness and accountability in content generation and engineering applications. Similarly, robust data de-biasing strategies developed for NLU are fundamental to ensuring equitable outcomes across all downstream applications, including educational agents and professional tools.

7.3. Proposal of a Forward-Looking Research Agenda

To address these gaps and proactively steer the AI revolution, a concerted research agenda is proposed, with an imperative focus on the unique challenges and transformative opportunities presented by agentic AI. Critical priorities include establishing foundational safety protocols and robust governance mechanisms before widespread deployment.

7.3.1. Methodological Advancements

NLU: Develop benchmarks that assess “deep understanding” and robust reasoning, critical for safe agentic behavior.
Content and Action Generation: Design adaptive quality and ethical control frameworks that are integrated directly into an agent’s decision-making loop.
Knowledge Discovery: Develop and validate rigorous protocols for human supervision of AI-assisted hypothesis generation and experimentation.
Education: Conduct longitudinal studies on the impact of learning agents on cognitive development. Design and test AI literacy curricula focused on human–agent collaboration.
Engineering: Formulate comprehensive testing and validation protocols for agents used in safety-critical design tasks and implement robust human-in-the-loop control frameworks.
Cross-Domain Methodologies for Agentic Systems: A crucial priority is to develop standardized safety protocols, robust and intuitive human-in-the-loop control mechanisms, and secure “sandboxing” environments for testing the behavior of autonomous agents before deployment in real-world settings.

7.3.2. Theoretical Advancements

NLU: Formulate theories of “Explainable Generative NLU” to make agent reasoning transparent.
Content and Action Generation: Develop “Ethical AI Agency Frameworks” that provide a theoretical basis for guiding the responsible actions of autonomous systems.
Knowledge Discovery: Propose “Computational Creativity Theories” to explain how AI agents contribute to novel discovery. These theories should model processes such as divergent idea generation, conceptual blending, and the computational mechanisms underlying scientific insight, moving beyond mere hypothesis generation to explain the emergence of truly novel scientific paradigms.
Education: Build “AI-Augmented Learning Theories” that model how students learn effectively in partnership with AI agents, exploring frameworks like “Cyborg Pedagogy.”
Engineering: Conceptualize “Human–Agent Symbiotic Engineering Theories” that define principles for shared cognition and distributed responsibility in human–agent teams.
Theories of Trustworthy Autonomy and Governance: An overarching theoretical challenge is to develop robust theories of human–agent teaming, create computational models for agent accountability, and design governance frameworks for multi-agent ecosystems where agents interact with each other and with society [4].

This research agenda must be guided by the “Ethical–Technical Co-evolution Imperative.” This implies embedding ethical design, fairness, transparency, and safety into the core R&D lifecycle of AI systems. Methodologically, this requires new ways to test for and mitigate ethical risks. Theoretically, it calls for “Co-evolutionary AI Development Frameworks” that model the interplay between technical progress and societal impact. This involves fostering “Anticipatory Governance” models for AI, where potential future impacts of widespread agent deployment are systematically explored and proactively addressed to guide innovation toward solutions that are not only powerful but also principled, equitable, and aligned with human values [167]. Table 6 outlines critical research gaps and proposes elements of a future agenda.

7.4. Practical Implications for Method, Theory, and Practice

The synthesis of these themes and the proposed research agenda have profound implications, particularly as the field moves from passive language generation toward autonomous agentic AI.

Method: The identified limitations necessitate new methodological approaches. This includes developing robust validation protocols for both generated content and agentic actions, advancing techniques like Retrieval-Augmented Generation (RAG) to ground agent knowledge [18], and establishing prompt engineering as a core skill for effective human–agent interaction. Crucially, new methods are needed for designing, testing, and ensuring the safety and reliability of complex, multi-step agentic workflows [7].
Theory: The challenges and emergent interactions demand new theoretical frameworks. These include theories for Explainable NLU, Responsible Generative Efficiency, and AI-assisted Abductive Reasoning. In education and engineering, this means developing AI-Augmented Learning Theories and Human–Agent Symbiotic Engineering Theories. These frameworks are the essential theoretical underpinnings for building trustworthy and beneficial AI agents. Overarching this is the need for Co-evolutionary AI Development Frameworks that model the interplay between technical and ethical progress, which is paramount for guiding agentic systems [167].
Practice: The practical implications are vast, requiring significant adaptation. This includes revising educational pedagogy to focus on skills like critical thinking and AI literacy, training professionals in human–agent teaming (HAT) [82], implementing rigorous quality assurance for AI outputs through the development and deployment of governance frameworks, and prioritizing ethical design and bias mitigation. Successfully enacting these changes will also demand robust institutional support, significant investment in AI infrastructure and continuous professional development, and overcoming inherent organizational inertia to embrace these transformative paradigms. The shift in practice is from using AI as a tool to leveraging it as a cognitive partner; this partnership is evolving into one where humans provide strategic oversight and ethical judgment for increasingly autonomous AI agents [81].

This integrated view underscores that the future of AI is not merely a technical challenge but a complex interplay of technological advancement, ethical considerations, and societal adaptation.

8. Limitation of This Critical Review Study

This study is subject to several limitations. Primarily, as a review, its conclusions are drawn from the synthesis of existing literature, research findings, and conference proceedings rather than from original empirical research. This reliance on secondary sources means the study’s scope and depth are constrained by the quality and recency of available publications. Furthermore, the focus is specifically on ChatGPT and similar LLMs, which may not fully represent the nuances or challenges associated with other classes of AI models. While this focus on ChatGPT and similar LLMs allows for an in-depth, domain-specific analysis of their immediate impact, it inherently means the conclusions may not fully represent the nuances or challenges associated with other classes of AI models. However, many of the identified conceptual frameworks and ethical considerations are by design transferable, offering foundational insights applicable across broader AI paradigms. A significant constraint is the dynamic nature of the field; the insights and identified gaps, while current at the time of writing, are susceptible to becoming quickly outdated as AI capabilities evolve at an unprecedented pace. Finally, the review acknowledges the inherent difficulty in objectively assessing “deep understanding,” as distinguishing between genuine semantic comprehension and sophisticated statistical pattern matching in LLMs remains a fundamental research challenge. Future empirical and longitudinal studies are thus essential to validate and update the findings as AI capabilities continue their unprecedented evolution, moving beyond static reviews to continuous, adaptive analyses of AI’s societal integration.

9. Conclusions

The journey through ChatGPT’s applications reveals a consistent theme: while these tools can automate, augment, and accelerate many tasks, they are not panaceas. In NLU, the tension between generalization and specialization persists. In content generation, the pursuit of quality, scalability, and ethics forms a complex trilemma. For knowledge discovery, the “black box” nature of LLM reasoning poses a conundrum for scientific rigor. In education, the “pedagogical adaptation imperative” calls for a fundamental rethinking of teaching and learning. Similarly, in engineering, the concept of “human–LLM cognitive symbiosis” suggests a future of human–AI collaboration. All these domain-specific challenges converge and intensify with the advent of agentic AI, which embodies the next frontier of capability, complexity, and risk.

The contributions of this critical review lie in framing these observations within broader conceptual challenges. Identifying the “Specialization vs. Generalization Tension,” the “Quality–Scalability–Ethics Trilemma,” the “Black Box Conundrum,” the “Pedagogical Adaptation Imperative,” and the evolution of “Human–LLM Cognitive Symbiosis” into “Human–Agent Orchestration” provides a structured way to understand the trajectory of ChatGPT and similar other generative AI tools. This final concept underscores that as AI becomes more autonomous, human roles will increasingly shift toward strategic oversight, ethical decision-making, and the intricate management of human–agent collaborative systems across all professional and learning environments. These frameworks underscore that advancing method and theory is not merely a technical pursuit but one deeply intertwined with the ethical and practical challenges of steering increasingly autonomous systems, fulfilling the “Ethical–Technical Co-evolution Imperative” first introduced. This imperative demands that ethical considerations, safety, and alignment are not afterthoughts but are embedded into the core design and deployment lifecycle of AI.

Harnessing the benefits of AI requires an unwavering commitment to ethical development, deployment, and continuous critical evaluation. This involves addressing biases, ensuring transparency, safeguarding privacy, and proactively considering societal impacts, especially for autonomous agents. The path forward is not one of unbridled technological determinism but of thoughtful, human-centered innovation [81]. Practically, this necessitates that organizations and individuals adopt a strategic portfolio approach to AI deployment, selecting models based on task-specific requirements, risk profiles, and a clear understanding of the trade-offs between capability, cost, and latency. As AI systems continue their rapid evolution toward greater autonomy, the global community must engage in ongoing interdisciplinary dialogue. The research agenda proposed herein offers starting points for such endeavors. By embracing critical innovation and a profound sense of responsibility, it is possible to navigate the complexities of the AI revolution, mitigating its risks while leveraging its immense potential to advance knowledge, enhance human capabilities, and contribute to global well-being. The journey is complex, but with diligent inquiry and ethical stewardship, the expanding horizons of agentic AI can indeed lead to a more informed, efficient, and equitable future.

Author Contributions

Conceptualization, T.R.F.; methodology, T.R.F. and J.O.A.; software, T.R.F.; formal analysis, T.R.F.; investigation, T.R.F., J.O.A., A.E.O., and D.O.A.; resources, T.R.F., J.O.A., A.E.O., and D.O.A.; data curation, T.R.F.; writing—original draft preparation, T.R.F., J.O.A., A.E.O., and D.O.A.; writing—review and editing, T.R.F., J.O.A., A.E.O., and D.O.A.; visualization, T.R.F.; supervision, J.O.A.; project administration, T.R.F. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Conflicts of Interest

The authors declare no conflict of interest.

References

Akhtarshenas, A.; Dini, A.; Ayoobi, N. ChatGPT or A Silent Everywhere Helper: A Survey of Large Language Models. arXiv 2025, arXiv:2503.17403. [Google Scholar]
Infosys Limited. A Perspective on ChatGPT, Its Impact and Limitations. 2023. Available online: https://www.infosys.com/techcompass/documents/perspective-chatgpt-impact-limitations.pdf (accessed on 22 May 2025).
Murray, M.; Maclachlan, R.; Flockhart, G.M.; Adams, R.; Magueijo, V.; Goodfellow, M.; Liaskos, K.; Hasty, W.; Lauro, V. A ‘snapshot’of engineering practitioners views of ChatGPT-informing pedagogy in higher education. Eur. J. Eng. Educ. 2025, 1–26. [Google Scholar] [CrossRef]
Xi, Z.; Chen, W.; Guo, X.; He, H.; Ding, Y.; Hong, B.; Zhang, M.; Wang, J.; Jin, S.; Zhou, E.; et al. The rise and potential of large language model based agents: A survey. arXiv 2023, arXiv:2309.07864. [Google Scholar] [CrossRef]
Dempere, J.; Modugu, K.; Hesham, A.; Ramasamy, L.K. The impact of ChatGPT on higher education. Front. Educ. 2023, 8, 1206936. [Google Scholar] [CrossRef]
Al Naqbi, H.; Bahroun, Z.; Ahmed, V. Enhancing work productivity through generative artificial intelligence: A comprehensive literature review. Sustainability 2024, 16, 1166. [Google Scholar] [CrossRef]
Sapkota, R.; Raza, S.; Karkee, M. Comprehensive analysis of transparency and accessibility of chatgpt, deepseek, and other sota large language models. arXiv 2025, arXiv:2502.18505. [Google Scholar]
Jurafsky, D.; Martin, J.H. Speech and Language Processing, 3rd ed.; Prentice Hall: Hoboken, NJ, USA, 2000. [Google Scholar]
Chowdhery, A.; Narang, S.; Devlin, J.; Bosma, M.; Mishra, G.; Roberts, A.; Barham, P.; Chung, H.W.; Sutton, C.; Gehrmann, S.; et al. PaLM: Scaling language modeling with pathways. J. Mach. Learn. Res. 2023, 24, 1–13. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems 30; Curran Associates, Inc.: Red Hook, NY, USA, 2017; pp. 5998–6008. Available online: https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf (accessed on 20 December 2024).
OpenAI. GPT-4 Technical Report. 2023. Available online: https://openai.com/index/gpt-4-research/ (accessed on 22 May 2025).
Sennrich, R.; Haddow, B.; Birch, A. Neural Machine Translation of Rare Words with Subword Units. In Volume 1: Long Papers, Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany, 7–12 August 2016; Association for Computational Linguistics: Berlin, Germany, 2016. [Google Scholar]
OpenAI. ChatGPT FAQ. 2024. Available online: https://help.openai.com/en/collections/3742473-chatgpt (accessed on 22 May 2025).
OpenAI. Hello GPT-4o. 2024. Available online: https://openai.com/index/hello-gpt-4o/ (accessed on 22 May 2025).
Hariri, W. Unlocking the potential of ChatGPT: A comprehensive exploration of its applications, advantages, limitations, and future directions in natural language processing. arXiv 2023, arXiv:2304.02017. [Google Scholar]
Park, J.S.; O’Brien, J.C.; Cai, C.J.; Morris, M.R.; Liang, P.; Bernstein, M.S. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology; San Francisco, CA, USA, 29 October–1 November 2023, Association for Computing Machinery: New York, NY, USA, 2023; pp. 1–22. [Google Scholar] [CrossRef]
Shinn, N.; Cassano, F.; Gopinath, A.; Narasimhan, K.; Yao, S. Reflexion: Language agents with verbal reinforcement learning. arXiv 2023, arXiv:2303.11366. [Google Scholar] [CrossRef]
Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.T.; Rocktäschel, T.; et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Adv. Neural Inf. Process. Syst. 2020, 33, 9459–9474. [Google Scholar]
Yu, Y.; Kim, S.; Lee, W.; Koo, B. Evaluating ChatGPT on Korea’s BIM Expertise Exam and improving its performance through RAG. J. Comput. Des. Eng. 2025, 12, 94–120. [Google Scholar] [CrossRef]
Gao, Y.; Xiong, Y.; Gao, X.; Jia, K.; Pan, J.; Bi, Y.; Dai, Y.; Sun, J.; Wang, H. Retrieval-augmented generation for large language models: A survey. arXiv 2023, arXiv:2312.10997. [Google Scholar]
Empirical Methods in Natural Language Processing. The 2024 Conference on Empirical Methods in Natural Language Processing. 2024. Available online: https://aclanthology.org/events/emnlp-2024/ (accessed on 11 June 2025).
Wu, J.; Gan, W.; Chen, Z.; Wan, S.; and Yu, P.S. Multimodal large language models: A survey. In Proceedings of the 2023 IEEE International Conference on Big Data (BigData): Sorrento, Naples, Italy, 15–18 December 2023; pp. 2247–2256. [Google Scholar]
OpenAI; Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; et al. Gpt-4 technical report. arXiv 2023, arXiv:2303.08774. [Google Scholar] [CrossRef]
Baroni, M. Linguistic generalization and compositionality in modern artificial neural networks. Philos. Trans. R. Soc. B 2020, 375, 20190307. [Google Scholar] [CrossRef] [PubMed]
Katzir, R. Why Large Language Models Are Poor Theories of Human Linguistic Cognition. A Reply to Piantadosi (2023); Tel Aviv University: Tel Aviv, Israel, 2023; Available online: https://lingbuzz.net/lingbuzz/007190 (accessed on 22 May 2025).
Lake, B.M. Compositional generalization through meta sequence-to-sequence learning. Adv. Neural Inf. Process. Syst. 2019, 32. [Google Scholar]
Shormani, M.Q. Non-native speakers of English or ChatGPT: Who thinks better? arXiv 2024, arXiv:2412.00457. [Google Scholar]
Liu, Y.; Yao, Y.; Ton, J.F.; Zhang, X.; Guo, R.; Cheng, H.; Klochkov, Y.; Taufiq, M.F.; Li, H. Trustworthy llms: A survey and guideline for evaluating large language models’ alignment. arXiv 2023, arXiv:2308.05374. [Google Scholar]
Liu, Y.; Deng, G.; Xu, Z.; Li, Y.; Zheng, Y.; Zhang, Y.; Zhao, L.; Zhang, T.; Wang, K.; Liu, Y. Jailbreaking chatgpt via prompt engineering: An empirical study. arXiv 2023, arXiv:2305.13860. [Google Scholar]
Bender, E.M.; Gebru, T.; McMillan-Major, A.; Shmitchell, S. On the dangers of stochastic parrots: Can language models be too bi? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, New York, NY, USA, 3–10 March 2021; pp. 610–623. [Google Scholar]
OpenAI. GPT-3.5 Turbo. 2023. Available online: https://openai.com/index/gpt-3-5-turbo-fine-tuning-and-api-updates/ (accessed on 22 May 2025).
Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi ARush, A.M. Huggingface’s transformers: State-of-the-art natural language processing. arXiv 2019, arXiv:1910.03771. Available online: https://arxiv.org/abs/1910.03771 (accessed on 20 July 2025).
Mavrepis, P.; Makridis, G.; Fatouros, G.; Koukos, V.; Separdani, M.M.; Kyriazis, D. XAI for all: Can large language models simplify explainable AI? arXiv 2024, arXiv:2401.13110. [Google Scholar] [CrossRef]
Zhao, H.; Chen, H.; Yang, F.; Liu, N.; Deng, H.; Cai, H.; Du, M. Explainability for large language models: A survey. ACM Trans. Intell. Syst. Technol. 2024, 15, 1–38. [Google Scholar] [CrossRef]
Weidinger, L.; Mellor, J.; Rauh, M.; Griffin, C.; Uesato, J.; Huang, P.-S.; Cheng, M.; Glaese, M.; Balle, B.; Kasirzadeh, A.; et al. An overarching risk analysis and management framework for frontier AI. arXiv 2024, arXiv:2405.02111. [Google Scholar] [CrossRef]
Susskind, R.; Susskind, D. The Future of the Professions: How Technology Will Transform the Work of Human Experts; Oxford University Press: Oxford, UK, 2022. [Google Scholar]
Ray, P.P. ChatGPT: A comprehensive review on background, applications, key challenges, bias, ethics, limitations and future scope. Internet Things Cyber-Phys. Syst. 2023, 3, 121–154. [Google Scholar] [CrossRef]
Shah, N.; Jain, S.; Lauth, J.; Mou, Y.; Bartsch, M.; Wang, Y.; Luo, Y. Can large language models reason about medical conversation? arXiv 2023, arXiv:2305.00412. [Google Scholar]
Rice, S.; Crouse, S.R.; Winter, S.R.; Rice, C. The advantages and limitations of using ChatGPT to enhance technological research. Technol. Soc. 2024, 76, 102426. [Google Scholar] [CrossRef]
Nguyen, M.N.; Nguyen Thanh, B.; Vo, D.T.H.; Pham Thi Thu, T.; Thai, H.; Ha Xuan, S. Evaluating the Efficacy of Generative Artificial Intelligence in Grading: Insights from Authentic Assessments in Economics. SSRN Electron. J. 2024. [Google Scholar] [CrossRef]
Thelwall, M. Evaluating research quality with large language models: An analysis of ChatGPT’s effectiveness with different settings and inputs. J. Data Inf. Sci. 2025, 10, 7–25. [Google Scholar] [CrossRef]
RedBlink. Llama 4 vs ChatGPT: Comprehensive AI Models Comparison 2025. Available online: https://redblink.com/llama-4-vs-chatgpt/ (accessed on 22 May 2025).
OpenAI. Introducing o1: Our Next Step in AI research. 2024. Available online: https://openai.com/o1/ (accessed on 22 May 2025).
OpenAI Help Center. What is the ChatGPT Model Selector? Available online: https://help.openai.com/en/articles/7864572-what-is-the-chatgpt-model-selector (accessed on 11 June 2025).
OpenAI. o1-mini: Our Best Performing Model on AIME. 2024. Available online: https://openai.com/index/openai-o1-mini-advancing-cost-efficient-reasoning/ (accessed on 4 August 2025).
Kim, T.; Yun, T.S.; Suh, H.S. Can ChatGPT implement finite element models for geotechnical engineering applications? Int. J. Numer. Anal. Methods Geomech. 2025, 49, 1747–1766. [Google Scholar] [CrossRef]
Wang, A.; Singh, A.; Michael, J.; Hill, F.; Levy, O.; Bowman, S.R. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP; Association for Computational Linguistics: Brussels, Belgium, 2018. [Google Scholar]
Chalkidis, I.; Jana, A.; Hartung, D.; Bommarito, M.; Androutsopoulos, I.; Katz, D.M.; Aletras, N. LexGLUE: A benchmark dataset for legal language understanding in English. In Long Papers, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, Dublin, Ireland, 22–27 May 2022; Association for Computational Linguistics: Dublin, Ireland, 2022; Volume 1. [Google Scholar]
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
Eastman, C.; Teicholz, P.; Sacks, R.; Liston, K. BIM Handbook: A Guide to Building Information Modeling for Owners, Managers, Designers, Engineers and Contractors; John Wiley & Sons: Hoboken, NJ, USA, 2011. [Google Scholar]
Nori, H.; King, N.; McKinney, S.M.; Carignan, D.; Horvitz, E. The unreasonable effectiveness of GPT-4 in medicine. arXiv 2023, arXiv:2303.12039. [Google Scholar]
Adams, L.C.; Truhn, D.; Busch, F.; Bressem, K.K. Harnessing the power of retrieval-augmented generation for radiology reporting. arXiv 2023, arXiv:2306.02731. [Google Scholar]
Hendrycks, D.; Burns, C.; Basart, S.; Zou, A.; Mazeika, M.; Song, D.; Steinhardt, J. Measuring Massive Multitask Language Understanding. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 3–7 May 2021. [Google Scholar]
Kahneman, D. Thinking, Fast and Slow; Farrar, Straus and Giroux: New York, NY, USA, 2011. [Google Scholar]
Mathematical Association of America (MAA). American Invitational Mathematics Examination. 2025. Available online: https://www.maa.org/math-competitions/aime (accessed on 4 August 2025).
Lake, B.M.; Baroni, M. Human-like systematic generalization through a meta-learning neural network. Nature 2023, 623, 115–121. [Google Scholar] [CrossRef]
Hu, Y.; Lu, Y. Rag and rau: A survey on retrieval-augmented language model in natural language processing. arXiv 2024, arXiv:2404.19543. [Google Scholar] [CrossRef]
Wu, Y.; Zhao, Y.; Hu, B.; Minervini, P.; Stenetorp, P.; Riedel, S. An efficient memory-augmented transformer for knowledge-intensive nlp tasks. arXiv 2022, arXiv:2210.16773. [Google Scholar]
Yu, W. Retrieval-augmented generation across heterogeneous knowledge. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Seattle, WA, USA, 10–15 July 2022; Student Research Workshop: Seattle, WA, USA, 2022; pp. 52–58. [Google Scholar]
Melamed, R.; McCabe, L.H.; Wakhare, T.; Kim, Y.; Huang, H.H.; Boix-Adsera, E. Prompts have evil twins. arXiv 2023, arXiv:2311.07064. [Google Scholar]
Mozes, M.A.J. Understanding and Guarding Against Natural Language Adversarial Examples. Ph.D. Thesis, University College London, London, UK, 2024. [Google Scholar]
Mozes, M.; He, X.; Kleinberg, B.; Griffin, L.D. Use of llms for illicit purposes: Threats, prevention measures, and vulnerabilities. arXiv 2023, arXiv:2308.12833. [Google Scholar] [CrossRef]
Oremus, W. The Clever Trick That Turns ChatGPT Into Its Evil Twin. The Washington Post. 2023. Available online: https://www.washingtonpost.com/technology/2023/02/14/chatgpt-dan-jailbreak/ (accessed on 22 May 2025).
Perez, F.; Ribeiro, I. Ignore previous prompt: Attack techniques for language models. arXiv 2022, arXiv:2211.09527. [Google Scholar] [CrossRef]
Xue, J.; Zheng, M.; Hua, T.; Shen, Y.; Liu, Y.; Bölöni, L.; Lou, Q. Trojllm: A black-box trojan prompt attack on large language models. Adv. Neural Inf. Process. Syst. 2023, 36, 65665–65677. [Google Scholar]
Hodge, S.D., Jr. Revolutionizing Justice: Unleashing the Power of Artificial Intelligence. SMU Sci. Technol. Law Rev. 2023, 26, 217. [Google Scholar] [CrossRef]
Perlman, A. The implications of ChatGPT for legal services and society. Mich. Technol. Law Rev. 2023, 30, 1. [Google Scholar]
Surden, H. ChatGPT, AI large language models, and law. Fordham Law Rev. 2023, 92, 1941. [Google Scholar]
Naveed, J. Optimized Code Generation in BIM with Retrieval-Augmented LLMs. Master’s Thesis, Aalto University School of Science, Otaniemi, Finland, 2025. [Google Scholar]
Neveditsin, N.; Lingras, P.; Mago, V. Clinical insights: A comprehensive review of language models in medicine. PLoS Digit. Health 2025, 4, e0000800. [Google Scholar] [CrossRef]
Fisher, J. ChatGPT for Legal Marketing: 6 Ways to Unlock the Power of AI. AI-CASEpeer. May 2025. Available online: https://www.casepeer.com/blog/chatgpt-for-legal-marketing/ (accessed on 22 May 2025).
Elkatmis, M. ChatGPT and Creative Writing: Experiences of Master’s Students in Enhancing. Int. J. Contemp. Educ. Res. 2024, 11, 321–336. [Google Scholar] [CrossRef]
Niloy, A.C.; Akter, S.; Sultana, N.; Sultana, J.; Rahman, S.I.U. Is Chatgpt a menace for creative writing ability? An experiment. J. Comput. Assist. Learn. 2024, 40, 919–930. [Google Scholar] [CrossRef]
Zhu, S.; Wang, Z.; Zhuang, Y.; Jiang, Y.; Guo, M.; Zhang, X.; Gao, Z. Exploring the impact of ChatGPT on art creation and collaboration: Benefits, challenges and ethical implications. Telemat. Inform. Rep. 2024, 14, 100138. [Google Scholar] [CrossRef]
Alasadi, E.; Baiz, A.A. ChatGPT: A systematic review of published research in medical education. medRxiv 2023. [Google Scholar]
Dwivedi, Y.K.; Kshetri, N.; Hughes, L.; Slade, E.L.; Jeyaraj, A.; Kar, A.K.; Baabdullah, A.M.; Koohang, A.; Raghavan, V.; Ahuja, M.; et al. Opinion Paper: “So what if ChatGPT wrote it?” Multidisciplinary perspectives on opportunities, challenges and implications of generative conversational AI for research, practice and policy. Int. J. Inf. Manag. 2023, 71, 102642. [Google Scholar] [CrossRef]
Isiaku, L.; Muhammad, A.S.; Kefas, H.I.; Ukaegbu, F.C. Enhancing technological sustainability in academia: Leveraging ChatGPT for teaching, learning and evaluation. Qual. Educ. All 2024, 1, 385–416. [Google Scholar] [CrossRef]
Michel-Villarreal, R.; Vilalta-Perdomo, E.; Salinas-Navarro, D.E.; Thierry-Aguilera, R.; Gerardou, F.S. Challenges and opportunities of generative AI for higher education as explained by ChatGPT. Educ. Sci. 2023, 13, 856. [Google Scholar] [CrossRef]
Preiksaitis, C.; Rose, C. Opportunities, challenges, and future directions of generative artificial intelligence in medical education: Scoping review. JMIR Med. Educ. 2023, 9, e48785. [Google Scholar] [CrossRef]
Wu, T.Y.; He, S.Z.; Liu, J.P.; Sun, S.Q.; Liu, K.; Han, Q.L.; Tang, Y. A brief overview of ChatGPT: The history, status quo and potential future development. IEEE/CAA J. Autom. Sinica 2023, 10, 1122–1136. [Google Scholar] [CrossRef]
Shneiderman, B. Human-Centered AI; Oxford University Press: Oxford, UK, 2022. [Google Scholar]
Seeber, I.; Bittner, E.; Briggs, R.O.; de Vreede, T.; de Vreede, G.-J.; Elkins, A.; Maier, R.; Merz, A.B.; Oeste-Reiß, S.; Randrup, N.; et al. Machines as teammates: A research agenda on AI in team collaboration. Inf. Manag. 2020, 57, 103174. [Google Scholar] [CrossRef]
Arvidsson, S.; Axell, J. Prompt Engineering Guidelines for LLMs in Requirements Engineering. Ph.D. Thesis, University of Technology, Gothenburg, Sweden, 2023. Available online: https://gupea.ub.gu.se/bitstream/handle/2077/77967/CSE%2023-20%20SA%20JA.pdf?sequence=1&isAllowed=y (accessed on 20 July 2025).
Marvin, G.; Hellen, N.; Jjingo, D.; Nakatumba-Nabende, J. Prompt engineering in large language models. In International Conference on Data Intelligence and Cognitive Informatics; Springer Nature: Singapore, 2023; pp. 387–402. [Google Scholar]
Velásquez-Henao, J.D.; Franco-Cardona, C.J.; Cadavid-Higuita, L. Prompt Engineering: A methodology for optimizing interactions with AI-Language Models in the field of engineering. Dyna 2023, 90, 9–17. [Google Scholar] [CrossRef]
Zhou, Y.; Muresanu, A.I.; Han, Z.; Paster, K.; Pitis, S.; Chan, H.; Ba, J. Large language models are human-level prompt engineers. In Proceedings of the 11th International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Pan, S.; Luo, L.; Wang, Y.; Chen, C.; Wang, J.; Wu, X. Unifying large language models and knowledge graphs: A survey. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2024, 14, e1518. [Google Scholar] [CrossRef]
Kaushik, A.; Yadav, S.; Browne, A.; Lillis, D.; Williams, D.; Donnell, J.M.; Grant, P.; Kernan, S.C.; Sharma, S.; Mansi, A. Exploring the Impact of Generative Artificial Intelligence in Education: A Thematic Analysis. arXiv 2025, arXiv:2501.10134. [Google Scholar]
Hadi, M.U.; Qureshi, R.; Shah, A.; Irfan, M.; Zafar, A.; Shaikh, M.B.; Akhtar, N.; Hassan, S.Z.; Shoman, M.; Wu, J.; et al. Large language models: A comprehensive survey of its applications, challenges, limitations, and future prospects. Authorea Prepr. 2023, 1, 1–26. [Google Scholar]
Chen, B.; Zhang, Z.; Langrené, N.; Zhu, S. Unleashing the potential of prompt engineering for large language models. Patterns 2025, 6, 101260. [Google Scholar] [CrossRef]
Sivarajkumar, S.; Kelley, M.; Samolyk-Mazzanti, A.; Visweswaran, S.; Wang, Y. An empirical evaluation of prompting strategies for large language models in zero-shot clinical natural language processing: Algorithm development and validation study. JMIR Med. Inform. 2024, 12, e55318. [Google Scholar] [CrossRef]
Naseem, U.; Dunn, A.G.; Khushi, M.; Kim, J. Benchmarking for biomedical natural language processing tasks with a domain specific ALBERT. BMC Bioinform. 2022, 23, 144. [Google Scholar] [CrossRef] [PubMed]
Perez, E.; Huang, S.; Song, F.; Cai, T.; Ring, R.; Aslanides, J.; Glaese, A.; McAleese, N.; Irving, G. Red teaming language models with language models. arXiv 2022, arXiv:2202.03286. [Google Scholar] [CrossRef]
Garousi, V. Why you shouldn’t fully trust ChatGPT: A synthesis of this AI tool’s error rates across disciplines and the software engineering lifecycle. arXiv 2023, arXiv:2504.18858. [Google Scholar]
Schiller, C.A. The human factor in detecting errors of large language models: A systematic literature review and future research directions. arXiv 2024, arXiv:2403.09743. [Google Scholar] [CrossRef]
Lee, H. The rise of ChatGPT: Exploring its potential in medical education. Anat. Sci. Educ. 2024, 17, 926–931. [Google Scholar] [CrossRef]
Johnson, S.; Acemoglu, D. Power and Progress: Our Thousand-Year Struggle over Technology and Prosperity; Hachette UK: London, UK, 2023. [Google Scholar]
OpenAI. Safety & Alignment. 2023. Available online: https://openai.com/safety/ (accessed on 22 May 2025).
Veisi, O.; Bahrami, S.; Englert, R.; Müller, C. AI Ethics and Social Norms: Exploring ChatGPT’s Capabilities from What to How. arXiv 2025, arXiv:2504.18044. [Google Scholar]
Daun, M.; Brings, J. How ChatGPT will change software engineering education. In Proceedings of the 2023 Conference on Innovation and Technology in Computer Science Education V. 1, Turku, Finland, 7–12 July 2023; pp. 110–116. [Google Scholar]
Marques, N.; Silva, R.R.; Bernardino, J. Using chatgpt in software requirements engineering: A comprehensive review. Future Internet 2024, 16, 180. [Google Scholar] [CrossRef]
Gamage, K.A.; Dehideniya, S.C.; Xu, Z.; Tang, X. ChatGPT and higher education assessments: More opportunities than concerns? J. Appl. Learn. Teach. 2023, 6, 358–369. [Google Scholar] [CrossRef]
Gao, R.; Yu, D.; Gao, B.; Hua, H.; Hui, Z.; Gao, J.; Yin, C. Legal regulation of AI-assisted academic writing: Challenges, frameworks, and pathways. Front. Artif. Intell. 2025, 8, 1546064. [Google Scholar] [CrossRef]
Hannigan, T.R.; McCarthy, I.P.; Spicer, A. Beware of botshit: How to manage the epistemic risks of generative chatbots. Bus. Horiz. 2024, 67, 471–486. [Google Scholar] [CrossRef]
Jiang, Y.; Hao, J.; Fauss, M.; Li, C. Detecting ChatGPT-generated essays in a large-scale writing assessment: Is there a bias against non-native English speakers? Comput. Educ. 2024, 217, 105070. [Google Scholar] [CrossRef]
Susnjak, T.; McIntosh, J. Academic integrity in the age of ChatGPT. Change Mag. High. Learn. 2024, 56, 21–27. [Google Scholar]
Levitt, G.; Grubaugh, S. Artificial intelligence and the paradigm shift: Reshaping education to equip students for future careers. Int. J. Soc. Sci. Humanit. Invent. 2023, 10, 7931–7941. [Google Scholar] [CrossRef]
U.S. Department of Education, Office of Educational Technology. Artificial Intelligence and the Future of Teaching and Learning: Insights and Recommendations. 2023. Available online: https://www.ed.gov/sites/ed/files/documents/ai-report/ai-report.pdf (accessed on 22 May 2025).
Dagdelen, J.; Dunn, A.; Lee, S.; Walker, N.; Rosen, A.S.; Ceder, G.; Persson, K.A.; Jain, A. Structured information extraction from scientific text with large language models. Nat. Commun. 2024, 15, 1418. [Google Scholar] [CrossRef]
Mitra, M.; de Vos, M.G.; Cortinovis, N.; Ometto, D. Generative AI for Research Data Processing: Lessons Learnt From Three Use Cases. In Proceedings of the 2024 IEEE 20th International Conference on e-Science (e-Science), Osaka, Japan, 16–20 September 2024; pp. 1–10. [Google Scholar]
Yang, X.; Chen, A.; PourNejatian, N.; Shin, H.C.; Smith, K.E.; Parisien, C.; Compas, C.; Martin, C.; Flores, M.G.; Zhang, Y.; et al. Gatortron: A large clinical language model to unlock patient information from unstructured electronic health records. arXiv 2022, arXiv:2203.03540. [Google Scholar] [CrossRef]
Gao, X.; Zhang, Z.; Xie, M.; Liu, T.; Fu, Y. Graph of AI Ideas: Leveraging Knowledge Graphs and LLMs for AI Research Idea Generation. arXiv 2025, arXiv:2503.08549. [Google Scholar]
Bran, A.; Cox, S.R.; Schilter, P. ChemCrow: Augmenting large-language models with a tool-set for chemistry. arXiv 2024. [Google Scholar] [CrossRef]
Chang, X.; Dai, G.; Di, H.; Ye, H. Breaking the Prompt Wall (I): A Real-World Case Study of Attacking ChatGPT via Lightweight Prompt Injection. arXiv 2025, arXiv:2504.16125. [Google Scholar]
Albadarin, Y.; Saqr, M.; Pope, N.; Tukiainen, M. A systematic literature review of empirical research on ChatGPT in education. Discov. Educ. 2024, 3, 60. [Google Scholar] [CrossRef]
Gabashvili, I.S. The impact and applications of ChatGPT: A systematic review of literature reviews. arXiv 2023, arXiv:2305.18086. [Google Scholar] [CrossRef]
Haman, M.; Školník, M. Using ChatGPT for scientific literature review: A case study. IASL 2024, 1, 1–13. [Google Scholar]
Imran, M.; Almusharraf, N. Analyzing the role of ChatGPT as a writing assistant at higher education level: A systematic review of the literature. Cont. Edu. Tech. 2023, 15, ep464. [Google Scholar] [CrossRef]
Mostafapour, M.; Asoodar, M.; Asoodar, M. Advantages and disadvantages of using ChatGPT for academic literature review. Cogent Eng. 2024, 11, 2315147. [Google Scholar]
Wang, G.; Xie, Y.; Jiang, Y.; Mandlekar, A.; Xiao, C.; Zhu, Y.; Fan, L.; Anandkumar, A. Voyager: An open-ended embodied agent with large language models. arXiv 2023, arXiv:2305.16291. [Google Scholar] [CrossRef]
Dai, W.; Lin, J.; Jin, H.; Li, T.; Tsai, Y.S.; Gašević, D.; Chen, G. Can large language models provide feedback to students? A case study on ChatGPT. In Proceedings of the 2023 IEEE International Conference on Advanced Learning Technologies (ICALT), Orem, UT, USA, 10–13 July 2023; IEEE: New York, NY, USA, 2023; pp. 323–325. [Google Scholar]
Haltaufderheide, J.; Ranisch, R. ChatGPT and the future of academic publishing: A perspective. Am. J. Bioeth. 2024, 24, 4–11. [Google Scholar]
Garg, R.K.; Urs, V.L.; Agarwal, A.A.; Chaudhary, S.K.; Paliwal, V.; Kar, S.K. Exploring the role of ChatGPT in patient care (diagnosis and treatment) and medical research: A systematic review. Health Promot. Perspect. 2023, 13, 183. [Google Scholar] [CrossRef]
Sallam, M. ChatGPT utility in healthcare education, research, and practice: Systematic review on the promising perspectives and valid concerns. Healthcare 2023, 11, 887. [Google Scholar] [CrossRef]
Glickman, M.; Zhang, Y. AI and generative AI for research discovery and summarization. arXiv 2024, arXiv:2401.06795. [Google Scholar] [CrossRef]
Huang, J.; Chang, K.C.C. Towards reasoning in large language models: A survey. arXiv 2022, arXiv:2212.10403. [Google Scholar]
Bhagavatula, C.; Bras, R.L.; Malaviya, C.; Sakaguchi, K.; Holtzman, A.; Rashkin, H.; Downey, D.; Yih, S.W.-t.; Choi, Y. Abductive commonsense reasoning. arXiv 2019, arXiv:1908.05739. [Google Scholar]
Garbuio, M.; Lin, N. Innovative idea generation in problem finding: Abductive reasoning, cognitive impediments, and the promise of artificial intelligence. J. Prod. Innov. Manag. 2021, 38, 701–725. [Google Scholar] [CrossRef]
Magnani, L.; Arfini, S. Model-based abductive cognition: What thought experiments teach us. Log. J. IGPL 2024, jzae096. [Google Scholar] [CrossRef]
Pareschi, R. Abductive reasoning with the GPT-4 language model: Case studies from criminal investigation, medical practice, scientific research. Sist. Intelligenti 2023, 35, 435–444. [Google Scholar]
Boiko, D.A.; MacKnight, R.; Gomes, G. Emergent autonomous scientific research capabilities of large language models. arXiv 2023, arXiv:2304.05332. [Google Scholar] [CrossRef]
Noy, S.; Zhang, W. Experimental evidence on the productivity effects of generative artificial intelligence. Science 2023, 381, 187–192. [Google Scholar] [CrossRef]
Eymann, V.; Lachmann, T.; Czernochowski, D. When ChatGPT Writes Your Research Proposal: Scientific Creativity in the Age of Generative AI. J. Intell. 2025, 13, 55. [Google Scholar] [CrossRef]
Fill, H.G.; Fettke, P.; Köpke, J. Conceptual modeling and large language models: Impressions from first experiments with ChatGPT. Enterp. Model. Inf. Syst. Archit. 2023, 18, 1–15. [Google Scholar]
Li, R.; Liang, P.; Wang, Y.; Cai, Y.; Sun, W.; Li, Z. Unveiling the Role of ChatGPT in Software Development: Insights from Developer-ChatGPT Interactions on GitHub. arXiv 2025, arXiv:2505.03901. [Google Scholar]
Dovesi, D.; Malandri, L.; Mercorio, F.; Mezzanzanica, M. A survey on explainable AI for Big Data. J. Big Data 2024, 11, 6. [Google Scholar] [CrossRef]
Davar, N.F.; Dewan, M.A.A.; Zhang, X. AI chatbots in education: Challenges and opportunities. Information 2025, 16, 235. [Google Scholar] [CrossRef]
Li, M. The impact of ChatGPT on teaching and learning in higher education: Challenges, opportunities, and future scope. In Encyclopedia of Information Science and Technology, 6th ed.; IGI Global Scientific Publishing: Hershey, PA, USA, 2025; pp. 1–20. [Google Scholar]
Arslan, B.; Lehman, B.; Tenison, C.; Sparks, J.R.; López, A.A.; Gu, L.; Zapata-Rivera, D. Opportunities and challenges of using generative AI to personalize educational assessment. Front. Artif. Intell. 2024, 7, 1460651. [Google Scholar] [CrossRef]
Lin, X.; Chan, R.Y.; Sharma, S.; Bista, K. (Eds.) ChatGPT and Global Higher Education: Using Artificial Intelligence in Teaching and Learning; STAR Scholars Press: Baltimore, MD, USA, 2024. [Google Scholar]
Molenaar, I. Human-AI co-regulation: A new focal point for the science of learning. npj Sci. Learn. 2024, 9, 29. [Google Scholar]
Salesforce. AI Agents in Education: Benefits & Use Cases. Salesforce. 23 June 2025. Available online: https://www.salesforce.com/education/artificial-intelligence/ai-agents-in-education/ (accessed on 22 May 2025).
Mohammed, A. Navigating the AI Revolution: Safeguarding Academic Integrity and Ethical Considerations in the Age of Innovation. BERA. March 2025. Available online: https://www.bera.ac.uk/blog/navigating-the-ai-revolution-safeguarding-academic-integrity-and-ethical-considerations-in-the-age-of-innovation (accessed on 22 May 2025).
Alghazo, R.; Fatima, G.; Malik, M.; Abdelhamid, S.E.; Jahanzaib, M.; Raza, A. Exploring ChatGPT’s Role in Higher Education: Perspectives from Pakistani University Students on Academic Integrity and Ethical Challenges. Educ. Sci. 2025, 15, 158. [Google Scholar] [CrossRef]
Zawacki-Richter, O.; Marín, V.I.; Bond, M.; Gouverneur, F. Systematic review of research on artificial intelligence applications in higher education—Where are the educators? Int. J. Educ. Technol. High. Educ. 2019, 16, 39. [Google Scholar] [CrossRef]
Atchley, P.; Pannell, H.; Wofford, K.; Hopkins, M.; Atchley, R.A. Human and AI collaboration in the higher education environment: Opportunities and concerns. Cogn. Res. Princ. Implic. 2024, 9, 20. [Google Scholar] [CrossRef]
Prinsloo, P. Data frontiers and frontiers of power in higher education: A view of the an/archaeology of data. Teach. High. Educ. 2020, 25, 394–412. [Google Scholar] [CrossRef]
Adiyono, A.; Al Matari, A.S.; Dalimarta, F.F. Analysis of Student Perceptions of the Use of ChatGPT as a Learning Media: A Case Study in Higher Education in the Era of AI-Based Education. J. Educ. Teach. 2025, 6, 306–324. [Google Scholar] [CrossRef]
UNESCO. Guidance for Generative AI in Education and Research. UNESCO. 2023. Available online: https://unesdoc.unesco.org/ark:/48223/pf0000386693 (accessed on 22 May 2025).
Weidlich, J.; Gašević, D. ChatGPT in education: An effect in search of a cause. PsyArXiv 2025. preprints. [Google Scholar] [CrossRef]
Kasneci, E.; Seßler, K.; Küchemann, S.; Bannert, M.; Dementieva, D.; Fischer, F.; Gasser, U.; Groh, G.; Günnemann, S.; Hüllermeier, E.; et al. ChatGPT for good? On opportunities and challenges of large language models for education. Learn. Individ. Differ. 2023, 103, 102274. [Google Scholar] [CrossRef]
Zhai, X. ChatGPT for next generation science learning. ACM Mag. Stud. 2023, 29, 42–46. [Google Scholar] [CrossRef]
Belzner, L.; Gabor, T.; Wirsing, M. Large language model assisted software engineering: Prospects, challenges, and a case study. In International Conference on Bridging the Gap Between AI and Reality; Springer Nature: Cham, Switzerland, 2023; pp. 355–374. [Google Scholar]
Rawat, A.S.; Fazzini, M.; George, T.; Gokulan, R.; Maddila, C.; Arrieta, A. A new era of software development: A survey on the impact of large language models. ACM Comput. Surv. 2024, 57, 1–40. [Google Scholar]
Yadav, S.; Qureshi, A.M.; Kaushik, A.; Sharma, S.; Loughran, R.; Kazhuparambil, S.; Shaw, A.; Sabry, M.; St John Lynch, N.; Singh, N.; et al. From Idea to Implementation: Evaluating the Influence of Large Language Models in Software Development--An Opinion Paper. arXiv 2025, arXiv:2503.07450. [Google Scholar]
Jiang, C.; Huang, R.; Shen, T. Generative AI-Enabled Conceptualization: Charting ChatGPT’s Impacts on Sustainable Service Design Thinking with Network-Based Cognitive Maps. J. Comput. Inf. Sci. Eng. 2025, 25, 021006. [Google Scholar] [CrossRef]
Vu, N.G.H.; Wang, K.; Wang, G.G. Effective prompting with ChatGPT for problem formulation in engineering optimization. Eng. Optim. 2025, 1–18. [Google Scholar] [CrossRef]
Puthumanaillam, G.; Ornik, M. The Lazy Student’s Dream: ChatGPT Passing an Engineering Course on Its Own. arXiv 2025, arXiv:2503.05760. [Google Scholar]
Ege, D.N.; Øvrebø, H.H.; Stubberud, V.; Berg, M.F.; Elverum, C.; Steinert, M.; Vestad, H. ChatGPT as an inventor: Eliciting the strengths and weaknesses of current large language models against humans in engineering design. Artif. Intell. Eng. Des. Anal. Manuf. 2025, 39, e6. [Google Scholar] [CrossRef]
Topcu, T.G.; Husain, M.; Ofsa, M.; Wach, P. Trust at Your Own Peril: A Mixed Methods Exploration of the Ability of Large Language Models to Generate Expert-Like Systems Engineering Artifacts and a Characterization of Failure Modes. Syst. Eng. 2025, 1–41. [Google Scholar] [CrossRef]
Hagendorff, T. A virtue ethics-based framework for the corporate ethics of AI. AI Ethics 2024, 4, 653–666. [Google Scholar]
Ballardini, R.M.; He, K.; Roos, T. AI-generated content: Authorship and inventorship in the age of artificial intelligence. In Online Distribution of Content in the EU; Edward Elgar Publishing: Cheltenham, UK, 2019; pp. 117–135. [Google Scholar]
Craig, C.J. The AI-copyright challenge: Tech-neutrality, authorship, and the public interest. In Research Handbook on Intellectual Property and Artificial Intelligence; Edward Elgar Publishing: Cheltenham, UK, 2022; pp. 134–155. [Google Scholar]
Reich, J. Failure to Disrupt: Why Technology Alone Can’t Transform Education; Harvard University Press: Cambridge, MA, USA, 2020. [Google Scholar]
Sabzalieva, E.; Valentini, A. ChatGPT and Artificial Intelligence in Higher Education: Quick Start Guide; UNESCO: Quito, Ecuador, 2023; pp. 1–14. [Google Scholar]
Miller, D. Exploring the impact of artificial intelligence language model ChatGPT on the user experience. Int. J. Technol. Innov. Manag. 2023, 3, 1–8. [Google Scholar]
Floridi, L.; Nobre, C. Artificial intelligence, and the new challenges of anticipatory governance. Ethics Inf. Technol. 2024, 26, 24. [Google Scholar]
Dimeli, M.; Kostas, A. The Role of ChatGPT in Education: Applications, Challenges: Insights From a Systematic Review. J. Inf. Technol. Educ. Res. 2025, 24, 2. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Overview of this study’s contribution. Navigating AI’s ethical frontier and human–AI synergy: balancing technology and society.

Table 1. Comparative overview of ChatGPT models in NLU and content generation.

Model Version	Key Architectural Features/Training Data Cutoff	Notable NLU Capabilities	Content Generation Strengths	Known Limitations	Key Benchmark Performance (Example)
ChatGPT-3.5/3.5-Turbo	Based on GPT-3.5, Text/Code (pre-2021/2023) [2,31,36]	Basic text tasks, translation, conversational AI, faster responses [2,31,36]	Dialogue, boilerplate tasks, initial drafts, summaries [2,31,36]	Accuracy issues, bias, limited by training data cutoff, struggles with highly specialized tasks [1,31]	GLUE average score ~78.7% (comparable to BERT-base, lags RoBERTa-large) [27]. Passed Korea’s BIM Expertise Exam with 65% average [19]. Error rates in healthcare can be high [37].
ChatGPT-4	Based on GPT-4, Text/Code (pre-2023) [2,38,39]	Multimodal (text), high precision, improved reasoning, expanded context window [11,13,15,23]	More coherent, contextually relevant text, complex conversations, nuanced topics [11,13,15,23]	Still prone to hallucinations, bias; costlier; specific weaknesses in areas like local guidelines without RAG [11,23,30,37,40]	Passed Korea’s BIM Expertise Exam with 85% average (improved to 88.6% with RAG for specific categories) [19]. Lower error rates in business/economics (~15–20%) compared to 3.5 [37].
GPT-4o/GPT-4o mini	Text/Code (pre-2024) [14,15]	Multimodal (text/image/audio/video), improved contextual awareness, advanced tokenization, cost-efficiency (mini) [14,15]	Richer, more interactive responses, real-time collaboration support [14,15]	Newer models, long-term limitations still under study, but likely share core LLM challenges	GPT-4o slightly better than 3.5-turbo and 4o-mini on research quality score estimates (correlation 0.67 with human scores using title/abstract) [41]. GPT-4o mini outperforms GPT-3.5 Turbo on MMLU (82% vs. 69.8%) [42].
o1-series (o1-preview, o1-mini, o1)	STEM-focused data, some general data (pre-2024/2025) [7,43]	System 2 thinking, PhD-level STEM reasoning (o1-preview), fast reasoning (o1-mini), full o1 reasoning and multimodality (o1) [7,43]	Analytical rigor, hypothesis generation/evaluation (biology, math, engineering) [44,45]	Specialized for STEM, general capabilities relative to GPT-4o may vary	o1-mini is best performing benchmarked model on AIME 2024 and 2025 [44,45]. Used for generating finite element code in geotechnical engineering [45,46].

Table 2. Key training differences: BERT vs. RoBERTa [49].

Feature	BERT	RoBERTa
Training Task	Masked Language Model (MLM) and Next Sentence Prediction (NSP)	MLM only (NSP task removed)
Training Data Size	~16 GB (BookCorpus and Wikipedia)	~160 GB (BookCorpus, Wikipedia, CC-News, etc.)
Masking Strategy	Static (masking pattern fixed during pre-processing)	Dynamic (masking pattern changes across training epochs)
Batch Size	Smaller (e.g., 256)	Significantly Larger (e.g., 8000)

Table 3. Evolution of context window size in GPT-4 models [13].

Model Version	Context Window Size (Tokens)	Approximate Page Equivalent
GPT-4 (Initial Release)	8192	~20 pages
GPT-4-32k	32,768	~80 pages
GPT-4 Turbo/GPT-4o	128,000	~300 pages
GPT-4.1 Series	1,000,000	~2500 pages

Table 4. Cross-model performance on key benchmarks (MMLU vs. AIME) [42,45].

Benchmark	Model	Score	Interpretation
MMLU (General Knowledge)	GPT-3.5 Turbo	~70%	Foundational general knowledge.
	GPT-4o mini	82%	Strong general knowledge and reasoning.
	o1-mini	85.2%	Very strong general knowledge, but not its primary strength.
AIME (Advanced Math Reasoning)	GPT-4o	~13%	Lacks specialized, multi-step reasoning ability.
	o1-mini	70%	Elite mathematical reasoning, competitive with top human talent.

Table 5. Key applications of ChatGPT in education and engineering: benefits, challenges, and novel methodological/theoretical implications.

Application Area	Specific Use Cases	Documented Benefits	Key Challenges	Novel Methodological/Theoretical Implications
Education	Personalized learning, virtual tutoring [137]	Tailored content, adaptive pacing, 24/7 support, increased engagement [137]	Over-reliance, reduced critical thinking, accuracy of information, data privacy, equity of access	Development of “AI-Integrated Pedagogy”; re-evaluation of constructivist and self-determination learning theories in AI contexts.
	Curriculum/Lesson Planning [138]	Efficiency for educators, idea generation, diverse material creation [138]	Quality of AI suggestions, maintaining teacher creativity, potential for generic content [138]	Frameworks for AI-assisted curriculum design that balance efficiency with pedagogical soundness and teacher agency.
	Student Assessment [140]	Generation of diverse quiz/exam questions, formative feedback, personalized assessment [138]	Academic integrity (plagiarism), difficulty assessing true understanding, fairness of AI-generated assessments [143]	New assessment paradigms focusing on higher-order skills, process over product; ethical guidelines for AI in assessment.
Engineering	Software Engineering (Code generation, debugging, QA) [155]	Increased developer productivity, reduced coding time, improved code quality [155]	Accuracy of generated code, over-dependence, skill degradation, security risks, bias in code [155]	“Human–LLM Cognitive Symbiosis” models for software development; AI-collaboration literacy for engineers.
	BIM/Architecture/Civil Engineering (Info retrieval, design visualization) [19]	Enhanced understanding of domain-specific knowledge (with RAG), task planning support [19]	Reliance on quality of RAG documents, need for domain expertise in prompt/RAG setup [19]	Methodologies for integrating LLMs with domain-specific knowledge bases (e.g., RAG) for specialized engineering tasks.
	Mechanical/Industrial Design (Ideation, prototyping, optimization) [67]	Accelerated idea generation, exploration of diverse concepts, assistance in optimization problem formulation [67]	Design fixation, unnecessary complexity, misinterpretation of feedback, unsubstantiated estimates [34]	“AI-Augmented Engineering Design” frameworks; theories of “AI-robustness” in design; understanding LLM impact on cognitive design processes.
	Geotechnical Engineering (finite element analysis code generation) [46]	Assistance in implementing numerical models, especially with high-level libraries [46]	Extensive human intervention needed for low-level programming or complex problems; requires user expertise [46]	Frameworks for human–AI collaboration in complex numerical modeling and simulation.

Table 6. Critical research gaps and future agenda for ChatGPT research (advancing method and theory).

Domain	Specific Identified Research Gap	Proposed Novel Research Question(s)	Potential Methodological Advancement	Potential Theoretical Advancement
NLU	True semantic understanding vs. mimicry; robustness to ambiguity; explainability [7]	How can NLU models be designed to exhibit verifiable deep understanding and provide transparent reasoning for their interpretations?	Development of “Deep Understanding Benchmarks”; new XAI techniques for generative NLU.	Theories of “Explainable Generative NLU”; models of computational semantics beyond statistical co-occurrence, drawing from linguistics and cognitive science.
Content Generation	Ensuring factual accuracy; dynamic quality control; IP and copyright [5]	What adaptive mechanisms can ensure real-time quality and ethical compliance in AI content generation across diverse contexts?	Adaptive, context-aware QA frameworks; blockchain or other technologies for provenance tracking.	“Ethical AI Content Frameworks” (informed by law and media ethics); theories of “Responsible Generative Efficiency.”
Knowledge Discovery	Validating AI-generated hypotheses; moving from info extraction to insight; ethical AI in science [39]	How can LLMs be integrated into the scientific method to reliably generate and validate novel, theoretically grounded hypotheses?	Rigorous validation protocols for AI-discovered knowledge; hybrid LLM-KG-Experimental methodologies.	“Computational Creativity Theories” for scientific discovery (integrating cognitive psychology and philosophy of science); models of AI-assisted abductive reasoning.
Education	Longitudinal impact on learning and critical thinking; AI literacy curricula; equity and bias in EdAI [5]; K-12 and special education gaps [168]	What pedagogical frameworks optimize human–AI collaboration for deep learning and critical skill development across diverse learners and contexts?	Longitudinal mixed-methods studies; co-design of AI literacy programs with educators and students; comparative studies in underrepresented educational settings.	“AI-Augmented Learning Theories” (linking to established learning sciences and cognitive psychology); frameworks for “Cyborg Pedagogy”; theories of ethical AI integration in diverse educational systems.
Engineering	LLMs in safety-critical tasks; understanding LLM failure modes in complex design [160]; human–LLM collaboration frameworks [21]; NL to code/design beyond software [155]	How can engineering design and optimization processes be re-theorized to effectively and safely incorporate LLM cognitive capabilities?	Protocols for LLM validation in complex simulations; frameworks for human-in-the-loop control for safety-critical engineering AI.	“Human–AI Symbiotic Engineering Design Theories” (grounded in Human-Computer Interaction [HCI] and cognitive engineering); theories of “AI-Robustness” in engineering systems.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Feyijimi, T.R.; Aliu, J.O.; Oke, A.E.; Aghimien, D.O. ChatGPT’s Expanding Horizons and Transformative Impact Across Domains: A Critical Review of Capabilities, Challenges, and Future Directions. Computers 2025, 14, 366. https://doi.org/10.3390/computers14090366

AMA Style

Feyijimi TR, Aliu JO, Oke AE, Aghimien DO. ChatGPT’s Expanding Horizons and Transformative Impact Across Domains: A Critical Review of Capabilities, Challenges, and Future Directions. Computers. 2025; 14(9):366. https://doi.org/10.3390/computers14090366

Chicago/Turabian Style

Feyijimi, Taiwo Raphael, John Ogbeleakhu Aliu, Ayodeji Emmanuel Oke, and Douglas Omoregie Aghimien. 2025. "ChatGPT’s Expanding Horizons and Transformative Impact Across Domains: A Critical Review of Capabilities, Challenges, and Future Directions" Computers 14, no. 9: 366. https://doi.org/10.3390/computers14090366

APA Style

Feyijimi, T. R., Aliu, J. O., Oke, A. E., & Aghimien, D. O. (2025). ChatGPT’s Expanding Horizons and Transformative Impact Across Domains: A Critical Review of Capabilities, Challenges, and Future Directions. Computers, 14(9), 366. https://doi.org/10.3390/computers14090366

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

ChatGPT’s Expanding Horizons and Transformative Impact Across Domains: A Critical Review of Capabilities, Challenges, and Future Directions

Abstract

1. Introduction: The ChatGPT Inflection Point in AI and Its Applications

2. Advancements in Natural Language Understanding with ChatGPT: Capabilities, Innovations, and Critical Frontiers

2.1. Core NLU Architecture and Functionalities

2.2. Innovative NLU Techniques and Their Impact

2.3. Evolution and Critical Assessment of GPT Models in NLU

2.3.1. Analysis of the GPT-3.5 Series: The Catalyst for Widespread Generative AI

Architectural Profile and Training Paradigm

Core Competencies and Limitations

The GLUE Score and Domain-Specific Tests

2.3.2. Analysis of the GPT-4 Series: A Leap in Reasoning and Multimodality

Architectural Enhancements and Expanded Context

Advanced Capabilities in Complex Reasoning and Nuanced Generation

Mitigating Knowledge Gaps: The Critical Role of Retrieval-Augmented Generation (RAG)

2.3.3. Analysis of the GPT-4o Series: The Pursuit of Omni-Modal, Efficient Intelligence

The “Omni” Architecture: Unifying Modalities and Advancing Tokenization

Strengths in Interactive and Real-Time Content Generation

The MMLU Score

2.3.4. Analysis of the o1-Series: Specialization in Advanced STEM Reasoning

A New Paradigm: “System 2” Thinking and Chain-of-Thought

Dominance in Scientific and Mathematical Domains

The AIME and Code Generation Tasks

2.3.5. Synthesis and Future Outlook: Trajectories in LLM Development

Comparative Synthesis Across Model Generations

Key Trends and Implications for Application

2.4. Advancing Method and Theory in NLU Through ChatGPT

3. The New Epoch of Content Generation: Diverse Applications, Quality Assurance, and Ethical Imperatives

3.1. ChatGPT’s Role in Diverse Content Generation and Task Automation

3.2. Methodologies for Quality Control, Coherence, and Accuracy

3.3. Ethical Considerations in Content and Task Automation

3.4. Advancing Method and Theory for Responsible Content Generation and Task Automation

4. ChatGPT as a Catalyst for Knowledge Discovery: Methodologies, Scientific Inquiry, and Future Paradigms

4.1. Methodologies for Knowledge Extraction from Unstructured Data

4.2. Applications in Scientific Research

4.3. Critical Assessment of ChatGPT’s Role in Advancing Research

4.4. Advancing Method and Theory in AI-Augmented Knowledge Discovery

5. Revolutionizing Education and Training: ChatGPT’s Global Impact on Pedagogy, Assessment, and Equity

5.1. Applications in Education

5.2. Impact on Critical Thinking, Academic Integrity, and Ethics

5.3. Global Perspectives and Educational Equity

5.4. Advancing Educational Research, Theories, and Pedagogical Models

6. Engineering New Frontiers with ChatGPT: Advancing Design, Optimization, and Methodological Frameworks

6.1. Applications in Engineering Disciplines

6.2. Theoretical Constructs and Novel Engineering Methodologies

6.3. Impact on Engineer Productivity and Future Practice

6.4. Advancing Engineering Methodologies and Theoretical Frameworks

7. Navigating the AI Revolution: Themes, Tensions, Critical Gaps, and Future Directions

7.1. Common Themes Across Domains

7.2. Synthesis of Themes and Identification of Critical Research Gaps

7.3. Proposal of a Forward-Looking Research Agenda

7.3.1. Methodological Advancements

7.3.2. Theoretical Advancements

7.4. Practical Implications for Method, Theory, and Practice

8. Limitation of This Critical Review Study

9. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI