Next Article in Journal
A Lightweight and Efficient Improved RRT* Algorithm for Global Path Planning in Complex Environments
Previous Article in Journal
Geometric Distortion Induced by Vertical Camera Positioning in Dental Imaging: Toward 2D-3D Reconstruction and AI-Driven Workflows
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Review

Interaction with LLM-Based Systems: A Structured Review and Taxonomy of Mechanisms and Autonomy

Faculty of Science, University of Split, Ruđera Boškovića 33, 21000 Split, Croatia
*
Author to whom correspondence should be addressed.
Appl. Sci. 2026, 16(10), 5001; https://doi.org/10.3390/app16105001
Submission received: 16 April 2026 / Revised: 14 May 2026 / Accepted: 15 May 2026 / Published: 17 May 2026

Abstract

Large language models (LLMs) are increasingly integrated into interactive systems across domains such as software development, robotics, and education. As these systems evolve from simple chat interfaces to autonomous, tool-using agents, the design of human–LLM interaction becomes critical. This paper presents a structured review of interaction with LLM-based systems, focusing on how prompting and interaction design mediate system behaviour and autonomy in practice. We analysed 87 studies from 2021–2025, identifying key interaction mechanisms and application-specific challenges. Based on this synthesis, we propose a two-dimensional taxonomy that classifies systems by interaction mechanism (conversational exploration, task-oriented assistance, tool-mediated interaction, and agentic workflows) and level of autonomy (advisory systems, guided execution, delegated execution, and high-autonomy execution). The taxonomy is supported by decision rules, worked examples, and a human-centred lens, emphasizing user control, transparency, error handling, and learning. Our review highlights a shift from single-turn prompting to structured multi-step workflows and the need for evaluation that considers both process and outcomes, particularly in safety-critical settings. This work provides a framework for analysing, comparing, and informing the design of human-centred interaction with LLM-based systems.

1. Introduction

Large language models (LLMs) have rapidly evolved from standalone chat interfaces into components embedded in a wide range of user-facing systems, including productivity tools, decision-support applications, educational technologies, and autonomous, tool-using agents [1]. As a result, interaction with LLM-based systems has become a central design and evaluation concern: users actively shape model behaviour through prompts, contextual information, and oversight mechanisms that vary across settings and levels of autonomy [2].
Despite the growing body of work on prompting, human-centred AI principles, and agentic workflows, the literature remains fragmented, with many studies focusing either on prompting techniques or on system-level interaction structures in isolation [3,4]. This fragmentation leads to heterogeneous terminology and inconsistent comparisons across systems, making it difficult to synthesise evidence, identify transferable design principles, and establish clear evaluation criteria [5]. Consequently, existing surveys tend not to offer an integrated view linking interaction mechanisms with system autonomy across application contexts.
This paper presents a structured review of research on interaction with LLM-based systems, focusing on interaction mechanisms that shape model behaviour and on the level of autonomy during use. It therefore includes studies that examine prompting, conversational interaction, workflow structures, tool-mediated use, or agentic configurations when these elements affect user–system interaction. By contrast, studies primarily concerned with model training, fine-tuning, benchmark performance, dataset construction, corpus development, or low-level optimisation are outside the scope unless they explicitly analyse interaction with an LLM-based system in use. This delimitation positions interaction mechanism and autonomy as the main analytical lens because these two dimensions capture both how users engage with LLM-based systems and how responsibility for action is distributed between the user and the system.
The paper offers three main contributions: (i) a cross-domain synthesis of interaction with LLM-based systems across common interaction patterns and application domains; (ii) an analysis of recurring design and evaluation challenges related to autonomy, oversight, and human-centred use; and (iii) a taxonomy for classifying LLM-based systems by interaction mechanism and level of autonomy, supported by decision rules and worked examples.
The remainder of the paper is organised as follows. Section 2 introduces the conceptual foundations, defines key terminology, and formalises LLM prompting. Section 3 describes the review methodology. Section 4 presents a descriptive synthesis of the reviewed corpus, including applications in software development, robotics, and education. Section 5 proposes a practical taxonomy, and the final section discusses implications, gaps, limitations, and directions for future research.

2. Background

This section provides the conceptual background necessary to understand interaction with LLM-based systems. It introduces LLM characteristics relevant to this review, defines prompting as a mechanism for shaping model behaviour through compositional prompt structure, and outlines a minimal human-centred framing for interpreting how users engage with these systems.

2.1. Large Language Models

LLMs are a class of deep learning systems trained to process, generate, translate and summarize human language across large volumes of textual data [6,7]. They represent some of the most advanced and accessible technologies in natural language processing, capable not only of analysing existing text but also of generating original content in response to user prompts. Their training relies on large and heterogeneous datasets using learning methods that do not require labelled examples [8], allowing them to learn broad linguistic and semantic knowledge without manual annotation. These properties make LLMs suitable for interactive, conversational settings in which users express goals, constraints and feedback directly in natural language.
A key characteristic of LLMs is their ability to generalize across tasks through in-context learning, in which models infer task instructions directly from examples embedded in the prompt. This capability emerges from large-scale optimization and exposure to diverse corpora, enabling models to dynamically adapt to task structure. As a result, LLMs can successfully handle tasks for which they were not explicitly trained, distinguishing them from traditional, narrowly scoped NLP systems. At the same time, recent evaluations highlight persistent limitations including hallucinations, incorrect information and sensitivity to prompt formulation, which makes interaction design and user oversight central to reliable use [6].

2.2. Formalizing LLM Prompting

LLM prompting refers to the process of constructing and refining input queries to guide model behaviour and influence the quality of generated outputs [9]. Unlike traditional software systems, where user intentions are expressed through predefined interfaces or programming constructs, prompting relies on natural-language instructions that shape how an LLM interprets a task and structures its output. In this sense, prompting serves as a communication protocol between users and LLMs, and as a flexible mechanism for specifying tasks in natural language. It can also be viewed as a lightweight programming paradigm for intelligent systems [10].
From a conceptual standpoint, a prompt can be understood as a structured input that guides the model behaviour by constraining or enriching the information available during inference. Prompts may include high-level instructions, contextual background, task-specific details, examples of desired behaviour, or explicit requirements regarding output structure and format. Although prompting methods vary widely in practice, they all rely on a common underlying principle: LLM behaviour can be shaped through careful design of the input text provided at inference time.
To formalize this idea, we propose a compositional view of prompt structure. A prompt P(x) can be modelled as a sequence of components, each contributing a distinct function in specifying user intent:
P ( x ) = I ( x ) C ( x ) D ( x ) O ( x )
where x denotes the user-provided input, ∘ denotes the sequential concatenation of these textual components, and:
  • I(x) represents Instruction—the core directive that communicates the user’s intention;
  • C(x) represents Context—additional information such as domain constraints, background knowledge, or system-level guidance;
  • D(x) represents Data—user-provided inputs, examples or reference content relevant to the task;
  • O(x) represents Output specification—requirements regarding structure, format, style, length, or reasoning transparency.
This model does not define a single “correct” prompt format. Instead, it provides a general framework for analysing how different prompt elements interact. In practice, many prompting strategies can be interpreted as emphasizing one or more components of this structure. LLM prompting is therefore a fundamental component of interaction with LLM-based systems because it allows users to steer model behaviour, reduce errors, manage ambiguity and adapt outputs to specific contexts.

2.3. Human–AI Interaction

Human–AI interaction research emphasises the design and evaluation of intelligent systems that align with human needs, expectations, and values. A central concept is human-centred AI (HCAI), which highlights that AI systems should support user control, provide trustworthy information, and remain usable for diverse users [11,12]. From this perspective, AI is expected to complement rather than replace human decision-making. Prior research indicates that combining human judgement with AI-generated suggestions can enhance decision-making, creativity, and problem-solving [13]. Accordingly, HCAI prioritises systems that users can understand, control, and rely on, supporting human oversight and promoting trust, transparency, and responsiveness to ethical concerns [14].
In the context of LLM-based systems, these principles manifest in concrete design choices such as prompt formulation, communication of model uncertainty, and the availability of controls over model behaviour. Because LLMs can generate fluent yet incorrect or misleading content, human-centred design requires mechanisms that communicate uncertainty and allow users to revise or override outputs when errors or misalignment are detected. These mechanisms are particularly important because users may overestimate the capabilities of LLM-based systems given their statistical nature [2].
Prior human–AI interaction research emphasises the importance of preserving human control and interaction-level transparency in increasingly autonomous intelligent systems. In such settings, transparency must support users in monitoring and interpreting system behaviour and, when necessary, intervening in the interaction process [15]. Building on this perspective, this review uses the term interaction mechanisms to refer to the concrete prompting-, interface-, and workflow-level structures through which users guide, constrain, and oversee the behaviour of LLM-based systems during use.

3. Review Methodology

This study adopts a structured, concept-centric literature review approach to synthesise research on interaction with LLM-based systems. The research scope of this review is narrowed to focus on understanding how interaction mechanisms shape the behaviour and autonomy of LLM-based systems in user-facing contexts. Given the rapid growth of research in this area, the review aims to provide a focused yet representative account of relevant interaction research rather than an exhaustive census of all publications. The overall coverage period spans 1 January 2021 to 31 December 2025, reflecting the recent emergence and rapid evolution of LLM ecosystems.

3.1. Database and Search Strategy

To narrow the search while ensuring coverage of representative peer-reviewed literature, the Web of Science (WoS) Core Collection (CC) database was exclusively used. WoS was selected for its established indexing standards, structured metadata, and broad coverage of influential journals and conference proceedings.
The review was conducted in two stages: initial search (July 2024) and update search (January 2026). The update search followed the same query, filters, and eligibility criteria as the initial search to ensure procedural consistency and reproducibility.
The search strategy was designed to capture studies addressing LLMs and their interaction mechanisms as described in Table 1. Relevant keyword groups were connected using Boolean operators and truncation (e.g., interact*) was used to cover variations in key terms.
The query with relevant keywords was applied to the Title (TI) and Abstract (AB) fields, while the Topic (TS) field was used for targeted exclusion of purely technical model-development studies. The Title/Abstract focus was chosen to increase thematic precision by prioritising studies in which interaction with LLM-based systems is a central research concern, while the Topic-level exclusion filter was used to remove large volumes of model-centric optimisation work (search query and results available at https://tinyurl.com/wosquerylink, accessed on 15 May 2026).

3.2. Screening and Selection Procedure

3.2.1. Initial Search (July 2024)

The initial search covered the period 1 January 2021–23 July 2024 and returned 1782 records. After restricting the dataset to the selected WoS research areas—Computer Science, Education & Educational Research, Robotics, Psychology, and Behavioral Sciences—the number of records was reduced to 1093. These research areas were selected to prioritise studies in which interaction with LLM-based systems is more likely a primary research focus. This restriction was used to improve thematic precision at the search stage and to reduce the inclusion of records from domains in which the search terms may appear in different contexts, despite passing the initial keyword filter. After applying the English-language filter, 1092 records remained.
To keep the dataset manageable while preserving representative coverage of influential, recent, and query-aligned work, a three-way stratified selection strategy was applied: 100 most cited, 100 most recent, and 100 most relevant records (WoS relevance ranking). These three subsets were selected to capture complementary signals within the rapidly evolving LLM literature:
  • The most cited records were included to reflect established and influential contributions recognised by the research community.
  • The most recent records were incorporated to mitigate citation lag and ensure coverage of emerging work in this fast-moving field.
  • The most relevant records, as ranked by WoS based on term occurrence in the title, abstract, and keywords, were used to strengthen alignment with the search query.
This sampling strategy was used as a pragmatic compromise between breadth, feasibility, and conceptual depth. It was not intended to produce an exhaustive systematic-review corpus, but to support a focused structured synthesis of influential, recent, and query-aligned studies. The strategy may therefore underrepresent studies that were neither highly cited, very recent, nor highly ranked by WoS relevance, including some emerging or domain-specific work. This trade-off was considered acceptable for the aims of the review, which were to identify recurring interaction mechanisms and autonomy patterns rather than to estimate the prevalence of all LLM-based interaction studies. After duplicate removal across the three subsets, 282 records remained for abstract screening.
At the abstract screening stage, studies were provisionally retained when interaction with an LLM-based system appeared to be a central focus of analysis, evaluation, or design, rather than merely a background component of the overall system. Following abstract review, 88 studies were retained for full-text assessment. Full-text assessment then applied the same eligibility criteria in greater detail to confirm inclusion and exclude studies whose full content showed a primary focus on model-centric performance or technical optimisation rather than on the interaction process itself. The full-text screening resulted in 44 studies included from the initial search.

3.2.2. Update Search (January 2026)

To capture the most recent developments, an update search covering 24 July 2024–31 December 2025 was performed using the identical protocol. This search returned 7053 records, which were reduced to 4930 after applying the same Research Area filters and to 4906 after limiting to English-language publications.
The same three-way stratified procedure was applied (most-cited, most recent, most relevant). After duplicate removal, 290 records remained for abstract screening. This process yielded 84 studies for full-text review, of which 43 studies met the inclusion criteria.

3.2.3. Final Corpus

The combined procedure resulted in a final dataset of 87 included studies (44 initial + 43 update) as presented in the PRISMA-inspired [16] flow diagram (Figure 1). Screening was conducted by a single reviewer using predefined eligibility criteria. Borderline cases were revisited to ensure consistent application of the inclusion and exclusion rules across both search stages. To assess screening reliability, a second reviewer independently screened a stratified random subset of 60 records, corresponding to approximately 10.5% of the 572 records screened at the abstract stage. Agreement was 88.3%, with Cohen’s κ = 0.73, indicating substantial agreement.
To make the stratified sampling procedure more transparent, Table 2 reports how the final corpus was distributed across the retained sampling strata after duplicate removal. All three strata contributed to the final set of included studies. The initial search included a larger share of most-recent records, whereas the update search included a larger share of most-cited records, reflecting the rapid maturation of LLM-related literature between the two search stages. Because some records appeared in more than one stratum, the table reports the stratum under which each included study was retained after duplicate removal.

4. Results

This section first summarises the final reviewed corpus (N = 87) and then presents a descriptive synthesis of the interaction patterns and application contexts identified within the included studies. As shown in Figure 2, most studies were published in recent years, reflecting the rapid expansion of research on interaction with LLM-based systems. Building on the Human–AI Interaction perspective introduced in Section 2.3, the synthesis focuses on how interaction is structured across the reviewed studies and how these structures vary between domains.
Studies that addressed interaction mechanisms, prompting methods, workflow structures, or agentic architectures without a clearly bounded application context were treated as cross-domain evidence. These studies were not excluded from the synthesis; rather, they inform Section 4.1, where recurring interaction patterns are analysed across application contexts. Application-focused studies were then synthesised within the most prominent and analytically coherent domains in the reviewed corpus. Software development, robotics, and education were retained as separate domain-level sections because they formed the clearest application-focused clusters and provided sufficiently developed evidence for comparing how interaction mechanisms and autonomy operate under different task structures, risks, and user roles. Other areas, including healthcare, law, and creative industries, appeared less frequently or in more heterogeneous forms and were therefore not treated as separate domain sections. Section 4.1 therefore synthesises cross-domain interaction patterns, while Section 4.2 examines how these patterns are adapted to domain-specific constraints, risks, and evaluation priorities in the three selected domains.
Figure 3 summarises the grouping used for the descriptive synthesis by distinguishing studies that examine interaction mechanisms in multiple contexts from those situated in a primary application domain.
The thematic structure of the reviewed corpus was further examined through keyword co-occurrence analysis. Figure 4 shows the main clusters identified from standardised author keywords and indexed terms. The network places human–AI interaction in close relation to broader LLM-related and interaction-oriented concepts, including “LLM”, “agents”, “AI”, and “NLP”. Rather than presenting LLMs only as model-level technologies, the map points to a literature organised around use, mediation, supervision, and the changing role of users in interaction with increasingly capable systems. This supports the review’s focus on interaction mechanisms and autonomy as two central dimensions for analysing LLM-based systems.

4.1. Cross-Domain Interaction Patterns

Across the reviewed corpus, three recurring interaction patterns emerged: prompting strategies, workflow patterns, and agentic architectures. These patterns differ in how user intent is specified, how intermediate steps are organised, and how much of the task is delegated to the system. Read together, they also indicate a broader shift from single-turn prompt formulation toward structured, tool-supported, and increasingly autonomous forms of interaction, with corresponding implications for human oversight and control.

4.1.1. Prompting Strategies

In the reviewed work, prompting appears as the most basic and widespread interaction mechanism through which users specify intent, provide context, and constrain outputs. Across these studies, prompting functions not merely as question phrasing, but as a structured means of configuring interaction through instructions, examples, contextual grounding, and output requirements. Common strategies include zero-shot and few-shot prompting, while more elaborate forms such as chain-of-thought and self-consistency are used when tasks require intermediate reasoning or greater output stability [17,18,19,20].
More structured prompting mechanisms in the corpus are typically used to reduce ambiguity, stabilise model behaviour across repeated interactions, and improve the inspectability of outputs. Role instructions, templates, constrained output formats, retrieval-augmented prompting, and multimodal reasoning extensions all serve this broader function, even though they differ technically [21,22,23]. A recurring finding across studies is that structured formats can enhance consistency and task compliance, but they do not eliminate the need for human oversight to assess correctness and contextual appropriateness.

4.1.2. Workflow Patterns

A second recurring pattern in the corpus is the shift from isolated prompting toward multi-step workflows. These interaction structures decompose tasks into intermediate stages, incorporate feedback, and often connect the model to external tools or representations when a single prompt–response exchange is insufficient for the user’s goal. Approaches such as Prompt chaining [24], Tree-of-Thought [25], Automatic Reasoning and Tool-Use [26], Automatic Prompt Engineer [27], Active Prompting [28], Program-aided language models [29], ReAct [30] and Reflexion [31] differ in implementation, but they converge on the same interactional principle: complex tasks are handled through staged reasoning, iterative refinement, and coordination between language generation and action.
Taken together, these studies show that interaction with LLM-based systems is often iterative rather than single-turn, particularly when tasks are exploratory, open-ended, or verification-sensitive. In such cases, effective interaction depends not only on prompt quality, but also on how the workflow decomposes the task, exposes intermediate states, and supports feedback or correction across turns. This marks an important transition from prompt design as isolated input crafting to interaction design as process orchestration [32].

4.1.3. Agentic Architectures

Agentic architectures represent the most complex interaction configurations identified in the reviewed corpus. Unlike direct prompt–response exchanges, they combine reasoning, memory, planning, and tool use in ways that allow LLM-based systems to manage extended task sequences and coordinate multiple intermediate steps. From an interaction perspective, this shifts the user’s role from specifying individual prompts toward supervising a system that maintains state, selects actions, and pursues subgoals within a broader workflow.
More advanced configurations adopt multi-agent frameworks, in which several LLM agents collaborate or negotiate to accomplish shared goals. Such systems use structured conversational protocols that allow agents to exchange messages, refine proposals and coordinate task execution. Some frameworks support internal discussions among multiple agents while presenting a unified conversational interface to the user, enabling sophisticated internal reasoning while preserving a coherent interaction surface [33]. From an interaction perspective, these multi-agent dynamics introduce new forms of mediated collaboration in which users engage with a single interface while the system orchestrates distributed reasoning, evaluation and decision-making behind the scenes.
Taken as a whole, LLM-based agents significantly expand the scope of human-LLM interaction. They introduce autonomy, planning, memory and collaboration into natural-language interfaces, enabling systems to support complex, multi-step workflows rather than isolated conversational turns. This shift marks an important transition in interaction design: users no longer interact only through direct prompt–response exchanges, but increasingly supervise coordinated agent-based systems that structure planning, tool use, and intermediate decision-making.

4.2. Application Domains

Although the interaction mechanisms outlined above recur across settings, their practical meaning changes substantially with application context. Domain constraints shape what counts as acceptable output, where verification must occur, which failures are consequential, and how user oversight is organised. The reviewed corpus makes this particularly visible in software development, robotics, and education, where prompting, workflows, and autonomy are adapted to substantially different task structures, risks, and user roles.

4.2.1. Software Development

In software development, the reviewed studies position LLMs primarily as workflow-embedded collaborators rather than as standalone code generators. The most common uses involve generating or refining software artefacts such as unit tests, requirement descriptions, alternative implementations, and code explanations, while developers remain responsible for validation and integration [34,35]. Across these studies, the value of the LLM lies less in replacing engineering judgement and more in accelerating artefact production, surfacing alternatives, and supporting routine but verification-sensitive tasks within tool-supported development environments.
Interaction with LLMs is framed as an iterative, prompt-driven collaboration in which the model produces candidate artefacts and explanations, while developers continuously review, adapt and integrate these outputs into established engineering processes [36]. Prompts thereby become a key interface element that translates informal intent into executable code, test cases or refined requirements.
Test-driven interaction patterns further formalise this translation by using intermediate specifications, such as tests, as feedback signals that iteratively guide model output and make verification an integral part of the interaction loop rather than a post hoc activity [37]. At the same time, the studies highlight the need for verification mechanisms, explicit review stages and a clear division of responsibilities that prevent over-reliance on model output and ensure that critical design decisions remain under human control.

4.2.2. Robotics

In robotics, LLM-based interaction is shaped by a core challenge that is less pronounced in other domains: natural-language input must be grounded in embodied, safety-relevant action. The reviewed studies therefore treat LLMs not simply as reasoning modules, but as interaction layers that translate high-level, often informal human instructions into executable robot behaviour while also supporting dialogue, clarification, and multimodal interpretation of the environment [38]. Prompt design becomes especially important in this context, because free-form language must be made precise enough for planning and control, often through structured templates, iterative dialogue, and inspectable intermediate plans.
Within studies, two complementary interaction patterns can be distinguished. In one, the LLM functions as a co-design partner that helps engineers formulate, test, and refine robot behaviours before deployment, turning prompts and example dialogues into reusable design artefacts [39]. In the other, the LLM is embedded directly into the control loop, enabling users to instruct and supervise robots through natural-language or multimodal interfaces [40,41,42]. The contrast matters because these two patterns imply different autonomy profiles, different oversight needs, and different consequences of failure, even though both rely on language-mediated interaction.
Together, these studies show that LLM-mediated human–robot interaction is both promising and fragile. Natural-language interfaces can broaden accessibility and improve collaboration, but only when they are supported by explicit mappings between language and action, inspectable intermediate plans, and robust mechanisms for correction or override. In this domain, interaction design is closely linked to safety design, as higher levels of autonomy increase the importance of clearly separating natural-language interpretation from safety-critical control [43].

4.2.3. Education

In education, the reviewed studies most often frame LLM-based systems as conversational mediators of learning rather than as simple answer generators. Their role is to extend feedback, explanation, and instructional support across ongoing interactions involving students, teachers, and educational content [44]. The LLM is treated less as a static content source and more as a flexible interface that can be embedded into courses to extend teacher presence and give students additional channels for practice and feedback. Sustained, structured critique-revision cycles have been shown to reframe interaction with LLMs from one-shot querying toward collaborative, feedback-driven engagement, encouraging users to actively evaluate and refine model outputs rather than passively accept them [45].
A recurring theme across these studies is that prompting is increasingly treated as an educational competence. Rather than assuming that useful interaction follows automatically from access to an LLM, the literature emphasises the ability of learners and teachers to specify goals, constraints, and contextual cues through iterative prompt construction [46,47,48]. This reframes prompting from a purely technical skill into a pedagogically relevant practice that affects the quality of explanation, feedback, and critical engagement.
The corpus also shows that educational interaction with LLMs raises important questions of safety, trust, and inclusion, especially for users whose vulnerabilities or support needs shape how conversational systems are perceived. Studies involving children suggest that anthropomorphic AI systems may elicit unusually high levels of trust and disclosure [49]. Other work, such as iTutor, shows how LLM-based interfaces can also support accessibility by adapting instructions step by step for older adults learning smartphone tasks [50].
Overall, these studies suggest that interaction design for LLMs in education must balance personalisation and accessibility with mechanisms that protect learners from over-disclosure, over-reliance and low-quality guidance.

4.3. Comparative Analysis Across Domains

Although the three interaction patterns identified in Section 4.1 recur across all three domains, their role and associated risks differ substantially depending on context. Table 3 summarises how the same interaction mechanisms take on different roles: prompting and workflows support artefact generation and verification in software development, language-to-action grounding in robotics, and feedback-oriented learning in education.
The most instructive contrast concerns prompting. In software development, prompting primarily serves artefact generation and verification: developers produce candidate outputs and then review and integrate them into existing processes. In robotics, the same natural-language input must be grounded in physically executable action, which means that misinterpreted prompts can result in unsafe robot behaviour that cannot simply be undone. In education, prompting is increasingly treated not as a means of task completion but as a competence that learners must develop in its own right. The same mechanism therefore plays a different functional role in each domain and produces qualitatively different types of failure.
A similar contrast applies to autonomy. Software development tolerates the widest range, from advisory suggestions to delegated multi-step execution, because failures remain bounded to digital artefacts and are typically recoverable. In robotics, guided and delegated execution are well-supported, but high-autonomy configurations are treated with caution: as autonomy increases, the gap between language interpretation and safe physical action widens, and errors can propagate in ways that are difficult to reverse. In education, the dominant range is advisory to guided execution, which appears to reflect a design principle rather than a limitation, as higher-autonomy configurations raise specific risks around over-reliance and misplaced trust, particularly for vulnerable users such as children. Consequently, oversight requirements also differ: software development relies primarily on review and testing, robotics requires approval mechanisms, interrupt controls, and safety gates, while education requires pedagogical mediation and trust calibration. These domain-level differences provide the empirical grounding for the two dimensions of the taxonomy proposed in the following section.

5. Taxonomy for Interaction with LLM-Based Systems

The preceding analysis suggests that interaction with LLM-based systems varies across domains, interfaces, and system designs, but also reveals recurring patterns in how these systems organise user interaction and system behaviour. Across the reviewed studies, differences repeatedly emerged along two questions: how the user interacts with the system, and how much independent action the system is allowed to take. These two questions therefore provide the basis for the taxonomy proposed in this section.
Accordingly, the taxonomy is organised around two dimensions. Interaction mechanism captures how the LLM is positioned within the interaction, ranging from conversational exploration and reflection to more structured, tool-mediated, and agentic forms of interaction. Level of autonomy captures how independently the system can initiate or carry out actions within a workflow, ranging from advisory support to high-autonomy configurations. Together, these dimensions provide a compact vocabulary for describing LLM-based systems and discussing differences in control and oversight.
To support practical application of the taxonomy, Figure 5 provides a decision tree for assigning a system’s position along the two dimensions. The figure complements the decision rules and boundary conditions presented in Table 4 and Table 5.

5.1. Interaction Mechanism

LLM-based systems are used in many ways, depending on how users define goals, how the interaction is structured, and how outputs are incorporated into a broader task. The interaction mechanism dimension captures the LLM’s primary role in the user–system relationship, regardless of application domain or autonomy level. It focuses on whether interaction is primarily exploratory, oriented toward completion of a defined task, mediated through tools or interfaces, or organised as a multi-step workflow. Table 4 operationalises this dimension by providing decision rules for assigning interaction mechanisms and resolving boundary cases.
Conversational exploration and reflection describes interactions in which the LLM primarily functions as a dialogue partner for exploration, sensemaking, and reflection. The interaction remains within the dialogue: users ask questions, challenge suggestions, and use the model’s responses to broaden their understanding of a problem rather than to produce a specified output with clear completion criteria. The user remains responsible for interpretation and conclusions.
Task-oriented assistance refers to settings in which users formulate relatively well-defined tasks and rely on the LLM to produce concrete outputs such as summaries, drafts, explanations, or code fragments. Here, interaction is still primarily prompt-response based, but it is organised around completing a defined task whose output can be judged for adequacy or correctness. Software development workflows exemplify this pattern, when developers request test-case generation, refactoring suggestions, or requirement descriptions that are then reviewed and integrated into existing engineering processes.
Tool-mediated interaction describes systems in which the LLM is embedded within a broader application and the interaction is primarily structured by interface elements, built-in functions, or tool calls rather than by direct prompting alone. Users mainly engage with the application, while the LLM organises content, adapts responses, or supports information retrieval in the background. In this case, the defining feature is not simply that the LLM assists with a task, but that it reshapes interaction through the surrounding interface and system functions.
Agentic workflows refer to systems in which the LLM pursues a multi-step goal by planning and iteratively executing actions across tools or subprocesses. Users specify higher-level goals and constraints, while the system decomposes tasks, invokes tools, and coordinates intermediate results, sometimes in multi-agent configurations. From the user’s perspective, interaction shifts from managing individual steps to supervising workflow execution.

5.2. Level of Autonomy

While interaction mechanism describes the qualitative shape of collaboration, level of autonomy captures the extent to which LLM-based systems independently select and execute actions and affect external artefacts, tools, or environments. In this taxonomy, autonomy is defined along a spectrum from systems that only suggest options to systems that execute sequences of actions through tools and external interfaces. This dimension is closely linked to questions of control, responsibility, safety and evaluation that recur throughout the reviewed literature. A system may involve a complex interaction mechanism while remaining low in autonomy if execution stays under explicit user control. To support consistent classification across systems, Table 5 summarises the decision rules used to distinguish autonomy levels and handle boundary cases.
Advisory systems generate suggestions, explanations, or alternative formulations, but they do not act on external artefacts or environments. Examples include conversational assistants that support reasoning, educational tools that provide formative feedback, and coding assistants that propose code which developers must review and integrate. Users remain responsible for decisions, execution and verification.
In guided execution, LLMs operate within constrained workflows or interfaces. They may fill templates, chain internal prompts, or manage intermediate representations, but still require explicit user confirmation for consequential steps. Examples include multi-stage summarisation/feedback pipelines, adaptive tutoring tools, and robotics interfaces that surface language-based plans for approval before execution.
Systems with delegated execution can invoke tools, run code, or trigger operations in other applications from natural-language instructions. LLM agents that automate workflow steps, interact with services, or coordinate sensors and actuators exemplify this level. Users set constraints and supervise outcomes, but may not approve every intermediate action. Evaluation should therefore consider output quality alongside safe tool use, robustness, and error recovery.
High-autonomy systems coordinate multiple agents that plan, negotiate, critique, and execute extended action sequences with limited direct intervention. This increases opacity and the risk of cascading failures, especially when agents can affect external systems. These configurations therefore require strong safeguards, clear oversight, and evaluation of both process and outcomes, especially in safety-critical settings.

5.3. Human-Centred AI Principles

Interaction mechanism and autonomy describe how a system works, but they do not by themselves indicate whether it supports appropriate and trustworthy use. Across the reviewed literature, recurring concerns related to user oversight, transparency of system behaviour, recovery from failure, and user adaptation pointed to a set of cross-cutting human-centred requirements. To interpret the implications of different “interaction mechanism” × autonomy combinations, an evaluative lens based on four human-centred principles is applied:
  • Control: Users should be able to influence what the system does and stop or reverse it when needed, especially as autonomy increases. This includes simple actions such as accepting or rejecting suggestions, as well as configuring goals and constraints, pausing execution, and reviewing actions before they affect external artefacts or environments.
  • Transparency: Users need cues that make the system’s behaviour understandable, including its current goals, intermediate steps, and sources of information. As autonomy and tool use increase, transparency should include plans, tool calls, and intermediate outputs presented in a way that supports intervention without overwhelming the user.
  • Error handling: LLM-based interaction can fail through misunderstandings, hallucinations, or misaligned inferences. Effective systems support detection and repair through clarification dialogue, iterative refinement, and structured feedback channels, enabling recovery without excessive user burden.
  • User learning: Users learn how to prompt, interpret outputs, and anticipate system behaviour over time. Interaction designs should support this learning and help users maintain calibrated trust, for example by making patterns of success and failure visible.
These principles are not an additional taxonomy dimension, but an evaluative lens for interpreting human-centred requirements, risks, and priorities. Table 6 provides exemplary indicators for their practical evaluation.
These indicators should be adapted to the system’s interaction mechanism and autonomy level. For advisory systems, evaluation may prioritise understandability, calibrated trust, and the ability to accept or reject suggestions. For guided execution, checkpoint quality and intermediate-step visibility become more important. For delegated and high-autonomy execution, evaluation should additionally consider audit trails, interruption success, and recovery after failures.

5.4. Application of the Taxonomy to Representative Systems

To examine the practical applicability of the taxonomy, we conducted a preliminary applicability check using representative systems from the reviewed corpus. The aim was not to provide a full inter-rater validation, but to test whether the decision rules could be applied consistently to systems with different interaction mechanisms and autonomy levels. The classifications were first assigned by the lead author and then reviewed by the co-authors to check conceptual consistency and resolve ambiguous cases. Table 7 summarises the classification of representative systems, while the following worked examples illustrate the reasoning in more detail.

5.4.1. PromptChainer

PromptChainer [52] is a visual interface for composing multi-step prompt chains. Users create and connect nodes, edit and test each step, and run the full chain while inspecting intermediate outputs. It is best classified as tool-mediated interaction because the workflow is structured through the interface rather than free-form chat. Prompts are embedded in UI elements and executed as a user-authored chain, with explicit support for decomposition and debugging. In terms of autonomy, PromptChainer is best classified as guided execution. The most relevant human-centred considerations are transparency and error handling, supported by inspectable intermediate states and debugging features.

5.4.2. AutoGen

AutoGen [33] is an open-source framework for building LLM applications via multi-agent conversation, where role-specialised agents collaborate on tasks and may execute tools or code. It is best classified as agentic workflows in terms of interaction mechanism and as high-autonomy execution in terms of autonomy, because users typically set goals and constraints while agents decompose tasks, execute steps, and iteratively refine results with limited turn-by-turn input. The key human-centred requirements are control and error handling, as such systems require clear intervention and stop mechanisms.
The worked examples illustrate how the taxonomy can be applied in detail. Figure 6 complements this view by visualising Table 7’s systems within the proposed taxonomy according to interaction mechanism (x-axis) and level of autonomy (y-axis).

6. Discussion

This review indicates that interaction with LLM-based systems cannot be understood only in terms of model capability. Across the reviewed studies, outcomes depend not only on what the model can generate, but also on how interaction is structured and on how much autonomy the system is permitted to exercise [2]. That observation is the central motivation for the taxonomy proposed in the paper. By separating interaction mechanism from level of autonomy, the taxonomy offers a more precise basis for comparing systems that may appear similar at the level of model or application domain, yet differ substantially in workflow organisation, oversight requirements, and responsibility [5].
This analytical lens is appropriate because many reviewed studies differ less in model architecture than in how the model is made available to users and how much action is delegated to the system. Educational tools, coding assistants, robot interfaces, and multi-agent workflows may rely on similar LLM capabilities, but they involve different levels of user involvement, verification burden, and risk exposure. The interaction mechanism × autonomy distinction therefore provides a common basis for comparing these systems without treating them as the same type of interaction.
A clear pattern across the corpus is the shift from simple prompt-response exchange toward more structured and workflow-oriented forms of interaction, including tool-mediated and agentic configurations [33,57]. Recent work on orchestrated LLM workflows illustrates this shift by decomposing interaction into specialised stages for extraction, scoring, and task execution, making modularity and auditability part of the interaction design problem [58]. This shift matters because it changes the unit of analysis. In low-autonomy settings, evaluation may focus primarily on output quality or usefulness. In more complex systems, however, the interaction process itself becomes central: intermediate steps, tool use, planning behaviour, and opportunities for user intervention all shape whether the system is understandable, controllable, and reliable in practice. As autonomy increases, assessing final outputs alone becomes insufficient.

6.1. Design Implications

Across domains, the literature further suggests that LLMs are most effective when positioned as cognitive partners rather than fully independent problem-solvers. Stronger outcomes are typically reported when users retain responsibility for goals, interpretation, and final judgement, while the system contributes suggestions, reformulations, guidance, or bounded forms of automation [54]. The same pattern recurs in education, software development, and robotics despite differences in context and task structure. It suggests that the main value of LLM-based systems often lies in augmenting human judgement rather than replacing it. Natural language interaction is especially important in this respect, as it allows users to express constraints, preferences, examples, and refinements in a flexible and accessible form. Several studies also point to the importance of personalisation in adapting explanations or interaction style to user needs [53,59]. At the same time, software development and robotics research indicates that LLM support is most useful when embedded into existing tools and workflows rather than treated as a detached text-generation interface [60]. This is especially visible in robotics, where recent work on LLM-based autonomy emphasises the integration of language interaction with motion, navigation, manipulation, voice-based input, and deployment constraints such as latency, scalability, and privacy [61].
A related insight is that prompting increasingly functions as a control surface for interaction [62]. Through prompts, users specify intent, scope, constraints, and success criteria, thereby shaping not only the content of outputs but also the behaviour of the system within the task. In this sense, prompt design is not merely an input technique but a core mechanism of human control. The I/C/D/O decomposition introduced earlier helps make this visible by providing a compact descriptive lens for analysing how users structure interaction through language across different systems and domains.
At the same time, the literature consistently highlights reliability, controllability, and transparency as persistent challenges [63]. LLM-based systems can produce outputs that are fluent yet incorrect, incomplete, or poorly grounded, and their behaviour may shift with relatively small changes in prompt wording or context. Such behaviour creates a verification burden for users, who must often judge the quality of plausible-sounding responses under uncertainty. The issue is therefore not only model accuracy but also trust calibration. Systems that obscure uncertainty or present responses with unwarranted confidence risk encouraging over-reliance. From a design perspective, the literature underscores the importance of interfaces that expose system state, support revision, and allow users to inspect, challenge, or override outputs rather than simply accept them.
The review also suggests a mismatch between how prompting is often conceptualised in the literature and how many users actually engage with LLMs. Prompting is frequently treated as a specialised skill akin to programming or formal instruction writing, whereas many users approach these systems through ordinary conversational habits [64]. That mismatch matters because it shifts attention from user deficiency to design responsibility. Progress should not depend only on teaching users to prompt more effectively; it should also involve designing interaction structures that make effective use more accessible. Templates, structured inputs, previews of intended actions, intermediate checkpoints, and feedback mechanisms may all help reduce ambiguity and support better performance, particularly for non-expert users.
These findings also clarify the broader value of the proposed taxonomy. Interaction mechanism and autonomy are not only descriptive dimensions; they help explain why different systems demand different forms of control, transparency, recovery support, and user adaptation. A conversational assistant that provides recommendations raises different design and evaluation questions from a multi-agent system that plans and executes tasks across external tools. Treating both simply as “LLM applications” obscures important differences in oversight, failure modes, and accountability. The taxonomy therefore provides a framework for linking system configuration to human-centred requirements, while the four principles, control, transparency, error handling, and user learning, offer a practical lens for interpreting risks, support needs, and evaluation priorities across different interaction mechanism × autonomy combinations.
This point is especially important because evaluation remains uneven across the reviewed literature. Many studies report task performance, output quality, or user satisfaction, but give less attention to interaction quality and process-level properties such as inspectability, recoverability, and calibrated trust. This becomes more problematic as autonomy increases. In tool-using and agentic systems, failures may propagate across multiple steps, affect external artefacts, or remain difficult to detect once the interaction extends beyond a single exchange. Future work should evaluate not only whether such systems achieve a desired outcome, but how that outcome is produced: whether intermediate steps are visible, whether users can intervene meaningfully, whether errors are contained, and whether recovery is practical. The human-centred lens proposed in Section 5 provides a concise basis for making these requirements more explicit.

6.2. Hybrid Systems

Systems that combine multiple interaction forms should be handled by separating descriptive classification from risk interpretation. For descriptive classification, systems should be positioned according to the dominant interaction configuration that structures the primary user workflow. For human-centred evaluation, however, secondary mechanisms involving higher autonomy, consequential tool use, or reduced user confirmation should also be reported, because they may change oversight, transparency, error-handling, and accountability requirements. This distinction preserves the practical clarity of the taxonomy while acknowledging that real-world systems may combine multiple interaction forms.

6.3. Ethical Compliance and Domain Adaptation

Ethical requirements for LLM-based interaction systems are strongly domain-dependent. Recent structured reviews similarly emphasise that LLM deployment raises technical, social, ethical, and legal challenges, including interpretability, biased outputs, privacy and data security, and domain-specific risks in areas such as healthcare, law, media, and education [65]. Across domains, designers must therefore account for biased or stereotypical outputs, privacy leakage, inappropriate use of personal data, and over-reliance on fluent but unreliable responses.
However, the implications of these risks vary substantially by context. In education, especially where children or vulnerable learners are involved, interaction design should limit unnecessary disclosure, support age-appropriate explanations, and ensure that teachers or guardians retain meaningful oversight. In robotics, the main ethical concern is not only the quality of language interpretation, but also the possibility that misinterpreted instructions may lead to unsafe physical actions. LLM-mediated robot systems therefore require explicit safety constraints, interrupt mechanisms, and separation between natural-language interpretation and safety-critical control. Similar domain adaptation is needed in other high-stakes settings, such as healthcare or legal support, where privacy, accountability, and professional responsibility impose stricter requirements than in exploratory or low-risk applications. These examples reinforce the need to evaluate LLM-based systems not only by task performance, but also by whether their interaction mechanisms and autonomy levels are appropriate for the domain in which they are deployed.

6.4. Limitations

This review has several limitations. The corpus is constrained by Web of Science coverage and by the indexing conditions of that database, which may exclude relevant computer science and human–computer interaction work disseminated through repositories or venues not indexed, or not yet indexed, in WoS. The three-way stratified sampling strategy supported a focused synthesis of influential, recent, and query-aligned work, but may have underrepresented studies that were neither highly cited, very recent, nor strongly ranked by WoS relevance. The findings should therefore not be interpreted as an exhaustive estimate of the prevalence of different LLM-based interaction systems, but as a structured synthesis of recurring interaction mechanisms and autonomy patterns within the selected corpus. Finally, the coding and interpretive synthesis involved judgement, particularly in borderline cases where systems combined multiple interaction patterns or autonomy characteristics. These limitations do not invalidate the review, but they do suggest that the taxonomy should be understood as a grounded analytical framework rather than an exhaustive representation of the field.
The domain-level synthesis is also necessarily selective. Although cross-domain studies were incorporated into the analysis of general interaction patterns, only software development, robotics, and education were treated as separate application domains because they formed the most coherent clusters in the reviewed corpus. As a result, the review does not provide separate domain-level syntheses for areas such as healthcare, law, or creative industries, where LLM-based interaction may involve distinct risks, regulatory constraints, and evaluation requirements. Future work should apply and test the taxonomy more explicitly in these high-stakes and domain-specific contexts.
A further limitation concerns the rapid evolution of LLM capabilities. As newer models increasingly support longer context, multimodal input, tool use, memory, planning, and autonomous action, the same system may shift toward a different interaction mechanism or autonomy level over time.

6.5. Future Research

Within these limits, the review contributes a shared vocabulary for comparing, designing, and evaluating interaction with LLM-based systems across domains. In practical terms, the taxonomy can help researchers and system designers identify a system’s dominant interaction mechanism and autonomy level, compare alternative interaction designs, select appropriate oversight strategies, and define evaluation priorities. Three directions appear particularly important for future research. First, the taxonomy should be further validated through clearer operational decision rules, inter-rater reliability testing, and application to broader corpora. Second, more longitudinal and real-world studies are needed to examine how prompting practices, user reliance, and interaction routines evolve over time. Third, higher-autonomy systems require stronger mechanisms for oversight and safety, including transparent intermediate steps, controllable tool use, and robust failure containment. More broadly, the findings suggest that meaningful comparison of LLM-based systems requires attention not only to what models can do, but also to how interaction is organised and how responsibility is distributed between user and system. Future research should also test the taxonomy in emerging domains where autonomy, oversight, and accountability requirements may differ substantially.

Author Contributions

Conceptualization, D.N., S.M. and A.G.; Methodology, D.N.; Formal Analysis, D.N.; Writing—Original Draft, D.N.; Writing—Review & Editing, A.G. and S.M.; Supervision, A.G.; Funding Acquisition, S.M. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the project “Artificial Intelligence for Future Technical Education and Industrial Competitiveness (AITECH), IP-UNIST-48”, funded by the European Union—NextGenerationEU through the National Recovery and Resilience Plan 2021–2026. The views and opinions expressed are those of the authors only and do not necessarily reflect those of the European Union or the European Commission. Neither the European Union nor the European Commission can be held responsible for them.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

No new data were created or analysed in this study. Data sharing is not applicable to this article.

Acknowledgments

During the preparation of this work, the authors used ChatGPT 5.4 (OpenAI) in order to improve the clarity, grammar, and language of the manuscript. After using this tool, the authors reviewed and edited the content as needed and take full responsibility for the content of the publication.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Procko, T.T.; Elvira, T.; Ochoa, O. Dawn of the dialogue: AI’s leap from lab to living room. Front. Artif. Intell. 2024, 7, 1308156. [Google Scholar] [CrossRef] [PubMed]
  2. Amershi, S.; Weld, D.; Vorvoreanu, M.; Fourney, A.; Nushi, B.; Collisson, P.; Suh, J.; Iqbal, S.; Bennett, P.N.; Inkpen, K.; et al. Guidelines for human-AI interaction. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems; Association for Computing Machinery: New York, NY, USA, 2019. [Google Scholar]
  3. Khan, N.; Khan, Z.; Koubaa, A.; Khan, M.K.; Salleh, R.B. Global insights and the impact of generative AI-ChatGPT on multidisciplinary: A systematic review and bibliometric analysis. Connect. Sci. 2024, 36. [Google Scholar] [CrossRef]
  4. Al-Hasan, T.M.; Sayed, A.N.; Bensaali, F.; Himeur, Y.; Varlamis, I.; Dimitrakopoulos, G. From Traditional Recommender Systems to GPT-Based Chatbots: A Survey of Recent Developments and Future Directions. Big Data Cogn. Comput. 2024, 8, 36. [Google Scholar] [CrossRef]
  5. Theofanos, M.; Choong, Y.; Jensen, T. AI Use Taxonomy: A Human-Centered Approach; National Institute of Standards and Technology: Gaithersburg, MD, USA, 2024.
  6. Yenduri, G.; Ramalingam, M.; Selvi, G.C.; Supriya, Y.; Srivastava, G.; Maddikunta, P.K.R.; Raj, G.D.; Jhaveri, R.H.; Prabadevi, B.; Wang, W.; et al. GPT (Generative Pre-Trained Transformer)—A Comprehensive Review on Enabling Technologies, Potential Applications, Emerging Challenges, and Future Directions. IEEE Access 2024, 12, 54608–54649. [Google Scholar] [CrossRef]
  7. Douglas, M.R. Large Language Models. arXiv 2023, arXiv:2307.05782. [Google Scholar]
  8. Fui-Hoon Nah, F.; Zheng, R.; Cai, J.; Siau, K.; Chen, L. Generative AI and ChatGPT: Applications, challenges, and AI-human collaboration. J. Inf. Technol. Case Appl. Res. 2023, 25, 277–304. [Google Scholar] [CrossRef]
  9. DAIR.AI. Prompt Engineering Guide. 29 May 2024. Available online: https://www.promptingguide.ai (accessed on 29 May 2024).
  10. Beurer-Kellner, L.; Fischer, M.; Vechev, M. Prompting Is Programming: A Query Language for Large Language Models. Proc. ACM Program. Lang. 2023, 7, 1946–1969. [Google Scholar] [CrossRef]
  11. National Institute of Standards and Technology—NIST. Human-Centered AI. 23 April 2024. Available online: https://www.nist.gov/programs-projects/human-centered-ai (accessed on 14 July 2024).
  12. Interaction Design Foundation—IxDF. What is Human-Centered AI (HCAI)? Available online: https://www.interaction-design.org/literature/topics/human-centered-ai (accessed on 14 July 2024).
  13. Salikutluk, V.; Koert, D.; Jaekel, F. Interacting with Large Language Models: A Case Study on AI-Aided Brainstorming for Guesstimation Problems. In HHAI 2023: Augmenting Human Intellect; IOS Press: Amsterdam, Netherlands, 2023. [Google Scholar] [CrossRef]
  14. Capel, T.; Brereton, M. What is Human-Centered about Human-Centered AI? A Map of the Research Landscape. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems; Association for Computing Machinery: New York, NY, USA, 2023. [Google Scholar]
  15. Stephanidis, C.; Salvendy, G.; Antona, M.; Duffy, V.G.; Gao, Q.; Karwowski, W.; Konomi, S.; Nah, F.; Ntoa, S.; Rau, P.-L.P.; et al. Seven HCI Grand Challenges Revisited: Five-Year Progress. Int. J. Human–Computer Interact. 2025, 41, 11947–11995. [Google Scholar] [CrossRef]
  16. Moher, D.; Liberati, A.; Tetzlaff, J.; Altman, D.G. Preferred reporting items for systematic reviews and meta-analyses: The PRISMA statement. Int. J. Surg. 2010, 8, 336–341. [Google Scholar] [CrossRef]
  17. Wei, J.; Bosma, M.; Zhao, V.; Guu, K.; Yu, A.W.; Lester, B.; Du, N.; Dai, A.M.; Le, Q.V. Finetuned Language Models Are Zero-Shot Learners. arXiv 2021, arXiv:2109.01652. [Google Scholar]
  18. Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. arXiv 2020, arXiv:2005.14165. [Google Scholar] [CrossRef]
  19. Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Chi, E.H.-H.; Xia, F.; Le, Q.; Zhou, D. Chain of Thought Prompting Elicits Reasoning in Large Language Models. arXiv 2022, arXiv:2201.11903. [Google Scholar]
  20. Wang, X.; Wei, J.; Schuurmans, D.; Le, Q.; Chi, E.H.-H.; Zhou, D. Self-Consistency Improves Chain of Thought Reasoning in Language Models. arXiv 2022, arXiv:2203.11171. [Google Scholar]
  21. Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Kuttler, H.; Lewis, M.; Yih, W.-T.; Rocktäschel, T.; et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv 2020, arXiv:2005.11401. [Google Scholar]
  22. Zhang, Z.; Zhang, A.; Li, M.; Zhao, H.; Karypis, G.; Smola, A.J. Multimodal Chain-of-Thought Reasoning in Language Models. arXiv 2023, arXiv:2302.00923. [Google Scholar]
  23. Vera-Amaro, G.; Rojano-Cáceres, J.R. Accessible Web Content Generation Using LLMs: An Empirical Study on Prompting Strategies and Template-Guided Remediation. IEEE Lat. Am. Trans. 2025, 23, 1230–1239. [Google Scholar] [CrossRef]
  24. Anthropic. Chain Prompts. Available online: https://docs.anthropic.com/en/docs/chain-prompts (accessed on 14 July 2024).
  25. Yao, S.; Yu, D.; Zhao, J.; Shafran, I.; Griffiths, T.L.; Cao, Y.; Narasimhan, K. Tree of Thoughts: Deliberate Problem Solving with Large Language Models. arXiv 2023, arXiv:2305.10601. [Google Scholar] [CrossRef]
  26. Paranjape, B.; Lundberg, S.M.; Singh, S.; Hajishirzi, H.; Zettlemoyer, L.; Ribeiro, M.T. ART: Automatic multi-step reasoning and tool-use for large language models. arXiv 2023, arXiv:2303.09014. [Google Scholar]
  27. Zhou, Y.; Muresanu, A.I.; Han, Z.; Paster, K.; Pitis, S.; Chan, H.; Ba, J. Large Language Models Are Human-Level Prompt Engineers. arXiv 2022, arXiv:2211.01910. [Google Scholar]
  28. Diao, S.; Wang, P.; Lin, Y.; Zhang, T. Active Prompting with Chain-of-Thought for Large Language Models. arXiv 2023, arXiv:2302.12246. [Google Scholar]
  29. Gao, L.; Madaan, A.; Zhou, S.; Alon, U.; Liu, P.; Yang, Y.; Callan, J.; Neubig, G. PAL: Program-aided Language Models. arXiv 2022, arXiv:2211.10435. [Google Scholar]
  30. Yao, S.; Zhao, J.; Yu, D.; Du, N.; Shafran, I.; Narasimhan, K.; Cao, Y. React: Synergizing reasoning and acting in language models. arXiv 2022, arXiv:2210.03629. [Google Scholar]
  31. Shinn, N.; Cassano, F.; Labash, B.; Gopinath, A.; Narasimhan, K.; Yao, S. Reflexion: Language agents with verbal reinforcement learning. arXiv 2023, arXiv:2303.11366. [Google Scholar] [CrossRef]
  32. Song, Y.; Lu, J.; Wong, R.C.-W. CoVis: Neural and LLM-Driven Multi-Turn Interactions for Conversational Text-to-Visualization Generation. VLDB J. 2025, 35, 3. [Google Scholar] [CrossRef]
  33. Wu, Q.; Bansal, G.; Zhang, J.; Wu, Y.; Zhang, S.; Zhu, E.; Li, B.; Jiang, L.; Zhang, X.; Wang, C. AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation Framework. arXiv 2023, arXiv:2308.08155. [Google Scholar]
  34. Munley, C.; Jarmusch, A.; Chandrasekaran, S. LLM4VV: Developing LLM-driven testsuite for compiler validation. Futur. Gener. Comput. Syst.-Int. J. Escience 2024, 160, 1–13. [Google Scholar] [CrossRef]
  35. Zan, D.; Chen, B.; Zhang, F.; Lu, D.; Wu, B.; Guan, B.; Wang, Y.; Lou, J.-G. Large Language Models Meet NL2Code: A Survey. arXiv 2022, arXiv:2212.09420. [Google Scholar]
  36. Marques, N.; Silva, R.R.; Bernardino, J. Using ChatGPT in Software Requirements Engineering: A Comprehensive Review. Futur. Internet 2024, 16, 180. [Google Scholar] [CrossRef]
  37. Fakhoury, S.; Naik, A.; Sakkas, G.; Chakraborty, S.; Lahiri, S.K. LLM-Based Test-Driven Interactive Code Generation: User Study and Empirical Evaluation. IEEE Trans. Softw. Eng. 2024, 50, 2254–2268. [Google Scholar] [CrossRef]
  38. Kim, Y.; Kim, D.; Choi, J.; Park, J.; Oh, N.; Park, D. A survey on integration of large language models with intelligent robots. Intell. Serv. Robot. 2024, 17, 1091–1107. [Google Scholar] [CrossRef]
  39. Vemprala, S.H.; Bonatti, R.; Bucker, A.; Kapoor, A. ChatGPT for Robotics: Design Principles and Model Abilities. IEEE Access 2024, 12, 55682–55696. [Google Scholar] [CrossRef]
  40. Ye, Y.; You, H.; Du, J. Improved Trust in Human-Robot Collaboration With ChatGPT. IEEE Access 2023, 11, 55748–55754. [Google Scholar] [CrossRef]
  41. Zhao, X.; Li, M.; Weber, C.; Hafez, M.B.; Wermter, S. Chat with the Environment: Interactive Multimodal Perception Using Large Language Models. In Proceedings of the 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Detroit, MI, USA, 1–5 October 2023. [Google Scholar]
  42. Lai, Y.; Yuan, S.; Nassar, Y.; Fan, M.; Gopal, A.; Yorita, A.; Kubota, N.; Rätsch, M. Natural Multimodal Fusion-Based Human–Robot Interaction: Application With Voice and Deictic Posture via Large Language Model. IEEE Robot. Autom. Mag. 2025, 2–11. [Google Scholar] [CrossRef]
  43. Frering, L.; Steinbauer-Wagner, G.; Holzinger, A. Integrating Belief-Desire-Intention agents with large language models for reliable human–robot interaction and explainable Artificial Intelligence. Eng. Appl. Artif. Intell. 2024, 141, 109771. [Google Scholar] [CrossRef]
  44. Pelaez-Sanchez, I.C.; Velarde-Camaqui, D.; Glasserman-Morales, L.D. The impact of large language models on higher education: Exploring the connection between AI and Education 4.0. Front. Educ. 2024, 9, 1392091. [Google Scholar] [CrossRef]
  45. Oppenheimer, D.M.; Cash, T.N.; Pensky, A.E.C. You’ve Got AI Friend in Me: LLMs as Collaborative Learning Partners. Int. J. Artif. Intell. Educ. 2025, 35, 3896–3921. [Google Scholar] [CrossRef]
  46. Mai, D.T.T.; Da, C.V.; Hanh, N.V. The use of ChatGPT in teaching and learning: A systematic review through SWOT analysis approach. Front. Educ. 2024, 9, 1328769. [Google Scholar] [CrossRef]
  47. Hwang, G.-J.; Chen, N.-S. Editorial Position Paper: Exploring the Potential of Generative Artificial Intelligence in Education: Applications, Challenges, and Future Research Directions. Educ. Technol. Soc. 2023, 26, 1–18. [Google Scholar]
  48. Bozkurt, A. Unleashing the Potential of Generative AI, Conversational Agents and Chatbots in Educational Praxis: A Systematic Review and Bibliometric Analysis of GenAI in Education. Open Prax. 2023, 15, 261–270. [Google Scholar] [CrossRef]
  49. Kurian, N. No, Alexa, no!’: Designing child-safe AI and protecting children from the risks of the ‘empathy gap’ in large language models. Learn. Media Technol. 2024, 50, 621–634. [Google Scholar] [CrossRef]
  50. Zou, R.; Ye, Z.; Ye, C. iTutor: A Generative Tutorial System for Teaching the Elders to Use Smartphone Applications. In Adjunct Proceedings of the 36th Annual ACM Symposium on User Interface Software & Technology, UIST 2023 Adjunct; Association for Computing Machinery: New York, NY, USA, 2023. [Google Scholar]
  51. Görer, B.; Aydemir, F.B. Generating Requirements Elicitation Interview Scripts with Large Language Models. In Proceedings of the 2023 IEEE 31st International Requirements Engineering Conference Workshops (REW), Hannover, Germany, 4–5 September 2023. [Google Scholar]
  52. Wu, T.; Jiang, E.; Donsbach, A.; Gray, J.; Molina, A.; Terry, M.; Cai, C.J. PromptChainer: Chaining Large Language Model Prompts through Visual Programming. In Extended Abstracts of the 2022 CHI Conference on Human Factors in Computing Systems, CHI 2022; Association for Computing Machinery: New York, NY, USA, 2022. [Google Scholar]
  53. Shu, Y.; Zhang, H.; Gu, H.; Zhang, P.; Lu, T.; Li, D.; Gu, N. RAH! RecSys–Assistant–Human: A Human-Centered Recommendation Framework With LLM Agents. IEEE Trans. Comput. Soc. Syst. 2024, 11, 6759–6770. [Google Scholar] [CrossRef]
  54. Lee, M.; Liang, P.; Yang, Q. CoAuthor: Designing a Human-AI Collaborative Writing Dataset for Exploring Language Model Capabilities. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (CHI’ 22); Association for Computing Machinery: New York, NY, USA, 2022. [Google Scholar]
  55. Wang, X.; Huey, S.L.; Sheng, R.; Mehta, S.; Wang, F. SciDaSynth: Interactive Structured Data Extraction From Scientific Literature With Large Language Model. Campbell Syst. Rev. 2025, 21, e70073. [Google Scholar] [CrossRef]
  56. Wang, B.; Li, G.; Li, Y. Enabling Conversational Interaction with Mobile UI using Large Language Models. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems; Association for Computing Machinery: New York, NY, USA, 2023. [Google Scholar]
  57. Tao, W.; Zhou, Y.; Zhang, W.; Cheng, Y. MAGIS: LLM-Based Multi-Agent Framework for GitHub Issue Resolution. arXiv 2024, arXiv:2403.17927. [Google Scholar]
  58. Trimigno, G.; Lombardo, G.; Tomaiuolo, M.; Cagnoni, S.; Poggi, A. LLMs in Staging: An Orchestrated LLM Workflow for Structured Augmentation with Fact Scoring. Futur. Internet 2025, 17, 535. [Google Scholar] [CrossRef]
  59. Chen, J.; Liu, Z.; Huang, X.; Wu, C.; Liu, Q.; Jiang, G.; Pu, Y.; Lei, Y.; Chen, X.; Wang, X.; et al. When large language models meet personalization: Perspectives of challenges and opportunities. World Wide Web 2024, 27, 42. [Google Scholar] [CrossRef]
  60. Liu, H.; Zhu, Y.; Kato, K.; Tsukahara, A.; Kondo, I.; Aoyama, T.; Hasegawa, Y. Enhancing the LLM-Based Robot Manipulation Through Human-Robot Collaboration. IEEE Robot. Autom. Lett. 2024, 9, 6904–6911. [Google Scholar] [CrossRef]
  61. Liu, Y.; Sun, Q.; Kapadia, D.R. Integrating Large Language Models into Robotic Autonomy: A Review of Motion, Voice, and Training Pipelines. AI 2025, 6, 158. [Google Scholar] [CrossRef]
  62. Zamfrescu-Pereira, J.D.; Wong, R.; Hartmann, B.; Yang, Q. Why Johnny Can’t Prompt: How Non-AI Experts Try (and Fail) to Design LLM Prompts. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI 2023); Association for Computing Machinery: New York, NY, USA, 2023. [Google Scholar]
  63. Ngu, N.; Lee, N.; Shakarian, P. Diversity Measures: Domain-Independent Proxies for Failure in Language Model Queries. In Proceedings of the 2024 IEEE 18th International Conference on Semantic Computing (ICSC), Laguna Hills, CA, USA, 5–7 February 2024. [Google Scholar]
  64. Bridgelall, R. Unraveling the mysteries of AI chatbots. Artif. Intell. Rev. 2024, 57, 89. [Google Scholar] [CrossRef]
  65. Peykani, P.; Ramezanlou, F.; Tanasescu, C.; Ghanidel, S. Large Language Models: A Structured Taxonomy and Review of Challenges, Limitations, Solutions, and Future Directions. Appl. Sci. 2025, 15, 8103. [Google Scholar] [CrossRef]
Figure 1. PRISMA-inspired flow diagram of the study selection process.
Figure 1. PRISMA-inspired flow diagram of the study selection process.
Applsci 16 05001 g001
Figure 2. Distribution of included studies by publication year (N = 87).
Figure 2. Distribution of included studies by publication year (N = 87).
Applsci 16 05001 g002
Figure 3. Distribution of included studies across synthesis categories (N = 87).
Figure 3. Distribution of included studies across synthesis categories (N = 87).
Applsci 16 05001 g003
Figure 4. Keyword co-occurrence network of the reviewed corpus.
Figure 4. Keyword co-occurrence network of the reviewed corpus.
Applsci 16 05001 g004
Figure 5. Decision tree for classifying LLM-based systems in the proposed taxonomy.
Figure 5. Decision tree for classifying LLM-based systems in the proposed taxonomy.
Applsci 16 05001 g005
Figure 6. Representative LLM-based systems positioned in the proposed taxonomy space Systems shown include: Salikutluk [13], Görer [51], Fakhoury [37], Zou [50], Lee [54], Wu [52], Song [32], Wang [55], Wang [56], and Shu [53].
Figure 6. Representative LLM-based systems positioned in the proposed taxonomy space Systems shown include: Salikutluk [13], Görer [51], Fakhoury [37], Zou [50], Lee [54], Wu [52], Song [32], Wang [55], Wang [56], and Shu [53].
Applsci 16 05001 g006
Table 1. Detailed search strategy and query logic applied in WoS CC database.
Table 1. Detailed search strategy and query logic applied in WoS CC database.
Query FieldLogic OperatorField Value
Title (“LLM*” OR “large language model*” OR “GPT*” OR “ChatGPT”) AND (“interact*” OR “prompt*”)
AbstractOR(“LLM*” OR “large language model*” OR “GPT*” OR “ChatGPT”) AND (“interact*” OR “prompt*”)
TopicNOTtraining OR finetuning OR “fine-tuning” OR benchmark* OR dataset OR corpus OR optimization
Index DateANDInitial search
1 January 2021 TO 23 July 2024
Update search
24 July 2024 TO 31 December 2025
Table 2. Distribution of included studies by retained sampling stratum after duplicate removal.
Table 2. Distribution of included studies by retained sampling stratum after duplicate removal.
Retained Sampling StratumInitial Search (n)Update Search (n)Total Included (n)Share of Final Corpus (%)
Most cited13233641.4
Most relevant9112023.0
Most recent2293135.6
Total444387100.0
Table 3. Interaction patterns, autonomy, and risks across domains.
Table 3. Interaction patterns, autonomy, and risks across domains.
Software DevelopmentRoboticsEducation
Role of promptingArtefact generation and verificationGrounding language in physical actionDeveloping prompting as a learner competence
Workflow patternsTest-driven iteration, failures recoverableHuman approval before executionIterative feedback, learning-focused
Agentic architecturesConsequences bounded to digital artefactsErrors can propagate into physical spaceLimited evidence in reviewed studies
Dominant autonomyAdvisory to Delegated executionGuided to Delegated executionAdvisory to Guided execution
Key riskOver-reliance on incorrect outputsMisinterpretation can cause physical harmUncalibrated trust, vulnerable users
Table 4. Decision rules for assigning interaction mechanism categories.
Table 4. Decision rules for assigning interaction mechanism categories.
CategoryClassification CriteriaBoundary RuleExample
Conversational exploration and reflectionThe interaction is primarily used for exploration, sensemaking, explanation, or idea generation within the dialogue, without direct action outside the dialogue.If the interaction is organised around producing a specified output with completion criteria, classify as task-oriented assistance.[13]
Task-oriented assistanceThe user has a defined task or deliverable, success is judged by completion, adequacy, or correctness of the output, and interaction remains primarily prompt-response based.If the interaction is primarily structured by application features, interface controls, or tool/API calls, classify as tool-mediated interaction.[50,51]
Tool-mediated interactionThe LLM is embedded in an application, and interaction is primarily structured by interface elements, built-in functions, or tool/API calls.If the system itself plans and iteratively executes a sequence of actions toward the goal, classify as agentic workflows.[32,52]
Agentic workflowsThe system pursues a multi-step goal by planning and iteratively executing actions across tools, modules, or subprocesses.If the system only proposes steps while the user remains responsible for carrying them out, classify as tool-mediated interaction, or as task-oriented assistance if no tools are involved.[53]
Table 5. Decision rules for assigning level of autonomy.
Table 5. Decision rules for assigning level of autonomy.
CategoryClassification CriteriaBoundary RuleExample
Advisory systemsThe system provides recommendations, suggestions, or explanations, but does not perform actions on external tools, artefacts, or environments. The user remains responsible for execution.If the system guides the task through predefined stages or structured interaction steps, classify as guided execution.[51]
Guided executionThe system structures the interaction through staged guidance or partial automation, but explicit user confirmation remains required for consequential actions.If consequential actions do not require step-by-step user confirmation, classify as delegated execution.[37,54]
Delegated executionThe system performs bounded actions through tools, APIs, code, or connected applications on the user’s behalf, while user oversight and intervention remain available.If the system autonomously plans subgoals and adapts actions, classify as high-autonomy execution.[55,56]
High-autonomy executionThe system pursues extended goals through self-directed planning and multi-step execution, potentially coordinating multiple agents or tools with minimal direct user intervention.If autonomy remains limited to tasks without extended self-directed planning, classify as delegated execution.[53]
Table 6. HCAI indicators across interaction mechanisms and autonomy levels.
Table 6. HCAI indicators across interaction mechanisms and autonomy levels.
HCAI PrincipleExemplary Operational Indicators
ControlPercentage of actions requiring user confirmation; availability of pause, stop, undo, or override functions; intervention success rate
TransparencyProportion of visible intermediate steps; availability of plans, tool calls, sources, or uncertainty cues; user-rated understandability
Error handlingError detection rate; recovery success rate; time needed to recover from failed or incorrect outputs
User learningImprovement in prompt quality over repeated use; reduction in repeated user corrections; change in calibrated trust or self-reported confidence
Table 7. Classification of representative systems using the proposed taxonomy.
Table 7. Classification of representative systems using the proposed taxonomy.
System/StudyInteraction MechanismAutonomy LevelBasis for Classification
Salikutluk et al. [13]Conversational explorationAdvisoryBrainstorming support; no external action
Görer & Aydemir [51]Task-oriented assistanceAdvisoryGenerates interview-script outputs
Fakhoury et al. [37]Task-oriented assistanceGuided executionTest-driven code generation with iterative user review
iTutor [50]Task-oriented assistanceGuided executionStep-by-step tutorial guidance from UI context
CoAuthor [54]Task-oriented assistanceGuided executionUser-controlled writing suggestions and revision
PromptChainer [52]Tool-mediated interactionGuided executionVisual prompt-chain authoring and debugging
CoVis [32]Tool-mediated interactionDelegated executionGenerates data queries and visualizations through a system-mediated workflow
SciDaSynth [55]Tool-mediated interactionDelegated executionGenerates structured data tables; user validates and refines
Wang et al. [56]Tool-mediated interactionDelegated executionMaps language instructions to bounded mobile UI actions
RAH [53]Agentic workflowsHigh-autonomy executionLLM-agent with learn–action–critic–reflection cycle
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Nejašmić, D.; Mladenović, S.; Granić, A. Interaction with LLM-Based Systems: A Structured Review and Taxonomy of Mechanisms and Autonomy. Appl. Sci. 2026, 16, 5001. https://doi.org/10.3390/app16105001

AMA Style

Nejašmić D, Mladenović S, Granić A. Interaction with LLM-Based Systems: A Structured Review and Taxonomy of Mechanisms and Autonomy. Applied Sciences. 2026; 16(10):5001. https://doi.org/10.3390/app16105001

Chicago/Turabian Style

Nejašmić, Dino, Saša Mladenović, and Andrina Granić. 2026. "Interaction with LLM-Based Systems: A Structured Review and Taxonomy of Mechanisms and Autonomy" Applied Sciences 16, no. 10: 5001. https://doi.org/10.3390/app16105001

APA Style

Nejašmić, D., Mladenović, S., & Granić, A. (2026). Interaction with LLM-Based Systems: A Structured Review and Taxonomy of Mechanisms and Autonomy. Applied Sciences, 16(10), 5001. https://doi.org/10.3390/app16105001

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop