Next Article in Journal
Study on Waveform Superposition and Ultrasonic Gain During Nonlinear Propagation of Ultrasound in Fibrin Clots
Previous Article in Journal
Neural Network Architectures in Video Capsule Endoscopy: A Systematic Review and Meta-Analysis on Accuracy and Reading Time Performances
Previous Article in Special Issue
Virtual Reality Exposure Therapy for Foreign Language Speaking Anxiety: Evidence from Electroencephalogram Signals and Subjective Self-Report Data
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Improving the Efficiency of Collaboration Between Humans and Embodied AI Agents in 3D Virtual Environments

1
Department of Computer Science, The Graduate School, Kwangwoon University, 20, Gwangun-ro, Nowon-gu, Seoul 01897, Republic of Korea
2
School of Software, Kwangwoon University, 20, Gwangun-ro, Nowon-gu, Seoul 01897, Republic of Korea
*
Author to whom correspondence should be addressed.
Appl. Sci. 2026, 16(2), 1135; https://doi.org/10.3390/app16021135
Submission received: 24 December 2025 / Revised: 19 January 2026 / Accepted: 20 January 2026 / Published: 22 January 2026
(This article belongs to the Special Issue Augmented and Virtual Reality for Smart Applications)

Abstract

This study proposes a human-in-the-loop dynamic graph-based planning framework designed to elevate LLM-based Embodied Agents from simple tools to trustworthy collaborative partners. To achieve this, we address the trade-off between the structural rigidity of plan-centric approaches and the instability of reactive methods. The framework utilizes a Directed Acyclic Graph (DAG) with AND/OR nodes to ensure robustness while maintaining flexibility. Critically, the agent features an Automated Recovery Mechanism for self-correction and a Dynamic Modification Mechanism that employs Relevance Analysis to effectively translate human interventions (Switch, Add, Delete) into graph updates. Comparative experiments in Minecraft with 30 participants validated the method’s effectiveness. The proposed agent (Agent B) outperformed the reactive baseline (Agent A), reducing mission completion time by 9.3%. Notably, the agent demonstrated high instruction compliance and reduced user frustration by approximately 20%, leading to statistically higher satisfaction scores (PSSUQ). These results confirm that by ensuring planning robustness and responsiveness, the proposed framework successfully enables agents to function as trustworthy partners in complex environments.

1. Introduction

1.1. Research Background

The recent rapid advancements in Artificial Intelligence (AI), spearheaded by Large Language Models (LLMs), have extended beyond text processing to the emergence of Embodied AI Agents capable of interacting within physical or virtual worlds. Unlike traditional conversational AI, which focuses on information processing and generation, Embodied Agents perceive complex 3D environments, make decisions, and execute physical actions. In particular, open-ended virtual environments with high degrees of freedom and dynamism, such as Minecraft, have established themselves as key testbeds for validating the potential of such agents. Studies such as Voyager [1], which expands exploration through autonomous curriculum learning, and MineLand [2], which simulates multi-agent social interactions, have significantly enhanced agent autonomy.
Nevertheless, achieving “complete autonomy” in complex, unpredictable environments that mimic the real world remains a challenge. To address this, researchers initially introduced planning frameworks like LLM-Planner [3] and DEPS [4], which decompose goals into sequential sub-tasks. However, these linear approaches often exhibit structural rigidity; they struggle to adapt when dependencies change or when a specific step fails, often requiring a complete restart. Recognizing these limitations, recent studies have evolved toward Graph-based Planning to handle task complexity more effectively. A notable approach, VillagerAgent [5] models tasks as Directed Acyclic Graphs (DAGs) to manage intricate dependencies and optimize parallel execution.
Despite these structural advancements, a critical gap remains: the lack of flexible “Human-in-the-loop” interaction. Existing graph-based agents predominantly focus on automated optimization, often neglecting the necessity of real-time alignment with human intent. In critical failure scenarios or when user goals shift dynamically, human strategic judgment becomes indispensable.
Consequently, this study aims to facilitate a fundamental paradigm shift in human–AI interaction: transitioning the agent from a passive “tool” that merely executes commands to an active “partner” capable of shared agency. Rather than strictly adhering to full autonomy, we propose a framework designed to align with human intent and adapt to environmental dynamics, ultimately fostering a relationship of trust and cognitive coupling between the user and the agent.

1.2. Problem Definition and Research Objectives

Before this study, we conducted an exploratory preliminary investigation to analyze the patterns of human–AI interaction within a Minecraft environment. Our findings reveal that conventional dichotomous agent designs exhibit significant limitations in real-world collaborative scenarios.
First, plan-centric agents, while demonstrating robust structural stability in the execution of complex tasks, exhibited excessive rigidity in response to human intervention. For example, when a user provided a shortcut—such as “The materials are in the chest, use them”—the agent failed to adjust its predefined “mining plan.” Instead, it persisted in redundant actions or inefficiently re-planned its entire trajectory from scratch in response to minor environmental variables. This lack of adaptability served as a primary source of user frustration in collaborative real-time environments.
Second, reactive agents showed flexibility in immediate follow-up instruction, but lacked the capacity to pursue long-term objectives. Relying exclusively on momentary observations without a high-level strategic road-map, these agents often lost task context or engaged in inconsistent behavioral loops when presented with comprehensive objectives like “Build a house.” This behavior indicates that these agents function merely as responsive tools rather than proactive partners.
These observations underscore a critical trade-off between structural stability and dynamic flexibility in existing methodologies. Consequently, this research is motivated by the imperative to develop a novel integrated framework that simultaneously harmonizes these two conflicting but essential attributes.

1.3. Research Objectives and Main Contributions

To elevate the Embodied Agent from an instruction-following executor to a collaborative partner, this study proposes a human-guided dynamic planning framework designed to seamlessly integrate human intervention into the agent’s structural reasoning. The primary objective is to achieve a synergy between structural robustness and real-time adaptability aligned with human intent. To this end, we introduce a novel collaborative mechanism that actively reconfigures the plan in response to failure scenarios or human input, while preserving the underlying structural skeleton that ensures the success of complex, long-term objectives.
The principal contributions of this paper are summarized as follows:
1.
A Flexible Planning Graph Framework Using AND/OR Branching:We present a plan representation that integrates AND/OR branching nodes—capable of handling failures and alternative exploration—into a Directed Acyclic Graph (DAG). This hybrid structure ensures the logical coherence necessary for complex missions while enabling fluid decision-making regarding execution paths in dynamic environments.
2.
An Adaptive Modification Mechanism via Natural Language Instructions: We developed a dynamic update mechanism that interprets human natural language input to assess its relevance to the active planning graph. This allows the system to instantaneously modify, append, or prune specific nodes without discarding the global plan, thereby maximizing the agent’s compliance and responsiveness.
3.
Empirical Validation of Collaborative Efficiency: Through a comprehensive user study involving 30 participants in a Minecraft-based environment, we demonstrate that our framework significantly outperforms baselines. Specifically, the proposed agent reduced the mission completion time by an average of 97 s and substantially alleviated the cognitive workload of the users.

1.4. Structure of the Paper

The remainder of this paper is organized as follows. Section 2 reviews the existing literature on Minecraft-based embodied agents, hierarchical planning, graph-based reasoning, and human–agent collaboration. Section 3 details the proposed system architecture and the dynamically modifiable graph-based methodology, highlighting the balance between flexibility and stability. Section 4 provides a comprehensive verification of the proposed approach through both quantitative and qualitative analyses of the user study and experimental findings. Finally, Section 6 concludes the study and discusses its limitations along with prospective directions for future research.

2. Related Work

2.1. Minecraft-Based Embodied Agents

Developing autonomous agents capable of self-directing goals and executing tasks within complex, dynamic environments has remained a fundamental challenge in artificial intelligence. Recently, open-ended virtual environments such as Minecraft have emerged as pivotal testbeds for this endeavor. Minecraft provides a sandbox environment with significant physical complexity and freedom, offering an optimal setting for evaluating an agent’s long-term planning capabilities. Tasks in Minecraft—such as harvesting wood and processing planks to craft a pickaxe—require strictly sequential and procedural actions. Furthermore, the environment is uniquely suited for validating problem-solving skills through creative open-ended tasks where solutions are not predefined.
Early research mainly used Reinforcement Learning (RL) and Imitation Learning (IL). MineRL [6] introduced large-scale human demonstrations to address complex challenges with sparse rewards, such as diamond acquisition. Subsequently, MineDojo [7] integrated internet-scale knowledge—including wikis and YouTube videos—into the Minecraft ecosystem, establishing a benchmark for thousands of diverse tasks described in natural language. This changed the focus of the research from learning low-level control policies to general-purpose goal execution driven by linguistic understanding.
The advent of Large Language Models (LLMs) is fundamentally reshaping the autonomous agent paradigm. While traditional RL/IL approaches excel in low-level control or task-specific optimization, LLMs provide high-level reasoning, commonsense-based planning, and self-correction. A prominent example, Voyager [1], utilizes an LLM as the central cognitive engine of the agent to explore environments, autonomously synthesize executable code (skills), and facilitate learning. By dynamically adjusting actions based on environmental feedback and execution errors, Voyager demonstrated that agents can solve complex tasks without explicit reward functions or extensive demonstrations. Consequently, research is rapidly transitioning toward utilizing LLMs as “High-level Planners” and “Reasoners.”

2.2. Hierarchical Planning for Embodied Agents

Hierarchical Planning is a cornerstone methodology for efficiently decomposing complex problems, rooted in classical theories such as Hierarchical Task Networks (HTN) [8,9,10]. HTN recursively refines abstract high-level tasks—e.g., “prepare for travel”—into sub-tasks such as “pack passport” and “pack clothes”, until they reach the level of primitive actions executable by the system. This structure emulates human cognitive processes, enabling intuitive and efficient plan generation [11].
Recent advances have actively applied hierarchical planning to robotics and embodied agents by using common sense reasoning from LLMs [12,13]. Because generating exhaustive task details via LLMs is computationally inefficient due to the vast search space, most studies adopt a hierarchical framework. For example, Huang et al. [14] decompose a goal like “discard a drink into the trash” into abstract sub-goals (e.g., move, pick, throw), which are then mapped to low-level control policies.
Despite these advances, hierarchical planning based on LLM still faces critical limitations. First, most approaches rely on linear planning structures, which exhibit rigidity; once a sequential plan is generated, it is difficult to adapt to unexpected environmental changes or execution failures. When a failure occurs, agents often discard the entire plan and restart, incurring significant overhead. Although DEPS [4] attempted to modify plans at failure points, it still involves substantial search costs. Second, there are difficulties in incorporating real-time feedback or novel knowledge. Although RT-2 [15] adjusts actions via visual feedback, it remains limited to fine-tuning immediate actions rather than dynamically reconfiguring the overall strategic structure.

2.3. Graph-Based Planning and Reasoning

To overcome the rigidity of linear planning, researchers have begun modeling task dependencies as graph structures. Tree of Thoughts (ToT) [16] enabled LLMs to explore multiple potential solution paths within a tree structure, enhancing robustness against single-path failures. Graph of Thoughts (GoT) [17] further extended this by modeling reasoning nodes as general graphs, facilitating more flexible information aggregation.
These graph-based methodologies are now being integrated into the field of the embodied agent. VillagerAgent [5] proposed a system where a manager agent constructs a Directed Acyclic Graph (DAG) reflecting task priorities and distributes specific nodes to multiple agents for role assignment.
Our study distinguishes itself from these existing approaches in several key aspects. While VillagerAgent focuses on inter-agent role distribution, our framework introduces AND/OR branching nodes to handle logical complexity and alternative pathways. This allows the agent to dynamically select the optimal sub-task—such as choosing between ‘Shearing Sheep’ or ‘Killing Sheep’ to ‘Obtain Wool’—based on the current environment. Most importantly, unlike existing graph models, our structure supports real-time modifications (Switch, Add, Delete) through human natural language intervention, providing the flexibility needed for unpredictable embodied tasks.

2.4. Multi-Agent Cooperative Systems

Multi-agent cooperation represents another major trend. LLMs have enabled sophisticated social simulations; Generative Agents [18] demonstrated human-like social behaviors in virtual communities, and MineLand [2] explored large-scale survival and construction dynamics in Minecraft. Specific cooperative frameworks have also emerged. ProAgent [19] predicts the partner’s intention to provide proactive assistance, while AGENTVERSE [20] uses an explicit role division.
MindCraft [21], which serves as a foundation for this paper, emphasizes “real-time natural language communication” between agents using an “Action by Action” approach. While this offers high flexibility for dynamic adaptation, it often lacks structural stability and risks losing long-term mission context, leading to repetitive or inefficient behaviors.

2.5. Human–Agent Interaction and Collaboration

Human–AI collaboration is evolving toward a partnership that understands human intent. The rigidity and unpredictability of the planning of current LLM agents suggest that human strategic judgment is essential for complex missions.
One pillar of this collaboration is trust through Explainable AI (XAI). Reflexion [22] allows agents to linguistically reflect their reasoning (Chain-of-Thought) and failure causes, allowing humans to diagnose and intervene. Another core aspect is dynamic plan modification. Recent studies focus on proactive interaction, where agents resolve uncertainty by asking questions [23], or through joint Meta-planning via discussion [24]. Joint Robotic Planning [25] also proposed iterative feedback loops to verify plans before execution.
Nevertheless, existing methods remain constrained by linear structures. Whenever a user intervenes to modify a plan, these systems must regenerate the entire plan from scratch, which is highly inefficient. This process breaks the continuity of the task context that should be maintained between the human and the agent, ultimately hindering the alignment of mutual intent. To address these issues, this study introduces a human-in-the-loop graph-based planning framework that synchronizes the intents of the human and the agent in real time. By managing each task as an independent unit (node) within a graph structure, the system can preserve the overall integrity of the global plan while selectively updating only the specific parts requested by the user. This approach ensures both structural stability and real-time flexibility, playing a vital role in establishing the predictability and trust necessary for a reliable partnership.

3. Materials and Methods

3.1. Structure of the Graph-Based Planning Framework

3.1.1. Mathematical Definition and Structural Characteristics of the Planning Graph

In this system, the planning graph is formalized as a Directed Acyclic Graph (DAG) G = ( V , E ) to ensure the logical structure of task execution, where the set of nodes V is partitioned into three disjoint sets based on their logical characteristics: V = V a c t V a n d V o r . Here, V a c t denotes primitive action nodes located at the leaf nodes of the graph which the agent can directly perform, while V a n d and V o r represent composite task nodes that define the logical precedence relationships with their children C ( u ) . The edges E V × V define the dependency relationships where an edge ( u , v ) E implies that node v is a prerequisite sub-task of u.
The core advantage of adopting this DAG structure is its inherent reusability and efficiency; unlike simple tree structures, graphs can represent many-to-many dependencies without duplication. For instance, the “Obtain Wood Logs” node can be a shared prerequisite for multiple distinct higher-level goals such as “Craft Crafting Table” and “Obtain Sticks,” allowing the graph to manage these common tasks as a single node and preventing the inefficient repetitive actions often observed in reactive agents.

3.1.2. Execution Flexibility via AND/OR Branching

The execution state of the graph is governed by a valuation function S : V { 0 , 1 } , where S ( v ) = 1 indicates that task v is completed. The DAG structure secures execution flexibility by introducing AND/OR branching logic, where the relationship between a parent node and its set of child nodes is determined by its logical attribute.
For an AND node ( u V a n d ), the completion status is calculated as S ( u ) = v C ( u ) S ( v ) , ensuring that all child nodes must be satisfied to achieve the upper-level goal, such as requiring “Obtain Wool,” “Craft Wood Planks,” and “Craft Crafting Table” to “Craft Bed.” Conversely, for an OR node ( w V o r ), the state is updated as S ( w ) = min ( 1 , v C ( w ) S ( v ) ) , implying that the goal can be achieved by selectively executing only one of the alternative pathways. For instance, “Obtain Wool” can be achieved by either “Kill Sheep” or “Shear Sheep,” allowing the agent to dynamically select the optimal sub-task based on the environment through LLM reasoning.
This mathematical modeling provides the structural foundation for both the automated recovery mechanism and real-time human intervention by allowing flexible modification of the graph’s logical flow (see Figure 1).

3.2. Plan Execution and Automated Recovery Mechanism

3.2.1. Initial Request Transmission and Complexity Analysis

The workflow begins when a human conveys a natural language command (e.g., “Make a bed”) to the agent. The agent, in an Idle state, performs a complexity analysis on the received request. This judgment is performed by the LLM using a prompt designed to infer task complexity.
Daily conversations like greetings or commands processable in a single turn, such as “Show inventory,” are classified as Simple requests. In this case, the agent responds or acts immediately without generating a separate graph. Complex requests are tasks requiring multi-step procedural actions and dependencies, such as “Make a bed” or “Make a diamond pickaxe”. The LLM classifies these as complex requests and enters the graph planning generation phase, as illustrated in Figure 2.

3.2.2. LLM-Based Plan Execution and Node Selection

Once the planning graph G is generated, the agent determines the execution sequence through a top-down traversal starting from the root node. At each branching node, the LLM analyzes the current world state S t (inventory I t and history H t ) alongside the sub-graph structure to decide the subsequent action. We formulate this decision process as a context-aware optimization problem aimed at minimizing the execution cost C ( v S t ) .
In an OR branch, the agent evaluates multiple alternatives N and selects the optimal path v * that approximates the minimal effort, defined as v * arg min v N C ( v S t ) . The LLM utilizes the inventory context to estimate this cost; for instance, if resources are available, the cost of crafting is evaluated as significantly lower than gathering. The agent then informs the user of its rationale: “I will hunt spiders to obtain wool from string.”
In contrast, in an AND branch, where all sub-tasks are mandatory, the agent determines a sequential permutation π of tasks that satisfies logical dependencies while maintaining workflow efficiency. The agent shares a comprehensive roadmap for this sequence: “To craft a bed, I need wool, planks, and a crafting table; I will get the planks first.” This proactive reporting serves as a vital mechanism for establishing transparency and trust, allowing the human partner to anticipate the agent’s behavior. Upon reaching a leaf node, execution proceeds bottom-up, with progress updates provided at each milestone to facilitate timely human intervention.

3.2.3. Automated Recovery Mechanism via Backtracking

If a specific node fails during execution, the system immediately triggers an automated recovery mechanism. The agent reports the failure and simultaneously invokes the handleFailure logic, which backtracks from the failed node to the Nearest Valid Branching Point.
A valid branching point is defined as either an OR node with untried alternatives or an AND node with pending parallel tasks. For instance, if “Hunt Spider (OR Alternative 1)” fails, the agent regresses to the “Obtain Wool” parent node and selects “Kill Sheep (OR Alternative 2)” as a fallback. This mechanism overcomes the inefficiencies of conventional linear planning—which often requires complete plan regeneration—by maximizing the success of the task through localized path reconfiguration. If no valid branching point is identified (e.g., reaching the root node), the plan is considered to have failed.

3.3. Dynamic Plan Modification Based on Human Intervention

Building upon the robust execution framework, we integrated a dynamic modification feature that accommodates real-time natural language intervention. Based on the interactions of the pilot study, we categorized human interventions into Irrelevant and Relevant requests. Upon detecting an utterance, the system performs a relevance analysis by synthesizing the action history, the planning graph, and the active node status (see Figure 3).
Irrelevant Requests: Cases that do not require structural changes to the current plan or require complete cancellation of the plan.
  • Chat: Casual utterances like “The weather is nice” are responded to lightly, and the existing task continues.
  • Stop: In case of an urgent stop request like “Stop,” all actions and plans are immediately terminated, and the agent switches to an idle state.
  • New Task: Cases directing a new task unrelated to the current context, such as “Stop what you’re doing and mine wood”. This was the most frequent type in pilot tests; the system discards the current graph and re-initiates the complexity analysis and graph generation process (Section 3.3.1).
Relevant Requests: Cases where the context of the current plan is maintained but detailed paths are modified or information is updated according to specific human instructions. These are further subdivided into ‘Switch’, ‘Delete’, and ‘Add’ to dynamically transform the graph (see Section 3.3.2, Section 3.3.3 and Section 3.3.4 and Figure 4).

3.3.1. New Task Request (New Task)

The “New Task” occurs when a human lowers the priority of the current task and presents a completely new goal. When the LLM detects this, the agent immediately stops the current graph execution. Subsequently, it establishes a plan for the new goal from scratch. This is a measure to prioritize the latest human intent without being bound by the existing plan.

3.3.2. Switch Active Node

The “Switch” function enables the agent to immediately pivot to an alternative path already present in the graph. For example, if a command like “Don’t collect wool, get wood planks right now” is input, the LLM generates a control signal like {“type”: “switch”, “nodeName”: “Wooden Planks”}. The system backtracks to the Lowest Common Ancestor (LCA) between the currently executing and the target nodes, sets the target node as the new active node, and resumes execution. The agent clearly informs the transition by reporting that “I will stop the wool task and get wood planks first”.

3.3.3. Delete Node

The “Delete” function deactivates task nodes that have become unnecessary in the plan. An utterance like “I’ll collect the wool” implies that the task is no longer the agent’s responsibility. The LLM returns a {“type”: “delete”, “nodeName”: “Obtain Wool”} signal, and the agent responds, “I will not collect wool”. The system isolates the node from the graph by severing the edge between the target node and its parent node. If the deletion target is on the path of the currently executing node, the system immediately backtracks to the upper branching point and re-explores alternative paths excluding the deleted path.

3.3.4. Add Node

The “Add” is a function that injects new sub-goals or external knowledge into the existing plan. Information such as “The wool is in the chest” requires new action procedures. The LLM specifies the new node and the parent node to attach it to, such as {“type”: “add”, “nodeName”: “Bring wool from chest”, “parentName”: “Obtain Wool”}. The system generates and connects a new sub-graph (e.g., “Open Chest” → “Take Out”) under the designated parent node. If the new node becomes a valid alternative (OR relationship) for the current goal, the agent immediately switches the task to the newly added path and reacts, “I will take the wool out of the chest”.

3.4. Memory Summarization Strategy for Context Management

3.4.1. Limitations of Existing Reactive Memory

The baseline MindCraft [21] model accumulates and stores all interaction records and action execution history in their original raw state to achieve user goals. For instance, if an agent fails a specific action, the entire process—including the recognition of missing resources (e.g., “I need more wool to make a bed”) and the subsequent preliminary actions—is recorded verbatim. However, this method causes “Context Inertia” during long-term collaboration. When excessive text regarding past failed attempts or intermediate sub-goals persists in the memory, the agent often misinterprets the current context, allowing the raw weight of past tasks to overpower new directives. This leads to a “Ghost Action” phenomenon, where the agent attempts a previously prioritized action B again, even after the user has explicitly requested a new action A.

3.4.2. Task-Centric Summarization Based on Plan Graph Completion

To resolve this context confusion and support multi-session consistency, we introduce a Task-Centric Incremental Summarization mechanism. Our strategy compresses memory specifically upon the termination of a planned task graph. When a task graph triggered by a user request is successfully completed, failed, or interrupted by user intervention, the system immediately invokes the memory management module. This module summarizes only the logs corresponding to that specific graph execution—from the plan initiation to its termination. The LLM abstracts the granular execution logs into a concise state update, preserving only the essential outcomes (success/failure status), acquired items, and critical user feedback. By storing these high-level results in long-term memory and removing the verbose procedural logs from the context window, the agent maintains a clear focus on current goals. This approach effectively prevents information loss regarding the outcome while eliminating the procedural noise that causes context inertia.

4. Results

4.1. Experimental Design

4.1.1. Experimental Environment and System Configuration

This experiment was conducted in the 3D virtual environment of Minecraft. We established a multiplayer environment where participants and agents could access the same server world and interact in real-time. The study compared two types of agents. As the control group (Agent A), we adopted the basic agent from MindCraft [21], discussed in Section 2.4. This agent is a reactive model that determines actions solely based on turn-by-turn observations and natural language communication without an explicit planning process. In our preliminary analysis of existing open-source Minecraft-based agents, we observed that most models are designed for achieving specific predefined cases, multi-agent cooperation, or operating within strictly constrained environments. However, MindCraft stood out as the most effective framework for achieving high-quality interactions with humans in a shared environment. Consequently, we adopted the MindCraft architecture as our baseline. The experimental group (Agent B) is an agent equipped with the “Human-Interaction-based Dynamic Planning Graph Mechanism” proposed in Section 3. This model features graph-based long-term planning capabilities (Section 3.1), an automated recovery mechanism (Section 3.2.3), and dynamic modification capabilities to respond flexibly to human intervention (Section 3.3).
We recruited a total of 30 adult participants (university students and graduates) through campus bulletin board advertisements using convenience sampling. All participants had prior experience playing Minecraft. The group consisted of 16 males and 14 females, all in their 20 s. Regarding cumulative playtime, 5 participants had less than 10 h, 15 had between 10 and 50 h, 2 had between 50 and 100 h, and 8 had over 100 h.

4.1.2. Experimental Procedure

This experiment followed a within-subject design. All 30 participants underwent a tutorial session to familiarize themselves with communication methods with agent and item placement techniques. Subsequently, participants performed the same mission twice in separate sessions, once with the control group (Agent A) and once with the experimental group (Agent B). To prevent bias from learning effects or fatigue due to the order of agent usage, the presentation order of the two agents was randomized and counterbalanced between participants.
Each session consisted of approximately 20 min of mission performance followed immediately by a 5-min survey. After completing missions with both agents, participants engaged in an in-depth interview with the researcher to collect qualitative data on their overall experience. Detailed items for the questionnaires and in-depth interviews can be found in Appendix B. The entire experiment took approximately one hour. This study was approved by the Institutional Review Board (IRB) of the affiliated institution. All participants provided their voluntary informed consent prior to participating and received a gift certificate worth approximately 10,000 KRW as compensation.

4.1.3. System Implementation and Models

The agent system proposed in this experiment was implemented using Large Language Models (LLMs) optimized for specific functions. For the control module that executes actual actions within the Minecraft environment, we adopted the Andy-4 model used in MindCraft [21]. In contrast, for plan generation, graph management and relevance analysis for human intervention (Planning & Reasoning)—the core of this study—we utilized the Gemini-2.5-Flash-Lite model. This model features long-context processing capabilities and fast inference speeds, making it suitable for dynamic plan modification requiring real-time responsiveness. Additionally, to facilitate smooth communication between Korean-speaking users and the English-based agent model, DeepL API was applied to the real-time translation module to minimize distortion caused by language barriers. The experimental environment was built on Minecraft (version 1.20.4). Statistical analyses of the experimental results were conducted using Python (version 3.11.7).

4.1.4. Experimental Scenario

Participants and agents were given a mission themed “Building a House,” which involved collecting a total of 9 specific items and placing them into item frames on a designated 3 × 3 bingo board (Figure 5). This scenario includes complex crafting, resource exploration, and item transport processes, making it suitable for comprehensively evaluating the agent’s long-term planning ability and collaboration efficiency with humans. The 9 goal items consisted of a Chest, White Bed, Crafting Table, Oak Planks, Furnace, Wooden Pickaxe, Stone Pickaxe, Wooden Axe, and Oak Door. In particular, for three key items with complex crafting processes (White Bed, Stone Pickaxe, Oak Door), participants were restricted from crafting them directly. This condition was designed to require humans to make planned requests or collaborate with the agent to successfully complete the mission.

4.2. Hypotheses and Metrics

4.2.1. Research Questions and Hypotheses

The core question of this study is: “Can the proposed human interaction-based dynamic graph planning method achieve higher efficiency and user satisfaction in human–AI collaborative tasks compared to existing reactive planning methods?”. To verify this, we established the following hypotheses:
  • H1: The proposed graph-based agent (Experimental Group) will result in shorter mission completion times compared to the control agent.
  • H2: Through the robustness of the plan and the automated recovery mechanism (Section 3.2.3), the proposed agent will significantly lower the subjective workload perceived by users (NASA-TLX), specifically “Frustration” and “Mental Demand,” compared to the control group.
  • H3: Users will evaluate collaboration with the proposed agent as more useful and satisfying than with the control agent.

4.2.2. Metrics

We collected the following quantitative and qualitative metrics to test the hypotheses:
Quantitative Metrics
  • Mission Completion Time: The total time taken to complete placing all 9 items on the bingo board.
  • Communication Efficiency: By comparing the number of utterances between the agent and the human, we quantitatively measured the communication cost and information density invested to achieve the same goal.
Qualitative Metrics
  • NASA-TLX (Task Load Index) [26]: A standard scale measuring subjective workload felt by participants across 6 dimensions (MD, PD, TD, Effort, Perf, Frus).
  • PSSUQ (Post-Study System Usability Questionnaire) [27]: A standard satisfaction scale measuring overall usability, information quality, and interface quality.
  • Collaboration Satisfaction (Custom Scale): An 8-item scale developed for this study to evaluate “Collaboration Quality,” the core objective of this study (e.g., “Did it understand the intent well?”, “Was it a competent partner?”).
  • Post-Interview: In-depth qualitative feedback on the collaboration experience with each agent.

4.3. Quantitative Analysis Results

4.3.1. Mission Completion Time

Analysis of mission completion times, shown in Figure 6 and the corresponding statistical data (Table 1), revealed that the proposed method (Agent B) demonstrated superior efficiency compared to the control group (Agent A). The average time for all participants ( N = 30 ) was 1042.27 s ( S D = 433.71 ) for Agent A, while Agent B recorded 944.60 s ( S D = 347.42 ), reducing the time by approximately 97 s on average. Analysis of individual results showed that 18 participants (60%) completed the mission faster when collaborating with Agent B.
Considering potential learning effects in a within-subject design, we conducted a detailed analysis by group based on experiment order. In the group that used Agent A first and then Agent B (A First Group, n = 15 ), Agent B showed statistically significantly faster performance (p = 0.023). In this group, Agent B’s average time was 933.60 s, a reduction of approximately 320 s compared to Agent A (1254.27 s), with 80% (12 participants) showing higher efficiency with Agent B. This suggests that while the reactive agent (A) requires high temporal costs when the participant is unskilled, the proposed agent (B) provides efficient paths through plan-based guidance.
In contrast, in the group that used Agent B first (B First Group, n = 15 ), Agent A recorded an average of 830.27 s, which was faster than Agent B (955.60 s). This is interpreted as participants learning the workflow during the first session (Agent B) and performing the mission in a skilled state during the second session (Agent A).
In particular, the proposed agent demonstrated performance stability. The control group (A) showed substantial variation in performance time depending on the user proficiency, ranging from 1254 s (unskilled) to 830 s (skilled), a deviation of approximately 424 s. In contrast, the proposed agent (B) showed a difference of only about 22 s between the unskilled state (955.60 s) and the skilled state (933.60 s). This shows that the proposed method guarantees consistent and predictable collaboration performance regardless of user proficiency.

4.3.2. Communication Efficiency

We analyze the utterance counts generated during the experiment to examine the communication patterns between the two agents. As summarized in Table 2, the analysis did not show statistical differences in the average number of human utterances between Agent A (23.41) and Agent B (22.38) (p = 0.756), implying that users did not need to expend extra effort for additional instructions or queries when collaborating with the proposed agent (B).
However, the number of AI utterances tended to decrease for Agent B (71.93) by approximately 22.7% compared to Agent A (93.03) ( t ( 28 ) = 1.784, p = 0.085). Although it did not reach statistical significance (p < 0.05), the effect size (Cohen’s d) was 0.412, indicating a medium effect and suggesting a practical reduction in communication volume. This result is interpreted as Agent A performing fragmented reporting for every action, increasing text load, whereas graph-based Agent B reported planned actions contextually, thus increasing information density. In conclusion, Agent B was able to complete the mission with more concise and efficient communication under the same level of human intervention.

4.4. Qualitative Analysis Results

Qualitative indicator analysis, summarized in Figure 7, revealed that the proposed method (Agent B) clearly improved collaboration quality and user experience compared to the control group (Agent A).

4.4.1. NASA-TLX Analysis

We analyzed the NASA-TLX to measure the subjective cognitive and physical workload experienced by participants. Paired t-tests showed that Agent B tended to reduce overall workload compared to Agent A (Table 3).
First, in Mental Demand (MD), Agent B ( M = 2.53 ) scored lower than Agent A ( M = 2.93 ) (p = 0.090), although not statistically significant, this suggests a tendency for the cognitive burden of the user to decrease as the proposed agent autonomously establishes complex plans. Second, in Effort, Agent B ( M = 2.47 ) was significantly lower than Agent A ( M = 2.83 ) ( t ( 29 ) = 2.164, p = 0.039), which proves that the effort required from users to achieve goals was substantially reduced due to the agent’s efficiency. Third, Frustration (Frus) also showed positive signals. Agent B recorded 2.37, less than Agent A (2.93) (p = 0.084), indicating that the automated recovery mechanism (Section 3.2.3) contributed to mitigating psychological stress caused by frequent agent failures.
In general, NASA-TLX analysis confirmed that the proposed method had a positive effect on significantly reducing user effort and mitigating mental burden and frustration.

4.4.2. PSSUQ Analysis

Analysis of the PSSUQ scale, which evaluates overall system usability and satisfaction, showed that the proposed method (B) received statistically significantly better evaluations (Table 4). Note that lower scores in PSSUQ indicate more positive results.
In Overall Satisfaction (PSSUQ OVERALL), Agent B ( M = 2.94 ) received significantly better evaluations than Agent A ( M = 3.43 ) (p = 0.035). Regarding sub-scales: System Usefulness (SYSUSE) showed that Agent B ( M = 2.98 ) was significantly more positive than Agent A ( M = 3.40 ) (p = 0.048), indicating that users perceived the system as useful and easy to learn despite the added complexity of graph planning. For Information Quality (INFOQUAL), Agent B ( M = 2.89 ) scored significantly higher than Agent A ( M = 3.47 ) (p = 0.029), which strongly validates that the proactive reporting and plan sharing functions designed in Section 3.2.3 clearly conveyed the current situation to users, enhancing trust.

4.4.3. Collaboration Satisfaction Analysis (Custom Scale)

We analyze an 8-item custom scale to assess collaboration quality. The scale’s reliability coefficient (McDonald’s Omega [28]) was 0.967, ensuring high validity. As shown in Table 5, the proposed method (B) recorded statistically significantly higher scores (p < 0.05) than the control group (A) across all 8 elements.
Notably, items related to Work Focus (C1) and Reduced Burden (C2) showed Agent B significantly outperforming Agent A (p = 0.021, p = 0.005), meaning the agent created an environment where users could focus on their own tasks. The Efficiency (C3) item also showed Agent B ( M = 4.80 ) significantly higher than Agent A ( M = 3.97 ) (p = 0.040), consistent with the quantitative time reduction results.
Crucially, in the item “The AI agent understood and followed my instructions well” (C6), measuring the responsiveness to human intervention, Agent B ( M = 4.57 ) scored highly significantly higher than Agent A ( M = 3.50 ) (p = 0.004). This suggests that the dynamic modification mechanism proposed in Section 3.4 gave users a sense of efficacy that their intentions were accurately reflected. Finally, in terms of Intention to Reuse (C7) and Role Complementation (C8), Agent B received the highest evaluations, demonstrating its role as a competent collaborative partner beyond a simple tool.

4.4.4. Post-Interview

The post-interview results supported the quantitative and qualitative findings. Participants praised Agent B’s planning and competence, stating, “The first agent (B) seemed to understand task instructions more clearly,” and “It felt like a more competent partner because it seemed to know exactly what to do and performed it step-by-step according to a plan.” Conversely, for Agent A, limitations of the reactive approach were revealed, with comments like “A took relatively longer and made many mistakes,” and “Feedback did not effectively lead to corrections,” indicating that the reactive agent increased the burden of intervention.

4.4.5. Case Study

To qualitatively analyze the sources of the performance gap, we categorized the interaction logs into three representative behavioral limitations observed in the baseline (Agent A) but resolved in the proposed method (Agent B). Detailed verbatim transcripts are provided in Appendix C.
Case 1: 
Reactive Planning and User Confusion
The “Bed Making” scenario (Table A1) highlights the inefficiency of the baseline’s reactive logic. Agent A discovered requirements sequentially (Planks → Logs → Inventory Check), executing actions in a fragmented manner. This step-by-step realization not only caused high latency but also resulted in significant user confusion, as the user could not predict whether the agent was progressing correctly. In contrast, Agent B demonstrated proactive planning by decomposing the task immediately, executing the mission linearly and transparently.
Case 2: 
Blind Loops and State Awareness
Agent A frequently failed to track its inventory state, leading to redundant loops. As shown in the “Pickaxe Transfer” scenario (Table A2), when Agent A failed to transfer the item (“Give me the pickaxe”), it ignored the fact that it still possessed the stone pickaxe in its inventory. Instead of retrying the transfer or checking its status, it concluded it needed to craft a completely new one, triggering a redundant resource gathering loop. Conversely, Agent B accurately tracked its inventory state and skipped unnecessary sub-tasks.
Case 3: 
Context Inertia and Hallucination Frequency
In prolonged interactions, Agent A struggled to switch contexts due to “Context Inertia.” In the “Furnace vs. Torch” scenario (Table A3), Agent A persisted in gathering coal for “Torches” despite the user’s explicit command to “Make a Furnace.” Even after acknowledging a “Stop” command, it hallucinated constraints from the previous task (“But first I need to make planks”).
To validate the prevalence of this issue, we conducted a random inspection of 10 extended interaction logs (5 per agent). We observed that Agent A exhibited such context-driven hallucinations in 4 out of 5 sessions (80%), whereas Agent B did so in only 1 session (20%). Crucially, the severity of the errors differed significantly: Agent A’s hallucinations were persistent (ignoring repeated user corrections), whereas Agent B’s rare error was transient and immediately resolved upon the next user input. This confirms that the Memory Summarization strategy not only reduces the frequency of hallucinations but also prevents the agent from becoming “stuck” in incorrect behavioral loops, allowing for rapid recovery.

4.5. Technical Validation

While the user study demonstrated the efficacy of the proposed collaboration framework, a critical limitation was identified during the post-analysis of the experimental data. Due to the limited complexity of the specific sub-tasks configured for the ’Building a House’ mission, participants rarely encountered situations necessitating the use of advanced modification functions such as ’Switch’, ’Add’, or ’Delete’. Consequently, the user study alone was insufficient to fully validate the system’s performance in high-complexity environments where frequent human intervention is required. To address this gap and verify the robustness of the Intent Classification mechanism, we conducted an offline benchmark stress test.
We constructed a dataset of five distinct planning scenarios ranging from basic tasks to extreme complexity to evaluate the accuracy and inference latency of the system. The scenarios ranged from a simple ‘Wooden Pickaxe’ task ( N = 12 ) to ‘Battle Prep’ ( N = 120 ), which requires acquiring a full set of diamond armor, golden apples, and strength potions, and finally to a ‘Master Plan’ ( N = 215 ). The ‘Master Plan’ represents a massive open-ended mission requiring the simultaneous completion of five end-game objectives, including placing a beacon, equipping full netherite armor, and obtaining a totem of undying. It is noteworthy that even for such an exhaustive list of objectives, the graph structure required only around 200 nodes. This suggests that N 200 effectively represents a practical upper bound for task complexity in standard Minecraft gameplay, confirming that the current benchmark is sufficient for verifying real-world applicability.
The benchmark results, summarized in Table 6, demonstrate the robustness of the proposed framework. The Intent Classification module achieved an average accuracy of 96.0% across all test cases. While it maintained perfect accuracy for most scenarios, the ‘Master Plan’ recorded an 80.0% success rate, confirming that the LLM can generally distinguish between subtle intent differences even within complex contexts. Furthermore, regarding scalability, the system exhibited remarkable stability. The average latency for the simplest graph ( N = 12 ) was 880 ms, while the most complex ‘Master Plan’ ( N = 215 ) recorded 782 ms. The fact that inference time did not increase with the graph size indicates that the computational cost remains stable regardless of the total number of nodes. Consequently, with a robust accuracy of 80.0% even in the most extreme case and consistent latency, the system demonstrates both high effectiveness and stability for real-time applications.

5. Discussion

The experimental results support all three hypotheses (H1, H2, and H3) established in this study. The findings demonstrate that the proposed dynamic graph-based agent (Agent B) successfully overcomes the trade-off between robustness and flexibility, a challenge previously identified in Section 2.5.
  • Verification of Efficiency and Stability (H1): Agent B not only achieved an average time reduction of 97 s, but also demonstrated outstanding performance stability, maintaining consistent execution times regardless of execution order or proficiency, unlike reactive agent (A), which was heavily influenced by user skill.
  • Substantial Reduction in Workload (H2): NASA-TLX analysis showed the proposed method significantly reduced the Effort required for mission completion. Frustration levels also showed a significant decreasing trend of about 20%, interpreted as the automated recovery mechanism and plan-based execution reducing unnecessary user intervention and mental consumption.
  • Improvement in Flexibility and Communication Satisfaction (H3): PSSUQ and Collaboration Satisfaction analysis showed the proposed model outperformed the control group in all items. High scores in instruction comprehension and compliance (C6) prove that the dynamic modification mechanism in Section 3.4 accurately reflected human intent, building trust.
In conclusion, the proposed agent secured robustness and performance stability through planning and maximized flexibility and communication satisfaction through dynamic modification.

6. Conclusions

6.1. Summary and Conclusions

This study proposed a dynamic planning graph framework that accepts human intervention to resolve the trade-off between the rigidity of plan-centric approaches and the long-term instability of reactive approaches when Embodied Agents collaborate with humans in complex and dynamic 3D virtual environments like Minecraft.
The comparative experiment with 30 participants verified the multidimensional effectiveness of the proposed methodology. The results confirmed that the proposed agent (Agent B) significantly improved collaboration efficiency by reducing the average mission time by 97 s compared to the reactive baseline, while ensuring consistent performance stability regardless of user proficiency. Furthermore, the system successfully reduced the user’s cognitive workload and frustration through automated recovery capabilities. Qualitative evaluations indicated that the dynamic modification mechanism accurately reflected human intent, transforming the agent into a trustworthy active partner. Ultimately, this study highlights a fundamental paradigm shift in human–AI interaction: moving beyond the perception of agents as passive execution tools toward active, trustworthy partners capable of shared agency and dynamic collaboration.

6.2. Limitations

Although this study successfully demonstrated the validity of the proposed methodology, several limitations identified during the experiment remain important tasks for future research.
First, the utilization of dynamic functions in the user study was limited due to constraints in the experimental difficulty. Theoretically, the presence of multiple branching nodes to achieve a single goal increases the necessity for human dynamic intervention. However, the tasks designed for the user study possessed few branching nodes, meaning the agent’s initial plan was often sufficient and left little room for human modification. To address this gap regarding task complexity, we conducted the offline benchmark described in Section 4.5, which verified the system’s technical robustness against high-complexity scenarios like the ’Master Plan’. Nevertheless, to fully evaluate the interaction dynamics, future research needs to apply these high-complexity scenarios to actual user studies. For instance, tasks such as ’Creating a Nether Portal,’ which involve diverse methodological directions and branching choices, would provide a more suitable environment to evaluate the system’s utility under conditions requiring frequent human decision-making. Furthermore, expanding comparative baselines to include Static Planners or emerging state-of-the-art approaches would provide a more rigorous assessment of the proposed dynamic framework’s relative advantages over traditional methodologies.
Second, LLM-based graph generation and modification face inherent structural limitations. To prevent cycles, disconnected nodes, and logical inconsistencies within AND/OR structures, we implemented a multi-layered framework comprising prompt constraints, programmatic verification, and a self-correction loop (up to five retries). Despite these measures, “repetitive generation failures” persisted due to the probabilistic nature of LLMs, where the model produced recurrent or new structural errors across multiple attempts. In such instances, the system was designed to bypass planning and proceed directly to execution as a fallback. However, this underscores a critical vulnerability: the current error correction mechanism relies entirely on the LLM’s reasoning, failing to provide a deterministic solution to break the cycle of invalid outputs. Consequently, future research should integrate a Deterministic Post-Processing module. Such a module would circumvent the uncertainties of generative inference by employing rule-based algorithms to programmatically remove or reconnect edges upon detecting flaws, thereby ensuring graph integrity regardless of the LLM’s stochastic behavior.
Third, there are limitations in asynchronous processing for high-frequency inputs. When users input short utterances consecutively, such as “Hi”, “Mine wood”, “10 pieces”, the system sometimes recognizes them as individual commands, leading to redundant execution or failure to merge context. This stems from the lack of a buffering mechanism to handle the concurrency of real-time dialog, requiring the introduction of an intelligent queue system that predicts human utterance intent or batches utterances for processing.

6.3. Future Research Directions

The dynamic graph planning framework proposed in this study can be expanded and developed in the following directions.
First, with regard to Personalized Collaboration Reinforcement using Long-term Memory, while the current system reflects human modifications only in the graph of the specific session, future work can store human intervention patterns and preferences in long-term memory to reflect them from the graph generation stage in subsequent collaborations. This would allow the agent to become a partner that progressively evolves to match the user’s style.
Second, regarding Extension to Multi-modal Interaction, beyond current text-based instructions, research can evolve to map non-verbal signals such as user gaze, gestures, or mouse pointing to graph nodes. This would contribute to resolving instructional ambiguity like “Bring that” and increasing collaboration intuitiveness.
Third, in terms Domain Expansion to Robotics and Multi-Agent Systems, the mechanism of graph modification through human intervention proposed in this study possesses significant potential for scalability beyond virtual sandboxes. In the field of robotics, allowing human operators to directly intervene during execution could effectively bridge the ’Sim2Real’ gap when physical uncertainties disrupt pre-planned trajectories. Furthermore, in multi-agent systems, this framework could be extended to a human-guided coordination model. By applying the dynamic modification functions to a multi-agent context, a human supervisor could resolve inter-agent conflicts or reassign roles in real-time, thereby maintaining system coherence in complex, dynamic environments.

Author Contributions

Conceptualization, S.H. and K.H.L.; methodology, S.H. and K.H.L.; software, S.H.; validation, S.H. and K.H.L.; formal analysis, S.H.; investigation, S.H.; resources, K.H.L.; data curation, S.H.; writing—original draft preparation, S.H.; writing—review and editing, K.H.L.; visualization, S.H.; supervision, K.H.L.; project administration, K.H.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

The study was conducted in accordance with the Declaration of Helsinki, and approved by the Institutional Review Board (or Ethics Committee) of Kwangwoon University protocol code 7001546-20250904-HR(SB)-008-11 and date of approval 4 September 2025.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

No new data were created or analyzed in this study.

Acknowledgments

The work reported in this paper was conducted during the sabbatical year of Kwangwoon University in 2025.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Algorithms

Algorithm A1 Dynamically Modifiable Graph-based Planning and Execution Loop
1: Input: User Request U, Agent State S (Inventory, History, I s E x e c u t i n g )
2: Output: Execution Result
3: function HandleRequest( U ,   S )
4:     if  S . I s E x e c u t i n g is True then
5:          HandleIntervention( U ,   S )              ▹ See Algorithm A3
6:     else
7:          Complexity ← AnalyzeComplexity( U ,   S )
8:          if  C o m p l e x i t y is Simple then
9:               ExecuteSingleAction(U)
10:        else                          ▹ Complex Task
11:            G(V, E) ← GenerateKnowledgeGraph( U ,   S )   ▹ Generate Graph Plan
12:             R o o t G . f i n d R o o t N o d e ( )
13:             S . I s E x e c u t i n g True
14:            TraverseAndExecute( R o o t , G , S )
15:         end if
16:     end if
17: end function
18: function GenerateKnowledgeGraph( U ,   S )
19:         P r o m p t “Given Request U and State S, generate a DAG plan G satisfying constraints (No Cycles, Inventory First).”
20:        JSON ← QueryLLM(PromptS)
21:        G ← ParseAndValidate(JSONreturn G
22: end function
Algorithm A2 LLM-based Context-Aware Node Prioritization
1: function PrioritizeTasks( R e q u i r e m e n t s ,   S t a t e   S )
2:       SubGraph ← GetSubGraph(CurrentNode, Depth = k)
3:       Prompt ← “Given State S and Goal, prioritize tasks in Requirements.”
4:       PrioritizedList ← QueryLLM(Prompt, S)
5:       ValidateNodes( P r i o r i t i z e d L i s t ,   R e q u i r e m e n t s )
6:       return  P r i o r i t i z e d L i s t
7: end function
Algorithm A3 Integrated Dynamic Planning and Intervention Handling
1: Input: User Request U, Agent State S (Inventory, History, G, I s E x e c u t i n g )
2: Output: Execution Result
3: function HandleRequest( U ,   S )
4:       if  S . I s E x e c u t i n g is True then
5:             InterventionHandledHandleIntervention( U ,   S )
6:             if  I n t e r v e n t i o n H a n d l e d is True then
7:                  Return True   ▹ Intervention processed (Switch/Add/Delete/Stop)
8:             end if
9:      end if
10:    ComplexityAnalyzecomplexity( U ,   S )
11:
12:     execute Standard Planning Process          ▹ See Algorithm A1
13: end function
14: function HandleIntervention( U ,   S )
15:        P r o m p t “Classify user intent U relative to plan S . G into {Switch, Delete, Add,
      Stop, New Task, Chat}.”
16:       JSONQueryLLM(Prompt, S)
17:       IntentParseJSON(JSON)
18:       switch  I n t e n t . T y p e
19:            case ‘switch’
20:                  StopActions(S)
21:                  SwitchToNode( S ,   I n t e n t . N o d e )
22:                  Return True
23:            end case
24:            case ‘delete’
25:                  StopActions(S)
26:                  DeleteNode( S . G ,   I n t e n t . N o d e )
27:                  if  I n t e n t . N o d e is Root then
28:                       StopExecution(S)
29:                  else
30:                       SwitchToNode( S ,   S . G . R o o t )
31:                  end if
32:                  Return True
33:            end case
34:            case ‘add’
35:                  StopActions(S)
36:                  SubGraphGeneratePlan(Intent.Node)
37:                  AddNode( S . G ,   S u b G r a p h , I n t e n t . P a r e n t )
38:                  SwitchToNode( S ,   I n t e n t . N o d e )
39:                  Return True
40:            end case
41:            case ‘stop’
42:                  StopExecution(S)
43:                  Return True          ▹ Plan stopped, Agent becomes Idle
44:            end case
45:            case ‘new_task’
46:                  StopExecution(S)
47:                  Return False      ▹ Signal to HandleRequest to start new plan
48:            end case
49:            case ‘Chat’
50:                  Return False           ▹ Just chatting, keep plan running
51:            end case
52:       end switch
53:       Return False
54: end function

Appendix B. Questionnaires and Interview Items

Appendix B.1. PSSUQ (Post-Study System Usability Questionnaire)

Participants responded to the following items on a 7-point Likert scale (1 = Strongly Agree, 7 = Strongly Disagree). Note: Lower scores indicate better usability.
  • Overall, I am satisfied with how easy it is to use this system.
  • It was simple to use this system.
  • I could effectively complete the tasks and scenarios using this system.
  • I was able to complete the tasks and scenarios quickly using this system.
  • I felt comfortable using this system.
  • It was easy to learn to use this system.
  • I believe I could become productive quickly using this system.
  • The system gave error messages that clearly told me how to fix problems.
  • Whenever I made a mistake using the system, I could recover easily and quickly.
  • The information (such as online help, on-screen messages, and other documentation) provided with this system was clear.
  • It was easy to find the information I needed.
  • The information was effective in helping me complete the tasks and scenarios.
  • The organization of information on the system screens was clear.
  • The interface of this system was pleasant.
  • I liked using the interface of this system.
  • This system has all the functions and capabilities I expect it to have.

Appendix B.2. NASA-TLX (Task Load Index)

Participants evaluated the workload based on the following six dimensions.
  • Mental Demand:How much mental and perceptual activity was required (e.g., thinking, deciding, calculating, remembering, looking, searching, etc.)? Was the task easy or demanding, simple or complex, exacting or forgiving?
  • Physical Demand: How much physical activity was required (e.g., pushing, pulling, turning, controlling, activating, etc.)? Was the task easy or demanding, slow or brisk, slack or strenuous, restful or laborious?
  • Temporal Demand: How much time pressure did you feel due to the rate or pace at which the tasks or task elements occurred? Was the pace slow and leisurely or rapid and frantic?
  • Effort: How hard did you have to work (mentally and physically) to accomplish your level of performance?
  • Performance: How successful do you think you were in accomplishing the goals of the task set by the experimenter (or yourself)? How satisfied were you with your performance in accomplishing these goals?
  • Frustration Level: How insecure, discouraged, irritated, stressed, and annoyed versus secure, gratified, content, relaxed, and complacent did you feel during the task?

Appendix B.3. Collaboration Satisfaction (Custom Scale)

Participants responded to the following items on a Likert scale (higher scores indicate positive evaluation).
B1.
The AI agent collected materials on my behalf, allowing me to focus more on other tasks (e.g., exploring for other items).
B2.
I believe the AI agent reduced my overall workload.
B3.
Collaborating with the AI agent was more efficient than working alone.
B4.
Overall, I was satisfied with the collaboration with the AI agent.
B5.
Communication with the AI agent was clear and effective.
B6.
The AI agent understood and followed my instructions well.
B7.
I would like to work with this AI agent again in Minecraft in the future.
B8.
I believe the AI agent complemented my role well.

Appendix B.4. Post-Experiment Interview Questions

  • You collaborated with two different AI agents today. How did the overall feeling of working with the [First Agent] differ from the [Second Agent]?
  • Which of the two agents felt more like a “competent collaboration partner (colleague)”? Conversely, which felt more like a “simple command execution tool”? Why?
  • Which agent caused more frustration or stress during the task? Specifically, at what moments did you feel that way?
  • Were there moments when the AI made mistakes or things didn’t go according to plan? If so, how did each agent cope, and how did you feel about their approach?
  • If you had to perform a complex task in Minecraft again, which of the two agents would you choose to collaborate with? What is the decisive reason for your choice?

Appendix C. Detailed Interaction Log Analysis

This appendix presents translated interaction logs illustrating the three primary behavioral limitations of the baseline agent (Agent A) compared to the successful execution of the proposed agent (Agent B).
Table A1. Scenario 1: Interaction log during the “Make a Bed” task.
Table A1. Scenario 1: Interaction log during the “Make a Bed” task.
Baseline (Agent A)—Reactive approach causing latency
UserMake a bed and bring it to me.
AgentI think I need to make a bed first.
AgentI need to make planks from birch logs first!
AgentI need to gather birch logs first!
AgentAh, I will make planks with the existing logs I have!
AgentWait, I need to convert more logs to planks!
Proposed (Agent B)—Proactive planning
UserMake a bed.
AgentTo make a bed, I need wool, planks, and a crafting table. First, I need to get white wool.
AgentI will hit a sheep to get wool.
AgentI am collecting oak logs now.
AgentI am crafting planks now.
Table A2. Scenario 2: Interaction log regarding inventory state awareness.
Table A2. Scenario 2: Interaction log regarding inventory state awareness.
Baseline (Agent A)—Blind loop despite possessing the item
UserGive me the stone pickaxe in your inventory.
AgentHere it is! (Action executes but fails to transfer)
AgentLooks like I have trouble giving it. I will make another one quickly.
AgentI don’t have planks.
AgentI need to make a wooden pickaxe first.
Proposed (Agent B)—Verified state avoiding redundancy
UserMake an Oak Door.
AgentI already have sufficient oak planks. Using the crafting table nearby.
AgentCraft Oak Door successfully completed.
AgentGift Oak Door to Ryan successfully completed.
Table A3. Scenario 3: Interaction log showing response to conflicting context.
Table A3. Scenario 3: Interaction log showing response to conflicting context.
Baseline (Agent A)—Hallucination due to context inertia
UserMake a furnace.
AgentI need more coal for torches.
UserStop making torches.
AgentOkay, I’ll stop making torches. I’ll mine coal ore instead.
UserStop working and come here.
AgentOkay! But first I need to make planks!
Proposed (Agent B)—Successful context switching
UserStop mining. Build a chest now.
AgentUnderstood.
AgentUsing planks to craft a chest.
AgentCraft Chest successfully completed.

References

  1. Wang, G.; Xie, Y.; Jiang, Y.; Mandlekar, A.; Xiao, C.; Zhu, Y.; Fan, L.; Anandkumar, A. Voyager: An Open-Ended Embodied Agent with Large Language Models. In Proceedings of the Intrinsically-Motivated and Open-Ended Learning Workshop@ NeurIPS2023, Paris, France, 13–15 September 2023. [Google Scholar]
  2. Yu, X.; Fu, J.; Deng, R.; Han, W. Mineland: Simulating large-scale multi-agent interactions with limited multimodal senses and physical needs. arXiv 2024, arXiv:2403.19267. [Google Scholar]
  3. Song, C.H.; Wu, J.; Washington, C.; Sadler, B.M.; Chao, W.L.; Su, Y. Llm-planner: Few-shot grounded planning for embodied agents with large language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; pp. 2998–3009. [Google Scholar]
  4. Wang, Z.; Cai, S.; Chen, G.; Liu, A.; Ma, X.; Liang, Y.; CraftJarvis, T. Describe, explain, plan and select: Interactive planning with large language models enables open-world multi-task agents. In Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023; pp. 34153–34189. [Google Scholar]
  5. Dong, Y.; Zhu, X.; Pan, Z.; Zhu, L.; Yang, Y. VillagerAgent: A Graph-Based Multi-Agent Framework for Coordinating Complex Task Dependencies in Minecraft. In Findings of the Association for Computational Linguistics ACL 2024; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 16290–16314. [Google Scholar]
  6. Guss, W.H.; Houghton, B.; Topin, N.; Wang, P.; Codel, C.; Veloso, M.; Salakhutdinov, R. MineRL: A large-scale dataset of minecraft demonstrations. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, Macao, China, 10–16 August 2019; pp. 2442–2448. [Google Scholar]
  7. Fan, L.; Wang, G.; Jiang, Y.; Mandlekar, A.; Yang, Y.; Zhu, H.; Tang, A.; Huang, D.-A.; Zhu, Y.; Anandkumar, A. Minedojo: Building open-ended embodied agents with internet-scale knowledge. Adv. Neural Inf. Process. Syst. 2022, 35, 18343–18362. [Google Scholar]
  8. Sacerdoti, E.D. Planning in a hierarchy of abstraction spaces. Artif. Intell. 1974, 5, 115–135. [Google Scholar] [CrossRef]
  9. Tate, A. Generating project networks. In Proceedings of the 5th International Joint Conference on Artificial Intelligence-Volume 2, Cambridge, MA, USA, 22–25 August 1977; pp. 888–893. [Google Scholar]
  10. Erol, K.; Hendler, J.; Nau, D.S. HTN planning: Complexity and expressivity. In Proceedings of the Twelfth AAAI National Conference on Artificial Intelligence, Seattle, WA, USA, 31 July–4 August 1994; pp. 1123–1128. [Google Scholar]
  11. Nau, D.S.; Au, T.C.; Ilghami, O.; Kuter, U.; Murdock, J.W.; Wu, D.; Yaman, F. SHOP2: An HTN planning system. J. Artif. Intell. Res. 2003, 20, 379–404. [Google Scholar] [CrossRef]
  12. Ahn, M.; Brohan, A.; Brown, N.; Chebotar, Y.; Cortes, O.; David, B.; Finn, C.; Fu, C.; Gopalakrishnan, K.; Hausman, K.; et al. Do as I can, not as I say: Grounding language in robotic affordances. In Proceedings of the 6th Conference on Robot Learning (CoRL 2022), Auckland, New Zealand, 14–18 December 2022; pp. 287–318. [Google Scholar]
  13. Huang, W.; Xia, F.; Xiao, T.; Chan, H.; Liang, J.; Florence, P.; Zeng, A.; Tompson, J.; Mordatch, I.; Chebotar, Y.; et al. Inner Monologue: Embodied Reasoning through Planning with Language Models. In Proceedings of the 6th the Conference on Robot Learning, (CoRL 2022), Auckland, New Zealand, 14–18 December 2022; pp. 1769–1782. [Google Scholar]
  14. Huang, W.; Abbeel, P.; Pathak, D.; Mordatch, I. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. In Proceedings of the International Conference on Machine Learning, PMLR, Baltimore, MD, USA, 17–23 July 2022; pp. 9118–9147. [Google Scholar]
  15. Zitkovich, B.; Yu, T.; Xu, S.; Xu, P.; Xiao, T.; Xia, F.; Wu, J.; Wohlhart, P.; Welker, S.; Wahid, A.; et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In Proceedings of the Conference on Robot Learning, PMLR, Atlanta, GA, USA, 6–9 November 2023; pp. 2165–2183. [Google Scholar]
  16. Yao, S.; Yu, D.; Zhao, J.; Shafran, I.; Griffiths, T.; Cao, Y.; Narasimhan, K. Tree of thoughts: Deliberate problem solving with large language models. Adv. Neural Inf. Process. Syst. 2023, 36, 11809–11822. [Google Scholar]
  17. Besta, M.; Blach, N.; Kubicek, A.; Gerstenberger, R.; Podstawski, M.; Gianinazzi, L.; Gajda, J.; Lehmann, T.; Niewiadomski, H.; Nyczyk, P.; et al. Graph of thoughts: Solving elaborate problems with large language models. Proc. AAAI Conf. Artif. Intell. 2024, 38, 17682–17690. [Google Scholar] [CrossRef]
  18. Park, J.S.; O’Brien, J.; Cai, C.J.; Morris, M.R.; Liang, P.; Bernstein, M.S. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, Francisco, CA, USA, 29 October–1 November 2023; pp. 1–22. [Google Scholar]
  19. Zhang, C.; Yang, K.; Hu, S.; Wang, Z.; Li, G.; Sun, Y.; Zhang, C.; Zhang, Z.; Liu, A.; Zhu, S.-C.; et al. Proagent: Building proactive cooperative ai with large language models. arXiv 2023, arXiv:2308.11339. [Google Scholar] [CrossRef]
  20. Chen, W.; Su, Y.; Zuo, J.; Yang, C.; Yuan, C.; Chan, C.M.; Yu, H.; Lu, Y.; Hung, Y.-H.; Qian, C.; et al. AgentVerse: Facilitating Multi-Agent Collaboration and Exploring Emergent Behaviors. In Proceedings of the ICLR, Vienna, Austria, 7–11 May 2024. [Google Scholar]
  21. White, I.; Nottingham, K.; Maniar, A.; Robinson, M.; Lillemark, H.; Maheshwari, M.; Qin, L.; Ammanabrolu, P. Collaborating action by action: A multi-agent LLM framework for embodied reasoning. arXiv 2025, arXiv:2504.17950. [Google Scholar] [CrossRef]
  22. Shinn, N.; Cassano, F.; Gopinath, A.; Narasimhan, K.; Yao, S. Reflexion: Language agents with verbal reinforcement learning. Adv. Neural Inf. Process. Syst. 2023, 36, 8634–8652. [Google Scholar]
  23. Taioli, F.; Zorzi, E.; Franchi, G.; Castellini, A.; Farinelli, A.; Cristani, M.; Wang, Y. Collaborative Instance Object Navigation: Leveraging Uncertainty-Awareness to Minimize Human-Agent Dialogues. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Honolulu, HI, USA, 19–23 October 2025; pp. 18781–18792. [Google Scholar]
  24. Liu, J.; Zhou, P.; Du, Y.; Tan, A.H.; Snoek, C.G.; Sonke, J.J.; Gavves, E. CaPo: Cooperative Plan Optimization for Efficient Embodied Multi-Agent Cooperation. In Proceedings of the Thirteenth International Conference on Learning Representations, Singapore, 24–28 April 2025. [Google Scholar]
  25. Asuzu, K.; Singh, H.; Idrissi, M. Human–robot interaction through joint robot planning with large language models. Intell. Serv. Robot. 2025, 18, 261–277. [Google Scholar] [CrossRef]
  26. Hart, S.G.; Staveland, L.E. Development of NASA-TLX (Task Load Index): Results of empirical and theoretical research. Adv. Psychol. 1988, 52, 139–183. [Google Scholar]
  27. Lewis, J.R. Psychometric evaluation of the post-study system usability questionnaire: The PSSUQ. In Proceedings of the Human Factors Society Annual Meeting; Sage Publications: Los Angeles, CA, USA, 1992; Volume 36, pp. 1259–1260. [Google Scholar]
  28. Dunn, T.J.; Baguley, T.; Brunsden, V. From alpha to omega: A practical solution to the pervasive problem of internal consistency estimation. Br. J. Psychol. 2014, 105, 399–412. [Google Scholar] [CrossRef] [PubMed]
Figure 1. Example of the graph generated for the ‘Craft Bed’ task, illustrating task dependencies via nodes (primitive actions) and AND/OR edges.
Figure 1. Example of the graph generated for the ‘Craft Bed’ task, illustrating task dependencies via nodes (primitive actions) and AND/OR edges.
Applsci 16 01135 g001
Figure 2. Complexity analysis flowchart that branches user requests into ‘simple execution’ and ‘graph generation’.
Figure 2. Complexity analysis flowchart that branches user requests into ‘simple execution’ and ‘graph generation’.
Applsci 16 01135 g002
Figure 3. Overall flowchart analyzing human intervention based on relevance (Relevant/Irrelevant) and branching into ‘Irrelevant’, ‘Stop’, ‘New Task’, ‘Switch’, ‘Delete’, and ‘Add’.
Figure 3. Overall flowchart analyzing human intervention based on relevance (Relevant/Irrelevant) and branching into ‘Irrelevant’, ‘Stop’, ‘New Task’, ‘Switch’, ‘Delete’, and ‘Add’.
Applsci 16 01135 g003
Figure 4. Dynamic graph modification processes driven by human intervention: (a) Switch Active Node: switching to a new active node, (b) Delete Node: deleting an unnecessary node (red ‘×’ denotes the node to be removed), and (c) Add Node: adding a new node for sub-goals.
Figure 4. Dynamic graph modification processes driven by human intervention: (a) Switch Active Node: switching to a new active node, (b) Delete Node: deleting an unnecessary node (red ‘×’ denotes the node to be removed), and (c) Add Node: adding a new node for sub-goals.
Applsci 16 01135 g004
Figure 5. A screenshot of the 3 × 3 Bingo board and 9 goal items in Minecraft. Participants must place the items corresponding to the goals on the left into the frames on the right.
Figure 5. A screenshot of the 3 × 3 Bingo board and 9 goal items in Minecraft. Participants must place the items corresponding to the goals on the left into the frames on the right.
Applsci 16 01135 g005
Figure 6. Comparison of mission completion times: (a) Overall Average Time: Overall average completion time showing Agent B is faster, (b) Time by Order Group: Comparison of completion times by execution order, indicating Agent B’s consistency regardless of learning effects.
Figure 6. Comparison of mission completion times: (a) Overall Average Time: Overall average completion time showing Agent B is faster, (b) Time by Order Group: Comparison of completion times by execution order, indicating Agent B’s consistency regardless of learning effects.
Applsci 16 01135 g006
Figure 7. Qualitative evaluation results with consistent chart heights: (a) NASA-TLX (Workload), (b) PSSUQ (Usability), and (c) Custom Scale (Collaboration Satisfaction).
Figure 7. Qualitative evaluation results with consistent chart heights: (a) NASA-TLX (Workload), (b) PSSUQ (Usability), and (c) Custom Scale (Collaboration Satisfaction).
Applsci 16 01135 g007
Table 1. Comparison of mission completion time and percentage of faster completion cases ( % B < A ) based on agent execution order.
Table 1. Comparison of mission completion time and percentage of faster completion cases ( % B < A ) based on agent execution order.
ScaleMean (A)SD (A)Mean (B)SD (B)% (B < A)
Overall1042.267433.711944.6347.41760
A First Group1254.267456.120933.6368.51180
B First Group830.267292.049955.6337.57940
Table 2. Comparison of Utterance Counts by Agent Type.
Table 2. Comparison of Utterance Counts by Agent Type.
ScaleMean (A)SD (A)Mean (B)SD (B)t-Valuedfp-ValueCohen’s d
AI Message Count93.0360.1471.9340.321.784280.0850.412
Human Message Count23.4114.5922.3814.640.314280.7560.071
Table 3. NASA-TLX Sub-scale Workload Assessment Based on Agent Used ( N = 30 , Lower scores are more positive, except for Perf).
Table 3. NASA-TLX Sub-scale Workload Assessment Based on Agent Used ( N = 30 , Lower scores are more positive, except for Perf).
ScaleMean (A)SD (A)Mean (B)SD (B)t-Valuedfp-ValueCohen’s d
Mental Demand (MD)2.931.142.531.251.755290.0900.334
Physical Demand (PD)2.531.202.301.151.424290.1650.199
Temporal Demand (TD)2.371.132.431.04−0.263290.7940.061
Effort2.831.262.471.172.164290.0390.302
Performance (Perf)3.171.213.301.24−0.548290.5880.109
Frustration (Frus)2.931.392.371.381.788290.0840.410
Table 4. Comparison of PSSUQ Sub-scale Average Scores and Statistical Results ( N = 30 , Lower scores indicate better usability).
Table 4. Comparison of PSSUQ Sub-scale Average Scores and Statistical Results ( N = 30 , Lower scores indicate better usability).
ScaleMean (A)SD (A)Mean (B)SD (B)t-Valuedfp-ValueCohen’s d
Overall Satisfaction3.431.532.941.622.209290.0350.306
System Usefulness (SYSUSE)3.401.572.981.762.066290.0480.250
Information Quality (INFOQUAL)3.471.602.891.482.304290.0290.379
Interface Quality (INTERQUAL)3.231.702.931.921.064290.2960.165
Table 5. Comparison of Scores on 8 Self-Developed Collaboration Satisfaction Items (C1–C8) ( N = 30 , Higher scores indicate higher satisfaction).
Table 5. Comparison of Scores on 8 Self-Developed Collaboration Satisfaction Items (C1–C8) ( N = 30 , Higher scores indicate higher satisfaction).
Item DescriptionMean (A)SD (A)Mean (B)SD (B)tdfpd
C1. Work Focus4.231.924.971.97−2.451290.0210.376
C2. Reduced Burden4.271.785.171.78−3.031290.0050.505
C3. Efficiency3.972.174.802.02−2.154290.0400.397
C4. Overall Satisfaction4.031.854.971.85−3.043290.0050.505
C5. Communication3.801.854.532.11−2.083290.0460.370
C6. Instruction Compliance3.501.834.571.87−3.087290.0040.576
C7. Intention to Reuse4.101.995.032.01−2.603290.0140.467
C8. Role Complementation4.372.035.201.81−2.533290.0170.434
Table 6. Benchmark results of Intent Classification Accuracy and Latency by Graph Size.
Table 6. Benchmark results of Intent Classification Accuracy and Latency by Graph Size.
Scenario NameNode Count (N)Accuracy (%)Avg. Latency (ms)
Wooden Pickaxe12100.0880
Bed Plan44100.0726
Nether Portal60100.0699
Battle Prep.120100.0781
Master Plan21580.0782
Average-96.0773.6
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Han, S.; Lee, K.H. Improving the Efficiency of Collaboration Between Humans and Embodied AI Agents in 3D Virtual Environments. Appl. Sci. 2026, 16, 1135. https://doi.org/10.3390/app16021135

AMA Style

Han S, Lee KH. Improving the Efficiency of Collaboration Between Humans and Embodied AI Agents in 3D Virtual Environments. Applied Sciences. 2026; 16(2):1135. https://doi.org/10.3390/app16021135

Chicago/Turabian Style

Han, Seowon, and Kang Hoon Lee. 2026. "Improving the Efficiency of Collaboration Between Humans and Embodied AI Agents in 3D Virtual Environments" Applied Sciences 16, no. 2: 1135. https://doi.org/10.3390/app16021135

APA Style

Han, S., & Lee, K. H. (2026). Improving the Efficiency of Collaboration Between Humans and Embodied AI Agents in 3D Virtual Environments. Applied Sciences, 16(2), 1135. https://doi.org/10.3390/app16021135

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop