Adaptive Realities: Human-in-the-Loop AI for Trustworthy XR Training in Safety-Critical Domains

Pretolesi, Daniele; Regal, Georg; Schrom-Feiertag, Helmut; Tscheligi, Manfred

doi:10.3390/mti10010011

Open AccessArticle

Adaptive Realities: Human-in-the-Loop AI for Trustworthy XR Training in Safety-Critical Domains^†

¹

Department for Artificial Intelligence and Human Interfaces, University of Salzburg, 5020 Salzburg, Austria

²

Center for Technology Experience, Austrian Institute of Technology, 1210 Vienna, Austria

^*

Author to whom correspondence should be addressed.

^†

This article is a revised and expanded version of the following papers, which were presented at international conferences. (1) Pretolesi, D.; Zechner, O.; Zafari, S.; Tscheligi, M. Human in the Loop for XR Training: Theory, Practice and Recommendations for Effective and Safe Training Environments. In Proceedings of the 2023 IEEE International Conference on Metrology for eXtended Reality, Artificial Intelligence and Neural Engineering (MetroXRAINE), Milano, Italy, 25–27 October 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 248–253. (2) Pretolesi, D.; Zechner, O.; Schrom-Feiertag, H.; Tscheligi, M. Can I Trust You? Exploring the Impact of Misleading AI Suggestions on User’s Trust. In Proceedings of the 2024 IEEE International Conference on Metrology for eXtended Reality, Artificial Intelligence and Neural Engineering (MetroXRAINE), St Albans, UK, 21–23 October 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1230–1235. (3) Pretolesi, D. Personalised training: Integrating recommender systems in XR training platforms. In Proceedings of the Mensch und Computer 2022-Workshopband. Gesellschaft für Informatik eV, Darmstadt, Germany, 4–7 September 2022; pp. 10–18420. 891. (4) Pretolesi, D.; Zechner, O.; Guirao, D.G.; Schrom-Feiertag, H.; Tscheligi, M. AI-Supported XR Training: Personalizing Medical First Responder Training. In Proceedings of the International Conference on Artificial Intelligence and Virtual Reality, Kumamoto, Japan, 21–23 July 2023; Springer: Singapore, 2023; pp. 343–356. (5) Pretolesi, D.; Gutierrez, R.; Regal, G.; Tscheligi, M. Human, I Need You: Comparison of Human-Robot Interaction Modalities for Search and Rescue Missions. In Proceedings of the 24th International Conference on Mobile and Ubiquitous Multimedia, New York, NY, USA, 2025; MUM ’25, pp. 22–32.

Multimodal Technol. Interact. 2026, 10(1), 11; https://doi.org/10.3390/mti10010011

Submission received: 15 December 2025 / Revised: 8 January 2026 / Accepted: 16 January 2026 / Published: 22 January 2026

Download

Browse Figures

Versions Notes

Abstract

Extended Reality (XR) technologies have matured into powerful tools for training in high-stakes domains, from emergency response to search and rescue. Yet current systems often struggle to balance real-time AI-driven personalisation with the need for human oversight and calibrated trust. This article synthesizes the programmatic contributions of a multi-study doctoral project to advance a design-and-evaluation framework for trustworthy adaptive XR training. Across six studies, we explored (i) recommender-driven scenario adaptation based on multimodal performance and physiological signals, (ii) persuasive dashboards for trainers, (iii) architectures for AI-supported XR training in medical mass-casualty contexts, (iv) theoretical and practical integration of Human-in-the-Loop (HITL) supervision, (v) user trust and over-reliance in the face of misleading AI suggestions, and (vi) the role of interaction modality in shaping workload, explainability, and trust in human–robot collaboration. Together, these investigations show how adaptive policies, transparent explanation, and adjustable autonomy can be orchestrated into a single adaptation loop that maintains trainee engagement, improves learning outcomes, and preserves trainer agency. We conclude with design guidelines and a research agenda for extending trustworthy XR training into safety-critical environments.

Keywords:

Extended Reality (XR); Human-in-the-Loop (HITL); adaptive training; personalisation; explainable AI (XAI)

1. Introduction

Realistic, repeatable, and ethically safe training is a cornerstone for preparing operators to act under pressure in high-stakes or safety-critical domains such as emergency response, medicine, law enforcement, aviation, and industrial maintenance [1,2]. Extended Reality (XR), including Augmented, Virtual, and Mixed Reality, has matured from niche applications to powerful, simulation-based environments that support scenario rehearsal, collaborative practice, and mission preparation under controlled risk [3]. In parallel, advances in Artificial Intelligence (AI) now enable XR systems to sense user state, interpret context, and adjust scenario parameters on the fly, thereby aligning the difficulty, pacing, and feedback of an exercise with the evolving needs of trainees. A key capability emerging from this convergence is automation: algorithmic agents can monitor performance and physiological variables at scale, present trends, and orchestrate stressors faster than human instructors, while preserving the immersive benefits of XR and the scalability of data-driven personalisation.

Over the past decade, the field has shifted from scripted, one-size-fits-all simulations toward data-driven personalisation grounded in multimodal sensing (e.g., interaction logs, task performance streams, body/hand kinematics, and lightweight physiological proxies) [4]. Within training platforms, recommender-style mechanisms have been proposed to suggest scenario adaptations before and during an exercise, coupled with dashboards that present, justify, and allow acceptance/rejection of those suggestions [5]. These approaches aim to deliver targeted difficulty scaling and stress management while preserving trainer agency and situational awareness.

Yet automation-centric pipelines can be brittle in high-stakes settings. Without explicit human oversight, fully automated adaptation risks overfitting to noisy signals, surfacing inappropriate challenges, or obscuring decision logic, all of which may erode user trust and undermine training effectiveness [6]. Recent work underscores the importance of Human-in-the-Loop (HITL) mechanisms that keep the trainer central to supervision and control, complemented by transparent, explainable interfaces that reveal why a suggestion is made and with what confidence [7]. At the same time, empirical evidence shows that users may still over-rely on AI suggestions even when explanations are present, highlighting the need to calibrate, not merely increase, transparency [8].

This article synthesises the programmatic contributions of a multi-study PhD project into a cohesive framework for trustworthy, adaptive XR training. We bring together (i) multimodal sensing and analytics for real-time trainee state estimation; (ii) AI-based personalisation policies for scenario adaptation; (iii) HITL controls for adjustable autonomy and human oversight; and (iv) transparency and explainability (XAI) patterns that support calibrated trust and timely intervention. Across studies, we examine the interplay between personalisation benefits, supervisory control, and trust calibration, focusing on practical constraints of safety-critical training (latency, robustness, interpretability, operator workload).

Challenges.

Three tensions motivate the integration agenda for adaptive XR training: (a) maintaining trust and transparency as autonomy increases, ensuring that decision rationales and system limits remain visible; (b) preserving human agency so that trainers can intervene, override, and steer adaptation without sacrificing responsiveness; and (c) handling complex, dynamic environments in which both trainee state and scenario context evolve rapidly and unpredictably [1]. Together, these challenges set the stage for coupling AI-driven personalisation with robust HITL oversight in XR.

Research Gap.

Prior work has shown that adaptive XR can improve learning outcomes, efficiency, and engagement, yet it typically treats autonomy, oversight, and transparency in isolation. Conversely, research on trust and explainability often abstracts away from the realities of real-time adaptation under constrained sensing and computation [9]. What is missing is an integrative account that (1) operationalise AI-based personalisation under realistic multimodal data constraints; (2) embeds HITL supervision without sacrificing responsiveness; and (3) leverages transparency/XAI to mitigate over-reliance and undertrust in safety-critical contexts.

Aim and Contribution.

We address this gap by advancing a design-and-evaluation framework for XR training that explicitly couples personalisation, HITL supervision, and transparency. Concretely, the following are our contributions:

C1.: A systems perspective that unifies multimodal sensing, adaptive policies, HITL controls, and explanation surfaces for trustworthy XR training.
C2.: Design mechanisms for adjustable autonomy and operator oversight that preserve the benefits of real-time adaptation while enabling human control.
C3.: Personalisation strategies that account for diverse trainee profiles, tasks, and learning goals, with guidance on signal selection and fusion.
C4.: Transparency/XAI patterns tailored to XR training that support calibrated trust, appropriate intervention, and sustained engagement.
C5.: An empirical synthesis indicating when and how these mechanisms jointly optimise effectiveness and user trust, and where trade-offs emerge.

Central Research Question.

Guided by these aims, we investigate the following question:

RQ1:: How can XR training systems incorporate AI-based personalisation and Human-in-the-Loop supervision to optimise effectiveness and user trust, particularly in high-stakes or safety-critical domains?

To operationalise this question, we articulate three sub-questions that structure the remainder of the article:

SQ1:: How can Human-in-the-Loop mechanisms be integrated to ensure user oversight without sacrificing the benefits of automated, real-time adaptation?
SQ2:: In what ways can personalisation enhance the effectiveness of XR simulations while accounting for diverse trainee profiles and learning goals?
SQ3:: Which factors shape trust and sustained engagement with AI-driven XR systems, and how can system transparency and explainability mitigate over-reliance or undertrust?

Roadmap.

Section 2 frames the RQ and clarifies assumptions and scope. Section 3 synthesises related work on XR training, HITL supervision, and trust/explainable AI. Section 4 details the materials and methods across studies, including research design, participants, apparatus, tasks, data, adaptation policies, HITL controls, transparency mechanisms, and analyses. Section 5 presents results focusing on the synergy story that addresses the RQ and SQs. Section 6 interprets findings, situates them within the literature, and reflects on design trade-offs and implications. Section 7 concludes and outlines an extended research agenda.

2. Framing the Research Question

Purpose. Building on the introduction, this section formalises the central RQ and derives the operational constructs, design requirements, and analytical objectives that structure the remainder of the paper. Rather than repeat the background information, we specify how key ideas are instantiated and measured in our studies.

2.1. Core Constructs and Scope

AI-based personalisation.

Personalisation acts on steerable scenario parameters based on a multi-dimensional trainee profile. This profile includes (i) static traits such as prior domain expertise, background knowledge, and professional responsibilities (e.g., doctor vs. nurse role); (ii) dynamic states derived from task performance and interaction kinematics; and (iii) physiological baselines that account for individual variability in stress response. These conditions ensure that stressors are adjusted not just to general performance, but to the specific learning objectives assigned to a trainee’s unique professional profile.

Human-in-the-Loop supervision.

Supervision spans authoring, run-time, and review. At run-time, adjustable autonomy exposes a small set of graded modes and immediate override. Suggestions are presented with minimal, glanceable metadata (e.g., expected effect, confidence/uncertainty, time-to-apply). All interventions (accept/modify/reject) are logged to support audit and policy revision. Authoring and review rely on editable rule sets and replay tools so domain experts can refine constraints without re-implementation.

Trust and transparency.

Trust is treated as a calibration of reliance to competence and context. Transparency is workload-aware and layered: (i) execution cues (what the system is doing now/next), (ii) compact rationales (why a suggestion is made, with uncertainty), and (iii) progressive disclosure for deeper inspection during debrief. These elements are timed to decision points to avoid unnecessary cognitive load and to support timely human intervention.

Scope and assumptions.

Target tasks are safety-critical or safety-relevant, where procedural correctness, timing, and situational awareness matter. Signals are incomplete and noisy; therefore, adaptation degrades safely to defaults and honours guardrails. Data practices follow consent and minimisation, with explicit avoidance of policies that disadvantage specific trainee profiles. While our framework focuses on the logic of adaptation, we acknowledge that in domains such as firefighting or civil emergencies, photorealism is critical for the accurate identification of environmental hazards (e.g., smoke density or flame patterns). In these contexts, rendering quality is not merely an aesthetic choice but a safety-critical requirement for valid trainee assessment.

2.2. From RQ/SQs to Design Requirements

The RQ splits into three sub-questions that map to concrete requirements in a single trustworthy adaptation loop (Figure 1). Table 1 summarises this mapping and anticipates how each requirement is exercised in later methods and evaluations.

The working premise is that effectiveness and trust are co-optimised when robust sensing and guardrails maintain adaptation quality under uncertainty, human supervisors retain meaningful control via adjustable autonomy, and transparency is paced to support appropriate reliance without overload.

2.3. Analytical Objectives and Expected Relationships

SQ1 (oversight without penalty).

The objective is to test whether HITL controls reduce unsafe or pedagogically inappropriate adaptations without degrading primary task performance, and examine how autonomy level influences workload and the frequency/latency of interventions.

SQ2 (personalisation gains).

The objective is to estimate performance and learning gains (e.g., time, errors, retention/transfer proxies) for personalised versus non-personalised conditions; analyse moderation by trainee profile and goal alignment; and verify that guardrails prevent harmful adaptations.

SQ3 (calibrated trust).

The objective is to measure trust calibration and reliance patterns with/without transparency features and with/without HITL. It is expected that transparency will improve calibration, and transparency plus HITL will yield more timely, appropriate interventions than either alone.

Statistical procedures and model evaluations are detailed in Section 4.

2.4. Conceptual Framework and Boundaries

Figure 1 depicts the adaptation loop used across studies. Multimodal signals feed a trainee estimator; a constraint-aware policy proposes scenario changes; transparency reflects state and rationale; HITL can shift autonomy or intervene; and outcomes and logs drive refinement. Boundaries are pragmatic: we exclude rendering advances that do not affect adaptation/supervision, heavy instrumentation that harms ecological validity, and fully automated, non-interruptible autonomy.

Importantly, we define this work within an ’XR Spectrum’ approach. While immersive VR serves as the primary vehicle for high-stress validation in this doctoral programme, the underlying framework is architecturally designed to be modality-agnostic. The evidence for this generalisability across the XR continuum is provided in Study C, which utilized a Mixed Reality (MR) hybrid involving physical patient simulators, and Study F, which explicitly validated the Human-in-the-Loop mechanisms using pass-through XR visualisation. This spectrum-based design ensures that the design guidelines remain applicable as training shifts between fully virtual and augmented environments.

3. Theoretical Background

3.1. Human-in-the-Loop Systems

HITL systems mediate the boundary between automation and human agency so that advanced, AI-driven processes remain responsive, accountable, and aligned with professional judgement in safety-relevant XR training [10,11]. While automation promises efficiency and scale, Bainbridge’s “ironies of automation’’ remind us that higher autonomy can increase the cognitive burden on humans at precisely the moments when anomalies occur [6]. This tension motivates HITL designs that guard against both automation bias (over-reliance) and automation disuse (distrust), each linked to degraded performance and safety risks [12,13].

Loci of human control.

To fit XR training, HITL supervision should be explicit at three points: (i) Authoring, where domain experts specify safety envelopes, scenario pre-conditions, and permissible adaptations; (ii) Run-time, where adjustable autonomy allows trainers to approve, modify, or veto recommendations with immediate override, safe defaults, and legible state transitions; and (iii) Review, where complete logs and explanations support audit, debrief, and policy refinement. These loci map naturally to Parasuraman–Sheridan–Wickens’ stages (information acquisition, analysis, decision, and action) and enable dynamic function allocation based on workload and situational complexity [13]. Endsley’s situation awareness perspective warns that opaque automation erodes operators’ mental models of system state; HITL interfaces must therefore expose what the system is doing now, why, and what it will likely do next [14]. Sheridan’s adaptive automation concept operationalises this by modulating autonomy in response to human state and task demands [15]; Cummings’ “man + machine’’ paradigm similarly emphasises collaborative resilience in dynamic decision environments [16].

Transparency, explainability, and calibrated oversight.

Effective HITL requires explanations that are concise, contrastive, and timely [7]. Interactive, iterative involvement can improve both predictive performance and calibrated reliance [17]. Practical interventions—such as surfacing model confidence/uncertainty, previewing anticipated effects of an adaptation, and logging the rationale used—reduce over-reliance and encourage appropriate intervention when introduced at moments that avoid cognitive overload [18,19]. In XR training, these principles are instantiated through trainer-facing dashboards that show state estimates, recommended scenario changes (e.g., stressor injection/removal), expected impact, and one-click accept/modify/reject controls; AR/MR overlays can carry glanceable rationales during live supervision where appropriate [20]. Recent programme-scale platforms couple physiological monitoring with constraint-aware adaptation and strict accountability boundaries and data-governance protocols—aligning HITL practice with organisational requirements for auditability and responsibility [21,22]. Meta-analyses suggest that HITL implementations yield measurable improvements in reliability and decision quality over fully autonomous baselines [23,24,25].

3.2. Personalisation and Adaptive Learning

Personalisation in XR denotes principled, goal-aligned adjustment of pacing, difficulty, sequencing, and feedback based on an evolving trainee model. In safety-relevant domains, this often includes stress-band steering: introducing or withdrawing stressors to keep trainees within an effective performance region. Three time scales are salient: pre-session initialisation from profiles and goals; in-session real-time adaptation; and post-session updates to support longitudinal progression [26,27].

Mechanisms and signals.

Personalisation mechanisms span (i) rule-based scaffolding and performance-contingent branching, (ii) data-driven policies (e.g., competence or error-risk estimation), and (iii) hybrids that place domain guardrails around learned policies [5,22,28,29,30,31]. Modern XR platforms expose heterogeneous streams, interaction logs and task outcomes (robust, low latency), hand/body kinematics and trajectories, and, where feasible, lightweight physiology (HR/HRV, EDA, pupil size), that offer proxies for effort, workload, and affect [32,33,34]. Recommender-style pipelines leverage these streams to propose scenario changes for trainer approval, creating a virtuous loop in which human feedback incrementally shapes policy behaviour.

Cognitive and motivational grounding.

Affective computing highlights the value of responding to learners’ emotional/cognitive states [4,35]. However, XR’s sensory richness can overload users if adaptations are poorly timed or verbose. Cognitive Load Theory and multimedia learning research therefore motivate concise, well-paced interventions that respect limited attentional resources [36,37,37]. From a motivational standpoint, Self-Determination Theory suggests that personalisation should support autonomy, competence, and relatedness; persuasive feedback should be transparent and optional to avoid undermining agency [38,39].

Operational constraints.

Safety-relevant XR imposes strict latency budgets and frequent data noise/missingness. Accordingly, adaptation policies need uncertainty-aware guardrails, conservative fallbacks, and clear separation between what the system may change (within a safety envelope) and who authorises the change (adjustable autonomy). These constraints align with our focus on a trustworthy adaptation loop in which HITL supervision is a first-class run-time control layer rather than a peripheral utility.

3.3. Trust, Transparency, and Explainability

Trust is the willingness to rely on automation given expectations of competence and predictability; performance degrades with both overtrust (automation bias) and undertrust (disuse) [8,40]. In our scope, the goal is calibrated reliance: a match between reliance and actual capability/context.

Actionable, workload-aware transparency.

Transparency supports calibration when it is actionable (paired with concrete controls), workload-aware (timed to decision points), and layered. Useful layers include (i) execution cues (what the system is doing now/next), (ii) compact rationales with uncertainty, and (iii) progressive disclosure for deeper inspection during debrief [41,42]. Distinguishing global explanations (how the policy generally behaves) from local explanations (why this suggestion now) helps users form correct mental models while validating specific actions [7,43]. Evidence indicates that concise, context-specific explanations improve understanding and reduce inappropriate reliance, whereas excessive or poorly timed detail can depress trust or overload operators; transparency must therefore be carefully paced [36,44].

XR-specific considerations.

In immersive, time-pressured contexts, transparency should be glanceable and tightly coupled to HITL controls (approve, veto, adjust autonomy) to preserve situation awareness. Logging explanations together with outcomes enables ex post analysis of over-/under-reliance patterns and targeted interface/policy refinement.

3.4. Ethical Considerations and Tensions in Adaptive XR Systems

Combining AI-driven personalisation with immersive realism raises interlocking concerns about oversight, privacy, and fairness [45]. In safety-relevant training, these are instrumental, not merely normative, because misadapted scenarios can normalise unsafe behaviours.

Clear divisions of responsibility (who may change what, when, and under which conditions) and comprehensive, tamper-evident logs support contestability and auditing. HITL supervision operationalises governance requirements by ensuring that critical scenario changes are attributable and reversible [21,22].

Continuous sensing (including physiology) requires strict data minimisation, informed consent, explicit purpose limitation, and retention controls. Transparency should extend to data practices—what is captured, how it is processed, and how it informs adaptation—to sustain legitimacy and trust.

Personalisation policies must be monitored for disparate impact across profiles. Bias-aware modelling, stratified evaluation, and guardrails that prevent harmful adaptations are essential to avoid systematic disadvantage.

When multiple humans and AI components coordinate, transparency must also communicate roles, capabilities, and intent to support shared mental models and prevent accountability gaps [31,46,47].

Taken together, these foundations motivate the paper’s central stance: trustworthy adaptation in XR requires coupling AI-based personalisation with first-class HITL supervision and workload-aware transparency, under uncertainty-aware guardrails and robust governance. The remainder of the article operationalises this stance and evaluates its effects on effectiveness and calibrated trust.

4. Materials and Methods

This section integrates the full methodological programme systems (See Table 2), instruments, procedures, and analyses spanning (i) a recommender-driven personalisation prototype for XR training (Study A) [48], (ii) a second-generation, AI-supported training platform with trainer dashboard and smart scenario control (Study C) [49], (iii) a controlled probe of misleading AI suggestions with expert trainers (Study E) [50], and (iv) a between-subjects comparison of interaction modalities for semi-autonomous search-and-rescue (S&R) (Study F) [51]. Cross-cutting design guidance on HITL supervision and actionable transparency was derived from our theory-to-practice work [52] and our dashboard/persuasive XR investigations [53]. Where concrete specifications are required (e.g., instruments, sample size, hardware, analyses), they are reported as implemented in the corresponding study.

4.1. Study Question Mapping and Design Logic

SQ1 (HITL oversight). Trainer-facing, low-latency control surfaces with adjustable autonomy (approve/modify/reject; emergency stop), explicit state transitions, and full logging were specified and prototyped as part of the trainer dashboard; co-creation activities yielded requirements for auditability and interruptibility under tight latency constraints [49,52,53].

SQ2 (Personalisation effectiveness). A recommender loop was introduced (Study A) [48] and subsequently coupled to a wearable-driven stress estimator and smart scenario control (Study C) [49], enabling targeted difficulty/stressor modulation with trainer approval. A complementary HRI study (Study F) quantified how interface modality shapes workload, explainability, and performance—factors that condition effective personalisation [51].

SQ3 (Trust and transparency). A within-subjects probe (Study E) tested whether trainers detect and reject misleading suggestions and how suggestion accuracy relates to trust and perceived transparency [50]; the modality study (Study F) modelled trust as a function of transparency and negative effects (under submission).

4.2. Context and Trainee Objectives

To ground the subsequent studies, we provide a plain-language summary of the two primary training domains used in this research: Medical First-Responder Training: The trainee’s goal is to assess, triage, and treat victims in a disaster scene. The trainee performs physical actions like CPR or wound management on a patient simulator. The AI-driven system monitors the trainee’s stress and "injects” challenges—such as loud sirens, uncooperative bystanders, or worsening patient conditions—to test the trainee’s ability to stay focused under pressure. Search-and-Rescue Missions: The trainee acts as an operator for a remote-controlled robot (UGV) searching for survivors in a dangerous industrial environment. The robot is semi-autonomous and suggests specific paths or identifies potential targets; the operator must evaluate these suggestions using the robot’s camera feed and sensor data to decide the best course of action.

4.3. Platforms, Sensors, and Architecture

XR engines and simulation assets.

Medical first-responder (MFR) scenarios combine a high-fidelity patient simulator (ADAM^®-X) with VR/MR environments; motion-capture infrastructure (e.g., OptiTrack) supports behavioural annotation and ground-truthing where deployed.

Wearable sensing and trainee state.

The second-generation platform fuses lightweight physiology into a stress estimator by ingesting raw wearable data, specifically ECG and electrodermal activity (EDA), via a dedicated API. To provide a concrete implementation example, these signals are processed by a random forest classifier to infer moment-to-moment stress levels in 5-s windows. These predictions and their corresponding feature importances are persisted for real-time use and longitudinal review.

Recommender and smart scenario control.

A recommender correlates simulation events with these trainee responses to propose scenario changes, such as modulating time pressure or stressors. This process is facilitated by the Centric Data Platform, which acts as a synchronisation hub to correlate physiological states with specific simulation events recorded in the XR engine, such as the introduction of a ‘Make It Rain’ environmental stressor [48,49].

Trainer dashboard and HITL controls.

The dashboard presents a live scene view, stress band timelines, and an actionable suggestions panel. This detailed logging architecture allows the Smart Scenario Controller to provide justified suggestions—for instance, alerting the trainer that ‘trainee stress is below the 70% threshold’—which are then displayed on the web-based dashboard for immediate human approval or override [49,52,53].

Data logging and governance.

Event streams (scenario events, approvals/overrides), sensor summaries, model outputs (stress levels), recommendations, and explanations are recorded in a central data platform to support after-action review and iterative model improvement under GDPR-compliant consent and retention practices.

4.4. Study A—Recommender Integration for XR Training (Foundational Prototype)

Aim. The aim was to integrate a recommender loop into XR training to support pre-session scenario authoring and in-session adaptation with the trainer in the loop [48].

System. A tablet/desktop interface guided trainers through scenario setup (scenario type, difficulty, environmental parameters) while surfacing data-driven suggestions learned from prior sessions; at run-time, deviations from target stress bands triggered suggestions to adjust stressors. The UI captured trainer ratings and decisions (accept/modify/reject) for subsequent learning.

Instrumentation. Recommendation quality and acceptance were planned with the ResQue framework and an adapted Technology Acceptance Model (TAM); multiple presentation variants (text vs. text+image) were specified to probe uptake and clarity.

4.5. Study C—AI-Supported XR Training for MFR

Aim. The aim was to realise a deployable platform that couples wearable-driven stress estimation with smart scenario control and a trainer dashboard, closing the sense–estimate–recommend–oversee loop [49].

Architecture. Wearables → stress estimator (random forest) → recommender with guardrails → dashboard; all outputs (predictions, recommendations, explanations) persisted; the cloud-based Trainer Support presents AI outputs via a responsive web app.

Apparatus. Integration with a professional mannequin (ADAM^®-X) and motion-capture for realism and annotation where available.

Status. System architecture, interfaces, and deployment plan were specified to gather sufficient field data for model refinement; these components underpin Studies E and F.

4.6. Study E—Misleading Suggestions: Trust and Oversight

Aim. The aim was to assess whether expert trainers can detect and reject misleading AI suggestions and how suggestion accuracy affects trust and perceived transparency [50].

Design. Study E utilized a within-subjects design. To ensure high internal validity and control over the timing of misleading suggestions, the 17 expert trainers in Study E interacted with the dashboard while viewing pre-recorded MFR sessions. This ’simulated live’ environment allowed for the precise manipulation of biosignal congruence specifically, presenting stressful trainee states alongside neutral AI arousal estimates, to probe for automation bias. In each condition, participants received two suggestions to increase stress (e.g., siren; aggressive dog). In the stressful condition, the second suggestion was deliberately incorrect (should be rejected); in the neutral condition, suggestions were appropriate. IV: recording type (stressed vs. neutral). DVs: suggestion acceptance; perceived relevance/accuracy/usefulness/explainability/transparency; trust; perceived detection of biosignal manipulation.

Participants.

N = 17

MFR trainers (7 female, 10 male), age 30–55 (

M = 43

,

S D = 8.31

), training experience 5–35 years (

M = 15

,

S D = 9.17

); recruited in Belgium (

n = 7

), Spain (

n = 6

), and Greece (

n = 4

).

Apparatus and procedure. Desktop interface with audio playback; briefing and interface familiarisation; both conditions in counterbalanced order; acceptance decisions logged; post-task questionnaires; debrief revealed the manipulation.

Measures and outcomes. Scales for trust, transparency, explainability, and perceived accuracy/usefulness; qualitative rationales. Only 8/17 accurately rejected the incorrect suggestion; the hypothesised trust decrease in the misleading condition was not observed at the group level, though those who rejected misleading items reported lower trust—evidence of potential over-reliance despite explanations.

Analysis. Descriptive and item-level analyses complemented by qualitative coding; the expert sample size (

N = 17

) constrained inferential power (acknowledged).

4.7. Study F—Interaction Modality for Semi-Autonomous UGVs (S&R)

Aim. The aim was to compare 2D, 3D, VR, and XR modalities for a semi-autonomous UGV task with suggestion-seeking autonomy, and model effects on trust, explainability, transparency, workload (NASA-TLX), task difficulty, negative effects (SSQ), and perceived performance [51].

Design. Between-subjects design; participants collaborated with a semi-autonomous UGV that presented options and uncertainty, requesting human input. Post-task questionnaires captured user experience constructs.

Participants.

N = 36

(31 male, 5 female), age 19–63 (

M = 37.25

,

S D = 12.18

); professional experience bands: 0–5 (

n = 9

), 6–10 (

n = 9

), 11–15 (

n = 8

), >20 (

n = 10

). Technology Affinity (ATI)

M = 4.51

,

S D = 1.00

; no gender differences in ATI.

Apparatus. Two- (2D)/three-dimensional (3D): desktop display; VR: Meta Quest Pro (first-person immersive); XR: Meta Quest Pro pass-through providing a 3D tabletop scene. Interaction followed the condition’s device conventions (mouse/controller). The 2D top-down view was included to serve as a baseline control for immersion and to practically simulate professional Command-and-Control (C2) map interfaces used in field operations. This allows for a pedagogical comparison between the high-presence affordances of immersive XR for trainees and the high-explainability requirements of supervisory roles.

Procedure. Briefing, consent, demographics, and ATI; headset acclimatisation (∼5 min) for VR/XR; task run; post-questionnaires; financial remuneration.

Measures and outcomes. One-way ANOVAs assessed modality effects. It was found that 2D yielded significantly higher Explainability than VR; XR reduced workload and perceived difficulty relative to VR; VR elicited the highest negative effects. Regression showed Transparency as a significant predictor of Trust.

4.8. Authoring, Autonomy, Guardrails, and Transparency (Cross-Cutting)

Scenario safety envelopes (bounds on stressor intensity, time pressure) and policy constraints are specified declaratively so domain experts can adjust without code changes. Under low confidence or degraded sensing, policies fall back to safe defaults [49,52].

The dashboard exposes graded autonomy modes and immediate veto/override with legible transitions; all interventions are timestamped and linked to the triggering rationale to support audit and policy revision [49,52,53].

Suggestions are delivered with concise rationales, anticipated effects, and (where applicable) confidence cues; progressive disclosure supports debrief. These patterns are critical for calibrated reliance and timely intervention [52,53].

4.9. Measures and Instruments

Effectiveness and performance.

Objective task metrics were collected for each scenario and modality, including completion time, error rates, and adherence to prescribed procedures. In addition, participants’ perceived performance was gathered through self-reports (Study F) to capture subjective aspects of task effectiveness [51].

Trust, transparency, and explainability.

To assess human–AI interaction quality, we employed validated trust scales originally developed for human–automation systems, adapted to the XR and decision-support context (Studies E, F). Perceptions of transparency and explainability were measured with instruments adapted from recommender system and XAI research, ensuring comparability with prior work. Study E further included measures of participants’ ability to detect misleading or manipulative AI suggestions, probing resilience against over-reliance [50].

Workload and adverse effects.

Cognitive workload was assessed using the NASA Task Load Index (NASA–TLX), a widely adopted multidimensional measure. In immersive modalities, potential side effects such as motion sickness or disorientation were captured with the Simulator Sickness Questionnaire (SSQ), applied in Study F.

Oversight behaviour.

To examine Human-in-the-Loop decision-making, we logged granular oversight actions: acceptance, modification, or rejection of system suggestions, response latencies, and accompanying free-text rationales (Study E). These data enabled analysis of how trainers negotiated AI input and exercised supervisory control [50].

4.10. Statistical Analyses

Study E.

This study utilised descriptive statistics for acceptance and ratings across conditions; qualitative coding of rationales. The expert sample (

N = 17

) precluded robust inferential testing (acknowledged) [50].

Study F.

This study utilised one-way ANOVAs with modality as a factor for trust, explainability, transparency; Tukey HSD post hoc comparisons; and regression models with Transparency and Negative Effects predicting Trust. There is a significant explainability advantage for 2D over VR; transparency predicted trust; and XR minimised workload and perceived difficulty relative to VR [51].

4.11. How the Protocol Answers SQ1–SQ3

The shared pipeline (Figure 2) enables (i) quantification of how adjustable autonomy and oversight change the appropriateness of adaptations and supervisory workload via intervention logs and latency traces (SQ1) [49,52]; (ii) evaluation of personalisation under realistic sensing/latency through recommendation uptake and performance/workload proxies, with modality-sensitive cognitive demands (SQ2) [48,49]; and (iii) modelling of trust calibration as a function of transparency and exposure to misleading suggestions, with actionable transparency designed into the dashboard and assessed across modalities (SQ3) [50,53].

5. Results

This section consolidates the outcomes achieved across the dissertation’s six appended papers (A–F) and the corresponding activities documented in the thesis. We report what was built, who was studied, how studies were run, and what was observed—organised by the sub-questions (SQs) that structure the PhD. Interpretation is reserved for Section 6.

5.1. SQ2—Personalisation in XR Training Platforms

Paper A introduced the conceptual integration of a recommender system (RS) into an XR training pipeline for on-the-fly scenario adaptation. The RS was framed to consume trainee performance and physiological indicators and to present scenario-level suggestions (e.g., add/remove stressors, adjust difficulty) to a trainer-facing interface for approval, thereby enabling personalised training without ceding control to full automation.

Building on this, Paper C delivered a second-generation implementation that couples wearable biosignal streams with a stress inference module and a trainer-facing decision layer. Concretely, biosignals are captured via a PLUX sensor system and fused with data from a high-fidelity patient simulator (ADAM-X) and the VR/MR server; key patient/manikin variables (e.g., ECG, blood pressure, CPR depth/frequency) are streamed in real time for monitoring and are persisted for debriefing. During live sessions, the dashboard presents current trainee stress and selected KPIs; after sessions, all raw/processed data are transferred to a Centric Data Platform for analytics and Smart Scenario Design. Given early data sparsity, the AI layer emphasised unsupervised methods, with random forest stress inference and time-series patterning discussed as part of the evolving stack.

At the system level, the implemented pipeline ("Smart Scenario Controller”) integrates hardware, AI services, cloud/on-prem data infrastructure, and a web-based Trainer Support Service, closing the loop between sensing, inference, recommendation, and expert oversight (see Figure 2). The trainer can accept/override suggestions; all outputs (recommendations, explanations, predictions) are logged for transparency and future reuse.

5.2. SQ1—Human-in-the-Loop (HITL) Supervision and Adjustable Autonomy

Paper D formalised how domain experts remain in control of AI-driven adaptation. Through co-creation and prototype evaluations with trainers, the work specified which recommendations require human sign-off, how confidence should be visualised, and how override policies should be enacted. The resulting closed-loop HITL workflow (Figure 3) routes trainee inputs to the XR engine, mirrors system outputs to a trainer dashboard that aggregates physiology and performance analytics, and returns trainer feedback (accept/reject/modify) to the engine in real time.

The dashboard design (Figure 4) emphasises single-screen situational awareness, including live scene video, a Training Assistant card with current stress state and available stressors, real-time ECG, colour-encoded stress-band timelines with interactive markers, and minute-updated KPIs (e.g., accuracy, response time, collaboration) to support rapid, informed decisions. This interface also hosts explainable elements that communicate the rationale behind each suggestion.

5.3. SQ3—Transparency, Trust, and Reliance Patterns Under Misleading AI

Paper E experimentally tested whether expert trainers can reject misleading AI suggestions and how such exposure modulates trust and perceived transparency. Seventeen medical first-responder trainers (three countries) completed a

2 \times 2

study crossing scenario stress (stressful vs. neutral) with biosignal congruence (congruent vs. incongruent) while interacting for ≤6 min per trial with an interface presenting live arousal estimates and suggestion explanations (see also Figure 4). The majority did not reject incorrect suggestions; contrary to expectation, trust and transparency ratings remained high overall. However, those who did reject misleading suggestions tended to assign lower trust, indicating sensitivity among accurate rejectors.

5.4. Role of Multimodal Signals and Safety Guardrails (Cross-Cutting)

Across Papers A/C/D, the personalisation stack consumes behavioural logs, biosignals, and performance KPIs to generate recommendations that remain subject to trainer approval. Guardrails are implemented through explicit sign-offs, logged provenance (recommendation, explanation, acceptance state), conservative fallbacks, and real-time visualisation of confidence/rationale, ensuring calibration and accountability within high-stakes training.

5.5. Modalities, Workload, and Trust in Semi-Autonomous HRI (Paper F)

Extending beyond human-only XR training, Paper F compared four interaction modalities—2D, 3D, VR, XR—in a semi-autonomous UGV search-and-rescue task (

N = 36

; age

M = 37.25, S D = 12.18

). Post-task measures included Trust, Explainability, Transparency, NASA-TLX workload, Task Difficulty, Negative Effects, and performance (self- and robot-rated).

Primary outcomes.

A one-way ANOVA found a significant modality effect on Explainability, with 2D rated higher than VR (Tukey:

M_{D} = - 0.95, p = 0.018

); Transparency and Trust did not differ significantly across modalities. Descriptively, 3D had the highest Transparency mean. Correlation analysis showed Transparency positively correlated with Explainability and negatively with Negative Effects; Task Difficulty positively correlated with Negative Effects. Stepwise regression identified Transparency and Negative Effects as predictors of Trust; notably, Transparency exhibited a significant negative coefficient (adjusted

R^{2} = 0.116

), suggesting that over-exposure to system internals can depress perceived reliability. Representative task performance means (1–7 scale) indicated competitive performance in XR (robot-rated

M = 6.12

) and 2D (self-rated

M = 5.56

). Figure 5 illustrates the four modalities tested.

5.6. Evidence Toward the Central RQ

Taken together, the results show that (i) personalisation has been designed and implemented end-to-end with explainable, trainer-supervised control (Papers A/C/D; Figure 2); (ii) HITL oversight operates in situ through a closed loop of sensing, recommendation, and explicit acceptance/override, with rich dashboard support and full logging (Paper D; Figure 3 and Figure 4); (iii) trust and transparency are nuanced: experts may still over-accept poor suggestions (Paper E), and, in HRI contexts, transparency can paradoxically depress trust if uncalibrated (Paper F; Figure 5); and (iv) modality choices shape explainability, workload, and comfort (2D improves explainability; XR tends to reduce workload/difficulty relative to VR), informing how personalisation should be adapted to different users and tasks. These findings ground the synthesis and design implications discussed in Section 6.

6. Discussion

This discussion interprets the consolidated outcomes through the lens of the central RQ and its three sub-questions, while situating them within the broader literature and integrating reflections from the research journey. A central argument advanced here is that effectiveness and calibrated trust in safety-relevant XR training are co-produced when AI-based personalisation, HITL supervision, and transparency are designed as one interaction loop rather than separable add-ons. Across the programme, sensing and trainee modelling propose changes; guardrails and transparency render these proposals legible; and supervision closes the loop with situated human judgement and audit trails. The implemented pipeline, from the Smart Scenario Controller and centric data platform to the trainer-facing dashboard, embodies this coupling in practice, with explicit accept/override points and persistent logging [49,52].

6.1. Answering the RQ: Why Effectiveness and Trust Rise Together in a Coupled Loop

Across Papers A–F, personalisation improved learning outcomes when adaptation remained bounded by pedagogical and domain constraints (Papers A, C) [48,49]. Within this envelope, high-impact changes were presented to trainers for approval, while low-risk micro-adaptations could proceed under telemetry and rollback. This is not an additive stack but a compositional mechanism: personalisation proposes; transparency explains; HITL decides; and the data layer enables refinement over time. The design resonates with models of levels of automation and situation awareness (SA) [13,14], and aligns with XAI perspectives that link calibrated reliance to intelligibility at the point of control [8,41]. Effectiveness increases because scenario adaptations better match trainee states; trust calibrates because oversight is actionable and explanations are embedded directly in the control loop. The prototypes and interfaces presented in Papers B and D illustrate these mechanisms in practice [50,53].

6.2. SQ1: Oversight Without Sacrificing Responsiveness

The HITL workflow and dashboard distribute labour in line with the "ironies of automation”: humans supervise what demands contextual interpretation and accountability; automation executes what it can do reliably under latency constraints [6]. In our implementations, the trainer dashboard mirrored XR state (scene feed, KPIs, biosignals) and channelled adaptation proposals for accept/modify/reject. Approvals triggered immediate enactment, with interventions logged for review and model refinement. This realises adjustable autonomy while sustaining SA, addressing Sub-question 1. Comparable HITL approaches in XR (e.g., [54]) similarly emphasise the importance of explainability and timely oversight.

6.3. SQ2: Personalisation Across Profiles and Goals

Personalisation was realised at two timescales: short-horizon adjustments to momentary state and long-horizon progression toward learning goals. The SSC combined biosignals (HR, HRV, EDA) with behavioural metrics, feeding a recommender engine. Early studies relied on robust features and conservative classifiers due to data sparsity, echoing findings in adaptive VR research that advocate for multi-source sensing with strong guardrails [55]. Recommendations were contextualised with rationales and confidence cues; high-uncertainty proposals were paused for trainer consent. This design reflects the broader adaptive training literature, where combining lightweight real-time intelligence with heavier offline analytics is recommended for safety-critical contexts [56]. The authoring guardrails—what may change, under which conditions—echo earlier calls for constraint-based personalisation [48].

6.4. SQ3: Trust, Transparency, and Reliance

The misleading-suggestion study (Paper E) showed that trainers sometimes accepted erroneous recommendations, demonstrating the persuasive pull of AI rationales [50]. Those who correctly rejected bad suggestions reported lower trust, underscoring that high self-reported trust is not always desirable. The modality study (Paper F) found XR reduced workload relative to VR, while 2D maximised explainability; transparency was a predictor of trust, but with a counter-intuitive negative coefficient, consistent with research on cognitive overload in transparency [42]. These findings converge with evidence from HAI that explanations can increase over-reliance if not carefully designed. They reinforce the need for contestable designs and residual human control in safety-critical training, echoing debates on automation bias and algorithm aversion.

6.5. Design Implications for Trustworthy Adaptive XR

Synthesising across the dissertation and related work, three design implications emerge:

Transparency must be embedded proximal to decision points. Rationales, confidence cues, and predicted effects should be co-located with accept/override affordances and logged for audit. This design principle is supported both by our prototypes (Papers B, C) and by the broader literature on persuasive interfaces and explainable AI in recommender systems [5,8]. By making explanations actionable rather than decorative, trainers can calibrate reliance in real time.

While our high-stress validation primarily utilized VR to ensure participant safety and experimental control, the underlying architecture is inherently designed for the XR spectrum. In AR/MR applications, the ’actionable transparency’ patterns identified here (e.g., co-locating AI rationales with physical assets) are critical for maintaining the trainer’s situational awareness. The transferability of these results is anchored in the system’s ability to process multimodal inputs regardless of the rendering modality, as demonstrated by the consistent trust patterns found in the XR pass-through condition of Study F.

Levels of autonomy, state transitions, and emergency stops should be treated as interaction primitives, not hidden system parameters. In practice, this means pausing for consent on high-impact changes, while allowing micro-adaptations under rollback. Our findings, together with guidance from HITL research [52], show that adjustable autonomy preserves responsiveness while sustaining agency. This principle is echoed in the human factors literature on adaptive automation [13,14].

Adaptation must remain bounded by explicit pedagogical and safety constraints. Guardrails should specify what can change, how far, and under which conditions, and these bounds should be exposed to trainers. This ensures domain expertise anchors adaptation and mitigates brittleness under noisy sensing. Our SSC implementation and authoring tools exemplify this approach (Papers A, C) [48], aligning with adaptive VR studies advocating constrained policy spaces [55].

While persuasive elements (e.g., authority cues, social proof) can increase acceptance [38], our findings caution that overly persuasive rationales risk undermining expert judgement. The trainer dashboards thus staged explanations, coupled them with data quality indicators, and allowed overrides at the moment of enactment. This reflects wider HCI debates on algorithmic persuasion versus contestability [42].

Together, these implications extend beyond XR training to broader human–AI collaboration, showing how effectiveness and trust can be co-produced when adaptation is both intelligible and contestable.

6.6. Reflections from the Research Journey

Two inflection points shaped the programme. First, initial explorations of full automation gave way to adjustable autonomy after trainer workshops emphasised the need for consent before stressor injections and rapid veto when context shifted. Embedding accept/override at the moment of enactment proved more impactful than richer post hoc explanations. Second, the realities of field deployment (short sessions, heterogeneous sensors, privacy constraints) motivated robust signals, simple yet transparent models, and a centric data platform for asynchronous analytics. These decisions were less about algorithmic novelty than fit with organisational accountability and trainer workflow. Personally, the most sobering insight was the persuasive pull of tidy rationales: interface design alone can nudge acceptance, even of poor advice. Iteratively simplifying visual language, adding confidence and data quality flags near the accept button, and staging detail for debrief helped re-centre expert judgement without undue cognitive burden.

6.7. Methodological Reflections

Treating Papers A–F as a single programme enabled triangulation across roles (trainee, trainer), modalities (behavioural/performance logs, biosignals, immersive interfaces), and outcome dimensions (effectiveness, reliance, supervision traces). This cumulative design provided a richer evidence base than any one study alone could yield. It also presented tensions familiar in HCI and HRI research: small but expert samples versus larger convenience samples; controlled laboratory conditions versus ecological validity in the field.

Several choices were deliberate. First, the methodological sequencing, from formative prototypes (Paper A) through persuasive interface design (Paper B), physiological pipelines (Paper C), and HITL evaluations (Papers D–F), mirrors a research through design trajectory, where artefacts are built not only to test hypotheses but also to probe the design space. This iterative prototyping allowed findings from one stage to inform the next, creating a cumulative refinement loop rather than isolated case studies. Second, measurement instruments (e.g., NASA-TLX, SSQ, SA indices) were aligned with established HCI/HRI practice to ensure comparability, even when adapted to XR settings. Third, the use of log traces, acceptance rates, and override events offered behavioural markers of supervision quality and reliance that complemented subjective reports of trust and workload. In doing so, the programme engaged both with human factor traditions (automation bias, workload, situation awareness) and with more recent HAI work on explainability and contestability.

Nevertheless, methodological compromises were necessary. Biosignals offered promise for adaptive stress classification but remained noisy and sometimes unavailable; fallback mechanisms and conservative approval policies were instituted. Expert trainers were few, making statistical power limited; triangulation across countries and roles partly mitigated this. And XR evaluations, particularly in Papers E and F, demanded balance between realism and experimental control—a tension widely discussed in the methodological literature on Mixed Reality studies [27].

6.8. Personal Reflections

Beyond methodological considerations, the PhD journey itself generated personal insights into conducting long-term, multi-project research at the intersection of XR, AI, and human factors. A first reflection concerns the value of co-design with domain experts. Trainer workshops repeatedly reshaped my assumptions: initial ideas of full automation gave way to insistence on adjustable autonomy; persuasive rationales had to be moderated after trainers pointed out their potential to mislead. These moments underscored how much design knowledge resides in practice, not in algorithms.

A second reflection is the emotional weight of researching safety-critical contexts. Medical first-responders and CBRN specialists brought not only technical expertise but also lived experiences of crisis. Their emphasis on accountability and auditability constantly reminded me that technological novelty is not an end in itself; fitness-to-practice and organisational legitimacy matter more. The most striking personal learning was recognising how easily I, too, was persuaded by “plausible” AI rationales when testing interfaces. This humbled me as a researcher and deepened my appreciation of cognitive biases in human–AI interaction.

Finally, working across six studies reinforced the importance of iteration and resilience. Prototypes failed, biosensors produced poor data, recruitment was difficult; yet each setback became material for design adaptation and methodological growth. This reflective practice is as much a contribution as the specific results.

6.9. Limitations and Threats to Validity

As with any cumulative research, limitations constrain interpretation. External validity is bounded by the domain focus (medical first-response; CBRN/S&R). Sample sizes, particularly for expert trainers (

N = 17

in Paper E;

N = 36

in Paper F), were necessarily modest given access constraints. Scenarios were abbreviated for feasibility, raising questions about generalisability to longer or more complex exercises. XR-specific issues such as novelty effects and cybersickness, noted in Paper F, may have shaped user experience beyond the intended variables.

Instrumentation also posed limits. Biosignals were opportunistic and often noisy; while conservative fallback strategies mitigated risk, they did not eliminate the possibility of mis-inference. Logging captured what happened more readily than why; qualitative debriefs provided partial context, but deeper sensemaking analyses of intervention rationales remain for future work. Finally, while triangulation across studies increased robustness, the heterogeneity of methods and measures complicates aggregation into single effect sizes or general laws.

Taken together, these limitations do not invalidate the findings but highlight the conditions under which they are most applicable. They also point to methodological avenues for future XR research: larger cross-national consortia to improve sample sizes, more robust sensor fusion pipelines, longitudinal field deployments, and richer qualitative capture of trainer decision rationales.

7. Conclusions

This article asked the following: How can XR training systems incorporate AI-based personalisation and Human-in-the-Loop supervision to optimise effectiveness and user trust, particularly in high-stakes or safety-critical domains? Synthesising six empirical studies and the cumulative reflections of the doctoral programme, the answer is that effectiveness and trust are co-produced when personalisation, supervision, and transparency are engineered as a single interaction loop. Multimodal sensing and trainee modelling generate adaptation proposals; transparency renders these proposals legible and predictable; HITL mechanisms make them contestable and accountable. When these elements are designed together, and bounded by pedagogical guardrails and latency constraints, adaptation improves outcomes without undermining situational awareness or human agency. This pattern is not additive but compositional: personalisation remains responsive; trainers retain meaningful oversight; and trainees develop calibrated reliance rather than blind dependence.

Contributions in brief.

First, the research consolidates a systems perspective in which multimodal sensing, AI-based personalisation, HITL supervision, and transparency form a trustworthy loop for XR training. Second, it articulates concrete design mechanisms, adjustable autonomy as an interaction primitive, guardrailed adaptation policies, and workload-aware explanations that operationalise oversight without sacrificing responsiveness. Third, it demonstrates empirically, across diverse prototypes and studies, that personalisation can improve task effectiveness when bounded by pedagogical and safety constraints and when explanations and supervisory controls allow timely human intervention. These contributions answer SQ1–SQ3 and together provide a cohesive response to the overarching RQ.

7.1. Extended Research Agenda

Looking forward, the work points to a research agenda that deepens scientific understanding and strengthens practical deployment, with emphasis on ecological validity, longitudinal impact, and reproducibility:

Longitudinal learning and transfer. Most XR studies report single-session outcomes. Future research should track retention, skill transfer, and reliance trajectories across weeks or months, clarifying how HITL oversight and explanation exposure influence long-term learning.

Adaptive autonomy scheduling. Adjustable autonomy can be formalised as a scheduling problem: when should adaptation proceed autonomously, when should the trainer be consulted, and when should proposals be deferred? Data-driven schedulers conditioned on uncertainty, workload, and risk can provide principled handover policies.

Transparency under load. Explanations must be workload-aware. Comparative studies of glanceable cues, progressive disclosure, and just-in-time rationales are needed to assess how different approaches affect intervention timing, error recovery, and secondary-task performance under pressure.

Multimodal robustness and on-device inference. As XR sensing is noisy and bandwidth-constrained, research should benchmark the marginal value of combining interaction logs, kinematics, gaze, and lightweight physiology under strict latency budgets, prioritising edge or on-device inference for safety and privacy.

Guardrails as pedagogical contracts. Pedagogical and safety envelopes should be encoded as first-class, testable specifications. Verifying adaptation policies against these contracts, and learning safer bounds from override logs, would connect scenario authoring, run-time control, and post hoc audit.

Fairness and ethics of personalisation. Adaptive systems risk amplifying disparities if objectives or features correlate with background attributes. Bias-aware modelling, subgroup analyses, and counterfactual evaluation should be integral to future XR pipelines, alongside data minimisation and explicit consent practices.

Cross-domain and multi-user generalisation. The architectural pattern should be validated in additional high-stakes domains (e.g., aviation, industrial maintenance, clinical procedures). Multi-user settings, with several trainees and multiple trainers, raise unresolved questions about role assignment, consensus, and conflict resolution.

Open benchmarks and reproducibility. Progress will accelerate through shared assets: synthetic or anonymised XR logs, reference adaptation policies with guardrails, and reproducible analysis pipelines for trust calibration and oversight efficacy.

7.2. Final Remarks

Trustworthy adaptive XR is not a technical achievement alone but a human–technology partnership. Personalisation is valuable only when it remains intelligible and contestable; supervision is effective only when it is timely and low-friction; transparency is meaningful only when it supports action, not just understanding. This dissertation’s journey shows that embedding AI-driven personalisation within a transparent, contestable, and auditable loop is key to optimising both training effectiveness and calibrated trust. By centring these principles, the work charts a path toward XR training systems that are safer, more accountable, and better aligned with the realities of high-stakes practice.

Author Contributions

Conceptualization, D.P., G.R., and H.S.-F.; methodology, D.P.; software, D.P.; validation, D.P., G.R., H.S.-F., and M.T.; formal analysis, D.P.; investigation, D.P.; resources, D.P.; data curation, D.P.; writing—original draft preparation, D.P.; writing—review and editing, D.P., G.R., H.S.-F., and M.T.; visualization, D.P.; supervision, G.R., H.S.-F., and M.T.; project administration, G.R. and H.S.-F. All authors have read and agreed to the published version of the manuscript.

Funding

Open access publication supported by the Paris Lodron University of Salzburg Publication Fund.

Institutional Review Board Statement

Ethical review and approval were not required for this study as it presents a synthesis of findings based on previously published works by the authors. The original studies cited herein ([49,50,51]) obtained the necessary ethical approvals as stated in their respective publications.

Informed Consent Statement

Not applicable, as this study involves the review of previously published data by the authors and did not involve the recruitment of new subjects.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Acknowledgments

During the preparation of this manuscript/study, the author(s) used ChatGTP 5.1 for the purposes of syntax and grammar review. The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

HITL	Human in the loop
XR	Extended Reality
VR	Virtual Reality
AR	Augmented Reality
MR	Mixed Reality
CBRN	Chemical, Biological, Radiological, and Nuclear
RQ	Research Question
SQ	Sub Question
XAI	Explainable AI
S&R	Search and Rescue

References

Endsley, M.R. Situation awareness. In Handbook of Human Factors and Ergonomics; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 2021; pp. 434–455. [Google Scholar]
Loetscher, T.; Barrett, A.M.; Billinghurst, M.; Lange, B. Immersive medical virtual reality: Still a novelty or already a necessity? J. Neurol. Neurosurg. Psychiatry 2023, 94, 499–501. [Google Scholar] [CrossRef]
Curran, V.R.; Xu, X.; Aydin, M.Y.; Meruvia-Pastor, O. Use of Extended Reality in Medical Education: An Integrative Review. Med. Sci. Educ. 2022, 33, 275–286. [Google Scholar] [CrossRef]
Calvo, R.A.; D’Mello, S. Affect detection: An interdisciplinary review of models, methods, and their applications. IEEE Trans. Affect. Comput. 2010, 1, 18–37. [Google Scholar] [CrossRef]
Jugovac, M.; Jannach, D. Interacting with recommenders—Overview and research directions. ACM Trans. Interact. Intell. Syst. (TiiS) 2017, 7, 1–46. [Google Scholar] [CrossRef]
Bainbridge, L. Ironies of automation. In Analysis, Design and Evaluation of Man–Machine Systems; Elsevier: Amsterdam, The Netherlands, 1983; pp. 129–135. [Google Scholar]
Miller, T. Explanation in artificial intelligence: Insights from the social sciences. Artif. Intell. 2019, 267, 1–38. [Google Scholar] [CrossRef]
Lee, J.D.; See, K.A. Trust in automation: Designing for appropriate reliance. Hum. Factors 2004, 46, 50–80. [Google Scholar] [CrossRef]
Lipton, Z.C. The Mythos of Model Interpretability: In machine learning, the concept of interpretability is both important and slippery. Queue 2018, 16, 31–57. [Google Scholar] [CrossRef]
Blackmore, K.L.; Smith, S.P.; Bailey, J.D.; Krynski, B. Integrating biofeedback and artificial intelligence into eXtended reality training scenarios: A systematic literature review. Simul. Gaming 2024, 55, 445–478. [Google Scholar] [CrossRef]
Strauch, B. The automation-by-expertise-by-training interaction: Why automation-related accidents continue to occur in sociotechnical systems. Hum. Factors 2017, 59, 204–228. [Google Scholar] [CrossRef] [PubMed]
Parasuraman, R.; Riley, V. Humans and automation: Use, misuse, disuse, abuse. Hum. Factors 1997, 39, 230–253. [Google Scholar] [CrossRef]
Parasuraman, R.; Sheridan, T.B.; Wickens, C.D. A model for types and levels of human interaction with automation. IEEE Trans. Syst. Man Cybern.-Part A Syst. Humans 2000, 30, 286–297. [Google Scholar] [CrossRef]
Endsley, M.R. Measurement of situation awareness in dynamic systems. Hum. Factors 1995, 37, 65–84. [Google Scholar]
Sheridan, T.B. Individual differences in attributes of trust in automation: Measurement and application to system design. Front. Psychol. 2019, 10, 1117. [Google Scholar] [CrossRef]
Cummings, M.M. Man versus machine or man+ machine? IEEE Intell. Syst. 2014, 29, 62–69. [Google Scholar] [CrossRef]
Holzinger, A.; Plass, M.; Kickmeier, M.; Holzinger, K.; Crisan, G.C.; Pintea, C.M.; Palade, V. Interactive machine learning: Experimental evidence for the human in the algorithmic loop: A case study on Ant Colony Optimization. Appl. Intell. 2019, 49, 2401–2414. [Google Scholar] [CrossRef]
Tatasciore, M.; Strickland, L.; Loft, S. Transparency improves the accuracy of automation use, but automation confidence information does not. Cogn. Res. Princ. Implic. 2024, 9, 67. [Google Scholar] [CrossRef] [PubMed]
Yang, X.J.; Unhelkar, V.V.; Li, K.; Shah, J.A. Evaluating effects of user experience and system transparency on trust in automation. In Proceedings of the 2017 ACM/IEEE International Conference on Human-Robot Interaction; IEEE: Piscataway, NJ, USA, 2017; pp. 408–416. [Google Scholar]
Xu, X.; Yu, A.; Jonker, T.R.; Todi, K.; Lu, F.; Qian, X.; Belo, J.M.E.; Wang, T.; Li, M.; Mun, A.; et al. XAIR: A Framework of Explainable AI in Augmented Reality. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems; Association for Computing Machinery: New York, NY, USA, 2023. [Google Scholar]
Elish, M.C. Moral crumple zones: Cautionary tales in human-robot interaction. Engag. Sci. Technol. Soc. 2019, 4, 50–60. [Google Scholar] [CrossRef]
Calhoun, G. Adaptable (not adaptive) automation: Forefront of human–automation teaming. Hum. Factors 2022, 64, 269–277. [Google Scholar] [CrossRef] [PubMed]
Davis, J.L. Elevating humanism in high-stakes automation: Experts-in-the-loop and resort-to-force decision making. Aust. J. Int. Aff. 2024, 78, 200–209. [Google Scholar] [CrossRef]
Holford, W.D. ‘Design-for-responsible’algorithmic decision-making systems: A question of ethical judgement and human meaningful control. AI Ethics 2022, 2, 827–836. [Google Scholar]
Green, B.; Chen, Y. The principles and limits of algorithm-in-the-loop decision making. Proc. ACM Hum.-Comput. Interact. 2019, 3, 1–24. [Google Scholar] [CrossRef]
Dalgarno, B.; Lee, M.J. What are the learning affordances of 3-D virtual environments? Br. J. Educ. Technol. 2010, 41, 10–32. [Google Scholar] [CrossRef]
Billinghurst, M.; Clark, A.; Lee, G. A survey of augmented reality. Found. Trends^® Hum.- Interact. 2015, 8, 73–272. [Google Scholar] [CrossRef]
Nanou, T.; Lekakos, G.; Fouskas, K. The effects of recommendations’ presentation on persuasion and satisfaction in a movie recommender system. Multimed. Syst. 2010, 16, 219–230. [Google Scholar]
Zhang, S.; Yao, L.; Sun, A.; Tay, Y. Deep learning based recommender system: A survey and new perspectives. ACM Comput. Surv. (CSUR) 2019, 52, 1–38. [Google Scholar]
Zhang, Q.; Lu, J.; Jin, Y. Artificial intelligence in recommender systems. Complex Intell. Syst. 2021, 7, 439–457. [Google Scholar] [CrossRef]
Chen, J.Y.; Lakhmani, S.G.; Stowers, K.; Selkowitz, A.R.; Wright, J.L.; Barnes, M. Situation awareness-based agent transparency and human-autonomy teaming effectiveness. Theor. Issues Ergon. Sci. 2018, 19, 259–282. [Google Scholar] [CrossRef]
Grassini, S.; Laumann, K.; Rasmussen Skogstad, M. The use of virtual reality alone does not promote training performance (but sense of presence does). Front. Psychol. 2020, 11, 1743. [Google Scholar] [CrossRef] [PubMed]
Slater, M.; Pérez Marcos, D.; Ehrsson, H.; Sanchez-Vives, M.V. Inducing illusory ownership of a virtual body. Front. Neurosci. 2009, 3, 676. [Google Scholar] [CrossRef] [PubMed]
Meehan, M.; Insko, B.; Whitton, M.; Brooks, F.P., Jr. Physiological measures of presence in stressful virtual environments. Acm. Trans. Graph. 2002, 21, 645–652. [Google Scholar] [CrossRef]
Picard, R.W. Affective Computing; MIT Press: Cambridge, MA, USA, 2000. [Google Scholar]
Sweller, J.; Van Merrienboer, J.J.G.; Paas, F. Cognitive Architecture and Instructional Design: 20 Years Later. Educ. Psychol. Rev. 2019, 31, 261–292. [Google Scholar] [CrossRef]
Mayer, R.E.; Moreno, R. Nine ways to reduce cognitive load in multimedia learning. Educ. Psychol. 2003, 38, 43–52. [Google Scholar] [CrossRef]
Fogg, B.J. Persuasive technology: Using computers to change what we think and do. Ubiquity 2002, 2002, 2. [Google Scholar] [CrossRef]
Ryan, R.M.; Deci, E.L. Self-determination theory and the facilitation of intrinsic motivation, social development, and well-being. Am. Psychol. 2000, 55, 68. [Google Scholar]
Hoff, K.A.; Bashir, M.N. Trust in Automation. Hum. Factors J. Hum. Factors Ergon. Soc. 2015, 57, 407–434. [Google Scholar]
Arrieta, A.B.; Díaz-Rodríguez, N.; Del Ser, J.; Bennetot, A.; Tabik, S.; Barbado, A.; García, S.; Gil-López, S.; Molina, D.; Benjamins, R.; et al. Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. Inf. Fusion 2020, 58, 82–115. [Google Scholar] [CrossRef]
Buçinca, Z.; Malaya, M.B.; Gajos, K.Z. To trust or to think: Cognitive forcing functions can reduce overreliance on AI in AI-assisted decision-making. Proc. ACM Hum.-Comput. Interact. 2021, 5, 1–21. [Google Scholar]
Doshi-Velez, F.; Kim, B. Towards a rigorous science of interpretable machine learning. arXiv 2017, arXiv:1702.08608. [Google Scholar] [CrossRef]
Chiou, E.K.; Lee, J.D. Trusting automation: Designing for responsivity and resilience. Hum. Factors 2023, 65, 137–165. [Google Scholar] [CrossRef] [PubMed]
Binns, R. Fairness in machine learning: Lessons from political philosophy. In Proceedings of the Conference on Fairness, Accountability and Transparency, PMLR, New York, NY, USA, 23–24 February 2018; pp. 149–159. [Google Scholar]
Stone, P.; Veloso, M. Multiagent systems: A survey from a machine learning perspective. Auton. Robot. 2000, 8, 345–383. [Google Scholar] [CrossRef]
Goodrich, M.A.; Schultz, A.C. Human–robot interaction: A survey. Found. Trends^® Hum.-Interact. 2008, 1, 203–275. [Google Scholar] [CrossRef]
Pretolesi, D. Personalised training: Integrating recommender systems in XR training platforms. In Proceedings of the Mensch und Computer 2022-Workshopband; Gesellschaft für Informatik eV: Bonn, Germany, 2022. [Google Scholar]
Pretolesi, D.; Zechner, O.; Guirao, D.G.; Schrom-Feiertag, H.; Tscheligi, M. AI-Supported XR Training: Personalizing Medical First Responder Training. In Proceedings of the International Conference on Artificial Intelligence and Virtual Reality; Springer: Berlin/Heidelberg, Germany, 2023; pp. 343–356. [Google Scholar]
Pretolesi, D.; Zechner, O.; Schrom-Feiertag, H.; Tscheligi, M. Can I Trust You? Exploring the Impact of Misleading AI Suggestions on User’s Trust. In Proceedings of the 2024 IEEE International Conference on Metrology for eXtended Reality, Artificial Intelligence and Neural Engineering (MetroXRAINE); IEEE: Piscataway, NJ, USA, 2024; pp. 1230–1235. [Google Scholar]
Pretolesi, D.; Gutierrez, R.; Regal, G.; Tscheligi, M. Human, I Need You: Comparison of Human-Robot Interaction Modalities for Search and Rescue Missions. In Proceedings of the International Conference on Mobile and Ubiquitous Multimedia; MUM ’25; Association for Computing Machinery: New York, NY, USA, 2025; pp. 22–32. [Google Scholar] [CrossRef]
Pretolesi, D.; Zechner, O.; Zafari, S.; Tscheligi, M. Human in the Loop for XR Training: Theory, Practice and Recommendations for Effective and Safe Training Environments. In Proceedings of the 2023 IEEE International Conference on Metrology for eXtended Reality, Artificial Intelligence and Neural Engineering (MetroXRAINE); IEEE: Piscataway, NJ, USA, 2023; pp. 248–253. [Google Scholar]
Pretolesi, D.; Zechner, O. Persuasive XR Training: Improving Training with AI and Dashboards. In Proceedings of the 18th International Conference on Persuasive Technology (PERSUASIVE 2023), Eindhoven, The Netherlands, 19–21 April 2023; CEUR Workshop Proceedings, 2023. pp. 19–21. [Google Scholar]
Walker, M.E.; Phung, T.; Chakraborti, T.; Williams, T.J.; Szafir, D.J. Virtual, Augmented, and Mixed Reality for Human-robot Interaction: A Survey and Virtual Design Element Taxonomy. ACM Trans. Hum.-Robot Interact. 2022, 12, 1–39. [Google Scholar] [CrossRef]
Zahabi, M.; Abdul Razak, A.M. Adaptive virtual reality-based training: A systematic literature review and framework. Virtual Real. 2020, 24, 725–752. [Google Scholar] [CrossRef]
Gunawardana, A.; Shani, G. A Survey of Accuracy Evaluation Metrics of Recommendation Tasks. J. Mach. Learn. Res. 2009, 10, 2935–2962. [Google Scholar]

Figure 1. Integrated loop connecting sensing, estimation, constraint-aware adaptation, transparency, and HITL supervision to outcomes in safety-critical XR training.

Figure 2. Stress-adaptive XR pipeline showing sensor ingestion, stress inference, recommendation generation, trainer-in-the-loop approval and data persistence for analytics.

Figure 3. Closed-loop HITL workflow. Note that Trainer Feedback directly influences the XR system state, closing the loop by translating human oversight into immediate scenario modifications (e.g., adjusting NPC behaviour or stressors).

Figure 4. The central panel shows trainees in the virtual environment along with real-time stress monitoring and biosignals. The right panel provides AI-generated suggestions for scenario improvements and effective stressors. The left menu allows access to scenario controls, session data, and trainee performance history.

Figure 5. Different visualisation modalities used to conduct experiments in Paper F: 3D, XR tabletop, 2D top-down, and immersive VR.

Table 1. Sub-questions mapped to design requirements and their operationalisation used later in methods and evaluation.

Focus	Design Requirement	Operationalisation (Overview)
SQ1: HITL oversight	Adjustable autonomy and interruptibility	Run-time approve/modify/reject; graded autonomy modes with clear transitions; emergency stop; complete intervention logs for audit and policy updates.
SQ2: Personalisation effectiveness	Profile- and state-based adaptation with guardrails	Multimodal trainee estimator; constraint-aware policy within scenario safety envelopes; uncertainty-triggered fallbacks; tuning of pacing/difficulty/feedback to stated learning goals.
SQ3: Trust and engagement	Workload-aware transparency for calibrated reliance	Glanceable execution cues; concise rationales with uncertainty; previews of likely next actions; debrief views for sensemaking and instruction.
Cross-cutting	Low-friction authoring and maintainability	Declarative rules/guardrails/explanation templates; domain-expert-editable artefacts; automatic logging for reproducibility.

Table 2. Holistic synthesis of the research programme: architectures, modality mapping, and validation objectives.

Framework Component	Architecture and Design Integration	Modality Mapping	Evaluative Focus
Multimodal Sensing	Fusion of wearable bio-signals (ECG/EDA) with task performance and kinematics.	VR, MR (Hybrid Manikin)	Inference accuracy of stress states and signal robustness.
Scenario Adaptation	Smart Scenario Controller using recommender logic and declarative guardrails.	VR (Run-time), 2D (Authoring)	Technology acceptance (TAM) and pedagogical alignment.
HITL Oversight	Trainer Support Service providing live telemetry and adjustable autonomy controls.	2D Desktop (Supervisory)	Oversight latency, intervention patterns, and situational awareness.
Trust and Transparency	Workload-aware XAI patterns, including rationales and uncertainty cues.	2D, 3D, VR, XR (Pass-through)	Trust calibration, cognitive workload, and explainability.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Pretolesi, D.; Regal, G.; Schrom-Feiertag, H.; Tscheligi, M. Adaptive Realities: Human-in-the-Loop AI for Trustworthy XR Training in Safety-Critical Domains. Multimodal Technol. Interact. 2026, 10, 11. https://doi.org/10.3390/mti10010011

AMA Style

Pretolesi D, Regal G, Schrom-Feiertag H, Tscheligi M. Adaptive Realities: Human-in-the-Loop AI for Trustworthy XR Training in Safety-Critical Domains. Multimodal Technologies and Interaction. 2026; 10(1):11. https://doi.org/10.3390/mti10010011

Chicago/Turabian Style

Pretolesi, Daniele, Georg Regal, Helmut Schrom-Feiertag, and Manfred Tscheligi. 2026. "Adaptive Realities: Human-in-the-Loop AI for Trustworthy XR Training in Safety-Critical Domains" Multimodal Technologies and Interaction 10, no. 1: 11. https://doi.org/10.3390/mti10010011

APA Style

Pretolesi, D., Regal, G., Schrom-Feiertag, H., & Tscheligi, M. (2026). Adaptive Realities: Human-in-the-Loop AI for Trustworthy XR Training in Safety-Critical Domains. Multimodal Technologies and Interaction, 10(1), 11. https://doi.org/10.3390/mti10010011

Article Menu

Adaptive Realities: Human-in-the-Loop AI for Trustworthy XR Training in Safety-Critical Domains †