3.1. Framework Components
The increasing pace of the emergence of generative artificial intelligence (AI) and extended reality (XR) shows significant potential in the educational environment [
15]. Nonetheless, the deployment of these technologies frequently proceeds without alignment to the core principles of curriculum design and pedagogy [
16]. In a bid to fill this disconnect, the current study proposes a four-layered conceptual framework that is holistic and designed to bridge the gap identified structurally [
17]. The architecture of the framework is systematic and chronological in nature, starting with pedagogical intent (Curriculum), then intelligent creation of content (AI), advancing to immersive delivery (XR), and finally dynamic interaction with the learner and feedback [
18]. This type of process-based structure ensures that technologically rich learning tools are not lost to the established learning goals [
19].
The architecture of the framework is a direct reaction to the need for more systematic approaches to educational technology research, to be able to adapt to fast-paced technological changes without losing theoretical rigor [
19]. It provides a systematic process between problem identification and a model that can be used in practice and implemented, thus making the development process less complicated, and pedagogical congruency. The newness of the framework lies in the synthesis of different research paradigms in the conceptual framework, thus creating an interdisciplinary pipeline:
Layer 1 (Curriculum) is based on curriculum theory and known instructional-design models.
Layer 2 (AI) integrates computer science concepts and the most recent developments in multimodal generative models.
Layer 3 (XR) incorporates human–computer interaction (HCI) and cognitive psychology.
Layer 4 (Learner Interaction) is based on learning analytics and intelligent-tutoring systems.
The framework represented in
Figure 1 explicitly connects these domains, holding that the pedagogical objectives established in the first layer directly constrain AI generation procedures in the second; that the AI-generated content in the second layer feeds the immersive environment in the third; and that the interactions the learner has with the third layer supply the data that drives the analytics and adaptive processes in the fourth, in turn creating a feedback loop that can dynamically adjust the experience.
3.1.1. Curriculum and Learning Objectives Layer
The base layer of the framework provides the pedagogical and ethical standards within which the whole scenario generation is conducted. It is mainly dedicated to the task of converting high-level educational policy into a machine-readable format.
Input: The resources that are fed into this layer include formal educational standards such as ISTE Standards of technology integration, Next Generation Science Standards (NGSS) of science education, and Common Core State Standards (CCSS) of English Language Arts and Mathematics [
20]. The purpose of the layer is to ensure that every scenario is probably aligned with relevant learning outcomes. Such a grounding is essential because the curriculum needs to capture the most necessary knowledge and skills that students should learn [
21].
Process: Semantic Analysis and Objective Structuring
This layer uses the methods of Natural Language Processing (NLP) to perform a thorough analysis of standards documents, breaking them down into machine-readable parts: target concepts, required skills (marked by action verbs), context, and performance criteria [
22].
Target Concepts: The main areas of knowledge or concepts (e.g., biodiversity, carrying capacity, historical point of view).
Required Skills: The mental activities students are expected to carry out, which are usually determined by looking at the action verbs (e.g., analyze, compare, use mathematical representations).
Context: The concrete circumstances or environments where the skills are to be used (e.g., ecosystems of various sizes, primary and secondary sources).
Performance Criteria: The criteria according to which mastery is evaluated.
One of the most important roles of this layer is to act as an ethical gatekeeper. The generative AI models can learn and amplify the biases present in society and, as a result, generate unfair content [
23]. To avert this risk, the Curriculum Layer conducts a critical interpretive review, interrogating language and implied contexts to be inclusive and representative, and then directs objectives to the AI. This is the main channel through which the inequity in the creation of learning situations can be addressed through this so-called ethical audit.
The resulting output is an ordered, machine-readable list of learning objectives (e.g., in JSON format). These specifications outline the variables that both inform and restrict the AI Content Generation Layer, thus providing consistency and pedagogical faithfulness.
3.1.2. Artificial Intelligence-Driven Content Generation Layer
It is the creative driver of the framework that converts the structured pedagogical instructions provided in the Curriculum Layer into multimodal stories and interactive elements.
Input: The input is the machine-readable learning objectives that give the AI an unambiguous set of constraints and objectives necessary to generate relevant educational content.
Processes: There are multiple stages involved in the process of content generation, and they are interrelated:
Semantic Mapping: The AI examines the structured goal and aligns its ideas and capabilities with an entire internal knowledge graph based on academic and curated materials. This correlation recognizes the relevant principles, events, or problems that form the thematic basis of the scenario [
24]. An example of this is an objective that involves ecosystem resilience, which can be mapped to the concepts of keystone species, trophic cascades, and the effect of invasive species.
Multimodal Generative AI Modeling: The essence of this layer is that a set of specialized generative models is deployed to generate a wide range of assets that result in a highly sensory learning experience.
Text Generation: Textual backbone of the scenario is generated using large language models like GPT-4, Llama-3, or Claude-3. This includes the main storyline, the backgrounds of the characters, interactive conversations, educational text, and problem sets.
Image Generation: DALL-E3 or Midjourney are models that are asked to generate 2D visual components, such as concept art of an environment, user interface, or character portrait, or texture of a 3D model.
Audio Generation: The use of artificial-intelligence audio synthesis models to produce character voice-overs based on the written text, create ambient sound effects, and create background music that resonates with the tone of the scenario.
3D Model Generation: Initial geometric meshes of objects, props, and other environmental features can be created with the help of emerging text-to-3D models and refined and optimized to be introduced into a real-time engine [
25,
26].
Adaptive Personalization: The system uses personalization algorithms by default, using the learner profile to adjust the complexity and language of the generated content. Given a single learning objective, it is possible to create a variety of different versions of a scenario: one with simplified language and more explicit instructions that are given to novice learners, and another with more complex vocabulary and open-ended challenges that are given to advanced learners. This guarantees that the resources are at the right level of difficulty and availability to a wide group of students [
27].
Output: The deliverable is a set of structured scenario templates—detailed digital blueprints of the XR experience—which include the entire narrative script, dialog trees, descriptions of assets, interactive event triggers, and logic behind adaptive feedback loops.
The AI layer is model-agnostic and informed by the current state-of-the-art multimodal generative tools, including:
Large Language Models (LLMs): (e.g., GPT-4, Claude 3) to create text such as dialog, scripts, and in-world documents.
Text-to-Image/3D Models: (e.g., Midjourney, DALL-E 3) to produce visual assets.
Text-to-Speech/Audio Models: To make voiceovers and ambient soundscapes.
The main technique of this layer is the systematic approach to prompt engineering that is perceived as a rigorous design practice of encoding pedagogical intent into a format that can be executed by an AI [
15]. This framework provides pedagogically sound AI-generated content. The prompt structure is divided into the major parts, as shown in
Table 3.
3.1.3. Extended Reality Integration Layer
The Extended Reality (XR) Integration Layer serves as the core construction node where abstract scenario templates are translated into concrete, interactive, and immersive learning environments with a heavy focus on cognitive science and human–computer interaction.
Input: The main input to this layer is a collection of structured scenario templates produced by the AI Content Generation Layer.
Processes: The integration process combines automated processes with required human control:
Integration into XR Environments: Scenario templates are integrated into a real-time 3D game engine such as Unreal Engine or Unity [
28,
29]. Part of this step is automated with custom scripts that read the template data (e.g., a JSON file) and construct the virtual scene. The scripts automatically insert 3D models, apply generated textures, fill dialog systems with scripted conversations, and set up the underlying logic that drives interactive components.
Integration of the Interactive Components: The interaction system, which is portrayed by physics-based manipulation of objects, creation of intuitive user-interface (UI) controls of a virtual tool or menu, or scripting of non-player character (NPC) actions, is programmed by developers.
Alignment with Cognitive and Usability Principles: This design stage is critical in ensuring that the XR experience is optimally designed to suit the target learners, especially children. Cognitive Load Theory (CLT) is a clear guide for design with the aim of making the learning process effective and manageable [
30]:
Intrinsic Cognitive Load Management: Complex tasks outlined in the scenario are broken down into smaller consecutive steps. The scaffolding provided by the environment, such as indicating what is to be performed next or giving a step-by-step instruction, is to mitigate the natural complexity of the learning content.
Reduction in Extraneous Cognitive Load: The user interface and environment are designed in such a way as to maximize simplicity and clarity, reducing extraneous cognitive load. Clutter on the screen is avoided with a purpose; on-screen text is not repeated when the text is accompanied by narration, and navigation is minimized by preferring onPress events to onRelease events, which is in line with the natural interaction patterns of users.
Maximizing Germane Cognitive Load: The design is purposely designed to facilitate deep thinking, to maximize the germane cognitive load. Students are helped to direct attention to the core learning activity, and other features encourage reflection, self-elaboration, and application of knowledge in an immersive environment.
Output: The resulting learning experience is comprehensive, immersive, and interactive, and packaged to be available on school-provided XR devices, including VR headsets and AR-enabled tablets.
3.1.4. Adaptive Learning and Personalization Mechanisms Layer
This last layer completes the circle, making the XR scenario not a one-dimensional experience, but a dynamic, responsive, and personalized learning environment.
Input: The input includes rich real-time data streams of interactions between the learner, such as interaction data (object manipulation, paths taken), performance data (correctness of answers), behavioral data (gaze tracking), and physiological data (heart rate or electrodermal activity as a proxy to cognitive load).
Processes: This layer works based on two fundamental AI-based processes:
AI-based Learning Analytics: A state-of-the-art analytics engine works with real-time data streams. It builds a dynamic profile of the current state of the learner, models the change in knowledge, identifies certain misconceptions, and tracks the level of engagement by leveraging machine learning methods [
31].
Real-Time Adaptive Scaffolding: The system provides real-time, adaptive scaffolding, which is a pillar of successful intelligent tutoring systems, based on the analytics. This support is dynamic, depending on the learners’ competencies. This may be in the form of hints and prompts, difficulty modulation, or alternative narrative options, taking a struggling student through a remedial loop or a high-achieving student through an extension activity [
32].
Output: The main deliverable is the dynamically customized, student-focused experience in immersion. Another essential secondary output is an educator feedback mechanism, including a teacher-facing dashboard that includes actionable information about the overall progress of the classes and the most prevalent misconceptions, which can be used to intervene through human-centered measures.
3.2. Framework Use Case
To illustrate the practical implementation of the conceptual framework, we further present a use case in STEM: Ecosystem Dynamics and Biodiversity. The workflow in
Figure 2 illustrates how the framework systematically translates the Next Generation Science Standard HS-LS2-2 (NGSS) [
33] (Layer 1) into a multimodal blueprint (Layer 2), generates a dynamic VR simulation (Layer 3), and uses real-time data to provide adaptive scaffolding (Layer 4), completing the instructional loop.
A four-layer simulation pipeline where each layer contributes a distinct function—from curriculum encoding to adaptive feedback—represents an end-to-end learning framework. In
Table 4, we can see the role, the main technical functionality, as well as the input/output of each layer, and the key technologies that are used to implement it.
Curriculum Layer Input: This situation is based on the NGSS, which states that students should be able to use mathematical representations to justify and update evidence-based explanations about factors that influence biodiversity and populations in ecosystems of various sizes. This layer breaks down the important concepts (biodiversity, carrying capacity, ecosystem resilience) and skills (data analysis, evidence-based explanation) of this topic. The layer implements the Standard by putting its main ideas and abilities in a machine-readable form. The Python 3.14 version was used to create a structured JSON file (“layer1_ngss.json”), which breaks down the NGSS statement into individual conceptual, procedural, and performance units.
The script introduces some of the main ecological concepts, including biodiversity, carrying capacity, ecosystem resilience, and trophic cascades, and scientific processes, such as data analysis, evidence-based reasoning, and modeling. Using this semantic representation in a programmatically captured form, the curriculum layer gives a transferrable, reproducible data representation that enters directly into the process of AI generation, as shown in
Figure 3.
AI Layer Output: The AI creates a template scenario for a virtual forest ecosystem simulation. It generates a story that puts the learner in the role of a field biologist studying a dying gray wolf population, dialog with an AI research assistant, simulated longitudinal data on species populations and environmental factors such as annual rainfall and human development encroachment over 20 years, and a library of 3D models of the related flora and fauna. The second layer is where the AI model decodes the Layer 1 JSON and generates a multimodal blueprint in the form of a JSON file (“layer2_ai.json”). This blueprint determines the narrative, the 3D asset library, and the simulation parameters in the form of population data and environmental variables. All the parts of the scenario, wolf and deer, are linked to a path compatible with Unity. By using the following prompt in Claude, we were able to interpret the structured data generated in Layer 1: “Using this JSON curriculum structure, generate a virtual forest ecosystem scenario where the learner investigates declining wolf populations. Include: storyline, dataset, dialogs, environment description, and 3D asset tags.”
The resulting JSON serves as a point of connection between conceptual information and visualization. It has an array of asset libraries, variants, and datasets in its schema, which enables Unity to read into information without having to be configured manually. This methodology shows how to have an AI-offering and an XR-implementing semantic hand-off, where the concepts taught at the curriculum level are automatically translated into the manipulable 3D objects in the learning environment, as shown in
Figure 4.
XR Layer Implementation: The template is integrated within a high-fidelity VR experience. The student can navigate through the forest and view data and a simulation tool through a virtual tablet. They can play with variables (e.g., reintroducing wolves) and simulate the outcome to see the predicted impact on the ecosystem.
The AI blueprint is translated by Layer 3 into an interactive 3D ecosystem. Immersive VR interaction with the Unity project (Unity 6.2 + XR Interaction Toolkit v3.2.2) was enabled with OpenXR support. The prefab assets were arranged in a hierarchical folder within the Assets/ Resources/Prefabs so that every model could be instantiated on-demand through the Resources.Load() method.
The Unity environment was initially created by developing a new HDRP 3D Project, then the XR Plugin Management and the XR Interaction Toolkit were installed using the Package Manager to add the feature of extended reality. This was followed by the activation of OpenXR support to make it compatible across head-mounted displays of the time. The default Main Camera was deleted and, in its place, the XR Origin (XR Rig) prefab was used, offering locomotion and controller-based interaction in the virtual environment in its entirety. To provide a naturalistic spatial environment to the ecosystem simulation, a Terrain object and a Sun Light source were included to provide a representation of ambient lighting and ground topography, respectively, as shown in
Figure 5.
An original C# script (DataManager.cs) is used to read the AI blueprint and instantiate all the mentioned prefabs in the area around the initial position of the learner. The objects are randomly placed within a radius of view that can be adjusted (e.g., 8 m) of the main camera and are facing the user, giving the impression of an open observation clearing within the woods. This design ensures that students are exposed to the relevant organisms as soon as they enter the simulation, as shown in
Figure 6.
The novel coordinate sampling and optional terrain detection of Unity are used to provide ecological authenticity of Unity; prefabs harmoniously blend with the ground surface and lighting of the environment. Where the diversity of the assets is needed, the script allows array variants in the JSON, which allows randomization of several visual variants of environmental heterogeneity (e.g., a variety of wolf or tree textures).
The project uses an XR Origin (XR Rig) that is set up with two ray interactors for teleportation and UI manipulation. The students drive through the simulation and operate the data instruments, which are shown in a Tablet UI, which is a world-space canvas. Two sliders that demonstrate variables within the ecosystem, such as the population of wolves and the population of deer, are found on this tablet. The sliders have events that are attached to the UpdateSimulation() method in the SimulationManager, which provide the learner with instant feedback between the learner and the state of the simulation, as shown in
Figure 7 and
Figure 8.
On the tablet, there is a TextMeshPro component used as the adaptive feedback display. The AdaptiveFeedback script interprets the values of the existing sliders and creates prompts in the context (e.g., the fact that the stable deer population with decreasing aspen groves indicates overgrazing). This element is the scaffolding adaptive feature that directs learners in the identification of higher-level ecological interrelations, such as trophic cascades.
To make it more realistic, the prefabs were textured with hand-created or imported materials in Resources/Materials. The assignment of textures was performed either by the Unity material system or programmatically by name-based matching scripts. In case of FBX imports without textures, the geometry was cleaned in Blender to eliminate non-manifold polygons, recalculated normals, and re-exported again so that it can be used in Unity by the rendering pipeline (see
Figure 9).
Adaptive Layer in Action: This system monitors the hypotheses of the student. Assuming that the student is merely interested in the predator–prey relationship and does not manage to stabilize the ecosystem, the AI assistant gives a scaffolded prompt: “Your data indicates that, despite the stable deer population, the aspen groves are not recovering. Have you wondered what the grazing habits of the deer could be doing to the rest of the ecosystem?” This leads the student to the more complicated idea of a trophic cascade.
The fourth layer completes the instructional loop because it captures the learner interactions and adaptive states. The AdaptiveDataManager script replicates the AI blueprint JSON (at runtime), and the alterations in the variables (e.g., the level of population, rainfall) are saved in a mutable file (adaptive_state.json) in the Unity persistent data path. This allows tracking the state continuously without updating the fixed AI blueprint, allowing it to be reproducible. Everything in the sliders automatically updates this runtime JSON that, in turn, can subsequently be reloaded to put the simulation back to a past state. This mechanism can be empirically traced with decisions of the learners and allows for analyzing the adaptive feedback efficacy over time, as shown in
Figure 10 and
Figure 11.
Various places on the pipeline are verified through error logs and various checking mechanisms: unavailable prefabs or materials caused non-blocking warnings; the order conflicts during the initialization were overcome by loading the runtime JSONs in the Awake () preceding the slider callbacks; mesh cleanup in Blender and Optimize Mesh and Recalculate Normals options in Unity were created to handle geometry import warnings (e.g., self-intersecting polygons). The optimization of performance is achieved by constraining dynamic instantiating to within the camera and allowing the instantiation of materials with the use of GPUs.
The system that was developed offers a working example of how NGSS-based learning objectives may be computationally transferred into immersive XR experiences. The workflow permits the removal of manual design bottlenecks by combining AI-generated blueprints with procedural Unity instantiation, allowing reproducible and data-driven simulations of an ecosystem. The interaction of every learner is recorded at an adaptive level, thereby making it possible to empirically test instructional scaffolds and cognitive loading factors. To ensure reproducibility, both code (C#, Python) and asset directory layouts are clearly documented, and all data output at runtime is human-readable in the form of a JSON file. This transparency is end-to-end and can be used to replicate the pedagogical side of the framework as well as to validate the feasibility of the framework technically.