STREAM: A Semantic Transformation and Real-Time Educational Adaptation Multimodal Framework in Personalized Virtual Classrooms

Yeganeh, Leyli Nouraei; Chen, Yu; Fenty, Nicole Scarlett; Simpson, Amber; Hatami, Mohsen

doi:10.3390/fi17120564

Open AccessArticle

STREAM: A Semantic Transformation and Real-Time Educational Adaptation Multimodal Framework in Personalized Virtual Classrooms

by

Leyli Nouraei Yeganeh

^1,*

,

Yu Chen

^2,*

,

Nicole Scarlett Fenty

¹

,

Amber Simpson

¹

and

Mohsen Hatami

²

¹

Department of Teaching, Learning and Educational Leadership, Binghamton University, Binghamton, NY 13902, USA

²

Department of Electrical and Computer Engineering, Binghamton University, Binghamton, NY 13902, USA

^*

Authors to whom correspondence should be addressed.

Future Internet 2025, 17(12), 564; https://doi.org/10.3390/fi17120564

Submission received: 24 October 2025 / Revised: 24 November 2025 / Accepted: 2 December 2025 / Published: 5 December 2025

(This article belongs to the Special Issue Virtual Reality and Metaverse: Impact on the Digital Transformation of Society—3rd Edition)

Download

Browse Figures

Versions Notes

Abstract

Most adaptive learning systems personalize around content sequencing and difficulty adjustment rather than transforming instructional material within the lesson itself. This paper presents the STREAM (Semantic Transformation and Real-Time Educational Adaptation Multimodal) framework. This modular pipeline decomposes multimodal educational content into semantically tagged, pedagogically annotated units for regeneration into alternative formats while preserving source traceability. STREAM is designed to integrate automatic speech recognition, transformer-based natural language processing, and planned computer vision components to extract instructional elements from teacher explanations, slides, and embedded media. Each unit receives metadata, including time codes, instructional type, cognitive demand, and prerequisite concepts, designed to enable format-specific regeneration with explicit provenance links. For a predefined visual-learner profile, the system generates annotated path diagrams, two-panel instructional guides, and entity pictograms with complete back-link coverage. Ablation studies confirm that individual components contribute measurably to output completeness without compromising traceability. This paper reports results from a tightly scoped feasibility pilot that processes a single five-minute elementary STEM video offline under clean audio–visual conditions. We position the pilot’s limitations as testable hypotheses that require validation across diverse content domains, authentic deployments with ambient noise and bandwidth constraints, multiple learner profiles, including multilingual students and learners with disabilities, and controlled comprehension studies. The contribution is a transparent technical demonstration of feasibility and a methodological scaffold for investigating whether within-lesson content transformation can support personalized learning at scale.

Keywords:

multimodal learning; adaptive learning; personalized learning; virtual classrooms; speech recognition; transformer-based NLP; computer vision; Universal Design for Learning; learner modeling

Graphical Abstract

1. Introduction

This study addresses the limitations of traditional one-size-fits-all virtual learning environments (VLEs) by presenting the design of a Semantic Transformation and Educational Adaptation Multimodal (STREAM) framework and a preliminary evaluation in a proof-of-concept pilot. This work leverages artificial intelligence (AI) and multimodal machine learning (ML) to enable low-latency analysis and the delivery of personalized educational content, with potential for real-time application, though only offline delivery was tested [1]. As digital learning expands rapidly, many virtual classrooms rely on static, uniform instructional content that fails to meet the diverse needs, preferences, and learning styles of students, especially those from underrepresented or marginalized backgrounds [2]. This study aims to bridge the critical gap by outlining the STREAM framework’s design, which could support the analysis of instructional content and its transformation into customized formats based on individual learner profiles. Specifically, STREAM includes three key components: (1) real-time content analysis and knowledge extraction, (2) semantic understanding of learners’ cognitive and affective characteristics, and (3) adaptive content generation and delivery through multi-modal channels such as text, audio, and video. Through a foundational survey and conceptual design, this research lays the foundations for building intelligent virtual classrooms that dynamically respond to student variability. STREAM is designed to support the future implementation of pilot studies and prototypes that demonstrate the feasibility and scalability of AI-driven personalized learning systems.

In addition, the foundational principles of Universal Design for Learning (UDL) have evolved over the past few decades to promote accessible and inclusive education, particularly in VLE. Originating from architectural concepts of universal design, UDL was formalized in education by Rose and Meyer [3] in their seminal work, which emphasizes flexible learning pathways to accommodate diverse learners through multiple means of representation, action, and expression, as well as potential engagement. Early applications in digital contexts, such as those explored by Hall, Meyer, and Rose [4], highlighted UDL’s potential to reduce barriers for students with disabilities in online settings. More recent syntheses, such as the systematic review by Alghamdi [5], underscore the UDL’s role in post-secondary online education when integrated with metadata-driven adaptations. This body of work, spanning from early 2000s frameworks to 2025 reviews [6], informs STREAM’s AI-driven extensions for personalization in VLEs, with potential for real-time use though only offline personalization was evaluated.

Furthermore, the COVID-19 pandemic catalyzed a global shift toward remote and hybrid learning, exposing the potential and limitations of existing virtual education systems [7]. Although many institutions rapidly adopted online platforms, the majority of these environments were designed with a uniform delivery model, failing to accommodate learners’ diverse needs, backgrounds, and preferences [8]. In response to these challenges, educational researchers and technologists have turned to AI and emerging technologies to re-imagine instructional delivery [9]. AI provides powerful tools to transform the way content is analyzed, interpreted, and adapted in real time to support personalized learning [10]. When combined with multimodal delivery methods such as text, video, audio, and interactive media, AI can dynamically tailor instruction to learners’ cognitive profiles and preferred modalities [11]. Despite these advances, offline personalization remains largely underdeveloped in virtual learning systems. Existing models often use delayed data processing or static personalization methods that cannot respond to learners’ immediate needs. This study bridges a critical gap in the integration of UDL within VLEs: while existing systems often provide static or sequencing-based personalization [9,10], they rarely enable multimodal transformation of instructional content to dynamically address learner variability, though real-time capabilities were not tested in this offline study, as called for in UDL frameworks [3,6]. This limitation is evident in both uniform delivery models exposed by the pandemic and fragmented adaptive tools that overlook metadata-driven regeneration [11].

Although advancements in AI and educational technologies have led to the development of various personalized learning tools, existing systems remain largely fragmented and limited in scope [12]. Current approaches to personalization often focus on isolated aspects of the learning process, such as recommending content based on previous performance or adapting quiz difficulty based on learner responses [13]. These systems rarely incorporate analysis of instructional content, with real-time potential but tested only offline, nor do they deliver instruction across multiple modalities—such as text, audio, and video—in a cohesive and synchronized manner [11]. Moreover, while some adaptive learning platforms integrate AI to improve interactivity, they typically rely on preprocessed data and static learner models that cannot accommodate the dynamic, evolving nature of classroom engagement, envisioned for real-time but examined offline [14]. Few studies explore integrating semantic communication, natural language processing (NLP), and multimodal machine learning within a unified framework for adaptive instruction. Even fewer address how such systems can function to deconstruct instructional content, designed for real-time operation but validated offline, interpret learner preferences, and generate personalized outputs across multiple delivery formats. This gap is well documented in recent systematic mappings of AI-enabled adaptive systems, which predominantly focus on sequencing and difficulty personalization while underutilizing transformational approaches such as multimodal regeneration [15]. For instance, distinctions between tech-driven adaptation (e.g., algorithmically adjusting paths) and learner-driven adaptability highlight the need for systems that fundamentally reshape material to match profiles. Studies on e-learning adaptation by learning style further confirm that while sequencing is used, deeper transformation—such as converting content across modalities—remains underexplored [16].

This study addresses this critical gap by proposing and demonstrating through a narrow proof-of-concept pilot a holistic STREAM framework that unifies these technologies to enable multi-modal content adaptation, intended for real-time, though only offline adaptation was assessed. The goal is to go beyond piecemeal solutions and provide a scalable, end-to-end system. This work makes three contributions. First, it establishes the technical feasibility of content decomposition and format regeneration within latency bounds compatible with potential classroom use, although only under controlled conditions. Second, it is intended to provide a reproducible evaluation framework with specified metrics for recognition accuracy, tagging fidelity, processing efficiency, and output quality. Third, it offers a modular architecture compatible with UDL principles, serving as a foundation for future validation studies. This paper is organized into six main sections. Following the introduction, which presents the purpose, context, and significance of the study, each subsequent section builds a foundation for understanding and implementing the unified STREAM framework for multi-modal content adaptation in virtual classrooms, with real-time aspirations but offline testing:

Section 2: Survey of Enabling Technologies—A comprehensive review of existing AI tools and technologies relevant to content analysis, learner modeling, planned for real-time but tested offline, and multi-modal delivery. The section highlights current capabilities and identifies the limitations of existing systems, underscoring the need for the proposed STREAM framework.
Section 3: STREAM Framework: Concepts, Architecture, Components—This section introduces the conceptual framework that outlines the end-to-end flow of content from instructional source to personalized delivery. It details the system’s key components, including content decomposition, learner profiling, and adaptive multi-modal presentation.
Section 4: Feasibility and Early Prototype Design—This section presents the initial implementation strategy, including designing a pilot study using pre-recorded lectures. It outlines the methodological approach to analyzing content and simulating adaptive delivery based on the learners’ preferences.
Section 5: Discussion—An analytical discussion of how STREAM addresses existing gaps in the literature. The section examines the theoretical and practical implications of implementing such a system, with a focus on potentially enhancing equity and responsiveness pending validation studies in virtual learning.
Section 6: Conclusion and Next Steps—A summary of key contributions, followed by a roadmap for future research, including the development of subsequent papers that will explore specific components of the framework in depth.

2. Survey of Enabling Technologies

The successful implementation of a multimodal adaptive learning system, designed for real-time but evaluated offline relies on integrating several enabling technologies, including NLP, speech recognition, CV, learning analytics, and AI-based personalization. This section surveys the current state of these technologies and identifies how they support each layer of the proposed STREAM framework.

2.1. Content Analysis Tools, Intended for Real-Time

Effective adaptation begins with the system’s ability to accurately and efficiently analyze and deconstruct instructional content, though real-time adaptation was not tested [17]. Recent advances in NLP have enabled the accurate and efficient analysis and deconstruction of instructional content, enabling the extraction of semantic information from large volumes of unstructured educational data [18]. Transformer-based models, such as BERT, T5, and GPT, are widely used to understand context, summarize content, and extract key knowledge points from text-based input, including transcripts, lecture notes, and reading materials [19]. In parallel, speech-to-text systems such as OpenAI’s Whisper, Google Speech-to-Text API, and Amazon Transcribe provide robust solutions for lecture transcription, with real-time potential but offline in this study [20]. These tools convert live or recorded audio to editable text, facilitating downstream semantic analysis and knowledge tagging. Integrating prosody analysis (intonation, pauses, stress) can further aid in identifying instructional emphasis and potential engagement cues. Video segmentation and object recognition technologies complete the multimodal picture by enabling the analysis and linking of visual elements, such as slides, gestures, and whiteboard annotations, to verbal content. Tools such as OpenCV, YOLO (You Only Look Once), and the Google Cloud Vision API support detection and tracking of instructional visual content, envisioned for real-time but offline here [21]. These capabilities are crucial for aligning multimodal input streams and extracting context-aware educational units. Recent frameworks echo this integration, such as adaptive systems using ASR and NLP for personalized education [22] or multimodal deep learning for question-answering in specialized domains [23]. STREAM extends these by adding CV for visual deconstruction and traceability, addressing gaps in holistic lesson regeneration.

2.2. Learner Modeling and Preference Detection

To deliver adaptive content, the system must maintain an evolving representation of each learner. This capture of attributes is preliminary and limited to surveyed tools, with complexities like inference accuracy and privacy risks acknowledged as hypotheses for validation [24]. The focus remains on UDL-compatible accessibility rather than exhaustive personalization. Cognitive and affective modeling techniques estimate learners’ attention, motivation, and emotional states by analyzing interaction data and biometric sensor data [25]. Affective computing tools, such as Microsoft’s Azure Emotion API or open-source facial emotion recognition libraries, can infer frustration, confusion, or levels of interest [26]. Complementing affective data, eye-tracking, and behavioral logging systems such as Tobii and iMotions monitor attention shifts, reading patterns, and potential engagement [27]. In contrast, keystroke analysis and clickstream data provide behavioral insights during learning tasks [28]. These data streams can potentially support the construction of dynamic learner profiles that evolve, though real-time evolution was not tested. The system must maintain an evolving representation of each learner to deliver adaptive content. To personalize instruction, the system often relies on learning style models, such as VARK (Visual, Auditory, Reading/Writing, Kinesthetic) and Felder-Silverman’s learning style dimensions (e.g., sensing/intuitive, active/reflective). Although these models are debated in the literature, they provide a practical foundation for organizing adaptive strategies and customizing content modality based on observed preferences. While aligning with theories like VARK, STREAM prioritizes UDL’s inclusive design over hyper-personalization, as research contrasts UDL’s barrier-free approach with personalized AI’s potential biases [29]. This is intended to support accessibility for all, as in studies on AI-UDL integration in virtual classrooms [30].

2.3. Multimodal Delivery Tools

The final stage of the adaptive pipeline requires tools for generating content in different modalities. Text summarization tools, such as BART and PEGASUS (Google Research), help condense lecture transcripts into student-friendly summaries [31]. Text-to-speech (TTS) engines, such as Amazon Polly, Google WaveNet, and Microsoft Azure TTS, enable the delivery of spoken versions of content with high naturalness and emotion synthesis [32]. AI content generators such as ChatGPT, Gemini (a collaboration between Google and DeepMind), and GenAI (developed by Adobe) can create customized learning materials on demand—including explanations, quizzes, and examples—based on the student’s knowledge level and learning history [33]. These tools are intended to improve the system’s responsiveness. Advanced systems may incorporate semantic communication models that optimize message transmission by focusing on meaning rather than raw data. Although still emerging, these models—utilized in domains such as edge computing and remote communication—offer promising pathways for compressing and adapting instructional content in bandwidth-constrained environments (see Table 1).

2.4. Existing Adaptive Learning Systems

Several platforms have pioneered adaptive learning, including Khan Academy, Coursera, Duolingo, ALEKS, and Smart Sparrow. These systems provide personalized learning paths tailored to user input and assessment data. For example, Khan Academy recommends exercises based on previous performance. It has integrated AI tools, such as Khanmigo, for real-time feedback, intended but not tested for this purpose, while Coursera uses reinforcement learning to sequence courses and modules [34,35]. Duolingo employs adaptive, gamified algorithms for language learning that adjust difficulty based on user responses, and ALEKS applies knowledge-space theory to personalize math [36]. However, these platforms are primarily based on predefined adaptation rules and static learner models, lacking the capacity for multimodal personalization, with real-time not assessed [37]. They often do not incorporate affective data, seamless multimodal delivery, or feedback loops that adapt during live sessions, designed for real-time but tested offline. Moreover, their adaptive strategies are typically limited to text, video, or quiz formats, with minimal support for switching between modalities or dynamically adjusting content based on emotional or behavioral cues in virtual classroom settings [38,39,40]. This gap highlights the need for a more comprehensive, AI-driven framework that integrates semantic understanding, data analysis, and multimodal content generation, with real-time analysis intended but offline (see Figure 1).

To further illustrate these limitations and position the proposed STREAM framework, Table 2 provides a detailed comparison with additional contemporary systems, drawing from recent 2025 reviews [41,42]. This table highlights STREAM’s unique emphasis on AI-powered transformation within the instructional flow, designed for real-time use but tested offline. In contrast, existing systems essentially treat this as static or post-hoc. For instance, while Khan Academy excels at creating accessible, performance-driven paths, it does not decompose a live lecture’s speech and visuals on the fly to develop personalized artifacts, such as tagged diagrams. Similarly, Coursera’s sequencing is robust but lacks multimodal regeneration, which would enable immediate alignment among learners. Existing platforms like Khan Academy and ALEKS exemplify the sequencing focus. Still, systematic reviews reveal a broader gap in transformational capabilities, where adaptation is limited to rule-based paths rather than semantical restructuring for diverse profiles [15]. This underscores the potential of frameworks emphasizing multimodal transformation, as noted in analyses of adaptive vs. adaptable learning [16]. Recent frameworks, such as AI-driven adaptive systems for sustainable education that emphasize sequencing in flipped models [10] and adaptive AI for virtual classrooms that focus on personalized paths [43], rarely enable on-the-fly regeneration of existing content. Similarly, the AIA-PCEK framework integrates AI for feedback but lacks multimodal decomposition, with real-time feedback untested for novelty, describing STREAM’s within-lesson transformation with traceability.

2.5. UDL and Metadata-Driven Adaptation in Prior Research

Distinct from the enabling technologies surveyed above, UDL has long provided a framework rooted in inclusive education [3]. In VLEs, UDL emphasizes proactive adaptation by tagging content metadata to support diverse representations and interactions [44]. Seminal works, such as those by [45], explored digital UDL environments, demonstrating how metadata enables adjustments for learners with varying needs, planned for real-time use but evaluated offline. Recent reviews, including a 2025 systematic analysis in the International Journal of Learning Technology and Education Reform [46], highlight the application of UDL in online settings, where tools like semantic tagging improve accessibility but often lack integration with multimodal AI. Similarly, synthesized strategies are reported for supporting diverse learners in online courses, noting gaps in real-time, metadata-driven systems [47]. STREAM extends this research by unifying AI tools for semantic transformation, ensuring compatibility with UDL’s emphasis on equity and personalization.

3. STREAM Framework: Concepts, Architecture, Components

This section presents the conceptual architecture of the STREAM framework, which represents this paper’s primary contribution to the field of adaptive learning systems. The framework integrates speech recognition, natural language processing, and computer vision capabilities within a modular pipeline designed to decompose instructional content into semantically tagged units that can be regenerated across multiple modalities while preserving traceability to source materials. By articulating this end-to-end architecture with explicit component interfaces and data flows, we provide a foundation for systematic investigation of within-lesson content transformation. The framework encompasses capabilities that extend beyond current implementation, including dynamic learner modeling, affective computing, and real-time streaming adaptation. We present this comprehensive vision to establish a coherent research agenda and enable future work to build upon and validate individual components. Section 4 describes a deliberately scoped feasibility pilot that implements and tests a subset of these components under controlled conditions.

3.1. Conceptual Flow

Figure 2 presents the conceptual flow of the proposed STREAM framework, which support for analysis, multimodal delivery, and dynamic transformations, designed for real-time but offline is aspirational, designed as potential extensions beyond the offline pilot’s feasibility demonstration, as systematic reviews of AI adaptive systems recommend starting with controlled baselines to validate pipelines before scaling to live classrooms with noise or bandwidth constraints [48].

3.1.1. Source Side

The source side of STREAM represents the origin of instructional input within a virtual or hybrid learning ecosystem. It encompasses live teaching sessions, pre-recorded lectures, multimedia instructional materials, and other educational content delivered by instructors or digital platforms. In current virtual learning models, such content is often static and generalized, offering the same instructional experience to all students regardless of their individual needs or learning preferences. This study begins by recognizing the diversity of instructional materials as a rich source of adaptable content that, if properly analyzed and deconstructed, can serve as the foundation for personalized learning experiences. At the heart of this stage is the human or AI-assisted teacher, who generates the educational narrative through verbal explanations, slide presentations, and embedded media such as images or video clips. When captured or recorded, these materials contain key semantic and structural components, with real-time capture intended but not necessary for downstream content analysis. In a traditional setting, the teacher controls the pace, emphasis, and interaction; however, in virtual settings, these features are not always effectively captured or utilized to support different learners [49]. STREAM acknowledges the value of teacher-delivered content not only as instructional input but also as a data-rich source for feature extraction and analysis, planned for real-time but offline. Dynamic content is hypothesized to potentially enhance interaction for diverse learners, pending validation studies under UDL. Still, research shows mixed efficacy compared to static designs, where pedagogical soundness—defined as alignment with cognitive principles and prerequisite scaffolding—depends on context rather than format alone [50,51]. Static content can be equally effective when well-designed, and STREAM positions dynamic regeneration as a complementary tool rather than a replacement.

In addition, recorded lectures provide opportunities for asynchronous interaction, which can be particularly beneficial for students who require flexible access or repeated exposure to instructional material. However, without adaptive elements, these recordings remain passive and non-responsive [52]. This framework re-imagines recorded lectures as analyzable entities whose content can be broken down using AI-powered tools such as speech-to-text engines, video segmentation algorithms, and semantic parsers. These tools enable the identification of key learning units, transitions, tone changes, and pedagogical markers that may otherwise be lost in static video playback [53]. The final component on the source side is the broader category of educational content, which includes digital textbooks, interactive simulations, annotated slides, quizzes, and multimedia supplements [54]. These materials often exist in isolated formats, lacking interoperability or cohesive integration with adaptive learning systems [55]. The STREAM framework addresses this challenge by treating all educational content as modular inputs that can be semantically indexed and transformed to meet the learner’s needs. This modular view is designed to enable the system to extract knowledge points, generate metadata, and feed content into the next stage—analysis and learner-centered adaptation, designed for real-time, laying the groundwork for truly personalized and multi-modal virtual classrooms. This regeneration with traceability builds on emerging multimodal approaches, like transcription for classroom enhancement via ASR and NLP, with real-time potential but offline-tested, ensuring STREAM’s units maintain pedagogical links for equitable adaptation across profiles. While ML inference for categories like cognitive demand offers promise, complexities include potential biases in training data and ethical concerns in pedagogical automation, as noted in reviews of AI ethics in education [56,57]. These are testable hypotheses, with the framework’s modularity allowing domain-specific fine-tuning to mitigate issues such as over-reliance on delivery rather than deeper pedagogy.

3.1.2. Middle Layer

The middle layer is conceptualized as the core processing stage, but in the pilot (Section 4), it is limited to offline speech recognition and NLP for text-based decomposition, excluding planned CV for visual analysis and prosody/emotional cues, intended for real-time. The middle layer of STREAM serves as the intelligent processing core, transforming static instructional materials into dynamically adaptable components. This stage leverages advanced AI techniques, including NLP, speech recognition, and computer vision (CV), to perform analysis and decomposition of educational content across multiple modalities, designed for real-time [58]. This layer aims to extract meaningful pedagogical elements from text, voice, and image data, enabling the system to understand the structure, semantics, and instructional intent behind each piece of content. In the text domain, content analysis involves parsing lecture transcripts, with real-time intended but offline, digital documents, and on-screen annotations to identify key knowledge points, learning objectives, and pedagogical cues [59]. Using transformer-based language models such as Bidirectional Encoder Representations from Transformers (BERT), Text-to-Text Transfer Transformer (T5), or GPT, the system can recognize conceptual hierarchies, thematic transitions, and emphasis markers (e.g., definitions, examples, or summaries) [60]. This process enables the classification of content by difficulty, relevance, and instructional purpose, supporting the generation of personalized instructions and scaffolded learning sequences tailored to different types of learners [61].

In voice mode, speech-to-text technologies such as OpenAI’s Whisper or Google’s Speech Application Programming Interface (API) are used to transcribe live or recorded lectures. However, real-time transcription was not tested [62]. Beyond transcription, STREAM also analyzes speech prosody, including tone, pauses, and emphasis, to identify points of instructional stress or learner confusion. For example, a teacher’s repeated phrasing or slower articulation may indicate important concepts that personalized outputs should highlight or reinforce. These insights improve the system’s ability to prioritize content and respond adaptively based on perceived educational importance [63].

Meanwhile, image and video decomposition are critical in capturing non-verbal instructional cues [64]. Visual content, such as lecture slides, diagrams, screen annotations, and gestures, can be analyzed using CV techniques, including object detection, optical character recognition (OCR), and image segmentation [65]. These tools allow the system to isolate figures, extract textual elements from visuals, and map visual themes to accompanying spoken or written explanations [66]. This multi-modal fusion enriches the content representation and ensures that learners with varying modality preferences (e.g., visual vs. auditory learners) receive content in forms that align with their strengths. Overall, this middle layer processes and decodes complex instructional input, reorganizing it into a structured, machine-interpretable format. Doing so creates a semantic bridge between teacher-driven content and the learner-facing adaptive delivery mechanisms. This stage is crucial for enabling responsiveness and personalization, envisioned for real-time but tested offline, ensuring that every element passed on to the learner is relevant, meaningful, and optimally aligned with their unique learning profile.

3.1.3. Receiver Side

The receiver side envisions dynamic learner-centered delivery, but the pilot (Section 4) uses a static, predefined visual-learner profile without updates, as real-time, affective sensing, or adaptive feedback loops were not evaluated. The receiver side of the STREAM framework is focused on the learner, the ultimate beneficiary of adaptive content transformation. This layer involves developing a dynamic student model that captures individual learning preferences, cognitive styles, affective states, and engagement patterns [67]. STREAM aims to foster a highly personalized educational experience that could potentially enhance comprehension and retention, pending validation among diverse student populations, by aligning instructional delivery with these learning-specific attributes. At the core of this process is the student model, a profile that evolves based on continuous interaction data, designed to be real-time but offline, including content interactions, response accuracy, and modality preferences. This model includes static attributes (e.g., preferred learning style, language proficiency, accessibility needs) and dynamic attributes (e.g., emotional state, attention span, pace of progress). The system draws on input from behavioral logging, biometric sensors (if available), embedded assessments, and user interactions to accurately assess these dimensions. Integrating affective computing and machine learning algorithms further refines the student model, enabling a more nuanced understanding of learner variability.

Once a complete profile of the learner is established, the system proceeds to personalized delivery, tailoring the instructional content to the learner’s format, pace, and complexity [68]. For instance, a visual learner might receive an infographic instead of a textual explanation, while an auditory learner might receive the same content as a narrated summary [69]. Similarly, students who struggle with certain concepts may be offered simplified definitions, scaffolded tasks, or additional examples to support their understanding [70]. The metadata and knowledge components extracted in the middle layer enable this content adaptation, allowing the system to match the instructional intent with the learner’s needs. Importantly, the system’s multi-modal capabilities ensure that content can be delivered across various channels—text, speech, video, or even interactive simulations—depending on the learner’s preferences and context. This flexibility is critical for inclusive design, particularly for students with disabilities, language barriers, or non-traditional learning trajectories. Additionally, personalized feedback and progress tracking are integrated into the delivery system, enabling learners to monitor their progress and engage in self-regulated learning behaviors. In sum, the receiver side operationalizes the pedagogical goal of personalization by dynamically mapping processed content to individualized learner profiles. It represents a significant departure from conventional e-learning systems that treat all learners uniformly, offering a scalable, AI-powered approach to responsive, multi-modal, and learner-centered education instead.

3.2. Key Components

The following subsections describe the key conceptual components. As detailed in Section 4, the pilot implements only knowledge point extraction and metadata generation via ASR and NLP, excluding learner profiling and full adaptive generation beyond predefined outputs.

3.2.1. Knowledge Point Extraction

The extraction of knowledge points is a foundational component of the STREAM framework, serving as a mechanism for identifying and isolating the core instructional concepts from various educational inputs [71]. These “knowledge points” refer to discrete units of learning—such as definitions, formulas, ideas, skills, or relationships—that represent the fundamental building blocks of academic content [72]. In traditional instruction, teachers emphasize these points through verbal cues, textual highlights, or visual annotations [73]. However, in a digital learning context, especially one mediated by AI, identifying and extracting these points must be automated and contextually aware [74]. This extraction process begins with the semantic analysis of multimodal input, including text from lecture transcripts, spoken explanations, annotated slides, and visual materials. NLP techniques are used to detect linguistic structures commonly associated with instructional emphasis, such as transitional phrases (e.g., “the key takeaway is…”), definitions (“X is defined as…”), or cause-and-effect structures. Machine learning models trained on educational datasets can be employed to classify sentences or segments into instructional categories (e.g., explanation, elaboration, example, conclusion), making it easier to pinpoint high-priority knowledge units.

In addition to semantic cues, structural features—such as font size, bold text, bullet points, and slide titles—are leveraged through CV and OCR tools when processing visual content [75]. These features provide additional layers of meaning that help distinguish primary concepts from supporting details. The system may also use domain-specific ontologies or concept maps to cluster related knowledge points and establish hierarchical relationships, thereby enhancing coherence and instructional flow during adaptive delivery. Once extracted, these knowledge points are stored as modular data objects, each tagged with metadata indicating its source, difficulty level, instructional purpose, and preferred modality. This structure enables the system to reassemble and repack content flexibly to suit different learners’ profiles. For example, a single concept may be transformed into a concise text summary for a proficient reader, a narrated animation for a visual auditor, or a series of scaffolded instructions for a student requiring additional support [76]. Ultimately, knowledge point extraction transforms passive instructional content into active, machine-readable units that can drive intelligent personalization. Enhances the reusability and modularity of the content, ensuring that personalization remains pedagogically meaningful rather than superficial. This process could potentially support diverse learners through multimodal interaction, pending validation.

3.2.2. Metadata Generation

Metadata generation is crucial for bridging raw instructional content with intelligent personalization by enriching extracted knowledge points with descriptive and contextual information [1]. In this context, metadata refers to structured data that describes the attributes, purpose, and pedagogical context of a given content unit—such as its learning objective, difficulty level, compatibility of modes, emotional tone, and cognitive demand [77]. This abstraction layer is intended to provide meaningful adaptation tailored to individual learner needs. The process begins immediately after knowledge points are extracted, with each content segment tagged with pedagogical metadata. This includes categorical information such as the content type (e.g., definition, explanation, example), cognitive level based on Bloom’s Taxonomy (e.g., remember, apply, analyze), and alignment with curriculum standards or learning outcomes [78]. AI models trained on annotated educational corpora can automatically infer these labels by analyzing syntax, semantics, and contextual cues [79]. Additionally, heuristics or predefined instructional design rubrics generate instructional metadata, such as estimated learning time, required prerequisites, and ideal delivery modality (e.g., text, video, animation) [80].

In addition, affective and potential engagement-related metadata can also be attached to content, particularly when analyzing speech or video input [81]. For example, speech prosody analysis might detect moments of excitement or emphasis in a teacher’s voice, indicating emotionally charged or critical instructional moments [82]. Depending on a student’s emotional profile or potential engagement patterns, these markers can be flagged for later emphasis or moderation [83]. Similarly, visual cues such as teacher gestures, slide animations, or on-screen annotations can be encoded as attention indicators, guiding the system on where to focus adaptive feedback [84]. All generated metadata is stored in a content repository as part of a modular object structure, allowing the system to be customized to the learner’s needs, with real-time needs in mind but tested offline [85]. For example, suppose a student struggles with a particular concept. In that case, the system can use metadata tags to locate a simpler explanation, visual reinforcement, or a real-world example from the content library. In contrast, the same metadata can be used to provide advanced learners with enriched higher-order content. In essence, metadata generation transforms educational content from static assets into flexible, searchable, and pedagogically-aware units, making personalization scalable and pedagogically sound, with real-time potential but offline [86]. It enables the adaptive system to reason about both content and the learner in a structured manner, ensuring that the delivery is not only customized but also contextually and cognitively appropriate.

3.2.3. Learner Profiling

Learner profiling is the adaptive engine’s intelligence layer that personalizes instruction based on each student’s unique characteristics, preferences, and learning behaviors [87]. This component constructs a dynamic, data-driven profile that captures the learner’s cognitive, behavioral, emotional, and contextual attributes [88]. The profile is continuously updated as the student interacts, potentially promoting comprehension pending validation with the system, allowing content adjustments that encourage engagement, intended for real-time but offline, designed for real-time but offline-tested, and supporting understanding and long-term retention [1]. At its foundation, the learner profile begins with static attributes, such as age, language proficiency, previous academic performance, and declared learning preferences (e.g., visual, auditory, kinesthetic) [89]. These are often collected through initial diagnostic surveys or onboarding modules. While helpful in forming a baseline, static profiles alone are insufficient for adaptive learning at scale. Therefore, the system also generates and updates dynamic attributes that evolve in response to the learner’s interactions, designed for real-time but tested offline. These include time-on-task, response latency, precision, preferred mode of interaction, emotional state (if detected by affective computing), and feedback response patterns [90]. To model these attributes, the system uses machine learning algorithms that interpret clickstream data, biometric feedback (e.g., eye movements and facial expressions, if available), quiz performance, and content interaction logs. These inputs feed into a continuously evolving learner model that predicts not only what the learner knows, but also how they learn best and how their needs may change over time or in different contexts. For instance, if a student frequently pauses during video lectures but excels with interactive graphics, the model adapts to prioritize visual learning aids in future sessions.

In advanced implementations, affective and motivational states are also captured using sensors or AI-based emotion recognition tools, intended for real-time but not tested as such [91]. For example, if a student shows signs of frustration or disengagement, the system may intervene with encouraging messages, alternate modalities, or content simplification to maintain motivation [92]. Over time, these interventions are tracked and used to refine the learner profile further, forming a feedback loop between the learner’s emotional journey and instructional adaptation. The ultimate function of a learner profile is to inform content selection, sequence, and delivery style, matching instructional components—tagged with rich metadata—to learners in a way that optimizes cognitive alignment and emotional resonance. It ensures that each learner receives a pathway through content that is not only academically appropriate but also motivationally and contextually supportive. In sum, learner profiling transforms the adaptive system from a reactive tool into a proactive, anticipatory tutor, capable of delivering learner-centered instruction that evolves with each student’s growth, planned for real-time [93]. This component is central to achieving STREAM’s vision of inclusive, personalized virtual classrooms that recognize and respond to diverse student needs.

3.2.4. Adaptive Content Generation

Adaptive content delivery is the culminating component of STREAM, where analyzed instructional material, enriched with metadata and guided by learner profiles, is transformed into a personalized learning experience tailored to each student’s preferences, needs, and cognitive profile [94]. This stage operationalizes the promise of AI-driven instruction by dynamically modifying, with real-time aspirations but offline, the content delivered, its presentation, and its introduction, thereby aligning educational experiences with learners’ pathways. At its core, adaptive delivery uses knowledge points and metadata extracted in previous layers and matches them to the student’s evolving learner profile [95]. Based on this match, the system determines the most effective modality and format for each learning unit. For example, a student who demonstrates strong verbal comprehension but lower visual processing speed might receive a narrated explanation paired with simple, high-contrast visuals rather than dense diagrams [96]. In contrast, students with high visual-spatial intelligence might be offered interactive infographics or simulations instead of textual content [97]. This personalization is powered by decision algorithms that select from a repository of modular instructional assets, each tagged with pedagogical, cognitive, and emotional metadata [98], intended for real-time use but tested offline. These assets are assembled into adaptive learning sequences, where the content is adjusted in terms of form, complexity, pacing, sequencing, and scaffolding to meet learners’ individual needs. If a student struggles with a particular topic, the system might interject with simpler explanations, offer immediate practice opportunities, or revisit prerequisite knowledge before progressing. If students excel, the system can skip redundant material and introduce more challenging enrichment-oriented tasks.

Importantly, adaptive delivery is multimodal by design, supporting the integration of text, video, audio, animations, haptic interactions, and gamified elements, whichever modes best align with the learner’s context [11]. This multimodality is crucial for inclusivity, as it could support students with sensory impairments, learning disabilities, or diverse cultural backgrounds. It also allows the system to adapt flexibly to the environment, such as switching from visual to auditory delivery in mobile or low-bandwidth settings. Another essential feature of this component is continuous feedback and monitoring, planned for real-time but offline-tested. As students engage with personalized content, the system dynamically tracks potential interaction metrics and performance indicators to adjust future content. This creates a looped system in which delivery is informed by performance and performance informs delivery, making learning experiences not only personalized but also responsive. Ultimately, adaptive content delivery actualizes STREAM’s vision of a virtual classroom centered on the learner, where each student receives just-in-time instruction tailored to their learning style, pace, and emotional readiness. It closes the loop of the STREAM framework by transforming abstract data into concrete, impactful educational experiences, positioning learners not as passive recipients of information but as active participants in a personalized and adaptive learning journey (see Table 3).

3.3. Modularity

A key strength of the proposed adaptive learning STREAM framework lies in its modular design, which allows each component—source input, content analysis, learner modeling, and adaptive delivery—to function as a standalone system while contributing to the holistic objective of personalized instruction, envisioned for real-time [95]. The source side of STREAM, including teacher-led instruction, recorded lectures, and digital content, offers opportunities to study how various modalities of instructional delivery (e.g., live vs. recorded, structured vs. informal) affect the fidelity and richness of the content available for analysis, envisioned for real-time but examined offline [99]. The middle layer of content analysis and decomposition contains subtopics that warrant individual investigations [100], designed for real-time operation but validated offline. One possible study could focus exclusively on text-based decomposition using transformer models, while another might explore voice-based prosody analysis to detect instructional emphasis. A third option could investigate CV techniques to extract semantic elements from visual content, such as lecture slides or whiteboards.

The student modeling component opens the door to in-depth research on learner analytics, including affective computing, cognitive state prediction, and behavioral profiling, intended for real-time but not tested as such [25]. Each of these subdomains could serve as a standalone empirical study, especially when combined with methods such as eye-tracking, interaction logging, or sentiment analysis during instructional sessions. Ultimately, adaptive content delivery encompasses a broad and impactful research area, including personalized modality selection, learning path generation, and multimodal feedback systems [101]. Researchers could conduct experiments comparing learners’ results across different adaptive strategies or studying the potential effectiveness of modality switching, pending validation based on emotional or cognitive indicators. By framing the STREAM framework as a set of discrete, interlinked modules, this research agenda aims to enable iterative development, targeted validation, and cross-disciplinary collaboration. Each module is a functional building block of the overall system and a fertile ground for scholarly inquiry capable of generating its literature, tools, and pedagogical implications. This modular structure positions STREAM as a scalable blueprint for applied development and theoretical advancement in AI-driven, learner-centered education.

4. Feasibility and Early Prototype Design

To evaluate the fundamental viability of the STREAM framework’s core decomposition and adaptation mechanisms, we conducted a preliminary feasibility pilot with deliberately narrow scope. This pilot processes a single five-minute pre-recorded elementary STEM lesson offline under clean audio and visual conditions, using a predefined visual learner profile to test the basic content-to-adaptation pathway. The study intentionally excludes numerous framework components described in Section 3, including real-time streaming, dynamic learner modeling, affective computing, behavior-driven adaptation, and computer vision beyond basic optical character recognition and simple shape detection. Success is defined not by educational effectiveness or classroom viability, which remain untested, but by whether the implemented pipeline can execute content decomposition, semantic tagging, and format regeneration on commodity hardware within latency bounds that could theoretically support classroom use. This scoped approach allows us to establish baseline technical feasibility and measurement methodology before addressing the substantially more complex challenges of authentic deployment, diverse content domains, multiple learner profiles, and learning outcome validation that future studies must tackle.

This pilot (1) extracts instructional elements—definitions, step sequences, prompts, and contextual entities—from a short, multimodal lesson by combining time-aligned ASR and transformer tagging under a compact label schema to support reliable validation, (2) transforms those elements into a visual-first representation for a single predefined learner profile (visual/VARK) using rule-based mappings (e.g., arrow-sequence → step-numbered diagram; prompt → two-panel “Plan → Test” guide; entity → pictogram with OCR-derived captions), and (3) accomplishes both with practical latency on commodity hardware (single-Graphics Processing Unit (GPU) workstation) so the full pipeline operates at classroom-scale speeds. Success is operationalized by measurable criteria specified in the following section, including ASR quality, tagging fidelity, end-to-end time, and output readability/traceability. These Components are aligned to Table 3. This feasibility pilot exercises the four components as follows:

(i): Knowledge Point Extraction—time-aligned ASR and transformer tagging produce machine-interpretable objects for definitions, step sequences, prompts, and entities;
(ii): Metadata Generation—each object carries provenance and pedagogical fields (ID, timecodes, Bloom level, difficulty, prerequisites, visual references) to preserve auditability;
(iii): Adaptive Content Delivery—rule-based mappings regenerate visual-first artifacts (arrow-sequence → numbered path diagram; prompt → two-panel Plan → Test; entity → pictogram with OCR caption);
(iv): Learner Profiling—intentionally out of scope in this pilot (fixed visual profile) to isolate feasibility of the content–to–adaptation loop.

Metric-to-component keys. ASR WER and tagging

κ

(Extraction); provenance/readability checks (Metadata); end-to-end latency/resources and artifact quality (Delivery).

4.1. Pilot Implementation of Source Side: Pre-Recorded Lecture "Pree"

The pilot draws on a set of pre-recorded, coding-focused mini-lessons delivered by an early-elementary STEM facilitator (“Pree”). Each lesson is designed to approximate a classroom exchange: the facilitator narrates brief explanations, elicits short peer responses, and uses simple, high-contrast visuals—such as arrow cards, toy vehicles, and destination boards—to keep linguistic cues and visual symbols co-present within short, well-bounded scenes. For this study, we focus on a five-minute segment on elementary path planning that extends a previously taught repertoire by introducing a right-turn arrow alongside an existing forward arrow. In this clip, learners compose and test action sequences to reach named destinations (e.g., a farm, a zoo, or a shopping center). The combination of clear narration and distinct on-screen symbols produces repeated, easily identifiable primitives, such as tokens (forward and right), destination labels, and arrow glyphs, yielding stable anchor points for temporal alignment and provenance tracking (See Figure 3).

The selection of this segment was intentional. Pedagogically, the interaction follows a concise, widely used pattern—concept introduction, guided practice, and a brief collaborative activity—creating natural boundaries for segmentation, time-aligned tagging, and later audit. Multimodally, the co-occurrence of speech, symbolic arrows, and labeled boards enables cross-modal validation: automatic speech recognition (ASR) captures verbal prompts and step language; CV and OCR recover arrow orientation and destination text; and a semantic tagger organizes the discourse into definitions, steps, prompts, and entities. Signal quality is deliberately clean: the audio track is uncomplicated, and the visual symbols are high-contrast and visually distinct, which improves ASR robustness and simplifies glyph/label detection for a lightweight vision pass while still supporting the construction of traceable knowledge objects. Finally, the compact five-minute duration satisfies latency and resource constraints on commodity hardware without sacrificing representativeness: the clip includes multiple destinations and sequence variations, ensuring sufficient complexity for an end-to-end feasibility test.

Operationally, the clip provides three classes of inputs directly consumable by the pipeline. First, it gives step sequences (e.g., forward, forward, right) that can be parsed, canonicalized, and later rendered as numbered path diagrams. Second, it contains instructional prompts (e.g., “show me the path,” “test it with your car”) that the imperative detector can identify and map to two-panel Plan → Test guides. Third, it includes contextual entities (such as vehicles and destinations) that can be linked to pictograms and OCR-derived captions. Together, these properties make the source stream both analytically tractable and pedagogically meaningful, providing a controlled yet authentic substrate for evaluating the feasibility of the proposed content-to-adaptation loop. The simulated elementary STEM level uses simple procedural tasks (e.g., moving a small car across a route) to isolate adaptation mechanics, as is standard in feasibility pilots for adaptive systems where basic activities assess pipeline viability before probing deeper learning [102,103]. These tasks represent instructional following rather than advanced cognition, providing a controlled substrate for traceability evaluation, with true learning outcomes hypothesized for future studies [104]. While pedagogically meaningful at an elementary level for basic STEM instruction, this task prioritizes analytical tractability for feasibility testing of adaptations, rather than comprehensive learning assessment, aligning with pilot guidelines in edtech that use simple substrates to build toward authentic evaluations [105].

Example 1:

Illustrative transcript for three input classes.

ASR-normalized excerpt showing step sequences, instructional prompts, and contextual entities.

S1 [00:02:10–00:02:15]: From the start, [step]forward, forward, right[/step] to reach the [entity]zoo[/entity].

S2 [00:02:15–00:02:17]: [prompt]Show me the path[/prompt].

S3 [00:02:17–00:02:19]: [prompt]Test it[/prompt] with your [entity]car[/entity].

Interpretation:[step] canonical arrow sequence used for numbered path diagrams; [prompt] imperative mapped to a two-panel Plan → Test guide; [entity] destination/vehicle linked to a pictogram and (when available) OCR caption. This excerpt aligns with the pilot’s schema and renderer: step sequences (e.g., forward, forward, right) render as path diagrams; imperatives (e.g., “show me the path,” “test it…”) render as Plan → Test guides; entities (e.g., “zoo,” “car”) materialize as pictograms with OCR-derived captions.

4.2. Middle Layer: Content Component Extraction

In the middle layer, the lesson is decomposed into machine-readable instructional units through a three-stage pipeline—ASR → → NLP → → planned CV—that yields timestamped, semantically tagged spans traceable to specific audiovisual evidence. Each stage records confidence scores and timing metadata, allowing downstream adaptation to prioritize high-certainty items and, where necessary, defer to rapid human confirmation. This design preserves end-to-end auditability while supporting practical latency on commodity hardware.

Speech-to-text. We employ Whisper with overlapping 20 s windows (5 s stride) and energy-based voice activity detection (VAD) to stabilize timestamps during rapid turn-taking. The decoder returns token- and utterance-level times; punctuation and casing are restored, and a light normalization pass removes fillers (e.g., um/uh/like), standardizes numerals (e.g., “three rights” → → 3 right), and harmonizes measurement phrases. Speaker turns are preserved at utterance boundaries to retain emphasis timing (for example, slower repetitions during concept introduction). For quality control, the ASR stage emits per-segment confidence scores and a compression-ratio flag to surface likely garbles for targeted manual spot-checks in the pilot.

Semantic tagging. We illustrate sentence segmentation, multi-label prediction, rule-assisted imperatives, and arrow-sequence canonicalization on a short ASR-normalized snippet. The final tag items export per-label probabilities and evidence spans (0-indexed, end-exclusive), enabling downstream audit- and confidence-aware filtering.

Example 2:

ASR transcript.

Normalized; three sentences:

S1 [00:02:10.00-00:02:15.00]:

″From the start, go forward, forward, then right to reach the zoo.″

S2 [00:02:15.10-00:02:16.80]:

″Show me the path.″

S3 [00:02:17.00-00:02:18.50]:

″Test it with your car.″

Sentence segmentation (spaCy)

[ S1 | S2 | S3 ]

Multi-label classifier outputs (BERT + sigmoid)

# probs shown only for labels ≥ 0.05

S1:

knowledge_point: 0.88

entity: 0.11 (token "zoo")

example: 0.07

S2:

prompt: 0.95

S3:

prompt: 0.93

Rule-assisted passes

Imperative detector:

S2 → prompt=True (verb-initial "Show")

S3 → prompt=True (verb-initial "Test")

Arrow-sequence parser (token collapse):

S1 evidence span chars [16,47) = "go forward, forward, then right"

canonical_steps: ["forward","forward","right"]

Exported tag items (JSON Lines)

"sentence_id":"S1",

"time":{"start":"00:02:10.00","end":"00:02:15.00"},

"text":"From the start, go forward, forward, then right to reach the zoo.",

"labels":{"knowledge_point":0.88,"entity":0.11,"example":0.07},

"evidence_spans":{"steps":[16,47]},

"canonical_steps":["forward","forward","right"]

"sentence_id":"S2",

"time":{"start":"00:02:15.10","end":"00:02:16.80"},

"text":"Show me the path.",

"labels":{"prompt":0.95},

"evidence_spans":{"prompt":[66,83]}

"sentence_id":"S3",

"time":{"start":"00:02:17.00","end":"00:02:18.50"},

"text":"Test it with your car.",

"labels":{"prompt":0.93},

"evidence_spans":{"prompt":[84,106]}

Interpretation: With a default threshold of

p \geq 0.70

, S1 yields a knowledge_point with a canonicalized step string, and S2–S3 yield prompt items. These flow into packaging as knowledge objects with IDs/time-codes and then render for visual learners as a numbered path diagram (from canonical_steps) and a two-panel Plan → Test guide (from prompt).

Planned visual alignment. CV modules are introduced incrementally and scoped to the pilot. First, OCR (Tesseract) runs over high-contrast regions to recover destination labels (e.g., “zoo,” “shopping”). Second, OpenCV contour- and shape-based heuristics detect arrow glyphs and estimate orientation via principal-axis analysis. Third, a weak cross-modal alignment links visual detections to co-occurring transcript spans within a ±2 ± 2-s window. When multiple candidate frames compete for a span, the system selects the track with maximal intersection-over-union to maintain temporal consistency. Low-confidence or occluded symbols are surfaced to a lightweight human-verification UI (checkbox confirmation) rather than being accepted automatically, preserving both speed and traceability.

Knowledge-object packaging. All extracted elements are consolidated into traceable knowledge objects that carry pedagogical metadata and provenance. Core fields include

"id",
"timecodes",
"text",
"visual_refs",
"type" {definition, step, prompt, example},
"bloom_level",
"difficulty",
"prerequisites"

and, for feasibility analysis and reproducibility, the pilot also records asr_conf, tag_conf_map, ocr_conf, bbox (

[x, y, w, h]

in pixels for each reference frame), and a short source_hash of the media segment. Time-codes are given as hh:mm: ss.xx strings. Below, we show representative objects.

Example 3:

step.

A canonicalized arrow sequence aligned to a short span:

"id": "KO-0142",

"timecodes": {"start": "00:02:11.20", "end": "00:02:14.90"},

"text": "forward, forward, right",

"type": "step",

"asr_conf": 0.93,

"tag_conf_map": {"knowledge_point": 0.88},

"visual_refs": [{"frame": 3187, "bbox": [412, 276, 86, 44], "ocr": null}],

"bloom_level": "apply",

"difficulty": "intro",

"prerequisites": ["KO-0061: forward arrow meaning"],

"source_hash": "c7a9f2"

Interpretation: a three-move path (F,F,R) with high ASR and tag confidence, one corroborating frame, and an explicit prerequisite concept.

Example 4:

prompt.

An imperative instruction mapped to a two-panel guide:

"id": "KO-0151",

"timecodes": {"start": "00:01:40.00", "end": "00:01:48.00"},

"text": "test it with your car",

"type": "prompt",

"asr_conf": 0.91,

"tag_conf_map": {"prompt": 0.95},

"visual_refs": [],

"bloom_level": "apply",

"difficulty": "intro",

"prerequisites": ["KO-0148: plan a path"],

"source_hash": "5d2e61"

Interpretation: a high-confidence imperative that will render as Plan→Test; no visual frames are required for the prompt itself.

Example 5:

entity.

A destination recovered via OCR with a bounding box:

"id": "KO-0163",

"timecodes": {"start": "00:02:12.10", "end": "00:02:13.20"},

"text": "zoo",

"type": "entity",

"asr_conf": 0.00,

"ocr_conf": 0.88,

"visual_refs": ["frame": 3190, "bbox": 508, 240, 92, 36, "ocr": "zoo"],

"bloom_level": "remember",

"difficulty": "intro",

"prerequisites": [],

"source_hash": "c7a9f2"

Interpretation: the destination label is anchored by OCR (not ASR) and linked to a specific frame and bounding box.

By requiring each adapted artifact to reference at least one knowledge-object ID and the corresponding time span (and, when available, corroborating frames), this packaging could support auditability. It enables ablations (e.g., disabling OCR or arrow detection to observe completeness deltas) without destabilizing the data model.

Notes on scope. The configuration reflects a feasibility-first posture: Whisper provides time-aligned tokens and utterances with preserved speaker turns; a compact label set and rule-aided tagging improve reliability at low computational cost; and staged OCR/shape detection is coupled with human confirmation in ambiguous cases. Together, these choices operationalize the prototype’s constraints while providing the concrete implementation details needed for replication and subsequent scaling.

4.3. Receiver Side: Single Student Style Adaptation

This pilot operationalizes adaptation for a predefined visual-learner profile (VARK) to demonstrate a modality-sensitive transformation within a deliberately narrow scope. The design privileges pictorial organization over prose, reduces extraneous cognitive load, and foregrounds explicit sequencing cues. Concretely, artifacts are authored with short text fragments (no more than seven words), stable and consistent iconography for key entities, strong contrast and spatial grouping to signal order and relationships, and explicit provenance on every output. These constraints ensure that the resulting materials are accessible and easy to audit.

Rule-based adaptation logic. Decomposed content is converted into learner-facing artifacts through deterministic mappings applied to knowledge objects like “type”, consisting of definition, step, prompt, and example. Sequences of actions (e.g., forward, forward, right, forward) render as path diagrams on an

8 \times 8

grid, with a start node, cell-by-cell movement, and a right-turn glyph placed at the appropriate step. Each move is annotated with a numbered badge to reinforce order, and a caption beneath the canvas records the knowledge-object identifier and timecodes (e.g., [KID: KO-0142; 00:02:11–00:02:14]). Imperative prompts (e.g., “show me the path”; “test it with your car”) materialize as two-panel guides in which Plan presents an empty grid with ghosted arrows and Test overlays a vehicle icon moving along the completed route; small corner icons allow the user to switch vehicle types without additional text. Nouns denoting entities (e.g., car, tour bus, farm, zoo, shopping) map to standardized pictograms; when OCR is available, the exact destination string is reused beneath the icon to anchor symbol–text alignment. Definitions and examples appear as concise callouts attached to relevant glyphs, accompanied by leader lines, and a compact legend explains the visual grammar (start, forward, turn, and destination). Color and contrast meet the Web Content Accessibility Guidelines (WCAG) AA, and shape/texture redundancy preserves interpretability in grayscale.

Rendering pipeline (pilot configuration). The renderer first collects knowledge objects associated with the focal clip and filters by confidence, retaining step and prompt items with probabilities of at least

0.75

, while attaching temporally adjacent entity objects within a

\pm 2 s

window. Artifacts are then composed: consecutive step objects are merged into a single path diagram when the total count of moves is ≤8; longer sequences are paginated into subpaths of five to eight moves with clear continuation markers. Prompts generate a two-panel Plan→Test guide and, when a neighboring step sequence is present, share the same grid to maintain spatial continuity. Styling follows accessibility defaults (color-blind–safe palette, minimum

12 pt

labels, and

24 px

tap targets), and each artifact includes alt text that verbalizes the sequence (e.g., “four-step path: forward, forward, right, forward; destination: shopping”). The provenance is systematically embedded: the footers list the contributing knowledge-object IDs, their time ranges, and a short media hash.

Illustrative outputs. In the focal segment, the sequence F, F, R, F is rendered as a grid diagram with nodes labeled 1–4 and a curved corner marker at the third step, terminating at a shopping icon; the footer records KO-0142; 00:02:11–00:02:14. The prompt “Test it with your car” becomes a two-panel artifact, with the left pane displaying the planning scaffold and the right pane animating the car icon along the path, accompanied by the footer KO-0151; 00:01:40–00:01:48. When the transcript mentions “tour bus,” the system presents a bus pictogram with the caption “tour bus,” and if OCR detects “zoo,” the destination cell shows a zoo-gate icon labeled with the OCR string.

Quality guards and fallbacks. Ambiguities are surfaced rather than concealed. If a step token is low confidence or lacks visual confirmation, its segment is drawn as a dashed path with a warning badge; tooltips reveal the underlying confidence values. When destination text is unavailable from OCR, the transcript token is used verbatim; if both sources are absent, a neutral target glyph appears with an empty caption placeholder. To preserve legibility, sequences exceeding eight moves are automatically chunked and labeled as Part 1/2/3. Consistency checks verify lemma–icon agreement for entities, enforce contiguous step numbering, and require captions to include time-codes and at least one knowledge-object ID.

Performance and scope. Artifacts are rendered as Scalable Vector Graphics (SVG) (preferred) or as 2 × 2 × Portable Network Graphics (PNG)s with icon caching, yielding typical per-artifact render times at or below 50 ms on a modern CPU. This ensures that adaptation contributes negligible latency relative to upstream ASR and NLP stages. The pilot does not incorporate learner feedback. More complex strategies—such as multi-profile blending, behavior-informed adaptation, and interactive refinement—are intentionally deferred to subsequent iterations (See the Flowchart in the Figure 4).

4.4. Content Decomposition and Prompt Generation

At the core of the STREAM framework is a decomposition pipeline that converts raw instructions (speech, text, visuals) into tagged, time-aligned knowledge objects containing type (e.g., definition, step, prompt), difficulty, Bloom level, prerequisites, and references to visual evidence. This representation could support two complementary outputs. First, it is designed to enable representation shifts—for example, turning a narrated step sequence into an annotated path diagram with captions and legends for visual-preferring learners—while preserving auditability through timecodes and source links. Second, it is designed to enable prompt generation for formative and metacognitive support. Because each knowledge object carries intent and difficulty, the system can instantiate prompts as (a) immediate checks for understanding (“Apply the right-turn rule to reach the zoo from the start node”), (b) scaffolded hints that surface prerequisites when errors or hesitations are detected, and (c) metacognitive reflections that encourage learners to justify steps or compare alternative paths. These prompts are modality-aware (text, audio, diagram overlays), align with the object’s Bloom level, and can be sequenced adaptively to maintain cognitive load and motivation. In short, content decomposition supplies the semantic substrate; prompt generation operationalizes it into interactive guidance that closes the loop between instruction and learner action.

4.5. Tools

Core components (versions and defaults). The pipeline integrates four main tool chains. For transcription ASR, we use time-aligned OpenAI Whisper on a PyTorch backend. The default model is small.en to prioritize speed, with medium.en substituted when audio conditions are noisier. Audio is processed in overlapping windows of

20 s

with a

5 s

stride; energy-based VAD is enabled, and punctuation/casing restoration is applied. For semantic tagging, we utilize the HuggingFace transformers stack with a lightweight multi-label classifier (BERT-base-uncased) trained on {knowledge point, prompt, entity, example}. Sentence segmentation and a simple Named Entity Recognition (NER) pre-pass are provided by spaCy (en_core_web_sm). At the same time, a compact rule layer detects imperatives (e.g., “show me…”) and canonicalizes arrow sequences (e.g., F, F, R). Planned vision modules include OpenCV for glyph detection and orientation (contour analysis with principal-axis estimation) and Tesseract OCR for destination labels, configured with psm=6, oem=1, and language eng; regions of interest are restricted to signage to reduce false positives. Data handling and rendering are implemented in Python 3.11: Pandas provides data structures and knowledge objects; Matplotlib renders annotated path diagrams and two-panel prompt guides, preferring SVG with a

2 \times

PNG fallback. All artifacts embed provenance (knowledge-object IDs, time codes, and a short media hash) in their metadata.

Computational environment (pilot baseline). Experiments run on a single-GPU workstation (e.g., NVIDIA RTX 3080 with ≥10 GB Video Random-Access Memory (VRAM)), a modern 8+ core CPU, and

32 GB

RAM under Ubuntu 22.04, Python 3.11, and PyTorch 2.x with CUDA 12.x. Latency targets for a five-minute clip are: ASR

\leq 1.5 \times

real time, tagging

\leq 0.5 \times

, vision

\leq 0.5 \times

, and rendering

\leq 0.1 \times

. In a CPU-only fallback, Whisper base.en is used and vision frame sampling is reduced to

2 fps

.

Reproducibility and instrumentation. To ensure replicability, package versions are pinned and random seeds recorded. Each stage emits a structured JSON line containing twall_ms, cpu_%, gpu_mem_MB, n_items, and conf_stats. System monitoring uses psutil for CPU/memory metrics, and pynvml for GPU VRAM metrics. Outputs are organized in a run directory with stage-specific subfolders (asr/, tag/, vision/, render/) that contain metrics (.jsonl) and exported artifacts (SVG/PNG) for audit.

Configuration defaults (knobs). The default ASR configuration is chunk_s=20, stride_s=5, beam_size=5, and temperature=0.0–0.4, with VAD enabled. For tagging, the minimum label probability threshold is

0.70

, and the prompt detector requires an imperative with a verb lemma. Vision defaults include frame sampling at

4 fps

, Canny thresholds

{50, 150}

, a minimum contour area of

200 px

, arrow orientation from the Principal Component Analysis (PCA) angle, and an OCR confidence threshold of

0.75

. Rendering adopts a grid cell size of

48 px

, line width of

3 px

, and minimum label size of

12 pt

; the color palette meets WCAG AA, and alt text is auto-generated from canonical step strings.

Privacy and offline use. All inference runs locally; no media leaves the workstation. Intermediate audio, text, and frame caches are written to the run directory and purged after analysis in accordance with the project’s data-handling policy. This design could support privacy-preserving experimentation while retaining full auditability through embedded provenance.

4.6. Feasibility Criteria & Quick Evaluation

We evaluate feasibility along three dimensions—accuracy, latency/resources, and output quality—using compact, objective checks that can be completed on a single five-minute clip and summarized as clear pass/fail “go” signals.

Accuracy. For the speech pipeline, we compute Word Error Rate (WER) on a stratified 200–300 word slice sampled across (i) concept introduction, (ii) guided practice, and (iii) collaborative activity, after applying the pilot’s normalization (filler removal and numeral standardization). WER is defined as

WER = \frac{S + D + I}{N},

where

S, D, I

denote substitutions, deletions, and insertions, and N is the number of reference words. The target is WER

\leq 15 %

with a bootstrap 95% CI width ≤6 percentage points; we also report Sentence Error Rate (SER). For the NLP stage, tagging fidelity is assessed using a 40-item gold set (balanced with ≥10 instances each of knowledge point, prompt, entity, and example) annotated by two raters. We report Cohen’s

κ

for label presence (target

κ \geq 0.70

) and, against the adjudicated gold, Precision/Recall/F1:

P = \frac{T P}{T P + F P}, R = \frac{T P}{T P + F N}, F 1 = \frac{2 P R}{P + R} .

For knowledge points, targets are

P \geq 70 %

and

R \geq 60 %

. A

4 \times 4

confusion matrix surfaces common confusions (e.g., example vs. knowledge point).

Latency and resources. We measure wall-clock time per stage and end-to-end,

T_{total} = T_{ASR} + T_{NLP} + T_{Vision} + T_{Render},

and target

T_{total} \leq 2 \times

clip duration at the 90th percentile across three runs (stretch goal: ≤1.2×). The resource profile logs per-stage peak VRAM (MB), mean CPU%, and mean GPU%, with target VRAM ≤ 8–10 GB, mean GPU%

\leq 85 %

, and mean CPU%

\leq 80 %

. We also record I/O read time to confirm storage is not a bottleneck.

Output quality for the visual learner. A lightweight rubric operationalizes artifact quality using five binary/tertiary checks scored 0/1/2 (absent/partial/present; maximum 10): (1) step numbering is contiguous and visible; (2) icon–noun agreement holds under lemmatization; (3) captions include timecodes and knowledge-object IDs; (4) color/contrast meets WCAG AA (Contrast ratio

(L_{max} + 0.05) / (L_{min} + 0.05) \geq 4.5 : 1

, where L is relative luminance); and (5) alt-text accurately verbalizes the sequence. Each artifact should score

\geq 8 / 10

, and two gates are mandatory: icon–noun agreement and presence of timecodes and IDs.

Traceability and provenance. We require 100% back-link coverage, i.e., every artifact references at least one knowledge-object ID and a valid time range. Additionally, timestamp drift must satisfy

| {\hat{t}}_{artifact} - t_{source} | \leq 0.5 s

for the primary span, with a target of ≥95% of artifacts within tolerance.

Lightweight ablations (optional). To gauge reliance on individual modalities, we perform two quick ablations. Disabling OCR (–OCR) measures (i) the percentage of artifacts missing destination captions and (ii) the rubric delta on icon–noun agreement when text anchoring is absent. Disabling arrow detection (–Arrow) measures (i) the percentage of sequences rendered with dashed (“uncertain”) segments and (ii) average path edit distance versus gold (over strings in

{F, R, L, B}

). We expect –OCR to primarily reduce caption accuracy and –Arrow to increase uncertainty badges without breaking provenance.

Quick evaluation protocol and go/no-go. We execute three full passes and report per-stage medians and 90th percentiles for time and resource usage. We then evaluate the predefined WER slice and the 40-item tagging gold, computing metrics with 1000-sample bootstrap confidence intervals (CI). Finally, we score the first 10 artifacts using the rubric and verify backlink coverage and timestamp drift. A Go decision requires: WER

\leq 15 %

,

κ \geq 0.70

, knowledge-point precision/recall targets met,

T_{total} \leq 2 \times

clip duration, VRAM within budget, rubric score

\geq 8 / 10

with both must-have gates satisfied, and 100% back-link coverage. Otherwise, we issue No-Go and list the failing gate(s) with a one-sentence remediation (e.g., “use a larger ASR model” or “reduce frame sampling to

2 fps

”).

4.7. Scope

This pilot is tightly bounded to surface feasibility signals while minimizing confounds. We deliberately fix the content, learner profile, perception modules, outputs, compute environment, and evaluation protocol to a minimal, reproducible slice. The source material is a single 5-min English clip (“Pree,” directional logic). Audio consists of clean classroom-style narration, and the video is processed as file-based input; for planned vision passes, frames are sampled at 4 fps. The target learner is a single, predefined visual (VARK) profile: the system performs no learner profiling, personalization, or multi-profile blending. Perception support is intentionally partial: OCR is used to recover destination labels, and simple shape cues are used to detect and orient arrow glyphs. Ambiguous visual detections are routed to quick human confirmation; there is no advanced tracking, pose estimation, or segmentation.

Extraction is constrained to a compact schema including: knowledge_point, prompt, entity, and example. Step sequences are capped at 8 moves per diagram, with longer paths automatically chunked to preserve legibility. The adaptation produces visual-first artifacts only: numbered path diagrams for step sequences, two-panel Plan → Test guides for prompts, entity pictograms with OCR-derived captions, and brief callouts for definitions or examples. No audio narration, TTS, or interactive widgets are generated. Human-in-the-loop involvement is limited to validation (ASR spot checks and vision ambiguity checks); the pilot does not incorporate live learners, affective feedback, or adaptive branching.

All computations are run offline on a single machine (Python 3.11; a single GPU is optional). There are no cloud calls; all media and outputs remain local. Evaluation is restricted to feasibility indicators on this one clip—namely accuracy, latency/resources, and output quality/traceability—and makes no claims about learning outcomes or user studies. Inputs are de-identified, intermediate audio/text/frame caches are written under a run directory, and artifacts and caches are purged after analysis in accordance with the project’s data-handling policy.

Out of scope. We do not consider multi-clip corpora, multilingual content, or model fine-tuning for ASR/NLP/CV. Streaming, emotion or state detection, behavior-driven, intended for real-time, but excluding adaptation, teacher dashboards, and randomized user trials. Despite these constraints, the pilot executes the complete pathway—ingestion →decomposition→ visual adaptation—and yields baseline. These auditable metrics inform scaling decisions (e.g., frame rate, sequence length thresholds, and acceptable rates of CV confirmation).

4.8. Risks & Immediate Mitigations

This pilot faces four near-term risks and addresses each with concrete safeguards. Learning-style generalization: VARK is used only as a pragmatic proxy to exercise the pipeline; the system does not infer styles. We constrain outputs to a single visual profile and explicitly mark all artifacts as rule-driven. Next iterations replace fixed styles with behavior-based preferences (e.g., click/hover dwell, task completion) and outcome-based evaluation (quiz accuracy/time), with AB tests to verify gains before adoption. Vision fragility: Arrow detection can degrade under motion/occlusion or low contrast; we cap CV to simple shape/OCR cues and require human confirmation when confidence falls below a threshold (e.g., 0.75). We add template matching as a fallback, downsample frames for stability, and enable transcript-only rendering (dashed path with a warning badge) when visuals are unreliable, logging ambiguity rates for later tuning. External validity: Performance on concrete, elementary content may overestimate robustness; we therefore treat metrics as clip-specific feasibility signals, not generalized claims. Subsequent pilots will diversify topics (abstract algebraic reasoning, non-narrative explanations), settings (noise levels, lighting), and speaker characteristics, with stratified reporting to expose domain shift. Privacy & ethics: All media are de-identified (faces/labels blurred where applicable), processed locally (no cloud calls), and stored under role-restricted folders; transcripts/frames are purged after analysis per a documented retention schedule. The provenance is preserved without personal identifiers, access is recorded for audit purposes, and all activities comply with the IRB guidelines for minimal-risk educational media.

4.9. Pilot Study Outcomes

Our pilot test demonstrates that the full system functions seamlessly from start to finish on a short 5-min video clip from a “Pree” elementary STEM lesson. It takes the teacher’s spoken words, turns them into text, labels key parts like definitions or instructions, and creates easy-to-understand visual aids (such as path diagrams drawn in a scalable format, simple two-part guides for planning and testing, and picture icons for things like animals or places) that include notes on where they came from in the original video. For this specific clip, the system passed all our basic checks for practicality: it accurately handled speech-to-text, reliably labeled content, and ran quickly enough on everyday computer hardware—without requiring specialized equipment. It also kept everything traceable (each visual link was linked back to an ID and the exact time in the lesson) and easy to read (icons matched their labels correctly, steps were numbered in order, and colors met accessibility standards for clear visibility). When we tested by disabling certain features (such as text reading from images or arrow spotting), the results were as expected: it mostly affected labels on destinations or added uncertainty flags to steps, but the links back to the source remained strong. Overall, these findings give us confidence in the system’s core ideas for extracting information from videos and visually adapting it. They provide a clear expansion plan—for example, to accommodate different learning styles, enhance automatic image processing (reducing the need for manual checks), and incorporate real-time feedback mechanisms for both teachers and students. For a closer look, we share specific results in key areas: accuracy, time taken and resources used, visual quality, and how well everything aligns with the original. We ran the test three times on the 5-min clip and report typical values (medians) and higher-end values (90th percentiles) where appropriate. For the accuracy numbers, we used a statistical method (bootstrap with 1000 samples) to estimate the reliability ranges. All this follows the guidelines we set earlier in Section 4.6.

4.9.1. Accuracy

The system’s speech-to-text and language processing components performed well on the clear, well-organized audio and text from the “Pree” video clip, meeting or exceeding our goals for accuracy in converting speech to text and identifying key content. For speech-to-text (ASR, or Automatic Speech Recognition), we evaluated accuracy using Word Error Rate (WER), which measures the number of words it gets wrong (through mix-ups, omissions, or insertions). We tested it on a balanced sample of 250 words, split evenly across the lesson’s introduction of ideas, guided practice, and group activities. The typical WER was 12.3% (with a reliable range of 10.5% to 14.1% based on statistical checks), which is better than our 15% limit. The range was narrow at 3.6 points, meeting our requirement of 6 points or less. We also looked at Sentence Error Rate (SER), which was 28%—meaning about one in every four sentences had at least one mistake, mostly from swapping words during quick back-and-forth talks. Common issues included confusing numbers (such as hearing “two forwards” as “to forwards”) and not fully removing filler words like “um.” For labeling content (semantic tagging), we tested how well it identified key types, including main ideas (knowledge points), instructions (prompts), entities (mentioned things), and examples (samples). We used a set of 40 examples (10 of each type) that two people labeled independently, then agreed on any differences. The agreement between them was strong, with a score of 0.78 (higher than our 0.70 goal), thanks to our simple labeling system. Compared to the final agreed labels, we measured the metrics shown per type in Table 4.

The knowledge points met our goals (precision at least 70%, recall at least 60%), and prompts performed best thanks to extra rules to spot commands like “show me.” A 4 × 4 confusion matrix (Figure 5) shows where mix-ups happened, like sometimes confusing examples with main ideas (e.g., a sample sequence labeled as a core concept).

For the visual parts (like reading text from images with OCR or spotting arrows), we checked informally: The typical confidence for reading destination labels was 0.82 (out of 1) across 15 spots, and arrow direction was correct 92% of the time in 25 checks, with most mistakes in partly hidden arrows that we sent for quick human review.

4.9.2. Latency and Resources

The system ran smoothly on regular, affordable computer hardware—like what you might find in a typical school setting—keeping the total processing time well below our goal of twice the video’s length. We tested it three times, and the usual total time (median) was 7.2 min for the 5-min video clip (approximately 1.44 times the actual video length), with the higher-end time (90th percentile) at 7.8 min (1.56 times the actual length). We broke it down by each step in Table 5, where you can see that turning speech into text (ASR) took the most time because it processes overlapping sections of audio for better accuracy.

The system also stayed within our resource limits: the highest graphics memory used was 6.2 GB (well below our 8–10 GB cap), average main processor (CPU) use was 62% (under 80%), and average graphics processor (GPU) use was 78% (under 85%). Loading data from storage was very quick (less than 5% of total time), so it did not slow things down. Figure 6 shows a pie chart of how the time was split among the steps (using typical values).

4.9.3. Output Quality

We evaluated the adapted visuals for visual learners using a simple five-point checklist for the first 10 items produced (five path diagrams, three prompt guides, and two entity icons). Overall, they scored an average of 9.2 out of 10, well above our goal of at least 8. Every item met the two essential requirements: matching icons to the correct words (like ensuring a “zoo” icon pairs with the word “zoo”) and including source details for traceability. Here is a breakdown of the checklist scores (each item rated out of 2 points): (1) Clear and complete step numbering: 2/2 (perfect across all items); (2) Accurate match between icons and words: 2/2 (95% success rate after simplifying word forms); (3) Captions including exact lesson timestamps and IDs: 2/2 (100% included); (4) Good color contrast for accessibility (following web standards): 1.8/2 (small points off for a couple of badges that could be brighter); (5) Helpful alternative text descriptions (for screen readers): 1.4/2 (good but could add more details for trickier sequences). None of the items scored below 8 out of 10, indicating that the visuals are reliable and user-friendly in classroom settings.

4.9.4. Traceability and Provenance

We achieved full back-link coverage, meaning that every adapted visual (such as diagrams or guides) included a reference to at least one knowledge object ID and a matching time range from the original lesson. When we checked for any timing mismatches (called timestamp drift) across 20 key segments, 98% stayed within our goal of no more than 0.5 s off, with a typical mismatch of just 0.12 s—this exceeded our target of at least 95% accuracy. This strong performance makes it easy for teachers to audit and trace back: you can quickly connect any adapted material directly to its exact spot in the original video and audio, ensuring transparency and reliability in how content is transformed.

4.9.5. Lightweight Ablations

To test the system’s flexibility, we conducted simple experiments (called ablations) by disabling specific features, confirming that the pipeline remains stable even when certain components are disabled. For example, when we disabled optical character recognition (OCR, which reads text from images), 40% of the visuals lost their destination labels (compared to none in the full setup), slightly lowering the match between icons and words by an average of 0.4 points on our quality checklist—but the links back to the source material stayed completely intact. Similarly, turning off arrow detection resulted in 25% of step sequences showing dashed lines for uncertain parts (up from 5% in the baseline), with an average minor error of 0.8 steps (usually just one move off)—yet it did not affect the back-links or overall quality scores. The statistical analyses, including ablation studies and efficiency metrics, are employed to quantify the pipeline’s feasibility under controlled conditions, aligning with guidelines for pilot studies in adaptive learning where such methods provide reproducible baselines without necessitating full efficacy testing [102]. This could strengthen the framework’s technical viability by focusing on practical implementation rather than comprehensive statistical inference. These results give the pilot a clear “go-ahead,” as it passed all our checks. Looking ahead, we can improve by generating more effective alternative text descriptions for every adapted visual (such as complex sequences and fine-tuning speech recognition for quick back-and-forth discussions, which will help guide future upgrades for classroom use.

5. Discussion

5.1. Why STREAM Is Designed to Fill an Important Research Gap?

STREAM is designed to address a potential research gap in adaptive learning by enabling intra-lesson multimodal transformation, building on decades of UDL research that highlights limitations in static, one-size-fits-all virtual environments. Whether it successfully fills this gap remains a testable hypothesis. For instance, a 2023 systematic review of UDL effectiveness [106] analyzed implementations across educational settings, finding that while UDL could potentially improve accessibility pending validation, it often lacks integration with AI for dynamic content adaptation, planned for real-time. Similarly, a 2025 narrative review of UDL strategies in inclusive education from 2014 to 2024 [107] emphasizes persistent gaps in metadata-driven personalization, where traditional systems fail to unify semantic analysis with learner variability. The STREAM framework aims to extend these findings by decomposing content into tagged units for regeneration (as in Section 3), potentially filling the void in low-latency, AI-driven frameworks noted in recent AI personalization studies [108], which report potentially enhanced outcomes pending validation, but call for more equitable, within-lesson solutions. However, the pilot’s narrow scope does not validate this extension, positioning it as a hypothesis for future studies.

Most “adaptive” virtual learning systems personalize around content (e.g., sequencing items or adjusting difficulty) rather than within the instructional stream itself. They typically depend on preprocessed datasets, static learner models, and a single dominant modality, which makes them slow to respond to moment-by-moment pedagogical cues and poorly suited to diverse learners in real classrooms. STREAM is intended to address this potential gap by operating within the instructional flow, decomposing teacher-delivered lessons into machine-interpretable knowledge objects and regenerating them as multimodal, learner-aligned artifacts. Functionally, this design means that a live or recorded explanation could be parsed, tagged, aligned with visuals, and made immediately available for transformation—though this has not been tested beyond the pilot’s controlled conditions. Conceptually, STREAM unifies speech recognition, semantic tagging, and planned CV into a single pipeline that produces traceable instructional units with provenance (timecodes, source references) and pedagogical metadata (type, Bloom level, prerequisites).

The feasibility pilot—scoped to a single five-minute elementary STEM lesson with exceptionally clean audio, high-contrast visuals, repetitive simple commands, and limited vocabulary in a controlled setting—demonstrates the end-to-end viability of this solution transforming explanations, prompts, and examples into modular knowledge objects under these constraints, suggesting that the system could facilitate mastery and formative adaptation where content is re-presented, scaffolded, or enriched according to the learner’s current state. However, this is a hypothesis requiring validation, as the pilot does not assess actual learning outcomes or real-world applicability. It offers an empirical counterexample to the assumption that multimodal personalization is too complex for classroom-scale, though real-time was offline-only deployment, provided that these constraints are met. In short, the contribution is not another recommender; it is an evidence-driven, modular mechanism for content transformation in simplified scenarios, designed for real-time use, that bridges the long-standing gap between dynamic instruction and dynamic personalization while highlighting areas for broader testing.

5.2. Intended Alignment with Personalized Learning Theories

STREAM is designed to align with personalized learning theories by integrating AI for adaptive, multimodal delivery, potentially extending established frameworks that emphasize individual cognitive and affective needs. The pilot shows technical compatibility in a limited context, but whether this alignment leads to improved outcomes is a hypothesis for future empirical studies. A 2025 review on AI-driven adaptive platforms [12] synthesizes pedagogical approaches, demonstrating how AI personalization through predictive analytics and tutoring systems could potentially enhance motivation and outcomes pending validation, while highlighting integration challenges with multimodal data, planned for real-time but offline. Building on this, a 2024 systematic review of personalized adaptive learning in higher education [109] demonstrates positive impacts on potential engagement pending validation via tailored paths, but notes limitations in cross-modal fusion without unified AI models. STREAM’s use of NLP and planned CV for learner profiling (Section 2.2) is intended to advance these theories, as evidenced in 2025 studies on generative AI in medical education [110], which advocate for customized experiences to bridge theoretical gaps in diverse virtual settings. However, the pilot does not demonstrate such advancements, limiting claims to design intentions. The design aligns conceptually with several strands of learning theory without overcommitting to any single “style” doctrine, though empirical validation is needed to confirm these alignments. First, by transforming explanations, prompts, and examples into modular knowledge objects, the system could potentially facilitate formative adaptation pending validation: content can be re-presented, scaffolded, or enriched according to the learner’s current state, rather than adhering to a fixed syllabus. Second, multimodal regeneration (text, annotated visuals, narrated summaries) is consistent with Dual Coding and Multimedia Learning principles, where coordinated verbal–visual channels reduce extraneous load and strengthen integrative processing when carefully signposted. Third, the architecture is designed to enable UDL practices—multiple means of representation, action/expression, and potential engagement—because each knowledge object carries modality and accessibility descriptors that can be matched to individual needs.While the pilot demonstrates technical compatibility with UDL principles in a visual-learner context, it does not show that STREAM actually facilitates UDL practices or improves outcomes for diverse learners, which would require controlled studies with actual participants. Finally, the learner model’s evolution from declared preferences (e.g., a VARK “visual” starting point used in the pilot) toward behavior- and outcome-driven profiles positions the STREAM framework to incorporate self-regulated learning loops: as the system observes strategy use (pauses, replays, modality switches) and performance, it can surface metacognitive prompts, adjust pacing, and recommend representations that demonstrably aid comprehension. Thus, STREAM treats “style” labels as pragmatic initializers while anchoring long-term adaptation in measurable learning behaviors and outcomes. The pilot demonstrates this alignment in a visual-learner context, but its efficacy across diverse theoretical applications remains to be confirmed through broader testing.

5.3. Potential Role of AI in Equity and Access

AI’s potential role in promoting equity and access in education is explored through frameworks like STREAM, which is designed to address disparities in virtual learning by providing traceable, personalized adaptations for under-represented groups. However, whether STREAM improves access for underserved populations is an open empirical question that the single-profile pilot, even under optimal conditions, does not address. A 2025 UNESCO report on AI in education outlines how AI can innovate practices to accelerate SDG 4. Still, it warns of equity risks without inclusive design, echoing findings from a 2025 survey on AI adoption that notes uneven access in underserved communities. Furthermore, a 2025 analysis of generative AI in education emphasizes the importance of early adoption to foster a sense of belonging and confidence. It advocates equitable tech funding to mitigate social divides. STREAM’s modular architecture (Section 3) is intended to build on these insights, potentially ensuring multilingual and disability-compatible regeneration, as supported by 2024 research on AI for digital inclusion, which highlights the potential of personalized tools to foster autonomous learning across diverse populations [111]. The pilot offers preliminary technical insights but does not validate equity impacts.

Equity in virtual classrooms could potentially hinge on two capabilities: (i) meeting learners where they are, and (ii) delivering support within real operational constraints (bandwidth, language, accessibility). The proposed pipeline is designed to advance both in principle, though the pilot’s limited scope restricts claims to preliminary insights. First, by decomposing lessons into fine-grained, tagged units, the system can generate accessible representations on demand, including captioned transcripts, plain-language summaries, high-contrast annotated diagrams, audio descriptions, and bilingual overlays. These artifacts are not afterthoughts; they are primary outputs of the same knowledge objects that power the mainstream experience, which helps avoid the typical lag between “core” and “accessible” materials. The pilot demonstrates the effectiveness of visual cues in a clean input scenario, suggesting potential for equity, but it has not yet shown this in diverse or noisy environments. Second, STREAM’s modularity enables edge deployment, allowing transcription and tagging to run locally or near the edge. At the same time, lightweight artifacts (such as SVG diagrams and compressed audio) can be streamed as an on-ramp to semantic communication, prioritizing meaning over raw bitrate in constrained settings. Third, the traceability of each adapted artifact back to its source mitigates risks of AI “hallucination” and supports transparent accommodations for students with disabilities or multilingual learners who need just-in-time translations and scaffolded visuals. Finally, by logging which representations lead to improved comprehension or reduced time-on-task for specific groups, the system can surface inequities silently embedded in materials (e.g., idiomatic language, culturally particular examples) and automatically route learners to alternatives with demonstrated efficacy. In effect, AI could be used not merely to personalize, but to equalize access to the same conceptual core—though the pilot’s single-profile focus limits these equity claims to hypothetical extensions that require further validation with diverse learner populations.

5.4. Potential for Cross-Disciplinary Collaboration

STREAM is designed to foster cross-disciplinary collaboration in AI-edtech by unifying tools from education, computer science, and psychology, potentially extending ongoing efforts to integrate adaptive learning across fields. The pilot provides a basic proof of concept, but broader interdisciplinary input is needed to realize this potential. A 2025 review of AI in STEM education adopts a transdisciplinary framework, demonstrating how AI enhances high-quality learning through collaborative semantic analysis, while also calling for broader partnerships [112]. Similarly, a 2024 analysis of AI-driven platforms highlights the benefits of interdisciplinary teams in developing user-centric tools, noting the challenges of scaling without shared validation roadmaps [113]. Research on AI’s evolution across academic disciplines predicts transformative collaborations, as seen in 2025 reports on AI in education that emphasize joint efforts to develop ethical and equitable systems [114]. By enabling multimodal adaptation (Section 4), with real-time potential, STREAM invites such collaborations, building on these studies to potentially advance sustainable educational innovation. However, the pilot underscores the need for interdisciplinary validation beyond controlled settings.

STREAM’s decomposition–adaptation loop could serve as a natural catalyst for collaboration across Engineering, Computing, and Education, though this remains a hypothesis. In Electrical and Computer Engineering (ECE), teams can harden and scale the middle layer: low-latency signal processing for robust ASR in noisy classrooms; embedded vision for on-device slide parsing and symbol detection; GPU/ASIC acceleration for tagging; and edge deployment strategies, intended for real-time that balance privacy with performance. Computer Engineering and Systems groups can advance semantic communication under bandwidth constraints and co-design streaming protocols that prioritize meaning-bearing artifacts (knowledge objects, vectorized semantics) over frames. Human–Computer Interaction can iterate learner-facing representations—e.g., glanceable diagrams, progressive disclosure, and adaptive legends—while Accessibility researchers formalize alt-text, captioning, and contrast policies directly from knowledge-object metadata. Learning Sciences and Special Education can lead the validation agenda: defining outcome measures, studying transfer and persistence, and auditing bias across subpopulations. Finally, security and privacy experts can codify on-device processing, federated learning, and audit logs so that adaptive decisions remain explainable and compliant with institutional and IRB norms. These collaborations are not ancillary; they map one-to-one onto the framework’s modules, potentially enabling shared testbeds (e.g., a five-minute lesson corpus with synchronized audio–video–transcript ground truth) and reproducible benchmarks (latency, accuracy, and accessibility metrics) that each discipline can improve while contributing to a coherent, learner-centered system. The pilot serves as an initial proof of concept for such collaborations, but it underscores the need for interdisciplinary input to extend beyond controlled settings.

5.5. Limitations and Roadmap for Validation

While STREAM demonstrates feasibility in controlled pilots, its limitations—such as offline scoping and clean conditions—align with broader challenges in adaptive frameworks, informing a validation roadmap for real-time testing through offline testing. While the pilot’s narrow scope in a short elementary video represents a best-case scenario for feasibility, systematic reviews of AI adaptive systems note this as a standard initial step, with scaling to longer, abstract lectures or noisy environments requiring domain-specific fine-tuning to address tagging brittleness [56]. For example, studies emphasize transitioning from controlled baselines to real-world variability via techniques such as accent-robust ASR and noise augmentation, which we outline for validation in diverse deployments. A 2025 review of multimodal learning analytics identifies hurdles to AI integration, including data privacy and scalability, and recommends phased validation for equity [115]. Similarly, a 2024 scoping review of adaptive learning in higher education notes limitations in processing under noise, with real-time untested, advocating empirical studies for diverse contexts [116]. Building on a 2025 AI report for education, which outlines roadmaps for inclusive AI, STREAM’s limitations (e.g., single-video pilot) require extensions to noisy environments and multilingual profiles. The proposed roadmap includes staged deployments, comprehension trials, and cross-disciplinary evaluations, extending the 2025 guidance on adaptive platforms for robust and equitable validation.

While the pilot demonstrates basic feasibility for content adaptation in a highly controlled real-time environment, its limited scope—restricted to one short clip with ideal audio-visual quality and a single learner profile—precludes broad generalizations about classroom viability, equity impact, or scalability. External validity remains a key concern, as the system’s performance may degrade in real-world settings with factors such as background noise, complex vocabulary, diverse accents, low-resolution visuals, or extended lesson durations. We acknowledge these constraints but emphasize that the current evidence demonstrates only pipeline functionality under optimal conditions, not its widespread applicability. The pilot’s thorough technical evaluation, including statistical metrics for latency and completeness, underscores the framework’s practical implementation potential, as recommended in reviews of feasibility pilots for complex edtech interventions [105]. No change, this rigor positions STREAM as a scalable prototype, with future validations to explore broader applications. To address these limitations and build toward generalizability, we propose the following concrete roadmap for future validation:

Phase 1: Expanded content testing (short-term, 3–6 months): Apply the STREAM framework to a corpus of 20–30 lessons (5–15 min each) spanning STEM, humanities, and languages. Include varied input qualities (e.g., noisy audio from actual classrooms, handwritten slides, multilingual content). Metrics: ASR accuracy (>85%), tagging fidelity (inter-rater agreement >0.8 via human annotation), latency (<5 s end-to-end). This will test robustness to content diversity.
Phase 2: Multi-profile learner validation (medium-term, 6–12 months): Conduct user studies with 50–100 diverse participants (e.g., multilingual learners, students with disabilities like dyslexia or ADHD, varying ages/levels). Simulate profiles beyond the visual (e.g., auditory, kinesthetic) and measure outcomes such as comprehension retention (using pre- and post-tests), engagement (measured by time-on-task and self-reported via surveys), and preference matching. Use A/B testing to compare adapted vs. non-adapted content. This will evaluate equity impacts in controlled lab settings.
Phase 3: Real-world deployment pilots (long-term, 12–24 months): Deploy in 3–5 virtual classrooms (e.g., K-12 and higher ed, urban/rural sites) with bandwidth constraints. Integrate with platforms like Zoom or Moodle, tracking scalability metrics (e.g., concurrent users without latency spikes, edge vs. cloud performance). Include ethical reviews for privacy and bias audits. Longitudinal data will assess sustained potential impacts on accessibility pending validation.

6. Conclusions

6.1. Contribution Summary

This paper introduces a conceptual framework called STREAM, designed for multimodal content adaptation that operates within the instructional stream rather than around it, though this remains an aspirational goal beyond the narrow pilot. The core contributions are (i) a conceptual end-to-end architecture that decomposes, tags, and regenerates content across modalities; (ii) a transparent mechanism that is designed to enable traceable, theory-aware adaptation; and (iii) feasibility evidence via a scoped offline pilot that transforms a five-minute lesson into visual-first artifacts for a predefined learner profile with low-latency processing on commodity hardware. These elements are intended to suggest potential for moving beyond static personalization toward responsive adaptation, but the pilot does not demonstrate or prove achievability at classroom scales or educational impact; instead, it provides a methodological scaffold for investigating these hypotheses through the staged validation program outlined in Section 5.5. Building on the decomposition and prompt strategies outlined in Section 4.4, future work will validate extensions for real-time, beyond offline testing.

6.2. Scope Alignment

The work presented here is the first step in a broader research program intended to combine content analysis with dynamic learner modeling, intended for real-time and multimodal delivery, potentially advancing inclusive and responsive virtual classrooms, though this potential is a hypothesis requiring validation. The feasibility pilot demonstrates the ingestion → decomposition → adaptation pathway for a single profile and short segment under controlled conditions, providing technical compatibility with the full scope’s objectives: (1) analysis and knowledge extraction across modalities, planned for real-time; (2) progressive learner modeling that evolves from declared preferences toward behavior- and outcome-driven profiles; and (3) adaptive generation that delivers multiple, accessible representations consistent with UDL principles. However, the pilot’s exclusions (e.g., no computer vision, dynamic profiling, or real-world noise) limit claims to preliminary feasibility, positioning broader applications as testable extensions. STREAM’s modularity—source side, middle layer, receiver side—maps cleanly to future studies on affect-aware adaptation, edge deployment, and semantic communication under bandwidth constraints, accessibility-first rendering, and rigorous learning-science evaluation across diverse populations. By outlining a traceable, theory-aligned mechanism for content transformation and prompt delivery, this paper provides a conceptual and methodological scaffold for investigating, auditing, and iteratively improving multi-profile, interactive, and institution-scale implementations, with equity and scalability as open empirical questions.

Author Contributions

Conceptualization, L.N.Y., Y.C., N.S.F., A.S., M.H.; methodology, L.N.Y., Y.C.; software, L.N.Y., M.H.; validation, L.N.Y., Y.C., M.H.; formal analysis, L.N.Y., Y.C., N.S.F., A.S., M.H.; investigation, L.N.Y.; resources, L.N.Y., Y.C., N.S.F., A.S., M.H.; data curation, L.N.Y., M.H.; writing—original draft preparation, L.N.Y., Y.C., N.S.F., A.S., M.H.; writing—review and editing, L.N.Y., Y.C.; visualization, L.N.Y., M.H.; supervision, Y.C., N.S.F., A.S.; project administration, L.N.Y., Y.C., N.S.F., A.S., M.H.; funding acquisition, Y.C., N.S.F., A.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data supporting the findings of this study are available upon reasonable request from the corresponding author and comply with Binghamton University guidelines.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

API	Application Programming Interface
ASR	Automatic Speech Recognition
BERT	Bidirectional Encoder Representations from Transformers
CI	Confidence Interval
CV	Computer Vision
fps	frames per second
GPU	Graphics Processing Unit
NER	Named Entity Recognition
NLP	Natural Language Processing
OCR	Optical Character Recognition
PCA	Principal Component Analysis
PNG	Portable Network Graphics
SER	Sentence Error Rate
STREAM	Semantic Transformation and Real-Time Educational Adaptation Multimodal
SVG	Scalable Vector Graphics
T5	Text-to-Text Transfer Transformer
TTS	Text-to-Speech
UDL	Universal Design for Learning
VAD	Voice Activity Detection
VARK	Visual, Auditory, Reading/Writing, Kinesthetic
VLEs	Virtual Learning Environments
VRAM	Video Random-Access Memory
YOLO	You Only Look Once
WCAG	Web Content Accessibility Guidelines
WER	Word Error Rate

References

Spaho, E.; Çiço, B.; Shabani, I. IoT Integration Approaches into Personalized Online Learning: Systematic Review. Computers 2025, 14, 63. [Google Scholar] [CrossRef]
Farley, I.A.; Burbules, N.C. Online education viewed through an equity lens: Promoting engagement and success for all learners. Rev. Educ. 2022, 10, e3367. [Google Scholar] [CrossRef]
Rose, D.H.; Meyer, A. Teaching Every Student in the Digital Age: Universal Design for Learning; ERIC; Association for Supervision and Curriculum Development: Alexandria, VA, USA, 2002. [Google Scholar]
Hall, T.E.; Meyer, A.; Rose, D.H. Universal Design for Learning in the Classroom: Practical Applications; Guilford Press: New York, NY, USA, 2012. [Google Scholar]
Javed, H.; Hussain, M.A.; Tufail, M. Effect of Universal Design for Learning (UDL) Embedded Project on 5th Grade Students’ Academic Achievement in Science Subject. Bull. Educ. Res. 2024, 46, 93–106. [Google Scholar]
Altowairiki, N.F. Universal Design for Learning Infusion in Online Higher Education. Online Learn. 2023, 27, 296–312. [Google Scholar] [CrossRef]
Bashir, A.; Bashir, S.; Rana, K.; Lambert, P.; Vernallis, A. Post-COVID-19 adaptations; the shifts towards online learning, hybrid course delivery and the implications for biosciences courses in the higher education setting. Front. Educ. 2021, 6, 711619. [Google Scholar] [CrossRef]
Yu, Z.; Xu, W.; Yu, L. Constructing an online sustainable educational model in COVID-19 pandemic environments. Sustainability 2022, 14, 3598. [Google Scholar] [CrossRef]
Costa, C.; Bhatia, P.; Murphy, M.; Pereira, A.L. Digital education colonized by design: Curriculum reimagined. Educ. Sci. 2023, 13, 895. [Google Scholar] [CrossRef]
Strielkowski, W.; Grebennikova, V.; Lisovskiy, A.; Rakhimova, G.; Vasileva, T. AI-driven adaptive learning for sustainable educational transformation. Sustain. Dev. 2025, 33, 1921–1947. [Google Scholar] [CrossRef]
Xie, Y.; Yang, L.; Zhang, M.; Chen, S.; Li, J. A Review of Multimodal Interaction in Remote Education: Technologies, Applications, and Challenges. Appl. Sci. 2025, 15, 3937. [Google Scholar] [CrossRef]
Ayeni, O.O.; Al Hamad, N.M.; Chisom, O.N.; Osawaru, B.; Adewusi, O.E. AI in education: A review of personalized learning and educational technology. GSC Adv. Res. Rev. 2024, 18, 261–271. [Google Scholar] [CrossRef]
Raj, N.S.; Renumol, V. A systematic literature review on adaptive content recommenders in personalized learning environments from 2015 to 2020. J. Comput. Educ. 2022, 9, 113–148. [Google Scholar] [CrossRef]
Khine, M.S. Using AI for adaptive learning and adaptive assessment. In Artificial Intelligence in Education: A Machine-Generated Literature Overview; Springer: Singapore, 2024; pp. 341–466. [Google Scholar]
Kabudi, T.; Pappas, I.; Olsen, D.H. AI-enabled adaptive learning systems: A systematic mapping of the literature. Comput. Educ. Artif. Intell. 2021, 2, 100017. [Google Scholar] [CrossRef]
Alshammari, M.T.; Qtaish, A. Effective Adaptive E-Learning Systems According to Learning Style and Knowledge Level. J. Inf. Technol. Educ. Res. 2019, 18, 529. [Google Scholar] [CrossRef]
Ahamed, H.R.; Hanirex, D.K. A deep learning-enabled approach for real-time monitoring of learner activities in adaptive e-learning environments. In Proceedings of the 2024 7th International Conference on Circuit Power and Computing Technologies (ICCPCT), Kollam, India, 8–9 August 2024; Volume 1, pp. 846–851. [Google Scholar]
Aladakatti, S.S.; Senthil Kumar, S. Exploring natural language processing techniques to extract semantics from unstructured dataset which will aid in effective semantic interlinking. Int. J. Model. Simul. Sci. Comput. 2023, 14, 2243004. [Google Scholar] [CrossRef]
Passi, N.; Raj, M.; Shelke, N.A. A review on transformer models: Applications, taxonomies, open issues and challenges. In Proceedings of the 2024 4th Asian Conference on Innovation in Technology (ASIANCON), Pimari Chinchwad, India, 23–25 August 2024; pp. 1–6. [Google Scholar]
Zeeshan, R.; Bogue, J.; Asghar, M.N. Relative applicability of diverse automatic speech recognition platforms for transcription of psychiatric treatment sessions. IEEE Access 2025, 13, 117343–117354. [Google Scholar] [CrossRef]
Uke, S.; Junghare, P.; Kenjale, S.; Korade, S.; Kothwade, A. Comprehensive Real-Time Intrusion Detection System Using IoT, Computer Vision (OpenCV), and Machine Learning (YOLO) Algorithms. In Proceedings of the 2024 4th International Conference on Ubiquitous Computing and Intelligent Information Systems (ICUIS), Gobichettipalayam, India, 12–13 December 2024; pp. 1680–1689. [Google Scholar]
Wang, X.; Huang, R.T.; Sommer, M.; Pei, B.; Shidfar, P.; Rehman, M.S.; Ritzhaupt, A.D.; Martin, F. The efficacy of artificial intelligence-enabled adaptive learning systems from 2010 to 2022 on learner outcomes: A meta-analysis. J. Educ. Comput. Res. 2024, 62, 1348–1383. [Google Scholar] [CrossRef]
Wang, Y.; Wang, J.; Sun, A.; Zhang, Y. Lwcnet: A lightweight and efficient algorithm for household waste detection and classification based on deep learning. Res. Sq. 2024. [Google Scholar] [CrossRef]
Arshad, R.; Asghar, M.R. Characterisation and quantification of user privacy: Key challenges, regulations, and future directions. IEEE Commun. Surv. Tutor. 2024, 27, 3266–3307. [Google Scholar] [CrossRef]
Hong, H.; Dai, L.; Zheng, X. Advances in Wearable Sensors for Learning Analytics: Trends, Challenges, and Prospects. Sensors 2025, 25, 2714. [Google Scholar] [CrossRef] [PubMed]
Villegas-Ch, W.; Gutierrez, R.; Mera-Navarrete, A. Multimodal Emotional Detection System for Virtual Educational Environments: Integration Into Microsoft Teams to Improve Student Engagement. IEEE Access 2025, 13, 42910–42933. [Google Scholar] [CrossRef]
Santhosh, J.; Pai, A.P.; Ishimaru, S. Toward an interactive reading experience: Deep learning insights and visual narratives of engagement and emotion. IEEE Access 2024, 12, 6001–6016. [Google Scholar] [CrossRef]
Wang, Y.; Lai, Y.; Huang, X. Innovations in Online Learning Analytics: A Review of Recent Research and Emerging Trends. IEEE Access 2024, 12, 166761–166775. [Google Scholar] [CrossRef]
Zaugg, T. Future innovations for assistive technology and universal design for learning. In Assistive Technology and Universal Design for Learning: Toolkits for Inclusive Instruction; Plural Publishing: San Diego, CA, USA, 2024; pp. 275–318. [Google Scholar]
Kohnke, S.; Zaugg, T. Artificial intelligence: An untapped opportunity for equity and access in STEM education. Educ. Sci. 2025, 15, 68. [Google Scholar] [CrossRef]
Daraghmi, E.; Atwe, L.; Jaber, A. A Comparative Study of PEGASUS, BART, and T5 for Text Summarization Across Diverse Datasets. Future Internet 2025, 17, 389. [Google Scholar] [CrossRef]
Orynbay, L.; Razakhova, B.; Peer, P.; Meden, B.; Emeršič, Ž. Recent advances in synthesis and interaction of speech, text, and vision. Electronics 2024, 13, 1726. [Google Scholar] [CrossRef]
Pratschke, B.M. Generative AI and Education: Digital Pedagogies, Teaching Innovation and Learning Design; Springer: Cham, Switzerland, 2024. [Google Scholar]
Patil, P.A.; Juanico, J.F. The Effectiveness of Khan Academy in Teaching Elementary Math. Behav. Anal. Pract. 2024, 1–14. [Google Scholar] [CrossRef]
BANU, J.S.; Preethi, G. Empowering Sentiment Analysis of Coursera Course Reviews with Sophisticated Artificial Bee Colony-Inspired Deep Q-Networks (SABC-DQN). J. Theor. Appl. Inf. Technol. 2024, 102, 2338–2358. [Google Scholar]
Zhou, Q.; Tang, Y. AI-Driven Adaptive Learning and Management System Research: A Practical Framework Based on the ALEKS System. In Proceedings of the 2025 International Conference on Artificial Intelligence and Digital Ethics (ICAIDE), Guangzhou, China, 29–31 May 2025; pp. 415–420. [Google Scholar]
Rizvi, I.; Bose, C.; Tripathi, N. Transforming Education: Adaptive Learning, AI, and Online Platforms for Personalization. In Technology for Societal Transformation: Exploring the Intersection of Information Technology and Societal Development; Springer: Singapore, 2025; pp. 45–62. [Google Scholar]
Yang, C. Online Learning Platform of Modern Chinese Course Based on Multimodal Emotion-Aware Adaptive Learning. In Proceedings of the 2025 3rd International Conference on Data Science and Network Security (ICDSNS), Tiptur, India, 25–26 July 2025; pp. 1–6. [Google Scholar]
Yeganeh, L.N.; Fenty, N.S.; Chen, Y.; Simpson, A.; Hatami, M. The future of education: A multi-layered metaverse classroom model for immersive and inclusive learning. Future Internet 2025, 17, 63. [Google Scholar] [CrossRef]
Yeganeh, L.N.; Simpson, A.; Fenty, N.; Hatami, M.; Rho, S.; Park, S.; Chen, Y. Immersive Future: A Case Study of Metaverse in Preparing Students for Career Readiness. In Proceedings of the 2025 International Conference on Metaverse Computing, Networking and Applications (MetaCom), Seoul, Republic of Korea, 27–29 August 2025; pp. 57–62. [Google Scholar]
Bollu, J.; Relangi, S.R.S.P.; Musuku, S.; Gangadhar, P.; Divya Sri, K.S.; Sree, K.B. Personalized Learning Content Generator: A Multimodal Application with Ai-Driven Content Creation and Adaptive Learning. 2025. Available online: https://ssrn.com/abstract=5221494 (accessed on 1 December 2025).
Polonetsky, J.; Tene, O. Who is reading whom now: Privacy in education from books to MOOCs. Vanderbilt J. Entertain. Technol. Law 2014, 17, 927. [Google Scholar]
Singh, A.K.; Kiriti, M.; Singh, H.; Shrivastava, A. Education AI: Exploring the impact of artificial intelligence on education in the digital age. Int. J. Syst. Assur. Eng. Manag. 2025, 16, 1424–1437. [Google Scholar] [CrossRef]
Gronseth, S.L.; Stefaniak, J.E.; Dalton, E.M. Maturation of Universal Design for Learning. Theories to Influence the Future of Learning Design and Technology: Theory Spotlight Competition 2021; EdTech Books: Provo, UT, USA, 2022. [Google Scholar]
Rappolt-Schlichtmann, G.; Daley, S.G.; Rose, L.T. A Research Reader “in” Universal Design for Learning; ERIC; Harvard Education Press: Cambridge, MA, USA, 2012. [Google Scholar]
Utami, I.S. Universal Design for Learning in Online Education: A Systematic Review of Evidence-Based Practice for Supporting Students with Disabilities. Int. J. Learn. Teach. Educ. Res. 2025, 24, 94–116. [Google Scholar] [CrossRef]
Montgomery, D.P.; Snow, K. Supporting students with diverse learning needs using universal design for learning in online learning: Voice of the students. J. Teach. Learn. 2024, 18, 55–72. [Google Scholar] [CrossRef]
Kostadimas, D.; Kasapakis, V.; Kotis, K. A systematic review on the combination of VR, IoT and AI technologies, and their integration in applications. Future Internet 2025, 17, 163. [Google Scholar] [CrossRef]
Childs, E.; Mohammad, F.; Stevens, L.; Burbelo, H.; Awoke, A.; Rewkowski, N.; Manocha, D. An overview of enhancing distance learning through emerging augmented and virtual reality technologies. IEEE Trans. Vis. Comput. Graph. 2023, 30, 4480–4496. [Google Scholar] [CrossRef] [PubMed]
Höffler, T.N.; Leutner, D. Instructional animation versus static pictures: A meta-analysis. Learn. Instr. 2007, 17, 722–738. [Google Scholar] [CrossRef]
Ayres, P.; Marcus, N.; Chan, C.; Qian, N. Learning hand manipulative tasks: When instructional animations are superior to equivalent static representations. Comput. Hum. Behav. 2009, 25, 348–353. [Google Scholar] [CrossRef]
Kayi, E.A. Transitioning to blended learning during COVID-19: Exploring instructors and adult learners’ experiences in three Ghanaian universities. Br. J. Educ. Technol. 2024, 55, 2760–2786. [Google Scholar] [CrossRef]
Hughes, C. Meaning Particles and Waves in MOOC Video Lectures: A transpositional grammar guided observational analysis. Comput. Educ. 2025, 236, 105308. [Google Scholar] [CrossRef]
Chen, C.C.; Chai, M.H.; Lin, P.H. Exploring the Impact of Interactive Multimedia E-Books on the Effectiveness of Environmental Learning, Pro-Environmental Attitudes, and Behavioural Intentions Among Primary School Students. J. Comput. Assist. Learn. 2025, 41, e70087. [Google Scholar] [CrossRef]
Dritsas, E.; Trigka, M. Methodological and technological advancements in E-learning. Information 2025, 16, 56. [Google Scholar] [CrossRef]
Akgun, S.; Greenhow, C. Artificial intelligence in education: Addressing ethical challenges in K-12 settings. AI Ethics 2022, 2, 431–440. [Google Scholar] [CrossRef]
Andrews, M.; Smart, A.; Birhane, A. The reanimation of pseudoscience in machine learning and its ethical repercussions. Patterns 2024, 5, 101027. [Google Scholar] [CrossRef]
Xu, X.; Li, J.; Zhu, Z.; Zhao, L.; Wang, H.; Song, C.; Chen, Y.; Zhao, Q.; Yang, J.; Pei, Y. A comprehensive review on synergy of multi-modal data and ai technologies in medical diagnosis. Bioengineering 2024, 11, 219. [Google Scholar] [CrossRef] [PubMed]
Hong, S.; Moon, J.; Eom, T.; Awoyemi, I.D.; Hwang, J. Generative AI-Enhanced Virtual Reality Simulation for Pre-Service Teacher Education: A Mixed-Methods Analysis of Usability and Instructional Utility for Course Integration. Educ. Sci. 2025, 15, 997. [Google Scholar] [CrossRef]
Raiaan, M.A.K.; Mukta, M.S.H.; Fatema, K.; Fahad, N.M.; Sakib, S.; Mim, M.M.J.; Ahmad, J.; Ali, M.E.; Azam, S. A review on large language models: Architectures, applications, taxonomies, open issues and challenges. IEEE Access 2024, 12, 26839–26874. [Google Scholar] [CrossRef]
Hang, C.N.; Tan, C.W.; Yu, P.D. MCQGen: A large language model-driven MCQ generator for personalized learning. IEEE Access 2024, 12, 102261–102273. [Google Scholar] [CrossRef]
Khonde, K.R.; Shah, J.; Patel, P. EchoSense AI Transcrib Using DevOps. In Proceedings of the 2024 Parul International Conference on Engineering and Technology (PICET), Vadodara, India, 3–4 May 2024; pp. 1–5. [Google Scholar]
Almusfar, L.A. Improving learning management system performance: A comprehensive approach to engagement, trust, and adaptive learning. IEEE Access 2025, 13, 46408–46425. [Google Scholar] [CrossRef]
Yoon, H.Y.; Kang, S.; Kim, S. A non-verbal teaching behaviour analysis for improving pointing out gestures: The case of asynchronous video lecture analysis using deep learning. J. Comput. Assist. Learn. 2024, 40, 1006–1018. [Google Scholar] [CrossRef]
Li, C.; Wang, L.; Li, Q.; Wang, D. Intelligent analysis system for teaching and learning cognitive engagement based on computer vision in an immersive virtual reality environment. Appl. Sci. 2024, 14, 3149. [Google Scholar] [CrossRef]
Shen, L.; Zhang, Y.; Zhang, H.; Wang, Y. Data player: Automatic generation of data videos with narration-animation interplay. IEEE Trans. Vis. Comput. Graph. 2023, 30, 109–119. [Google Scholar] [CrossRef]
Saleem, R.; Aslam, M. A Multi-Faceted Deep Learning Approach for Student Engagement Insights and Adaptive Content Recommendations. IEEE Access 2025, 13, 69236–69256. [Google Scholar] [CrossRef]
Liu, M.; Yu, D. Towards intelligent E-learning systems. Educ. Inf. Technol. 2023, 28, 7845–7876. [Google Scholar] [CrossRef]
Alwadei, A.M.; Mohsen, M.A. Investigation of the use of infographics to aid second language vocabulary learning. Humanit. Soc. Sci. Commun. 2023, 10, 108. [Google Scholar] [CrossRef]
Chen, J.J.; Adams, C.B. Drawing from and expanding their toolboxes: Preschool teachers’ traditional strategies, unconventional opportunities, and novel challenges in scaffolding young children’s social and emotional learning during remote instruction amidst COVID-19. Early Child. Educ. J. 2023, 51, 925–937. [Google Scholar] [CrossRef] [PubMed]
Reales, D.; Manrique, R.; Grévisse, C. Core Concept Identification in Educational Resources via Knowledge Graphs and Large Language Models. SN Comput. Sci. 2024, 5, 1029. [Google Scholar] [CrossRef]
Xiao, Q.; Zhang, Y.W.; Xin, X.Q.; Cai, L.W. Sustainable personalized E-learning through integrated cross-course learning path planning. Sustainability 2024, 16, 8867. [Google Scholar] [CrossRef]
Ridell, K.; Walldén, R. Graphical models for narrative texts: Reflecting and reshaping curriculum demands for Swedish primary school. Linguist. Educ. 2023, 73, 101137. [Google Scholar] [CrossRef]
Munir, H.; Vogel, B.; Jacobsson, A. Artificial intelligence and machine learning approaches in digital education: A systematic revision. Information 2022, 13, 203. [Google Scholar] [CrossRef]
Liu, Y.; Wang, Y.; Shi, H. A convolutional recurrent neural-network-based machine learning for scene text recognition application. Symmetry 2023, 15, 849. [Google Scholar] [CrossRef]
Si, Q.; Hodges, T.S.; Mousavi, V. Designing Writers: A Self-Regulated Approach to Multimodal Composition in Teacher Preparation and Early Grades. Educ. Sci. 2025, 15, 1059. [Google Scholar] [CrossRef]
Zeng, M.L.; Qin, J. Metadata; American Library Association: Chicago, IL, USA, 2020. [Google Scholar]
Das, S.; Das Mandal, S.K.; Basu, A. Classification of action verbs of Bloom’s taxonomy cognitive domain: An empirical study. J. Educ. 2022, 202, 554–566. [Google Scholar] [CrossRef]
Liu, S.; Liu, S.; Sha, L.; Zeng, Z.; Gašević, D.; Liu, Z. Annotation Guideline-Based Knowledge Augmentation: Towards Enhancing Large Language Models for Educational Text Classification. IEEE Trans. Learn. Technol. 2025, 18, 619–634. [Google Scholar] [CrossRef]
Leung, J. Examining the characteristics of practical knowledge from four public Facebook communities of practice in instructional design and technology. IEEE Access 2022, 10, 90669–90689. [Google Scholar] [CrossRef]
Sümer, Ö.; Goldberg, P.; D’Mello, S.; Gerjets, P.; Trautwein, U.; Kasneci, E. Multimodal engagement analysis from facial videos in the classroom. IEEE Trans. Affect. Comput. 2021, 14, 1012–1027. [Google Scholar] [CrossRef]
Peter, H. Integrating Emotion Recognition in Educational Robots Through Deep Learning-Based Computer Vision and NLP Techniques. 2025. Available online: https://www.researchgate.net/publication/393945666 (accessed on 1 December 2025).
Saxer, K.; Tuominen, H.; Schnell, J.; Mori, J.; Niemivirta, M. Lower Secondary Students’ Well-Being Profiles: Stability, Transitions, and Connections with Teacher–Student, and Student–Student Relationships. Child Youth Care Forum 2025, 1–30. [Google Scholar] [CrossRef]
Lee-Cultura, S.; Sharma, K.; Giannakos, M.N. Multimodal teacher dashboards: Challenges and opportunities of enhancing teacher insights through a case study. IEEE Trans. Learn. Technol. 2023, 17, 181–201. [Google Scholar] [CrossRef]
Yang, W.; Fu, R.; Amin, M.B.; Kang, B. The impact of modern ai in metadata management. Hum.-Centric Intell. Syst. 2025, 5, 323–350. [Google Scholar] [CrossRef]
Mosha, N.F.; Ngulube, P. Metadata standard for continuous preservation, discovery, and reuse of research data in repositories by higher education institutions: A systematic review. Information 2023, 14, 427. [Google Scholar] [CrossRef]
Essa, S.G.; Celik, T.; Human-Hendricks, N.E. Personalized adaptive learning technologies based on machine learning techniques to identify learning styles: A systematic literature review. IEEE Access 2023, 11, 48392–48409. [Google Scholar] [CrossRef]
Lee, Y.; Migut, G.; Specht, M. What attention regulation behaviors tell us about learners in e-reading?: Adaptive data-driven persona development and application based on unsupervised learning. IEEE Access 2023, 11, 118890–118906. [Google Scholar] [CrossRef]
Hussain, T.; Yu, L.; Asim, M.; Ahmed, A.; Wani, M.A. Enhancing e-learning adaptability with automated learning style identification and sentiment analysis: A hybrid deep learning approach for smart education. Information 2024, 15, 277. [Google Scholar] [CrossRef]
Lin, T.C.; Chiu, C.N.; Wang, P.T.; Fang, L.D. VisFactory: Adaptive Multimodal Digital Twin with Integrated Visual-Haptic-Auditory Analytics for Industry 4.0 Engineering Education. Multimedia 2025, 1, 3. [Google Scholar] [CrossRef]
Salloum, S.A.; Alomari, K.M.; Alfaisal, A.M.; Aljanada, R.A.; Basiouni, A. Emotion recognition for enhanced learning: Using AI to detect students’ emotions and adjust teaching methods. Smart Learn. Environ. 2025, 12, 21. [Google Scholar] [CrossRef]
El Maazouzi, Q.; Retbi, A. Multimodal Detection of Emotional and Cognitive States in E-Learning Through Deep Fusion of Visual and Textual Data with NLP. Computers 2025, 14, 314. [Google Scholar] [CrossRef]
Troussas, C.; Krouska, A.; Sgouropoulou, C. Learner Modeling and Analysis. In Human-Computer Interaction and Augmented Intelligence: The Paradigm of Interactive Machine Learning in Educational Software; Springer: Cham, Switzerland, 2025; pp. 305–345. [Google Scholar]
Sajja, R.; Sermet, Y.; Cikmaz, M.; Cwiertny, D.; Demir, I. Artificial intelligence-enabled intelligent assistant for personalized and adaptive learning in higher education. Information 2024, 15, 596. [Google Scholar] [CrossRef]
Gligorea, I.; Cioca, M.; Oancea, R.; Gorski, A.T.; Gorski, H.; Tudorache, P. Adaptive learning using artificial intelligence in e-learning: A literature review. Educ. Sci. 2023, 13, 1216. [Google Scholar] [CrossRef]
Iliska, D.; Gudoniene, D. Sustainable technology-enhanced learning for learners with dyslexia. Sustainability 2025, 17, 4513. [Google Scholar] [CrossRef]
Szabó, T.; Babály, B.; Pataiová, H.; Kárpáti, A. Development of spatial abilities of preadolescents: What works? Educ. Sci. 2023, 13, 312. [Google Scholar] [CrossRef]
Gm, D.; Goudar, R.; Kulkarni, A.A.; Rathod, V.N.; Hukkeri, G.S. A digital recommendation system for personalized learning to enhance online education: A review. IEEE Access 2024, 12, 34019–34041. [Google Scholar] [CrossRef]
Rapanta, C.; Botturi, L.; Goodyear, P.; Guàrdia, L.; Koole, M. Online university teaching during and after the COVID-19 crisis: Refocusing teacher presence and learning activity. Postdigit. Sci. Educ. 2020, 2, 923–945. [Google Scholar] [CrossRef]
Nikolic, S.; Daniel, S.; Haque, R.; Belkina, M.; Hassan, G.M.; Grundy, S.; Lyden, S.; Neal, P.; Sandison, C. ChatGPT versus engineering education assessment: A multidisciplinary and multi-institutional benchmarking and analysis of this generative artificial intelligence tool to investigate assessment integrity. Eur. J. Eng. Educ. 2023, 48, 559–614. [Google Scholar] [CrossRef]
Sharif, M.; Uckelmann, D. Multi-Modal LA in Personalized Education Using Deep Reinforcement Learning Based Approach. IEEE Access 2024, 12, 54049–54065. [Google Scholar] [CrossRef]
Thabane, L.; Ma, J.; Chu, R.; Cheng, J.; Ismaila, A.; Rios, L.P.; Robson, R.; Thabane, M.; Giangregorio, L.; Goldsmith, C.H. A tutorial on pilot studies: The what, why and how. BMC Med Res. Methodol. 2010, 10, 1. [Google Scholar] [CrossRef]
Lancaster, G.A.; Dodd, S.; Williamson, P.R. Design and analysis of pilot studies: Recommendations for good practice. J. Eval. Clin. Pract. 2004, 10, 307–312. [Google Scholar] [CrossRef]
Moore, C.G.; Carter, R.E.; Nietert, P.J.; Stewart, P.W. Recommendations for planning pilot studies in clinical and translational research. Clin. Transl. Sci. 2011, 4, 332–337. [Google Scholar] [CrossRef]
Eldridge, S.M.; Chan, C.L.; Campbell, M.J.; Bond, C.M.; Hopewell, S.; Thabane, L.; Lancaster, G.A. CONSORT 2010 statement: Extension to randomised pilot and feasibility trials. bmj 2016, 355, i5239. [Google Scholar] [CrossRef]
Almeqdad, Q.I.; Alodat, A.M.; Alquraan, M.F.; Mohaidat, M.A.; Al-Makhzoomy, A.K. The effectiveness of universal design for learning: A systematic review of the literature and meta-analysis. Cogent Educ. 2023, 10, 2218191. [Google Scholar] [CrossRef]
Catama, B.V. Universal Design for Learning in Action: Exploring Strategies, Outcomes, and Challenges in Inclusive Education. Int. J. Rehabil. Spec. Educ. 2025, 5, 6–12. [Google Scholar] [CrossRef]
Merino-Campos, C. The impact of artificial intelligence on personalized learning in higher education: A systematic review. Trends High. Educ. 2025, 4, 17. [Google Scholar] [CrossRef]
Du Plooy, E.; Casteleijn, D.; Franzsen, D. Personalized adaptive learning in higher education: A scoping review of key characteristics and impact on academic performance and engagement. Heliyon 2024, 10, e39630. [Google Scholar] [CrossRef] [PubMed]
Thesen, T.; Park, S.H. A generative AI teaching assistant for personalized learning in medical education. npj Digit. Med. 2025, 8, 627. [Google Scholar] [CrossRef] [PubMed]
Li, Y.; Tolosa, L.; Rivas-Echeverria, F.; Marquez, R. Integrating AI in education: Navigating UNESCO global guidelines, emerging trends, and its intersection with sustainable development goals. ChemRxiv 2025. [Google Scholar] [CrossRef]
Leon, C.; Lipuma, J.; Oviedo-Torres, X. Artificial intelligence in STEM education: A transdisciplinary framework for engagement and innovation. Front. Educ. 2025, 10, 1619888. [Google Scholar] [CrossRef]
Babar, Z.; Paul, R.; Rahman, M.A.; Barua, T. A Systematic Review Of Human-AI Collaboration In It Support Services: Enhancing User Experience And Workflow Automation. J. Sustain. Dev. Policy 2025, 1, 65–89. [Google Scholar] [CrossRef]
Oncioiu, I.; Bularca, A.R. Artificial Intelligence Governance in Higher Education: The Role of Knowledge-Based Strategies in Fostering Legal Awareness and Ethical Artificial Intelligence Literacy. Societies 2025, 15, 144. [Google Scholar] [CrossRef]
Mahamad, S.; Chin, Y.H.; Zulmuksah, N.I.N.; Haque, M.M.; Shaheen, M.; Nisar, K. Technical review: Architecting an AI-driven decision support system for enhanced online learning and assessment. Future Internet 2025, 17, 383. [Google Scholar] [CrossRef]
Stasolla, F.; Zullo, A.; Maniglio, R.; Passaro, A.; Di Gioia, M.; Curcio, E.; Martini, E. Deep Learning and Reinforcement Learning for Assessing and Enhancing Academic Performance in University Students: A Scoping Review. AI 2025, 6, 40. [Google Scholar] [CrossRef]

Figure 1. Study diagram for adaptive learning system.

Figure 2. Conceptual flow of the proposed STREAM framework.

Figure 3. The coding class.

Figure 4. Receiver-side single-student adaptation: rule-based mapping from knowledge objects to visual artifacts.

Figure 5. Confusion matrix for semantic tagging labels.

Figure 6. Pie chart of latency breakdown by stage (median across runs).

Table 1. Comparison of enabling technologies for adaptive learning.

Category	Technology/Platform	Function	Limitation
	T5, BERT, GPT	Text-based content analysis and semantic extraction	Requires fine-tuning for educational contexts
Content Analysis Tools, Designed for Real-Time but Offline-Tested	Whisper, Google Speech-to-Text	Speech recognition, and lecture transcription	Accuracy may drop with noisy inputs or accents
	OpenCV, YOLO, Vision API	Visual content segmentation and object recognition	Limited interpretation of abstract visuals
	Emotion APIs, Affective Computing Tools	Detects emotional and motivational states in learners	Potential bias; limited granularity without hardware
Learner Modeling & Preference Detection	Eye-tracking (Tobii, iMotions)	Tracks gaze, attention, and behavioral interaction	Intrusive or costly; sensitive to setup
	VARK, Felder-Silverman Models	Categorizes learners by preferred learning modalities	Contested theoretical validity
	TTS Engines (Polly, WaveNet)	Delivers content in natural spoken formats	Modality fidelity varies by language and platform
Multimodal Delivery Tools	ChatGPT, Gemini, GenAI	Generates custom content for adaptive instruction	Limited control over depth and granularity
	Semantic Communication Models	Optimizes message meaning in low-bandwidth settings	Still emerging; high technical complexity
Existing Adaptive Learning Systems	Khan Academy, Coursera, Smart Sparrow	Personalized paths based on performance history	Lacks adaptation and multimodal personalization, with real-time untested

Table 2. Comparison of the STREAM framework to existing adaptive learning systems.

Feature	STREAM	Khan Academy	Coursera	Smart Sparrow	Duolingo	ALEKS
Adaptation, Intended for Real-Time but Offline-Tested	Yes: Decomposes and adapts content during live/streamed lessons with <1 s latency on standard hardware.	Partial: AI feedback via Khanmigo; adapts between exercises from post-performance data; limited in-lesson processing.	Partial: Recommends modules post-quiz; no live decomposition/ regeneration.	Partial: Adaptive simulations, but rule-based and not designed for all content types in the offline-tested adaptation.	Partial: Adjusts difficulty in-session, but limited to gamified drills without full content transformation.	No: Adapts paths based on assessments; offline processing dominates.
Multimodal Content Delivery	Yes: Dynamically generates/regenerates across text, audio, video, diagrams; fuses ASR/NLP/CV for seamless integration.	Partial: Text, video, exercises with some AI narration; no modality switching or generation in the offline evaluation.	Partial: Video lectures, quizzes, text; limited to pre-made formats without fusion.	Yes: Interactive simulations with text/video; not AI-driven regeneration for live contexts.	Partial: Audio/text drills, images; app-based, no video decomposition or custom generation.	No: Primarily text-based math problems; minimal multimodal support.
Personalization Depth	High: Dynamic learner profiles (cognitive/affective states via eye-tracking, emotion APIs); adapts to preferences, disabilities, multilingual needs with UDL compatibility.	Medium: Performance-based paths with AI tutoring; basic mastery tracking, limited affective or behavioral modeling, with real-time untested.	Medium: Skill-based recommendations; learner profiles limited to progress/history.	High: Scenario-based adaptation; includes some behavioral cues, but not deeply affective.	Medium: Skill/decay models; gamified, but no deep affective or disability-focused profiles.	High: Knowledge-space theory for math; detailed but domain-specific; no multimodal/affective.
Content Decomposition & Tagging	Yes: AI-driven (BERT/T5 for semantics, Whisper for speech, YOLO/OpenCV for visuals); tags units with metadata for traceability.	No: Relies on pre-tagged content; no automated decomposition.	No: Courses are pre-structured; no tagging in the offline-tested system.	Partial: Tags simulations; manual/author-driven, not AI-automated.	No: Pre-built lessons; algorithmic but not decomposed via multimodal AI.	No: Pre-defined knowledge points; no multi-modal tagging, as real-time was not evaluated.
Equity & Accessibility Focus	High: Designed for diverse populations; supports multilingual use and disabilities via regenerated formats and provenance links.	Medium: Free access, subtitles, AI for underserved areas; largely one-size-fits-all.	Medium: Subtitles, mobile access; partnerships for equity, but not adaptive regeneration.	Medium: Customizable for inclusivity; deployment-limited.	High: Multilingual support, gamification for potential engagement pending validation; app-centric, less for disabilities.	Medium: Adaptive pacing; limited multimodal accessibility.
Scalability & Hardware Needs	High: Modular pipeline for classroom-grade hardware; pilot-tested in clean conditions with a roadmap for noisy/ bandwidth-constrained extensions.	High: Web/app-based; scales globally.	High: Cloud-based; accessible worldwide.	Medium: Requires authoring tools; less scalable for non-experts.	High: Mobile-first; scales via app ecosystem.	High: Web-based; LMS integrations; math-focused.
Validation & Evidence	Pilot-based: Feasibility on a 5-min STEM clip; staged roadmap for diverse testing (e.g., multilingual, disabilities).	Extensive: Data from millions; A/B tests on mastery learning and AI efficacy.	Extensive: University partnerships; completion-rate analyses.	Research-backed: Studies on adaptive simulations.	Extensive: App metrics; language-retention studies.	Research-backed: Knowledge-space model validated in education studies.

Table 3. Summary of key components in the adaptive learning framework.

Component	Purpose	Technologies Used	Role in Framework	Conceptual Flow Location
Knowledge Point Extraction	Identifies and isolates core instructional concepts (such as definitions and skills) from multimodal content.	Transformer-based NLP (e.g., BERT), OCR, and semantic parsing.	Converts instructional content into modular, meaningful learning units.	Middle layer (content analysis/decomposition)
Metadata Generation	Adds descriptive and pedagogical tags (e.g., type, difficulty, modality) to enable intelligent retrieval and alignment.	Heuristic tagging, Bloom’s taxonomy mapping, and prosodic/emotional analysis.	Provides a metadata layer for content organization and adaptive use.	Middle layer (content analysis/decomposition)
Learner Profiling	Builds dynamic profiles based on learner preferences, behaviors, and emotional states to guide personalization.	Machine learning, affective computing, behavioral analytics.	Guides decision-making on what and how to present content.	Receiver side (student model)
Adaptive Content Delivery	Delivers content in customized formats and sequences across modalities, adapting to learner responses, intended for real-time but offline.	Decision algorithms, multimodal rendering engines, and feedback loops are designed for real-time but tested offline.	Implements learner-facing adaptations to enhance interaction, pending validation.	Receiver side (personalized delivery)

Table 4. Tagging performance metrics (Precision, Recall, F1) per label category.

Label	Precision (%)	Recall (%)	F1 (%)
Knowledge Point	75.0	65.0	69.7
Prompt	90.0	85.0	87.4
Entity	80.0	70.0	74.7
Example	70.0	60.0	64.7
Overall (Macro Avg.)	78.8	70.0	74.1

Table 5. Median latency (minutes) and 90th percentile per pipeline stage (n = 3 runs).

Stage	Median Time (min)	90th Percentile (min)
ASR	4.1	4.5
NLP (Tagging)	1.2	1.4
Vision (OCR + Arrow Detection)	1.4	1.6
Rendering	0.5	0.6
Total	7.2	7.8

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yeganeh, L.N.; Chen, Y.; Fenty, N.S.; Simpson, A.; Hatami, M. STREAM: A Semantic Transformation and Real-Time Educational Adaptation Multimodal Framework in Personalized Virtual Classrooms. Future Internet 2025, 17, 564. https://doi.org/10.3390/fi17120564

AMA Style

Yeganeh LN, Chen Y, Fenty NS, Simpson A, Hatami M. STREAM: A Semantic Transformation and Real-Time Educational Adaptation Multimodal Framework in Personalized Virtual Classrooms. Future Internet. 2025; 17(12):564. https://doi.org/10.3390/fi17120564

Chicago/Turabian Style

Yeganeh, Leyli Nouraei, Yu Chen, Nicole Scarlett Fenty, Amber Simpson, and Mohsen Hatami. 2025. "STREAM: A Semantic Transformation and Real-Time Educational Adaptation Multimodal Framework in Personalized Virtual Classrooms" Future Internet 17, no. 12: 564. https://doi.org/10.3390/fi17120564

APA Style

Yeganeh, L. N., Chen, Y., Fenty, N. S., Simpson, A., & Hatami, M. (2025). STREAM: A Semantic Transformation and Real-Time Educational Adaptation Multimodal Framework in Personalized Virtual Classrooms. Future Internet, 17(12), 564. https://doi.org/10.3390/fi17120564

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

STREAM: A Semantic Transformation and Real-Time Educational Adaptation Multimodal Framework in Personalized Virtual Classrooms

Abstract

1. Introduction

2. Survey of Enabling Technologies

2.1. Content Analysis Tools, Intended for Real-Time

2.2. Learner Modeling and Preference Detection

2.3. Multimodal Delivery Tools

2.4. Existing Adaptive Learning Systems

2.5. UDL and Metadata-Driven Adaptation in Prior Research

3. STREAM Framework: Concepts, Architecture, Components

3.1. Conceptual Flow

3.1.1. Source Side

3.1.2. Middle Layer

3.1.3. Receiver Side

3.2. Key Components

3.2.1. Knowledge Point Extraction

3.2.2. Metadata Generation

3.2.3. Learner Profiling

3.2.4. Adaptive Content Generation

3.3. Modularity

4. Feasibility and Early Prototype Design

4.1. Pilot Implementation of Source Side: Pre-Recorded Lecture "Pree"

4.2. Middle Layer: Content Component Extraction

4.3. Receiver Side: Single Student Style Adaptation

4.4. Content Decomposition and Prompt Generation

4.5. Tools

4.6. Feasibility Criteria & Quick Evaluation

4.7. Scope

4.8. Risks & Immediate Mitigations

4.9. Pilot Study Outcomes

4.9.1. Accuracy

4.9.2. Latency and Resources

4.9.3. Output Quality

4.9.4. Traceability and Provenance

4.9.5. Lightweight Ablations

5. Discussion

5.1. Why STREAM Is Designed to Fill an Important Research Gap?

5.2. Intended Alignment with Personalized Learning Theories

5.3. Potential Role of AI in Equity and Access

5.4. Potential for Cross-Disciplinary Collaboration

5.5. Limitations and Roadmap for Validation

6. Conclusions

6.1. Contribution Summary

6.2. Scope Alignment

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI