Next Article in Journal
Research on Spanning Tree Topology Optimization and Pyramid-Based Fine Alignment Algorithm for Multi-View Point Cloud Registration
Previous Article in Journal
Remote Sensing Small Object Detection Network Based on Wavelet-Convolution and Fine-Grained Preservation
Previous Article in Special Issue
Event-Based Sentiment Analysis of Financial News Using Large Language Models: A Comprehensive Framework Integrating RAG, GNNs, and Multi-Agent Systems
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Review

From Prediction to Stewardship: Framing Educational Data Science in the Age of Generative AI

Learning Engineering Institute, Arizona State University, 120 Cady Mall, Tempe, AZ 85281, USA
*
Author to whom correspondence should be addressed.
Information 2026, 17(6), 610; https://doi.org/10.3390/info17060610 (registering DOI)
Submission received: 15 May 2026 / Revised: 12 June 2026 / Accepted: 15 June 2026 / Published: 19 June 2026

Abstract

As generative AI expands the technical frontiers of prediction, measurement, and design, a growing tension has emerged between algorithmic fluency and institutional trust. This conceptual article offers a narrative synthesis of recent work in learning analytics, educational data science, human–AI interaction, and AI governance to propose stewardship as a necessary fourth paradigm of educational data science. Stewardship represents the professional, epistemic, and institutional work of governing judgment in an environment where analytic systems are increasingly generative and persuasive. Rather than treating stewardship as a general ethics checklist, the article positions it as the governance of epistemic and pedagogical authority: who determines what counts as evidence, interpretation, and educational action when AI systems help produce those judgments. The synthesis suggests that while GenAI can support bounded analytic tasks, evidence for systemic educational transformation remains limited and uneven. The field’s primary challenge is therefore not technical performance alone, but the governance of interpretation, validation, delegation, and action. By centering provenance, uncertainty, accountable oversight, learner agency, and institutional learning, stewardship provides an actionable framework for anchoring analytic innovation in responsible educational improvement.

Graphical Abstract

1. Introduction

For more than a century, educational technologies have been animated by the hope that machines might not only scale instruction but also improve judgment about learning itself. This lineage is well established: from Pressey’s early teaching machine to later forms of computer-assisted instruction, each wave of innovation promised more adaptive, responsive, and individualized education [1,2]. Learning analytics (LA) emerged within this broader history of artificial intelligence in education (AIED), educational data mining (EDM), and data-intensive educational research, with a distinctive ambition: not merely to automate educational tasks, but to collect, analyze, and act on learner data in ways that improve teaching and learning. As several recent reviews note, LA matured during a period when digitization, online platforms, and trace data made it possible to detect patterns in learner behavior at scale, while also sharpening questions about validity, actionability, and ethics. In that sense, the arrival of large language models (LLMs) and generative AI (GenAI) did not create the field’s central tensions. It amplified them [3,4].
Much of current discussion in the literature and media treats GenAI as a sudden rupture. Since the release of ChatGPT in late 2022, educational discourse has been saturated with claims about transformation: personalized tutoring, automated feedback, conversational analytics, synthetic learner data, multimodal dashboards, and AI-enabled intervention systems. Recent conceptual work in LA has argued that GenAI may affect every phase of the learning analytics cycle, from the identification of learners to the processing of unstructured data, to explanatory analytics, personalization, and adaptive intervention. Yan, Martinez-Maldonado, and Gašević [5], for example, position GenAI as a potential catalyst for analyzing discourse, generating synthetic data, enriching multimodal interaction data, and making analytics more interactive and interpretable. Misiejuk, López-Pernas, Kaliisa, and Saqr [4] similarly describe GenAI as opening new possibilities for the design of LA tools and for supporting teachers’ assessment and monitoring practices. Yet these same authors also caution that the evidence base remains uneven and that practical implications for real interventions are still underdeveloped.
Accordingly, this article pursues three objectives. First, it clarifies how GenAI intensifies long-standing tensions in prediction, measurement, and design. Second, it differentiates stewardship from adjacent frameworks such as responsible learning analytics, trustworthy AI, human-in-the-loop design, and Learning Engineering. Third, it translates stewardship from a broad normative stance into an operational architecture that can guide educational data science research, system design, and institutional implementation.
The present article should be read as a conceptual and narrative synthesis rather than a systematic review. Consistent with review typologies that distinguish narrative syntheses from exhaustive protocol-driven reviews [6], the argument integrates representative work from learning analytics, generative AI in education, human–AI interaction, and AI governance to identify tensions that are theoretically important for educational data science. The overarching goal is to develop a disciplined conceptual account of why generative systems require a stewardship paradigm.
This tension between expanding technical capability and constrained empirical validation reflects a deeper pattern in the development of educational data science. The field can be understood as evolving through three overlapping paradigms: prediction, measurement, and design. These paradigms do not represent discrete stages, but rather dominant orientations that continue to coexist and shape one another. Prediction focuses on identifying patterns in data to forecast outcomes such as student dropout, disengagement, or performance [7,8]. This work translates complex trace data into actionable signals, aligning with descriptive, diagnostic, predictive, and prescriptive forms of analytics [5]. While prediction enables large-scale insight into learning processes, it does not by itself determine how educational outcomes should be improved.
Measurement focuses on the validity of inferences drawn from learner data. It examines whether digital traces can serve as credible indicators of constructs such as reasoning, collaboration, self-regulation, and affect [9]. This perspective emphasizes that analytic outputs depend on theoretically grounded interpretations, and that translating learning into data necessarily involves assumptions that must be examined and validated. This concern remains central in current GenAI research, where much of the work focuses on coding, scoring, and classifying unstructured data, often with uneven validation practices [4].
Design focuses on how analytic insight is embedded within systems that shape teaching and learning. This includes dashboards, feedback systems, and intervention designs that translate data into action [10,11]. Within this paradigm, the field engages the challenge of “closing the loop”—ensuring that analytics inform practice in ways that lead to meaningful improvement.
Learning Engineering has emerged as a central development within this design paradigm. It provides an iterative, evidence-based approach that integrates learning science, human-centered design, and data-informed decision making to support continuous improvement [12,13,14]. Learning Engineering is not synonymous with analytics; rather, it is the process through which analytic insight is translated into intervention, evaluated in context, and refined over time. In this sense, if prediction helps the field anticipate and measurement helps it interpret, Learning Engineering enables it to act.
Taken together, these paradigms remain essential but insufficient in the age of LLMs. GenAI introduces new pressures across all three: prediction becomes more fluent and authoritative, measurement becomes more scalable but more vulnerable to construct slippage, and design becomes capable of producing feedback, explanations, and adaptive supports before institutions have clear criteria for judging whether those interventions are pedagogically appropriate, equitable, or safe. The resulting risk is not only error, but uncalibrated certainty: outputs that appear meaningful before their epistemic status has been established.
This shift alters the architecture of judgment in educational data science. Historically, analytics involved a layered process in which data were transformed into indicators, interpreted through theoretical frameworks, and translated into action through human decision making [10,11]. Generative systems compress this chain by producing fluent interpretations directly from data, often without making intermediate reasoning steps visible. As a result, outputs increasingly function as judgments rather than analytic inputs.
This development raises a question that prediction, measurement, and design do not fully address: how should increasingly generative, fluent, and consequential systems be governed once they begin to shape educational interpretation and action?
This article argues that the answer lies in a fourth paradigm: stewardship.
Stewardship refers to the disciplined governance of judgment in educational data science. It encompasses the commitments, practices, and institutional arrangements through which analytic outputs become educationally legitimate, uncertainty is represented, decisions remain accountable, and systems are revised when their consequences diverge from their intentions. It is not an external ethical overlay, but an organizing principle for how prediction, measurement, design, and Learning Engineering operate under conditions of generative analytics.
Recent work already points toward this need. Khosravi et al. (2023) [3] call for “GenAI analytics” that capture prompts, interaction context, and model parameters. Yan et al. (2024) [5] highlight the need to reconsider the learner in contexts where human and AI contributions are intertwined. Misiejuk et al. (2025) [4] show that while GenAI is expanding across the learning analytics cycle, validated instructional uses and evaluation standards remain underdeveloped. Together, these studies suggest that the central scarcity in the field is no longer computational capability, but disciplined judgment.
The argument advanced here is therefore not that GenAI should be resisted, but that it requires educational data science to mature. As AI expands the range of analytic outputs, the field’s contribution can no longer be defined primarily by generating those outputs, but in governing their interpretation, validation, and use. Learning Engineering remains essential as the process through which analytic insight is translated into iterative educational improvement [13]. Stewardship becomes essential as the framework that ensures such improvement remains epistemically grounded, institutionally accountable, and aligned with the purposes of education.
Stewardship does not begin from a blank slate. It extends prior work in responsible learning analytics, which established that analytics are not merely technical systems, but sociotechnical practices shaped by accountability, reasonable care, informed consent, power, transparency, and the obligation to act [15,16,17]. Many of the concerns raised here are therefore not new. Educational technology research has also long cautioned against overreliance on automated systems, the displacement of professional judgment, and the risks of treating model outputs as authoritative [18,19].
What is different in the context of GenAI is not the existence of these risks, but their amplification and transformation. LLMs produce outputs that are not only predictive or descriptive, but fluent, contextually responsive, and rhetorically persuasive. They collapse analytic pipelines into conversational interfaces and reduce the visibility of uncertainty and intermediate reasoning. Stewardship therefore does not relabel responsible AI. It asks who retains authority over educational meaning, interpretive confidence, and institutional action when AI systems increasingly participate in generating all three.

2. Why LLMs Change the Problem: Fluency, Delegation, and the Governance of Judgment

Large language models do not simply improve existing learning analytics workflows. They alter the conditions under which educational judgments are produced, interpreted, and acted upon. They do so by extending generative capabilities across multiple phases of the learning analytics cycle [4,5]. Earlier analytics systems typically produced bounded outputs: risk scores, classifications, visualizations, alerts, or dashboard indicators. Those outputs could still be misleading, reductive, or harmful, but they were usually constrained by clearer interfaces and narrower forms of interpretation [10,20]. LLMs change this by producing language itself as the analytic medium. They not only calculate but also explain, summarize, recommend, and generate justification [21,22]. In doing so, they make analytic outputs more accessible and more persuasive, even when explanation faithfulness and appropriate reliance remain unsettled [3,23,24].
This shift matters because educational judgment has always been a layered activity. Learner activity is transformed into indicators, interpreted through theories of learning, and then translated into action by teachers, designers, advisors, or institutions [10,11]. Learning analytics has never merely found insights in data; it has always constructed usable interpretations through a chain of decisions about what to capture, how to model it, what counts as meaningful, and when intervention is warranted [10,11]. Learning Engineering makes this layered structure visible by framing educational improvement as an iterative process in which theory, data, design, implementation, and revision are tightly linked rather than separated into isolated stages [13]. What LLMs do is compress and partially obscure that chain. They can move directly from traces or prompts to polished explanations and recommendations without exposing the intermediate reasoning steps needed for inspection [21,25,26]. The result is not just efficiency. It is a reconfiguration of where judgment appears to reside.
Yan, Martinez-Maldonado, and Gašević [5] provide perhaps the clearest conceptual basis for understanding this shift. They argue that GenAI may shape every phase of the learning analytics cycle, including analysis of unstructured data, synthetic data generation, multimodal enrichment, interactive and explanatory analytics, and personalization or adaptive intervention. Embedded in that argument is a profound change in what analytics can look like. Analytics are no longer limited to static metrics or visual representations; they can become conversational, responsive, context-sensitive, and seemingly interpretive. This shift helps explain why LLM-based analytics feel transformative in educational settings: they promise to close the distance between raw data and pedagogically usable recommendations [21,26,27]. From a Learning Engineering perspective, that promise matters because it suggests that analytics may enter design and improvement cycles in more immediate and generative ways, shaping not only what is known about learning but how interventions are proposed, adapted, and refined [13].
But this apparent closing of distance creates a new problem. When explanation is generated fluently, it becomes easier to mistake plausibility for validity [23,28,29,30]. In other words, the first major way LLMs change educational data science is by making analytics rhetorically stronger before they are epistemically stronger.

2.1. Fluency as Epistemic Risk

The appeal of LLMs lies partly in fluency: they can present analytic outputs as coherent and contextually responsive explanations and recommendations rather than as charts or metrics alone. In learning analytics, this capability makes them especially attractive for dashboard narration, feedback generation, descriptive analytics, and stakeholder-facing explanation. Ochoa, Huang, and Shao [27] explicitly frame this promise in terms of making learning analytics more accessible to non-experts performing LA tasks with GenAI support. Recent research also points toward the same direction: LLM-powered chatbot can augment LA dashboards with contextualized and conversational explanation, improving non-experts’ comprehension of outputs without relying on deep technical expertise [22,31].
However, that promise should not be romanticized because ease of interaction is not the same as calibrated reliance. Ochoa et al. [27] emphasize that these systems are not yet sufficiently reliable for independent real-world use and that domain knowledge remains essential for interpreting and checking outputs. This caveat is not incidental. It reveals a central paradox of LLM-enabled analytics: the very features that make them usable also make them easy to overtrust [19,32]. A system that explains clearly, answers instantly, and produces seemingly thoughtful interpretations can persuade users that the underlying inference is stronger than it actually is [33,34,35]. In this sense, fluency is not merely a user-experience benefit. It is an epistemic risk.
This risk becomes even clearer within qualitative coding and text analysis. Liu et al. [36] found that GPT-4 can code a broad range of educational constructs, but its performance varies by construct, prompt strategy, and context. No single prompting method consistently performs best, and the constructs that challenge human coders also tend to challenge the model. That finding undercuts any simplistic claim that LLMs solve interpretation at scale. They can accelerate coding, but they do not remove ambiguity from the phenomena being coded. They reproduce and sometimes disguise the same uncertainty that already resided inherently within the construct itself. When those codes are later summarized in fluent prose, the uncertainty may disappear even though it has not disappeared from the analysis itself. This concern is consistent with recent work showing that natural-language explanations can be plausible or self-consistent without being faithful to the processes that generated them [24].
Misiejuk et al. [4] reinforce this point at the field level. Their synthesis indicates that discourse coding, scoring, and classification dominate current empirical work, but it also notes that some studies feed GenAI outputs into LA pipelines without sufficient validation. Here again, the problem is not merely technical error. Generative systems can produce usable-looking analytic artifacts before the field has established whether those artifacts are sufficiently valid to support downstream decisions [23]. Within the field of Learning Engineering, this is especially consequential, because iterative improvement depends on the quality of the evidence entering the cycle. If fluent but weakly validated outputs are treated as sound evidence, the improvement process itself can become distorted [13,37].

2.2. Delegation Without Visibility

The second major change introduced by LLMs is delegation. Educational data science has always delegated certain analytic tasks to algorithms, but LLMs expand both the range and the subtlety of what can be delegated. Systems can now summarize forum activity, classify discourse, generate personalized narratives from dashboards, answer stakeholder questions in plain language, explain visualizations, draft intervention suggestions, and synthesize multimodal observations [5,21,38]. Some of these delegations are desirable because they reduce labor, broaden access, and allow researchers or practitioners to work with data that would otherwise be too unstructured or voluminous to analyze effectively.
Yet delegation becomes more problematic when the analytic work being delegated is not merely procedural but interpretive [39]. Once a model is asked to explain why a learner appears disengaged, summarize a group’s collaborative dynamic, or suggest what kind of support a teacher should provide, it is no longer just processing data. It is participating in pedagogical judgment. That participation may still be partial and constrained, but it is substantive. The governance question is therefore not whether delegation occurs but whether the field has adequate ways to decide which judgments may be delegated, how uncertainty should be represented, and when human oversight must remain primary [39,40].
Yan et al. [5] argue that as the lines blur between learners and GenAI tools, the LA community must better understand human–AI collaboration and trace both human and AI contributions. That claim is often read as a matter of data capture or methodological innovation. But it is also a governance claim. If AI contributes to learning, sensemaking, or interaction, then educational data science must distinguish between human performance, AI mediation, and co-produced activity. Otherwise, institutions risk building analytics on increasingly unstable assumptions about authorship, effort, and learning itself.
The issue becomes even sharper in multimodal analytics. Whitehead, Nguyen, and Järvelä [41] demonstrate how Multimodal Large Language Models (MLLMs) can make complex non-verbal data more tractable through video analysis of posture in collaborative learning. But this is precisely the kind of domain where delegation can outrun interpretability. The model may annotate multimodal signals efficiently, but deciding what those annotations mean educationally still requires human and theoretical judgment [42,43]. As feature extraction becomes more powerful, the need for stewardship at the interpretation layer increases because these interpretations feed directly into downstream decisions.
In an iterative design and improvement framework, delegation is never solely about analytic efficiency; it also determines what counts as evidence for intervention. If models are delegated interpretive authority too early, then learning environments may be redesigned on the basis of outputs whose educational meaning has not been adequately established [13]. Delegation without visibility therefore threatens not only interpretation, but the integrity of improvement itself.

2.3. From Outputs to Consequences

A third way LLMs change the problem is by shifting attention from outputs to consequences. Traditional LA debates often focused on the quality of models or the interpretability of dashboards. With LLMs, the more pressing issue is increasingly what happens when model outputs circulate in educational settings as advice, explanation, or action. A generative summary is not merely information. It can shape how a teacher interprets a student, how a student understands their own progress, how an advisor prioritizes outreach, or how an institution allocates attention. LLM outputs are therefore consequential not only because they may be correct or incorrect, but because they can reorganize human judgment around them [30,44,45].
This shift toward consequence is also why Khosravi, Viberg, Kovanović, and Ferguson [3] call for robust GenAI analytics. Their argument is not only to analyze learners using GenAI, but also to analyze interactions with GenAI systems themselves: prompts, responses, model parameters, and the emerging forms of human–AI collaboration that these systems create. The field, then, is not simply incorporating a new tool, but encountering a new mediating layer in educational action. This change requires much richer attention to provenance, context, traceability, and outcome monitoring than conventional AI-in-education approaches typically assume.

2.4. Why Stewardship Becomes Unavoidable

Fluency, delegation, and consequence collectively render stewardship unavoidable. Prediction, measurement, and design remain indispensable. But none of these, on their own, is sufficient for governing LLM-based educational systems. Prediction can estimate likelihoods, but it cannot decide which forms of uncertainty must remain visible to users. Measurement can refine constructs, but it cannot determine when a generative interpretation is too weak to enter a feedback or intervention pipeline. Design can create usable interfaces, but it cannot establish what institutional safeguards are needed when those interfaces begin generating recommendations in real time.
Stewardship becomes necessary because LLMs increase not only what analytics can do, but also how quickly weakly supported inference can become institutionalized and embedded in practice [46,47]. A model-generated code can become a dashboard category. A dashboard narrative can become a teacher’s impression. A recommendation can become an intervention norm. A conversational explanation can become a student’s understanding of their own ability. At each step, the issue is not simply whether the model worked, but whether sufficient epistemic and institutional discipline governs how those outputs are used [20,45,48].
The future of educational data science cannot be defined merely by better model performance. Under conditions of generative fluency, the field must address how uncertainty is communicated, how delegation is bounded, how provenance is documented, how iterative improvement remains evidence-based, and how institutions recognize when apparently helpful outputs begin to distort decision-making. The technical question is capability. The disciplinary question is stewardship.

3. Generative AI and Learning Analytics

Large language models have shifted the central empirical task for learning analytics from demonstrating technical possibility to examining how the field is evolving in practice. Because this article offers a conceptual synthesis rather than a systematic review, the discussion below does not claim exhaustive coverage of the GenAI learning analytics literature. Instead, it identifies three recurrent tendencies visible across representative recent work: areas of robust technical performance, limited evidence of broader pedagogical or institutional impact, and a tendency for interpretive claims to exceed the strength of available evidence [4,6].
The risk introduced by generative systems is not simply additive but multiplicative. Earlier systems required interpretation: dashboards had to be read, models had to be understood, and outputs were often partial or fragmented. LLMs reduce this friction. They generate coherent explanations, recommendations, and narratives that appear complete and authoritative. This fluency increases the likelihood that users will accept outputs without interrogation, particularly in contexts where time, expertise, or institutional support for evaluation are limited [19,30]. In this sense, generative fluency does not merely support decision-making; it may reshape the threshold at which decisions are made.
The first tendency concerns domains in which empirical work suggests consistent technical value. The second concerns the persistent gap between analytic or interface improvements and meaningful educational outcomes. The third concerns the tendency within the literature to translate promising technical results into stronger claims about pedagogical validity, objectivity, or transformation than the evidence can presently sustain. Taken together, these tendencies reinforce the central argument of this article: the key challenge is not only whether LLMs can produce useful outputs for learning analytics, but whether the field can govern how those outputs are interpreted, validated, and institutionalized [4].

3.1. Areas of Robust Technical Performance

The strongest evidence in the current literature concerns the use of GenAI to process unstructured educational data. This is not a trivial development. Much of the most educationally meaningful information in digital learning environments appears in text-rich forms that have historically been difficult to analyze at scale: discussion posts, peer feedback, reflective writing, collaborative discourse, tutoring dialogue, and other open-ended language. Misiejuk et al. [4] indicate that the dominant empirical uses of GenAI in LA are in discourse coding, scoring, and classification. In other words, the strongest current contribution of GenAI is not that it has already transformed intervention or redesign, but that it is making the measurement layer of analytics more tractable by converting text-rich data into usable analytic representations.
As discussed earlier, the study by Liu et al. [36] of GPT-4 and qualitative coding illustrates both the promise and the limitations of using GenAI to render text-rich educational data analytically tractable. Across three educational datasets, they found that GPT-4 could code a broad range of constructs with meaningful agreement to human coders, and that embeddings or carefully designed examples could improve performance for more difficult constructs. Importantly, however, their findings resist broad generalization: no single prompting or modeling strategy consistently performed best across tasks, and the constructs that proved most difficult for human coders were also the ones that most challenged the model.
Similarly, Long, Luo, and Zhang [26] show that GPT-4 can assist in classroom dialogue analysis with substantial time savings and high consistency relative to expert coding. In game-based learning context, Acosta et al. [49] applies LLMs to analyze multi-party epistemic dialogue acts in collaborative game-based learning, providing teachers with actionable insights about group dynamics and student learning. The evidence here is strong, but its strength lies in bounded augmentation, not replacement of human methodological judgment. Related methodological work outside the LA Special Issue literature points in the same direction: generative models can assist qualitative analysis at scale, but their effectiveness remains highly sensitive to prompt structure, interpretive framing, and researcher oversight, reinforcing that their value lies in augmentation rather than autonomous judgment [36,50,51].
A second area of robust technical performance appears in the use of GenAI to support descriptive and explanatory analytics for non-experts. For example, Yan et al. [22] demonstrate how multi-model generative chatbot VizChat can provide contextualized and personalized explanations for LA dashboards, offering comprehensive insights from multiple sources. While this suggests that GenAI can broaden access to aspects of analytic practice, successful use still depends on disciplined interpretive habits such as checking outputs, evaluation, and domain knowledge [27].
A third area concerns multimodal learning analytics, particularly feature extraction from non-verbal data. The case study by Whitehead et al. [41] suggests that MLLMs may be leveraged to extract postural behavior from video of collaborative learning. Zhou, Suraworachet, and Cukurova [43] also demonstrate how gaze and other non-verbal behaviors in group interaction can be automatically detected and linked to differences in collaborative learning outcomes. Together, these works show meaningful development because multimodal learning analytics has long required specialized pipelines, substantial technical expertise, and considerable manual effort to process non-verbal data streams. However, system capability does not resolve the interpretive challenge. Non-verbal signals become educationally meaningful only when they are mapped to constructs through theory-informed interpretation, rather than treated as self-explanatory features [52,53]. Reliability, data quality, prompt construction, and contextual sensitivity therefore remain central concerns. Here too, the evidence points to a meaningful methodological advance, not a solved interpretive problem [41,43].

3.2. Limited Pedagogical and Institutional Effects

GenAI is rapidly expanding the field’s analytic reach, especially where researchers and practitioners need to work with language, interaction, and multimodal data that are otherwise costly or difficult to process [4]. The evidence becomes substantially weaker, however, when it moves from analytic capability to pedagogical consequence. Technical success in coding, summarization, or conversational data analysis does not automatically translate into meaningful improvements in learning, teaching, or institutional decision making.
Learning analytics has encountered this problem before. For more than a decade, dashboards have served as a dominant interface to close the loop between data and action. Yet dashboards have yielded only limited gains: they increased awareness and access to information more reliably than they improve academic achievement, motivation, or deep learning behaviors. For example, a review of 38 empirical studies concluded that there is no evidence that dashboards have lived up to the promise of improving academic achievement, and that most reported effects were negligible or small, with limited evidence from well-powered controlled experiments [20].
GenAI may make such interfaces more conversational, personalized, eloquent and explanatory, as recent work on GenAI-augmented dashboards demonstrates [22]. However, this advancement does not remove the underlying problem. Unless these systems are grounded in stronger theory and evaluated for actual learning impact, the field risks repeating the same pattern with more sophisticated tools [4,20,54]. More broadly, evidence from both experimental studies and systematic reviews of GenAI in education suggest that apparent utility does not consistently translate into improved learning outcomes. The impact of generative AI on learning outcomes is highly variable, with outcomes depending heavily on instructional and task design, scaffolding, and how AI support is structured and integrated into the task [55,56].
Likewise, Misiejuk et al. [4] conclude that while students’ perceptions of GenAI are often positive and some studies report improvements in participation or task performance, evidence for actual learning outcomes remains limited. In short, the field has documented a meaningful expansion of analytic and interface capability, but not yet strong evidence of widespread pedagogical transformation. This distinction between analytic value and educational value is foundational for the stewardship argument. A model may classify discourse more efficiently, help an instructor inspect patterns more quickly, or give a user a more natural way to query data. Those are real advances. But they are not the same as demonstrating improved learning, effective self-regulation, or more defensible institutional action.

3.3. Inflation of Interpretive and Pedagogical Claims

A recurring pattern in the literature is the inflation of claims based on technically promising results. Overstatement in the field is not merely occasional; it follows a recognizable structure in which technical outputs are translated into stronger claims than the evidence supports. Coding performance is taken as evidence of educational understanding, conversational explanation as trustworthy pedagogy, automation as objectivity, and positive user response as proof of learning improvement.
One dimension of this pattern is the veneer of objectivity. LLMs generate fluent, confident, and apparently neutral language that can make weak inferences appear settled. This risk is evident in adjacent domains such as automated assessment, where scoring systems require explicit evaluation for measurement and algorithmic bias before their outputs are treated as objective [57]. More broadly, research on human–AI interaction shows that users often rely on AI outputs without sufficient interrogation, particularly when those outputs are presented clearly and confidently [33,35].
A second element is the production of artificial authority. Research on LLM-as-a-judge shows that reliability, consistency, and bias remain unresolved, indicating that fluent evaluative output should not be treated as equivalent to robust judgment [58,59,60,61]. In practice, GenAI can support analysis effectively only when human oversight, checking, and interpretive discipline remain central [27,36]. The risks heighten when these human responsibilities are minimized and system outputs acquire unwarranted authority.
A third element is the inflation of pedagogical consequence. The field often moves too quickly from a methodological result—such as improved coding, easier querying, or a more accessible interface—to claims about personalization, adaptive learning, or transformation of practice. Misiejuk et al. [4] explicitly caution that validated classroom integrations and impacts on learning outcomes remain limited while the dashboard literature shows that a previous wave of analytics research frequently celebrated increased awareness or access without corresponding evidence of deep educational improvement [20]. Seen in this light, inflation of claims is not only a GenAI problem, but a persistent tendency within the field that GenAI risks intensifying.
A fourth element concerns dependency and cognitive outsourcing. This area should still be handled cautiously, but it has more direct empirical grounding than a purely speculative concern. Recent higher-education evidence indicates that stronger GenAI dependency may be associated with lower academic achievement through mechanisms involving false self-efficacy, while perceived teacher’s caring moderates part of that relationship [62]. This finding does not resolve the question of long-term cognitive outsourcing, but it does provide preliminary evidence that systems that appear highly supportive may also shift learners toward forms of dependence that weaken metacognitive monitoring or distort self-assessment. At minimum, this is an area where the field should proceed more cautiously than many current claims suggest. Most importantly, it points to the need to design systems that leverage GenAI within educational designs that strengthen learner agency and active processing rather than displace it.

3.4. Implications for the Present Argument

GenAI in learning analytics supports bounded augmentation more strongly than autonomous educational judgment. It has demonstrated clear value, particularly in processing unstructured text, extending descriptive analytics to non-experts, and enabling multimodal analysis. These are meaningful advances. However, pedagogical gains remain narrower than technical progress and claims often exceed the available evidence. The work is most convincing when GenAI is positioned as augmenting analytic workflows under validated conditions, and least convincing when it is treated as a reliable, autonomous source of educational judgment.
These conditions necessitate a stewardship framework. The available evidence does not yet justify delegating educational decision making to generative systems. Rather, it reflects a stage in which models are increasingly capable, useful, and rhetorically persuasive, while norms for validation, interpretability, provenance, and accountable use remain underdeveloped [32,33,35]. The central challenge is therefore not only to advance technical capability, but to establish the evaluative, professional, and institutional disciplines required to govern how such systems are trusted and used.
These patterns also explain why GenAI changes the work of educational data science itself. If generative systems increasingly perform parts of coding, summarization, explanation, and recommendation, then disciplinary expertise shifts toward validating, contextualizing, and governing outputs whose surface form may exceed their evidentiary strength. The movement toward stewardship therefore follows directly from the empirical synthesis: the field’s next task is not only to build more capable systems, but to preserve the conditions under which analytic outputs can be treated as warranted educational knowledge.

3.5. Generative AI and the Transformation of Data Science Work

The implications of GenAI extend beyond educational judgment to the work of educational data science itself. Many of the tasks that have historically defined the field—coding qualitative data, constructing features, generating summaries, interpreting patterns, and communicating results—are increasingly automated or augmented by LLMs [21,26,27]. The compression of analytic workflows by generative systems does not eliminate the need for data science, but it changes where and how its expertise is exercised.
Earlier forms of learning analytics required visible stages of analytic work. Data had to be processed, models specified, outputs interpreted, and findings translated into actionable insight. These stages made the epistemic labor of the field legible: assumptions could be examined, uncertainty could be debated, and interpretations could be contested. Generative systems compress this pipeline. They can move from raw or semi-structured data to fluent explanations, recommendations, or narratives with minimal visibility into intermediate reasoning. As a result, parts of the analytic process that were previously sites of professional judgment risk becoming opaque or implicitly delegated.
This shift creates a tension for the field. On one hand, generative systems expand access to analytic capabilities, enabling non-experts to engage in forms of data interpretation that were previously restricted by technical expertise [27]. On the other hand, this same accessibility can obscure the distinction between generating an output and justifying an inference. When explanation becomes automated, the role of the data scientist may shift from analyst and interpreter to validator of system outputs.
This transformation has been noted more broadly in discussions of AI and data science. As generative models increasingly handle coding, feature extraction, and even aspects of analysis, the core contribution of data science shifts from producing outputs to governing their interpretation, validation, and use. In this sense, GenAI does not eliminate the need for data science, but it relocates it. The field’s value becomes less about performing analytic tasks and more about ensuring that those tasks remain epistemically sound.
Within educational data science, this shift is particularly consequential. If analytic outputs increasingly take the form of fluent explanations, recommendations, or feedback, then the risk is not only that systems may be wrong, but that their outputs may be accepted without sufficient scrutiny. The problem is therefore not only technical automation, but the displacement of interpretive responsibility.
Stewardship emerges in response to this shift. It defines the work of the field not in terms of generating analytic outputs, but in governing the conditions under which those outputs are treated as knowledge and used in practice. In the generative era, the central question for educational data science is no longer only how to produce insight, but how to maintain the integrity of insight when its production is increasingly automated.

4. Stewardship as a Paradigm for Educational Data Science

GenAI alters the conditions under which educational judgment is produced and acted upon, raising a problem that prediction, measurement, and design do not fully resolve. While these paradigms remain essential—prediction for estimating likelihoods, measurement for strengthening construct validity, and design for translation to practice—they do not address how increasingly generative, fluent, and consequential analytic systems should be governed in practice.
Stewardship is proposed as the paradigm required to address this gap. It refers to the disciplined governance of judgment in educational data science: the commitments, practices, and institutional arrangements through which analytic outputs become educationally legitimate, uncertainty is made visible, decisions remain accountable, and systems are revised when their consequences diverge from their intentions. In this sense, stewardship governs the movement from analytic possibility to educational consequence. Although related to broader work on AI governance in education, the argument here is disciplinary: stewardship is not an external checklist, but an organizing paradigm for educational data science itself [3,25,63].
This need is underscored by the current empirical pattern. Technical advances in coding, classification, descriptive analytics, and multimodal feature extraction are clear, while evidence of sustained pedagogical transformation or learning outcomes remains limited [4,27,36,41]. At the same time, claims about learning and pedagogical impact often exceed what the evidence supports. Stewardship addresses this imbalance by focusing on how analytic systems are interpreted, validated, and governed as their outputs become more fluent and persuasive.

4.1. Positioning Stewardship Against Adjacent Frameworks

The concept of stewardship is useful only if it does more than rename established governance concerns. Its distinctive contribution is to shift the unit of analysis from the ethical behavior of a model or the usability of an interface to the governance of epistemic and pedagogical authority across a sociotechnical educational system. Responsible AI asks whether a system is fair, safe, transparent, or accountable. Explainable AI asks whether users can understand or inspect system outputs [64,65]. Human-in-the-loop design asks whether human oversight is present. Learning Engineering asks how evidence-informed interventions can be designed and improved. Stewardship incorporates these concerns, but organizes them around a different question: who determines what counts as valid educational interpretation and justified educational action when generative systems increasingly participate in producing both? To make the distinction explicit, Table 1 compares prediction, measurement, design/Learning Engineering, and stewardship across their goals, assumptions, risks, and educational implications. The comparison shows that stewardship does not replace earlier paradigms; it governs the conditions under which their outputs can justifiably inform educational action. We also situate stewardship against adjacent governance-oriented frameworks to further clarify that stewardship foregrounds authority over educational meaning, warranted interpretation, and permissible action (see Table 2 for details).

4.2. Stewardship as the Governance of Judgment

The need to govern which outputs may legitimately guide educational action becomes especially urgent in the context of LLMs. At its core, stewardship begins from a simple premise: educational data science is not valuable because it produces outputs, but because it helps determine which outputs may legitimately guide educational action. Ochoa et al. [27] show that LLMs may lower the expertise barrier for users to engage in learning analytics, but successful use still depends on accountable human oversight: checking, evaluation, and domain knowledge. This finding highlights a central tension in the field: broadening access to producing and using analytic outputs also broadens the need for norms that distinguish usable assistance from unwarranted authority.
GenAI can operate across the learning analytics cycle, from unstructured data analysis and synthetic data generation to explanatory analytics and personalized intervention [5]. As these systems become more capable of explaining, summarizing, and recommending, their outputs may be more readily accepted as authoritative rather than critically evaluated. Research on human–AI interaction supports this concern. Buçinca, Malaya, and Gajos [33] show that users frequently over-rely on AI suggestions, even when they are wrong, and that explanations do not reliably reduce that overreliance. Similarly, Salvi, Ribeiro, Gallotti, and West [35] demonstrate that GPT-4 can be more persuasive than human opponents in controlled settings, particularly when responses are personalized.
The field must therefore govern not only model performance, but also the inferential pathways through which model outputs are translated into feedback, institutional decisions, student self-understanding, and teacher judgment. Stewardship therefore reframes the value proposition of educational data science. In a pre-generative environment, the field could often define its contribution in terms of better prediction, better measurement, or better interfaces. In a generative environment, those are no longer sufficient markers of maturity. The central contribution must also include the capacity to calibrate uncertainty, document provenance, preserve human accountability, and ensure that educational action is not driven by outputs whose validity is weaker than their fluency suggests.

4.3. Core Commitments of a Stewardship Paradigm

A stewardship paradigm requires more than a general appeal to caution. It requires substantive commitments that orient research, design, Learning Engineering (LE) practice, and institutional governance. These five commitments define not only how systems should be built, but how their outputs should be interpreted, evaluated, and used in educational contexts.
Epistemic discipline. The first commitment is epistemic discipline: the insistence that fluent or useful output must not be confused with warranted inference. This commitment is foundational because much of the risk introduced by LLMs lies in their ability to make uncertain interpretations appear settled. For example, Liu et al. [36] show that GPT-4 can assist with coding a range of educational constructs, but performance depends on the construct, prompt strategy, and context; the hardest constructs for human coders also remain difficult for the model. This limitation is a reminder that models do not resolve ambiguity in the underlying phenomenon. Stewardship requires that such ambiguity remain visible rather than being rhetorically smoothed away in dashboards, summaries, or interventions.
Epistemic discipline also implies a shift in research standards. Studies should no longer move directly from technical feasibility to claims about learning, pedagogy, or educational improvement. A model that classifies discourse more efficiently or produces a more natural-language explanation has achieved something meaningful, but that achievement does not automatically justify claims about deeper understanding or improved learning. Research must therefore distinguish more carefully between technical performance, interpretive validity, and pedagogical consequence. In Learning Engineering contexts, this distinction is especially critical, because iterative improvement cycles depend on the quality of the evidence they incorporate. If weak inference is treated as established fact, those cycles risk scaling error rather than reducing it [13].
Provenance and traceability. The second commitment is provenance and traceability. As GenAI becomes embedded in analytics workflows, stakeholders must be able to understand not only what a system produces, but how that output was generated and how it can be audited—what data it draws on, what transformations were applied, what prompts or contextual inputs shaped the response, and where uncertainty enters the process. Khosravi et al. [3] emphasize the importance of capturing prompts, interaction context, and model parameters in “GenAI analytics”. The need to document such provenance information is both a methodological and governance concern. Without provenance, outputs risk functioning as persuasive but opaque artifacts that cannot be meaningfully audited, reconstructed, or contested.
This commitment has direct implications for infrastructure and research practice. Educational data science must move beyond reporting performance metrics toward documenting analytic pipelines, decision pathways, and model conditions. As systems become more composite—combining prompts, prior interactions, interface logic, and model updates—trust can no longer rest on output quality alone [66]. It must also depend on the ability to trace how an output came to matter. This aligns with broader work on AI governance in education, which increasingly foregrounds transparency, explainability, and auditability as conditions for responsible deployment [3,25,67].
Accountable human oversight. The third commitment is accountable human oversight. The literature consistently supports human–AI collaboration more strongly than autonomous AI judgment. Misiejuk et al. [4] establish that current classroom implementations emphasize human–AI collaboration rather than fully automated systems, and Ochoa et al. [27] clarify why: even when non-experts perform well with GenAI, successful use still requires checking outputs, evaluating results, and applying domain knowledge. These findings suggest that stewardship should not be framed as a temporary precaution before full automation becomes possible. Rather, human oversight should be treated as a constitutive feature of educational judgment [33,38].
This has implications for both design and institutional practice. Oversight must be meaningful, not symbolic. It requires clarity about what humans are responsible for interpreting, what they are expected to question, and what decisions must remain reviewable or contestable. Systems should be designed to support this interpretive role by making assumptions, uncertainty, and evidence visible. In Learning Engineering contexts, this means that iterative design cycles must preserve points at which human judgment remains nondelegable, particularly when outputs influence assessment, feedback, or high-stakes decisions.
Institutional learning. The fourth commitment is institutional learning. Stewardship cannot end at deployment. Educational systems must be designed so that institutions can continuously monitor how systems function in practice and refine them over time: identifying where outputs are effective, where performance begins to drift, where users misunderstand or misuse responses, where inequities emerge, and where unintended consequences develop. The closed-loop ambition of learning analytics has traditionally focused on feeding data back into teaching and learning [10]. A stewardship paradigm extends this loop to the institution itself. Institutions must become capable of learning from the consequences of the systems they adopt.
This shifts the focus of research and evaluation. The field must study not only model performance, but also how outputs are interpreted, how they shape practice, and how they evolve over time [67]. Many risks associated with generative systems—such as overreliance, narrowing of attention, or normalization of weak evidence—are systemic rather than technical [32,33,47]. Addressing them requires monitoring interpretive use, organizational incentives, and downstream effects on educational practice. Learning Engineering plays a critical role here by structuring iterative cycles of design and evaluation, but stewardship determines what must be monitored, when revision is required, and how institutional learning is achieved.
Protection of learner agency. The fifth commitment is the protection of learner agency. As generative systems become more adaptive, personalized, and conversational, there is a risk that learners are positioned less as active participants in knowledge construction and more as recipients of optimized support. Yan et al. [5] argue that the learning analytics community must rethink the learner in contexts where human and AI contributions increasingly blur, which is not only a methodological issue but also a normative one.
Stewardship requires that systems be evaluated not only in terms of efficiency or task completion, but in terms of their effects on self-regulation, critical reflection, and durable understanding. Emerging evidence suggests that stronger reliance on GenAI may be associated with lower academic achievement through mechanisms such as false self-efficacy [62], highlighting the need to design systems that support rather than displace learner cognition.

4.4. Operationalizing Stewardship as an Evaluation Architecture

A central challenge for stewardship is that its core principles—uncertainty, oversight, accountability, and learner agency—are already widely acknowledged, yet inconsistently enacted. Operationalizing stewardship therefore requires an evaluation architecture that makes the movement from data to judgment inspectable. At minimum, such an architecture should specify what evidence must be logged, what checks must be completed before outputs guide action, who is responsible for review, and when systems must be revised or withdrawn. In Figure 1, we provide an overview of the operational workflow showing how data, interpretation, oversight, intervention, monitoring, and institutional revision interact within a stewardship architecture.
An example of stewardship in action comes from NLP-based validation of GenAI-powered text personalization [68,69]. In this work, LLMs were prompted to adapt educational texts for readers with different assets (i.e., levels of prior knowledge, reading skill, and learning goals), and NLP-based analyses were leveraged to evaluate whether the outputs aligned with linguistic features associated with reading comprehension theory [68]. This approach illustrates stewardship at the construct-integrity layer: AI-generated personalization was not treated as valid simply because it appeared coherent or learner-specific, but was evaluated against theoretically grounded indicators such as cohesion, lexical sophistication, syntactic complexity, and readability. The findings from this line of research demonstrated how automated evaluation can support iterative refinement of GenAI systems before their outputs are treated as pedagogically appropriate.
Consider an LLM-assisted discourse-coding workflow for collaborative learning. A stewardship-oriented implementation would begin by specifying the educational construct to be coded, fixing the prompt and exclusion criteria before analysis, logging the model version and context, and preserving rejected outputs rather than treating them as invisible failures. Model-generated codes would then be compared with human-coded examples, reported with uncertainty or disagreement indicators, audited for subgroup differences, and restricted from high-stakes decisions unless reviewed by a responsible educator or researcher. If the coded outputs were later used to shape feedback or group intervention, the institution would monitor whether those actions improved learning, narrowed attention in harmful ways, or produced inequitable effects. In this example, stewardship is not a separate ethics statement. It is built into the analytic pipeline as provenance, validation, escalation, and revision.
These mechanisms can also be expressed as a maturity model. At an initial level, GenAI may be used only for informal analytic augmentation. At a documented level, systems maintain prompt, model, and data provenance. At a validated level, outputs are calibrated against human judgment, construct definitions, and subgroup audits. At a governed level, institutional policies define permissible uses, escalation thresholds, appeal mechanisms, and revision cycles. The point is not to make every educational use of GenAI administratively heavy. It is to match the strength of governance to the educational consequence of the judgment being delegated.

4.5. Implications for Design, Practice and the Field

Taken together, these commitments redefine what counts as rigor and contribution in educational data science by shifting the field’s focus from producing analytic outputs to governing their use. Research must extend beyond demonstrating model performance to examine how outputs function within educational systems—how they are interpreted, where they are overly trusted, and whether they support meaningful learning outcomes. Design must prioritize interpretability, traceability, and contestability, ensuring that systems support human judgment rather than replace it. Institutions must develop governance structures that specify where generative systems are appropriate, how outputs are reviewed, and how consequences are monitored over time. Educational data science must engage not only with models and methods, but with the epistemic, professional, and institutional conditions under which those models become educationally consequential.
At a broader level, this shift is not a resistance to innovation, but a marker of what maturity now requires. In a field where generative models can already classify discourse, support descriptive analysis, extract multimodal features, and generate persuasive explanations, the scarcity is no longer computational capability but disciplined judgment. Educational data science must now define itself not only by what it can build, but by what it can justify, govern, revise, and, when necessary, refuse.
Stewardship therefore becomes the paradigm through which prediction, measurement, and design remain educationally credible in the age of generative analytics. The field’s next advance will not come from treating AI outputs as self-authenticating evidence of progress. It will come from building the epistemic, professional, and institutional conditions under which those outputs can serve learning without displacing the human purposes that make education worth improving in the first place [3,13].

4.6. Limitations, Implementation Barriers, and Future Directions

Several limits define the scope of this argument. First, the synthesis is conceptual and narrative rather than systematic; it is designed to clarify a theoretical problem and propose a framework, not to estimate the prevalence of findings across the full GenAI learning analytics literature. Future work should test the stewardship framework through scoping reviews, comparative case studies, and design-based research in educational institutions.
Second, the present discussion draws heavily on research from higher education and from Western or Global North contexts, which limits how directly the argument can be generalized. Learning analytics scholarship itself remains unevenly distributed across regions, languages, and institutional conditions, and scholars have warned that analytics models developed in one context may not transfer straightforwardly to another [70]. Stewardship should therefore be adapted to local policies, data infrastructures, cultural expectations, teacher roles, and learner rights rather than treated as a universal checklist.
Third, adoption is likely to be constrained by institutional capacity. Prior research on learning analytics policy and implementation shows that leadership, analytic capability, infrastructure, stakeholder trust, privacy, consent, and evaluation practices shape whether institutions can move from pilot use to sustainable adoption [71,72]. These barriers are not peripheral to stewardship. They are part of the framework itself, because a stewardship paradigm requires institutions that can document systems, train users, monitor consequences, and revise practices over time. Future research should therefore examine which governance mechanisms are feasible in different educational settings, which forms of oversight reduce overreliance, and how stewardship can support innovation without creating burdens that prevent responsible local experimentation.
Finally, the need for evidence anchoring is not unique to education. Recent work on LLM-based phishing detection shows that semantically plausible language can become an adversarial liability when systems overweight fluent surface meaning and underweight harder-to-manipulate infrastructural evidence [73,74]. Although this example comes from cybersecurity, the underlying lesson is relevant for educational data science: stewardship should require generative interpretations to be checked against independent evidence channels, construct definitions, and institutional context before those interpretations guide action.

5. Conclusions

Generative AI marks an important development in learning analytics, but its significance lies not only in new technical capabilities. It reconfigures the conditions under which educational inferences are generated, communicated, and acted upon. LLMs and related systems expand the range of data that can be processed, classified, narrated, and adapted within educational settings. At the same time, technical progress has outpaced evidentiary, institutional, and professional adaptation. This imbalance is the central reason stewardship is needed.
The central claim of this article is that educational data science should no longer define maturity only by improved prediction, more refined measurement, or more powerful design. Those paradigms remain necessary, but they are insufficient when analytic systems can generate fluent interpretations, recommendations, and interventions. Stewardship names the disciplined governance of judgment: the work of making uncertainty visible, documenting provenance, preserving human accountability, protecting learner agency, and ensuring that institutions learn from the consequences of the systems they deploy.
The practical implication is that researchers, designers, and institutions must build governance into analytic pipelines rather than add it after deployment. Future work should evaluate stewardship architectures in real educational settings, compare maturity models across institutional contexts, and examine how provenance, uncertainty reporting, construct validation, equity audits, and escalation protocols affect trust and learning. The future of educational data science will depend not only on what generative systems can produce, but on whether the field can decide, justify, and refine what those systems should be allowed to mean and do. We will need experts who are trained to make those judgements and decisions.

Author Contributions

Conceptualization, D.S.M.; investigation, D.S.M.; resources, D.S.M.; writing—original draft preparation, D.S.M.; writing—review and editing, D.S.M. and L.H.; supervision, D.S.M.; project administration, D.S.M.; funding acquisition, D.S.M. All authors have read and agreed to the published version of the manuscript.

Funding

The research reported here was supported by the Institute of Education Sciences, U.S. Department of Education, through Grants R305N210041 and R305T240035 to Arizona State University and Grant NSF IIS 2153481 to Rice University and Arizona State University. The opinions expressed are those of the authors and do not represent views of the Institute of Education Sciences, the U.S. Department of Education, or the National Science Foundation.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

No new data were created or analyzed in this study.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
EDMEducational Data Mining
AIEDArtificial Intelligence in Education
GenAIGenerative AI
LLMLarge Language Model
LALearning Analytics
LELearning Engineering

References

  1. Sinatra, A.M.; Rus, V.; Lawton, P.; Graesser, A.C. (Eds.) Design Recommendations for Intelligent Tutoring Systems: Volume 12—Intelligent Tutoring Systems with Generative AI; US Army Combat Capabilities Development Command-Soldier Center: Orlando, FL, USA, 2025. [Google Scholar]
  2. Skinner, B.F. Teaching machines. Science 1958, 128, 969–977. [Google Scholar] [CrossRef] [PubMed]
  3. Khosravi, H.; Viberg, O.; Kovanović, V.; Ferguson, R. Generative AI and learning analytics. J. Learn. Anal. 2023, 10, 1–6. [Google Scholar] [CrossRef]
  4. Misiejuk, K.; López-Pernas, S.; Kaliisa, R.; Saqr, M. Mapping the landscape of generative artificial intelligence in learning analytics: A systematic literature review. J. Learn. Anal. 2025, 12, 12–31. [Google Scholar] [CrossRef]
  5. Yan, L.; Martinez-Maldonado, R.; Gašević, D. Generative Artificial Intelligence in Learning Analytics: Contextualising Opportunities and Challenges through the Learning Analytics Cycle. In Proceedings of the 14th Learning Analytics and Knowledge Conference, Kyoto, Japan, 8–22 March 2024; pp. 101–111. [Google Scholar] [CrossRef]
  6. Grant, M.J.; Booth, A. A typology of reviews: An analysis of 14 review types and associated methodologies. Health Inf. Libr. J. 2009, 26, 91–108. [Google Scholar] [CrossRef] [PubMed]
  7. Baker, R.S.; Inventado, P.S. Educational data mining and learning analytics. In Learning Analytics: From Research to Practice; Larusson, J.A., White, B., Eds.; Springer: New York, NY, USA, 2014; pp. 61–75. [Google Scholar] [CrossRef]
  8. Baker, R.S.; Siemens, G. Learning analytics and educational data mining. In The Cambridge Handbook of the Learning Sciences, 3rd ed.; Sawyer, R.K., Ed.; Cambridge University Press: Cambridge, UK, 2022; pp. 259–278. [Google Scholar] [CrossRef]
  9. Greller, W.; Drachsler, H. Translating learning into numbers: A generic framework for learning analytics. Educ. Technol. Soc. 2012, 15, 42–57. [Google Scholar]
  10. Clow, D. The learning analytics cycle: Closing the loop effectively. In Proceedings of the 2nd International Conference on Learning Analytics and Knowledge, Vancouver, BC, Canada, 29 April–2 May 2012; ACM: New York, NY, USA, 2012; pp. 134–138. [Google Scholar] [CrossRef]
  11. Wise, A.F. Designing pedagogical interventions to support student use of learning analytics. In Proceedings of the 4th International Conference on Learning Analytics and Knowledge, Indianapolis, IN, USA, 24–28 March 2014; ACM: New York, NY, USA, 2014; pp. 203–211. [Google Scholar] [CrossRef]
  12. Azad, A.K.M.; Goodell, J.; Kessler, A.; Craig, S.D.; Saliah-Hassane, H. Learning Engineering—A System Design Approach for Engineering Education. In Proceedings of the 2025 ASEE Annual Conference & Exposition, Montreal, QC, Canada, 22–25 June 2025. [Google Scholar] [CrossRef]
  13. Baker, R.S.; Boser, U.; Snow, E. Learning engineering: A view on where the field is at, where it is going, and the research needed. Technol. Mind Behav. 2022, 3, 1–23. [Google Scholar] [CrossRef]
  14. Goodell, J.; Kolodner, J.L. (Eds.) Learning Engineering Toolkit: Evidence-Based Practices from the Learning Sciences, Instructional Design, and Beyond; Routledge: London, UK, 2023. [Google Scholar] [CrossRef]
  15. Pargman, T.C.; McGrath, C.; Viberg, O.; Knight, S. New vistas on responsible learning analytics: A data feminist perspective. J. Learn. Anal. 2023, 10, 133–148. [Google Scholar] [CrossRef]
  16. Prinsloo, P.; Slade, S. An elephant in the learning analytics room: The obligation to act. In Proceedings of the Seventh International Learning Analytics & Knowledge Conference, New York, NY, USA, 13–17 March 2017; pp. 46–55. [Google Scholar] [CrossRef]
  17. Slade, S.; Prinsloo, P. Learning analytics: Ethical issues and dilemmas. Am. Behav. Sci. 2013, 57, 1510–1529. [Google Scholar] [CrossRef]
  18. Amershi, S.; Weld, D.; Vorvoreanu, M.; Fourney, A.; Nushi, B.; Collisson, P.; Suh, J.; Iqbal, S.; Bennett, P.N.; Inkpen, K.; et al. Guidelines for human-AI interaction. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, Glasgow, UK, 4–9 May 2019. [Google Scholar] [CrossRef]
  19. Vasconcelos, H.; Jörke, M.; Grunde-McLaughlin, M.; Gerstenberg, T.; Bernstein, M.S.; Krishna, R. Explanations can reduce overreliance on AI systems during decision-making. Proc. ACM Hum.-Comput. Interact. 2023, 7, 1–38. [Google Scholar] [CrossRef]
  20. Kaliisa, R.; Misiejuk, K.; López-Pernas, S.; Khalil, M.; Saqr, M. Have learning analytics dashboards lived up to the hype? A systematic review of 38 empirical studies. In Proceedings of the 14th Learning Analytics and Knowledge Conference, Kyoto, Japan, 18–22 March 2024; pp. 716–726. [Google Scholar] [CrossRef]
  21. Lekan, K.; Pardos, Z.A. AI-augmented advising: A comparative study of GPT-4 and advisor-based major recommendations. J. Learn. Anal. 2025, 12, 110–128. [Google Scholar] [CrossRef]
  22. Yan, L.; Zhao, L.; Echeverria, V.; Jin, Y.; Alfredo, R.; Li, X.; Gaševi’c, D.; Martinez-Maldonado, R. VizChat: Enhancing learning analytics dashboards with contextualised explanations using multimodal generative AI chatbots. In Artificial Intelligence in Education; Springer Nature: Cham, Switzerland, 2024; pp. 180–193. [Google Scholar] [CrossRef]
  23. Madsen, A.; Chandar, S.; Reddy, S. Are Self-Explanations from Large Language Models Faithful? In Findings of the Association for Computational Linguistics: ACL 2024; Association for Computational Linguistics: Kerrville, TX, USA, 2024; pp. 295–337. Available online: https://aclanthology.org/2024.findings-acl.19/ (accessed on 17 April 2026).
  24. Parcalabescu, L.; Frank, A. On measuring faithfulness or self-consistency of natural language explanations. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Bangkok, Thailand, 11–16 August 2024; pp. 6048–6089. [Google Scholar] [CrossRef]
  25. Khosravi, H.; Shibani, A.; Jovanovic, J.; Pardos, Z.A.; Yan, L. Generative AI and learning analytics: Pushing boundaries, preserving principles. J. Learn. Anal. 2025, 12, 1–11. [Google Scholar] [CrossRef]
  26. Long, Y.; Luo, H.; Zhang, Y. Evaluating large language models in analysing classroom dialogue. npj Sci. Learn. 2024, 9, 60. [Google Scholar] [CrossRef] [PubMed]
  27. Ochoa, X.; Huang, X.; Shao, Y. Exploring the potential of generative AI to support non-experts in learning analytics practice. J. Learn. Anal. 2025, 12, 65–90. [Google Scholar] [CrossRef]
  28. Bender, E.M.; Koller, A. Climbing towards NLU: On meaning, form, and understanding in the age of data. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, online, 5–10 July 2020; pp. 5185–5198. Available online: https://aclanthology.org/2020.acl-main.463/ (accessed on 17 April 2026). [CrossRef]
  29. Bender, E.M.; Gebru, T.; McMillan-Major, A.; Shmitchell, S. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, Virtual, 3–10 March 2021; pp. 610–623. [Google Scholar] [CrossRef]
  30. Si, C.; Goyal, N.; Wu, T.; Zhao, C.; Feng, S.; Daumé, H., III; Boyd-Graber, J. Large Language Models Help Humans Verify Truthfulness-Except When They Are Convincingly Wrong. In Proceedings of the NAACL 2024, Mexico City, Mexico, 16–21 June 2024; pp. 1459–1474. [Google Scholar] [CrossRef]
  31. Zhang, T.; Zhang, M.; Low, W.Y.; Yang, X.J.; Li, B.A. Conversational explanations: Discussing explainable AI with non-AI experts. In Proceedings of the 30th International Conference on Intelligent User Interfaces, Cagliari, Italy, 24–27 March 2025; pp. 409–424. [Google Scholar] [CrossRef]
  32. Schoeffer, J.; De-Arteaga, M.; Kuehl, N. Explanations, fairness, and appropriate reliance in human-AI decision-making. In Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems, Honolulu, HI, USA, 11–16 May 2024. [Google Scholar] [CrossRef]
  33. Buçinca, Z.; Malaya, M.B.; Gajos, K.Z. To trust or to think: Cognitive forcing functions can reduce overreliance on AI in AI-assisted decision-making. Proc. ACM Hum.-Comput. Interact. 2021, 5, 1–21. [Google Scholar] [CrossRef]
  34. Klingbeil, A.; Grützner, C.; Schreck, P. Trust and reliance on AI—An experimental study on the extent and costs of overreliance on AI. Comput. Hum. Behav. 2024, 160, 108352. [Google Scholar] [CrossRef]
  35. Salvi, F.; Horta Ribeiro, M.; Gallotti, R.; West, R. On the conversational persuasiveness of GPT-4. Nat. Hum. Behav. 2025, 9, 1645–1653. [Google Scholar] [CrossRef] [PubMed]
  36. Liu, X.; Zambrano, A.F.; Baker, R.S.; Barany, A.; Ocumpaugh, J.; Zhang, J.; Pankiewicz, M.; Nasiar, N.; Wei, Z. Qualitative coding with GPT-4: Where it works better. J. Learn. Anal. 2025, 12, 169–185. [Google Scholar] [CrossRef]
  37. Chaleshtori, F.H.; Ghosal, A.; Gill, A.; Bambroo, P.; Marasović, A. On evaluating explanation utility for human-AI decision making in NLP. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2024, Miami, FL, USA, 12–16 November 2024; pp. 7456–7504. [Google Scholar] [CrossRef]
  38. Kasepalu, R.; Prieto, L.P.; Ley, T.; Chejara, P. Teacher artificial intelligence-supported pedagogical actions in collaborative learning coregulation: A wizard-of-oz study. Front. Educ. 2022, 7, 736194. [Google Scholar] [CrossRef]
  39. Holstein, K.; McLaren, B.M.; Aleven, V. Co-designing a real-time classroom orchestration tool to support teacher-AI complementarity. J. Learn. Anal. 2019, 6, 27–52. [Google Scholar] [CrossRef]
  40. Olsen, J.K.; Rummel, N.; Aleven, V. Designing for the co-orchestration of social transitions between individual, small-group and whole-class learning in the classroom. Int. J. Artif. Intell. Educ. 2021, 31, 24–56. [Google Scholar] [CrossRef]
  41. Whitehead, R.; Nguyen, A.; Järvelä, S. Utilizing multimodal large language models for video analysis of posture in studying collaborative learning: A case study. J. Learn. Anal. 2025, 12, 186–200. [Google Scholar] [CrossRef]
  42. Sellberg, C.; Sharma, A. Toward multimodal learning analytics in simulation-based collaborative learning: A design ethnography of maritime training. Int. J. Comput.-Support. Collab. Learn. 2025, 20, 201–221. [Google Scholar] [CrossRef]
  43. Zhou, Q.; Suraworachet, W.; Cukurova, M. Detecting non-verbal speech and gaze behaviours with multimodal data and computer vision to interpret effective collaborative learning interactions. Educ. Inf. Technol. 2024, 29, 1071–1098. [Google Scholar] [CrossRef]
  44. Kizilcec, R.F. To advance AI use in education, focus on understanding educators. Int. J. Artif. Intell. Educ. 2024, 34, 12–19. [Google Scholar] [CrossRef] [PubMed]
  45. Lai, V.; Zhang, Y.; Chen, C.; Liao, Q.V.; Tan, C. Selective explanations: Leveraging human input to align explainable AI. Proc. ACM Hum.-Comput. Interact. 2023, 7, 1–35. [Google Scholar] [CrossRef]
  46. Baker, R.S.; Hawn, A. Algorithmic bias in education. Int. J. Artif. Intell. Educ. 2022, 32, 1052–1092. [Google Scholar] [CrossRef]
  47. Green, B.; Chen, Y. The principles and limits of algorithm-in-the-loop decision making. Proc. ACM Hum.-Comput. Interact. 2019, 3, 1–24. [Google Scholar] [CrossRef]
  48. Selbst, A.D.; Boyd, D.; Friedler, S.A.; Venkatasubramanian, S.; Vertesi, J. Fairness and abstraction in sociotechnical systems. In Proceedings of the Conference on Fairness, Accountability, and Transparency, Atlanta, GA, USA, 29–31 January 2019. [Google Scholar] [CrossRef]
  49. Acosta, H.; Lee, S.; Bae, H.; Feng, C.; Rowe, J.; Glazewski, K.; Hmelo-Silver, C.; Mott, B.; Lester, J.C. Recognizing multi-party epistemic dialogue acts during collaborative game-based learning using large language models. Int. J. Artif. Intell. Educ. 2025, 35, 677–701. [Google Scholar] [CrossRef]
  50. de la Iglesia, D.H.; Thomas, M.B.; Fuentes, C. AI-assisted qualitative analysis at scale: Opportunities and constraints for text-rich research. Qual. Quant. 2025, 59, 2511–2534. [Google Scholar] [CrossRef]
  51. Morris, W.; Holmes, L.; Choi, J.S.; Crossley, S. Automated scoring of constructed response items in math assessment using large language models. Int. J. Artif. Intell. Educ. 2025, 35, 559–586. [Google Scholar] [CrossRef]
  52. Guerrero-Sosa, J.D.; Romero, F.P.; Menéndez-Domínguez, V.H.; Serrano-Guerrero, J.; Montoro-Montarroso, A.; Olivas, J.A. A comprehensive review of multimodal analysis in education. Appl. Sci. 2025, 15, 5896. [Google Scholar] [CrossRef]
  53. Schneider, B.; Worsley, M.; Martinez-Maldonado, R. Gesture and gaze: Multimodal data in dyadic interactions. In International Handbook of Computer-Supported Collaborative Learning; Springer International Publishing: Cham, Switzerland, 2021; pp. 625–641. [Google Scholar] [CrossRef]
  54. Alfredo, R.; Echeverria, V.; Jin, Y.; Yan, L.; Swiecki, Z.; Gašević, D.; Martinez-Maldonado, R. Human-centred learning analytics and AI in education: A systematic literature review. Comput. Educ. Artif. Intell. 2024, 6, 100215. [Google Scholar] [CrossRef]
  55. Lee, H.Y.; Chen, P.H.; Wang, W.S.; Huang, Y.M.; Wu, T.T. Empowering ChatGPT with guidance mechanism in blended learning: Effect of self-regulated learning, higher-order thinking skills, and knowledge construction. Int. J. Educ. Technol. High. Educ. 2024, 21, 16. [Google Scholar] [CrossRef]
  56. Deng, R.; Jiang, M.; Yu, X.; Lu, Y.; Liu, S. Does ChatGPT enhance student learning? A systematic review and meta-analysis of experimental studies. Comput. Educ. 2025, 227, 105224. [Google Scholar] [CrossRef]
  57. Johnson, M.S.; Liu, X.; McCaffrey, D.F. Psychometric Methods to Evaluate Measurement and Algorithmic Bias in Automated Scoring. J. Educ. Meas. 2022, 59, 338–361. [Google Scholar] [CrossRef]
  58. Chen, G.H.; Chen, S.; Liu, Z.; Jiang, F.; Wang, B. Humans or LLMs as the judge? A study on judgement bias. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Miami, FL, USA, 12–16 November 2024; pp. 8301–8327. Available online: https://aclanthology.org/2024.emnlp-main.474/ (accessed on 17 April 2026).
  59. Gu, J.; Chen, H.; Feng, Y.; Chen, J.; Li, M.; Wu, Y. A survey on LLM-as-a-judge. arXiv 2024, arXiv:2411.15594. [Google Scholar] [CrossRef]
  60. Tan, S.; Zhuang, S.; Montgomery, K.; Tang, W.Y.; Cuadron, A.; Wang, C.; Popa, R.A.; Stoica, I. Judgebench: A benchmark for evaluating LLM-based judges. arXiv 2024, arXiv:2410.12784. [Google Scholar] [CrossRef]
  61. Wataoka, K.; Takahashi, T.; Ri, R. Self-preference bias in LLM-as-a-judge. arXiv 2024, arXiv:2410.21819. [Google Scholar] [CrossRef]
  62. Sheng, Y.; Wang, C.; Chen, X. Effect of GenAI dependency on university students’ academic achievement: False self-efficacy and the moderating role of perceived teacher caring. Behav. Sci. 2025, 15, 1348. [Google Scholar] [CrossRef]
  63. Fitsilis, P.; Damasiotis, V.; Dervenis, C.; Kyriatzis, V.; Tsoutsa, P. Effective data stewardship in higher education: Skills, competences, and the emerging role of open data stewards. arXiv 2024, arXiv:2410.20361. [Google Scholar] [CrossRef]
  64. Angelov, P.P.; Soares, E.A.; Jiang, R.; Arnold, N.I.; Atkinson, P.M. Explainable artificial intelligence: An analytical review. WIREs Data Min. Knowl. Discov. 2021, 11, e1424. [Google Scholar] [CrossRef]
  65. Minh, D.; Wang, H.X.; Li, Y.F.; Nguyen, T.N. Explainable artificial intelligence: A comprehensive review. Artif. Intell. Rev. 2022, 55, 3503–3568. [Google Scholar] [CrossRef]
  66. Arnold, M.; Bellamy, R.K.E.; Hind, M.; Houde, S.; Mehta, S.; Mojsilović, A.; Nair, R.; Ramamurthy, K.N.; Olteanu, A.; Piorkowski, D.; et al. FactSheets: Increasing trust in AI services through supplier’s declarations of conformity. IBM J. Res. Dev. 2019, 63, 6:1–6:13. [Google Scholar] [CrossRef]
  67. Singh, A.; Szajnfarber, Z. Architecting human-AI systems for effective collaboration and oversight: Making sense of human/AI-in/on/over/under/along-the-loop. Syst. Eng. 2026, 29, 337–353. [Google Scholar] [CrossRef]
  68. Huynh, L.; McNamara, D.S. GenAI-powered text personalization: Natural language processing validation of adaptation capabilities. Appl. Sci. 2025, 15, 6791. [Google Scholar] [CrossRef]
  69. Huynh, L.; McNamara, D.S. Evaluation of linguistic consistency of LLM-generated text personalization using natural language processing. Electronics 2026, 15, 1262. [Google Scholar] [CrossRef]
  70. Guzmán-Valenzuela, C.; Gómez-González, C.; Rojas-Murphy Tagle, A.; Lorca-Vyhmeister, A. Learning analytics in higher education: A preponderance of analytics but very little learning? Int. J. Educ. Technol. High. Educ. 2021, 18, 23. [Google Scholar] [CrossRef] [PubMed]
  71. Tsai, Y.-S.; Moreno-Marcos, P.M.; Jivet, I.; Scheffel, M.; Tammets, K.; Kollom, K.; Gašević, D. The SHEILA framework: Informing institutional strategies and policy processes of learning analytics. J. Learn. Anal. 2018, 5, 5–20. [Google Scholar] [CrossRef]
  72. Alzahrani, A.S.; Tsai, Y.-S.; Iqbal, S.; Moreno Marcos, P.M.; Scheffel, M.; Drachsler, H.; Delgado Kloos, C.; Aljohani, N.; Gašević, D. Untangling connections between challenges in the adoption of learning analytics in higher education. Educ. Inf. Technol. 2023, 28, 4563–4595. [Google Scholar] [CrossRef]
  73. Kompa, B.; Snoek, J.; Beam, A.L. Second opinion needed: Communicating uncertainty in medical machine learning. npj Digit. Med. 2021, 4, 4. [Google Scholar] [CrossRef] [PubMed]
  74. Li, Y.; Wang, Z.; Ren, Y.; Yang, X.; Liu, Y.; Tian, Z. When semantic plausibility becomes a liability: LLM-based phishing detection from an adversarial asymmetry perspective. Cyber Secur. Appl. 2026, 4, 100126. [Google Scholar] [CrossRef]
Figure 1. Operational workflow showing how data, interpretation, oversight, intervention, monitoring, and institutional revision interact within a stewardship architecture. Source: authors’ synthesis.
Figure 1. Operational workflow showing how data, interpretation, oversight, intervention, monitoring, and institutional revision interact within a stewardship architecture. Source: authors’ synthesis.
Information 17 00610 g001
Table 1. Distinguishing prediction, measurement, design, and stewardship in educational data science. Source: authors’ contribution.
Table 1. Distinguishing prediction, measurement, design, and stewardship in educational data science. Source: authors’ contribution.
ParadigmPrimary GoalDominant AssumptionPrincipal RiskEducational Implication
PredictionForecast learner states, outcomes, or risksPatterns in learner data can support timely actionSignals may be mistaken for causes or needsImproves anticipation, but does not determine what should be done
MeasurementValidate inferences from learner tracesData can represent educational constructs if theoretically groundedConstructs may be simplified, unstable, or poorly validatedImproves interpretive credibility, but does not govern downstream action
Design/Learning EngineeringTranslate evidence into interventions and iterative improvementAnalytics can improve learning when embedded in designed systemsWeak evidence can be scaled through well-designed toolsImproves actionability, but requires governance of what enters the cycle
StewardshipGovern judgment, authority, and consequenceGenerative systems require institutional rules for meaning, validation, delegation, and revisionFluent outputs may institutionalize weak inferenceDefines when AI-supported interpretation is legitimate enough to guide educational action
Table 2. Stewardship in relation to adjacent governance-oriented frameworks. Source: authors’ contribution.
Table 2. Stewardship in relation to adjacent governance-oriented frameworks. Source: authors’ contribution.
FrameworkPrimary ConcernTypical Unit of AnalysisCore QuestionHow Stewardship Differs
Responsible AI/responsible learning analyticsEthical and accountable system useModels, data practices, and institutional responsibilitiesIs the system fair, transparent, privacy-preserving, and accountable?Makes epistemic authority and educational meaning the central governance problem
Trustworthy/explainable AIReliability, transparency, interpretability, and user trustTechnical systems and explanationsCan users understand and appropriately rely on the system?Treats explanation as necessary but insufficient without construct validity and institutional accountability
Human-in-the-loop designHuman supervision and reviewDecision pipelines and interfacesIs a human positioned to supervise or override output?Requires human review to be role-defined, consequential, and tied to escalation rules
Learning EngineeringIterative improvement of learning systemsInstructional designs, interventions, and improvement cyclesDoes the intervention improve learning in context?Governs what evidence is strong enough to enter the improvement cycle
StewardshipGovernance of judgment under AI-mediated interpretationSociotechnical educational ecosystemsWho determines what AI outputs are allowed to mean and do?Provides the organizing paradigm that links evidence, authority, oversight, and institutional revision
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

McNamara, D.S.; Huynh, L. From Prediction to Stewardship: Framing Educational Data Science in the Age of Generative AI. Information 2026, 17, 610. https://doi.org/10.3390/info17060610

AMA Style

McNamara DS, Huynh L. From Prediction to Stewardship: Framing Educational Data Science in the Age of Generative AI. Information. 2026; 17(6):610. https://doi.org/10.3390/info17060610

Chicago/Turabian Style

McNamara, Danielle S., and Linh Huynh. 2026. "From Prediction to Stewardship: Framing Educational Data Science in the Age of Generative AI" Information 17, no. 6: 610. https://doi.org/10.3390/info17060610

APA Style

McNamara, D. S., & Huynh, L. (2026). From Prediction to Stewardship: Framing Educational Data Science in the Age of Generative AI. Information, 17(6), 610. https://doi.org/10.3390/info17060610

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop