1. Introduction
Artificial Intelligence (AI) has been a recurrent catalyst for pedagogical change in universities. Across successive technological waves, it has also reconfigured core instructional paradigms. Early deployments relied mainly on mechanistic drill-and-practice routines that emulated rote learning. Subsequent generations of intelligent tutoring systems (ITSs) incorporated explicit cognitive architectures and adaptive algorithms. These systems modeled students’ knowledge states and tailored feedback accordingly. The most recent phase centers on transformer-based large language models (LLMs) that support natural language interaction, enabling dialogic learning experiences at unprecedented scale.
This article is a critical review. Its purpose is twofold. The first aim is to provide an integrated historical and technical synthesis that situates contemporary educational technologies within a research continuum spanning more than six decades. Without that context, it is easy to overlook how earlier advances, such as knowledge-space theory and cognitive modeling, inform current design practices. The second aim is to examine the wave of AI-driven products launched by major U.S. educational publishers during the 2024–2025 period. These products raise pressing questions about strategic positioning, academic integrity, and effects on learning outcomes across diverse student populations. Addressing them requires a structured analysis that goes beyond marketing narratives.
Guiding perspective: The review adopts an explicitly socio-technical perspective that integrates two lenses. The technological lens traces the algorithms and architectures that make each generation of tools possible. The educational lens evaluates how those mechanisms serve teaching and learning. Neither lens is sufficient on its own: a technically sophisticated assistant that ignores instructional theory tends to underperform, and a pedagogically sound design constrained by weak underlying mechanisms cannot scale. Treating the assistants as socio-technical systems makes their value contingent on both technical reliability and the educational roles they fulfill.
Terminology: Several near-synonyms are used loosely in industry materials. To keep the analysis precise, the following conventions are adopted throughout. AI assistant is the umbrella term for any publisher-deployed system that supports learning or instruction through AI. An intelligent tutoring system is a system with explicit domain, student, and tutoring models. A tutor is an assistant whose primary function is guided, often Socratic, instruction. A chatbot is a conversational interface, typically built on an LLM, that may or may not be grounded in vetted content. Adaptive system denotes a tool that sequences content using a learner model. Where a generic reference is intended, the term assistant is used.
To frame the investigation, a theoretical perspective that synthesizes lessons from expert systems and ITSs is adopted. Research shows that features such as immediate feedback, guided practice, and adaptivity are central to effective AI tutors. Systematic reviews further indicate that AI-driven tutoring can enhance learning when implemented under appropriate conditions. Building on this foundation, three research questions are posed:
How have AI tutors evolved across successive technological waves, and how can the underlying mechanisms be organized into a taxonomy?
How do publisher-developed generative assistants perform across the TRIAD dimensions when assessed using a reproducible rubric with documented evidence and reliability?
To what extent do these tools fulfill core educational jobs (understanding complex content, generating assessment questions, and providing timely feedback) as conceptualized by the JTBD framework?
Contributions: Relative to prior surveys of AI in education, this review offers three contributions to the state of the art. (i) It consolidates six decades of educational AI into a taxonomy of algorithmic families and maps those families onto educational applications. (ii) It formalizes a transparent, reproducible evaluation framework (TRIAD coupled with JTBD) that includes anchored scoring criteria, an explicit evidence-and-confidence grading scheme, and reported inter-rater reliability, so that the procedure can be reapplied as tools evolve. (iii) It delivers the first comparative, evidence-graded assessment of the 2024–2025 cohort of publisher-built assistants, from which application scenarios and institutional recommendations are derived.
Structure
The remainder of this article proceeds as follows.
Section 2 introduces the theoretical foundations, integrating the literature on expert systems, ITSs, and responsible AI, and synthesizes the technical evolution of educational AI into a taxonomy of algorithmic families.
Section 3 is a dedicated Materials and Methods Section that describes the search protocol, inclusion and exclusion criteria, the tool-selection procedure, the evidence-and-confidence grading scheme, and the reproducible TRIAD and JTBD scoring methodologies, including inter-rater reliability.
Section 4 traces the historical evolution of AI in U.S. higher education from the 1950s through the 2020s.
Section 5 catalogs generative AI tools developed by leading academic publishers and organizes them by interface and architecture.
Section 6 presents a comparative analysis using the TRIAD framework and the JTBD lens.
Section 7 reports the scored results, the mechanism-by-application cross-analysis, and derived application scenarios, and discusses emerging technical and pedagogical trends.
Section 8 synthesizes the findings, contrasts publisher-provided tools with campus-specific solutions, and reflects on ethical and societal implications. The article concludes with evidence-based recommendations for researchers, educators, and institutional leaders.
2. Theoretical Foundations and Technical Evolution
This evaluation is grounded in the literature on expert systems, intelligent tutoring systems (ITSs), and responsible artificial intelligence (AI). Decades of ITS research emphasize that the educational effectiveness of AI tutors depends not merely on technical sophistication but on alignment with instructional theory. A recent analysis notes that key features, namely immediate feedback, guided practice, and adaptivity, are grounded in cognitive theory and yield positive learning outcomes only when implemented under the right conditions [
1]. Likewise, systematic reviews of AI-driven tutoring systems report generally positive effects on learning. Those reviews also stress that the gains often require sustained interventions and careful attention to context [
2].
Expert systems historically focused on knowledge representation and inference. In educational domains, rule-based tutors such as Cognitive Tutor Algebra I employ explicit domain models to deliver individualized instruction. Longitudinal trials show that such systems can produce significant improvements in student proficiency when deployed over multiple years [
3]. More recent developments merge deep learning and generative modeling with human-authored content, which raises questions about transparency, fairness, and privacy. Ethical guidance from the United Nations Educational, Scientific, and Cultural Organization (UNESCO), the U.S. National Institute of Standards and Technology (NIST), and the EdSAFE AI Alliance calls for human-centered design, data minimization, accountability, and explainability [
4,
5,
6].
2.1. Technical and Algorithmic Evolution: A Taxonomy of Mechanisms
Because the target venue emphasizes algorithms, it is useful to make the technical lineage of educational AI explicit before evaluating present-day products. The mechanisms that power educational AI fall into six families that emerged in overlapping waves.
Figure 1 arranges them along a trajectory of increasing capability and, notably, decreasing transparency.
The first family, rule-based and expert systems, encodes domain knowledge as production rules and semantic networks. Systems such as SCHOLAR and GUIDON, and later ACT-R cognitive tutors, used model tracing to compare student actions against an explicit expert model. Their behavior is fully inspectable, but authoring is labor-intensive and brittle outside well-formalized domains. The second family, adaptive and knowledge-space methods, represents a learner’s mastery as a state in a partially ordered knowledge space and sequences content toward the next learnable topic; ALEKS is the canonical example [
7]. The third family, statistical and probabilistic methods, includes Bayesian knowledge tracing, educational data mining (EDM), and learning analytics, which infer latent mastery and predict outcomes from interaction logs [
8,
9]. The fourth family, deep learning, applies neural networks to tasks such as deep knowledge tracing, affect detection, and student modeling [
10]. The fifth family, transformer-based large language models (LLMs), uses self-attention to support open-ended dialogic tutoring; models such as BERT (Bidirectional Encoder Representations from Transformers) and the Generative Pre-trained Transformer (GPT) series anchor most current products [
11,
12,
13]. The sixth family, retrieval-augmented generation (RAG) and hybrid architectures, grounds an LLM in vetted course content to reduce hallucination and to align responses with learning objectives. This last family is the dominant design pattern among the publisher tools reviewed here, and it directly addresses the transparency deficit introduced by pure LLMs.
These families are not mutually exclusive. Modern assistants frequently combine an LLM front end with retrieval over a curated corpus and analytics for progress monitoring so that a single product may instantiate several mechanisms at once.
Table 1 summarizes each family, its representative technique, the principal educational function it enables, and its characteristic transparency profile. This taxonomy provides the technical vocabulary used in the comparative analysis and in the mechanism-by-application cross-analysis presented in
Section 7.
2.2. The TRIAD Dimensions
The TRIAD framework is a pragmatic rubric derived from responsible AI principles and educational technology evaluation. It operationalizes those principles alongside curricular relevance and usability across five dimensions. The conceptual meaning of each dimension is given below; the anchored scoring criteria used to apply it are presented in the Materials and Methods Section (
Section 3).
Trust refers to the degree to which the tool protects student privacy, provides transparency and explainability, mitigates bias, and complies with legal standards such as the Family Educational Rights and Privacy Act (FERPA). International guidance for generative AI in education emphasizes a human-centered approach that protects data privacy and ensures ethical validation. Risk-management guidance from NIST incorporates trustworthiness into the design, development, use, and evaluation of AI systems [
4]. The SAFE Benchmarks framework similarly emphasizes safety, accountability, fairness, and transparency in educational technology (edtech) [
5]. Tools that provide clear response provenance, limit hallucinations, and allow human oversight score higher.
Relevance describes alignment with curriculum standards and the extent to which the assistant integrates with the publisher’s content and learning platforms. This dimension assesses whether the AI enhances or detracts from instructional objectives.
Impact refers to evidence of improved learning outcomes, engagement, or efficiency. When empirical studies are unavailable, impact is inferred from product claims, user uptake, and alignment with best practices such as Socratic guidance and adaptive feedback.
Adoption captures the extent of institutional and user uptake, including ease of onboarding. Adoption is influenced by self-efficacy, subjective norms, perceived enjoyment, facilitating conditions, and system accessibility [
14]. Widespread usage, positive feedback, and institutional support increase the score.
Design concerns usability, accessibility, and responsiveness. Inclusive design draws on frameworks such as Universal Design for Learning (UDL), which encourages multiple means of engagement, representation, and expression to remove barriers for diverse learners [
15]. Interfaces that offer intuitive interactions, support learner variability, and provide high-quality feedback score higher.
Each dimension is scored on a 1–10 scale. The scores are relative: a higher score indicates stronger performance than peers.
2.3. The Jobs-to-Be-Done Lens
The analysis also draws on the Jobs-to-Be-Done (JTBD) framework from innovation theory. JTBD posits that individuals and organizations “hire” products or services when specific circumstances arise, and that each job has functional, social, and emotional dimensions [
16]. The framework focuses on the underlying job rather than on demographic characteristics. In an educational context, students hire an AI assistant to understand complex concepts, to generate practice questions, or to receive immediate feedback outside class. Instructors hire one to streamline assessment creation and to personalize instruction. This lens complements TRIAD: TRIAD measures the quality and responsibility of an assistant, whereas JTBD measures how well it fits the tasks users actually need to accomplish.
3. Materials and Methods
This section documents how the review was conducted and how the assistants were evaluated, so that the procedure can be audited and reapplied. It describes the review type and protocol; the search strategy and source selection; the criteria used to include tools; the scheme for grading evidence and confidence; the operational TRIAD and JTBD scoring procedures; the rater process and reliability analysis; and the temporal scope of the findings.
3.1. Review Type and Protocol
This work is a critical review with a structured search component. It does not propose a new algorithm; its contributions are the synthesis of educational AI, the formalization of a reproducible evaluation framework, and the evidence-based application of that framework to current products. The review followed a written protocol with four stages adapted from PRISMA: identification, screening, eligibility, and inclusion. The protocol defined the search sources, the date range, the inclusion and exclusion criteria, the tool selection rule, and the scoring procedure before data extraction began. The PRISMA structure is borrowed to make the search transparent and reproducible. The article is nonetheless positioned as a critical review with a structured scoping-style search, not as a systematic review in the strict PRISMA sense: it does not register a review protocol, does not restrict its evidence base to study-level findings, and does not perform a formal risk-of-bias meta-synthesis. Readers should therefore interpret
Figure 2 as a transparent account of how the literature was assembled rather than as the basis of an aggregated quantitative synthesis.
3.2. Search Strategy and Source Selection
Peer-reviewed articles, conference proceedings, monographs, and government reports addressing AI in higher education from 1956 to 2025 were sought in five databases: Scopus, Web of Science, IEEE Xplore, the ACM Digital Library, and ERIC, supplemented by Google Scholar for citation chaining. Search strings combined a technology facet with an education facet, for example: (“intelligent tutoring” OR “adaptive learning” OR “knowledge tracing” OR “large language model” OR “generative AI”) AND (“higher education” OR university OR undergraduate). Inclusion criteria were: (i) relevance to AI mechanisms or tools used in post-secondary teaching and learning; (ii) English language; (iii) full-text availability; and (iv) publication between 1956 and mid-2025. Exclusion criteria were: off-topic records, opinion pieces without methodological or empirical content, and redundant secondary coverage of a primary source already included. Priority was given to high-impact journals and to seminal, widely cited works.
Figure 2 reports the flow of records through the four stages and the counts retained at each step.
3.3. Tool Inclusion and Selection
A tool was eligible for the comparative evaluation if it satisfied four conditions: it was (i) an AI assistant for teaching or learning, (ii) built or commissioned by a U.S. academic publisher or a closely comparable provider, (iii) generally available or in a documented pilot as of mid-2025, and (iv) documented in enough detail to support scoring on every TRIAD dimension. Applying these conditions yielded eleven assistants from eight providers (Cengage, Khan Academy, Macmillan Learning, McGraw Hill, Pearson, Wiley, Quizlet, and Chegg). Khan Academy is included as an influential reference point even though it is not a traditional textbook publisher; this choice is noted so that readers can weigh it.
3.4. Evidence and Confidence Grading
Because the available information ranges from peer-reviewed trials to vendor marketing, each tool’s evidence base was graded by source type, and a confidence level was attached to its overall evaluation. Four source types were distinguished in descending order of evidential weight: independent peer-reviewed research; institutional or governmental reports; vendor product documentation; and commercial or news claims. Confidence was rated High when independent research substantiated the central claims, Moderate when product documentation was corroborated by at least one independent source or by early empirical data, and Low when the evidence rested largely on vendor or news material.
Table 2 records the primary evidence basis and the confidence level for each tool. This grading directly distinguishes ratings anchored in independent research, such as ALEKS, from those inferred largely from marketing, such as the Wiley AI Tutor, and it should be read alongside the scores in
Section 7.
3.5. TRIAD Scoring Procedure
Each TRIAD dimension was rated on a 1–10 scale using fixed anchors. Scores of 1–3 indicate limited evidence or non-compliance; 4–6 indicate moderate performance with notable gaps; 7–8 indicate strong alignment with responsible AI and instructional quality; and 9–10 indicate exemplary practice. Dimension-specific anchors were defined as follows. For Trust, the presence of explicit privacy safeguards, explainable outputs, bias mitigation through content grounding, and human oversight; tools lacking published safeguards scored 1–3, and tools with comprehensive adherence to international AI-ethics guidance scored 9–10. For Relevance, the depth of curricular alignment and integration with vetted content; generic support scored low, whereas deep integration with publisher content and learning management systems (LMSs) scored high. For Impact, the strength of evidence for improved outcomes; no published evidence scored 1–3, pilot or anecdotal benefits 4–6, survey or small-scale studies 7–8, and peer-reviewed evaluations of significant learning gains 9–10. For Adoption, the breadth of institutional and user uptake was limited; pilots scored low, and sustained multi-institution use scored high. For Design, usability, accessibility, and responsiveness, minimal interfaces scored low, and exemplary universal design scored high. As a worked example, the Cengage Student Assistant received a Trust score of 8 because it limits hallucination risk through discipline-specific grounding, emphasizes academic integrity, and provides oversight controls; its Relevance and Impact scores of 8 and 7 reflect close alignment with course content and early engagement evidence, while moderate Adoption (6) and solid Design (7) yield a total of 36.
3.6. JTBD Scoring Procedure
For the JTBD matrix, operational criteria rated each assistant’s ability to fulfill three core jobs: understanding complex content, generating assessment questions, and providing timely feedback. A rating of High indicates explicit, dedicated features that directly accomplish the job, such as step-by-step explanations, automated question generation, or around-the-clock chat support. Moderate indicates partial or auxiliary support. Low indicates minimal or no functionality relative to the job. Ratings were assigned from documented features, usage policies, and product demonstrations, so that the High/Moderate/Low labels rest on transparent decision rules rather than impressions.
3.7. Raters, Reliability, and Reproducibility
Two researchers with expertise in educational technology and machine learning independently scored every tool on all TRIAD dimensions and JTBD jobs using the anchored criteria above. Independent ratings were recorded before any discussion. Agreement was then quantified, and remaining differences were reconciled by consensus to produce the values reported in
Section 7. Because the scores are bounded ordinal ratings that cluster in a narrow band, agreement is reported with several complementary statistics rather than a single coefficient: exact percent agreement, percent agreement within one point, quadratic-weighted Cohen’s kappa, and a two-way intraclass correlation coefficient, ICC(2,1).
Table 3 reports these per dimension and overall. Across all dimensions, raters agreed within 1 point in 100% of cells and exactly in 72.7% of cells; the overall quadratic-weighted kappa was 0.86, and ICC(2,1) was 0.87, indicating strong reliability. For the Impact dimension, weighted kappa is lower (0.45) despite 100% within-one agreement. This is the well-known base-rate paradox of kappa: when ratings have little marginal variance, as they do for Impact (scores cluster tightly at 7–8), kappa is deflated even when raters agree closely. The percent-agreement and ICC values are therefore the more informative indicators for that dimension.
Only having two expert raters is a limitation, and small panels can introduce bias even when reliability is high. Three design choices mitigate this risk. First, the anchored rubric in
Section 3.5 and
Section 3.6 constrains judgment to documented criteria, which reduces idiosyncratic scoring. Second, the evidence-and-confidence grading in
Table 2 makes explicit where a score rests on independent research versus vendor material so that a reader can discount low-confidence ratings. Third, and most important, the framework is presented as a reusable template rather than as a definitive ranking: the rubric, anchors, and scoring sheet are specified in enough detail that other evaluators can reapply them, expand the panel, and update the scores as tools evolve. All quantitative procedures, including the reliability statistics, were implemented in Python Version 3.12.6. Replacing the independent rating sheet with additional raters’ scores regenerates
Table 3 without further changes.
3.8. Temporal Scope
Generative AI products change rapidly, often on a monthly basis. Every finding in this article reflects a mid-2025 snapshot, and all tables carry that snapshot date. The contribution is therefore the evaluation method and the analysis of a defined cohort at a defined time, not a durable leaderboard. Readers applying the framework later should re-extract the underlying evidence before reusing any specific score.
4. Historical Evolution of AI in Higher Education
The decade-by-decade account below traces how the algorithmic families of
Section 2.1 reached the classroom. Two caveats frame it. First, the decade headings are an organizing convenience, not strict boundaries: techniques develop continuously and diffuse slowly, so the same idea often spans several periods, as the entries for cognitive modeling and for learning analytics in the tables below illustrate. Second, the historical material is not offered as a direct cause of present-day publisher strategy. Its purpose is to show that today’s generative assistants recombine long-standing components, namely explicit domain models, adaptive sequencing, knowledge tracing, and dialogic interaction, now delivered through LLMs. That lineage explains why retrieval-augmented designs, which graft transparency-enhancing grounding onto otherwise opaque models, have become the preferred architecture among the publisher tools examined in
Section 5. The historical record thus motivates the evaluation criteria rather than the commercial timing of any single product.
4.1. 1950s and 1960s: Foundations and Early Experiments
Modern AI research was formally inaugurated at the 1956 Dartmouth Summer Research Project, organized by pioneers who defined the goal of making machines simulate human intelligence. This workshop is widely regarded as the birth of the field [
17]. In the 1960s, computer-assisted instruction (CAI) emerged. The PLATO system, created in 1960 at the University of Illinois at Urbana-Champaign, provided time-shared computer terminals through which students could access instructional materials [
18]. By the early 1970s, PLATO supported 1000 simultaneous users over 1200 bps connections and fostered one of the first online communities [
18].
Table 4 summarizes key milestones of the 1950s and 1960s.
4.2. 1970s: Intelligent Tutoring Emerges
The limitations of ad hoc CAI led researchers to explore AI techniques to develop more adaptive and interactive systems. In 1970, the SCHOLAR system was introduced as a pioneering intelligent tutoring system (ITS) that utilized a semantic network to store domain knowledge and engage in mixed-initiative dialogue [
23]. The information-structure-oriented CAI approach, based on an information network rather than preprogrammed frames, enabled the system to generate questions, answers, and feedback on the fly. SCHOLAR demonstrated that an ITS could detect misspellings, answer students’ questions, and dynamically adapt content [
23].
Another strand of research applied cognitive psychology to education. A seminal 1984 paper reported that students tutored one-on-one achieved performance two standard deviations better than that of students receiving conventional classroom instruction, highlighting the “two sigma” problem and motivating researchers to build systems that emulate human tutors [
24]. Throughout the 1970s, early ITS prototypes such as GUIDON and BIP (based on medical diagnostics) experimented with expert systems to teach domain knowledge.
Table 5 lists the major developments of the 1970s in AI tutoring.
4.3. 1980s: Cognitive Models and Rule-Based Tutors
By the early 1980s, advances in cognitive psychology and artificial intelligence converged to produce more sophisticated ITSs. Researchers introduced student models that represented learners’ knowledge states and misconceptions. ACT theory laid the groundwork for cognitive tutors that model procedural skills through production rules. Early systems, such as LISP Tutor and Geometry Tutor, incorporated rule-based models and adaptive feedback. An influential 1990 overview summarized the architecture of ITSs and identified four main components: domain model, student model, tutoring model, and user interface [
29]. The 1980s also saw the emergence of adaptive hypermedia and rule-based system shells such as HERACLES and MENO.
Table 6 summarizes highlights of AI and tutoring research in the 1980s.
4.4. 1990s: Commercial Adaptive Platforms
The 1990s marked a transition from research prototypes to scalable educational products. Cognitive Tutors, developed at a university, applied ACT-R production rules to secondary mathematics and were commercialized by an educational company. Empirical studies demonstrated that students using Cognitive Tutor algebra curricula achieved significant learning gains compared with conventional instruction. Another major contribution was ALEKS (Assessment and Learning in Knowledge Spaces), launched in 1996. ALEKS applies knowledge space theory to adaptively diagnose student knowledge and select the next best topic, allowing students to progress efficiently [
7]. The platform has been continuously developed for over 25 years and remains widely used. Early attempts at intelligent student assistants also emerged, such as AutoTutor, which engages learners in natural language dialogue.
Table 7 summarizes key developments in AI tutoring during the 1990s.
4.5. 2000s: Learning Analytics and MOOCs
The early 2000s saw the confluence of web technologies, data mining, and online education. Researchers leveraged large datasets from learning management systems (LMS) to identify at-risk students and personalize interventions. The term “learning analytics” gained prominence. Researchers argued that data-driven analytics could transform higher education by enabling evidence-based decision-making [
9]. A survey of educational data mining and learning analytics noted that distance education generates rich, traceable data that can be used to model engagement and predict persistence [
8]. Massive open online courses (MOOCs) emerged in 2012, bringing scale but also challenges of attrition and engagement. During this period, companies like Knewton and Smart Sparrow introduced adaptive learning platforms for higher education.
Table 8 summarizes the major developments in AI and learning analytics during the 2000s.
4.6. 2010s: Intelligent Assistants and Deep Learning
During the 2010s, AI tutors began to incorporate natural language processing and deep learning. A case study of an online AI course introduced a virtual teaching assistant named Jill Watson in 2016; students did not realize it was an AI until the end of the semester [
49]. Subsequent descriptions detailed how Jill Watson responded autonomously to introductions and FAQs and posted announcements [
50]. The decade also saw advances in affective computing and multimodal analytics. Researchers have developed models to detect students’ emotions from facial expressions and physiology, enabling the development of adaptive interventions. Deep neural networks have been applied to knowledge tracing and student modeling, culminating in algorithms such as Deep Knowledge Tracing.
Table 9 presents key AI developments in higher education during the 2010s.
4.7. 2020s: Generative AI and LLMs
The current decade has witnessed a surge in the development of generative AI tools. Transformer models, such as BERT and GPT-3, paved the way for large language models (LLMs) capable of generating coherent text, code, and dialogue. Public awareness of generative chatbots rose sharply when ChatGPT became widely accessible in late 2022 and early 2023 [
53,
54]. Government reports have noted that generative AI can write essays, create lesson plans, and personalize assignments, while also raising concerns about surveillance and algorithmic discrimination [
54,
55,
56]. Publishers quickly incorporated LLMs into their platforms to offer chat-based tutoring, content generation, and study aids.
Section 5 examines these tools in detail.
Table 10 highlights AI developments in higher education from 2020 through 2025.
Recent research in “Computers and Education: Artificial Intelligence” highlights both the potential and the complexity of deploying generative AI tools in higher education. One study investigated how interacting with ChatGPT influences undergraduate cognitive skills using a mixed-methods pretest-posttest design with a control group [
67]. The Ghanaian study found that using ChatGPT significantly improved students’ critical, creative, and reflective thinking skills, illustrating the capacity of conversational models to scaffold higher-order cognition [
67].
Another 2024 survey of 5894 students across Swedish universities evaluated perceptions and usage of AI chatbots [
68]. The survey reported high awareness and generally positive attitudes toward ChatGPT, yet noted significant differences across gender, academic level, and field of study, with female and humanities students expressing greater skepticism and concern about the role of AI [
68]. These findings underscore the need for context-sensitive adoption strategies and suggest that demographic factors should inform the design and deployment of AI-driven educational tools.
5. AI-Powered Educational Tools by Academic Publishers
This section catalogs the AI-driven educational tools offered by major U.S. academic publishers as of mid-2025. Each subsection summarizes a publisher’s products, including launch dates, target markets, key features, and limitations, and a table records the salient details. Before the per-publisher catalog, it is useful to organize the cohort along the two dimensions that most sharply distinguish the tools, as they vary widely in their interfaces, underlying algorithms, motivations, and foundation models.
5.1. Organizing Dimensions: Interface and Architecture
The eleven assistants differ along two axes. The first is the user interface and delivery modality. The most common pattern, used by a clear majority, is a conversational chatbot embedded inside the publisher’s own platform: Cengage’s Student Assistant in MindTap, Macmillan’s Achieve AI Tutor, Pearson’s AI Study Tools in MyLab and Pearson+, McGraw Hill’s AI Reader in Connect and GO, and CheggMate in Chegg Study all follow this in-platform conversational model. Two tools depart from it. The Wiley AI Tutor is delivered through a consumer messaging app (WhatsApp) rather than an LMS, and Quizlet’s Q-Chat is woven into a flashcard study workflow rather than presented as a free-form chat. A further distinct modality is the instructor-facing generator, exemplified by the Macmillan iClicker AI Question Creator, whose interface is an authoring tool rather than a student tutor. Khanmigo and ALEKS represent, respectively, a standalone conversational tutor and an adaptive assessment engine.
The second axis is the underlying architecture. ALEKS is not generative; it is an adaptive engine grounded in knowledge-space theory. Khanmigo, CheggMate, and Q-Chat are built on general-purpose LLMs (the GPT family or the OpenAI application programming interface, API). The remaining publisher tools follow the retrieval-augmented pattern, pairing an LLM with retrieval over vetted publisher content so that responses stay aligned with course material.
Table 11 summarizes this classification, which the comparative analysis in
Section 6 builds upon.
5.2. Cengage
Cengage entered the generative AI space with the Student Assistant, an in-platform chatbot integrated into the MindTap learning environment. The tool is discipline-specific: course content and pedagogy train the underlying model to provide prompts and feedback without simply giving away solutions [
61]. The Student Assistant emphasizes critical thinking and academic integrity through Socratic questioning and is available 24/7 for just-in-time help [
61]. A beta launch in Fall 2024 targeted four courses in Management, Organizational Behavior, Psychology, and Economics, with expansion across disciplines planned for 2025 [
61].
Table 12 outlines the features and limitations of the Cengage Student Assistant.
5.3. Khan Academy
Although not a traditional publisher, Khan Academy’s Khanmigo provides an influential reference for generative AI tutoring. The GPT-4-powered assistant, launched in March 2023, converses with learners using Socratic questions and hints across various subjects, including mathematics, science, and humanities. Khanmigo can role-play historical figures to enrich engagement and integrates teacher tools for lesson planning, rubric creation, and progress summaries. Safety measures include logging all interactions with minors and using a second AI to filter inappropriate content. Access remains limited to pilot schools and paid subscribers and is primarily oriented toward K-12 learners [
58].
Table 13 summarizes the highlights and limitations of Khanmigo.
5.4. Macmillan Learning
Macmillan offers two AI tools: the Achieve AI Tutor and the iClicker AI Question Creator. The Achieve AI Tutor, launched in 2023 and rolled out widely in 2024–2025, serves as an on-demand homework helper that provides step-by-step guidance through Socratic questioning. It is available in roughly 80 courses, predominantly in science, technology, engineering, and mathematics (STEM) disciplines, and must be enabled at the instructor’s discretion. Surveys indicate that the tool increases student confidence and engagement. However, the tutor avoids giving direct answers, and its scope remains limited to courses in which it has been trained [
62,
63].
Table 14 compares Macmillan’s AI tools.
The iClicker AI Question Creator, launched in February 2024, generates up to 50 customized quiz or polling questions based on instructor-specified topics, difficulty, and taxonomy levels. By producing unique, non-searchable questions, it aims to enhance academic integrity and promote active learning. Instructors must review AI-generated content for accuracy; the tool currently functions best for formative assessments rather than high-stakes exams [
63].
5.5. McGraw Hill
McGraw Hill’s contributions span decades. Its long-standing ALEKS platform, launched in 1996, remains a pioneering adaptive learning and assessment system based on knowledge space theory. ALEKS diagnoses what a student knows and selects the next appropriate topic, enabling self-paced mastery. Research shows it can reduce assessment time and improve learning efficiency [
7]. In 2024, McGraw Hill announced two generative AI tools: the AI Reader and enhancements to ALEKS. The AI Reader allows students to highlight text in eBooks and request simplified explanations or practice questions. It integrates into the Connect and GO platforms and is intended to promote active reading. Availability is currently limited to select textbook titles [
64]. As part of the same announcement, McGraw Hill described future generative capabilities within ALEKS, though details remain sparse.
Table 15 describes McGraw Hill’s AI tools.
5.6. Pearson
Pearson launched AI study tools in 2023 and expanded them in 2024 to dozens of courses. These tools embed generative AI into e-textbooks and homework platforms, providing personalized Q&A, step-by-step problem-solving, syllabus-based study plans, interactive video assistants, and AI-generated practice problems. The AI draws exclusively on vetted Pearson content to ensure accuracy and includes features such as the ability to upload syllabi for custom study schedules. Pearson emphasizes responsible AI use by monitoring interactions and adjusting the tool’s tone in response to feedback. Rollout began with a handful of titles and is expected to expand through 2024 [
60].
Table 16 summarizes Pearson’s AI study tools.
Alongside the student-facing study tools, Pearson offers an instructor-facing counterpart, referred to here as the Pearson AI Instructor Tool. It supports educators in generating and curating assessment content, including practice questions and assignment items, drawn from vetted Pearson material and aligned to course learning outcomes. Like the study tools, it operates on the publisher’s platforms rather than as a free-standing application, and it is documented primarily through product materials, as reflected in its Low confidence rating in
Table 2.
5.7. Wiley
Wiley partnered with eFlow to pilot the Wiley AI Tutor via mobile messaging platforms, including WhatsApp. Announced in 2024, the service provides on-demand micro-tutoring sessions for subjects such as physics, accounting, and statistics. Students send natural-language questions and receive step-by-step explanations and practice problems. By leveraging a familiar chat interface, the tool lowers barriers to access. However, it is early in development: only a few subjects are covered, and the messaging platform may struggle with complex mathematical notation [
65].
Table 17 highlights attributes and limitations of the Wiley AI Tutor.
5.8. Quizlet
Quizlet’s
Q-Chat, built on OpenAI’s API, offers an AI-driven study companion within the popular flashcard platform. Launched in March 2023, Q-Chat utilizes Quizlet’s user-generated content to quiz students, adjust difficulty based on their responses, and provide hints. Additional features, such as “Magic Notes” and “Quick Summary,” summarize or explain content using AI. The tool has been restricted to users aged 18+ during beta testing and may propagate errors when underlying flashcards contain inaccuracies [
66].
Table 18 details the functionality and limitations of Quizlet’s Q-Chat.
5.9. Chegg
Chegg’s CheggMate, announced in April 2023 and launched in beta shortly thereafter, combines GPT-4 with Chegg’s proprietary database of textbook solutions, expert Q&A and practice exam pathways, allowing students to submit questions or photos of problems and receive step-by-step explanations and additional practice tailored to their level. The service is exclusive to Chegg subscribers, which differentiates it from free alternatives. Chegg has been criticized in the past for enabling cheating, and there are concerns that generative AI could exacerbate misuse [
59,
69].
Table 19 provides an overview of CheggMate.
6. Competition Among Publishers
To understand the strategic positioning of generative AI tools across the U.S. academic publishing landscape,
Table 20 and
Table 21 provide a comparative overview. These tools differ in modality (e.g., chatbot, flashcard interface, messaging app), integration depth, disciplinary focus, and monetization model. While many rely on large language models and emphasize Socratic dialogue, implementation strategies diverge significantly. Some tools are embedded within proprietary platforms tightly aligned with curriculum content, whereas others are offered as standalone or cross-platform solutions.
These comparisons reveal several key dynamics. First, publishers with established platforms (e.g., McGraw Hill’s ALEKS, Pearson’s MyLab) are leveraging AI to augment existing ecosystems, emphasizing integration and instructional alignment. By contrast, new tools such as Wiley’s WhatsApp-based tutor and Quizlet’s Q-Chat focus on accessibility and scale through consumer-facing delivery.
Second, discipline coverage varies widely, from narrowly scoped pilots (Cengage, Wiley) to tools spanning dozens of subjects (Pearson, Chegg). Most generative tools are embedded in proprietary platforms and rely on subscription models, while others offer freemium access (e.g., Quizlet) or pilot deployments (e.g., Wiley, Khan Academy).
Finally, despite converging on LLM-powered features and Socratic methods, publishers differ in the degree of oversight, transparency, and customization they provide. The competitive advantage increasingly depends not just on AI capabilities, but on curricular relevance, instructional design, and safeguards for privacy and academic integrity. To remain viable against open-access alternatives like ChatGPT, publisher-provided tools must deliver context-aware, evidence-aligned support tailored to educational outcomes.
7. Current State of the Art and Emerging Trends
Advances in machine learning are rapidly altering educational technology. Large language models trained on billions of parameters can generate coherent explanations, answer questions, and simulate dialogue. Recent models, such as GPT-4, Claude, and Gemini, incorporate multimodal capabilities, enabling image-based reasoning and code generation. The integration of such models into educational platforms enables on-demand tutoring, automated question generation, and interactive video assistance.
Research is also exploring hybrid systems that combine deep learning with knowledge bases to ground responses in vetted content. For instance, generative tutoring assistants may retrieve textbook passages before generating answers, reducing the risk of hallucination and aligning with course objectives. Another trend is the combination of real-time learning analytics with AI tutors; by continuously monitoring student behavior and affect, systems can adjust feedback and content accordingly. At the same time, there is growing attention to fairness, transparency, and explainability in AI models, particularly when they make recommendations that affect students’ learning trajectories.
7.1. Scored Results: TRIAD and JTBD
The two instruments measure different things and were therefore developed separately, but they describe the same eleven assistants and are most useful side by side. TRIAD reports the quality and responsibility of each assistant on five numeric dimensions, whereas JTBD reports how well each assistant fits three concrete user tasks on a categorical scale. To allow quality and task fit to be read together and to keep each score tied to its evidence base, the two instruments are combined in
Table 22, whose final column repeats the confidence level from
Table 2. The three core jobs, derived from the functional, social, and emotional needs of students and instructors, are: understanding complex content (students seek explanations and scaffolding); generating assessment questions (instructors and students need quiz items or practice problems that reinforce learning and support integrity); and receiving timely feedback or support (both groups need on-demand assistance outside class).
On the TRIAD dimensions, the highest totals belong to Khanmigo and McGraw Hill’s ALEKS (42 each), followed by Pearson’s AI Study Tools (41) and the Pearson AI Instructor Tool (40). These tools combine privacy safeguards, curricular relevance, and broad adoption. Early-stage pilots such as the iClicker AI Question Creator and the Wiley AI Tutor score lower (33 each), reflecting limited adoption data and a narrower scope. The confidence column tempers this ranking: ALEKS is the only assistant whose high score rests on independent peer-reviewed research, whereas the lower-confidence entries depend largely on vendor documentation and should be read as provisional.
For JTBD jobs, most assistants are well-positioned to help students understand complex content and provide timely feedback, with High ratings across nearly all tools. Fewer excel at generating assessment questions, a job concentrated in instructor-facing tools such as the iClicker AI Question Creator, the Pearson AI Instructor Tool, and CheggMate. The pattern indicates that development effort has favored student-facing comprehension and support over assessment authoring, which marks an opportunity for innovation in instructionally aligned content generation.
7.2. Mechanism-by-Application Cross-Analysis and Scenarios
Combining the mechanism taxonomy of
Section 2.1 with the educational jobs above clarifies where current technology is strong and where it is thin.
Figure 3 maps the six mechanism families against six educational applications, rating the maturity of each pairing. Transformer LLMs and retrieval-augmented designs dominate conceptual explanation, assessment-item generation, and just-in-time feedback. In contrast, progress analytics and early-warning functions remain the province of statistical and deep-learning methods. Adaptive practice and mastery sequencing are still served best by knowledge-space and adaptive engines such as ALEKS. No single mechanism is strong across all applications, which is why hybrid architectures predominate among the reviewed tools.
Reading the cross-analysis together with the scored results yields three concrete application scenarios for institutions. In a gateway STEM course with high enrollment and well-defined problem domains, an adaptive engine (knowledge-space mastery) paired with a retrieval-augmented explanation assistant best fits the dominant jobs of adaptive practice and conceptual explanation; ALEKS combined with an in-platform tutor is the closest match. In a writing-intensive or discussion-based humanities course, where the dominant job is dialogic explanation and feedback rather than mastery sequencing, a guardrailed conversational tutor grounded in course readings is more appropriate, with assessment authoring handled by an instructor-facing generator. In a large blended program concerned with retention, the binding constraint is progress analytics and early warning, so a learning-analytics layer should be prioritized over a chat interface. These scenarios illustrate how the framework translates scores into design choices, and the institutional recommendations in
Section 8 align with them.
8. Discussion and Implications
This comparative evaluation of publisher-built AI assistants reveals a heterogeneous landscape shaped by divergent design philosophies, integration strategies, and evidence bases. Tools that embed clear privacy guardrails and provide explainable outputs (such as Khanmigo, ALEKS, and Pearson’s AI study tools) achieve higher scores on the Trust dimension because they foreground transparency, data protection, and human oversight. By contrast, general-purpose chatbots built atop general-purpose models, such as Quizlet’s Q-Chat and CheggMate, offer broad functionality but fewer public assurances, leading to lower trust ratings. Curricular relevance is strongest when assistants are tightly coupled with vetted content and integrated into existing learning platforms, as exemplified by Cengage’s Student Assistant, Pearson’s MyLab, McGraw-Hill’s AI Reader, and ALEKS. Cross-disciplinary services remain valuable for learners seeking supplemental support but may be misaligned with specific syllabi and learning outcomes. Impact and adoption data indicate that long-standing adaptive systems and deeply integrated assistants have robust evidence of improved learning and widespread uptake.
In contrast, newer pilots, such as the iClicker AI Question Creator and the Wiley AI Tutor, show limited but promising results. Usability is enhanced by intuitive interfaces and inclusive design (e.g., Khanmigo and Pearson), while paywalls or constrained interfaces (e.g., Q-Chat and CheggMate) hamper user experience. Viewed through the Jobs-to-Be-Done lens, most assistants excel at helping learners understand complex concepts and providing just-in-time feedback; comparatively few offer sophisticated question-generation capabilities, highlighting an opportunity for innovation.
Beyond the publisher landscape, campus-specific AI platforms (including Nectir and Element451) illustrate an alternative model in which institutions train assistants on their syllabi, policies, and support resources. These institutionally controlled systems score highly on relevance because they mirror local curricula and administrative processes and may better accommodate unique course designs and regulatory requirements. However, bespoke development demands significant investment, expertise, and robust data governance, which limits scalability. By leveraging economies of scale and curated content, publisher-provided assistants can reach larger audiences but may struggle to adapt to local policies. The trade-off between general-purpose and customized AI underscores the need for flexible architectures that allow institutions to combine vetted publisher content with institution-specific data under clear governance frameworks.
Overall, the assistants occupy different niches along the continuum of generative AI capabilities. Tools built on mature adaptive platforms (e.g., ALEKS) and integrated into well-established ecosystems (Pearson’s MyLab, Khan Academy) combine strong privacy safeguards with curricular alignment and broad adoption. Generative services focused on assessment creation (iClicker AI Question Creator, Wiley AI Tutor) address specific needs but currently lack robust evidence of learning impact and remain limited in reach. Beta systems like CheggMate and Q-Chat offer personalized study pathways, yet must strengthen transparency and compliance to build trust. Students primarily “hire” AI assistants to decode complex concepts and obtain immediate feedback, while instructors “hire” them to generate assessments and reduce administrative workload. Tools that combine explanation with context-aware question generation (Khanmigo, ALEKS, Pearson) deliver greater perceived value. The success of campus platforms such as Nectir and Element451 highlights a growing demand for institutionally governed AI that aligns with local policies and curricula. Policymakers and educators should remain vigilant about transparency, bias, and privacy. Guidance from the U.S. Department of Education calls for AI systems to be inspectable, explainable, and aligned to a vision for high-quality learning [
70]. Future research should rigorously measure learning outcomes, assess user perceptions across diverse populations, and explore how generative assistants can support inclusive pedagogies.
8.1. Recommendations for Institutions
The recommendations below follow directly from the application scenarios in
Section 7.2, so that guidance is tied to identified use cases rather than offered in the abstract. First, institutions should select tools by job rather than by brand. A gateway STEM course should prioritize an adaptive mastery engine plus a retrieval-augmented explanation assistant; a discussion-based humanities course should prioritize a guardrailed conversational tutor grounded in course readings, with assessment authoring delegated to an instructor-facing generator; and a retention-focused blended program should invest first in a learning analytics and early-warning layer. Second, procurement should explicitly weigh evidence and confidence: a high TRIAD total backed only by vendor documentation (
Table 2) warrants a local pilot with pre-registered outcomes before scale-up, whereas an independently validated engine such as ALEKS can be adopted with greater confidence. Third, because the strongest current designs are retrieval-augmented hybrids, institutions should favor architectures that let vetted institutional content be combined with publisher content under clear data-governance terms. Fourth, contracts should require inspectability and human oversight consistent with the responsible-AI guidance cited above, and should mandate a re-evaluation cadence given the mid-2025 snapshot nature of any assessment.
8.2. Ethical and Societal Considerations
The ethical implications of AI deployment in education extend well beyond privacy, and a responsible evaluation must weigh several risks together. Government and non-governmental reports caution that generative systems may amplify surveillance, infringe on student privacy, or exacerbate inequities [
54,
55,
56]. Algorithmic bias arises when training data reflect historical inequities, which can yield unequal support across demographic groups; the Swedish survey discussed earlier already shows uneven trust across gender and field of study [
68]. Student deskilling is a distinct concern: assistants that supply fluent answers can erode the productive struggle through which durable skills form, so designs that withhold direct answers and scaffold reasoning are preferable to those optimized for convenience. The environmental impact of large models is non-trivial, since training and serving LLMs consume substantial energy and water, which argues for right-sizing models, caching, and retrieval over repeated generation where feasible. Labor displacement for instructors and teaching assistants is a further risk: tools that automate explanation, grading, and question generation can reduce demand for human instructional roles, and institutions should treat these tools as augmentation rather than replacement and involve faculty in adoption decisions. Across all of these, publishers and institutions must adopt robust safeguards, including rigorous content review, data minimization, transparency about training data and algorithms, and opt-out mechanisms that preserve student agency. Academic integrity remains prominent: educators should design assessments that value reasoning over recall and should integrate AI literacy into curricula so that students use these tools responsibly. Addressing these considerations alongside technical and pedagogical factors enables generative AI to catalyze equitable, effective, and human-centered education.
8.3. Limitations
Three limitations qualify the findings. First, the panel comprised two expert raters; although reliability was high (
Section 3.7) and the anchored rubric constrains judgment, a larger and more diverse panel would reduce residual bias. The framework is published as a reusable template precisely so that others can broaden the panel. Second, the evaluation is a mid-2025 snapshot of a fast-moving market, so individual scores will date even as the method persists; every table carries the snapshot date for this reason. Third, the cohort is limited to U.S. publisher-built assistants and one comparator (Khan Academy), so the conclusions should not be generalized to campus-built systems or to non-U.S. markets without re-extraction of evidence.
9. Conclusions
The integration of AI within the educational sector has progressed from basic computer-assisted instruction to the deployment of advanced generative models. These models offer the potential for scalable and personalized academic support, thereby transforming pedagogical practices and enhancing learning outcomes. The history of AI in higher education reveals a recurring theme: technological breakthroughs inspire educational innovation, but meaningful improvement ultimately depends on sound pedagogy, ethical deployment, and rigorous evaluation. The decade-by-decade account underscores how early pioneers laid the conceptual foundations that later found expression in commercial platforms.
The recent entry of major U.S. publishers into the AI tutoring market signals a competitive race to integrate generative technologies into educational ecosystems. Cengage, Macmillan, McGraw Hill, Pearson, Wiley, Quizlet, and Chegg have each launched products with distinctive modalities and coverage. However, their success will hinge on delivering verifiable value beyond generic chatbots, addressing privacy and integrity concerns, and aligning with instructors’ needs. Future research should systematically evaluate learning outcomes with these tools, explore hybrid models that blend generative AI with established cognitive frameworks, and investigate how AI can support equity and accessibility in higher education.
AI assistants are becoming integral to higher education’s digital learning ecosystems. Using the TRIAD and JTBD frameworks, this analysis demonstrates that well-designed tools can enhance comprehension, streamline assessment creation, and offer continuous support. However, adoption should proceed at the “speed of trust,” emphasizing transparency, privacy, and human oversight. Publishers and institutions should collaborate with researchers and educators to collect evidence on effectiveness, address equity and bias, and design interfaces that empower rather than replace human judgment. As AI continues to mature, aligning technological innovation with pedagogical purpose will determine whether these assistants become transformative partners in teaching and learning or remain supplemental tools.