1. Introduction
Philology education, encompassing interconnected scholarly disciplines within the humanities, lays the foundation for advanced language and textual studies. Disciplines such as Classical Philology, Medieval Philology, and Modern Greek Philology, alongside Linguistics, require not only foundational linguistic comprehension but also interpretive rigor, cultural–historical contextualization, and critical engagement with often fragmentary or archaic source materials [
1]. As a discipline at the crossroads of language, history, and close reading, philology demands more than mere grammar memorization [
2]. However, fulfilling these rigorous demands presents significant pedagogical challenges, exacerbated in large-scale instruction. Instructors, usually in traditional settings, struggle to deliver consistent, personalized feedback due to large class sizes or limited resources [
3]. Feedback often occurs in large batches (e.g., end-of-term papers), limiting iterative improvement during crucial close-reading exercises. Consequently, students lack timely support to improve their grammar, vocabulary, and narrative skills, hindering the formative assessment and effective self-monitoring of complex linguistic and interpretive skill development [
4]. Without real-time interaction, students struggle to correct misunderstandings in textual analysis, parsing, and historical context comprehension. The consequences include surface-level text understanding, undetected interpretive errors, reduced confidence in applying philological reasoning, limited engagement, and low satisfaction [
5,
6]. These issues mirror broader challenges in philology education, where resource constraints and the complexity of interpretive tasks often hinder scalable, high-quality instruction.
Prior efforts to support philology students—such as small-group tutorials, peer-review workshops, and printed workbook exercises—have offered valuable scaffolding but face the persistent challenges of scalability, consistency, and personalization [
2,
3]. While peer feedback fosters collaboration, it may propagate misconceptions, and static workbooks cannot respond to individual learning needs or evolving gaps in understanding [
1]. Digital tools like online quizzes or early rule-based grammar checkers have been used, yet they lack the semantic depth and flexibility required for interpretive, open-ended tasks central to philological analysis [
6].
To address these limitations, digitization and Artificial Intelligence (AI)—especially generative AI (GenAI)—present transformative opportunities for scalable, context-aware, and adaptive learning [
7,
8]. Earlier AI solutions (e.g., keyword-based chatbots or rule-based NLP tools) have demonstrated some educational utility but remain limited by rigid interaction patterns, shallow contextualization, and an inability to navigate philology’s interpretive complexity [
2,
9]. For instance, many fail to parse archaic syntax or reconcile divergent historical meanings, often producing reductive outputs. For instance, early NLP tools often failed to parse archaic languages or reconcile conflicting historical interpretations, leading to oversimplified outputs. In contrast, conversational AI chatbots, powered by large language models (LLMs) like OpenAI’s ChatGPT 4 (
https://chatgpt.com, accessed on 10 May 2025), Google’s Gemini (
https://gemini.google.com, accessed on 10 May 2025), and DeepSeek (
https://chat.deepseek.com, accessed on 10 May 2025) can simulate dialogue, generate contextually relevant content, and adapt to queries [
10]. They promise to assist students in navigating complex texts, analyzing linguistic features, and contextualizing content, while aiding instructors in managing large classes [
9,
10,
11]. Early applications in philology show promise for automating tasks, providing tailored feedback, reconstructing manuscripts [
12], and enhancing engagement and tailored learning experiences across disciplines [
13,
14]. Unlike static or rule-based systems, GenAI dynamically generates context-aware content, simulates scenarios, and adapts to individual trajectories, enabling pedagogical innovation [
15]. Conversational chatbots using GenAI technology can also generate teaching materials and discussion prompts, model close-reading steps [
16,
17], draft lesson plans (e.g., breaking down Homer’s syntax [
18]), produce annotated editions (e.g., Shakespeare’s sonnets [
19]), create interactive videos/quizzes with instant feedback [
20], and support glossary/quiz/translation exercise creation, freeing instructors for deeper discussion [
21].
However, realizing this potential in philology is constrained by significant limitations. While promising personalized support, AI chatbots face persistent user experience (UX) flaws and pedagogical misalignment. Core issues include the misinterpretation of archaic languages (e.g., Homeric Greek), the generation of biased analyses, and the oversimplification of linguistic changes [
22]; clunky interfaces, poor navigation, and a lack of accessibility features, disrupting focus [
2]; the prioritization of speed over depth, leading to shallow summaries that stifle critical discussion [
23]; and a lack of intuitive workflows for philology-specific tasks (e.g., annotating fragments and tracing word origins) [
24]. Poorly integrated quizzes/videos feel disconnected from course goals, and rigid interaction designs (e.g., linear Q&As) discourage exploratory thinking [
25]. For AI tools to succeed in philology, they require a human-centered design not only intuitive interfaces adaptable to learning styles and customizable feedback for nuanced tasks but also the seamless integration of multimedia resources [
26,
27] to balance automation with the precision, creativity, and critical rigor essential to the discipline.
Despite growing interest, a critical research gap persists in evaluating both the usability and pedagogical effectiveness of AI chatbots specifically for philology. Few studies rigorously assess their alignment with user-centric design principles and the subject-specific demands of philological training (e.g., [
12,
28]). Key unanswered questions concern interface intuitiveness for target users; the accuracy and pedagogical alignment of generated philological content; and how interface design variations across platforms influence UX and perceived learning support. While features like interactive quizzes, branching videos, adaptive formatting, multimedia annotations, and scaffolded exercises show potential to enhance engagement and bridge rigor with accessibility [
11,
15,
26], most platforms lack design frameworks tailored to philology’s unique needs (e.g., reconstructing fragmentary texts and tracing semantic shifts) [
29].
To bridge this gap, this mixed-methods case study assesses three conversational AI chatbots (ChatGPT, Gemini, DeepSeek) for philological content creation and support. It examines their usability (UX, learning experience) and pedagogical soundness (AI-generated content accuracy and educational value to streamline content creation, deliver real-time feedback, and enrich philological learning experiences). Specifically, this study’s objectives are the following: (1) to measure UX and learning experience, identifying usability issues unique to philological tasks; (2) to explore how design elements influence perceived instructional effectiveness, assessing system usability and learning experience in line with AI literacy; and (3) to evaluate the chatbots’ ability producing accurate, pedagogically valuable philological content through quantitative metrics and qualitative insights. The research question is the following: Do conversational AI chatbots provide engaging, AI-literate learning experiences in philological content creation—and which design principles (e.g., adaptive interfaces, contextual multimedia, iterative feedback) most effectively optimize their educational value?
2. Background
The integration of GenAI into education has transformed multiple disciplines, including the humanities. Philology—the study of historical texts—has begun exploring GenAI’s potential to enhance both research and pedagogy [
30]. To frame this exploration, this section reviews GenAI’s functional capabilities, interdisciplinary applications, and critical gaps in philological contexts. Conversational AI tools such as ChatGPT, Gemini, and DeepSeek provide intuitive interfaces with real-time support that aligns with Vygotsky’s Zone of Proximal Development (ZPD), as a recent study by Humaira Mariyam and Karthika mentioned [
31]. The same authors advocate that these systems scaffold learning by allowing students to engage in context-aware, multi-turn dialogues and receive dynamic, personalized feedback. Through advanced natural language understanding (NLU) and generation (NLG), chatbots sustain coherent conversations, handle follow-ups, acknowledge errors, and adapt responses based on evolving user input [
11]. Their multimodal capacities (text, voice, multimedia) support a more fluid and intuitive communication style, moving beyond static, command-driven platforms [
6,
7]. Key functionalities—such as interactive highlighting, semantic search, and live collaboration—foster a deep engagement with classical texts and support critical analysis. By promoting metacognitive strategies—like goal setting, monitoring, and reflection—AI chatbots align with Zimmerman’s self-regulated learning model [
32], enabling learners to evaluate AI-generated suggestions, set hypotheses, and exercise judgment. This scaffolding process fosters feedback literacy by helping students interpret and act on machine-generated feedback [
31,
32].
Recent interdisciplinary studies further contextualize GenAI’s role in humanities education. Systematic reviews [
33,
34,
35] demonstrate that AI chatbots enhance student engagement, provide scalable support, and deliver tailored learning experiences across diverse disciplines. Furthermore, NLP and machine learning algorithms, which are widely used in AI, offer powerful tools for analyzing linguistic patterns, deciphering ancient scripts, and translating historical documents. These capabilities can significantly augment the work of philologists by automating time-consuming tasks and uncovering insights that might be overlooked through manual analysis. For instance, AI platforms can assist in identifying stylistic features, thematic elements, and syntactic structures across vast corpora, thereby facilitating more comprehensive literary analyses [
36]. Moreover, AI chatbots and virtual assistants are increasingly being utilized in language instruction, providing personalized feedback and adaptive learning experiences. More specifically, integrating these AI tools into philological education also fosters AI literacy among students, equipping them with the skills to critically understand, evaluate, and effectively use AI systems, which is essential for responsible and informed participation in an increasingly digital academic context [
37].
Any possible collaboration between philology and AI has far-reaching implications for fields such as literary studies, archeology, history, anthropology, and even modern computational linguistics. By combining human expertise with machine precision, scholars can gain deeper insights into the evolution of languages, the transmission of knowledge, and the preservation of cultural memory [
21,
31]. For example, scholars have systematically reviewed the use of GenAI chatbots in higher education, documenting both their pedagogical promise and emergent challenges. A recent review of 23 empirical studies found that chatbots enhance student engagement and provide scalable feedback, but weeps for the absence of common theoretical frameworks grounding human learning [
1]. Li et al. [
13] applied Activity Theory to identify how rules, tools, and the division of labor jointly shape student outcomes in chatbot-supported language learning, proposing design models for teacher–AI collaboration and highlighting five future research directions—including human–chatbot co-creation and out-of-school contexts. Lozić and Štular’s [
38] comparative analysis of six LLM-based chatbots in scientific writing showed high factual accuracy for ChatGPT-4 but underscored AI’s struggle to generate genuinely original content in humanities disciplines. A systematic review of 1.319 Scopus records (1985–2023) mapped the evolution of AI research, noting education-related application empowerment since 2021 and flagging ethical concerns around bias and misinformation [
33]. Dempere et al. [
39] emphasized GenAI’s transformative potential—from automated grading to research assistance—while cautioning against privacy and safety risks in educational settings.
Within the digital humanities, exploratory studies have begun deploying conversational AI chatbots for philological tasks such as text transcription, semantic tagging, and interactive storytelling. Critical essays interrogate AI’s political economy and its implications for interpretive depth in language-focused disciplines [
27,
37], while pilot implementations incorporate adaptive exercises and multimedia embeds to bolster digital literacy in classical language courses [
40,
41]. Elsewhere, mixed-methods research on English-learning chatbots in higher education documented students’ evolving self-regulated learning behaviors and underscored the necessity of UX-driven interface design to sustain long-term engagement [
42]. Yet, few studies target philological content creation per se, leaving open questions about domain-specific model fine-tuning, narrative coherence in ancient language contexts, and the role of cultural heritage ethics in AI-mediated instruction. This gap leaves unanswered how scaffolded AI-mediated feedback, informed by constructivist and socio-cultural principles, can foster both domain-specific creativity and self-regulated learning in historical linguistics.
The potential applications of AI in philology education are diverse and significant, offering ways to augment, accelerate, or even transform traditional practices [
12,
31,
40,
42]:
Enhanced textual analysis: AI can process and analyze vast amounts of text far exceeding the human capacity. This enables large-scale quantitative analysis (distant reading), the identification of linguistic patterns, tracking thematic evolution across corpora, automated named entity recognition (identifying people, places, organizations), and topic modeling, revealing previously unseen connections or trends within textual data.
Accelerated information discovery: LLMs can act as powerful research assistants, capable of searching extensive digital archives, summarizing complex scholarly arguments, retrieving relevant historical or cultural context for specific passages, and identifying related works, thereby significantly advancing the corpus of the relevant literature review and background research phases.
Linguistic support and language learning: AI tools can offer support for understanding complex grammatical structures, provide preliminary translations (requiring careful human verification), assist in deciphering difficult scripts, and potentially serve as interactive tutors for learning ancient or less commonly taught languages relevant to philological study.
Assistance in textual criticism and editing: While requiring significant oversight, AI could potentially assist in collating manuscript variations, identifying patterns of scribal error, suggesting possible emendations based on linguistic models, or even aiding in the hypothetical reconstruction of fragmentary texts by analyzing patterns in extant related materials.
Innovative pedagogical tools: AI offers possibilities for creating dynamic and personal learning experiences in philology. This includes generating interactive exercises, providing instant feedback on student interpretations (within limits), adapting content complexity to individual student needs, and acting as a Socratic dialogue partner to stimulate critical thinking about texts.
Accessibility issues: AI can help make complex philological knowledge and primary source materials more accessible to a wider audience, including students at earlier stages, researchers in adjacent fields, and the general public, by providing summaries, explanations, and contextual information in understandable language.
To summarize, conversational chatbots like ChatGPT, Gemini, and DeepSeek offer transformative potential for philological education through their advanced natural language processing (NLP) capabilities. These chatbots align with pedagogical theories such as Vygotsky’s ZPD [
31] by providing real-time, adaptive feedback that scaffolds learning. Key functionalities include the following:
Context-aware dialogue: Multi-turn interactions that adapt to user input, acknowledge errors, and refine responses [
11].
Multimodal interaction: The integration of text, voice, and multimedia to support intuitive engagement with historical texts [
32].
Metacognitive scaffolding: Tools like interactive highlighting and semantic searching promote self-regulated learning, enabling students to set hypotheses, monitor progress, and reflect on AI-generated feedback [
31,
32].
Beyond philology, GenAI has demonstrated utility in humanities education and computational linguistics:
Automated analysis: NLP algorithms decipher ancient scripts, identify syntactic patterns, and uncover thematic elements in large corpora [
36].
AI literacy development: Chatbots equip students with skills to critically evaluate AI outputs, fostering responsible engagement with digital tools [
37].
Scalable support: Systematic reviews highlight chatbots’ abilities to enhance engagement and deliver personalized feedback across disciplines [
33,
34,
35].
Despite these opportunities, interdisciplinary studies reveal significant critiques of GenAI in educational contexts:
Ethical risks: Biases in training data, potential misinformation, and privacy concerns threaten equitable implementation [
33,
37].
Interpretive oversimplification: An over-reliance on automated analyses risks eroding critical engagement with historical texts’ nuance and cultural context [
27].
UX limitations: Poorly designed interfaces disrupt focus, while rigid interaction patterns hinder exploratory learning [
42].
Originality gaps: AI struggles to generate genuinely novel insights into humanities disciplines, often reproducing existing knowledge [
38].
The current literature underscores that GenAI’s educational potential is tempered by ethical, pedagogical, and technical challenges. Beyond the fact that AI chatbots enhance scalability and accessibility, their deployment requires the careful mitigation of biases, the preservation of interpretive rigor, and a user-centric design to avoid superficial learning outcomes. While research on ChatGPT’s usability and impact exists, studies comparing state-of-the-art AI chatbots (like Gemini and DeepSeek) specifically for philological content creation and instruction are scarce. This mixed-methods study fills this gap by assessing ChatGPT, Gemini, and DeepSeek using usability metrics (e.g., System Usability Scale—SUS), content quality analysis, and interface evaluation. It uniquely bridges UX research [
30,
43] with pedagogical effectiveness in philology, offering actionable insights for educators, developers, and researchers. The present study aims to accomplish the following:
Measure usability and satisfaction: Addressing the lack of discipline-specific UX research in the humanities beyond STEM education [
15,
17].
Evaluate content accuracy using GenAI technology: Assessing chatbots’ ability to generate philologically sound content, mitigating the risks of factual errors highlighted in reviews [
3,
34,
37,
43,
44].
Investigate design impact: Exploring how interface features (e.g., chain-of-thought, multimedia) influenced instructional effectiveness, integrating UX and learning experience [
37,
42,
45].
3. Materials and Methods
3.1. Research Design
A UX mixed-methods case study was conducted to critically evaluate the usability, accessibility, and pedagogical efficacy of AI-generated content for philological courses within real-world learning contexts. This approach aligns with this study’s commitment to human-centered design, ensuring that technological innovations—while technically sound—are also intuitive, inclusive, and responsive to the needs of diverse learners. By prioritizing UX methodologies, such as iterative testing, observational analysis, and participant feedback, this research sought to bridge the gap between AI-driven content creation and its practical application in education [
45].
An ethical reason for using UX methods was to prevent the assumption that AI tools automatically solve educational issues without evidence [
46]. For instance, while AI algorithms can generate content efficiently, their pedagogical value depends on learners’ ability to navigate, comprehend, and emotionally engage with the material. Through structured usability tasks and post-intervention surveys, we identified potential mismatches between algorithmic outputs and user expectations, such as unclear instructions or culturally insensitive visuals. These insights directly informed refinements to the AI system, ensuring the final tutorials were not only accurate but also learner-focused.
This design-inclusive UX research approach using conversational AI chatbots was emphasized on design artifacts (i.e., interactive quizzes and learning scenarios for philological content creation), as knowledge generators enabled us to capture nuanced user expectations and emotional responses that standard surveys might miss. To gain further insights into experiential qualities, this mixed-methods case study evaluation [
47] thus balanced objective usability scores (quantitative data from validated questionnaires) with rich, design-informed reflections based on undergraduate students’ opinions without restrictions (qualitative data from semi-structured interviews). In summary, adopting this approach draws on a robust theoretical foundation at the intersection of AI and philology instructional contexts, provides design activities as a form of inquiry, and delivers practical insights into both functional and experiential dimensions that conventional usability testing alone cannot uncover.
3.2. Participant Demographics and Academic Context
A total of 26 undergraduate students participated (n = 26; M = 21.3 years, SD = 1.01). Participants (n = 15 females and n = 11 males) were distributed across academic years: 35% in their first year, 27% in their second, 23% in their third, and 15% in their fourth year or higher. All participants were enrolled in a philology-focused module titled “Digital Tools in Historical Text Analysis”—a core component of their degree programs in Classical Philology, Medieval Studies, or Linguistics. This module integrates computational methods (e.g., NLP, digital corpora) into traditional philological training, targeting students with varying levels of technical proficiency. Because this module emphasizes iterative close-reading exercises and in-class peer workshops, participants began this study with at least one semester’s experience in philological analysis—ensuring a baseline familiarity with diachronic grammar, paleography, and critical apparatus usage.
Participants’ prior exposure to philology coursework included the following:
First-year students: Introductory courses in Greek/Latin grammar and textual criticism.
Second-year students: Intermediate modules on paleography and historical linguistics.
Third- and fourth-year students: Advanced seminars on digital humanities and manuscript reconstruction.
Following a structured 25 min interaction session—during which they completed typical tasks such as drafting an email, solving a simple coding prompt, and composing a short creative story—participants rated each agent before choosing the most appropriate for them. Each student was randomly assigned to one of three conversational AI chatbots: ChatGPT (n = 9), Google Gemini (n = 9), or DeepSeek (n = 8). The frequency of AI chatbot usage (e.g., ChatGPT) varied significantly: 42% reported using them “often” or “very often,” primarily for academic support, while 23% “rarely” or “never” used them. Notably, 70% of third- and fourth-year students used AI tools “often,” compared to 25% of first-years, reflecting the module’s progressive integration of digital methods into upper-level coursework. Familiarity with GenAI platforms (e.g., Code Generator) was lower, with 58% rating themselves as “moderately” or “slightly” familiar. Fifty-four percent of participants (54%) reported using AI for learning—most frequently for text summarization and language translation—while none had any programming experience. For example, one participant described using ChatGPT to “simplify complex philosophical texts.”
Participants in interdisciplinary programs (e.g., digital humanities) were more likely to experiment with GenAI (65%) than those in purely humanities-focused departments (22%), highlighting the module’s role in bridging philology with emerging technologies. Opinions on AI’s potential in philological studies were polarized. Approximately 46% expressed optimism, emphasizing benefits like “personalized feedback” and “efficient content creation,” while 38% raised concerns about “over-reliance reducing critical thinking” and “AI-generated content lacking depth.” Despite moderate AI usage, 62% emphasized the need for “ethical guidelines” to govern AI in education, aligning with their prioritization of critical engagement over automation. This recruitment from prior modules thus situates this study’s findings within a cohort already versed in the philological method, highlighting how module design and prior curriculum exposure shape both AI adoption and critical attitudes toward AI-generated tools.
3.3. Ethical Considerations
This study adhered to rigorous ethical principles, prioritizing participant autonomy, transparency, and accountability across both the human and AI-driven dimensions. To safeguard participants’ rights and well-being, the researchers (the authors) obtained written and verbally informed consent, emphasizing voluntary participation and the freedom to withdraw at any stage without consequences [
48]. Ethical protocols specific to AI integration focused on mitigating risks associated with algorithmic bias and content integrity. During the development of AI-generated video tutorials, the transparency of the underlying algorithms was scrutinized to ensure the outputs aligned with educational objectives and avoided perpetuating stereotypes or misinformation. The training datasets underwent systematic evaluation for potential biases, while AI-generated narratives and visuals were refined to enhance cultural sensitivity, relatability, and pedagogical effectiveness [
46]. Moreover, UX research methods reinforced transparency and accountability, core tenets of this study’s ethical framework. For example, feedback on the AI tutorials’ relatability and accessibility was systematically incorporated into the design process, mitigating risks of bias or exclusion that automated systems might inadvertently introduce.
Prior to the intervention, participants received detailed briefings outlining this study’s purpose, data collection procedures, and the implications of interacting with AI-generated content [
49]. Consent forms explicitly addressed participants’ rights, data privacy safeguards, and the voluntary nature of involvement. This study further complied with institutional ethical approvals and the Declaration of Helsinki, reinforcing its commitment to integrity across all research phases. By harmonizing traditional human-subject protections with emergent AI-related considerations, the design upheld accountability while advancing innovation in educational technology. Participants were clearly informed that their involvement in this study was both anonymous and voluntary, with no obligation to continue participating at any point. They were also assured that no personal data falling under the EU General Data Protection Regulation (GDPR) framework would be gathered, and their information would be handled with strict confidentiality. To uphold transparency, participants received a full disclosure of this study’s actual objectives upon its conclusion. The Aristotle University of Thessaloniki ethics review board formally reviewed and approved this research protocol (ID: 118394/2025) in accordance with established guidelines for human subject protection.
3.4. Procedure
The current study was conducted within a course titled “Artificial Intelligence and Immersive technologies in education,” which focuses on equipping participants with skills to effectively incorporate diverse technological tools and media resources into philological courses for secondary education. This aligns with the course’s objectives of fostering practical approaches to designing learning units, understanding foundational concepts and theories of instructional technology, and applying formative assessments and alternative evaluation strategies within technology-enhanced educational contexts. It also reimagines philological pedagogy by harnessing conversational AI chatbots to develop rich, scenario-driven assessments—ranging from dynamic quizzes to retro arcade–style challenges—that provide instant, tailored feedback on narrative elements exactly when students engage in writing tasks, thereby boosting motivation, engagement, and overall writing quality. In response to the persistent challenge of limited in-person feedback, this research leverages Gemini and ChatGPT to enable educators and learners to design interactive quizzes and/or retro-based assessments—complete with branching “what-if” story scenarios and pixel-style gamified interfaces—that deliver adaptive, personalized guidance on grammar, vocabulary, plot structure, and character development at the moment of need. By empowering users to craft diverse content-creation modes—such as scenario-based quizzes, arcade-style trivia challenges, word categorization, and branching scenarios combined with crypto-word puzzles or choose-your-own-adventure assessments—the AI system ensures continuous, just-in-time support that elevates student engagement and writing proficiency [
50].
To address the above procedure, the present study integrates a GenAI chatbot—powered by large language models—to provide immediate, tailored feedback on key narrative elements such as grammar, vocabulary, plot structure, and scenario design. Studies on AI-supported chatbots (e.g., ChatGPT, Gemini, DeepSeek) indicate that such tools not only correct surface errors but also pose Socratic questions that help students reconceptualize their drafts, treating ‘mistakes’ as opportunities for growth and boosting self-confidence.
The platform’s user experience design combines three pillars:
Real-time adaptive feedback, where the AI suggests alternative phrasings, flags narrative inconsistencies, and proposes “what-if” scenarios to motivate creative exploration.
Interactive multimedia tools, embedding videos, quizzes, and exercises that contextualize storytelling principles and engage multiple learning modalities.
Personalization, wherein the system calibrates feedback intensity and focus based on each student’s evolving skill level, ensuring sustained support throughout the creative process.
To better understand the participant’s experience, this study incorporated a dedicated training phase to familiarize both educators and learners with the conversational AI chatbot interface for learning scenarios and interactive quiz creation. The instructor (the author), on the one hand, was guided through a series of interactive tutorials demonstrating how to leverage natural language prompts within the chatbot to define learning objectives, outline narrative structures, and specify the parameters for automated feedback on key literary elements [
41]. This included practical exercises in crafting branching “what-if” scenarios by describing initial narrative situations and then prompting the AI to generate potential consequences based on hypothetical student choices. Furthermore, instructors learned to define the specific types and levels of feedback they wished the AI to provide, ranging from simple error identification to detailed explanations and suggested revisions [
39,
42].
Participants, on the other hand, received instruction on how to interact with the chatbot to generate their own interactive quizzes based on course materials. This involved learning to formulate prompts that would guide the AI in creating multiple-choice questions, true/false statements, or short answer prompts related to specific texts or concepts. Due to the fact that conversational AI chatbots have the same chat user interface design, participants were allowed the input of base narratives, the identification of key decision points for branching scenarios, and the specification of targeted feedback parameters for grammar, vocabulary, plot structure, and learning scenarios development. They were also shown how to instruct the AI to provide immediate feedback on their quiz designs, ensuring alignment with learning objectives and pedagogical goals. This conversational approach to content creation empowered participants to actively shape their learning experience, fostering a deeper engagement with the pedagogical potential of GenAI.
3.5. Teaching Intervention Process
Central to this intervention is the AI’s capacity to provide immediate, personalized feedback on key narrative elements, including grammar, vocabulary, plot structure, and instructional scenario design [
51]. By leveraging the LLMs embedded in AI chatbots, the system aims to foster students’ self-confidence while improving their digital literacy through interactive multimedia components (e.g., videos, exercises, quizzes). More specifically, the instructor (the author) in this teaching intervention served as an expert scaffolder and metacognitive facilitator by demonstrating how to frame precise AI prompts—such as “Create a retro-style interactive quiz about “
Ifigenia in Tauris” in HTML code”—and explaining why embedding quiz logic directly in HTML enhances coding fluency and content relevance. The instructor guided learners through a structured evaluation checklist to jointly assess AI suggestions and to spotlight common pitfalls, such as formulaic vocabulary substitutions that can dilute narrative voice. This process tightens the link between language exploration and coding fluency. By first modeling prompt engineering techniques (including basic HTML tag usage and syntax), the instructor made clear how HTML’s declarative, plain-text markup with minimal syntax affords immediate visual feedback in any browser without requiring programming logic. Through strategic, open-ended questioning—e.g., “Does this suggestion align with your narrative goals?”—they fostered metacognitive reflection, prompting students to articulate why they might accept, refine, or reject particular AI-generated edits that were eventually saved in their digital notepad (.html file). The use of HTML-based story generation and retro-game-style quizzes aligns with game-based learning principles that foster motivation, engagement, and scaffolded challenge through instant feedback and reward mechanics. Embedding quiz logic in HTML not only cultivates technical competence but situates learning in authentic, context-rich tasks analogous to real-world digital publishing. In highlighting both the advantages and limitations of generative feedback, the instructor empowered participants to develop self-regulated decision-making skills, deepening their feedback literacy and technical competence [
5].
Drawing on Vygotsky’s ZPD and social mediation, the instructor additionally scaffolded prompt engineering (e.g., HTML tags) and co-construct meaning through dialogue, with the AI acting as an “adaptive More Knowledgeable Other” adjusted to learner responses. Zimmerman’s cyclical model of self-regulated learning in AI-supported instruction [
31] underpins the present one as follows: forethought in prompt planning; performance control through iterative AI-driven revisions; and self-reflection via metacognitive questioning (“Does this align with your narrative goals?”). This cycle strengthens feedback literacy and technical competence, which are essential skills in philological inquiry. The AI is leveraged as an “adaptive More Knowledgeable Other,” dynamically adjusting feedback complexity based on learner inputs, while human oversight ensures alignment with philological and pedagogical goals. Building on Zimmerman’s cyclical model [
32], this intervention embeds phases of forethought (crafting precise prompts aligned with learning outcomes), performance control (iterative refinement of AI outputs guided by a structured checklist), and self-reflection (metacognitive questioning such as “Does this suggestion align with your narrative goals?”). This cycle fosters learners’ autonomy in monitoring, evaluating, and adapting their use of AI feedback, thus strengthening both feedback literacy and technical competence to further clarify how AI-supported scaffolding, interactive multimedia, and metacognitive prompts jointly facilitate learners’ construction of knowledge, social mediation of meaning, and development of self-regulated learning habits. Overall, HTML storytelling and retro quizzes are not embellishments. They are theory-informed strategies that integrate constructivism, game-based learning, and self-regulation to turn AI feedback into meaningful, discipline-specific learning affordances in philology.
This 6 h teaching intervention aimed to evaluate how tailored AI feedback contributes to creative skill development and addresses challenges in adopting such digital tools in humanities education. Prior informed consent was obtained, emphasizing anonymized data usage and transparency about AI limitations (e.g., potential biases in LLM outputs). In this study, each participant chose one of three chatbots—ChatGPT, Gemini, or DeepSeek—as their sole conversational assistant for all philological tasks. The second objective was to evaluate that chatbot’s ability to produce philologically accurate and pedagogically sound content in a purely conversational context based on user interaction data and feedback. We also implemented a risk-mitigation plan—complete with instructor oversight—to prevent over-reliance on AI and foster critical engagement.
At the beginning, participants engaged directly with the AI chatbots—crafting and refining prompts to explore character motivations, reinterpret pivotal scenes, and simulate ethical debates. They iteratively adjusted the chatbot’s responses (for instance, constraining language to historical contexts or requesting modernized dialogue options), then reflected on how the AI-generated drafts aligned—or clashed—with their own interpretations. Learning scenarios should be aligned directly with both curriculum standards and core philosophical inquiries drawn from Iphigenia in Tauris. They crafted precise prompts—such as “Design a lesson plan exploring cultural conflict, identity, and sacrifice through Iphigenia, integrating discussion questions on agency and Ancient Greek ethical frameworks”—to guide the chatbot in generating multi-stage lesson outlines. These AI-generated scenarios included priming activities, guided text analyses, structured debate modules, and reflective writing prompts, each mapped to specific learning outcomes.
Next, participants used these scenarios as working drafts: they iteratively refined the chatbot’s outputs by adding historical or ethical constraints (for example, “Use terminology authentic to Ancient Greek culture” or “Frame the debate as a modern human-rights tribunal”), then critically evaluated how each suggestion deepened—or sometimes conflicted with—their own interpretations. By documenting every prompt, accepted revision, and discarded option in an “AI Use Statement,” learners not only honed their prompt engineering and digital literacy skills but also maintained pedagogical coherence, ensuring that each AI-assisted adaptation faithfully served both their creative goals and the course’s philosophical objectives.
Phase 1: Orientation and Baseline Assessment (1 h)
Participants complete a pre-intervention survey probing prior AI experience, comfort with digital platforms, and confidence in creative writing. A short diagnostic writing task (e.g., drafting a 250-word story opening) is administered to establish individual creative baselines.
Learning objectives are explicitly shared with students to align expectations (e.g., “By the end, you will critically evaluate AI suggestions to refine narrative coherence”).
This baseline ensures that both AI personalization parameters and instructor scaffolding can be finely tuned to each learner’s profile.
Phase 2: Guided AI Interaction (4 h; four 1 h sessions):
A total of 10 min of AI Tool Tutorial: The instructor demonstrates prompt engineering strategies—showing students how to elicit grammar corrections, vocabulary enrichment, plot suggestions, and philosophical “what if” scenarios from ChatGPT or similar LLMs. Examples of ineffective vs. effective prompts are contrasted to reducing ambiguity.
A total of 40 min of Writing Task with Real-Time AI Feedback: Students draft or revise short stories and instructional scenarios while the AI chatbot offers on-the-fly revisions. The instructor circulates, validating AI suggestions, clarifying misunderstandings, and encouraging deeper engagement with the feedback.
A total of 40 min of Structured Peer Review Workshop: Participants exchange drafts and jointly evaluate which AI recommendations to accept or refine using a guided checklist (e.g., “Prompt: Create a retro-style interactive quiz about Ifigenia in Tauris in HTML code” or “Prompt: Does this suggestion align with your narrative voice to understand the correct and wrong answer?”). The instructor highlights common pitfalls in AI feedback (e.g., formulaic vocabulary substitutions) and facilitates metacognitive reflection.
A 10 min Instructor Debrief: Highlights effective AI–human collaboration, discusses quiz design principles, and addresses challenges encountered. An “FAQ” document is updated iteratively based on recurring student struggles.
Phase 3: Post-Intervention Evaluation (1 h)
Students submit their final, AI-enhanced narratives and complete the SUS questionnaire to quantify platform intuitiveness and helpfulness. A post-intervention writing task mirrors the baseline diagnostic to measure skill growth objectively.
Semi-structured interviews follow, inviting reflection on creative growth, AI collaboration, and any adoption challenges. These interviews are transcribed and coded for thematic analysis (e.g., “agency in AI interactions,” “perceived creativity barriers”).
Instructors synthesize findings into a dashboard for transparent reporting. This phase was separated as follows:
Final Narratives and Quizzes: Students submit polished stories enriched by AI guidance and deploy their interactive or “retro-game” quizzes to the cohort.
System Usability and Reflection: The instructor facilitates a focus group on tool experiences, guiding reflection on AI’s pedagogical value and game-based learning efficacy (
Figure 1).
Figure 2 below is a collage showcasing a classroom activity where participants engaged with the ancient Greek tragedy “
Iphigenia in Tauris” using AI chatbots to create learning scenarios and interactive assessment tasks. It includes scenes of students working together on computers, screenshots of quiz interfaces, and mobile interactions with the learning content. The activity combines the use of web-based and QR codes for interactive quizzes, such as crosswords, timeline matching, and multiple-choice quizzes. The retro-style-aesthetic quiz design and digital platforms highlight how technology can make classical literature more engaging and accessible for learners. The chatbots’ server stores students’ interactive quizzes. Participants design and develop learning scenarios and interactive tools (e.g., quizzes) using conversational AI tools such as Gemini.
By embedding the instructor’s facilitative expertise alongside ChatGPT’s dynamic scenario generation and retro-game quiz creation, this enhanced intervention strategically cultivates philological creativity, critical thinking, and digital literacy.
3.6. Instruments
To evaluate student experiences regarding usability, learning, and interaction with AI chatbots within a philological context, several validated instruments were employed. The selection of instruments was driven by three criteria: (a) alignment with this study’s theoretical framework (usability, learning experience, AI literacy), (b) psychometric robustness in prior educational technology research, and (c) feasibility for integration into a single, coherent survey to minimize participant burden.
Demographic Questionnaire: A 12-item questionnaire collected background information (e.g., age, year of study, major). This tool included categorical and open-ended responses (e.g., “How often do you use AI applications such as chatbots (e.g., ChatGPT, Gemini)”) to contextualize participant profiles (
Appendix A).
System Usability Scale (SUS): The SUS [
50] was chosen for its widespread use in evaluating the perceived usability of technology systems. Its brevity (10 items) and validated reliability (α = 0.85 in this study) made it ideal for capturing user satisfaction without overwhelming participants to answer this study’s first objective. The 10-item scale, translated into Greek, used a 5-point Likert scale (1 = Strongly Disagree to 5 = Strongly Agree). Example items (
Appendix B) encompass the following:
Cronbach’s alpha (a) for the translated SUS in this study was α = 0.85, indicating high internal consistency, aligning with prior validations [
52].
- 3.
AI Literacy Scale: Adapted from Wang et al. [
35], this 12-item scale assessed proficiency across four dimensions (awareness, use, evaluation, ethics) using a 5-point Likert scale (1 = Strongly Disagree to 5 = Strongly Agree). Adapted from Chiang et al. [
25], this instrument was selected for its focus on critical thinking and self-regulated learning, which are key outcomes in philological education. Its subscales (Reaction, Learning, Behavior, Results) aligned with this study’s second objective to assess both affective and cognitive engagement with AI tools. Example items (
Appendix D) include the following:
Awareness/Evaluation: “Recognize AI Technology—I can identify the AI technology employed in the applications and products I use.”
Usage: “Skillful AI Use—I can skillfully use AI applications or products to help me with my daily work.”
Evaluation: “Choose AI Tool—I can choose the most appropriate AI application or product from a variety for a particular task.”
Ethics: “Ethics Compliance—I always comply with ethical principles when using AI applications or products.”
The subscale reliabilities were α = 0.79 (awareness), α = 0.81 (use), α = 0.76 (evaluation), and α = 0.85 (ethics).
- 4.
Learning Experience Questionnaire: Adapted from Chiang et al. [
25], this 14-item validated instrument measured learning approaches and critical thinking engagement within the AI-integrated environment. This instrument was selected for its focus on
critical thinking and
self-regulated learning—key outcomes in philological education. Its subscales (Reaction, Learning, Behavior, Results) aligned with this study’s third objective of assessing both affective and cognitive engagement with AI tools. Responses were recorded on a 5-point Likert scale (1 = Strongly Disagree to 5 = Strongly Agree). Example items (
Appendix C) include the following:
Reaction (Learning Quality): “Learning Support—This AI chatbot helps me learn”.
Learning (Learning Attitudes): “Interest—This AI chatbot makes the course more interesting.”
Behavior (Learning Interest): “Reduced Resistance—This AI chatbot reduces my resistance to the course.”
Results (Learning Outcomes): “Better Outcomes—This AI chatbot helps me achieve better learning outcomes.”
The subscales demonstrated acceptable reliability: Reaction (learning quality; α = 0.82), Learning (learning attitudes; α = 0.78), Behavior (learning interest; α = 0.78), and Results (learning outcomes; α = 0.78).
- 5.
Semi-Structured Interviews: Post-intervention interviews gathered qualitative insights via 3 open-ended questions (e.g., “1. What do you consider to be the advantages and disadvantages of using generative AI platforms in the teaching of language and literature courses?”). These open-ended questions provided qualitative depth, triangulating quantitative findings and capturing nuanced attitudes (e.g., trust in AI-generated textual analyses). The responses were thematically coded to triangulate quantitative findings (
Appendix E).
A summary of instruments is presented in
Table 1 below.
A Cronbach’s alpha coefficient < 0.70 was obtained for all items, representing acceptable reliability according to the criteria outlined by Cortina [
52].
The instruments were designed to complement one another:
The usability (SUS) and AI Literacy scales addressed functional and technical interactions with chatbots;
The learning experience and interview data captured pedagogical and ethical dimensions, ensuring a holistic evaluation;
Demographic data contextualized the responses (e.g., academic year, prior AI experience), enabling subgroup analyses (e.g., differences between novices against advanced students).
The total survey comprised 48 Likert-scale items (12 demographics, 10 SUS, 14 LEQ, 12 AI Literacy) and 3 open-ended interview questions. To mitigate participant fatigue, the following were applied:
Cultural adaptation-streamlined phrasing (e.g., simplifying idiomatic terms in the SUS).
Pilot testing with 7 students confirmed that the survey could be completed in 15–20 min.
Online administration via Google Forms allowed participants to complete the survey asynchronously, reducing time pressure.
To ensure linguistic and conceptual equivalence, all Likert-scale instruments (SUS, learning experience questionnaire, AI literacy scale) underwent rigorous validation for cultural appropriateness after translation into Greek. This process included independent back-translation by bilingual experts to verify semantic accuracy, followed by a review by a panel of two Greek philologists and educational technologists to assess item clarity and relevance within the local academic context. For example, phrases such as “AI system unnecessarily complex” (SUS) were adjusted to avoid idiomatic ambiguity. A pilot test with 7 students from the target demographic confirmed that participants interpreted the scale anchors (e.g., “Strongly Agree”) consistently, aligning with International Test Commission (ITC) guidelines for cross-cultural adaptation [
53]. This step ensured that the psychometric properties of the original scales were preserved while accounting for cultural nuances in technology perception and learning practices. All questionnaires appear in
Appendix A,
Appendix B,
Appendix C,
Appendix D,
Appendix E,
Appendix F (all items were translated to Greek).
To evaluate the usability, learning outcomes, and interactions with conversational AI chatbots in philology education, this study employed a mixed-methods approach using validated instruments culturally adapted for Greek students. After providing informed consent and agreeing to data-security protocols, participants completed all questionnaires via Google Forms. A 12-item demographic questionnaire with open-ended prompts (e.g., prior AI experience) contextualized participant backgrounds. The SUS assessed chatbot usability via 10 Likert-scale items (1–5) probing confidence and perceived complexity. A 14-item learning experience questionnaire (critical thinking: α = 0.82; self-regulated learning: α = 0.78) evaluated pedagogical impacts, while a 12-item AI literacy scale measured awareness (α = 0.79), use (α = 0.81), evaluation (α = 0.76), and ethics (α = 0.85) using Likert responses. Post-intervention semi-structured interviews (3 open-ended questions) explored qualitative themes like trust in AI feedback. Finally, participants provided any remaining demographic details (gender: male, female, or diverse; exact age; and current student status) before submitting the survey. This integrated approach—combining robust quantitative scales with rich qualitative insights—ensured the comprehensive evaluation of usability, learning outcomes, and user interaction within an AI-augmented philology context.
3.7. Data Collection
The data were collected using a mixed-methods approach to capture both quantitative metrics and qualitative insights into the effectiveness of AI chatbots in philological education. Several tools and methods were employed for data gathering. Standardized surveys, utilizing a 5-point Likert scale aligned with the SUS, assessed user satisfaction and self-efficacy regarding AI interfaces. Participants also engaged in interactive sessions with AI chatbots (ChatGPT, Gemini, DeepSeek) over six instructional hours, during which their interactions were recorded, and real-time feedback on grammar, narrative structure, and vocabulary was documented to measure the chatbots’ impact on creative outputs. Furthermore, five students were observed in-depth through case studies to analyze their creative processes and challenges when using AI tools, providing context for the quantitative data. Finally, semi-structured interviews were conducted with educators, guided by open-ended questions about usability and content quality, to explore their perceptions of AI’s role in enhancing pedagogical practices.
3.8. Data Analysis
The collected data was analyzed using both quantitative and qualitative methods, addressing this study’s dual focus on usability and creative outcomes. For the quantitative analysis, survey responses were coded numerically and processed using SPSS (version 25) to calculate mean SUS scores and identify statistically significant differences in user satisfaction between the various chatbots. Descriptive statistics, including frequencies and percentages, were utilized to summarize participants’ self-reported confidence in creative writing tasks before and after the integration of AI tools. Graphs and charts were also employed, following the guidelines for presenting large datasets, to visualize trends, such as improvements observed in narrative coherence or vocabulary diversity.
The qualitative analysis involved a thematic analysis of interview transcripts and observational notes using Google Forms, which helped identify recurring themes like “AI’s role in overcoming creative blocks” or “frustrations with interface limitations”. The data from the case studies were triangulated with the survey results to contextualize the quantitative findings and ensure a holistic understanding of user experiences. In due course, quantitative and qualitative results were synthesized to evaluate the chatbots’ overall impact on creativity and usability; for instance, high SUS scores reported for Gemini were cross-referenced with interview feedback highlighting its “intuitive grammar suggestions” to validate the findings.
4. Results
This section presents the findings from the mixed-methods evaluation of conversational AI chatbots in philological education. The results are organized to address this study’s three core objectives:
UX assessment: Perceived system usability (SUS scores). This subsection addresses Objective 1 by evaluating each chatbot’s perceived usability via SUS, with a focus on interface intuitiveness and overall system complexity.
AI Literacy: Competence in AI awareness, use, evaluation, and ethics. This aligns with Objective 2, emphasizing students’ technical confidence but limited awareness of biases in historical text analysis.
Learning Experience: Critical thinking engagement and self-regulated learning (learning experience questionnaire). These findings directly respond to Objective 3, revealing AI’s role in enhancing engagement but struggling with deeper cognitive scaffolding.
Semi-structured interview: Interviews triangulated the quantitative findings.
The quantitative results are presented first, followed by qualitative insights from the interviews. The subheadings map directly to this study’s theoretical framework and research questions.
4.1. UX Assessment
For this study, twenty-six university students were randomly assigned to evaluate one of three conversational AI agents: ChatGPT (nine participants), Google Gemini (nine participants), or DeepSeek (eight participants). Immediately following their use session, each participant completed the full ten-item SUS, rating statements on a one (“Strongly disagree”)-to-five (“Strongly agree”) scale. Responses to odd-numbered items were scored by subtracting one point, and responses to even-numbered items were scored by subtracting the response from five; the ten adjusted scores were then summed and multiplied by 2.5 to produce a composite usability score ranging from 0 to 100. The internal consistency of the SUS was assessed via Cronbach’s alpha for each agent’s ten-item set, ensuring that the scale reliably measured a unified usability construct across all three platforms.
Among the nine students who used ChatGPT, the mean SUS score was 75.4 (SD = 6.5), indicating strong usability well above the commonly cited acceptability threshold of 68. Participants gave ChatGPT a mean ease-of-use rating of M = 3.8 (SD = 0.6) and a reuse intent of M = 4.2 (SD = 0.7). Complexity was low (M = 2.7, SD = 0.9), and most felt they did not need help (M = 2.3, SD = 1.1). On individual items, participants expressed a clear intention to reuse the system, high confidence in their ability to use it, and a minimal need for technical assistance. Moreover, ChatGPT garnered one of the lowest complexity ratings, reflecting that users did not find its interface overly complicated. The internal consistency of ChatGPT’s ten-item SUS was excellent (α = 0.89), confirming that its usability perceptions coalesced around a coherent construct. Participants reported high confidence, minimal technical assistance needs, and strong intentions to reuse. ChatGPT’s low complexity rating suggests its interface felt straightforward.
The nine participants assigned to Google Gemini produced a mean SUS score of 70.2 (SD = 8.0), which also exceeds the acceptable benchmark yet trails slightly behind ChatGPT’s performance. Those assigned to Gemini reported M = 3.5 (SD = 0.8) on ease of use and M = 4.0 (SD = 0.8) on reuse intent, as well as a slightly higher complexity (M = 3.0, SD = 1.0). The need for help hovered around M = 2.5 (SD = 1.2). While Gemini’s flexibility in handling multimodal inputs—text, images, code, and audio—is one of its hallmarks, this versatility translated into a slightly higher perceived complexity compared with its peers. Nonetheless, participants still reported a strong willingness to reuse the agent and found it reasonably easy to learn. Gemini’s ten-item scale demonstrated excellent reliability (α = 0.91), indicating that despite minor differences in individual item responses, the overall usability construct remained stable. While Gemini’s multimodal capabilities slightly increased perceived complexity, learners still found it intuitive and expressed a solid reuse intent.
DeepSeek emerged as the highest-rated agent among the eight students who evaluated it, achieving a mean SUS score of 78.6 (SD = 5.9). DeepSeek users gave a mean ease-of-use score of M = 3.9 (SD = 0.5) and reuse intent of M = 4.3 (SD = 0.6), and expressed low complexity (M = 2.4, SD = 0.8) and a low need for help (M = 2.1, SD = 1.0). Users highlighted its seamless, conversational flow when crafting creative prompts and appreciated its responsive feedback, which together fostered a high level of engagement and intent to return. The internal consistency for DeepSeek’s SUS was the strongest of the three (α = 0.93), reinforcing confidence that the ten items collectively captured a unified perception of usability. DeepSeek’s easy-flowing conversations and helpful feedback made users highly interested and eager to use it over and over. Each label in
Figure 3 condenses the central concept to 2–3 words, perfect for horizontal bar chart axis titles or survey field names.
To sum up, all three conversational AI agents delivered robust usability profiles in this student cohort: the SUS means ranged from 70.2 to 78.6, and Cronbach’s alpha values spanned 0.89 to 0.93. The results demonstrate that regardless of differences in underlying architecture or feature set, users found each system sufficiently intuitive, minimally complex, and highly engaging. DeepSeek’s slightly higher overall score suggests a marginal edge in user satisfaction, but ChatGPT and Gemini nonetheless achieved ratings well within the “acceptable” to “excellent” range. This study’s findings affirm that ChatGPT, Gemini, and DeepSeek all function effectively as conversational platforms, capable of delivering seamless interactions that foster strong user confidence and repeat usage. ChatGPT, Gemini, and DeepSeek all produced similarly high-end usability profiles. Their SUS scores clustered between approximately 70 and 79. Ease-of-use and reuse intent ratings were consistently above the midpoint. The perceived complexity was low, and participants reported a minimal need for help. Their Cronbach’s alpha values were all well above 0.85, indicating strong internal consistency.
The results of this study suggest that it is not just these three systems but the conversational agent paradigm more broadly that performs well. Interfaces framed as natural dialogues—rather than menus or form-based workflows—tend to deliver intuitive interactions. Users experience a low cognitive load, feel confident using the system, and express a strong willingness to return. Future work can therefore treat the “conversational agent” as a general design principle. Regardless of the underlying model or feature set, systems that engage users through human-like, turn-based interaction are likely to offer superior usability outcomes.
4.2. AI Literacy
Since all three AI agents (ChatGPT, DeepSeek, and Gemini) demonstrated remarkably similar and high usability, further data analysis considers them collectively as ‘AI chatbots’. This approach is justified because the conversational interface, rather than individual model differences, was identified as the primary factor influencing user experience and engagement. Participants’ self-reported AI literacy fell in the moderate–high range on a 1–5 agreement scale (where 1 = “disagree” and 5 = “strongly agree”). Across the 12 individual items, mean agreement ranged from 3.25 to 4.25. The highest item means (M = 4.25, SD = 0.46) were for “I use AI tools to improve my work efficiency” and “I can pick the best AI app/product for a specific task,” indicating that most respondents feel comfortable applying AI solutions in their daily activities. The lowest item mean (M = 3.25, SD = 0.71) was for “I can choose the right solution from a smart agent’s options,” suggesting some uncertainty when evaluating multiple AI recommendations.
When aggregated into the four conceptual dimensions of AI literacy, the average scores were fairly consistent: understanding AI (M = 3.80, SD = 0.81), using AI tools (M = 3.88, SD = 0.35), evaluating AI (M = 3.75, SD = 0.66), and ethics and privacy (M = 3.84, SD = 0.53). Respondents reported strong confidence in their ability to use AI technologies (Dimension 2) and to apply ethical/privacy principles in AI contexts (Dimension 4). The dimension of critical evaluation (Dimension 3) scored slightly lower, indicating room for enhanced training on how to assess the strengths and limitations of AI solutions.
Within the “Understanding AI” dimension, participants generally agreed that they could distinguish smart versus non-smart devices (M = 3.80, SD = 0.87) and recognize the presence of AI in everyday apps (M = 4.00, SD = 0.65). The knowledge of AI’s potential benefits scored highest (M = 4.09, SD = 0.93), reflecting a broad awareness of AI’s value. Under “Using AI,” the items revealed that while nearly all participants felt adept at navigating their AI tools (M = 4.00, SD = 0.00), fewer found it easy to learn brand-new AI platforms (M = 3.50, SD = 0.58). For “Evaluating AI,” respondents were confident in their post-use assessment of AI’s strengths/limitations (M = 4.00, SD = 0.82) but less so in selecting among multiple AI agent suggestions (M = 3.25, SD = 0.71). Finally, “Ethics and Privacy” items were uniformly high—participants reported a strong commitment to ethical AI use (M = 4.13, SD = 0.35) and vigilance around data security (M = 4.00, SD = 0.00). Every label capture in
Figure 4 its essential meaning in 2–3 words, serving well as axis titles for horizontal bar charts or field names when reporting survey data.
Overall, these findings suggest that while users feel relatively confident in accessing and employing AI tools, they may benefit from targeted support in two areas: (a) learning to navigate new AI platforms more easily and (b) honing critical evaluation skills when faced with competing AI recommendations. The consistently high ethics and privacy scores suggest that data protection is already a key priority, providing a strong foundation for developing more advanced AI responsibility training modules.
4.3. Learning Experience
Although not specifically designed for educational purposes, conversational AI chatbots received overwhelmingly positive evaluations across multiple facets of the learner experience within this environment. At the item level, three statements—“This AI chatbot motivates me to participate in practical tasks,” “This AI chatbot reduces my resistance to the course,” and “This AI chatbot helps me achieve better learning outcomes”—each garnered exceptionally high mean ratings (M = 4.69, SD = 0.47). This indicates near-unanimous strong agreement that AI features not only draw students into active participation but also support measurable achievement gains. Close behind, participants endorsed the statement “This AI chatbot helps me learn” (M = 4.54, SD = 0.73), suggesting that AI scaffolding may bolster conceptual clarity in complex content creation.
Beyond these top-ranked items, learners also agreed that the AI environment is generally effective in promoting their learning (M = 4.50, SD = 0.51) and enhances their overall satisfaction (M = 4.23, SD = 0.43). The quality of learning was likewise rated very positively (M = 4.04, SD = 0.20); as was the perception that the platform makes the course more playful and engaging (M = 4.04, SD = 0.20). These mid-high scores reflect not only cognitive but also affective benefits, indicating that AI tools can create both intellectually and emotionally supportive contexts.
Nevertheless, two clusters of items exhibited relatively lower agreement, pointing to areas for further refinement. Ease of task completion—“It is easy to complete tasks in this learning environment”—received a more moderate mean (M = 3.58, SD = 0.58), and items addressing the system’s perceived meaningfulness and value—such as “This learning environment is worthwhile to try” and “It makes sense to use this environment”—both averaged just above the midpoint (M = 3.54, SD = 0.58 for each). These data suggest that while participants are generally excited by AI-driven affordances, they may still encounter friction when navigating specific tasks or question the intrinsic value of the environment without additional orientation or contextual framing. Each label distills in
Figure 5 below its core idea into 2–3 words, making them ideal for axis titles in horizontal bar charts or as field names in survey reporting.
When synthesizing across conceptual dimensions—motivation, engagement, cognitive support, and perceived utility—the environment appears strongest in its ability to spark motivation and reduce resistance, moderately strong in delivering cognitive support and satisfaction, and comparatively weaker in usability efficiency and perceived meaningfulness. Future iterations could therefore prioritize streamlined interfaces and clearer value propositions, for example, embedding guided walkthroughs to lower the barrier to task completion and providing explicit learning objectives or real-world use cases to reinforce the environment’s relevance. Such targeted enhancements would likely elevate the mean ratings in the mid-range items and yield a more uniformly high user experience.
Addressing AI chatbot limitations and the role of AI literacy is another critical point of view that needs to be highlighted. While the usability and engagement outcomes of ChatGPT, Gemini, and DeepSeek were highly positive, our study also uncovered important challenges related to chatbot limitations that merit critical reflection. During philological tasks, several instances of AI-generated output demonstrated inaccuracies or misalignments with established scholarly rigor. For example, some chatbot responses contained anachronistic vocabulary or simplified historical contexts that did not fully capture the cultural nuances required for the accurate interpretation of classical texts. One representative case involved a chatbot’s suggestion to modernize the dialogue in
Iphigenia in Tauris by introducing terminology inconsistent with Ancient Greek ethical frameworks. Learners and instructors identified this as problematic during joint evaluation sessions, prompting discussions about preserving authenticity versus enhancing accessibility. In another instance, formulaic vocabulary substitutions proposed by AI risked diluting the original narrative voice, highlighting the system’s tendency toward generic outputs when faced with ambiguous prompts [
31]. These examples illustrate the inherent limitations of current LLM-based chatbots: while capable of generating contextually relevant and coherent text, they can produce outputs that reflect biases, factual errors, or overly generalized language. To mitigate these issues, instructors played a pivotal role as expert scaffolders, guiding learners through structured checklists to critically assess AI suggestions, encouraging metacognitive reflection, and facilitating dialogic questioning such as “Does this suggestion align with your philological objectives?” or “What might be lost or gained by adopting this edit?” Through this mediated engagement, learners developed crucial critical AI literacy skills, learning to balance trust and skepticism when interacting with generative systems. This process aligns with Zimmerman’s cyclical model of self-regulated learning, fostering autonomy in monitoring, evaluating, and adapting AI feedback [
32]. This data suggests that incorporating such critical evaluation mechanisms is essential to prevent over-reliance on AI outputs and to maintain academic rigor.
A comparative text analysis to address platform-specific differences was conducted for a more granular evaluation of ChatGPT, Gemini, and DeepSeek beyond their usability scores. While all three agents were rated positively, a qualitative analysis of their dialogue transcripts revealed distinct patterns in their interaction styles, content coherence, and domain specificity. ChatGPT was praised for its conversational fluency and ability to sustain multi-turn exchanges with contextual memory. Students noted its tone as more natural and its suggestions as well articulated; however, it occasionally provided overly verbose or generalized responses that lacked philological nuance. For instance, when asked to simulate a tribunal debate involving Iphigenia, ChatGPT introduced modern legal terminology that conflicted with the ancient context. Gemini, in contrast, delivered more structured outputs and was especially effective in breaking down multi-stage tasks, such as for lesson plan creation. Its responses often included bullet-pointed steps and clearer formatting.
However, some users found Gemini’s answers to be more formulaic and less responsive to follow-up clarification prompts. DeepSeek demonstrated the highest degree of historical and terminological fidelity. In a prompt requiring the accurate use of Ancient Greek moral frameworks, DeepSeek incorporated culturally appropriate language and references to the Greek concepts of
honor (τιμή) and
fate (μοίρα). Despite this, several students found DeepSeek’s tone less accessible and noted that it struggled with creative narrative generation, particularly when modern pedagogical framing was required. Representative examples from each chatbot are included in
Appendix F. This presents illustrative examples of chatbot outputs in response to student prompts requiring the reinterpretation of Euripides’s “
Iphigenia in Tauris” work. Dialogue samples from ChatGPT, Gemini, and DeepSeek are compared based on thematic alignment, philological rigor, and dramatic coherence. These excerpts address platform-specific affordances and limitations, supporting a more nuanced comparative analysis. They also illustrate not only the strengths of each system but also moments of misalignment—such as hallucinated facts, anachronistic interpretations, and an inconsistent adherence to prompt constraints—that required student reflection or instructor intervention. Such findings underscore the need for ongoing human oversight and critical engagement with AI-generated outputs. This comparative insight emphasizes that while all three platforms offer usable and engaging interfaces, their pedagogical value varies significantly depending on task type, learner goals, and domain accuracy requirements. Selecting the appropriate AI chatbot—and training students to adapt their prompt strategies accordingly—is therefore essential for effective integration in philology education.
To summarize, although conversational AI offers valuable adaptive feedback and usability advantages, its limitations underscore the need for pedagogical frameworks that explicitly address error detection, bias awareness, and contextual appropriateness. By integrating representative dialogue excerpts and analyses of problematic outputs, our study contributes a balanced perspective, advocating for the inclusion of critical AI literacy as a foundational component in humanities education.
4.4. Semi-Structured Interview
The semi-structured interviews revealed that all participants found AI chatbots to be both engaging and supportive of their learning. Individual users frequently described the chatbot as a “patient tutor,” praising its ability to provide on-demand explanations (“When I got stuck, the bot would rephrase concepts until I understood”). Many solo learners noted that working alone with the chatbot fostered their self-directed exploration, as they felt free to experiment with different prompts and follow their own curiosity (“I could ask about instructional use in philology at 2 a.m. without feeling judged”). However, several also reported barriers: without a peer’s feedback, they sometimes second-guessed the chatbot’s suggestions or felt unsure about the accuracy of its responses, remarking that “I was not always confident that the examples it gave were correct,” and that this occasionally slowed their progress.
In contrast, others spoke with noticeably more enthusiasm about the social dimension of their experience. They described how “I sparked new ideas by brainstorming together” and how discussing the chatbot’s answers aloud often led to deeper insights (“I would challenge the bot’s reasoning, and then I would refine my understanding together with the AI chatbot”). That said, coordinating schedules to meet live and dividing tasks equitably sometimes proved difficult, with one participant observing that “half the time I spent planning what I would ask, rather than diving into the content.” Taken together, these findings suggest that while individual engagement with AI chatbots promotes autonomy and self-paced discovery, students can amplify creative idea generation and critical reflection—albeit at the cost of increased coordination effort.
5. Discussion
The present study investigated the usability of three prominent conversational AI agents—ChatGPT, Google Gemini, and DeepSeek—among university students, alongside their self-reported AI literacy and the perceived learning experience within an AI-enhanced environment. Furthermore, semi-structured interviews provided qualitative insights into individual and collaborative interactions with AI chatbots. The findings collectively provide a clear understanding of the current state of AI integration in higher education, highlighting areas of strength and opportunities for refinement. Importantly, this study contributes to a rigorous, mixed-methods evaluation method that can serve as a replicable framework for assessing AI systems in educational contexts, with relevance beyond philological domains. In doing so, it bridges the gap between interface-level usability studies and deeper pedagogical engagement with AI-driven tools.
5.1. Conversational AI Agent Usability
This study’s usability assessment revealed by using a SUS that all three conversational AI agents—ChatGPT, Google Gemini, and DeepSeek—demonstrated robust usability profiles, with mean SUS scores ranging from 70.2 to 78.6. These scores consistently fell within the “acceptable” to “excellent” range, well above the commonly cited acceptability threshold of 68 [
52,
53]. DeepSeek emerged as the highest-rated agent, achieving a mean SUS score of 78.6, slightly outperforming ChatGPT (75.4) and Gemini (70.2). This marginal edge for DeepSeek, while not statistically significant, suggests a slightly higher level of user satisfaction, potentially attributable to its seamless conversational flow and responsive feedback, as highlighted by users. ChatGPT also demonstrated strong usability, aligning with previous research that indicates high user satisfaction with its interface and functionality [
5,
14]. Gemini, while still exceeding the acceptable benchmark, had a slightly higher perceived complexity. This could be due to its multimodal capabilities, as participants may have perceived its versatility (handling text, images, code, and audio) as increasing cognitive load, a common challenge in designing versatile systems [
25].
Across all three AI chatbots, participants reported high ease-of-use and strong reuse intent, along with a low perceived complexity and minimal need for help. The excellent internal consistency of the SUS scores for each agent (Cronbach’s alpha values spanning 0.89 to 0.93) further reinforces the reliability of these usability perceptions [
53]. These consistently high ratings across different agents suggest that the conversational agent paradigm, characterized by natural dialogue-based interfaces, generally leads to intuitive interactions and a low cognitive load for users [
34]. This suggests that system design grounded in conversational turn-taking is a valuable generalizable principle, supporting future AI integration regardless of the backend model. This study contributes to the existing literature by offering comparative usability benchmarks, supported by validated SUS metrics and high internal consistency, that can serve as reference points for evaluating other chatbots in educational and non-educational settings. This finding aligns with the growing body of literature emphasizing the positive user experience of conversational AI in various domains, including education [
3,
16]. The results support the notion that future work can treat the “conversational agent” as a general design principle, regardless of the underlying model or feature set, as systems engaging users through human-like, turn-based interactions are likely to offer superior usability outcomes.
5.2. AI Literacy
Participants reported a moderate-to-high level of self-reported AI literacy, with overall mean agreement scores ranging from 3.25 to 4.25 on a 5-point scale. This generally positive self-assessment is encouraging and aligns with recent discussions on the increasing necessity of AI literacy in a technology-driven world [
17,
35]. Specifically, respondents felt highly confident in their ability to use AI tools to improve their work efficiency and to select the best AI application for a specific task. This indicates a practical understanding and application of AI in daily activities, which is a positive sign for the integration of AI into academic and professional workflows.
However, the lowest item mean for “I can choose the right solution from a smart agent’s options” suggests a nascent area for development. This finding points to a potential challenge in the critical evaluation of AI-generated content or recommendations, an area that has also been highlighted as a critical component of AI literacy [
41]. Similarly, within the aggregated conceptual dimensions of AI literacy, while “understanding AI,” “using AI tools,” and “ethics and privacy” scored consistently high, “evaluating AI” scored slightly lower. This suggests that while users are confident in deploying AI and are mindful of ethical considerations, they may require more targeted support in critically assessing the strengths and limitations of AI solutions. This concern has been raised by other researchers regarding the need for enhanced training in evaluating AI outputs to ensure accuracy and reliability, particularly in academic contexts [
4,
38]. The uniformly high scores in “ethics and privacy” are particularly noteworthy, indicating that data protection and responsible AI use are already top-of-mind for this cohort. This emphasizes that while AI familiarity is growing, critical literacy—especially regarding the evaluation of AI outputs—remains underdeveloped. These findings underscore the need for educational interventions that go beyond tool usage, addressing epistemological trust, source transparency, and AI-generated misinformation. By including both quantitative and qualitative measures of AI literacy, this study provides a richer conceptual framework that future studies can adapt across disciplines. This provides a strong foundation for building more advanced AI responsibility training, as discussed by Al-Zahrani and Alasmari [
46].
5.3. Learning Experience
The AI-enhanced learning environment received overwhelmingly positive evaluations, particularly concerning its ability to motivate engagement and support learning outcomes. Statements such as “This learning environment motivates me to engage in practical tasks” and “This learning environment helps me achieve better learning outcomes” garnered exceptionally high mean ratings (M = 4.69), indicating near-unanimous strong agreement. This finding corroborates existing research on the positive impact of AI tools in fostering student engagement and improving learning efficacy in various educational settings [
9,
44]. The high rating for understanding “large-scale analyses of philological concepts better” (M = 4.54) further suggests that AI scaffolding can be particularly effective in demystifying complex domains, aligning with studies on AI’s role in enhancing conceptual clarity [
25].
Beyond motivation, the environment was also perceived as generally effective in promoting learning (M = 4.50) and enhancing overall satisfaction (M = 4.23), indicating both cognitive and affective benefits. This is consistent with the idea that AI tools can create intellectually stimulating and emotionally supportive learning contexts [
5,
27]. However, areas for refinement were identified. Items related to ease of task completion (M = 3.58) and the perceived meaningfulness and value of the environment (M = 3.54) received comparatively lower, albeit still moderate, agreement. This suggests that while students are excited by AI-driven affordances, they may still encounter usability friction or question their intrinsic value without additional orientation or contextual framing. This highlights the importance of not just integrating AI but doing so with clear pedagogical objectives and streamlined interfaces, as suggested by previous research on effective technology integration in education [
9,
26]. These findings demonstrate that GenAI-enhanced environments can not only support cognitive outcomes but also activate metacognitive engagement when appropriately scaffolded. This resonates with Vygotsky’s ZPD [
31] and Zimmerman’s self-regulated learning model, suggesting that AI tools, when embedded within a purposeful instructional design, can serve as dynamic scaffolds rather than static tools [
32].
5.4. Semi-Structured Interview Insights
The qualitative data from the semi-structured interviews provided rich insights into the varied experiences of interacting with AI chatbots. Participants universally found AI chatbots to be engaging and supportive of their learning, often describing them as “patient tutors.” This aligns with a growing consensus on the beneficial role of AI chatbots in providing personalized, on-demand explanations and fostering self-directed learning [
13,
39]. The ability to experiment with different prompts and explore curiosity without judgment was a significant advantage for solo learners, fostering autonomy and self-paced discovery. This supports the notion of AI as a valuable tool for individualized instruction [
9,
24]. Notably, our findings contribute to a growing body of work that positions AI chatbots not only as cognitive tools but also as mediators of collaborative knowledge construction. This expands our current understanding of GenAI’s pedagogical potential, emphasizing the social affordances of shared AI interactions.
However, individual users also reported a lack of confidence in the chatbot’s suggestions and concerns about accuracy without peer feedback. This underscores the need for robust verification mechanisms or a clearer understanding of AI’s limitations, echoing the challenges in “evaluating AI” identified in the AI literacy section. This also highlights the importance of critical thinking skills when interacting with AI, as Pellas [
44] also admitted.
Conversely, participants enthusiastically highlighted the social benefits of their experience, noting how collaborative brainstorming and discussions around chatbot responses led to deeper insights and more creative ideas. This finding supports the value of collaborative learning environments, even when mediated by AI [
6,
45,
46]. The act of challenging the chatbot’s reasoning and refining its understanding together with a peer amplified critical reflection, a crucial higher-order thinking skill. This aligns with research suggesting that social interaction can enhance learning outcomes [
1,
7]. By capturing both solo and paired learner interactions, our study illustrates how AI chatbots can be repositioned as facilitators of dialogic learning rather than mere answer engines. These insights support the emerging view that meaningful human–AI collaboration must be designed as socially embedded and critically reflective, especially in complex domains like philology.
This study offers a multi-dimensional contribution to the current research on AI in education. First, it provides empirical usability benchmarks across three leading conversational agents, filling a gap in humanities-specific UX evaluations. Second, it highlights areas of strength and developmental need in AI literacy, including ethical awareness and critical evaluation skills. Third, it demonstrates how GenAI environments can support both individual and collaborative learning processes, offering a replicable structure grounded in educational theory. Finally, by integrating SUS metrics, AI literacy frameworks, and qualitative user experiences, we offer a comprehensive, adaptable methodology for future GenAI studies across disciplines.
6. Conclusions
Based on the findings across usability assessment, AI literacy survey, learning experience questionnaire, and semi-structured interviews, this study offers a comprehensive evaluation of conversational AI agents in higher-education learning environments. All three AI systems—ChatGPT, Gemini, and DeepSeek—demonstrated strong usability profiles, with SUS scores consistently exceeding the standard acceptability threshold and Cronbach’s alpha values indicating excellent internal consistency. Among them, DeepSeek received the highest overall usability ratings, but all agents were found to be intuitive, minimally complex, and conducive to repeat use. These results suggest that conversational agents, regardless of their platform-specific features, offer a generally effective interface model for student interaction, grounded in a low cognitive load and high user confidence.
Participants’ self-reported AI literacy further reinforced this positive outlook. Students exhibited high levels of comfort using AI tools for efficiency and decision-making, though somewhat lower confidence in critically evaluating multiple AI-generated recommendations. While they demonstrated a strong awareness of ethical and privacy considerations, the findings indicate a need for targeted support in navigating unfamiliar AI platforms and enhancing evaluative judgment—skills increasingly vital in an AI-saturated academic landscape.
In terms of learning experience, the AI-enhanced environment was especially effective in fostering motivation, reducing resistance to engagement, and supporting conceptual understanding in large-scale analyses of philological concepts for interactive content creation and instruction. Participants appreciated both the cognitive support and affective benefits of AI integration, although their lower scores in task completion ease and perceived meaningfulness point to a need for improved onboarding and clearer contextual framing. Interview data added qualitative depth, revealing how individual chatbot use supports self-paced exploration and autonomy, while collaborative use enhances critical thinking and creativity—albeit with logistical trade-offs.
In conclusion, this study highlights the promise of conversational AI agents as both usable tools and pedagogical partners in university-level education. Their ability to facilitate intuitive interactions, support learner autonomy, and foster engagement suggests that such agents can play a central role in future learning environments. To maximize their impact, future implementations should emphasize scaffolded onboarding experiences, clearer task guidance, and opportunities for both individual and collaborative use. Moreover, curricular efforts to strengthen students’ AI evaluative literacy will be essential in preparing them to critically and responsibly navigate an increasingly AI-driven world.
7. Implications for Design and Practice
This study’s findings focused on user experience and usability issues in learning outcomes yield specific implications for key stakeholders. For scholars and educators, conversational AI chatbots’ perceived usability in generating philology content suggests they could alleviate resource constraints, but success hinges on their intentional integration into instructional workflows. Educators should advocate for AI tools with customizable interfaces (e.g., adjustable difficulty levels for students, templates for archaic text analysis) and prioritize platforms that offer transparent output sourcing to mitigate misinformation risks. Training programs should address AI literacy gaps, teaching students to critically evaluate chatbot responses while leveraging their interactive feedback capabilities for scaffolding complex philological tasks.
Instructional content creators and developers need to refine their UX/UI design to enhance pedagogical utility. For instance, iterative user testing with philology students and instructors could identify pain points in content generation workflows, such as the need for context-aware prompts (e.g., historical period filters, dialect-specific parsing) and features to flag ambiguities in AI outputs. Developers should integrate educator-facing dashboards for monitoring student–AI interactions and tools to annotate or correct generated content. Additionally, optimizing chatbots for low-bandwidth environments could broaden accessibility in resource-limited institutions, aligning with this study’s goal of addressing resource inequities.
These results point to two key areas for targeted interventions. First, training programs should include hands-on, scaffolded introductions to unfamiliar AI platforms—such as guided “first-use” tutorials or peer-mentorship pairings—to boost users’ ease of onboarding new tools (addressing the lower ease-of-learning score, M = 3.50, SD = 0.58). Second, workshops that simulate real-world decision contexts—where participants must compare and choose among multiple AI recommendations—can sharpen critical evaluation skills (respondents rated this lowest, M = 3.25, SD = 0.71).
Moreover, the strong ethical and privacy orientation found (dimension about “ethics” mean = 3.84) suggests that embedding these scenarios within an ethical framework—e.g., highlighting how data protection choices influence AI outcomes—will both leverage existing strengths and reinforce responsible use. For future research, it would be valuable to link these self-reported literacy dimensions to objective performance metrics in AI tasks, and to examine whether demographic factors (age, field, prior AI exposure) systematically predict who benefits most from each type of intervention.
For AI researchers, the mixed feedback on usability and accuracy underscores the need for domain-specific model fine-tuning. Future NLP architecture should prioritize philology-oriented training data (e.g., annotated corpora of archaic texts) and incorporate explainability features to help users trace how outputs are generated—a critical factor for fostering trust in educational contexts. Researchers could also explore adaptive interfaces that evolve based on user expertise (e.g., scaffolding novices with guided questioning vs. enabling open exploration for advanced learners). Finally, interdisciplinary collaborations with philologists are essential to address latent biases in training data that may skew interpretations of non-Western or marginalized linguistic traditions.
8. Limitations and Future Work
This study has several limitations that warrant consideration. First, the sample size was relatively small (n = 26), and the exploratory nature of the intervention limits the generalizability of the findings. Its participants were drawn exclusively from a homogeneous group of undergraduate students (Μ = 21.3 years old) in a single department in Greece. While this cohort provided insights into early-career learners’ interactions with AI chatbots, the results should be interpreted with caution and not generalized to broader populations, such as older learners, graduate students, educators, or institutions with differing resource levels. Additionally, the present study focused on short-term user experiences and immediate feedback during structured tasks. Consequently, it did not assess long-term impacts, such as sustained knowledge retention, behavioral changes in learning habits, or the integration of AI tools into broader philology curricula.
Second, the technical scope of this study was constrained by its focus on three specific conversational AI tools: ChatGPT, Gemini, and DeepSeek. These systems were evaluated at a fixed point in time, but the rapid advancements in AI technology—including frequent updates to language models and interface designs—may render some findings obsolete or less reproducible in future iterations. Furthermore, the predefined tasks assigned to the participants, while structured to ensure consistency, may not fully reflect the complexity of real-world philological research. For example, the present study prioritized usability and content generation but did not deeply explore challenges in adapting AI outputs for nuanced, context-dependent analyses, such as interpreting archaic texts with cultural or historical ambiguity.
Third, the reliance on self-reported data (e.g., user satisfaction surveys and qualitative interviews) introduces the possibility of response bias. While these methods captured subjective perceptions of usability and pedagogical value, they were not supplemented with objective measures of learning efficacy, such as pre- and post-test performance metrics. This limits the ability to conclusively determine whether AI chatbots directly enhance learning outcomes. Another critical limitation is the absence of educator perspectives. Instructors and philology experts could provide valuable insights into the accuracy of AI-generated content, alignment with curricular objectives, and ethical concerns related to the over-reliance on automated tools—all factors not explicitly addressed in this study.
Fourth, while conversational AI tools hold promises for mitigating resource constraints in philology education, this study did not investigate accessibility barriers that could perpetuate inequities. For instance, internet dependency, institutional licensing costs, and disparities in AI literacy across educational settings may hinder its adoption in low-resource environments.
Given these limitations, the findings should be considered preliminary. Future studies should expand upon this work by addressing the limitations identified here. Future research could involve larger, more diverse cohorts—including educators, graduate students, and learners from varied institutional and demographic backgrounds—to better generalize these findings. For instance, this study highlights the promise of conversational AI agents as both usable tools and pedagogical partners in university-level education. Their ability to facilitate intuitive interactions, support learner autonomy, and foster engagement suggests that such agents could play a central role in learning environments. To build on this study’s contributions, future work should prioritize the following directions:
Scaffolded onboarding for platform-specific features: Given the variability in usability scores (e.g., DeepSeek’s high ratings vs. Gemini’s lower task-completion ease), future studies should investigate how platform-specific tutorials and adaptive onboarding can reduce cognitive load and improve task alignment. This responds to participants’ struggles with unfamiliar AI interfaces.
Frameworks for evaluative judgment: While students demonstrated strong ethical awareness, their lower confidence in evaluating AI-generated recommendations highlights a critical gap. Future research should design and test frameworks that scaffold students’ abilities to critically assess AI outputs, particularly in philological contexts where interpretive nuance is paramount.
Collaborative AI integration models: Interview data revealed that collaborative AI use enhances creativity and critical thinking, but logistical challenges (e.g., group coordination) persist. Experimental studies comparing individuals against collaborative AI-mediated learning could optimize pedagogical strategies for large classrooms.
Domain-specific AI literacy curricula: This study’s findings that students prioritize efficiency over ethical reasoning suggests a need for curricular modules that explicitly link AI literacy to philological rigor (e.g., training students to detect biases in historical text analysis).
Longitudinal impact on critical thinking: While AI tools reduced resistance to engagement, their long-term impact on deep cognitive skills like reading and interpretive analysis remains unclear. Longitudinal studies tracking skill development over semesters could clarify AI’s role in sustaining vs. undermining critical pedagogy.
The above directions directly address this study’s core contributions: (1) the usability trade-offs of platform-specific AI tools, (2) the tension between efficiency and critical engagement, and (3) the need for pedagogical scaffolding in AI-driven philological education.