Custom Generative Artificial Intelligence Tutors in Action: An Experimental Evaluation of Prompt Strategies in STEM Education
Abstract
1. Introduction
- RQ1: What types of questions do students formulate when interacting with a GEN-AI tutor in STEM laboratory activities?
- RQ2: How do responses generated under different prompting strategies differ in their instructional characteristics, based on authors’ analysis?
- RQ3: How do students perceive and rank responses produced by different prompting strategies, and to what extent are these preferences consistent across different task contexts?
2. Materials and Methods
2.1. Research Design
2.2. Participants and Setting
2.3. Tutoring Tool Architecture
- Educator prompt (system message): A prompt defined by the teacher that sets the tutor’s instructional role, tone, and strategy;
- Student input: The learner’s current query or message submitted via the chat interface;
- Dialogue context (history): Previous messages in the conversation that help the tutor understand the current input;
- Retrieval layer: Optional access to materials provided by teacher and stored in a vector database, invoked by the tutor when additional context is needed.

2.4. Prompt Configurations
2.5. Benchmark Question Set
Selection of Benchmark Questions for Evaluation
- Factual: “What does the term ‘effective voltage’ mean in the context of an AC power source?”
- Conceptual: “The NTC sensor output changes by only 0.1 V for a 20° C temperature change. What could be wrong?”
2.6. Evaluation Procedure
- 0.00 < W < 0.20—slight agreement
- 0.20 ≤ W < 0.40—fair agreement
- 0.40 ≤ W < 0.60—moderate agreement
- 0.60 ≤ W < 0.80—substantial agreement
- 0.80 ≤ W < 1– almost perfect agreement
2.7. Validity and Reliability
3. Results
3.1. Classification of Student Questions (RQ1)
- Procedural messages (n = 123; 59.1%) represented the majority of entries, focusing on instructions, task steps, or operational procedures.
- Factual messages (n = 55; 26.4%) involved requests for definitions, factual confirmation, or parameter values.
- Conceptual messages (n = 21; 10.1%) addressed underlying principles and causal mechanisms.
- Metacognitive messages (n = 9; 4.3%) reflected self-evaluation, uncertainty, or planning strategies.
3.2. Research Team Evaluation of Tutor Responses Based on Prompt Strategy (RQ2)
- Neutral responses were brief and to the point, typically providing a direct answer without much additional context or pedagogical framing.
- Template outputs followed a consistent, easily interpretable format. For factual and procedural questions, these responses were typically well organised and included common elements such as definitions, examples, or step-by-step instructions. Example: “Term—Effective voltage … Definition—The RMS value of an AC source ... Example—Household mains voltage is 230 V RMS ... Common mistake—Confusing RMS value with peak voltage…”
- Chain-of-thought responses emphasised reasoning and step-by-step explanations, guiding the student through a structured logical progression. This structure was particularly evident in conceptual and troubleshooting contexts.
- Example: “Let’s begin by understanding what alternating current (AC) means. Then we can define the effective voltage, which is calculated based on the power it delivers. Finally, we’ll see how it differs from peak voltage.”
- Persona responses adopted a conversational and supportive tone, often framed as if written by a human teacher. This made the responses more approachable, though sometimes less concise. Example: “Great question! In the lab, we often work with AC sources, so it’s important to understand the concept of effective voltage. Think of it as the value that tells you how much work the voltage can really do.”
- Few-shot responses followed the style of the examples included in the prompt. When questions were similar in form, the answers mirrored the demonstrated structure, but with different question types, outputs sometimes deviated from the expected structure.
- Game responses introduced playful elements (e.g., roleplay, quizzes, exclamations). The format was versatile, with outputs ranging from short quiz interactions to roleplay scenarios and other gamified styles. Example: “Tesla smiles and asks: ‘Can you guess why RMS voltage matters when plugging in devices at home?’”
- Flipped responses replaced direct answers with follow-up questions. Instead of providing explanations, the tutor prompted the student to reflect or propose an answer before continuing. Example: “What do you think effective voltage means? How might it differ from the highest voltage reached in an AC waveform?”
3.3. Student Preferences Regarding Tutor Responses (RQ3)
4. Discussion
4.1. RQ1—What Students Ask a GEN-AI Tutor to Do
4.2. RQ2—How Prompting Strategies Shape Instructional Value (Authors’ View)
4.3. RQ3—How Students Ranked Responses
4.4. The Preference–Pedagogy Gap
4.5. Design Implications: Richer Prompts and Multi-Step Orchestration
4.6. Limitations and Future Work
5. Conclusions
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- VanLehn, K. The Relative Effectiveness of Human Tutoring, Intelligent Tutoring Systems, and Other Tutoring Systems. Educ. Psychol. 2011, 46, 197–221. [Google Scholar] [CrossRef]
- Weitekamp, D.; Harpstead, E.; Koedinger, K.R. An Interaction Design for Machine Teaching to Develop AI Tutors. In Proceedings of the CHI ‘20: CHI Conference on Human Factors in Computing Systems, Honolulu, HI, USA, 25–30 April 2020. [Google Scholar]
- Aleven, V.; McLaren, B.M.; Roll, I.; Koedinger, K.R. Help Helps, But Only So Much: Research on Help Seeking with Intelligent Tutoring Systems. Int. J. Artif. Intell. Educ. 2016, 26, 205–223. [Google Scholar] [CrossRef]
- Calo, T.; MacLellan, C. Towards Educator-Driven Tutor Authoring: Generative AI Approaches for Creating Intelligent Tutor Interfaces. In Proceedings of the Eleventh ACM Conference on Learning @ Scale, Atlanta, GA, USA, 18–20 July 2024. [Google Scholar]
- Liu, S.; Guo, X.; Hu, X.; Zhao, X. Advancing Generative Intelligent Tutoring Systems with GPT-4: Design, Evaluation, and a Modular Framework for Future Learning Platforms. Electronics 2024, 13, 4876. [Google Scholar] [CrossRef]
- Cain, J.; Rajan, A.S. Proof of Concept of ChatGPT as a Virtual Tutor. Am. J. Pharm. Educ. 2024, 88, 101333. [Google Scholar] [CrossRef] [PubMed]
- Limo, F.A.F.; Tiza, D.R.H.; Roque, M.M.; Herrera, E.E.; Murillo, J.P.M.; Huallpa, J.J.; Flores, V.A.A.; Castillo, A.G.R.; Peña, P.F.P.; Carranza, C.P.M.; et al. Personalised tutoring: ChatGPT as a virtual tutor for personalised learning experiences. Przestrz. Społeczna Soc. Space 2023, 23, 293–312. [Google Scholar]
- Pardos, Z.A.; Bhandari, S. ChatGPT-generated help produces learning gains equivalent to human tutor-authored help on mathematics skills. PLoS ONE 2024, 19, e0304013. [Google Scholar] [CrossRef]
- Chen, A.; Wei, Y.; Le, H.; Zhang, Y. Learning-by-Teaching with ChatGPT: The Effect of a Teachable ChatGPT Agent on Programming Education. Br. J. Educ. Technol. arXiv 2024, arXiv:2412.15226. [Google Scholar] [CrossRef]
- Nye, B.D.; Mee, D.; Core, M.G. Generative Large Language Models for Dialog-Based Tutoring: An Early Consideration of Opportunities and Concerns. In Proceedings of the LLM@AIED Workshop, Tokyo, Japan, 3–7 July 2023. [Google Scholar]
- Shetye, S. An Evaluation of Khanmigo, a Generative AI Tool, as a Computer-Assisted Language Learning App. Stud. Appl. Linguist. TESOL 2024, 24, 38–53. [Google Scholar] [CrossRef]
- Jiang, Z.; Jiang, M. Beyond Answers: Large Language Model-Powered Tutoring System in Physics Education for Deep Learning and Precise Understanding. arXiv 2024, arXiv:2406.10934. [Google Scholar] [CrossRef]
- Aperstein, Y.; Cohen, Y.; Apartsin, A. Generative AI-Based Platform for Deliberate Teaching Practice: A Review and a Suggested Framework. Educ. Sci. 2025, 15, 405. [Google Scholar] [CrossRef]
- Reicher, H.; Frenkel, Y.; Lavi, M.J.; Nasser, R.; Ran-milo, Y.; Sheinin, R.; Shtaif, M.; Milo, T. A Generative AI-Empowered Digital Tutor for Higher Education Courses. Information 2025, 16, 264. [Google Scholar] [CrossRef]
- Schmucker, R.; Xia, M.; Azaria, A.; Mitchell, T. Ruffle&Riley: Insights from Designing and Evaluating a Large Language Model-Based Conversational Tutoring System. In Proceedings of the International Conference on Artificial Intelligence in Education, Recife, Brazil, 8–12 July 2024. [Google Scholar]
- Mzwri, K.; Turcsányi-Szabo, M. The Impact of Prompt Engineering and a Generative AI-Driven Tool on Autonomous Learning: A Case Study. Educ. Sci. 2025, 15, 199. [Google Scholar] [CrossRef]
- Yang, X.; Liu, X.; Gao, Y. The Impact of Generative AI on Students’ Learning: A Study of Learning Satisfaction, Self-efficacy and Learning Outcomes. Educ. Tech. Res. Dev. 2025, 1–14. [Google Scholar] [CrossRef]
- Bastani, H.; Bastani, O.; Sungu, A.; Ge, H.; Kabakcı, O.; Mariman, R. Generative AI Can Harm Learning. 2024. Available online: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4895486 (accessed on 28 June 2025).
- Wecks, J.O.; Voshaar, J.; Plate, B.J.; Zimmermann, J. Generative AI Usage and Exam Performance. arXiv 2024, arXiv:2404.19699. [Google Scholar]
- Lehmann, M.; Cornelius, P.B.; Sting, F.J. AI Meets the Classroom: When Do Large Language Models Harm Learning? arXiv 2024, arXiv:2409.09047. [Google Scholar]
- Kotsiovos, J.; Chicone, R.; Doyle, S. Optimizing Student Success: The Impact of Generative AI in Teaching and Learning. In Proceedings of the 22nd Theory of Cryptography Conference (TCC 2024), Milan, Italy, 2–6 December 2024. [Google Scholar]
- Avsec, S.; Rupnik, K. From Transformative Agency to AI Literacy: Profiling Slovenian Technical High School Students through the Five Big Ideas Lens. Systems 2025, 13, 562. [Google Scholar] [CrossRef]
- Puech, R.; Macina, J.; Chatain, J.; Sachan, M.; Kapur, M. Towards the Pedagogical Steering of Large Language Models for Tutoring: A Case Study with Modeling Productive Failure. arXiv 2024, arXiv:2410.03781. [Google Scholar] [CrossRef]
- Maloy, R.W.; Gattupalli, S. Prompt Literacy. In EdTechnica: The Open Encyclopedia of Educational Technology; EdTech Books: Provo, UT, USA, 2024. [Google Scholar] [CrossRef]
- Kurent, B.; Avsec, S. Synergizing Systems Thinking and Technology-Enhanced Learning for Sustainable Education Using the Flow Theory Framework. Sustainability 2024, 16, 9319. [Google Scholar] [CrossRef]
- Rupnik, D.; Avsec, S. Student Agency as an Enabler in Cultivating Sustainable Competencies for People-Oriented Technical Professions. Educ. Sci. 2025, 15, 469. [Google Scholar] [CrossRef]
- Cain, W. Prompting Change: Exploring Prompt Engineering in Large Language Model AI and Its Potential to Transform Education. TechTrends 2024, 68, 47–57. [Google Scholar] [CrossRef]
- White, J.; Fu, Q.; Hays, S.; Sandborn, M.; Olea, C.; Gilbert, H.; Elnashar, A.; Spencer-Smith, J.; Schmidt, D.C. A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT. arXiv 2023, arXiv:2302.11382. [Google Scholar] [CrossRef]
- Blasco, A.; Charisi, V. AI Chatbots in K-12 Education: An Experimental Study of Socratic vs. Non-Socratic Approaches and the Role of Step-by-Step Reasoning. 2024. Available online: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5040921 (accessed on 28 June 2025).
- Li, Y.; Feng, S.; Wang, D.; Zhang, Y.; Yang, X. A Persona-Aware Chain-of-Thought Learning Framework for Personalized Dialogue Response Generation. In Proceedings of the CCF International Conference on Natural Language Processing and Chinese Computing (NLPCC 2024), Hangzhou, China, 1–3 November 2024. [Google Scholar]
- Zhao, Y.; Cao, H.; Zhao, X.; Ou, Z. An Empirical Study of Retrieval Augmented Generation with Chain-of-Thought. arXiv 2024, arXiv:2407.15569. [Google Scholar] [CrossRef]
- Tan, C.W. Large Language Model-Driven Classroom Flipping: Empowering Student-Centric Peer Questioning with Flipped Interaction. arXiv 2023, arXiv:2311.14708. [Google Scholar]
- Mao, Y.; He, J.; Chen, C. From Prompts to Templates: A Systematic Prompt Template Analysis for Real-World LLMapps. arXiv 2025, arXiv:2504.02052. [Google Scholar] [CrossRef]
- Kakarla, S.; Borchers, C.; Thomas, D.; Bhushan, S.; Koedinger, K.R. Comparing Few-Shot Prompting of GPT-4 LLMs with BERT Classifiers for Open-Response Assessment in Tutor Equity Training. arXiv 2025, arXiv:2501.06658. [Google Scholar]
- Xu, D.; Xie, T.; Xia, B.; Li, H.; Bai, Y.; Sun, Y.; Wang, W. Does Few-Shot Learning Help LLM Performance in Code Synthesis? arXiv 2024, arXiv:2412.02906. [Google Scholar] [CrossRef]
- Kong, A.; Zhao, S.; Chen, H.; Li, Q.; Qin, Y.; Sun, R.; Zhou, X.; Wang, E.; Dong, X. Better Zero-Shot Reasoning with Role-Play Prompting. arXiv 2023, arXiv:2308.07702. [Google Scholar]
- Li, Z.; Wang, Z.; Wang, W.; Hung, K.; Xie, H.; Wang, F.L. Retrieval-augmented generation for educational application: A systematic survey. Comput. Educ. Artif. Intell. 2025, 8, 100417. [Google Scholar] [CrossRef]
- Deng, X.; Yu, Z. A Meta-Analysis and Systematic Review of the Effect of Chatbot Technology Use in Sustainable Education. Sustainability 2024, 15, 2940. [Google Scholar] [CrossRef]
- United Nations. The Sustainable Development Goals Report 2024. Available online: https://unstats.un.org/sdgs/report/2024 (accessed on 28 June 2024).
- Nedungadi, P.; Tang, K.Y.; Raman, R. The Transformative Power of Generative Artificial Intelligence for Achieving the Sustainable Development Goal of Quality Education. Sustainability 2024, 16, 9779. [Google Scholar] [CrossRef]
- UNESCO. Guidance on Generative AI in Education and Research. UNESCO Report, 2023. Available online: https://www.unesco.org/en/articles/guidance-generative-ai-education-and-research (accessed on 28 June 2024).
- Schorcht, S.; Buchholtz, N.; Baumanns, L. Prompt the Problem–Investigating the Mathematics Educational Quality of AI-Supported Problem Solving by Comparing Prompt Techniques. Front. Educ. 2024, 9, 1386075. [Google Scholar] [CrossRef]
- Olea, C.; Tucker, H.; Phelan, J.; Pattison, C.; Zhang, S.; Lieb, M.; Schmidt, D.; White, J. Evaluating Persona Prompting for Question Answering Tasks. In Proceedings of the 10th International Conference on Artificial Intelligence and Soft Computing (AIS 2024), Sydney, Australia, 22–23 June 2024. [Google Scholar]
- Chen, E.; Wang, D.; Xu, L.; Cao, C.; Fang, X.; Lin, J. A Systematic Review on Prompt Engineering in Large Language Models for K-12 STEM Education. arXiv 2024, arXiv:2410.11123. [Google Scholar] [CrossRef]
- GEN-UI Initiative. Available online: https://gen-ui.si/ (accessed on 10 September 2025).
- University of Ljubljana, Faculty of Education. 21st Century Skills Project. Available online: https://www.pef.uni-lj.si/razvijanje_vescin_21_stoletja/ (accessed on 10 September 2025).
- Falegnami, A.; Romano, E.; Tomassi, A. The Emergence of the GreenSCENT Competence Framework. In The European Green Deal in Education; Routledge: London, UK, 2024; pp. 204–216. [Google Scholar]
- n8n.io. Available online: https://n8n.io/ (accessed on 10 September 2025).
- Rihtaršič, D.; Gabrovšek, R. Innovative Use of a GUI Tutor in Robotics and Electronics Projects within a Pedagogical Context. In Education in the Age of Generative Artificial Intelligence: International Guidelines and Research; Čotar Konrad, S., Flogie, A., Eds.; Založba Univerze na Primorskem: Koper, Slovenia, 2025; ISBN 978-961-293-431-6. [Google Scholar] [CrossRef]



| Prompting Strategy | System Message | Function/Intended Effect |
|---|---|---|
| Neutral (Baseline) | You are a helpful tutor. | Produces standard answers without any specific strategy or structure. |
| Persona | You are an experienced teacher. Provide answers as a teacher would, focusing on clarity and correctness. | Shapes the output by assigning the model a specific role or identity. |
| Template | If the question asks for a definition or fact, respond in the format: Term–Definition–Example–Common mistake (if any). If the question asks for a procedure or method, list the steps one by one in order. | Structures the output in a fixed format to make explanations more consistent. |
| Chain-of-thought | Think step-by-step. Explain your reasoning before giving the final answer. | Makes the model explain its thinking step-by-step before giving the answer. |
| Few-Shot | Use the logic demonstrated in the following examples to answer the next question in the same style. [example 1], [example 2], …. | Makes the model follow the examples given in the prompt and produce similar output |
| Game-based | Turn each question into a quiz. Ask the user first, then reveal the correct answer with a short explanation. [following rules] | Turns the output into a game-like format with questions, feedback, and interaction. |
| Flipped | Ask one question at a time (Socratic method) to help the student reach the answer on their own. Do not give direct answers. | Makes the model ask questions instead of answering directly, guiding the user step-by-step. |
| Category | Description | Example |
|---|---|---|
| Factual | Definitions or straightforward knowledge recall. | “What is the definition of electrical resistance?” |
| Procedural | Step-by-step guidance or instructions. | “How do I calculate the current through this resistor using Ohm’s law?” |
| Conceptual | Reasoning about underlying principles or cause–effect relationships. | “Why does increasing the resistance decrease the current in the circuit?” |
| Metacognitive | Self-evaluation, hypothesis formulation, or confidence checks. | “I believe the circuit should behave this way, but how can I check if I’m right?” |
| Category | Description | Example |
|---|---|---|
| Troubleshooting | Addressing unexpected outcomes, system faults, or technical errors, typically in hardware or code. | “The output voltage of the sensor only changes by 0.1 V when I change the temperature by 20 °C. What could be wrong?” |
| Theoretical reasoning | Aiming to understand or predict system behaviour based on underlying principles (e.g., Ohm’s law). | “Why is the RMS value of an AC voltage different from its peak value?” |
| Measurement-focused | Related to taking or interpreting measurements, expected vs. actual values, or observed data. | “How can we measure the effective voltage of a source?” |
| Task clarification | Seeking clarification about instructions, tools, or expected outcomes. | “How should I approach this task? Based on the measured data and the U(t) graph, draw a phasor diagram showing the conditions at t = 0.15 s.” |
| Code-related | Referencing programming behaviour, modifications, or including code snippets. | “Which part of the Arduino code should I change to make the robot move more slowly?” |
| Hardware/Equipment-related | Referring to components, wiring, or physical setup. | “Can I power the display with 9 V instead of 5 V?” |
| Social | Informal, conversational, or socially motivated messages. | “Thanks! What else can you do?” |
| Underspecified/Incomplete | Short, ambiguous, or context-dependent inputs. | “And now?” |
| Incorrect assumptions | Based on flawed conceptual understanding or technical misconceptions. | “If I use two thermistors in a voltage divider, will it make the sensor more sensitive?” |
| Content Category | Count (n) | Percentage (%) |
|---|---|---|
| Code-related | 66 | 24.1 |
| Underspecified/Incomplete | 59 | 21.5 |
| Measurement-focused | 35 | 12.8 |
| Social comments | 30 | 10.9 |
| Hardware/Equipment-related | 26 | 9.5 |
| Troubleshooting | 24 | 8.8 |
| Theoretical reasoning | 14 | 5.1 |
| Task clarification | 13 | 4.7 |
| Incorrect assumptions | 7 | 2.6 |
| Prompt Strategy | Average Rank |
|---|---|
| Template | 2.76 |
| Persona | 3.22 |
| Neutral | 3.29 |
| Chain-of-thought | 3.50 |
| Few-Shot | 3.57 |
| Flipped | 5.69 |
| Game | 5.93 |
| Primary Category | Best Strategy | Mean Rank | W | p |
|---|---|---|---|---|
| Factual | Template/Persona | 2 | 0.656 | <0.001 |
| Procedural | Template | 1.71 | 0.456 | 0.004 |
| Conceptual | Chain-of-Thought | 2.71 | 0.600 | <0.001 |
| Metacognitive | Neutral | 2.57 | 0.397 | 0.011 |
| Question | Winner | Statistically Better than (p < 0.05) |
|---|---|---|
| Factual | Template | Game, Flipped, Few-shot |
| Procedural | Template | Game, Chain-of-thought, Few-shot |
| Conceptual | Chain-of-thought | Flipped, Game |
| Metacognitive | Neutral | Game |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Gabrovšek, R.; Rihtaršič, D. Custom Generative Artificial Intelligence Tutors in Action: An Experimental Evaluation of Prompt Strategies in STEM Education. Sustainability 2025, 17, 9508. https://doi.org/10.3390/su17219508
Gabrovšek R, Rihtaršič D. Custom Generative Artificial Intelligence Tutors in Action: An Experimental Evaluation of Prompt Strategies in STEM Education. Sustainability. 2025; 17(21):9508. https://doi.org/10.3390/su17219508
Chicago/Turabian StyleGabrovšek, Rok, and David Rihtaršič. 2025. "Custom Generative Artificial Intelligence Tutors in Action: An Experimental Evaluation of Prompt Strategies in STEM Education" Sustainability 17, no. 21: 9508. https://doi.org/10.3390/su17219508
APA StyleGabrovšek, R., & Rihtaršič, D. (2025). Custom Generative Artificial Intelligence Tutors in Action: An Experimental Evaluation of Prompt Strategies in STEM Education. Sustainability, 17(21), 9508. https://doi.org/10.3390/su17219508
