Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (122)

Search Parameters:
Keywords = rubric assessment

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
20 pages, 1119 KiB  
Article
Smartphone-Assisted Experimentation as a Medium of Understanding Human Biology Through Inquiry-Based Learning
by Giovanna Brita Campilongo, Giovanna Tonzar-Santos, Maria Eduarda dos Santos Verginio and Camilo Lellis-Santos
Educ. Sci. 2025, 15(8), 1005; https://doi.org/10.3390/educsci15081005 - 6 Aug 2025
Abstract
The integration of Inquiry-Based Learning (IBL) and mobile technologies can transform science education, offering experimentation opportunities to students from budget-constrained schools. This study investigates the efficacy of smartphone-assisted experimentation (SAE) within IBL to enhance pre-service science teachers’ understanding of human physiology and presents [...] Read more.
The integration of Inquiry-Based Learning (IBL) and mobile technologies can transform science education, offering experimentation opportunities to students from budget-constrained schools. This study investigates the efficacy of smartphone-assisted experimentation (SAE) within IBL to enhance pre-service science teachers’ understanding of human physiology and presents a newly developed and validated rubric for assessing their scientific skills. Students (N = 286) from a Science and Mathematics Teacher Education Program participated in a summative IBL activity (“Investigating the Human Physiology”—iHPhys) where they designed experimental projects using smartphone applications to collect body sign data. The scoring rubric, assessing seven criteria including hypothesis formulation, methodological design, data presentation, and conclusion writing, was validated as substantial to almost perfect inter-rater reliability. Results reveal that students exhibited strong skills in hypothesis clarity, theoretical grounding, and experimental design, with a high degree of methodological innovation observed. However, challenges persisted in predictive reasoning and evidence-based conclusion writing. The students were strongly interested in inquiring about the cardiovascular and nervous systems. Correlational analyses suggest a positive relationship between project originality and overall academic performance. Thus, integrating SAE and IBL fosters critical scientific competencies, creativity, and epistemic cognition while democratizing access to scientific experimentation and engaging students in tech-savvy pedagogical practices. Full article
(This article belongs to the Special Issue Inquiry-Based Learning and Student Engagement)
Show Figures

Figure 1

23 pages, 1192 KiB  
Article
Multi-Model Dialectical Evaluation of LLM Reasoning Chains: A Structured Framework with Dual Scoring Agents
by Catalin Anghel, Andreea Alexandra Anghel, Emilia Pecheanu, Ioan Susnea, Adina Cocu and Adrian Istrate
Informatics 2025, 12(3), 76; https://doi.org/10.3390/informatics12030076 - 1 Aug 2025
Viewed by 264
Abstract
(1) Background and objectives: Large language models (LLMs) such as GPT, Mistral, and LLaMA exhibit strong capabilities in text generation, yet assessing the quality of their reasoning—particularly in open-ended and argumentative contexts—remains a persistent challenge. This study introduces Dialectical Agent, an internally developed [...] Read more.
(1) Background and objectives: Large language models (LLMs) such as GPT, Mistral, and LLaMA exhibit strong capabilities in text generation, yet assessing the quality of their reasoning—particularly in open-ended and argumentative contexts—remains a persistent challenge. This study introduces Dialectical Agent, an internally developed modular framework designed to evaluate reasoning through a structured three-stage process: opinion, counterargument, and synthesis. The framework enables transparent and comparative analysis of how different LLMs handle dialectical reasoning. (2) Methods: Each stage is executed by a single model, and final syntheses are scored via two independent LLM evaluators (LLaMA 3.1 and GPT-4o) based on a rubric with four dimensions: clarity, coherence, originality, and dialecticality. In parallel, a rule-based semantic analyzer detects rhetorical anomalies and ethical values. All outputs and metadata are stored in a Neo4j graph database for structured exploration. (3) Results: The system was applied to four open-weight models (Gemma 7B, Mistral 7B, Dolphin-Mistral, Zephyr 7B) across ten open-ended prompts on ethical, political, and technological topics. The results show consistent stylistic and semantic variation across models, with moderate inter-rater agreement. Semantic diagnostics revealed differences in value expression and rhetorical flaws not captured by rubric scores. (4) Originality: The framework is, to our knowledge, the first to integrate multi-stage reasoning, rubric-based and semantic evaluation, and graph-based storage into a single system. It enables replicable, interpretable, and multidimensional assessment of generative reasoning—supporting researchers, developers, and educators working with LLMs in high-stakes contexts. Full article
Show Figures

Figure 1

20 pages, 2714 KiB  
Article
Diagnosing Bias and Instability in LLM Evaluation: A Scalable Pairwise Meta-Evaluator
by Catalin Anghel, Andreea Alexandra Anghel, Emilia Pecheanu, Adina Cocu, Adrian Istrate and Constantin Adrian Andrei
Information 2025, 16(8), 652; https://doi.org/10.3390/info16080652 - 31 Jul 2025
Viewed by 275
Abstract
The evaluation of large language models (LLMs) increasingly relies on other LLMs acting as automated judges. While this approach offers scalability and efficiency, it raises serious concerns regarding evaluator reliability, positional bias, and ranking stability. This paper presents a scalable framework for diagnosing [...] Read more.
The evaluation of large language models (LLMs) increasingly relies on other LLMs acting as automated judges. While this approach offers scalability and efficiency, it raises serious concerns regarding evaluator reliability, positional bias, and ranking stability. This paper presents a scalable framework for diagnosing positional bias and instability in LLM-based evaluation by using controlled pairwise comparisons judged by multiple independent language models. The system supports mirrored comparisons with reversed response order, prompt injection, and surface-level perturbations (e.g., paraphrasing, lexical noise), enabling fine-grained analysis of evaluator consistency and verdict robustness. Over 3600 pairwise comparisons were conducted across five instruction-tuned open-weight models using ten open-ended prompts. The top-performing model (gemma:7b-instruct) achieved a 66.5% win rate. Evaluator agreement was uniformly high, with 100% consistency across judges, yet 48.4% of verdicts reversed under mirrored response order, indicating strong positional bias. Kendall’s Tau analysis further showed that local model rankings varied substantially across prompts, suggesting that semantic context influences evaluator judgment. All evaluation traces were stored in a graph database (Neo4j), enabling structured querying and longitudinal analysis. The proposed framework provides not only a diagnostic lens for benchmarking models but also a blueprint for fairer and more interpretable LLM-based evaluation. These findings underscore the need for structure-aware, perturbation-resilient evaluation pipelines when benchmarking LLMs. The proposed framework offers a reproducible path for diagnosing evaluator bias and ranking instability in open-ended language tasks. Future work will apply this methodology to educational assessment tasks, using rubric-based scoring and graph-based traceability to evaluate student responses in technical domains. Full article
Show Figures

Figure 1

16 pages, 628 KiB  
Article
Beyond the Bot: A Dual-Phase Framework for Evaluating AI Chatbot Simulations in Nursing Education
by Phillip Olla, Nadine Wodwaski and Taylor Long
Nurs. Rep. 2025, 15(8), 280; https://doi.org/10.3390/nursrep15080280 - 31 Jul 2025
Viewed by 225
Abstract
Background/Objectives: The integration of AI chatbots in nursing education, particularly in simulation-based learning, is advancing rapidly. However, there is a lack of structured evaluation models, especially to assess AI-generated simulations. This article introduces the AI-Integrated Method for Simulation (AIMS) evaluation framework, a dual-phase [...] Read more.
Background/Objectives: The integration of AI chatbots in nursing education, particularly in simulation-based learning, is advancing rapidly. However, there is a lack of structured evaluation models, especially to assess AI-generated simulations. This article introduces the AI-Integrated Method for Simulation (AIMS) evaluation framework, a dual-phase evaluation framework adapted from the FAITA model, designed to evaluate both prompt design and chatbot performance in the context of nursing education. Methods: This simulation-based study explored the application of an AI chatbot in an emergency planning course. The AIMS framework was developed and applied, consisting of six prompt-level domains (Phase 1) and eight performance criteria (Phase 2). These domains were selected based on current best practices in instructional design, simulation fidelity, and emerging AI evaluation literature. To assess the chatbots educational utility, the study employed a scoring rubric for each phase and incorporated a structured feedback loop to refine both prompt design and chatbox interaction. To demonstrate the framework’s practical application, the researchers configured an AI tool referred to in this study as “Eval-Bot v1”, built using OpenAI’s GPT-4.0, to apply Phase 1 scoring criteria to a real simulation prompt. Insights from this analysis were then used to anticipate Phase 2 performance and identify areas for improvement. Participants (three individuals)—all experienced healthcare educators and advanced practice nurses with expertise in clinical decision-making and simulation-based teaching—reviewed the prompt and Eval-Bot’s score to triangulate findings. Results: Simulated evaluations revealed clear strengths in the prompt alignment with course objectives and its capacity to foster interactive learning. Participants noted that the AI chatbot supported engagement and maintained appropriate pacing, particularly in scenarios involving emergency planning decision-making. However, challenges emerged in areas related to personalization and inclusivity. While the chatbot responded consistently to general queries, it struggled to adapt tone, complexity and content to reflect diverse learner needs or cultural nuances. To support replication and refinement, a sample scoring rubric and simulation prompt template are provided. When evaluated using the Eval-Bot tool, moderate concerns were flagged regarding safety prompts and inclusive language, particularly in how the chatbot navigated sensitive decision points. These gaps were linked to predicted performance issues in Phase 2 domains such as dialog control, equity, and user reassurance. Based on these findings, revised prompt strategies were developed to improve contextual sensitivity, promote inclusivity, and strengthen ethical guidance within chatbot-led simulations. Conclusions: The AIMS evaluation framework provides a practical and replicable approach for evaluating the use of AI chatbots in simulation-based education. By offering structured criteria for both prompt design and chatbot performance, the model supports instructional designers, simulation specialists, and developers in identifying areas of strength and improvement. The findings underscore the importance of intentional design, safety monitoring, and inclusive language when integrating AI into nursing and health education. As AI tools become more embedded in learning environments, this framework offers a thoughtful starting point for ensuring they are applied ethically, effectively, and with learner diversity in mind. Full article
Show Figures

Figure 1

26 pages, 338 KiB  
Article
ChatGPT as a Stable and Fair Tool for Automated Essay Scoring
by Francisco García-Varela, Miguel Nussbaum, Marcelo Mendoza, Carolina Martínez-Troncoso and Zvi Bekerman
Educ. Sci. 2025, 15(8), 946; https://doi.org/10.3390/educsci15080946 - 23 Jul 2025
Viewed by 474
Abstract
The evaluation of open-ended questions is typically performed by human instructors using predefined criteria to uphold academic standards. However, manual grading presents challenges, including high costs, rater fatigue, and potential bias, prompting interest in automated essay scoring systems. While automated essay scoring tools [...] Read more.
The evaluation of open-ended questions is typically performed by human instructors using predefined criteria to uphold academic standards. However, manual grading presents challenges, including high costs, rater fatigue, and potential bias, prompting interest in automated essay scoring systems. While automated essay scoring tools can assess content, coherence, and grammar, discrepancies between human and automated scoring have raised concerns about their reliability as standalone evaluators. Large language models like ChatGPT offer new possibilities, but their consistency and fairness in feedback remain underexplored. This study investigates whether ChatGPT can provide stable and fair essay scoring—specifically, whether identical student responses receive consistent evaluations across multiple AI interactions using the same criteria. The study was conducted in two marketing courses at an engineering school in Chile, involving 40 students. Results showed that ChatGPT, when unprompted or using minimal guidance, produced volatile grades and shifting criteria. Incorporating the instructor’s rubric reduced this variability but did not eliminate it. Only after providing an example-rich rubric, a standardized output format, low temperature settings, and a normalization process based on decision tables did ChatGPT-4o demonstrate consistent and fair grading. Based on these findings, we developed a scalable algorithm that automatically generates effective grading rubrics and decision tables with minimal human input. The added value of this work lies in the development of a scalable algorithm capable of automatically generating normalized rubrics and decision tables for new questions, thereby extending the accessibility and reliability of automated assessment. Full article
(This article belongs to the Section Technology Enhanced Education)
19 pages, 1186 KiB  
Article
Synthetic Patient–Physician Conversations Simulated by Large Language Models: A Multi-Dimensional Evaluation
by Syed Ali Haider, Srinivasagam Prabha, Cesar Abraham Gomez-Cabello, Sahar Borna, Ariana Genovese, Maissa Trabilsy, Bernardo G. Collaco, Nadia G. Wood, Sanjay Bagaria, Cui Tao and Antonio Jorge Forte
Sensors 2025, 25(14), 4305; https://doi.org/10.3390/s25144305 - 10 Jul 2025
Viewed by 611
Abstract
Background: Data accessibility remains a significant barrier in healthcare AI due to privacy constraints and logistical challenges. Synthetic data, which mimics real patient information while remaining both realistic and non-identifiable, offers a promising solution. Large Language Models (LLMs) create new opportunities to generate [...] Read more.
Background: Data accessibility remains a significant barrier in healthcare AI due to privacy constraints and logistical challenges. Synthetic data, which mimics real patient information while remaining both realistic and non-identifiable, offers a promising solution. Large Language Models (LLMs) create new opportunities to generate high-fidelity clinical conversations between patients and physicians. However, the value of this synthetic data depends on careful evaluation of its realism, accuracy, and practical relevance. Objective: To assess the performance of four leading LLMs: ChatGPT 4.5, ChatGPT 4o, Claude 3.7 Sonnet, and Gemini Pro 2.5 in generating synthetic transcripts of patient–physician interactions in plastic surgery scenarios. Methods: Each model generated transcripts for ten plastic surgery scenarios. Transcripts were independently evaluated by three clinically trained raters using a seven-criterion rubric: Medical Accuracy, Realism, Persona Consistency, Fidelity, Empathy, Relevancy, and Usability. Raters were blinded to the model identity to reduce bias. Each was rated on a 5-point Likert scale, yielding 840 total evaluations. Descriptive statistics were computed, and a two-way repeated measures ANOVA was used to test for differences across models and metrics. In addition, transcripts were analyzed using automated linguistic and content-based metrics. Results: All models achieved strong performance, with mean ratings exceeding 4.5 across all criteria. Gemini 2.5 Pro received mean scores (5.00 ± 0.00) in Medical Accuracy, Realism, Persona Consistency, Relevancy, and Usability. Claude 3.7 Sonnet matched the scores in Persona Consistency and Relevancy and led in Empathy (4.96 ± 0.18). ChatGPT 4.5 also achieved perfect scores in Relevancy, with high scores in Empathy (4.93 ± 0.25) and Usability (4.96 ± 0.18). ChatGPT 4o demonstrated consistently strong but slightly lower performance across most dimensions. ANOVA revealed no statistically significant differences across models (F(3, 6) = 0.85, p = 0.52). Automated analysis showed substantial variation in transcript length, style, and content richness: Gemini 2.5 Pro generated the longest and most emotionally expressive dialogues, while ChatGPT 4o produced the shortest and most concise outputs. Conclusions: Leading LLMs can generate medically accurate, emotionally appropriate synthetic dialogues suitable for educational and research use. Despite high performance, demographic homogeneity in generated patients highlights the need for improved diversity and bias mitigation in model outputs. These findings support the cautious, context-aware integration of LLM-generated dialogues into medical training, simulation, and research. Full article
(This article belongs to the Special Issue Feature Papers in Smart Sensing and Intelligent Sensors 2025)
Show Figures

Figure 1

11 pages, 566 KiB  
Article
Reliability and Sources of Variation of Preclinical OSCEs at a Large US Osteopathic Medical School
by Martin Schmidt, Sarah Parrott and Maurice Blodgett
Int. Med. Educ. 2025, 4(3), 25; https://doi.org/10.3390/ime4030025 - 5 Jul 2025
Viewed by 251
Abstract
The objective structured clinical examination (OSCE) is a well-established tool for assessing clinical skills, providing reliability, validity, and generalizability for high-stakes examinations. Des Moines University College of Osteopathic Medicine (DMU-COM) adapted the OSCE for formative assessments in undergraduate medical education, focusing on interpersonal [...] Read more.
The objective structured clinical examination (OSCE) is a well-established tool for assessing clinical skills, providing reliability, validity, and generalizability for high-stakes examinations. Des Moines University College of Osteopathic Medicine (DMU-COM) adapted the OSCE for formative assessments in undergraduate medical education, focusing on interpersonal aspects in the primary care setting. Students are graded by standardized patients and faculty observers on interpersonal skills, history/physical examination, oral case presentation, and documentation. The purpose of the study is to establish the reliability and to identify sources of variation in the DMU-COM OSCE to aid medical educators in their understanding of the accuracy of clinical skills. We examined student performance data across five OSCE domains. We assessed intra- and inter-OSCE reliability by calculating KR20 values, determined sources of variation by multivariate regression analysis, and described relationships among observed variables through factor analysis. The results indicate that the OSCE captures student performance in three dimensions with low intra-OSCE reliability but acceptable longitudinal inter-OSCE reliability. Variance analysis shows significant measurement error in rubric-graded scores but negligible error in checklist-graded portions. Physical exam scores from patients and faculty showed no correlation, indicating value in having two different observers. We conclude that a series of formative OSCEs is a valid tool for assessing clinical skills in preclinical medical students. However, the low intra-assessment reliability cautions against using a single OSCE for summative clinical skills competency assessments. Full article
Show Figures

Figure 1

27 pages, 6138 KiB  
Article
From Mapping to Action: SmartRubrics, an AI Tool for Competency-Based Assessment in Engineering Education
by Jorge Hochstetter-Diez, Marlene Negrier-Seguel, Mauricio Diéguez-Rebolledo, Esteban Candia-Garrido and Elizabeth Vidal
Sustainability 2025, 17(13), 6098; https://doi.org/10.3390/su17136098 - 3 Jul 2025
Viewed by 437
Abstract
Competency-based assessment in engineering education is becoming increasingly critical as the profession faces rapid technological advances and the growing need for cross-cutting competencies. This paper introduces SmartRubrics, an AI-based tool designed to support the automated generation of competency-based assessment rubrics. The development of [...] Read more.
Competency-based assessment in engineering education is becoming increasingly critical as the profession faces rapid technological advances and the growing need for cross-cutting competencies. This paper introduces SmartRubrics, an AI-based tool designed to support the automated generation of competency-based assessment rubrics. The development of this tool is based on a systematic literature mapping study conducted between 2019 and 2024, which identified key gaps, such as the limited integration of digital tools and the under-representation of transversal skills in current assessment practices. By addressing these gaps, SmartRubrics aims to support the standardisation, accessibility, and potential enhancement of competency-based assessment practices, aligned with UNESCO’s Sustainable Development Goal 4 (SDG4). Preliminary testing of the prototype with computer science educators has provided valuable information on the effectiveness of the tool and areas for improvement. Future work includes further experimental validation in real educational settings to assess the impact of the tool on teaching and learning practices. Full article
(This article belongs to the Special Issue Sustainable Education in the Age of Artificial Intelligence (AI))
Show Figures

Figure 1

21 pages, 1471 KiB  
Article
The PIEE Cycle: A Structured Framework for Red Teaming Large Language Models in Clinical Decision-Making
by Maissa Trabilsy, Srinivasagam Prabha, Cesar A. Gomez-Cabello, Syed Ali Haider, Ariana Genovese, Sahar Borna, Nadia Wood, Narayanan Gopala, Cui Tao and Antonio J. Forte
Bioengineering 2025, 12(7), 706; https://doi.org/10.3390/bioengineering12070706 - 27 Jun 2025
Viewed by 513
Abstract
The increasing integration of large language models (LLMs) into healthcare presents significant opportunities, but also critical risks related to patient safety, accuracy, and ethical alignment. Despite these concerns, no standardized framework exists for systematically evaluating and stress testing LLM behavior in clinical decision-making. [...] Read more.
The increasing integration of large language models (LLMs) into healthcare presents significant opportunities, but also critical risks related to patient safety, accuracy, and ethical alignment. Despite these concerns, no standardized framework exists for systematically evaluating and stress testing LLM behavior in clinical decision-making. The PIEE cycle—Planning and Preparation, Information Gathering and Prompt Generation, Execution, and Evaluation—is a structured red-teaming framework developed specifically to address artificial intelligence (AI) safety risks in healthcare decision-making. PIEE enables clinicians and informatics teams to simulate adversarial prompts, including jailbreaking, social engineering, and distractor attacks, to stress-test language models in real-world clinical scenarios. Model performance is evaluated using specific metrics such as true positive and false positive rates for detecting harmful content, hallucination rates measured through adapted TruthfulQA scoring, safety and reliability assessments, bias detection via adapted BBQ benchmarks, and ethical evaluation using structured Likert-based scoring rubrics. The framework is illustrated using examples from plastic surgery, but is adaptable across specialties, and is intended for use by all medical providers, regardless of their backgrounds or familiarity with artificial intelligence. While the framework is currently conceptual and validation is ongoing, PIEE provides a practical foundation for assessing the clinical reliability and ethical robustness of LLMs in medicine. Full article
(This article belongs to the Special Issue New Sights of Deep Learning and Digital Model in Biomedicine)
Show Figures

Figure 1

14 pages, 877 KiB  
Article
No Learner Left Behind: How Medical Students’ Background Characteristics and Psychomotor/Visual–Spatial Abilities Correspond to Aptitude in Learning How to Perform Clinical Ultrasounds
by Samuel Ayala, Eric R. Abrams, Lawrence A. Melniker, Laura D. Melville and Gerardo C. Chiricolo
Emerg. Care Med. 2025, 2(3), 31; https://doi.org/10.3390/ecm2030031 - 25 Jun 2025
Viewed by 246
Abstract
Background/Objectives: The goal of educators is to leave no learner behind. Ultrasounds require dexterity and 3D image interpretation. They are technologically complex, and current medical residency programs lack a reliable means of assessing this ability among their trainees. This prompts consideration as to [...] Read more.
Background/Objectives: The goal of educators is to leave no learner behind. Ultrasounds require dexterity and 3D image interpretation. They are technologically complex, and current medical residency programs lack a reliable means of assessing this ability among their trainees. This prompts consideration as to whether background characteristics or certain pre-existing skills can serve as indicators of learning aptitude for ultrasounds. The objective of this study was to determine whether these characteristics and skills are indicative of learning aptitude for ultrasounds. Methods: This prospective study was conducted with third-year medical students rotating in emergency medicine at the New York Presbyterian Brooklyn Methodist Hospital, Brooklyn, NY, USA. First, students were given a pre-test survey to assess their background characteristics. Subsequently, a psychomotor task (Purdue Pegboard) and visual–spatial task (Revised Purdue Spatial Visualization Tests) were administered to the students. Lastly, an ultrasound task was given to identify the subxiphoid cardiac view. A rubric assessed ability, and proficiency was determined as a 75% or higher score in the ultrasound task. Results: In total, 97 students were tested. An analysis of variance (ANOVA) was used to ascertain if any background characteristics from the pre-test survey was associated with the ultrasound task score. The student’s use of cadavers to learn anatomy had the most correlation (p-value of 0.02). Assessing the psychomotor and visual–spatial tasks, linear regressions were used against the ultrasound task scores. Correspondingly, the p-values were 0.007 and 0.008. Conclusions: Ultrasound ability is based on hand–eye coordination and spatial relationships. Increased aptitude in these abilities may forecast future success in this skill. Those who may need more assistance can have their training tailored to them and further support offered. Full article
Show Figures

Figure A1

14 pages, 912 KiB  
Article
Evaluation of Large Language Model Performance in Answering Clinical Questions on Periodontal Furcation Defect Management
by Georgios S. Chatzopoulos, Vasiliki P. Koidou, Lazaros Tsalikis and Eleftherios G. Kaklamanos
Dent. J. 2025, 13(6), 271; https://doi.org/10.3390/dj13060271 - 18 Jun 2025
Viewed by 486
Abstract
Background/Objectives: Large Language Models (LLMs) are artificial intelligence (AI) systems with the capacity to process vast amounts of text and generate human-like language, offering the potential for improved information retrieval in healthcare. This study aimed to assess and compare the evidence-based potential [...] Read more.
Background/Objectives: Large Language Models (LLMs) are artificial intelligence (AI) systems with the capacity to process vast amounts of text and generate human-like language, offering the potential for improved information retrieval in healthcare. This study aimed to assess and compare the evidence-based potential of answers provided by four LLMs to common clinical questions concerning the management and treatment of periodontal furcation defects. Methods: Four LLMs—ChatGPT 4.0, Google Gemini, Google Gemini Advanced, and Microsoft Copilot—were used to answer ten clinical questions related to periodontal furcation defects. The LLM-generated responses were compared against a “gold standard” derived from the European Federation of Periodontology (EFP) S3 guidelines and recent systematic reviews. Two board-certified periodontists independently evaluated the answers for comprehensiveness, scientific accuracy, clarity, and relevance using a predefined rubric and a scoring system of 0–10. Results: The study found variability in LLM performance across the evaluation criteria. Google Gemini Advanced generally achieved the highest average scores, particularly in comprehensiveness and clarity, while Google Gemini and Microsoft Copilot tended to score lower, especially in relevance. However, the Kruskal–Wallis test revealed no statistically significant differences in the overall average scores among the LLMs. Evaluator agreement and intra-evaluator reliability were high. Conclusions: While LLMs demonstrate the potential to answer clinical questions related to furcation defect management, their performance varies. LLMs showed different comprehensiveness, scientific accuracy, clarity, and relevance degrees. Dental professionals should be aware of LLMs’ capabilities and limitations when seeking clinical information. Full article
(This article belongs to the Special Issue Artificial Intelligence in Oral Rehabilitation)
Show Figures

Graphical abstract

21 pages, 931 KiB  
Article
JorGPT: Instructor-Aided Grading of Programming Assignments with Large Language Models (LLMs)
by Jorge Cisneros-González, Natalia Gordo-Herrera, Iván Barcia-Santos and Javier Sánchez-Soriano
Future Internet 2025, 17(6), 265; https://doi.org/10.3390/fi17060265 - 18 Jun 2025
Viewed by 687
Abstract
This paper explores the application of large language models (LLMs) to automate the evaluation of programming assignments in an undergraduate “Introduction to Programming” course. This study addresses the challenges of manual grading, including time constraints and potential inconsistencies, by proposing a system that [...] Read more.
This paper explores the application of large language models (LLMs) to automate the evaluation of programming assignments in an undergraduate “Introduction to Programming” course. This study addresses the challenges of manual grading, including time constraints and potential inconsistencies, by proposing a system that integrates several LLMs to streamline the assessment process. The system utilizes a graphic interface to process student submissions, allowing instructors to select an LLM and customize the grading rubric. A comparative analysis, using LLMs from OpenAI, Google, DeepSeek and ALIBABA to evaluate student code submissions, revealed a strong correlation between LLM-generated grades and those assigned by human instructors. Specifically, the reduced model using statistically significant variables demonstrates a high explanatory power, with an adjusted R2 of 0.9156 and a Mean Absolute Error of 0.4579, indicating that LLMs can effectively replicate human grading. The findings suggest that LLMs can automate grading when paired with human oversight, drastically reducing the instructor workload, transforming a task estimated to take more than 300 h of manual work into less than 15 min of automated processing and improving the efficiency and consistency of assessment in computer science education. Full article
(This article belongs to the Special Issue Generative Artificial Intelligence in Smart Societies)
Show Figures

Graphical abstract

24 pages, 2091 KiB  
Article
Reflections on Addressing Educational Inequalities Through the Co-Creation of a Rubric for Assessing Children’s Plurilingual and Intercultural Competence
by Janine Knight and Marta Segura
Educ. Sci. 2025, 15(6), 762; https://doi.org/10.3390/educsci15060762 - 16 Jun 2025
Viewed by 468
Abstract
Recognising linguistic diversity as a person’s characteristic is arguably central to their multilingual identity and is important as an equity issue. Different indicators suggest that students with migrant backgrounds, whose linguistic diversity is often not reflected in European education systems, tend to underperform [...] Read more.
Recognising linguistic diversity as a person’s characteristic is arguably central to their multilingual identity and is important as an equity issue. Different indicators suggest that students with migrant backgrounds, whose linguistic diversity is often not reflected in European education systems, tend to underperform compared to their peers without migrant backgrounds. There is a dire need, therefore, to alleviate the educational inequalities that negatively affect some of the most plurilingual students in European school systems. This can be carried out by revisiting assessment tools. Developing assessments to make children’s full linguistic and cultural repertoire visible, and what they can do with it, is one way that potential inequalities in school systems and assessment practices can be addressed so that cultural and linguistic responsiveness of assessments and practices can be improved. This paper explores the concept of discontinuities or mismatches between the assessment of plurilingual children’s linguistic practices in one primary school in Catalonia and their actual linguistic realities, including heritage languages. It asks: (1) What are the children’s linguistic profiles? (2) What mismatches and/or educational inequalities do they experience? and (3) How does the co-creation and use of a rubric assessing plurilingual and intercultural competence attempt to mitigate these mismatches and inequalities? Mismatches are identified using a context- and participant-relevant reflection tool, based on 18 reflective questions related to aspects of social justice. Results highlight that mismatches exist between children’s plurilingual and intercultural knowledge and skills compared to the school, education system, curriculum, and wider regional and European policy. These mismatches highlight two plurilingual visions for language education. The paper highlights how language assessment tools and practices can be made more culturally and linguistically fair for plurilingual children with migration backgrounds. Full article
Show Figures

Figure 1

15 pages, 216 KiB  
Article
Participatory Co-Design and Evaluation of a Novel Approach to Generative AI-Integrated Coursework Assessment in Higher Education
by Alex F. Martin, Svitlana Tubaltseva, Anja Harrison and G. James Rubin
Behav. Sci. 2025, 15(6), 808; https://doi.org/10.3390/bs15060808 - 12 Jun 2025
Viewed by 909
Abstract
Generative AI tools offer opportunities for enhancing learning and assessment, but raise concerns about equity, academic integrity, and the ability to critically engage with AI-generated content. This study explores these issues within a psychology-oriented postgraduate programme at a UK university. We co-designed and [...] Read more.
Generative AI tools offer opportunities for enhancing learning and assessment, but raise concerns about equity, academic integrity, and the ability to critically engage with AI-generated content. This study explores these issues within a psychology-oriented postgraduate programme at a UK university. We co-designed and evaluated a novel AI-integrated assessment aimed at improving critical AI literacy among students and teaching staff (pre-registration: osf.io/jqpce). Students were randomly allocated to two groups: the ‘compliant’ group used AI tools to assist with writing a blog and critically reflected on the outputs, while the ‘unrestricted’ group had free rein to use AI to produce the assessment. Teaching staff, blinded to group allocation, marked the blogs using an adapted rubric. Focus groups, interviews, and workshops were conducted to assess the feasibility, acceptability, and perceived integrity of the approach. Findings suggest that, when carefully scaffolded, integrating AI into assessments can promote both technical fluency and ethical reflection. A key contribution of this study is its participatory co-design and evaluation method, which was effective and transferable, and is presented as a practical toolkit for educators. This approach supports growing calls for authentic assessment that mirrors real-world tasks, while highlighting the ongoing need to balance academic integrity with skill development. Full article
20 pages, 2451 KiB  
Article
Enhancing Efficiency and Creativity in Mechanical Drafting: A Comparative Study of General-Purpose CAD Versus Specialized Toolsets
by Simón Gutiérrez de Ravé, Eduardo Gutiérrez de Ravé and Francisco J. Jiménez-Hornero
Appl. Syst. Innov. 2025, 8(3), 74; https://doi.org/10.3390/asi8030074 - 29 May 2025
Viewed by 1348
Abstract
Computer-Aided Design (CAD) plays a critical role in modern engineering education by supporting technical accuracy and fostering innovation in design. This study compares the performance of beginner CAD users employing general-purpose AutoCAD 2025 with those using the specialized AutoCAD Mechanical 2025. Fifty undergraduate [...] Read more.
Computer-Aided Design (CAD) plays a critical role in modern engineering education by supporting technical accuracy and fostering innovation in design. This study compares the performance of beginner CAD users employing general-purpose AutoCAD 2025 with those using the specialized AutoCAD Mechanical 2025. Fifty undergraduate mechanical engineering students, all with less than one year of CAD experience and no prior exposure to AutoCAD Mechanical, were randomly assigned to complete six mechanical drawing tasks using one of the two software environments. Efficiency was evaluated through command usage, frequency, and task completion time, while creativity was assessed using a rubric covering originality, functionality, tool proficiency, and graphical quality. Results show that AutoCAD Mechanical significantly improved workflow efficiency, reducing task execution time by approximately 50%. Creativity scores were also notably higher among users of AutoCAD Mechanical, particularly in functionality and tool usage. These gains are attributed to automation features such as parametric constraints, standard part libraries, and automated dimensioning, which lower cognitive load and support iterative design. The findings suggest that integrating specialized CAD tools into engineering curricula can enhance both technical and creative outcomes. Limitations and future research directions include longitudinal studies, diverse user populations, and exploration of student feedback and tool adaptation. Full article
Show Figures

Figure 1

Back to TopTop