Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Article Types

Countries / Regions

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Search Results (188)

Search Parameters:
Keywords = exam grades

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
15 pages, 381 KB  
Article
Assessment Validity in the Age of Generative AI: A Natural Experiment
by Håvar Brattli, Alexander Utne and Matthew Lynch
Informatics 2026, 13(4), 56; https://doi.org/10.3390/informatics13040056 - 3 Apr 2026
Viewed by 808
Abstract
Universities play a dual role as sites of learning and as institutions that certify student competence through assessment. The rapid diffusion of generative artificial intelligence (GenAI) challenges this certification function by altering the conditions under which assessment evidence is produced. When powerful AI [...] Read more.
Universities play a dual role as sites of learning and as institutions that certify student competence through assessment. The rapid diffusion of generative artificial intelligence (GenAI) challenges this certification function by altering the conditions under which assessment evidence is produced. When powerful AI tools are widely available, grades may increasingly reflect a combination of individual understanding and external cognitive support rather than solely independent competence. This study examines how changes in assessment format interact with GenAI availability to reshape observable performance outcomes in higher education. Using exam grade data from a compulsory undergraduate course delivered over five years (2021–2025; N = 1066), the study exploits a naturally occurring change in assessment conditions as a natural experiment. From 2021 to 2024, the course was assessed using an AI-permissive take-home examination, while in 2025 the assessment shifted to an AI-restricted, supervised in-person examination. Course content, intended learning outcomes, grading criteria, examiner continuity, and the structural design of the examination tasks remained stable across cohorts. The results reveal a pronounced shift in grade distributions coinciding with the format change. Failure rates increased sharply in 2025, mid-range grades declined, and the proportion of top grades remained largely unchanged. Statistical analysis indicates a significant association between examination period and grade outcomes (χ2(5, N = 1066) = 60.62, p < 0.001), with a small-to-moderate effect size (Cramér’s V = 0.24), driven primarily by the increase in failing grades. These findings suggest that AI-permissive and AI-restricted assessment formats may not be measurement-equivalent under conditions of widespread GenAI use. The results raise concerns about construct validity and the credibility of grades as signals of independent competence, while also highlighting tensions between certification credibility and assessment authenticity. Full article
Show Figures

Figure 1

19 pages, 1032 KB  
Review
Assessment of Congestion in Heart Failure Using VExUS: Current Evidence, Limitations and Clinical Perspectives
by Cosmina-Georgiana Ponor, Maria-Ruxandra Cepoi, Marilena Renata Spiridon, Ionuț Tudorancea, Amelian Mădălin Bobu, Minerva Codruta Badescu, Alexandru Dan Costache, Sandu Cucută and Irina-Iuliana Costache-Enache
Life 2026, 16(3), 518; https://doi.org/10.3390/life16030518 - 20 Mar 2026
Viewed by 1787
Abstract
Background: Systemic venous congestion is a key driver of organ dysfunction in heart failure (HF), yet accurate non-invasive quantification remains challenging. Recognizing residual congestion is critical, since it predicts HF readmissions and mortality. Traditional assessments (physical exam, jugular venous pressure, inferior vena [...] Read more.
Background: Systemic venous congestion is a key driver of organ dysfunction in heart failure (HF), yet accurate non-invasive quantification remains challenging. Recognizing residual congestion is critical, since it predicts HF readmissions and mortality. Traditional assessments (physical exam, jugular venous pressure, inferior vena cava [IVC] size) are imprecise. The Venous Excess Ultrasound Score (VExUS) is a semi-quantitative point-of-care ultrasound (POCUS) protocol that integrates IVC diameter with Doppler flow patterns in the hepatic, portal and intrarenal veins to grade systemic venous overload. Methods: We conducted a narrative review of literature (2018–2025) regarding the usefulness of VExUS in HF, covering congestion pathophysiology, clinical evidence (hemodynamic correlations, organ dysfunction, outcomes), potential applications, integration with lung ultrasound, echocardiography and biomarkers, limitations of its assessment and future directions. Results and Discussions: In HF, elevated right atrial pressure causes venous congestion. VExUS integrates IVC diameter with Doppler waveforms of hepatic, portal, and intrarenal veins to grade congestion. Emerging evidence shows higher VExUS grades correlate with elevated filling pressures, renal dysfunction, and worse outcomes. Its use may guide diuretic therapy, aid discharge planning, and monitor outpatient congestion, especially when combined with lung ultrasound and biomarkers. However, VExUS has limitations: it is technical and operator-dependent. Importantly, large trials validating VExUS-guided management are lacking. Future directions include AI-driven automation of Doppler analysis and integration with multimodal congestion monitoring to provide a comprehensive congestion assessment. Conclusions: VExUS is a promising noninvasive tool for quantifying congestion in HF. Higher grades are associated with organ dysfunction and poor prognosis. Incorporating this technique into HF care may improve congestion-guided therapy, but large-scale validation is required before routine use. Full article
Show Figures

Figure 1

33 pages, 2332 KB  
Article
EvalHack: Answer-Side Prompt Injection for Probing LLM Exam-Grading Panel Stability
by Catalin Anghel, Marian Viorel Craciun, Adina Cocu, Andreea Alexandra Anghel, Antonio Stefan Balau, Adrian Istrate and Aurelian-Dumitrache Anghele
Information 2026, 17(3), 297; https://doi.org/10.3390/info17030297 - 18 Mar 2026
Viewed by 498
Abstract
Large language models are increasingly used as automated graders, yet their reliability under answer-side manipulation and their behavior in multi-model panels remain insufficiently understood. This paper introduces EvalHack, a matrix benchmark in which a fixed committee of four LLMs grades university-level machine learning [...] Read more.
Large language models are increasingly used as automated graders, yet their reliability under answer-side manipulation and their behavior in multi-model panels remain insufficiently understood. This paper introduces EvalHack, a matrix benchmark in which a fixed committee of four LLMs grades university-level machine learning exam answers under a strict integer-only contract (0–10) grounded in instructor-authored rubric artifacts. The dataset comprises 100 students answering 10 short, open-ended items (1000 answers). For each answer, the evaluation includes a clean version and two content-preserving adversarial variants that operate only on the student text: A1, a visible coercive suffix appended to the answer, and A2, a stealth variant that uses Unicode control characters (e.g., zero-width and bidirectional marks) to embed an instruction. EvalHack instruments the full grading pipeline, recording item-level member scores, the committee aggregate, within-panel disagreement, and discrepancies to human grades. Empirically, answer-side edits induce systematic score inflation and stronger top-end concentration, with edited answers clustering near the upper end of the scale. Within-panel disagreement, measured as the range between the highest and lowest member score, varies across conditions, with median Consistency Spread values of 3.0 (clean), 2.0 (A1), and 6.0 (A2). Compared to human graders, the panel is more lenient on average (MAE = 1.897; bias human − panel = −1.345). Finally, grouping items by disagreement shows that low-disagreement items exhibit smaller human-panel errors, indicating that within-panel spread can serve as a practical uncertainty signal for routing difficult answers to human review or to larger/more specialized panels. Full article
(This article belongs to the Section Artificial Intelligence)
Show Figures

Graphical abstract

32 pages, 2055 KB  
Article
Leveraging Transformers and LLMs for Automated Grading and Feedback Generation Using a Novel Dataset
by Asmaa G. Khalf, Emad Nabil, Wael H. Gomaa, Oussama Benrhouma and Amira M. El-Mandouh
Data 2026, 11(3), 57; https://doi.org/10.3390/data11030057 - 16 Mar 2026
Viewed by 539
Abstract
Automated Short Answer Grading (ASAG) has garnered significant attention in the field of educational technology due to its potential to improve the efficiency, scalability, and consistency of student assessments. This study introduces a novel dataset of 651 student responses from a Database Transaction [...] Read more.
Automated Short Answer Grading (ASAG) has garnered significant attention in the field of educational technology due to its potential to improve the efficiency, scalability, and consistency of student assessments. This study introduces a novel dataset of 651 student responses from a Database Transaction course exam at Beni-Suef University, referred to as the Beni-Suef Transaction Processing (BeSTraP) dataset. The BeSTraP is specifically designed to support ASAG evaluation. To assess ASAG performance, five approaches were employed: string-based similarity, semantic similarity, a hybrid of both, fine-tuning transformer-based models, and the application of Large Language Models (LLMs). The experimental results indicated that fine-tuned transformers, particularly GPT-2, achieved the highest Pearson correlation with human scores (0.8813) on the new dataset and maintained robust performance on the Mohler benchmark (0.7834). In addition to grading, the framework integrates automated feedback generation through LLMs, further enriching the assessment process. This research contributes (i) a novel, domain-specific dataset derived from an actual university examination, (ii) a comprehensive comparison of traditional and transformer-based approaches, and (iii) evidence of the efficacy of fine-tuned models in providing accurate and scalable grading solutions. The created dataset will be publicly available for the community. Full article
Show Figures

Graphical abstract

12 pages, 630 KB  
Article
Subconcussive Head Injuries Negatively Affect Academic Achievement in Adolescent Males
by Michael A. Carron, Lauren E. Caplick and Vincent J. Dalbo
Children 2026, 13(3), 399; https://doi.org/10.3390/children13030399 - 13 Mar 2026
Viewed by 421
Abstract
Background/Objectives: To determine the effects of a subconcussive head injury on adolescent student academic achievement assessed by grade point average (GPA). Methods: The study utilised an experimental (subconcussive head injury, n = 45) and a matched pair control group (n [...] Read more.
Background/Objectives: To determine the effects of a subconcussive head injury on adolescent student academic achievement assessed by grade point average (GPA). Methods: The study utilised an experimental (subconcussive head injury, n = 45) and a matched pair control group (n = 45). Data were collated at baseline (i.e., the term prior to sustaining a subconcussive head injury) and the term the subconcussive head injury occurred. Subconcussive head injuries were preliminarily assessed onsite by a registered nurse and diagnosed by a general practitioner using established protocol. The average subconcussive head injury occurred 26.93 ± 15.22 days prior to the exam period, which is when all graded assessments/examinations occurred. All participants (N = 90) were adolescent males (age: 14.04 ± 1.48 years) in grades 7–12 (grade: 8.62 ± 1.51). An independent t-test was used to test for potential between group differences at baseline. Separate dependent t-tests were used to test for the effects of a subconcussive head injury on GPA in the experimental group and the effects of time on GPA in the control group. Standardised Cohen’s d with 95% confidence intervals were used to quantify the meaningfulness of the potential between or within group differences. Results: Non-meaningful, non-significant differences were revealed for all variables between the experimental and control group at baseline. A subconcussive head injury resulted in a meaningful and significant decrease in GPA (d = −0.417, 95% CI = −0.720 to −0.110, small, p = 0.008); while a non-meaningful, non-significant increase in GPA occurred in the matched pair control group (d = 0.037, 95% CI = −0.256 to 0.329, trivial, p = 0.808). Conclusions: Our findings provide initial evidence suggesting the need for return to learn protocols to consider subconcussive head injuries. Full article
(This article belongs to the Section Global Pediatric Health)
Show Figures

Figure 1

33 pages, 2576 KB  
Article
ExamQ-Gen: Instructor-in-the-Loop Generation of Self-Contained Exam Questions from Course Materials and Decision-Support Grading
by Catalin Anghel, Emilia Pecheanu, Andreea Alexandra Anghel, Marian Viorel Craciun and Adina Cocu
Computers 2026, 15(3), 177; https://doi.org/10.3390/computers15030177 - 9 Mar 2026
Viewed by 391
Abstract
Reliable evaluation of large language models (LLMs) for educational use requires benchmarks that reflect exam constraints, instructor grading practices, and the operational consequences of thresholded decisions. This paper introduces ExamQ-Gen, an instructor-in-the-loop benchmark that couples two tasks: (i) an LLM answering university-style exam [...] Read more.
Reliable evaluation of large language models (LLMs) for educational use requires benchmarks that reflect exam constraints, instructor grading practices, and the operational consequences of thresholded decisions. This paper introduces ExamQ-Gen, an instructor-in-the-loop benchmark that couples two tasks: (i) an LLM answering university-style exam questions and (ii) decision-support grading aligned with an instructor reference. Automatic grading is used for triage and feedback; in practice, ExamQ-Gen supports instructor-led exam authoring and provides grading recommendations, while the instructor issues the final grade and pass/fail decision. ExamQ-Gen is constructed from the course content by using an LLM to generate exam-style questions directly from the lecture materials, producing a course-derived question set suitable for controlled experimentation. The benchmark then instantiates contrasting exam conditions, including instructor-authored (HUMAN) versus pipeline-generated (PIPELINE) artifacts, to evaluate robustness under distribution shifts that can occur when exam questions and answers are produced through different generation workflows. Using two LLM “students” (Llama3-8B-Instruct and Mistral-7B-Instruct) and an LLM-based grader, we compare automatic grading against an instructor reference on a 1–10 score scale and at the decision level induced by the operational pass policy (pass if score ≥ 9). Accordingly, our conclusions are conditioned on the two evaluated student models. Score-level agreement is strong under HUMAN conditions but degrades substantially under PIPELINE conditions, indicating condition-dependent stability. At the pass threshold, decision errors are highly asymmetric, with false fails dominating false passes, meaning that conservative grading may appear safe while producing credit denial. A severity-focused analysis isolates a high-stakes failure mode—denial of instructor-perfect answers—and shows that, in the most affected PIPELINE condition, the perfect-pass miss rate reaches 0.926 (50/54), consistent with systematic conservatism rather than borderline noise. Overall, the results highlight that aggregate score agreement and accuracy are insufficient for instructor-controlled exam deployment and motivate reporting practices that combine disaggregated score agreement, threshold-based error asymmetry with uncertainty, and severity-aware diagnostics under exam-relevant condition shifts. Full article
Show Figures

Figure 1

13 pages, 1182 KB  
Article
In-Person vs. Virtual: A Comparative Study of Teaching Methods in Nutritional Medicine
by Benjamin Caspar Raphael Trutwin, Jantje Eilers, Hans Joachim Herrmann, Markus Friedrich Neurath, Matthias Kohl, Yurdagül Zopf and Leonie Cordelia Burgard
Nutrients 2026, 18(5), 821; https://doi.org/10.3390/nu18050821 - 3 Mar 2026
Cited by 1 | Viewed by 694
Abstract
Background/Objectives: Nutritional medicine remains underrepresented in medical education despite its relevance across specialties. Online learning offers a resource-efficient option to address this gap, yet evidence on the effectiveness and acceptability of online learning modules (OLMs) is limited. Methods: In this exploratory randomized controlled [...] Read more.
Background/Objectives: Nutritional medicine remains underrepresented in medical education despite its relevance across specialties. Online learning offers a resource-efficient option to address this gap, yet evidence on the effectiveness and acceptability of online learning modules (OLMs) is limited. Methods: In this exploratory randomized controlled single post-test trial, medical students were assigned to either an OLM or an in-person lecture (IPL) on nutritional medicine (n = 91, no a priori sample size calculation performed). After course completion, students took a knowledge test and completed a questionnaire on their learning experience. Group differences were analyzed using permutation Welch t-tests, Wilcoxon–Mann–Whitney tests, or Fisher’s exact tests, depending on variable characteristics, with α = 0.05. Results: OLM students achieved significantly higher test scores than IPL students (mean difference: 2.4 points on a 0–40 scale), resulting in differences in grade classification (p < 0.05). OLM was further rated more favorably regarding content delivery, overall course evaluation, and exam preparation (all p < 0.05), while self-reported attention, concentration, and involvement did not differ between groups. Flexibility, time savings, and convenience were the most frequently reported advantages of OLM over IPL. Conclusions: This study suggests that OLM in nutritional medicine may be associated with higher test performance and more favorable student evaluations compared to IPL. These findings highlight the potential of online learning as a scalable, resource-efficient approach that may help address persistent gaps in nutritional medicine education. Building on this evidence, future work should examine how such modules can be optimally integrated into medical curricula to complement existing teaching structures. Full article
(This article belongs to the Section Nutritional Policies and Education for Health Promotion)
Show Figures

Figure 1

21 pages, 28351 KB  
Article
Development of a Radiotherapy-Induced Wound Model in Wistar Rats: Simulating Post-Radiation Skin and Soft Tissue Complications for Therapeutic Evaluation
by Stefana Avadanei-Luca, Bogdan Ionel Tamba, Irina Draga Caruntu, Simona Eliza Giusca, Andrei Daniel Timofte, Andrei Szilagyi, Ivona Costachescu, Maria Raluca Gogu, Andrei Nicolae Avadanei, Mihaela Pertea, Malek Benamor, Ionel Daniel Cojocaru, Mihai Liviu Ciofu and Viorel Scripcariu
Biomedicines 2026, 14(2), 415; https://doi.org/10.3390/biomedicines14020415 - 12 Feb 2026
Viewed by 717
Abstract
Background/Objectives: Radiotherapy can severely impair skin and soft tissue healing, particularly when high doses or subsequent surgical interventions are involved. Robust experimental platforms that replicate clinically relevant radiation-impaired wound healing remain limited. This study aims to establish a reproducible experimental model for [...] Read more.
Background/Objectives: Radiotherapy can severely impair skin and soft tissue healing, particularly when high doses or subsequent surgical interventions are involved. Robust experimental platforms that replicate clinically relevant radiation-impaired wound healing remain limited. This study aims to establish a reproducible experimental model for radiation-induced cutaneous injury using contemporary clinical radiotherapy techniques. Methods: A Wistar rat model was developed using single-dose external beam irradiation delivered by clinical-grade volumetric modulated arc therapy (VMAT; 6 MV FFF), at doses of 20 Gy or 30 Gy. Animals were distributed in five distinct groups: G1—control, G2—20 Gy irradiation only, G3—20 Gy irradiation followed by excision, G4—excision only, G5—30 Gy irradiation only. Standardized full-thickness skin excision (1.5 × 1.5 cm) was performed one-week post-irradiation to simulate surgical intervention in pre-irradiated tissue. Animals were monitored for up to 42 days, through skin damage macroscopic scoring, body weight, hematological and biochemical parameters, and a qualitative histological exam. Results: Single-dose irradiation with 20 Gy induced moderate, self-limiting radiation dermatitis with complete healing. When combined with delayed excision, 20 Gy irradiation resulted in more severe and prolonged wound healing impairment, and transient systemic alterations. Excision alone produced controlled wounds with predictable healing. Exploratory observations following 30 Gy irradiation revealed severe cutaneous injury and marked systemic involvement, with a high mortality rate. Conclusions: This study establishes a foundational model for radiation-impaired wound healing using clinical-grade VMAT delivery and standardized delayed excision. The 20 Gy-based protocols provide an ethically sustainable and experimentally tractable platform for future mechanistic and therapeutic studies. Full article
(This article belongs to the Section Molecular and Translational Medicine)
Show Figures

Figure 1

13 pages, 707 KB  
Article
Does It Make Sense to Perform Prostate Magnetic Resonance Imaging in Men with Normal PSA (<4 ng/mL)?
by Pieter De Visschere, Camille Berquin, Pieter De Backer, Joris Vangeneugden, Eva Donck, Thomas Tailly, Valérie Fonteyne, Sofie Verbeke, Sigi Hendrickx, Nicolaas Lumen, Daan De Maeseneer, Geert Villeirs and Charles Van Praet
Cancers 2026, 18(3), 423; https://doi.org/10.3390/cancers18030423 - 28 Jan 2026
Viewed by 529
Abstract
Objective: We evaluate the performance and relevance of MRI to detect csPC in men with normal PSA. Methods: Out of our database of patients referred for prostate MRI, we selected men with PSA < 4 ng/mL for whom histopathology or at [...] Read more.
Objective: We evaluate the performance and relevance of MRI to detect csPC in men with normal PSA. Methods: Out of our database of patients referred for prostate MRI, we selected men with PSA < 4 ng/mL for whom histopathology or at least 2 years of clinical follow-up data were available as standard of reference. Subgroup analyses were performed for the patients with PSA < 3 ng/mL, <2 ng/mL, and 2–3.9 ng/mL. The reasons for prostate MRI referral despite their normal PSA level were retrieved by exploring the patients’ files. The prostate MRIs were reported according to the Prostate Imaging and Reporting Data System (PI-RADS), and the overall assessment score was registered. For evaluation of the performance, PI-RADS ≥ 3 was set as a threshold for a positive exam. The patients without PC or only International Society of Urological Pathology (ISUP) grade group 1 PC (Gleason 3+3) were considered as one category having no csPC. The performance of prostate MRI was separately evaluated for detection of ISUP ≥ 2 and for ISUP ≥ 3 csPC. Results: A total of 148 men were included, with PSA ranging from 0.42 to 3.99 ng/mL (median 2.95, IQR 1.68–3.50) and age ranging from 36 to 84 years (median 58, IQR 52–66). A total of 74 men (50.0%) had a PSA level < 3 ng/mL, 42 (28.4%) had a PSA level < 2 ng/mL, and 106 (71.6%) had a PSA level of 2–3.9 ng/mL. They were referred for prostate MRI for a wide variety, and usually a combination of, reasons, such as younger age (<60 years in 55.4%, N = 82; <50 years in 17.6%, N = 26), abnormal digital rectal examination in 31.8% of cases (N = 47), suspicious PSA dynamics in 29.7% (N = 44), positive familial history in 27.0% (N = 40), clinical signs of prostatitis in 18.2% (N = 27), suspicious findings on Transrectal Ultrasound (TRUS) in 16.9% (N = 25), hematospermia in 7.4% (N = 11), hematuria in 4.1% (N = 6), incidental hot spot in the prostate on Fluoro-Deoxy-Glucose (FDG) Positron Emission Tomography (PET)–Computed Tomography (CT) in 4.1% (N = 6), lymphadenopathies on CT in 2.7% (N = 4), or severe patient anxiety in 3.4% (N = 5). Overall, ISUP ≥ 2 PC was present in 18.9% (N = 28) of cases, and MRI detected this with a sensitivity of 92.9%, a specificity of 66.7%, and a positive predictive value of 39.4%. ISUP ≥ 3 PC was present in 9.5% (N = 14) of cases, and prostate MRI detected this with a sensitivity of 100%, a specificity of 61.2%, and a positive predictive value of 21.2%. In patients with PSA < 2 ng/mL (N = 42), no csPC was found, but MRI generated false positives in 33.3%. Conclusions: Performing prostate MRI in men with normal PSA (<4 ng/mL) seems useful if there are other reasons that increase the clinical suspicion of csPC. In about one-fifth of these patients, csPC is present and MRI has high sensitivity for its detection. Prostate MRI has, however, low positive predictive value in this patient group, and clinicians should be aware of the risk of false-positive MRI. Below a PSA level of 2 ng/mL, no csPC was found and prostate MRI generated only false positives, suggesting limited value in this subgroup. Full article
(This article belongs to the Special Issue Updates on Imaging of Common Urogenital Neoplasms—2nd Edition)
Show Figures

Figure 1

19 pages, 1421 KB  
Article
Turning the Page: Pre-Class AI-Generated Podcasts Improve Student Outcomes in Ecology and Environmental Biology
by Laura Díaz and Víctor D. Carmona-Galindo
Educ. Sci. 2026, 16(1), 168; https://doi.org/10.3390/educsci16010168 - 22 Jan 2026
Cited by 1 | Viewed by 802
Abstract
In the aftermath of the COVID-19 pandemic, instructors in higher education have reported a decline in foundational reading habits, particularly in STEM courses where dense, technical texts are common. This study examines a low-barrier instructional intervention that used generative AI (GenAI) to support [...] Read more.
In the aftermath of the COVID-19 pandemic, instructors in higher education have reported a decline in foundational reading habits, particularly in STEM courses where dense, technical texts are common. This study examines a low-barrier instructional intervention that used generative AI (GenAI) to support pre-class preparation in two upper-division biology courses. Weekly AI-generated audio overviews—“podcasts”—were paired with timed, textbook-based online quizzes. These tools were not intended to replace reading, but to scaffold engagement, reduce preparation anxiety, and promote early familiarity with course content. We analyzed student engagement, perceptions, and performance using pre/post surveys, quiz scores, and exam outcomes. Students reported that the podcasts helped manage time constraints, improved their readiness for lecture, and increased their motivation to read. Those who consistently completed the quizzes performed significantly better on closed-book, in-class exams and earned higher final course grades. Our findings suggest that GenAI tools, when integrated intentionally, can reintroduce structured learning behaviors in post-pandemic classrooms. By meeting students where they are—without compromising cognitive rigor—audio-based scaffolds may offer inclusive, scalable strategies for improving academic performance and reengaging students with scientific content in an increasingly attention-fragmented educational landscape. Full article
Show Figures

Figure 1

27 pages, 1930 KB  
Article
SteadyEval: Robust LLM Exam Graders via Adversarial Training and Distillation
by Catalin Anghel, Marian Viorel Craciun, Adina Cocu, Andreea Alexandra Anghel and Adrian Istrate
Computers 2026, 15(1), 55; https://doi.org/10.3390/computers15010055 - 14 Jan 2026
Viewed by 579
Abstract
Large language models (LLMs) are increasingly used as rubric-guided graders for short-answer exams, but their decisions can be unstable across prompts and vulnerable to answer-side prompt injection. In this paper, we study SteadyEval, a guardrailed exam-grading pipeline in which an adversarially trained LoRA [...] Read more.
Large language models (LLMs) are increasingly used as rubric-guided graders for short-answer exams, but their decisions can be unstable across prompts and vulnerable to answer-side prompt injection. In this paper, we study SteadyEval, a guardrailed exam-grading pipeline in which an adversarially trained LoRA filter (SteadyEval-7B-deep) preprocesses student answers to remove answer-side prompt injection, after which the original Mistral-7B-Instruct rubric-guided grader assigns the final score. We build two exam-grading pipelines on top of Mistral-7B-Instruct: a baseline pipeline that scores student answers directly, and a guardrailed pipeline in which a LoRA-based filter (SteadyEval-7B-deep) first removes injection content from the answer and a downstream grader then assigns the final score. Using two rubric-guided short-answer datasets in machine learning and computer networking, we generate grouped families of clean answers and four classes of answer-side attacks, and we evaluate the impact of these attacks on score shifts, attack success rates, stability across prompt variants, and alignment with human graders. On the pooled dataset, answer-side attacks inflate grades in the unguarded baseline by an average of about +1.2 points on a 1–10 scale, and substantially increase score dispersion across prompt variants. The guardrailed pipeline largely removes this systematic grade inflation and reduces instability for many items, especially in the machine-learning exam, while keeping mean absolute error with respect to human reference scores in a similar range to the unguarded baseline on clean answers, with a conservative shift in networking that motivates per-course calibration. Chief-panel comparisons further show that the guardrailed pipeline tracks human grading more closely on machine-learning items, but tends to under-score networking answers. These findings are best interpreted as a proof-of-concept guardrail and require per-course validation and calibration before operational use. Full article
Show Figures

Figure 1

18 pages, 2272 KB  
Article
Machine Learning Approaches for Early Student Performance Prediction in Programming Education
by Seifeddine Bouallegue, Aymen Omri and Salem Al-Naemi
Information 2026, 17(1), 60; https://doi.org/10.3390/info17010060 - 8 Jan 2026
Viewed by 1187
Abstract
Intelligent recommender systems are essential for identifying at-risk students and personalizing learning through tailored resources. Accurate prediction of student performance enables these systems to deliver timely interventions and data-driven support. This paper presents the application of machine learning models to predict final exam [...] Read more.
Intelligent recommender systems are essential for identifying at-risk students and personalizing learning through tailored resources. Accurate prediction of student performance enables these systems to deliver timely interventions and data-driven support. This paper presents the application of machine learning models to predict final exam grades in a university-level programming course, leveraging multi-modal student data to improve prediction accuracy. In particular, a recent raw dataset of students enrolled in a programming course across 36 class sections from the Fall 2024 and Winter 2025 terms was initially processed. The data was collected up to one month before the final exam. From this data, a comprehensive set of features was engineered, including the student’s background, assessment grades and completion times, digital learning interactions, and engagement metrics. Building on this feature set, six machine learning prediction models were initially developed using data from the Fall 2024 term. Both training and testing were conducted on this dataset using cross-validation combined with hyperparameter tuning. The XGBoost model demonstrated strong performance, achieving an accuracy exceeding 91%. To assess the generalizability of the considered models, all models were retrained on the complete Fall 2024 dataset. They were then evaluated on an independent dataset from Winter 2025, with XGBoost achieving the highest accuracy, exceeding 84%. Feature importance analysis has revealed that the midterm grade and the average completion duration of lab assessments are the most influential predictors. This data-driven approach empowers instructors to proactively identify and support at-risk students, enabling adaptive learning environments that deliver personalized learning and timely interventions. Full article
(This article belongs to the Special Issue Human–Computer Interactions and Computer-Assisted Education)
Show Figures

Graphical abstract

8 pages, 4771 KB  
Article
Enhancing Pathology Education Through Special Staining Integration: A Study on Diagnostic Confidence and Practical Skill Development
by Zhiling Qu, Chengcheng Wang, Yaqi Duan, Junhong Guo, Rumeng Yang, Huiling Yu, Xi Wang and Zitian Huo
Int. Med. Educ. 2026, 5(1), 10; https://doi.org/10.3390/ime5010010 - 8 Jan 2026
Viewed by 446
Abstract
Background: Pathology education requires innovative experimental teaching approaches to enhance clinical competency. This study evaluated the integration of special staining techniques into pathology curricula to improve diagnostic confidence and practical skills. Methods: The reform involved 227 medical students, incorporating acid-fast, PAS, GMS, Congo [...] Read more.
Background: Pathology education requires innovative experimental teaching approaches to enhance clinical competency. This study evaluated the integration of special staining techniques into pathology curricula to improve diagnostic confidence and practical skills. Methods: The reform involved 227 medical students, incorporating acid-fast, PAS, GMS, Congo red, and other special stains into laboratory sessions. Diagnostic confidence was surveyed, and theoretical and practical exam scores were compared with 180 students from a previous grade. Statistical analysis was performed using GraphPad Prism 7.0. Results: Practical exam scores significantly improved (86.0 ± 17.2 vs. 82.2 ± 18.9, p < 0.001), while theoretical scores remained unchanged. Diagnostic confidence strongly correlated with morphological recognition, particularly for acid-fast and fungal stains. Student feedback noted challenges such as staining artifacts. Conclusion: Integrating special staining enhances practical skills and diagnostic confidence, effectively bridging basic and clinical training. Expanding such modules is recommended to advance competency-based medical education. Full article
Show Figures

Figure 1

23 pages, 5056 KB  
Article
Identifying Features of LLM-Resistant Exam Questions: Insights from Artificial Intelligence (AI)–Student Performance Comparisons
by Asen Stoyanov and Anely Nedelcheva
Sci 2025, 7(4), 183; https://doi.org/10.3390/sci7040183 - 12 Dec 2025
Viewed by 1022
Abstract
Large language models (LLMs) are rapidly being explored as tools to support learning and assessment in health science education, yet their performance across discipline-specific evaluations remains underexamined. This study evaluated the accuracy of two prominent LLMs on university-level pharmacognosy examinations and compared their [...] Read more.
Large language models (LLMs) are rapidly being explored as tools to support learning and assessment in health science education, yet their performance across discipline-specific evaluations remains underexamined. This study evaluated the accuracy of two prominent LLMs on university-level pharmacognosy examinations and compared their performance to that of pharmacy students. Authentic exam papers comprising a range of question formats and content categories were administered to ChatGPT and DeepSeek using a structured prompting approach. Student data were anonymized and LLM responses were graded using the same marking criteria applied to student cohorts, and a Monte Carlo simulation was conducted to determine whether observed performance differences were statistically meaningful. Facility Index (FI) values were calculated to contextualize item difficulty and identify where LLM performance aligned or diverged from student outcomes. The models demonstrated variable accuracy across question types, with a stronger performance in recall-based and definition-style items and comparatively weaker outputs for applied or interpretive questions. Simulated comparisons showed that LLM performance did not uniformly exceed or fall below that of students, indicating dimension-specific strengths and constraints. These findings suggest that while LLM-resistant examination design is contingent on question structure and content, further research should refine their integration into pharmacy education. Full article
Show Figures

Figure 1

17 pages, 475 KB  
Article
The Relationship Between the Mathematics Anxiety and Mathematics Achievement of Middle School Students: The Moderating Effect of Working Memory
by Hongye Ma and Changan Sun
Behav. Sci. 2025, 15(11), 1566; https://doi.org/10.3390/bs15111566 - 17 Nov 2025
Viewed by 1886
Abstract
To investigate the moderating role of working memory subcomponents in the relationship between mathematics anxiety and mathematics achievement among middle school students, this study selected 92 seventh-grade students (45 boys, 47 girls) from a middle school in Suzhou City. The Mathematics Anxiety Scale [...] Read more.
To investigate the moderating role of working memory subcomponents in the relationship between mathematics anxiety and mathematics achievement among middle school students, this study selected 92 seventh-grade students (45 boys, 47 girls) from a middle school in Suzhou City. The Mathematics Anxiety Scale was used to assess mathematics anxiety levels, while the rotation span task, operation-letter span task, and Stroop task were employed to measure visual working memory, verbal working memory, and central executive system function, respectively. Midterm mathematics exam scores served as the indicator of mathematics achievement. Data were analyzed using correlation analysis and hierarchical regression analysis. The results showed that: (1) Mathematics anxiety was significantly negatively correlated with mathematics achievement (r = −0.61, p < 0.01) and had a significant negative predictive effect on mathematics achievement (β = −0.600, p < 0.001); (2) Mathematics anxiety was significantly negatively correlated with verbal working memory (r = −0.84, p < 0.01), visual working memory (r = −0.68, p < 0.01), and the central executive system (r = −0.49, p < 0.01), and it had a significant negative predictive effect on all three; (3) Verbal working memory had a significant positive predictive effect on mathematics achievement (β = 0.481, p < 0.01); (4) Moderating effect analysis indicated that visual working memory played a significant negative moderating role in the relationship between mathematics anxiety and mathematics achievement (β = −0.226, p = 0.017), whereas the moderating effects of verbal working memory and the central executive system were not significant. The research demonstrates that working memory subcomponents play specific roles in the pathway through which mathematics anxiety affects achievement. The resource-dependent nature of visual working memory may exacerbate competition for cognitive resources under anxious conditions, providing empirical evidence for interventions targeting individuals with high visual working memory capacity who experience mathematics anxiety. Full article
Show Figures

Figure 1

Back to TopTop