Assessing ChatGPT’s Reliability in Endodontics: Implications for AI-Enhanced Clinical Learning

Llorente de Pedro, María; Suárez, Ana; Algar, Juan; Díaz-Flores García, Víctor; Andreu-Vázquez, Cristina; Freire, Yolanda

doi:10.3390/app15105231

Open AccessArticle

Assessing ChatGPT’s Reliability in Endodontics: Implications for AI-Enhanced Clinical Learning

by

María Llorente de Pedro

¹,

Ana Suárez

²

,

Juan Algar

³

,

Víctor Díaz-Flores García

^4,*

,

Cristina Andreu-Vázquez

⁵

and

Yolanda Freire

²

¹

School for Doctoral Studies and Research, Universidad Europea de Madrid, 28670 Villaviciosa de Odón, Spain

²

Department of Preclinical Dentistry II, Faculty of Biomedical and Health Sciences, Universidad Europea de Madrid, 28670 Villaviciosa de Odón, Spain

³

Department of Clinical Dentistry-Pregraduate Studies, Faculty of Biomedical and Health Sciences, Universidad Europea de Madrid, 28670 Villaviciosa de Odón, Spain

⁴

Department of Preclinical Dentistry I, Faculty of Biomedical and Health Sciences, Universidad Europea de Madrid, 28670 Villaviciosa de Odón, Spain

⁵

Department of Veterinary, Faculty of Biomedical and Health Sciences, Universidad Europea de Madrid, 28670 Villaviciosa de Odón, Spain

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(10), 5231; https://doi.org/10.3390/app15105231

Submission received: 9 April 2025 / Revised: 29 April 2025 / Accepted: 6 May 2025 / Published: 8 May 2025

(This article belongs to the Special Issue Recent Developments in E-learning: Learning and Teaching AI, Learning and Teaching with AI)

Download Versions Notes

Abstract

The integration of large language models (LLMs) like ChatGPT is transforming education the health sciences. This study evaluated the applicability of ChatGPT-4 and ChatGPT-4o in endodontics, focusing on their reliability and repeatability in responding to practitioner-level questions. Thirty closed-clinical questions, based on international guidelines, were each submitted thirty times to both models, generating a total of 1800 responses. These responses were evaluated by endodontic experts using a 3-point Likert scale. ChatGPT-4 achieved a reliability score of 52.67%, while ChatGPT-4o slightly outperformed it with 55.22%. Notably, ChatGPT-4o demonstrated greater response consistency, showing superior repeatability metrics such as Gwet’s AC1 and percentage agreement. While both models show promise in supporting learning, ChatGPT-4o may provide more consistent and pedagogically coherent feedback, particularly in contexts where response dependability is essential. From an educational standpoint, the findings support ChatGPT’s potential as a complementary tool for guided study or formative assessment in dentistry. However, due to moderate reliability, unsupervised use in specialized or clinically relevant contexts is not recommended. These insights are valuable for educators and instructional designers seeking to integrate AI into digital pedagogy. Further research should examine the performance of LLMs across diverse disciplines and formats to better define their role in AI-enhanced education.

Keywords:

AI; endodontics; ChatGPT; educational reliability; instructional

1. Introduction

Artificial intelligence (AI) is a branch of computer science focused on developing machines capable of performing tasks in a human-like manner [1,2]. Within AI, large language models (LLMs) are systems developed using deep learning algorithms, designed to process natural language and generate text-based responses [3]. These LLMs have undergone significant advancements over the past decade [4], with one of the best-known examples being the Chat Generative Pre-Trained Transformer (ChatGPT, OpenAI, San Francisco, CA, USA) [5,6].

OpenAI has progressively developed various generative language models. For a better understanding of their evolution and characteristics, Table 1 summarizes the timeline of the most relevant versions, from GPT-1 to ChatGPT-4o, including the recent variants released in 2024.

According to its developers, ChatGPT-4o represents a significant evolution over GPT-4. The new model enhances overall performance while optimizing efficiency and cost, making it well suited for scientific, industrial, and commercial applications. Its advancements are particularly notable in textual reasoning, encoding, and multilingual analysis, as well as in its integration of advanced tools for processing auditory and visual information [8].

Given these capabilities, ChatGPT and other similar LLMs have rapidly emerged as influential tools in the context of online education [9].

In e-learning environments, the integration of LLMs has transformed the way students and teachers interact with content. ChatGPT, for instance, has been used to enhance student engagement [10], generate personalized learning materials [11], and support teachers through automated feedback [12] and content design [13].

Although much of the existing research has focused on the clinical and diagnostic applications of ChatGPT in healthcare [14,15,16,17,18], including dentistry [19,20,21,22,23,24,25,26], few studies have examined its effectiveness as an educational technology tool in the field of dentistry [27]. In the context of health sciences education—dental education, in particular—understanding how reliable and consistent ChatGPT is in generating accurate and repeatable responses is essential for its adoption in teaching scenarios.

In dentistry, ChatGPT-4o outperformed ChatGPT-3.5 at the Japanese National Dental Examination [19]. However, ChatGPT-4 and ChatGPT-4o have shown similar reliability in generating BI-RADS scores in medicine [28]. Meanwhile, in restorative dentistry and endodontics, researchers found that ChatGPT-4o outperformed favorably to ChatGPT-4 in student assessments, though there was no significant difference between the two versions [29]. Therefore, investigating the role of ChatGPT in educational contexts—particularly in specialized domains such as endodontics—offers valuable insights into its potential as an AI-powered educational assistant.

This aligns with the growing need to evaluate AI not merely as a technical solution, but as an integral component of digital pedagogy [30]. Understanding the reliability (accuracy) and repeatability (consistency) of ChatGPT’s responses is essential to determining its value in learning activities, whether for self-directed study, formative assessment, or content exploration.

Within the field of endodontics, several studies have evaluated the performance of different versions of ChatGPT in various contexts, including its ability to answer frequently asked questions from the patient’s perspective [31]. Its performance has also been assessed in answering dichotomous questions from the practitioner’s perspective [21], as well as in evaluating its diagnostic accuracy in endodontic assessments [19]. The results suggest that, although ChatGPT cannot replace the clinical judgment of a dentist, it may serve as a valuable tool in the field of endodontics [21], acting as a complementary source of information [31]. However, the limitations of this technology must be taken into account, making it essential to verify its capabilities before integrating it into daily clinical practices [32].

In addition, when evaluating the efficiency of ChatGPT, it is important not only to assess the accuracy of its responses but also their repeatability, defined as the variation in answers when the same prompt is used consistently [33]. This variability can result in both correct and arbitrarily incorrect answers. In this context, repeatability is just as important as reliability, as inconsistent responses may confuse users who are unable to independently verify the content provided by ChatGPT [34].

This is particularly relevant in educational settings, where students may rely heavily on AI-generated content to understand complex topics [35]. Ensuring consistency in responses can help maintain pedagogical coherence and foster trust in digital learning tools.

Therefore, assessing the repeatability of answers generated by LLMs can serve as an additional indicator of confidence in the information provided [36]. The literature reports varying degrees of repeatability in answers generated by ChatGPT-3.5 and ChatGPT-4 across different studies [26,34,37]. However, there is limited research evaluating the repeatability between ChatGPT-4 and ChatGPT-4o.

The aim of this study was to evaluate the reliability and repeatability of responses generated by ChatGPT-4 compared to ChatGPT-4o in response to endodontic questions, with the broader goal of assessing their potential application in AI-enhanced educational settings.

We hypothesized that there would be no difference between the ChatGPT-4 and ChatGPT-4o versions in terms of reliability and repeatability in generating responses in the field of endodontics.

2. Materials and Methods

2.1. Question Design

To conduct this study, two authors (A.S., Y.F.) designed 60 short questions related to endodontics, following the methodology established in earlier research [21]. The questions were based on the Position Statements of the European Society of Endodontology [38,39,40] and the AAE (American Association of Endodontics) Consensus Conference Recommended Diagnostic Terminology [41], as these sources reflect expert consensus across various key areas within the field of endodontics.

The 60 questions were independently evaluated by two experts in the field of endodontics (M.LL.DP and V.DF.G.) using a 3-point Likert scale (0 = disagree; 1 = neutral; 2 = agree). They assessed the clarity of wording, relevance to the study context, and appropriateness for inclusion. The results were recorded in an Excel spreadsheet (version 16; Microsoft, Redmond, WA, USA) and subsequently analyzed. Any disagreements between the experts were independently reviewed by a third endodontic expert (J.A.). Based on the results, 30 questions were selected to evaluate the reliability and repeatability of response generation using ChatGPT-4 and ChatGPT-4o software. (Table 2).

In the context of AI-based education, these questions were used to simulate learning prompts commonly employed on digital platforms for formative assessment or guided study.

Evaluating the consistency and accuracy of ChatGPT’s responses to these questions offers valuable insight into its effectiveness as a pedagogical support tool.

2.2. Answer Generation

To evaluate the reliability of the answers provided by ChatGPT to short endodontic questions, two authors (A.S. and Y.F.) submitted 30 questions using the prompt ‘Only short answer’ to elicit more specific responses, selecting the ‘new chat’ option each time. Each question was entered 30 times to assess the repeatability of the responses.

The 900 responses from the ChatGPT-4 version were collected in January 2024, while the 900 responses from the ChatGPT-4o version were collected in October 2024. All responses were stored in an Excel spreadsheet for analysis.

2.3. Answers Grading by Experts

Two experts (M.LL.DP and V.DF.G.) evaluated the responses generated by ChatGPT-4 and ChatGPT-4o against reference answers established in the guidelines used for the question design. A total of 1800 responses were assessed using a 3-point Likert scale (0 = incorrect; 1 = partial or incomplete; 2 = correct) (Table 3).

2.4. Statistical Analysis

For each of the 30 questions, the absolute (n) and relative (%) frequencies of responses were reported according to the assigned score (0 = incorrect, 1 = incomplete or partially correct, 2 = correct).

To evaluate the performance of ChatGPT, reliability was defined as the proportion of repeated responses that were correct for each question set. A 95% confidence interval was calculated using the Wald binomial method. The difference in reliability between ChatGPT-4 and ChatGPT-4o responses was assessed using the Chi-square test.

Repeatability was assessed through concordance analysis and weighted by ordinal categories and multiple repetitions with 95% confidence intervals. The metrics used included percent agreement, Brennan and Prediger coefficient, Conger’s generalized Cohen’s kappa, Fleiss’ kappa, Gwet’s AC, and Krippendorff’s alpha. The estimated coefficients were classified as follows: <0.0 = Poor, 0.0–0.2 = Slight, 0.2–0.4 = Fair, 0.4–0.6 = Moderate, 0.6–0.8 = Substantial, and 0.8–1.0 = Almost perfect. Differences in repeatability between ChatGPT-4 and ChatGPT-4o were analyzed by comparing the overlap of their 95% confidence intervals. All statistical analyses were performed using STATA version BE 14 (StataCorp, College Station, TX, USA), with a significance level set at 5%.

3. Results

Table 4 presents the percentages of the evaluations assigned by the experts to each question, according to the version of ChatGPT used. The percentage of correct questions varied from 0% to 100% depending on the question for each version.

The results showed 100% repeatability and accuracy in expert evaluations of responses to 7 questions generated by ChatGPT-4 (questions: 13, 24, 25, 26, 28, 29, and 30), and to 11 questions generated by ChatGPT-4o (questions: 9, 11, 13, 16, 23, 25, 26, 27, 28, 29, and 30). While none of the questions evaluated using ChatGPT-4 exhibited 100% repeatability with inaccuracy, two questions evaluated using ChatGPT-4o did (questions: 4 and 18).

ChatGPT-4 achieved an accuracy of 52.67%, with a 95% confidence interval ranging from 49.4% to 55.91%. ChatGPT-4o demonstrated a slightly higher accuracy of 55.22%, with a 95% confidence interval ranging from 51.96% to 58.44%. However, the difference between the two versions was not statistically significant (p = 0.227).

The percentage agreement across the 30 repetitions of each question was 88.5% (95% CI: 83.3–93.7%) for ChatGPT-4 and 94.6% (95% CI: 91.4–97.7%) for ChatGPT-4o. Additionally, Gwet’s agreement coefficient indicated moderate repeatability for ChatGPT-4 (AC1 = 0.729; 95% CI: 0.574–0.883) and almost perfect repeatability for ChatGPT-4o (AC1 = 0.881; 95% CI: 0.800–0.961). These results suggest that ChatGPT-4o provides more consistent responses compared to ChatGPT-4.

4. Discussion

The primary objective of this study was to assess the reliability and repeatability of ChatGPT-4 and ChatGPT-4o in generating responses to closed-clinical endodontic questions from a practitioner’s perspective. The findings partially reject the research hypothesis: While no significant differences in reliability were observed between the two models, ChatGPT-4o demonstrated a higher level of repeatability compared to ChatGPT-4.

Based on these results, it can be interpreted that although both versions of ChatGPT show limitations in terms of absolute reliability—achieving accuracy rates close to 50%—the difference in repeatability becomes especially relevant in educational contexts. The greater consistency observed in ChatGPT-4o, both in terms of percentage agreement and concordance indices, suggests that this model may offer a more dependable learning experience by reducing confusion caused by contradictory responses to identical questions. However, the variability in response quality, even for questions with well-defined clinical content, indicates that the autonomous use of these models still involves educational and clinical risks.

In terms of reliability, no statistically significant differences were observed between ChatGPT-4 and ChatGPT-4o, which aligns with previous studies reporting similar performance by both models in tasks such as radiological image interpretation [28] and assessments in restorative dentistry and endodontics [29]. Although ChatGPT-4o integrates technical improvements over ChatGPT-4 [42], the performance gap between these two models appears narrower than that observed between ChatGPT-4 and its predecessor, ChatGPT-3.5. The superiority of ChatGPT-4 over ChatGPT-3.5 has been well documented across various healthcare domains [20], including specific dental disciplines. For instance, ChatGPT-4 has outperformed its predecessor in dental licensing examinations [43], in responding to clinically relevant questions across multiple fields [22], and within specialized areas such as periodontics [44] and implantology [45]. More recent studies have also shown improved outcomes with ChatGPT-4o compared to ChatGPT-3.5 [19].

In educational contexts, the absence of significant differences in reliability indicates that both ChatGPT-4 and ChatGPT-4o may be equally viable for generating instructional content or facilitating simulated student–teacher interactions. Nonetheless, the enhanced consistency exhibited by ChatGPT-4o may prove particularly advantageous in preserving pedagogical coherence across repeated uses.

In the present study, the reliability scores obtained were 52.67% for ChatGPT-4 and 55.22% for ChatGPT-4o, which are lower than those reported by Künzle et al. [29], who observed reliability levels of 62% and 72% for ChatGPT-4 and ChatGPT-4o, respectively. These discrepancies may be attributed to differences in methodology; while the current study employed short, closed-clinical questions from the practitioner’s perspective, Künzle et al. utilized multiple-choice questions aimed at student assessment. This suggests that the question format—whether closed-clinical or multiple-choice—may significantly impact the perceived effectiveness of AI models as educational tools.

Further research is needed to evaluate ChatGPT’s performance across different question types. Evidence from other domains also indicates that the reliability of ChatGPT-4 varies by discipline, with reported accuracy rates of 71.7% in oral surgery [26] and only 25.6% in prosthodontics [37]. Similarly, Jaworski et al. [46] found varying performances of ChatGPT-4o across different areas of the Polish Final Dentistry Examination, with lower accuracy in orthodontics (52.63%) and higher in prosthodontics (80%). These findings suggest that the effectiveness of ChatGPT may depend on the specific domain of expertise.

Regarding repeatability, this study found that ChatGPT-4 exhibited a range from moderate to near-perfect agreement, while ChatGPT-4o ranged from substantial to near-perfect agreement. Thus, ChatGPT-4o demonstrated superior repeatability compared to its predecessor. Although the literature directly comparing repeatability between these versions is scarce, several studies have reported improved performance of ChatGPT-4 over ChatGPT-3.5 [34,36]. In this study, ChatGPT-4 achieved 100% repeatability on 7 questions, whereas ChatGPT-4o achieved this level on 11 questions, highlighting its greater consistency in response generation. These findings are noteworthy, as repeatability is a critical metric in assessing the performance of LLMs.

Although the literature in this area remains limited [34], comparable studies have reported similar repeatability ranges in oral surgery [26] and lower ranges in prosthodontics [37]. The observed trend of improved repeatability in newer versions—such as ChatGPT-4o compared to ChatGPT-4—highlights the importance of further exploring this metric across various fields of knowledge and question formats. Repeatability is essential for educational contexts, as it ensures that learners receive dependable feedback when exploring similar content repeatedly. Despite this improvement, the variability in reliability underscores the necessity of expert supervision to validate AI-generated content, particularly in clinical education.

As described in previous studies [37], the performance of LLMs can vary significantly when faced with highly specialized tasks. In this context, endodontics is a discipline with particular complexities that challenge the ability of generalist LLMs to provide appropriate responses: the intensive use of technical terminology, reliance on official clinical protocols, and the need for therapeutically precise answers. These characteristics may partly explain the limitations observed in terms of reliability

Although this study did not focus on directly optimizing model performance, such as through specific training, our results highlight the importance of exploring such strategies in future research. These could include prompt engineering techniques tailored to clinical tasks or supervised adaptation to specialized educational contexts.

A limitation of this study is the small number of questions evaluated (n = 30), which may restrict the wider application of the results to the entire field of endodontics. However, this methodology was selected based on previous studies [21,26,37], which used limited sets of clinical questions carefully selected and validated by experts. Additionally, this methodological design was strengthened by repeating each question 30 times per model, which generated 1800 responses individually evaluated by specialists, thus providing a solid foundation for the reliability and repeatability analyses.

Another limitation of the study could be the evaluation scale used. In this study, a quantitative evaluation based on a three-level scale was chosen, which is suitable for assessing brief responses to closed-clinical questions. However, we acknowledge that in other contexts—where open-ended questions or more complex clinical cases are presented—it would be appropriate to incorporate additional metrics such as depth, reasoning logic, or comprehensiveness of information. These could enrich the evaluation of the pedagogical capabilities of LLMs.

In addition to considerations of reliability and repeatability, it is important to highlight some deficiencies observed in the model’s behavior. First, partially correct responses were detected that, despite including some appropriate information, omitted essential elements according to the clinical guidelines used. In other cases, the model provided incorrect answers with seemingly confident phrasing, which could mislead students. Unexplained variations were also identified between responses generated from the same input, suggesting a lack of internal control in text generation.

At a more structural level, there is a risk of bias in the model stemming from its training, both in relation to culturally specific clinical practices and outdated versions of therapeutic recommendations. These shortcomings highlight the necessity for the educational use of ChatGPT to always be mediated by a tutor or expert instructor. This ensures that students do not become dependent on a system that, although useful as a support tool, cannot by itself guarantee accuracy or appropriate contextualization.

Future research on the performance of ChatGPT could focus on evaluating a larger number of questions within the same field of knowledge in order to gain a better understanding of its performance in specific contexts. Likewise, it would be interesting to analyze questions with different levels of complexity to assess ChatGPT’s performance in various clinical and educational contexts. Finally, it would be relevant to investigate how different types of prompts may influence the model’s performance, considering their potential impact on the reliability and consistency of the generated responses.

Despite these encouraging findings, the study also underscores limitations related to the reliability of responses. Consequently, the integration of artificial intelligence tools in educational contexts—particularly in clinical or technical disciplines—should be carefully guided and supervised by human educators. This oversight is essential to ensure the accuracy of information, clarity of content, and alignment with curricular objectives.

5. Conclusions

From an educational standpoint, the findings of this study indicate that while ChatGPT shows promise as a supportive learning tool in specialized domains such as endodontics, its current level of performance falls short of the standards necessary for independent, unsupervised educational use.

The reliability of ChatGPT in generating responses to endodontic questions was limited in both versions, with no significant differences observed between them. However, ChatGPT-4o demonstrated greater repeatability than ChatGPT-4, which could support its use in contexts where pedagogical consistency is required.

However, the observed deficiencies must also be considered, such as partially correct responses, omissions of key information, or the potential for bias stemming from training data. These limitations underscore the need for ChatGPT’s use to be supervised by professionals, avoiding reliance on a tool that does not always guarantee clinical accuracy or curricular alignment.

Regarding its application, ChatGPT could be integrated as a complementary resource for self-learning, formative assessment, or guided content exploration. To ensure its effective use, it is essential to design instructional activities that contextualize its responses and promote critical review by students.

Looking ahead, it is recommended that developers consider adapted versions of the models trained on validated clinical content, and that educators employ pedagogical strategies incorporating prompts specifically designed for clinical tasks.

Finally, future research should evaluate these proposals and expand the analysis of the model across different disciplines and educational formats, with the aim of determining its actual and safe role in AI-assisted education.

Author Contributions

Conceptualization, M.L.d.P., A.S., V.D.-F.G. and Y.F.; methodology, M.L.d.P., A.S., J.A., C.A.-V., V.D.-F.G. and Y.F.; formal analysis, C.A.-V.; investigation, M.L.d.P., J.A. and V.D.-F.G.; writing—original draft preparation, M.L.d.P.; writing—review and editing, A.S. and Y.F.; supervision, A.S. and Y.F. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data obtained in this study are available upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Deng, J.; Lin, Y. The Benefits and Challenges of ChatGPT: An Overview. Front. Comput. Intell. Syst. 2023, 2, 81–83. [Google Scholar] [CrossRef]
Shetty, S.; Gali, S.; Augustine, D.; SV, S. Artificial Intelligence Systems in Dental Shade-matching: A Systematic Review. J. Prosthodont. 2024, 33, 519–532. [Google Scholar] [CrossRef] [PubMed]
Schwartz, I.S.; Link, K.E.; Daneshjou, R.; Cortés-Penfield, N. Black Box Warning: Large Language Models and the Future of Infectious Diseases Consultation. Clin. Infect. Dis. 2024, 78, 860–866. [Google Scholar] [CrossRef] [PubMed]
Gupta, R.; Park, J.B.; Bisht, C.; Herzog, I.; Weisberger, J.; Chao, J.; Chaiyasate, K.; Lee, E.S. Expanding Cosmetic Plastic Surgery Research With ChatGPT. Aesthetic Surg. J. 2023, 43, 930–937. [Google Scholar] [CrossRef] [PubMed]
Barrington, N.M.; Gupta, N.; Musmar, B.; Doyle, D.; Panico, N.; Godbole, N.; Reardon, T.; D’Amico, R.S. A Bibliometric Analysis of the Rise of ChatGPT in Medical Research. Med. Sci. 2023, 11, 61. [Google Scholar] [CrossRef]
Deiana, G.; Dettori, M.; Arghittu, A.; Azara, A.; Gabutti, G.; Castiglia, P. Artificial Intelligence and Public Health: Evaluating ChatGPT Responses to Vaccination Myths and Misconceptions. Vaccines 2023, 11, 1217. [Google Scholar] [CrossRef]
OpenAI. Platform: Models. OpenAI Platform. 2024. Available online: https://platform.openai.com/docs/models (accessed on 10 March 2025).
OpenAI. GPT-4o: An Omni Model Integrating Text, Vision, and Audio. OpenAI Blog. 2024. Available online: https://openai.com/index/hello-gpt-4o/ (accessed on 10 March 2025).
Almogren, A.S.; Al-Rahmi, W.M.; Dahri, N.A. Exploring Factors Influencing the Acceptance of ChatGPT in Higher Education: A Smart Education Perspective. Heliyon 2024, 10, e31887. [Google Scholar] [CrossRef]
Fidan, M.; Gencel, N. Supporting the Instructional Videos With Chatbot and Peer Feedback Mechanisms in Online Learning: The Effects on Learning Performance and Intrinsic Motivation. J. Educ. Comput. Res. 2022, 60, 1716–1741. [Google Scholar] [CrossRef]
Abas, M.A.; Arumugam, S.E.; Yunus, M.M.; Rafiq, K.R.M. ChatGPT and Personalized Learning: Opportunities and Challenges in Higher Education. Int. J. Acad. Res. Bus. Soc. Sci. 2023, 13, 3936–3945. [Google Scholar] [CrossRef]
Cao, S.; Zhong, L. Exploring the Effectiveness of ChatGPT-Based Feedback Compared with Teacher Feedback and Self-Feedback: Evidence from Chinese to English Translation. arXiv 2023, arXiv:2309.01645. [Google Scholar]
Basri, W.S.; Attar, R.W.; Albagmi, S.; Alibrahim, D.; Alanezi, F.; Almutairi, S.A.; AboAlsamh, H.M.; Alsedrah, I.T.; Arif, W.M.; Alsadhan, A.A.; et al. Effectiveness of ChatGPT for Educators Professional Development: An Empirical Study with Medical Faculty. Nutr. Health, 2025; in press. [Google Scholar] [CrossRef] [PubMed]
Antaki, F.; Touma, S.; Milad, D.; El-Khoury, J.; Duval, R. Evaluating the Performance of ChatGPT in Ophthalmology. Ophthalmol. Sci. 2023, 3, 100324. [Google Scholar] [CrossRef]
Cadamuro, J.; Cabitza, F.; Debeljak, Z.; De Bruyne, S.; Frans, G.; Perez, S.M.; Ozdemir, H.; Tolios, A.; Carobene, A.; Padoan, A. Potentials and Pitfalls of ChatGPT and Natural-Language Artificial Intelligence Models for the Understanding of Laboratory Medicine Test Results. An Assessment by the European Federation of Clinical Chemistry and Laboratory Medicine (EFLM) Working Group on Artificial Intelligence (WG-AI). Clin. Chem. Lab. Med. (CCLM) 2023, 61, 1158–1166. [Google Scholar] [CrossRef] [PubMed]
Das, D.; Kumar, N.; Longjam, L.A.; Sinha, R.; Deb Roy, A.; Mondal, H.; Gupta, P. Assessing the Capability of ChatGPT in Answering First- and Second-Order Knowledge Questions on Microbiology as per Competency-Based Medical Education Curriculum. Cureus 2023, 15, e36034. [Google Scholar] [CrossRef] [PubMed]
Ge, J.; Lai, J.C. Artificial Intelligence-Based Text Generators in Hepatology: ChatGPT Is Just the Beginning. Hepatol. Commun. 2023, 7, e0097. [Google Scholar] [CrossRef] [PubMed]
Juhi, A.; Pipil, N.; Santra, S.; Mondal, S.; Behera, J.K.; Mondal, H. The Capability of ChatGPT in Predicting and Explaining Common Drug-Drug Interactions. Cureus 2023, 15, e36272. [Google Scholar] [CrossRef]
Uehara, O.; Morikawa, T.; Harada, F.; Sugiyama, N.; Matsuki, Y.; Hiraki, D.; Sakurai, H.; Kado, T.; Yoshida, K.; Murata, Y.; et al. Performance of ChatGPT-3.5 and ChatGPT-4o in the Japanese National Dental Examination. J. Dent. Educ. 2024, 89, 459–466. [Google Scholar] [CrossRef]
Jin, H.K.; Lee, H.E.; Kim, E. Performance of ChatGPT-3.5 and GPT-4 in National Licensing Examinations for Medicine, Pharmacy, Dentistry, and Nursing: A Systematic Review and Meta-Analysis. BMC Med. Educ. 2024, 24, 1013. [Google Scholar] [CrossRef]
Suárez, A.; Jiménez, J.; de Pedro, M.L.; Andreu-Vázquez, C.; García, V.D.-F.; Sánchez, M.G.; Freire, Y. Beyond the Scalpel: Assessing ChatGPT’s Potential as an Auxiliary Intelligent Virtual Assistant in Oral Surgery. Comput. Struct. Biotechnol. J. 2023, 24, 46–52. [Google Scholar] [CrossRef]
Giannakopoulos, K.; Kavadella, A.; Aaqel Salim, A.; Stamatopoulos, V.; Kaklamanos, E.G. Evaluation of the Performance of Generative AI Large Language Models ChatGPT, Google Bard, and Microsoft Bing Chat in Supporting Evidence-Based Dentistry: Comparative Mixed Methods Study. J. Med. Internet Res. 2023, 25, e51580. [Google Scholar] [CrossRef]
Alhaidry, H.M.; Fatani, B.; Alrayes, J.O.; Almana, A.M.; Alfhaed, N.K. ChatGPT in Dentistry: A Comprehensive Review. Cureus 2023, 15, e38317. [Google Scholar] [CrossRef] [PubMed]
Balel, Y. Can ChatGPT Be Used in Oral and Maxillofacial Surgery? J. Stomatol. Oral Maxillofac. Surg. 2023, 124, 101471. [Google Scholar] [CrossRef] [PubMed]
Eggmann, F.; Weiger, R.; Zitzmann, N.U.; Blatz, M.B. Implications of Large Language Models Such as ChatGPT for Dental Medicine. J. Esthet. Restor. Dent. 2023, 35, 1098–1102. [Google Scholar] [CrossRef] [PubMed]
Suárez, A.; Díaz-Flores García, V.; Algar, J.; Gómez Sánchez, M.; Llorente de Pedro, M.; Freire, Y. Unveiling the ChatGPT Phenomenon: Evaluating the Consistency and Accuracy of Endodontic Question Answers. Int. Endod. J. 2023, 57, 108–113. [Google Scholar] [CrossRef]
Kavadella, A.; Dias da Silva, M.A.; Kaklamanos, E.G.; Stamatopoulos, V.; Giannakopoulos, K. Evaluation of ChatGPT’s Real-Life Implementation in Undergraduate Dental Education: Mixed Methods Study. JMIR Med. Educ. 2024, 10, e51344. [Google Scholar] [CrossRef]
Nguyen, D.; Rao, A.; Mazumder, A.; Succi, M.D. Exploring the Accuracy of Embedded ChatGPT-4 and ChatGPT-4o in Generating BI-RADS Scores: A Pilot Study in Radiologic Clinical Support. Clin. Imaging 2025, 117, 110335. [Google Scholar] [CrossRef]
Künzle, P.; Paris, S. Performance of Large Language Artificial Intelligence Models on Solving Restorative Dentistry and Endodontics Student Assessments. Clin. Oral Investig. 2024, 28, 575. [Google Scholar] [CrossRef]
Huang, R.; Adarkwah, M.A.; Liu, M.; Hu, Y.; Zhuang, R.; Chang, T. Digital Pedagogy for Sustainable Education Transformation: Enhancing Learner-Centred Learning in the Digital Era. Front. Digit. Educ. 2024, 1, 279–294. [Google Scholar] [CrossRef]
Mohammad-Rahimi, H.; Ourang, S.A.; Pourhoseingholi, M.A.; Dianat, O.; Dummer, P.M.H.; Nosrat, A. Validity and Reliability of Artificial Intelligence Chatbots as Public Sources of Information on Endodontics. Int. Endod. J. 2024, 57, 305–314. [Google Scholar] [CrossRef]
Aminoshariae, A.; Kulild, J.; Nagendrababu, V. Artificial Intelligence in Endodontics: Current Applications and Future Directions. J. Endod. 2021, 47, 1352–1357. [Google Scholar] [CrossRef]
Franc, J.M.; Hertelendy, A.J.; Cheng, L.; Hata, R.; Verde, M. Accuracy of a Commercial Large Language Model (ChatGPT) to Perform Disaster Triage of Simulated Patients Using the Simple Triage and Rapid Treatment (START) Protocol: Gage Repeatability and Reproducibility Study. J. Med. Internet Res. 2024, 26, e55648. [Google Scholar] [CrossRef] [PubMed]
Kochanek, K.; Skarzynski, H.; Jedrzejczak, W.W. Accuracy and Repeatability of ChatGPT Based on a Set of Multiple-Choice Questions on Objective Tests of Hearing. Cureus 2024, 16, e59857. [Google Scholar] [CrossRef] [PubMed]
Zhai, X.; Chu, X.; Chai, C.S.; Jong, M.S.Y.; Istenic, A.; Spector, M.; Liu, J.-B.; Yuan, J.; Li, Y. A Review of Artificial Intelligence (AI) in Education from 2010 to 2020. Complexity 2021, 2021, 8812542. [Google Scholar] [CrossRef]
Mykhalko, Y.O.; Filak, Y.F.; Dutkevych-Ivanska, Y.V.; Sabadosh, M.V.; Rubtsova, Y.I. From Open-Ended to Multiple-Choice: Evaluating Diagnostic Performance and Consistency of ChatGPT, Google Gemini and Claude AI. Wiadomości Lek. 2024, 77, 1852–1856. [Google Scholar] [CrossRef]
Freire, Y.; Santamaría Laorden, A.; Orejas Pérez, J.; Gómez Sánchez, M.; Díaz-Flores García, V.; Suárez, A. ChatGPT Performance in Prosthodontics: Assessment of Accuracy and Repeatability in Answer Generation. J. Prosthet. Dent. 2024, 131, 659.e1–659.e6. [Google Scholar] [CrossRef]
Segura-Egea, J.J.; Gould, K.; Hakan Şen, B.; Jonasson, P.; Cotti, E.; Mazzoni, A.; Sunay, H.; Tjäderhane, L.; Dummer, P.M.H. European Society of Endodontology Position Statement: The Use of Antibiotics in Endodontics. Int. Endod. J. 2018, 51, 20–25. [Google Scholar] [CrossRef]
Plotino, G.; Abella Sans, F.; Duggal, M.S.; Grande, N.M.; Krastl, G.; Nagendrababu, V.; Gambarini, G. European Society of Endodontology Position Statement: Surgical Extrusion, Intentional Replantation and Tooth Autotransplantation. Int. Endod. J. 2021, 54, 655–659. [Google Scholar] [CrossRef]
Krastl, G.; Weiger, R.; Filippi, A.; Van Waes, H.; Ebeleseder, K.; Ree, M.; Connert, T.; Widbiller, M.; Tjäderhane, L.; Dummer, P.M.H.; et al. European Society of Endodontology Position Statement: Endodontic Management of Traumatized Permanent Teeth. Int. Endod. J. 2021, 54, 1473–1481. [Google Scholar] [CrossRef]
Glickman, G.N. AAE Consensus Conference on Diagnostic Terminology: Background and Perspectives. J. Endod. 2009, 35, 1619–1620. [Google Scholar] [CrossRef]
Zhang, Q.; Wu, Z.; Song, J.; Luo, S.; Chai, Z. Comprehensiveness of Large Language Models in Patient Queries on Gingival and Endodontic Health. Int. Dent. J. 2025, 75, 151–157. [Google Scholar] [CrossRef]
Chau, R.C.W.; Thu, K.M.; Yu, O.Y.; Hsung, R.T.-C.; Lo, E.C.M.; Lam, W.Y.H. Performance of Generative Artificial Intelligence in Dental Licensing Examinations. Int. Dent. J. 2024, 74, 616–621. [Google Scholar] [CrossRef] [PubMed]
Sabri, H.; Saleh, M.H.A.; Hazrati, P.; Merchant, K.; Misch, J.; Kumar, P.S.; Wang, H.; Barootchi, S. Performance of Three Artificial Intelligence (AI)-based Large Language Models in Standardized Testing; Implications for AI-assisted Dental Education. J. Periodontal Res. 2025, 60, 121–133. [Google Scholar] [CrossRef] [PubMed]
Revilla-León, M.; Barmak, B.A.; Sailer, I.; Kois, J.C.; Att, W. Performance of an Artificial Intelligence–Based Chatbot (ChatGPT) Answering the European Certification in Implant Dentistry Exam. Int. J. Prosthodont. 2024, 37, 221–224. [Google Scholar] [CrossRef] [PubMed]
Jaworski, A.; Jasiński, D.; Sławińska, B.; Błecha, Z.; Jaworski, W.; Kruplewicz, M.; Jasińska, N.; Sysło, O.; Latkowska, A.; Jung, M. GPT-4o vs. Human Candidates: Performance Analysis in the Polish Final Dentistry Examination. Cureus 2024, 16, e68813. [Google Scholar] [CrossRef]

Table 1. Chronological evolution of ChatGPT versions and their main highlighted features.

Version	Release Date	Main Features [6,7]
GPT-1	2018	First pre-trained generative model, foundational for subsequent development.
GPT-2	2019	Increased generative capacity, with greater coherence in long sequences.
GPT-3	2020	Large-scale model with 175 billion parameters.
ChatGPT-3.5	2022	Conversational adaptation of GPT-3; widely distributed public version.
ChatGPT-4	14 March 23	Improvements in logical reasoning, expanded context, and complex comprehension.
ChatGPT-4o	13 May 24	Multimodal optimization (text, audio, image), with greater efficiency and speed.
ChatGPT-o1	September 24	Based on chain-of-thought prompting; recent versions include o1, o3-mini, etc.

Table 2. 30 questions included for answer generation using ChatGPT-4 and ChatGPT-4o.

Question Number	Question Description
1	What antibiotics are recommended to treat dental infections in patients who are allergic to beta-lactam antibiotics? Only short answer.
2	If an adult patient with no allergies requires oral antibiotic prophylaxis prior to root canal treatment, how many grams of amoxicillin will he/she need and how long before should it be administered? Only short answer.
3	If an adult patient is allergic to beta-lactam antibiotics and requires oral antibiotic prophylaxis prior to root canal treatment, how many grams of clindamycin will he/she need and how long before should it be administered? Only short answer.
4	If an adult patient is allergic to beta-lactam antibiotics and requires oral antibiotic prophylaxis prior to root canal treatment, how many grams of azithromycin will he/she need and how long before should it be administered? Only short answer.
5	If amoxicillin is used for a dental infection in a non-allergic patient, what would be the maintenance dose for how many days? Only short answer.
6	If amoxicillin with clavulanic acid is used for a dental infection in a non-allergic patient, what would be the maintenance dose for how many days? Only short answer.
7	If clindamycin is used for a dental infection in a patient who is allergic to beta-lactam antibiotics, what would be the maintenance dose? Only short answer.
8	If clarithromycin is used for a dental infection in a patient who is allergic to beta-lactam antibiotics, what would be the maintenance dose? Only short answer.
9	If azithromycin is used for a dental infection in a patient who is allergic to beta-lactam antibiotics, what would be the maintenance dose? Only short answer.
10	In the case of a pulp necrosis and root fracture of the middle third of the root, how should the endodontic treatment be performed? Only short answer.
11	In the case of a traumatized tooth with a fissure limited to the enamel, what treatment should be performed from an endodontic point of view? Only short answer
12	When should a root canal be performed on a permanent tooth that has been extracted and replanted in its socket? Only short answer.
13	In the case of a traumatized tooth with pulp canal obliteration and no symptoms, what treatment should be performed? Only short answer.
14	What treatment should be performed on a traumatized permanent tooth with severe displacement from its original position? Only short answer.
15	In the case of a traumatized tooth with a severe intrusion, what treatment should be performed from an endodontic point of view? Only short answer.
16	What is the diagnosis when the crown of a tooth turns progressively gray after a dental trauma? Only short answer.
17	What is the maximum recommended time of manipulation after surgical extrusion and intentional tooth replantation? Only short answer.
18	How long should be waited before moving orthodontically after transplanted and replanted or extruded teeth? Only short answer.
19	When should systemic antibiotic prophylaxis be considered for surgical extrusion and intentional reimplantation? Only short answer.
20	When should preoperative small field cone beam computed tomography (CBCT) be performed prior to surgical extrusion and intentional reimplantation? Only short answer.
21	What type of splint is recommended after surgical extrusion and intentional replantation? Only short answer.
22	What should be the duration of splinting after surgical extrusion and intentional replantation? Only short answer.
23	When should root canal treatment be performed after surgical extrusion and intentional replantation? Only short answer.
24	What would be the pulpal diagnosis for a tooth that presents with acute pain immediately after a heat stimulus and disappears when the stimulus is removed? Only short answer.
25	What would be the pulpal diagnosis for a tooth that presents with acute pain immediately after a cold stimulus and disappears when the stimulus is removed? Only short answer.
26	What would be the pulpal diagnosis for a tooth that presents pain that persists after a heat stimulus is removed? Only short answer
27	What would be the pulpal diagnosis for a tooth with persistent pain that causes sleep disturbance? Only short answer.
28	What would be the pulpal diagnosis for a tooth not responding to pulp tests and no previous treatment? Only short answer.
29	What would be the apical diagnosis for a tooth characterized by rapid onset, spontaneous pain, tooth tenderness to pressure, pus formation, and swelling of the associated tissues? Only short answer.
30	What would be the apical diagnosis for a tooth characterized by gradual onset, little or no discomfort, and the intermittent discharge of pus through an associated sinus tract? Only short answer.

Table 3. Expert evaluation for answers generated by ChatGPT-4 and ChatGPT-4o.

Expert’s Grading	Grading Description
Incorrect (0)	The answer provided is completely incorrect or unrelated to the question. It does not demonstrate an adequate understanding or knowledge of the topic.
Partially correct or incomplete (1)	The answer shows some understanding or knowledge of the topic, but there are significant errors or missing elements. Although not completely incorrect, the answer is not sufficiently correct or complete to be considered certain or adequate.
Correct (2)	The answer is completely accurate and shows a solid and precise understanding of the subject. All major components are addressed in a thorough and accurate manner.

Table 4. Sample and percentage grade for each answer to each question by version of ChatGPT.

Q	ChatGPT-4						ChatGPT-4o						p-Value
	Incorrect		Partially Correct or Incomplete		Correct		Incorrect		Partially Correct or Incomplete		Correct
	n	%	n	%	n	%	n	%	n	%	n	%
Q1	3	10.00	27	90.0	0	0.0	0	0.0	30	100.0	0	0.00	0.545
Q2	2	6.7	28	93,33	0	0.0	0	0.0	29	96.7	1	3.3	0.589
Q3	29	96.7	0	0.0	1	3.3	0	0.0	5	16.7	25	83.3	0.830
Q4	26	86.7	2	6.7	2	6.7	30	100.0	0	0.0	0	0.0	0.854
Q5	29	96.7	1	3.3	0	0.0	0	0.0	29	96.7	1	3.3	0.192
Q6	1	3.3	18	60.0	11	36.7	0	0.0	14	46.7	16	53.3	0.383
Q7	0	0.0	3	10.0	27	90.0	0	0.0	30	100.0	0	0.0	0.289
Q8	0	0.0	3	10.0	27	90.0	0	0.0	30	100.0	0	0.0	0.220
Q9	0	0.0	1	3.3	29	96.7	0	0.0	0	0.0	30	100.0	0.589
Q10	1	3.3	29	96.7	0	0.0	1	3.3	22	73.3	7	23.3	0.134
Q11	2	6.7	27	90.0	1	3.3	0	0.0	0	0.0	30	100.0	0.732
Q12	12	40.0	17	56.7	1	3.3	0	0.0	2	6.7	28	93.3	0.577
Q13	0	0.0	0	0.0	30	100.0	0	0.0	0	0.0	30	100.0	0.301
Q14	0	0.0	1	3.3	29	96.7	0	0.0	30	100.0	0	0.0	0.065
Q15	21	70.0	1	3.3	8	26.7	22	73.3	4	13.3	4	13.3	0.493
Q16	1	3.3	0	0.0	29	96.7	0	0.0	0	0.0	30	100.0	0.591
Q17	23	76.7	2	6.7	5	16.7	0	0.0	6	20.0	24	80.0	0.803
Q18	25	83.3	0	0.0	5	16.7	30	100.0	0	0.0	0	0.0	0.952
Q19	0	0.0	4	13.3	26	86.7	4	13.3	24	80.0	2	6.7	0.295
Q20	10	33.3	8	26.7	12	40.0	12	40.0	15	50.0	3	10.0	0.153
Q21	0	0.0	3	10.0	27	90.0	0	0.0	1	3.3	29	96.7	0.070
Q22	4	13.3	26	86.7	0	0.0	9	30.0	21	70.0	0	0.0	0.118
Q23	5	16.7	17	56.7	8	26.7	0	0.0	0	0.0	30	100.0	0.841
Q24	0	0.0	0	0.0	30	100.0	3	10.0	0	0.0	27	90.0	0.189
Q25	0	0.0	0	0.0	30	100.0	0	0.0	0	0.0	30	100.0	0.131
Q26	0	0.00	0	0.0	30	100.0	0	0.0	0	0.0	30	100.0	0.953
Q27	12	40.0	2	6.7	16	53.3	0	0.0	0	0.0	30	100.0	1.000
Q28	0	0.0	0	0.0	30	100.0	0	0.0	0	0.0	30	100.0	0.179
Q29	0	0.0	0	0.0	30	100.00	0	0.0	0	0.0	30	100.0	0.783
Q30	0	0.0	0	0.0	30	100.00	0	0.0	0	0.0	30	100.0	0.365

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Llorente de Pedro, M.; Suárez, A.; Algar, J.; Díaz-Flores García, V.; Andreu-Vázquez, C.; Freire, Y. Assessing ChatGPT’s Reliability in Endodontics: Implications for AI-Enhanced Clinical Learning. Appl. Sci. 2025, 15, 5231. https://doi.org/10.3390/app15105231

AMA Style

Llorente de Pedro M, Suárez A, Algar J, Díaz-Flores García V, Andreu-Vázquez C, Freire Y. Assessing ChatGPT’s Reliability in Endodontics: Implications for AI-Enhanced Clinical Learning. Applied Sciences. 2025; 15(10):5231. https://doi.org/10.3390/app15105231

Chicago/Turabian Style

Llorente de Pedro, María, Ana Suárez, Juan Algar, Víctor Díaz-Flores García, Cristina Andreu-Vázquez, and Yolanda Freire. 2025. "Assessing ChatGPT’s Reliability in Endodontics: Implications for AI-Enhanced Clinical Learning" Applied Sciences 15, no. 10: 5231. https://doi.org/10.3390/app15105231

APA Style

Llorente de Pedro, M., Suárez, A., Algar, J., Díaz-Flores García, V., Andreu-Vázquez, C., & Freire, Y. (2025). Assessing ChatGPT’s Reliability in Endodontics: Implications for AI-Enhanced Clinical Learning. Applied Sciences, 15(10), 5231. https://doi.org/10.3390/app15105231

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Assessing ChatGPT’s Reliability in Endodontics: Implications for AI-Enhanced Clinical Learning

Abstract

1. Introduction

2. Materials and Methods

2.1. Question Design

2.2. Answer Generation

2.3. Answers Grading by Experts

2.4. Statistical Analysis

3. Results

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI