Evaluating Generative AI (Microsoft Copilot) as an Adjunctive Decision-Support System in Oral and Maxillofacial Radiology: A Retrospective Study

Jagadeesh, Yashaswini; Rizvi, Nubaira; Nair, Madhu

doi:10.3390/oral6010010

Open AccessArticle

Evaluating Generative AI (Microsoft Copilot) as an Adjunctive Decision-Support System in Oral and Maxillofacial Radiology: A Retrospective Study

by

Yashaswini Jagadeesh

^1,*

,

Nubaira Rizvi

²

and

Madhu Nair

^3,4,*

¹

Department of Diagnostic Sciences, Louisiana State University School of Dentistry, New Orleans, LA 70112, USA

²

Department of Biostatistics and Data Science, LSU-Health School of Public Health, New Orleans, LA 70112, USA

³

Department of Diagnostic Sciences, Texas A&M College of Dentistry, Dallas, TX 75246, USA

⁴

Baylor University Medical Center, Dallas, TX 75246, USA

^*

Authors to whom correspondence should be addressed.

Oral 2026, 6(1), 10; https://doi.org/10.3390/oral6010010

Submission received: 22 July 2025 / Revised: 21 December 2025 / Accepted: 24 December 2025 / Published: 9 January 2026

Download

Browse Figures

Versions Notes

Abstract

Objectives: To assess the utility of Microsoft Copilot, a generative AI tool, in providing meaningful differential diagnosis and management strategies comparable with those generated by a board-certified radiologist using cone beam computed tomography (CBCT) studies in maxillofacial disease and thus assess its potential utility to help with the initial provisional diagnostic process. Study Design: A pilot project designed as a single-center, retrospective study using a convenient sample was conducted. De-identified data collected from patient charts in a consistent format was fed to Microsoft 365 Copilot (MCP) to generate a list of meaningful differential diagnosis (DD) and management protocols. Scores ranging of 0–3 were given for 0–3 matches in DD and management protocols, respectively. Results: Proportional analysis showed that the radiologist and Copilot agreed on the DD in 75.2% of cases and 94.6% of cases in management protocols. For biopsy recommendations, the radiologist and Copilot advised biopsy in 33 (89.2%) cases while they did not recommend biopsy in 23 (41.8%) cases. Conclusions: Generative AI platforms at this point may have value in generating DD and management protocols based on maxillofacial CBCT findings. However, the radiologist’s judgement based on clinical context, feature recognition, and critical analysis seemed to outperform MCP. Larger studies with statistical validation are warranted.

Keywords:

AI in oral radiology; Microsoft Copilot; differential diagnosis; management protocol; biopsy

1. Introduction

Artificial Intelligence (AI) has gained significant attention in medical literature and healthcare [1,2]. The uses of AI are wide-ranging and can enhance the effectiveness of intricate tasks [3,4]. With machine learning, AI-powered programs can generate functional code and even diagnose complicated illnesses by analyzing medical history, lab tests, radiological images, or pathological findings [5]. While achieving genuine AI outcomes remains a distant goal, progress in mathematical modeling and computational capabilities has caused a surge in algorithm publications [6,7,8]. Opinions on the medical applications of AI span a broad spectrum, ranging from assisting clinicians in decision-making to the potential of surpassing human expertise [1,9].

Radiologists are vital for interpreting medical images as they are equipped with holistic knowledge about imaging patterns and can facilitate correlation with history, clinical and demographic information, and prior images, while also drawing upon their expertise and experience over several years and across similar images from multiple acquisition systems likely subjected to different post-processing algorithms [10,11]. Based on this knowledge, they can provide differential diagnoses while considering probabilities and clinical contexts, which are crucial to the systematic and wholistic approach to diagnostic process in radiology. Identifying and linking these patterns to specific pathoses are critical stages. Referencing the literature to confirm or refute diagnoses is vital, particularly for radiologists in training, despite the process’s time-consuming and resource-intensive nature. However, uncertainty remains regarding the adoption of AI and its utility as an adjunctive decision support system for radiologists [5,12].

Large language models (LLMs) like OpenAI’s ChatGPT, Google’s Bard (now Gemini), Microsoft’s Bing (Copilot), and Perplexity AI Inc.’s Perplexity AI have gained considerable attention due to their ability to understand textual information and produce contextually relevant responses. They are trained on vast amounts of medical literature and data that could aid in offering differential diagnoses from text-based descriptions of imaging patterns [13,14]. Sarangi et al. explored the utility of four LLMs in providing differential diagnosis based on cardiovascular and thoracic imaging patterns [14]. The utilization of freely available chatbots for generating relevant differential diagnoses in medical radiology has been partially investigated. However, there is a gap in exploring text-based descriptions, incorporating factors like age, sex, and image patterns to offer pertinent differential diagnoses and management strategies in the field of oral and maxillofacial radiology.

This study aims to evaluate the effectiveness of generative AI (Microsoft Copilot) in delivering meaningful differential diagnoses and management recommendations while using maxillofacial cone beam computed tomography (CBCT) studies to develop a provisional diagnosis and management strategy. The agreement between the radiologist and Microsoft Copilot was assessed in these two areas in order to explore the utility of the algorithm to serve as a decision-making tool for clinicians who use it increasingly as a workflow asset. We hypothesized that there would be little concurrence in agreement between the maxillofacial radiologist and Microsoft Copilot on differential diagnosis and management protocols of maxillofacial pathoses, including the need for biopsies, and that the latter would call for more such interventional procedures owing to lack of training on data related to such pathoses.

2. Material and Methods

2.1. Study Design

The current study used a single-center, retrospective design with convenient data sampling. This approach was selected due to the accessibility of data and the exploratory nature of the study. Data was collected from patient records at the Imaging Center at Texas A&M University College of Dentistry, Dallas, Texas, USA, between January 2015 and May 2024 (Figure 1). The reports were archived on a secure server. Data was collected from 101 finalized radiology reports. As this was a pilot investigation, the sample size was chosen for viability and to allow for preliminary analysis. The limited number of cases may restrict generalizability, and the findings should be interpreted as exploratory. Previous studies investigating the use of generative AI tools to assess diagnostic accuracy in generating differential diagnosis involved sample sizes under 100, consistent with pilot study guidelines [15,16,17]. This study was conducted using the secure enterprise version of Microsoft 365 Copilot, specifically the version available between February and June 2024. The Institutional Review Board and Research Ethics Committee approved the study.

2.2. Inclusion Criteria

Participation criteria included the following:

Patients seen in the Imaging Center for Cone Beam Computed Tomography studies.
Pathoses present in maxillofacial region.
Single or multiple findings related to the pathoses of interest.
Relevant clinical findings included.
Availability of appropriate history/diagnostic objective/s.

2.3. Exclusion Criteria

Exclusion criteria included the following:

Suboptimal image quality due to inadequate or suboptimal exposure parameters.
Presence of artifacts.
Other systemic conditions.
History of pharmacotherapy with potential to alter osseous dynamics.
History of hormone therapy.
To keep the diagnostic process simple, we excluded cases with comorbid conditions that could complicate the assessment. As this is a pilot study, the goal was to evaluate Copilot’s utility in focusing on the chief complaint without interference from other systemic conditions for which the patient may be under treatment. This study did not include cases where confounding findings could potentially impact the overall diagnostic process.

2.4. Data Collection

Fields of view used varied from small to large, based on submitted requests. The location of pathosis reported in the study was limited to the following anatomic entities: maxilla, mandible, sinuses, and TMJs. Data collected from radiology reports included the patient’s dental and medical history, clinical indication for the study, age, sex, location of the lesion, number of lesion/s, clinical features, if any, radiographic features such as site, approximate size, shape, extent, internal architecture, borders, effect/s on adjacent structures, radiologist-provided list of differential diagnoses and radiologist’s recommendations for further management including comparison with prior images, additional complex imaging, histopathologic evaluation, lab work as deemed necessary, and the need for multidisciplinary consultation. Furthermore, 101 radiology reports were analyzed in this study. As this is a qualitative study, the sample size selected helped provide a broad perspective. Each patient’s information was de-identified with a unique alphanumeric random ID to anonymize the information prior to feeding into Microsoft Copilot Chat. Appropriate prompts were generated via a test session where 5 cases outside of those included in the study were used as test cases to evaluate the accuracy of differential diagnoses and the appropriateness of management strategies proposed by AI.

2.5. Data Analysis

The coded data from each radiology report was fed to the Microsoft 365 Copilot Chat interface to generate a meaningful list of differential diagnoses followed by appropriate management strategies. The prompts used in the study included history and clinical information, age, race, sex, and a brief and accurate description of the radiographic findings reported by the radiologist: “Generate a report template with differential diagnosis and further management for a cone beam computed tomography study of a xx-year-old male/female ….”. A representative example is depicted in Figure 2. A generic and structured prompt was used to reflect how generative AI tools are commonly used in real-world settings, as conversational and informal assistants, while still ensuring uniformity across all study cases. The generated list of differential diagnoses and management recommendations were collected for comparison with radiologist-generated information in the radiology report to assess the utility of generative AI in these two specific areas. Scoring was conducted by two evaluators, one of whom was a board-certified oral and maxillofacial radiologist with 32 years of experience, and the other, an oral and maxillofacial radiology faculty member.

3. Results

3.1. Differential Diagnoses

The 101 cases evaluated to assess the agreement on differential diagnoses between the radiologist’s report and Copilot were assigned scores based on the number of matching differential diagnoses, ranging from 0 to 3. A score of 1 was given if there was one matching diagnosis, 2 for two matches, and 3 for three or more matches. If there were no matches, the assigned agreement score was zero.

In twenty-five cases, there was no match between the radiologist’s and Copilot’s diagnoses, resulting in a score of 0. Sixteen cases scored a 1, 13 scored a 2, and 47 scored, a 3. In the 16 cases that scored 1, it was noted that it was the principal finding in 6 cases (Table 1 and Figure 3). The ranking of the differential diagnoses generated by Copilot did not match with those in the radiologist’s reports in the remaining cases.

A proportional analysis with a 95% confidence interval was performed to assess agreement between the radiologist and Copilot in formulating a list of differential diagnosis (Table 2). In 75.2% of cases (95% CI: 65.6–82.9%), there was agreement of at least 1 (score ≥1) differential diagnosis between the radiologist and Copilot. Complete agreement across all three or more matching differential diagnoses was observed in 46.5% of cases (95% CI: 36.9–56.3%).

Jaccard similarity metric was performed to evaluate the degree of overlap between differential diagnoses provided by the radiologist and Copilot. The mean similarity score was 0.604 (60.4%), indicating that, on average, 60.4% of Copilot’s suggested differential diagnoses overlapped with those provided by the radiologist.

Table 3 indicates the summary of baseline characteristics of the patients. Table 4 lists the various differential diagnosis recommendations by the radiologist and the Copilot. Overall, although Copilot detected more cases of certain lesions, there were several lesions identified by the radiologists that were missed by the Copilot. This indicates that Copilot did not capture all details noted by the radiologists, reflecting incomplete overlap between Copilot and the radiologist’s evaluation. Jaccard similarity, which measures overlap between radiologist and Copilot selections for each lesions varies widely, reflects differing levels of concordance.

High Agreement (Jaccard ≥ 0.67):

Lesions/categories such as nasopalatine duct cyst, AV malformation, ossifying fibroma, dense bone island, TMJ abnormalities, and sialoliths show strong concordance between Copilot and radiologist annotations. These results suggest that both Copilot and radiologists reliably identify common and well-defined lesions.

Moderate Agreement (Jaccard 0.40–0.66):

Lesions such as odontogenic keratocyst, odontoma, osseous dysplasia, chronic mucositis in the sinuses, and foreign bodies exhibit moderate agreement. Copilot tends to detect more cases with these lesions than radiologists, resulting in partial overlap.

Low Agreement (Jaccard < 0.40):

Lesions/categories such as radicular cyst, dentigerous cyst, ameloblastoma, central giant cell granuloma, and malignancies show low concordance. Differences reflect challenges in identifying rare tumors or subtle cystic lesions, highlighting areas for Copilot’s improvement.

3.2. Management

A total of 101 cases were evaluated to determine the agreement in management strategies between the radiologist’s report and Copilot-generated data. Each case was assigned a score based on the level of agreement in management recommendations. The scores ranged from 0 to 3, with 0 indicating no agreement, 1 indicating one matching management strategy, 2 indicating two matches, and 3 indicating three or more matches. An agreement between the radiologist’s ranking of management strategies was not noted when compared with Copilot’s ranking. Most cases received a score of 2, with 40 cases in this category, followed by 25 cases with a score of 3. 22 cases received a score of 1 (Table 5 and Figure 4). Of the 101 cases, 9 were not scored as no intervention was deemed necessary by the radiologist. However, Copilot provided management recommendations for all cases.

A proportional analysis with 95% confidence interval was performed to assess agreement between the radiologist and Copilot in management protocols (Table 6). Copilot matched at least 1 (score ≥ 1) management recommendation provided by the radiologist in 94.6% of cases (95% CI: 87.5–98.0). Complete agreement (score = 3) was observed in 27.2% of cases (95% CI: 18.5–37.5).

Jaccard similarity metric showed a mean similarity score of 0.641 (64.1%), indicating that, on average, 64.1% of Copilot’s suggested management strategies overlapped with those provided by the radiologist.

3.3. Biopsy

Biopsy recommendations were evaluated to determine if Copilot could accurately identify when a biopsy was needed. The radiologist recommended biopsies in 37 cases. Nine cases were not scored as there were no further management recommendations were deemed necessary by the radiologist. By contrast, Copilot recommended biopsies in 69 cases. There was concurrence in biopsy recommendations between the radiologist and Copilot in 33 cases (35.9%), while both did not recommend one in 23 (25%) cases.

Among 101 cases, 9 were not scored as there were no management recommendations, leaving 92 cases for biopsy recommendation analysis. Chi-square test and Fisher’s tests were performed: Chi-square test p-value = 0.001362 and Fisher’s test p-value = 0.002038). These tests confirm that Copilot recommendations are not independent of the radiologists’ decisions (p < 0.01).

Out of 92 evaluable cases, radiologists recommended biopsy in 37 cases. Copilot recommended biopsy in 65 cases, showing a higher biopsy recommendation rate compared to the radiologists.

Discordant cases highlight the mismatch, particularly when Copilot recommended biopsy and the radiologist did not (32 cases).

Diagnostic accuracy metrics for biopsy recommendations were as follows (Table 7): Sensitivity was 89.2%, indicating that Copilot correctly identified the majority of cases in which radiologists recommended biopsy. This suggests that Copilot is good at flagging potentially necessary biopsies. Specificity was 41.8%, indicating that Copilot frequently recommended biopsy in cases where radiologists did not, reflecting overly cautious decision-making.

The positive predictive value (PPV) was 50.8%; only about half of Copilot’s biopsy recommendations aligned with the radiologist’s judgment.

The negative predictive value (NPV) was 85.2%; when Copilot did not recommend a biopsy, it was usually in agreement with the radiologist.

The balanced accuracy was 65.5%, reflecting moderate overall performance. The high sensitivity was offset by relatively low specificity.

Cohen’s κ of 0.276 (p = 0.0014) indicated fair agreement beyond chance, consistent with the frequent recommendation by Copilot.

An odds ratio of 5.82 (95% CI: 1.72–25.75) shows that cases the radiologist recommended for biopsy were 5.82 times more likely to be recommended by Copilot than cases the radiologist did not, suggesting significant association.

4. Discussion

The aim of our study was to evaluate the performance of Copilot in generating a meaningful and relevant list of differential diagnoses as well as an appropriate management protocol based on multiple factors. Copilot was successful in developing differential diagnoses in 75.2% of cases, though the ranking did not match with the differential diagnosis provided by the radiologist, but agreement with the management protocol was observed in 94.6% of cases, not in the order desired. The impact of appropriate ranking of differential diagnoses and the management strategies proposed are relevant in managing patients, especially in instances where time is of the essence. This pilot study we attempted to evaluate broad agreement in differential diagnoses and management strategies between generative AI and human-generated data. Further studies with a larger sample size are required to test significant differences in these areas that might impact timely patient care. Copilot-Al has been designed to offer generic differential diagnoses and management strategies, with limited consideration given to age, race, sex, history, clinical findings, and radiographic features. The commonly generated top five differential diagnosis by Copilot were dentigerous cyst, odontogenic keratocyst, ameloblastoma, central giant cell granuloma, and radicular cyst. For example, the imaging findings provided in the radiologist’s report illustrated the presence of calcifications within the lesion in a 73-year-old male patient. The result generated by Copilot included only commonly seen cysts and benign tumors, such as dentigerous cyst, odontogenic keratocyst, and ameloblastoma, rather than those with specific features such as calcifications. A valuable methodological improvement to this study would be the inclusion of cases with lesions displaying complex architectural features.

Some cases presented multiple isolated lesions/findings. Copilot failed to recognize and report such lesions as being of different origin. In one such case, the referring doctor sought to evaluate a lesion in the right mandible and another in the left maxillary sinus. The findings of both lesions were entered into Copilot to facilitate appropriate differential diagnosis and management generation. However, Copilot only addressed findings in the right mandible and omitted the findings in the left maxillary sinus. Another example involved a lesion in the right maxilla and a second one in the left mandible. Based on age, sex, and imaging findings, the radiologist suggested the possibility of multiple odontogenic keratocysts and recommended considering a syndromic etiology. The report generated by Copilot included dentigerous cyst, odontogenic keratocyst, ameloblastoma, and central giant cell granuloma. Copilot failed to specify whether the differential diagnosis provided was for both lesions or just one of them. Yet another case with two distinct lesions was monitored for any interval changes. Following a three-month period, it was observed that Copilot recognized both lesions as separate entities. This may be attributed to improvements or updates made to the software in the interim. This also highlights the need to include similar cases with multifocal pathoses or multiple lesion types in a future study to assess the efficacy of AI.

Generative AI has been designed to provide general information and not necessarily customize its answers based on image findings. The management protocol provided by Copilot was overall sound. Yet in some instances the rationale was inconsistent, with a potential to lead to substantial ramifications [18,19,20]. In 69 out of 101 cases, Copilot suggested a biopsy in cases where it was not required. An example is that of periapical osseous dysplasia, where biopsy is not routinely indicated but was suggested by Copilot. In instances such as this, the experience and expertise of a radiologist play a vital role since patient history, clinical findings, vitality testing, presence of multiple lesions with characteristic maturation patterns, and patient demographics will help in making a call. The radiologist’s familiarity with characteristic radiographic features enables them to evaluate such cases and recommend appropriate management. It appears that Copilot was incapable of doing so in its current version. Copilot has demonstrated a tendency toward caution when recommending biopsies in certain cases, such as periapical osseous dysplasia. Refining this approach can help reduce the likelihood of patients undergoing unnecessary invasive procedures. Yet another example is that of a vascular lesion where, based on the radiographic features and location, the appropriate diagnosis was made by the radiologist, along with suggestions for further imaging and follow-up. Copilot was unable to include a vascular lesion in its differential diagnosis, and the suggested management protocol did not include further imaging prior to biopsy. Copilot also recommended orthodontic treatments in adolescent patients. However, it was noted that management protocols generated by Copilot in temporomandibular joint disorders were meaningful and relevant.

Limitations of the study include a relatively small sample size, use of CBCTs in the maxillofacial region only, inclusion of odontogenic and non-odontogenic pathoses, use of a single LLM model, and sourcing of the radiology reports from a single center, which may limit the generalizability of the findings. Further studies could be conducted by comparing various platforms and cases from multiple institutions.

The inconsistency in Copilot’s output highlights inherent uncertainty and emphasizes that LLMs are tools that support but not substitute a radiologist’s judgement [20]. LLMs are continuously changing; therefore, the outcome may vary as the tool is updated constantly. This study indicates that Copilot is not yet ready for heavy use and requires a lot of fine tuning before it can be expected to generate relevant information that may assist the clinician in diagnosis and management of the patient. The limitations of Copilot may be borne in mind [14,21]. Moreover, it may be noted that, due to limitations in “active memory” from limited token usage in some AI tools, the loss of context could lead to hallucinations. It is therefore important to use multiple such tools to perform a comparative analysis so that such drifting is not a factor in decision-making.

This study was designed to assess agreement between radiologists and Microsoft Copilot (MCP) based solely on available patient history, indications for imaging, reported clinical features (if provided), and imaging findings. The objective was not to validate biopsy-based outcomes or establish definitive diagnostic pathways but to evaluate the concordance in decision-making regarding differential diagnoses and subsequent treatment planning/management including the need for histopathologic evaluation or additional complex morphologic or functional imaging The study showed that AI is not ready yet for widespread use by clinicians as a decision-making system; rather it has the potential to serve as an adjunctive decision support system if used appropriately, taking limitations mentioned earlier into consideration. The findings in the study emphasize the preliminary nature of this evaluation and highlight the need for further studies to determine MCP’s true clinical utility and its impact on diagnostic workflows in oral and maxillofacial radiology.

Author Contributions

Conceptualization, M.N.; methodology, M.N. and Y.J.; formal analysis, N.R.; investigation, Y.J.; resources, M.N., Y.J. and N.R.; data curation, Y.J.; writing—original draft preparation, Y.J.; writing—review and editing, M.N. and Y.J.; supervision, M.N.; project administration, M.N. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

The study was conducted in accordance with the Declaration of Helsinki, and approved by the Institutional Review Board of Texas A&M University, Division of Research (protocol code IRB-2023-0815 and 3 November 2023).

Informed Consent Statement

Patient consent was waived due to retrospective study of anonymized data.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

Vasey, B.; Ursprung, S.; Beddoe, B.; Taylor, E.H.; Marlow, N.; Bilbro, N.; Watkinson, P.; McCulloch, P. Association of clinician diagnostic performance with machine learning–based decision support systems. JAMA Netw. Open. 2021, 4, e211276. [Google Scholar] [CrossRef] [PubMed]
Alowais, S.A.; Alghamdi, S.S.; Alsuhebany, N.; Alqahtani, T.; Alshaya, A.I.; Almohareb, S.N.; Aldairem, A.; Alrashed, M.; Bin Saleh, K.; Badreldin, H.A.; et al. Revolutionizing healthcare: The role of artificial intelligence in clinical practice. BMC Med. Educ. 2023, 23, 689. [Google Scholar] [CrossRef] [PubMed]
Brynjolfsson, E.; Li, D.; Raymond, L.R. Generative AI at Work; Working Paper No. 31161; National Bureau of Economic Research: Cambridge, MA, USA, 2023. [Google Scholar] [CrossRef]
López-Úbeda, P.; Martín-Noguerol, T.; Juluru, K.; Luna, A. Natural language processing in radiology: Update on clinical applications. J. Am. Coll. Radiol. 2022, 19, 1271–1285. [Google Scholar] [CrossRef] [PubMed]
Kothari, A.N. ChatGPT, large language models, and generative AI as future augments of surgical cancer care. Ann. Surg. Oncol. 2023, 30, 3174–3176. [Google Scholar] [CrossRef] [PubMed]
Ahsan, M.M.; Luna, S.A.; Siddique, Z. Machine-learning-based disease diagnosis: A comprehensive review. Healthcare 2022, 10, 541. [Google Scholar] [CrossRef] [PubMed]
Xu, Y.; Wang, Q.; An, Z.; Huang, C.; Liu, E.; Qian, S.; Liu, X.; Wu, Y.; Dong, F.; Qiu, C.-W.; et al. Artificial intelligence: A powerful paradigm for scientific research. Innovation 2021, 2, 100179. [Google Scholar] [CrossRef] [PubMed]
Dwivedi, Y.K.; Sharma, A.; Rana, N.P.; Giannakis, M.; Goel, P.; Dutot, V. Evolution of artificial intelligence research in Technological Forecasting and Social Change: Research topics, trends, and future directions. Technol. Forecast. Soc. Change 2023, 192, 122579. [Google Scholar] [CrossRef]
Kung, T.H.; Cheatham, M.; Medenilla, A.; Sillos, C.; De Leon, L.; Elepaño, C.; Madriaga, M.; Aggabao, R.; Diaz-Candido, G.; Maningo, J.; et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLoS Digit. Health 2023, 2, e0000198. [Google Scholar] [CrossRef] [PubMed]
Zimmerman, A. A ghostwriter for the masses: ChatGPT and the future of writing. Ann. Surg. Oncol. 2023, 30, 3170–3173. [Google Scholar] [CrossRef] [PubMed]
Duong, M.T.; Rauschecker, A.M.; Rudie, J.D.; Chen, P.-H.; Cook, T.S.; Bryan, R.N.; Mohan, S. Artificial intelligence for precision education in radiology. Br. J. Radiol. 2019, 92, 20190389. [Google Scholar] [CrossRef] [PubMed]
Kottlors, J.; Bratke, G.; Rauen, P.; Kabbasch, C.; Persigehl, T.; Schlamann, M.; Lennartz, S. Feasibility of differential diagnosis based on imaging patterns using a large language model. Radiology 2023, 308, 231167. [Google Scholar] [CrossRef] [PubMed]
Kumari, A.; Kumari, A.; Singh, A.; Juhi, A.; Dhanvijay, A.K.D.; Pinjar, M.J.; Mondal, H. Large language models in hematology case solving: A comparative study of ChatGPT-3.5, Google Bard, and Microsoft Bing. Cureus 2023, 15, e43861. [Google Scholar] [CrossRef] [PubMed]
Sarangi, P.K.; Irodi, A.; Panda, S.; Nayak, D.S.K.; Mondal, H. Radiological differential diagnoses based on cardiovascular and thoracic imaging patterns: Perspectives of four large language models. Indian J. Radiol. Imaging 2023, 34, 269–275. [Google Scholar] [CrossRef] [PubMed]
Tomo, S.; Lechien, J.R.; Bueno, H.S.; Cantieri-Debortoli, D.F.; Simonato, L.E. Accuracy and consistency of ChatGPT-3.5 and -4 in providing differential diagnoses in oral and maxillofacial diseases: A comparative diagnostic performance analysis. Clin. Oral Investig. 2024, 28, 544. [Google Scholar] [CrossRef] [PubMed]
Hirosawa, T.; Kawamura, R.; Harada, Y.; Mizuta, K.; Tokumasu, K.; Kaji, Y.; Suzuki, T.; Shimizu, T. ChatGPT-Generated Differential Diagnosis Lists for Complex Case–Derived Clinical Vignettes: Diagnostic Accuracy Evaluation. JMIR Med. Inform. 2023, 11, e48808. [Google Scholar] [CrossRef] [PubMed]
Whitehead, A.L.; Julious, S.A.; Cooper, C.L.; Campbell, M.J. Estimating the sample size for a pilot randomised trial to minimise the overall trial sample size for the external pilot and main trial for a continuous outcome variable. Stat. Methods Med. Res. 2016, 25, 1057–1073. [Google Scholar] [CrossRef] [PubMed]
Garcia-Vidal, C.; Sanjuan, G.; Puerta-Alcalde, P.; Moreno-García, E.; Soriano, A. Artificial intelligence to support clinical decision-making processes. EBioMedicine 2019, 46, 27–29. [Google Scholar] [CrossRef] [PubMed]
Ji, M.; Genchev, G.Z.; Huang, H.; Xu, T.; Lu, H.; Yu, G. Evaluation framework for successful artificial intelligence–enabled clinical decision support systems: Mixed methods study. J. Med. Internet Res. 2021, 23, e25929. [Google Scholar] [CrossRef] [PubMed]
Berg, H.M.; van Bakel, J.; van Jie, K.E.; Jie, K.E.; Schipper, A.; Jansen, H.; O’Connor, R.D.; van Ginneken, B.; Kurstjens, S. ChatGPT and generating a differential diagnosis early in an emergency department presentation. Ann. Emerg. Med. 2023, 83, 83–86. [Google Scholar] [CrossRef] [PubMed]
Tools such as ChatGPT threaten transparent science; here are our ground rules for their use. Nature 2023, 613, 612. [CrossRef] [PubMed]

Figure 1. Flow chart of included studies.

Figure 2. Representative example of data input, prompt, and data output.

Figure 3. Concurrence in differential diagnosis between Copilot and radiologist.

Figure 4. Concurrence in management between Copilot and radiologist.

Table 1. Differential Diagnosis (DD):: Concurrence between radiologist and Copilot on differential diagnoses (DD).

Score	Number of Cases	Percentage (%)	Interpretation
0	25	24.8	No DD matched
1	16	15.8	1 DD matched (partial agreement)
2	13	12.9	2 DD matched (partial agreement)
3	47	46.5	3 DD matched (full agreement)
Total	101	100

Table 2. Agreement metrics for differential diagnoses.

Metric	Estimate (%)	95% CI (%)
Top 3 agreement	75.2	65.6–82.9
Full agreement	46.5	36.9–56.3

Table 3. Summary statistics of baseline patient characteristics.

n = 101
Age	Mean (SD)	Range
	49 (17.47)	[18, 90]
Sex	n (%)
Male	56 (60.2)
Female	37 (39.8)
Anatomic Site
Right posterior mandible	32 (28.8%)
Left posterior mandible	19 (17.1%)
Right posterior maxilla	13 (11.7%)
Right anterior mandible	8 (7.2%)
Right anterior maxilla	5 (4.5%)
TMJ (Left, Right, or Both)	4 (3.6%)
Other Sites	20 (18.0%)
	Radiologist n (%)	Copilot n (%)
Lesion Category
Cyst-odontogenic	31 (30.7)	49 (48.5)
Tumor-benign (non-odontogenic)	26 (25.7)	45 (44.6)
Tumor-benign (odontogenic)	23 (22.8)	38 (37.6)
Infection/Inflammation	22 (21.8)	37 (36.6)
Fibro-osseous lesion	20 (19.8)	12 (11.9)
Calcifications	7 (6.9)	7 (6.9)
Sinus disease	6 (5.9)	8 (7.9)
Tumor-malignant, non-odontogenic	3 (3.0)	9 (8.9)
Others *	30 (29.7)	34 (33.7)

* Others include non-odontogenic cysts, TMJ abnormalities, salivary gland diseases, tumor–malignant odontogenic, syndromes, metabolic/endocrine, and miscellaneous categories.

Table 4. Differential diagnosis recommendations lesions: radiologist vs. Copilot.

Category	Lesion	Radiologist n (%)	Copilot n (%)	Avg. Jaccard *
Odontogenic Cyst	Odontogenic Keratocyst	20 (19.8)	28 (27.7)	0.71
	Radicular cyst	8 (7.9)	23 (22.8)	0.35
	Dentigerous cyst	6 (5.9)	19 (18.8)	0.32
	Residual cyst	3 (3.0)	2 (2.0)	0.67
	Lateral periodontal cyst	1 (1.0)	0	0.00
	Calcifying odontogenic cyst	3 (3.0)	0	0.00
Non-Odontogenic Cyst	Nasopalatine duct cyst	3 (3.0)	3 (3.0)	1.00
	Surgical ciliated cyst	2 (2.0)	0	0.00
	Simple bone cyst	2 (2.0)	0	0.00
Benign odontogenic Tumor	Ameloblastoma	6 (5.9)	31 (30.7)	0.19
	Unicystic ameloblastoma	5 (5.0)	0	0.00
	CEOT	3 (3.0)	0	0.00
	Odontoma	2 (2.0)	6 (5.9)	0.33
	Adenomatoid odontogenic tumor	1 (1.0)	0	0.00
	Cementoblastoma	3 (3.0)	2 (2.0)	0.67
	Ameloblastic Fibroma	1 (1.0)	0	0.00
Benign non-odontogenic Tumor	CGCG	4 (4.0)	26 (25.7)	0.15
	Osteoma	5 (5.0)	7 (6.9)	0.71
	Osteoid osteoma	1 (1.0)	0	0.00
	Idiopathic osteosclerosis/DBI	11 (10.9)	9 (8.9)	0.82
	Hemangioma	2 (2.0)	0	0.00
	AV malformation	1 (1.0)	1 (1.0)	1.00
	Aneurysmal bone cyst	1 (1.0)	0	0.00
Malignant Tumor	Malignancy	1 (1.0)	6 (5.9)	0.17
	Metastasis	2 (2.0)	2 (2.0)	1.00
	SCC	0	2 (2.0)	0.00
Fibro-osseous lesion	Fibrous dysplasia	2 (2.0)	3 (3.0)	0.67
	Osseous dysplasia	13 (12.9)	7 (6.9)	0.54
	Ossifying fibroma	5 (5.0)	5 (5.0)	1.00
Infection/Inflammation	Dental/Periapical abscess/Apical periodontitis	6 (5.9)	15 (14.9)	0.40
	Periapical Granuloma	6 (5.9)	14 (13.9)	0.43
	Osteomyelitis	8 (7.9)	10 (9.9)	0.80
	Idiopathic osteosclerosis/DBI	11 (10.9)	9 (8.9)	0.82
	Condensing osteitis	8 (7.9)	8 (7.9)	1.00
Sinus Diseases	Chronic sinusitis	4 (4.0)	7 (6.9)	0.57
	Mucus retention pseudocyst	4 (4.0)	0	0.00
	Sinus tumors	1 (1.0)	1 (1.0)	1.00
	Mucocele sinus	1 (1.0)	1 (1.0)	1.00
TMJ/Developmental	TMJ abnormalities	4 (4.0)	4 (4.0)	1.00
	Developmental/Acquired anomalies	6 (5.9)	4 (4.0)	0.67
Soft Tissue Calcifications	Sialoliths	6 (5.9)	6 (5.9)	1.00
	Calcified lymph nodes	3 (3.0)	5 (5.0)	0.60
	Phleboliths	2 (2.0)	4 (4.0)	0.50
Salivary/Other	Salivary gland disease	1 (1.0)	0	0.00
	Foreign body	0	2 (2.0)	0.00
	Other	17 (16.8)	27 (26.7)	0.63

* Avg. Jaccard is calculated as the mean of subcategory-level Jaccard similarities within that category. It reflects overall agreement for that group of diagnoses. Higher values indicate stronger concordance.

Table 5. Concurrence between radiologist and Copilot on management:.

Score	Number of Cases	Percentage (%)	Interpretation
0	5	4.95	0/3 matches or no agreement
1	22	21.78	1/3 match
2	40	39.60	2/3 matches
3	25	24.75	3/3 matches or full agreement
Total	92	100

Table 6. Agreement metrics for management.

Metric	Estimate (%)	95% CI (%)
Top 3 agreement	94.6	87.5–98.0
Full agreement	27.2	18.5–37.5

Table 7. Diagnostic accuracy metrics for biopsy recommendations:.

Metric	Estimate (95% CI)
Sensitivity	89.2% (74.6–96.9)
Specificity	41.8% (28.7–55.9)
Positive Predictive Value (PPV)	50.8% (38.3–63.2)
Negative Predictive Value (NPV)	85.2% (66.3–95.8)
Balanced Accuracy	65.5%
Cohen’s κ	0.276 (p = 0.0014)
Odds Ratio (Fisher’s)	5.82 (95% CI: 1.72–25.75)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Jagadeesh, Y.; Rizvi, N.; Nair, M. Evaluating Generative AI (Microsoft Copilot) as an Adjunctive Decision-Support System in Oral and Maxillofacial Radiology: A Retrospective Study. Oral 2026, 6, 10. https://doi.org/10.3390/oral6010010

AMA Style

Jagadeesh Y, Rizvi N, Nair M. Evaluating Generative AI (Microsoft Copilot) as an Adjunctive Decision-Support System in Oral and Maxillofacial Radiology: A Retrospective Study. Oral. 2026; 6(1):10. https://doi.org/10.3390/oral6010010

Chicago/Turabian Style

Jagadeesh, Yashaswini, Nubaira Rizvi, and Madhu Nair. 2026. "Evaluating Generative AI (Microsoft Copilot) as an Adjunctive Decision-Support System in Oral and Maxillofacial Radiology: A Retrospective Study" Oral 6, no. 1: 10. https://doi.org/10.3390/oral6010010

APA Style

Jagadeesh, Y., Rizvi, N., & Nair, M. (2026). Evaluating Generative AI (Microsoft Copilot) as an Adjunctive Decision-Support System in Oral and Maxillofacial Radiology: A Retrospective Study. Oral, 6(1), 10. https://doi.org/10.3390/oral6010010

Article Menu

Evaluating Generative AI (Microsoft Copilot) as an Adjunctive Decision-Support System in Oral and Maxillofacial Radiology: A Retrospective Study

Abstract

1. Introduction

2. Material and Methods

2.1. Study Design

2.2. Inclusion Criteria

2.3. Exclusion Criteria

2.4. Data Collection

2.5. Data Analysis

3. Results

3.1. Differential Diagnoses

3.2. Management

3.3. Biopsy

4. Discussion

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI