1. Introduction
Artificial Intelligence (AI) has gained significant attention in medical literature and healthcare [
1,
2]. The uses of AI are wide-ranging and can enhance the effectiveness of intricate tasks [
3,
4]. With machine learning, AI-powered programs can generate functional code and even diagnose complicated illnesses by analyzing medical history, lab tests, radiological images, or pathological findings [
5]. While achieving genuine AI outcomes remains a distant goal, progress in mathematical modeling and computational capabilities has caused a surge in algorithm publications [
6,
7,
8]. Opinions on the medical applications of AI span a broad spectrum, ranging from assisting clinicians in decision-making to the potential of surpassing human expertise [
1,
9].
Radiologists are vital for interpreting medical images as they are equipped with holistic knowledge about imaging patterns and can facilitate correlation with history, clinical and demographic information, and prior images, while also drawing upon their expertise and experience over several years and across similar images from multiple acquisition systems likely subjected to different post-processing algorithms [
10,
11]. Based on this knowledge, they can provide differential diagnoses while considering probabilities and clinical contexts, which are crucial to the systematic and wholistic approach to diagnostic process in radiology. Identifying and linking these patterns to specific pathoses are critical stages. Referencing the literature to confirm or refute diagnoses is vital, particularly for radiologists in training, despite the process’s time-consuming and resource-intensive nature. However, uncertainty remains regarding the adoption of AI and its utility as an adjunctive decision support system for radiologists [
5,
12].
Large language models (LLMs) like OpenAI’s ChatGPT, Google’s Bard (now Gemini), Microsoft’s Bing (Copilot), and Perplexity AI Inc.’s Perplexity AI have gained considerable attention due to their ability to understand textual information and produce contextually relevant responses. They are trained on vast amounts of medical literature and data that could aid in offering differential diagnoses from text-based descriptions of imaging patterns [
13,
14]. Sarangi et al. explored the utility of four LLMs in providing differential diagnosis based on cardiovascular and thoracic imaging patterns [
14]. The utilization of freely available chatbots for generating relevant differential diagnoses in medical radiology has been partially investigated. However, there is a gap in exploring text-based descriptions, incorporating factors like age, sex, and image patterns to offer pertinent differential diagnoses and management strategies in the field of oral and maxillofacial radiology.
This study aims to evaluate the effectiveness of generative AI (Microsoft Copilot) in delivering meaningful differential diagnoses and management recommendations while using maxillofacial cone beam computed tomography (CBCT) studies to develop a provisional diagnosis and management strategy. The agreement between the radiologist and Microsoft Copilot was assessed in these two areas in order to explore the utility of the algorithm to serve as a decision-making tool for clinicians who use it increasingly as a workflow asset. We hypothesized that there would be little concurrence in agreement between the maxillofacial radiologist and Microsoft Copilot on differential diagnosis and management protocols of maxillofacial pathoses, including the need for biopsies, and that the latter would call for more such interventional procedures owing to lack of training on data related to such pathoses.
3. Results
3.1. Differential Diagnoses
The 101 cases evaluated to assess the agreement on differential diagnoses between the radiologist’s report and Copilot were assigned scores based on the number of matching differential diagnoses, ranging from 0 to 3. A score of 1 was given if there was one matching diagnosis, 2 for two matches, and 3 for three or more matches. If there were no matches, the assigned agreement score was zero.
In twenty-five cases, there was no match between the radiologist’s and Copilot’s diagnoses, resulting in a score of 0. Sixteen cases scored a 1, 13 scored a 2, and 47 scored, a 3. In the 16 cases that scored 1, it was noted that it was the principal finding in 6 cases (
Table 1 and
Figure 3). The ranking of the differential diagnoses generated by Copilot did not match with those in the radiologist’s reports in the remaining cases.
A proportional analysis with a 95% confidence interval was performed to assess agreement between the radiologist and Copilot in formulating a list of differential diagnosis (
Table 2). In 75.2% of cases (95% CI: 65.6–82.9%), there was agreement of at least 1 (score ≥1) differential diagnosis between the radiologist and Copilot. Complete agreement across all three or more matching differential diagnoses was observed in 46.5% of cases (95% CI: 36.9–56.3%).
Jaccard similarity metric was performed to evaluate the degree of overlap between differential diagnoses provided by the radiologist and Copilot. The mean similarity score was 0.604 (60.4%), indicating that, on average, 60.4% of Copilot’s suggested differential diagnoses overlapped with those provided by the radiologist.
Table 3 indicates the summary of baseline characteristics of the patients.
Table 4 lists the various differential diagnosis recommendations by the radiologist and the Copilot. Overall, although Copilot detected more cases of certain lesions, there were several lesions identified by the radiologists that were missed by the Copilot. This indicates that Copilot did not capture all details noted by the radiologists, reflecting incomplete overlap between Copilot and the radiologist’s evaluation. Jaccard similarity, which measures overlap between radiologist and Copilot selections for each lesions varies widely, reflects differing levels of concordance.
High Agreement (Jaccard ≥ 0.67):
Lesions/categories such as nasopalatine duct cyst, AV malformation, ossifying fibroma, dense bone island, TMJ abnormalities, and sialoliths show strong concordance between Copilot and radiologist annotations. These results suggest that both Copilot and radiologists reliably identify common and well-defined lesions.
Moderate Agreement (Jaccard 0.40–0.66):
Lesions such as odontogenic keratocyst, odontoma, osseous dysplasia, chronic mucositis in the sinuses, and foreign bodies exhibit moderate agreement. Copilot tends to detect more cases with these lesions than radiologists, resulting in partial overlap.
Low Agreement (Jaccard < 0.40):
Lesions/categories such as radicular cyst, dentigerous cyst, ameloblastoma, central giant cell granuloma, and malignancies show low concordance. Differences reflect challenges in identifying rare tumors or subtle cystic lesions, highlighting areas for Copilot’s improvement.
3.2. Management
A total of 101 cases were evaluated to determine the agreement in management strategies between the radiologist’s report and Copilot-generated data. Each case was assigned a score based on the level of agreement in management recommendations. The scores ranged from 0 to 3, with 0 indicating no agreement, 1 indicating one matching management strategy, 2 indicating two matches, and 3 indicating three or more matches. An agreement between the radiologist’s ranking of management strategies was not noted when compared with Copilot’s ranking. Most cases received a score of 2, with 40 cases in this category, followed by 25 cases with a score of 3. 22 cases received a score of 1 (
Table 5 and
Figure 4). Of the 101 cases, 9 were not scored as no intervention was deemed necessary by the radiologist. However, Copilot provided management recommendations for all cases.
A proportional analysis with 95% confidence interval was performed to assess agreement between the radiologist and Copilot in management protocols (
Table 6). Copilot matched at least 1 (score ≥ 1) management recommendation provided by the radiologist in 94.6% of cases (95% CI: 87.5–98.0). Complete agreement (score = 3) was observed in 27.2% of cases (95% CI: 18.5–37.5).
Jaccard similarity metric showed a mean similarity score of 0.641 (64.1%), indicating that, on average, 64.1% of Copilot’s suggested management strategies overlapped with those provided by the radiologist.
3.3. Biopsy
Biopsy recommendations were evaluated to determine if Copilot could accurately identify when a biopsy was needed. The radiologist recommended biopsies in 37 cases. Nine cases were not scored as there were no further management recommendations were deemed necessary by the radiologist. By contrast, Copilot recommended biopsies in 69 cases. There was concurrence in biopsy recommendations between the radiologist and Copilot in 33 cases (35.9%), while both did not recommend one in 23 (25%) cases.
Among 101 cases, 9 were not scored as there were no management recommendations, leaving 92 cases for biopsy recommendation analysis. Chi-square test and Fisher’s tests were performed: Chi-square test p-value = 0.001362 and Fisher’s test p-value = 0.002038). These tests confirm that Copilot recommendations are not independent of the radiologists’ decisions (p < 0.01).
Out of 92 evaluable cases, radiologists recommended biopsy in 37 cases. Copilot recommended biopsy in 65 cases, showing a higher biopsy recommendation rate compared to the radiologists.
Discordant cases highlight the mismatch, particularly when Copilot recommended biopsy and the radiologist did not (32 cases).
Diagnostic accuracy metrics for biopsy recommendations were as follows (
Table 7): Sensitivity was 89.2%, indicating that Copilot correctly identified the majority of cases in which radiologists recommended biopsy. This suggests that Copilot is good at flagging potentially necessary biopsies. Specificity was 41.8%, indicating that Copilot frequently recommended biopsy in cases where radiologists did not, reflecting overly cautious decision-making.
The positive predictive value (PPV) was 50.8%; only about half of Copilot’s biopsy recommendations aligned with the radiologist’s judgment.
The negative predictive value (NPV) was 85.2%; when Copilot did not recommend a biopsy, it was usually in agreement with the radiologist.
The balanced accuracy was 65.5%, reflecting moderate overall performance. The high sensitivity was offset by relatively low specificity.
Cohen’s κ of 0.276 (p = 0.0014) indicated fair agreement beyond chance, consistent with the frequent recommendation by Copilot.
An odds ratio of 5.82 (95% CI: 1.72–25.75) shows that cases the radiologist recommended for biopsy were 5.82 times more likely to be recommended by Copilot than cases the radiologist did not, suggesting significant association.
4. Discussion
The aim of our study was to evaluate the performance of Copilot in generating a meaningful and relevant list of differential diagnoses as well as an appropriate management protocol based on multiple factors. Copilot was successful in developing differential diagnoses in 75.2% of cases, though the ranking did not match with the differential diagnosis provided by the radiologist, but agreement with the management protocol was observed in 94.6% of cases, not in the order desired. The impact of appropriate ranking of differential diagnoses and the management strategies proposed are relevant in managing patients, especially in instances where time is of the essence. This pilot study we attempted to evaluate broad agreement in differential diagnoses and management strategies between generative AI and human-generated data. Further studies with a larger sample size are required to test significant differences in these areas that might impact timely patient care. Copilot-Al has been designed to offer generic differential diagnoses and management strategies, with limited consideration given to age, race, sex, history, clinical findings, and radiographic features. The commonly generated top five differential diagnosis by Copilot were dentigerous cyst, odontogenic keratocyst, ameloblastoma, central giant cell granuloma, and radicular cyst. For example, the imaging findings provided in the radiologist’s report illustrated the presence of calcifications within the lesion in a 73-year-old male patient. The result generated by Copilot included only commonly seen cysts and benign tumors, such as dentigerous cyst, odontogenic keratocyst, and ameloblastoma, rather than those with specific features such as calcifications. A valuable methodological improvement to this study would be the inclusion of cases with lesions displaying complex architectural features.
Some cases presented multiple isolated lesions/findings. Copilot failed to recognize and report such lesions as being of different origin. In one such case, the referring doctor sought to evaluate a lesion in the right mandible and another in the left maxillary sinus. The findings of both lesions were entered into Copilot to facilitate appropriate differential diagnosis and management generation. However, Copilot only addressed findings in the right mandible and omitted the findings in the left maxillary sinus. Another example involved a lesion in the right maxilla and a second one in the left mandible. Based on age, sex, and imaging findings, the radiologist suggested the possibility of multiple odontogenic keratocysts and recommended considering a syndromic etiology. The report generated by Copilot included dentigerous cyst, odontogenic keratocyst, ameloblastoma, and central giant cell granuloma. Copilot failed to specify whether the differential diagnosis provided was for both lesions or just one of them. Yet another case with two distinct lesions was monitored for any interval changes. Following a three-month period, it was observed that Copilot recognized both lesions as separate entities. This may be attributed to improvements or updates made to the software in the interim. This also highlights the need to include similar cases with multifocal pathoses or multiple lesion types in a future study to assess the efficacy of AI.
Generative AI has been designed to provide general information and not necessarily customize its answers based on image findings. The management protocol provided by Copilot was overall sound. Yet in some instances the rationale was inconsistent, with a potential to lead to substantial ramifications [
18,
19,
20]. In 69 out of 101 cases, Copilot suggested a biopsy in cases where it was not required. An example is that of periapical osseous dysplasia, where biopsy is not routinely indicated but was suggested by Copilot. In instances such as this, the experience and expertise of a radiologist play a vital role since patient history, clinical findings, vitality testing, presence of multiple lesions with characteristic maturation patterns, and patient demographics will help in making a call. The radiologist’s familiarity with characteristic radiographic features enables them to evaluate such cases and recommend appropriate management. It appears that Copilot was incapable of doing so in its current version. Copilot has demonstrated a tendency toward caution when recommending biopsies in certain cases, such as periapical osseous dysplasia. Refining this approach can help reduce the likelihood of patients undergoing unnecessary invasive procedures. Yet another example is that of a vascular lesion where, based on the radiographic features and location, the appropriate diagnosis was made by the radiologist, along with suggestions for further imaging and follow-up. Copilot was unable to include a vascular lesion in its differential diagnosis, and the suggested management protocol did not include further imaging prior to biopsy. Copilot also recommended orthodontic treatments in adolescent patients. However, it was noted that management protocols generated by Copilot in temporomandibular joint disorders were meaningful and relevant.
Limitations of the study include a relatively small sample size, use of CBCTs in the maxillofacial region only, inclusion of odontogenic and non-odontogenic pathoses, use of a single LLM model, and sourcing of the radiology reports from a single center, which may limit the generalizability of the findings. Further studies could be conducted by comparing various platforms and cases from multiple institutions.
The inconsistency in Copilot’s output highlights inherent uncertainty and emphasizes that LLMs are tools that support but not substitute a radiologist’s judgement [
20]. LLMs are continuously changing; therefore, the outcome may vary as the tool is updated constantly. This study indicates that Copilot is not yet ready for heavy use and requires a lot of fine tuning before it can be expected to generate relevant information that may assist the clinician in diagnosis and management of the patient. The limitations of Copilot may be borne in mind [
14,
21]. Moreover, it may be noted that, due to limitations in “active memory” from limited token usage in some AI tools, the loss of context could lead to hallucinations. It is therefore important to use multiple such tools to perform a comparative analysis so that such drifting is not a factor in decision-making.
This study was designed to assess agreement between radiologists and Microsoft Copilot (MCP) based solely on available patient history, indications for imaging, reported clinical features (if provided), and imaging findings. The objective was not to validate biopsy-based outcomes or establish definitive diagnostic pathways but to evaluate the concordance in decision-making regarding differential diagnoses and subsequent treatment planning/management including the need for histopathologic evaluation or additional complex morphologic or functional imaging The study showed that AI is not ready yet for widespread use by clinicians as a decision-making system; rather it has the potential to serve as an adjunctive decision support system if used appropriately, taking limitations mentioned earlier into consideration. The findings in the study emphasize the preliminary nature of this evaluation and highlight the need for further studies to determine MCP’s true clinical utility and its impact on diagnostic workflows in oral and maxillofacial radiology.