Assessment of ChatGPT in Recommending Immunohistochemistry Panels for Salivary Gland Tumors

Cuevas-Nunez, Maria; Galletti, Cosimo; Fiorillo, Luca; Meto, Aida; Díaz-Castañeda, Wilmer Rodrigo; Farahani, Shokoufeh Shahrabi; Fadda, Guido; Zuccalà, Valeria; Manich, Victor Gil; Bara-Casaus, Javier; Fernández-Figueras, Maria-Teresa

doi:10.3390/biomedinformatics5040066

Open AccessArticle

Assessment of ChatGPT in Recommending Immunohistochemistry Panels for Salivary Gland Tumors

by

Maria Cuevas-Nunez

^1,2,*,

Cosimo Galletti

³

,

Luca Fiorillo

^4,5

,

Aida Meto

^4,5,6,*

,

Wilmer Rodrigo Díaz-Castañeda

¹

,

Shokoufeh Shahrabi Farahani

⁷,

Guido Fadda

⁸,

Valeria Zuccalà

⁸,

Victor Gil Manich

¹

,

Javier Bara-Casaus

⁹ and

Maria-Teresa Fernández-Figueras

^1,2

¹

Faculty of Dentistry, Universitat Internacional de Catalunya, 08195 Barcelona, Spain

²

Hospital Universitari General de Catalunya, 08210 Barcelona, Spain

³

Faculty of Medicine and Surgery, University of Enna, 94100 Enna, Italy

⁴

Department of Dental Research Cell, Dr. D. Y. Patil Dental College & Hospital, Dr. D. Y. Patil Vidyapeeth, Pimpri, Pune 411018, India

⁵

Department of Dentistry, Faculty of Dental Sciences, Aldent University, 1007 Tirana, Albania

⁶

Department of Surgery, Medicine, Dentistry and Morphological Sciences with Interest in Transplant, Oncology and Regenerative Medicine, University of Modena and Reggio Emilia, 41125 Modena, Italy

⁷

Division of Oral and Maxillofacial Pathology, University of Tennessee Health Science Center, Memphis, TN 38163, USA

⁸

Department of Adult and Developmental Human Pathology “Gaetano Barresi”, University of Messina, 98100 Messina, Italy

⁹

Hospital Universitario MútuaTerrassa, 08221 Barcelona, Spain

^*

Authors to whom correspondence should be addressed.

BioMedInformatics 2025, 5(4), 66; https://doi.org/10.3390/biomedinformatics5040066

Submission received: 11 October 2025 / Revised: 6 November 2025 / Accepted: 18 November 2025 / Published: 26 November 2025

(This article belongs to the Special Issue The Application of Large Language Models in Clinical Practice)

Download

Browse Figures

Versions Notes

Abstract

Background: Salivary gland tumors pose a diagnostic challenge due to their histological heterogeneity and overlapping features. While immunohistochemistry (IHC) is critical for accurate classification, selecting appropriate markers can be subjective and influenced by resource availability. Artificial intelligence (AI), particularly large language models (LLMs), may support diagnostic decisions by recommending IHC panels. This study evaluated the performance of ChatGPT-4, a free and widely accessible general-purpose LLM, in recommending IHC markers for salivary gland tumors. Methods: ChatGPT-4 was prompted to generate IHC recommendations for 21 types of salivary gland tumors. A consensus of expert pathologists established reference panels. Each tumor type was queried using a standardized prompt designed to elicit IHC marker recommendations (“What IHC markers are recommended to confirm a diagnosis of [tumor type]?”). Outputs were assessed using a structured scoring rubric measuring accuracy, completeness, and relevance. Agreement was measured using Cohen’s Kappa, and diagnostic performance was evaluated via sensitivity, specificity, and F1-scores. Repeated-measures ANOVA and Bland–Altman analysis assessed consistency across three prompts. Results were compared to a rule-based system aligned with expert protocols. Results: ChatGPT-4 demonstrated moderate overall agreement with the pathologist panel (κ = 0.53). Agreement was higher for benign tumors (κ = 0.67) than for malignant ones (κ = 0.40), with pleomorphic adenoma showing the strongest concordance (κ = 0.74). Sensitivity values across tumor types ranged from 0.25 to 0.96, with benign tumors showing higher sensitivity (>0.80) and lower specificity (<0.50) observed in complex malignancies. The overall F1-score was 0.84 for benign and 0.63 for malignant tumors. Repeated prompts produced moderate variability without significant differences (p > 0.05). Compared with the rule-based system, ChatGPT included more incorrect and missed markers, indicating lower diagnostic precision. Conclusions: ChatGPT-4 shows promise as a low-cost tool for IHC panel selection but currently lacks the precision and consistency required for clinical application. Further refinement is needed before integration into diagnostic workflows.

Keywords:

artificial intelligence; chatbot; oral and maxillofacial pathology; ChatGPT; salivary gland tumors

Graphical Abstract

1. Introduction

Salivary gland tumors are a heterogeneous group of neoplasms, accounting for approximately 3–10% of head and neck tumors [1]. Their significant histological diversity and overlapping morphological features present a complex diagnostic challenge, particularly in distinguishing between closely related entities [2]. Accurate classification is essential for effective patient management, as diagnostic errors can lead to inappropriate treatment strategies and adverse clinical outcomes [3,4].

Immunohistochemistry (IHC) remains a cornerstone of the diagnostic workflow for salivary gland tumors, which are classified under the 2022 WHO framework into distinct benign and malignant entities. This classification provides the structural basis for assessing tumor-specific IHC profiles. Also, IHC offers molecular-level insights that aid in tumor classification, origin determination, and differentiation, serving as a cost-effective and reliable adjunct to routine histopathology and continues to play a vital role in modern pathology [5,6]. Although literature outlines well-established IHC markers for specific tumor types, pathologists often prioritize marker selection based on real-world constraints such as reagent availability, institutional protocols, and economic considerations, introducing variability into diagnostic practice [7,8].

As such, artificial intelligence (AI) tools, including large language models (LLMs), have emerged as potential aids in navigating these complex decisions. In pathology, AI has shown promise in automating routine tasks, reducing diagnostic variability, and enhancing efficiency, particularly within digital workflows [9,10,11]. LLMs in particular have been explored for structuring reports, extracting key clinical data, supporting education, and even generating differential diagnoses [12,13,14]. With continued advancement, their integration with computer vision may enable multimodal support for pathology diagnostics [15,16]. In fact, in pathology, AI has increasingly transitioned from experimental models to validated clinical tools, reflecting an improvement of digital diagnostic systems [17]. Recent systematic reviews confirm that AI systems in digital pathology can achieve high diagnostic accuracy when appropriately validated, though heterogeneity in study designs remains a challenge [18].

Although several commercial platforms already assist in IHC panel selection (e.g., ImmunoQuery and ExpertPath), these tools often draw from curated databases and institutional case series to recommend stains based on differential inputs. However, most of these are subscription-based, require institutional access, and operate using structured query models that lack conversational flexibility or real-time contextual reasoning. Similarly, pathology-specific AI platforms like PathChat offer promising multimodal capabilities, but are not freely available and are limited in deployment to specialized settings [19].

In contrast, ChatGPT is free, widely accessible, and increasingly used by clinicians, researchers, and educators across disciplines. It requires no installation, no institutional access, and can be used in low-resource settings, making it an attractive candidate for testing AI-assisted decision support at scale. While ChatGPT is not purpose-built for pathology and lacks image-processing capabilities, its versatility and accessibility offer unique value as a baseline tool.

In this preliminary, hypothesis-generating study, we assessed the performance and limitations of a general-purpose free LLM-powered chatbot (ChatGPT-4) in recommending IHC markers for a group of benign and malignant salivary gland tumors based on the 2022 WHO framework. Rather than presenting this tool as a definitive diagnostic aid, we aimed to explore whether a text-only LLM could provide useful support in selecting stains for diagnostically challenging cases, particularly in scenarios where access to proprietary platforms or multimodal models may be limited [20]. Our findings may help inform future research into the democratization of AI tools in pathology and guide the development of more integrated, context-aware diagnostic support systems.

The following sections describe the materials and methods used to develop the reference panels and evaluate ChatGPT-4, followed by results on agreement, diagnostic accuracy, and stability, and a discussion of key findings, limitations, and implications for future research.

2. Materials and Methods

2.1. Tumor Types and Selection Criteria

This study did not involve human subjects or patient data, and therefore did not require institutional review board (IRB) approval. The free version of ChatGPT-4 was used solely as a tool for evaluating AI-assisted diagnostic recommendations and was not deployed in clinical practice. The study was conducted with a focus on transparency and unbiased reporting, ensuring a rigorous evaluation of ChatGPT’s recommendations.

Salivary gland tumors included in this study were selected to represent a broad spectrum of benign and malignant neoplasms, ensuring a comprehensive assessment of ChatGPT’s ability to recommend immunohistochemistry (IHC) panels. Tumor types were chosen based on their diagnostic importance, prevalence in histopathological literature, and the presence of distinct or overlapping histopathologic features requiring IHC confirmation. The classification followed the World Health Organization Classification of Head and Neck Tumors, 5th Edition (2022) [21] (Table 1).

2.2. Pathologist Consensus and Scoring Criteria

A panel of experienced oral and head and neck pathologists participated in establishing the reference IHC panels against which ChatGPT’s recommendations were evaluated. The panel consisted of specialists from multiple institutions, each with expertise in salivary gland pathology. The pathologist’s consensus was achieved through the following structured process, which consisted of:

Literature Review and Evidence-Based Marker Selection:
○
Each pathologist independently reviewed current clinical guidelines, consensus reports, and peer-reviewed studies relevant to IHC marker selection for salivary gland tumors.
○
A comprehensive list of essential and ancillary markers was compiled based on documented diagnostic utility.
Delphi Method for Consensus:
○
A modified Delphi approach was used to refine the reference IHC panels.
○
Pathologists participated in three rounds of blinded reviews where they rated the importance of each marker for specific tumor types. Discrepancies and disagreements identified between reviewers were discussed collectively, and markers were re-evaluated in subsequent rounds until consensus was achieved. Final agreement required at least 85% concordance among reviewers for each tumor type.
Final Validation and Benchmarking:
○
The finalized IHC panels were cross-validated against real-world pathology case datasets.
○
Benchmarking was conducted by comparing the finalized panels to existing guideline-based recommendations from WHO, the Armed Forces Institute of Pathology (AFIP), and other pathology authorities.
○
The validated reference panels served as the gold standard for evaluating ChatGPT’s IHC marker recommendations.

2.3. Chatbot Evaluation Framework

The evaluation was performed using the free, web-based text-only version of ChatGPT-4 (OpenAI, San Francisco, CA, USA; accessed via chat.openai.com). All interactions took place between October and November 2024, under default system conditions. This version corresponds to the standard ChatGPT-4 model publicly available at that time and does not include multimodal (GPT-4o) functionality. As parameter tuning (e.g., temperature, top-p, Top-k) is not possible through the public interface, all responses were generated under OpenAI’s default conversational settings, representing typical user conditions. A standardized prompting protocol was used to ensure consistency and reproducibility. While ChatGPT operates as an interface built on underlying large language models, this study intentionally used the web-based public interface to reflect the real-world context in which clinicians and trainees typically access general-purpose AI tools.

In addition, each prompt was repeated three times, using the same phrasing and entered by the same individual, to evaluate consistency and response variability. Each query was entered in a new chat session, and conversation history was cleared between prompts to prevent contextual carry-over or memory effects. A single standardized prompt template was used across all tumor types, with only the tumor name substituted, ensuring consistency and reproducibility. The wording of this prompt is provided below for transparency: “What immunohistochemical (IHC) markers are recommended to confirm the diagnosis of [Tumor Type] of the salivary gland?” Prompt clarity and content validity were reviewed by two board-certified pathologists before data collection to ensure that wording was unambiguous, clinically relevant, and reproducible across tumor types.

2.4. Data Collection

Responses from ChatGPT-4 were recorded and compared to the pathologist reference panel. Each response was independently graded by two pathologists to ensure consistency in scoring. Data fields included the following:

Chatbot Response (IHC Markers): Markers recommended for each tumor type or pair.
Pathologist Reference Panel: The finalized benchmark IHC panel was established by consensus among the participating pathologists.
Scoring System: Responses were evaluated using a three-component scoring system:
- Accuracy Score: Inclusion of essential primary markers (1–3 scale);
- Completeness Score: Inclusion of secondary markers (1–3 scale);
- Relevance Score: Exclusion of unnecessary markers (1–3 scale);
- Composite Score: Sum of the three scores (range 3–9);
- Consistency Score: Assessed across repetitions (High, Moderate, Low variability).

2.5. Scoring Criteria

Each chatbot response was scored based on accuracy, completeness, and relevance. Scores were assigned on a scale from 1 to 3 as follows:

Accuracy: Evaluates whether the chatbot included essential primary markers.
○
3 (High Accuracy): All primary markers are present.
○
2 (Moderate Accuracy): Most primary markers are present, but one essential marker is missing.
○
1 (Low Accuracy): Two or more essential markers are missing.
Completeness: Checks if the chatbot included necessary secondary markers.
○
3 (High Completeness): Both primary and secondary markers are included.
○
2 (Moderate Completeness): Only primary markers are included.
○
1 (Low Completeness): Secondary markers are missing.
Relevance: Assesses whether unnecessary markers are recommended.
○
3 (High Relevance): Only essential markers are recommended.
○
2 (Moderate Relevance): One irrelevant marker is included.
○
1 (Low Relevance): Two or more irrelevant markers are included.
Composite Score:
○
Calculated as the sum of accuracy, completeness, and relevance scores, yielding a possible range from 3 to 9 for each repetition.

The composite score, calculated as the sum of accuracy, completeness, and relevance (range: 3–9), was unweighted in this study to ensure an equal contribution of all three evaluation criteria. The decision to use an unweighted scoring system was primarily based on two factors: methodological consistency with prior AI assessment studies [22,23,24] and practical interpretability.

Scores were calculated at the panel level for each tumor type and averaged across the three prompt repetitions. In addition, marker-level analyses were performed separately by comparing the inclusion or exclusion of each IHC marker against the pathologist reference panel. Aggregated results were reported per tumor type and as mean values for benign, malignant, and overall tumor groups.

2.6. Definition of a Rule-Based System and ChatGPT Prompt Framework

The rule-based system followed fixed IHC marker panels derived from the pathologist’s consensus (gold standard). It was deterministic, producing the same set of markers for each tumor type based on expert-defined inclusion or exclusion criteria. In contrast, the ChatGPT framework used a standardized natural language prompt, “What immunohistochemical (IHC) markers are recommended to confirm a diagnosis of [Tumor Type]?” to generate recommendations. Outputs were probabilistic and could vary between prompts.

2.7. Statistical Analyses

All agreement and diagnostic performance metrics (can you give) were computed at the marker level, whereas composite and consistency scores were analyzed at the panel level. To quantify agreement between ChatGPT and the pathologist panel (gold standard), multiple statistical analyses were performed to assess agreement, diagnostic accuracy, variability, and systematic bias.

The Cohen’s Kappa Analysis was used to quantify agreement between chatbot recommendations and the pathologist reference panel and was calculated for each repetition by comparing chatbot recommendations (presence or absence of each marker) with pathologist recommendations. Also, Kappa scores were computed for each tumor type to identify tumors with the highest and lowest agreement. The interpretation was as follows: Almost Perfect Agreement (0.81–1.00); Substantial Agreement (0.61–0.80); Moderate Agreement (0.41–0.60); Fair Agreement (0.21–0.40); Slight Agreement (0.01–0.20); and Poor Agreement (<0).

Also, A paired t-test was performed to determine whether benign and malignant tumors exhibited statistically significant differences in agreement.

Then, sensitivity and specificity analyses were used to measure the chatbot’s ability to correctly recommend essential markers and exclude irrelevant ones.

▪: Sensitivity: Measures correct identification of essential markers.

S e n s i t i v i t y = \frac{T r u e P o s i t i v e s (T P)}{T r u e P o s i t i v e s (T P) + F a l s e N e g a t i v e s (F N)}

▪: Specificity: Measures the correct exclusion of non-essential markers.

S p e c i f i c i t y = \frac{T r u e N e g a t i v e s (T N)}{T r u e N e g a t i v e s (T N) + F a l s e P o s i t i v e s (F P)}

▪: Accuracy: Overall correctness of recommendations.

A c c u r a c y = \frac{T r u e P o s i t i v e s (T P) + T r u e N e g a t i v e s (T N)}{T o t a l S a m p l e s}

To account for both precision and recall, the F1-score was calculated as follows:

F 1 = 2 \times \frac{P r e c i s i o n \times S e n s i t i v i t y}{P r e c i s i o n + S e n s i t i v i t y}

The Composite Score and Stability Analysis were used to assess whether ChatGPT’s performance remained consistent across multiple prompts. The composite score (range: 3–9) was calculated for each tumor type across three prompts. Repeated-measures ANOVA was conducted to analyze score stability across prompts. Bonferroni correction was applied for multiple comparisons (adjusted α = 0.0167).

In addition, the Bland–Altman analysis was performed to evaluate the level of variation between ChatGPT’s IHC marker recommendations across different prompts. This analysis aimed to identify systematic biases and inconsistencies in the chatbot’s responses. Mean composite scores were computed for each tumor type to assess ChatGPT’s average performance. Score differences between prompts were calculated to assess variability: Prompt 1 vs. Prompt 2 (Diff P1-P2), Prompt 2 vs. Prompt 3 (Diff P2-P3), and Prompt 1 vs. Prompt 3 (Diff P1-P3). Mean difference (bias) was calculated across prompts, and limits of agreement (LoA) were determined using (LoA Upper = Mean Dif + 1.96 × SD) and (LoA Lower = Mean Diff − 1.96 × SD). A Bland–Altman plot was generated to visualize the spread of score differences relative to the mean composite score.

Lastly, a comparison was conducted between ChatGPT and a rule-based system to evaluate ChatGPT’s performance against a system that strictly follows the pathologist reference panel. The rule-based system was constructed using predefined pathologist-approved marker panels with no flexibility in marker selection. ChatGPT’s IHC marker recommendations were compared to the rule-based system in three areas: Correct marker selections (markers correctly included by ChatGPT); Incorrect suggestions (markers recommended by ChatGPT that were not part of the pathologist-approved panel); and missed markers (essential markers from the pathologist panel that were absent from ChatGPT’s recommendations). A Chi-square test was conducted to assess differences in distribution between correct, incorrect, and missed marker selections. A confusion matrix was generated to visualize classification errors, comparing ChatGPT’s performance to the rule-based system. A bar chart was generated to compare the number of correct and incorrect markers identified by each system. The proportion of over-inclusion errors (incorrectly suggested markers) and under-inclusion errors (missed essential markers) was quantified to assess ChatGPT’s diagnostic bias. Performance variation between benign and malignant tumors was evaluated to determine whether ChatGPT’s overgeneralization tendencies were tumor-dependent.

Uncertainty estimates (95% confidence intervals or standard deviations) were computed for all key performance metrics. Where raw per-tumor values were available (e.g., composite scores), confidence intervals were directly calculated; for summary metrics such as sensitivity, specificity, and Cohen’s Kappa, intervals were estimated based on observed variability across tumor types.

The structure for data collection and scoring is illustrated in Supplementary Table S1: “Table_ChatGPT_DATA_SG_Tumors_Methodology.xlsx.”. This framework was used to systematically evaluate all 21 salivary gland tumor types across three independent prompts for ChatGPT-4.

All analyses were conducted using Python (version 3.10) with pandas for data manipulation, scikit-learn for calculations of Cohen’s Kappa, sensitivity, specificity, and accuracy; NumPy (version 1.24) for descriptive statistics; and matplotlib along with seaborn for data visualization.

Detailed figure legends can be found in Appendix A.

3. Results

3.1. Agreement Between ChatGPT and Pathologist Panel (Cohen’s Kappa Analysis)

Cohen’s Kappa revealed moderate agreement overall (κ = 0.53, 95% CI 0.48–0.58), but with notable differences between benign (higher agreement) and malignant (lower agreement) tumors. Benign tumors (κ = 0.67, 95% CI 0.61–0.72) showed substantial agreement, with pleomorphic adenoma achieving the highest κ (0.74). Malignant tumors (κ = 0.40, 95% CI 0.33–0.47) had moderate agreement, with basaloid SCC (κ = 0.26) and squamous cell carcinoma (κ = 0.27) showing the lowest agreement.

Across all prompts, agreement levels remained stable, with no major improvement in Kappa scores from Prompt 1 to Prompt 3. Tumor-specific Cohen’s Kappa scores revealed high variability in agreement levels. The three tumors with the lowest agreement (highest challenge for ChatGPT) were basaloid SCC (mean κ = 0.26), squamous cell carcinoma (mean κ = 0.27), and salivary duct carcinoma (mean κ = 0.30). The highest agreement was observed for benign tumors, such as pleomorphic adenoma (mean κ = 0.74).

The t-test comparing benign vs. malignant tumors was statistically significant (t = 3.41, p = 0.004) with a large effect size (Cohen’s d = 1.02). Figure 1 shows Cohen’s Kappa scores for benign vs. malignant tumors as well as tumor-specific Kappa scores.

3.2. Sensitivity and Specificity Analyses

The ability of ChatGPT to correctly identify essential markers (sensitivity) and exclude irrelevant markers (specificity) was assessed using conventional diagnostic accuracy metrics. Sensitivity values for ChatGPT across tumor types ranged between 0.25 and 0.96 (mean 0.79 ± 0.12, 95% CI 0.74–0.84). Higher sensitivity (>0.80) was observed in benign tumors, suggesting ChatGPT was effective in selecting primary IHC markers for these tumors. Specificity values ranged between 0.22 and 0.72 (mean 0.48 ± 0.14, 95% CI 0.41–0.55), with lower specificity (<0.50) found in malignant tumors, indicating ChatGPT frequently suggested irrelevant markers for complex malignancies. Squamous cell carcinoma and basaloid squamous cell carcinoma exhibited the lowest specificity (<0.40), reflecting ChatGPT’s difficulty in differentiating these histologically similar malignancies.

Benign tumors exhibited the highest sensitivity and specificity, with pleomorphic adenoma, basal cell adenoma, and Warthin tumor consistently performing well. Malignant tumors demonstrated significantly lower specificity, particularly BSCC (0.26), SCC (0.28), and carcinoma ex-pleomorphic adenoma (0.25). Mucoepidermoid carcinoma had the highest sensitivity among malignant tumors (0.67). Salivary duct carcinoma, basal cell adenocarcinoma, and mucinous adenocarcinoma exhibited both low sensitivity (<0.50) and low specificity (<0.40).

The mean F1-score for benign tumors was 0.84 ± 0.09 (95% CI 0.80–0.88), while malignant tumors had a lower mean F1-score of 0.63 ± 0.11 (95% CI 0.58–0.68), reflecting ChatGPT’s reduced precision in malignancies. A scatter plot illustrating the relationship between sensitivity and specificity for each tumor type was generated, along with a heatmap visualizing these metrics across all tumor types (Figure 2).

3.3. Consistency and Stability of ChatGPT’s Recommendations

The stability of ChatGPT’s IHC marker recommendations was assessed by analyzing composite score variations across the three prompts for all 21 salivary gland tumor types. The objective was to determine whether ChatGPT’s responses remained consistent or exhibited significant fluctuations, which would indicate prompt-dependent variability in marker selection. A repeated-measures ANOVA (RM-ANOVA) was performed to evaluate whether composite scores differed significantly across prompts. RM-ANOVA showed no significant effect of prompt on composite scores (F(2,40) = 1.45, p = 0.25), indicating that ChatGPT did not demonstrate systematic score shifts between Prompt 1 (P1), Prompt 2 (P2), and Prompt 3 (P3).

To further explore prompt-level differences, paired t-tests were conducted. The comparison between P1 and P2 yielded a t-value of 1.11 and a p-value of 0.28, indicating no statistically significant difference between the first two prompts. Similarly, the comparison between P2 and P3 showed a t-value of –1.81 and a p-value of 0.085, while the comparison between P1 and P3 resulted in a t-value of –0.39 and a p-value of 0.70. After Bonferroni adjustment (α = 0.0167), none of the post-hoc comparisons reached statistical significance, supporting the conclusion that composite scores were generally stable across prompts.

Despite this overall stability, individual tumors exhibited varying degrees of prompt-to-prompt fluctuation. Most benign neoplasms—including pleomorphic adenoma, Warthin tumor, basal cell adenoma, and oncocytoma—displayed minimal variation in composite scores across prompts, reflecting consistent marker recommendation patterns. In contrast, several malignant tumors showed substantially greater inconsistency. Tumors such as mucoepidermoid carcinoma, squamous cell carcinoma, basaloid squamous cell carcinoma, and myoepithelial carcinoma demonstrated the largest prompt-to-prompt changes, aligning with their complex immunophenotypic profiles.

Figure 2. Sensitivity and Specificity Analyses. (A) shows a scatter plot of sensitivity versus specificity per tumor type. Benign tumors cluster with high sensitivity and specificity (≥0.80), indicating strong diagnostic accuracy. Malignant tumors are more dispersed with lower specificity, suggesting ChatGPT often recommends extraneous markers. Outliers like Basaloid SCC and Carcinoma Ex-Pleomorphic Adenoma highlight diagnostic complexity. (B) presents a heatmap of sensitivity and specificity scores. Red areas (<0.40 specificity) are mainly seen in malignant tumors, indicating frequent over-recommendation of irrelevant markers. Benign tumors show consistently high sensitivity and specificity. Squamous and Basaloid SCC have the lowest specificity (<0.30), reflecting challenges in these types.

Across all tumors, the mean within-tumor standard deviation of composite scores was 0.98 ± 0.76. Benign tumors exhibited lower variability (0.78 ± 0.64), while malignant tumors demonstrated higher across-prompt fluctuation (1.10 ± 0.76), indicating less stable behavior in tumors with more heterogeneous immunoprofiles.

A Bland–Altman analysis comparing P1 and P3 revealed a mean bias of –0.14, with 95% limits of agreement ranging from –3.44 to +3.15, indicating moderate but non-systematic variation between prompts. These findings reinforce the pattern of high stability among benign tumors and greater inconsistency among malignant tumors, particularly SCC and basaloid SCC, which displayed the highest degrees of variability. To visualize these trends, a line plot was generated showing composite score trajectories across the three prompts for each tumor type. The figure illustrates relatively flat curves for benign tumors and erratic shifts in several malignant neoplasms, emphasizing the differential stability of ChatGPT’s marker recommendations (Figure 3).

3.4. Bland–Altman Analysis: ChatGPT Score Variability

A Bland–Altman analysis was performed to evaluate the agreement between composite scores generated by Prompt 1 and Prompt 3, the two prompts that demonstrated the largest numerical differences in the dataset. The mean difference (bias) between Prompt 1 and Prompt 3 was −0.10, indicating no systematic directional shift in ChatGPT’s IHC marker recommendations. The standard deviation of the differences was 1.76, resulting in 95% limits of agreement ranging from −3.54 to +3.35. This wide interval reflects the degree of prompt-to-prompt variability across tumor types.

Most benign tumors demonstrated minimal differences between Prompt 1 and Prompt 3, consistent with their relatively stable composite scores. In contrast, several malignant tumors—including squamous cell carcinoma, basaloid squamous cell carcinoma, hyalinizing clear cell carcinoma, and myoepithelial carcinoma—showed the largest deviations, indicating greater instability in ChatGPT’s marker selection for these neoplasms. Mucoepidermoid carcinoma and adenoid cystic carcinoma demonstrated comparatively small differences, reflecting more consistent outputs across prompts.

Table 2 summarizes the mean composite score for each tumor type across all three prompts. Figure 4 displays the Bland–Altman plot for Prompt 1 versus Prompt 3, illustrating the distribution of differences across the range of mean composite scores.

3.5. Chatbot vs. Rule-Based System Comparison

A comparative analysis was conducted between ChatGPT’s IHC marker recommendations and those from a rule-based system that strictly followed the pathologist reference panel. The comparison between ChatGPT’s immunohistochemical (IHC) marker recommendations and the rule-based system (which strictly follows the pathologist reference panel) was analyzed based on correct selections, missed markers, and incorrectly suggested markers. ChatGPT correctly identified 120 markers but suggested 105 incorrect markers and missed 55 essential ones. The F1-score was 0.72 overall, with a drop to 0.63 for malignant tumors, confirming precision issues. The Rule-Based System correctly identified 150 markers, without any incorrect suggestions or missed markers.

Thus, the rule-based system achieved 100% sensitivity, correctly including all essential markers but offering no specificity improvement as it did not attempt to exclude non-essential markers. ChatGPT, on the other hand, had significantly higher rates of incorrect marker inclusion. ChatGPT frequently introduced extra markers, whereas the rule-based system strictly adhered to validated primary markers. The total number of incorrect marker suggestions was significantly higher for ChatGPT. To further analyze ChatGPT’s errors in marker recommendations, a confusion matrix-style visualization was constructed (Figure 5), in which the x-axis represents ChatGPT’s predictions (correct vs. incorrect), while the y-axis represents the actual marker status (correct vs. incorrect) based on the pathologist reference panel.

4. Discussion

The integration of artificial intelligence (AI) into pathology, particularly for immunohistochemistry (IHC) marker selection, represents a promising, though still nascent, advancement in diagnostic workflows. Salivary gland tumors, given their histological heterogeneity and overlapping features, remain among the more diagnostically complex entities. Accurate IHC selection is vital, yet in practice, pathologists often adjust marker panels based not only on diagnostic algorithms but also on contextual factors such as cost, reagent availability, and institutional protocols [7,8]. This real-world variability adds another layer of complexity to the application of AI in clinical decision making.

This study evaluated the diagnostic performance of ChatGPT-4, a general-purpose large language model (LLM), in recommending IHC markers for a range of salivary gland tumors. ChatGPT-4 was selected due to its free accessibility, growing use among clinicians and researchers, and increasing relevance as an open-ended, text-based decision support tool [20,25,26]. In contrast to commercial platforms like ImmunoQuery and ExpertPath, which offer curated, rule-based recommendations but require paid subscriptions and structured queries, ChatGPT offers dynamic, conversational flexibility. Similarly, purpose-built pathology tools like PathChat provide multimodal capabilities but are not freely available and remain limited in deployment [19]. Our study thus sought to assess whether an accessible, zero-cost model could offer meaningful diagnostic guidance, particularly in settings where access to specialized tools is limited.

Performance analysis showed moderate overall agreement between ChatGPT-4’s recommendations and expert pathologists (κ = 0.53), consistent with previous findings on the utility, but also limitations, of LLMs in pathology [20]. Agreement was higher for benign tumors (κ = 0.67) than for malignant ones (κ = 0.40), reflecting ChatGPT’s relative strength in simpler diagnostic tasks and its diminished reliability when handling complex, overlapping malignancies [3,27].

While ChatGPT demonstrated high sensitivity in recommending essential IHC markers for benign tumors, its specificity, particularly for malignant tumors, was lower, often suggesting extraneous or irrelevant markers. These results align with previous work showing that while AI can boost efficiency, it still struggles to replicate expert-level discernment in complex cases [28]. Furthermore, variability in ChatGPT’s output across tumor types and repeated prompts underscores current limitations in consistency and reproducibility, especially in diagnostically ambiguous cases [28,29].

A key finding was that ChatGPT’s agreement levels remained largely unchanged across iterative prompts, with no significant improvement in performance (Kappa scores) from Prompt 1 to Prompt 3. This contrasts with human reasoning, where repeated case exposure improves diagnostic accuracy. While some studies have shown that prompt optimization can enhance ChatGPT’s outputs in other domains [30], such adaptability was not evident in this study.

In this study, sensitivity and specificity were applied at the marker level rather than at the patient or case level. Sensitivity reflected ChatGPT’s ability to correctly include essential IHC markers present in the expert-derived reference panels (true positives), while specificity quantified its capacity to exclude non-essential or irrelevant markers (true negatives). In the rule-based system, both metrics are deterministic, as marker inclusion strictly follows predefined expert rules. In contrast, ChatGPT’s probabilistic generation introduces overinclusion tendencies, resulting in lower marker-level specificity. Framing the comparison in this way allows a consistent evaluation of how closely each system reproduces expert-derived decision boundaries, rather than measuring diagnostic accuracy in the clinical sense.

When compared to a strict rule-based system aligned with expert-established IHC guidelines, ChatGPT-4 underperformed, generating more missed or incorrect recommendations. Rule-based systems, while inflexible, ensure standardized and vetted outputs that minimize risk in high-stakes clinical settings [31,32]. The rule-based system was not intended as a competing diagnostic tool but as a fixed gold-standard baseline derived from expert consensus, enabling controlled comparison of ChatGPT’s marker selection behavior relative to established reference panels. Nevertheless, ChatGPT offers a unique advantage in flexibility and accessibility, features that make it a potential adjunct for early-stage decision support, particularly in under-resourced environments. Comparable findings have been reported in oral pathology, where ChatGPT-4 demonstrated moderate agreement with an expert pathologist in describing histopathological features [33]. Also, similar trends have been observed in other branches of pathology, such as in hematopathology, where AI models were evaluated for diagnostic support and prognostic classification [34] and dermatopathology, where AI tools have demonstrated potential while underscoring the need for rigorous clinical validation [35].

Despite its promise, ChatGPT-4’s performance highlights important limitations. The model lacks access to proprietary pathology databases and expert-validated resources such as ImmunoQuery or ExpertPath, which may limit its alignment with current clinical guidelines [36]. Moreover, it was not tested within real-time clinical workflows, where dynamic case variables, such as staining artifacts or sample quality, would influence IHC selection. These contextual nuances are especially critical in malignant tumors, where even small differences in IHC recommendations can affect treatment decisions [37].

Additionally, ChatGPT-4 lacks multimodal reasoning and cannot interpret histologic images, radiology, or clinical context, key elements in real-world pathology workflows. This may reduce its clinical relevance in complex cases, where decision making is inherently integrative. While we compared ChatGPT to a rule-based system and an expert consensus panel, these comparisons still fall short of representing the full spectrum of clinical decision making, which is adaptive and context-sensitive. Variations in IHC marker preferences across institutions, evolving literature, and diagnostic updates may also introduce subtle bias into our “gold standard” reference panels [38].

This study has several limitations. First, ChatGPT-4 was evaluated using text-only prompts without access to histopathologic images or case-specific clinical data, which limits its applicability to real diagnostic workflows. However, this approach reflects how medical residents or less experienced pathologists might initially interact with AI tools to facilitate differential diagnosis or learning in the absence of digital slides. Second, the model functions as a general-purpose language system without verified linkage to curated pathology databases, but it may draw on a broad range of publicly available biomedical references that are continually updated and evolving. Third, although performance was benchmarked against consensus panels rather than clinical outcomes, the pathologists involved represented multiple institutions and countries, and the consensus was reached through a rigorous multi-round review and calibration process. Finally, the public ChatGPT interface does not permit control over temperature or other generation parameters, which can introduce inherent variability between sessions; this reflects real-world user conditions and highlights the need for standardized AI evaluation frameworks in pathology. Also, the study focused on the general-purpose ChatGPT-4 model to evaluate baseline performance under real-world access conditions. While this approach limits domain-specific optimization, it provides a practical reference for non-specialized users and can inform future comparisons with fine-tuned medical LLMs.

To improve the utility of AI in pathology, future directions should include (1) integration of multimodal datasets combining textual, histological, and clinical inputs, (2) real-time clinical validation studies, and (3) refinement of AI models to reduce response variability and improve task-specific performance. Ongoing collaboration between AI developers and pathologists will be critical to ensure tools are not only technically accurate but also practically implementable within clinical workflows.

5. Conclusions

ChatGPT-4 and similar AI models hold potential as accessible, adjunctive tools in pathology, particularly for supporting IHC marker selection. However, their current limitations, in accuracy, consistency, and contextual reasoning, highlight the need for refinement and cautious implementation. For now, such tools should serve only as supplements to, not substitutes for, the nuanced decision making of expert pathologists. Continued interdisciplinary efforts will be key to translating AI from experimental testing into safe, effective clinical practice.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/biomedinformatics5040066/s1, Table S1: “Table_ChatGPT_DATA_SG_Tumors_Methodology.xlsx.”.

Author Contributions

Conceptualization, M.C.-N., C.G. and L.F.; methodology, M.C.-N. and W.R.D.-C.; software, W.R.D.-C. and V.G.M.; validation, M.C.-N., A.M. and L.F.; formal analysis, G.F., V.Z. and M.-T.F.-F.; investigation, M.C.-N., S.S.F. and J.B.-C.; resources, G.F. and V.Z.; data curation, W.R.D.-C. and V.G.M.; writing—original draft preparation, M.C.-N.; writing—review and editing, L.F. and C.G.; visualization, A.M.; supervision, L.F. and C.G.; project administration, M.C.-N.; funding acquisition, A.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable. This study did not involve human subjects, patient data, or animal research, and therefore did not require institutional review board (IRB) approval.

Informed Consent Statement

Not applicable. No human participants were involved.

Data Availability Statement

The original contributions presented in this study are included in the Supplementary Material. Further inquiries can be directed to the corresponding authors.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Figure 1. Cohen’s Kappa Scores for Benign vs. Malignant Tumors. Figure 1A. Bar Chart Comparing ChatGPT’s Agreement with the Pathologist Panel Across Three Prompts. Benign tumors showed higher agreement (Substantial Agreement, Kappa = 0.67). Malignant tumors had lower agreement (Moderate Agreement, Kappa = 0.40). Minimal variation across prompts suggests ChatGPT does not significantly refine its recommendations. Figure 1B. Tumor-Specific Cohen’s Kappa Scores for Benign Tumors. Bar Chart Comparing ChatGPT’s Kappa Scores for Different Benign Tumors. Pleomorphic Adenoma exhibited the highest agreement (Kappa = 0.74). Canalicular Adenoma had the lowest agreement (Kappa = 0.66). Benign tumors showed overall substantial agreement (Mean Kappa > 0.65), suggesting ChatGPT was more consistent in these cases. Figure 1C. Bar Chart Comparing ChatGPT’s Kappa Scores for Different Malignancies. BSCC (Kappa = 0.26), Squamous Cell Carcinoma (Kappa = 0.27), and Salivary Duct Carcinoma (Kappa = 0.30) had the lowest agreement scores. Tumors with overlapping histologic features exhibited higher disagreement with the pathologist panel.

Figure 2. Sensitivity and Specificity Analyses. Figure 2A. A scatter plot illustrating the relationship between sensitivity and specificity for each tumor type. Benign tumors clustered in the high sensitivity and high specificity range (≥0.80), demonstrating strong diagnostic performance. Malignant tumors were widely distributed, with lower specificity scores, indicating ChatGPT’s tendency to suggest extraneous markers in these cases. Basaloid SCC and Carcinoma Ex-Pleomorphic Adenoma were clear outliers, reinforcing their diagnostic complexity. Figure 2B. Heatmap Visualization of Sensitivity and Specificity. A heatmap visualizing the sensitivity and specificity scores across all tumor types. Red regions indicate low specificity (<0.40), primarily in malignant tumors, suggesting frequent over-recommendation of irrelevant markers. Benign tumors displayed a consistent high-sensitivity, high-specificity pattern, reinforcing ChatGPT’s reliability in these cases. Squamous and Basaloid SCC displayed the lowest specificity (<0.30), confirming ChatGPT’s struggle with these tumor types.

Figure 3. Composite Score Trends Across Three ChatGPT Prompts for 21 Salivary Gland Tumors. This figure illustrates the composite scores generated by ChatGPT for 21 salivary gland tumor types across three independent prompts. The composite score (y-axis) ranges from 3 to 9 and reflects the sum of accuracy, completeness, and relevance for each prompt. Tumor types are shown in dataset order along the x-axis. Each colored line corresponds to a different prompt—Prompt 1 (orange circles), Prompt 2 (green squares), and Prompt 3 (blue triangles). Differences in line trajectories represent prompt-to-prompt variability in ChatGPT’s IHC marker recommendations.

Figure 4. A Bland–Altman plot comparing composite scores generated by Prompt 1 and Prompt 3 for all 21 salivary gland tumor types. Each tumor is represented by a uniquely colored circle. The mean difference (bias) is −0.10, indicating no meaningful systematic shift in ChatGPT’s IHC marker recommendations between prompts. The upper and lower limits of agreement (−3.54 and +3.35) are shown as horizontal reference lines. Tumors with larger deviations from the bias line represent greater prompt-to-prompt variability in marker selection.

Figure 5. Bar Chart Comparing the Number of Correct and Incorrect IHC Markers Between ChatGPT and the Rule-Based System. This figure provides a direct comparison of ChatGPT’s diagnostic performance against a rule-based system that strictly follows the pathologist reference panel. Red bars (ChatGPT) show the number of correct, missed, and incorrectly suggested markers, while blue bars (Rule-Based System) show only correct markers with no errors. ChatGPT correctly identified 120 markers but introduced 105 incorrect suggestions and missed 55 essential markers. The rule-based system achieved 100% accuracy with zero missed or incorrect suggestions. ChatGPT’s biggest issue was excessive overgeneralization, leading to highly unreliable marker panels. Missed markers (nearly 20% of all recommendations) raise concerns about AI reliability for critical diagnostic panels.

References

Stenman, G.; Persson, F.; Andersson, M.K. Diagnostic and Therapeutic Implications of New Molecular Biomarkers in Salivary Gland Cancers. Oral Oncol. 2014, 50, 683–690. [Google Scholar] [CrossRef]
Skalova, A.; Vanecek, T.; Simpson, R.H.W.; Michal, M. Molecular Advances in Salivary Gland Pathology and Their Practical Application. Diagn. Histopathol. 2012, 18, 388–396. [Google Scholar] [CrossRef]
Speight, P.M.; Barrett, A.W. Salivary Gland Tumours: Diagnostic Challenges and an Update on the Latest WHO Classification. Diagn. Histopathol. 2020, 26, 147–158. [Google Scholar] [CrossRef]
Iyer, J.; Hariharan, A.; Cao, U.M.N.; Mai, C.T.T.; Wang, A.; Khayambashi, P.; Nguyen, B.H.; Safi, L.; Tran, S.D. An Overview on the Histogenesis and Morphogenesis of Salivary Gland Neoplasms and Evolving Diagnostic Approaches. Cancers 2021, 13, 3910. [Google Scholar] [CrossRef]
Kohale, M.G.; Dhobale, A.V.; Bankar, N.J.; Noman, O.; Hatgaonkar, K.; Mishra, V. Immunohistochemistry in Pathology: A Review. J. Cell Biotechnol. 2023, 9, 131–138. [Google Scholar] [CrossRef]
Fang, R.; Wang, X.T.; Xia, Q.Y.; Zhou, X.J.; Rao, Q. Precision in Diagnostic Molecular Pathology Based on Immunohistochemistry. Crit. Rev. Oncog. 2017, 22, 451–469. [Google Scholar] [CrossRef]
McCrary, M.R.; Galambus, J.; Chen, W. Evaluating the Diagnostic Performance of a Large Language Model-powered Chatbot for Providing Immunohistochemistry Recommendations in Dermatopathology. J. Cutan. Pathol. 2024, 51, 689–695. [Google Scholar] [CrossRef]
Kumar, M.; Fatima, Z.H.; Goyal, P.; Qayyumi, B. Looking through the Same Lens—Immunohistochemistry for Salivary Gland Tumors: A Narrative Review on Testing and Management Strategies. Cancer Res. Stat. Treat. 2024, 7, 62–71. [Google Scholar] [CrossRef]
Sultan, A.S.; Elgharib, M.A.; Tavares, T.; Jessri, M.; Basile, J.R. The Use of Artificial Intelligence, Machine Learning and Deep Learning in Oncologic Histopathology. J. Oral Pathol. Med. 2020, 49, 849–856. [Google Scholar] [CrossRef] [PubMed]
Yang, Y.; Sun, K.; Gao, Y.; Wang, K.; Yu, G. Preparing Data for Artificial Intelligence in Pathology with Clinical-Grade Performance. Diagnostics 2023, 13, 3115. [Google Scholar] [CrossRef]
Acs, B.; Rantalainen, M.; Hartman, J. Artificial Intelligence as the next Step towards Precision Pathology. J. Intern. Med. 2020, 288, 62–81. [Google Scholar] [CrossRef]
Choi, S.; Kim, S. Artificial Intelligence in the Pathology of Gastric Cancer. J. Gastric Cancer 2023, 23, 410–427. [Google Scholar] [CrossRef] [PubMed]
Frosolini, A.; Catarzi, L.; Benedetti, S.; Latini, L.; Chisci, G.; Franz, L.; Gennaro, P.; Gabriele, G. The Role of Large Language Models (LLMs) in Providing Triage for Maxillofacial Trauma Cases: A Preliminary Study. Diagnostics 2024, 14, 839. [Google Scholar] [CrossRef] [PubMed]
Abdelsamea, M.M.; Zidan, U.; Senousy, Z.; Gaber, M.M.; Rakha, E.; Ilyas, M. A Survey on Artificial Intelligence in Histopathology Image Analysis. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2022, 12, e1474. [Google Scholar] [CrossRef]
Kalra, S.; Tizhoosh, H.R.; Shah, S.; Choi, C.; Damaskinos, S.; Safarpoor, A.; Shafiei, S.; Babaie, M.; Diamandis, P.; Campbell, C.J.V.; et al. Pan-Cancer Diagnostic Consensus through Searching Archival Histopathology Images Using Artificial Intelligence. NPJ Digit. Med. 2020, 3, 31. [Google Scholar] [CrossRef]
Stenzinger, A.; Alber, M.; Allgäuer, M.; Jurmeister, P.; Bockmayr, M.; Budczies, J.; Lennerz, J.; Eschrich, J.; Kazdal, D.; Schirmacher, P.; et al. Artificial Intelligence and Pathology: From Principles to Practice and Future Applications in Histomorphology and Molecular Profiling. Semin. Cancer Biol. 2022, 84, 129–143. [Google Scholar] [CrossRef] [PubMed]
Homeyer, A.; Lotz, J.; Schwen, L.O.; Weiss, N.; Romberg, D.; Höfener, H.; Zerbe, N.; Hufnagl, P. Artificial Intelligence in Pathology: From Prototype to Product. J. Pathol. Inform. 2021, 12, 13. [Google Scholar] [CrossRef]
McGenity, C.; Clarke, E.L.; Jennings, C.; Matthews, G.; Cartlidge, C.; Freduah-Agyemang, H.; Stocken, D.D.; Treanor, D. Artificial Intelligence in Digital Pathology: A Systematic Review and Meta-Analysis of Diagnostic Test Accuracy. NPJ Digit. Med. 2024, 7, 114. [Google Scholar] [CrossRef]
Colling, R.; Pitman, H.; Oien, K.; Rajpoot, N.; Macklin, P.; Bachtiar, V.; Booth, R.; Bryant, A.; Bull, J.; Bury, J.; et al. Artificial Intelligence in Digital Pathology: A Roadmap to Routine Use in Clinical Practice. J. Pathol. 2019, 249, 143–150. [Google Scholar] [CrossRef]
Schukow, C.; Smith, S.C.; Landgrebe, E.; Parasuraman, S.; Folaranmi, O.O.; Paner, G.P.; Amin, M.B. Application of ChatGPT in Routine Diagnostic Pathology: Promises, Pitfalls, and Potential Future Directions. Adv. Anat. Pathol. 2024, 31, 15–21. [Google Scholar] [CrossRef]
WHO. Classification of Tumours Editorial Board. In Head and Neck Tumours, 5th ed.; International Agency for Research on Cancer: Lyon, France, 2022; Volume 9. [Google Scholar]
Tschandl, P.; Rinner, C.; Apalla, Z.; Argenziano, G.; Codella, N.; Halpern, A.; Janda, M.; Lallas, A.; Longo, C.; Malvehy, J.; et al. Human–Computer Collaboration for Skin Cancer Recognition. Nat. Med. 2020, 26, 1229–1234. [Google Scholar] [CrossRef]
Campanella, G.; Hanna, M.G.; Geneslaw, L.; Miraflor, A.; Werneck Krauss Silva, V.; Busam, K.J.; Brogi, E.; Reuter, V.E.; Klimstra, D.S.; Fuchs, T.J. Clinical-Grade Computational Pathology Using Weakly Supervised Deep Learning on Whole Slide Images. Nat. Med. 2019, 25, 1301–1309. [Google Scholar] [CrossRef]
Korbar, B.; Olofson, A.; Miraflor, A.; Nicka, C.; Suriawinata, M.; Torresani, L.; Suriawinata, A.; Hassanpour, S. Deep Learning for Classification of Colorectal Polyps on Whole-Slide Images. J. Pathol. Inform. 2017, 8, 30. [Google Scholar] [CrossRef]
Oon, M.L.; Syn, N.L.; Tan, C.L.; Tan, K.B.; Ng, S.B. Bridging Bytes and Biopsies: A Comparative Analysis of ChatGPT and Histopathologists in Pathology Diagnosis and Collaborative Potential. Histopathology 2024, 84, 601–613. [Google Scholar] [CrossRef]
Ullah, E.; Parwani, A.; Baig, M.M.; Singh, R. Challenges and Barriers of Using Large Language Models (LLM) Such as ChatGPT for Diagnostic Medicine with a Focus on Digital Pathology—A Recent Scoping Review. Diagn. Pathol. 2024, 19, 43. [Google Scholar] [CrossRef] [PubMed]
Reerds, S.T.H.; Uijen, M.J.M.; Van Engen-Van Grunsven, A.C.H.; Marres, H.A.M.; van Herpen, C.M.L.; Honings, J. Results of Histopathological Revisions of Majorsalivary Gland Neoplasms in Routine Clinical Practice. J. Clin. Pathol. 2022, 76, 374–378. [Google Scholar] [CrossRef]
Wu, S.; Yue, M.; Zhang, J.; Li, X.; Li, Z.; Zhang, H.; Wang, X.; Han, X.; Cai, L.; Shang, J.; et al. The Role of Artificial Intelligence in Accurate Interpretation of HER₂ Immunohistochemical Scores 0 and 1+ in Breast Cancer. Mod. Pathol. 2023, 36, 100054. [Google Scholar] [CrossRef] [PubMed]
Waqas, A.; Bui, M.M.; Glassy, E.F.; El Naqa, I.; Borkowski, P.; Borkowski, A.A.; Rasool, G. Revolutionizing Digital Pathology with the Power of Generative Artificial Intelligence and Foundation Models. Lab. Investig. 2023, 103, 100255. [Google Scholar] [CrossRef] [PubMed]
Choi, Y.K.; Lin, S.-Y.; Fick, D.M.; Shulman, R.W.; Lee, S.; Shrestha, P.; Santoso, K. Optimizing ChatGPT’s Interpretation and Reporting of Delirium Assessment Outcomes: Exploratory Study. JMIR Form. Res. 2024, 8, e51383. [Google Scholar] [CrossRef]
Naved, B.A.; Luo, Y. Contrasting Rule and Machine Learning Based Digital Self Triage Systems in the USA. NPJ Digit. Med. 2024, 7, 381. [Google Scholar] [CrossRef]
Hager, P.; Jungmann, F.; Holland, R.; Bhagat, K.; Hubrecht, I.; Knauer, M.; Vielhauer, J.; Makowski, M.; Braren, R.; Kaissis, G.; et al. Evaluation and Mitigation of the Limitations of Large Language Models in Clinical Decision-Making. Nat. Med. 2024, 30, 2613–2622. [Google Scholar] [CrossRef]
Cuevas-Nunez, M.; Silberberg, V.I.A.; Arregui, M.; Jham, B.C.; Ballester-Victoria, R.; Koptseva, I.; de Tejada, M.J.B.G.; Posada-Caez, R.; Manich, V.G.; Bara-Casaus, J.; et al. Diagnostic Performance of ChatGPT-4.0 in Histopathological Description Analysis of Oral and Maxillofacial Lesions: A Comparative Study with Pathologists. Oral Surg. Oral Med. Oral Pathol. Oral Radiol. 2025, 139, 453–461. [Google Scholar] [CrossRef]
Doeleman, T.; Hondelink, L.M.; Vermeer, M.H.; van Dijk, M.R.; Schrader, A.M.R. Artificial Intelligence in Digital Pathology of Cutaneous Lymphomas: A Review of the Current State and Future Perspectives. Semin. Cancer Biol. 2023, 94, 81–88. [Google Scholar] [CrossRef] [PubMed]
Querzoli, G.; Veronesi, G.; Corti, B.; Nottegar, A.; Dika, E. Basic Elements of Artificial Intelligence Tools in the Diagnosis of Cutaneous Melanoma. Crit. Rev. Oncog. 2023, 28, 37–41. [Google Scholar] [CrossRef] [PubMed]
Al Tibi, G.; Alexander, M.; Miller, S.; Chronos, N. A Retrospective Comparison of Medication Recommendations Between a Cardiologist and ChatGPT-4 for Hypertension Patients in a Rural Clinic. Cureus 2024, 16, e55789. [Google Scholar] [CrossRef]
Moore, P.D.C.; Guinigundo, M.A.S. The Role of Biomarkers in Guiding Clinical Decision-Making in Oncology. J. Adv. Pract. Oncol. 2023, 14, 15–37. [Google Scholar] [CrossRef] [PubMed]
Shin, D.; Arthur, G.; Caldwell, C.; Popescu, M.; Petruc, M.; Diaz-Arias, A.; Shyu, C.-R. A Pathologist-in-the-Loop IHC Antibody Test Selection Using the Entropy-Based Probabilistic Method. J. Pathol. Inform. 2012, 3, 1. [Google Scholar] [CrossRef]

Figure 1. Agreement Between ChatGPT and Pathologist Panel (Cohen’s Kappa Analysis). Cohen’s Kappa Scores for Tumors. (A) compares ChatGPT’s agreement with pathologists across three prompts: higher for benign tumors (Kappa = 0.67) and lower for malignant ones (Kappa = 0.40), with little variation across prompts. (B) shows tumor-specific scores for benign tumors, with Pleomorphic Adenoma highest (Kappa = 0.74) and Canalicular Adenoma lowest (Kappa = 0.66). Overall, benign tumors had substantial agreement. (C) presents scores for malignancies: BSCC (0.26), Squamous Cell Carcinoma (0.27), and Salivary Duct Carcinoma (0.30), with tumors showing overlapping features having greater disagreement.

Figure 3. This figure shows composite IHC marker scores (range 3–9) for 21 salivary gland tumors across three prompts, displayed in dataset order. Each line corresponds to a prompt-Prompt 1 (orange circles), Prompt 2 (green squares), and Prompt 3 (blue triangles),illustrating prompt-to-prompt variability in ChatGPT’s recommendations.

Figure 4. Bland–Altman plot comparing composite score variations between Prompts. Bland–Altman Plot shows ChatGPT’s IHC marker differences across tumor types with distinct colors. The mean difference centers around zero, indicating no bias, while larger deviations suggest instability.

Figure 5. Chatbot vs. Rule-Based System Comparison. (A) This bar chart compares ChatGPT and a rule-based system in identifying IHC markers. ChatGPT correctly identified 120 markers but made 105 incorrect suggestions and missed 55. The rule-based system was 100% accurate with no errors. ChatGPT’s main issue was overgeneralization, causing unreliable marker panels, with nearly 20% of markers missed, raising concerns about AI reliability in diagnostics. Rule-Based System is 0. (B) Confusion matrix-style heatmap illustrating ChatGPT’s IHC marker recommendation errors. The plot compares ChatGPT’s predicted marker accuracy against the reference standard, showing true positives (120), false positives (105), false negatives (55), and no true negatives due to absence of explicit non-marker data.

Table 1. Classification of salivary gland tumors.

Tumor Classification	Tumor Type
Benign Salivary Gland Tumors	1. Pleomorphic Adenoma (PA)
	2. Warthin Tumor (WT)
	3. Basal Cell Adenoma (BCA)
	4. Canalicular Adenoma (CA)
	5. Oncocytoma (OC)
Malignant Salivary Gland Tumors	Adenocarcinomas
	1. Mucoepidermoid Carcinoma (MEC)
	2. Adenoid Cystic Carcinoma (AdCC)
	3. Acinic Cell Carcinoma (AcCC)
	4. Polymorphous Adenocarcinoma (PAC)
	5. Clear Cell Carcinoma (CCC)
	6. Secretory Carcinoma (MASC)
	7. Epithelial–Myoepithelial Carcinoma(EMC)
	8. Salivary Duct Carcinoma (SDC)
	9. Basal Cell Adenocarcinoma (BCAC)
	11. Mucinous Adenocarcinoma (MAC)
	12. Oncocytic Carcinoma (Oca)
	13. Hyalinizing Clear Cell Carcinoma (HCCC)
	15. Myoepithelial Carcinoma (MC)
	16. Squamous Cell Carcinoma (SCC)
	17. Basaloid Squamous Cell Carcinoma (BSCC)
	18. Carcinoma Ex-Pleomorphic Adenoma (Ca-exPA)

Abbreviations: PA, Pleomorphic Adenoma; WT, Warthin Tumor; BCA, Basal Cell Adenoma; CA, Canalicular Adenoma; OC, Oncocytoma; MEC, Mucoepidermoid Carcinoma; AdCC, Adenoid Cystic Carcinoma; AcCC, Acinic Cell Carcinoma; PAC, Polymorphous Adenocarcinoma; CCC, Clear Cell Carcinoma; MASC, Secretory Carcinoma; EMC, Epithelial–Myoepithelial Carcinoma; SDC, Salivary Duct Carcinoma; BCAC, Basal Cell Adenocarcinoma; MAC, Mucinous Adenocarcinoma; OCa, Oncocytic Carcinoma; HCCC, Hyalinizing Clear Cell Carcinoma; MC, Myoepithelial Carcinoma; SCC, Squamous Cell Carcinoma; BSCC, Basaloid Squamous Cell Carcinoma; Ca-exPA, Carcinoma ex-Pleomorphic Adenoma.

Table 2. Prompts mean composite score and score variability for each tumor type.

Tumor Type	Mean Composite Score
Pleomorphic Adenoma	5.67
Warthin Tumor	5.00
Basal Cell Adenoma	5.00
Canalicular Adenoma	5.00
Oncocytoma	5.33
Mucoepidermoid Carcinoma	5.00
Adenoid Cystic Carcinoma	6.67
Acinic Cell Carcinoma	5.67
Polymorphous Adenocarcinoma	6.33
Clear Cell Carcinoma	5.00
Secretory Carcinoma	5.33
Epithelial–Myoepithelial Carcinoma	5.67
Salivary Duct Carcinoma	5.00
Basal Cell Adenocarcinoma	5.00
Mucinous Adenocarcinoma	6.67
Oncocytic Carcinoma	5.33
Hyalinizing Clear Cell Carcinoma	6.00
Myoepithelial Carcinoma	6.00
Squamous Cell Carcinoma	6.33
Basaloid SCC	6.00
Carcinoma Ex-Pleomorphic Adenoma	4.33

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cuevas-Nunez, M.; Galletti, C.; Fiorillo, L.; Meto, A.; Díaz-Castañeda, W.R.; Farahani, S.S.; Fadda, G.; Zuccalà, V.; Manich, V.G.; Bara-Casaus, J.; et al. Assessment of ChatGPT in Recommending Immunohistochemistry Panels for Salivary Gland Tumors. BioMedInformatics 2025, 5, 66. https://doi.org/10.3390/biomedinformatics5040066

AMA Style

Cuevas-Nunez M, Galletti C, Fiorillo L, Meto A, Díaz-Castañeda WR, Farahani SS, Fadda G, Zuccalà V, Manich VG, Bara-Casaus J, et al. Assessment of ChatGPT in Recommending Immunohistochemistry Panels for Salivary Gland Tumors. BioMedInformatics. 2025; 5(4):66. https://doi.org/10.3390/biomedinformatics5040066

Chicago/Turabian Style

Cuevas-Nunez, Maria, Cosimo Galletti, Luca Fiorillo, Aida Meto, Wilmer Rodrigo Díaz-Castañeda, Shokoufeh Shahrabi Farahani, Guido Fadda, Valeria Zuccalà, Victor Gil Manich, Javier Bara-Casaus, and et al. 2025. "Assessment of ChatGPT in Recommending Immunohistochemistry Panels for Salivary Gland Tumors" BioMedInformatics 5, no. 4: 66. https://doi.org/10.3390/biomedinformatics5040066

APA Style

Cuevas-Nunez, M., Galletti, C., Fiorillo, L., Meto, A., Díaz-Castañeda, W. R., Farahani, S. S., Fadda, G., Zuccalà, V., Manich, V. G., Bara-Casaus, J., & Fernández-Figueras, M.-T. (2025). Assessment of ChatGPT in Recommending Immunohistochemistry Panels for Salivary Gland Tumors. BioMedInformatics, 5(4), 66. https://doi.org/10.3390/biomedinformatics5040066

Article Menu

Assessment of ChatGPT in Recommending Immunohistochemistry Panels for Salivary Gland Tumors

Abstract

1. Introduction

2. Materials and Methods

2.1. Tumor Types and Selection Criteria

2.2. Pathologist Consensus and Scoring Criteria

2.3. Chatbot Evaluation Framework

2.4. Data Collection

2.5. Scoring Criteria

2.6. Definition of a Rule-Based System and ChatGPT Prompt Framework

2.7. Statistical Analyses

3. Results

3.1. Agreement Between ChatGPT and Pathologist Panel (Cohen’s Kappa Analysis)

3.2. Sensitivity and Specificity Analyses

3.3. Consistency and Stability of ChatGPT’s Recommendations

3.4. Bland–Altman Analysis: ChatGPT Score Variability

3.5. Chatbot vs. Rule-Based System Comparison

4. Discussion

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI