3. Results
Between October 2025 and January 2026, a total of 58 physicians from four tertiary healthcare centers in Turkey (Ankara Bilkent City Hospital, Ankara Etlik City Hospital, Elazığ Fethi Sekin City Hospital, and Etimesgut Şehit Sait Ertürk State Hospital) were assessed for eligibility to participate in this multi-center cross-sectional diagnostic accuracy study. Following rigorous application of predetermined inclusion and exclusion criteria, 28 physicians were excluded from participation. Among these exclusions, 12 physicians did not meet the specified inclusion criteria: 7 had less than 5 years of clinical experience following residency or fellowship completion, and 5 lacked board certification in their respective specialties. Additionally, 8 physicians were excluded due to prior involvement in artificial intelligence-related clinical research, which was deemed a potential source of bias. Four physicians had documented previous exposure to the study questions through participation in pilot testing or question development phases, and 4 physicians declined to participate after being informed of the study protocol. The final study population comprised 30 physicians who met all eligibility requirements and provided written informed consent. These participants were stratified into three specialty groups of 10 participants each: general surgery specialists, gastroenterology specialists, and emergency medicine physicians, with equal representation from each participating institution (
Figure 1).
The demographic and professional characteristics of the 30 enrolled participants demonstrated comparable baseline characteristics across specialty groups (
Table 1). The overall mean age of participants was 42.4 ± 5.8 years, ranging from 34 to 56 years. General surgery specialists had a mean age of 42.3 ± 5.8 years, gastroenterology specialists 44.1 ± 6.2 years, and emergency medicine physicians 40.8 ± 5.4 years, with no statistically significant difference between groups (one-way ANOVA, F = 0.91,
p = 0.412). The study population was predominantly male, with 19 of 30 participants (63.3%) being male. The distribution of male participants across specialty groups was as follows: general surgery 7 of 10 (70%), gastroenterology 6 of 10 (60%), and emergency medicine 6 of 10 (60%), with no significant difference in sex distribution (chi-square = 0.27,
p = 0.872).
Clinical experience following completion of specialty training demonstrated adequate expertise across all groups. The overall mean post-residency or post-fellowship clinical experience was 11.5 ± 4.4 years. General surgery specialists reported a mean of 11.2 ± 4.3 years of experience, gastroenterology specialists 12.8 ± 5.1 years, and emergency medicine physicians 10.5 ± 3.8 years (one-way ANOVA, F = 0.65, p = 0.528). Academic positions were distributed across the study population as follows: 9 participants (30%) held the position of assistant professor, 5 participants (17%) were associate professors, and 16 participants (53%) were specialist physicians without academic appointments. The distribution of academic positions did not differ significantly between specialty groups (chi-square = 2.54, p = 0.634). General surgery specialists reported performing a mean of 124 ± 45 laparoscopic cholecystectomies annually as primary surgeon, demonstrating substantial surgical experience in the management of gallbladder disease. The annual volume of acute biliary cases managed differed significantly between specialty groups: emergency medicine physicians reported the highest volume (215 ± 78 cases annually), followed by general surgery specialists (156 ± 52 cases), and gastroenterology specialists (98 ± 38 cases; one-way ANOVA, F = 4.12, p = 0.023). This finding reflects the distinct clinical roles of each specialty in the acute care continuum, with emergency medicine physicians serving as the initial point of contact for the majority of acute biliary presentations.
All 30 enrolled participants successfully completed the 30-question case-based assessment instrument in its entirety, resulting in no missing data for the primary analysis. The assessment was administered under standardized conditions with unlimited time allocation, prohibition of external references, and prohibition of artificial intelligence assistance tools. Responses were collected via a secure online platform with unique participant identifiers to ensure data integrity and participant confidentiality.
The neuro-symbolic large language model (NS-LLM) system achieved the highest overall diagnostic accuracy among all evaluated groups, correctly answering 29 of 30 questions for an accuracy rate of 96.7% (95% confidence interval [CI]: 82.8–99.9%). This performance substantially exceeded that of all three human expert groups (
Table 2,
Figure 2). Among human participants, general surgery specialists demonstrated the highest mean accuracy at 82.3% ± 6.8%, corresponding to a mean of 24.7 ± 2.0 correct responses out of 30 questions (95% CI: 75.5–89.1%). Individual performance among general surgeons ranged from 70.0% (21/30 correct) to 93.3% (28/30 correct). Gastroenterology specialists achieved a mean accuracy of 78.7% ± 7.4%, corresponding to 23.6 ± 2.2 correct responses (95% CI: 71.3–86.1%), with individual performance ranging from 66.7% (20/30) to 90.0% (27/30). Emergency medicine physicians demonstrated the lowest mean accuracy at 71.0% ± 8.2%, corresponding to 21.3 ± 2.5 correct responses (95% CI: 62.8–79.2%), with individual performance ranging from 56.7% (17/30) to 83.3% (25/30).
Statistical comparison using independent samples t-tests with Bonferroni correction for multiple comparisons revealed highly significant differences between the NS-LLM system and all human expert groups. The NS-LLM system demonstrated superior performance compared to general surgery specialists with a mean difference of 14.4 percentage points (95% CI: 8.2–20.6; t = 4.82, degrees of freedom [df] = 9, p < 0.001). The difference between the NS-LLM system and gastroenterology specialists was 18.0 percentage points (95% CI: 11.5–24.5; t = 5.73, df = 9, p < 0.001). The largest performance gap was observed between the NS-LLM system and emergency medicine physicians, with a difference of 25.7 percentage points (95% CI: 18.9–32.5; t = 7.89, df = 9, p < 0.001). One-way analysis of variance (ANOVA) confirmed significant differences in diagnostic accuracy among the three human specialty groups (F = 5.84, p = 0.008). Post hoc analysis using Tukey’s honestly significant difference test revealed that general surgery specialists significantly outperformed emergency medicine physicians (mean difference: 11.3 percentage points; 95% CI: 3.8–18.8; p = 0.003). The difference between general surgery and gastroenterology specialists did not reach statistical significance (mean difference: 3.6 percentage points; 95% CI: −3.9–11.1; p = 0.412), nor did the difference between gastroenterology specialists and emergency medicine physicians (mean difference: 7.7 percentage points; 95% CI: 0.2–15.2; p = 0.057).
Receiver operating characteristic (ROC) curve analysis was performed to assess the discriminative ability of each participant group in distinguishing correct from incorrect clinical decisions (
Figure 3). The NS-LLM system demonstrated excellent discriminative ability with an area under the ROC curve (AUC) of 0.983 (95% CI: 0.951–0.998), indicating near-perfect classification performance approaching the theoretical maximum of 1.0. This AUC value corresponds to a sensitivity of 96.7% and specificity of 100% at the optimal operating point. Among human expert groups, general surgery specialists achieved an AUC of 0.856 (95% CI: 0.784–0.928), indicating good discriminative ability. Gastroenterology specialists demonstrated an AUC of 0.821 (95% CI: 0.742–0.900), while emergency medicine physicians showed the lowest discriminative ability with an AUC of 0.764 (95% CI: 0.678–0.850), which nonetheless exceeded the threshold for acceptable discrimination (AUC > 0.7). Pairwise comparison of AUC values using the DeLong method confirmed statistically significant differences between the NS-LLM system and each human group: NS-LLM versus general surgery (z = 3.42,
p < 0.001), NS-LLM versus gastroenterology (z = 4.18,
p < 0.001), and NS-LLM versus emergency medicine (z = 5.67,
p < 0.001).
Analysis of diagnostic accuracy stratified by clinical domain revealed differential patterns of performance across the three predefined categories: diagnosis (n = 10 questions), treatment (n = 10 questions), and complications/prognosis (n = 10 questions) (
Table 3,
Figure 4). The NS-LLM system achieved perfect accuracy of 100% (10/10 correct responses) in the diagnostic domain, correctly identifying all clinical diagnoses including acute calculous cholecystitis, acute acalculous cholecystitis, complicated cholecystitis variants (gangrenous, emphysematous, perforated), chronic cholecystitis, biliary colic, gallbladder polyps, adenomyomatosis, Mirizzi syndrome, and gallbladder carcinoma. Similarly, the system achieved 100% accuracy (10/10) in the complications and prognosis domain, correctly predicting clinical outcomes and identifying potential complications across all presented scenarios. In the treatment domain, the NS-LLM system achieved 90% accuracy (9/10), with a single incorrect response that will be discussed in detail in the error analysis section.
Among human participants, performance in the diagnostic domain varied by specialty in a pattern consistent with clinical expertise. Gastroenterology specialists achieved the highest mean diagnostic accuracy at 82.0% ± 7.5% (8.2 ± 0.75 correct responses), reflecting their subspecialty focus on hepatobiliary pathophysiology and differential diagnosis. General surgery specialists achieved a mean diagnostic accuracy of 78.0% ± 8.2% (7.8 ± 0.82 correct responses), while emergency medicine physicians demonstrated 74.0% ± 9.1% (7.4 ± 0.91 correct responses). The difference between gastroenterology and emergency medicine in the diagnostic domain approached but did not reach statistical significance (mean difference: 8.0 percentage points; t = 2.01, df = 18, p = 0.059).
In the treatment domain, general surgery specialists demonstrated superior performance consistent with their primary clinical role in surgical management of acute cholecystitis. General surgeons achieved a mean treatment accuracy of 88.0% ± 6.3% (8.8 ± 0.63 correct responses), significantly exceeding gastroenterology specialists at 76.0% ± 8.4% (7.6 ± 0.84 correct responses; mean difference: 12.0 percentage points; t = 3.61, df = 18, p = 0.002) and emergency medicine physicians at 70.0% ± 7.8% (7.0 ± 0.78 correct responses; mean difference: 18.0 percentage points; t = 5.68, df = 18, p < 0.001). The difference between gastroenterology and emergency medicine in treatment questions was also statistically significant (mean difference: 6.0 percentage points; t = 2.09, df = 18, p = 0.042). This pattern reflects the central role of general surgeons in determining surgical timing, approach, and technique for cholecystectomy, as well as their familiarity with the TG18 surgical management flowcharts.
Performance in the complications and prognosis domain demonstrated the following accuracy rates: general surgery specialists 81.0% ± 7.1% (8.1 ± 0.71 correct responses), gastroenterology specialists 78.0% ± 6.8% (7.8 ± 0.68 correct responses), and emergency medicine physicians 69.0% ± 8.5% (6.9 ± 0.85 correct responses). The difference between general surgery and emergency medicine in this domain was statistically significant (mean difference: 12.0 percentage points; t = 3.43, df = 18, p = 0.003). Questions in this domain addressed topics including post-cholecystectomy syndrome, bile duct injury classification and management, biliary fistula, retained common bile duct stones, and long-term outcomes following various treatment approaches. The NS-LLM system maintained consistent superiority over all human groups across each clinical domain, with statistically significant pairwise differences for all comparisons (all p < 0.01).
Detailed analysis of diagnostic accuracy across eight predefined disease categories revealed nuanced patterns of performance that differed between the NS-LLM system and human expert groups (
Table 4,
Figure 5). The NS-LLM system achieved perfect accuracy (100%) in six of eight disease categories: acute calculous cholecystitis (6/6 correct), acute acalculous cholecystitis (2/2 correct), complicated cholecystitis including gangrenous, emphysematous, and perforated variants (5/5 correct), chronic cholecystitis and biliary colic (4/4 correct), gallbladder polyps and adenomyomatosis (3/3 correct), and Mirizzi syndrome (2/2 correct). The system achieved 87.5% accuracy in both gallbladder carcinoma (3/4 correct, with one incorrect response) and post-cholecystectomy complications (3/4 correct, with one incorrect response).
For acute calculous cholecystitis, the most prevalent clinical scenario represented by 6 questions in the assessment instrument, human participant performance, demonstrated the following mean accuracies: general surgery specialists 85.0% ± 5.8% (5.1 ± 0.35 correct responses), gastroenterology specialists 80.0% ± 6.3% (4.8 ± 0.38 correct responses), and emergency medicine physicians 73.3% ± 8.2% (4.4 ± 0.49 correct responses). The performance differential between general surgery and emergency medicine in this category was statistically significant (t = 3.51, df = 18, p = 0.002). Acute acalculous cholecystitis, a less common but clinically challenging entity represented by 2 questions, demonstrated mean accuracies of 80.0% ± 10.5% for general surgery, 85.0% ± 7.1% for gastroenterology (the highest among human groups for this category), and 70.0% ± 12.2% for emergency medicine. The higher performance of gastroenterologists in this category may reflect their greater familiarity with the pathophysiology and risk factors for acalculous cholecystitis, including critical illness, total parenteral nutrition, and immunosuppression.
Complicated cholecystitis cases, encompassing gangrenous, emphysematous, and perforated variants across 5 questions, yielded mean accuracies of 88.0% ± 5.4% for general surgery, 82.0% ± 6.9% for gastroenterology, and 72.0% ± 9.3% for emergency medicine. General surgeons demonstrated particular proficiency in recognizing imaging findings suggestive of complicated cholecystitis (such as intramural gas, pericholecystic abscess, and gallbladder wall discontinuity) and in selecting appropriate urgent surgical or interventional management. Chronic cholecystitis and biliary colic questions (n = 4) demonstrated accuracies of 85.0% ± 7.1% for general surgery, 77.5% ± 8.5% for gastroenterology, and 75.0% ± 6.5% for emergency medicine, with less pronounced inter-specialty differences reflecting the more straightforward clinical presentations in this category.
Questions addressing gallbladder polyps and adenomyomatosis (n = 3) demonstrated accuracies of 76.7% ± 11.5% for general surgery, 80.0% ± 8.2% for gastroenterology, and 66.7% ± 12.5% for emergency medicine. Gastroenterologists demonstrated superior performance in this category, likely reflecting their role in endoscopic ultrasound evaluation and surveillance decision-making for gallbladder polyps. Mirizzi syndrome, a rare but diagnostically challenging condition represented by 2 questions, showed accuracies of 80.0% ± 14.1% for general surgery, 75.0% ± 7.1% for gastroenterology, and 65.0% ± 14.1% for emergency medicine. The wider standard deviations in this category reflect substantial inter-individual variability in recognizing this uncommon condition, which is often misdiagnosed preoperatively.
Gallbladder carcinoma questions (n = 4) demonstrated mean accuracies of 80.0% ± 8.2% for general surgery, 77.5% ± 9.6% for gastroenterology, and 70.0% ± 8.2% for emergency medicine. Questions in this category addressed topics including recognition of incidental gallbladder carcinoma, staging considerations, surgical management principles, and prognostic factors. Post-cholecystectomy complication questions (n = 4) yielded accuracies of 82.5% ± 6.5% for general surgery, 75.0% ± 7.1% for gastroenterology, and 72.5% ± 9.6% for emergency medicine. General surgeons demonstrated particular proficiency in questions involving bile duct injury classification (Strasberg classification), management algorithms for post-operative bile leak, and indications for endoscopic versus surgical intervention. A notable observation was the consistent performance of the NS-LLM system across both common and rare conditions, contrasting with human experts who demonstrated greater performance variability in less frequently encountered clinical scenarios such as Mirizzi syndrome and acute acalculous cholecystitis.
The performance characteristics of the NS-LLM system, including confidence score calibration and inter-agent agreement metrics, are presented in
Table 5. The system architecture, illustrated in
Figure 6, employs a multi-agent framework with two constituent large language models (Gemini 3.0 Flash and GPT-5.2) coordinated through the LangGraph orchestration platform, with integration of the TG18 symbolic knowledge base for guideline-adherent clinical reasoning. The mean confidence score generated by the NS-LLM system across all 30 questions was 0.924 ± 0.048 on a scale from 0 to 1, indicating high overall system certainty. Confidence scores for correctly answered questions (n = 29) were significantly higher at 0.931 ± 0.042 compared to the confidence score for the single incorrectly answered question (0.724; independent samples
t-test,
t = 4.97, df = 28,
p = 0.031). This finding suggests that the system’s confidence calibration may serve as a clinically useful indicator of response reliability, with lower confidence scores potentially flagging responses warranting additional scrutiny or human oversight.
Agent-specific analysis revealed comparable performance between the two constituent large language models. Agent 1 (Gemini 3.0 Flash) demonstrated a mean confidence score of 0.918 ± 0.052 across all questions, with scores of 0.926 ± 0.046 for correct responses and 0.682 for the incorrect response (t = 5.28, df = 28, p = 0.028). Agent 2 (GPT-5.2) achieved a mean confidence score of 0.931 ± 0.045 overall, with scores of 0.938 ± 0.039 for correct responses and 0.766 for the incorrect response (t = 4.40, df = 28, p = 0.042). Both agents demonstrated appropriate confidence calibration, with significantly lower confidence on the incorrectly answered question. Inter-agent agreement was high, with full concordance (identical selected answers from both agents) observed in 27 of 30 questions (90.0%). Among the 29 correctly answered questions, full inter-agent agreement was achieved in 27 cases (93.1%), indicating that consensus between architecturally distinct models was strongly predictive of correct responses. Partial agreement (agents selected different answers) was observed in 3 questions (10.0%), including the single incorrectly answered question where the consensus arbitration module selected the incorrect option despite conflicting agent recommendations.
Reproducibility of the NS-LLM system was assessed through three independent evaluation runs with the temperature parameter set to 0.1 to minimize stochastic variation. Cohen’s kappa coefficient for inter-run agreement was 0.967, indicating near-perfect reproducibility approaching complete consistency. The system produced identical answers across all three runs for 29 of 30 questions (96.7%), with minor variation observed in confidence scores (mean inter-run confidence score standard deviation: 0.012). The symbolic constraint checking mechanism within the consensus arbitration module successfully identified and rejected two candidate responses during the evaluation that violated TG18 diagnostic criteria. Specifically, one candidate response incorrectly classified a case as Grade I cholecystitis despite the presence of white blood cell count exceeding 18,000/μL (a Grade II criterion), and another candidate response recommended immediate cholecystectomy for a patient meeting Grade III criteria with hemodynamic instability. Both violations were detected through automated comparison against the encoded TG18 severity grading rules, demonstrating the value of explicit guideline encoding in preventing clinically implausible outputs.
Error Analysis
A comprehensive analysis of incorrect responses was performed to identify patterns of error and potential areas for improvement in both the NS-LLM system and human expert groups. The NS-LLM system produced a single incorrect response out of 30 questions (error rate: 3.3%), while human participants collectively produced 172 incorrect responses out of 900 total responses (error rate: 19.1%). The distribution of errors among human groups was as follows: emergency medicine physicians 87 errors (error rate: 29.0%), gastroenterology specialists 64 errors (error rate: 21.3%), and general surgery specialists 53 errors (error rate: 17.7%).
The single error committed by the NS-LLM system occurred in Question 17, which presented a 58-year-old male patient with Grade II acute cholecystitis (defined by symptom duration exceeding 72 h and white blood cell count of 19,200/μL) and borderline surgical risk (Charlson Comorbidity Index of 4 and American Society of Anesthesiologists Physical Status classification of III). The clinical vignette described a patient with controlled type 2 diabetes mellitus, compensated heart failure with preserved ejection fraction, and chronic kidney disease stage 3a. The NS-LLM system recommended early laparoscopic cholecystectomy within 72 h (Option B), whereas the gold standard answer established by the expert panel was initial conservative management with percutaneous cholecystostomy followed by delayed cholecystectomy after clinical optimization (Option C). Analysis of the system’s reasoning trace revealed that both agents correctly identified the Grade II classification and recognized the elevated surgical risk, but the consensus arbitration module weighted the potential benefits of early definitive surgery more heavily than the risks of operating on a physiologically compromised patient. The confidence score for this response was 0.724, substantially lower than the mean confidence for correct responses (0.931), suggesting that the system’s uncertainty calibration appropriately reflected the clinical complexity of the case.
Among human participants, error analysis revealed distinct patterns across specialty groups and clinical domains. In the diagnostic domain, the most common errors involved misclassification of cholecystitis severity grade (28 errors, 16.3% of all human errors) and failure to recognize complicated cholecystitis variants (19 errors, 11.0%). Specifically, 14 errors involved undergrading of Grade II cholecystitis as Grade I due to incomplete application of TG18 criteria, particularly the 72 h symptom duration threshold and the white blood cell count threshold of 18,000/μL. Additionally, 9 errors involved failure to recognize imaging findings consistent with emphysematous cholecystitis (intramural gas) or gangrenous cholecystitis (irregular gallbladder wall thickening with areas of discontinuity). Emergency medicine physicians committed the highest proportion of severity grading errors (15 of 28, 53.6%), consistent with their lower familiarity with the nuanced TG18 grading system compared to specialists with primary hepatobiliary focus.
In the treatment domain, the most frequent errors involved inappropriate surgical timing recommendations (31 errors, 18.0% of all human errors) and incorrect selection of drainage approach (22 errors, 12.8%). Surgical timing errors predominantly involved recommending early cholecystectomy for high-risk Grade III patients who required initial organ support and drainage (18 errors) or recommending delayed surgery for low-risk Grade I patients who would benefit from early intervention (13 errors). Drainage approach errors included inappropriate selection of percutaneous cholecystostomy versus endoscopic transpapillary gallbladder drainage (12 errors) and failure to recommend drainage in patients meeting Grade III criteria with prohibitive surgical risk (10 errors). General surgery specialists committed the fewest treatment errors (12 of 53 total errors, 22.6%), while emergency medicine physicians committed the most (38 of 87 total errors, 43.7%).
Errors in the complications and prognosis domain primarily involved misidentification of bile duct injury severity according to the Strasberg classification (17 errors, 9.9%) and incorrect prognostic assessment for patients with gallbladder carcinoma (15 errors, 8.7%). Bile duct injury classification errors frequently involved confusion between Strasberg Type A injuries (cystic duct leak or leak from minor hepatic radicals) and Type D injuries (lateral injury to extrahepatic bile duct), which have substantially different management implications. Gallbladder carcinoma prognostic errors often involved overestimation of survival for advanced-stage tumors or underestimation of the significance of incidental gallbladder carcinoma discovered at cholecystectomy.
Analysis of errors by disease category revealed that questions involving rare conditions demonstrated higher error rates among human participants. Mirizzi syndrome questions had an overall human error rate of 26.7% (16 errors among 60 responses), substantially higher than the error rate for acute calculous cholecystitis questions (17.8%, 32 errors among 180 responses). Similarly, acute acalculous cholecystitis questions demonstrated an error rate of 21.7% (13 errors among 60 responses), reflecting the diagnostic challenge posed by this condition in patients without typical risk factors. In contrast, the NS-LLM system maintained perfect accuracy on these challenging rare conditions, demonstrating the advantage of systematic guideline application over pattern recognition based on clinical experience with common presentations.
Question-level analysis identified three questions with particularly high human error rates exceeding 40%. Question 8, addressing the management of acute acalculous cholecystitis in a critically ill intensive care unit patient, had an error rate of 46.7% (14/30 incorrect), with the most common error being recommendation of early cholecystectomy rather than percutaneous cholecystostomy in a patient too unstable for surgery. Question 14, involving recognition of Mirizzi syndrome Type II (cholecystobiliary fistula) on magnetic resonance cholangiopancreatography, had an error rate of 43.3% (13/30 incorrect), with errors primarily involving failure to recognize the fistulous communication or misclassification of the Csendes type. Question 23, addressing appropriate surveillance intervals for a 12 mm gallbladder polyp with concerning features, had an error rate of 40.0% (12/30 incorrect), with errors involving both over-recommendation of immediate cholecystectomy and under-recommendation of surveillance. The NS-LLM system answered all three of these high-difficulty questions correctly, with confidence scores of 0.912, 0.887, and 0.941 respectively.
Comparative analysis of error patterns between the NS-LLM system and human experts revealed fundamental differences in failure modes. Human errors frequently resulted from incomplete information integration (failure to consider all relevant clinical parameters), anchoring bias (premature diagnostic closure based on salient features), and knowledge gaps regarding rare conditions or updated guideline recommendations. In contrast, the single NS-LLM system error resulted from a value judgment in balancing competing risks (early surgery versus delayed surgery in a borderline-risk patient) rather than from factual error or guideline misapplication. This distinction suggests that neuro-symbolic systems may be particularly valuable for ensuring complete guideline application and reducing errors of omission, while human oversight remains important for clinical scenarios involving nuanced risk-benefit assessments that fall outside explicitly encoded algorithmic pathways.
Inter-rater reliability among human participants within each specialty group was assessed using Fleiss’ kappa coefficient to quantify the degree of agreement beyond that expected by chance. General surgery specialists demonstrated moderate agreement with a Fleiss’ kappa of 0.58 (95% CI: 0.49–0.67), indicating that approximately 58% of the agreement observed exceeded chance-level agreement. Gastroenterology specialists showed fair to moderate agreement with a kappa of 0.52 (95% CI: 0.43–0.61), while emergency medicine physicians demonstrated the lowest inter-rater agreement with a kappa of 0.44 (95% CI: 0.35–0.53), falling in the fair agreement range. These findings indicate substantial variability in clinical decision-making even among board-certified specialists from tertiary care centers, highlighting the challenge of achieving consistent guideline application in complex clinical scenarios.
Analysis of agreement patterns by question characteristics revealed that questions with the highest inter-rater agreement (kappa > 0.70) predominantly involved straightforward acute calculous cholecystitis scenarios with unambiguous TG18 criteria fulfillment, clear imaging findings, and low-surgical-risk patients. These questions typically presented classic clinical presentations with fever, right upper quadrant pain, positive Murphy’s sign, elevated inflammatory markers, and characteristic ultrasonographic findings. Conversely, questions with the lowest inter-rater agreement (kappa < 0.40) involved complex cases requiring integration of multiple clinical parameters. These challenging questions included cases with borderline severity grading (white blood cell count near 18,000/μL or symptom duration near 72 h), competing diagnostic considerations (acute cholecystitis versus acute cholangitis versus acute pancreatitis with biliary etiology), patients with significant comorbidities affecting surgical risk assessment, and rare conditions requiring recognition of subtle clinical or imaging features. The NS-LLM system’s consistent performance across both high-agreement and low-agreement questions suggests that neuro-symbolic approaches may reduce the variability inherent in complex clinical decision-making.
Assessment completion time was recorded for all participants to evaluate the efficiency of clinical decision-making. Human participants required a mean time of 47.3 ± 12.8 min (range: 28–78 min) to complete the 30-question assessment instrument, corresponding to approximately 1.58 min per question. Analysis by specialty group revealed that general surgery specialists demonstrated the fastest mean completion time at 42.1 ± 10.5 min (1.40 min per question), followed by gastroenterology specialists at 48.7 ± 13.2 min (1.62 min per question), and emergency medicine physicians at 51.2 ± 14.1 min (1.71 min per question). The difference in completion time between groups did not reach statistical significance (one-way ANOVA, F = 1.62, p = 0.218), although the trend toward faster completion by general surgeons may reflect their greater familiarity with the clinical scenarios and decision algorithms specific to cholecystectomy candidacy assessment.
The NS-LLM system processed all 30 questions in a mean time of 4.2 ± 0.3 min across three independent evaluation runs (range: 3.8–4.6 min), corresponding to approximately 8.4 s per question. This represents an approximately 11.3-fold reduction in assessment completion time compared to the mean human performance and a 18.6-fold reduction compared to the slowest human participant. The processing time included all stages of the multi-agent pipeline: clinical feature extraction from the vignette text, semantic mapping to the symbolic knowledge base, parallel agent inference with Gemini 3.0 Flash and GPT-5.2, knowledge graph traversal for TG18 criteria verification, and consensus arbitration with final answer selection. The substantial efficiency advantage of the NS-LLM system, combined with its superior guideline-concordant performance, suggests potential applications in time-critical clinical environments where rapid guideline-adherent decision support could enhance patient care.
4. Discussion
This study demonstrates that a neuro-symbolic large language model system integrating TG18 guidelines as its symbolic knowledge base achieves superior performance in standardized guideline-concordant assessment of acute cholecystitis management compared to human expert physicians across three clinical specialties. The NS-LLM system attained 96.7% overall accuracy, substantially outperforming general surgeons (82.3%), gastroenterologists (78.7%), and emergency medicine physicians (71.0%). These findings align with the emerging paradigm of neuro-symbolic artificial intelligence in healthcare, which has been increasingly recognized as a promising approach to overcome the fundamental limitations of purely neural methods [
13].
The superior performance of our neuro-symbolic system is consistent with recent systematic reviews comparing AI diagnostic capabilities with physicians. Takita et al. conducted a comprehensive meta-analysis demonstrating that generative AI models can achieve diagnostic performance comparable to or better than that of physicians in certain clinical domains [
14]. Similarly, a pooled analysis of 30 studies found that LLM primary diagnosis accuracy ranged from 25% to 97.8%, with the best-performing models approaching expert-level performance on specific tasks [
15]. However, our results extend these findings by demonstrating that the integration of symbolic reasoning with neural networks can achieve even higher accuracy rates than standalone LLMs in structured clinical decision-making scenarios.
The neuro-symbolic architecture employed in this study addresses several critical limitations of conventional LLMs identified in the recent literature. Lu et al. demonstrated that neuro-symbolic methods using Logical Neural Networks can develop explainable models for diagnosis prediction that integrate domain-specific knowledge through logical rules with learnable weights and thresholds, achieving superior performance over traditional machine learning approaches [
9]. Furthermore, Prenosil et al. showed that connecting LLMs with rule-based expert systems through semantic integration platforms enables traceable and deterministic clinical reasoning, which is essential for auditable healthcare applications [
10]. Our system’s incorporation of TG18 diagnostic criteria as structured symbolic knowledge exemplifies this approach, enabling explicit rule-based verification of neural network outputs against established clinical guidelines.
A notable benefit of our neuro-symbolic methodology is its ability to reduce the risk of hallucinations, an ongoing concern in medical artificial intelligence applications. Recent research has shown that leading medical LLMs demonstrate hallucination rates between 15% and 40% when performing clinical tasks [
16]. Omar et al. reported that LLMs can elaborate on fabricated clinical details in 50–82% of cases when exposed to adversarial prompts, highlighting their vulnerability to generating misleading medical content [
8]. The symbolic constraint checking mechanism in our system addresses this limitation by validating each candidate answer against hard constraints encoded in the TG18 knowledge base, rejecting responses that violate established diagnostic criteria. This dual verification approach substantially reduces the risk of clinically implausible outputs. It is essential to maintain a critical and balanced perspective regarding the clinical implications of these findings. Current AI models, including advanced neuro-symbolic architectures, operate as sophisticated pattern-matching and rule-application systems that lack genuine clinical understanding, empathic reasoning, and the ability to navigate the inherent uncertainty of individual patient encounters. The observed performance advantage likely reflects the system’s unfailing application of encoded guideline criteria—a task at which computational systems inherently excel—rather than a fundamental superiority in clinical reasoning. Clinicians should approach AI-generated recommendations with the same critical appraisal they would apply to any clinical decision support tool, recognizing both the capabilities and the fundamental limitations of these systems.
The multi-agent architecture of our system aligns with the emerging trend of agentic AI in healthcare. Recent research has highlighted that agentic artificial intelligence systems characterized by autonomy, adaptability, and scalability are transforming healthcare applications by enhancing diagnostics, clinical decision support, and treatment planning. In healthcare settings, multi-agent AI systems have been shown to improve collaboration among specialized AI components, with each agent optimized for specific tasks such as diagnosis, monitoring, and clinical workflow support [
17]. In particular, multi-agent AI architectures in healthcare have been shown to facilitate more precise collaboration among specialized components, with each agent optimized for specific tasks across clinical workflows [
17]. Our system employs parallel agent deployment using two distinct LLMs (Gemini 3.0 Flash and GPT-5.2), providing redundancy that reduces single-point-of-failure errors and enables cross-validation of reasoning pathways. The consensus arbitration module integrating confidence-weighted voting with symbolic constraint checking represents a novel approach to resolving inter-agent disagreements while maintaining clinical guideline adherence.
The observed variation in performance across clinical specialties reflects the differential training and clinical focus of each discipline. General surgeons demonstrated the highest accuracy in treatment decisions (88.0%), consistent with their primary role in surgical management of acute cholecystitis. Gastroenterologists excelled in diagnostic questions (82.0%), reflecting their expertise in hepatobiliary pathophysiology and differential diagnosis. Emergency medicine physicians showed comparable overall performance despite having the broadest scope of practice, suggesting effective application of clinical decision-making frameworks in acute presentations. These findings are consistent with Gaber et al., who demonstrated that LLM workflows in clinical decision support vary in effectiveness across different clinical domains and specialties [
5,
6]. A methodological consideration is the inherent asymmetry between the NS-LLM system, which was explicitly optimized to follow TG18 through its symbolic knowledge base, and human participants, who were not provided with specific TG18 training prior to the study. This design reflects real-world clinical practice conditions but limits the interpretation of performance differences to guideline adherence rather than global clinical superiority. The observed performance gap may partly reflect differences in systematic guideline application rather than fundamental differences in clinical competence. It should be noted that the gold standard was established by a panel of three experts utilizing TG18 as the reference framework—the same guidelines encoded in the NS-LLM system’s symbolic knowledge base. This creates an inherent methodological alignment between the AI system and the evaluation criteria, potentially favoring the system’s performance. The observed accuracy differences should therefore be interpreted as reflecting superior guideline adherence in a structured assessment rather than comprehensive clinical superiority encompassing the full spectrum of physician competencies. Future studies comparing AI performance against physicians who have received immediate pre-study guideline refreshment would help disentangle guideline adherence from inherent diagnostic reasoning ability.
The clinical implications of our findings are significant, especially in resource-constrained environments where access to specialist expertise may be limited. Emerging evidence suggests that artificial intelligence tools can enhance clinical decision support, diagnostics, and workflow efficiency when incorporated into standard care procedures, thereby potentially improving patient outcomes and alleviating clinician workload in practical settings [
18]. Our neuro-symbolic system’s consistent application of TG18 guidelines addresses the documented variability in guideline adherence (45–78%) across different healthcare settings [
3]. The system’s ability to provide structured reasoning traces with explicit TG18 rule references enhances transparency and may facilitate clinician trust, which has been identified as a critical factor for AI adoption in healthcare [
19]. Furthermore, the system’s performance on complex cases involving multi-organ dysfunction and special conditions demonstrates its potential utility in challenging clinical scenarios where cognitive load and time pressure may compromise human decision-making.
Importantly, the recent literature suggests that the optimal model for AI integration in clinical practice may involve augmentation rather than replacement of physician expertise. A randomized clinical trial found that the availability of a large language model to physicians as a diagnostic aid did not significantly improve their diagnostic reasoning compared with the use of conventional resources, while the large language model evaluated alone demonstrated higher performance than both physician groups [
20]. This finding, known as “automation neglect” by some researchers, indicates that physicians may not effectively utilize AI recommendations even when available [
21]. The structured output format of our neuro-symbolic system, which includes explicit confidence scores, clinical reasoning chains, and TG18 rule references, is designed to promote more effective human–AI collaboration by providing actionable and interpretable decision support.
The integration of the TG18 knowledge base as a structured symbolic component signifies a noteworthy progression beyond retrieval-augmented generation methodologies. Although retrieval-augmented LLMs have exhibited potential in clinical applications, a recent study assessing knowledge graph-enhanced LLMs for diagnostic support demonstrated enhanced clinician confidence and reasoning support, especially in cases that are rare or characterized by uncertainty. Our system extends this concept by encoding the complete TG18 diagnostic criteria, severity grading system, and management algorithms as interconnected nodes with explicit logical relationships, enabling rule-based inference that mirrors established diagnostic reasoning patterns. This approach ensures consistent guideline application across all queries while maintaining the flexibility to handle complex clinical scenarios involving multiple concurrent conditions.
Several limitations of this study warrant consideration. First, the evaluation was conducted using case-based multiple-choice questions, which may not fully capture the complexity of real-world clinical decision-making involving incomplete information, evolving presentations, and patient-specific factors. McCoy et al. demonstrated that LLMs show markedly lower performance on script concordance testing, which measures how new information adjusts diagnostic judgments under uncertainty, compared to standard medical multiple-choice benchmarks [
22]. Second, the study population was limited to Turkish tertiary care centers, potentially limiting generalizability to other healthcare systems and cultural contexts. The exclusion of physicians with prior AI research experience, while intended to minimize performance bias, may have introduced selection bias by excluding a subpopulation of technologically engaged clinicians whose clinical reasoning patterns may differ from physicians without such experience. As AI tools become increasingly prevalent in clinical practice, the representativeness of study populations excluding AI-experienced physicians may diminish, and future studies should consider including such physicians as a separate comparison group. Third, the single-pass evaluation methodology does not capture the iterative nature of clinical reasoning, where diagnoses are refined based on response to treatment and additional investigations. Third, the sample size of 30 case-based questions, while generating 900 total human responses, may be insufficient to draw definitive conclusions and increases the risk of Type II error for subgroup analyses involving rare disease categories with limited question representation (
Supplementary Table S1). Post hoc power analysis indicated that with 10 participants per group and an alpha level of 0.05, the study had approximately 80% power to detect a mean accuracy difference of 12 percentage points between groups (effect size d = 0.85), suggesting adequate power for the primary overall comparison but limited power for detecting smaller differences in subgroup analyses. Additional practical limitations warrant acknowledgment. First, the current system relies on structured, unambiguous clinical input and lacks the capability to process ambiguous or equivocal imaging reports that are frequently encountered in clinical practice, such as reports stating ‘findings suggestive of but not diagnostic for acute cholecystitis.’ The system’s semantic mapping module is designed for structured entity extraction rather than probabilistic interpretation of diagnostic uncertainty. Second, the weight allocation for patients’ comorbidities within the symbolic knowledge base follows predefined scoring systems (CCI and ASA-PS) and does not account for nuanced clinical interactions between multiple comorbid conditions that may influence surgical risk beyond standardized indices. For example, the synergistic risk of combined hepatic and renal dysfunction may exceed the additive risk captured by these scoring systems. Third, the system lacks a mechanism to integrate patient preferences, values, and goals of care into its treatment recommendations, which is a fundamental component of shared decision-making in contemporary medical practice and an ethical imperative in surgical planning. From a technical standpoint, the NS-LLM system requires access to cloud-based API endpoints for both constituent large language models (Gemini 3.0 Flash via Google Cloud and GPT-5.2 via OpenAI API), necessitating stable internet connectivity and incurring per-query computational costs. The LangGraph orchestration framework requires Python 3.10 or higher with compatible dependency libraries. While the system does not require local high-performance GPU infrastructure (as neural inference is handled by cloud-based APIs), this reliance on third-party cloud services introduces dependencies on service availability, network latency, and potential API deprecation or pricing changes. The symbolic knowledge base traversal and Monte Carlo dropout-based uncertainty quantification impose additional computational overhead that may limit real-time applicability in high-throughput clinical settings. These infrastructure dependencies may restrict deployment in resource-limited healthcare environments with inadequate computational infrastructure or unreliable network connectivity. Finally, the neuro-symbolic system’s performance is inherently constrained by the completeness and currency of its symbolic knowledge base, requiring periodic updates to incorporate evolving clinical guidelines.
Future research should focus on prospective validation of the neuro-symbolic system in real-world clinical settings, evaluation of its impact on patient outcomes, and assessment of clinician acceptance and workflow integration. The development of multi-agent collaborative systems incorporating diverse clinical perspectives, as proposed by the AI Agent Hospital concept, represents a promising direction for comprehensive clinical decision support [
23]. Current research in medical artificial intelligence is increasingly directed toward multimodal fusion integrating clinical text, imaging, and genomic data, enhancement of clinical interpretability through explainable AI methodologies, and improvement of cross-scenario generalization capabilities that enable robust performance across diverse clinical settings and patient populations [
24,
25]. These research directions align with the neuro-symbolic approach employed in the present study, which enhances interpretability through explicit symbolic reasoning while maintaining the pattern recognition capabilities of neural networks. Future iterations of the NS-LLM system may benefit from incorporating multimodal input processing to simultaneously analyze clinical text, laboratory data, and imaging findings within a unified reasoning. In the current study design, clinical vignettes were presented as structured text input, requiring manual curation of clinical scenarios. Real-world clinical deployment of the NS-LLM system would necessitate the development of interoperability pipelines compatible with hospital electronic health record (EHR) systems (e.g., HL7 FHIR-based data exchange) and integration with imaging report systems (PACS/RIS) for automated extraction of radiology findings. The requirement for manual data preprocessing represents a significant barrier to seamless clinical workflow integration. While the present study utilized hypothetical clinical vignettes without real patient data, future clinical deployment of the NS-LLM system would necessitate comprehensive data protection measures. These should include end-to-end data encryption (e.g., AES-256) for data in transit and at rest, compliance with international privacy regulations including the Health Insurance Portability and Accountability Act (HIPAA) and the General Data Protection Regulation (GDPR), implementation of de-identification and pseudonymization protocols for clinical data processing, role-based access control mechanisms, comprehensive audit logging of all system interactions with patient data, and institutional data governance frameworks with regular security assessments. The current prototype operates through cloud-based API endpoints (Google Cloud and OpenAI API), which introduces additional data sovereignty and third-party processing considerations that must be addressed through Business Associate Agreements (BAAs) and Data Processing Agreements (DPAs) in compliance with applicable regulations. The absence of these safeguards in the current prototype represents a limitation that must be addressed prior to any clinical implementation