Automated Tumor and Node Staging from Esophageal Cancer Endoscopic Ultrasound Reports: A Benchmark of Advanced Reasoning Models with Prompt Engineering and Cross-Lingual Evaluation
Abstract
1. Introduction
2. Materials and Methods
2.1. Dataset
2.2. TNM Staging
2.3. LLM and Testing Process
2.4. Influencing Factors
2.4.1. Prompt Strategy
2.4.2. Language Environment
2.5. Statistical Analysis
3. Results
3.1. Characteristics of the EUS Reports
3.2. Overall Performance
3.3. Analysis of Influencing Factors
3.4. Pairwise McNemar Test Comparison Across Scenarios
4. Discussion
5. Conclusions
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
| API | Application Programming Interface |
| EUS | Endoscopic Ultrasound |
| QWK | Quadratic weighted Kappa |
| LLM(s) | Large Language Model(s) |
| MIT | Massachusetts Institute of Technology |
| MoE | Mixture-of-Experts |
| OR | Odds Ratio |
| R | The R statistical computing environment |
| R/L | Right/Left |
| RL | Reinforcement Learning |
| SD | Standard Deviation |
| SFT | Supervised Fine-Tuning |
| TNM | Tumor–Node–Metastasis |
| U/M/Lo | Upper/Middle/Lower |
| UICC | Union for International Cancer Control |
| RAG | retrieval–augmented generation |
| 95%CI | 95% confidence intervals |
References
- Sallam, M.; Al-Mahzoum, K.; Sallam, M.; Mijwil, M.M. DeepSeek: Is it the End of Generative AI Monopoly or the Mark of the Impending Doomsday? Mesopotamian J. Big Data 2025, 2025, 26–34. [Google Scholar] [CrossRef]
- Ajani, J.A.; D’Amico, T.A.; Bentrem, D.J.; Cooke, D.; Corvera, C.; Das, P.; Enzinger, P.C.; Enzler, T.; Farjah, F.; Gerdes, H.; et al. Esophageal and Esophagogastric Junction Cancers, Version 2.2023, NCCN Clinical Practice Guidelines in Oncology. J. Natl. Compr. Cancer Netw. 2023, 21, 393–422. [Google Scholar] [CrossRef]
- Choi, H.S.; Song, J.Y.; Shin, K.H.; Chang, J.H.; Jang, B.S. Developing prompts from large language model for extracting clinical information from pathology and ultrasound reports in breast cancer. Radiat. Oncol. J. 2023, 41, 209–216. [Google Scholar] [CrossRef]
- Krill, T.; Baliss, M.; Roark, R.; Sydor, M.; Samuel, R.; Zaibaq, J.; Guturu, P.; Parupudi, S. Accuracy of endoscopic ultrasound in esophageal cancer staging. J. Thorac. Dis. 2019, 11, S1602–S1609. [Google Scholar] [CrossRef]
- Liu, C.Q.; Ma, Y.L.; Qin, Q.; Wang, P.H.; Luo, Y.; Xu, P.F.; Cui, Y. Epidemiology of esophageal cancer in 2020 and projections to 2030 and 2040. Thorac. Cancer 2023, 14, 3–11. [Google Scholar] [CrossRef]
- Maity, S.; Saikia, M.J. Large Language Models in Healthcare and Medical Applications: A Review. Bioengineering 2025, 12, 631. [Google Scholar] [CrossRef]
- Matsuo, H.; Nishio, M.; Matsunaga, T.; Fujimoto, K.; Murakami, T. Exploring multilingual large language models for enhanced TNM classification of radiology report in lung cancer staging. Cancers 2024, 16, 3621. [Google Scholar] [CrossRef]
- Rojas-Carabali, W.; Agrawal, R.; Gutierrez-Sinisterra, L.; Baxter, S.L.; Cifuentes-González, C.; Wei, Y.C.; Abisheganaden, J.; Kannapiran, P.; Wong, S.; Lee, B. Natural Language Processing in medicine and ophthalmology: A review for the 21st-century clinician. Asia-Pac. J. Ophthalmol. 2024, 13, 100084. [Google Scholar] [CrossRef]
- Sallam, M. The Utility of ChatGPT as an Example of Large Language Models in Healthcare Education, Research and Practice: Systematic Review on the Future Perspectives and Potential Limitations. medRxiv 2023. medRxiv:2023.2002.2019.23286155. [Google Scholar] [CrossRef]
- Thirunavukarasu, A.J.; Ting, D.S.J.; Elangovan, K.; Gutierrez, L.; Tan, T.F.; Ting, D.S.W. Large language models in medicine. Nat. Med. 2023, 29, 1930–1940. [Google Scholar] [CrossRef]
- Bai, J.; Bai, S.; Chu, Y.; Cui, Z.; Dang, K.; Deng, X.; Fan, Y.; Ge, W.; Han, Y.; Huang, F. Qwen technical report. arXiv 2023, arXiv:2309.16609. [Google Scholar] [CrossRef]
- Chen, D.; Alnassar, S.A.; Avison, K.E.; Huang, R.S.; Raman, S. Large Language Model Applications for Health Information Extraction in Oncology: Scoping Review. JMIR Cancer 2025, 11, e65984. [Google Scholar] [CrossRef] [PubMed]
- Gencer, G.; Gencer, K. Large Language Models in Healthcare: A Bibliometric Analysis and Examination of Research Trends. J. Multidiscip. Healthc. 2025, 18, 223–238. [Google Scholar] [CrossRef] [PubMed]
- He, L.J.; Shan, H.B.; Luo, G.Y.; Li, Y.; Zhang, R.; Gao, X.Y.; Wang, G.B.; Lin, S.Y.; Xu, G.L.; Li, J.J. Endoscopic ultrasonography for staging of T1a and T1b esophageal squamous cell carcinoma. World J. Gastroenterol. 2014, 20, 1340–1347. [Google Scholar] [CrossRef] [PubMed]
- Liu, F.; Zhou, H.; Gu, B.; Zou, X.; Huang, J.; Wu, J.; Li, Y.; Chen, S.S.; Hua, Y.; Zhou, P.; et al. Application of large language models in medicine. Nat. Rev. Bioeng. 2025, 3, 445–464. [Google Scholar] [CrossRef]
- Bhayana, R.; Nanda, B.; Dehkharghanian, T.; Deng, Y.; Bhambra, N.; Elias, G.; Datta, D.; Kambadakone, A.; Shwaartz, C.G.; Moulton, C.-A. Large language models for automated synoptic reports and resectability categorization in pancreatic cancer. Radiology 2024, 311, e233117. [Google Scholar] [CrossRef]
- Chen, K.; Hou, X.; Li, X.; Xu, W.; Yi, H. Structured Report Generation for Breast Cancer Imaging Based on Large Language Modeling: A Comparative Analysis of GPT-4 and DeepSeek. Acad. Radiol. 2025, 32, 5693–5702. [Google Scholar] [CrossRef]
- Huang, J.; Yang, D.M.; Rong, R.; Nezafati, K.; Treager, C.; Chi, Z.; Wang, S.; Cheng, X.; Guo, Y.; Klesse, L.J. A critical assessment of using ChatGPT for extracting structured data from clinical notes. npj Digit. Med. 2024, 7, 106. [Google Scholar] [CrossRef]
- Lee, J.E.; Park, K.-S.; Kim, Y.-H.; Song, H.-C.; Park, B.; Jeong, Y.J. Lung cancer staging using chest CT and FDG PET/CT free-text reports: Comparison among three ChatGPT large language models and six human readers of varying experience. Am. J. Roentgenol. 2024, 223, e2431696. [Google Scholar] [CrossRef]
- Mondillo, G.; Colosimo, S.; Perrotta, A.; Frattolillo, V.; Masino, M. Comparative evaluation of advanced AI reasoning models in pediatric clinical decision support: ChatGPT O1 vs. DeepSeek-R1. medRxiv 2025. medRxiv:2025.2001.2027.25321169. [Google Scholar] [CrossRef]
- Nakamura, Y.; Kikuchi, T.; Yamagishi, Y.; Hanaoka, S.; Nakao, T.; Miki, S.; Yoshikawa, T.; Abe, O. ChatGPT for automating lung cancer staging: Feasibility study on open radiology report dataset. medRxiv 2023. medRxiv:2023.2012.2011.23299107. [Google Scholar] [CrossRef]
- Etaiwi, W.; Alhijawi, B. Comparative Evaluation of ChatGPT and DeepSeek Across Key NLP Tasks: Strengths, Weaknesses, and Domain-Specific Performance. arXiv 2025, arXiv:abs/2506.18501. [Google Scholar] [CrossRef]
- Guo, D.; Yang, D.; Zhang, H.; Song, J.; Zhang, R.; Xu, R.; Zhu, Q.; Ma, S.; Wang, P.; Bi, X. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv 2025, arXiv:2501.12948. [Google Scholar]
- Jin, I.; Tangsrivimol, J.A.; Darzi, E.; Hassan Virk, H.U.; Wang, Z.; Egger, J.; Hacking, S.; Glicksberg, B.S.; Strauss, M.; Krittanawong, C. DeepSeek vs. ChatGPT: Prospects and challenges. Front. Artif. Intell. 2025, 8, 1576992. [Google Scholar] [CrossRef]
- Mudrik, A.; Nadkarni, G.N.; Efros, O.; Soffer, S.; Klang, E. Prompt Engineering in Large Language Models for Patient Education: A Systematic Review. medRxiv 2025. medRxiv:2025.2003.2028.25324834. [Google Scholar] [CrossRef]
- Sallam, M.; Alasfoor, I.M.; Khalid, S.W.; Al-Mulla, R.I.; Al-Farajat, A.; Mijwil, M.M.; Zahrawi, R.; Sallam, M.; Egger, J.; Al-Adwan, A.S. Chinese generative AI models (DeepSeek and Qwen) rival ChatGPT-4 in ophthalmology queries with excellent performance in Arabic and English. Narra J. 2025, 5, e2371. [Google Scholar] [CrossRef] [PubMed]
- He, Z.; Zhao, L.; Li, G.; Wang, J.; Cai, S.; Tu, P.; Chen, J.; Wu, J.; Zhang, J.; Chen, R. Comparative performance evaluation of large language models in answering esophageal cancer-related questions: A multi-model assessment study. Front. Digit. Health 2025, 7, 1670510. [Google Scholar] [CrossRef]
- Ishida, K.; Murakami, R.; Yamanoi, K.; Hamada, K.; Hasebe, K.; Sakurai, A.; Miyamoto, T.; Mizuno, R.; Taki, M.; Yamaguchi, K. Real-world application of large language models for automated TNM staging using unstructured gynecologic oncology reports. npj Precis. Oncol. 2025, 9, 366. [Google Scholar] [CrossRef]
- Kim, J.-S.; Baek, S.-J.; Ryu, H.S.; Choo, J.M.; Cho, E.; Kwak, J.-M.; Kim, J. Using large language models for clinical staging of colorectal cancer from imaging reports: A pilot study. Ann. Surg. Treat. Res. 2025, 109, 318. [Google Scholar] [CrossRef]
- Papale, A.; Flattau, R.; Vithlani, N.; Mahajan, D.; Ziemba, Y.; Zavadsky, T.; Carvino, A.; King, D.; Nadella, S. Large Language Model-Based Entity Extraction Reliably Classifies Pancreatic Cysts and Reveals Predictors of Malignancy: A Cross-Sectional and Retrospective Cohort Study. medRxiv 2025. medRxiv:2025.2007.2015.25331413. [Google Scholar]
- Yao, Y.; Cen, X.; Gan, L.; Jiang, J.; Wang, M.; Xu, Y.; Yuan, J. Automated Esophageal Cancer Staging From Free-Text Radiology Reports: Large Language Model Evaluation Study. JMIR Med. Inform. 2025, 13, e75556. [Google Scholar] [CrossRef] [PubMed]
- Luo, P.-W.; Liu, J.-W.; Xie, X.; Jiang, J.-W.; Huo, X.-Y.; Chen, Z.-L.; Huang, Z.-C.; Jiang, S.-Q.; Li, M.-Q. DeepSeek vs ChatGPT: A comparison study of their performance in answering prostate cancer radiotherapy questions in multiple languages. Am. J. Clin. Exp. Urol. 2025, 13, 176. [Google Scholar] [CrossRef]
- Keloth, V.K.; Selek, S.; Chen, Q.; Gilman, C.; Fu, S.; Dang, Y.; Chen, X.; Hu, X.; Zhou, Y.; He, H. Social determinants of health extraction from clinical notes across institutions using large language models. npj Digit. Med. 2025, 8, 287. [Google Scholar] [CrossRef]
- Lai, V.D.; Ngo, N.T.; Veyseh, A.P.B.; Man, H.; Dernoncourt, F.; Bui, T.; Nguyen, T.H. Chatgpt beyond english: Towards a comprehensive evaluation of large language models in multilingual learning. arXiv 2023, arXiv:2304.05613. [Google Scholar] [CrossRef]
- Cao, B.; Cai, D.; Zhang, Z.; Zou, Y.; Lam, W. On the worst prompt performance of large language models. Adv. Neural Inf. Process. Syst. 2024, 37, 69022–69042. [Google Scholar]
- Jin, Y.; Chandra, M.; Verma, G.; Hu, Y.; Choudhury, M.D.; Kumar, S. Better to Ask in English: Cross-Lingual Evaluation of Large Language Models for Healthcare Queries. In Proceedings of the ACM Web Conference 2024, Singapore, 13–17 May 2024. [Google Scholar]
- Zhao, W.X.; Zhou, K.; Li, J.; Tang, T.; Wang, X.; Hou, Y.; Min, Y.; Zhang, B.; Zhang, J.; Dong, Z. A survey of large language models. arXiv 2023, arXiv:2303.18223. [Google Scholar] [PubMed]
- Barrie, C.; Palmer, A.; Spirling, A. Replication for Language Models Problems, Principles, and Best Practice for Political Science. 2024. Available online: https://arthurspirling.org/documents/BarriePalmerSpirling_TrustMeBro.pdf (accessed on 5 December 2025).
- Chen, X.; Gao, C.; Chen, C.; Zhang, G.; Liu, Y. An empirical study on challenges for LLM application developers. ACM Trans. Softw. Eng. Methodol. 2025, 34, 205. [Google Scholar] [CrossRef]
- Gartlehner, G.; Kugley, S.; Crotty, K.; Viswanathan, M.; Dobrescu, A.; Nussbaumer-Streit, B.; Booth, G.; Treadwell, J.R.; Han, J.M.; Wagner, J. Artificial Intelligence–Assisted Data Extraction With a Large Language Model: A Study Within Reviews. Ann. Intern. Med. 2025, 34, 205–211. [Google Scholar] [CrossRef]
- Cohen, J. Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. Psychol. Bull. 1968, 70, 213. [Google Scholar] [CrossRef]
- McHugh, M.L. Interrater reliability: The kappa statistic. Biochem. Med. 2012, 22, 276–282. [Google Scholar] [CrossRef]
- Sokolova, M.; Lapalme, G. A systematic analysis of performance measures for classification tasks. Inf. Process. Manag. 2009, 45, 427–437. [Google Scholar] [CrossRef]
- Busch, F.; Prucker, P.; Komenda, A.; Ziegelmayer, S.; Makowski, M.R.; Bressem, K.K.; Adams, L.C. Multilingual feasibility of GPT-4o for automated Voice-to-Text CT and MRI report transcription. Eur. J. Radiol. 2025, 182, 111827. [Google Scholar] [CrossRef]
- Daitch, Z.E.; Heller, S.J. Endoscopic ultrasonography in esophageal carcinoma: A narrative review. Ann. Esophagus 2023, 6, 145–153. [Google Scholar] [CrossRef]
- Saadany, S.E.; Mayah, W.; Kalla, F.E.; Atta, T. Endoscopic ultrasound staging of upper gastrointestinal malignancies. Asian Pac. J. Cancer Prev. 2016, 17, 2361–2367. [Google Scholar] [PubMed]
- Kayaalp, M.E.; Prill, R.; Sezgin, E.A.; Cong, T.; Królikowska, A.; Hirschmann, M.T. DeepSeek versus ChatGPT: Multimodal artificial intelligence revolutionizing scientific discovery. From language editing to autonomous content generation-Redefining innovation in research and practice. Knee Surg. Sports Traumatol. Arthrosc. Off. J. ESSKA 2025, 33, 1553–1556. [Google Scholar] [CrossRef] [PubMed]





| Author | Cancer Type and Modality | LLMs Investigated | Task and Focus | Statistical Metrics | Statistical Tests | Limitations/Significance |
|---|---|---|---|---|---|---|
| Choi et al. (2023) [3] | Breast Cancer (Pathology and Ultrasound) | GPT-3.5-turbo | Extracting clinical factors, focusing on prompt development efficiency (time/cost). | Accuracy, Time consumption, Cost analysis | Descriptive comparison (Manual vs. LLM) | Validates the efficiency of LLM prompting for basic extraction but lacks analysis of complex reasoning or logic consistency. |
| Matsuo et al. (2024) [7] | Lung Cancer (Chest CT) | GPT-3.5-turbo | Multilingual TNM classification, evaluating the impact of providing definitions prompts | Accuracy | Generalized Linear Mixed Model (GLMM), Odds Ratio (OR) | Validates that external definition injection improves accuracy, but performance remains lower in non-English contexts compared to English. |
| Bhayana et al. (2024) [16] | Pancreatic Cancer (CT Reports) | GPT-4, GPT-3.5 | Resectability Categorization, focus on prompt logic | F1-score, Precision, Recall, Accuracy | McNemar’s Test, Wilcoxon signed-rank test | Validates that Chain-of-Thought prompting improves categorization logic. |
| Lee et al. (2024) [19] | Lung Cancer (Multimodality Radiology) | GPT-3.5, GPT-4, GPT-4o | Automated TNM staging, benchmarking LLM performance against human radiologists of varying experience | Overall Staging Accuracy, Error Rate | McNemar’s Test | Establishes that LLMs can match physicians in TNM staging. |
| Nakamura et al. (2023) [21] | Lung Cancer (CT Reports) | GPT-4, GPT-3.5 Turbo | Feasibility of automated staging, identifying failures in numerical reasoning | Accuracy | Descriptive analysis of error types | Validates staging feasibility but exposes frequent errors in handling numerical thresholds and anatomical details due to insufficient reasoning capabilities. |
| He et al. (2025) [27] | Esophageal Cancer (QA Tasks) | DeepSeek-R1, Gemini 2.5, ChatGPT-5, Grok-4 | Evaluating accuracy and completeness of esophageal cancer Q&A. | Accuracy score, Completeness score (Likert scale) | Friedman test, Wilcoxon signed-rank, Bonferroni correction | Highlights trade-offs: Gemini excelled in accuracy while ChatGPT led in completeness; supports a tiered model selection strategy for clinical consulting. |
| Ishida et al. (2025) [28] | Gynecologic Cancer (Pathology Reports) | Gemini 1.5 Pro, Qwen2.5-72B | Automated TNM staging via zero-shot prompting without model fine-tuning. | Accuracy, Error Rate | Comparison with manual registry error rates | Validates that simple prompting outperforms manual entry (>99% accuracy). |
| Kim et al. (2025) [29] | Colorectal Cancer (Imaging Reports) | GPT-4 | Clinical Staging Extraction: Extracting TNM stages and lesion locations from unstructured reports using specific prompts. | Accuracy | Chi-square test | Demonstrates high accuracy in English-only contexts; notably, LLMs outperformed human data managers in extracting precise lesion locations. |
| Papale et al. (2025) [30] | Pancreatic Cysts (Radiology Reports) | GPT-4 | Comparing “Open-Prompt” vs. “Entity-Extraction” for risk classification. | F1-score, Precision, Recall | 95% Confidence Intervals (CI) | Validates the superiority of “Entity-Extraction” prompting over generic prompts in identifying high-risk cysts. |
| Yao et al. (2025) [31] | Esophageal Cancer (Radiology Reports) | INF-72B, Qwen2.5-72B, LLaMA3.1 | Preoperative Staging: Comparing prompting strategies (Zero-shot, CoT, Interpretable Reasoning) vs. clinicians. | Accuracy, F1-score | McNemar test, Pearson chi-square | Validates that “Interpretable Reasoning” prompting significantly enhances LLM performance, matching or surpassing clinicians in staging accuracy. |
| Luo et al. (2025) [32] | Prostate Cancer (Radiotherapy) | DeepSeek-R1, GPT-4o | Bilingual QA for patient education and clinical consultation | Physician Satisfaction (Likert Scale) | Wilcoxon signed-rank test, Student’s t-test, Mann–Whitney U-test | Validates DeepSeek’s superior performance in Chinese contexts but focuses on simple QA retrieval rather than complex staging logic. |
| Our Study | Esophageal Cancer (EUS Reports) | DeepSeek-R1, GPT-4o, Qwen3, Grok-3 | Zero-shot T/N staging (Focus on Intrinsic Reasoning Robustness) | Accuracy, Quadratic Weighted Kappa (QWK) | Cochran’s Q test, McNemar’s Test, Odds Ratio (OR) | Validates DeepSeek-R1′s intrinsic reasoning ensures robustness in unprompted and cross-lingual staging. |
| Characteristic | Value |
|---|---|
| Age (years), mean ± SD | 61.6 ± 7.8 |
| Sex, n (%) | |
| Male | 502 (80.3) |
| Female | 123 (19.7) |
| Endoscopic T/N staging | |
| T stage (n = 625), n (%) | |
| T1a | 80 (12.8) |
| T1b | 69 (11.0) |
| T2 | 91 (14.6) |
| T3 | 285 (45.6) |
| T4a | 39 (6.2) |
| T4b | 61 (9.8) |
| N stage (n = 579), n (%) | |
| N0 | 122 (21.1) |
| N1 | 147 (25.4) |
| N2 | 201 (34.7) |
| N3 | 109 (18.8) |
| Category | Definition |
|---|---|
| T: Primary Tumor | |
| T1a | Tumor invades the lamina propria, muscularis mucosae, or submucosa |
| T1b | Tumor invades the lamina propria or muscularis mucosae |
| T2 | Tumor invades the submucosa |
| T3 | Tumor invades the muscularis propria |
| T4a | Tumor invades the adventitia |
| T4b | Tumor invades neighboring structures |
| N: Regional Lymph Nodes | |
| N0 | No metastasis in the regional lymph nodes |
| N1 | Metastasis in 1–2 regional lymph nodes |
| N2 | Metastasis in 3–6 regional lymph nodes |
| N3 | Metastasis in 7 or more regional lymph nodes |
| Error Category | Specific Error Type | Description | Handling Method |
|---|---|---|---|
| Technical Errors | The network is busy, or call failure | Failed to obtain any response from the model due to network timeout, server issues, or configuration errors. | Resubmit the original report and prompt, with a maximum of 5 retries. If the failure persists, it is recorded as a technical fault. |
| Model Output Errors | A. Generation Failure | ||
| Return blank results | The model failed to generate any valid or relevant text content. | Resubmit the original report and prompt repeatedly until a non-blank, properly formatted response is obtained. | |
| Refusals to answer or irrelevant dialogue | The model returned a refusal to process the request or generated dialogue text irrelevant to the task. | Resubmit the original report and prompt repeatedly until a task-relevant, properly formatted response is obtained. | |
| B. Format/Compliance Errors | |||
| Unparsable text | The model’s response did not adhere to the predefined “Stage/Reason” output format. | Resubmit the original report and prompt repeatedly until a response that can be programmatically parsed is obtained. | |
| Non-compliant staging results | The model returned a staging result outside the study’s predefined categories (e.g., T2a, N4) or an ambiguous stage (e.g., Tx). | Resubmit the original report and prompt repeatedly until a response that uses a compliant staging category is obtained. | |
| Error Category | Specific Error Type | Description | Handling Method |
| Technical Errors | The network is busy, or call failure | Failed to obtain any response from the model due to network timeout, server issues, or configuration errors. | Resubmit the original report and prompt, with a maximum of 5 retries. If the failure persists, it is recorded as a technical fault. |
| Model Output Errors | A. Generation Failure | ||
| Return blank results | The model failed to generate any valid or relevant text content. | Resubmit the original report and prompt repeatedly until a non-blank, properly formatted response is obtained. | |
| Refusals to answer or irrelevant dialogue | The model returned a refusal to process the request or generated dialogue text irrelevant to the task. | Resubmit the original report and prompt repeatedly until a task-relevant, properly formatted response is obtained. | |
| B. Format/Compliance Errors | |||
| Unparsable text | The model’s response did not adhere to the predefined “Stage/Reason” output format. | Resubmit the original report and prompt repeatedly until a response that can be programmatically parsed is obtained. | |
| Non-compliant staging results | The model returned a staging result outside the study’s predefined categories (e.g., T2a, N4) or an ambiguous stage (e.g., Tx). | Resubmit the original report and prompt repeatedly until a response that uses a compliant staging category is obtained. |
| Staging Category | Subgroup | DeepSeek-R1 | GPT-4o | Qwen3 | Grok-3 |
|---|---|---|---|---|---|
| T-Staging | Overall Accuracy | 91.4% | 84.2% | 88.8% | 81.3% |
| T1a (n = 80) | 90.3% | 96.6% | 85.3% | 97.5% | |
| T1b (n = 69) | 86.2% | 71.0% | 88.0% | 83.7% | |
| T2 (n = 91) | 92.6% | 66.5% | 94.2% | 84.1% | |
| T3 (n = 285) | 96.8% | 96.6% | 94.2% | 91.8% | |
| T4a (n = 39) | 76.3% | 48.7% | 70.5% | 39.7% | |
| T4b (n = 61) | 81.1% | 73.8% | 72.1% | 47.5% | |
| N-Staging | Overall Accuracy | 84.2% | 65.0% | 68.4% | 51.9% |
| N0 (n = 122) | 88.1% | 64.3% | 52.0% | 73.6% | |
| N1 (n = 147) | 75.5% | 76.7% | 54.1% | 91.0% | |
| N2 (n = 201) | 84.3% | 57.7% | 88.6% | 30.6% | |
| N3 (n = 109) | 88.5% | 69.5% | 67.7% | 13.8% |
| Staging Task | Scenario | DeepSeek-R1 | GPT-4o | Qwen3 | Grok-3 | All Models Avg. | Cochran’s Q | p-Value |
|---|---|---|---|---|---|---|---|---|
| T-Staging (n = 625) | Chinese Without-Prompt | 93.6% (585/625) | 76.6% (479/625) | 84.3% (527/625) | 64.2% (401/625) | 79.7% (498/625) | 226.77 | <0.001 *** |
| Chinese With-Prompt | 93.6% (585/625) | 93.4% (584/625) | 91.5% (572/625) | 92.8% (580/625) | 92.8% (580/625) | 4.97 | 0.17 | |
| English Without-Prompt | 88.3% (552/625) | 77.1% (482/625) | 86.9% (543/625) | 77.9% (487/625) | 82.6% (516/625) | 68.49 | <0.001 *** | |
| English With-Prompt | 90.1% (563/625) | 89.6% (560/625) | 92.8% (579/625) | 90.1% (563/625) | 90.7% (567/625) | 10.09 | 0.018 * | |
| N-Staging (n = 579) | Chinese Without-Prompt | 87.2% (505/579) | 65.4% (379/579) | 80.5% (466/579) | 47.7% (276/579) | 70.2% (406/579) | 271.23 | <0.001 *** |
| Chinese With-Prompt | 85.8% (497/579) | 69.9% (405/579) | 78.1% (452/579) | 56.5% (327/579) | 72.6% (420/579) | 178.56 | <0.001 *** | |
| English Without-Prompt | 79.8% (462/579) | 60.8% (352/579) | 36.8% (213/579) | 43.7% (253/579) | 55.3% (320/579) | 239.01 | <0.001 *** | |
| English With-Prompt | 83.9% (486/579) | 63.7% (369/579) | 78.2% (453/579) | 59.6% (345/579) | 71.4% (413/579) | 143.37 | <0.001 *** |
| Staging Category | Comparison Pair (vs. DeepSeek-R1) | Chinese Without-Prompt | Chinese With-Prompt | English Without-Prompt | English With-Prompt |
|---|---|---|---|---|---|
| T-Staging | vs. GPT-4o | 7.84 (4.62–13.30), p < 0.001 *** | 1.05 (0.57–1.95), p > 0.99 | 3.33 (2.22–4.99), p < 0.001 *** | 1.08 (0.62–1.87), p > 0.99 |
| vs. Qwen3 | 5.00 (2.85–8.79), p < 0.001 *** | 1.96 (1.02–3.78), p = 0.16 | 1.31 (0.81–2.11), p = 0.985 | 0.55 (0.32–0.95), p = 0.121 | |
| vs. Grok-3 | 6.47 (4.30–9.74), p < 0.001 *** | 1.00 (0.53–1.87), p > 0.99 | 4.02 (2.51–6.45), p < 0.001 *** | 1.00 (0.58–1.73), p > 0.99 | |
| N-Staging | vs. GPT-4o | 4.64 (3.20–6.74), p < 0.001 *** | 3.12 (2.21–4.38), p < 0.001 *** | 2.35 (1.74–3.18), p < 0.001 *** | 3.18 (2.32–4.36), p < 0.001 *** |
| vs. Qwen3 | 2.43 (1.52–3.89), p < 0.001 *** | 2.80 (1.78–4.41), p < 0.001 *** | 5.61 (4.20–7.48), p < 0.001 *** | 1.76 (1.17–2.66), p = 0.024 * | |
| vs. Grok-3 | 5.86 (4.29–8.00), p < 0.001 *** | 5.79 (4.05–8.27), p < 0.001 *** | 5.70 (4.14–7.83), p < 0.001 *** | 4.06 (2.91–5.65), p < 0.001 *** |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Hu, X.; Feng, L.; Jing, B.; Luo, L.; Tan, W.; Li, Y.; Zheng, X.; Huang, X.; Lin, S.; Wu, H.; et al. Automated Tumor and Node Staging from Esophageal Cancer Endoscopic Ultrasound Reports: A Benchmark of Advanced Reasoning Models with Prompt Engineering and Cross-Lingual Evaluation. Diagnostics 2026, 16, 215. https://doi.org/10.3390/diagnostics16020215
Hu X, Feng L, Jing B, Luo L, Tan W, Li Y, Zheng X, Huang X, Lin S, Wu H, et al. Automated Tumor and Node Staging from Esophageal Cancer Endoscopic Ultrasound Reports: A Benchmark of Advanced Reasoning Models with Prompt Engineering and Cross-Lingual Evaluation. Diagnostics. 2026; 16(2):215. https://doi.org/10.3390/diagnostics16020215
Chicago/Turabian StyleHu, Xudong, Lingde Feng, Bingzhong Jing, Linna Luo, Wencheng Tan, Yin Li, Xinyi Zheng, Xinxin Huang, Shiyong Lin, Huiling Wu, and et al. 2026. "Automated Tumor and Node Staging from Esophageal Cancer Endoscopic Ultrasound Reports: A Benchmark of Advanced Reasoning Models with Prompt Engineering and Cross-Lingual Evaluation" Diagnostics 16, no. 2: 215. https://doi.org/10.3390/diagnostics16020215
APA StyleHu, X., Feng, L., Jing, B., Luo, L., Tan, W., Li, Y., Zheng, X., Huang, X., Lin, S., Wu, H., & He, L. (2026). Automated Tumor and Node Staging from Esophageal Cancer Endoscopic Ultrasound Reports: A Benchmark of Advanced Reasoning Models with Prompt Engineering and Cross-Lingual Evaluation. Diagnostics, 16(2), 215. https://doi.org/10.3390/diagnostics16020215

