Evaluating Chat GPT-4o’s Comparative Performance over GPT-4 in Japanese Medical Licensing Examination and Its Clinical Partnership Potential
Abstract
1. Introduction
2. Materials and Methods
3. Results
3.1. Performance in the 118th JNMLE
3.1.1. Overall Results
3.1.2. Accuracy Comparison
3.2. Performance in the 117th JNMLE
3.2.1. Overall Results
3.2.2. Accuracy Comparison
3.3. Performance in the 116th JNMLE
3.3.1. Overall Results
3.3.2. Accuracy Comparison
3.4. Combined Analysis of All Three Examinations
3.4.1. Integrated Overall Accuracy
3.4.2. Section-Based Performance
3.4.3. Question-Type Performance
3.4.4. Prohibited Choices
4. Discussion
5. Limitation
6. Conclusions
Supplementary Materials
Author Contributions
Funding
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
| JNMLE | Japan National Medical Licensing Examination |
References
- Introducing ChatGPT. Open AI. Available online: https://openai.com/blog/chatgpt (accessed on 21 October 2025).
- Tanaka, Y.; Nakata, T.; Aiga, K.; Etani, T.; Muramatsu, R.; Katagiri, S.; Kawai, H.; Higashino, F.; Enomoto, M.; Noda, M.; et al. Performance of Generative Pretrained Transformer on the National Medical Licensing Examination in Japan. PLoS Digit. Health 2024, 3, e0000433. [Google Scholar] [CrossRef] [PubMed]
- Takagi, S.; Watari, T.; Erabi, A.; Sakaguchi, K. Performance of GPT-3.5 and GPT-4 on the Japanese medical licensing examination: Comparison study. JMIR Med. Educ. 2023, 9, e48002. [Google Scholar] [CrossRef] [PubMed]
- Yanagita, Y.; Yokokawa, D.; Uchida, S.; Tawara, J.; Ikusaka, M. Accuracy of ChatGPT on medical questions in the national medical Licensing examination in Japan: Evaluation study. JMIR Form. Res. 2023, 7, e48023. [Google Scholar] [CrossRef] [PubMed]
- Murad, I.A.; Khaleel, M.I.; Shakor, M.Y. Unveiling GPT-4o: Enhanced Multimodal Capabilities and Comparative Insights with ChatGPT-4. Int. J. Electron. Commun. Syst. 2024, 4, 127–136. [Google Scholar] [CrossRef]
- Nakao, T.; Miki, S.; Nakamura, Y.; Kikuchi, T.; Nomura, Y.; Hanaoka, S.; Yoshikawa, T.; Abe, O. Capability of GPT-4V(ision) in Japanese National Medical Licensing Examination. JMIR Med. Educ. 2024, 10, e54393. [Google Scholar] [CrossRef] [PubMed]
- Liu, M.; Okuhara, T.; Dai, Z.; Huang, W.; Gu, L.; Okada, H.; Furukawa, E.; Kiuchi, T. Evaluating the Effectiveness of advanced large language models in medical Knowledge: A Comparative study using Japanese national medical examination. Int. J. Med. Inform. 2025, 193, 105673. [Google Scholar] [CrossRef] [PubMed]
- Available online: https://www.mhlw.go.jp/seisakunitsuite/bunya/kenkou_iryou/iryou/topics/tp240424-01.html (accessed on 21 October 2025).
- Available online: https://www.mhlw.go.jp/seisakunitsuite/bunya/kenkou_iryou/iryou/topics/tp230502-01.html (accessed on 21 October 2025).
- Available online: https://www.mhlw.go.jp/seisakunitsuite/bunya/kenkou_iryou/iryou/topics/tp220421-01.html (accessed on 21 October 2025).
- Available online: https://www.mhlw.go.jp/general/sikaku/successlist/2024/siken01/about.html (accessed on 21 October 2025).
- Available online: https://www.mhlw.go.jp/general/sikaku/successlist/2023/siken01/about.html (accessed on 21 October 2025).
- Available online: https://www.mhlw.go.jp/general/sikaku/successlist/2022/siken01/about.html (accessed on 21 October 2025).
- Available online: https://informa.medilink-study.com/web-informa/post41529.html/ (accessed on 21 October 2025).
- Available online: https://informa.medilink-study.com/web-informa/post39343.html/ (accessed on 21 October 2025).
- Available online: https://informa.medilink-study.com/web-informa/post36171.html/ (accessed on 21 October 2025).
- Miyazaki, Y.; Hata, M.; Omori, H.; Hirashima, A.; Nakagawa, Y.; Eto, M.; Takahashi, S.; Ikeda, M. Performance of ChatGPT-4o on the Japanese Medical Licensing Examination: Evalution of Accuracy in Text-Only and Image-Based Questions. JMIR Med. Educ. 2024, 10, eE63129. [Google Scholar] [CrossRef] [PubMed]



| 118th JNMLE Passing Criteria | 117th JNMLE Passing Criteria | 116th JNMLE Passing Criteria | |
|---|---|---|---|
| Essential Questions | Each general question is worth 1 point, and each clinical practical question is worth 3 points, total score is 160 points or more/200 points. | Each general question is worth 1 point, and each clinical practical question is worth 3 points, total score is 160 points or more/200 points. | Each general question is worth 1 point, and each clinical practical question is worth 3 points, total score is 158 points or more/197 points. Excluded questions: B6, B43, E16 E16Candidates who answer correctly will be included in the marks, and candidates who answer incorrectly will not be included in the marks. |
| Non-essential General and Clinical Questions | Each question is worth 1 point, total score is 230 points or more/300 points. | Each question is worth 1 point, total score is 220 points or more/295 points. Excluded questions: C15, C60, D38, D53, F42 | Each question is worth 1 point, total score is 214 points or more/297 points. Excluded questions: A34, A71, C36, D64 |
| Prohibited Choices (Critical Questions) | 3 questions or less | 2 questions or less | 3 questions or less |
| Question Section | Question Type | Number of Questions | GPT-4o Number of Correct Answers | GPT-4o Correct Answer Rate | GPT-4 Number of Correct Answers | GPT-4 Correct Answer Rate | p Value |
|---|---|---|---|---|---|---|---|
| 118 All questions | All Questions | 400 | 363 | 90.8% | 333 | 83.3% | 0.0016 |
| 118 All questions | Image-based Questions | 101 | 85 | 84.2% | 76 | 75.2% | 0.1153 |
| 118 All questions | Text-based Questions | 299 | 278 | 93.0% | 257 | 86.0% | 0.0050 |
| 118 Essential Section | All Questions | 100 | 96 | 96.0% | 89 | 89.0% | 0.0602 |
| 118 Essential Section | Image-based Questions | 10 | 8 | 80.0% | 6 | 60.0% | 0.3291 |
| 118 Essential Section | Text-based Questions | 90 | 88 | 97.8% | 83 | 92.2% | 0.0005 |
| 118 Non-essential Sections | All Questions | 300 | 267 | 89.0% | 244 | 81.3% | 0.0084 |
| 118 Non-essential Sections | Image-based Questions | 91 | 77 | 84.6% | 70 | 76.9% | 0.1879 |
| 118 Non-essential Sections | Text-based Questions | 209 | 190 | 90.9% | 174 | 83.3% | 0.0192 |
| 118 General Section | All Questions | 150 | 133 | 88.7% | 123 | 82.0% | 0.1026 |
| 118 General Section | Image-based Questions | 10 | 5 | 50.0% | 7 | 70.0% | 0.3613 |
| 118 General Section | Text-based Questions | 140 | 128 | 91.4% | 116 | 82.9% | 0.0321 |
| 118 Clinical Section | All Questions | 250 | 230 | 92.0% | 210 | 84.0% | 0.0059 |
| 118 Clinical Section | Image-based Questions | 91 | 80 | 87.9% | 69 | 75.8% | 0.0343 |
| 118 Clinical Section | Text-based Questions | 159 | 150 | 94.3% | 141 | 88.7% | 0.0701 |
| 117 All questions | All Questions | 393 | 350 | 89.1% | 312 | 79.4% | 0.0016 |
| 117 All questions | Image-based Questions | 127 | 107 | 84.3% | 83 | 65.4% | 0.0005 |
| 117 All questions | Text-based Questions | 266 | 243 | 91.4% | 229 | 86.1% | 0.0550 |
| 117 Essential Section | All Questions | 100 | 94 | 94.0% | 81 | 81.0% | 0.0054 |
| 117 Essential Section | Image-based Questions | 16 | 13 | 81.3% | 10 | 62.5% | 0.2381 |
| 117 Essential Section | Text-based Questions | 84 | 81 | 96.4% | 71 | 84.5% | 0.0085 |
| 117 Non-essential Sections | All Questions | 293 | 256 | 87.4% | 231 | 78.8% | 0.0058 |
| 117 Non-essential Sections | Image-based Questions | 111 | 94 | 84.7% | 73 | 65.8% | 0.0010 |
| 117 Non-essential Sections | Text-based Questions | 182 | 162 | 89.0% | 158 | 86.8% | 0.5201 |
| 117 General Section | All Questions | 148 | 128 | 86.6% | 119 | 80.4% | 0.0042 |
| 117 General Section | Image-based Questions | 14 | 9 | 64.3% | 7 | 50.0% | 0.4450 |
| 117 General Section | Text-based Questions | 132 | 117 | 88.6% | 110 | 83.3% | 0.2145 |
| 117 Clinical Section | All Questions | 246 | 222 | 90.2% | 193 | 78.5% | 0.0003 |
| 117 Clinical Section | Image-based Questions | 111 | 96 | 86.5% | 74 | 66.7% | 0.0004 |
| 117 Clinical Section | Text-based Questions | 134 | 126 | 94.0% | 119 | 88.8% | 0.1268 |
| 116 All questions | All Questions | 395 | 366 | 92.7% | 330 | 83.8% | 0.0016 |
| 116 All questions | Image-based Questions | 94 | 84 | 89.4% | 71 | 75.5% | 0.0126 |
| 116 All questions | Text-based Questions | 301 | 282 | 93.7% | 259 | 86.3% | 0.0027 |
| 116 Essential Section | All Questions | 98 | 94 | 95.9% | 83 | 85.6% | 0.0130 |
| 116 Essential Section | Image-based Questions | 13 | 12 | 92.3% | 11 | 85.6% | 0.5393 |
| 116 Essential Section | Text-based Questions | 85 | 82 | 96.5% | 72 | 85.7% | 0.0145 |
| 116 Non-essential Sections | All Questions | 297 | 272 | 91.6% | 247 | 83.2% | 0.0020 |
| 116 Non-essential Sections | Image-based Questions | 81 | 72 | 88.9% | 60 | 74.1% | 0.0152 |
| 116 Non-essential Sections | Text-based Questions | 216 | 200 | 92.6% | 187 | 86.6% | 0.0406 |
| 116 General Section | All Questions | 147 | 139 | 94.6% | 123 | 84.2% | 0.0042 |
| 116 General Section | Image-based Questions | 7 | 4 | 57.1% | 3 | 42.9% | 0.5929 |
| 116 General Section | Text-based Questions | 140 | 135 | 96.4% | 120 | 86.3% | 0.0027 |
| 116 Clinical Section | All Questions | 248 | 227 | 91.5% | 207 | 83.5% | 0.0066 |
| 116 Clinical Section | Image-based Questions | 87 | 80 | 92.0% | 68 | 78.2% | 0.0107 |
| 116 Clinical Section | Text-based Questions | 161 | 147 | 91.3% | 139 | 86.3% | 0.1571 |
| Number of Questions with Prohibit Option | GPT-4o Number of Prohibit Option Chosen | GPT-4o Prohibit Answer Rate | GPT-4 Number of Prohibit Option Chosen | GPT-4 Prohibit Answer Rate | |
|---|---|---|---|---|---|
| 118th JNMLE | 9 | 0 | 0.0% | 1 | 1.1% |
| 117th JNMLE | 11 | 0 | 0.0% | 2 | 8.2% |
| 116th JNMLE | 9 | 0 | 0.0% | 2 | 22.2% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Published by MDPI on behalf of the Academic Society for International Medical Education. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Miyamura, M.; Fujiki, G.; Kanzaki, Y.; Tsuda, K.; Asano, H.; Morita, H.; Hoshiga, M. Evaluating Chat GPT-4o’s Comparative Performance over GPT-4 in Japanese Medical Licensing Examination and Its Clinical Partnership Potential. Int. Med. Educ. 2026, 5, 9. https://doi.org/10.3390/ime5010009
Miyamura M, Fujiki G, Kanzaki Y, Tsuda K, Asano H, Morita H, Hoshiga M. Evaluating Chat GPT-4o’s Comparative Performance over GPT-4 in Japanese Medical Licensing Examination and Its Clinical Partnership Potential. International Medical Education. 2026; 5(1):9. https://doi.org/10.3390/ime5010009
Chicago/Turabian StyleMiyamura, Masatoshi, Goro Fujiki, Yumiko Kanzaki, Kosuke Tsuda, Hironaka Asano, Hideaki Morita, and Masaaki Hoshiga. 2026. "Evaluating Chat GPT-4o’s Comparative Performance over GPT-4 in Japanese Medical Licensing Examination and Its Clinical Partnership Potential" International Medical Education 5, no. 1: 9. https://doi.org/10.3390/ime5010009
APA StyleMiyamura, M., Fujiki, G., Kanzaki, Y., Tsuda, K., Asano, H., Morita, H., & Hoshiga, M. (2026). Evaluating Chat GPT-4o’s Comparative Performance over GPT-4 in Japanese Medical Licensing Examination and Its Clinical Partnership Potential. International Medical Education, 5(1), 9. https://doi.org/10.3390/ime5010009

