Next Article in Journal
Enhancing Pathology Education Through Special Staining Integration: A Study on Diagnostic Confidence and Practical Skill Development
Previous Article in Journal
Feasibility of Script Concordance Test Development: A Qualitative Study of Medical Educators’ Experiences
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Evaluating Chat GPT-4o’s Comparative Performance over GPT-4 in Japanese Medical Licensing Examination and Its Clinical Partnership Potential

1
Department of Cardiology, Osaka Medical and Pharmaceutical University, Daigakumachi 2-7, Takatsuki 569-8686, Japan
2
Department of Cardiology, Nippon Life Hospital, Enokojima 2-1-54, Nishi-ku, Osaka 550-0006, Japan
3
Department of Cardiology, Itami City Hospital, Koyaike 1-100, Itami 664-8540, Japan
4
Department of Cardiology, Osaka Medical and Pharmaceutical University Mishima-Minami Hospital, Tamagawashinmachi 8-1, Takatsuki 569-0856, Japan
*
Author to whom correspondence should be addressed.
Int. Med. Educ. 2026, 5(1), 9; https://doi.org/10.3390/ime5010009
Submission received: 4 December 2025 / Revised: 26 December 2025 / Accepted: 5 January 2026 / Published: 7 January 2026

Abstract

Background: Recent advances in artificial intelligence (AI) have produced ChatGPT-4o, a multimodal large language model (LLM) capable of processing both text and image inputs. Although ChatGPT has demonstrated usefulness in medical examinations, few studies have evaluated its image analysis performance. Methods: This study compared GPT-4o and GPT-4 using public questions from the 116th–118th Japanese National Medical Licensing Examinations (JNMLE), each consisting of 400 questions. Both models answered in Japanese using simple prompts, including screenshots for image-based questions. Accuracy was analyzed across essential, general, and clinical questions, with statistical comparisons by chi-square tests. Results: GPT-4o consistently outperformed GPT-4, achieving passing scores in all three examinations. In the 118th JNMLE, GPT-4o scored 457 points versus 425 for GPT-4. GPT-4o demonstrated higher accuracy for image-based questions in the 117th and 116th exams, though the difference in the 118th was not significant. For text-based questions, GPT-4o showed superior medical knowledge, clinical reasoning, and ethical response behavior, notably avoiding prohibited options. Conclusion: Overall, GPT-4o exceeded GPT-4 in both text and image domains, suggesting strong potential as a diagnostic aid and educational resource. Its balanced performance across modalities highlights its promise for integration into future medical education and clinical decision support.

1. Introduction

Recent advances in artificial intelligence (AI) have led to the development of sophisticated large-scale language models (LLMs) that can process and generate human-like text. Among them, ChatGPT-4 (GPT-4) from OpenAI (OpenAI, San Francisco, CA) has shown great progress in generating and understanding text, making it a notable milestone [1]. ChatGPT can be accessed using standard computers with an internet connection. ChatGPT and its underlying model, generative pre-trained transformers (GPT), were not developed specifically for medical purposes.
GPT-4 focuses primarily on text-based interactions and has limited processing power for image inputs. Although it can process basic image inputs and preliminary analysis, its performance in complex image processing tasks is limited. GPT-4 has reportedly achieved a passing score for the medical licensing exam in non-English-speaking countries such as Japan [2,3,4]. However, previous studies have only considered text-based questions, not problems involving images, figures, tables, or graphs.
In contrast, ChatGPT-4o (GPT-4o) represents a major advancement through its multimodal capabilities, enabling integrated processing of text, image, audio, and video inputs. By combining multiple data modalities, GPT-4o can generate more contextually relevant responses and demonstrate improved performance in image-based reasoning tasks, suggesting potential applications in medical education and clinical support [5].
The Japanese National Medical Licensing Examinations (JNMLE) is a rigorous national assessment required for obtaining a medical license in Japan. The JNMLE consists of multiple-choice questions covering basic medicine, clinical medicine, and public health, and is designed to evaluate not only factual knowledge but also clinical reasoning and decision-making skills. Importantly, the JNMLE reflects real-world clinical practice by including questions that require interpretation of medical images, assessment of disease severity, and extraction of clinically meaningful information from graphical data. However, the accuracy rate for these image-related questions was low in previous versions of ChatGPT and other LLMs [6,7].
A unique feature of the JNMLE is the inclusion of so-called “prohibited choices” (kinshi-shi), which represent diagnostic or therapeutic decisions that could seriously endanger a patient’s life. If examinees select more than the permitted number of prohibited options—typically two or three, depending on the examination year—they automatically fail the examination, regardless of their total score. This system is intended to ensure a minimum standard of patient safety and physician competency, emphasizing that clinically dangerous decisions are unacceptable.
Medical students and practicing physicians are generally not experts in artificial intelligence. In educational and clinical settings, LLMs are often used as information retrieval tools similar to search engines, supporting learning, review of past examination questions, and diagnostic reasoning. However, because AI systems may occasionally generate incorrect or misleading information, users must possess sufficient foundational medical knowledge and image interpretation skills to critically evaluate AI-generated outputs. Within this context, evaluating the reliability and safety of multimodal LLMs in examination settings that closely reflect real clinical decision-making is of particular importance.
In this study, we evaluated the performance of GPT-4 and GPT-4o using questions from the 118th, 117th, and 116th JNMLE, with a particular focus on image-based questions and the avoidance of prohibited choices as a surrogate marker of patient safety.

2. Materials and Methods

We downloaded the questions and answers for the 118th, the 117th, and the 116th JNMLE from the Japanese Ministry of Health, Labor and Welfare’s website [8,9,10].
We assumed that the AI would be used by general medical professionals, so we did not train the AI to increase its rate of correct answers, nor did we input complex prompts to increase its rate of correct answers. Instead, we had the AI solve the problems using prompts as simple as possible.
For each question, we started by entering the initial prompt in Japanese: “Please solve the Japanese National Medical Licensing Examinations’ questions.” In the subsequent prompt, for textual questions, we copied and pasted the text to obtain the answers from ChatGPT (Figure 1).
For image-based questions, in the following prompt, we not only copied and pasted the text but also captured the image portion via a screenshot and attached it to obtain the answers (Figure 2).
Similarly, for questions involving the interpretation of tables, in the next prompt, we copied and pasted the text, along with capturing the table portion via a screenshot and attached it to obtain the answers. For questions where the options were presented in a table format, we also captured the table via a screenshot and attached it to obtain the answers. In order to avoid hallucinations, we created a new Chat for each question and let ChatGPT solve the problem. Between August and December 2024, responses to each question were obtained using ChatGPT-4 and ChatGPT-4o.
The JNMLE consists of 400 questions, with a total of 500 points, of which 100 questions are essential questions and 300 are non-essential questions. The pass criteria for the 118th JNMLE of Japan are as follows:
Essential Questions: A score of at least 160 out of 200 points is required. This represents an 80% passing rate, which is an absolute standard. Regardless of other examinees’ performance, this score is mandatory for passing.
General questions will be scored as 1 point each, and clinical practical questions will be scored as 3 points each. Non-essential questions: Each question is worth 1 point, and a total score of at least 230 out of 300 points is required. This is a relative standard, meaning the passing rate can vary each year depending on the performance of all examinees. For this examination, a passing rate of 76.7% was required. Candidates receive one mark for each correct answer.
Prohibited Choices: Up to three prohibited choices are allowed. If an examinee selects more than the permitted number, they will automatically fail.
All these criteria must be met to pass the examination.
We used similar criteria to examine whether GPT-4 and GPT-4o could pass 118th JNMLE, as well as the rate of correct answers and trends in incorrect answers [11].
The 117th and 116th JNMLE also have passing criteria for three elements: compulsory questions, general questions, and prohibited questions [12,13].
Although the Japanese Ministry of Health, Labour and Welfare has not announced any information regarding prohibited choices, Medic Media Company Limited (Tokyo, Japan) has published speculations, and this information was used in our investigation [14,15,16].
The 117th JNMLE contained two questions for which the images were not made public because they contained photographs of genitalia, and these two questions were excluded from this survey.
The passing criteria for each round are shown in the table (Table 1).
The overall correct answer rate for each round, as well as the correct answer rates for required questions, general/clinical questions, and each text- and image-based question, were compared between GPT-4 and GPT-4o. The number of prohibited options selected was also compared. Since the number of image questions was small (approximately 100 questions), the correct answer rates were compared not only for each round but also for the total of the three rounds. In addition, we divided the test questions into general questions that ask simple knowledge and clinical questions that are in the form of case studies, and compared the overall correct answer rate and each section.
The correct answer rates were statistically analyzed using the chi-square test, and a p value of less than 0.05 was considered statistically significant.
The JNMLE and Chat GPT used in this study are publicly accessible online. Therefore, ethical approval was not required.

3. Results

3.1. Performance in the 118th JNMLE

3.1.1. Overall Results

The 118th JNMLE included 101 image-based questions, consisting of 10 essential image questions and 91 non-essential image questions. GPT-4o passed the examination with a total score of 457 points, including 190 in essential questions (general 49, clinical 141), 267 in non-essential questions, and zero prohibited choices. GPT-4 also passed, scoring 425 points (essential 181, non-essential 244), but selected one prohibited option. Detailed results for the 118th JNMLE are provided in Supplementary Tables S1 and S2.

3.1.2. Accuracy Comparison

GPT-4o demonstrated a significantly higher overall accuracy rate than GPT-4. No significant difference was observed in image-based questions. GPT-4o showed significantly higher accuracy in text-based questions. In the essential section, GPT-4o showed superior accuracy for text-based items, while image-based accuracy was comparable between models. In both general and clinical domains, GPT-4o demonstrated significantly higher overall accuracy and higher text-based accuracy, with no significant difference in image-based accuracy.

3.2. Performance in the 117th JNMLE

3.2.1. Overall Results

The 117th examination included 127 image-based questions (16 essential and 111 non-essential). GPT-4o passed with 446 total points (essential 190, non-essential 256, prohibited choices 0). GPT-4 narrowly passed with 392 points (essential 161, non-essential 231), selecting two prohibited choices. Detailed results for the 118th JNMLE are provided in Supplementary Tables S3 and S4.

3.2.2. Accuracy Comparison

GPT-4o again demonstrated significantly superior overall performance.
GPT-4o achieved significantly higher accuracy in image-based questions, with no difference in text-based accuracy. In essential questions, GPT-4o surpassed GPT-4 overall and in text-based accuracy. In non-essential questions, GPT-4o excelled overall and in image-based accuracy. In general questions, no significant differences were observed. In clinical questions, GPT-4o demonstrated significantly higher accuracy overall and in image-based questions.

3.3. Performance in the 116th JNMLE

3.3.1. Overall Results

The 116th JNMLE included 94 image-based questions (13 essential, 81 non-essential). GPT-4o passed with 190 essential points (general 46, clinical 144), 272 non-essential points, and zero prohibited choices. GPT-4 passed with 173 essential points, 247 non-essential points, and two prohibited choices. Detailed results for the 118th JNMLE are provided in Supplementary Tables S5 and S6.

3.3.2. Accuracy Comparison

GPT-4o demonstrated significantly higher overall accuracy compared with GPT-4.
GPT-4o achieved significantly higher image-based accuracy, while text-based accuracy was similar between models. In essential questions, GPT-4o showed significantly higher overall and text-based accuracy. In non-essential questions, GPT-4o outperformed GPT-4 across overall, image-based, and text-based categories. In general questions, GPT-4o showed significantly higher overall and text-based accuracy. In clinical questions, GPT-4o demonstrated superior overall and image-based accuracy, with comparable text-based accuracy. These findings are summarized in Table 2.

3.4. Combined Analysis of All Three Examinations

3.4.1. Integrated Overall Accuracy

When combining all 1200 questions across the three examinations, GPT-4o demonstrated significantly higher overall accuracy than GPT-4. GPT-4o also achieved superior performance in both image-based and text-based questions (Figure 3).

3.4.2. Section-Based Performance

Essential questions: GPT-4o showed significantly higher overall and text-based accuracy; image-based accuracy did not differ significantly. Non-essential questions: GPT-4o significantly outperformed GPT-4 overall, in image-based accuracy, and in text-based accuracy.

3.4.3. Question-Type Performance

General questions: GPT-4o showed significantly higher overall and text-based accuracy; image-based accuracy was similar.
Clinical questions: GPT-4o demonstrated significantly higher accuracy overall and in both image-based and text-based categories.

3.4.4. Prohibited Choices

GPT-4 selected several prohibited choices across the three examinations, whereas GPT-4o selected none. This difference further supports the relative safety of GPT-4o’s clinical decision-making tendencies (Table 3).

4. Discussion

Both GPT-4o and GPT-4 passed each round of the JNMLE; however, GPT-4o consistently outperformed GPT-4 in overall scores and demonstrated superior performance across multiple domains. A comprehensive analysis across three consecutive examinations showed that GPT-4o achieved higher accuracy not only in text-based questions assessing basic medical knowledge but also in clinical questions requiring interpretation of laboratory data and medical images. In contrast to GPT-4, GPT-4o did not select any prohibited choices in any examination, highlighting a potential advantage in patient safety–oriented decision-making. Although GPT-4o demonstrated higher accuracy in mandatory and general image-based questions, these differences were not always statistically significant, likely due to the limited number of image-based questions included in each examination.
It should be noted that the number of image-based questions in each individual JNMLE was relatively limited, with only 39 essential image-based questions and 31 general image-based questions across all three examinations. To address this limitation and improve the robustness of statistical comparisons, we performed a combined analysis of image-based questions from the 118th, 117th, and 116th JNMLE. This approach allowed for a more stable evaluation of model performance in image-based reasoning while preserving the structure and content characteristics of the original examinations. GPT-4o has previously been reported to improve the accuracy of image-based questions to a level comparable with text-based questions; however, direct comparisons with earlier models and analyses focusing on prohibited choices have been limited [17].
The competencies required to pass the JNMLE extend beyond knowledge recall and include clinical reasoning, interpretation of medical images, and safe decision-making in high-risk clinical scenarios. From this perspective, GPT-4o demonstrated several advantages over GPT-4. GPT-4o showed consistently higher accuracy in text-based questions, suggesting broader and more integrated medical knowledge. In addition, its superior performance in clinical questions indicates an enhanced ability to analyze complex case scenarios and propose appropriate diagnostic and therapeutic approaches. Although differences in image-based accuracy between models were sometimes modest, the overall performance of GPT-4o suggests that it possesses at least the minimum image interpretation capability required for real-world clinical decision support.
Previous studies have reported that attaching images to GPT-4 did not necessarily improve accuracy in medical licensing examination questions, suggesting that image interpretation may not be essential for achieving passing scores in certain contexts [6]. However, the JNMLE intentionally incorporates image-based questions to assess physician competency in situations that reflect real clinical practice. The relatively lower accuracy observed in some image-based general and essential questions may be attributable to the use of basic textbook figures or conceptual graphs, which limit opportunities for advanced reasoning or statistical inference. Furthermore, as reported by Liu et al., large language model performance may be influenced by the availability of domain-specific academic publications used during training, potentially resulting in lower performance in areas with limited published data [7].
A particularly important finding of this study is the difference in prohibited choice selection between models. We analyzed four prohibited answers selected by GPT-4 and found that, in three cases, the model correctly identified the anatomical region and imaging modality but failed to recognize critical pathological findings. As a result, image information was not appropriately incorporated into clinical decision-making, leading to choices that could seriously endanger patients. In contrast, GPT-4o not only identified the relevant anatomy and modality but also recognized abnormal findings and incorporated them into its reasoning process, thereby avoiding prohibited choices. In another case, GPT-4 selected a treatment option that could endanger the patient based primarily on extensive textual information, whereas GPT-4o selected the correct option. These findings suggest that GPT-4o demonstrates improved risk-aware reasoning when integrating textual and visual information.
Our findings have important implications for both medical education and clinical practice. For medical students, GPT-4o may function as a reliable partner when preparing for the JNMLE, particularly when solving past examination questions involving medical images or graphical data. The model’s consistent avoidance of prohibited choices suggests that it may support the development of patient safety–oriented clinical reasoning, provided that students maintain sufficient foundational medical knowledge and image interpretation skills to critically evaluate AI-generated responses.
For general physicians and medical students without specialized expertise in artificial intelligence, GPT-4o may also serve as a supportive tool in routine clinical practice. By inputting patient history and attaching relevant medical images, physicians may use the model as one component of diagnostic and therapeutic decision support. Nevertheless, AI-generated outputs should not replace clinical judgment, and physicians must retain full responsibility for final medical decisions.
Importantly, the analysis of prohibited choices highlights a dimension of AI evaluation that extends beyond conventional accuracy metrics. The ability to avoid clinically dangerous decisions represents a critical requirement for any AI system intended to support medical education or clinical practice. At the same time, both GPT-4 and GPT-4o may generate confident and coherent explanations even when incorrect, which underscores the importance of human oversight. Accordingly, large language models should be regarded as supportive tools that complement, rather than substitute for, professional medical expertise and clinical responsibility.

5. Limitation

This study has several limitations.
First, the analysis was conducted entirely in Japanese and was based on the Japanese National Medical Licensing Examination (JNMLE). Therefore, the findings may not be directly generalizable to other languages, healthcare systems, or licensing examinations with different structures.
Second, the evaluation relied exclusively on publicly available past examination questions. While these questions are designed to reflect real-world clinical reasoning, they cannot fully replicate the complexity, uncertainty, and contextual richness of actual clinical practice.
Third, we intentionally used simple and standardized prompts without optimization or fine-tuning to reflect how medical students and general physicians without AI expertise are likely to interact with large language models in real-world settings. As a result, the reported performance may represent a conservative estimate rather than the maximum achievable capability of each model.
Fourth, this study compared only two models, GPT-4 and GPT-4o. Other large language models with multimodal capabilities were not evaluated, and future studies should include broader model comparisons.
Finally, outcome measures such as accuracy and avoidance of prohibited choices serve as surrogate markers of clinical reasoning and patient safety. They do not directly translate into real-world diagnostic accuracy, treatment outcomes, or clinical effectiveness.

6. Conclusions

Overall, GPT-4o demonstrated superior performance compared with GPT-4 in both text-based and image-based questions on the JNMLE. These findings suggest that GPT-4o has the potential to function as a supportive educational and clinical reasoning tool, provided that users possess sufficient medical knowledge to critically evaluate its outputs. Importantly, GPT-4o should be regarded as a complementary resource rather than a substitute for professional medical judgment.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/ime5010009/s1. Table S1: 118th JNMLE GPT-4o; Table S2: 118th JNMLE GPT-4; Table S3: 117th JNMLE GPT-4o; Table S4: 117th JNMLE GPT-4; Table S5: 116th JNMLE GPT-4o; Table S6: 116th JNMLE GPT-4.

Author Contributions

Conceptualization, M.M. and G.F.; methodology, M.M.; software, M.M.; validation, M.M.; formal analysis, M.M.; investigation, M.M., H.A., K.T., Y.K., H.M. and M.H.; resources, M.M.; data curation, M.M.; writing—original draft preparation, M.M.; writing—review and editing, M.M., H.M. and M.H.; visualization, M.M.; supervision, M.H. and H.M.; project administration, M.H. and H.M.; funding acquisition, M.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Informed Consent Statement

Not applicable.

Data Availability Statement

The questions and answers for the Japanese National Medical Licensing Examination used in this study can be viewed and downloaded without restriction from the Ministry of Health, Labour and Welfare’s website (Japanese only). https://www.mhlw.go.jp/seisakunitsuite/bunya/kenkou_iryou/iryou/topics/tp240424-01.html (accessed on 21 October 2025). https://www.mhlw.go.jp/seisakunitsuite/bunya/kenkou_iryou/iryou/topics/tp230502-01.html (accessed on 21 October 2025). https://www.mhlw.go.jp/seisakunitsuite/bunya/kenkou_iryou/iryou/topics/tp220421-01.html (accessed on 21 October 2025). https://www.mhlw.go.jp/general/sikaku/successlist/2024/siken01/about.html (accessed on 21 October 2025). https://www.mhlw.go.jp/general/sikaku/successlist/2023/siken01/about.html (accessed on 21 October 2025). https://www.mhlw.go.jp/general/sikaku/successlist/2022/siken01/about.html (accessed on 21 October 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviation is used in this manuscript:
JNMLEJapan National Medical Licensing Examination

References

  1. Introducing ChatGPT. Open AI. Available online: https://openai.com/blog/chatgpt (accessed on 21 October 2025).
  2. Tanaka, Y.; Nakata, T.; Aiga, K.; Etani, T.; Muramatsu, R.; Katagiri, S.; Kawai, H.; Higashino, F.; Enomoto, M.; Noda, M.; et al. Performance of Generative Pretrained Transformer on the National Medical Licensing Examination in Japan. PLoS Digit. Health 2024, 3, e0000433. [Google Scholar] [CrossRef] [PubMed]
  3. Takagi, S.; Watari, T.; Erabi, A.; Sakaguchi, K. Performance of GPT-3.5 and GPT-4 on the Japanese medical licensing examination: Comparison study. JMIR Med. Educ. 2023, 9, e48002. [Google Scholar] [CrossRef] [PubMed]
  4. Yanagita, Y.; Yokokawa, D.; Uchida, S.; Tawara, J.; Ikusaka, M. Accuracy of ChatGPT on medical questions in the national medical Licensing examination in Japan: Evaluation study. JMIR Form. Res. 2023, 7, e48023. [Google Scholar] [CrossRef] [PubMed]
  5. Murad, I.A.; Khaleel, M.I.; Shakor, M.Y. Unveiling GPT-4o: Enhanced Multimodal Capabilities and Comparative Insights with ChatGPT-4. Int. J. Electron. Commun. Syst. 2024, 4, 127–136. [Google Scholar] [CrossRef]
  6. Nakao, T.; Miki, S.; Nakamura, Y.; Kikuchi, T.; Nomura, Y.; Hanaoka, S.; Yoshikawa, T.; Abe, O. Capability of GPT-4V(ision) in Japanese National Medical Licensing Examination. JMIR Med. Educ. 2024, 10, e54393. [Google Scholar] [CrossRef] [PubMed]
  7. Liu, M.; Okuhara, T.; Dai, Z.; Huang, W.; Gu, L.; Okada, H.; Furukawa, E.; Kiuchi, T. Evaluating the Effectiveness of advanced large language models in medical Knowledge: A Comparative study using Japanese national medical examination. Int. J. Med. Inform. 2025, 193, 105673. [Google Scholar] [CrossRef] [PubMed]
  8. Available online: https://www.mhlw.go.jp/seisakunitsuite/bunya/kenkou_iryou/iryou/topics/tp240424-01.html (accessed on 21 October 2025).
  9. Available online: https://www.mhlw.go.jp/seisakunitsuite/bunya/kenkou_iryou/iryou/topics/tp230502-01.html (accessed on 21 October 2025).
  10. Available online: https://www.mhlw.go.jp/seisakunitsuite/bunya/kenkou_iryou/iryou/topics/tp220421-01.html (accessed on 21 October 2025).
  11. Available online: https://www.mhlw.go.jp/general/sikaku/successlist/2024/siken01/about.html (accessed on 21 October 2025).
  12. Available online: https://www.mhlw.go.jp/general/sikaku/successlist/2023/siken01/about.html (accessed on 21 October 2025).
  13. Available online: https://www.mhlw.go.jp/general/sikaku/successlist/2022/siken01/about.html (accessed on 21 October 2025).
  14. Available online: https://informa.medilink-study.com/web-informa/post41529.html/ (accessed on 21 October 2025).
  15. Available online: https://informa.medilink-study.com/web-informa/post39343.html/ (accessed on 21 October 2025).
  16. Available online: https://informa.medilink-study.com/web-informa/post36171.html/ (accessed on 21 October 2025).
  17. Miyazaki, Y.; Hata, M.; Omori, H.; Hirashima, A.; Nakagawa, Y.; Eto, M.; Takahashi, S.; Ikeda, M. Performance of ChatGPT-4o on the Japanese Medical Licensing Examination: Evalution of Accuracy in Text-Only and Image-Based Questions. JMIR Med. Educ. 2024, 10, eE63129. [Google Scholar] [CrossRef] [PubMed]
Figure 1. Example of a multiple-choice question, in which one selects one answer from five options, in Japanese with ChatGPT response in a single chat box. (A) is an example of actually entering a prompt in Japanese and getting the GPT answer. (B) is an English translation of the contents of (A) by the author.
Figure 1. Example of a multiple-choice question, in which one selects one answer from five options, in Japanese with ChatGPT response in a single chat box. (A) is an example of actually entering a prompt in Japanese and getting the GPT answer. (B) is an English translation of the contents of (A) by the author.
Ime 05 00009 g001
Figure 2. Example of prompt input and GPT response when there is an attached image. (A) is an example of actually entering prompt in Japanese and getting the GPT answer. (B) is an English translation of the contents of (A) by the author.
Figure 2. Example of prompt input and GPT response when there is an attached image. (A) is an example of actually entering prompt in Japanese and getting the GPT answer. (B) is an English translation of the contents of (A) by the author.
Ime 05 00009 g002
Figure 3. Radar chart of all answered questions and correct answer rate for each section for GPT-4 and GPT-4o.
Figure 3. Radar chart of all answered questions and correct answer rate for each section for GPT-4 and GPT-4o.
Ime 05 00009 g003
Table 1. The Passing Criteria for Each Round of Japan National Medical Licensing Examination.
Table 1. The Passing Criteria for Each Round of Japan National Medical Licensing Examination.
118th JNMLE Passing Criteria117th JNMLE Passing Criteria116th JNMLE Passing Criteria
Essential QuestionsEach general question is worth 1 point, and each clinical practical question is worth 3 points,
total score is 160 points or more/200 points.
Each general question is worth 1 point, and each clinical practical question is worth 3 points,
total score is 160 points or more/200 points.
Each general question is worth 1 point, and each clinical practical question is worth 3 points,
total score is 158 points or more/197 points.
Excluded questions: B6, B43, E16
E16Candidates who answer correctly will be included in the marks, and candidates who answer incorrectly will not be included in the marks.
Non-essential
General and Clinical Questions
Each question is worth 1 point,
total score is 230 points or more/300 points.
Each question is worth 1 point,
total score is 220 points or more/295 points.
Excluded questions: C15, C60, D38, D53, F42
Each question is worth 1 point,
total score is 214 points or more/297 points.
Excluded questions: A34, A71, C36, D64
Prohibited Choices
(Critical Questions)
3 questions or less2 questions or less3 questions or less
Table 2. Number of Questions, Number of Correct Answers, and Correct Answer Rate for Each Rounds and Question Types for Chat GPT-4o and GPT-4.
Table 2. Number of Questions, Number of Correct Answers, and Correct Answer Rate for Each Rounds and Question Types for Chat GPT-4o and GPT-4.
Question SectionQuestion TypeNumber of
Questions
GPT-4o
Number of Correct Answers
GPT-4o
Correct Answer Rate
GPT-4
Number of Correct Answers
GPT-4
Correct Answer Rate
p Value
118 All questionsAll Questions40036390.8%33383.3%0.0016
118 All questionsImage-based Questions1018584.2%7675.2%0.1153
118 All questionsText-based Questions29927893.0%25786.0%0.0050
118 Essential SectionAll Questions1009696.0%8989.0%0.0602
118 Essential SectionImage-based Questions10880.0%660.0%0.3291
118 Essential SectionText-based Questions908897.8%8392.2%0.0005
118 Non-essential SectionsAll Questions30026789.0%24481.3%0.0084
118 Non-essential SectionsImage-based Questions917784.6%7076.9%0.1879
118 Non-essential SectionsText-based Questions20919090.9%17483.3%0.0192
118 General SectionAll Questions15013388.7%12382.0%0.1026
118 General SectionImage-based Questions10550.0%770.0%0.3613
118 General SectionText-based Questions14012891.4%11682.9%0.0321
118 Clinical SectionAll Questions25023092.0%21084.0%0.0059
118 Clinical SectionImage-based Questions918087.9%6975.8%0.0343
118 Clinical SectionText-based Questions15915094.3%14188.7%0.0701
117 All questionsAll Questions39335089.1%31279.4%0.0016
117 All questionsImage-based Questions12710784.3%8365.4%0.0005
117 All questionsText-based Questions26624391.4%22986.1%0.0550
117 Essential SectionAll Questions1009494.0%8181.0%0.0054
117 Essential SectionImage-based Questions161381.3%1062.5%0.2381
117 Essential SectionText-based Questions848196.4%7184.5%0.0085
117 Non-essential SectionsAll Questions29325687.4%23178.8%0.0058
117 Non-essential SectionsImage-based Questions1119484.7%7365.8%0.0010
117 Non-essential SectionsText-based Questions18216289.0%15886.8%0.5201
117 General SectionAll Questions14812886.6%11980.4%0.0042
117 General SectionImage-based Questions14964.3%750.0%0.4450
117 General SectionText-based Questions13211788.6%11083.3%0.2145
117 Clinical SectionAll Questions24622290.2%19378.5%0.0003
117 Clinical SectionImage-based Questions1119686.5%7466.7%0.0004
117 Clinical SectionText-based Questions13412694.0%11988.8%0.1268
116 All questionsAll Questions39536692.7%33083.8%0.0016
116 All questionsImage-based Questions948489.4%7175.5%0.0126
116 All questionsText-based Questions30128293.7%25986.3%0.0027
116 Essential SectionAll Questions989495.9%8385.6%0.0130
116 Essential SectionImage-based Questions131292.3%1185.6%0.5393
116 Essential SectionText-based Questions858296.5%7285.7%0.0145
116 Non-essential SectionsAll Questions29727291.6%24783.2%0.0020
116 Non-essential SectionsImage-based Questions817288.9%6074.1%0.0152
116 Non-essential SectionsText-based Questions21620092.6%18786.6%0.0406
116 General SectionAll Questions14713994.6%12384.2%0.0042
116 General SectionImage-based Questions7457.1%342.9%0.5929
116 General SectionText-based Questions14013596.4%12086.3%0.0027
116 Clinical SectionAll Questions24822791.5%20783.5%0.0066
116 Clinical SectionImage-based Questions878092.0%6878.2%0.0107
116 Clinical SectionText-based Questions16114791.3%13986.3%0.1571
Table 3. Number of Prohibited Answers, Number of Prohibited Questions, and Prohibit Answer Rate for Each Round of GPT-4o and GPT-4.
Table 3. Number of Prohibited Answers, Number of Prohibited Questions, and Prohibit Answer Rate for Each Round of GPT-4o and GPT-4.
Number of
Questions with
Prohibit Option
GPT-4o
Number of Prohibit Option Chosen
GPT-4o
Prohibit Answer Rate
GPT-4
Number of Prohibit Option Chosen
GPT-4
Prohibit Answer Rate
118th JNMLE900.0%11.1%
117th JNMLE1100.0%28.2%
116th JNMLE900.0%222.2%
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Miyamura, M.; Fujiki, G.; Kanzaki, Y.; Tsuda, K.; Asano, H.; Morita, H.; Hoshiga, M. Evaluating Chat GPT-4o’s Comparative Performance over GPT-4 in Japanese Medical Licensing Examination and Its Clinical Partnership Potential. Int. Med. Educ. 2026, 5, 9. https://doi.org/10.3390/ime5010009

AMA Style

Miyamura M, Fujiki G, Kanzaki Y, Tsuda K, Asano H, Morita H, Hoshiga M. Evaluating Chat GPT-4o’s Comparative Performance over GPT-4 in Japanese Medical Licensing Examination and Its Clinical Partnership Potential. International Medical Education. 2026; 5(1):9. https://doi.org/10.3390/ime5010009

Chicago/Turabian Style

Miyamura, Masatoshi, Goro Fujiki, Yumiko Kanzaki, Kosuke Tsuda, Hironaka Asano, Hideaki Morita, and Masaaki Hoshiga. 2026. "Evaluating Chat GPT-4o’s Comparative Performance over GPT-4 in Japanese Medical Licensing Examination and Its Clinical Partnership Potential" International Medical Education 5, no. 1: 9. https://doi.org/10.3390/ime5010009

APA Style

Miyamura, M., Fujiki, G., Kanzaki, Y., Tsuda, K., Asano, H., Morita, H., & Hoshiga, M. (2026). Evaluating Chat GPT-4o’s Comparative Performance over GPT-4 in Japanese Medical Licensing Examination and Its Clinical Partnership Potential. International Medical Education, 5(1), 9. https://doi.org/10.3390/ime5010009

Article Metrics

Back to TopTop