Evaluating Chat GPT-4o’s Comparative Performance over GPT-4 in Japanese Medical Licensing Examination and Its Clinical Partnership Potential

Miyamura, Masatoshi; Fujiki, Goro; Kanzaki, Yumiko; Tsuda, Kosuke; Asano, Hironaka; Morita, Hideaki; Hoshiga, Masaaki

doi:10.3390/ime5010009

Open AccessArticle

Evaluating Chat GPT-4o’s Comparative Performance over GPT-4 in Japanese Medical Licensing Examination and Its Clinical Partnership Potential

by

Masatoshi Miyamura

¹,

Goro Fujiki

²,

Yumiko Kanzaki

¹,

Kosuke Tsuda

¹

,

Hironaka Asano

³,

Hideaki Morita

^1,*

and

Masaaki Hoshiga

⁴

¹

Department of Cardiology, Osaka Medical and Pharmaceutical University, Daigakumachi 2-7, Takatsuki 569-8686, Japan

²

Department of Cardiology, Nippon Life Hospital, Enokojima 2-1-54, Nishi-ku, Osaka 550-0006, Japan

³

Department of Cardiology, Itami City Hospital, Koyaike 1-100, Itami 664-8540, Japan

⁴

Department of Cardiology, Osaka Medical and Pharmaceutical University Mishima-Minami Hospital, Tamagawashinmachi 8-1, Takatsuki 569-0856, Japan

^*

Author to whom correspondence should be addressed.

Int. Med. Educ. 2026, 5(1), 9; https://doi.org/10.3390/ime5010009

Submission received: 4 December 2025 / Revised: 26 December 2025 / Accepted: 5 January 2026 / Published: 7 January 2026

Download

Browse Figures

Versions Notes

Abstract

Background: Recent advances in artificial intelligence (AI) have produced ChatGPT-4o, a multimodal large language model (LLM) capable of processing both text and image inputs. Although ChatGPT has demonstrated usefulness in medical examinations, few studies have evaluated its image analysis performance. Methods: This study compared GPT-4o and GPT-4 using public questions from the 116th–118th Japanese National Medical Licensing Examinations (JNMLE), each consisting of 400 questions. Both models answered in Japanese using simple prompts, including screenshots for image-based questions. Accuracy was analyzed across essential, general, and clinical questions, with statistical comparisons by chi-square tests. Results: GPT-4o consistently outperformed GPT-4, achieving passing scores in all three examinations. In the 118th JNMLE, GPT-4o scored 457 points versus 425 for GPT-4. GPT-4o demonstrated higher accuracy for image-based questions in the 117th and 116th exams, though the difference in the 118th was not significant. For text-based questions, GPT-4o showed superior medical knowledge, clinical reasoning, and ethical response behavior, notably avoiding prohibited options. Conclusion: Overall, GPT-4o exceeded GPT-4 in both text and image domains, suggesting strong potential as a diagnostic aid and educational resource. Its balanced performance across modalities highlights its promise for integration into future medical education and clinical decision support.

Keywords:

artificial intelligence; multimodal large language model; ChatGPT; GPT4; GPT4o; Japan National Medical Licensing Examinations

1. Introduction

Recent advances in artificial intelligence (AI) have led to the development of sophisticated large-scale language models (LLMs) that can process and generate human-like text. Among them, ChatGPT-4 (GPT-4) from OpenAI (OpenAI, San Francisco, CA) has shown great progress in generating and understanding text, making it a notable milestone [1]. ChatGPT can be accessed using standard computers with an internet connection. ChatGPT and its underlying model, generative pre-trained transformers (GPT), were not developed specifically for medical purposes.

GPT-4 focuses primarily on text-based interactions and has limited processing power for image inputs. Although it can process basic image inputs and preliminary analysis, its performance in complex image processing tasks is limited. GPT-4 has reportedly achieved a passing score for the medical licensing exam in non-English-speaking countries such as Japan [2,3,4]. However, previous studies have only considered text-based questions, not problems involving images, figures, tables, or graphs.

In contrast, ChatGPT-4o (GPT-4o) represents a major advancement through its multimodal capabilities, enabling integrated processing of text, image, audio, and video inputs. By combining multiple data modalities, GPT-4o can generate more contextually relevant responses and demonstrate improved performance in image-based reasoning tasks, suggesting potential applications in medical education and clinical support [5].

The Japanese National Medical Licensing Examinations (JNMLE) is a rigorous national assessment required for obtaining a medical license in Japan. The JNMLE consists of multiple-choice questions covering basic medicine, clinical medicine, and public health, and is designed to evaluate not only factual knowledge but also clinical reasoning and decision-making skills. Importantly, the JNMLE reflects real-world clinical practice by including questions that require interpretation of medical images, assessment of disease severity, and extraction of clinically meaningful information from graphical data. However, the accuracy rate for these image-related questions was low in previous versions of ChatGPT and other LLMs [6,7].

A unique feature of the JNMLE is the inclusion of so-called “prohibited choices” (kinshi-shi), which represent diagnostic or therapeutic decisions that could seriously endanger a patient’s life. If examinees select more than the permitted number of prohibited options—typically two or three, depending on the examination year—they automatically fail the examination, regardless of their total score. This system is intended to ensure a minimum standard of patient safety and physician competency, emphasizing that clinically dangerous decisions are unacceptable.

Medical students and practicing physicians are generally not experts in artificial intelligence. In educational and clinical settings, LLMs are often used as information retrieval tools similar to search engines, supporting learning, review of past examination questions, and diagnostic reasoning. However, because AI systems may occasionally generate incorrect or misleading information, users must possess sufficient foundational medical knowledge and image interpretation skills to critically evaluate AI-generated outputs. Within this context, evaluating the reliability and safety of multimodal LLMs in examination settings that closely reflect real clinical decision-making is of particular importance.

In this study, we evaluated the performance of GPT-4 and GPT-4o using questions from the 118th, 117th, and 116th JNMLE, with a particular focus on image-based questions and the avoidance of prohibited choices as a surrogate marker of patient safety.

2. Materials and Methods

We downloaded the questions and answers for the 118th, the 117th, and the 116th JNMLE from the Japanese Ministry of Health, Labor and Welfare’s website [8,9,10].

We assumed that the AI would be used by general medical professionals, so we did not train the AI to increase its rate of correct answers, nor did we input complex prompts to increase its rate of correct answers. Instead, we had the AI solve the problems using prompts as simple as possible.

For each question, we started by entering the initial prompt in Japanese: “Please solve the Japanese National Medical Licensing Examinations’ questions.” In the subsequent prompt, for textual questions, we copied and pasted the text to obtain the answers from ChatGPT (Figure 1).

For image-based questions, in the following prompt, we not only copied and pasted the text but also captured the image portion via a screenshot and attached it to obtain the answers (Figure 2).

Similarly, for questions involving the interpretation of tables, in the next prompt, we copied and pasted the text, along with capturing the table portion via a screenshot and attached it to obtain the answers. For questions where the options were presented in a table format, we also captured the table via a screenshot and attached it to obtain the answers. In order to avoid hallucinations, we created a new Chat for each question and let ChatGPT solve the problem. Between August and December 2024, responses to each question were obtained using ChatGPT-4 and ChatGPT-4o.

The JNMLE consists of 400 questions, with a total of 500 points, of which 100 questions are essential questions and 300 are non-essential questions. The pass criteria for the 118th JNMLE of Japan are as follows:

Essential Questions: A score of at least 160 out of 200 points is required. This represents an 80% passing rate, which is an absolute standard. Regardless of other examinees’ performance, this score is mandatory for passing.

General questions will be scored as 1 point each, and clinical practical questions will be scored as 3 points each. Non-essential questions: Each question is worth 1 point, and a total score of at least 230 out of 300 points is required. This is a relative standard, meaning the passing rate can vary each year depending on the performance of all examinees. For this examination, a passing rate of 76.7% was required. Candidates receive one mark for each correct answer.

Prohibited Choices: Up to three prohibited choices are allowed. If an examinee selects more than the permitted number, they will automatically fail.

All these criteria must be met to pass the examination.

We used similar criteria to examine whether GPT-4 and GPT-4o could pass 118th JNMLE, as well as the rate of correct answers and trends in incorrect answers [11].

The 117th and 116th JNMLE also have passing criteria for three elements: compulsory questions, general questions, and prohibited questions [12,13].

Although the Japanese Ministry of Health, Labour and Welfare has not announced any information regarding prohibited choices, Medic Media Company Limited (Tokyo, Japan) has published speculations, and this information was used in our investigation [14,15,16].

The 117th JNMLE contained two questions for which the images were not made public because they contained photographs of genitalia, and these two questions were excluded from this survey.

The passing criteria for each round are shown in the table (Table 1).

The overall correct answer rate for each round, as well as the correct answer rates for required questions, general/clinical questions, and each text- and image-based question, were compared between GPT-4 and GPT-4o. The number of prohibited options selected was also compared. Since the number of image questions was small (approximately 100 questions), the correct answer rates were compared not only for each round but also for the total of the three rounds. In addition, we divided the test questions into general questions that ask simple knowledge and clinical questions that are in the form of case studies, and compared the overall correct answer rate and each section.

The correct answer rates were statistically analyzed using the chi-square test, and a p value of less than 0.05 was considered statistically significant.

The JNMLE and Chat GPT used in this study are publicly accessible online. Therefore, ethical approval was not required.

3. Results

3.1. Performance in the 118th JNMLE

3.1.1. Overall Results

The 118th JNMLE included 101 image-based questions, consisting of 10 essential image questions and 91 non-essential image questions. GPT-4o passed the examination with a total score of 457 points, including 190 in essential questions (general 49, clinical 141), 267 in non-essential questions, and zero prohibited choices. GPT-4 also passed, scoring 425 points (essential 181, non-essential 244), but selected one prohibited option. Detailed results for the 118th JNMLE are provided in Supplementary Tables S1 and S2.

3.1.2. Accuracy Comparison

GPT-4o demonstrated a significantly higher overall accuracy rate than GPT-4. No significant difference was observed in image-based questions. GPT-4o showed significantly higher accuracy in text-based questions. In the essential section, GPT-4o showed superior accuracy for text-based items, while image-based accuracy was comparable between models. In both general and clinical domains, GPT-4o demonstrated significantly higher overall accuracy and higher text-based accuracy, with no significant difference in image-based accuracy.

3.2. Performance in the 117th JNMLE

3.2.1. Overall Results

The 117th examination included 127 image-based questions (16 essential and 111 non-essential). GPT-4o passed with 446 total points (essential 190, non-essential 256, prohibited choices 0). GPT-4 narrowly passed with 392 points (essential 161, non-essential 231), selecting two prohibited choices. Detailed results for the 118th JNMLE are provided in Supplementary Tables S3 and S4.

3.2.2. Accuracy Comparison

GPT-4o again demonstrated significantly superior overall performance.

GPT-4o achieved significantly higher accuracy in image-based questions, with no difference in text-based accuracy. In essential questions, GPT-4o surpassed GPT-4 overall and in text-based accuracy. In non-essential questions, GPT-4o excelled overall and in image-based accuracy. In general questions, no significant differences were observed. In clinical questions, GPT-4o demonstrated significantly higher accuracy overall and in image-based questions.

3.3. Performance in the 116th JNMLE

3.3.1. Overall Results

The 116th JNMLE included 94 image-based questions (13 essential, 81 non-essential). GPT-4o passed with 190 essential points (general 46, clinical 144), 272 non-essential points, and zero prohibited choices. GPT-4 passed with 173 essential points, 247 non-essential points, and two prohibited choices. Detailed results for the 118th JNMLE are provided in Supplementary Tables S5 and S6.

3.3.2. Accuracy Comparison

GPT-4o demonstrated significantly higher overall accuracy compared with GPT-4.

GPT-4o achieved significantly higher image-based accuracy, while text-based accuracy was similar between models. In essential questions, GPT-4o showed significantly higher overall and text-based accuracy. In non-essential questions, GPT-4o outperformed GPT-4 across overall, image-based, and text-based categories. In general questions, GPT-4o showed significantly higher overall and text-based accuracy. In clinical questions, GPT-4o demonstrated superior overall and image-based accuracy, with comparable text-based accuracy. These findings are summarized in Table 2.

3.4. Combined Analysis of All Three Examinations

3.4.1. Integrated Overall Accuracy

When combining all 1200 questions across the three examinations, GPT-4o demonstrated significantly higher overall accuracy than GPT-4. GPT-4o also achieved superior performance in both image-based and text-based questions (Figure 3).

3.4.2. Section-Based Performance

Essential questions: GPT-4o showed significantly higher overall and text-based accuracy; image-based accuracy did not differ significantly. Non-essential questions: GPT-4o significantly outperformed GPT-4 overall, in image-based accuracy, and in text-based accuracy.

3.4.3. Question-Type Performance

General questions: GPT-4o showed significantly higher overall and text-based accuracy; image-based accuracy was similar.

Clinical questions: GPT-4o demonstrated significantly higher accuracy overall and in both image-based and text-based categories.

3.4.4. Prohibited Choices

GPT-4 selected several prohibited choices across the three examinations, whereas GPT-4o selected none. This difference further supports the relative safety of GPT-4o’s clinical decision-making tendencies (Table 3).

4. Discussion

Both GPT-4o and GPT-4 passed each round of the JNMLE; however, GPT-4o consistently outperformed GPT-4 in overall scores and demonstrated superior performance across multiple domains. A comprehensive analysis across three consecutive examinations showed that GPT-4o achieved higher accuracy not only in text-based questions assessing basic medical knowledge but also in clinical questions requiring interpretation of laboratory data and medical images. In contrast to GPT-4, GPT-4o did not select any prohibited choices in any examination, highlighting a potential advantage in patient safety–oriented decision-making. Although GPT-4o demonstrated higher accuracy in mandatory and general image-based questions, these differences were not always statistically significant, likely due to the limited number of image-based questions included in each examination.

It should be noted that the number of image-based questions in each individual JNMLE was relatively limited, with only 39 essential image-based questions and 31 general image-based questions across all three examinations. To address this limitation and improve the robustness of statistical comparisons, we performed a combined analysis of image-based questions from the 118th, 117th, and 116th JNMLE. This approach allowed for a more stable evaluation of model performance in image-based reasoning while preserving the structure and content characteristics of the original examinations. GPT-4o has previously been reported to improve the accuracy of image-based questions to a level comparable with text-based questions; however, direct comparisons with earlier models and analyses focusing on prohibited choices have been limited [17].

The competencies required to pass the JNMLE extend beyond knowledge recall and include clinical reasoning, interpretation of medical images, and safe decision-making in high-risk clinical scenarios. From this perspective, GPT-4o demonstrated several advantages over GPT-4. GPT-4o showed consistently higher accuracy in text-based questions, suggesting broader and more integrated medical knowledge. In addition, its superior performance in clinical questions indicates an enhanced ability to analyze complex case scenarios and propose appropriate diagnostic and therapeutic approaches. Although differences in image-based accuracy between models were sometimes modest, the overall performance of GPT-4o suggests that it possesses at least the minimum image interpretation capability required for real-world clinical decision support.

Previous studies have reported that attaching images to GPT-4 did not necessarily improve accuracy in medical licensing examination questions, suggesting that image interpretation may not be essential for achieving passing scores in certain contexts [6]. However, the JNMLE intentionally incorporates image-based questions to assess physician competency in situations that reflect real clinical practice. The relatively lower accuracy observed in some image-based general and essential questions may be attributable to the use of basic textbook figures or conceptual graphs, which limit opportunities for advanced reasoning or statistical inference. Furthermore, as reported by Liu et al., large language model performance may be influenced by the availability of domain-specific academic publications used during training, potentially resulting in lower performance in areas with limited published data [7].

A particularly important finding of this study is the difference in prohibited choice selection between models. We analyzed four prohibited answers selected by GPT-4 and found that, in three cases, the model correctly identified the anatomical region and imaging modality but failed to recognize critical pathological findings. As a result, image information was not appropriately incorporated into clinical decision-making, leading to choices that could seriously endanger patients. In contrast, GPT-4o not only identified the relevant anatomy and modality but also recognized abnormal findings and incorporated them into its reasoning process, thereby avoiding prohibited choices. In another case, GPT-4 selected a treatment option that could endanger the patient based primarily on extensive textual information, whereas GPT-4o selected the correct option. These findings suggest that GPT-4o demonstrates improved risk-aware reasoning when integrating textual and visual information.

Our findings have important implications for both medical education and clinical practice. For medical students, GPT-4o may function as a reliable partner when preparing for the JNMLE, particularly when solving past examination questions involving medical images or graphical data. The model’s consistent avoidance of prohibited choices suggests that it may support the development of patient safety–oriented clinical reasoning, provided that students maintain sufficient foundational medical knowledge and image interpretation skills to critically evaluate AI-generated responses.

For general physicians and medical students without specialized expertise in artificial intelligence, GPT-4o may also serve as a supportive tool in routine clinical practice. By inputting patient history and attaching relevant medical images, physicians may use the model as one component of diagnostic and therapeutic decision support. Nevertheless, AI-generated outputs should not replace clinical judgment, and physicians must retain full responsibility for final medical decisions.

Importantly, the analysis of prohibited choices highlights a dimension of AI evaluation that extends beyond conventional accuracy metrics. The ability to avoid clinically dangerous decisions represents a critical requirement for any AI system intended to support medical education or clinical practice. At the same time, both GPT-4 and GPT-4o may generate confident and coherent explanations even when incorrect, which underscores the importance of human oversight. Accordingly, large language models should be regarded as supportive tools that complement, rather than substitute for, professional medical expertise and clinical responsibility.

5. Limitation

This study has several limitations.

First, the analysis was conducted entirely in Japanese and was based on the Japanese National Medical Licensing Examination (JNMLE). Therefore, the findings may not be directly generalizable to other languages, healthcare systems, or licensing examinations with different structures.

Second, the evaluation relied exclusively on publicly available past examination questions. While these questions are designed to reflect real-world clinical reasoning, they cannot fully replicate the complexity, uncertainty, and contextual richness of actual clinical practice.

Third, we intentionally used simple and standardized prompts without optimization or fine-tuning to reflect how medical students and general physicians without AI expertise are likely to interact with large language models in real-world settings. As a result, the reported performance may represent a conservative estimate rather than the maximum achievable capability of each model.

Fourth, this study compared only two models, GPT-4 and GPT-4o. Other large language models with multimodal capabilities were not evaluated, and future studies should include broader model comparisons.

Finally, outcome measures such as accuracy and avoidance of prohibited choices serve as surrogate markers of clinical reasoning and patient safety. They do not directly translate into real-world diagnostic accuracy, treatment outcomes, or clinical effectiveness.

6. Conclusions

Overall, GPT-4o demonstrated superior performance compared with GPT-4 in both text-based and image-based questions on the JNMLE. These findings suggest that GPT-4o has the potential to function as a supportive educational and clinical reasoning tool, provided that users possess sufficient medical knowledge to critically evaluate its outputs. Importantly, GPT-4o should be regarded as a complementary resource rather than a substitute for professional medical judgment.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/ime5010009/s1. Table S1: 118th JNMLE GPT-4o; Table S2: 118th JNMLE GPT-4; Table S3: 117th JNMLE GPT-4o; Table S4: 117th JNMLE GPT-4; Table S5: 116th JNMLE GPT-4o; Table S6: 116th JNMLE GPT-4.

Author Contributions

Conceptualization, M.M. and G.F.; methodology, M.M.; software, M.M.; validation, M.M.; formal analysis, M.M.; investigation, M.M., H.A., K.T., Y.K., H.M. and M.H.; resources, M.M.; data curation, M.M.; writing—original draft preparation, M.M.; writing—review and editing, M.M., H.M. and M.H.; visualization, M.M.; supervision, M.H. and H.M.; project administration, M.H. and H.M.; funding acquisition, M.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Informed Consent Statement

Not applicable.

Data Availability Statement

The questions and answers for the Japanese National Medical Licensing Examination used in this study can be viewed and downloaded without restriction from the Ministry of Health, Labour and Welfare’s website (Japanese only). https://www.mhlw.go.jp/seisakunitsuite/bunya/kenkou_iryou/iryou/topics/tp240424-01.html (accessed on 21 October 2025). https://www.mhlw.go.jp/seisakunitsuite/bunya/kenkou_iryou/iryou/topics/tp230502-01.html (accessed on 21 October 2025). https://www.mhlw.go.jp/seisakunitsuite/bunya/kenkou_iryou/iryou/topics/tp220421-01.html (accessed on 21 October 2025). https://www.mhlw.go.jp/general/sikaku/successlist/2024/siken01/about.html (accessed on 21 October 2025). https://www.mhlw.go.jp/general/sikaku/successlist/2023/siken01/about.html (accessed on 21 October 2025). https://www.mhlw.go.jp/general/sikaku/successlist/2022/siken01/about.html (accessed on 21 October 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviation is used in this manuscript:

JNMLE

Japan National Medical Licensing Examination

References

Introducing ChatGPT. Open AI. Available online: https://openai.com/blog/chatgpt (accessed on 21 October 2025).
Tanaka, Y.; Nakata, T.; Aiga, K.; Etani, T.; Muramatsu, R.; Katagiri, S.; Kawai, H.; Higashino, F.; Enomoto, M.; Noda, M.; et al. Performance of Generative Pretrained Transformer on the National Medical Licensing Examination in Japan. PLoS Digit. Health 2024, 3, e0000433. [Google Scholar] [CrossRef] [PubMed]
Takagi, S.; Watari, T.; Erabi, A.; Sakaguchi, K. Performance of GPT-3.5 and GPT-4 on the Japanese medical licensing examination: Comparison study. JMIR Med. Educ. 2023, 9, e48002. [Google Scholar] [CrossRef] [PubMed]
Yanagita, Y.; Yokokawa, D.; Uchida, S.; Tawara, J.; Ikusaka, M. Accuracy of ChatGPT on medical questions in the national medical Licensing examination in Japan: Evaluation study. JMIR Form. Res. 2023, 7, e48023. [Google Scholar] [CrossRef] [PubMed]
Murad, I.A.; Khaleel, M.I.; Shakor, M.Y. Unveiling GPT-4o: Enhanced Multimodal Capabilities and Comparative Insights with ChatGPT-4. Int. J. Electron. Commun. Syst. 2024, 4, 127–136. [Google Scholar] [CrossRef]
Nakao, T.; Miki, S.; Nakamura, Y.; Kikuchi, T.; Nomura, Y.; Hanaoka, S.; Yoshikawa, T.; Abe, O. Capability of GPT-4V(ision) in Japanese National Medical Licensing Examination. JMIR Med. Educ. 2024, 10, e54393. [Google Scholar] [CrossRef] [PubMed]
Liu, M.; Okuhara, T.; Dai, Z.; Huang, W.; Gu, L.; Okada, H.; Furukawa, E.; Kiuchi, T. Evaluating the Effectiveness of advanced large language models in medical Knowledge: A Comparative study using Japanese national medical examination. Int. J. Med. Inform. 2025, 193, 105673. [Google Scholar] [CrossRef] [PubMed]
Available online: https://www.mhlw.go.jp/seisakunitsuite/bunya/kenkou_iryou/iryou/topics/tp240424-01.html (accessed on 21 October 2025).
Available online: https://www.mhlw.go.jp/seisakunitsuite/bunya/kenkou_iryou/iryou/topics/tp230502-01.html (accessed on 21 October 2025).
Available online: https://www.mhlw.go.jp/seisakunitsuite/bunya/kenkou_iryou/iryou/topics/tp220421-01.html (accessed on 21 October 2025).
Available online: https://www.mhlw.go.jp/general/sikaku/successlist/2024/siken01/about.html (accessed on 21 October 2025).
Available online: https://www.mhlw.go.jp/general/sikaku/successlist/2023/siken01/about.html (accessed on 21 October 2025).
Available online: https://www.mhlw.go.jp/general/sikaku/successlist/2022/siken01/about.html (accessed on 21 October 2025).
Available online: https://informa.medilink-study.com/web-informa/post41529.html/ (accessed on 21 October 2025).
Available online: https://informa.medilink-study.com/web-informa/post39343.html/ (accessed on 21 October 2025).
Available online: https://informa.medilink-study.com/web-informa/post36171.html/ (accessed on 21 October 2025).
Miyazaki, Y.; Hata, M.; Omori, H.; Hirashima, A.; Nakagawa, Y.; Eto, M.; Takahashi, S.; Ikeda, M. Performance of ChatGPT-4o on the Japanese Medical Licensing Examination: Evalution of Accuracy in Text-Only and Image-Based Questions. JMIR Med. Educ. 2024, 10, eE63129. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Example of a multiple-choice question, in which one selects one answer from five options, in Japanese with ChatGPT response in a single chat box. (A) is an example of actually entering a prompt in Japanese and getting the GPT answer. (B) is an English translation of the contents of (A) by the author.

Figure 2. Example of prompt input and GPT response when there is an attached image. (A) is an example of actually entering prompt in Japanese and getting the GPT answer. (B) is an English translation of the contents of (A) by the author.

Figure 3. Radar chart of all answered questions and correct answer rate for each section for GPT-4 and GPT-4o.

Table 1. The Passing Criteria for Each Round of Japan National Medical Licensing Examination.

	118th JNMLE Passing Criteria	117th JNMLE Passing Criteria	116th JNMLE Passing Criteria
Essential Questions	Each general question is worth 1 point, and each clinical practical question is worth 3 points, total score is 160 points or more/200 points.	Each general question is worth 1 point, and each clinical practical question is worth 3 points, total score is 160 points or more/200 points.	Each general question is worth 1 point, and each clinical practical question is worth 3 points, total score is 158 points or more/197 points. Excluded questions: B6, B43, E16 E16Candidates who answer correctly will be included in the marks, and candidates who answer incorrectly will not be included in the marks.
Non-essential General and Clinical Questions	Each question is worth 1 point, total score is 230 points or more/300 points.	Each question is worth 1 point, total score is 220 points or more/295 points. Excluded questions: C15, C60, D38, D53, F42	Each question is worth 1 point, total score is 214 points or more/297 points. Excluded questions: A34, A71, C36, D64
Prohibited Choices (Critical Questions)	3 questions or less	2 questions or less	3 questions or less

Table 2. Number of Questions, Number of Correct Answers, and Correct Answer Rate for Each Rounds and Question Types for Chat GPT-4o and GPT-4.

Question Section	Question Type	Number of Questions	GPT-4o Number of Correct Answers	GPT-4o Correct Answer Rate	GPT-4 Number of Correct Answers	GPT-4 Correct Answer Rate	p Value
118 All questions	All Questions	400	363	90.8%	333	83.3%	0.0016
118 All questions	Image-based Questions	101	85	84.2%	76	75.2%	0.1153
118 All questions	Text-based Questions	299	278	93.0%	257	86.0%	0.0050
118 Essential Section	All Questions	100	96	96.0%	89	89.0%	0.0602
118 Essential Section	Image-based Questions	10	8	80.0%	6	60.0%	0.3291
118 Essential Section	Text-based Questions	90	88	97.8%	83	92.2%	0.0005
118 Non-essential Sections	All Questions	300	267	89.0%	244	81.3%	0.0084
118 Non-essential Sections	Image-based Questions	91	77	84.6%	70	76.9%	0.1879
118 Non-essential Sections	Text-based Questions	209	190	90.9%	174	83.3%	0.0192
118 General Section	All Questions	150	133	88.7%	123	82.0%	0.1026
118 General Section	Image-based Questions	10	5	50.0%	7	70.0%	0.3613
118 General Section	Text-based Questions	140	128	91.4%	116	82.9%	0.0321
118 Clinical Section	All Questions	250	230	92.0%	210	84.0%	0.0059
118 Clinical Section	Image-based Questions	91	80	87.9%	69	75.8%	0.0343
118 Clinical Section	Text-based Questions	159	150	94.3%	141	88.7%	0.0701
117 All questions	All Questions	393	350	89.1%	312	79.4%	0.0016
117 All questions	Image-based Questions	127	107	84.3%	83	65.4%	0.0005
117 All questions	Text-based Questions	266	243	91.4%	229	86.1%	0.0550
117 Essential Section	All Questions	100	94	94.0%	81	81.0%	0.0054
117 Essential Section	Image-based Questions	16	13	81.3%	10	62.5%	0.2381
117 Essential Section	Text-based Questions	84	81	96.4%	71	84.5%	0.0085
117 Non-essential Sections	All Questions	293	256	87.4%	231	78.8%	0.0058
117 Non-essential Sections	Image-based Questions	111	94	84.7%	73	65.8%	0.0010
117 Non-essential Sections	Text-based Questions	182	162	89.0%	158	86.8%	0.5201
117 General Section	All Questions	148	128	86.6%	119	80.4%	0.0042
117 General Section	Image-based Questions	14	9	64.3%	7	50.0%	0.4450
117 General Section	Text-based Questions	132	117	88.6%	110	83.3%	0.2145
117 Clinical Section	All Questions	246	222	90.2%	193	78.5%	0.0003
117 Clinical Section	Image-based Questions	111	96	86.5%	74	66.7%	0.0004
117 Clinical Section	Text-based Questions	134	126	94.0%	119	88.8%	0.1268
116 All questions	All Questions	395	366	92.7%	330	83.8%	0.0016
116 All questions	Image-based Questions	94	84	89.4%	71	75.5%	0.0126
116 All questions	Text-based Questions	301	282	93.7%	259	86.3%	0.0027
116 Essential Section	All Questions	98	94	95.9%	83	85.6%	0.0130
116 Essential Section	Image-based Questions	13	12	92.3%	11	85.6%	0.5393
116 Essential Section	Text-based Questions	85	82	96.5%	72	85.7%	0.0145
116 Non-essential Sections	All Questions	297	272	91.6%	247	83.2%	0.0020
116 Non-essential Sections	Image-based Questions	81	72	88.9%	60	74.1%	0.0152
116 Non-essential Sections	Text-based Questions	216	200	92.6%	187	86.6%	0.0406
116 General Section	All Questions	147	139	94.6%	123	84.2%	0.0042
116 General Section	Image-based Questions	7	4	57.1%	3	42.9%	0.5929
116 General Section	Text-based Questions	140	135	96.4%	120	86.3%	0.0027
116 Clinical Section	All Questions	248	227	91.5%	207	83.5%	0.0066
116 Clinical Section	Image-based Questions	87	80	92.0%	68	78.2%	0.0107
116 Clinical Section	Text-based Questions	161	147	91.3%	139	86.3%	0.1571

Table 3. Number of Prohibited Answers, Number of Prohibited Questions, and Prohibit Answer Rate for Each Round of GPT-4o and GPT-4.

	Number of Questions with Prohibit Option	GPT-4o Prohibit Answer Rate	GPT-4 Number of Prohibit Option Chosen	GPT-4 Prohibit Answer Rate
118th JNMLE	9	0.0%	1	1.1%
117th JNMLE	11	0.0%	2	8.2%
116th JNMLE	9	0.0%	2	22.2%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2026 by the authors. Published by MDPI on behalf of the Academic Society for International Medical Education. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.

Share and Cite

MDPI and ACS Style

Miyamura, M.; Fujiki, G.; Kanzaki, Y.; Tsuda, K.; Asano, H.; Morita, H.; Hoshiga, M. Evaluating Chat GPT-4o’s Comparative Performance over GPT-4 in Japanese Medical Licensing Examination and Its Clinical Partnership Potential. Int. Med. Educ. 2026, 5, 9. https://doi.org/10.3390/ime5010009

AMA Style

Miyamura M, Fujiki G, Kanzaki Y, Tsuda K, Asano H, Morita H, Hoshiga M. Evaluating Chat GPT-4o’s Comparative Performance over GPT-4 in Japanese Medical Licensing Examination and Its Clinical Partnership Potential. International Medical Education. 2026; 5(1):9. https://doi.org/10.3390/ime5010009

Chicago/Turabian Style

Miyamura, Masatoshi, Goro Fujiki, Yumiko Kanzaki, Kosuke Tsuda, Hironaka Asano, Hideaki Morita, and Masaaki Hoshiga. 2026. "Evaluating Chat GPT-4o’s Comparative Performance over GPT-4 in Japanese Medical Licensing Examination and Its Clinical Partnership Potential" International Medical Education 5, no. 1: 9. https://doi.org/10.3390/ime5010009

APA Style

Miyamura, M., Fujiki, G., Kanzaki, Y., Tsuda, K., Asano, H., Morita, H., & Hoshiga, M. (2026). Evaluating Chat GPT-4o’s Comparative Performance over GPT-4 in Japanese Medical Licensing Examination and Its Clinical Partnership Potential. International Medical Education, 5(1), 9. https://doi.org/10.3390/ime5010009

Article Menu

Evaluating Chat GPT-4o’s Comparative Performance over GPT-4 in Japanese Medical Licensing Examination and Its Clinical Partnership Potential

Abstract

1. Introduction

2. Materials and Methods

3. Results

3.1. Performance in the 118th JNMLE

3.1.1. Overall Results

3.1.2. Accuracy Comparison

3.2. Performance in the 117th JNMLE

3.2.1. Overall Results

3.2.2. Accuracy Comparison

3.3. Performance in the 116th JNMLE

3.3.1. Overall Results

3.3.2. Accuracy Comparison

3.4. Combined Analysis of All Three Examinations

3.4.1. Integrated Overall Accuracy

3.4.2. Section-Based Performance

3.4.3. Question-Type Performance

3.4.4. Prohibited Choices

4. Discussion

5. Limitation

6. Conclusions

Supplementary Materials

Author Contributions

Funding

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI