What Disease does this Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical Exams

Open domain question answering (OpenQA) tasks have been recently attracting more and more attention from the natural language processing (NLP) community. In this work, we present the first free-form multiple-choice OpenQA dataset for solving medical problems, MedQA, collected from the professional medical board exams. It covers three languages: English, simplified Chinese, and traditional Chinese, and contains 12,723, 34,251, and 14,123 questions for the three languages, respectively. We implement both rule-based and popular neural methods by sequentially combining a document retriever and a machine comprehension model. Through experiments, we find that even the current best method can only achieve 36.7\%, 42.0\%, and 70.1\% of test accuracy on the English, traditional Chinese, and simplified Chinese questions, respectively. We expect MedQA to present great challenges to existing OpenQA systems and hope that it can serve as a platform to promote much stronger OpenQA models from the NLP community in the future.


Introduction
Question answering (QA) is a fundamental task in Natural Language Processing (NLP), which requires models to answer a particular question. When given the context text associated with the question, language pre-training based models such as BERT (Devlin et al. 2019), RoBERTa , and ALBERT (Lan et al. 2019) have achieved nearly saturated performance on most of the popular datasets (Rajpurkar et al. 2016;Rajpurkar, Jia, and Liang 2018;Lai et al. 2017;Yang et al. 2018;Gao et al. 2020). However, real-world scenarios for QA are usually much more complex and one may not have a body of text already labeled as containing the answer to the question. In this scenario, models are required to find and extract relevant information to questions from large-scale text sources such as a search engine (Dunn et al. 2017) and Wikipedia (Chen et al. 2017). This type of task is generally called as open-domain question answering (OpenQA), which has recently attracted lots of attention from the NLP community (Clark and Gardner 2018;Wang et al. 2019;Asai et al. 2020) but still remains far from being solved.
Most previous works for OpenQA focus on datasets in which answers are in the format of spans (several consecutive tokens) and can be found based on the information explicitly expressed in the provided text (Joshi et al. 2017;Chen et al. 2017;Dunn et al. 2017;Dhingra, Mazaitis, and Cohen 2017). As a more challenging task, free-form multiple-choice OpenQA datasets such as ARC  and OpenBookQA (Mihaylov et al. 2018) contain a significant percentage of questions focusing on the implicitly expressed facts, events, opinions, or emotions in the retrieved text. To answer these questions, models need to perform logical reasoning over the information presented in the retrieved text and in some cases even need to integrate some prior knowledge. Such datasets can motivate a general QA algorithm that can read any given question, find relevant evidence from a knowledge bank, and conduct logical reasoning to obtain the answer. Unfortunately, these OpenQA datasets consist of questions that require only elementary or middle school level knowledge (e.g., "Which object would let the most heat travel through?"), so even excellent models trained on them may be unable to support more sophisticated real-world scenarios.
To this end, we introduce a new OpenQA dataset, MEDQA, for solving medical problems, representing a demanding real-world scenario. Questions in this dataset are collected from medical board exams in US, Mainland China, and Taiwan, where human doctors are evaluated on their professional knowledge and ability to make clinical decisions. Questions in these exams are varied and generally require a deep understanding of related medical concepts learned from medical textbooks to answer. Table 1 shows two typical examples. An OpenQA model must learn to find relevant information from the large collection of text materials we assembled from medical textbooks, reason over them, and make decisions about the right answer. Taking the first question in Table 1 as instance, to obtain the correct answer "Chlamydia trachomatis", the OpenQA model first needs to retrieve the relevant evidence as shown in this table from a large collection of medical textbooks, read over the question body, pay attention to the key findings for this patient: no evident signs of urethritis, finding of pyuria, and positive leukocyte esterase test, and infer the correct answer after ex- To provide benchmarks for MEDQA, we implement several state-of-the-art methods for the OpenQA task based on the standard system design, which consists of two components: a document retriever for finding relevant evidence text, and a document reader that performs machine comprehension over the retrieved evidence. Experimental results have shown that even the best method powered by the large pre-trained models (Devlin et al. 2019) can only achieve 36.7%, 42.0%, and 70.1% of test accuracy on the questions collected from US, Mainland China, and Taiwan, respectively, indicating the great challenge of this dataset. Through both quantitative and qualitative analysis, we find that the performance of document retriever should be the bottleneck since its current form cannot conduct multi-hop reasoning over the retrieving process. Our hope is that MEDQA can serve as a platform to encourage researchers to develop a general OpenQA model that can solve complex real-world questions via abundant logical reasoning in both retrieval and comprehension stages.

Related Work
Traditionally, QA tasks have been designed to be textdependent (Rajpurkar et al. 2016;Lai et al. 2017;Reddy, Chen, and Manning 2019), where a model is required to comprehend a given text to answer questions. Those given texts relevant to the questions are specially curated by people, which is infeasible for real-world applications where annotations of relevant context are expensive to obtain. This gave rise to the birth of open-domain QA (OpenQA) task, where models must both find and comprehend the context for answering the questions. As a preliminary trial, Chen et al. (2017) proposes the DrQA model, which uses a text retriever to obtain relevant documents from Wikipedia, and further applies a trained reading comprehension model to extract the answer from the retrieved documents. Afterwards, researchers have introduced more sophisticated models, which either aggregate all informative evidence (Clark and Gardner 2018;Wang et al. 2019) or filter out those ir-relevant retrieved texts (Wang et al. 2018;Das et al. 2019) to better predict the answers. Benefiting from the power of neural networks, these models have achieved remarkable progress in OpenQA.
Much of the previous OpenQA work focuses on the datasets whose answers are spans from the retrieved documents (Joshi et al. 2017;Chen et al. 2017;Dunn et al. 2017). In general, most questions concern the facts that are explicitly expressed in the text, offering an advantage to systems that rely mostly on surface word matching (Sun et al. 2019).
To promote more advanced reading skills, another research line has studied OpenQA tasks in a free-form multiple-choice form Mihaylov et al. 2018). These benchmarks allow a relatively more comprehensive evaluation of different higher-level reading skills such as logical reasoning and prior knowledge integration. In particular, real-world exams such as SAT and Gaokao are ideal sources for constructing this kind of OpenQA datasets Mihaylov et al. 2018;Zhang et al. 2018). Our proposed dataset follows this line but differs in three aspects: • The source of our dataset is designed to examine the doctors' professional capability and thus contains a significant number of questions that require multi-hop logical reasoning, which helps push the development of reading comprehension models along this direction (Yu et al. 2020). • Our dataset is the first publicly available large-scale multiple-choice OpenQA dataset for the medical problems, where extensive prior domain-specific knowledge is anticipated for the model. It can thus contribute to the emerging field where a general language model will need to be combined with world knowledge. • Our dataset is cross-lingual, covering English and simplified/traditional Chinese, which contributes to the emerging field of cross-lingual natural language understanding.
There are several related medical QA datasets, which are summarized in Table 2 Table 2: Comparison of our dataset with existing medical QA datasets. In terms of the answer format, "retrieval" means that the answer is a snippet of retrieval result by searching relevant websites; "ranking" means that the answer is a list of candidates and the task is to rank them with higher ranking ones being better answers. Automatic annotation is obtained via an algorithm. Mrabet, and Ben Abacha 2020) and selecting a snippet of searching results out, where the answer retrieval is performed by keywords matching and complex reasoning is seldom involved. BioASQ (Nentidis et al. 2020) is similar to the SQuAQ dataset (Rajpurkar et al. 2016), where a span of text in the given context is used as the answer and thus no external knowledge source is needed. emrQA (Pampari et al. 2018) aims to rank a list of Electronic Medical Record (EMR) text lines to find the best line as the answer and the ground truth answers are obtained via an algorithm rather than human annotation. Overall, none of these related datasets have been formulated as an OpenQA problem. Last but not the least, Zhang et al. (2018) and Ha and Yaneva (2019) have previously worked on the Chinese and English versions of our proposed dataset, respectively. However, the former one did not release any data while the latter one only released 454 questions for public use, while we publicize a large-scale dataset to promote more powerful deep models.

Task Formulation
The task is defined by its three components: Question: question in text, either in one sentence asking for a certain piece of knowledge, or in a long paragraph starting with a description of the patient condition.
Answer candidates: multiple answer options are given for each question, of which only one should be chosen as the most appropriate.
Document collection: a collection of text material extracted from a variety of sources and organized into paragraphs, which contains the knowledge and information to help find the answers. This task is to determine the best answer to the question among the candidates, relying on the documents.

Data Collection
Questions and Answers We collected the questions and their associated answer candidates from the National Med-ical Board Examination in the USA , Mainland China , and Taiwan . For convenience, we denote the datasets from these three sources as USMLE, MCMLE, and TWMLE, respectively. These tests assess a physician's ability to apply knowledge, concepts, and principles, and the ability to demonstrate fundamental patient-centered skills. We include problems from both real exams and mock tests; all are freely accessible online for public usage. Details about the sources that we collected data from are described in Table A.1 of the Supplementary Material. We also provided the scripts of data collection and processing in the Supplementary Material.
We remove duplicate problems and randomly split the data based on questions, with 80% training, 10% development, and 10% test. The overall statistics of the dataset are summarized in Table 3. To comply with fair use of law 2 , we shuffle the order of answer options and randomly delete one of the wrong options for each question for USMLE and MCMLE datasets, which results in four options with one right option and three wrong options. Percentages of each option as the correct answer for the development set are summarized in Table A Table 3: Overall statistics of MEDQA. Question/option length and vocabulary/character size are calculated in tokens for English and in characters for Chinese. Vocabulary/character size is measured on the combination of questions and options. "Avg./Max. option len." represents "Average/Maximum option length".

Document Collection
Extensive medical knowledge is needed to answer every question in our MEDQA data. For people to obtain answers for these questions, they need to obtain necessary knowledge from a volume of medical textbooks during years of training. Similarly, for a machine learning model to be successful in this task, we need to grant it access to the same collection of text materials as human have. Therefore, for USMLE, we prepared text materials from a total of 18 English medical textbooks that have been widely used by medical students and USMLE takers, whereas for MCMLE, we collected 33 simplified Chinese medical textbooks designated as the official textbooks for preparing the medical licensing exam in Mainland China. For TWMLE, since medical students in Taiwan use the same textbooks as those in USA for exam preparation, USMLE and TWMLE would use the same document collection for solving questions. We will release the textbooks we collected upon the license agreement of research use only. All textbooks we collected are originally in PDF format and we converted them into digital text via OCR. We performed some clean-up pre-processing over the converted text such as misspelling correction, and then divided all text into paragraphs.  To evaluate whether these collected text materials can cover enough knowledge to answer those questions, we randomly extracted 100 questions from the development set of all three datasets and let two medical experts with the MD degree annotate how many of them can be answered by the evidence from our prepared text materials. Table 5 summarizes the results of this evaluation. From this table, we see that our collected text materials can provide enough information for answering most of the questions in our data.

Data Analysis
Based on a preliminary analysis of our proposed data, we find that it poses unique challenges for language understanding compared with existing OpenQA datasets, elaborated below: 88.0 100.0 87.0 Table 5: Percentage of questions that human experts can find enough evidence from our collected text material to obtain correct answers for by randomly annotating 100 samples from the development set.
Professional Knowledge: For most existing QA datasets, the question answering process relies largely on the basic understanding of language and general world knowledge reasoning. Some works have revealed that the large-scale pretrained language models bear a certain level of commonsense and symbolic reasoning capability besides their linguistic knowledge (Talmor et al. 2019;Petroni et al. 2019), which may contribute significantly to the remarkable performance of current QA models. However, answering of every question in our dataset needs abundant professional domainspecific knowledge, particularly medical knowledge, which forces the model to have a deep understanding of the extracted context.

Diversity of Questions:
The field of clinical medicine is diverse enough for questions to be asked about a wide range of topics. In general, there are two categories of questions: 1). The question is asking for a single piece of knowledge, for instance via the question "Which of the following symptoms belongs to schizophrenia?" 2). The question first describes a patients condition and then asks for the most probable diagnosis / the most appropriate treatment / the examination needed / the mechanism of certain conditions / the possible outcome for a certain treatment, etc. Table 1 shows two typical examples for type 2. Table 6 summarizes the percentages of these two types of questions for each dataset by annotating randomly selected 100 questions from the development set. Typically, type 1 questions need one-step reasoning while type 2 questions require multi-hop reasoning and are thus much more complicated than type 1 ones, imposing challenges not only to the reading comprehension model but also to the relevant text retrieval module. For example, in order to solve the first question in Table 1, the model needs to first extract, understand, and interpret the symptoms of this patient among a long paragraph of description, then match these symptoms to millions of medical knowledge text snippets and find out the most relevant one, and finally understand the evidence sentence for answer selection.
Complex Reasoning over Multiple Evidence: Many questions in our data involve complex multi-hop reasoning over several evidence snippets. For instance, the second example is a typical question that requires multiple steps of reasoning over three evidences, where from the symptoms and signs we can know that this patient is highly likely an ALS case by looking at the evidence 1 and 2. Afterwards, from evidence 3, we know that the SOD1 is the possible genetic mutation for familial ALS, which is the correct answer.
Noisy Evidence Retrieval: Retrieving relevant information from large-scale text is much more challenging than reading a short piece of text. Passages from textbooks often do not directly give answers to questions and many passages retrieved by the most widely adopted term-matching based information retrieval (IR) systems turn out to be noisy distractors and not relevant, especially for type 2 questions. For those questions involving multi-hop reasoning, models must identify all relevant information scattered in different passages, and missing any single piece of evidence would lead to failures.

Approaches
We implement both classical rule-based methods and recent state-of-the-art neural network based models.

Rule-based Methods
We first propose two rule-based methods that do not involve a training process.
Pointwise Mutual Information (PMI) This method is based on the PMI score function (Clark et al. 2016), which measures the strength of association between two n-grams x and y and is defined as: where p(x, y) is the joint probability that x and y occur together in our document collection C, within a certain window of text (we use a 10 word window); p(x) is the probability that x occurs in C. In practice, we use the frequency to represent the probability. The larger this PMI score, the stronger the association between x and y. This method extracts unigrams, bigrams, trigrams, and skip-bigrams from the question q and each answer option a i and calculates the average PMI score over all pairs of question n-grams and answer option n-grams. The answer option with the highest average PMI score will be picked as the prediction.
Information Retrieval (IR) As the preliminary trial, we adopt a standard off-the-shelf text retrieval system built upon Apache Lucene, Elasticsearch, using inverted index lookup followed by BM25 ranking. Specifically, for each question q and each answer option a i , we send q + a i as a query to the search engine, and return the search engines score for the top-N retrieved sentences. This is repeated for all options to score them all, and the option with the highest score is selected. We denote this version as IR-ES.
To seek for better IR performance, we enhance the above using a BM25 re-weighting mechanism inspired by Chen et al. (2017). The updated scoring function is detailed in Section 2 of the Supplementary Material due to page limits. Our best performing system uses unigram counts. For English questions, we perform Snowball stemming to both documents and questions; and we use the MetaMap 3 tool to identify and remove the non-medically-related words out of the questions since they are not useful for medical evidence retrieval. Details of this step can be found in the Section C of the Supplementary Material. We denote this version as IR-CUSTOM.

Neural Models
Following the DrQA system by Chen et al. (2017), this line of models consists of two components: (1) the Document Retriever module for finding relevant passages and (2) a machine comprehension model, Document Reader, for obtaining the answer by reading the small collection of passages.

Document Retriever
We use the best IR system developed in Section 4.1 and obtain the top-N ranked passages from the large-scale document collection C, concatenating them into a long sequence c. Then for each question and option pair qa i = q + a i , qa i and c are then passed to the Document Reader for reasoning and decision-making.

Document Reader
We implement the following widely used document reader models: MAX-OUT: Following Mihaylov et al. (2018), we first use the same bi-directional gated recurrent unit (BiGRU) model to encode both the context c and the question and option pair qa i , and then perform max-pooling to obtain the final representation vectors h c ∈ R h and h qa i . We then use the following equation to calculate the probability score of how likely this option a i is to answer this question q given the retrieved context c: where [·; ·] represents the concatenation operation; W 1 ∈ R 1×h and W 2 ∈ R h×4h are weight matrices to be learned. We compute such a probability score for each option and select the option with the highest score. [SEP] are the classifier token and sentence separator in a pretrained language model, respectively. We denote the hidden state output by the first token as h ∈ R h and obtain the unnormalized log probability p(q, a i |c) = W h ∈ R 1 , where W ∈ R 1×h is the weight matrix. We obtain the final prediction by applying a softmax layer over the unnormalized log probabilities of all options associated with q and picking the option with the highest probability.

Experimental Settings
Dealing with TWMLE Data: Since TWMLE uses the same document collection as USMLE for solving problems, we translate the questions in TWMLE from traditional Chinese to English via Google Translation and then use the same models as USMLE.

MAX-OUT:
We use spaCy as the English tokenizer and HanLP as the Chinese tokenizer. We use the 200dimensional word2vec word embeddings induced from PubMed and PMC texts (Moen and Ananiadou 2013) for English text, and use the 300-dimensional Chinese fastText word embeddings (Bojanowski et al. 2017). The maximum sequence length for a passage is limited to 450 tokens while that for the question and answer pair is limited to 150 tokens.  (Alsentzer et al. 2019), bio-medical RoBERTa-Base by adapting RoBERTa-Base to bio-medical scientific papers from the Semantic Scholar corpus (denoted as BIOROBERTA-BASE) (Gururangan et al. 2020), English RoBERTa-Large (ROBERTA-LARGE) ). We did not try further fine-tuning BERT models on our collected textbooks since they only contain 12-15 M tokens, which are far less than the typical corpus size for model pre-training (containing over billions of tokens).
We set the learning rate and effective batch size (product of batch size and gradient accumulation steps) to 2 × 10 −5 and 18 for base models, and 1 × 10 −5 and 6 for large models. We truncate the longest sequence to 512 tokens after sentence-piece tokenization (we only truncate context). We fine-tune English/Chinese models for 8/16 epochs, set the warm-up steps to 1000, and keep the default values for the other hyper-parameters (Devlin et al. 2019).

Baseline Results
Tables 7 and 8 summarize the performance of all baselines on the three datasets. By looking at these two tables, we observe:  Table 7: Performance of baselines in accuracy (%) on the MCMLE dataset.
• Our customized IR system outperforms the off-the-shelf version and the improvement is larger on the English questions. • For the MCMLE dataset, pretrained models significantly outperform non-pretrained models and all neural models outperform the non-neural baselines. However, for the USMLE and TWMLE datasets, the non-pretrained neural models (MAX-OUT) cannot even surpass the IR baseline, and most surprisingly, most of the pretrained models for the USMLE dataset cannot even beat the IR baseline.
Overall, even the strongest pretrained model (BIOBERT-LARGE, ROBERTA-LARGE) cannot harvest good scores on any of the three datasets, validating the great challenge of our proposed data. Notably, we did not include human performance for comparison due to the high variance of scores for human examinees (best medical students can earn almost full marks while worse ones cannot pass the exams).

Error Analysis
Quantitative Analysis Since our approaches to the proposed data involve two stages, i.e., document retrieval and reading comprehension, both stages could be the potential error sources. We first check the performance of the IR based document retrieval stage by letting two medical doctors annotate whether the top 25 retrieved paragraphs by IR-CUST contain enough evidence for answering the questions over 100 randomly selected samples from the development set. We have three annotation levels: full evidence (evidence is totally complete to derive the answer), partial evidence (evidence is useful but not complete), and no evidence (no evidence is found at all). Table 9 summarizes the results of the annotation. We also summarize the percentages of samples for which we find full, partial, or no evidence individually for type 1 and type 2 questions on the MCMLE and TWMLE datasets. From this table, we see that only the MCMLE dataset has a good retrieval recall while for the majority of samples in USMLE, we cannot find any evidence in the retrieved text. Moreover, among those samples that can find full or partial evidence, we calculate the percentage of samples that find the evidence in the top N (1, 5, 10, 15) retrieved paragraphs, as shown in Table 10. Overall, we have two findings: 1). The MCMLE and TWMLE datasets have much higher percentage of type 1 questions than USMLE, which is positively correlated with their much better IR retrieval performance; 2). Such poor evidence retrieval performance for the USMLE dataset should be the main cause of the extremely low baseline models' performance as shown in Table 8 (neural models are even beaten by the IR baseline).   Qualitative Analysis Seeing such poor retrieval performance on the USMLE dataset, we wonder what could be the reasons. After taking a close look into the successful and failed samples in the retrieval stage, we can summarize one success pattern as well as two failure patterns for the majority of cases, described in the sections below. Remember that almost all questions in USMLE are about case studies (type 2 questions). And the IR system always returns us snippets with each of which matching to only a small portion of the question text.

Success Patterns:
The question asks about the most probable diagnosis, which involves only one step of reasoning (inference from symptoms / signs / findings to diagnosis of diseases), and it is easy to constrain the possible diagnosis candidates to one or two based on some special condition terms. In this case, it is easy for the IR system to obtain useful evidence snippets by matching those key condition terms. For example, given a case of a stressful young female patient suffering from recurrent headaches that alternately affect the right or left brain and are exacerbated by loud sounds or bright light, along with nausea, the IR system can identify these typical, condition-specific terms (highlighted by italic font) and retrieve the suitable evidence about the disease of migraine headache.
Failure Patterns: 1). The question still asks about the most probable diagnosis, however, each of the patient's symptoms is very common and could correspond to many possible diagnoses. In this case, the IR system may return miscellaneous diseases' descriptions, each of which can match part of the patient's symptoms, signs or other findings. However, none of these retrieved texts are relevant to the correct diagnosis. We provide one example in Section D of the Supplementary Material for illustration. 2). The question asks about the the most appropriate treatment / the examination needed / the mechanism of certain condition, etc., which all involve two-steps of reasoning. That is, we need to first derive the diagnosis based on the patient's condition and then answer this question based on the inferred diagnosis. In this case, it is highly likely that the IR system only return evidence that enables the first step of reasoning (making the diagnosis) but not the second one. We also provide an example in Section D of the Supplementary Material. Last but not the least, we did a qualitative analysis of the translation quality from traditional Chinese to English for TWMLE questions during experiments and found that most of the sentences are good enough while a few of them look not that fluent. However, since all medical terms in TWMLE questions are originally in English, such minor in-fluency would not affect evidence retrieval and question answering.

Conclusion
We present the first open-domain multiple-choice question answering dataset for solving medical problems, MEDQA, collected from the real-world professional examinations, requiring extensive and advanced domain knowledge to answer questions. This dataset covers three languages: English, simplified Chinese, and traditional Chinese. Together with the question data, we also collect and release a largescale corpus from medical textbooks from which the reading comprehension models can obtain necessary knowledge for answering the questions. We implement several state-of-theart methods as baselines to this dataset by cascading two components: document retrieval and reading comprehension. And experimental results demonstrate that even current best approach cannot achieve good performance on these data. We anticipate more research efforts from the community can be devoted to this dataset so that future OpenQA models can be strong enough to solve such real-world complex problems.

A Data Collection
For USMLE and MCMLE datasets, we scraped several websites that provide the question banks, and for TWMLE dataset, we downloaded the examination materials in PDF format directly from the official National Examination website in Taiwan and then converted them into digital format via Optical Character Recognition (OCR).

B Information Retrieval (IR)
In this section we describe how the BM25 re-weighted BM25 scoring function works using the following equation: BM25(qi, D) · IDF (qi) · f (qi, Q) · (kQ + 1) f (qi, Q) + kQ · (1 − bQ + bQ · queryLen avgQueryLen ) , BM25(qi, D) = IDF (qi) · f (qi, D) · (kD + 1) f (qi, Q) + kD · (1 − bD + bD · docLen avgDocLen ) , where Q = {q 1 , q 2 , ..., q n } is the query consisting of n query terms (each query term in our method is actually a n-gram); IDF (q i ) the inverse document frequency of the query term q i ; f (q i , Q) and f (q i , D) represents the frequency of query term q i in the query Q and the document D, respectively; k D , k Q , b D , and b Q are hyper-parameters and their values in our experiments are summarized in the Table B.13; queryLen and docLen are the length in tokens of the current query Q and document D, respectively; avgQueryLen and avgDocLen are the average length of all queries and documents.

C MetaMap for IR
Since type 2 questions are long and contain many words that are not useful for retrieving the relevant context, we adopt the MetaMap tool to filter out those unwanted words. MetaMap was developed by the National Library of Medicine (NLM) to map biomedical text to concepts in the Unified Medical Language System (UMLS). It uses a hybrid approach combining a natural language processing (NLP), knowledge-intensive approach and computational linguistic techniques. We can use it to extract and standardize medical concepts from any biomedical/clinical text. Table C.14 shows one example by sending one question to MetaMap API and parsing the returned result, where the bold font highlighted words are identified medically-related entities and we concatenate them together as the processed question for retrieval.

Question
A 61-year-old man presents to the emergency department because he has developed blisters at multiple locations on his body. He says that the blisters appeared several days ago after a day of hiking in the mountains with his colleagues. When asked about potential triggering events, he says that he recently had an infection and was treated with antibiotics but he cannot recall the name of the drug that he took. In addition, he accidentally confused his medication with one of his wife's blood thinner pills several days before the blisters appeared. On examination, the blisters are flesh-colored, raised, and widespread on his skin but do not involve his mucosal surfaces. The blisters are tense to palpation and do not separate with rubbing. Pathology of the vesicles show that they continue under the level of the epidermis. Which of the following is the most likely cause of this patient's blistering?

D Failure Patterns of IR
In Table D.15, we showcase examples for each of the two failure patterns mentioned in the main text. In this table, the first example illustrates the first failure pattern, where the model fails to perform the correct onestep reasoning for disease diagnosis since the symptoms and patient history mentioned in the case description are very common symptoms (e.g. cough with sputum, increased urinary frequency) and are not disease-or condition-specific. Although the better option for this case is to examine the patient's abdomen due to his age and the lacking information of his abdominal condition, the IR system can only return some miscellaneous evidences about (1) a similar case, (2) lung cancer and smoking, (3) urinary tract infection, and (4) cholesterol issue, which can not actually answer the given question.
The second example for explaining the second failure pattern is a case about a mother with gestational diabetes / hypertension and her baby. In this case, the IR system can identify the medical condition of the mother and retrieve relevant information to this mother. However, the question is actually asking about the baby, which is only described in the very end of this question. Unfortunately, the IR system only returns the evidence for the first-step reasoning (mother's diagnosis) but not for the second-step (diagnose the baby's disease based on the mother's diagnosis).