1. Introduction
Rare-disease diagnosis remains a significant clinical problem. Although 3.5–5.9% of the global population is affected by one of more than 6000 rare diseases [
1,
2], each condition is encountered infrequently by individual clinicians, creating a setting in which diagnosis is often delayed and dependent on repeated specialist referrals [
3,
4,
5]. Computational methods are relevant here not as replacements for clinical judgment but as tools that can organize large disease spaces, prioritize plausible candidates, and attach structured evidence to support specialist review at earlier stages of care.
From a computer-science perspective, rare-disease diagnosis is not merely another multiclass classification problem. The label space is large, the class distribution is severely imbalanced, and the evidence associated with each case is heterogeneous. Clinical narratives may contain long descriptive passages, structured symptom inventories may only partially capture the phenotype, diagnostic-method fields encode procedural context, and images or image-linked captions may add auxiliary information that is neither uniformly available nor trivially fused with text. These properties make flat one-shot prediction an inadequate formulation for the task. A more realistic alternative is ranked differential diagnosis, in which a system returns an ordered list of candidate diseases and is evaluated on whether the true diagnosis appears near the top of that list. This paper operationalizes that idea as full-label-space top-
k ranking, where a prediction is counted as successful if the correct disease appears among the first
k returned candidates. This formulation is consistent with standard ranked-retrieval evaluation practice [
6] and with clinical decision-support settings in which ranked candidate lists guide downstream review [
7].
Recent work in rare-disease AI has advanced along several distinct lines, though the field remains fragmented in methodology. One line of studies has focused on decision-support systems and machine learning models for rare diseases, documenting recurring weaknesses in study design such as heterogeneous cohorts, inconsistent validation procedures, limited external testing, and narrow disease coverage [
8,
9,
10]. A second line has shifted attention toward large language models and retrieval-augmented diagnostic support: RareDxGPT demonstrated that retrieval augmentation could improve ChatGPT 3.5-based diagnosis on a small phenotype-oriented case set [
11], while RareArena scaled rare-disease language-model evaluation to a much larger corpus derived from PubMed Central case reports [
12]. A third line, situated in broader clinical AI, has shown that multimodal integration and explanation generation are promising but difficult to validate rigorously [
13,
14,
15]. Collectively, these strands show progress in model capability, dataset construction, and prompting-based evaluation. However, they also reveal that benchmark design, modality integration, and explanation assessment are typically addressed in isolation rather than within a unified downstream diagnosis framework.
This fragmentation exposes the central gap addressed here. To the best of the authors’ knowledge, no existing rare-disease study provides a downstream benchmark built on a published multimodal case-report resource that simultaneously supports full-label-space disease ranking, strong lexical and supervised baselines, dense retrieval and reranking, multimodal late fusion, and quantitative evaluation of grounded explanation. Text-only prompting benchmarks address language-model performance, but they do not establish how retrieval, classification, reranking, and multimodal evidence should be combined in a reproducible disease-ranking pipeline. Conversely, data-resource papers provide the substrate for experimentation but do not define a benchmark task, split protocol, or evaluation framework for diagnostic reasoning. ZebraMap was introduced as a multimodal rare-disease knowledge map built from case reports, disease summaries, structured clinical fields, and linked images [
16]. What remains unresolved is whether that resource can function as a rigorous benchmark for rare-disease diagnosis rather than only as a curated dataset, and whether grounded explanation can be evaluated on top of fixed diagnosis outputs in a reproducible and analytically useful manner.
Building on this gap, the present work investigates four research questions. First, can ZebraMap support a reproducible full-label-space benchmark for rare-disease diagnosis under leakage-aware splitting and ranking-based evaluation? Second, how do lexical retrieval, sparse supervised classification, dense biomedical retrieval, cross-encoder reranking, and caption-based multimodal late fusion compare within a common experimental framework? Third, can evidence-grounded explanation quality be measured quantitatively on top of frozen diagnosis outputs rather than only through qualitative examples? Fourth, how sensitive is explanation quality to the choice of open instruction model and to the introduction of support-case evidence? These questions map directly to the gap identified above: benchmark construction, model comparison, explanation assessment, and diagnosis-versus-explainability tradeoffs.
To answer these questions, ZebraMap is transformed into a downstream benchmark in which each patient case is treated as a diagnosis query and each disease profile is treated as a candidate label within a catalog of 1727 diseases. The proposed framework proceeds in five stages. Raw multimodal inputs are first normalized into case-level and disease-level benchmark artifacts. Cases are then partitioned using grouped splitting by source article to reduce article-level leakage. Each evaluation case is next processed through three text-based candidate-generation branches: BM25 lexical retrieval, a class-balanced TF–IDF classifier, and dense retrieval with MedCPT encoders. Their outputs are combined through reciprocal-rank fusion and refined through MedCPT cross-encoder reranking. Image-linked evidence is incorporated at a later stage through caption-based late fusion rather than through a learned joint text–image encoder, keeping the multimodal contribution explicit. Finally, a grounded explanation module generates matched findings, rationale statements, alternative considerations, and confirmatory-test suggestions from frozen ranking outputs. This design addresses the identified gap by supporting benchmark reproducibility, strong baseline comparison, explicit control over modality fusion, and quantitative evaluation of explanation quality without conflating diagnosis and explanation into a single model.
The contributions of this paper are as follows:
Benchmark formulation. ZebraMap is recast as a reproducible top-k ranking benchmark over 1727 rare diseases with grouped splitting by source article.
Hybrid diagnosis pipeline. A hybrid pipeline combining BM25, a class-balanced TF–IDF classifier, MedCPT dense retrieval, cross-encoder reranking, and caption-based late fusion attains test MRR 0.3905 and Recall@10 0.5507 on 10,895 test cases, while the class-balanced TF–IDF classifier alone reaches MRR 0.4200 and Recall@10 0.6279, so the contribution of the hybrid pipeline lies in integrating ranking with grounded explanation rather than in maximizing single-metric accuracy.
Grounded explanation on frozen predictions. An explanation stage operating only on frozen diagnosis outputs is evaluated on 256 cases with seven quantitative metrics, including citation coverage and faithfulness deletion; the explanation metrics are computational surrogates and not clinician-validated judgments.
Diagnosis–explainability tradeoff. A four-model comparison (Qwen, Mistral, Gemma, and Llama) and a support-case ablation together quantify a tradeoff between ranking accuracy and explanation richness under frozen-prediction control.
The remainder of this paper is organized as follows.
Section 2 reviews the related literature on rare-disease decision support, language-model-based diagnosis, and multimodal clinical AI.
Section 3 describes the benchmark construction, split protocol, diagnosis pipeline, multimodal late-fusion design, explanation framework, and evaluation metrics.
Section 4 presents the benchmark results, ablations, slice analyses, and explanation-model comparison.
Section 5 discusses the implications of the findings, including the role of strong sparse baselines and the observed diagnosis-versus-explainability tradeoff.
Section 6 concludes the paper and outlines directions for future work.
3. Materials and Methods
3.1. Framework Overview
The benchmark framework is a sequential diagnosis-and-explanation pipeline in which disease ranking and explanation generation are treated as related but distinct stages.
Figure 1 summarizes the full workflow. Raw ZebraMap resources were first normalized into two benchmark artifacts: a case-level manifest used for query construction and a disease-level bank used for retrieval, classification, and reranking. Cases were then partitioned by source article so that case reports originating from the same publication did not leak across training and evaluation splits. Each evaluation case was transformed into a unified diagnosis query and processed by three text-based candidate-generation branches: BM25 lexical retrieval, a class-balanced term frequency–inverse document frequency (TF–IDF) classifier, and dense biomedical retrieval with MedCPT. The resulting candidate lists were merged by reciprocal-rank fusion (RRF), reranked with a MedCPT cross-encoder, and rescored by caption-mediated late fusion using image-linked text. Only after the final ranked disease list was fixed was an explanation model invoked.
This staged design served two purposes. First, it enabled sparse, dense, and hybrid rankers to be compared within a common benchmark protocol. Second, it kept explanation generation downstream from the ranking model so that explanation quality could be studied without altering diagnosis outputs. That separation was important because one of the research questions concerned grounded explanation as an evaluable object in its own right, rather than as a by-product of a generative diagnostic model.
Figure 2 expands the ranking core of the framework and shows where caption-based image evidence entered the pipeline.
3.2. Problem Formulation
Let denote the set of evaluation cases and let denote the disease catalog, where in the full benchmark. Each case has a gold disease label and is represented by a normalized query text derived from the case narrative, structured symptoms, and diagnostic-method fields. Each disease is represented by a profile text derived from disease summaries and structured disease-level fields. No case content from any split (training, validation, or test) is used to build , so the disease-side representation cannot leak gold-label evidence through the case partitioning.
The ranking model assigns each pair
a score
and returns an ordered list of diseases,
where higher scores indicate greater diagnostic plausibility. The rank of the true disease is denoted by
A top-
k success event occurs when the correct disease appears among the first
k entries of
,
which is consistent with ranked-retrieval evaluation practice [
6]. This formulation was selected because rare-disease decision support typically benefits from a short ordered differential rather than an immediate single-label commitment [
7].
The framework separated diagnostic ranking from explanation generation. Once the final ranking had been fixed, an explanation function
g generated a grounded explanation,
where
is the top-
K ranked list,
is the top-ranked disease, and
is the supporting evidence bundle assembled from case snippets, alternative predictions, and structured disease information. This decomposition made it possible to analyze diagnosis quality and explanation quality separately.
3.3. Candidate Generation and Hybrid Retrieval
The candidate-generation stage balances lexical precision, sparse supervised discrimination, and semantic retrieval. Each case was represented by a single query text obtained by concatenating the normalized free-text narrative with structured symptom and diagnostic-method fields. Rare-disease case reports distribute diagnostically relevant evidence across prose descriptions and structured fields; excluding either source would reduce recall in the early ranking stages. The same query representation was used across all candidate-generation branches so that performance differences reflected the retrieval or classification mechanism rather than changes in input construction.
The first branch used BM25 over disease-profile text [
22]. BM25 was selected because it is a strong and interpretable probabilistic baseline for ranking tasks in which exact terminology overlap remains informative. In rare-disease case reports, diagnostically important strings such as gene names, syndromic descriptors, or radiologic findings are often repeated verbatim; a lexical branch therefore provided a useful lower-bound baseline and a complementary signal to dense retrieval. The BM25 index used the standard parameters
and
, which were retained from the benchmark implementation. The lexical score for case
c and disease
d was
The second branch used a sparse supervised classifier built from TF–IDF features [
23]. TF–IDF was used because it reweights tokens by within-document prominence and corpus rarity, which is appropriate when case descriptions contain both common medical vocabulary and highly discriminative rare-disease cues. The classifier was implemented as an
SGDClassifier with logistic loss,
regularization, and class balancing. The classifier used unigram and bigram features, English stop-word filtering, sublinear term frequency scaling, a minimum document frequency of 2, a maximum vocabulary size of 50,000 terms, regularization strength
, a maximum of 2000 training iterations, and convergence tolerance
. This branch was retained because sparse linear models remain competitive on long-tail label spaces and often outperform more complex methods when label imbalance is severe.
The third branch used dense retrieval with MedCPT query and article encoders [
24]. Dense retrieval was selected to recover semantically related diseases even when case wording and disease summaries did not share exact vocabulary. The query encoder was
ncbi/MedCPT-Query-Encoder, the document encoder was
ncbi/MedCPT-Article-Encoder, the maximum input length was 512 tokens, and the retrieval batch size was 16. Dense similarity was computed with cosine similarity,
The three ranked lists were merged by RRF [
25]. In the frozen main run, the hybrid retriever was enabled with per-source truncation of 100 candidates, final retrieval depth 100, candidate-pool depth 150, RRF constant
, and source weights
,
, and
. The fusion score was
where
is the rank assigned by source
m. The weighted RRF design was used because it preserves ranking robustness while allowing stronger signals to contribute differentially. Algorithm 1 summarizes this stage.
| Algorithm 1 Hybrid Candidate Generation and Reciprocal-Rank Fusion |
Input: case query , disease catalog , per-source depth , final depth . Output: hybrid candidate list .
Score all diseases in with BM25 over disease-profile text and retain the top results. Transform with the fitted TF–IDF vectorizer, predict class probabilities with the class-balanced linear classifier, and retain the top diseases. Encode and all disease profiles with MedCPT, compute cosine similarity, and retain the top diseases. Merge the three ranked lists with RRF using source weights and constant . Sort diseases by and retain the top candidates.
|
3.4. Cross-Encoder Reranking and Caption-Mediated Late Fusion
The hybrid candidate list was refined in two stages. First, a cross-encoder reranker rescored each candidate using direct case–disease interaction. Second, a caption-mediated late-fusion stage added image-linked evidence without learning a shared text–image representation. This separation was deliberate: reranking was used to improve text-based ordering among plausible disease candidates, whereas the multimodal signal was introduced conservatively as an auxiliary score so that its contribution remained transparent.
Reranking used the ncbi/MedCPT-Cross-Encoder model loaded in safetensors format. For each case, the top 50 fused candidates were rescored by the cross-encoder. This design reflects the standard retrieval-then-rerank pattern in information retrieval: a high-recall candidate generator narrows the search space, and the cross-encoder then applies a more computationally expensive interaction model to the shortlist. The reranker therefore improved local ordering while preserving tractability.
Multimodal evidence was incorporated only after reranking. The benchmark did
not learn a joint image–text embedding space and did not train an image-only classifier. Instead, image information was represented through linked caption text. A disease-specific caption bank was built from training-set image captions, and each evaluation case contributed a concatenated case-caption string when such captions were available. The auxiliary image score was computed as Jaccard-style lexical overlap between these two token sets,
where
is the concatenated case-caption text,
is the disease caption bank, and
denotes stop-word-filtered tokenization. This choice kept the multimodal contribution explicit and auditable, which was preferred over an opaque fusion module in a benchmark whose main goal was analysis rather than end-to-end model optimization.
All branch scores were min–max normalized within each case. The final late-fusion score was
where tildes denote per-case min–max normalization,
is the dense retrieval score, and
is the cross-encoder score. The image-rerank stage evaluated the top 20 candidates per case. Candidate weight sets were tuned on the validation split by maximizing the composite objective
The selected frozen weights were
for BM25,
for the classifier,
for dense retrieval,
for reranking, and
for the caption-based image score. We set
because the dense signal had already entered the pipeline upstream through reciprocal-rank fusion (Equation (
7)) and the MedCPT cross-encoder reranker; adding an explicit weight on the standalone dense score did not improve the composite objective
J on the validation split. Dense retrieval is therefore still important in candidate generation, just not in the final scoring step. Algorithm 2 summarizes this stage.
| Algorithm 2 Validation-Tuned Caption-Mediated Late Fusion |
Input: reranked candidate list , case-caption text , disease caption bank , candidate weight set . Output: final ranked list .
For each candidate disease , compute the caption-overlap score using Equation ( 8). Min–max normalize the BM25, classifier, dense-retrieval, reranker, and image scores within case c. For each candidate weight vector , compute validation-set rankings with Equation ( 9). Select the weight vector that maximizes the objective in Equation ( 10) on the validation split. Apply the selected weights to the validation and test candidates and sort diseases by .
|
3.5. Grounded Explanation Generation and Auxiliary Analyses
After the diagnostic ranker produced the final candidate list, the explanation stage generated a structured rationale for a fixed subset of cases. The explanation input contained five elements: the normalized diagnosis query, the top predicted disease, up to three alternative predictions, the disease-profile text, and a compact evidence bundle derived from the case narrative and retrieved support. Supporting case snippets were extracted by scoring case sentences against the top-disease profile and retaining the highest-overlap sentences. The prompt instructed the model to produce matched findings, a rationale for the leading diagnosis, a short discussion of alternatives, and suggested confirmatory tests. This format was chosen to keep explanations tied to observable evidence rather than to unconstrained free-form narrative.
In the frozen main run, explanation generation used Qwen/Qwen2.5-7B-Instruct with the transformers_generate backend, temperature 0.1, maximum generation length 384 tokens, checkpoint interval 8 cases, and maximum explanation subset size 256. The generation prompt was truncated to 2048 input tokens by the tokenizer. The explanation model did not alter the diagnosis ranking; it operated only on frozen predictions. This separation was necessary because the study was designed to compare explanation quality without introducing confounding changes in ranking accuracy.
Two auxiliary ablation families were evaluated. The first added support-case evidence during reranking and explanation. In that ablation, representative training-case snippets and a support score were appended to each candidate disease to test whether explicit case memory improved either ranking or explanation quality. The second ablation held the diagnostic predictions fixed and varied only the explanation model. Four open instruction-tuned generators were evaluated on the same 256-case subset under the same prompt template: Qwen2.5-7B-Instruct, Mistral-7B-Instruct-v0.3, Gemma-2-9B-it, and Llama-3.1-8B-Instruct. Because the predictions were frozen, any changes in this comparison reflected explanation generation alone.
3.6. Dataset, Preprocessing, and Benchmark Release
The experiments used the full ZebraMap release described in the original dataset paper [
16]. The raw resource comprised structured case data, disease-level information, literature metadata, and a linked image directory corresponding to the full ZebraMap release. The benchmark package preserved resolved configuration, environment snapshots, intermediate candidate lists, final predictions, explanations, tables, and figures.
Preprocessing produced two benchmark artifacts. The
case_manifest stored one row per case with the disease identifier, source article identifier, normalized case text, structured symptom list, diagnostic-method list, split assignment, and linked image-caption metadata. The
disease_bank stored one row per disease with a consolidated profile text assembled from disease summaries and structured disease-level fields. Whitespace was normalized, lightweight stop-word filtering was applied in token-based retrieval stages, and image handling was separated from core text normalization. Cases without images were retained. Grouped splitting was performed at the
source_article level to reduce article-level leakage. The
disease_bank profiles use only disease-level metadata, and the disease-specific caption banks used in the late-fusion stage (
Section 3.4) use only training-set image captions; no validation or test case content enters either resource. The resulting split contained 47,930 training cases, 10,321 validation cases, and 10,895 test cases, with disease counts of 1464, 1131, and 1268, respectively. The full benchmark contained 69,146 structured cases, 1727 diseases, 41,955 cases with raw images, and 51,773 cases with at least one linked indexed image, corresponding to 140,794 linked images after indexing.
The present study involved only secondary analysis of a previously published dataset derived from case reports. No new patient recruitment, intervention, or collection of identifiable private data was performed.
Table 2 summarizes the benchmark release used throughout the paper.
3.7. Evaluation Metrics
Ranking quality was assessed with Recall@
k, mean reciprocal rank (MRR), normalized discounted cumulative gain at
k (nDCG@
k), mean rank, median rank, macro recall, and weighted recall. Recall@
k measures whether the correct disease appears within the first
k retrieved candidates. MRR emphasizes early placement of the correct disease and is well suited to differential-diagnosis ranking because a correct disease placed at rank 1 contributes more than one placed at rank 10. nDCG@
k was included because it rewards higher placement near the top of the list and is standard in information retrieval [
6,
26]. For a single relevant disease per case, the metric definitions were
where
is the rank of the correct disease for case
i. Mean rank and median rank are computed only over the subset
of test cases for which the true disease appeared in the retrieved candidate list; cases without a finite rank are not included in these two summaries. Recall@
k is reported over the full test set. Macro recall was computed by averaging disease-level recall scores, whereas weighted recall aggregated hits over all cases.
Grounded explanation quality was evaluated with citation coverage, faithfulness deletion, clinical overlap, and four heuristic judge scores. Citation coverage measured how much explicit evidence was attached to the explanation,
where
is the set of supporting evidence identifiers. Faithfulness deletion measured the drop in case–disease lexical overlap after removing the snippets used to support the explanation,
where
denotes Jaccard lexical overlap,
is the case text, and
is the set of supporting snippets. Clinical overlap compared matched findings in the generated explanation with structured symptoms and diagnostic-method fields,
where
is the matched-finding text and
is the structured clinical evidence. Clarity, factual grounding, differential reasoning, and usefulness were reported on a 1–5 heuristic scale computed from explanation length, number of evidence identifiers, and number of alternatives. Because no expert-annotated explanation benchmark was available for this dataset, these scores should be interpreted as operational surrogates rather than clinician-validated judgments.
3.8. Experimental Setup
The frozen main benchmark run was executed with random seed 42. Experiments were conducted on a server equipped with one NVIDIA H100 NVL GPU (95,830 MiB visible memory). The core software environment comprised Python 3.12.9, PyTorch 2.5.1, Transformers 4.57.6, scikit-learn 1.7.1, NumPy 2.2.2, and CUDA 12.4. Data loading and preprocessing used eight parallel workers.
All diagnosis baselines and comparison stages were evaluated under the same grouped split and the same disease catalog. The lexical baseline was BM25 [
22]. The sparse supervised baseline was the class-balanced TF–IDF linear classifier built as described in
Section 3.3 [
23]. The dense retriever and reranker both used MedCPT [
24]. The final explainable system combined these components with caption-mediated late fusion and then applied the Qwen2.5-7B explanation stage. The main benchmark was reported from one frozen seeded run. Additional experiments included one support-case ablation run and four explanation-only generator comparisons with frozen diagnosis outputs. This design isolated the effect of each component while keeping the train–validation–test partition fixed across all analyses.
Table 3 consolidates the key implementation settings needed to reproduce the frozen main run from the manuscript text alone.
3.9. Statistical Analysis
The study was benchmark-oriented, reporting point estimates from one frozen split and one fixed random seed. The large held-out test set (
) and consistent metric ordering across stages (
Table 4) provide empirical robustness support. For the proportion metrics, Recall@
k, Wald 95% confidence intervals and pairwise two-sample
z-tests are reported as a footnote to
Table 4. The
z-tests assume independence between stages, which overstates the variance because the same cases are scored by every stage, so the reported
p-values bound the paired tail probabilities from above. Bootstrap confidence intervals for MRR and nDCG@10, paired tests on per-case outcomes, and repeated runs under independent seeds require access to per-case prediction arrays that are not part of the frozen package and are therefore left to follow-up work. Systems were compared by absolute differences in ranking metrics on identical evaluation cases, by stage-wise ablation, and by subgroup analysis over disease-frequency and image-availability slices. For the explanation-model comparison, diagnosis outputs were held fixed, so differences in explanation metrics reflected only the generation stage.
5. Discussion
5.1. Diagnostic Performance and Task Difficulty
On the test split, the final fusion pipeline (MRR 0.3905) ranked second behind the class-balanced TF–IDF classifier (MRR 0.4200), reflecting the design choice to integrate ranking with grounded explanation rather than to maximize accuracy on a single metric. The results show that the proposed hybrid pipeline produces ranked differentials across a large and imbalanced rare-disease space. The final explainable pipeline achieved test MRR 0.3905, Recall@1 0.3078, and Recall@10 0.5507, placing the correct disease within the top 10 candidates in more than half of all test cases. The gap between Recall@1 and Recall@10 shows that a substantial fraction of cases contained enough signal for the correct diagnosis to appear near the top of the ranked list even when not placed first. That pattern is consistent with the clinical goal of ranked differential diagnosis, which aims to surface the correct disease within a manageable review list rather than force a single-label commitment [
7].
The slice analysis clarifies where performance is lowest. Diseases with only 1–4 training cases reached test MRR 0.0936 and Recall@10 0.2115, whereas the 10–49 bucket reached MRR 0.4466 and Recall@10 0.6115. This contrast indicates that data sparsity in the long tail, rather than model architecture alone, is the primary limiting factor for rare-disease diagnosis in this setting. Clinical deployment would therefore require strategies targeting ultra-rare conditions, such as few-shot learning, external knowledge bases, or specialist-curated disease profiles [
8].
5.2. Pipeline Component Analysis and the Role of Sparse Supervision
The stage-wise comparison clarifies which components contributed most to ranking quality. BM25 provided an interpretable lexical baseline (test MRR 0.2153, Recall@10 0.3441). Dense retrieval and cross-encoder reranking improved those values to MRR 0.2842 and 0.2937, respectively, indicating that semantic matching and pairwise interaction modeling were both useful in this disease space. However, the strongest single stage remained the class-balanced TF–IDF classifier, establishing a sparse baseline that future methods must exceed under the same split protocol.
The explainable pipeline (MRR 0.3905) fell below the classifier baseline (MRR 0.4200), reflecting a deliberate design tradeoff: separating diagnosis and explanation preserves interpretability but relinquishes the accuracy gains of a single optimized ranking stage. Future work may pursue joint optimization of ranking and explanation within a single model while maintaining this interpretive separation. The selected late-fusion weights assigned the largest coefficient to the classifier signal (), a smaller weight to reranking (), and a still smaller weight to caption-mediated image evidence (). This weighting pattern is consistent with the observed accuracy profile: in the present configuration, the most reliable signal remained supervised sparse text matching, while dense retrieval, reranking, and image-linked evidence contributed complementary but weaker gains. Although zeroes the dense term in the final scoring step, dense retrieval still contributes upstream through reciprocal-rank fusion in the candidate generator and through the cross-encoder reranker; the hybrid design therefore operates as a staged combination across candidate generation and reranking rather than as a balanced linear blend in the final score.
The multimodal component should be interpreted precisely. Image information entered the pipeline only through caption-based late fusion rather than through a learned shared image–text encoder. Cases with linked images achieved test MRR 0.4019 and Recall@10 0.5629, compared with MRR 0.3579 and Recall@10 0.5161 for cases without linked images. These numbers indicate that image-linked information was associated with better performance in the present pipeline, but they do not indicate that the full visual content of ZebraMap was modeled.
5.3. Grounded Explanation and Diagnosis–Explainability Tradeoffs
The explanation results add a second dimension to the benchmark. On the fixed 256-case subset, the main explanation model achieved citation coverage 0.7334, clinical overlap 0.4609, factual grounding 3.9336, and usefulness 3.8734. These values do not constitute clinician validation, but they show that explanation quality can be quantified systematically rather than reported only through selected examples. The open-model comparison showed that Qwen2.5-7B-Instruct produced the strongest grounding profile among the tested generators, whereas Llama-3.1-8B-Instruct yielded the highest clarity score.
The support-case ablation exposed a more consequential finding. When representative training-case evidence was introduced, explanation quality improved across citation coverage, clinical overlap, factual grounding, and usefulness, while diagnosis quality fell (test MRR from 0.3905 to 0.3549; Recall@10 from 0.5507 to 0.5086). Support-case evidence therefore altered ranking quality and explanation quality in opposite directions. This pattern suggests that richer contextual evidence can improve explanation grounding without necessarily improving the ranking function that produced the explanation.
One possible reason for this divergence is that support cases bring in vocabulary from training neighbours that are lexically similar but diagnostically heterogeneous. The added vocabulary is useful for the explanation stage because the generator has more concrete features to cite, but it dilutes the classifier signal in the ranking stage, since competing diseases gain surface-token overlap. Routing support evidence into the explanation context only, without injecting it into the ranking features, would test this. The interpretation remains a hypothesis pending controlled ablations.
5.4. Limitations and Implications
Several aspects of the design bound the claims of this work and define its scope. The benchmark is built from published case reports, so it inherits the biases and reporting granularity of that literature. The explanation metrics are computational surrogates, and a prospective clinician-validation study with collaborating specialists is planned as the next step. For Recall@
k we report Wald 95% confidence intervals and pairwise
z-tests in the footnote to
Table 4; bootstrap intervals for MRR and nDCG@10, paired tests on per-case outcomes, and repeated seeded runs are left to follow-up work since they require access to per-case prediction arrays that are not part of the frozen package. The multimodal component uses caption-mediated late fusion rather than a learned vision–text encoder such as BiomedCLIP, which is the main direction for the next iteration. Evaluation is also limited to ZebraMap, and cross-benchmark transfer to RareArena [
12], Orphanet-derived, and HPO-annotated corpora is deferred because the task formulations differ.
Despite these constraints, the work makes a concrete contribution to rare-disease diagnosis research. By combining lexical, supervised, and dense retrieval components with caption-mediated multimodal fusion and grounded explanation, hybrid AI pipelines are shown to produce ranked differentials covering more than half of correct diagnoses within the top 10 candidates across 1727 diseases. The resulting evaluation framework, with explicit splits, reproducible artifacts, strong baselines, and grounded-explanation analysis, provides a foundation for improving computational support for the rare-disease diagnostic process.
6. Conclusions
Rare-disease diagnosis remains difficult because the evidence available at first presentation is incomplete, heterogeneous, and distributed across a long-tail disease space of thousands of conditions. This work developed and evaluated a hybrid multimodal pipeline that combines grouped article-level splitting, text-first query construction, BM25 lexical retrieval, a class-balanced TF–IDF classifier, MedCPT dense retrieval and cross-encoder reranking, caption-mediated late fusion, and a downstream grounded explanation stage operating on frozen diagnosis outputs.
Applied to 1727 rare diseases and 10,895 test cases, the final explainable pipeline achieved MRR 0.3905 (Recall@10 0.5507), placing the correct diagnosis within the top 10 candidates in more than half of cases. The class-balanced TF–IDF classifier established a supervised baseline at MRR 0.4200 (Recall@10 0.6279). The support-case variant improved citation coverage to 0.9961 and factual grounding to 4.9844 at the cost of reduced ranking quality (MRR 0.3549), confirming a diagnosis–explainability tradeoff that future work should resolve. Among tested explanation generators, Qwen2.5-7B-Instruct produced the strongest grounding profile.
These results show that a hybrid retrieval-and-classification approach, extended with caption-mediated multimodal evidence and grounded explanation, can support rare-disease differential diagnosis at scale. The framework supports direct comparison among lexical, supervised, dense, reranking, multimodal, and explanation-oriented components under one leakage-aware protocol; the pipeline outputs are intended to assist specialist review through ranked candidate differentials and grounded explanations rather than to drive autonomous clinical decision-making. Three directions follow from the reported findings: redesigning support-case integration so that richer evidence improves rather than reduces ranking quality, fine-tuning biomedical retrievers and rerankers on this disease space, and replacing heuristic explanation scores with clinician-annotated assessment to bring the pipeline closer to clinical relevance. Cross-benchmark evaluation on adapted versions of RareArena, Orphanet-derived, and HPO-annotated corpora, together with the addition of a learned vision–text encoder, are further directions for future work.