The overall diagnostic framework is designed to closely align with the standard radiological workflow in clinical practice. The proposed two-stage framework is illustrated in
Figure 1. In routine clinical settings, following image acquisition by radiology technicians, radiologists interpret chest radiographs by identifying abnormal radiological findings and generating structured reports. These findings serve as critical references for specialist physicians, such as pediatricians, who further integrate imaging evidence with clinical information to establish an accurate diagnosis and formulate an appropriate treatment plan. Motivated by this established workflow, we design a two-stage diagnostic framework that explicitly separates radiological finding recognition from disease-level diagnostic reasoning, thereby improving explainability and traceability. In the first stage, a vision–language model (VLM) fine-tuned on pediatric radiological data interprets the input chest X-ray and identifies abnormal radiological findings. The recognized findings are then organized into structured and interpretable diagnostic evidence, which serves as the input to the second stage of the framework. In the second stage, a multimodal large language model (MLLM) performs disease diagnosis by reasoning over the evidence generated in the first stage. To compensate for the limited medical domain knowledge of general-purpose MLLMs, additional domain-specific information, including standardized descriptions of radiological findings and patient demographic attributes, is incorporated into the diagnostic process. Specifically, the structured findings, demographic information, and the chest radiograph jointly constitute the input query to the MLLM. Relevant medical knowledge is retrieved from an external database via a retrieval-augmented generation (RAG) mechanism. Finally, the diagnostic MLLM integrates multimodal information from imaging evidence, retrieved medical knowledge, and structured patient data to produce the final disease diagnosis. The following sections describe each module of the proposed framework in detail.
2.1. Vision-LanguageModel Domain-Specific Fine-Tuning
Radiological findings in chest radiographs are often localized within small anatomical regions and manifested through subtle, fine-grained visual patterns, such as faint opacities, mild consolidations, or slight structural asymmetries. These characteristics make precise recognition particularly challenging, especially in pediatric chest X-rays, where anatomical variations across age groups further increase diagnostic complexity. In this stage, vision–language models (VLMs) are employed to perform radiological finding recognition, leveraging their capability to jointly model visual representations and textual semantics for fine-grained medical interpretation.
Benefiting from the availability of large-scale adult multimodal chest X-ray datasets, including MIMIC-CXR [
30], VinDr-CXR [
31], and CheXpert [
32], VLMs pretrained on these resources have acquired rich radiological representations and substantial medical domain knowledge. Through large-scale pretraining, these models are able to capture clinically meaningful visual patterns and learn robust semantic alignments between chest radiographs and radiological descriptions. Such properties make pretrained VLMs particularly suitable for transfer learning to downstream radiological tasks under limited data settings.
In this work, we adopt the knowledge-enhanced auto diagnosis (KAD) model [
33] as the initialization backbone for fine-tuning. Unlike general-purpose vision–language models such as LLaVA or BLIP-2, which are pretrained mainly on natural image–text pairs without explicit medical supervision, KAD is jointly pretrained on adult chest radiographs, corresponding radiology reports, and structured medical knowledge graphs. By leveraging both visual representations and medical knowledge embeddings learned from large-scale adult chest radiography datasets, KAD provides a strong and clinically meaningful starting point for pediatric radiological adaptation. Given the limited availability of publicly accessible pediatric multimodal pretrained models, transferring knowledge from adult-domain radiology data becomes a practical and effective strategy. The semantic associations between imaging findings and disease concepts learned by KAD during adult pretraining can be effectively adapted to pediatric scenarios through domain-specific fine-tuning, thereby alleviating the cold-start problem caused by pediatric data scarcity and improving radiological finding recognition robustness.
The overall fine-tuning process is illustrated in
Figure 2, which is a domain adaptation process. By leveraging both visual representations and medical knowledge embeddings learned from adult data, KAD provides a strong and informative starting point for pediatric radiological adaptation. Specifically, the KAD model consists of an image encoder and a knowledge encoder (text encoder), both of which are pretrained on paired adult chest X-ray images and associated radiological reports. In our framework, the pretrained weights of these two encoders are used to initialize the VLM, which is subsequently fine-tuned using pediatric chest X-ray data. During fine-tuning, the model adapts the learned adult radiological representations to pediatric imaging characteristics while preserving clinically relevant semantic associations. This transfer learning strategy facilitates the effective migration of adult radiological knowledge to the pediatric domain, alleviates the cold-start problem, and improves radiological finding recognition performance under limited pediatric data availability.
After extracting visual and textual representations, the KAD architecture employs a transformer decoder to fuse multimodal information and model the interactions between image features and semantic concepts. Specifically, let denote the visual patch features extracted from the input chest X-ray image, and let represent the class-specific textual embeddings corresponding to radiological findings. The image encoder and text encoder are used to map the input modalities into a shared embedding space.
Formally, given an input chest radiograph
X, the visual features are obtained through the image encoder as
where
N denotes the number of visual patches and
d represents the feature dimension.
Similarly, given a predefined set of radiological finding categories
, the corresponding textual embeddings are generated by the text encoder as
where
denotes the number of predefined radiological finding classes and
d is the embedding dimension. The resulting visual features
and textual embeddings
are subsequently fed into a transformer-based decoder, which performs multimodal fusion through cross-attention mechanisms to model fine-grained interactions between localized image regions and semantic radiological concepts, thereby enabling accurate radiological finding recognition.
Within the transformer-based decoder, a cross-attention mechanism is employed to enable each class-specific textual feature to attend to the visual feature space extracted from the input chest radiograph. Through this mechanism, each radiological finding class actively queries image regions that are most relevant to its corresponding semantic representation, allowing the model to establish fine-grained associations between localized visual evidence and predefined clinical concepts.
Formally, the attention weight distribution
for the
i-th radiological finding class is computed as
where
and
denote the learnable projection matrices for the query and key in the transformer mechanism, respectively, and
d represents the dimensionality of the feature embeddings. The resulting attention weights highlight image regions that contribute most strongly to the prediction of the corresponding radiological finding.
By explicitly modeling the attention weight distribution between class-specific textual features and visual features, the transformer decoder is able to dynamically identify the most informative image patches that provide strong evidence for the presence of a given radiological finding. The attended visual information is then aggregated and fused into the corresponding class representation. Specifically, the fused class-aware feature
is computed as
where
denotes the learnable value projection matrix. This residual fusion mechanism preserves the original semantic prior encoded in each class embedding while augmenting it with image-specific visual evidence. As a result, the model effectively integrates multimodal information, enabling more accurate and interpretable recognition of radiological findings.
This formulation explicitly illustrates how each class token selectively aggregates image evidence that supports the presence of the corresponding radiological finding. By establishing class-aware interactions between textual representations and visual features, the model effectively associates specific radiological findings with their most relevant image regions.
To transform the high-dimensional fused class features into final classification prediction scores (i.e., logits), additional feature aggregation and projection operations are applied. Specifically, the fused class-aware features
are first aggregated via average pooling along the category dimension, yielding a compact global representation:
Subsequently, the aggregated feature
is fed into a linear classification head implemented as a multi-layer perceptron (MLP), which projects the representation from the embedding space of dimension
d to the target class space of dimension
C:
where
z denotes the final prediction logits for multi-label radiological finding classification.
The radiological finding recognition task is formulated as a multi-label classification problem, in which multiple pathological findings may simultaneously exist within a single chest radiograph. Accordingly, the binary cross-entropy (BCE) loss is employed as the training objective to supervise the model optimization, as defined in Equation (
7). The fine-tuning process aims to minimize this loss, which quantitatively measures the discrepancy between the predicted logits and the corresponding ground-truth labels through gradient-based optimization.
During training, gradients are computed via backpropagation and propagated backward along the forward computation graph. As illustrated in
Figure 2, the text encoder is kept frozen to preserve the pretrained semantic representations learned during the pretraining stage, while the image encoder and subsequent multi-modal fusion modules are jointly optimized. Specifically, the gradient flow originates from the loss layer, propagates through the linear classification head and the cross-attention layers of the Transformer decoder, and finally reaches the convolutional and projection layers of the image encoder. This selective fine-tuning strategy enables the model to effectively adapt visual representations to pediatric-specific radiological patterns while maintaining stable and consistent textual embeddings for radiological findings. Such a design helps mitigate overfitting and preserves the semantic alignment between visual features and medical concepts learned from large-scale adult datasets. All trainable parameters are optimized using the AdamW optimizer, which combines adaptive learning rates with decoupled weight decay, thereby improving both training stability and generalization performance.
where
denotes the binary cross-entropy loss;
B represents the batch size, and C denotes the number of predefined radiological finding classes;
is the predicted logit for the i-th class of the b-th sample;
is the corresponding ground-truth label;
denotes the sigmoid activation function.
2.2. Domain Knowledge-Enhanced Multimodal Diagnosis
In the second stage of the proposed framework, multimodal large language models (MLLMs) are employed to act as specialist physicians and perform disease-level diagnostic reasoning tasks. At this stage, the MLLM conducts diagnostic reasoning by comprehensively referring to multimodal input information, including the chest radiograph, the recognized radiological findings obtained from Stage 1, and external medical domain knowledge retrieved from an authoritative medical knowledge vector database.
By integrating these complementary sources of information, the MLLM is able to emulate the diagnostic reasoning process of experienced clinicians, who do not rely solely on imaging evidence but also incorporate accumulated medical knowledge and patient-specific contextual information when formulating a diagnosis. Moreover, the inclusion of retrieved domain knowledge explicitly grounds the model’s reasoning in established clinical guidelines and expert consensus, which enhances diagnostic reliability and robustness while effectively mitigating the risk of hallucinated or unsupported conclusions.
The Stage 2 diagnostic module consists of two main components. The first component is responsible for integrating heterogeneous input information into a unified, structured, and clinically meaningful representation. This process ensures that visual evidence from chest radiographs, structured radiological findings recognized in Stage 1, and patient demographic information are coherently organized and jointly presented to the model in a consistent format. Such structured integration facilitates effective cross-modal interaction and provides a comprehensive diagnostic context. The second component performs disease-level diagnostic reasoning enhanced by clinical domain knowledge. In this component, the MLLM leverages relevant medical knowledge retrieved from authoritative clinical guidelines and disease descriptions to guide its reasoning process. By grounding the diagnostic inference in external domain knowledge, the model is able to generate accurate, reliable, and evidence-supported diagnostic conclusions, thereby improving clinical interpretability and reducing the risk of unsupported or hallucinated predictions.
2.2.1. Structured Input Integration
Following the recognition stage, the identified radiological findings are treated as explicit diagnostic evidence for subsequent disease inference. To address the limited medical domain knowledge of general MLLMs—particularly in the specialized context of pediatric radiology, we supplement each recognized finding with concise and standardized clinical descriptions. These descriptions are collected from authoritative medical resources and databases, such as Radiopaedia and RSNA, and are further reviewed and validated by clinical experts to ensure their accuracy and reliability.
By incorporating expert-curated textual descriptions alongside the recognized findings, the multimodal input is transformed into a structured and clinically meaningful representation. This structured integration enables the MLLM to better interpret the clinical significance of each finding, bridge knowledge gaps between visual observations and medical concepts, and establish a stronger foundation for downstream knowledge-augmented diagnostic reasoning.
The demographic information, including gender and age, is also important for the clinicians to consider when diagnosing the pediatric respiratory disease, according to the authoritative medical references [
11]. Therefore, the demographic information extracted from the original DICOM files is taken into consideration as part of the patient information.
In summary, the structured input information for each patient sample is composed of multiple clinically relevant components. Specifically, it includes:
Gender, encoded as a categorical variable (”M” or ”F”).
Age, which provides essential demographic context for pediatric diagnosis.
Recognized radiological findings, automatically identified in Stage 1 and serving as the primary visual evidence.
Clinical descriptions of the recognized findings, collected from authoritative medical sources and validated by experts to provide domain knowledge support.
Together, these elements form a structured and comprehensive representation of each patient sample, enabling the multimodal large language model to reason over both demographic information and knowledge-enhanced radiological evidence during the diagnostic process.
2.2.2. Retrieval-Augmented Diagnosis
The final diagnosis is performed by a multimodal large language model (MLLM), which integrates heterogeneous sources of evidence, including structured patient information introduced in the previous section, the original chest X-ray, and external medical domain knowledge retrieved from a curated knowledge base. By jointly reasoning over visual, textual, and knowledge-based inputs, the MLLM is able to produce accurate and clinically trustworthy diagnostic results. The overall diagnostic workflow of the second stage is illustrated in the lower part of
Figure 1.
Knowledge Source and Construction. In clinical practice, diagnostic decisions are primarily guided by authoritative clinical guidelines and expert consensus documents. Accordingly, we construct the external knowledge base using high-quality resources from reputable medical platforms, including Radiopaedia, the Radiological Society of North America (RSNA), Pediatric Imaging: A Pediatric Radiology Textbook, and the Pediatric Radiology Digital Library. These sources are widely recognized and routinely consulted by radiologists and clinicians in pediatric settings. From these references, we specifically extract the imaging- and radiology-related sections that describe characteristic radiographic manifestations of different pediatric respiratory diseases. These sections focus on disease-specific imaging features, typical radiological patterns, and differential diagnostic clues observable on chest X-rays.
Maintenance and Updating Strategy. To ensure long-term sustainability, the knowledge base follows a version-controlled update protocol. Newly published guidelines or consensus documents collected are periodically reviewed and incorporated after clinical verification. Each knowledge entry is associated with metadata (source, publication year, version ID), enabling traceability and reproducibility across different system versions.
Validation and Clinical Reliability. To ensure the reliability and clinical validity of the knowledge base, as well as consistency with established clinical standards, all collected content is carefully reviewed and verified by pediatric specialists.
For efficient retrieval and structured utilization by the MLLM, the curated knowledge is organized into predefined templates, with each disease guideline stored as an independent Markdown document. This structured representation facilitates accurate knowledge retrieval and seamless integration into the retrieval-augmented diagnostic process, enabling the MLLM to ground its diagnostic reasoning in authoritative clinical evidence. The textual content is then encoded into vector embeddings using a text encoder
. The textual content of each guideline is further encoded into a dense vector representation using a text encoder
. For a given disease guideline
, its embedding is computed as
where
denotes the encoded knowledge vector in the shared embedding space, and
N represents the total number of diseases included in the knowledge base. All knowledge embeddings are subsequently indexed in a vector database to support efficient similarity-based retrieval.
During the retrieval stage, the MLLM constructs a query vector
q by jointly considering the patient demographic information, the radiological findings recognized in Stage 1, and its internal reasoning context. The relevance between the query and each disease guideline embedding is measured using cosine similarity:
Based on the similarity scores, the most relevant disease guideline is selected as
which is then injected into the MLLM as external knowledge to augment the final diagnostic reasoning process.
The final diagnostic prediction is produced by jointly processing the input chest radiograph
i, the recognized radiological findings
, patient demographic information
, the predefined system role
r, the task specification
T, and the retrieved clinical guideline
. This process can be formally expressed as
where
denotes the final diagnostic output result corresponding to the predicted disease category. By integrating multimodal patient-specific evidence with retrieval-augmented authoritative medical knowledge, the MLLM is guided to perform clinically grounded reasoning, thereby improving diagnostic reliability, consistency, and interpretability in pediatric diagnosis scenarios.
For completeness and improved transparency, an algorithmic summary of the proposed diagnostic framework is provided below. Algorithm 1 formalizes the entire pipeline using the notation defined above and clarifies the interactions among feature extraction, multimodal representation learning, knowledge retrieval, and final inference of the two stages. This structured presentation facilitates both conceptual understanding and practical implementation.
| Algorithm 1 Retrieval-Augmented Multimodal Diagnostic Framework. |
- 1:
// Stage 1: Radiological Finding Recognition - 2:
- 3:
- 4:
Compute cross-attention between and - 5:
Obtain fused features - 6:
Aggregate - 7:
- 8:
Derive recognized findings from - 9:
// Stage 2: Evidence-guided Diagnosis - 10:
for
do - 11:
- 12:
end for - 13:
Construct query q using and - 14:
- 15:
- 16:
return
|