Contrastive Learning with Feature Space Interpolation for Retrieval-Based Chest X-Ray Report Generation

Rahman, Zahid Ur; Yu, Gwanghyun; Jin, Lee; Kim, Jin Young

doi:10.3390/app16010470

Open AccessArticle

Contrastive Learning with Feature Space Interpolation for Retrieval-Based Chest X-Ray Report Generation

by

Zahid Ur Rahman

¹,

Gwanghyun Yu

²,

Lee Jin

¹

and

Jin Young Kim

^1,*

¹

Department of Intelligent Electronics and Computer Engineering, Chonnam National University, 77, Yongbong-ro, Buk-gu, Gwangju 61186, Republic of Korea

²

R&BD Foundation, G5-AICT Research Center, Chonnam National University, Gwangju 61186, Republic of Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(1), 470; https://doi.org/10.3390/app16010470 (registering DOI)

Submission received: 25 November 2025 / Revised: 26 December 2025 / Accepted: 30 December 2025 / Published: 1 January 2026

(This article belongs to the Topic Artificial Intelligence and Big Data in Biomedical Engineering)

Download

Browse Figures

Versions Notes

Abstract

Automated radiology report generation from chest X-rays presents a critical challenge in medical imaging. Traditional image-captioning models struggle with clinical specificity and rare pathologies. Recently, contrastive vision language learning has emerged as a robust alternative that learns joint visual–textual representations. However, applying contrastive learning (CL) to radiology remains challenging due to severe data scarcity. Prior work has employed input space augmentation, but these approaches incur computational overhead and risk distorting diagnostic features. This work presents CL with feature space interpolation for retrieval (CLFIR), a novel CL framework operating on learned embeddings. The method generates interpolated pairs in the feature embedding space by mixing original and shuffled embeddings in batches using a mixing coefficient

λ \sim U (0.85, 0.99)

. This approach increases batch diversity via synthetic samples, addressing the limitations of CL on medical data while preserving diagnostic integrity. Extensive experiments demonstrate state-of-the-art performance across critical clinical validation tasks. For report generation, CLFIR achieves BLEU-1/ROUGE/METEOR scores of 0.51/0.40/0.26 (Indiana university [IU] X-ray) and 0.45/0.34/0.22 (MIMIC-CXR). Moreover, CLFIR excels at image-to-text retrieval with R@1 scores of 4.14% (IU X-ray) and 24.3% (MIMIC-CXR) and achieves 0.65 accuracy in zero-shot classification on the CheXpert5×200 dataset, surpassing the established vision-language models.

Keywords:

chest X-rays; medical report generation; feature-space interpolation; contrastive learning; medical imaging

1. Introduction

Radiology reports are crucial for diagnosing X-ray images, interpreting disease severity, and identifying anatomical positions [1]. Analyzing these X-ray records requires substantial expertise and time from radiologists [2]. In the healthcare system, significant delays have been caused by the increasing demand for medical imaging and the global shortage of qualified radiologists [3]. Hence, generating automatic radiology reports has emerged as a promising deep learning solution to improve diagnostic efficiency and reduce clinical workload [4].

Most existing methods adapt the image captioning paradigm with traditional encoder–decoder frameworks to generate radiology reports [5,6,7,8]. However, these approaches produce results that differ from natural image descriptions. Image captioning requires all visual elements to be comprehensively described, whereas radiology diagnosis demands a selective focus on abnormal findings while integrating medical knowledge from detailed reports [9]. This approach requires models that accurately identify pathological regions and generate clinically precise descriptions of specific findings.

Recently, contrastive learning (CL) has emerged as a robust alternative to image captioning models, learning joint visual–textual representations of medical reports and X-rays [10]. Unlike traditional supervised approaches, CL applies large-scale multimodal data without extensive manual annotations. Several studies have investigated CL for radiology report generation by adapting encoders pretrained on natural images [11,12,13]. However, this domain adaptation approach faces challenges in medical domains [14]. Radiology datasets often differ from natural image datasets in content and scale. The domain gap between natural and medical images, combined with the limited medical training data, poses challenges for scaling large pretrained models to the medical domain.

To address data scarcity challenges, prior work has explored augmentation techniques to enhance CL performance [15]. Although these approaches are efficient, they present several critical limitations. These methods necessitate substantial computational overhead and complex implementation with extensive hyperparameter tuning. Input space augmentation of medical data poses significant semantic integrity risks. Unlike natural images, where geometric transformations or color adjustments preserve semantic meaning, augmenting X-ray images via rotation, cropping, or intensity changes can alter critical diagnostic features, potentially creating misleading training signals [16]. Furthermore, text augmentation using paraphrasing or synonym replacement can alter clinical meanings and compromise diagnostic accuracy.

In response to these limitations, this work proposes a novel CL framework with feature space interpolation for retrieval (CLFIR) that operates directly on learned embeddings rather than raw inputs. Feature interpolation has been extensively explored in computer vision, with previous work [17,18,19,20] demonstrating superiority over input space methods across tasks. These studies suggest that feature space transformations better preserve semantics while performing comparably with significantly lower computational cost. However, these techniques have not been thoroughly investigated in CL for radiological applications.

The proposed method operates on learned embeddings to generate an interpolated combination of samples, expanding the training signals while preserving semantic integrity. For each batch of X-ray report embedding pairs, additional mixed pairs are created by combining the original embeddings with shuffled versions from the same batch using a mixing coefficient

λ

sampled from a uniform distribution U(0.85, 0.99). This approach maintains consistent cross-modal alignment where the original image embeddings are paired with the original text embeddings, and mixed image embeddings are paired with mixed text embeddings during training. The uniform sampling ensures that interpolated samples retain 85% to 99% of their original semantic content while incorporating 1% to 15% variation from other samples, providing diverse interpolation intensities in each training batch rather than fixed mixing ratios. Therefore, CLFIR avoids the risks associated with input space modifications. The approach creates additional positive and more complex negative pair samples for CL while establishing smoother embedding manifolds that improve model generalization. The major contributions of this work are as follows:

(1) Novel Feature Space Interpolation for Contrastive Vision-Language Pretraining in Data-Scarce Medical Domains: This work introduces CLFIR, which operates directly on learned embeddings, unlike general-domain mixup or input space augmentations that risk diagnostic distortions in sensitive X-ray data.

(2) Batch Expansion and Manifold Smoothing for Enhanced Generalization: By shuffling and interpolating embeddings in mini-batches, CLFIR doubles positive/negative pairs (to 2× per update), mitigating the large-batch requirement of CL image pretraining (CLIP)-style models on small medical corpora. This approach smooths the embedding manifold (by generating intermediate samples between the existing data points), reducing overfitting and improving robustness to rare pathologies.

(3) Superior Computational Efficiency Without Sacrificing Diagnostic Integrity: The CLFIR framework eliminates the overhead of input space augmentations, making it practical for resource-constrained clinical environments while preserving semantic consistency.

(4) State-of-the-Art (SOTA) Performance Across Multiple Tasks: Extensive evaluations on the Indiana University (IU) X-ray and MIMIC-CXR datasets yield new benchmarks in report generation (BLEU-1: 0.51/0.45, ROUGE-L: 0.40/0.34, METEOR: 0.26/0.22), image-to-text retrieval (R@1: 4.14%/24.3%), and zero-shot classification (0.65 accuracy on CheXpert5×200), outperforming prior contrastive and generative methods.

2. Materials and Methods

2.1. Overview and Model Architecture

This section presents the proposed CLFIR framework with feature space mixup interpolation to generate automated reports, as depicted in Figure 1. The approach builds on the CLIP framework [21], employing a dual-encoder architecture comprising separate image and text encoders that map their respective inputs to a shared embedding space.

2.2. Problem Formulation

Let

D

represent the training dataset containing N pairs of chest X-ray images and their corresponding radiology reports, denoted as

(x_{i}, r_{i})

, where i ranges from 1 to N. The objective is to learn joint representations of medical images and textual reports that capture semantic relationships between visual pathological findings and clinical descriptions.

Following the CL paradigm, the aim is to learn two encoders: an image encoder

f_{I}

that maps images to d-dimensional embeddings, and a text encoder

f_{T}

that maps the corresponding reports to the same d-dimensional space. The encoders map input images and reports to a shared embedding space where semantically similar image–text pairs have high similarity scores.

2.3. CLIP Foundation

As discussed, the proposed approach follows the CLIP framework, which learns visual and textual representations via CL on large-scale image–text pairs. The CLIP framework optimizes the InfoNCE loss to maximize the similarity between matching image–text pairs while minimizing the similarity between non-matching pairs:

L_{C L I P} = - \frac{1}{2 N} \sum_{i = 1}^{N} [log \frac{exp (s (x_{i}, r_{i}) / τ)}{\sum_{j = 1}^{N} exp (s (x_{i}, r_{j}) / τ)} + log \frac{exp (s (x_{i}, r_{i}) / τ)}{\sum_{j = 1}^{N} exp (s (x_{j}, r_{i}) / τ)}]

(1)

where

s (x_{i}, r_{i})

represents the cosine similarity between matching image

x_{i}

and text

r_{i}

pairs, while

s (x_{i}, r_{j})

denotes the cosine similarity between non-matching pairs across the batch, and

τ

is the temperature parameter. The loss is calculated from the image to text and from the text to image to enforce symmetric CL, ensuring both modalities are optimized simultaneously in the shared embedding space. This bidirectional formulation prevents representational collapse and guarantees that semantically similar image–text pairs have high cosine similarity regardless of the query direction.

2.3.1. Image Encoder

The image encoder

f_{I}

employs a MobileNetV2 backbone to process chest X-ray images resized to 312 × 312 pixels to preserve fine-grained pathological details that are crucial for medical diagnosis. The encoder extracts visual features capturing anatomical structures and pathological findings.

2.3.2. Text Encoder

The text encoder

f_{T}

uses the bidirectional encoder representations from transformer (BERT) universal sentence encoder to process radiology reports and clinical findings. The transformer-based model processes tokenized medical text encoding semantic relationships between clinical observations, diagnostic terminology, and pathological descriptions. Both encoders output 512-dimensional L2-normalized embeddings projected to a shared representation space.

2.4. Feature Space Mixup Interpolation

The CLFIR framework operates on learned embeddings in the representation space rather than raw input data to address the challenges of limited medical data in chest X-ray analyses. This approach enhances CL performance by increasing the batch size and creating diverse negative samples, where original embeddings serve as negatives for augmented pairs and vice versa. By generating interpolated representations in the embedding space, the proposed method creates smoother decision boundaries and prevents overfitting the model to specific samples while preserving critical pathological features that are crucial for medical diagnoses. The feature space interpolation is suitable for medical applications because it maintains semantic integrity while providing rich contrastive signals that improve cross-modal alignment between chest X-rays and radiology reports.

2.4.1. Mixup Methodology

Given a batch of learned embeddings after computation by the encoders, this work performs mixup interpolation in the feature space. For image embeddings

I = {I_{1}, I_{2}, \dots, I_{B}}

and text embeddings

T = {T_{1}, T_{2}, \dots, T_{B}}

where B represents the batch size, this method generates additional pairs through conservative linear interpolation:

{\tilde{I}}_{i} = λ I_{i} + (1 - λ) I_{π (i)}

(2)

{\tilde{T}}_{i} = λ T_{i} + (1 - λ) T_{π (i)}

(3)

where

{\tilde{I}}_{i}

and

{\tilde{T}}_{i}

represent the interpolated image and text embeddings, respectively;

π (i)

denotes a random permutation of batch indices that changes across training batches, ensuring that each sample is mixed with diverse samples over time for maximum diversity. The parameter

λ

is sampled from a uniform distribution

λ \sim

U(0.85,0.99), ensuring that interpolated samples retain about 85% to 99% of their original content while incorporating only 1% to 15% from other samples.

This conservative mixup strategy maintains the pathological features that are critical for medical diagnosis while providing meaningful data interpolation. By avoiding extreme mixing ratios and preventing identical copies, the proposed approach ensures that each mixed sample contributes diversity to the training process without compromising the diagnostic integrity.

2.4.2. Negative Pairing Strategy for Cross-Modal Alignment

A crucial design choice in the proposed framework is treating original-interpolated cross-combinations as negative pairs during CL. This strategy relates to semantic correspondence preservation. Interpolated pairs

({\tilde{I}}_{i}, {\tilde{T}}_{i})

maintain semantic correspondence because image and text embeddings are derived from the same source samples, interpolated using identical mixing coefficient

λ

and permutation index

π (i)

. This method ensures that the interpolated image embedding and its paired interpolated text embedding share consistent diagnostic semantics.

In contrast, cross-combinations, such as

(I_{i}, {\tilde{T}}_{j})

, associate an original image with an interpolated text embedding derived from various source samples. This pairing breaks the semantic correspondence required for accurate clinical alignment, as the original chest X-ray is linked to text features from unrelated diagnostic descriptions. Similarly,

({\tilde{I}}_{i}, T_{j})

pairs interpolated image features with text from a different sample, creating misaligned training signals.

Therefore, this work maintains a strict positive pairing between semantically consistent samples: original pairs

(I_{i}, T_{i})

and their corresponding interpolated pairs

({\tilde{I}}_{i}, {\tilde{T}}_{i})

, and all cross-combinations

(I_{i}, {\tilde{T}}_{j})

and

({\tilde{I}}_{i}, T_{j})

, where

i \neq j

are treated as negative samples during training. The training strategy combines the original and interpolated embeddings into a single enlarged batch:

I_{c o m b i n e d} = [I, \tilde{I}] = [I_{1}, I_{2}, \dots, I_{B}, {\tilde{I}}_{1}, {\tilde{I}}_{2}, \dots, {\tilde{I}}_{B}]

(4)

T_{c o m b i n e d} = [T, \tilde{T}] = [T_{1}, T_{2}, \dots, T_{B}, {\tilde{T}}_{1}, {\tilde{T}}_{2}, \dots, {\tilde{T}}_{B}]

(5)

L_{C L I P} = L_{C L I P} (I_{c o m b i n e d}, T_{c o m b i n e d})

(6)

This formulation generates an effective batch size of

2 B

, where the original and interpolated samples serve as additional negatives for each other, enhancing the discriminative capacity of the learned representations.

2.4.3. Rationale for Conservative Interpolation Range

The selection of the mixing coefficient range

λ \sim U (0.85, 0.99)

reflects the semantic sensitivity of the medical image–text data. Unlike natural image–caption pairs, where partial semantic overlap may be acceptable, radiology reports contain precise diagnostic statements that must accurately correspond to specific visual findings. When interpolating embeddings, aggressive mixing (e.g.,

λ < 0.85

)) risks creating representations with conflicting diagnostic semantics.

By constraining

λ

to retain 85% to 99% of the original embedding content, the proposed approach ensures that interpolated samples preserve their core diagnostic semantics while introducing sufficient variation to enhance generalization. This conservative strategy is important for medical applications where training signals must maintain semantic coherence to learn reliable cross-modal correspondences. The effectiveness of this range is validated using ablation studies in Section 4.1, demonstrating superior performance compared with wider and narrower interpolation ranges.

3. Results

3.1. Datasets

This study applies two widely recognized benchmark datasets for a comprehensive evaluation of the proposed CLFIR approach: IU-Xray [22] and MIMIC-CXR [23]. Both datasets offer essential resources for training and evaluating medical image–text models.

IU X-ray Dataset: This dataset is publicly accessible and comprises 3955 anonymized radiology reports paired with 7470 posterior–anterior and lateral-view chest X-ray images. Each radiology report contains structured sections, including medical subject headings (MeSH), clinical indications, comparative analysis, detailed findings, and diagnostic impressions. For experiments, this work focuses on the findings sections for the ground-truth reference text, given its direct correspondence to observable radiological features. The dataset was carefully filtered to remove incomplete records lacking X-ray images or findings sections. This work adopts the established data partitioning strategy from prior SOTA research, implementing a 7:1:2 split for training, validation, and testing phases while ensuring no patient overlap across partitions.

MIMIC-CXR Dataset: This dataset constitutes a large-scale medical imaging repository with 377,110 chest X-ray images and 227,827 free-text radiology reports covering 65,379 unique patients. This comprehensive dataset originates from clinical studies conducted at Beth Israel Deaconess Medical Center between 2011 and 2016. Following the established preprocessing protocols, the dataset was filtered to include only frontal and lateral view images, consistent with previous research methods. Similar to the IU X-ray dataset, this work focuses on the findings sections for ground-truth references and applies identical text preprocessing procedures. This work adheres to the official dataset split provided by the original dataset curators for experimental validation, ensuring reproducible results that are comparable with the existing literature.

3.2. Implementation Details

The implementation employed a carefully designed dual-encoder architecture optimized for medical image–text CL. The MobileNetV2 image encoder was initialized using ImageNet pretrained weights, with the first convolutional layer adapted for grayscale input via weight averaging, and all layers remained trainable throughout training. The universal sentence text encoder was similarly set as fully trainable from its pretrained checkpoint. Both encoders were fully fine-tuned during training. For the permutation strategy in the feature space interpolation, a new random permutation

π (i)

was generated for each training batch, with the same permutation consistently applied to the image and text embeddings in the batch to maintain correspondence. Regarding the batch size sensitivity, smaller batch sizes negatively influenced performance due to insufficient negative samples for effective CL. Larger batch sizes (e.g., 150) improved performance on the MIMIC-CXR dataset, given its scale, whereas the performance of the smaller IU X-ray dataset (about 7000 image-report pairs) degraded with excessively large batches due to insufficient iterations per epoch. A batch size of 150 was set to balance these considerations across both datasets.

Training optimization employed the Adam optimizer with a learning rate of

1 \times 10^{- 4}

and gradient clipping (max norm

1.0

) for training stability. Training proceeded for 30 epochs for the IU X-ray dataset and 50 epochs for the MIMIC-CXR dataset with a batch size of 150, incorporating the learning rate reduction on a plateau (factor

0.5

; patience 3), and early stopping was based on the validation accuracy (patience 10) to prevent overfitting. Model selection retained the best weights based on the validation performance.

3.3. Quantitative Results

This work evaluates the proposed CLFIR approach against the established baselines using standard natural language generation metrics for medical report generation and image–text retrieval tasks.

3.3.1. IU X-Ray Results

Table 1 presents the comprehensive evaluation results on the IU X-ray dataset. The CLFIR framework achieved SOTA performance across all evaluation metrics, demonstrating the effectiveness of feature space interpolation in medical CL. The results display progression in model sophistication and performance. Early image captioning models (Show-Tell [24], Att2in [25], and AdaAtt [26]) achieved modest scores across all metrics, with BLEU-1 scores ranging from 0.24 to 0.28. Initially designed for natural image description, these approaches struggle to capture the specialized terminology and diagnostic patterns in medical reports. More recent medical-specific approaches introduced substantial improvements. The R2Gen [7] method, which introduced memory-driven transformers for radiology, achieved notable gains with a BLEU-1 score of 0.47, nearly doubling the performance of basic captioning models. This improvement underscores the importance of domain-specific architectural designs for medical text generation. Later methods (MGRRG [27], METransformer [28], ASGMDN [29], and RRGLKM [30]) achieved incremental improvements, reaching BLEU-1 scores of around 0.47 to 0.49. Recent CL approaches demonstrated competitive performance. For example, BLLM [31] and ATL-CAV [32], which also employ CL strategies, achieved BLEU-1 scores of 0.49 and 0.48, respectively. Although these approaches employ contrastive methods, they differ from CLFIR in architectural settings.

The CLFIR framework demonstrated significant advances across all metrics. The improvement to 0.51 in the BLEU-1 score represents a 4.1% relative gain over the previously best result, and BLEU-2 advances from 0.33 to 0.35, indicating better capturing of medical phrase patterns. The improvement in the METEOR score from 0.20 to 0.26 is significant, representing a 30% relative improvement. The focus of METEOR on semantic similarity and synonym recognition makes this gain valuable for medical terminology. The F1-score improved from 0.39 to 0.42, indicating enhanced clinical entity detection accuracy. In medical report generation, the F1-score measures the ability of the model to identify and correctly generate clinically relevant terms, making this 7.7% improvement clinically meaningful. The consistent improvements across complementary metrics (BLEU for n-gram precision, ROUGE for content recall, METEOR for semantic alignment, and the F1-score for clinical accuracy) demonstrate that the feature space interpolation of CLFIR enhances multiple aspects of medical report quality.

3.3.2. MIMIC-CXR Results

Table 2 presents the performance of CLFIR on the larger MIMIC-CXR dataset, which includes more diverse cases and complex pathological findings than those in the IU X-ray dataset. The results reveal distinct performance patterns between retrieval-based and generative approaches. Retrieval-based methods (CXR-RePaiR-2 [13] and CXR-RePaiR-Select [13]) reveal limited performance with sparse metric coverage. The CXR IRGEN [12] method achieved the highest scores among retrieval methods with a BLEU-1 score of 0.32 and an F1-score of 0.29.

Generative methods demonstrate considerably better performance across all metrics. The R2Gen [7] method established a strong baseline with a BLEU-1 score of 0.35 and a METEOR score of 0.14. Advanced medical-specific approaches (ASGMDN [29] and METransformer [28]) achieved incremental improvements, reaching BLEU-1 scores of 0.37 to 0.38. The current SOTA generative methods (MGGRRG [27]) displayed competitive results with a BLEU-1 score of 0.40 and an F1-score of 0.33. Among the CL approaches, ATL-CAV achieved competitive results with a BLEU-1 score of 0.38.

The CLFIR method demonstrated significant advances across all evaluation metrics. The improvement in BLEU-1 to 0.45 over the previously best score of 0.40 (MGGRRG [27]) indicates superior n-gram matching with ground-truth reports. The improvement in the METEOR score from 0.16 to 0.22 represents a 37.5% relative improvement, the most substantial gain across all metrics. This improvement is crucial for MIMIC-CXR, where the larger vocabulary and more diverse pathological descriptions require robust semantic understanding. The F1-score of 0.34 indicates that CLFIR maintains clinical entity detection accuracy while improving the overall report quality. These results validate that the feature space interpolation strategy of CLFIR scales to large, diverse medical datasets while maintaining the semantic preservation crucial for clinical applications.

3.4. Image-to-Text Retrieval

This work evaluates the performance of the model on image–text retrieval tasks to validate the effectiveness of the proposed approach. This evaluation demonstrates the quality of the learned multimodal representations and their ability to establish meaningful correspondences between radiological images and their associated reports.

This study assesses image–text retrieval performance using the standard recall@k metric (R@K), which measures the recall of the exact report in the top k retrieved reports for a given query image. This metric is relevant for clinical applications as it reflects the ability to identify the correct diagnostic report among multiple candidates. The experiments are conducted on the MIMIC-CXR and IU X-ray (testing) datasets to ensure a comprehensive evaluation and fair comparison with SOTA approaches.

As demonstrated in Table 3, CLFIR consistently outperformed SOTA approaches across both datasets on the R@K metric. The improvements are substantial on the MIMIC-CXR dataset, where the proposed approach reached notable gains of 2.7 points in R@1, 6.1 points in R@5, and 7.8 points in R@10 compared with the previously best-performing method (CXR-CLIP). On the IU X-ray dataset, the proposed method demonstrated significant improvements, particularly in R@5 and R@10, with gains of 6.94 and 10.78 points, respectively, over CXR-CLIP.

The superior performance in image-to-text retrieval confirms that the feature space mixup interpolation strategy enhances the quality of learned multimodal representations. The substantial improvements in higher k metrics (R@5 and R@10) suggest that the proposed approach enhances exact matches and the overall semantic similarity ranking between images and reports.

3.5. CheXpert5×200 Zero-Shot Classification Performance

This work presents comprehensive zero-shot classification experiments conducted on the CheXpert5×200 dataset to evaluate the cross-dataset generalizability of the proposed CLFIR framework. The validation offers a more rigorous assessment of model robustness on unseen data by focusing on the five clinically critical conditions from CheXpert5×200 (i.e., atelectasis, cardiomegaly, consolidation, edema, and pleural effusion), representing critical pathologies for automated chest radiograph interpretation. The zero-shot classification method applied pretrained vision and text encoders from MIMIC-CXR training. Classification was performed by computing the cosine similarity between the image and condition text embeddings. For each image, 512-dimensional visual features were extracted using the trained vision encoder. Then, cosine similarity scores were computed between the normalized image embedding and each of the five condition text embeddings. Classification decisions were made by assigning positive labels to conditions with similarity scores above a threshold of 0.5. The CLFIR framework achieved superior cross-dataset performance across the five CheXpert conditions, displaying strong zero-shot generalizability. In Figure 2, the proposed framework substantially outperformed established vision-language models including ConVIRT [34], GLoRIA [33], MedCLIP-ResNet, MedCLIP-ViT [35], and CXR-CLIP [11].

To provide deeper insight into the diagnostic capabilities of CLFIR, this work presents a detailed per-condition analysis conducted across the five CheXpert pathologies. Figure 3 presents the classification accuracy and area under the receiver operating characteristic curve (AUC-ROC) scores for each condition, revealing the granular performance characteristics of the proposed approach.

The CLFIR framework demonstrates consistently robust performance across all five pathologies, with classification accuracy ranging from 0.631 (pleural effusion) to 0.672 (consolidation). The per-condition AUC-ROC scores validate the discriminative capabilities of the model, exhibiting scores between 0.687 and 0.745 across conditions. Notably, CLFIR performs exceptionally well on consolidation (accuracy: 0.672, AUC: 0.745), which is clinically significant because this condition often presents subtle imaging features that challenge automated detection systems.

To contextualize these performance gains, Table 4 summarizes the critical methodological differences between CLFIR and existing medical vision-language approaches. Although prior methods rely on input space augmentation that may distort diagnostic features, CLFIR operates in the learned embedding space with a negative pairing strategy for interpolated samples. This fundamental difference in the augmentation domain and pairing strategy contributes to the superior zero-shot classification performance of CLFIR.

3.6. Qualitative Results

The qualitative evaluation offers crucial insight into the clinical applicability and semantic coherence of radiological frameworks. Although quantitative metrics measure linguistic similarity, the qualitative analysis indicates whether retrieved reports maintain clinical accuracy and appropriate medical reasoning patterns. To assess the effectiveness of the proposed approach, this work presents a comprehensive qualitative analysis conducted on the IU X-ray and MIMIC-CXR datasets. This work examines how well the retrieved reports align with ground-truth results in terms of accuracy, focusing on diagnostic completeness and terminology appropriateness. Table 5 presents examples compared with ground-truth references on both datasets, demonstrating the strong semantic learning capabilities of the proposed model in retrieving clinically relevant and diagnostically accurate reports.

4. Discussion

4.1. Design Validation via Ablation Studies

To further validate the effectiveness of CLFIR, this work presents comprehensive ablation studies examining the critical components and hyperparameters that influence performance. This work systematically analyzes (1) the effect of feature space vs. input space interpolation, (2) the influence of mixing coefficient ranges, (3) computational efficiency comparisons, and (4) the contribution of various interpolation strategies.

4.1.1. Theoretical Justification for Negative Pairing Strategy

A fundamental design in the proposed framework involves treating the combination of original and interpolated pairs to negative during training, requiring theoretical justification. In radiological diagnoses, clinical accuracy demands an exact correspondence between visual pathological findings and textual descriptions. When interpolating

{\tilde{T}}_{i} = λ T_{i} + (1 - λ) T_{π (i)}

, the resulting text embedding contains mixed clinical semantics that may include contradictory diagnostic statements. Unlike natural image–text pairs, where a partial semantic overlap might be acceptable, medical image–text alignment requires diagnostic precision.

A chest X-ray demonstrating specific pathological findings (

I_{i}

) cannot be correctly paired with a report containing mixed pathological descriptions (

{\tilde{T}}_{j}

) because this approach teaches the model to accept clinically inaccurate associations. Furthermore, treating the original interpolated cross-combinations as positive artificially increases positive samples from B to

4 B

, creating an imbalanced CL scenario with insufficient discriminative signals.

The negative pairing strategy preserves clinical diagnostic integrity while providing challenging negative samples. This approach ensures that the model learns to distinguish between the exact clinical correspondence and mixed pathological descriptions, which is crucial for reliable medical AI systems. To validate this theoretical framework, this work presents comparative experiments conducted between the negative pairing strategy and cross-modal learning approaches, as shown in Table 6.

The experimental comparison encompasses two distinct pairing strategies. Cross-model learning treats all interpolated combinations as positive pairs, creating

(I_{i}, {\tilde{T}}_{j})

and

({\tilde{I}}_{i}, T_{j})

as positive samples alongside the original pairs. In contrast, the negative pairing strategy maintains strict positive pairing between semantically consistent samples

(I_{i}, T_{i})

and

({\tilde{I}}_{i}, {\tilde{T}}_{i})

, while treating cross-combinations

(I_{i}, {\tilde{T}}_{j})

and

({\tilde{I}}_{i}, T_{j})

as negative samples. As listed in Table 6, the negative pairing strategy consistently outperformed cross-model learning across all evaluation metrics on both datasets.

4.1.2. Comparison with Alternative Configurations

This work compares CLFIR against multiple baseline configurations to demonstrate the effectiveness of feature space mixup interpolation. All models employed identical encoder architectures optimized for radiology-specific data. CLFIR is evaluated against three baseline configurations: (1) standard CL with no feature space interpolation and no input space augmentation, (2) CL with input space augmentation (random brightness, contrast, and noise adjustments), (3) standard CL with doubled batch size (300) to match the effective training sample count of the feature space interpolation, and the CLFIR approach.

During experimentation, the optimal batch size selection depended on the dataset. For smaller datasets, such as IU X-ray, excessive batch size increases can degrade accuracy due to insufficient batches per epoch. As demonstrated in Table 7, the proposed CLFIR with feature space interpolation consistently outperformed all baseline configurations, validating that the proposed strategy offers benefits beyond simple batch size increase or input space augmentation.

4.1.3. Comparison with Gaussian Noise Interpolation

To validate CLFIR performance, this work compares Gaussian noise as an alternative feature space interpolation technique. The analysis was conducted on the IU X-ray dataset due to its manageable size for comprehensive hyperparameter exploration, where both datasets exhibited similar radiological characteristics. All experiments applied a fixed random seed (seed = 42) to ensure reproducibility and fair comparison across configurations.

The mixup approach combines embeddings using

λ \sim Uniform (0.85, 0.99)

, ensuring each interpolated sample retains 85% to 99% of its original semantic content while incorporating 1% to 15% from permuted samples. For comparison, this work implements Gaussian noise interpolation by adding random noise

N (0, σ^{2})

to the feature embeddings at varying intensity levels, increasing the feature embedding samples similar to the proposed approach.

The experimental results in Table 8 demonstrate that the semantic mixup consistently outperformed random noise interpolation across all intensity levels. Although moderate Gaussian noise (

σ = 0.05

) offers improvement over the baseline, it fails to reach the performance gains of the mixup strategy, which applies semantic relationships between samples rather than random perturbations.

As illustrated in Figure 4, radiology datasets are sensitive to random perturbations, and semantic preservation via feature space mixing is critical for optimal performance. The degradation with higher noise levels (

σ = 0.1, 0.2, 0.3

) supports the hypothesis that maintaining semantic consistency is critical in medical imaging applications. The visualization reveals the instability of Gaussian noise approaches compared with the robust mixup strategy, which consistently outperformed even the optimal noise configuration.

4.1.4. Optimal Mixup Interpolation Range for CLFIR

To determine the optimal mixup interpolation range for CLFIR, this work presents experiments conducted with various uniform distributions

λ

on the IU X-ray dataset. Unlike fixed mixing ratios, the proposed approach samples

λ

from uniform distributions, providing variable mixing intensities. This work evaluates four uniform distribution ranges, namely

λ \sim

U(0.75,0.99),

λ \sim

U(0.80,0.99),

λ \sim

U(0.85,0.99), and

λ \sim

U(0.90,0.99), where

λ

represents the proportion of the original sample retained in the interpolated combination. A fixed random seed (seed = 42) was employed across all configurations to ensure a fair comparison.

As listed in Table 9, optimal performance was achieved with

λ \sim

U(0.85,0.99), which provides the best balance between preserving the original sample characteristics and introducing beneficial variation. Figure 5 visualizes these performance trends, demonstrating the superior and consistent performance of U(0.85, 0.99) across all evaluation metrics.

The range ensures that interpolated samples retain 85% to 99% of their original content while incorporating 1% to 15% from shuffled samples. As illustrated in the figure, wider ranges, such as U(0.75, 0.99), introduce excessive variation that could compromise semantic integrity. In contrast, narrower ranges, such as U(0.90, 0.99), offer more conservative mixing, which, while still effective, does not achieve the optimal balance.

This finding indicates that moderate uniform sampling preserves the semantic integrity of medical data while providing sufficient stochastic variation for improved generalization. The uniform distribution approach offers advantages over fixed mixing ratios by delivering diverse interpolation intensities in each batch, preventing the model from overfitting to a single mixing pattern.

4.2. Computational Efficiency Analysis

Figure 6 compares the training time on the MIMIC-CXR dataset between CLFIR and the traditional input space image augmentation CL approach. The CLFIR framework requires 33 min per epoch compared to 37 min for input space augmentation methods, representing a 10.8% reduction in training time per epoch. By the end of training, the total training time for CLFIR reaches about 27 h compared to 32 h for augmentation-based approaches, yielding total time savings of 3.3 h. This efficiency gain results from the computational overhead required for input space image augmentation operations. The baseline augmentation process applies random brightness adjustments (±20%), random contrast modifications (±20%), random rotations (±%15°), random cropping and resizing, and Gaussian noise additions (

σ

= 0.01) to each 312 × 312-pixel chest X-ray image during training. These transformations require significant central processing and unit processing time and memory bandwidth for each training batch, creating a computational bottleneck before graphics processing unit-based model training.

In contrast, in CLFIR, feature space interpolation performs simple linear combinations on 512-dimensional embeddings after encoding, eliminating the image preprocessing overhead. The computational savings are even more substantial when text augmentation techniques are included in the baseline because these require additional natural language processing operations, synonym lookup, and text reconstruction processes. The consistent efficiency advantage confirms that feature space interpolation scales favorably for larger datasets while maintaining superior performance, making it more practical for medical imaging research, where computational resources are often limited.

4.3. Synthesis and Limitations

The superior performance of CLFIR across multiple evaluation studies demonstrates its potential to enhance automated radiology report generation. The improvements in METEOR scores on the IU X-ray and MIMIC-CXR datasets compared to previous methods suggest an enhanced semantic understanding of medical terminology, a crucial consideration for medical AI development. Unlike general-purpose image captioning systems that prioritize comprehensive visual descriptions, the proposed approach focuses on pathologically relevant features, aligning with the diagnostic patterns of radiologists. The cross-dataset generalization capabilities (particularly the 65.5% accuracy on CheXpert5×200 zero-shot classification) and SOTA retrieval accuracy demonstrate the robustness of the proposed architecture.

The conservative mixing strategy uses

λ \sim U (0.85, 0.99)

to create interpolated embeddings from the original X-ray and report representations. This approach preserves the diagnostic integrity because interpolation occurs in the learned embedding space rather than modifying raw medical images or text. The negative pairing strategy treats original-interpolated combinations as negative samples during training, ensuring that the model learns to distinguish between exact clinical correspondences and mixed representations.

The substantial performance gaps between CLFIR and input space augmentation methods highlight fundamental differences in how medical and natural image domains respond to data augmentation. Although geometric transformations preserve semantic meaning in natural images, ablation studies confirm that similar operations on radiological images risk altering diagnostically relevant features. The consistent superiority of feature space interpolation across all evaluation metrics validates the hypothesis that learned embedding spaces deliver semantically stable transformation environments. The comparison with Gaussian noise interpolation further reveals the importance of structured semantic relationships in medical data. Random perturbations, regardless of magnitude, fail to achieve the performance gains of the mixup strategy, confirming that structured mixing with intersample relationships outperforms unstructured noise addition.

The computational efficiency demonstrated by CLFIR addresses the computational constraints often encountered in medical imaging research. The elimination of complex input space augmentation processes reduces the computational overhead and hyperparameter tuning requirements. The feature space interpolation increases training samples, mitigating batch size dependencies in CL.

Several limitations require consideration. The evaluation applies standardized benchmark datasets that may not fully represent the variability encountered in clinical practice. Patient demographics (e.g., age and body habitus), imaging acquisition parameters (exposure settings and positioning), and equipment variations across institutions could affect model performance. Although the MIMIC-CXR dataset offers some demographic diversity, a systematic evaluation across patient subgroups (pediatric vs. elderly and different exposure protocols) was not performed. Additionally, the evaluation relies on automated metrics (BLEU, ROUGE, and METEOR) that measure linguistic similarity but do not directly assess clinical correctness. Human radiologists must validate whether generated reports meet clinical standards before future clinical translation.

From a methodological perspective, the mixing coefficient range

λ \sim U (0.85, 0.99)

was empirically optimized for chest X-ray data. Although the principle of conservative interpolation for semantic preservation should generalize across medical imaging modalities, the optimal range may require modality-specific tuning for computed tomography (CT), magnetic resonance imaging (MRI), or other imaging types. Furthermore, CLFIR employs global image–text embeddings for retrieval, limiting its interpretability compared with attention-based methods that offer spatial localization of the findings.

Future work should extend the framework to other medical imaging modalities and evaluate its generalizability across healthcare systems. The authors also plan to incorporate human radiologist assessments to validate the clinical reliability and explore region-level CL approaches that could enable spatial attention visualization for improved interpretability.

5. Conclusions

This work introduces CLFIR, a CL framework with feature space interpolation designed to generate automated radiology reports. By interpolating in the learned embedding space rather than in the raw input space, CLFIR preserves the diagnostic integrity and avoids distortions introduced by traditional image augmentations. Beyond semantic preservation, the interpolation strategy generates additional training samples, particularly negative pairs, enhancing model robustness in small-scale datasets and mitigating the batch-size limitations often faced in resource-constrained medical environments. In addition, CLFIR achieves SOTA results on IU X-ray and MIMIC-CXR across multiple tasks, confirming the effectiveness of semantically grounded interpolation for improving CL in medical imaging. Although the evaluation relied on automated metrics, future work should incorporate expert radiologist assessments to validate clinical reliability. The authors also plan to extend CLFIR to other modalities, including CT and MRI, and evaluate its generalizability across healthcare systems.

Author Contributions

Z.U.R.: Conceptualization, Methodology, Software, Implementation, Validation, Formal Analysis, Investigation, Data Curation, Visualization, and Writing—Original Draft. G.Y.: Data Curation and Writing—Review and Editing. L.J.: Data Curation and Writing—Review and Editing. J.Y.K.: Supervision, Project Administration, Funding Acquisition, Writing—Review and Editing, and Conceptualization. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by a grant from the Korea Health Technology R&D Project through the Korea Health Industry Development Institute (KHIDI), funded by the Ministry of Health & Welfare, Republic of Korea (grant number: RS-2025-19252970). This work was partly supported by the Innovative Human Resource Development for Local Intellectualization program through the Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korean government (MSIT) (IITP-2024-RS-2022-00156287). This work was also supported by the Artificial Intelligence Industrial Convergence Cluster Development Project funded by the Ministry of Science and ICT (MSIT, Republic of Korea) and Gwangju Metropolitan City.

Institutional Review Board Statement

Ethical review and approval were waived for this study as it exclusively utilized publicly available, de-identified datasets (IU X-ray and MIMIC-CXR) that do not contain identifiable patient information.

Informed Consent Statement

Patient consent was waived for this study as all analyses were conducted on publicly available, fully de-identified chest X-ray datasets where individual patient identification is not possible.

Data Availability Statement

The datasets used in this study are publicly available. The IU X-ray dataset can be accessed at https://openi.nlm.nih.gov/ (accessed on 25 November 2025). The MIMIC-CXR dataset is available through PhysioNet at https://physionet.org/content/mimic-cxr/2.0.0/ (accessed on 25 November 2025) (requires credentialing). The CheXpert5×200 dataset used for zero-shot classification can be obtained from https://stanfordmlgroup.github.io/competitions/chexpert/ (accessed on 25 November 2025). The code for CLFIR will be made available upon publication at [https://github.com/Zah-Ram/CLFIR-ChestXray/(accessed on 25 November 2025)].

Conflicts of Interest

The authors declare no conflicts of interest.

References

European Society of Radiology. Good practice for radiological reporting. Guidelines from the European Society of Radiology (ESR). Insights Imaging 2011, 2, 93–96. [Google Scholar] [CrossRef] [PubMed]
Peng, Y.C.; Lee, W.J.; Chang, Y.C.; Chan, W.P.; Chen, S.J. Radiologist burnout: Trends in medical imaging utilization under the national health insurance system with the universal code bundling strategy in an academic tertiary medical centre. Eur. J. Radiol. 2022, 157, 110596. [Google Scholar] [CrossRef] [PubMed]
Jing, A.B.; Garg, N.; Zhang, J.; Brown, J.J. AI solutions to the radiology workforce shortage. Npj Health Syst. 2025, 2, 20. [Google Scholar] [CrossRef]
Messina, P.; Pino, P.; Parra, D.; Soto, A.; Besa, C.; Uribe, S.; Andía, M.; Tejos, C.; Prieto, C.; Capurro, D. A survey on deep learning and explainability for automatic report generation from medical images. ACM Comput. Surv. 2022, 54, 1–40. [Google Scholar] [CrossRef]
Tiwari, V.; Bapat, K.; Shrimali, K.R.; Singh, S.K.; Tiwari, B.; Jain, S.; Sharma, H.K. Automatic generation of chest X-ray medical imaging reports using lstm-cnn. In Proceedings of the International Conference on Data Science, Machine Learning and Artificial Intelligence, Hyderabad, India, 15–16 December 2021; pp. 80–85. [Google Scholar]
Pahwa, E.; Mehta, D.; Kapadia, S.; Jain, D.; Luthra, A. Medskip: Medical report generation using skip connections and integrated attention. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 3409–3415. [Google Scholar]
Chen, Z.; Song, Y.; Chang, T.H.; Wan, X. Generating radiology reports via memory-driven transformer. arXiv 2020, arXiv:2010.16056. [Google Scholar]
Alqahtani, F.F.; Mohsan, M.M.; Alshamrani, K.; Zeb, J.; Alhamami, S.; Alqarni, D. Cnx-b2: A novel cnn-transformer approach for chest X-ray medical report generation. IEEE Access 2024, 12, 26626–26635. [Google Scholar] [CrossRef]
Beddiar, D.R.; Oussalah, M.; Seppänen, T. Automatic captioning for medical imaging (MIC): A rapid review of literature. Artif. Intell. Rev. 2023, 56, 4019–4076. [Google Scholar] [CrossRef] [PubMed]
Zhao, Z.; Liu, Y.; Wu, H.; Wang, M.; Li, Y.; Wang, S.; Teng, L.; Liu, D.; Cui, Z.; Wang, Q.; et al. CLIP in medical imaging: A survey. Med. Image Anal. 2025, 102, 103551. [Google Scholar] [CrossRef] [PubMed]
You, K.; Gu, J.; Ham, J.; Park, B.; Kim, J.; Hong, E.K.; Baek, W.; Roh, B. Cxr-clip: Toward large scale chest X-ray language-image pre-training. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Vancouver, BC, Canada, 8–12 October 2023; pp. 101–111. [Google Scholar]
Shentu, J.; Al Moubayed, N. Cxr-irgen: An integrated vision and language model for the generation of clinically accurate chest X-ray image-report pairs. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 4–8 January 2024; pp. 5212–5221. [Google Scholar]
Endo, M.; Krishnan, R.; Krishna, V.; Ng, A.Y.; Rajpurkar, P. Retrieval-based chest X-ray report generation using a pre-trained contrastive language-image model. In Proceedings of the Machine Learning for Health, Virtual, 4 December 2021; pp. 209–219. [Google Scholar]
Chambon, P.; Bluethgen, C.; Delbrouck, J.B.; Van der Sluijs, R.; Połacin, M.; Chaves, J.M.Z.; Abraham, T.M.; Purohit, S.; Langlotz, C.P.; Chaudhari, A. Roentgen: Vision-language foundation model for chest X-ray generation. arXiv 2022, arXiv:2211.12737. [Google Scholar]
Rahman, Z.U.; Lee, J.H.; Vu, D.T.; Murtza, I.; Kim, J.Y. DuCo-Net: Dual-Contrastive Learning Network for Medical Report Retrieval Leveraging Enhanced Encoders and Augmentations. IEEE Access 2025, in press. [Google Scholar] [CrossRef]
Elgendi, M.; Nasir, M.U.; Tang, Q.; Smith, D.; Grenier, J.P.; Batte, C.; Spieler, B.; Leslie, W.D.; Menon, C.; Fletcher, R.R.; et al. The effectiveness of image augmentation in deep learning networks for detecting COVID-19: A geometric transformation perspective. Front. Med. 2021, 8, 629134. [Google Scholar] [CrossRef] [PubMed]
DeVries, T.; Taylor, G.W. Dataset augmentation in feature space. arXiv 2017, arXiv:1702.05538. [Google Scholar] [CrossRef]
Liu, B.; Wang, X.; Dixit, M.; Kwitt, R.; Vasconcelos, N. Feature space transfer for data augmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 9090–9098. [Google Scholar]
Liu, Z.; Tang, Z.; Shi, X.; Zhang, A.; Li, M.; Shrivastava, A.; Wilson, A.G. Learning multimodal data augmentation in feature space. arXiv 2022, arXiv:2212.14453. [Google Scholar]
Li, P.; Li, D.; Li, W.; Gong, S.; Fu, Y.; Hospedales, T.M. A simple feature augmentation for domain generalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 8886–8895. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
Demner-Fushman, D.; Kohli, M.D.; Rosenman, M.B.; Shooshan, S.E.; Rodriguez, L.; Antani, S.; Thoma, G.R.; McDonald, C.J. Preparing a collection of radiology examinations for distribution and retrieval. J. Am. Med. Inform. Assoc. 2015, 23, 304–310. [Google Scholar] [CrossRef] [PubMed]
Johnson, A.E.W.; Pollard, T.J.; Berkowitz, S.J.; Greenbaum, N.R.; Lungren, M.P.; Deng, C.Y.; Mark, R.G.; Horng, S. MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Sci. Data 2019, 6, 317. [Google Scholar] [CrossRef] [PubMed]
Vinyals, O.; Toshev, A.; Bengio, S.; Erhan, D. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3156–3164. [Google Scholar]
Rennie, S.J.; Marcheret, E.; Mroueh, Y.; Ross, J.; Goel, V. Self-critical sequence training for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7008–7024. [Google Scholar]
Lu, J.; Xiong, C.; Parikh, D.; Socher, R. Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 375–383. [Google Scholar]
Liu, A.; Guo, Y.; Yong, J.H.; Xu, F. Multi-grained radiology report generation with sentence-level image-language contrastive learning. IEEE Trans. Med. Imaging 2024, 43, 2657–2669. [Google Scholar] [CrossRef] [PubMed]
Wang, Z.; Liu, L.; Wang, L.; Zhou, L. Metransformer: Radiology report generation by transformer with multiple learnable expert tokens. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 11558–11567. [Google Scholar]
Xue, Y.; Tan, Y.; Tan, L.; Qin, J.; Xiang, X. Generating radiology reports via auxiliary signal guidance and a memory-driven network. Expert Syst. Appl. 2024, 237, 121260. [Google Scholar] [CrossRef]
Yang, S.; Wu, X.; Ge, S.; Zheng, Z.; Zhou, S.K.; Xiao, L. Radiology report generation with a learned knowledge base and multi-modal alignment. Med. Image Anal. 2023, 86, 102798. [Google Scholar] [CrossRef] [PubMed]
Liu, C.; Tian, Y.; Chen, W.; Song, Y.; Zhang, Y. Bootstrapping large language models for radiology report generation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 18635–18643. [Google Scholar]
Mei, X.; Yang, L.; Gao, D.; Cai, X.; Han, J.; Liu, T. Adaptive medical topic learning for enhanced fine-grained cross-modal alignment in medical report generation. IEEE Trans. Multimed. 2025, in press. [Google Scholar] [CrossRef]
Huang, S.C.; Shen, L.; Lungren, M.P.; Yeung, S. Gloria: A multimodal global-local representation learning framework for label-efficient medical image recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 3942–3951. [Google Scholar]
Zhang, Y.; Jiang, H.; Miura, Y.; Manning, C.D.; Langlotz, C.P. Contrastive learning of medical visual representations from paired images and text. In Proceedings of the Machine Learning for Healthcare Conference, Durham, NC, USA, 5–6 August 2022; pp. 2–25. [Google Scholar]
Wang, Z.; Wu, Z.; Agarwal, D.; Sun, J. Medclip: Contrastive learning from unpaired medical images and text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, 7–11 December 2022; p. 3876. [Google Scholar]

Figure 1. Overview of the CLFIR framework. During training (left), image and text encoders process chest X-rays and full reports to generate original and interpolated embeddings using feature space mixup (

λ \sim U (0.85, 0.99)

). The model learns contrastive similarities across the original and interpolated pairs. During retrieval (right), cosine similarity is employed to match encoded test X-rays with the most relevant report from the testing set.

Figure 1. Overview of the CLFIR framework. During training (left), image and text encoders process chest X-rays and full reports to generate original and interpolated embeddings using feature space mixup (

λ \sim U (0.85, 0.99)

). The model learns contrastive similarities across the original and interpolated pairs. During retrieval (right), cosine similarity is employed to match encoded test X-rays with the most relevant report from the testing set.

Figure 2. Overview of CLFIR architecture performance versus state-of-the-art methods on the CheXpert5×200 dataset for zero-shot classification.

Figure 3. Per-condition performance analysis of CLFIR zero-shot classification on CheXpert5 pathologies. (A) Classification accuracy across five conditions. (B) AUC-ROC scores between positive and negative cases for each pathology.

Figure 4. Performance comparison by Gaussian noise intensity on the IU X-ray dataset. The curves indicate optimal performance around

σ = 0.05

, with degradation at higher noise levels.

Figure 4. Performance comparison by Gaussian noise intensity on the IU X-ray dataset. The curves indicate optimal performance around

σ = 0.05

, with degradation at higher noise levels.

Figure 5. CLFIR performance across different uniform distribution ranges. U(0.85,0.99) consistently yields the best results, highlighting its robust parameter effectiveness.

Figure 6. Comparison of training time per epoch between CLFIR and a standard contrastive learning framework using input space image augmentations. The CLFIR framework achieves faster training by eliminating computationally intensive augmentation operations.

Table 1. Performance comparison on IU X-ray dataset. Best results are highlighted in bold.

Models	BLEU-1	BLEU-2	BLEU-3	BLEU-4	ROUGE	METEOR	F1
Show-Tell [24]	0.24	0.13	0.10	0.07	0.30	0.15	-
Att2in [25]	0.24	0.13	0.11	0.09	0.30	0.16	-
AdaAtt [26]	0.28	0.20	0.15	0.12	0.31	0.16	-
R2Gen [7]	0.47	0.30	0.21	0.16	0.37	0.18	-
MGRRG [27]	0.47	0.32	0.23	0.17	-	0.19	0.37
METransformer [28]	0.48	0.32	0.22	0.17	0.38	0.19	-
ASGMDN [29]	0.48	0.32	0.23	0.17	0.39	0.20	-
RRGLKM [30]	0.49	0.31	0.23	0.17	0.39
BLLM [31]	0.49	0.32	0.23	0.18	-	0.20	0.39
ATL-CAV [32]	0.48	0.33	0.24	0.20	0.40	0.20
CLFIR	0.51	0.35	0.26	0.20	0.40	0.26	0.42

Table 2. Performance comparison on MIMIC-CXR dataset. Sign (*) represents retrieval-based results. Best results are highlighted in bold.

Models	BLEU-1	BLEU-2	BLEU-3	BLEU-4	ROUGE	METEOR	F1
CXR-RePaiR-2 * [13]	-	0.06	-	-	-	-	0.25
CXR-RePaiR-Select * [13]	-	0.05	-	-	-	-	0.27
CXR IRGEN(F) * [12]	0.32	0.16	0.10	0.06	-	-	0.29
R2Gen [7]	0.35	0.21	0.14	0.10	0.27	0.14	-
ASGMDN [29]	0.37	0.23	0.15	0.11	0.28	0.15
METransformer [28]	0.38	0.25	0.16	0.12	0.29	0.15	-
RRGLKM	0.38	0.23	0.15	0.11	0.27		-
MGGRRG [27]	0.40	0.26	0.19	0.14	0.30	0.16	0.33
ATL-CAV [32]	0.38	0.26	0.19	0.14	0.33	0.16
CLFIR	0.45	0.29	0.20	0.14	0.34	0.22	0.34

Table 3. Image-to-text retrieval performance comparison on testing data for the MIMIC-CXR and IU X-ray datasets. Results show Recall@K scores, where higher values indicate better retrieval performance. Best results are highlighted in bold.

Method	MIMIC-CXR			IU X-Ray
Method	R@1	R@5	R@10	R@1	R@5	R@10
GLoRIA [33]	7.2	20.6	30.3	1.5	4.4	6.5
CXR-CLIP (SwinT) [11]	21.6	48.9	60.2	3.6	8.3	11.5
CLFIR	24.3	55.0	68.0	4.14	15.24	22.28

Table 4. Comparison of medical vision-language contrastive learning approaches on CheXpert5×200 zero-shot classification. Best results are highlighted in bold.

Method	Augmentation	Pairing Strategy	Accuracy
ConVIRT	Input space	Standard positive	0.419
GLoRIA	Input space	Local–global alignment	0.433
MedCLIP-ResNet	Input space	Decoupled matching	0.548
MedCLIP-ViT	Input space	Decoupled matching	0.594
CXR-CLIP	Input space	Standard positive	0.628
CLFIR (Ours)	feature space	Negative for interpolated	0.655

Table 5. Qualitative comparison of CLFIR-generated reports vs. the ground-truth on the IU X-ray and MIMIC-CXR datasets. Matching colors indicate corresponding clinical findings between generated and ground-truth reports.

Dataset	Generated Report	Ground Truth
IU X-ray	Cardiome-diastinal silhouette is within normal limits of size and appearance. Lungs are expanded and clear of airspace disease. Negative for pneumothorax or pleural effusion.	The lungs are clear bilaterally. Specifically, no evidence of focal consolidation, pneumothorax, or pleural effusion. Cardio mediastinal silhouette is unremarkable.
IU X-ray	Mild hyperexpansion of the lungs. Numerous bilateral rib deformities. No focal airspace disease. Heart size is normal. No pneumothorax or effusion.	The heart size and cardiomediastinal silhouette are normal. There is hyperexpansion of the lungs. There is no focal airspace opacity, pleural effusion, or pneumothorax.
MIMIC-CXR	The lungs are clear of focal consolidation, pleural effusion or pneumothorax. The heart size is normal. The mediastinal contours are normal.	The cardiac, mediastinal and hilar contours are normal. Pulmonary vasculature is normal. Lungs are clear. No pleural effusion or pneumothorax is present.
MIMIC-CXR	Lung volumes are low. This results in crowding of the bronchovascular structures. There may be mild pulmonary vascular congestion.	Lung volumes remain low. There are innumerable bilateral scattered small pulmonary nodules. Mild pulmonary vascular congestion is stable.

Table 6. Validation of negative pairing strategy: comparison of negative pairing and cross-modal learning on the IU X-ray and MIMIC-CXR datasets. Best results are highlighted in bold.

Pairing Strategy	BLEU-1	BLEU-2	ROUGE	METEOR	F1
IU X-ray Dataset
CLFIR (Cross-Modal)	44.2	29.4	33.9	23.5	35.8
CLFIR (Negative Pairing)	51.1	35.4	40.2	26.0	42.2
MIMIC-CXR Dataset
CLFIR (Cross-Modal)	41.8	25.1	28.3	19.2	28.5
CLFIR (Negative Pairing)	45.4	29.2	34.5	22.0	34.8

Table 7. Performance comparison of different contrastive learning configurations on the IU X-ray dataset. Best results are highlighted in bold.

Model	BLEU-1	BLEU-2	ROUGE	METEOR	F1
IU X-ray Dataset
CL (Standard)	44.1	29.0	32.0	21.0	34.2
CL (Input Augmentation)	45.0	29.5	32.4	21.7	35.0
CL (2x Batch Size)	46.8	32.2	35.8	24.2	38.4
CLFIR	51.1	35.4	40.2	26.0	42.2

Table 8. Feature space interpolation comparison on the IU X-ray dataset: mixup vs. Gaussian noise. Best results are highlighted in bold.

Interpolation Method	BLEU-1	BLEU-2	ROUGE	METEOR	F1
CL (baseline)	44.0	29.0	32.0	21.0	34.0
CL ( $σ = 0.3$ )	43.7	27.3	30.1	19.0	34.4
CL ( $σ = 0.2$ )	46.2	30.5	34.1	22.0	36.2
CL ( $σ = 0.1$ )	47.4	32.6	35.9	22.9	38.0
CL ( $σ = 0.05$ )	48.0	33.8	37.0	23.7	39.5
CLFIR	51.1	35.4	40.2	26.0	42.2

Table 9. CLFIR mixup performance across various uniform interpolation ranges on the IU X-ray dataset. Best results are highlighted in bold.

Lambda Configuration	BLEU-1	BLEU-2	ROUGE	METEOR	F1
CL (baseline)	44.2	29.5	32.0	21.4	34.6
CLFIR ( $λ \sim$ U(0.75,0.99))	45.8	29.2	34.1	23.2	36.7
CLFIR ( $λ \sim$ U(0.80,0.99))	47.8	32.2	36.1	23.8	38.8
CLFIR $λ \sim$ U(0.85,0.99)	51.0	35.2	40.4	26.3	42.4
CLFIR $λ \sim$ U(0.90,0.99)	49.2	34.4	38.9	24.0	41.1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Rahman, Z.U.; Yu, G.; Jin, L.; Kim, J.Y. Contrastive Learning with Feature Space Interpolation for Retrieval-Based Chest X-Ray Report Generation. Appl. Sci. 2026, 16, 470. https://doi.org/10.3390/app16010470

AMA Style

Rahman ZU, Yu G, Jin L, Kim JY. Contrastive Learning with Feature Space Interpolation for Retrieval-Based Chest X-Ray Report Generation. Applied Sciences. 2026; 16(1):470. https://doi.org/10.3390/app16010470

Chicago/Turabian Style

Rahman, Zahid Ur, Gwanghyun Yu, Lee Jin, and Jin Young Kim. 2026. "Contrastive Learning with Feature Space Interpolation for Retrieval-Based Chest X-Ray Report Generation" Applied Sciences 16, no. 1: 470. https://doi.org/10.3390/app16010470

APA Style

Rahman, Z. U., Yu, G., Jin, L., & Kim, J. Y. (2026). Contrastive Learning with Feature Space Interpolation for Retrieval-Based Chest X-Ray Report Generation. Applied Sciences, 16(1), 470. https://doi.org/10.3390/app16010470

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Contrastive Learning with Feature Space Interpolation for Retrieval-Based Chest X-Ray Report Generation

Abstract

1. Introduction

2. Materials and Methods

2.1. Overview and Model Architecture

2.2. Problem Formulation

2.3. CLIP Foundation

2.3.1. Image Encoder

2.3.2. Text Encoder

2.4. Feature Space Mixup Interpolation

2.4.1. Mixup Methodology

2.4.2. Negative Pairing Strategy for Cross-Modal Alignment

2.4.3. Rationale for Conservative Interpolation Range

3. Results

3.1. Datasets

3.2. Implementation Details

3.3. Quantitative Results

3.3.1. IU X-Ray Results

3.3.2. MIMIC-CXR Results

3.4. Image-to-Text Retrieval

3.5. CheXpert5×200 Zero-Shot Classification Performance

3.6. Qualitative Results

4. Discussion

4.1. Design Validation via Ablation Studies

4.1.1. Theoretical Justification for Negative Pairing Strategy

4.1.2. Comparison with Alternative Configurations

4.1.3. Comparison with Gaussian Noise Interpolation

4.1.4. Optimal Mixup Interpolation Range for CLFIR

4.2. Computational Efficiency Analysis

4.3. Synthesis and Limitations

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI