Feasibility Study of CLIP-Based Key Slice Selection in CT Images and Performance Enhancement via Lesion- and Organ-Aware Fine-Tuning

Kohei Yamamoto; Tomohiro Kikuchi

doi:10.3390/bioengineering12101093

and

¹

Department of Radiology, School of Medicine, Jichi Medical University, 3311-1 Yakushi-ji, Shimotsuke, Tochigi 329-0498, Japan

²

Data Science Center, Jichi Medical University, 3311-1 Yakushi-ji, Shimotsuke, Tochigi 329-0498, Japan

^*

Author to whom correspondence should be addressed.

Bioengineering2025, 12(10), 1093;https://doi.org/10.3390/bioengineering12101093

This article belongs to the Special Issue Machine Learning-Driven Innovations in Biomedical Signal and Image Processing

Version Notes

Order Reprints

Review Reports

Abstract

Large-scale medical visual question answering (MedVQA) datasets are critical for training and deploying vision–language models (VLMs) in radiology. Ideally, such datasets should be automatically constructed from routine radiology reports and their corresponding images. However, no existing method directly links free-text findings to the most relevant 2D slices in volumetric computed tomography (CT) scans. To address this gap, a contrastive language–image pre-training (CLIP)-based key slice selection framework is proposed, which matches each sentence to its most informative CT slice via text–image similarity. This experiment demonstrates that models pre-trained in the medical domain already achieve competitive slice retrieval accuracy and that fine-tuning them on a small dual-supervised dataset that imparts both lesion- and organ-level awareness yields further gains. In particular, the best-performing model (fine-tuned BiomedCLIP) achieved a Top-1 accuracy of 51.7% for lesion-aware slice retrieval, representing a 20-point improvement over baseline CLIP, and was accepted by radiologists in 56.3% of cases. By automating the report-to-slice alignment, the proposed method facilitates scalable, clinically realistic construction of MedVQA resources.

Keywords:

CLIP; CT; key slice selection

1. Introduction

Recent advances in vision–language models (VLMs) have been driven by joint learning from paired image–text inputs [,,,]. Medical visual question answering (MedVQA) has become a benchmark task in this domain, requiring a model to answer questions by jointly processing a medical image and the corresponding sentence(s) [,]. To scale training, most studies mine figures and captions from large bibliographic databases and then employ large language models (LLMs) to synthesize visual question answering (VQA) pairs [,]. Although effective, these bibliography-based datasets overrepresent prototypical cases and often fail to capture the heterogeneity of real-world clinical presentations, potentially reducing the model performance in practice [,]. Consequently, constructing MedVQA datasets directly from routine clinical data is a critical next step and could even surpass the scale of bibliography-based collections. Although a VQA dataset has been curated from routine chest radiograph reports, no comparable resource currently exists for volumetric modalities such as computed tomography (CT) or magnetic resonance imaging (MRI) []. A major barrier is that clinical reports are not annotated to show which sentence corresponds to which slice, and automated tools for establishing these links remain underdeveloped, hindering the construction of large, clinically sourced MedVQA datasets. Some studies have explored VLMs that use 3D encoders that pair a full image volume with the entire report; however, these models require substantial computational resources, and many state-of-the-art (SOTA) VLMs still operate on 2D inputs [,,]. Moreover, everyday radiology communication—whether in report snapshots, electronic medical record attachments, or conference presentations—continues to rely on a limited set of representative 2D slices; thus, slice-level tasks remain indispensable in clinical practice.

An automated key slice selector that links each report sentence to the 2D slice that best matches its described finding is therefore essential for building the datasets needed to advance clinically useful VLMs in radiology. The key motivation of this work is to establish such a method, bridging report sentences and CT slices to enable scalable construction of clinically realistic MedVQA datasets at this scale.

This study aims to demonstrate the utility of a slice selection framework that aligns report sentences with CT slices and to investigate whether fine-tuning with lesion- and organ-aware supervision can further enhance performance. Contrastive language–image pre-training (CLIP) jointly embeds images and text into a shared latent space and computes cosine similarity between these embeddings to enable zero-shot classification and text–image retrieval []. We hypothesized that the same framework could be adapted to align individual report sentences with their corresponding CT slices. Each finding sentence was encoded by the text encoder, and every axial slice in the CT volume was encoded by the image encoder; the slice with the highest cosine similarity was selected as the key image for that sentence (Figure 1). This pipeline was evaluated using the original CLIP (ViT-B/16) and two medical-domain variants, PubMedCLIP and BiomedCLIP [,,]. In addition, we curated a compact training set from a limited number of clinical CT studies and fine-tuned each model using this dataset.

Figure 1. Overview of the proposed contrastive language–image pre-training (CLIP)-based slice selection pipeline. Each report sentence is encoded by the text encoder and each CT slice by the image encoder. Cosine similarities are computed between the sentence embedding and all slice embeddings, and the slice with the highest score is chosen as the key slice for that finding.

Our contributions are summarized as follows:

CLIP-based slice selector for clinical CT.
We present the first CLIP-based key slice selector that can be used to construct large-scale image–text pairs directly from routine CT studies. Feasibility is evaluated with both automatic metrics and radiologist review.
Efficient dataset curation protocol.
We describe and release a lightweight procedure that uses a small number of studies to generate balanced fine-tuning data containing both lesion-level and organ-level (negative) sentence–slice pairs.
Radiologist acceptance gains.
For abnormal findings, the expert “acceptance ratio” (i.e., the proportion of automatically selected slices judged correct) is summarized in Table 1.

Table 1. Radiologist acceptance ratio for automatically selected slices with abnormal findings.

2. Related Works

2.1. MedVQA Dataset Construction

VQA-RAD is a foundational MedVQA benchmark for radiological imaging. Expert radiologists manually authored clinical questions and provided the corresponding ground-truth answers, ensuring high annotation quality but resulting in a relatively small dataset []. SLAKE builds on this by adding spatial annotations, such as pixel-level masks and bounding boxes, and by including knowledge-base-driven questions to probe deeper clinical reasoning []. The effective training of VLMs in highly specialized medical domains requires much larger and more diverse datasets. PMC-VQA constructs a 227 K example VQA dataset, predominantly composed of radiological images, using literature extracted from PubMed Central []. This dataset was built by mining images and their associated captions from articles and employing ChatGPT to generate the corresponding VQA pairs. LLava-Med adopts a similar strategy: it first trains 600K image–caption pairs mined from PubMed Central articles and then fine-tunes the model with 60K GPT-4-generated instruction-tuning examples []. However, none of these datasets provide explicit links between report sentences and specific CT slices, which is the key challenge that this study addresses.

2.2. Slice Selection

Vote-MI was proposed as a weakly supervised method for identifying representative key slices from CT volumes []. Although effective in highlighting diagnostically relevant slices, Vote-MI is designed for slice selection alone and does not provide a mechanism for sentence-level text–image retrieval. This limits its applicability to tasks such as MedVQA dataset construction, where fine-grained alignment between report sentences and CT slices is required. In contrast, our method explicitly addresses this gap by enabling text-driven slice retrieval and further improving alignment performance through lesion- and organ-aware fine-tuning.

Although no prior studies have applied CLIP to slice selection, related methods for extracting keyframes from sequential medical data have been employed for zero-shot surgical phase recognition [].

2.3. Multimodal Learning

With the advent of CLIP, contrastive learning has become the predominant paradigm in multimodal representation learning []. CLIP projects textual and visual inputs into a shared embedding space by encoding each modality separately—text via a transformer-based text encoder and images via a vision transformer—and then aligns them through cosine similarity. This cross-modal alignment capability has led many researchers to employ CLIP image encoders as the visual backbone in VLMs. Subsequent studies extended this foundation to better support downstream tasks. For example, GLIP and RegionCLIP introduce region-level contrastive objectives that improve object detection and semantic segmentation performance [,]. SigLIP replaces the standard contrastive loss with a sigmoid cross-entropy formulation, substantially reducing the text encoder context length while maintaining high-efficiency multimodal learning []. CLIP-Adapter enhances few-shot transfer by inserting lightweight adapter modules into a frozen CLIP backbone, thereby boosting performance on downstream tasks without extensive retraining []. In the medical domain, PubMedCLIP fine-tunes CLIP using ROCO, a radiology-focused corpus derived from PubMed articles [,]. BiomedCLIP expands this approach by assembling a PMC-15M dataset from PubMed Central publications and applying it to further fine-tune CLIP []. These biomedical variants have achieved SOTA performance in cross-modal retrieval, zero-shot image classification, and MedVQA tasks, thereby establishing themselves as foundation models for a broad range of downstream medical AI applications [,,].

In summary, while previous research has advanced MedVQA dataset construction, slice retrieval, and multimodal learning, none has proposed an automated framework that aligns report sentences with CT slices and enhances performance through lesion- and organ-aware fine-tuning. This constitutes the novelty of the present study.

3. Materials and Methods

3.1. Dataset

This study was approved by the Jichi Medical University Hospital Bioethics Committee for Clinical Research. We assembled our dataset from 137 consecutive patients with gastrointestinal cancer who underwent their first CT examination at our institution between 2021 and 2023. A board-certified radiologist (Radiologist 1) with more than ten years of experience reviewed each report and selected the appropriate CT series (e.g., optimal contrast phase) for each sentence describing an abnormality. The radiologist then identified the CT slice that best matched the sentence and annotated the lesion with a bounding box. The dataset was divided into 92 training, 23 validation, and 22 test studies (Figure 2). For the test set, the radiologist annotated the range of slices corresponding to each sentence to enable more detailed evaluation. This yielded 625 sentence–slice pairs for training, 152 for validation, and 120 for testing. The original reports were written in Japanese; after confirming that they contained no protected health information, all sentences were translated into English using gpt-4o-mini-2024-07-18 [].

Figure 2. Overview of dataset composition and evaluation measures.

To improve performance on negative findings, we augmented the dataset with synthetic finding–image pairs containing no abnormalities. We first applied the TotalSegmentator to each CT series to extract organ masks and determine which of the 15 major thoracoabdominal organs were present in each slice []. Using these organ labels, we generated pseudo-findings in two ways: (1) inserting organ names into a fixed template and (2) prompting gpt-4o-mini to produce natural language descriptions based on the same templates. Further details of the rule- and LLM-based prompt designs are provided in Appendix A.

For lesion-positive examples, we expanded the slice range based on lesion size. Bounding boxes larger than 2000 pixels were padded by one slice above and below, whereas boxes exceeding 4000 and 6000 pixels were expanded by two and three slices, respectively. This procedure yielded a training set composed of 12,743 CT slices, including 1009 lesion-positive image pairs and 140,761 normal anatomy pairs generated from organ labels. As part of image preprocessing, we applied soft-tissue windowing to the CT values to enhance contrast in the relevant intensity ranges. Because the vision transformer (ViT) backbone requires 224 × 224 inputs, we first cropped 32 pixels from each edge of the original 512 × 512 slices to obtain 448 × 448 images and then downsampled them to 224 × 224 pixels. During training, we applied data augmentation, such as horizontal and vertical flips, translations, scaling, rotations, elastic distortions, and cutouts, to improve model robustness [].

3.2. Slice Selection Algorithm

To implement this study’s approach, we defined a slice selection algorithm that links each report sentence to the most relevant CT slice. As illustrated in Figure 1, the text encoder embeds the finding sentence, while the image encoder embeds each axial slice of the CT volume. Cosine similarities are then computed between the sentence embedding and all slice embeddings. The slice with the highest similarity score is designated as the key slice corresponding to the finding. This procedure, summarized in Algorithm 1, constitutes the inference pipeline of this research method and serves as the foundation for both evaluation and fine-tuning.

Algorithm 1 CLIP-based key slice selection.

Require:: finding sentence f, CT volume $V = {s_{1}, \dots, s_{N}}$ , text encoder $E_{text}$ , image encoder $E_{img}$
Ensure:: key slice $s_{k}$
1:: $e_{text} \leftarrow E_{text} (f)$
2:: for $i \leftarrow 1$ to N do
3:: $e_{img} \leftarrow E_{img} (s_{i})$
4:: ${sim}_{i} \leftarrow cos (e_{text}, e_{img})$
5:: end for
6:: $k \leftarrow arg {max}_{i} {sim}_{i}$
7:: return $s_{k}$

3.3. Models and Training Details

We trained CLIP to learn the correspondence between report sentences and CT slices. For the image encoder, we used ViT-B/16 and fine-tuned the pre-trained model with the official OpenAI weights. The same procedure was applied to PubMedCLIP and BiomedCLIP. We summarize the image encoders and text encoders used in each method in Table 2. For all text encoders, the context length was fixed at 77 tokens, and sentences exceeding this length were truncated. All training was conducted on a single NVIDIA RTX 6000 Ada GPU with a batch size of 64 using the AdamW optimizer []. The learning rate was scheduled from

5 \times 10^{- 5}

to

1 \times 10^{- 6}

over 20 epochs using cosine annealing. The checkpoint with the lowest validation loss was selected for evaluation. The experiments were run on Ubuntu 22.04.5 LTS with Python 3.11.9 (conda-forge), using numpy 2.1.1, torch 2.4.1+cu118, and TotalSegmentator 2.10.0.

Table 2. Summary of image and text encoders used in each method.

4. Results

We evaluated the slice selection performance of six models: pre-trained CLIP, PubMedCLIP, and BiomedCLIP, along with their fine-tuned counterparts (FT).

Before reporting task-specific retrieval metrics, we first verified whether each model learned meaningful alignments during training. To this end, we measured the mean absolute error (MAE) between the predicted and ground-truth slice indices using two strategies: hard prediction (selecting the Top-1 slice) and soft prediction (using a local moving average).

In Section 4.2, we assess the retrieval accuracy of sentences describing abnormal findings from diagnostic reports. In addition, we report on a radiologist-led evaluation of slice selection results to examine clinical applicability. Section 4.3 presents the alignment fidelity between organ-related text and its corresponding CT slices. Finally, Section 4.4 provides visualizations of the slice selection results for key series, highlighting both quantitative performance and qualitative interpretability.

4.1. Training Verification via Slice-Level MAE

To validate the training process, we measured the mean absolute error (MAE) between the predicted and ground-truth slice indices across all models using the image–sentence pairs from the validation set. We compared two strategies: (i) hard prediction, which directly selects the slice with the highest similarity score, and (ii) soft prediction, which applies a moving average over ±2 neighboring slices to leverage the sequential continuity of CT volumes. As shown in Figure 3, fine-tuned models consistently achieved a lower MAE than their pre-trained counterparts, confirming that meaningful learning occurred. Moreover, the soft prediction strategy yielded fewer errors than the hard prediction strategy, motivating its adoption for all subsequent evaluations. For example, in BiomedCLIP, the MAE improved from

10.40 \pm 13.58

with hard prediction to

9.21 \pm 11.09

with soft prediction, corresponding to an average gain of approximately one slice in localization accuracy.

Figure 3. Comparison of mean absolute error (MAE) between predicted and ground-truth slice indices for each method. Results are shown for both hard prediction (Top-1 slice) and soft prediction (moving average over ±2 slices).

4.2. Lesion Awareness

The models are evaluated using a test set of 22 studies containing 120 sentences with manually annotated ground-truth slice ranges. At inference, each sentence is encoded by the CLIP text encoder, every slice in the corresponding CT volume is processed by the image encoder, cosine similarities are computed, and the slice with the highest score is selected as the model’s key slice (Figure 1, Algorithm 1). To assess interobserver variability, Radiologist 2—independent of the ground-truth annotator—also selected the single best-matching slice for each sentence. Table 3 shows the proportion of Top-1 predictions falling within the annotated ranges and the proportion of cases in which any of the Top-5 most similar slices overlapped with those ranges. CLIP (FT) improves the Top-1 accuracy by 20 percentage points over CLIP, demonstrating effective adaptation to the medical domain. PubMedCLIP and BiomedCLIP began with higher baseline accuracies, which are further improved by fine-tuning. BiomedCLIP (FT), in particular, achieves the best performance, with a Top-1 accuracy of 51.72% and Top-5 accuracy of 64.37%. Considering that Radiologist 2 achieved a Top-1 accuracy of 78.16%, the model accuracy of 51.72% represents a relatively strong result.

Table 3. Comparison of slice selection accuracy for each method. Reported values are Top-1 and Top-5 accuracy [%]. For each metric, the highest score among the automated methods (excluding Radiologist 2) is highlighted in red.

As noted above, strict ground-truth ranges do not always align with radiologists’ subjective assessments. To evaluate whether the slices proposed by each method would be considered acceptable in clinical practice, we conducted a qualitative assessment in which a radiologist evaluated whether to adopt each suggested slice for a given sentence. We developed a simple web application that displayed each finding sentence alongside the slices proposed by each slice selector. For each case, the suggestions from all models were presented in a random order on the same screen, and the radiologist was blinded to the source model. Table 4 reports the adoption rates for each model’s recommendations. Although these adoption rates generally correspond to the quantitative results presented in Table 3 (Top-1 accuracy), they are uniformly slightly higher. These higher adoption rates occur when lesions deemed acceptable by the radiologist fall outside the strict ground-truth range—for example, calcifications extending across several slices or annotation covering only one of multiple hepatic cysts.

Table 4. Radiologist acceptance rates [%] of slice selector predictions for each model.

4.3. Organ Awareness

Recognizing normal anatomy is as important as detecting abnormalities, so we define organ-aware slice selection as a complementary evaluation. Using TotalSegmentator, we generated binary organ presence labels for each CT slice based on the 15 major thoracoabdominal organs []. For each model, we encoded the organ prompt and each CT slice, computed their cosine similarities, and then determined organ-specific thresholds using the Youden index on the validation set, which served as cutoffs for determining organ presence in each test slice []. To assess the impact of text granularity, we compared two input types: single-word organ names (e.g., “Heart” and “Liver”) and full template sentences (e.g., “This CT image includes the heart”). These experiments test whether word- or sentence-level prompts are more effective when querying a text encoder.

Table 5 summarizes the organ-aware evaluation, reporting both the accuracy and F1-score for each model under word and sentence prompt conditions. BiomedCLIP (FT) achieved the highest performance in both settings, with an F1-score of 0.85 for word and sentence inputs. PubMedCLIP showed similar accuracy and F1-scores across the two prompt types, while BiomedCLIP tended to perform better with sentence prompts. Fine-tuning yielded comparable gains for word and sentence inputs across all models, indicating that our approach generalized well to both single-token organ names and fully descriptive sentences. On average, fine-tuning improved the accuracy and F1-score by more than 10% for each model. These substantial improvements demonstrate enhanced normal anatomy recognition accuracy not only for the original CLIP model but also for the medical-domain pre-trained variants (PubMedCLIP and BiomedCLIP). They further suggest that fine-tuning improves slice extraction accuracy for non-abnormal findings, thereby benefiting downstream VQA generation. F1-scores for each organ are provided in Appendix B.

Table 5. Comparison of organ extraction performance. Reported metrics are accuracy (Acc.) and F1-score (F1). “Word” indicates prompts using single organ names, and “Sentence” indicates prompts using full finding sentences. The highest score in each column is highlighted in red.

Additionally, as an external dataset benchmark, we evaluated organ awareness using the publicly available CT-RATE dataset, which consists of chest CT volumes with organ labels generated by TotalSegmentator []. We evaluated 1304 cases from the validation split, selecting the earliest series (denoted as “_a_1” in the dataset) when multiple CT series were available per patient. The results are summarized in Table 6. Overall performance was lower than that reported in Table 5, likely due to domain shifts related to imaging devices and acquisition parameters. For example, BiomedCLIP(FT) achieved an accuracy of 0.926 and an F1-score of 0.809 on CT-RATE, compared with 0.948 and 0.854, respectively, on the in-house dataset. Despite this decrease, BiomedCLIP(FT) consistently yielded the highest accuracy across all models, confirming its robustness to dataset variation.

Table 6. Comparison of organ extraction performance on the external CT-RATE dataset. Reported metrics are accuracy (Acc.) and F1-score (F1). “Word” indicates prompts using single organ names, and “Sentence” indicates prompts using full finding sentences. The highest score in each column is highlighted in red.

4.4. Visualization

Figure 4 shows an example of a lesion-aware slice selection result on the test cohort. For uterine fibroids, PubMedCLIP, BiomedCLIP, and BiomedCLIP(FT) correctly identified key CT slices containing the lesion. Figure 5 shows similarity profiles between organ-related text and CT slices for a selected series. The original CLIP model exhibited numerous false positives and inconsistent alignments, whereas CLIP (FT) produced a much cleaner correspondence and improved localization. Although PubMedCLIP and BiomedCLIP also showed false positives and negatives, both models achieved high accuracy in recognizing the target organ after fine-tuning. Additional qualitative examples are presented in Appendix C.

Figure 4. “Multiple masses that may represent fibroids are observed in the myometrium and beneath the endometrium.” The Ground Truth column shows the reference CT slice with the radiologist-annotated lesion bounding box. The other columns display each model’s key slice prediction. The white double circle symbol indicates predictions that match the ground-truth slice range.

Figure 5. Similarity profiles between the prompt “Heart” and each slice. The green line denotes the ground-truth slice range. Similarity profiles increase from left to right, and the dashed line indicates the decision threshold.

5. Discussion

In this study, we introduced a CLIP-based key slice selector that aligns free-text radiology findings with corresponding CT slices. The approach was evaluated along two complementary axes: lesion and organ awareness. Despite training on a relatively small dataset, the fine-tuned models achieved substantial gains in lesion-aware slice retrieval, improving accuracy by 7–20 percentage points over the pre-trained baselines. The best performer, BiomedCLIP (FT), achieved 51% Top-1 accuracy for lesion localization. In the organ-aware task, all fine-tuned models achieved improvements of over 10% in both accuracy and F1-score. Notably, the performance remained strong regardless of whether organ prompts were provided as single words or full sentences, demonstrating the flexibility of the text-encoding strategy.

To further examine the robustness under external conditions, we conducted an additional organ-awareness evaluation on the publicly available CT-RATE dataset. Although all methods exhibited some degradation in performance owing to domain shifts in imaging devices and acquisition protocols, the decrease was relatively modest, and BiomedCLIP (FT) consistently achieved the highest accuracy. These findings suggest that this study’s approach is robust across datasets and can be generalized beyond the in-house cohort.

Regarding system requirements, all experiments were conducted on a single NVIDIA RTX 6000 Ada GPU with 48 GB of memory, which was sufficient to train models with a batch size of 64. Each training run with 20 epochs finished within one hour. These results indicate that the proposed method can be reproduced on a high-end single-GPU workstation without requiring distributed resources.

Beyond the dataset used in this study, the proposed slice selection framework offers practical benefits when applied to diverse imaging archives. First, automatically aligning report sentences with key slices can substantially reduce the cost of two-dimensional (2D) slice-level annotation, which is often a major bottleneck in dataset curation. Second, it enables the construction of image–text pairs directly from unlabeled CT volumes and routine reports, providing a low-cost pathway for expanding training resources for medical VLMs. These improvements highlight the broader utility of the proposed method across datasets and institutions, extending its impact beyond the present feasibility study.

The method paves the way for fully automated construction of MedVQA datasets from routine clinical archives, directly addressing the biases and scale limitations inherent in bibliographic collections. Furthermore, the radiologist’s evaluation of the slice selection results (Table 4) and visualization of similarity profiles (Figure 5) suggest that, in addition to serving as a VQA dataset generator, integrating the slice selector into clinical workflows as a recommendation tool could help reduce radiologists’ workload and enhance diagnostic accuracy.

This study has several limitations that should be acknowledged. First, as this was a feasibility study, the training and test cohorts were small and drawn from a single institution. Expanding the dataset size could uncover additional gains in retrieval accuracy and reduce the risk of overfitting to the relatively small evaluation cohorts. Second, the evaluation reports only overall lesion- and organ-aware metrics; it does not stratify the results by lesion size, anatomical region, contrast phase, or disease category. Finer-grained analysis is essential to uncover failure modes in clinically relevant subgroups. Third, the Youden index thresholds used to determine organ presence were chosen from the validation set and fixed for testing. Although this mirrors a real-world deployment scenario, alternative calibration strategies may yield better threshold stability for unseen data.

In future work, we will integrate a key slice selector into a complete MedVQA generation pipeline consisting of four main stages.

Segmenting diagnostic reports into individual finding units;
Acquiring the corresponding imaging series (including selection of contrast phases);
Extracting key slices using the slice selector;
Generating VQA pairs from each finding sentence through rule-based systems or LLMs.

While this study addresses Stage 3, the remaining steps are critical for end-to-end automation. Accordingly, we plan to develop and evaluate methods for report segmentation and image retrieval and to implement VQA generation using both deterministic templates and generative LLM prompts. As these components mature, we will iteratively expand the clinical dataset and refine the slice selector’s accuracy.

We also plan to train VLMs on the resulting automatically constructed MedVQA dataset and compare their performance with models trained on manually curated slice annotations to quantify the impact of automated slice selection on the downstream VLM capabilities.

Beyond dataset construction, architectural advances offer promising directions. Recent large-scale VLMs such as Merlin, which align 3D CT volumes with radiology reports through contrastive learning, provide useful design principles []. Incorporating similar volume-level representation strategies, as well as techniques from the CLIP-Driven Universal Model—which uses text embeddings as semantic labels for segmentation and detection—could extend the framework beyond slice retrieval to fine-grained localization and more efficient MedVQA dataset construction [].

6. Conclusions

This study introduced a CLIP-based key slice selector that aligns free-text radiology findings with the corresponding CT images. Fine-tuning on a compact dual-supervised dataset substantially improved lesion-aware accuracy, with the best-performing model achieving strong results in lesion localization. In the organ-aware task, all fine-tuned models demonstrated gains in both accuracy and F1-score, regardless of whether the prompts were provided as single words or full sentences. A qualitative review indicated that the slice selector’s recommendations could reduce radiologists’ workload and enhance diagnostic confidence in routine practice. Nevertheless, the current evaluation is limited by its single-institution, small-cohort design, lack of stratification by lesion size or disease type, and reliance on fixed-threshold calibration, all of which limit generalizability. Future work will expand the dataset, incorporate stratified analyses, and integrate the selector into an end-to-end MedVQA pipeline encompassing report segmentation, series retrieval, slice extraction, and automatic question–answer generation.

Author Contributions

Conceptualization, K.Y. and T.K.; methodology, K.Y.; software, K.Y.; validation, K.Y.; formal analysis, K.Y.; investigation, K.Y. and T.K.; resources, T.K.; data curation, K.Y. and T.K.; writing—original draft preparation, K.Y.; writing—review and editing, T.K.; visualization, K.Y.; supervision, T.K.; project administration, K.Y. All authors have read and agreed to the published version of the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This research was partially supported by JST Program for co-creating startup ecosystem, Grant Number JPMJSF2319, Japan, and by JSPS KAKENHI Grant Number JP23K17234.

Institutional Review Board Statement

This study was conducted in accordance with the Declaration of Helsinki and approved by the Jichi Medical University Hospital Bioethics Committee for Clinical Research (protocol code 25-087).

Informed Consent Statement

This retrospective study was approved by the Jichi Medical University Hospital Bioethics Committee for Clinical Research, and the requirement for informed consent was waived due to the retrospective nature of this study and the use of anonymized medical imaging data that poses minimal risk to participants.

Data Availability Statement

The datasets generated and analyzed during the current study are not publicly available due to privacy and ethical restrictions related to patient medical imaging data. The source code used for the analysis is available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

CLIP	Contrastive Language–Image Pre-training
CT	Computed Tomography
VLM	Vision–Language Model
MedVQA	Medical Visual Question Answering
LLM	Large Language Model
SOTA	State of the Art
VQA	Visual Question Answering

Appendix A. Finding Text Generation

The following templates were designed to generate non-abnormal finding sentences using organ labels predicted by TotalSegmentator. In each template, the placeholder {{organ}} was replaced with the corresponding organ label obtained from TotalSegmentator. Note that the original templates were written in Japanese and later translated into English; therefore, their word order may differ slightly from natural English syntax.

Appendix A.1. Rule-Based

organs = [

"aorta",

"colon",

"duodenum",

"esophagus",

"gallbladder",

"heart",

"kidney",

"liver",

"pancreas",

"stomach",

"spleen",

"thyroid_gland",

"urinary_bladder",

"trachea",

"lung",

]

text_templates = {

"prefix": [

"In the image,",

"In this image,",

"In this CT image,’’,

"",

],

"suffix": [

"{{organ}} is visible."

"{{organ }} is included."

"{{organ}} is contained."

"{{organ}} etc. are visible."

"{{organ }} etc. are included."

"{{organ }} etc. are contained."

],

}

Appendix A.2. LLM-Based

We generated 20 sentence templates for each organ. Below, we present representative examples for the liver and pancreas.

{

’liver’: [

’No abnormal findings are observed in the liver, and~normal morphology is maintained.’,

’’No abnormal findings are observed in the liver, \

and morphology falls within normal limits.’,

’No lesions or abnormal findings are noted in the liver.’,

’No abnormal findings are observed in the liver, and~normal morphology is maintained.’,

’No abnormal findings are observed in the liver, and~it remains within normal limits.’,

’No abnormal masses or tumors can be identified in the liver.’,

’No abnormal findings are observed in the liver, and~its structure is normal.’,

’No abnormal findings are observed in the liver, and~its normal state is preserved.’,

’No abnormal findings are observed in the liver, \

and its shape and size are within normal limits.’,

’No abnormal masses or fatty changes are noted in the liver.’,

’No abnormal masses or deformities are observed in the liver, \

and findings are within normal limits.’,

’No findings indicative of abnormality are observed in the liver.’,

’No deformities or neoplastic lesions are observed in the liver, \

which maintains normal morphology.’,

’No morphological abnormalities or neoplastic lesions are observed in the liver.’,

’No significant lesions or impairments are detected in the liver.’,

’No particular abnormalities are observed in the liver, \

which preserves normal morphology.’,

’No abnormalities in shape or echo texture are noted in the liver.’,

’No obvious lesions or abnormal findings are noted in the liver.’,

’No abnormal masses or perfusion changes are observed in the liver.’,

’No abnormal findings are observed in the liver, and~its normal structure is preserved.’

],

’pancreas’: [

’No abnormal findings are observed in the pancreas.’,

’No findings suggestive of abnormality are noted in the pancreas.’,

’No significant abnormalities are observed in the pancreas.’,

’No findings indicating abnormality are apparent in the pancreas.’,

’No abnormal findings are observed in the pancreas, and~normal morphology is maintained.’,

’No abnormal masses or signs of inflammation are observed in the pancreas.’,

’No masses or inflammatory findings are noted in the pancreas.’,

’No abnormal findings are observed in the pancreas, \

and its morphology falls within normal limits.’,

’No tumors or inflammatory findings are observed in the pancreas, \

which retains normal morphology.’,

’No findings suggestive of abnormality are apparent in the pancreas.’,

’No structural abnormalities or lesions are observed in the pancreas.’,

’No abnormal findings are observed in the pancreas.’,

’No tumors or abnormal structural changes are observed in the pancreas.’,

’No abnormal findings are observed in the pancreas, \

and normal imaging appearance is obtained.’,

’No abnormal findings are observed in the pancreas, \

and normal morphology is confirmed.’,

’No abnormal masses or inflammation are observed in the pancreas.’,

’No abnormal findings are observed in the pancreas.’,

’No abnormal findings of any kind are observed in the pancreas.’,

’No abnormal findings are observed in the pancreas, and~normal morphology is present.’,

’No findings suggestive of lesions are observed in the pancreas.’

]

}

Appendix B. Detailed Organ-Aware Results

Table A1. F1-scores for each organ across slice selection methods. “Word” denotes prompts using only the organ name, and “Sentence” denotes prompts using full finding sentences.

Organ	CLIP		CLIP(FT)		PubMedCLIP		PubMedCLIP(FT)		BiomedCLIP		BiomedCLIP(FT)
Organ	Word	Sentence	Word	Sentence	Word	Sentence	Word	Sentence	Word	Sentence	Word	Sentence
aorta	0.679	0.660	0.960	0.960	0.852	0.836	0.966	0.967	0.858	0.754	0.943	0.948
colon	0.676	0.747	0.877	0.877	0.817	0.859	0.880	0.878	0.878	0.886	0.878	0.878
duodenum	0.450	0.522	0.761	0.775	0.749	0.762	0.804	0.841	0.665	0.724	0.798	0.828
esophagus	0.614	0.731	0.932	0.931	0.936	0.918	0.958	0.958	0.873	0.895	0.973	0.972
gallbladder	0.250	0.235	0.494	0.495	0.450	0.472	0.547	0.547	0.432	0.451	0.543	0.537
heart	0.554	0.643	0.933	0.941	0.842	0.832	0.945	0.943	0.664	0.841	0.956	0.955
kidney	0.684	0.576	0.886	0.887	0.767	0.788	0.885	0.884	0.733	0.696	0.917	0.917
liver	0.600	0.570	0.900	0.901	0.791	0.832	0.928	0.921	0.821	0.750	0.914	0.912
pancreas	0.588	0.518	0.801	0.807	0.731	0.716	0.836	0.841	0.730	0.755	0.838	0.824
stomach	0.402	0.418	0.720	0.737	0.650	0.612	0.780	0.767	0.699	0.673	0.823	0.827
spleen	0.325	0.409	0.731	0.744	0.619	0.685	0.809	0.811	0.665	0.639	0.810	0.800
thyroid gland	0.363	0.227	0.744	0.744	0.552	0.576	0.648	0.647	0.219	0.643	0.727	0.728
urinary bladder	0.165	0.210	0.491	0.691	0.387	0.386	0.739	0.779	0.160	0.575	0.753	0.762
trachea	0.542	0.514	0.900	0.905	0.730	0.714	0.881	0.884	0.796	0.767	0.945	0.947
lung	0.816	0.827	0.964	0.963	0.954	0.904	0.976	0.975	0.906	0.907	0.980	0.981

Appendix C. Visualization

Appendix C.1. Lesion Awareness

Figure A1. Visualization of lesion-aware slice selection for the finding “Circumferential wall thickening is observed in the antrum to pylorus of the stomach, which is consistent with findings of gastric cancer. There is evidence of extramural invasion.” The Ground Truth column shows the reference CT slice with the radiologist-annotated lesion bounding box. The other columns display each model’s key slice prediction. The white double circle symbol indicates predictions that match the ground-truth slice range.

Figure A2. Visualization of lesion-aware slice selection for the finding “Multiple enlarged lymph nodes are observed in the celiac artery region, which is considered metastatic.” In this case, no prediction fell within the ground-truth range.

Figure A3. Visualization of lesion-aware slice selection for the finding “Ring enhancement lesions are observed in both lobes of the liver, indicating the presence of hepatic metastases”.

Figure A4. Visualization of lesion-aware slice selection for the finding “Prostate enlargement is observed”.

Appendix C.2. Organ Awareness

Figure A5. Similarity profile between the prompt “Liver” and each axial CT slice. The green line denotes the ground-truth slice range. The dashed line indicates the Youden index threshold determined on the validation set.

Figure A6. Similarity profile between the prompt “Kidney” and each axial CT slice.

Appendix D. Impact of Organ-Aware Synthetic Data

To evaluate the contribution of synthetic data generated from organ labels, we conducted an ablation study in which the models were fine-tuned only on report-derived lesion sentences, without using any organ-aware synthetic data. Because the training set in this condition contained only 625 sentence–slice pairs (Figure 2), we adopted a milder fine-tuning schedule with the learning rate decayed from

1 \times 10^{- 5}

to

1 \times 10^{- 6}

over 20 epochs. This model was referred to as BiomedCLIP(lesion-FT). Table A2 compares the lesion-awareness accuracies of BiomedCLIP, BiomedCLIP(lesion-FT), and BiomedCLIP(FT). The results show that BiomedCLIP(lesion-FT) achieved a higher Top-1 accuracy than the original BiomedCLIP baseline (45.98% vs. 44.83%) but remained inferior to BiomedCLIP(FT), which incorporated organ-aware synthetic data (51.72%). These findings indicate that synthetic data are beneficial not only for organ-aware evaluation, but also for improving lesion-aware performance.

Table A2. Lesion-aware accuracy [%] of BiomedCLIP variants. BiomedCLIP(lesion-FT) was fine-tuned only on lesion sentences without synthetic data.

Method	Acc.@1 ↑	Acc.@5 ↑
BiomedCLIP	44.83	60.92
BiomedCLIP(FT)	51.72	64.37
BiomedCLIP(lesion-FT)	45.98	63.22

References

Li, F.; Zhang, R.; Zhang, H.; Zhang, Y.; Li, B.; Li, W.; Ma, Z.; Li, C. LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models. arXiv 2024, arXiv:2407.07895. [Google Scholar]
Xiao, B.; Wu, H.; Xu, W.; Dai, X.; Hu, H.; Lu, Y.; Zeng, M.; Liu, C.; Yuan, L. Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 4818–4829. [Google Scholar]
Chen, Z.; Wu, J.; Wang, W.; Su, W.; Chen, G.; Xing, S.; Zhong, M.; Zhang, Q.; Zhu, X.; Lu, L.; et al. InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 24185–24198. [Google Scholar]
Alayrac, J.B.; Donahue, J.; Luc, P.; Miech, A.; Barr, I.; Hasson, Y.; Lenc, K.; Mensch, A.; Millican, K.; Reynolds, M.; et al. Flamingo: A Visual Language Model for Few-Shot Learning. In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 28 November–9 December 2022; Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2022; Volume 35, pp. 23716–23736. [Google Scholar]
Lau, J.J.; Gayen, S.; Ben Abacha, A.; Demner-Fushman, D. A dataset of clinically generated visual questions and answers about radiology images. Sci. Data 2018, 5, 180251. [Google Scholar] [CrossRef] [PubMed]
Liu, B.; Zhan, L.M.; Xu, L.; Ma, L.; Yang, Y.; Wu, X.M. Slake: A Semantically-Labeled Knowledge-Enhanced Dataset For Medical Visual Question Answering. In Proceedings of the 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI), Nice, France, 13–16 April 2021; pp. 1650–1654. [Google Scholar] [CrossRef]
Zhang, X.; Wu, C.; Zhao, Z.; Lin, W.; Zhang, Y.; Wang, Y.; Xie, W. PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering. arXiv 2024, arXiv:2305.10415. [Google Scholar]
Li, C.; Wong, C.; Zhang, S.; Usuyama, N.; Liu, H.; Yang, J.; Naumann, T.; Poon, H.; Gao, J. LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day. In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023; Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2023; Volume 36, pp. 28541–28564. [Google Scholar]
Dong, W.; Shen, S.; Han, Y.; Tan, T.; Wu, J.; Xu, H. Generative Models in Medical Visual Question Answering: A Survey. Appl. Sci. 2025, 15, 2983. [Google Scholar] [CrossRef]
Bae, S.; Kyung, D.; Ryu, J.; Cho, E.; Lee, G.; Kweon, S.; Oh, J.; Ji, L.; Chang, E.; Kim, T.; et al. EHRXQA: A multi-modal question answering dataset for electronic health records with chest x-ray images. Adv. Neural Inf. Process. Syst. 2024, 36, 3867–3880. [Google Scholar]
Bai, F.; Du, Y.; Huang, T.; Meng, M.Q.H.; Zhao, B. M3D: Advancing 3D Medical Image Analysis with Multi-Modal Large Language Models. arXiv 2024, arXiv:2404.00578. [Google Scholar]
Blankemeier, L.; Cohen, J.P.; Kumar, A.; Van Veen, D.; Gardezi, S.J.S.; Paschali, M.; Chen, Z.; Delbrouck, J.B.; Reis, E.; Truyts, C.; et al. Merlin: A Vision Language Foundation Model for 3D Computed Tomography. Res. Sq. 2024, rs.3.rs-4546309. [Google Scholar] [CrossRef]
Hamamci, I.E.; Er, S.; Menze, B. CT2Rep: Automated Radiology Report Generation for 3D Medical Imaging. In Proceedings of the Medical Image Computing and Computer Assisted Intervention—MICCAI 2024, Marrakesh, Morocco, 6–10 October 2024; Linguraru, M.G., Dou, Q., Feragen, A., Giannarou, S., Glocker, B., Lekadir, K., Schnabel, J.A., Eds.; Springer: Cham, Switzerland, 2024; pp. 476–486. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning, Virtual, 18–24 July 2021; Meila, M., Zhang, T., Eds.; PMLR: Washington, DC, USA, 2021; Volume 139, pp. 8748–8763. [Google Scholar]
Eslami, S.; Meinel, C.; de Melo, G. PubMedCLIP: How Much Does CLIP Benefit Visual Question Answering in the Medical Domain? In Findings of the Association for Computational Linguistics: EACL 2023; Vlachos, A., Augenstein, I., Eds.; Association for Computational Linguistics: Dubrovnik, Croatia, 2023; pp. 1181–1193. [Google Scholar] [CrossRef]
Zhang, S.; Xu, Y.; Usuyama, N.; Xu, H.; Bagga, J.; Tinn, R.; Preston, S.; Rao, R.; Wei, M.; Valluri, N.; et al. BiomedCLIP: A multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs. arXiv 2025, arXiv:2303.00915. [Google Scholar]
Wang, Y.; Peng, J.; Dai, Y.; Jones, C.; Sair, H.; Shen, J.; Loizou, N.; Wu, J.; Hsu, W.C.; Imami, M.; et al. Enhancing vision-language models for medical imaging: Bridging the 3D gap with innovative slice selection. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 10–15 December 2024; Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., Zhang, C., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2024; Volume 37, pp. 99947–99964. [Google Scholar]
Yuan, K.; Srivastav, V.; Navab, N.; Padoy, N. HecVL: Hierarchical Video-Language Pretraining for Zero-shot Surgical Phase Recognition. arXiv 2025, arXiv:2405.10075. [Google Scholar]
Li, L.H.; Zhang, P.; Zhang, H.; Yang, J.; Li, C.; Zhong, Y.; Wang, L.; Yuan, L.; Zhang, L.; Hwang, J.N.; et al. Grounded Language-Image Pre-Training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 10965–10975. [Google Scholar]
Zhong, Y.; Yang, J.; Zhang, P.; Li, C.; Codella, N.; Li, L.H.; Zhou, L.; Dai, X.; Yuan, L.; Li, Y.; et al. RegionCLIP: Region-Based Language-Image Pretraining. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 16793–16803. [Google Scholar]
Zhai, X.; Mustafa, B.; Kolesnikov, A.; Beyer, L. Sigmoid Loss for Language Image Pre-Training. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 11975–11986. [Google Scholar]
Gao, P.; Geng, S.; Zhang, R.; Ma, T.; Fang, R.; Zhang, Y.; Li, H.; Qiao, Y. CLIP-Adapter: Better Vision-Language Models with Feature Adapters. Int. J. Comput. Vis. 2024, 132, 581–595. [Google Scholar] [CrossRef]
Pelka, O.; Koitka, S.; Rückert, J.; Nensa, F.; Friedrich, C.M. Radiology Objects in COntext (ROCO): A Multimodal Image Dataset. In Proceedings of the Intravascular Imaging and Computer Assisted Stenting and Large-Scale Annotation of Biomedical Data and Expert Label Synthesis, Granada, Spain, 16 September 2018; Stoyanov, D., Taylor, Z., Balocco, S., Sznitman, R., Martel, A., Maier-Hein, L., Duong, L., Zahnd, G., Demirci, S., Albarqouni, S., et al., Eds.; Springer: Cham, Switzerland, 2018; pp. 180–189. [Google Scholar]
Koleilat, T.; Asgariandehkordi, H.; Rivaz, H.; Xiao, Y. MedCLIP-SAM: Bridging Text and Image Towards Universal Medical Image Segmentation. In Proceedings of the Medical Image Computing and Computer Assisted Intervention—MICCAI 2024, Marrakesh, Morocco, 6–10 October 2024; Linguraru, M.G., Dou, Q., Feragen, A., Giannarou, S., Glocker, B., Lekadir, K., Schnabel, J.A., Eds.; Springer: Cham, Switzerland, 2024; pp. 643–653. [Google Scholar]
Zhao, T.; Gu, Y.; Yang, J.; Usuyama, N.; Lee, H.H.; Kiblawi, S.; Naumann, T.; Gao, J.; Crabtree, A.; Abel, J.; et al. A foundation model for joint segmentation, detection and recognition of biomedical objects across nine modalities. Nat. Methods 2024, 22, 166–176. [Google Scholar] [CrossRef] [PubMed]
Polis, B.; Zawadzka-Fabijan, A.; Fabijan, R.; Kosińska, R.; Nowosławska, E.; Fabijan, A. Exploring BiomedCLIP’s Capabilities in Medical Image Analysis: A Focus on Scoliosis Detection and Severity Assessment. Appl. Sci. 2025, 15, 398. [Google Scholar] [CrossRef]
OpenAI; Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; et al. GPT-4 Technical Report. arXiv 2024, arXiv:2303.08774. [Google Scholar]
Wasserthal, J.; Breit, H.C.; Meyer, M.T.; Pradella, M.; Hinck, D.; Sauter, A.W.; Heye, T.; Boll, D.T.; Cyriac, J.; Yang, S.; et al. TotalSegmentator: Robust Segmentation of 104 Anatomic Structures in CT Images. Radiol. Artif. Intell. 2023, 5, e230024. [Google Scholar] [CrossRef] [PubMed]
DeVries, T.; Taylor, G.W. Improved Regularization of Convolutional Neural Networks with Cutout. arXiv 2017, arXiv:1708.04552. [Google Scholar] [CrossRef]
Loshchilov, I.; Hutter, F. Fixing weight decay regularization in adam. arXiv 2017, arXiv:1711.05101. [Google Scholar]
Youden, W.J. Index for rating diagnostic tests. Cancer 1950, 3, 32–35. [Google Scholar] [CrossRef] [PubMed]
Liu, J.; Zhang, Y.; Chen, J.N.; Xiao, J.; Lu, Y.; Landman, B.A.; Yuan, Y.; Yuille, A.; Tang, Y.; Zhou, Z. CLIP-Driven Universal Model for Organ Segmentation and Tumor Detection. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 21095–21107. [Google Scholar] [CrossRef]

Figure 1. Overview of the proposed contrastive language–image pre-training (CLIP)-based slice selection pipeline. Each report sentence is encoded by the text encoder and each CT slice by the image encoder. Cosine similarities are computed between the sentence embedding and all slice embeddings, and the slice with the highest score is chosen as the key slice for that finding.

Figure 2. Overview of dataset composition and evaluation measures.

Figure 3. Comparison of mean absolute error (MAE) between predicted and ground-truth slice indices for each method. Results are shown for both hard prediction (Top-1 slice) and soft prediction (moving average over ±2 slices).

Figure 4. “Multiple masses that may represent fibroids are observed in the myometrium and beneath the endometrium.” The Ground Truth column shows the reference CT slice with the radiologist-annotated lesion bounding box. The other columns display each model’s key slice prediction. The white double circle symbol indicates predictions that match the ground-truth slice range.

Figure 5. Similarity profiles between the prompt “Heart” and each slice. The green line denotes the ground-truth slice range. Similarity profiles increase from left to right, and the dashed line indicates the decision threshold.

Table 1. Radiologist acceptance ratio for automatically selected slices with abnormal findings.

Model	Baseline [%]	+ Fine-Tuning [%]	$Δ$ [%]
CLIP	24.14	40.23	+16.09
PubMedCLIP	37.93	50.57	+12.64
BiomedCLIP	50.57	56.32	+5.75

Table 2. Summary of image and text encoders used in each method.

Method	Image Encoder	Text Encoder
CLIP	ViT-B/16	Transformer
PubMedCLIP	ViT-B/32	Transformer
BiomedCLIP	ViT-B/16	PubMedBERT

Table 3. Comparison of slice selection accuracy for each method. Reported values are Top-1 and Top-5 accuracy [%]. For each metric, the highest score among the automated methods (excluding Radiologist 2) is highlighted in red.

Method	Acc.@1 ↑	Acc.@5 ↑
Radiologist2	78.16	-
CLIP	19.54	44.83
CLIP(FT)	40.23	49.43
PubMedCLIP	29.89	54.02
PubMedCLIP(FT)	42.53	59.77
BiomedCLIP	44.83	60.92
BiomedCLIP(FT)	51.72	64.37

Table 4. Radiologist acceptance rates [%] of slice selector predictions for each model.

Method	Acceptance Rate ↑
CLIP	24.14
CLIP(FT)	40.23
PubMedCLIP	37.93
PubMedCLIP(FT)	50.57
BiomedCLIP	50.57
BiomedCLIP(FT)	56.32

Table 5. Comparison of organ extraction performance. Reported metrics are accuracy (Acc.) and F1-score (F1). “Word” indicates prompts using single organ names, and “Sentence” indicates prompts using full finding sentences. The highest score in each column is highlighted in red.

Method	Word		Sentence
Method	Acc. ↑	F1 ↑	Acc. ↑	F1 ↑
CLIP	$0.715 \pm 0.105$	$0.514 \pm 0.182$	$0.725 \pm 0.071$	$0.520 \pm 0.192$
CLIP(FT)	$0.927 \pm 0.038$	$0.806 \pm 0.153$	$0.934 \pm 0.035$	$0.824 \pm 0.129$
PubMedCLIP	$0.881 \pm 0.046$	$0.722 \pm 0.164$	$0.882 \pm 0.042$	$0.726 \pm 0.157$
PubMedCLIP(FT)	$0.943 \pm 0.030$	$0.839 \pm 0.123$	$0.944 \pm 0.031$	$0.843 \pm 0.121$
BiomedCLIP	$0.832 \pm 0.133$	$0.673 \pm 0.230$	$0.884 \pm 0.048$	$0.730 \pm 0.126$
BiomedCLIP(FT)	$0.947 \pm 0.033$	$0.853 \pm 0.118$	$0.948 \pm 0.032$	$0.854 \pm 0.118$

Table 6. Comparison of organ extraction performance on the external CT-RATE dataset. Reported metrics are accuracy (Acc.) and F1-score (F1). “Word” indicates prompts using single organ names, and “Sentence” indicates prompts using full finding sentences. The highest score in each column is highlighted in red.

Method	Word		Sentence
Method	Acc. ↑	F1 ↑	Acc. ↑	F1 ↑
CLIP	$0.726 \pm 0.059$	$0.554 \pm 0.223$	$0.724 \pm 0.065$	$0.556 \pm 0.226$
CLIP(FT)	$0.902 \pm 0.050$	$0.770 \pm 0.251$	$0.902 \pm 0.048$	$0.770 \pm 0.250$
PubMedCLIP	$0.842 \pm 0.099$	$0.680 \pm 0.237$	$0.861 \pm 0.095$	$0.707 \pm 0.226$
PubMedCLIP(FT)	$0.911 \pm 0.072$	$0.796 \pm 0.252$	$0.910 \pm 0.073$	$0.795 \pm 0.252$
BiomedCLIP	$0.749 \pm 0.171$	$0.632 \pm 0.271$	$0.755 \pm 0.125$	$0.604 \pm 0.226$
BiomedCLIP(FT)	$0.926 \pm 0.059$	$0.809 \pm 0.257$	$0.926 \pm 0.059$	$0.809 \pm 0.257$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Feasibility Study of CLIP-Based Key Slice Selection in CT Images and Performance Enhancement via Lesion- and Organ-Aware Fine-Tuning

Abstract

1. Introduction

3. Materials and Methods

3.1. Dataset

3.2. Slice Selection Algorithm

3.3. Models and Training Details

4. Results

4.1. Training Verification via Slice-Level MAE

4.2. Lesion Awareness

4.3. Organ Awareness

4.4. Visualization

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A. Finding Text Generation

Appendix A.1. Rule-Based

Appendix A.2. LLM-Based

Appendix B. Detailed Organ-Aware Results

Appendix C. Visualization

Appendix C.1. Lesion Awareness

Appendix C.2. Organ Awareness

Appendix D. Impact of Organ-Aware Synthetic Data

References

Article Metrics

Citations

Article Access Statistics