A Language Vision Model Approach for Automated Tumor Contouring in Radiation Oncology

Luo, Yi; Hooshangnejad, Hamed; Feng, Xue; Huang, Gaofeng; Chen, Xiaojian; Zhang, Rui; Chen, Quan; Ngwa, Wil; Ding, Kai

doi:10.3390/bioengineering12080835

Open AccessArticle

A Language Vision Model Approach for Automated Tumor Contouring in Radiation Oncology

by

Yi Luo

^1,†,

Hamed Hooshangnejad

^1,2,†

,

Xue Feng

³

,

Gaofeng Huang

³

,

Xiaojian Chen

¹

,

Rui Zhang

⁴,

Quan Chen

⁵,

Wil Ngwa

² and

Kai Ding

^2,*

¹

Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21287, USA

²

Department of Radiation Oncology and Molecular Radiation Sciences, Johns Hopkins University, Baltimore, MD 21287, USA

³

Carina Medical LLC., Lexington, KY 40513, USA

⁴

Division of Computational Health Sciences, Department of Surgery, University of Minnesota, Minneapolis, MN 55455, USA

⁵

Department of Radiation Oncology, Mayo Clinic Arizona, Phoenix, AZ 85054, USA

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Bioengineering 2025, 12(8), 835; https://doi.org/10.3390/bioengineering12080835

Submission received: 7 July 2025 / Revised: 27 July 2025 / Accepted: 28 July 2025 / Published: 31 July 2025

(This article belongs to the Special Issue Novel Imaging Techniques in Radiotherapy)

Download

Browse Figures

Versions Notes

Abstract

Background: Lung cancer ranks as the leading cause of cancer-related mortality worldwide. The complexity of tumor delineation, crucial for radiation therapy, requires expertise often unavailable in resource-limited settings. Artificial Intelligence (AI), particularly with advancements in deep learning (DL) and natural language processing (NLP), offers potential solutions yet is challenged by high false positive rates. Purpose: The Oncology Contouring Copilot (OCC) system is developed to leverage oncologist expertise for precise tumor contouring using textual descriptions, aiming to increase the efficiency of oncological workflows by combining the strengths of AI with human oversight. Methods: Our OCC system initially identifies nodule candidates from CT scans. Employing Language Vision Models (LVMs) like GPT-4V, OCC then effectively reduces false positives with clinical descriptive texts, merging textual and visual data to automate tumor delineation, designed to elevate the quality of oncology care by incorporating knowledge from experienced domain experts. Results: The deployment of the OCC system resulted in a 35.0% reduction in the false discovery rate, a 72.4% decrease in false positives per scan, and an F1-score of 0.652 across our dataset for unbiased evaluation. Conclusions: OCC represents a significant advance in oncology care, particularly through the use of the latest LVMs, improving contouring results by (1) streamlining oncology treatment workflows by optimizing tumor delineation and reducing manual processes; (2) offering a scalable and intuitive framework to reduce false positives in radiotherapy planning using LVMs; (3) introducing novel medical language vision prompt techniques to minimize LVM hallucinations with ablation study; and (4) conducting a comparative analysis of LVMs, highlighting their potential in addressing medical language vision challenges.

Keywords:

language vision model; tumor contouring; lung cancer; radiotherapy

1. Introduction

Cancer is the second major cause of death worldwide [1] and is expected to become the primary cause of global morbidity and mortality in coming years, surpassing other diseases [2]. Common treatment modalities for cancer include surgery, radiation therapy, chemotherapy, hormone therapy, targeted therapy, and immunotherapy. Among all cancer types, lung cancer is the leading cause of cancer-related deaths worldwide [3]. It is predominantly diagnosed as non-small-cell lung cancer (NSCLC) [4], for which stereotactic body radiotherapy is often the preferred treatment method when it is inoperable and early-stage. The efficacy of such therapy is contingent upon the quality of radiotherapy treatment planning, to which the accurate and laborious localization of lung tumor targets is central [5].

In developing nations, millions are deprived of access to radiation therapy (RT), prompting a growing need for automatic oncology delineation technologies. Information and communication technologies (ICTs) hold immense promise for greater space- and time-flexible collaborative action against cancers [6]. The deployment of ICTs, encompassing social media platforms, websites, voice-over messaging, and toll-free telecommunication services, is steadily expanding in the realm of oncology services [7]. Nonetheless, the implementation of remote tumor delineation poses significant time-related challenges, emphasizing the critical need for the development of automated methods for tumor localization.

The considerable advances in deep learning have substantially redirected research efforts towards utilizing these methods for the precise localization of lung tumors, predominantly through the analysis of CT volumes [8,9,10]. However, accurately distinguishing false positives caused by pulmonary blood vessels, lung borders, and CT scan noise continues to be a significant challenge [11]. Prior studies aimed at minimizing false positives in lung nodules have largely concentrated on single-modality vision inputs [12,13,14]. These studies often overlook the invaluable textual information provided by diagnostic physicians, including radiologists and pathologists, through Electronic Health Records (EHRs). This textual data, abundant in comprehensive pathology and radiology reports, represents a crucial resource for enhancing accurate tumor delineation and reducing false positives. Although numerous LLM studies, such as Radiology-Llama2 [15], RadOnc-GPT [16], and CancerLLM [17], have addressed tumor-related textual processing, the potential for its clinical application, particularly in clinical visual tasks, remains largely underexplored [18].

Recent advancements have seamlessly integrated visual modalities with LLMs, giving rise to LVMs such as GPT-4V [19] and Claude 3 for application in visual commonsense reasoning [20], visual question answering, and multimodal dialogue systems [19,21,22] and so on. While recent studies have shown that GPT-4V is proficient in distinguishing between various medical imaging modalities and anatomical structures, it encounters considerable difficulties when dealing with complex medical issues in detail [23] and often suffers from hallucinations when detecting small-scale objects.

Recognizing the limitations of current deep learning approaches, the emergence of powerful LVMs, and the untapped potential of textual data, our work introduces the OCC system, as shown in Figure 1. This system is engineered to integrate textual descriptions with visual data from CT scans, leveraging the sophisticated capabilities of LVMs to enhance tumor delineation accuracy. In descriptive text analysis, personalized preferences and varying levels of expertise among human evaluators can introduce biases and inconsistencies, such as differences in expression and terminology. However, the OCC system emphasizes positional information extraction to minimize false positives. Thanks to the advanced capabilities of the latest LVMs, the OCC system can still accurately interpret content and effectively match candidate nodules with the corresponding clinical descriptive texts, despite variations in terminology and phrasing. Ultimately, the OCC system enables patients to benefit from experienced domain experts who can remotely analyze CT scans along with pathology slices. By accurately delineating tumor contours based on straightforward clinical text descriptions, the system enhances the precision of radiotherapy planning. This approach not only reduces the reliance on on-site experts but also provides an innovative solution for delivering high-quality, personalized care to patients in resource-limited settings, ultimately improving treatment outcomes.

2. Related Work

Lung nodule segmentation mainly involves two steps: candidate nodule detection and false positive reduction. (i) Candidate Nodule Detection: Various methods have been proposed, such as Faster Region-based Convolutional Neural Network (R-CNN) [8], 3D R-CNN [9], and Faster R-CNN architectures work by incorporating dual region proposal networks and a deconvolutional layer [24]. (ii) False Positive Reduction: Research teams have proposed a multi-view ConvNet approach [12], a CNN model with hand-crafted features [25]. Most recently, research has primarily focused on enhancing 3D CNN architectures with an attentive 3D-CNN module [26], 3D IRes2Net module [27], and 3D cuboid attention module [28]. At the same time, notable contributions, e.g., Hooshangnejad et al. [29], integrated EHR information, offering a novel approach to lung nodule false positive reduction.

Prompt engineering is a method of adapting a large pre-trained model to a downstream problem with task-specific hints, and it has emerged as a crucial technique for maximizing the utility and accuracy of LLMs [30,31]. Prompts can be divided into two main categories: hard prompts and soft prompts. Hard prompts are manually crafted text prompts with discrete tokens, and soft prompts are optimizable, learnable tensors concatenated with input embeddings which can be optimized in a data-driven manner through back-propagation but lack human readability due to their non-alignment with real word embeddings [32,33]. Due to the limited amount of clinical data and the necessity for user-friendly interfaces, hard prompts are frequently employed in prompt engineering for clinical applications [30]. Recent developments in Language Vision Models (LVMs), as highlighted by studies [21,23], have garnered interest in the field of language vision prompt engineering. Despite this growing attention, the practical application of LVMs in medical issues remains relatively unexplored.

3. Methods

Our OCC system comprises two primary components, as shown in Figure 2. The first component is a candidate tumor detection model, which uses CT scans from patients to identify multiple potential tumor nodules. The second component is the false positive reduction method. Unlike conventional deep learning-based systems for false positive elimination, our approach utilizes LVMs, which take both the nodule candidates identified in the first component and the domain experts’ clinical text descriptions as inputs. By processing and understanding these descriptions, the LVMs effectively select the correct tumor nodules, thereby enhancing the accuracy of false positive removal. Ultimately, the OCC system empowers patients to achieve precise lung nodule delineation under the guidance or second opinion of remote, experienced domain experts in high-patient-volume centers, facilitating radiotherapy planning for patients.

3.1. Candidate Tumor Detection Model

3.1.1. Architecture

We deployed the Retina-UNet3D [29] as our candidate tumor detection model. Retina-UNet3D aligns the principle of both a feature pyramid network (FPN) [34] and UNet [35], which enables the model to harness the advantages of FPN within 3D detection framework while also building upon the proven effectiveness of UNet-3D for segmentation tasks. Moreover, Retina-UNet3D has been included as a reference tumor detection model in the latest MONAI Model Zoo. The detailed architecture of the Retina-UNet3D model is presented in Supplementary Figure S1.

3.1.2. Loss Functions

We used a dual loss by adding categorical cross-entropy loss L_c and dice loss L_d. By combining them into a unified dual loss function, this strategy harmonized pixel-level precision with image-level authenticity, enhancing the segmentation process. It bolstered the model’s capacity for recognizing intricate object details and simultaneously advanced a thorough evaluation of image quality.

L_{c} (p, y) = - \sum_{i} y_{i} \log (p_{i})

(1)

L_{d} (y, p) = 1 - \frac{2 \sum_{i} y_{i} p_{i}}{\sum_{i} y_{i}^{2} + \sum_{i} p_{i}^{2}}

(2)

L = L_{c} + L_{d}

(3)

L_c (p,y) denotes the categorical cross-entropy loss, with p_i and y_i representing the predicted probability and ground-truth label for class i, respectively. L_d (y,p) is the dice loss, where p_i and y_i represent the predicted and ground-truth values for pixel i.

We adopted smooth L1 loss [36] and focal loss [37] for box regression heads and box classification heads separately. In smooth L1 loss, MAE is the mean absolute error between predictions and the ground truth and δ is a smoothing hyperparameter. In focal loss, y is the binary ground-truth label, p is the predicted probability for the positive class, α is the weighting factor for class imbalance, and γ controls the focus level on hard examples.

s m o o t h L 1 = \{\begin{cases} \frac{0.5 {(M A E)}^{2}}{δ} \\ M A E - 0.5 δ \end{cases} \begin{array}{l} if MAE < 1 \\ o t h e r w i s e \end{array}

(4)

F_{L} (p, y) = \{\begin{cases} - α {(1 - p)}^{γ} \log (p) \\ - {(1 - α)}^{γ} \log (1 - p) \end{cases} \begin{array}{l} if y = 1 \\ if y = 0 \end{array}

(5)

3.2. False Positive Reduction Model

We mainly integrated GPT-4V as the cornerstone of our model to process both visual and linguistic data concurrently. Following the acquisition of segmentation outputs from our candidate tumor detection model, we select a random slice that includes representations of nodule and lobe masks. This image, along with the patient’s her, which provides clinical descriptions, is fed into GPT-4V. GPT-4V then autonomously identifies the location of potential nodules, correlating them with definitive clinical diagnoses extracted from the EHR. In its final phase, GPT-4V generates a concise and clear textual report, helping effectively minimize false positives.

3.2.1. Experiment Design

In our research, we combined image and text prompts with LVMs, aligning their outputs with oncologist assessments. We identified a highly effective prompt engineering approach and delved into its potential, yielding six key insights for enhancing LVMs use in medical contexts with an ablation study, conducted from November 2023 to March 2024. Additionally, we implemented a UNet-3D-based false positive reduction network, using it as a comparison for our LVM-based Model.

3.2.2. Medical Language Vision Prompt Methods

Single Vision Input: Employing a single image to present all spatial information relevant to the nodule and lung lobe masks outperformed the use of multiple images, thereby simplifying the visual input for the model. Leave Time to Think: We gave the model sufficient time to process information, avoiding word limits that could truncate its reasoning process. This approach led to richer, more in-depth, and more accurate responses. Conceal Medical Intent: To enhance the accuracy of GPT-4V’s outcomes, we rephrased medical prompts into a generalized language, which allowed us to circumvent the AI’s default restrictive responses. A Series of Guiding Questions: We broke down complex medical queries into simpler questions, which directed the AI through a logical reasoning process and resulted in more consistent and correct answers. Vision Instructions: By embedding color references in the images, we significantly improved GPT-4V’s color recognition capabilities, aiding its performance in identifying objects with various colors. Highlighting Areas of Interest: To improve the AI’s ability to detect small nodal areas, we cropped out extraneous backgrounds and adjusted the image contrast, making crucial details more discernible. An example of our several medical language vision prompt methods is shown in Figure 3.

3.3. Dataset and Preprocessing

In the development of our candidate nodule detection model, we utilized the Lung Image Database Consortium imaging collection (LIDC-IDRI) from the Cancer Imaging Archive (TCIA) [38], renowned for its high-quality annotations of lung nodules. Given the lack of transparency regarding the datasets used to train the currently popular commercial LVMs, we aimed to develop and validate our false positive reduction model without bias. To achieve this, we utilized 31 CT datasets including 10 diagnostic CT datasets and 21 planning CT datasets obtained from the stereotactic body radiation therapy patients treated at Johns Hopkins Hospital. Importantly, only one diagnostic CT datum was utilized for development; others were used for validation as unseen data. To safeguard privacy, patient-specific details are thoroughly manually annotated. Additionally, intending to minimize the randomness of the responses, we set the temperature parameter at zero during the experimental trials to make it fully reproducible.

3.4. Evaluation Metrics

We utilize universally recognized evaluation metrics. A true positive (TP) denotes an instance where the model correctly identifies a verified lung nodule, thus indicating a precise prediction. Conversely, a false positive (FP) arises when the model erroneously identifies a non-nodule as a nodule. A false negative (FN) arises when the model fails to detect an actual nodule, leading to a missed detection, while a true negative (TN) arises when the model correctly identifies an area as non-nodule, confirming an accurate negative prediction. The aggregate count of lung nodules within the dataset is represented by N_sample. N_reject means the total amount of our requests rejected by internal inhibition of LVMs. We deployed the false discovery rate (FDR), reject rate, FP/scan,

S e n s i t i v i t y

(Sen),

S p e c i f i c i t y

(Spe), and F1-score to evaluate our model.

F D R = \frac{F P}{T P + F P}

(6)

S e n s i t i v i t y = \frac{T P}{T P + F N}

(7)

S p e c i f i c i t y = \frac{T N}{T N + F P}

(8)

F 1 - s c o r e = \frac{2 \times T P}{2 \times T P + F P + F N}

(9)

Reject Rate = \frac{N_{r e j e c t}}{N_{s a m p l e}}

(10)

F P / S c a n = \frac{F P}{N_{s a m p l e}}

(11)

4. Results

In our experiment, the training dataset comprised a single diagnostic CT scan featuring seven candidate nodules. For validation purposes, the dataset included 9 diagnostic CT scans and 21 planning CT scans, collectively presenting 221 candidate nodules; there were two cases with no candidate nodules detected, so we excluded them from our subsequent analysis. Then, we conducted a comprehensive evaluation of our OCC system with different false positive reduction cores, including the UNet-3D-based method and several LVM-based methods, specifically, ViLT [39], Claude 3 Sonnet, and GPT-4V, all tested with the same well-designed prompt inputs. The outcomes of this analysis are presented in Table 1. The probability density function curves for FP/Scan with various false positive reduction methods are illustrated in Supplementary Figure S2.

An optimal model ought to sustain elevated sensitivity, ensuring that true positive nodules are not erroneously classified as false negatives. Concurrently, it should endeavor to minimize the FDR and the average FP per scan, as well as the reject rate, while enhancing measures of sensitivity, specificity, and the F1 score to achieve a balanced diagnostic accuracy. Our proposed workflow, which incorporates state-of-the-art LVMs such as Claude 3 Sonnet and GPT-4V, demonstrates substantial enhancements in various metrics over traditional deep learning-based false positive reduction methods. When incorporating the ViLT model as the core component of our LVM framework, we encounter significant limitations due to the model’s inherent capabilities. Given a significant map, the model frequently hallucinates, mistakenly reclassifying true positive nodules as false positives. This error leads to an obvious decrease in performance.

Moreover, we executed an ablation study to assess the efficacy of the medical language vision prompt methods we devised. In these trials, we systematically removed individual elements from our prompt engineering strategies, ensuring that the remaining methods were still executed. Every column displays the results of the test sets based on various combinations of medical prompt engineering strategies. The results of these tests are shown in Figure 2.

The results of our ablation study shown in Table 2 indicate that the absence of any element within our prompt engineering methodology significantly affects model performance. The omission of “Single Vision Input” highlights the LVM’s limited capacity for comparative analysis across multiple image inputs. When “Time to Think” is removed, the LVM tends to bypass preliminary contemplation and leaps to conclusions without logical validation, disrupting the natural flow of reasoned thinking. The lack of “Conceal Medical Intent” often causes the LVM to internally filter out such medical inquiries, leading to an increased presence of false positive nodules that are not properly dismissed. Without “A Series of Guiding Questions,” the LVM’s logical reasoning occasionally falters. The absence of “Vision Instructions” can cause the LVM’s inherent color biases to skew its judgments. Finally, without “Highlighting Areas of Interest,” the model struggles to pinpoint the exact location within a vast background, which can impede accurate classification.

5. Case Study

We conducted a case study to assess the false positive reduction capability of different LVMs within the OCC system for nodule images. This study was designed with two goals, i.e., whether the models could correctly identify TP or FP nodules (G1) and the richness of the information inferred by the models (G2). Summarily, we input various nodule images and the questions into GPT-4V, Claude 3 Sonnet, and ViLT. We highlighted the key responses in red for correct inferences and in blue for incorrect ones.

The first case is conducted for a TP scenario, which is a small nodule on the left upper lobe. As shown in Figure 4a, GPT-4V generally could give the right answer, demonstrating its ability to identify the TP nodule (G1). However, the performance of Claude 3 Sonnet and ViLT was subpar, indicating their poor performance in small nodule detection. In addition, the answer from GPT-4V and Claude 3 Sonnet is a paragraph, while the ViLT only outputs a simple sentence or even a single word (G2).

The results of the second case are shown in Figure 4b, where both GPT-4V and Claude 3 Sonnet gave the correct answers for a medium FP nodule successfully. On the other hand, the ViLT responded with the wrong answer, with few words.

6. Discussion

In our study, we introduced a novel OCC system to help radiotherapy planning for patients, leveraging the capabilities of LVMs. This novel method synergistically integrates visual data from clinical CT scans with textual information from experienced domain experts, culminating in a highly effective reduction in false positives, enhancing the precision of radiotherapy treatment. Furthermore, our approach not only capitalizes on the advanced features of LVMs but also holds the potential for widespread adoption across various clinical settings.

Our research is subject to certain limitations. Firstly, the false positive reduction and language vision prompt engineering strategies we introduced were tested exclusively within our institution’s dataset, which was provided by leading radiation oncologists and included precise tumor delineations along with comprehensive clinical descriptive texts. While the results were promising, it is imperative to conduct further evaluations on a more extensive and varied dataset. When working with larger datasets, obtaining evaluations from top radiation oncologists is challenging. Additionally, comprehensive and reliable human evaluations often require a significant number of evaluators, making the process time-consuming, labor-intensive, and costly. In this context, LLM-as-a-Judge [40] could provide a potential solution to obtain accurate and detailed clinical descriptive text datasets. In addition, although the FDR has been reduced in our experiment, it remains relatively high, at 0.511. Additionally, the 3D DICE score for TP nodules across the experiments is 0.4164, indicating that the performance is still not suitable for real-world clinical applications. This is because our candidate tumor detection model prioritizes sensitivity to ensure that no nodules are missed to meet clinical needs. However, this approach results in a significant number of low-quality nodule predictions. In addition, our false positive reduction methodology predominantly mitigates false positives by correlating the candidate nodule’s location with the clinical descriptive texts. However, for false positive nodules situated within the target lobe, our approach faces challenges in excluding these erroneous detections. Further exploration of the use of varied textual information such as tumor size and tumor stage for nodule contouring holds significant promise. Additionally, our medical vision language prompt methods primarily aim to enhance visual focus to reduce hallucinations when detecting small-scale objects. However, beyond small-scale object hallucinations, there are other types of hallucinations, such as positional, verbosity, and self-enhancement biases [40]. The system has so far only been validated retrospectively. Future work should focus on multi-center studies, enhanced integration of clinical context, and development of standardized benchmarks for hallucination assessment in medical imaging applications.

7. Conclusions

We introduce the OCC system, which integrates LVMs with clinical CT imaging and textual data, to reduce the need for on-site experts while ensuring precise and efficient radiation therapy. The system enhances the reliability of clinical applications by implementing novel medical language vision prompt techniques that effectively reduce hallucinatory outputs from LVMs. Moreover, we provide a scalable and intuitive framework that significantly improves diagnostic accuracy and the quality of oncology care in resource-limited environments. A detailed comparative analysis of LVMs within the OCC highlights the LVMs’ transformative potential for medical language vision problems.

This study may improve clinical practice by increasing the accessibility and reliability of automated tumor delineation, especially in resource-limited settings. In addition, it offers important directions for future academic research in multimodal data integration and safer, more effective medical AI systems.

Supplementary Materials

The following supporting information can be downloaded at https://www.mdpi.com/article/10.3390/bioengineering12080835/s1: Figure S1: Structure of Retina-Unet3D; Figure S2: PDF curves comparing different methodologies for false positive reduction.

Author Contributions

Conceptualization, Y.L., H.H. and K.D.; methodology, Y.L. and H.H.; software, Y.L., X.F. and G.H.; validation, Y.L., H.H. and K.D.; formal analysis, Y.L. and K.D.; investigation, Y.L., H.H. and K.D.; resources, K.D.; data curation, Y.L., H.H. and K.D.; writing—original draft preparation, Y.L. and K.D; writing—review and editing, all; visualization, Y.L. and X.C.; supervision, K.D.; project administration, K.D.; funding acquisition, W.N. and K.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Cancer Institute of the National Institutes of Health grant number R25CA288263 and R37CA229417.

Institutional Review Board Statement

The study was conducted in accordance with the Declaration of Helsinki, and approved by the Institutional Review Board of Johns Hopkins University (Approval Code: IRB-1, Approval Date: 18 March 2020).

Informed Consent Statement

Patient consent was waived because this was a retrospective study.

Data Availability Statement

Due to the inclusion of patient data, the datasets generated and/or analyzed during the current study are not publicly available due to privacy and ethical restrictions.

Conflicts of Interest

Quan Chen and Xue Feng are co-founders of Carina Medical LLC, and Gaofeng Huang is employed by Carina Medical LLC. The other authors declare no conflicts of interest.

References

Vyas, A.; Kumar, K.; Sharma, A.; Verma, D.; Bhatia, D.; Wahi, N.; Yadav, A.K. Advancing the frontier of artificial intelligence on emerging technologies to redefine cancer diagnosis and care. Comput. Biol. Med. 2025, 191, 110178. [Google Scholar] [CrossRef] [PubMed]
Ferlay, J.; Soerjomataram, I.; Dikshit, R.; Eser, S.; Mathers, C.; Rebelo, M.; Parkin, D.M.; Forman, D.; Bray, F. Cancer incidence and mortality worldwide: Sources, methods and major patterns in GLOBOCAN 2012. Int. J. Cancer 2015, 136, E359–E386. [Google Scholar] [CrossRef] [PubMed]
Leiter, A.; Veluswamy, R.R.; Wisnivesky, J.P. The global burden of lung cancer: Current status and future trends. Nat. Rev. Clin. Oncol. 2023, 20, 624–639. [Google Scholar] [CrossRef] [PubMed]
Bodor, J.N.; Boumber, Y.; Borghaei, H. Biomarkers for immune checkpoint inhibition in non–small cell lung cancer (NSCLC). Cancer 2020, 126, 260–270. [Google Scholar] [CrossRef]
Zhang, L.; Liu, Z.; Zhang, L.; Wu, Z.; Yu, X.; Holmes, J.; Feng, H.; Dai, H.; Li, X.; Li, Q.; et al. Generalizable and promptable artificial intelligence model to augment clinical delineation in radiation oncology. Med. Phys. 2024, 51, 2187–2199. [Google Scholar] [CrossRef]
Ngwa, W.; Ngoma, T.; Zietman, A.; Mayr, N.; Elzawawy, A.; Winningham, T.A.; Balogun, O.; Enwerem-Bromson, N.; Ntizimira, C.; Olopade, O.I.; et al. Closing the cancer divide through Ubuntu: Information and communication technology-powered models for global radiation oncology. Int. J. Radiat. Oncol. 2016, 94, 440–449. [Google Scholar] [CrossRef]
Addai, B.W.; Ngwa, W. COVID-19 and cancer in Africa. Science 2021, 371, 25–27. [Google Scholar] [CrossRef]
Ding, J.; Li, A.; Hu, Z.; Wang, L. Accurate pulmonary nodule detection in computed tomography images using deep convolutional neural networks. In Medical Image Computing and Computer Assisted Intervention, Proceedings of the MICCAI 2017: 20th International Conference, Quebec, QC, Canada, 11–13 September 2017; Proceedings, Part III 20; Springer: Berlin/Heidelberg, Germany, 2017; pp. 559–567. [Google Scholar]
Tang, H.; Kim, D.R.; Xie, X. Automated pulmonary nodule detection using 3D deep convolutional neural networks. In Proceedings of the 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018), Washington, DC, USA, 4–7 April 2018; pp. 523–526. [Google Scholar]
Zhu, W.; Liu, C.; Fan, W.; Xie, X. Deeplung: Deep 3d dual path nets for automated pulmonary nodule detection and classification. In Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, USA, 12–15 March 2018; pp. 673–681. [Google Scholar]
Yu, X.; He, L.; Wang, Y.; Dong, Y.; Song, Y.; Yuan, Z.; Yan, Z.; Wang, W. A deep learning approach for automatic tumor delineation in stereotactic radiotherapy for nonsmall cell lung cancer using diagnostic PET-CT and planning CT. Front. Oncol. 2023, 13, 1235461. [Google Scholar] [CrossRef]
Setio, A.A.A.; Ciompi, F.; Litjens, G.; Gerke, P.; Jacobs, C.; Van Riel, S.J.; Wille, M.M.W.; Naqibullah, M.; S’anchez, C.I.; Van Ginneken, B. Pulmonary nodule detection in CT images: False positive reduction using multi-view convolutional networks. IEEE Trans. Med. Imaging 2016, 35, 1160–1169. [Google Scholar] [CrossRef]
Dou, Q.; Chen, H.; Yu, L.; Qin, J.; Heng, P.-A. Multilevel contextual 3-D CNNs for false positive reduction in pulmonary nodule detection. IEEE Trans. Biomed. Eng. 2016, 64, 1558–1567. [Google Scholar] [CrossRef]
Xie, Y.; Xia, Y.; Zhang, J.; Song, Y.; Feng, D.; Fulham, M.; Cai, W. Knowledge-based collaborative deep learning for benign-malignant lung nodule classification on chest CT. IEEE Trans. Med. Imaging 2018, 38, 991–1004. [Google Scholar] [CrossRef]
Liu, Z.; Li, Y.; Shu, P.; Zhong, A.; Yang, L.; Ju, C.; Wu, Z.; Ma, C.; Luo, J.; Li, X.; et al. Radiology-llama2: Best-in-class large language model for radiology. arXiv 2023, arXiv:2309.06419. [Google Scholar]
Liu, Z.; Wang, P.; Li, Y.; Holmes, J.; Shu, P.; Zhang, L.; Liu, C.; Liu, N.; Zhu, D.; Liu, W. Radonc-gpt: A large language model for radiation oncology. arXiv 2023, arXiv:2309.10160. [Google Scholar]
Li, M.; Huang, J.; Yeung, J.; Blaes, A.; Johnson, S.; Liu, H.; Xu, H.; Zhang, R. Cancerllm: A large language model in cancer domain. arXiv 2024, arXiv:2406.10459. [Google Scholar]
Luo, Y.; Hooshangnejad, H.; Ngwa, W.; Ding, K. Opportunities and challenges in lung cancer care in the era of large language models and vision language models. Transl. Lung Cancer Res. 2025, 14, 1830. [Google Scholar] [CrossRef] [PubMed]
OpenAI; Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; et al. Gpt-4 technical report. arXiv 2023, arXiv:2303.08774. [Google Scholar]
Huang, S.; Dong, L.; Wang, W.; Hao, Y.; Singhal, S.; Ma, S.; Lv, T.; Cui, L.; Mohammed, O.K.; Patra, B.; et al. Language is not all you need: Aligning perception with language models. arXiv 2023, arXiv:2302.14045. [Google Scholar]
Alayrac, J.-B.; Donahue, J.; Luc, P.; Miech, A.; Barr, I.; Hasson, Y.; Lenc, K.; Mensch, A.; Millican, K.; Reynolds, M.; et al. Flamingo: A visual language model for few-shot learning. Adv. Neural Inf. Process. Syst. 2022, 35, 23716–23736. [Google Scholar]
Wu, C.; Yin, S.; Qi, W.; Wang, X.; Tang, Z.; Duan, N. Visual chatgpt: Talking, drawing and editing with visual foundation models. arXiv 2023, arXiv:2303.04671. [Google Scholar]
Wu, C.; Lei, J.; Zheng, Q.; Zhao, W.; Lin, W.; Zhang, X.; Zhou, X.; Zhao, Z.; Zhang, Y.; Wang, Y.; et al. Can gpt-4v (ision) serve medical applications? case studies on gpt-4v for multimodal medical diagnosis. arXiv 2023, arXiv:2310.09909. [Google Scholar]
Xie, H.; Yang, D.; Sun, N.; Chen, Z.; Zhang, Y. Automated pulmonary nodule detection in CT images using deep convolutional neural networks. Pattern Recognit. 2019, 85, 109–119. [Google Scholar] [CrossRef]
Teramoto, A.; Fujita, H.; Yamamuro, O.; Tamaki, T. Automated detection of pulmonary nodules in PET/CT images: Ensemble false-positive reduction using a convolutional neural network technique. Med. Phys. 2016, 43, 2821–2827. [Google Scholar] [CrossRef] [PubMed]
Zhao, D.; Liu, Y.; Yin, H.; Wang, Z. An attentive and adaptive 3D CNN for automatic pulmonary nodule detection in CT image. Expert Syst. Appl. 2023, 211, 118672. [Google Scholar] [CrossRef]
Liu, B.; Song, H.; Li, Q.; Lin, Y.; Weng, X.; Su, Z.; Yang, J. 3D ARCNN: An Asymmetric Residual CNN for False Positive Reduction in Pulmonary Nodule. IEEE Trans. NanoBioscience 2023, 23, 18–25. [Google Scholar] [CrossRef]
Wang, S.; Liu, Y.; Shi, C. Controlling False-Positives in Automatic Lung Nodule Detection by Adding 3D Cuboid Attention to a Convolutional Neural Network. Biomed. Signal Process. Control 2023, 85, 104946. [Google Scholar] [CrossRef]
Hooshangnejad, H.; Feng, X.; Huang, G.; Zhang, R.; Chen, Q.; Ding, K. EXACT-Net: EHR-guided lung tumor auto-segmentation for non-small cell lung cancer radiotherapy. arXiv 2024, arXiv:2402.14099. [Google Scholar]
Zaghir, J.; Naguib, M.; Bjelogrlic, M.; Névéol, A.; Tannier, X.; Lovis, C. Prompt engineering paradigms for medical applications: Scoping review. J. Med. Internet Res. 2024, 26, e60501. [Google Scholar] [CrossRef]
Sahoo, P.; Singh, A.K.; Saha, S.; Jain, V.; Mondal, S.; Chadha, A. A systematic survey of prompt engineering in large language models: Techniques and applications. arXiv 2024, arXiv:2402.07927. [Google Scholar]
Gu, J.; Han, Z.; Chen, S.; Beirami, A.; He, B.; Zhang, G.; Liao, R.; Qin, Y.; Tresp, V.; Torr, P. A systematic survey of prompt engineering on vision-language foundation models. arXiv 2023, arXiv:2307.12980. [Google Scholar]
Bai, S.; Zhang, Y.; Zhou, W.; Luan, Z.; Chen, B. Soft prompt generation for domain generalization. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; Springer Nature Switzerland: Cham, Switzerland, 2024; pp. 434–450. [Google Scholar]
Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Doll’ar, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention, Proceedings of the MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; proceedings, part III 18; Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
Armato, S.G., III; McLennan, G.; Bidaut, L.; McNitt-Gray, M.F.; Meyer, C.R.; Reeves, A.P.; Zhao, B.; Aberle, D.R.; Henschke, C.I.; Hoffman, E.A.; et al. The lung image database consortium (LIDC) and image database resource initiative (IDRI): A completed reference database of lung nodules on CT scans. Med. Phys. 2011, 38, 915–931. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. In Proceedings of the Name of the Conference the 29th International Conference on Neural Information Processing Systems—Volume 1, Montreal, Canada, 7–12 December 2015; pp. 91–99. [Google Scholar]
Tang, H.; Zhang, C.; Xie, X. Automatic pulmonary lobe segmentation using deep learning. In Proceedings of the 2019 IEEE 16th International Symposium on Biomedical Imaging (ISBI 2019), Venice, Italy, 8–11 April 2019; pp. 1225–1228. [Google Scholar]
Kim, W.; Son, B.; Kim, I. Vilt: Vision-and-language transformer without convolution or region supervision. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 5583–5594. [Google Scholar]
Zheng, L.; Chiang, W.-L.; Sheng, Y.; Zhuang, S.; Wu, Z.; Zhuang, Y.; Lin, Z.; Li, Z.; Li, D.; Xing, E.P.; et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Adv. Neural Inf. Process. Syst. 2023, 36, 46595–46623. [Google Scholar]

Figure 1. OCC workflow: Individuals initially undergo diagnostic lab tests, including CT scans and pathology biopsies, which are then uploaded to our OCC system. Remote domain experts review these scans and compile a clinical description of the findings. This narrative, together with the original CT images, is subsequently uploaded to the OCC system. Utilizing this comprehensive data, the system precisely contours true positive nodules, facilitating targeted and effective radiotherapy planning.

Figure 2. Components of the OCC model: the candidate tumor detection component is fully trainable, while the LVM, which serves as the false positive reduction model, is frozen.

Figure 3. The figure illustrates several medical vision language prompt methods. To simplify analysis, masks were combined into a single image, transitioning from a single vision input to multiple vision inputs. To conceal medical intent, the CT chest wall background was removed. A color reference was added in the top left corner for vision instructions, and contrast was enhanced while the marginal background was removed to highlight areas of interest.

Figure 4. Cases of false positive reduction in the OCC system with different LVMs. (a) GPT-4V model accurately identifies the small TP nodule, whereas Claude 3 Sonnet provides no response, and ViLT erroneously classifies it as an FP. (b) ViLT mistakenly labels the nodule as a TP, while both GPT-4V and Claude 3 Sonnet correctly identify and eliminate the FP nodule.

Table 1. Comparison of different FP reduction methods.

Method	FDR ^↓	Average FP/Scan ^↓	Sen ^↑	Spe ^↑	F1-Score ^↑
Candidates	0.787	6.214	-	-	-
Candidates + Unet-3D [35]	0.696	3.357	0.872	0.460	0.366
Candidates + ViLT [39]	0.773	3.036	0.532	0.511	0.318
Candidates + Claude 3 Sonnet	0.556	1.964	0.936	0.684	0.603
Candidates + GPT-4V (Ours)	0.511	1.714	0.979	0.724	0.652

Table 2. Ablation study results on impacts of medical language vision prompt methods.

Methods and Metrics	Choice and Results
Single Vision Input	√	√	√	√	√		√
Leave Time to Think	√	√	√	√		√	√
Conceal Medical Intent	√	√	√		√	√	√
A Series of Guiding Questions	√	√		√	√	√	√
Vision Instructions	√		√	√	√	√	√
Highlighting Areas of Interest		√	√	√	√	√	√
FDR ^↓	0.615	0.546	0.667	0.715	0.639	0.762	0.511
Average FP/Scan ^↓	2.286	1.893	2.607	4.036	1.393	1.607	1.714
Sen ^↑	0.833	0.917	0.766	0.954	0.468	0.298	0.979
Spe ^↑	0.630	0.695	0.580	0.362	0.776	0.741	0.724
F1-score ^↑	0.527	0.607	0.464	0.439	0.407	0.265	0.652
Reject Rate ^↓	-	-	-	0.575	-	-	-

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Luo, Y.; Hooshangnejad, H.; Feng, X.; Huang, G.; Chen, X.; Zhang, R.; Chen, Q.; Ngwa, W.; Ding, K. A Language Vision Model Approach for Automated Tumor Contouring in Radiation Oncology. Bioengineering 2025, 12, 835. https://doi.org/10.3390/bioengineering12080835

AMA Style

Luo Y, Hooshangnejad H, Feng X, Huang G, Chen X, Zhang R, Chen Q, Ngwa W, Ding K. A Language Vision Model Approach for Automated Tumor Contouring in Radiation Oncology. Bioengineering. 2025; 12(8):835. https://doi.org/10.3390/bioengineering12080835

Chicago/Turabian Style

Luo, Yi, Hamed Hooshangnejad, Xue Feng, Gaofeng Huang, Xiaojian Chen, Rui Zhang, Quan Chen, Wil Ngwa, and Kai Ding. 2025. "A Language Vision Model Approach for Automated Tumor Contouring in Radiation Oncology" Bioengineering 12, no. 8: 835. https://doi.org/10.3390/bioengineering12080835

APA Style

Luo, Y., Hooshangnejad, H., Feng, X., Huang, G., Chen, X., Zhang, R., Chen, Q., Ngwa, W., & Ding, K. (2025). A Language Vision Model Approach for Automated Tumor Contouring in Radiation Oncology. Bioengineering, 12(8), 835. https://doi.org/10.3390/bioengineering12080835

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Language Vision Model Approach for Automated Tumor Contouring in Radiation Oncology

Abstract

1. Introduction

2. Related Work

3. Methods

3.1. Candidate Tumor Detection Model

3.1.1. Architecture

3.1.2. Loss Functions

3.2. False Positive Reduction Model

3.2.1. Experiment Design

3.2.2. Medical Language Vision Prompt Methods

3.3. Dataset and Preprocessing

3.4. Evaluation Metrics

4. Results

5. Case Study

6. Discussion

7. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI