Next Article in Journal
Abnormal Alliance Detection Method Based on a Dynamic Community Identification and Tracking Method for Time-Varying Bipartite Networks
Previous Article in Journal
Application of Artificial Intelligence in Control Systems: Trends, Challenges, and Opportunities
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Application of Vision-Language Models in the Automatic Recognition of Bone Tumors on Radiographs: A Retrospective Study

1
Department of Dermatology and Allergy, School of Medicine, Technical University of Munich, 80802 Munich, Germany
2
Department of Orthopedics, Trauma and Plastic Surgery, University of Leipzig, Liebigstrasse 20, 04103 Leipzig, Germany
3
Department of Orthopedics and Trauma Surgery, University Hospital of Bonn, 53127 Bonn, Germany
4
Clinic of Plastic, Hand and Aesthetic Surgery, Burn Center, BG Clinic Bergmannstrost, 06112 Halle (Saale), Germany
*
Author to whom correspondence should be addressed.
AI 2025, 6(12), 327; https://doi.org/10.3390/ai6120327
Submission received: 1 November 2025 / Revised: 6 December 2025 / Accepted: 8 December 2025 / Published: 16 December 2025
(This article belongs to the Section Medical & Healthcare AI)

Abstract

Background: Vision-language models show promise in medical image interpretation, but their performance in musculoskeletal tumor diagnostics remains underexplored. Objective: To evaluate the diagnostic accuracy of six large language models on orthopedic radiographs for tumor detection, classification, anatomical localization, and X-ray view interpretation, and to assess the impact of demographic context and self-reported certainty. Methods: We retrospectively evaluated six VLMs on 3746 expert-annotated orthopedic radiographs from the Bone Tumor X-ray Radiograph dataset. Each image was analyzed by all models with and without patient age and sex using a standardized prompting scheme across four predefined tasks. Results: Over 48,000 predictions were analyzed. Tumor detection accuracy ranged from 59.9–73.5%, with the Gemini Ensemble achieving the highest F1 score (0.723) and recall (0.822). Benign/malignant classification reached up to 85.2% accuracy; tumor type identification 24.6–55.7%; body region identification 97.4%; and view classification 82.8%. Demographic data improved tumor detection accuracy (+1.8%, p < 0.001) but had no significant effect on other tasks. Certainty scores were weakly correlated with correctness, with Gemini Pro highest (r = 0.089). Conclusion: VLMs show strong potential for basic musculoskeletal radiograph interpretation without task-specific training but remain less accurate than specialized deep learning models for complex classification. Limited calibration, interpretability, and contextual reasoning must be addressed before clinical use. This is the first systematic assessment of image-based diagnosis and self-assessment in LLMs using a real-world radiology dataset.

Graphical Abstract

1. Introduction

The past few years have witnessed a significant surge in the application of large language models (LLMs), followed by vision-language models (VLMs), in the field of medical research [1,2,3,4]. VLMs, a category of artificial intelligence (AI) utilizing deep learning, are designed to interpret and reason about visual and textual data simultaneously. The most widely known model to date is ChatGPT, developed by OpenAI [5]. Since September 2023, ChatGPT has been augmented with vision capabilities, allowing it to analyze and interpret images alongside textual input [6]. The first studies show that there is room for improvement compared to the current text function [7,8].
The intended use and clinical role for an AI approach in this context is to act as an assistive tool for radiologists and orthopedic surgeons. It could help streamline radiology workflow by pre-screening radiographs, flagging potential abnormalities, and providing preliminary interpretations, especially in settings with limited access to subspecialty expertise. Approximately 3% of all pediatric and adolescent cancer cases are attributable to malignant bone tumors [9]. Among primary bone tumors, osteosarcoma, Ewing sarcoma, and chondrosarcoma are the most prevalent [10,11]. While malignant tumors receive considerable attention, benign bone tumors are also of clinical significance, with osteochondromas being the most common, followed by giant cell tumors, osteoblastomas, and osteoid osteomas [12,13]. The rarity of these lesions poses a diagnostic challenge, as many radiologists may lack the experience needed to detect and evaluate them accurately on conventional radiographs [14]. Deep learning (DL), a branch of artificial intelligence (AI) that recognizes complex patterns in large datasets, has shown great promise in detecting such lesions [15,16,17]. More recently, there has been a rapid expansion of vision–language and foundation models in medical imaging, which aim to jointly learn from images and text and generalize across multiple radiologic tasks [18,19,20]. In parallel, robustness-oriented methods have been proposed that focus on enhancing feature representations under adverse imaging conditions. For example, Li et al. introduced IHENet, an illumination-invariant hierarchical feature enhancement network for low-light object detection that improves performance in challenging scenes. Although designed for natural images, IHENet exemplifies hierarchical feature enhancement and illumination-invariant representation learning that could inspire more robust bone tumor analysis on X-ray imaging [21].
The primary aim of this retrospective study was to evaluate the diagnostic capabilities of VLMs in detecting and classifying bone tumors on conventional radiographs, using visual information alone. Our primary hypothesis was that VLMs could achieve clinically relevant accuracy in tumor detection and classification. Model performance was assessed across five clinically relevant tasks: tumor detection, classification, tumor type identification, anatomical region recognition, and X-ray view classification. In addition to evaluating diagnostic accuracy, the study investigated whether self-reported certainty scores could serve as a meaningful indicator of prediction correctness. Furthermore, we explored the effect of adding basic demographic information—specifically patient age and sex—on model performance. We hypothesized that the inclusion of demographic data would improve diagnostic accuracy, particularly for age-dependent tumor types. Moreover, all models were evaluated using a standardized zero-shot prompt with a temperature of 0 to ensure reproducibility. This approach may disadvantage models that perform better with Chain-of-Thought (CoT) prompting or iterative interaction. Also, the dataset reflects class imbalances inherent to the source collection, which may skew performance metrics; for instance, frequent tumor types like osteochondroma are overrepresented compared to rare lesions. Lastly, we tested an ensemble strategy to determine whether combining multiple models could enhance diagnostic reliability. This comprehensive evaluation provides insights into the current potential and limitations of VLMs in musculoskeletal radiologic diagnostics.

2. Materials and Methods

2.1. Study Design

This was a retrospective study designed to evaluate the performance of existing AI models on a publicly available dataset. The study goal was to perform a feasibility study and exploratory analysis of several commercially available VLMs for the task of bone tumor diagnosis on radiographs.

2.2. Data

The data source for this study was the Bone Tumor X-ray Radiograph (BTXRD) dataset, a publicly available collection of anonymized radiographs with expert annotations, accessible at https://www.nature.com/articles/s41597-024-04311-y (accessed on 27 May 2025). The images were collected as part of routine clinical care, though specific dates and locations of initial data collection are not detailed in the source dataset.

2.3. Eligibility Criteria

All images from the BTXRD dataset were initially considered. The eligibility criteria for inclusion in our final study cohort were the presence of a single, unambiguous ground-truth label for each of the following categories: X-ray view, anatomical body region, tumor presence, tumor classification (benign/malignant), and specific tumor type.

2.4. Data Pre-Processing and Selection

Prior to model evaluation, the dataset underwent a rigorous filtering process to ensure data quality and remove ambiguity. This process constituted the selection of a data subset. Only cases with a single, unambiguous ground-truth label for each of the following categories were retained for analysis:
  • X-ray view (e.g., frontal, lateral, or oblique)
  • Anatomical body region (Upper limb/lower limb/pelvis location)
  • Tumor presence, if yes:
    (a)
    Tumor classification (benign or malignant)
    (b)
    Tumor type
    (c)
    Tumor location
This filtering step resulted in a final cohort where each case had a single, definitive ground-truth label for all evaluated characteristics, ensuring a clear and objective basis for performance assessment. Data elements were defined based on the labels provided in the original BTXRD dataset. The dataset authors state that the images were de-identified prior to public release; no further de-identification methods were applied by us. No missing data were imputed; cases with missing or ambiguous labels were excluded during the filtering step.

2.5. Ground Truth

The ground truth reference standard was defined by the expert-annotated labels provided within the BTXRD dataset. The rationale for choosing this reference standard is that it represents a curated, publicly available benchmark that has undergone peer review. The original dataset paper specifies that annotations were performed by clinical experts. Details on the qualifications and specific preparation of the annotators or the annotation tools used are provided in the source publication for the BTXRD dataset. The original dataset creators did not provide measurements of inter- and intrarater variability or methods to resolve discrepancies, which is a limitation of the source data.

2.6. Data Partitions

This study did not involve training a new model; therefore, data was not partitioned into traditional training, validation, and testing sets. The entire filtered dataset of 3746 images was used as a single test set to evaluate the performance of the pre-trained, commercially available VLMs. The intended sample size was the maximum number of high-quality, unambiguously labeled images available from the source dataset after filtering.

2.7. Models

Six commercially available large language models (LLMs) were evaluated: Google’s Gemini 2.5 Pro Preview (5 June 2025) and Gemini 2.5 Flash Preview (20 May 2025), Anthropic’s Claude Sonnet 4 (14 May 2025), and OpenAI’s GPT-4.1 (14 April 2025), GPT-4o Mini, and GPT-o3 (16 April 2025). These models are vision-language models that take an image and a text prompt as input and generate a structured JSON text as output. The internal architectures (intermediate layers and connections) of these commercial models are proprietary and not publicly detailed. All models were accessed via the HyprLab API gateway (https://hyprlab.io/ (accessed on 27 May 2025)).

2.8. Evaluation

To maximize determinism and reproducible outputs, the temperature for all model inferences was set to 0.0, with a maximum token output of 1024. For each image, the models were provided with a consistent system prompt defining their role as an expert radiologist. The user prompt instructed the models to perform five clinical tasks. Outputs were returned in a structured JSON (JavaScript Object Notation) format, which included all available class labels per category to enable automated analysis. The prompt was as follows:
You are an expert radiologist specializing in X-ray interpretation. Carefully examine the provided X-ray image and extract the following information in a structured JSON format:
  • Identify the anatomical view or projection (e.g., frontal, lateral, oblique).
  • Specify the body region depicted in the image.
  • Determine whether a tumor or abnormal growth is present.
    If present, classify the tumor (benign or malignant), specify the tumor type, and indicate its anatomical location.
  • Assign a certainty score to your analysis (from 0 [completely uncertain] to 1 [completely certain]).
Base your assessment strictly on the visual evidence in the image. Be concise, accurate, and avoid speculation. Return only the structured JSON output as specified by the schema.
To assess the impact of patient context, two experimental conditions were tested for each model on each image: (1) analysis of the X-ray image alone, and (2) analysis of the X-ray image supplemented with patient demographic information (age and sex). Whenever a model was overly specific, the specific term was mapped to the general category (e.g., “foot” was mapped to “lower limb”). The mapping rules are documented in the Appendix A.

2.9. Creation of the Gemini Ensemble Model

To explore the potential of combining model strengths, a “Gemini Ensemble” model was created by integrating the predictions of Gemini 2.5 Pro and Gemini 2.5 Flash. The ensembling strategy was determined empirically by comparing multiple fusion approaches. For the primary task of tumor detection, three strategies were evaluated:
  • High-Recall (OR logic): The ensemble predicts a tumor if either model detects one.
  • High-Precision (AND logic): The ensemble predicts a tumor only if both models detect one.
  • Certainty-Weighted: The ensemble adopts the prediction of the model with the higher self-reported certainty score.
The strategy yielding the highest F1 score for tumor detection was selected for the final ensemble model. For classification tasks (tumor classification, tumor type, body region, and X-ray view), we compared four fusion strategies: (1) certainty-weighted voting, (2) majority voting with certainty-based tie-breaking, (3) Gemini Pro predictions only, and (4) Gemini Flash predictions only. For a two-model ensemble, majority voting reduces to agreement-based selection with a tie-breaker when models disagree; we used certainty scores as the tie-breaker to maintain consistency. The comparison revealed that certainty-weighted and majority voting yielded identical results, while single-model performance varied by task (Table A2).

2.10. Statistical Analysis and Performance Metrics

Model performance was evaluated across the five clinical tasks using metrics of accuracy, precision, recall, specificity, and F1 score. The F1 score was considered the key metric for the tumor detection task due to its balanced assessment of precision and recall. To evaluate the impact of including demographic data, paired t-tests were performed to compare model accuracy on each task under the “with demographics” (providing age and sex of the patient) versus “without demographics” conditions. Statistical measures of uncertainty were calculated as 95% confidence intervals (CI) for the mean differences. A p-value of <0.05 was considered statistically significant. The correlation between model-reported certainty scores and actual accuracy was assessed using the Pearson correlation coefficient. A robustness analysis was performed by evaluating performance with and without demographic data. The unit of analysis was the individual radiographic image. Due to the anonymized nature of the dataset, clustering by patient identity was not possible. No specific methods for explainability or interpretability (e.g., saliency maps) were evaluated in this study, as the focus was on diagnostic output accuracy. All statistical analyses and data visualizations were conducted using R (version 4.3) with the tidyverse, ggplot2, and gt packages.

3. Results

3.1. Data Flow

A total of 3746 images from the BTXRD dataset met the inclusion criteria after filtering for unambiguous labels. All 3746 images were analyzed by each of the seven models (six base VLMs and one Gemini Ensemble) under two conditions (with and without demographics), resulting in 52,444 individual model analyses. A diagram detailing the inclusion and exclusion of cases is not provided as the entire filtered dataset was used. The demographic and clinical characteristics of the cases are as described in the original BTXRD dataset publication. A total of seven models (six base LLMs and one Gemini Ensemble) were evaluated on the filtered dataset. Each model processed each image under two conditions (with and without demographics), resulting in over 48,000 individual model-based analyses.

3.2. Overall Model Performance on Clinical Tasks

Model performance varied significantly across the five evaluated clinical tasks (Figure 1). Most models demonstrated high accuracy in identifying the Body Region, with Gemini 2.5 Pro and the Gemini Ensemble achieving the highest accuracy at 97.4%. Performance on X-ray View identification was also strong, led by GPT-o3 (16 April) at 82.8%. However, accuracy was substantially lower for more complex interpretive tasks. Identifying the correct Tumor Type proved to be the most challenging task for all models, with the highest accuracy achieved by Gemini 2.5 Flash at 55.7%. Based on a composite of all tasks, Gemini 2.5 Flash achieved the highest overall accuracy (Table 1).

3.3. Tumor Detection Performance: F1 Score, Precision, and Recall

Given the clinical importance of tumor detection, performance on this task was evaluated using precision, recall, and the F1 score. The Gemini Ensemble model achieved the highest F1 score (0.723), indicating the most effective balance between precision and recall (Table 1). This was followed closely by Gemini 2.5 Pro (F1 score = 0.719). An analysis of the components of the F1 score revealed distinct model behaviors (Figure 2, Table 2). Models like GPT-o3 (Apr-16) and Gemini 2.5 Flash operated as high-precision, low-recall systems, correctly identifying tumors when they made a positive prediction (precision of 0.930 and 0.892, respectively) but missing a large number of actual tumors (recall of 0.284 and 0.503, respectively). In contrast, the Gemini Ensemble leveraged the high recall of Gemini Pro (0.816) to create a more balanced and clinically useful detector.

3.4. Impact of Demographic Information on Performance

The inclusion of patient demographic information (age and sex) had a statistically significant, albeit modest, positive impact on model performance for specific tasks (Figure 3). Overall accuracy across all tasks increased by 1.0 percentage point when demographics were provided (40.8% vs. 41.8%, p = 0.015). The most notable improvement was in Tumor Detection, where accuracy increased by 1.8 percentage points (65.8% vs. 67.6%, p < 0.001). For other tasks, including Classification, Tumor Type, Body Region, and X-ray View, the addition of demographic data did not result in a statistically significant change in performance (Table 3).

3.5. Correlation Between Certainty and Accuracy

The relationship between the models’ self-reported certainty scores and their actual accuracy was analyzed to assess model calibration (Figure 4). Gemini Pro was the only model to show a weak but positive correlation (r = 0.089 for both raw and normalized certainty), suggesting that its higher confidence was slightly indicative of a more accurate analysis. In contrast, most other models, including GPT-4.1, GPT-o3, and Claude Sonnet 4, exhibited a weak negative correlation. This indicates that for these models, a higher certainty score was not a reliable indicator of correctness and was slightly associated with a higher likelihood of error.

4. Discussion

This study represents one of the first systematic evaluations of VLMs in the context of musculoskeletal tumor diagnostics using radiographs. By analyzing multiple commercially available models across a range of clinically relevant tasks, we provide novel insights into the capabilities, limitations, and calibration of current VLMs in a real-world medical imaging scenario. Our findings highlight both promising use cases and critical challenges that must be addressed before broader clinical adoption is feasible.
All models performed well in basic image classification tasks such as anatomical region (up to 97.4%) and X-ray view identification (up to 82.8%). At their best, the VLMs achieved performance levels that were comparable to, although in many cases lower, those of deep learning models specifically trained for this task [22,23]. For more complex tasks such as tumor detection, the models demonstrated overall poor performance, achieving accuracies of up to only 73.5%, which is notably lower than the >90% accuracy reported in previous studies using dedicated deep learning approaches [24,25]. It is noteworthy that prior studies were limited to a single anatomical region, while VLMs, although less accurate, offer greater flexibility for use across a wider range of anatomical sites [25,26]. Compared to a previously conducted study on the classification of different primary bone tumors from X-ray images at different anatomical locations, which yielded an accuracy of 69.7% using a multimodal deep learning model, the VLMs in our study—like Gemini Flash 2.5—achieved a slightly lower performance (55.7% for tumor type classification) [27]. When considering only the distinction between benign and malignant tumors, the VLMs demonstrated performance comparable to that of musculoskeletal fellowship-trained radiologists and exceeded the performance of radiology residents [28]. Although Gemini 2.5 Flash is less powerful and less expensive than the Pro version, it outperformed the Pro version in this study [29,30]. The reasons for the different performance of the models should be investigated further in future studies.
The overall results indicate that VLMs are capable of delivering solid diagnostic performance, even without task-specific training. This suggests considerable potential for adaptation and improvement through fine-tuning and future advances in model architecture and multimodal integration, which could further enhance their clinical utility. A recent study by Ren et al. assessed the diagnostic performance of GPT-4 in classifying osteosarcoma on radiographs. Using a binary classification setup with 80 anonymized X-ray images, the model showed limited sensitivity for malignant lesions. Osteosarcoma was correctly identified as the top diagnosis in only 20% of cases and was at least mentioned among the possible diagnoses in 35% [31]. These results highlight the challenges of malignancy recognition for current VLMs in clinical imaging. In contrast, our study evaluated a broader range of musculoskeletal tumors across multiple anatomical regions and found higher diagnostic accuracy, particularly in tumor detection. The best-performing ensemble model achieved a recall of 0.822 and an F1 score of 0.723, indicating a more balanced and clinically useful performance. However, both studies confirm that tumor type classification remains a major limitation, and that VLMs, while improving, should not yet replace expert radiologists in high-stakes diagnostic tasks. This is further underscored by a recent reader study in which two radiologists specialized in musculoskeletal tumor imaging achieved accuracies of 84% and 83% at sensitivities of 90% and 81%, respectively, clearly outperforming both an artificial neural network and non-subspecialist readers on the same benign–malignant classification task. To place these findings in a clinical context, radiologists interpreting bone tumors on radiographs have been reported to achieve substantially higher sensitivity for distinguishing malignant from benign or infectious lesions (e.g., sensitivities around 93–98%) (PMID: 31850149), and recent task-specific deep learning models for bone tumor detection or classification can approach or even match such expert performance on selected tasks [32]. In contrast, the sensitivity and F1 scores observed for the general-purpose VLMs in our study fall clearly below such thresholds and would be insufficient for any system that could miss malignant lesions. Taken together, our findings demonstrate that, in their current form, general-purpose vision–language models are not suitable for autonomous diagnostic interpretation in clinical practice and should be regarded as exploratory research tools only.
In addition to evaluating model accuracy, our study also assessed the calibration of VLMs by analyzing the correlation between self-reported certainty and diagnostic correctness. While Gemini Pro demonstrated a weak but positive correlation (r = 0.089), indicating that higher certainty was slightly associated with better performance, all other models showed a weak negative correlation. This suggests that many VLMs are overconfident, even when incorrect, which poses a critical concern in clinical contexts where reliability and transparency are essential. Beyond image interpretation, current research also explores whether LLMs can assess their own confidence when generating responses. This self-evaluation capability is crucial in medical applications, where overconfidence can lead to misinformed decisions. A recent study demonstrated that even state-of-the-art models such as GPT-4o show limited differentiation in confidence levels between correct and incorrect answers [33]. Self-evaluation may even help improve the accuracy of the responses [34]. These findings highlight a critical limitation of current LLMs in safety-sensitive domains like healthcare and emphasize the need for improved methods of uncertainty estimation before broader clinical integration. To our knowledge, this is the first study to evaluate image-based recognition, diagnostic accuracy and self-assessment of large-scale language models in a real-world clinical setting.
We also investigated whether providing demographic context (age and sex) improved diagnostic performance. While the inclusion of this information led to a statistically significant increase in tumor detection accuracy (+1.8%, p < 0.001), it had no meaningful effect on more complex classification tasks such as tumor type or malignancy status. To date, no published studies have systematically examined the influence of demographic information on the diagnostic output of VLMs. However, given that demographic factors such as age and sex can substantially affect the likelihood and presentation of specific tumors and diseases, their consideration represents an important aspect of real-world clinical decision-making.
Nevertheless, VLMs can have practical value as an aid, particularly for triage or second opinions in resource-constrained settings. For low-risk applications, such as providing differential diagnoses, even moderate accuracy and good calibration can offer meaningful benefits. In contrast, high-risk applications—such as autonomous primary diagnoses in musculoskeletal tumor imaging—require near-expert performance and robust reliability. Our results suggest that current VLMs do not yet meet these higher requirements and should therefore remain limited to complementary tasks within carefully controlled clinical workflows. Future work should focus on improving model calibration, incorporating multimodal data sources (e.g., clinical history, lab results), and validating performance in real-world, heterogeneous clinical environments.

Limitations

This study has several limitations. First, potential bias exists as the evaluation was performed on a single, curated public dataset, which may not fully represent the diversity of clinical presentations and image quality seen in practice. The generalizability of our findings is therefore not guaranteed. Second, all models were evaluated using a standardized prompt format, which limits the assessment of model flexibility and adaptability to different instruction styles. Third, we deliberately restricted clinical context to the minimal demographic variables available in the BTXRD dataset, namely age and sex. This design choice allowed us to isolate how limited demographic information modulates otherwise purely image-based VLM performance. A broader multimodal evaluation that integrates richer clinical metadata (e.g., medical history, laboratory values, follow-up imaging) lies beyond the scope of the present work and should be addressed in future studies. Fourth, the dataset consisted of clearly labelled cases with unambiguous ground-truth annotations, which may not reflect the diagnostic ambiguity frequently encountered in clinical practice. As such, the generalizability of these findings to more complex or borderline cases remains uncertain. Finally, no comparison was made to human radiologists, so it remains unclear how current VLM performance benchmarks against expert-level clinical interpretation.

5. Conclusions

This study demonstrates that vision-language models can interpret bone tumor radiographs with promising accuracy for basic tasks but currently lack the reliability of specialized AI models for complex diagnostics. VLMs show strong potential for basic musculoskeletal radiograph interpretation without task-specific training but remain less accurate than specialized deep learning models for complex classification. Consequently, general-purpose VLMs should at best be used as strictly assistive tools under the close supervision of expert radiologists, and only after rigorous task-specific adaptation and prospective validation; any deployment without robust safety safeguards would be inappropriate at this stage.

Author Contributions

Conceptualization, R.K. and J.R.; methodology, R.K.; software, R.K.; validation, R.K., R.M. and F.S.F.; formal analysis, R.K.; investigation, J.R. and R.K.; resources, K.W. and J.R.; data curation, R.K.; writing—original draft preparation, J.R., R.K. and R.M.; writing—review and editing, R.K., J.R., R.M., K.W., F.S.F., P.P., S.K. and S.S.; visualization, R.K.; supervision, R.K. and J.R.; project administration, R.K.; funding acquisition, J.R. All authors have read and agreed to the published version of the manuscript.

Funding

This publication was supported by the Open Access Publication Fund of the University of Bonn.

Institutional Review Board Statement

The study was approved by the local institutional review board (reference number: 2025-168-BO).

Informed Consent Statement

Not applicable as this study was a retrospective analysis of a publicly available, de-identified dataset.

Data Availability Statement

The data used in this study is publicly available in the Bone Tumor X-ray Radiograph (BTXRD) dataset, which can be accessed at https://www.nature.com/articles/s41597-024-04311-y (accessed on 27 May 2025).

Acknowledgments

During the preparation of this manuscript, the author(s) used GPT-4o (OpenAI) to enhance linguistic clarity and rectify grammatical discrepancies. The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
AIArtificial Intelligence
BTXRDBone Tumor X-ray Radiograph
CIConfidence Interval
DLDeep Learning
JSONJavaScript Object Notation
LLMLarge Language Model
VLMVision-Language Model

Appendix A. Mapping Dictionaries

The following dictionaries were used to map specific model outputs to standardized categories for analysis.
map_body_region = {
    ’foot’: ’lower limb’,
    ’hip bone’: ’pelvis’,
    ’knee-joint’: ’lower limb’,
    ’elbow-joint’: ’upper limb’,
    ’ankle-joint’: ’lower limb’,
    ’hip’: ’pelvis’
}
map_tumor_location = {
    "hip bone": "hip bone",
    "hip joint": "hip-joint",
    "proximal femur": "femur",
    "distal femur": "femur",
    "femur": "femur",
    "tibia": "tibia",
    "distal tibia": "tibia",
    "fibula": "fibula",
    "foot": "foot",
    "heel": "foot",
    "calcaneus": "foot",
    "humerus": "humerus",
    "ulna": "ulna",
    "radius": "radius",
    "shoulder-joint": "shoulder-joint",
    "hand": "hand",
    "finger": "hand",
    "phalanx": "hand",
    "wrist-joint": "hand",
    "elbow-joint": "elbow-joint",
    "knee-joint": "knee-joint",
    "ankle-joint": "ankle-joint"
}
map_tumor_type = {
    "osteosarcoma": "osteosarcoma",
    "giant cell tumor": "giant cell tumor",
    "osteochondroma": "osteochondroma",
    "multiple osteochondromas": "multiple osteochondromas",
    "osteofibroma": "osteofibroma",
    "synovial osteochondroma": "synovial osteochondroma",
    "simple bone cyst": "simple bone cyst",
    "bone cyst": "simple bone cyst",
    "enchondroma": "other benign tumor",
    "glenoid labrum cyst": "glenoid labrum cyst",
    "other malignant tumor": "other malignant tumor",
    "other benign tumor": "other benign tumor"
}
Table A1. Distribution of tumor types in the study cohort.
Table A1. Distribution of tumor types in the study cohort.
Tumor TypeClassificationCount (N)
osteochondromabenign745
osteosarcomamalignant294
simple bone cystbenign206
other benign tumorbenign114
giant cell tumorbenign93
multiple osteochondromasbenign91
synovial osteochondromabenign48
osteofibromabenign44
other malignant tumormalignant43
Table A2. Comparison of ensemble fusion strategies for classification tasks. Macro F1 scores are shown for four different fusion approaches: certainty-weighted voting, majority voting with certainty tie-breaker, and single-model baselines (Gemini Pro only, Gemini Flash only). Bold values indicate the best performance per task.
Table A2. Comparison of ensemble fusion strategies for classification tasks. Macro F1 scores are shown for four different fusion approaches: certainty-weighted voting, majority voting with certainty tie-breaker, and single-model baselines (Gemini Pro only, Gemini Flash only). Bold values indicate the best performance per task.
TaskCertainty-WeightedMajority VotingGemini Pro OnlyGemini Flash Only
Classification0.7620.7620.7600.767
Tumor Type0.3310.3310.3300.311
Body Region0.9410.9410.9410.944
X-ray View0.6970.6970.6980.640

References

  1. Zhou, H.; Liu, F.; Gu, B.; Zou, X.; Huang, J.; Wu, J.; Li, Y.; Chen, S.S.; Zhou, P.; Liu, J.; et al. A Survey of Large Language Models in Medicine: Progress, Application, and Challenge. arXiv 2023, arXiv:2311.05112. [Google Scholar]
  2. Wang, D.; Zhang, S. Large language models in medical and healthcare fields: Applications, advances, and challenges. Artif. Intell. Rev. 2024, 57, 299. [Google Scholar] [CrossRef]
  3. Jeong, D.P.; Garg, S.; Lipton, Z.C.; Oberst, M. Medical Adaptation of Large Language and Vision-Language Models: Are We Making Progress? arXiv 2024, arXiv:2411.04118. [Google Scholar] [CrossRef]
  4. Hartsock, I.; Rasool, G. Vision-language models for medical report generation and visual question answering: A review. Front. Artif. Intell. 2024, 7, 1430984. [Google Scholar] [CrossRef]
  5. OpenAI. Available online: https://openai.com/ (accessed on 1 December 2024).
  6. ChatGPT Can Now See, Hear, and Speak. Available online: https://openai.com/index/chatgpt-can-now-see-hear-and-speak/ (accessed on 1 December 2024).
  7. Huppertz, M.S.; Siepmann, R.; Topp, D.; Nikoubashman, O.; Yüksel, C.; Kuhl, C.K.; Truhn, D.; Nebelung, S. Revolution or risk?—Assessing the potential and challenges of GPT-4V in radiologic image interpretation. Eur. Radiol. 2024, 35, 1111–1121. [Google Scholar] [CrossRef] [PubMed]
  8. Wilhelm, T.I.; Roos, J.; Kaczmarczyk, R. Large Language Models for Therapy Recommendations Across 3 Clinical Specialties: Comparative Study. J. Med. Internet Res. 2023, 25, e49324. [Google Scholar] [CrossRef]
  9. Jackson, T.M.; Bittman, M.; Granowetter, L. Pediatric Malignant Bone Tumors: A Review and Update on Current Challenges, and Emerging Drug Targets. Curr. Probl. Pediatr. Adolesc. Health Care 2016, 46, 213–228. [Google Scholar] [CrossRef] [PubMed]
  10. Keil, L. Bone Tumors: Primary Bone Cancers. FP Essent. 2020, 493, 22–26. [Google Scholar] [PubMed]
  11. Ritter, J.; Bielack, S.S. Osteosarcoma. Ann. Oncol. 2010, 21, vii320–vii325. [Google Scholar] [CrossRef]
  12. Lam, Y. Bone Tumors: Benign Bone Tumors. FP Essent. 2020, 493, 11–21. [Google Scholar]
  13. Tepelenis, K.; Papathanakos, G.; Kitsouli, A.; Troupis, T.; Barbouti, A.; Vlachos, K.; Kanavaros, P.; Kitsoulis, P. Osteochondromas: An Updated Review of Epidemiology, Pathogenesis, Clinical Presentation, Radiological Features and Treatment Options. In Vivo 2021, 35, 681–691. [Google Scholar] [CrossRef]
  14. Zimbalist, T.; Rosen, R.; Peri-Hanania, K.; Caspi, Y.; Rinott, B.; Zeltser-Dekel, C.; Bercovich, E.; Eldar, Y.C.; Bagon, S. Detecting bone lesions in X-ray under diverse acquisition conditions. J. Med. Imaging 2024, 11, 024502. [Google Scholar] [CrossRef] [PubMed]
  15. Bi, W.L.; lHosny, A.; Schabath, M.B.; Giger, M.L.; Birkbak, N.J.; Mehrtash, A.; Allison, T.; Arnaout, O.; Abbosh, C.; Dunn, I.F.; et al. Artificial intelligence in cancer imaging: Clinical challenges and applications. CA Cancer J. Clin. 2019, 69, 127–157. [Google Scholar] [CrossRef]
  16. Hosny, A.; Parmar, C.; Quackenbush, J.; Schwartz, L.H.; Aerts, H.J.W.L. Artificial intelligence in radiology. Nat. Rev. Cancer 2018, 18, 500–510. [Google Scholar] [CrossRef] [PubMed]
  17. LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef] [PubMed]
  18. Bluethgen, C.; Chambon, P.; Delbrouck, J.B.; van der Sluijs, R.; Połacin, M.; Zambrano Chaves, J.M.; Abraham, T.M.; Purohit, S.; Langlotz, C.P.; Chaudhari, A.S. A vision-language foundation model for the generation of realistic chest X-ray images. Nat. Biomed. Eng. 2025, 9, 494–506. [Google Scholar] [CrossRef]
  19. Ryu, J.S.; Kang, H.; Chu, Y.; Yang, S. Vision-language foundation models for medical imaging: A review of current practices and innovations. Biomed. Eng. Lett. 2025, 15, 809–830. [Google Scholar] [CrossRef]
  20. Wu, J.; Wang, Y.; Zhong, Z.; Liao, W.; Trayanova, N.; Jiao, Z.; Bai, H.X. Vision-language foundation model for 3D medical imaging. npj Artif. Intell. 2025, 1, 17. [Google Scholar] [CrossRef]
  21. Li, N.; Pan, W.; Xu, B.; Liu, H.; Dai, S.; Xu, C. Ihenet: An illumination invariant hierarchical feature enhancement network for low-light object detection. Multimed. Syst. 2025, 31, 407. [Google Scholar] [CrossRef]
  22. Del Cerro, C.F.; Giménez, R.C.; Garcia-Blas, J.; Sosenko, K.; Ortega, J.M.; Desco, M.; Abella, M. Deep Learning-Based Estimation of Radiographic Position to Automatically Set Up the X-Ray Prime Factors. J. Digit. Imaging 2024, 38, 1661–1668. [Google Scholar] [CrossRef]
  23. Fang, X.; Harris, L.; Zhou, W.; Huo, D. Generalized Radiographic View Identification with Deep Learning. J. Digit. Imaging 2021, 34, 66–74. [Google Scholar] [CrossRef]
  24. Shao, J.; Lin, H.; Ding, L.; Li, B.; Xu, D.; Sun, Y.; Guan, T.; Dai, H.; Liu, R.; Deng, D.; et al. Deep learning for differentiation of osteolytic osteosarcoma and giant cell tumor around the knee joint on radiographs: A multicenter study. Insights Imaging 2024, 15, 35. [Google Scholar] [CrossRef] [PubMed]
  25. Xu, D.; Li, B.; Liu, W.; Wei, D.; Long, X.; Huang, T.; Lin, H.; Cao, K.; Zhong, S.; Shao, J.; et al. Deep learning-based detection of primary bone tumors around the knee joint on radiographs: A multicenter study. Quant. Imaging Med. Surg. 2024, 14, 5420–5433. [Google Scholar] [CrossRef] [PubMed]
  26. Breden, S.; Hinterwimmer, F.; Consalvo, S.; Neumann, J.; Knebel, C.; von Eisenhart-Rothe, R.; Burgkart, R.H.; Lenze, U. Deep Learning-Based Detection of Bone Tumors around the Knee in X-rays of Children. J. Clin. Med. 2023, 12, 5960. [Google Scholar] [CrossRef]
  27. Hinterwimmer, F.; Guenther, M.; Consalvo, S.; Neumann, J.; Gersing, A.; Woertler, K.; von Eisenhart-Rothe, R.; Burgkart, R.; Rueckert, D. Impact of metadata in multimodal classification of bone tumours. BMC Musculoskelet. Disord. 2024, 25, 822. [Google Scholar] [CrossRef] [PubMed]
  28. Von Schacky, C.E.; Wilhelm, N.J.; Schäfer, V.S.; Leonhardt, Y.; Gassert, F.G.; Foreman, S.C.; Gassert, F.T.; Jung, M.; Jungmann, P.M.; Russe, M.F.; et al. Multitask Deep Learning for Segmentation and Classification of Primary Bone Tumors on Radiographs. Radiology 2021, 301, 398–406. [Google Scholar] [CrossRef]
  29. Gemini Models|Gemini API. Google AI for Developers. Available online: https://ai.google.dev/gemini-api/docs/models (accessed on 1 December 2024).
  30. We’re Expanding Our Gemini 2.5 Family of Models. Google. Available online: https://blog.google/products/gemini/gemini-2-5-model-family-expands/ (accessed on 1 December 2024).
  31. Ren, Y.; Guo, Y.; He, Q.; Cheng, Z.; Huang, Q.; Yang, L. Exploring whether ChatGPT-4 with image analysis capabilities can diagnose osteosarcoma from X-ray images. Exp. Hematol. Oncol. 2024, 13, 71. [Google Scholar] [CrossRef]
  32. Papageorgiou, P.S.; Christodoulou, R.; Korfiatis, P.; Papagelopoulos, D.P.; Papakonstantinou, O.; Pham, N.; Woodward, A.; Papagelopoulos, P.J. Artificial Intelligence in Primary Malignant Bone Tumor Imaging: A Narrative Review. Diagnostics 2025, 15, 1714. [Google Scholar] [CrossRef]
  33. Omar, M.; Agbareia, R.; Glicksberg, B.S.; Nadkarni, G.N.; Klang, E. Benchmarking the Confidence of Large Language Models in Clinical Questions. JMIR Med. Inform. 2024, 13, e66917. [Google Scholar] [CrossRef]
  34. Ren, J.; Zhao, Y.; Vu, T.; Liu, P.J.; Lakshminarayanan, B. Self-Evaluation Improves Selective Generation in Large Language Models. arXiv 2023, arXiv:2312.09300. [Google Scholar] [CrossRef]
Figure 1. LLM Performance Comparison on the Bone Tumor X-ray Radiograph dataset (BTXRD). Accuracy of each model across five distinct clinical tasks: Tumor Detection, Classification, Tumor Type identification, Body Region identification, and X-ray View identification. The chart illustrates the significant variation in model performance, with higher accuracy on simpler recognition tasks (Body Region, X-ray View) and lower accuracy on more complex interpretive tasks (Tumor Type).
Figure 1. LLM Performance Comparison on the Bone Tumor X-ray Radiograph dataset (BTXRD). Accuracy of each model across five distinct clinical tasks: Tumor Detection, Classification, Tumor Type identification, Body Region identification, and X-ray View identification. The chart illustrates the significant variation in model performance, with higher accuracy on simpler recognition tasks (Body Region, X-ray View) and lower accuracy on more complex interpretive tasks (Tumor Type).
Ai 06 00327 g001
Figure 2. Comprehensive Tumor Detection Performance Metrics. A detailed breakdown of accuracy, precision, recall, and F1 score for the tumor detection task for each model. This visualization highlights the different operational characteristics of the models, such as the high-precision profile of GPT-o3 and the high-recall profile of the Gemini models.
Figure 2. Comprehensive Tumor Detection Performance Metrics. A detailed breakdown of accuracy, precision, recall, and F1 score for the tumor detection task for each model. This visualization highlights the different operational characteristics of the models, such as the high-precision profile of GPT-o3 and the high-recall profile of the Gemini models.
Ai 06 00327 g002
Figure 3. Impact of Demographics on Model Performance. Boxplots showing the distribution of model-level accuracy for six performance areas, comparing the condition with demographic information provided against the condition without. Each facet includes statistical annotations derived from individual-level paired t-tests, displaying the mean difference (Δ), the 95% confidence interval (CI), and the p-value for the comparison.
Figure 3. Impact of Demographics on Model Performance. Boxplots showing the distribution of model-level accuracy for six performance areas, comparing the condition with demographic information provided against the condition without. Each facet includes statistical annotations derived from individual-level paired t-tests, displaying the mean difference (Δ), the 95% confidence interval (CI), and the p-value for the comparison.
Ai 06 00327 g003
Figure 4. Correlation Between Certainty Scores and Accuracy. Pearson correlation coefficient between each model’s self-reported certainty score and its actual predictive accuracy. A positive value indicates that higher certainty is associated with higher accuracy (good calibration), while a negative value indicates the opposite. The chart displays correlations for both the raw certainty scores provided by the models and a normalized version.
Figure 4. Correlation Between Certainty Scores and Accuracy. Pearson correlation coefficient between each model’s self-reported certainty score and its actual predictive accuracy. A positive value indicates that higher certainty is associated with higher accuracy (good calibration), while a negative value indicates the opposite. The chart displays correlations for both the raw certainty scores provided by the models and a normalized version.
Ai 06 00327 g004
Table 1. Model performance summary. A comprehensive summary of model accuracy across five distinct clinical tasks. Values represent the percentage of correct predictions for each task. “Overall” accuracy is calculated as the percentage of cases where the model was correct on all five tasks simultaneously. “Avg Certainty” is the mean of the model’s self-reported confidence scores (0–1 scale).
Table 1. Model performance summary. A comprehensive summary of model accuracy across five distinct clinical tasks. Values represent the percentage of correct predictions for each task. “Overall” accuracy is calculated as the percentage of cases where the model was correct on all five tasks simultaneously. “Avg Certainty” is the mean of the model’s self-reported confidence scores (0–1 scale).
ModelN CasesTumor Det.ClassificationTumor TypeBody RegionX-Ray ViewOverallAvg Certainty
Gemini 2.5 Flash701373.5%82.8%55.7%97.1%79.3%48.1%0.90
GPT-o3 (Apr-16)692565.4%82.2%44.8%95.7%82.8%47.3%0.78
GPT-4.1711167.0%85.2%26.5%92.9%81.7%44.9%0.98
Gemini Ensemble663070.7%81.5%49.5%97.4%80.8%39.8%0.94
Gemini 2.5 Pro672670.5%81.5%49.6%97.4%80.9%39.8%0.94
Claude Sonnet 4711260.5%65.6%31.6%89.8%75.8%36.2%0.88
GPT-4o Mini711259.9%71.7%24.6%91.4%69.8%33.0%0.90
Table 2. Tumor detection performance metrics. Detailed performance metrics for the binary task of tumor detection. The table includes counts for true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN), along with calculated precision, recall, F1 score, accuracy, and specificity for each model. Models are ranked by F1 score.
Table 2. Tumor detection performance metrics. Detailed performance metrics for the binary task of tumor detection. The table includes counts for true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN), along with calculated precision, recall, F1 score, accuracy, and specificity for each model. Models are ranked by F1 score.
ModelTrue PosFalse PosTrue NegFalse NegPrecisionRecallF1 ScoreAccuracySpecificity
Gemini Ensemble2536139121525510.650.820.720.710.61
Gemini 2.5 Pro2542141122015720.640.820.720.710.61
Gemini 2.5 Flash1673202348416540.890.500.640.740.95
GPT-4.11198188356921560.860.360.510.670.95
GPT-o3 (Apr-16)92269361023240.930.280.440.650.98
Claude Sonnet 4868321343724860.730.260.380.610.91
GPT-4o Mini727225353326270.760.220.340.600.94
Table 3. Statistical comparison of performance with and without demographic information. Results of paired t-tests comparing model accuracy when provided with patient demographic information (age and sex) versus without. The table shows the mean accuracy for each condition, the mean difference, the 95% confidence interval (CI) of the difference, and the corresponding p-value. A p-value < 0.05 indicates a statistically significant difference. Significant differences in bold.
Table 3. Statistical comparison of performance with and without demographic information. Results of paired t-tests comparing model accuracy when provided with patient demographic information (age and sex) versus without. The table shows the mean accuracy for each condition, the mean difference, the 95% confidence interval (CI) of the difference, and the corresponding p-value. A p-value < 0.05 indicates a statistically significant difference. Significant differences in bold.
Clinical TaskWithout DemographicsWith DemographicsMean DifferenceCI LowerCI Upperp-Value
Tumor Detection65.8%67.6%1.8%1.0%2.7%<0.001
Overall40.7%41.8%1.1%0.2%2.0%0.02
Classification80.9%80.9%0.0%−1.4%1.4%0.99
Tumor Type41.2%42.6%1.4%−0.4%3.1%0.12
Body Region94.5%94.4%−0.1%−0.5%0.3%0.60
X-ray View78.6%78.8%0.2%−0.5%0.9%0.59
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Kaczmarczyk, R.; Pieroh, P.; Koob, S.; Fröschen, F.S.; Scheidt, S.; Welle, K.; Martin, R.; Roos, J. Application of Vision-Language Models in the Automatic Recognition of Bone Tumors on Radiographs: A Retrospective Study. AI 2025, 6, 327. https://doi.org/10.3390/ai6120327

AMA Style

Kaczmarczyk R, Pieroh P, Koob S, Fröschen FS, Scheidt S, Welle K, Martin R, Roos J. Application of Vision-Language Models in the Automatic Recognition of Bone Tumors on Radiographs: A Retrospective Study. AI. 2025; 6(12):327. https://doi.org/10.3390/ai6120327

Chicago/Turabian Style

Kaczmarczyk, Robert, Philipp Pieroh, Sebastian Koob, Frank Sebastian Fröschen, Sebastian Scheidt, Kristian Welle, Ron Martin, and Jonas Roos. 2025. "Application of Vision-Language Models in the Automatic Recognition of Bone Tumors on Radiographs: A Retrospective Study" AI 6, no. 12: 327. https://doi.org/10.3390/ai6120327

APA Style

Kaczmarczyk, R., Pieroh, P., Koob, S., Fröschen, F. S., Scheidt, S., Welle, K., Martin, R., & Roos, J. (2025). Application of Vision-Language Models in the Automatic Recognition of Bone Tumors on Radiographs: A Retrospective Study. AI, 6(12), 327. https://doi.org/10.3390/ai6120327

Article Metrics

Back to TopTop