An Open-Source Clinical Case Dataset for Medical Image Classification and Multimodal AI Applications
Abstract
1. Summary
2. Data Description
- data_dictionary.csv: This file describes the dataset’s contents, providing definitions for all data fields in each file.
- metadata.parquet: This file contains metadata for case report articles, including authors, year, journal, keywords (extracted from the article’s content), MeSH terms, DOI, PMCID, license, and link.
- abstracts.parquet: This file stores the abstracts of the articles.
- cases.parquet: This file contains the text of the clinical cases, along with the age of the patient (in years), and the gender (Female, Male, Transgender or Unknown).
- case_images.parquet: This file contains metadata from the raw images, such as the file name and the tag present in the EuropePMC API, captions, and text references (substrings from the clinical case text that refer to the images).
- captions_and_labels.csv: This file contains the metadata of all the images of the dataset. It includes two types of columns:
- –
- file information columns: they include details such as the image name, the source image (original name in EuropePMC’s API), license, size, image caption, and substrings from the case text referencing the image.
- –
- label columns:
- ∗
- image_type: multiclass classification column with seven possible classes (‘radiology’, ‘pathology’, ‘chart’, ‘endoscopy’, ‘medical_photograph’, ‘ophthalmic_imaging’ and ‘electrography’).
- ∗
- image_subtype: multiclass classification column with 40 possible classes (including ‘x_ray’, ‘ct’, ‘mri’, ‘ekg’, ‘immunostaining’, etc).
- ∗
- radiology_region: multiclass classification column, including nine possible anatomical classes that are applicable only for ‘radiology’ images (‘thorax’, ‘abdomen’, ‘head’, ‘lower_limb’, etc).
- ∗
- radiology_region_granular: multiclass classification column, including 19 possible anatomical classes that are applicable only for ‘radiology’ images (the same classes as in radiology_region, but ‘upper_limb’ and ‘lower_limb’ are replaced with more detailed classes, such as ‘hip’ and ‘knee’).
- ∗
- radiology_view: multiclass classification column, including 12 possible classes that are only applicable for ‘radiology’ images (e.g., ‘axial’, ‘frontal’, ‘transabdominal’, etc).
- ∗
- ml_labels_for_supervised_classification: multilabel classification column (there can be more than one label per image). This column includes all the labels assigned to images by machine learning (ML) models, including the ones present in the multiclass columns.
- ∗
- gt_labels_for_semisupervised_classification: multilabel classification column (there can be more than one label per image). Labels were primarily assigned based on the image captions, except when they conflicted with the ML-generated labels. In such cases, the caption-based label was removed—unless the ML label had a confidence score above 0.95, in which case it replaced the caption-based label. Compared to the labels from the ml_labels_for_supervised_classification, the ones from gt_labels_for_semisupervised_classification have a higher quality and a wider taxonomy, but they are not complete (for example, not all the CT scans have a ‘ct’ label, because some of their captions do not mention words that are specific to that class). Coincidences between both columns mean that the ML models predicted the same labels as extracted using the caption-based approach. This column is well-suited for semi-supervised learning, as it includes a mix of labeled and unlabeled data. Additionally, it can be used to sample images from specific classes when higher label quality is required—even if the total number of labeled examples is lower than in ml_labels_for_supervised_classification.
- Image files: This dataset contains 130,791 images in .webp format, with an average resolution of 430 × 365 pixels (this average varies depending on the image class). The image files are located in 90 folders (from ‘\PMC1\PMC10\’ to ‘\PMC9\PMC99\’), based on the first digits of the PMCID of the article they were sourced from. The file names encode relevant information about each image. For example, in PMC10000728_fmed-09-985235-g001_A_1_3.webp, ‘PMC10000728’ is the PMCID of the corresponding case report article, ‘fmed-09-985235-g001’ is the name of the file present in EuropePMC’s API, ‘A’ is the reference used in the caption to refer to this part of the original file, and ‘1’ and ‘3’ mean that the file is the first component from a compound image with three components. The names of images that were not split during post-processing end with ‘_undivided_1_1.webp’.
3. Methods
3.1. Dataset Creation
3.2. Data Download
3.3. Data Preprocessing
3.4. Programmatic Image Labeling
3.4.1. Taxonomy Development
3.4.2. Caption-to-Label Conversion
3.4.3. Machine Learning-Based Annotation
3.5. Data Validation and Quality Control
- Random subset review: Six random samples of 50 labels each were reviewed for quality assessment—three from the gt_labels_for_semisupervised_classification column and three from the ml_labels_for_supervised_classification column. In the first group, 96%, 90%, and 96% of the labels were correct, corresponding to a mean error rate of 6% with a standard deviation of 3.46%. In the second group, 70%, 78%, and 76% of the labels were correct, resulting in a mean error rate of 25.3% and a standard deviation of 4.16%.
- Stratified manual review: To account for class imbalance, another manual review was conducted on 200 caption-based labels and 200 ML-based labels. This time, the sampling ensured a minimum number of examples per class (at least one per class for caption-based labels and two per class for ML-based labels). The review found that 90% of the caption-based labels and 70% of the ML-based labels were correct. Notably, nearly half of the errors in the ML-based labels stemmed from incorrect parent class assignments—for instance, when a skin image is misclassified as ‘immunostaining’, the root cause is often an upstream misclassification under the broader ‘pathology’ class instead of ‘medical_photograph’.
4. Usage Notes
4.1. Creation of Customized Subsets
- readme.txt: a short description of the subset, including its name, type, creation date, and the filters that were used during its creation.
- case_report_citations.json: file including all the metadata necessary to cite the case report articles, such as title, authors, and year.
- article_metadata.json: file including the rest of the metadata (license, keywords and mesh terms).
4.1.1. Text Subsets
4.1.2. Image Classification Subsets
- image_metadata.json: it includes the patient ID, file path, license, caption, and labels for each image.
- image_classification.csv: it contains one column with the file path, and another one with the corresponding class.
4.1.3. Multimodal Subsets
4.2. Use of Image Classification Models
4.3. Resources for Healthcare Professionals
- Case series subsets: If ‘case_series’ is used as the dataset_type for the .create_dataset() method, then a subset with multimodal patient data is created, with the following distinctions: (a) the image organization differs from other subset types, with each patient having their own folder, and (b) the cases.csv file is very similar to the one from text subsets, except for some additional columns (including link to the case report article, number of images per patient, and path to the corresponding image folder).
- Clinical Case Hub [101]: This specialized platform is designed for medical students and healthcare professionals to explore the dataset’s contents, including clinical case texts and medical images. The hub’s user-friendly interface enables users to apply filters without requiring programming knowledge.
- Digital Twin Retriever [102]: This AI-powered tool with a chatbot interface can be used to search clinical cases and answer questions about them, helping healthcare professionals make more informed decisions. The current version operates only on textual data.
4.4. Dataset Limitations
- Differences from real-life data: As noted in Appendix B, data from case reports differs from real clinical data in several ways. The text is shorter, rare cases tend to be overrepresented, images are generally smaller and lower in resolution, and they may contain burned-in annotations such as arrows or asterisks.
- Label quality: This is a silver standard dataset, meaning that image labels were generated semi-automatically and may contain a relatively high number of errors. As a result, label quality is lower than in datasets annotated manually by domain experts.
- Data imbalance: There is substantial variation in both the number and resolution of images across classes. This imbalance reflects the nature of the original data sources.
- Data loss: Not all content from PMC case reports was included. Two major causes of data loss are: (a) exclusion of articles with non-derivative licenses, and (b) removal of images when the number of cropped subimages did not match the number of caption parts, as explained in Section 3.3.
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
AI | artificial intelligence |
API | application programming interface |
CC0 | Creative Commons Zero |
CC BY | Creative Commons Attribution |
CC BY-NC | Creative Commons Attribution-NonCommercial |
CC BY-NC-SA | Creative Commons Attribution-NonCommercial-ShareAlike |
CSV | comma-separated values |
CT | computed tomography |
DOI | digital object identifier |
EEG | electroencephalogram |
EKG | electrocardiogram |
FISH | fluorescence in situ hybridization |
FMA | Foundational Model of Anatomy |
H&E | hematoxylin and eosin |
HTML | HyperText Markup Language |
IHC | immunohistochemistry |
MeSH | Medical Subject Headings |
ML | machine learning |
MRI | magnetic resonance imaging |
NCBI | National Center for Biotechnology Information |
OCT | optical coherence tomography |
PET | positron emission tomography |
PHI | protected health information |
PMC | PubMed Central |
PMCID | PubMed Central ID |
SNOMED CT | Systematized Nomenclature of Medicine Clinical Terms |
SPECT | single-photon emission computed tomography |
Appendix A. Differences Between the Updated and Initial Versions of the MultiCaRe Dataset
- Data Licenses: Data with non-derivative license types were excluded from the updated version, as derivatives based on such data cannot be redistributed. This change ensures full compliance with licensing terms, although it resulted in a slight reduction in the number of cases and images. However, if needed, these excluded data can still be downloaded using the MulticareCreator class with the parameter license_types = [] (see Section 3.1).
- Image Format: The image format was changed from .jpg to .webp to reduce the overall dataset size. This modification does not affect the resolution, quality, or visual fidelity of the images.
Initial Version | Updated Version | |
---|---|---|
update date | August 2023 | February 2025 |
excluded license types | none | non-derivatives |
amount of cases | 96,428 | 93,816 |
amount of images | 135,596 | 130,791 |
image classes | 138 | 150 |
taxonomy quality | low | high |
taxonomy structure | flat | hierarchical, with logical relations |
documented taxonomy | no | yes |
image labeling | caption-based | caption- and ml-based |
file format | .jpg | .webp |
dataset size | 8.8gb | 2.9gb |
associated python package | no | yes |
other associated products | none | image classifiers, clinical case hub, digital twin retriever |
Appendix B. Differences Between a Dataset of Case Reports and a Dataset of Medical Records
- Healthy cases: Patients with no significant clinical problems are typically excluded from case report datasets, but healthy cases are sometimes included in medical record datasets (e.g., routine checkups or screening visits).
- Rare cases: These may be overrepresented in a case report dataset, and this can be considered as something beneficial because it increases the variability and diversity of the dataset. However, it is important to mention that normal disease presentations are still well represented since most case reports are published for educational purposes rather than because of their rarity [103].
- Text extension: Case reports typically consist of one or a few paragraphs, whereas medical records are generally much longer.
- Irrelevant information: Phrases like “patient denies any chest pain” are common in medical records. In contrast, case reports usually exclude absent symptoms, and normal test results may also be omitted.
- Text quality: Although text quality in medical records is high, when it comes to case reports it is much higher, as they undergo a rigorous review process before being published in a medical journal.
- Personal Health Information (PHI): Medical records are not de-identified because they are confidential documents maintained within a healthcare institution. In contrast, case reports are intended to be shared publicly, so they do not include any personal information of the patient. This ensures the protection of patient privacy, which is a fundamental ethical requirement for open clinical datasets.
- Images: Images from case reports are much smaller in size, they may contain burned drawings (such as arrows, circles or asterisks) and they do not contain any personal information of the patient.
Case Reports | Medical Records | |
---|---|---|
Normal Cases | Underrepresented | Proportionally Represented |
Rare Cases | Overrepresented | Proportionally Represented |
Text Extension | Paragraphs | Pages |
Irrelevant Information | Absent | Present |
Text Quality | Very High | High |
Personal Health Information | Absent | Present |
Images | Small Size (<200 kb) | Large Size |
References
- Luo, L.; Yu, L.; Chen, H.; Liu, Q.; Wang, X.; Xu, J.; Heng, P.A. Deep Mining External Imperfect Data for Chest X-ray Disease Screening. arXiv 2020, arXiv:2006.03796. [Google Scholar] [CrossRef]
- Wang, X.; Peng, Y.; Lu, L.; Lu, Z.; Bagheri, M.; Summers, R.M. ChestX-ray8: Hospital-scale Chest X-ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 3462–3471. [Google Scholar] [CrossRef]
- Borgli, H.; Thambawita, V.; Smedsrud, P.H.; Hicks, S.; Jha, D.; Eskeland, S.L.; Randel, K.R.; Pogorelov, K.; Lux, M.; Nguyen, D.T.D.; et al. HyperKvasir, a comprehensive multi-class image and video dataset for gastrointestinal endoscopy. Sci. Data 2020, 7, 283. [Google Scholar] [CrossRef]
- Polat, G.; Kani, H.T.; Ergenc, I.; Ozen Alahdab, Y.; Temizel, A.; Atug, O. Improving the Computer-Aided Estimation of Ulcerative Colitis Severity According to Mayo Endoscopic Score by Using Regression-Based Deep Learning. Inflamm. Bowel Dis. 2023, 29, 1431–1439. [Google Scholar] [CrossRef]
- Hernández-Pérez, C.; Combalia, M.; Podlipnik, S.; Codella, N.C.F.; Rotemberg, V.; Halpern, A.C.; Reiter, O.; Carrera, C.; Barreiro, A.; Helba, B.; et al. BCN20000: Dermoscopic Lesions in the Wild. Sci. Data 2024, 11, 641. [Google Scholar] [CrossRef] [PubMed]
- Tschandl, P.; Rosendahl, C.; Kittler, H. The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Sci. Data 2018, 5, 180161. [Google Scholar] [CrossRef]
- Yap, M.H.; Cassidy, B.; Pappachan, J.M.; O’Shea, C.; Gillespie, D.; Reeves, N. Analysis Towards Classification of Infection and Ischaemia of Diabetic Foot Ulcers. In Proceedings of the 2021 IEEE EMBS International Conference on Biomedical and Health Informatics (BHI), Athens, Greece, 27–30 July 2021; pp. 1–4. [Google Scholar] [CrossRef]
- Da, Q.; Huang, X.; Li, Z.; Zuo, Y.; Zhang, C.; Liu, J.; Chen, W.; Li, J.; Xu, D.; Hu, Z.; et al. DigestPath: A benchmark dataset with challenge review for the pathological detection and segmentation of digestive-system. Med. Image Anal. 2022, 80, 102485. [Google Scholar] [CrossRef]
- Koziarski, M.; Cyganek, B.; Niedziela, P.; Olborski, B.; Antosz, Z.; Żydak, M.; Kwolek, B.; Wąsowicz, P.; Bukała, A.; Swadźba, J.; et al. DiagSet: A dataset for prostate cancer histopathological image classification. Sci. Rep. 2024, 14, 6780. [Google Scholar] [CrossRef]
- Maksoud, S.; Zhao, K.; Hobson, P.; Jennings, A.; Lovell, B. SOS: Selective Objective Switch for Rapid Immunofluorescence Whole Slide Image Classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 3861–3870. [Google Scholar] [CrossRef]
- Phoulady, H.A.; Mouton, P.R. A New Cervical Cytology Dataset for Nucleus Detection and Image Classification (Cervix93) and Methods for Cervical Nucleus Detection. arXiv 2018, arXiv:1811.09651. [Google Scholar]
- Yang, J.; Shi, R.; Wei, D.; Liu, Z.; Zhao, L.; Ke, B.; Pfister, H.; Ni, B. MedMNIST v2 - A large-scale lightweight benchmark for 2D and 3D biomedical image classification. Sci. Data 2023, 10, 41. [Google Scholar] [CrossRef] [PubMed]
- Abacha, A.B.; Sarrouti, M.; Demner-Fushman, D.; Hasan, S.A.; Müller, H. Overview of the VQA-Med Task at ImageCLEF 2021: Visual Question Answering and Generation in the Medical Domain. In Proceedings of the CLEF 2021 Conference and Labs of the Evaluation Forum-Working Notes, Bucharest, Romania, 21–24 September 2021. [Google Scholar] [CrossRef]
- Chen, J.; Gui, C.; Ouyang, R.; Gao, A.; Chen, S.; Chen, G.H.; Wang, X.; Zhang, R.; Cai, Z.; Ji, K.; et al. HuatuoGPT-Vision, Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale. arXiv 2024, arXiv:2406.19280. [Google Scholar]
- He, X.; Zhang, Y.; Mou, L.; Xing, E.; Xie, P. PathVQA: 30000+ Questions for Medical Visual Question Answering. arXiv 2020, arXiv:2003.10286. [Google Scholar]
- Ikezogwo, W.O.; Seyfioglu, M.S.; Ghezloo, F.; Geva, D.S.C.; Mohammed, F.S.; Anand, P.K.; Krishna, R.; Shapiro, L. Quilt-1M: One Million Image-Text Pairs for Histopathology. arXiv 2023, arXiv:2306.11207. [Google Scholar]
- Jin, Q.; Chen, F.; Zhou, Y.; Xu, Z.; Cheung, J.M.; Chen, R.; Summers, R.M.; Rousseau, J.F.; Ni, P.; Landsman, M.J.; et al. Hidden flaws behind expert-level accuracy of multimodal GPT-4 vision in medicine. NPJ Digit. Med. 2024, 7, 190. [Google Scholar] [CrossRef]
- Lau, J.J.; Gayen, S.; Ben Abacha, A.; Demner-Fushman, D. A dataset of clinically generated visual questions and answers about radiology images. Sci. Data 2018, 5, 180251. [Google Scholar] [CrossRef] [PubMed]
- Lin, W.; Zhao, Z.; Zhang, X.; Wu, C.; Zhang, Y.; Wang, Y.; Xie, W. PMC-CLIP: Contrastive Language-Image Pre-training using Biomedical Documents. arXiv 2023, arXiv:2303.07240. [Google Scholar]
- Liu, B.; Zhan, L.M.; Xu, L.; Ma, L.; Yang, Y.; Wu, X.M. SLAKE: A Semantically-Labeled Knowledge-Enhanced Dataset for Medical Visual Question Answering. arXiv 2021, arXiv:2102.09542. [Google Scholar]
- Matos, J.; Chen, S.; Placino, S.; Li, Y.; Pardo, J.C.C.; Idan, D.; Tohyama, T.; Restrepo, D.; Nakayama, L.F.; Pascual-Leone, J.M.M.; et al. WorldMedQA-V: A multilingual, multimodal medical examination dataset for multimodal language models evaluation. arXiv 2024, arXiv:2410.12722. [Google Scholar]
- Rückert, J.; Bloch, L.; Brüngel, R.; Idrissi-Yaghir, A.; Schäfer, H.; Schmidt, C.S.; Koitka, S.; Pelka, O.; Abacha, A.B.; Herrera, A.G.S.d.; et al. ROCOv2: Radiology Objects in COntext Version 2, an Updated Multimodal Image Dataset. Sci. Data 2024, 11, 688. [Google Scholar] [CrossRef]
- Siragusa, I.; Contino, S.; Ciura, M.L.; Alicata, R.; Pirrone, R. MedPix 2.0: A Comprehensive Multimodal Biomedical Dataset for Advanced AI Applications. arXiv 2024, arXiv:2407.02994. [Google Scholar]
- Xie, Y.; Zhou, C.; Gao, L.; Wu, J.; Li, X.; Zhou, H.Y.; Liu, S.; Xing, L.; Zou, J.; Xie, C.; et al. MedTrinity-25M: A Large-scale Multimodal Dataset with Multigranular Annotations for Medicine. arXiv 2024, arXiv:2408.02900. [Google Scholar]
- Nievas Offidani, M. MultiCaRe: An Open-Source Clinical Case Dataset for Medical Image Classification and Multimodal AI Applications. 2025. Available online: https://zenodo.org/records/14994046 (accessed on 29 July 2025). [CrossRef]
- Nievas Offidani, M.A.; Delrieux, C.A. Dataset of clinical cases, images, image labels and captions from open access case reports from PubMed Central (1990–2023). Data Brief 2024, 52, 110008. [Google Scholar] [CrossRef]
- Nievas Offidani, M.A. MultiCaRe_Dataset. 2025. Available online: https://github.com/mauro-nievoff/MultiCaRe_Dataset (accessed on 29 July 2025).
- Belharty, N.; Azouzi, R.E.; Chafai, Y.; Mouine, N.; Benyass, A. Concomitant in situ and in transit right heart thrombi: A case report. Pan Afr. Med J. 2020, 37, 355. [Google Scholar] [CrossRef]
- Li, C.; Wong, C.; Zhang, S.; Usuyama, N.; Liu, H.; Yang, J.; Naumann, T.; Poon, H.; Gao, J. LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day. arXiv 2023, arXiv:2306.00890. [Google Scholar]
- Zhang, S.; Xu, Y.; Usuyama, N.; Xu, H.; Bagga, J.; Tinn, R.; Preston, S.; Rao, R.; Wei, M.; Valluri, N.; et al. BiomedCLIP: A multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs. arXiv 2024, arXiv:2303.00915. [Google Scholar]
- Zhang, X.; Wu, C.; Zhao, Z.; Lin, W.; Zhang, Y.; Wang, Y.; Xie, W. PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering. arXiv 2024, arXiv:2305.10415. [Google Scholar]
- Zhao, Z.; Jin, Q.; Chen, F.; Peng, T.; Yu, S. A large-scale dataset of patient summaries for retrieval-based clinical decision support systems. Sci. Data 2023, 10, 909. [Google Scholar] [CrossRef]
- Subramanian, S.; Wang, L.L.; Mehta, S.; Bogin, B.; Zuylen, M.v.; Parasa, S.; Singh, S.; Gardner, M.; Hajishirzi, H. MedICaT: A Dataset of Medical Images, Captions, and Textual References. arXiv 2020, arXiv:2010.06000. [Google Scholar]
- Nissen, T.; Wynn, R. The clinical case report: A review of its merits and limitations. BMC Res. Notes 2014, 7, 264. [Google Scholar] [CrossRef]
- Nievas Offidani, M. Multiversity Package. Available online: https://github.com/mauro-nievoff/MultiCaRe_Dataset/tree/main/multiversity_library/multiversity/multiversity (accessed on 29 July 2025).
- Cock, P.J.A.; Antao, T.; Chang, J.T.; Chapman, B.A.; Cox, C.J.; Dalke, A.; Friedberg, I.; Hamelryck, T.; Kauff, F.; Wilczynski, B.; et al. Biopython: Freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 2009, 25, 1422–1423. [Google Scholar] [CrossRef]
- Reitz, K. Requests: HTTP for Humans. Available online: https://docs.python-requests.org/en/latest/ (accessed on 29 July 2025).
- Richardson, L. Beautiful Soup Documentation. Available online: https://www.crummy.com/software/BeautifulSoup/bs4/doc/ (accessed on 29 July 2025).
- Comeau, D.C.; Wei, C.H.; Islamaj Doğan, R.; Lu, Z. PMC text mining subset in BioC: About three million full-text articles and growing. Bioinformatics 2019, 35, 3533–3535. [Google Scholar] [CrossRef] [PubMed]
- Europe PMC Consortium. Europe PMC: A full-text literature database for the life sciences and platform for innovation. Nucleic Acids Res. 2015, 43, D1042–D1048. [Google Scholar] [CrossRef] [PubMed]
- Bradski, G. The OpenCV Library. J. Softw. Tools 2000, 25, 120–123. [Google Scholar]
- Elbanna, K.Y.; Almutairi, B.M.; Zidan, A.T. Bilateral renal lymphangiectasia: Radiological findings by ultrasound, computed tomography, and magnetic resonance imaging. J. Clin. Imaging Sci. 2015, 5, 6. [Google Scholar] [CrossRef] [PubMed]
- Structural Informatics Group. Foundational Model of Anatomy. 2019. Available online: http://sig.biostr.washington.edu/projects/fm/AboutFM.html (accessed on 29 July 2025).
- National Library of Medicine Medical. Medical Subject Headings; U.S. National Library of Medicine: Bethesda, MD, USA, 2024. [Google Scholar]
- Radiological Society of North America. RadLex Radiology Lexicon. 2020. Available online: https://www.rsna.org/practice-tools/data-tools-and-standards/radlex-radiology-lexicon (accessed on 29 July 2025).
- National Center for Biomedical Ontology. BioPortal. 2024. Available online: https://bioportal.bioontology.org/ (accessed on 29 July 2025).
- Snomed International. SNOMED CT. 2024. Available online: https://www.snomed.org/ (accessed on 29 July 2025).
- Nievas Offidani, M.A.; Roffet, F.; Delrieux, C.A.; González Galtier, M.C.; Zarate, M.D. The Multiplex Classification Framework: Optimizing Multi-Label Classifiers Through Problem Transformation, Ontology Engineering, and Model Ensembling. Appl. Ontol. 2025. [Google Scholar] [CrossRef]
- Nievas Offidani, M. MCR_TX Taxonomy. Available online: https://bioportal.bioontology.org/ontologies/MCR_TX (accessed on 29 July 2025).
- Nievas Offidani, M. ML_MCR_TX Taxonomy. Available online: https://bioportal.bioontology.org/ontologies/ML_MCR_TX (accessed on 29 July 2025).
- Zarfati, A.; Martucci, C.; Crocoli, A.; Serra, A.; Persano, G.; Inserra, A. Chemotherapy-induced cavitating Wilms’ tumor pulmonary metastasis: Active disease or scarring? A case report and literature review. Front. Pediatr. 2023, 11, 1083168. [Google Scholar] [CrossRef]
- Bailey, D.D.; Montgomery, E.Y.; Garzon-Muvdi, T. Metastatic high-grade meningioma: A case report and review of risk factors for metastasis. Neuro-Oncol. Adv. 2023, 5, vdad014. [Google Scholar] [CrossRef]
- Rouientan, H.; Gilani, A.; Sarmadian, R.; Rukerd, M.R.Z. A rare case of an immunocompetent patient with isolated pulmonary mucormycosis. IDCases 2023, 31, e01726. [Google Scholar] [CrossRef]
- Mohanty, V.; Sharma, S.K.; Goswami, S.; Gudhage, R.; Deora, S. Idiopathic Isolated Right Ventricular Cardiomyopathy: A Rare Case Report. Avicenna J. Med. 2023, 13, 56–59. [Google Scholar] [CrossRef]
- Idris, Z.; Ghazali, F.H.; Abdullah, J.M. Fibromyalgia and arachnoiditis presented as an acute spinal disorder. Surg. Neurol. Int. 2014, 5, 151. [Google Scholar] [CrossRef]
- Nishiki, E.; Honda, S.; Yamano, M.; Kawasaki, T. Midventricular Obstruction With Diastolic Paradoxic Jet Flow in Transthyretin Cardiac Amyloidosis: A Case Report. Cureus 2025, 17, e77097. [Google Scholar] [CrossRef] [PubMed]
- Miyamoto, K.; Fujisawa, M.; Hozumi, H.; Tsuboi, T.; Kuwashima, S.; Hirao, J.i.; Sugita, K.; Arisaka, O. Systemic inflammatory response syndrome and prolonged hypoperfusion lesions in an infant with respiratory syncytial virus encephalopathy. J. Infect. Chemother. 2013, 19, 978–982. [Google Scholar] [CrossRef]
- Jiang, C.; Tang, W.; Yang, X.; Li, H. Immune checkpoint inhibitor-related pancreatitis: What is known and what is not. Open Med. 2023, 18, 20230713. [Google Scholar] [CrossRef] [PubMed]
- Sun, Y.; Feng, L.; Huang, X.; Hu, B.; Yuan, Y. Case report: An elderly woman with recurrent syncope after pacemaker implantation. Front. Cardiovasc. Med. 2023, 10, 1117244. [Google Scholar] [CrossRef] [PubMed]
- Caputo, R.; Pagliuca, M.; Pensabene, M.; Parola, S.; De Laurentiis, M. Long-term complete response with third-line PARP inhibitor after immunotherapy in a patient with triple-negative breast cancer: A case report. Front. Oncol. 2023, 13, 1214660. [Google Scholar] [CrossRef]
- Zhou, J.; Liu, H.; Chen, J.; He, X. Case report: Echocardiographic diagnosis of cardiac involvement caused by congenital generalized lipodystrophy in an infant. Front. Pediatr. 2023, 11, 1087833. [Google Scholar] [CrossRef]
- Derman, P.B.; Rogers-LaVanne, M.P.; Satin, A.M. Minimally Invasive Revision of Luque Plate Instrumentation: A Case Report. Cureus 2024, 16, e53120. [Google Scholar] [CrossRef] [PubMed]
- Handschel, J.G.; Depprich, R.A.; Zimmermann, A.C.; Braunstein, S.; Kübler, N.R. Adenomatoid odontogenic tumor of the mandible: Review of the literature and report of a rare case. Head Face Med. 2005, 1, 3. [Google Scholar] [CrossRef]
- Adlakha, V.K.; Chandna, P.; Gandhi, S.; Chopra, S.; Singh, N.; Singh, S. Surgical Repositioning of a Dilacerated Impacted Incisor. Int. J. Clin. Pediatr. Dent. 2011, 4, 55–58. [Google Scholar] [CrossRef]
- Zhou, C.; Zhang, J.; Chen, Y.; Ding, X.; Chen, F.; Feng, K.; Chen, K. Cryptococcal Osteomyelitis of the Left Acetabulum: A Case Report. Curr. Med Imaging Former. Curr. Med Imaging Rev. 2023, 19, e251122211248. [Google Scholar] [CrossRef]
- Zhang, Z.Y.; Wang, Y.W.; Zhang, W.; Zhang, B.X. Case Report: Solitary metastasis to the appendix after curative treatment of HCC. Front. Surg. 2023, 10, 1081326. [Google Scholar] [CrossRef]
- Zhang, Y.; Dong, B.; Xue, Y.; Wang, Y.; Yan, J.; Xu, L. Case report: A case of Culler-Jones syndrome caused by a novel mutation of GLI2 gene and literature review. Front. Endocrinol. 2023, 14, 1133492. [Google Scholar] [CrossRef] [PubMed]
- Bu, J.T.; Torres, D.; Robinson, A.; Malone, C.; Vera, J.C.; Daghighi, S.; Dunn-Pirio, A.; Khoromi, S.; Nowell, J.; Léger, G.C.; et al. Case report: Neuronal intranuclear inclusion disease presenting with acute encephalopathy. Front. Neurol. 2023, 14, 1184612. [Google Scholar] [CrossRef]
- Singh, A.; Gandavaram, S.; Patel, K.; Herlekar, D. Use of a Trephine to Extract a Fractured Corail Femoral Stem During Revision Total Hip Arthroplasty: Tips From Our Case Report. Cureus 2024, 16, e52996. [Google Scholar] [CrossRef] [PubMed]
- Xuan, W.; Wang, Z.; Lin, J.; Zou, L.; Xu, X.; Yang, X.; Xu, Y.; Zhang, Y.; Zheng, Q.; Xu, X.; et al. Case report: Aggressive progression of acute heart failure due to juvenile tuberculosis-associated Takayasu arteritis with aortic stenosis and thrombosis. Front. Cardiovasc. Med. 2023, 10, 1076118. [Google Scholar] [CrossRef]
- Cupler, Z.A.; Anderson, M.T.; Stefanowicz, E.T.; Warshel, C.D. Post-infectious ankylosis of the cervical spine in an army veteran: A case report. Chiropr. Man. Ther. 2018, 26, 40. [Google Scholar] [CrossRef]
- Pang, S.; Song, J.; Zhang, K.; Wang, J.; Zhao, H.; Wang, Y.; Li, P.; Zong, Y.; Wu, Y. Case report: Coexistence of sigmoid tumor with unusual pathological features and multiple colorectal neuroendocrine tumors with lymph node metastases. Front. Oncol. 2023, 13, 1073234. [Google Scholar] [CrossRef] [PubMed]
- Sivapalan, P.; Gottlieb, M.; Christensen, M.; Clementsen, P.F. An obstructing endobronchial lipoma simulating COPD. Eur. Clin. Respir. J. 2014, 1, 25664. [Google Scholar] [CrossRef]
- Koizumi, M.; Sata, N.; Yoshizawa, K.; Tsukahara, M.; Kurihara, K.; Yasuda, Y.; Nagai, H. Post-ERCP pancreatogastric fistula associated with an intraductal papillary-mucinous neoplasm of the pancreas—A case report and literature review. World J. Surg. Oncol. 2005, 3, 70. [Google Scholar] [CrossRef]
- Sekino, Y.; Mochizuki, H.; Kuniyasu, H. A 49-year-old woman presenting with hepatoid adenocarcinoma of the urinary bladder: A case report. J. Med Case Rep. 2013, 7, 12. [Google Scholar] [CrossRef]
- Bohbot, Y.; Garot, J.; Danjon, I.; Thébert, D.; Nahory, L.; Gros, P.; Salerno, F.; Garot, P. Case report: Diagnosis, management and evolution of a bulky and invasive cardiac mass complicated by complete atrioventricular block. Front. Cardiovasc. Med. 2023, 10, 1135233. [Google Scholar] [CrossRef]
- Tan, Y.; Zheng, S. Clinicopathological characteristics and diagnosis of hepatic sinusoidal obstruction syndrome caused by Tusanqi–Case report and literature review. Open Med. 2023, 18, 20230737. [Google Scholar] [CrossRef]
- Pailoor, K.; Kini, H.; Rau, A.R.; Kumar, Y. Cytological diagnosis of a rare case of solid pseudopapillary neoplasm of the pancreas. J. Cytol./Indian Acad. Cytol. 2010, 27, 32–34. [Google Scholar] [CrossRef] [PubMed]
- Lin, J.; Dong, L.; Yu, L.; Huang, J. Autoimmune glial fibrillary acidic protein astrocytopathy coexistent with reversible splenial lesion syndrome: A case report and literature review. Front. Neurol. 2023, 14, 1192118. [Google Scholar] [CrossRef]
- Ge, W.; Qu, Y.; Hou, T.; Zhang, J.; Li, Q.; Yang, L.; Cao, L.; Li, J.; Zhang, S. Case report: Surgical treatment of a primary giant epithelioid hemangioendothelioma of the spine with total en-bloc spondylectomy. Front. Oncol. 2023, 13, 1109643. [Google Scholar] [CrossRef] [PubMed]
- Hanna, R.M.; Tran, N.T.; Patel, S.S.; Hou, J.; Jhaveri, K.D.; Parikh, R.; Selamet, U.; Ghobry, L.; Wassef, O.; Barsoum, M.; et al. Thrombotic Microangiopathy and Acute Kidney Injury Induced After Intravitreal Injection of Vascular Endothelial Growth Factor Inhibitors VEGF Blockade-Related TMA After Intravitreal Use. Front. Med. 2020, 7, 579603. [Google Scholar] [CrossRef]
- Orchard, G.E. Gout With Associated Cutaneous AA Amyloidosis: A Case Report and Review of the Literature. Br. J. Biomed. Sci. 2023, 80, 11442. [Google Scholar] [CrossRef]
- Yang, K.; Kruse, R.L.; Lin, W.V.; Musher, D.M. Corynebacteria as a cause of pulmonary infection: A case series and literature review. Pneumonia 2018, 10, 10. [Google Scholar] [CrossRef]
- Jin, H.; Zhang, Y.; Zhang, W.; Wang, K. Multimodal imaging features of primary pericardial synovial sarcoma: A case report. Front. Oncol. 2023, 13, 1181778. [Google Scholar] [CrossRef]
- Juthani, R.; Singh, A.R.; Basu, D. A case series of therapy-related leukemias: A deadly ricochet. Leuk. Res. Rep. 2023, 20, 100382. [Google Scholar] [CrossRef] [PubMed]
- Olobatoke, A.O.; David, D.; Hafeez, W.; Van, T.; Saleh, H.A. Pulmonary carcinosarcoma initially presenting as invasive aspergillosis: A case report of previously unreported combination. Diagn. Pathol. 2010, 5, 11. [Google Scholar] [CrossRef]
- Liu, S.C.; Suresh, M.; Jaber, M.; Mercado Munoz, Y.; Sarafoglou, K. Case Report: Anastrozole as a monotherapy for pre-pubertal children with non-classic congenital adrenal hyperplasia. Front. Endocrinol. 2023, 14, 1101843. [Google Scholar] [CrossRef]
- Jin, J.; Xia, G.; Luo, Y.; Cai, Y.; Huang, Y.; Yang, Z.; Yang, Q.; Yang, B. Case report: Implantable cardioverter-defibrillator implantation with optimal medical treatment for lethal ventricular arrhythmia caused by recurrent coronary artery spasm due to tyrosine kinase inhibitors. Front. Cardiovasc. Med. 2023, 10, 1145075. [Google Scholar] [CrossRef]
- Xiang, Y.; Li, F.; Song, Z.; Yi, Z.; Yang, C.; Xue, J.; Zhang, Y. Two pediatric patients with hemiplegic migraine presenting as acute encephalopathy: Case reports and a literature review. Front. Pediatr. 2023, 11, 1214837. [Google Scholar] [CrossRef] [PubMed]
- Matsumoto, K.; Tanna, N. Maxillary protraction and vertical control utilizing skeletal anchorage for midfacial-maxillary deficiency. Dent. Press J. Orthod. 2021, 26, e2120114. [Google Scholar] [CrossRef]
- Al-Qattan, M.M.; Almotairi, M.I. Facial cutaneous lesions of dental origin: A case series emphasizing the awareness of the entity and its medico-legal consequences. Int. J. Surg. Case Rep. 2018, 53, 75–78. [Google Scholar] [CrossRef] [PubMed]
- Kumar, N.; Al Kandari, J.; Al Sabti, K.; Wani, V.B. Partial-thickness macular hole in vitreomacular traction syndrome: A case report and review of the literature. J. Med Case Rep. 2010, 4, 7. [Google Scholar] [CrossRef] [PubMed]
- Koufakis, D.; Konstantopoulos, D.; Koufakis, T. An uncommon cause of progressive visual loss in a heavy smoker. Pan Afr. Med J. 2015, 22, 47. [Google Scholar] [CrossRef]
- Makri, O.E.; Tsapardoni, F.N.; Plotas, P.; Pallikari, A.; Georgakopoulos, C.D. Intravitreal aflibercept for choroidal neovascularization secondary to angioid streaks in a non-responder to intravitreal ranibizumab. Int. Med Case Rep. J. 2018, 11, 229–231. [Google Scholar] [CrossRef]
- Babu, K.; Murthy, V.; Akki, V.; Prabhakaran, V.; Murthy, K. Coin-shaped epithelial lesions following an acute attack of erythema multiforme minor with confocal microscopy findings. Indian J. Ophthalmol. 2010, 58, 64. [Google Scholar] [CrossRef]
- Nievas Offidani, M.; Roffet, F. MultiCaReClassifier. 2024. Available online: https://huggingface.co/mauro-nievoff/MultiCaReClassifier (accessed on 29 July 2025).
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
- Torchvision Maintainers and Contributors. M.N.O.: The Correct Name Is Torchvision Maintainers and Contributors. TorchVision: PyTorch’s Computer Vision Library. 2016. Available online: https://github.com/pytorch/vision (accessed on 29 July 2025).
- Buslaev, A.; Iglovikov, V.I.; Khvedchenya, E.; Parinov, A.; Druzhinin, M.; Kalinin, A.A. Albumentations: Fast and Flexible Image Augmentations. Information 2020, 11, 125. [Google Scholar] [CrossRef]
- Howard, J.; Gugger, S. Fastai: A Layered API for Deep Learning. Information 2020, 11, 108. [Google Scholar] [CrossRef]
- Gonzalez Galtier, M.C.; Nievas Offidani, M.A.; Massiris, M. Clinical Case Hub. Available online: https://clinical-cases.streamlit.app/ (accessed on 29 July 2025).
- González Galtier, M.C.; Nievas Offidani, M.A. Digital Twin Retriever. Available online: https://digital-twin-retriever.streamlit.app/ (accessed on 29 July 2025).
- Lundh, A.; Christensen, M.; Jørgensen, A. International or national publication of case reports. Dan. Med Bull. 2011, 58, A4242. [Google Scholar] [PubMed]
MultiCaRe [25] | PMC-OA [19] | ROCO [22] | LLaVA-Med [29] | PMC-15M [30] | PMC-VQA [31] | PMC-Patients [32] | MedICaT [33] | |
---|---|---|---|---|---|---|---|---|
Images | 130K | 1.6M | 79K | 600K | 15M | 149K | – | 217K |
Image classes | 140+ | – | 20+ | 5 | 30+ | – | – | – |
Hierarchical taxonomy | ✓ | – | – | – | – | – | – | – |
Captions | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | – | ✓ |
Textual references * | ✓ | – | – | ✓ | – | – | – | ✓ |
Clinical cases | 93K | – | – | – | – | – | 167K | – |
Q&A | – | – | – | ✓ | – | ✓ | – | – |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Nievas Offidani, M.; Roffet, F.; González Galtier, M.C.; Massiris, M.; Delrieux, C. An Open-Source Clinical Case Dataset for Medical Image Classification and Multimodal AI Applications. Data 2025, 10, 123. https://doi.org/10.3390/data10080123
Nievas Offidani M, Roffet F, González Galtier MC, Massiris M, Delrieux C. An Open-Source Clinical Case Dataset for Medical Image Classification and Multimodal AI Applications. Data. 2025; 10(8):123. https://doi.org/10.3390/data10080123
Chicago/Turabian StyleNievas Offidani, Mauro, Facundo Roffet, María Carolina González Galtier, Miguel Massiris, and Claudio Delrieux. 2025. "An Open-Source Clinical Case Dataset for Medical Image Classification and Multimodal AI Applications" Data 10, no. 8: 123. https://doi.org/10.3390/data10080123
APA StyleNievas Offidani, M., Roffet, F., González Galtier, M. C., Massiris, M., & Delrieux, C. (2025). An Open-Source Clinical Case Dataset for Medical Image Classification and Multimodal AI Applications. Data, 10(8), 123. https://doi.org/10.3390/data10080123