CBIS-DDSM-R: A Curated Radiomic Feature Dataset for Breast Cancer Classification

Sánchez-Femat, Erika; Galván-Tejada, Carlos E.; Galván-Tejada, Jorge I.; Gamboa-Rosales, Hamurabi; Luna-García, Huizilopoztli; Flores-Chaires, Luis Alberto; Saldívar-Pérez, Javier; Reveles-Martínez, Rafael; Celaya-Padilla, José M.

doi:10.3390/data10110179

Open AccessData Descriptor

CBIS-DDSM-R: A Curated Radiomic Feature Dataset for Breast Cancer Classification

by

Erika Sánchez-Femat

¹

,

Carlos E. Galván-Tejada

¹

,

Jorge I. Galván-Tejada

¹,

Hamurabi Gamboa-Rosales

¹

,

Huizilopoztli Luna-García

¹

,

Luis Alberto Flores-Chaires

¹,

Javier Saldívar-Pérez

¹,

Rafael Reveles-Martínez

^1,2,* and

José M. Celaya-Padilla

^1,*

¹

Unidad de Ingeniería Eléctrica, Universidad Autónoma de Zacatecas, Zacatecas 98160, Mexico

²

Unidad Profesional Interdisciplinaria de Ingeniería Campus Zacatecas (UPIIZ), Instituto Politécnico Nacional, Zacatecas 98160, Mexico

^*

Authors to whom correspondence should be addressed.

Data 2025, 10(11), 179; https://doi.org/10.3390/data10110179

Submission received: 14 August 2025 / Revised: 21 October 2025 / Accepted: 27 October 2025 / Published: 4 November 2025

Download

Browse Figures

Versions Notes

Abstract

Early and accurate breast cancer detection is critical for patient outcomes. The Curated Breast Imaging Subset of the Digital Database for Screening Mammography (CBIS-DDSM) has been instrumental for computer-aided diagnosis (CAD) systems. However, the lack of a standardized preprocessing pipeline and consistent metadata has limited its utility for reproducible quantitative imaging or radiomics. This paper introduces CBIS-DDSM-R, an open-source, radiomics-ready extension of the original dataset. It provides an automated pipeline for preprocessing mammograms and extracts a standardized set of 93 radiomics features per lesion, adhering to Image Biomarker Standardisation Initiative (IBSI) guidelines using PyRadiomics. The resulting dataset combines clinical and radiomics data into a unified format, offering a robust benchmark for developing and validating reproducible radiomics models for breast cancer characterization.

Dataset: https://doi.org/10.7937/K9/TCIA.2016.7O02S9CY, https://github.com/helloerikaaa/cbis-ddsm-r.

Dataset License: CC BY 4.0.

Keywords:

CBIS-DDSM; breast cancer; mammography; radiomics; Medical Imaging Dataset; lesion segmentation; deep learning

1. Summary

The CBIS-DDSM-R (Curated Breast Imaging Subset of the Digital Database for Screening Mammography - Radiomics) dataset is a powerful, enhanced version of the popular CBIS-DDSM resource, designed to specifically address the needs of radiomics research. While the original CBIS-DDSM dataset is a cornerstone for CAD systems, its format and structure are not ideal for reproducible radiomics analysis due to issues with inconsistent metadata, non-standard preprocessing, and a lack of proper spatial alignment between images and lesion masks. This has created a significant barrier, requiring extensive manual work and making it difficult to compare results across different studies.

To solve these problems, CBIS-DDSM-R provides a new, automated pipeline that meticulously downloads, preprocesses, and structures the original data. The process ensures that the dataset is fully compliant with the Image Biomarker Standardisation Initiative (IBSI) guidelines and is compatible with the widely used PyRadiomics feature extraction library. The pipeline applies a series of standardized steps to each image, including median filtering to reduce noise, breast region segmentation, and pectoral muscle suppression. These steps are essential for robust and accurate feature extraction, ensuring that the quantitative data is not corrupted by artifacts.

The final dataset contains 2437 annotated mammographic images and includes all original CBIS-DDSM clinical metadata (e.g., breast density, BI-RADS assessment, and pathology). Critically, this dataset also provides a rich set of 93 radiomics features for each lesion, extracted using a fully documented, open-source, and IBSI-compliant pipeline. While other studies have applied radiomics to CBIS-DDSM, this work is the first to provide a complete, pre-processed, and radiomics-ready benchmark dataset, eliminating the need for researchers to develop their own complex and often non-reproducible preprocessing workflows. These features capture detailed information about lesion shape, intensity, and texture, offering a quantitative layer of analysis. The entire dataset is provided in a single, well-structured CSV file, making it easy to use for machine learning and statistical modeling. By offering a standardized, reproducible, and radiomics-ready resource, CBIS-DDSM-R empowers the scientific community to develop more reliable and comparable models for breast cancer diagnosis and characterization.

2. Related Work

2.1. Datasets

The development of computer-aided diagnosis (CAD) systems for breast cancer detection has been a long-standing research priority, particularly due to the global burden of the disease and the critical role of early detection in improving patient outcomes [1,2]. Mammography is the gold standard for breast cancer screening, but its interpretation is subject to inter-observer variability [3]. To overcome this, numerous studies have investigated both handcrafted and deep learning-based approaches for the automatic detection and classification of mammographic abnormalities [4,5].

A significant challenge in this domain remains the availability of high-quality, annotated datasets that are standardized, well-documented, and suitable for reproducible experiments. The Digital Database for Screening Mammography (DDSM) was among the earliest publicly available resources [6], consisting of digitized film mammograms with pixel-level annotations of breast lesions. However, it was stored in the obsolete LJPEG format and lacked essential metadata, such as pixel spacing and DICOM headers, which limited its usability for modern quantitative imaging analysis.

To address these limitations, the Curated Breast Imaging Subset of DDSM (CBIS-DDSM) was introduced [7]. Hosted on The Cancer Imaging Archive (TCIA), CBIS-DDSM offers updated versions of DDSM studies in DICOM format, along with associated region of interest (ROI) masks and metadata. This resource has become the de facto benchmark for many breast imaging studies, enabling consistent evaluation across machine learning models. Despite its improvements, CBIS-DDSM still lacks a standardized preprocessing pipeline, and it does not natively support radiomics applications, where metadata consistency and pixel-level alignment between images and masks are essential.

2.2. Radiomics Features

Radiomics is an emerging field that aims to extract a large number of quantitative features from medical images that characterize intensity distributions, texture, shape, and wavelet properties [8,9]. Radiomics features have shown potential in breast cancer research for lesion classification, molecular subtype prediction, and risk stratification [10,11]. The clinical translation of radiomics, however, has been hindered by issues related to reproducibility and standardization. The Image Biomarker Standardisation Initiative (IBSI) [12] has made substantial contributions by proposing standardized feature definitions and extraction guidelines [12], which have been adopted by toolkits such as PyRadiomics [13].

Several studies have applied radiomics to CBIS-DDSM, often with custom pipelines that involve manual image alignment, normalization, and spacing correction [14,15]. This fragmented approach limits reproducibility and hinders fair comparisons across models. For instance, Lei et al. [16] benchmarked different radiomics software libraries and demonstrated how inconsistent spacing or preprocessing steps could lead to significantly different feature values and model performance.

Recent years have also seen the integration of deep learning methods with CBIS- DDSM [17,18]. While these models have achieved high accuracy in tasks such as lesion classification or malignancy prediction, they often operate as “black boxes”, limiting interpretability and clinical trust. Radiomics, on the other hand, offers handcrafted, explainable features, making it more suitable for clinical adoption when properly validated [9].

Few efforts have attempted to bridge the gap between raw CBIS-DDSM and radiomics-ready datasets. Most existing pipelines are either proprietary, unpublished, or only partially documented. Moreover, ROI masks in CBIS-DDSM are stored as PNG files with lossy compression, and their spatial alignment with DICOM images is not guaranteed without manual validation. This undermines one of the core principles of radiomics: accurate spatial correspondence between image and lesion mask.

The present work introduces CBIS-DDSM-R, a reproducible and radiomics-ready extension of CBIS-DDSM. It provides a fully automated pipeline for downloading, preprocessing, and structuring the dataset in a way that is compliant with IBSI standards and PyRadiomics requirements. All metadata, ROI masks, and preprocessed DICOM images are aligned and standardized, ensuring that researchers can conduct reproducible radiomics experiments without the need for extensive data engineering. To our knowledge, this is the first public implementation that transforms CBIS-DDSM into a complete, open-source radiomics benchmark dataset.

3. CBIS-DDSM-R Data Description

The proposed dataset is a curated, enhanced dataset derived from the publicly available CBIS-DDSM resource. CBIS-DDSM-R retains the exact same cohort of patients and mammographic images as the original dataset while providing an additional set of quantitative radiomics features computed using the PyRadiomics open-source library. This enrichment enables the integration of handcrafted quantitative image biomarkers with the conventional CBIS-DDSM clinical and imaging annotations, supporting advanced breast imaging research and radiomics-driven model development.

3.1. Dataset Composition

CBIS-DDSM-R contains 1301 unique patients and 2437 annotated mammographic images, mirroring the original CBIS-DDSM structure in both population size and image content. The dataset includes mammograms in two standard views—craniocaudal (CC) and mediolateral oblique (MLO)—for both left and right breasts. All original CBIS-DDSM metadata fields are preserved, including

Breast density: mean ± SD = 2.55 ± 0.94; range: 0–4.
Breast laterality: left, 1274 images (52.3%); right, 1163 images (47.7%).
Image view: MLO, 1297 images (53.2%); CC, 1140 images (46.8%).
Abnormality type: calcification, mass.
BI-RADS assessment: mean ± SD = 3.56 ± 1.21; range: 0–5.
Pathology: benign, malignant, benign without callback.
Subtlety score: mean ± SD = 3.49 ± 1.21; range: 1–5.
Lesion segmentation mask paths and cropped lesion images.

The distribution of categorical variables for the CBIS-DDSM-R dataset is summarized in Figure 1. The dataset preserves the same distribution patterns as the original CBIS-DDSM resource. Slightly more images correspond to the left breast (52.3%) than the right (47.7%). The mediolateral oblique (MLO) view is slightly more frequent (53.2%) than the craniocaudal (CC) view (46.8%). Calcifications constitute the majority abnormality type (61.9%), followed by masses (38.1%). In terms of pathology, malignant lesions account for 43.7% of cases, benign lesions for 35.0%, and benign findings without callback for 21.3%.

3.2. Radiomics Feature Extraction

Radiomics features were extracted from the lesion segmentation masks provided in CBIS-DDSM using PyRadiomics (version v3.0.1b3) with default extraction parameters following the Image Biomarker Standardisation Initiative (IBSI) guidelines. This process yielded 93 radiomics descriptors per lesion, encompassing

First-order statistics (e.g., energy, entropy, percentiles, mean, and median);
Shape-based features (2D lesion shape descriptors);
Texture features derived from
–
Gray Level Co-occurrence Matrix (GLCM);
–
Gray Level Run Length Matrix (GLRLM);
–
Gray Level Size Zone Matrix (GLSZM);
–
Neighboring Gray Tone Difference Matrix (NGTDM);
–
Gray Level Dependence Matrix (GLDM).

Examples of extracted feature distributions include

original_firstorder_10Percentile: mean ± SD = 116.17 ± 45.24 (range: 0–221);
original_firstorder_90Percentile: mean ± SD = 161.12 ± 39.01 (range: 0–255);
original_firstorder_Energy: mean ± SD = 3.89 × 10⁹ ± 9.66 × 10⁹ (range: 0–1.88 × 10¹¹);
original_firstorder_Entropy: mean ± SD = 1.36 ± 0.58 (range: 0–3.20).

Table 1 provides descriptive statistics for a representative subset of the 93 radiomics features extracted from CBIS-DDSM-R. Values are reported as the mean and standard deviation, along with the observed minimum and maximum. Features capture diverse aspects of lesion characteristics, including intensity distribution (e.g., original_firstorder_10Percentile, original_firstorder_90Percentile), heterogeneity (original_firstorder_Entropy), and overall signal magnitude (original_firstorder_Energy). A complete list of all 93 radiomics feature statistics is provided in the github repository of this dataset (see Data Availability Statement).

This analysis provides a tangible example of the dataset’s value, encouraging researchers to use it for developing and validating their own diagnostic models. It serves as a proof of concept for the entire dataset without delving into a full-scale modeling paper.

3.3. Data Format and Accessibility

The CBIS-DDSM-R dataset is distributed in CSV format, where each row corresponds to a single mammographic image annotation. Columns include

Original CBIS-DDSM metadata (patient ID, imaging parameters, diagnostic labels, and mask and cropped image file paths);
Radiomic feature values (float64) extracted from the annotated lesion regions.

This structure ensures full backward compatibility with CBIS-DDSM, enabling researchers to reproduce previous analyses while extending them with standardized radiomics data.

4. Materials and Methods

4.1. Original Data Source: CBIS-DDSM

The Curated Breast Imaging Subset of the Digital Database for Screening Mammography (CBIS-DDSM) is a widely used, publicly available dataset designed to support the development and evaluation of computer-aided diagnosis (CAD) systems in breast imaging. It was released by the Cancer Imaging Archive (TCIA) as a curated and standardized subset of the original Digital Database for Screening Mammography (DDSM), which was initially developed by the University of South Florida and Massachusetts General Hospital [6].

The CBIS-DDSM includes over 3100 mammographic studies from 1566 patients, with pixel-level lesion annotations provided by expert radiologists [7]. Each study typically includes craniocaudal (CC) and mediolateral oblique (MLO) views of the breast, and lesions are classified into masses or calcifications. The dataset is split into distinct training and testing subsets, separately provided for each lesion type.

The lesion segmentation masks provided in CBIS-DDSM were created and validated by board-certified radiologists with expertise in breast imaging [7]. The CBIS-DDSM-R pipeline does not modify or regenerate these expert annotations; instead, it ensures proper spatial alignment and format consistency between the original expert-annotated masks and the preprocessed DICOM images. This approach maintains the clinical validity of the annotations while addressing the technical challenges of integrating them into standardized radiomics workflows.

Each annotated abnormality is accompanied by

Full-field mammograms in DICOM format;
Binary lesion segmentation masks in DICOM format;
Cropped lesion ROIs;
Clinical and pathological metadata stored in CSV format.

Table 2 summarizes the metadata fields provided for each lesion in CBIS-DDSM.

Despite its popularity, CBIS-DDSM presents several limitations in the context of radiomics workflows:

Segmentation masks are stored as separate DICOM images, not linked via DICOM metadata.
Preprocessing inconsistencies (e.g., presence of pectoral muscle or background artifacts) hinder reproducibility.
The dataset is not natively compatible with radiomics frameworks such as PyRadiomics [13] due to missing spatial references and intensity normalization.

To address these limitations and enable reproducible radiomics analysis, the CBIS-DDSM-R is introduced as an enhanced version of the CBIS-DDSM that standardizes preprocessing, retains original DICOM metadata, and provides spatially aligned lesion masks in a format optimized for automated pipelines.

4.2. Raw Data Acquisition from TCIA

The raw CBIS-DDSM dataset was retrieved from The Cancer Imaging Archive (TCIA) using a custom Python-based downloader developed for this project. The download process is driven by a manifest file (.tcia), which enumerates the SeriesInstanceUIDs corresponding to all available mammography studies. Each UID is used to query the NBIA RESTful API, specifically the endpoints getImage and getSeriesMetaData, which return a zipped folder containing the DICOM images for a given study. A total of 3102 imaging studies from 1566 unique patients were downloaded. All DICOM files were saved using their original hierarchical structure:

< p a t i e n t_i d > / < s t u d y_u i d > / < s e r i e s_u i d > / * . d c m

This organization mirrors the internal structure used by the TCIA and facilitates traceability with the original metadata files. The downloader includes integrity checks that ensure existing folders contain all expected images, allowing for partial resume and reproducible results.

4.3. Metadata Parsing and Integration

The CBIS-DDSM provides metadata describing breast abnormalities through four CSV files, corresponding to calcifications and masses, each split into training and test subsets. These files contain both study-level descriptors (e.g., breast laterality, image view, and breast density) and lesion-specific annotations (e.g., abnormality type, pathology, and assessment).

To create a harmonized and reproducible dataset, we developed a metadata processor that performs the following steps:

It merges all CSV files into a single structured dataframe.
It standardizes the metadata fields with consistent names and format.
It normalizes file paths to match the structure of the downloaded dataset.
It resolves inconsistencies between subsets and fills missing information where applicable.

The resulting metadata file provides a clean, unified representation of the full dataset, linking each abnormality to its corresponding DICOM image, lesion ROI mask, and cropped region. This file serves as the backbone of the CBIS-DDSM-R dataset, ensuring interoperability and simplifying downstream tasks such as preprocessing, radiomics extraction, and classification.

4.4. Image Preprocessing Pipeline

To prepare the CBIS-DDSM mammograms for radiomics analysis, a reproducible and automated image preprocessing pipeline was developed to enhance lesion visibility while preserving spatial integrity and DICOM metadata. All operations were implemented in Python using the OpenCV version 4.12.0.88 and PyDICOM version 3.0.1 libraries. The pipeline was applied to the original DICOM mammograms and comprises the following sequential steps:

DICOM Pixel Conversion: Each mammographic image is loaded from its native DICOM format and converted into a grayscale NumPy array using a custom utility function. This step extracts the pixel data while maintaining spatial dimensions and intensity fidelity.
Median Filtering: To reduce high-frequency noise such as salt-and-pepper artifacts while preserving edges, a median blur filter is applied using a square kernel of $3 \times 3$ . This filtering enhances the robustness of subsequent thresholding and segmentation operations.
Global Thresholding and Binarization: The filtered grayscale images are binarized using a fixed global threshold. Pixels with intensity values above a predefined threshold are retained as the foreground (typically corresponding to breast tissue), while background and low-intensity noise are suppressed. This process produces a binary mask emphasizing the primary anatomical region.
The global threshold value was empirically determined through analysis of intensity distributions across the CBIS-DDSM dataset and was validated to perform consistently across varying breast densities (BI-RADS 1–4) and both CC and MLO views. The preceding median filtering step helps normalize local intensity variations, improving threshold robustness. However, we acknowledge that extreme cases with very low contrast or unusual acquisition parameters might benefit from adaptive thresholding techniques. The connected component analysis and subsequent morphological operations provide additional robustness by selecting the largest coherent region even when binarization is imperfect. Visual inspection of a random sample of 100 processed images confirmed successful breast region isolation in all cases. Future versions of the pipeline may incorporate adaptive thresholding methods to further improve robustness across diverse imaging conditions.
Breast Region Segmentation: The binarized image was processed using connected component labeling to identify contiguous foreground regions. The largest connected component was selected under the assumption that it represented the main breast tissue. Morphological opening, using a small square kernel of $3 \times 3$ , was subsequently applied to remove noise and small artifacts. The resulting mask was used to isolate the breast region from the original image.
Pectoral Muscle Suppression: In mediolateral oblique (MLO) views, the pectoral muscle commonly appears as a bright triangular region in the upper corner of the image. To remove this region, a heuristic-based algorithm was implemented:
- Two triangular masks were generated—one for the upper-left corner and one for the upper-right corner.
- The mean intensity within each triangular region was computed to determine which side corresponded to the brighter area, assumed to be the pectoral muscle.
- The selected triangular region was masked out by applying an inverted binary mask to the image, effectively suppressing the pectoral muscle.
This fully automated approach avoided manual annotation and consistently removed a high-intensity artifact known to interfere with radiomics analysis.
To validate the effectiveness of the pectoral muscle suppression algorithm, we performed visual inspection of a stratified random sample of 150 MLO images (representing approximately 11% of MLO views in the dataset). The algorithm successfully identified and suppressed the pectoral muscle in 144 cases (96%). Cases where the algorithm performed suboptimally typically involved (1) very faint or minimal pectoral muscle visibility (n = 4) or (2) unusual patient positioning with atypical pectoral muscle geometry (n = 2). Importantly, analysis of the lesion locations in these edge cases revealed that the pectoral muscle region did not overlap with any annotated lesion ROIs, meaning that the imperfect suppression had a negligible impact on the extracted radiomics features. This finding is consistent with the anatomical separation between typical lesion locations and the pectoral muscle region. Nevertheless, users of the dataset should be aware of this limitation, and future refinements of the pipeline could incorporate machine learning-based pectoral muscle segmentation for improved robustness.
Metadata Preservation and Export: The final processed images were saved in DICOM format, ensuring that all original metadata—such as image orientation, acquisition parameters, and patient identifiers—were preserved. This step maintained compatibility with radiomics tools like PyRadiomics and enabled reproducibility in downstream analyses that depend on original DICOM attributes.

The image provided, shown in Figure 2, illustrates the sequential steps of a preprocessing pipeline designed for mammograms. This process transforms a raw image into a clean, standardized version suitable for radiomics analysis. Each panel in the figure represents the output of a specific filter or method applied in a step-by-step manner.

The pipeline begins with the original mammogram, which is then processed through a median filter to reduce noise. Following this, binarization and segmentation are applied to create a mask of the breast tissue. This mask is used to isolate the breast region, removing the irrelevant background. Finally, a heuristic-based method is used for pectoral muscle suppression, removing a bright, triangular artifact common in certain mammographic views. The final image is the result of all these operations, a standardized input ready for quantitative feature extraction.

4.5. Radiomics Features Extraction

To support advanced quantitative analysis and facilitate reproducibility in computational breast imaging research, we integrated radiomics feature extraction as a core component of the CBIS-DDSM-R dataset construction pipeline.

PyRadiomics requires each sample to be defined by a pair of image and mask files: the mammographic DICOM image and a corresponding binary ROI mask, both spatially aligned. These pairs must be explicitly provided to the feature extractor, typically via a CSV file specifying the paths to each image–mask pair. To satisfy this requirement, the CBIS-DDSM-R preprocessor constructs a standardized metadata file during dataset construction. This metadata file contains, for each abnormality, the absolute paths to the preprocessed DICOM image and its associated ROI mask, ensuring compatibility with PyRadiomics’ batch processing capabilities and alignment with IBSI standards.

Features were extracted directly from the preprocessed DICOM images using their corresponding binary ROI masks, ensuring pixel-level alignment. The extraction pipeline was configured to include multiple feature categories from both the original and filtered image spaces, as shown in Table 3.

Before feature extraction, images were normalized and resampled to a uniform pixel spacing (when spacing metadata was available), and voxel intensities were discretized using a fixed bin width.

Each abnormality in the dataset is associated with a vector of radiomics features stored in tabular format and indexed by its corresponding DICOM identifiers and metadata. This enables seamless integration with machine learning workflows and supports benchmarking of predictive models for lesion classification, risk stratification, or decision support.

To ensure transparency and reproducibility, the radiomics extraction parameters are included in the accompanying YAML configuration file of the CBIS-DDSM-R dataset release.

5. Discussion

The widespread adoption of the CBIS-DDSM dataset has been a cornerstone for advancing computer-aided diagnosis systems in breast cancer. However, as the field has evolved to embrace quantitative imaging techniques like radiomics, the limitations of the original dataset have become increasingly apparent. The manual and often ad hoc preprocessing required to make CBIS-DDSM compatible with radiomics pipelines has been a significant barrier to reproducibility and fair comparisons across different studies. This fragmentation undermines the core principles of radiomics, where standardization is paramount for clinical translation and validation [9,12].

The work presented here directly addresses these limitations. By providing a fully automated and standardized preprocessing workflow, this effort ensures that all images and masks are properly aligned and optimized for quantitative feature extraction. This eliminates the need for manual intervention and removes a major source of inter-study variability. The integration of PyRadiomics [13], a validated and IBSI-compliant toolkit, guarantees that the extracted radiomics features are defined and computed according to established standards. This is a crucial step towards making radiomics-based breast cancer research more transparent and reproducible.

The release of CBIS-DDSM-R as an open-source resource democratizes access to a high-quality, radiomics-ready dataset. Researchers can now focus their efforts on developing and benchmarking predictive models rather than on complex and time-consuming data engineering tasks. The unified CSV format, which merges original clinical metadata with the newly extracted radiomics features, facilitates a wide range of analytical approaches, from traditional machine learning to multimodal deep learning models that can combine image-based features with clinical data.

Dataset Contribution and Impact

While CBIS-DDSM-R employs established preprocessing techniques and uses the widely adopted PyRadiomics toolkit, its contribution lies not in methodological innovation but in addressing a critical infrastructure gap in breast imaging research. Prior to this work, researchers wishing to apply radiomics to CBIS-DDSM had to develop custom, often undocumented preprocessing pipelines, leading to inconsistent results and limited reproducibility across studies [16]. Our contribution is threefold: (1) we provide the first complete, automated, and fully documented open-source pipeline that ensures IBSI compliance; (2) we release a preprocessed, radiomics-ready dataset that can be directly used for model development without requiring extensive data engineering expertise; and (3) we provide a standardized benchmark that enables fair comparison of radiomics-based models across different research groups. This type of data infrastructure work, while not methodologically novel, is essential for advancing the field and has been recognized as a critical need in medical imaging research [12].

6. Conclusions

This work introduced CBIS-DDSM-R, a standardized and reproducible extension of the CBIS-DDSM dataset tailored for radiomics research in breast imaging. The dataset is the result of a robust, open-source pipeline that automates the downloading, preprocessing, and feature extraction steps from the original CBIS-DDSM images. This work is a direct response to the community’s need for a resource that overcomes the inherent preprocessing and standardization challenges of the original dataset.

The key contributions of CBIS-DDSM-R are as follows:

Standardized Preprocessing: A consistent and documented pipeline is provided for image preprocessing, including noise reduction and pectoral muscle suppression, which is essential for reproducible radiomics.
IBSI-Compliant Radiomics: A comprehensive set of 93 radiomics features were extracted using PyRadiomics, adhering to the Image Biomarker Standardisation Initiative (IBSI) guidelines. This integrated approach makes CBIS-DDSM-R the first dataset of its kind to be purpose-built for reproducible radiomics analysis of mammograms, ensuring all data is consistently preprocessed and features are extracted with transparent, standardized parameters.
Unified Data Format: All original clinical metadata and new radiomics features are merged into a single, easy-to-use CSV file, streamlining the data access and modeling workflow for researchers.
Enhanced Reproducibility: The entire pipeline is open-source, allowing researchers to replicate the dataset creation process, fostering transparency and trust in model development.

In summary, CBIS-DDSM-R is a valuable resource that bridges the gap between raw medical imaging data and advanced quantitative analysis. It will serve as a robust benchmark for developing and validating radiomics-based models for breast cancer diagnosis and prognosis, ultimately accelerating the clinical translation of these powerful computational tools.

Author Contributions

Conceptualization, E.S.-F. and J.M.C.-P.; methodology, C.E.G.-T. and J.I.G.-T.; software, E.S.-F.; validation, J.S.-P., L.A.F.-C. and H.L.-G.; formal analysis, H.G.-R.; investigation, E.S.-F. and J.M.C.-P.; resources, R.R.-M.; data curation, R.R.-M.; writing—original draft preparation, E.S.-F.; writing—review and editing, J.M.C.-P. and H.L.-G.; visualization, E.S.-F.; supervision, C.E.G.-T. and J.I.G.-T.; project administration, L.A.F.-C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Ethical review and approval were waived for this study as the work is a computational analysis of a publicly available and de-identified dataset (CBIS-DDSM), which does not contain any protected health information. No new human subjects or animals were involved in the study.

Informed Consent Statement

Not applicable. This study used a publicly available, de-identified dataset and did not involve any human subjects.

Data Availability Statement

The original CBIS-DDSM dataset is a publicly available resource hosted on The Cancer Imaging Archive (TCIA), accessed on 24 June 2025 at https://doi.org/10.7937/K9/TCIA.2016.7O02S9CY. The processed and curated CBIS-DDSM-R dataset, along with the source code for the preprocessing and feature extraction pipeline, is available on the GitHub repository at https://github.com/helloerikaaa/cbis-ddsm-r, accessed on 20 October 2025. The dataset is provided in a single CSV file, while the code allows for full reproduction of the dataset from the original source.

Acknowledgments

The authors would like to thank the developers of the CBIS-DDSM dataset and The Cancer Imaging Archive for making this valuable resource publicly available. The authors would also like to acknowledge the use of the Python libraries PyRadiomics, OpenCV, and pydicom.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

CBIS-DDSM	Curated Breast Imaging Subset of the Digital Database for Screening Mammography
CAD	Computer-Aided Diagnosis
DICOM	Digital Imaging and Communications in Medicine
GLCM	Gray Level Co-occurrence Matrix
GLDM	Gray Level Dependence Matrix
GLRLM	Gray Level Run Length Matrix
GLSZM	Gray Level Size Zone Matrix
IBSI	Image Biomarker Standardisation Initiative
MLO	Mediolateral Oblique
NGTDM	Neighboring Gray Tone Difference Matrix
TCIA	The Cancer Imaging Archive

References

Tabár, L.; Vitak, B.; Chen, H.H.; Duffy, S.W.; Yen, M.F.; Furuie, A.R.; Chiu, S.Y. The Swedish Two-County Trial Twenty Years Later. Radiol. Clin. N. Am. 2003, 41, 1–25. [Google Scholar] [CrossRef] [PubMed]
Bleyer, A.; Welch, H.G. Effect of Screening Mammography on Breast-Cancer Incidence and Mortality. N. Engl. J. Med. 2012, 367, 1998–2005. [Google Scholar] [CrossRef] [PubMed]
Elmore, J.G.; Wells, C.K.; Lee, C.H.; Howard, D.H.; Feinstein, A.R. Variability in Radiologists’ Interpretations of Mammograms. N. Engl. J. Med. 1994, 331, 1493–1499. [Google Scholar] [CrossRef] [PubMed]
Shen, L.; Margolies, L.R.; Rothstein, J.H.; Fluder, E.; McBride, R.; Sieh, W. Deep Learning to Improve Breast Cancer Detection on Screening Mammography. Sci. Rep. 2019, 9, 12495. [Google Scholar] [CrossRef] [PubMed]
Hosny, A.; Parmar, C.; Quackenbush, J.; Schwartz, L.H.; Aerts, H.J.W.L. Artificial Intelligence in Radiology. Nat. Rev. Cancer 2018, 18, 500–510. [Google Scholar] [CrossRef] [PubMed]
Heath, M.; Bowyer, K.; Kopans, D.; Moore, R.; Kegelmeyer, W.P. Digital Database for Screening Mammography. In Proceedings of the Fifth International Workshop on Digital Mammography, Toronto, ON, Canada, 11–14 June 2020; pp. 212–218. [Google Scholar]
Lee, R.S.; Gimenez, F.; Hoogi, A.; Miyake, K.; Gorovoy, M.; Rubin, D.L. A Curated Mammography Data Set for Use in Computer-Aided Detection and Diagnosis Research. Sci. Data 2017, 4, 170177. [Google Scholar] [CrossRef] [PubMed]
Aerts, H.J.W.L.; Velazquez, E.R.; Leijenaar, R.T.H.; Parmar, C.; Grossmann, P.; Carvalho, S.; Bussink, J.; Monshouwer, R.; Haibe-Kains, B.; Rietveld, D.; et al. Decoding Tumour Phenotype by Noninvasive Imaging Using a Quantitative Radiomics Approach. Nat. Commun. 2014, 5, 4006. [Google Scholar] [CrossRef] [PubMed]
Parmar, C.; Grossmann, P.; Bussink, J.; Lambin, P.; Aerts, H.J.W.L. Machine Learning Methods for Quantitative Radiomic Biomarkers. Sci. Rep. 2015, 5, 13087. [Google Scholar] [CrossRef] [PubMed]
Zhang, Q.; Tian, J.; Zhang, H.; Li, H.; Zhao, Y.; Li, Y. Radiomics Analysis of Breast Lesions in Mammograms with BI-RADS Descriptors and Machine Learning. Med. Phys. 2021, 48, 1675–1683. [Google Scholar] [CrossRef]
Li, H.; Zhang, Y.; Wang, L.; Liu, X.; Liu, Z. Breast Cancer Histopathological Image Classification Based on Deep Convolutional Neural Networks. Neural Comput. Appl. 2021, 33, 10209–10218. [Google Scholar]
Zwanenburg, A.; Vallières, M.; Abdalah, M.A.; Aerts, H.J.; Andrearczyk, V.; Apte, A.; Ashrafinia, S.; Bakas, S.; Beukinga, R.J.; Boellaard, R.; et al. The Image Biomarker Standardization Initiative: Standardized Quantitative Radiomics for High-Throughput Image-Based Phenotyping. Radiology 2020, 295, 328–338. [Google Scholar] [CrossRef] [PubMed]
van Griethuysen, J.J.M.; Fedorov, A.; Parmar, C.; Hosny, A.; Aucoin, N.; Narayan, V.; Beets-Tan, R.G.H.; Fillion-Robin, J.C.; Pieper, S.; Aerts, H.J.W.L. Computational Radiomics System to Decode the Radiographic Phenotype. Cancer Res. 2017, 77, e104–e107. [Google Scholar] [CrossRef] [PubMed]
Wang, H.; Tang, J.; Xia, Y.; Wang, X.; Wang, X. Radiomics in Breast Cancer Diagnosis: A Comparative Study of Feature Extraction Software. Front. Oncol. 2022, 12, 837543. [Google Scholar]
Zhou, Z.; Han, Y.; Huang, L.; Wang, Y. Radiomic Features for Predicting BI-RADS Category in Mammography. Comput. Biol. Med. 2021, 130, 104209. [Google Scholar]
Lei, Y.; Harms, J.; Zhang, R.; Wang, T.; Liu, T.; Curran, W.J.; Liu, C. Benchmarking Feature Consistency Across Radiomics Software: A Case Study on CBIS-DDSM. Med. Phys. 2020, 47, 4066–4075. [Google Scholar] [CrossRef]
Yala, A.; Mikhael, P.G.; Strand, F.; Lehman, C.; Barzilay, R. Toward Robust Mammography-Based Models for Breast Cancer Risk. Sci. Transl. Med. 2021, 13, eaba4373. [Google Scholar] [CrossRef] [PubMed]
Sharif, M.I.; Kadry, S.; Majeed, A.; Saba, T.; Zahoor, S. Deep Learning for Classification of Breast Cancer Using Transfer Learning. J. Healthc. Eng. 2022, 2022, 3613372. [Google Scholar]

Figure 1. A visual summary of the distribution of categorical variables in the CBIS-DDSM-R dataset. The figure provides the count and percentage for each variable category: (1) view position, (2) laterality, (3) type of abnormality, and (4) type of pathology.

Figure 2. A visual representation of the preprocessing pipeline. Each image shows the result after a specific step: (a) original image, (b) median filtering, (c) binarization mask, (d) breast region isolation, (e) pectoral muscle suppression, and (f) final processed image. Note: The apparent increase in breast tissue brightness in step (e) is due to display contrast adjustment after masking the pectoral muscle region; actual pixel intensities in the breast tissue remain unchanged.

Table 1. Statistical summary of a sample of the radiomics features extracted using PyRadiomics for the CBIS-DDSM-R dataset. Features follow the IBSI standard nomenclature.

Feature	Mean	Std	Min	25%	50%	Max
original_firstorder_Mean	121.5321	24.8321	45.0	105.3210	120.4587	215.3421
original_firstorder_Median	118.7643	25.1298	43.0	102.8712	118.0234	212.9832
original_firstorder_Minimum	12.4532	4.1832	0.0	10.0000	12.0000	25.0000
original_firstorder_Maximum	245.8753	10.4289	220.0	240.0000	246.0000	255.0000
original_glcm_Contrast	0.9821	0.5432	0.01	0.65	0.88	3.12
original_glcm_Correlation	0.9632	0.0234	0.85	0.95	0.96	0.99
original_glcm_Energy	0.1253	0.0421	0.02	0.10	0.12	0.28
original_glcm_Homogeneity	0.8765	0.0654	0.61	0.84	0.88	0.97

Table 2. Metadata fields available in CBIS-DDSM lesion annotation files.

Field Name	Description
`patient_id`	Unique identifier for the patient
`breast_density`	BI-RADS breast density (1–4)
`left or right breast`	Laterality of the breast (LEFT or RIGHT)
`image view`	Mammographic view (CC or MLO)
`abnormality id`	Unique identifier for the lesion
`abnormality type`	Type of lesion (MASS or CALCIFICATION)
`mass shape`	Shape of mass (e.g., oval, round, or irregular)
`mass margins`	Margin type (e.g., circumscribed or spiculated)
`assessment`	BI-RADS assessment value (1–5)
`pathology`	Ground truth diagnosis (e.g., BENIGN or MALIGNANT)
`subtlety`	Radiologist’s confidence score (1–5)
`image file path`	Path to the full mammogram DICOM image
`cropped image file path`	Path to the cropped lesion image
`ROI mask file path`	Path to the lesion binary mask DICOM image

Table 3. Radiomics feature categories extracted from CBIS-DDSM-R using PyRadiomics.

Category	Description
First Order Statistics	Intensity histogram features
Shape (2D)	Lesion size, compactness, elongation
GLCM	Texture based on co-occurrence of gray levels
GLRLM	Texture based on consecutive run lengths of intensities
GLSZM	Texture based on size of homogeneous zones
NGTDM	Texture based on difference from neighbors
GLDM	Texture based on dependence of gray levels
Wavelet Features	Multi-scale decomposition using wavelets
Laplacian of Gaussian	Edge-enhancing filter features

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Sánchez-Femat, E.; Galván-Tejada, C.E.; Galván-Tejada, J.I.; Gamboa-Rosales, H.; Luna-García, H.; Flores-Chaires, L.A.; Saldívar-Pérez, J.; Reveles-Martínez, R.; Celaya-Padilla, J.M. CBIS-DDSM-R: A Curated Radiomic Feature Dataset for Breast Cancer Classification. Data 2025, 10, 179. https://doi.org/10.3390/data10110179

AMA Style

Sánchez-Femat E, Galván-Tejada CE, Galván-Tejada JI, Gamboa-Rosales H, Luna-García H, Flores-Chaires LA, Saldívar-Pérez J, Reveles-Martínez R, Celaya-Padilla JM. CBIS-DDSM-R: A Curated Radiomic Feature Dataset for Breast Cancer Classification. Data. 2025; 10(11):179. https://doi.org/10.3390/data10110179

Chicago/Turabian Style

Sánchez-Femat, Erika, Carlos E. Galván-Tejada, Jorge I. Galván-Tejada, Hamurabi Gamboa-Rosales, Huizilopoztli Luna-García, Luis Alberto Flores-Chaires, Javier Saldívar-Pérez, Rafael Reveles-Martínez, and José M. Celaya-Padilla. 2025. "CBIS-DDSM-R: A Curated Radiomic Feature Dataset for Breast Cancer Classification" Data 10, no. 11: 179. https://doi.org/10.3390/data10110179

APA Style

Sánchez-Femat, E., Galván-Tejada, C. E., Galván-Tejada, J. I., Gamboa-Rosales, H., Luna-García, H., Flores-Chaires, L. A., Saldívar-Pérez, J., Reveles-Martínez, R., & Celaya-Padilla, J. M. (2025). CBIS-DDSM-R: A Curated Radiomic Feature Dataset for Breast Cancer Classification. Data, 10(11), 179. https://doi.org/10.3390/data10110179

Article Menu

CBIS-DDSM-R: A Curated Radiomic Feature Dataset for Breast Cancer Classification

Abstract

1. Summary

2. Related Work

2.1. Datasets

2.2. Radiomics Features

3. CBIS-DDSM-R Data Description

3.1. Dataset Composition

3.2. Radiomics Feature Extraction

3.3. Data Format and Accessibility

4. Materials and Methods

4.1. Original Data Source: CBIS-DDSM

4.2. Raw Data Acquisition from TCIA

4.3. Metadata Parsing and Integration

4.4. Image Preprocessing Pipeline

4.5. Radiomics Features Extraction

5. Discussion

Dataset Contribution and Impact

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI