King Abdulaziz University Breast Cancer Mammogram Dataset (KAU-BCMD)

: The current era is characterized by the rapidly increasing use of computer-aided diagnosis (CAD) systems in the medical ﬁeld. These systems need a variety of datasets to help develop, evaluate, and compare their performances fairly. Physicians indicated that breast anatomy, especially dense ones, and the probability of breast cancer and tumor development, vary highly depending on race. Researchers reported that breast cancer risk factors are related to culture and society. Thus, there is a massive need for a local dataset representing breast cancer in our region to help develop and evaluate automatic breast cancer CAD systems. This paper presents a public mammogram dataset called King Abdulaziz University Breast Cancer Mammogram Dataset (KAU-BCMD) version 1. To our knowledge, KAU-BCMD is the ﬁrst dataset in Saudi Arabia that deals with a large number of mammogram scans. The dataset was collected from the Sheikh Mohammed Hussein Al-Amoudi Center of Excellence in Breast Cancer at King Abdulaziz University. It contains 1416 cases. Each case has two views for both the right and left breasts, resulting in 5662 images based on the breast imaging reporting and data system. It also contains 205 ultrasound cases corresponding to a part of the mammogram cases, with 405 images as a total. The dataset was annotated and reviewed by three different radiologists. Our dataset is a promising dataset that contains different imaging modalities for breast cancer with different cancer grades for Saudi women.


Summary
Breast cancer is considered a common disease and the second leading cancer among women in the world [1,2]. According to the international agency for research on cancer report, more than 2 million women were diagnosed with breast cancer [1,3]. Moreover, the Saudi ministry of health reported that one out of eight women is diagnosed with breast cancer [4]. These figures signify an urgent need for a local public dataset that utilizes modern technology to build an accurate computer-aided detection and diagnosis (CAD) system to detect and classify breast cancer. Breast screening is the only way to detect early breast cancer. Therefore, it is essential for women, especially those over 40 years, to undergo it periodically even if they have no symptoms [2,5,6]. Several methods are available for breast imaging, such as mammography, ultrasound (US), magnetic resonance imaging (MRI), computed tomography (CT), positron emission tomography (PET), and microwave imaging [7]. Breast imaging that uses low-dose x-rays to detect cancer is known as screening mammography. A mammogram is the most widely used and reliable tool for breast cancer screening, exceeding even US as a tool for breast cancer detection. Breast US is rarely used as a diagnostic method for breast cancer as it does not detect early signs of cancer, such as microcalcifications (tiny calcium deposits) [7,8].
During mammography screening, each case includes the recording of two views for each breast: the craniocaudal (CC), which is a top-to-bottom view, and the mediolateral oblique (MLO) which is a side view [9]. The breast imaging reporting and data system (BI-RADS) is a classification system for breast anomalies that was introduced in 1986 [10]. The BI-RADS enables standardized breast imaging reporting [11] by providing mammography reports, including categories for the description of breast cancer stages. The categories are numbered from 0 to 6, though the last category for the approved malignant state was recently added to this report [11,12]. BI-RADS 0 refers to an incomplete diagnosis that needs an additional image for reclassification. Table 1 shows the categories for BI-RADS in detail [13,14]. The authors follow the Al-Amoudi Center of Excellence in Breast Cancer to scale cases from 0 to 6. Researchers have an increasing need for datasets to develop, test, and evaluate automatic breast cancer CAD systems and build diagnostic systems [15,16]. Most mammogram datasets are private, and few datasets are public for researchers to use during the development of breast cancer tools. This situation has resulted in a lack of comparison among different classification methods. Researchers also reported that breast cancer risk factors are related to culture and society [4,[17][18][19]. Therefore, local and public mammogram datasets are needed to help researchers detect and classify automatic breast cancer systems in women in Saudi Arabia, especially in the early stages. Some factors affect the probability of increasing breast cancer in Saudi more or less than other countries, such as health-related characteristics, menstrual history, obesity, and lack of exercise [8,12,20,21]. Early detection of breast cancer increases the probability of a cure to 92-96% [1,3].
This research's main contribution is a published local mammogram dataset based on BI-RADS categories that attempted to solve local public datasets' availability problem. This is achieved by collecting, categorizing, and annotating mammogram images from a local hospital.
The main advantage of this work it provides a new digitalized mammogram dataset for breast cancer in Saudi Arabia. Additionally, the dataset will help researchers provide reliable systems for the early detection of breast cancer, thereby supporting the medical field, especially in Saudi Arabia. It will also support the medical and educational fields by providing physicians with different diagnosed cases. The King Abdulaziz University Breast Cancer Mammogram Dataset (KAU-BCMD) contains 1416 cases, each with two types of views for both the right and left breasts, resulting in 5662 images. The dataset was collected from 2019 to 2020 from Sheikh Mohammed Hussein Al-Amoudi Center of Excellence in Breast Cancer in King Abdul-Aziz University. Information about dataset accessibility and specifications is provided in Table 2. The KAU-BCMD is a valuable tool in developing and testing decision support systems due to its size and ground truth (GT). The remaining part of the paper is structured as follows. Section 2 describes some of the available mammogram datasets. Section 3 presents the dataset description. Section 4 discusses the methods used to collect and generate the dataset. Section 5 provides the discussion. Finally, Section 6 provides the conclusion and future work.

Related Work
In the following subsections, we describe the most famous public and private breast mammogram datasets. The main goal for discussing these datasets is to help researchers in the medical field and improve the CAD system's performance.

The Digital Dataset for Screening Mammography (DDSM) Dataset
The DDSM dataset was developed by the University of South Florida and published in 1999 [22]. This dataset contains mammogram images accompanied by some information, such as patient age, date of the screening, abnormality type, and breast density [23]. The largest mammogram dataset contains 2620 cases with four views each and available in 43 volumes with the images categorized as normal, malignant, and benign.

The Curated Breast Imaging Subset (CBIS-DDSM) Dataset
The CBIS-DDSM dataset is an updated version of the DDSM. The main reason for this dataset is to update and enhance the image segmentation of the DDSM. The CBIS-DDSM updates the region of interest (ROI) annotation and evaluates specialist and segmentation methods. The dataset contains more than 1000 images and divides them into two types of abnormalities, calcification, and mass, for training and testing any breast cancer detection model [24,25]. and normal images, according to the BI-RADS categorial. The dataset can no longer be found [26].

The Mammographic Image Analysis Society (MIAS) Dataset
The MIAS dataset is one of the oldest datasets. It is a private dataset from the UK research group. It includes a total of 161 cases and 322 images from malignant, benign, and normal mammograms. The dataset includes annotation images consisting of circles around the ROI [27].

Other Datasets
The MIRacle dataset [28] contains mammography images by radiologists and is used for computer learning. It contains 204 images from 196 cases. This dataset has two modes: classification and radiologist evaluation. The Magic 5 Italian dataset [29] was collected from several hospitals. It includes 967 cases, depending on pathology type. A dataset from Nijmegan, Netherlands, was published as a digital mammogram collection from the university hospital's radiology department, but it is no longer available [30]. The LLNL dataset [30] contains 197 images in two views saved in image cytometry standard (ICS) format. The dataset also contains patient information and biopsy results. A special dataset that integrates multiple datasets is the IRAM dataset [31], which contains a huge number of images. Table 3 shows a comparison between the different mammogram datasets. Approximately 25% of mammogram datasets are public for the research community.

KAU-BCMD Data Description
The proposed mammography dataset was collected from Sheikh Mohammed Hussein Al-Amoudi Center of Excellence in Breast Cancer at King Abdulaziz University in Jeddah, Saudi Arabia, from April 2019 to March 2020. The annotation was between April and June 2020. The device used for screening was a breast imaging technology from IMS Giotto, a GMM Group company. The device provides high-quality images with very low SNR (signal-to-noise) [34]. The dataset contains 1416 cases; all cases include images with two types of views (CC and MLO) for both breasts (right and left), making a total of 5662 mammogram images. The dataset was classified into six categories following the BI-RAD system ( Table 1). The BI-RADS are verified using US scans. Three different experts verified the BIRAD system using US scans. Then, the majority voting technique is applied to determine the final BIRAD classifications.
Most of our cases fall into BIRADS 2 (48%) category, which is benign. Approximately 21% of cases fall into BIRADS 4 and 5. About a third of the cases (30%) fell into the category of BIRADS 3, as illustrated in Figure 1. The center where the cases were collected provides screening programs for the general population, which explains most of our data's negativity. Digital Imaging and Communications in Medicine (DICOM) is an international standard for transmitting, storing, and displaying medical imaging data. The images were saved in DICOM format, which is a popular format for mammograms. Figure 2 shows the steps of the preprocessing phase of the KAU-BCMD dataset, which will be discussed later. Figures 3-8 show examples from the proposed dataset for BIRADS 0 to BIRADS 5, respectively. system ( Table 1). The BI-RADS are verified using US scans. Three different experts verified the BIRAD system using US scans. Then, the majority voting technique is applied to determine the final BIRAD classifications.
Most of our cases fall into BIRADS 2 (48%) category, which is benign. Approximately 21% of cases fall into BIRADS 4 and 5. About a third of the cases (30%) fell into the category of BIRADS 3, as illustrated in Figure 1. The center where the cases were collected provides screening programs for the general population, which explains most of our data's negativity. Digital Imaging and Communications in Medicine (DICOM) is an international standard for transmitting, storing, and displaying medical imaging data. The images were saved in DICOM format, which is a popular format for mammograms. Figure 2 shows the steps of the preprocessing phase of the KAU-BCMD dataset, which will be discussed later. Figures 3-8 show examples from the proposed dataset for BIRADS 0 to BIRADS 5, respectively.
The annotation of the images was provided by three different radiologists, which are Dr. Sawsan Ashoor, Dr. Samia Alamoud, and Dr. Gawaher Al Ahadi. They are consultants at the Al Amoudi Breast Cancer Center. The final annotation is created by applying a majority voting technique. The center's system validated the collected images. They were segmented through hand-drawing on the suspicious areas.
To our knowledge, there is no published dataset for breast mammography in Saudi Arabia. Therefore, several work stages need to be accomplished to create such a dataset. Furthermore, successful attempts to construct mammographic datasets fulfilled requirements for validating a mammographic dataset. The current work met the following requirements, which were adopted from research [35][36][37]. Figure 2 shows a diagram of the process of creating the dataset.                 The annotation of the images was provided by three different radiologists, which are Dr. Sawsan Ashoor, Dr. Samia Alamoud, and Dr. Gawaher Al Ahadi. They are consultants at the Al Amoudi Breast Cancer Center. The final annotation is created by applying a majority voting technique. The center's system validated the collected images. They were segmented through hand-drawing on the suspicious areas.

Number of cases in each BIRADS
To our knowledge, there is no published dataset for breast mammography in Saudi Arabia. Therefore, several work stages need to be accomplished to create such a dataset. Furthermore, successful attempts to construct mammographic datasets fulfilled requirements for validating a mammographic dataset. The current work met the following requirements, which were adopted from research [35][36][37]. Figure 2 shows a diagram of the process of creating the dataset.   The dataset contains five folders divided based on BIRAD categories and includes DICOM and JPG image formats in separate folders. In addition, they include a tumor mask for benign and malignant in JPG formats. The dataset also contains the information in the CSV file, as shown in Figure 9. The CSV file contains the following fields:    The dataset contains five folders divided based on BIRAD categories and includes DICOM and JPG image formats in separate folders. In addition, they include a tumor mask for benign and malignant in JPG formats. The dataset also contains the information in the CSV file, as shown in Figure 9. The CSV file contains the following fields:    The dataset contains five folders divided based on BIRAD categories and includes DICOM and JPG image formats in separate folders. In addition, they include a tumor mask for benign and malignant in JPG formats. The dataset also contains the information in the CSV file, as shown in Figure 9. The CSV file contains the following fields:  The dataset contains five folders divided based on BIRAD categories and includes DICOM and JPG image formats in separate folders. In addition, they include a tumor mask for benign and malignant in JPG formats. The dataset also contains the information in the CSV file, as shown in Figure 9. The CSV file contains the following fields: A.
Date of the scan: the study of mammogram screening. B.
Patient ID: It is a unique number to distinguish the records. C.
Patient age.  Figure 9. A sample of the KAU-BCMD dataset metadata that is stored in CSV file format.

Ethics Statement
The authors followed the Saudi executive regulations of the system of ethics for research on living creatures. The dataset received approval from the local research ethics Committee at King Abdul Aziz University to be published with the dataset (1 February 2021).

Annotation of Images
Initially, all listed cases in the dataset were annotated and validated by three different radiologists: Dr. Sawsan Ashoor, Dr. Samia Alamoud, and Dr. Gawaher Al Ahadi. Figures  10-12 show examples of the image annotation from our proposed dataset. The breast cancer images were segmented through hand-drawing on the suspicious areas in the dataset for the BI-RADS 3, 4, and 5. Figures 13 and 14 show examples of the dataset masks for BI-RADS 3 and 4, respectively. The dataset includes RoI segmentation and bounding box images generated by the image labeler App in MATLAB. This application marks RoI labels as rectangular on the tumor area for malignant cases (BIRADS 4 and 5), as shown in Figure 15. The app then exported the images to tables containing the coordinator x, y, width, and height provided with dataset images.

Ethics Statement
The authors followed the Saudi executive regulations of the system of ethics for research on living creatures. The dataset received approval from the local research ethics Committee at King Abdul Aziz University to be published with the dataset (1 February 2021).

Annotation of Images
Initially, all listed cases in the dataset were annotated and validated by three different radiologists: Dr. Sawsan Ashoor, Dr. Samia Alamoud, and Dr. Gawaher Al Ahadi. Figures 10-12 show examples of the image annotation from our proposed dataset. The breast cancer images were segmented through hand-drawing on the suspicious areas in the dataset for the BI-RADS 3, 4, and 5. Figures 13 and 14 show examples of the dataset masks for BI-RADS 3 and 4, respectively. The dataset includes RoI segmentation and bounding box images generated by the image labeler App in MATLAB. This application marks RoI labels as rectangular on the tumor area for malignant cases (BIRADS 4 and 5), as shown in Figure 15. The app then exported the images to tables containing the coordinator x, y, width, and height provided with dataset images.

Ethics Statement
The authors followed the Saudi executive regulations of the system of ethics for research on living creatures. The dataset received approval from the local research ethics Committee at King Abdul Aziz University to be published with the dataset (1 February 2021).

Annotation of Images
Initially, all listed cases in the dataset were annotated and validated by three different radiologists: Dr. Sawsan Ashoor, Dr. Samia Alamoud, and Dr. Gawaher Al Ahadi. Figures  10-12 show examples of the image annotation from our proposed dataset. The breast cancer images were segmented through hand-drawing on the suspicious areas in the dataset for the BI-RADS 3, 4, and 5. Figures 13 and 14 show examples of the dataset masks for BI-RADS 3 and 4, respectively. The dataset includes RoI segmentation and bounding box images generated by the image labeler App in MATLAB. This application marks RoI labels as rectangular on the tumor area for malignant cases (BIRADS 4 and 5), as shown in Figure 15. The app then exported the images to tables containing the coordinator x, y, width, and height provided with dataset images.

Data Acquisition
The dataset includes normal, benign, and malignant cases. In addition, it contains pathology details and patients' histories. It includes the age and previous screenings, as this may be useful for the researcher's study. BI-RADS categories were also reported, as they are considered essential information for a digital mammogram dataset. The authors provide DICOM and JPG format on the dataset. We followed the following steps, as shown in Figure 2: A.
Image preparing and collecting. B.
Image labeling. C.
Image validation by a committee of radiologists. D.
Publish the dataset.

Data Acquisition
The dataset includes normal, benign, and malignant cases. In addition, it contains pathology details and patients' histories. It includes the age and previous screenings, as this may be useful for the researcher's study. BI-RADS categories were also reported, as they are considered essential information for a digital mammogram dataset. The authors provide DICOM and JPG format on the dataset. We followed the following steps, as shown in Figure 2: A. Image preparing and collecting. B. Image labeling. C. Image validation by a committee of radiologists. D. Publish the dataset.

US Images
The proposed dataset contains a subset of US images for 205 cases that need more investigation after mammogram screening. The total number of images is 405 different images for the left or right sides per case. The US images were obtained using the iU22 xMATRIX device. They have a size of 2816 by 3584 pixels and are stored in DICOM and JPG format. The importance of US comes after a mammogram, as a mammogram scan can detect early stages efficiently while ultrasound can detect further stages. Some of the US diagnoses were concurrent with the mammogram diagnosis, while most of the data images diagnosed in ultrasound were diagnoses as BI-RADS 0 from the mammogram results. The US images are raw, i.e., not annotated. Figures 16 and 17 show the detailed categorization of the US image data according to the BI-RADS system. Figure 18 shows a sample of the US images. The US data, with the mammogram data, open the path to more investigation and classification using multimodal data to increase the accuracy of the automatic classification system.

US Images
The proposed dataset contains a subset of US images for 205 cases that need more investigation after mammogram screening. The total number of images is 405 different images for the left or right sides per case. The US images were obtained using the iU22 xMATRIX device. They have a size of 2816 by 3584 pixels and are stored in DICOM and JPG format. The importance of US comes after a mammogram, as a mammogram scan can detect early stages efficiently while ultrasound can detect further stages. Some of the US diagnoses were concurrent with the mammogram diagnosis, while most of the data images diagnosed in ultrasound were diagnoses as BI-RADS 0 from the mammogram results. The US images are raw, i.e., not annotated. Figures 16 and 17 show the detailed categorization of the US image data according to the BI-RADS system. Figure 18 shows a sample of the US images. The US data, with the mammogram data, open the path to more investigation and classification using multimodal data to increase the accuracy of the automatic classification system.

Breast Density
Mammographic density is considered a decisive risk factor for breast cancer. The risk of women with high breast density is 4-6 fold compared with women with low density [38,39]. Breast density refers to the volume of fibrous and glandular tissue in a woman's breasts compared to the amount of fatty tissue in the breasts. Therefore, the probability of having breast cancer increases as the women's breasts density increases. The denser breasts are, the higher the risk of breast cancer, but there is no apparent cause.
Several methods are available for measuring breast density, but it is unclear which method is the best predictor of breast cancer risk. BI-RADS is considered the most widely used method in clinics to estimate breast density. It uses a density score. BI-RADS has several limitations based on subjective visual assessment and is time-consuming [38][39][40].         In our dataset, breast density is estimated by the radiologist who examines the mammogram to estimate the ratio of non-dense tissue to dense tissue and assigns a level of breast density. The breast density levels are defined using the BI-RADS reporting system. The levels of density are: • A (0-25%): Almost entirely fatty indicates that the breasts are almost entirely composed of fat. One out of ten women has this result. • B (25-50%): Scattered areas of fibroglandular density indicate some scattered areas of density, but most of the breast tissue is non-dense. Four out of ten women have this result.
• C (50-75%): Heterogeneously dense indicates that there are some areas of nondense tissue but that most of the breast tissue is dense. Four out of ten women have this result. • D (75-100%): Extremely dense indicates that nearly all breast tissue is dense. One out of each women has this result.   In the current work, the breast density was estimated numerically according to the BI-RADS fourth edition based on percentages [40]. It was estimated as 25% for almost entirely fat, 50% for scattered fibroglandular densities, 75% for heterogeneously dense, and finally, 100% for extremely dense. The estimation was performed manually by Prof. Sawsan Ashour (author), Dr. Samia Alamoud, and Dr. Gawaher Al Ahadi. They have more than 20 years of mammogram consulting experience.

Discussion
The amount and quality of datasets used to design machine learning-based CAD systems directly related to the system's final accuracy. There is a lack of standard evaluation data in mammography. Most CAD algorithms are evaluated on private datasets as most mammographic databases are not publicly available. This poses a challenge to compare the performance of different methods or to replicating prior results.
Deep learning has recently emerged as a promising medical image classification solution, but it requires many images to learn. Most of the available mammogram datasets provide an inappropriate number of samples for deep learning, which is considered a big challenge. The current work provides a dataset that satisfies public availability and a large sample size. It is the first to be collected and publicly available in the region, as far as we know. The only drawback of the presented dataset is the imbalanced size of the different classes, as shown in Figure 1. Overall, our digital mammogram dataset can be considered the first such dataset in Saudi Arabia. In the future, we aim to increase the number of cases in the BIRADs 3, 4, and 5 classes to make the dataset more balanced and thus more suitable for research purposes.
On the other hand, in deep learning-based CAD systems, the dataset's size could be increased using data augmentation techniques to overcome the imbalanced classes size. This is achieved by adding noise with different percentages or applying various transformations to the dataset and a different rotation and translation level. Moreover, transfer learning techniques are expected to work efficiently with the current dataset size as it is. Additionally, we can measure the machine learning-based CAD systems' performance on unbalanced datasets by using various performance metrics, such as sensitivity, specificity, false-positive rate, false-negative rate, geometrical mean, positive likelihood, and diagnostic odds ratio (DOR), discriminant power (DP), and YI.
The dataset includes a set of US images associated with 205 cases out of 1416 total mammogram cases. The US images were captured for most mammogram BI-RADS 0 classified images when the consultants could not decide for the case. Although the number of US is not large, it could be instrumental in designing a multimodal breast cancer classification system based on mammograms and US images to increase classification accuracy.
Finally, the proposed dataset satisfied most of the ideal medical image dataset criteria described in [36,37,41]. It has adequate data volume, curation, annotation, ground truth, reusability, and generalizability. Each medical imaging data object has metadata and an identifier.

Conclusions
In this research, we provide a public mammogram dataset considered a stander of a breast cancer images dataset to help a researcher work on the dataset to produce a CAD system. The proposed work has the potential to be the first digital mammogram dataset in Saudi Arabia. Additionally, the GT is provided with related information. The dataset also contains a subset of many ultrasounds' images corresponding to mammogram cases. The 405 images of ultrasound could be combined with its corresponding mammogram to develop a multimodal CAD breast cancer system. We aim to increase the number of medical images in the dataset to help researchers in breast cancer detection systems. We will develop a second version of the dataset by increasing the number of images to balance and improve their annotation.