Breast Invasive Ductal Carcinoma Classification on Whole Slide Images with Weakly-Supervised and Transfer Learning

Simple Summary In this study, we have trained deep learning models using transfer learning and weakly-supervised learning for the classification of breast invasive ductal carcinoma (IDC) in whole slide images (WSIs). We evaluated the models on four test sets: one biopsy (n = 522) and three surgical (n = 1129) achieving AUCs in the range 0.95 to 0.99. We have also compared the trained models to existing pre-trained models on different organs for adenocarcinoma classification and they have achieved lower AUC performances in the range 0.66 to 0.89 despite adenocarcinoma exhibiting some structural similarity to IDC. Therefore, performing fine-tuning on the breast IDC training set was beneficial for improving performance. The results demonstrate the potential use of such models to aid pathologists in clinical practice. Abstract Invasive ductal carcinoma (IDC) is the most common form of breast cancer. For the non-operative diagnosis of breast carcinoma, core needle biopsy has been widely used in recent years for the evaluation of histopathological features, as it can provide a definitive diagnosis between IDC and benign lesion (e.g., fibroadenoma), and it is cost effective. Due to its widespread use, it could potentially benefit from the use of AI-based tools to aid pathologists in their pathological diagnosis workflows. In this paper, we trained invasive ductal carcinoma (IDC) whole slide image (WSI) classification models using transfer learning and weakly-supervised learning. We evaluated the models on a core needle biopsy (n = 522) test set as well as three surgical test sets (n = 1129) obtaining ROC AUCs in the range of 0.95–0.98. The promising results demonstrate the potential of applying such models as diagnostic aid tools for pathologists in clinical practice.


Introduction
Breast cancer is one of the leading causes of global cancer incidence [1]. In 2020, there were 2,261,419 new cases (11.7% of all cancer cases) and 684,996 deaths (6.9% of all cancer related deaths) due to breast cancer. Among women, breast cancer accounts for one in four cancer cases and for one in six cancer deaths in the vast majority of countries (159 of 185 countries) [1].
Invasive ductal carcinoma (IDC) (or invasive carcinoma of no special type: ductal NST) is a heterogeneous group of tumors that fail to exhibit sufficient characteristics to achieve classification as a specific histopathological type. Microscopically, there are a wide variety of histopathological characteristics in IDCs. IDC grows in diffuse-sheets, well-defined nests, cords, or as individual (single) cells. Tubular differentiation tends to be well developed, barely detectable, or altogether absent.
Core needle biopsy is frequently used for the management of non-palpable mammogram abnormalities, as it is cost effective and provides an alternative to short-interval follow-up mammography. It is also generally favored over fine-needle aspiration biopsy (FNAB) for the non-operative diagnosis of breast carcinoma, and it could replace open breast biopsy provided that the quality assurance is acceptable [2,3]. Core needle biopsy allows the evaluation of histopathological features, making it possible to provide a definitive diagnosis of IDC and benign lesions (e.g., fibroadenoma) in over 90% of cases [4]. All these factors highlight the benefit of establishing a histopathological screening system based on core needle biopsy specimens for breast IDC patients. Glass slides of biopsy specimens can be digitised as whole slide images (WSIs) and could benefit from the application of computational histopathology algorithms to aid pathologists as part of a screening system.
In this paper, we trained a WSI breast IDC classification model using transfer learning from ImageNet and weakly-supervised learning. We have also evaluated on the test sets, without fine-tuning, models that had been previously trained on other organs for the classification of carcinomas.

Clinical Cases and Pathological Records
This is a retrospective study. A total of 2183 H and E (hematoxylin and eosin) stained histopathological specimens of human breast IDC and benign lesions-1154 core needle biopsy and 1028 surgical-were collected from the surgical pathology files of three hospitals: International University of Health and Welfare, Mita Hospital (Tokyo) and Kamachi Group Hospitals (consist of Shinkomonji and Shinkuki hospitals) (Fukuoka) after histopathological review of those specimens by surgical pathologists. The test cases were selected randomly, so the obtained ratios reflected a real clinical scenario as much as possible. All WSIs were scanned at a magnification of x20 using the same Leica Aperio AT2 scanner and were saved SVS file format with JPEG2000.
In addition, we collected 100 WSIs from The Cancer Genome Atlas (TCGA); however, only four benign cases were available.

Dataset
The pathologists excluded cases that were inappropriate or of poor scanned quality prior to this study. The diagnosis of each WSI was verified by at least two pathologists. Table 1 breaks down the distribution of dataset into training, validation, and test sets. Hospitals that provided histopathological cases were anonymised (e.g., Hospital 1-2). The training set was solely composed of WSIs of core needle biopsy specimens. The test sets were composed of WSIs of core needle biopsy or surgical specimens. The patients' pathological records were used to extract the WSIs' pathological diagnoses and to assign WSI labels. Out of the 191 WSIs with IDC, 96 WSIs were loosely annotated by pathologists. There were about seven annotations per WSI on average. We did not annotate on the carcinoma in situ areas, and some parts of the adjacent stromal area were included in the annotations in order to provide contextual information.
The rest of IDC and benign WSIs were not annotated and the training algorithm only used the WSI labels. Each WSI diagnosis was observed by at least two pathologists, with the final checking and verification performed by a senior pathologist. The senior pathologist only reviewed discordant cases between the two initial pathologists.

Deep Learning Models
We trained all the models using the partial fine-tuning approach [25]. This method consists of using the weights of an existing pre-trained model and only fine-tuning the affine parameters of the batch normalisation layers and the final classification layer. We have used the EfficientNetB1 architecture [26], as well as B3, with modified input sizes of 224 × 224 px and 512 × 512 px, starting with pre-trained weights from ImageNet. The total number of trainable parameters for EfficientNetB1 was only 63,329. The training method that we have used in this study is exactly the same as reported in a previous study [27] with the main difference being the use of a partial fine-tuning method. For completeness, we repeat the method here.
To apply the model on the WSIs for training and inference, we performed slide tiling by extracting fixed-sized tiles from tissue regions. We detected the tissue regions by performing a thresholding on a grayscale version of the WSI using Otsu's method [28], which allows the elimination of most of the white background. During inference, we performed the slide tiling in a sliding window fashion on the tissue regions, using a fixedsize stride that was half the size of the tile. During training, we initially performed random balanced sampling of tiles from the tissue regions, where we maintained an equal balance of positive and negative labelled tiles in the training batch. To do so, we placed the WSIs in a shuffled queue with oversampling of the positive labels to ensure that all the WSIs were seen at least once during each epoch, and we looped over the labels in succession (i.e., we alternated between picking a WSI with a positive label and a negative label). Once a WSI was selected, we randomly sampled batch size 2 tiles from each WSI to form a balanced batch. We then switched into hard mining of tiles. To perform the hard mining, we alternated between training and inference. During inference, the CNN was applied in a sliding window fashion on all of the tissue regions in the WSI, and we then selected the k tiles with the highest probability of being positive. If the tile is from a negative WSI, this step effectively selects the false positives. The selected tiles were placed in a training subset, and once the subset size reached N tiles, a training pass was triggered. We used k = 4, N = 256, and a batch size of 32.
A subset of WSIs with IDC were loosely annotated (n = 96) while the rest had WSIlevel labels only (n = 95). From the loosely annotated WSIs, we only sampled tiles from the annotated tissue regions. Otherwise, we freely sampled tiles from the entire tissue region.
The models were trained on WSIs at ×10 and ×20 magnifications. We used two input tile sizes: 512 × 512 px and 224 × 224 px. The strides were half the tile sizes. The WSI prediction was obtained by taking the maximum probability from all of the tiles.
We trained the models with the Adam optimisation algorithm [29] with the following parameters: beta 1 = 0.9, beta 2 = 0.999. We used a learning rate of 0.001. We applied a learning rate decay of 0.95 every 2 epochs. We used the binary cross entropy loss. We used early stopping by tracking the performance of the model on a validation set, and training was stopped automatically when there was no further improvement on the validation loss for 10 epochs. The model with the lowest validation loss was chosen as the final model.

Software and Statistical Analysis
The deep learning models were implemented and trained using TensorFlow [30]. AUCs were calculated in python using the scikit-learn package [31] and plotted using matplotlib [32]. The 95% CIs of the AUCs were estimated using the bootstrap method [33] with 1000 iterations.

A Deep Learning Model for WSI Breast IDC Classification
The purpose of this study was to train a deep learning model to classify breast IDC in WSIs. We had a total of 1154 biopsy WSIs of which we used 632 for training and 522 for testing. In addition, we used 1129 surgical WSIs obtained from three sources as part of supplementary test sets. We used a transfer learning (TL) approach based on partial finetuning [25] to train the models. Figure 1 shows an overview of our training method. We then evaluated the trained models on four tests sets: one biopsy test set and three surgical test sets. We refer to the trained models as TL <magnification> <tile size> <model size>, based on the different configurations. Training consisted of two stages: random sampling and hard mining. In the first stage (b) we randomly sampled tiles from the positive and negative WSIs, restricting the sampling from WSIs that had annotations if they were positive. In the second stage (c) We iteratively alternated between inference and training, relying only on the WSI label. During inference, the model weights were frozen, and it was applied in a sliding window fashion on each WSI. The top k tiles with the highest probabilities were then selected from each WSI. During training the selected tiles were then used to train the model.
As we had at our disposal six models [18,27,[34][35][36][37] that had been trained specifically on specimens from different organs (stomach, colon, lung, and pancreas), we evaluated those models without fine-tuning on the test sets to investigate whether morphological cancer similarities transfer across organs without additional training. Table 1 breaks down the distribution of the WSIs in each test set. For each test set, we computed the ROC AUC and log loss, and we have summarised the results in Table 2 and Figures 2 and 3. Figures 4-7 show representative heatmap prediction outputs for true positive, false positive, and false negative. Table 3 shows a confusion matrix breakdown by subtype for the false positives and true negatives using a probability threshold of 0.5. All 10 false positive WSIs were fibroadenomas. Figure 8 shows an overview of representative fibroadenoma histopathology of 10 cases (WSIs) that were falsely predicted as IDC. There were representative histopathologic changes (e.g., proliferative epithelial changes, fibrocystic epithelial canges, and stromal changes) [38] in falsely predicted fibroadenomas ( Figure 8); the proliferative findings could be the potential cause of the false positive.

Discussion
In this study, we trained deep learning models for the classification of breast IDC in surgical and biopsy WSIs. We used weakly-supervised and transfer learning. We used the partial fine-tuning approach, which is fast to train. The best model achieved AUCs in the range of 0.96-0.98.
Overall, the EfficientNetB1 model trained at magnification ×10 achieved slightly better results than ×20 on the biopsy test set. In addition, using a larger tile size of 512 × 512 px achieved slightly better results than 224 × 244 px. Despite IDC morphology having some similarities with adenocarcinoma (ADC) [39], the application of models that classify ADC on other organs did not fully generalise to IDC. They have achieved lower AUC performances in the range 0.66 to 0.89. The stomach ADC had the highest AUC 0.85-0.89 when applied to breast IDC WSIs. While the results on the TCGA test set are high, it does not provide a proper evaluation in terms of potential false positives as there were only four benign cases.
All of the false positive cases in the biopsy test set were fibroadenomas (see Table 3). Fibroadenomas exhibit a wide range of morphology [38], and it could be that the variety was not fully represented in the training set, which only had 91 cases of fibroadenomas compared to the 131 in the test set. The false positive predictions with fibroadenomas are most likely due to the enlarged spindle shaped stromal cell nuclei with pleomorphism and tubules composed of cuboidal or low columnar cells with round uniform nuclei resting on a myoepithelial cell layer. This is morphologically analogous to invading single cells, ductular structures, and cancer stroma in IDC (see Figure 8).
One source of difficulty in creating a balanced set of the fibroadenomas varieties is that the diagnostic reports did not include a detailed description of fibroadenoma histology, making a simple random partition the only option. In addition, the test set had a larger proportion of fibroadenomas compared to other benign subtypes. Therefore, in future work, it would be important to investigate the histopathological typing of fibroadenomas in order to develop better deep learning models.
While in this study we relied on performing classification from WSI histopathology, in some cases, pathologists make use of immunohistochemistry markers to further confirm a diagnosis and guide treatment decisions. In particular, markers that allow for myoepithelial differentiation are useful for distinguishing between IDC and benign proliferations such as fibroadenoma [40]. This is because IDCs lack the myoepithelial cells that normally surround benign breast glands.
According to the guideline by General Rule Committee of the Japanese Breast Cancer Society [41], the pathological diagnosis of IDC is sufficient for core needle biopsy. Therefore, the application of a deep learning model, once properly validated, in a clinical setting would help pathologists in their diagnostic workflows potentially serving as a second reader during the screening process. It could also be used to sort cases in order of priority for review by the pathologists. On the other hand, surgical specimens tend to require further subtyping of IDC, so future work could look into developing models specifically for IDC subtype classification for surgical specimens.

Conclusions
In this study, we have trained deep learning models at two magnifications, ×10 and ×20, using transfer learning and weakly supervised learning for the classification of breast IDC in WSIs. We evaluated the models on four test sets (one biopsy and three surgical) achieving AUCs in the range 0.95 to 0.99. We have also compared the trained models to existing pre-trained models on different organs for adenocarcinoma classification and they have achieved lower AUC performances in the range 0.66 to 0.89 despite adenocarcinoma exhibiting some structural similarity to IDC. Therefore, performing fine-tuning on the breast IDC training set was beneficial for improving performance. Informed Consent Statement: Informed consent to use histopathological samples and pathological diagnostic reports for research purposes had previously been obtained from all patients prior to the surgical procedures at all hospitals, and the opportunity for refusal to participate in research had been guaranteed by an opt-out manner.
Data Availability Statement: Due to specific institutional requirements governing privacy protection, datasets used in this study are not publicly available. The TCGA data were obtained from TCGA-BRCA project and is available from https://www.cancer.gov/tcga, accessed on 6 June 2019.