Nuclei-Guided Network for Breast Cancer Grading in HE-Stained Pathological Images

Breast cancer grading methods based on hematoxylin-eosin (HE) stained pathological images can be summarized into two categories. The first category is to directly extract the pathological image features for breast cancer grading. However, unlike the coarse-grained problem of breast cancer classification, breast cancer grading is a fine-grained classification problem, so general methods cannot achieve satisfactory results. The second category is to apply the three evaluation criteria of the Nottingham Grading System (NGS) separately, and then integrate the results of the three criteria to obtain the final grading result. However, NGS is only a semiquantitative evaluation method, and there may be far more image features related to breast cancer grading. In this paper, we proposed a Nuclei-Guided Network (NGNet) for breast invasive ductal carcinoma (IDC) grading in pathological images. The proposed nuclei-guided attention module plays the role of nucleus attention, so as to learn more nuclei-related feature representations for breast IDC grading. In addition, the proposed nuclei-guided fusion module in the fusion process of different branches can further enable the network to focus on learning nuclei-related features. Overall, under the guidance of nuclei-related features, the entire NGNet can learn more fine-grained features for breast IDC grading. The experimental results show that the performance of the proposed method is better than that of state-of-the-art method. In addition, we released a well-labeled dataset with 3644 pathological images for breast IDC grading. This dataset is currently the largest publicly available breast IDC grading dataset and can serve as a benchmark to facilitate a broader study of breast IDC grading.


Introduction
Breast invasive ductal carcinoma (IDC) is the most widespread type of breast cancer, making up approximately 80% of all diagnosed cases. Histological grading has direct guiding significance for the prognostic evaluation of IDC. The most popular grading scheme is the Nottingham Grading System (NGS) [1] which gives a more objective assessment than previous grading systems. NGS includes three semi-quantitative criteria: mitotic count, nucleus atypia, and tubular formation. However, in clinical practice, the burden of pathological diagnosis is very heavy, and many pathologists cannot accurately grasp NGS, which will greatly weaken the guiding significance of histological grading for clinical prognosis evaluation, and even mislead the clinical judgment of prognoses. Therefore, there is an urgent need for an automatic and accurate pathological grading method.
The automatic breast cancer grading methods based on pathological images can be summarized into two categories. The first category is to use machine-learning or deeplearning methods to directly extract the features of the pathological image for breast cancer grading. However, unlike the coarse-grained problem of breast cancer classification, IDC grading is a fine-grained classification problem. Using only general methods cannot classify IDC well because the classification boundaries among intermediate-grade and low-and high-grade IDC pathological images are blurred.
The second category is to compute the three evaluation criteria of NGS separately and then integrate those results to obtain the final IDC grading result. However, NGS is only a semiquantitative evaluation method. The inherent medical motivation of NGS is to classify IDC based on the morphological and texture characteristics of the cell nucleus and the topological structure of the cell population. With the end-to-end advantage of deep learning, not only can the medical goal of emphasizing nuclei-related features be achieved, but more fine-grained feature representations of pathological images that are too abstract for pathologists to understand can also be learned.
In this paper, we propose a Nuclei-Guided Network (NGNet) for IDC grading in hematoxylin-eosin (HE) stained pathological images. Specifically, our network includes two branches. The main branch is used to extract the feature representation of the entire pathological image, and the nuclei branch is used to extract the feature representation of the nuclei image. Then, the nuclei-guided attention module between the two branches plays the role of nucleus attention in end-to-end learning, so that more nuclei-related feature representations for IDC grading can be learned. In addition, the proposed nuclei-guided fusion module in the fusion process of two branches can further enable the network to focus on learning nuclei-related features. Overall, under the guidance of nuclei-related features, the entire NGNet can learn more fine-grained features for breast IDC grading. It should be pointed out that this is different from the general attention mechanism [2][3][4] that cannot artificially emphasize the region of interest.
Experimental results show that the proposed NGNet significantly outperforms the state-of-the-art method, achieving 93.4% average classification accuracy and 0.93 AUC with our released dataset. In addition, we release a new dataset containing 3644 pathological images with different magnifications (20× and 40×) for evaluating the IDC grading methods. Compared with the previous publicly available breast cancer grading dataset with only 300 images in total, the number of images in our dataset has increased by an order of magnitude. The dataset is publicly available from https://github.com/YANRUI121 /Breast-cancer-grading (accessed on 1 April 2022).

Related Works
Recently, the application of deep learning has enabled breast cancer pathological image classification to achieve high performance. However, breast cancer classification is not enough for the final medical diagnosis. The classification must be subdivided and accurate to the extent of the pathological grade of the cancer, because the gold standard of the final medical diagnosis, the choice of treatment plan and the prediction of patient outcome are all based on the results of the pathological grade.
The classification boundaries among intermediate-grade and low-and high-grade IDC pathological images are ambiguous; thus, general methods cannot classify the IDC grade well. The current IDC grading methods can be divided into two categories. The first category is to classify the features extracted directly from the pathological image. The second category is to first calculate the three evaluation criteria of NGS (1) mitotic count [5][6][7][8], (2) nucleus atypia [9,10], and (3) tubular formation [11][12][13], and then artificially integrate these three criteria to obtain the final result. Figure 1 is a brief description of NGS. By analyzing the three evaluation criteria of NGS, we observe that nuclei-related features are very important for breast cancer pathological diagnosis. Specifically, mitotic count and nucleus atypia are concerned with the morphological and texture characteristics of the cell nucleus, whereas tubular formation is concerned with the topological structure of the cell population. Because we are primarily concerned with end-to-end breast cancer grading studies, we will only briefly introduce the related works of the first category in the following. By analyzing the three evaluation criteria of NGS, we observe that nuclei-related features are very important for breast cancer pathological diagnosis. Specifically, mitotic count and nucleus atypia are concerned with the morphological and texture characteristics of the cell nucleus, whereas tubular formation is concerned with the topological structure of the cell population. Because we are primarily concerned with end-to-end breast cancer grading studies, we will only briefly introduce the related works of the first category in the following. Before the era of deep learning, research on breast cancer pathological image grading was mainly based on traditional machine-learning methods. For example, Doyle et al. [14] proposed a novel method to classify low-and high-grade of breast cancer histopathological images by using architectural features. Naik et al. [15] classify the low-and high-grade breast cancer by using a combination of low-level, high-level, and domain-specific information. They first segment glands and nuclei. Then, morphological and architectural attributes derived from the segmented gland and nuclei were used to discriminate lowgrade from high-grade breast cancer. Basavanhally et al. [16] conducted a multifield-ofview classifier with robust feature selection for classifying ER+ breast cancer pathological images. Their grading system can distinguish low-vs. high-grade patients well, but fails to distinguish low-vs. intermediate-, and intermediate-vs. high-grade patients well.
Deep learning has made great progress in breast cancer pathological image grading. The most representative work was proposed by Wan et al. [17]. They integrated semanticlevel features extracted from a convolutional neural network (CNN), pixel-level texture features, and object-level architecture features to classify low-, intermediate-, and highgrade breast cancer pathological images. The method achieved an accuracy of 0.92 for low vs. high, 0.77 for low vs. intermediate, and 0.76 for intermediate vs. high, and an overall accuracy of 0.69 when discriminating all three grades of breast cancer pathological images. Our preliminary work that shows that only using deep learning can help achieve better grading performance was published in BIBM2020 [18]. Compared to the previous work, we put forward new contributions in nuclei-guided branch fusion and further disclosed one of the largest IDC grading datasets. Before the era of deep learning, research on breast cancer pathological image grading was mainly based on traditional machine-learning methods. For example, Doyle et al. [14] proposed a novel method to classify low-and high-grade of breast cancer histopathological images by using architectural features. Naik et al. [15] classify the low-and high-grade breast cancer by using a combination of low-level, high-level, and domain-specific information. They first segment glands and nuclei. Then, morphological and architectural attributes derived from the segmented gland and nuclei were used to discriminate low-grade from high-grade breast cancer. Basavanhally et al. [16] conducted a multifield-of-view classifier with robust feature selection for classifying ER+ breast cancer pathological images. Their grading system can distinguish low-vs. high-grade patients well, but fails to distinguish low-vs. intermediate-, and intermediate-vs. high-grade patients well.
Deep learning has made great progress in breast cancer pathological image grading. The most representative work was proposed by Wan et al. [17]. They integrated semanticlevel features extracted from a convolutional neural network (CNN), pixel-level texture features, and object-level architecture features to classify low-, intermediate-, and highgrade breast cancer pathological images. The method achieved an accuracy of 0.92 for low vs. high, 0.77 for low vs. intermediate, and 0.76 for intermediate vs. high, and an overall accuracy of 0.69 when discriminating all three grades of breast cancer pathological images. Our preliminary work that shows that only using deep learning can help achieve better grading performance was published in BIBM2020 [18]. Compared to the previous work, we put forward new contributions in nuclei-guided branch fusion and further disclosed one of the largest IDC grading datasets. In the field of computer vision, there are many excellent networks based on attention mechanisms, such as SENet [19], Position and Channel Attention [20], CBAM [4], Criss-Cross Attention [21], and Self-Attention [22,23]. SENet [19] is the abbreviation of Squeeze-and-Excitation Networks. SENet mainly recalibrates the feature responses of channels adaptively by explicitly modeling the interdependence between channels. In other words, the correlation between channels is learned. Convolutional Block Attention Module (CBAM) [4] combines spatial and channel attention mechanism, which can achieve better results than SENet's attention mechanism that only focuses on channels. Because CBAM is a lightweight general module, it can be integrated into any CNN architecture with negligible overhead of this module, and can be trained end-to-end together with the base CNN. Transformer is a deep neural network based on self-attention mechanism, which has been considered as a viable alternative to convolutional and recurrent neural networks. In the field of computer vision, Vision Transformer (ViT) proposed by Dosovitskiy et al. [24] is a pioneering work. Following the paradigm of ViT, a series of ViT variants [25,26] have been proposed to improve the performance. The complexity of the ViT-like model is very high, so it needs a very large training dataset. Therefore, the application of the ViT-like model in the field of pathological images analysis is still few at present, especially for the breast cancer grading tasks that are difficult to manually label. These above-mentioned attention mechanisms are adaptively learned from the data, and are the areas where the algorithm thinks attention should be focused. However, if we need to customize the area where the algorithm focuses attention based on prior knowledge, this is not possible. A more comprehensive review of attention mechanisms can be found in [27,28]. Our proposed network can focus on a specific area. This is different from the general attention mechanism that cannot artificially emphasize the region of interest. This provides a new paradigm for embedding medical prior knowledge into algorithms.

Dataset
Deep-learning methods have an important dependence on well-labeled datasets such as BreaKHis dataset [29], the Yan et al. dataset [30], and the BACH dataset [31]. However, due to the difficulty of the IDC grading task, there are few related works. To the best of our knowledge, only Kosmas et al. [32] has released one IDC grading dataset containing 300 pathological images, which is insufficient for deep-learning research. In this work, we cooperated with Peking University International Hospital to release a new benchmark dataset for IDC grading. We conducted experiments on these two datasets to comprehensively verify the effectiveness of our proposed NGNet method. Next, we will introduce these two datasets.

IDC Pathological Images Dataset
The dataset released by Kosmas et al. [32] includes 300 images (107 Grade1 images, 102 Grade2 images, and 91 Grade3 images). All images were acquired at 40× magnification. Although this released dataset has played a significant role in the IDC grading research, 300 images are not enough for the deep-learning method.
To meet the needs of deep-learning research, we cooperated with Peking University International Hospital to release a new IDC grading dataset. Our annotated HE-stained pathological image dataset consists of 3644 pathological images (1000 × 1000 pixels). Figure 2 is an example of the images and a summary of the dataset. We named it the PathoIDCG dataset, which is an abbreviation of the Pathological Image Dataset for Invasive Ductal Carcinoma Grading. The overall description of the PathoIDCG dataset is shown in Table 1. The preparation procedure used in our research is the standard paraffin process, which is widely used in routine clinical practice. The thickness of pathological sections is 3-5 µm. Each image is labeled Grade1, Grade2, or Grade3 according to the three evaluation criteria of NGS. Image annotation was independently performed by two pathologists in strict accordance with NGS standards, and the images with different annotations were rean-

Nuclei Segmentation Dataset
The dataset released by Kumar et al. [33] included HE-stained pathological images with 21,623 annotated nucleus boundaries, and Figure 3 is an example of this dataset. Kumar et al. [33] downloaded 30 whole slide pathological images of several organs from The Cancer Genomic Atlas (TCGA) [34] and used only one WSI per patient to maximize nuclear appearance variation. In addition, these images come from 18 different hospitals, which makes the dataset sufficiently diverse. It is important to emphasize that although we only segmented the nucleus of breast cancer pathological images, our segmentation model was trained on pathological images of all seven organs: breast, liver, colon,  Our dataset is mainly acquired under a 20× magnified field of view, because the 20× magnified pathological image can contain more information about the topology of the cell population. Another reason is that the commonly available 20× slides are easier to obtain, and the current cell nucleus segmentation technology can also segment pathological images under 20× magnification. At the same time, we also collected pathological images at 40× magnification because a larger magnification can better reflect the texture and morphological characteristics of individual nuclei.

Nuclei Segmentation Dataset
The dataset released by Kumar et al. [33] included HE-stained pathological images with 21,623 annotated nucleus boundaries, and Figure 3 is an example of this dataset. Kumar et al. [33] downloaded 30 whole slide pathological images of several organs from The Cancer Genomic Atlas (TCGA) [34] and used only one WSI per patient to maximize nuclear appearance variation. In addition, these images come from 18 different hospitals, which makes the dataset sufficiently diverse. It is important to emphasize that although we only segmented the nucleus of breast cancer pathological images, our segmentation model was trained on pathological images of all seven organs: breast, liver, colon, prostate, bladder, kidney, and stomach. For the above reasons, our segmentation model is more robust and generalizable. prostate, bladder, kidney, and stomach. For the above reasons, our segmentation model is more robust and generalizable.

Methods
The key idea of NGNet is shown in Figure 4. Our method consists of two stages: in the first stage, we segmented the nucleus of each pathological image to obtain all images that only contain the nucleus region. In the second stage, two images (original pathological image and corresponding nuclei image) are input at the same time and sent to the NGNet to obtain the final classification result.

Nuclei Segmentation
We use DeepLabV3+ [35] as our nuclei segmentation network because it can better address the following challenges. In the HE-stained pathological image, some cell nuclei are very large, whereas some are very small. Moreover, under different magnifications, such as 20× and 40×, the difference in the size of the nucleus is more significant. Therefore, our network is required to be able to use multiscale image features, especially to be able to reconstruct the information of small objects. At the same time, many overlapping nuclei boundaries make nuclei segmentation more difficult, so the segmentation algorithm is required to have the ability to reconstruct nuclei boundaries.
Given a pathological image, the output of DeepLabV3+ is a nuclei segmentation mask. The backbone of the DeepLabV3+ algorithm we applied is Xception [36]. When our training steps are 100,000, we have achieved the best experimental results. The values of atrous rates we used are 6, 12, and 18. We adopt an output stride equal to 16. Here, we

Methods
The key idea of NGNet is shown in Figure 4. Our method consists of two stages: in the first stage, we segmented the nucleus of each pathological image to obtain all images that only contain the nucleus region. In the second stage, two images (original pathological image and corresponding nuclei image) are input at the same time and sent to the NGNet to obtain the final classification result. prostate, bladder, kidney, and stomach. For the above reasons, our segmentation model is more robust and generalizable.

Methods
The key idea of NGNet is shown in Figure 4. Our method consists of two stages: in the first stage, we segmented the nucleus of each pathological image to obtain all images that only contain the nucleus region. In the second stage, two images (original pathological image and corresponding nuclei image) are input at the same time and sent to the NGNet to obtain the final classification result.

Nuclei Segmentation
We use DeepLabV3+ [35] as our nuclei segmentation network because it can better address the following challenges. In the HE-stained pathological image, some cell nuclei are very large, whereas some are very small. Moreover, under different magnifications, such as 20× and 40×, the difference in the size of the nucleus is more significant. Therefore, our network is required to be able to use multiscale image features, especially to be able to reconstruct the information of small objects. At the same time, many overlapping nuclei boundaries make nuclei segmentation more difficult, so the segmentation algorithm is required to have the ability to reconstruct nuclei boundaries.
Given a pathological image, the output of DeepLabV3+ is a nuclei segmentation mask. The backbone of the DeepLabV3+ algorithm we applied is Xception [36]. When our training steps are 100,000, we have achieved the best experimental results. The values of atrous rates we used are 6, 12, and 18. We adopt an output stride equal to 16. Here, we

Nuclei Segmentation
We use DeepLabV3+ [35] as our nuclei segmentation network because it can better address the following challenges. In the HE-stained pathological image, some cell nuclei are very large, whereas some are very small. Moreover, under different magnifications, such as 20× and 40×, the difference in the size of the nucleus is more significant. Therefore, our network is required to be able to use multiscale image features, especially to be able to reconstruct the information of small objects. At the same time, many overlapping nuclei boundaries make nuclei segmentation more difficult, so the segmentation algorithm is required to have the ability to reconstruct nuclei boundaries.
Given a pathological image, the output of DeepLabV3+ is a nuclei segmentation mask. The backbone of the DeepLabV3+ algorithm we applied is Xception [36]. When our training steps are 100,000, we have achieved the best experimental results. The values of atrous rates we used are 6, 12, and 18. We adopt an output stride equal to 16. Here, we denote the output stride as the ratio of input image spatial resolution to the final output resolution.

NGNet Architecture
The overall network architecture is shown in Figure 5. The proposed NGNet has two inputs [I main , I nuclei ]. The input to the main branch is the original pathological image I main , and the input to the guide branch is the image I nuclei containing only the nuclei, respectively. The relationships between the two inputs are: where S is the nuclei segmentation result corresponding to the original pathological image.
To get the final classification result, P is flattened into the vector, and then goes through the fully connected layer. The loss function for NGNet is defined as the cross entropy (CE) loss: where and indicates the ground truth and prediction probability of the z-th image for k-th class.
It should be emphasized that our method is universal and can be easily generalized to another task that needs to emphasize a certain local area (such as a lesion) in the model. First, determine the image area of interest through prior knowledge and segment this area. Then, our algorithm framework can model this particular part of attention into the algorithm through end-to-end learning. The design of this network structure provides an endto-end modeling methodology for custom attention.  The guide branch and main branch contain the same number of convolutional layers. Between the corresponding convolutional layers of the two branches, the Nuclei-Guided Attention (NGA) module transfers the nuclei-related features of the guide branch to the main branch. On top of the last convolution layer of each branch, feature maps F M main (n) and F M nuclei (n) were flattened to several feature vectors P M main (n) and P M nuclei (n), respectively, where M represents the number of convolutional layers of each branch, and n represents the n-th feature map. Then, the feature vectors P M main (n) and P M nuclei (n) were passed through the Nuclei-Guided Fusion (NGF) module to obtain fused feature representation. Finally, the grading result is obtained through the multilayer perceptron (MLP) module.
The following is a detailed introduction to the NGA module and NGF module. The specific implementation details of the NGA module can be illustrated by the specific example of the "Guide 21" step in NGNet, as shown in Figure 6. Given a pathological image I, F m main (I main ) and F m nuclei (I nuclei ) is denoted as the convolutional feature maps from the m-th convolutional layer of the main branch and guide branch, respectively. In each corresponding convolutional layer, the guide branch extracting nuclei features has a guide block F m guide (I nuclei ) pointing to the main branch extracting pathological image features. segmentation corresponding to this original pathological image. The entire NGNet is trained endto-end.

Figure 6.
Detailed schematic diagram of the nuclei-guided attention module in the NGNet we proposed; the example comes from the "Guide 21" step.

Results and Discussion
In this section, we will evaluate the performance of NGNet. We randomly selected 80% of the dataset to train and validate the model, and the remaining 20% was used for testing. All experiments in this paper are finished on three NVIDIA GPUs by using the Keras framework with TensorFlow backend. We mainly use the average accuracy to evaluate the performance of NGNet. Apart from the average accuracy, the classification performance of an algorithm can be further evaluated by using the sensitivity, specificity, confusion matrix, and AUC. The accuracy, sensitivity, and specificity metrics can be defined as follows: Accuracy = , where TP (TN) represents number of true positive (true negative) classified pathological images, and FP (FN) represents number of false positive (false negative) classified pathological images.

Comparison of the Accuracy with Previous Methods
To verify the effectiveness of the method, we conduct comprehensive comparative experiments. For the three-class classifications, our method achieved 93.4% average accuracy based on the PathoIDCG dataset (see Table 2). The morphological differences between grade 1 (G1) and grade 2 (G2), as well as grade 2 (G2) and grade 3 (G3), is very subtle, so it is difficult to distinguish. This problem is reflected by our experimental results and previous studies. For this reason, previous studies have only focused on the classification tasks of G1 and G3. We have made comprehensive comparisons with previous state-of-the-art studies and the classic CNN: ResNet50 [37] and Xception [36]; the experimental results are shown in Table 2. It can be seen from the results that our method has achieved good classification accuracy in each category. However, only 94.1% and 93.9% accuracy are achieved on G1 vs. G2 and G2 vs. G3, respectively. Compared with the We first perform a 1 × 1 convolution on the feature maps of the corresponding nuclei block F m nuclei (I nuclei ), in which the input and output dimensions are equal. After performing the 1×1 convolution operation on the feature maps of the corresponding nuclei block F m nuclei (I nuclei ), the Softmax activation function is used to generate the attention map A m . Thus, the value of the feature map is adjusted to between 0 and 1. Then, we perform elementwise multiplication with the feature map of the corresponding main branch F m main (I main ), thereby increasing the weight of the important area of the feature map. The purpose of this is to focus on the features related to the nuclei. Specifically, we calculate the attention map A m and guide block F m guide (I nuclei ) as follows: where the Softmax (.) is the Softmax activation function, Conv 1×1 (.) is a 1 × 1 convolution operation, ⊗ represents elementwise multiplication. At the end of each NGA module, an elementwise addition ⊕ is performed: where F m f use (I) is the feature maps guided by nuclei-related features from the m-th convolutional layer.
The NGF module (see Figure 5) is inspired by the self-attention mechanism which can capture various dependencies within a sequence (e.g., short-range and long-range dependencies). The self-attention mechanism is implemented via the Query-Key-Value (QKV) model. Given a sequence and its packed matrix representations of Q, K, and V, the scaled dot-product attention is given by where d k is the dimension of key, and A is often called the attention matrix which computes the similarity score of the QK pairs. Different from the standard self-attention QKV which comes from the same input sequence, our Q nuclei is the feature vector from the guide branch, and the K main , V main are the feature vectors from the main branch. Therefore, the Q nuclei K main similarity we calculated represents the similarity between the nuclear features and the original pathological image features. The similarity score of Q nuclei K main is then mapped to V main , allowing the network to pay more attention to the nuclei-related features. The Q l nuclei K l main V l main calculation can be performed one or more times (L); here we set L = 3. In addition, we also added a residual connection between V l main and Att Q l nuclei , K l main , V l main to preserve the information of the main branch. At the end of the NGF module, we obtained the fused feature representation of the guide and main branch. Formally, we have To get the final classification result, P is flattened into the vector, and then goes through the fully connected layer. The loss function for NGNet is defined as the cross entropy (CE) loss: where q z k and p z k indicates the ground truth and prediction probability of the z-th image for k-th class.
It should be emphasized that our method is universal and can be easily generalized to another task that needs to emphasize a certain local area (such as a lesion) in the model. First, determine the image area of interest through prior knowledge and segment this area. Then, our algorithm framework can model this particular part of attention into the algorithm through end-to-end learning. The design of this network structure provides an end-to-end modeling methodology for custom attention.

Results and Discussion
In this section, we will evaluate the performance of NGNet. We randomly selected 80% of the dataset to train and validate the model, and the remaining 20% was used for testing. All experiments in this paper are finished on three NVIDIA GPUs by using the Keras framework with TensorFlow backend. We mainly use the average accuracy to evaluate the performance of NGNet. Apart from the average accuracy, the classification performance of an algorithm can be further evaluated by using the sensitivity, specificity, confusion matrix, and AUC. The accuracy, sensitivity, and specificity metrics can be defined as follows: where TP (TN) represents number of true positive (true negative) classified pathological images, and FP (FN) represents number of false positive (false negative) classified pathological images.

Comparison of the Accuracy with Previous Methods
To verify the effectiveness of the method, we conduct comprehensive comparative experiments. For the three-class classifications, our method achieved 93.4% average accuracy based on the PathoIDCG dataset (see Table 2). The morphological differences between grade 1 (G1) and grade 2 (G2), as well as grade 2 (G2) and grade 3 (G3), is very subtle, so it is difficult to distinguish. This problem is reflected by our experimental results and previous studies. For this reason, previous studies have only focused on the classification tasks of G1 and G3. We have made comprehensive comparisons with previous state-of-the-art studies and the classic CNN: ResNet50 [37] and Xception [36]; the experimental results are shown in Table 2. It can be seen from the results that our method has achieved good classification accuracy in each category. However, only 94.1% and 93.9% accuracy are achieved on G1 vs. G2 and G2 vs. G3, respectively. Compared with the classification results of these two difficult categories, the classification accuracy of G1 vs. G3 is much better, reaching 97.8%.

Confusion Matrix and AUC
We conduct experiments on the PathoIDCG dataset to comprehensively evaluate the performance of our method. The confusion matrix of the predictions is presented in Figure 7 by using the proposed NGNet on the test set. Figure 8 shows the mean area under curve (AUC) of 0.93, corresponding to 0.94, 0.91, and 0.93 based on receiver operating characteristic analysis. classification results of these two difficult categories, the classification accuracy of G1 vs. G3 is much better, reaching 97.8%.

Confusion Matrix and AUC
We conduct experiments on the PathoIDCG dataset to comprehensively evaluate the performance of our method. The confusion matrix of the predictions is presented in Figure  7 by using the proposed NGNet on the test set. Figure 8 shows the mean area under curve (AUC) of 0.93, corresponding to 0.94, 0.91, and 0.93 based on receiver operating characteristic analysis.
As seen from the experimental results in Figures 7 and 8, the results obtained in G1 vs. G2 and G2 vs. G3 are not as good as the classification results of G1 vs. G3. This also further illustrates that the classification bottleneck is to learn more distinguished features for similar categories.

Nuclei Segmentation Results
To select the suitable method for nucleus segmentation, we compare with three methods: Watershed, UNet [38], and DeepLabV3+ [35]. The watershed is the most representative traditional image processing method, and the version we used in the experiment is Fiji [39]. At the same time, we also conduct experiments on representative deep-learning methods UNet [38] and DeepLabV3+. As can be seen from Figure 9, DeepLabV3+ is suitable for our cell nucleus segmentation task, and achieved satisfactory results. As seen from the experimental results in Figures 7 and 8, the results obtained in G1 vs. G2 and G2 vs. G3 are not as good as the classification results of G1 vs. G3. This also further illustrates that the classification bottleneck is to learn more distinguished features for similar categories.

Nuclei Segmentation Results
To select the suitable method for nucleus segmentation, we compare with three methods: Watershed, UNet [38], and DeepLabV3+ [35]. The watershed is the most representative traditional image processing method, and the version we used in the experiment is Fiji [39]. At the same time, we also conduct experiments on representative deep-learning methods UNet [38] and DeepLabV3+. As can be seen from Figure 9, DeepLabV3+ is suitable for our cell nucleus segmentation task, and achieved satisfactory results. We perform a visual qualitative analysis of the segmentation results only. The visual display of the segmentation results is shown in Figure 9. Because we do not have the ground truth of nuclei segmentation for the PathoIDCG dataset, we did not use traditional quantitative indicators such as mean intersection over union (mIOU) to measure the segmentation effect. Our segmentation network is trained on the well-annotated dataset proposed by Kumar et al. [33]. After the segmentation network is well-trained, we directly use this trained segmentation network to segment the IDC grading dataset. Moreover, traditional metrics cannot measure the segmentation results we need. For example, we think that a slightly larger segmentation that includes the edge background of the nuclei may be better. However, the segmentation of nuclei containing a large number of missing nuclei is very poor.

Grad-CAM Visualization
Gradient-weighted class activation mapping (Grad-CAM) is a method proposed by Selvaraju et al. [40] to produce visual explanations (heat map) of decisions, making CNNbased methods more transparent and explainable. Grad-CAM can generate a rough location map to highlight important areas in the image for prediction. This method only considers the pixels and locations that have a positive impact on the classification result because we only care about the locations that have a positive impact on the classification.
In this section, we use the Grad-CAM method to visualize the pathological image regions that provide support for a particular classification result. We compare the Grad-CAM experimental results of NGNet with VGG16, as shown in Figure 10. From the experimental results, it can be found that the experimental results of NGNet are more focused on the area related to the nuclei. Moreover, NGNet can further refine the nucleirelated feature representations. As shown in the pathological image and the corresponding heat map in Figure 10, attention not only focuses on the nuclei-related area but also focuses on the gland-related nucleus area. This is consistent with the medical knowledge of NGS.
Clinically, breast cancer grading is adopted by pathologists through NGS, and one of the key evaluation criteria is the formation of glands.
In this section, we use the Grad-CAM method to visualize the pathological image regions that provide support for a particular classification result. We compare the Grad-CAM experimental results of NGNet with VGG16, as shown in Figure 10. From the experimental results, it can be found that the experimental results of NGNet are more focused on the area related to the nuclei. Moreover, NGNet can further refine the nucleirelated feature representations. As shown in the pathological image and the corresponding heat map in Figure 10, attention not only focuses on the nuclei-related area but also focuses on the gland-related nucleus area. This is consistent with the medical knowledge of NGS. Clinically, breast cancer grading is adopted by pathologists through NGS, and one of the key evaluation criteria is the formation of glands.

Ablation Study
To evaluate the effectiveness of each component in our proposed method, we conducted an ablation study. The experimental results on the test set are shown in Table 3. The hyperparameters of the experiment include the following: the loss function is categorical cross-entropy, the learning rate is 0.00002, the optimizer is RMSProp, and a total of 300 epoch iterations are performed.
We conduct comparative experiments on accuracy, sensitivity, specificity and AUC. First, because our single branch network structure is similar to VGG16, we compare the classification performance of NGNet and VGG16. The experimental results show that NGNet has achieved much better results than just using VGG16. Then, we compare the experimental results of NGNet with different experimental configurations. NGNet has achieved better results even with a simple fusion of pathological images and nuclear images; that is NGNet without nuclei-guided attention (NGA) and nuclei-guided fusion (NGF) module. After adding the NGA module and NGF module to NGNet, the best results are achieved. Specifically, compared with NGNet without NGA and NGF module,

Ablation Study
To evaluate the effectiveness of each component in our proposed method, we conducted an ablation study. The experimental results on the test set are shown in Table 3. The hyperparameters of the experiment include the following: the loss function is categorical cross-entropy, the learning rate is 0.00002, the optimizer is RMSProp, and a total of 300 epoch iterations are performed. We conduct comparative experiments on accuracy, sensitivity, specificity and AUC. First, because our single branch network structure is similar to VGG16, we compare the classification performance of NGNet and VGG16. The experimental results show that NGNet has achieved much better results than just using VGG16. Then, we compare the experimental results of NGNet with different experimental configurations. NGNet has achieved better results even with a simple fusion of pathological images and nuclear images; that is NGNet without nuclei-guided attention (NGA) and nuclei-guided fusion (NGF) module. After adding the NGA module and NGF module to NGNet, the best results are achieved. Specifically, compared with NGNet without NGA and NGF module, NGA and NGF module bring an AUC improvement of 0.01 and 0.03 to the network, respectively. When using the NGA and NGF module at the same time-that is, our proposed NGNet-it brings an AUC improvement of 0.04 to the network. The experimental results fully demonstrate the advantages of NGA module and NGF module in NGNet, and also demonstrate that each module is indispensable. The experimental results are shown in Table 3.

Conclusions
In this paper, the proposed NGNet can ensure that the network is focused on nucleirelated features, so as to learn fine-grained feature representations for breast IDC grading. Through extensive experimental comparisons, it was shown that NGNet outperforms the state-of-the-art method and has the potential to assist pathologists in breast IDC grading diagnosis. In addition, we released a new dataset containing 3644 pathological images with different magnifications (20× and 40×) for evaluating breast IDC grading methods. Compared with the previous publicly available dataset of breast cancer grading with only 300 images in total, our number of images is an order of magnitude greater. Therefore, the dataset can be used as a benchmark to facilitate a broader study of the breast IDC grading method.
In future work, to further improve the classification performance of breast IDC grading, medical knowledge embedding and semi-supervised learning are two promising directions. Whether in the field of natural image analysis or medical image analysis, the research on the network structure of deep learning has been very comprehensive. Therefore, only by improving the network structure to further improve the classification performance is limited. There are few studies on how to combine medical knowledge with pathological image to further improve classification performance [41]. If we can embed medical knowledge in the end-to-end network learning, the performance of the IDC grading method will be further improved. In terms of pathological image datasets for IDC grading, it is impractical to label a sufficiently large dataset because the cost of labeling pathological images is high. However, the amount of unlabeled pathological image data in each hospital is very large [42]. If a small labeled dataset and a large unlabeled dataset can be used at the same time, the performance of the IDC grading method may be further improved to a level that can be used clinically.
Author Contributions: Methodology, R.Y.; investigation, Z.L. and R.Y.; resources, J.L. and F.Z.; data curation, X.R. and F.R.; writing-original draft preparation, R.Y.; writing-review and editing, C.Z. and F.Z.; funding acquisition, X.R., F.R. and F.Z. All authors have read and agreed to the published version of the manuscript.

Conflicts of Interest:
The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.