Breast Cancer Histopathological Image Classification Based on High-Order Modeling and Multi-Branch Receptive Fields

Zhao, Mengda; Hou, Cunqiao; Cao, Lu; Zhang, Jianxin

doi:10.3390/app15116085

Open AccessArticle

Breast Cancer Histopathological Image Classification Based on High-Order Modeling and Multi-Branch Receptive Fields

¹

College of Computer Science and Engineering, Dalian Minzu University, Dalian 116650, China

²

Research Center of Multimodal Information Perception and Intelligent Processing, Dalian Minzu University, Dalian 116650, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(11), 6085; https://doi.org/10.3390/app15116085

Submission received: 21 April 2025 / Revised: 22 May 2025 / Accepted: 26 May 2025 / Published: 28 May 2025

Download

Browse Figures

Versions Notes

Abstract

:

Existing convolutional neural network (CNN) methods primarily depend on first-order feature modeling, which makes it challenging to effectively capture higher-order features in breast cancer histopathological images. Additionally, due to the limitations of the receptive field, CNNs have difficulty capturing long-range dependencies, thereby limiting the integration of global information. To address this, inspired by the strengths of high-order statistical features and extended receptive fields in visual tasks, this study proposes a novel high-order receptive field network (HoRFNet). Specifically, HoRFNet expands the receptive field and improves the model’s contextual awareness of pathological tissue structures by introducing a multi-branch convolutional structure with convolution kernels of varying sizes, along with dilated convolution layers. Additionally, HoRFNet integrates a matrix power normalization strategy in the covariance pooling module to model the global interactions between convolutional features, thereby improving the higher-order representation of complex textures and structural relationships in tissue images. The BreakHis dataset shows that HoRFNet achieves an image level classification accuracy of 99.50% and a patient level classification accuracy of 99.23%, significantly outperforming existing methods and demonstrating its effectiveness.

Keywords:

breast cancer histopathological image classification; high-order modeling; multi-branch receptive field; covariance pooling

1. Introduction

Breast cancer remains one of the most prevalent cancers worldwide. In 2022, it accounted for over 2.3 million new cases, ranking second among all cancer types globally. Moreover, with about 666,000 deaths, breast cancer ranks as the fourth most frequent contributor to cancer mortality worldwide. A timely and precise diagnosis of breast cancer can prolong the lives of patients or increase their chance of survival [1]. However, the accuracy of pathological diagnosis relies not only on image texture quality but also on subjective elements like the pathologist’s focus and level of experience. Consequently, the risk of misdiagnosis remains, which may negatively influence patient prognosis and clinical outcomes. In order to achieve more accurate and efficient breast cancer pathology assessments, leveraging advanced computational techniques to assist pathologists in pathological tissue analysis holds significant clinical value [2].

Convolutional neural networks (CNNs) have demonstrated outstanding performance in medical image processing tasks [3,4,5,6,7]. CNN-related breast cancer histopathological image (BCHI) classification methods have also made considerable progress [8,9,10,11,12,13,14]. Currently, CNN-based approaches for BCHI classification can be broadly divided into two categories.

Non-end-to-end training model can be used to describe the first category. Prior to using machine learning classifiers to categorize samples based on the extracted features, researchers first extract features from input images using conventional or newly developed CNNs. In this process, researchers typically pre-train CNN models using the ImageNet dataset [15]. For example, Kumar et al. [16] used the VGG-16 model pre-trained on ImageNet to extract deep features from BCHI and evaluate classification performance by combining support vector machines and random forest methods. The second stream of classification for BCHI use end-to-end training of the deep neural network model. This stream prevents the multi-module network model’s discontinuity between the stage of feature extraction and the classifier, which is further divided into two subcategories. The first subtype is task-specific CNN approaches, which involve fine-tuning traditional pre-trained networks for certain tasks [10,11]. For example, in the work of Shaalu and Mehra [10], commonly adopted pre-trained models including VGG16, VGG19, and ResNet50 were evaluated against fully trained counterparts on BCHI. Zhang et al. [11] incorporated global and local characteristics on six standard VGGNet and ResNet models, and examined the models’ ability to classify BCHI systematically. ResNet50 outperforms the other six traditional network models, according to experimental findings. Benhammou et al. [17] performed BCHI classification by fine-tuning Inception-V3 convolutional neural network pre-trained on ImageNet. Said et al. [18] utilized a ResNet18 model pre-trained on ImageNet to fine-tune its terminal residual modules. They also combine global contrast normalization and data augmentation, achieving significant performance advantages on the BreakHis dataset.

Another subclass is to add representative convolutional neural network units to the classical network or the newly constructed network. Bardou et al. [19] proposed a seven-layer BCHI classification network, with two fully connected layers placed after five convolutional stages. Through data enhancement technology, the accuracy of the two models on the BreakHis dataset was between 96.15% and 98.33%. In order to extract picture features and address the multi-classification problem with fewer computational resources, Murtaza et al. [20] suggested a tree-based breast tumor model. Studies have revealed that a single network with a straightforward structure is also capable of achieving great classification accuracy for this medical purpose. Since the attention mechanism can swiftly choose the focus point and provide a more distinct feature representation, more and more researchers have started incorporating it into CNN models. Jiang et al. [14] proposed a lightweight CNN that combines residual structures with a squeeze-and-excitation mechanism, achieving superior performance in BCHI classification with fewer parameters. In the classification of BCHI, it achieved higher performance with fewer parameters. Then, Budak et al. [21] used FCN and BiLSTM to detect BCHI, and achieved significant results in the BreakHis dataset. Toğaçar et al. [22] combined the residual module and the convolutional block attention module, and utilized HyperColumn technology to better perform fine-grained positioning of BreakHis images. The performance of BCHI classification has significantly improved with the continuous development of deep neural network models.

In addition, Transformer-based [23] and attention-driven models for medical image classification continue to emerge, demonstrating strong capabilities in feature extraction and representation [24]. For example, ST-Double-Net [25] integrates the Swin Transformer (SwinT) [26] with a weakly supervised object localization strategy, effectively combining global and local features without requiring bounding box annotations, and significantly improves the recognition of ambiguous lesion regions. SupCon-ViT [27] combines a pre-trained vision Transformer with supervised contrastive learning, enhancing feature discriminability and improving generalization. Additionally, DecT [28] enhances color feature representation by jointly processing RGB and HED images using color deconvolution and a self-attention mechanism. Zhuang et al. [29] further combined ResNet101 with the SwinT and incorporated the CBAM attention mechanism, along with focal loss, to increase sensitivity to diverse lesion types. Although these Transformer-based models achieve notable improvements in classification performance, most still face limitations, such as high computational demands and increased training costs.

Although notable advancements have been achieved in medical image analysis, several challenges remain unaddressed. For instance, in BCHI classification, most existing methods rely solely on first-order statistics while ignoring high-order statistical features, which limits the expressive capacity of deep features. In fact, CNN models with high-order statistics recently impressed with their promising efficacy on a variety of visual tasks. Included among the representative deep high-order models are

{DeepO}_{2}

P, Bilinear CNN, MPN-COV, and GSoP [30,31,32,33]. In particular, Li et al. [32] developed a novel trainable second-order network called MPN-COV that successfully applies matrix power normalization to capture the global covariance statistical information of deep features. In our previous work [34], we conducted a preliminary investigation into the role of deep second-order statistics in BCHI classification. Experimental results demonstrate that it gains average performance improvement of 4.92% over its baseline. Meanwhile, dilated convolution, which perceives global features by expanding the perception field, has been widely employed for building more discriminant deep models in recent years. Among them, Chen et al. [35] introduced hole convolution, allowing the network to extract denser features and improve the reliability of segmentation results. Receptive field block network (RFBNet) [36] optimizes receptive field design by accounting for spatial eccentricity, which enhances feature extraction in lightweight networks and has a significant effect on target detection. Nevertheless, the challenge of classifying medical images has not yet been effectively solved by hole convolution migration.

This study proposes an innovative high-order receptive field network (HoRFNet) focused on the efficient classification of BCHI. The core design concept of HoRFNet lies in integrating multi-branch receptive field expansion with high-order statistical modeling streams, thereby overcoming the limitations of traditional CNNs in spatial dependency modeling and feature representation. HoRFNet introduces multi-branch convolutions and dilated convolutional layers to replace the square-kernel convolutional layers in traditional backbone networks. This design breaks the limitations of fixed receptive fields in traditional CNNs by employing a multi-scale perception strategy, capturing local details while integrating global contextual information, thereby achieving a more precise understanding of complex breast cancer tumor morphologies. Moreover, the introduction of dilated convolutions effectively avoids information loss caused by downsampling, preserves high-resolution feature representation, and significantly enhances the network’s discriminative power and detail depiction capabilities. To capture the global interactions between convolutional features, HoRFNet introduces covariance pooling based on matrix power normalization (MPN) after the final convolutional layer. This design overcomes the limitations of traditional CNNs that rely solely on first-order feature aggregation, by explicitly modeling high-order statistical information to more accurately represent the structural characteristics of pathological tissues. This enables histopathological images to obtain more informative global feature representations. Figure 1 depicts the overall structure of HoRFNet, and the three areas listed below sum up as the key contributions of this study.

To categorize breast cancer histopathological image, a high-order receptive field network (HoRFNet) from end to end is proposed. To the best of our knowledge, this is the first instance in which the medical image classification problem simultaneously incorporates second-order statistical characteristics and receptive field block (RFB).
HoRFNet replaces the third layer with a multi-branch convolution layer using various convolution kernels, followed by an expansion convolution layer. This successfully enhances receptive fields and the discriminality of the deep model. To further enhance performance, covariance statistics of deep features computed using matrix power normalization (MPN) are used to collect more discriminant information from breast cancer histopathological image.
HoRFNet demonstrates superior performance on the open-source BreakHis dataset, achieving 99.50% and 99.23% classification accuracy at the image level and patient level, respectively, and surpassing existing approaches.

2. Methods

In this section, we primarily focus on detailing the proposed high-order receptive field network (HoRFNet). We first give a brief description of the network architecture. Then, the basic principle of receptive field block and covariance pooling module used in HoRFNet are given. Finally, we introduce the detailed instantiation process.

2.1. HoRFNet

Figure 1 presents the HoRFNet structure, which comprises a backbone, a receptive field block, and a covariance pooling module. The basic backbone is a straightforward and efficient plain deep neural network architecture, resembling the typical VGGNet model. A receptive field block (RFB) is introduced in place of the third convolutional layer in the backbone, effectively enlarging the receptive field and enhancing the network’s feature discrimination. A covariance fusion module leveraging matrix power normalization (MPN) [32] is embedded ahead of the fully connected (FC) layer to facilitate convergence and capture high-order statistics. In addition, ReLU, a nonlinear activation function, is adopted in HoRFNet to enhance its representational capacity:

f (x) = m a x (0, x)

(1)

Compared with other activation functions, it effectively mitigates the issue of gradient vanishing, which typically arises from the accumulation of multiple derivatives during the backpropagation process. After the 1st, 2nd, and 5th layers, we apply a convolution operation with a 3 × 3 kernel and a stride of 2, followed by a max pooling layer after ReLU activation to reduce the feature map resolution and enhance spatial invariance to input deformation and translation. We use a Gaussian distribution to initialize the network weights. To improve generalization performance and control overfitting, we add a Dropout layer after the first FC layer, achieving this goal by randomly dropping some neurons.

2.2. Receptive Field Block

Neuroscientific research has demonstrated that the population receptive field size in retinal mapping correlates with the eccentricity of the retinal stimulus. Specifically, as eccentricity increases, the scale of the human visual receptive field also expands, meaning that the closer the target area is to the center of the receptive field, the more important its location becomes [37]. The receptive field block (RFB) proposed by RFBNet [36] uses different scales of conventional convolutions and dilated convolutions in each branch, simulating multiple receptive fields within the population receptive field by adjusting convolutional kernel dimensions. Dilated convolution in each branch simulates the dependence of population receptive field size on eccentricity. Therefore, we attempt to incorporate the RFB into the network to improve deep feature extraction from BCHI, while avoiding a significant increase in computational cost.

RFB consists of multi-branch convolutional layers with different kernel sizes and dilated convolution layers. According to research [38,39,40], the performance of multi-scale receptive field implemented by convolution kernels of different sizes is better than that of single-scale receptive field implemented by fixed-size convolution kernels. Specifically, the RFB begins by applying a 1 × 1 convolution to reduce the number of channels and form a bottleneck structure, followed by a standard n × n convolution. To replace the original 5 × 5 convolution, two stacked 3 × 3 convolutions are employed, which reduce parameter count while enhancing nonlinear representation. To accommodate the limited depth of the network, we adopt the RFB-s module [36] and replace the original 3 × 3 convolution with a combination of 1 × 3 + 3 × 1 convolutions. The additivity of convolution is illustrated by the following equation:

P (x) = W_{1} * (W_{2} * x)

(2)

where

W_{1}

and

W_{2}

represent the convolution of dimensions 1 × 3 and 3 × 1, respectively, ∗ denotes convolution process, and x denotes the input feature map. The RFB-s module [36] adds a dilated convolution after each normal convolution. Expansion convolution can capture more information in a larger area and generate a higher resolution feature map with constant parameters. The detailed structure of RFB can be shown in Figure 2.

2.3. Covariance Pooling Module

Prior research has suggested that second-order statistical measures offer enhanced discriminative power over first-order approaches for image feature extraction [32,33,41]. The detailed structure of second-order pooling block can be shown in Figure 3. The MPN-COV model [41] differs from conventional approaches by not being restricted to first-order statistical features during training. Instead, it employs global covariance aggregation and matrix power normalization to capture inter-channel dependencies and model the distributional structure of features. This enables MPN-COV to significantly outperform other classical convolutional networks in terms of performance. Inspired by this, this study further integrates covariance aggregation techniques to calculate the second-order statistical features of BCHI.

2.3.1. Covariance Normalization

Supposing that the feature tensor generated by a ReLU-activated convolutional layer has dimensions

C \times H \times W

, adjusting this tensor to a feature matrix X with dimension d and feature number

N = H \times W

, then matrix X can be expressed as

X ϵ R^{d \times N}

. Next, the covariance matrix F is established. The second-order pooling applied by F can be calculated by the following equation:

F = X \bar{I} X^{T}, \bar{I} = \frac{1}{N} (I - \frac{1}{N} h h^{T})

(3)

where I denotes an identity matrix of size N × N, h represents a column vector filled with ones, and T denotes matrix transposition. The covariance matrix’s square root can be obtained via singular value decomposition (SVD) or eigen decomposition (EIG):

F \to (U, Λ), F = U \land U^{T}

(4)

where

Λ = d i a g (λ_{1}, \dots, λ_{d})

denotes a diagonal matrix containing eigenvalues sorted in descending order, and

U = [u_{1}, \dots, u_{d}]

refers to an orthogonal matrix formed by the associated eigenvectors.

The matrix power of the covariance matrix F is transformed into the power of eigenvalue by EIG decomposition. The detailed calculation process is as follows:

(U, Λ) \to Z, Z = F^{α} = U (Λ) U^{T}

(5)

where

ϕ (Λ) = d i a g (λ_{1}^{α}, \dots, λ_{d}^{α})

and

α

is a positive real number. The normalization performance of the covariance matrix is optimal when

α

= 0.5.

2.3.2. Acceleration Covariance Normalization

The square root of a covariance matrix is itself positive definite, and such a matrix admits a unique root that can be computed via EIG or SVD. However, EIG and SVD computations are generally slower on GPU than on CPU [32,41,42]. To solve this problem, MPN-COV utilizes Newton–Schultz iteration to accelerate the calculation of covariance normalization [41,42].

In Equation (5), when

α = \frac{1}{2}

,

Z = F^{\frac{1}{2}} = Y = U d i a g (λ_{i}^{\frac{1}{2}}) U^{T}

, let

Y_{0} = \hat{F}

,

Z_{0} = I

for

n = 1, \dots, d

, after [34], the coupled iteration can be written as:

\{\begin{matrix} Y_{n} = \frac{1}{2} Y_{n - 1} (3 I - Z_{n - 1} Y_{n - 1}) \\ Z_{n} = \frac{1}{2} (3 I - Z_{n - 1} Y_{n - 1}) Z_{n - 1} \end{matrix}

(6)

Equation (6) involves only matrix multiplication, making the process highly compatible with GPU-based parallel processing. Simultaneously, compared to the precise matrix square roots computed using the EIG method, our approach achieves higher precision with fewer iterations. We set the iteration count to 5 in our method. To promote stable convergence of the Newton–Schultz iteration, the covariance matrix is handled as:

\hat{F} = \frac{1}{t r (F)} F

(7)

where

t r (\cdot)

represents the matrix trace. The transformation applied to the covariance matrix in Equation (7) may scale down the data, potentially degrading network performance. Thus, a post-compensation step aligned with the pre-processing is necessary after the Newton–Schulz iteration:

Z = \sqrt{t r (F)} Y_{n}

(8)

2.4. Instantiation

As mentioned above, HoRFNet combines the covariance merging module and the receptive field block in a single architecture. The deeper network does not perform as well as expected for the two classification problems, and it is very likely to cause overfitting. As shown in Table 1, we draw lessons from the commonly utilized VGGNet and AlexNet architectures, constructing a backbone with five convolutional and three FC layers specifically for BCHI classification. According to the VGGNet model, the small-size convolution kernel is suitable for learning micro-textures, so we use a 3 × 3 convolution kernel for feature extraction. After the FC layer, we introduce a Dropout layer to further enhance classification performance. As shown in Figure 1 and Table 1, the multi-branch convolution and trailing dilated convolution are employed in the third convolutional layer to replace the square-core convolution. This design allows the HoRFNet model to enhance the receptive field, ensuring that more powerful convolutional features are extracted.

3. Experiments

This section presents the experimental validation of the proposed method and is organized into four main parts. First, we introduce the BreakHis dataset and its key characteristics. Next, we describe the evaluation metrics used to assess model performance. Then, we detail the experimental setup, including data pre-processing, model training, and parameter configuration. Finally, we report the experimental results and provide a systematic analysis and comparison of model performance, thereby validating the effectiveness and feasibility of the proposed approach.

3.1. Datasets

3.1.1. BreakHis Dataset

The BreakHis dataset is a publicly available dataset widely used for breast cancer histopathological image analysis. It contains 7909 breast cancer tissue slice images from 82 patients, with images captured at four magnification levels including

40 \times

,

100 \times

,

200 \times

, and

400 \times

. All images are divided into two main categories, benign and malignant, with further subdivisions into specific subtypes. The benign tumors include adenosis (A), fibroadenoma (F), tubular adenoma (TA), and phyllodes tumor (PT). The malignant tumors include ductal carcinoma (DC), lobular carcinoma (LC), mucinous carcinoma (MC), and papillary carcinoma (PC). Some typical histopathological images in breakHis at 400× magnification are given in Figure 4.

3.1.2. BACH Dataset

The BACH dataset, introduced by ICIAR 2018 (International Conference on Image Analysis and Recognition), is designed for breast cancer histopathological image classification. It consists of 400 high-resolution (2048 × 1536 pixels) H&E-stained images. The images are equally distributed across four tissue categories: Normal, Benign, In situ carcinoma, and Invasive carcinoma, with 100 images per class, all annotated by expert pathologists. To accommodate different levels of classification tasks, this study establishes both four-class and two-class classification settings. In the two-class classification task, the Normal and Benign categories are merged into the Benign class, while In situ carcinoma and Invasive carcinoma are combined into the Malignant class, simulating a typical clinical scenario of benign versus malignant decision making in breast cancer screening.

3.2. Evaluation Metrics

To comprehensively evaluate the classification performance of HoRFNet on the BreakHis dataset, this study introduces six commonly used evaluation metrics including image level accuracy, patient level accuracy, precision, recall, F1-score, and confusion matrix. These metrics provide a systematic assessment of model performance from multiple perspectives, including recognition capability at the image and patient levels, as well as classification stability and generalization ability.

Image level accuracy (ILA) measures the proportion of correctly classified images to the total number of images and is used to evaluate the model’s recognition capability at the individual image level:

I L A = \frac{N_{c}}{N_{t}}

(9)

where

N_{c}

represents the number of correctly classified images and

N_{t}

denotes the total number of images.

Patient level accuracy (PLA) is computed on a per-patient basis by aggregating the classification results of all associated images using a majority voting strategy. This metric reflects the model’s consistency and stability in overall patient level classification:

P L A = \frac{P_{c}}{P_{t}}

(10)

where

P_{c}

represents the number of correctly classified patients and

P_{t}

denotes the total number of patients.

A confusion matrix is an intuitive tool for evaluating the performance of a classification model. It summarizes the relationship between the model’s predictions and the actual labels, illustrating how the model performs across different classes. It includes four fundamental cases: a true positive (TP) is a sample that is actually positive and correctly predicted as positive; a false positive (FP) is a sample that is actually negative but incorrectly predicted as positive; a true negative (TN) is a sample that is actually negative and correctly predicted as negative; and a false negative (FN) is a sample that is actually positive but incorrectly predicted as negative.

Precision measures the proportion of true positive samples among all samples predicted as positive, reflecting the model’s ability to reduce false positives. Recall measures the proportion of true positive samples among all actual positive samples, evaluating the model’s ability to capture positive cases. The F1-score is the harmonic mean of precision and recall, used to comprehensively evaluate the model’s ability to identify positive samples in classification tasks.

3.3. Experimental Setup

In the experiments, the original dataset is randomly divided into training set and test set in the ratio of 7:3. Among them, 25% of the training set is used for cross-validation. The data augmentation strategy is used during the training process to improve the generalization ability of the model. The optimizer is chosen to be stochastic gradient descent (SGD) with the momentum factor set to 0.9. The initial learning rate is 0.001 and is dynamically adjusted by applying the inverse decay strategy in eight rounds of training. The experimental environment consists of SitOnHoly dual NVIDIA 2080Ti GPUs (each with 11 GB of VRAM), 64 GB of RAM, and uses the PyTorch 2.1.0 deep learning framework.

3.4. Ablation Study

3.4.1. The Impact of Newton–Schulz Iteration Count

In this experiment, we investigate the impact of the number of Newton–Schulz iterations on the classification model’s accuracy. This experiment is evaluated on histopathological images from the BreakHis dataset at 100× magnification. As shown in Figure 5, increasing the number of iterations significantly enhances the model’s ability to represent high-order features. As the number of iterations increases from 1 to 5, classification accuracy steadily improves, reaching a peak of 98.56% at 5 iterations. This indicates that a moderate number of iterations better captures covariance information and enhances feature representation quality. However, when the number of iterations further increases to seven, although the model’s accuracy remains at a high level, the improvement tends to saturate. Beyond seven iterations, the model’s performance starts to decline. This phenomenon is likely due to the accumulation of numerical instability caused by excessive iterations, which weakens the feature’s discriminative ability. Meanwhile, a higher number of iterations significantly increases computational cost and reduces inference efficiency. Considering both performance and computational complexity, we ultimately choose five iterations as the optimal setting, ensuring high classification accuracy while effectively controlling computational costs, achieving an optimal balance between performance and efficiency.

3.4.2. The Impact of the Structural Configuration of the RFB Module

We conducted an ablation study on the BreakHis dataset at 100× magnification to evaluate the effect of replacing different convolutional layers within HoRFNet with the receptive field block (RFB) module. The results are shown in Figure 6a. Specifically, we replaced the second, third, fourth, and fifth convolutional layers with RFB modules, corresponding to configurations numbered 1 through 4. The experimental results show that replacing the third convolutional layer yields the highest classification accuracy, with improvements of 2.40%, 1.60%, and 3.54% compared to replacing the second, fourth, and fifth layers, respectively. This suggests that the third convolutional layer plays a critical role in feature extraction, as it retains relatively high spatial resolution while also capturing semantic information. Integrating the multi-branch RFB module at this stage significantly enhances the model’s feature representation capability. In contrast, replacing the fifth convolutional layer results in the poorest performance, likely because the RFB module is less effective at this deeper stage due to limited benefits from receptive field expansion. Based on this analysis, we select the third convolutional layer as the optimal insertion point for the RFB module in subsequent experiments.

Building on this, we further investigate the impact of varying the number of branches within the RFB module on model performance, as shown in Figure 6a. We incrementally increase the number of branches: “1” represents using only the first branch, “2” includes the first and second branches, and so on, up to all four branches. The results show a consistent increase in classification accuracy as the number of branches increases, with the highest performance achieved when all four branches are included. Specifically, using four branches improves accuracy by 2.40%, 1.12%, and 0.80% compared to using only one, two, and three branches, respectively. This demonstrates that the multi-branch structure effectively simulates receptive fields at multiple scales, enhancing the model’s ability to capture complex structures and diverse textures. The combination of different kernel sizes and dilation rates in each branch facilitates the extraction of key features across scales, improving the model’s discriminative capacity. Therefore, we adopt the configuration with all four branches as the optimal RFB module setting in our model design.

To further investigate the impact of dilation rate configuration on model performance, we conduct ablation experiments using five different sets of dilation rate combinations within the RFB module. The experimental results are shown in Figure 6b. The tested configurations include (1, 1, 1), (3, 3, 3), (5, 5, 5), (1, 3, 5), and (3, 3, 5), with each triplet corresponding to the dilation rates of the three branches in the RFB module. The results show that the (3, 3, 5) configuration achieves the highest classification accuracy among all settings, outperforming (1,1,1), (3,3,3), (5, 5, 5), and (1, 3, 5) by 0.80%, 0.48%, 1.04%, and 0.96%, respectively. These findings demonstrate that a progressively increasing dilation structure is more effective for modeling multi-scale receptive fields, thereby enhancing the model’s ability to capture both global and local contextual information. In contrast, fixed dilation configurations such as (3, 3, 3) and (5, 5, 5), while capable of enlarging the receptive field to some extent, are limited in feature diversity and information integration. Additionally, while the asymmetric design of (1, 3, 5) offers advantages in deeper receptive field coverage, it lacks adequate receptive field in the shallow branches, thereby compromising overall feature extraction. Based on this analysis, we select (3, 3, 5) as the optimal dilation rate configuration for the RFB module in subsequent experiments, aiming to maximize the representation and discrimination of structural information across multiple scales.

3.4.3. The Impact of the Second-Order Feature and Receptive Field Modules

In this section, we conduct ablation studies at both image level and patient level to verify the effectiveness of each module in HoRFNet under four different magnifications. In Figure 7, we use the backbone network as the baseline and compare the performance changes after introducing the second-order feature module (SeONet), the receptive field module (RFNet), and their combination (HoRFNet) to verify each module’s contribution to performance improvement.

In Figure 7a, HoRFNet achieves image level recognition accuracies of 97.66%, 98.56%, 99.50%, and 97.43% at 40×, 100×, 200×, and 400× magnifications, respectively, demonstrating clear advantages over the other three models. In particular, compared with the Backbone model at the image level, the HoRFNet model has an accuracy increase of 6.35%, 7.36%, 5.63%, and 4.94% at the four magnifications, respectively.

On the BreakHis dataset, we conduct an in-depth analysis of the performance of the Backbone, SeONet, and RFNet models, and also compares the performance improvement of high-order and receptive field block units on the Backbone model. In the image level evaluation results presented in Figure 7a, both the second-order feature module (SeONet) and the receptive field module (RFNet) effectively improve classification accuracy across all magnifications, demonstrating clear improvements over the backbone baseline. For example, at 200× magnification, the backbone achieves an accuracy of 93.87%, while SeONet and RFNet improve it to 98.01% and 98.34%, respectively, demonstrating the positive contribution of both modules to image level recognition performance. The second-order features more effectively model the high-order relationships among local textures, while the receptive field module enhances the model’s ability to perceive multiscale contextual information. However, at 100× and 400× magnifications, RFNet performs slightly lower than SeONet. For example, at 400×, RFNet achieves 94.50% compared to 95.05% for SeONet, which may be attributed to the distribution of fine-grained details in the images. At these magnifications, structural boundaries and local contrast become more prominent, giving second-order features an advantage in modeling local relationships. Nevertheless, RFNet still outperforms the backbone across all magnifications, indicating its effectiveness in enhancing overall feature representation. However, by combining the second-order statistical features and receptive field block, the HoRFNet model can obtain additional gain of 1.84%, 4.00%, 1.16%, and 2.93% higher than the RFNet model, and 2.17%, 2.72%, 1.49%, and 2.38% higher than the SeONet model.

In the patient level evaluation presented in Figure 7b, all models exhibit stability across different magnifications, similar to the image level evaluation. For example, at 200× magnification, the backbone achieves an accuracy of 93.11%, while SeONet and RFNet increase it to 97.89% and 98.44%, respectively. HoRFNet further improves the accuracy to 99.23%. This trend is consistent across most magnifications. Notably, similar to the image level results, at 100× and 400× magnifications, RFNet slightly underperforms SeONet but still significantly outperforms the backbone. These results suggest that the receptive field module is advantageous for extracting global structural information, whereas at 100× and 400× magnifications, second-order features may play a more critical role in ensuring consistency in patient level predictions. Overall, HoRFNet achieves the best performance by combining the strengths of SeONet and RFNet. At 40×, 100×, 200×, and 400× magnifications, it achieves patient level recognition accuracies of 95.82%, 98.77%, 99.23%, and 97.94%, respectively. Therefore, the ablation study results further confirm the effectiveness of second-order statistical features, the receptive field module, and their integration strategy in BCHI classification.

3.4.4. Statistical Significance Analysis

To further validate the statistical significance of the performance improvements achieved by the proposed HoRFNet model in BCHI classification, this study conducts a t-test comparing SeONet, RFNet, and HoRFNet against the Backbone model across four magnification levels (40×, 100×, 200×, and 400×). As shown in Table 2, HoRFNet consistently achieves the highest mean improvement (Mean) at each magnification, with values of 7.515, 6.560, 5.710, and 4.395, respectively. The corresponding p-values are 0.044, 0.010, 0.005, and 0.011, all of which are significantly below the 0.05 threshold, indicating that the performance gains of HoRFNet over the Backbone model are statistically significant. In contrast, although SeONet and RFNet show some performance improvements at certain magnifications, most of their p-values exceed 0.05. Notably, RFNet yields a p-value of 0.836 at 400× magnification, failing the significance test. These results demonstrate that HoRFNet not only outperforms the Backbone in classification accuracy but also exhibits statistically more significant and robust advantages, validating the effectiveness of its architectural design.

3.5. Experimental Results

3.5.1. Experimental Results Under Other Metrics

To more comprehensively evaluate the classification performance of HoRFNet under various magnifications, this study introduces precision, recall, and F1-score as supplementary evaluation metrics. The corresponding experimental results are presented in Table 3, and the confusion matrices for each magnification are shown in Figure 8.

As shown in Table 3, HoRFNet achieves high classification performance across all magnification levels. The average precision reaches 98.47%, the average recall is 99.05%, and the average F1-score is 98.88%. Among the four magnifications, the model achieves the best performance at 200×, with precision and F1-score reaching 99.53% and 99.64%, respectively. At 100× and 200×, recall reaches 99.76%, indicating that the model demonstrates strong sensitivity in detecting malignant samples.

Figure 8 further illustrates the model’s classification performance under different magnification conditions. For example, on the 40× dataset (Figure 8a), the model correctly classifies 180 benign and 405 malignant samples, with only 14 misclassifications. On the 100× dataset (Figure 8b), only one malignant sample and eight benign samples are misclassified. In the 200× dataset (Figure 8c), only three misclassifications occur, representing the lowest error rate among all magnifications. In contrast, the classification performance at 400× (Figure 8d) is slightly lower, with 14 misclassifications primarily occurring in the prediction of malignant cases. The results indicate that HoRFNet maintains stable and strong classification performance across different magnifications, with particularly high accuracy at 100× and 200×, further confirming the model’s reliability in BCHI classification.

3.5.2. Comparison with Other Methods

To further evaluate the classification performance of HoRFNet on multi-magnification images, this study conducts a systematic comparison with representative CNN methods from recent years. The comparison includes accuracy at both the image level and the patient level, as shown in Table 4. In particular, SeONet and RFNet are included as reference models to further assess the improvements brought by HoRFNet.

Based on Table 4, we conclude that HoRFNet demonstrates a significant competitive advantage over previously proposed models. Specifically, on the 200× dataset, HoRFNet achieves an optimal image level accuracy reaching 99.50%, while the performance at the patient level attains 99.23%. Compared with studies [43,44,45,46,52,54], HoRFNet improves accuracy at the image level by 8.50%, 14.60%, 10.40%, 11.02%, 7.05%, and 5.70%, respectively, and at the patient level by 8.23%, 15.13%, 9.43%, 10.03%, 6.95%, and 4.93%, respectively. Additionally, SeONet yields 98.01% accuracy for image level classification and 97.89% for patient-level prediction on the 200× dataset, while RFNet achieves 98.34% and 98.44%, respectively. Both models significantly outperform those in studies [43,44,45,46,49,52,53]. However, HoRFNet reaches 97.66% accuracy on the 40× dataset in terms of image-level performance, which is slightly 0.33% lower than BreastNet proposed by Toğaçar et al. [22]. Nevertheless, at other magnifications, the proposed model outperforms BreastNet in recognition accuracy. Compared with the models in recent studies [53,54,55], HoRFNet shows consistently better results under all magnification settings of the BreakHis dataset. For instance, Xiao et al. [54] employ Inception-V3 combined with image segmentation and fusion strategies to enhance BCHI classification accuracy. However, its average performance is approximately 5.70% and 4.93% lower than that of HoRFNet on image level and patient level evaluations, respectively. According to the above analysis results, compared with the network model in recent years, our HoRFNet has considerable advantages.

In addition, to comprehensively evaluate the performance of HoRFNet, we compare it with recent models based on the Transformer architecture. As shown in Table 4, at the image level, Zhuang et al. [29] achieved accuracies of 97.50%, 96.60%, 96.30%, and 96.10% across different magnifications. ST-Double-Net [25] achieves corresponding accuracies of 97.47%, 96.86%, 97.25%, and 95.05%. HoRFNet consistently outperforms both Transformer-based methods at all magnification levels. In particular, at 100× and 200×, HoRFNet improves upon ResNet101+SwinT by 1.96% and 3.20%, respectively, and surpasses ST-Double-Net by 1.70% and 2.25%, respectively. Therefore, although Transformer-based architectures have recently shown impressive performance in vision tasks, HoRFNet demonstrates superior discriminative power in the BCHI classification.

3.5.3. Evaluation on the BACH Dataset

The above experiments demonstrate that HoRFNet achieves high classification accuracy on the BreakHis dataset. To further validate the model’s generalization ability and robustness, we conduct additional experiments on the BACH dataset, involving both two-class and four-class classification tasks. As shown in Table 5, HoRFNet consistently outperforms the Backbone model on the BACH dataset. In the two-class classification task, HoRFNet achieves an accuracy of 88.75%, reflecting an 8.75% improvement over the Backbone’s 80.00%. In the four-class classification task, HoRFNet attains an accuracy of 82.50%, significantly outperforming the Backbone’s 76.25%, with a margin of 6.25%. These results indicate that HoRFNet maintains stable recognition performance across different data distributions and tissue types, further validating its generalizability and robustness in BCHI classification.

3.5.4. Model Efficiency Comparison

To comprehensively evaluate the proposed model in terms of computational efficiency, resource consumption, and classification performance, we conduct a systematic comparison at 100× magnification between HoRFNet and several baseline models, including the Backbone, SeONet, RFNet, and the lightweight EfficientNet-lite. The experimental results are presented in Table 6. Compared to the Backbone model, HoRFNet shows significant advantages across multiple dimensions. First, in terms of classification performance, HoRFNet achieves the highest accuracy of 98.56%, significantly outperforming the Backbone and other comparison models. This result demonstrates that HoRFNet possesses stronger feature representation and discriminative capabilities. In terms of training efficiency, HoRFNet completes training in just 41 min, clearly outperforming the Backbone’s 64 min. This indicates that its optimized architecture and enhanced feature extraction capability effectively reduce training time. During inference, HoRFNet achieves an average inference time of 7.80 s, which is higher than the Backbone’s 3.28 s. However, this difference primarily results from the incorporation of high-order statistical modeling and multi-branch receptive field modules. Specifically, the matrix power normalization in covariance pooling and the use of dilated convolutions enhance feature representation and global perception, but also increase the complexity of the inference path. Despite this, HoRFNet still performs much faster than EfficientNet-lite, which takes 26.40 s, highlighting its strong deployment responsiveness in practical settings.

In terms of resource consumption, HoRFNet has 101.72 M parameters, which is only about one-fourth the size of the Backbone’s 396.98 M. This enhances the model’s compactness and deployment flexibility. Regarding computational complexity, HoRFNet requires 3.57 G FLOPs, which is comparable to the Backbone’s 3.45 G FLOPs. This demonstrates that HoRFNet substantially enhances representational capacity without introducing a significant computational burden. In contrast, although EfficientNet-lite has fewer parameters (81.64 M), its computational complexity reaches 6.80G FLOPs, which results in a disadvantage in actual computational efficiency. Overall, HoRFNet maintains excellent classification performance while achieving a favorable balance between computational efficiency and model effectiveness.

3.5.5. Visualization Results

To more clearly illustrate the advantages of HoRFNet in BCHI classification, this study presents a set of representative examples: images that are misclassified by the backbone model but correctly identified by HoRFNet. These examples are shown in Figure 9. Upon inspection, these histopathological slide images often contain highly complex and subtly varying texture patterns, which challenge the feature extraction capability of the backbone model and may result in misclassification. In contrast, HoRFNet improves the ability to capture fine-grained information by introducing second-order statistical features and a receptive field enhancement module, enabling stronger discriminative power when processing complex images. This structural optimization significantly enhances recognition accuracy, further demonstrating the effectiveness of HoRFNet in BCHI classification.

In addition, we provide visualizations of model attention regions under different magnification levels to further evaluate the discriminative capability of HoRFNet. As shown in Figure 10, HoRFNet (second row) consistently exhibits more concentrated high-response regions across all magnifications, clearly outperforming the dispersed activation maps produced by the Backbone (first row). At 40× and 100×, HoRFNet effectively focuses on key tissue areas with pathological significance, whereas the Backbone shows substantial background interference and lacks precise region localization. At 200× and 400×, HoRFNet demonstrates superior sensitivity to fine-grained structures, accurately capturing edges and detailed regions, while the Backbone displays blurred boundaries and unstable attention patterns. These results indicate that HoRFNet has stronger multi-scale feature extraction capabilities, maintaining stable and accurate discriminative performance across varying resolutions.

4. Discussion

This work proposes the HoRFNet model, which integrates high-order modeling with a multi-branch receptive field mechanism and demonstrates notable advantages in BCHI classification. Experimental results show that the model achieves leading image-level and patient-level recognition performance across all four magnification levels in the BreakHis dataset, confirming its strong generalization and discriminative capabilities. It is important to note that BreakHis, as a widely used public dataset, provides strong representativeness and comparability, serving as a standardized evaluation platform for various histopathological image analysis methods. However, histopathological image of breast cancer may exhibit structural differences across countries, regions, climates, and population backgrounds, such as variations in tissue morphology, cell density, or staining conditions.

Looking ahead, we expand the scope of our study by exploring the model’s adaptability on multicenter datasets collected from different countries and regions. We also introduce data with diverse sources, staining styles, and preparation protocols to systematically evaluate the model’s cross-domain robustness and generalization performance. In addition, we consider incorporating other publicly available BCHI datasets to comprehensively assess the model’s performance across varying data environments and to further advance the practical deployment of HoRFNet in real-world clinical applications.

5. Conclusions

This work introduces HoRFNet, a novel network architecture designed to enhance the classification of BCHI through high-order receptive field modeling. HoRFNet achieves end-to-end integration of multi-branch receptive fields and high-order statistical modeling in a single network architecture. Compared with conventional CNN architectures, HoRFNet effectively overcomes the limitations of fixed receptive fields and the bottleneck of relying solely on first-order feature aggregation. It significantly improves the model’s ability to capture complex tissue structures and multi-scale textures. Specifically, HoRFNet replaces the third convolutional layer in the backbone with a multi-branch convolutional structure, which expands the receptive field and enhances multi-scale contextual modeling while keeping the parameter size manageable. In addition, the model incorporates a covariance pooling module to extract high-order statistical features, further improving its capacity to represent nonlinear local relationships in images. Across various magnification levels of the BreakHis dataset, HoRFNet consistently achieves higher image level and patient level classification accuracy than existing mainstream methods, demonstrating strong representational power and generalization ability. More importantly, HoRFNet maintains high accuracy while exhibiting favorable computational efficiency and compactness. Its training time and inference speed are significantly better than those of some lightweight models, indicating strong potential for practical deployment. These advantages suggest that HoRFNet is well positioned to provide efficient and reliable intelligent support for pathological diagnosis in real-world clinical scenarios.

Author Contributions

Conceptualization, J.Z.; methodology, M.Z. and C.H.; validation, M.Z., C.H. and L.C.; writing—original draft preparation, C.H.; writing—review and editing, M.Z. and J.Z.; supervision, J.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Applied Basic Research Project of Liaoning Province under Grant 2023JH2/101300191 and the Major Open Project of Key Laboratory for Advanced Design and Intelligent Computing of the Ministry of Education under Grant ADIC2023ZD003.

Data Availability Statement

The datasets used in this study are available to the public at the following addresses: https://web.inf.ufpr.br/vri/databases/breast-cancer-histopathological-database-breakhis/ for the BreakHis dataset (accessed on 16 January 2025), and https://iciar2018-challenge.grand-challenge.org/Dataset/ for the BACH dataset (accessed on 16 January 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

BCHI	Breast Cancer Histopathological Image
HoRFNet	High-order Receptive Field Network
CNNs	Convolutional Neural Networks
RFB	Receptive Field Block
MPN	Matrix Power Normalization
SVD	Singular Value Decomposition
EIG	Eigen Decomposition
FC	Fully Connected
Mean	Mean Improvement
P	p-value
ILA	Image Level Accuracy
PLA	Patient Level Accuracy
SGD	Stochastic Gradient Descent
SeONet	Second-order Feature Module
RFNet	Receptive Field Module

References

Bray, F.; Ferlay, J.; Soerjomataram, I.; Siegel, R.L.; Torre, L.A.; Jemal, A. Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J. Clin. 2018, 68, 394–424. [Google Scholar] [CrossRef] [PubMed]
Feng, Y.; Zhang, L.; Mo, J. Deep Manifold Preserving Autoencoder for Classifying Breast Cancer Histopathological Images. IEEE/ACM Trans. Comput. Biol. Bioinform. 2020, 17, 91–101. [Google Scholar] [CrossRef] [PubMed]
Van Grinsven, M.J.; van Ginneken, B.; Hoyng, C.B.; Theelen, T.; Sánchez, C.I. Fast convolutional neural network training using selective data sampling: Application to hemorrhage detection in color fundus images. IEEE Trans. Med. Imaging. 2016, 35, 1273–1284. [Google Scholar] [CrossRef]
Shin, H.C.; Roth, H.R.; Gao, M.; Lu, L.; Xu, Z.; Nogues, I.; Yao, J.; Mollura, D.; Summers, R.M. Deep convolutional neural networks for computer-aided detection: CNN architectures, dataset characteristics and transfer learning. IEEE Trans. Med. Imaging. 2016, 35, 1285–1298. [Google Scholar] [CrossRef]
Yu, H.; Liu, D.; Shi, H.; Yu, H.; Wang, Z.; Wang, X.; Cross, B.; Bramler, M.; Huang, T.S. Computed tomography super-resolution using convolutional neural networks. In Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017; pp. 3944–3948. [Google Scholar]
Wang, G.; Zuluaga, M.A.; Li, W.; Pratt, R.; Patel, P.A.; Aertsen, M.; Doel, T.; David, A.L.; Deprest, J.; Ourselin, S.; et al. DeepIGeoS: A deep interactive geodesic framework for medical image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 1559–1572. [Google Scholar] [CrossRef]
Sudha, S.; Jayanthi, K.; Rajasekaran, C.; Sunder, T. Segmentation of RoI in medical images using CNN-A comparative study. In Proceedings of the TENCON 2019-2019 IEEE Region 10 Conference (TENCON), Kochi, India, 17–20 October 2019; pp. 767–771. [Google Scholar]
Spanhol, F.A.; Oliveira, L.S.; Petitjean, C.; Heutte, L. Breast cancer histopathological image classification using convolutional neural networks. In Proceedings of the 2016 International Joint Conference on Neural Networks (IJCNN), Vancouver, BC, Canada, 24–29 July 2016; pp. 2560–2567. [Google Scholar]
Deniz, E.; Şengür, A.; Kadiroğlu, Z.; Guo, Y.; Bajaj, V.; Budak, Ü. Transfer learning based histopathologic image classification for breast cancer detection. Health Inf. Sci. Syst. 2018, 6, 18. [Google Scholar] [CrossRef]
Shallu; Mehra, R. Breast cancer histology images classification: Training from scratch or transfer learning? ICT Express 2018, 4, 247–254. [Google Scholar] [CrossRef]
Zhang, J.; Wei, X.; Che, C.; Zhang, Q.; Wei, X. Breast cancer histopathological image classification based on convolutional neural networks. J. Med. Imaging Health Inform. 2019, 9, 735–743. [Google Scholar] [CrossRef]
Gupta, V.; Bhavsar, A. Sequential modeling of deep features for breast cancer histopathological image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–22 June 2018; pp. 2254–2261. [Google Scholar]
Spanhol, F.A.; Oliveira, L.S.; Cavalin, P.R.; Petitjean, C.; Heutte, L. Deep features for breast cancer histopathological image classification. In Proceedings of the 2017 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Banff, AB, Canada, 5–8 October 2017; pp. 1868–1873. [Google Scholar]
Jiang, Y.; Chen, L.; Zhang, H.; Xiao, X. Breast cancer histopathological image classification using convolutional neural networks with small SE-ResNet module. PLoS ONE 2019, 14, e0214587. [Google Scholar] [CrossRef]
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
Kumar, A.; Singh, S.K.; Saxena, S.; Lakshmanan, K.; Sangaiah, A.K.; Chauhan, H.; Shrivastava, S.; Singh, R.K. Deep feature learning for histopathological image classification of canine mammary tumors and human breast cancer. Inf. Sci. 2020, 508, 405–421. [Google Scholar] [CrossRef]
Benhammou, Y.; Tabik, S.; Achchab, B.; Herrera, F. A first study exploring the performance of the state-of-the art CNN model in the problem of breast cancer. In Proceedings of the International Conference on Learning and Optimization Algorithms: Theory and Applications, New York, NY, USA, 2–5 May 2018; pp. 1–6. [Google Scholar]
Boumaraf, S.; Liu, X.; Zheng, Z.; Ma, X.; Ferkous, C. A new transfer learning based approach to magnification dependent and independent classification of breast cancer in histopathological images. Biomed. Signal Process. Control 2021, 63, 102192. [Google Scholar]
Bardou, D.; Zhang, K.; Ahmad, S.M. Classification of breast cancer based on histology images using convolutional neural networks. IEEE Access 2018, 6, 24680–24693. [Google Scholar] [CrossRef]
Murtaza, G.; Wahab, A.W.A.; Raza, G.; Shuib, L. A tree-based multiclassification of breast tumor histopathology images through deep learning. Comput. Med. Imaging Graph. 2021, 89, 101870. [Google Scholar] [CrossRef] [PubMed]
Budak, Ü.; Cömert, Z.; Rashid, Z.N.; Şengür, A.; Çıbuk, M. Computer-aided diagnosis system combining FCN and Bi-LSTM model for efficient breast cancer detection from histopathological images. Appl. Soft Comput. 2019, 85, 105765. [Google Scholar] [CrossRef]
Toğaçar, M.; Özkurt, K.B.; Ergen, B.; Cömert, Z. BreastNet: A novel convolutional neural network model through histopathological images for the diagnosis of breast cancer. Phys. A Stat. Mech. Its Appl. 2020, 545, 123592. [Google Scholar] [CrossRef]
Ashish, V. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, I. [Google Scholar]
Yan, Y.; Lu, R.; Sun, J.; Zhang, J.; Zhang, Q. Breast cancer histopathology image classification using transformer with discrete wavelet transform. Med Eng. Phys. 2025, 138, 104317. [Google Scholar] [CrossRef]
Hao, S.; Jia, Y.; Liu, J.; Wang, Z.; Liu, C.; Ji, Z.; Ganchev, I. ST-Double-Net: A two-stage breast tumor classification model based on swin transformer and weakly supervised target localization. IEEE Access 2024, 12, 117921–117933. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
Shiri, M.; Reddy, M.P.; Sun, J. Supervised Contrastive Vision Transformer for Breast Histopathological Image Classification. In Proceedings of the 2021 IEEE International Conference on Information Reuse and Integration for Data Science (IRI), Montreal, QC, Canada, 10–17 October 2021; pp. 296–301. [Google Scholar]
He, Z.; Lin, M.; Xu, Z.; Yao, Z.; Chen, H.; Alhudhaif, A.; Alenezi, F. Deconv-transformer (DecT): A histopathological image classification model for breast cancer based on color deconvolution and transformer architecture. Inf. Sci. 2022, 608, 1093–1112. [Google Scholar]
Zhuang, J.; Xiaohui, W.; Dongdong, M.; Shenghua, J. A Swin transformer and residual network combined model for breast cancer disease multi-classification using histopathological images. Instrumentation 2024, 11, 112–120. [Google Scholar]
Ionescu, C.; Vantzos, O.; Sminchisescu, C. Matrix backpropagation for deep networks with structured layers. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 2965–2973. [Google Scholar]
Lin, T.Y.; RoyChowdhury, A.; Maji, S. Bilinear Convolutional Neural Networks for Fine-grained Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 1309–1322. [Google Scholar] [CrossRef] [PubMed]
Li, P.; Xie, J.; Wang, Q.; Zuo, W. Is second-order information helpful for large-scale visual recognition? In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2070–2078. [Google Scholar]
Gao, Z.; Xie, J.; Wang, Q.; Li, P. Global second-order pooling convolutional networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3024–3033. [Google Scholar]
Li, J.; Zhang, J.; Sun, Q.; Zhang, H.; Dong, J.; Che, C.; Zhang, Q. Breast cancer histopathological image classification based on deep second-order pooling network. In Proceedings of the 2020 International Joint Conference on Neural Networks (IJCNN), Glasgow, UK, 19–24 July 2020; pp. 1–7. [Google Scholar]
Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef] [PubMed]
Liu, S.; Huang, D.; Wang, Y. Receptive field block net for accurate and fast object detection. In Proceedings of the European conference on computer vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 385–400. [Google Scholar]
Wandell, B.A.; Winawer, J. Computational neuroimaging and population receptive fields. Trends Cogn. Sci. 2015, 19, 349–357. [Google Scholar] [CrossRef]
Szegedy, C.; Ioffe, S.; Vanhoucke, V.; Alemi, A. Inception-v4, inception-resnet and the impact of residual connections on learning. In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; Volume 31. [Google Scholar]
Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2818–2826. [Google Scholar]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
Li, P.; Xie, J.; Wang, Q.; Gao, Z. Towards faster training of global covariance pooling networks by iterative matrix square root normalization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 947–955. [Google Scholar]
Lin, T.Y.; Maji, S. Improved bilinear pooling with CNNs. arXiv 2017, arXiv:1707.06772. [Google Scholar]
Zhu, C.; Song, F.; Wang, Y.; Dong, H.; Guo, Y.; Liu, J. Breast cancer histopathology image classification through assembling multiple compact CNNs. BMC Med. Inform. Decis. Mak. 2019, 19, 1–17. [Google Scholar] [CrossRef]
Lichtblau, D.; Stoean, C. Cancer diagnosis through a tandem of classifiers for digitized histopathological slides. PLoS ONE 2019, 14, e0209274. [Google Scholar] [CrossRef]
Zhang, J.; Wei, X.; Dong, J.; Liu, B. Aggregated deep global feature representation for breast cancer histopathology image classification. J. Med. Imaging Health Inform. 2020, 10, 2778–2783. [Google Scholar] [CrossRef]
Hou, Y. Breast cancer pathological image classification based on deep learning. J. X-Ray Sci. Technol. 2020, 28, 727–738. [Google Scholar] [CrossRef]
Mi, W.; Li, J.; Guo, Y.; Ren, X.; Liang, Z.; Zhang, T.; Zou, H. Deep learning-based multi-class classification of breast digital pathology images. Cancer Manag. Res. 2021, 13, 4605–4617. [Google Scholar] [CrossRef]
Burçak, K.C.; Baykan, Ö.K.; Uğuz, H. A new deep convolutional neural network model for classifying breast cancer histopathological images and the hyperparameter optimisation of the proposed model. J. Supercomput. 2021, 77, 973–989. [Google Scholar] [CrossRef]
Zerouaoui, H.; Idri, A. Deep hybrid architectures for binary classification of medical breast cancer images. Biomed. Signal Process. Control 2022, 71, 103226. [Google Scholar] [CrossRef]
Chattopadhyay, S.; Dey, A.; Singh, P.K.; Sarkar, R. DRDA-Net: Dense residual dual-shuffle attention network for breast cancer classification using histopathological images. Comput. Biol. Med. 2022, 145, 105437. [Google Scholar] [CrossRef] [PubMed]
Sepahvand, M.; Abdali-Mohammadi, F. Joint learning method with teacher–student knowledge distillation for on-device breast cancer image classification. Comput. Biol. Med. 2023, 155, 106476. [Google Scholar] [CrossRef] [PubMed]
Chhipa, P.C.; Upadhyay, R.; Pihlgren, G.G.; Saini, R.; Uchida, S.; Liwicki, M. Magnification prior: A self-supervised method for learning representations on breast cancer histopathological images. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 2–7 January 2023; pp. 2717–2727. [Google Scholar]
Liu, Y.; Liu, X.; Qi, Y. Adaptive threshold learning in frequency domain for classification of breast cancer histopathological images. Int. J. Intell. Syst. 2024, 2024, 9199410. [Google Scholar] [CrossRef]
Xiao, M.; Li, Y.; Yan, X.; Gao, M.; Wang, W. Convolutional neural network classification of cancer cytopathology images: Taking breast cancer as an example. In Proceedings of the 2024 7th International Conference on Machine Vision and Applications, Singapore, 12–14 March 2024; pp. 145–149. [Google Scholar]
Potsangbam, J.; Devi, S.S. Classification of breast cancer histopathological images using transfer learning with DenseNet121. Procedia Comput. Sci. 2024, 235, 1990–1997. [Google Scholar] [CrossRef]

Figure 1. The architecture of the HoRFNet model. The HoRFNet architecture comprises three primary components. First, a simple CNN flow serves as backbone. Next, a receptive field module is applied in the third convolutional layer. Finally, global statistical features of the feature maps are encoded via a second-order pooling block.

Figure 2. Structural details of the receptive field block (RFB) module. The module contains four parallel branches. The first branch uses a 1 × 1 convolution, while the remaining three apply convolutions with different kernel shapes and dilation rates to capture multi-scale contextual features. The outputs are concatenated and fused with a 1 × 1 convolution, followed by a residual connection and ReLU activation.

Figure 3. Structural details of the second-order pooling block. The process begins with covariance pooling to capture second-order feature statistics from convolutional feature maps. The resulting covariance matrix is then pre-normalized by its trace, followed by iterative matrix square root approximation using the Newton–Schulz method. Post-compensation restores the scale of the features, and the output is fed into a softmax classifier.

Figure 4. Typical images in the BreakHis dataset. Images of benign and malignant tissues under a

400 \times

microscope.

Figure 4. Typical images in the BreakHis dataset. Images of benign and malignant tissues under a

400 \times

microscope.

Figure 5. The impact of Newton–Schulz iteration count on the classification accuracy of the model under 100× magnification with the BreakHis dataset. Increasing the number of iterations enhances the model’s feature representation capabilities, but the effect tends to saturate after five iterations. At seven iterations or more, performance begins to degrade, accompanied by a significant rise in computational cost.

Figure 6. Ablation study results of RFB module. (a) Comparison of classification accuracy with the RFB module replacing different convolutional layers and using varying numbers of branches. (b) Impact of different dilation rate combinations within the RFB module on model performance.

Figure 7. Ablation study results of HoRFNet at four magnification levels (40×, 100×, 200×, and 400×) on the BreakHis dataset. (a) Image level classification accuracy. (b) Patient level classification accuracy. HoRFNet consistently outperforms the Backbone, SeONet, and RFNet across all settings, demonstrating the effectiveness of integrating second-order pooling and multi-branch receptive field modules.

Figure 8. Visualization of confusion matrices of HoRFNet under four different magnifications (40×, 100×, 200×, and 400×).

Figure 9. Selected sample images from

40 \times

,

100 \times

,

200 \times

, and

400 \times

datasets. The backbone model cannot effectively identify these slices of breast cancer histopathological image, in contrast, HoRFNet correctly classifies them. In the first row are tissue samples from benign cases, whereas the second row illustrates sections associated with malignant tumors.

Figure 9. Selected sample images from

40 \times

,

100 \times

,

200 \times

, and

400 \times

datasets. The backbone model cannot effectively identify these slices of breast cancer histopathological image, in contrast, HoRFNet correctly classifies them. In the first row are tissue samples from benign cases, whereas the second row illustrates sections associated with malignant tumors.

Figure 10. Comparison of Grad-CAM visualization results between the Backbone (first row) and HoRFNet (second row) at different magnification levels. Compared with the Backbone, HoRFNet consistently focuses more accurately on tumor-related regions across all magnifications.

Table 1. The configuration of HoRFNet.

Layer	Operator	Mode	Channel	Output Size
01	Convolution1	Stride = 1	64	$222 \times 222$
02	ReLU	-	-	$222 \times 222$
03	MaxPool	Stride = 2	64	$111 \times 111$
04	Convolution2	Stride = 1	96	$108 \times 108$
05	ReLU	-	-	$108 \times 108$
06	MaxPool	Stride = 2	96	$53 \times 53$
07	Receptive Field Block	-	128	$51 \times 51$
08	ReLU	-	-	$51 \times 51$
09	Convolution3	Stride = 1	256	$49 \times 49$
10	ReLU	-	-	$49 \times 49$
11	Convolution4	Stride = 1	256	$47 \times 47$
12	ReLU	-	-	$47 \times 47$
13	MaxPool	Stride = 2	256	$23 \times 23$
14	High-order Pooling	$1 \times 1 \times 32896$
15	FC1	$1 \times 32896$
16	FC2	$1 \times 2880$
17	FC3	$1 \times 2048$
18	Softmax	2

Table 2. Statistical comparison (mean improvement and p-value) of SeONet, RFNet, and HoRFNet versus the Backbone across different magnifications using t-test.

Comparison	40×		100×		200×		400×
Comparison	Mean ↑	p ↓	Mean ↑	p ↓	Mean ↑	p ↓	Mean ↑	p ↓
SeONet	5.095	0.097	3.840	0.028	3.970	0.022	2.015	0.049
RFNet	5.675	0.074	2.560	0.060	3.805	0.068	0.365	0.836
HoRFNet	7.515	0.044	6.560	0.010	5.710	0.005	4.395	0.011

Table 3. Classification performance metrics of HoRFNet (precision, recall, and F1-score) at different magnifications.

Magnification	40× (%)	100× (%)	200× (%)	400× (%)	Average (%)
Precision	98.30	98.18	99.53	97.88	98.47
Recall	98.30	99.76	99.76	98.40	99.05
F1-Score	98.30	99.46	99.64	98.13	98.88

Table 4. Comparison of classification accuracy at the image level and patient level using different methods.

References	Methods	Image Level (%)				Patient Level (%)
References	Methods	40×	100×	200×	400×	40×	100×	200×	400×
Zhu et al. [43]	Multiple CNNs	85.70	84.20	84.90	80.10	85.20	83.50	84.10	79.30
Lichtblau et al. [44]	DE ensemble	83.90	86.00	89.10	86.60	85.60	87.40	89.80	87.00
Toğaçar et al. [22]	BreastNet	97.99	97.84	98.51	95.88	-	-	-	-
Zhang et al. [45]	VGG-VD16	95.03	90.41	88.48	85.00	95.50	91.57	89.20	89.20
Hou [46]	22 layers CNN	90.89	90.99	91.00	90.97	91.00	91.00	91.00	91.00
Kumar et al. [16]	VGGNet16	-	-	-	-	94.11	95.12	97.01	93.40
Mi et al. [47]	Two-stage model	96.70	97.60	95.00	93.30	-	-	-	-
Burçak et al. [48]	Hcnn	97.00	97.00	96.00	96.00	-	-	-	-
Zerouaoui et al. [49]	Hybrid architectures	92.61	92.00	93.93	91.73	-	-	-	-
Chattopadhyay et al. [50]	DRDA-Net	95.72	94.41	97.43	96.84	-	-	-	-
Sepahvand et al. [51]	ResNeXt+KD	97.23	96.92	97.60	96.64	-	-	-	-
Chhipa et al. [52]	MPCS	93.26	93.45	92.45	89.57	93.00	93.26	92.28	88.74
Liu et al. [53]	ResNeXt+adaptive	94.37	93.85	91.63	93.31	-	-	-	-
Xiao et al. [54]	Inception-V3	95.00	95.10	93.80	92.20	95.00	94.60	94.30	92.20
Potsangbam et al. [55]	DenseNet 121+FCL	96.19	96.53	94.02	94.50	-	-	-	-
Zhuang et al. [29]	ResNet101+SwinT	97.50	96.60	96.30	96.10	-	-	-	-
Hao et al. [25]	ST-Double-Net	97.47	96.86	97.25	95.05	-	-	-	-
Ours	SeONet	95.49	95.84	98.01	95.05	94.50	96.02	97.89	95.82
	RFNet	95.82	94.56	98.34	94.50	94.79	95.27	98.44	94.67
	HoRFNet	97.66	98.56	99.50	97.43	95.82	98.77	99.23	97.94

“-” denotes that this method does not provide the experimental results.

Table 5. Performance comparison of HoRFNet and baseline model on the BACH dataset under two-class and four-class classification tasks.

Methods	Two-Class	Four-Class
Backbone	80.00	76.25
HoRFNet	88.75	82.50

Table 6. Comparison of computational efficiency and resource consumption of different models at 100× magnification.

Metrics	$100 \times$
Metrics	Backbone	SeONet	RFNet	EfficientNet-Lite	HoRFNet
Parameter count (M)	396.98	101.77	431.58	81.64	101.72
Training time (min)	64.00	39.00	71.00	198.00	41.00
Inference speed (s)	3.28	7.55	4.40	26.40	7.80
FLOPs (G)	3.45	3.50	3.54	6.80	3.57
Accuracy (%)	91.20	95.84	94.56	92.96	98.56

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhao, M.; Hou, C.; Cao, L.; Zhang, J. Breast Cancer Histopathological Image Classification Based on High-Order Modeling and Multi-Branch Receptive Fields. Appl. Sci. 2025, 15, 6085. https://doi.org/10.3390/app15116085

AMA Style

Zhao M, Hou C, Cao L, Zhang J. Breast Cancer Histopathological Image Classification Based on High-Order Modeling and Multi-Branch Receptive Fields. Applied Sciences. 2025; 15(11):6085. https://doi.org/10.3390/app15116085

Chicago/Turabian Style

Zhao, Mengda, Cunqiao Hou, Lu Cao, and Jianxin Zhang. 2025. "Breast Cancer Histopathological Image Classification Based on High-Order Modeling and Multi-Branch Receptive Fields" Applied Sciences 15, no. 11: 6085. https://doi.org/10.3390/app15116085

APA Style

Zhao, M., Hou, C., Cao, L., & Zhang, J. (2025). Breast Cancer Histopathological Image Classification Based on High-Order Modeling and Multi-Branch Receptive Fields. Applied Sciences, 15(11), 6085. https://doi.org/10.3390/app15116085

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Breast Cancer Histopathological Image Classification Based on High-Order Modeling and Multi-Branch Receptive Fields

Abstract

1. Introduction

2. Methods

2.1. HoRFNet

2.2. Receptive Field Block

2.3. Covariance Pooling Module

2.3.1. Covariance Normalization

2.3.2. Acceleration Covariance Normalization

2.4. Instantiation

3. Experiments

3.1. Datasets

3.1.1. BreakHis Dataset

3.1.2. BACH Dataset

3.2. Evaluation Metrics

3.3. Experimental Setup

3.4. Ablation Study

3.4.1. The Impact of Newton–Schulz Iteration Count

3.4.2. The Impact of the Structural Configuration of the RFB Module

3.4.3. The Impact of the Second-Order Feature and Receptive Field Modules

3.4.4. Statistical Significance Analysis

3.5. Experimental Results

3.5.1. Experimental Results Under Other Metrics

3.5.2. Comparison with Other Methods

3.5.3. Evaluation on the BACH Dataset

3.5.4. Model Efficiency Comparison

3.5.5. Visualization Results

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI