1. Introduction
Pathological image analysis is essential in medical diagnosis, particularly in the early detection of cancers and the precise localization of lesion areas. Breast cancer ranks among the most prevalent cancers affecting women worldwide. Early and precise detection can greatly improve survival rates and reduce treatment costs [
1]. Traditional diagnostic methods include clinical evaluations, imaging techniques like mammography and ultrasound, and tissue biopsies [
2]. Although these methods are essential for early breast cancer screening, they have practical limitations. The complexity of lesions, small lesion sizes, and image quality variations due to uneven staining in pathological images can reduce diagnostic accuracy and increase reliance on the physician’s experience. This has led to growing interest in computer-aided diagnostic methods, which provide a more objective and accurate approach to detecting lesions, improving diagnostic precision in clinical practice [
3].
Recent advancements in deep learning have greatly impacted pathological image analysis, with current methods falling into three main categories: Convolutional Neural Network (CNN), Transformer-based models, and emerging Foundation Models for pathology. CNN-based approaches, including ResNet and Inception, have demonstrated strong performance in a range of pathological image classification tasks. These models automatically extract hierarchical features from images through layers of convolutional networks, showing effectiveness in tasks like tumor detection, tissue segmentation, and cancer classification. ResNet [
4], for example, introduces residual connections to mitigate the vanishing gradient problem, which allows deeper models to be trained. However, this still does not address the lack of global context and the inability to effectively capture the subtle, fine-grained lesions that are often present in pathological images. Aresta et al. [
5] demonstrated CNNs’ effectiveness in breast cancer diagnosis on the BACH dataset, but their models still fall short in capturing multi-scale lesion features, which are essential for reliable diagnosis. Similarly, Jiang et al. [
6] proposed an SE-ResNet module that reduces the parameter count while improving performance, but the model still struggles with fine-grained lesion detection and does not effectively combine global tissue structures with local lesion details. These limitations underscore the fact that CNN-based methods are insufficient for handling the complexity and multi-dimensional characteristics of histopathological images, particularly in breast cancer diagnosis.
As deep learning continues to advance in computer vision, Transformer-based architectures have emerged as a new choice. Transformer models, originally successful in natural language processing (NLP), use self-attention mechanisms to capture long-range dependencies and effectively handle global information, showing strong potential in image classification. Vision Transformer (ViT) divides images into patches and models relationships through self-attention, achieving strong performance. Similarly, the Swin Transformer [
7], with its sliding window attention mechanism, improves computational efficiency and enhances classification performance, but still faces limitations in capturing local fine details, such as lesion boundaries, which are essential in medical imaging. These approaches, while promising in some contexts, fail to address the core issue of computational efficiency and fine-grained feature extraction, making them less suitable for practical use in medical applications, where annotated data is scarce, and computational resources are limited.
Foundation Models for pathology, trained on large-scale multi-modal datasets, have recently become a promising approach in pathological image analysis. A key feature of these models is their pre-training on vast amounts of unlabeled data, which helps them learn general visual features through self-supervised learning. The main benefit is their ability to be fine-tuned for specific tasks, reducing the reliance on labeled data. Models like CTransPath [
8], despite a 10% improvement in breast cancer classification over CNNs, require large labeled datasets and significant computational resources for fine-tuning, making them impractical for many medical applications. Similarly, the UNI model [
9], trained on over 100,000 whole slide images, has achieved strong performance across various pathology tasks. However, its success is heavily reliant on massive amounts of annotated data and computational power, posing a challenge for clinical environments where labeled data is limited. The MS2M model [
10] faces similar issues, demonstrating strong performance but still needing large-scale pre-training and extensive fine-tuning. While these models offer significant improvements through generalization, their high computational cost and need for large datasets make them less efficient than CNNs in practical medical settings, especially when labeled data is scarce.
Although deep learning methods like CNNs, Transformers, and pathology foundation models have advanced breast cancer histopathological image analysis, they still struggle with one major issue: balancing fine-grained details of local lesions with the broader, multi-scale tissue structure. In breast histopathology, a reliable diagnosis depends not only on subtle features such as lesion boundaries, micro-lesions, and texture variations [
11], but also on the overall tissue organization and contextual information. While CNN-based methods excel at extracting local features, they are less effective at modeling global structures. On the other hand, Transformer-based and foundation models, though strong at capturing global semantics, may fall short in handling subtle morphological details and often require high computational costs or task-specific adjustments [
12]. As a result, current methods still show limited ability in capturing discriminative structural information across scales [
13], especially under staining variation, fuzzy boundaries, and small-sample conditions.
To address these challenges, we propose the UNI-Phase-Dual Network (UPDNet), a novel framework that uniquely integrates phase congruency (PC) with the UNI foundation model. The core innovation of UPDNet lies in its ability to combine fine-grained local features with global semantic context through a dual-branch feature modeling module. One branch, based on DConv, enhances local lesion details, while the other, using ATConv, captures multi-scale tissue context. The branches are adaptively fused using a learnable spatial gating mechanism, allowing UPDNet to effectively handle both local and global features without interference. This synergistic integration enables UPDNet to address the limitations of existing models, such as poor boundary detection and the inability to handle fine details in small-sample situations. Furthermore, UPDNet introduces a parameter-efficient fine-tuning (PEFT) strategy, which updates only a small number of task-specific parameters while freezing the UNI backbone. This approach reduces computational overhead and mitigates overfitting, making UPDNet highly efficient for scenarios with limited annotated data.
Our contributions are summarized as follows:
We propose UPDNet, a novel framework that uniquely combines PC with the UNI foundation model. This integration enables joint modeling of fine-grained local features and global semantic information, addressing the key limitations of existing methods that struggle to effectively combine these complementary aspects.
We design a dual-branch feature refinement module to improve feature representation from two complementary aspects: one branch (DConv) focuses on fine-grained local texture, while the other (ATConv) captures multi-scale context. The branches are adaptively fused via a learnable spatial gating mechanism, reducing feature interference and improving the overall representation.
We introduce a PEFT strategy, which updates only a small number of task-specific parameters while freezing the UNI backbone. This dramatically reduces computational cost and alleviates the problem of overfitting, especially in small-sample scenarios where traditional models would struggle.
2. Methods
In this section, we describe the proposed UPDNet for breast cancer diagnosis. We begin with an overview of the network architecture. Then, we describe the phase congruency-based feature extraction module and the dual-branch feature modeling module used in UPDNet. Finally, we outline the training procedure and evaluation metrics used to assess the model’s performance.
2.1. UPDNet
As illustrated in
Figure 1, we detail the architecture of the proposed UPDNet, which is specifically designed for breast cancer pathological image diagnosis. The overall framework of UPDNet mainly consists of three key parts: UNI pre-trained backbone [
9] for global semantic feature extraction; PC [
14] module for complementary structural prior feature extraction; Dual-branch feature modeling module for adaptive feature refinement and fusion.
The overall process in UPDNet can be represented by the following simplified equation:
where
X represents the input breast cancer pathological image, Branch1 focuses on extracting fine-grained features using DConv (depthwise separable convolution), Branch2 captures long-range context using ATConv (dilated convolution), GatingNetwork adaptively combines the outputs from both branches. To improve training efficiency and alleviate overfitting in small-sample scenarios, a PEFT strategy is adopted for lightweight adaptation.
UPDNet is built upon the UNI pre-trained foundation model [
9], which yields strong global semantic representation from large-scale pre-training. To strengthen the model’s sensitivity to lesion boundaries and microstructures, PC is introduced as a structural prior [
14]. PC significantly improves lesion detection by stabilizing structural cues, especially for tiny lesions and fuzzy boundaries.
The Dual-branch feature modeling module does not extract new features, but performs adaptive refinement on the fused UNI + PC features. Branch1 and Branch2 focus on different enhancement targets: Branch1 uses DConv [
15] to focus on fine-grained local features, such as small lesion details and tissue textures, with a low computational cost, while Branch2 uses ATConv [
16] to capture multi-scale context by expanding the receptive field, making it capable of modeling larger tissue structures. Unlike traditional methods that simply concatenate or average feature maps from different sources, our gating mechanism dynamically assigns different weights to the features based on their relevance to regional lesion characteristics, effectively resolving feature interference and enhancing the robustness of lesion detection, especially in complex cases like fuzzy boundaries and subtle lesions.
This adaptive feature fusion, which allows the model to simultaneously capture both local fine-grained details and global tissue context, represents a substantial improvement over existing methods that are either too focused on local textures or too reliant on global structures. By combining the strengths of both approaches, UPDNet is significantly more efficient in handling real-world challenges such as small-sample data and staining variations, making it a powerful tool for practical breast cancer image diagnosis.
2.2. Phase Congruency
Phase congruency (PC) is introduced in UPDNet as a structure-prior module to enhance the representation of lesion boundaries and subtle morphological patterns in breast cancer pathological images. Unlike traditional intensity-based features, which can be sensitive to variations in illumination or contrast, PC is more robust to these changes because it focuses on phase alignment across scales. In
Figure 2, the input image undergoes an initial transformation to the frequency domain using the Fast Fourier Transform (FFT), followed by filtering through a series of Log-Gabor filters at various scales [
17,
18]. The filtered outputs are reverted to the spatial domain through inverse FFT (IFFT), and the phase-based structural responses are subsequently computed.
Formally, given an input image
, its response at scale
s and orientation
o can be obtained by convolving the image with the corresponding Log-Gabor filter. The resulting complex response is decomposed into an even-symmetric component
and an odd-symmetry component
. Based on these two components, the local amplitude [
19] at each scale and orientation is defined as:
where
reflects the local energy magnitude of the image structure at position
x, To aggregate structural responses across multiple scales under the same orientation, the orientation-dependent energy [
20] is computed as:
This phase congruency measure quantifies the degree of phase alignment across scales, capturing perceptually significant structures such as edges, corners, and lesion boundaries. When Fourier components at different scales are in phase, the corresponding location typically corresponds to stable, reliable structures critical for accurate diagnosis.
Based on the above quantities, the phase congruency map [
21] is defined as:
where
represents the weighting factor at orientation
is a noise compensation term that reduces unstable low-energy responses, and
is a small constant to avoid numerical instability. The numerator reflects the useful phase-consistent structural energy after noise suppression, while the denominator normalizes the response by the total local amplitude across all scales and orientations.
Unlike traditional intensity- or gradient-based features, PC is unaffected by changes in illumination or contrast because it relies on phase alignment rather than absolute intensity [
18,
19]. This makes it especially useful for pathological images, where factors like staining variation, local contrast differences, and complex tissue structures can distort intensity-based features [
22]. By incorporating PC as a structural prior, the network is guided to focus on more stable morphological features, improving its sensitivity to fine lesion details and irregular tissue boundaries.
2.3. Dual-Branch Feature Modeling Module
After extracting semantic features from the UNI backbone and structural priors from the PC module, UPDNet refines the fused representation with a dual-branch feature modeling module. One branch focuses on fine local textures, while the other captures multi-scale context. A learnable gating mechanism is used to adaptively balance the two branches at different spatial locations [
23].
Let
denote the semantic feature map obtained from the UNI backbone, and
represent the structural prior generated by the PC module. As detailed in
Section 2.2, the two feature streams are initially aligned and subsequently fused to form the input to the dual-branch refinement module. The fused representation is written as:
where
and
denote learnable
projection operators. The channel-wise modulation functions
and
explicitly generate the scale and shift terms for feature adaptation, respectively. In our architecture,
consists of a
convolution followed by a Sigmoid activation, while
is implemented as a
convolution without non-linearity and ⊙ indicating element-wise multiplication. This step adjusts the global semantic features by incorporating structural saliency cues, thereby establishing a strong shared feature foundation.
For each specified transformer block, the adapter is inserted into the architecture with the following configuration: the down-projection matrix dimensions are dynamically inferred from the ViT block’s native feature dimension to a configurable bottleneck dimension (default: 64), and the up-projection matrix maps this bottleneck dimension back to the original ViT block feature dimension. A GELU activation function is used between the down-projection and up-projection layers, instead of ReLU. The adapter parameters are trained alongside the classification head (and optional local MoE branch) using a shared Adam optimizer with a global learning rate (default: 1 ) and weight decay of 1 .
2.3.1. DConv Branch for Fine-Grained Refinement
The DConv branch is designed to enhance subtle lesion boundaries, morphological details, and fine-grained textural patterns. To efficiently extract local features with minimal parameter overhead, we use depthwise separable convolution, which separates spatial filtering from channel transformation [
24]. To maintain the original fused information while emphasizing local details, we implement a residual calibration strategy:
where
refers to the depthwise separable convolution module, which includes batch normalization and nonlinear activation, and
is a learnable weight tensor used to highlight important local regions.
This residual calibration allows the network to selectively enhance fine-grained texture cues while preserving the original fused semantic-structural representation [
4], ensuring that the DConv branch refines local details without redundantly re-extracting features.
2.3.2. ATConv Branch for Multi-Scale Context Modeling
While enhancing local textures is essential for detecting micro-lesions, pathological diagnosis also depends on structural information from larger tissue areas. To model long-range structural dependencies effectively, the ATConv branch uses multi-scale dilated convolutions to increase the receptive field [
25], capturing global tissue features without additional computational cost.
This branch adaptively weights and fuses convolutional responses at different scales to highlight effective structural information related to lesions and suppress interference from irrelevant scales. The multi-scale structural feature aggregation can be formulated as:
where
denotes the dilated convolution operation with the
k-
dilation rate,
is the learnable scale weight, and
K is the total number of scales used.
By adopting this approach, the ATConv branch can effectively capture multi-scale context while maintaining computational efficiency, making it a perfect complement to the DConv branch. The branch also utilizes a residual structure to preserve the original fused representation while strengthening the global structural context, ensuring the model captures both fine-grained localization and large-scale tissue understanding.
2.3.3. Spatial Gating Fusion
After obtaining the fine-grained refinement feature
and the multi-scale contextual feature
, UPDNet does not directly concatenate them. Instead, a learnable spatial gating mechanism is introduced to adaptively determine the relative contribution of the two branches at each spatial position. The gate weights are computed as:
where
denotes channel-wise concatenation,
is a lightweight gating function implemented by stacked
convolutions, and
are normalized spatial coefficients corresponding to the DConv and ATConv branches, respectively.
The final refined representation is then obtained through gated residual fusion:
Compared with rough concatenation, this position-wise fusion strategy allows the network to emphasize texture-sensitive responses in subtle lesion regions while relying more on contextual dependency modeling in structurally complex areas [
26]. Therefore, feature interference between heterogeneous branches can be alleviated, and the discriminability of the final representation can be improved.
Finally, the fused feature is sent to the classification head for category prediction:
where
represents global average pooling, and
and
are the classifier parameters.
2.4. Parameter-Efficient Fine-Tuning Strategy
To improve generalization under limited annotated data, UPDNet adopts a PEFT strategy. Instead of updating all parameters of the large pre-trained UNI backbone, the method freezes the backbone and optimizes only a small number of task-specific parameters. This design reduces the optimization burden and helps alleviate overfitting in small-sample breast cancer histopathological image classification.
For each selected transformer block, a bottleneck-style Adapter is inserted to provide lightweight task adaptation [
27]. The adapted feature representation is expressed as:
where
denotes the feature representation of the
l-
transformer block,
and
are the down-projection and up-projection matrices of the Adapter,
is a nonlinear activation function, and
is a learnable scaling factor. With this residual bottleneck design, only a small number of additional parameters are introduced.
During optimization, the backbone is fixed, while the Adapter, dual-branch module, and classifier are jointly updated. This PEFT strategy efficiently exploits UNI’s prior knowledge, ensuring sufficient capacity for task-specific learning with high parameter efficiency and robustness.
3. Experimental Setup
In this section, we perform experiments on BRACS, BreakHis, and BACH to evaluate the effectiveness of the proposed method. The following subsections provide details on the datasets, evaluation protocol, implementation, and result analysis.
3.1. Datasets
3.1.1. BRACS Dataset
The BRACS dataset, which stands for BReAst Carcinoma Subtyping, is used for classifying breast lesions through H&E-stained histology images. It consists of 547 whole-slide images and 4539 regions of interest, all of which have been annotated by expert pathologists. The dataset is divided into seven categories (
Figure 3): Normal (N), Pathological Benign (PB), Usual Ductal Hyperplasia (UDH), Flat Epithelial Atypia (FEA), Atypical Ductal Hyperplasia (ADH), Ductal Carcinoma in Situ (DCIS), and Invasive Carcinoma (IC). With its inclusion of atypical lesions and detailed subtypes, BRACS presents a challenging and clinically meaningful classification task, which is why it was chosen as the primary dataset for this study.
3.1.2. BreakHis Dataset
The BreakHis dataset is commonly used for breast histopathological image classification. It includes 7909 images from 82 patients, with benign and malignant samples taken at four magnification levels: 40×, 100×, 200×, and 400×. Beyond the basic classification of benign versus malignant, BreakHis offers eight histological subtypes: four benign (adenosis, fibroadenoma, phyllodes tumor, tubular adenoma) and four malignant (ductal carcinoma, lobular carcinoma, mucinous carcinoma, papillary carcinoma). With its use of multiple magnifications and diverse subtypes, BreakHis is a useful dataset for testing the robustness of the proposed method across different scales.
3.1.3. BACH Dataset
The BACH Dataset is a public benchmark introduced in the ICIAR 2018 Grand Challenge on Breast Cancer Histology Images. Its microscopy subset includes 400 H&E-stained breast histology images, evenly distributed across four categories: Normal (N), Benign (B), In Situ Carcinoma (IS), and Invasive Carcinoma (I). Compared to BRACS, BACH offers a more compact and balanced multi-class setting, and compared to BreakHis, it provides a cleaner image-level benchmark for classification performance. For these reasons, BACH was used as an additional dataset to validate the generalization capability of the proposed method.
3.2. Evaluation Metrics
To evaluate the proposed method on BRACS, BreakHis, and BACH, four standard metrics were used: accuracy (ACC), precision (Pre), F1-score, and AUC. These metrics measure the model’s performance from various angles, including overall accuracy, prediction reliability, the balance between precision and recall, and class separability.
The overall accuracy is defined as:
where
denotes the total number of test samples,
is the ground-truth label of the
i-
sample,
is the model’s predicted label, and
is the indicator function (equal to 1 if the prediction matches the ground truth, 0 otherwise).
For multi-class classification, AUC was computed using the one-versus-rest strategy and averaged over all categories. Precision and F1-score were also reported to provide a more comprehensive evaluation of the proposed method.
3.3. Implementation Details
All experiments were implemented using Python 3.11.5 and PyTorch 2.0.1 with CUDA 12.2, and run on a single NVIDIA RTX 3090 Ti GPU. To ensure a fair, rigorous, and highly reproducible comparison, all experiments including those for the baseline models were conducted using the official dataset partitions provided by the original dataset creators. The BRACS dataset contains 4539 images, with 3657 images used for training, 312 images for validation, and 570 images for testing. The BreakHis dataset consists of 7909 images, with each magnification-level subset split into training and test sets in a 7:3 ratio. The training set includes 5536 images, while the test set includes 2373 images. The BACH dataset contains 400 images, which are randomly split into training and test sets in an 8:2 ratio, with 320 images in the training set and 80 images in the test set. All evaluated methods strictly follow the exact same official data splits and preprocessing protocols.
All images are resized to 224 × 224 and preprocessed according to the UNI model’s official pipeline. We use the UNI (ViT-L/16) foundation model as a frozen backbone. For parameter-efficient fine-tuning, we insert adapters into all Transformer blocks, with a bottleneck dimension of 64. Only the adapters, proposed modules, and classification head are trained. The PC module uses a Log-Gabor filter bank with 6 orientations and 5 scales. The noise threshold T is estimated adaptively as in Kovesi’s method. The PC map is concatenated with UNI features and fed into the dual-branch module. The dual-branch module includes a 64-channel stem layer, a DConv branch, and an ATConv branch with multi-scale dilation. A spatial gating mechanism dynamically fuses the two branches. The final classifier uses global average pooling and a fully connected layer with a dropout rate of 0.5. We train using the Adam optimizer with a learning rate of
, batch size of 64, weight decay of
, and cosine annealing learning rate schedule. All reported results are evaluated on the test set; the validation set is used only for model selection. AUC is computed with the one-versus-rest strategy for multi-class tasks. To augment the data, we applied random horizontal flipping and random cropping to the images. The dropout rate in the classifier was set to 0.5 to help regularize the network. We have made our code publicly available at
https://github.com/flipped123-wq/UPDNet (accessed on 13 May 2026).
3.4. Experiment Result Analysis
To thoroughly evaluate the effectiveness of UPDNet, comparative experiments were conducted on the BRACS, BACH, and BreakHis datasets. BRACS served as the primary benchmark to assess the fine-grained classification capability, while BACH was used to evaluate generalization performance under balanced class settings. BreakHis was employed to examine UPDNet’s robustness under different magnification levels. As shown in
Table 1, the method’s performance was evaluated based on Accuracy, Precision, Recall, F1-score, and AUC. The results show that the proposed method performs consistently well across all three datasets, demonstrating strong classification accuracy, prediction reliability, and class separation. These findings highlight UPDNet’s competitiveness and stability across various classification tasks and data distributions.
3.4.1. Comparison on BRACS 7-Class Dataset
The BRACS (BReAst Carcinoma Subtyping) dataset includes seven categories of breast histopathological images: N, PB, UDH, FEA, ADH, DCIS, and IC. It presents a challenge for fine-grained classification, particularly in detecting micro-lesions with unclear boundaries, often affected by staining and contrast variations. As shown in
Table 2, UPDNet outperforms all other methods, achieving the highest weighted F1-score of 67.46%. It also reached 92.8% in the IC category, significantly surpassing other methods. This demonstrates UPDNet’s strong ability to detect micro-lesions and subtle structural details, making it highly effective for breast cancer diagnosis.
In the comparison of methods, UPDNet clearly surpasses several traditional convolutional neural network approaches. For instance, CLAM performs well in the IC category but struggles with more complex categories like ADH and UDH. While Patch-GNN and TransMIL perform decently in some categories, UPDNet consistently shows more balanced and stable results across all classes.
Figure 4 (left) presents the confusion matrix for UPDNet on the BRACS dataset. Correctly classified samples are shown along the diagonal, while misclassifications appear off the diagonal. As seen in the figure, UPDNet performs exceptionally well in most categories, with particularly high performance in the IC category, where it correctly classifies a significant number of samples.
In addition,
Figure 5 shows a bar chart comparing the classification accuracy of UPDNet with other methods on the BRACS (left) and BACH (right) datasets. As shown, UPDNet achieves the highest accuracy on the BRACS dataset, outperforming other methods, which demonstrates its strong performance in fine-grained breast cancer classification tasks.
In conclusion, UPDNet delivers exceptional performance on the BRACS dataset, outperforming other methods in accuracy and demonstrating superior robustness in detecting micro-lesions and capturing subtle structural details. Notably, in the IC category, which involves micro-lesions, UPDNet excels, highlighting its effectiveness for early breast cancer diagnosis.
3.4.2. Comparison on BreakHis Dataset
Besides the BRACS dataset, we also assessed UPDNet’s performance on the BreakHis dataset, a widely used benchmark for breast histopathological image classification. We compared UPDNet with several state-of-the-art methods, including DenseNet, ResNet50, and other hybrid models, across four magnification levels.
Table 3 presents the classification accuracy comparison at different magnification levels. UPDNet performed exceptionally well at all magnifications, achieving the accuracy of 99.60% at 40×, 99.35% at 100×, 99.81% at 200×, and 99.22% at 400×. These results demonstrate UPDNet’s ability to handle various image scales, making it robust to magnification changes and well-suited for real-world clinical scenarios where different magnification levels are used.
As illustrated in
Figure 6, visualization analysis is performed on the proposed UPDNet model under different magnification levels (40×, 100×, 200×, 400×). The attention heatmaps of UPDNet illustrate the regions of interest focused by the model on pathological images at various scales. It can be observed that UPDNet can stably and accurately focus on lesion regions, tissue edges, and fine-grained structures at all magnification levels, while maintaining high sensitivity to micro-lesions and key cellular morphologies. Meanwhile, the model exhibits consistent and reliable responses in both low-magnification global fields of view and high-magnification detail fields of view, demonstrating favorable multi-scale robustness that effectively adapts to the requirements of breast cancer pathological diagnosis at different imaging magnifications in clinical practice.
3.4.3. Comparison on BACH Dataset
To further assess the generalization capability of UPDNet, we conducted experiments on the BACH dataset, a public benchmark released as part of the ICIAR 2018 Grand Challenge on Breast Cancer Histology Images.
In this experiment, we compared UPDNet with several state-of-the-art methods, including DeiT, Swin Transformer, and ResViT-GANNet, in both 2-class and 4-class classification tasks.
Table 4 shows the classification accuracy for both tasks. UPDNet achieved 98.75% accuracy in the 2-class task and 97.50% in the 4-class task, outperforming all other methods. These results demonstrate UPDNet’s strong generalization ability in balanced classification scenarios.
Additionally,
Figure 4 (right) shows the confusion matrix for UPDNet on the BACH dataset. The diagonal values represent correctly classified samples, while the off-diagonal values indicate misclassifications. As seen in the figure, UPDNet performs excellently across both datasets, showing strong ability to distinguish between the various categories in both the BRACS and BACH datasets.
Figure 5 (right) shows a bar chart comparing UPDNet’s classification accuracy with that of other methods on the BACH dataset. As illustrated, UPDNet achieves the highest accuracy across both datasets, surpassing the other methods, further highlighting its strong performance in fine-grained breast cancer classification tasks.
3.4.4. Convergence Analysis
To further assess UPDNet’s optimization and training stability,
Figure 7 shows the validation performance curves on the BreakHis and BACH datasets over 30 epochs. As seen in
Figure 7a, the accuracy curves on BreakHis at four magnification levels (40×, 100×, 200×, and 400×) rise quickly in the early stages of training and then stabilize, indicating that the model converges rapidly while maintaining strong performance across different image scales. This aligns with BreakHis’s role in our experiments, which is to test the model’s robustness across varying magnification factors.
As shown in
Figure 7b, on the BACH dataset, both the ACC and AUC curves of the 2-class and 4-class classification tasks exhibit a clear upward trend during the first several epochs and remain stable in the later stage. In particular, the AUC values quickly approach a high level and show only minor fluctuations afterward, suggesting that UPDNet has good class discrimination ability and stable optimization performance under both binary and multi-class settings. Since BACH is used in this paper to further verify the generalization ability of the proposed method under a relatively balanced image-level benchmark, the convergence behavior shown in
Figure 7 further supports the effectiveness and reliability of UPDNet on this dataset.
3.5. Ablation Study
To thoroughly evaluate the contribution of each component in UPDNet, we conducted an ablation study to investigate the impact of key modules on the model’s overall performance. Specifically, we examined the effects of removing or modifying various components, including the baseline model (UNI), the PC module, the dual-branch feature learning structure, DConv, ATConv, and PEFT. This ablation study was conducted on the BRACS, BreakHis, and BACH datasets to evaluate the model’s generalization across different types of histopathological data.
Table 5 compares the performance of different model configurations across three breast histopathology datasets to validate the contribution of each module. Taking the BRACS dataset as an example, the baseline model (UNI), which excludes advanced components, achieved an accuracy of 59.30%. The full UPDNet model, incorporating all proposed modules, significantly improved this score to 68.58%, demonstrating the effectiveness of our architecture.
Integrating the PC module into the baseline increased the accuracy from 59.30% to 62.75%, highlighting its importance in capturing fine-grained structural details in histopathological images. Furthermore, introducing the dual-branch feature learning structure (combining both DConv and ATConv) boosted the performance to 66.87%. Specifically, adding only DConv or ATConv to the PC-enhanced baseline yielded 63.98% and 64.53%, respectively. This demonstrates their crucial and complementary roles in local feature extraction and contextual understanding.
Finally, incorporating the PEFT module brought the overall performance to 68.58%. This 1.71% improvement over the 66.87% configuration indicates its beneficial impact on model generalization and effective representation learning. These results confirm that each component of UPDNet contributes to its overall effectiveness, providing a synergistic boost in classification performance.
4. Discussion
This work proposes UPDNet, a novel framework for breast cancer histopathology image classification. As ablation studies (
Table 5) reveal, integrating the PC module, dual-branch refinement and PEFT boosts the UNI baseline accuracy from 59.30% to 68.58%. This improvement stems from addressing a fundamental gap in current literature: while CNNs (e.g., ResNet, CLAM) often miss global context, foundation models (e.g., TransMIL) frequently overlook subtle morphological details. UPDNet bridges this gap. The PC module provides a contrast-invariant structural prior that is highly robust to staining variations. Subsequently, the dual-branch architecture explicitly decouples feature learning: the DConv branch captures fine-grained micro-lesions, while the ATConv branch models multi-scale tissue structures.
Furthermore, we observed distinct dataset-specific behaviors. On the complex 7-class BRACS dataset, UPDNet achieves a 92.8% F1-score on IC, thanks to the DConv branch’s sensitivity to subtle cellular atypia. On the BreakHis dataset, the model maintains >99.3% stability across magnifications (40× to 400×), driven by the ATConv branch’s scale-robustness. Despite these advantages, UPDNet has limitations. The PC computation and dual-branch feature modeling increase computational overhead and inference time, limiting deployment in resource-constrained environments. Additionally, it relies on fully annotated datasets and lacks real-world clinical data, hindering clinical decision-making. Future work will explore data-efficient paradigms like weakly-supervised or self-supervised learning.
5. Conclusions
This paper introduces UPDNet, a novel multi-component fusion model aimed at improving the accuracy and robustness of breast cancer classification. By integrating PC, dual-branch feature refinement, and PEFT, UPDNet tackles key challenges like fine-grained lesion detection, multi-scale feature fusion, and small-sample learning. UPDNet improves local feature extraction through the PC module, while the dual-branch feature refinement module combines global semantic and local detail information. The DConv branch enhances fine-grained feature extraction, and the ATConv branch boosts multi-scale contextual modeling. Meanwhile, the PEFT strategy fine-tunes only a small number of parameters, reducing computational costs and enhancing the model’s generalization ability. Experimental results show that UPDNet surpasses existing methods on the BRACS, BreakHis, and BACH datasets, particularly excelling in fine-grained lesion detection and small-sample learning. Its robustness across different datasets and magnification levels, combined with an interpretable attention mechanism, makes it highly reliable and suitable for clinical use. Overall, UPDNet offers efficient and reliable support for early breast cancer diagnosis and treatment decisions.