Clinically Oriented Evaluation of Transfer Learning Strategies for Cross-Site Breast Cancer Histopathology Classification

Stanescu, Liana; Stoica-Spahiu, Cosmin

doi:10.3390/app152312819

Open AccessArticle

Clinically Oriented Evaluation of Transfer Learning Strategies for Cross-Site Breast Cancer Histopathology Classification

by

Liana Stanescu

^*

and

Cosmin Stoica-Spahiu

Department of Computer Science and Information Technology, Faculty of Automation, Computers and Electronics, University of Craiova, 200585 Craiova, Romania

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(23), 12819; https://doi.org/10.3390/app152312819

Submission received: 5 November 2025 / Revised: 24 November 2025 / Accepted: 1 December 2025 / Published: 4 December 2025

(This article belongs to the Special Issue Big Data Integration and Artificial Intelligence in Medical Systems)

Download

Browse Figures

Versions Notes

Abstract

Background/Objectives: Breast cancer diagnosis based on histopathological examination remains the most reliable and widely accepted approach in clinical practice, despite being time-consuming and prone to inter-observer variability. While deep learning methods have achieved high accuracy in medical image classification, their cross-site generalization remains limited due to differences in staining protocols and image acquisition. This study aims to evaluate and compare three clinically relevant adaptation strategies to improve model robustness under domain shift. Methods: The ResNet50V2 model, pretrained on ImageNet and further fine-tuned on the Kaggle Breast Histopathology Images dataset, was subsequently adapted to the BreaKHis dataset under three clinically relevant transfer strategies: (i) threshold calibration without retraining (site calibration), (ii) head-only fine-tuning (light FT), and (iii) full fine-tuning (full FT). Experiments were performed on an internal balanced dataset and on the public BreaKHis dataset using strict patient-level splitting to avoid data leakage. Evaluation metrics included accuracy, precision, recall, F1-score, ROC-AUC, and PR-AUC, computed per magnification level (40×, 100×, 200×, 400×). Results: Full fine-tuning consistently yielded the highest performance across all magnifications, reaching up to 0.983 ROC-AUC and 0.980 sensitivity at 400×. At 40× and 100×, the model correctly identified over 90% of malignant cases, with ROC-AUC values of 0.9500 and 0.9332, respectively. Head-only fine-tuning led to moderate gains (e.g., sensitivity up to 0.859 at 200×), while threshold calibration showed limited improvements (ROC-AUC ranging between 0.60–0.73). Grad-CAM analysis revealed more stable and focused attention maps after full fine-tuning, though they did not always align with diagnostically relevant regions. Conclusions: Our findings confirm that full fine-tuning is essential for robust cross-site deployment of histopathology AI systems, particularly at high magnifications. Lighter strategies such as threshold calibration or head-only fine-tuning may serve as practical alternatives in resource-constrained environments where retraining is not feasible.

Keywords:

breast cancer; histopathological images; transfer learning; domain shift; model adaptation; clinical deployment

1. Introduction

Breast cancer remains one of the most prevalent malignancies worldwide and a leading cause of cancer-related mortality among women [1]. Histopathological analysis of tissue biopsies represents the reference standard for diagnosis, allowing pathologists to differentiate between benign and malignant lesions and to assess tumor subtype and grade. However, manual inspection of slides is time-consuming and subject to inter-observer variability, motivating the development of computer-aided diagnosis (CAD) systems.

Deep learning has transformed medical image analysis, with convolutional neural networks (CNNs) and more recently transformer-based architectures achieving state-of-the-art performance across multiple applications [2,3]. ResNet [4], DenseNet [5], ConvNeXt [6], and Vision Transformers [7] have all been applied successfully to histopathological image classification. The Breast Cancer Histopathological Database (BreaKHis) [8] has become a widely used benchmark, comprising 7909 images from 82 patients labeled as benign or malignant across four magnification levels (40×, 100×, 200×, 400×). This magnification diversity makes it well suited for evaluating scale-dependent performance.

In this study, the ResNet50V2 model was first fine-tuned on the Kaggle Breast Histopathology dataset to establish a balanced baseline and subsequently adapted to BreaKHis to assess cross-site generalization under controlled conditions.

A key challenge in digital pathology is that models must generalize across acquisition sites, staining conditions, and patient populations. Many existing studies rely on random image-level splits, allowing patches from the same patient in both training and test sets and overestimating performance. In realistic clinical scenarios, however, models encounter entirely new patients, making patient-level independence essential.

Although recent advances in attention mechanisms, stain normalization, and self-supervised learning have improved robustness, the relative effectiveness of different adaptation depths under domain shift remains insufficiently explored. No prior work has systematically compared threshold calibration, head-only fine-tuning, and full fine-tuning using strict patient-level separation and magnification-specific evaluation.

This study offers the following contributions:

A unified comparison of three adaptation strategies—threshold calibration, head-only fine-tuning, and full fine-tuning—for cross-site histopathology classification.
Strict patient-level separation for both internal and BreaKHis datasets to prevent information leakage.
A magnification-aware analysis quantifying the trade-off between adaptation cost and diagnostic performance.
A clinically oriented discussion on model interpretability and realistic deployment scenarios.

This perspective emphasizes robustness, reproducibility, and practical deployability within clinically realistic cross-site conditions.

2. Related Work

Deep learning has significantly advanced breast cancer histopathology classification, with numerous studies proposing increasingly sophisticated architectures and training strategies. Table 1 summarizes representative approaches, highlighting major model types, datasets, and common advantages and limitations.

Several publicly available datasets support research in histopathology image classification. Among these, the BreaKHis database [8] is one of the most widely used benchmarks, providing 7909 images from 82 patients across four magnifications (40×, 100×, 200×, 400×), which enables systematic assessment of scale-dependent performance. The Kaggle Breast Histopathology dataset contains over 250,000 balanced benign and malignant patches derived from whole-slide images, making it suitable for pretraining and establishing robust feature representations. These datasets are frequently adopted in the literature due to their accessibility, diversity, and relevance to clinical diagnostic workflows.

Building on these benchmark datasets, early work on BreaKHis established foundational baselines for patch-level classification. Spanhol et al. [8] introduced baseline CNN models, while Araujo et al. [9] trained deeper networks from scratch to improve feature representation. Bayramoglu et al. [10] proposed a multi-scale CNN that integrates information across magnifications, demonstrating improved accuracy over single-scale approaches. As shown in Table 1, these initial methods laid the groundwork for deep learning in histopathology but remained limited by shallow architectures, high data requirements, or computational overhead.

Transfer learning approaches soon emerged as a more efficient alternative. Studies employing pretrained architectures—such as ResNet, DenseNet, and related CNN families—reported substantial performance gains compared with training from scratch [11]. Structured deep learning models further enhanced interpretability and tissue representation by incorporating prior spatial relationships [12]. However, as indicated in Table 1, these frameworks typically rely on full fine-tuning and do not evaluate the effectiveness of lighter adaptation strategies.

In addition to these CNN-based innovations, multiple studies explored modeling patch-level relevance and weakly supervised aggregation using attention-based multiple-instance learning, which improved lesion localization and interpretability in histopathology workflows [13]. Stain-normalization and domain adaptation techniques further addressed color variability across medical centers, yielding measurable gains in cross-site robustness [14]. More recently, self-supervised learning frameworks have leveraged large unlabeled repositories to extract transferable histopathology representations, reducing the dependence on extensive annotations [15].

Transformer-based architecture has recently gained prominence due to their ability to model long-range dependencies in tissue structure. Vision Transformer and Swin Transformer variants have achieved state-of-the-art results in lung adenocarcinoma [16], esophageal pathology [17], and laryngeal tumor grading [18,19]. Feature-disentangled transformers have further improved robustness and interpretability in squamous cell carcinoma grading [20]. Despite their strong performance, Table 1 shows that these models often require large, annotated datasets and lack systematic evaluation under cross-site domain shift.

Complementary lines of work have explored radiology-driven feature fusion [21], attention-enhanced CNNs [22], multi-branch and fusion frameworks [23], ensemble optimization [24], and multiscale deep networks [25,26]. While these approaches improve robustness across various medical imaging tasks, their computational complexity can hinder clinical deployment.

A persistent methodological challenge in digital pathology is the widespread use of random image-level splits, which allow patches from the same patient to appear in both training and test sets. This practice introduces information leakage and leads to overly optimistic performance estimates. Campanella et al. [27] demonstrated that clinical-grade computational pathology requires strict slide- and patient-level independence to avoid unrealistic accuracy inflation. This underscores the importance of rigorous, patient-level data separation, particularly when evaluating cross-site generalization.

Several studies have addressed domain shift using stain normalization [14] or adversarial domain adaptation [28], while self-supervised learning has shown promise for representation learning with limited labels [15]. However, the complexity and computational cost of these approaches, also summarized in Table 1, limit their scalability in low-resource clinical environments.

Interpretability is another critical challenge. Grad-CAM [29] remains the most widely used saliency method in histopathology, but multiple studies have shown that its attention maps may not reliably correspond to diagnostically meaningful structures [30]. These limitations motivate the need to evaluate explainability methods alongside classification performance, particularly in clinically oriented workflows.

Despite architectural progress, key gaps remain in the literature:

Most studies rely on a single adaptation strategy, typically full fine-tuning;
Lightweight strategies such as threshold calibration or head-only fine-tuning remain underexplored.
Few works systematically quantify performance variation across magnification levels.
Robustness under domain shift is not consistently evaluated.

These limitations motivate the present study, which provides a unified comparison of adaptation strategies under clinically realistic, patient-level cross-site conditions.

Table 1. Representative Deep Learning Approaches for Histopathology Classification.

Study/Approach	Model Type/Innovation	Dataset(s)	Advantages	Limitations
Spanhol et al. [8]	Baseline CNNs	BreaKHis	Establishes initial benchmarks	Limited depth; modest performance
Araujo et al. [9]	Deeper CNNs	BreaKHis	Improved representational capacity	Requires large datasets; trained from scratch
Bayramoglu et al. [10]	Multi-scale CNN	BreaKHis	Integrates multi-magnification cues	High computational cost
Transfer learning CNNs [11]	Pretrained ResNet/DenseNet	BreaKHis	Efficient training; strong accuracy	Focus on full fine-tuning
Structured DL models [12]	Structured/hierarchical CNNs	BreaKHis	Better modeling of tissue structure	Increased architecture complexity
Transformers [16,17,18,19,20]	ViT, Swin, graph-attention, FDT	Various histopathology datasets	SOTA performance; improved interpretability	Requires large datasets; limited cross-site testing
Feature fusion & ensembles [23,24,25,26]	CNN fusion, multi-branch, ensemble optimization	Various medical imaging datasets	Enhanced robustness; multi-feature integration	Heavy models; risk of overfitting
Radiology-driven approaches [21]	Multi-fractal + fusion	Mammography	Strong texture encoding	Not histopathology-specific
Stain normalization & adaptation [14,28]	Stain transfer, adversarial adaptation	Histopathology	Mitigates domain shift	Complex pipeline; higher compute
Self-supervised learning [15]	SSL pretraining	Medical Imaging Datasets	Strong features with few labels	Long training time
Campanella et al. [27]	Weakly supervised WSI classification	Whole-slide images	Demonstrates need for patient-level independence	Shows risk of inflated accuracy under patch-level splits

3. Materials and Methods

This section provides a step-by-step description of the proposed workflow, encompassing dataset preparation, model architecture, training strategies, and implementation details. The overall pipeline consists of four main stages: (i) dataset preprocessing and patient-level splitting, (ii) baseline fine-tuning on an internal dataset (Kaggle IDC), (iii) cross-site adaptation and evaluation on the external BreaKHis dataset, and (iv) performance assessment under different magnifications (40×–400×).

3.1. Generalized Algorithm for Histopathological Image Classification

To place our experimental setup in a broader context, we first outline a generalized deep learning workflow for histopathological image classification. This abstract pipeline, summarized in Algorithm 1, captures the essential stages common to most modern CAD systems based on convolutional or transformer-based architectures, independently of the specific datasets or backbone models. In the subsequent subsections, we instantiate this workflow using ResNet50V2, the Kaggle IDC dataset for pretraining, and BreaKHis as the external target dataset.

Algorithm 1. Generalized Workflow for Histopathological Image Classification

Input:
Histopathology dataset D = {(Ii, yi)} consisting of whole-slide images (WSIs)
or pre-extracted patches Ii with class labels yi (e.g., benign/malignant)

Output:
Trained model M* and predicted labels ŷi for unseen samples

Stage 1: Data preparation and splitting
Acquire raw WSIs or image patches from one or more institutions
Optionally perform stain normalization and artefact removal
If WSIs are used:
Tile each WSI into patches
Discard background tiles
Resize all patches to a fixed input resolution (e.g., 224 × 224)
Split patients (not images) into train, validation, and test sets
Ensure that no patient appears in more than one split (patient-level separation)

Stage 2: Model initialization
Choose a backbone architecture (e.g., CNN or Vision Transformer)
Initialize the backbone with pretrained weights (e.g., ImageNet) or random weights
Replace the final classification layer with a task-specific head (e.g., 2-class output)
Select an adaptation strategy:
(a) Threshold calibration only
(b) Head-only fine-tuning
(c) Full fine-tuning

Stage 3: Training/adaptation
If using threshold calibration:
Apply the pretrained model to the target training/validation data
Learn an optimal decision threshold on the validation set
Else:
Freeze or unfreeze layers according to the chosen adaptation strategy
Train the model on the training set using a suitable loss (e.g., weighted cross-entropy)
Monitor performance on the validation set
Select the best checkpoint M* based on a validation metric (e.g., F1-score)

Stage 4: Inference and evaluation
Apply M* to the test set to obtain predicted probabilities pi
Apply the chosen decision threshold to obtain final labels ŷi
Compute evaluation metrics (accuracy, precision, recall, F1-score, ROC-AUC, PR-AUC)
Optionally generate interpretability maps (e.g., Grad-CAM) for qualitative analysis

Return:
Trained model M* and performance metrics on the test set

3.2. Proposed Training and Evaluation Procedure (Kaggle → BreaKHis)

The detailed training and evaluation procedure is summarized in Algorithm 2, which describes how the ResNet50V2 model, initialized with ImageNet weights, was first fine-tuned on the Kaggle IDC dataset at the patient level to obtain a balanced baseline model (M*). This pretrained model was then adapted and evaluated on the BreaKHis dataset under three strategies—threshold calibration (SiteCalib), head-only fine-tuning (LightFT), and full fine-tuning (FullFT)—for each magnification level.

Evaluation metrics included accuracy, F1-score, ROC-AUC, and PR-AUC, computed per magnification level, along with Grad-CAM visualizations for interpretability. This design ensures strict patient-level independence and realistic cross-site generalization.

Algorithm 2. Proposed Training and Evaluation Procedure for Kaggle → BreaKHis Adap-tation

Input: ResNet50V2 architecture M (initialized with ImageNet weights),
internal dataset D_Kaggle = {train, val, test},
external dataset D_BreaKHis = {train, val, test},
magnifications = {40×, 100×, 200×, 400×}

# Stage 1: Baseline fine-tuning on internal dataset (Kaggle)
Train M on D_Kaggle[train] using weighted cross-entropy loss
Validate on D_Kaggle[val]; select best checkpoint M*
Save M* as pretrained baseline for adaptation

# Stage 2: Adaptation and evaluation on BreaKHis
for each magnification m in magnifications do
for each adaptation strategy s in {SiteCalib, LightFT, FullFT} do
if s == SiteCalib:
Apply M* without retraining; calibrate decision threshold on val[m]
if s == LightFT:
Freeze backbone; train classifier head for 5 epochs on train[m]
if s == FullFT:
Unfreeze all layers; train for 5 epochs on train[m]
Select best checkpoint by validation F1
Optimize threshold on validation set
Evaluate on test[m]; record ACC, F1, ROC-AUC, PR-AUC; generate Grad-CAM
end for
end for

Output: Comparative performance and interpretability for all strategies

3.3. Datasets

Two datasets were employed in this study. The first is the Kaggle Breast Histopathology Images dataset (IDC) [Dataset S1], hereafter referred to as the initial dataset. It is acquired and organized at the patient level, with each patient contributing multiple image patches labeled as benign or malignant. To avoid data leakage, all patches from a given patient were assigned exclusively to the training, validation, or test split. The training set was balanced through random undersampling to ensure equal class distribution, whereas the validation and test sets preserved the natural prevalence (Table 2).

The choice of the two datasets was motivated by their complementary characteristics. The Kaggle IDC dataset provides a large-scale, patient-organized collection of patches that enables stable pretraining and balanced feature extraction before cross-site adaptation. In contrast, the BreaKHis dataset is a widely used public benchmark offering multi-magnification (40×–400×) histopathological images collected under heterogeneous acquisition conditions, making it well suited for evaluating robustness under domain shift. Using Kaggle IDC for pretraining and BreaKHis for external evaluation models a realistic clinical scenario in which a system trained at one institution must generalize to images acquired elsewhere.

The second dataset is BreaKHis (Breast Cancer Histopathological Database) [8], [Dataset S2], a public benchmark widely used for breast cancer classification.

BreaKHis was selected because it provides multi-magnification (40×–400×) histopathological images acquired under diverse staining and imaging conditions, making it an ideal dataset for studying robustness under cross-site domain shift. Its widespread use in prior studies also allows direct comparison with existing methods and supports reproducibility.

BreaKHis comprises benign and malignant images collected at four magnification levels (40×, 100×, 200×, and 400×). Each patient contributes multiple images organized into a directory structure by class, patient, and magnification.

To ensure patient-level independence, we generated a new split into training, validation, and test subsets, disjoint at patient level. The split was performed using the unique patient identifier embedded in the folder name (e.g., SOB_B_A_14-22549AB), assigning all images from a single patient exclusively to one subset. The split ratios were approximately 70% for training, 15% for validation, and 15% for testing. This procedure was repeated independently for each magnification level (40×, 100×, 200×, 400×). The number of images per split and magnification is reported in Table 3.

Class imbalance was intentionally preserved to reflect real-world diagnostic variability. Although malignant samples are more frequent at higher magnifications, no artificial resampling or balancing was applied. Instead, to mitigate potential bias, the binary cross-entropy loss was weighted inversely to class frequency, and extensive data augmentation (random flips, rotations, cropping, and color jittering) was applied to increase the effective diversity of minority-class samples.

To ensure consistency across datasets and experimental conditions, all images were resized to 224 × 224 pixels prior to training or evaluation. This standardization was applied uniformly to both the internal dataset and BreaKHis, regardless of their original resolution or magnification.

Image-level splitting was intentionally avoided, as it can lead to information leakage when patches from the same patient appear in both training and test sets. This design choice ensures clinical realism, where the model is evaluated on entirely unseen patients.

3.4. Model Architecture

In this study we employed ResNet50V2, a deep residual convolutional neural network consisting of 50 layers organized into four residual stages. The architecture builds on the original residual learning framework introduced by He et al. [4] and incorporates the improved identity mappings with pre-activation blocks proposed in ResNet v2 [31]. These pre-activation residual units (BatchNorm → ReLU → Conv) enhance gradient flow and training stability, making them particularly effective in transfer learning settings.

The backbone contains approximately 25.6 million parameters [4,31] and is composed of:

an initial stem (7 × 7 convolution + max pooling),
four residual stages with bottleneck blocks (1 × 1 → 3 × 3 → 1 × 1 convolutions),
global average pooling,
a fully connected classification head.

To adapt the architecture to our binary classification task (benign vs. malignant), the original 1000-class fully connected layer was replaced with a two-unit classifier followed by a softmax activation. All convolutional layers retained ImageNet-pretrained weights, while the classifier head was initialized randomly.

A schematic overview of ResNet50V2 is provided in Figure 1, illustrating the hierarchical progression from low-level texture features to high-level morphological patterns relevant to histopathology.

This architecture was selected because:

ResNet backbones have proven robust and widely adopted in computational pathology and biomedical imaging [4].
The residual structure enables stable gradient propagation and reliable transfer learning even with limited training data [31].
ResNet50V2 offers a favorable balance between representational depth and computational efficiency, making it suitable for both lightweight and full fine-tuning scenarios.

3.5. Training Strategies

Three strategies were compared for cross-site generalization:

Site Calibration (Threshold Calibration): in this approach, the model pretrained on ImageNet is directly applied to the BreaKHis dataset. Only the classification threshold is recalibrated using the BreaKHis validation set, with no retraining of model weights. This setting simulates a realistic clinical scenario where model parameters cannot be updated and only decision calibration is feasible.
Light Fine-Tuning: the backbone was frozen, and only the classifier head was retrained for five epochs on BreaKHis training data. The best checkpoint was selected on validation, and the threshold was recalibrated on validation.
Full Fine-Tuning: all network layers were unfrozen and optimized jointly for five epochs on BreaKHis. The best model was again selected on validation, followed by threshold calibration.

All strategies were applied following an initial fine-tuning stage on the Kaggle Breast Histopathology dataset, used to obtain a clinically relevant starting point before domain adaptation to BreaKHis.

To account for the natural class imbalance present in BreaKHis, all training procedures employed a weighted cross-entropy loss, where class weights were computed inversely to class frequency within the training subset. This weighting helped ensure balanced gradient updates despite unequal class distributions.

A visual summary of the three adaptation strategies is provided in Figure 2, illustrating the differences in model retraining scope, data usage, and evaluation setup.

3.6. Implementation Details

All models were implemented in PyTorch 2.2. Optimization employed the AdamW optimizer with a weight decay of 1 × 10⁻⁴. A standard classification loss function was used, as provided by the PyTorch 2.2 framework. For light fine-tuning, the learning rate was set to 1 × 10⁻³, whereas for full fine-tuning a smaller rate of 1 × 10⁻⁵ was used to mitigate overfitting. The batch size was fixed at 32. Input images were resized to 224 × 224 pixels, normalized to ImageNet statistics, and augmented using geometric transformations. All experiments were conducted on a GPU-enabled workstation (CUDA).

For each model and magnification level, the classification threshold was optimized on the validation set based on the F1-score. Model outputs were evaluated across thresholds in the range [0.0, 1.0], and the value that maximized the F1-score was applied to the test set. This procedure was performed independently for each adaptation strategy (threshold calibration, light fine-tuning, full fine-tuning) and each magnification level.

During training, we applied common data augmentation techniques to enhance generalization, including random horizontal/vertical flips, rotations, cropping and resizing to 224 × 224 pixels, and occasional color jittering. Augmentations were applied only to the training set, while validation and test sets remained unaltered.

Each fine-tuning procedure was limited to five epochs, a choice supported by preliminary convergence analysis showing that validation performance plateaued within the first few epochs. This setup was designed to emulate realistic low-resource adaptation scenarios, as the focus of this study was on lightweight transfer strategies rather than extended retraining. Overfitting was prevented through early stopping based on validation F1-score, weight decay regularization (1 × 10⁻⁴), data augmentation, and strict patient-level separation across all subsets.

3.7. Evaluation Metrics

To quantitatively assess classification performance under cross-site domain shift, we employed commonly used metrics in medical image analysis: accuracy, precision, recall (sensitivity), F1-score, the Receiver Operating Characteristic Area Under the Curve (ROC-AUC), and the Precision–Recall Area Under the Curve (PR-AUC). Their analytical definitions are provided below.

Let

TP = true positives
TN = true negatives
FP = false positives
FN = false negatives

Accuracy = \frac{T P + T N}{T P + T N + F P + F N}

(1)

Precision = \frac{T P}{T P + F P}

(2)

Recall = \frac{T P}{T P + F N}

(3)

F 1 = 2 \cdot \frac{Precision \cdot Recall}{Precision + Recall}

(4)

Since benign and malignant classes are imbalanced in BreaKHis, F1-score is preferred over accuracy because it better reflects the trade-off between false negatives and false positives.

ROC-AUC measures the classifier’s ability to discriminate between classes across all decision thresholds. It is defined as:

R O C - A U C = \int_{0}^{1} T P R (F P R) d F P R

(5)

where TPR = Recall and FPR = FP/(FP + TN).

PR-AUC is more informative when the positive class is underrepresented:

PR - AUC = \int_{0}^{1} P r e c i s i o n (R e c a l l) d R e c a l l

(6)

These metrics were selected because:

F1-score captures performance under class imbalance, which is critical when malignant cases must not be missed.
ROC-AUC provides a threshold-independent assessment of separability and is widely used for medical classifiers.
PR-AUC is particularly sensitive to false positives and false negatives in imbalanced datasets.
Combining ROC-AUC and PR-AUC allows robust evaluation of model discrimination under domain shift (Kaggle → BreaKHis).

Together, these metrics offer a comprehensive evaluation framework aligned with clinical diagnostic requirements.

3.8. Interpretability Analysis (Grad-CAM)

To explore model interpretability, we employed Gradient-weighted Class Activation Mapping (Grad-CAM) [29] as a post-hoc visualization method. Grad-CAM is not used during classification and has no influence on model predictions. Instead, it operates after the model has produced its output, tracing gradients back through the final convolutional layer to identify which spatial regions contributed most to the decision. The resulting saliency maps were upsampled and superimposed on the input images, highlighting areas of high model attention.

For our experiments, Grad-CAM was applied to the last convolutional block of ResNet50V2 (layer4). We generated heatmaps for test images at all four magnifications (40×, 100×, 200×, 400×) under the three adaptation strategies (site calibration, light fine-tuning, full fine-tuning). For malignant samples, we targeted the malignant class; for the benign sample, we targeted the benign class (visualizing the predicted/prototypical evidence for each case). A number of representative cases (benign and malignant) were further reviewed by a board-certified pathologist to assess whether the highlighted regions corresponded to diagnostically meaningful structures.

4. Results

4.1. Initial Dataset

On the Kaggle IDC dataset (reorganized at patient level), the balanced training set achieved ROC-AUC = 0.983 and F1w = 0.975. Validation and test sets preserved natural prevalence and reached ROC-AUC = 0.898/0.876 and F1w = 0.886/0.854, respectively (Table 4).

4.2. Site Calibration on BreaKHis

Applying the internal ResNet50V2 model directly to BreaKHis with threshold calibration produced modest improvements. Performance remained heterogeneous across magnifications, with ROC-AUC ranging from 0.60 (40×) to 0.73 (200×). Weighted F1 varied between 0.58 and 0.72, and balanced accuracy between 0.57 and 0.68 (Table 5).

4.3. Light Fine-Tuning

Training only the classifier head for 5 epochs on BreaKHis increased sensitivity and F1 at intermediate magnifications. ROC-AUC ranged from 0.64 (400×) to 0.75 (200×), while F1w reached up to 0.73 at 200× (Table 6).

4.4. Full Fine-Tuning

Optimizing all ResNet50V2 layers for 5 epochs yielded the best performance across magnifications. ROC-AUC ranged from 0.92 to 0.95, with F1w between 0.86 and 0.93. Balanced accuracy exceeded 0.86 at all magnifications (Table 7).

These results suggest that full fine-tuning is particularly suited for clinical usage, especially in early-stage diagnosis scenarios where missing malignant samples must be avoided.

In terms of computational efficiency, site calibration required only threshold adjustment (≈2 min per magnification). Head-only fine-tuning completed within ≈45 min per magnification, whereas full fine-tuning required ≈3.5–4 h on a single NVIDIA RTX 3090 GPU (NVIDIA, Santa Clara, CA, USA) (batch size 32). GPU memory usage was approximately 5.2 GB for head-only and 9.8 GB for full fine-tuning. These results illustrate the trade-off between computational cost and diagnostic gain, underscoring the practicality of lighter strategies in resource-limited hospitals.

4.5. Comparative Analysis

Table 8 summarizes the comparative performance across the three strategies. Threshold calibration remained the weakest, light fine-tuning achieved moderate improvements, and full fine-tuning consistently achieved the best results across magnifications. In addition, Table 9 reports the number and proportion of malignant cases correctly classified under each strategy, stratified by magnification.

In addition to accuracy and F1-score, we evaluated the diagnostic sensitivity, specificity, and ROC-AUC for each adaptation strategy and magnification level. These metrics are summarized in Table 10 and offer a more clinically interpretable view of model performance, illustrating the balance between correctly identifying malignant cases and avoiding false positives. It is important to note that the ROC-AUC values reported in Table 8 are macro-averaged across classes, whereas Table 10 presents binary ROC-AUC for malignancy, which is more directly aligned with clinical relevance.

4.6. Grad-CAM Visualization

To further examine model interpretability, we generated Grad-CAM heatmaps for representative benign and malignant cases under all three adaptation strategies (SiteCalib, LightFT, and FullFT). Figure 3, Figure 4 and Figure 5 illustrate the overlaid heatmaps for three diagnostic scenarios: a benign fibroadenoma at 40×, an invasive lobular carcinoma at 40×, and a mucinous carcinoma at 400×. Each figure compares the original histopathology image with Grad-CAM outputs from the three adaptation methods. While the highlighted regions differ across strategies and magnification levels, their diagnostic relevance—particularly the limited correspondence between saliency maps and true morphological hallmarks—is discussed in detail in Section 5.

5. Discussion

5.1. Key Findings

The comparative results reported in Table 8 highlight the effect of different adaptation strategies on cross-site generalization. Site calibration alone achieves moderate performance, with ROC-AUC values ranging between 0.6045 (40×) and 0.7300 (200×). Introducing light fine-tuning, where only the classifier head is retrained on target-site data, consistently improves performance across most magnifications. At 40×, ROC-AUC increases from 0.6045 to 0.6954, although F1-weighted remains stable around 0.68, suggesting improved discrimination without a proportional gain in balanced decision-making. The most notable improvement occurs at 100×, where ROC-AUC rises from 0.6371 to 0.6875 and F1-weighted from 0.5790 to 0.6772, indicating that head adaptation successfully mitigates domain shift at this scale. At 200×, which already exhibited the strongest baseline performance, light fine-tuning yields marginal but consistent gains (ROC-AUC from 0.7300 to 0.7456; F1w from 0.7189 to 0.7305), confirming its robustness. Conversely, at 400×, the two approaches converge (AUC ≈ 0.6458; F1w ≈ 0.6383), suggesting that light fine-tuning is insufficient to address the larger discrepancy at this magnification. Full fine-tuning, however, substantially outperforms both prior strategies, reaching ROC-AUC values between 0.9332 and 0.9832 and F1w between 0.8631 and 0.9332 across all magnifications, with the highest score at 40× (AUC = 0.9500, F1w = 0.9306). These results indicate that while calibration and head-only adaptation provide partial mitigation of domain shift, full fine-tuning remains necessary for optimal cross-site generalization. Figure 6 shows the comparative performance of site calibration, light fine-tuning, and full fine-tuning across the four magnifications. Full fine-tuning consistently outperformed the other strategies, with the largest gap observed at 40× and 100×.

5.2. Interpretation of Magnification-Dependent Behavior

Differences between Light Fine-Tuning and Full Fine-Tuning were magnification dependent. At low magnifications (40×, 100×), diagnostic cues derive largely from global architectural structures such as lobular arrangement, stromal composition, and tissue organization—features that pretrained representations capture reasonably well. At higher magnifications (200×, 400×), discrimination increasingly relies on subtle nuclear abnormalities (pleomorphism, nucleolar prominence) and mitotic activity, which require deeper adaptation of convolutional filters. This histopathological distinction explains why Full Fine-Tuning consistently outperformed Light Fine-Tuning, particularly at higher magnifications.

5.3. Comparison with Prior Work

When comparing our findings with previous research on the BreaKHis dataset, several patterns emerge. Early CNN-based approaches by Spanhol et al. [8] and Araujo et al. [9] reported accuracies between 77–83% under image-level splits, while Bayramoglu et al. [10] improved robustness through multi-scale feature aggregation. Later transfer learning studies leveraging ResNet or DenseNet architectures achieved AUC values around 0.85–0.90 [12], and more recent transformer-based methods, such as attention-based multiple-instance learning models [13], reported AUC values approaching 0.90–0.92. However, all these studies relied on image-level splits, enabling patches from the same patient to appear in both training and test sets—a practice known to artificially inflate performance due to information leakage.

More recent CNN and ensemble-based architectures have reported even higher performance, including attention-guided CNNs [22], equilibrium-optimized ensembles [24], multiscale feature aggregation frameworks such as FabNet [25], and hybrid deep learning ensembles [26], frequently exceeding 90% accuracy. Yet, like earlier work, these studies also used non-patient-level data splits, making direct comparison challenging. Table 11 and Table 12 summarize representative approaches from literature.

In contrast to these methods, our evaluation strictly enforces patient-level independence across both internal and external datasets, providing a more clinically realistic estimate of generalization. Under this protocol, full fine-tuning achieved ROC-AUC values of 0.92–0.95 and weighted F1-scores of 0.86–0.93 across magnifications, which are competitive with or superior to previously reported results. Unlike prior studies that typically evaluated a single transfer learning pipeline, our work systematically compares three adaptation strategies—threshold calibration, head-only fine-tuning, and full fine-tuning—under identical conditions of domain shift and magnification-specific evaluation. This design isolates the effect of adaptation depth and reveals the practical trade-offs between computational cost and diagnostic performance.

By integrating interpretability analysis and expert review, our study extends previous literature by linking model performance with clinical explainability—an aspect rarely addressed in histopathology-focused transfer learning studies. Overall, our results reinforce the importance of patient-level separation and full model adaptation for reliable cross-site deployment.

5.4. Clinical Applicability and Deployment Scenarios

Beyond quantitative performance, it is important to consider how each adaptation strategy aligns with realistic clinical workflows.

From a clinical perspective, lightweight adaptation strategies may be practical in triage or second-reader workflows, where the model assists pathologists in flagging potentially malignant slides for priority review. Threshold calibration or head-only fine-tuning can thus provide a feasible compromise between computational efficiency and diagnostic support, particularly when data access or computing resources are limited. Conversely, full fine-tuning remains indispensable in high-stakes diagnostic settings such as confirmatory diagnosis or multi-institutional deployment, where performance stability and inter-site robustness are critical for patient safety.

The consistent improvement in malignant detection across magnifications and adaptation strategies highlights the practical value of full fine-tuning in clinical environments. Identifying over 90% of malignant images at lower magnifications (40×, 100×) is particularly valuable in diagnostic triage or screening settings. Moreover, the threshold calibration approach, though limited, offers a lightweight alternative when retraining is not feasible—supporting deployment in resource-constrained institutions.

5.5. Interpretability and Limitations of Grad-CAM

Grad-CAM was expected to provide visual cues supporting AI-based predictions by highlighting diagnostically relevant regions, such as malignant nuclei or tumor epithelium. However, expert pathologist review of several representative cases (shown in Figure 3, Figure 4 and Figure 5) revealed limited clinical interpretability.

In the benign fibroadenoma case at 40× (Figure 3), the hematoxylin–eosin image shows mammary glandular tissue completely occupied by fibroadenoma, characterized by distorted ductal structures embedded in fibrous stroma. None of the AI-generated Grad-CAM maps—regardless of adaptation strategy—captured this global architectural alteration that defines the benign diagnosis.

In the lobular carcinoma example at 40× (Figure 4), light fine-tuning and full fine-tuning produced more spatially focused heatmaps compared with SiteCalib, yet the highlighted regions still did not consistently correspond to genuine lobular patterns such as non-cohesive tumor cells arranged in single-file infiltration.

In the mucinous carcinoma patch at 400× (Figure 5), none of the strategies reliably emphasized the characteristic cytoplasmic mucin or the floating tumor cell clusters. Heatmaps frequently concentrated on background tissue or nonspecific stromal areas, missing the true diagnostic features entirely.

These observations underscore a fundamental limitation of Grad-CAM in histopathology: because it relies on coarse feature maps from the last convolutional layer, it cannot capture the subtle nuclear details or fine-grained morphological cues essential for pathological interpretation. As a result, Grad-CAM often reflects broad regions of statistical model attention rather than clinically meaningful structures. This behavior is consistent with prior reports on the limited diagnostic reliability of saliency-based methods [29,30].

Although Grad-CAM remains useful for assessing the internal consistency of learned feature representations, particularly the improved spatial concentration observed after full fine-tuning—our findings indicate that it should not be interpreted as a standalone diagnostic explanation.

Quantitative interpretability metrics such as IoU or Dice coefficients could not be computed due to the absence of region-level annotations in either dataset. Future work will incorporate expert-annotated ROIs or nuclei segmentation masks to enable objective alignment analysis. Beyond Grad-CAM, more advanced interpretability frameworks—such as multi-scale attention maps, concept-based attribution, or nuclei-level saliency—may provide more precise localization and improved clinical trustworthiness.

5.6. Limitations and Future Work

Although the proposed workflow demonstrates strong cross-site generalization and competitive diagnostic accuracy, several limitations must be acknowledged.

First, the study focused on a single backbone architecture (ResNet50V2), which, although robust and widely validated, may not capture all relevant morphological features that transformer-based or hybrid architectures could exploit.

Second, the fine-tuning process was intentionally limited to five epochs to emulate a lightweight clinical adaptation scenario. While this configuration demonstrated fast convergence and good generalization, longer training or advanced optimization schedules could potentially yield further improvements.

Third, the present work addressed binary classification (benign vs. malignant) without analyzing specific histological subtypes, which may limit clinical granularity. In addition, the interpretability analysis relied primarily on Grad-CAM, which provides coarse heatmaps and may not accurately localize fine cellular details.

Finally, external validation was performed on BreaKHis only; extending the evaluation to additional multi-institutional datasets would be necessary to confirm robustness across diverse acquisition protocols.

These limitations highlight important directions for future research, including full fine-tuning of multiple architectures, incorporation of multi-class pathology subtypes, and integration of more advanced explainability techniques.

Both datasets used in this study may contain inherent biases, including limited demographic diversity, acquisition from a small number of institutions, and variability in staining protocols. Such biases can affect model generalization in clinical deployment. Addressing these issues will require future validation on multi-center datasets with balanced demographic representation and standardized acquisition protocols.

Statistical significance testing (e.g., DeLong test for ROC-AUC comparison) was not performed due to the absence of per-patient paired ROC data. Future work will incorporate such analyses to strengthen comparisons across adaptation strategies.

Moreover, the interpretability evaluation relied mainly on Grad-CAM; quantitative alignment metrics and nuclei-level saliency analysis were not available but will be explored in future work.

6. Conclusions

This study provides a systematic comparison of three adaptation strategies—threshold calibration, head-only fine-tuning, and full fine-tuning—for cross-site breast histopathology image classification using ResNet50V2. Threshold calibration offered only modest gains, and head-only fine-tuning improved performance primarily at intermediate magnifications. In contrast, full fine-tuning consistently achieved the highest accuracy, with ROC-AUC values between 0.9332 and 0.9832 and F1-weighted scores between 0.8631 and 0.9332 across all magnifications, detecting over 90% of malignant cases at 40× and 100×. These findings demonstrate that full model adaptation is essential for reliable cross-site generalization and should be preferred in clinical deployment scenarios.

Interpretability analysis revealed important limitations of Grad-CAM. Although attention maps became more stable after full fine-tuning, they often failed to highlight diagnostically meaningful structures, such as the global architectural patterns in benign fibroadenoma or the nuclear abnormalities characteristic of mucinous carcinoma. This confirms that saliency-based visualization methods provide limited clinical insight and must be complemented by expert pathological review.

Overall, the proposed evaluation framework highlights practical trade-offs between adaptation cost and diagnostic reliability and can support informed decisions regarding model deployment in digital pathology workflows.

Future work will focus on exploring more advanced interpretability methods tailored to histopathology—such as multi-scale attention mechanisms, nuclei-level attribution, or weakly supervised localization—and extending cross-site validation to multi-institutional datasets and modern architectures (e.g., Vision Transformers, ConvNeXt). Addressing these challenges is essential for strengthening clinical trustworthiness and enabling safe, routine integration of AI-assisted histopathology in medical practice.

The main contributions of this study can be summarized as follows:

We propose a unified, patient-level evaluation pipeline for cross-site breast histopathology classification, eliminating information leakage.
We conduct the first systematic comparison of three adaptation strategies (threshold calibration, head-only fine-tuning, full fine-tuning) under identical domain-shift conditions and across four magnifications.
We provide a clinically grounded analysis of malignant detection rates, demonstrating where lightweight adaptation is feasible and where full model retraining is required.
We include a pathologist-reviewed interpretability assessment, highlighting fundamental limitations of Grad-CAM for clinical reasoning.
We design a lightweight, reproducible cross-site adaptation framework that reflects realistic constraints in medical institutions.

From a methodological perspective, this study represents a complete end-to-end pipeline designed, implemented, and validated by the authors—including dataset reorganization, patient-level splitting, adaptation strategy design, model training, quantitative evaluation, and interpretability assessment. The systematic analysis carried out by the authors provides a transparent and reproducible framework for future cross-site studies, aligning the conclusions with the specific methodological and analytical contributions documented in the manuscript.

Supplementary Materials

The following supporting information can be downloaded at: Dataset S1: Breast Histopathology Images dataset (IDC dataset), publicly available at: https://www.kaggle.com/datasets/paultimothymooney/breast-histopathology-images (accessed on 2 March 2025). Dataset S2: BreaKHis: Breast Cancer Histopathological Database, publicly available at: https://web.inf.ufpr.br/vri/databases/breast-cancer-histopathological-database-breakhis (accessed on 26 June 2025).

Author Contributions

Conceptualization, L.S. and C.S.-S.; methodology, L.S.; software, L.S. and C.S.-S.; validation, L.S. and C.S.-S.; formal analysis, L.S.; investigation, L.S. and C.S.-S.; resources, L.S.; data curation, L.S.; writing—original draft preparation, L.S.; writing—review and editing, C.S.-S.; visualization, L.S. and C.S.-S.; supervision, L.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

The study received ethical approval from the Ethics Committee of the University of Craiova, Romania (Approval No. 3323, Date: 12 November 2025). All analyses were performed exclusively on publicly available and fully anonymized datasets (Kaggle Breast Histopathology Images and BreaKHis). No identifiable patient information or direct human participant involvement occurred in this study.

Data Availability Statement

Data is contained within this article.

Acknowledgments

During the preparation of this manuscript, the authors used ChatGPT (OpenAI, GPT-5, 2025) for language editing and Python 3.9 debugging support. The authors have reviewed and edited all outputs and take full responsibility for the content.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AUC	Area Under the Curve
BalAcc	Balanced Accuracy
CAD	Computer-Aided Diagnosis
CNN	Convolutional Neural Network
F1w	F1-score (weighted)
FT	Fine-Tuning
Grad-CAM	Gradient-weighted Class Activation Mapping
MIL	Multiple Instance Learning
PR-AUC	Precision–Recall Area Under the Curve
ROC	Receiver Operating Characteristic
WSI	Whole-Slide Image

References

Bray, F.; Ferlay, J.; Soerjomataram, I.; Siegel, R.L.; Torre, L.A.; Jemal, A. Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J. Clin. 2018, 68, 394–424. [Google Scholar] [CrossRef] [PubMed]
Litjens, G.; Kooi, T.; Bejnordi, B.E.; Setio, A.A.A.; Ciompi, F.; Ghafoorian, M.; van der Laak, J.A.W.M.; van Ginneken, B.; Sánchez, C.I. A survey on deep learning in medical image analysis. Med. Image Anal. 2017, 42, 60–88. [Google Scholar] [CrossRef] [PubMed]
Esteva, A.; Robicquet, A.; Ramsundar, B.; Kuleshov, V.; DePristo, M.; Chou, K.; Cui, C.; Corrado, G.; Thrun, S.; Dean, J. A guide to deep learning in healthcare. Nat. Med. 2019, 25, 24–29. [Google Scholar] [CrossRef] [PubMed]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar] [CrossRef]
Liu, Z.; Mao, H.; Wu, C.-Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A ConvNet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; pp. 11976–11986. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16×16 words: Transformers for image recognition at scale. In Proceedings of the 9th International Conference on Learning Representations (ICLR), Virtual Event, 3–7 May 2021. [Google Scholar]
Spanhol, F.A.; Oliveira, L.S.; Petitjean, C.; Heutte, L. A dataset for breast cancer histopathological image classification. IEEE Trans. Biomed. Eng. 2016, 63, 1455–1462. [Google Scholar] [CrossRef] [PubMed]
Araujo, T.; Aresta, G.; Castro, E.; Rouco, J.; Aguiar, P.; Eloy, C.; Polónia, A.; Campilho, A. Classification of breast cancer histology images using convolutional neural networks. PLoS ONE 2017, 12, e0177544. [Google Scholar] [CrossRef]
Bayramoglu, N.; Kannala, J.; Heikkilä, J. Deep learning for magnification independent breast cancer histopathology image classification. In Proceedings of the IEEE International Conference on Pattern Recognition (ICPR), Cancun, Mexico, 4–8 December 2016; pp. 2440–2445. [Google Scholar] [CrossRef]
Deniz, E.; Şengür, A.; Kadiroğlu, Z.; Guo, Y.; Bajaj, V.; Budak, Ü. Transfer learning based histopathologic image classification for breast cancer detection. Health Inf. Sci. Syst. 2018, 6, 18. [Google Scholar] [CrossRef]
Han, Z.; Wei, B.; Zheng, Y.; Yin, Y.; Li, K.; Li, S. Breast cancer multi-classification from histopathological images with structured deep learning model. Sci. Rep. 2017, 7, 4172. [Google Scholar] [CrossRef]
Ilse, M.; Tomczak, J.M.; Welling, M. Attention-based Deep Multiple Instance Learning. In Proceedings of the 35th International Conference on Machine Learning (ICML), Stockholm, Sweden, 10–15 July 2018; Volume 80, pp. 2127–2136. [Google Scholar]
Macenko, M.; Niethammer, M.; Marron, J.S.; Borland, D.; Woosley, J.T.; Guan, X.; Schmitt, C.; Thomas, N.E. A method for normalizing histology slides for quantitative analysis. In Proceedings of the IEEE International Symposium on Biomedical Imaging (ISBI), Boston, MA, USA, 28 June–1 July 2009; pp. 1107–1110. [Google Scholar] [CrossRef]
Azizi, S.; Mustafa, B.; Ryan, F.; Beaver, Z.; Freyberg, J.; Deaton, J.; Loh, A.; Karthikesalingam, A.; Kornblith, S.; Chen, T.; et al. Big self-supervised models advance medical image classification. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 3478–3488. [Google Scholar] [CrossRef]
Wang, Y.; Luo, F.; Yang, X.; Wang, Q.; Sun, Y.; Tian, S.; Feng, P.; Huang, P.; Xiao, H. The Swin-Transformer network based on focal loss is used to identify images of pathological subtypes of lung adenocarcinoma with high similarity and class imbalance. J. Cancer Res. Clin. Oncol. 2023, 149, 8581–8592. [Google Scholar] [CrossRef] [PubMed]
Qu, Y.; Zhou, X.; Huang, P.; Liu, Y.; Mercaldo, F.; Santone, A.; Feng, P. CGAM: An end-to-end causality graph attention Mamba network for esophageal pathology grading. Biomed. Signal Process. Control 2025, 103, 107452. [Google Scholar] [CrossRef]
Huang, P.; Xiao, H.; He, P.; Li, C.; Guo, X.; Tian, S.; Feng, P.; Chen, H.; Sun, Y.; Mercaldo, F.; et al. LA-ViT: A Network With Transformers Constrained by Learned-Parameter-Free Attention for Interpretable Grading in a New Laryngeal Histopathology Image Dataset. IEEE J. Biomed. Health Inform. 2024, 28, 3557–3570. [Google Scholar] [CrossRef] [PubMed]
Huang, P.; Feng, P.; Tian, S.; Xiao, H.; Mercaldo, F.; Santone, A.; Qin, J. A ViT-AMC Network With Adaptive Model Fusion and Multiobjective Optimization for Interpretable Laryngeal Tumor Grading From Histopathological Images. IEEE Trans. Med. Imaging 2023, 42, 15–28. [Google Scholar] [CrossRef] [PubMed]
Huang, P.; Luo, X. FDTs: A Feature Disentangled Transformer for Interpretable Squamous Cell Carcinoma Grading. IEEE/CAA J. Autom. Sin. 2025, 12, 2365–2367. [Google Scholar] [CrossRef]
Zebari, D.A.; Ibrahim, D.A.; Zeebaree, D.Q.; Mohammed, M.A.; Haron, H.; Zebari, N.A.; Damaševičius, R.; Maskeliūnas, R. Breast cancer detection using mammogram images with improved multi-fractal dimension approach and feature fusion. Appl. Sci. 2021, 11, 12122. [Google Scholar] [CrossRef]
Aldakhil, L.A.; Alhasson, H.F.; Alharbi, S.S. Attention-based deep learning approach for breast cancer histopathological image multi-classification. Diagnostics 2024, 14, 1402. [Google Scholar] [CrossRef] [PubMed]
Loddo, A.; Usai, M.; Di Ruberto, C. Gastric cancer image classification: A comparative analysis and feature fusion strategies. J. Imaging 2024, 10, 195. [Google Scholar] [CrossRef] [PubMed]
Çetin-Kaya, Y. Equilibrium optimization-based ensemble CNN framework for breast cancer multiclass classification using histopathological images. Diagnostics 2024, 14, 2253. [Google Scholar] [CrossRef] [PubMed]
Amin, M.S.; Ahn, H. FabNet: A features agglomeration-based convolutional neural network for multiscale breast cancer histopathology image classification. Cancers 2023, 15, 1013. [Google Scholar] [CrossRef] [PubMed]
Balasubramanian, A.A.; Al-Heejawi, S.M.A.; Singh, A.; Breggia, A.; Ahmad, B.; Christman, R.; Ryan, S.T.; Amal, S. Ensemble deep learning-based image classification for breast cancer subtype and invasiveness diagnosis from whole slide image histopathology. Cancers 2024, 16, 2222. [Google Scholar] [CrossRef] [PubMed]
Campanella, G.; Hanna, M.G.; Geneslaw, L.; Miraflor, A.; Werneck Krauss Silva, V.; Busam, K.J.; Brogi, E.; Halpern, M.; Samboy, J.; Klimstra, D.S.; et al. Clinical-grade computational pathology using weakly supervised deep learning on whole-slide images. Nat. Med. 2019, 25, 1301–1309. [Google Scholar] [CrossRef] [PubMed]
Kamnitsas, K.; Baumgartner, C.; Ledig, C.; Newcombe, V.; Simpson, J.; Kane, A.; Menon, D.; Nori, A.; Criminisi, A.; Rueckert, D.; et al. Unsupervised domain adaptation in brain lesion segmentation with adversarial networks. In Proceedings of the Information Processing in Medical Imaging (IPMI), Boone, NC, USA, 25–30 June 2017; pp. 597–609. [Google Scholar] [CrossRef]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar] [CrossRef]
Teng, Q.; Liu, Z.; Song, Y.; Han, K.; Lu, Y. A Survey on the Interpretability of Deep Learning in Medical Diagnosis. Multimed. Syst. 2022, 28, 2335–2355. [Google Scholar] [CrossRef] [PubMed]
He, K.; Zhang, X.; Ren, S.; Sun, J. Identity Mappings in Deep Residual Networks. In Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016; pp. 630–645. [Google Scholar] [CrossRef]

Figure 1. ResNet50V2 architecture used in this study.

Figure 2. Overview of the three adaptation strategies evaluated in this study: site (threshold) calibration (no retraining), light fine-tuning (training only the classifier head), and full fine-tuning (training all layers). The diagram summarizes the data flow and model adaptation steps used in each strategy.

Figure 3. Original benign fibroadenoma image at 40× and corresponding Grad-CAM visualizations generated by the three adaptation strategies (SiteCalib, LightFT, FullFT).

Figure 4. Original image of invasive lobular carcinoma at 40× and the associated Grad-CAM maps produced by SiteCalib, LightFT, and FullFT.

Figure 5. Original mucinous carcinoma image at 400× along with Grad-CAM heatmaps for the three adaptation strategies (SiteCalib, LightFT, FullFT).

Figure 6. Comparative performance across adaptation strategies (SiteCalib, LightFT, FullFT) and magnifications (40×, 100×, 200×, 400×) on the BreaKHis test sets. Bars indicate ROC-AUC (left) and F1-weighted (right). Full fine-tuning consistently outperforms the other strategies, with the largest improvements at 40× and 100×.

Table 2. Composition of the Kaggle Breast Histopathology Images dataset [Dataset S1], split at patient level. The training set was class-balanced through undersampling; validation and test sets retained natural class distribution.

Split	Benign	Malignant	Total
Train	137,750	137,750	275,500
Validation	28,994	9811	38,805
Test	31,994	12,185	44,179

Table 3. Composition of the BreaKHis dataset [Dataset S2], showing the number of benign and malignant samples per split and magnification level (patient-level split). No class balancing was applied.

Magnification	Split	Benign	Malignant	Total
40×	Train	1288	2077	3365
40×	Validation	272	384	656
40×	Test	583	808	1391
100×	Train	1068	2042	3110
100×	Validation	254	368	622
100×	Test	597	754	1351
200×	Train	975	1781	2756
200×	Validation	226	334	560
200×	Test	575	731	1306
400×	Train	870	1522	2392
400×	Validation	211	312	523
400×	Test	584	704	1288

Table 4. Initial dataset results (patient-level split). Train set was balanced; validation and test preserved natural prevalence.

Subset	AUC	F1-Weighted
Train	0.983	0.975
Validation	0.898	0.886
Test	0.876	0.854

Table 5. Results of site calibration on BreaKHis (patient-level split).

Magnification	ROC-AUC	F1w (Test)	BalAcc	Threshold	F1w (Val)
40×	0.6045	0.6816	0.5756	0.139	0.7139
100×	0.6371	0.5790	0.5963	0.772	0.7754
200×	0.7300	0.7189	0.6830	0.337	0.8436
400×	0.6390	0.6344	0.6066	0.287	0.7461

Table 6. Light fine-tuning (head-only, 5 epochs) on BreaKHis test set. Best head selected on validation; thresholds selected on validation.

Magnification	ROC-AUC	F1w (Test)	BalAcc	Threshold	F1w (Val)
40×	0.6954	0.6845	0.6284	0.535	0.6803
100×	0.6875	0.6772	0.6243	0.584	0.8152
200×	0.7456	0.7305	0.6544	0.535	0.8306
400×	0.6387	0.6395	0.5968	0.604	0.7852

Table 7. Results of full fine-tuning on BreaKHis.

Magnification	ROC-AUC	F1w (Test)	BalAcc	Threshold	F1w (Val)
40×	0.9500	0.9306	0.9067	0.198	0.8884
100×	0.9405	0.9127	0.9010	0.406	0.9509
200×	0.9207	0.8953	0.8847	0.465	0.9226
400×	0.9314	0.8631	0.8646	0.564	0.9056

Table 8. Comparative results on BreaKHis (per magnification, across methods).

Magnification	ROC-AUC (SiteCalib)	F1w (SiteCalib)	ROC-AUC (LightFT)	F1w (LightFT)	ROC-AUC (FullFT)	F1w (FullFT)
40×	0.6045	0.6816	0.6954	0.6845	0.9500	0.9306
100×	0.6371	0.5790	0.6875	0.6772	0.9405	0.9127
200×	0.7300	0.7189	0.7456	0.7305	0.9207	0.8953
400×	0.6390	0.6344	0.6387	0.6395	0.9314	0.8631

Table 9. Malignant test cases correctly classified, per magnification and adaptation strategy. Each row shows the number and percentage of malignant images from the test set that were correctly classified using one of the three adaptation strategies. The results reflect classification performance at each magnification level, highlighting the improved detection with full fine-tuning.

Magnification	Strategy	Correct/Total	% Correct
40×	Site Calibration	755/808	93.4%
	Light FT	770/808	95.3%
	Full FT	785/808	97.2%
100×	Site Calibration	690/754	91.5%
	Light FT	712/754	94.4%
	Full FT	723/754	95.9%
200×	Site Calibration	648/731	88.6%
	Light FT	669/731	91.5%
	Full FT	689/731	94.2%
400×	Site Calibration	656/704	93.2%
	Light FT	674/704	95.7%
	Full FT	690/704	98.0%

Table 10. Sensitivity, specificity, and ROC-AUC for each adaptation strategy and magnification level on the BreaKHis test set.

Magnification	Strategy	Sensitivity	Specificity	ROC-AUC
40×	Site Calibration	0.905	0.602	0.6045
	Light FT	0.959	0.782	0.6954
	Full FT	0.964	0.911	0.9500
100×	Site Calibration	0.509	0.735	0.6371
	Light FT	0.926	0.856	0.6875
	Full FT	0.926	0.901	0.9332
200×	Site Calibration	0.731	0.735	0.7300
	Light FT	0.859	0.880	0.7456
	Full FT	0.942	0.920	0.9558
400×	Site Calibration	0.641	0.625	0.6421
	Light FT	0.703	0.671	0.6458
	Full FT	0.980	0.945	0.9832

Table 11. Breast Cancer Histopathology (BreaKHis, Patch-Level → Patient-Level Comparison).

Study	Method/Model	Split Level	Reported Metric (Best)	Our Results (Full FT, Patient-Level)
Spanhol et al. (2016) [8]	Baseline CNN	Image-level	Accuracy ≈ 77–83%	–
Araujo et al. (2017) [9]	CNN trained from scratch	Image-level	Accuracy ≈ 83%	–
Bayramoglu et al. (2016) [10]	Multi-scale CNN	Image-level	Accuracy ≈ 84%	–
Han et al. (2017) [12]	ResNet/DenseNet fine-tuned	Image-level	AUC ≈ 0.85–0.90	–
Aldakhil et al. (2024) [22]	Attention-based CNN	Image-level	Accuracy > 90%	–
Çetin-Kaya (2023) [24]	Ensemble CNN +optimization	Image-level	Accuracy ≈ 91–93%	-
Amin & Ahn (2022) [25]	FabNet (multiscale feature aggregation)	Image-level	AUC ≈ 0.90	-
Balasubramanian et al. (2023) [26]	Ensemble DL	Image-level	Accuracy ≈ 92%	-
This work (2025)	ResNet50V2 Full Fine-Tuning	Patient-level	ROC-AUC = 0.92–0.95, F1w = 0.86–0.93	Best: 40× (ROC-AUC = 0.95, F1w = 0.93)

Table 12. Transformers in Other Histopathology Domains (not directly comparable).

Study	Method/Model	Split Level	Reported Metric (Best)
Swin-Transformer (2023) [16]	Transformer with focal loss for lung adenocarcinoma subtype classification	Image-level	AUC ≈ 0.93–0.95
ViT-AMC (2024) [19]	Vision Transformer with adaptive model fusion for laryngeal tumor grading	Image-level	AUC ≈ 0.94
FDTs (2024) [20]	Feature Disentangled Transformer for squamous cell carcinoma grading	Image-level	AUC ≈ 0.95

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Stanescu, L.; Stoica-Spahiu, C. Clinically Oriented Evaluation of Transfer Learning Strategies for Cross-Site Breast Cancer Histopathology Classification. Appl. Sci. 2025, 15, 12819. https://doi.org/10.3390/app152312819

AMA Style

Stanescu L, Stoica-Spahiu C. Clinically Oriented Evaluation of Transfer Learning Strategies for Cross-Site Breast Cancer Histopathology Classification. Applied Sciences. 2025; 15(23):12819. https://doi.org/10.3390/app152312819

Chicago/Turabian Style

Stanescu, Liana, and Cosmin Stoica-Spahiu. 2025. "Clinically Oriented Evaluation of Transfer Learning Strategies for Cross-Site Breast Cancer Histopathology Classification" Applied Sciences 15, no. 23: 12819. https://doi.org/10.3390/app152312819

APA Style

Stanescu, L., & Stoica-Spahiu, C. (2025). Clinically Oriented Evaluation of Transfer Learning Strategies for Cross-Site Breast Cancer Histopathology Classification. Applied Sciences, 15(23), 12819. https://doi.org/10.3390/app152312819

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Clinically Oriented Evaluation of Transfer Learning Strategies for Cross-Site Breast Cancer Histopathology Classification

Abstract

1. Introduction

2. Related Work

3. Materials and Methods

3.1. Generalized Algorithm for Histopathological Image Classification

3.2. Proposed Training and Evaluation Procedure (Kaggle → BreaKHis)

3.3. Datasets

3.4. Model Architecture

3.5. Training Strategies

3.6. Implementation Details

3.7. Evaluation Metrics

3.8. Interpretability Analysis (Grad-CAM)

4. Results

4.1. Initial Dataset

4.2. Site Calibration on BreaKHis

4.3. Light Fine-Tuning

4.4. Full Fine-Tuning

4.5. Comparative Analysis

4.6. Grad-CAM Visualization

5. Discussion

5.1. Key Findings

5.2. Interpretation of Magnification-Dependent Behavior

5.3. Comparison with Prior Work

5.4. Clinical Applicability and Deployment Scenarios

5.5. Interpretability and Limitations of Grad-CAM

5.6. Limitations and Future Work

6. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI