Next Article in Journal
Preparation and Properties of Expanded Perlite-Based Paraffin Phase Change Reinforced Concrete
Previous Article in Journal
A Novel TSF Control Method with an Adaptive Turn-On Angle in Three Regions to Suppress Torque Ripple in Permanent Magnet-Assisted Switched Reluctance Motor
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Clinically Oriented Evaluation of Transfer Learning Strategies for Cross-Site Breast Cancer Histopathology Classification

by
Liana Stanescu
* and
Cosmin Stoica-Spahiu
Department of Computer Science and Information Technology, Faculty of Automation, Computers and Electronics, University of Craiova, 200585 Craiova, Romania
*
Author to whom correspondence should be addressed.
Appl. Sci. 2025, 15(23), 12819; https://doi.org/10.3390/app152312819
Submission received: 5 November 2025 / Revised: 24 November 2025 / Accepted: 1 December 2025 / Published: 4 December 2025
(This article belongs to the Special Issue Big Data Integration and Artificial Intelligence in Medical Systems)

Abstract

Background/Objectives: Breast cancer diagnosis based on histopathological examination remains the most reliable and widely accepted approach in clinical practice, despite being time-consuming and prone to inter-observer variability. While deep learning methods have achieved high accuracy in medical image classification, their cross-site generalization remains limited due to differences in staining protocols and image acquisition. This study aims to evaluate and compare three clinically relevant adaptation strategies to improve model robustness under domain shift. Methods: The ResNet50V2 model, pretrained on ImageNet and further fine-tuned on the Kaggle Breast Histopathology Images dataset, was subsequently adapted to the BreaKHis dataset under three clinically relevant transfer strategies: (i) threshold calibration without retraining (site calibration), (ii) head-only fine-tuning (light FT), and (iii) full fine-tuning (full FT). Experiments were performed on an internal balanced dataset and on the public BreaKHis dataset using strict patient-level splitting to avoid data leakage. Evaluation metrics included accuracy, precision, recall, F1-score, ROC-AUC, and PR-AUC, computed per magnification level (40×, 100×, 200×, 400×). Results: Full fine-tuning consistently yielded the highest performance across all magnifications, reaching up to 0.983 ROC-AUC and 0.980 sensitivity at 400×. At 40× and 100×, the model correctly identified over 90% of malignant cases, with ROC-AUC values of 0.9500 and 0.9332, respectively. Head-only fine-tuning led to moderate gains (e.g., sensitivity up to 0.859 at 200×), while threshold calibration showed limited improvements (ROC-AUC ranging between 0.60–0.73). Grad-CAM analysis revealed more stable and focused attention maps after full fine-tuning, though they did not always align with diagnostically relevant regions. Conclusions: Our findings confirm that full fine-tuning is essential for robust cross-site deployment of histopathology AI systems, particularly at high magnifications. Lighter strategies such as threshold calibration or head-only fine-tuning may serve as practical alternatives in resource-constrained environments where retraining is not feasible.

1. Introduction

Breast cancer remains one of the most prevalent malignancies worldwide and a leading cause of cancer-related mortality among women [1]. Histopathological analysis of tissue biopsies represents the reference standard for diagnosis, allowing pathologists to differentiate between benign and malignant lesions and to assess tumor subtype and grade. However, manual inspection of slides is time-consuming and subject to inter-observer variability, motivating the development of computer-aided diagnosis (CAD) systems.
Deep learning has transformed medical image analysis, with convolutional neural networks (CNNs) and more recently transformer-based architectures achieving state-of-the-art performance across multiple applications [2,3]. ResNet [4], DenseNet [5], ConvNeXt [6], and Vision Transformers [7] have all been applied successfully to histopathological image classification. The Breast Cancer Histopathological Database (BreaKHis) [8] has become a widely used benchmark, comprising 7909 images from 82 patients labeled as benign or malignant across four magnification levels (40×, 100×, 200×, 400×). This magnification diversity makes it well suited for evaluating scale-dependent performance.
In this study, the ResNet50V2 model was first fine-tuned on the Kaggle Breast Histopathology dataset to establish a balanced baseline and subsequently adapted to BreaKHis to assess cross-site generalization under controlled conditions.
A key challenge in digital pathology is that models must generalize across acquisition sites, staining conditions, and patient populations. Many existing studies rely on random image-level splits, allowing patches from the same patient in both training and test sets and overestimating performance. In realistic clinical scenarios, however, models encounter entirely new patients, making patient-level independence essential.
Although recent advances in attention mechanisms, stain normalization, and self-supervised learning have improved robustness, the relative effectiveness of different adaptation depths under domain shift remains insufficiently explored. No prior work has systematically compared threshold calibration, head-only fine-tuning, and full fine-tuning using strict patient-level separation and magnification-specific evaluation.
This study offers the following contributions:
  • A unified comparison of three adaptation strategies—threshold calibration, head-only fine-tuning, and full fine-tuning—for cross-site histopathology classification.
  • Strict patient-level separation for both internal and BreaKHis datasets to prevent information leakage.
  • A magnification-aware analysis quantifying the trade-off between adaptation cost and diagnostic performance.
  • A clinically oriented discussion on model interpretability and realistic deployment scenarios.
This perspective emphasizes robustness, reproducibility, and practical deployability within clinically realistic cross-site conditions.

2. Related Work

Deep learning has significantly advanced breast cancer histopathology classification, with numerous studies proposing increasingly sophisticated architectures and training strategies. Table 1 summarizes representative approaches, highlighting major model types, datasets, and common advantages and limitations.
Several publicly available datasets support research in histopathology image classification. Among these, the BreaKHis database [8] is one of the most widely used benchmarks, providing 7909 images from 82 patients across four magnifications (40×, 100×, 200×, 400×), which enables systematic assessment of scale-dependent performance. The Kaggle Breast Histopathology dataset contains over 250,000 balanced benign and malignant patches derived from whole-slide images, making it suitable for pretraining and establishing robust feature representations. These datasets are frequently adopted in the literature due to their accessibility, diversity, and relevance to clinical diagnostic workflows.
Building on these benchmark datasets, early work on BreaKHis established foundational baselines for patch-level classification. Spanhol et al. [8] introduced baseline CNN models, while Araujo et al. [9] trained deeper networks from scratch to improve feature representation. Bayramoglu et al. [10] proposed a multi-scale CNN that integrates information across magnifications, demonstrating improved accuracy over single-scale approaches. As shown in Table 1, these initial methods laid the groundwork for deep learning in histopathology but remained limited by shallow architectures, high data requirements, or computational overhead.
Transfer learning approaches soon emerged as a more efficient alternative. Studies employing pretrained architectures—such as ResNet, DenseNet, and related CNN families—reported substantial performance gains compared with training from scratch [11]. Structured deep learning models further enhanced interpretability and tissue representation by incorporating prior spatial relationships [12]. However, as indicated in Table 1, these frameworks typically rely on full fine-tuning and do not evaluate the effectiveness of lighter adaptation strategies.
In addition to these CNN-based innovations, multiple studies explored modeling patch-level relevance and weakly supervised aggregation using attention-based multiple-instance learning, which improved lesion localization and interpretability in histopathology workflows [13]. Stain-normalization and domain adaptation techniques further addressed color variability across medical centers, yielding measurable gains in cross-site robustness [14]. More recently, self-supervised learning frameworks have leveraged large unlabeled repositories to extract transferable histopathology representations, reducing the dependence on extensive annotations [15].
Transformer-based architecture has recently gained prominence due to their ability to model long-range dependencies in tissue structure. Vision Transformer and Swin Transformer variants have achieved state-of-the-art results in lung adenocarcinoma [16], esophageal pathology [17], and laryngeal tumor grading [18,19]. Feature-disentangled transformers have further improved robustness and interpretability in squamous cell carcinoma grading [20]. Despite their strong performance, Table 1 shows that these models often require large, annotated datasets and lack systematic evaluation under cross-site domain shift.
Complementary lines of work have explored radiology-driven feature fusion [21], attention-enhanced CNNs [22], multi-branch and fusion frameworks [23], ensemble optimization [24], and multiscale deep networks [25,26]. While these approaches improve robustness across various medical imaging tasks, their computational complexity can hinder clinical deployment.
A persistent methodological challenge in digital pathology is the widespread use of random image-level splits, which allow patches from the same patient to appear in both training and test sets. This practice introduces information leakage and leads to overly optimistic performance estimates. Campanella et al. [27] demonstrated that clinical-grade computational pathology requires strict slide- and patient-level independence to avoid unrealistic accuracy inflation. This underscores the importance of rigorous, patient-level data separation, particularly when evaluating cross-site generalization.
Several studies have addressed domain shift using stain normalization [14] or adversarial domain adaptation [28], while self-supervised learning has shown promise for representation learning with limited labels [15]. However, the complexity and computational cost of these approaches, also summarized in Table 1, limit their scalability in low-resource clinical environments.
Interpretability is another critical challenge. Grad-CAM [29] remains the most widely used saliency method in histopathology, but multiple studies have shown that its attention maps may not reliably correspond to diagnostically meaningful structures [30]. These limitations motivate the need to evaluate explainability methods alongside classification performance, particularly in clinically oriented workflows.
Despite architectural progress, key gaps remain in the literature:
  • Most studies rely on a single adaptation strategy, typically full fine-tuning;
  • Lightweight strategies such as threshold calibration or head-only fine-tuning remain underexplored.
  • Few works systematically quantify performance variation across magnification levels.
  • Robustness under domain shift is not consistently evaluated.
These limitations motivate the present study, which provides a unified comparison of adaptation strategies under clinically realistic, patient-level cross-site conditions.
Table 1. Representative Deep Learning Approaches for Histopathology Classification.
Table 1. Representative Deep Learning Approaches for Histopathology Classification.
Study/ApproachModel Type/InnovationDataset(s)AdvantagesLimitations
Spanhol et al. [8]Baseline CNNsBreaKHisEstablishes initial benchmarksLimited depth; modest performance
Araujo et al. [9]Deeper CNNsBreaKHisImproved representational capacityRequires large datasets; trained from scratch
Bayramoglu et al. [10]Multi-scale CNNBreaKHisIntegrates multi-magnification cuesHigh computational cost
Transfer learning CNNs [11]Pretrained ResNet/DenseNetBreaKHisEfficient training; strong accuracyFocus on full fine-tuning
Structured DL models [12]Structured/hierarchical CNNsBreaKHisBetter modeling of tissue structureIncreased architecture complexity
Transformers [16,17,18,19,20]ViT, Swin, graph-attention, FDTVarious histopathology datasetsSOTA performance; improved interpretabilityRequires large datasets; limited cross-site testing
Feature fusion & ensembles [23,24,25,26]CNN fusion, multi-branch, ensemble optimizationVarious medical imaging datasetsEnhanced robustness; multi-feature integrationHeavy models; risk of overfitting
Radiology-driven approaches [21]Multi-fractal + fusionMammographyStrong texture encodingNot histopathology-specific
Stain normalization & adaptation [14,28]Stain transfer, adversarial adaptationHistopathologyMitigates domain shiftComplex pipeline; higher compute
Self-supervised learning [15]SSL pretrainingMedical Imaging DatasetsStrong features with few labelsLong training time
Campanella et al. [27]Weakly supervised WSI classificationWhole-slide imagesDemonstrates need for patient-level independenceShows risk of inflated accuracy under patch-level splits

3. Materials and Methods

This section provides a step-by-step description of the proposed workflow, encompassing dataset preparation, model architecture, training strategies, and implementation details. The overall pipeline consists of four main stages: (i) dataset preprocessing and patient-level splitting, (ii) baseline fine-tuning on an internal dataset (Kaggle IDC), (iii) cross-site adaptation and evaluation on the external BreaKHis dataset, and (iv) performance assessment under different magnifications (40×–400×).

3.1. Generalized Algorithm for Histopathological Image Classification

To place our experimental setup in a broader context, we first outline a generalized deep learning workflow for histopathological image classification. This abstract pipeline, summarized in Algorithm 1, captures the essential stages common to most modern CAD systems based on convolutional or transformer-based architectures, independently of the specific datasets or backbone models. In the subsequent subsections, we instantiate this workflow using ResNet50V2, the Kaggle IDC dataset for pretraining, and BreaKHis as the external target dataset.
Algorithm 1. Generalized Workflow for Histopathological Image Classification
Input:
  Histopathology dataset D = {(Ii, yi)} consisting of whole-slide images (WSIs)
  or pre-extracted patches Ii with class labels yi (e.g., benign/malignant)

Output:
  Trained model M* and predicted labels ŷi for unseen samples

Stage 1: Data preparation and splitting
  Acquire raw WSIs or image patches from one or more institutions
  Optionally perform stain normalization and artefact removal
  If WSIs are used:
    Tile each WSI into patches
    Discard background tiles
    Resize all patches to a fixed input resolution (e.g., 224 × 224)
  Split patients (not images) into train, validation, and test sets
  Ensure that no patient appears in more than one split (patient-level separation)

Stage 2: Model initialization
  Choose a backbone architecture (e.g., CNN or Vision Transformer)
  Initialize the backbone with pretrained weights (e.g., ImageNet) or random weights
  Replace the final classification layer with a task-specific head (e.g., 2-class output)
  Select an adaptation strategy:
    (a) Threshold calibration only
    (b) Head-only fine-tuning
    (c) Full fine-tuning

Stage 3: Training/adaptation
  If using threshold calibration:
    Apply the pretrained model to the target training/validation data
    Learn an optimal decision threshold on the validation set
  Else:
    Freeze or unfreeze layers according to the chosen adaptation strategy
    Train the model on the training set using a suitable loss (e.g., weighted cross-entropy)
    Monitor performance on the validation set
    Select the best checkpoint M* based on a validation metric (e.g., F1-score)

Stage 4: Inference and evaluation
  Apply M* to the test set to obtain predicted probabilities pi
  Apply the chosen decision threshold to obtain final labels ŷi
  Compute evaluation metrics (accuracy, precision, recall, F1-score, ROC-AUC, PR-AUC)
  Optionally generate interpretability maps (e.g., Grad-CAM) for qualitative analysis

Return:
  Trained model M* and performance metrics on the test set

3.2. Proposed Training and Evaluation Procedure (Kaggle → BreaKHis)

The detailed training and evaluation procedure is summarized in Algorithm 2, which describes how the ResNet50V2 model, initialized with ImageNet weights, was first fine-tuned on the Kaggle IDC dataset at the patient level to obtain a balanced baseline model (M*). This pretrained model was then adapted and evaluated on the BreaKHis dataset under three strategies—threshold calibration (SiteCalib), head-only fine-tuning (LightFT), and full fine-tuning (FullFT)—for each magnification level.
Evaluation metrics included accuracy, F1-score, ROC-AUC, and PR-AUC, computed per magnification level, along with Grad-CAM visualizations for interpretability. This design ensures strict patient-level independence and realistic cross-site generalization.
Algorithm 2. Proposed Training and Evaluation Procedure for Kaggle → BreaKHis Adap-tation
Input: ResNet50V2 architecture M (initialized with ImageNet weights),
    internal dataset D_Kaggle = {train, val, test},
    external dataset D_BreaKHis = {train, val, test},
    magnifications = {40×, 100×, 200×, 400×}

# Stage 1: Baseline fine-tuning on internal dataset (Kaggle)
Train M on D_Kaggle[train] using weighted cross-entropy loss
Validate on D_Kaggle[val]; select best checkpoint M*
Save M* as pretrained baseline for adaptation

# Stage 2: Adaptation and evaluation on BreaKHis
for each magnification m in magnifications do
  for each adaptation strategy s in {SiteCalib, LightFT, FullFT} do
    if s == SiteCalib:
      Apply M* without retraining; calibrate decision threshold on val[m]
    if s == LightFT:
Freeze backbone; train classifier head for 5 epochs on train[m]
    if s == FullFT:
      Unfreeze all layers; train for 5 epochs on train[m]
    Select best checkpoint by validation F1
    Optimize threshold on validation set
    Evaluate on test[m]; record ACC, F1, ROC-AUC, PR-AUC; generate Grad-CAM
  end for
end for

Output: Comparative performance and interpretability for all strategies

3.3. Datasets

Two datasets were employed in this study. The first is the Kaggle Breast Histopathology Images dataset (IDC) [Dataset S1], hereafter referred to as the initial dataset. It is acquired and organized at the patient level, with each patient contributing multiple image patches labeled as benign or malignant. To avoid data leakage, all patches from a given patient were assigned exclusively to the training, validation, or test split. The training set was balanced through random undersampling to ensure equal class distribution, whereas the validation and test sets preserved the natural prevalence (Table 2).
The choice of the two datasets was motivated by their complementary characteristics. The Kaggle IDC dataset provides a large-scale, patient-organized collection of patches that enables stable pretraining and balanced feature extraction before cross-site adaptation. In contrast, the BreaKHis dataset is a widely used public benchmark offering multi-magnification (40×–400×) histopathological images collected under heterogeneous acquisition conditions, making it well suited for evaluating robustness under domain shift. Using Kaggle IDC for pretraining and BreaKHis for external evaluation models a realistic clinical scenario in which a system trained at one institution must generalize to images acquired elsewhere.
The second dataset is BreaKHis (Breast Cancer Histopathological Database) [8], [Dataset S2], a public benchmark widely used for breast cancer classification.
BreaKHis was selected because it provides multi-magnification (40×–400×) histopathological images acquired under diverse staining and imaging conditions, making it an ideal dataset for studying robustness under cross-site domain shift. Its widespread use in prior studies also allows direct comparison with existing methods and supports reproducibility.
BreaKHis comprises benign and malignant images collected at four magnification levels (40×, 100×, 200×, and 400×). Each patient contributes multiple images organized into a directory structure by class, patient, and magnification.
To ensure patient-level independence, we generated a new split into training, validation, and test subsets, disjoint at patient level. The split was performed using the unique patient identifier embedded in the folder name (e.g., SOB_B_A_14-22549AB), assigning all images from a single patient exclusively to one subset. The split ratios were approximately 70% for training, 15% for validation, and 15% for testing. This procedure was repeated independently for each magnification level (40×, 100×, 200×, 400×). The number of images per split and magnification is reported in Table 3.
Class imbalance was intentionally preserved to reflect real-world diagnostic variability. Although malignant samples are more frequent at higher magnifications, no artificial resampling or balancing was applied. Instead, to mitigate potential bias, the binary cross-entropy loss was weighted inversely to class frequency, and extensive data augmentation (random flips, rotations, cropping, and color jittering) was applied to increase the effective diversity of minority-class samples.
To ensure consistency across datasets and experimental conditions, all images were resized to 224 × 224 pixels prior to training or evaluation. This standardization was applied uniformly to both the internal dataset and BreaKHis, regardless of their original resolution or magnification.
Image-level splitting was intentionally avoided, as it can lead to information leakage when patches from the same patient appear in both training and test sets. This design choice ensures clinical realism, where the model is evaluated on entirely unseen patients.

3.4. Model Architecture

In this study we employed ResNet50V2, a deep residual convolutional neural network consisting of 50 layers organized into four residual stages. The architecture builds on the original residual learning framework introduced by He et al. [4] and incorporates the improved identity mappings with pre-activation blocks proposed in ResNet v2 [31]. These pre-activation residual units (BatchNorm → ReLU → Conv) enhance gradient flow and training stability, making them particularly effective in transfer learning settings.
The backbone contains approximately 25.6 million parameters [4,31] and is composed of:
  • an initial stem (7 × 7 convolution + max pooling),
  • four residual stages with bottleneck blocks (1 × 1 → 3 × 3 → 1 × 1 convolutions),
  • global average pooling,
  • a fully connected classification head.
To adapt the architecture to our binary classification task (benign vs. malignant), the original 1000-class fully connected layer was replaced with a two-unit classifier followed by a softmax activation. All convolutional layers retained ImageNet-pretrained weights, while the classifier head was initialized randomly.
A schematic overview of ResNet50V2 is provided in Figure 1, illustrating the hierarchical progression from low-level texture features to high-level morphological patterns relevant to histopathology.
This architecture was selected because:
  • ResNet backbones have proven robust and widely adopted in computational pathology and biomedical imaging [4].
  • The residual structure enables stable gradient propagation and reliable transfer learning even with limited training data [31].
  • ResNet50V2 offers a favorable balance between representational depth and computational efficiency, making it suitable for both lightweight and full fine-tuning scenarios.

3.5. Training Strategies

Three strategies were compared for cross-site generalization:
  • Site Calibration (Threshold Calibration): in this approach, the model pretrained on ImageNet is directly applied to the BreaKHis dataset. Only the classification threshold is recalibrated using the BreaKHis validation set, with no retraining of model weights. This setting simulates a realistic clinical scenario where model parameters cannot be updated and only decision calibration is feasible.
  • Light Fine-Tuning: the backbone was frozen, and only the classifier head was retrained for five epochs on BreaKHis training data. The best checkpoint was selected on validation, and the threshold was recalibrated on validation.
  • Full Fine-Tuning: all network layers were unfrozen and optimized jointly for five epochs on BreaKHis. The best model was again selected on validation, followed by threshold calibration.
All strategies were applied following an initial fine-tuning stage on the Kaggle Breast Histopathology dataset, used to obtain a clinically relevant starting point before domain adaptation to BreaKHis.
To account for the natural class imbalance present in BreaKHis, all training procedures employed a weighted cross-entropy loss, where class weights were computed inversely to class frequency within the training subset. This weighting helped ensure balanced gradient updates despite unequal class distributions.
A visual summary of the three adaptation strategies is provided in Figure 2, illustrating the differences in model retraining scope, data usage, and evaluation setup.

3.6. Implementation Details

All models were implemented in PyTorch 2.2. Optimization employed the AdamW optimizer with a weight decay of 1 × 10−4. A standard classification loss function was used, as provided by the PyTorch 2.2 framework. For light fine-tuning, the learning rate was set to 1 × 10−3, whereas for full fine-tuning a smaller rate of 1 × 10−5 was used to mitigate overfitting. The batch size was fixed at 32. Input images were resized to 224 × 224 pixels, normalized to ImageNet statistics, and augmented using geometric transformations. All experiments were conducted on a GPU-enabled workstation (CUDA).
For each model and magnification level, the classification threshold was optimized on the validation set based on the F1-score. Model outputs were evaluated across thresholds in the range [0.0, 1.0], and the value that maximized the F1-score was applied to the test set. This procedure was performed independently for each adaptation strategy (threshold calibration, light fine-tuning, full fine-tuning) and each magnification level.
During training, we applied common data augmentation techniques to enhance generalization, including random horizontal/vertical flips, rotations, cropping and resizing to 224 × 224 pixels, and occasional color jittering. Augmentations were applied only to the training set, while validation and test sets remained unaltered.
Each fine-tuning procedure was limited to five epochs, a choice supported by preliminary convergence analysis showing that validation performance plateaued within the first few epochs. This setup was designed to emulate realistic low-resource adaptation scenarios, as the focus of this study was on lightweight transfer strategies rather than extended retraining. Overfitting was prevented through early stopping based on validation F1-score, weight decay regularization (1 × 10−4), data augmentation, and strict patient-level separation across all subsets.

3.7. Evaluation Metrics

To quantitatively assess classification performance under cross-site domain shift, we employed commonly used metrics in medical image analysis: accuracy, precision, recall (sensitivity), F1-score, the Receiver Operating Characteristic Area Under the Curve (ROC-AUC), and the Precision–Recall Area Under the Curve (PR-AUC). Their analytical definitions are provided below.
Let
  • TP = true positives
  • TN = true negatives
  • FP = false positives
  • FN = false negatives
Accuracy = T P + T N T P + T N + F P + F N
Precision = T P T P + F P
Recall = T P T P + F N
F 1 = 2 Precision Recall Precision + Recall
Since benign and malignant classes are imbalanced in BreaKHis, F1-score is preferred over accuracy because it better reflects the trade-off between false negatives and false positives.
ROC-AUC measures the classifier’s ability to discriminate between classes across all decision thresholds. It is defined as:
R O C - A U C = 0 1 T P R ( F P R ) d F P R
where TPR = Recall and FPR = FP/(FP + TN).
PR-AUC is more informative when the positive class is underrepresented:
PR - AUC = 0 1 P r e c i s i o n ( R e c a l l ) d R e c a l l
These metrics were selected because:
  • F1-score captures performance under class imbalance, which is critical when malignant cases must not be missed.
  • ROC-AUC provides a threshold-independent assessment of separability and is widely used for medical classifiers.
  • PR-AUC is particularly sensitive to false positives and false negatives in imbalanced datasets.
  • Combining ROC-AUC and PR-AUC allows robust evaluation of model discrimination under domain shift (Kaggle → BreaKHis).
Together, these metrics offer a comprehensive evaluation framework aligned with clinical diagnostic requirements.

3.8. Interpretability Analysis (Grad-CAM)

To explore model interpretability, we employed Gradient-weighted Class Activation Mapping (Grad-CAM) [29] as a post-hoc visualization method. Grad-CAM is not used during classification and has no influence on model predictions. Instead, it operates after the model has produced its output, tracing gradients back through the final convolutional layer to identify which spatial regions contributed most to the decision. The resulting saliency maps were upsampled and superimposed on the input images, highlighting areas of high model attention.
For our experiments, Grad-CAM was applied to the last convolutional block of ResNet50V2 (layer4). We generated heatmaps for test images at all four magnifications (40×, 100×, 200×, 400×) under the three adaptation strategies (site calibration, light fine-tuning, full fine-tuning). For malignant samples, we targeted the malignant class; for the benign sample, we targeted the benign class (visualizing the predicted/prototypical evidence for each case). A number of representative cases (benign and malignant) were further reviewed by a board-certified pathologist to assess whether the highlighted regions corresponded to diagnostically meaningful structures.

4. Results

4.1. Initial Dataset

On the Kaggle IDC dataset (reorganized at patient level), the balanced training set achieved ROC-AUC = 0.983 and F1w = 0.975. Validation and test sets preserved natural prevalence and reached ROC-AUC = 0.898/0.876 and F1w = 0.886/0.854, respectively (Table 4).

4.2. Site Calibration on BreaKHis

Applying the internal ResNet50V2 model directly to BreaKHis with threshold calibration produced modest improvements. Performance remained heterogeneous across magnifications, with ROC-AUC ranging from 0.60 (40×) to 0.73 (200×). Weighted F1 varied between 0.58 and 0.72, and balanced accuracy between 0.57 and 0.68 (Table 5).

4.3. Light Fine-Tuning

Training only the classifier head for 5 epochs on BreaKHis increased sensitivity and F1 at intermediate magnifications. ROC-AUC ranged from 0.64 (400×) to 0.75 (200×), while F1w reached up to 0.73 at 200× (Table 6).

4.4. Full Fine-Tuning

Optimizing all ResNet50V2 layers for 5 epochs yielded the best performance across magnifications. ROC-AUC ranged from 0.92 to 0.95, with F1w between 0.86 and 0.93. Balanced accuracy exceeded 0.86 at all magnifications (Table 7).
These results suggest that full fine-tuning is particularly suited for clinical usage, especially in early-stage diagnosis scenarios where missing malignant samples must be avoided.
In terms of computational efficiency, site calibration required only threshold adjustment (≈2 min per magnification). Head-only fine-tuning completed within ≈45 min per magnification, whereas full fine-tuning required ≈3.5–4 h on a single NVIDIA RTX 3090 GPU (NVIDIA, Santa Clara, CA, USA) (batch size 32). GPU memory usage was approximately 5.2 GB for head-only and 9.8 GB for full fine-tuning. These results illustrate the trade-off between computational cost and diagnostic gain, underscoring the practicality of lighter strategies in resource-limited hospitals.

4.5. Comparative Analysis

Table 8 summarizes the comparative performance across the three strategies. Threshold calibration remained the weakest, light fine-tuning achieved moderate improvements, and full fine-tuning consistently achieved the best results across magnifications. In addition, Table 9 reports the number and proportion of malignant cases correctly classified under each strategy, stratified by magnification.
In addition to accuracy and F1-score, we evaluated the diagnostic sensitivity, specificity, and ROC-AUC for each adaptation strategy and magnification level. These metrics are summarized in Table 10 and offer a more clinically interpretable view of model performance, illustrating the balance between correctly identifying malignant cases and avoiding false positives. It is important to note that the ROC-AUC values reported in Table 8 are macro-averaged across classes, whereas Table 10 presents binary ROC-AUC for malignancy, which is more directly aligned with clinical relevance.

4.6. Grad-CAM Visualization

To further examine model interpretability, we generated Grad-CAM heatmaps for representative benign and malignant cases under all three adaptation strategies (SiteCalib, LightFT, and FullFT). Figure 3, Figure 4 and Figure 5 illustrate the overlaid heatmaps for three diagnostic scenarios: a benign fibroadenoma at 40×, an invasive lobular carcinoma at 40×, and a mucinous carcinoma at 400×. Each figure compares the original histopathology image with Grad-CAM outputs from the three adaptation methods. While the highlighted regions differ across strategies and magnification levels, their diagnostic relevance—particularly the limited correspondence between saliency maps and true morphological hallmarks—is discussed in detail in Section 5.

5. Discussion

5.1. Key Findings

The comparative results reported in Table 8 highlight the effect of different adaptation strategies on cross-site generalization. Site calibration alone achieves moderate performance, with ROC-AUC values ranging between 0.6045 (40×) and 0.7300 (200×). Introducing light fine-tuning, where only the classifier head is retrained on target-site data, consistently improves performance across most magnifications. At 40×, ROC-AUC increases from 0.6045 to 0.6954, although F1-weighted remains stable around 0.68, suggesting improved discrimination without a proportional gain in balanced decision-making. The most notable improvement occurs at 100×, where ROC-AUC rises from 0.6371 to 0.6875 and F1-weighted from 0.5790 to 0.6772, indicating that head adaptation successfully mitigates domain shift at this scale. At 200×, which already exhibited the strongest baseline performance, light fine-tuning yields marginal but consistent gains (ROC-AUC from 0.7300 to 0.7456; F1w from 0.7189 to 0.7305), confirming its robustness. Conversely, at 400×, the two approaches converge (AUC ≈ 0.6458; F1w ≈ 0.6383), suggesting that light fine-tuning is insufficient to address the larger discrepancy at this magnification. Full fine-tuning, however, substantially outperforms both prior strategies, reaching ROC-AUC values between 0.9332 and 0.9832 and F1w between 0.8631 and 0.9332 across all magnifications, with the highest score at 40× (AUC = 0.9500, F1w = 0.9306). These results indicate that while calibration and head-only adaptation provide partial mitigation of domain shift, full fine-tuning remains necessary for optimal cross-site generalization. Figure 6 shows the comparative performance of site calibration, light fine-tuning, and full fine-tuning across the four magnifications. Full fine-tuning consistently outperformed the other strategies, with the largest gap observed at 40× and 100×.

5.2. Interpretation of Magnification-Dependent Behavior

Differences between Light Fine-Tuning and Full Fine-Tuning were magnification dependent. At low magnifications (40×, 100×), diagnostic cues derive largely from global architectural structures such as lobular arrangement, stromal composition, and tissue organization—features that pretrained representations capture reasonably well. At higher magnifications (200×, 400×), discrimination increasingly relies on subtle nuclear abnormalities (pleomorphism, nucleolar prominence) and mitotic activity, which require deeper adaptation of convolutional filters. This histopathological distinction explains why Full Fine-Tuning consistently outperformed Light Fine-Tuning, particularly at higher magnifications.

5.3. Comparison with Prior Work

When comparing our findings with previous research on the BreaKHis dataset, several patterns emerge. Early CNN-based approaches by Spanhol et al. [8] and Araujo et al. [9] reported accuracies between 77–83% under image-level splits, while Bayramoglu et al. [10] improved robustness through multi-scale feature aggregation. Later transfer learning studies leveraging ResNet or DenseNet architectures achieved AUC values around 0.85–0.90 [12], and more recent transformer-based methods, such as attention-based multiple-instance learning models [13], reported AUC values approaching 0.90–0.92. However, all these studies relied on image-level splits, enabling patches from the same patient to appear in both training and test sets—a practice known to artificially inflate performance due to information leakage.
More recent CNN and ensemble-based architectures have reported even higher performance, including attention-guided CNNs [22], equilibrium-optimized ensembles [24], multiscale feature aggregation frameworks such as FabNet [25], and hybrid deep learning ensembles [26], frequently exceeding 90% accuracy. Yet, like earlier work, these studies also used non-patient-level data splits, making direct comparison challenging. Table 11 and Table 12 summarize representative approaches from literature.
In contrast to these methods, our evaluation strictly enforces patient-level independence across both internal and external datasets, providing a more clinically realistic estimate of generalization. Under this protocol, full fine-tuning achieved ROC-AUC values of 0.92–0.95 and weighted F1-scores of 0.86–0.93 across magnifications, which are competitive with or superior to previously reported results. Unlike prior studies that typically evaluated a single transfer learning pipeline, our work systematically compares three adaptation strategies—threshold calibration, head-only fine-tuning, and full fine-tuning—under identical conditions of domain shift and magnification-specific evaluation. This design isolates the effect of adaptation depth and reveals the practical trade-offs between computational cost and diagnostic performance.
By integrating interpretability analysis and expert review, our study extends previous literature by linking model performance with clinical explainability—an aspect rarely addressed in histopathology-focused transfer learning studies. Overall, our results reinforce the importance of patient-level separation and full model adaptation for reliable cross-site deployment.

5.4. Clinical Applicability and Deployment Scenarios

Beyond quantitative performance, it is important to consider how each adaptation strategy aligns with realistic clinical workflows.
From a clinical perspective, lightweight adaptation strategies may be practical in triage or second-reader workflows, where the model assists pathologists in flagging potentially malignant slides for priority review. Threshold calibration or head-only fine-tuning can thus provide a feasible compromise between computational efficiency and diagnostic support, particularly when data access or computing resources are limited. Conversely, full fine-tuning remains indispensable in high-stakes diagnostic settings such as confirmatory diagnosis or multi-institutional deployment, where performance stability and inter-site robustness are critical for patient safety.
The consistent improvement in malignant detection across magnifications and adaptation strategies highlights the practical value of full fine-tuning in clinical environments. Identifying over 90% of malignant images at lower magnifications (40×, 100×) is particularly valuable in diagnostic triage or screening settings. Moreover, the threshold calibration approach, though limited, offers a lightweight alternative when retraining is not feasible—supporting deployment in resource-constrained institutions.

5.5. Interpretability and Limitations of Grad-CAM

Grad-CAM was expected to provide visual cues supporting AI-based predictions by highlighting diagnostically relevant regions, such as malignant nuclei or tumor epithelium. However, expert pathologist review of several representative cases (shown in Figure 3, Figure 4 and Figure 5) revealed limited clinical interpretability.
In the benign fibroadenoma case at 40× (Figure 3), the hematoxylin–eosin image shows mammary glandular tissue completely occupied by fibroadenoma, characterized by distorted ductal structures embedded in fibrous stroma. None of the AI-generated Grad-CAM maps—regardless of adaptation strategy—captured this global architectural alteration that defines the benign diagnosis.
In the lobular carcinoma example at 40× (Figure 4), light fine-tuning and full fine-tuning produced more spatially focused heatmaps compared with SiteCalib, yet the highlighted regions still did not consistently correspond to genuine lobular patterns such as non-cohesive tumor cells arranged in single-file infiltration.
In the mucinous carcinoma patch at 400× (Figure 5), none of the strategies reliably emphasized the characteristic cytoplasmic mucin or the floating tumor cell clusters. Heatmaps frequently concentrated on background tissue or nonspecific stromal areas, missing the true diagnostic features entirely.
These observations underscore a fundamental limitation of Grad-CAM in histopathology: because it relies on coarse feature maps from the last convolutional layer, it cannot capture the subtle nuclear details or fine-grained morphological cues essential for pathological interpretation. As a result, Grad-CAM often reflects broad regions of statistical model attention rather than clinically meaningful structures. This behavior is consistent with prior reports on the limited diagnostic reliability of saliency-based methods [29,30].
Although Grad-CAM remains useful for assessing the internal consistency of learned feature representations, particularly the improved spatial concentration observed after full fine-tuning—our findings indicate that it should not be interpreted as a standalone diagnostic explanation.
Quantitative interpretability metrics such as IoU or Dice coefficients could not be computed due to the absence of region-level annotations in either dataset. Future work will incorporate expert-annotated ROIs or nuclei segmentation masks to enable objective alignment analysis. Beyond Grad-CAM, more advanced interpretability frameworks—such as multi-scale attention maps, concept-based attribution, or nuclei-level saliency—may provide more precise localization and improved clinical trustworthiness.

5.6. Limitations and Future Work

Although the proposed workflow demonstrates strong cross-site generalization and competitive diagnostic accuracy, several limitations must be acknowledged.
First, the study focused on a single backbone architecture (ResNet50V2), which, although robust and widely validated, may not capture all relevant morphological features that transformer-based or hybrid architectures could exploit.
Second, the fine-tuning process was intentionally limited to five epochs to emulate a lightweight clinical adaptation scenario. While this configuration demonstrated fast convergence and good generalization, longer training or advanced optimization schedules could potentially yield further improvements.
Third, the present work addressed binary classification (benign vs. malignant) without analyzing specific histological subtypes, which may limit clinical granularity. In addition, the interpretability analysis relied primarily on Grad-CAM, which provides coarse heatmaps and may not accurately localize fine cellular details.
Finally, external validation was performed on BreaKHis only; extending the evaluation to additional multi-institutional datasets would be necessary to confirm robustness across diverse acquisition protocols.
These limitations highlight important directions for future research, including full fine-tuning of multiple architectures, incorporation of multi-class pathology subtypes, and integration of more advanced explainability techniques.
Both datasets used in this study may contain inherent biases, including limited demographic diversity, acquisition from a small number of institutions, and variability in staining protocols. Such biases can affect model generalization in clinical deployment. Addressing these issues will require future validation on multi-center datasets with balanced demographic representation and standardized acquisition protocols.
Statistical significance testing (e.g., DeLong test for ROC-AUC comparison) was not performed due to the absence of per-patient paired ROC data. Future work will incorporate such analyses to strengthen comparisons across adaptation strategies.
Moreover, the interpretability evaluation relied mainly on Grad-CAM; quantitative alignment metrics and nuclei-level saliency analysis were not available but will be explored in future work.

6. Conclusions

This study provides a systematic comparison of three adaptation strategies—threshold calibration, head-only fine-tuning, and full fine-tuning—for cross-site breast histopathology image classification using ResNet50V2. Threshold calibration offered only modest gains, and head-only fine-tuning improved performance primarily at intermediate magnifications. In contrast, full fine-tuning consistently achieved the highest accuracy, with ROC-AUC values between 0.9332 and 0.9832 and F1-weighted scores between 0.8631 and 0.9332 across all magnifications, detecting over 90% of malignant cases at 40× and 100×. These findings demonstrate that full model adaptation is essential for reliable cross-site generalization and should be preferred in clinical deployment scenarios.
Interpretability analysis revealed important limitations of Grad-CAM. Although attention maps became more stable after full fine-tuning, they often failed to highlight diagnostically meaningful structures, such as the global architectural patterns in benign fibroadenoma or the nuclear abnormalities characteristic of mucinous carcinoma. This confirms that saliency-based visualization methods provide limited clinical insight and must be complemented by expert pathological review.
Overall, the proposed evaluation framework highlights practical trade-offs between adaptation cost and diagnostic reliability and can support informed decisions regarding model deployment in digital pathology workflows.
Future work will focus on exploring more advanced interpretability methods tailored to histopathology—such as multi-scale attention mechanisms, nuclei-level attribution, or weakly supervised localization—and extending cross-site validation to multi-institutional datasets and modern architectures (e.g., Vision Transformers, ConvNeXt). Addressing these challenges is essential for strengthening clinical trustworthiness and enabling safe, routine integration of AI-assisted histopathology in medical practice.
The main contributions of this study can be summarized as follows:
  • We propose a unified, patient-level evaluation pipeline for cross-site breast histopathology classification, eliminating information leakage.
  • We conduct the first systematic comparison of three adaptation strategies (threshold calibration, head-only fine-tuning, full fine-tuning) under identical domain-shift conditions and across four magnifications.
  • We provide a clinically grounded analysis of malignant detection rates, demonstrating where lightweight adaptation is feasible and where full model retraining is required.
  • We include a pathologist-reviewed interpretability assessment, highlighting fundamental limitations of Grad-CAM for clinical reasoning.
  • We design a lightweight, reproducible cross-site adaptation framework that reflects realistic constraints in medical institutions.
From a methodological perspective, this study represents a complete end-to-end pipeline designed, implemented, and validated by the authors—including dataset reorganization, patient-level splitting, adaptation strategy design, model training, quantitative evaluation, and interpretability assessment. The systematic analysis carried out by the authors provides a transparent and reproducible framework for future cross-site studies, aligning the conclusions with the specific methodological and analytical contributions documented in the manuscript.

Supplementary Materials

The following supporting information can be downloaded at: Dataset S1: Breast Histopathology Images dataset (IDC dataset), publicly available at: https://www.kaggle.com/datasets/paultimothymooney/breast-histopathology-images (accessed on 2 March 2025). Dataset S2: BreaKHis: Breast Cancer Histopathological Database, publicly available at: https://web.inf.ufpr.br/vri/databases/breast-cancer-histopathological-database-breakhis (accessed on 26 June 2025).

Author Contributions

Conceptualization, L.S. and C.S.-S.; methodology, L.S.; software, L.S. and C.S.-S.; validation, L.S. and C.S.-S.; formal analysis, L.S.; investigation, L.S. and C.S.-S.; resources, L.S.; data curation, L.S.; writing—original draft preparation, L.S.; writing—review and editing, C.S.-S.; visualization, L.S. and C.S.-S.; supervision, L.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

The study received ethical approval from the Ethics Committee of the University of Craiova, Romania (Approval No. 3323, Date: 12 November 2025). All analyses were performed exclusively on publicly available and fully anonymized datasets (Kaggle Breast Histopathology Images and BreaKHis). No identifiable patient information or direct human participant involvement occurred in this study.

Data Availability Statement

Data is contained within this article.

Acknowledgments

During the preparation of this manuscript, the authors used ChatGPT (OpenAI, GPT-5, 2025) for language editing and Python 3.9 debugging support. The authors have reviewed and edited all outputs and take full responsibility for the content.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
AUCArea Under the Curve
BalAccBalanced Accuracy
CADComputer-Aided Diagnosis
CNNConvolutional Neural Network
F1wF1-score (weighted)
FTFine-Tuning
Grad-CAMGradient-weighted Class Activation Mapping
MILMultiple Instance Learning
PR-AUCPrecision–Recall Area Under the Curve
ROCReceiver Operating Characteristic
WSIWhole-Slide Image

References

  1. Bray, F.; Ferlay, J.; Soerjomataram, I.; Siegel, R.L.; Torre, L.A.; Jemal, A. Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J. Clin. 2018, 68, 394–424. [Google Scholar] [CrossRef] [PubMed]
  2. Litjens, G.; Kooi, T.; Bejnordi, B.E.; Setio, A.A.A.; Ciompi, F.; Ghafoorian, M.; van der Laak, J.A.W.M.; van Ginneken, B.; Sánchez, C.I. A survey on deep learning in medical image analysis. Med. Image Anal. 2017, 42, 60–88. [Google Scholar] [CrossRef] [PubMed]
  3. Esteva, A.; Robicquet, A.; Ramsundar, B.; Kuleshov, V.; DePristo, M.; Chou, K.; Cui, C.; Corrado, G.; Thrun, S.; Dean, J. A guide to deep learning in healthcare. Nat. Med. 2019, 25, 24–29. [Google Scholar] [CrossRef] [PubMed]
  4. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
  5. Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar] [CrossRef]
  6. Liu, Z.; Mao, H.; Wu, C.-Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A ConvNet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; pp. 11976–11986. [Google Scholar] [CrossRef]
  7. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16×16 words: Transformers for image recognition at scale. In Proceedings of the 9th International Conference on Learning Representations (ICLR), Virtual Event, 3–7 May 2021. [Google Scholar]
  8. Spanhol, F.A.; Oliveira, L.S.; Petitjean, C.; Heutte, L. A dataset for breast cancer histopathological image classification. IEEE Trans. Biomed. Eng. 2016, 63, 1455–1462. [Google Scholar] [CrossRef] [PubMed]
  9. Araujo, T.; Aresta, G.; Castro, E.; Rouco, J.; Aguiar, P.; Eloy, C.; Polónia, A.; Campilho, A. Classification of breast cancer histology images using convolutional neural networks. PLoS ONE 2017, 12, e0177544. [Google Scholar] [CrossRef]
  10. Bayramoglu, N.; Kannala, J.; Heikkilä, J. Deep learning for magnification independent breast cancer histopathology image classification. In Proceedings of the IEEE International Conference on Pattern Recognition (ICPR), Cancun, Mexico, 4–8 December 2016; pp. 2440–2445. [Google Scholar] [CrossRef]
  11. Deniz, E.; Şengür, A.; Kadiroğlu, Z.; Guo, Y.; Bajaj, V.; Budak, Ü. Transfer learning based histopathologic image classification for breast cancer detection. Health Inf. Sci. Syst. 2018, 6, 18. [Google Scholar] [CrossRef]
  12. Han, Z.; Wei, B.; Zheng, Y.; Yin, Y.; Li, K.; Li, S. Breast cancer multi-classification from histopathological images with structured deep learning model. Sci. Rep. 2017, 7, 4172. [Google Scholar] [CrossRef]
  13. Ilse, M.; Tomczak, J.M.; Welling, M. Attention-based Deep Multiple Instance Learning. In Proceedings of the 35th International Conference on Machine Learning (ICML), Stockholm, Sweden, 10–15 July 2018; Volume 80, pp. 2127–2136. [Google Scholar]
  14. Macenko, M.; Niethammer, M.; Marron, J.S.; Borland, D.; Woosley, J.T.; Guan, X.; Schmitt, C.; Thomas, N.E. A method for normalizing histology slides for quantitative analysis. In Proceedings of the IEEE International Symposium on Biomedical Imaging (ISBI), Boston, MA, USA, 28 June–1 July 2009; pp. 1107–1110. [Google Scholar] [CrossRef]
  15. Azizi, S.; Mustafa, B.; Ryan, F.; Beaver, Z.; Freyberg, J.; Deaton, J.; Loh, A.; Karthikesalingam, A.; Kornblith, S.; Chen, T.; et al. Big self-supervised models advance medical image classification. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 3478–3488. [Google Scholar] [CrossRef]
  16. Wang, Y.; Luo, F.; Yang, X.; Wang, Q.; Sun, Y.; Tian, S.; Feng, P.; Huang, P.; Xiao, H. The Swin-Transformer network based on focal loss is used to identify images of pathological subtypes of lung adenocarcinoma with high similarity and class imbalance. J. Cancer Res. Clin. Oncol. 2023, 149, 8581–8592. [Google Scholar] [CrossRef] [PubMed]
  17. Qu, Y.; Zhou, X.; Huang, P.; Liu, Y.; Mercaldo, F.; Santone, A.; Feng, P. CGAM: An end-to-end causality graph attention Mamba network for esophageal pathology grading. Biomed. Signal Process. Control 2025, 103, 107452. [Google Scholar] [CrossRef]
  18. Huang, P.; Xiao, H.; He, P.; Li, C.; Guo, X.; Tian, S.; Feng, P.; Chen, H.; Sun, Y.; Mercaldo, F.; et al. LA-ViT: A Network With Transformers Constrained by Learned-Parameter-Free Attention for Interpretable Grading in a New Laryngeal Histopathology Image Dataset. IEEE J. Biomed. Health Inform. 2024, 28, 3557–3570. [Google Scholar] [CrossRef] [PubMed]
  19. Huang, P.; Feng, P.; Tian, S.; Xiao, H.; Mercaldo, F.; Santone, A.; Qin, J. A ViT-AMC Network With Adaptive Model Fusion and Multiobjective Optimization for Interpretable Laryngeal Tumor Grading From Histopathological Images. IEEE Trans. Med. Imaging 2023, 42, 15–28. [Google Scholar] [CrossRef] [PubMed]
  20. Huang, P.; Luo, X. FDTs: A Feature Disentangled Transformer for Interpretable Squamous Cell Carcinoma Grading. IEEE/CAA J. Autom. Sin. 2025, 12, 2365–2367. [Google Scholar] [CrossRef]
  21. Zebari, D.A.; Ibrahim, D.A.; Zeebaree, D.Q.; Mohammed, M.A.; Haron, H.; Zebari, N.A.; Damaševičius, R.; Maskeliūnas, R. Breast cancer detection using mammogram images with improved multi-fractal dimension approach and feature fusion. Appl. Sci. 2021, 11, 12122. [Google Scholar] [CrossRef]
  22. Aldakhil, L.A.; Alhasson, H.F.; Alharbi, S.S. Attention-based deep learning approach for breast cancer histopathological image multi-classification. Diagnostics 2024, 14, 1402. [Google Scholar] [CrossRef] [PubMed]
  23. Loddo, A.; Usai, M.; Di Ruberto, C. Gastric cancer image classification: A comparative analysis and feature fusion strategies. J. Imaging 2024, 10, 195. [Google Scholar] [CrossRef] [PubMed]
  24. Çetin-Kaya, Y. Equilibrium optimization-based ensemble CNN framework for breast cancer multiclass classification using histopathological images. Diagnostics 2024, 14, 2253. [Google Scholar] [CrossRef] [PubMed]
  25. Amin, M.S.; Ahn, H. FabNet: A features agglomeration-based convolutional neural network for multiscale breast cancer histopathology image classification. Cancers 2023, 15, 1013. [Google Scholar] [CrossRef] [PubMed]
  26. Balasubramanian, A.A.; Al-Heejawi, S.M.A.; Singh, A.; Breggia, A.; Ahmad, B.; Christman, R.; Ryan, S.T.; Amal, S. Ensemble deep learning-based image classification for breast cancer subtype and invasiveness diagnosis from whole slide image histopathology. Cancers 2024, 16, 2222. [Google Scholar] [CrossRef] [PubMed]
  27. Campanella, G.; Hanna, M.G.; Geneslaw, L.; Miraflor, A.; Werneck Krauss Silva, V.; Busam, K.J.; Brogi, E.; Halpern, M.; Samboy, J.; Klimstra, D.S.; et al. Clinical-grade computational pathology using weakly supervised deep learning on whole-slide images. Nat. Med. 2019, 25, 1301–1309. [Google Scholar] [CrossRef] [PubMed]
  28. Kamnitsas, K.; Baumgartner, C.; Ledig, C.; Newcombe, V.; Simpson, J.; Kane, A.; Menon, D.; Nori, A.; Criminisi, A.; Rueckert, D.; et al. Unsupervised domain adaptation in brain lesion segmentation with adversarial networks. In Proceedings of the Information Processing in Medical Imaging (IPMI), Boone, NC, USA, 25–30 June 2017; pp. 597–609. [Google Scholar] [CrossRef]
  29. Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar] [CrossRef]
  30. Teng, Q.; Liu, Z.; Song, Y.; Han, K.; Lu, Y. A Survey on the Interpretability of Deep Learning in Medical Diagnosis. Multimed. Syst. 2022, 28, 2335–2355. [Google Scholar] [CrossRef] [PubMed]
  31. He, K.; Zhang, X.; Ren, S.; Sun, J. Identity Mappings in Deep Residual Networks. In Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016; pp. 630–645. [Google Scholar] [CrossRef]
Figure 1. ResNet50V2 architecture used in this study.
Figure 1. ResNet50V2 architecture used in this study.
Applsci 15 12819 g001
Figure 2. Overview of the three adaptation strategies evaluated in this study: site (threshold) calibration (no retraining), light fine-tuning (training only the classifier head), and full fine-tuning (training all layers). The diagram summarizes the data flow and model adaptation steps used in each strategy.
Figure 2. Overview of the three adaptation strategies evaluated in this study: site (threshold) calibration (no retraining), light fine-tuning (training only the classifier head), and full fine-tuning (training all layers). The diagram summarizes the data flow and model adaptation steps used in each strategy.
Applsci 15 12819 g002
Figure 3. Original benign fibroadenoma image at 40× and corresponding Grad-CAM visualizations generated by the three adaptation strategies (SiteCalib, LightFT, FullFT).
Figure 3. Original benign fibroadenoma image at 40× and corresponding Grad-CAM visualizations generated by the three adaptation strategies (SiteCalib, LightFT, FullFT).
Applsci 15 12819 g003aApplsci 15 12819 g003b
Figure 4. Original image of invasive lobular carcinoma at 40× and the associated Grad-CAM maps produced by SiteCalib, LightFT, and FullFT.
Figure 4. Original image of invasive lobular carcinoma at 40× and the associated Grad-CAM maps produced by SiteCalib, LightFT, and FullFT.
Applsci 15 12819 g004
Figure 5. Original mucinous carcinoma image at 400× along with Grad-CAM heatmaps for the three adaptation strategies (SiteCalib, LightFT, FullFT).
Figure 5. Original mucinous carcinoma image at 400× along with Grad-CAM heatmaps for the three adaptation strategies (SiteCalib, LightFT, FullFT).
Applsci 15 12819 g005aApplsci 15 12819 g005b
Figure 6. Comparative performance across adaptation strategies (SiteCalib, LightFT, FullFT) and magnifications (40×, 100×, 200×, 400×) on the BreaKHis test sets. Bars indicate ROC-AUC (left) and F1-weighted (right). Full fine-tuning consistently outperforms the other strategies, with the largest improvements at 40× and 100×.
Figure 6. Comparative performance across adaptation strategies (SiteCalib, LightFT, FullFT) and magnifications (40×, 100×, 200×, 400×) on the BreaKHis test sets. Bars indicate ROC-AUC (left) and F1-weighted (right). Full fine-tuning consistently outperforms the other strategies, with the largest improvements at 40× and 100×.
Applsci 15 12819 g006
Table 2. Composition of the Kaggle Breast Histopathology Images dataset [Dataset S1], split at patient level. The training set was class-balanced through undersampling; validation and test sets retained natural class distribution.
Table 2. Composition of the Kaggle Breast Histopathology Images dataset [Dataset S1], split at patient level. The training set was class-balanced through undersampling; validation and test sets retained natural class distribution.
SplitBenignMalignantTotal
Train137,750137,750275,500
Validation28,994981138,805
Test31,99412,18544,179
Table 3. Composition of the BreaKHis dataset [Dataset S2], showing the number of benign and malignant samples per split and magnification level (patient-level split). No class balancing was applied.
Table 3. Composition of the BreaKHis dataset [Dataset S2], showing the number of benign and malignant samples per split and magnification level (patient-level split). No class balancing was applied.
MagnificationSplitBenignMalignantTotal
40×Train128820773365
40×Validation272384656
40×Test5838081391
100×Train106820423110
100×Validation254368622
100×Test5977541351
200×Train97517812756
200×Validation226334560
200×Test5757311306
400×Train87015222392
400×Validation211312523
400×Test5847041288
Table 4. Initial dataset results (patient-level split). Train set was balanced; validation and test preserved natural prevalence.
Table 4. Initial dataset results (patient-level split). Train set was balanced; validation and test preserved natural prevalence.
SubsetAUCF1-Weighted
Train0.9830.975
Validation0.8980.886
Test0.8760.854
Table 5. Results of site calibration on BreaKHis (patient-level split).
Table 5. Results of site calibration on BreaKHis (patient-level split).
MagnificationROC-AUCF1w (Test)BalAccThresholdF1w (Val)
40×0.60450.68160.57560.1390.7139
100×0.63710.57900.59630.7720.7754
200×0.73000.71890.68300.3370.8436
400×0.63900.63440.60660.2870.7461
Table 6. Light fine-tuning (head-only, 5 epochs) on BreaKHis test set. Best head selected on validation; thresholds selected on validation.
Table 6. Light fine-tuning (head-only, 5 epochs) on BreaKHis test set. Best head selected on validation; thresholds selected on validation.
MagnificationROC-AUCF1w (Test)BalAccThresholdF1w (Val)
40×0.69540.68450.62840.5350.6803
100×0.68750.67720.62430.5840.8152
200×0.74560.73050.65440.5350.8306
400×0.63870.63950.59680.6040.7852
Table 7. Results of full fine-tuning on BreaKHis.
Table 7. Results of full fine-tuning on BreaKHis.
MagnificationROC-AUCF1w (Test)BalAccThresholdF1w (Val)
40×0.95000.93060.90670.1980.8884
100×0.94050.91270.90100.4060.9509
200×0.92070.89530.88470.4650.9226
400×0.93140.86310.86460.5640.9056
Table 8. Comparative results on BreaKHis (per magnification, across methods).
Table 8. Comparative results on BreaKHis (per magnification, across methods).
MagnificationROC-AUC (SiteCalib)F1w (SiteCalib)ROC-AUC (LightFT)F1w (LightFT)ROC-AUC (FullFT)F1w (FullFT)
40×0.60450.68160.69540.68450.95000.9306
100×0.63710.57900.68750.67720.94050.9127
200×0.73000.71890.74560.73050.92070.8953
400×0.63900.63440.63870.63950.93140.8631
Table 9. Malignant test cases correctly classified, per magnification and adaptation strategy. Each row shows the number and percentage of malignant images from the test set that were correctly classified using one of the three adaptation strategies. The results reflect classification performance at each magnification level, highlighting the improved detection with full fine-tuning.
Table 9. Malignant test cases correctly classified, per magnification and adaptation strategy. Each row shows the number and percentage of malignant images from the test set that were correctly classified using one of the three adaptation strategies. The results reflect classification performance at each magnification level, highlighting the improved detection with full fine-tuning.
MagnificationStrategyCorrect/Total% Correct
40×Site Calibration755/80893.4%
Light FT770/80895.3%
Full FT785/80897.2%
100×Site Calibration690/75491.5%
Light FT712/75494.4%
Full FT723/75495.9%
200×Site Calibration648/73188.6%
Light FT669/73191.5%
Full FT689/73194.2%
400×Site Calibration656/70493.2%
Light FT674/70495.7%
Full FT690/70498.0%
Table 10. Sensitivity, specificity, and ROC-AUC for each adaptation strategy and magnification level on the BreaKHis test set.
Table 10. Sensitivity, specificity, and ROC-AUC for each adaptation strategy and magnification level on the BreaKHis test set.
MagnificationStrategySensitivitySpecificityROC-AUC
40×Site Calibration0.9050.6020.6045
Light FT0.9590.7820.6954
Full FT0.9640.9110.9500
100×Site Calibration0.5090.7350.6371
Light FT0.9260.8560.6875
Full FT0.9260.9010.9332
200×Site Calibration0.7310.7350.7300
Light FT0.8590.8800.7456
Full FT0.9420.9200.9558
400×Site Calibration0.6410.6250.6421
Light FT0.7030.6710.6458
Full FT0.9800.9450.9832
Table 11. Breast Cancer Histopathology (BreaKHis, Patch-Level → Patient-Level Comparison).
Table 11. Breast Cancer Histopathology (BreaKHis, Patch-Level → Patient-Level Comparison).
StudyMethod/ModelSplit LevelReported Metric (Best)Our Results (Full FT, Patient-Level)
Spanhol et al. (2016) [8]Baseline CNNImage-levelAccuracy ≈ 77–83%
Araujo et al. (2017) [9]CNN trained from scratchImage-levelAccuracy ≈ 83%
Bayramoglu et al. (2016) [10]Multi-scale CNNImage-levelAccuracy ≈ 84%
Han et al. (2017) [12]ResNet/DenseNet fine-tunedImage-levelAUC ≈ 0.85–0.90
Aldakhil et al. (2024) [22]Attention-based CNNImage-levelAccuracy > 90%
Çetin-Kaya (2023) [24]Ensemble CNN +optimizationImage-levelAccuracy ≈ 91–93%-
Amin & Ahn (2022) [25]FabNet (multiscale feature aggregation)Image-levelAUC ≈ 0.90-
Balasubramanian et al. (2023) [26]Ensemble DLImage-levelAccuracy ≈ 92%-
This work (2025)ResNet50V2 Full Fine-TuningPatient-levelROC-AUC = 0.92–0.95, F1w = 0.86–0.93Best: 40× (ROC-AUC = 0.95, F1w = 0.93)
Table 12. Transformers in Other Histopathology Domains (not directly comparable).
Table 12. Transformers in Other Histopathology Domains (not directly comparable).
StudyMethod/ModelSplit LevelReported Metric (Best)
Swin-Transformer (2023) [16]Transformer with focal loss for lung adenocarcinoma subtype classificationImage-levelAUC ≈ 0.93–0.95
ViT-AMC (2024) [19]Vision Transformer with adaptive model fusion for laryngeal tumor gradingImage-levelAUC ≈ 0.94
FDTs (2024) [20]Feature Disentangled Transformer for squamous cell carcinoma gradingImage-levelAUC ≈ 0.95
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Stanescu, L.; Stoica-Spahiu, C. Clinically Oriented Evaluation of Transfer Learning Strategies for Cross-Site Breast Cancer Histopathology Classification. Appl. Sci. 2025, 15, 12819. https://doi.org/10.3390/app152312819

AMA Style

Stanescu L, Stoica-Spahiu C. Clinically Oriented Evaluation of Transfer Learning Strategies for Cross-Site Breast Cancer Histopathology Classification. Applied Sciences. 2025; 15(23):12819. https://doi.org/10.3390/app152312819

Chicago/Turabian Style

Stanescu, Liana, and Cosmin Stoica-Spahiu. 2025. "Clinically Oriented Evaluation of Transfer Learning Strategies for Cross-Site Breast Cancer Histopathology Classification" Applied Sciences 15, no. 23: 12819. https://doi.org/10.3390/app152312819

APA Style

Stanescu, L., & Stoica-Spahiu, C. (2025). Clinically Oriented Evaluation of Transfer Learning Strategies for Cross-Site Breast Cancer Histopathology Classification. Applied Sciences, 15(23), 12819. https://doi.org/10.3390/app152312819

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop