1. Introduction
Persistent infection by high-risk genotypes of Human Papillomavirus (HPV), most notably HPV-16 and HPV-18, is the established etiological agent of the vast majority of cervical carcinomas. The oncogenic mechanism is driven by the viral E6 and E7 proteins, which degrade the tumor-suppressor proteins p53 and pRb and disrupt cell-cycle control, leading to progressive epithelial dysplasia and, ultimately, to invasive disease. Although molecular HPV-DNA testing offers high analytical sensitivity for the detection of viral genomes, it does not discriminate between transient infections and clinically relevant precancerous lesions. Histopathological assessment of tissue biopsies therefore remains the diagnostic gold standard, based on the identification of specific morphological alterations such as koilocytosis and dyskeratosis. In this setting, immunohistochemical biomarkers such as p16 are frequently employed as surrogate indicators of HPV-driven oncogenic activity, particularly in diagnostically challenging or borderline lesions, further highlighting the multi-level nature of pathological assessment [
1].
Importantly, these morphological alterations do not occur as isolated events but are distributed along a cytology–histology continuum, ranging from subtle cytological abnormalities to progressively disorganized epithelial architecture in high-grade lesions. This biological continuity represents a fundamental aspect of HPV-related disease and a critical challenge for both human interpretation and computational modeling [
2]. The manual interpretation of histological slides is further limited by substantial inter-observer variability and by the inherently focal distribution of HPV cytopathic effects, which may be confined to small foci scattered within large regions of morphologically normal tissue. The digitization of glass slides into Whole Slide Images (WSI), multi-gigapixel, pyramidally structured files that reproduce the entire specimen at multiple magnifications, has opened the possibility of applying deep learning models, and in particular Convolutional Neural Networks (CNNs), to extract diagnostic features directly from pixel data [
3,
4].
Several recent studies have demonstrated the potential of deep learning for HPV-related histological analysis. Chakravarthy et al. [
5] developed CNN-based models capable of predicting molecular subtypes of cervical cancer directly from histology, showing that WSIs carry morphological signatures correlated with viral infection. A systematic review and meta-analysis by Zhu et al. [
6] reported area-under-the-curve (AUC) values above
for HPV-status prediction in oropharyngeal squamous cell carcinoma, identifying dataset heterogeneity as the principal obstacle to clinical translation. Schömig-Markiefka et al. [
7] proposed vendor-agnostic pipelines for computational pathology and emphasized that automated tissue segmentation and quality control are prerequisites for scalable and reproducible models. Within this literature, three issues remain only partially addressed. First, extreme class imbalance, arising from the numerical prevalence of histologically healthy tissue over pathological foci, biases models toward the majority class when trained with standard cross-entropy. Second, chromatic variability induced by differences in staining protocols, reagent batches, fixation durations, and scanner calibrations can be exploited by the network as a spurious predictor if not explicitly normalized [
8,
9]. Third, the use of patch-level random splits in cross-validation allows patches from the same patient to appear both in training and test sets, producing inflated performance estimates that do not reflect generalization to unseen subjects. A further, often underexplored, limitation is that most current models do not explicitly account for the biological and diagnostic continuum underlying HPV-related lesions, thereby reducing their ability to capture transitional morphological patterns that are critical in routine pathology practice.
A less frequently discussed, yet increasingly important, aspect of model design in medical imaging is the choice of the source domain for transfer learning. The de facto standard is ImageNet initialization, but Raghu et al. [
10] have shown that domain-specific pretraining on medical images can yield superior performance on medical targets, even when the intermediate dataset is much smaller than ImageNet. More recent work on self-supervised pretraining on medical corpora [
11] further supports the intuition that reducing the domain gap between source and target tasks is a first-order design decision.
The present work builds on these observations and focuses on a specific research question: does domain-adaptive pretraining on the SIPaKMeD cervical cytology dataset [
12] provide a measurable benefit for patch-level HPV classification in H&E-stained WSIs, under a strict patient-level evaluation protocol spanning the cytology–histology continuum? The contributions can be summarized as follows. A fully automated, vendor-agnostic preprocessing chain is implemented, combining Otsu thresholding on the HSV saturation channel, Laplacian-variance blur rejection, and Macenko stain normalization. A ResNet50 backbone [
13] is then pretrained on SIPaKMeD, which shares morphological primitives with the target domain, and subsequently fine-tuned on the HPV cohort under Focal Loss [
14] and weighted sampling to handle class imbalance. Evaluation is performed through a three-fold Leave-One-Patient-Out (LOPO) cross-validation that strictly forbids inter-patient information leakage. The SIPaKMeD-pretrained model is directly compared against an ImageNet-pretrained baseline on the same folds, providing a controlled quantification of the benefit of domain-adaptive initialization.
2. Materials and Methods
2.1. Dataset and Experimental Setting
The study cohort was assembled within a diagnostic workflow spanning from cytological screening to histological evaluation, and comprises a total of 42 diagnostic specimens derived from 19 unique anonymized patients. For each case, cytological assessment (Pap test) was available prior to biopsy, followed by histopathological examination of Hematoxylin and Eosin (H&E)-stained tissue sections, which were subsequently digitized into Whole Slide Images (WSIs) under the supervision of an expert pathologist at a cancer institute. Regions of interest (ROIs) were annotated by expert pathologists to capture HPV-related morphological alterations across the cytology–histology continuum, ensuring biologically meaningful and diagnostically relevant training labels, as illustrated in
Figure 1. The 42-specimen total decomposes into 19 Pap cytology preparations (one per patient), 19 H&E-stained biopsy WSIs (one per patient), and 4 additional p16 immunohistochemistry (IHC) slides obtained for diagnostically borderline cases in which HPV-driven oncogenic activity required surrogate molecular confirmation. The complete mapping between specimens and patients is reported in
Table 1. When grouped at the patient level, this many-to-one structure reflects the multi-level diagnostic framework in which a single subject may contribute multiple specimens at different diagnostic scales; the LOPO evaluation protocol described below operates on this patient-level grouping, so that no specimen-to-specimen correlation within a patient can leak across training and test partitions.
The H&E WSIs were tiled into non-overlapping tissue patches of pixels extracted at the highest magnification level (Level 0) of the WSI pyramid. Of these, 6181 () are HPV-positive and () are HPV-negative, corresponding to a positive-to-negative ratio of approximately .
A subsequent comprehensive audit of the dataset metadata, performed at the patient level, yielded the verified class distribution adopted in all analyses reported in this work: 7181 HPV-positive patches (
) and
HPV-negative patches, with a negative-to-positive ratio of
. The complete per-patient breakdown is reported in
Table 2, sorted by descending positive prevalence to make the heterogeneity of the cohort apparent. Six patients contain zero HPV-positive patches; these contribute to the training partitions of all folds in which they are not held out, but cannot serve as informative held-out test patients in isolation.
Because the biological variability associated with only 19 subjects poses a concrete risk of patient-specific overfitting, the cohort is organized into two disjoint groups. Three subjects (referred to as P1, P2, and P3) are designated as primary test patients and rotate through the three LOPO folds; the remaining sixteen subjects are used exclusively to augment the training partition across all folds. The composition of the primary patients is reported in
Table 3. This design maximizes the utilization of the available patient-level variability during training while still enforcing patient-level separation at test time.
To complement the three-patient ablation rotation described above and provide an evaluation that covers every subject of the cohort, a second cross-validation scheme has been defined in which all 19 patients are partitioned into five disjoint folds. Each patient appears in exactly one test fold (four patients per fold, except fold 4 which contains three patients). The fold composition is reported in
Table 4; the corresponding evaluation protocol is described in
Section 2.4 and the results are reported in
Section 3.7.
All experiments were performed on a workstation equipped with an NVIDIA RTX 4080 SUPER GPU (16 GB of dedicated memory), using Python 3.8+ with PyTorch, OpenSlide, OpenCV, NumPy, and the TIAToolbox library. No patient-identifying information was retained at any stage of the pipeline.
2.2. Preprocessing Pipeline
Given the multi-gigapixel nature of WSIs, a sequence of preprocessing operations is applied before training to isolate informative tissue regions, suppress low-quality content, and harmonize chromatic appearance across specimens. A synthetic comparison between the chosen techniques and standard alternatives in the literature is reported in
Table 5.
Each WSI is first downsampled to a thumbnail representation and converted from RGB to the HSV color space. The saturation channel is used as the basis for tissue–background separation because stained tissue exhibits high saturation, whereas the glass-slide background is close to achromatic. The Otsu algorithm [
15] is then applied to compute, in an unsupervised and scanner-agnostic way, the threshold
that minimizes the intra-class variance
where
and
denote the probabilities of the two classes (background and tissue) separated by threshold
t, and
,
their within-class variances. The resulting binary mask is used to guide high-resolution patch extraction: for each candidate
tile, coordinates are mapped back onto the mask, and the tile is retained only if at least
of its area corresponds to foreground tissue. A qualitative example of tissue segmentation is shown in
Figure 2.
Each retained patch is subsequently subjected to blur quality control. Image sharpness is quantified as the variance of the Laplacian response
computed on the grayscale version of the patch: this quantity is large when the image contains well-defined edges (nuclear membranes, cellular borders) and small when the patch is out of focus. Patches with a Laplacian variance below a fixed threshold
are discarded. On the reference CMU-1 slide adopted as a methodological sanity check, this filter rejected
of the extracted tiles (5 out of 204) while retaining
as sharp. Representative examples of accepted versus rejected patches are reported in
Figure 3.
The final preprocessing step is chromatic standardization via the Macenko stain-normalization algorithm [
8]. Unlike purely statistical approaches such as Reinhard normalization, which operates on generic channel statistics, Macenko’s method is physically motivated: pixel intensities are first mapped to the optical density (OD) domain according to the Beer–Lambert law
where
I is the pixel intensity and
the background illumination. Singular Value Decomposition (SVD) is then applied to identify the principal stain-vector directions corresponding to Hematoxylin and Eosin, and each patch is reprojected onto a shared reference appearance. The effect of this transformation is illustrated in
Figure 4. This stage suppresses chromatic variations arising from reagent batches, fixation conditions, and scanner calibrations [
9].
2.3. Domain-Adaptive Transfer Learning and Training Protocol
The classification model is based on a ResNet50 backbone [
13], whose default classification head is replaced by a task-specific module composed of a dropout layer (
), a fully connected layer projecting the 2048-dimensional feature vector to 512 units with ReLU activation, a second dropout layer (
), and a final linear layer producing a single sigmoid-activated logit.
Rather than initializing the backbone with generic ImageNet weights, a domain-adaptive transfer-learning strategy is adopted. The ResNet50 backbone is first pretrained in a binary classification task on the SIPaKMeD dataset [
12], a publicly available corpus of 4049 Pap-smear cytological images organized into five morphological classes. The task is cast as koilocytotic vs. non-koilocytotic binary classification: the single class koilocytotic cells (825 images) is treated as positive, and the remaining four classes (superficial–intermediate, parabasal, metaplastic, dyskeratotic; 3224 images in total) are grouped as negative. The rationale is that koilocytes are the cytopathic hallmark of HPV infection, and a backbone that has already learned to localize koilocytotic features in a cytological context is expected to provide a more informative initialization for the target histological task than a backbone trained on natural images, consistently with the cytology–histology continuum underlying HPV-related disease. Pretraining is performed for 50 epochs with the AdamW optimizer [
16] and Focal Loss, and the best-validation checkpoint is retained for fine-tuning.
Fine-tuning on the target HPV cohort is then performed as an end-to-end update of the full ResNet50 network, starting from the SIPaKMeD-pretrained weights. The learning rate is set to
, an order of magnitude lower than the one used for SIPaKMeD pretraining, so as to refine the inherited representations without erasing them. The model is trained for 50 epochs with a cosine-annealing schedule using the AdamW optimizer (weight decay
). The main hyperparameters are summarized in
Table 6.
Class imbalance is addressed by two complementary mechanisms. First, the training loss is the Focal Loss [
14],
where
denotes the predicted probability of the ground-truth class,
is a class-balancing weight (set to
for the positive class), and
is the focusing parameter. The modulating factor
down-weights the contribution of easily classified examples and drives the optimization toward the morphologically ambiguous positive patches. Second, a
WeightedRandomSampler with inverse-frequency weights is used during mini-batch construction, so that each batch contains a meaningful number of minority-class examples.
Online data augmentation during training comprises random resized crops (scale – to ), random horizontal and vertical flips, random rotations in , and mild color jitter (brightness , contrast , saturation , hue ). Validation and test patches are processed with a deterministic resize-and-center-crop pipeline to guarantee reproducible evaluation, and all images are normalized with ImageNet channel statistics (, ) to retain compatibility with the pretrained backbone.
2.4. Evaluation Protocol
Model evaluation is performed under a patient-level Leave-One-Patient-Out (LOPO) cross-validation scheme with three folds, one for each primary test patient (P1, P2, P3). In each fold, all patches belonging to the held-out primary patient constitute the test set, while patches from the two remaining primary patients and from all sixteen training augmentation patients are used for training and validation through a further patient-level split, so that no patient, and, by construction, no specimen originating from a given patient within the 42-case collection, contributes patches to more than one partition. This design eliminates inter-patient information leakage by construction and provides an estimate of generalization to subjects genuinely unseen during training.
For each fold, the following metrics are computed on the test set: the Area Under the Receiver Operating Characteristic Curve (AUC-ROC), the Average Precision (AP), the best F1-score over decision thresholds in with a step of , and the corresponding sensitivity and specificity. Fold-level results are aggregated into mean and standard deviation (with degrees of freedom). To isolate the effect of domain-adaptive pretraining, the entire LOPO protocol is repeated twice under identical conditions: once with the backbone initialized from standard ImageNet weights, and once with the backbone initialized from the SIPaKMeD-pretrained checkpoint.
To provide an evaluation that covers every subject of the cohort and to avoid any selection bias in the choice of test patients, the SIPaKMeD-pretrained model is additionally evaluated under a 5-fold patient-level cross-validation that partitions the 19 patients into five disjoint folds (
Table 4). In each fold, all patches belonging to the held-out patients constitute the test set, while patches from the remaining patients are used for training and validation through a further patient-level split. By construction, no patch and no specimen originating from a given patient appears in more than one partition; the leakage-free property of the Leave-One-Patient-Out (LOPO) principle is therefore preserved. The acronym LOPO is hereafter used in its general leakage-free sense; the specific protocols of this work are referred to as three-fold ablation (
Section 3.3 and
Section 3.4,
Table 7 and
Table 8) and 5-fold patient-level CV (
Section 3.7 and following), respectively.
In addition to the metrics enumerated above, the 5-fold analysis reports the Matthews Correlation Coefficient (MCC) at threshold
, the balanced accuracy at threshold
, the Brier score for probability calibration, the sensitivity at fixed specificity (≥90% and ≥95%), and the precision–recall curves of the SIPaKMeD-pretrained model on each test fold (
Figure 5). The aggregate mean test AUC is further accompanied by a patient-aware non-parametric bootstrap
confidence interval with
resamples, which captures within-fold sampling variability while respecting the patient-level design. Checkpoint selection follows the standard best-validation-AUC rule throughout; the noise of this criterion under low positive support, already noted in
Section 3.5, motivates more robust strategies (e.g., exponential moving average of late parameters or logit ensembling of late checkpoints), which we leave as future work.
4. Discussion
The main empirical finding of this study is that replacing ImageNet pretraining with SIPaKMeD pretraining produces a measurable and reproducible improvement on a small-cohort HPV histopathology task. The improvement in mean test AUC from to is moderate in absolute terms, but is obtained under a strict LOPO evaluation and is accompanied by a halving of the per-fold variance, indicating that the SIPaKMeD-initialized model generalizes more uniformly across patients with very different disease prevalence and morphological characteristics. Three aspects of this result deserve explicit discussion.
The first aspect concerns the motivation for domain-adaptive pretraining. ImageNet and histopathology differ substantially in low- and mid-level image statistics, and several recent works have pointed out that the benefit of ImageNet initialization in medical imaging, although non-zero, is smaller than commonly assumed [
10,
11]. SIPaKMeD, although a cytological rather than histological dataset, shares with the target domain the morphological primitives that are most discriminative for HPV-associated alterations, namely nuclear shape, chromatin texture, and perinuclear halos. The high validation AUC (
) achieved on the binary koilocytotic versus non-koilocytotic subtask of SIPaKMeD confirms that the backbone has acquired an explicit representation of the cytopathic hallmark of HPV before being exposed to the histological target.
The second aspect is the strict enforcement of patient-level partitioning. A relevant fraction of the existing literature on patch-based classification in digital pathology reports performance estimates obtained through random, patch-level splits, which allow patches from the same patient to appear in both training and test sets. The LOPO scheme adopted here guarantees that each test fold is evaluated on tissue from a patient never seen during training, at the cost of increased per-fold variance. In this context, the reduction of variance from
to
observed when moving from ImageNet to SIPaKMeD pretraining is particularly meaningful: it suggests that domain-adaptive initialization mitigates the sensitivity of the model to inter-patient variability, which is one of the main obstacles to clinical translation identified by Zhu et al. [
6].
Beyond its technical implications, the observed improvement in generalization stability has direct translational relevance. In routine cervical pathology, diagnostic variability, particularly in borderline categories such as LSIL versus HSIL, remains a well-recognized challenge, with implications for patient management and follow-up strategies. A model that is less sensitive to inter-patient variability, as suggested by the reduced variance observed with SIPaKMeD pretraining, may contribute to more consistent decision support across heterogeneous clinical scenarios. From a biological perspective, the ability of the model to capture cytopathic features associated with HPV infection, such as koilocytic changes and nuclear atypia, supports its potential role in bridging cytology and histology within a unified analytical framework. This is particularly relevant in screening settings, where cytological and histological assessments are inherently complementary but often disconnected in digital workflows. In this context, domain-adaptive pretraining may represent a step toward more clinically aligned artificial intelligence systems, capable not only of improving classification performance but also of reflecting the underlying disease biology. Future integration with slide-level aggregation and explainability tools could further enhance clinical interpretability and facilitate adoption in diagnostic practice, in line with recent efforts to translate AI models into real-world pathology workflows [
17,
18].
The third aspect is the role of preprocessing. Macenko normalization, Otsu segmentation, and Laplacian-variance quality control are often treated as implementation details, yet they directly affect what the model is allowed to learn. The proposed chain is fully automatic, free from manually selected thresholds, and consistent with the vendor-agnostic design advocated by Schömig-Markiefka et al. [
7] and by Tellez et al. [
9]. It therefore provides a reproducible entry point for multi-center extensions of this work.
Several limitations must be acknowledged. The cohort of 19 patients (assembled from 42 diagnostic specimens spanning Pap cytology, H&E biopsy, and p16 IHC) is small by deep learning standards, and the three-fold LOPO design, while rigorously leakage-free, produces wide confidence intervals simply because
. The single-center nature of the data prevents a direct quantification of the vendor-agnostic claims associated with Macenko normalization. The validation sets are severely imbalanced and contain only a few tens of positive samples, which renders the validation AUC a noisy model selection criterion and motivates more robust checkpoint selection strategies as a direction for future work. The model operates at the patch level and does not yet aggregate predictions into patient- or slide-level diagnoses, which would be the natural next step through Multiple Instance Learning (MIL) formulations [
19]. Finally, the absence of explainability mechanisms such as Grad-CAM [
20] limits the interpretability of predictions for clinical end-users.
A direct numerical comparison with recent HPV-prediction studies is not straightforward, since the reported AUC values above
are typically obtained on oropharyngeal rather than cervical tissue, on slide-level rather than patch-level labels, and on larger, multi-institutional cohorts with MIL aggregation [
6]. The patch-level AUC of
reported here, obtained under strict LOPO on a single-center H&E dataset, targets a more demanding evaluation scenario and is expected to constitute a lower bound with respect to patient-level or slide-level formulations.
Future work will focus on four main directions. First, the cohort will be expanded through multi-center collaboration, which is the primary lever to reduce the per-fold variance that currently dominates the uncertainty of the estimate. Second, the training protocol will be extended along two axes: a two-phase fine-tuning schedule, in which the backbone is initially frozen so that the classification head can adapt without perturbing the SIPaKMeD-derived representations, and a dual-checkpoint testing procedure designed to mitigate the unreliability of best-validation model selection under extreme imbalance. Third, Test-Time Augmentation over geometric transforms will be investigated as an inexpensive ensemble mechanism exploiting the rotational symmetry of histological tissue. Fourth, attention-based MIL will be integrated to aggregate patch-level scores into patient- or slide-level decisions, and explainability via Grad-CAM will be added to make the predictions visually inspectable by pathologists.
A complementary set of observations follows from the analyses reported in
Section 3.7,
Section 3.8,
Section 3.9 and
Section 3.10. The mean test AUC obtained under the 5-fold patient-level cross-validation (
,
bootstrap CI
) is essentially identical to the value obtained on the controlled three-fold ablation rotation (
), providing direct empirical evidence that the SIPaKMeD pretraining benefit is robust to the choice of cross-validation design. The larger per-fold standard deviation under the 5-fold protocol (
versus
) is a faithful reflection of inter-patient heterogeneity in the full cohort, not a degradation of the model: when every patient is allowed to enter the test set, including patients whose tissue morphology departs substantially from the rest of the cohort, the fold-level performance unavoidably spans a wider range. This is, in our view, the most honest reading of generalization variability that a cohort of 19 subjects can produce under leakage-free evaluation.
The competitive performance of the classical XGBoost baseline on fold 1, which contains the patient with the highest HPV-positive prevalence (), deserves explicit comment. The CNN attains an AUC of on this fold while XGBoost reaches . We interpret this gap as a consequence of the fact that a single outlier patient with an unusually high positive support can dominate the loss landscape of a high-capacity model and produce overfitting on patient-specific tissue cues, while a low-capacity classifier on engineered features is intrinsically more conservative under the same regime. Two practical mitigations suggest themselves: an ensemble that combines the deep representation with the handcrafted descriptor, and patient-stratified mini-batch sampling at training time. Both are identified as future directions.
Regarding the domain relationship between Pap-smear cytology and H&E histology, we note that, although the two modalities differ substantially in data appearance and structure (isolated cells on sparse background versus densely packed tissue architecture), the cytopathic hallmark of HPV infection (koilocytosis: perinuclear halos, irregular nuclear membranes, hyperchromasia) is morphologically conserved across the two modalities at the sub-cellular scale. The high validation AUC () achieved by the SIPaKMeD pretraining on its binary koilocytotic-versus-non-koilocytotic task indicates that this shared substrate is effectively acquired by the backbone, and the consistency of the downstream fine-tuned model under both the three-fold and the 5-fold patient-level protocols indicates that the substrate transfers effectively to the histological target. A quantitative feature-space alignment study (e.g., Centered Kernel Alignment between SIPaKMeD-domain and biopsy-domain activations) would provide additional insight into the structure of this transfer and is identified as a future direction.
On the methodological choice between supervised domain-adaptive transfer learning and self-supervised pretraining directly on the biopsy data, three considerations motivated the present approach. First, contemporary contrastive self-supervised frameworks display diminishing returns below approximately
in-domain images per ResNet-class backbone [
11], while our cohort yields
patches but only 19 unique patients, so the effective diversity is markedly lower than the patch count suggests. Second, contrastive objectives treat random pairs of crops as negatives; with a global positive prevalence of
, the negative pool is overwhelmingly dominated by visually similar healthy-tissue patches, a regime in which the embedding space can collapse around features unrelated to HPV cytopathic effects. Third, the SIPaKMeD source provides supervised labels for the canonical cytopathic hallmark of HPV (koilocytes), so the supervised signal is directly aligned with the downstream task in a way that self-supervised pretraining cannot, by construction, replicate. Self-supervised pretraining on a substantially larger multi-center biopsy corpus remains an attractive complementary direction for the future.
Finally, the patient-level aggregation case study (
Section 3.10) confirms that simple rules can convert patch-level probabilities into clinically interpretable patient-level scores, but the absence of statistical power at the patient level (3–4 test patients per fold) reinforces the need for a formal patient-level evaluation through attention-based Multiple Instance Learning on a larger multi-center cohort, which we have already identified as the primary clinically oriented future direction.
5. Conclusions
This work has presented a fully automated, end-to-end deep learning pipeline for patch-level classification of HPV-associated lesions in H&E-stained Whole Slide Images, operating across the cytology–histology continuum of HPV-related disease. The pipeline integrates a vendor-agnostic preprocessing chain, Otsu segmentation on the HSV saturation channel, Laplacian-variance quality control, and Macenko stain normalization, with a ResNet50 classifier subjected to domain-adaptive transfer learning from the SIPaKMeD cytology dataset. Class imbalance is addressed through Focal Loss and a weighted random sampler, and the model is evaluated under a strict three-fold Leave-One-Patient-Out protocol over 42 diagnostic specimens grouped into 19 unique anonymized patients and
patches. The SIPaKMeD-pretrained model attains a mean test AUC-ROC of
, improving over an otherwise-identical ImageNet-initialized baseline (
) by
in mean AUC and, crucially, halving the per-fold variance. These results are obtained under no-leakage conditions and constitute a reproducible, honest baseline for HPV histopathology in small-cohort settings. The modular design of the pipeline naturally accommodates future extensions, including multi-center validation, refined fine-tuning schedules, Test-Time Augmentation, Multiple Instance Learning for patient-level aggregation, and explainability integration. These findings align with broader efforts to integrate artificial intelligence within multi-layered oncological frameworks, where computational models are expected to capture both morphological and molecular complexity [
21].
Under the complementary 5-fold patient-level cross-validation that covers every subject of the cohort (
Section 2.4 and
Section 3.7), the SIPaKMeD-pretrained model attains a mean test AUC-ROC of
with a
patient-aware bootstrap confidence interval of
, in close agreement with the controlled three-fold ablation result and consistently above the ImageNet baseline mean of
. The classical machine-learning baselines evaluated under the same protocol (
Section 3.9) confirm that handcrafted-feature classifiers remain competitive in this small-cohort regime, while the deep representation retains the structural advantages that motivate its use as the primary modelling approach. The verified class distribution of the cohort (7181 HPV-positive patches out of
, prevalence
, ratio
) is the reference for all numerical claims of this work.