1. Introduction
Breast cancer is a significant global health issue, particularly in low- and middle-income countries where limited screening resources often result in late-stage diagnoses and increased mortality rates [
1]. Early detection is crucial for improving patient outcomes, emphasizing the need for advanced diagnostic methods.
Histopathological examination remains the gold standard for breast cancer diagnosis, involving microscopic tissue analysis to identify malignant changes. However, manual evaluation is time-consuming, prone to inter-observer variability, and requires specialized expertise. The advent of artificial intelligence (AI) and machine learning (ML), particularly deep learning models such as convolutional neural networks (CNNs), has introduced new possibilities for automating cancer detection, enhancing efficiency, and improving diagnostic consistency [
2]. Histopathological images contain intricate patterns at multiple scales, from cellular abnormalities to broad tissue structures. Hematoxylin and eosin (H&E) staining enhances visualization, yet variations in staining protocols and image quality pose challenges. AI models must address these inconsistencies through techniques such as stain normalization and domain adaptation to ensure robust and reliable performance across diverse datasets [
3].
Key challenges in histopathological image analysis include high data dimensionality, class imbalances in datasets, and the need for expert-labeled annotations. Overcoming these limitations using methods like data augmentation, transfer learning, and synthetic sample generation is essential for developing accurate and generalizable AI models [
4]. Additionally, the integration of AI into clinical workflows must consider factors such as interpretability, regulatory approval, and real-world applicability.
Critical diagnostic features—such as cellular pleomorphism, nuclear-cytoplasmic ratio, and tissue invasion—are essential for cancer classification and treatment planning. AI-driven tools can assist pathologists by enhancing diagnostic precision, reducing workload, and enabling personalized medicine. Leveraging AI for histopathological image analysis holds significant promise in improving breast cancer detection and patient outcomes worldwide.
Recent developments in vision-language models (VLMs), such as CLIP, ALIGN, and their biomedical adaptations, provide a promising pathway for overcoming these limitations. By aligning visual features from histopathological slides with textual information such as diagnostic labels, morphological descriptions, or clinical notes, VLMs offer dual advantages: (i) improved robustness through multimodal learning, and (ii) enhanced interpretability by grounding predictions in natural language explanations. These capabilities align well with clinical requirements, where transparent reasoning and trustworthiness are as important as raw predictive accuracy.
In this study, we build upon our prior deep learning pipeline—based on ResNet, AlexNet, and VGG—and extend it with a VLM-based framework. This hybrid approach seeks to capture fine-grained morphological features while leveraging text embeddings to guide classification, ultimately advancing automated breast cancer diagnosis.
2. Related Works
The application of AI in breast cancer diagnosis has gained considerable attention in recent years [
5], driven by advances in computational power, large-scale datasets, and sophisticated deep learning architectures. Early studies focused on traditional machine learning techniques such as support vector machines (SVMs) and random forests, which relied on handcrafted features extracted from histopathological images [
2]. While these approaches showed promise, they were limited by feature selection biases and their inability to capture complex patterns effectively.
One of the critical challenges in AI-driven histopathological analysis is dataset availability and variability. Publicly available datasets, such as the Breast Cancer Histopathological Database (BACH) and The Cancer Genome Atlas (TCGA), have facilitated model development and benchmarking [
6]. However, differences in image acquisition protocols, staining methods, and patient demographics introduce domain shifts that can impact model generalizability. Recent research has focused on domain adaptation techniques, such as adversarial training and self-supervised learning, to mitigate these issues [
7,
8].
Another key aspect of AI in breast cancer histopathology is interpretability and explainability. Deep learning models often function as “black boxes,” making it challenging to gain insights into their decision-making processes. Explainable AI (XAI) techniques, including saliency maps, Grad-CAM, and SHAP values, have been explored to provide visual and quantitative interpretations of model predictions [
9].
Despite these advancements, integrating AI into clinical workflows remains a challenge due to regulatory requirements, validation in real-world settings, and the need for extensive cross-institutional collaborations. Future research must address these challenges by focusing on model robustness, fairness, and ethical considerations in AI-driven cancer diagnostics.
Before deep learning became widely adopted, researchers relied on traditional machine learning techniques to analyze histopathological images. Support Vector Machines (SVMs), for example, were commonly used to classify tissue samples based on manually extracted features like texture and morphology. Naik et al. [
10] explored the use of SVMs for segmenting and classifying prostate cancer histopathology images, demonstrating promising results.
Another widely used approach was Decision Trees, often combined with ensemble methods like Random Forests to improve accuracy. Doyle et al. [
11] applied Decision Tree classifiers to distinguish between different breast tissue types, achieving reasonable accuracy despite the limitations of handcrafted feature selection. However, these methods required significant domain expertise and manual preprocessing, making them less scalable for large datasets.
Histopathology image classification presents major obstacles in the quest for precise breast cancer detection using histopathology image analysis [
12]. The development of trustworthy diagnostic tools is hampered by these interrelated problems. A significant challenge is the intricacy and diversity of tissue architecture in histopathological images. Since models have trouble generalizing across datasets, a variety of cell shapes, tissue types, and staining methods require intensive model training and adaptation [
13]. Particularly in densely populated areas, overlapping or occluded nuclei make segmentation even more difficult, increasing the likelihood of errors and emphasizing the need for more sophisticated algorithms.
High-quality annotated data is essential for classification models, but this is frequently scarce because data annotation is a labor-intensive process [
14]. Additional difficulties are brought about by inconsistent annotation standards and intra- and inter-observer variability, highlighting the necessity of standardized annotation procedures and multi-expert consensus to improve data quality and model reliability. In clinical settings with limited computing capacity, producing high-resolution whole-slide images can be difficult due to the significant processing resources needed [
15].
2.1. Deep Learning
The introduction of deep learning, especially CNNs [
16], transformed histopathological image analysis by eliminating the need for manual feature extraction. Instead of relying on predefined features, CNNs learn hierarchical patterns directly from raw images, allowing for greater accuracy and adaptability. One of the earlier studies in this domain by Spanhol et al. [
9] used CNNs to classify breast cancer histopathological images, significantly outperforming traditional machine learning models.
The emergence of deep learning, particularly CNNs, has significantly advanced the field by enabling automated feature extraction directly from raw images. Studies have demonstrated that CNN-based models can achieve high classification accuracy in distinguishing benign and malignant tissue samples [
17]. Additionally, approaches such as residual networks (ResNets), attention mechanisms, and transformer-based architectures have further improved performance by refining feature representation and reducing computational costs [
9].
CNNs gained popularity due to their ability to capture both fine-grained textures and high-level structural patterns, which are critical for diagnosing diseases like cancer. Unlike earlier methods that required manual tuning of features, deep learning models can automatically extract the most relevant information from histopathological slides. Additionally, techniques like transfer learning and data augmentation have made CNNs more robust, even when dealing with limited labeled datasets. In 2022, Alotaibi et al. [
18] proposed an ensemble model, ViT-DeiT, for classifying breast cancer in histopathological images. This model combines Vision Transformer (ViT) and Data-Efficient Image Transformer (DeiT) architectures using a soft voting approach, which averages class probabilities to make final predictions.
The study by Srikantamurthy et al. [
3] explores the use of a hybrid deep learning model combining CNN and Long Short-Term Memory (LSTM) networks for the classification of benign and malignant breast cancer subtypes from histopathological images. The hybrid approach leverages CNNs for spatial feature extraction and LSTMs for capturing sequential patterns in the data. Pre-trained ImageNet weights are utilized for transfer learning, optimizing the model’s performance while minimizing the need for extensive manual feature engineering.
2.2. Vision Language Models
VLMs have emerged as powerful multimodal architectures that jointly learn from image–text pairs to classify cancer pathology images effectively [
19]. Ferber et al. [
20] showed that GPT-4V pretrained on diverse multimodal data with a few-shot prompt outperforms deep learning models in medical image classification. OpenAI’s CLIP demonstrated that large-scale contrastive learning can align vision and language representations for zero-shot classification. In medical imaging, adaptations such as MedCLIP [
21], BioViL [
22], and PathCLIP [
23] have shown promise in pathology and radiology by linking images with descriptive text or diagnostic labels. For histopathology, VLMs allow models not only to classify tissue slides but also to generate interpretable text-based rationales, bridging the gap between computational predictions and pathologists’ reasoning [
24].
One of the problems in histopathology is that a whole slide image (WSI) has a giga-pixel size and it is difficult for AI models to use as input. To overcome this issue, WSIs are cropped into patches which are used as input. Some CLIP-based VLMs are founded on this idea [
25]. However, Shi et al. [
26] show that it is possible to use WSIs as input for pretrained image and text encoders with a fine-tuned multiple instance learning (MIL) decoder. Moreover, Rahaman et al. [
27] used a combination of scaled WSIs and patches for their model using an adaptation of the CLIP model. The work of Zanella et al. [
28] introduces a transductive inference approach to enhance zero-shot histopathological image classification using VLMs, which works in the embedding space by looking for similarities across all patches in a WSI.
Recent studies indicate that VLM-enhanced models can outperform purely visual architectures in robustness and provide explainable outputs that facilitate clinical adoption.
4. Results and Discussion
This section presents a comprehensive evaluation of classical CNN architectures and vision-language models (VLMs) on the BreakHis dataset using five-fold cross-validation. We report accuracy, precision, recall, F1-score, and AUC across four magnification levels (40×, 100×, 200×, and 400×). The results without data augmentation are summarized in
Table 4, while the augmented setting is reported in
Table 5.
Figure 5 shows per-magnification accuracy comparisons across all models; training and validation loss trajectories are shown in
Figure 6 and
Figure 7.
4.1. Effect of Evaluation Protocol on Performance Estimates
A key methodological consideration when interpreting the results is the choice of evaluation protocol. All results in this study are based on 5-fold patient-level cross-validation, which ensures that images from the same patient appear exclusively in either the training or validation fold. This protocol produces performance estimates that are averaged across all possible patient partitions and are therefore more representative of true generalization ability than a single fixed split. It is well established that a single train/test split can produce estimates that are sensitive to the particular patient assignment chosen, especially on datasets of moderate size such as BreakHis (∼7900 images from 82 patients). In particular, a favorable split—one in which easier or more representative patients happen to fall in the test set—can yield performance figures that are overly optimistic and do not generalise to other partitions of the same data. This is a plausible explanation for why performance estimates reported under a single fixed split in earlier versions of this work appeared higher than those obtained here under cross-validation. Cross-validation mitigates this sensitivity by averaging over multiple partitions, and the resulting estimates should therefore be considered more reliable and more conservative.
4.2. Overall Performance Comparison
Across all experimental settings, CNN-based architectures consistently outperform VLM-based models. In particular, ResNet34 achieves the highest overall performance at all magnifications, reaching an accuracy of 0.9879 ± 0.006 and AUC of 0.9984 ± 0.002 at 40×, with accuracy ranging from 0.9698 to 0.9824 and AUC remaining above 0.996 across all remaining magnifications. Among the other CNN baselines, AlexNet achieves accuracy in the range ≈0.95–0.96 across magnifications, while VGG19 exhibits somewhat higher variability, with accuracy dropping to 0.9369 at 400×.
These results indicate that residual architectures provide a superior balance between representational capacity and generalization, and that CNN-based models are highly robust to changes in magnification level. The consistently low variance of ResNet34 across folds further supports the reliability of this finding.
4.3. Extrernal Evaluation
No fine-tuning was performed on BACH. Models trained on BreakHis were evaluated directly on the full BACH set without any domain adaptation, providing a zero-shot cross-dataset generalization assessment. This setting deliberately introduces domain shift—arising from differences in staining protocols, imaging conditions, scanner hardware, and tissue composition between the two datasets—to test the robustness of each model family beyond the training distribution. Results are reported in
Table 6.
4.4. Statistical Significance of Model Comparisons
To assess whether the observed performance differences are statistically meaningful, pairwise comparisons between models were conducted using the Wilcoxon signed-rank test applied across the five cross-validation fold scores. The resulting
p-values for accuracy, F1-score, and AUC at the representative
magnification are reported in
Table 7. All pairwise differences are statistically significant (
), confirming that the ranking of models reflects genuine performance differences rather than random fold-to-fold variation. Specifically, SmolVLM-FT significantly outperforms Qwen2-VL-2B-FT (
for accuracy), while ResNet34 remains significantly superior to both VLM approaches (
for all metrics), supporting the conclusions drawn from the mean performance values in
Table 4 and
Table 5.
4.5. VLM Performance
Under linear probing (LP), both Qwen2-VL-2B and SmolVLM achieve moderate accuracy despite receiving no domain-specific adaptation: Qwen2-VL-2B ranges from 0.8037 to 0.8581 across magnifications, while SmolVLM ranges from 0.8293 to 0.8884. LoRA fine-tuning substantially improves performance for both models. Qwen2-VL-2B increases its accuracy from 0.8037 (LP) to 0.8736 (FT) at 200×, with AUC rising from 0.8844 to 0.9552. SmolVLM-FT achieves the strongest VLM results overall, with accuracy of 0.9453 ± 0.010 and F1-score of 0.9572 ± 0.010 at 200×; recall remains consistently high across all magnifications (0.9495–0.9806). Despite these gains, no VLM reaches the performance of any CNN baseline at any magnification.
These results suggest that general-purpose vision-language representations retain partial transferability to histopathology images even without domain adaptation, and that LoRA fine-tuning meaningfully closes the gap. The sustained high recall of SmolVLM-FT is of particular clinical relevance, given the severity of false negative predictions in cancer screening. The persistent gap between VLMs and CNNs is discussed in the context of the asymmetric adaptation setting in
Section 4.6 and in the Conclusion.
4.6. Effect of Data Augmentation
Contrary to common expectations, data augmentation consistently leads to a slight decrease in performance across most models. For example, ResNet34 accuracy at 40× decreases from 0.9879 to 0.9815, and similar trends are observed across all magnifications and models.
This degradation can be attributed to the sensitivity of histopathology images to the specific types of distortions introduced by the augmentation pipeline. Unlike natural images, diagnostic cues in H&E-stained slides are tightly coupled to two distinct signal types: (i) color and staining patterns, which encode biochemical tissue properties and are disrupted by brightness, contrast, and color jitter transforms; and (ii) microstructural spatial arrangements, such as glandular architecture and nuclear orientation, which are sensitive to geometric transforms including rotation and resized cropping. The applied pipeline combines both transform types, and the observed degradation likely reflects their joint disruption of these domain-specific cues.
It should be noted that no dedicated augmentation ablation study was conducted in this work, and the explanations that follow are therefore preliminary interpretations informed by the overall pattern of results rather than experimentally confirmed conclusions. The discussion below is intended to motivate hypotheses for future investigation rather than to establish causal mechanisms.
While a full factorial augmentation ablation is beyond the scope of this study, the pattern of results allows some inference about which transform type is more harmful. The performance drop is consistent across both CNNs and VLMs and is observed even at low magnifications (40×), where tissue architecture dominates and color cues are less discriminative. This suggests—tentatively, pending a controlled ablation—that geometric transforms, particularly the random resized crop with scale perturbation (, scale ), may be a primary driver of degradation, as they alter the spatial scale and composition of tissue structures that models have learned to rely on. The color-based transforms (brightness and contrast jitter, ) are comparatively mild and are unlikely to account for the full effect on their own, though this inference remains unconfirmed without isolating each transform type.
The effect is more pronounced for VLMs, where SmolVLM-FT accuracy decreases from 0.9570 to 0.9345 at 200×. This amplified sensitivity is consistent with—though not directly demonstrated to be caused by—VLMs relying more heavily on global spatial context captured through patch-level attention, which may be more susceptible to scale and composition perturbations than the local convolutional features used by CNNs.
These findings motivate a dedicated ablation study separating color-based and geometric transforms, which we identify as an important direction for future work. Stain normalization techniques—which standardize color appearance without distorting tissue structure—may also prove more suitable than generic color jitter for histopathology augmentation, and should be explored in subsequent studies.
4.7. Effect of Magnification
CNN-based models remain relatively stable across magnification levels, with ResNet34 maintaining accuracy above 0.969 at all four settings. VLM performance is more sensitive to magnification, with both Qwen2-VL-2B and SmolVLM showing noticeable accuracy decreases at 400× compared to lower magnifications.
This differential sensitivity likely reflects differences in feature extraction mechanisms. CNNs rely on local convolutional filters that support effective multi-scale representation learning, whereas VLMs use global attention-based vision encoders that may be less well-suited to the highly repetitive fine-grained textures characteristic of higher-magnification histopathology images.
4.8. Cross-Dataset Generalization (BreakHis → BACH)
Models trained on BreakHis were directly tested on the BACH dataset without any fine-tuning. Both models show a substantial drop in performance relative to their BreakHis cross-validation results. ResNet34 achieves an accuracy of 0.7825 and AUC of 0.7824 on BACH, while SmolVLM achieves an accuracy of 0.7215 and AUC of 0.7498. ResNet34 outperforms SmolVLM across all reported metrics. SmolVLM exhibits relatively higher recall than precision on BACH (0.7591 vs. 0.7327), a pattern not observed to the same degree on BreakHis.
These results confirm the presence of domain shift between the two datasets, most plausibly attributable to differences in staining protocols, imaging hardware, and tissue composition. The larger performance drop observed for SmolVLM suggests that CNN-based models generalize more robustly under distribution shift in this setting. The asymmetry between SmolVLM’s recall and precision on BACH suggests a tendency to over-predict the malignant class when the input distribution departs from the training domain, which would reduce overall classification reliability despite maintaining sensitivity.
4.9. Limitations and Future Directions
Three central constraints bound the conclusions of this study and should be kept in mind when interpreting all results.
First, the VLM vision encoders were kept fully frozen throughout LoRA fine-tuning. Only the language decoder was adapted; the visual feature representations therefore remain those learned from natural image–text pretraining and have not been adjusted to the histopathology domain. This is likely a meaningful source of the remaining gap between VLM and CNN performance, and means that the results reported here constitute a lower bound on what LoRA-adapted VLMs can achieve. Adapting the vision encoder—even partially—is an important direction for future work and may substantially change the comparative picture.
Second, the VLMs were evaluated exclusively in image-only mode. No textual input of any kind was provided during training or inference. As a result, the multimodal capabilities of these models—their ability to incorporate diagnostic descriptions, magnification metadata, or clinical context—were not exercised in this study. The performance figures reported for VLMs therefore reflect only a constrained subset of their potential, and should not be taken as indicative of what these models can achieve in a fully multimodal setting.
Third, all primary results are derived from a single dataset (BreakHis), which originates from one institution. The cross-dataset evaluation on BACH demonstrates that a notable performance drop occurs under distribution shift, limiting the generalizability of the conclusions. Validation on additional multi-institution datasets such as TCGA or CAMELYON is needed before broader claims about clinical applicability can be made.
Beyond these central constraints, two further limitations should be noted. The comparison is intentionally asymmetric in terms of trainable parameters: CNNs are fully fine-tuned (21.8M–143.7M parameters) while VLMs use LoRA (∼5–8M parameters), so the results reflect fully fine-tuned CNNs versus partially adapted VLMs rather than a parameter-matched evaluation. Additionally, the LoRA hyperparameters (r, , dropout) were fixed for all experiments; a systematic ablation over these choices, and over the selection of which attention layers to adapt, may yield further VLM performance improvements.
5. Conclusions
This study presented a systematic and controlled comparison of classical convolutional neural network architectures and LoRA fine-tuned vision-language models for binary breast cancer classification on the BreakHis histopathology dataset. Experiments were conducted across four magnification levels (40×, 100×, 200×, 400×). It is important to note that the comparison is intentionally asymmetric: CNN baselines were fully fine-tuned (21.8M–143.7M trainable parameters), while VLMs were adapted via LoRA with a frozen vision encoder (∼5–8M trainable parameters). The conclusions below should therefore be interpreted as reflecting the performance of fully fine-tuned CNNs against partially adapted VLMs under practical resource constraints, rather than a parameter-matched comparison.
5.1. CNN Baselines
Among the CNN architectures evaluated, ResNet34 emerged as the strongest performer, achieving the highest overall results across all magnification levels. Its best performance was observed at 40× magnification, where it reached an accuracy of 0.9879 ± 0.006, precision of 0.9860 ± 0.007, recall of 0.9935 ± 0.005, F1-score of 0.9897 ± 0.006, and AUC of 0.9984 ± 0.002. Performance remained consistently strong across the remaining magnifications, with accuracy ranging from 0.9698 to 0.9824 and AUC values remaining above 0.996. These results confirm that residual architectures, even at moderate depth (34 layers), are highly effective for histopathology image classification when fine-tuned end-to-end from ImageNet-pretrained weights. Importantly, ResNet34 achieved this performance with only 21.8M parameters—fewer than either AlexNet (61.1M) or VGG19 (143.7M)—demonstrating that architectural design choices such as residual connections and batch normalization are more impactful than raw model size in this domain. AlexNet and VGG19 also performed competitively, with accuracy consistently above 0.93 and reaching 0.9543 and 0.9684 at their best settings, respectively, confirming the robustness of CNN-based approaches for histopathology classification even without domain-specific pretraining or data augmentation.
5.2. Vision-Language Models
For VLMs, we evaluated Qwen2-VL-2B-Instruct and SmolVLM under both Linear Probing (LP) and LoRA Fine-Tuning (FT) modes. Under Linear Probing, both models achieved moderate accuracy, with Qwen2-VL-2B ranging from 0.8037 to 0.8581 and SmolVLM ranging from 0.8293 to 0.8884, indicating that general-purpose vision-language representations retain partial transferability to microscopy images despite the significant domain gap between natural image–text pretraining data and H&E-stained histopathology slides.
LoRA fine-tuning substantially improved performance for both VLMs, validating the effectiveness of parameter-efficient adaptation for medical imaging tasks. SmolVLM fine-tuned with LoRA (, , dropout) achieved the best VLM results overall, with its strongest performance at 200× magnification: accuracy of 0.9453 ± 0.010, precision of 0.9505 ± 0.010, recall of 0.9640 ± 0.009, F1-score of 0.9572 ± 0.010, and AUC of 0.9789 ± 0.005. At 40×, it achieved an accuracy of 0.9408 ± 0.010, while recall remained consistently high across all magnifications (0.9495–0.9682), which is of particular clinical relevance given the severe consequences of false negative predictions in cancer screening. It should be noted that these results were achieved with a frozen vision encoder and only ∼5–8M trainable parameters, compared to 21.8M–143.7M for the fully fine-tuned CNN baselines. The observed performance gap between CNNs and VLMs therefore reflects this asymmetry in adaptation depth, and should not be interpreted as evidence of general CNN superiority over VLMs. Deeper VLM adaptation—including vision encoder fine-tuning—may substantially reduce or eliminate this gap.
Qwen2-VL-2B, while benefiting from LoRA fine-tuning, remained the weakest model overall, with accuracy ranging from 0.8590 to 0.8736, suggesting that its instruction-following architecture and 2B-parameter scale may be suboptimal for purely visual discriminative tasks. The repurposing of its autoregressive language decoder for binary classification, rather than text generation, may limit the utility of its pretrained representations in this context.
5.3. Statistical Analysis
Statistical significance analysis was performed using the Wilcoxon signed-rank test applied to the five per-fold scores obtained from cross-validation for each model pair. The
p-values reported in
Table 7 indicate the probability, under the null hypothesis of no difference, of observing a difference at least as large as the one measured; values below 0.05 are taken as evidence that the performance gap between two models is unlikely to be due to chance variation across folds. As shown in
Table 7, SmolVLM-FT significantly outperforms Qwen2-VL-2B-FT across accuracy, F1-score, and AUC at
magnification (
). CNN-based models, particularly ResNet34, are significantly superior to both VLM approaches (
for all metrics), confirming that the observed performance differences reflect genuine model characteristics rather than fold-level variability. These results confirm that while LoRA-adapted VLMs meaningfully narrow the performance gap, CNN architectures still achieve the highest statistically verified accuracy and robustness.
5.4. Computational Efficiency
A key practical contribution of this work is the demonstration that large-scale VLMs can be adapted for histopathology classification on a single consumer GPU (NVIDIA RTX 4080, 16 GB VRAM) through LoRA fine-tuning. By reducing the number of trainable parameters from billions to under 10M, LoRA makes VLM fine-tuning accessible without high-performance computing infrastructure. SmolVLM required gradient accumulation to manage memory constraints, but achieved an effective batch size of 64 through 8 accumulation steps, with no degradation in final model quality relative to native batch training.
5.5. Limitations
Three constraints are central to interpreting the findings of this study. First, the vision encoder was kept fully frozen throughout LoRA fine-tuning, meaning that VLM visual representations were never adapted to the histopathology domain; the reported VLM results therefore constitute a lower bound on achievable performance, and conclusions about VLMs relative to CNNs should be read accordingly. Second, VLMs were evaluated in image-only mode throughout; their capacity for text-guided reasoning, incorporation of clinical context, and generation of diagnostic rationale was not tested, and the reported figures do not reflect the models’ multimodal potential. Third, primary results are drawn from a single-institution dataset (BreakHis); the observed performance drop on BACH confirms the presence of domain shift and limits the generalizability of the conclusions beyond this experimental setting.
Two additional limitations apply. The comparison is intentionally asymmetric in trainable parameters (fully fine-tuned CNNs at 21.8M–143.7M versus LoRA-adapted VLMs at ∼5–8M with a frozen encoder), so results should not be interpreted as a parameter-matched evaluation. Data augmentation also led to a modest performance decrease across all models, suggesting that domain-aware strategies such as stain normalization may be more appropriate for H&E histopathology than the generic transforms applied here.
5.6. Future Directions
Based on these findings, we identify the following directions for future research:
Domain-specific pretraining. Pretraining VLMs on large-scale histopathology image–text datasets (e.g., paired pathology reports and whole-slide images) prior to task-specific fine-tuning is expected to substantially reduce the domain gap and improve classification performance.
Multimodal classification. Future work should investigate whether providing textual context—such as magnification level, stain type, morphological descriptions, or clinical notes—alongside the image input improves VLM classification accuracy, leveraging the native text–image alignment capabilities of these models.
LoRA hyperparameter ablation. A systematic study of LoRA rank r, scaling factor , dropout rate, and the selection of which attention layers to adapt may yield further performance improvements, particularly for Qwen2-VL-2B, which showed the largest gap relative to CNN baselines.
Multi-scale and magnification-aware architectures. Designing input pipelines that present images at multiple magnifications simultaneously, or that condition the model on magnification metadata, may improve robustness at higher magnifications where both CNN and VLM performance was more variable.
Multi-institution validation. Evaluating the proposed models on additional publicly available datasets such as BACH, TCGA, or CAMELYON would strengthen the generalizability claims and better reflect real-world deployment conditions.
Whole-slide image integration. Extending the current patch-level classification framework to whole-slide image (WSI) analysis via multiple instance learning (MIL) would bring the approach closer to clinical workflows, where pathologists examine entire slides rather than individual crops.
In summary, this work establishes a controlled baseline for comparing CNN and VLM approaches in breast cancer histopathology classification under a specific and constrained experimental setting: image-only input, a frozen VLM vision encoder, and a single-institution dataset. Within this scope, LoRA-based fine-tuning is demonstrated to be a viable and computationally efficient strategy for partially adapting large vision-language models to medical imaging tasks. While CNNs currently achieve higher performance under these conditions, the results suggest that VLMs may offer complementary strengths—particularly if the vision encoder is adapted and multimodal inputs are incorporated—and motivate further investigation into their role in computational pathology workflows.