1. Introduction
Cancer continues to be one of the leading causes of death worldwide [
1,
2,
3], despite significant achievements in medical imaging technology and advanced therapies. The recent global statistics indicate that cancer caused approximately 10 million deaths in 2020 [
2] and nearly 20 million new cases were reported in 2022 [
4]. Medical imaging plays an important role in all stages of cancer treatment, involving diagnosing, staging, recommending treatments, and assessing how patients respond to therapy [
5,
6]. In modern medicine, medical imaging such as computed tomography (CT), magnetic resonance imaging (MRI), ultrasound (US), positron emission tomography (PET), and histopathology images are widely used by oncologists to diagnose cancer [
7,
8]. Although imaging techniques such as CT, MRI, PET scans, ultrasound, and mammograms aid in cancer detection, histopathology images remain the gold standard for confirming malignancy [
9,
10,
11]. Histopathology images, especially those that are stained with Hematoxylin and Eosin, play an essential role in cancer diagnosis, grading, and prognosis [
12,
13,
14]. Tissue processing for histopathology images preserves its architecture, which assists in identifying disease characteristics and providing essential insights into tumor morphology, structural abnormalities, lymphocytic infiltration, and biological behavior [
12,
15]. Recent technological advances have improved the quality of histopathology images, which has led to a reduction in disease complications and an improvement in patient outcomes. However, the manual interpretation of histopathology images is prone to inter-observer variability, requires specialized expertise, and is time-consuming [
14,
16].
As the technology has evolved, histopathological analysis has been digitized via whole-slide images (WSI) [
9]. The advancement in digital pathology has facilitated large-scale histopathology images and the development of computational tools, and has improved the efficiency in clinical workflows [
17,
18]. Due to the advancement of artificial intelligence, deep learning has contributed to digital pathology by offering the potential to improve the accuracy of cancer classification and detection. Recently, many studies have demonstrated the effectiveness and power of deep learning, particularly convolutional neural networks (CNNs), in detecting various diseases by analyzing histopathology images, including lung cancer classification [
19,
20], breast cancer classification [
21,
22], and brain tumor classification [
23,
24], etc. CNNs overcome the limitations of manual feature engineering through their capability to automatically extract hierarchical feature representations [
15,
25]. Both qualitative data (e.g., architectural patterns such as gland formation, stromal vs. epithelial layout, etc.) and quantitative data (e.g., cell counts, nucleus counts, mitotic counts, etc.) can be extracted from the histopathology images and are essential for accurate diagnoses and monitoring the progression of disease [
13,
14].
Despite notable advances in deep learning for histopathology image analysis, achieving high accuracy remains one of the key challenges for deep learning models for computational pathology. This is attributed to the heterogeneous and complex nature of cancerous tissues, the lack of large labeled datasets, and variations in image quality caused by differences in staining protocols and different magnification levels [
26]. Multi-pretrained deep learning models with fusion strategies have emerged as a potential solution to improve the accuracy of histopathology image classification with limited datasets. This study aims to evaluate the effectiveness of multi-pretrained deep learning models using different fusion strategies for the classification of breast cancer histopathology images and renal clear cell carcinoma (RCCC) grading.
This study employs the proposed model to systematically evaluate practical fusion design choices when integrating multiple pretrained deep learning models under single-modality constraints. The findings offer actionable guidance for selecting optimal fusion strategies in histopathology image analysis, especially in settings where multi-source data are unavailable. By systematically analyzing fusion approaches, this work aims to advance more accurate deep learning frameworks for cancer histopathology analysis. The main contributions of this study are: (1) to present a controlled comparison of early, intermediate, and late fusion while keeping the pretrained backbones and the training protocol fixed and (2) to evaluate the fusion strategies consistently across three public histopathology classification tasks, spanning binary and multiclass settings.
2. Related Work
Recently, technological advancements have transformed the workflow of histopathology image analysis from the traditional practice (manual review of glass slide) to digital workflows supported by whole-slide imaging (WSI) and deep learning [
27,
28]. This transformation has improved cancer diagnosis, enabling gigapixel-scale image computation, accurate classification, and automated tumor grading [
29]. Deep learning offers great benefits to pathologists and patients by enhancing workflow efficiency, accelerating interpretation speed, and making the diagnosis more reliable [
28,
30,
31]. Many recent studies have highlighted the applications of deep learning, including cancer diagnosis, tumor grading, molecular prediction, and automation of pathological workflows. For instance, a recent study conducted by Y. Ma et al. demonstrated that most studies related to deep learning in histopathology focused on diagnosis (30.9%) and detection (24.2%) tasks and have achieved the highest performance (AUC ≈ 96%), affirming that deep learning has become central to cancer diagnostics [
32]. Moreover, deep learning can extract features from histopathology images that can help in inferring molecular subtypes, mutations, and treatment response [
33]. Furthermore, deep learning models have achieved high accuracy in the automatic grading of meningiomas [
34], predicting breast cancer [
35], and predicting genetic mutations and protein biomarkers from histopathology images for cancer types such as colorectal carcinoma and melanoma [
36], etc. Although deep learning has achieved remarkable advancements in computational pathology, several challenges still limit its implementation into clinical practice, such as the complexity of whole-slide imaging, lack of diverse and well-annotated datasets, model interpretation issues, technical variations across clinics, and ethical and regulatory factors [
37,
38,
39,
40]. These limitations hinder the effectiveness of the unimodal or single model, which struggles to capture the rich and diverse features that are necessary for achieving high classification accuracy. Recent studies [
41,
42,
43,
44,
45,
46,
47] have shown that multi-modal approaches using fusion strategies offer promising opportunities to enhance the classification accuracy. By leveraging the strength of multiple pretrained deep learning models and extracting richer features, these approaches have been effective in handling the complexity of histopathology images [
48,
49]. However, the multi-modal approach requires an effective fusion strategy to combine the extracted features from different pre-trained models to gain better performance.
In biomedical research, multimodal fusion strategies are widely applied to extract the complementary information across modalities such as genomics, clinical data, and medical imaging [
50,
51]. By applying data fusion strategies, multimodal aims to combine the extracted features into a unified space, which can help to enhance the predictive accuracy, and thus advance precision oncology, through improved diagnosis and treatment selection [
52,
53]. The most predominant fusion strategies used in multimodal deep learning are generally classified into four broad strategies: early (input/feature-level), intermediate (feature/joint level), late fusion (decision-level), and hybrid fusion (combination of multiple fusion strategies) [
52,
54]. Early fusion strategy relies on combining low-level features from multiple modalities at the initial stage before feeding them into the deep learning model for further training [
52,
55]. It is straightforward and efficient, as it involves the concatenation of low-level features [
44]. However, it is susceptible to noisy features that can affect the outcomes. [
55]. Intermediate fusion is known as joint fusion, and it relies on combining feature vectors that are extracted by separate modality-specific branches at mid-network layers [
53,
55,
56]. This joint representation allows for the model to capture complex and non-linear intermodal relationships [
56]. It helps preserve fine-grained morphological details, which are valuable in histopathology tasks [
53]. Late fusion strategy relies on averaging the predictions (probabilities) from separate models [
52,
53,
55]. Unlike previously mentioned fusion strategies, late fusion is limited in capturing fine-grained features, as it only aggregates the final outputs of each modality [
53]. Hybrid fusion is less common, and it integrates different fusion strategies: for example, integrating features extracted by intermediate fusion and late fusion [
53]. In histopathology tasks, hybrid fusion can be used to combine high-dimensional image features with low-dimensional features from other clinical sources [
57].
Figure 1 presents an overview of the common fusion strategies:
Recently, in 2020, S. C. Huang et al. applied different fusion strategies (early, intermediate, and late fusion) to evaluate multimodal deep learning models for automated pulmonary embolism classification by combining different sources of information (computed tomography pulmonary angiography (CTPA) imaging with electronic medical records (EMR) of patients) [
54]. The findings showed that late fusion demonstrated the best performance (AUROC = 0.947), outperforming the single modality. In 2022, Steyaert et al. developed and compared multimodal deep learning models that combine histopathology whole slide images and gene expression data for the purpose of improving survival prediction in adult and pediatric brain tumors. ResNet50 was used for extracting features from histopathology images, and a multilayer perception was used for extracting features from gene expression data. The study evaluated early fusion, late fusion, and intermediate fusion (joint fusion). The results showed that multimodal fusion models outperformed single modalities in survival prediction. The study demonstrated that early fusion performed the best and yielded the highest test CS (0.836) among other fusion strategies [
58]. In 2023, Cahan et al. developed and compared multimodal deep learning fusion models by fusing two sources of data, CT pulmonary angiography (cTPA) imaging with electronic health record (EHR) tabular data for pulmonary embolism 30-day mortality prediction. The study evaluated different fusion strategies, including early fusion, intermediate fusion, and late fusion. The findings demonstrated that the intermediate fusion strategy achieved the best performance with AUC = 0.96, sensitivity = 90%, and specificity = 94%, outperforming single models that were imaging-only and HER-only [
59]. In 2023, Kumar et al. developed a deep learning multimodal fusion model for lung disease classification by integrating chest X-ray images with clinical laboratory data. Different pre-trained CNN backbones, including DenseNet121, DenseNet169, and ResNet50, were used for extracting features from X-ray images, while LSTM and self-attention networks were used for training clinical data. The results showed that intermediate fusion outperformed late fusion and single modalities, achieving a higher F1-Score = 94.75% [
48]. In 2023, Zheng et al. developed an application of transfer learning and ensemble learning to improve the classification of histopathology images. A late fusion strategy was employed to combine predictions from multiple pre-trained CNN models, including VGG16, ResNet50, InceptionV3, and DenseNet12. The findings revealed that the ensemble model outperformed single-pretrained models, achieving an accuracy ≈ 98% and an F1-Score > 0.97. These results indicated the effectiveness of using transfer learning with a late fusion strategy [
60]. In 2024, Chakravarthy et al. proposed a hybrid fusion strategy (intermediate + late fusion) to combine deep features from multiple CNN models, VGG16, VGG19, ResNet50, and DenseNet121. The study aimed to enhance multiclass classification of breast cancer, using mammography datasets. The extracted features were concatenated and passed through a fully connected classifier. The results revealed that the hybrid model achieved the highest accuracy, reaching 98.83%, outperforming late fusion and single models [
61]. In 2025, J. Li developed a hybrid fusion framework to improve the early diagnosis of oral squamous cell carcinoma (OSCC) from histopathology images. The hybrid model combined deep features from a cross-attention vision transformer (CrossViT) with handcrafted texture and color features (LBP, GLCM, FCH) using hybrid feature-level fusion and was classified using an artificial neural network (ANN). The hybrid fusion model was compared with pre-trained models, including ResNet50/101, VGG16/19, EfficientNetB0/B7, ViT, and CrossViT. The results showed that the proposed hybrid model achieved the highest performance with accuracy = 99.36%/99.59%, AUC ≈ 0.999, and sensitivity and specificity ≥ 99% over both pre-trained models and ViT alone [
62]. In 2025, Das et al. implemented an ensemble learning framework using multiple pre-trained deep learning models to improve the accuracy of diagnosing invasive ductal carcinoma (IDC) from histopathology images. Breast histopathology images at three magnification levels (100×, 200×, and 400×) were used. Pre-trained models, including ResNet50, Xception, MobileNetV2, VGG16, and VGG19, were used to extract features from histopathology images, and a late fusion strategy was implemented through the average and weighted ensemble learning of multiple pre-trained models. The obtained results indicated that the weighted average ensemble model (combining ResNet50, VGG16, and VGG19) achieved the best performance across magnification levels, with accuracy = 97.27%, AUC = 0.975, and F1-Score ≈ 0.97, while ResNet50 was the best standalone model, achieving approximately 98% accuracy on 100× images. The overall results showed that the ensemble model achieved better generalization [
41]. In 2025, Asif et al. applied an intermediate fusion strategy through hierarchical concatenation to combine multi-scale deep features extracted from MobileNet (pre-trained on ImageNet) and a custom multi-scale depthwise dilated residual network (MSDDR-Net). Three public brain MRIs (Brain Tumor MRI Dataset, Br35H, and Figshare) were used to assess the proposed model (HMDFF-Net). The results showed that HMDFF-Net achieved 99.31% accuracy, 99.28% precision, 99.25% recall, and an F1-score of 99.26 on the primary MRI multiclass dataset, outperforming DenseNet, ResNet, and transformer-based models [
63]. In 2025, Punarselvam et al. conducted a comprehensive evaluation of deep learning models enhanced with feature fusion strategies to improve the classification accuracy of cervical cancer using cytology and histopathology images. Different fusion strategies were evaluated, including early fusion, intermediate fusion, late fusion, and hybrid (CNN + Transformer) fusion. Different pre-trained models, including ResNet50, DenseNet121, EfficientNet-B4, and a hybrid CNN-Transformer, were used. The findings showed the superiority of hybrid fusion, achieving accuracy = 96.2%, precision = 96.0%, recall = 96.5%, F1 = 96.2%, and AUC = 0.981 [
64]. In 2025, Sahu proposed a deep learning framework to improve brain tumor classification accuracy from MRI scans. The proposed framework was based on hybrid fusion (intermediate + late fusion) to combine features from different pretrained models, including VGG16, InceptionV3, DenseNet121, Xception, and InceptionResNetV2. The majority voting strategy was used in the hybrid fusion. The proposed ensemble model achieved the highest performance, reaching 99.46% accuracy, 0.995 precision, 0.995 recall, and a 0.992 F1-score, alongside an AUC of 0.995 [
65].
The following comparative
Table 1 presents a comprehensive overview of thirteen state-of-the-art studies focusing on fusion strategies in multimodal deep learning for medical image analysis across different diseases. The studies were conducted between 2020 and 2025.
As illustrated in
Table 1, the studies highlight the evolution of fusion strategies in deep learning for medical image analysis and demonstrate how the stage of fusion strategy and integration method affect model performance and computational efficiency. Earlier studies of multimodal strategies relied on simple fusion approaches (late fusion), such as Huang et al. (2020) [
54], which combined the final predictions from the fused models of different data sources (imaging and EMR). Late fusion achieved strong discriminative power in the classification task (AUROC = 0.947) with minimal complex architecture. In 2022, Steyaert et al. expanded the concept of fusion to early fusion to combine WSI and RNA-seq data for brain tumor prognosis at the early level, showing that the integration of multimodalities at an early stage achieved higher C-index values (0.836–0.919) than single-modal models, indicating the power of integrating morphological and molecular features in disease prediction. By 2023, studies like Kumar et al. (2023) [
48] and Cahan et al. (2023) [
59] implemented intermediate fusion to emphasize representation learning between different modalities, such as chest X-ray or 3D-CT imaging with EMR data through latent concatenation and attention mechanisms. Intermediate fusion outperformed image-only or tabular-only models, achieving AUC values of around 93–96%. In the same year, Zheng et al. (2023) [
60] focused on utilizing multi-pretrained CNNs, such as VGG16, VGG19, InceptionV3, Xception, ResNet50, and DenseNet201, to extract features from a single modality of breast cancer histopathology images, employing an ensemble approach to combine the final decisions of each pre-trained model. The study opened the door toward applying fusion strategies in different multimodal approaches, relying on multiple pre-trained models and a single data modality, instead of combining different data modalities, such as medical imaging and tabular clinical data. In the years 2024–2025, the research focus shifted to hybrid fusion, which integrates the early fusion and late fusion, or intermediate fusion with late fusion. Chakravarthy et al. (2024) [
61] implemented hybrid fusion to fuse features of multi-pretrained deep learning models, achieving higher classification accuracy, reaching 98.83% in the mammography dataset. Similarly, recent studies, such as Sahu (2025) [
65] and Punarselvam (2025) [
64], demonstrated the effectiveness of hybrid CNN–CNN-Transformer and ensemble approaches in enhancing the classification accuracy across MRI and cytology datasets.
Overall, the comparison among studies between 2020 and 2025, focusing on multimodal learning and fusion strategies, confirms that fusing features from different modalities or multiple deep learning models enhances diagnostic accuracy and outperforms single modality in many tasks. Generally, recent studies demonstrated the effectiveness of integrating features through intermediate fusion and hybrid fusion with attention mechanisms, yielding remarkable improvements in diagnostic and prognostic accuracy. The selection of the fusion point at the early, intermediate, or late stage affects the model’s performance. Most studies focused on combining different modalities, such as imaging with tabular data. Although the fusion of different modalities consistently improves the accuracy, the challenge of limited datasets, especially multimodal datasets, remains [
63,
65]. To help mitigate this limitation, this work shifts toward exploring and evaluating multi-pretrained deep learning models and fusion strategies on a single modality instead of multi-modality (imaging with tabular data). To the best of our knowledge, there is a lack of studies that implement comprehensive experiments on all fusion strategies with a single modality. This work investigates multiple pretrained deep learning models with fusion strategies based on single modality instead of multi-modality and evaluates the implemented fusion strategies among different histopathology datasets.