Deep Learning-Based Prediction of Tumor Mutational Burden from Digital Pathology Slides: A Comprehensive Review

Ma, Dongheng; Nishikubo, Hinano; Sano, Tomoya; Yashiro, Masakazu

doi:10.3390/app16031340

Open AccessReview

Deep Learning-Based Prediction of Tumor Mutational Burden from Digital Pathology Slides: A Comprehensive Review

by

Dongheng Ma

^1,2,

Hinano Nishikubo

^1,2

,

Tomoya Sano

^1,2 and

Masakazu Yashiro

^1,2,*

¹

Department of Molecular Oncology and Therapeutics, Osaka Metropolitan University Graduate School of Medicine, 1-4-3 Asahimachi, Abeno-ku, Osaka 545-8585, Japan

²

Cancer Center for Translational Research, Osaka Metropolitan University Graduate School of Medicine, 1-4-3 Asahimachi, Abeno-ku, Osaka 545-8585, Japan

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(3), 1340; https://doi.org/10.3390/app16031340

Submission received: 22 December 2025 / Revised: 23 January 2026 / Accepted: 27 January 2026 / Published: 28 January 2026

(This article belongs to the Special Issue Artificial Intelligence Applications in Healthcare and Precision Medicine, 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

Tumor mutational burden (TMB) is a key pan-cancer biomarker for immunotherapy selection, but its routine assessment by whole-exome sequencing (WES) or large next-generation sequencing (NGS) panels is costly, time-consuming, and constrained by tissue and DNA quality. In parallel, advances in computational pathology have enabled deep learning models to infer molecular biomarkers directly from hematoxylin and eosin (H&E) whole-slide images (WSIs), raising the prospect of a purely digital assay for TMB. In this comprehensive review, we surveyed PubMed and Scopus (2015–2025) to identify original studies that applied deep learning directly to H&E WSIs of human solid tumors for TMB estimation. Across the 17 eligible studies, deep learning models have been applied to predict TMB from H&E WSIs in a variety of tumors, achieving moderate to good discrimination for TMB-high versus TMB-low status. Multimodal architectures tended to outperform conventional CNN-based pipelines. However, heterogeneity in TMB cut-offs, small and imbalanced cohorts, limited external validation, and the black-box nature of these models limit clinical translation.

Keywords:

tumor mutational burden; digital pathology; deep learning; artificial intelligence

1. Introduction

Tumor mutational burden (TMB) is an emerging pan-cancer biomarker that approximates the number of somatic mutations per megabase of coding DNA [1,2]. High TMB has been associated with enhanced neoantigen load and improved response to immune checkpoint inhibitors, and a threshold of 10 mutations/Mb has been adopted by the US Food and Drug Administration as a companion diagnostic for pembrolizumab in solid tumors [3,4]. TMB has been established as a pan-cancer predictive biomarker, functioning independently yet complementarily to PD-L1 (programmed death-ligand 1) expression and microsatellite instability (MSI) to identify patients primed for immune checkpoint inhibitor (ICI) response [3,5]. However, the routine clinical implementation is hampered by the logistical and financial burdens of whole-exome sequencing (WES) or next-generation sequencing (NGS) [6]. These traditional methodologies are plagued by prohibitive costs, extended turnaround times, and stringent tissue requirements that often exceed the availability of small diagnostic biopsies, given the variable DNA quality derived from formalin-fixed paraffin-embedded (FFPE) samples [7,8].

Recently, the renaissance of computational pathology has seen deep learning achieve promising performance in predicting gene mutation [9,10,11,12,13,14,15] and immune-associated molecular biomarkers [16,17], such as MSI and PD-L1 status, directly from routine hematoxylin and eosin (H&E) slides. These developments open the possibility of a purely “digital assay” for TMB, inferred from Whole-Slide Images (WSIs) using artificial intelligence, that could be deployed at scale without additional wet-lab testing. Such an approach could provide a rapid, tissue-sparing, and potentially cost-effective complement to sequencing-based assays, especially in settings where sequencing capacity is limited or retrospective DNA is unavailable. The contrast between conventional genomic TMB testing and an AI-based prediction workflow from histopathology is illustrated in Figure 1. However, while broader reviews [18,19,20] have documented the general rise of pathomics in predicting molecular signatures, TMB represents a distinct and complex quantitative target rather than a single gene alteration.

The studies of predicting TMB from WSIs are emerging, but to our best knowledge, there is no comprehensive review that focuses solely on this topic. Therefore, in this comprehensive review we summarize studies that use deep learning on digitized H&E WSIs to estimate TMB. We describe the typical analytic pipeline, compare methodological choices and predictive performance across tumor types. Furthermore, we provide a structured synthesis of how these models are currently implemented in practice, highlighting recurrent limitations in TMB definition, data availability, and model interpretability that hinder clinical translation. Finally, we outline a focused research agenda for WSI-based TMB prediction, emphasizing multimodal fusion, pathology foundation models, and rigorous prospective multi-center validation for future clinical deployment.

2. Methods

We performed a comprehensive literature search to identify studies using deep learning on WSIs to predict TMB. PubMed (U.S. National Library of Medicine, Bethesda, MD, USA) and Scopus (Elsevier, Amsterdam, The Netherlands) were searched for articles published between 1 January 2015 and 1 November 2025. The search strategy combined three concept blocks—(1) TMB, (2) digital pathology/histopathology, and (3) artificial intelligence—using combinations of related phrases in titles and abstracts (e.g., “tumor mutational burden”, “whole-slide image”, “digital pathology”, “deep learning”, “multiple instance learning”, “transformer”). The full executable search strategies (database-specific syntax, field restrictions, Boolean operators, wildcard use, and the prespecified date limits) are provided in Supplementary Note S1; no language or publication-type filters were applied. We included full-length, peer-reviewed original research articles that applied deep learning methods directly to H&E WSIs of human tumors to estimate TMB, either as a binary high/low label or as a continuous value. We excluded review articles, editorials, conference abstracts, and studies in which TMB was only used as an auxiliary variable to predict other outcomes without predicting TMB from WSIs. Non-human or non-H&E studies, as well as works relying solely on handcrafted image features without deep learning, were also excluded. The database search identified 164 records (Scopus n = 120; PubMed n = 44). After duplicate removal, 123 records were screened by title and abstract, and 19 reports were sought for retrieval and assessed for eligibility by full-text review. Ultimately, 17 studies met the inclusion criteria and were included in the qualitative synthesis and descriptive quantitative analysis (PRISMA 2020 flowchart [21] shown in Figure 2). Screening, data extraction, and PROBAST+AI assessment (2025) [22] were conducted by one reviewer and verified by a second reviewer for all included studies and uncertain cases at both the title/abstract and full-text stages; disagreements were resolved through discussion. The key characteristics and main results of the included studies are summarized. For each included study, we extracted information on cancer type, cohort characteristics and data source, definition of TMB-high, image preprocessing and patch extraction strategy, deep learning architecture and training paradigm, supervision level, and performance metrics. Risk of bias and applicability were assessed using PROBAST+AI (domain-level assessments for each included study in Supplementary Table S1).

Because of heterogeneity in tumor types, TMB-cutoff, and validation settings across included studies, we did not perform a formal meta-analysis. Instead, we conducted a descriptive quantitative summary. For the bubble plotting unit was a tumor-specific entry (i.e., a single study could contribute multiple bubbles when multiple tumor types were evaluated). For each tumor-specific entry, we selected one “primary” AUC using pre-specified rules: we preferentially used AUCs from an independent external cohort when available; otherwise, we used the AUC from a held-out internal test set; if neither was reported, we used the cross-validation AUC. Bubble plot was analyzed and generated in R (v4.5.2; R Foundation for Statistical Computing, Vienna, Austria).

3. Pipeline for WSI-Based TMB Prediction

By synthesizing the methodological pipelines reported in the studies included in this review, we derive a conceptual, multi-stage workflow for WSI-based TMB prediction (Figure 3). Although implementation details vary across tumor types, datasets, and modeling strategies, most published approaches can be mapped onto four core phases: data preprocessing, patch feature extraction, slide feature aggregation, and performance evaluation.

3.1. Data Preprocessing

Most studies are based on digitized H&E-stained slides with The Cancer Genome Atlas (TCGA) serving as the primary data source for WES-derived TMB ground truth. WSIs are typically scanned at 20× (0.5 µm/pixel) or 40× (0.25 µm/pixel) magnification. Early works relied heavily on manual region-of-interest (ROI) annotation by pathologists to isolate tumor tissue and exclude necrosis or normal structures (e.g., Shimada et al. [23]). More recent high-throughput frameworks employ automated tissue segmentation algorithms (e.g., Otsu’s thresholding, simple tissue/background classifiers, or toolkits such as clustering-constrained attention multiple instance learning) to separate tissue from background. The tissue regions are then tessellated into fixed-size patches, typically 256 × 256 or 512 × 512 pixels at 20× magnification. Rigorous quality control is standard practice: most studies apply color normalization to mitigate staining variability across institutions, and discard tiles with artifacts, blur, pen marks, or insufficient cellularity (e.g., Wang et al. [24]). Tumor region enrichment is performed either through manual annotation, semi-automated tumor segmentation, or pre-trained tissue classifiers that distinguish tumor epithelium from stroma, necrosis, and background. This step aims to reduce label noise by ensuring that the majority of tiles in each WSI come from tumor-rich areas that are more likely to encode TMB-related morphology.

3.2. Patch Feature Extraction

The core of the pipeline involves encoding raw image patches into high-dimensional feature vectors. Historically, the dominant paradigm has utilized convolutional neural networks (CNNs) such as ResNet18/50 or InceptionV3, pre-trained on ImageNet and fine-tuned on histology patches (e.g., Liu et al. [25]). These models transform each patch into a feature vector that captures color, texture, nuclear atypia, glandular architecture, and microenvironmental patterns. To overcome the domain gap between natural images and histopathology, recent studies have moved toward domain-specific representations. Advanced works employ self-supervised learning (SSL) encoders (e.g., contrastive learning frameworks) trained on massive collections of unlabeled pathology tiles to generate more biologically relevant embeddings (e.g., Zheng et al. [26]). In parallel, the field is rapidly adopting next-generation architectures, including vision transformers (ViTs) and state space models, which demonstrate superior capability in capturing global contextual dependencies compared with local-receptive CNNs.

3.3. Slide Feature Aggregation

Once patch-level features are extracted, they must be aggregated to form a slide-level prediction. Simple averaging or max-pooling has largely been superseded by multiple instance learning (MIL) frameworks equipped with attention mechanisms. Attention-based MIL and graph-based aggregation have emerged as the dominant paradigms for weakly supervised WSI analysis, not only for TMB but across diverse diagnostic, prognostic, and biomarker prediction tasks [27,28,29,30,31,32,33,34,35,36,37,38,39,40]. In these models, each slide is treated as a bag of instances (patch features), and attention-based pooling allows the model to automatically assign weights to patches, highlighting “high-attention” regions—often tumor–stroma interfaces or lymphocyte-rich areas—that contribute most to the TMB status. More complex aggregation strategies include graph convolutional networks (GCNs) that model the spatial topology of the tumor microenvironment and multimodal fusion modules that integrate histology features with clinical data, mRNA expression, or text reports to enhance predictive signal (e.g., Yu et al. [41]). Cascaded or hierarchical MIL architectures have also been proposed to first identify highly informative patches, then refine slide-level predictions in a second stage.

3.4. Performance Evaluation

The downstream prediction task is predominantly framed as a binary classification problem (TMB-high vs. TMB-low). The ground truth is derived either from WES or targeted NGS panels with cut-offs varying by organ type and study design (e.g., 10 mut/Mb, top 20th percentile, or tumor-specific tertiles). A minority of studies have explored regression to predict continuous TMB values, offering a more granular assessment of mutation load (e.g., Sun et al. [42]). Model performance is evaluated primarily by discrimination metrics—area under the receiver operating characteristic curve (AUC), precision–recall curves (AUCPR), accuracy, F1-score—and, less frequently, calibration curves and decision-curve analysis. Regarding validation protocols, internal evaluation typically involves splitting cohorts into training, validation, and test sets and adopting k-fold cross-validation to reduce sampling bias. In k-fold cross-validation, each fold is held out in turn as the test set while the remaining folds are used for model development, and performance (e.g., AUC) is averaged across folds. Reported k varies across studies (commonly 5-fold, and sometimes 2-fold in smaller cohorts), reflecting a trade-off between estimation stability and available sample size. Several studies additionally perform Kaplan–Meier survival analysis to demonstrate that image-predicted TMB status can stratify overall survival or progression-free survival among patients treated with ICIs, thereby validating the model’s potential as a prognostic companion.

4. Results

4.1. Risk of Bias Assessment

Figure 4 summarizes the PROBAST + AI risk-of-bias assessments, and the item-level judgments for each study are provided in Supplementary Table S1. For the Participants domain, all studies included at least one public dataset (17/17), and a minority additionally incorporated a private institutional cohort (3/17); all were rated as low risk. In the Predictors domain, one study was rated as unclear risk because it did not report whether predictor preprocessing was performed independently within the training and validation sets, while the remaining studies were judged low risk. For the Outcome domain, one study was rated as unclear risk due to insufficient details on outcome measurement in the external cohort, and all others were low risk. In contrast, the Analysis domain showed substantial concerns, with 29% rated as high risk and 47% as unclear risk, commonly due to limited sample size and lack of reporting on strategies to address class imbalance. Overall, only 24% of studies were judged at low risk of bias, indicating considerable room for improvement in bias-mitigation measures and transparent reporting practices.

4.2. Characteristics of Included Studies

The landscape of TMB prediction via deep learning is defined by a rapid methodological evolution applied primarily to TCGA cohorts, which serve as the foundational training ground for the majority of studies due to the availability of paired WSIs and WES-derived TMB. Across the 17 included articles, tumor types span lung (mainly lung adenocarcinoma, LUAD), colorectal cancer (CRC), gastric cancer (GC), endometrial cancer (EC), clear cell renal cell carcinoma (ccRCC), and smaller series in other solid tumors. Sample sizes are modest to moderate, typically ranging from around 50 to 600 patients per cohort, with a few multi-center studies incorporating external validation datasets such as CPTAC or institutional cohorts from Asia and Europe. All works rely on retrospective data. WES is the predominant assay in TCGA-based studies, whereas some institutional series use targeted gene panels of varying sizes and gene content. With respect to the TMB endpoint, the majority of studies treat TMB as a binary variable, defining TMB-high vs. TMB-low using fixed numerical thresholds (e.g., 10 mut/Mb in Niu et al. [43]) or cohort-specific quantiles (e.g., upper tertile or top 20% in Wang et al. [44]). The distributions of cancer types, endpoint definitions, validation designs, and dataset sources across the included studies are summarized in Figure 5.

Imaging pipelines are uniformly based on H&E WSIs, but differ in patch size, magnification, stain normalization, and tumor-enrichment strategies, as outlined above. Overall, early investigations relied heavily on transfer learning with standard CNNs trained on patch-level labels and simple aggregation schemes, whereas more recent works employ weakly supervised learning (WSL) and MIL frameworks to leverage slide-level labels. From 2023 onwards, there is a clear trend toward integrating multimodal data—histopathology fused with clinical variables, transcriptomics, or textual reports—and adopting cutting-edge architectures such as vision transformers and state space models that better capture global context. These temporal trends in sample size, cancer type, architecture, and AUC are summarized in Figure 6.

4.3. Performance

Across the included studies, reported performance for TMB prediction from WSIs spans a broad range, with internal AUCs varying from approximately 0.64 to 0.99 (Table 1). This variability is not purely technical; it reflects a combination of (i) how strongly TMB is “expressed” in the histomorphology of a given tumor type, (ii) the choice of TMB cut-off and resulting class balance, and (iii) the complexity of the model architecture. Notably, even relatively simple CNN-based pipelines often achieve AUCs around 0.70–0.80 in multiple cohorts, suggesting that H&E slides do contain reproducible morphological correlates of mutational burden that can be captured by baseline encoders. At the same time, comparisons within the same cancer type indicate that performance tends to be higher in biologically more aggressive or hypermutated subgroups, and that more liberal, cohort-specific TMB thresholds frequently yield higher apparent discrimination than stringent ≥10 mut/Mb criteria.

4.3.1. Lung Cancer

Performance in this cohort varies substantially, reflecting high intratumoral and inter-patient heterogeneity. Early CNN-based models reported modest internal AUC values ranging from about 0.64 to 0.77. Recent advancements, however, utilizing state-of-the-art architectures like VMamba and text-guided attention (Yu et al. [41]) have achieved exceptional internal AUCs approaching 0.99 in selected LUAD cohorts. Despite these promising numbers, external validation remains a major hurdle. Studies often observe a performance drop of 0.10–0.15 when models are tested on independent cohorts or held-out tissue source sites (Sadhwani et al. [45]).

4.3.2. Gastrointestinal Cancers

CRC and GC constitute another major cluster characterized by consistently higher predictive performance. Models applied to CRC frequently achieve internal AUCs between 0.81 and 0.93 (Huang et al. [46], Shimada et al. [23]). This superior performance is largely attributed to the strong biological correlation between high TMB and MSI in these malignancies, which manifests as distinct histological features such as increased tumor-infiltrating lymphocytes (TILs), glandular disarray, and mucinous differentiation—patterns that are readily detectable by computer vision algorithms. External validation in gastrointestinal cohorts is comparatively robust, with several studies evaluating models on independent cohorts (e.g., JP-CRC or CPTAC). Nevertheless, cross-dataset performance drops are still observed; for example, Wang et al. [24] reported a decline from AUC ≈ 0.88 in TCGA to ≈0.58 in CPTAC before domain adaptation or retraining.

4.3.3. Endometrial and Renal Cancers

Endometrial cancer studies [50,54] report some of the highest robust metrics, with AUCs generally in the 0.8 range for distinguishing aggressive, high-TMB subtypes. These strong results may reflect relatively distinct nuclear grade differences and architectural patterns associated with mutational burden in EC. In ccRCC, the literature presents a more mixed picture. Zheng et al. [26] demonstrated robust external validation with AUC ≈ 0.83 in CPTAC after training on TCGA. In contrast, Liu et al. [47] reported a decline from internal AUC ≈ 0.81 to ≈0.65 on external institutional data, underscoring the susceptibility of renal histopathology AI to site-specific batch effects and the challenges of domain generalization in tumors with subtler TMB-related morphology.

4.4. Cross-Study Comparability

Across deep learning studies, the apparent performance differences are not always attributable to architectural advances but can be driven by heterogeneity in endpoint construction and cohort composition. First, TMB-high is often a low-prevalence label; changing the cutoff can alter prevalence and substantially affect discrimination. For example, in the same CRC cohort with the same model, increasing the cutoff from 10 to 20 mut/Mb increased AUC from 0.729 to 0.774, illustrating sensitivity to threshold definition rather than a methodological change. Second, even under seemingly comparable settings (same cancer type, TCGA-based LUAD, and a fixed cutoff of 10 mut/Mb), reported AUCs vary widely across studies (e.g., conventional CNN vs. text-guided architectures), likely reflecting differences in sample size, training/validation design, and the degree to which proxy signals are captured. Finally, most included studies rely on internal splits, and performance often degrades markedly under truly independent external validation (e.g., AUC dropping from 0.881 to 0.577), which further limits cross-study comparability and cautions against over-interpreting high internal AUC as field-wide progress. Table 2 (Internal vs. external validation performance) summarizes these differences and illustrates the limited generalizability under independent external validation.

4.5. Technology Evolution

4.5.1. Architectural Evolution

The methodological trajectory of TMB prediction traces a clear evolution in computational sophistication. Initial efforts predominantly relied on transfer learning with CNNs (e.g., InceptionV3, ResNet50) applied to tiled patches, often aggregating scores via simple averaging or max-pooling. A paradigm shift occurred with the adoption of MIL and attention mechanisms, allowing models to automatically identify and weigh TMB-relevant regions within a WSI without pixel-level annotation. The most recent wave of research (2023–2025) has embraced SSL and novel backbones. For instance, Wang et al. [54] utilized SSL-pretrained encoders to extract richer features, while Yu et al. [41] employed transformer-based architectures, marking the entry of “foundation model” technologies that significantly enhance feature extraction capability and data efficiency. These models better capture long-range dependencies and contextual patterns in the tumor microenvironment, enabling higher performance even in relatively small training cohorts.

4.5.2. Multimodal Fusion: Recent Advances

Recent “state-of-the-art” approaches increasingly adopt multimodal fusion. Li et al. [49] demonstrated that fusing histology with mRNA expression and clinical data via multimodal compact bilinear pooling boosted AUC from 0.749 (image-only) to 0.971. Similarly, Yu et al. [41] integrated clinical text reports using attention mechanisms, and Zhang et al. [51] incorporated nuclear segmentation features, demonstrating that synergistic integration of morphological, clinical, and omics data is key to achieving the reliability required for clinical deployment. Temporally, the field has transitioned through distinct phases: an early “CNN era” characterized by supervised patch-level classification; an “attention era” where MIL frameworks began to highlight interpretable regions of interest; and the current “foundation model era,” which utilizes SSL on massive pathology datasets and multimodal architectures to overcome data scarcity and annotation costs.

5. Limitations

5.1. TMB Cut-Off Heterogeneity

Although a threshold of 10 mutations per megabase (mut/Mb) has been endorsed by the U.S. Food and Drug Administration as a tissue-agnostic companion diagnostic cut-off for pembrolizumab, this value is not a universal or biology-driven definition of “TMB-high”. In individual studies, TMB cut-offs are frequently chosen in a data-driven manner based on the distribution of TMB within a cohort (e.g., tertiles or other quantiles) or by directly optimizing prognostic or predictive separation using maximally selected rank statistics, ROC-based criteria, or survival cut-point methods [55]. Ideally, TMB should be quantified from WES, whereas most clinical targeted NGS panels provide only an estimate of “panel-TMB”, whose analytical validity and clinical equivalence to WES-TMB are still under evaluation [56]. Empirically, WES-derived TMB values are often higher and more dispersed than panel-based estimates. Biological and technical factors such as tumor purity and ploidy further complicate interpretation: specimens with low tumor content may yield artificially deflated sequencing-based TMB, thereby providing a noisy and imperfect supervision signal for downstream deep learning models. As a result, TMB cut-offs are heterogeneous both across and within tumor types [57,58,59,60,61,62,63,64,65]. In several gastrointestinal and genitourinary cancers, applying the ≥10 mutations/Mb threshold leads to very low TMB-high prevalence [66]. This combination of assay- and disease-specific cut-off heterogeneity, systematic differences between WES-TMB and panel-TMB, and low TMB-high prevalence jointly increases label noise and makes it difficult to compare model performance across studies or to translate a trained predictor into real-world clinical workflows.

5.2. Data Scarcity

After restricting to specific histologic subtypes or institutional series, sample sizes are often in the low hundreds, and for rarer entities such as lung squamous cell carcinoma, the total number of patients can fall well below 100. Within these already limited datasets, the proportion of TMB-high patients—especially when using higher cut-offs such as ≥10 mutations/Mb—is frequently very low, yielding highly imbalanced classification tasks where the positive class may consist of only dozens of patients. Training high-capacity deep learning models under such small-sample, class-imbalanced conditions greatly increases the risk of overfitting and unstable performance estimates [67]. Apparent gains in cross-validation AUC on internal TCGA splits or single-center cohorts may reflect idiosyncratic cohort-specific patterns rather than robust, generalizable morphological correlates of TMB. External validation, when present, is often limited to a single institutional test set with modest size, and performance frequently degrades compared to internal results. Taken together, data scarcity, severe class imbalance, and incomplete reporting make it difficult to assess the true generalizability of current WSI-based TMB predictors and to plan prospective clinical validation.

5.3. Model Interpretability

Even when predictive performance appears promising, most deep learning models for WSI-based TMB prediction remain essentially black boxes [68]. Some studies attempt to enhance interpretability using attention heatmaps, saliency maps, or cell segmentation overlays to highlight regions or cell populations that contribute most strongly to the TMB-high vs. TMB-low decision (e.g., Zhang et al. [51]). However, these visualizations typically offer only qualitative, post hoc explanations and rarely provide mechanistic insight into which concrete histological patterns or tissue contexts are being used by the model. For pathologists and oncologists, this limited explainability reduces trust in individual predictions and complicates the integration of such models into multidisciplinary decision-making or regulatory evaluation [69]. Moreover, many of the highest-performing models still depend on substantial manual intervention. Several studies rely on expert pathologists to annotate tumor-rich regions of interest, exclude non-tumor tissue such as necrosis or stroma, or specifically target invasion fronts (e.g., Liu et al. [47]). These manual steps are time-consuming, introduce inter-observer variability, and partially undermine the promise of fully automated, high-throughput pipelines that could scale to routine practice [69]. This combination of limited interpretability and workflow dependence remains a major barrier to clinician acceptance, regulatory approval, and real-world deployment.

5.4. Bias, Confounding, and Proxy Learning

Despite several included studies reporting very high discrimination, bias and proxy learning remain major validity threats in TMB prediction. In our included literature, colorectal cancer (CRC) often shows comparatively strong internal performance (frequently ~0.81–0.93, with some cohorts reaching 0.934), which likely reflects, at least in part, the strong biological correlation between TMB-high and MSI in GI malignancies. MSI-associated histologic patterns (e.g., increased TILs, glandular disarray, and mucinous differentiation) are visually salient and may act as proxy signals that co-vary with TMB labels. Importantly, performance can differ substantially across histologic subtypes within the same cancer (e.g., CRC non-mucinous AUC 0.90 vs. mucinous AUC 0.72), further suggesting that models may exploit subtype-, grade-, immune infiltration–, or tumor-purity-related morphology that correlates with the endpoint. Therefore, high internal AUCs should not be interpreted as evidence of causal “TMB morphology,” and future work should prioritize confounder-aware analyses (e.g., stratified evaluation by MSI/subtype and site-held-out testing) to improve interpretability and trustworthiness.

6. Future Directions

6.1. Multimodal Fusion

Future research should move beyond unimodal histology-based prediction by developing multimodal fusion models. As demonstrated by Li et al. [49] and Yu et al. [41], integrating histological features with clinical data, radiomics, and genomic profiles significantly boosts predictive accuracy compared to using image features alone. Similar performance gains have been consistently reported in other multimodal pathology applications, where combining WSIs with other omics data enables models to selectively distill task-relevant signals from each modality [70,71]. Future frameworks should explore advanced fusion strategies, such as tensor fusion or cross-attention mechanisms, to effectively capture the complementary information between morphological phenotypes and molecular profiles, leading to more robust biomarkers for immunotherapy response [72]. Another important future direction is to combine precise cell segmentation and classification with spatially resolved multi-omic technologies to construct cell-resolved tissue atlases. By aligning local morphological patterns with locally measured molecular states, such approaches could generate pixel- or cell-level “ground truth” labels across the WSI, rather than relying on a single global TMB value derived from a small sampled region. This would transform the current slide-level weakly supervised setting into a more fine-grained supervised or semi-supervised learning problem, enabling models to learn how regional clonal architecture and microenvironmental niches contribute to TMB heterogeneity.

6.2. Advanced Architectures

To address data scarcity and annotation costs, the field should pivot towards Self-Supervised Learning (SSL) and Foundation Models [73]. Instead of relying on small, labeled datasets, SSL allows models to learn robust feature representations from massive amounts of unlabeled histopathology. In this paradigm, large-scale pathology foundation models [74] can be pretrained once on heterogeneous, multi-center, multi-cancer, multi-scanner WSI collections using SSL objectives (for example, DINO-style vision transformers), and then lightly fine-tuned for disease- or task-specific objectives such as TMB status, MSI, or driver mutation prediction. This represents a shift from “one disease, one bespoke network” towards a “pretrain once, adapt many times” workflow, in which TMB prediction becomes one capability within a broader, general-purpose pathology model that encodes a shared morphological language. Notably, several recent pathology foundation models already report TMB classification as one of multiple downstream benchmarks, suggesting that signals related to mutational load can be implicitly captured in these generic histological representations. At the architectural level, network architectures are evolving from standard CNNs (like ResNet) to Vision Transformers (ViTs) and Graph Neural Networks (GNNs). These architectures are better suited to capture long-range dependencies and the global context of the Tumor Microenvironment (TME), potentially revealing novel morphological patterns associated with high TMB that local-feature-based CNNs might miss [75].

6.3. Prospective Clinical Validation

To translate these algorithms into clinical practice, future studies must prioritize large-scale, prospective, multi-center trials rather than retrospective analyses. Despite an expanding body of digital pathology and pathomics studies [16,76,77] predicting key immunotherapy biomarkers such as MSI, PD-L1, and TMB from H&E whole-slide images, prospective studies of these models have not yet been reported to our knowledge. Most existing TMB-from-WSI models have been developed on retrospective, single-center cohorts or public resources such as TCGA, which are prone to selection bias and offer limited insight into how models behave under real-world heterogeneity in scanners, tissue processing protocols, and patient demographics. These trials should verify not only the accuracy of TMB prediction but also its correlation with actual immunotherapy outcomes. Ideally, international consortia would establish harmonized prospective cohorts in which TMB measurement, slide digitization, and model evaluation protocols are standardized, and where WSI-based TMB is tested in clearly defined clinical roles—for example, as a triage tool before sequencing or as one component of a composite immunotherapy biomarker. Future work should also focus on correlating deep learning features with specific biological entities (e.g., Tumor-Infiltrating Lymphocytes, nuclear segmentation features [51]). Establishing a clear link between the “digital score” and biological reality will be essential for acceptance by the pathological community. For clinical deployment, studies should report calibration, pre-specified thresholds under real-world prevalence, and clinical-utility analyses in prospective multi-center settings.

7. Conclusions

Deep learning models applied to routine H&E WSIs have opened a new avenue for approximating TMB without additional wet-lab assays. Across diverse tumor types and methodological frameworks, these models have progressed from early proof-of-concept studies to sophisticated architectures that leverage weak supervision, self-supervised feature learning, and multimodal fusion. In colorectal, endometrial, and lung cohorts, internal validation results approach the performance of sequencing-based assays, and emerging data suggest that image-predicted TMB can stratify response to immune checkpoint inhibitors. At the same time, heterogeneity in TMB definitions, limited external validation, label noise in genomic ground truth, and the black-box nature of many models all constrain clinical translation. Future research must therefore focus on prospective validation, biologically grounded explainability, and seamless integration into digital pathology workflows.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/app16031340/s1. Supplementary Note S1: Literature searches; Supplemental Table S1: Risk-of-bias assessment of included studies.

Author Contributions

Methodology and formal analysis, H.N.; investigation and visualization, T.S.; writing—original draft preparation, D.M.; writing—review and editing and supervision, M.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by JST SPRING, Grant Number JPMJSP2139.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

No new data were created or analyzed in this study.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Meri-Abad, M.; Moreno-Manuel, A.; García, S.G.; Calabuig-Fariñas, S.; Pérez, R.S.; Herrero, C.C.; Jantus-Lewintre, E. Clinical and Technical Insights of Tumour Mutational Burden in Non-Small Cell Lung Cancer. Crit. Rev. Oncol. Hematol. 2023, 182, 103891. [Google Scholar] [CrossRef]
Horndalsveen, H.; Haakensen, V.D.; Madebo, T.; Grønberg, B.H.; Halvorsen, T.O.; Koivunen, J.; Oselin, K.; Cicenas, S.; Helbekkmo, N.; Aanerud, M.; et al. Blood-Based Tumor Mutational Burden as a Biomarker in Unresectable Non-Small Cell Lung Cancer Treated with Chemoradiotherapy and Durvalumab. Front. Oncol. 2025, 15, 1681420. [Google Scholar] [CrossRef]
Aggarwal, C.; Ben-Shachar, R.; Gao, Y.; Hyun, S.W.; Rivers, Z.; Epstein, C.; Kaneva, K.; Sangli, C.; Nimeiri, H.; Patel, J. Assessment of Tumor Mutational Burden and Outcomes in Patients with Diverse Advanced Cancers Treated with Immunotherapy. JAMA Netw. Open 2023, 6, e2311181. [Google Scholar] [CrossRef]
Marcus, L.; Fashoyin-Aje, L.A.; Donoghue, M.; Yuan, M.; Rodriguez, L.; Gallagher, P.S.; Philip, R.; Ghosh, S.; Theoret, M.R.; Beaver, J.A.; et al. FDA Approval Summary: Pembrolizumab for the Treatment of Tumor Mutational Burden–High Solid Tumors. Clin. Cancer Res. 2021, 27, 4685–4689. [Google Scholar] [CrossRef]
Hou, W.; Yi, C.; Zhu, H. Predictive Biomarkers of Colon Cancer Immunotherapy: Present and Future. Front. Immunol. 2022, 13, 1032314. [Google Scholar] [CrossRef]
Niknafs, N.; Najjar, M.; Dennehy, C.; Stouras, I.; Anagnostou, V. 1 Of Context, Quality, and Complexity: Fine-Combing Tumor Mutation Burden in 2 Immunotherapy Treated Cancers. Clin. Cancer Res. 2025, 31, 2850–2863. [Google Scholar] [CrossRef]
Amirault, K.; Collins, M.; Beker, L.; Mills, B.; Werner, M.; Andreas, J.; Hartman, D.; Dargert, J.; Process, V.; Cederlund, S.; et al. Fully Automated Extraction of High-Quality Total Nucleic Acids from FFPE Specimens for Comprehensive Genomic Profiling of Solid Tumors. SLAS Technol. 2025, 31, 100252. [Google Scholar] [CrossRef]
Krieghoff-Henning, E.; Michaeli, T.; Boch, T.; Kirchhof, J.; Haselmann, V.; Neumaier, M.; Hofmann, W.-K.; Betge, J.; Ebert, M.; Teufel, A.; et al. Clinical Benefit of Additional Whole-Exome Sequencing over Panel Sequencing in an All-Comer Real-World Molecular Tumor Board. ESMO Open 2025, 10, 105894. [Google Scholar] [CrossRef]
Rączkowska, A.; Paśnik, I.; Kukiełka, M.; Nicoś, M.; Budzinska, M.A.; Kucharczyk, T.; Szumiło, J.; Krawczyk, P.; Crosetto, N.; Szczurek, E. Deep Learning-Based Tumor Microenvironment Segmentation Is Predictive of Tumor Mutations and Patient Survival in Non-Small-Cell Lung Cancer. BMC Cancer 2022, 22, 1001. [Google Scholar] [CrossRef]
Chen, Z.; Li, X.; Yang, M.; Zhang, H.; Xu, X.S. Optimization of Deep Learning Models for the Prediction of Gene Mutations Using Unsupervised Clustering. J. Pathol. Clin. Res. 2023, 9, 3–17. [Google Scholar] [CrossRef]
Hu, J.; Lv, H.; Zhao, S.; Lin, C.-J.; Su, G.-H.; Shao, Z.-M. Prediction of Clinicopathological Features, Multi-Omics Events and Prognosis Based on Digital Pathology and Deep Learning in HR+/HER2—Breast Cancer. J. Thorac. Dis. 2023, 15, 2528–2543. [Google Scholar] [CrossRef]
Nero, C.; Boldrini, L.; Lenkowicz, J.; Giudice, M.T.; Piermattei, A.; Inzani, F.; Pasciuto, T.; Minucci, A.; Fagotti, A.; Zannoni, G.; et al. Deep-Learning to Predict BRCA Mutation and Survival from Digital H&E Slides of Epithelial Ovarian Cancer. Int. J. Mol. Sci. 2022, 23, 11326. [Google Scholar] [CrossRef]
Loeffler, C.M.L.; Ortiz Bruechle, N.; Jung, M.; Seillier, L.; Rose, M.; Laleh, N.G.; Knuechel, R.; Brinker, T.J.; Trautwein, C.; Gaisa, N.T.; et al. Artificial Intelligence–Based Detection of FGFR3 Mutational Status Directly from Routine Histology in Bladder Cancer: A Possible Preselection for Molecular Testing? Eur. Urol. Focus. 2022, 8, 472–479. [Google Scholar] [CrossRef]
Puget, C.; Ganz, J.; Ostermaier, J.; Conrad, T.; Parlak, E.; Bertram, C.A.; Kiupel, M.; Breininger, K.; Aubreville, M.; Klopfleisch, R. Artificial Intelligence Can Be Trained to Predict c-KIT-11 Mutational Status of Canine Mast Cell Tumors from Hematoxylin and Eosin-Stained Histological Slides. Vet. Pathol. 2025, 62, 152–160. [Google Scholar] [CrossRef]
Zanoletti, M.; Ugolini, F.; El Bachiri, L.; Pasini, V.; Laurino, M.; Logu, F.D.; Melissa, E.; Marchi, C.; Colombino, M.; Massi, D.; et al. EGFR Mutation Detection in Whole Slide Images of Non-Small Cell Lung Cancers Using a Two-Stage Deep Transfer Learning Approach. Cancer Med. 2025, 14, e71249. [Google Scholar] [CrossRef]
Shamai, G.; Livne, A.; Polónia, A.; Sabo, E.; Cretu, A.; Bar-Sela, G.; Kimmel, R. Deep Learning-Based Image Analysis Predicts PD-L1 Status from H&E-Stained Histopathology Images in Breast Cancer. Nat. Commun. 2022, 13, 6753. [Google Scholar] [CrossRef]
Kather, J.N.; Pearson, A.T.; Halama, N.; Jäger, D.; Krause, J.; Loosen, S.H.; Marx, A.; Boor, P.; Tacke, F.; Neumann, U.P.; et al. Deep Learning Can Predict Microsatellite Instability Directly from Histology in Gastrointestinal Cancer. Nat. Med. 2019, 25, 1054–1056. [Google Scholar] [CrossRef]
Nakagaki, R.; Debsarkar, S.S.; Kawanaka, H.; Aronow, B.J.; Prasath, V.B.S. Deep Learning-Based IDH1 Gene Mutation Prediction Using Histopathological Imaging and Clinical Data. Comput. Biol. Med. 2024, 179, 108902. [Google Scholar] [CrossRef]
Shi, J.; Sun, D.; Wu, K.; Jiang, Z.; Kong, X.; Wang, W.; Wu, H.; Zheng, Y. Positional Encoding-Guided Transformer-Based Multiple Instance Learning for Histopathology Whole Slide Images Classification. Comput. Methods Programs Biomed. 2025, 258, 108491. [Google Scholar] [CrossRef]
Yang, D.; Miao, Y.; Liu, C.; Zhang, N.; Zhang, D.; Guo, Q.; Gao, S.; Li, L.; Wang, J.; Liang, S.; et al. Advances in Artificial Intelligence Applications in the Field of Lung Cancer. Front. Oncol. 2024, 14, 1449068. [Google Scholar] [CrossRef]
Page, M.J.; McKenzie, J.E.; Bossuyt, P.M.; Boutron, I.; Hoffmann, T.C.; Mulrow, C.D.; Shamseer, L.; Tetzlaff, J.M.; Akl, E.A.; Brennan, S.E.; et al. The PRISMA 2020 Statement: An Updated Guideline for Reporting Systematic Reviews. BMJ 2021, 372, 71. [Google Scholar] [CrossRef]
Moons, K.G.M.; Damen, J.A.A.; Kaul, T.; Hooft, L.; Andaur Navarro, C.; Dhiman, P.; Beam, A.L.; Van Calster, B.; Celi, L.A.; Denaxas, S.; et al. PROBAST+AI: An Updated Quality, Risk of Bias, and Applicability Assessment Tool for Prediction Models Using Regression or Artificial Intelligence Methods. BMJ 2025, 388, e082505. [Google Scholar] [CrossRef]
Shimada, Y.; Okuda, S.; Watanabe, Y.; Tajima, Y.; Nagahashi, M.; Ichikawa, H.; Nakano, M.; Sakata, J.; Takii, Y.; Kawasaki, T.; et al. Histopathological Characteristics and Artificial Intelligence for Predicting Tumor Mutational Burden-High Colorectal Cancer. J. Gastroenterol. 2021, 56, 547–559. [Google Scholar] [CrossRef]
Wang, W.; Shi, W.; Nie, C.; Xing, W.; Yang, H.; Li, F.; Liu, J.; Tian, G.; Wang, B.; Yang, J. Prediction of Colorectal Cancer Microsatellite Instability and Tumor Mutational Burden from Histopathological Images Using Multiple Instance Learning. Biomed. Signal Process. Control 2025, 104, 107608. [Google Scholar] [CrossRef]
Liu, Y.; Huang, K.; Yang, Y.; Wu, Y.; Gao, W. Prediction of Tumor Mutation Load in Colorectal Cancer Histopathological Images Based on Deep Learning. Front. Oncol. 2022, 12, 906888. [Google Scholar] [CrossRef]
Zheng, Q.; Wang, X.; Yang, R.; Fan, J.; Yuan, J.; Liu, X.; Wang, L.; Xiao, Z.; Chen, Z. Predicting Tumor Mutation Burden and VHL Mutation from Renal Cancer Pathology Slides with Self-Supervised Deep Learning. Cancer Med. 2024, 13, e70112. [Google Scholar] [CrossRef]
Yang, Y.; Liu, Z.; Huang, J.; Sun, X.; Ao, J.; Zheng, B.; Chen, W.; Shao, Z.; Hu, H.; Yang, Y.; et al. Histological Diagnosis of Unprocessed Breast Core-Needle Biopsy via Stimulated Raman Scattering Microscopy and Multi-Instance Learning. Theranostics 2023, 13, 1342–1354. [Google Scholar] [CrossRef]
Huang, B.; Tian, S.; Zhan, N.; Ma, J.; Huang, Z.; Zhang, C.; Zhang, H.; Ming, F.; Liao, F.; Ji, M.; et al. Accurate Diagnosis and Prognosis Prediction of Gastric Cancer Using Deep Learning on Digital Pathological Images: A Retrospective Multicentre Study. EBioMedicine 2021, 73, 103631. [Google Scholar] [CrossRef]
Ghaffari Laleh, N.; Muti, H.S.; Loeffler, C.M.L.; Echle, A.; Saldanha, O.L.; Mahmood, F.; Lu, M.Y.; Trautwein, C.; Langer, R.; Dislich, B.; et al. Benchmarking Weakly-Supervised Deep Learning Pipelines for Whole Slide Classification in Computational Pathology. Med. Image Anal. 2022, 79, 102474. [Google Scholar] [CrossRef]
Feng, M.; Zhao, Y.; Chen, J.; Zhao, T.; Mei, J.; Fan, Y.; Lin, Z.; Yao, J.; Bu, H. A Deep Learning Model for Lymph Node Metastasis Prediction Based on Digital Histopathological Images of Primary Endometrial Cancer. Quant. Imaging Med. Surg. 2023, 13, 1899–1913. [Google Scholar] [CrossRef]
Tampu, I.E.; Nyman, P.; Spyretos, C.; Blystad, I.; Shamikh, A.; Prochazka, G.; De Ståhl, T.D.; Sandgren, J.; Lundberg, P.; Haj-Hosseini, N. Pediatric Brain Tumor Classification Using Digital Pathology and Deep Learning: Evaluation of SOTA Methods on a Multi-center Swedish Cohort. Brain Pathol. 2025, 36, e70029. [Google Scholar] [CrossRef]
Li, K.; Qian, Z.; Han, Y.; Chang, E.I.-C.; Wei, B.; Lai, M.; Liao, J.; Fan, Y.; Xu, Y. Weakly Supervised Histopathology Image Segmentation with Self-Attention. Med. Image Anal. 2023, 86, 102791. [Google Scholar] [CrossRef]
Xu, H.; Wang, M.; Shi, D.; Qin, H.; Zhang, Y.; Liu, Z.; Madabhushi, A.; Gao, P.; Cong, F.; Lu, C. When Multiple Instance Learning Meets Foundation Models: Advancing Histological Whole Slide Image Analysis. Med. Image Anal. 2025, 101, 103456. [Google Scholar] [CrossRef]
Fu, F.; Zhang, X.; Wang, Z.; Xie, L.; Fu, M.; Peng, J.; Wu, J.; Wang, Z.; Guan, T.; He, Y.; et al. A Pathology-Attention Multi-Instance Learning Framework for Multimodal Classification of Colorectal Lesions. Front. Pharmacol. 2025, 16, 1592950. [Google Scholar] [CrossRef]
Kim, J.S.; Lee, J.H.; Yeon, Y.; An, D.; Kim, S.J.; Noh, M.-G.; Lee, S. Predicting Nottingham Grade in Breast Cancer Digital Pathology Using a Foundation Model. Breast Cancer Res. 2025, 27, 58. [Google Scholar] [CrossRef]
Zeng, Q.; Klein, C.; Caruso, S.; Maille, P.; Laleh, N.G.; Sommacale, D.; Laurent, A.; Amaddeo, G.; Gentien, D.; Rapinat, A.; et al. Artificial Intelligence Predicts Immune and Inflammatory Gene Signatures Directly from Hepatocellular Carcinoma Histology. J. Hepatol. 2022, 77, 116–127. [Google Scholar] [CrossRef]
Tourniaire, P.; Ilie, M.; Hofman, P.; Ayache, N.; Delingette, H. MS-CLAM: Mixed Supervision for the Classification and Localization of Tumors in Whole Slide Images. Med. Image Anal. 2023, 85, 102763. [Google Scholar] [CrossRef]
Jin, S.; Xu, H.; Dong, Y.; Wang, X.; Hao, X.; Qin, F.; Wang, R.; Cong, F. Ranking Attention Multiple Instance Learning for Lymph Node Metastasis Prediction on Multicenter Cervical Cancer MRI. J. Appl. Clin. Med. Phys. 2024, 25, e14547. [Google Scholar] [CrossRef]
Zhang, X.; Liu, C.; Zhu, H.; Wang, T.; Du, Z.; Ding, W. A Universal Multiple Instance Learning Framework for Whole Slide Image Analysis. Comput. Biol. Med. 2024, 178, 108714. [Google Scholar] [CrossRef]
Hashimoto, N.; Hanada, H.; Miyoshi, H.; Nagaishi, M.; Sato, K.; Hontani, H.; Ohshima, K.; Takeuchi, I. Multimodal Gated Mixture of Experts Using Whole Slide Image and Flow Cytometry for Multiple Instance Learning Classification of Lymphoma. J. Pathol. Inform. 2024, 15, 100359. [Google Scholar] [CrossRef]
Yu, C.; Meng, X.; Li, Y.; Zhao, Z.; Zhang, Y. TG-Mamba: Leveraging Text Guidance for Predicting Tumor Mutation Burden in Lung Cancer. Comput. Med. Imaging Graph. 2025, 124, 102626. [Google Scholar] [CrossRef]
Sun, C.; Luo, T.; Liu, Z.; Ge, J.; Shao, L.; Liu, X.; Li, B.; Zhang, S.; Qiu, Q.; Wei, W.; et al. Tumor Mutation Burden–Related Histopathologic Features for Predicting Overall Survival in Gliomas Using Graph Deep Learning. Am. J. Pathol. 2023, 193, 2111–2121. [Google Scholar] [CrossRef]
Niu, Y.; Wang, L.; Zhang, X.; Han, Y.; Yang, C.; Bai, H.; Huang, K.; Ren, C.; Tian, G.; Yin, S.; et al. Predicting Tumor Mutational Burden From Lung Adenocarcinoma Histopathological Images Using Deep Learning. Front. Oncol. 2022, 12, 927426. [Google Scholar] [CrossRef]
Wang, L.; Jiao, Y.; Qiao, Y.; Zeng, N.; Yu, R. A Novel Approach Combined Transfer Learning and Deep Learning to Predict TMB from Histology Image. Pattern Recognit. Lett. 2020, 135, 244–248. [Google Scholar] [CrossRef]
Sadhwani, A.; Chang, H.-W.; Behrooz, A.; Brown, T.; Auvigne-Flament, I.; Patel, H.; Findlater, R.; Velez, V.; Tan, F.; Tekiela, K.; et al. Comparative Analysis of Machine Learning Approaches to Classify Tumor Mutation Burden in Lung Adenocarcinoma Using Histopathology Images. Sci. Rep. 2021, 11, 16605. [Google Scholar] [CrossRef]
Huang, K.; Lin, B.; Liu, J.; Liu, Y.; Li, J.; Tian, G.; Yang, J. Predicting Colorectal Cancer Tumor Mutational Burden from Histopathological Images and Clinical Information Using Multi-Modal Deep Learning. Bioinformatics 2022, 38, 5108–5115. [Google Scholar] [CrossRef]
Liu, X.; Liu, Z.; Yan, Y.; Wang, K.; Wang, A.; Ye, X.; Wang, L.; Wei, W.; Li, B.; Sun, C.; et al. Development of Prognostic Biomarkers by TMB-Guided WSI Analysis: A Two-Step Approach. IEEE J. Biomed. Health Inform. 2023, 27, 1780–1789. [Google Scholar] [CrossRef]
Dammak, S.; Cecchini, M.J.; Breadner, D.; Ward, A.D. Using Deep Learning to Predict Tumor Mutational Burden from Scans of H&E-Stained Multicenter Slides of Lung Squamous Cell Carcinoma. J. Med. Imag. 2023, 10, 017502. [Google Scholar] [CrossRef]
Li, J.; Liu, H.; Liu, W.; Zong, P.; Huang, K.; Li, Z.; Li, H.; Xiong, T.; Tian, G.; Li, C.; et al. Predicting Gastric Cancer Tumor Mutational Burden from Histopathological Images Using Multimodal Deep Learning. Brief. Funct. Genom. 2024, 23, 228–238. [Google Scholar] [CrossRef]
Wang, C.-W.; Firdi, N.P.; Lee, Y.-C.; Chu, T.-C.; Muzakky, H.; Liu, T.-C.; Lai, P.-J.; Chao, T.-K. Deep Learning for Endometrial Cancer Subtyping and Predicting Tumor Mutational Burden from Histopathological Slides. npj Precis. Oncol. 2024, 8, 287. [Google Scholar] [CrossRef]
Zhang, Y.; Han, J.; Chen, H.; Hu, F.; Huang, Y.; Tian, G.; Zhong, D.; Yang, J. Deep Learning-Based Fusion of Nuclear Segmentation Features for Microsatellite Instability and Tumor Mutational Burden Prediction in Digestive Tract Cancers: A Multicenter Validation Study. Brief. Bioinform. 2025, 26, bbaf580. [Google Scholar] [CrossRef]
Al-Rubaian, A.; Gunesli, G.N.; Althakfi, W.A.; Azam, A.; Snead, D.; Rajpoot, N.M.; Raza, S.E.A. CellOMaps: A Compact Representation for Robust Classification of Lung Adenocarcinoma Growth Patterns. Comput. Biol. Med. 2025, 192, 110127. [Google Scholar] [CrossRef]
Wang, C.-W.; Muzakky, H.; Lee, Y.-C.; Chung, Y.-P.; Wang, Y.-C.; Yu, M.-H.; Wu, C.-H.; Chao, T.-K. Interpretable Multi-Stage Attention Network to Predict Cancer Subtype, Microsatellite Instability, TP53 Mutation and TMB of Endometrial and Colorectal Cancer. Comput. Med. Imaging Graph. 2025, 121, 102499. [Google Scholar] [CrossRef]
Wang, C.-W.; Liu, T.-C.; Lai, P.-J.; Muzakky, H.; Wang, Y.-C.; Yu, M.-H.; Wu, C.-H.; Chao, T.-K. Ensemble Transformer-Based Multiple Instance Learning to Predict Pathological Subtypes and Tumor Mutational Burden from Histopathological Whole Slide Images of Endometrial and Colorectal Cancer. Med. Image Anal. 2025, 99, 103372. [Google Scholar] [CrossRef]
Bendani, H.; Boumajdi, N.; Belyamani, L.; Ibrahimi, A. A Decision-Aid Model for Predicting Triple-Negative Breast Cancer ICI Response Based on Tumor Mutation Burden. BioMedInformatics 2025, 5, 9. [Google Scholar] [CrossRef]
Merino, D.M.; McShane, L.M.; Fabrizio, D.; Funari, V.; Chen, S.-J.; White, J.R.; Wenz, P.; Baden, J.; Barrett, J.C.; Chaudhary, R.; et al. Establishing Guidelines to Harmonize Tumor Mutational Burden (TMB): In Silico Assessment of Variation in TMB Quantification across Diagnostic Platforms: Phase I of the Friends of Cancer Research TMB Harmonization Project. J. Immunother. Cancer 2020, 8, e000147. [Google Scholar] [CrossRef]
Ruel, L.-J.; Li, Z.; Gaudreault, N.; Henry, C.; Saavedra Armero, V.; Boudreau, D.K.; Zhang, T.; Landi, M.T.; Labbé, C.; Couture, C.; et al. Tumor Mutational Burden by Whole-Genome Sequencing in Resected NSCLC of Never Smokers. Cancer Epidemiol. Biomark. Prev. 2022, 31, 2219–2227. [Google Scholar] [CrossRef]
Mo, S.-F.; Cai, Z.-Z.; Kuai, W.-H.; Li, X.; Chen, Y.-T. Universal Cutoff for Tumor Mutational Burden in Predicting the Efficacy of Anti-PD-(L)1 Therapy for Advanced Cancers. Front. Cell Dev. Biol. 2023, 11, 1209243. [Google Scholar] [CrossRef]
Peters, S.; Dziadziuszko, R.; Morabito, A.; Felip, E.; Gadgeel, S.M.; Cheema, P.; Cobo, M.; Andric, Z.; Barrios, C.H.; Yamaguchi, M.; et al. Atezolizumab versus Chemotherapy in Advanced or Metastatic NSCLC with High Blood-Based Tumor Mutational Burden: Primary Analysis of BFAST Cohort C Randomized Phase 3 Trial. Nat. Med. 2022, 28, 1831–1839. [Google Scholar] [CrossRef]
Barnett, R.M.; Jang, A.; Lanka, S.; Fu, P.; Bucheit, L.A.; Babiker, H.; Bryce, A.; Meyer, H.M.; Choi, Y.; Moore, C.; et al. Blood-Based Tumor Mutational Burden Impacts Clinical Outcomes of Immune Checkpoint Inhibitor Treated Breast and Prostate Cancers. Commun. Med. 2024, 4, 256. [Google Scholar] [CrossRef]
Fancello, L.; Gandini, S.; Pelicci, P.G.; Mazzarella, L. Tumor Mutational Burden Quantification from Targeted Gene Panels: Major Advancements and Challenges. J. Immunother. Cancer 2019, 7, 183. [Google Scholar] [CrossRef]
Marques, A.; Cavaco, P.; Torre, C.; Sepodes, B.; Rocha, J. Tumor Mutational Burden in Colorectal Cancer: Implications for Treatment. Crit. Rev. Oncol. Hematol. 2024, 197, 104342. [Google Scholar] [CrossRef]
Canale, M.; Urbini, M.; Petracci, E.; Angeli, D.; Tedaldi, G.; Priano, I.; Cravero, P.; Flospergher, M.; Andrikou, K.; Bennati, C.; et al. Genomic Profiling of Extensive Stage Small-Cell Lung Cancer Patients Identifies Molecular Factors Associated with Survival. Lung Cancer Targets Ther. 2025, 16, 11–23. [Google Scholar] [CrossRef]
Fang, H.; Bertl, J.; Zhu, X.; Lam, T.C.; Wu, S.; Shih, D.J.H.; Wong, J.W.H. Tumour Mutational Burden Is Overestimated by Target Cancer Gene Panels. J. Natl. Cancer Cent. 2023, 3, 56–64. [Google Scholar] [CrossRef]
Budak, B.; Arga, K.Y. Tumor Mutation Burden as a Cornerstone in Precision Oncology Landscapes: Effect of Panel Size and Uncertainty in Cutoffs. OMICS A J. Integr. Biol. 2024, 28, 193–203. [Google Scholar] [CrossRef]
Kang, Y.-J.; O’Haire, S.; Franchini, F.; IJzerman, M.; Zalcberg, J.; Macrae, F.; Canfell, K.; Steinberg, J. A Scoping Review and Meta-Analysis on the Prevalence of Pan-Tumour Biomarkers (dMMR, MSI, High TMB) in Different Solid Tumours. Sci. Rep. 2022, 12, 20495. [Google Scholar] [CrossRef]
Lee, M. Recent Advancements in Deep Learning Using Whole Slide Imaging for Cancer Prognosis. Bioengineering 2023, 10, 897. [Google Scholar] [CrossRef]
Plass, M.; Kargl, M.; Kiehl, T.; Regitnig, P.; Geißler, C.; Evans, T.; Zerbe, N.; Carvalho, R.; Holzinger, A.; Müller, H. Explainability and Causability in Digital Pathology. J. Pathol. Clin. Res. 2023, 9, 251–260. [Google Scholar] [CrossRef]
Pantanowitz, L.; Hanna, M.; Pantanowitz, J.; Lennerz, J.; Henricks, W.H.; Shen, P.; Quinn, B.; Bennet, S.; Rashidi, H.H. Regulatory Aspects of Artificial Intelligence and Machine Learning. Mod. Pathol. 2024, 37, 100609. [Google Scholar] [CrossRef]
Yang, H.; Wang, J.; Wang, W.; Shi, S.; Liu, L.; Yao, Y.; Tian, G.; Wang, P.; Yang, J. MMsurv: A Multimodal Multi-Instance Multi-Cancer Survival Prediction Model Integrating Pathological Images, Clinical Information, and Sequencing Data. Brief. Bioinform. 2025, 26, bbaf209. [Google Scholar] [CrossRef]
Chen, Z.; Chen, Y.; Sun, Y.; Tang, L.; Zhang, L.; Hu, Y.; He, M.; Li, Z.; Cheng, S.; Yuan, J.; et al. Predicting Gastric Cancer Response to Anti-HER2 Therapy or Anti-HER2 Combined Immunotherapy Based on Multi-Modal Data. Signal Transduct. Target. Ther. 2024, 9, 222. [Google Scholar] [CrossRef]
Stahlschmidt, S.R.; Ulfenborg, B.; Synnergren, J. Multimodal Deep Learning for Biomedical Data Fusion: A Review. Brief. Bioinform. 2022, 23, bbab569. [Google Scholar] [CrossRef]
Campanella, G.; Chen, S.; Singh, M.; Verma, R.; Muehlstedt, S.; Zeng, J.; Stock, A.; Croken, M.; Veremis, B.; Elmas, A.; et al. A Clinical Benchmark of Public Self-Supervised Pathology Foundation Models. Nat. Commun. 2025, 16, 3640. [Google Scholar] [CrossRef]
Xu, H.; Usuyama, N.; Bagga, J.; Zhang, S.; Rao, R.; Naumann, T.; Wong, C.; Gero, Z.; González, J.; Gu, Y.; et al. A Whole-Slide Foundation Model for Digital Pathology from Real-World Data. Nature 2024, 630, 181–188. [Google Scholar] [CrossRef]
Wu, J.; Ke, X.; Jiang, X.; Wu, H.; Kong, Y.; Shao, L. Leveraging Tumor Heterogeneity: Heterogeneous Graph Representation Learning for Cancer Survival Prediction in Whole Slide Images. Adv. Neural Inf. Process. Syst. 2024, 37, 64312–64337. [Google Scholar]
Jiao, F.; Shang, Z.; Lu, H.; Chen, P.; Chen, S.; Xiao, J.; Zhang, F.; Zhang, D.; Lv, C.; Han, Y. A Weakly Supervised Deep Learning Framework for Automated PD-L1 Expression Analysis in Lung Cancer. Front. Immunol. 2025, 16, 1540087. [Google Scholar] [CrossRef]
Yan, F.; Da, Q.; Yi, H.; Deng, S.; Zhu, L.; Zhou, M.; Liu, Y.; Feng, M.; Wang, J.; Wang, X.; et al. Artificial Intelligence-Based Assessment of PD-L1 Expression in Diffuse Large B Cell Lymphoma. npj Precis. Oncol. 2024, 8, 76. [Google Scholar] [CrossRef]

Figure 1. Conventional genomic TMB testing versus AI-based TMB prediction from histopathology. Tumor tissue obtained by surgical resection or biopsy is processed into FFPE blocks. In the conventional workflow (upper row), DNA is extracted from FFPE, subjected to NGS using a targeted panel or WES, and TMB is quantified genomically, typically with a turnaround time of approximately 3 weeks. In the proposed AI-augmented workflow (lower row), routine H&E slides are digitized as WSIs and a deep learning model directly predicts an image-based TMB estimate, providing rapid decision support for immunotherapy.

Figure 2. PRISMA 2020 flowchart. PRISMA 2020 flowchart of the study identification and selection process for the systematic review.

Figure 3. Conceptual pipeline for deep learning-based prediction of TMB from H&E WSIs. (A) Data preprocessing: tissue/ROI selection, patch tiling, and quality control. (B) Patch feature extraction: CNN/SSL/ViT encoders transform each patch into a feature embedding. (C) Slide-level aggregation: patch embeddings are pooled into a slide representation to output a TMB score or TMB-high/low prediction. (D) Evaluation: performance is tested on internal or external cohorts using standard metrics (e.g., AUC/PR, accuracy) and optional survival association analyses.

Figure 4. PROBAST risk of bias results. PROBAST risk of bias results summarised for the 17 papers included in this review.

Figure 5. Summary distributions of study characteristics across included articles. (a) Cancer types represented. (b) Endpoint definitions (TMB ≥ 10 mut/Mb vs. other thresholds). (c) Validation design (internal validation only vs. internal + external validation). (d) Dataset sources (public datasets only vs. public + private datasets).

Figure 6. Overview of deep learning studies predicting tumor mutational burden from H&E WSIs. Each bubble represents one study, positioned by publication year on the x-axis and AUC on the y-axis. Bubble color indicates cancer type, and bubble size is proportional to the number of patients included. Bubbles with a horizontal bar denote models evaluated on an independent external validation cohort, whereas bubbles without a bar indicate internal validation only; bubbles marked with an upward arrow correspond to multimodal models that integrate WSIs with additional non-image data. Each bubble represents one tumor-specific entry extracted from an included study. For each tumor-specific entry, the plotted AUC is selected according to the extraction rules described in Methods. Abbreviations: AUC, area under the receiver operating characteristic curve; STAD, stomach adenocarcinoma; CRC, colorectal cancer; LUAD, lung adenocarcinoma; LUSC, lung squamous cell carcinoma; RCC, renal cell carcinoma; EC, endometrial carcinoma.

Table 1. Summary of deep learning studies predicting tumor mutational burden (TMB) from H&E whole-slide images.

Study	Year	Cancer Type	Cohort	Sample Size (Patients)	TMB Cutoff	TMB Labeling Assay	Architecture	Performance (AUC, Unless Otherwise Specified)
Wang et al. [44]	2020	STAD, COAD	TCGA	284 (STAD), 360 (COAD)	Upper tertile	WES	CNN	0.75 (STAD), 0.82 (COAD)
Sadhwani et al. [45]	2021	LUAD	TCGA	414	70th percentile	WES	CNN	0.77
Shimada et al. [23]	2021	CRC	TCGA, JP-CRC	278	27 mut/Mb	TCGA: WES; JP-CRC: Gene panel	CNN	0.934
Huang et al. [46]	2022	CRC	TCGA	509	20 mut/Mb	WES	CNN+ MCB	0.817
Niu et al. [43]	2022	LUAD	TCGA	427	10 mut/Mb	WES	CNN	0.641
Liu et al. [25]	2022	CRC	TCGA	509	10 mut/Mb 20 mut/Mb	WES	CNN	0.729 (Cutoff 10), 0.774 (Cutoff 20)
Liu et al. [47]	2023	RCC	TCGA, private	566	TCGA: 2.413 mut/Mb Private: 6.053 mut/Mb	TCGA: WES; Private: Gene-panel	CNN + Logistic Regression	0.655
Dammak et al. [48]	2023	LUSC	TCGA	50	10 mut/Mb	WES	CNN	0.65
Li et al. [49]	2024	GC	TCGA	326	10 mut/Mb	WES	CNN + MCB	0.749 (only Image), 0.971 (Multimodal)
Zheng et al. [26]	2024	RCC	TCGA, CPTAC	513	0.9 mut/Mb	TCGA: WES; CPTAC: NS	SSL-ABMIL	0.83
Wang et al. [50]	2024	EC	TCGA	592	10 mut/Mb	WES	TR-MAMIL	0.82 (Aggressive EC), 0.56 (Non-aggressive EC)
Zhang et al. [51]	2025	GC, CRC	TCGA, Private	400 (GC), 387 (CRC)	Not reported	TCGA: WES; Private: NS	Fusion-DTFD-MIL	0.80 (GC), 0.76 (CRC)
Yu et al. [41]	2025	LUAD	TCGA	230	10 mut/Mb	WES	TG-Mamba (Text-Guided)	0.994
Wang et al. [24]	2025	CRC	TCGA, CPTAC	587	20 mut/Mb	TCGA: WES; CPTAC: NS	CasNet (two-stageMIL)	0.881 (Internal validation) 0.577 (External validation)
Al-Rubaian et al. [52]	2025	LUAD	TCGA	372	10 mut/Mb	WES	CellOMaps representation + CNN	0.67
Wang et al. [53]	2025	EC, CRC	TCGA	594	10 mut/Mb	WES	IMAN (Multi-scale attention MIL)	0.81 (Accuracy)
Wang et al. [54]	2025	EC, CRC	TCGA	529 (EC), 594 (CRC)	10 mut/Mb	WES	ETMIL-SSLViT	0.83 (EC Aggressive), 0.62 (EC non-aggressive), 0.90 (CRC non-mucinous), 0.72 (CRC mucinous)

Abbreviations: TMB, tumor mutational burden; AUC, area under the receiver operating characteristic curve; STAD, stomach adenocarcinoma; COAD, colon adenocarcinoma; CRC, colorectal cancer; GC, gastric cancer; LUAD, lung adenocarcinoma; LUSC, lung squamous cell carcinoma; RCC, renal cell carcinoma; EC, endometrial carcinoma; TCGA, The Cancer Genome Atlas; CPTAC, Clinical Proteomic Tumor Analysis Consortium; JP-CRC, Japanese colorectal cancer cohort; mut/Mb, mutations per megabase; CNN, convolutional neural network; MCB, multimodal compact bilinear pooling; SSL, self-supervised learning; MIL, multiple instance learning; ABMIL, attention-based multiple instance learning; SSL-ABMIL, self-supervised attention-based multiple instance learning; TR-MAMIL, truncated ResNet-based multilayer attention multiple instance learning; DTFD-MIL, double-tier feature distillation multiple instance learning; CasNet, cascaded network; TG-Mamba, text-guided Mamba model; IMAN, interpretable multi-stage attention network; ETMIL-SSLViT, ensemble transformer-based multiple instance learning with self-supervised vision transformer encoder; WES, Whole Exome Sequencing; NS, Not specified.

Table 2. Internal vs. external validation performance.

Study	Internal Cohort	Internal Performance (AUC)	External Cohort (AUC)	External Performance	Δ (Ext-Int)
Wang et al. [24]	TCGA-CRC	0.881	CPTAC	0.577	−0.304
Zheng et al. [26]	TCGA-RCC	0.84	CPTAC	0.83	−0.01
Liu et al. [47]	Private	0.813	TCGA-RCC	0.655	−0.158

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ma, D.; Nishikubo, H.; Sano, T.; Yashiro, M. Deep Learning-Based Prediction of Tumor Mutational Burden from Digital Pathology Slides: A Comprehensive Review. Appl. Sci. 2026, 16, 1340. https://doi.org/10.3390/app16031340

AMA Style

Ma D, Nishikubo H, Sano T, Yashiro M. Deep Learning-Based Prediction of Tumor Mutational Burden from Digital Pathology Slides: A Comprehensive Review. Applied Sciences. 2026; 16(3):1340. https://doi.org/10.3390/app16031340

Chicago/Turabian Style

Ma, Dongheng, Hinano Nishikubo, Tomoya Sano, and Masakazu Yashiro. 2026. "Deep Learning-Based Prediction of Tumor Mutational Burden from Digital Pathology Slides: A Comprehensive Review" Applied Sciences 16, no. 3: 1340. https://doi.org/10.3390/app16031340

APA Style

Ma, D., Nishikubo, H., Sano, T., & Yashiro, M. (2026). Deep Learning-Based Prediction of Tumor Mutational Burden from Digital Pathology Slides: A Comprehensive Review. Applied Sciences, 16(3), 1340. https://doi.org/10.3390/app16031340

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Deep Learning-Based Prediction of Tumor Mutational Burden from Digital Pathology Slides: A Comprehensive Review

Abstract

1. Introduction

2. Methods

3. Pipeline for WSI-Based TMB Prediction

3.1. Data Preprocessing

3.2. Patch Feature Extraction

3.3. Slide Feature Aggregation

3.4. Performance Evaluation

4. Results

4.1. Risk of Bias Assessment

4.2. Characteristics of Included Studies

4.3. Performance

4.3.1. Lung Cancer

4.3.2. Gastrointestinal Cancers

4.3.3. Endometrial and Renal Cancers

4.4. Cross-Study Comparability

4.5. Technology Evolution

4.5.1. Architectural Evolution

4.5.2. Multimodal Fusion: Recent Advances

5. Limitations

5.1. TMB Cut-Off Heterogeneity

5.2. Data Scarcity

5.3. Model Interpretability

5.4. Bias, Confounding, and Proxy Learning

6. Future Directions

6.1. Multimodal Fusion

6.2. Advanced Architectures

6.3. Prospective Clinical Validation

7. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI