2. Materials and Methods
2.1. Data Sources and Description
This study utilized five publicly available, high-quality glioma datasets: (1) the University of Pennsylvania Health System Glioblastoma dataset (UPENN-GBM) [
13]; (2) the preoperative diffuse glioma MRI dataset from the University of California, San Francisco (UCSF-PDGM) [
14]; (3) the TCGA-GBM and CPTAC-GBM datasets from the 2021 Brain Tumor Segmentation (BraTS) Challenge [
15,
16]; (4) the longitudinal GBM dataset from Bern University Hospital, Switzerland (Lumiere) [
17].
The UPENN-GBM dataset contains data from 630 patients, of whom 611 have preoperative MRI and 19 have only postoperative MRI. The UCSF-PDGM dataset comprises 501 adult patients with histopathologically confirmed diffuse gliomas, including 56 WHO grade II and 43 WHO grade III cases. In this study, only patients diagnosed with WHO grade IV GBM were retained for analysis. The TCGA-GBM and CPTAC-GBM datasets—both contributing to the 2021 BraTS Challenge—include preoperative MRI scans from 135 and 39 GBM patients, respectively. The Lumiere dataset consists of longitudinal imaging data from 91 GBM patients. The detailed data inclusion process is shown in
Figure 1.
In addition to MRI data, whole-slide images of histopathological specimens were obtained from The Cancer Genome Atlas (TCGA) and The Cancer Imaging Archive (TCIA) for subsets of patients in the TCGA-GBM, CPTAC-GBM, and UPENN-GBM cohorts. Specifically, the TCGA-GBM cohort includes 860 WSIs from 389 patients, the CPTAC-GBM cohort contains 527 WSIs from 200 patients, and the UPENN-GBM cohort provides 71 WSIs from 34 patients.
For MRI preprocessing, skull stripping was performed on all scans using the Brain Extraction Tool (BET) from the FMRIB Software Library (FSL 6.0.6.5) [
18]. The skull-stripped T1-weighted contrast-enhanced (T1CE) images were then registered to the SRI24 atlas. Subsequently, T1-weighted (T1WI), T2-weighted (T2WI), and FLAIR sequences were co-registered to the T1CE space using the Advanced Normalization Tools (ANTs). To facilitate robust feature extraction, all images were resampled to an isotropic resolution of
, resulting in a standardized volume size of
voxels. Intensity normalization and standardization were applied across all scans to enhance cross-scanner and multi-center comparability.
WSIs are stored in a pyramidal format. Due to variability in maximum magnification across slides (e.g., vs. ), all WSIs were uniformly processed at the magnification level. Each WSI was first partitioned into non-overlapping patches of size pixels. Patches dominated by background regions were excluded using Otsu’s automatic thresholding method. Finally, stain normalization was applied to all retained patches using the Macenko method, which estimates and aligns the hematoxylin and eosin stain color vectors across slides. This step effectively reduces inter-institutional and inter-slide variations in staining intensity and color appearance, which are common artifacts resulting from differences in tissue processing protocols, scanner settings, and reagent batches. By minimizing these technical discrepancies, the method enhances the consistency of morphological features learned by the model and improves its generalizability across diverse clinical datasets.
2.2. Imaging–Pathology Collaborative Survival Prediction Framework
As illustrated in
Figure 2, the proposed Imaging–Pathology Collaborative Survival Prediction Framework addresses a key limitation in current research—specifically, the predominant reliance on preoperative MRI alone for survival prediction, while overlooking the rich prognostic information contained in WSIs. To bridge this gap, our framework leverages a multimodal fusion strategy to enable more accurate survival prediction for glioma patients. It consists of two core components: (1) a contrastive-learning-based imaging–pathology semantic alignment module, and (2) an overall survival prediction module. The following subsections provide detailed descriptions of these two components, along with their associated loss functions.
2.2.1. Contrastive-Learning-Based Imaging–Pathology Semantic Alignment Module
The Contrastive-Learning-based Imaging–Pathology Semantic Alignment module employs two encoders, ResNet34 [
19] and UNI2-h [
20], to extract features from the two distinct modalities (MRI and WSI) of the same patient. Subsequently, a multi-layer perceptron (MLP) projects these heterogeneous modality-specific features into a shared embedding space, enabling cross-modal semantic alignment: it brings together multimodal feature representations from the same patient, while separating those from different patients. The module is optimized using the InfoNCE loss function to enhance both inter-modality consistency and discriminability. The schematic diagram of CL-IPSA is shown in
Figure 3.
We employ different encoding architectures for MRI and WSI based on their fundamental differences in data characteristics. MRI is a three-dimensional volumetric image that requires capturing spatial structural information. The residual connections introduced by ResNet effectively alleviate the vanishing gradient problem in deep networks, enabling the model to maintain a reasonable depth while being easier to optimize and converge. Therefore, we adopt the 3D ResNet-34, which offers strong learning capacity and stability. In contrast, WSI consists of ultra-high-resolution two-dimensional pathological images, typically processed as a large collection of image patches. UNI2-h is currently the state-of-the-art model pretrained via self-supervised learning on large-scale pathological data, efficiently extracting semantically rich patch-level features and aggregating them to obtain a global representation. This tailored design fully leverages the strengths of each architecture and lays a solid foundation for subsequent cross-modal alignment and fusion.
Specifically, given a paired multimodal sample , where denotes a 3D MRI volume comprising four sequences (T1, T1ce, T2, and FLAIR), and represents the set of N histopathology patch features pre-extracted by the UNI2-h model at pixel resolution.
We employ two dedicated encoders to extract modality-specific semantic embeddings. A 3D ResNet34 backbone serves as the MRI encoder, whose forward pass is defined as:
where
denotes the ResNet34 feature extractor, and
is global average pooling over spatial dimensions.
For WSI feature extraction, each patient’s WSI is represented as a variable-length patch set with . where C denotes the dimensionality of each patch-level embedding. Since N varies across slides and patch-level embeddings exhibit substantial redundancy, an aggregation step is required to obtain a fixed-dimensional slide-level embedding.
To this end, we introduce a learnable aggregator . Specifically, it supports a multi-head self-attention aggregation strategy.
In the MultiHeadAttentionAggregator, variable length patch sequences arising from the heterogeneous tissue content across WSIs are handled through a padding aware Transformer encoder. During batched training, input sequences are padded to the maximum length within the batch, and a binary key padding mask is generated to distinguish valid patches from padded positions. This mask is integrated into the self attention computation to suppress contributions from padding tokens, ensuring that only biologically meaningful regions influence the learned representation. A learnable
token
is prepended to the masked patch sequence, and the concatenated input
is processed by the Transformer encoder:
where
serves as the aggregated WSI embedding. The architecture is parameterized by several key hyperparameters: the feature dimension
C (determined by the pre-extracted patch features, we use 1536.), the number of attention heads (
), the depth of the Transformer encoder (
), the feed-forward network expansion ratio (
), and a dropout rate of
applied within each layer. These parameters are configured via command-line arguments during training and enable flexible modeling of spatial relationships among patches while preserving the hierarchical and non-uniform structure inherent in histopathological data.
To align the heterogeneous feature distributions and establish a shared semantic embedding space, lightweight projection heads are appended to both modalities:
where
is a two-layer MLP (see
ProjectionHead):
The hidden dimensions are configurable (default: 256 for MRI and 512 for WSI), and the output dimensionality is fixed to
. All embeddings are L2-normalized before similarity computation:
Within this unified embedding space, we employ a symmetric InfoNCE contrastive loss to enable end-to-end optimization.
2.2.2. Overall Survival Prediction Module
In clinical practice, newly enrolled patients typically have access only to multi-sequence MRI scans, whereas corresponding histopathological data are often unavailable. To leverage cross-modal semantic associations under this constraint, we propose a retrieval-based proxy pathology representation strategy. Specifically, for each new patient with MRI-only data, we first extract high-dimensional semantic features from their MRI using a 3D ResNet34 backbone, and use this embedding as a query vector to perform nearest-neighbor retrieval within the cross-modal semantic mapping library constructed via contrastive learning. Retrieval is performed using cosine similarity, and the WSI sample in the library that exhibits the highest semantic similarity to the query MRI is selected.
The retrieved WSI is then encoded using the UNI2-h model to obtain its slide-level histological embedding, which serves as a proxy pathology representation for the patient’s missing tissue data. A standard CNN architecture is used to extract MRI features, which are concatenated channel-wise with the retrieved WSI proxy embedding to form a unified multimodal representation. This joint representation is subsequently fed into a classification head to predict overall survival outcomes for glioma patients.
After training, all model parameters are frozen, and a single forward pass is performed over the entire paired dataset to construct the cross-modal embedding library:
where
K denotes the total number of paired samples. This library preserves exact modality-wise correspondences and implicitly encodes semantic distances between non-matched samples through cosine similarity.
During inference, for a new patient with MRI-only data, the proxy pathology representation is obtained via maximum similarity matching:
This proxy representation can then be seamlessly integrated into overall survival prediction, effectively mitigating limitations imposed by the absence of real WSI data.
2.3. Symmetric InfoNCE Loss Function
In the context of multimodal representation learning for glioma diagnosis, where MRI and WSI data are paired at the patient level, the symmetric InfoNCE loss provides a principled framework to align embeddings across modalities. Given two batches of normalized feature embeddings
(from MRI) and
(from WSI), each containing
N samples corresponding to
N distinct patients, the symmetric InfoNCE loss is formulated as the average of two directional contrastive losses:
This symmetry ensures that the learning objective treats both modalities equally, preventing bias toward either MRI or WSI during training.
The cross-modal alignment is governed by a similarity matrix that captures pairwise affinities between embeddings from the two modalities within a batch. Let
and
denote the
d-dimensional feature representations of the
i-th MRI scan and the
j-th whole-slide image, respectively, extracted by their respective encoders. Prior to computing similarity, both vectors are normalized to unit
norm to ensure scale invariance and to make cosine similarity equivalent to the dot product. The scaled cosine similarities between all
N MRI samples and
N WSI samples in a batch are then arranged into an
matrix
S, where each entry is defined as:
where
is a temperature hyperparameter that modulates the concentration of the resulting similarity distribution. A smaller value of
sharpens the softmax probabilities over
S, effectively increasing the penalty on hard negative pairs and encouraging the model to learn more discriminative features. Conversely, a larger
yields a softer probability distribution, which can improve training stability—particularly when the batch size is limited or during early training stages—by providing smoother gradients. This temperature-scaled similarity matrix serves as the foundation for the subsequent bidirectional contrastive objective.
Building upon the similarity matrix
S defined in Equation (
9), we first formulate a unidirectional contrastive loss that treats MRI embeddings as query anchors. Specifically, for each patient
i in a batch of size
N, the model is tasked with identifying the corresponding WSI embedding—its positive match—among all
N WSI candidates, which include one true positive (
) and
negative samples (
). This is achieved by applying a softmax function over the
i-th row of
S, which encodes the affinities between the
i-th MRI anchor and every WSI in the batch. The resulting loss, denoted
, is the average negative log-likelihood of correctly retrieving the positive pair:
Minimizing this objective encourages the diagonal entries —representing intra-patient cross-modal pairs—to dominate their respective rows in the softmax-normalized similarity distribution. Consequently, the learned representations are pulled closer for matched MRI–WSI pairs while being pushed away from mismatched inter-patient combinations, thereby enforcing modality alignment from the perspective of radiological features.
To ensure symmetric learning and prevent modality bias, we introduce a complementary contrastive loss in the reverse direction, where WSI embeddings serve as the query anchors. In this formulation, for each patient
i, the
i-th WSI is used to retrieve its matching MRI scan from the full set of
N MRI embeddings in the batch. This corresponds to evaluating similarities along the
i-th
column of the similarity matrix
S, since the column aggregates the affinities between all MRI samples (indexed by
j) and the fixed WSI anchor
i. The loss
is similarly defined as the mean negative log-probability of the correct match under a column-wise softmax:
Note that while the numerator remains the same diagonal term , the denominator now sums over the i-th column (), reflecting the reversed query role. Optimizing this loss enforces consistency when querying from the pathological modality, ensuring that the joint embedding space is not only aligned but also bidirectionally coherent. This symmetry is crucial for robust multimodal retrieval and representation learning, especially in scenarios where either modality may be used as the input query at test time.
In both cases, the numerator corresponds to the similarity of the positive pair— embeddings from the same patient—while the denominator aggregates similarities over all possible pairs in the batch, effectively treating every non-matching cross-patient pair as a negative example. This in-batch negative sampling strategy leverages the full batch as a source of contrasting information without requiring external memory banks.
Within the similarity matrix
S, the diagonal entries
represent positive pairs, i.e., aligned MRI–WSI embeddings from the same glioma patient, whereas all off-diagonal entries
with
constitute negative pairs derived from different patients. This structure can be visualized as:
Mathematically, the symmetric InfoNCE loss exhibits several desirable properties. First, it is inherently symmetric with respect to the two modalities, ensuring that no implicit preference is placed on either MRI or WSI during optimization. As a contrastive objective, it explicitly maximizes the agreement between positive pairs while pushing apart negative ones, thereby encouraging the learned embeddings to capture shared semantic content across imaging domains. The temperature parameter provides fine-grained control over the concentration of the similarity distribution, influencing how aggressively the model distinguishes between correct and incorrect matches. Finally, by treating all other samples within a batch as negative examples, the loss function fully exploits available data without incurring additional computational overhead—making it highly suitable for scalable representation learning in multimodal medical imaging tasks such as glioma characterization.
2.4. Evaluation Metrics
Within the contrastive learning framework, we primarily adopt Recall@1 and Recall@5 as the core evaluation metrics. Recall@1 measures whether the highest-ranked prediction matches the ground-truth label, reflecting the model’s precision in single-choice settings. In contrast, Recall@5 evaluates whether the correct label appears among the five predictions with the highest confidence values, providing a more tolerant assessment of the model’s discriminative capability.
For the overall survival prediction task—formulated as a binary classification problem—we employ commonly used evaluation metrics, including accuracy, area under the receiver operating characteristic curve (AUROC), sensitivity, specificity, and the score, to comprehensively assess model performance.
2.5. Implementation Details
We implemented our method using the PyTorch 2.5.0 framework on an NVIDIA A100 GPU with 80 GB of memory. Throughout the entire pipeline, strict separation of training and test sets was maintained to prevent any form of data leakage. Standard data augmentation strategies—including random horizontal flips, rotations, vertical flips, and random cropping—were applied during training to improve model robustness.
It should be noted that the tumor labels were obtained exclusively from publicly available datasets and were not generated by our team through either manual or automated segmentation. To better capture MRI features of the region of interest, for all MRI datasets, we centered the crop on the centroid of the solid tumor and extracted a volume of size voxels.
During the contrastive learning phase, we applied L2 regularization with a weight decay coefficient of to mitigate overfitting. The model was optimized using the Adam optimizer with an initial learning rate of . A cosine annealing learning-rate scheduler gradually reduced the learning rate from to over 100 epochs and the batch size was fixed at 15. We performed a grid search over and selected the InfoNCE temperature parameter as 0.07 based on validation set performance.
For the overall survival prediction task, we adopted the same optimizer and learning-rate schedule as in the contrastive learning stage, but increased the batch size to 20 to better accommodate the classification setting.
3. Results
3.1. Patient Characteristics
This study utilized five high-quality glioma datasets comprising imaging data from more than 1000 patients. Strict inclusion and exclusion criteria were applied to ensure the integrity and validity of the experiments. In the contrastive-learning-based imaging–pathology semantic alignment module, three datasets were employed: TCGA-GBM (), CPTAC-GBM (), and UPENN-GBM (), serving as the training, validation, and test sets, respectively. All patients in these datasets had paired MRI and whole-slide imaging data available.
For the overall survival prediction module, three distinct datasets were used: UPENN-GBM (
), UCSF-PDGM (
), and LUMIERE (
). The UPENN-GBM cohort was randomly divided into training and validation subsets using a 7:3 ratio, whereas UCSF-PDGM and LUMIERE were treated as two independent external test sets. Importantly, the datasets used in the survival prediction module did not overlap with those used for contrastive learning; only MRI scans and survival time labels were available for these patients. Detailed clinical characteristics are summarized in
Table 1. Notably, due to data access restrictions, the CPTAC-GBM dataset provided only MRI and WSI images without accompanying clinical information. In addition, we cross-checked unique subject identifiers across all sources used for contrastive alignment and survival prediction and confirmed no overlap.
3.2. Imaging Pathology Semantic Alignment
We developed a CL-IPSA framework to enable the preoperative fusion of radiological and histopathological information for accurate overall survival prediction. Specifically, CL-IPSA takes as input the four standard MRI sequences—T1WI, T2WI, T1CE, and FLAIR—along with the corresponding WSI from the same patient. The framework is designed to learn cross-modal semantic correspondences between macroscopic imaging phenotypes and microscopic tissue architectures by aligning their latent representations within a shared embedding space. This modality alignment strategy mitigates the inherent heterogeneity between imaging and pathology data and enhances the model’s ability to generalize across diverse clinical cohorts.
As illustrated in
Figure 4, the proposed contrastive learning framework, CL-IPSA, demonstrates strong performance in the semantic alignment task between MRI and whole-slide images—two intrinsically heterogeneous modalities. To assess the effectiveness of our method, we first established a baseline using a randomly initialized (untrained) version of CL-IPSA. On the validation and test sets, this random baseline achieved Recall@1 of 0.026 and 0.056, respectively, and Recall@5 of 0.132 and 0.278.
After end to end training with CL-IPSA and selecting the checkpoint with the highest validation performance, we observed substantial improvements in cross modal alignment accuracy. The trained model achieved Recall@1 scores of 0.187 and 0.329 on the validation and test sets, respectively, corresponding to approximately 7.2 fold and 5.9 fold improvements over the random baseline. Under the more permissive Recall@5 metric, the model attained scores of 0.603 on the validation set and 0.854 on the test set, representing 4.7 fold and 3.1 fold gains compared with the untrained baseline. In addition, the trained model achieved Mean Reciprocal Rank scores of 0.523 and 0.624 on the validation and test sets, respectively, which correspond to 2.9 fold and 2.1 fold improvements over the untrained model. These results demonstrate that the proposed model effectively learns semantic associations between different modalities from the same patient through contrastive learning.
To assess the effectiveness of cross-modal alignment in CL-IPSA, we performed a t-SNE visualization of the learned MRI and actual WSI embeddings from the UPENN-GBM test set. In
Figure 5, each MRI–WSI pair from the same patient is linked by a line to indicate their shared identity. Critically, the embeddings exhibit a clear pattern consistent with the objective of contrastive learning: representations from the same patient across modalities are pulled into close proximity, while embeddings from different patients, regardless of modality, are well separated in the unified embedding space. This structure demonstrates that CL-IPSA successfully minimizes intra-patient cross-modal distance while maximizing inter-patient separation, thereby learning patient-specific, modality-invariant representations. These results provide visual confirmation that our contrastive framework achieves meaningful semantic alignment even with limited paired data, supporting its potential for clinical deployment.
The results above clearly demonstrate that CL-IPSA can effectively uncover cross-modal semantic associations between MRI and WSI, achieving deep semantic alignment through a contrastive learning mechanism without requiring explicit label supervision. This performance gain not only validates the effectiveness of the proposed contrastive loss function and modality alignment strategy but also highlights the strong generalization ability and practical applicability of our framework in multimodal medical image understanding tasks.
3.3. OS Time Prediction
Building upon the CL-IPSA framework, we further constructed a cross-modal semantic mapping library. For samples with MRI-only data, we performed nearest-neighbor retrieval within this library using cosine similarity to identify the WSI sample whose pathological features are most semantically similar to the given MRI embedding. The retrieved WSI is treated as a proxy pathological representation. We then systematically evaluated the improvement in overall survival (OS) prediction for GBM patients by integrating this proxy representation into several state-of-the-art CNN architectures.
As shown in
Table 2, under the unimodal setting using only MRI as input, all tested CNN models demonstrated robust discriminative capability across the validation set and two independent test sets. AUC values on the validation set consistently exceeded 0.7, while those on both test sets remained above 0.6. These results indicate that, despite the inherent complexity and heterogeneity of large-scale clinical data, deep learning models can effectively extract prognostically informative features from MRI alone, enabling overall survival (OS) prediction with meaningful generalization. Specifically, SEResNet34 achieved the best performance on the validation set (AUC = 0.753), highlighting the advantage of its channel-attention mechanism in capturing informative radiological patterns. On Test Set 1, the lightweight MobileNet yielded the highest AUC (0.664), likely due to its efficient feature compression and robustness under limited-sample distributions. On Test Set 2, the multi-task learning architecture MA-MTLN achieved the best performance (AUC = 0.663), demonstrating the effectiveness of this novel design.
Furthermore, as shown in
Table 3, when both the original MRI data and the proxy WSI representations derived from cross-modal semantic mapping were jointly used as model inputs, all architectures exhibited consistent and substantial performance improvements across all three evaluation sets, with average AUC gains of approximately 0.08–0.10. This finding provides strong evidence for the considerable potential of multimodal learning—integrating radiological and pathological information—for GBM survival prediction. Notably, even in the absence of real paired histopathology slides, the proxy representations generated via semantic alignment effectively complemented MRI by supplying missing microscopic tissue information, thereby enhancing the model’s understanding of tumor biology and prognostic cues. Notably, all of our reported results are based on WSI data normalized using Macenko stain normalization. When training or inference is performed without stain normalization, model performance degrades significantly. Among all compared models, SEResNet34 again demonstrated the best overall performance, achieving AUCs of 0.836, 0.736, and 0.724 on the validation set, Test Set 1, and Test Set 2, respectively—significantly outperforming all other architectures.
Additionally,
Figure 6 displays the ROC curves for all evaluated models under both unimodal (MRI-only) and bimodal (MRI combined with proxy whole-slide image features) input conditions across three cohorts: the internal validation set and two external test sets. In each cohort, the ROC curves under the bimodal setting consistently lie above and to the left of their unimodal counterparts, indicating a clear improvement in model discriminative ability. The AUC increases uniformly across different architectures, with gains ranging from approximately 0.07 to 0.11, highlighting the added value of the cross-modal semantic mapping strategy. Notably, these performance improvements are maintained not only in the development cohort but also in geographically and technically distinct external datasets, underscoring the robustness and generalizability of the proposed multimodal fusion approach.
To more intuitively demonstrate the benefits of multimodal feature extraction over relying solely on MRI for GBM prognosis prediction, we plotted the average objective evaluation metrics across all models on the validation set and two independent test sets using radar charts. As shown in
Figure 7, these charts provide a comprehensive overview of model performance and clearly highlight the substantial gains in predictive accuracy enabled by multimodal integration. This representation emphasizes the enhanced reliability and effectiveness of prognosis achieved through a multimodal strategy, rather than by assessing individual models in isolation.
3.4. Model Visualization
To better understand what histopathological features the CL-IPSA model relies on for its predictions, we perform attention visualization on H&E-stained whole-slide images.
Figure 8 shows hematoxylin and eosin (H&E)-stained whole-slide images from two glioblastoma patients at 20× magnification, each divided into a 4 × 4 grid of 4096 × 4096 pixel patches. The top row displays the original histopathology, revealing hallmark features of glioblastoma such as high cellular density, nuclear atypia, microvascular proliferation, and necrotic zones. The bottom row presents multi-head attention heatmaps generated by the CL-IPSA model, where red indicates regions receiving high attention weights.
Physically, these attention patterns highlight areas with pronounced pathological abnormalities, particularly dense tumor cell clusters, pseudopalisading necrosis, and regions of abnormal vascular growth, while assigning low attention to stromal or less cellular background tissue. This selective focus reflects the model’s learned sensitivity to morphological hallmarks that are intrinsically linked to tumor aggressiveness. For example, a high nuclear-to-cytoplasmic ratio signals rapid cellular proliferation, perinecrotic palisading results from hypoxia-driven tumor cell migration, and microvascular proliferation reflects active angiogenesis. By concentrating computational resources on these biologically active zones, the attention mechanism provides a physically meaningful summary of the most prognostically relevant tissue structures within the H&E slide, enhancing both interpretability and the model’s ability to capture tumor heterogeneity directly from histology.
Furthermore, in this study, Grad-CAM was employed to visualize the regions of MRI images that contributed most strongly to the patient prognosis prediction task. The results indicate that the model predominantly attends to tumor necrotic regions. For cases UCSF-PDGM-0137 (proxy pathology TCGA-06-0122) and UCSF-PDGM-0150 (proxy pathology TCGA-06-0138), the activation maps consistently emphasize areas of necrosis. This observation suggests that imaging characteristics of necrosis, including signal heterogeneity, poorly defined margins, and atypical contrast enhancement patterns, are important predictors of adverse clinical outcomes. These findings are consistent with established pathological evidence linking tumor necrosis to aggressive tumor behavior and poor prognosis. Moreover, they support the integration of imaging features with histopathological information to achieve more interpretable and reliable prognostic models.
4. Discussion
The CL-IPSA framework introduced in this study enables effective synergy between imaging and pathology for GBM prognostic modeling. Its value lies not only in enhancing predictive performance but also in establishing a new paradigm for multimodal medical artificial intelligence: during training, paired multimodal data are leveraged to construct a shared cross-modal semantic space; during inference, only preoperative MRI is required, with the missing pathological modality represented via a retrieval-based proxy. This strategy effectively addresses real-world clinical constraints, such as limited access to tissue samples and delays in histopathological reporting, thereby substantially improving the practicality and deployability of the model.
Methodologically, the effectiveness of CL-IPSA arises from two key innovations. First, a symmetric InfoNCE loss function enforces bidirectional alignment between MRI and WSI embeddings, ensuring that both modalities contribute equally to the shared representation space. Second, a Transformer-based attention aggregator enables efficient and learnable patch-level feature compression for WSIs, selectively preserving the most discriminative histological regions. Notably, even with a relatively small number of paired cases in the training set, the model exhibits strong generalization across large-scale unimodal datasets—indicating that contrastive learning captures biologically meaningful patterns shared across patients and institutions, rather than overfitting to individual examples. To demonstrate the role of CL-IPSA in cross-modal semantic alignment, we compare the performance of patient prognosis prediction using WSI and MRI inputs with a trained WSI projection head against that with a randomly initialized projection head. The detailed results are provided in
Table 3 and
Table 4. The results indicate that WSI features processed through random projection primarily introduce noise and do not contribute to improved prognostic prediction, confirming that meaningful cross-modal integration relies on a learned, rather than random, WSI embedding.
Compared with conventional multimodal fusion approaches—such as feature concatenation, early or late fusion, and cross-attention architectures—which typically require complete multimodal inputs at inference time, the CL-IPSA framework addresses a critical clinical limitation: the frequent unavailability of pathology data when only imaging is accessible. By learning a cross-modal semantic mapping during training and retrieving reliable proxy pathology representations at test time, our method extends the benefits of multimodal modeling to a much broader patient population. Moreover, unlike recent approaches that synthesize virtual pathology images using generative models (e.g., GANs or diffusion models), our framework avoids the inherent risks of synthetic histology—including compromised structural realism, reduced cellular fidelity, and limited pathological plausibility—by directly leveraging semantic embeddings derived from real WSIs. This design substantially enhances the model’s reliability, stability, and clinical interpretability.
The proposed approach also offers several promising translational applications. First, it enables preoperative prognostic assessment that approaches the accuracy of combined imaging–pathology analysis, thereby supporting neurosurgeons in determining the extent of resection, selecting adjuvant therapies, and making clinical trial enrollment decisions. Second, for patients with recurrent GBM who are unable to undergo repeat biopsies, longitudinal MRI scans can be used to dynamically update proxy pathology features, providing a non-invasive means to monitor tumor evolution. Third, in resource-limited settings lacking neuropathology expertise, a cloud-based cross-modal library can offer proxy pathology consultations, helping to reduce disparities in diagnostic access and quality.
To assess the robustness of proxy feature retrieval, we evaluated the impact of varying the number of nearest neighbors k in the k-NN search. We fused the top-k proxy pathological features, weighted by similarity, with MRI features to predict survival and computed the average AUC across eight models for k = 1, 3, 5, and 10. Performance peaked at k = 1 and declined steadily as k increased, indicating that additional neighbors introduce noise rather than useful information. This trend was consistent across both internal and external test sets, suggesting that the single most similar proxy WSI captures the dominant prognostic signal. The AUC values across all datasets as a function of k are summarized in
Figure 9, which illustrates the performance degradation with increasing k. Thus, using k = 1 maximizes both accuracy and stability, supporting its suitability for reliable clinical application.
Nevertheless, several limitations remain. First, the quality of proxy WSI representations depends critically on the diversity and coverage of paired cases in the training cohort; rare histological subtypes absent from the training data may lead to mismatched retrievals and biased predictions. Second, the current framework does not explicitly incorporate molecular biomarkers—such as MGMT promoter methylation status or IDH mutation—that are well-established determinants of GBM prognosis. Future work could integrate genomic data to build a unified “imaging–pathology–genomics” prognostic model. Third, although stain normalization mitigates batch effects in WSI preprocessing, systematic variations arising from different tissue-processing protocols (e.g., frozen vs. paraffin sections) may still introduce domain shifts, necessitating more robust domain adaptation strategies. Fourth, this study employs a binary classification framework to predict 12-month overall survival (OS), which facilitates model evaluation but oversimplifies the continuity and complexity of patient prognosis. Future work will explore more flexible modeling strategies, such as continuous survival regression or multi-level risk stratification. Fifth, our dataset for overall survival prediction based on MRI remains limited in size. Moreover, we employed a binary classification approach rather than continuous-time survival models such as Cox regression. Addressing the class imbalance through resampling techniques may further enhance classification accuracy [
28]. Sixth, during the construction of CL-IPSA, we employed the ResNet34 model. Although this network demonstrates strong learning capability and stability in processing 3D medical images, future research should compare it with other models, such as VGG and DenseNet, to identify the optimal architecture.
Another limitation of this study is that the acquisition of surrogate WSI features relies on cross-modal semantic retrieval, whose biological equivalence has not yet been systematically validated by pathology experts. Although the contrastive learning mechanism and attention visualizations indirectly support the plausibility of semantic alignment, and the observed performance gains are consistent across multiple datasets, whether the retrieved WSIs are truly similar to the query MRI in terms of key histological characteristics—such as cellular architecture, necrotic patterns, or microvascular proliferation—remains to be further evaluated by experts. Future work will incorporate blinded assessments of retrieval results by pathologists and integrate molecular pathological biomarkers to more rigorously validate the biological basis of cross-modal mapping, thereby enhancing the interpretability and clinical credibility of the model.
While we observed optimal performance with a temperature parameter of = 0.07 in our current experiments, this value was selected from a limited set of candidates. Given the unique characteristics of medical images, such as lower inter-class variance and domain-specific textures, the optimal temperature for contrastive learning may differ significantly from that in natural image domains. In future work, we plan to conduct a more comprehensive and systematic investigation of the temperature parameter, including broader grid searches and adaptive temperature strategies, to better tailor contrastive learning to the nuances of medical imaging data.
Although the proposed cross-modal alignment framework demonstrates significant advantages in survival prediction, the learned latent representations have not yet been systematically correlated with key molecular biomarkers, such as IDH mutation status, MGMT promoter methylation, or Ki-67 proliferation index. Therefore, it remains uncertain whether the model captures biologically meaningful signals or merely statistical associations. Future work will integrate multi-omics data to validate whether this latent space genuinely reflects the intrinsic molecular subtypes of glioblastoma, thereby advancing non-invasive imaging signatures toward interpretable, mechanism-driven tools for precision prognosis.
An emerging clinical framework that could benefit from integration with our approach is the Brain Tumor Reporting and Data System (BT-RADS) [
29], which standardizes radiological reporting and risk stratification for glioma patients based on MRI features and clinical context. While our current model focuses on overall survival prediction, its learned MRI embeddings and proxy histopathological representations may capture imaging phenotypes that align with BT-RADS categories—particularly those related to tumor aggressiveness, treatment response, or progression risk. Future work should explore whether baseline multimodal embeddings derived from CL-IPSA can serve as quantitative imaging biomarkers predictive of BT-RADS classification at diagnosis or, more importantly, of BT-RADS category evolution during longitudinal follow-up. Such validation would not only enhance the clinical interpretability of our model but also support its potential use in dynamic risk assessment and standardized reporting workflows, bridging AI-driven prognostication with structured clinical decision-making systems.
Furthermore, our framework bridges MRI and WSI, it could be further enhanced by incorporating intraoperative optical techniques such as Optical Coherence Tomography (OCT) and Raman Spectroscopy. These methods provide real-time, mesoscale information on tissue microstructure and molecular composition, effectively filling the resolution gap between MRI and WSI. Integrating such data into the retrieval or fusion module during surgery could refine the proxy pathological signatures, yielding more accurate prognostic predictions and offering immediate biological feedback to guide resection. This multimodal approach would strengthen both clinical relevance and bioengineering impact.
Although the CL-IPSA framework achieved a notable improvement of 0.08 in AUC in this study, its clinical relevance should be interpreted in the context of specific decision thresholds. In the preoperative setting of glioblastoma, this improvement can translate into more accurate risk stratification. For patients near therapeutic decision boundaries, such as those of advanced age or with poor performance status, a model with an AUC of 0.75 may provide uncertain prognostic estimates, whereas an AUC of 0.83 allows more confident identification of poor prognosis. This improved discriminatory performance can support multidisciplinary clinical decision-making, helping clinicians balance aggressive surgical resection against palliative treatment strategies. Thus, the 0.08 gain in AUC reflects a meaningful improvement in the clinical utility and reliability of prognostic predictions rather than a purely numerical advance.
In summary, CL-IPSA represents not only a technical advance but also a significant step toward the vision of “non-invasive precision pathology.” Future efforts will focus on extending the framework to other solid tumors (e.g., breast or lung cancer), incorporating dynamic treatment response data, and conducting prospective clinical validation—to ultimately realize an AI-driven, personalized neuro-oncology care loop.